The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in the textbook and also to analyze real data and come to informed conclusions. To straighten out which is which:
R
: is a programming language specfically designed for statistical analysis. R
is open-source, and is developed by a team of statisticians and programmers in both academia and industry. It is updated frequently and has become the de facto industry standard. In the data science realm, alternatives to R
include Python
with the Pandas library, and Julia
. In the statistics realm, alternatives include SAS, Stata, and SPSS.R
. RStudio is also open-source software, and depends upon a valid R
installation to function. RStudio as available as both a both Desktop and Server application. Before RStudio, people used R
through the command line directly, or through graphical user interfaces like RKwrd and Rcmdr, but RStudio is so vastly superior that these alternatives have few users left. RStudio employees are important drivers of R innovation, and currently maintain the rmarkdown
, knitr
, and dplyr
packages, among others.R
code and text. R Markdown is an extension of markdown (a general-purpose authoring format) that provides functionality for processing R code and output.If you are using RStudio on the server, you should be all set (although please let me know if your account does not work), but if you want to work locally on your computer, you need to install both R and RStudio. Follow the links on the course Resources page or just google to find the appropriate sources.
As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.
The panel in the upper right contains your environment as well as a history of the commands that you’ve previously entered. Any plots that you generate will show up in the panel in the lower right corner.
The panel on the left is where the action happens. It’s called the console. Everytime you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is really a request, a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.
To get you started, enter all commands at the R prompt (i.e. right after >
on the console); you can either type them in manually or copy and paste them from this document.
R has a number of additional packages that help you perform specific tasks. For example,mosaic
: is an R
package written by a team of NSF-funded statistics educators. It is specifically designed to streamline R
syntax so as to make it more accessible to undergraduate statistics students. In order to use these packages, they must be installed (you only have to do this once) and loaded (you have to do this every time you start an R session).
If you have not installed the packages already (particularly if you are running R and RStudio on your local machine), run the commands below. It may take a few minutes; you’ll know when the packages have finished installing when you see the R prompt (>
) return in your console.
install.packages("mosaic")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("Stat2Data")
install.packages("car")
Now that we have some packages, we can start doing some Explotatory Data Analysis
We are going to be looking at some plots and summary statistics as part of the technique of Exploratory Data Analysis (EDA). EDA was named by John Tukey in the 1960s, and continues to be exceedingly useful today. Essentially, Tukey was advocating getting familiar with data before beginning modeling, so you don’t run into errors that are easy to catch visually but hard to catch numerically.
To begin, we need to load packages to use in our session. We can do this either with library()
or require()
. I try to be consistent, but sometimes change it up.
require(mosaic)
require(car)
Then, we need some data. For this lab I chose the Salaries dataset, because it comes with R
and had the right number of variables for the lab. Plus, it has to do with college professors.
Since the data is already in R
, we can access it with the data()
command,
data(Salaries)
Now, we’re going to want to look at it.
str(Salaries)
Once we have an idea of what the data look like, we can make some plots.
There are three prominent graphics libraries in R:
graphics
: often called base graphics, these are the drawing methods that come pre-installed with R. These graphics are the most commonly-used, but often the least user-friendly. (e.g. plot()
)lattice
: a nice-looking and powerful graphics library that is particularly adept at making multivariate comparisons. lattice graphics are very convenient and easy-to-learn for most common statistical plots, and are the default for most of the mosaic
graphing functions. Customization of lattice graphics often involves writing panel.functions – which can be tricky, but powerful. (e.g. xyplot()
)ggplot2
: a very popular graphing library maintained by Hadley Wickham, based on his “Grammar of Graphics” paradigm. Unlike lattice, ggplot2
uses an incremental philosophy towards building graphics. (e.g. ggplot()
)We will likely use graphics from all three libraries.
For one quantitative variable, you might want to produce a histogram to show the distribution.
histogram(~salary, data=Salaries)
You could also view a smoothed version of the distribution with a density plot
densityplot(~salary, data=Salaries)
If you have categorical data, a barchart is more appropriate.
barchart(tally(~rank, data=Salaries))
Notice that all these plots use the same syntax: goal ( y ~ x , data = mydata )
We call this the formula syntax, and it is the most common way we will be writing R code in this class.
In this case, we ony had an x
variable, so we put nothing before the tilde, ~
. (Hint: to find the tilde button, look at the upper left of your computer keyboard. It usually shares a key with the backtick.)
This lab (and all the rest of them) is written in RMarkdown. That means I type text, code, and formatting into a document with the file extension .Rmd. That file is available on the course site if you would like to look at it. Some students prefer to download the source code and work through it on their own machine or on the RStudio server. Once I have things typed into my .Rmd file, I click the “Knit” button to knit a formatted HMTL version of the lab. If you are doing this on your own machine and the knit button isn’t showing, check out Prof McNamara’s RMarkdown troubleshooting page. When I write the labs, I use the option eval=FALSE
, which means that the output of the code doesn’t show up in my knitted document. For your homework, you need to use the option eval=TRUE
so that the output shows up in your HTML. If you are working through this lab on your own computer, go to the top of the document and change this option. Then try re-knitting. See how all the plots and output show up in the document?
For homework, you need to submit the knitted HTML document. So, it is important to make sure your document has knitted correctly. Again, the RMarkdown troubleshooting page has some hints for how to do this. I usually just open it in my web browser to check it looks right.
ggplot2
can create all the same plot types as lattice
. A histogram,
require(ggplot2)
ggplot(Salaries) + geom_histogram(aes(x=salary))
A density plot,
ggplot(Salaries) + geom_density(aes(x=salary))
Or, for categorical data, a barchart.
ggplot(Salaries) + geom_bar(aes(x=rank))
Notice that the syntax is different with ggplot
. It’s now ggplot(mydata) + goal(aes(x=x, y=y))
where your goal
is probably a geom
(short for geometric object), like geom_histogram()
, geom_density()
or geom_bar()
. There are actually other ways to write ggplot2
code, but we won’t think about those for now.
For two quantitative variables, a scatterplot is appropriate.
xyplot(salary~yrs.since.phd, data=Salaries)
For one quantitative and one categorical variable, you can make side-by-side boxplots (boxplot()
, bwplot()
),
bwplot(salary~sex, data=Salaries)
For two categorical variables, a mosaic plot is most appropriate
mosaicplot(sex~rank, data=Salaries, las=1)
We’re back to that goal ( y ~ x , data = mydata )
syntax.
A scatterplot,
ggplot(Salaries) + geom_point(aes(x=yrs.since.phd, y=salary))
Side-by-side boxplots,
ggplot(Salaries) + geom_boxplot(aes(x=sex, y=salary))
Again, we’re using ggplot(mydata) + goal(aes(x=x, y=y))
where your goal
is probably a geom
(short for geometric object), like geom_boxplot()
.Mosaic plots in ggplot2
are more complicated and require an additional package, so we’ll leave those for now.
Most R plotting function can take many arguments that will modify their behavior. Read the documentation for more information.
help(xyplot)
?geom_histogram
Again, there are several possible syntaxes for summary statistics. I mostly use the mosaic
package syntax, which looks like this:
mean(~salary, data=Salaries)
sd(~salary, data=Salaries)
This is the same formula syntax as lattice
plotting. goal ( y ~ x , data = mydata )
, and your goal
can be a summary statistic like mean()
or sd()
. Most summary statistics you would expect are in R. median()
, max()
, min()
, IQR()
, and many more all work. There is also a function that gives you a bunch of my favorite statistics, favstats()
favstats(~salary, data=Salaries)
For categorical data, you might want to tally()
tally(~rank, data=Salaries)
The syntax for linear models is very similar to the syntax for lattice graphics and mosaic summary statistics. goal ( y ~ x , data = mydata )
We’ll use it for modeling the data we collected about ourselves.
In order to read in the data, we need to use an assignment operator so we can store the data in our environment. Without an assignment operator, the data would all print out into the Console but not be useable again.
We’ll try to be consistent and use =
as our assignment operator, but there is another way to do it (of course!) that I sometimes use that looks like this: <-
.
# install.packages("googlesheets")
require(googlesheets)
url = gs_url("https://docs.google.com/spreadsheets/d/1tHg_Oo88GIS8E00-lf286-4AzR-wht7ZddUMFM1s5s0/")
ds = gs_read_csv(url)
It’s not crucial that you understand the code above, because most of the data we use in this course is in packages.
Now we can start to do some modeling! We might want to start with some EDA to explore the relationship between distance from home and number of countries visited.
xyplot(numFBFriends ~ accountLength, data=ds)
Given how the data looks, it seems reasonable to choose a simple linear regression model. We can then fit it using R.
mod = lm(numFBFriends ~ accountLength, data=ds)
Again, we’re using the assignment operator to store the results of our function into a named object. This is because we might want to do things with our model output later. If you want to skip the assignment operator, try just running lm(numFBFriends ~ accountLength, data=ds)
in your console to see what happens.
Now, we want to move on to assessing and using our model. Typically, this means we want to look at the model output (the fitted coefficients, etc). If we run summary()
on our model object we can look at the output.
summary(mod)
The p-values is significant, which might lead us to assess that the model was effective. We can use the model for description. To write this model out in our mathematical notation,
\widehat{\verb#numFBFriends#} = -5.264 + 84.164\cdot \verb#accountLength#
To assess further, we can compare it to a “null” model, where we don’t use a predictor (Instead, we just use the mean as the model for every observation). We can compare the residuals from the null model and the simple linear model.
In order to visualize this, we need a special plotting function. Copy-paste this code into your Console without changing anything:
panel.regplot = function (x, y, mod, ...) {
panel.xyplot(x, y, type = c("p"), lwd=3, pch=19, ...)
panel.arrows(x0=x, y0=y, x1=x, y1=fitted.values(mod), code=2, length=0.1, col="darkgray")
panel.abline(mod)
}
Now, we can run two models and visualize them to compare! The first is the average, and the second is the least squares regression line. We can think of the latter as a the null model where \beta_1 = 0 and \hat{y} = \bar{y}. Which model do you think is better?
mod.mean = lm(numFBFriends ~ 1, data=ds)
mod.line = lm(numFBFriends ~ accountLength, data=ds)
xyplot(numFBFriends ~ accountLength, data=ds, panel=panel.regplot, mod = mod.mean)
xyplot(numFBFriends ~ accountLength, data=ds, panel=panel.regplot, mod = mod.line)
Your turn:
lm()
to fit a linear regression model to one pair of variables and examine the output using summary()
. What conclusions can you draw?