What is R
It is a programming language and environment
commonly used in statistical computing, data analytics and scientific research.
It is one of the most
popular languages used by statisticians, data analysts, researchers and
marketers to retrieve, clean, analyze, visualize and present data.
Due to its expressive syntax and easy-to-use
interface, it has grown in popularity in recent years.
Installing R
Get R here: http://www.r-project.org/. Once you've installed it, also install a few
add on packages by running the following commands at the R prompt:
install.packages("reshape")
install.packages("plyr")
install.packages("ggplot2")
A discovery flight: plotting with ggplot.
When you begin lessons to get a private
pilot's license, the first lesson is often a "discovery flight,"
where they take you up and let you fly the airplane and help you land it
safely. It's a fun way to get your feet
wet and see if the taking more lessons seems like a good idea.
I'd like to begin with a discovery flight
of sorts, introducing ggplot. GGplot is
a powerful command line plotting system that should show you some of the power
of R without you needing to know all the details of how R works.
ggplot basics
First let's make a little data. Make a hundred normally distributed random
numbers and put them in a data frame under the name input. (I'll explain about
data frames more later.)
> df<-data.frame(input=rnorm(100))
Then add a hundred more random numbers
named output, each perturbing the input numbers a bit.
> df$output<-rnorm(100,df$input,sd=0.5)
A data frame is like a matrix, or a
speadsheet, where each row contains one observation.
> head(df)
input output
1 -0.58362979 -0.4540510
2
0.86899831 0.5282722
3 -0.70655426 -0.5075248
4 -0.35125465 0.4221305
5 -1.56632422 -0.6980894
6 -0.01087019 0.3033364
Ok, now let's plot our data.
>
ggplot(df,aes(x=input,y=output))+geom_point()
What's a plot? There's a coordinate system and some
scales. But the heart of the plot is the
data. What characteristics does each
point have? Certainly an x and y
coordinate, but also a size, shape and
color that we aren't varying. These visual properties are called "aesthetics"
in ggplot, and the main thing we want to learn first is how to map the
variables in our data onto the aesthetics in the plot.
So let's pull apart that command and see
how it describes the plot. The first bit of the command is a call to the
ggplot() function. ggplot() can operate in many different ways, but the
simplest and most common is to pass it two arguments, the first of which is a
data frame containing the data to be plotted.
The second argument to ggplot is an
aesthetic mapping returned by the function aes(). aes() takes aesthetic-variable pairs, in our
case mapping the variable input to the x-coordinate and output to the y-coordinate
of objects in the plot.
ggplot() called like this can't actually
make a plot just yet, because we haven't told it how to present the data.
Points, bars, lines, text, and tiles are some of the straightforward options,
but there are many other possible choices.
So we add a "layer" to the plot
with geom_point() that says we want our data represented as points, and voila,
we have a scatter plot.
More aesthetics and faceting
So what about some other aesthetics? Let's add a category to each observation in
our data.
> df$category<-LETTERS[seq(1,5)]
> head(df)
input output category
1 -0.58362979 -0.4540510 A
2
0.86899831 0.5282722 B
3 -0.70655426 -0.5075248 C
4 -0.35125465 0.4221305 D
5 -1.56632422 -0.6980894 E
6 -0.01087019 0.3033364 A
And map the color of each point to its
category. Note that aes() assumes it's
firsts two arguments map x and y position, so we can leave off those names.
>
ggplot(df,aes(input,output,color=category))+geom_point()
Another way to show category is with
"faceting," creating a series of smaller plots, one for each category. The ~ in the call to facet_wrap() is an
example of R's formula syntax, something we'll cover later when we get to
statistical modeling.
>
ggplot(df,aes(input,output))+geom_point()+facet_wrap(~category)
Since we applied our categories randomly to
random data all the facets look the same, but if we tweak one category and
replot, it sticks out like a sore thumb.
>
df[df$category=="E","output"] <-
abs(df[df$category=="E","output"])
>
ggplot(df,aes(input,output))+geom_point()+facet_wrap(~category)
Let's undo our little tweak before going
forward.
> df$output<-rnorm(100,df$input,sd=0.5)
How about the size aesthetic? Let's add yet another variable to our dataframe.
> df$magnitude <- runif(100)
And map magnitude to the size aesthetic,
still faceting on category
>
ggplot(df,aes(x,y,size=magnitude))+geom_point()+facet_wrap(~category)
If we have another variable to plot we
could add color back into the mix and then we'd be plotting five dimensions at
once. You could add shape and bump it to
six. Then you could organize the facets into a grid using facet_grid() instead
of facet_wrap() and now you're up to seven dimensions, all on 2-d surface.
layer = geom + stat
Let's see another geom, a bar chart. We'll make a simple data frame with 26
observations, each with a random score.
> LETTERS
[1] "A" "B" "C"
"D" "E" "F" "G" "H"
"I" "J" "K" "L" "M"
"N" "O" "P" "Q" "R"
"S"
[20] "T" "U"
"V" "W" "X" "Y" "Z"
>
bar_df<-data.frame(letter=LETTERS,score=runif(26))
> head(bar_df)
letter score
1
A 0.3526575
2
B 0.5549401
3
C 0.4672510
4
D 0.4784176
5
E 0.9944808
6
F 0.8506076
Now we map x-position to letter and
y-position to score, and make a bar layer.
> ggplot(bar_df,aes(letter,score))+geom_bar()
So we see that the x-scale can deal just
fine with categorical data, and that in a bar layer, y-position maps to the
height of the bar.
In statistics, the classic use of a bar
chart is a histogram. Let's fake up some
more data. The average male in the US is
about six feet tall with a standard deviation of 3 inches.
>
height_data<-data.frame(height=rnorm(1000,mean=6,sd=0.25))
Hmm.. but how to get a histogram? We need to bucket our thousand observations
and count the number in each bucket.
Surely something like this won't work, right?
> ggplot(height_data,aes(height))+geom_bar()
It worked, but how? The first thing you need to know is that a
layer has two parts, a "geom" that controls how the layer looks, and
a "stat" that applies some statistical transformation to the data
before it's plotted. Every geom_XXX()
has a default stat, and geom_point() default stat is "identity,"
which just leaves the data alone. But geom_bar's
default is "bin", which bucketizes or "bins" the data, dividing
it up and counting the number in of observations in each bin. So these two
commands are equivalent.
>
ggplot(height_data,aes(height))+geom_bar()
>
ggplot(height_data,aes(height))+geom_bar(stat="bin")
But there's one more bit of magic going
on. We never mapped any variable to the
y-position of the bars. stat_bin adds a
variable named ..count.. to the data, and automatically maps it to the y aesthetic. So this is still equivalent.
>
ggplot(height_data,aes(height))+geom_bar(aes(y=..count..),stat="bin")
What's that aes() doing in the geom_bar()
call? The aesthetic mapping you supply
in the call to ggplot() just sets the default mapping for the plot's layers,
but each layer you add can override those defaults. This is useful when making
sophisticated plots with many different layers.
Note that stat_bin also add a ..density..
variable, so if you wanted densities instead of counts you could do...
>
ggplot(height_data,aes(height))+geom_bar(aes(y=..density..),stat="bin")
As I said before, each layer has a stat and
a geom so we can make the equivalent plot using stat_bin() instead of
geom_bar(). Usually the choice doesn't
matter much, but sometimes various defults makes one way shorter, or more
clearly expresses your intentions.
>
ggplot(height_data,aes(height))+stat_bin(aes(y=..density..),geom="bar")
All out plots so far have only used one
layer, but we can add more easily. For
our histogram, we can easily add a smoothed density estimate that is plotted
with a line with geom_density().
>
ggplot(height_data,aes(height))+geom_bar(aes(y=..density..))+geom_density()
Going back to our scatter plot data, we can
easily add a layer showing a smoothed line with a confidence interval.
>
ggplot(df,aes(input,output))+geom_point()+stat_smooth()
There are many smoothing methods and ggplot
tries to be smart about which one it chooses based on your data, but you can
override it's choice by supplying an explicit method argument. To get a linear fit, for example:
>
ggplot(df,aes(input,output))+geom_point()+stat_smooth(method="lm")
All the layers are preserved if you add faceting.
>
ggplot(df,aes(input,output))+geom_point()+stat_smooth(method="lm")+facet_wrap(~category)
Dang, that line is getting kind long. Each of those function calls returns an
object that you can assign to a variable, and a common way to break up the creation
of a chart is to assign the base chart to a variable, and then add
embellishments
>
p<-ggplot(df,aes(input,output))+geom_point()
>
p+stat_smooth(method="lm")+facet_wrap(~category)
Dealing with overplotting
For fun let's have a look at few of the other
capabilities of ggplot. Our little scatter plot of 100 observations was nice,
but what happens with you get a lot of data?
>
big_df<-data.frame(input=rnorm(10000))
>
big_df$output<-rnorm(10000,big_df$input,sd=0.3)
>
p<-ggplot(big_df,aes(input,output))
> p+geom_point()
With so many points, there's a lot of
overplotting, so it's hard to see what's going on. There are several ways to deal with overplotting,
one is to just make the points smaller.
> p+geom_point(size=1)
Note that we're not using aes() here. We're not /mapping/ something to size, we're
/setting/ size to a constant. You can do
this with all aesthetics, and it can be a source of confusion when you set an aesthetic
you meant to map or vice-versa.
Another way to deal with overplotting is to
use alpha transparency.
> p+geom_point(alpha=0.1)
That's a little better. Just as we added a smooth density estimate to
our histogram before, we can add a 2-d density estimate to this plot.
> p+geom_point(alpha=0.1)+stat_density2d()
For something a little more fancy, we can
override some of stat_density2d's defaults.
>
p+stat_density2d(aes(fill=..level..),geom="polygon")
Normally stat_density2d produces contour
shapes. We can turn that off and use the
tile geom to get a heatmap.
> p+stat_density2d(aes(fill=..density..),geom="tile",contour=FALSE)
More in depth resouces
I've only covered the barest
essentials. If you want to learn more I really
recommend Hadley's book.
Online reference with tons of examples:
http://had.co.nz/ggplot2/
Hadley's Book, _ggplot2: Elegant Graphics
for Data Analysis_: http://www.amazon.com/gp/product/0387981403
Your turn: EXAMPLE
Now it's your turn.
Extremely brief intro to plyr
Split, apply, combine.
Like group by and aggregate functions in
SQL, or pivot tables in excel, but far more flexible.
Your turn part 2: EXAMPLE
My creation:
Part 2: R fundamentals
and hypothesis testing
Most R tutorials assume you know some
statistics and teach you just how to do statistics in R. In this part I'm going to assume you know neither
and teach you a (teensy, tiny) bit of statistics at the same time I demonstrate
the R system.
A little background on R
R is an implementation of the S statistical
programming language. S has a very long
history going back to 1976. Because the
theory of programming language design was undergoing very active exploration throughout
the development of S, it is something of a hodge podge of different
paradigms. It's got both functional and
procedural aspects, with object oriented features sprinkled on top.
S's influences include C and Ratfor (brace
syntax), APL and APL's cousin PPL (the arrow operator), lisp (the lambda
calculus, first class functions), JCL (positional or keyword argument
notation), Smalltalk (classes and inheritance). An excellent history of S can be found here:
So R is a full blown programming language,
but in my normal use of it, I don't really do much programming, I mostly use
the incredible array of built in functionality at the R command line. R does statistical modeling, analysis of
variance (aka anova), and statistical tests like Student's-t or
chi-squared. It implements techniques
from linear algebra like solving systems of linear equations, singular value decomposition,
and working with eigensystems. It also
implements all kinds of analytical techniques from clustering to factor
analysis.
The absolute minimum you need to know about R to get
started.
But before we get into all that let's get
familiar with R's command line and the most basic of R concepts.
R's command line is a read-evaluate-print
loop. You type a statement, R evaluates
that statement and prints the result. So
you can do evaluate the number 5 like so:
> 5
[1] 5
and do simple arithmetic like so:
> 2+3
[1] 5
A character string is entered with quotes
and evaluates to itself:
> "Hello"
[1] "Hello"
In addition to numbers and strings, R has a
boolean type with keywords TRUE and FALSE in all-caps.
> TRUE
[1] TRUE
> FALSE
[1] FALSE
And some more special keywords, NA for
missing data and NULL to represent no object.
Vectors
So what's that funny [1] that comes out in
the exmaples above? It turns out that
R's most basic data structure is a vector.
R doesn't support the idea of a scalar, a single value. So when you type a single value it's actually
being coerced into a vector, and the [1] is a vector index.
The simplest way to make vectors is with
the c() function, which just turns it's arguments into a vector.
> c(12,23,34,45,56)
[1] 12 23 34 45 56
Lots of functions make vectors. seq() for example makes a sequence of numbers.
> seq(1,30)
[1]
1 2 3
4 5 6
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25
[26] 26 27 28 29 30
So now we see that the [1] and [26] tell
you the index of the first element on that row of output.
R can generate random numbers accoring to
many distributions. This command gives
you 10 random numbers uniformy distrubuted from 1 to 100.
> runif(10,1,100)
[1] 45.12021 93.65181 18.73653 21.81440
14.16285 30.65400 73.82752 14.86346
[9] 31.08075 14.69531
Most any distribution you can think of is
cooked into R: normal, log-normal, exponential, gamma, chi-squared, Poisson,
binomial, etc. Here's 10 normally distributed random numbers with mean 0 and
standard deviation 1.
> rnorm(10,0,1)
[1]
0.1248729 -1.6965157 0.1094253
-0.8555028 -0.4061967 -1.0979561
[7] 1.2266705
-0.8437930 -0.2905176 0.1290491
Vectors (and pretty much everything else in
R) are objects. You can give an object a
name with the assignment operator '<-'. Using '<-' as the assignment
operator comes was borrowed from APL.
You can also just use '='.
> numbers<-rnorm(10,0,1)
> numbers
[1]
0.86378839 -0.05729522
0.50690025 0.98559770 0.95857035
0.84151008
[7] -1.36267419 1.68299597 -1.02089036 0.01789479
> more_numbers=rnorm(10,0,1)
> more_numbers
[1]
0.63656000 1.22249113
-1.46085369 0.29902021 -0.05245649
-0.68435848
[7] -0.32230916 1.42836638 -0.90796790 2.23426332
You can get a listing of all your currently
named objects with ls()
> ls()
[1] "more_numbers"
"numbers"
You can get rid of objects with rm().
> rm(numbers, more_numbers)
> ls()
character(0)
Operations on vectors
Ok.
So we know some way to make vectors; what can we do with them? Well, all
the normal sorts of things you'd expect.
> numbers<-rnorm(10)
> numbers
[1]
0.8893482 0.4200492 -1.2398678
-0.3585673 -1.4215178 0.7752930
[7]
0.3963262 -1.1906303
0.1747187 2.2828912
> mean(numbers)
[1] 0.07280433
> median(numbers)
[1] 0.2855224
> sum(numbers)
[1] 0.7280433
These functions summarize a vector into one
number. There are also lots of functions
that operate on each element of a vector, returning a new modified vector.
> abs(numbers)
[1] 0.8893482 0.4200492 1.2398678 0.3585673
1.4215178 0.7752930 0.3963262
[8] 1.1906303 0.1747187 2.2828912
> numbers*100
[1]
88.93482 42.00492
-123.98678 -35.85673 -142.15178 77.52930
[7]
39.63262 -119.06303
17.47187 228.28912
Note that the statments above aren't
modifying ther argument, they are creating new vectors. So to modify a vector you need to assign the output.
> numbers<-sort(numbers)
> numbers
[1] -1.4215178 -1.2398678 -1.1906303
-0.3585673 0.1747187 0.3963262
[7]
0.4200492 0.7752930 0.8893482
2.2828912
Statistical tests
What about that mean that we got earlier?
> mean(numbers)
[1] 0.07280433
We got those numbers by drawing from a
normal distribution with mean of zero, so the mean of our sample should be
close to zero. 0.07 is pretty close, but
should we trust that rnorm() is really giving us numbers with a mean of
zero? Maybe rnorm() is buggy and has a
bias. How can we tell?
Student's t-test is one way to answer this
question. Given a sample, the t-test can
tell you how confident you can be that the population's mean is a particular
value. Here's how you run a t-test in R,
asking if the population's mean is zero.
> t.test(numbers, mu=0)
One
Sample t-test
data:
numbers
t = 0.1992, df = 9, p-value = 0.8465
alternative hypothesis: true mean is
not equal to 0
95 percent confidence interval:
-0.7538657
0.8994743
sample estimates:
mean of x
0.07280433
The number to look at in the above output
is the p-value. Very loosely, the lower
the p-value, the less likely it is that population mean is actually zero. The p-value is a probability, so it ranges from
zero to one, and one usually looks for a p-value less than 0.05 or 0.01 in
order to conclude that the population mean is not zero.
Our p-value is 0.85, well above the 0.05
cutoff, so we conclude that the population mean really is zero (within the
power of our test to tell us), and that rnorm() isn't buggy.
What's that p-value really mean? Why set the cutoff at 0.05 or 0.01? The
cutoff corresponds to the chance that you'll reach the wrong conclusion. If you choose 0.05, there's a 5% chance
you'll conclude that the mean isn't zero, when it really is. This means if draw many samples and run my
t-test on each of them, sometimes the test will say that the population mean
isn't zero, even if it really is.
We can verify this experimentally very
easily. What t.test returns is actually
an object. We can save that object to a
variable with the assignment operator just like we've done before with
vectors. The output we saw above is just
that object being printed.
> result <- t.test(numbers,mu=0)
> result
One
Sample t-test
data:
numbers
t = 0.1992, df = 9, p-value = 0.8465
alternative hypothesis: true mean is
not equal to 0
95 percent confidence interval:
-0.7538657
0.8994743
sample estimates:
mean of x
0.07280433
We can extract just the p-value like so:
> t.test(numbers,mu=0)$p.value
[1] 0.8465136
So we can re-sample and re-test a few times
by hand.
> t.test(rnorm(100),mu=0)$p.value
[1] 0.682082
> t.test(rnorm(100),mu=0)$p.value
[1] 0.7112275
> t.test(rnorm(100),mu=0)$p.value
[1] 0.324931
> t.test(rnorm(100),mu=0)$p.value
[1] 0.9596542
So far, so good, no numbers under
0.05. But this is getting tedious; we
can use the replicate function to do this a bunch of times and accumulate the
results in a vector.
>
pvalues<-replicate(60,t.test(rnorm(100),mu=0)$p.value)
And if we sort the p-values we get, lo and
behold there are four of them below 0.05.
We ran our test 60 times, and 4 of those 60 times we reached the wrong
conclusion. With a 0.05 (or one in
twenty) cutoff, we only expected three of these errors.
> sort(pvalues)
[1] 0.01182703 0.02263391 0.02990155
0.03108499 0.05414258 0.06868589
[7] 0.08567086 0.08964473 0.12293828
0.16763739 0.17225885 0.18524312
[13] 0.19448739 0.20541757 0.24862190
0.25007644 0.25691785 0.27851561
[19] 0.28053250 0.30991840 0.32623091
0.36615250 0.36920395 0.37273555
[25] 0.38772763 0.38788907 0.39546547
0.39799577 0.42377867 0.46539131
[31] 0.48143437 0.48794339 0.49729286
0.50076295 0.50924057 0.52898420
[37] 0.54825732 0.57358131 0.59666177
0.59980506 0.61071583 0.62020976
[43] 0.62161798 0.67315114 0.67944879
0.69186076 0.70146278 0.73897408
[49] 0.76402396 0.76638926 0.78694787
0.84759542 0.86765217 0.94199050
[55] 0.95396098 0.95869792 0.96865229
0.97681270 0.98922664 0.99296506
Two-sample t-test and statistical power
The way I'm using the t-test here is
somewhat unusual. A far more common case
is when you have two samples, and you want to know the the populations the
samples are drawn from have different means.
This is the way to do weblabs here at
Amazon. We cut our user base into two
populations, and give one of them a different experience. We theorize that this different experience
will cause different behavior, so we run t-tests to find out if, for exmaple,
one population spend more money on average than the other. This is called a two sample t-test.
Here's a contrived example, where I draw
one sample with a mean of zero, and another sample with a mean of one, and then
ask the t-test to tell me if the means are indeed different.
> t.test(rnorm(100,mean=0),rnorm(100,mean=1))
Welch
Two Sample t-test
data:
rnorm(100, mean = 0) and rnorm(100, mean = 1)
t = -6.7014, df = 188.741, p-value =
2.315e-10
alternative hypothesis: true
difference in means is not equal to 0
95 percent confidence interval:
-1.2045547 -0.6566807
sample estimates:
mean of x mean of y
0.1470455 1.0776632
That p-value is tiny. We can safely conclude that the two
populations have different means. But
what if the difference in the means is less extreme?
> t.test(rnorm(100,mean=0),rnorm(100,mean=0.5))$p.value
[1] 0.001426943
>
t.test(rnorm(100,mean=0),rnorm(100,mean=0.2))$p.value
[1] 0.3231079
>
t.test(rnorm(100,mean=0),rnorm(100,mean=0.1))$p.value
[1] 0.8972577
As the difference in means gets smaller our
test loses the ability to resolve the difference. We can get it back by upping the sample size.
>
t.test(rnorm(10000,mean=0),rnorm(10000,mean=0.1))$p.value
[1] 3.305264e-17
This question of how small a difference a
test can detect given a certain sample size is an example of statistical power
analysis.
Another statistical test
The t-test is just one of many statistical
tests built into R. Let's use another
test to empirically verify the Central Limit Theorem. The central limit theorem is one of the most
important results in all of statistics.
Roughly, the central limit theorem says that if you take a large number
of independent and identically distributed random variables and sum them, then
that sum will be normally distributed no matter the underlying distributions
(well, there are some requirements on the distributions, but just about
anything you'd encouter in real life will work).
So that's still pretty math-y, so what's it
mean in practical terms? Recall that we can generate uniformly distributed
random numbers with runif().
> runif(10)
[1] 0.4456587 0.9358769 0.1791569 0.2102465
0.1329581 0.2995354 0.7356315
[8] 0.1400350 0.3038460 0.1383365
Let's plot a histogram quickly to visually
see that they're uniform. qplot() is a simplified interface to the ggplot2
plotting system.
> qplot(runif(1000),
geom="histogram", binwidth=0.1)
Looks pretty good. If we sum a 100 random numbers uniformly
distributed from zero to one, we expect to get a number close to 50, right?
> sum(runif(100))
[1] 48.79861
48.9 is pretty close to 50. So our intuition seems right. This intuition is at the heart of the central
limit theorem. The theorem says that if
we compute this sum over a bunch of samples, the results will be nornally
distributed. So here's a plot of 1,000
such sums.
> qplot(replicate(1000,sum(runif(100))),
geom="histogram")
Sweet, looks normal to me. And here's the magic part, it works with pretty
much any distribution, not just the uniform distribution. For example, here's the exponential
distribution:
> qplot(rexp(100), geom="histogram")
and the distribution of the sums of
exponentially distributed random numbers:
> qplot(replicate(1000,sum(rexp(100))),
geom="histogram")
Looks pretty normal. These's a formal test that tells you whether
or not a bunch of numbers have a certain distribution, the Kolmogorov-Smirnov
test, or ks-test.
Sum of uniform deviates on the interval
[-1,1].
> mu=0
> sigma=sqrt(4/12)
> n=1000
>
ks.test(replicate(1000,(sum(runif(n,-1,1))-mu*n)/(sigma*sqrt(n))),"pnorm")
One-sample Kolmogorov-Smirnov test
data:
replicate(1000, (sum(runif(n, -1, 1)) - mu * n)/(sigma * sqrt(n)))
D = 0.0219, p-value = 0.7251
alternative hypothesis: two-sided
Sum of exponentials:
> mu=1
> sigma=1
> n=1000
> ks.test(replicate(1000,(sum(rexp(n))-mu*n)/(sigma*sqrt(n))),"pnorm")
One-sample Kolmogorov-Smirnov test
data:
replicate(1000, (sum(rexp(n)) - mu * n)/(sigma * sqrt(n)))
D = 0.0251, p-value = 0.5565
alternative hypothesis: two-sided
Part 3: Linear models in R
R has a powerful, flexible and diverse set
of modeling capabilities. We'll just focus on straight-forward linear models
here though.
lm() with mtcars
R has many built in datasets that you can
experiment with. One of them is the
mtcars dataset. From the R documentation:
The data was extracted from the 1974 Motor
Trend US magazine, and
comprises fuel consumption and 10 aspects of
automobile design and
performance for 32 automobiles (1973-74
models).
Here's a few rows:
> head(mtcars)
mpg cyl disp hp drat
wt qsec vs am gear carb
Mazda RX4 21.0
6 160 110 3.90 2.620 16.46 0 1 4
4
Mazda RX4 Wag 21.0
6 160 110 3.90 2.875 17.02 0 1 4
4
Datsun 710 22.8
4 108 93 3.85 2.320 18.61 1
1 4 1
Hornet 4 Drive 21.4
6 258 110 3.08 3.215 19.44 1 0 3
1
Hornet Sportabout 18.7 8 360
175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225
105 2.76 3.460 20.22 1 0
3 1
The columns are (again from the R
documentation):
A data frame with 32 observations on 11
variables.
[, 1]
mpg Miles/(US) gallon
[, 2]
cyl Number of cylinders
[, 3]
disp Displacement (cu.in.)
[, 4]
hp Gross horsepower
[, 5]
drat Rear axle ratio
[, 6]
wt Weight (lb/1000)
[, 7]
qsec 1/4 mile time
[, 8]
vs V/S
[, 9]
am Transmission (0 = automatic,
1 = manual)
[,10]
gear Number of forward gears
[,11]
carb Number of carburetors
What drives mileage? Probably weight, but can we check that conjecture? The lm() function makes linear models. To fit a model predicting milage using weight
you do:
> m<-lm(mpg~wt, mtcars)
The first argument is in R's formula
syntax. You can read the tilde as
"is modeled by." We'll see
more features of the forumla syntax in just a bit. lm() returns an object that contains the
results of the fit. You can looks at the
results with summary.
> summary(m)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median
3Q Max
-4.5432 -2.3647 -0.1252 1.4096
6.8727
Coefficients:
Estimate Std. Error t value
Pr(>|t|)
(Intercept) 37.2851
1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591
-9.559 1.29e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '
1
Residual standard error: 3.046 on 30
degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
So it's a pretty good fit. We can explain 75% of the variance in mileage
with weight alone. The model says that
we lose about 5 miles per gallon for each 1,000 lbs of weight. We can also see some visualizations of the
model using plot().
> plot(m)
Can we do better? What if we add another variable to the model?
Horsepower seems a good candidate. We
can modify the formula to include horsepower.
> m<-lm(mpg~wt+hp, mtcars)
So now our model is: mpg = A*wt + B*hp +
C. You can add more and more variables
to the formula with the + operator. So
how's it look?
> summary(m)
Call:
lm(formula = mpg ~ wt + hp, data =
mtcars)
Residuals:
Min 1Q Median 3Q
Max
-3.941 -1.600 -0.182 1.050
5.854
Coefficients:
Estimate Std. Error t value
Pr(>|t|)
(Intercept) 37.22727 1.59879
23.285 < 2e-16 ***
wt -3.87783 0.63273
-6.129 1.12e-06 ***
hp -0.03177 0.00903
-3.519 0.00145 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '
1
Residual standard error: 2.593 on 29
degrees of freedom
Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148
F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
Better.
The r^2 is up to 0.826, and the p-value on the hp variable is 0.00145,
quite small, so it's likely a relevant factor contributing to a car's mileage.
Should we try more variables? What about automatic vs. manual transmission? That's in the am variable, but (as is often
the case), it's coded as a number, 0 or 1, not a factor, so we need to recode it.
> str(mtcars$am)
num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
> mtcars$am<-factor(mtcars$am)
> str(mtcars$am)
Factor w/ 2 levels
"0","1": 2 2 2 1 1 1 1 1 1 1 ...
That's better. Now the model:
> m<-lm(mpg~wt+hp+am, mtcars)
> summary(m)
Call:
lm(formula =
mpg ~ wt + hp + am, data = mtcars)
Residuals:
Min
1Q Median 3Q
Max
-3.4221
-1.7924 -0.3788 1.2249 5.5317
Coefficients:
Estimate Std. Error t value
Pr(>|t|)
(Intercept)
34.002875 2.642659 12.867 2.82e-13 ***
wt -2.878575 0.904971
-3.181 0.003574 **
hp -0.037479 0.009605
-3.902 0.000546 ***
am1 2.083710 1.376420
1.514 0.141268
---
Signif.
codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1
Residual
standard error: 2.538 on 28 degrees of freedom
Multiple
R-squared: 0.8399, Adjusted R-squared:
0.8227
F-statistic:
48.96 on 3 and 28 DF, p-value: 2.908e-11
Hrm.
The r^2 got a bit better, and the sign on the coefficient is right,
manual transmission makes milage a bit better, but the p-value is pretty
marginal, I think we're overfitting the dataset at this point. Note however that R is perfectly happy to
fit models to categorical as well as numeric data.
Your turn: lm() with diamonds
The diamonds dataset has data for nearly
54,000 diamonds. The variables are
mostly what you'd expect; table, depth, x, y, and z, are some of the the
physical dimensions of the diamond.
> head(diamonds)
carat cut color clarity
depth table price x y
z
1
0.23 Ideal E
SI2 61.5 55 326
3.95 3.98 2.43
2
0.21 Premium E
SI1 59.8 61
326 3.89 3.84 2.31
3
0.23 Good E
VS1 56.9 65
327 4.05 4.07 2.31
4
0.29 Premium I
VS2 62.4 58
334 4.20 4.23 2.63
5
0.31 Good J
SI2 63.3 58
335 4.34 4.35 2.75
6
0.24 Very Good J VVS2
62.8 57 336 3.94 3.96 2.48
> nrow(diamonds)
[1] 53940
So what drives price? Use ggplot() to explore the data and form hypotheses,
and use lm() to see how well your hypotheses hold up.
> ggplot(diamonds,
aes(carat,price))+geom_point()
> ggplot(diamonds,
aes(carat,price))+geom_point()+geom_smooth()
This is mgcv 1.6-1. For overview type
`help("mgcv-package")'.
> ggplot(diamonds,
aes(log(carat),log(price)))+geom_point()+geom_smooth()
> ggplot(diamonds,
aes(log(carat),log(price)))+geom_point()+geom_smooth()+facet_wrap(~color)
> ggplot(diamonds,
aes(log(carat),log(price)))+geom_point()+geom_smooth()+facet_grid(clarity~color)
summary(lm(log(price)~log(carat),diamonds))
summary(lm(log(price)~log(carat)+color,diamonds))
summary(lm(log(price)~log(carat)+color+clarity,diamonds))