What is R

It is a programming language and environment commonly used in statistical computing, data analytics and scientific research.

It is one of the most popular languages used by statisticians, data analysts, researchers and marketers to retrieve, clean, analyze, visualize and present data.

Due to its expressive syntax and easy-to-use interface, it has grown in popularity in recent years.

Installing R

Get R here: http://www.r-project.org/. Once you've installed it, also install a few add on packages by running the following commands at the R prompt:

install.packages("reshape")

install.packages("plyr")

install.packages("ggplot2")

A discovery flight: plotting with ggplot.

When you begin lessons to get a private pilot's license, the first lesson is often a "discovery flight," where they take you up and let you fly the airplane and help you land it safely. It's a fun way to get your feet wet and see if the taking more lessons seems like a good idea.

I'd like to begin with a discovery flight of sorts, introducing ggplot. GGplot is a powerful command line plotting system that should show you some of the power of R without you needing to know all the details of how R works.

ggplot basics

First let's make a little data. Make a hundred normally distributed random numbers and put them in a data frame under the name input. (I'll explain about data frames more later.)

> df<-data.frame(input=rnorm(100))

Then add a hundred more random numbers named output, each perturbing the input numbers a bit.

> df$output<-rnorm(100,df$input,sd=0.5)

A data frame is like a matrix, or a speadsheet, where each row contains one observation.

> head(df)

input output

1 -0.58362979 -0.4540510

2 0.86899831 0.5282722

3 -0.70655426 -0.5075248

4 -0.35125465 0.4221305

5 -1.56632422 -0.6980894

6 -0.01087019 0.3033364

Ok, now let's plot our data.

> ggplot(df,aes(x=input,y=output))+geom_point()

What's a plot? There's a coordinate system and some scales. But the heart of the plot is the data. What characteristics does each point have? Certainly an x and y coordinate, but also a size, shape and

color that we aren't varying. These visual properties are called "aesthetics" in ggplot, and the main thing we want to learn first is how to map the variables in our data onto the aesthetics in the plot.

So let's pull apart that command and see how it describes the plot. The first bit of the command is a call to the ggplot() function. ggplot() can operate in many different ways, but the simplest and most common is to pass it two arguments, the first of which is a data frame containing the data to be plotted.

The second argument to ggplot is an aesthetic mapping returned by the function aes(). aes() takes aesthetic-variable pairs, in our case mapping the variable input to the x-coordinate and output to the y-coordinate of objects in the plot.

ggplot() called like this can't actually make a plot just yet, because we haven't told it how to present the data. Points, bars, lines, text, and tiles are some of the straightforward options, but there are many other possible choices.

So we add a "layer" to the plot with geom_point() that says we want our data represented as points, and voila, we have a scatter plot.

More aesthetics and faceting

So what about some other aesthetics? Let's add a category to each observation in our data.

> df$category<-LETTERS[seq(1,5)]

> head(df)

input output category

1 -0.58362979 -0.4540510 A

2 0.86899831 0.5282722 B

3 -0.70655426 -0.5075248 C

4 -0.35125465 0.4221305 D

5 -1.56632422 -0.6980894 E

6 -0.01087019 0.3033364 A

And map the color of each point to its category. Note that aes() assumes it's firsts two arguments map x and y position, so we can leave off those names.

> ggplot(df,aes(input,output,color=category))+geom_point()

Another way to show category is with "faceting," creating a series of smaller plots, one for each category. The ~ in the call to facet_wrap() is an example of R's formula syntax, something we'll cover later when we get to statistical modeling.

> ggplot(df,aes(input,output))+geom_point()+facet_wrap(~category)

Since we applied our categories randomly to random data all the facets look the same, but if we tweak one category and replot, it sticks out like a sore thumb.

> df[df$category=="E","output"] <- abs(df[df$category=="E","output"])

> ggplot(df,aes(input,output))+geom_point()+facet_wrap(~category)

Let's undo our little tweak before going forward.

> df$output<-rnorm(100,df$input,sd=0.5)

How about the size aesthetic? Let's add yet another variable to our dataframe.

> df$magnitude <- runif(100)

And map magnitude to the size aesthetic, still faceting on category

> ggplot(df,aes(x,y,size=magnitude))+geom_point()+facet_wrap(~category)

If we have another variable to plot we could add color back into the mix and then we'd be plotting five dimensions at once. You could add shape and bump it to six. Then you could organize the facets into a grid using facet_grid() instead of facet_wrap() and now you're up to seven dimensions, all on 2-d surface.

layer = geom + stat

Let's see another geom, a bar chart. We'll make a simple data frame with 26 observations, each with a random score.

> LETTERS

[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"

[20] "T" "U" "V" "W" "X" "Y" "Z"

> bar_df<-data.frame(letter=LETTERS,score=runif(26))

> head(bar_df)

letter score

1 A 0.3526575

2 B 0.5549401

3 C 0.4672510

4 D 0.4784176

5 E 0.9944808

6 F 0.8506076

Now we map x-position to letter and y-position to score, and make a bar layer.

> ggplot(bar_df,aes(letter,score))+geom_bar()

So we see that the x-scale can deal just fine with categorical data, and that in a bar layer, y-position maps to the height of the bar.

In statistics, the classic use of a bar chart is a histogram. Let's fake up some more data. The average male in the US is about six feet tall with a standard deviation of 3 inches.

> height_data<-data.frame(height=rnorm(1000,mean=6,sd=0.25))

Hmm.. but how to get a histogram? We need to bucket our thousand observations and count the number in each bucket. Surely something like this won't work, right?

> ggplot(height_data,aes(height))+geom_bar()

It worked, but how? The first thing you need to know is that a layer has two parts, a "geom" that controls how the layer looks, and a "stat" that applies some statistical transformation to the data before it's plotted. Every geom_XXX() has a default stat, and geom_point() default stat is "identity," which just leaves the data alone. But geom_bar's default is "bin", which bucketizes or "bins" the data, dividing it up and counting the number in of observations in each bin. So these two commands are equivalent.

> ggplot(height_data,aes(height))+geom_bar()

> ggplot(height_data,aes(height))+geom_bar(stat="bin")

But there's one more bit of magic going on. We never mapped any variable to the y-position of the bars. stat_bin adds a variable named ..count.. to the data, and automatically maps it to the y aesthetic. So this is still equivalent.

> ggplot(height_data,aes(height))+geom_bar(aes(y=..count..),stat="bin")

What's that aes() doing in the geom_bar() call? The aesthetic mapping you supply in the call to ggplot() just sets the default mapping for the plot's layers, but each layer you add can override those defaults. This is useful when making sophisticated plots with many different layers.

Note that stat_bin also add a ..density.. variable, so if you wanted densities instead of counts you could do...

> ggplot(height_data,aes(height))+geom_bar(aes(y=..density..),stat="bin")

As I said before, each layer has a stat and a geom so we can make the equivalent plot using stat_bin() instead of geom_bar(). Usually the choice doesn't matter much, but sometimes various defults makes one way shorter, or more clearly expresses your intentions.

> ggplot(height_data,aes(height))+stat_bin(aes(y=..density..),geom="bar")

All out plots so far have only used one layer, but we can add more easily. For our histogram, we can easily add a smoothed density estimate that is plotted with a line with geom_density().

> ggplot(height_data,aes(height))+geom_bar(aes(y=..density..))+geom_density()

Going back to our scatter plot data, we can easily add a layer showing a smoothed line with a confidence interval.

> ggplot(df,aes(input,output))+geom_point()+stat_smooth()

There are many smoothing methods and ggplot tries to be smart about which one it chooses based on your data, but you can override it's choice by supplying an explicit method argument. To get a linear fit, for example:

> ggplot(df,aes(input,output))+geom_point()+stat_smooth(method="lm")

All the layers are preserved if you add faceting.

> ggplot(df,aes(input,output))+geom_point()+stat_smooth(method="lm")+facet_wrap(~category)

Dang, that line is getting kind long. Each of those function calls returns an object that you can assign to a variable, and a common way to break up the creation of a chart is to assign the base chart to a variable, and then add embellishments

> p<-ggplot(df,aes(input,output))+geom_point()

> p+stat_smooth(method="lm")+facet_wrap(~category)

Dealing with overplotting

For fun let's have a look at few of the other capabilities of ggplot. Our little scatter plot of 100 observations was nice, but what happens with you get a lot of data?

> big_df<-data.frame(input=rnorm(10000))

> big_df$output<-rnorm(10000,big_df$input,sd=0.3)

> p<-ggplot(big_df,aes(input,output))

> p+geom_point()

With so many points, there's a lot of overplotting, so it's hard to see what's going on. There are several ways to deal with overplotting, one is to just make the points smaller.

> p+geom_point(size=1)

Note that we're not using aes() here. We're not /mapping/ something to size, we're /setting/ size to a constant. You can do this with all aesthetics, and it can be a source of confusion when you set an aesthetic you meant to map or vice-versa.

Another way to deal with overplotting is to use alpha transparency.

> p+geom_point(alpha=0.1)

That's a little better. Just as we added a smooth density estimate to our histogram before, we can add a 2-d density estimate to this plot.

> p+geom_point(alpha=0.1)+stat_density2d()

For something a little more fancy, we can override some of stat_density2d's defaults.

> p+stat_density2d(aes(fill=..level..),geom="polygon")

Normally stat_density2d produces contour shapes. We can turn that off and use the tile geom to get a heatmap.

> p+stat_density2d(aes(fill=..density..),geom="tile",contour=FALSE)

More in depth resouces

I've only covered the barest essentials. If you want to learn more I really recommend Hadley's book.

Online reference with tons of examples: http://had.co.nz/ggplot2/

Hadley's Book, _ggplot2: Elegant Graphics for Data Analysis_: http://www.amazon.com/gp/product/0387981403

Your turn: EXAMPLE

Now it's your turn.

Extremely brief intro to plyr

Split, apply, combine.

Like group by and aggregate functions in SQL, or pivot tables in excel, but far more flexible.

Your turn part 2: EXAMPLE

My creation:

Part 2: R fundamentals and hypothesis testing

Most R tutorials assume you know some statistics and teach you just how to do statistics in R. In this part I'm going to assume you know neither and teach you a (teensy, tiny) bit of statistics at the same time I demonstrate the R system.

A little background on R

R is an implementation of the S statistical programming language. S has a very long history going back to 1976. Because the theory of programming language design was undergoing very active exploration throughout the development of S, it is something of a hodge podge of different paradigms. It's got both functional and procedural aspects, with object oriented features sprinkled on top.

S's influences include C and Ratfor (brace syntax), APL and APL's cousin PPL (the arrow operator), lisp (the lambda calculus, first class functions), JCL (positional or keyword argument notation), Smalltalk (classes and inheritance). An excellent history of S can be found here:

http://view.samurajdata.se/psview.php?id=f2486d82&page=15&size=full

So R is a full blown programming language, but in my normal use of it, I don't really do much programming, I mostly use the incredible array of built in functionality at the R command line. R does statistical modeling, analysis of variance (aka anova), and statistical tests like Student's-t or chi-squared. It implements techniques from linear algebra like solving systems of linear equations, singular value decomposition, and working with eigensystems. It also implements all kinds of analytical techniques from clustering to factor analysis.

The absolute minimum you need to know about R to get started.

But before we get into all that let's get familiar with R's command line and the most basic of R concepts.

R's command line is a read-evaluate-print loop. You type a statement, R evaluates that statement and prints the result. So you can do evaluate the number 5 like so:

> 5

[1] 5

and do simple arithmetic like so:

> 2+3

[1] 5

A character string is entered with quotes and evaluates to itself:

> "Hello"

[1] "Hello"

In addition to numbers and strings, R has a boolean type with keywords TRUE and FALSE in all-caps.

> TRUE

[1] TRUE

> FALSE

[1] FALSE

And some more special keywords, NA for missing data and NULL to represent no object.

Vectors

So what's that funny [1] that comes out in the exmaples above? It turns out that R's most basic data structure is a vector. R doesn't support the idea of a scalar, a single value. So when you type a single value it's actually being coerced into a vector, and the [1] is a vector index.

The simplest way to make vectors is with the c() function, which just turns it's arguments into a vector.

> c(12,23,34,45,56)

[1] 12 23 34 45 56

Lots of functions make vectors. seq() for example makes a sequence of numbers.

> seq(1,30)

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

[26] 26 27 28 29 30

So now we see that the [1] and [26] tell you the index of the first element on that row of output.

R can generate random numbers accoring to many distributions. This command gives you 10 random numbers uniformy distrubuted from 1 to 100.

> runif(10,1,100)

[1] 45.12021 93.65181 18.73653 21.81440 14.16285 30.65400 73.82752 14.86346

[9] 31.08075 14.69531

Most any distribution you can think of is cooked into R: normal, log-normal, exponential, gamma, chi-squared, Poisson, binomial, etc. Here's 10 normally distributed random numbers with mean 0 and standard deviation 1.

> rnorm(10,0,1)

[1] 0.1248729 -1.6965157 0.1094253 -0.8555028 -0.4061967 -1.0979561

[7] 1.2266705 -0.8437930 -0.2905176 0.1290491

Vectors (and pretty much everything else in R) are objects. You can give an object a name with the assignment operator '<-'. Using '<-' as the assignment operator comes was borrowed from APL. You can also just use '='.

> numbers<-rnorm(10,0,1)

> numbers

[1] 0.86378839 -0.05729522 0.50690025 0.98559770 0.95857035 0.84151008

[7] -1.36267419 1.68299597 -1.02089036 0.01789479

> more_numbers=rnorm(10,0,1)

> more_numbers

[1] 0.63656000 1.22249113 -1.46085369 0.29902021 -0.05245649 -0.68435848

[7] -0.32230916 1.42836638 -0.90796790 2.23426332

You can get a listing of all your currently named objects with ls()

> ls()

[1] "more_numbers" "numbers"

You can get rid of objects with rm().

> rm(numbers, more_numbers)

> ls()

character(0)

Operations on vectors

Ok. So we know some way to make vectors; what can we do with them? Well, all the normal sorts of things you'd expect.

> numbers<-rnorm(10)

> numbers

[1] 0.8893482 0.4200492 -1.2398678 -0.3585673 -1.4215178 0.7752930

[7] 0.3963262 -1.1906303 0.1747187 2.2828912

> mean(numbers)

[1] 0.07280433

> median(numbers)

[1] 0.2855224

> sum(numbers)

[1] 0.7280433

These functions summarize a vector into one number. There are also lots of functions that operate on each element of a vector, returning a new modified vector.

> abs(numbers)

[1] 0.8893482 0.4200492 1.2398678 0.3585673 1.4215178 0.7752930 0.3963262

[8] 1.1906303 0.1747187 2.2828912

> numbers*100

[1] 88.93482 42.00492 -123.98678 -35.85673 -142.15178 77.52930

[7] 39.63262 -119.06303 17.47187 228.28912

Note that the statments above aren't modifying ther argument, they are creating new vectors. So to modify a vector you need to assign the output.

> numbers<-sort(numbers)

> numbers

[1] -1.4215178 -1.2398678 -1.1906303 -0.3585673 0.1747187 0.3963262

[7] 0.4200492 0.7752930 0.8893482 2.2828912

Statistical tests

What about that mean that we got earlier?

> mean(numbers)

[1] 0.07280433

We got those numbers by drawing from a normal distribution with mean of zero, so the mean of our sample should be close to zero. 0.07 is pretty close, but should we trust that rnorm() is really giving us numbers with a mean of zero? Maybe rnorm() is buggy and has a bias. How can we tell?

Student's t-test is one way to answer this question. Given a sample, the t-test can tell you how confident you can be that the population's mean is a particular value. Here's how you run a t-test in R, asking if the population's mean is zero.

> t.test(numbers, mu=0)

One Sample t-test

data: numbers

t = 0.1992, df = 9, p-value = 0.8465

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval:

-0.7538657 0.8994743

sample estimates:

mean of x

0.07280433

The number to look at in the above output is the p-value. Very loosely, the lower the p-value, the less likely it is that population mean is actually zero. The p-value is a probability, so it ranges from zero to one, and one usually looks for a p-value less than 0.05 or 0.01 in order to conclude that the population mean is not zero.

Our p-value is 0.85, well above the 0.05 cutoff, so we conclude that the population mean really is zero (within the power of our test to tell us), and that rnorm() isn't buggy.

What's that p-value really mean? Why set the cutoff at 0.05 or 0.01? The cutoff corresponds to the chance that you'll reach the wrong conclusion. If you choose 0.05, there's a 5% chance you'll conclude that the mean isn't zero, when it really is. This means if draw many samples and run my t-test on each of them, sometimes the test will say that the population mean isn't zero, even if it really is.

We can verify this experimentally very easily. What t.test returns is actually an object. We can save that object to a variable with the assignment operator just like we've done before with vectors. The output we saw above is just that object being printed.

> result <- t.test(numbers,mu=0)

> result

One Sample t-test

data: numbers

t = 0.1992, df = 9, p-value = 0.8465

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval:

-0.7538657 0.8994743

sample estimates:

mean of x

0.07280433

We can extract just the p-value like so:

> t.test(numbers,mu=0)$p.value

[1] 0.8465136

So we can re-sample and re-test a few times by hand.

> t.test(rnorm(100),mu=0)$p.value

[1] 0.682082

> t.test(rnorm(100),mu=0)$p.value

[1] 0.7112275

> t.test(rnorm(100),mu=0)$p.value

[1] 0.324931

> t.test(rnorm(100),mu=0)$p.value

[1] 0.9596542

So far, so good, no numbers under 0.05. But this is getting tedious; we can use the replicate function to do this a bunch of times and accumulate the results in a vector.

> pvalues<-replicate(60,t.test(rnorm(100),mu=0)$p.value)

And if we sort the p-values we get, lo and behold there are four of them below 0.05. We ran our test 60 times, and 4 of those 60 times we reached the wrong conclusion. With a 0.05 (or one in twenty) cutoff, we only expected three of these errors.

> sort(pvalues)

[1] 0.01182703 0.02263391 0.02990155 0.03108499 0.05414258 0.06868589

[7] 0.08567086 0.08964473 0.12293828 0.16763739 0.17225885 0.18524312

[13] 0.19448739 0.20541757 0.24862190 0.25007644 0.25691785 0.27851561

[19] 0.28053250 0.30991840 0.32623091 0.36615250 0.36920395 0.37273555

[25] 0.38772763 0.38788907 0.39546547 0.39799577 0.42377867 0.46539131

[31] 0.48143437 0.48794339 0.49729286 0.50076295 0.50924057 0.52898420

[37] 0.54825732 0.57358131 0.59666177 0.59980506 0.61071583 0.62020976

[43] 0.62161798 0.67315114 0.67944879 0.69186076 0.70146278 0.73897408

[49] 0.76402396 0.76638926 0.78694787 0.84759542 0.86765217 0.94199050

[55] 0.95396098 0.95869792 0.96865229 0.97681270 0.98922664 0.99296506

Two-sample t-test and statistical power

The way I'm using the t-test here is somewhat unusual. A far more common case is when you have two samples, and you want to know the the populations the samples are drawn from have different means.

This is the way to do weblabs here at Amazon. We cut our user base into two populations, and give one of them a different experience. We theorize that this different experience will cause different behavior, so we run t-tests to find out if, for exmaple, one population spend more money on average than the other. This is called a two sample t-test.

Here's a contrived example, where I draw one sample with a mean of zero, and another sample with a mean of one, and then ask the t-test to tell me if the means are indeed different.

> t.test(rnorm(100,mean=0),rnorm(100,mean=1))

Welch Two Sample t-test

data: rnorm(100, mean = 0) and rnorm(100, mean = 1)

t = -6.7014, df = 188.741, p-value = 2.315e-10

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-1.2045547 -0.6566807

sample estimates:

mean of x mean of y

0.1470455 1.0776632

That p-value is tiny. We can safely conclude that the two populations have different means. But what if the difference in the means is less extreme?

> t.test(rnorm(100,mean=0),rnorm(100,mean=0.5))$p.value

[1] 0.001426943

> t.test(rnorm(100,mean=0),rnorm(100,mean=0.2))$p.value

[1] 0.3231079

> t.test(rnorm(100,mean=0),rnorm(100,mean=0.1))$p.value

[1] 0.8972577

As the difference in means gets smaller our test loses the ability to resolve the difference. We can get it back by upping the sample size.

> t.test(rnorm(10000,mean=0),rnorm(10000,mean=0.1))$p.value

[1] 3.305264e-17

This question of how small a difference a test can detect given a certain sample size is an example of statistical power analysis.

Another statistical test

The t-test is just one of many statistical tests built into R. Let's use another test to empirically verify the Central Limit Theorem. The central limit theorem is one of the most important results in all of statistics. Roughly, the central limit theorem says that if you take a large number of independent and identically distributed random variables and sum them, then that sum will be normally distributed no matter the underlying distributions (well, there are some requirements on the distributions, but just about anything you'd encouter in real life will work).

So that's still pretty math-y, so what's it mean in practical terms? Recall that we can generate uniformly distributed random numbers with runif().

> runif(10)

[1] 0.4456587 0.9358769 0.1791569 0.2102465 0.1329581 0.2995354 0.7356315

[8] 0.1400350 0.3038460 0.1383365

Let's plot a histogram quickly to visually see that they're uniform. qplot() is a simplified interface to the ggplot2 plotting system.

> qplot(runif(1000), geom="histogram", binwidth=0.1)

Looks pretty good. If we sum a 100 random numbers uniformly distributed from zero to one, we expect to get a number close to 50, right?

> sum(runif(100))

[1] 48.79861

48.9 is pretty close to 50. So our intuition seems right. This intuition is at the heart of the central limit theorem. The theorem says that if we compute this sum over a bunch of samples, the results will be nornally distributed. So here's a plot of 1,000 such sums.

> qplot(replicate(1000,sum(runif(100))), geom="histogram")

Sweet, looks normal to me. And here's the magic part, it works with pretty much any distribution, not just the uniform distribution. For example, here's the exponential distribution:

> qplot(rexp(100), geom="histogram")

and the distribution of the sums of exponentially distributed random numbers:

> qplot(replicate(1000,sum(rexp(100))), geom="histogram")

Looks pretty normal. These's a formal test that tells you whether or not a bunch of numbers have a certain distribution, the Kolmogorov-Smirnov test, or ks-test.

Sum of uniform deviates on the interval [-1,1].

> mu=0

> sigma=sqrt(4/12)

> n=1000

> ks.test(replicate(1000,(sum(runif(n,-1,1))-mu*n)/(sigma*sqrt(n))),"pnorm")

One-sample Kolmogorov-Smirnov test

data: replicate(1000, (sum(runif(n, -1, 1)) - mu * n)/(sigma * sqrt(n)))

D = 0.0219, p-value = 0.7251

alternative hypothesis: two-sided

Sum of exponentials:

> mu=1

> sigma=1

> n=1000

> ks.test(replicate(1000,(sum(rexp(n))-mu*n)/(sigma*sqrt(n))),"pnorm")

One-sample Kolmogorov-Smirnov test

data: replicate(1000, (sum(rexp(n)) - mu * n)/(sigma * sqrt(n)))

D = 0.0251, p-value = 0.5565

alternative hypothesis: two-sided

Part 3: Linear models in R

R has a powerful, flexible and diverse set of modeling capabilities. We'll just focus on straight-forward linear models here though.

lm() with mtcars

R has many built in datasets that you can experiment with. One of them is the mtcars dataset. From the R documentation:

The data was extracted from the 1974 Motor Trend US magazine, and

comprises fuel consumption and 10 aspects of automobile design and

performance for 32 automobiles (1973-74 models).

Here's a few rows:

> head(mtcars)

mpg cyl disp hp drat wt qsec vs am gear carb

Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4

Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4

Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1

Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1

Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

The columns are (again from the R documentation):

A data frame with 32 observations on 11 variables.

[, 1] mpg Miles/(US) gallon

[, 2] cyl Number of cylinders

[, 3] disp Displacement (cu.in.)

[, 4] hp Gross horsepower

[, 5] drat Rear axle ratio

[, 6] wt Weight (lb/1000)

[, 7] qsec 1/4 mile time

[, 8] vs V/S

[, 9] am Transmission (0 = automatic, 1 = manual)

[,10] gear Number of forward gears

[,11] carb Number of carburetors

What drives mileage? Probably weight, but can we check that conjecture? The lm() function makes linear models. To fit a model predicting milage using weight you do:

> m<-lm(mpg~wt, mtcars)

The first argument is in R's formula syntax. You can read the tilde as "is modeled by." We'll see more features of the forumla syntax in just a bit. lm() returns an object that contains the results of the fit. You can looks at the results with summary.

> summary(m)

Call:

lm(formula = mpg ~ wt, data = mtcars)

Residuals:

Min 1Q Median 3Q Max

-4.5432 -2.3647 -0.1252 1.4096 6.8727

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***

wt -5.3445 0.5591 -9.559 1.29e-10 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom

Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446

F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

So it's a pretty good fit. We can explain 75% of the variance in mileage with weight alone. The model says that we lose about 5 miles per gallon for each 1,000 lbs of weight. We can also see some visualizations of the model using plot().

> plot(m)

Can we do better? What if we add another variable to the model? Horsepower seems a good candidate. We can modify the formula to include horsepower.

> m<-lm(mpg~wt+hp, mtcars)

So now our model is: mpg = A*wt + B*hp + C. You can add more and more variables to the formula with the + operator. So how's it look?

> summary(m)

Call:

lm(formula = mpg ~ wt + hp, data = mtcars)

Residuals:

Min 1Q Median 3Q Max

-3.941 -1.600 -0.182 1.050 5.854

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.22727 1.59879 23.285 < 2e-16 ***

wt -3.87783 0.63273 -6.129 1.12e-06 ***

hp -0.03177 0.00903 -3.519 0.00145 **

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.593 on 29 degrees of freedom

Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148

F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12

Better. The r^2 is up to 0.826, and the p-value on the hp variable is 0.00145, quite small, so it's likely a relevant factor contributing to a car's mileage.

Should we try more variables? What about automatic vs. manual transmission? That's in the am variable, but (as is often the case), it's coded as a number, 0 or 1, not a factor, so we need to recode it.

> str(mtcars$am)

num [1:32] 1 1 1 0 0 0 0 0 0 0 ...

> mtcars$am<-factor(mtcars$am)

> str(mtcars$am)

Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...

That's better. Now the model:

> m<-lm(mpg~wt+hp+am, mtcars)

> summary(m)

Call:

lm(formula = mpg ~ wt + hp + am, data = mtcars)

Residuals:

Min 1Q Median 3Q Max

-3.4221 -1.7924 -0.3788 1.2249 5.5317

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 34.002875 2.642659 12.867 2.82e-13 ***

wt -2.878575 0.904971 -3.181 0.003574 **

hp -0.037479 0.009605 -3.902 0.000546 ***

am1 2.083710 1.376420 1.514 0.141268

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.538 on 28 degrees of freedom

Multiple R-squared: 0.8399, Adjusted R-squared: 0.8227

F-statistic: 48.96 on 3 and 28 DF, p-value: 2.908e-11

Hrm. The r^2 got a bit better, and the sign on the coefficient is right, manual transmission makes milage a bit better, but the p-value is pretty marginal, I think we're overfitting the dataset at this point. Note however that R is perfectly happy to fit models to categorical as well as numeric data.

Your turn: lm() with diamonds

The diamonds dataset has data for nearly 54,000 diamonds. The variables are mostly what you'd expect; table, depth, x, y, and z, are some of the the physical dimensions of the diamond.

> head(diamonds)

carat cut color clarity depth table price x y z

1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43

2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31

3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31

4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63

5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75

6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

> nrow(diamonds)

[1] 53940

So what drives price? Use ggplot() to explore the data and form hypotheses, and use lm() to see how well your hypotheses hold up.

> ggplot(diamonds, aes(carat,price))+geom_point()

> ggplot(diamonds, aes(carat,price))+geom_point()+geom_smooth()

This is mgcv 1.6-1. For overview type `help("mgcv-package")'.

> ggplot(diamonds, aes(log(carat),log(price)))+geom_point()+geom_smooth()

> ggplot(diamonds, aes(log(carat),log(price)))+geom_point()+geom_smooth()+facet_wrap(~color)

> ggplot(diamonds, aes(log(carat),log(price)))+geom_point()+geom_smooth()+facet_grid(clarity~color)

summary(lm(log(price)~log(carat),diamonds))

summary(lm(log(price)~log(carat)+color,diamonds))

summary(lm(log(price)~log(carat)+color+clarity,diamonds))

A Learner's Blog.....

Labels

26 Feb 2017

Learn R Programming -- The Definitive Guide

What is R

Installing R

A discovery flight: plotting with ggplot.

ggplot basics

More aesthetics and faceting

layer = geom + stat

Dealing with overplotting

More in depth resouces

Your turn: EXAMPLE

Extremely brief intro to plyr

Your turn part 2: EXAMPLE

Part 2: R fundamentals and hypothesis testing

A little background on R

The absolute minimum you need to know about R to get started.

Vectors

Operations on vectors

Statistical tests

Two-sample t-test and statistical power

Another statistical test

Part 3: Linear models in R

lm() with mtcars

Your turn: lm() with diamonds