22 Feb 2019

Hadoop File Formats, when and what to use?

Hadoop File Formats, when and what to use?


Basic file formats are: Text format, Key-Value format, Sequence format
Hadoop File Formats, when and what to use? Hadoop is gaining traction and on a higher adaption curve to liberate the data from the clutches of the applications and native formats. This article helps us look at the file formats supported by Hadoop ( read, HDFS) file system. A quick broad categorisations of file formats would be
  • Other formats which are used and are well known are: Avro, Parquet, RC or Row-Columnar format, ORC or Optimized Row Columnar format
The need ..
A file format is just a way to define how information is stored in HDFS file system. This is usually driven by the use case or the processing algorithms for specific domain, File format should be well-defined and expressive. It should be able to handle variety of data structures specifically structs, records, maps, arrays along with strings, numbers etc. File format should be simple, binary and compressed.. When dealing with Hadoop’s filesystem not only do you have all of these traditional storage formats available to you (like you can store PNG and JPG images on HDFS if you like), but you also have some Hadoop-focused file formats to use for structured and unstructured data. A huge bottleneck for HDFS-enabled applications like MapReduce and Spark is the time it takes to find relevant data in a particular location and the time it takes to write the data back to another location. These issues are exacerbated with the difficulties managing large datasets, such as evolving schemas, or storage constraints. The various Hadoop file formats have evolved as a way to ease these issues across a number of use cases. Choosing an appropriate file format can have some significant benefits: 1. Faster read times 2. Faster write times 3. Splittable files (so you don’t need to read the whole file, just a part of it) 4. Schema evolution support (allowing you to change the fields in a dataset) 5. Advanced compression support (compress the columnar files with a compression codec without sacrificing these features) Some file formats are designed for general use (like MapReduce or Spark), others are designed for more specific use cases (like powering a database), and some are designed with specific data characteristics in mind. So there really is quite a lot of choice.
The generic classification of the characteristics are Expressive, Simple, Binary, Compressed, Integrity to name few. Typically text-based, serial and columnar types…
Since Protocol buffers & thrift are serializable but not splittable they are not largely popular on HDFS use cases and thus Avro becomes the first choice …
Text Input Format
Simple text-based files are common in the non-Hadoop world, and they’re super common in the Hadoop world too. Data is laid out in lines, with each line being a record. Lines are terminated by a newline character \n in the typical UNIX fashion. Text-files are inherently splittable (just split on \n characters!), but if you want to compress them you’ll have to use a file-level compression codec that support splitting, such as BZIP2 Because these files are just text files you can encode anything you like in a line of the file. One common example is to make each line a JSON document to add some structure. While this can waste space with needless column headers, it is a simple way to start using structured data in HDFS.
  • Default, JSONCSV formats are available
  • Slow to read and write
  • Can’t split compressed files (Leads to Huge maps)
  • Need to read/decompress all fields.
An Input format for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text. Advantages: Light weight Disadvantages: Slow to read and write, Can’t split compressed files (Leads to Huge maps)
Sequence File Input Format
Sequence files were originally designed for MapReduce, so the integration is smooth. They encode a key and a value for each record and nothing more. Records are stored in a binary format that is smaller than a text-based format would be. Like text files, the format does not encode the structure of the keys and values, so if you make schema migrations they must be additive. Typically if you need to store complex data in a sequence file you do so in the value part while encoding the id in the key. The problem with this is that if you add or change fields in your Writable class it will not be backwards compatible with the data stored in the sequence file. One benefit of sequence files is that they support block-level compression, so you can compress the contents of the file while also maintaining the ability to split the file into segments for multiple map tasks.
  • Traditional map reduce binary file format
  • Stores Keys and Values as a class
  • Not good for Hive ,Which has sql types
  • Hive always stores entire line as a value
  • Default block size is 1 MB
  • Need to read and Decompress all the fields
In addition to text files, Hadoop also provides support for binary files. Out of these binary file formats, Hadoop Sequence Files are one of the Hadoop specific file format that stores serialized key/value pairs. Advantages: Compact compared to text files, Optional compression support. Parallel processing. Container for huge number of small files. Disadvantages: Not good for Hive, Append only like other data formats, Multi Language support not yet provided One key benefit of sequence files is that they support block-level compression, so you can compress the contents of the file while also maintaining the ability to split the file into segments for multiple map tasks. Sequence files are well supported across Hadoop and many other HDFS enabled projects, and I think represent the easiest next step away from text files.
RC (Row-Columnar) File Input Format
RCFILE stands of Record Columnar File which is another type of binary file format which offers high compression rate on the top of the rows used when we want to perform operations on multiple rows at a time. RCFILEs are flat files consisting of binary key/value pairs, which shares much similarity with SEQUENCE FILE. RCFILE stores columns of a table in form of record in a columnar manner. It first partitions rows horizontally into row splits and then it vertically partitions each row split in a columnar way. RCFILE first stores the metadata of a row split, as the key part of a record, and all the data of a row split as the value part. This means that RCFILE encourages column oriented storage rather than row oriented storage. This column oriented storage is very useful while performing analytics. It is easy to perform analytics when we “hive’ a column oriented storage type. We cannot load data into RCFILE directly. First we need to load data into another table and then we need to overwrite it into our newly created RCFILE.
  • columns stored separately
  • Read and decompressed only needed one.
  • Better compression
  • Columns stored as binary Blobs
  • Depend on Meta store to supply Data types
  • Large Blocks - 4MB default
  • Still search file for split boundary
ORC (Optimized Row Columnar)Input Format
ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%. As a result the speed of data processing also increases and shows better performance than Text, Sequence and RC file formats. An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data. We cannot load data into ORCFILE directly. First we need to load data into another table and then we need to overwrite it into our newly created ORCFILE. ORC File Format Full Form is Optimized Row Columnar File Format.ORC File format provides very efficient way to store relational data then RC file,By using ORC File format we can reduce the size of original data up to 75%.Comparing to Text,Sequence,Rc file formats ORC is better
  • Column stored separately
  • Knows Types - Uses Types specific en-coders
  • Stores statistics (Min,Max,Sum,Count)
  • Has Light weight Index
  • Skip over blocks of rows that that don’t matter
  • Larger Blocks - 256 MB by default, Has an index for block boundaries
Using ORC files improves performance when Hive is reading, writing, and processing data comparing to Text,Sequence and Rc. RC and ORC shows better performance than Text and Sequence File formats. Comparing to RC and ORC File formats always ORC is better as ORC takes less time to access the data comparing to RC File Format and ORC takes Less space space to store data. However, the ORC file increases CPU overhead by increasing the time it takes to decompress the relational data. ORC File format feature comes with the Hive 0.11 version and cannot be used with previous versions.
AVRO Format
Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting, the father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes quite helpful, as it deals with data formats that can be processed by multiple languages. Avro is a preferred tool to serialize data in Hadoop. Avro is an opinionated format which understands that data stored in HDFS is usually not a simple key/value combo like int/string. The format encodes the schema of its contents directly in the file which allows you to store complex objects natively. Honestly, Avro is not really a file format, it’s a file format plus a serialization and de-serialization framework with regular old sequence files you can store complex objects but you have to manage the process. Avro handles this complexity whilst providing other tools to help manage data over time and is a well thought out format which defines file data schemas in JSON (for interoperability), allows for schema evolutions (remove a column, add a column), and multiple serialization/deserialization use cases. It also supports block-level compression. For most Hadoop-based use cases Avro becomes really good choice. Avro depends heavily on its schema. It allows every data to be written with no prior knowledge of the schema. It serializes fast and the resulting serialized data is lesser in size. Schema is stored along with the Avro data in a file for any further processing. In RPC, the client and the server exchange schemas during the connection. This exchange helps in the communication between same named fields, missing fields, extra fields, etc. Avro schemas are defined with JSON that simplifies its implementation in languages with JSON libraries. Like Avro, there are other serialization mechanisms in Hadoop such as Sequence Files, Protocol Buffers, and Thrift.
Thrift & Protocol Buffers Vs. Avro
Thrift and Protocol Buffers are the most competent libraries with Avro. Avro differs from these frameworks in the following ways –
  • Avro supports both dynamic and static types as per the requirement. Protocol Buffers and Thrift use Interface Definition Languages (IDLs) to specify schemas and their types. These IDLs are used to generate code for serialization and deserialization.
  • Avro is built in the Hadoop ecosystem. Thrift and Protocol Buffers are not built in Hadoop ecosystem. Unlike Thrift and Protocol Buffer, Avro's schema definition is in JSON and not in any proprietary IDL.
The latest hotness in file formats for Hadoop is columnar file storage. Basically this means that instead of just storing rows of data adjacent to one another you also store column values adjacent to each other. So datasets are partitioned both horizontally and vertically. This is particularly useful if your data processing framework just needs access to a subset of data that is stored on disk as it can access all values of a single column very quickly without reading whole records.
  • Design based on googles Dreamel paper
  • Schema segregated into footer
  • Column major format with stripes
  • Simple type-model with logical types
  • All data pushed to leaves of the tree
  • Integrated compression and indexes
Parquet file format is also a columnar format. Instead of just storing rows of data adjacent to one another you also store column values adjacent to each other. So datasets are partitioned both horizontally and vertically. This is particularly useful if your data processing framework just needs access to a subset of data that is stored on disk as it can access all values of a single column very quickly without reading whole records. Just like ORC file, it’s great for compression with great query performance especially efficient when querying data from specific columns. Parquet format is computationally intensive on the write side, but it reduces a lot of I/O cost to make great read performance. It enjoys more freedom than ORC file in schema evolution, that it can add new columns to the end of the structure. If you’re chopping and cutting up datasets regularly then these formats can be very beneficial to the speed of your application, but frankly if you have an application that usually needs entire rows of data then the columnar formats may actually be a detriment to performance due to the increased network activity required. One huge benefit of columnar oriented file formats is that data in the same column tends to be compressed together which can yield some massive storage optimizations (as data in the same column tends to be similar). It supports both File-Level Compression and Block-Level Compression. File-level compression means you compress entire files regardless of the file format, the same way you would compress a file in Linux. Some of these formats are splittable (e.g. bzip2, or LZO if indexed). Block-level compression is internal to the file format, so individual blocks of data within the file are compressed. This means that the file remains splittable even if you use a non-splittable compression codec like Snappy. However, this is only an option if the specific file format supports it. Summary Overall these format can drastically optimize workloads, especially for Hive and Spark which tend to just read segments of records rather than the whole thing (which is more common in MapReduce). Since Avro and Parquet have so much in common when choosing a file format to use with HDFS, we need to consider read performance and write performance. Because the nature of HDFS is to store data that is write once, read multiple times, we want to emphasize on the read performance. The fundamental difference in terms of how to use either format is this: Avro is a Row based format. If you want to retrieve the data as a whole, you can use Avro. Parquet is a Column based format. If your data consists of lot of columns but you are interested in a subset of columns, you can use Parquet. Hopefully by now you’ve learned a little about what file formats actually are and why you would think of choosing a specific one. We’ve discussed the main characteristics of common file formats and talked a little about compression. This is not an exhaustive article that can answer all the use cases but definitely provide pointers for you to explore, so if you want to learn more about the particular codecs I would request you to visit their respective Apache / Wikipedia pages…

24 Sept 2017

Data Visualization Best Practices

There are four basic presentation types that you can use to present your data:
  • Comparison
  • Composition
  • Distribution
  • Relationship
Unless you are a statistician or a data-analyst, you are most likely using only the two, most commonly used types of data analysis: Comparison or Composition.

Selecting the Right Chart

To determine which chart is best suited for each of those presentation types, first you must answer a few questions:
  • How many variables do you want to show in a single chart? One, two, three, many?
  • How many items (data points) will you display for each variable? Only a few or many?
  • Will you display values over a period of time, or among items or groups?
Bar charts are good for comparisons, while line charts work better for trends. Scatter plot charts are good for relationships and distributions, but pie charts should be used only for simple compositions — never for comparisons or distributions.
There is a chart selection diagram created by Dr. Andrew Abela that should help you pick the right chart for your data type. (You can download the PDF version here: Chart Selection diagram.)
Data Visualization Diagram
Let’s dig in and review the most commonly used chart types, some example, and the dos and don’ts for each chart type.

Tables

Data Visualization - Table Charts
Tables are essentially the source for all the charts. They are best used for comparison, composition, or relationship analysis when there are only few variables and data points. It would not make much sense to create a chart if the data can be easily interpreted from the table.
Use tables when:
  • You need to compare or look up individual values.
  • You require precise values.
  • Values involve multiple units of measure.
  • The data has to communicate quantitative information, but not trends.
Use charts when the data presentation:
  • Is used to convey a message that is contained in the shape of the data.
  • Is used to show a relationship between many values.
For example, if you want to show the rate of change, like sudden drop of temperature, it is best to use a chart that shows the slope of a line because rate of change is not easily grasped from a table.

Column Charts

Data Visualization - Column Charts
The column chart is probably the most used chart type. This chart is best used to compare different values when specific values are important, and it is expected that users will look up and compare individual values between each column.
With column charts you could compare values for different categories or compare value changes over a period of time for a single category.
Best practices for column charts
  • Use column charts for comparison if the number of categories is quite small — up to five, but not more than seven categories.
  • If one of your data dimensions is time — including years, quarters, months, weeks, days, or hours — you should always set time dimension on the horizontal axis.
  • In charts, time should always run from left to right, never from top to bottom.
  • For column charts, the numerical axis must start at zero. Our eyes are very sensitive to the height of columns, and we can draw inaccurate conclusions when those bars are truncated.
  • Avoid using pattern lines or fills. Use border only for highlights.
  • Only use column charts to show trends if there are a reasonably-low number of data points (less than 20) and if every data point has a clearly-visible value.

Column Histograms

Data Visualization - Column Charts
Histogram is a common variation of column charts used to present distribution and relationships of a single variable over a set of categories. A good example of a histogram would be a distribution of grades on a school exam or the sizes of pumpkins, divided by size group, in a pumpkin festival.

Stacked Column Charts

Data Visualization - Stacked Column Charts
Use stacked column charts to show a composition. Do not use too many composition items (not more than three or four) and make sure the composing parts are relatively similar in size. It can get messy very quickly.
Before moving to the next chart type, I wanted to show you a good example of how to improve the effectiveness of your column chart by simplifying it. Credit: Joey Cherdarchuk
Data Visualization - Data-ink Ratio

Bar Charts

Data Visualization - Column Charts
Bar charts are essentially horizontal column charts.
If you have long category names, it is best to use bar charts because they give more space for long text. You should also use bar charts, instead of column charts, when the number of categories is greater than seven (but not more than fifteen) or for displaying a set with negative numbers.
  • A typical use of bar charts would be visitor traffic from top referral websites. Referring sites are usually more than five to seven sites and website names are quite long, so those should be better horizontally graphed.
  • Another example could be sales performance by sales representatives. Again, names can be quite long, and there might be more than seven sales reps.

Bar Histogram Charts

Data Visualization - Bar Histogram Charts
Just like column charts, bar charts can be used to present histograms.
  • A good histogram example is a population distribution by the age (and sex). Remember those Christmas-tree graphs?

Stacked Bar Charts

Data Visualization - Stacked Bar Charts
I’m not quite sure about a good application of stacked bar charts — except when there are only a few variables, composition parts, and the emphasis is on composition, not comparison.
Stacked bars are not good for comparison or relationship analysis. The only common baseline is along the left axis of the chart, so you can only reliably compare values in the first series and for the sum of all series.

Line Charts

Data Visualization - Line Charts
Who doesn’t know line charts? We used to draw those on blackboards in school.
Line charts are among the most frequently used chart types. Use lines when you have a continuous data set. These are best suited for trend-based visualizations of data over a period of time, when the number of data points is very high (more than 20).
With line charts, the emphasis is on the continuation or the flow of the values (a trend), but there is still some support for single value comparisons, using data markers (only with less than 20 data points.)
A line chart is also a good alternative to column charts when the chart is small.

Timeline Charts

Data Visualization - Timeline Charts
The timeline chart is a variation of line charts. Obviously, any line chart that shows values over a period of time is a timeline chart. The only difference is in functionality — most timeline charts will let you zoom in and out and compress or stretch the time axis to see more details or overall trends.
The most common examples of a time-line chart might be:
  • stock market price changes over time,
  • website visitors per day for the past 30 days,
  • sales numbers by day for the previous quarter.

The Dos and Don’ts for Line Charts

  • Use lines to present continuous data in an interval scale, where intervals are equal in size.
  • For line charts, the axis may not start from zero if the intended message of the chart is the rate of change or overall trend, not exact values or comparison. It’s best to start the axis with zero for wide audiences because some people may otherwise interpret the chart incorrectly.
  • In line charts, time should always run from left to right.
  • Do not skip values for consistent data intervals presenting trend information, for example, certain days with zero values.
  • Remove guidelines to emphasize the trend, rate of change, and to reduce distraction.
  • Use a proper aspect ratio to show important information and avoid dramatic slope effects. For the best perception, aim for a 45-degree slope. (https://eagereyes.org/basics/banking-45-degrees )

Area Charts

Data Visualization - Area Charts
An area chart is essentially a line chart — good for trends and some comparisons. Area charts will fill up the area below the line, so the best use for this type of chart is for presenting accumulative value changes over time, like item stock, number of employees, or a savings account.
Do not use area charts to present fluctuating values, like the stock market or prices changes.

Stacked Area

Data Visualization - Stacked Area Charts
Stacked area charts are best used to show changes in composition over time. A good example would be the changes of market share among top players or revenue shares by product line over a period of time.
Stacked area charts might be colorful and fun, but you should use them with caution, because they can quickly become a mess. Don’t use them if you need an exact comparison and don’t stack together more than three to five categories.

Pie Charts and Donut Charts

Data Visualization - Pie Charts
Who doesn’t love pies or donuts, right? Not in data visualization, though. These charts are among the most frequently used and also misused charts. The one on the right is a good example of a terrible, useless pie chart - too many components, very similar values.
A pie chart typically represents numbers in percentages, used to visualize a part to whole relationship or a composition. Pie charts are not meant to compare individual sections to each other or to represent exact values (you should use a bar chart for that).
When possible, avoid pie charts and donuts. The human mind thinks linearly but, when it comes to angles and areas, most of us can’t judge them well. Source: Oracle.com
Data Visualization - Pie Chart Angles

Stacked Donut Charts

Data Visualization - Stacked Donut Chart
I would not recommend using stacked donut charts at all! I mean, like, never! You might think that you could use a stacked donut to present composition, while allowing some comparison (with an emphasis on composition), but it would perform badly for both. Use stacked column charts instead.
Here’s a good example of how to use pie chart effectively. Credit: Joey Cherdarchuk
Data Visualization - Pie to Bar

The Dos and Don’ts for Pie charts

For those of you who still feel sentimental about the old PowerPoint Pie charts, and want to keep using them, there are some things to keep in mind.
  • Make sure that the total sum of all segments equals 100 percent.
  • Use pie charts only if you have less than six categories, unless there’s a clear winner you want to focus on.
  • Ideally, there should be only two categories, like men and women visiting your website, or only one category, like a market share of your company, compared to the whole market.
  • Don’t use a pie chart if the category values are almost identical or completely different. You could add labels, but that’s a patch, not an improvement.
  • Don’t use 3D or blow apart effects — they reduce comprehension and show incorrect proportions.

Scatter Charts

Data Visualization - Scatter Plot Charts
Scatter charts are primarily used for correlation and distribution analysis. Good for showing the relationship between two different variables where one correlates to another (or doesn’t).
Scatter charts can also show the data distribution or clustering trends and help you spot anomalies or outliers.
A good example of scatter charts would be a chart showing marketing spending vs. revenue.

Bubble Charts

Data Visualization - Bubble Charts
A bubble chart is a great option if you need to add another dimension to a scatter plot chart. Scatter plots compare two values, but you can add bubble size as the third variable and thus enable comparison. If the bubbles are very similar in size, use labels.
We could in fact add the fourth variable by color-grading those bubbles or displaying them as pie charts, but that’s probably too much.
A good example of a bubble chart would be a graph showing marketing expenditures vs. revenue vs. profit. A standard scatter plot might show a positive correlation for marketing costs and revenue (obviously), when a bubble chart could reveal that an increase in marketing costs is chewing on profits.
Use Scatter and Bubble charts to:
  • Present relationships between two (scatter) or three (bubble) numerical variables,
  • Plot two or three sets of variables on one x-y coordinate plane,
  • Turn the horizontal axis into a logarithmic scale, thus showing the relationships between more widely distributed elements.
  • Present patterns in large sets of data, linear or non-linear trends, correlations, clusters, or outliers.
  • Compare large number of data points without regard to time. The more data you include in a scatter chart, the better comparisons you can make.
  • Present relationships, but not exact values for comparisons.

Map Charts

Data Visualization - Bubble Charts
Map charts are good for giving your numbers a geographical context to quickly spot best and worst performing areas, trends, and outliers. If you have any kind of location data like coordinates, country names, state names or abbreviations, or addresses, you can plot related data on a map.
Maps won’t be very good for comparing exact values, because map charts are usually color scaled and humans are quite bad at distinguishing shades of colors. Sometimes it’s better to use overlay bubbles or numbers if you need to convey exact numbers or enable comparison.
A good example would be website visitors by country, state, or city, or product sales by state, region or city.
But, don’t use maps for absolutely everything that has a geographical dimension. Today, almost any data has a geographical dimension, but it doesn’t mean that you should display it on a map.
Data Visualization - Bad Map Chart Example
When to use map charts?
  • If you want to display quantitative information on a map.
  • To present spatial relationships and patterns.
  • When a regional context for your data is important.
  • To get an overview of the distribution across geographic locations.
  • Only if your data is standardized (that is, it has the same data format and scale for the whole set).

Gantt Charts

Data Visualization - Gantt Charts
Gantt charts were adapted by Karol Adamiecki in 1896. But the name comes from Henry Gantt who independently adapted this bar chart type much later, in the 1910s.
Gantt charts are good for planning and scheduling projects. Gantt charts are essentially project maps, illustrating what needs to be done, in what order, and by what deadline. You can visualize the total time a project should take, the resources involved, as well as the order and dependencies of tasks.
But project planning is not the only application for a Gantt chart. It can also be used in rental businesses, displaying a list of items for rent (cars, rooms, apartments) and their rental periods.
Data Visualization - Booking Calendar Example
To display a Gantt chart, you would typically need, at least, a start date and an end date. For more advanced Gantt charts, you’d enter a completion percentage and/or a dependency from another task.

Gauge Charts

Data Visualization - Gauge Charts
Gauge charts are good for displaying KPIs (Key Performance Indicators). They typically display a single key value, comparing it to a color-coded performance level indicator, typically showing green for “good” and red for “trouble.”
A Dashboard would be the most obvious place to use Gauge charts. There, all the KPIs will be in one place and will give a quick “health check” for your project or company.
Gauges are a great choice to:
  • Show progress toward a goal.
  • Represent a percentile measure, like a KPI.
  • Show an exact value and meaning of a single measure.
  • Display a single bit of information that can be quickly scanned and understood.
The bad side of gauge charts is that they take up a lot of space and typically only show a single point of data. If there are many gauge charts compared against a single performance scale, a column chart with threshold indicators would be a more effective and compact option.

Multi Axes Charts

Data Visualization - Multi Axes Charts
There are times when a simple chart just cannot tell the whole story. If you want to show relationships and compare variables on vastly different scales, the best option might be to have multiple axes.
A multi-axes chart will let you plot data using two or more y-axes and one shared x-axis. But it comes at a cost. That is, the charts are much more difficult to read and understand.
Multi-axes charts might be good for presenting common trends, correlations (or the lack thereof) and the relationships between several data sets. But multi-axes charts are not good for exact comparisons (because of different scales) and you should not use this type if you need to show exact values.
Use multi-axes charts if you want to:
  • Display a line chart and a column chart with the same X-axis.
  • Compare multiple measures with different value ranges.
  • Illustrate the relationships, correlation, or the lack thereof between two or more measures in one visualization.
  • Save canvas space (if the chart does not become too complicated).

Data Visualization Do’s and Don’ts – A General Conclusion

  • Time axis. When using time in charts, set it on the horizontal axis. Time should run from left to right. Do not skip values (time periods), even if there are no values.
  • Proportional values. The numbers in a chart (displayed as bar, area, bubble, or other physically measured element in the chart) should be directly proportional to the numerical quantities presented.
  • Data-Ink Ratio. Remove any excess information, lines, colors, and text from a chart that does not add value. More about data-Ink ratio
  • Sorting. For column and bar charts, to enable easier comparison, sort your data in ascending or descending order by the value, not alphabetically. This applies also to pie charts.
  • Legend. You don’t need a legend if you have only one data category.
  • Labels. Use labels directly on the line, column, bar, pie, etc., whenever possible, to avoid indirect look-up.
  • Inflation adjustment. When using monetary values in a long-term series, make sure to adjust for inflation. (EU Inflation ratesUS InflationM rates)
  • Colors. In any chart, don’t use more than six colors.
  • Colors. For comparing the same value at different time periods, use the same color in a different intensity (from light to dark).
  • Colors. For different categories, use different colors. The most widely used colors are black, white, red, green, blue, and yellow.
  • Colors. Keep the same color palette or style for all charts in the series, and same axes and labels for similar charts to make your charts consistent and easy to compare.
  • Colors. Check how your charts would look when printed out in grayscale. If you cannot distinguish color differences, you should change hue and saturation of colors.
  • Colors. Seven to 10 percent of men have color deficiency. Keep that in mind when creating charts, ensuring they are readable for color-blind people. Use Vischeck to test your images. Or, try to use color palettes that are friendly to color-blind people.
  • Data Complexity. Don’t add too much information to a single chart. If necessary, split data in two charts, use highlighting, simplify colors, or change chart type. 

26 Feb 2017

Learn R Programming -- The Definitive Guide

What is R



It is a programming language and environment commonly used in statistical computing, data analytics and scientific research.
It is one of the most popular languages used by statisticians, data analysts, researchers and marketers to retrieve, clean, analyze, visualize and present data.
Due to its expressive syntax and easy-to-use interface, it has grown in popularity in recent years.

Installing R


Get R here: http://www.r-project.org/.  Once you've installed it, also install a few add on packages by running the following commands at the R prompt:
install.packages("reshape")
install.packages("plyr")
install.packages("ggplot2")

A discovery flight: plotting with ggplot.


When you begin lessons to get a private pilot's license, the first lesson is often a "discovery flight," where they take you up and let you fly the airplane and help you land it safely.  It's a fun way to get your feet wet and see if the taking more lessons seems like a good idea. 
I'd like to begin with a discovery flight of sorts, introducing ggplot.  GGplot is a powerful command line plotting system that should show you some of the power of R without you needing to know all the details of how R works.

ggplot basics


First let's make a little data.  Make a hundred normally distributed random numbers and put them in a data frame under the name input. (I'll explain about data frames more later.)
> df<-data.frame(input=rnorm(100))
Then add a hundred more random numbers named output, each perturbing the input numbers a bit.
> df$output<-rnorm(100,df$input,sd=0.5)
A data frame is like a matrix, or a speadsheet, where each row contains one observation.
> head(df)
        input     output
1 -0.58362979 -0.4540510
2  0.86899831  0.5282722
3 -0.70655426 -0.5075248
4 -0.35125465  0.4221305
5 -1.56632422 -0.6980894
6 -0.01087019  0.3033364
Ok, now let's plot our data.
> ggplot(df,aes(x=input,y=output))+geom_point()

What's a plot?  There's a coordinate system and some scales.  But the heart of the plot is the data.  What characteristics does each point have?  Certainly an x and y coordinate, but also a size, shape and
color that we aren't varying.  These visual properties are called "aesthetics" in ggplot, and the main thing we want to learn first is how to map the variables in our data onto the aesthetics in the plot.
So let's pull apart that command and see how it describes the plot. The first bit of the command is a call to the ggplot() function. ggplot() can operate in many different ways, but the simplest and most common is to pass it two arguments, the first of which is a data frame containing the data to be plotted. 
The second argument to ggplot is an aesthetic mapping returned by the function aes().  aes() takes aesthetic-variable pairs, in our case mapping the variable input to the x-coordinate and output to the y-coordinate of objects in the plot.
ggplot() called like this can't actually make a plot just yet, because we haven't told it how to present the data. Points, bars, lines, text, and tiles are some of the straightforward options, but there are many other possible choices.
So we add a "layer" to the plot with geom_point() that says we want our data represented as points, and voila, we have a scatter plot.

More aesthetics and faceting


So what about some other aesthetics?  Let's add a category to each observation in our data.

> df$category<-LETTERS[seq(1,5)]
> head(df)
        input     output category
1 -0.58362979 -0.4540510        A
2  0.86899831  0.5282722        B
3 -0.70655426 -0.5075248        C
4 -0.35125465  0.4221305        D
5 -1.56632422 -0.6980894        E
6 -0.01087019  0.3033364        A

And map the color of each point to its category.  Note that aes() assumes it's firsts two arguments map x and y position, so we can leave off those names.
> ggplot(df,aes(input,output,color=category))+geom_point()
Another way to show category is with "faceting," creating a series of smaller plots, one for each category.  The ~ in the call to facet_wrap() is an example of R's formula syntax, something we'll cover later when we get to statistical modeling.
> ggplot(df,aes(input,output))+geom_point()+facet_wrap(~category)
Since we applied our categories randomly to random data all the facets look the same, but if we tweak one category and replot, it sticks out like a sore thumb.
> df[df$category=="E","output"] <- abs(df[df$category=="E","output"])
> ggplot(df,aes(input,output))+geom_point()+facet_wrap(~category)
Let's undo our little tweak before going forward.
> df$output<-rnorm(100,df$input,sd=0.5)
How about the size aesthetic?  Let's add yet another variable to our dataframe.
> df$magnitude <- runif(100)
And map magnitude to the size aesthetic, still faceting on category
> ggplot(df,aes(x,y,size=magnitude))+geom_point()+facet_wrap(~category)
If we have another variable to plot we could add color back into the mix and then we'd be plotting five dimensions at once.  You could add shape and bump it to six. Then you could organize the facets into a grid using facet_grid() instead of facet_wrap() and now you're up to seven dimensions, all on 2-d surface.

layer = geom + stat


Let's see another geom, a bar chart.  We'll make a simple data frame with 26 observations, each with a random score.
> LETTERS
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"
> bar_df<-data.frame(letter=LETTERS,score=runif(26))
> head(bar_df)
  letter     score
1      A 0.3526575
2      B 0.5549401
3      C 0.4672510
4      D 0.4784176
5      E 0.9944808
6      F 0.8506076

Now we map x-position to letter and y-position to score, and make a bar layer.
> ggplot(bar_df,aes(letter,score))+geom_bar()

So we see that the x-scale can deal just fine with categorical data, and that in a bar layer, y-position maps to the height of the bar.

In statistics, the classic use of a bar chart is a histogram.  Let's fake up some more data.  The average male in the US is about six feet tall with a standard deviation of 3 inches.
> height_data<-data.frame(height=rnorm(1000,mean=6,sd=0.25))
Hmm.. but how to get a histogram?  We need to bucket our thousand observations and count the number in each bucket.  Surely something like this won't work, right?
> ggplot(height_data,aes(height))+geom_bar()

It worked, but how?  The first thing you need to know is that a layer has two parts, a "geom" that controls how the layer looks, and a "stat" that applies some statistical transformation to the data before it's plotted.  Every geom_XXX() has a default stat, and geom_point() default stat is "identity," which just leaves the data alone.  But geom_bar's default is "bin", which bucketizes or "bins" the data, dividing it up and counting the number in of observations in each bin. So these two commands are equivalent.
> ggplot(height_data,aes(height))+geom_bar()
> ggplot(height_data,aes(height))+geom_bar(stat="bin")

But there's one more bit of magic going on.  We never mapped any variable to the y-position of the bars.  stat_bin adds a variable named ..count.. to the data, and automatically maps it to the y aesthetic.  So this is still equivalent.
> ggplot(height_data,aes(height))+geom_bar(aes(y=..count..),stat="bin")

What's that aes() doing in the geom_bar() call?  The aesthetic mapping you supply in the call to ggplot() just sets the default mapping for the plot's layers, but each layer you add can override those defaults. This is useful when making sophisticated plots with many different layers.
Note that stat_bin also add a ..density.. variable, so if you wanted densities instead of counts you could do...
> ggplot(height_data,aes(height))+geom_bar(aes(y=..density..),stat="bin")

As I said before, each layer has a stat and a geom so we can make the equivalent plot using stat_bin() instead of geom_bar().  Usually the choice doesn't matter much, but sometimes various defults makes one way shorter, or more clearly expresses your intentions.
> ggplot(height_data,aes(height))+stat_bin(aes(y=..density..),geom="bar")

All out plots so far have only used one layer, but we can add more easily.   For our histogram, we can easily add a smoothed density estimate that is plotted with a line with geom_density().
> ggplot(height_data,aes(height))+geom_bar(aes(y=..density..))+geom_density()
Going back to our scatter plot data, we can easily add a layer showing a smoothed line with a confidence interval.
> ggplot(df,aes(input,output))+geom_point()+stat_smooth()
There are many smoothing methods and ggplot tries to be smart about which one it chooses based on your data, but you can override it's choice by supplying an explicit method argument.  To get a linear fit, for example:
> ggplot(df,aes(input,output))+geom_point()+stat_smooth(method="lm")
All the layers are preserved if you add faceting.
> ggplot(df,aes(input,output))+geom_point()+stat_smooth(method="lm")+facet_wrap(~category)
Dang, that line is getting kind long.  Each of those function calls returns an object that you can assign to a variable, and a common way to break up the creation of a chart is to assign the base chart to a variable, and then add embellishments
> p<-ggplot(df,aes(input,output))+geom_point()
> p+stat_smooth(method="lm")+facet_wrap(~category)

Dealing with overplotting


For fun let's have a look at few of the other capabilities of ggplot. Our little scatter plot of 100 observations was nice, but what happens with you get a lot of data?
> big_df<-data.frame(input=rnorm(10000))
> big_df$output<-rnorm(10000,big_df$input,sd=0.3)
> p<-ggplot(big_df,aes(input,output))
> p+geom_point()

With so many points, there's a lot of overplotting, so it's hard to see what's going on.  There are several ways to deal with overplotting, one is to just make the points smaller.
> p+geom_point(size=1)

Note that we're not using aes() here.  We're not /mapping/ something to size, we're /setting/ size to a constant.  You can do this with all aesthetics, and it can be a source of confusion when you set an aesthetic you meant to map or vice-versa.
Another way to deal with overplotting is to use alpha transparency.
> p+geom_point(alpha=0.1)
That's a little better.  Just as we added a smooth density estimate to our histogram before, we can add a 2-d density estimate to this plot.
> p+geom_point(alpha=0.1)+stat_density2d()

For something a little more fancy, we can override some of stat_density2d's defaults.

> p+stat_density2d(aes(fill=..level..),geom="polygon")
Normally stat_density2d produces contour shapes.  We can turn that off and use the tile geom to get a heatmap.
> p+stat_density2d(aes(fill=..density..),geom="tile",contour=FALSE)

More in depth resouces


I've only covered the barest essentials.  If you want to learn more I really recommend Hadley's book.
Online reference with tons of examples: http://had.co.nz/ggplot2/
Hadley's Book, _ggplot2: Elegant Graphics for Data Analysis_: http://www.amazon.com/gp/product/0387981403

Your turn: EXAMPLE


Now it's your turn. 

Extremely brief intro to plyr


Split, apply, combine.
Like group by and aggregate functions in SQL, or pivot tables in excel, but far more flexible.

Your turn part 2: EXAMPLE


My creation:

Part 2: R fundamentals and hypothesis testing


Most R tutorials assume you know some statistics and teach you just how to do statistics in R.  In this part I'm going to assume you know neither and teach you a (teensy, tiny) bit of statistics at the same time I demonstrate the R system.

A little background on R


R is an implementation of the S statistical programming language.  S has a very long history going back to 1976.  Because the theory of programming language design was undergoing very active exploration throughout the development of S, it is something of a hodge podge of different paradigms.  It's got both functional and procedural aspects, with object oriented features sprinkled on top.
S's influences include C and Ratfor (brace syntax), APL and APL's cousin PPL (the arrow operator), lisp (the lambda calculus, first class functions), JCL (positional or keyword argument notation), Smalltalk (classes and inheritance).  An excellent history of S can be found here:
So R is a full blown programming language, but in my normal use of it, I don't really do much programming, I mostly use the incredible array of built in functionality at the R command line.  R does statistical modeling, analysis of variance (aka anova), and statistical tests like Student's-t or chi-squared.  It implements techniques from linear algebra like solving systems of linear equations, singular value decomposition, and working with eigensystems.  It also implements all kinds of analytical techniques from clustering to factor analysis.

The absolute minimum you need to know about R to get started.


But before we get into all that let's get familiar with R's command line and the most basic of R concepts.
R's command line is a read-evaluate-print loop.  You type a statement, R evaluates that statement and prints the result.  So you can do evaluate the number 5 like so:
> 5
[1] 5
and do simple arithmetic like so:
> 2+3
[1] 5
A character string is entered with quotes and evaluates to itself:

> "Hello"
[1] "Hello"
In addition to numbers and strings, R has a boolean type with keywords TRUE and FALSE in all-caps.
> TRUE
[1] TRUE
> FALSE
[1] FALSE
And some more special keywords, NA for missing data and NULL to represent no object.

Vectors


So what's that funny [1] that comes out in the exmaples above?  It turns out that R's most basic data structure is a vector.  R doesn't support the idea of a scalar, a single value.  So when you type a single value it's actually being coerced into a vector, and the [1] is a vector index.
The simplest way to make vectors is with the c() function, which just turns it's arguments into a vector.
> c(12,23,34,45,56)
[1] 12 23 34 45 56
Lots of functions make vectors.  seq() for example makes a sequence of numbers.
> seq(1,30)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30
So now we see that the [1] and [26] tell you the index of the first element on that row of output.
R can generate random numbers accoring to many distributions.  This command gives you 10 random numbers uniformy distrubuted from 1 to 100.
> runif(10,1,100)
 [1] 45.12021 93.65181 18.73653 21.81440 14.16285 30.65400 73.82752 14.86346
 [9] 31.08075 14.69531
Most any distribution you can think of is cooked into R: normal, log-normal, exponential, gamma, chi-squared, Poisson, binomial, etc. Here's 10 normally distributed random numbers with mean 0 and standard deviation 1.
> rnorm(10,0,1)
 [1]  0.1248729 -1.6965157  0.1094253 -0.8555028 -0.4061967 -1.0979561
 [7]  1.2266705 -0.8437930 -0.2905176  0.1290491
Vectors (and pretty much everything else in R) are objects.  You can give an object a name with the assignment operator '<-'. Using '<-' as the assignment operator comes was borrowed from APL.  You can also just use '='.
> numbers<-rnorm(10,0,1)
> numbers
 [1]  0.86378839 -0.05729522  0.50690025  0.98559770  0.95857035  0.84151008
 [7] -1.36267419  1.68299597 -1.02089036  0.01789479
> more_numbers=rnorm(10,0,1)
> more_numbers
 [1]  0.63656000  1.22249113 -1.46085369  0.29902021 -0.05245649 -0.68435848
 [7] -0.32230916  1.42836638 -0.90796790  2.23426332
You can get a listing of all your currently named objects with ls()
> ls()
 [1] "more_numbers" "numbers"
You can get rid of objects with rm().
> rm(numbers, more_numbers)
> ls()
character(0)

Operations on vectors


Ok.  So we know some way to make vectors; what can we do with them? Well, all the normal sorts of things you'd expect.
> numbers<-rnorm(10)
> numbers
 [1]  0.8893482  0.4200492 -1.2398678 -0.3585673 -1.4215178  0.7752930
 [7]  0.3963262 -1.1906303  0.1747187  2.2828912
> mean(numbers)
[1] 0.07280433
> median(numbers)
[1] 0.2855224
> sum(numbers)
[1] 0.7280433
These functions summarize a vector into one number.  There are also lots of functions that operate on each element of a vector, returning a new modified vector.
> abs(numbers)
 [1] 0.8893482 0.4200492 1.2398678 0.3585673 1.4215178 0.7752930 0.3963262
 [8] 1.1906303 0.1747187 2.2828912
> numbers*100
 [1]   88.93482   42.00492 -123.98678  -35.85673 -142.15178   77.52930
 [7]   39.63262 -119.06303   17.47187  228.28912
Note that the statments above aren't modifying ther argument, they are creating new vectors.  So to modify a vector you need to assign the output.
> numbers<-sort(numbers)
> numbers
 [1] -1.4215178 -1.2398678 -1.1906303 -0.3585673  0.1747187  0.3963262
 [7]  0.4200492  0.7752930  0.8893482  2.2828912

Statistical tests


What about that mean that we got earlier?
> mean(numbers)
[1] 0.07280433
We got those numbers by drawing from a normal distribution with mean of zero, so the mean of our sample should be close to zero.  0.07 is pretty close, but should we trust that rnorm() is really giving us numbers with a mean of zero?  Maybe rnorm() is buggy and has a bias. How can we tell?
Student's t-test is one way to answer this question.  Given a sample, the t-test can tell you how confident you can be that the population's mean is a particular value.  Here's how you run a t-test in R, asking if the population's mean is zero.
> t.test(numbers, mu=0)

       One Sample t-test

data:  numbers
t = 0.1992, df = 9, p-value = 0.8465
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.7538657  0.8994743
sample estimates:
 mean of x
0.07280433
The number to look at in the above output is the p-value.  Very loosely, the lower the p-value, the less likely it is that population mean is actually zero.  The p-value is a probability, so it ranges from zero to one, and one usually looks for a p-value less than 0.05 or 0.01 in order to conclude that the population mean is not zero.
Our p-value is 0.85, well above the 0.05 cutoff, so we conclude that the population mean really is zero (within the power of our test to tell us), and that rnorm() isn't buggy.
What's that p-value really mean?  Why set the cutoff at 0.05 or 0.01? The cutoff corresponds to the chance that you'll reach the wrong conclusion.  If you choose 0.05, there's a 5% chance you'll conclude that the mean isn't zero, when it really is.  This means if draw many samples and run my t-test on each of them, sometimes the test will say that the population mean isn't zero, even if it really is. 
We can verify this experimentally very easily.  What t.test returns is actually an object.  We can save that object to a variable with the assignment operator just like we've done before with vectors.  The output we saw above is just that object being printed.

> result <- t.test(numbers,mu=0)
> result

       One Sample t-test

data:  numbers
t = 0.1992, df = 9, p-value = 0.8465
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.7538657  0.8994743
sample estimates:
 mean of x
0.07280433
We can extract just the p-value like so:
> t.test(numbers,mu=0)$p.value
[1] 0.8465136
So we can re-sample and re-test a few times by hand.
> t.test(rnorm(100),mu=0)$p.value
[1] 0.682082
> t.test(rnorm(100),mu=0)$p.value
[1] 0.7112275
> t.test(rnorm(100),mu=0)$p.value
[1] 0.324931
> t.test(rnorm(100),mu=0)$p.value
[1] 0.9596542
So far, so good, no numbers under 0.05.  But this is getting tedious; we can use the replicate function to do this a bunch of times and accumulate the results in a vector.
> pvalues<-replicate(60,t.test(rnorm(100),mu=0)$p.value)
And if we sort the p-values we get, lo and behold there are four of them below 0.05.  We ran our test 60 times, and 4 of those 60 times we reached the wrong conclusion.  With a 0.05 (or one in twenty) cutoff, we only expected three of these errors. 
> sort(pvalues)
 [1] 0.01182703 0.02263391 0.02990155 0.03108499 0.05414258 0.06868589
 [7] 0.08567086 0.08964473 0.12293828 0.16763739 0.17225885 0.18524312
[13] 0.19448739 0.20541757 0.24862190 0.25007644 0.25691785 0.27851561
[19] 0.28053250 0.30991840 0.32623091 0.36615250 0.36920395 0.37273555
[25] 0.38772763 0.38788907 0.39546547 0.39799577 0.42377867 0.46539131
[31] 0.48143437 0.48794339 0.49729286 0.50076295 0.50924057 0.52898420
[37] 0.54825732 0.57358131 0.59666177 0.59980506 0.61071583 0.62020976
[43] 0.62161798 0.67315114 0.67944879 0.69186076 0.70146278 0.73897408
[49] 0.76402396 0.76638926 0.78694787 0.84759542 0.86765217 0.94199050
[55] 0.95396098 0.95869792 0.96865229 0.97681270 0.98922664 0.99296506

Two-sample t-test and statistical power


The way I'm using the t-test here is somewhat unusual.  A far more common case is when you have two samples, and you want to know the the populations the samples are drawn from have different means. 
This is the way to do weblabs here at Amazon.  We cut our user base into two populations, and give one of them a different experience.  We theorize that this different experience will cause different behavior, so we run t-tests to find out if, for exmaple, one population spend more money on average than the other.  This is called a two sample t-test. 
Here's a contrived example, where I draw one sample with a mean of zero, and another sample with a mean of one, and then ask the t-test to tell me if the means are indeed different.
> t.test(rnorm(100,mean=0),rnorm(100,mean=1))

       Welch Two Sample t-test

data:  rnorm(100, mean = 0) and rnorm(100, mean = 1)
t = -6.7014, df = 188.741, p-value = 2.315e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.2045547 -0.6566807
sample estimates:
mean of x mean of y
0.1470455 1.0776632
That p-value is tiny.  We can safely conclude that the two populations have different means.  But what if the difference in the means is less extreme?
> t.test(rnorm(100,mean=0),rnorm(100,mean=0.5))$p.value
[1] 0.001426943
> t.test(rnorm(100,mean=0),rnorm(100,mean=0.2))$p.value
[1] 0.3231079
> t.test(rnorm(100,mean=0),rnorm(100,mean=0.1))$p.value
[1] 0.8972577
As the difference in means gets smaller our test loses the ability to resolve the difference.  We can get it back by upping the sample size.
> t.test(rnorm(10000,mean=0),rnorm(10000,mean=0.1))$p.value
[1] 3.305264e-17
This question of how small a difference a test can detect given a certain sample size is an example of statistical power analysis. 

Another statistical test


The t-test is just one of many statistical tests built into R.  Let's use another test to empirically verify the Central Limit Theorem.  The central limit theorem is one of the most important results in all of statistics.  Roughly, the central limit theorem says that if you take a large number of independent and identically distributed random variables and sum them, then that sum will be normally distributed no matter the underlying distributions (well, there are some requirements on the distributions, but just about anything you'd encouter in real life will work).
So that's still pretty math-y, so what's it mean in practical terms? Recall that we can generate uniformly distributed random numbers with runif().
> runif(10)
 [1] 0.4456587 0.9358769 0.1791569 0.2102465 0.1329581 0.2995354 0.7356315
 [8] 0.1400350 0.3038460 0.1383365
Let's plot a histogram quickly to visually see that they're uniform. qplot() is a simplified interface to the ggplot2 plotting system.
> qplot(runif(1000), geom="histogram", binwidth=0.1)
Looks pretty good.  If we sum a 100 random numbers uniformly distributed from zero to one, we expect to get a number close to 50, right?
> sum(runif(100))
[1] 48.79861
48.9 is pretty close to 50.  So our intuition seems right.  This intuition is at the heart of the central limit theorem.  The theorem says that if we compute this sum over a bunch of samples, the results will be nornally distributed.  So here's a plot of 1,000 such sums.
> qplot(replicate(1000,sum(runif(100))), geom="histogram")
Sweet, looks normal to me.  And here's the magic part, it works with pretty much any distribution, not just the uniform distribution.  For example, here's the exponential distribution:
> qplot(rexp(100), geom="histogram")
and the distribution of the sums of exponentially distributed random numbers:
> qplot(replicate(1000,sum(rexp(100))), geom="histogram")
Looks pretty normal.  These's a formal test that tells you whether or not a bunch of numbers have a certain distribution, the Kolmogorov-Smirnov test, or ks-test.
Sum of uniform deviates on the interval [-1,1].
> mu=0
> sigma=sqrt(4/12)
> n=1000
> ks.test(replicate(1000,(sum(runif(n,-1,1))-mu*n)/(sigma*sqrt(n))),"pnorm")

        One-sample Kolmogorov-Smirnov test

data:  replicate(1000, (sum(runif(n, -1, 1)) - mu * n)/(sigma * sqrt(n)))
D = 0.0219, p-value = 0.7251
alternative hypothesis: two-sided

Sum of exponentials:

> mu=1
> sigma=1
> n=1000
> ks.test(replicate(1000,(sum(rexp(n))-mu*n)/(sigma*sqrt(n))),"pnorm")

        One-sample Kolmogorov-Smirnov test

data:  replicate(1000, (sum(rexp(n)) - mu * n)/(sigma * sqrt(n)))
D = 0.0251, p-value = 0.5565
alternative hypothesis: two-sided

Part 3: Linear models in R


R has a powerful, flexible and diverse set of modeling capabilities. We'll just focus on straight-forward linear models here though.

lm() with mtcars


R has many built in datasets that you can experiment with.  One of them is the mtcars dataset.  From the R documentation:
   The data was extracted from the 1974 Motor Trend US magazine, and
   comprises fuel consumption and 10 aspects of automobile design and
   performance for 32 automobiles (1973-74 models).

Here's a few rows:
> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
The columns are (again from the R documentation):
    A data frame with 32 observations on 11 variables.
    [, 1]  mpg   Miles/(US) gallon
    [, 2]  cyl   Number of cylinders
    [, 3]  disp  Displacement (cu.in.)
    [, 4]  hp    Gross horsepower
    [, 5]  drat  Rear axle ratio
    [, 6]  wt    Weight (lb/1000)
    [, 7]  qsec  1/4 mile time
    [, 8]  vs    V/S
    [, 9]  am    Transmission (0 = automatic, 1 = manual)
    [,10]  gear  Number of forward gears
    [,11]  carb  Number of carburetors

What drives mileage?  Probably weight, but can we check that conjecture?  The lm() function makes linear models.  To fit a model predicting milage using weight you do:
> m<-lm(mpg~wt, mtcars)
The first argument is in R's formula syntax.  You can read the tilde as "is modeled by."  We'll see more features of the forumla syntax in just a bit.  lm() returns an object that contains the results of the fit.  You can looks at the results with summary.
> summary(m)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max
-4.5432 -2.3647 -0.1252  1.4096  6.8727

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528,  Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10
So it's a pretty good fit.  We can explain 75% of the variance in mileage with weight alone.  The model says that we lose about 5 miles per gallon for each 1,000 lbs of weight.  We can also see some visualizations of the model using plot().
> plot(m)
Can we do better?  What if we add another variable to the model? Horsepower seems a good candidate.  We can modify the formula to include horsepower.
> m<-lm(mpg~wt+hp, mtcars)
So now our model is: mpg = A*wt + B*hp + C.  You can add more and more variables to the formula with the + operator.  So how's it look?
> summary(m)

Call:
lm(formula = mpg ~ wt + hp, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max
-3.941 -1.600 -0.182  1.050  5.854

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
wt          -3.87783    0.63273  -6.129 1.12e-06 ***
hp          -0.03177    0.00903  -3.519  0.00145 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.593 on 29 degrees of freedom
Multiple R-squared: 0.8268,  Adjusted R-squared: 0.8148
F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

Better.  The r^2 is up to 0.826, and the p-value on the hp variable is 0.00145, quite small, so it's likely a relevant factor contributing to a car's mileage.
Should we try more variables?  What about automatic vs. manual transmission?  That's in the am variable, but (as is often the case), it's coded as a number, 0 or 1, not a factor, so we need to recode it.

> str(mtcars$am)
 num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
> mtcars$am<-factor(mtcars$am)
> str(mtcars$am)
 Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
That's better.  Now the model:
> m<-lm(mpg~wt+hp+am, mtcars)
> summary(m)

Call:
lm(formula = mpg ~ wt + hp + am, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max
-3.4221 -1.7924 -0.3788  1.2249  5.5317

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 34.002875   2.642659  12.867 2.82e-13 ***
wt          -2.878575   0.904971  -3.181 0.003574 **
hp          -0.037479   0.009605  -3.902 0.000546 ***
am1          2.083710   1.376420   1.514 0.141268   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.538 on 28 degrees of freedom
Multiple R-squared: 0.8399,   Adjusted R-squared: 0.8227
F-statistic: 48.96 on 3 and 28 DF,  p-value: 2.908e-11
Hrm.  The r^2 got a bit better, and the sign on the coefficient is right, manual transmission makes milage a bit better, but the p-value is pretty marginal, I think we're overfitting the dataset at this point.    Note however that R is perfectly happy to fit models to categorical as well as numeric data.

Your turn: lm() with diamonds


The diamonds dataset has data for nearly 54,000 diamonds.  The variables are mostly what you'd expect; table, depth, x, y, and z, are some of the the physical dimensions of the diamond.
> head(diamonds)
  carat       cut color clarity depth table price    x    y    z
1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
> nrow(diamonds)
[1] 53940
So what drives price?  Use ggplot() to explore the data and form hypotheses, and use lm() to see how well your hypotheses hold up.
> ggplot(diamonds, aes(carat,price))+geom_point()
> ggplot(diamonds, aes(carat,price))+geom_point()+geom_smooth()
This is mgcv 1.6-1. For overview type `help("mgcv-package")'.
> ggplot(diamonds, aes(log(carat),log(price)))+geom_point()+geom_smooth()
> ggplot(diamonds, aes(log(carat),log(price)))+geom_point()+geom_smooth()+facet_wrap(~color)
> ggplot(diamonds, aes(log(carat),log(price)))+geom_point()+geom_smooth()+facet_grid(clarity~color)


summary(lm(log(price)~log(carat),diamonds))
summary(lm(log(price)~log(carat)+color,diamonds))
summary(lm(log(price)~log(carat)+color+clarity,diamonds))