Академический Документы
Профессиональный Документы
Культура Документы
by Sharon Machlis
edited by Johanna Ambrosio
R: a beginners guide
COMPUTERWORLD.COM
Introduction
R is hot. Whether measured by more than as easy to run multiple data sets through
6,100 add-on packages, the 41,000+ mem- spreadsheet formulas to check results as it
bers of LinkedIns R group or the 170+ R is to put several data sets through a script,
Meetup groups currently in existence, there he explains.
can be little doubt that interest in the R sta-
Indeed, the mantra of Make sure your
tistics language, especially for data analysis,
work is reproducible! is a common theme
is soaring.
among R enthusiasts.
Why R? Its free, open source, powerful and
highly extensible. You have a lot of pre-
packaged stuff thats already available, so
Who uses R?
youre standing on the shoulders of giants, Relatively high-profile users of R include:
Googles chief economist told The New York
Facebook: Used by some within the com-
Times back in 2009.
pany for tasks such as analyzing user
Because its a programmable environment behavior.
that uses command-line scripting, you can
Google: There are more than 500 R users
store a series of complex data-analysis
at Google, according to David Smith at
steps in R. That lets you re-use your analy-
Revolution Analytics, doing tasks such as
sis work on similar data more easily than if
making online advertising more effective.
you were using a point-and-click interface,
notes Hadley Wickham, author of several National Weather Service: Flood forecasts.
popular R packages and chief scientist with
Orbitz: Statistical analysis to suggest best
RStudio.
hotels to promote to its users.
That also makes it easier for others to vali-
Trulia: Statistical modeling.
date research results and check your work
for errors an issue that cropped up in the Source: Revolution Analytics
news recently after an Excel coding error
Why not R? Well, R can appear daunting at
was among several flaws found in an influ-
first. Thats often because R syntax is dif-
ential economics analysis report known as
ferent from that of many other languages,
Reinhart/Rogoff.
not necessarily because its any more dif-
The error itself wasnt a surprise, blogs ficult than others.
Christopher Gandrud, who earned a doc-
I have written software professionally in
torate in quantitative research methodol-
perhaps a dozen programming languages,
ogy from the London School of Economics.
and the hardest language for me to learn
Despite our best efforts we always will
has been R, writes consultant John D.
make errors, he notes. The problem is that
Cook in a Web post about R programming
we often use tools and practices that make
for those coming from other languages.
it difficult to find and correct our mistakes.
The language is actually fairly simple, but
Sure, you can easily examine complex for- it is unconventional.
mulas on a spreadsheet. But its not nearly
2
R: a beginners guide
COMPUTERWORLD.COM
And so, this guide. Our aim here isnt R code editor allowing you to create a file
mastery, but giving you a path to start with multiple lines of R code or open an
using R for basic data work: Extracting key existing file and then run the entire file
statistics out of a data set, exploring a data or portions of it.
set with basic graphics and reshaping data
Bottom left is the interactive console where
to make it easier to analyze.
you can type in R statements one line
at a time. Any lines of code that are run
Your first step from the editor window also appear in the
console.
To begin using R, head to r-project.org to
download and install R for your desktop or The top right window shows your work-
laptop. It runs on Windows, OS X and a space, which includes a list of objects cur-
wide variety of Unix platforms, but not yet rently in memory. Theres also a history tab
on Android or iOS. with a list of your prior commands; whats
handy there is that you can select one,
Installing R is actually all you need to get
some or all of those lines of code and one-
started. However, Id suggest also installing
click to send them either to the console or
the free R integrated development environ-
to whatever file is active in your code editor.
ment (IDE) RStudio. Its got useful features
youd expect from a coding platform, such The window at bottom right shows a plot
as syntax highlighting and tab for sug- if youve created a data visualization with
gested code auto-completion. I also like its your R code. Theres a history of previous
four-pane workspace, which better man- plots and an option to export a plot to an
ages multiple R windows for typing com- image file or PDF. This window also shows
mands, storing scripts, viewing command external packages (R extensions) that are
histories, viewing visualizations and more. available on your system, files in your work-
ing directory and help files when called
from the console.
Although you dont need the free RStudio IDE to NN Control + the up arrow (command +
get started, it makes working with R much easier. up arrow on a Mac) is a similar auto-
complete tool. Start typing and hit that
The top left window is where youll prob-
key combination, and it shows you a list
ably do most of your work. Thats the R
of every command youve typed starting
3
R: a beginners guide
COMPUTERWORLD.COM
with those keys. Select the one you want If you dont want to type the command,
and hit return. This works only in the in RStudio theres a Packages tab in the
interactive console, not in the code editor lower right window; click that and youll
window. see a button to Install Packages. (Theres
also a menu command; the location varies
NN Control + enter (command + enter on
depending on your operating system.)
a Mac) takes the current line of code in
the editor, sends it to the console and To see which packages are already installed
executes it. If you select multiple lines of on your system, type:
code in the editor and then hit ctrl/cmd +
installed.packages()
enter, all of them will run.
Or, in RStudio, go to the Packages tab in
For more about RStudio features, including
the lower right window.
a full list of keyboard shortcuts, head to the
online documentation. To use a package in your work once its
installed, load it with:
4
R: a beginners guide
COMPUTERWORLD.COM
example(functionName)
args(functionName)
5
R: a beginners guide
COMPUTERWORLD.COM
If you just want to play with some test data (Aside: Whats that <- where you expect to
to see how they load and what basic func- see an equals sign? Its the R assignment
tions you can run, the default installation of operator. I said R syntax was a bit quirky.
R comes with several data sets. Type: More on this in the section on R syntax
quirks.)
data()
And if youre wondering what kind of object
into the R console and youll get a listing
is created with this command, mydata is
of pre-loaded data sets. Not all of them are
an extremely handy data type called a data
useful (body temperature series of two bea-
frame basically a table of data. A data
vers?), but these do give you a chance to
frame is organized with rows and columns,
try analysis and plotting commands. And
similar to a spreadsheet or database table.
some online tutorials use these sample sets.
The read.csv function assumes that your
One of the less esoteric data sets is mtcars,
file has a header row, so row 1 is the name
data about various automobile models that
of each column. If thats not the case, you
come from Motor Trends. (Im not sure
can add header=FALSE to the command:
from what year the data are from, but given
that there are entries for the Valiant and mydata <- read.csv(filename.txt,
Duster 360, Im guessing theyre not very header=FALSE)
recent; still, its a bit more compelling than
In this case, R will read the first line as data,
whether beavers have fevers.)
not column headers (and assigns default
Youll get a printout of the entire data set if column header names you can change
you type the name of the data set into the later).
console, like so:
If your data use another character to sepa-
mtcars rate the fields, not a comma, R also has the
more general read.table function. So if your
There are better ways of examining a data
separator is a tab, for instance, this would
set, which Ill get into later in this series.
work:
Also, R does have a print() function for
6
R: a beginners guide
COMPUTERWORLD.COM
7
R: a beginners guide
COMPUTERWORLD.COM
and Perl, and in general Id rather export a Center data about mobile shopping are
spreadsheet to CSV in hopes of not running available as a CSV file for download. You
into Microsoft special-character prob- can store the data in a variable called pew_
lems. For more info on other formats, see data like this:
UCLAs How to input data into R which
pew_data <- read.csv(http://bit.
discusses the foreign add-on package for
ly/11I3iuU)
importing several other statistical software
file types. Its important to make sure the file youre
downloading is in an R-friendly format
If youd like to try to connect R with a data-
first: in other words, that it has a maximum
base, there are several dedicated packages
of one header row, with each subsequent
such as RPostgreSQL, RMySQL, RMongo,
row having the equivalent of one data
RSQLite and RODBC.
record. Even well-formed government data
(You can see the entire list of available R might include lots of blank rows followed
packages at the CRAN website.) by footnotes -- thats not what you want in
an R data table if you plan on running sta-
8
R: a beginners guide
COMPUTERWORLD.COM
save(variablename, file=filename.
If youre finished with variable x and want
rda)
to remove it from your workspace, use the
rm() remove function: Reload it at any time with:
rm(x) load(filename.rda)
9
R: a beginners guide
COMPUTERWORLD.COM
Examine your data object Tail can be useful when youve read in data
from an external source, helping to see if
Before you start analyzing, you might anything got garbled (or there was some
want to take a look at your data objects footnote row at the end you didnt notice).
structure and a few row entries. If its a
To quickly see how your R object is struc-
2-dimensional table of data stored in an R
tured, you can use the str() function:
data frame object with rows and columns
one of the more common structures youre str(mydata)
likely to encounter here are some ideas.
This will tell you the type of object you
Many of these also work on 1-dimensional
have; in the case of a data frame, it will
vectors as well.
also tell you how many rows (observations
Many of the commands below assume that in statistical R-speak) and columns (vari-
your data are stored in a variable called ables to R) it contains, along with the type
mydata (and not that mydata is somehow of data in each column and the first few
part of these functions names). entries in each column.
If you type:
head(mydata)
tail(mydata)
10
R: a beginners guide
COMPUTERWORLD.COM
Likewise, if youre interested in the row load the psych package. Install it with this
names in essence, all the values in the command:
first column of your data frame use:
install.packages(psych)
rownames(mydata)
You need to run this install only once on a
system. Then load it with:
Pull basic stats from your library(psych)
data frame You need to run the library command each
Because R is a statistical programming time you start a new R session if you want
platform, its got some pretty elegant ways to use the psych package.
to extract statistical summaries from data.
Now try the command:
To extract a few basic stats from a data
frame, use the summary() function: describe(mydata)
11
R: a beginners guide
COMPUTERWORLD.COM
mean(myvector, na.rm=TRUE)
?median
Use the combine function to see all pos-
The function description should say sible combinations from a group.
whether the na.rm argument is needed to
Probably most experienced R users would
exclude missing values.
combine these two steps into one like this:
Checking a functions help files even for
combn(c(Bob, Joanne, Sally,
simple functions can also uncover addi-
Tim, Neal),2)
tional useful options, such as an optional
12
R: a beginners guide
COMPUTERWORLD.COM
But separating the two can be more read- That will give you a 1-dimensional vector of
able for beginners. numbers like this:
data [12] 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4
33.9 21.5 15.5
Maybe you dont need correlations for
every column in your data frame and you [23] 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
just want to work with a couple of columns, 15.0 21.4
not 15. Perhaps you want to see data that
The numbers in brackets are not part of
meets a certain condition, such as within 3
your data, by the way. They indicate what
standard deviations. R lets you slice your
item number each line is starting with. If
data sets in various ways, depending on the
youve only got one line of data, youll just
data type.
see [1]. If theres more than one line of data
To select just certain columns from a data and only the first 11 entries can fit on the
frame, you can either refer to the columns first line, your second line will start with
by name or by their location (i.e., column 1, [12], and so on.
2, 3, etc.).
Sometimes a vector of numbers is exactly
For example, the mtcars sample data frame what you want if, for example, you want
has these column names: mpg, cyl, disp, hp, to quickly plot mtcars$mpg and dont need
drat, wt, qsec, vs, am, gear and carb. item labels, or youre looking for statistical
info such as variance and mean.
Cant remember the names of all the col-
umns in your data frame? If you just want Chances are, though, youll want to sub-
to see the column names and nothing else, set your data by more than one column
instead of functions such as str(mtcars) at a time. Thats when youll want to use
and head(mtcars) you can type: bracket notation, what I think of as rows-
comma-columns. Basically, you take the
names(mtcars)
name of your data frame and follow it by
Thats handy if you want to store the names [rows,columns]. The rows you want come
in a variable, perhaps called mtcars.col- first, followed by a comma, followed by the
names (or anything else youd like to call columns you want. So, if you want all rows
it): but just columns 2 through 4 of mtcars,
you can use:
mtcars.colnames <- names(mtcars)
mtcars[,2:4]
But back to the task at hand. To access
only the data in the mpg column in mtcars, Do you see that comma before the 2:4?
you can use Rs dollar sign notation: Thats leaving a blank space where the
which rows do you want? portion of the
mtcars$mpg
bracket notation goes, and it means Im
More broadly, then, the format for access- not asking for any subset, so return all.
ing a column by name would be: Although its not always required, its not a
bad practice to get into the habit of using
dataframename$columnname
a comma in bracket notation so that you
13
R: a beginners guide
COMPUTERWORLD.COM
remember whether you were slicing by col- [28] TRUE FALSE FALSE FALSE TRUE
umns or rows.
To turn that into a listing of the data you
If you want multiple columns that arent want, use that logical test condition and
contiguous, such as columns 2 AND 4 but row-comma-column bracket notation.
not 3, you can use the notation: Remember that this time you want to select
rows by condition, not columns. This:
mtcars[,c(2,4)]
mtcars[mtcars$mpg>20,]
A couple of syntax notes here:
tells R to get all rows from mtcars where
R indexes from 1, not 0. So your first col-
mpg > 20, and then to return all the
umn is at [1] and not [0].
columns.
R is case sensitive everywhere. mtcars$mpg
If you dont want to see all the column data
is not the same as mtcars$MPG.
for the selected rows but are just interested
mtcars[,-1] will not get you the last column in displaying, say, mpg and horsepower
of a data frame, the way negative indexing for cars with an mpg greater than 20, you
works in many other languages. Instead, could use the notation:
negative indexing in R means exclude that
mtcars[mtcars$mpg>20,c(1,4)]
item. So, mtcars[,-1] will return every col-
umn except the first one. using column locations, or:
14
R: a beginners guide
COMPUTERWORLD.COM
are not making any changes to the data What if you wanted to find the row with the
that need to be saved, you can attach and highest mpg?
detach a copy of the data set temporarily.
subset(mtcars, mpg==max(mpg))
The attach() function works like this:
If you just wanted to see the mpg informa-
attach(mtcars) tion for the highest mpg:
mpg20 <- mtcars$mpg > 20 If you just want to use subset to extract
some columns and display all rows, you can
You can leave out the data set reference and
either leave the row conditional spot blank
type this instead:
with a comma, similar to bracket notation:
mpg20 <- mpg > 20
subset(mtcars, , c(mpg, hp))
After using attach() remember to use the
Or, indicate your second argument is for
detach function when youre finished:
columns with select= like this:
detach()
subset(mtcars, select=c(mpg, hp))
Some R users advise avoiding attach()
Update: The dplyr package, released in early
because it can be easy to forget to detach().
2014, is aimed at making manipulation of
If you dont detach() the copy, your vari-
data frames faster and more rational, with
ables could end up referencing the wrong
similar syntax for a variety of tasks. To
data set.
select certain rows based on specific logical
criteria, youd use the filter() function with
Alternative to bracket the syntax filter(dataframename, logi-
So, in the mtcars example, to find all rows You can also combine filter and subset with
where mpg is greater than 20 and return the dplyr %>% chaining operation
only those rows with their mpg and hp data, that allows you to string together multiple
the subset() statement would look like: commands on a data frame. The chaining
syntax in general is:
subset(mtcars, mpg>20, c(mpg,
hp)) dataframename %>%
firstfunction(argument
15
R: a beginners guide
COMPUTERWORLD.COM
table(diamonds$cut)
table(diamonds$cut, diamonds$color)
16
R: a beginners guide
COMPUTERWORLD.COM
plot(mtcars$disp, mtcars$mpg)
17
R: a beginners guide
COMPUTERWORLD.COM
Using ggplot2
In particular, the ggplot2 package is quite
popular and worth a look for robust visu-
alizations. ggplot2 requires a bit of time to A scatterplot from ggplot2
learn its Grammar of Graphics approach. using the qplot() function.
But once youve got that down, you have a The qplot default starts the y axis at a value
tool to create many different types of visu- that makes sense to R. However, you might
alizations using the same basic structure. want your y axis to start at 0 so you can
If ggplot2 isnt installed on your system yet, better see whether changes are truly mean-
install it with the command: ingful (starting a graphs y axis at your first
value instead of 0 can sometimes exagger-
install.packages(ggplot2) ate changes).
You only need to do this once. Use the ylim argument to manually set your
To use its functions, load the ggplot2 pack- lower and upper y axis limits:
age into your current R session you only qplot(disp, mpg, ylim=c(0,35),
need to do this once per R session with data=mtcars)
the library() function:
Bonus intermediate tip: Sometimes on a
library(ggplot2) scatterplot you may not be sure if a point
Onto some ggplot2 examples. represents just one observation or multiple
ones, especially if youve got data points
ggplot2 has a quick plot function called that repeat such as in this example that
qplot() that is similar to Rs basic plot() ggplot2 creator Hadley Wickham generated
function but adds some options. The basic with the command:
quick plot code:
generates a scatterplot.
18
R: a beginners guide
COMPUTERWORLD.COM
The code structure for a basic graph with It may be a little confusing here since both
ggplot() is a bit more complicated than in the data set and one of its columns are
either plot() or qplot(); it goes as follows: called the same thing: pressure. That first
pressure represents the name of the data
ggplot(mtcars, aes(x=disp, y=mpg)) + frame; the second, y=pressure, represents
geom_point() the column named pressure.
The first argument in the ggplot() function, In these examples, I set only x and y aes-
mtcars, is fairly easy to understand thats thetics. But there are lots more aesthetics
we could add, such as color, axes and more.
19
R: a beginners guide
COMPUTERWORLD.COM
barplot(BOD$demand)
barplot(BOD$demand, main=Graph of
demand)
Creating a bar plot.
468
20
R: a beginners guide
COMPUTERWORLD.COM
11 7 14
Histograms
Now you can create a bar graph of the cyl-
Histograms work pretty much the same,
inder count:
except you want to specify how many buck-
barplot(cylcount) ets or bins you want your data to be sepa-
rated into. For base R graphics, use:
ggplot2s qplot() quick plotting function
can also create bar graphs: hist(mydata$columnName, breaks = n)
qplot(columnName, data=mydata,
binwidth=n)
ggplot(mydata, aes(x=columnName)) +
geom_histogram(binwidth=n)
21
R: a beginners guide
COMPUTERWORLD.COM
boxplot(diamonds$x, diamonds$y, So, if you want five colors from the rainbow
diamonds$z) palette, use:
Using color
Looking at nothing but black and white
graphics can get tiresome after a while. Of
course, there are numerous ways of using
color in R.
Using three colors in the R rainbow palette.
Colors in R have both names and numbers
Now that youve got a list of colors, how do
as well as the usual RGB hex code, HSV
you get them in your graphic? Heres one
(hue, saturation and value) specs and oth-
way. Say youre drawing a 3-bar barchart
ers. And when I say names, I dont mean
using ggplot() and want to use 3 colors
just the usual red, green, blue, black
from the rainbow palette. You can create a
and white. R has 657 named colors. The
3-color vector like:
colors() or colours() function R does not
discriminate against either American or mycolors <- rainbow(3)
British English gives you a list of all of
Or for the heat.colors pallette:
them. If you want to see what they look like,
not just their text names, you can get a full, mycolors <- heat.colors(3)
multi-page PDF chart with color numbers,
Now instead of using the geom_bar()
colors names and swatches, sorted in vari-
function without any arguments, add
ous ways. Or you can find just the names
fill=mycolors to geombar() like this:
and color swatches for each.
ggplot(mtcars, aes(x=factor(cyl))) +
There are also R functions that automati-
geom_bar(fill=mycolors)
cally generate a vector of n colors using a
specific color palette such as rainbow or You dont need to put your list of colors
heat: in a separate variable, by the way; you can
merge it all in a single line of code such as:
rainbow(n)
ggplot(mtcars, aes(x=factor(cyl))) +
heat.colors(n)
geom_bar(fill=rainbow(3))
terrain.colors(n)
But it may be easier to separate the colors
topo.colors(n) out if you want to create your own list of
colors instead of using one of the defaults.
cm.colors(n)
22
R: a beginners guide
COMPUTERWORLD.COM
The basic R plotting functions can also entry in testscores is greater than or equal
accept a vector of colors, such as: to 80, add blue to the testcolors vec-
tor; otherwise add red to the testcolors
barplot(BOD$demand, col=rainbow(6))
vector.
You can use a single color if you want all
Now that youve got the list of colors prop-
the items to be one color (but not mono-
erly assigned to your list of scores, just add
chrome), such as
the testcolors vector as your desired color
barplot(BOD$demand, col=royalblue3) scheme:
23
R: a beginners guide
COMPUTERWORLD.COM
testscores <- sort(c(96, 71, 85, 92, Why stat = identity? Thats needed here
82, 78, 72, 81, 68, 61, 78, 86, 90), to show that the y axis represents a numer-
decreasing = TRUE) ical value as opposed to an item count.
The sort() function defaults to ascending ggplot2s qplot() also has easy ways to
sort; for descending sort you need the addi- color bars by a factor, such as number of
tional argument: decreasing = TRUE. cylinders, and then automatically gener-
ate a legend. Heres an example of graph
If that code above is starting to seem
counting the number of 4-, 6- and 8-cylin-
unwieldy to you as a beginner, break it
der cars in the mtcars data set:
into two lines for easier reading, and per-
haps also set a new variable for the sorted qplot(factor(cyl), data=mtcars,
version: geom=bar, fill=factor(cyl))
testscores <- c(96, 71, 85, 92, 82, But, as I said, were getting somewhat
78, 72, 81, 68, 61, 78, 86, 90) beyond a beginners overview of R when
coloring by factor. For a few more examples
testscores_sorted <- sort(testscores,
and details for many of the themes cov-
decreasing = TRUE)
ered here, you might want to see the online
If you had scores in a data frame called tutorial Producing Simple Graphs with R.
results with one column of student names For more on graphing with color, check
called students and another column of out a source such as the R Graphics Cook-
scores called testscores, you could use the book. The ggplot2 documentation also has
ggplot2 packages ggplot() function as well: a lot of examples, such as this page for bar
geometry.
ggplot(results, aes(x=students,
y=testscores)) + geom_
bar(fill=testcolors, stat = Exporting your graphics
identity)
You can save your R graphics to a file for
use outside the R environment. RStudio
has an export option in the plots tab of the
bottom right window.
24
R: a beginners guide
COMPUTERWORLD.COM
jpeg(myplot.jpg, width=350,
height=420)
barplot(BOD$demand, col=rainbow(6))
dev.off()
25
R: a beginners guide
COMPUTERWORLD.COM
26
R: a beginners guide
COMPUTERWORLD.COM
it out if youre referring to consecutive have for, while and repeat loops, youll
values in a range with a colon between more likely see operations applied to a data
minimum and maximum, like this: collection using apply() functions or by
using the plyr() add-on package functions.
my_vector <- (1:10)
But first, some basics.
I bring up this exception because Ive run
into that style quite a bit in R tutorials and If youve got a vector of numbers such as:
texts, and it can be confusing to see the c
my_vector <- c(7,9,23,5)
required for some multiple values but not
others. Note that it wont hurt anything and, say, you want to multiply each by 0.01
to use the c with a colon-separated range, to turn them into percentages, how would
though, even if its not required, such as: you do that? You dont need a for, foreach
or while loop. Instead, you can create a new
my_vector <- c(1:10)
vector called my_pct_vectors like this:
One more very important point about the
my_pct_vector <- my_vector * 0.01
c() function: It assumes that everything in
your vector is of the same data type that Performing a mathematical operation on
is, all numbers or all characters. If you cre- a vector variable will automatically loop
ate a vector such as: through each item in the vector.
my_vector <- c(1, 4, hello, TRUE) Typically in data analysis, though, you
want to apply functions to subsets of data:
You will not have a vector with two integer
Finding the mean salary by job title or the
objects, one character object and one logi-
standard deviation of property values by
cal object. Instead, c() will do what it can
community. The apply() function group
to convert them all into all the same object
and plyr add-on package are designed for
type, in this case all character objects. So
that.
my_vector will contain 1, 4, hello and
TRUE. In other words, c() is also for con- There are more than half a dozen functions
vert or coerce. in the apply family, depending on what type
of data object is being acted upon and what
To create a collection with multiple object
sort of data object is returned. These func-
types, you need a list, not a vector. You
tions can sometimes be frustratingly diffi-
create a list with the list() function, not c(),
cult to get working exactly as you intended,
such as:
especially for newcomers to R, says a blog
My_list <- list(1,4,hello, TRUE) post at Revolution Analytics, which focuses
on enterprise-class R.
Now youve got a variable that holds the
number 1, the number 4, the character Plain old apply() runs a function on either
object hello and the logical object TRUE. every row or every column of a 2-dimen-
sional matrix where all columns are the
Loopless loops same data type. For a 2-D matrix, you also
need to tell the function whether youre
Iterating through a collection of data with applying by rows or by columns: Add the
loops like for and while is a corner- argument 1 to apply by row or 2 to apply by
stone of many programming languages. column. For example:
Thats not the R way, though. While R does
27
R: a beginners guide
COMPUTERWORLD.COM
apply(my_matrix, 1, median) then, yes, youve got to know the ins and
outs of data types. But my assumption is
returns the median of every row in my_
that youre here to try generating quick
matrix and
plots and stats before diving in to create
apply(my_matrix, 2, median) complex code.
calculates the median of every column. So, to start off with the basics, heres what
Id suggest you keep in mind for now: R
Other functions in the apply() family such
has multiple data types. Some of them
as lapply() or tapply() deal with different
are especially important when doing basic
input/output data types. Australian statisti-
data work. And some functions that are
cal bioinformatician Neal F.W. Saunders
quite useful for doing your basic data work
has a nice brief introduction to apply in R
require your data to be in a particular type
in a blog post if youd like to find out more
and structure.
and see some examples. (In case youre
wondering, bioinformatics involves issues More specifically, R has the Is it an inte-
around storing, retrieving and organizing ger or character or true/false? data type,
biological data, not just analyzing it.) the basic building blocks. R has several of
these including integer, numeric, charac-
Many R users who dislike the the apply
ter and logical. Missing values are repre-
functions dont turn to for-loops, but
sented by NaN (if a mathematical function
instead install the plyr package created by
wont work properly) or NA (missing or
Hadley Wickham. He uses what he calls
unavailable).
the split-apply-combine model of dealing
with data: Split up a collection of data the As mentioned in the prior section, you can
way you want to operate on it, apply what- have a vector with multiple elements of the
ever function you want to each of your data same type, such as:
group(s) and then combine them all back
1, 5, 7
together again.
or
The plyr package is probably a step beyond
this basic beginners guide; but if youd like Bill, Bob, Sue
to find out more about plyr, you can head to
>
Wickhams plyr website. Theres also a use-
ful slide presentation on plyr in PDF format A single number or character string is also
from Cosma Shalizi, an associate professor a vector a vector of 1. When you access
of statistics at Carnegie Mellon University, the value of a variable thats got just one
and Vincent Vu. Another PDF presentation value, such as 73 or Learn more about R at
on plyr is from an introduction to R work- Computerworld.com, youll also see this in
shop at Iowa State University. your console before the value:
[1]
R data types in brief (very Thats telling you that your screen print-
brief) out is starting at vector item number one.
If youve got a vector with lots of values
Should you learn about all of Rs data types
so the printout runs across multiple lines,
and how they behave right off the bat, as
each line will start with a number in brack-
a beginner? If your goal is to be an R ninja
28
R: a beginners guide
COMPUTERWORLD.COM
ets, telling you which vector item number R also has special vector and list types
that particular line is starting with. (See the that are of special interest when analyzing
screen shot, below.) data, such as matrices and data frames. A
matrix has rows and columns; you can find
a matrix dimension with dim() such as
dim(my_matrix)
class(3L)
class(as.integer(3))
29
R: a beginners guide
COMPUTERWORLD.COM
30
R: a beginners guide
COMPUTERWORLD.COM
You can.
This will find all rows in the mtcars sample This can be useful if youve got a data set
data frame that have an mpg greater than with a lot of columns that are wrapping in
20, ordered from highest to lowest mpg. the small command-line window. However,
since theres no way to save your work as
Most R experts will discourage newbies you go along changes are saved only
from cheating this way: Falling back when you close the editing window and
on SQL makes it less likely youll power theres no command-history record of what
through learning R syntax. However, its youve done, the edit window probably isnt
there for you in a pinch or as a useful your best choice for editing data in a proj-
way to double-check whether youre get- ect where its important to repeat/repro-
ting back the expected results from an R duce your work.
expression.
In RStudio you can also examine a data
object (although not edit it) by clicking on
Examine and edit data with a it in the workspace tab in the upper right
GUI window.
31
R: a beginners guide
COMPUTERWORLD.COM
write.table(myData, testfile.txt,
sep=\t)
32
R: a beginners guide
COMPUTERWORLD.COM
60+ R resources to
improve your data skills
This list was originally published as part of R data structures to running regressions
the Computerworld Beginners Guide to R and conducting factor analyses. The begin-
but has since been expanded to also include ners section may be a bit tough to follow
resources for advanced beginner and interme- if you havent had any exposure to R, but
diate users. it offers a good foundation in data types,
imports and reshaping once youve had a
These websites, videos, blogs, social media/
bit of experience. There are some particu-
communities, software and books/ebooks
larly useful explanations and examples for
can help you do more with R.; my favorites
aggregating, restructuring and subsetting
are listed in bold.
data, as well as a lot of applied statistics.
Note that if your interest in graphics is
Books and e-books learning ggplot2, theres relatively little
on that here compared with base R graph-
R Cookbook. Like the rest of the OReilly
ics and the lattice package. You can see
Cookbook series, this one offers how-to
an excerpt from the book online: Aggrega-
recipes for doing lots of different tasks,
tion and restructuring data. By Robert I.
from the basics of R installation and creat-
Kabacoff.
ing simple data objects to generating prob-
abilities, graphics and linear regressions. It The Art of R Programming. For those who
has the added bonus of being well written. want to move beyond using R in an ad hoc
If you like learning by example or are seek- way ... to develop[ing] software in R. This
ing a good R reference book, this is well is best if youre already at least moderately
worth adding to your reference library. By proficient in another programming lan-
Paul Teetor, a quantitative developer work- guage. Its a good resource for systemati-
ing in the financial sector. cally learning fundamentals such as types
of objects, control statements (unlike many
R Graphics Cookbook. If you want to do
R purists, the author doesnt actively dis-
beyond-the-basics graphics in R, this is a
courage for loops), variable scope, classes
useful resource both for its graphics recipes
and debugging in fact, theres nearly as
and brief introduction to ggplot2. While
large a chapter on debugging as there is on
this goes way beyond the graphics capabili-
graphics. With some robust examples of
ties that I need in R, Id recommend this if
solving real-world statistical problems in R.
youre looking to move beyond advanced-
By Norman Matloff.
beginner plotting. By Winston Chang, a
software engineer at RStudio. R in a Nutshell. A reasonably readable
guide to R that teaches the languages
R in Action: Data analysis and graphics
fundamentals syntax, functions, data
with R. This book aims at all levels of users,
structures and so on as well as how-to
with sections for beginning, intermediate
statistical and graphics tasks. Useful if you
and advanced R ranging from Exploring
33
R: a beginners guide
COMPUTERWORLD.COM
want to start writing robust R programs, as be a downloadable PDF, but now the only
it includes sections on functions, object- versions are for OS X or iOS.
oriented programming and high-perfor-
R for Everyone. Author Jared P. Lander
mance R. By Joseph Adler, a senior data
promises to go over 20% of the function-
scientist at LinkedIn.
ality needed to accomplish 80% of the
Visualize This. Note; Most of this book is work. And in fact, topics that are actually
not about R, but there are several examples covered, are covered pretty well; but be
of visualizing data with R. And theres so warned that some items appearing in the
much other interesting info here about how table of contents can be a little thin. This
to tell stories with data that its worth a is still a well-organized reference, though,
read. By Nathan Yau, who runs the popular with information that beginning and inter-
Flowing Data blog and whose doctoral dis- mediate users might want to know: import-
sertation was on personal data collection ing data, generating graphs, grouping and
and how we can use visualization to learn reshaping data, working with basic stats
about ourselves. and more.
34
R: a beginners guide
COMPUTERWORLD.COM
Exploring Everyday Things with R and then Frequencies and crosstabs to get
Ruby. This book oddly goes from a couple an explainer of the table() function. This
of basic introductory chapters to some ranges from basics (including useful how-
fairly robust, beyond-beginner program- tos for customizing R startup) through
ming examples; for those who are just beyond-beginner statistics (matrix algebra,
starting to code, much of the book may anyone?) and graphics. By Robert I. Kaba-
be tough to follow at the outset. However, coff, author of R in Action.
the intro to R is one of the better ones Ive
R Reference Card. If you want help remem-
read, including lot of language fundamen-
bering function names and formats for vari-
tals and basics of graphing with ggplot2.
ous tasks, this 4-page PDF is quite useful
Plus experienced programmers can see how
despite its age (2004) and the fact that
author Sau Sheong Chang splits up tasks
a link to whats supposed to be the latest
between a general language like Ruby and
version no longer works. By Tom Short, an
the statistics-focused R.
engineer at the Electric Power Research
Institute.
Online references A short list of R the most useful commands.
4 data wrangling tasks in R for advanced Commands grouped by function such as
beginners. This follow-up to our Beginners input, moving around and statistics
Guide outlines how to do several specific and transformations. This offers minimal
data tasks in R: add columns to an exist- explanations, but theres also a link to a
ing data frame, get summaries, sort results longer guide to Using R for psychologi-
and reshape data. With sample code and cal research. HTML format makes it easy
explanations. to cut and paste commands. Also some-
what old, from 2005. By William Revelle,
Data manipulation tricks: Even better in
psychology professor at Northwestern
R. From working with dates to reshaping
University.
data to if-then-else statements, see how to
perform common data munging tasks. You R Graph Catalog. Lots of graph and other
can also download these R tips & tricks as plot examples, easily searchable and each
a PDF (free Insider registration required). with downloadable code. All are made with
ggplot2 based on visualization ideas in Cre-
Cookbook for R. Not to be confused with
ating More Effective Graphs. Maintained by
the R Cookbook book mentioned above,
Joanna Zhao and Jennifer Bryan.
this website by software engineer Winston
Chang (author of the R Graphics Cook- Beautiful Plotting in R: A ggplot2 Cheat-
book) offers how-tos for tasks such as data sheet. Easy to read with a lot of useful
input and output, statistical analysis and information, from starting with default
creating graphs. Its got a similar format plots to customizing title, axes, legends;
to an OReilly Cookbook; and while not creating multi-panel plots and more. By
as complete, can be helpful for answering Zev Ross.
some How do I do that? questions.
Frequently Asked Questions about R. Some
Quick-R. This site has a fair amount of basics about reading, writing, sorting and
samples and brief explanations grouped shaping data as well as a lineup of how to
by major category and then specific items. do various statistical operations and a few
For example, youd head to Stats and specialized graphics such as spaghetti plots.
35
R: a beginners guide
COMPUTERWORLD.COM
From UCLAs Institute for Digital Research or twenty thousandth day on earth with R.
and Education. Id strongly recommend giving this a look if
textbook-style instruction leaves you cold.
R Reference Card for Data Mining. Includes
examples and other documentation. includ-
ing a substantial portion of his book R and
Data Mining published by Elsevier in 2012.
By Yanchang Zhao.
Videos
Twotorials. Youll either enjoy these snappy
2-minute twotorial videos or find them, oh,
corny or over the top. I think theyre both
informative and fun, a welcome antidote to
the typically dry how-tos you often find in This video in the Google Develop-
statistical programming. Analyst Anthony ers R series introduces functions in R.
Damico takes on R in 2-minute chunks,
Up and Running with R. This lynda.com
from how to create a variable with R to
video class covers the basics of topics
how to plot residuals from a regression in
such as using the R environment, read-
R; he also tackles an occasional problem
ing in data, creating charts and calculating
such as how to calculate your ten, fifteen,
36
R: a beginners guide
COMPUTERWORLD.COM
statistics. The curriculum is limited, but Johns Hopkins, posted his lecture videos
presenter Barton Poulson tries to explain on YouTube, and Revolution Analytics col-
what hes doing and why, not simply run lected links to them all by week.
commands. He also has a more in-depth
6-hour class, R Statistics Essential Train-
ing. Lynda.com is a subscription service
that starts at $25/month, but several of the
videos are available free for you to view and
see if you like the instruction style, and
theres a 7-day free trial available.
37
R: a beginners guide
COMPUTERWORLD.COM
and exploratory data analysis includ- are still many sections that focus specifi-
ing data.table. Videos by Princeton Ph.D. cally on R.
student David Robinson and Neo Christo-
R Tutorial. A reasonably robust beginning
pher Chung, Ph.D, filmed and edited at the
guide that includes sections on data types,
Princeton Broadcast Center.
probability and plots as well as sections
focused on statistical topics such as linear
Other online introductions regression, confidence intervals and p-val-
38
R: a beginners guide
COMPUTERWORLD.COM
39
R: a beginners guide
COMPUTERWORLD.COM
How to turn CSV data into interactive visu- 13 resources for time series analysis. A
alizations with R and rCharts. 9page slide- video and 12 slide presentations by Rob
show gives step-by-step instructions on J. Hyndman, author of Forecasting time
various options for generating interactive series using R. Also has links to exercises
graphics. The charts and graphs use jQuery and answers to the exercises.
libraries as the underlying technology but
knitr in a knutshell. knitR is designed to
only a couple of line of R code are needed.
easily create reports and other documents
By Sharon Machlis, Computerworld.
that can combine text, R code and the
Higher Order Functions in R. If youre at results of R code in short, a way to share
the point where you want to apply func- your R analyses with others. This minimal
tions on multiple vectors and data frames, tutorial by Karl Broman goes over subjects
you may start bumping up against the lim- such as creating Markdown documents
its of Rs apply family. This post goes over and adding graphics and tables, along with
6 extremely useful base R functions with links to resources for more info.
readable explanations and helpful examples.
By John Mules White, soon-to-be scientist
at Facebook.
More free downloads and
Introduction to Linear Regression Using R
websites from academia:
and Quandl While this does indeed pro- Introducing R. Slide presentation from the
mote Quandl as your data source, that data UCLA Institute for Digital Research and
is free, and for those interested in using R Education, with downloadable data and
for regressions, youll find several detailed code.
walk-throughs from data import through
Introducing R. Although titled for begin-
statistical analysis.
ners and including sections on getting
Introduction to dplyr. The dplyr package started and reading data, this also shows
(by ggplot2 creator Hadley Wickham) sig- how to use R for various types of linear
nificantly speeds up operations like group- models. By German Rodriguez at Princeton
ing and sorting of data frames. It also aims Universitys Office of Population Research.
to rationalize such functions by using a
R: A self-learn tutorial. Intro PDF from
common syntax. In this short introductory
National Center for Ecological Analysis and
vignette, youll learn about five basic data
Synthesis at UC Santa Barbara. While a bit
manipulation filter(), arrange(), select(),
dry, it goes over a lot of fundamentals and
mutate() and summarise() including
includes exercises.
examples, as well as how to chain them
together for more streamlined, readable Statistics with R Computing and Graphics.
code. Another useful package for manipu- Unlike many PDF downloads from aca-
lating data in R: doBy. demia, this one is both short (15 pages) and
basic, with some suggested informal exer-
Applied Time Series Analysis. Text-based
cises as well as explanations on things like
online class from Penn State to learn and
getting data into R and statistical modeling
apply statistical methods for the analysis
(understanding statistical concepts like lin-
of data that have been observed over time.
ear modeling is assumed). By Kjell Konis,
Access to the articles is free, although there
then at the University of Oxford.
is no community or instructor participation.
40
R: a beginners guide
COMPUTERWORLD.COM
Twitter #rstats hashtag. Level of discourse Post: R programming for those coming
here ranges from beginner to extremely from other languages. If youre an experi-
advanced, with a lot of useful R resources enced programmer trying to learn R, youll
and commentary getting posted. probably find some useful tips here.
You can also find R groups on Linke- Post: A brief introduction to apply in R. If
dIn, Reddit and Facebook, among other you want to learn how the apply() function
platforms. family works, this is a good primer.
Stackoverflow has a very active R commu- Translating between R and SQL. If youre
nity where people ask and answer coding more experienced (and comfortable) with
questions. If youve got a specific coding SQL than R, it can be frustrating and
challenge, its definitely worth searching confusing at times to figure out how to do
here to see if someone else has already basic data tasks such as subsetting your
asked about something similar. data. Statistics consultant Patrick Burns
shows how to do common data slicing in
There are dozens of R User Meetups world- both SQL and R, making it easier for expe-
wide. In addition, there are other user rienced database users to add R to their
groups not connected with Meetup.com. toolkit.
Revolution Analytics has an R User Group
Directory. Graphs & Charts in base R, ggplot2 and
rCharts. There are lots of sample charts
with code here, showing how to do similar
visualization tasks with basic R, the ggplot2
41
R: a beginners guide
COMPUTERWORLD.COM
add-on package and rCharts for interactive R site search returns results just from
HTML visualizations. R functions, package vignettes (docu-
mentation that helps explain how a func-
When to use Excel, when to use R? For
tion works) and task views (focusing on
spreadsheet users starting to learn R, this
a particular field such as social science or
is a useful question to consider. Michael
econometrics).
Milton, author of Head First Data Analysis
(which discusses both Excel and R), offers
practical (and short) advice on when to use Misc
each.
Googles R Style Guide. Want to write neat
A First Step Towards R From Spreadsheets. code with a consistent style? Youll prob-
Some advice on both when and how to start ably want a style guide; and Google has
moving from Excel to R, with a link to a helpfully posted their internal R style for
follow-up post, From spreadsheet thinking all to use. If that one doesnt work for you,
to R thinking. Hadley Wickham has a fairly abbreviated R
style guide based on Googles but with a
Using dates and times in R. This post from
few tweaks.
a presentation by Bonnie Dixon at the
Davis R Users group goes over some of the RStudio documentation. If youre using
intricacies of dates and times in R, includ- RStudio, its worth taking a look at parts
ing various date/time classes as well as of the documentation at some point so you
different options for performing date/time can take advantage of all it has to offer.
calculations and other manipulations.
History of R Financial Time Series Plotting.
Scraping Pro-Football Data and Interactive Although, as the name implies, this focuses
Charts using rCharts, ggplot2, and shiny. on financial time-series graphics, its also
This is a highly useful example of begin- a useful look at various options for plot-
ning-to-end data analysis with R. Youll ting any data over time. With lots of code
see a sample of how to scrape data off a samples along with graphics. By Timely
website, clean and restructure the data and Portfolio on GitHub.
then visualize it in several ways, including
Grouping & Summarizing Data in R. There
interactive Web graphics all with down-
are so many ways to do these tasks in R
loadable code. By Vivek Patil, an associate
that it can be a little overwhelming even for
professor at Gonzaga University.
those beyond the beginner stage to decide
which to use when. This downloadable
Search Slideshare presentation by analyst Jeffrey
Breen from the Greater Boston useR Group
Searching for R on a general search is a useful overview.
engine like Google can be somewhat frus-
trating, given how many utterly unrelated
English words include the letter r. Some Apps
search possibilities: R Instructor. This app is primarily a well-
RSeek is a Web search engine that just designed, very thorough index to R, offer-
returns results from certain R-focused ing snippets on how to import, summarize
websites. and plot data, as well as an introductory
section. An I want to... section gives
42
R: a beginners guide
COMPUTERWORLD.COM
short how-tos on a variety of tasks such examples, Show Me Shiny offers a gallery
as changing data classes or column/row of apps with links to code.
names, ordering or subsetting data and
Swirl. This R package for interactive learn-
more. Similar information is available free
ing teaches basic statistics and R together.
online; the value-add is if you want the
See more info on version 2.0.
info organized in an attractive mobile app.
Extras include instructional videos and a
statistical tests section explaining when
to use various tests as well as R code for
each. For iOS and Android, about $5.
Software
Comprehensive R Archive Network
(CRAN). The most important of all: home
of the R Project for Statistical Computing,
including downloading the basic R plat-
form, FAQs and tutorials as well as thou-
sands of add-on packages. Also features
detailed documentation and a number of
links to more resources.
43