Minicurso R PDF

Mini-curso
Introdução ao R
Prof. Dr. Henrique Castro

hcastro@usp.br
Faculdade de Economia, Administração

e Contabilidade da Universidade de São
Paulo, FEA-USP
2013
1 of 100
Working directory
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Get or Set Working Directory

• getwd(): returns an absolute filepath representing the current
working directory of the R process
• setwd("C:/Users/Henrique/Documents"): is used to set the
working directory to dir.
2 of 100
Installing and loading R packages
Install packages
install.packages("foreign")
Loading packages
install.packages("foreign")
3 of 100
Importing data
• Importing data into R is fairly simple.

• For Stata and Systat, use the foreign package.
• For SPSS and SAS I would recommend the Hmisc package for
ease and functionality.
From a Comma Delimited Text File
# first row contains variable names, comma is separator
# assign the variable id to row names
mydata <- read.table("mydata.csv", header=TRUE, sep=",", row.names="id")
Exercise
• Create a csv file and import.
4 of 100
Importing data
From Excel
• The best way to read an Excel file is to export it to a comma
delimited file and import it using the method above.
• On windows systems you can use the RODBC package to access
Excel files.
• The first row should contain variable/column names.
# first row contains variable names

# we will read in workSheet mysheet
library(RODBC)
channel <- odbcConnectExcel("c:/myexel.xls")
mydata <- sqlFetch(channel, "mysheet")
odbcClose(channel)
5 of 100
Importing data
From SPSS
# save SPSS dataset in transport format
get file=’c:\mydata.sav’.
export outfile=’c:\mydata.por’.
# in R
library(Hmisc)
mydata <- spss.get("c:/mydata.por", use.value.labels=TRUE)
# last option converts value labels to R factors
From Stata
# input Stata file
library(foreign)
mydata <- read.dta("c:/mydata.dta")
6 of 100
Getting Information on a Dataset
# list objects in the working environment

ls()
# list the variables in mydata

names(mydata)
# list the structure of mydata

str(mydata)
# dimensions of an object
dim(object)
# class of an object (numeric, matrix, data frame, etc)

class(object)
# print mydata
mydata
# print first 10 rows of mydata

head(mydata, n=10)
# print last 5 rows of mydata

tail(mydata, n=5)
7 of 100
Selecting Elements
Identify rows, columns or elements using subscripts

> mydata[,3] # 3th column of matrix
[1] 0.02138096 0.53024983 0.81832713 0.68361351 0.57439575 0.65139071
[7] 0.77752234 0.78719712 0.99746985 0.91793588 0.08434072 0.22034468
[13] 0.41735257 0.02295936 0.38344190 0.59126108 0.96521616 0.56327819
[19] 0.38097578 0.40297945 0.24743431 0.40990097 0.95087157 NA
[25] 0.86153218
> mydata[3,] # 3rd row of matrix

y x1 x2
3 0.327052 0.2195572 0.8183271
> mydata[2:4,1:3] # rows 2,3,4 of columns 1,2,3

y x1 x2
2 0.6111948 0.3738518 0.5302498
3 0.3270520 0.2195572 0.8183271
4 0.9560805 0.2477098 0.6836135
8 of 100
Vectors and Data Frames
Vectors Data Frame

> d <- c(1,2,3,4) > mydata2 <- data.frame(d,e,f)
> e <- c("red", "white", "red", NA) > mydata2
> f <- c(TRUE,TRUE,TRUE,FALSE) d e f
> d 1 1 red TRUE
[1] 1 2 3 4 2 2 white TRUE
> e 3 3 red TRUE
[1] "red" "white" "red" NA 4 4 <NA> FALSE
> f > names(mydata2) <- c("ID","Color","Passed") # variable names
[1] TRUE TRUE TRUE FALSE > mydata2
ID Color Passed
1 1 red TRUE
2 2 white TRUE
3 3 red TRUE
4 4 <NA> FALSE
9 of 100
Missing Data
• In R, missing values are represented by the symbol NA (not

available) .
Excluding Missing Values from Analyses
• Arithmetic functions on missing values yield missing values.
x <- c(1,2,NA,3)
mean(x) # returns NA
mean(x, na.rm=TRUE) # returns 2
Function complete.cases()
> complete.cases(mydata)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
[25] TRUE
10 of 100
Missing Data
List rows of data that have missing values

> mydata[!complete.cases(mydata),]
y x1 x2
23 0.54610883 NA 0.9508716
24 0.03563499 0.07205155 NA
Function na.omit()
> # create new dataset without missing data

> newdata <- na.omit(mydata)
> dim(newdata)
[1] 23 3
> tail(newdata)
y x1 x2
18 0.3477647 0.1576324 0.5632782
19 0.4663852 0.5410509 0.3809758
20 0.6129595 0.8376189 0.4029794
21 0.6741362 0.1673157 0.2474343
22 0.4829047 0.1548488 0.4099010
25 0.5821938 0.0698297 0.8615322
11 of 100
Data Management
Creating new variables

> mydata$sum <- mydata$x1 + mydata$x2
> mydata$mean <- (mydata$x1 + mydata$x2)/2
> head(mydata)
y x1 x2 sum mean
1 0.8122147 0.4219462 0.02138096 0.4433272 0.2216636
2 0.6111948 0.3738518 0.53024983 0.9041016 0.4520508
3 0.3270520 0.2195572 0.81832713 1.0378844 0.5189422
4 0.9560805 0.2477098 0.68361351 0.9313233 0.4656617
5 0.9247670 0.2061416 0.57439575 0.7805373 0.3902687
6 0.8935983 0.6727692 0.65139071 1.3241599 0.6620799
> tail(mydata)
y x1 x2 sum mean
20 0.61295952 0.83761895 0.4029794 1.2405984 0.6202992
21 0.67413620 0.16731569 0.2474343 0.4147500 0.2073750
22 0.48290474 0.15484882 0.4099010 0.5647498 0.2823749
23 0.54610883 NA 0.9508716 NA NA
24 0.03563499 0.07205155 NA NA NA
25 0.58219381 0.06982970 0.8615322 0.9313619 0.4656809
12 of 100
Data Management
Recoding variables
> mydata$x1cat<-ifelse(mydata$x1>0.5,c("high"),c("low"))
> head(mydata)
y x1 x2 sum mean x1cat
1 0.8122147 0.4219462 0.02138096 0.4433272 0.2216636 low
2 0.6111948 0.3738518 0.53024983 0.9041016 0.4520508 low
3 0.3270520 0.2195572 0.81832713 1.0378844 0.5189422 low
4 0.9560805 0.2477098 0.68361351 0.9313233 0.4656617 low
5 0.9247670 0.2061416 0.57439575 0.7805373 0.3902687 low
6 0.8935983 0.6727692 0.65139071 1.3241599 0.6620799 high
> mydata$x1cat[mydata$x1>0.5] <- "High"

> mydata$x1cat[mydata$x1<=0.5] <- "Low"
> head(mydata)
1 0.8122147 0.4219462 0.02138096 0.4433272 0.2216636 Low
2 0.6111948 0.3738518 0.53024983 0.9041016 0.4520508 Low
3 0.3270520 0.2195572 0.81832713 1.0378844 0.5189422 Low
4 0.9560805 0.2477098 0.68361351 0.9313233 0.4656617 Low
5 0.9247670 0.2061416 0.57439575 0.7805373 0.3902687 Low
6 0.8935983 0.6727692 0.65139071 1.3241599 0.6620799 High
13 of 100
Data Management
Merging Data
> x<-data.frame(1:10,rep(1))
> names(x)<-c("id","x1")
> head(x,2)
id x1
1 1 1
2 2 1
> y<-data.frame(1:10,rep(2))
> names(y)<-c("id","y1")
> head(y,2)
id y1
1 1 2
2 2 2
> dataxy<-merge(x,y,by="id")
> head(dataxy)
id x1 y1
1 1 1 2
2 2 1 2
3 3 1 2
4 4 1 2
5 5 1 2
6 6 1 2
14 of 100
Data Management
Reshaping Data
> d1 <- data.frame(subject = c("1", "2"),
+ x0 = c("male", "female"),
+ x1_2000 = 1:2,
+ x1_2005 = 5:6,
+ x2_2000 = 1:2,
+ x2_2005 = 5:6
+ )
> d1
subject x0 x1_2000 x1_2005 x2_2000 x2_2005
1 1 male 1 5 1 5
2 2 female 2 6 2 6
> reshape(d1, dir = "long", varying = 3:6, sep = "_")

subject x0 time x1 x2 id
1.2000 1 male 2000 1 1 1
2.2000 2 female 2000 2 2 2
1.2005 1 male 2005 5 5 1
2.2005 2 female 2005 6 6 2
15 of 100
Data Management
Reshaping Data
> d1 <- data.frame(subject = c("1", "2"),
+ x0 = c("male", "female"),
+ x1_2000 = 1:2,
+ x1_2005 = 5:6,
+ x2_2000 = 1:2,
+ x2_2005 = 5:6
+ )
> d1
subject x0 x1_2000 x1_2005 x2_2000 x2_2005
1 1 male 1 5 1 5
2 2 female 2 6 2 6
> reshape(d1, dir = "long", varying = 3:6, sep = "_")

subject x0 time x1 x2 id
1.2000 1 male 2000 1 1 1
2.2000 2 female 2000 2 2 2
1.2005 1 male 2005 5 5 1
2.2005 2 female 2005 6 6 2
16 of 100
Subsetting Data
Selecting (Keeping) Variables

> # select variables y, x1
> myvars <- c("y", "x1")
> newdata <- mydata[myvars]
> head(newdata)
y x1
1 0.8122147 0.4219462
2 0.6111948 0.3738518
3 0.3270520 0.2195572
4 0.9560805 0.2477098
5 0.9247670 0.2061416
6 0.8935983 0.6727692
17 of 100
Subsetting Data
Excluding (DROPPING) Variables

> newdata <- mydata[c(-3,-5)]
> head(newdata)
y x1 sum x1cat
1 0.8122147 0.4219462 0.4433272 Low
2 0.6111948 0.3738518 0.9041016 Low
3 0.3270520 0.2195572 1.0378844 Low
4 0.9560805 0.2477098 0.9313233 Low
5 0.9247670 0.2061416 0.7805373 Low
6 0.8935983 0.6727692 1.3241599 High
18 of 100
Subsetting Data
Selecting Observations
# first 5 observerations
> newdata <- mydata[1:5,]
> newdata
1 0.8122147 0.4219462 0.02138096 0.4433272 0.2216636 Low
2 0.6111948 0.3738518 0.53024983 0.9041016 0.4520508 Low
3 0.3270520 0.2195572 0.81832713 1.0378844 0.5189422 Low
4 0.9560805 0.2477098 0.68361351 0.9313233 0.4656617 Low
5 0.9247670 0.2061416 0.57439575 0.7805373 0.3902687 Low
# based on variable values

> newdata <- mydata[which(mydata$x1cat==’Low’ & mydata$sum > 1), ]
> newdata
3 0.3270520 0.2195572 0.8183271 1.037884 0.5189422 Low
17 0.1994746 0.1669037 0.9652162 1.132120 0.5660599 Low
19 of 100
Subsetting Data
Selection using the Subset Function

> newdata <- subset(mydata, x1>=.6 | x2<.4, select=c(y:x2))
> newdata
y x1 x2
1 0.81221468 0.4219462 0.02138096
6 0.89359831 0.6727692 0.65139071
7 0.59623223 0.7039585 0.77752234
9 0.05417394 0.6776613 0.99746985
11 0.79439514 0.4500744 0.08434072
12 0.21072273 0.3740807 0.22034468
14 0.02172214 0.2773447 0.02295936
15 0.14615300 0.7039209 0.38344190
16 0.89172957 0.8183909 0.59126108
19 0.46638519 0.5410509 0.38097578
20 0.61295952 0.8376189 0.40297945
21 0.67413620 0.1673157 0.24743431
20 of 100
Subsetting Data
Random Samples
> set.seed(1)
> x<-1:25
> sample(x, 7, replace = F)
[1] 7 9 14 20 5 18 22
> set.seed(1)
> mysample <- mydata[sample(1:nrow(mydata), 7, replace=FALSE),]
> mysample
7 0.59623223 0.7039585 0.77752234 1.4814808 0.7407404 High
9 0.05417394 0.6776613 0.99746985 1.6751312 0.8375656 High
14 0.02172214 0.2773447 0.02295936 0.3003041 0.1501520 Low
20 0.61295952 0.8376189 0.40297945 1.2405984 0.6202992 High
5 0.92476700 0.2061416 0.57439575 0.7805373 0.3902687 Low
18 0.34776470 0.1576324 0.56327819 0.7209106 0.3604553 Low
22 0.48290474 0.1548488 0.40990097 0.5647498 0.2823749 Low
21 of 100
Numeric Functions
abs(x) absolute value

sqrt(x) square root
ceiling(x) ceiling(3.475) is 4
floor(x) floor(3.475) is 3
trunc(x) trunc(5.99) is 5
round(x, digits=n) round(3.475, digits=2) is 3.48
signif(x, digits=n) signif(3.475, digits=2) is 3.5
cos(x), sin(x), tan(x) also acos(x), cosh(x), acosh(x), etc.
log(x) natural logarithm
log10(x) common logarithm
exp(x) ex
22 of 100
Statistical Functions
Function Description
mean(x, trim=0, na.rm=T) mean of object x

# trimmed mean, removing any missing values and
# 5 percent of highest and lowest scores
mx <- mean(x,trim=.05,na.rm=TRUE)
sd(x) standard deviation of object(x). also look at var(x) for variance and
mad(x) for median absolute deviation.
median(x) median
quantiles where x is the numeric vector whose quantiles are desired

quantile(x, probs) and probs is a numeric vector with probabilities in [0,1].
# 30th and 84th percentiles of x
y <- quantile(x, c(.3,.84))
range(x) range
sum(x) sum
diff(x, lag=1) lagged differences, with lag indicating which lag to use
min(x) minimum
max(x) maximum
scale(x, center=TRUE, scale=TRUE) column center or standardize a matrix.

23 of 100
Descriptive Statistics
Summary function
> # mean,median,25th and 75th quartiles,min,max
> summary(mydata)
y x1 x2 sum
Min. :0.02172 Min. :0.03389 Min. :0.02138 Min. :0.3003
1st Qu.:0.32603 1st Qu.:0.16459 1st Qu.:0.38283 1st Qu.:0.5796
Median :0.54611 Median :0.32560 Median :0.56884 Median :0.9313
Mean :0.50232 Mean :0.37177 Mean :0.55256 Mean :0.9200
3rd Qu.:0.67414 3rd Qu.:0.57398 3rd Qu.:0.79498 3rd Qu.:1.1864
Max. :0.95608 Max. :0.83762 Max. :0.99747 Max. :1.6751
NA’s :1 NA’s :1 NA’s :2
mean x1cat
Min. :0.1502 Length:25
1st Qu.:0.2898 Class :character
Median :0.4657 Mode :character
Mean :0.4600
3rd Qu.:0.5932
Max. :0.8376
NA’s :2
24 of 100
Using the pastecs package
> library(pastecs)
> stat.desc(mydata)
nbr.val 25.00000000 24.00000000 24.00000000 23.00000000 23.00000000 NA
nbr.null 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 NA
nbr.na 0.00000000 1.00000000 1.00000000 2.00000000 2.00000000 NA
min 0.02172214 0.03388894 0.02138096 0.30030407 0.15015204 NA
max 0.95608055 0.83761895 0.99746985 1.67513117 0.83756558 NA
range 0.93435840 0.80373000 0.97608890 1.37482709 0.68741355 NA
sum 12.55789913 8.92258074 13.26137199 21.16102961 10.58051480 NA
median 0.54610883 0.32559825 0.56883697 0.93132330 0.46566165 NA
mean 0.50231597 0.37177420 0.55255717 0.92004477 0.46002238 NA
SE.mean 0.05751775 0.05243628 0.06117871 0.07890179 0.03945090 NA
CI.mean.0.95 0.11871081 0.10847271 0.12655781 0.16363231 0.08181615 NA
var 0.08270730 0.06598952 0.08982804 0.14318634 0.03579659 NA
std.dev 0.28758876 0.25688425 0.29971327 0.37839971 0.18919986 NA
coef.var 0.57252563 0.69096848 0.54241133 0.41128402 0.41128402 NA
25 of 100
Using the psych package
> library(psych)
> describe(mydata[c(-6)])
var n mean sd median trimmed mad min max range skew kurtosis se
y 1 25 0.50 0.29 0.55 0.51 0.33 0.02 0.96 0.93 -0.11 -1.17 0.06
x1 2 24 0.37 0.26 0.33 0.36 0.26 0.03 0.84 0.80 0.39 -1.31 0.05
x2 3 24 0.55 0.30 0.57 0.56 0.32 0.02 1.00 0.98 -0.21 -1.13 0.06
sum 4 23 0.92 0.38 0.93 0.91 0.50 0.30 1.68 1.37 0.13 -1.08 0.08
mean 5 23 0.46 0.19 0.47 0.45 0.25 0.15 0.84 0.69 0.13 -1.08 0.04
26 of 100
Summary Statistics by Group

> library(psych)
> describeBy(mydata[,c(-6)], group = mydata$x1cat)
group: High
y 1 8 0.52 0.31 0.54 0.52 0.32 0.05 0.89 0.84 -0.20 -1.49 0.11
x1 2 8 0.68 0.12 0.69 0.68 0.11 0.51 0.84 0.33 -0.16 -1.44 0.04
x2 3 8 0.62 0.23 0.62 0.62 0.28 0.38 1.00 0.62 0.26 -1.55 0.08
sum 4 8 1.30 0.23 1.31 1.30 0.20 0.92 1.68 0.75 -0.09 -1.13 0.08
mean 5 8 0.65 0.12 0.65 0.65 0.10 0.46 0.84 0.38 -0.09 -1.13 0.04
------------------------------------------------------------------
group: Low
y 1 16 0.49 0.30 0.52 0.49 0.35 0.02 0.96 0.93 -0.01 -1.32 0.07
x1 2 16 0.22 0.13 0.19 0.21 0.15 0.03 0.45 0.42 0.36 -1.24 0.03
x2 3 15 0.49 0.32 0.53 0.49 0.43 0.02 0.97 0.94 -0.04 -1.46 0.08
sum 4 15 0.72 0.26 0.72 0.71 0.31 0.30 1.13 0.83 0.02 -1.55 0.07
mean 5 15 0.36 0.13 0.36 0.36 0.16 0.15 0.57 0.42 0.02 -1.55 0.03
27 of 100
Creating a Graph
Scatter plot Regression of MPG on Weight
# Creating a Graph ●
attach(mtcars) ● ●
30
plot(mpg~wt, xlab="Car weight", ylab="Miles per gallon")
title("Regression of MPG on Weight") ●
25
Miles per gallon
●
● ●
Saving Graphs ●
●
●
●
20
●
● ●
●
●
●
●
• You can save the graph in a 15

●
●
● ● ●
●
●
●
variety of formats from the ●
menu File → Save As ● ●

10
2 3 4 5
Car weight
28 of 100
Creating a Graph
Histograms Histogram of mtcars$mpg
12
• You can create histograms with the
function hist(x) where x is a numeric
10
vector of values to be plotted.
• The option freq=FALSE plots
8
Frequency
probability densities instead of
6
frequencies.
4
• The option breaks= controls the
number of bins. 2
# Simple Histogram
0
hist(mtcars$mpg) 10 15 20 25 30 35
mtcars$mpg
29 of 100
Creating a Graph
Histograms Histogram of mtcars$mpg
• You can create histograms with the
7
function hist(x) where x is a numeric
6
vector of values to be plotted.
5
• The option freq=FALSE plots
Frequency
4
probability densities instead of
frequencies.
3
• The option breaks= controls the
2
number of bins. 1
# Colored Histogram with Bins = 12

0
hist(mtcars$mpg, breaks=12, col="red") 10 15 20 25 30
mtcars$mpg
30 of 100
Creating a Graph
Dot Plots Gas Milage for Car Models
Volvo 142E ●
• Create dotplots with the dotchart(x, Maserati Bora

Ferrari Dino
●
Ford Pantera L ●
labels=) function, where x is a numeric Lotus Europa

Porsche 914−2 ●
●
vector and labels is a vector of labels for Fiat X1−9

Pontiac Firebird ●
●
Camaro Z28 ●
each point. AMC Javelin ●
Dodge Challenger ●
• cex controls the size of the labels. Toyota Corona

Toyota Corolla
Honda Civic
●
●
●
Fiat 128 ●
Chrysler Imperial ●
Lincoln Continental ●
# Simple Dotplot Cadillac Fleetwood ●
Merc 450SLC ●
dotchart(mtcars$mpg,labels=row.names(mtcars),cex=.7, Merc 450SL ●
main="Gas Milage for Car Models", Merc 450SE

Merc 280C
●
xlab="Miles Per Gallon") Merc 280 ●
Merc 230 ●
Merc 240D ●
Duster 360 ●
Valiant ●
Hornet Sportabout ●
Hornet 4 Drive ●
Datsun 710 ●
Mazda RX4 Wag ●
Mazda RX4 ●
10 15 20 25 30
Miles Per Gallon
31 of 100
Creating a Graph
Dot Plots Gas Milage for Car Models

grouped by cylinder
• You can add a groups= option to Toyota Corolla

Fiat 128
Lotus Europa ●
●
●
designate a factor specifying how the Honda Civic

Fiat X1−9
Porsche 914−2 ●
●
●
elements of x are grouped. Merc 240D

Merc 230 ●
●
Datsun 710 ●
• If so, the option gcolor= controls the Toyota Corona

Volvo 142E
●
6
color of the groups label. Hornet 4 Drive
Mazda RX4 Wag ●
●
Mazda RX4 ●
Ferrari Dino ●
Merc 280 ●
Valiant ●
# Dotplot: Grouped Sorted and Colored Merc 280C ●
# Sort by mpg, group and color by cylinder 8

Pontiac Firebird ●
x <- mtcars[order(mtcars$mpg),] # sort by mpg Hornet Sportabout ●
Merc 450SL ●
x$cyl <- factor(x$cyl) # it must be a factor Merc 450SE ●
Ford Pantera L ●
x$color[x$cyl==4] <- "red" Dodge Challenger ●
AMC Javelin ●
x$color[x$cyl==6] <- "blue" Merc 450SLC ●
Maserati Bora ●
x$color[x$cyl==8] <- "darkgreen" Chrysler Imperial ●
Duster 360 ●
dotchart(x$mpg,labels=row.names(x),cex=.7,groups= x$cyl, Camaro Z28 ●
Lincoln Continental ●
main="Gas Milage for Car Models\ngrouped by cylinder", Cadillac Fleetwood ●
xlab="Miles Per Gallon", gcolor="black", color=x$color) 10 15 20 25 30
Miles Per Gallon
32 of 100
Creating a Graph
Bar Plots Car Distribution
• Create barplots with the barplot(height)
14
function, where height is a vector or
matrix.
12
• If height is a vector, the values determine
10
the heights of the bars in the plot.
8
# Simple Bar Plot
6
counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution",
xlab="Number of Gears")
4
2
0
3 4 5
Number of Gears
33 of 100
Creating a Graph
Bar Plots Car Distribution
• Create barplots with the barplot(height)

function, where height is a vector or
5 Gears
matrix.
• If height is a vector, the values determine
the heights of the bars in the plot.
4 Gears
# Simple Horizontal Bar Plot with Added Labels
counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution", horiz=TRUE,
names.arg=c("3 Gears", "4 Gears", "5 Gears"))
3 Gears
0 2 4 6 8 10 12 14
34 of 100
Creating a Graph
Stacked Bar Plot Car Distribution by Gears and VS
• If height is a matrix and the option 1
14
0
beside=FALSE then each bar of the plot
corresponds to a column of height, with
12
the values in the column giving the
10
heights of stacked “sub-bars”.
8
# Stacked Bar Plot with Colors and Legend
counts <- table(mtcars$vs, mtcars$gear)
6
barplot(counts, main="Car Distribution by Gears and VS",
xlab="Number of Gears", col=c("darkblue","red"),
4
legend = rownames(counts)) 2
0
3 4 5
Number of Gears
35 of 100
Creating a Graph
Grouped Bar Plot Car Distribution by Gears and VS
12
• If height is a matrix and beside=TRUE, 0
1
then the values in each column are
10
juxtaposed rather than stacked.
8
# Grouped Bar Plot
counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, main="Car Distribution by Gears and VS",
6
xlab="Number of Gears", col=c("darkblue","red"),
legend = rownames(counts), beside=TRUE)
4
2
0
3 4 5
Number of Gears
36 of 100
Creating a Graph
Line Charts
• Line charts are created with the
function lines(x, y, type=)
where x and y are numeric vectors of
(x,y) points to connect. type description
• type= can take the values in the
table. p points
l lines
• The lines( ) function adds o overplotted points and lines
information to a graph. It can not b, c points (empty if "c") joined by lines
produce a graph on its own. s, S stair steps
• Usually it follows a plot(x, y) h histogram-like vertical lines
command that produces a graph. n does not produce any points or lines
• By default, plot( ) plots the (x,y)
points. Use the type="n" option in
the plot( ) command, to create the
graph with axes, titles, etc., but
without plotting the points.
37 of 100
Creating a Graph
5
Line Charts
4
x <- 1:5 # create some data
y <- x # create some data
3
y
# plotting symbol and color
plot(x, y, type="n")
lines(x, y, type="o") 2
1
1 2 3 4 5
x
38 of 100
Creating a Graph
Line Charts
type= p type= l type= o type= b
5
# plotting symbol and color
4
par(pch=22, col="red")
3
y
y
# all plots on one page
2
par(mfrow=c(2,4))
1
1 3 5 1 3 5 1 3 5 1 3 5
opts = c("p","l","o","b","c","s","S","h") x x x x
for(i in 1:length(opts)){
type= c type= s type= S type= h
heading = paste("type=",opts[i])
5
plot(x, y, type="n", main=heading)
4
lines(x, y, type=opts[i])
3
3
y
y
} 2
2
# return to original setting
1
1
par(mfrow=c(1,1)) 1 3 5 1 3 5 1 3 5 1 3 5
x x x x
39 of 100
Creating a Graph
type= p type= l type= o type= b

Line Charts
5
4
4
par(pch=22, col="blue")
3
y
y
par(mfrow=c(2,4))
2
opts = c("p","l","o","b","c","s","S","h")
1
1 3 5 1 3 5 1 3 5 1 3 5
for(i in 1:length(opts)){ x x x x
heading = paste("type=",opts[i])
type= c type= s type= S type= h
plot(x, y, main=heading)
5
lines(x, y, type=opts[i])
4
}
3
3
y
y
# return to original setting 2
2
par(mfrow=c(1,1))
1
1
1 3 5 1 3 5 1 3 5 1 3 5
x x x x
40 of 100
Creating a Graph
Pie Charts
• Pie charts are not recommended in the R
Pie Chart of Countries
documentation, and their features are
somewhat limited.
• The authors recommend bar or dot plots UK
over pie charts because people are able US

to judge length more accurately than
volume. Australia
Simple Pie Chart France
Germany
# Simple Pie Chart
slices <- c(10, 12, 4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pie(slices, labels = lbls, main="Pie Chart of Countries")
41 of 100
Creating a Graph
Pie Chart with Annotated Percentages UK 24%
US 20%
# Pie Chart with Percentages
slices <- c(10, 12, 4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France") Australia 8%
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels France 16%
pie(slices,labels = lbls, col=rainbow(length(lbls)),
main="Pie Chart of Countries") Germany 32%
42 of 100
Creating a Graph

3D Pie Chart
• The pie3D( ) function in the
plotrix package provides 3D UK
exploded pie charts. US
Australia
# 3D Exploded Pie Chart
library(plotrix) France
slices <- c(10, 12, 4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France") Germany
pie3D(slices,labels=lbls,explode=0.1,
main="Pie Chart of Countries")
43 of 100
Creating a Graph
Boxplot Car Milage Data
• Boxplots can be created for individual
30
variables or for variables by group.
Miles Per Gallon

• The format is boxplot(x, data=),
25
where x is a formula and data=
20
denotes the data frame providing the
data.
15
# Boxplot of MPG by Car Cylinders
boxplot(mpg~cyl,data=mtcars, main="Car Milage Data", 10
xlab="Number of Cylinders", ylab="Miles Per Gallon") 4 6 8
Number of Cylinders
44 of 100
Creating a Graph
Simple Scatterplot Scatterplot Example
• There are many ways to create a

scatterplot in R.
30
• The basic function is plot(x, y),
Miles Per Gallon

25
where x and y are numeric vectors
denoting the (x,y) points to plot.
20
15
# Simple Scatterplot
attach(mtcars)
plot(wt, mpg, main = "Scatterplot Example",
10
xlab = "Car Weight",
ylab = "Miles Per Gallon", pch=19) 2 3 4 5
Car Weight
45 of 100
Creating a Graph
Scatterplot Example
30
Add Fit Line to Scatterplot
Miles Per Gallon

25
# Add fit lines
20
abline(lm(mpg~wt), col="red")
15
10
2 3 4 5
Car Weight
46 of 100
Creating a Graph
Simple Scatterplot
cyl
• The scatterplot( ) function in the 4 Enhanced Scatter Plot
6
car package offers many enhanced 8
features, including fit lines, marginal
box plots, conditioning on a factor,
30
Miles Per Gallon
and interactive point identification.
25
• Each of these features is optional.
20
# Enhanced Scatterplot of MPG vs. Weight
15
# by Number of Car Cylinders
library(car) 10
scatterplot(mpg ~ wt | cyl, data=mtcars, smoother = F,
xlab="Weight of Car", ylab="Miles Per Gallon", 2 3 4 5
main="Enhanced Scatter Plot",
labels=row.names(mtcars)) Weight of Car
47 of 100
Creating a Graph
Simple Scatterplot Matrix

100 300 2 3 4 5
30
Scatterplot Matrices mpg
20
10
• There are at least 4 useful functions
300
disp
for creating scatterplot matrices.
100
5.0
# Basic Scatterplot Matrix
4.0
pairs(~mpg+disp+drat+wt,data=mtcars, drat
3.0
main="Simple Scatterplot Matrix")
2 3 4 5
wt
10 20 30 3.0 4.0 5.0
48 of 100
Creating a Graph
Scatterplot Matrices
100 300 2 3 4 5
• The car package can condition the mpg
30
scatterplot matrix on a factor, and
20
4
6
8
optionally include lowess and linear
10
disp
best fit lines, and boxplot, densities,
300
or histograms in the principal
100
diagonal, as well as rug plots in the
5.0
drat
margins of the cells.
4.0
3.0
# Scatterplot Matrices from the car Package wt
2 3 4 5
library(car)
scatterplotMatrix(~mpg+disp+drat+wt|cyl,
data=mtcars, smoother=F, by.group=TRUE,
diagonal = "density") 10 20 30 3.0 4.0 5.0
49 of 100
Creating a Graph
3D Scatterplot
3D Scatterplots
• You can create a 3D scatterplot with

the scatterplot3d package.
35
• Use the function scatterplot3d(x,
30
y, z).
25
mpg
disp
500
20
400
# 3D Scatterplot 300
15
library(scatterplot3d) 200
attach(mtcars) 100
10
0
scatterplot3d(wt,disp,mpg, main="3D Scatterplot") 1 2 3 4 5 6
wt
50 of 100
Creating a Graph
3D Scatterplot
3D Scatterplots with Coloring and

Vertical Drop Lines
35
30
library(scatterplot3d)
attach(mtcars)
25
mpg
scatterplot3d(wt,disp,mpg, pch=16,
disp
highlight.3d=TRUE, 500
20
400
type="h", main="3D Scatterplot") 300
15
200
100
10
0
1 2 3 4 5 6
wt
51 of 100
Creating a Graph
3D Scatterplot
3D Scatterplots with Coloring and
Vertical Drop Lines and Regression
Plane
35
30
library(scatterplot3d)
attach(mtcars)
25
s3d <-scatterplot3d(wt,disp,mpg, pch=16,
mpg
disp
highlight.3d=TRUE, 500
20
type="h", main="3D Scatterplot") 400
300
fit <- lm(mpg ~ wt+disp)
15
200
s3d$plane3d(fit) 100
10
0
1 2 3 4 5 6
wt
52 of 100
Creating a Graph
Combining Plots
• R makes it easy to combine multiple

plots into one overall graph, using par()
function. Scatterplot of wt vs. mpg Scatterplot of wt vs disp
• With the par() function, you can include
30
300
mpg
disp
the option mfrow=c(nrows, ncols) to
20
100
create a matrix of nrows x ncols plots
10
2 3 4 5 2 3 4 5
that are filled in by row. wt wt
• mfcol=c(nrows, ncols) fills in the Histogram of wt Boxplot of wt

matrix by columns.
5
0 2 4 6 8
Frequency
4
3
# 4 figures arranged in 2 rows and 2 columns
2
attach(mtcars)
par(mfrow=c(2,2)) 2 3 4 5
plot(wt,mpg, main="Scatterplot of wt vs. mpg") wt
plot(wt,disp, main="Scatterplot of wt vs disp")

hist(wt, main="Histogram of wt")
boxplot(wt, main="Boxplot of wt")
53 of 100
Hypothesis Testing
Student’s t-Test
• The t.test() function produces a variety of t-tests.
• Unlike most statistical packages, the default assumes unequal variance and
applies the Welsh df modification.
Function Help # independent 2-group t-test

# where y is numeric and x is a binary factor
t.test(y~x)
t.test(x, ...)
# independent 2-group t-test
## Default S3 method: t.test(y1,y2) # where y1 and y2 are numeric
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"), # paired t-test
mu = 0, paired = FALSE, var.equal = FALSE, t.test(y1,y2,paired=TRUE) # y1 & y2 are numeric
conf.level = 0.95, ...)
# one sample t-test
t.test(y,mu=3) # Ho: mu=3
54 of 100
Multiple (Linear) Regression
• R provides comprehensive support for multiple linear regression.
Other useful functions

Fitting the Model
coefficients(fit) # model coefficients
confint(fit, level=0.95) # CIs for model parameters
# Multiple Linear Regression Example fitted(fit) # predicted values
fit <- lm(y ~ x1 + x2 + x3, data=mydata) residuals(fit) # residuals
summary(fit) # show results anova(fit) # anova table
vcov(fit) # covariance matrix for model parameters
55 of 100
Linear Regression
Simple Linear Regression
• We begin with a small example to provide a feel for the process.

• The data set Journals is taken from Stock and Watson (2007).
• The data provide some information on subscriptions to economics
journals at US libraries for the year 2000.
• Bergstrom (2001) argues that commercial publishers are charging
excessive prices for academic journals and also suggests ways that
economists can deal with this problem.
• The Journals data frame contains 180 observations (the journals) on 10
variables, among them the number of library subscriptions (subs), the
library subscription price (price), and the total number of citations for the
journal (citations).
56 of 100
Linear Regression

• The data can be loaded, transformed, and summarized via
> library(AER)
> data("Journals")
> names(Journals)
[1] "title" "publisher" "society" "price" "pages" "charpp" "citations"
[8] "foundingyear" "subs" "field"
> journals <- Journals[, c("subs", "price")]
> journals$citeprice <- Journals$price/Journals$citations
> summary(journals)
subs price citeprice
Min. : 2.0 Min. : 20.0 Min. : 0.005223
1st Qu.: 52.0 1st Qu.: 134.5 1st Qu.: 0.464495
Median : 122.5 Median : 282.0 Median : 1.320513
Mean : 196.9 Mean : 417.7 Mean : 2.548455
3rd Qu.: 268.2 3rd Qu.: 540.8 3rd Qu.: 3.440171
Max. :1098.0 Max. :2120.0 Max. :24.459459
57 of 100
• In view of the wide range of the variables, combined with a

considerable amount of skewness, it is useful to take logarithms.
80
Frequency
Frequency
80
40
R code
40
0
0
0 200 600 1000 0 5 10 15 20 25
par(mfrow=c(2,2))
journals$subs journals$citeprice
hist(journals$subs, main="")
hist(journals$citeprice, main="")
hist(log(journals$subs), main="")
hist(log(journals$citeprice), main="")
par(mfrow=c(1,1))
Frequency
Frequency
40
40
20
20
0
0
0 2 4 6 8 −6 −4 −2 0 2 4
log(journals$subs) log(journals$citeprice)
58 of 100
• The goal is to estimate the effect of the price per citation on the
number of library subscriptions.
> jmodel<-lm(log(subs)~log(citeprice), data = journals)
> summary(jmodel)
Call:
lm(formula = log(subs) ~ log(citeprice), data = journals)
Residuals:
Min 1Q Median 3Q Max
-2.72478 -0.53609 0.03721 0.46619 1.84808
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.76621 0.05591 85.25 <2e-16 ***
log(citeprice) -0.53305 0.03561 -14.97 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 0.7497 on 178 degrees of freedom

Multiple R-squared: 0.5573, Adjusted R-squared: 0.5548
F-statistic: 224 on 1 and 178 DF, p-value: < 2.2e-16
59 of 100
Confidence intervals
• It is good practice to give a measure of error along with every
estimate.
• One way to do this is to provide a confidence interval.
• This is available via the extractor function confint().
> confint(jmodel, level = 0.95)

2.5 % 97.5 %
(Intercept) 4.6558822 4.8765420
log(citeprice) -0.6033319 -0.4627751
60 of 100
Prediction
• Often a regression model is used for prediction.
• There are two types of predictions: the prediction of points on the regression
line and the prediction of a new data value.
• The standard errors of predictions for new data take into account both the
uncertainty in the regression line and the variation of the individual points
about the line.
• Thus, the prediction interval is larger than that for prediction of points on the
line.
• The function predict() provides both types of standard errors.
> predict(jmodel, newdata=data.frame(citeprice=2.11), interval = "confidence")
fit lwr upr
1 4.368188 4.247485 4.48889
> predict(jmodel, newdata=data.frame(citeprice=2.11), interval = "prediction")
fit lwr upr
1 4.368188 2.883746 5.852629
61 of 100
Diagnostic plots
Standardized residuals
Residuals vs Fitted Normal Q−Q
• The plot() command for class IO IO
2
Residuals
1
0
lm() object provides four
−1
BoIES BoIES
diagnostic plots.
−3
−3
MEPiTE MEPiTE
3 4 5 6 7 −2 −1 0 1 2
• The figure depicts the result for Fitted values Theoretical Quantiles
the journals regression.
• We set the graphical parameter Scale−Location
MEPiTE
Residuals vs Leverage
3
BoIES RoRPE
IO
mfrow to c(2, 2) using the
−1 1
1.0
par() function, creating a 2 × 2 Ecnmt
Cook's distance 0.5
−4
0.0
MEPiTE
matrix of plotting areas to see all
3 4 5 6 7 0.00 0.02 0.04 0.06
four plots simultaneously. Fitted values Leverage
62 of 100
Testing a linear hypothesis
• Often it is necessary to test more general hypotheses.

• This is possible using the function linearHypothesis() from the car
package
> linearHypothesis(jmodel, "log(citeprice)=-0.5")
Linear hypothesis test
Hypothesis:
log(citeprice) = - 0.5
Model 1: restricted model

Model 2: log(subs) ~ log(citeprice)
Res.Df RSS Df Sum of Sq F Pr(>F)

1 179 100.54
2 178 100.06 1 0.48421 0.8614 0.3546
63 of 100
R and LATEX: texreg package
> library(texreg)
> texreg(jmodel, dcolumn = TRUE, booktabs = TRUE)
\begin{table} Model 1
\begin{center}
\begin{tabular}{l D{.}{.}{3.5}@{} }
\toprule
(Intercept) 4.77∗∗∗
\midrule
& \multicolumn{1}{c}{Model 1} \\
(0.06)
(Intercept) & 4.77^{***} \\
& (0.06) \\
log(citeprice) −0.53∗∗∗
log(citeprice) & -0.53^{***} \\
& (0.04) \\
(0.04)
\midrule
R$^2$ & 0.56 \\ R2 0.56
Adj. R$^2$ & 0.55 \\
Num. obs. & 180 \\ Adj. R2 0.55
\bottomrule
\multicolumn{2}{l}{\scriptsize{ Num. obs. 180
\textsuperscript{***}$p<0.001$,
\textsuperscript{**}$p<0.01$, *** p < 0.001, ** p < 0.01, * p < 0.05
\textsuperscript{*}$p<0.05$}}
\end{tabular}
\caption{Statistical models} Table: Statistical models
\label{table:coefficients}
\end{center}
\end{table}
64 of 100
R and LATEX: stargazer package
> library(stargazer)
> stargazer(jmodel, align=T)
\begin{table}[!htbp] \centering
\caption{}
\label{}
\begin{tabular}{@{\extracolsep{5pt}}lD{.}{.}{-3} } \\[-1.8ex]\hline \hline \\[-1.8ex]
& \multicolumn{1}{c}{\textit{Dependent variable:}} \\ \cline{2-2}
\\[-1.8ex] & \multicolumn{1}{c}{log(subs)} \\ \hline \\[-1.8ex]
log(citeprice) & -0.533^{***} \\
& (0.036) \\
& \\
Constant & 4.766^{***} \\
& (0.056) \\
& \\ \hline \\[-1.8ex]
Observations & \multicolumn{1}{c}{180} \\
R$^{2}$ & \multicolumn{1}{c}{0.557} \\
Adjusted R$^{2}$ & \multicolumn{1}{c}{0.555} \\
Residual Std. Error & \multicolumn{1}{c}{0.750 (df = 178)} \\
F Statistic & \multicolumn{1}{c}{224.037$^{***}$ (df = 1; 178)} \\ \hline \hline \\[-1.8ex]
\textit{Note:} & \multicolumn{1}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\
\normalsize
\end{tabular}
\end{table}
65 of 100
Table: caption
Dependent variable:
log(subs)
log(citeprice) −0.533∗∗∗
(0.036)
Constant 4.766∗∗∗
(0.056)
Observations 180
R2 0.557
Adjusted R2 0.555
Residual Std. Error 0.750 (df = 178)
F Statistic 224.037∗∗∗ (df = 1; 178)
Note: ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
66 of 100
Multiple Linear Regression
• To illustrate how to deal with multiple regression in R, we consider a standard

task in labor economics, estimation of a wage equation in semilogarithmic form.
• Here, we employ the CPS1988 data frame collected in the March 1988 Current
Population Survey (CPS) by the US Census Bureau and analyzed by Bierens
and Ginther (2001).
> data("CPS1988")
> head(CPS1988)
wage education experience ethnicity smsa region parttime
1 354.9 7 45 cauc yes northeast no
2 123.5 12 1 cauc yes northeast yes
67 of 100
> options(digits=4)
> educmodel<-lm(log(wage)~experience+I(experience^2)+education+ethnicity,data=CPS1988)
> summary(educmodel)
Call:
lm(formula = log(wage) ~ experience + I(experience^2) + education +
ethnicity, data = CPS1988)
Residuals:
-2.943 -0.316 0.058 0.376 4.383
Coefficients:
(Intercept) 4.321395 0.019174 225.4 <2e-16 ***
experience 0.077473 0.000880 88.0 <2e-16 ***
I(experience^2) -0.001316 0.000019 -69.3 <2e-16 ***
education 0.085673 0.001272 67.3 <2e-16 ***
ethnicityafam -0.243364 0.012918 -18.8 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

F-statistic: 3.54e+03 on 4 and 28150 DF, p-value: <2e-16
68 of 100
Interactions
• Let us consider an interaction between ethnicity and education.
> cps_int<-lm(log(wage)~experience+I(experience^2)+education*ethnicity,data=CPS1988)
> coeftest(cps_int)
t test of coefficients:

(Intercept) 4.313059 0.019590 220.17 <2e-16 ***
experience 0.077520 0.000880 88.06 <2e-16 ***
I(experience^2) -0.001318 0.000019 -69.34 <2e-16 ***
education 0.086312 0.001309 65.94 <2e-16 ***
ethnicityafam -0.123887 0.059026 -2.10 0.036 *
education:ethnicityafam -0.009648 0.004651 -2.07 0.038 *
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
69 of 100
Interactions
• Let us consider only the interaction between ethnicity and education.
> cps_int2<-lm(log(wage)~experience+I(experience^2)+education:ethnicity,data=CPS1988)
> coeftest(cps_int2)
t test of coefficients:

(Intercept) 4.303774 0.019085 225.5 <2e-16 ***
experience 0.077567 0.000880 88.1 <2e-16 ***
I(experience^2) -0.001320 0.000019 -69.5 <2e-16 ***
education:ethnicitycauc 0.086986 0.001269 68.5 <2e-16 ***
education:ethnicityafam 0.067812 0.001642 41.3 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
70 of 100
Dependent variable:
log(wage)
(1) (2)
education 0.101∗∗∗ 0.076∗∗∗
(0.001) (0.001)
experience 0.020∗∗∗
(0.0003)
Constant 4.489∗∗∗ 5.178∗∗∗

(0.020) (0.019)
Observations 28,155 28,155

R2 0.213 0.095
Adjusted R2 0.213 0.095
Residual Std. Error 0.635 (df = 28152) 0.681 (df = 28153)
F Statistic 3,806.000∗∗∗ (df = 2; 28152) 2,942.000∗∗∗ (df = 1; 28153)
∗
Note: p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
71 of 100
Linear Regression with Time Series Data
• The package dynlm (Zeileis 2008) provides the function dynlm(),

which allows for formulas such as
∆yt = β0 + β1 ∆yt−1 + xt−4 ,
• that describes a regression of the first differences of a variable y on

its first difference lagged by one period and on the fourth lag of a
variable x.
72 of 100
Example: create data em estimate linear regression

> set.seed(1)
> x<-rnorm(100)
> y<-0.5*x+rnorm(100)/2
> summary(lm(y~x))
Call:
lm(formula = y ~ x)
Residuals:
-0.9384 -0.3069 -0.0697 0.2697 1.1731
Coefficients:
(Intercept) -0.0188 0.0485 -0.39 0.7
x 0.4995 0.0539 9.27 4.6e-15 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

F-statistic: 86 on 1 and 98 DF, p-value: 4.58e-15
73 of 100
Example: bind data, create time series and estimate dynamic

model
> data<-cbind(y,x)
> head(data)
y x
[1,] -0.6234 -0.6265
[2,] 0.1129 0.1836
[3,] -0.8733 -0.8356
[4,] 0.8767 1.5953
[5,] -0.1625 0.3295
[6,] 0.4734 -0.8205
> data.ts<-ts(data,start=c(2005,1), frequency = 12)
> window(data.ts, start=c(2005,1), end=c(2005,6))
y x
Jan 2005 -0.6234 -0.6265
Feb 2005 0.1129 0.1836
Mar 2005 -0.8733 -0.8356
Apr 2005 0.8767 1.5953
May 2005 -0.1625 0.3295
Jun 2005 0.4734 -0.8205
> library(dynlm)
> dynmodel<-dynlm(d(y)~L(d(y))+L(x, 4), data=data.ts)
74 of 100
Example: summarize dynamic linear model
> summary(dynmodel)
Time series regression with "ts" data:

Start = 2005(5), End = 2013(4)
Call:
dynlm(formula = d(y) ~ L(d(y)) + L(x, 4), data = data.ts)
Residuals:
-2.0278 -0.4267 0.0013 0.5882 2.0425
Coefficients:
(Intercept) -0.0143 0.0838 -0.17 0.86
L(d(y)) -0.5087 0.0876 -5.81 8.9e-08 ***
L(x, 4) 0.0214 0.0940 0.23 0.82
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

F-statistic: 17.3 on 2 and 93 DF, p-value: 4.06e-07
75 of 100
Linear Regression with Panel Data
• There has been considerable interest in panel data econometrics

over the last two decades, and hence it is almost mandatory to
include a brief discussion of some common specifications in R.
• The package plm (Croissant and Millo 2008) contains the relevant
fitting functions and methods.
• The main difference between cross-sectional data and panel data is
that panel data have an internal structure, indexed by a
two-dimensional array, which must be communicated to the fitting
function.
• We refer to the cross-sectional objects as “individuals” and the
time identifier as “time”.
76 of 100
• For illustrating the basic fixed- and random-effects methods, we use the
well-known Grunfeld data (Grunfeld 1958) comprising 20 annual
observations on the three variables for 11 large US firms for the years
1935-1954.
• The basic one-way panel regression is
investit = β1 valueit + β2 capitalit + ηi + νit ,
• where invest is the real gross investment, value is the real value of the
firm, and capital is the real value of the capital stock.
• Originally employed in a study of the determinants of corporate
investment in a University of Chicago Ph.D. thesis, these data have been
a textbook classic since the 1970s.
• The package AER provides the full data set comprising all 11 firms.
77 of 100
• We use the Grunfeld data from package plm.

> data("Grunfeld", package = "plm")
> head(Grunfeld)
firm year inv value capital
1 1 1935 317.6 3078.5 2.8
2 1 1936 391.8 4661.7 52.6
3 1 1937 410.6 5387.1 156.9
4 1 1938 257.7 2792.2 209.2
5 1 1939 330.8 4313.2 203.4
6 1 1940 461.2 4643.9 207.2
• Utilizing plm.data(), we tell R that the individuals are called

“firm”, whereas the time identifier is called “year”:
> library("plm")
> pgr <- plm.data(Grunfeld, index = c("firm", "year"))
78 of 100
• Panel data models are estimated using
Pooled OLS (POLS)
> gr_pool <- plm(inv ~ value + capital, data = pgr, model = "pooling")
Fixed effects (FE)
> gr_fe <- plm(inv ~ value + capital, data = pgr, model = "within")
Random effects (RE)
> gr_re <- plm(inv ~ value + capital, data = pgr, model = "random")
79 of 100
library(stargazer)
stargazer(gr_pool, gr_fe, gr_re, align=T, column.labels = c("POLS", "FE", "RE"),
dep.var.caption = "Dependent variable: investment")
Dependent variable: investment

inv
POLS FE RE
(1) (2) (3)
value 0.116∗∗∗ 0.110∗∗∗ 0.110∗∗∗
(0.006) (0.012) (0.010)
capital 0.231∗∗∗ 0.310∗∗∗ 0.308∗∗∗

(0.025) (0.017) (0.017)
Constant −42.714∗∗∗ −57.834∗∗

(9.512) (28.899)
Observations 200 200 200

R2 0.812 0.767 0.770
Adjusted R2 0.800 0.721 0.758
F Statistic 426.576∗∗∗ (df = 2; 197) 309.014∗∗∗ (df = 2; 188) 328.837∗∗∗ (df = 2; 197)
∗
Note: p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
80 of 100
• It is of interest to check whether the fixed effects are really needed.

• This is done by comparing the fixed effects and the pooled OLS
fits by means of pFtest() and yields
> pFtest(gr_fe, gr_pool)
F test for individual effects
data: inv ~ value + capital

F = 49.1766, df1 = 9, df2 = 188, p-value < 2.2e-16
alternative hypothesis: significant effects
81 of 100
• A comparison of the regression coefficients shows that fixed- and

random-effects methods yield rather similar results for these data.
• Hausman test can be also used to differentiate between fixed
effects model and random effects model in panel data.
• In this case, Random effects (RE) is preferred under the null
hypothesis due to higher efficiency, while under the alternative
Fixed effects (FE) is at least consistent and thus preferred.
> phtest(gr_re, gr_fe)
Hausman Test
data: inv ~ value + capital

chisq = 2.3304, df = 2, p-value = 0.3119
alternative hypothesis: one model is inconsistent
82 of 100
Dynamic Linear Regression with Panel Data
• Now we present a more advanced example, the dynamic panel data

model estimated by the method of Arellano and Bond (1991).
• Recall that their estimator is a generalized method of moments
(GMM) estimator utilizing lagged endogenous regressors after a
first-differences transformation.
• plm comes with the original Arellano-Bond data (EmplUK) dealing
with determinants of employment (emp) in a panel of 140 UK firms
for the years 1976–1984.
• The data are unbalanced, with seven to nine observations per firm.
83 of 100
Arellano and Bond (1991): Difference GMM
> ab1991 <- pgmm(log(emp) ~ lag(log(emp), 1:2) + lag(log(wage), 0:1)

+ + log(capital) + lag(log(output), 0:1) | lag(log(emp), 2:8),
+ data = EmplUK, effect = "twoways", model = "twosteps")
> summary(ab1991)
Twoways effects Two steps model
Unbalanced Panel: n=140, T=7-9, N=1031
Number of Observations Used: 611
Coefficients
Estimate Std. Error z-value Pr(>|z|)
lag(log(emp), 1:2)1 0.4742 0.1854 2.56 0.01054 *
lag(log(emp), 1:2)2 -0.0530 0.0517 -1.02 0.30605
lag(log(wage), 0:1)0 -0.5132 0.1456 -3.53 0.00042 ***
lag(log(wage), 0:1)1 0.2246 0.1419 1.58 0.11353
log(capital) 0.2927 0.0626 4.67 3.0e-06 ***
lag(log(output), 0:1)0 0.6098 0.1563 3.90 9.5e-05 ***
lag(log(output), 0:1)1 -0.4464 0.2173 -2.05 0.03996 *
---
Sargan Test: chisq(25) = 30.11 (p.value=0.22)
Autocorrelation test (1): normal = -1.538 (p.value=0.062)
Wald test for coefficients: chisq(7) = 142 (p.value=<2e-16)
Wald test for time dummies: chisq(6) = 16.97 (p.value=0.00939)
84 of 100
Blundell and Bond (1998): System GMM
> bb1998 <- pgmm(log(emp) ~ lag(log(emp), 1)+ lag(log(wage), 0:1) +

+ lag(log(capital), 0:1) | lag(log(emp), 2:8) + lag(log(wage), 2:8)
+ + lag(log(capital), 2:8), data = EmplUK, effect = "twoways",
+ model = "onestep", transformation = "ld")
> summary(bb1998, robust = TRUE)
Twoways effects One step model
Unbalanced Panel: n=140, T=7-9, N=1031
Number of Observations Used: 1642
Coefficients
Estimate Std. Error z-value Pr(>|z|)
lag(log(emp), 1) 0.9356 0.0263 35.58 < 2e-16 ***
lag(log(wage), 0:1)0 -0.6310 0.1181 -5.34 9.1e-08 ***
lag(log(wage), 0:1)1 0.4826 0.1369 3.53 0.00042 ***
lag(log(capital), 0:1)0 0.4839 0.0539 8.98 < 2e-16 ***
lag(log(capital), 0:1)1 -0.4244 0.0585 -7.26 4.0e-13 ***
---
Sargan Test: chisq(100) = 118.8 (p.value=0.0971)
Autocorrelation test (1): normal = -4.808 (p.value=7.61e-07)
Wald test for coefficients: chisq(5) = 11175 (p.value=<2e-16)
Wald test for time dummies: chisq(7) = 14.71 (p.value=0.0399)
85 of 100
Time Series
The quantmod Package
• quantmod is the short form of Quantitative Financial Modelling Framework, a

package written by Jeffrey A. Ryan.
• The quantmod package for R is designed to assist the quantitative trader in the
development, testing, and deployment of statistically based trading models.
• What quantmod IS: A rapid prototyping environment, with comprehensive
tools for data management and visualization, where quant traders can quickly
and cleanly explore and build trading models.
• What quantmod is NOT: A replacement for anything statistical; It has no
‘new’ modelling routines or analysis tool to speak of; It does now offer charting
not currently available elsewhere in R, but most everything else is more of a
wrapper to what you already know and love about the language and packages
you currently use.
86 of 100
Time Series: quantmod Package
• It is possible with one quantmod Getting data

function to load data from a variety
of sources, including > # from google finance
> getSymbols("YHOO",src="google")
◦ Yahoo! Finance (OHLC data) [1] "YHOO"
◦ Federal Reserve Bank of St. Louis
FRED (11,000 economic series) > # from yahoo finance
◦ Google Finance (OHLC data) > getSymbols("GOOG",src="yahoo")
[1] "GOOG"
◦ Oanda, The Currency Site (FX
and Metals) > # FX rates from FRED
◦ MySQL databases (Your local > getSymbols("DEXJPUS",src="FRED")
data) [1] "DEXJPUS"
◦ R binary formats (.RData and
.rda) > # Platinum from Oanda
> getSymbols("XPT/USD",src="Oanda")
◦ Comma Separated Value files [1] "XPTUSD"
(.csv)
87 of 100
Apple stock prices
> library(quantmod)
> getSymbols("AAPL",src="yahoo")
[1] "AAPL"
> head(AAPL)
AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2007-01-03 86.29 86.58 81.90 83.80 44225700 81.03
2007-01-04 84.05 85.95 83.82 85.66 30259300 82.83
2007-01-05 85.77 86.20 84.40 85.05 29812200 82.24
2007-01-08 85.96 86.53 85.28 85.47 28468100 82.64
2007-01-09 86.45 92.98 85.15 92.57 119617800 89.51
2007-01-10 94.75 97.80 93.45 97.00 105460000 93.79
88 of 100
> chartSeries(AAPL, theme="white")
AAPL [2007−01−03/2013−11−22]
Last 519.8 700
600
500
400
300
200
100
120
100 Volume (millions):
80 7,979,500
60
40
20
jan 03 2007 jan 02 2009 jan 03 2011 jan 02 2013
89 of 100
Aggregating to a different time scale

> periodicity(AAPL)
Daily periodicity from 2007-01-03 to 2013-11-22
> head(to.weekly(AAPL))
2007-01-05 86.29 86.58 81.90 85.05 104297200 82.24
2007-01-12 85.96 97.80 85.15 94.62 351865300 91.49
2007-01-19 95.68 97.60 88.12 88.50 236407700 85.57
2007-01-26 89.14 89.16 84.99 85.38 195789700 82.55
2007-02-02 86.30 86.65 83.70 84.75 129342000 81.95
2007-02-09 84.30 86.51 82.86 83.27 144630100 80.51
> head(to.monthly(AAPL))
jan 2007 86.29 97.80 81.90 85.73 971777900 82.89
fev 2007 86.23 90.81 82.86 84.61 490084100 81.81
mar 2007 84.03 96.83 83.75 92.91 568523000 89.84
abr 2007 94.14 102.50 89.60 99.80 480705000 96.50
mai 2007 99.59 122.17 98.55 121.19 620181000 117.18
jun 2007 121.10 127.61 115.40 122.04 831412200 118.00
90 of 100
Period Simple Returns

> head(dailyReturn(AAPL[,6])) > head(monthlyReturn(AAPL[,6]))
daily.returns monthly.returns
2007-01-03 0.000000 2007-01-31 0.022954
2007-01-04 0.022214 2007-02-28 -0.013029
2007-01-05 -0.007123 2007-03-30 0.098154
2007-01-08 0.004864 2007-04-30 0.074132
2007-01-09 0.083132 2007-05-31 0.214301
2007-01-10 0.047816 2007-06-29 0.006998
> head(weeklyReturn(AAPL[,6])) > head(quarterlyReturn(AAPL[,6]))
weekly.returns quarterly.returns
2007-01-05 0.014933 2007-03-30 0.1087
2007-01-12 0.112476 2007-06-29 0.3134
2007-01-19 -0.064707 2007-09-28 0.2575
2007-01-26 -0.035293 2007-12-31 0.2907
2007-02-02 -0.007268 2008-03-31 -0.2756
2007-02-09 -0.017572 2008-06-30 0.1668
91 of 100
Time Series: Basic Plots
Monthly Log Returns

r.aapl
> r.aapl<-monthlyReturn(AAPL[,6], type = "log")
0.1
> head(r.aapl)
monthly.returns
−0.4
2007-01-31 0.022695
jan 2007 jan 2008 jan 2009 jan 2010 jan 2011 jan 2012 jan 2013
2007-02-28 -0.013115
2007-03-30 0.093631
2007-04-30 0.071513 acf(r.aapl)
2007-05-31 0.194168
−0.2 0.1
2007-06-29 0.006973 ACF
5 10 15
Lag
ACF and PACF pacf(r.aapl)

Partial ACF
−0.2 0.1
> library(TSA)
> par(mfrow=c(2,1))
> acf(r.aapl, main = "acf(r.aapl)") 5 10 15
> pacf(r.aapl, main = "pacf(r.aapl)") Lag
> par(mfrow=c(1,1))
92 of 100
Time Series: ARIMA models
• The package forecast (Hyndman and Khandakar 2008) contains

a function auto.arima() that performs a search over a
user-defined set of models.
• auto.arima() returns best ARIMA model according to either
AIC, AICc or BIC value.
> library(forecast)
> auto.arima(r.aapl, ic = "bic")
Series: r.aapl
ARIMA(0,0,0) with zero mean
sigma^2 estimated as 0.0113: log likelihood=68.37

AIC=-134.7 AICc=-134.7 BIC=-132.3
93 of 100
Estimate an ARIMA model

> ar.aapl<-arima(r.aapl, order = c(1,0,0), include.mean = T)
> ar.aapl
Series: x
ARIMA(1,0,0) with non-zero mean
Coefficients:
ar1 intercept
0.170 0.022
s.e. 0.107 0.013
sigma^2 estimated as 0.0105: log likelihood=71.49

AIC=-139 AICc=-138.7 BIC=-131.7
> coeftest(ar.aapl)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)

ar1 0.1702 0.1075 1.58 0.113
intercept 0.0223 0.0135 1.66 0.098 .
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
94 of 100
Ljung-Box Test of the Residuals of the Estimated Model

> Box.test(resid(ar.aapl), lag = 5, type = "Ljung-Box", fitdf = 1)
Box-Ljung test
data: resid(ar.aapl)
X-squared = 4.592, df = 4, p-value = 0.3318
Box-Ljung test
Box-Ljung test
95 of 100
Time Series: ARCH effects
Ljung-Box Test of the Squared Residuals of the Estimated

Model
> Box.test(resid(ar.aapl)^2, lag = 5, type = "Ljung-Box")
Box-Ljung test
data: resid(ar.aapl)^2
Box-Ljung test
Box-Ljung test
96 of 100
Time Series: ARCH effects
ARCH LM Test
> library(FinTS) # Necessary package for ArchTest

> ArchTest(r.aapl)
ARCH LM-test; Null hypothesis: no ARCH effects
data: r.aapl
Chi-squared = 20.93, df = 12, p-value = 0.05143
97 of 100
Time Series: GARCH Model
GARCH(1, 1)
> library(fGarch)
> model<-garchFit(~garch(1,1), data = r.aapl, trace=F)
> summary(model)
Title: GARCH Modelling
Conditional Distribution: norm
Coefficient(s):
mu 0.0215996 0.0104962 2.058 0.0396 *
omega 0.0006796 0.0005811 1.169 0.2422
alpha1 0.1014303 0.0807611 1.256 0.2091
beta1 0.8378938 0.0795071 10.539 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
98 of 100
References
• Crawley, M. J. (2012). The R book. 2nd ed. Wiley.

• Dalgaard, P. (2008). Introductory statistics with R. 2nd ed. Springer.
• Kleiber, C., & Zeileis, A. (2008). Applied econometrics with R. Springer.
• Ruppert, D. (2011). Statistics and data analysis for financial engineering.
Springer.
• Tsay, R. S. (2010). Analysis of financial time series. 3rd ed. Wiley.
• Tsay, R. S. (2013). An Introduction to Analysis of Financial Data with R.
Wiley.
• Wooldridge, J. M. (2010). Econometric analysis of cross section and
panel data. 2nd ed. MIT.
• Wooldridge, J. M. (2012). Introductory econometrics: a modern
approach. 5th ed. Cengage.
99 of 100
Mini-curso
Introdução ao R
Prof. Dr. Henrique Castro

hcastro@usp.br
Faculdade de Economia, Administração

e Contabilidade da Universidade de São
Paulo, FEA-USP
2013
100 of 100

Minicurso R PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Minicurso R PDF

Загружено:

Авторское право:

Доступные форматы

Mini-curso

Prof. Dr. Henrique Castro

Faculdade de Economia, Administração

Get or Set Working Directory

• Importing data into R is fairly simple.

mydata <- read.table("mydata.csv", header=TRUE, sep=",", row.names="id")

# first row contains variable names

# list objects in the working environment

# list the variables in mydata

# list the structure of mydata

# class of an object (numeric, matrix, data frame, etc)

# print first 10 rows of mydata

# print last 5 rows of mydata

Identify rows, columns or elements using subscripts

> mydata[3,] # 3rd row of matrix

> mydata[2:4,1:3] # rows 2,3,4 of columns 1,2,3

Vectors Data Frame

• In R, missing values are represented by the symbol NA (not

List rows of data that have missing values

> # create new dataset without missing data

Creating new variables

> mydata$x1cat[mydata$x1>0.5] <- "High"

> reshape(d1, dir = "long", varying = 3:6, sep = "_")

> reshape(d1, dir = "long", varying = 3:6, sep = "_")

Selecting (Keeping) Variables

Excluding (DROPPING) Variables

# based on variable values

Selection using the Subset Function

abs(x) absolute value

mean(x, trim=0, na.rm=T) mean of object x

quantiles where x is the numeric vector whose quantiles are desired

scale(x, center=TRUE, scale=TRUE) column center or standardize a matrix.

Using the pastecs package

Using the psych package

Summary Statistics by Group

Scatter plot Regression of MPG on Weight

• You can save the graph in a 15

variety of formats from the ●

menu File → Save As ● ●

Histograms Histogram of mtcars$mpg

Histograms Histogram of mtcars$mpg

• You can create histograms with the

# Colored Histogram with Bins = 12

hist(mtcars$mpg, breaks=12, col="red") 10 15 20 25 30

Dot Plots Gas Milage for Car Models

• Create dotplots with the dotchart(x, Maserati Bora

labels=) function, where x is a numeric Lotus Europa

vector and labels is a vector of labels for Fiat X1−9

each point. AMC Javelin ●

• cex controls the size of the labels. Toyota Corona

# Simple Dotplot Cadillac Fleetwood ●

main="Gas Milage for Car Models", Merc 450SE

xlab="Miles Per Gallon") Merc 280 ●

Mazda RX4 Wag ●

Miles Per Gallon

Dot Plots Gas Milage for Car Models

• You can add a groups= option to Toyota Corolla

designate a factor specifying how the Honda Civic

elements of x are grouped. Merc 240D

• If so, the option gcolor= controls the Toyota Corona

# Sort by mpg, group and color by cylinder 8

xlab="Miles Per Gallon", gcolor="black", color=x$color) 10 15 20 25 30

Miles Per Gallon