151 views

Uploaded by Ponlapat Yonglitthipagon

- wp2 graded
- Beginning R: The Statistical Programming Language (Preview Sample)
- Five College Guide to R
- Meta Analysis With R
- RDES 2 07 Castellanos Dopico Sanchez
- Exploratory Data Analysis With R (2015)
- Beginning Data Science With r Manas a Pathak
- A Guide to Create Beautiful Graphics in R, 2nd Ed.pdf
- SAS Slides 1 : Introduction to SAS
- R Programming
- Statistical Data Analysis Explained
- R Programming
- Forecasting Crude Oil Prices using Eviews
- F5 Maths Annual Scheme of Work_2010
- Principles of Statistical Inference
- Alboukadel Kassambara - ggplot2: The Elements for Elegant Data Visualization in R
- Application of Remote Sensing and GIS Fo
- Giuseppe Ciaburro-Regression Analysis With R-Packt (2018)
- Advanced Analytics With R and Tableau - Jen Stirrup, Ruben Oliva Ramos
- Literatur Review

You are on page 1of 47

- a quick start -

OLEG NENADI

C, WALTER ZUCCHINI

September 2004

Contents

1 An Introduction to R 3

1.1 Downloading and Installing R . . . . . . . . . . . . . . . . . . . . . 3

1.2 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Statistical Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Writing Custom R Functions . . . . . . . . . . . . . . . . . . . . . . . 10

2 Linear Models 12

2.1 Fitting Linear Models in R . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Time Series Analysis 23

3.1 Classical Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Exponential Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 ARIMAModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Advanced Graphics 36

4.1 Customizing Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2 Mathematical Annotations . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3 Three-Dimensional Plots . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4 RGL: 3D Visualization in R using OpenGL . . . . . . . . . . . . . . 43

A Rfunctions 44

A.1 Mathematical Expressions (expression()) . . . . . . . . . . . . . 44

A.2 The RGL Functionset . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

1

Preface

This introduction to the freely available statistical software package R is primar-

ily intended for people already familiar with common statistical concepts. Thus

the statistical methods used to illustrate the package are not explained in de-

tail. These notes are not meant to be a reference manual, but rather a hands-on

introduction for statisticians who are unfamiliar with R. The intention is to of-

fer just enough material to get started, to motivate beginners by illustrating the

power and exibility of R, and to show how simply it enables the user to carry

out sophisticated statistical computations and to produce high-quality graphical

displays.

The notes comprise four sections, which build on each other and should there-

fore be read sequentially. The rst section (An Introduction to R) introduces the

most basic concepts. Occasionally things are simplied and restricted to the min-

imum background in order to avoid obscuring the main ideas by offering too

much detail. The second and the third section (Linear Models and Time Series

Analysis) illustrate some standard R commands pertaining to these two common

statistical topics. The fourth section (Advanced Graphics) covers some of the ex-

cellent graphical capabilities of the package.

Throughout the text typewriter font is used for annotating R functions and

options. R functions are given with brackets, e.g. plot() while options are

typed in italic typewriter font, e.g. xlab="x label". R commands which are

entered by the user are printed in red and the output from R is printed in blue.

The datasets used are available from the URI http://134.76.173.220/R workshop.

An efcient (and enjoyable) way of beginning to master R is to actively use it,

to experiment with its functions and options and to write own functions. It is

not necessary to study lengthy manuals in order to get started; one can get use-

ful work done almost immediately. Thus, the main goal of this introduction is to

motivate the reader to actively explore R. Good luck!

2

Chapter 1

An Introduction to R

1.1 Downloading and Installing R

R is a widely used environment for statistical analysis. The striking difference

between R and most other statistical packages is that it is free software and that

it is maintained by scientists for scientists. Since its introduction in 1996 by R.

Ihaka and R. Gentleman, the R project has gained many users and contributors

who continuously extend the capabilities of R by releasing addons (packages)

that offer new functions and methods, or improve the existing ones.

One disadvantage or advantage, depending on the point of view, is that R is used

within a commandline interface, which imposes a slightly steeper learning curve

than other software. But, once this hurdle has been overcome, Roffers almost un-

limited possibilities for statistical data analysis.

Ris distributed by the Comprehensive R Archive Network (CRAN) it is avail-

able from the URI: http://cran.r-project.org. The current version of R (1.9.1 as of

September 2004, approx. 20 MB) for Windows can be downloaded by selecting

R binaries windows base and downloading the le rw1091.exe from

the CRANwebsite. R can then be installed by executing the downloaded le.

The installation procedure is straightforward; one usually only has to specify the

target directory in which to install R. After the installation, R can be started like

any other application for Windows, that is by doubleclicking on the correspond-

ing icon.

1.2 Getting Started

Since R is a command line based language, all commands are entered directly

into the console. A starting point is to use R as a substitute for a pocket calcula-

tor. By typing

2+3

3

1.2. GETTING STARTED 4

into the console, R adds 3 to 2 and displays the result. Other simple operators

include

2-3 # Subtraction

2*3 # Multiplication

2/3 # Division

23 # 2

3

sqrt(3) # Square roots

log(3) # Logarithms (to the base e)

Operators can also be nested, e.g.

(2 - 3) * 3

rst subtracts 3 from 2 and then multiplies the result with 3.

Often it can be useful to store results from operations for later use. This can be

done using the assignment operator <- , e.g. <-

test <- 2 * 3

performs the operation on the right hand side (2*3) and then stores the result

as an object named test. (One can also use = or even -> for assignments.) Fur-

ther operations can be carried out on objects, e.g.

2 * test

multiplies the value stored in test with 2. Note that objects are overwritten

without notice. The command ls() outputs the list of currently dened objects. ls()

Data types

As in other programming languages, there are different data types available in R,

namely numeric, character and logical. As the name indicates, numeric

is used for numerical values (double precision). The type character is used for

characters and is generally entered using quotation marks:

myname <- "what"

myname

However, it is not possible (nor meaningful) to apply arithmetic operators on

character data types. The data type logical is used for boolean variables: (TRUE

or T, and FALSE or F).

1.2. GETTING STARTED 5

Object types

Depending on the structure of the data, R recognises 4 standard object types:

vectors, matrices, data frames and lists. Vectors are onedimensional ar-

rays of data; matrices are twodimensional data arrays. Data frames and lists are

further generalizations and will be covered in a later section.

Creating vectors in R

There are various means of creating vectors in R. E.g. in case one wants to save

the numbers 3, 5, 6, 7, 1 as mynumbers, one can use the c() command: c()

mynumbers <- c(3, 5, 6, 7, 1)

Further operations can then be carried out on the R object mynumbers. Note

that arithmetic operations on vectors (and matrices) are carried out component

wise, e.g. mynumbers*mynumbers returns the squared value of each component

of mynumbers.

Sequences can be created using either : or seq(): :

1:10

creates a vector containing the numbers 1, 2, 3, . . . , 10. The seq() command al- seq()

lows the increments of the sequence to be specied:

seq(0.5, 2.5, 0.5)

creates a vector containing the numbers 0.5, 1, 1.5, 2, 2.5. Alternatively one can

specify the length of the sequence:

seq(0.5, 2.5, length = 100)

creates a sequence from 0.5 to 2.5 with the increments chosen such that the re-

sulting sequence contains 100 equally spaced values.

Creating matrices in R

One way of creating a matrix in R is to convert a vector of length n minto a nm

matrix:

mynumbers <- 1:12

matrix(mynumbers, nrow = 4) matrix()

Note that the matrix is created columnwise for rowwise construction one has to

use the option byrow=TRUE:

matrix(mynumbers, nrow = 4, byrow = TRUE)

1.2. GETTING STARTED 6

An alternative way for constructing matrices is to use the functions cbind()

and rbind(), which combine vectors (row- or columnwise) to a matrix:

mynumbers1 <- 1:4

mynumbers2 <- 11:14

cbind(mynumbers1, mynumbers2) cbind()

rbind(mynumbers1, mynumbers2)

rbind()

Accessing elements of vectors and matrices

Particular elements of Rvectors and matrices can be accessed using square brack-

ets. Assume that we have created the following Robjects vector1 and matrix1:

vector1 <- seq(-3, 3, 0.5)

matrix1 <- matrix(1:20, nrow = 5)

Some examples of how to access particular elements are given below:

vector1[5] # returns the 5th element of vector1

vector1[1:3] # returns the rst three elements of vector1

vector1[c(2, 4, 5)] # returns the 2nd, 4th and 5th element of

vector1

vector1[-5] # returns all elements except for the 5th one

Elements of matrices are accessed in a similar way. matrix1[a,b] returns the

value from the ath row and the bth column of matrix1:

matrix1[2,] # returns the 2nd row of matrix1

matrix1[,3] # returns the 3rd column of matrix1

matrix1[2, 3] # returns the value from matrix1 in the

2nd row and 3rd column

matrix1[1:2, 3] # returns the value from matrix1 in the rst

two rows and the 3rd column

Example: Plotting functions

Assume that you were to plot a function by hand. One possibility of doing it is to

1. Select some xvalues from the range to be plotted

2. Compute the corresponding y = f(x) values

3. Plot x against y

4. Add a (more or less) smooth line connecting the (x, y)points

1.2. GETTING STARTED 7

Graphs of functions are created in essentially the same way in R, e.g. plotting the

function f(x) = sin(x) in the range of to can be done as follows:

x <- seq(-pi, pi, length = 10) # denes 10 values from to

y <- sin(x) # computes the corresponding

yvalues

plot(x, y) # plots x against y plot()

lines(x, y) # adds a line connecting the

(x, y)points lines()

3 2 1 0 1 2 3

1

.

0

0

.

5

0

.

0

0

.

5

1

.

0

a): length(x)=10

x

y

3 2 1 0 1 2 3

1

.

0

0

.

5

0

.

0

0

.

5

1

.

0

b): length(x)=1000

x

y

Figure 1.1: Plotting sin(x) in R.

The output is shown in the left part of gure 1.1. However, the graph does not

look very appealing since it lacks smoothness. A simple trick for improving

the graph is to simply increase the number of xvalues at which f(x) is evalu-

ated, e.g. to 1000:

x <- seq(-pi, pi, length = 1000)

y <- sin(x)

plot(x, y, type = "l")

The result is shown in the right part of gure 1.1. Note the use of the option

type="l", which causes the graph to be drawn with connecting lines rather than

points.

1.3. STATISTICAL DISTRIBUTIONS 8

1.3 Statistical Distributions

The names of the R functions for distributions comprise two parts. The rst part

(the rst letter) indicates the function group, and the second part (the remain-

der of the function name) indicates the distribution. The following function

groups are available:

probability density function (d)

cumulative distribution function (p)

quantile function (q)

random number generation (r)

Common distributions have their corresponding R names:

distribution R name distribution R name distribution R name

normal norm t t

2

chisq

exponential exp f f uniform unif

log-normal lnorm beta beta gamma gamma

logistic logis weibull weibull cauchy cauchy

geometric geom binomial binom hypergeometric hyper

poisson pois negative binomial nbinom

E.g., random numbers (r) from the normal distribution (norm) can be drawn us-

ing the rnorm() function; quantiles (q) of the

2

distribution (chisq) are ob-

tained with qchisq().

The following examples illustrate the use of the R functions for computations in-

volving statistical distributions:

rnorm(10) # draws 10 random numbers from a standard

normal distribution

rnorm(10, 5, 2) # draws 10 random numbers from a N( = 5, = 2)

distribution

pnorm(0) # returns the value of a standard normal cdf at t = 0

qnorm(0.5) # returns the 50% quantile of the standard normal

distribution

Examples for handling distributions

Assume that we want to generate 50 (standard) normally distributed random

numbers and to display them as a histogram. Additionally, we want to add the

pdf of the (tted) normal distribution to the plot as shown in gure 1.2:

1.3. STATISTICAL DISTRIBUTIONS 9

mysample <- rnorm(50) # generates random numbers

hist(mysample, prob = TRUE) # draws the histogram hist()

mu <- mean(mysample) # computes the sample mean

mean()

sigma <- sd(mysample) # computes the sample standard

deviation sd()

x <- seq(-4, 4, length = 500) # denes xvalues for the pdf

y <- dnorm(x, mu, sigma) # computes the normal pdf

lines(x, y) # adds the pdf as lines to the plot

Histogram of mysample

mysample

D

e

n

s

i

t

y

3 2 1 0 1 2 3

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

Figure 1.2: Histogramof normally distributed randomnumbers and tted den-

sity.

Another example (gure 1.3) is the visualization of the approximation of the bi-

nomial distribution with the normal distribution for e.g. n = 50 and = 0.25:

x <- 0:50 # denes the xvalues

y <- dbinom(x, 50, 0.25) # computes the binomial

probabilities

plot(x, y, type="h") # plots binomial probabilities

x2 <- seq(0, 50, length = 500)# denes xvalues (for the

normal pdf)

y2 <- dnorm(x2, 50*0.25,

sqrt(50*0.25*(1-0.25))) # computes the normal pdf

lines(x2, y2, col = "red") # draws the normal pdf

1.4. WRITING CUSTOM R FUNCTIONS 10

0 10 20 30 40 50

0

.

0

0

0

.

0

4

0

.

0

8

0

.

1

2

Comparison: Binomial distribution and normal approximation

x

y

Figure 1.3: Comparing the binomial distribution with n = 50 and = 0.25 with

an approximation by the normal distribution ( = n , =

_

n (1 )).

1.4 Writing Custom R Functions

In case R does not offer a required function, it is possible to write a custom one.

Assume that we want to compute the geometric mean of a sample:

G

=

n

i=1

x

i

1

n

= e

1

n

i

log(x

i

)

Since R doesnt have a function for computing the geometric mean, we have to

write our own function geo.mean():

fix(geo.mean) fix()

opens an editor window where we can enter our function:

1.4. WRITING CUSTOM R FUNCTIONS 11

function(x) function()

n <- length(x)

gm <- exp(mean(log(x)))

return(gm)

Note that R checks the function after closing and saving the editorwindow. In

case of structural errors (the most common case for that are missing brackets),

R reports these to the user. In order to x the error(s), one has to enter

geo.mean <- edit() edit()

since the (erroneous) results are not saved. (Using fix(geo.mean) results in

loosing the last changes.)

Chapter 2

Linear Models

2.1 Fitting Linear Models in R

This section focuses on the three main types of linear models: Regression, Anal-

ysis of Variance and Analysis of Covariance.

Simple regression analysis

The dataset strength, which is stored as strength.dat, contains measurements

on the ability of workers to perform physically demanding tasks. It contains the

measured variables grip, arm, rating and sims collected from 147 per-

sons. The dataset can be imported into R with

strength <- read.table("C:/R workshop/strength.dat",

header = TRUE) read.table()

The command read.table() reads a le into R assuming that the data is struc-

tured as a matrix (table). It assumes that the entries of a row are separated by

blank spaces (or any other suitable separator) and the rows are separated by line

feeds. The option header=TRUE tells R that the rst row is used for labelling the

columns.

In order to get an overview over the relation between the 4 (quantitative) vari-

ables, one can use

pairs(strength) pairs()

which creates a matrix of scatterplots for the variables.

Lets focus on the relation between grip (1st column) and arm (2nd column).

The general function for linear models is lm(). Fitting the model

grip

i

=

0

+

1

arm

i

+ e

i

can be done using

12

2.1. FITTING LINEAR MODELS IN R 13

fit <- lm(strength[,1]strength[,2]) lm()

fit

The function lm() returns a list object which we have saved under some name,

e.g. as fit. As previously mentioned, lists are a generalized objecttype; a list

can contain several objects of different types and modes arranged into a single

object. The names of the entries stored in a list can be viewed using

names(fit) names()

One entry in this list is coefficients which contains the coefcients of the

tted model. The coefcients can be accessed using the $sign:

fit$coefficients

returns a vector (in this case of length 2) containing the estimated parameters

(

0

and

1

). Another entry is residuals, which contains the residuals of the

tted model:

res <- fit$residuals

Before looking further at our tted model, let us briey examine the residuals.

A rst insight is given by displaying the residuals as a histogram:

hist(res, prob = TRUE, col = "red") hist()

An alternative is to use a kernel density estimate and to display it along with

the histogram:

lines(density(res), col = "blue")

The function density() computes the kernel density estimate (other methods density()

for kernel density estimation will be discussed in a later section).

Here one might also wish to add the pdf of the normal distribution to the graph:

mu <- mean(res)

sigma <- sd(res)

x <- seq(-60, 60, length = 500)

y <- dnorm(x, mu, sigma)

lines(x, y, col = 6)

2.1. FITTING LINEAR MODELS IN R 14

Histogram of res

res

D

e

n

s

i

t

y

40 20 0 20 40 60

0

.

0

0

0

0

.

0

0

5

0

.

0

1

0

0

.

0

1

5

0

.

0

2

0

Figure 2.1: Histogram of the model residuals with kernel density estimate and

tted normal distribution.

There also exist alternative ways for (graphically) investigating the normality of

a sample, for example QQ-plots:

qqnorm(res) qqnorm()

draws the sample quantiles against the quantiles of a normal distribution as

shown in gure 2.2. Without going into detail, the ideal case is given when the

points lie on a straight line.

Another option is to specically test for normality, e.g. using the Kolmogorov-

Smirnov test or the Shapiro-Wilks test:

ks.test(res, "pnorm", mu, sigma) ks.test()

performs the Kolmogorov-Smirnov test on res. Since this test can be used for

any distribution, one has to specify the distribution (pnorm) and its parameters

(mu and sigma). The ShapiroWilks test specically tests for normality, so one

only has to specify the data:

shapiro.test(res) shapiro.test()

2.1. FITTING LINEAR MODELS IN R 15

2 1 0 1 2

4

0

2

0

0

2

0

4

0

Normal QQ Plot

Theoretical Quantiles

S

a

m

p

l

e

Q

u

a

n

t

i

l

e

s

Figure 2.2: QQ-plot of residuals.

Now back to our tted model. In order to display the observations together with

the tted model, one can use the following code which creates the graph shown

in gure 2.3:

plot(strength[,2], strength[,1])

betahat <- fit$coefficients

x <- seq(0, 200, length = 500)

y <- betahat[1] + betahat[2]*x

lines(x, y, col = "red")

Another useful function in this context is summary(): summary()

summary(fit)

returns an output containing the values of the coefcients and other information:

2.1. FITTING LINEAR MODELS IN R 16

20 40 60 80 100 120

5

0

1

0

0

1

5

0

strength[, 2]

s

t

r

e

n

g

t

h

[

,

1

]

Figure 2.3: Observations (strength[,2] vs. strength[,1]) and tted line.

Call:

lm(formula = strength[, 1] strength[, 2])

Residuals:

Min 1Q Median 3Q Max

-49.0034 -11.5574 0.4104 12.3367 51.0541

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 54.70811 5.88572 9.295 <2e-16 ***

strength[, 2] 0.70504 0.07221 9.764 <2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 18.42 on 145 degrees of freedom

Multiple R-Squared: 0.3967, Adjusted R-squared: 0.3925

F-statistic: 95.34 on 1 and 145 DF, p-value: < 2.2e-16

Analysis of Variance

Consider the dataset miete.dat, which contains rent prices for apartments in a

German city. Two factors were also recorded: year of construction, and whether

the apartment was on the ground oor, rst oor, ..., fourth oor. Again, the data

can be imported into R using the read.table() command:

rent <- read.table("C:/R workshop/miete.dat", header = TRUE)

2.1. FITTING LINEAR MODELS IN R 17

In this case, rent is a matrix comprising three columns: The rst one (Bau-

jahr) indicates the year of construction, the second one (Lage) indicates the

oor and the third column contains the rent prices (Miete) per a square meter.

We can start by translating the German labels into English:

names(rent) names()

returns the names of the rent objects. The names can be changed by typing

names(rent) <- c("year", "floor", "price")

into the console.

In order to examine the relationship between price and oor, one can use box-

plots for visualization. In order to do so, one needs to extract the rent prices for

each oor-group:

price <- rent[,3]

fl <- rent[,2]

levels(fl) levels()

Here we have saved the third column of rent as price and the second one as

fl. The command levels(fl) shows us the levels of fl (a to e). It is pos-

sible to perform queries using square brackets, e.g.

price[fl=="a"]

returns the prices for the apartments on oor a (ground oor in this case). Ac-

cordingly,

fl[price<7]

returns the oor levels whose corresponding rent prices (per m

2

) are less than

7 (Euro). These queries can be further expanded using logical AND (&) or OR (|)

operators:

fl[price<7 & price>5]

returns all oor levels whose corresponding rent prices are between 5 and 7

(Euro).

2.1. FITTING LINEAR MODELS IN R 18

Aconvenient function is split(a,b), which splits the data a by the levels given

in b. This can be used together with the function boxplot():

boxplot(split(price, fl)) boxplot()

split()

Accordingly, the relation between the year of construction and the price can be

visualized with

year <- rent[,1]

boxplot(split(price, year))

a b c d e

2

4

6

8

1

0

1

2

price vs. floor

B

0

0

4

8

B

4

9

6

0

B

6

1

6

9

B

7

0

7

9

B

8

0

8

9

2

4

6

8

10

12

price vs. year

Figure 2.4: Boxplots of the rent example. The left boxplot displays the relation

between price and oor; the right boxplot shows the relation between price and

year.

The analysis of variance can be carried out in two ways, either by treating it as a

linear model (lm()) or by using the function aov(), which is more convenient

in this case:

fit1a <- lm(pricefl) lm()

summary(fit1a)

returns

2.1. FITTING LINEAR MODELS IN R 19

Call:

lm(formula = price fl)

Residuals:

Min 1Q Median 3Q Max

-4.4132 -1.2834 -0.1463 1.1717 6.2987

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.8593 0.1858 36.925 <2e-16 ***

flb 0.0720 0.2627 0.274 0.784

flc -0.2061 0.2627 -0.785 0.433

fld 0.0564 0.2627 0.215 0.830

fle -0.1197 0.2627 -0.456 0.649

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 1.858 on 495 degrees of freedom

Multiple R-Squared: 0.003348, Adjusted R-squared: -0.004706

F-statistic: 0.4157 on 4 and 495 DF, p-value: 0.7974

On the other hand,

fit1b <- aov(pricefl) aov()

summary(fit1b)

returns

Df Sum Sq Mean Sq F value Pr(>F)

fl 4 5.74 1.43 0.4157 0.7974

Residuals 495 1708.14 3.45

The full model (i.e. including year and oor as well as interactions) is analysed

with

fit2 <- aov(pricefl+year+fl*year)

summary(fit2)

Df Sum Sq Mean Sq F value Pr(>F)

fl 4 5.74 1.43 0.7428 0.5632

year 4 735.26 183.81 95.1808 <2e-16 ***

fl:year 16 55.56 3.47 1.7980 0.0288 *

Residuals 475 917.33 1.93

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

The interpretation of the tables is left to the reader.

2.2. GENERALIZED LINEAR MODELS 20

Analysis of covariance

The extension to the analysis of covariance is straightforward. The dataset car is

based on data provided by the U.S. Environmental Protection Agency (82 cases).

It contains the following variables:

BRAND Car manufacturer

VOL Cubic feet of cab space

HP Engine horsepower

MPG Average miles per gallon

SP Top speed (mph)

WT Vehicle weight (10 lb)

Again, the data is imported using

car <- read.table("C:/R workshop/car.dat", header = TRUE)

attach(car) attach()

The attach command makes it possible to access columns of car by simply

entering their name. The rst column can be accessed by either typing BRAND or

car[,1] into the console.

The model

MPG

ijk

= +

i

+ SP

j

+ e

ijk

;

i

: Effect of BRAND i

can be analysed in R with

fit3 <- aov(MPGBRAND+SP) aov()

summary(fit3)

summary()

Df Sum Sq Mean Sq F value Pr(>F)

BRAND 28 6206.7 221.7 8.6259 2.056e-11 ***

SP 1 564.5 564.5 21.9667 2.038e-05 ***

Residuals 52 1336.3 25.7

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

2.2 Generalized Linear Models

Generalized linear models enable one to model response variables that follow

any distribution from the exponential family. The R function glm() ts general- glm()

ized linear models. The model formula is specied in the same way as in lm()

2.3. EXTENSIONS 21

or aov(). The distribution of the response needs to be specied, as does the link

function, which expresses the monotone function of the conditional expectation

of the response variable that is to be modelled as a linear combination of the co-

variates.

In order to obtain help for an R-function, one can use the builtin helpsystem of

R:

?glm or help(glm) ?

help()

Typically, the helpdocument contains information on the structure of the func-

tion, an explanation of the arguments, references, examples etc. In case of glm(),

there are several examples given. The examples can be examined by copying the

code and pasting it into the R console. For generalized linear models, the infor-

mation retrieved by

?family family

is also relevant since it contains further information about the specication of

the error distribution and the link function.

The following example of how to use the glm() function is given in the help

le:

## Dobson (1990) Page 93: Randomized Controlled Trial :

counts <- c(18,17,15,20,10,20,25,13,12)

outcome <- gl(3,1,9)

treatment <- gl(3,3)

print(d.AD <- data.frame(treatment, outcome, counts))

glm.D93 <- glm(counts outcome + treatment,

family=poisson())

anova(glm.D93)

summary(glm.D93)

The rst three lines are used to create an R object with the data. The fourth line

(print()) displays the created data; the tting is done in the fth line with

glm(). The last two lines (anova() and summary()) are used for displaying

the results.

2.3 Extensions

An important feature of R is its extension system. Extensions for R are delivered

as packages (libraries), which can be loaded within R using the library()

command. Usually, packages contain functions, datasets, help les and other library()

les such as dlls (Further information on creating custom packages for R can be

found on the R website).

There exist a number of packages that offer extensions for linear models. The

package mgcv contains functions for tting generalized additive models (gam());

routines for nonparametric density estimation and nonparametric regression are

2.3. EXTENSIONS 22

offered by the sm package. An overview over the available R packages is given

at http://cran.r-project.org/src/contrib/PACKAGES.html.

For example, tting a GAM to the car dataset can be carried out as follows:

library(mgcv)

fit <- gam(MPGs(SP)) gam()

summary(fit)

The rst line loads the package mgcv which contains the function gam(). In

the second line the variable MPG (Miles per Gallon) was modelled as a smooth

function of SP (Speed). Note that the structure of GAM formulae is almost iden-

tical to the standard ones in R the only difference is the use of s() for indi-

cating smooth functions. A summary of the tted model is again given by the

summary() command.

Plotting the observations and the tted model as shown in gure 2.5 can be done

in the following way:

plot(HP, MPG)

x <- seq(0, 350, length = 500)

y <- predict(fit, data.frame(HP = x))

lines(x, y, col = "red", lwd = 2)

In this case, the (generic) function predict() was used for predicting (i.e. predict()

obtaining y at the specied values of the covariate, here x).

50 100 150 200 250 300

2

0

3

0

4

0

5

0

6

0

HP

M

P

G

Figure 2.5: Fitting a simple GAM to the car data.

Chapter 3

Time Series Analysis

3.1 Classical Decomposition

Linear Filtering of Time Series

A key concept in traditional time series analysis is the decomposition of a given

time series X

t

into a trend T

t

, a seasonal component S

t

and the remainder or

residual, e

t

.

A common method for obtaining the trend is to use linear lters on given time

series:

T

t

=

i=

i

X

t+i

A simple class of linear lters are moving averages with equal weights:

T

t

=

1

2a + 1

a

i=a

X

t+i

In this case, the ltered value of a time series at a given period is represented by

the average of the values x

a

, . . . , x

, . . . , x

+a

. The coefcients of the ltering

are

1

2a+1

, . . . ,

1

2a+1

.

Consider the dataset tui, which contains stock data for the TUI AGfromJan., 3rd

2000 to May, 14th 2002, namely date (1st column), opening values (2nd column),

highest and lowest values (3rd and 4th column), closing values (5th column) and

trading volumes (6th column). The dataset has been exported from Excel

c

as a

CSVle (comma separated values). CSVles can be imported into R with the

function read.csv(): read.csv()

tui <- read.csv("C:/R workshop/tui.csv", header = TRUE,

dec = ",", sep = ";")

The option dec species the decimal separator (in this case, a comma has been

used as a decimal separator. This option is not needed when a dot is used as a

23

3.1. CLASSICAL DECOMPOSITION 24

decimal separator.) The option sep species the separator used to separate en-

tries of the rows.

Applying simple moving averages with a = 2, 12, and 40 to the closing values of

the tui dataset implies using following lters:

a = 2 :

i

=

1

5

,

1

5

,

1

5

,

1

5

,

1

5

a = 12 :

i

=

1

25

, . . . ,

1

25

. .

25 times

a = 40 :

i

=

1

81

, . . . ,

1

81

. .

81 times

The resulting ltered values are (approximately) weekly (a = 2), monthly (a = 12)

and quarterly (a = 40) averages of returns. Filtering is carried out in R with the

filter() command.

filter()

0 100 200 300 400 500 600

2

0

3

0

4

0

5

0

Index

t

u

i

[

,

4

]

Figure 3.1: Closing values and averages for a = 2, 12 and 40.

The following code was used to create gure 3.1 which plots the closing values

of the TUI shares and the averages, displayed in different colours.

plot(tui[,5], type = "l")

tui.1 <- filter(tui[,5], filter = rep(1/5, 5))

3.1. CLASSICAL DECOMPOSITION 25

tui.2 <- filter(tui[,5], filter = rep(1/25, 25))

tui.3 <- filter(tui[,5], filter = rep(1/81, 81))

lines(tui.1, col = "red")

lines(tui.2, col = "purple")

lines(tui.3, col = "blue")

Decomposition of Time Series

Another possibility for evaluating the trend of a time series is to use a nonpara-

metric regression technique (which is also a special type of linear lter). The

function stl() performs a seasonal decomposition of a given time series X

t

by stl()

determining the trend T

t

using loess regression and then computing the sea-

sonal component S

t

(and the residuals e

t

) from the differences X

t

T

t

.

Performing the seasonal decomposition for the time series beer (monthly beer

production in Australia from Jan. 1956 to Aug. 1995) is done using the following

commands:

beer <- read.csv("C:/R_workshop/beer.csv", header = TRUE,

dec = ",", sep = ";")

beer <- ts(beer[,1], start = 1956, freq = 12)

plot(stl(log(beer), s.window = "periodic"))

The data is read from C:/R workshop/beer.csv and then transformed with

ts() into a ts object. This transformation is required for most of the time ts()

series functions, since a time series contains more information than the values

itself, namely information about dates and frequencies at which the time series

has been recorded.

3.1. CLASSICAL DECOMPOSITION 26

4

.

2

4

.

6

5

.

0

5

.

4

d

a

t

a

0

.

2

0

.

0

0

.

2

s

e

a

s

o

n

a

l

4

.

5

4

.

7

4

.

9

5

.

1

t

r

e

n

d

0

.

2

0

.

0

1960 1970 1980 1990

r

e

m

a

i

n

d

e

r

time

Figure 3.2: Seasonal decomposition using stl().

Regression analysis

R offers the functions lsfit() (least squares t) and lm() (linear models, a lsfit()

lm()

more general function) for regression analysis. This section focuses on lm(),

since it offers more features, especially when it comes to testing signicance of

the coefcients.

Consider again the beer data. Assume that we want to t the following model

(a parabola) to the logs of beer: log(X

t

) =

0

+

1

t +

2

t

2

+ e

t

The tting can be carried out in R with the following commands:

lbeer <- log(beer)

t <- seq(1956, 1995 + 7/12, length = length(lbeer))

t2 <- t2

plot(lbeer)

lm(lbeert+t2)

lines(lm(lbeert+t2)$fit, col = 2, lwd = 2)

3.1. CLASSICAL DECOMPOSITION 27

Time

l

b

e

e

r

1960 1970 1980 1990

4

.

2

4

.

6

5

.

0

5

.

4

Figure 3.3: Fitting a parabola to lbeer with lm().

In the rst command above, logs of beer are computed and stored as lbeer.

Explanatory variables (t and t

2

as t and t2) are dened in the second and third

row. The actual t of the model is done using lm(lbeert+t2). The func-

tion lm() returns a list object, whose element can be accessed using the $

sign: lm(lbiert+t2)$coefficients returns the estimated coefcients (

0

,

1

and

2

); lm(lbiert+t2)$fit returns the tted values

X

t

of the model.

Extending the model to

log(X

t

) =

0

+

1

t +

2

t

2

+ cos

_

2t

12

_

+ sin

_

2t

12

_

+e

t

so that it includes the rst Fourier frequency is straightforward. After dening

the two additional explanatory variables, cos.t and sin.t, the model can be

estimated in the usual way:

lbeer <- log(beer)

t <- seq(1956, 1995 + 7/12, length = length(lbeer))

t2 <- t2

sin.t <- sin(2*pi*t)

cos.t <- cos(2*pi*t)

plot(lbeer)

lines(lm(lbeert+t2+sin.t+cos.t)$fit, col = 4)

Note that in this case sin.t and cos.t do not include 12 in the denominator,

since

1

12

has already been considered during the transformation of beer and the

3.1. CLASSICAL DECOMPOSITION 28

Time

l

b

e

e

r

1960 1970 1980 1990

4

.

2

4

.

6

5

.

0

5

.

4

Figure 3.4: Fitting a parabola and the rst fourier frequency to lbeer.

construction of t.

Another important aspect in regression analysis is to test the signicance of the

coefcients.

In the case of lm(), one can use the summary() command: summary()

summary(lm(lbeert+t2+sin.t+cos.t))

which returns the following output:

Call:

lm(formula = lbeer t + t2 + sin.t + cos.t)

Residuals:

Min 1Q Median 3Q Max

-0.2722753 -0.0686953 -0.0006432 0.0695916 0.2370383

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -3.734e+03 1.474e+02 -25.330 < 2e-16 ***

t 3.768e+00 1.492e-01 25.250 < 2e-16 ***

t2 -9.493e-04 3.777e-05 -25.137 < 2e-16 ***

sin.t -4.870e-02 6.297e-03 -7.735 6.34e-14 ***

cos.t 1.361e-01 6.283e-03 21.655 < 2e-16 ***

---

3.2. EXPONENTIAL SMOOTHING 29

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.09702 on 471 degrees of freedom

Multiple R-Squared: 0.8668, Adjusted R-squared: 0.8657

F-statistic: 766.1 on 4 and 471 DF, p-value: < 2.2e-16

Apart from the coefcient estimates and their standard error, the output also in-

cludes the corresponding t-statistics and pvalues. In our case, the coefcients

0

(Intercept),

1

(t),

2

(t

2

) and (sin(t)) differ signicantly from zero, while does

not seem to. (One might include anyway, since Fourier frequencies are usually

taken in pairs of sine and cosine.)

3.2 Exponential Smoothing

Introductory Remarks

One method of forecasting the next value x

n+1

, of a time series x

t

, t = 1, 2, . . . , n

is to use a weighted average of past observations:

x

n

(1) =

0

x

n

+

1

x

n1

+ . . .

The popular method of exponential smoothing assigns geometrically decreasing

weights:

i

= (1 )

i

; 0 < < 1

such that x

n

(1) = x

n

+(1 ) x

n1

+ (1 )

2

x

n2

+ . . .

In its basic form exponential smoothing is applicable to time series with no sys-

tematic trend and/or seasonal components. It has been generalized to the Holt

Wintersprocedure in order to deal with time series containing trend and sea-

sonal variation. In this case, three smoothing parameters are required, namely

(for the level), (for the trend) and (for the seasonal variation).

Exponential Smoothing and Prediction of Time Series

The ts package offers the function HoltWinters(x,alpha,beta,gamma) , HoltWinters()

which lets one apply the HoltWinters procedure to a time series x. One can spec-

ify the three smoothing parameters with the options alpha, beta and gamma.

Particular components can be excluded by setting the value of the corresponding

parameter to zero, e.g. one can exclude the seasonal component by specifying

gamma=0. If one does not specify smoothing parameters, these are computed

automatically (i.e. by minimizing the mean squared prediction error from the

onestepahead forecasts).

Thus, the exponential smoothing of the beer dataset can be performed as fol-

lows:

3.2. EXPONENTIAL SMOOTHING 30

beer <- read.csv("C:/beer.csv", header = TRUE, dec = ",",

sep = ";")

beer <- ts(beer[,1], start = 1956, freq = 12)

The above commands load the dataset from the CSVle and transform it to a

ts object.

HoltWinters(beer)

This performs the HoltWinters procedure on the beer dataset. It displays a list

with e.g. the smoothing parameters ( 0.076, 0.07 and 0.145 in this

case). Another component of the list is the entry fitted, which can be accessed

using HoltWinters(beer)$fitted:

plot(beer)

lines(HoltWinters(beer)$fitted[,1], col = "red")

Time

b

e

e

r

1960 1970 1980 1990

1

0

0

1

5

0

2

0

0

Figure 3.5: Exponential smoothing of the beer data.

Roffers the function predict(), which is a generic function for predictions from

various models. In order to use predict(), one has to save the t of a model

to an object, e.g.:

beer.hw <- HoltWinters(beer)

In this case, we have saved the t from the HoltWinters procedure on beer as

beer.hw.

3.3. ARIMAMODELS 31

predict(beer.hw, n.ahead = 12) predict()

returns the predicted values for the next 12 periods (i.e. Sep. 1995 to Aug. 1996).

The following commands can be used to create a graph with the predictions for

the next 4 years (i.e. 48 months):

plot(beer, xlim=c(1956, 1999))

lines(predict(beer.hw, n.ahead = 48), col = 2)

Time

b

e

e

r

1960 1970 1980 1990 2000

1

0

0

1

5

0

2

0

0

Figure 3.6: Predicting beer with exponential smoothing.

3.3 ARIMAModels

Introductory Remarks

Forecasting based on ARIMA (autoregressive integrated moving average) mod-

els, sometimes referred to as the BoxJenkins approach, comprises following stages:

i.) Model identication

ii.) Parameter estimation

iii.) Diagnostic checking

These stages are repeated iteratively until a satisfactory model for the given data

has been identied (e.g. for prediction). The following three sections show some

facilities that R offers for carrying out these three stages.

3.3. ARIMAMODELS 32

Analysis of Autocorrelations and Partial Autocorrelations

A rst step in analysing time series is to examine the autocorrelations (ACF) and

partial autocorrelations (PACF). R provides the functions acf() and pacf() for acf()

pacf()

computing and plotting of ACF and PACF. The order of pure AR and MA pro-

cesses can be identied from the ACF and PACF as shown below:

sim.ar <- arima.sim(list(ar = c(0.4, 0.4)), n = 1000) arima.sim()

sim.ma <- arima.sim(list(ma = c(0.6, -0.4)), n = 1000)

par(mfrow = c(2, 2))

acf(sim.ar, main = "ACF of AR(2) process")

acf(sim.ma, main = "ACF of MA(2) process")

pacf(sim.ar, main = "PACF of AR(2) process")

pacf(sim.ma, main = "PACF of MA(2) process")

The function arima.sim() was used to simulate the ARIMA(p,d,q)models:

0 5 10 20 30

0

.

0

0

.

8

Lag

A

C

F

ACF of AR(2) process

0 5 10 20 30

0

.

2

1

.

0

Lag

A

C

F

ACF of MA(2) process

0 5 10 20 30

0

.

0

0

.

6

Lag

P

a

r

t

i

a

l

A

C

F

PACF of AR(2) process

0 5 10 20 30

0

.

4

0

.

1

Lag

P

a

r

t

i

a

l

A

C

F

PACF of MA(2) process

Figure 3.7: ACF and PACF of AR and MAmodels.

In the rst line 1000 observations of an ARIMA(2,0,0)model (i.e. AR(2)model)

were simulated and saved as sim.ar. Equivalently, the second line simulated

1000 observations from a MA(2)model and saved them to sim.ma.

An useful command for graphical displays is par(mfrow=c(h,v)) which splits

the graphics window into (hv) regions in this case we have set up 4 separate

regions within the graphics window.

3.3. ARIMAMODELS 33

The last four lines create the ACF and PACF plots of the two simulated processes.

Note that by default the plots include condence intervals (based on uncorrelated

series).

Estimating Parameters of ARIMAModels

Once the order of the ARIMA(p,d,q)model has been specied, the parameters

can be estimated using the function arima() from the tspackage: arima()

arima(data, order = c(p, d, q))

Fitting e.g. an ARIMA(1,0,1)model on the LakeHurondataset (annual levels

of the Lake Huron from 1875 to 1972) is done using

data(LakeHuron) data()

fit <- arima(LakeHuron, order = c(1, 0, 1))

In this case fit is a list containing e.g. the coefcients (fit$coef), residuals

(fit$residuals) and the Akaike Information Criterion AIC (fit$aic).

Diagnostic Checking

A rst step in diagnostic checking of tted models is to analyse the residuals

from the t for any signs of nonrandomness. R has the function tsdiag(), tsdiag()

which produces a diagnostic plot of a tted time series model:

fit <- arima(LakeHuron, order = c(1, 0, 1))

tsdiag(fit)

It produces the output shown in gure 3.8: A plot of the residuals, the auto-

correlation of the residuals and the p-values of the LjungBox statistic for the rst

10 lags.

The BoxPierce (and LjungBox) test examines the Null of independently dis-

tributed residuals. Its derived from the idea that the residuals of a correctly

specied model are independently distributed. If the residuals are not, then

they come from a missspecied model. The function Box.test() computes Box.test()

the test statistic for a given lag:

Box.test(fit$residuals, lag = 1)

Prediction of ARIMAModels

Once a model has been identied and its parameters have been estimated, one

can predict future values of a time series. Lets assume that we are satised with

3.3. ARIMAMODELS 34

Standardized Residuals

Time

1880 1900 1920 1940 1960

2

1

0 5 10 15

0

.

2

0

.

6

Lag

A

C

F

ACF of Residuals

2 4 6 8 10

0

.

0

0

.

6

p values for LjungBox statistic

lag

p

v

a

l

u

e

Figure 3.8: Output from tsdiag().

the t of an ARIMA(1,0,1)model to the LakeHurondata:

fit <- arima(LakeHuron, order = c(1, 0, 1))

As with Exponential Smoothing, the function predict() can be used for pre- predict()

dicting future values of the levels under the model:

LH.pred <- predict(fit, n.ahead = 8)

Here we have predicted the levels of Lake Huron for the next 8 years (i.e. until

1980). In this case, LH.pred is a list containing two entries, the predicted values

LH.pred$pred and the standard errors of the prediction LH.pred$se. Using

the familiar rule of thumb for an approximate condence interval (95%) for the

prediction, prediction 2SE, one can plot the Lake Huron data, the predicted

values and the corresponding approximate condence intervals:

plot(LakeHuron, xlim = c(1875, 1980), ylim = c(575, 584))

LH.pred <- predict(fit, n.ahead = 8)

3.3. ARIMAMODELS 35

lines(LH.pred$pred, col = "red")

lines(LH.pred$pred + 2*LH.pred$se, col = "red", lty = 3)

lines(LH.pred$pred - 2*LH.pred$se, col = "red", lty = 3)

First, the levels of Lake Huron are plotted. In order to leave some space for

adding the predicted values, the x-axis has been set to the interval 1875 to 1980

using the optional argument xlim=c(1875,1980); the use of ylim below is

purely for visual enhancement. The prediction takes place in the second line us-

ing predict() on the tted model. Adding the prediction and the approximate

condence interval is done in the last three lines. The condence bands are drawn

as a red, dotted line (using the options col="red" and lty=3):

Time

L

a

k

e

H

u

r

o

n

1880 1900 1920 1940 1960 1980

5

7

6

5

7

8

5

8

0

5

8

2

5

8

4

Figure 3.9: Lake Huron levels and predicted values.

Chapter 4

Advanced Graphics

4.1 Customizing Plots

Labelling graphs

R offers various means for annotating graphs. Consider a histogram of 100 nor-

mally distributed random numbers given by

hist(rnorm(100), prob = TRUE)

Assume that we want to have a custom title and different labels for the axes as

shown in gure 4.1. The relevant options are main (for the title), xlab and ylab main

xlab

(axes labels):

ylab

hist(rnorm(100), prob = TRUE, main = "custom title",

xlab = "x label", ylab = "y label")

The title and the labels are entered as characters, i.e. in quotation marks. To

include quotation marks in the title itself, a backslash is required before each

quotation mark: \". The backslash is also used for some other commands, such

as line breaks. Using \n results in a line feed, e.g.

main = "first part \n second part"

within a plot command writes rst part in the rst line of the title and sec-

ond part in the second line.

Setting font face and font size

The option font allows for (limited) control over the font type used for annota- font

tions. It is specied by an integer. Additionally, different font types can be used

for different parts of the graph:

font.axis # species the font for the axis annotations font.axis

font.lab # species the font for axis labels font.lab

36

4.1. CUSTOMIZING PLOTS 37

custom title

x label

y

l

a

b

e

l

3 2 1 0 1 2 3 4

0

.

0

0

.

1

0

.

2

0

.

3

Figure 4.1: Customizing the main title and the axes labels using main, xlab and

ylab.

font.main # species the font for the (main) title font.main

font.sub # species the font for the subtitle font.sub

The use of the fontoptions is illustrated in the example below:

hist(rnorm(100), sub = "subtitle", font.main = 6,

font.lab = 7, font.axis = 8, font.sub = 9) sub

Integer codes used in this example are:

6 : Times font

7 : Times font, italic

8 : Times font, boldface

9 : Times font, italic and boldface

The text size can be controlled using the cex option (character expansion). cex

Again, the cex option also has subcategories such as cex.axis , cex.lab ,

cex.axis

cex.lab

cex.main and cex.sub . The size of text is specied using a relative value (e.g.

cex.main

cex.sub

cex=1 doesnt change the size, cex=0.8 reduces the size to 80% and cex=1.2

enlarges the size to 120%).

A complete list of graphical parameters is given in the helple for the par()

command, i.e. by typing

4.1. CUSTOMIZING PLOTS 38

?par par()

into the console.

Another useful command for labelling graphs is text(a, b, "content") , text()

which adds content to an existing plot at the given coordinates (x = a, y = b):

hist(rnorm(500), prob = TRUE)

text(2, 0.2, "your text here")

Histogram of rnorm(500)

rnorm(500)

D

e

n

s

i

t

y

2 1 0 1 2 3

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

your text here

Figure 4.2: Adding text to an existing plot using text().

Specication of Colours

There exist various means of specifying colours in R. One way is to use the R

names such as col="blue" , col="red" or even col="mediumslateblue". col

(A complete list of available colour names is obtained with colours(). ) Alter-

colours()

natively, one can use numerical codes to specify the colours, e.g. col=2 (for red),

col=3 (for green) etc. Colours can also be specied in hexadecimal code (as in

html), e.g. col="#FF0000" denotes to red. Similarly, one can use col=rgb(1,0,0) rgb()

for red. The rgb() command is especially useful for custom colour palettes.

R offers a few predened colour palettes. These are illustrated on the volcano

data example below:

data(volcano) data()

par(mfrow = c(2, 2))

image(volcano, main = "heat.colors")

image(volcano, main = "rainbow", col = rainbow(15))

image(volcano, main = "topo", col = topo.colors(15))

image(volcano, main = "terrain.colors",

col = terrain.colors(15))

4.2. MATHEMATICAL ANNOTATIONS 39

heat.colors rainbow

topo.col terrain.colors

Figure 4.3: Some predened colour palettes available in R.

The resulting image maps with different colour palettes are shown in gure 4.3.

The (internal) dataset volcano, containing topographic information for Maunga

Whau on a 10m by 10m grid, is loaded by entering data(volcano). The com-

mand par(mfrow=c(a,b)) is used to split the graphics window into a b re-

gions (a rows and b columns). The image() function creates an image map image()

of a given matrix.

4.2 Mathematical Annotations

Occasionally, it is useful to add mathematical annotations to plots. Lets assume

we want to investigate the relationship between HP (horsepower) and MPG (miles

per gallon) from the car dataset by tting the following two models to the data

M

1

: MPG

i

=

0

+

1

HP

i

+ e

i

M

2

: MPG

i

=

0

+

1

HP

i

+

2

HP

2

i

+e

i

Fitting the model and plotting the observations along with the two tted models

is done with

car <- read.table("C:/R workshop/car.dat", header = TRUE)

attach(car)

M1 <- lm(MPGHP)

HP2 <- HP2

4.2. MATHEMATICAL ANNOTATIONS 40

M2 <- lm(MPGHP+HP2)

plot(HP, MPG, pch = 16)

x <- seq(0, 350, length = 500)

y1 <- M1$coef[1] + M1$coef[2]*x

y2 <- M2$coef[1] + M2$coef[2]*x + M2$coef[3]*x2

lines(x, y1, col="red")

lines(x, y2, col="blue")

In order to add mathematical expressions, R offers the function expression() expression()

which can be used e.g. in conjunction with the text command:

text(200, 55, expression(bold(M[1])*":"*hat(MPG)==

hat(beta)[0] + hat(beta)[1]*"HP"),

col = "red", adj = 0)

text(200, 50, expression(bold(M[2])*":"*hat(MPG)==

hat(beta)[0] + hat(beta)[1]*"HP" + hat(beta)[2]*"HP"2),

col = "blue", adj = 0)

50 100 150 200 250 300

2

0

3

0

4

0

5

0

6

0

HP

M

P

G

M

1

:MPG

^

=

^

0

+

^

1

HP

M

2

:MPG

^

=

^

0

+

^

1

HP+

^

2

HP

2

Figure 4.4: Using expression() for mathematical annotations.

Further options for annotating plots can be found in the examples given in the

help documentation of the legend() function. A list of available expressions is legend()

given in the appendix.

4.3. THREE-DIMENSIONAL PLOTS 41

4.3 Three-Dimensional Plots

Perspective plots

The R function persp() can be used to create 3D plots of surfaces. A 3D display persp()

of the volcano data can be created with

data(volcano)

persp(volcano)

persp(volcano, theta = 70, phi = 40)

volcano

Y

Z

v

o

l

c

a

n

o

Y

Z

Figure 4.5: 3D plots with persp().

The 3D space can be navigated by changing the parameters theta (azimuthal theta

direction) and phi (colatitude).

phi

Further options are illustrated in the example below:

par(mfrow=c(1,2))

# example 1:

persp(volcano, col = "green", border = NA, shade = 0.9,

theta = 70, phi = 40, ltheta = 120, box = FALSE,

axes = FALSE, expand = 0.5)

# example 2:

collut <- terrain.colors(101)

temp <- 1 + 100*(volcano-min(volcano)) /

(diff(range(volcano)))

mapcol <- collut[temp[1:86, 1:61]]

persp(volcano, col = mapcol, border = NA, theta = 70,

phi = 40,shade = 0.9, expand = 0.5, ltheta = 120,

4.3. THREE-DIMENSIONAL PLOTS 42

lphi = 30)

v

o

l

c

a

n

o

[

i

,

j

]

Y

Z

Figure 4.6: Advanced options for persp().

Plotting functions of two variables

In order to display twodimensional functions f(x, y) with persp() the follow-

ing R objects are required: x, y (grid mark vectors) and the values z = f(x, y)

which are stored as a matrix. A useful function here is outer(x,y,f), which outer()

computes the values of z = f(x, y) for all pairs of entries in the vectors x and y.

In the example given below the vectors x and y are specied and then the func-

tion f is dened. The command outer() creates a matrix, which is stored as z.

It contains the values of f(x, y) for all points on the specied xy grid. Finally,

persp() is used to generate the plot (gure 4.7):

y <- x <- seq(-2, 2, length = 20)

f <- function(x, y)

{

fxy <- -x2 - y2

return(fxy)

}

z <- outer(x, y, f)

persp(x, y, z, theta = 30, phi = 30)

4.4. RGL: 3D VISUALIZATION IN R USING OPENGL 43

x

y

z

Figure 4.7: Plotting 2D functions with persp().

4.4 RGL: 3D Visualization in R using OpenGL

RGL is an R package which was designed to overcome some limitations for 3D

graphics. It uses OpenGL

c

as the rendering backend and is freely available at

the URI

http://134.76.173.220/dadler/rgl/index.html

Further information on RGL can be found the website and on the slides RGL:

An R-Library for 3D Visualization in R (rgl.ppt) in your working directory.

Appendix A

Rfunctions

A.1 Mathematical Expressions (expression())

ARITHMETIC OPERATORS:

Expression Result

x+y

x +y

x-y

x y

x*y xy

x/y x/y

x%+-%y x y

x%/%y x y

x%*%y x y

-x

x

+x

+x

SUB- AND SUPERSCRIPTS:

Expression Result

x[i] x

i

x2 x

2

JUXTAPOSITION:

Expression Result

x*y xy

paste(x,y,z) xyz

LISTS:

Expression Result

list(x,y,z) x, y, z

RADICALS:

Expression Result

sqrt(x)

x

sqrt(x,y)

y

x

RELATIONS:

Expression Result

x==y

x = y

x!=y

x ,= y

x<y

x < y

x<=y

x y

x>y

x > y

x>=y

x y

x%%y x y

x%=%y x

= y

x%==%y x y

x%prop%y x y

SYMBOLIC NAMES:

Expression Result

Alpha-Omega A

alpha-omega

infinity

32*degree 32

o

60*minute 32

30*second 32

ELLIPSIS:

Expression Result

list(x[1],...,x[n]) x

1

, . . . , x

n

x[1]+...+x[n] x

1

+ +x

n

list(x[1],cdots,x[n]) x

1

, , x

n

x[1]+ldots+x[n] x

1

+ . . . +x

n

44

A.1. MATHEMATICAL EXPRESSIONS (expression()) 45

SET RELATIONS:

Expression Result

x%subset%y x y

x%subseteq%y x y

x%supset%y x y

x%supseteq%y x y

x%notsubset%y x , y

x%in%y x y

x%notin%y x , y

ACCENTS:

Expression Result

hat(x) x

tilde(x) x

ring(x)

o

x

bar(x) x

widehat(xy) xy

widetilde xy

ARROWS:

Expression Result

x%<->%y x y

x%->%y x y

x%<-%y x y

x%up%y x y

x%down%y x y

x%<=>%y x y

x%=>%y x y

x%<=%y x y

x%dblup%y x y

x%dbldown%y x y

SPACING:

Expression Result

x y x y

x+phantom(0)+y x + +y

x+over(1,phantom(0)) x +

1

FRACTIONS:

Expression Result

frac(x,y)

x

y

over(x,y)

x

y

atop(x,y)

x

y

STYLE:

Expression Result

displaystyle(x) x

textstyle(x) x

scriptstyle(x) x

scriptscriptstyle(x) x

TYPEFACE:

Expression Result

plain(x) x

italic(x) x

bold(x) x

bolditalic(x) x

BIG OPERATORS:

Expression Result

sum(x[i],i=1,n)

n

1

x

i

prod(plain(P)(X==x),x)

x

P(X=x)

integral(f(x)*dx,a,b)

_

b

a

f(x)dx

union(A[i],i==1,n)

n

i=1

A

i

intersect(A[i],i==1,n)

n

i=1

A

i

lim(f(x),x%->%0) lim

x0

f (x)

min(g(x),x>=0) min

x0

g(x)

inf(S) inf S

sup(S) supS

GROUPING:

Expression Result

(x+y)*z (x +y)z

xy+z x

y

+z

x(y+z) x

(y+z)

xy+z x

y+z

group("(",list(a,b),"]") (a, b]

bgroup("(",atop(x,y),")"))

_

x

y

_

group(lceil,x,rceil) x|

group(lfloor,x,rfloor) x|

group("|",x,"|") [x[

A.2. THE RGL FUNCTIONSET 46

A.2 The RGL Functionset

DEVICE MANAGEMENT:

rgl.open() Opens a new device.

rgl.close() Closes the current device.

rgl.cur() Returns the number of the active device.

rgl.set(which) Sets a device as active.

rgl.quit() Shuts down the subsystem and detaches RGL.

SCENE MANAGEMENT:

rgl.clear(type="shapes") Clears the scene from the stack of specied type (shapes

or lights).

rgl.pop(type="shapes") Removes the last added node from stack.

EXPORT FUNCTIONS:

rgl.snapshot(file) Saves a screenshot of the current scene in PNGformat.

SHAPE FUNCTIONS:

rgl.points(x,y,z,...) Add points at (x, y, z).

rgl.lines(x,y,z,...) Add lines with nodes (x

i

, y

i

, z

i

), i = 1, 2.

rgl.triangles(x,y,z,...) Add triangles with nodes (x

i

, y

i

, z

i

), i = 1, 2, 3.

rgl.quads(x,y,z,...) Add quads with nodes (x

i

, y

i

, z

i

), i = 1, 2, 3, 4.

rgl.spheres(x,y,z,r,...) Add spheres with center (x, y, z) and radius r.

rgl.texts(x,y,z,text,...) Add texts at (x, y, z).

rgl.sprites(x,y,z,r,...) Add 3D sprites at (x, y, z) and half-size r.

rgl.surface(x,y,z,...) Add surface dened by two grid mark vectors x and y and

a surface height matrix z.

ENVIRONMENT SETUP:

rgl.viewpoint(theta,phi,

fov,zoom,interactive)

Sets the viewpoint (theta, phi) in polar coordinates with

a eldofview angle fov and a zoom factor zoom. The

logical ag interactive species whether or not navi-

gation is allowed.

rgl.light(theta,phi,...) Adds a light source to the scene.

rgl.bg(...) Sets the background.

rgl.bbox(...) Sets the bounding box.

APPEARANCE FUNCTIONS:

rgl.material(...) Generalized interface for appearance parameters.

- wp2 gradedUploaded byapi-463439118
- Beginning R: The Statistical Programming Language (Preview Sample)Uploaded byMark Gardener
- Five College Guide to RUploaded byNemahun Vincent
- Meta Analysis With RUploaded byKay Han
- RDES 2 07 Castellanos Dopico SanchezUploaded byIonut Lupescu
- Exploratory Data Analysis With R (2015)Uploaded byJennifer Parker
- Beginning Data Science With r Manas a PathakUploaded byAkonilagi
- A Guide to Create Beautiful Graphics in R, 2nd Ed.pdfUploaded byIsaac Pedro
- SAS Slides 1 : Introduction to SASUploaded bySASTechies
- R ProgrammingUploaded byÁlvaro González Balaguer
- Statistical Data Analysis ExplainedUploaded bymalikjunaid
- R ProgrammingUploaded byvsuarezf2732
- Forecasting Crude Oil Prices using EviewsUploaded byNaba Kr Medhi
- F5 Maths Annual Scheme of Work_2010Uploaded byFikrah Imayu
- Principles of Statistical InferenceUploaded byEdmundo Caetano
- Alboukadel Kassambara - ggplot2: The Elements for Elegant Data Visualization in RUploaded byTom Leslie
- Application of Remote Sensing and GIS FoUploaded byAbu Zafor
- Giuseppe Ciaburro-Regression Analysis With R-Packt (2018)Uploaded bySamuel S. Kamel
- Advanced Analytics With R and Tableau - Jen Stirrup, Ruben Oliva RamosUploaded byEdwin Alex Palomino Morales
- Literatur ReviewUploaded byLaurensia Anita Sandjaja
- 161367 eUploaded byObakoma Josiah
- 173 Funtions of ExcelUploaded byHoa Tran
- mm2rt2Uploaded bygodsent7
- Blind Signal-To-Noise Ratio EstimationUploaded byaminovski
- Univariate Versus MultivariateUploaded bykislaya kumar
- NUMB3RS - Lista de Capítulos y TemasUploaded byRafael González Diez
- jurnal5.docxUploaded bydeby alishia m
- Maths-I DIP Sem-I Wef 01082011Uploaded byHarmish Bhatt
- Predicting the Volume of Money in the Economic Geography of IranUploaded byTI Journals Publishing
- ad_cheatsUploaded byGaetan Kenway

- Grammar With LaughterUploaded byRudi Anggoro S
- Cancer in thailand_VII.pdfUploaded byPonlapat Yonglitthipagon
- Checklist ตรวจรับบ้านUploaded byภาณุพงศ์ วิจิตรทองเรือง
- Pages 09 User GuideUploaded byPonlapat Yonglitthipagon
- Numbers '09 User GuideUploaded byCapital City Goofball
- MSnbase - Base FunctionsUploaded byPonlapat Yonglitthipagon
- MeV Manual 4.0Uploaded byPonlapat Yonglitthipagon
- Robert Kiyosaki - Cashflow Quadrant - Rich Dad's Guide to Financial FreedomUploaded byjetion84
- Intro Micro ArrayUploaded byPonlapat Yonglitthipagon
- Keynote 09 User GuideUploaded byPonlapat Yonglitthipagon
- Write Better Essays 2eUploaded byagus purnomo
- R-introUploaded byPonlapat Yonglitthipagon

- SHRM ChinaUploaded byJoan Bubest
- divendUploaded byArshad Khan
- Ch 14 Regression r2Uploaded byMariam Imran
- Bottom Up Beta AnalysisUploaded bySreejith AS
- Ordered SpssUploaded byDian Nurma Utami
- Wilson Chpt 1Uploaded byCharleneKronstedt
- Stat 512 Homework 6 solutionUploaded byaverysloane
- 32826715-MITUploaded byMihaela Nastase
- Marketing MixUploaded bySanheeta
- CBM-paperUploaded byThigarajan
- Estimation of Speed Required for Palm Nut Shell Mass-Size Particle Reduction Operation to Enhance Whole Kernels SeparationUploaded byIjstre Journal
- The Relationship Between Intelligence, Emotional Intelligence, Personality Styles and Academic SuccessUploaded byranirahmani
- Public policy and inequalities of choice and autonomy.pdfUploaded bySamuel Rizal
- Effect of Audit Fees on Audit QualityUploaded bylasniar
- Sberbank Project ReportUploaded bySamar Taj Shaikh
- Audit Committee, Firm Size, Profitability, Leverage on Income SmoothingUploaded byHisar Pangaribuan
- powerpointUploaded byCamille Pastrana
- Organization Commitment and Five Factor Model OfUploaded byMuhammad Farrukh Rana
- 1536Uploaded bybepinno
- Accounting Standard-Setting Organizations and Earnings Relevance - Longitudinal Evidence From NYSE Common StocksUploaded byZetyawan Poetra
- A Confirmatory Factor Analysis of the End-User Computing Satisfaction InstrumentUploaded byAbhishek Narain Singh
- Transportation Research Part E- Logistics and Transportation Review Volume 40 issue 6 2004 [doi 10.1016_j.tre.2004.08.002] Tae Hoon Oum; Chunyan Yu -- Measuring airportsâ€™ operating efficiency-.pdfUploaded byPhan Thành Trung
- exampaper-N14B27E1 (2)Uploaded byIbrahim Dibal
- Annotated SPSS Output_ Factor AnalysisUploaded by@12
- SSRN-id1663354Uploaded byمحمد احمد جیلانی
- Optimization of Surface Finish During Milling of Hardened Aisi4340 Steel With Minimal Pulsed Jet of Fluid Application Using Response Surface Methodology_norestrictionUploaded byIAEME Publication
- Multiple Regression CaseUploaded byNeeraj Yadav
- Safety Leadership in ConstructionUploaded byImar Masriyah
- pengaruh musik terhadap hipermetabolik pada luka bakarUploaded byOktaffrastya W. Septafani
- Programmazione e Controllo Esercizi Capitolo 9Uploaded byAzhar Septari