Вы находитесь на странице: 1из 30


1 Plot plot (total_toxic_chemicals/Surface_Area,deaths_total/Population)
2 pairs > pairs(batting.2008[batting.2008$AB>100,c("H","R","SO","BB","HR")])
3.data(turkey.price.ts) #time series
3 plot(time series > plot(turkey.price.ts)
4 Barplot 4.barplot(doctorates.m,beside=TRUE,horiz=TRUE,legend=TRUE,cex.names=.75)
5 Pie 5.pie(acres.harvested, init.angle=100, cex=.6)
> cdplot(bats~AVG,data=batting.w.names.2008,
6 cdplot + subset=(batting.w.names.2008$AB>100))
7 persp 7.halfdome <- yosemite[(nrow(yosemite) - ncol(yosemite) + 1):562,seq(from=253,to=1)] p
8 hist > hist(batting.2008$PA)
9 plot-density 9.plot(density(batting.2008[batting.2008$PA>25,"PA"]))
10 rug 10.rug(batting.2008[batting.2008$PA>25,"PA"])
11 qqnorm 11.qqnorm(batting.2008$AB)
+ data=batting.2008[batting.2008$PA>100 & batting.2008$lgID=="AL",],
12 boxplot + cex.axis=.7)

lattice plot
1 xyplot xyplot(y~x|z,data=d)

2 library(lattice)
births.dow <- table(births2006.smpl$DOB_WK)

births2006.dm <- transform(

Barchart + births2006.smpl[births2006.smpl$DMETH_REC != "Unknown",],
+ DMETH_REC=as.factor(as.character(DMETH_REC)))
dob.dm.tbl <- table(WK=births2006.dm$DOB_WK, MM=births2006.dm$DMETH_REC)


3 dotplot dob.dm.tbl.alt <- table(WEEK=births2006.dm$DOB_WK,

+ MONTH=births2006.dm$DOB_MM,
+ METHOD=births2006.dm$DMETH_REC)

> dotplot(as.factor(Speed_At_Failure_km_h)~Time_To_Failure|Tire_Type,
+ data=tires.sus)

4 histogram histogram(~DBWT|DPLURAL, data=births2006.smpl)

5 density plot densityplot(~DBWT,groups=DPLURAL,data=births2006.smpl,

+ plot.points=FALSE,auto.key=TRUE)
6 strip plot stripplot(~DBWT, data=births2006.smpl,
+ subset=(DPLURAL=="5 Quintuplet or higher" |
+ DPLURAL=="4 Quadruplet"),
+ jitter.data=TRUE)

7 qqmath qqmath(rnorm(100000))

data=births2006.smpl[sample(1:nrow(births2006.smpl), 50000), ],
subset=(DPLURAL != "5 Quintuplet or higher"))

8 scatter plot trellis.par.set(fontsize=list(text=7))

> xyplot(price~squarefeet|zip, data=sanfrancisco.home.sales,
+ subset=(zip!=94100 & zip!=94104 & zip!=94108 &
+ zip!=94111 & zip!=94133 & zip!=94158 &
+ price<4000000 &
+ ifelse(is.na(squarefeet),FALSE,squarefeet<6000)),
+ strip=strip.custom(strip.levels=TRUE))

9 bwplot(log(price)~cut(saledate,"month"),
boxplot + data=sanfrancisco.home.sales,
+ scales=list(x=list(rot=90)))

Statistics with R
1 mean mean(dow30$Open)
2 min min(dow30$Open)
3 max max(dow30$Open)
4 mean mean(c(1,2,3,4,5,NA),na.rm=TRUE)
5 range(dow30$Open)
6 quantile(dow30$Open, probs=c(0,0.25,0.5,0.75,1.0))
7 fivenum(dow30$Open)
8 IQR(dow30$Open) To return the interquartile range (the differ
values), use the function IQR:
10 summary(dow30)

562,seq(from=253,to=1)] persp(halfdome,col=grey(.25),border=NA,expand=.15,theta=225, phi=20, ltheta=45,lphi=20,shade=.75)

erquartile range (the difference between the 25th and 75th percentile
unction IQR:
Sl no.





Discrete Data


Coninious data

Name of the Test

T test(One sample)

T test(Two sample)

Comparing paired data

Comparing variances of two populations

Comparing means across more than two groups

Pairwise t-tests between multiple groups

Testing for normality

Testing if a data vector came from an arbitrary distribution

You can use the Kolmogorov-Smirnov test to see if a vector came from an arbitrary
probability distribution (not just a normal distribution):

Testing if two data vectors came from the same distribution

(The Kolmogorov-Smirnov test can also be used to test the probability that two data
vectors came from the same distribution.)

Correlation tests

Distribution-Free Tests
Although many real data sources can be approximated well by a normal distribution,
there are many cases where you know that the data is not normally distributed, or
do not know the shape of the distribution. A good alternative to the tests described
in “Normal Distribution-Based Tests” on page 344 are distribution-free tests. These
tests can be more computationally intensive than tests based on a normal distribution,
but they may help you make better choices when the distribution is not normally

Comparing more than two means

The Kruskal-Wallis rank sum test is a distribution-free equivalent to ANOVA

Comparing variances
To compare the variance between different groups using a nonparametric test, R
includes an implementation of the Fligner-Killeen (median) test through the
fligner.test function:

Proportion Tests

Binomial Tests
Often, an experiment consists of a series of identical trials, each of which has only
two outcomes. For example, suppose that you wanted to test the hypothesis that
the probability that a coin would land on heads was .5. You might design an experiment
where you flipped the coin 50 times and counted the number of heads.
Each coin flip is an example of a Bernoulli trial. The distribution of the number of
heads is given by the binomial distribution.

Tabular Data Tests(Fisher Test)

A common problem is to look at a table of data and determine if there is a relationship
between two categorical variables. If there were no relationship, the two variables
would be statistically independent. In these tests, the hypothesis is that the two
variables are independent. The alternative (or null) hypothesis is that the two variables
are not independent.

Tabular Data Tests(Chi square Test)




+ subset(SPECint2006,Num.Chips==1&Num.Cores==2)$Result,
+ paired=TRUE)

var.test(yards~outside, data=field.goals.inout)

> tapply(mort06.smpl$age,INDEX=list(mort06.smpl$Cause),FUN=summary)





cor.test(air_on_site/Surface_Area, deaths_lung/Population)

wilcox.test(times.to.failure.e, times.to.failure.d)


field.goals.table <- table(field.goals.goodbad$play.type,

+ field.goals.goodbad$stadium.type)
field.goals.table.t <- t(field.goals.table[3:4,])


births.july.2006 <- births2006.smpl[births2006.smpl$DMETH_REC!="Unknown" &

births2006.smpl$DOB_MM==7, ]
method.and.sex <- table(
+ births.july.2006$SEX,
+ as.factor(as.character(births.july.2006$DMETH_REC)))

births2006.byday <- table(births2006.smpl$DOB_WK)
One Sample t-test
data: times.to.failure.h
t = 0.7569, df = 9, p-value = 0.4684
alternative hypothesis: true mean is not equal to 9
95 percent confidence interval:
6.649536 13.714464
sample estimates:
mean of x

Welch Two Sample t-test

data: yards by outside
t = 1.1259, df = 319.428, p-value = 0.2610
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
sample estimates:
mean in group FALSE mean in group TRUE
35.31707 34.40034

data: subset(SPECint2006, Num.Chips == 1 & Num.Cores == 2)$Baseline

and subset(SPECint2006, Num.Chips == 1 & Num.Cores == 2)$Result
t = -21.8043, df = 111, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
sample estimates:
mean of the differences

F test to compare two variances

data: yards by outside
F = 1.2432, num df = 252, denom df = 728, p-value = 0.03098
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
1.019968 1.530612
sample estimates:
ratio of variances

Cause Residuals
Sum of Squares 15727886 72067515
Deg. of Freedom 9 243034
Residual standard error: 17.22012
Estimated effects may be unbalanced
29 observations deleted due to missingness

Pairwise comparisons using t tests with pooled SD

data: tires.sus$Time_To_Failure and tires.sus$Tire_Type
C 0.2219 - - - -
D 1.0000 0.5650 - - -
E 1.0000 0.0769 1.0000 - -
H 2.4e-07 0.0029 2.6e-05 1.9e-08 -
L 0.1147 1.0000 0.4408 0.0291 0.0019

Shapiro-Wilk normality test

data: field.goals$YARDS
W = 0.9728, p-value = 1.307e-12

One-sample Kolmogorov-Smirnov test

data: field.goals$YARDS
D = 1, p-value < 2.2e-16
alternative hypothesis: two-sided
Warning message:
In ks.test(field.goals$YARDS, pnorm) :
cannot compute correct p-values with ties

D = 0.2143, p-value = 0.01168

alternative hypothesis: two-sided

Pearson's product-moment correlation

data: air_on_site/Surface_Area and deaths_lung/Population
t = 3.4108, df = 39, p-value = 0.001520
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2013723 0.6858402
sample estimates:

Wilcoxon rank sum test with continuity correction

data: times.to.failure.e and times.to.failure.d
W = 14.5, p-value = 0.05054
alternative hypothesis: true location shift is not equal to 0
Warning message:
In wilcox.test.default(times.to.failure.e, times.to.failure.d) :
cannot compute exact p-value with ties

Kruskal-Wallis rank sum test

data: age by Cause
Kruskal-Wallis chi-squared = 34868.1, df = 9, p-value
< 2.2e-16

Fligner-Killeen test of homogeneity of variances

data: age by Cause
Fligner-Killeen:med chi-squared = 15788, df = 9,
p-value < 2.2e-16

3-sample test for equality of proportions without

continuity correction
data: field.goals.table
X-squared = 2.3298, df = 2, p-value = 0.3120
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3
0.7910448 0.8636364 0.8231966

Exact binomial test

data: 110 and 416
number of successes = 110, number of trials = 416, p-value =
alternative hypothesis: true probability of success is less than 0.3
95 percent confidence interval:
0.0000000 0.3023771
sample estimates:
probability of success

Fisher's Exact Test for Count Data
data: method.and.sex
p-value = 1.604e-05
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.8678345 0.9485129
sample estimates:
odds ratio

Pearson's Chi-squared test with Yates' continuity

data: twins.2006$DMETH_REC and twins.2006$SEX
X-squared = 0.1745, df = 1, p-value = 0.6761

Chi-squared test for given probabilities

data: births2006.byday
X-squared = 15873.20, df = 6, p-value < 2.2e-16
Here’s an explanation of the output from the t.test function. First, the function
shows us the test statistic (t = 0.7569), the degrees of freedom (df = 9), and the
calculated p-value for the test (p-value = 0.4684). The p-value means that the probability
that the mean value from an actual sample was higher than 10.812 (or lower
than 7.288) was 0.4684. This means that it is very likely that the true mean time to
failure was 10.
The next line states the alternative hypothesis: the true mean is not equal to 9, which
we would reject based on the result of this test. Next, the t.test function shows the
95% confidence intervals for this test, and, finally, it gives the actual mean.

Although the average successful field goal length was about a yard longer, the difference
is not significant at a 95% confidence level. The same is true for field goals
that missed

In this case, we can clearly see that the results were significant at the 95% confidence
interval. (This isn’t a very big surprise. It’s well known that optimizing compiler
settings and system parameters can make a big difference on system performance.
Additionally, submitting optimized results is optional: organizations that could not
tune their systems very well probably would not voluntarily share that fact.)

As you can see from the output above, the p-value is less than .05, indicating that
the difference in variance between the two populations is statistically significant.
As you can tell from the p-value, it is quite likely that this data came from a normal

Notice the warning message; ties are extremely unlikely for values from a true normal
distribution. If there are ties in the data, that is a good sign that the test data is not
actually normally distributed, so the function prints a warning.

According to the p-value, the two vectors likely came from the same distribution (at
a 95% confidence level).

The test shows that there appears to be a positive correlation between these two
quantities that is statistically significant. However, don’t infer that there is a causal
relationship between the rates of toxins released and the rates of lung cancer deaths.
There are many alternate explanations for this phenomenon. For example, states
with lots of dirty industrial activity may also be states with lower levels of income,
which, in turn, correlates with lower-quality medical care. Or, perhaps, states with
lots of industrial activity may be states with higher rates of smoking. Or, maybe
states with lower levels of industrial activity are less likely to identify cancer as a
cause of death. Whatever the explanation, I thought this was a neat result.

Here’s an explanation of the output. The test function first shows the test statistic
(W = 14.5) and the p-value for the statistic (0.05054). Notice that this is different
from the result for the t-test. With the t-test, there was a significant difference between
the means of the two groups, but with the Wilcoxon rank sum test, the difference
between the two groups is not significant at a 95% confidence level (though
it barely misses).
As you can see, the results are not significant.

Unlike some other test functions, the p-value represents the probability that the
fraction of successes (.26443431) was at least as far from the hypothesized value
(.300) after the experiment. We specified that the alternative hypothesis was “less,”
meaning that the p-value represents the probability that the fraction of successes
was less than .26443431, which in this case was .06174.
In plain English, this means that if David Ortiz was a “true” .300 hitter, the probability
that he actually hit .264 or worse in a season was .06174.

The p-value is the probability of obtaining results that were at least as far removed
from independence as these results. In this case, the p-value is very low, indicating
that the results were very far from what we would expect if the variables were truly
independent. This implies that we should reject the hypothesis that the two variables
are independent.

As above, the p-value is very high, so it is likely that the two variables are independent
for twin births.
Linear regression

runs.mdl <- lm(formula=runs~singles+doubles+triples+homeruns+walks+hitbypitch+sac

This function shows the following plots:
• Residuals against fitted values
• A scale-location plot of sqrt{| residuals |} against fitted values
• A normal Q-Q plot
• (Not plotted by default) A plot of Cook’s distances versus row labels
• A plot of residuals against leverages
• (Not plotted by default) A plot of Cook’s distances against leverage/(1 −

where y is the response variable, x1, x2, ..., xn are the predictor variables (or predictors),
c1, c2, ..., cn are the coefficients for the predictor variables, c0 is the intercept,
and ε is the error term.

Suppose that you have a matrix of observed predictor variables X and a vector of
response variables Y. (In this sentence, I’m using the terms “matrix” and “vector”
in the mathematical sense.) We have assumed a linear model, so given a set of coefficients
c, we can calculate a set of estimates ŷ for the input data X by calculating
ŷ = cX. The differences between the estimates ŷ and the actual values Y are called
the residuals. You can think of the residuals as a measure of the prediction error;
small residuals mean that the predicted values are close to the actual values. We
assume that the expected difference between the actual response values and the
residual values (the error term in the model) is 0. This is important to remember: at
best, a model is probabilistic.†

Our goal is to find the set of coefficients c that does the best job of estimating Y given
X; we’d like the estimates ŷ to be as close as possible to Y. In a classical linear regression
model, we find coefficients c that minimize the sum of squared differences
between the estimates ŷ and the observed values Y. Specifically, we want to find
values for c that minimize:

This is called the least squares method for regression. You can use the lm function
in R to estimate the coefficients in a linear model:‡
lm(formula, data, subset, weights, na.action,
method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
singular.ok = TRUE, contrasts = NULL, offset, ...)

Assumptions of Least Squares Regression

Linear models fit with the least squares method are one of the oldest statistical
methods, dating back to the age of slide rules. Even today, when computers are
ubiquitous, high-quality statistical software is free, and statisticians have developed
thousands of new estimation methods, they are still popular. One reason why linear
regression is still popular is because linear models are easy to understand. Another
reason is that the least squares method has the smallest variance among all unbiased
linear estimates (proven by the Gauss-Markov theorem).
Technically, linear regression is not always appropriate. Ordinary least squares
(OLS) regression (implemented through lm) is only guaranteed to work when certain
properties of the training data are true. Here are the key assumptions:
1. Linearity. We assume that the response variable y is a linear function of the
predictor variables x1, x2, ..., cn.
2. Full rank. There is no linear relationship between any pair of predictor variables.
(Equivalently, the predictor matrix is not singular.) Technically, ∀ xi, xj, ∄ c
such that xi = cxj.
3. Exogenicity of the predictor variables. The expected value of the error term ε is
0 for all possible values of the predictor variables.
4. Homoscedasticity. The variance of the error term ε is constant and is not correlated
with the predictor variables.
5. Nonautocorrelation. In a sequence of observations, the values of y are not correlated
with each other.
6. Exogenously generated data. The predictor variables x1, x2, ..., xn are generated
independently of the process that generates the error term ε.
7. The error term ε is normally distributed with standard deviation σ and mean 0.
In practice, OLS models often make accurate predictions even when one (or more)
of these assumptions are violated.
By the way, it’s perfectly OK for there to be a nonlinear relationship between some
of the predictor variables. Suppose that one of the variables is age. You could add
age^2, log(age), or other nonlinear mathematical expressions using age to the model
and not violate the assumptions above. You are effectively defining a set of new
predictor variables: w1 = age, w2 = age2, w3 = log(age). This doesn’t violate the
linearity assumption (because the model is still a linear function of the predictor
variables) or the full rank assumption (as long as the relationship between the new
variables is not linear).

Heteroscedasticity ncv.test or bptest(library(lmtest)) robust(rlm()) & resistance(lqs()) regression

Auto-corelation durbin.watson(dwtest) Ridge regression(lm.ridge())

Singularity singular.ok=FALSE in Im()

Stepwise Variable Selection

A simple technique for selecting the most important variables is stepwise variable
selection. The stepwise algorithm works by repeatedly adding or removing variables
from the model, trying to “improve” the model at each step. When the algorithm
can no longer improve the model by adding or subtracting variables, it stops and
returns the new (and usually smaller) model.

Ridge Regression
Stepwise variable selection simply fits a model using lm, but limits the number of
variables in the model. In contrast, ridge regression places constraints on the size of
the coefficients and fits a model using different computations.
Ridge regression can be used to mitigate problems when there are several highly
correlated variables in the underlying data. This condition (called multicollinearity)
causes high variance in the results. Reducing the number, or impact, of regressors
in the data can help reduce these problems.

Principal Components Regression and Partial Least Squares Regression

Ordinary least squares regression doesn’t always work well with closely correlated
variables. A useful technique for modeling effects in this form of data is principal
components regression. Principal components regression works by first transforming
the predictor variables using principal components analysis. Next, a linear regression
is performed on the transformed variables.
A closely related technique is partial least squares regression. In partial least squares
regression, both the predictor and the response variables are transformed before
fitting a linear regression. In R, principal components regression is available through
the function pcr in the pls package:
)) regression
Nonlinear Models

Generalized Linear Models

Generalized linear modeling is a technique developed by John Nelder and Robert

Wedderburn to compute many common types of models using a single framework.
You can use generalized linear models (GLMs) to fit linear regression models, logistic
regression models, Poisson regression models, and other types of models.
As the name implies, GLMs are a generalization of linear models. Like linear models,
there is a response variable y and a set of predictor variables x1, x2, ..., xn. GLMs
introduce a new quantity called the linear predictor. The linear predictor takes the
following form:

In a general linear model, the predicted value is a function of the linear predictor.
The relationship between the response and predictor variables does not have to be
linear. However, the relationship between the predictor variables and the linear predictor
must be linear. Additionally, the only way that the predictor variables influence
the predicted value is through the linear predictor.
In “Example: A Simple Linear Model” on page 373, we noted that a good way to
interpret the predicted value of a model is as the expected value (or mean) of the
response variable, given a set of predictor variables. This is also true in GLMs, and
the relationships between that mean and the linear predictor is what makes GLMs
so flexible. To be precise, there must be a smooth, invertible function m such that:
The inverse of m (denoted by l above) is called the link function. You can use many
different function families with a GLM, each of which lets you predict a different
form of model. For GLMs, the underlying probability distribution needs to be part
of the exponential family of probability distributions.

runs.glm <- glm(

+ formula=runs~singles+doubles+triples+homeruns+
+ walks+hitbypitch+sacrificeflies+
+ stolenbases+caughtstealing,
+ data=team.batting.00to08)

Nonlinear Least Squares

Sometimes, you know the form of a model, even if the model is extremely nonlinear.
To fit nonlinear models (minimizing least squares error), you can use the nls
nls(formula, data, start, control, algorithm,
trace, subset, weights, na.action, model,
lower, upper, ...)

Survival Models
Survival analysis is concerned with looking at the amount of time that elapses before
an event occurs. An obvious application is to look at mortality statistics (predicting
how long people live), but it can also be applied to mechanical systems (the time
before a failure occurs), marketing (the amount of time before a consumer cancels
an account), or other areas.
In R, there are a variety of functions in the survival library for modeling survival data.
To estimate a survival curve for censored data, you can use the survfit function:

As an example, let’s fit a survival curve for the GSE2034 data set. This data comes
from the Gene Expression Omnibus of the National Center for Biotechnology Information
(NCBI), which is accessible from http://www.ncbi.nlm.nih.gov/geo/. The
experiment examined how the expression of certain genes affected breast cancer
relapse-free survival time. In particular, it tested estrogen receptor binding sites.
(We’ll revisit this example in Chapter 24.)
First, we need to create a Surv object within the data frame. A Surv object is an R
object for representing survival information, in particular, censored data. Censored
data occurs when the outcome of the experiment is not known for all observations.
In this case, the data is censored. There are three possible outcomes for each observation:
the subject had a recurrence of the disease, the subject died without having
a recurrence of the disease, or the subject was still alive without a recurrence at the
time the data was reported. The last outcome—the subject was still alive without a
recurrence—results in the censored values:
> library(survival)
> GSE2034.Surv <- transform(GSE2034,
+ surv=Surv(
+ time=GSE2034$months.to.relapse.or.last.followup,
+ event=GSE2034$relapse,
# show the first 26 observations:
> GSE2034.Surv$surv[1:26,]
[1] 101+ 118+ 9 106+ 37 125+ 109+ 14 99+ 137+ 34 32 128+
[14] 14 130+ 30 155+ 25 30 84+ 7 100+ 30 7 133+ 43
Now, let’s calculate the survival model. We’ll just make it a function of the ER.status
flag (which stands for “estrogen receptor”):
> GSE2034.survfit <- survfit(
+ formula=surv~ER.Status,
+ data=GSE2034.Surv,

The easiest way to view a survfit object is graphically. Let’s plot the model:
> plot(GSE2034.survfit,lty=1:2,log=T)
> legend(135,1,c("ER+","ER-"),lty=1:2,cex=0.5)

Kernel Smoothing
To estimate a probability density function, regression function, or their derivatives
using polynomials, try the function locpoly in the library KernSmooth:
locpoly(x, y, drv = 0L, degree, kernel = "normal",
bandwidth, gridsize = 401L, bwdisc = 25,
range.x, binned = FALSE, truncate = TRUE)