Академический Документы
Профессиональный Документы
Культура Документы
*
Hrishikesh D. Vinod
August 22, 2018
Abstract
These commands are very basic and are intuitive in most cases.
They are adequate for a beginning statistics course. The material in
red font in this document can be copied and pasted to your R-GUI
(graphical user interface). The material in blue font is the output from
R.
Using R
First we clean up R memory.
1
Some useful commands are listed next. The most important is “c” which
allows user to combine/store a list of numbers/things in a vector. Within R
commands, note that # means comment. Everything after this symbol in a
line is ignored by R.
c(1,7,"name")
[1] "1" "7" "name"
The above output shows that numbers and words can be included in a vector.
Of course, the words must be placed in simple quotes (not smart quotes of
MS Word)
R is an object-oriented language. Almost everything is an object with
a name. “x =” or “x < −” are both assignment operations to create an R
object named x. R purists do not like = as assignment operator as I do. I
like = because it requires less space and less typing.
The R obejct names are almost arbitrary, except that they cannot start
with numbers or contain symbols. It is not advisable to use common com-
mands as names of R objects (e.g. sum, mean, sd, c, sin, cos, pi, exp etc
described later). Everything in R including object names is case-sensitive.
Note that 3x is not a valid name of an R object.
3x=1:4
The object name ‘3x’ in the above code returns an ERROR
Error: unexpected symbol in "3x"
For example, x=5 means 5 stored under a name x. Also x <- c(1,2,3,4)
defines variable x as = (1,2,3,4). Alternatively use x=1:4.
x=1:4
x #typing the name of an R object asks R to print it to the screen.
> x
[1] 1 2 3 4
sum(..., na.rm = FALSE) shows that sum is a function always available in
R where x is its argument. ‘na.rm=FALSE’ is an optional argument with
default value FALSE meaning that if there are missing values (NA’s or not-
available data values) sum will also be NA. This is a useful warning.
sum(x) # Calculates the sum of elements in vector x.
Now the output of sum command is:
> sum(x)
[1] 10
Now we illustrate the use of sum in the presence of missing data or NA’s.
We create a vector x with five numbers and one NA. To compute the sum
2
correctly, we need to use the option ‘na.rm=TRUE’. Otherwise the sum is
NA. This is a useful warning that there are missing data, as can happen
unknowingly.
x=c(1:3,NA,4);x
sum(x)
sum(x,na.rm=TRUE)
> x=c(1:3,NA,4)
> x
[1] 1 2 3 NA 4
> sum(x)
[1] NA
> sum(x,na.rm=TRUE)
[1] 10
The above output shows that the sum(x) is NA if we do not recognize the
presence of NA and explicitly ask R to remove it (na.rm means remove NAs)
before computing the sum.
The option ‘na.rm=TRUE’ is available for computation of mean, median,
standard deviation, variance, etc. Less sophisticated software gives incorrect
number of observations and wrong answers in the presence of NA’s.
q() #quits a session. If R is expecting continuing command it prompts
with ”+”. It may be an indication that something is wrong and it may be
better to press escape key to get out. It can be because parentheses do not
match or other syntax errors.
pi
exp(1)
print(c(pi,exp(1))) #prints to screen values of pi and e symbols
Note that exp is a function in R and exp(1) means e raised to power 1. Note
also that the ‘c’ function of R defines a catalog or list of two or more values.
R does not understand a mere list of things without the c command. Print
command of R needs the ‘c’ above, because we want to print more than one
thing from a list.
> pi
[1] 3.141593
> exp(1)
[1] 2.718282
> print(c(pi,exp(1))) #prints to screen values of pi and e symbols
[1] 3.141593 2.718282
3
Thus the transcendental numbers ‘e’ and π are already defined in R as exp(1)
and pi.
x=123*(10^(-9)) # multiplication is with * and
#raise to power is with the ^ symbol in R
x=123*(10^(-9)); x #semicolon allows two commands on the same line
Now the output of above commands is:
> x=123*(10^(-9)) # multiplication is with * and
#raise to power is with the ^ symbol in R
> x=123*(10^(-9)); x #semicolon allows two commands on the same line
[1] 1.23e-07
Printing x as 1.23e-07 is in the scientific notation. If you do not want that,
use ‘format’ instead of print withe option scientific=FALSE as below:
format(x, scientific=FALSE) # print it as "0.000000123"
#this avoids the scientific notation
x #without the option, it prints 1.23e-07 or scientific notation.
Note that ‘format’ means print. Simple x will print x in scientific notation.
(default)
> format(x, scientific=FALSE) # print it as "0.000000123"
[1] "0.000000123"
> #this avoids the scientific notation
> x #without the option, it prints 1.23e-07 or scientific notation.
[1] 1.23e-07
4
> median(x) # Calculates the median of x elements.
[1] 3
5
> x=c(2,4,0,12,7,2,7,2);x
[1] 2 4 0 12 7 2 7 2
> quantile(x, probs=c(0.05, 0.45, 0.95), type=1)
5% 45% 95%
0 2 12
> quantile(x, probs=c(0.05, 0.45, 0.95), type=6)
5% 45% 95%
0.0 2.1 12.0
In general the spread of the data is of interest. It is measured by the
overall range. Measurement of volatility of stock returns is an important
indicator of risk associated with that investment. Besides the range, “devi-
ations from the mean” (x − x̄) provide information regarding the spread of
the data with respect to its own mean. However since Σ(x − x̄) = 0 always
holds, sum of deviations from the mean will be useless for distinguishing be-
tween different data sets. Hence we can compare mean of absolute deviations
(MAD) from the mean. Statisticians prefer variance and standard deviation
(sd) of elements of vector x over MAD since it has convenient mathematical
properties. (e.g. its derivative is easy to compute)
n=length(x);n #count how many items in x
sqrt(16)#should be 4 square root function is sqrt
max(x)-min(x) #defines the range
dev=x-mean(x);dev#vector of deviations from the mean of x
sum(dev)#must be zero
sum(dev^2)/(n-1)# sample variance definition
var(x) # Calculates the sample variance of x.
#standard deviation is square root of x
sqrt(var(x))
sd(x) # direct calculation of sample standard deviation of x
sum(dev^2)/n# population variance definition
#indirect calculation of population variance from var(x)
popvar=var(x)*(n-1)/n;popvar
sqrt(popvar) #computes the square root of population variance
popsd=sd(x)*sqrt(n-1)/sqrt(n);popsd
Output of above code is next.
> sum(dev^2)/(n-1)# sample variance definition
[1] 15.42857
> var(x) # Calculates the sample variance of x.
[1] 15.42857
6
> #standard deviation is square root of x
> sqrt(var(x))
[1] 3.927922
> sd(x) # direct calculation of sample standard deviation of x
[1] 3.927922
> sum(dev^2)/n# population variance definition
[1] 13.5
> #indirect calculation of population variance from var(x)
> popvar=var(x)*(n-1)/n;popvar
[1] 13.5
> sqrt(popvar) #computes the square root of population variance
[1] 3.674235
> popsd=sd(x)*sqrt(n-1)/sqrt(n);popsd
[1] 3.674235
factorial(n) # Calculates the factorial of integer x.
summary(x) #prints six number summary Min, Q1,median, mean, Q3, Max
7
set.seed(123) #sets the seed of the random number generator
seq(from=1,to=9,by=2)
‘read.table’, ‘read.DIF’, etc commands are for reading data. But they can be
hard to use. It may be just as good to copy the numbers in MS Word file or
text file and read them with x=c(.., ..,)
The rounding in R by using the R command round is too sophisticated for
Hawkes Learning which uses the biased method we learned in High School.
For example, R command round(c(-0.5,0.5,1.5,2.5)) rounds to the
nearest even number as (0, 0, 2, 2) to avoid bias. This is different from round-
ing we learned in High School which would give (−1, 1, 2, 3).
8
4 Probability Distributions
4.1 Uniform Distribution
How to create random numbers from the uniform density? In R ‘unif’ means
uniform and prefix:
d means density,
p means cumulative probability
q means quantile
r means random numbers from that density. Thus,
plot(dunif) #range is 0 to 1 as default
x=runif(10)#creates 10 uniform random numbers in x
x #print x
punif(1)#area under uniform between 0 to 1
punif(0.5)#area 0 to 0.5
qunif(0.5)# given area=0.5, the qunatile of uniform
> x=runif(10)#creates 10 uniform random numbers in x
> x #print x
[1] 0.22820188 0.01532989 0.12898156 0.09338193 0.23688501 0.79114741
[7] 0.59973157 0.91014771 0.56042455 0.75570477
> punif(1)#area under uniform between 0 to 1
[1] 1
> punif(0.5)#area 0 to 0.5
[1] 0.5
> qunif(0.5)# given area=0.5, the qunatile of uniform
9
The Binomial coefficients (1/8, 2/8, 2/8, 1/8) are correctly produced by
dbinom. The graphical output is omitted for brevity.
> db=dbinom(x,prob=p,size=n);db
[1] 0.125 0.375 0.375 0.125
10
[1] 10 4 3 3
> x=0:minnk
> px=choose(k,x)*choose((N-k),(n-x)) /choose(N,n)
> px
[1] 0.16666667 0.50000000 0.30000000 0.03333333
> names(px)=x;barplot(px)
> sum(px)
[1] 1
11
Normal Density
0.4
0.3
0.2
dn
0.1
0.0
−4 −2 0 2 4
12
> pnorm(1) #area under N(0,1) always from minus infinity
[1] 0.8413447
> qnorm(0.5)#gives the quantile z from cumulative probability
[1] 0
13
#example
permute(4,2)
When you write your own function, it is important to check it aganist a
known answer. For example we know 4P2 is 12 and it checks out.
> permute=function(n,k){
+ out=factorial(n)/factorial(n-k)
+ return(out)
+ }
> #example
> permute(4,2)
[1] 12
Now we illustrate a function to compute an alternative version of hyper-
geometric distribution as follows.
myhyper=function(N,n,k){
x=0:min(n,k)
px=choose(k,x)*choose((N-k),(n-x)) /choose(N,n)
return(px)}
#example
myhyper(N=10,n=4,k=3)
When you write your own function, it is important to check it aganist a
known answer given above for N=10, n=4, k=3. Abridged output is:
> myhyper(N=10,n=4,k=3)
[1] 0.16666667 0.50000000 0.30000000 0.03333333
6 Final Remarks
We show that R is far more convenient than a calculator, allowing us the give
names to our calculations and implement vast sets of calculations without the
tedium. We can also avoid the use of most probability distribution tables.
Some useful references are Kerns (2013), Vinod (2008), and Verzani (2009),
Kleiber and Zeileis (2008), among others.
References
Kerns, G. J. (2013), IPSUR: Introduction to Probability and Statistics Using
R, r package version 1.5, URL https://CRAN.R-project.org/package=
14
IPSUR.
Verzani, J. (2009), UsingR: Data sets for the text ”Using R for Introductory
Statistics”, r package version 0.1-12, URL https://cran.r-project.org/
doc/contrib/Verzani-SimpleR.pdf.
15