Вы находитесь на странице: 1из 11

Introduction to R

What is R?
• R is a programming language, and not a “statistical software” program. However, R is a great
and flexible language for doing statistical analyses.
• R is described as a language and environment for statistical computing and graphics.
• R makes use of both functional programming and object-oriented programming. However, R
is not strictly a functional programming language nor an object-oriented programming.

Why R?
• Platform of choice for many statisticians à new methods will most likely be available first in
R than in other statistical software
• Default educational platform in university statistics programs, and is also becoming one of
the most popular platform among statistics authors.
• Offer the largest collection of analytics tools and statistical methods
• Very powerful plotting functions for data visualization
• Built in facility for report generation
• Backed by an active community of users
• Available in Windows, Mac and Linus à (almost) platform-agnostic
• R is FREE

Why not R?
• Requires mastering the basics of the R language à many simple analyses are cumbersome to
do in R if you don’t know how to manipulate objects in R à in other statistical software,
procedures and programs for statistical analysis are ready-made recipes on how to use the
ingredients while in R, you need to come up with your own recipes given the ingredients
• Not a good fit to those who have tried programming and didn’t like it

Installing R to your system


• Download the R installer through https://cran.r-project.org/mirrors.html, select any of the
mirror sites, then download the appropriate installer for you operating system. Make sure to
look for the installer that indicates something like “install R for the first time” to make sure
that you are downloading the installer for the base R distribution.
• Install R to your computer
• After the successful installation of R, download the installer for RStudio through
https://www.rstudio.com/. Look for the download link for RStudio (make sure to download
the RStudio Desktop version, the free version of the software.
• RStudio is an integrated development environment for R, providing its users with a set of
integrated tools (syntax editor, debugger, syntax highlighting, documentation, and packaging
and deployment options) that make working in R more efficient.

Basics of Writing R Programs


• R is an expression language with a very simple syntax.
• Elementary commands or statements consist of either expressions or assignments.
• When expressions are given as a command, they are evaluated, printer and their values lost.
• An assignment evaluates expressions and passes the resulting value to a variable. The value
is not automatically printed.
• Assignment operators in R: “=” and “<-“
• Programming in R (in its simplest form) consists of typing and submitting individual lines of
codes that R will execute.
o Example: Open RStudio and look for the Console pane. Once this pane is active, type
the following:
x <- 10
y <- 5
z <- x+y
o The three lines of codes are assignment statements (similar to the concept of
assignment operation discussed in SAS). In R, the symbol <- is an assignment
operator. If you look very closely, it is in fact an arrow pointing to the left. Hence, x
<- 10 literally means “put 10 in the object named x”, wherein the object x is a
container in the context of the R programming language.
o Note that upon hitting Enter after each line of codes, an entry will appear under the
Environment pane. The environment pane lists down all objects stored in the current
R environment. It is important to note that visually, nothing will happen after
submitting the lines of codes. To view the content of the objects, type the name of the
object and press enter.
NOTE: R is sensitive to the case used in its codes. This means that x is different from X, and
Male_Average is different from male_average. (SAS is case insensitive).

R Names
• Normally, all alphanumeric symbols can be used plus “.” or “_”.
• An R name must start with “.” or a letter.
• If an R names starts with “.”, the second character must not be a digit.
o MyNumber, my.number, my_number
o .error, .2var
R Basic Operators

Basic Objects and Data Structures in R


• An object in R is a generic term that refers to data, functions, or anything else that the R
system processes (in fact, everything in R is an object).
• R’s base structures can be organized based on their dimensionality and the homogeneity of
their contents.

• There is no 0-dimensional object in R à no scalars à only vectors of length one.


• Use the function str() to check the structure of R objects
• For our purposes, we are only going to discuss the three most important objects: vectors,
lists, and data frames.
• The basic data structure in R is the vector.
o atomic vectors
o list
• Vectors have three common properties:
o Type: what the vector is
o Length: how many elements it contains
o Attributes: additional arbitrary metadata
Elements of an atomic vector have the same type; a list can have elements with different
types.

ATOMIC VECTORS
• A vector (atomic) is defined as a one-dimensional collection of data points of a similar kind
(number or text)
• The objects x, y and z we created earlier are examples of a vector. Specifically, they are
vectors containing a single value (vectors with a single value are not considered as scalars in
R, they are just vectors of length 1).
o Type the following in the R console:
sample.vector <- c(10.4, 5.6, 3.1, 6.4, 21.7)
sample.vector
o The function c() takes vector arguments (even vectors of varying lengths) and return a
vector by combining them. (c() is referred to as the combine function)
o Type the following code:
sample.vector <- c(sample.vector, c(30.4,21.4))
sample.vector
1 / sample.vector
NOTE: In the third line, the reciprocal of the elements of the vector is shown, but no
changes to the contents of sample.vector occurred.
• As per the definition, elements of a vector should all be of the same type.
respondent.name <- c(“Martin”, “Gessy”, NA, “Damaso”)
respondent.age <- c(21, 18, 10, NA)
logical.vector <- c(TRUE, TRUE, FALSE, FALSE)
typeof(respondent.name)
typeof(respondent.age)
typeof(logical.vector)
is.na(respondent.name)
is.na(respondent.age)
0/0 #NaN means Not a Number which is different from NA
• For vectors containing values that are used for grouping purposes, a special “type” of vector
can be created to account for this information: factors. Factors are important especially for R
functions related to statistical modelling. Note that factors are integer vectors with an
additional attribute: levels.
o Type the following codes:
sex <- c(“male”, “male”, “female”, “female”, “male”)
sex
sex <- factor(sex)
sex
group <- c(1,1,2,3)
summary(group)
group <- factor(group)
summary(group)
• VECTOR OPERATIONS
o Arithmetic operations can be applied to a set of vectors in a straightforward manner
x <- c(1, 2, 3, 4, 5)
y <- x**2
o Operations on vectors are done element-by-element (vectorized operations).
o RECYCLING
§ Type the following codes:
a <- c(1, 2, 3, 4, 5, 6)
b <- a+1
length(a)
length(1)
§ The second line of codes demonstrates the concept of recycling when R
operates on vectors of varying lengths. The expression a+1 asks R to add a
vector of length 6 and a vector of length 1. To accomplish this, R “needs” to
come up with a set of vectors with equal length. R does this by recycling the
content of the vector with the shorter length to match the longer vector. Hence,
R creates a vector containing five 1s to match the length of the vector a.
§ How about adding the vector a with a vector of length 2:
c <- c(1,2)
d <- a+c
§ The recycling done by R replicates the contents of the shorter vector until it
matches the length of the longer vector, even if not all elements of the shorter
vector will be replicated the same number of times.
e <- c(1, 2, 3, 4)
f <- a+e
• Accessing elements of a vector:
o Elements of a vector can be accessed using the [ operator.
§ x[1] extracts the first element of a vector. In general, x[k] extracts the kth
element of the vector x.
§ Multiple elements of a vector can be extracted by using an n-element vector k
whose length is less than or equal to the length of the vector x, e.g.
k <- c(2, 4, 6)
x[k] # extracts the 2nd, 4th and 6th element of the vector x
§ Elements of the vector k need not be in ascending or descending order, but
the order of the elements of the resulting vector is determined by the ordering
of the elements of k.
§ k can also be a logical vector with the same length as x. Elements of x that
will coincide (will have the same index number) with the TRUE elements of k
are the ones that will be extracted.
o Note that the [ operator will always return an object of the same class as the original
object.
o If the elements of an atomic vector are named, the name of the elements can be used
with the [ operator to extract them.
x <- c(math = 90, science = 65)
x[“math”] # do not forget the double quotation
x[c(“science”, “math”)]
LISTS
• An R list is an object consisting of a (possibly) named collection of objects known as its
components.
• Use the function list(name_1=object_1, …, name_m=object_m) to create a list, where
name_k is the name of the kth object stored in the list.
• The components of a list need not be of the same type of object. The list can contain any
combination of objects.
o Type the following codes:
sample.list <- list(name= “Fred”, wife= “Mary”, no.children=3, child.ages=c(4, 7, 9))
o sample.list contains character and numeric vectors of varying lengths
• It is important to know how to access the contents of a list in R since most functions in R
return several outputs stored in a single list.
• Three subsetting operators can be used when dealing with lists:
o [ operator – returns an object with the same class as the original. This operator can
extract multiple elements from a list.
o $ operator – used to extract single elements from a list using their literal names.
Computations and object calls will not work.
o [[ operator – used to extract single elements from a list. Either an index number or the
name of the element can be used. Computations and object calls are allowed.
x <- list(owner = "Martin", num.pet = 5, name.pet = c("Wacky", "Gamble",
"Bubbles", "Damaso", "Tibor"))
y <- "owner"
x$owner
x$name.pet
x$y # this doesn't work! why?
x[["owner"]]
x[[y]] # this works! why?
x[[c(3, 2)]] # hmm, how come!?
x[c(3,2)] # what!?
DATA FRAMES
• A data frame is a list with additional attribute class “data.frame”. In short, it is a “special list”
o The components must be vectors (numeric, character, or logical), factors, lists or other
data frames. These vectors must have the same length.
o A data frame may be regarded as a matrix with columns that are heterogenous in
terms of data type.
o A data frame is a 2-dimensional structure, hence it shares properties of both a matrix
and a list (note that a list is a vector!).
• Type the following codes:
respondent.name<-c(“Martin”, “Gessy”, “Gamble”, “Damaso”)
respondent.age<-c(21, 18, 10, NA)
respondent.sex<-factor(c(1,2,1,1))
sample.data<-data.frame(name=respondent.name,age=respondent.age,sex=respondent.sex)
summary(sample.data)
tapply(sample.data$age,sample.data$sex,mean)
• The fastest way to create data frames in R is to import external data files. For our purposes,
we are only going to focus on importing Excel files.
o Install the xlsx package, then load the package by typing library(xlsx)
o Install the readxl package, then load the package by typing library(readxl)
o Let us upload some data into the R environment
body.fat.data<-read.xlsx(file="body fat data.xlsx",sheetIndex = 1)
diabetes.data<-read.xlsx(file="diabetes data set.xlsx",sheetIndex = 1)
summary(body.fat.data)
summary(diabetes.data)
Part 2. Basic Statistics using R
• Unlike in SAS where a single PROC step can generate a set of statistics on a given set of
variables, summary statistics in R are best computed using individual functions.
• Here is a list of the most commonly used functions for generating summary statistics:
o mean(), median()
o sd()
o var()
o min() and max()
o kurtosis() and skewness() from the moments package
o quantile(data, percentile(0,1))
o CV: user defined function
cv <- function(x) {
temp <- sd(x)/mean(x)*100
return(temp)
}
cv(body.fat.data$age)
• Normality test: test for normality (Anderson-Darling) is available within the nortest package
using the function ad.test().
library(nortest)
ad.test(body.fat.data$age)
• One-sample t-test: one sample t-test is implemented in R using the function t.test
o A researcher believes that in recent years women have been getting taller.
She knows that 10 years ago the average height of young adult women living in
her city was 63 inches. The standard deviation is unknown. She randomly
samples eight young adult women currently residing in her city and measures
their heights. The following are the observed values:
64, 66, 68, 60, 62, 65, 66, 63

Test the researcher’s claim using a 0.05-level of significance in R.

height <- c(64, 66, 68, 60, 62, 65, 66, 63)
result<-t.test(x = height,mu = 63,alternative = "greater",conf.level = 0.95)
• independent samples t-test: use again the t.test function but check first equality of variance
assumption
o (Lifespans of Rats (in Days) Given Two Diets) This dataset involves
the lifespan of two groups of rats, one group given a restricted diet and the other
an ad libitum diet (i.e. free eating diet). The research aims to determine whether
lifespan is affected by diet.

Setup the hypothesis test for this research problem.

lifespan<-read_xlsx(file = "Lifespan Data.xlsx", sheetIndex = 1)
View(lifespan)
names(lifespan)[2]<-"Diet Type"
lifespan$`Diet Type`<-factor(lifespan$`Diet Type`,levels=c(1,2),labels =
c("Restricted Diet","Ad Libitum"))
library(car)
leveneTest(y = lifespan$Lifespan,group = lifespan$`Diet Type`)
t.test(lifespan$Lifespan~lifespan$`Diet Type`,alternative = "two.sided",mu =
0,var.equal = FALSE,conf.level = 0.95)
• paired samples t-test is also implemented under the t.test function
o A neuroscientist believes that the lateral hypothalamus is involved in eating
behaviour. If so, then electrical stimulation of that area might affect the amount
eaten. To test this possibility, chronic indwelling electrodes are implanted in 10
rats. Each rat has two electrodes: one implanted in the lateral hypothalamus and
the other in the area where electrical stimulation is known to have no effect.
After the animals have recovered from surgery, they each receive 30 minutes of
electrical stimulation to each brain area, and the amount of food eaten during the
stimulation is measured. The data is in Brain Stimulation.xlsx. Use 0.05 level of
significance.

brain<-read_xlsx(file="Brain Stimulation.xlsx",sheetIndex = 1)
t.test(x = brain$Lateral.Hypothalamus,y = brain$Neutral.Area,alternative =
"two.sided",mu = 0,paired = TRUE,conf.level = 0.95)
• Test for Proportions: use the prop.test function
o An apparel store is considering a promotion that involves mailing discount coupons to
all their credit card customers. This promotion will be considered a success if more
than 10% of those receiving the coupons use them. Before launching this in the entire
country, coupons were sent to a sample of 100 credit card customers. If 15 credit card
customers used the coupon, will you recommend the full launch of this promotion?
Use a 0.05-level of significance.

prop.test(x = 15,n = 100,p = 0.10,alternative = "greater",conf.level = 0.95)


binom.test(x = 15,n = 100,p = 0.10,alternative = "greater",conf.level = 0.95)
• Confidence Intervals
library(Rmisc)
CI(x = height,ci = 0.95)
• Test for Independence
o The data in the table below was taken form a report on the relationship between
aspirin use and myocardial infarction (heart attacks) by the Physicians’ Health Study
Research Group at Harvard Medical School. The Physicians’ Health Study was a five-
year randomized study testing whether regular intake of aspirin reduces mortality
from cardiovascular disease. Every other day, physicians participating in the study
took either one aspirin tablet or a placebo. The subjects were blind as to what type of
pill they were taking. Are the two variables associated? Use 𝛼 = 0.05.

chisq.test(table.1)

Вам также может понравиться