Вы находитесь на странице: 1из 57

Basic Introduction to R

By
Dr. Manika Gupta
Assistant Professor
Department of Geology
Delhi University

24-Mar-17 10:00 AM
Benefits of R

• Open source language


• Continuous Packages being developed and available

Installation
Install R:
https://cran.r-project.org/bin/windows/base/

Install R studio:
https://www.rstudio.com/products/rstudio/download3/

Resources
• http://cran.us.r-project.org/doc/manuals/R-intro.pdf
• https://onlinecourses.science.psu.edu/statprogram/sites/onlinecourses.science
.psu.edu.statprogram/files/lesson00/Short-refcard.pdf
• https://www.datacamp.com/courses/free-introduction-to-r
• http://www.cyclismo.org/tutorial/R/
• http://rseek.org/
Help in R

 help.start() # general help

 help(“mean”) # help about function foo

 ?mean # same thing

 apropos(“mean") # list all functions containing


string mean

 example(mean) # show an example of function foo


Basics!!!
Variables:
integers
characters
logical

Assigning values to variables:


x <- 5
x
y <- 3
z <- x+y
z

class(X)
Error: object 'X' not found

R is case sensitive
class(x)
[1] "numeric"
Operators in R
Binary operators work on vectors and matrices as well as scalars.

Arithmetic Operators include:

Logical Operators include:


Data Formats
Vector
A vector is a sequence of data elements of the same basic type.
Suppose we have vector containing three numeric values 2, 3 and 5 -
> aa<- c(2, 3, 5)
[1] 2 3 5
And a vector of logical values -
> bb<- c(TRUE, FALSE, TRUE, FALSE, FALSE)
[1] TRUE FALSE TRUE FALSE FALSE

A vector can contain character strings -


> cc <- c("aa", "bb", "cc", "dd", "ee")
[1] "aa" "bb" "cc" "dd" "ee"

> length(aa)
[1] 3
Arithmetic operations of vectors are performed member-by-member

> a = c(1, 3, 5, 7)
> b = c(1, 2, 4, 8)

>5*a
[1] 5 15 25 35

>a+b
[1] 2 5 9 15

>a-b
[1] 0 1 1 -1

>a*b
[1] 1 6 20 56

>a/b
[1] 1.000 1.500 1.250 0.875
a <- c(1,2,3,4)

> sqrt(a)
[1] 1.000000 1.414214 1.732051 2.000000

> exp(a)
[1] 2.718282 7.389056 20.085537 54.598150

> log(a)
[1] 0.0000000 0.6931472 1.0986123 1.3862944

> exp(log(a))
[1] 1 2 3 4

> cc <- (a + sqrt(a))/(exp(2)+1)


> cc
[1] 0.2384058 0.4069842 0.5640743 0.7152175
> a <- c(1,-2,3,-4)
> b <- c(-1,2,-3,4)
> min(a,b)
[1] -4
> pmin(a,b)
[1] -1 -2 -3 -4
>max(a)
>sum(a)

Recycling Rule
If two vectors are of unequal length, the shorter one will be recycled
in order to match the longer vector.
> u = c(10, 20, 30)
> v = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
>u+v
[1] 11 22 33 14 25 36 17 28 39
Vector Index
We retrieve values in a vector by declaring an index inside a single square bracket "[]"
operator.

> s <- c("aa", "bb", "cc", "dd", "ee")


> s[3]
[1] "cc"
> s[c(2, 3)]
[1] "bb" "cc"

Negative Index
If the index is negative, it would strip the member whose position has the same absolute
value as the negative index.
> s[-3]
[1] "aa" "bb" "dd" "ee"

Out-of-Range Index
If an index is out-of-range, a missing value will be reported via the symbol NA.
> s[10]
[1] NA
>length(s)
Duplicate Indexes
The index vector allows duplicate values.
> s[c(2, 3, 3)]
[1] "bb" "cc" "cc"

Out-of-Order Indexes
The index vector can even be out-of-order.
> s[c(2, 1, 3)]
[1] "bb" "aa" "cc"

Range Index
To produce a vector slice between two indexes, we can use the colon operator ":".
> s[2:4]
[1] "bb" "cc" "dd"
Matrix
A matrix is a collection of data elements arranged in a two-dimensional rectangular
layout. The following is an example of a matrix with 2 rows and 3 columns.

> A = matrix(
+ c(2, 4, 3, 1, 5, 7), # the data elements
+ nrow=2, # number of rows
+ ncol=3, # number of columns
+ byrow = TRUE) # fill matrix by rows

>A # print the matrix


[,1] [,2] [,3]
[1,] 2 4 3
[2,] 1 5 7
An element at the mth row, nth column of A can be accessed by the expression
A[m, n]
> A[2, 3] # element at 2nd row, 3rd column
[1] 7
The entire mth row A can be extracted as A[m, ].
> A[2, ] # the 2nd row
[1] 1 5 7
Similarly, the entire nth column A can be extracted as A[ ,n].
> A[ ,3] # the 3rd column
[1] 3 7
We can also extract more than one rows or columns at a time.
> A[ ,c(1,3)] # the 1st and 3rd columns
[,1] [,2]
[1,] 2 3
[2,] 1 7
If we assign names to the rows and columns of the matrix, than we can access the
elements by names.
> dimnames(A) = list( c("row1", "row2"), # row names
c("col1", "col2", "col3")) # column names
>A # print A
col1 col2 col3
row1 2 4 3
row2 1 5 7
> A["row2", "col3"] # element at 2nd row, 3rd column
[1] 7

Transpose
We construct the transpose of a matrix by interchanging its columns and rows
with the function t .
> t(A) # transpose of A
Combining Matrices

The columns of two matrices having the same number of rows can be combined into a
larger matrix. For example, suppose we have another matrix C also with 3 rows.
> C = matrix( c(7, 4, 2), nrow=3, ncol=1)

Then we can combine the columns of B and C with cbind.

> cbind(B, C)

Similarly, we can combine the rows of two matrices if they have the same number of
columns with the rbind function.

> D = matrix( c(6, 2), nrow=1, ncol=2)

> rbind(B, D)
List
A list is a generic vector containing other objects.
For example, the following variable x is a list containing copies of three vectors n, s, b, and a
numeric value 3.
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc", "dd", "ee")
> b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
> x = list(n, s, b, 3) # x contains copies of n, s, b

List Slicing
We retrieve a list slice with the single square bracket "[]" operator. The following is a slice
containing the second member of x, which is a copy of s.
> x[2]

With an index vector, we can retrieve a slice with multiple members. Here a slice containing
the second and fourth members of x.
> x[c(2, 4)]
Member Reference
In order to reference a list member directly, we have to use the double square bracket "[[]]"
operator.

> x[[2]]
[1] "aa" "bb" "cc" "dd" "ee"
We can modify its content directly.
> x[[2]][1] = "ta"
> x[[2]]
[1] "ta" "bb" "cc" "dd" "ee"
>s
[1] "aa" "bb" "cc" "dd" "ee" # s is unaffected
A data frame is used for storing data tables. It is a list of vectors of equal length. For
example, the following variable df is a data frame containing three vectors n, s, b.

> n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b) # df is a data frame

Build-in Data Frame

For example, here is a built-in data frame in R, called mtcars.

> mtcars
Here is the cell value from the first row, second column of mtcars.
> mtcars[1, 2]
[1] 6

Moreover, we can use the row and column names instead of the numeric coordinates.
> mtcars["Mazda RX4", "cyl"]
[1] 6

Lastly, the number of data rows in the data frame is given by the nrow function.
> nrow(mtcars) # number of data rows
[1] 32

And the number of columns of a data frame is given by the ncol function.
> ncol(mtcars) # number of columns
[1] 11

> head(mtcars)
> names(mtcars)
> dimnames(mtcars)
Retrieve column vectors

> mtcars[[9]]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...

> mtcars[["am"]]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...

> mtcars$am
[1] 1 1 1 0 0 0 0 0 0 0 0 ...

> mtcars[,"am"]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
Read and write data files
Working Directory

The data files need to be located in the R working directory, which can
be found with the function getwd.
> getwd() # get current working directory

When we need to set a directory other than the current one -


> setwd("<new path>") # set working directory

Note that the forward slash should be used as the path separator even
on Windows platform.

> setwd("C:/Users/Geology/Desktop/R_workshop")
Data Import
It is often necessary to import data into R

Excel File
We can use the function read.xlsx from the xlsx package. It reads from an Excel spreadsheet
and returns a data frame.

> library(xlsx) or require(xlsx) # load xlsx package

> mydata <- read.xlsx("my_data.xlsx”, 1) # read from first sheet

Alternatively, we can use the function load Workbook from the XLConnect package to read
the entire workbook, and then load the worksheets with readWorksheet.
The XLConnect package requires Java to be pre-installed.
> library(XLConnect) # load XLConnect package

> wk <- loadWorkbook("my_data.xlsx")

> df <- readWorksheet(wk, sheet="Sheet1")


Text File

Load txt file into the workspace with the function read.table.

> mydata <- read.table("my_data.txt") # read text file


> mydata # print data frame

V1 V2 V3
1 100 a1 b1
2 200 a2 b2
3 300 a3 b3
4 400 a4 b4

For further detail of the function read.table, please consult the R documentation.
> help(read.table)
CSV File

Sample data in .csv file -


100,a1,b1
200,a2,b2
300,a3,b3

After we copy and paste the data above in a file named "my_data.csv", we can read the data
with the function read.csv

> mydata <- read.csv("my_data.csv") # read csv file


> mydata

> help(read.csv)
>tree <- read.csv(file="trees91.csv", header=TRUE, sep=",")
>attributes(tree)
>names(tree)

>tree$C
> summary(tree$C)
> tree$C <- factor(tree$C)
> tree$C

> summary(tree$C)
> levels(tree$C)
Minitab File
If the data file is in Minitab Portable Worksheet format, it can be opened with the function
read.mtp from the foreign package.
It returns a list of components in the Minitab worksheet.
> library(foreign) # load the foreign package

> help(read.mtp) # documentation

> mydata <- read.mtp("my_data.mtp") # read from .mtp file

SPSS File
For the data files in SPSS format, it can be opened with the function read.spss also from the
foreign package. There is a "to.data.frame" option for choosing whether a data frame is to
be returned. By default, it returns a list of components instead.
> library(foreign) # load the foreign package

> help(read.spss) # documentation

> mydata <- read.spss(“international.sav”, to.data.frame=TRUE)


write.table(mydata, "mydata.txt", sep="\t")

write.csv(mydata, “mydata.csv”)

write.xlsx(mydata, "mydata.xlsx")

write.foreign(mydata, "mydata.txt", "mydata.sps", package="SPSS")

write.foreign(mydata, "mydata.txt", "mydata.sas", package="SAS")

write.dta(mydata, "mydata.dta")
Missing Values
is.na(x) # returns TRUE of x is missing
y <- c(1,2,3,NA)
is.na(y) # returns a vector (F F F T)

View(mydata)
> mydata$m_illit[mydata$m_illit==3.0]<-NA
> head(mydata)

x <- c(1,2,NA,3)
mean(x) # returns NA
mean(x, na.rm=TRUE) # returns 2

# list rows of data that have missing values


mydata[!complete.cases(mydata),]

# create new dataset without missing data


newdata <- na.omit(mydata)
Basic plots
attach(mtcars)
plot(wt, mpg)
abline(lm(mpg ~ wt))
title("Regression of MPG on Weight")

jpeg("myplot.jpg")
plot(wt, mpg)
dev.off()
Error Bars

x <- rnorm(10,sd=5,mean=20)
y <- 2.5*x - 1.0 + rnorm(10,sd=9,mean=0)
plot(x,y,xlab="Independent",ylab="Dependent",main="Random
Stuff")
xHigh <- x
yHigh <- y + abs(rnorm(10,sd=3.5))
xLow <- x
yLow <- y - abs(rnorm(10,sd=3.1))
arrows(xHigh,yHigh,xLow,yLow,col=2,angle=90,length=0.1,code=3)
# Simple Histogram
hist(mtcars$mpg)
# Colored Histogram with Different Number of Bins
hist(mtcars$mpg, breaks=12, col="red")

# Kernel Density Plot


d <- density(mtcars$mpg) # returns the density data
plot(d) # plots the results

# Filled Density Plot


d <- density(mtcars$mpg)
plot(d, main="Kernel Density of Miles Per Gallon")
polygon(d, col="red", border="blue")
# Compare MPG distributions for cars with
# 4,6, or 8 cylinders
library(sm)
attach(mtcars)

# create value labels


cyl.f <- factor(cyl, levels= c(4,6,8),
labels = c("4 cylinder", "6 cylinder", "8 cylinder"))

# plot densities
sm.density.compare(mpg, cyl, xlab="Miles Per Gallon")
title(main="MPG Distribution by Car Cylinders")

# add legend via mouse click


colfill<-c(2:(1+length(levels(cyl.f))))
legend(locator(1), levels(cyl.f), fill=colfill)
library(sm)
attach(mtcars)

# create value labels


cyl.f <- factor(cyl, levels= c(4,6,8),
labels = c("4 cylinder", "6 cylinder", "8 cylinder"))

# plot densities
sm.density.compare(mpg, cyl, xlab="Miles Per Gallon")
title(main="MPG Distribution by Car Cylinders")

# add legend via mouse click


colfill<-c(2:(1+length(levels(cyl.f))))
legend(locator(1), levels(cyl.f), fill=colfill)

# Simple Dotplot
dotchart(mtcars$mpg,labels=row.names(mtcars),cex=.7,
main="Gas Milage for Car Models",
xlab="Miles Per Gallon")
# Dotplot: Grouped Sorted and Colored
# Sort by mpg, group and color by cylinder

x <- mtcars[order(mtcars$mpg),] # sort by mpg

x$cyl <- factor(x$cyl) # it must be a factor


x$color[x$cyl==4] <- "red"
x$color[x$cyl==6] <- "blue"
x$color[x$cyl==8] <- "darkgreen“

dotchart(x$mpg,labels=row.names(x),cex=.7,groups= x$cyl,
main="Gas Milage for Car Models\ngrouped by cylinder",
xlab="Miles Per Gallon", gcolor="black", color=x$color)
# Simple Bar Plot
counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution",
xlab="Number of Gears")

# Simple Horizontal Bar Plot with Added Labels


counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution", horiz=TRUE,
names.arg=c("3 Gears", "4 Gears", "5 Gears"))

# Stacked Bar Plot with Colors and Legend


counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, main="Car Distribution by Gears and VS",
xlab="Number of Gears", col=c("darkblue","red"),
legend = rownames(counts))
# Grouped Bar Plot
counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, main="Car Distribution by Gears and VS",
xlab="Number of Gears", col=c("darkblue","red"),
legend = rownames(counts), beside=TRUE)

# Fitting Labels
par(las=2) # make label text perpendicular to axis
par(mar=c(5,8,4,2)) # increase y-axis margin.

counts <- table(mtcars$gear)


barplot(counts, main="Car Distribution", horiz=TRUE,
names.arg=c("3 Gears", "4 Gears", "5 Gears"), cex.names=0.8)
Line plots

x <- c(1:5); y <- x # create some data


par(pch=22, col="red") # plotting symbol and color
par(mfrow=c(2,4)) # all plots on one page
opts = c("p","l","o","b","c","s","S","h")
for(i in 1:length(opts)){
heading = paste("type=",opts[i])
plot(x, y, type="n", main=heading)
lines(x, y, type=opts[i])
}
# Create Line Chart
# convert factor to numeric for convenience

Orange$Tree <- as.numeric(Orange$Tree)


ntrees <- max(Orange$Tree)

# get the range for the x and y axis


xrange <- range(Orange$age)
yrange <- range(Orange$circumference)
# set up the plot
plot(xrange, yrange, type="n", xlab="Age (days)",
ylab="Circumference (mm)" )
colors <- rainbow(ntrees)
linetype <- c(1:ntrees)
plotchar <- seq(18,18+ntrees,1)
# add lines
for (i in 1:ntrees) {
tree <- subset(Orange, Tree==i)
lines(tree$age, tree$circumference, type="b", lwd=1.5,
lty=linetype[i], col=colors[i], pch=plotchar[i])
}
# add a title and subtitle
title("Tree Growth", "example of line plot")

# add a legend
legend(xrange[1], yrange[2], 1:ntrees, cex=0.8, col=colors,
pch=plotchar, lty=linetype, title="Tree")
# Simple Pie Chart
slices <- c(10, 12,4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pie(slices, labels = lbls, main="Pie Chart of Countries")

# Pie Chart with Percentages


slices <- c(10, 12, 4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(slices,labels = lbls, col=rainbow(length(lbls)),
main="Pie Chart of Countries")
# 3D Exploded Pie Chart
library(plotrix)
slices <- c(10, 12, 4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pie3D(slices,labels=lbls,explode=0.1,
main="Pie Chart of Countries ")
# Boxplot of MPG by Car Cylinders
boxplot(mpg~cyl,data=mtcars, main="Car Milage Data",
xlab="Number of Cylinders", ylab="Miles Per Gallon")

# Notched Boxplot of Tooth Growth Against 2 Crossed Factors


# boxes colored for ease of interpretation

boxplot(len~supp*dose, data=ToothGrowth, notch=TRUE,


col=(c("gold","darkgreen")),
main="Tooth Growth", xlab="Suppliment and Dose")

# In the notched boxplot, if two boxes' #notches do not overlap this is


‘strong evidence’ their medians differ (Chambers et al., 1983, p. 62)
# Simple Scatterplot
attach(mtcars)
plot(wt, mpg, main="Scatterplot Example",
xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19)

# Add fit lines


abline(lm(mpg~wt), col="red") # regression line (y~x)
lines(lowess(wt,mpg), col="blue") # lowess line (x,y)
# Enhanced Scatterplot of MPG vs. Weight
# by Number of Car Cylinders
library(car)
scatterplot(mpg ~ wt | cyl, data=mtcars,
xlab="Weight of Car", ylab="Miles Per Gallon",
main="Enhanced Scatter Plot",
labels=row.names(mtcars))

# Basic Scatterplot Matrix


pairs(~mpg+disp+drat+wt,data=mtcars,
main="Simple Scatterplot Matrix")
# Scatterplot Matrices from the glus Package
library(gclus)
dta <- mtcars[c(1,3,5,6)] # get data
dta.r <- abs(cor(dta)) # get correlations
dta.col <- dmat.color(dta.r) # get colors
# reorder variables so those with highest correlation
# are closest to the diagonal
dta.o <- order.single(dta.r)
cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
main="Variables Ordered and Colored by Correlation" )
# 3D Scatterplot
library(scatterplot3d)
attach(mtcars)
scatterplot3d(wt,disp,mpg, main="3D Scatterplot")

# 3D Scatterplot with Coloring and Vertical Drop Lines


library(scatterplot3d)
attach(mtcars)
scatterplot3d(wt,disp,mpg, pch=16, highlight.3d=TRUE,
type="h", main="3D Scatterplot")
Thank You!!!
Additional slides
# Display the Student's t distributions with various
# degrees of freedom and compare to the normal distribution
x <- seq(-4, 4, length=100)
hx <- dnorm(x)

degf <- c(1, 3, 8, 30)


colors <- c("red", "blue", "darkgreen", "gold", "black")
labels <- c("df=1", "df=3", "df=8", "df=30", "normal")

plot(x, hx, type="l", lty=2, xlab="x value",


ylab="Density", main="Comparison of t Distributions")

for (i in 1:4){
lines(x, dt(x,degf[i]), lwd=2, col=colors[i])
}

legend("topright", inset=.05, title="Distributions",


labels, lwd=2, lty=c(1, 1, 1, 1, 2), col=colors)
x_func<- function(a,b)
{
x<-a+b
}
ab<-x_func(5,5)
x

x_func<- function(a,b){
x<-a+b
y<-(2*a)-b
}
x
ab<-x_func(5,5)
x_func<- function(a,b){
x<-a+b
y<-a-b
result<- c(x,y)
return(result)}
x
ab<-x_func(5,5)

Вам также может понравиться