Вы находитесь на странице: 1из 59

[Type text]

ROLE OF ‘R’…IN WHICH STORY


The R language is widely used among statisticians
and data miners for developing statistical
software and data analysis.

Instead of long programming, R gives


visualization of statistical computations in an
easy way(instant methods and less programming
with many packages included)

R is one of the analytical tools

INTRODUCTION TO R:

R IS A PROGRAMMING LANGUA

GE

R IS AN ANALYTICAL TOOL

R IS A SCRIPTING LANGUAGE

R STUDIO IS A SOFTWARE ENVIRONMENT

– A free and open source software programming language for statistical


computing and graphics.

• Founders of R-Ross Ihaka & Robert Gentleman

• R Studio is an IDE to develop R Founded by JJ Allaire

• R is an extension of S Language a Statistical Language.

• Latest version of R = R 3.4.2 for Windows


[Type text]

32/64bit.

Features of R:

• More powerful data manipulation capabilities

• Easier automation

• Faster computation

• It reads any type of data

• Easier project organization

• It supports larger data sets

• Reproducibility (important for detecting errors)

• Easier to find and fix errors

• It's free

• It's open source

• Advanced Statistics capabilities

• State-of-the-art graphics

• It runs on many platforms

• Anyone can contribute packages to improve its functionality

R Basics
Control Structures

Conditional Executions
Comparison Operators

 equal: ==
[Type text]

 not equal: !=
 greater/less than: > <
 greater/less than or equal: >= <=

Logical Operators

 and: &
 or: |
 not: !

If Statements
If statements operate on length-one logical vectors.

Syntax

if(cond1=true) { cmd1 } else { cmd2 }

Example

if(1==0) {
print(1)
} else {
print(2)
}
[1] 2
Avoid inserting newlines between '} else'.

Ifelse Statements
Ifelse statements operate on vectors of variable length.

Syntax

ifelse(test, true_value, false_value)

Example

x <- 1:10 # Creates sample data


ifelse(x<5 | x>8, x, 0)
[1] 1 2 3 4 0 0 0 0 9 10

Loops
The most commonly used loop structures in R are for, while and apply loops. Less
common are repeat loops. The break function is used to break out of loops,
and next halts the processing of the current iteration and advances the looping index.
[Type text]

For Loop
For loops are controlled by a looping vector. In every iteration of the loop one value in
the looping vector is assigned to a variable that can be used in the statements of the
body of the loop. Usually, the number of loop iterations is defined by the number of
values stored in the looping vector and they are processed in the same order as they are
stored in the looping vector.

Syntax

for(variable in sequence) {
statements
}

While Loop
Similar to for loop, but the iterations are controlled by a conditional statement.

Syntax

while(condition) statements

Functions
A very useful feature of the R environment is the possibility to expand existing functions
and to easily write custom functions. In fact, most of the R software can be viewed as a
series of R functions.

Syntax to define functions

myfct <- function(arg1, arg2, ...) {


function_body
}

Table of Contents

The value returned by a function is the value of the function body, which is usually an
unassigned final expression, e.g.: return()

Running R Programs
(1) Executing an R script from the R console

source("my_script.R")
[Type text]

PROGRAM 1: IMPLEMENT ALL BASIC COMMANDS

help() : Obtain documentation for a given R command


example():View some examples on the use of a command

seq() Make arithmetic progression vector

rep() : Make vector of repeated values

data() : Load (often into a data.frame) built-in dataset

View() View dataset in a spreadsheet-type format


library(), require() : Make available an R add-on package
length() Give length of a vector
ls() : Lists memory contents
rm() Removes an item from memory
names() Lists names of variables in a dataFrame
hist() Command for producing a histogram
histogram() Lattice command for producing a histogram
table() List all values of a variable with frequencies
mean(), median() Identify “center” of distribution
[Type text]

by() apply function to a column split by factors


summary() Display 5-number summary and mean
var(), sd() Find variance, sd of values in vector
sum() Add up all values in a vector
quantile() Find the position of a quantile in a dataset
barplot() Produces a bar graph
barchart() Lattice command for producing bar graphs
boxplot() Produces a boxplot
plot() Produces a scatterplot

Examples of usage
help (): help(mean)

The variables are assigned with R-Objects and the data type of the R-
object becomes the data type of the variable. There are many types of R-
objects. The frequently used ones are −

 Vectors

 Lists

 Matrices

 Arrays

 Factors

 Data Frames

Vectors:
When you want to create vector with more than one element, you should
use c() function which means to combine the elements into a vector.
[Type text]

# Create a vector.

apple <- c('red','green',"yellow")

print(apple)

# Get the class of the vector.

print(class(apple))

When we execute the above code, it produces the following result −

[1] "red" "green" "yellow"


[1] "character"

Lists
A list is an R-object which can contain many different types of elements
inside it like vectors, functions and even another list inside it.

# Create a list.

list1 <- list(c(2,5,3),21.3,sin)

# Print the list.

print(list1)

When we execute the above code, it produces the following result −

[[1]]
[1] 2 5 3

[[2]]
[1] 21.3

[[3]]
function (x) .Primitive("sin")
[Type text]

Matrices
A matrix is a two-dimensional rectangular data set. It can be created
using a vector input to the matrix function.

# Create a matrix.

M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)

print(M)

When we execute the above code, it produces the following result −

[,1] [,2] [,3]


[1,] "a" "a" "b"
[2,] "c" "b" "a"

Arrays
While matrices are confined to two dimensions, arrays can be of any
number of dimensions. The array function takes a dim attribute which
creates the required number of dimension. In the below example we
create an array with two elements which are 3x3 matrices each.

# Create an array.

a <- array(c('green','yellow'),dim = c(3,3,2))

print(a)

When we execute the above code, it produces the following result −


, , 1

[,1] [,2] [,3]


[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"

, , 2

[,1] [,2] [,3]


[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[Type text]

[3,] "yellow" "green" "yellow"

Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each
column can contain different modes of data. The first column can be
numeric while the second column can be character and third column can
be logical. It is a list of vectors of equal length.

Data Frames are created using the data.frame() function.

# Create the data frame.

BMI <- data.frame(

gender = c("Male", "Male","Female"),

height = c(152, 171.5, 165),

weight = c(81,93, 78),

Age = c(42,38,26)

print(BMI)

When we execute the above code, it produces the following result −

gender height weight Age


1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26

Data Type of a Variable


In R, a variable itself is not declared of any data type, rather it gets the
data type of the R - object assigned to it. So R is called a dynamically
typed language, which means that we can change a variable’s data type
of the same variable again and again when using it in a program.
[Type text]

var_x <- "Hello"

cat("The class of var_x is ",class(var_x),"\n")

var_x <- 34.5

cat(" Now the class of var_x is ",class(var_x),"\n")

var_x <- 27L

cat(" Next the class of var_x becomes ",class(var_x),"\n")

When we execute the above code, it produces the following result −


The class of var_x is character
Now the class of var_x is numeric
Next the class of var_x becomes integer

Finding Variables
To know all the variables currently available in the workspace we use
the ls()function. Also the ls() function can use patterns to match the
variable names.

print(ls())

When we execute the above code, it produces the following result −

[1] "my var" "my_new_var" "my_var" "var.1"


[5] "var.2" "var.3" "var.name" "var_name2."
[9] "var_x" "varname"

ls() :

The ls() function can use patterns to match the variable names.

# List the variables starting with the pattern "var".


[Type text]

print(ls(pattern = "var"))

When we execute the above code, it produces the following result −

[1] "my var" "my_new_var" "my_var" "var.1"


[5] "var.2" "var.3" "var.name" "var_name2."
[9] "var_x" "varname"

Deleting Variables
Variables can be deleted by using the rm() function. Below we delete the
variable var.3. On printing the value of the variable error is thrown.

rm(var.3)

print(var.3)

When we execute the above code, it produces the following result −

[1] "var.3"
Error in print(var.3) : object 'var.3' not found

All the variables can be deleted by using the rm() and ls() function
together.

rm(list = ls())

print(ls())

To see any dataset in Code editor, Type

>View(women) in Console.

To list the number of rows / columns respectively

>nrow(women)

>ncol(women)

To output a summary about the dataset’s columns.

>summary(women)

To output a summary of a dataset’s structure.


[Type text]

>str(women)

To get the dimensions of a dataset(number of obseravtions and columns)

>dim(women)

To access a column in a dataset

>women$height

To check the type (or class) of a variable, the class function can be used

>class(women)

PROGRAM 2 :
INTERACT DATA THROUGH .csv Files(Import and Export to .csv Files)

Getting and Setting the Working Directory


You can check which directory the R workspace is pointing to using
the getwd() function. You can also set a new working directory
using setwd()function.

# Get and print current working directory.

print(getwd())

# Set current working directory.

setwd("/web/com")

# Get and print current working directory.

print(getwd())

When we execute the above code, it produces the following result −


[1] "/web/com/1441086124_2016"
[1] "/web/com"

This result depends on your OS and your current directory where you are
working.
[Type text]

Input as CSV File


The csv file is a text file in which the values in the columns are separated
by a comma. Let's consider the following data present in the file
named input.csv.

You can create this file using windows notepad by copying and pasting
this data. Save the file as input.csv using the save As All files(*.*)
option in notepad.

id,name,salary,start_date,dept

1,Rick,623.3,2012-01-01,IT

2,Dan,515.2,2013-09-23,Operations

3,Michelle,611,2014-11-15,IT

4,Ryan,729,2014-05-11,HR

,Gary,843.25,2015-03-27,Finance

6,Nina,578,2013-05-21,IT

7,Simon,632.8,2013-07-30,Operations

8,Guru,722.5,2014-06-17,Finance

Reading a CSV File


Following is a simple example of read.csv() function to read a CSV file
available in your current working directory −

data <- read.csv("input.csv")

print(data)

When we execute the above code, it produces the following result −

id, name, salary, start_date, dept


1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 NA Gary 843.25 2015-03-27 Finance
[Type text]

6 6 Nina 578.00 2013-05-21 IT


7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance

Writing into a CSV File


R can create csv file form existing data frame. The write.csv() function
is used to create the csv file. This file gets created in the working
directory.

# Create a data frame.

data <- read.csv("input.csv")

retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))

# Write filtered data into a new file.

write.csv(retval,"output.csv")

newdata <- read.csv("output.csv")

print(newdata)

When we execute the above code, it produces the following result −

X id name salary start_date dept


1 3 3 Michelle 611.00 2014-11-15 IT
2 4 4 Ryan 729.00 2014-05-11 HR
3 5 NA Gary 843.25 2015-03-27 Finance
4 8 8 Guru 722.50 2014-06-17 Finance
[Type text]

PROGRAM 3: Get and Clean data using swirl package

Swirl is a platform for learning (and teaching) statistics and R simultaneously and
interactively. It presents a choice of course lessons and interactively tutors a user
through them. A user may be asked to watch a video, to answer a multiple-choice
or fill-in-the-blanks question, or to enter a command in the R console precisely as
if he or she were using R in practice. Emphasis is on the last, interacting with the R
console. User responses are tested for correctness and hints are given if
appropriate. Progress is automatically saved so that a user may quit at any time and
later resume without losing work.

Swirl leans heavily on exercising a student's use of the R console. A callback


mechanism, suggested and first demonstrated for the purpose by Hadley Wickham,
is used to capture student input and to provide immediate feedback relevant to the
course material hand.

WHAT IS SWIRL() IN R
• swirl is a software package for
the R programming language that turns
the Rconsole into an interactive learning
environment. Users receive immediate feedback
as they are guided through self-paced lessons in
data science and R programming.
 install.packages(“swirl”)
library(swirl)
install_from_swirl("Getting and Cleaning Data")

Packages in Swirl()

dplyr()

According to the "Introduction to dplyr" vignette written by the package


authors, "The dplyr philosophy is to have small functions that each do one
thing well."
[Type text]

Specifically, dplyr supplies five 'verbs' that cover most fundamental data
manipulation tasks:

select(), filter(), arrange(), mutate(), and summarize().

Installing Swirl (from CRAN)

The easiest way to install and run swirl is by typing the following from the R
console:
install.packages("swirl")
library(swirl)
swirl()

What is dplyr?
dplyr is a powerful R-package to transform and summarize tabular data with rows
and columns
To install dplyr

install.packages("dplyr")

To load dplyr

library(dplyr)

Data manipulation using dplyr

• install.packages("dplyr") ## install

• You might get asked to choose a CRAN mirror – this is basically


asking you to choose a site to download the package from. The
choice doesn’t matter too much; We recommend the RStudio
mirror.

• library("dplyr") ## load

• You only need to install a package once per computer, but you need
to load it every time you open a new R session and want to use that
package.
[Type text]

select() select columns


filter() filter rows
arrange() re-order or arrange rows
mutate() create new columns
summarise() summarise values
group_by() allows for group operations in the “split-apply-combine” concept

• To select columns of a data frame, use select(). The first argument to this function is
the data frame (ToothGrowth), and the subsequent arguments are the columns to
keep.

• select(ToothGrowth, len, supp, dose)

>aa<-select(ToothGrowth,len,supp,dose)

To select columns of a data frame

• select(ToothGrowth, len, supp, dose)

>plot(aa)

• Filter():

To choose rows

• filter(ToothGrowth, len=5)

• Filter():

To choose rows

• filter(ToothGrowth, len=5)

Pipes(>%>)

• nest functions (i.e. one function inside of another)

• Pipes let you take the output of one function and send it directly to the next, which is
useful when you need to many things to the same data set.

>ToothGrowth %>%

+ filter(len < 5) %>%


[Type text]

+ select(len,supp,dose)

MUTATE():

create new columns based on the values in existing colum

>ToothGrowth %>%

+ mutate(len = len/ 4)

• If this runs off your screen and you just want to see the first few rows, you can use a
pipe to view the head() of the data

>ToothGrowth %>%

+ mutate(len=len/4) %>%

+head

Groupby():

• group_by() splits the data into groups upon which some operations can be run

>ToothGrowth %>%

+ group_by(len) %>%

+ tally()

Summarize ():

• single group_by() is often used together with summarize() which collapses each
group into a -row summary of that group.
[Type text]

>ToothGrowth %>%

+group_by(len) %>%

+summarize(len= mean(len, na.rm = TRUE))

PROGRAM 4: Visualizing all statistical measures

Statistical analysis in R is performed by using many in-built functions.


Most of these functions are part of the R base package. These functions
take R vector as an input along with the arguments and give the result.

The functions we are discussing in this chapter are mean, median and
mode.

Mean
It is calculated by taking the sum of the values and dividing with the
number of values in a data series.

The function mean () is used to calculate this in R.

Syntax
The basic syntax for calculating mean in R is −

mean(x, trim = 0, na.rm = FALSE, ...)

Following is the description of the parameters used −

 x is the input vector.

 trim is used to drop some observations from both end of the sorted
vector.

 na.rm is used to remove the missing values from the input vector.

Example

# Create a vector.

x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
[Type text]

# Find Mean.

result.mean <- mean(x)

print(result.mean)

Median
The middle most value in a data series is called the median.
The median()function is used in R to calculate this value.

Syntax
The basic syntax for calculating median in R is −

median(x, na.rm = FALSE)

Following is the description of the parameters used −

 x is the input vector.

 na.rm is used to remove the missing values from the input vector.

Example

# Create the vector.

x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find the median.

median.result <- median(x)

print(median.result)

When we execute the above code, it produces the following result −

[1] 5.6
[Type text]

Mode
The mode is the value that has highest number of occurrences in a set of
data. Unike mean and median, mode can have both numeric and
character data.

R does not have a standard in-built function to calculate mode. So we


create a user function to calculate mode of a data set in R. This function
takes the vector as input and gives the mode value as output.

# Create a vector.

x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find Mean.

result.mean <- mean(x)

print(result.mean)

When we execute the above code, it produces the following result −


[1] 8.22

# Create a vector.

x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find Mean.

result.mean <- mean(x,trim = 0.3)

print(result.mean)

When we execute the above code, it produces the following result −


[1] 5.55
[Type text]

# Create the function.

getmode <- function(v) {

uniqv <- unique(v)

uniqv[which.max(tabulate(match(v, uniqv)))]

# Create the vector with numbers.

v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)

# Calculate the mode using the user function.

result <- getmode(v)

print(result)

# Create the vector with characters.

charv <- c("o","it","the","it","it")

# Calculate the mode using the user function.

result <- getmode(charv)

print(result)

When we execute the above code, it produces the following result −


[1] 2
[1] "it"
[Type text]

RANGE

range() function get a vector of the minimum and maximum values.

range(..., na.rm = FALSE, finite = FALSE)

...: numeric vector


na.rm: whether NA should be removed, if not, NA will be returned
finite: whether non-finite elements should be omitted

>x <- c(1,2.3,2,3,4,8,12,43,-4,-1)


>r <- range(x)
>r
[1] -4 43
>diff(r)
[1] 47

Missing value affect the results:


>y<- c(x,NA)
>y
[1] 1.0 2.3 2.0 3.0 4.0 8.0 12.0 43.0 -4.0 -1.0 NA
>range(y)
[1] NA NA

After define na.rm=TRUE, result is meaningful:


>range(y,na.rm=TRUE)
[1] -4 43

> range(y,finite=TRUE)
[1] -4 43

Range Of Values

range returns a vector containing the minimum and maximum of all the given arguments.

Keywords

arith, univar

Usage
range(…, na.rm = FALSE)

# S3 method for default


[Type text]

range(…, na.rm = FALSE, finite = FALSE)

Arguments

any numeric or character objects.

na.rm

logical, indicating if NA's should be omitted.

finite

logical, indicating if all non-finite elements should be omitted.

Histogram

A histogram consists of parallel vertical bars that graphically shows the frequency
distribution of a quantitative variable. The area of each bar is equal to the frequency of
items found in each class.
Example
In the data set faithful, the histogram of the eruptions variable is a collection of parallel
vertical bars showing the number of eruptions classified according to their durations.
Problem
Find the histogram of the eruption durations in faithful.
Solution
We apply the hist function to produce the histogram of the eruptions variable.
> duration = faithful$eruptions
> hist(duration, # apply the hist function
+ right=FALSE) # intervals closed on the left

Answer
The histogram of the eruption durations is:
[Type text]

Enhanced Solution
To colorize the histogram, we select a color palette and set it in
the col argument of hist. In addition, we update the titles for readability.
> colors = c("red", "yellow", "green", "violet", "orange",
+ "blue", "pink", "cyan")
> hist(duration, # apply the hist function
+ right=FALSE, # intervals closed on the left
+ col=colors, # set the color palette
+ main="Old Faithful Eruptions", # the main title
+ xlab="Duration minutes") # x-axis label
[Type text]

\
[Type text]

PROGRAM 5: creation of data frame with the following structure

EMP ID EMP NAME SALARY START DATE


1 Sathish 5000 01-11-2013
2 vani 7500 05-06-2011
3 Ramesh 10000 21-09-1999
4 Praveen 9500 13-09-2005
5 pallavi 4500 23-10-2000

> emp.data <- data.frame(emp_id = c(1:5),emp_name =


c("ratna","kumar","kamala","prajwal","prava"),salary =
c(500,750,1000,950,450),start_date = as.Date(c("1-11-2013","5-6-2011","21-
9-1999","23-10-2000",'13-9-2005')),stringsAsFactors = FALSE
+ )

> View(emp.data)

> print(emp.data)

emp_id emp_name salary start_date


1 1 ratna 500 0001-11-20
2 2 kumar 750 0005-06-20
3 3 kamala 1000 0021-09-19
4 4 prajwal 950 0023-10-20
5 5 prava 450 0013-09-20

> emp.data[1:2,]

emp_id emp_name salary start_date


1 1 ratna 500 0001-11-20
2 2 kumar 750 0005-06-20

> emp.data[c(3,5),c(2,4)]

emp_name start_date

3 kamala 0021-09-19
5 prava 0013-09-20
> A = emp.data$emp_id
> B = emp.data$emp_name
> C = data.frame(A,B)

> print(C)
[Type text]

A B
1 1 ratna
2 2 kumar
3 3 kamala
4 4 prajwal
5 5 prava

> D = emp.data$salary

Draw the table of data.

A)Extract two column names using column name.

A = emp.data $emp_id

B = emp.data$emp_name

C = data.frame(A,B)

Print(c)

B) Extract the first two rows and then all columns.

emp.data[1;2,]

c) Extract 3rd and 5th row with 2nd and 4th columns.

emp.data[c(3,5),c(2,4)]

PROGRAM 6: Applying Normalization function on each of


columns of iris dataset
There can be instances found in data frame where values for one feature could
range between 1-100 and values for other feature could range from 1-10000000. In
scenarios like these, owing to the mere greater numeric range, the impact on
response variables by the feature having greater numeric range could be more than
the one having less numeric range, and this could, in turn, impact prediction
accuracy. The objective is to improve predictive accuracy and not allow a particular
feature impact the prediction due to large numeric value range. Thus, we may need
to normalize or scale values under different features such that they fall under
common range. Take a look at following example:
1 # Age vector
age <- c(25, 35, 50)
# Salary vector
salary <- c(200000, 1200000, 2000000)
[Type text]

2 # Data frame created using age and salary


3 df <- data.frame( "Age" = age, "Salary" = salary, stringsAsFactors = FALSE)
4
5
6
Min-Max Normalization

Above data frame could be normalized using Min-Max normalization technique which
specifies the following formula to be applied to each value of features to be
normalized. This technique is traditionally used with K-Nearest Neighbors
(KNN) Classification problems.
1 (X - min(X))/(max(X) - min(X))
Above could be programmed as the following function in R:

1 normalize <- function(x) {


2 return ((x - min(x)) / (max(x) - min(x)))
3 }
In order to apply above normalize function on each of the features of above data
frame, df, following code could be used. Pay attention to usage of lapply function.
1 dfNorm <- as.data.frame(lapply(df, normalize))
2 # One could also use sequence such as df[1:2]
3 dfNorm <- as.data.frame(lapply(df[1:2], normalize))
In case, one wish to specify a set of features such as salary, following formula could
be used:

1 # Note df[2]
2 dfNorm <- as.data.frame(lapply(df[2], normalize))
3 # Note df["Salary"]
4 dfNorm <- as.data.frame(lapply(df["Salary"], normalize))

Z-Score Standardization

The disadvantage with min-max normalization technique is that it tends to bring data
towards the mean. If there is a need for outliers to get weighted more than the other
values, z-score standardization technique suits better. In order to achieve z-score
standardization, one could use R’s built-in scale() function. Take a look at following
example where scale function is applied on “df” data frame mentioned above.
1 dfNormZ <- as.data.frame( scale(df[1:2] ))
Following gets printed as dfNormZ

1 Age Salary
1 -0.9271726 -1.03490978
2 2 -0.1324532 0.07392213
3 3 1.0596259 0.96098765
[Type text]

PROGRAM 7: Implementation of rbind and cbind functions.

rbind() function combines vector, matrix or data frame by rows.

rbind(x1,x2,...)
x1,x2: vector, matrix, data frames

Read in the data from the file:

>x <- read.csv("data1.csv",header=T,sep=",")


>x2 <- read.csv("data2.csv",header=T,sep=",")

>x3 <- rbind(x,x2)


>x3
Subtype Gender Expression
1 A m -0.54
2 A f -0.80
3 B f -1.03
4 C m -0.41
5 D m 3.22
6 D f 1.02
7 D f 0.21
8 D m -0.04
9 D m 2.11
10 B m -1.21
11 A f -0.20

The column of the two datasets must be same,

• Matrices can be created by column-binding or row-binding with cbind() and rbind().

• Data frames can also be appended by these functions.

• > x <- 1:3

• > y <- 10:12


[Type text]

cbind() function combines vector, matrix or data frame by columns.

Read in the data from the file:

>x <- read.csv("data1.csv",header=T,sep=",")


>x2 <- read.csv("data2.csv",header=T,sep=",")

>x3 <- cbind(x,x2)


>x3
Subtype Gender Expression Age City
1 A m -0.54 32 New York
2 A f -0.80 21 Houston
3 B f -1.03 34 Seattle
4 C m -0.41 67 Houston

The row number of the two datasets must be equal.

>cbind(x1,x2,...)
• x1,x2:vector, matrix, data frames

• > cbind(x, y)

– x y

• [1,] 1 10

• [2,] 2 11

• [3,] 3 12

• > rbind(x, y)

• [,1] [,2] [,3]

• x1 2 3

• y 10 11 12
[Type text]

PROGRAM 9: Creation of scatter plot using toothgrowth dataset


using ‘dplyr’ library
toot Growth dataset using dplyr library:

dplyr : An R package for fast and easy data manipulation

In dplyr the main verbs (or functions)


are filter, arrange, select, mutate, summarize, and group_by. You can
probably guess what these functions do by their names, but let’s describe them and try
them out:
 filter – select a subset of the rows of a data frame
 arrange – works similarly to filter, except that instead of filtering or selecting
rows, it reorders them
 select – select columns of a data frame
 mutate – add new columns to a data frame that are functions of existing
columns
 summarize – summarize values
 group_by – describe how to break a data frame into groups of rows

> library(dplyr)

> library(ggplot2)
> library(datasets)
> data(ToothGrowth)
> str(ToothGrowth)
> summary(ToothGrowth)
len supp dose
Min. : 4.20 OJ:30 Min. :0.500
1st Qu.:13.07 VC:30 1st Qu.:0.500
Median :19.25 Median :1.000
Mean :18.81 Mean :1.167
3rd Qu.:25.27 3rd Qu.:2.000
Max. :33.90 Max. :2.000
> scatter.smooth(ToothGrowth)
[Type text]

What is dplyr

dplyr is a package for data manipulation, written and maintained by Hadley Wickham. It

provides some great, easy-to-use functions that are very handy when performing

exploratory data analysis and manipulation. Here, I will provide a basic overview of some of

the most useful functions contained in the package.


dplyr is a powerful R-package to transform and summarize tabular data with rows and columns
> install.packages("dplyr")

> library(dplyr)

> library("swirl")

> install.packages("downloader")

> library(downloader)

> url <-


"https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extda
ta/msleep_ggplot2.csv"

> filename <- "msleep_ggplot2.csv"

> if (!file.exists(filename)) download(url,filename)

> msleep <- read.csv("msleep_ggplot2.csv")

> head(msleep)

select(ToothGrowth, len, supp, dose)

>aa<-select(ToothGrowth,len,supp,dose)

To select columns of a data frame


[Type text]

select(ToothGrowth, len, supp, dose)

>plot(aa)

• Filter():

To choose rows

• filter(ToothGrowth, len=5)

• Filter():

To choose rows

• filter(ToothGrowth, len=5)

Pipes (>%>)

• nest functions (i.e. one function inside of another)

• Pipes let you take the output of one function and send it directly to the next, which is
useful when you need to many things to the same data set.

>ToothGrowth %>%

+ filter(len < 5) %>%

+ select(len,supp,dose)

MUTATE ():

Create new columns based on the values in existing colum

>ToothGrowth %>%

+ mutate(len = len/ 4)

• If this runs off your screen and you just want to see the first few rows, you can use a
pipe to view the head() of the data

>ToothGrowth %>%

+ mutate(len=len/4) %>%

+head
[Type text]

Groupby():

• group_by() splits the data into groups upon which some operations can be run

>ToothGrowth %>%

+ group_by(len) %>%

+ tally()

summarize():

• single group_by() is often used together with summarize() which collapses each
group into a -row summary of that group.

>ToothGrowth %>%

+group_by(len) %>%

+summarize(len= mean(len, na.rm = TRUE))

PROGRAM 10: Implementation of linear and multiple regression on ‘mtcars’


dataset:
R - Linear Regression

Regression analysis is a very widely used statistical tool to establish a


relationship model between two variables. One of these variable is called
predictor variable whose value is gathered through experiments. The
other variable is called response variable whose value is derived from the
predictor variable.

In Linear Regression these two variables are related through an


equation, where exponent (power) of both these variables is 1.
Mathematically a linear relationship represents a straight line when
[Type text]

plotted as a graph. A non-linear relationship where the exponent of any


variable is not equal to 1 creates a curve.

The general mathematical equation for a linear regression is −


y = ax + b

Following is the description of the parameters used −

 y is the response variable.

 x is the predictor variable.

 a and b are constants which are called the coefficients.

R's "Mtcars" dataset contains a series of variables relating to motor cars that can
be plotted to explore correlation, with a linear regression model fitted to the
points.

 x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

 y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

 # Apply the lm() function.

 relation <- lm(y~x)

 print(relation)

 # Create the predictor and response variable.

 x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

 y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

 relation <- lm(y~x)

 # Give the chart file a name.

 png(file = "linearregression.png")
[Type text]

 # Plot the chart.

 plot(y,x,col = "blue",main = "Height & Weight Regression",

 abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in


cm")

R - Multiple Regression

Multiple regression is an extension of linear regression into relationship


between more than two variables. In simple linear relation we have one
predictor and one response variable, but in multiple regression we have
more than one predictor variable and one response variable.

The general mathematical equation for multiple regression is −


y = a + b1x1 + b2x2 +...bnxn

Following is the description of the parameters used −

 y is the response variable.

 a, b1, b2...bn are the coefficients.

 x1, x2, ...xn are the predictor variables.

We create the regression model using the lm() function in R. The model
determines the value of the coefficients using the input data. Next we
can predict the value of the response variable for a given set of predictor
variables using these coefficients.

create a subset of these variables from the mtcars data set for this
purpose.

input <- mtcars[,c("mpg","disp","hp","wt")]

print(head(input))

When we execute the above code, it produces the following result −


mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
[Type text]

Hornet Sportabout 18.7 360 175 3.440


Valiant 18.1 225 105 3.460

Establishing Relationship between “mpg” as response variable and “disp”, “hp” as


predictor variables.
Step1: Load the required data.

From this command we are creating new data variable with


all rows and only required columns

data <- mtcars[,c("mpg","disp","hp")]

head(data)

model <- lm(mpg~disp+hp, data=data)

summary(model)

mpg = 30.735904 + (-0.030346)disp + (-0.024840)hp

Using the above equation we can predict the value of mpg based on disp and hp.
Step3: Predicting the output.

predict(model, newdata = data.frame(disp=140, hp=80))

Predicted Output Mileage is 24.50022

plot(model)

The purpose of this analysis was to answer two questions regarding fuel
economy. The answer to the first, “Is an automatic or manual transmission
better for MPG?”, is this:
[Type text]

 In terms of ranges of gas mileage, manual transmissions provide more


MPGs than automatic. But in terms of how other criteria, such as weight
and horsepower, influence gas mileage, it has been shown that
automatic transmissions are less affected by these factors than manual
ones.
 For the purpose of this analysis we use mtcars dataset which is a dataset
that was extracted from the 1974 Motor Trend US magazine, and
comprises fuel consumption and 10 aspects of automobile design and
performance for 32 automobiles (1973–74 models). Below is a brief
description of the variables in the data set:
 [, 1] mpg Miles/(US) gallon
 [, 2] cyl Number of cylinders
 [, 3] disp Displacement (cu.in.)
 [, 4] hp Gross horsepower
 [, 5] drat Rear axle ratio
 [, 6] wt Weight (lb/1000)
 [, 7] qsec 1/4 mile time
 [, 8] vs V/S
 [, 9] am Transmission (0 = automatic, 1 = manual)
 [,10] gear Number of forward gears

 [,11] carb Number of carburetors


[Type text]

PROGRAM 11. R PROGRAM TO IMPLEMENT K-MEANS CLUSTERING

> data("iris")

> summary(iris)

> install.packages("dplyr")

> virginica <- filter(iris, Species == "virginica")


> head(virginica)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 6.3 3.3 6.0 2.5 virginica
2 5.8 2.7 5.1 1.9 virginica
3 7.1 3.0 5.9 2.1 virginica
4 6.3 2.9 5.6 1.8 virginica
5 6.5 3.0 5.8 2.2 virginica
6 7.6 3.0 6.6 2.1 virginica

> Sepal.Length <- filter(iris, Species == "virginica", Sepal.Length >6)


> head(Sepal.Length)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 6.3 3.3 6.0 2.5 virginica
2 7.1 3.0 5.9 2.1 virginica
3 6.3 2.9 5.6 1.8 virginica
4 6.5 3.0 5.8 2.2 virginica
5 7.6 3.0 6.6 2.1 virginica
6 7.3 2.9 6.3 1.8 virginica

>

selected <- select(iris, sepal.length, sepal.width, petal.length)

head(selected, 3)

sepal.length sepal.width petal.length

## 1 5.1 3.5 1.4


[Type text]

## 2 4.9 3.0 1.4

## 3 4.7 3.2 1.3

select() This function selects data by column name. You can select any number of columns in a
few different ways.

Mutate()

# create a new column that stores logical values for sepal.width greater
than half of sepal.length

newCol <- mutate(iris, greater.half = sepal.width > 0.5 * sepal.length)

tail(newCol)

## sepal.length sepal.width petal.length petal.width species

## 145 6.7 3.3 5.7 2.5 virginica

## 146 6.7 3.0 5.2 2.3 virginica

## 147 6.3 2.5 5.0 1.9 virginica

## 148 6.5 3.0 5.2 2.0 virginica

## 149 6.2 3.4 5.4 2.3 virginica

## 150 5.9 3.0 5.1 1.8 virginica

## greater.half

## 145 FALSE

## 146 FALSE

## 147 FALSE

newCol <- arrange(newCol, petal.width)


head(newCol)
## sepal.length sepal.width petal.length petal.width species greater.half
## 1 4.9 3.1 1.5 0.1 setosa TRUE
[Type text]

## 2 4.8 3.0 1.4 0.1 setosa TRUE


## 3 4.3 3.0 1.1 0.1 setosa TRUE
## 4 5.2 4.1 1.5 0.1 setosa TRUE
## 5 4.9 3.6 1.4 0.1 setosa TRUE
## 6 5.1 3.5 1.4 0.2 setosa TRUE

Visualization
Any powerful analysis will visualize the data to give a better picture (wink wink) of the data.
Below is a general plot of the iris dataset:
> plot(iris)

Kmeans(data, 2, nstart=100)

Generally, the way K-Means algorithms work is via an iterative refinement process:

1. Each data point is randomly assigned to a cluster (number of clusters is given before
hand).
2. Each cluster’s centroid (mean within cluster) is calculated.
3. Each data point is assigned to its nearest centroid (iteratively to minimise the within-
cluster variation) until no major differences are found.

Plot(

> library(datasets)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

> library(ggplot2)
> ggplot(iris, aes(Petal.Length, Petal.Width, color = Species))+ geom_point()
> set.seed(20)
> irisCluster <- kmeans(iris[, 3:4],3,nstart = 20)
> irisCluster
[Type text]

K-means clustering with 3 clusters of sizes 50, 52, 48


Cluster means:
Petal.Length Petal.Width
1 1.462000 0.246000
2 4.269231 1.342308
3 5.595833 2.037500
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[46] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 3 2 2
[91] 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 2 3 3 3 3
[136] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3
Within cluster sum of squares by cluster:
[1] 2.02200 13.05769 16.29167
(between_SS / total_SS = 94.3 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "between
[7] "size" "iter" "ifault"
> table(irisCluster$cluster, iris$Species)
setosa versicolor virginica
1 50 0 0
2 0 48 4
3 0 2 46
> irisCluster$cluster <- as.factor(irisCluster$cluster)
> ggplot(iris, aes(Petal.Length, Petal.Width, color = irisCluster$cluster))+
+ geom_point()

>
[Type text]
[Type text]
[Type text]
[Type text]
[Type text]
[Type text]
[Type text]
[Type text]
[Type text]
[Type text]

PROGRAM 12:

R - Decision Tree
Decision tree is a graph to represent choices and their results in form of a tree. The nodes
in the graph represent an event or choice and the edges of the graph represent the
decision rules or conditions. It is mostly used in Machine Learning and Data Mining
applications using R.

Examples of use of decision tress is − predicting an email as spam or not spam, predicting
of a tumor is cancerous or predicting a loan as a good or bad credit risk based on the
factors in each of these. Generally, a model is created with observed data also called
training data. Then a set of validation data is used to verify and improve the model. R has
packages which are used to create and visualize decision trees. For new set of predictor
variable, we use this model to arrive at a decision on the category (yes/No, spam/not
spam) of the data.

The R package "party" is used to create decision trees.

Install R Package
Use the below command in R console to install the package. You also have to install the
dependent packages if any.

install.packages("party")
The package "party" has the function ctree() which is used to create and analyze decison
tree.

Syntax
The basic syntax for creating a decision tree in R is −
ctree(formula, data)

Following is the description of the parameters used −

 formula is a formula describing the predictor and response variables.

 data is the name of the data set used.


[Type text]

Input Data
We will use the R in-built data set named readingSkills to create a decision tree. It
describes the score of someone's readingSkills if we know the variables
"age","shoesize","score" and whether the person is a native speaker or not.

Here is the sample data.

# Load the party package. It will automatically load other dependent packages.

library(party)

# Print some records from data set readingSkills.

print(head(readingSkills))

When we execute the above code, it produces the following result and chart −
nativeSpeaker age shoeSize score
1 yes 5 24.83189 32.29385
2 yes 6 25.95238 36.63105
3 no 11 30.42170 49.60593
4 yes 7 28.66450 40.28456
5 yes 11 31.88207 55.46085
6 yes 10 30.07843 52.83124
Loading required package: methods
Loading required package: grid
...............................
...............................

Example
We will use the ctree() function to create the decision tree and see its graph.

# Load the party package. It will automatically load other dependent packages.

library(party)

# Create the input data frame.

input.dat <- readingSkills[c(1:105),]

# Give the chart file a name.


[Type text]

png(file = "decision_tree.png")

# Create the tree.

output.tree <- ctree(

nativeSpeaker ~ age + shoeSize + score,

data = input.dat)

# Plot the tree.

plot(output.tree)

# Save the file.

dev.off()

When we execute the above code, it produces the following result −


null device
1
Loading required package: methods
Loading required package: grid
Loading required package: mvtnorm
Loading required package: modeltools
Loading required package: stats4
Loading required package: strucchange
Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

as.Date, as.Date.numeric

Loading required package: sandwich


[Type text]

Conclusion
From the decision tree shown above we can conclude that anyone whose readingSkills
score is less than 38.3 and age is more than 6 is not a native Speaker.
[Type text]

PROGRAM 13: Implementation of decision trees using “iris” dataset using


package Party

> install.packages("party")
Installing package into ‘C:/Users/Linda/Documents/R/win-library/3.1’
(as ‘lib’ is unspecified)
trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.1/party_1.0-15.zip'
Content type 'application/zip' length 731049 bytes (713 Kb)
opened URL
downloaded 713 Kb

package ‘party’ successfully unpacked and MD5 sums checked

This page shows how to build a decision tree with R.

> library("party")

> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Call function ctree to build a decision tree. The first parameter is a formula,
which defines a target variable and a list of independent variables.

> iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length +


Petal.Width, data=iris)
[Type text]

> print(iris_ctree)

Conditional inference tree with 4 terminal nodes

Response: Species
Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
Number of observations: 150

1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264


2)* weights = 50
1) Petal.Length > 1.9
3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894
4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865
5)* weights = 46
4) Petal.Length > 4.8
6)* weights = 8
3) Petal.Width > 1.7
7)* weights = 46

> plot(iris_ctree)

> plot(iris_ctree, type="simple")


[Type text]

> install.packages("party")
> library("party")

> str(iris)

> iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length +


Petal.Width, data=iris)
>
> print(iris_ctree)

> plot(iris_ctree)

Вам также может понравиться