Академический Документы
Профессиональный Документы
Культура Документы
Read first:
Please email your assignment to stat2450winter@gmail.com before the due date Friday 11:59pm, 3rd
Feb 2017.
R should be used exclusively and the R code must be provided in your assignment.
Show the results clearly; the code must be executable and comment your code.
This assignment covers R prgramming basics, Data visualization and kNN algorithm.
Please remember that all of the work must be your own, and answers must be given in your own words.
Let me know if you run into trouble.
## [1] 89 42 15 60 17 31 22 36 95 25 26 4
[1 point] 2. Return the last 5 elements in this vector
vec[(length(vec)-4):length(vec)]
## [1] 36 95 25 26 4
[1 point] 3. Use this vector to create a 3 4 matrix.
mvec = matrix(vec,3,4)
mvec
## [1] 36
[1 point] 5. Return the elements of last 2 rows in last 2 columns
mvec[2:3,3:4]
## [,1] [,2]
## [1,] 36 26
## [2,] 95 4
[2 points] 6. Get all the odd values in this 3 4 matrix (Use R!).
1
mvec[mvec %% 2 ==1]
## [1] 89 15 17 31 95 25
Data Frame
[1 point] 2. How many rows and columns in this data set (not including row numbers)?
dim(credit)
## [1] 400 11
[1 point] 3. Choose the rows where Income > 100 and Age < 50.
credit[credit$Income>100 & credit$Age<50,]
Create a function
[3 points] 1. Write a function that can find the numbers which are divisible by 7 but are not a multiple of 5.
Input: startNumber, endNumber
Output: return a vector of all such numbers between startNumber and endNumber.
Use this function to find all such numbers from 100 to 200.
2
func1 = function(start,end){
results = c()
for ( i in start:end){
if (i %% 7 == 0 & i %% 5 !=0){
results = c(results,i)
}
}
return(results)
}
func1(100,200)
## [1] 112 119 126 133 147 154 161 168 182 189 196
[3 points] 2. Create a function that can normalize the input data by x = xmax xmin .
xxmin
(hint: input should be
a vector)
Use this function to normalize the vector in your first question, and print the results.
func2 = function(x){
return( (x-min(x))/(max(x)-min(x)) )
}
func2(vec)
[1 point] 1. Use the last 3 digits of your student number to set a seed
set.seed(1)
[1 point] 2. Use rnorm to generate 1000 samples, and assign them to a variable (Choose any mean and sd
you like).
samples = rnorm(1000,mean=3,sd=2)
3
Histogram of samples
0.20
0.15
Density
0.10
0.05
0.00
0 5 10
samples
### question
##
[2 points] 4. add a curve of this norm distriubtion (using the mean and sd you set) on that histogram. (Be
sure that y axis is consistent.)
hist(samples,probability = T)
curve(dnorm(x,3,2),col="blue",add=T)
4
Histogram of samples
0.20
0.15
Density
0.10
0.05
0.00
0 5 10
samples
[2 points] 5. add vertical lines at 0.05 and 0.95 quantile
hist(samples,probability = T)
curve(dnorm(x,3,2),col="blue",add=T)
abline(v= qnorm(0.05,3,2),col="green")
abline(v= qnorm(0.95,3,2),col="blue")
5
Histogram of samples
0.20
0.15
Density
0.10
0.05
0.00
0 5 10
samples
[2 points] 6. add texts beside those two vertical lines to show the quantile values (x =).
hist(samples,probability = T)
curve(dnorm(x,3,2),col="blue",add=T)
abline(v= qnorm(0.05,3,2),col="green")
abline(v= qnorm(0.95,3,2),col="blue")
text(qnorm(0.05,3,2),0.16, labels = paste0("x = ",round(qnorm(0.05,3,2),4)))
text(qnorm(0.95,3,2),0.16, labels = paste0("x = ",round(qnorm(0.95,3,2),4)))
6
Histogram of samples
0.20
x = 0.2897 x = 6.2897
0.15
Density
0.10
0.05
0.00
0 5 10
samples
Part 3 - KNN
[1 points] 1. Install the package ISLR and use the Weekly data set in this ISLR package to do the following
questions. (This question is very similar to the example in our textbook Page 163. You can use as a reference.)
How many rows and columns in this data set?
library(ISLR)
library(class)
attach(Weekly)
dim(Weekly)
## [1] 1089 9
[1 point] 2. List all the column names in this data set.
colnames(Weekly)
7
## test data set
testx = Weekly[Year>=2009 & Year <=2010,c("Lag1","Lag2")]
testy = Direction[Year>=2009 & Year <=2010]
## knn
knn.pred = knn(trainx,testx,trainy,k = 5)
table(knn.pred,testy)
## testy
## knn.pred Down Up
## Down 22 32
## Up 21 29
sum(knn.pred == testy)
## [1] 51
[2 points] 4. Why we need to set seed to ensure reproducing the same results when we use this knn() function?
Because knn() will break the tie randomly, we need to set the seed to reproduce the results.
[3 points] 5. Perform this knn classification using different Ks ( k = 1, 3, 5, 10, 15, 20) and calculate the
corresponding accuracy rate of prediction.
ks = c(1,3,5,10,15,20)
accuracyRate = c()
for (i in ks){
knn.pred = knn(trainx,testx,trainy,k = i)
accuracyRate = c(accuracyRate,mean(knn.pred == testy))
}
accuracyRate
8
0.58
0.54
accuracyRate
0.50
0.46
5 10 15 20
ks
k = 15, since it has the largest accurate rate.
[2 points] 7. If we want to add Volume as another predictor, what is the extra step we need to do and why?
We need to normalize the data first, because the Volumn is in the different scale from Lag1,Lag2.
summary(Weekly[,c("Lag1","Lag2","Volume")])