Вы находитесь на странице: 1из 9

STAT-2450 Assignment 1

Name: *** , Student ID: B00***

Read first:
Please email your assignment to stat2450winter@gmail.com before the due date Friday 11:59pm, 3rd
Feb 2017.
R should be used exclusively and the R code must be provided in your assignment.
Show the results clearly; the code must be executable and comment your code.
This assignment covers R prgramming basics, Data visualization and kNN algorithm.
Please remember that all of the work must be your own, and answers must be given in your own words.
Let me know if you run into trouble.

Part 1 - R prgramming basics

Vector and Matrix


[1 point] 1. Create a vector vec with 12 random numbers from 1 to 100 (sample() function) and print it.
vec = sample(1:100,12)
print(vec)

## [1] 89 42 15 60 17 31 22 36 95 25 26 4
[1 point] 2. Return the last 5 elements in this vector
vec[(length(vec)-4):length(vec)]

## [1] 36 95 25 26 4
[1 point] 3. Use this vector to create a 3 4 matrix.
mvec = matrix(vec,3,4)
mvec

## [,1] [,2] [,3] [,4]


## [1,] 89 60 22 25
## [2,] 42 17 36 26
## [3,] 15 31 95 4
[1 point] 4. Return the element in the 2nd row and 3rd column.
mvec[2,3]

## [1] 36
[1 point] 5. Return the elements of last 2 rows in last 2 columns
mvec[2:3,3:4]

## [,1] [,2]
## [1,] 36 26
## [2,] 95 4
[2 points] 6. Get all the odd values in this 3 4 matrix (Use R!).

1
mvec[mvec %% 2 ==1]

## [1] 89 15 17 31 95 25

Data Frame

[2 points] 1. Download the file Credit.csv from http://www-bcf.usc.edu/~gareth/ISL/data.html, import the


data to R.
credit = read.table("Credit.csv",header = T,row.names = 1,sep=",")

[1 point] 2. How many rows and columns in this data set (not including row numbers)?
dim(credit)

## [1] 400 11
[1 point] 3. Choose the rows where Income > 100 and Age < 50.
credit[credit$Income>100 & credit$Age<50,]

## Income Limit Rating Cards Age Education Gender Student Married


## 4 148.924 9504 681 3 36 11 Female No No
## 29 186.634 13414 949 2 41 14 Female No Yes
## 33 134.181 7838 563 2 48 13 Female No No
## 67 113.829 9704 694 4 38 13 Female No Yes
## 79 110.968 6662 468 3 45 11 Female No Yes
## 86 152.298 12066 828 4 41 12 Female No Yes
## 194 130.209 10088 730 7 39 19 Female No Yes
## 294 140.672 11200 817 7 46 9 Male No Yes
## 353 104.483 7140 507 2 41 14 Male No Yes
## Ethnicity Balance
## 4 Asian 964
## 29 African American 1809
## 33 Caucasian 526
## 67 Asian 1388
## 79 Caucasian 391
## 86 Asian 1779
## 194 Caucasian 1426
## 294 African American 1677
## 353 African American 583
[2 points] 4. Save the data from above question to your desktop and use ; to seperate the columns. (You
only need provide the code.)
write.table(credit,"credit2.txt",sep=";",row.names = F,quote=F)

Create a function

[3 points] 1. Write a function that can find the numbers which are divisible by 7 but are not a multiple of 5.
Input: startNumber, endNumber
Output: return a vector of all such numbers between startNumber and endNumber.
Use this function to find all such numbers from 100 to 200.

2
func1 = function(start,end){
results = c()
for ( i in start:end){
if (i %% 7 == 0 & i %% 5 !=0){
results = c(results,i)
}
}
return(results)
}
func1(100,200)

## [1] 112 119 126 133 147 154 161 168 182 189 196
[3 points] 2. Create a function that can normalize the input data by x = xmax xmin .
xxmin
(hint: input should be
a vector)
Use this function to normalize the vector in your first question, and print the results.
func2 = function(x){
return( (x-min(x))/(max(x)-min(x)) )
}

func2(vec)

## [1] 0.9340659 0.4175824 0.1208791 0.6153846 0.1428571 0.2967033 0.1978022


## [8] 0.3516484 1.0000000 0.2307692 0.2417582 0.0000000

Part 2 - Data visualization

[1 point] 1. Use the last 3 digits of your student number to set a seed
set.seed(1)

[1 point] 2. Use rnorm to generate 1000 samples, and assign them to a variable (Choose any mean and sd
you like).
samples = rnorm(1000,mean=3,sd=2)

[2 points] 3. Use these 1000 samples to draw a histogram.


hist(samples,probability = T)

3
Histogram of samples
0.20
0.15
Density

0.10
0.05
0.00

0 5 10

samples
### question

##

[2 points] 4. add a curve of this norm distriubtion (using the mean and sd you set) on that histogram. (Be
sure that y axis is consistent.)
hist(samples,probability = T)
curve(dnorm(x,3,2),col="blue",add=T)

4
Histogram of samples
0.20
0.15
Density

0.10
0.05
0.00

0 5 10

samples
[2 points] 5. add vertical lines at 0.05 and 0.95 quantile
hist(samples,probability = T)
curve(dnorm(x,3,2),col="blue",add=T)
abline(v= qnorm(0.05,3,2),col="green")
abline(v= qnorm(0.95,3,2),col="blue")

5
Histogram of samples
0.20
0.15
Density

0.10
0.05
0.00

0 5 10

samples
[2 points] 6. add texts beside those two vertical lines to show the quantile values (x =).
hist(samples,probability = T)
curve(dnorm(x,3,2),col="blue",add=T)
abline(v= qnorm(0.05,3,2),col="green")
abline(v= qnorm(0.95,3,2),col="blue")
text(qnorm(0.05,3,2),0.16, labels = paste0("x = ",round(qnorm(0.05,3,2),4)))
text(qnorm(0.95,3,2),0.16, labels = paste0("x = ",round(qnorm(0.95,3,2),4)))

6
Histogram of samples
0.20

x = 0.2897 x = 6.2897
0.15
Density

0.10
0.05
0.00

0 5 10

samples

Part 3 - KNN

[1 points] 1. Install the package ISLR and use the Weekly data set in this ISLR package to do the following
questions. (This question is very similar to the example in our textbook Page 163. You can use as a reference.)
How many rows and columns in this data set?
library(ISLR)
library(class)
attach(Weekly)
dim(Weekly)

## [1] 1089 9
[1 point] 2. List all the column names in this data set.
colnames(Weekly)

## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"


## [7] "Volume" "Today" "Direction"
[5 points] 3. Perform KNN methed to predict the Direction using two predictors: Lag1, Lag2 in Weekly data
set and show how many test data are correctly predicted.
Use the data period from 1990 to 2008 as the training data.
Use the data from 2009 and 2010 as the test data.
Choose k = 5
You can use the knn() function from class library.
## training data set
trainx = Weekly[Year>=1990 & Year <=2008,c("Lag1","Lag2")]
trainy = Direction[Year>=1990 & Year <=2008]

7
## test data set
testx = Weekly[Year>=2009 & Year <=2010,c("Lag1","Lag2")]
testy = Direction[Year>=2009 & Year <=2010]

## knn
knn.pred = knn(trainx,testx,trainy,k = 5)
table(knn.pred,testy)

## testy
## knn.pred Down Up
## Down 22 32
## Up 21 29
sum(knn.pred == testy)

## [1] 51
[2 points] 4. Why we need to set seed to ensure reproducing the same results when we use this knn() function?
Because knn() will break the tie randomly, we need to set the seed to reproduce the results.
[3 points] 5. Perform this knn classification using different Ks ( k = 1, 3, 5, 10, 15, 20) and calculate the
corresponding accuracy rate of prediction.
ks = c(1,3,5,10,15,20)
accuracyRate = c()

for (i in ks){

knn.pred = knn(trainx,testx,trainy,k = i)
accuracyRate = c(accuracyRate,mean(knn.pred == testy))

}
accuracyRate

## [1] 0.4807692 0.5192308 0.4903846 0.4615385 0.5769231 0.5384615


[2 points] 6. Create a plot to reflect the change of accuracy rate along with different Ks ( k = 1, 3, 5, 10,
15, 20). (hint: x-axis is k; y-axis is the corresponding accuracy rate). According to the plot, which k you
would like to pick and why?
plot(ks,accuracyRate,type="b")

8
0.58
0.54
accuracyRate

0.50
0.46

5 10 15 20

ks
k = 15, since it has the largest accurate rate.
[2 points] 7. If we want to add Volume as another predictor, what is the extra step we need to do and why?
We need to normalize the data first, because the Volumn is in the different scale from Lag1,Lag2.
summary(Weekly[,c("Lag1","Lag2","Volume")])

## Lag1 Lag2 Volume


## Min. :-18.1950 Min. :-18.1950 Min. :0.08747
## 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.:0.33202
## Median : 0.2410 Median : 0.2410 Median :1.00268
## Mean : 0.1506 Mean : 0.1511 Mean :1.57462
## 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.:2.05373
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821
## use the second function we created
newLag1 = func2(Lag1)
newLag2 = func2(Lag2)
newVol = func2(Volume)
summary(cbind(newLag1,newLag2,newVol))

## newLag1 newLag2 newVol


## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.5639 1st Qu.:0.5639 1st Qu.:0.02647
## Median :0.6100 Median :0.6100 Median :0.09904
## Mean :0.6070 Mean :0.6071 Mean :0.16093
## 3rd Qu.:0.6486 3rd Qu.:0.6487 3rd Qu.:0.21278
## Max. :1.0000 Max. :1.0000 Max. :1.00000

Вам также может понравиться