Вы находитесь на странице: 1из 15

1.

2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.

Look at a number of Spam filters


Try a main effect model
Stepwise model
Build test and training sets for ALL models and chose the model base on the TEST sets!!!!!!
a. Use the same test set code from day 3
Consider using PCA results.
Insight into scatterplot matrices (any correlation? Not really, may not be needed here - not as much
With and without log trans
Which will do better predicting? Probably the log
Write about the graphic in the situation description
Graphical analysis to guide the modeling analysis
What do we want the predictors to look like in general? Dont really want them to be bell shaped - want them
uniform or symmetric - spread out uniformly
The more spread out the better
Create a factorial view
Spread them out as much as you can
Skew brings bias to the model and messes up the prediction
Boxplots - the more separation the better - the more separation
Log reg puts a hyperplane through the data
Use the nike swosh biplot
Want a good predictive value
a. Interested in error count
Dont care about the actual words, counts, letters, etc., as long as it identifies spam
Not like the train accidents

############################################################################

1. A qualitative response variable violates what assumption that we make in linear regression?
A. The distribution of the response variable is Gaussian.

2. What is the key insight we make to model a categorical response variable with two levels?
A. We model the event occurrence as a {0,1} dummy variable.
3. A categorical response means the Gaussian assumption is not satisfied. What is the impact of this?
A. All inferential results are questionable.
4. Which of the following are properties of the logistic function?

D. All of the above


5. If the probability of an event is 0.6 what are the odds of the event?
C. 3:2
6 The link function used in logistic regression is the logit. What is the logit?
C. Log (base e) of the odds of the event.
7. Why don't we optimize sum of square residuals to obtain parameter estimates in logistic regression?
D. The response variable in logistic regression is not Gaussian and this means we cannot use sum of squares.
8 What test statistic do we use for the model utility test in logistic regression?
D. Chi square
##### Data Set Load ######
spam <- read.table("spam.txt", header = F)
summary(spam)
> spam <- read.table("spam.txt", header = F)
> summary(spam)
V1
V2
V3
V4
V6
Min.
:0.0000
Min.
: 0.000
Min.
:0.0000
Min.
: 0.00000
Min.
:0.0000
1st Qu.:0.0000
1st Qu.: 0.000
1st Qu.:0.0000
1st Qu.: 0.00000
1st Qu.:0.0000
Median :0.0000
Median : 0.000
Median :0.0000
Median : 0.00000
Median :0.0000
Mean
:0.1046
Mean
: 0.213
Mean
:0.2807
Mean
: 0.06542
Mean
:0.0959
3rd Qu.:0.0000
3rd Qu.: 0.000
3rd Qu.:0.4200
3rd Qu.: 0.00000
3rd Qu.:0.0000
Max.
:4.5400
Max.
:14.280
Max.
:5.1000
Max.
:42.81000
Max.
:5.8800
V7
V8
V9
V10
V12
Min.
:0.0000
Min.
: 0.0000
Min.
:0.00000
Min.
: 0.0000
Min.
:0.0000
1st Qu.:0.0000
1st Qu.: 0.0000
1st Qu.:0.00000
1st Qu.: 0.0000
1st Qu.:0.0000
Median :0.0000
Median : 0.0000
Median :0.00000
Median : 0.0000
Median :0.1000
Mean
:0.1142
Mean
: 0.1053
Mean
:0.09007
Mean
: 0.2394
Mean
:0.5417
3rd Qu.:0.0000
3rd Qu.: 0.0000
3rd Qu.:0.00000
3rd Qu.: 0.1600
3rd Qu.:0.8000
Max.
:7.2700
Max.
:11.1100
Max.
:5.26000
Max.
:18.1800
Max.
:9.6700
V13
V14
V15
V16
V18
Min.
:0.00000
Min.
: 0.00000
Min.
:0.0000
Min.
: 0.0000
Min.
:0.0000

V5
Min.

: 0.0000

1st Qu.: 0.0000


Median : 0.0000
Mean

: 0.3122

3rd Qu.: 0.3800


Max.

:10.0000
V11

Min.

:0.00000

1st Qu.:0.00000
Median :0.00000
Mean

:0.05982

3rd Qu.:0.00000
Max.

:2.61000
V17

Min.

:0.0000

1st Qu.:0.00000
1st Qu.: 0.00000
1st Qu.:0.0000
1st Qu.: 0.0000
1st Qu.:0.0000
1st Qu.:0.0000
Median :0.00000
Median : 0.00000
Median :0.0000
Median : 0.0000
Median :0.0000
Median :0.0000
Mean
:0.09393
Mean
: 0.05863
Mean
:0.0492
Mean
: 0.2488
Mean
:0.1426
Mean
:0.1847
3rd Qu.:0.00000
3rd Qu.: 0.00000
3rd Qu.:0.0000
3rd Qu.: 0.1000
3rd Qu.:0.0000
3rd Qu.:0.0000
Max.
:5.55000
Max.
:10.00000
Max.
:4.4100
Max.
:20.0000
Max.
:7.1400
Max.
:9.0900
V19
V20
V21
V22
V23
V24
Min.
: 0.000
Min.
: 0.00000
Min.
: 0.0000
Min.
: 0.0000
Min.
:0.0000
Min.
: 0.00000
1st Qu.: 0.000
1st Qu.: 0.00000
1st Qu.: 0.0000
1st Qu.: 0.0000
1st Qu.:0.0000
1st Qu.: 0.00000
Median : 1.310
Median : 0.00000
Median : 0.2200
Median : 0.0000
Median :0.0000
Median : 0.00000
Mean
: 1.662
Mean
: 0.08558
Mean
: 0.8098
Mean
: 0.1212
Mean
:0.1016
Mean
: 0.09427
3rd Qu.: 2.640
3rd Qu.: 0.00000
3rd Qu.: 1.2700
3rd Qu.: 0.0000
3rd Qu.:0.0000
3rd Qu.: 0.00000
Max.
:18.750
Max.
:18.18000
Max.
:11.1100
Max.
:17.1000
Max.
:5.4500
Max.
:12.50000
V25
V26
V27
V28
V29
V30
Min.
: 0.0000
Min.
: 0.0000
Min.
: 0.0000
Min.
:0.0000
Min.
: 0.00000
Min.
:0.0000
1st Qu.: 0.0000
1st Qu.: 0.0000
1st Qu.: 0.0000
1st Qu.:0.0000
1st Qu.: 0.00000
1st Qu.:0.0000
Median : 0.0000
Median : 0.0000
Median : 0.0000
Median :0.0000
Median : 0.00000
Median :0.0000
Mean
: 0.5495
Mean
: 0.2654
Mean
: 0.7673
Mean
:0.1248
Mean
: 0.09892
Mean
:0.1029
3rd Qu.: 0.0000
3rd Qu.: 0.0000
3rd Qu.: 0.0000
3rd Qu.:0.0000
3rd Qu.: 0.00000
3rd Qu.:0.0000
Max.
:20.8300
Max.
:16.6600
Max.
:33.3300
Max.
:9.0900
Max.
:14.28000
Max.
:5.8800
V31
V32
V33
V34
V35
V36
Min.
: 0.00000
Min.
:0.00000
Min.
: 0.00000
Min.
:0.00000
Min.
:
0.0000
Min.
:0.00000
1st Qu.: 0.00000
1st Qu.:0.00000
1st Qu.: 0.00000
1st Qu.:0.00000
1st Qu.:
0.0000
1st Qu.:0.00000
Median : 0.00000
Median :0.00000
Median : 0.00000
Median :0.00000
Median :
0.0000
Median :0.00000
Mean
: 0.06475
Mean
:0.04705
Mean
: 0.09723
Mean
:0.04784
Mean
:
0.1054
Mean
:0.09748
3rd Qu.: 0.00000
3rd Qu.:0.00000
3rd Qu.: 0.00000
3rd Qu.:0.00000
3rd Qu.:
0.0000
3rd Qu.:0.00000
Max.
:12.50000
Max.
:4.76000
Max.
:18.18000
Max.
:4.76000
Max.
:20.0000
Max.
:7.69000
V37
V38
V39
V40
V41
V42
Min.
:0.000
Min.
:0.0000
Min.
: 0.00000
Min.
:0.00000
Min.
:0.00000
Min.
: 0.0000
1st Qu.:0.000
1st Qu.:0.0000
1st Qu.: 0.00000
1st Qu.:0.00000
1st Qu.:0.00000
1st Qu.: 0.0000
Median :0.000
Median :0.0000
Median : 0.00000
Median :0.00000
Median :0.00000
Median : 0.0000
Mean
:0.137
Mean
:0.0132
Mean
: 0.07863
Mean
:0.06483
Mean
:0.04367
Mean
: 0.1323
3rd Qu.:0.000
3rd Qu.:0.0000
3rd Qu.: 0.00000
3rd Qu.:0.00000
3rd Qu.:0.00000
3rd Qu.: 0.0000

Max.
Max.

:6.890
:14.2800
V43

V48
Min.
:0.0000
Min.
: 0.00000
1st Qu.:0.0000
1st Qu.: 0.00000
Median :0.0000
Median : 0.00000
Mean
:0.0461
Mean
: 0.03187
3rd Qu.:0.0000
3rd Qu.: 0.00000
Max.
:3.5700
Max.
:10.00000
V49
V54
Min.
:0.00000
Min.
: 0.00000
1st Qu.:0.00000
1st Qu.: 0.00000
Median :0.00000
Median : 0.00000
Mean
:0.03857
Mean
: 0.04424
3rd Qu.:0.00000
3rd Qu.: 0.00000
Max.
:4.38500
Max.
:19.82900
V55
Min.
:
1.000
1st Qu.:
1.588
Median :
2.276
Mean
:
5.191
3rd Qu.:
3.706
Max.
:1102.500

Max.

:8.3300

Max.

:11.11000

V44
Min.

Max.

:4.76000

V45

: 0.0000

Min.

Max.

V46

: 0.0000

Min.

: 0.0000

:7.14000
V47

Min.

:0.000000

1st Qu.: 0.0000

1st Qu.: 0.0000

1st Qu.: 0.0000

1st Qu.:0.000000

Median : 0.0000

Median : 0.0000

Median : 0.0000

Median :0.000000

Mean

Mean

Mean

Mean

: 0.0792

: 0.3012

: 0.1798

:0.005444

3rd Qu.: 0.0000

3rd Qu.: 0.1100

3rd Qu.: 0.0000

3rd Qu.:0.000000

Max.

Max.

Max.

Max.

:20.0000
V50

Min.

:21.4200

:0.000

V51
Min.

:0.00000

:22.0500
V52

Min.

: 0.0000

:2.170000
V53

Min.

:0.00000

1st Qu.:0.000

1st Qu.:0.00000

1st Qu.: 0.0000

1st Qu.:0.00000

Median :0.065

Median :0.00000

Median : 0.0000

Median :0.00000

Mean

Mean

Mean

Mean

:0.139

:0.01698

: 0.2691

:0.07581

3rd Qu.:0.188

3rd Qu.:0.00000

3rd Qu.: 0.3150

3rd Qu.:0.05200

Max.

Max.

Max.

Max.

:9.752

V56
Min.
:
1.00
1st Qu.:
6.00
Median : 15.00
Mean
: 52.17
3rd Qu.: 43.00
Max.
:9989.00

##### Question 9 #####


length(which(spam[,58]==1))
length(which(spam[,58]==1))
[1] 1813

##### Question 10-12 #####


uva.pairs(spam[,c(48:58)])

:4.08100

V57
Min.
:
1.0
1st Qu.:
35.0
Median :
95.0
Mean
: 283.3
3rd Qu.: 266.0
Max.
:15841.0

:32.4780

V58
Min.
:0.000
1st Qu.:0.000
Median :0.000
Mean
:0.394
3rd Qu.:1.000
Max.
:1.000

:6.00300

##### Question 13-15 #####


spam.transformsubset<-spam
spam.transformsubset[,c(47:57)]<-log(spam.transformsubset[,c(47:57)]+0.01)
uva.pairs(spam.transformsubset[,c(47:58)])

##### Question 16 #####


plot.factor(10, 15, spam, spam[,58], "spam")

##### Question 17 #####


Off <- 0.01
Lspam <- log(spam[,1:57]+ Off)
Lspam[,58]<-spam[,58]
plot.factor(20, 25, Lspam, Lspam[,58], "spam")

##### Question 18-21 #####

spam.pc = princomp(spam[,1:57], cor = T)


cumplot(spam.pc)
biplot.fact(spam.pc, spam[,58])
legend(-1.8, 8, legend = c("spam", "Ham"), pch = c(18, 19), col = c("red", "blue"))

##### Question 22-25 #####


spam.transform<-spam
spam.transform[,c(1:57)]<-log((spam.transform[,c(1:57)])+0.01)
spam.glm <- glm(V58~., data = spam.transform, family=binomial)
spam.null <- glm(V58~1, data = spam.transform, family = binomial)
anova(spam.null, spam.glm, test = "Chi")
Analysis of Deviance Table
Model 1: V58 ~ 1
Model 2: V58 ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + V11 +
V12 + V13 + V14 + V15 + V16 + V17 + V18 + V19 + V20 + V21 +
V22 + V23 + V24 + V25 + V26 + V27 + V28 + V29 + V30 + V31 +
V32 + V33 + V34 + V35 + V36 + V37 + V38 + V39 + V40 + V41 +
V42 + V43 + V44 + V45 + V46 + V47 + V48 + V49 + V50 + V51 +
V52 + V53 + V54 + V55 + V56 + V57
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1
4600
6170.2
2
4543
1362.8 57
4807.4 < 2.2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

spamreduced.glm<-glm(V58~.,data=spam.transform[,c(1:47,58)])
anova(spamreduced.glm, spam.glm, test = "Chi")

summary(spam.glm)
Analysis of Deviance Table
Model 1: V58 ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 +
V12 + V13 + V14 + V15 + V16 + V17 + V18 + V19
V22 + V23 + V24 + V25 + V26 + V27 + V28 + V29
V32 + V33 + V34 + V35 + V36 + V37 + V38 + V39
V42 + V43 + V44 + V45 + V46 + V47
Model 2: V58 ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 +
V12 + V13 + V14 + V15 + V16 + V17 + V18 + V19
V22 + V23 + V24 + V25 + V26 + V27 + V28 + V29
V32 + V33 + V34 + V35 + V36 + V37 + V38 + V39
V42 + V43 + V44 + V45 + V46 + V47 + V48 + V49
V52 + V53 + V54 + V55 + V56 + V57
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1
4553
375.94
2
4543
1362.78 10 -986.84
> summary(spam.glm)

V8 + V9
+ V20 +
+ V30 +
+ V40 +

+ V10 + V11 +
V21 +
V31 +
V41 +

V8 + V9
+ V20 +
+ V30 +
+ V40 +
+ V50 +

+ V10 + V11 +
V21 +
V31 +
V41 +
V51 +

Call:
glm(formula = V58 ~ ., family = binomial, data = spam.transform)
Deviance Residuals:
Min
1Q
Median
-4.2571 -0.1599 -0.0027

3Q
0.0827

Max
3.8159

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -26.454477
9.823892 -2.693 0.007084 **
V1
-0.134745
0.059038 -2.282 0.022469 *
V2
-0.041912
0.051712 -0.810 0.417654
V3
-0.110085
0.042267 -2.605 0.009200 **
V4
0.154184
0.179899
0.857 0.391415
V5
0.265274
0.040389
6.568 5.10e-11 ***
V6
0.076853
0.060057
1.280 0.200663
V7
0.519041
0.065965
7.868 3.59e-15 ***
V8
0.194667
0.060615
3.212 0.001320 **
V9
0.023939
0.071789
0.333 0.738788
V10
0.040651
0.044427
0.915 0.360191
V11
-0.095394
0.069185 -1.379 0.167947
V12
-0.094412
0.035459 -2.663 0.007754 **
V13
-0.208426
0.063848 -3.264 0.001097 **
V14
0.209835
0.076827
2.731 0.006310 **
V15
0.296183
0.152815
1.938 0.052602 .
V16
0.273367
0.041746
6.548 5.82e-11 ***
V17
0.253624
0.060092
4.221 2.44e-05 ***
V18
-0.124757
0.049028 -2.545 0.010940 *
V19
0.017058
0.035207
0.485 0.628026
V20
0.156699
0.102193
1.533 0.125188
V21
0.113396
0.037274
3.042 0.002349 **
V22
0.052410
0.088306
0.594 0.552840
V23
0.214966
0.080983
2.654 0.007943 **
V24
0.335243
0.069675
4.812 1.50e-06 ***
V25
-0.760413
0.084451 -9.004 < 2e-16 ***
V26
-0.060826
0.092791 -0.656 0.512137
V27
-1.304266
0.197912 -6.590 4.39e-11 ***
V28
0.369562
0.085567
4.319 1.57e-05 ***
V29
-0.201463
0.148312 -1.358 0.174347
V30
0.008074
0.106451
0.076 0.939538
V31
-0.152699
0.268874 -0.568 0.570090
V32
-0.344677
0.390195 -0.883 0.377049
V33
-0.164589
0.083647 -1.968 0.049107 *
V34
-0.003287
0.359179 -0.009 0.992698

V35
-0.469932
0.161016
V36
0.159342
0.079098
V37
-0.222701
0.065705
V38
0.303385
0.167478
V39
-0.154907
0.086075
V40
-0.064607
0.126832
V41
-2.027127
1.946524
V42
-0.573184
0.120578
V43
-0.345693
0.149838
V44
-0.387622
0.115830
V45
-0.250956
0.045212
V46
-0.462600
0.070816
V47
-0.020737
0.266011
V48
-0.475513
0.158137
V49
-0.146023
0.077857
V50
-0.071762
0.053900
V51
-0.085938
0.133503
V52
0.379690
0.040212
V53
0.553207
0.072855
V54
-0.248230
0.105959
V55
0.621813
0.223583
V56
0.098703
0.150473
V57
0.526103
0.123163
--Signif. codes: 0 *** 0.001 **

-2.919
2.014
-3.389
1.811
-1.800
-0.509
-1.041
-4.754
-2.307
-3.346
-5.551
-6.532
-0.078
-3.007
-1.876
-1.331
-0.644
9.442
7.593
-2.343
2.781
0.656
4.272

0.003517
0.043957
0.000700
0.070064
0.071911
0.610481
0.297686
2.00e-06
0.021049
0.000818
2.85e-08
6.47e-11
0.937864
0.002639
0.060720
0.183059
0.519762
< 2e-16
3.12e-14
0.019145
0.005417
0.511854
1.94e-05

**
*
***
.
.
***
*
***
***
***
**
.
***
***
*
**
***

0.01 * 0.05 . 0.1 1

(Dispersion parameter for binomial family taken to be 1)


Null deviance: 6170.2
Residual deviance: 1362.8
AIC: 1478.8

on 4600
on 4543

degrees of freedom
degrees of freedom

Number of Fisher Scoring iterations: 11

##### Afternoon Quiz #####

##### Question 13 #####


spam <- read.table("spam.txt", header = F)
spam.transform<-spam
spam.transform[,c(1:57)]<-log(spam.transform[,c(1:57)]+0.01)
spamtrans.glm <- glm(V58~., data = spam.transform[,c(1:10,58)], family=binomial)
spamtrans.null <- glm(V58~1, data = spam.transform[,c(1:10,58)], family=binomial)
anova(spamtrans.glm,spamtrans.null, test="Chi")
Analysis of Deviance Table
Model 1: V58 ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10
Model 2: V58 ~ 1
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1
4590
3915.2

2
4600
--Signif. codes:

6170.2 -10

-2254.9 < 2.2e-16 ***

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

##### Question 14 #####


dropterm(spamtrans.glm, test = "Chi")
> dropterm(spamtrans.glm, test = "Chi")
Single term deletions
Model:
V58 ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10
Df Deviance
AIC
LRT
Pr(Chi)
<none>
3915.2 3937.2
V1
1
3922.5 3942.5
7.23 0.007156 **
V2
1
3934.2 3954.2 18.95 1.340e-05 ***
V3
1
3963.6 3983.6 48.38 3.509e-12 ***
V4
1
3942.2 3962.2 27.01 2.028e-07 ***
V5
1
4036.7 4056.7 121.51 < 2.2e-16 ***
V6
1
3947.4 3967.4 32.22 1.377e-08 ***
V7
1
4571.7 4591.7 656.45 < 2.2e-16 ***
V8
1
4025.2 4045.2 109.94 < 2.2e-16 ***
V9
1
3950.2 3970.2 34.99 3.309e-09 ***
V10
1
3930.5 3950.5 15.25 9.423e-05 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

##### Question 15 #####


summary(spamtrans.glm)
(exp(0.07538)-1)*100
> (exp(0.07538)-1)*100
[1] 7.829382

##### Question 16 ####


small.step <- step(spamtrans.glm)
anova(small.step, spamtrans.glm, test = "Chi")
> small.step <- step(spamtrans.glm)
Start: AIC=3937.23
V58 ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10
<none>
- V1
- V10
- V2
- V4
- V6
- V9

Df Deviance
3915.2
1
3922.5
1
3930.5
1
3934.2
1
3942.2
1
3947.4
1
3950.2

AIC
3937.2
3942.5
3950.5
3954.2
3962.2
3967.4
3970.2

- V3
1
3963.6 3983.6
- V8
1
4025.2 4045.2
- V5
1
4036.7 4056.7
- V7
1
4571.7 4591.7
> anova(small.step, spamtrans.glm, test = "Chi")
Analysis of Deviance Table
Model 1: V58 ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10
Model 2: V58 ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1
4590
3915.2
2
4590
3915.2 0
0

##### Question 17 #####


spamtrans.inter.glm<-glm(V58~(.)^2, data = spam.transform[,c(1:10,58)], family=binomial)
anova(spamtrans.inter.glm, spamtrans.glm, test = "Chi")
> spamtrans.inter.glm<-glm(V58~(.)^2, data = spam.transform[,c(1:10,58)],
family=binomial)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
> anova(spamtrans.inter.glm, spamtrans.glm, test = "Chi")
Analysis of Deviance Table
Model 1: V58 ~ (V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10)^2
Model 2: V58 ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1
4545
3735.6
2
4590
3915.2 -45 -179.64 < 2.2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

##### Question 18 #####


AIC(spamtrans.glm)
AIC(spamtrans.inter.glm)
> AIC(spamtrans.glm)
[1] 3937.228
> AIC(spamtrans.inter.glm)
[1] 3847.587

##### Question 19 #####


spamtrans.pred <- predict(spamtrans.glm, type = "response")
score.table(spamtrans.pred , spam[,58], .5)
> score.table(spamtrans.pred , spam[,58],
Actual vs. Predicted
Pred
Actual FALSE TRUE

.5)

0
1

2581 207
630 1183

##### Question 20 #####


spamtrans.inter.pred <- predict(spamtrans.inter.glm, type = "response")
score.table(spamtrans.inter.pred , spam[,58], .5)
> spamtrans.inter.pred <- predict(spamtrans.inter.glm, type = "response")
> score.table(spamtrans.inter.pred , spam[,58], .5)
Actual vs. Predicted
Pred
Actual FALSE TRUE
0 2591 197
1
582 1231

##### Question 21 #####


plot.roc(spamtrans.pred, spam[,58], main = "ROC Curve - SPAM Filter")
lines.roc(spamtrans.inter.pred, spam[,58], col = "orange")

##### Question 22 #####


spam.pca <- princomp(spam[,-c(11:58)], cor = T)

var.comp(spam.pca, 98)
> spam.pca <- princomp(spam[,-c(11:58)], cor = T)
> var.comp(spam.pca, 98)
[1] 9

##### Question 23 #####


spampca.glm98 <- pc.glm(spam.pca, 98, spam[,58])
spampc.null <- pc.null(spam.pca, 98, spam[,58])
anova(spampc.null, spampca.glm98, test = "Chi")
> spampca.glm98 <- pc.glm(spam.pca, 98, spam[,58])
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
> spampc.null <- pc.null(spam.pca, 98, spam[,58])
> anova(spampc.null, spampca.glm98, test = "Chi")
Analysis of Deviance Table
Model 1: r ~ 1
Model 2: r ~ Comp.1 + Comp.2 + Comp.3 + Comp.4 + Comp.5 + Comp.6 + Comp.7 +
Comp.8 + Comp.9
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1
4600
6170.2
2
4591
4389.5 9
1780.6 < 2.2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

##### Question 24 #####


spampca.glm98.pred <- pc.pred(spam.pca, , spampca.glm98)
spampca.glm98.pred2<- predict(spampca.glm98, type = "response")
score.table(spampca.glm98.pred , spam[,58], .5)
> spampca.glm98.pred <- pc.pred(spam.pca, , spampca.glm98)
> spampca.glm98.pred2<- predict(spampca.glm98, type = "response")
> score.table(spampca.glm98.pred , spam[,58], .5)
Actual vs. Predicted
Pred
Actual FALSE TRUE
0 2649 139
1
896 917

##### Question 25 #####


score.table(spamtrans.pred , spam[,58], .5)
score.table(spamtrans.inter.pred , spam[,58], .5)

score.table(spampca.glm98.pred , spam[,58], .5)


> score.table(spamtrans.pred , spam[,58], .5)
Actual vs. Predicted
Pred
Actual FALSE TRUE
0 2581 207
1
630 1183
> score.table(spamtrans.inter.pred , spam[,58], .5)
Actual vs. Predicted
Pred
Actual FALSE TRUE
0 2591 197
1
582 1231
> score.table(spampca.glm98.pred , spam[,58], .5)
Actual vs. Predicted
Pred
Actual FALSE TRUE
0 2649 139
1
896 917

Вам также может понравиться