Вы находитесь на странице: 1из 5

Data Mining and Knowledge Engineering (COMP723) – Laboratory 3 –Model Answers

Task 1 Establish baseline scores for classification accuracy, F-score and classification time (testing)

Record the classification accuracy, overall F-score and the classification time (i.e. Time taken to test
model on supplied test set)

Classification accuracy :
Correctly Classified Instances 470 89.8662 %
Overall F-score :
0.871

Classification time (i.e. Time taken to test model on supplied test set):
Time taken to test model on supplied test set: 0.05 seconds*

Task 2 Perform feature selection and investigate effects on accuracy and model building time

(a) Record the classification accuracy, overall F-score and the classification time values with the use
of the CorrelationAttributeEval filter.

Classification accuracy:
Correctly Classified Instances 479 91.587 %
Overall F-score:
0.897
Classification time values:
Time taken to test model on supplied test set: 0.05 seconds*

(b) Record the classification accuracy, overall F-score and the classification time values with the use
of the CfsSubsetEval filter.

Classification accuracy:
Correctly Classified Instances 475 90.8222 %
Overall F-score:
0.886
Classification time:
Time taken to test model on supplied test set: 0.05 seconds*
(c) Record the classification accuracy, overall F-score and the classification time values with the use
of the InfoGainAttributeEval filter.

Classification accuracy:
Correctly Classified Instances 475 90.8222 %
Overall F-score:
0.898
Classification time :
Time taken to test model on supplied test set: 0.04 seconds*

Task 3 Use the RWeka interface to perform feature selection

Study the code snippet below and run it in R.

TraindataSecom<-read.arff("C:/Users/rpears/Desktop/Secom/SecomTrain.arff")
colnames(TraindataSecom)[colnames(TraindataSecom)=="591"] <- "class" # set the target attribute name to "class" in the
training file
TestdataSecom<-read.arff("C:/Users/rpears/Desktop/Secom/SecomTest.arff")
colnames(TestdataSecom)[colnames(TestdataSecom)=="591"] <- "class" # set the target attribute name to "class" in the testing
file
actual<-TestdataSecom[, 591] # get the class values from the test file
A<- InfoGainAttributeEval(class ~ . , data = TraindataSecom,na.action=NULL ) # rank features by their information gain score
ranked_list<- A[order(A)] # sorting in ascending order
A[order(-A)] # print the features with the highest information gain together with their corresponding gain values

D<-540 # now specify the number of features to be dropped;


s<- ranked_list[1:D]
cols.dont.want <- c(names(s)) # identify names of low ranked features
TraindataSecom1<- TraindataSecom[, !names(TraindataSecom) %in% cols.dont.want, drop = T] # now drop low ranked features
from the training dataset and retrun the top 50 highest ranked attributes.

classifier <- J48(class ~ ., data = TraindataSecom1 , na.action=NULL) # build the model on the reduced training dataset (the
version with 50 attributes)
TestdataSecom1<- TestdataSecom[, !names(TraindataSecom) %in% cols.dont.want, drop = T] # drop low ranked features drop
low ranked features from the test dataset
pred<-predict(classifier,TestdataSecom1, na.action=NULL,seed=1) # deploy the new version of the dataset on the test dataset to
make predictions
P11<-0
P12<-0
P21<-0
P22<-0
for ( K in seq(1,523))
{
if(actual[K]==-1){
if(pred[K]==-1){
P11<-P11+1
}
else
{
P12<-P12+1
}
}
else if (actual[K]==1){
if(pred[K]==1){
P22<-P22+1
}
else
{
P21<-P21+1
}
}
}

Prec_1<-(P11/(P11+P21))
Prec_2<-(P22/(P22+P12))
Recall_1<-(P11/(P11+P12))
Recall_2<-(P22/(P22+P21))

F_1<-(2*Prec_1*Recall_1)/(Prec_1+Recall_1)
F_2<-(2*Prec_2*Recall_2)/(Prec_2+Recall_2)
F_overall<-(F_1*462+F_2*61)/523
paste("This is the F overall score",F_overall)
t<-system.time(predict(classifier,TestdataSecom1, na.action=NULL,seed=1)) # return elapsed cpu time
paste("This is the total classification time", t[[1]])

Remember to alter the pathnames to point to your file locations.


Note that the number of attributes selected is 50 as 540 attributes are dropped.

(a) Run the above code paste the overall F score and classification time in your lab report

The overall F score:


0.903393659122237
Classification time:
0.0399999999999996*

(b) Now use a for loop and perform feature selection with K values in the range [10, 50] in intervals
of 5. You will need to lookup R help on using a for loop with increments. Paste your code in your
submission.

Code:

TraindataSecom<-read.arff("H:/lab class_COMP723/lab 4/2016 Lab 4/SecomTrain.arff")


colnames(TraindataSecom)[colnames(TraindataSecom)=="591"] <- "class" # set the target
attribute name to "class" in the training file
TestdataSecom<-read.arff("H:/lab class_COMP723/lab 4/2016 Lab 4/SecomTest.arff")
colnames(TestdataSecom)[colnames(TestdataSecom)=="591"] <- "class" # set the target
attribute name to "class" in the testing file
actual<-TestdataSecom[, 591] # get the class values from the test file
A<- InfoGainAttributeEval(class ~ . , data = TraindataSecom,na.action=NULL ) # rank features by
their information gain score
ranked_list<- A[order(A)] # sorting in ascending order
#A[order(-A)] # print the features with the highest information gain together with their
corresponding gain values
for(i in seq(10,50,5))
{
cat("Number of features selected:" ,i ,"\n")
D<-(590-i) # now specify the number of features to be dropped;
s<- ranked_list[1:D]
cols.dont.want <- c(names(s)) # identify names of low ranked features
TraindataSecom1<- TraindataSecom[, !names(TraindataSecom) %in% cols.dont.want,
drop = T] # now drop low ranked features from the training dataset

classifier <- J48(class ~ ., data = TraindataSecom1 , na.action=NULL) # build the model on


the reduced training dataset (the version with the top i attributes)
TestdataSecom1<- TestdataSecom[, !names(TraindataSecom) %in% cols.dont.want, drop
= T] # drop low ranked features drop low ranked features from the test dataset
pred<-predict(classifier,TestdataSecom1, na.action=NULL,seed=1) # deploy the new
version of the dataset on the test dataset to make predictions

P11<-0
P12<-0
P21<-0
P22<-0

for ( K in seq(1,523))
{
if(actual[K]==-1)
{
if(pred[K]==-1)
{
P11<-P11+1
}
else
{
P12<-P12+1
}
}
else if (actual[K]==1)
{
if(pred[K]==1)
{
P22<-P22+1
}
else
{
P21<-P21+1
}
}
}
Output:

Number of features selected: 10


This is the F overall score: 0.8908404
This is the total classification time: 0.01 *

Number of features selected: 15


This is the F overall score: 0.8861252
This is the total classification time: 0.01 *

Number of features selected: 20


This is the F overall score: 0.8861252
This is the total classification time: 0.03*

Number of features selected: 25


This is the F overall score: 0.9210694
This is the total classification time: 0.01 *

Number of features selected: 30


This is the F overall score: 0.9149383
This is the total classification time: 0.02 *

Number of features selected: 35


This is the F overall score: 0.9149383
This is the total classification time: 0.03 *

Number of features selected: 40


This is the F overall score: 0.9149383
This is the total classification time: 0.03 *

0.7679383
Number of features selected: 45
This is the F overall score: 0.9067615
This is the total classification time: 0.04 *

Number of features selected: 50


This is the F overall score: 0.9033937
This is the total classification time: 0.03 *
0.8577615

* Classification time can vary