Академический Документы
Профессиональный Документы
Культура Документы
Aigerim Tulegenova
Part 1 a) i.
Table1
er
pgr
ar
l1
l2
b1
b2
b3
b4
h1
Min
0.0
0.00
0.00
0.0
0.0
0.00
0.000
0.0000
0
0.00
1st
quart
0.0
0.00
0.00
100.0
106.5
0.00
0.000
0.0000
0
0.00
Media
n
Mean
100.
0
102.
3
62.00
90.00
200.0
200.0
0.00
0.000
71.84
182.5
177.9
10.96
7.742
0.0000
0
0.0102
3
0.00
97.33
3rd
quart
175.
0
195.00
105.0
0
270.0
250.0
0.00
0.000
1.0000
0
1.0000
0
0.0000
0
0.0018
6
1.0000
0
0.0000
0
0.00
Max
300.
0
0
300.00
300.0
300.0
295.0
0
0
300.0
0
0
300.00
0
0
1.0000
0
0
10.000
00
0
300.
0
0
87.0
3
103.9
66.64
98.41
86.20
34.35
30.91
0.80
0.31
31.8
7574
.6
175.
0
300
10799.
4
195.00
4440.
3
105.0
0
295
9684.
7
170.0
7429.
8
143.5
1179.
7
0.00
955.5
0.64
0.094
0.00
300
300
300
300
2.0000
0
2
1013
.4
0
10
300
Count
of NA
Std.De
v
Varian
ce
IQR
Range
300
h2
h3
p1
p2
o1
o2
o3
o4
o5
o6
11.5
5
o7
Min
0.00
0.0
0.0
20.0
0.00
0.0
0.00
0.00
1st qu
0.00
90.0
0.0
75.0
30.00
75.0
0.00
0.00
Median
0.00
0.0
0.00
0.00
43.3
30.00
15.6
7
0.00
4.89
3rd qu
0.37 8.5
67
73
0
0
100.
0
117.
1
175.
0
36.26
105.
0
124.
9
170.
0
95.00
Mean
150.
0
145.
3
200.
0
32.5
92.56
138.5
0
0.00
Max
300.0
0
300.
0
300.0 180
300
300.
0
300.0
0
300.
0
300.
0
270.
0
Count
of NA
Std.
Dev.
Varianc
e
IQR
236
377
697
76.58
70.6
2
4987
.5
95
74.33
5864.
5
30.00
84.0
9
7070
.5
110
77.3 0
6
5984 0
.6
100
0
42.9
1
1841
.3
0.00
22.3
5
499.
6
0
Range
300
300
300
280
300
300
270
180
300
5525.
5
108.5
0
300
Figure 1
Figure 2
By looking at Figure 4, it can be seen that the data has leftskewness. It shows that the median is greater than the mean.
The range of this attribute is 300, but interquartile range is 170,
this is somewhat greater than the half of data which indicates
that data is pretty spread out. This feature has high proportion
of large values, and the Median (200) is therefore closer to the
third quartile (270) than the 1st quartile (100).
Similarly, attribute l2 follows the same pattern.
Figure 5
PCA will automatically give more weight to the first variable. For
this reason, transformation technique is required. The survey
that was conducted by Hur and Guyon in 2003 concluded that
PCA improves the extraction of cluster structure when data is
normalized rather than standardized. (Please refer to Appendix
Part 1 g) iii. to see code). Moreover, standardization is usually
required when variables have different units of measurement,
otherwise they are considered to be already standardised.
Since the variables of our dataset are related to the breast
cancer, we assume that all of them in the one unit of
measurement.
Figure 9
Correctly classified
Average Accuracy
86
0.08
187
0.17
PAM
346
0.32
Figure 11
Figure 12
Average Accuracy
0.361
0.282
0.313
0.277
0.281
Precision
Recall
F-measure
Roc Area
0.076
0.275
0.119
0.498
OneR
0.34
0.364
0.31
0.602
NaiveBayes
0.751
0.713
0.717
0.937
IBk
0.631
0.633
0.629
0.775
J48
0.868
0.865
0.864
0.941
Figure 13
Figure 14
Precision
Recall
F-measure
Roc Area
With the
first 10 PC
After
deletion
inst/attr.
NAs
replaced
with 0
NAs
replaced
with mean
0.513
0.522
0.516
0.743
0.914
0.914
0.914
0.964
0.859
0.856
0.855
0.938
0.857
0.856
0.855
0.939
NAs
0.868
replaced
with median
0.865
0.864
0.941
References:
1. Adeyemi, T.(2011). The Effective use of Standard Scores
for Research in Educational Management. Research
Journal of Mathematics and Statistics, 3(3).
2. Abbas, O.(2008). Comparisons Between Data clustering
Algorithms. The International Arab Journal of Information
Technology, 5(3).
3. Hur, B. and Guyon, I.(2003). Detecting Stable Clusters
Using Principal Component Analysis. In Functional
Genomics: Methods and Protocols, pp.159-182.
4. Nabi, A. and Ahmed,S. Survey on Classification Algorithms
for Data Mining: (Comparison and Evaluation).
5. Bayat,S., Cuggia,M., Rossille,D., Kessler,M. & Frimat,L.
(2009). Comparison of Bayesian Network and Decision
Appendix
Part 1 a) i.
#setwd("/Users/aigerimtulegenova/Desktop/CW DMA")
#install.packages("readxl")
library(readxl)
Cancer<-read_excel("coursework_data.xlsx")
summary(Cancer[,1:22])
Variance=rbind(1:22)
Std.dev=rbind(1:22)
Int.qu=rbind(1:22)
Range=rbind(1:22)
for (i in 1:22)
{
Variance[,i]=var(Cancer[,i], na.rm=TRUE)
Std.dev[,i]=sd(Cancer[,i],na.rm=TRUE)
Int.qu[,i]=IQR(Cancer[,i],na.rm=TRUE)
Range[,i]=diff(range(Cancer[,i],na.rm=TRUE))
}
Part 1 a) ii.
Histogram<-function()
{
library(readxl);
Cancer<-read_excel("coursework_data.xlsx");
for (i in 1:22)
{
pdf(paste("histogram of ",names(Cancer)
[i],".pdf",sep=""))
hist(Cancer[,i],main=paste("Histogram of
",names(Cancer)[i]),xlab=paste("Range of ", names(Cancer)
[i]),ylab="Frequency",col="pink")
abline(v=mean(Cancer[,i],na.rm=TRUE),col="red",lwd=2)
abline(v=median(Cancer[,i],na.rm=TRUE),col="green",lwd=2)
legend("topright",
c("Mean","Median"),col=c("red","green"),lwd=8)
dev.off()
}
}
Histograms.
Part 1 b) i.
#correlation between er and pgr
cor(Cancer$er,Cancer$pgr)
plot(Cancer$er, Cancer$pgr, main="Scatterplot between er and
pgr", xlab="er", ylab="pgr")
#correlation between of b1 and b2
cor(Cancer$b1,Cancer$b2)
plot(Cancer$b1, Cancer$b2, main="Scatterplot between b1 and
b2", xlab="b1", ylab="b2")
#correlation between p1 and p2
cor(Cancer$p1,Cancer$p2)
plot(Cancer$p1, Cancer$p2, main="Scatterplot between p1 and
p2", xlab="p1", ylab="p2")
Part 1 b) ii.
#recode string into numerical
Cancer$class[Cancer$class=="B-A"]<-1
Cancer$class[Cancer$class=="B-B"]<-2
Cancer$class[Cancer$class=="H-A"]<-3
Cancer$class[Cancer$class=="H-B"]<-4
Cancer$class[Cancer$class=="L-A"]<-5
Cancer$class[Cancer$class=="L-B"]<-6
Cancer$class[Cancer$class=="L-N"]<-7
plot(Cancer$class,Cancer$er, main="Scatterplot between class
variable and er attribute", xlab="class variable",ylab="er")
plot(Cancer$class,Cancer$pgr,main="Scatterplot between class
variable and pgr attribute",xlab="class variable",ylab="pgr")
plot(Cancer$class,Cancer$h1,main="Scatterplot between class
variable and h1 attribute",xlab="class variable",ylab="h1")
Part 1 d) i.
#replace NAs with 0
Cancer<-read_excel("coursework_data.xlsx")
Cancer$h2[is.na(Cancer$h2)]<-0
Cancer$o2[is.na(Cancer$o2)]<-0
Cancer$o5[is.na(Cancer$o5)]<-0
write.csv(Cancer,file="Cancer_with_0.csv",row.names=FALSE)
#replace NAs with mean
Cancer<-read_excel("coursework_data.xlsx")
Cancer$h2[is.na(Cancer$h2)]<round(mean(Cancer$h2,na.rm=TRUE),0)
Cancer$o2[is.na(Cancer$o2)]<round(mean(Cancer$o2,na.rm=TRUE),0)
Cancer$o5[is.na(Cancer$o5)]<round(mean(Cancer$o5,na.rm=TRUE),0)
write.csv(Cancer,file="Cancer_with_mean.csv",row.names=FAL
SE)
#replace NAs with median
Cancer<-read_excel("coursework_data.xlsx")
Cancer$h2[is.na(Cancer$h2)]<median(Cancer$h2,na.rm=TRUE)
Cancer$o2[is.na(Cancer$o2)]<median(Cancer$o2,na.rm=TRUE)
Cancer$o5[is.na(Cancer$o5)]<median(Cancer$o5,na.rm=TRUE)
write.csv(Cancer,file="Cancer_with_median.csv",row.names=FA
LSE)
Part 1 f)
write.csv(Canc_mean,file="sd_with_mean.csv",row.names=FAL
SE)
write.csv(Canc_median,file="sd_with_median.csv",row.names=
FALSE)
#normalization of datasets where NAs are replaced with 0,
mean and median/ without removing NA
Canc_zero<-read.csv("Cancer_with_0.csv")
Canc_mean<-read.csv("Cancer_with_mean.csv")
Canc_median<-read.csv("Cancer_with_median.csv")
for (i in 1:22)
{
Canc_zero[,i]<-(Canc_zero[,i]-min(Canc_zero[,i]))/
(max(Canc_zero[,i])-min(Canc_zero[,i]))
Canc_mean[,i]<-(Canc_mean[,i]-min(Canc_mean[,i]))/
(max(Canc_mean[,i])-min(Canc_mean[,i]))
Canc_median[,i]<-(Canc_median[,i]-min(Canc_median[,i]))/
(max(Canc_median[,i])-min(Canc_median[,i]))
}
Canc_zero<-Canc_zero[,!(names(Canc_zero)%in%c("o5"))]
Canc_mean<-Canc_mean[,!(names(Canc_mean)%in%c("o5"))]
Canc_median<-Canc_median[,!(names(Canc_median)%in
%c("o5"))]
write.csv(Canc_zero,file="norm_with_0.csv",row.names=FALSE)
write.csv(Canc_mean,file="norm_with_mean.csv",row.names=F
ALSE)
write.csv(Canc_median,file="norm_with_median.csv",row.name
s=FALSE)
Part 1 g) i.
#dealing with missing values by deleting attributes/instances
Cancer<-read_excel("coursework_data.xlsx")
Cancer<-Cancer[,!(names(Cancer)%in%c("o2","o5"))]
Cancer=na.omit(Cancer)
write.csv(Cancer,file="Cancer_deleted.csv",row.names=FALSE)
Part 1 g) ii.
#correlation between attributes
del_dt<-read.csv("Cancer_deleted.csv")
del_dt2<-read.csv("Cancer_deleted.csv")
del_dt<-del_dt[,!(names(del_dt)%in%c("b4"))]
tmp<-cor(del_dt[,1:19])
tmp[upper.tri(tmp)]<-0
diag(tmp)<-0
data.new<-del_dt[,1:19][,!apply(tmp,2,function(x) any(x>0.4))]
final<cbind(data.new[,1:4],del_det2[,9],data.new[,5:15],del_dt[,20:21
])
names(final)[5]<-"b4"
write.csv(final,file="Cancer_uncor.csv",row.names=FALSE)
Part 1 g) iii.
#pca on normalized data
Norm<-read.csv("norm_with_median.csv")
pca<-princomp(Std[,1:21])
pcs=princomp(Norm[,1:21],scale=TRUE,cor=TRUE)$scores
pc110<-pcs[,1:10]
pc110_cl<-cbind(pc110,Std[,22:23])
write.csv(pc110_cl,file="norm_pca_data.csv",row.names=FALSE
)
Part 2 a)
#hclust / for normalized median data
my_data<-read.csv("norm_with_median.csv")
clusters<-hclust(dist(my_data[,1:21][,7]))
clusterCut<-cutree(clusters,7)
hclust_cc<-sum(diag(table(clusterCut,my_data$clsn)))
hclust_accur<sum(diag(table(clusterCut,my_data$clsn)))/sum(table(clusterCu
t,my_data$clsn))
print(hclust_cc)
print(hclust_accur)
#kmeans 10 times /for normalized median data
my_data<-read.csv("norm_with_median.csv")
lk<-rbind(1:10)
for (i in 1:10)
{
result<-kmeans(my_data[,1:21],7,iter.max = 100)
lk[,i]<-sum(diag(table(result$cluster,my_data$clsn)))
}
kmeans_cc<-sum(lk)/10
kmeans_accur<-(sum(lk)/10)/nrow(my_data)
print(kmeans_cc)
print(kmeans_accur)
#PAM (gives the best result) /for normalized median data
install.packages("pamr")
library(pamr)
my_data<-read.csv("norm_with_median.csv")
result<-pam(my_data[,1:21],7,FALSE,"euclidean")
pam_cc<-sum(diag(table(result$clustering,my_data$class)))
pam_accur<sum(diag(table(result$clustering,my_data$class)))/sum(table(re
sult$clustering,my_data$class))
print(pam_cc)
print(pam_accur)
Part 2 b)
#b)optimization of hclust (for median)
library(clusterSim)
min_nc=2
max_nc=15
res <- array(0, c(max_nc-min_nc+1, 2))
res[,1] <- min_nc:max_nc
md<-read.csv("norm_with_median.csv")
for (nc in min_nc:max_nc)
{
result<-hclust(dist(my_data[,1:21][,nc]))
clusterCut<-cutree(result,nc)
res[nc-min_nc+1,2]<index.DB(md[,1:21],clusterCut,centrotypes="centroids")$DB
}
print(paste("min DB for",(min_nc:max_nc)
[which.min(res[,2])],"clusters=",min(res[,2])))
write.table(res,file="DB_res.csv",sep=";",dec=",",row.names=T
RUE,col.names=FALSE)
library(pamr)
Cancer_zero<-read.csv("Cancer_with_0.csv")
pam_zero<-pam(Cancer_zero[,1:22],7,FALSE,"euclidean")
table(Cancer_zero[,24],pam_zero$clustering)
accur_zero<sum(diag(table(Cancer_zero[,24],pam_zero$clustering)))/sum(t
able(Cancer_zero[,24],pam_zero$clustering))
print(accur_zero)
#for dataset where missing values were replaced by mean.
library(pamr)
Cancer_mean<-read.csv("Cancer_with_mean.csv")
pam_mean<-pam(Cancer_mean[,1:22],7,FALSE,"euclidean")
table(Cancer_mean[,24],pam_mean$clustering)
accur_mean<sum(diag(table(Cancer_mean[,24],pam_mean$clustering)))/su
m(table(Cancer_mean[,24],pam_mean$clustering))
print(accur_mean)
#for dataset where missing values were replaced by median.
library(pamr)
Cancer_median<-read.csv("Cancer_with_median.csv")
pam_median<-pam(Cancer_median[,1:22],7,FALSE,"euclidean")
table(Cancer_median[,24],pam_median$clustering)
accur_median<sum(diag(table(Cancer_median[,24],pam_median$clustering)))/
sum(table(Cancer_median[,24],pam_median$clustering))
print(accur_median)
Part 3 b)
#Precision
ParamK<-read_excel("Tuning_classif.xlsx")
plot(ParamK$kParameter,ParamK$Precision, main="Plot of the
Precision rate according to the K parameter",xlab="K
parameter", ylab="Precision rate")
subset(ParamK,
ParamK$Precision==max(ParamK$Precision),select=1)
# Number of correctly classified instances
library(readxl)
ParamK<-read_excel("Tuning_classif.xlsx")
plot(ParamK$kParameter,ParamK$CorClInst, main="Plot of
correctly classified instances according to K
parameters",xlab="K parameter", ylab="Number of correctly
classified instances")
subset(ParamK,
ParamK$CorClInst==max(ParamK$CorClInst),select=1)