Вы находитесь на странице: 1из 39

Genetic classification of breast cancer tumours

Aigerim Tulegenova
Part 1 a) i.
Table1
er

pgr

ar

l1

l2

b1

b2

b3

b4

h1

Min

0.0

0.00

0.00

0.0

0.0

0.00

0.000

0.0000
0

0.00

1st
quart

0.0

0.00

0.00

100.0

106.5

0.00

0.000

0.0000
0

0.00

Media
n
Mean

100.
0
102.
3

62.00

90.00

200.0

200.0

0.00

0.000

71.84

182.5

177.9

10.96

7.742

0.0000
0
0.0102
3

0.00

97.33

3rd
quart

175.
0

195.00

105.0
0

270.0

250.0

0.00

0.000

1.0000
0
1.0000
0
0.0000
0
0.0018
6
1.0000
0

0.0000
0

0.00

Max

300.
0
0

300.00

300.0

300.0

295.0
0
0

300.0
0
0

300.00
0
0

1.0000
0
0

10.000
00
0

300.
0
0

87.0
3

103.9

66.64

98.41

86.20

34.35

30.91

0.80

0.31

31.8

7574
.6
175.
0
300

10799.
4
195.00

4440.
3
105.0
0
295

9684.
7
170.0

7429.
8
143.5

1179.
7
0.00

955.5

0.64

0.094

0.00

300

300

300

300

2.0000
0
2

1013
.4
0

10

300

Count
of NA
Std.De
v
Varian
ce
IQR
Range

300
h2

h3

p1

p2

o1

o2

o3

o4

o5

o6

11.5
5

o7

Min

0.00

0.0

0.0

20.0

0.00

0.0

0.00

0.00

1st qu

0.00

90.0

0.0

75.0

30.00

75.0

0.00

0.00

Median

0.00

0.0

0.00

0.00

43.3

30.00

15.6
7
0.00

4.89

3rd qu

0.37 8.5
67
73
0
0

100.
0
117.
1
175.
0

36.26

105.
0
124.
9
170.
0

95.00

Mean

150.
0
145.
3
200.
0

32.5

92.56
138.5
0

0.00

Max

300.0
0

300.
0

300.0 180

300

300.
0

300.0
0

300.
0

300.
0

270.
0

Count
of NA
Std.
Dev.
Varianc
e
IQR

236

377

697

76.58

83.78 5.96 28.


99
7018. 35.5 840
8
.5
32.5
0
0

70.6
2
4987
.5
95

74.33

5864.
5
30.00

84.0
9
7070
.5
110

77.3 0
6
5984 0
.6
100
0

42.9
1
1841
.3
0.00

22.3
5
499.
6
0

Range

300

300

300

280

300

300

270

180

300

5525.
5
108.5
0
300

Table 1 contains variables of breast cancer tumours and their


measures of centrality, dispersion and how many missing
values each of them has. (Please refer to Appendix to see code:
Part 1 a) i.) Table shows that there are 22 numerical attributes
where three of them (h2, o2 and o5) have considerable amount
of missing values. Thus, before starting extracting information
from the dataset, those attributes need to be pre-processed.
Moreover, by looking at the table, it can be observed that
although most of the variables have the same range of 300 or
slightly less, there are still the attributes as b3, b4 and p2
which range is -1:1, 0:10 and 0:180 respectively. Since not all
the variables in the same range, they need to be transformed
to make it useful for building the good (not biased) model.
ii. Drawing histograms helps to illustrate how our data is
distributed. (Please refer to Appendix to see code: Part 1 a) ii.
to see code and all the histograms). The data was analysed by
looking at histograms shape and also by comparison of their
measures of centrality and dispersion.

Figure 1

By looking at Figure 1, it can be noticed that it has a shape of


close to symmetric. The data are roughly the same in height
and location on either side of the centre of the histogram.
Moreover, the mean and median have approximately similar
values. Interquartile range of this attribute is equal to 110,
which shows that the data is very consistent (most values lie
close to each other). Median is almost equally close to both 1 st
(90) and 3rd (200) quartiles which also demonstrates that the
data spread equally. The standard deviation is relatively small
(84) which also indicates that the values in the attribute are
pretty tightly bunched together. By looking at description above
we can say that the data is not largely spread and values
spread in almost equal amount.
The attributes b3, b4 and p2 follow the same pattern.

Figure 2

By looking at Figure 2, it can be observed that the data is


right-skewed. Although the range of this feature is between 0
and 300, interquartile range is equal to 0. This indicates that
the data does not have wide spread. Moreover, the mean (16)
is somehow greater that the median (0). This means that o6
has a few extreme values that drive the mean upward, but do
not affect where the exact middle of the data is.
The attributes as h1, h2, b1, b2, p1, o1, o2, o4, o7, o8 follow
the same pattern.
Figure 3

By looking at Figure 3, it can be observed that the range is


equal to 300, while interquartile range of this attribute is 195.
This indicates that the data are largely spread out, in other
words, there is a large distance between individual values.
Moreover, the median is closer to the 1st quartile rather than
the 3rd quartile and it means that half of the values are close to
one another, while the larger half of the values are scattered
and distant from others.
Attribute er follows the same pattern.
Figure 4

By looking at Figure 4, it can be seen that the data has leftskewness. It shows that the median is greater than the mean.
The range of this attribute is 300, but interquartile range is 170,
this is somewhat greater than the half of data which indicates
that data is pretty spread out. This feature has high proportion
of large values, and the Median (200) is therefore closer to the
third quartile (270) than the 1st quartile (100).
Similarly, attribute l2 follows the same pattern.
Figure 5

Figure 5 is not similar to the other histograms in terms of


skewness. The reason is that although the histogram looks right
skewed, the mean is not greater than the median. These is
because of the fact that the histogram is multimodal (not just
one mode) and its one tail is long, but other tail is heavy.
(o3 follows the same pattern.)
b) i. Correlation between er and pgr attributes is equal to
0.4, this type of correlation is deemed to be very weak,
while correlation between b1 and b2 is equal to 0.6
which is pretty strong since it is higher than 50 %. In both
cases the type of correlation is positive. However,
correlation coefficient of p1 and p2 is equal to -0.007
(negative correlation), this is virtually equal to 0 which
indicates nonexistence of correlation. (Please look at
Appendix Part 1 b) i. to see code)

ii. (Please look at Appendix Part 2 b)ii. To see code)


Figure 6

By looking at Figure 6 we can say that if the value of er


attribute falls into the range of 25 and 275 it is likely that this
value will belong to the 3rd class. However, if the value of the
variable is higher than 275, it will belong to the 5 th, 6th and 7th
classes because the threshold of these classes is between 25
and 300. In case that the variable lies in a range below 25 it is
more likely that they will belong to the 1st, 2nd and 4th classes.
Moreover, it can be noticed that very few values of the er
attribute are in the 1st, 2nd and 4th classes.
Figure 7

Figure 7 depicts that any value that is in the range of this


variable might belong to the 3rd class. However, if the
value is above 25 it is likely to belong to the 5 th and 7th
classes, while class o6 accept the values that fall into a
range below 25. Furthermore, the density of these latter 3
classes are higher compared to all other classes.
Figure 8

By looking at the scatterplot above, it can be seen that the


values of h3 attribute might belong to the any class and
the density of all the classes are roughly similar.
c) Considering all the descriptive statistics, the
visualisations, the correlations that were produced we can
say that one of attributes that have correlation between
them (b1 or b2 and er or pgr) is not important, since
correlated attributes may actually be measuring the same
underlying feature and their presence together in the build
data can skew the logic of the algorithm and affect the
accuracy of the model. Furthermore, o5 can be
considered as insignificant attribute because it has too
many missing values (more than half of the data).
Moreover, attribute h3 is likely to be insignificant
because it does not assist in identifying the ground truth
of the instance. Its values might belong to any class, while
an attribute like pgr influence the result of 5 th, 6th and 7th
classes.
d) i. (Please look at Appendix Part 1 d) i. to see code)
ii. Filling missing values of the skewed data (h2 and o2) with
mean imputation skews the data more compared to zero and
median imputations. Thus, better alternative solution is
replacing missing values with median because this makes the
data closer to the symmetric and it also helps to avoid bias
since it is not affected by extreme values. Next, replacing
missing values with 0 reduces skewness of the data the most
among all three imputation methods. Moreover, mode of all
three attributes that contain missing values is 0 (it is second
mode in o2 variable). Hence, this technique can also give less
error.
f) Mean centring, standardization and normalization techniques
were applied into each of the dataset where missing values
were replaced by 0, mean and median respectively. (Please
look at Appendix Part 1 f) to see code)
After mean-centring of data, mean became 0. This
transformation
technique would be useful if all of our data was of the same
range. With standardization we got a mean of zero and
standard deviation of 1, while normalization rescaled all the

values to range between 0 and 1. It changed the variables


values, but kept the same proportion between each value.
After standardization and normalization techniques NaNs were
produced in o5 variable, which means that the standard
deviation and variance of this attribute is 0 (values are
constant). Since the main reason of standardizing/normalizing
the data is modelling (particularly clustering and classification),
we deleted the feature o5 from the dataset because in many
cases, direction with the least variances are the most irrelevant
to the clusters.
g)
i. Attribute h2 has 236 missing values. Considering the fact
that it is only 20% of our data we can drop instances without
substantial loss of statistical power. Moreover, values are
missed at random and we are not inadvertently removing a
class of participants. On the contrary, attributes o2 and o5
have considerably more missing values and values are missed
not at random. If we remove instances, we will loose large
amount of data which might have power of prediction. As an
illustration, most of missing values of o2 are in L-B class and
if we delete the whole rows because of NAs in o2, we will
loose too much information about L-B class. Thus, it is better to
delete attributes in this case. (Please refer to Appendix Part 1
g) i. to see code)
ii. After deletion of attributes and instances the rest of the
attributes were checked for correlation. Although threshold for
correlation of 0.4 is deemed to be weak, it is still sign of
correlation existence. In our case attributes er, ar, l1 and
b1 were deleted as they have correlation with other
variables. As it was said in section c) correlated attributes
might be measuring the same underlying feature and
preserving them would only add a noise to the data and affect
negatively building a model. (Please refer to Appendix Part 1
g) ii. to see code)
iii. For PCA the normalized dataset where missing values were
replaced by median was chosen because if the raw data is
used, principal component analysis will tend to give more
emphasis to those variables that have higher variances than to
those variables that have very low variances. For example, the
variable as er has a range of 0-300, while range of b4 is 0-10.

PCA will automatically give more weight to the first variable. For
this reason, transformation technique is required. The survey
that was conducted by Hur and Guyon in 2003 concluded that
PCA improves the extraction of cluster structure when data is
normalized rather than standardized. (Please refer to Appendix
Part 1 g) iii. to see code). Moreover, standardization is usually
required when variables have different units of measurement,
otherwise they are considered to be already standardised.
Since the variables of our dataset are related to the breast
cancer, we assume that all of them in the one unit of
measurement.
Figure 9

By looking at the biplot it can be seen that attributes pgr,


er,ar, l1, l2 (negative) and p1 (positive) has significant
importance for PC1, while features as b4,o7 and o8
(negative) are useful for PC2. o3 (negative for both PCs) and
h1 (positive for PC1 and negative for PC2) are roughly equally
important for both PCs.
Part 2-Clustering
a) Data needs to be cleaned and transformed for building a
model, so normalized dataset where missing values were filled
with median imputation was used for clustering.
Three clustering algorithms as hclust, kmeans and PAM were
used for clustering the dataset. (Please refer to Appendix Part

2 a) to see code) Since the result of kmeans algorithms


depends on the initial guessed mean of clusters, it was run 10
times and average result was shown in order to be sure that
the received result was not just case of luck.
Table 2
Result of hclust, kmeans and PAM clustering algorithms
Total Number of records in dataset=1075
Clustering
algorithm
Hierarchical
clustering
k-means

Correctly classified

Average Accuracy

86

0.08

187

0.17

PAM

346

0.32

Hierarchical clustering algorithm gave the poorest accuracy


among all three algorithms. k-means produces a better result,
while the best result was produced by PAM algorithm.
This can be explained by the fact that k number (7 in our case)
of clusters for this dataset is pretty big. When number of
clusters getting bigger hierarchical clustering shows the less
accurate result compared to k-means and PAM which produce
good result when the number of the clusters are bigger.
The superiority of PAM algorithm over kmeans algorithm can be
because of the fact that PAM uses medoids rather than means
which makes it robust than k-means that uses mean which can
be influenced by extreme values. As we saw before, our data
does have number of large values which might affect the mean.
b) In this section dataset was clustered with different K values
(K=1,15 groups) with three clustering algorithms above. To
measure quality Davies-Bouldin index was used which is
according to the survey that was conducted by Endon et.al in
2011 was one of the indexes that was less wrong. The lower the
index number the better is result. (Please refer to Appendix
Part 2 b). to see code)
Figure 10

Figure 11

Figure 12

From the scatterplots above it can be observed that k-means


(was run 10 times) and PAM algorithms offered to have 2
clusters in dataset which is very small compared to the original
number of clusters, while the result of hierarchical clustering
showed 9 number of clusters that is closer to 7. This better
performance of hierarchical clustering algorithm might be
related to the fact that it shows good result when the data is
relatively small, because
k-means and PAM shows
better accuracy when data is huge (64000 or more instances as
an illustration). As a result, there was no enough information for
k-means and PAM clustering techniques.
c) Pam was chosen for clustering 5 different datasets. (Please
refer to Appendix Part 2 c) to see code)
Table 3
Result of PAM clustering algorithm on different datasets
Dataset

Average Accuracy

Dataset with the first 10 PC

0.361

Dataset after deletion


inst./attr.
Dataset (NAs replaced with
0)
Dataset (NAs replaced with
mean)

0.282
0.313
0.277

Dataset (NAs replaced with


median)

0.281

By looking at the Table 3, it can be seen that dataset with the


first 10 principal components produced the best result.
Actually, noisy data can hide a cluster structure. Thus,
clustering can benefit from a pre-processing step of feature
selection or filtering. Since PCA removes features with low
variance, it acts as a filter that results in a distance metric that
provides a more robust clustering. The next accurate result was
shown by dataset where missing values were replaced by 0.
This can be related with the fact that this data is less-skewed
compared to the dataset where missing values were filled with
mean and median. Thus, we can say that having less skewness
might positively affect clustering. Surprisingly, the dataset after
deletion of instances and attributes produced less accurate
result compared to 0 imputation, this can be due to the fact
that this dataset is smaller (less number instances and
attributes). Hence, it can be concluded that clustering does not
perform well on small data. Dataset with mean imputation
clearly demonstrates the poorest result as it skews the data
more, while the dataset that have missing values replaced by
median shows better result since it represents the data more.
Part 3-Classification.
a) The same dataset that was used for clustering algorithms
was used for classification.
Table 4
Result of ZeroR, OneR, NaiveBayes, IBk and J48 classification
algorithms
Total Number of records in dataset=1075
Classificatio
n algorithm
ZeroR

Precision

Recall

F-measure

Roc Area

0.076

0.275

0.119

0.498

OneR

0.34

0.364

0.31

0.602

NaiveBayes

0.751

0.713

0.717

0.937

IBk

0.631

0.633

0.629

0.775

J48

0.868

0.865

0.864

0.941

From Table 4 it can be seen that ZeroR produced the


worst result among all classifiers. This is simply owing to the
fact that this algorithm predicts the majority class. OneR
showed slightly better performance, but still accuracy is very
poor. Since the working principle of this algorithm is just easily
taking an attribute with the least error and based on its value
predicting the ground truth of the testing dataset, it can not be
considered as a good classifier because it simply ignores rest of
the variables. These two algorithms performance does not
reach even 50% accuracy, so we can consider them as useless
for our dataset.
The result of IBk classification algorithm is worse
compared to the decision tree and Bayesian. This nearest
neighbour classifier assigns equal weight to each attribute and
it usually causes confusion when there are many irrelevant
attributes in data. Hence, our data might be containing some
irrelevant variables that worsened the result of IBk classifier.
The best result was produced by J48 algorithm which is unlike
IBk classifier robust to noise and has an ability to deal with
redundant attributes. NaveBayes classification algorithm
demonstrated less accuracy than J48. Usually, this classifier
excels J48 when the number of classes increases (Al-Nabi and
Ahmed, 2013). Hence, for our dataset that have only 7 classes
J48 classification algorithm might be the most appropriate.
b) IBk classification algorithm was chosen to explore various
parameter settings. The choice of K equal to the square root of
the number of instances, but we took higher range (from 1 to
50) in order to be sure that we covered big enough range.
(Please refer to Appendix Part 3 b) to see code)
With IBk classification, the class of a query instance is decided
by the majority class of its k nearest neighbours. In the
presence of class imbalance, a query instance is often classified
as belonging to the majority class and as a result many positive
(minority class) instances are misclassified. Thus, we need to
find the right number for k parameter that would assist in
producing the best accuracy.

Figure 13

Figure 14

By repeatedly changing parameters of K, we saw that this


classifier works well for our dataset when K value is equal to 20.
The larger K is, the more noise in the dataset. 20 is quite high
value and demonstrates that our data contains considerable
amount of noise and setting k parameter of IBk classification
algorithm as 20 suppresses the noise and improves accuracy.
c)
Table 5
Result of J48 classification algorithm for different datasets
Dataset

Precision

Recall

F-measure

Roc Area

With the
first 10 PC
After
deletion
inst/attr.
NAs
replaced
with 0
NAs
replaced
with mean

0.513

0.522

0.516

0.743

0.914

0.914

0.914

0.964

0.859

0.856

0.855

0.938

0.857

0.856

0.855

0.939

NAs
0.868
replaced
with median

0.865

0.864

0.941

The worst classification result was obtained from dataset


with the first 10 PCs because PCA-based feature
transformations allow to summarize the information from a
large number of features into a limited number of components,
i.e. linear combinations of the original features. However, the
principal components are often difficult to interpret and as a
result do not improve the classification performance. Although
dataset after deletion instances and attributes contains less
data than all other datasets, it demonstrated the most accurate
result. This shows that maintaining original data instead of 0,
mean and median imputation results in a better accuracy in
classification. The next dataset with the good accuracy is the
one with median imputation. This is owing to the fact that when
data is not distributed normally the median is deemed to be the
fairest representation of centrality. As in clustering, mean
imputation gave the most inappropriate result since it is
meaningless to use this approach having skewed dataset.

References:
1. Adeyemi, T.(2011). The Effective use of Standard Scores
for Research in Educational Management. Research
Journal of Mathematics and Statistics, 3(3).
2. Abbas, O.(2008). Comparisons Between Data clustering
Algorithms. The International Arab Journal of Information
Technology, 5(3).
3. Hur, B. and Guyon, I.(2003). Detecting Stable Clusters
Using Principal Component Analysis. In Functional
Genomics: Methods and Protocols, pp.159-182.
4. Nabi, A. and Ahmed,S. Survey on Classification Algorithms
for Data Mining: (Comparison and Evaluation).
5. Bayat,S., Cuggia,M., Rossille,D., Kessler,M. & Frimat,L.
(2009). Comparison of Bayesian Network and Decision

Tree Methods for Predicting Access to the Renal Transplant


Waiting List. MIE, pp.600-604.
6. Rendon,E., Abundez,I., Arzimendi,A. & Quiroz,E. (2011).
Internal versus External cluster validation Indexes.
International Journal of Computers and Communications,
5(1).

Appendix
Part 1 a) i.
#setwd("/Users/aigerimtulegenova/Desktop/CW DMA")
#install.packages("readxl")
library(readxl)
Cancer<-read_excel("coursework_data.xlsx")
summary(Cancer[,1:22])
Variance=rbind(1:22)
Std.dev=rbind(1:22)
Int.qu=rbind(1:22)

Range=rbind(1:22)
for (i in 1:22)
{
Variance[,i]=var(Cancer[,i], na.rm=TRUE)
Std.dev[,i]=sd(Cancer[,i],na.rm=TRUE)
Int.qu[,i]=IQR(Cancer[,i],na.rm=TRUE)
Range[,i]=diff(range(Cancer[,i],na.rm=TRUE))
}
Part 1 a) ii.
Histogram<-function()
{
library(readxl);
Cancer<-read_excel("coursework_data.xlsx");
for (i in 1:22)
{
pdf(paste("histogram of ",names(Cancer)
[i],".pdf",sep=""))
hist(Cancer[,i],main=paste("Histogram of
",names(Cancer)[i]),xlab=paste("Range of ", names(Cancer)
[i]),ylab="Frequency",col="pink")
abline(v=mean(Cancer[,i],na.rm=TRUE),col="red",lwd=2)
abline(v=median(Cancer[,i],na.rm=TRUE),col="green",lwd=2)
legend("topright",
c("Mean","Median"),col=c("red","green"),lwd=8)
dev.off()
}
}

Histograms.

Part 1 b) i.
#correlation between er and pgr
cor(Cancer$er,Cancer$pgr)
plot(Cancer$er, Cancer$pgr, main="Scatterplot between er and
pgr", xlab="er", ylab="pgr")
#correlation between of b1 and b2
cor(Cancer$b1,Cancer$b2)
plot(Cancer$b1, Cancer$b2, main="Scatterplot between b1 and
b2", xlab="b1", ylab="b2")
#correlation between p1 and p2
cor(Cancer$p1,Cancer$p2)
plot(Cancer$p1, Cancer$p2, main="Scatterplot between p1 and
p2", xlab="p1", ylab="p2")
Part 1 b) ii.
#recode string into numerical
Cancer$class[Cancer$class=="B-A"]<-1
Cancer$class[Cancer$class=="B-B"]<-2
Cancer$class[Cancer$class=="H-A"]<-3
Cancer$class[Cancer$class=="H-B"]<-4
Cancer$class[Cancer$class=="L-A"]<-5

Cancer$class[Cancer$class=="L-B"]<-6
Cancer$class[Cancer$class=="L-N"]<-7
plot(Cancer$class,Cancer$er, main="Scatterplot between class
variable and er attribute", xlab="class variable",ylab="er")
plot(Cancer$class,Cancer$pgr,main="Scatterplot between class
variable and pgr attribute",xlab="class variable",ylab="pgr")
plot(Cancer$class,Cancer$h1,main="Scatterplot between class
variable and h1 attribute",xlab="class variable",ylab="h1")
Part 1 d) i.
#replace NAs with 0
Cancer<-read_excel("coursework_data.xlsx")
Cancer$h2[is.na(Cancer$h2)]<-0
Cancer$o2[is.na(Cancer$o2)]<-0
Cancer$o5[is.na(Cancer$o5)]<-0
write.csv(Cancer,file="Cancer_with_0.csv",row.names=FALSE)
#replace NAs with mean
Cancer<-read_excel("coursework_data.xlsx")
Cancer$h2[is.na(Cancer$h2)]<round(mean(Cancer$h2,na.rm=TRUE),0)
Cancer$o2[is.na(Cancer$o2)]<round(mean(Cancer$o2,na.rm=TRUE),0)
Cancer$o5[is.na(Cancer$o5)]<round(mean(Cancer$o5,na.rm=TRUE),0)
write.csv(Cancer,file="Cancer_with_mean.csv",row.names=FAL
SE)
#replace NAs with median
Cancer<-read_excel("coursework_data.xlsx")
Cancer$h2[is.na(Cancer$h2)]<median(Cancer$h2,na.rm=TRUE)
Cancer$o2[is.na(Cancer$o2)]<median(Cancer$o2,na.rm=TRUE)
Cancer$o5[is.na(Cancer$o5)]<median(Cancer$o5,na.rm=TRUE)
write.csv(Cancer,file="Cancer_with_median.csv",row.names=FA
LSE)
Part 1 f)

# mean centering of datasets where NAs are replaced with 0,


mean and median
Canc_zero<-read.csv("Cancer_with_0.csv")
Canc_mean<-read.csv("Cancer_with_mean.csv")
Canc_median<-read.csv("Cancer_with_median.csv")
for (i in 1:22)
{
Canc_zero[,i]<scale(Canc_zero[,i],center=TRUE,scale=FALSE)
Canc_mean[,i]<scale(Canc_mean[,i],center=TRUE,scale=FALSE)
Canc_median[,i]<scale(Canc_median[,i],center=TRUE,scale=FALSE)
}
write.csv(Canc_zero,file="mc_with_0.csv",row.names=FALSE)
write.csv(Canc_mean,file="mc_with_mean.csv",row.names=FAL
SE)
write.csv(Canc_median,file="mc_with_median.csv",row.names=
FALSE)
#standardization of dataset where NAs are replaced with 0,
mean and median
Canc_zero<-read.csv("Cancer_with_0.csv")
Canc_mean<-read.csv("Cancer_with_mean.csv")
Canc_median<-read.csv("Cancer_with_median.csv")
for (i in 1:22)
{
Canc_zero[,i]<scale(Canc_zero[,i],center=TRUE,scale=TRUE)
Canc_mean[,i]<scale(Canc_mean[,i],center=TRUE,scale=TRUE)
Canc_median[,i]<scale(Canc_median[,i],center=TRUE,scale=TRUE)
}
Canc_zero<-Canc_zero[,!(names(Canc_zero)%in%c("o5"))]
Canc_mean<-Canc_mean[,!(names(Canc_mean)%in%c("o5"))]
Canc_median<-Canc_median[,!(names(Canc_median)%in
%c("o5"))]
write.csv(Canc_zero,file="sd_with_0.csv",row.names=FALSE)

write.csv(Canc_mean,file="sd_with_mean.csv",row.names=FAL
SE)
write.csv(Canc_median,file="sd_with_median.csv",row.names=
FALSE)
#normalization of datasets where NAs are replaced with 0,
mean and median/ without removing NA
Canc_zero<-read.csv("Cancer_with_0.csv")
Canc_mean<-read.csv("Cancer_with_mean.csv")
Canc_median<-read.csv("Cancer_with_median.csv")
for (i in 1:22)
{
Canc_zero[,i]<-(Canc_zero[,i]-min(Canc_zero[,i]))/
(max(Canc_zero[,i])-min(Canc_zero[,i]))
Canc_mean[,i]<-(Canc_mean[,i]-min(Canc_mean[,i]))/
(max(Canc_mean[,i])-min(Canc_mean[,i]))
Canc_median[,i]<-(Canc_median[,i]-min(Canc_median[,i]))/
(max(Canc_median[,i])-min(Canc_median[,i]))
}
Canc_zero<-Canc_zero[,!(names(Canc_zero)%in%c("o5"))]
Canc_mean<-Canc_mean[,!(names(Canc_mean)%in%c("o5"))]
Canc_median<-Canc_median[,!(names(Canc_median)%in
%c("o5"))]
write.csv(Canc_zero,file="norm_with_0.csv",row.names=FALSE)
write.csv(Canc_mean,file="norm_with_mean.csv",row.names=F
ALSE)
write.csv(Canc_median,file="norm_with_median.csv",row.name
s=FALSE)
Part 1 g) i.
#dealing with missing values by deleting attributes/instances
Cancer<-read_excel("coursework_data.xlsx")
Cancer<-Cancer[,!(names(Cancer)%in%c("o2","o5"))]
Cancer=na.omit(Cancer)
write.csv(Cancer,file="Cancer_deleted.csv",row.names=FALSE)
Part 1 g) ii.
#correlation between attributes
del_dt<-read.csv("Cancer_deleted.csv")

del_dt2<-read.csv("Cancer_deleted.csv")
del_dt<-del_dt[,!(names(del_dt)%in%c("b4"))]
tmp<-cor(del_dt[,1:19])
tmp[upper.tri(tmp)]<-0
diag(tmp)<-0
data.new<-del_dt[,1:19][,!apply(tmp,2,function(x) any(x>0.4))]
final<cbind(data.new[,1:4],del_det2[,9],data.new[,5:15],del_dt[,20:21
])
names(final)[5]<-"b4"
write.csv(final,file="Cancer_uncor.csv",row.names=FALSE)
Part 1 g) iii.
#pca on normalized data
Norm<-read.csv("norm_with_median.csv")
pca<-princomp(Std[,1:21])
pcs=princomp(Norm[,1:21],scale=TRUE,cor=TRUE)$scores
pc110<-pcs[,1:10]
pc110_cl<-cbind(pc110,Std[,22:23])
write.csv(pc110_cl,file="norm_pca_data.csv",row.names=FALSE
)
Part 2 a)
#hclust / for normalized median data
my_data<-read.csv("norm_with_median.csv")
clusters<-hclust(dist(my_data[,1:21][,7]))
clusterCut<-cutree(clusters,7)
hclust_cc<-sum(diag(table(clusterCut,my_data$clsn)))
hclust_accur<sum(diag(table(clusterCut,my_data$clsn)))/sum(table(clusterCu
t,my_data$clsn))
print(hclust_cc)
print(hclust_accur)
#kmeans 10 times /for normalized median data
my_data<-read.csv("norm_with_median.csv")
lk<-rbind(1:10)
for (i in 1:10)
{

result<-kmeans(my_data[,1:21],7,iter.max = 100)
lk[,i]<-sum(diag(table(result$cluster,my_data$clsn)))
}
kmeans_cc<-sum(lk)/10
kmeans_accur<-(sum(lk)/10)/nrow(my_data)
print(kmeans_cc)
print(kmeans_accur)
#PAM (gives the best result) /for normalized median data
install.packages("pamr")
library(pamr)
my_data<-read.csv("norm_with_median.csv")
result<-pam(my_data[,1:21],7,FALSE,"euclidean")
pam_cc<-sum(diag(table(result$clustering,my_data$class)))
pam_accur<sum(diag(table(result$clustering,my_data$class)))/sum(table(re
sult$clustering,my_data$class))
print(pam_cc)
print(pam_accur)
Part 2 b)
#b)optimization of hclust (for median)
library(clusterSim)
min_nc=2
max_nc=15
res <- array(0, c(max_nc-min_nc+1, 2))
res[,1] <- min_nc:max_nc
md<-read.csv("norm_with_median.csv")
for (nc in min_nc:max_nc)
{
result<-hclust(dist(my_data[,1:21][,nc]))
clusterCut<-cutree(result,nc)
res[nc-min_nc+1,2]<index.DB(md[,1:21],clusterCut,centrotypes="centroids")$DB
}
print(paste("min DB for",(min_nc:max_nc)
[which.min(res[,2])],"clusters=",min(res[,2])))
write.table(res,file="DB_res.csv",sep=";",dec=",",row.names=T
RUE,col.names=FALSE)

plot(res,type="p",pch=0,main="DB index values for clusters


between 2 and 15 for hclust algorithm",xlab="Number of
clusters",ylab="DB index values",xaxt="n")
axis(1,c(min_nc:max_nc))
#abline(v=(min_nc:max_nc)[which.min(res[,2])],col="red")
#b)optimization of kmeans 10 times (for median)
library(clusterSim)
min_nc=2
max_nc=15
l<-rbind(1:10)
res <- array(0, c(max_nc-min_nc+1, 2))
res[,1] <- min_nc:max_nc
md<-read.csv("norm_with_median.csv")
for (nc in min_nc:max_nc)
{
for (i in 1:10)
{
result<-kmeans(md[,1:21],nc,iter.max=100)
l[,i]<index.DB(md[,1:21],result$cluster,centrotypes="centroids")$DB
}
res[nc-min_nc+1,2]<-sum(l)/10
}
print(paste("min DB for",(min_nc:max_nc)
[which.min(res[,2])],"clusters=",min(res[,2])))
write.table(res,file="DB_res.csv",sep=";",dec=",",row.names=T
RUE,col.names=FALSE)
plot(res,type="p",pch=0,main="DB index values for clusters
between 2 and 15 for kmeans algorithm",xlab="Number of
clusters",ylab="DB index values",xaxt="n")
axis(1,c(min_nc:max_nc))

#b)optimization of pam (for median)


library(clusterSim)
min_nc=2
max_nc=15
res <- array(0, c(max_nc-min_nc+1, 2))
res[,1] <- min_nc:max_nc
md<-read.csv("norm_with_median.csv")

for (nc in min_nc:max_nc)


{
result<-pam(my_data[,1:21],nc,FALSE,"euclidean")
res[nc-min_nc+1,2]<index.DB(md[,1:21],result$clustering,centrotypes="centroids")
$DB
}
print(paste("min DB for",(min_nc:max_nc)
[which.min(res[,2])],"clusters=",min(res[,2])))
print("clustering for min DB")
print(clusters[which.min(res[,2]),])
write.table(res,file="DB_res.csv",sep=";",dec=",",row.names=T
RUE,col.names=FALSE)
plot(res,type="p",pch=0,main="DB index values for clusters
between 2 and 15 for PAM algorithm",xlab="Number of
clusters",ylab="DB index values",xaxt="n")
axis(1,c(min_nc:max_nc))
Part 2 c)
# for pca
library(pamr)
Cancer_pca<-read.csv("norm_pca_data.csv")
pam_pca<-pam(Cancer_pca[,1:10],7,FALSE,"euclidean")
table(Cancer_pca[,12],pam_pca$clustering)
accur_pca<sum(diag(table(Cancer_pca[,12],pam_pca$clustering)))/sum(ta
ble(Cancer_pca[,12],pam_pca$clustering))
print(accur_pca)
# for deleted dataset.
library(pamr)
Cancer_del<-read.csv("Cancer_deleted.csv")
pam_del<-pam(Cancer_del[,1:20],7,FALSE,"euclidean")
table(Cancer_del[,22],pam_del$clustering)
accur_del<sum(diag(table(Cancer_del[,22],pam_del$clustering)))/sum(tabl
e(Cancer_del[,22],pam_del$clustering))
print(accur_del)
#for dataset where missing values were replaced by 0.

library(pamr)
Cancer_zero<-read.csv("Cancer_with_0.csv")
pam_zero<-pam(Cancer_zero[,1:22],7,FALSE,"euclidean")
table(Cancer_zero[,24],pam_zero$clustering)
accur_zero<sum(diag(table(Cancer_zero[,24],pam_zero$clustering)))/sum(t
able(Cancer_zero[,24],pam_zero$clustering))
print(accur_zero)
#for dataset where missing values were replaced by mean.
library(pamr)
Cancer_mean<-read.csv("Cancer_with_mean.csv")
pam_mean<-pam(Cancer_mean[,1:22],7,FALSE,"euclidean")
table(Cancer_mean[,24],pam_mean$clustering)
accur_mean<sum(diag(table(Cancer_mean[,24],pam_mean$clustering)))/su
m(table(Cancer_mean[,24],pam_mean$clustering))
print(accur_mean)
#for dataset where missing values were replaced by median.
library(pamr)
Cancer_median<-read.csv("Cancer_with_median.csv")
pam_median<-pam(Cancer_median[,1:22],7,FALSE,"euclidean")
table(Cancer_median[,24],pam_median$clustering)
accur_median<sum(diag(table(Cancer_median[,24],pam_median$clustering)))/
sum(table(Cancer_median[,24],pam_median$clustering))
print(accur_median)
Part 3 b)
#Precision
ParamK<-read_excel("Tuning_classif.xlsx")
plot(ParamK$kParameter,ParamK$Precision, main="Plot of the
Precision rate according to the K parameter",xlab="K
parameter", ylab="Precision rate")
subset(ParamK,
ParamK$Precision==max(ParamK$Precision),select=1)
# Number of correctly classified instances
library(readxl)

ParamK<-read_excel("Tuning_classif.xlsx")
plot(ParamK$kParameter,ParamK$CorClInst, main="Plot of
correctly classified instances according to K
parameters",xlab="K parameter", ylab="Number of correctly
classified instances")
subset(ParamK,
ParamK$CorClInst==max(ParamK$CorClInst),select=1)

Вам также может понравиться