Вы находитесь на странице: 1из 7

Applied Manufacturing Research

Practical 1 – Data Exploration & Clustering

- sample solutions -

Laura Maruster and Stefano Fazi

Week 18 2019

These practice exercises are meant to let you apply the methods and techniques discussed in
lectures on various data. In this assignment you use the Iris dataset in KNIME. The Iris
dataset is used to let you practice with different data analysis techniques and to get familiar
with the KNIME software. The dataset is based on a sample of 150 instances, and refers to the
characteristics of different species of the Iris flower. It consists of five variables:
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species
. The variable Species refers to the class to which each record belongs.

KNIME can be used on the RUG computers as following: click the windows icon at the
bottom left of the desktop, and in the “search program and files” field (symbolized with a
loupe) type “knime”. Note that KNIME is not accessible under the UWP (University Work
Place) for home (with citrix receiver). Alternatively KNIME can be downloaded and installed
one own computer from https://www.knime.org/downloads/overview. As KNIME is open
source, there are plenty of materials available (tutorials, exercises, manuals, etc.) to practice
with.

Contents
1. Practical 1 – Data exploration............................................................................................2
2. Practical 1 – Clustering.......................................................................................................3
3. Practical 1 – Summarize results..........................................................................................5

1
1. Practical 1 – Data exploration
-Sample solutions: in order to follow the sample solutions below consider the Knime
workflow available on Nestor under “Assignments -> Assignment 1”

The objective of this first part of the practice exercises is meant to give you first handles on
the data, and to let you assign meaning to the data.

1.1 Importing the dataset


To import the Iris dataset, please watch the “KNIME Workbench” and “KNIME Intro”
tutorials at https://tech.knime.org/screencasts. These tutorials will provide you with a lot of
basic knowledge on KNIME, and allows you to have a good start with using the software and
performing these practice exercises.

1.2 Find out more about the data


Let us consider all variables from the Iris dataset and let us explore them.

Question1- variable type


Which variables can be considered categorical (nominal, binary and ordinal) and which are
continuous (integer, interval-scaled and ratio-scaled)?

Potential answer:
- We relabel vars as follows: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and
Species with SL, SW, PL, PW and Species, respectively
- 4 continuous (ratio) variables (SL, SW, PL, PW) and one categorical (nominal)
(Species)

Question2 – variable distribution


How are the variables distributed? Explore graphically the variables (with histograms,
boxplots, etc.), but also their statistical properties (such as descriptives, etc.)

Potential answer:
- looking to histograms: very diverse distributions of continuous variables
- box-plot: for example SW along the levels of species: we note that Sepal-Width of
specie Iris-Setosa is smallest with a mean of 5, while for specie Iris-Virginica is
highest, with mean of 6.5
- pie-chart: the proportion of each Specie type is equal, 33,33%
- the descriptives

Question3 – relations among variables


Let us now explore the relationships existing between two or more variables. Use the
scatterplots to investigate the relationships between two variables. Identify two variables
which might be correlated and examine their correlation. Do the scatterplots provide any
indication of low/high correlation between the selected variables?

The correlation between variables can be more reliably assessed by calculating the correlation
coefficients between variables rather than by just examining the scatterplots. Add the “Linear

2
Correlation” node and examine the correlation matrix it produces. Does this matrix provide
any indication of low/high correlation between the selected variables?

In case you found high correlations: does it make sense that the two variables that are found to
be highly correlated are indeed highly correlated?

Potential answer:
- the 2D/3D scatterplot: by3D plotting for instance SL, SW and PL, and rotating, we
can already see that some groupings arise
- correlations: some variables show high correlations, for example between petals PW-
PL, but also among sepals-petals; among sepals, much smaller correlations

1.3 Outliers

Before you can carry out data analysis techniques on a dataset, it is important to know
whether there are outliers in the dataset. Why is this important?
Beforehand it might be worthwhile to check whether you think there are any outliers at all.
You already inspected the variables using different graphical methods (e.g. scatterplots,
boxplots), which can already give you an impression about the existence of outliers.

When dealing with categorised data (i.e., the Species in the Iris dataset) it might be
worthwhile checking for outliers within each category instead of the dataset as a whole; this
will help find data that was misclassified upon collection.

Analyse the data to identify potential outliers, by using data exploration techniques, such as
scatter plots, pie charts, box plots, histograms.

Question4 - outliers
a) Based on the scatterplots for combinations of different variables, do you identify any
potential outliers?
Potential answer:
- There are some points suspicious of being outliers

b) Based on the histograms, do you identify any potential outliers? What would you
conclude about the extent to which the variables are normally distributed?
Potential answer:
- Variables are not normally distribute
c) Based on the boxplots, do you identify any outliers?
Potential answer:
- Boxplot: SL has an outlier 4.9, SW at point 2.3, etc…

d) If you have identified any outliers (in scatterplots, histograms, and/or the boxplots): to
which Species do these outliers belong? What are the Row IDs of the outliers?
Potential answer:
- Box plot SW with Species: point 2.3 belong to specie Iris-Setosa
e) If you have identified outliers, what do you do with them (e.g. ignore, delete)?
Potential answer:
- This is not a simple answer; outliers can distort the analyses, for instance
regressions; however it can also point to a specific class of points, which can have

3
an interesting meaning. Therefore it is wise that before deleting of ignoring
outliers, to have a good understanding of the meaning of these extraordinary
points; we will not delete outliers

2. Practical 1 – Clustering
The objective of this clustering assignment is to apply the knowledge acquired in the lecture
about two basic clustering methods: k-means and hierarchical clustering. In this clustering
assignment we use again the Iris dataset. The dataset refers to the characteristics of different
species of the Iris flower. It consists of five variables: Sepal.Length, Sepal.Width,
Petal.Length, Petal.Width, and Species. The variable Species refers to the class to which each
record belongs. For the time being we will not consider the variable Species.

The overall goal is to create clusters based on the flower characteristics. To this aim, we will
use the two clustering methods mentioned above. Ideally, the obtained clusters should be
nicely clear-cut, but also meaningful.

2.1 K-means clustering

Adding colours to the data can facilitate a better understanding of data. Watch the KNIME
“Hiliting in KNIME” tutorial on https://tech.knime.org/screencasts. One way to add colors is
by using the “Color Manager” node.

Question5 – k-means clustering


Add the K-means clustering method in KNIME using the node “k-Means”. Use the first four
flower characteristics, namely Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width to
create k=3 clusters. Examine the results.
a) Do you obtain distinct and clear-cut clusters? In order to answer this question compare
the centres of the three clusters (e.g. the centroid value for every variable used in
clustering), and also the number of items of every obtained cluster.
Potential answer:
 The obtained clusters are not perfectly clear-cut
 The centroid value for every variable seems though to determine relatively
distinct clusters
 The number of items per cluster seems in balance; if, for instance only a very
small number of items are in a cluster, we may suspect that that clusters
include extreme values

4
b) Inspect also visually the obtained clusters (tip: add the Color Manager after the “k-
means node”, and use “Scatter Matrix”)
c) Try to produce a better clustering by considering less variables, that is only two or
three variables (flower characteristics) for clustering. Are these clustering solutions
better?
Potential answer:
 Using less variables in clustering does not seem to produce better clusters
d) Try to produce a better clustering by setting a different number of clusters (k=2, or
k=4, etc.). Are these clustering solutions better?
Potential answer:
o Using k=2 produces 2 clusters; it seems that clusters are better clear-cut than
with k=3
o Using k=4 ….

2.2 Hierarchical clustering

Question6 – hierchical clustering


Add the hierarchical clustering method in KNIME using the node “Hierarchical Clustering”.
Use the first four flower characteristics, namely Sepal.Length, Sepal.Width, Petal.Length, and
Petal.Width. Examine the results.
a) What is the appropriate number of clusters (where should we cut the dendrogram)?
Potential answer:
 Cutting the dendrogram at value approx. 0.85 we get 2 clusters. However if we
cut at value approx. 0.76 we get 3 clusters, where one cluster contains only 2
values; these 2 values (rows 117, 131) can be extreme values, and therefore it
is interesting to have a deeper look at them
b) Do you obtain distinct and clear-cut clusters?
Potential answer:
 Yes.
c) Inspect also visually the obtained clusters (Tip: add the Color Manager after the
“Hierarchical Clustering node”)

2.3 Compare the results of the two methods


Question7 – comparing clustering methods
a) Are their differences compared with the K-means clustering method?
Potential answer:
 Yes. The 3 cluster solution with the k-means method produces clusters with a
balanced number of items; the hierarchical method with 3 clusters yields one
cluster with only 2 values.
b) If yes, what could have caused these differences?
Potential answer:
 The differences are caused by the fact that different methods are used to
compute the clusters.
c) Which method produces the best clustering model?
Potential answer:
 The k-means model seems to be better. Also for hierarchical model, with a
large dataset it is not possible to produce a dendrogram, (in this case a small
sample of data can be selected)

5
2.4 Evaluating clustering results
The two clustering methods used above are unsupervised learning methods, which means that
they are used to cluster data when no prior classification exist in the data.

However the iris flowers are already classified in variable “Species”. We can check to what
extend the existing Species classification converges with the results of our own clustering
experiments (with K-means and Hierarchical methods). Ideally, the clustering results will
converge with the existing grouping already provided by the Species variable.

Question 8 – evaluating results


Add the “Scorer” node to evaluate the results of the k-means clustering method (with 4 input
variables, k=3). This node creates a confusion matrix, e.g. it specifies to what extend the
clusters produced by the k-means method match with the existing Species clusters (in the
Lecture about Decision Trees, from week 19, we will discuss in details about confusion
matrix). Similarly, add the “Scorer” node to evaluate the results of the hierarchical clustering
method.
a) Which observations are assigned by the k-means method in a different cluster than by
the Species variable?
Potential answer:
 3 flowers previously classified as “Iris-virginica’ are classified as Iris-
versicolor
 14 flowers previously classified as “Iris-versicilor’ are classified as Iris-
virginica

b) Which observations are assigned by the hierarchical clustering method in a different
cluster than by the Species variable? Were these observations identified as outliers
earlier on?
Potential answer:
 rows 117, 131… (here corroborate the answer with the outlier)
c) If differences are detected, what could have caused these differences?
Potential answer:
 It is possible that the existing classification of flowers was done by specialists
(botanists), which considered more flower characteristics to classify the given
flowers from the dataset; our analysis is based on measurements of petal and
sepal only
d) Based on the hierarchical cluster dendrogram, would you say that the number of
clusters you chose was appropriate? If not, what would be a more appropriate number
of clusters?
Potential answer:
 A hierarchical clustering solution with 3 clusters lead to a loss of misclassified
items (see the confusion matrix, 48 items are misclasified). Therefore a
hierarchical clustering solution with only 2 cluster would have resulted in less
misclasification.
e) Consider the results of the two clustering techniques (k-means clustering and
hierarchical clustering) applied to four, three or two variables, and with k=2, 3, and 4
cluster. Which clustering approach provides you with the best result? (Hint: compare
the Scorer values to build you argument which clustering result is the best).

6
Potential answer
 The 3 cluster solution with the k-means method produces clusters with a
balanced number of items, also the clusters are reasonable clear-cut. Also the
number of misclassified item is smaller (only 17 items). Though we have to
consider this way of assessing the quality of a clustering model with caution,
because usually, we do not know how actually the data is classified (clustering
is a unsupervised learning method).

Question 9 – evaluating results with hilite


Insert the “Color Manager” node just before the k-means clustering / hierarchical clustering
nodes. In this way you can check visually whether there is a match between the newly
obtained clusters and the existing Species clusters.

3. Practical 1 – Summarize results


Report in max one A4-page what the advantages and disadvantages of the clustering
techniques are, and which clustering result proves to be the best. Provide argumentation and
evidence (i.e., confusion matrices and performance measures) to build your argument.

See the potential answers given above.

Вам также может понравиться