Академический Документы
Профессиональный Документы
Культура Документы
- sample solutions -
Week 18 2019
These practice exercises are meant to let you apply the methods and techniques discussed in
lectures on various data. In this assignment you use the Iris dataset in KNIME. The Iris
dataset is used to let you practice with different data analysis techniques and to get familiar
with the KNIME software. The dataset is based on a sample of 150 instances, and refers to the
characteristics of different species of the Iris flower. It consists of five variables:
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species
. The variable Species refers to the class to which each record belongs.
KNIME can be used on the RUG computers as following: click the windows icon at the
bottom left of the desktop, and in the “search program and files” field (symbolized with a
loupe) type “knime”. Note that KNIME is not accessible under the UWP (University Work
Place) for home (with citrix receiver). Alternatively KNIME can be downloaded and installed
one own computer from https://www.knime.org/downloads/overview. As KNIME is open
source, there are plenty of materials available (tutorials, exercises, manuals, etc.) to practice
with.
Contents
1. Practical 1 – Data exploration............................................................................................2
2. Practical 1 – Clustering.......................................................................................................3
3. Practical 1 – Summarize results..........................................................................................5
1
1. Practical 1 – Data exploration
-Sample solutions: in order to follow the sample solutions below consider the Knime
workflow available on Nestor under “Assignments -> Assignment 1”
The objective of this first part of the practice exercises is meant to give you first handles on
the data, and to let you assign meaning to the data.
Potential answer:
- We relabel vars as follows: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and
Species with SL, SW, PL, PW and Species, respectively
- 4 continuous (ratio) variables (SL, SW, PL, PW) and one categorical (nominal)
(Species)
Potential answer:
- looking to histograms: very diverse distributions of continuous variables
- box-plot: for example SW along the levels of species: we note that Sepal-Width of
specie Iris-Setosa is smallest with a mean of 5, while for specie Iris-Virginica is
highest, with mean of 6.5
- pie-chart: the proportion of each Specie type is equal, 33,33%
- the descriptives
The correlation between variables can be more reliably assessed by calculating the correlation
coefficients between variables rather than by just examining the scatterplots. Add the “Linear
2
Correlation” node and examine the correlation matrix it produces. Does this matrix provide
any indication of low/high correlation between the selected variables?
In case you found high correlations: does it make sense that the two variables that are found to
be highly correlated are indeed highly correlated?
Potential answer:
- the 2D/3D scatterplot: by3D plotting for instance SL, SW and PL, and rotating, we
can already see that some groupings arise
- correlations: some variables show high correlations, for example between petals PW-
PL, but also among sepals-petals; among sepals, much smaller correlations
1.3 Outliers
Before you can carry out data analysis techniques on a dataset, it is important to know
whether there are outliers in the dataset. Why is this important?
Beforehand it might be worthwhile to check whether you think there are any outliers at all.
You already inspected the variables using different graphical methods (e.g. scatterplots,
boxplots), which can already give you an impression about the existence of outliers.
When dealing with categorised data (i.e., the Species in the Iris dataset) it might be
worthwhile checking for outliers within each category instead of the dataset as a whole; this
will help find data that was misclassified upon collection.
Analyse the data to identify potential outliers, by using data exploration techniques, such as
scatter plots, pie charts, box plots, histograms.
Question4 - outliers
a) Based on the scatterplots for combinations of different variables, do you identify any
potential outliers?
Potential answer:
- There are some points suspicious of being outliers
b) Based on the histograms, do you identify any potential outliers? What would you
conclude about the extent to which the variables are normally distributed?
Potential answer:
- Variables are not normally distribute
c) Based on the boxplots, do you identify any outliers?
Potential answer:
- Boxplot: SL has an outlier 4.9, SW at point 2.3, etc…
d) If you have identified any outliers (in scatterplots, histograms, and/or the boxplots): to
which Species do these outliers belong? What are the Row IDs of the outliers?
Potential answer:
- Box plot SW with Species: point 2.3 belong to specie Iris-Setosa
e) If you have identified outliers, what do you do with them (e.g. ignore, delete)?
Potential answer:
- This is not a simple answer; outliers can distort the analyses, for instance
regressions; however it can also point to a specific class of points, which can have
3
an interesting meaning. Therefore it is wise that before deleting of ignoring
outliers, to have a good understanding of the meaning of these extraordinary
points; we will not delete outliers
2. Practical 1 – Clustering
The objective of this clustering assignment is to apply the knowledge acquired in the lecture
about two basic clustering methods: k-means and hierarchical clustering. In this clustering
assignment we use again the Iris dataset. The dataset refers to the characteristics of different
species of the Iris flower. It consists of five variables: Sepal.Length, Sepal.Width,
Petal.Length, Petal.Width, and Species. The variable Species refers to the class to which each
record belongs. For the time being we will not consider the variable Species.
The overall goal is to create clusters based on the flower characteristics. To this aim, we will
use the two clustering methods mentioned above. Ideally, the obtained clusters should be
nicely clear-cut, but also meaningful.
Adding colours to the data can facilitate a better understanding of data. Watch the KNIME
“Hiliting in KNIME” tutorial on https://tech.knime.org/screencasts. One way to add colors is
by using the “Color Manager” node.
4
b) Inspect also visually the obtained clusters (tip: add the Color Manager after the “k-
means node”, and use “Scatter Matrix”)
c) Try to produce a better clustering by considering less variables, that is only two or
three variables (flower characteristics) for clustering. Are these clustering solutions
better?
Potential answer:
Using less variables in clustering does not seem to produce better clusters
d) Try to produce a better clustering by setting a different number of clusters (k=2, or
k=4, etc.). Are these clustering solutions better?
Potential answer:
o Using k=2 produces 2 clusters; it seems that clusters are better clear-cut than
with k=3
o Using k=4 ….
However the iris flowers are already classified in variable “Species”. We can check to what
extend the existing Species classification converges with the results of our own clustering
experiments (with K-means and Hierarchical methods). Ideally, the clustering results will
converge with the existing grouping already provided by the Species variable.
6
Potential answer
The 3 cluster solution with the k-means method produces clusters with a
balanced number of items, also the clusters are reasonable clear-cut. Also the
number of misclassified item is smaller (only 17 items). Though we have to
consider this way of assessing the quality of a clustering model with caution,
because usually, we do not know how actually the data is classified (clustering
is a unsupervised learning method).