Академический Документы
Профессиональный Документы
Культура Документы
You can disable automatic email alerts of comment discussions via the "Discussions" button.
Question 1
1.1) [1 marks]
Briefly describe the general objective of Association Rules mining.
The objective of association rules mining is to discover interesting relations between objects in
large databases.
1.2) [1 marks] Describe the general Apriori Principle.
If an itemset is frequent, then all of its subsets must also be frequent. The support of an itemset
never exceeds the support of its subsets.
1.3) [1 marks] Explain the benefits of applying the Apriori Principle in the context of the Apriori
Algorithm for Association Rules mining.
Apriori is a algorithm used to determine association rules in the database by identifying frequent
individual terms to construct itemsets with respect to their support. The apriori principle rules out
all supersets of itemsets with support under the minimum threshold.
Question 2
Consider the market basket transactions shown in the following table. Assume that
min_support=40% and min_confidence=70%. Further, assume that the Apriori algorithm is
used to discover strong association rules among transaction items.
Transaction ID Items Bought
1 {Bread, Butter, Milk}
2 {Bread, Butter}
3 {Beer, Cookies, Diapers}
4 {Milk, Diapers, Bread, Butter}
5 {Beer, Diapers}
2.1) [5 marks] Show step by step the generated candidate itemsets (Ck) and the qualified
frequent itemsets (Lk) until the largest frequent itemset(s) are generated. Use table C1 as a
template. Make sure you clearly identify all the frequent itemsets.
Itemset Support_count
Bread 3
Butter 3
Milk 2
Beer 2
Cookies 1
Diapers 3
C1
min_support_count = min_support x itemset_count
min_support_count = 40% x 5
= 2
Itemset Support_count
Bread 3
Butter 3
Diapers 3
Milk 2
Beer 2
L1
Itemset
Bread, Butter
Bread, Diapers
Bread, Milk
Bread, Beer
Butter, Diapers
Butter, Milk
Butter, Beer
Diapers, Milk
Diapers, Beer
Milk, Beer
C2
2nd Scan
Itemset Support_count
Bread, Butter 3
Bread, Diapers 1
Butter, Diapers 1
Itemset Support_count
Bread, Butter 3
Bread, Diapers 1
Bread, Milk 2
Bread, Beer 0
Butter, Diapers 1
Butter, Milk 2
Butter, Beer 0
Diapers, Milk 1
Diapers, Beer 2
Milk, Beer 0
C2
Itemset Support_count
Bread, Butter 3
Bread, Milk 2
Butter, Milk 2
Diapers, Beer 2
L2
Itemset
Bread, Butter, Milk
C3
Itemset Support_count
Bread, Butter, Milk 2
C3
2.2) [5 marks] Generate all possible association rules from the frequent itemsets obtained in the
previous question. Calculate the confidence of each rule and identify all the strong association
rules.
Confidence (X→Y) = P(Y | X) = P(X ∪ Y) / P(X)
Min Confidence = 70%
bread>milk = 2/3 = 67%
butter>milk = 2/3 = 67%
diapers>beer = 2/3 = 67%
butter>bread = 3/3 = 100%
milk>bread = 2/2 =100%
milk>butter = 2/2 = 100%
beer>diapers = 2/2 = 100%
bread>butter = 3/3 = 100%
bread,butter>milk = 2/3 = 67%
bread,milk>butter = 2/2 = 100%
milk,butter>bread = 2/2 = 100%
bread>butter,milk = 2/3 = 67%
butter>bread,milk = 2/3 = 67%
milk>bread,butter = 2/2=100%
8 Strong rules overall
Question 3
3.1) [1 marks] Briefly discuss the main differences between the data mining tasks of: 1)
Classification, and 2) Clustering.
● Classification is the method of defining what itemsets determine resulting categories.
● Clustering is the method of grouping data into categories by similarity.
3.2) [2 marks] Compare the Precision and Recall metrics for classifier evaluation. Illustrate your
answer using the Confusion Matrix.
● Precision: exactness what percent of tuples that the classifier labeled as positive are
actually positive?
Precision = True Positive / (True Positive + False Positive)
● Recall: completeness what percent of positive tuples did the classifier label as
positive?
Recall = True Positive / (True Positive + False Negative)
Predicted Positive Predicted Negative
3.5) [2 marks] In the context of Decision Tree induction, what does overfitting mean? And how
to avoid overfitting?
Overfitting in Decision Trees is when the classifier is too complex or has poor accuracy. Over
complexity is caused from too many branches that may be prone to noise or outliers while the
poor accuracy is due to the small training set having unseen samples.
Decision trees should be stopped before the fully grown tree is created to avoid overfitting. For
example: stop at a node if all instances belong to the same class or if all the attribute values are
the same.
Question 4
Consider the onedimensional data set shown below:
[4 marks] Classify the data point x = 5.0 according to its 1, 3, 5, and 9nearest neighbors
(using majority vote). Clearly justify your classification decisions.
x > 1nearest neighbor =
x 4.9
y +
therefore x = +
x > 3nearest neighbor =
x 4.9 5.2 5.3
y +
therefore x =
x > 5nearest neighbor =
x 4.5 4.6 4.9 5.2 5.3
y + + +
therefore x = +
x > 9nearest neighbor =
x 0.5 3.0 4.5 4.6 4.9 5.2 5.3 5.5 7.0
y + + + +
therefore x =
Question 5
Consider the data set shown below:
5.1) [3 marks] Estimate the conditional probabilities for
P(A = 1|+), P(B = 1|+), P(C =1|+), P(A = 1|−), P(B = 1|−), and P(C = 1|−)
● P(A = 1 | +) = 3 / 5 = 60%
● P(B = 1 | +) = 2 / 5 = 40%
● P(C = 1 | +) = 4 / 5 = 80%
● P(A = 1 | −) = 2 / 5 = 40%
● P(B = 1 | −) = 2 / 5 = 40%
● P(C = 1 | −) = 1 / 5 = 20%
5.2) [3 marks] Use the conditional probabilities estimated in the previous question to predict the
class label for a test sample (A = 1, B = 1, C = 1) using the naive Bayes approach.
Let X = (A=1, B=1, C=1) then
P(X|) = P(A=1|) x P(B=1|) x P(C=1|) = 0.4*0.4*0.2= 0.032
P(X|+) = P(A=1|+) x P(B=1|+) x P(C=1|+) = 0.6*0.4*0.8 = 0.192
Next would calculate P(X|) x P() and P(X|+) x P(+) but in this instance, P() = P(+) = 0.5
so have no impact on the outcome. The predicted class label for X would be ‘+’ as this
has a higher class probability P(X|+) = 0.192 which is higher than P(X|).
Question 6
6.1) [2 marks] Describe the problems involved in selecting the initial points of the Kmeans
clustering method. Describe postprocessing techniques to solve such problems.
Selecting different initial points of the Kmeans clustering method can return different clusters for
the results. This can return suboptimal clusters because sometimes the centroids do not
readjust in the ‘right’ way. Postprocessing techniques to overcome this problem consist of
eliminating ‘small’ clusters that may represent outliers, splitting ‘loose’ clusters with relatively
high sum of squared error and merging ‘close’ clusters with relatively low sum of squared error.
6.2) [1 marks] In Densitybased clustering, what are the necessary conditions for two data
objects to be densityreachable?
Using the local outlier factor approach for densitybased clustering the average is calculated of
the ratio of the density of the sample and the density of its nearest neighbors. In order to be
densityreachable the two data objects… (something about how the two data objects must be
close to one another but also have similar densities?)
Question 7
Consider the 5 data points shown below:
P1: (1, 2, 3),
P2: (0, 1, 2),
P3: (3, 0, 5),
P4: (4, 1, 3),
P5: (5, 0, 1)
7.1) [5 marks] Apply the Kmeans clustering algorithm, to group those data points into 2 clusters,
using the L1 distance measure. (L1 distance is just manhattan distance: sum of differences in
each dimension)
Suppose that the initial centroids are C1: (1, 0, 0) and C2: (0, 1, 1).
Use the following table as a template to illustrate each of the Kmeans iterations.
Iteration 1:
Cluster# Old Centroids Cluster Elements New Centroids
7.2) [1 marks] Explain how the Kmeans termination condition has been reached.
After iteration 3 the new centroids and cluster elements are the same as the old centroids and
cluster elements.
7.3) [2 marks] For your final clustering, calculate the corresponding SSE (Sum of Squared Error)
value.
where mi is the centroid of the cluster Ci
SSE for cluster 1 = SSE P3 + SSE P4 + SSE P5
= 11.0889 + 0.4489 + 11.0889
= 22.6267
SSE for cluster 2 = SSE P1 + SSE P2
= 2.25 + 2.25
= 4.5
Total SSE = 22.6134 + 4.5
= 27.1134
Question 8
Consider the following set of 6 onedimensional data points:
18, 22, 25, 42, 27, 43
[5 marks] Apply the agglomerative hierarchical clustering algorithm to build the hierarchical
clustering dendogram. Merge the clusters using Min distance and update the proximity matrix
accordingly. Clearly show the proximity matrix corresponding to each iteration of the algorithm.
Step1:
18 22 25 27 42 43
18 0 4 7 9 24 25
22 4 0 3 5 20 21
25 7 3 0 2 17 18
27 9 5 2 0 15 16
42 24 20 17 15 0 1
43 25 21 18 16 1 0
Step 2:
18 22 25 27 42, 43
18 0 4 7 9 24
22 4 0 3 5 20
25 7 3 0 2 17
27 9 5 2 0 15
42, 43 24 20 17 15 0
Step 3:
18 22 25, 27 42, 43
18 0 4 7 24
22 4 0 3 20
25, 27 7 3 0 15
42, 43 24 20 15 0
Step 4:
18 22, 25, 42, 43
27
18 0 4 24
22, 4 0 15
25, 27
42, 43 24 15 0
Step 5:
18, 22, 42, 43
25, 27
18, 22, 0 15
25, 27
42, 43 15 0
Step 6:
18, 22, 25, 27, 42,
43
18, 22, 25, 27, 0
42, 43
Dendogram:
Question 9
[4 marks] Handling unstructured data is one of the main challenges in Text Mining. Describe in
details the necessary steps that are needed to provide a structured representation of text
documents.
To provide a structured representation of a text document Vector Space Model is used to
calculate the Term Frequency Inverse Document Frequency.
● Step 1: Extract the text. (Move all of the text into a database.)
● Step 2: Remove stop words. (Remove words like ‘the’ and ‘a’.)
● Step 3: Convert all words to lowercase. (To ensure word comparison will work.)
● Step 4: Use stemming on all words. (Words like ‘stemming’ become ‘stem’.)
● Step 5: Count the term frequencies. (Get count for each word in each document.)
● Step 6: Create an indexing file. (For each word to create the VSM.)
● Step 7: Create the vector space model. (Uses the index file to represent the term
frequency in each document.)
● Step 8: Compute inverse document frequency. (IDF (word) = log(total documents /
document frequency).)
● Step 9: Compute term weights. (Weight(word) = TF(word) x IDF(word). Where TF(word)
= number of times the word appears in the document.)
● Step 10: Normalize all documents to unit length. (weight(word) = weight(word) /
sqrt(each weight of each word squared).)