Академический Документы
Профессиональный Документы
Культура Документы
Data Mining I
Summer semester 2017
Lecture 7: Classification 3
Lectures: Prof. Dr. Eirini Ntoutsi
Exercises: Tai Le Quy and Damianos Melidis
Outline
DTs partition the feature space into axis-parallel hyper-rectangles, and label each rectangle with one of the class labels.
Decision boundary: the border line between neighboring regions of different classes
What is the decision boundary?
How a decision tree based decision boundary would look like for the examples below? Why?
Eager learners
Construct a classification model (based on a training set)
Learned models are ready and eager to classify previously unseen instances
e.g., decision trees, SVM, MNB, NN
Lazy learners
Simply store training data (with labels) and wait until a previously unknown instance arrives
No model is constructed.
known also as instance-based learners, because they store the training set
e.g., k-NN classifier
Nearest-neighbor classifiers compare a given unknown instance with training tuples that are similar
to it
Basic idea: If it walks like a duck, quacks like a duck, then its probably a duck
Test Record
Training
Records
Nearest-neighbor classifiers compare a given unknown instance with training tuples that are similar
to it
Basic idea: If it walks like a duck, quacks like a duck, then its probably a duck
Compute
Distance Test Record
Training
Records
Nearest-neighbor classifiers compare a given unknown instance with training tuples that are similar
to it
Basic idea: If it walks like a duck, quacks like a duck, then its probably a duck
Compute
Distance Test Record
Input:
A training set D (with known class labels)
A distance metric to compute the distance between two instances
The number of neighbors k
Pseudocode:
Neighborhood for k = 1
x: unknown instance
x Neighborhood for k = 7
x: unknown instance
Neighborhood for k = 17
x: unknown instance
Neighborhood for k = 1
x Neighborhood for k = 7
Neighborhood for k = 17
x: unknown instance
2
4
2
4
Pairwise distances example: sample of 105 instances drawn from a uniform [0, 1] distribution,
normalized (1/ sqrt(d)).
Source: Tutorial on Outlier Detection in High-Dimensional Data, Zimek et al, ICDM 2012
(+-) Lazy learners: Do not require model building , but testing is more expensive
(-) Classification is based on local information in contrast to e.g. DTs that try to find a global model
that fits the entire input space: Susceptible to noise
(+) Incremental classifiers
(-) The choice of distance function and k is important
(+) Nearest-neighbor classifiers can produce arbitrarily shaped decision boundaries, in contrary to
e.g. decision trees that result in axis parallel hyper rectangles
Model
Test set
The quality of a classifier is evaluated over a test set, different from the training set
For each instance in the test set, we know its true class label Why we need a different test set?
Compare the predicted class (by some classifier) with the true class of the test instances
The quality of a classifier is evaluated over a test set, different from the training set
For each instance in the test set, we know its true class label Why we need a different test set?
Compare the predicted class (by some classifier) with the true class of the test instances
true class
Test set label
Day Outlook Temperature Humidity Wind Play Predictio
Training set Teniss n
D16 Overcast Cool Normal Weak Yes Yes
D17 Overcast High Normal Weak Yes No
D18 Sunny Hot Normal Weak No No
D19 Overcast Cool Normal Weak No Yes
predicted
class labels
A useful tool for analyzing how well a classifier performs is the confusion matrix
The confusion matrix is built from the test set and the predictions of the classifier
For an m-class problem, the matrix is of size m x m
Terminology
Positive tuples: tuples of the main class of interest
Negative tuples: all other tuples
Actual
t
class
D17 Overcas High Normal Weak Yes No
t
D18 Sunny Hot Normal Weak No No
C2 FP(false positive) TN (true negative) N
D19 Overcas Cool Normal Weak No Yes
Totals P N
t
Accuracy
Error rate
Sensitivity
Specificity
Precision
Recall
Fmeasure
F-measure
...
Actual
C1 TP (true positive) FN (false negative) P
class
% of test set instances correctly classified
C2 FP(false positive) TN (true negative) N
Totals P N
Predicted class
classes buy_computer = yes buy_computer = no total
Actual
Accuracy(M)=95.42%
buy_computer = no 412 2588 3000
total 7366 2634 10000
!!! Accuracy and error rate are more effective when the class distribution is relatively balanced
Actual
class
positive)
Sensitivity/ True positive rate/ recall: C2 FP(false TN (true negative) N
positive)
Specificity/ True negative rate : % of negative tuples that are correctly recognized
Predicted class
classes buy_computer = yes buy_computer = no total Accuracy (%) Accuracy(M)=95.42%
Actual
class
Precision: % of tuples labeled as positive which are actually positive Predicted class
C1 C2 totals
C1 TP (true FN (false P
Actual
class
positive) negative)
Totals P N
Predicted class
classes buy_computer = yes buy_computer = no total Accuracy (%)
Actual
precision(M)=94.41%
class
Predicted class
F-measure/ F1 score/F-score combines both
C1 C2 totals
C1 TP (true FN (false P
Actual
2 * precision( M ) * recall ( M ) It is the harmonic mean of
class
positive) negative)
F (M )
precision( M ) recall ( M ) precision and recall C2 FP(false TN (true negative) N
positive)
Totals P N
Predicted class
C1 C2 totals
C1 TP (true FN (false P
Actual
class
Totals P N
Predicted class
C1 C2 totals
C1 TP (true FN (false P
Actual
class
Totals P N
C1 TP (true FN (false P
Actual
class
Totals P N
Adopted by
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
Holdout method
Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction
Training set Test set
Test set (e.g., 1/3) for accuracy estimation
(+) It takes no longer to compute
(-) it depends on how data are divided
(-) You have less data for training, problematic for small datasets.
Leave-one-out: k-folds with k = #of tuples, so only one sample is used as a test set at a time;
for small sized data
Source: https://faculty.elgin.edu/dkernler/statistics/ch01/images/strata-sample.gif
Stratified cross-validation: folds are stratified so that class distribution in each fold is approximately
the same as that in the initial data
Stratified 10 fold cross-validation is recommended!!!
So on average, 36.8% of the tuples will not be selected for training and thereby end up in the test set; the remaining
63.2% will form the train test
1 k
accboot ( M ) (0.632 acc( M i )testSeti 0.368 acc( M i )train _ set )
k i 1
Accuracy of the model obtained by bootstrap sample i Accuracy of the model obtained by bootstrap sample
when it is applied on test set i. i when it is applied over all labeled data
Evaluation measures
accuracy, error rate, sensitivity, specificity, precision, F-score, F, ROC
Train test splitting
Holdout, cross-validation, bootstrap,
Other parameters
Speed (construction time, usage time)
kNN classifiers
Evaluation measures
Evaluation setup