Академический Документы
Профессиональный Документы
Культура Документы
Jianlin Cheng
Department of Computer Science
University of Missouri
Slides Adapted from
©2006 Jiawei Han and Micheline Kamber, All rights reserved
Target marketing
Medical diagnosis
Fraud detection
will occur
If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
December 22, 2019 Data Mining: Concepts and Techniques 4
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
December 22, 2019 Data Mining: Concepts and Techniques 6
Supervised vs. Unsupervised Learning
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
Accuracy
classifier accuracy: predicting class label
age?
<=30 overcast
31..40 >40
no yes yes
manner
At start, all the training examples are at the root
discretized in advance)
Examples are partitioned recursively based on selected attributes
i 1
Class B
Class A
X
Class C
Y
December 22, 2019 Data Mining: Concepts and Techniques 17
Decision Boundary ?
Class B
Class A
X
Class C
Y
December 22, 2019 Data Mining: Concepts and Techniques 18
Decision Boundary ?
Class B
Class A
X
Class C
Y
December 22, 2019 Data Mining: Concepts and Techniques 19
Computing Information-Gain for
Continuous-Value Attributes
Let attribute A be a continuous-valued attribute
Must determine the best split point for A
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point yielding maximum information gain for A is
selected as the split-point for A
Split:
D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
December 22, 2019 Data Mining: Concepts and Techniques 20
Gain Ratio for Attribute Selection (C4.5)
GainRatio(A) = Gain(A)/SplitEntropy(A)
4 4 6 6 4 4
Ex. SplitEntropy A ( D)
14
log 2 ( ) log 2 ( ) log 2 ( ) 0.926
14 14 14 14 14
gain_ratio(income) = 0.029/0.926 = 0.031
The attribute with the maximum gain ratio is selected as
the splitting attribute
December 22, 2019 Data Mining: Concepts and Techniques 21
Gini index (CART, IBM IntelligentMiner)
If a data set D contains examples from n classes, gini index, gini(D) is
defined as
Reduction in Impurity:
but gini{medium,high} is 0.30 and thus the best since it is the lowest
• Classification
• Clustering
• Regression
• Feature Selection
• Outlier detection
Review paper:
http://www.sciencedirect.com/science/article/pii/S003
1320310003973
December 22, 2019 Data Mining: Concepts and Techniques 36
An Example
Wikipedia
2
and P(xk|Ci) is
P(X | C i) g ( xk , Ci , Ci )
December 22, 2019 Data Mining: Concepts and Techniques 43
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
December 22, 2019 Data Mining: Concepts and Techniques 44
Naïve Bayesian Classifier: An Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
counterparts
Disadvantages
Assumption: class conditional independence, therefore
loss of accuracy
Practically, dependencies exist among variables
Bayesian Classifier
How to deal with these dependencies?
Bayesian Belief Networks
December 22, 2019 Data Mining: Concepts and Techniques 48
Weka Demo
Vote Classification
http://people.csail.mit.edu/kersting/profile/PROFILE_nb.html
Classification:
predicts categorical class labels
x1 : # of a word “homepage”
x2 : # of a word “welcome”
Mathematically
x X = , y Y = {+1, –1}
n
We want a function f: X Y
• Perceptron: update W
additively
x1
Images.google.com
December 22, 2019 Data Mining: Concepts and Techniques 68
December 22, 2019 Data Mining: Concepts and Techniques 69
A Neuron (= a perceptron)
1 w0
x1
w1
f
output y
xn wn
For Example
n
Input weight weighted Activation y sign( wi xi w0 )
vector x vector w sum function i 1
Output vector
Output layer
Hidden layer
wij
Input layer
Input vector: X
December 22, 2019 Data Mining: Concepts and Techniques 71
December 22, 2019 Data Mining: Concepts and Techniques 72
December 22, 2019 Data Mining: Concepts and Techniques 73
Activation / Transfer Function
Sigmoid function:
Weka (Java)
NNClass (C++, good performance, very fast):
http://people.cs.missouri.edu/~chengji/cheng_sof
tware.html
Matlab
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
http://www.youtube.com/watch?v=3liCbRZPrZA
SVM-light: http://svmlight.joachims.org/
LIBSVM:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Gist: http://bioinformatics.ubc.ca/gist/
Weka
SVM Website
http://www.kernel-machines.org/
Representative implementations
CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)
Efficiency: Uses an enhanced FP-tree that maintains the distribution of
class labels among tuples satisfying each frequent itemset
Rule pruning whenever a rule is inserted into the tree
Given two rules, R1 and R2, if the antecedent of R1 is more general
Instance-based learning:
Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be
classified
Typical approaches
k-nearest neighbor approach
space.
Locally weighted regression
_
_
_ _
.
+
_ .
+
xq + . . .
December 22, 2019
_ + .
Data Mining: Concepts and Techniques 131
Discussion on the k-NN Algorithm
KNN)
Other regression methods: generalized linear model, Poisson
(x x )( yi y )
w w y w x
i
i 1
1 | D|
0 1
(x
i 1
i x )2
X1 X2 … Die / Live
0.01 0.004 1
0.001 0.02 0
…
…
0.003 0.005 1
Binomial distribution:
For one dose level xi, yi = P(die|xi), likelihood = yiti(1-yi)(1-ti)
www.stat.cmu.edu/~cshalizi/350-2006/lecture-10.pdf
December 22, 2019 Data Mining: Concepts and Techniques 148
Algorithm
1. Start with a single node containing all points. Calculate
mc and S.
2. If all the points in the node have the same value for all
the independent variables, stop. Otherwise, search over
the splits of all variables for the one which will reduce S
as much as possible. If the largest decrease in S would
be less than some threshold , or one of the resulting
nodes would contain less than q points, stop. Otherwise,
take that split, creating two new nodes.
C1(Pre) C2(pre)
C1 True positive False negative
C2 False positive True negative
d d
d
obtained
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets,
Ensemble methods
Use a combination of models to increase accuracy
The bagged classifier M* counts the votes and assigns the class