Академический Документы
Профессиональный Документы
Культура Документы
What is classification?
What is classification?
Classification is the task of learning a target function f that maps attribute set x to one of the predefined class labels y
What is classification?
Why classification?
The target function f is known as a classification model Descriptive modeling: Explanatory tool to distinguish between objects of different classes (e.g., description of who can pay back his loan) Predictive modeling: Predict a class of a previously unseen record
Typical applications
credit approval target marketing medical diagnosis treatment effectiveness analysis
Class = 0
f01
f00
# correct predictions f11 f 00 Accuracy ! ! total # of predictions f11 f10 f 01 f 00 # wrong predictions f10 f 01 Error rate ! ! total # of predictions f11 f10 f 01 f 00
Decision Trees
Decision tree
A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution
Training Dataset
age <=30 <=30 3140 >40 >40 >40 3140 <=30 <=30 >40 <=30 3140 3140 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no
yes
Design issues
How should the training records be split? How should the splitting procedure stop?
Splitting methods
Binary attributes
Splitting methods
Nominal attributes
Splitting methods
Ordinal attributes
Splitting methods
Continuous attributes
Gini(t ) ! 1 ?p(i | t )A
i !1
Impurity measures
In general the different impurity measures are consistent Gain of a test condition: compare the impurity of the parent node with the impurity of the child nodes
k
( ! I ( parent )
j !1
N (v j ) N
I (v j )
Maximizing the gain == minimizing the weighted average impurity measure of children nodes If I() = Entropy(), then info is called information gain
Gain ratio
Gain ratio =
info/Splitinfo
SplitInfo = - i=1 kp(vi)log(p(vi)) k: total number of splits If each attribute has the same number of records, SplitInfo = logk Large number of splits large SplitInfo small gain ratio
2. 3. 4. 5.
root = createNode() root.test_condition = findBestSplit(S,F) V = {v| v a possible outcome of root.test_condition} for each value v V:
a. b. c. Sv: = {s | root.test_condition(s) = v and s S}; child = TreeGrowth(Sv ,F) ; Add child as a descent of root and label the edge (root child) as v
6. return root
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task
Border line between two neighboring regions of different classes is known as decision boundary Decision boundary in decision trees is parallel to axes because test condition involves a single attribute at-a-time
Class = +
Class =
Test condition may involve multiple attributes More expressive representation Not all datasets can be partitioned optimally using test Finding optimal test condition is computationally expensive