Вы находитесь на странице: 1из 18

Study of Decision Trees

August 12, 2012

Contents
1 Introduction 2 Introduction To Decision Tree Classiers 2.1 Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . . . . 3 Splitting Criteria 3.1 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Gain Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Gini Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Construction of Decision Trees 5 Missing Attribute Values 6 Noise In Attributes 6.1 Decision Tree pruning 6.2 Sub-Tree Replacement 6.3 Sub-Tree Raising . . . 6.4 Error Measure . . . . . 2 3 4 4 4 6 6 7 8 8 10 11 13 14

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

7 ID3 Algorithm 15 7.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 8 C4.5 Algorithm 17 8.1 The C4.5 induction system . . . . . . . . . . . . . . . . . . . 18 8.2 Continuous valued attributes . . . . . . . . . . . . . . . . . . . 18 1

9 CART 10 SPRINT

18 18

Introduction

Decision Trees Learning Algorithms can be considered to be general purpose knowledge based learning system. This primary task of system deals with approach to synthesize decision trees based on inductive inferences from examples. for classication tasks.These trees are constructed beginning with the root of the tree and proceeding down to its leaves nodes. This is called a top-down approach for construction of decision trees. The objects from which a classication rule is developed are described through their values of a set of properties or attributes, and the decision trees are also expressed in terms of these same attributes. Each attribute measures some important feature of an object . Each object belongs to a set of mutually exclusive classes. The induction task is to develop a classication rule that can determine the class of any object from its values of the attributes. Learning Strategy used is as follows. The systems are presented with a data set consisting of set of attributes at associated class label indicating the class the object described by attributes belong to. A decision tree is constructed from the top down, guided by frequency information in the examples but not by the particular order in which the examples are given.The product of learning is a piece of procedural knowledge that can assign a unseen object to one of a specied number of disjoint classes.The members of particular class are characterized by their representation of acquired knowledge as decision trees. The data set of relevant cases used for learning task is called as training set. An induction algorithm, or more concisely an inducer (also known as learner), is an entity that obtains a training set and forms a model that generalizes the relationship between the input attributes and the target attribute. Decision tree inducers are algorithms that automatically construct a decision tree from a given dataset. Typically the goal is to nd the optimal decision tree by minimizing the generalization/classication error.

Introduction To Decision Tree Classiers

A decision tree is a classication model in the form of a tree consisting of decision nodes and leaves. A leaf speces a class value. A decision node species a test to make a choice between number of alternatives. The test is performed over one of the predictive attributes, which is called the attribute selected at the node. For each possible outcome of the test, a child node is present. The weight of a child is the fraction of cases of the training set reaching the decision node that satisfy the test outcome of the child. The largest child is the child node with the largest weight. A case is classied by a decision tree by following the path from the root to a leaf according to the decision node tests evaluated on the case attribute values. Consider arbitrary collection C of objects from training data set described by attributes which can only take on nominal values. Let the object belong to mutually exclusive classes. If C is empty or contains only objects of one class, the simplest decision tree is just a leaf labeled with the class. Otherwise, let T be any test on an object with possible outcomes O1 , O2 , ...Ow .Each object in C will give one of these outcomes for T, so T produces a partition C1 , C2 , ...Cw of C with Ci containing those objects having outcome Oi . If each subset Ci could be replaced by a decision tree for Ci , the result would be a decision tree for all of C. As long as two or more Ci s are nonempty, each Ci is smaller than C. That we arrive at simpler decision tree as we perform a split. The worst case scenario is that this strategy will yield object that belong to dierent classes. This can be considered as trivial partition of set of objects. Thus provided that a test can always be found that gives a non-trivial partition of any set of objects, this procedure will always produce a decision tree that correctly classies each object in C. This phase is referred to as growing phase of decision tree. The growing phase continues till a stop criteria is triggered. The Top Down approach operates by recursively selecting a feature to split on and then partitioning the training examples wrt to that feature.We start at the root of the tree,select an attribute and associate it with a internal node. Then we perform the split of the data based on the evaluation test/Split Criteria on values that attribute takes. Multiple leaf are obtained after the split,then we recursively perform the above operations of selecting a attribute and performing a split till stop criteria is satised.

2.1

Stopping Criteria

The growing phase continues until a stopping criterion is triggered. The following conditions are common stopping rules: The growing process continues until a leaf is encountered, at which time the object is asserted to belong to one of the disjoint classes.All instances in the training set reaching the leaf to a single disjoint class. The maximum tree depth is reached. The best splitting criterion is not greater than a certain threshold.

Splitting Criteria

Splitting criteria is a evaluation function for selecting the best attributebased test to form the root of a decision tree. In general the best test will be the one which performs split on attribute so that after parition each node Ci is empty or contains only objects of one class. In most decision trees inducers discrete splitting functions are univariate,an internal node is split according to the value of test performed on a single attribute.Consequently, the inducer searches for the best attribute upon which to perform the split. There are various univariate criteria which can be characterized in dierent ways, Information gain,Gain ration, Dependence, and Distance. Impurity Based criteria, Normalized Impurity Based criteria and Binary criteria

3.1

Information Gain

Information Gain is an impurity-based criteria that uses the entropy measure as the impurity measure. Let C contain p objects of class P and n of class N.An arbitrary object will p and to class be determined to belong to class P with probability P1 = p+n n N with probability N1 = When a decision tree is used to classify an p+n object, it returns a class. A decision tree can thus be regarded as a source

of a message P or N. The Entropy of the source or expected information needed to generate this message given by I(p, n) = Entropy(S) N1 log2 N1 P1 log2 P1 Thus if all the attribute values belong to distinct class P on N we obain a low value of entropy. If all the attribute values exhibit high variance we high a high value of entropy. The range of values of entropy are bounded from 0 to logn M where M is the number of distinct classes. Thus entropy is a measure of impurity of training examples. Now let us classify data set based on attribute A.If attribute A with values A1 , A2 , ...Av is used for the root of the decision tree, it will partition C into C1 , C2 , ...CV where Ci contains those objects in C that have value Ai of A. Let Ci contain pi objects of class P and ni of class N. The expected information required for the subtree for Ci or entropy of subtree Ci is I(pi , ni ). The tree A is divided into v branches based on attribute values.We know expected information of each of subtree. The expected information of the tree A as the root The expected information required for the tree with A as root is then obtained as the weighted average E(C)
iv

|Ci | E(Ci ) |C|

where || denotes cardinality of set This is conditional entropy of set S given attribute A.This can also be denote as E(S/A) Gain(C, A) = expected reduction in entropy/uncertainty due to sorting on A or knowing the value of attribute A. |Cv | Entropy(Cv ) |C|

Gain(C, A) Entropy(C)
vV alues(A)

The gain represents ammount of uncertainty remaining after the split is performed on the attribute A. If the gain measure is low it indicates that average entropy after the split is the same as entropy before the split hence very less information is gained.Even after the split the number of instances reaching the child node donot predominately belongs to a single class. Higher value of gain indicates that after the split is performed the average enttropy of the 5

children is low .This indicates that after the split is performed the child node contain instances belonging to a single class as desired.Thus we would choose that attribute to branch on which gains the most information gain .We Examine all candidate attributes and chooses A to maximize gain(A), forms the tree as above, and then uses the same process recursively to form decision trees for the residual subsets C1 , C2 , ...Cv . For a given root node I(P, n) is constant thus maximising E(A) is equivalent to minimising entropy measure.

3.2

Gain Ratio

The Gain Ration is also information based measure In the earlier case the gain criteria was used as selection criteria. One diadvantage of gain criteria is biased towards attributes taking many values. Let A be an attribute with values A1 , A2 , ...Av . and let A be an attribute formed from A by splitting one of the values into two. Let A be an attribute with values A1 1, A1 2, A2 , ...Av . If the values of A were suciently ne for the induction task at hand, we would not expect this renement to increase the usefulness of A. If the proportions of classes are same for both the subdivision of the original values The entropy of subdivisions remain the same.The gain of A and A are the same. If the proportions of classes are not same then uncertainty will decrease. and gain of A will be greater than that of A. Thus attributes with more values will tend to be preferred to attributes with fewer. A another measure that can be used is Rain Ratio.This gain normalized by expected information gain on selecting the attribute GainRation(C, A) Entropy(C)
|Cv | vV alues(A) |C| Entropy(Cv )

This value may not always be dened and may be zero or it may tend to favor attributes for which the Entropy(Cv ) is small.

3.3

Gini Index

The Gini index is an impurity-based criteria that measures the divergences between the probability distributions of the target attributes values. A Gini coecient of zero expresses perfect equality where all values are the same. A Gini coecient of one expresses maximal inequality among values. A low 6

Gini coecient indicates a more equal distribution, with 0 corresponding to complete equality, while higher Gini coecients indicate more unequal distribution, with 1 corresponding to complete inequality. Let is consider the probability distribute of target attribute values. The goodness of split on attribute value can be measured based on reduction in impurity of target attribute after partitioning on training set S. We measure the gini index of target probability at the root node. we measure the average gini index of target probabilty assuming the splity has been performed using a particular attribute.we obtain a dierence of the two measures. High value of dierence indicates that there is less impurity in after the splity a Low value of dierence indicates that impurity is not reduced signicantly after the split. Thus we test the gini index for all the attribute values and choose the attribute value corresponding to the highest gini index measure. To compute Gini impurity for a set of items, suppose y takes on values in 1, 2, ..., m, and let fi = the fraction of items labeled with value i in the set.
m m m m m

IG (f ) =
i=1

fi (1 fi ) =
i=1

(fi fi ) =
i=1

fi
i=1

fi = 1
i=1

fi 2

The average ginini index is weight measure of gini index of children nodes.

Construction of Decision Trees

During the construction of decision trees the important question to ask is whether attributes provide sucient information to construct the decision tree. In particular, if the data set provided contains two objects that have identical values for each attribute and yet belong to dierent classes, it is clearly impossible to dierentiate between these objects with reference only to the given attributes. In such a case attributes will be termed inadequate for the induction task. If the attributes are adequate it is always possible to construct a decision tree that correctly classies the object in the provided data set.The decision trees are not unique there may be many decision trees that produce the same result. The choice of test/splitting criteria and a optimal stopping criteria is crucial if the decision tree is to be simple. Tests will be restricted to branching on the values of an attribute, so choosing a test comes down to selecting an 7

attribute for the root of the tree. Dierent algorithm are constructed based of diernt selection and stopping criteria

Missing Attribute Values

In training data provided there may be cases in which few attribute values are missing.Few approaches to handle this issue are to ll the unknown value from information available in context.Treat unknown values as new values for attribute value and construct the tree. However these approaches give poor performance as they increase the information gain of attribute with missing values. A good strategy When an attribute has been chosen by the selection criterion, objects with unknown values of that attribute are discarded before forming decision trees for the subsets .This leads to information gain being decreased of such attribute value. If a decision tree is constructed and we provide a object with missing attribue values. Since this values is unknown we must assume that it could take all posible values.We can use the statistics of data to not that some values are more probable than others. Each branch is explored using a token value T*ratio.Instead of single path to leaf we have multiple paths to leaf node.These token values can be summed up at class with highest value can be choosen.

Noise In Attributes

So far it has been assumed that the data provided is ideal.In real situations This assumption will not hold true.There may be errors in attribute values or misclassications present in the data set provided. Two main consequences of noisy data are inadequate data set or may also lead to decision tree to become complex.Modications are required to the algorithm to operate under noisy data. The algorithm must be capable of working with inadequate attributes.In this case the output will not be a class lable but likelihood indicating that unknown object belongs to specied class . Due to noise the tree may become very large.The algorithm must be able to decide that testing further attributes will not im- prove the predictive

accuracy of the decision tree.Just to accomodate a single case of noisy data it must not increase the tree complexity.This is also necessary since apart from providing good accuracy on the training data the tree must work well on unseen data as well.Thus approach to accomodate all cases or training data is not necessary.Simple tree provides a better generalization. Let C be a collection of objects containing representatives of both classes P and N, and let A be an attribute with random values that produces subsets C1, C2, ... Cv . Unless the proportion of class P objects in each of the Ci is exactly the same as the proportion of class P objects in C itself, branching on attribute A will give an apparent information gain. It will therefore appear that testing attribute A is a sensible step, even though the values of A are random and so cannot help to classify the objects in C One solution to this dilemma might be to require that the information gain of any tested attribute exceeds some absolute or percentage threshold.Experiments with this approach suggest that a threshold large enough to screen out irrelevant attributes also excludes attributes that are relevant, and the performance of the tree-building pro- cedure is degraded in the noise-free case.If the information gain is not high enough it indicates that the after splitting the tree the ration of mutually exclusive object to classes is almost same as the parent node.This implies that further splitting tree does not provide increase in accuracy of data and to obtain a trivial tree large number of splits may be required which is not desired. An alternative method based on the chi-square test for stochastic independence has been found to be more useful. This statistic can be used to determine the condence with which one can reject the hypothesis that A is independent of the class of objects in C Suppose attribute A producessubsets C1, C2, ... Cv of C, where Ci contains pi and ni objects of class P and N, espectively. If the value of A is irrelevant to the class of an object in C, the expected value p i of pi should be pi p the expected value of ni of ni should be ni n |Ci | |C| |Ci | |C|

The statistic is dened by


v

i1

(pi pi )2 (ni ni )2 + pi ni

This measure provides the irrelevane of values of A to class of object C. High value of irrelevance indicates that attribute can be rejected. The treebuilding procedure can then be modied to prevent testing any attribute whose irrelevance cannot be rejected with a very high condence level(99%). a collection of C objects may contain representatives of both classes, yet further testing of C may be ruled out, either because the attributes are inadequate and unable to distinguish among the objects in C, or because each attribute has been judged to be irrelevant to the class of objects in C. In this situation it is necessary to produce a leaf labelled with class information, but the objects in C are not all of the same class. Two approachs to solve the problem.One is to use probability measure to obtain a likelyhood that the object belongs to specied class.Then second approach is to choose the class belonging to highest probability. The rst approach minimizes the sum of the squares of the error over objects in C, while the second minimizes the sum of the ab- solute errors over objects in C. If the aim is to minimize expected error, the second approach might be anticipated to be superior, and indeed this has been found to be the case.

6.1

Decision Tree pruning

The Aim of pruning is Prevent overtting to noise in the data. The growing phase of decision tree continues till a stop criteria is triggered. Employing tight stopping criteria tends to create small and undertted decision trees. On the other hand, using loose stopping criteria tends to generate large decision trees that are overtted to the training set. To solve this dilemma a pruning methodology was developed based on a loose stopping criterion and allowing the decision tree to overt the training set. Then the overtted tree is cut back into a smaller tree by removing sub-branches that are not contributing to the generalization accuracy. It has been shown in various studies that pruning methods can improve the generalization performance of a decision tree, especially in noisy domains. Pruning decision trees is a fundamental step in optimizing the computational eciency as well as classication accuracy of a model.Applying pruning 10

methods to a tree usually results in reducing the size of the tree (or the number of nodes) to avoid unnecessary complexity, and to avoid over-tting of the data set when classifying new data. The process of pruning traditionally begins from the bottom of the tree (at the child leaves), and propagates upwards. A decision tree must be in the process of being induced, or completely created for pruning to occur . The overlying principle of pruning is to compare the amount of error that a decision tree would suer before and after each possible prune, and to then decide accordingly to maximally avoid error. The metric used to describe possible error, denoted error estimate (or E), is calculated with the e+1 E N +M where e is misclassied example at each node M all the training examples N number of examples reaching the node Post-pruning is implemented on a fully induced decision tree, and sifts through to remove statistically insignicant nodes. Working from the bottom up, the probability (or relative frequency) of sibling leaf nodes will be compared, and any overwhelming dominance of a certain leaf node will result a pruning of that node in one of several ways.

6.2

Sub-Tree Replacement

The error estimate of each child node is calculated and used to derive the total error of the parent node. Post-pruning in the C4.5 algorithm is the process of evaluating the decision error (estimated percent misclassications) at each decision junction and propagating this error up the tree. At each junction, the algorithm compares the weighted error of each child node versus the misclassication error if the child nodes were deleted and the decision node were assigned the class label of the majority class. If the data is adequate the decision tree will also provide no error on training data. Consider the contact-lens database with no pruning Number of Leaves : 9 11

Size of the tree : 15 Correctly Classied Instances 24 100 Incorrectly Classied Instances 0 0

With pruning Number of Leaves : 4 Size of the tree : 7 Correctly Classied Instances 22 91.6667 Incorrectly Classied Instances 2 8.3333

12

We have a measure to calculate error which depends on number of misclassied instance out of all the possible instance reaching that node . The children of the node have 0 misclassications.we take the weighted sum of the errors. if the children have lower error we do not prune .Else we prune the subtree and replace it with a single leaf node with majority class. This pruning method is called subtree replacement. A good example with illustrations of pruning are provided at the site http : //web.cs.wpi.edu/ cs4445/a08/P rojects/P roject1/khasawnehp roj2r eport.pdf

6.3

Sub-Tree Raising

Subtree raising is more complex and is also used by C4.5. With the subtree rais- ing operation, we raise the subtree and replace its parent node. Then 13

we reclassify the instances in the other branch of the parent node into one of the leaf nodes in the raised subtree. Subtree raising selects a subtree and replaces it with the child one (ie, a sub-subtree replaces its parent) Once this is performed it is necessary to reclassify instance under the subtree into new subtree. Subtree training is time consuming and is often restricted to raising the subtree of the most popular branch The general procedure is the same as for subtree replacement, we prune the tree until the decision is made not to

6.4

Error Measure

These two pruning methods require a decision whether to replace an internal node with a leaf for subtree replacement, and whether to replace an internal node with one of the nodes below it for subtree raising. To achieve this, we need to estimate the error at the internal node and the leaf nodes. In general the error on the test set will be greater than than on the training set. The misclassifcation error is assumed to be approximation of actual error inclusing out of sample data. However since only training data is available to us we make error estimate based on the training data and assume that it will approximate the actual data . C4.5 uses the upper limit of a condence interval for the error on the training data as the error estimate. Condence interval of random variable with zero mean is given by P r[z X z] c = 1 2P r[X z] we assume a probability distribution is gaussian with zero mean and variance 1. Cosider the misclassication error f=S/N. S is the number of misclassications and N is number of instances at the node. Let us assume that it is modelled as random variable.It takes probability p of misclassication being true and 1-p of it being false.The misclassication error is a bernouli distribution with mean p and variance (1-p)/N. For large enough N we can approximate the benouli distribution with a normal distribution.

14

If we say that condence level is 25% .We nd the value of Z such that P r[X Z] c we can nd out p for the provided condence interval c assuming a normalized distrribution z. This is the error estimate of the node. The decision can be made by comparing the estimate error between the un-replaced/un-raised trees and replaced/raised subtrees. If condence interval is less the error probability of error increases.

ID3 Algorithm

ID3 algorithm is a algorithm for unpruned decision trees where the attributes can take only nominal values and no attribute values are missing. Using information gain as splitting criteria, the ID3 ceases to grow when all instances belong to a single value of a target feature or when best information gain is not greater than zero ID3 algorithm is a greedy algorithm which constructs simple decision trees the approach it uses cannot guarantee that better trees have not been overlooked. The basic structure of ID3 is iterative. A subset of the training set called the window is chosen at random and a decision tree formed from it; this tree correctly classies all objects in the window. All other objects in the training set are then classied using the tree. If the tree gives the correct answer for all these objects then it is correct for the entire training set and the process terminates. If not, a selection of the incorrectly classied objects is added to the window and the process continues. In this way, correct decision trees have been found after only a few iterations . The aim of learning system is to construct a decision tree that correctly classies not only the objects from the data set but also unseen objects it may encounter.Given two decision trees which provide identical results on the provided data set but may perform dierently on unseen data. A choice must be made as to which of decision trees to select so that it performs well or generalizes on unseen data . A brute force approach to induction task would be to generate all possible trees that correctly classify the data set and select the simplest tree according to some predened criteria.The number of such trees is nite but very large .And is not suitable if the number of attributes are large and data set contains many objects. 15

A correct decision tree is usually found more quickly by this iterative method than by forming a tree directly from the entire training set.However has noted that the iterative framework cannot be guaranteed to converge on a nal tree unless the window can grow to include the entire training set.

7.1

Example

Consider the Following Play Tennis Dataset Training Examples Day Outlook Temperature D1 Sunny Hot D2 Sunny Hot D3 Overcast Hot D4 Rain Mild D5 Rain Cool D6 Rain Cool D7 Overcast Cool D8 Sunny Mild D9 Sunny Cool D10 Rain Mild D11 Sunny Mild D12 Overcast Mild D13 Overcast Hot D14 Rain Mild Humidity Wind High Weak High Strong High Weak High Weak Normal Weak Normal Strong Normal Strong High Weak Normal Weak Normal Weak Normal Strong High Strong Normal Weak High Strong PlayTennis No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No

we calculate information gain of various attributes. Attribute outlook windy humidity temperature Information Gain 0.2467 0.0481 0.151 0.0292

since outlook has highest gain it is selected as the root node.And a split condition is established on the outlook node. This will split the tree C[9,5] into 3 subtrees C2[4,0],c1[2,3],c3[3,2] For each of the subtrees we calculte the information gain for the remaining attributes and repeate the same procedure again. The nal decision tree obtained is as follows 16

In ID3 algorithm the entire training set is not used for construction of decision tree. a subset of data set is used for construction of the tree.The decision tree classies all the object in the window.The remaining data set is then provided to the tree. If any misclassications are present ,the misclassied entries are added to the window and the process continues.

C4.5 Algorithm

C4.5 is a program for inducing classication rules in the form of decision trees from a set of given examples. C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. Input to C4.5 consists of a collection of training cases, each having a tuple of values for a xed set of attributes (or independent variables) A = A1 A2 ...Ak and a class attribute (or dependent variable). An attribute Aa is described as continuous or discrete according to whether its values are numeric or nominal. The class attribute C is discrete and has values C1 C2 ...Cx . The goal is to 17

learn from the training cases a function that maps from the attribute values to a predicted class. C4.5 made a number of improvements to ID3. Some of these are: Pruning trees after creation - C4.5 goes back through the tree once its been created and attempts to remove branches that do not help by replacing them with leaf nodes. Handling both continuous and discrete attributes - In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it.[3] Handling training data with missing attribute values - C4.5 allows attribute values to be marked as ? for missing. Missing attribute values are simply not used in gain and entropy calculations. Handling attributes with diering costs.

8.1

The C4.5 induction system

The C4.5 system consists of following operations : selecting a C4.5-type split for a given dataset. The parameters of split condition are minimum number of instances that have to occur in at least two subsets induced by split

8.2

Continuous valued attributes

CART

CART stands for Classication and Regression Trees. It was developed by [Breiman et al. (1984)] and is characterized by the fact that it constructs binary trees, namely each internal node has exactly two outgoing edges. The splits are selected using the Twoing Criteria and the obtained tree is pruned by Cost-Complexity Pruning. When provided, CART can consider misclassication costs in the tree induction. It also enables users to provide prior probability distribution. An important feature of CART is its ability to generate regression trees. In regression trees, the leafs predict a real number and not a class. In case of regression, CART looks for splits that minimize the prediction squared error (the least-squared deviation). The prediction in each leaf is based on the weighted mean for node.

10

SPRINT

18

Вам также может понравиться