Вы находитесь на странице: 1из 8

Data Structures

Notes for Lecture 13


Techniques of Data Mining
By
Samaher Hussein Ali
2007-2008

Classification: Basic Concepts


1. Classification: Definition

• Given a collection of records (training set ) Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the values of other attributes.
• Goal: previously unseen records should be assigned a class as accurately as possible.
ƒ A test set is used to determine the accuracy of the model. Usually, the given data set is divided
into training and test sets, with training set used to build the model and test set used to validate it.
2. Illustrating Classification Task

3. Classification Techniques

ƒ Decision Tree based Methods


ƒ Rule-based Methods
ƒ Memory based reasoning
1
ƒ Neural Networks
ƒ Naïve Bayes and Bayesian Belief Networks
ƒ Support Vector Machines
3.1. Decision Tree
Decision trees are one of the fundamental techniques used in data mining. They are tree-like
structures used for classification, clustering, feature selection, and prediction. Decision trees are easily
interpretable and intuitive for humans. They are well suited for high-dimensional applications. Decision
trees are fast and usually produce high-quality solutions. Decision tree objectives are consistent with the
goals of data mining and knowledge discovery. This lecture reviews the concept of decision trees in data
mining.
A decision tree consists of a root and internal nodes. The root and the internal nodes are labeled with
questions in order to find a solution to the problem under consideration. The root node is the first state of
a DT. This node is assigned to all of the examples from the training data. If all examples belong to the
same group, no further decisions need to be made to split the data set. If the examples in this node
belong to two or more groups, a test is made at the node that results in a split. A DT is binary if each
node is split into two parts, and it is nonbinary (multi-branch) if each node is split into three or more
parts
A decision tree model consists of two parts: creating the tree and applying the tree to the database. To
achieve this, decision trees use several different algorithms. The most widely-used algorithms by
computer scientists are ID3, C4-5, and C5.0. Example:-

2
Another Example of Decision Tree

We can constructing a decision tree from a set T of training cases as follows:


Let the classes be denoted by {C\, C2, • • •, Cn}. There are three possibilities:
(i) T contains one or more cases, but all belonging to a single class Cj. The decision tree for T is a leaf
identifying class Cj.
(ii) T contains no cases. The decision tree is also a leaf in this case, but the class to be associated with
the leaf must be determined from sources other than T.
(iii) T contains cases that belong to a mixture of classes. T is partitioned into subsets T1,T2, • • •, Tk,
where Ti contains all cases in T that have outcome Oi of the chosen test. The decision tree for T consists
of a decision node identifying the test, and one branch for each possible outcome. This process is
applied recursively to each subset of the training cases, so that the ith branch leads to the decision tree
constructed from the subset Ti of the training cases.
Generally, a decision tree algorithm is most appropriate for the third case. In this case, the decision tree
algorithm can be stated as follows:
• From the training data set, identify a target variable and a set of input variables.
• Examine each input variable one at a time:
• Create two or more groupings of the values of the input variables, and measure how similar
items are within each group and how different items are between groups.
• Select the grouping that maximizes similarity within groupings and differences between
groupings.

3
• Once the groupings have been calculated for each input variable, select the single input variable that
maximizes similarity within groupings and differences between groupings.
This process is repeated in each group that contains a convincing percentage of information in the
original data. The process is not terminated until all divisible groups have been divided.

3.1.1. ID3 Algorithm


Below is the decision tree algorithm for ID3 that describes the general layout for DT algorithms. This
algorithm uses gain ratio as the evaluating test.
The gain criterion selects a test to maximize the mutual information between the test and the class.
The process of determining the gain for a test is as follows :
Imagine selecting one case at random from a set S of cases and announcing that it belongs to some class
Cj. Let freq(Cj, S) denote the frequency of class Cj cases in set S so that this message has the
probability

The information the message conveys is defined by

The expected information from such a message pertaining to class membership is the sum over the
classes in proportion to their frequencies in 5; that is,

When applied to the set of training cases, Info(T) measures the average amount of information needed
to identify the class of a case in set T. This quantity is also known as the entropy of the set T. Now
consider a similar measurement after T has been partitioned (denoted by Tj) in accordance with the n
outcomes of a test X. The expected information requirement is the weighted sum over the n subsets:

The quantity

measures the information that is gained by partitioning T in accordance to the test X.


Even though the gain criterion yields good results, it has a serious deficiency — it is biased towards
tests with many outcomes. The bias in the gain criterion can be corrected by normalizing the apparent
gain of tests. By analogy, the definition of split info is given by
4
This represents the "potential information generated by dividing T into n subsets, whereas the
information gain measures the information relevant to classification that arises from the same division."
Then,

expresses the useful portion of the generated information by the split (that appears useful for
classification). The gain ratio selects a test to maximize the ratio above, subject to the constraint that the
information gain must be large — at least as large as the average gain over all tests examined.

ID3 Decision Tree Algorithm

Given Examples (S); Target attribute (C); Attributes (R)


Initialize Root
Function ID3 (S,C,R)
Create a Root node for the tree
IF S = empty, return a single node with value Failure;
IF S = C, return a single node with C;
IF R = empty, return a single node with most frequent target attribute (C);
ELSE
BEGIN
Let D be the attribute with largest Gain Ratio (D, S) among attributes in R;
Let {dj\j = 1, 2 , . . . , n} be the values of attribute D;
Let {Sj\j = 1, 2 , . . . , n} be the subsets of 5 consisting respectively of records
with value dj for attribute D;
Return a tree with root labeled D arcs d\, d-i,..., dn going respectively
to the trees;
For each branch in the tree
IF S = empty, add a new branch with most frequent C;
ELSE
ID3{S!,C,R-{D}), ID3{S2,C,R-{D}), . . . , ID3 (Sn, C,R-{B})
END ID3
Return Root

5
3.1.2. Decision Tree Classification Task

A. Apply Model to Test Data

(A)

6
(B)

(C)

(D)
7
(E)

(F)

Вам также может понравиться