You are on page 1of 20

ILA-2 : AN IND UC TIVE LEAR NING

ALGOR ITHM FOR KNOWLED GE D ISC OVER Y

MEHMET R . TOLUN
Department of Computer Engineering, Eastern
Mediterranean University, Gazimagusa, Turkey
HAYR I SEVER
Department of Computer Science, Hacettepe University,
Ankara, Turkey
MAHMUT ULUD AG
Information Security Research Institute, Gebze, Kocaeli,
Turkey
SALEH M. ABU-SOUD
Department of Computer Science, Princess Sumaya
University College for Technology, Amman, Jordan

In this paper we describe the ILA-2 rule induction algorithm, which is the
improved ve rsion of a novel inductive learning algorithm s ILA .. We first
outline the basic algorithm ILA, and then present how the algorithm is
improved using a new evaluation metric that handles uncertainty in the
data. By using a new soft computing metric, users can reflect their preferences through a penalty factor to control the performance of the algorithm.
Inductive learning algorithm has also a faster pass criteria feature which
reduces the processing time without sacrificing much from the accuracy that
is not available in basic ILA.

Address correspondence to Mehmet R. Tolun, Department of Computer Engineering, Easte rn Mediterranean University, Gazimagusa, T.R.N.C. Turkey. E-mail: tolun.compenet.emu.edu.tr
Cybernetics and Systems: An International Journal, 30:609 ] 628, 1999
Copyright Q 1999 Taylor & Francis
0196-9722 99 $12.00 + .00

609

610

M. R. TOLUN ET AL.
We experimentally show that the performance of ILA-2 is comparable
to that of well-known inductive learning algorithms, namely, CN2, OC1,
ID3, and C4.5.

A knowledge discovery process involves extracting valid, previously


unknown, pote ntially useful, and compre hensible patte rns from large
database s. As described in Fayyad s 1996 . and Simoudis s 1996 ., this
process is typically made up of selection and sampling, preproce ssing
and cleaning, transformation and reduction, data mining, and evaluation steps. The first step in the data-mining proce ss is to select a targe t
data set from a database and to possibly sample the target data. The
preproce ssing and data cleaning step handle s noise and unknown value s
as well as accounting for missing data fields, time sequence information,
and so forth. The data reduction and transformation involve s finding
relevant feature s depending on the goal of the task and certain transformations on the data such as converting one type of data to another s e.g.,
changing nominal value s into numeric ones, discretizing continuous
values ., and r or defining new attribute s. In the mining step, the user
may apply one or more knowle dge discove ry technique s on the transformed data to extract valuable patte rns. Finally, the evaluation step
involve s interpreting the result s or discovere d patte rn. with respect to
the goal r task at hand. Note that the data mining proce ss is not line ar
and involves a varie ty of feedback loops, as any one step can result in
change s in preceding or succeeding steps. Furthermore , the nature of a
re al-world data set, which may contain noisy, incomplete, dynamic,
redundant, continuous, and missing value s, certainly make s all steps
critical on the path going from data to knowle dge s Deogun et al., 1997;
Matheus et al., 1993..
One of the methods in the data mining step is inductive learning,
which is mostly concerned with finding general descriptions of a concept
from a set of training example s. Practical data mining tools generally
employ a numbe r of inductive learning algorithms. For example , Silicon
Graphics data mining and visualization product Mine Set T M uses
MLC qq as a base for the induction and classification algorithms
s Kohavi et al., 1996.. This work focuse s on establishing causal relationships to class labels from value s of attributes via a heuristic search that
starts with values of individual attributes and continue s to consider the
double , triple, or further combinations of attribute value s in sequence
until the example set is cove red.

ILA-2: AN INDUCTIVE LEARNING ALGORITHM

611

Once a new learning algorithm has been introduce d, it is not


unusual to see follow-up contributions appe ar that extend the algorithm
in orde r to improve it in various ways. These improvements are important in establishing a method and clarifying when it is and is not useful.
In this pape r, we propose exte nsions to improve upon such a new
algorithm, namely, the inductive learning algorithm s ILA. introduce d by
Tolun and Abu-Soud s 1998 .. Inductive learning algorithm is a consistent
inductive algorithm that operates on nonconflicting example sets. It
extracts a rule set cove ring all instance s in a give n example set. A
condition s also known as a description . is defined as a pair of attributes
and their value s. The condition is a building block in constructing the
ante cedent of a rule in which there may be one or more conditions
conjuncted with one to anothe r; the consequence of that rule is associate d with a particular class. The representation of rules in ILA is
suitable for data exploration such that a description set in its simplest
form is generated to distinguish a class from the othe r ones.
The induction bias used in ILA selects a rule for a class from a set
of promising rules if and only if the coverage proportion of the rule 1 is
maximum. To implement this bias, ILA runs in a stepwise forward
iteration that cycles as many times as the numbe r of attribute s until all
positive examples of a single class is covere d. Each iteration se arche s
for a description s or a combination of descriptions . that cove rs a
relatively large r numbe r of training example s of a single class than the
othe r candidates do. Having found such a description s or combination .,
ILA generate s a rule with the antecedent part consisting of that
description. It then marks the example s cove red by the rule just generate d so that they are not conside red in furthe r iteration steps.
The first modification we propose is the ability to deal with uncertain data. In general, two different source s of uncertainty can be
distinguished. One of these is noise that is defined as nonsystematic
errors in gathe ring or entering data. This may happe n because of
incorre ct recording or transcription of data, or because of incorrect
me asure ment or perception at an earlier stage . The second situation
occurs when descriptions of example s become insufficient to induce
certain rules. In this pape r, a data set is called inconsiste nt if descriptions of examples are not sufficie nt to induce rules. This case is also
1

The coverage proportion of a rule is computed as the proportion of the number of


positive s and none of negative . examples covered ove r the size of description s i.e., the
number of conjunctives is maximum amongst the others ..

612

M. R. TOLUN ET AL.

known as incomple te data to point out the fact that some relevant
feature s are missing to extract nonconflicting class descriptions. In
re al-world problems, this often constitute s the gre ate st source of error,
because data are usually organized and collected around the needs of
organizational activitie s that cause s incomplete data from the knowledge discovery task point of view. Unde r such circumstance s, the
knowle dge discovery mode l should have the capability of providing
approximate decisions with some confidence level.
The second modification is a greedy rule generation bias that
reduces learning time at the cost of an increased numbe r of generate d
rules. This feature is discussed in the next section. ILA-2 is an extension
of ILA with respect to the modifications state d above . We have empirically compare d ILA-2 with ILA using re al-world data sets. The results
show that ILA-2 is better than ILA in terms of accuracy in classifying
unseen instance s, size of the classifiers, and learning time. Some wellknown inductive learning algorithms are compared to our own algorithm. These algorithms are ID3 s Quinlan, 1983 ., C4.5, C4.5rules
s Quinlan, 1993 ., OC1 s Murthy et al., 1994 ., and CN2 s Clark & Niblett,
1989., respectively. Test results with unse en example s also show that
ILA-2 is comparable to both CN2 and C4.5 algorithms.
The organization of this paper is as follows. In the following section,
we briefly introduce the ILA algorithm; the execution of the algorithm
for an example task is also presented. In the next section, modifications
to ILA algorithm are described, followed by a section on the time
complexity analysis of ILA-2. Finally, ILA-2 is empirically compare d
with five well-known induction algorithms ove r 19 different domains.
THE IND UC TIVE LEAR NING ALGOR ITHM
Inductive learning algorithm works in an iterative fashion. In each
iteration the algorithm strive s for se arching a rule that covers a large
numbe r of training examples of a single class. Having found a rule, ILA
first removes those examples from furthe r consideration by marking
them, and then appends the rule at the end of its rule set. In othe r
words, the algorithm works on a rules-pe r-class basis. For each class,
rules are induced to separate examples in that class from the examples
in the other classes. This produces an orde red list of rules rather than a
decision tree. The details of ILA algorithm are given in Figure 1. A
good description is a conjuncted pair of attributes and their values such

ILA-2: AN INDUCTIVE LEARNING ALGORITHM

613

Figure 1. ILA inductive le arning algorithm.

that it covers some positive examples and none of the negative examples
for a give n class. The goodness me asure asse sses the extent of goodness
by returning the good description with maximum occurre nces of positive
example s. Inductive learning algorithm constructs production rules in a
general-to-specific way, i.e., starting off with the most general rule
possible and producing specific rules whenever it is deemed necessary.

614

M. R. TOLUN ET AL.

The advantage s of ILA can be state d as follows:

The rules are in a suitable form for data exploration; namely a


description of each class in the simplest way that enables it to be
distinguishe d from the othe r classes.
The rule set is orde red in a more modular fashion, which enables it to
focus on a single rule at a time. Direct rule extraction is preferred
ove r decision trees, as the latter are hard to interpret particularly
when there is a large numbe r of nodes.

D escription of the ILA Algorithm with a R unning Example


In describing ILA we shall make use of a simple training set. Consider
the training set for object classification give n in Table 1, consisting of
seven examples with thre e attributes and the class attribute with two
possible value s.
Let us trace the execution of the ILA algorithm for this training set.
After re ading the object data, the algorithm starts by the first class s yes .
and generates hypothe sis in the form of descriptions as shown in Table
2. A description is a conjunction of attribute-value pairs; they are used
to form the left-hand side of rules in the rule generation step.
For each description the numbe r of positive and negative instance s
are found. The descriptions 6 and 8 are the ones with the most positive s
and no negative s. Since description 6 is the first, it is selected by default.

Table 1. Object classification training set


Example no.
1
2
3
4
5
6
7
s Thornton, 1992 .

Size

Color

Shape

Class

medium
small
small
large
large
large
large

blue
red
red
red
gre en
red
gre en

brick
wedge
sphere
wedge
pillar
pillar
sphere

yes
no
yes
no
yes
no
yes

ILA-2: AN INDUCTIVE LEARNING ALGORITHM

615

Table 2. The first set of descriptions


No.
1
2
3
4
5
6
7
8
9

Description

True Positive

False Negative

size s medium
color s blue
shape s brick
size s small
color s red
shape s sphere
size s large
color s green
shape s pillar

1
1
1
1
1
2
2
2
1

0
0
0
1
3
0
2
0
1

Hence, the following rule is generated:


Rule 1: IF shape s sphe re class is yes.
Upon generation of Rule 1, instance s 2 and 6 covered by that rule are
marked as classified. These instance s are no longer take n into account
in hypothe sis generation. Next time, the algorithm generate s descriptions shown in Table 3.
In the second hypothesis space , there are four equivale nt quality
descriptions; 1, 2, 3, and 5. The system selects the first description for a
new rule generation.
Rule 2: IF size s medium class is yes.
Item 1 cove red by this rule is marke d as classified and the next
descriptions are generate d s Table 4..
Table 3. The second set of descriptions
No.
1
2
3
4
5
6

Description

True Positive

False Negative

size s medium
color s blue
shape s brick
size s large
color s green
shape s pillar

1
1
1
1
1
1

0
0
0
2
0
1

616

M. R. TOLUN ET AL.

Table 4. The third set of descriptions


No.
1
2
3

Description

True Positive

False Negative

size s large
color s green
shape s pillar

1
1
1

2
0
1

This time only the second description satisfie s the ILA quality
criterion and is used for the generation of the following rule:
Rule 3: IF color s green class is yes.
Since example 4 is covere d by this rule, it is marke d as classified. All
example s of the current class s yes . are now marked as classified. The
algorithm continue s with the next class s no . generating the following
rules:
Rule 4: IF shape s wedge class is no.
Rule 5: IF color s red and size s large class is no.
The algorithm stops when all of the example s in the training set are
marked as classified, i.e., all the example s are cove red by the current
rule set.
EXTENSIONS TO ILA
Two main proble ms of ILA are overfitting and long learning time. The
overfitting proble m is due to the bias that ILA tries to generate a
consistent classifier on training data. However training data, most of the
time includes noisy examples causing overfitting in the generate d classifiers. We have developed a nove l heuristic function that preve nts this
bias in the case of noisy example s. We also propose d anothe r modification to make ILA faster, which considers the possibility of generating
more than one rule after the iteration steps.
The ILA algorithm and its exte nsions have been implemented by
using the source code of C4.5 programs. Therefore , in addition to
extensions state d above, the algorithm has also been enhanced by some
feature s of C4.5, such as rule sifting and default class selection s Quinlan, 1993 .. During the classification process, the ILA system first

ILA-2: AN INDUCTIVE LEARNING ALGORITHM

617

extracts a set of rules using the ILA algorithm. In the next step, the set
of extracted rules for the classes are ordered to minimize false positive
errors and then a default class is chosen. The default class is found as
the one with most instance s not covere d by any rule. Ties are resolve d
in favor of more frequent classes. Once the class orde r and the default
class have been establishe d, the rule set is subject to a postpruning
process. If there are one or more rules whose omission would actually
reduce the number of classification errors in training cases, the first
such rule is discarded and the set is checked again. This last step allows
a final global scrutiny of the rule set as a whole for the conte xt it will be
used.
The ILA-2 algorithm can also handle continuous feature discretization through using the entropy-based algorithm of Fayyad and Irani
s 1993.. This algorithm uses a recursive entropy minimization heuristic
for discretization and couples this with a minimum description length
criterion s Rissane n, 1986. to control the numbe r of intervals produce d
over the continuous space . In the original pape r by Fayyad and Irani,
this method was applied locally at each node during tree generation.
The method was found to be quite promising as a global discretization
method s Ting, 1994 .. We have used the implementation of Fayyad and
Iranis discretization algorithm as provided within the MLC qq library
s Kohavi et al., 1996..
The Novel Evaluation Function
In general, an evaluation function s score for a description should
increase in proportion to both the numbe r of positive instances cove red,
denoted by TP, and the numbe r of negative instance s not cove red,
denoted by TN. In orde r to normalize the score, a simple metric take s
into account the total numbe r of positive instance s, P, and negative
instances, N, which is given in Langle y s 1996 . as
s TP q TN . r s P qN . ,
where the resulting value range s between 0 s when no positive s and all
negatives are cove red. and 1 s when all positives and no negative s are
cove red.. This ratio may be used to measure the ove rall classification
accuracy of a description on the training data.

618

M. R. TOLUN ET AL.

Now, we may turn to a description evaluation of the metric used in


ILA, which can be expressed as follows.
The description should not occur in any of the negativ e exam ples of the
current class AND must be the one with maximum occurrences in
positi v e examples of the current class.
Since this metric assume s, however, no uncertainty to be present in the
training data, it se arches a given description space to extract a rule set
that classifies training data perfectly. It is, howeve r, a well-known fact
that an application targeting re al-world domains should address how to
handle uncertainty. Generally, uncertain tolerant classification requires
relaxing the constraint that the induced descriptions must classify the
consistent part of training data s Clark & Niblett, 1989., which is
equivale nt to say that classification methods should generate almost
true rules s Mathe us et al., 1993 .. Considering this point, a noise tolerant
ve rsion of the ILA metric has been developed.
The above ide a is also supporte d by one of the guiding principles of
soft-computing: ``Exploit the tolerance for imprecision, uncertainty, and
particular truth to achiev e tractability, robustness, and low solution cost
s Zade h, 1994 .. Using these ideas, a new quality criterion is establishe d
in which the quality of a description s D . is calculated by the following
heuristic:
Qualitys D . s TP y PF*FN,
where TP s true positiv es . are the number of correctly identified positive
example s; FN s false negati v es . are the number of wrongly identified
negative example s; and PF s penalty factor . is a user-define d parame ter.
Note that to reason about uncertain data, the evaluation me asure of
ILA-2 maximizes the quality value of a rule for a given PF.
The PF determines the negative effect of FN example s on the
quality of descriptions. It is similar to the well-known sensiti v ity me asure used for accuracy estimation. Usually, sensiti v ity s Sn . is defined as
the proportion of items that have been correctly predicted to the
numbe r of items covere d, i.e.,
Sn s TP r s TP q FN . .

ILA-2: AN INDUCTIVE LEARNING ALGORITHM

619

Sensiti v ity may be equivale ntly rewritten in terms of penalty factor s PF .


as
Sn s PF r s PF q 1 . .
The user defined PF value s may be converted to the sensitivity me asure
using the above equation.
As seen in Figure 2, sensitivity value approache s to one as the PF
increases. The advantage of the new heuristic may be seen by an
example case. In respect to PF of five, let us have two descriptions one
with TP s 70 and FN s 2, the other with TP s 6 and FN s 0. The
ILA quality criterion selects the second description. However, the soft
criteria selects the first one, which is intuitively more predictive than
the second.
When PF approache s to the numbe r of training examples, the
selection with the new formula is the same as ILA criteria. On the othe r
hand, a zero PF me ans the number of negative training examples has no
importance and only the number of positive examples are looke d at,
which is quite an optimistic choice.
Propertie s of classifiers generated by different value s of PF for
splice data set are given in Table 5. The number of rules and the
average number of conditions in the rules incre ase as the PF incre ase s.
As seen in Table 5, i.e., ILA with smaller penalty factors, constructs
smaller size classifiers.
As seen from Figure 3, for all re asonable values of PFs, s 1 ] 30., we
get better results than ILA in terms of estimated accuracy. For this data

Figure 2. Penalty factor vs. sensitivity value s.

620

M. R. TOLUN ET AL.

Table 5. Results for splice data set with different values of penalty factors

Penalty
factor
1
2
3
4
5
7
10
30
ILA

Number of
rules

Average
number of
conditions

Total
number of
conditions

Accuracy on
training data

Accuracy on
test data

13
31
38
50
53
63
66
87
91

2.0
2.3
2.3
2.5
2.4
2.5
2.5
2.6
2.6

26
71
86
125
128
158
167
228
240

82.4%
88.7%
94.7%
96.1%
97.9%
98.7%
99.6%
100.0%
100.0%

73.4%
81.6%
87.6%
87.2%
86.9%
85.4%
88.5%
71.7%
67.9%

Figure 3. Accuracy values on training and test data for the splice data set.

set we get maximum estimated accuracy prediction when PF is 10.


Incre asing the PF furthe r does not yield better classifiers. However,
when we look at the accuracy on training data we see that there is a
proportional incre ase in the accuracy as the PF increases.
The Faster Pass C riteria
In each iteration, afte r an exhaustive se arch for good descriptions in the
se arch space, ILA selects only one description to generate a new rule.
This approach seems an expensive way of rule extraction. However, it is
possible that if there exists more than one description with the same
quality, then all these descriptions may be used for possible rule

ILA-2: AN INDUCTIVE LEARNING ALGORITHM

621

generation. Usually the second approach tends to decrease the processing time. On the othe r hand, this approach might result in redundant
rules with an increase in the size of the output rule set.
The above ide a was implemented in the ILA syste m and the option
to activate this fe ature referred to as FastILA. For example, in case of
promote r data set, FastILA reduced the processing time from 17
seconds to 10 seconds. Also, the numbe r of final rules decreased by one
and the total numbe r of conditions by two. The expe riments show that if
the size of the classification rules is not extre mely important and less
processing time is desirable , then FastILA option would be more
suitable to use.
As seen in Table 6, ILA-2 s or ILA with fast pass criteria. generates
a higher number of initial rules than ILA, which amounts to about 5.5
on ave rage for the evaluation data sets. This is because the faste r pass
criteria permits more than one rule to be asserted at once. However,
afte r the rule generation step is finishe d, the sifting process sifts all the
unnecessary rules.
THE TIME C OMPLEXITY ANALYSIS
In the time complexity analysis of ILA, only the uppe r bounds for the
critical compone nts of the algorithm are provided because the overall
Table 6. Effect of fast-ILA option in terms of four different parame ters

Training
set
Lenses
Monk1
Monk2
Monk3
Mushroom
Parity5 q5
Tic-tac-toe
Vote
Zoo
Splice
Coding
Promoter
Totals

Number of
initial rules

Number of
final rules

ILA

ILA-2

ILA

ILA-2

6
23
61
23
24
35
26
33
9
93
141
14
488

9
30
101
37
55
78
31
185
55
825
1073
226
2703

6
21
48
23
16
15
26
27
9
91
108
14
404

5
22
48
23
15
14
26
22
9
76
112
13
383

Estimated
accuracy s % .

Time
s second.

ILA

ILA-2

ILA

ILA-2

50
100
78.5
88.2
100
50.8
98.1
96.3
91.2
67.9
68.7
100
85.25

62.5
94.4
81.3
86.3
100
50.8
98.1
94.8
85.3
97.7
100
100
90.58

1
1
3
1
1476
9
90
31
1
1569
1037
17
4236

1
1
2
1
1105
5
52
25
0
745
345
10
2290

622

M. R. TOLUN ET AL.

complexity of any algorithm is domain-depe ndent. Let us define the


following notation:
e is the numbe r of examples in the training set.
Na stands for the numbe r of attributes.
c stands for the numbe r of class attribute values.
j is the number of attributes of the descriptions.
S is the number of descriptions in the current test descriptions set.
The following proce dure is applied when ILA-2 runs for a classification task: ILA-2 algorithm comprises two main loops. The oute r loop is
performe d for each class attribute value, which take s time O s c .. In the
loop, first the size of the descriptions for the hypothesis space , also
known as version space , is set to one. A description is a combination of
attribute-value pairs; they are used as the left-hand side of the IF-THEN
rules in the rule generation step. Then an inner loop is executed. The
execution of the inner loop is continue d until all examples of the
current class are covere d by the generated rules or the maximum size
for the descriptions is reache d; thus it take s time O s Na ..
Inside the inner loop, first a set of temporary descriptions is
generated for the give n description size using the training example s of
the current class, O s e r c.Na .. Then, occurre nces of these descriptions
are counted, O s e.S.. Next, good descriptions maximizing the ILA quality me asure are se arche d, O s S .. If some good descriptions are found,
they are used to generate new rules. The examples matching these new
rules are marked as classified, O s e r c .. When no description is found
passing the quality criteria, the size of the descriptions for the next
iteration is incremented by one. Steps of the algorithm are given using
short notations as follows:
for each class s c .
v
for each attribute s Na .
v
Hypothesis generation in the form of descriptions;
Frequency counting for the descriptions ;

r r O s e . S.

r r O s s e r c . .Na .

ILA-2: AN INDUCTIVE LEARNING ALGORITHM

623

Evaluation of descriptions ; r r O s s .
Marking covere d instance s ; r r O s e r c .
4
4
Therefore , the ove rall time complexity is then give n by
O s c. Na .s e r c.Na q e.S qS q e r c ..
O s e.Na2 q c. Na .S.s e q1 . q e.Na .
This may be simplified by replacing e q 1 with e, and then the complexity becomes
O s e.Na2 q c. Na .e.S q e.Na .
O s e. Na . s Na q c.S q1 r c . . .
As 1 r c is also comparative ly small, then
O s e. Na . s Na q c.S .. .
Usually c.S is much larger than Na , e.g., in the experiments we
selected S as 1500 while the maximum Na was only 60. In addition, c is
also comparative ly smaller than S. Therefore, we may simplify the
complexity equation as
O s e. Na .S . .
So, the time complexity of the ILA is line ar in the numbe r of attributes
and the numbe r of examples. Also, the size of the hypothesis space s S.
linearly effects the proce ssing time.
EVALUA TION OF ILA-2
For evaluation purposes of ILA-2, we have mainly used two paramete rs
} the classifier size and accuracy. The classifier size is the total numbe r
of conditions of the rules in the classifier. For decision-tre e algorithms

624

M. R. TOLUN ET AL.

classifier size refers to the numbe r of leaf node s in the decision tree,
i.e., the numbe r of regions that the data are divide d into by the tree.
Accuracy is the estimated accuracy on test data. We have used the
hold-out method to estimate the future prediction accuracy on unse en
data.
We have used 19 diffe rent training sets from the UCI repository
s Merz & Murphy, 1997.. Table 7 summarize s the characte ristics of these
data sets. In orde r to test the algorithms for the ability of classifying
unseen example s a simple practice is to reserve a portion of the training
data set as a separate test set that is not used in building the classifiers.
We have employe d the test sets related with these training sets from the
UCI repository in the expe riments to estimate the accuracy of the
classifiers.
In selecting the PF as 1 and 5, we have conside red the two ends and
the middle of the PF spectrum. In the highe r end we observe that the
ILA-2 performs like the basic ILA. For this reason we have not
Table 7. The characteristics features of tested data sets

Domain
name
Lenses
Monk1
Monk2
Monk3
Mushroom
Parity5 q5
Tic-tac-toe
Vote
Zoo
Splice
Coding
Promoter
Australia
Crx
Breast
Cleve
Diabetes
Heart
Iris

Number of
attributes

Number of
examples in
training data

Number of
class value s

Number of
examples in
test data

4
6
6
6
22
10
9
16
16
60
15
57
14
15
10
13
8
13
10

16
124
169
122
5416
100
638
300
67
700
600
106
460
465
466
202
512
180
100

3
2
2
2
2
2
2
2
7
3
2
2
2
2
2
2
2
2
3

8
432
432
432
2708
1024
320
135
34
3170
5000
40
230
187
233
99
256
90
50

ILA-2: AN INDUCTIVE LEARNING ALGORITHM

625

included highe r PF value s in the experiments. The lower end of the


spectrum is zero, which eliminate s the me aning of the PF. Therefore,
PF s 1 is included in the experiments to see the results with the most
relaxed s error tolerant . case. The PF s 5 case is selected as it represents a moderate case, which corresponds to sensitivity value of 0.83 in
training.
We ran three different versions of ILA algorithm on these datasets:
ILA-2 with penalty factor set to 1 and 5, and the basic ILA algorithm.
In addition, we ran three diffe rent decision-tree algorithms and two rule
induction algorithms: ID3 s Quinlan, 1986., C4.5, C4.5rules s Quinlan,
1993., OC1 s Murthy et al., 1994., and CN2 s Clark & Niblett, 1989 ..
Algorithms other than ILA-2 were run using the default settings supplied with their systems. Estimated accuracie s of generate d classifiers
on classifying the test sets are given in Table 8 on which our comparative interpretation is base d. Inductive learning algorithms have highe r
accuracie s than OC1 algorithm in classifying 13 out of 19 domains test
sets. Compared to CN2, ILA-2 performs better in 11 of the 19 domains
Table 8. Estimated accuracies of various le arning algorithms on selected domains
Domain
name
Lenses
Monk1
Monk2
Monk3
Mushroom
Parity5 q5
Tic-tac-toe
Vote
Zoo
Splice
Coding
Promoter
Australia
Crx
Breast
Cleve
Diabetes
Heart
Iris

ILA-2
PF s 1

ILA-2
PF s 5

ILA

ID3

C4.5pruned

OC1

C4.5rules

CN2

62.5
100
59.7
100
98.2
50.0
84.1
97.0
88.2
73.4
70.0
97.5
83.0
80.2
95.7
70.3
71.5
60.0
96.0

50
100
66.7
87.7
100
51.1
98.1
96.3
91.2
86.9
70.7
97.5
76.5
78.1
96.1
76.2
73.8
82.2
94.0

50
100
78.5
88.2
100
51.2
98.1
94.8
91.2
67.9
68.7
100
82.6
75.4
95.3
76.2
65.6
84.4
96.0

62.5
81.0
69.9
91.7
100
50.8
80.9
94.1
97.1
89.0
65.7
100
81.3
72.5
94.4
64.4
62.5
75.6
94.0

62.5
75.7
65.0
97.2
100
50.0
82.2
97.0
85.3
90.4
63.2
95.0
87.0
83.0
95.7
77.2
69.1
83.3
92.0

37.5
91.2
96.3
94.2
99.9
52.4
85.6
96.3
73.5
91.2
65.9
87.5
84.8
78.5
95.7
79.2
70.3
78.9
96.0

62.5
93.5
66.2
96.3
99.7
50.0
98.1
95.6
85.3
92.7
64.0
97.5
88.3
84.5
94.4
82.2
73.4
84.4
92.0

62.5
98.6
75.4
90.7
100
53.0
98.4
95.6
82.4
84.5
100
100
82.2
80.0
97.0
68.3
70.7
77.8
94.0

626

M. R. TOLUN ET AL.

test sets and they have similar accuracies in the other two domains.
Inductive learning algorithms performe d marginally better than C4.5
among the 19 test sets, producing highe r accuracies in 10 domains test
sets.
Table 9 shows the size of the output classifiers generate d by the
same algorithms above for the same data sets. Results in the table prove
that ILA-2 or ILA with the nove l evaluation function is comparable to
C4.5 algorithms in terms of the generated classifiers size. When the
penalty factor is set to 1, ILA-2 usually produce d the smallest size
classifiers for the evaluation sets.
In regard to results in Table 9, it may be worth pointing out that
ILA-2 solves the overfitting proble m of basic s certain . ILA, in a similar
fashion to C4.5 which solve s the overfitting proble m of ID3 s Quinlan,
1986.. The sizes of classifiers generate d by the corresponding classification methods show the existence of this relationship clearly.

Table 9. Size of the classifiers generated by various algorithms

Domain name
Lenses
Monk1
Monk2
Monk3
Mushroom
Parity5 q5
Tic-tac-toe
Vote
Zoo
Splice
Coding
Promoter
Australia
Crx
Breast
Cleve
Diabetes
Heart
Iris
Totals

ILA-2,
PF s 1

ILA-2,
PF s 5

ILA

ID3

C4.5pruned

C4.5rules

9
14
9
5
9
2
15
18
11
26
64
7
13
12
5
8
6
1
5
238

13
37
115
48
13
67
88
35
16
128
256
18
69
39
16
55
22
25
16
1079

13
58
188
63
22
81
88
69
17
240
319
27
116
111
38
64
33
19
4
1570

9
92
176
42
29
107
304
67
21
201
429
41
130
129
37
74
165
57
9
2119

7
18
31
12
30
23
85
7
19
81
109
25
30
58
19
27
27
33
7
648

8
23
35
25
11
17
66
8
14
88
68
12
30
32
20
20
19
26
5
527

ILA-2: AN INDUCTIVE LEARNING ALGORITHM

627

C ONC LUSION
We introduce d an exte nded version of ILA, namely ILA-2, which is a
supervised rule induction algorithm. ILA-2 has additional fe ature s that
are not available in basic ILA. A faster pass criterion that reduces the
processing time by employing a greedy rule generation strate gy is
introduced. This fe ature, called fastILA, is useful for situations where a
reduced processing time is more important than the size of the classification task performed.
The main contribution of our work is the evaluation metric utilized
in ILA-2 for evaluation of description s s.. In othe r words, users can
reflect their preferences via a PF to tune up s or control . the performance of the ILA-2 system with respect to the nature of the domain at
hand. This provides a valuable advantage over most of the current ILA.
Finally, using a number of machine learning and re al-world data
sets, we show that the performance of ILA-2 is comparable to that of
well-known inductive learning algorithms, CN2, OC1, ID3, and C4.5.
As a further work, adapting diffe rent feature subse t selection s FSS.
approaches is planned to be embedded into the syste m as a preprocessing step in orde r to yield a bette r performance. With FSS, the se arch
space requirements and the processing time will probably be reduced
due to the elimination of irrelevant attribute value combinations at the
ve ry beginning of the rule extraction process.

AC KNOWLED GMENTS
This rese arch is partly supporte d by the State Planning Organization of
the Turkish Republic unde r the rese arch grant 97-K-12330. All of the
data sets were obtained from the Unive rsity of California-Irvine s repository of machine learning databases and domain theories, manage d by
Patrick M. Murphy. We acknowledge Ross Quinlan and Peter Clark for
the implementations of C4.5 and CN2 and also Ron Kohavi, for we have
used MLC qqlibrary to execute OC1, CN2, and ID3 algorithms.

R EFER ENC ES
Clark, P., and T. Niblett. 1989. The CN2 induction algorithm. Machine Learning
3: 261 ] 283.

628

M. R. TOLUN ET AL.

Deogun, J. S., V. V. Raghavan, A. Sarkar, and H. Sever. 1997. Data mining:


Research trends, challenges, and applications. In Rough Sets and Data
Mining: An alysis of Imprecise Data, eds. T. Y. Lin and N. Cercone, 9 ] 45.
Boston, MA: Kluwer Academic Publishers.
Fayyad, U. M. 1996. Data mining and knowledge discovery: Making sense out of
data. IEEE Expert 11s 5.: 20 ] 25.
Fayyad, U. M., and K. B. Irani. 1993. Multi-interval discretization of
continuous-valued attributes for classification le arning. In Proc. 13th Internation al Joint Conference on Artificial Intelligence, ed. R. Bajcsy, 1022 ] 1027.
Philadelphia, PA: Morgan Kaufmann.
Kohavi, R., D. Sommerfield, and J. Dougherty. 1996. Data mining using
MLC qq: A machine le arning library in C qq. In Proc. Tools with AI:
234 ] 245.
Langley, P. 1996. Elements of Machine Learning. San Francisco: Morgan Kaufmann.
Matheus, C. J., P. K. Chan, and G. Piatetsky-Shapiro. 1993. Systems for
knowledge discovery in databases. IEEE Trans. Knowledge and Data Engineering 5s 6.: 903 ] 912.
Merz, C. J., and P. M. Murphy. 1997. UCI repository of machine le arning
databases, http: r r www.ics.uci.edu r ; mlearn r MLRepository.html. Irvine,
CA: University of California, Department of Information and Computer
Science.
Murthy, S. K., S. Kasif, and S. Salzberg. 1994. A system for induction of oblique
decision trees. J. Artificial Intelligence Research 2: 1 ] 32.
Quinlan, J. R. 1983. Le arning efficient classification procedures and their
application to chess end games. In Machine Learning: An Artificial Intelligence Approach , eds., R. S. Michalski, J. G. Carbonell, and T. M. Mitchell,
463 ] 482. Palo Alto, CA: Tioga.
Quinlan, J. R. 1986. Induction of decision trees. Machine Learning 1: 81 ] 106.
Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Philadelphia, PA:
Morgan Kaufmann.
Rissanen, J. 1986. Stochastic complexity and modeling. Ann. Statist. 14:
1080 ] 1100.
Simoudis, E. 1996. Reality check for data mining. IEEE Expert 11 s 5.: 26 ] 33.
Thornton, C. J. 1992. Techniques in Computation al Learning] An Introduction.
London: Chapman and Hall.
Ting, K. M. 1994. Discretization of Continuous-Valued Attributes and
Instance-Based Le arning. Technical Report 491, University of Sydney,
Australia.
Tolun, M. R., and S. M. Abu-Soud. 1998. ILA: An inductive le arning algorithm
for rule discovery. Expert Systems with Applications 14: 361 ] 370.
Zadeh, L. A. 1994. Soft computing and fuzzy logic. IEEE Software 11s 6.: 48 ] 56.