Академический Документы
Профессиональный Документы
Культура Документы
I
I Related work
1
There havc bcen several proposals of approachcs to feature suh,ct iii a diagnosis task) caii significantly worsen a decision tree clasifiers
selection. (We discuss only a few of these in this article; our rccent gcneralization accuracy.Also, most of the proposcd lwture selection
work contains a more complete list of references.) Some 01 these techniques (wiih thc exception of those using genetic algorithms) are
approachcs involve searching for an optimal subset of features hazed on not designed t o handle multiple selection criteria (classification itccu-
particular criteria of interest. racy, feature mcasurementcost. and so on).
Feature weighting is a variant of feature sclcction. It involvcs assign- lhc multicriteria approach that we explore in this article is wrapper-
ing a real-valued weight to cach feature. The weight associated with a based and uses B genetic algorithm in conjunction with a relatively fast,
feature rwasures its relevance or significance in the classificationtask.2 interpattern distance-based, ncural-network lcaining algorithm. How-
Feature subsct selection is a spccial case of wcighting with hinary ever, this gcncral approach works hith any inductive learning algorithm.
I weights.
Several authors havc examined the USK ofa heurixtic search for fea-
ture subset selection; this oftcn operates in conjunction with a branch.. References
and-bound swrch.3 Others havc explored mndomizctfl and random-
1. J. Yang andV. Honavar. Feature Subset Selcction Using a Genetic
I ized, population-based heuristic search techniques such as genetic
algorithms-to select feature subsels for USC with decision-trccor
nearest-neighborclassifiers.
Algorithm, Feature Extraction, Constntction and Selection--/\
Dura Mining Perspective, Liu and hlutoda, eds., Kluwer Academic
Publishers, Boston, forthcoming, 1998.
Feature fiubset selection algorithms fall into two categories based on
whe1:heror not they perfonn feature selection independentlyof the 2. S. Cost and S. Salzberg, A Weighted Nearest Neighbor Algorithm
learning algorithmthat constructs the classifier. If the techniquc for Learning with Synibolic Features, Muchine Learning, Vol. 10,
performs feature selection indepentlenUy of the learning algorithni, it NO.1, Jan. 1993,pp. 57-78.
follows a,fiZferapproach. Otherwise, it follows a wrapper approach.?
The filter approach is generally computationallymore etlicient. 3. G. John, R. Kohavi, and K. Pflegcr, lrrelevant Features and the
HOWCWF, its major drawback is that an optimal selection of features Suhset Selection Problem, Pwc. I Irh Int 1 Corzj Machine Learn-
rnay not he independent of the inductive and representational biases of ing, Morgan Kaufmann, San Francisco, 1994,pp. I21-..129.
I the leanling algorithm that constructs the classitier.The wrapper 4. H. Liu and R. Setiono, A Probabihic Approach to Feature Selec-
approach, on the olher hand, incurs the computational overhead of eval- tiou-4 Filter Solution, Proc. 13th In1 1 Cor$ Machine Lcwning,
ualing candidate reature subsets by executing a selected learning algo- Morgan Kaufmann, 1996. pp. 319-327.
I
I
rithm on the data sct using each feature subset undcr consideration.
Becauseexhaustivc search over all possible combinationsof features
is iiot computationally feasible. most current approaches assume
monotonicity of some measurc of classilication pcrfomiancc and then
5. E Brill, D.Brown, and W. Martin, Fast Genetic Selection of Fea-
tures For Neural Network Classifiers, IEEE Truns. Neunil
Networks, Vol. 3 , No. 2, Mar. 1991, pp. 324-328.
! usc branch-and-bound search. This ensures that adding features does
0. M. Richeldi and P. J a n ~ iPerfornming
, Effective Feature Selection
not worsen perforinance. Techniques that make this monotonicity
hy Investigating the Deep Structure of the Data, Prcic. Second
assumption in some form appcar to work rcasonably wcll with linear Inll Con& Knowledge Discover?,mid Dain Minitig, AAA1 Press,
classificrs. 1,Iowever. thcy can exhi.bit poor performancc with nonlinear Menlo Park, Oalif.. 1996,pp. 379-383.
I
classiliers such as neural networks. Furthermore, many practical sce-
narios do not satisfy the monotonicity assumption. For example, irrcle- I . R.Ripky, Puttern Rt?rcignihnand Neurul Akhvorks, Cambridge
vant features (11or example; social sccurity numhers in n~cdicelrecords IJniv. Press, New York, 1996.
...-
. .. . . .. . .. ... ..
Why a genetic algorithm? tions computed by the neurons, the connec- network pattern classifiers.
tivity of the network, and the parameters However, if we use traditional neural-
Feature subset selection in the context of (weights) associated with the connections. network training algorithms to train the pat-
practical problems such as diagnosis presents Assume Cis a finite set of classes, II a finite tern classifiers, the use of genetic algorithms
a multicriteria optimization problem. The cri- number of discrete or real-valued attributes, for subset selection presents some practical
teria to be optimized include the classifica- R the set of real numbers, and D a finite set of problems:
tions accuracy, cost, and risk. Evolutionary discrete values. Multilayer networks of non-
algorithms offer a particularly attractive linear computing elements (such as threshold Traditional neural-network learning algo-
approach to multicriteria optimization be- neurons) can realize any classification func- rithms (such as back-propagation) per- ,
cause they are effective in high-dimensional tion ip : Rn+ C o r ip: Dn+ C. If the attributes form an error gradient-guided search for
search spaces. are symbolic, they must first be mapped to a suitable setting of weights in the weight
Neural networks are densely intercon- numeric values using appropriate coding space determined by a user-specified net-
nected networks of relatively simple com- schemes. Evolutionary algorithms are gener- work architecture. This ad hoc choice of
puting elements-for example. threshold or ally quite effective for rapid global search of network architecture often inappropri-
sigmoid neurons. Neural networks potential large search spaces in multimodal optimiza- ately constrains the search for weight set-
for parallelism and their fault and noise tol- tion problems. Neural networks are particu- ting. For example, if the network has too
erance make them an attractive framework larly effective for fine-tuning solutions once few neurons, the learning algorithm will
for the design of pattern classifiers for real- promising regions in the search space have miss the desired classification function.
world, real-time, pattern-classification tasks. been identified. Against this background, If the network has far more neurons than
The classification function realized by a genetic algorithms offer an attractive ap- necessary, it can result in overfitting of
neural network is determined by the func- proach to feature subset selection for neural- the training data, which leads to poor gen-
MARCH/APRIL 1998 45
eralization. Either case would make it dif- then be eliminated from further considera- * Population size: 50
ficult to evaluate the usefulness of a fea- tion. The process terminates when this pro- Number of generations: 20
ture subset describing the training pat- cess results in an empty training set-that is, Probability of crossover: 0.6
terns for the neural network. when the network correctly classifies the Probability of mutation: 0.001
Gradient-based learning algorithms, al- entire training set. At this point, the training * Probability of selection of the highest
though mathematically well-founded for set becomes linearly separable in the trans- ranked individual: 0 6
unimodal search spaces, can get caught formed space defined by the hidden neurons.
in local mnima of the error function. This In fact, it is possible to set the weights on the We based these parameter settings on the
can complicate the evaluation of the use- hidden-to-output neuron connections with- results of several prelimnary runs The prob-
fulness of a feature subset employed to out going through an iterative process. abilities of crossover, mutation, and selec-
describe the neural networks' training DistAl is guaranteed to converge to 100% tion of the highest ranked individual are close
patterns. classification accuracy on any finite training to the typical values used in standard genetic
A typical run of a genetic algorithm in- set in time that is polynomial in the number algorithms.
volves many generations. In each gener- of training pattems. Earlier experiments3show Each individual in the population repre-
ation, evaluation of an individual (a fea- that DistAl, despite its simplicity, yields clas- sents a candidate solution to the feature sub-
ture subset) involves training the neural sifiers that compare quite favorably with those set selechon problem. Let m be the total num-
network and computing its accuracy and generated by leaming algorithms that are more ber of features available to choose from to
cost. This can make the fitness evaluation sophisticated and substantiallymore demand- represent the patterns to be classified. In a
rather expensive, because gradient-based medical diagnosis task, these would be
algorithms are typically quite slow. The observable symptoms and a set of possible
problem is exacerbated because we must diagnostic tests that can be performed on the
use multiple neural networks to sample patient. (Given m such features, there exist
the space of ad hoc network architecture DIS'GqL ADDS HIDDEN NEU- 2mpossible feature subsets Thus, for large
choices to get a reliable fitness estimate values of m, an exhaustive search is not fea-
for each feature subset represented in the RONS ONE AT A TlMl?, USING A sible). Each feature subset is represented by
population. GREEDY S7RA7EGY THAT EN- a binary vector of hmension m If a bit is a
1, it means that the corresponding feature is
Fortunately, constructive neural-network SURES THAT EACH HDDm selected. A value of 0 indicates that the cor-
learning algorithms2eliminate the need for ad NEURON CORRECTLY CLASSIFIES responding feature is not selected
hoc and often inappropriate a priori choices We determine an individual's fitness by
of network architectures. In addition, such A iW4XlML SUBSET OF 'IIL4w- evaluating the neural network constructed by
algorithms can potefitially discover near-min- ING PATTERNS BELONGING TO A DistAl using a tranmg set whose pattenis are
imal networks whose size is commensurate represented using only the selected subset of
with the complexity of the classification task SINGLE CLASS. features If an individual has n bits tumed on,
implicitly specified by the training data. Sev- the correspondmg neural network has n input
eral new, provably convergent, and relatively nodes.
efficient constructive learning algorithms for ing computationally. This makes DistAl an The fitness function combines two crite-
multicategory real and discrete-valued pattern attractive choice for experimenting with evo- ria-the accuracy of the classificahon func-
classification tasks have begun to appear in lutionary approaches to feature subset selec- tion realized by the neural network and the
the l i t e r a t ~ r e . Many
~ . ~ of these have demon- tion for neural-network pattem classifiers. Fig- cost of performing the classification. We can
strated very good performance in terms of ure 1 shows the key steps in our approach. estimate the classification function's accu-
reduced network size, learning time, and gen- racy by calculating the percentage of patterns
eralization in several experiments with both in a test set that the neural network in ques-
artificial and fairly large real-world data sets. Implementation hon correctly classifies Several measures of
classification cost suggest themselves. the
We ran our experiments using a standard cost of measuring the value of a particular
I genetic algorithm with a rank-based selec- feature needed for classification (the cost of
tion strategy. The probability of selection of performing the necessary test in a medical
The results we present in this article are the highest ranked individual i s p (where 0.5 diagnosis application), the nsk involved, and
from experiments using neural networks con- < p < 1.0 is a user-specified parameter); that so on. To keep things simple, we chose this
structed by D i ~ t A l a, ~simple and fast con- of the second highest ranked individual isp(1 two-criteria fitness function.
structive neural-network learning algorithm - p ) ; that of the third highest ranked individ- fztness(x) = uccurucy(x)
for pattern classification. DistAl's key fea- ual is p ( 1 - P ) ~and
; that of the last ranked
cost(x)
ture is to add hidden neurons one at a time, individual is 1 - (sum of the probabilities of - +cost,,
using a greedy strategy that ensures that each selection of all the other individuals).' Our accuvucy(x) +1 (1)
hidden neuron correctly classifies a maximal results are based on ten random partitions for Here,fitizess(xj is the fitness of the feature
subset of training patterns belonging to a sin- each classification task with the following subset represented by x; accuvucy(x) is the
gle class. Correctly classified examples can parameter settings: test accuracy of the neural-network classifier
3-bit parity data set. We constructed this 3P 3-bit parity problem 100 13 Numeric 2
data set to explore the genetic algorithm's Annealing Annealing database 798 38 Numeric, nominal 5
Audiology Audiology database 200 69 Nominal 24
effectiveness in selecting an appropriate sub- Bridges Pittsburgh bridges 105 11 Numeric, nominal 6
set of relevant features in the presence of Cancer Breast cancer 699 9 Numeric 2
redundant features. If successful, the genetic CRX Credit screening 690 15 Numeric, nominal 2
algorithm would minimize the cost and max- Flag Flag database 194 28 Numeric, nominal 8
imize the accuracy of the resulting neural- Glass Glass identification 214 9 Numeric 6
Heart Heart disease 270 13 Numeric, nominal 2
network pattern classifier. HeartCle Heart disease (Cleveland) 303 13 Numeric, nominal 2
To introduce redundancy to the training HeartHun Heart disease (Hungarian) 294 13 Numeric, nominal 2
set, we replicated the original features once, HeartLB Heart disease (Long Beach) 200 13 Numeric, nominal 2
thereby doubling the number of features. HeartSwi Heart disease (Swiss) 123 13 Numeric, nominal 2
Hepatitis Hepatitis domain 155 19 Numeric, nominal 2
Then, we generated an additional set of irrel- Horse Horse colic 300 22 Numeric, nominal 2
evant features and assigned them random Ionosphere Ionosphere structure 351 34 Numeric 2
Boolean values. We generated 1007-bit ran- Liver Liver disorders 345 6 Numeric 2
dom vectors and augmented them with the Pima Pima Indians diabetes 768 8 Numeric 2
6-bit vectors (corresponding to the original Promoters DNA sequences 106 57 Nominal 2
Sonar Sonar classification 208 60 Numeric 2
three bits plus an identical set of ihree bits). Soybean Large soybean 307 35 Nominal 19
We assigned each feature in the resulting data Votes House votes 435 16 Nominal 2
set a random cost between 0 and 9. Vehicle Vehicle silhouettes 846 18 Numeric 4
Vowel Vowel recognition 528 10 Numeric 11
Wine Wine recognition 178 13 Numeric 3
Real-world data sets. Our objective with zoo Zoo database 101 16 Numeric, nominal 7
real-world data sets was to compare the
MARCH/APRIL 1998 47
etic algonthm. Tables 2,3, and 4 show aver- Acknowledgments
aged performance. The table entries corre-
spond to means and standard deviations, AN attractive approach to solving the fea- This research was partially supported by the
shown in the form mean k standard deviation. ture subset selection problem in inductive National Science Foundation (through grants IRI-
learning of pattern classifiers in general and 9409580 and IRI 9643299) and the John Deere
(See our recent work6 for more thorough
Foundation.
experiments .) neural-network pattern classifiers in partic-
ular. This task finds applications in the cost-
Improving generalization. To study the sensitive design of classifiers for tasks such
effect of feature subset selection on general- as medical diagnosis and computer vision.
ization, we ran expenments using classifica- Other applications of interest include auto-
tion accuracy as the fitness function. The mated data-mining and knowledge discov-
results shown in Table 2 indicate that the net- ery from data sets with an abundance of
works constructed using a GA-selected sub-
set of features compare quite favorably with
networks that use all of the features. In par- References
ticular, feature subset selection resulted in
significant generalization improvement. 1 M Mitchel1,An Introduction to Genetic Algo-
Table 3 compares the results of our ap-
proach with other GA-based approaches7
THEFITNESS FUNCTION THAT rithms, MIT Press, Cambndge, Mass , 1996
and several non-GA-based approaches cited COMBINED BOTH ACCURACY 2 V. Honavar and L. Uhr, GeneratlveLearning
in our recent work.6 (These non-GA ap- Structures and Processes for Connectionist
AND COST OUTPERFORMED Networks, Information Sciences, Vol. 70,
proaches use a decision-tree algorithm.) We
1993,pp. 75-108.
limited the comparisons to only those data THAT BASED ON ACCURACY
sets for which at least one of the two stud-
ies6,7reported results that could be compared
ALONE IN EVERY RESPEC?: THE 3 J Yan~.R Parekh. andV Honavar. DistAl.
U
Table 4. Performunce comparison:neural-network pattern classifiersthat use features selected based on accuracy alone compared to
those that use features selected based on both accuracy and cost.
ACCURACY
ONLY ACCURACY
AND COST
DATA SET FEATURES ACCURACY HIDDEN FEATURES ACCURACY COST HIDDEN
3P 6.6f1.6 1OOfO.O 9.2 f 4.9 4.3f1.2 1OOfO.O 26.7 f 7.6 7.3 f 4.2
Hepatitis 9.2 f 2.3 97.1 f 4 . 3 8.1 f 2 . 8 8.3 f 2.4 97.3 f 3.5 19.0 f 8.1 7.4 f 2.8
HeartCle 7.3f1.7 92.9 f 3.6 7.6 f 4.2 6.1 f 1 . 6 93.0 f 3.4 261.5 f94.4 7.2 f5.1
Pima 3.8 f 1.5 79.5 f 3.1 20.8 f 21.2 3.1 f 1 . 0 79.5 f 3.0 22.8 f 9.7 16.0f 11.1
MARCH/APRIL 1998 49