Вы находитесь на странице: 1из 61

FEATURE DISCRETIZATION AND SELECTION

TECHNIQUES FOR HIGH-DIMENSIONAL


DATA
Artur J. Ferreira
Supervisor, Prof. Mario A. T. Figueiredo
Instituto Superior de Engenharia de Lisboa
Instituto de Telecomunicacoes, Lisboa
Priberam Machine Learning Lunch Seminar
17 April 2012
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Outline
Introduction
Background
High-Dimensional Data
Feature Discretization (FD)
Feature Selection (FS)
FD Proposals
Static Unsupervised Proposals
Static Supervised Proposal
FS Proposals
Feature Selection Proposals
Conclusions
Some Resources
2 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Introduction and Motivation
Some well-known facts about machine learning:
1 An adequate (sometimes discrete) representation of the data
is necessary
2 High-dimensional datasets are increasingly common
3 Learning in high-dimensional data is challenging
4 Sometimes, the curse-of-dimensionality problem arises
small number of instances n and large dimensionality d (e. g.,
micro-array data or text categorization data)
it must be addressed in order to have eective learning
algorithms
5 Feature discretization (FD) and feature selection (FS)
techniques address these problems
achieve adequate representations
select an adequate subset of features with a convenient
representation
3 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Introduction and Motivation
Some well-known facts about machine learning:
1 An adequate (sometimes discrete) representation of the data
is necessary
2 High-dimensional datasets are increasingly common
3 Learning in high-dimensional data is challenging
4 Sometimes, the curse-of-dimensionality problem arises
small number of instances n and large dimensionality d (e. g.,
micro-array data or text categorization data)
it must be addressed in order to have eective learning
algorithms
5 Feature discretization (FD) and feature selection (FS)
techniques address these problems
achieve adequate representations
select an adequate subset of features with a convenient
representation
3 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Introduction and Motivation
Some well-known facts about machine learning:
1 An adequate (sometimes discrete) representation of the data
is necessary
2 High-dimensional datasets are increasingly common
3 Learning in high-dimensional data is challenging
4 Sometimes, the curse-of-dimensionality problem arises
small number of instances n and large dimensionality d (e. g.,
micro-array data or text categorization data)
it must be addressed in order to have eective learning
algorithms
5 Feature discretization (FD) and feature selection (FS)
techniques address these problems
achieve adequate representations
select an adequate subset of features with a convenient
representation
3 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Introduction and Motivation
Some well-known facts about machine learning:
1 An adequate (sometimes discrete) representation of the data
is necessary
2 High-dimensional datasets are increasingly common
3 Learning in high-dimensional data is challenging
4 Sometimes, the curse-of-dimensionality problem arises
small number of instances n and large dimensionality d (e. g.,
micro-array data or text categorization data)
it must be addressed in order to have eective learning
algorithms
5 Feature discretization (FD) and feature selection (FS)
techniques address these problems
achieve adequate representations
select an adequate subset of features with a convenient
representation
3 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Introduction and Motivation
Some well-known facts about machine learning:
1 An adequate (sometimes discrete) representation of the data
is necessary
2 High-dimensional datasets are increasingly common
3 Learning in high-dimensional data is challenging
4 Sometimes, the curse-of-dimensionality problem arises
small number of instances n and large dimensionality d (e. g.,
micro-array data or text categorization data)
it must be addressed in order to have eective learning
algorithms
5 Feature discretization (FD) and feature selection (FS)
techniques address these problems
achieve adequate representations
select an adequate subset of features with a convenient
representation
3 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Introduction and Motivation
Some well-known facts about machine learning:
1 An adequate (sometimes discrete) representation of the data
is necessary
2 High-dimensional datasets are increasingly common
3 Learning in high-dimensional data is challenging
4 Sometimes, the curse-of-dimensionality problem arises
small number of instances n and large dimensionality d (e. g.,
micro-array data or text categorization data)
it must be addressed in order to have eective learning
algorithms
5 Feature discretization (FD) and feature selection (FS)
techniques address these problems
achieve adequate representations
select an adequate subset of features with a convenient
representation
3 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Introduction and Motivation
Some well-known facts about machine learning:
1 An adequate (sometimes discrete) representation of the data
is necessary
2 High-dimensional datasets are increasingly common
3 Learning in high-dimensional data is challenging
4 Sometimes, the curse-of-dimensionality problem arises
small number of instances n and large dimensionality d (e. g.,
micro-array data or text categorization data)
it must be addressed in order to have eective learning
algorithms
5 Feature discretization (FD) and feature selection (FS)
techniques address these problems
achieve adequate representations
select an adequate subset of features with a convenient
representation
3 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Introduction and Motivation
Some well-known facts about machine learning:
1 An adequate (sometimes discrete) representation of the data
is necessary
2 High-dimensional datasets are increasingly common
3 Learning in high-dimensional data is challenging
4 Sometimes, the curse-of-dimensionality problem arises
small number of instances n and large dimensionality d (e. g.,
micro-array data or text categorization data)
it must be addressed in order to have eective learning
algorithms
5 Feature discretization (FD) and feature selection (FS)
techniques address these problems
achieve adequate representations
select an adequate subset of features with a convenient
representation
3 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Introduction and Motivation
Some well-known facts about machine learning:
1 An adequate (sometimes discrete) representation of the data
is necessary
2 High-dimensional datasets are increasingly common
3 Learning in high-dimensional data is challenging
4 Sometimes, the curse-of-dimensionality problem arises
small number of instances n and large dimensionality d (e. g.,
micro-array data or text categorization data)
it must be addressed in order to have eective learning
algorithms
5 Feature discretization (FD) and feature selection (FS)
techniques address these problems
achieve adequate representations
select an adequate subset of features with a convenient
representation
3 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
High-Dimensional Data
Some high-dimensional datasets available on the WEB with
dierent types of problems
Datasets with c classes and n instances shown by increasing
dimensionality d; notice that in some cases d n
Dataset d c n Type of Data
Colon 2000 2 62 Microarray
SRBCT 2309 4 83 Microarray
AR10P 2400 10 130 Face
PIE10P 2420 10 210 Face
TOX-171 5748 4 171 Microarray
Example1 9947 2 50 Text, BoW
ORL10P 10304 10 100 Face
11-Tumors 12553 11 174 Microarray
Lung-Cancer 12601 5 203 Microarray
SMK-CAN-187 19993 2 187 Microarray
Dexter 20000 2 2600 BoW
GLI-85 22283 2 85 Microarray
Dorothea 1000000 2 1950 Drug Discovery
4 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
High-Dimensional Data
These datasets are available on-line:
The well-known university of California at Irvine (UCI)
repository, archive.ics.uci.edu/ml/datasets.html
The gene expression model selector (GEMS) project, at
www.gems-system.org/
5 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
High-Dimensional Data
The recently developed Arizona state university (ASU) repository
featureselection.asu.edu/datasets.php has
high-dimensional datasets
6 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Feature Discretization
Feature Discretization (FD) aims at:
representing a feature with a set of symbols from a nite set
keeping enough information for the learning task
ignoring minor (noisy/irrelevant) uctuations on the data
FD can be performed by:
unsupervised or supervised methods; the latter uses class
labels to compute the discretization intervals
static or dynamic methods
static - the discretization intervals are computed by using
solely the training data
dynamic - rely on a wrapper approach with a quantizer and a
classier
7 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Feature Discretization: Static Unsupervised Approaches
Three static unsupervised techniques are commonly used for FD:
equal-interval binning (EIB) - uniform quantization
equal-frequency binning (EFB) - non-uniform quantization to
attain uniform distribution
proportional k-interval discretization (PkID) - the number and
size of discretized intervals are adjusted to the number of
training instances
As compared to both EIB and EFB, PkID provides nave-Bayes
classiers with:
competitive classication performance for smaller datasets
better classication performance for larger datasets
8 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Feature Discretization: Static Supervised Approaches
Some static supervised techniques for FD:
information entropy maximization (IEM), Fayyad and Irani,
1993
minimal description length (MDL), Kononenko, 1995
class-attribute interdependence maximization (CAIM), 2004
class-attribute contingency coecient (CACC), 2008
correlation maximization (CM), 2011
Empirical evidence shows that:
the IEM and MDL methods have good performance regarding
accuracy and running-time.
the CACC and CAIM methods have higher running-time than
both IEM and MDL. In some cases, they attain better results
than these methods
9 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Feature Discretization: Wekas Environment
We can nd IEM (Fayyad and Irani, 1993) and MDL (Kononenko,
1995) in the Wekas machine learning package
weka.lters.supervised.attribute.Discretize
Note: the unsupervised PkID method is also available.
10 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Feature Selection
Feature Selection (FS) is a central problem in Machine Learning
and Pattern Recognition
what is the best subset of features for a given problem?
how many features should we choose?
There are many recent papers published in 2012, regarding FS
FS can be performed by:
unsupervised or supervised methods; the latter uses class
labels
lter, wrapper, or embedded approaches
11 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Feature Selection: Filter Approach
The FS lter approach:
assesses the quality of a given feature subset using solely
characteristics of that subset
it does not rely on the use of any learning algorithm
There are many (successful) lter approaches for FS:
unsupervised Term-Variance (TV), Laplacian Score (LS),
Laplacian Score Extended (LSE), SPEC, ...
supervised Relief, ReliefF, CFS, FiR, ...
supervised FCBF, mrMR, MIM, CMIM, IG, ...
12 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Feature Selection: Information-Theoretic Filters
Claude Shannons information theory (IT) has been the basis for
the proposal of many lter methods (IT-based lters):
Brown, G., Pocock, A., Zhao, M., Lujan, M., 2012. Conditional likelihood
maximisation: A unifying framework for information theoretic feature
selection. Journal of Machine Learning Research 13, 2766. 13 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Feature Selection: Wrapper Approach
Wrapper approaches
uses some method to search the space of all possible subsets
of features
assess the quality by learning and evaluating a classier with
that feature subset
Combinatorial search makes wrappers inadequate for
high-dimensional data
Some recent wrapper approaches for FS (2009-2011):
Greedy randomized adaptive search procedure (GRASP)
A GRASP-based FS hybrid lter-wrapper method
A sparse model for high-dimensional data, based on linear
programming
Estimation of the redundancy between feature sets, by the
conditional MI between feature subsets and each class
14 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Feature Selection: Embedded Approach
The embedded (or integrated) approach:
simultaneously learns the classier and chooses a subset of
features
assigns weights to features
the objective function encourages some of these weights to
become zero
Examples of the embedded approach are:
sparse multinomial logistic regression (SMLR*)
sparse logistic regression (SLogReg) method
Bayesian logistic regression (BLogReg), a more adaptive
version of SlogReg
joint classier and feature optimization (JCFO*), which
applies FS inside a kernel function
15 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Feature Selection on High-Dimensional Data
How dicult/challenging is FS on a given dataset?
The ratio
1
R
FS
=
n
a c
,
with n patterns (instances), c classes, and a being the median arity
of the features, discretized with EFB, measures this diculty. Low
values imply more dicult FS problems.
Another question:
Shouldnt the d/n ratio also be taken into account?
1
Brown, G., Pocock, A., Zhao, M., Lujan, M., 2012. Conditional likelihood
maximisation: A unifying framework for information theoretic feature
selection. Journal of Machine Learning Research 13, 2766.
16 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Feature Selection: an (outdated) categorization
Liu, H. and Yu, L., Toward Integrating Feature Selection Algorithms for
Classication and Clustering, IEEE Transactions on Knowledge and Data
Engineering, vol. 17, n. 4, April 2005, pp 491502 17 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Feature Discretization: Static U-LBG1 Algorithm
Recently, we have proposed the use of the Linde-Buzo-Gray
algorithm for unsupervised static FS
2
The LBG algorithm is applied individually to each feature
leading to discrete features with minimum mean square error
(MSE)
Rationale: low MSE(X
i
, Q(X
i
)) is adequate for learning!
It is stopped when:
the MSE distortion falls below some threshold
or the maximum number of bits q per feature is reached
obtains a variable number of bits per feature
uses (, q) as input parameters; to 5% of the range of
each feature and q 4, . . . , 10 are adequate
2
A. Ferreira and M. Figueiredo, An unsupervised approach to feature
discretization and selection, Elsevier - Pattern Recognition Journal, DOI:
10.1016/j.patcog.2011.12.008
18 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Feature Discretization: U-LBG2 Algorithm
Similar to U-LBG1 (both aim at obtaining quantizers that
represent the features with a small distortion) with the following
key dierences:
each discretized feature will be given the same (maximum)
number of bits q;
only one quantizer is learned for each feature
19 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Experimental Results: Unsupervised Discretization
Discretization performance:
total number of bits per pattern (T. Bits) and Test Set Error
Rate (Err, average of ten runs)
using up to q = 7 bits, using the nave Bayes classier
EFB U-LBG1 U-LBG2
Dataset Original Err T. Bits Err T. Bits Err T. Bits Err
Phoneme 21.30 30 22.30 9 22.80 30 20.60
Pima 25.30 48 25.20 30 25.20 48 25.80
Abalone 28.00 48 27.60 15 27.20 48 27.70
Contraceptive 34.80 54 31.40 15 38.00 54 34.80
Wine 3.73 78 4.80 27 3.20 78 3.20
Hepatitis 20.50 95 21.50 32 21.00 39 18.00
WBCD 5.87 180 5.13 60 5.87 180 5.67
Ionosphere 10.60 198 9.80 49 17.40 198 11.00
SpamBase 15.27 324 13.40 54 15.73 324 15.67
Lung 35.00 318 35.00 74 35.83 318 35.00
Arrhythmia 32.00 1392 51.56 553 30.22 1392 41.56
20 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Some insight on the accuracy of each feature
Test set error rate of nave Bayes using only a single feature, discretized
with q = 4 bit by the U-LBG2 algorithm, on the WBCD dataset.
Horizontal dashed line is the test set error rate of the p=30 features.
21 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Some insight on the accuracy of each feature
Test set error rate of nave Bayes using only a single feature, discretized
with q = 8 bit by the U-LBG2 algorithm, on the WBCD dataset.
Horizontal dashed line is the test set error rate of the p=30 features.
22 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Some insight on progressive discretization 1/2
Test set error rate of nave Bayes using solely feature 17 of the WBCD
dataset (original feature and U-LBG2 discretized with q 1, . . . , 10).
The discrete versions do not provide higher accuracy!
23 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Some insight on progressive discretization 2/2
Test set error rate of nave Bayes using solely feature 25 of the WBCD
dataset (original feature and U-LBG2 discretized with q 1, . . . , 10).
With q = 10, we get a small improvement on the accuracy.
24 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Experimental Results Analysis
Somme comments on these methods and results:
Often, discretization improves classication accuracy
EFB usually attains better results than EIB
U-LBG2 is usually faster than U-LBG1, but allocates more
bits per feature
Often, one of the LBG methods attains better results than
EFB
The U-LBG procedures are more complex than either EIB or
EFB
PkID attains results close to EFB, when using the nave Bayes
classier
25 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Feature Discretization: S-LBG Algorithm (ideas)
Supervised version of the U-LBG counterparts:
each feature is discretized with LBG
it uses the mutual information (MI) between discretized
features and the class label, in order to control the
discretization procedure
it stops at b bits or when the relative increase of MI with
respect to the previous quantizer is less than a given threshold
each feature is discretized with an increasing number of bits,
stopping only when:
there is no signicant increase on the relevance of the feature
the maximum number of bits is reached
26 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Feature Discretization: MI Algorithm (ideas)
Ongoing work with the following key ideas:
discretize each feature to maximize MI with the class label
do this in a progressive approach, scanning all features
allocate a variable number of bits per feature
check how the relevance of each feature changes when
discretized with b + 1 bits, as compared to b bits
27 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Experimental Results: Supervised Discretization
MI as a function of the number of bits per feature on the S-LBG
algorithm, using up to q = 10 bit Discretization of features 1 and
18 of the Hepatitis dataset
28 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Experimental Results
Discretization performance:
total number of bits per pattern (TBits) and Test Set Error
Rate (Err, average of ten runs)
using up to q = 7 bits, using the nave Bayes classier
EFB U-LBG1 U-LBG2 S-LBG
Dataset Base TBits Err TBits Err TBits Err TBits Err
Iris 2.6 24 4.5 7 9.0 24 2.6 15 3.4
Phoneme 21.3 30 22.3 9 22.8 30 20.6 20 22.5
Pima 25.3 48 25.2 30 25.2 48 25.8 40 24.4
Abalone 28.0 48 27.6 15 27.2 48 27.7 35 27.8
Contrac. 34.8 54 31.4 15 38.0 54 34.8 29 34.7
Wine 3.7 78 4.8 27 3.2 78 3.2 57 4.5
Hepatitis 20.5 95 21.5 32 21.0 39 18.0 29 19.0
WBCD 5.8 180 5.1 60 5.8 180 5.6 116 5.4
Ionosph. 10.6 198 9.8 49 17.4 198 11.00 177 11.0
SpamBase 15.2 324 13.4 54 15.7 324 15.6 220 14.6
Lung 35.0 318 35.0 74 35.8 318 35.0 135 35.0
Arrhyt. 32.0 1392 51.5 553 30.2 1392 41.5 1050 31.3
29 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Feature Selection
Somme comments on existing FS methods:
Usually, wrappers perform better than embedded methods,
taking a longer time
Embedded methods perform better than lters, being much
slower
On very high-dimensional datasets:
both wrapper and embedded methods are too expensive
lters are the only applicable option!
even some lter FS methods can take a prohibitive time in the
redundancy analysis and elimination stage
30 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Feature Selection: Relevance-Redundancy (RR)
The Yu and Liu relevance-redundancy framework approach
3
Optimal subset is provided by parts III and IV, whitout both
irrelevant and redundant features
3
Yu, L., Liu, H., Dec. 2004. Ecient feature selection via analysis of
relevance and redundancy, JMLR 5, 12051224.
31 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Feature Selection: Relevance-Redundancy (RR)
The Yu and Liu relevance-redundancy framework approach
4
First compute relevance, then redundancy
After this, nd the optimal subset (parts III and IV)
4
Yu, L., Liu, H., Dec. 2004. Ecient feature selection via analysis of
relevance and redundancy, JMLR 5, 12051224.
32 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
A RR FS approach for high-dimensional data (ideas)
Key observations for lters on high-dimensional data:
Some supervised lter methods (e.g. CFS and mrMR) are
computionally expensive
They waste time on subspace search and redundancy analysis
Redundancy is typically found among the most relevant
features
Our proposal for fast unsupervised and supervised lter RR FS on
high-dimensional data:
sorts the d features by decreasing relevance
computes the redundancy between the most relevant features
computes up to d 1 pairwise similarities
33 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
A RR FS approach for high-dimensional data (ideas)
We keep only features with high relevance and low (i.e., below
some threshold M
S
) similarity among themselves
It is not expected that redundant features are consecutive in
the ranked sorted feature list
However, it is a waste of time to compute the redundancy
between weakly relevant and irrelevant features!
So, we perform the redundancy check only among the top
relevant features
34 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
A RR FS approach for high-dimensional data (details)
Input: X: n d matrix, n patterns of a d-dimensional training set.
m ( d):, maximum number of features to keep.
M
S
: maximum allowed similarity between pairs of features.
Output: FeatKeep: an m

dimensional array (with m

m) containing the indexes


of the selected features.

X : n m

matrix, reduced dimensional training set, with features sorted


by decreasing relevance.
1: Compute the relevance r
i
of each feature X
i
(columns of X), for
i 1, . . . , d, using one of the dispersion measures MAD or MM).
2: Sort the features by decreasing order of r
i
. Let i
1
, i
2
, ..., i
d
be the
resulting permutation of 1, ..., d (i.e., r
i
1
r
i
2
... r
i
d
).
3: FeatKeep[1] = i
1
; prev=1; next=2;
4: for f = 2 to d do
5: s = S(X
i
f
, X
i
prev
);
6: if s < M
S
then
7: FeatKeep[next] = i
f
;

X
next
= X
i
f
; prev = i
f
; next = next + 1;
8: end if
9: if next=m then
10: break; /* We have m features. Break loop. */
11: end if
12: end for
35 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Relevance Measures
For unsupervised learning, we found relevance proportional to
dispersion
The mean absolute dierence
MAD
i
=
1
n
n

j =1

X
ij
X
i

,
and the mean-median
MM
i
= [X
i
median(X
i
)[,
i.e., the absolute dierence between the mean and median of X
i
are adequate measures of relevance
They attain better results than variance
36 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Redundancy Measure
The redundancy between two features, say X
i
and X
j
, is computed
by the absolute cosine
[ cos(
X
i
X
j
)[ =

X
i
, X
j

|X
i
| |X
j
|

=
n

k=1
X
ik
X
jk

k=1
X
2
ik
n

k=1
X
2
jk
, (1)
where , denotes the inner product and |.| the Euclidean norm
We have 0 [ cos(
X
i
X
j
)[ 1:
with 0 meaning that the two features are orthogonal
(maximally dierent)
1 resulting from colinear features
Similar to the Pearsons correlation coecient
37 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Experimental Results: Relevance and Redundancy
The relevance and similarity of the consecutive m = 1000
top-ranked features of the Brain-Tumor1 dataset (d = 5920
features)
Among the top-ranked features, we have high redundancy!
38 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Experimental Results: Unsupervised FS
Comparison with other unsupervised approaches for a 10-fold CV
with linear SVM. Best result in bold face and second best is
underlined
Our Approach Unsupervised Baseline
Dataset MAD MM AMGM TV LS SPEC No FS
Colon 21.0 17.7 19.4 17.7 21.0 22.6 19.4
SRBCT 0.0 0.0 0.0 0.0 0.0 0.0 0.0
PIE10P 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Lymphoma 2.2 2.2 2.2 2.2 2.2 2.2 2.2
Leukemia1 4.2 4.2 5.6 4.2 5.6 4.2 4.2
Brain-Tumor1 14.4 12.2 12.2 12.2 13.3 28.9 12.2
Leukemia 2.8 2.8 2.8 2.8 2.8 30.6 2.8
Example1 2.7 2.7 2.6 2.7 3.3 22.9 2.7
ORL0P 2.0 2.0 4.0 5.0 1.0 1.0 1.0
Lung-Cancer 4.9 5.9 4.9 5.9 5.4 6.4 4.9
SMK-CAN-187 41.7 41.7 41.7 41.7 26.2 25.7 26.2
Dexter 5.3 5.2 5.2 5.2 5.2 39.3 4.7
GLI-85 12.9 14.1 15.3 14.1 11.8 9.4 11.8
Dorothea 24.0 24.0 24.0 24.0 25.0 22.0 24.0
39 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Experimental Results: Supervised FS
Comparison with other supervised approaches for a 10-fold CV with
linear SVM. Best result in bold face and second best is underlined
Our Approach Supervised Filters Base
Dataset MM FiR MI RF CFS FCBF FiR mrMR No FS
Colon 24.2 22.6 24.2 19.4 25.8 22.6 19.4 21.0 21.0
SRBCT 0.0 0.0 0.0 0.0 0.0 4.8 0.0 4.8 0.0
PIE10P 0.0 0.0 0.5 0.0 * 1.0 0.0 24.8 0.0
Lymph. 2.2 2.2 2.2 2.2 * 3.3 2.2 22.8 2.2
Leuk1 5.6 2.8 6.9 6.9 * 5.6 4.2 9.7 5.6
B-Tum1 13.3 12.2 13.3 11.1 * 18.9 11.1 25.6 10.0
Leuk. 2.8 12.5 2.8 2.8 * 4.2 4.2 8.3 2.8
Example1 2.3 2.2 2.2 3.7 * 6.3 2.1 28.3 2.4
ORL0P 4.0 5.0 2.0 1.0 * 1.0 2.0 68.0 1.0
B-Tum2 34.0 22.0 30.0 22.0 * 36.0 24.0 42.0 26.0
P-Tumor 7.8 5.9 4.9 7.8 * 9.8 7.8 12.7 8.8
L-Cancer 5.9 6.4 4.9 4.9 * 6.4 5.4 11.8 5.9
SMK-187 41.7 40.6 53.5 24.6 * 33.2 23.5 33.2 24.1
Dexter 6.7 6.0 7.7 9.3 * 15.3 6.7 18.0 6.3
GLI-85 14.1 12.9 17.6 11.8 * 20.0 14.1 16.5 14.1
Dorot. 25.0 26.0 25.0 * * * 25.0 * 25.0
40 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Experimental Results: Running-Time
For each dataset:
the rst row contains the test set error rates (%) of linear
SVM for a 10-fold CV for our RR method (with M
S
= 0.8)
and other FS methods selecting m features
the second row shows the total running time taken by each FS
algorithm to select m features for the 10-folds
Our RR Filter Embedded Baseline
#, m MAD FiR LS SPEC RF FCBF FiR mrMR BReg No FS
Colon 24.2 30.6 21.0 24.2 24.2 25.8 24.2 17.7 21.0 21.0
m=1000 0.3 7.4 0.4 1.5 9.0 147.0 7.1 19.8 2.2
SRBCT 0.0 0.0 0.0 0.0 0.0 1.2 0.0 3.6 2.4 0.0
m=1800 0.5 9.7 0.4 2.3 11.6 8.0 9.3 15.8 5.7
Lymph. 2.2 2.2 2.2 3.3 2.2 3.3 3.3 25.0 8.7 2.2
m=2000 0.8 29.5 0.8 9.0 21.7 25.2 29.2 24.7 143.7
P-Tumor 8.8 6.9 10.8 9.8 5.9 10.8 9.8 13.7 7.8 8.8
m=4000 1.8 21.6 2.4 13.5 47.8 29.1 21.1 48.8 46.1
L-Cancer 5.9 6.4 6.4 8.4 6.9 7.4 6.9 11.3 7.4 6.4
m=8000 3.8 54.9 7.5 47.5 95.5 208.8 54.1 88.5 207.4
41 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Conclusions
High-dimensional datasets are increasingly common
Learning in high-dimensional data is challenging
Unsupervised and supervised FD and FS methods can
alleviate the inherent complexity
Wrapper and embedded methods are too costly
Our lter FD and FS proposals in this talk:
are both time and space ecient
attain competitive results with the state-of-the-art techniques
can act as pre-processors to wrapper and embedded methods
A promising avenue of research (our ongoing work)?
perform progressive FD and FS simultaneously!
42 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Conclusions
High-dimensional datasets are increasingly common
Learning in high-dimensional data is challenging
Unsupervised and supervised FD and FS methods can
alleviate the inherent complexity
Wrapper and embedded methods are too costly
Our lter FD and FS proposals in this talk:
are both time and space ecient
attain competitive results with the state-of-the-art techniques
can act as pre-processors to wrapper and embedded methods
A promising avenue of research (our ongoing work)?
perform progressive FD and FS simultaneously!
42 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Conclusions
High-dimensional datasets are increasingly common
Learning in high-dimensional data is challenging
Unsupervised and supervised FD and FS methods can
alleviate the inherent complexity
Wrapper and embedded methods are too costly
Our lter FD and FS proposals in this talk:
are both time and space ecient
attain competitive results with the state-of-the-art techniques
can act as pre-processors to wrapper and embedded methods
A promising avenue of research (our ongoing work)?
perform progressive FD and FS simultaneously!
42 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Conclusions
High-dimensional datasets are increasingly common
Learning in high-dimensional data is challenging
Unsupervised and supervised FD and FS methods can
alleviate the inherent complexity
Wrapper and embedded methods are too costly
Our lter FD and FS proposals in this talk:
are both time and space ecient
attain competitive results with the state-of-the-art techniques
can act as pre-processors to wrapper and embedded methods
A promising avenue of research (our ongoing work)?
perform progressive FD and FS simultaneously!
42 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Conclusions
High-dimensional datasets are increasingly common
Learning in high-dimensional data is challenging
Unsupervised and supervised FD and FS methods can
alleviate the inherent complexity
Wrapper and embedded methods are too costly
Our lter FD and FS proposals in this talk:
are both time and space ecient
attain competitive results with the state-of-the-art techniques
can act as pre-processors to wrapper and embedded methods
A promising avenue of research (our ongoing work)?
perform progressive FD and FS simultaneously!
42 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Conclusions
High-dimensional datasets are increasingly common
Learning in high-dimensional data is challenging
Unsupervised and supervised FD and FS methods can
alleviate the inherent complexity
Wrapper and embedded methods are too costly
Our lter FD and FS proposals in this talk:
are both time and space ecient
attain competitive results with the state-of-the-art techniques
can act as pre-processors to wrapper and embedded methods
A promising avenue of research (our ongoing work)?
perform progressive FD and FS simultaneously!
42 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Conclusions
High-dimensional datasets are increasingly common
Learning in high-dimensional data is challenging
Unsupervised and supervised FD and FS methods can
alleviate the inherent complexity
Wrapper and embedded methods are too costly
Our lter FD and FS proposals in this talk:
are both time and space ecient
attain competitive results with the state-of-the-art techniques
can act as pre-processors to wrapper and embedded methods
A promising avenue of research (our ongoing work)?
perform progressive FD and FS simultaneously!
42 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Conclusions
High-dimensional datasets are increasingly common
Learning in high-dimensional data is challenging
Unsupervised and supervised FD and FS methods can
alleviate the inherent complexity
Wrapper and embedded methods are too costly
Our lter FD and FS proposals in this talk:
are both time and space ecient
attain competitive results with the state-of-the-art techniques
can act as pre-processors to wrapper and embedded methods
A promising avenue of research (our ongoing work)?
perform progressive FD and FS simultaneously!
42 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Conclusions
High-dimensional datasets are increasingly common
Learning in high-dimensional data is challenging
Unsupervised and supervised FD and FS methods can
alleviate the inherent complexity
Wrapper and embedded methods are too costly
Our lter FD and FS proposals in this talk:
are both time and space ecient
attain competitive results with the state-of-the-art techniques
can act as pre-processors to wrapper and embedded methods
A promising avenue of research (our ongoing work)?
perform progressive FD and FS simultaneously!
42 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Calling Weka from MATLAB - FD
FD example. Discretize dataset X using Kononekos MDL method
...
t = weka.lters.supervised.attribute.Discretize();
cong = wekaArgumentString(-R rst-last,-K);
t.setOptions(cong);
wekaData = wekaCategoricalData(X,SY2MY(Y));
t.setInputFormat( wekaData );
t.useFilter( wekaData, t ) ;
d = size(X,2);
for i=1 : d
cutPoints(i) = t.getCutPoints(i-1);
end ...
43 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Calling Weka from MATLAB - FS
FS example. Apply ReliefF on dataset X
...
t = weka.attributeSelection.ReliefFAttributeEval();
cong = wekaArgumentString(-M, m, -D, 1 -K, k);
t.setOptions(cong);
t.buildEvaluator(wekaCategoricalData(X,SY2MY(Y)));
out.W = zeros(1,nF);
for i =1:nF;
out.W(i) = t.evaluateAttribute(i-1);
end
out.fList = sort(out.W, descend);
...
44 / 45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Some useful resources
Datasets available on-line:
The university of California at Irvine (UCI) repository,
archive.ics.uci.edu/ml/datasets.html
The gene expression model selector (GEMS) project, at
www.gems-system.org/
The Arizona state university (ASU) repository,
http://featureselection.asu.edu/datasets.php
The world health organization (WHO) data,
http://apps.who.int/ghodata/
Machine learning tools on-line:
ENTool, machine learning toolbox,
http://www.j-wichard.de/entool/
PRTools machine learning toolbox, Delft University of
Technology, http://www.prtools.org/
Weka, http://www.cs.waikato.ac.nz/ml/weka/
The ASU FS package with lter and embedded methods
http://featureselection.asu.edu/software.php
45 / 45

Вам также может понравиться