Академический Документы
Профессиональный Документы
Культура Документы
Information technology has produced huge amount of data and these data need to be processed to extract
information hidden in it. Often data produced now a days are high dimensional data i.e. data having large
number of attributes (or features). These data contain large number of redundant and irrelevant attributes.
Processing these useless attributes requires unnecessary huge amount of computational and storage ability of
the system. So, Dimensionality Reduction techniques are needed for selecting only relevant attributes from all
the attributes. It requires identifying and removing irrelevant and redundant attributes to reduce computational
and storage cost. This paper presents a novel approach GA-CFS for dimensionality reduction and also a
Subset Approach towards the modification in the previous approach. In both the approach, Genetic Algorithm
is used as search method for selecting subset of attributes and Correlation-based Feature Selection (CFS) is
used to measure the quality of those subsets. Correlation measure Chi Square has also been used to find the
relevance of attributes. Experimental results on nine benchmark data set from UCI Machine Learning
repository show that the proposed GA-CFS Approach selects good quality feature set for six data set out of
the nine data set. For the rest three data set, the proposed GA-CFS approach poorly performed the feature
selection task. For poor performance of the proposed GA-CFS approach on three data sets, another
methodology for feature selection named Subset Approach is proposed towards modification in the earlier
approach. Results show the success of the Subset Approach.
Keywords: Genetic Algorithm, Correlation-based Feature Selection, Chi Square, Symmetric Uncertainty.
2 Aditya et al.
1. INTRODUCTION
Recent technological advances have brought an explosive growth of data dimension, where the number of
features (also called attributes or variables) is considerably larger than the number of observation, often in the
hundreds or thousands. These data having higher dimension are encountered in a wide range of areas such as
engineering, bioinformatics, social network, geo-science, text categorization etc. Examples of these high
dimensional data are :- Gene Microarrays, Deep Sequencing, Micro RNA Expression, Financial Time Series
Data, Spatial Data, Spatio-temporal Data, Movie Rating Data, Social networking data. In data set having
extremely large number of features, many features are found to be non-informative. These features are either
redundant or irrelevant features. A redundant feature does not has any new contribution to the target class and
an irrelevant feature does not affect the target class. Relevance of features can be broadly divided into three
categories- strong relevance, weak relevance, irrelevance [5]. Strong relevance of a feature indicates that the
feature is always necessary for determining a target class. Weak relevance indicates that the feature is not
always necessary but may become necessary at certain conditions. Irrelevance indicates that the feature is not
necessary at all for determining a target class. Processing of these data full of irrelevant and redundant
features incurs various difficulties and challenges to data miners. These high-dimensional data take up a lot of
computational memory and bring a huge computational burden even though modern computers have great
processing power. In 1957, Richard Bellman coined a term Curse of Dimensionality [7] to indicate the
difficulties in processing these high dimensional data. It says that the number of samples needed to estimate a
function with a given level of accuracy grows exponentially with the number of features. In addition, these
high dimensional data also affect the estimation power of many learning algorithm. When the number of
features is extremely large, some of the learning algorithms, such as Fishers discriminant rule , are not
applicable while other methods, such as Neural Network , K-nearest Neighbour and Support Vector Machines
may produce poor accuracy [6]. Also, high-dimensional data usually contain high noise and these increased
number of noisy features greatly complicates the data mining or estimation task. These all problems demand
for some mechanism to reduce the dimensionality of data set without affecting its predictive power or
statistical characteristic. Dimensionality reduction techniques contribute in addressing these problems. These
techniques improve the results of data mining task by identifying and removing redundant and irrelevant
features from the data set. Next sections of the paper are organized in the following manner. Section 2 starts
with Literature Survey of the Dimensionality Reduction problem. Section 3 discusses the proposed approach
for Dimensionality Reduction. Section 4 discusses the experimental results and analysis of the proposed
approach. In last, section 5 presents the conclusion and future work of the paper.
2. LITERATURE SURVEY
In this section, review of the literature of feature selection has been presented with more focus on work done
in recent years. The review starts with an overview of feature selection methods. Use of feature selection
techniques in some of the areas where dimensionality reduction is extensively used has also been presented.
consideration outputs some feature set. With each feature set generated, a frequency list of features is
maintained because same feature can be found in many of the feature sets. In the end, top high frequency
features are taken and an optimal feature set is formed with these high frequency feature sets. Features
selected by most of the learning methods indicate that those features are important in detecting network
intrusions. Nguyen et al. [9] have proposed a feature selection procedure for IDS which finds feature set using
Correlation-based Feature Selection (CFS) [5] method. Existing search methods to be used with CFS like
Brute Force search, Best First search [13], Genetic Algorithm (GA) [14] have some shortcomings. Brute force
works on low feature data set, so it is not useful for high dimensional data set. Best First and GA can work on
high dimensional data set but they do not always produce globally optimum solutions. The proposed search
method to be used with CFS guarantees globally optimal solution. First the CFS measure of all the features are
calculated and then a 'polynomial mixed 01 fractional programming' (P01FP) [10] problem is constructed
from these CFS measures. Further this problem is transformed to a 'mixed 01 linear programming' (M01LP)
[11] problem, whose solution by Branch and Bound method [12] gives the optimal solution set of feature
subset. The Branch and Bound method is just a framework of a large family of methods. There can be many
variation of it depending on the particular problem. Often Branch and Bound methods are slow and have
exponential worst case performance. Most of the feature selection methods developed for IDS assumes that
data is either categorical or real valued. But in many situations, feature data for IDS can be mix of both
categorical and real valued. Yongli et al. [16] have proposed a feature selection method for IDS to handle
cases of mixed data. The proposed method includes a feature ranking method [1] and an exhaustive search
method. The feature ranking method [1] ranks feature individually based on Mahalanobis distance.
Then a feature subset evaluator is applied to a fixed number of attributes based on their ranking. The search
method finds optimal feature set from these ranked feature set based on an objective function which uses
classification rate and misclassification rate of learning algorithm as parameters. However the work is
effective only for two class classification problem. So, multi class classification with the proposed
methodologies is left as future work.
work well for English language text classification do not perform well for Arabic language text classification.
There are many work for English language text classification, which have used combining multiple feature
selection methods, but no work shows combining multiple feature selection methods for Arabic language text
classification. So, they have developed a feature selection approach for Arabic language text classification. In
this regard, authors have used multiple feature selection approach consisting of combinations of five feature
selection methods IG (information gain) [18], CHI (CHI square) [18], NGL (NG, Goh and Low coefficient)
[56], GSS (the Galavotti, Sebastiani, Simi, a coefficient and relevancy score) [57]and RS (relevancy score)
[58]. The proposed method has slight improvement in classification accuracy for combination of two or three
feature selection methods and for combination of four and five feature selection method, no improvement is
observed. It is observed that a lot of research is needed for this language to find the efficient feature selection
method for text classification.
uses a filter feature selection method to return some specified no of best features. Then Shapley value based
wrapper feature selection method is used to predict which features among the selected features have more
relevance to the classifier. Those relevant features form the optimal feature set. Brahim et al. [39] have
proposed an aggregation technique for combining results of different feature selection methods. In the work,
authors have used different feature selection technique for different data sample. To aggregate the different
results produced by different feature selection techniques, the proposed aggregation technique first calculates
feature selection algorithm's confidence. Feature selection algorithm's confidence calculation is done using
feature weight and classification error. Then it assigns a reliability factor to each algorithm based on the
feature confidence. The result of the proposed method was compared with results of two existing aggregation
techniques: Weighted mean aggregation (WMA) [40] and complete linear aggregation (CLA) [40]. It was
found that proposed aggregation technique outperforms others.
Results of various feature evaluation criterion like Fischer Ratio [2], Relief [4] is combined in this approach.
Fusion of performance criterion is done by two methods: one is based on score and other is based on ranking.
In score based approach, each criterion produces a score vector containing scores of all feature and in last all
score is aggregated by an aggregation method. Finally a feature ranking method ranks the feature. In ranking
based approach, first each feature is assigned a ranking based on individual criterion then a ranking
aggregation algorithm [31] aggregates the results and produces a final ranking of features. Sample weighting
is one of the approaches to deal with instability. In this approach, higher weights are assigned to relevant
samples. Yu et al. [32] have proposed a general framework of sample weighting to improve the stability of
feature selection methods. The proposed framework assigns weights to each sample according to its impact on
the estimation of feature relevance. The weighted feature set is evaluated by a feature selection algorithm.
Under this framework various algorithm can be developed. An algorithm named Margin based Sample
Weighting is proposed by the authors under this framework. In margin based sample weighting, feature space
is projected in to a margin vector feature space [32]. The difference in margin vectors is used to weight
samples of the original space.
3. PROPOSED APPROACH
This section presents two approaches to deal with the problem of feature selection: - GA-CFS Approach and
Subset Approach. Subset approach has been presented towards modification of GA-CFS Approach to deal
with the poor performance of GA-CFS Approach over some data sets.
7
Data set
Data set
GA-Model
Reduced
feature
set of GACFS
Calculate Chi
square score of
all attributes
Filter model
Calculate
median of chi
square score
GA Model: In this model, Correlation-based Feature Selection has been used to evaluate the quality of
feature. The Genetic Algorithm searches for optimal feature subsets based on the score assigned by the fitness
function CFS. CFS assigns a score to each feature subset which GA searches and considers an optimal
solution. In the last iteration of GA, the subset having highest CFS score is presented as optimal solution.
Calculation of feature-feature and feature-class correlation in CFS can be done in various ways and
accordingly CFS can have many variations. In this work, Symmetric Uncertainty [5] has been used for
calculating feature to feature and feature to class correlation of CFS.
Overview of Correlation-based Feature Selection: Correlation-based Feature Se-lection (CFS) [5] is used to
measure quality of feature subsets. It uses Correlation [60] as a measure to find relevance of feature subsets.
Every subset of features is assigned a score based on following equation:
where,
= number of features in the feature subset
= average feature - class correlation in the feature subset
= average feature - feature correlation in the feature subset
Overview of Symmetric Uncertainty: Symmetric Uncertainty [5] is a information theoratic measure like
Information Gain.
The Symmetric Uncertainty of feature X with another feature ( or class ) Y is defined as:
(
Where,
Information Gain IG (X|Y) = H(X) - H (X|Y )
Entropy of feature X = H(X) =
Conditional Entropy of feature X with another feature or class
Y= H (X|Y ) =
Where,
2 =chi square statistics
Oi =observed frequency of attribute and class values
Ei =expected frequency of attribute and class values
(i ranges from 1 to total number of cells in the contingency table of feature X and class Y)
Filter Model: Reduced feature set from the GA Model is again filtered for irrelevant attributes by using
ranking of attributes obtained from the Chi-Square Model. All the attributes from the output set of GA Model
are filtered out which are having Chi Square score less than the median of the Chi Square score of all the
attributes. The GA-CFS Approach is presented in algorithmic form as follows.
1:
2:
3:
4: A = GA Model (IP1, IP2)
5: B = Chi-Square Model (IP1, IP2)
6: OP = Filter Model (A, B)
7: return final optimized set of attributes OP
10
1:
2:
3:
4: for each attribute in the input data set IP do
5: calculate Chi square score
Algorithm 3 GA Model ( )
1:
2:
3:
4: sum1, sum2 = 0.
5: Create specified number of population
6: for the (specified number of generations+1), do
7:
for each chromosome (chmi )in the population do
8:
for each attribute (Xi) in the chromosome
9:
SU(Xi,Y )= H(Xi) -H(Xi j Y )/H(Xi)+H(Y )
10:
sum 1 = sum 1 +SU(Xi,Y )
11:
end for
12:
13:
14:
15:
16:
17:
18:
19:
do
,
for each attribute (Xi) in the chromosome do
for each attribute (Xj) left in the chromosome do
SU(Xi,Xj)= H(Xi) -H(Xi j Xj)/H(Xi)+H(Xj)
sum 2 = sum 2 + SU(Xi,Xj)
end for
remove attribute (Xi) from the chromosome (chmi)
end for
20:
21:
22:
23:
end for
arrange the chromosomes in sorted order based on their fitness score
11
24:
if current generation (last generation + 1) then
25:
apply crossover on each fit pair of parent chromosome produce child
26:
perform mutation on the child
27:
replace the least fit chromosomes in the population with all the child
28: end if
29: end for
30: return chromosome having highest fit-score
1:
2:
3:
4: Output = Optimized attribute set
5: sort the list B and accordingly sort the set IP
6: remove from the set IP, attributes having Chi
12
8:
make frequency table of attributes with the help of Z
9: end for
10: remove attributes from IP2 having frequency below than a threshold m (user specified)
11: remove attributes from IP2 by applying Chi Square the same way as algorithm Chi-Square
of the proposed
approach except on attributes which are present in both output D and output IP2.
12: after removal, IP2 is the finally optimized set of attributes returned by the proposed approach
13: return IP2
4. EXPERIMENTAL RESULTS
4.1. Experimental Setup
MATLAB R2013a has been used to implement the codes of the proposed approach. The following parameters
of GA are set to the standard and most commonly used values based on the observations made from the
literature survey of feature selection.
(1) Population size = 20
(2) Number of generation = 20
(3) Crossover rate = 0.6
(4) Mutation rate = 0.01
Missing values (if any) present in the data set have been replaced by modes of the training data. Data set is
partitioned into subset after shuffling the instances randomly to maintain the maximum possible uniformity of
classes in each subset as some data set have instances ordered on class. Shuffling of instances is repeated 10
times.
13