Вы находитесь на странице: 1из 13

A novel approach for Feature Selection based on correlation measures

CFS and Chi Square


ADITYA KUMAR
Central University of Bihar
Patna,Bihar,India
aditya@cub.ac.in
SMITA ROY
Central University of Bihar
Patna,Bihar,India
smitaroy@cub.ac.in
PRABHAT RANJAN
Central University of Bihar
Patna,Bihar,India
prabhatranjan@cub.ac.in
Received (Day Month Year)
Revised (Day Month Year)
Accepted (Day Month Year)

Information technology has produced huge amount of data and these data need to be processed to extract
information hidden in it. Often data produced now a days are high dimensional data i.e. data having large
number of attributes (or features). These data contain large number of redundant and irrelevant attributes.
Processing these useless attributes requires unnecessary huge amount of computational and storage ability of
the system. So, Dimensionality Reduction techniques are needed for selecting only relevant attributes from all
the attributes. It requires identifying and removing irrelevant and redundant attributes to reduce computational
and storage cost. This paper presents a novel approach GA-CFS for dimensionality reduction and also a
Subset Approach towards the modification in the previous approach. In both the approach, Genetic Algorithm
is used as search method for selecting subset of attributes and Correlation-based Feature Selection (CFS) is
used to measure the quality of those subsets. Correlation measure Chi Square has also been used to find the
relevance of attributes. Experimental results on nine benchmark data set from UCI Machine Learning
repository show that the proposed GA-CFS Approach selects good quality feature set for six data set out of
the nine data set. For the rest three data set, the proposed GA-CFS approach poorly performed the feature
selection task. For poor performance of the proposed GA-CFS approach on three data sets, another
methodology for feature selection named Subset Approach is proposed towards modification in the earlier
approach. Results show the success of the Subset Approach.
Keywords: Genetic Algorithm, Correlation-based Feature Selection, Chi Square, Symmetric Uncertainty.

August 13, 2014 13:17 WSPC/INSTRUCTION FILE ijait*paper

2 Aditya et al.

1. INTRODUCTION
Recent technological advances have brought an explosive growth of data dimension, where the number of
features (also called attributes or variables) is considerably larger than the number of observation, often in the
hundreds or thousands. These data having higher dimension are encountered in a wide range of areas such as
engineering, bioinformatics, social network, geo-science, text categorization etc. Examples of these high
dimensional data are :- Gene Microarrays, Deep Sequencing, Micro RNA Expression, Financial Time Series
Data, Spatial Data, Spatio-temporal Data, Movie Rating Data, Social networking data. In data set having
extremely large number of features, many features are found to be non-informative. These features are either
redundant or irrelevant features. A redundant feature does not has any new contribution to the target class and
an irrelevant feature does not affect the target class. Relevance of features can be broadly divided into three
categories- strong relevance, weak relevance, irrelevance [5]. Strong relevance of a feature indicates that the
feature is always necessary for determining a target class. Weak relevance indicates that the feature is not
always necessary but may become necessary at certain conditions. Irrelevance indicates that the feature is not
necessary at all for determining a target class. Processing of these data full of irrelevant and redundant
features incurs various difficulties and challenges to data miners. These high-dimensional data take up a lot of
computational memory and bring a huge computational burden even though modern computers have great
processing power. In 1957, Richard Bellman coined a term Curse of Dimensionality [7] to indicate the
difficulties in processing these high dimensional data. It says that the number of samples needed to estimate a
function with a given level of accuracy grows exponentially with the number of features. In addition, these
high dimensional data also affect the estimation power of many learning algorithm. When the number of
features is extremely large, some of the learning algorithms, such as Fishers discriminant rule , are not
applicable while other methods, such as Neural Network , K-nearest Neighbour and Support Vector Machines
may produce poor accuracy [6]. Also, high-dimensional data usually contain high noise and these increased
number of noisy features greatly complicates the data mining or estimation task. These all problems demand
for some mechanism to reduce the dimensionality of data set without affecting its predictive power or
statistical characteristic. Dimensionality reduction techniques contribute in addressing these problems. These
techniques improve the results of data mining task by identifying and removing redundant and irrelevant
features from the data set. Next sections of the paper are organized in the following manner. Section 2 starts
with Literature Survey of the Dimensionality Reduction problem. Section 3 discusses the proposed approach
for Dimensionality Reduction. Section 4 discusses the experimental results and analysis of the proposed
approach. In last, section 5 presents the conclusion and future work of the paper.

2. LITERATURE SURVEY
In this section, review of the literature of feature selection has been presented with more focus on work done
in recent years. The review starts with an overview of feature selection methods. Use of feature selection
techniques in some of the areas where dimensionality reduction is extensively used has also been presented.

2.1. Overview of Dimensionality Reduction Methods


There are two types of approaches for dimensionality reduction: feature extraction and feature selection [1]. In
feature extraction, features are reduced by projecting data into a lower dimensional space. The new features
are different from old feature set. In the feature selection approach, a small subset of the original features is
selected without projecting them into a new space. Feature selection is considered better in many cases
because feature selection is more readable and interpretable than the feature extraction. Existing feature
selection methods mainly follow two approaches: individual feature selection and feature subset selection [1].
Methods of individual feature selection rank features according to their degrees of relevance. Methods of
feature subset selection search for a minimum subset of features that satisfies some objective function. Feature
subset selection is a procedure comprising: a search strategy to select a subset of feature and an objective
function to evaluate these selected feature subsets. In all of the feature subset selection algorithms, first a
search strategy selects some features then objective function evaluates those features for their quality. Features
having higher quality are selected as desired feature set. Objective function of all the feature selection
methods are categorized into three categories: filter model, wrapper model, and hybrid model [8]. Feature
selection algorithms of filter model are independent of any classifier. It evaluates feature subset based on the
general characteristic of data set with respect to the class label such as distance, information, correlation.
Unlike filter model, wrapper model uses a classifier to evaluate the quality of the selected features. The given
classifier evaluates the quality of the selected subset. It uses all the combination of feature set into
consideration. So this model is time consuming approach compared to the filter model but its predictive
accuracy is normally greater than filter model. To overcome the limitations of the filter and wrapper model, a
hybrid model is used. It uses different evaluation criteria for evaluating different subsets during the process
of subset searching and evaluation The hybrid model is more efficient than filter and less expensive than
wrapper. All the Feature selection methods belong to two broad categories: supervised feature selection and
unsupervised feature selection [8]. Supervised techniques are applied in cases where the target dataset is
labelled and unsupervised techniques are applied to cases where the target data set is unlabelled

2.2. Dimensionality Reduction in various Domains


Dimensionality reduction techniques are used in various areas such as intrusion detection, medical diagnosis,
bioinformatics, text categorization, image retrieval, genomic microarray analysis, customer relationship
management etc. All these areas suffer from problem of high dimensional data and various approaches have
been developed for reducing data dimensionality for effective data mining in above areas. Some of these areas
have been explored below.

2.2.1. Intrusion Detection


An intrusion detection system (IDS) [15] takes as input a huge number of traffic features. To process data with
such a huge number of feature is computationally expensive task. So, feature selection techniques are needed
to reduce the computational cost without reducing the efficiency of IDS. Khor et al. [15] used a multiple
feature selection approach for IDS. Four learning algorithm namely Id3, J48, Naive Bayes Classifier and
Learned Bayesian Network was used in the approach. Each feature selection algorithm taken into
3

consideration outputs some feature set. With each feature set generated, a frequency list of features is
maintained because same feature can be found in many of the feature sets. In the end, top high frequency
features are taken and an optimal feature set is formed with these high frequency feature sets. Features
selected by most of the learning methods indicate that those features are important in detecting network
intrusions. Nguyen et al. [9] have proposed a feature selection procedure for IDS which finds feature set using
Correlation-based Feature Selection (CFS) [5] method. Existing search methods to be used with CFS like
Brute Force search, Best First search [13], Genetic Algorithm (GA) [14] have some shortcomings. Brute force
works on low feature data set, so it is not useful for high dimensional data set. Best First and GA can work on
high dimensional data set but they do not always produce globally optimum solutions. The proposed search
method to be used with CFS guarantees globally optimal solution. First the CFS measure of all the features are
calculated and then a 'polynomial mixed 01 fractional programming' (P01FP) [10] problem is constructed
from these CFS measures. Further this problem is transformed to a 'mixed 01 linear programming' (M01LP)
[11] problem, whose solution by Branch and Bound method [12] gives the optimal solution set of feature
subset. The Branch and Bound method is just a framework of a large family of methods. There can be many
variation of it depending on the particular problem. Often Branch and Bound methods are slow and have
exponential worst case performance. Most of the feature selection methods developed for IDS assumes that
data is either categorical or real valued. But in many situations, feature data for IDS can be mix of both
categorical and real valued. Yongli et al. [16] have proposed a feature selection method for IDS to handle
cases of mixed data. The proposed method includes a feature ranking method [1] and an exhaustive search
method. The feature ranking method [1] ranks feature individually based on Mahalanobis distance.

Then a feature subset evaluator is applied to a fixed number of attributes based on their ranking. The search
method finds optimal feature set from these ranked feature set based on an objective function which uses
classification rate and misclassification rate of learning algorithm as parameters. However the work is
effective only for two class classification problem. So, multi class classification with the proposed
methodologies is left as future work.

2.2.2. Text Categorization


Text categorization [18] also suffers from high dimensionality problem. All the unique terms in a document
serves as feature set, which normally ranges in thou-sands in any text categorization task. So, feature selection
for text classification is a challenging task. Various algorithms have been proposed to deal with feature
selection in text classification. Text categorization domain has some widely used feature selection methods
though not only limited to these; like Document Frequency (DF) [18], Term Strength (TS) [18], Chi Square
(CHI) [18], Mutual Information (MI) [18] and Information Gain (IG) [18]. Feature selection for text
categorization are either topic-based classification where the classification categories are related to the subject
content(e.g. politics, religious) or sentiment classification where classification categories are the attitude
expressed in the text (e.g recommended or not recommended, liked not liked). Naive Bayes, Maximum
Entropy, K-NN, and SVM are widely used classification algorithm for text categorization. Among these
methods, SVM perform better than other methods. Li et al. [19] have proposed a framework of feature
selection methods in text classification. This framework is based on two basic measurements: frequency
measurement and ratio measurement. Frequency measurement is the calculation of frequency of a term in one
category of document, whereas ratio measurement is the ratio of frequency of a term in one category of
document to other category of document. The framework helps in choosing the appropriate feature selection
algorithm from a no of candidate feature selection algorithms when a new text classification task comes.
Authors have analytically shown that other feature selection methods are using measurements which are based
upon these two basic measurements. Abanumay et al. [20] state that all the feature selection methods which
4

work well for English language text classification do not perform well for Arabic language text classification.
There are many work for English language text classification, which have used combining multiple feature
selection methods, but no work shows combining multiple feature selection methods for Arabic language text
classification. So, they have developed a feature selection approach for Arabic language text classification. In
this regard, authors have used multiple feature selection approach consisting of combinations of five feature
selection methods IG (information gain) [18], CHI (CHI square) [18], NGL (NG, Goh and Low coefficient)
[56], GSS (the Galavotti, Sebastiani, Simi, a coefficient and relevancy score) [57]and RS (relevancy score)
[58]. The proposed method has slight improvement in classification accuracy for combination of two or three
feature selection methods and for combination of four and five feature selection method, no improvement is
observed. It is observed that a lot of research is needed for this language to find the efficient feature selection
method for text classification.

2.2.3. Image Retrieval


Feature selection is an important procedure in image retrieval task [21]. Image retrieval system is used for
browsing, searching and retrieving images from a large database of digital images. Everyday giga-bytes of
images are generated. This huge amount of information cannot be accessed unless it is organized so as to
allow efficient browsing, searching and retrieval of images. Chen et al. [21] have proposed a feature selection
method based on Ant Colony Optimization [49] for image feature selection. Ant Colony Optimization belongs
to group of nature inspired algorithms [33]. Other algorithms of this group are Genetic Algorithm (GA) [34]
and Particle Swarm Optimization (PSO) [36] and these algorithms have been used in various feature selection
methods. Most of the Ant Colony Optimization based feature selection algorithm builds a complete graph. But
since each feature selected is independent of other features selected, complete graph with O(n2) edges is
inappropriate for this scenario. Authors have proposed a new model of graph with O(n) edges. Thus it reduces
the memory and computational power requirement than other ACO based algorithms. The ants travel on the
graph from one node to next node and so on. At each node, it decides which next node to choose. Ant's travel
stops as it reaches the last node. When an ant completes the search from first node to last node, the arcs on the
route form a solution. Thangavel et al. [23] have proposed an unsupervised feature selection algorithm based
on Rough Set [24] for Mammograph Image Classification. As op-posed to Rough Set based supervised feature
selection based on Entropy Based Reduct (EBR) [25] which takes two input parameter conditional attribute
and decision attribute, the proposed unsupervised EBR based feature selection (UEBR) takes only one
parameter-conditional attribute. The proposed algorithm is evaluated against three rough set based supervised
feature selection algorithms: Quick Reduct (QR) [25], Entropy based Reduct (EBR) [25] and Relative Reduct
(RR) [25]. Results shows that the proposed method performs better than others rough set based supervised
feature selection algorithms. Lotfabadi et al. [26] have done a comparative study of fuzzy rough set [27] based
feature selection method with other five feature selection methods namely Relief-F [61], Information Gain,
Gain Ratio, OneR and Chi Square in the area of image retrieval. Results have shown that fuzzy rough method
outperforms all other methods on image data set for image retrieval task.

2.3. Enhanced Approaches for Feature Selection


2.3.1. Ensemble Approaches
Ensemble approaches [37] in feature selection have been found in many cases giving more improved
classification accuracy than single approaches. Ensemble feature selection approaches use combination of
more than one feature selection methods and produce output by aggregating the outputs of many of the single
approaches. Liu et al. [37] have used a combination of a mutual information based filter feature selection
method and Shapely value [38] based wrapper feature selection method. First step of the proposed algorithm
5

uses a filter feature selection method to return some specified no of best features. Then Shapley value based
wrapper feature selection method is used to predict which features among the selected features have more
relevance to the classifier. Those relevant features form the optimal feature set. Brahim et al. [39] have
proposed an aggregation technique for combining results of different feature selection methods. In the work,
authors have used different feature selection technique for different data sample. To aggregate the different
results produced by different feature selection techniques, the proposed aggregation technique first calculates
feature selection algorithm's confidence. Feature selection algorithm's confidence calculation is done using
feature weight and classification error. Then it assigns a reliability factor to each algorithm based on the
feature confidence. The result of the proposed method was compared with results of two existing aggregation
techniques: Weighted mean aggregation (WMA) [40] and complete linear aggregation (CLA) [40]. It was
found that proposed aggregation technique outperforms others.

2.3.2. Multi Objective Approaches


Multi objective approaches [45] for feature selection are also commonly used in many feature selection
methods. Most of the feature selection algorithm has single objective of maximizing the classification
accuracy (or modelling accuracy) and reducing the no of features. Though these are two objectives, they are
treated as one. How-ever, there may be required to optimize multiple objectives while performing feature
selection task. For example, multiple objectives may comprise of enhancing specificity [48] and sensitivity
[48] as well as maximizing the classification accuracy and reducing the no of features. In such cases, multiple
objective functions are needed instead of a single objective function in the feature selection algorithm to
evaluate the quality of feature sets. Multi objective optimization assumes that there is no single solution for all
the objectives. So, it outputs a set of solutions. Evolutionary algorithms [49] have been found very effective in
dealing with multi objective optimization problems and thus they are suited well for multi objective feature
selection task. Karshenas et al. [45] have proposed a methodology which uses six objectives function namely
area under ROC curve [46], Sensitivity [48], Specificity [48], Precision [48], F1 measure [48] and Brier score
[47] for evaluating quality of feature sets in the feature selection task. Many evolutionary algorithms based
optimization approaches have recently been using concept of estimation of distribution algorithms (EDA). An
EDA learns a probabilistic model from the set of selected solutions. This probabilistic model contains an
abstraction of the common properties of those solutions. This probabilistic model is then used to generate new
solutions in the search space. The proposed approach is using a variation of EDA which is based on joint
modelling of objectives and variables

2.4. Stability of Feature Selection Algorithms


Stable (Robust) feature selection [28] method means that the result of the algorithm is stable when some data
is added to or deleted from the data set. If a feature selection algorithm is not stable, it may give different
result even if small no of data are changed. So, the stability is a required characteristic of the feature selection
algorithms. Stability issue of algorithms has long been ignored but recently it has developed increased
research interest. There are many causes for this instability [28, 29]. Often results are unstable when the high
dimensional data set contains small no of samples. This is often observed in the biomedical field as DNA data
mostly contain thousands of features but very small no of samples. To counter this instability, there are
different frameworks for designing feature selection methods namely Ensemble feature selection [28,29],
Group feature selection [28,29], Sample weighting [28, 29] etc. Ensemble approaches of feature selection
have been found to be effective for generating stable algorithms. Yang et al. [30] have used an ensemble
approach called multi criterion fusion-based recursive feature elimination (MCF-RFE) algorithm.

Results of various feature evaluation criterion like Fischer Ratio [2], Relief [4] is combined in this approach.
Fusion of performance criterion is done by two methods: one is based on score and other is based on ranking.
In score based approach, each criterion produces a score vector containing scores of all feature and in last all
score is aggregated by an aggregation method. Finally a feature ranking method ranks the feature. In ranking
based approach, first each feature is assigned a ranking based on individual criterion then a ranking
aggregation algorithm [31] aggregates the results and produces a final ranking of features. Sample weighting
is one of the approaches to deal with instability. In this approach, higher weights are assigned to relevant
samples. Yu et al. [32] have proposed a general framework of sample weighting to improve the stability of
feature selection methods. The proposed framework assigns weights to each sample according to its impact on
the estimation of feature relevance. The weighted feature set is evaluated by a feature selection algorithm.
Under this framework various algorithm can be developed. An algorithm named Margin based Sample
Weighting is proposed by the authors under this framework. In margin based sample weighting, feature space
is projected in to a margin vector feature space [32]. The difference in margin vectors is used to weight
samples of the original space.

2.5. Dimensionality Reduction using Genetic Algorithm


Genetic algorithm is used in feature selection as a search method for searching feature subsets. In the
beginning of Genetic Algorithm based search, there is very large solution space to explore. But unlike other
search methods, GA reaches optimal solution without having to explore all search space. GA's main
advantage is that it can give globally optimal solution even in presence of many locally optimum solutions.
But GAs is computationally slow as the number of features increases. mGenetic Algorithms have been
developed for feature selection by using both filter and wrapper approaches. Various previous work in feature
selection with Genetic Algorithm involves modifying fitness function. Some authors have proposed classifier
based fitness function while some other work involves using multi fitness function. Freitas et al. [50] have
used a wrapper approach with Genetic Algorithm. The method uses metrics of decision tree C4.5 as fitness
function for GA. The fitness function of the GA is based on the error rate of decision tree classification and
the size of the decision tree. Zhang et. al. [51] have used a multiple fitness function in the genetic algorithm as
classification scores from multiple classifiers. Multiple classifier consists of a decision tree classifier [48], a
multiple layer perceptron back propagation ANN classifier [48], and a support vector machine (SVM)
classifier [48]. When the GA selects some parent features for crossover, these feature set are evaluated on
ensemble of classifiers. Fitness functions of GA use classification accuracy and consensus as parameter.
Consensus indicates the majority voting for a particular feature subset. Ren et al. [22] have developed a
feature selection method for CBIR CAD (Computer-aided detection and diagnosis). Content-based image
retrieval (CBIR) is the methodology of selecting image from large image database. They have used GA based
approach. But problem with GA approach is that it generates different output when it is run in different time.
Authors are proposing a methodology called F-GA (Frequency GA) which produces stable output. The
algorithm runs a specified no of times. At last, occurrence of highest frequency feature is selected as feature
set. Some authors [52, 53] have proposed modification in the Genetic algorithm parameters for improving
feature selection process such as changing representation of chromosomes, modifying crossover parameter in
the GA.

3. PROPOSED APPROACH
This section presents two approaches to deal with the problem of feature selection: - GA-CFS Approach and
Subset Approach. Subset approach has been presented towards modification of GA-CFS Approach to deal
with the poor performance of GA-CFS Approach over some data sets.
7

3.1. Proposed GA-CFS Approach


In the working of GA-CFS Approach, first relevance of attributes present in the data set is measured by using
Correlation-based Feature Selection with the help of Genetic Algorithm. The output attribute set produced by
this step is again filtered by another correlation measure Chi Square. The approach removes some of the
attributes selected by Correlation-based Feature Selection which are not relevant according to Chi Square
score. The attributes set which are having consensus be-tween Correlation-based Feature Selection and Chi
Square have greater predictive power. Figure 1 shows the architecture of GA-CFS Approach of the proposed
methodology. In this approach, input data set passes through the GA Model and Chi Square Model. Result
from these two model is presented to the Aggregation Model which produces final optimized set of features.

Data set

Data set

GA-Model

Reduced
feature
set of GACFS

Calculate Chi
square score of
all attributes

Filter model

(Filter those attributes


which have chi square
score less than median)

Calculate
median of chi
square score

Optimized feature set of GACFS

Return the chi


square score and its
median value to
Filter model

Fig.1 GA-CFS Approach

GA Model: In this model, Correlation-based Feature Selection has been used to evaluate the quality of
feature. The Genetic Algorithm searches for optimal feature subsets based on the score assigned by the fitness
function CFS. CFS assigns a score to each feature subset which GA searches and considers an optimal
solution. In the last iteration of GA, the subset having highest CFS score is presented as optimal solution.
Calculation of feature-feature and feature-class correlation in CFS can be done in various ways and
accordingly CFS can have many variations. In this work, Symmetric Uncertainty [5] has been used for
calculating feature to feature and feature to class correlation of CFS.

Overview of Correlation-based Feature Selection: Correlation-based Feature Se-lection (CFS) [5] is used to
measure quality of feature subsets. It uses Correlation [60] as a measure to find relevance of feature subsets.
Every subset of features is assigned a score based on following equation:

where,
= number of features in the feature subset
= average feature - class correlation in the feature subset
= average feature - feature correlation in the feature subset
Overview of Symmetric Uncertainty: Symmetric Uncertainty [5] is a information theoratic measure like
Information Gain.
The Symmetric Uncertainty of feature X with another feature ( or class ) Y is defined as:

(
Where,
Information Gain IG (X|Y) = H(X) - H (X|Y )
Entropy of feature X = H(X) =
Conditional Entropy of feature X with another feature or class

Y= H (X|Y ) =

The approach is presented in algorithmic form as follows:


Chi-Square Model: In this model, Correlation measure Chi Square has been used to assign a score to each
attributes of the data set. The score indicates the relevance of that attribute. It shows the correlation of that
attribute with respect to class, more the value of Chi Square score, more the attributes is correlated to class.
With all ranking based attribute quality measure, there is need for deciding a threshold; above which attributes
can be considered as relevant and below which attributes can be considered as irrelevant. As Chi Square only
assigns a score to attributes, it is needed to decide a threshold for Chi Square score for deciding relevance of
attributes. Median of Chi Square score of all attributes of the data set has been used as threshold for deciding
the relevance of attributes. The functioning of Chi-Square Model is executed by the algorithm Chi-Square
Model described in Algorithm2.
Overview of Chi Square: Chi Square test [18] has been used to calculate whether an attribute is important or
not for classification. Chi Square test involves making a contingency table between class and an attribute.
Under the null hypothesis, the two attribute are independent i.e. not correlated with each other. If the
calculated value of Chi Square > the tabulated value, the null hypothesis is rejected otherwise accepted.

The Chi Square statistics is given by following formula:

Where,
2 =chi square statistics
Oi =observed frequency of attribute and class values
Ei =expected frequency of attribute and class values
(i ranges from 1 to total number of cells in the contingency table of feature X and class Y)

Filter Model: Reduced feature set from the GA Model is again filtered for irrelevant attributes by using
ranking of attributes obtained from the Chi-Square Model. All the attributes from the output set of GA Model
are filtered out which are having Chi Square score less than the median of the Chi Square score of all the
attributes. The GA-CFS Approach is presented in algorithmic form as follows.

Algorithm 1 GA-CFS Approach ()

1:
2:
3:
4: A = GA Model (IP1, IP2)
5: B = Chi-Square Model (IP1, IP2)
6: OP = Filter Model (A, B)
7: return final optimized set of attributes OP

Input IP1 = Data set


Input IP2 = Attribute set
Output OP = Optimized attribute set

10

Algorithm 2 Chi-Square Model ()

1:
2:
3:
4: for each attribute in the input data set IP do
5: calculate Chi square score

Input IP1 = Data set


Input IP2 = Attribute set
Output = Chi Square score set

6: store Chi Square score in set CH


7: end for
8: return CH

Algorithm 3 GA Model ( )
1:
2:
3:
4: sum1, sum2 = 0.
5: Create specified number of population
6: for the (specified number of generations+1), do
7:
for each chromosome (chmi )in the population do
8:
for each attribute (Xi) in the chromosome
9:
SU(Xi,Y )= H(Xi) -H(Xi j Y )/H(Xi)+H(Y )
10:
sum 1 = sum 1 +SU(Xi,Y )
11:
end for
12:

13:
14:
15:
16:
17:
18:
19:

Input IP1 = Data set


Input IP2 = Attribute set
Output OP2 = Reduced attribute set

do

,
for each attribute (Xi) in the chromosome do
for each attribute (Xj) left in the chromosome do
SU(Xi,Xj)= H(Xi) -H(Xi j Xj)/H(Xi)+H(Xj)
sum 2 = sum 2 + SU(Xi,Xj)
end for
remove attribute (Xi) from the chromosome (chmi)
end for

20:

21:

22:
23:

end for
arrange the chromosomes in sorted order based on their fitness score

11

24:
if current generation (last generation + 1) then
25:
apply crossover on each fit pair of parent chromosome produce child
26:
perform mutation on the child
27:
replace the least fit chromosomes in the population with all the child
28: end if
29: end for
30: return chromosome having highest fit-score

Algorithm 4 Filter model ()


Input A = result of GA-CFS model ( )
Input B = Chi Square ranking of attributes
Input IP = Attribute set

1:
2:
3:
4: Output = Optimized attribute set
5: sort the list B and accordingly sort the set IP
6: remove from the set IP, attributes having Chi

score equal to zero and also remove corresponding Chi


Square score from set B
7: calculate median of the set B
8: remove from the set IP those attributes having Chi Square score less than median.
9: remove from the set A, those attributes which are not present in set IP.
10: return final optimized set of attributes A

3.2. Proposed Subset Approach


Experimental analysis of the GA-CFS approach has shown that the GA-CFS approach failed to optimize three
data sets out of the total nine data sets used. Keeping in view of the fact that some data sets have attributes that
are locally predictive but not globally predictive [59], a modification to the proposed approach is suggested.
Locally predictive attributes are those attributes which have predictive power over part of the data set and
globally predictive attributes are those attributes which are predictive over all the instance of the data set. The
idea behind the modification is to run the previous algorithm GA Model over each subset of the data set
instead of whole data set. The algorithm Chi-Square Model is used to filter attributes as earlier. After running
the algorithm over all the subsets, results from all the subsets is to be aggregated to produce the final result.
Aggregation of result is done by first making frequency list of selection of each attribute in each run of
algorithm over each subset and then removes from this list the attributes which have low frequency i.e. the
attributes which have been selected as relevant attribute by very few run of Subset Approach. The Subset
Approach has been described in the following algorithm.

Algorithm 5 Subset Approach ( )


1:
2:
3:
4: D = GA-CFS Approach (IP1, IP2)
5: partition the data set into g (user specified ) subset
6: for each data set (Di) from the g subsets do
7:
Z = GA Model (Di, IP2 )

Input IP1 = Data set


Input IP2 = Attribute set
Output OP = Optimized attribute set
D1, D2, ., Dg

12

8:
make frequency table of attributes with the help of Z
9: end for
10: remove attributes from IP2 having frequency below than a threshold m (user specified)
11: remove attributes from IP2 by applying Chi Square the same way as algorithm Chi-Square

of the proposed
approach except on attributes which are present in both output D and output IP2.
12: after removal, IP2 is the finally optimized set of attributes returned by the proposed approach
13: return IP2

4. EXPERIMENTAL RESULTS
4.1. Experimental Setup
MATLAB R2013a has been used to implement the codes of the proposed approach. The following parameters
of GA are set to the standard and most commonly used values based on the observations made from the
literature survey of feature selection.
(1) Population size = 20
(2) Number of generation = 20
(3) Crossover rate = 0.6
(4) Mutation rate = 0.01
Missing values (if any) present in the data set have been replaced by modes of the training data. Data set is
partitioned into subset after shuffling the instances randomly to maintain the maximum possible uniformity of
classes in each subset as some data set have instances ordered on class. Shuffling of instances is repeated 10
times.

4.2. Data set Description


Experiments have been conducted on 9 benchmark data sets from UCI machine learning repository [55]. The
table 1 shows the characteristics of the data set. All the data set are having nominal attributes and all the data
sets are intended to be used for classification task.

13

Вам также может понравиться