Вы находитесь на странице: 1из 10

A hybrid feature selection scheme for unsupervised learning and its application

in bearing fault diagnosis


Yang Yang
a,
, Yinxia Liao
b
, Guang Meng
a
, Jay Lee
b
a
State Key Laboratory of Mechanical System and Vibration, Shanghai Jiaotong University, Shanghai 200240, PR China
b
NSF I/UCR Center for Intelligent Maintenance Systems, 560 Rhodes Hall, University of Cincinnati, Cincinnati, OH 45221, USA
a r t i c l e i n f o
Keywords:
Feature selection
Unsupervised learning
Fault diagnostics
a b s t r a c t
With the development of the condition-based maintenance techniques and the consequent requirement
for good machine learning methods, new challenges arise in unsupervised learning. In the real-world sit-
uations, due to the relevant features that could exhibit the real machine condition are often unknown as
priori, condition monitoring systems based on unimportant features, e.g. noise, might suffer high false-
alarm rates, especially when the characteristics of failures are costly or difcult to learn. Therefore, it
is important to select the most representative features for unsupervised learning in fault diagnostics.
In this paper, a hybrid feature selection scheme (HFS) for unsupervised learning is proposed to improve
the robustness and the accuracy of fault diagnostics. It provides a general framework of the feature selec-
tion based on signicance evaluation and similarity measurement with respect to the multiple clustering
solutions. The effectiveness of the proposed HFS method is demonstrated by a bearing fault diagnostics
application and comparison with other features selection methods.
2011 Elsevier Ltd. All rights reserved.
1. Introduction
As sensing and signal processing technologies advance rapidly,
increasingly features have been involved in condition monitoring
system and fault diagnosis. A challenge in this area is to select
the most sensitive parameters for the various types of fault, espe-
cially when the characteristics of failures are costly or difcult to
learn (Malhi & Gao, 2004). In reality, since the relevant or impor-
tant features are often not available as priori, amount of candidate
features have been proposed to achieve a better representation of
the machine health condition (Dash & Liu, 1997; Jardine, Lin, &
Banjevic, 2006; Peng & Chu, 2004). Due to the irrelevant and
redundant features in the original feature space, employing all fea-
tures might lead to high complexity and low performance of fault
diagnosis. Moreover, most unsupervised learning methods assume
that all features have uniform importance degree during clustering
operations (Dash & Koot, 2009). Even in the optimal feature set, it
is also assumed that each feature has the same sensitivity through-
out clustering operations. In fact, it is known that an important fea-
ture facilitates creating clusters while an unimportant feature, on
the contrary, may jeopardize the clustering operation by blurring
the clusters. Thereby, it is better to select only the most represen-
tative features (Xu, Xuan, Shi, & Wu, 2009) rather than simply
reducing the number of the features. Hence, it is of signicance
to develop a systematic and automatic feature selection method
that is capable of selecting the prominent features to achieve a bet-
ter insight into the underlying machine performance. Summarily
speaking, the feature selection is one of the essential and fre-
quently used techniques in machine learning (Blum & Langley,
1997; Dash & Koot, 2009; Dash & Liu, 1997; Ginart, Barlas, &
Goldin, 2007; Jain, Duin, & Mao, 2000; Kwak & Choi, 2002), whose
aim is to select the most representative features and which brings
the immediate effects for improving mining performance such as
the predictive accuracy and solution comprehensibility (Guyon &
Elisseeff, 2003; Liu & Yu, 2005).
However, traditional feature selection algorithms for classica-
tion do not work for unsupervised learning since there is no class
information available. Dimensionality reduction or feature extrac-
tion methods are frequently used for unsupervised data, such as
Principal Components Analysis (PCA), KarhunenLoeve transforma-
tion, or Singular Value Decomposition (SVD) (Dash & Koot, 2009).
Malhi and Gao (2004) presented a PCA-based feature selectionmod-
el for the bearing defect classication in the condition monitoring
system. Compared to using all features initially considered relevant
to the classication results, it provided higher accurate classica-
tions for bothsupervisedandunsupervisedpurposes withfewer fea-
ture inputs. But the drawback is the difculty of understanding the
data and the found clusters through the extracted features (Dash &
Koot, 2009). Given sufcient computation time, the feature subset
selection investigates all candidate feature subsets and selects the
optimal one with satisfying the cost function. Greedy search
algorithms like sequential forward feature selection (SFFS) (or back-
ward search feature selection (BSFS)) and random feature selection
0957-4174/$ - see front matter 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2011.02.181

Corresponding author. Tel.: +86 21 34206831x322.


E-mail address: emma002@sjtu.edu.cn (Y. Yang).
Expert Systems with Applications 38 (2011) 1131111320
Contents lists available at ScienceDirect
Expert Systems with Applications
j our nal homepage: www. el sevi er . com/ l ocat e/ eswa
were commonly used. Oduntan, Toulouse, and Baumgartner (2008)
developed a multilevel tabu search algorithmcombined a hierarchi-
cal search framework, and compared it to the sequential forward
feature selection, the random feature selection and the tabu search
feature selection (Zhang & Sun, 2002). Feature subset selections re-
quire intensive computational time and showpoor performance for
non-monotonic indices. In order to overcome these drawbacks, fea-
ture selectionmethods tendto rank features or select a subset of ori-
ginal features (Guyon & Elisseeff, 2003). Feature ranking techniques
resort the features according to cost functions and select a subset
from the ordered features. Hong, Kwong, and Chang (2008a) intro-
duced an effective methods feature ranking from multiple view
(FRMV). It scores each feature using a ranking criterion by consider-
ing multiple clustering results, and selects the rst several features
with the best quality as the optimal feature subset. However,
FRMV prefers to the importance of features that achieve better clas-
sicationrather thanthe redundancy of the selectedfeatures, which
results in the selected features not necessarily being the optimal
subset. Considering the approaches for evaluating the cost function
of feature selectiontechniques, feature selectionalgorithms broadly
fall into three categories: the lter model, the wrapper model and
the hybrid model (Liu & Yu, 2005) .The lter model discovers the
general characteristics of data and considers the feature selection
as a preprocessing step which is independent of any mining algo-
rithms. Filter method is less time consuming while less efcient.
The wrapper model incorporates the one predetermined learning
algorithmand selects the feature subset aiming to improve its min-
ing performance according to certain criteria. It is more time-
consuming but more effective compared to the lter models. More-
over, the predetermined learning algorithm remains bias towards
the shape of the cluster, that is, it is sensitive to the data structure
according to its operation concept. The hybrid model tends to take
advantage of the two models in different search stages according
to different criteria. Mitra, Murthy, and Pal (2002) described a lter
feature selection algorithm for high dimension data sets based on
measuring similarity between features whereby the redundancy
therein was removed, and a maximuminformation compression in-
dex was also introduced to estimate the similarity between features
in their work. Li, Dong, and Hua (2008) proposed a novel lter fea-
tures selection algorithm through feature clustering (FFC) to group
the features into different clusters based on the feature similarity
and to select the representative features in each cluster to reduce
the feature redundancy. Wei and Billings (2007) introduced a for-
ward orthogonal search feature selection algorithm by maximizing
the overall dependency to nd signicant variables, which also pro-
vided a rank list of selected features ordered according to the per-
centage contribution for representing the overall structures. Liu,
Ma, Zhang, and Mathew (2006) presented a wrapper model based
on fuzzy c-means (FCM) algorithm for rolling element bearing fault
diagnostics. Sugumaran and Ramachandran (2007) employed a
wrapper approach based on decision tree with information gain
and entropy reduction as criteria to select representative features
that could discriminate faults of bearing. Hong , Kwong, and Chang
(2008b) described a novel feature selection algorithm based on
unsupervised learning ensembles and population based incremen-
tal learning algorithm. It searches for a subset fromall candidate fea-
ture subsets so that the clustering operation based on this feature
subset could generate the most similar clustering result to the one
obtained by a unsupervised learning ensembles method. Huang,
Cai, andXu(2007a) developeda twostages hybridgenetic algorithm
to nd a subset of features. In the rst stage, the mutual information
betweenthe predictive labels andthe true class labels servedas a t-
ness function for the genetic algorithmto conduct the global search
in a wrapper way. Then in the second stage, the conditional mutual
information served as an independent measure for the feature rank-
ing considering both the relevance and the redundancy of features.
As mentioned above, these techniques either require the avail-
able features to be independent initially, to which the realistic sit-
uations are opposite, or remain bias toward the shape of the cluster
due to their fundamental concept. This paper introduces a hybrid
feature selection scheme for unsupervised learning, which can
overcome those deciencies. The proposed scheme generates two
random-selected subspaces for further clustering, combines differ-
ent genres of clustering analysis to obtain a population of sub-
decisions of feature selection based on signicance measurement,
and removes redundant features based on feature similarity mea-
surement to improve the quality of selected features. The effective-
ness of the proposed scheme is validated by an application of
bearing defects classication, and the experimental results illus-
trate that the proposed method is able to (a) identify the features
that are relevant to the bearing defects, and (b) maximize the per-
formance of unsupervised learning models with fewer features.
The rest of this paper is arranged as follows. Section 2 illustrates
the proposed HFS scheme in details. Section 3 discusses the appli-
cation of the proposed feature selection scheme in bearing fault
diagnosis. Finally, Section 4 concludes this paper.
2. Hybrid feature selection scheme (HFS) for unsupervised
classication
It is timeconsumingor a difcult missionevenfor anexperienced
fault diagnosis engineer to determine whichfeature among all avail-
able features is able to distinguish the characteristics of various fail-
ures, especially there is no prior knowledge (class information)
available. To tackle the problemof class information absence, FRMV
(Hong et al., 2008a) extended the feature ranking methodology into
the unsupervised data clustering data. It offered a generic approach
to boost the performance of clustering analysis. A stable and robust
unsupervised feature ranking approach was proposed based on the
ensembles of multiple feature rankings obtained from different
views of the same data set. When conducting the FRMV, data in-
stances were rst classied in a randomly selected feature subspace
to obtain a clustering solution, and then all features were ranked
according to their relevancies with the obtained clustering solution.
These two steps iterated until a population of feature rankings was
achieved. Thereby, all obtained features rankings were combined
by a consensus function into a single consensus one.
However, FRMV clustered the data instances in a subspace that
consists of random-selected half number of features every time. It
was likely that some valuable features might be ignored in the
beginning, that is, some features probably might never be included
in all iterations. Besides, it only focused on the ensemble of one
unsupervised learning algorithm results through its scheme, which
obviously overlooked the reality that it is likely for a learning algo-
rithm to hold the bias toward the nature structure of data, such as
the hyper-spherical structure and the hierarchical structure (Frigui,
2008; Greene, Cunningham, & Mayer, 2008). As simply illustrated
in Fig. 1, the data set is consisted of 11 points. If two clusters are
contained in the data set, a classier based on the hierarchical
concept tends to assign point 1, 2, 3, 4, 5 to a cluster, and the
remaining to the other cluster. While, a classier based on the hy-
per-spherical concept tempts to assign point 1, 2, 3, 4, 6, 7, 8 to a
cluster, and the remaining to the other cluster.
Furthermore, in FRMV all features were supposed to be inde-
pendent before the selection process, which is usually opposite
in the real world. Thereby, the well-ranked features were related
to their neighbors with high probability, in other words, some
top ranked features might turn out to be redundant.
Since the abovementioned shortcomings in FRMV are the obsta-
cles of boosting higher classication performance and constraints
of wider applications in the real world, a hybrid feature selection
11312 Y. Yang et al. / Expert Systems with Applications 38 (2011) 1131111320
scheme (HFS) is proposed to overcome these deciencies. The
remaining part of this section will present the HFS scheme for
unsupervised learning and introduce the criterions used in HFS.
2.1. Procedure of the hybrid unsupervised feature selection method
The HFS is inspired by FRMV, which ranked each feature accord-
ing to the relevancy between the feature and the combined cluster-
ing solutions. Moreover, the HFS is developed to combine the
different genres of clustering analysis to a consensus decision
and rank features according to the relevancies between the fea-
tures and the consensus decision and independencies between
features.
Generally, HFS involves two aspects: (1) signicance evaluation,
which determines the contribution of each feature on behalf of
multiple clustering results; (2) redundancy evaluation, which re-
tains the most signicant and independent features concerning
the feature similarity.
Some notations used throughout this paper are given as follows.
The input vector x of original feature space X with D candidate fea-
tures is denoted as X
i
fx
1
i
; x
2
i
. . . ; x
D
i
g i 1; . . . ; M, in which i is
denoted as ith instance and M is the number of instances. Let
RF
k
= {rank
(k)
(F
1
), rank
(k)
(F
2
), . . . , rank
(k)
(F
n
)} (1 < rank
(k)
(F
i
) < D) be
the kth sub-decision of the feature ranking, where rank
(k)
(F
i
) de-
notes the rank of the ith feature F in the kth sub-decision. Assum-
ing there are P sub-decisions of the feature rank
{RF
(1)
, RF
(2)
, . . . RF
(p)
} ,a combine function determines a nal
decision through combining the P sub-decisions into a single con-
sensus feature decision RF
pre-nal
, which is thereafter processed
according to the feature similarity to obtain RF
nal
. Detail of the
scheme is described as follows.
Algorithm: Hybrid feature selection scheme for unsupervised
learning
Input: feature space X, the number of cluster N, maximum
iteration L
Output: decision of feature selection
(1) Iterate until get a population of dub-decisions
For k = 1:L, Do:
(1.1) Randomly divide original feature space into two
subspaces X
1
, X
2
(1.2) Group data with the rst and second group of
unsupervised learning algorithms in subspace X
1
, X
2
separately
(1.3) Evaluate signicance of each feature based on
signicance measurement to obtain the kth sub-decision of
feature selection RF
(k)
End
//combine all rankings into a single consensus one
(2) RF
pre-nal
= combiner{RF
(1)
, RF
(2)
, . . . , RF
(2L)
}
(3) Redundancy evaluation based on feature similarities
(4) Return RF
nal
Fig. 1. An example of data structures in 2D; (a) hierarchical cluster, (b) spherical cluster.
Y. Yang et al. / Expert Systems with Applications 38 (2011) 1131111320 11313
At the beginning, the original feature space X is random divided
into two feature subspaces X
1
, X
2
, in which the instances are clus-
tered correspondingly. In step 1.2, two different genres of cluster-
ing analysis, e.g. hyper-spherical cluster and hierarchical cluster,
are used to classify data instances in the two subspaces respec-
tively. Thereby, a clustering solution is obtained in each subspace.
Then all features are ranked with respect to their relevancies to the
obtained clustering solutions in step 1.3, named the signicance
evaluation. These above two steps iterate until a population of
feature rankings, named sub-decisions, are achieved. In step 2, a
consensus function is utilized to combine all sub-decisions into a
pre-nal decision. Thereafter, the nal decision of feature selection
is accomplished by re-ranking the pre-nal decision according to
the re-rank scheme based on feature similarity in step 3, named
the redundancy evaluation. The details of HFS will be introduced
in Sections 2.2 and 2.3. Fig. 2 illustrates the framework of the HFS.
Table 1 lists the differences between FRMV and the proposed
HFS. First of all, in order to make sure that every feature in the ori-
ginal feature set is able to contribute to the decision making, HFS is
making use of both randomly divided subspaces from the original
feature space instead of ignoring some features due to randomly
selecting the half of the original feature space in FRMV. Secondly,
HFS considers the bias of individual unsupervised learning algo-
rithm. Thereby, different genres of clustering methods are used
to cluster data in the subspace. Moreover, HFS provides a redun-
dancy evaluation according to the feature similarity and re-ranks
the features. It is more appropriate for the real world applications
than FRMV.
2.2. Signicance evaluation
The goal of unsupervised feature selection is to nd as few fea-
tures as possible that best uncovers interesting natural clusters
from data, which could be found by unsupervised learning algo-
rithm. Therefore, the relationship between the clustering solution
and feature is considered as the signicance of the feature to the
clustering solution. In step 1.3, the features are ranked with respect
to their relevancies with the obtained clustering solutions, named
signicance evaluation. The sub-decisions serve as the target and
each feature is considered as a variable. In this research, the widely
used linear correlation coefcient (LCC) (Hong et al., 2008a), the
symmetrical uncertainty (SU) (Yu & Liu, 2004) and the Davies
Bouldin index (DB) (Bouldin, 1979) are used for signicance evalu-
ation. The details of each criterion are introduced as follows. For
convenience, denote F
k
and R
(k)
as the kth feature and the kth
sub-decision respectively.
First, the linear correlation coefcient studies the correlations
between the variables and the target, which is calculated as
follows:
LCCF
k
; R
k

covF
k
; R
k

rF
k
rR
k

; 1
where r(R
k
) is the standard deviation of the kth target and
cov(F
k
, R
k
) is the covariance between F
k
and R
k
.
Secondly, the symmetrical uncertainty is dened as follows:
SUF
k
; R
k
2
IGF
k
jR
k

HF
k
HR
k

_ _
; 2
with
IGF
k
jR
k
HF
k
HF
k
jR
k
; 3
HF
k

F
0
k
2XF
k

PF
0
k
logPF
0
k
; 4
HF
k
jR
k

R
0
k
2XR
k

PR
0
k

F
0
k
2XF
k

PF
0
k
jR logPF
0
k
jR

_
; 5
PF
0
k

N
i1
dd
i
; F
0
k

N
; 6
dd
i
; F
0
k

1; if d
i
F
0
k
0; otherwise
_
; 7
Fig. 2. Flowchart of HFS for unsupervised learning.
Table 1
Comparison between FRMV and HFS.
FRMV HFS
Subspace Randomly select
n
2
features Randomly divide feature space to two subspaces X
1
, X
2
Clustering analysis Remain the bias towards single data structure Consider the bias towards date structure of individual algorithm
Independent evaluation None Re-rank the features according to similarity measurement
Note: N the number of all features.
11314 Y. Yang et al. / Expert Systems with Applications 38 (2011) 1131111320
where H(F
k
) calculates the entropy of F
k
and H(F
k
|R
k
) is the condi-
tional entropy of F
k
. X(F
k
) denotes all possible values of F
k
and
X(R
k
) is all possible values of R
k
. PF
0
k
is the probability that F
k
equals to F
0
k
and PF
0
k
jR
k
is the probability that F
k
equals to F
0
k
under
the condition that the instances are assigned into the group R
k
. In
addition, the value 1 of the symmetrical uncertainty SU(F
k
, R
k
) indi-
cates that F
k
is completely related to R
k
. On the other hand, the va-
lue 0 of the symmetrical uncertainty SU(F
k
, R
k
) means that F
k
is
absolutely irrelevant with target (Hong et al., 2008a; Shao & Nezu,
2000).
Thirdly, DB index is a function of the ratio of the sum of within-
cluster scatter to betweencluster separations, which is computed
as follows:
DB
1
n

n
i1
min
1j
S
n
Q
i
S
n
Q
j

SQ
i
; Q
j

_ _
; 8
where n is the number of clusters, Q
i
stands for the ith cluster, S
n
de-
notes the average distance of all objects from the clusters to their
cluster centre, S(Q
i
, Q
j
) is the distance between cluster centers.
The DB index is small if the clusters are compact and fat from each
other, in other word, a small DB index means a good clustering.
2.3. Combination and similarity measurement
Besides maximization of clustering performance, the other
important purpose is the selection of features based on the feature
dependency or the similarity. Since any feature carrying little or no
additional information beyond that subsumed by the remaining
features, is redundant and should be eliminated (Mitra et al.,
2002). That is, if there is a feature with high rank carrying valuable
information and it is very similar to a lower-ranked feature, thus
latter one should be eliminated due to it carries no additional valu-
able information. Therefore, the similarities between features are
considered as reference for the redundancy evaluation.
In step 2, a consensus function is utilized to combine all sub-
decisions into a pre-nal decision, named combiner. A large num-
ber of combiners used for combining the results of the classier
were discussed in Dietrich, Palm, and Schwenker (2003). The most
common combiners are majority vote, simple average and
weighted average. In the simple average, the average of learning
model results is calculated and the variable owned the largest
average value is selected as the nal decision. The weighted aver-
age is the same concept as the simple average except that the
weights are selected heuristically. While, the majority vote assigns
the kth variable a rank j if more than half of sub-decisions vote it to
rank j.
Practically, the determination of the weights in the weighted
average combiners relies on experience. On the other hand, the
majority vote could lead to confusion of decision making, e.g. one
feature could be nominated with two ranks at the same time.
Therefore, the simple average combiner is applied in this study
to combine the sub-decisions, which is computed as follows:
ARj

M
k1
rank
k
j
M
; 9
where M is the population of sub-decisions, and rank
(k)
(j) is the sig-
nicance measurement of feature j in kth sub-decision RF
(k)
.
Thereafter, in step 3, in order to reduce the redundancy, those
high ranked but less independent features with respect to the ob-
tained pre-nal decision are eliminated. The similarity between
features could be utilized to estimate the redundancy. There are
broadly criteria for measuring similarity between two random
variables, based on the linear dependency between them. The rea-
son why chooses the linear dependency as a feature similarity
measure is that the data is still linearly separable when all but
one of the linearly dependent features are eliminated if the data
is linearly separable in the original representation. In this research,
the most well known measure of similarity between two random
variables, correlation coefcient, is adopted. The correlation coef-
cient q between two variables x and y is dened as
qx; y
covx; y

varxvary
_ ; 10
where var(x) denotes the variance of x and cov (x, y) is the covari-
ance between two variables x and y.
The elimination procedure is then conducted according to the
pre-nal decision and the similarity measure between features.
For example, the most signicant (top one) feature is retained, to
which thereby the most related features based on the similarity
measure are considered as the redundant features to be removed,
and the successive features are processed likewise until the well-
ranked features are linear independent.
3. HFSs application in bearing fault diagnosis
This section applies the HFS in bearing fault diagnostics. The
comparison results between HFS and other feature selection meth-
ods will be demonstrated and discussed.
To validate the proposed feature selection scheme could im-
prove the classication accuracy, a comparison between the pro-
posed hybrid feature selection scheme and other ve feature
selection approaches was carried out. The eight learning algo-
rithms are listed as follows:
(1) HFS with Symmetrical uncertainty (HFS_SU);
(2) HFS with Linear Correlation Coefcient (HFS_LCC)
(3) HFS with DB index (HFS_DB)
(4) PCA-based feature selection (Malhi & Gao, 2004);
(5) FRMV based on k-means clustering with Symmetrical uncer-
tainty (FRMV_KM) (Hong et al., 2008a);
(6) Forward search feature selection (SFFS) (Oduntan et al.,
2008);
(7) Forward orthogonal search feature selection algorithm by
maximizing the overall dependency (fosmod) (Wei &
Billings, 2007);
(8) Feature selection through feature clustering (FFC) (Li, Hu,
Shen, Chen, & Li, 2008).
The comparisons among them were in term of classication
accuracy. According to Hong et al. (2008a), the iteration of
FRMV_KM was set to 100, the k-means clustering was used to ob-
tain the population of clustering solutions and SU was adopted as
the evaluation criteria. In order to get the comparable population
of sub-decision, the iteration of the proposed algorithm was set
to 50. The threshold of the fosmod was set to 0.2. Two commonly
used clustering algorithms were adopted in the HFS, fuzzy c-mean
clustering and hierarchical clustering algorithms. In this research,
the result of FCM was defuzzied as follows:
Rk
1; if Pk maxP
0; otherwise
_
; 11
where P and P(k) denote the membership of instance that belongs to
each cluster and the possibility of the instance that belongs to kth
cluster, respectively.
Features discussed in this chapter for bearing defects included
the features extracted from time-domain, frequency domain,
time-frequency domain and empirical mode decomposition
(EMD). Firstly, in the time domain, statistical parameters were ex-
tracted from the waveform of the vibration signals directly. A wide
set of statistical parameters, such as rms, kurtosis, skewness, crest
Y. Yang et al. / Expert Systems with Applications 38 (2011) 1131111320 11315
factor and normalized high order central moment, have been
developed (Jack & Nandi, 2002; Lei, He, & Zi, 2008; Samanta &
Nataraj, 2009; Samanta, Al-Balushi, & AI-Araimi, 2003). Second,
the characteristic frequencies related to the bearing components
were located, e.g. ball spin frequency (BSF), ball-pass frequency
of inner ring (BPFI), and ball-pass frequency of outer ring (BPFO).
Besides, in order to interpret real world signals effectively, the
envelope technique for the frequency spectrumwas used to extract
the features of the modulated carrier frequency signals (Patil,
Mathew, & RajendraKumar 2008). In addition, a new signal feature,
proposed by Huang from envelope signal (Hung, Xi, & Li, 2007b),
the power ratio of maximal defective frequency to mean or PMM
for short, was calculated as follows:
PMM
maxpf
po
; pf
pi
; pf
bc

meanp
; 11
where P(f
Po
), p(f
pi
) and p(f
bc
) are the average power of the defective
frequencies of the outer-race, inner race and ball defects, respec-
tively; and mean (p) is the average of overall frequency power.
Thirdly, Yen introduced wavelet packet transform (WPT) in Yen
(2000) as follows,
e
jn

k
w
2
j;n;k
; 12
where w
j.n.k
is the packet coefcient, j is the scaling parameter, k is
the translation parameter, and n is the oscillation parameter. Each
wavelet packet coefcient measures a specic sub-band frequency
content. In addition, EMD was used to decompose signal into sev-
eral intrinsic mode functions (IMFs) and a residual. The EMD energy
entropy in Yu, Yu, and Cheng (2006) was developed to calculate the
rst several IMFs of signal.
In this research, self-organized map (SOM) was used to validate
the classication performance based on the selected features. The
theoretical background of unsupervised SOM has been extensively
studied in the literature. A brief introduction of SOM can be found
in Liao and Lee (2009) for bearing faults diagnose. With available
data from different bearing failure modes, the SOM can be applied
to build a health map in which different regions indicate different
defects of a bearing. Each input vector could be represented by a
BMU (Best Machining Unit) in the SOM. After training, the input
vectors of a specic bearing defect are represented by a cluster of
BMUs in the map, which forms a region indicating the defect. If
the input vectors are labeled, each region could be dened to rep-
resent a defect.
3.1. Experiments
In this research, two tests were conducted on two types of bear-
ing and the class information was considered as unknown in both
cases.
In the rst test, bearings were articially made to have roller
defect, inner-race defect, outer-race defect and four different
0 0.5 1 1.5 2 2.5 3
-0.1
-0.05
0
0.05
0.1
Time (s)
A
c
c
e
l
e
r
a
t
i
o
n

(
g
)
Normal
0 0.5 1 1.5 2 2.5 3
-0.2
0
0.2
0.4
0.6
Time (s)
A
c
c
e
l
e
r
a
t
i
o
n

(
g
)
Roller defect
0 0.5 1 1.5 2 2.5 3
-0.5
0
0.5
1
Time (s)
A
c
c
e
l
e
r
a
t
i
o
n

(
g
)
Inner-race defect
0 0.5 1 1.5 2 2.5 3
-0.1
-0.05
0
0.05
0.1
Time (s)
A
c
c
e
l
e
r
a
t
i
o
n

(
g
)
Outer-race defect
0 0.5 1 1.5 2 2.5 3
-0.5
0
0.5
1
Time (s)
A
c
c
e
l
e
r
a
t
i
o
n

(
g
)
Inner-race & Roller defect
0 0.5 1 1.5 2 2.5 3
-1
-0.5
0
0.5
1
Time (s)
A
c
c
e
l
e
r
a
t
i
o
n

(
g
)
Outer & Inner-race defect
0 0.5 1 1.5 2 2.5 3
-2
-1
0
1
2
Time (s)
A
c
c
e
l
e
r
a
t
i
o
n

(
g
)
Outer & inner-race & Roller defect
0 0.5 1 1.5 2 2.5 3
-0.5
0
0.5
Time (s)
A
c
c
e
l
e
r
a
t
i
o
n

(
g
)
Outer-race & Roller defect
Fig. 3. Vibration signal of the rst test, including normal pattern and seven failure patterns.
11316 Y. Yang et al. / Expert Systems with Applications 38 (2011) 1131111320
combinations of the single failures respectively. In this case,
SKF32208 bearing was tested, with an accelerometer installed on
the vertical direction of its housing. The sampling rate for the
vibration signal was 50 kHz. The BPFI, BPFO, and BSF for this case
were calculated as 131.73 Hz, 95.2 Hz and 77.44 Hz, respectively.
Fig. 3 shows the vibration signal of all defects as well as the normal
condition in the rst test.
In the second test, a set of 6308-2R single row deep groove ball
bearings were run to failure resulting in roller defect, inner-race
defect and outer-race defect (Huang et al., 2007b). Totally 10 bear-
ings were involved through the experiment. The data sampling fre-
quency was 20 kHz. The BPFI, BPFO, and BSF in this case were
calculated as 328.6 Hz, 205.3 Hz and 274.2 Hz, respectively. It
should be pointed out that the beginning of the second test was
not stable, and then it fell into a long normal period. Hence, two
separate segments from the stable normal period were selected
to be baseline for training and testing, respectively. On the other
hand, the data that exceeded mean value before end of the test
was supposed as potential failure patterns. Therefore, 70% of the
faulty patterns and the half of good patterns were used for training
unsupervised learning model, while all the faulty patterns and the
other half of good patterns for testing. Fig. 4 shows the part of the
data segments of one bearing from the run-to-failure experiment
in the second test.
3.2. Analysis and result
In the rst test, totally 24 features were computed as follows.
Half of the data was used for training the SOM and the remaining
part for testing.
Energies centered at 1xBPFO, 2xBPFO, 1xBPFI, 2xBPFI, 1xBSF,
2xBSF.
6 statistics for the raw signal (mean, rms, kurtosis, crest factor,
skewness, entropy).
6 statistics for envelop signal obtained by hilbert transform.
6 statistics for the spectrum results of the waveform by FFT.
Figs. 5 shows the results of rst test, with the x axis represent-
ing the number of selected features fed into the unsupervised SOM
for clustering and the y axis representing the classication accu-
racy correspondingly. The rst 12 features selected by each algo-
rithm are shown for convenience. Take Fig. 5a as an example, the
classication accuracies based on HFS_SU, HFS_LCC and HFS_DB
with the top one ranked feature as input were 92.11%, 92.11%
and 85.59%, respectively. When using the rst three ranked fea-
tures, the accuracy of 97.19%, 99.77% and 97.03% were achieved.
In the comparison with HFS_SU, HFS_DB, features selected by
HFS_LCC achieved higher classication accuracy of 99.77%. In other
word, HFS_LCC apparently selected most representative features
for this specic application. As shown in Fig. 5b, the highest classi-
cation accuracy was 99.38% with 5 features for PCA. In Fig. 5c, the
classication accuracies based on HFS_SU, HFS_DB and HFS_LCC
were higher than the results based on FRMV_KM. For FRMV_KM,
highest classication accuracy of 98.36% was achieved with 12 fea-
tures. Fig. 5d compared SFFS and three HFS methods, HFS_LCC se-
lected most representative features, and the accuracies reached by
HFS were higher. For SFFS, highest classication accuracy of 98.43%
was achieved with 9 features. As shown in Fig. 5e, although the
rst 11 features selected by fosmod ultimately reached the accu-
racy of 99.14%, HFS not only obtained higher accuracy but also
ranked features with high reliability. Comparing to FFC as shown
in Fig. 5f, features selected by HFS provided better classication
0 500 1000 1500 2000
-40
-20
0
20
40
A
c
c
e
l
e
r
a
t
i
o
n

(
g
)
unstable beginning
0 500 1000 1500 2000
-50
0
50
A
c
c
e
l
e
r
a
t
i
o
n

(
g
)
stable 1
0 500 1000 1500 2000
-50
0
50
A
c
c
e
l
e
r
a
t
i
o
n

(
g
)
stable 2
0 500 1000 1500 2000
-200
-100
0
100
200
A
c
c
e
l
e
r
a
t
i
o
n

(
g
)
failure
Fig. 4. Vibration signal of one bearing of the second test; (1) unstable beginning of
the test; (2) rst stable segment; (3) second stable segment; (4) failure pattern
(inner race defect).
1 2 3 4 5 6 7 8 9 10 11 12
88
90
92
94
96
98
100
Validated by SOM
Number of features according to rankings
A
c
c
u
r
a
c
y
HFS SU
HFS LCC
HFS DB
Fig. 5a. Comparison results of classication accuracy of HFS_LLC, HFS_SU, HFS_DB.
1 2 3 4 5 6 7 8 9 10 11 12
88
90
92
94
96
98
100
Validated by SOM
Number of features according to rankings
A
c
c
u
r
a
c
y
HFS SU
HFS LCC
HFS DB
PCA
Fig. 5b. Comparison results of classication accuracy of HFS_LLC, HFS_SU, HFS_DB
and PCA based method.
Y. Yang et al. / Expert Systems with Applications 38 (2011) 1131111320 11317
accuracy with less features. For FFC, the highest accuracy of 98.43%
was reached with 9 features. The performance improvement of the
proposed model over FRMV_KM, SFFS, fosmod and FFC was mainly
due to making use of every feature, combining the clustering solu-
tions and independence evaluation, which overcomes deciencies
of less uncertainty of clustering solution and jeopardize of related
features.
In order to illustrate the effect of the redundancy of the feature
set to the classication performance and demonstrate the robust-
ness of HFS, more features were involved to be the candidates in
the second test. Totally 40 features were calculated and given as
follows.
10 statistics for the raw signal (var, rms, skewness, kurtosis,
crest factor, 5th to 9th central moment).
Energies centered at 1xBPFO, 1xBPFI, 1xBSF for both raw signal
and envelop.
PMMs for both raw signal and envelop.
16 WPNs.
6 IMF energy entropies.
The results in Fig. 6 show the classication accuracy of the sec-
ond test. As shown in Fig. 4(a), HFS_LCC reached highest classica-
tion accuracy of 88.56% with rst 10 features, while HFS_DB and
HFS_SU achieved their highest classication accuracy of 87.29%,
85.17% with rst 8 features and 11 features, respectively Fig. 6b
shows that compare to PCA based feature selection method,
HFS_LCC and HFS_DB outperformed PCA (highest accuracy
86.02%) with respect to higher accuracy with the same number
of features or less features. In the comparison with FRMV_KM (as
shown in Fig. 6c) (highest accuracy 83.90%), HFS group showed
apparently better classication accuracy with less features. As
shown in Figs. 6d and 6e, the feature selected by SFFS and fosmod
resulted in the accuracy of 84.75% and 85.17%, which were worse
than the only one feature selected by HFS_DB. Compared to FFC
(as shown in Fig. 6f), HFS_LCC showed better performance. Since
86.86% accuracy was reached by FFC with 6 features selected.
From the results of the two tests, the conclusion could be drawn
that the proposed HFS is robust and effective in selecting the most
representative features, which maximizes the unsupervised classi-
cation performance. It also should be noted that for both tests,
FRMV_KM and HFS_SU shared the same evaluation criterion, but
the decision provided by the HFS_SU was always better than
1 2 3 4 5 6 7 8 9 10 11 12
40
50
60
70
80
90
100
Validated by SOM
Number of features according to rankings
A
c
c
u
r
a
c
y
HFS SU
HFS LCC
HFS DB
FRMV KM
Fig. 5c. Comparison results of classication accuracy of HFS_LLC, HFS_SU, HFS_DB
and FRMV_KM.
1 2 3 4 5 6 7 8 9 10 11 12
88
90
92
94
96
98
100
Validated by SOM
Number of features according to rankings
A
c
c
u
r
a
c
y
HFS SU
HFS LCC
HFS DB
SFFS
Fig. 5d. Comparison results of classication accuracy of HFS_LLC, HFS_SU, HFS_DB
and SFFS.
1 2 3 4 5 6 7 8 9 10 11 12
50
55
60
65
70
75
80
85
90
95
100
Validated by SOM
Number of features according to rankings
A
c
c
u
r
a
c
y
HFS SU
HFS LCC
HFS DB
fosmod
Fig. 5e. Comparison results of classication accuracy of HFS_LLC, HFS_SU, HFS_DB
and fosmod.
1 2 3 4 5 6 7 8 9 10 11 12
40
50
60
70
80
90
100
Validated by SOM
Number of features according to rankings
A
c
c
u
r
a
c
y
HFS SU
HFS LCC
HFS DB
FFC
Fig. 5f. Comparison results of classication accuracy of HFS_LLC, HFS_SU, HFS_DB
and FFC.
11318 Y. Yang et al. / Expert Systems with Applications 38 (2011) 1131111320
1 2 3 4 5 6 7 8 9 10 11 12
76
78
80
82
84
86
88
90
Validated by SOM
Number of features according to rankings
A
c
c
u
r
a
c
y
HFS SU
HFS LCC
HFS DB
Fig. 6a. Comparison results of classication accuracy of HFS_LLC, HFS_SU, HFS_DB.
1 2 3 4 5 6 7 8 9 10 11 12
76
78
80
82
84
86
88
90
Validated by SOM
Number of features according to rankings
A
c
c
u
r
a
c
y
HFS SU
HFS LCC
HFS DB
PCA
Fig. 6b. Comparison results of classication accuracy of HFS_LLC, HFS_SU, HFS_DB
and PCA based method.
1 2 3 4 5 6 7 8 9 10 11 12
76
78
80
82
84
86
88
90
Validated by SOM
Number of features according to rankings
A
c
c
u
r
a
c
y
HFS SU
HFS LCC
HFS DB
FRMV KM
Fig. 6c. Comparison results of classication accuracy of HFS_LLC, HFS_SU, HFS_DB
and FRMV_KM.
1 2 3 4 5 6 7 8 9 10 11 12
75
80
85
90
Validated by SOM
Number of features according to rankings
A
c
c
u
r
a
c
y
HFS SU
HFS LCC
HFS DB
SFFS
Fig. 6d. Comparison results of classication accuracy of HFS_LLC, HFS_SU, HFS_DB
and SFFS.
1 2 3 4 5 6 7 8 9 10 11 12
75
80
85
90
Validated by SOM
Number of features according to rankings
A
c
c
u
r
a
c
y
HFS SU
HFS LCC
HFS DB
fosmod
Fig. 6e. Comparison results of classication accuracy of HFS_LLC, HFS_SU, HFS_DB
and fosmod.
1 2 3 4 5 6 7 8 9 10 11 12
75
80
85
90
Validated by SOM
Number of features according to rankings
A
c
c
u
r
a
c
y
HFS SU
HFS LCC
HFS DB
FFC
Fig. 6f. Comparison results of classication accuracy of HFS_LLC, HFS_SU, HFS_DB
and FFC.
Y. Yang et al. / Expert Systems with Applications 38 (2011) 1131111320 11319
FRMV_KM, which indicated that the proposed HFS scheme is supe-
rior to FRMV with respect to the same evaluation criterion.
Besides, it is worth noticing that in both tests, the proposed HFS
based on three evaluation criterions, e.g. SU, LCC and DB index,
generated results with slight difference. It suggested that the effec-
tiveness of features selected by proposed HFS relied on the applied
evaluation criterion, and LCC was considered more appropriate for
both the two cases. Nonetheless, it is still appropriate to conclude
that the overall performance based on features selected by HFS
was better comparing to other ve methods.
4. Conclusion
This paper presented a hybrid unsupervised feature selection
(HFS) approach to select the most representative features for
unsupervised learning and used two experimental bearing data
to demonstrate the effectiveness of HFS. The performance of
HFS approach was compared with other ve feature selection
methods with respect to the accuracy improvement of unsuper-
vised learning algorithm SOM. The results showed that the
proposed model could (a) identify the features that are relevant
to the bearing defects, and (b) maximize the performance of
unsupervised learning models with fewer features. Moreover, it
suggested that HFS relied on the evaluation criterion to the
certain application. Therefore, the further research will focus
on expand HFS to broader applications and online machinery
defect diagnostics and prognostics.
Acknowledgement
The authors gratefully acknowledge the support of 863 Program
(No. 50821003), PR China, for this work.
References
Blum, A. L., & Langley, P. (1997). Selection of relevant features and examples in
machine learning. Articial Intelligence, 12, 245271.
Bouldin, D. L. D. a. D. W. (1979). A cluster separation measure. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 224227.
Dash, M., & Koot, P. W. (2009). Feature selection for clustering. In Encyclopedia of
database systems (pp. 11191125).
Dash, M., & Liu, H. (1997). Feature selection for classication. Intelligent Data
Analysis, 14, 131156.
Dietrich, C., Palm, G., & Schwenker, F. (2003). Decision templates for the
classication of bioacoustic time series. Information Fusion, 2, 101109.
Frigui, H. (2008). Clustering: Algorithms and applications. In 2008 1st international
workshops on image processing theory,tools and applications, IPTA 2008. Sousse.
Ginart, A., Barlas, I., & Goldin, J. (2007). Automated feature selection for embeddable
prognostic and health monitoring (PHM) architectures. In AUTOTESTCON
(Proceedings), Anaheim, CA (pp. 195201).
Greene, D., Cunningham, P., & Mayer, R. (2008). Unsupervised learning and
clustering. Lecture Notes in Applied and Computational Mechanics, 5190.
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection.
Journal of Machine Learning Research, 11571182.
Hong, Y., Kwong, S., & Chang, Y. (2008a). Consensus unsupervised feature ranking
from multiple views. Pattern Recognition Letters, 5, 595602.
Hong, Y., Kwong, S., & Chang, Y. (2008b). Unsupervised feature selection using
clustering ensembles and population based incremental learning algorithm.
Pattern Recognition, 9, 27422756.
Huang, J., Cai, Y., & Xu, X. (2007a). A hybrid genetic algorithm for feature selection
wrapper based on mutual information. Pattern Recognition Letters, 13,
18251844.
Huang, R., Xi, L., & Li, X. (2007b). Residual life predictions for ball bearings based on
self-organizing map and back propagation neural network methods. Mechanical
Systems and Signal Processing, 1, 193207.
Jack, L. B., & Nandi, A. K. (2002). Fault detection using support vector machines and
articial neural networks, augmented & by genetic algorithms. Mechanical
Systems and Signal Processing, 23, 373390.
Jain, A. K., Duin, R. P. W., & Mao, J. (2000). Statistical pattern recognition: A review.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 437.
Jardine, A. K. S., Lin, D., & Banjevic, D. (2006). A review on machinery diagnostics and
prognostics implementing condition-based maintenance. Mechanical Systems
and Signal Processing, 7, 14831510.
Kwak, N., & Choi, C. H. (2002). Input feature selection for classication problems.
IEEE Transactions on Neural Networks, 1, 143159.
Lei, Y. G., He, Z. J., & Zi, Y. Y. (2008). A new approach to intelligent fault diagnosis of
rotating machinery. Expert Systems with Applications, 4, 15931600.
Li, G., Hu, X., Shen, X., et al. (2008) A novel unsupervised feature selection method
for bioinformatics data sets through feature clustering, In IEEE international
conference on granular computing, GRC 2008. Hangzhou. (pp. 4147).
Li, Y., Dong, M., & Hua, J. (2008). Localized feature selection for clustering. Pattern
Recognition Letters, 1018.
Liao, L., & Lee, J. (2009). A novel method for machine performance degradation
assessment based on xed cycle features test. Journal of Sound and Vibration,
326, 894908.
Liu, X., Ma, L., Zhang, S., & Mathew, J. (2006). Feature group optimisation for
machinery fault diagnosis based on fuzzy measures. Australian Journal of
Mechanical Engineering, 2, 191197.
Liu, H., & Yu, L. (2005). Toward integrating feature selection algorithms for
classication and clustering. IEEE Transactions on Knowledge and Data
Engineering, 4, 491502.
Malhi, A., & Gao, R. X. (2004). PCA-based feature selection scheme for machine
defect classication. IEEE Transactions on Instrumentation and Measurement, 6,
15171525.
Mitra, P., Murthy, C. A., & Pal, S. K. (2002). Unsupervised feature selection using
feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence,
3, 301312.
Oduntan, I. O., Toulouse, M., & Baumgartner, R. (2008). A multilevel tabu search
algorithm for the feature selection problem in biomedical data. Computers &
Mathematics with Applications, 5, 10191033.
Patil, M. S., Mathew, J., & RajendraKumar, P. K. (2008). Bearing signature analysis as
a medium for fault detection: A review. Journal of Tribology, 1.
Peng, Z. K., & Chu, F. L. (2004). Application of the wavelet transform in machine
condition monitoring and fault diagnostics: a review with bibliography.
Mechanical Systems and Signal Processing, 2, 199221.
Samanta, B., Al-Balushi, K. R., & AI-Araimi, S. A. (2003). Articial neural networks
and support vector machines with genetic algorithm for bearing fault detection.
Engineering Applications of Articial Intelligence, 7-8, 657665.
Samanta, B., & Nataraj, C. (2009). Use of particle swarm optimization for machinery
fault detection. Engineering Applications of Articial Intelligence, 2, 308316.
Shao, Y., & Nezu, K. (2000). Prognosis of remaining bearing life using neural
networks. Proceedings of the Institution of Mechanical Engineers. Part I. Journal
of Systems and Control Engineering, 3, 217230.
Sugumaran, V., & Ramachandran, K. I. (2007). Automatic rule learning using
decision tree for fuzzy classier in fault diagnosis of roller bearing. Mechanical
Systems and Signal Processing, 5, 22372247.
Wei, H. L., & Billings, S. A. (2007). Feature subset selection and ranking for data
dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 1, 162166.
Xu, Z., Xuan, J., Shi, T., & Wu, B. (2009). Application of a modied fuzzy ARTMAP
with feature-weight learning for the fault diagnosis of bearing. Expert Systems
with Applications, 6, 99619968.
Yen, G. G. (2000). Wavelet packet feature extraction for vibration monitoring. IEEE
Transactions on Industrial Electronics, 3, 650667.
Yu, Y., Yu, D., & Cheng, J. (2006). A roller bearing fault diagnosis method based on
EMD energy entropy and ANN. Journal of Sound and Vibration, 12, 269277.
Yu, L., & Liu, H. (2004). Efcient feature selection via analysis of relevance and
redundancy. Journal of Machine Learning Research, 12051224.
Zhang, H., & Sun, G. (2002). Feature selection using tabu search method. Pattern
Recognition, 35, 701711.
11320 Y. Yang et al. / Expert Systems with Applications 38 (2011) 1131111320

Вам также может понравиться