E08 - A Survey On Statistical Pattern Feature Extraction

A Survey on Statistical Pattern Feature Extraction
Shifei Ding1,2, Weikuan Jia3, Chunyang Su1, Fengxiang Jin4, and Zhongzhi Shi2
1
School of Computer Science and Technology, China University of Mining and
Technology, Xuzhou 221008
2
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology,
Chinese Academy of Sciences, Beijing 100080
3
College of Plant Protection, Shandong Agricultural University, Taian 271018
4
College of Geoinformation Science and Engineering, Shandong University of Science and
Technology, Qingdao 266510
dingsf@cumt.edu.cn, dingshifei@sina.com
Abstract: The goal of statistical pattern feature extraction (SPFE) is ‘low loss
dimension reduction’. As the key link of pattern recognition, dimension reduc-
tion has become the research hot spot and difficulty in the fields of pattern rec-
ognition, machine learning, data mining and so on. Pattern feature extraction is
one of the most challenging research fields and has attracted the attention from
many scholars. This paper summarily introduces the basic principle of SPFE,
and discusses the latest progress of SPFE from the aspects such as classical sta-
tistical theories and their modifications, kernel-based methods, wavelet analysis
and its modifications, algorithms integration and so on. At last we discuss the
development trend of SPFE.
1 Introduction
With the development of science and technology, the research objects are more and
more complex. The complex systems have the characteristics of high dimension and
salient nonlinearity. Large amounts of data provide utilizable information, but also
make it difficult to use these data effectively. Useful knowledge may be inundated in
a large number of redundant data, this will occupy a lot of storage space and computa-
tion time, make the training process time-consuming, finally affect the precision of
recognition, and cause dimensionality curse. So how to make use of these huge vol-
umes of data, analyze, extract useful information and exclude the influence of related
or repeated factors, are the problems that feature extraction needs to solve, that is to
reduce the feature dimension under the prerequisite of not affecting the problem solv-
ing as much as possible; this provides a good precondition to pattern recognition [1].
Feature extraction is the key link of pattern recognition system; it determines the final
results of recognition system.
How to extract efficient and reasonable reduction data from mass datasets while
keeping the data completely, that is the connotation of pattern feature extraction. The
basic task of feature extraction is to find out a group of the most effective features for
classification, so as to design classifier effectively. However in piratical problems it is
D.-S. Huang et al. (Eds.): ICIC 2008, LNAI 5227, pp. 701–708, 2008.
© Springer-Verlag Berlin Heidelberg 2008
702 S. Ding et al.
often hard to find the most efficient features, this makes feature extraction become
one of the most important, difficult and challenging tasks in the fields of pattern
recognition system, data mining, machine learning and so on. A large number of
domestic and foreign scholars are attracted to this field and they’ve brought some
good results.
Various compression algorithms solve the problems of information feature extrac-
tion to some extent, but they have some disadvantages. Many scholars have proposed
some new ideas which make the research of pattern feature extraction improve
greatly. We’ll discuss the research progress of Statistical Pattern Feature Extraction
(SPFE).
2 Researches on SPFE
SPFE is to use the existing feature parameters to comprise a lower-dimensional fea-
ture space, map useful information contained by original features to a small number
of features, ignoring redundant and irrelevant information [2,3]. This is the process of
pattern feature extraction, which can be summarized as the process of ‘low loss di-
mension reduction’ of original information; mapping is the feature extraction algo-
rithm. Aiming at different problems, we select different feature extraction algorithms
which can be roughly divided into linear feature extraction and nonlinear feature
extraction. Linear combination is easy to compute, and the early high-dimensional
data process methods use linear methods to reduce the dimension. However, in prac-
tice we often meet nonlinear, time-varying systems, so the researches on nonlinear
feature extraction are more.
2.1 The Methods Based on Statistical Analysis and Their Improved Methods
Statistical analysis theory is the frequently-used method of data feature extraction. It

can analyze the statistical laws when several objects and several indices are interre-
lated; it is a comprehensive analysis method. Statistical methods are based on forceful
theory, have lots of algorithms, and can effectively analyze and process the data.
Analyzing the data features or classifying the data subsets should subject to statistics
irrelevant assumption.
Principal Component Analysis (PCA) is a kind of statistical method that turns vari-
ous feature indicators to a small number of indicators that describe the data sets from
the perspective of the effectiveness of the features. PCA should find several compre-
hensive factors to replace the original mass variables, here it is requested that the
principal components should reflect the information of original data as much as pos-
sible, and should be independent from each other, so as to achieve the goal of simpli-
fication. PCA represents the principal components as the linear combinations of the
single variables, putting emphasis on explaining the total variance of the variables,
when the eigenvalue of the given covariance matrix or correlation matrix are only, the
principal components are unique in general; however in PCA, the variance can not
fully reflect the amount of information. Two-dimensional PCA was proposed based
on PCA and was used to extract statistical features of palm prints images [4], it was
proved that the generalization ability was better than traditional PCA, based on this,
A Survey on Statistical Pattern Feature Extraction 703
the paper proposed and defined improved two-dimensional PCA, and proved it could
keep the total divergence of the training sample images, and at the same time more
effectively extract the sample features. It improved the recognition rate and drastically
reduced the feature dimension of original algorithm and the computation complexity,
and made the system more practical.
Line Discriminant Analysis (LDA) is a typical representative of linear feature ex-
traction methods; it is widely applied, but is restricted by scared samples problems. A
feature extraction algorithm which is based on boundary and is applicable to scared
samples problems was proposed [5]. The algorithm used the characteristic that when
the sample size of the high-dimensional data was small, the linear separable probabil-
ity of the data increased and low-dimensional projection of the data tended to be
normal distribution, it defined new classificatory borders, not only considered the
discrete degrees in-class and between-class brought forward by Linear Discriminant
Analysis, and also took the variance difference of each category into account. It ob-
tained optimal projection vector by the maximization of the borders, at the same time
avoided scared samples problems caused by in-class discrete degrees singular matrix.
Partial Least Squares (PLS) method emerged and developed in recent years, it is a
kind of multi-element data processing method which has wide applicability and is
built on the basis of PCA, it is the widening of Ordinary Least Square method. Then
information feature compression algorithm [6] based on PLS was advanced, the algo-
rithm can better resolve the difficult problem that observation sample data are few but
the explanatory variables are more. If the explanation space "direction" is selected
suitably, then the data fitting and forecasting will be robust and reliable. When there
is higher level correlation in explaining sets, PLS can use the system data to analyze
and sift, extract the integrated variables which can explain the forecasting variables
best, and establish appropriate model. Therefore, when the method compresses the
explanatory variable data, it takes the related level of forecasting variable into ac-
count; its compression results will be more significative.
Projection Pursuit (PP) [7] is used to analyze and process high-dimensional ob-
servation data, specially the data from normal population; its basic idea is to project
high-dimensional data to one to three dimensional sub space, search the projection
that can reflect the structure or features of the high-dimensional data, so as to
achieve the goal of studying and analyzing the high-dimensional data. This method
is not restricted by the assumption of normal population, and in practice many data
do not correspond with normal population; people don’t have enough prior knowl-
edge with the data distribution; it overcomes the problems that are brought about by
‘the curse of dimensionality’, and at the same time the data visibility is increased; it
can exclude the interference of that the data structure has no or little relationship
with features. M T Gao used genetic algorithm to search the best projection direc-
tion, used the projection matrix of optimization projection direction to represent the
linear and nonlinear data structure and feature projection of original data [8]. Gao
applied this method in text clustering, compared with K-mean clustering and proved
the algorithm was effective.
Independent Component Analysis (ICA) [9] is a kind of new statistical method de-
velops in recent years; the goal of the method is that the observation data will be proc-
essed with some kinds of linear decomposition, so as to make the data be decomposed
to statistical independent components. The basic of ICA is to use a hidden statistical
704 S. Ding et al.
variable model x = As ; it represents how the observation data are produced by the
mixing of independent components. Independent components are hidden variables,
this means they can’t be observed directly, and the mixing matrix is supposed to be
unknown. What can be observed is only the random vector x, A and s must be esti-
mated, and they must be estimated under the assumed conditions as few as possible.
ICA supposes that components are statistical independent, and the independent
components are must supposed to be not Gauss distribution, unbeknown mixed matrix
are supposed to be square matrix, if the inverse matrix of A can be ciphered out,
to assume as W, then the independent components can be got by s = Wx . It can
be known that there are two uncertainties of ICA model: we’re not sure of the vari-
ance of independent components and we can not ensure the order of the independent
components.
The independent components analysis algorithm improved by basis function was
used in extracting the image features [10]; by analyzing the Laplace priori conditions
of the images, the ICA problems are simplified to solving the least normal number;
the algorithm doesn’t need to optimize the high-order nonlinear compare function and
is sparser and has a faster convergence speed. J Karhunen et al used ICA to extract the
image pattern feature [11], P C Yuen et al used ICA to do face recognition [12], and
these researches show the wide application prospect of ICA. Because ICA appears in
recent years, its theories and algorithms are not very mature, many substances should
be added and perfected. The rising ICA theories and methods will start another up-
surge in the study of pattern feature extraction.
Moreover there’re incremental PCA (IPCA) and incremental discriminant analysis,
a new incremental face feature extraction method- incremental weighing average
samples analysis was used in real time face recognition [13]; the feature extraction
method based on singular value decomposition of matrix improved the disadvantages
of classical mathematical methods [14].
2.2 Methods Base on Kernel Functions
Kernel thinking [15] is to introduce kernel function to other algorithms, transform the
non-linear problem of the original space to a linear problem of feature space, yet the
actual calculation run in the original space. Then the method of using kernel function
develops a new thinking for solving nonlinear problems, it can be applied to many
linear algorithms of data analyzing, especially the algorithms that appear as a form of
inner product. The method is based on selecting a conditional function K ( xi , x j )
that is symmetric, continuous and subject to the Mercer theorem, xi and x j are
the two sample points of the input space, the method is to achieve the
mapping Φ : R d L → H from the input space d L to d H -dimension feature space, and
there is
dH
K ( xi , x j ) = ∑Φ
n =1
n ( x i )Φ n ( x j ) (1)
The aim of achieving mapping is to map the problems difficult to be solved to fea-
ture space to process. At present, the kernel function used more are Linear Polynomial
Function, p Order Polynomial Function, Gaussian Radial Basis Function (RBF) ker-
nel function, Multi-Layer Perception (MLP) kernel function, and so on.
Such as the Kernel-based Principal Component Analysis (Kernel PCA, KPCA)
[16], the main idea is to map from input data x via a nonlinear mapping Φ (x) to
feature space F , and then execute the linear PCA in the feature space F . For the
computation of eigenvalue in the feature space and the vector projection in the feature
space, KPCA doesn’t require the mapping Φ (x) having explicit format, but only
computing the dot product of mapping, actually the dot product can use the kernel
function
K ij = k ( xi , x j ) = (Φ ( xi ) ⋅ Φ( x j )) (2)
to compute. The nonlinear of KPCA is achieved by kernel transformation, transform-

ing input space to Hilbert feature space, so it can be said that the PCA is computed in
the input space, while Kernel PCA in the feature space.
Kernel-Based Fisher Discriminant Analysis[17] (Kernel FDA, KFDA) and Kernel-
Based Canonical Correlation Discriminant Analysis [18] (Kernel CCDA, KCCDA),
and so on are well referenced kernel functions, they overcome the weaknesses of
solving linear problems only, though in form it is a little complicated, but it can turn
nonlinear problems into linear, and is easy to resolve the problems. As a bridge from
linear to nonlinear, kernel function can generalize the methods that only can solve
linear problems to that can solve nonlinear problems.
2.3 Methods Based on the Integration of Several Algorithms
When each method plays its advantages, it also has certain disadvantages, and differ-
ent methods generally have different adapting environments, it is hard to get good
robustness and high precision using only one feature extraction method. Combining
various methods organically, developing one’s advantage and avoiding one’s weak-
ness, feature can be compressed better, better feature information can be provided,
and thereby the accuracy of recognition will be improved. R. W. Swiniarski combined
PCA and rough set [19], based on reducing the dimension by PCA, he used the attrib-
ute reduction algorithms in rough set to compress the dimension further, applied it in
the neural network recognition of face images, and achieved ideal results.
PCA feature compressed algorithm based on information theory [20], according to
the concept of information function in Shannon's information theory, combining the
intrinsic behavior of the eigenvalue, the concept of generalized information function
was advanced. And it was applied in feature compression of PCA, the concept of
information rate and cumulative information rate was advanced, then the PCA feature
compression algorithm that based on information theory was established. The algo-
rithm can describe the level of information compressed better to a large extent; it
includes more information content of original features than principal components
received by PCA. This algorithm is the combination of the advantages of principal
component analysis and information theory.
706 S. Ding et al.
Fei Zuo et al. proposed cascading the three methods CGD, CTF and CFR and using
this method in facial expression recognition [21], the results showed that the perform-
ance of the catenation method was far better than the independent method. The two
way feature compress method based on PCA and immunity clustering effectively
excluded the relativity between various feature parameters [22], the algorithm will
perform more effectively and has wider applicability while adding the step of nor-
malization of antigen data, and the step of directly removing the similar samples.
Combining iris technique and multi-dimension scale analysis to extract the features
will better improve the accuracy of iris recognition [23].
2.4 Other Methods
Pattern feature extraction is always an active study area; many scholars have paid
great efforts and made great contributions. Besides the improved methods based on
the above classical methods, many scholars have proposed new theories and methods,
such as Nonnegative Matrix Factorization (NMF) [24], Locally Linear Embedding
(LLE) [25], Manifold Learning (ML) [26] and so on.
NMF method is the research results of nonnegative matrix that was published in
the famous magazine《 Nature》 in 1999. D D Lee and H S Seung proposed a new
matrix decomposition idea in this paper, which is Non-negative Matrix Factorization
(NMF). NMF is a matrix decomposition method under the constraint conditions that
all the elements in the matrix are nonnegative numbers. LLE is a nonlinear dimen-
sionality reduction method; it constructs the reconstruct relationship between each
warp beam sample point and its neighboring sample points, keeps the reconstruct
relationship unchanging in the process of dimensionality reduction, and retains parts
of important features of the high-dimensional measuring space. ML is a type of unsu-
pervised statistics learning problems. A manifold can be simply thought as a topologi-
cal space, it is locally Euclidean, and its main goal is to find low-dimensional smooth
manifold embedded in a high-dimensional observation data space. The research con-
tent of ML mainly includes the dimensionality reduction of limited data sets that re-
serve or highlight special features of the original data; density estimation problems of
high-dimensional limited sample points that submitted to a distribution; establishing
hidden variable model of high-dimensional observation data that influenced by a
small number of potential factors. ML can be divided into methods based on local and
global, sometimes it can be divided into spectral methods and non-spectral methods.
Although ML is a basic research direction, but for its broad applied prospect, Mani-
fold Learning is increasingly becoming a hot issue in recent years.
3 Prospects
The main significance of feature extraction lies in the “Low Loss Dimensionality
Reduction”, which enable the problems tend to be simplified and can be easily com-
puted, or increase computation speed then the learning and training of the system
becomes easy. Compared to pattern recognition, it is fundamental and antecedent
research. In recent years pattern recognition has been applied to various areas. Pattern
feature extraction is the basis of recognizing, and learning, and it plays a key role in
improving the recognition accuracy. Pattern feature extraction is the important part of
data mining and machine learning. Aiming at different problems, we use reasonable,
reliable, feasible feature extraction methods.
Although the theoretical methods of feature extraction and selection have made a
lot of achievements, but some methods are still in theory, failed to be put into prac-
tice, the purpose of theory studying is to apply it in practice, in future, one of the
research hot spot is how to apply mature theory in practice to deal with practical prob-
lems. From the view of application, generally, the systems we meet in practice are
nonlinear, time-varying systems, so the current hot research focuses on the research
on the theories and algorithms of high-dimensional nonlinear feature extraction.
Acknowledgements. This work is supported by the National Natural Science

Foundation of China under Grant no.40574001, the 863 National High-Tech Program
under Grant no. 2006AA01Z128, and the Opening Foundation of Key Laboratory of
Intelligent Information Processing of Chinese Academy of Sciences under Grant
no.IIP2006-2.
References
1. Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. Wiley, New York
(1973)
2. Jensn, D.D., Cohen, P.R.: Multi Plecocm Partitions in Induction Algorithms. Machine
Learning 3, 309–338 (2000)
3. Foman, G.: An Exnetsive Empirical Study of Feater Selection Metrics for Text Classifica-
tion. Journal of Machine Learning Research 3, 1289–1305 (2003)
4. Li, Q., Qiu, Z.D., Sun, D.M., et al.: Online Palmprint Identification Based on Improved 2D
PCA. Acta Electronica Sinica 10, 1886–1889 (2005)
5. Huang, R., He, M.Y., Yang, S.J.: A Margin Based Feature Extraction Algorithm for the
Small Sample Size Problem. Chinese Journal of Computers 7, 1173–1178 (2007)
6. Ding, S.F., Jin, F.X., Shi, Z.Z.: Information Feature Compression Based on Partial Least
Squares. Journal of Computer Aided Design & Computer Graphics 2, 368–371 (2005)
7. Friedman, J.H., Tukey, J.W.: A Projection Pursuit Algorithm for Exploratory Data Analy-
sis. IEEE Trans. of Computers 9, 881–890 (1974)
8. Gao, M.T., Wang, Z.O.: A New Algorithm for Text Clustering Based on Projection Pur-
suit. In: Proceedings of the Sixth International Conference on Machine Learning and Cy-
bernetics, pp. 3401–3405 (2007)
9. Comon, P.: Independent Component Analysis: A New Concept. Signal Processing 3, 287–
314 (1994)
10. Huang, Q.H., Wang, S., Liu, Z.: Improved Algorithm of Image Feature Extraction Based
on Independent Component Analysis. Opto-Electronic Engineering 1, 121–125 (2007)
11. Karhunen, J., Hyvarinen, A., Vigario, R., et al.: Applications of Neural Blind Separation to
Signal and Image Processing. In: Proceedings of the IEEE 1997 International Conference
on Acoustics, Speech, and Signal Processing, pp. 131–134 (1997)
12. Yuen, P.C., Lai, J.H.: Face Representation Using Independent Component Analysis. Pat-
tern Recognition 3, 545–553 (2001)
13. Song, F.X., Gao, X.M., Liu, S.H.: Dimensionality Reduction in Statistical Pattern Recog-
nition and Low Loss Dimensionality Reduction. Chinese Journal of Computers 11, 1915–
1922 (2005)
708 S. Ding et al.
14. Wiener, E., Pedersen, J.O., Weigend, A.S.: A Neural Network Approach to Topic Spot-
ting. In: Proceedings of the 4th Annual Symposium on Document Analysis and Informa-
tion Retrieval, pp. 317–332 (1995)
15. Aizerman, M., Branverman, E., Rozonoer, L.: Theoretical Foundations of the Potential
Foundation Method in Patten Recognitions. Automation and Remote control 25, 821–837
(1964)
16. Taylor, J.S., Holloway, R., Williams, C.: The Stability of Kernel Principal Components
Analysis and Iits Relation to the Process Eigenspectrum. In: Advances in Neural Informa-
tion Processing Systems, vol. 15, pp. 383–389 (2003)
17. Mika, S., Retsch, G., et al.: Fisher Discriminant Aanalysis with Kernels. Neural Networks
for Signal Processing IX, 41–48 (1999)
18. Sun, P., Xu, Z.B., Shen, J.Z.: Nonlinear Canonical Correlation Analysis for Discrimination
Based on Kernel Methods. Chinese Journal of Computers 6, 789–795 (2004)
19. Swiniarski, R.W., Skowron, A.: Rough Set Methods in Feature Selection and Recognition.
Pattern Recognition Letters 6, 833–849 (2003)
20. Ding, S.F., Jin, F.X., Wang, J., et al.: New PCA Feature Compression Algorithm Based on
Information Theory. Mini-micro Systems 4, 694–697 (2004)
21. Zuo, F., De With, P.H.N.: Facial Feature Extraction Using a Cascade of Model-Based Al-
gorithms. In: Proceedings IEEE Conference on Advanced Video and Signal Based Surveil-
lance, pp. 348–353 (2005)
22. Fan, Y.P., Chen, Y.P., Sun, W.S., et al.: Algorithm for Bi-directional Reduce Feature Data
Based on the Principal Component Analysis and Immune Clustering. Acta Simulata Sys-
tematica Sinica 1, 148–153 (2005)
23. Makram, N., Bouridane, A.: An Effective and Fast Iris Recognition System Based on a
Combined Multiscale Feature Extraction Technique. Pattern Recognition 3, 868–879
(2008)
24. Lee, D.D., Seung, H.S.: Learning the Parts of Objects by Non-negative Matrix Factoriza-
tion. Nature 6755, 788–791 (1999)
25. Roweis, S.T., Saul, L.K.: Nonlinear Dimensionality Reduction by Locally Linear Embed-
ding. Science 5500, 2323–2326 (2000)
26. Seung, H.S., Lee, D.D.: The Manifold Ways of Perception. Science 12, 2268–2269 (2000)

E08 - A Survey On Statistical Pattern Feature Extraction

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

E08 - A Survey On Statistical Pattern Feature Extraction

Загружено:

Авторское право:

Доступные форматы

A Survey on Statistical Pattern Feature Extraction

Statistical analysis theory is the frequently-used method of data feature extraction. It

2.2 Methods Base on Kernel Functions

to compute. The nonlinear of KPCA is achieved by kernel transformation, transform-

2.3 Methods Based on the Integration of Several Algorithms

2.4 Other Methods

Acknowledgements. This work is supported by the National Natural Science

Вам также может понравиться