Академический Документы
Профессиональный Документы
Культура Документы
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
art ic l e i nf o a b s t r a c t
Article history: Recently, some L1-norm-based principal component analysis algorithms with sparsity have been pro-
Received 19 November 2014 posed for robust dimensionality reduction and processing multivariate data. The L1-norm regularization
Received in revised form used in these methods encounters stability problems when there are various correlation structures
7 July 2016
among data. In order to overcome the drawback, in this paper, we propose a novel L1-norm-based
Accepted 7 July 2016
Available online 8 July 2016
principal component analysis with adaptive regularization (PCA-L1/AR) which can consider sparsity and
correlation simultaneously. PCA-L1/AR is adaptive to the correlation structure of the training samples
Keywords: and can benefit both from L2-norm and L1-norm. An iterative procedure for solving PCA-L1/AR is also
Principal component analysis proposed. The experiment results on some data sets demonstrate the effectiveness of the proposed
Dimensionality reduction
method.
L1-norm
& 2016 Elsevier Ltd. All rights reserved.
Trace lasso
L2-norm
http://dx.doi.org/10.1016/j.patcog.2016.07.014
0031-3203/& 2016 Elsevier Ltd. All rights reserved.
902 G.-F. Lu et al. / Pattern Recognition 60 (2016) 901–907
et al. [17], respectively, proposed the non-greedy versions of max ‖wT X‖1 , subject to wT w = 1, ‖w‖1 < k (4)
2DPCA-L1 and TPCA-L1.
In order to improve the interpretation of the basis vectors of where k is a positive integer. An efficient iterative procedure to
PCA-L1, Meng et al. [14] proposed a sparse PCA-L1 method called solve Eq. (4) is also presented in [14].
PCA-L1 with sparsity (PCA-L1S). Not only the objective function of
PCA-L1S is based on L1-norm, but the basis vectors are also pe-
nalized by L1-norm. Similarly, Wang et al. [7] proposed 2DPCA-L1 3. L1-norm-based principal component analysis with adaptive
with sparsity (2DPCA-L1S). regularization (PCA-L1/AR)
The L1-norm regularization can work optimally on high-di-
mensional low-correlation data [19–22]. However, there are var- 3.1. Problem formulation
ious correlation structures among a lot of data. In this situation,
the L1-norm regularization encounters instability problems. Re- In this subsection, we will present our proposed L1-norm-
cently, trace Lasso [19,23–25] has been proposed to remedy this based principal component analysis with adaptive regularization
instability problem. Trace lasso is adaptive and interpolates be- (PCA-L1/AR).
tween L1-norm and L2-norm. The L1-norm regularization will encounter stability problems if
In this paper, we use trace Lasso to regularize the basis vectors the data samples exhibit strong correlations [19]. In this paper, we
of PCA-L1 and propose a novel L1-norm-based principal compo- impose the trace norm onto w inspired by [19]. Specifically, we
nent analysis, called PCA-L1 with adaptive regularization (PCA-L1/ integrate the trace norm into the objective function of PCA-L1 and
AR). PCA-L1/AR, which can consider sparsity and correlation si- the objective function of PCA-L1/AR is formulated as
multaneously, is adaptive to the correlation structure and can
benefit both from L2-norm and L1-norm. We also present an arg max‖X T w‖1 − λ‖X T Diag (w)‖*
w (5)
iterative algorithm for solving PCA-L1/AR. The experiments on
some publicly available data sets confirm the effectiveness of the or
proposed method. arg min λ‖X T Diag (w)‖* − ‖X T w‖1
w (6)
The remainder of the paper is organized as follows. In Section
2, we review briefly the PCA and PCA-L1 techniques. In Section 3, where ‖•‖* denotes the trace norm of a matrix, i.e., the sum of its
we propose the PCA-L1/AR approach, including its objective singular values, Diag (•)denotes to convert a vector into a diagonal
function and algorithmic procedure. The experiment results are matrix. In Section 3.2, we will introduce how to solve the objective
reported in Section 4. Finally, we conclude the paper in Section 5. function of PCA-L1/AR, i.e., Eq. (6). The main difference between
trace norm ‖XT Diag (w)‖*and other norm, e.g. L1-norm and L2-
norm, is that ‖XT Diag (w)‖* contains the data sample matrix X.
2. Outline of PCA, PCA-L1 and PCA-L1S ‖XT Diag (w)‖* is adaptive to the correlation structure and inter-
polates between L1-norm and L2-norm [19]. If XXT = I , i.e., the
Let X = {x1, x2, .. . , xn} ∈ Rd × n be a d-dimensional sample set data are uncorrelated, then we have
with n elements. Without loss of generality, we assume that X has
‖X T Diag (w)‖* = tr ⎡⎣ (X T Diag (w))T (X T Diag (w)) ⎤⎦
0.5
been centered. The classical PCA method (termed as PCA-L2) aims
to maximize the variance of data points in the projected subspace. = tr ⎡⎣ Diag (w) XX T Diag (w) ⎤⎦ = ‖w‖1
0.5
(7)
The optimal projection vector w ∈ Rd can be obtained by solving
the following criterion function: Thus, the trace norm regularization, i.e., ‖XT Diag (w)‖*, is equal
to the L1-norm. If X = 1x1, i.e., the data are highly correlated,
max wT St w
wT w = 1 (1) where x1 denotes the first row of X and 1 ∈ Rd is a column vector
1 taking one at each entry, then we have
where St = n XXT is the covariance matrix. The optimal subspace of
PCA is spanned by the eigenvectors of St corresponding to the ‖X T Diag (w)‖* = ‖(x1)T w‖*
largest m eigenvalues. Eq. (1) can be reformulated as = ‖x1‖‖w‖2 = ‖w‖2 (8)
1 Thus in the case ‖XT Diag (w)‖
max ‖wT X‖22
(2) * is equal to the L2-norm. For other
wT w = 1 n cases, trace Lasso interpolates between the L1-norm and L2-norm
where ‖•‖2 denotes the L2-norm of a vector. depending on correlations [19], i.e.,
Obviously the conventional PCA is based on L2-norm. In [11], ‖w‖2 ≤ ‖X T Diag (w)‖* ≤ ‖w‖1 (9)
Kwak proposed PCA-L1, where L2-norm in PCA-L2 is replaced with
L1-norm. Then the robustness to noise and outliers of PCA-L2 is This means that trace Lasso can benefit both from L2-norm and
improved. PCA-L1 aims to maximize the following objective L1-norm according to the correlations among data.
function
where ‖•‖1 denotes the L1-norm of a vector. Kwak proposed a Motivated by the optimization method used in [26], we use the
greedy iterative procedure to compute w since it is difficult to augmented Lagrange multiplies (ALM) method [27] to solve Eq.
solve Eq. (3) directly. (6). In [27], the ALM method is introduced for solving the fol-
In order to improve the interpretation of the basis vectors of lowing constrained optimization problem:
PCA-L1, Meng et al. [14] proposed PCA-L1S, where L1-norm is not min f (X ) s . t . h (X ) = 0 (10)
only used in the objective function, but also used to regularize the
basis vector of PCA-L1. PCA-L1S aims to solve the following opti- where f : Rn
→ R and h: → Rn Rm .
We can define the augmented
mization problem Lagrangian function to solve Eq. (10):
G.-F. Lu et al. / Pattern Recognition 60 (2016) 901–907 903
s . t . e = X T w, J = X T Diag (w) (12) Similarly, if y is smaller than 0, then the value of (x − can be y )2
minimized only when x is smaller than 0. That is, the optimal x
Then, motivated the ALM method, we can use the following should be x < 0 when y < 0. Taking the partial derivative of Eq.
augmented Lagrangian function to solve Eq. (12): (17) with respect to x and setting it to zero, we have
μ1 2
L (w, e, J ) = λ‖J‖* − ‖e‖1 + y1T (e − X T w) + e − XTw 2 −λ − x + y = 0 (21)
2
μ
+ trace (Y2T (J − X T Diag (w)) + 2 ‖J − X T Diag (w)‖2F Then we have
2 (13)
x = y − λ = sgn (y)( y + λ ) (22)
where y1 and Y2 are Lagrange multipliers, respectively, and μ1 > 0
and μ2 > 0 are the penalty parameters, respectively. Now Eq. (13) Suppose y = 0. Differentiating Eq. (17) with x and setting it to
is unconstrained and can, respectively, be optimized with respect zero, we have
to w, e and J by considering the other variables as constants.
x= ±λ (23)
First, we compute J when w and e are fixed. Then Eq. (13) is
equivalent to solve the following optimization problem: ☐
Then Eq. (16), which is separable at element level, can be solved
J * = arg min L (J , w, e)
J by using Theorem 1. Let z = XT w − μ1 y1 and the optimal e* is
1
μ2
= arg min λ‖J‖* + trace (Y2T J ) + ‖J − X T Diag (w)‖2F ⎧ ⎛ ⎞
J 2 ⎪ sgn (zi ) ⎜ zi + 1 ⎟, y ≠ 0
⎪ ⎝ μ1⎠
λ 1 ⎛ 1 ⎞
2
ei * = ⎨
= arg min ‖J‖ + J − ⎜ X T Diag (w) − Y2 ⎟ ⎪ 1
J μ2 * 2 ⎝ μ 2 ⎠ (14) ⎪ ± , y=0
F ⎩ μ1 (24)
From [28], we can know that Eq. (14) has a closed form solution where ei , i = 1, .. . , n is the i-th entry of e and zi , i = 1, .. . , n is the
via the singular value thresholding (SVT) operator. The optimal J* i-th entry of z .
is computed as Third, we compute w when e and J are fixed. Then Eq. (13) is
⎛ 1 ⎞ equivalent to solving the following optimization problem:
J* = SVTλ / μ2 ⎜ X T Diag (w) − Y2 ⎟ μ1 T
⎝ μ2 ⎠ (15) w* = arg min − y1T X T w + ‖X w − e‖22
w 2
where SVTδ (A) = UDiag ((σ − δ )+) VT , the singular value decom- μ
− trace (Y2T X T Diag (w)) + 2 ‖J − X T Diag (w)‖2F
position (SVD) of A is given by A = UDiag ({σi }1 ≤ i ≤ r ) VT and 2
t+ = max (0, t ). T T μ1 T T
= arg min − y1 X w + (w XX w − 2eT X T w)
Second, we compute e when w and J are fixed. Then Eq. (13) is w 2
equivalent to solve the following optimization problem: μ
− diag (Y2T X T ) w + 2 wT Diag (diag (XX T )) w − μ2 diag (J T X T ) w
2
μ1
e* = arg max ‖e‖1 − y1T e − ‖e − X T w‖22 T
e 2 w
( )
= arg min −X y1 − μ1X e − diag (Y2T X T ) − μ2 diag (J T X T ) w
2
1 1 ⎛ 1 ⎞ ⎛μ μ ⎞
= arg max ‖e‖1 − e − ⎜XTw − y⎟ + wT ⎜ 1 XX T + 2 Diag (diag (XX T )) ⎟ w
⎝ 2 2 ⎠ (25)
e μ1 2 ⎝ μ1 1⎠ (16)
2
where diag (•) denotes the diagonal of a matrix.
To solve the maximization problem of Eq. (16), we will prove Eq. (25) can be easily solved by
the following Theorem.
Proof. To maximize Eq. (17), the value of ‖x‖1 should be as great as ( μ1XX T + μ2 Diag (diag (XX T )) )−1
possible and the value of (x − y )2 should be as small as possible. If −1
= A−1 − A−1μ1X ( I + μ2 X T A−1X ) X T A−1 (28)
y is greater than 0, then the value of (x − y )2 can be minimized
904 G.-F. Lu et al. / Pattern Recognition 60 (2016) 901–907
μ1 = ρμ1 (31)
and
μ2 = ρμ2 (32)
2
4.1. Experiments on artificial datasets
e − (X w − y ) ;
1 1 T 1
2) update e by e* = arg max μ ‖e‖1 − 2 μ1 1
e 1
2
3) update w by In this experiment, we first generate a data set
X = {x1, x2, .. . , xn} ∈ Rd × n of n ¼100 two-dimensional observation
⎛ ⎡ 10 8 ⎤ ⎞
points drawn form the Gaussian distribution N ⎜ 0, ⎢ ⎟. In
⎣ 8 10 ⎥⎦ ⎠
T
(
w* = arg min −X y1 − μ1X e − diag (Y2T X T ) − μ2 diag (J T X T ) w
w
) ⎝
Fig. 1, we plot the nominal data with black “*”. For reference pur-
(
+ wT μ1XX T + μ2 Diag (diag (XX T )) w ) pose, we also plot the first principal component learned by PCA-L2
on the artificial dataset without any outlier.
Then, we corrupt the nominal data with four outlier measure-
4) update the Lagrange multipliers and penalty parameters by
ments, i.e., [ 33,4], [ 28,7], [ 30,12] and [ 25,16], depicted by
y1 = y1 + μ1 (e − XT w) red “o” in Fig. 1. Finally, we use the PCA-L2, PCA-L1, PCA-L1S and
Y2 = Y2 + μ2 (J − XT Diag (w)) PCA-L1/AR methods to extract the principal component on the
μ1 = ρμ1 above nominal data and four outlier measurements. In Fig. 1, we
μ2 = ρμ2 also depict the first extracted principal component by PCA-L2,
end while PCA-L1, PCA-L1S and PCA-L1/AR, respectively.
From Fig. 1, we can know that the principal component learned
by PCA-L2 on artificial dataset with outlier is severely deviated
from the one learned by PCA-L2 on artificial dataset without
3.3. Extension to multiple basis vectors outlier. However, the principal component learned by our pro-
posed PCA-L1/AR is just slightly deviated from the one learned by
By using Algorithm 1, one optimal projection vector can be PCA-L2 on artificial dataset without outlier and much closer to the
obtained. One projection vector, however, is generally not enough principal component learned by PCA-L2 on artificial dataset
for feature extraction. In this section, we will introduce how to without outlier than PCA-L1 and PCA-L1S, which indicates that
obtain multiple projection vectors. If we have learned the first r 1 PCA-L1/AR is more robust to outlier than the other methods.
(r 1) projection vectors w1, w2, ... , wr − 1, then we can use the
deflated data to compute the rth projection vectors wr : 4.2. Experiments on yale face database
r−1
(x i )deflated = x i − ∑ wl (wlT x i) The Yale face database contains 165 Gy scale images of 15 in-
l= 1 (33) dividuals, each individual has 11 images. The images demonstrate
After x i is obtained, we will use x i to form the data matrix X. variations in lighting condition, facial expression (normal, happy,
Then we can use Algorithm 1 to compute wr . sad, sleepy, surprised, and wink). In our experiments, each image
in Yale database was manually cropped and resized to 64 64.
4. Experiments and results Some images of one person in Yale database are shown in Fig. 2.
In the experiments, we randomly choose i (i¼4 and 5) images
In this section, we will compare our proposed PCA-L1/AR with of each person for training, and the remaining ones are used for
G.-F. Lu et al. / Pattern Recognition 60 (2016) 901–907 905
Table 1
Comparison of recognition rates for the different methods on Yale database.
testing. We repeat the procedure 10 times and report the average Table 2
recognition and the standard deviation in Table 1. The plots of Comparison of recognition rates for the different methods on Yale database with
contaminated images.
recognition rate vs. the dimension of reduced space are shown in
Fig. 3. Sample size PCA-L2 PCA-L1 PCA-L1S PCA-L1/AR
In order to further investigate the robustness of PCA-L1/AR, we
conduct the experiments on polluted images. We first in- 4 58.4 7 3.6 61.2 7 3.7 59.9 73.1 65.5 7 3.8
5 65.0 7 4.3 67.0 7 4.0 66.27 3.5 71.9 74.2
tentionally contaminated 20% of the training samples by rectangle
noise. The rectangle noise takes white or black dots, its location in
face image is random and its size is 20 20. Some face images
with rectangle noise are shown in Fig. 4. Then, we randomly In the experiments, we randomly choose four images of each
choose i (i¼ 4 and 5) images of each person for training, and the person for training, and the remaining ones are used for testing.
remaining ones are used for testing. Third, we repeat the proce- We repeat the procedure 10 times and report the average re-
dure 10 times and report the average recognition and the standard cognition and the standard deviation in Table 3. The plots of re-
deviation in Table 2. cognition rate vs. the dimension of reduced space are shown in
For visual perception, we illustrate the first ten projection Fig. 7.
vectors of PCA-L2, PCA-L1, PCA-L1S, and PCA-L1/AR in Fig. 5. We
can find that the basis vectors learned by the PCA-L1/AR method is 4.4. Experiments on COIL-20 image database
more robust to noise than the other three methods.
The COIL-20 data set contains 1440 images of 20 objects. For
4.3. Experiments on AR face database each object, 72 images were captured with a black background
from varying angles. The moving interval of the camera is five
The AR face database contains over 4000 Gy face images of 126 degrees. Each image is resized to 32 32 in our experiment.
people, including frontal views of with different facial expressions, In the experiments, we randomly choose ten images of each
lighting conditions and occlusions. The face images of 120 in- object for training, and the remaining ones are used for testing. To
dividuals (26 images per person) were taken in two sessions. The test the robustness of the proposed PCA-L1/AR against outlier, we
images of these 120 persons (3120 images in total) are used in our randomly choose 50% of the training samples to be contaminated
experiments. All images were manually cropped and resized to by rectangle noise. The rectangle noise takes white or black dots,
50 40. Some example images of one person are shown in Fig. 6. its location in face image is random and its size is 16 16. Some
Fig. 3. Recognition rate vs. dimension of reduced space on the Yale database. (a) 4 Train and (b) 5 Train.
906 G.-F. Lu et al. / Pattern Recognition 60 (2016) 901–907
Fig. 5. The first ten basis vectors calculated by (a) PCA-L2, (b) PCA-L1 (c) PCA-L1S, (d) PCA-L1/AR using the polluted Yale database.
Table 3
Comparison of recognition rates for the different methods on AR database.
Gui-Fu Lu received the B.S degree in 1997 from Hefei University of Technology, PR China, the M.S. degree in 2004 from Hangzhou Institute of Electronics Engineering, and the
Ph.D degree in 2012 from Nanjing University of Science and Technology, P.R. China. Since 2004, he has been teaching in the School of Computer Science and Information, Anhui
Polytechnic University, Wuhu, Anhui, China. His research interests include computer vision, digital image processing and pattern recognition. E-mail: luguifu_jsj@163.com.
Jian Zou received the M.S. degree in applied mathematics from the Department of Mathematics of Nanjing University of Information Science & Technology, Nanjing, China, in
2006. He received the Ph.D degree in 2013 from Nanjing University of Science and Technology, PR China. His scientific interests are in the fields of pattern recognition,
manifold learning and information statistics.
Yong Wang received the B.S. and M.S. degrees in computer science from Anhui university technology and science, Wuhu, Anhui, China, in 2001, and 2007, respectively.
Currently, he is with the School of Computer Science and Information, Anhui Polytechnic University, Wuhu, Anhui, China. His research interests include software engineering
and machine learning.
Zhongqun Wang is a professor in the School of Management Engineering, Anhui Polytechnic University, Wuhu, Anhui, China. His research interests include software
engineering and machine learning.