Вы находитесь на странице: 1из 7

Pattern Recognition 60 (2016) 901–907

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/pr

L1-norm-based principal component analysis with adaptive


regularization
Gui-Fu Lu n, Jian Zou, Yong Wang, Zhongqun Wang
School of Computer and Information, Anhui Polytechnic University, Wuhu, Anhui 241000, China

art ic l e i nf o a b s t r a c t

Article history: Recently, some L1-norm-based principal component analysis algorithms with sparsity have been pro-
Received 19 November 2014 posed for robust dimensionality reduction and processing multivariate data. The L1-norm regularization
Received in revised form used in these methods encounters stability problems when there are various correlation structures
7 July 2016
among data. In order to overcome the drawback, in this paper, we propose a novel L1-norm-based
Accepted 7 July 2016
Available online 8 July 2016
principal component analysis with adaptive regularization (PCA-L1/AR) which can consider sparsity and
correlation simultaneously. PCA-L1/AR is adaptive to the correlation structure of the training samples
Keywords: and can benefit both from L2-norm and L1-norm. An iterative procedure for solving PCA-L1/AR is also
Principal component analysis proposed. The experiment results on some data sets demonstrate the effectiveness of the proposed
Dimensionality reduction
method.
L1-norm
& 2016 Elsevier Ltd. All rights reserved.
Trace lasso
L2-norm

1. Introduction good experiment results. Some different implementations of SPCA


have also proposed [4,5]. SPCA has gained success in many ap-
Dimensionality reduction [1] is of great importance in many plications for extracting interpretable principal components. By
applications, e.g., pattern recognition, text categorization and using structured regularization, Jenatton et al. [6] generalized
computer vision where the dimensionality of data is often very SPCA to structured sparse PCA.
high. It can reduce the computational complexity and discover the However, the objective functions of the above mentioned PCA
intrinsic manifold structure of high-dimensional data. Principal and SPCA are both based on L2-norm, which makes these methods
component analysis (PCA) [1,2] is perhaps the most famous di- to be sensitive to noise and outliers since the square operation in
mensionality reduction technique due to its simplicity and effec- L2-norm will exaggerate the effect of noise and outliers. It is
tiveness. Generally, PCA finds a set of projection vectors such that generally believed that L1-norm is more robust to noise and out-
the variance of the projected data points is maximized. By pro- liers than L2-norm. Then, in recent years, some L1-norm-based
jecting the data onto the set of projection vectors, the data principal component analysis methods have been developed in the
structure in the original input space can be discovered. literature [7–17]. Due to the use of the absolute value operator in
Every projection vector obtained by PCA is a nonzero linear L1-norm, however, it is much more difficult to obtain the optimal
combination of all the data, and then each variable in data point is projection vectors of L1-norm-based PCA than those of L2-norm-
regarded as equally important in dimensionality reduction. Hence, based PCA.
the extracted features obtained by PCA are difficult to interpret. Recently, Kwak [11] proposed the PCA-L1 method, which is also
The original variables in the high-dimensional data, however, have based on L1-norm and rotational invariant. A greedy iterative al-
meaningful physical interpretation in many applications. In this gorithm for solving PCA-L1 is also presented in [11]. Experiment
case, the interpretation of the obtained projection vectors can be results on data sets shows the effectiveness of PCA-L1. Nie et al.
enhanced if the obtained projection vectors involve more zero [13] proposed a non-greedy procedure to calculate the projection
entries. vectors of PCA-L1. Nie's method can obtain all the projection
As a consequence sparse PCA (SPCA) [3], which reformulates vectors of PCA-L1 simultaneously while the original PCA-L1
the conventional PCA as a regression-type optimization problem method obtains the projection vectors one by one. Li et al. [8]
with the elastic net regularization, has been proposed and can gain proposed the L1-norm-based 2DPCA algorithm (2DPCA-L1), which
is a robust version of the 2DPCA method [18]. Further, Pang et al.
n
Corresponding author. [12] proposed the L1-norm-based tensor PCA method (TPCA-L1).
E-mail address: luguifu_tougao@163.com (G.-F. Lu). Motivated by Nie's non-greedy PCA-L1, Wang et al. [16] and Cao

http://dx.doi.org/10.1016/j.patcog.2016.07.014
0031-3203/& 2016 Elsevier Ltd. All rights reserved.
902 G.-F. Lu et al. / Pattern Recognition 60 (2016) 901–907

et al. [17], respectively, proposed the non-greedy versions of max ‖wT X‖1 , subject to wT w = 1, ‖w‖1 < k (4)
2DPCA-L1 and TPCA-L1.
In order to improve the interpretation of the basis vectors of where k is a positive integer. An efficient iterative procedure to
PCA-L1, Meng et al. [14] proposed a sparse PCA-L1 method called solve Eq. (4) is also presented in [14].
PCA-L1 with sparsity (PCA-L1S). Not only the objective function of
PCA-L1S is based on L1-norm, but the basis vectors are also pe-
nalized by L1-norm. Similarly, Wang et al. [7] proposed 2DPCA-L1 3. L1-norm-based principal component analysis with adaptive
with sparsity (2DPCA-L1S). regularization (PCA-L1/AR)
The L1-norm regularization can work optimally on high-di-
mensional low-correlation data [19–22]. However, there are var- 3.1. Problem formulation
ious correlation structures among a lot of data. In this situation,
the L1-norm regularization encounters instability problems. Re- In this subsection, we will present our proposed L1-norm-
cently, trace Lasso [19,23–25] has been proposed to remedy this based principal component analysis with adaptive regularization
instability problem. Trace lasso is adaptive and interpolates be- (PCA-L1/AR).
tween L1-norm and L2-norm. The L1-norm regularization will encounter stability problems if
In this paper, we use trace Lasso to regularize the basis vectors the data samples exhibit strong correlations [19]. In this paper, we
of PCA-L1 and propose a novel L1-norm-based principal compo- impose the trace norm onto w inspired by [19]. Specifically, we
nent analysis, called PCA-L1 with adaptive regularization (PCA-L1/ integrate the trace norm into the objective function of PCA-L1 and
AR). PCA-L1/AR, which can consider sparsity and correlation si- the objective function of PCA-L1/AR is formulated as
multaneously, is adaptive to the correlation structure and can
benefit both from L2-norm and L1-norm. We also present an arg max‖X T w‖1 − λ‖X T Diag (w)‖*
w (5)
iterative algorithm for solving PCA-L1/AR. The experiments on
some publicly available data sets confirm the effectiveness of the or
proposed method. arg min λ‖X T Diag (w)‖* − ‖X T w‖1
w (6)
The remainder of the paper is organized as follows. In Section
2, we review briefly the PCA and PCA-L1 techniques. In Section 3, where ‖•‖* denotes the trace norm of a matrix, i.e., the sum of its
we propose the PCA-L1/AR approach, including its objective singular values, Diag (•)denotes to convert a vector into a diagonal
function and algorithmic procedure. The experiment results are matrix. In Section 3.2, we will introduce how to solve the objective
reported in Section 4. Finally, we conclude the paper in Section 5. function of PCA-L1/AR, i.e., Eq. (6). The main difference between
trace norm ‖XT Diag (w)‖*and other norm, e.g. L1-norm and L2-
norm, is that ‖XT Diag (w)‖* contains the data sample matrix X.
2. Outline of PCA, PCA-L1 and PCA-L1S ‖XT Diag (w)‖* is adaptive to the correlation structure and inter-
polates between L1-norm and L2-norm [19]. If XXT = I , i.e., the
Let X = {x1, x2, .. . , xn} ∈ Rd × n be a d-dimensional sample set data are uncorrelated, then we have
with n elements. Without loss of generality, we assume that X has
‖X T Diag (w)‖* = tr ⎡⎣ (X T Diag (w))T (X T Diag (w)) ⎤⎦
0.5
been centered. The classical PCA method (termed as PCA-L2) aims
to maximize the variance of data points in the projected subspace. = tr ⎡⎣ Diag (w) XX T Diag (w) ⎤⎦ = ‖w‖1
0.5
(7)
The optimal projection vector w ∈ Rd can be obtained by solving
the following criterion function: Thus, the trace norm regularization, i.e., ‖XT Diag (w)‖*, is equal
to the L1-norm. If X = 1x1, i.e., the data are highly correlated,
max wT St w
wT w = 1 (1) where x1 denotes the first row of X and 1 ∈ Rd is a column vector
1 taking one at each entry, then we have
where St = n XXT is the covariance matrix. The optimal subspace of
PCA is spanned by the eigenvectors of St corresponding to the ‖X T Diag (w)‖* = ‖(x1)T w‖*
largest m eigenvalues. Eq. (1) can be reformulated as = ‖x1‖‖w‖2 = ‖w‖2 (8)
1 Thus in the case ‖XT Diag (w)‖
max ‖wT X‖22
(2) * is equal to the L2-norm. For other
wT w = 1 n cases, trace Lasso interpolates between the L1-norm and L2-norm
where ‖•‖2 denotes the L2-norm of a vector. depending on correlations [19], i.e.,
Obviously the conventional PCA is based on L2-norm. In [11], ‖w‖2 ≤ ‖X T Diag (w)‖* ≤ ‖w‖1 (9)
Kwak proposed PCA-L1, where L2-norm in PCA-L2 is replaced with
L1-norm. Then the robustness to noise and outliers of PCA-L2 is This means that trace Lasso can benefit both from L2-norm and
improved. PCA-L1 aims to maximize the following objective L1-norm according to the correlations among data.
function

max ‖wT X‖1 3.2. Optimization procedure for PCA-L1/AR


wT w = 1 (3)

where ‖•‖1 denotes the L1-norm of a vector. Kwak proposed a Motivated by the optimization method used in [26], we use the
greedy iterative procedure to compute w since it is difficult to augmented Lagrange multiplies (ALM) method [27] to solve Eq.
solve Eq. (3) directly. (6). In [27], the ALM method is introduced for solving the fol-
In order to improve the interpretation of the basis vectors of lowing constrained optimization problem:
PCA-L1, Meng et al. [14] proposed PCA-L1S, where L1-norm is not min f (X ) s . t . h (X ) = 0 (10)
only used in the objective function, but also used to regularize the
basis vector of PCA-L1. PCA-L1S aims to solve the following opti- where f : Rn
→ R and h: → Rn Rm .
We can define the augmented
mization problem Lagrangian function to solve Eq. (10):
G.-F. Lu et al. / Pattern Recognition 60 (2016) 901–907 903

μ only when x is greater than 0. That is, the optimal x should be


(
L (X , Y , μ) = f (X ) + trace Y T h (X ) + ) 2
‖h (X )‖2F
(11) x > 0 when y > 0.
where trace (•) denotes the trace operator of a given matrix, ‖•‖F Differentiating Eq. (17) with x and setting it to zero, we have
denotes the Frobenius norm of a matrix, Y is Lagrange multiplier
and μ > 0 is the penalty parameters. λ−x+y=0 (19)
We first convert Eq. (6) into the following equivalent problem:
Then we have
arg min λ‖J‖* − ‖e‖1 x = λ + y = sgn (y)( y + λ ) (20)
w, J , e

s . t . e = X T w, J = X T Diag (w) (12) Similarly, if y is smaller than 0, then the value of (x − can be y )2
minimized only when x is smaller than 0. That is, the optimal x
Then, motivated the ALM method, we can use the following should be x < 0 when y < 0. Taking the partial derivative of Eq.
augmented Lagrangian function to solve Eq. (12): (17) with respect to x and setting it to zero, we have
μ1 2
L (w, e, J ) = λ‖J‖* − ‖e‖1 + y1T (e − X T w) + e − XTw 2 −λ − x + y = 0 (21)
2
μ
+ trace (Y2T (J − X T Diag (w)) + 2 ‖J − X T Diag (w)‖2F Then we have
2 (13)
x = y − λ = sgn (y)( y + λ ) (22)
where y1 and Y2 are Lagrange multipliers, respectively, and μ1 > 0
and μ2 > 0 are the penalty parameters, respectively. Now Eq. (13) Suppose y = 0. Differentiating Eq. (17) with x and setting it to
is unconstrained and can, respectively, be optimized with respect zero, we have
to w, e and J by considering the other variables as constants.
x= ±λ (23)
First, we compute J when w and e are fixed. Then Eq. (13) is
equivalent to solve the following optimization problem: ☐
Then Eq. (16), which is separable at element level, can be solved
J * = arg min L (J , w, e)
J by using Theorem 1. Let z = XT w − μ1 y1 and the optimal e* is
1
μ2
= arg min λ‖J‖* + trace (Y2T J ) + ‖J − X T Diag (w)‖2F ⎧ ⎛ ⎞
J 2 ⎪ sgn (zi ) ⎜ zi + 1 ⎟, y ≠ 0
⎪ ⎝ μ1⎠
λ 1 ⎛ 1 ⎞
2
ei * = ⎨
= arg min ‖J‖ + J − ⎜ X T Diag (w) − Y2 ⎟ ⎪ 1
J μ2 * 2 ⎝ μ 2 ⎠ (14) ⎪ ± , y=0
F ⎩ μ1 (24)
From [28], we can know that Eq. (14) has a closed form solution where ei , i = 1, .. . , n is the i-th entry of e and zi , i = 1, .. . , n is the
via the singular value thresholding (SVT) operator. The optimal J* i-th entry of z .
is computed as Third, we compute w when e and J are fixed. Then Eq. (13) is
⎛ 1 ⎞ equivalent to solving the following optimization problem:
J* = SVTλ / μ2 ⎜ X T Diag (w) − Y2 ⎟ μ1 T
⎝ μ2 ⎠ (15) w* = arg min − y1T X T w + ‖X w − e‖22
w 2
where SVTδ (A) = UDiag ((σ − δ )+) VT , the singular value decom- μ
− trace (Y2T X T Diag (w)) + 2 ‖J − X T Diag (w)‖2F
position (SVD) of A is given by A = UDiag ({σi }1 ≤ i ≤ r ) VT and 2
t+ = max (0, t ). T T μ1 T T
= arg min − y1 X w + (w XX w − 2eT X T w)
Second, we compute e when w and J are fixed. Then Eq. (13) is w 2
equivalent to solve the following optimization problem: μ
− diag (Y2T X T ) w + 2 wT Diag (diag (XX T )) w − μ2 diag (J T X T ) w
2
μ1
e* = arg max ‖e‖1 − y1T e − ‖e − X T w‖22 T
e 2 w
( )
= arg min −X y1 − μ1X e − diag (Y2T X T ) − μ2 diag (J T X T ) w
2
1 1 ⎛ 1 ⎞ ⎛μ μ ⎞
= arg max ‖e‖1 − e − ⎜XTw − y⎟ + wT ⎜ 1 XX T + 2 Diag (diag (XX T )) ⎟ w
⎝ 2 2 ⎠ (25)
e μ1 2 ⎝ μ1 1⎠ (16)
2
where diag (•) denotes the diagonal of a matrix.
To solve the maximization problem of Eq. (16), we will prove Eq. (25) can be easily solved by
the following Theorem.

Theorem 1. The solution to the following maximization problem.


(
w = μ1XX T + μ2 Diag (diag (XX T )) )−1
(Xy 1 + μ1X e + diag (Y2T X T ) + μ2 diag (J T X T ) ) (26)

1 It is very time-consuming to directly compute the inverse of


arg max λ‖x‖1 − (x − y)2 matrix μ1XXT + μ2 Diag (diag (XXT )) since it is a d  d matrix and d is
x 2 (17)
often very large for the high-dimensional data.
is By using the Sherman–Morrison–Woodbury formula [29] on
⎧ sgn (y)( y + λ ), y ≠ 0 matrix manipulations, i.e.,
x=⎨
⎩ ± λ, y=0 (18) (A + UV )−1 = A−1 − A−1U (I + VA−1U )−1VA−1 (27)

where λ > 0 and sgn (•) is the sign function. we have

Proof. To maximize Eq. (17), the value of ‖x‖1 should be as great as ( μ1XX T + μ2 Diag (diag (XX T )) )−1
possible and the value of (x − y )2 should be as small as possible. If −1
= A−1 − A−1μ1X ( I + μ2 X T A−1X ) X T A−1 (28)
y is greater than 0, then the value of (x − y )2 can be minimized
904 G.-F. Lu et al. / Pattern Recognition 60 (2016) 901–907

where A = μ2 Diag (diag (XXT )). Note that I + μ2 XT A−1X is an n  n


matrix. Then the matrix I + μ2 XT A−1X is much smaller than the
matrix μ1XXT + μ2 Diag (diag (XXT )) when we deal with the high-
dimensional data where d is much larger than n. Therefore the
computational cost is dramatically reduced.
Fourth, we update the Lagrange multipliers and penalty para-
meters using the following equations:
y1 = y1 + μ1 (e − X T w) (29)

Y2 = Y2 + μ2 (J − X T Diag (w)) (30)

μ1 = ρμ1 (31)

and
μ2 = ρμ2 (32)

where ρ > 1 is a constant.


Fig. 1. The first principal component learned by PCA-L2, PCA-L1, PCA-L1S, and PCA-
Finally, the whole algorithm for solving Eq. (13) is summarized L1/AR on artificial dataset. (For interpretation of the references to color in this
in Algorithm 1. figure, the reader is referred to the web version of this article.)

Algorithm 1. Solving Eq. (13) via ALM.


Input: data matrix X, parameter λ . PCA-L2, PCA-L1 [11] and PCA-L1S [14] on Yale, AR and COIL-20
Initialize: J, w, e, y1, Y2, μ1, μ2 and ρ image databases. The source code of PCA-L1S is downloaded from
Output: the optimal projection vector w. the author's homepage. In the experiments we use the nearest
while not converged do neighbor classifier (1-NN) for classification, which assigns a test
2 sample to the class of its closest neighbor in the training samples.
1) update J by J * = arg min μ ‖J‖* +
2 J
λ 1
2
J− ( X Diag (w) − Y )
T 1
μ2 2
F
; The programming environment is MATLAB 2008.

2
4.1. Experiments on artificial datasets
e − (X w − y ) ;
1 1 T 1
2) update e by e* = arg max μ ‖e‖1 − 2 μ1 1
e 1
2
3) update w by In this experiment, we first generate a data set
X = {x1, x2, .. . , xn} ∈ Rd × n of n ¼100 two-dimensional observation
⎛ ⎡ 10 8 ⎤ ⎞
points drawn form the Gaussian distribution N ⎜ 0, ⎢ ⎟. In
⎣ 8 10 ⎥⎦ ⎠
T
(
w* = arg min −X y1 − μ1X e − diag (Y2T X T ) − μ2 diag (J T X T ) w
w
) ⎝
Fig. 1, we plot the nominal data with black “*”. For reference pur-
(
+ wT μ1XX T + μ2 Diag (diag (XX T )) w ) pose, we also plot the first principal component learned by PCA-L2
on the artificial dataset without any outlier.
Then, we corrupt the nominal data with four outlier measure-
4) update the Lagrange multipliers and penalty parameters by
ments, i.e., [ 33,4], [  28,7], [ 30,12] and [  25,16], depicted by
y1 = y1 + μ1 (e − XT w) red “o” in Fig. 1. Finally, we use the PCA-L2, PCA-L1, PCA-L1S and
Y2 = Y2 + μ2 (J − XT Diag (w)) PCA-L1/AR methods to extract the principal component on the
μ1 = ρμ1 above nominal data and four outlier measurements. In Fig. 1, we
μ2 = ρμ2 also depict the first extracted principal component by PCA-L2,
end while PCA-L1, PCA-L1S and PCA-L1/AR, respectively.
From Fig. 1, we can know that the principal component learned
by PCA-L2 on artificial dataset with outlier is severely deviated
from the one learned by PCA-L2 on artificial dataset without
3.3. Extension to multiple basis vectors outlier. However, the principal component learned by our pro-
posed PCA-L1/AR is just slightly deviated from the one learned by
By using Algorithm 1, one optimal projection vector can be PCA-L2 on artificial dataset without outlier and much closer to the
obtained. One projection vector, however, is generally not enough principal component learned by PCA-L2 on artificial dataset
for feature extraction. In this section, we will introduce how to without outlier than PCA-L1 and PCA-L1S, which indicates that
obtain multiple projection vectors. If we have learned the first r  1 PCA-L1/AR is more robust to outlier than the other methods.
(r  1) projection vectors w1, w2, ... , wr − 1, then we can use the
deflated data to compute the rth projection vectors wr : 4.2. Experiments on yale face database
r−1
(x i )deflated = x i − ∑ wl (wlT x i) The Yale face database contains 165 Gy scale images of 15 in-
l= 1 (33) dividuals, each individual has 11 images. The images demonstrate
After x i is obtained, we will use x i to form the data matrix X. variations in lighting condition, facial expression (normal, happy,
Then we can use Algorithm 1 to compute wr . sad, sleepy, surprised, and wink). In our experiments, each image
in Yale database was manually cropped and resized to 64  64.
4. Experiments and results Some images of one person in Yale database are shown in Fig. 2.
In the experiments, we randomly choose i (i¼4 and 5) images
In this section, we will compare our proposed PCA-L1/AR with of each person for training, and the remaining ones are used for
G.-F. Lu et al. / Pattern Recognition 60 (2016) 901–907 905

Fig. 2. Some face images in Yale database

Table 1
Comparison of recognition rates for the different methods on Yale database.

Sample size PCA-L2 PCA-L1 PCA-L1S PCA-L1/AR

4 62.8 73.5 66.3 73.8 63.1 73.0 69.3 7 3.8


5 65.8 74.8 69.17 3.8 67.7 7 4.7 74.8 7 4.2
Fig. 4. Some face images with occlusion in Yale database

testing. We repeat the procedure 10 times and report the average Table 2
recognition and the standard deviation in Table 1. The plots of Comparison of recognition rates for the different methods on Yale database with
contaminated images.
recognition rate vs. the dimension of reduced space are shown in
Fig. 3. Sample size PCA-L2 PCA-L1 PCA-L1S PCA-L1/AR
In order to further investigate the robustness of PCA-L1/AR, we
conduct the experiments on polluted images. We first in- 4 58.4 7 3.6 61.2 7 3.7 59.9 73.1 65.5 7 3.8
5 65.0 7 4.3 67.0 7 4.0 66.27 3.5 71.9 74.2
tentionally contaminated 20% of the training samples by rectangle
noise. The rectangle noise takes white or black dots, its location in
face image is random and its size is 20  20. Some face images
with rectangle noise are shown in Fig. 4. Then, we randomly In the experiments, we randomly choose four images of each
choose i (i¼ 4 and 5) images of each person for training, and the person for training, and the remaining ones are used for testing.
remaining ones are used for testing. Third, we repeat the proce- We repeat the procedure 10 times and report the average re-
dure 10 times and report the average recognition and the standard cognition and the standard deviation in Table 3. The plots of re-
deviation in Table 2. cognition rate vs. the dimension of reduced space are shown in
For visual perception, we illustrate the first ten projection Fig. 7.
vectors of PCA-L2, PCA-L1, PCA-L1S, and PCA-L1/AR in Fig. 5. We
can find that the basis vectors learned by the PCA-L1/AR method is 4.4. Experiments on COIL-20 image database
more robust to noise than the other three methods.
The COIL-20 data set contains 1440 images of 20 objects. For
4.3. Experiments on AR face database each object, 72 images were captured with a black background
from varying angles. The moving interval of the camera is five
The AR face database contains over 4000 Gy face images of 126 degrees. Each image is resized to 32  32 in our experiment.
people, including frontal views of with different facial expressions, In the experiments, we randomly choose ten images of each
lighting conditions and occlusions. The face images of 120 in- object for training, and the remaining ones are used for testing. To
dividuals (26 images per person) were taken in two sessions. The test the robustness of the proposed PCA-L1/AR against outlier, we
images of these 120 persons (3120 images in total) are used in our randomly choose 50% of the training samples to be contaminated
experiments. All images were manually cropped and resized to by rectangle noise. The rectangle noise takes white or black dots,
50  40. Some example images of one person are shown in Fig. 6. its location in face image is random and its size is 16  16. Some

Fig. 3. Recognition rate vs. dimension of reduced space on the Yale database. (a) 4 Train and (b) 5 Train.
906 G.-F. Lu et al. / Pattern Recognition 60 (2016) 901–907

Fig. 5. The first ten basis vectors calculated by (a) PCA-L2, (b) PCA-L1 (c) PCA-L1S, (d) PCA-L1/AR using the polluted Yale database.

Fig. 6. Images of one person in AR

Table 3
Comparison of recognition rates for the different methods on AR database.

Sample size PCA-L2 PCA-L1 PCA-L1S PCA-L1/AR

4 58.0 71.0 58.5 70.8 58.4 7 1.0 62.5 7 1.1

images with or without rectangle noise are shown in Fig. 8. The


procedure is repeated 10 times and the average recognition rates
as well as the standard deviation are reported in Table 4.
From the experiment results we find that PCA-L2 generally has
lower classification rates than the other three L1-norm based
methods. The reason may be that PCA-L2 is based on L2-norm,
which is sensitive to noise and outliers and the image samples
contain noise such as varies in light, expression, pose and rotation.
The experiment results also show that L1-norm is less sensitive to
the negative effects of noise and outliers.
The proposed PCA-L1/AR gets the best classification perfor-
mances in our experiments. The reason may be that we use trace
norm, which contains the data sample matrix X and is adaptive to Fig. 7. Recognition rate vs. dimension of reduced space on the AR database
the correlation structure, to regularize the basis vectors of PCA-L1.
Our model can consider sparsity and correlation simultaneously,
both of which have been demonstrated to be critical in pattern
G.-F. Lu et al. / Pattern Recognition 60 (2016) 901–907 907

Comput. Graph. Stat. 15 (2) (2006) 265–286.


[4] Z. Lu, Y. Zhang, An augmented Lagrangian approach for sparse principal
component analysis, Math. Program. 135 (2012) 149–193.
[5] M. Journée, Y. Nesterov, P. Richtárik, R. Sepulchre, Generalized power method
for sparse principal component analysis, J. Mach. Learn. Res. 11 (451–487)
(2010).
[6] R. Jenatton, G. Obozinski, F. Bach, Structured sparse principal component
Fig. 8. Some images with/without occlusion in COIL-20 database analysis, in: Proceeding of the 13th international conference on artificial in-
telligence and statistics, 2010, pp. 366–373.
[7] H. Wang, J. Wang, 2DPCA with L1-norm for simultaneously robust and sparse
modelling, Neural Netw. 46 (10) (2013) 190–198.
Table 4
[8] Xuelong Li, Y. Pang, Y. Yuan, L1-Norm-Based 2DPCA, IEEE Trans. on Syst Man
Comparison of recognition rates for the different methods on COIL-20 database.
Cybern. B: Cybern. 40 (4) (2009) 1170–1175.
[9] C. Ding, D. Zhou, X. He, H. Zha, R1-PCA: Rotational invariant L1-norm principal
Sample size PCA-L2 PCA-L1 PCA-L1S PCA-L1/AR component analysis for robust subspace factorization, in: Proceedings of the
23rd Internal Conference on Machine Learning, June 2006, pp. 281–288.
10 58.8 7 1.8 59.8 71.8 59.7 7 1.7 65.4 7 1.7 [10] Q. Ke, T. Kanade, Robust L1 norm factorization in the presence of outliers and
missing data by alternative convex programming, in: Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition, June 2005, pp. 1–8.
recognition problems. [11] N. Kwak, Principal component analysis based on L1-norm maximization, IEEE
Trans. Pattern Anal. Mach. Intell. 30 (9) (2008) 1672–1680.
[12] Y. Pang, X. Li, Y. Yuan, Robust tensor analysis with L1-Norm, IEEE Trans. Cir-
cuits Syst. Video Technol. 20 (2) (2010) 172–178.
5. Conclusions [13] F. Nie, H. Huang, C. Ding, D. Luo, H. Wang, Principal component analysis with
non-greedy L1-norm maximization, in: Proceedings of the 22nd International
Joint Conference on Artificial Intelligence (IJCAI), Barcelona, 2011, pp. 1–6.
In this paper, we propose a novel dimensionality reduction [14] D. Meng, Q. Zhao, Z. Xu, Improve robustness of sparse PCA by L1-norm
method, called PCA-L1 with adaptive regularization (PCA-L1/AR). maximization, Pattern Recognit. 45 (1) (2012) 487–497.
[15] Q. Yua, R. Wang, X. Yang, B.N. Li, M. Yao, Diagonal principal component ana-
PCA-L1/AR is adaptive to the correlation structure and considers
lysis with non-greedy L1-norm maximization for face recognition, Neuro-
sparsity and correlation simultaneously since it uses trace Lasso, computing 171 (1) (2016) 57–62.
which interpolates between L1-norm and L2-norm, to regularize [16] R. Wang, F. Nie, X. Yang, F. Gao, M. Yao, Robust 2DPCA with non-greedy L1-
the basis vectors of PCA-L1. An iterative algorithm for solving PCA- norm maximization for image analysis, IEEE Trans. Cybern. 45 (5) (2015)
1108–1112.
L1/AR is also developed in this paper. The experiment results on [17] X. Cao, X. Wei, Y. Han, D. Lin, Robust face clustering via tensor decomposition,
some publicly available data sets confirm the effectiveness of the IEEE Trans. Cybern. 45 (11) (2015) 2546–2557.
proposed PCA-L1/AR. [18] J. Yang, D. Zhang, A.F. Frangi, J.Y. Yang, Two-dimensional PCA: a new approach
to appearance-based face representation and recognition, IEEE Trans. Pattern
Anal. Mach. Intell. 26 (1) (2004) 131–137.
[19] E. Grave, G. Obozinski, F. Bach, Trace lasso: a trace norm regularization for
Acknowledgments correlated designs, Adv. Neural Inf. Process. Syst. (2011) 2187–2195.
[20] P.J. Bickel, Y. Ritov, A.B. Tsybakov, Simultaneous analysis of Lasso and Dantzig
selector, Ann. Stat. 37 (4) (2009) 1705–1732.
This research is supported by NSFC of China (Nos. 61572033 [21] P. Zhao, B. Yu, On model selection consistency of Lasso, J. Mach. Learn. Res. 7
and 71371012), the Natural Science Foundation of Education De- (2006) 2541–2563.
[22] M.J. Wainwright, Sharp thresholds for high-dimensional and noisy sparsity
partment of Anhui Province of China (No. KJ2015ZD08), 2014 recovery using l1-constrained quadratic programming (Lasso), IEEE Trans. Inf.
Program for Excellent Youth Talents in University, the Social Sci- Theory 55 (5) (2009) 2183–2202.
ence and Humanity Foundation of the Ministry of Education of [23] C. Lu, J. Feng, Z. Lin, S. Yan, Correlation adaptive subspace segmentation by
trace Lasso, in: Proceedings of the IEEE International Conference on Computer
China (No. 13YJA630098), Anhui Provincial Natural Science Foun-
Vision (ICCV), Sydney, 2013, pp. 1345–1352.
dation (No. 1608085MF147). The authors would like to thank the [24] J. Wang, C. Lu, M. Wang, P. Li, S. Yan, X. Hu, Robust face recognition via
anonymous reviews and the editor for their helpful comments and adaptive sparse representation, IEEE Trans. Cybern. 44 (12) (2014) 2368–2378.
[25] J. Lai, X. Jiang, Supervised trace lasso for robust face recognition, in: Pro-
suggestions to improve the quality of this paper.
ceedings of the IEEE International Conference on Multimedia and Expo
(ICME), IEEE, Chengdu, 2014, pp. 1–6.
[26] Z. Lin, M. Chen, Y. Ma, The augmented Lagrange multiplier method for exact
References recovery of corrupted low-rank matrices, UIUC Technica Report, Rep. UILU-
ENG-09-2215, 2009.
[27] D.P. Bertsekas, Constrained Optimization and Lagrange Multiplier Method,
[1] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd ed., John Wiley & Academic Press, Boston, USA, 1982.
Sons, New York, 2000. [28] J.-F. Cai, E.J. Candès, Z. Shen, A singular value thresholding algorithm for
[2] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed., Academic matrix completion, SIAM J. Optim. 20 (4) (2010) 1956–1982.
Press, Boston, USA, 1990. [29] G.H. Golub, C.F.V. Loan, Matrix Computations, 3rd ed., The Johns Hopkins
[3] H. Zhou, T. Hastie, R. Tibshirani, Sparse principal component analysis, J. University Press, Baltimore, Maryland, 1996.

Gui-Fu Lu received the B.S degree in 1997 from Hefei University of Technology, PR China, the M.S. degree in 2004 from Hangzhou Institute of Electronics Engineering, and the
Ph.D degree in 2012 from Nanjing University of Science and Technology, P.R. China. Since 2004, he has been teaching in the School of Computer Science and Information, Anhui
Polytechnic University, Wuhu, Anhui, China. His research interests include computer vision, digital image processing and pattern recognition. E-mail: luguifu_jsj@163.com.

Jian Zou received the M.S. degree in applied mathematics from the Department of Mathematics of Nanjing University of Information Science & Technology, Nanjing, China, in
2006. He received the Ph.D degree in 2013 from Nanjing University of Science and Technology, PR China. His scientific interests are in the fields of pattern recognition,
manifold learning and information statistics.

Yong Wang received the B.S. and M.S. degrees in computer science from Anhui university technology and science, Wuhu, Anhui, China, in 2001, and 2007, respectively.
Currently, he is with the School of Computer Science and Information, Anhui Polytechnic University, Wuhu, Anhui, China. His research interests include software engineering
and machine learning.

Zhongqun Wang is a professor in the School of Management Engineering, Anhui Polytechnic University, Wuhu, Anhui, China. His research interests include software
engineering and machine learning.

Вам также может понравиться