Вы находитесь на странице: 1из 17

Analysis of Two Partial-Least-Squares Algorithms for

Multivariate Calibration
ROLF MANNE
Department of Chemistry, University of Bergen, N-5007 Bergen (Norway)
Abstract
Manne,R., 1987. Analysis of two partial-least-squares algorithms for multivariate calibra-
tion. Chemometrics and Intelligent Laboratory Systems, 2:187-197.
Two algorithms for multivariate calibration are analysed in terms of standard linear
regression theory. The matrix inversion problem of linear regression is shown to be solved
by transformations to a bidiagonal form in PLS1 and to a triangular form in PLS2. PLS1
gives results identical with the bidiagonalization algorithm by Golub and Kahan, similar to
the method of conjugate gradients. The general eciency of the algorithms is discussed.
1 INTRODUCTION
Partial least squares (PLS) is the name of a set of algorithms developed by Wold for use in
econometrics[1,2]. They have in common that no a priori assumptions are made about the model
structure, a fact which has given rise to the name soft modelling for the PLS approach. Instead,
estimates of reliability may be made using the jack-knife or cross-validation (for a review of these
techniques, see ref.3). Although such reliability estimates seem to be essential in the description
of the PLS approach [4], they are not considered here.
The PLS approach has been used in chemometrics for extracting chemical information from
complex spectra which contain interference eects from other factors (noise) than those of primary
interest [4-10]. This problem can also be solved by using more or less standard least-squares
methods, provided that the collinearity problem is considered [11]. What is required by these
methods is a proper method for calculating generalized inverses of matrices. The PLS approach,
however, is so far only described through its algorithms, and appears to have an intuitive
character. There has been some confusion about what the PLS algorithms do, what their
underlying mathematics are, and how they relate to other formulations of linear regression theory.
The purpose of this paper is to place the two PLS algorithms that have been used in
chemometrics in relation to a more conventional description of linear regression. The two PLS
algorithms are called, in the terminology in chemometrics, PLS1 and PLS2 [6]. PLS2 considers
the case when several chemical variables are to be tted to the spectrum and has PLS1, which
ts one such variable, as a special case. A third algorithm, which is equivalent to PLS1, has been
1
suggested by Martens and Ns [7]. It diers from the latter in its orthogonality relations, but
has not been used for actual calculations. It will be referred to here only in passing.
Some properties of the PLS1 algorithm have previously been described by Wold et al. [4], in
particular its equivalence to a conjugate-greadient algorithm. This equivalence, however, has not
been used further in the literature, e.g., in comparisons with other methods. Such comparisons
have been made, particularly with the method of principal components regression (PCR), by Ns
and Martens [12] and Helland [13]. A recent tutorial article by Geladi and Kowalski [14] also
attempts to present the PLS algorithms in relation to PCR but, unfortunately, suers from a
certain lack of precision.
The outline of this paper is as follows. After the notation has been established, the solution
to the problem of linear regression is developed using the Moore-Penrose generalized inverse. The
bidiagonalization algorithm of Golub and Kahan [15] is sketched and shown to be equivalent to
the PLS1 algorithm. Properties of this solution are discussed.
The Ulvik workshop made it clear that within chemistry and the geosciences there is growing
interest in the methods of multivariate statistics but, at the same time, the theoretical background
of workers in the eld is highly variable. With this situation in mind, and attempt has been made
to make the presentation reasonably self-contained.
2 NOTATION
The notation used for the PLS method varies from publication to publication. Further confusion
is caused by the use of dierent conventions for normalization. In the following, we arrange
the measurements of the calibration set in a matrix X = X
ij
, where each row contains the
measurements for a given sample and the each column the measurements for a given variable.
the number of samples (or rows) is given as n and the number of variables (or column) as p. With
the experimental situation in mind, we shall call the p measurements for sample i a spectrum
which we denote by x
i
. For each sample in the calibration set, there is, in addition, a chemical
variable y
i
which is represented by a column vector y. In the prediction step the spectrum x of
a new sample is used to predict the value

h for the sample.
We use bold-face capital letters for matrices and bold-face lower-case letters for vectors. The
transpose of vectors and matrices will be denoted by , e.g., y

and X

. A scalar production of
two (column) vectors is thus written (a

b). Whenever possible, scalar quantities obtained from


vector or matrix multiplication are enclosed in parentheses. The Euclidian norm of a column
vector a is written ||a|| = (a

a)
1/2
. The Kronecker delta,
ij
, is used to describe orthogonality
relations. It takes the values 1 for i = j and 0 for i = j.
2
3 CALIBRATION AND THE LEAST-SQUARES METHOD
The relationship between a set of spectra X and the known values y of the chemical variable is
assumed to be
y
i
= b
0
+
p

j=1
X
ij
b
j
+ noise(i = 1, 2, . . . , n) (1)
If all variables are measured relative to their averages, the natural estimate of b
0
is zero, and in
this sense b
0
may be eliminated from eqn.1. We thus assume that the zero points of the variables
y, x
1
, x
2
, . . . , x
p
are chosen so that
n

i=1
y
i
= 0 and
n

i=1
X
ij
= 0(j = 1, . . . , p) (2)
Eq.1 may therefore be written as
y
i
=
p

j=1
X
ij
b
j
(3)
or, in matrix notation
y = Xb (4)
For the estimation of b, eq.4 in general does not have an exact solution. In standard least
squares one instead minimizes the residual error ||e||
2
dened by the relationship
e = y Xb (5)
This leads to the normal equations
X

Xb = X

y (6)
which reduce to eq.4 for non-singular square matrices X. Provided that the inverse of X

X exists,
the solution may be written as
b = (X

X)
1
X

y (7)
If the inverse does not exist, there will be non-zero vectors c
j
which full
X

Xc
j
= 0 (8)
Then, if b is a solution to eq.6, so is b+

j
c
j
(
j
arbitrary scalars). This may be expressed
by substituting for (X

X)
1
any generalized inverse of X

X. A particular such generalized inverse


is chosen as follows. Write X as product of three matrices:
3
X = URW

(9)
where U and W are orthogonal and of dimensions na and pa, respectively, R is of dimension
a a and non-singular and a is the rank of the matrix X. That is
U

U = W

W = 1 (10)
where 1 is the unit matrix of dimension a a. A generalized inverse may be written:
X
+
= WR
1
U

(11)
Truncations are commonly introduced by choosing the dimension of R equal to r smaller than
the rank a of X. As written (no truncation implied), X
+
fulls not only the dening condition
for a generalized inverse, i.e.,
XX
+
X = X (12)
but also
X
+
XX
+
= X
+
; XX
+
= (XX
+
)

; (X
+
X)

= X
+
X (13)
which dene the Moore-Penrose generalized inverse (see,e.g., ref. 17). Insertion of eq.9 into eq.6
gives
WW

b = WR
1
U

y (14)
It may be shown [17] that the Moore-Penrose generalized inverse gives the minimum-norm
solution to the least squares problem,i.e.,
b = WW

b = X
+
y = WR
1
U

y (15)
is the solution which minimizes (b

b) in the case that X

X is singular and eq.6 has multiple


solutions.
The orthogonal decomposition 9 gives together with eq.15 a simple expression for the residual
error:
e = y Xb = (1 URW

WR
1
U

)y
= (1 UU

)y (16)
There are many degrees of freedom in the orthogonal decomposition 9. From the computa-
tional point of view, it is important that the inverse R
1
is simple to calculate. This may be
achieved, e.g., with R diagonal or triangular. For a right triangular matrix one has R
ij
= 0 for
i > j. In that case, from the denition of the inverse it follows that

k<j
(R
1
)
ik
R
kj
+ (R
1
)
ij
R
jj
=
ij
(17)
4
or
(R
1
)
ij
= (
ij

k<j
(R
1
)
ik
R
kj
)/R
jj
(18)
A bidiagonal matrix is a special case of a triangular matrix with R
ij
= 0 except for i = j
and either i = j 1 (right bidiagonal) or i = j +1 (left bidiagonal). From successive application
of eq.18 it follows that the inverse of a right triangular matrix (including the bidiagonal case) is
itself right triangular.
With a diagonal matrix R, i.e., with R
ij
= 0 for i = j, the decomposition 9 is known as the
singular-value decomposition. The singular values, which are chosen to be 0, are the diagonal
elements R
ii
. Another decomposition of interest for matrix inversion is the QR decomposition
with R right triangular and W = 1, the unit matrix.
4 BIDIAGONALIZATION AND PLS1
Wold et al. [4] mention that the transformation step of the PLS1 algorithm is equivalent to a
conjugate gradient algorithm given by Paige and Saunders [18]. This algorithm, which is applied
to X, however, is not the general conjugate gradient algorithm, which requires a symmetric X
[19,20], but a bidiagonalization algorithm developed by Golub and Kahan [15] and called Bidiag2
by Paige and Saunders [18]. The detailed properties of Bidiag2, which in our view is the key to
the understanding of the PLS1 algorithm, have so far not been exploited in the literature.
The Bidiag2 algorithm gives a decomposition of the type described in eq. 9. It can be
described as follows: given w
1
= X

y/||X

y|| and u
1
= Xw
1
/||Xw
1
||, one obtains successive
column vectors of the matrices U and W through
w
i
= k
i
[X

u
i1
w
i1
(w

i1
X

u
i1
)] (19)
u
i
= k

i
[X

w
i
u
i1
(u

i1
X

w
i
)] (20)
where k
i
and k

i
are appropriate normalization constants.
Eqs. 19 and 20 may be obtained from the general expressions
Xw
j
=
j

i=1
u
i
(u

i
Xw
j
) (21)
X

u
i
=
i+1

j=1
w
j
(w

j
X

u
i
) (22)
which dene the vectors u
j
and w
i+1
, respectively.
The following properties are easily shown: (i) the sets {u
i
} and {w
i
} are orthonormal, (ii) eq.21
gives (u

i
Xw
j
) = 0 for i > j and (iii) eq.22 gives (u

i
Xw
j
) = 0 for j > i + 1. The decomposition
5
of X according to eq.9 therefore makes R
ij
= u

i
Xw
j
a right bidiagonal matrix. Eqs. 21 and 22
may therefore be reformulated as
Xw
i
= u
i
(u

i
Xw
i
) +u
i1
(u

i1
Xw
i
) (23)
X

u
i
= w
i+1
(w

i+1
X

u
i
) +w
i
(w

i
X

u
i
) (24)
which are equivalent to eqs. 20 and 19, respectively.
There is a close relationship between the present bidiagonalizaiton scheme and Lanczos
iteration scheme for making a symmetric matrix tridiagonal [16]. The latter is obtained by
multiplying the two bidiagonalization equation together, e.g.,
w
i+1
= N[X

Xw
i
w
i
(w

i
X

Xw
i
) w
i1
(w

i1
X

Xw
i
)] (25)
where N is a normalization constant. A similar equation applies to u
i+1
. In the Lanczos
basis {w
i
} the matrix X

X is tridiagonal. It is further obvious from eq. 25 that the vectors


{w
i
} represent a Schmidt orthogonalization of the Krylov sequence {k
i
; k
i
= (X

X)
i1
w
1
}. The
relationship between the Krylov sequence and the vectors w
i
generated by the PLS1 algorithm
has previously been pointed out by Helland [13].
In order to show the equivalence between the bidiagonalization algorithm and PLS1, the
calibration step of the latter can be written as follow:
Step 1: X
1
= X; y
1
= y
Step 2: For i = 1, 2, . . . , r(r a) do Steps 2.1-2.4:
Step 2.1: w
i
= X

i
y
i
/||X

i
y
i
||
Step 2.2: u
i
= X
i
w
i
/||X
i
w
i
||
Step 2.3: X
i+1
= (1 u
i
u

i
)X
i
Step 2.4: y
i+1
= (1 u
i
u

i
)y
i
The iteration in Step 2 may be continued until the rank of X
i
equals zero (a=rank of X)
or may be interrupted earlier using, e.g., a stopping criterion from cross-validation. The present
description diers from that given, e.g., by Martens and Ns [7] only by the introduction of
normalized vectors {u
i
}. The latter write
Step 2.2a: t
i
= X
i
w
i
but have equation for the other steps that give results identical with our formulation. The
PLS decomposition of the X matrix into
6
X = TP (26)
where T is orthorgonal and of dimension na and P is of dimension a p therefore corresponds
in our notation to
X = U(U

X) = U(RW

) (27)
The alternative algorithm of Martens and Ns gives a decomposition of X according to eq.
26, but with orthogonal rows in the matrix P [7]. These rows are, apart from a normalization
factor, identical with those of the matrix W

dened here.
In our notation this algorithm corresponds to the decomposition
X = (XW)W

= (UR)W

(28)
The equivalence of this algorithm with the ordinary PLS1 algorithm has been pointed out by
Ns and Martens [12] and shown in detail by Helland [13].
In the PLS1 algorithm the orthogonality of {u
i
} follows from Steps 2.2 and 2.3 through
induction. One may then reformulate Step 2.3 as
Step 2.3b: X
i+1
= (1

i
k=1
u
k
u

k
)X
Step 2.1 may be simplied into
Step 2.1b: w
i
= X

i
y/||X

i
y||
Step 2.4 is thus shown to be unnecessary. Sjostrom et al. [5] updated y by
y
i+1
= (1 c
i
u
i
u

i
)y
i
(29)
with specications for the parameter c
i
, called the inner PLS relation. This updating therefore
has no eect on the result. Later publications which use the inner PLS relationship [6,14] have,
in fact, the same updating expression as used here in Step 2.4.
Comparing PLS1 with Biadiag2 one nds that Step 2.2 and 2.3b of the former give the second
step of the latter, eq. 20, since (u

k
Xw
i
= 0) for k < i 1. In order to show the equivalence of
the rst step of Bidiag2, eq. 19 and Step 2.1b, we write the latter as
w
i+1
= X

i
(1 u
i
u

i
)y/||X

i+1
y|| (30)
The orthogonality between w
i
and w
i+1
is then obtained from
w

i
w
i+1
w

i
X

i
(1 u
i
u

i
)y u

i
(1 u
i
u

i
)y = 0 (31)
From
7
w
i+1
||X

i+1
y|| = X

i+1
y = X

i
(1 u
i
u

i
)y = w
i
||X

i
y|| X

u
i
(u

i
y) (32)
one thus obtains, by multiplication with w

i
,
||X

i
y|| = (w

i
X

u
i
)(u

i
y) (33)
which inserted back in eq. 30 gives the bidiagonalization equation 19 apart from, possibly, a sign
factor.
5 MATRIX TRANSFORMATIONS-PLS2
The PLS2 algorithm was designed for the case when several chemical variable vectors y
k
are to
be tted using the same measured spectra X. The chemical vectors are collected as columns
in a matrix Y. The algorithm may be described as follow:
Step1: X
1
= X; Y
1
= Y
Step2: For i = 1, 2, . . . , r(r a) do Steps 2.1-2.4
Step2.1: v
i
= rst column of Y
i
.
Step2.2: Repeat until convergence of w
i
:
Step 2.2.1: w
i
= X

i
v
i
/||X

i
v
i
||
Step 2.2.2: u
i
= X
i
w
i
/||X

i
w
i
||
Step 2.2.3: z
i
= Y
i
u
i
/||Y

i
u
i
||
Step 2.2.4: v
i
= Y
i
z
i
/||Y

i
z
i
||
Step2.3: X
i+1
= (1 u
i
u

i
)X
i
Step2.4: Y
i+1
= (1 u
i
u

i
)Y
i
As for PLS1 the iteration may be continued until the rank of X
i
is zero or may be stopped
earlier (r < a).
Several simplications of the algorithm can be made. What is important in the present con-
text, however, is that the u
i
and w
i
vectors form two orthonormal sets, and that the transformed
matrix U

XW is right triangular. Also, it may be shown that PLS2 reduces to PLS1 when the
Y matrix has only one column. In the latter case z
i
from Step 2.2.3 has only a single element=1.
Convergence of w
i
is then obtained in the rst iteration.
The orthogonality u

i
u
i
=
ij
follows from Steps 2.2.2 and 2.3 in the same way as in PLS1. This
in turn makes it possible to show that eq.21 is valid also for PLS2, which proves the triangularity
of the transformed matrix R = U

XW. Finally, the orthogonality w

i
w
j
=
ij
(i > j) may be
8
established from
w

i
w
j
v

i
X
i
X

j
v
j
= v

i
(1
i1

k=j
u
k
u

k
)X
j
X

j
v
j
v

i
(1
i1

k=j
u
k
u

k
)u
j
= 0 (34)
The orthogonality relationships of PLS2 are well established in the literature.
6 MATRIX INVERSION AND PREDICTION
For a sample with the spectrum x (row vector) but with unknown chemical composition, the
predicted value of the chemical variable is obtained as (cf. eqs. 4 and 14)
y = (xb) = (xWR
1
U

y) (35)
We write here this expression as for PLS1. The extension to PLS2, however, is trivial. Utilizing
the fact that R
1
is right triangular both in PLS1 and in PLS2, the vector of regression coecients
can be written as
b =

ij
w
i
(R
1
)
ij
(u

j
y) =

j
d
j
(u

j
y) (36)
where w
i
and u
j
are columns of the matrices W and U, respectively. The substitution
d
j
=

ij
w
i
(R
1
)
ij
(37)
or

kj
d
k
R
kj
= w
j
(38)
d
j
= (w
j

k<j
d
k
R
kj
)/R
jj
(39)
simplies for a bidiagonal matrix R to
d
j
= (w
j
d
j1
R
j1,j
)/R
jj
(40)
This equation makes it possible to calculate b with little use of computer memory, especially
since also
(u

j
y) = (u

j1
y)R
j1,j
/R
jj
(41)
The proof of eq. 41 is given in the next section.
9
In the following, we develop eq. 35 to yield the equations for the prediction used in the PLS
literature. A basic feature of these equations is that the regression vector b is never explicitly
calculated. For this reason, the predicted value is written as
y = (xb) =
r

j=1
(xd
j
)(u

j
y) =
r

j=1
h
j
(u

j
y) (42)
where, from eq. 39, one obtains
h
j
= [(xw
j
)

k<j
h
k
R
kj
]/R
jj
(43)
These expressions dier from those given by Martens and N s [7] only in the normalization.
The latter write
y =
r

j=1

t
j
q
j
(44)
with (cf. PLS1 Step 2.2a)
q
j
= (t

j
y)/(t

j
t
j
) = (u

j
y)/(u

j
Xw
j
) (45)

t
j
= (x

k<j

t
k
p
k
)w
j
(46)
and
p
k
= t

k
X
k
/(t

k
t
k
) = u

k
X/(u

k
Xw
k
) (47)
i.e.

t
j
= (xw
j
)

k<j

t
k
R
kj
/R
kk
(48)
From the identication

t
j
= h
j
R
jj
in eq. 48, it follows that the prediction equation 44 of
Martens and N s[7] gives the same result as eq. 42.
Wold et al. [6] and Geladi and Kowalski [14] give expressions for prediction containing the
inner PLS relationship. They write for the PLS1 case
y =
r

j=1
c
j

t
j
q
j
(49)
The need for the inner relationship comes from the normalization of q
j
to |q
j
| = 1 used by
these authors. A detailed calculation shows that c
j
q
j
in eq. 49 equals q
j
as dened by Martens
and N s [7]. Also for PLS2, the inner relationship of eq. 48 gives results identical with those of
Martens and N s[7].
On the other hand, we believe that Sjostrom et al. [5] make an erroneous use of the inner PLS
relationship in the prediction step. These authors make still another choice of normalization.
Also in other respects their prediction equations indicate an early stage of development.
10
Geladi and Kowalski in their tutorial [14], discuss a procedure for obtaining orthogonal t
values which we, so far, have chosen to overlook. This procedure, however, is said to be not
absolutely necessary. What is described is a scaling procedure for the vectors p
j
, t
j
and w
j
so
that
p
new
j
= p
j
/||p
j
||
t
new
j
= t
j
||p
j
||
w
new
j
= w
j
||p
j
|| (50)
It should be noted that both before and after this scaling the vectors t
j
are orthogonal. The
replacements in eq. 50 also scal the values of c
j
and

t
j
appearing in eq. 49 but have no eect
upon the predicted value y.
7 STOPPING RULES FOR BIDIAGONALIZATION
It is well known that the Krylov sequence converges to the eigenvector with the numerically
largest eigenvalue. This is the basis of the power method for nding eigenvalues of symmetric
matrices. In the chemometrics literature the power method is often called NIPALS and attributed
to Wold [21]. The method, however, is well established in numerical analysis and can, according
to Householder [22], be traced back at least to the work of M untz in 1913 [23]. A krylov or, for
that matter, a Lanczos basis thus becomes linearly dependent if the iteration is continued too far.
In exact arithmetic this linear dependence would at some point give a vector w
s+1
or u
s
exactly
equal to zero. Rounding prevent this from happening, however.
Like the Bidiag1 algorithm discussed in detail by Paige and Saunders [18], the Bidiag2 algo-
rithm produces simple estimates of quantities that can be used in stopping rules for the iteration
scheme. We write e
s
as the residual error (5) obtained with a matrix R or s dimensions,i.e.,
e
s
= (1
s

j=1
u
j
u

j
)y = y
s
+1 (51)
dened in Step 2.4 and
||y
s+1
||
2
= ||y
s
||
2
(u

s
y)
2
(52)
As the iteration proceeds ||y
s
||
2
becomes smaller, and the stability of this quantity may be
taken as a stopping criterion. Using Step 2.1 and eq. 21 we write
(y

Xw
s
) = ||X

y||(w

1
w
s
)
= (y

u
s
)(u

s
Xw
s
) + (y

u
s1
)(u

s1
Xw
s
) (53)
Since (w

1
w
s
) = 0 for s = 1 one obtains the iteration equation (s > 1)
(y

u
s
) = (y

u
s1
)(u

s1
Xw
s
)/(u

s
Xw
s
) (54)
11
which may be used to evaluate the residual error (eq. 52). As mentioned above, eq.41, one may
also use eq. 54 in the evaluation of regression coecients.
Another quantity of interest is the normalization integral ||X

e
s
|| = ||X

s+1
y||, which appears
in the denominator of Step 2.1 of PLS1. When this quantity approaches zero the iteration scheme
becomes instable. The equation
||X

s+1
y|| = ||X

s
y||(u

s
Xw
s+1
)/(u

s
Xw
s
) (55)
may be derived from eq. 30. The use of eqs. 52 and 55 and further criteria for stopping was
dicussed by Paige and Saunders [18]. These criteria are simple to evaluate and relate directly to
the numerical properties of the PLS1 iteration scheme. For this reason, they may be of advantage
as a complement to the cross-validation currently used.
8 COMPARISON OF BIDIAG2/PLS1 AND PCR/SINGULAR-
VALUE DECOMPOSITION
It is frequently stated that with the same number of factors w
i
PLS1 has a better predictive
ability than principal components regression (PCR)/singular-value decomposition. This may be
understood by expanding PLS1 results in terms of the factors of the singular-value decomposition
(=principal components). We write the latter as
X =

i
g
i
d
i
f

i
(56)
and obtain
X

X =

i
d
2
f
i
f

i
(57)
The rst PLS1 factor becomes
w
1
= X

y/||X

y|| =

i
f
i
d
i
(g

i
y)/||X

y|| =

f
i
c
i
(58)
(no contribution from vectors f
i
with d
i
= 0). Application of the Lanczos equation (25) expanded
according to eq. 57 yields
w
2
X

Xw
1
w
1
(w

1
X

Xw
1
) =

f
i
c
i
[d
2
i
(w

1
X

Xw
1
)] (59)
Partial summations over functions f
i
with degenerate eigenvalues d
i
give the same results
for w
2
as for w
1
. One may therefore show that the number of linearly independent terms in the
sequence {w
i
} is no greater than the number of eigenvectors with distinct eigenvalues contributing
to the expansion of w
1
. Almost degenerate or clustered eigenvalues coupled with nite numerical
accuracy may make the expansion even shorter in practice. Compare this with PCR, where all
eigenvectors with large eigenvalues are used irrespective of their degeneracy. These properties of
12
PLS1 have been pointed out by Ns and Martens [12] and by Helland [13]. They are also well
established in the literature on the conjugate gradient method (e.g.,ref. 20).
The exclusion of d
i
= 0 not only for all w
j
but also for all u
j
is the main advantage of
Bidiag2 over the Bidiag1 algorithm used by Paige and Saunders in their LSQR algorithm [18].
The latter starts the bidiagonalization with u
1
= y
1
, w
1
= X

y, and obtains a left bidiagonal


matrix along similar lines as Bidiag2. The two algorithms generate the same set of vectors {w
i
},
but Bidiag1 runs into singularity problems for least-squares problems, which, however, are solved
by the application of the QR algorithm. The resulting right bidiagonal matrix is the same as that
obtained directly from Bidiag2. We have not found any obvious advantage of this algorithm over
the direct use of Bidiag2 as in PLS1.
In PCR the stopping or truncation criterion is usually the magnitude of the eigenvalue d
i
. As
discussed by Jolie [24], this may lead to omission of vectors f
i
, which are important for reducing
the residual error ||e
s
||
2
, eq. 51.
On the other hand, the Bidiag2/PLS1 algorithms or, equivalently, the Lancoz algorithm do not
favour only the principal components with large eigenvalues d
i
. Instead, it is our own experience
from eigenvalue calculations with the Lanczos algorithm in the initial tridiagonalization step [25]
that convergence is rst reached for eigenvalues at the ends of the spectrum. That means in the
present context that with large and small eigenvalues are favoured over those with eigenvalues
in the middle. the open question is the extent to which the principal components with small
eigenvalues represent noise and therefore should be excluded from model building. Without
additional information about the data there is no simple solution to this problem.
9 UNDERSTANDING PLS2
In this section we consider the triangularization algorithm in PLS2. At convergence the iteration
loop contained in Step 2.2 of the algorithm (see Matrix transformations PLS2) leads to the
eigenvalue relationships
X

i
Y
i
Y

i
X
i
w
i
= k
2
i
w
i
(60)
Y

i
X
i
X

i
Y
i
z
i
= k
2
i
z
i
(61)
where k
2
i
, which are the numerically large eigenvalues, may be evaluated from the product
of normalization integrals in Steps 2.2.1-2.2.4. We interpret these relationships as principal
component relationships of the matrix X

i
Y
i
. For each matrix one thus obtains the principal
component with largest variance. As mentioned before, the vectors w
i
obtained in this way are
mutually orthogonal, but the vectors z
i
are not.
The matrix X

i
Y
i
may be simplied to X

Y
i
. Hence for each iteration, those parts of the
column vectors of Y which overlap with u
i
= X
i
w
i
/||X
i
w
i
|| are removed. Eventually, a u
i
is
13
obtained that has zero (or a small) overlap with all the columns of Y, and the iteration has to
be stopped.
As for PLS1, it is of interest to relate the vectors w
i
to the singular-value decomposition of
X, eq. 56. We write eq. 60 as
X

A
i
Xw
i
= k
2
i
w
i
(62)
Inserting eq. 56 and
w
i
=

j
f
j
c
ji
(63)
We obtain

j
f
j
(d
j
g

j
A
i
Xw
i
) = k
2
i

j
f
j
c
ji
(64)
or, using the orthogonality of the f
j
s,
k
2
i
c
ji
= d
j
(g

j
A
i
Xw
i
) (65)
From k
i
> 0 and d
j
= 0 it follows that c
ji
= 0.
Hence there is no contribution to w
i
from a vector f
j
with the singular value d
j
= 0. In
this way, we are assured that the vectors w
i
of PLS2 form a subspace of the space spanned by
{f
j
; d
j
= 0}. The transformed matrix R = U

XW is therefore invertible.
10 DISCUSSION
The rst result of this study is that both PLS algorithms, as given by Martens and Ns [7], yield
the ordinary least-squares solution for invertible matrices X

X. The algorithms correspond to


standard methods for inverting matrices or solving systems of linear equations, and the various
steps of these methods are identied in the PLS algorithms. This result is likely to be known
to those who know the method in detail. However, as parts of the PLS literature are obscure,
and as even recent descriptions of the algorithms in refereed publications contain errors, it is
felt necessary to make this statement. There are, however, no reasons to believe that the errors
mentioned carry over into current computer codes.
The close relationship with conjugate gradient techniques makes it possible to speculate about
the computational utility of PLS methods relative to other methods of linear regression. As
pointed out also by Wold et al. [4], the matrix transformations of Bidiag2 are computationally
simpler than those of the original PLS1 method. Further, in the prediction step some saving
would be possible using the equations given here. For problems of moderate size this saving will
not, however, be large. For small matrices a still faster procedure for bi- or tridiagonalization
is Householders method. Savings by using this method would be important for both matrix
inversion and matrix diagonalization (principal components regression). The real saving with
methods of the conjugate gradients type discussed here is for large and spare matrices where the
elements can only be accessed in a xed order.
14
On the other hand, with present technology neigher matrix inversion nor matrix diagonal-
ization is particularly dicult, even on a small computer. The cost of obtaining high-quality
chemical data for the calibration is likely to be much higher than the cost of computing. This
puts a limit on the amount of eort one may want to invest in program renement.
Compared with principal components regression/sigular value decomposition it is clear that
PLS1/Bidiag2 manages with fewer latent vectors. Like PCR, the PLS methods avoid exact linear
dependences, i.e., the zero eigenvalues of the X

X matrix. On the other hand, there is room for


uncertainty in how PLS treats approximate linear dependences, i.e., small positive eigenvalues
of X

X. Is it desirable to include such eigenvalues irrespective of the data considered? Detailed


studies of this problem in a PCR procedure might lead to a cut-o criterion where the smallness
of the eigenvalue is compared with the importance of the eigenvector for reducing the residual
error.
The points where the PLS algorithms depart most from standard regression methods are the
use of latent vectors (PLS factors) instead of regression coecients in the prediction step, and
that the matrix inversion of standard regression methods is actually performed anew for each
prediction sample. As is clear from the present work and also from that of Helland [13], the
latter procedure is by no means a requirement. Once the latent vectors are obtained they may
be combined into regression coecients, (eq. 36), i.e., into one vector giving the same predicted
value as obtained with several PLS or PCR vectors. A possible use of the PLS factors would
then be for the detection of outliers among samples supplied for prediction. For this purpose, a
regression vector is insucient as it spans only one dimension. On the other hand, there seems
to be no guarantee that the space spanned by the PLS vectors is more suitable for this purpose
than that spanned by principal components.
It seems as if the PLS2 method has few numerical or computational advantage both relative to
PLS1/Bidiag2 performed for each dependent variable y and relative to PCR. The power method
of extracting eigenvalues, although simple to program, is inecient, especially for near-degenerate
eigenvalues. In contrast to principal components analysis, the PLS2 eigenvalue problem changes
from iteration to iteration, which makes the saving small if matrix diagonalization is used instead.
As long as the number of dependent variables is relatively small, the use of PLS1 for each
dependent variable may well be worth the eort.
In conclusion, it can be stated that the PLS1 algorithm provides one solution to the calibration
problem using collinear data. This solution has a number of attractive features, some of which
have not yet been exploited. It is an open question, however, whether this method is the optimal
solution to the problem or not. For an answer one would have to consider the structure of the
input data in greater detail than has been done so far.
ACKNOWLEDGEMENTS
Numerous discussions with Olav M. Kvalheim are gratefully acknowledged. Thanks are also
due to John Birks, Inge Helland, Terje V. Karstang, H.J.H. MacFie, Harald Martens and an
15
unnamed referee for valuable comments.
References
[1] H. Wold, Soft modelling. The basic design and some extensions, in K. Jo reskog and H.
Wold (Editors), Systems under Indirect Observation, North-Holland, Amsterdam, 1982, Vol.
II, pp. 1-54
[2] H.Wold, Partial least squares, in S. Kotz and N.L. Johnson (Editors), Encyclopedia of
Statistical Sciences, Vol. 6, Wiley, New York, 1985, pp. 581-591
[3] B. Efron and G. Gong, A leisurely look at the bootstrap, jackknife and cross-validation, The
American Statistician, 37(1983)37-48
[4] S. Wold, A. Ruhe, H. Wold and W.J. Dunn III, The collinearity problem in linear regression.
The partial least squares (PLS) approach to generalized inverses, SIAM Journal of Scientic
and Statistical Computations, 5(1984)735-743
[5] M. Sjostrom, S. Wold, W. Lindberg, J.-A. Persson and H. Martens, Amultivariate calibration
problem in analytical chemistry solved by partial least-squares models in latent variables,
Analytica Chimica Acta, 150(1983)61-70.
[6] S. Wold, C. Albano, W.J. Dunn III, K. Esbensen, S. Hellberg, E. Johansson and M. Sjostrom,
Pattern recognition: nding and using regularities in multivariate data, in H. Martens and
H. Russworm, Jr. (Editors), Food Reasearch and Data Analysis, Applied Science Publishers,
London, 1983, pp. 147-188
[7] H. Martens and T. Ns, Multivariate calibration by data compression, in H.A. Martens,
Multivariate Calibration. Quantitative Interpretation of Non-selective Chemical Data, Dr.
techn. thesis, Technical University of Norway, Trondheim, 1985, pp. 167-286; K. Norris
and P.C. Williams (Editors), Near Infrared Technology in Agricultural and Food Industries,
American Cereal Association, St. Paul, MN, in press.
[8] T.V. Karstang and R. Eastgate, Multivariate calibration of an X-ray diractometer by partial
least squares regression, Chemometrics and Intelligent Laboratory Systems, 2(1987)209-219.
[9] A.A. Christy, R.A. Velapoldi, T.V. Karstang, O.M. Kvalheim, E.Sletten and N. Telns,
Multivariate calibration of diuse reectance infrared spectra of coals as an alternative
to rank determination by vitrinite reectance, Chemometrics and Intelligent Laboratory
Systems, 2(1987)221-232
[10] K.H. Esbensen and H. Martens, Predicting oil-well permeability and porosity from wire-line
geophysical logs-a feasibility study using partial least squares regression, Chemometrics and
Intelligent Laboratory Systems, 2(1987)221-232.
[11] P.J. Brown, Multivariate calibration, Proceedings of the Royal Statistical Society, Series B,
44(1982) 287-321
[12] T. Ns and H. Martens, Comparison of prediction methods for multicollinear data,
Communications in StatisticsSimulations and Computations, 14(1985)545-576
16
[13] I.S. helland, On the structure of partial least squares regression, Reports from the Department
of Mathematics and Statics, Agricultural University of Norway, 21(1986)44
[14] P. Geladi and B.R. Kowalski, Partial least-squares regression: A tutorial, Analytica Chimica
Acta, 185(1986)1-17
[15] G.H. Golub and W. Kahan, Calculating the singular values and pseudo-inverse of a matrix,
SIAM Journal of Numerical Analysis, Series B, 2(1965)205-224
[16] C. Lanczos, An iteration method for the solution of the eigenvalue problem of linear
dierential and integral operators, Journal of Research of the National Bureau of Standards,
45(1950)255-282.
[17] C.R. Rao and S.K. Mitra, Generalized Inverse of Matrices and its Applications, Wiley, New
York, 1971
[18] C.C. Paige and M.A. Saunders, A bidiagonalization algorithm for sparse linear equations
and least squares problems, ACM Transcactions on Mathematical Software, 8(1982)43-71
[19] M.R. Hestenes and E. Stiefel, Method of conjugate gradients for solving linear systems,
Journal of Research of the National Bureau of Standards, 49(1952)409-436.
[20] M.R. Hestenes, Conjugate Direction Methods in Optimization, Springer, New
York,1980,p.247 .
[21] H. Wold, Estimation of principal components and related models by iterative least squares,
in P.R. Krishnaiah (Editor), Multivariate Analysis, Academic Press, New York, 1966, pp.391-
420.
[22] A.S. Householder, The Theory of Matrices in Numerical Analysis, Blaisdell Publ. Corp., New
York, 1964, reprinted by Dover Publications, New York, 1975, p. 198.
[23] C.R. M untz, Solution directe de lequation seculaire et des problemes analogues transcen-
dentes, Comptes Rendus de lAcademie des Sciences, Paris, 156(1913)443-46
[24] I.T. Jolie, A note on the use of principal components in regression, Applied Statistics,
31(1982)300-303.
[25] R. Arneberg, J. M uller and R. Manne, Conguration interaction calculations of statelite
structure in photoelectron spectra of H
2
O, Chemical Physics, 64(1982)249-258.
17

Вам также может понравиться