Вы находитесь на странице: 1из 11

1

Multiobjective evolutionary algorithm NSGA-II for variables selection in


multivariate calibration problems

Daniel Vitor de Lucena (dvlucena@gmail.com) , Telma Woerle de Lima Soares (telma@inf.ufg.br)
Anderson da Silva Soares (anderson@inf.ufg.br), Clarimar Jos Coelho (clarimarc@gmail.com)
Instituto de Informtica UFG Departamento de Computao PUC - GO
Abstract
This paper proposes the use of a multi-objective genetic algorithm NSGA-II on the variable selection in
multivariate calibration problems. The NSGA-II algorithm is used for selecting variables for a Multiple
Linear Regression (MLR) by two conflicting objectives: the prediction error and the used variables
number in MLR. For the case study is used wheat data obtained by NIR spectrometry with the objective
for determining a variable subgroup with information about protein concentration. The results of
traditional techniques of multivariate calibration as the Partial Least Square (PLS) and Successive
Projection Algorithm (SPA) for MLR are presents for comparisons. The obtained results showed that the
proposed approach obtained better results when compared with a monoobjective evolutionary algorithm
and with traditional techniques of multivariate calibration.
Keywords: Multivariate Calibration, Genetic Algorithm, Multiobjective Optimization

Introduction
The need of obtaining relevant information
about the concentration of some chemicals called
analytes, collected from analysis tools, stimulated
chemometrics studies, determined as application
of mathematics and statistics techniques for
chemical data analysis (Beebe, et al., 1998). As
the analytes concentration is hardly directly
provided from the analysis tool (Nunes, 2008), the
chemometrics, through calibration, has the
objective to extract these information using
regression models.
According to Martens (1989) the
calibration is defined as the process of
construction of a mathematical model to connect
the output of a tool and a certain property of the
sample and prediction is the process of using the
model to forecast the properties of a given sample
of an output of the tool, for example, the
absorbance at a wavelength can be related to the
concentration of an analyte.
The multivariate calibration term refers to
the construction of a mathematic model that grants
to forecast the value of an interesting size based on
measured values of a set of explanatory variables.
In these cases, some known techniques for
building model regression are: Multiple Linear
Regression (MLR) (Martens, 1989), Principal
Component Regression (PCR) (Jolliffe, 1989) e
Partial Least Square Regression (PLS) (Beebe, et
al., 1998; Wold, et. al., 2001; Martens & Naes,
1989).
Not always is necessary the utilization of
all collected data of a sample during the
calibration, therefore, when is intended to analyze
just some features of the sample.
Selecting variables with information
related to these features of interest allows to create
more tough, simple and of easy interpretation
models, as well avoiding the irrelevant information
processing. Other problems also found on
calibration are: the collinearity, where two or more
have correlated information and the sensitivity to
noise that prejudice the calibration efficiency and
the prediction of the compounds of the sample, in
particular MLR (Martens & Naes, 1989; Draper &
Smith, 1998).
A solution to the collinear variables is to
obliterate them through selection (Guyon, 2003).
2

At this process, the use of evolutionary algorithms,
in particular genetic algorithms (GAs), is an
option. An optimization algorithm like an
evolutionary algorithm can be used to choose a
strong subset of variables and with little
redundancy and information related to the
characteristics of interest.
At this work is studied the use of the
multiobjective genetic algorithms NSGA-II at the
variables selection process for the conflicting
objectives such as minimizing the residual error
between the concentration predicted by the
regression model and the real protein
concentration of the grain as reducing the
computational cost and simplifying the calibration
model.
Multivariate Calibration
Linear Multiple Regression
The regression analysis is a statistical
methodology to predict the values of one or more
response variables (dependents) of a predictors set
(independents) (Johnson & Wichern, 2002).
The classical model of the multiple linear
regression is given by:
Y = X s (1)
where X is the data matrix obtained from
instrumental responses of order (n x p), with n the
amount of samples and p the amount of variables
of each sample, is the order regression
coefficients vector (n x 1) calculated by least
squares from the pseudo-inverse of X, is the
order residual error vector (n x 1), Y is the order
vector (n x 1) that has the values of the properties
of interest obtained by a standard method, each
variable depending on the vector Y is a linear
combination obtained by the independent variables
of the data matrix X (Johnson & Wichern, 2002;
Nunes, 2008).
The model (1) in the matrix notation:
_

n
_ =

1
1

1

X
11
X
21

X
n1

X
12
X
22

X
n2


X
1p
X
2p

X
np

_
[
1
[
2

[
n
_ _
e
1
e
2

e
n
_ (2)
The regression coefficients are determined
by linear combination
=

= ( X
i
X)
-1
( X
i
Y) (3)
and the estimated regression coefficients equal to
those calculated (Johnson & Wichern, 2002).
According to Rencher (2002) the response
variable estimated is defined by linear combination
between the data matrix X and the estimated
regression coefficients

, so:
Y

= X

(4)
The residual error is calculated by
difference between the response variable reference
Y and response variable estimated Y

(Rencher,
2002), thus:
s = Y Y

(5)
The Root Mean Square Error of Prediction
(RMSEP) evaluates how much the concentration
predicted by the model approximates from the
expected concentration. RMSEP is expressed by

RHSEP =
_
( y
i
y

)
n
=1
n


(6)
where y
i
is i-th concentration predicted by the
regression model. This measurement helps on the
evaluation of calibration model performance e
allows us chose models more suitable to
prediction. The regression parameters could be
estimated with some noise because the datum is
measured with error and because they are
estimated from data samples [Varmuza &
Filzmoser, 2000; Gemperline, 2006).
Multicollinearity Problem and Variables
Selection
In statistics, the existence of linear
correlation between two or more independent
variables in a multiple regression model is defined
as multicollinearity. This problem may cause
serious difficulty with the reliability of the
estimates of the model coefficients and difficulty
in understanding the values obtained in response
variable (Alin, 2010; Chong & Jun, 2005).
In prediction problems when the
regression model have many variables, the larger
part can contribute little or nothing to prediction
precision, therefore, select a reduced set with the
3

variables that do influence positively in the
regression model is crucial, but how many and
which variables should compose this subset?
(Snedecor & Cochran, 1972; Hocking, 1976).
To define a smaller set of independent
explanatory variables to be included in the final
regression model is a frequent problem in
regression analysis (Hocking, 1976).
The problem of determining an
appropriate equation based on a subset of the
original set of variables contains three basic
ingredients, namely (a) the computational
technique used to provide the information for the
analysis, (b) the criterion used to analyze the
variables and select a subset, if that is appropriate,
and (c) the estimation of the coefficients in the
equation (3) (Hocking, 1976).
According to Miller (1984) the reasons for
using only some of the available or possible
predictor variables include: a) to estimate or
predict at lower cost by reducing the number of
variables on which data are collected, b) to predict
accurately by eliminating uninformative variables,
c) to describe a multivariate data set
parsimoniously, d) to estimate regression
coefficients with small standard errors
(particularly when some of the predictors are
highly correlated).
The proposed strategy to the problem of
variables selection for multiple linear regression is
the use of genetic algorithm to solve the
multicollinearity problem, reduce cost by reducing
the number of variables and minimize the residuals
errors.
Partial Least Squares Regression
Partial Least Squares (PLS) is a method
for constructing predictive models when the
factors are many and highly collinear. Note that
the emphasis is on predicting the responses and not
necessarily on trying to understand the underlying
relationship between the variables (Tobias, 1997).
PLS regression is a technique that generalizes and
combines features from principal component
analysis and multiple linear regression. The goals
of PLS regression is to predict Y from X and to
describe their common structure (Abdi, 2003).
Successive Projection Algorithm
The successive projections algorithm
(SPA) is a variable selection technique designed to
minimize collinearity problems in multiple linear
regression (MLR) (Galvo, 2008).
SPA comprises three main phases: The
first consists of projection operations carried out
on the matrix X of instrumental responses. These
projections are used to generate chains of variables
with successively more elements. Each element in
a chain is selected in order to show the least
collinearity with the previous one. In the next
phase the candidate subsets of variables are
evaluated according to the RMSEP predictive
performance in the MLR model. The last phase
consists of variable elimination procedure aimed at
improving the parsimony of the model.
The last results of multivariate calibration
literature showed that the SPA-MLR has the better
results in terms of RMSEP and parsimony when
compared with the classical genetic algorithm and
PLS (Soares(a), Galvao Filho, Galvao, & Araujo,
2010; Soares(b), Galvao Filho, Galvao, & Araujo,
2010).
Genetic Algorithm
Genetic Algorithms (GAs) were proposed
by Holland in the 70s. He studied natural
evolution, considering this a simple and powerful
process, ready for adaptation to obtain efficient
computational solutions for optimization
problems. In this context, strength is related to
GAs produces, in general, appropriate solutions
independently of the initial parameters (Goldberg,
1989). The main differential of GAs is the creation
of descendant by the recombination operator
named crossover (De Jong, 2006).
Other important feature of GAs is to use
mutation and recombination operators to balance
two possible conflicting objectives: preserve the
best solutions and the exploration of the search
space. Therefore, the search process is
multidimensional, preserving candidate solutions
and inducing the information exchange between
the explored solutions (Michalewicz, 1996; Von
Zuben, 2000).
According to Michalewicz (1996), the
main steps of a GA are:
4

1. During the gen iteration, a GA keeps a
population of potential solutions
P( gcn) = |x
1
gcn
, , x
n
gcn
|:
2. Each x

gcn
individual is measured,
producing a fitness measurement;
3. New individuals are generated from
individuals of current population, which
are selected for reproduction by a process
that tends to choose individuals with a
larger fitness;
4. Some individuals undergo changes by
recombination and mutation, forming new
potential solutions;
5. Among old and new solutions ( + ), are
selected (survivor) individuals for the next
generation (gen + 1);
This process is repeated until one stop
condition is satisfied. This condition can be an
expected level of solutions adequacy or a
maximum number of iterations.
In the context of multivariate calibration,
the population is the set of possible solutions
where each individual is a candidate solution.
Using the binary representation for the
chromosome , the gene amount is the total of
possible variables to be selected. Each gene
determines the selection of each variable,
according to its value 0 or 1, being selected or not,
respectively.
Multiobjective Optimization
For many decision-making problems in
the world exist the need of simultaneous
optimization of multiple objectives (Michalewicz,
1996), what makes optimization problems analysis
observing only one objective an insufficient
approaching to find satisfactory solutions. Great
part of these problems presents a collection of
objectives to be optimized, not always harmonic
ones, and could have conflicts between the
objectives and consequently the improvement of
one causes deterioration of another.
Multiobjective optimization (MOO)
problems require distinct techniques, that are far
way different of the standard optimization
techniques for monoobjective problems. It is very
clear that if exists two objectives to be optimized,
it could be possible to find a solution that is best
for the first objective and another solution that is
the best for the second objective. (Michalewicz,
1996).
Takahashi (2004) describes, generally, a
MOO problem as:

H00_
X

= x
n
x

=
min
x
mox
x
( x)
suj ei t o a: x F
x


(7)
where X

is the set os efficiente solutions,


n
is
the optimization space paiameteis, x* a point
belonging to X

, ( x) the objective function vector


of the problem and F
x
is the set of feasible points
belonging to the optimization parameters space.
The classification of all possible solutions
for MOO problems in dominated and non-
dominated solutions (Pareto-optimal) is
convenient. Given a solution x, it is dominated if
exist a viable solution y that is not worse than x in
all coordinates, in other words, for all objectives

( i = 1, , k) (Michalewicz, 1996):

( x)

( y) ; poro 1 i k (8)
Not existing such relation, in other words,
not existing a dominated solution by any other
viable solution, this is defined as non-dominated
solution (or Pareto-optimal). All Pareto-optimal
solutions can be of some interest, and ideally, the
system should notify the set of all Pareto-optimal
points (Michalewicz, 1996).
Schafter, in 1984, implemented the first
GA for MOO, called VEGA (Vector Evaluated
Genetic Algorithm), and this was an extension of
GENESIS, program to include multi-criteria
functions. In 1994, Srinivas, & Deb (1994)
proposed a new technique called NSGA (Non-
dominated Sorting Genetic Algorithm) based on
classifying the individuals in many groups called
fronts. This grouping process is accomplished
based on non-domination before the selection
(Michalewicz, 1996).
In the multivariate calibration two
conflicting objectives are to minimize the error on
predicting a analyte state and the minor amount of
possible selected variables. These are conflicting
5

objectives because as lower is the amount of
selected variables the error on predicting can be
higher.
Non-Dominated Sorting Genetic
Algorithm II (NSGA-II)
Developed by Deb, Pratap, Agarwal, &
Meyarivan (2002), NSGA-II, as the first NSGA
version, implements the dominance concept,
classifying population in fronts accordingly to its
dominance level. The best solutions of each
generation are located at the first front while the
worst are located at the last front. The process of
classification occurs until all population
individuals are located at a front. Finalized this
process of classification, individuals belonging to
first front are non-dominated, but dominate
individuals from second front and the individuals
from the second front dominate the individuals
from the third front and so on (Deb, Pratap,
Agarwal, & Meyarivan, 2002). Fig. 1 illustrate the
steps of NSGA-II working:

Fig. 1: NSGA-II working
The main difference from NSGA-II to a
simple GA is the way the selection operator is
applied, and this operator is subdivided in two
process: Fast Non-Dominated Sorting and
Crowding-Distance. The other operators are
applied on traditional way (Deb, Pratap, Agarwal,
& Meyarivan, 2002).
The Fast Non-Dominated Sorting process
execution is done in 2 parts: first all population
individuals are compared with each other in orders
to calculate the dominance level. Finished this
first part of the process, the individuals that have
dominance level equal to zero, are classified as
non-dominated and will be inserted at first front
(Pareto-optimal) (Deb, Pratap, Agarwal, &
Meyarivan, 2002).
The second part of Fast Non-dominated
Sorting process will treat individuals which
dominance level is different of zero. In this step
each individual is removed from population,
classified and inserted on the front of its
dominance level. The Fast Non-dominated Sorting
ends when the population is empty (Deb, Pratap,
Agarwal, & Meyarivan, 2002).
The search for a Pareto-optimal solutions
group tends to converge the solutions in a same
region. However, a desired feature in a GA is the
good scattering of found solutions, and, at this
point, the second selection operator process of
NSGA-II, Crowding-distance, works. The
crowding-distance is a new approaching based on
comparison of aggregation proposed to substitute
the approaching of sharing function of NSGA,
eliminating two known problems: the sharing
method performance strongly dependent on the
parameter value of o
shucd
chosen by user, and,
the global complexity of approaching being O(N),
for the comparison of each solution with all other
solutions. Other crowding-distance function is to
order all solutions within the same front (Deb,
Pratap, Agarwal, & Meyarivan, 2002).
In order to better comprehend the
crowding-distance approaching is necessary to
define the metrics for density estimation and
comparison operator.
The density estimation of solutions around
a particular solution of population is obtained by
calculating the average distance between two
points of each side of this point along each one of
the objectives. The I value serves as an estimate of
cuboid perimeter formed using next borderers as
vertices, as shown on Fig.2 (Deb, Pratap, Agarwal,
& Meyarivan, 2002):
6


Fig. 2: Crowding-distance calculation. Marked
points in filled circles are non-dominated front
solutions (Deb, Pratap, Agarwal, & Meyarivan,
2002).
At Crowding Distance calculation I[ i] . m
refers to m-th individual i objective value at set I
and parameters
m
mux
and
m
mn
are maximum and
minimum of m-th objective function. Complexity
is O(MNLogN) for M classification, independent
of, at maximum, N solutions. This calculation
requires the population classification accordingly
to each objective function value in magnitude
ascending order. Each objective function is
normalized before the calculation. After all
population members in the I set were attributed a
distance metric to enable the comparison between
two solutions for their closeness level and, the
lower this distance value, closer to other solutions
it is (Deb, Pratap, Agarwal, & Meyarivan, 2002).
The comparison operator (
n
) has the
objective to guide the selection process on the
various steps of algorithm in direction to a Pareto-
optimal front evenly spread. Supposing that each
individual on population has two attributes:
1. Non-dominance Rank (i
unk
);
2. Crowding Distance (i
dstuncc
).
A partial order
n
is defined by:
i
n
] i ( i
unk
< ]
unk
) or
( ( i
unk
= ]
unk
) onJ ( i
dstuncc
> ]
dstuncc
) )
(9)
for two solutions between different non-dominant
fronts this model gives preference to choose the
best ranked solution, in other words, minor rank,
otherwise, is chosen the solution located on a
minor agglomerated region (Deb, Pratap, Agarwal,
& Meyarivan, 2002).

Multiobjective Decision Maker
NSGA-II algorithm presents a set of
solutions for multiobjective problem at its first
front. To help choosing a solution within this set, it
were applied the t test as a multiobjective decision
maker.
T-test evaluate statistically the
significance of the distance between two
independent samples average, appropriate when
there is a need to compare two groups average
(Trochim & Donnelly, 2007). At this problem
context, t-test analysis is appropriate to verify the
difference between solutions RMSEP values with
5% of significance, considering the increase of
variables at this model.
T-test is given by:

t =
X
1
X
2
_
:or
1
n
1
+
:or
2
n
2


(10)
where X

1
and X

2
are means samples. The upper
part of the formula (10) is the difference between
the average of samples X
1
and X
2
. The bottom part
is difference standard error, and is obtained by the
variance division for each group by the number of
elements of this group (Trochim & Donnelly,
2007).
The null hypothesis tells that solutions are
random samples of independent normal
distributions with equal averages and equal
variances, but unknown, against the alternative
that environments are not equal. The result of t-test
1 indicates a rejection of the null hypothesis at
significance level and 0 indicates a failure when
rejecting the null hypothesis at significance
level.
Materials and Methods
Wheat Data
All the samples are from whole grain
wheat, obtained from vegetal material from
occidental Canadian producers. The standard data
were determined at the Grain Research Laboratory,
in Winnipeg. The standard data are: protein
content (%); test weight (kg/hl); PSI (wheat kernel
7

texture) (%); farinograph water absorption (%);
farinograph dough development time (minutes)
and farinograph mixing tolerance index
(Brabender unities). The data set for the
multivariate calibration study consists of 775 VIS-
NIR spectra of whole-kernel wheat samples, which
were used as shoot-out data in the 2008
International Diffuse Reflectance Conference
(http://www.idrc-chambersburg.org/shootout.html
). Protein content was chosen as the property of
interest. The spectra were acquired in the range
400-2500 nm with a resolution of 2 nm. In the
present work, online the NIR region in the range
1100-2500 nm was employed. In order to remove
undesirable baseline features, first derivative
spectra were calculated by using a Savitzky-Golay
filter with a 2
nd
order polynomial and an 11-
points window. But only the data referring to
protein concentration were used at these tests.
The Kennard-Stone (KS) algorithm
(Kennard & Stone, 1969) was applied to the
resulting spectra to divide the data into calibration,
validation and prediction sets with 389, 193 and
193 samples, respectively. The validation set was
employed to guide the selection of variables in
SPA-MLR, MONO-GA-MLR and MULTI-GA-
MLR. The prediction set was only employed in the
final performance assessment of the resulting
MLR models. In the PLS study, the calibration and
validation sets were joined into a single modeling
set, which was used in the leave-one-out cross-
validation procedure. The number of latent
variables was selected on the basis of the cross-
validation error by using the F-test criterion of
Haaland and Thomas with = 0.25 as suggested
else where (Haaland & Thomas, 1988). The
prediction set was only employed in the final
evaluation of the PLS model.
Environment and Tools
For executing the proposed NSGA-II
algorithm, as well the regression calculation
applying MLR and RMSEP used as fitness value
for the objective of wheat protein concentration
determination were used the Matlab software
version 7.10 (R2010a).
For algorithm execution were used the
parameters at the Table 1:

Table 1 - NSGA-II set.
Population Size 100
Generations Number 30, 50 and 100
Selection Operator Binary Tournament
Mutation Operator Polynomial Mutation
Mutation Probability 1/Population Size
Crossover Operator Simulated Binary
Crossover (SBX)
Crossover Probability 0.9
Results e Discussions
A Fig. 3 presents the derivative spectra of
wheat sample.


Fig. 3: Wheat sample spectrum.
It were done 30 executions of NSGA-II
using parameters showed on Table 1, changing just
the generation number within 30 new executions.
In the executions were used wheat data set
presented at Picture 3, with 389 samples and 690
spectra in each sample. Each spectrum is a
variable in this model. Each NSGA-II execution
has the objective to give a set of solutions at
Pareto-optimal front. Fig 4 shows the label for the
Pareto fronts graphs. Figs. 5, 6 and 7 show an
arrangement of solutions after NSGA-II execution
within the 30 executed for each generation number
set:

Fig. 4 - Pareto fronts label
8


Fig. 5 - Pareto fronts for 30 generations

Fig. 6 - Pareto fronts for 50 generations

Fig. 7 - Pareto fronts for 100 generations
Figs 8,9 and 10 show bar graph of sum of
times that each spectrum were selected in one of
the Pareto-optimal front for each execution set of
NSGA-II. From this graphics is possible to
observe that some regions like as 0-50 and 110-
150 spectral variables are important to build MLR
model for distinct GAs. In order hands, others
regions like as 190-220 spectral variables are not
frequently used. These observations are important
to reduce the hardware cost to obtain the spectral
variables.
Fig. 8 - Selected spectra at Pareto-optimal front
solutions for 30 generations
Fig. 9 - Selected spectra at Pareto-optimal front
solutions for 50 generations

Fig. 10 - Selected spectra at Pareto-optimal front
solutions for 100 generations
To choose one of the possible solutions
from the Pareto-optimal front were used the t test
as a multiobjective decision maker with a 5%
significance level. At Table 2 is presented the
results of the choices done by the multiobjective
decision maker.





9

Table 2 - Choices of the Pareto-optimal front solutions by statistical test t with 5% significance.
30 generations 50 generations 100 generations
RMSEP Average 0.08 0.08 0.08
Larger RMSEP 0.13 0.16 0.13
Smaller RMSEP 0.06 0.06 0.06
Variables Number Average 21 20 19
Larger Variables Number 25 26 24
Smaller Variables Number 17 14 12

Table 3 - Results of the traditional techniques PLS,
SPA-MLR, and MONO-GA-MLR.
RMSEP
PLS 0.21 (15*)
SPA-MLR 0.20 (13
#
)
MONO-GA-MLR 0.21 (146
#
)
Range of protein content in the prediction set: 10.2-16.2 %
m/m. *Number of latent variables.
#
Number of spectral
variables selected.
The Table 3 shows the prediction results
for PLS, SPA-MLR and MONO-GA-MLR. These
are considered the main techniques in the
multivariate calibration. As can see the proposed
algorithm obtained a better RMSEP results using a
slightly larger number of variables.
Conclusion
At this paper were presented a study that
proposed the use of a genetic algorithm and
multiobjective optimization to select a variables
set for multivariate calibration using MLR for
wheat protein concentration prediction.
The obtained results show the techniques
efficiency about to reduce RMSEP with a little
amount of spectral variables in relation to the
spectra amount present on the sample.
From the results, it was possible to
observe that spectra selected on set on the
solutions found by GA are similar. It was also
realized that GA converges with a few generations
that could indicate possible local optima and not
global optima.
When compared to traditional algorithms,
like PLS, APS-MLR and MONO-GA-MLR, the
results obtained by proposed algorithm were better
in all found cases.
As next studies, is suggested the
implementation of another genetic algorithms for
multiobjective optimization, as SPEA II (Strength
Pareto Evolutionary Algorithm).
References
Abdi, H. (2003). Partial least squares (PLS)
regression.. In M. Lewis-Beck, A. Bryman, T.
Futing (Eds): Encyclopedia for research methods
for the social sciences. Thousand Oaks (CA):
Sage. (pp. 792-795).
Alin, A. (2010). Multicollinearity. WIREs Comp
Stat, 2: 370374. doi: 10.1002/wics.84.
Beebe, K. R., Pell R. J., & Seasholtz, M. B.
(1998). Chemometrics: A Practical Guide. John
Wiley & Sons, INC. New York.
Chong, Il-Gyo. Jun, Chi-Hyuck. (2005).
Performance of some variable selection methods
when multicollinearity is present, Chemometrics
and Intelligent Laboratory Systems, v. 78, n. 1-2,
(pp. 103-112).
De Jong, K. A. (2006). Evolutionary computation :
a unified approach. Massachusetts Institute of
Technology.
Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T.
(2002). A Fast and Elitist Multiobjective Genetic
Algorithm: NSGA-II. IEEE TRANSACTIONS ON
EVOLUTIONARY COMPUTATION, VOL. 6,
NO. 2.
10

Draper, R. N. & Smith, H. (1998). Applied
regression analysis, Willey series and probability
and statistics.
Galvo. R. K. H. (2008). A variable elimination
method to improve the parsimony of MLR models
using the successive projections algorithm.
Chemometrics and Intelligent Laboratory Systems,
Volume 92, Issue 1, (pp 83-91).
Gemperline, P. (2006). Practical Guide to
Chemometrics, CRC Taylor & Francis, Boca
Raton, 2006.
Goldberg, D. E. (1989). Genetic algorithms in
search, optimization, and machine learning. New
York: Addison-Wesley.
Guyon, I. (2003). An Introduction to Variable and
Feature Selection. Journal of Machine Learning
Research, [S.l.], v.3, (pp.11571182).
Haaland, D. M. & Thomas, E. V. (1988). Partial
Least-Squares Methods for Spectral Analysis 1.
Relation to Other Quantitative Calibration
Methods and the Extraction of Quantitative
Information, Anal. Chem, 60, (pp 1193).
Hardle, W. & Simar, L. (2003). Applied
Multivariate Statistical Analysis. [S.l.]: Tech.
Hocking, R. R. "The analysis and selection of
variables in linear regression". Biometrics, 32, 1-
49, 1976.
Johnson, R. A. & Wichern, D. W. (2002). Applied
Multivariate Statistical Analysis, Prentice Hall.
Jolliffe, Ian (1982). A Note on the Use of Principal
Components in Regression, Journal of the Royal
Statistical Society. Series C (Applied Statistics) v..
31, n. 3, (pp. 300-303).
Kennard, R.W. & Stone L. A. (1969). Computer
aided design of experiments, Technometrics, 11,
(pp 137-148).
Martens, H. & Naes, T. (1989). Multivariate
Calibration, John Willey & Sons, Chichester.
Michalewicz, Z. (1996). Genetic algorithms +
data structures = evolution programs. 3 ed.
Berlin: Springer.
Miller, A.J. (1984). Selection of Subsets of
Regression Variable. Journal of the Royal
Statistical Society. Series A (General), Vol. 147,
No. 3, (pp.389-425).
Nunes, P. G. A. (2008). Uma nova tcnica para
seleo de variveis em calibrao multivariada
aplicada s espectrometrias UV-VIS e NIR, Tese
de Doutorado, UFPB/CCEN Joo Pessoa.
Rencher, A. C. (2002). Methods of Multivariate
Analysis, Willey-Interscience.
Snedecor, G. W. & Cochran, W. G. (1972).
Statistical Methods. 6 ed., Iowa: Ames.
Soares(a), A. S., Galvao Filho, A. R., Galvao, R.
K. H. & Araujo, M. C. U. (2010). Improving the
computational efficiency of the successive
projections algorithm by using a sequential
regression implementation: a case study involving
nir spectrometric analysis of wheat samples. J.
Braz. Chem. Soc., So Paulo, v. 21, n. 4.
Soares(b), A. S., Galvao Filho, A. R., Galvao, R.
K. H. & Araujo, M. C. U. (2010). Multi-core
computation in chemometrics: case studies of
voltammetric and NIR spectrometric analyses. J.
Braz. Chem. Soc., So Paulo, v. 21, n. 9.
Srinivas, N. & Deb, K.. (1994). Multiobjective
Optimization Using Nondominated Sorting
Genetic Algorithms, Evolutionary Computation,
Vol.2, No.3.
Takahashi, R. H. C. (2004). Otimizao Escalar e
Vetorial. Universidade Federal de Minas
Gerais,[S.l.].
Tobias, R. (1997). An Introduction to Partial
Least Squares Regression, TS-509. SAS Institute
Inc., Cary, NC.
Trochim, W. & Donnelly, J. P. (2007). The
Research Methods Knowledge Base, 3ed.
Thompson Publishing, Mason, OH.
Von Zuben, F. J. (2000). Computao evolutiva:
Uma abordagem pragmtica. In: Anais da I
Jornada de Estudos em Computao de Piracicaba
e Regio (1a. JECOMP), Piracicaba, SP, (pp. 25
45).
Varmuza, K. & Filzmoser, P. (2000). Introduction
to Multivariate Statistical Analysis in
Chemometrics, [S.l.]: CRC Press.
WOLD, SVANT, SJSTRM, MICHAEL.
ERIKSSON, LENNART. (2001), PLS-regression:
a basic tool of chemometrics, Chemometrics and
intelligent laboratory, v. 58, (pp. 109-130).
11