Mihaela Elena Breaban Editors
Artificial Intelligent
Approaches
in Petroleum
Geosciences
Artificial Intelligent Approaches in Petroleum
Geosciences
Constantin Cranganu Henri Luchian
•
Artificial Intelligent
Approaches in Petroleum
Geosciences
123
Editors
Constantin Cranganu Mihaela Elena Breaban
Brooklyn College University of Iaşi
Brooklyn, NY Iaşi
USA Romania
Henri Luchian
University of Iaşi
Iaşi
Romania
Integration, handling data of immense size and uncertainty, and dealing with risk
management are among crucial issues in petroleum geosciences. The problems one
has to solve in this domain are becoming too complex to rely on a single discipline
for effective solutions, and the costs associated with poor predictions (e.g., dry
holes) increase. Therefore, there is a need to establish new approaches aimed at
proper integration of disciplines (such as petroleum engineering, geology, geo
physics, and geochemistry), data fusion, risk reduction, and uncertainty
management.
This book presents several artiﬁcial intelligent approaches1 for tackling and
solving challenging practical problems from the petroleum geosciences and
petroleum industry. Written by experienced academics, this book offers stateof
theart working examples and provides the reader with exposure to the latest
developments in the ﬁeld of artiﬁcial intelligent methods applied to oil and gas
research, exploration, and production. It also analyzes the strengths and weaknesses
of each method presented using benchmarking, while also emphasizing essential
parameters such as robustness, accuracy, speed of convergence, computer time,
overlearning, or the role of normalization.
The reader of this book will beneﬁt from exposure to the latest developments in
the ﬁeld of modern heuristics applied to oil and gas research, exploration, and
production. These approaches can be used for uncertainty analysis, risk assessment,
data fusion and mining, data analysis and interpretation, and knowledge discovery,
from diverse data such as 3D seismic, geological data, well logging, and pro
duction data. Thus, the book is intended for petroleum scientists, data miners, data
scientists and professionals, and postgraduate students involved in the petroleum
industry.
Petroleum Geosciences are—like many other ﬁelds—a paradigmatic realm of
difﬁcult optimization and decisionmaking realworld problems. As the number,
1
Artiﬁcial Intelligence methods, some of which are grouped together in various ways, under
names such as Computational Intelligence, Soft Computing, Metaheuristics, or Modern heuristics.
v
vi Preface
difﬁculty, and scale of such speciﬁc problems increase steadily, the need for
diverse, adjustable problemsolving tools can hardly be satisﬁed by the necessarily
limited number of approaches typically included in a curriculum/syllabus from
academic ﬁelds other than Computer Science (such as Petroleum Geology).
Therefore, the ﬁrst three chapters of this volume aim at providing working infor
mation about modern problemsolving tools, in particular in machine learning and
in data mining, and also at inciting the reader to look further into this thriving topic.
Traditionally, solving a given problem in mathematics and in sciences at large
implies the construction of an abstract model, the process of proving theoretical
results valid in that model, and eventually, based on those theoretical results, the
design of a method for solving the problem. This problemsolving paradigm has
been and will continue to be immensely successful. Nevertheless, an abstract model
is an approximation of the realworld problem; there have been failures triggered by
a tiny mismatch between the original problem and the proposed model for it.
Furthermore, a problemsolving method developed in this manner is likely to be
useful only for the problem at hand. While, ultimately, any problemsolving
technique may be—in various degrees—subject to these two observations, some
relatively new approaches illustrate alternative lines of attack; it is the editors’ hope
that the ﬁrst three chapters of the book illustrate this idea in a way that will prove to
be useful to the readers.
In the ﬁrst chapter, Simovici presents some of the main paradigms of intelligent
data analysis provided by machine learning and data mining. After discussing
several types of learning (supervised, unsupervised, semisupervised, active, and
reinforcement learning), he examines several classes of learning algorithms (naïve
Bayes classiﬁers, decision trees, support vector machines, and neural networks) and
the modalities to evaluate their performance. Examples of speciﬁc applications of
algorithms are given using System R.
The second and third chapters, by Luchian, Breaban, and Bautu, are dedicated to
metaheuristics. After a rather simple introduction to the topic, the second chapter
presents, based on working examples, evolutionary computing in general and, in
particular, genetic algorithms and differential evolution; particle swarm optimiza
tion is also extensively discussed. Topics of particular importance, such as multi
modal and multiobjective problems, hybridization, and also applications in
petroleum geosciences are discussed based on concrete examples. The third chapter
gives a compact presentation of genetic programming, gene expression program
ming, and also discusses an R package for genetic programming and applications of
GP for solving speciﬁc problems from the oil and gas industry.
Ashena and Thonhauser discuss the Artiﬁcial Neural Networks (ANNs), which
has the potential to increase the ability of problem solving in geosciences and in the
petroleum industry, particularly in case of limited availability or lack of input data.
ANN applications have become widespread because they proved to be able to
produce reasonable outputs for inputs they have not learned how to deal with. The
following subjects are presented: artiﬁcial neural networks basics (neurons, acti
vation function, ANN structure), feedforward ANN, backpropagation and learn
ing, perceptrons and backpropagation, multilayer ANNs and backpropagation
Preface vii
three parameters, namely mean square error (MSE), mean relative error (MRE), and
Pearson product momentum correlation coefﬁcient (R). The authors employed both
the measured and simulated sonic log DT to predict the presence and estimate the
depth intervals where overpressured fluid zone may develop in the Anadarko Basin,
Oklahoma. Based on interpretation of the sonic log trends, they inferred that
overpressure regions are developing between *1,250 and 2,500 m depth and the
overpressured intervals have thicknesses varying between *700 and 1,000 m.
These results match very well previous published results reported in the Anadarko
Basin, using the same wells, but different artiﬁcial intelligent approaches.
Second, Bahrpeyma et al. employed ALM to estimate another missing log in
hydrocarbon reservoirs, namely the density log. The regression and normalized
mean squared error (MSE) for estimating density log using ALM were equal to 0.9
and 0.042, respectively. The results, including errors and regression coefﬁcients,
proved that ALM was successful in processing the density estimation. In their
chapter, the authors illustrated ALM by an example of a petroleum ﬁeld in the NW
Persian Gulf.
Third, Bahrpeyma et al. tackled the common issue when reservoir engineers
should analyze the reservoirs with small sets of measurements (this problem is
known as the small sample size problem). Because of small sample size problem,
modeling techniques commonly fail to accurately extract the true relationships
between the inputs and the outputs used for reservoir properties prediction or
modeling. In this chapter, small sample size problem is addressed for modeling
carbonate reservoirs by using the active learning method (ALM). Noise injection
technique, which is a popular solution to small sample size problem, is employed to
recover the impact of separating the validation and test sets from the entire sample
set in the process of ALM. The proposed method is used to model hydraulic flow
units (HFUs). HFUs are deﬁned as correlatable and mappable zones within a res
ervoir controlling the fluid flow. This research presents quantitative formulation
between flow units and well log data in one of the heterogeneous carbonate res
ervoirs in Persian Gulf. The results for R and nMSE are 85 % and 0.0042,
respectively, which reflect the ability of the proposed method to improve gener
alization ability of the ALM when facing with sample size problem.
Dobróka and Szabó carried out a well log analysis by global optimizationbased
interval inversion method. Global optimization procedures, such as genetic algo
rithms and simulated annealing methods, offer robust and highly accurate solution
to several problems in petroleum geosciences. The authors argue that these methods
can be used effectively in the solution of welllogging inverse problems. Traditional
inversion methods are used to process the borehole geophysical data collected at a
given depth point. As having barely more types of probes than unknowns in a given
depth, a set of marginally overdetermined inverse problems has to be solved along a
borehole. This single inversion scheme represents a relatively noisesensitive
interpretation procedure. To reduce the noise, the degree of overdetermination
of the inverse problem must be increased. This condition can be achieved by using
a socalled interval inversion method, which inverts all data from a greater depth
interval jointly to estimate petrophysical parameters of hydrocarbon reservoirs to
Preface ix
the same interval. The chapter gives a detailed description of the interval inversion
problem, which is then solved by a series expansionbased discretization technique.
The high degree of overdetermination signiﬁcantly increases the accuracy of
parameter estimation. The quality improvement in the accuracy of estimated model
parameters often leads to a more reliable calculation of hydrocarbon reserves. The
knowledge of formation boundaries is also required for reserve calculation. Well
logs contain information about layer thicknesses, which cannot be extracted by the
traditional local inversion approach. The interval inversion method is applicable to
derive the layer boundary coordinates and certain zone parameters involved in the
interpretation problem automatically. In this chapter, the authors analyzed how to
apply a fully automated procedure for the determination of rock interfaces and
petrophysical parameters of hydrocarbon formations. Cluster analysis of well
logging data is performed as a preliminary dataprocessing step before inversion.
The analysis of cluster number log allows the separation of formations and gives an
initial estimate for layer thicknesses. In the global inversion phase, the model
including petrophysical parameters and layer boundary coordinates is progressively
reﬁned to achieve an optimal solution. The very fast simulated reannealing method
ensures the best ﬁt between the measured data and theoretical data calculated on the
model. The inversion methodology is demonstrated by a hydrocarbon ﬁeld exam
ple, with an application for shaly sand reservoirs.
Finally, Mohebbi and Kaydani undertake a detailed review of metaheuristics
dealing with permeability estimation in petroleum reservoirs. They argue that
proper permeability distribution in reservoir models is very important for the
determination of oil and gas reservoir quality. In fact, it is not possible to have
accurate solutions in many petroleum engineering problems without having accu
rate values for this key parameter of hydrocarbon reservoir. Permeability estimation
by individual techniques within the various porous media can vary with the state of
in situ environment, fluid distribution, and the scale of the medium under investi
gation. Recently, attempts have been made to utilize metaheuristics for the iden
tiﬁcation of the relationship that may exist between the well log data and core
permeability. This chapter overviews the different metaheuristics in permeability
prediction, indicating the advantages of each method. In the end, some suggestions
and comments about how to choose the best method are presented.
xi
xii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
Intelligent Data Analysis
Techniques—Machine Learning
and Data Mining
Dan Simovici
Keywords Supervised learning Unsupervised learning Clustering General
ization Overﬁtting Active learning Classiﬁers A priori probabilities
A posteriori probabilities Decision trees Entropy Impurity Naive Bayes
classiﬁers Perceptrons Neural Networks
1 Introduction
Machine learning and its applied counterpart, data mining, deal with problems that
present difﬁculties in formulating algorithms that can be readily translated into
programs, due to their complexity. Examples of such problems are ﬁnding diag
nosis for patients starting with a series of their symptoms, determining credit
worthiness of customers based on their demographics and credit history. In each of
these problems, the challenge is to compute a label for each analyzed piece of data
that depends on the characteristics of data.
The general approach known as supervised learning is to begin with a number of
labeled examples (where answers are known) in order to generate an algorithm that
computes the function that gives the answers starting from these examples.
D. Simovici (&)
Department of Computer Science, University of Massachusetts Boston, Boston, MA, USA
email: dsim@cs.umb.edu
csv format, which is one of the most common modalities for uploading data. For
example, to create a data frame d by reading the ﬁle d.csv, one could use
d < read.csv("d.csv")
To learn the basics of R, the reader is invited to consult one of the basic
references (Lander 2014; Maindonald and Braun 2004) or seek help on the Web.
2 Simple Classiﬁers
We present now several types of classiﬁers using two of the most popular data sets,
namely Fisher’s iris data and the tennis data.
Example 2.1 The iris data were collected by Anderson (1936), an American bot
anist who was interested in the study of variations in three species of iris flowers in
Gaspè peninsula in northeastern Canada and was made popular in statistics by
Fisher (1936).
Fisher’s iris data consist of measurements on 150 of iris specimens and include
measurements of sepal length, sepal width, petal length, and petal width, as well as
the species of the plants. The attributes that are distinct from the class are numerical,
so each plant is represented by a point in R4 . The species identiﬁed are iris setosa,
iris versicolor, and iris virginica, and there are 50 specimens from each of these
species, as shown in Table 1.
We will use various types of classiﬁers as they are implemented in system R,
one of the most used pieces of software for data analysis, which is freely available
on the Internet.
The iris data set is a part of the basic R package and can be loaded using
> data(iris)
Example 2.2 The tennis data set shown in Table 2 is a ﬁctitious small data set
that speciﬁes conditions from playing an outdoor game. It contains ﬁve attributes:
outlook, temperature, humidity, windy, and play.
4 D. Simovici
Suppose that a data set D consists of n nonempty and mutually disjoint classes
C1 ; . . .; Cm :
The probabilities PðCi jxÞ are known as a posteriori probabilities, since they are
evaluated after the datum x is observed and the class Ck is occasionally referred to
as the maximum a posteriori class.
By the Bayes’ law, we have
PðxjCi ÞPðCi Þ
PðCi jxÞ ¼
P ð xÞ
Y
m
PðxjCi Þ ¼ P xj jCi
j¼1
for 1 6 i 6 n. The probabilities Pðxj jCi Þ are usually estimated from the training
examples, and the estimation method depends on the nature of each of the attributes
A1 ; . . .; Am that deﬁne these components. The classiﬁer will assign x to the most
value of PðCi jxÞ and
likely class, that is to the Ci that corresponds to the maximum
Qm
therefore to the class Ci for which PðxjCi Þ ¼ j¼1 P xj jCi is maximal.
6 D. Simovici
f
If Aj is a categorical attribute, Pðxj jCi Þ can be estimated as cjii , where ci is the
number of training examples in class Ci and fji is the number of training examples in
the class Ci having the value of the Aj component equal to xj .
If Aj is continuous, Pðxj jCi Þ can be approximated with the normal distribution. If
li and ri are the mean and the standard deviation of the examples of the class Ci ,
then we may adopt as an estimate of Pðxj jCi Þ the value
ðxj li Þ2
1
pﬃﬃﬃﬃﬃﬃ e 2r2
i :
2pri
Example 2.3 In the tennis data set, there are two classes determined by the attribute
play: Cyes and CNo , which contain 9 and 5 records, respectively. If the proba
bilities of these classes are estimated by their frequencies, we will have PðCYes Þ ¼
14 and PðCNo Þ ¼ 14. Since all attributes in this example are categorical, the prob
9 5
f
abilities Pðxj jCi Þ are estimated as ci ij , where ci is the number of training examples in
class Ci and fji is the number of training examples in the class Ci having the value of
the Aj component equal to xj . In this case, the frequencies are computed in Table 3.
A naive Bayes classiﬁer for this categorical data set is created in R with the
package e1071. After installing this package, e1071 is loaded using the directive
> library(e1071)
In this case, Play is clearly the class variable; the period “.” replaces all other
variables. If several variables participate in the list of explanatory variables, they are
linked by +.
Displaying the components of nbc gives us the prior probabilities and the
conditional probabilities Pð xjC Þ:
Apriori probabilities:
No Yes
0.3571429 0.6428571
Conditional probabilities:
Outlook
Overcast Rainy Sunny
No 0.0000000 0.4000000 0.6000000
Yes 0.4444444 0.3333333 0.2222222
Temp
Cool Hot Mild
No 0.2000000 0.4000000 0.4000000
Yes 0.3333333 0.2222222 0.4444444
Humidity
High Normal
No 0.8000000 0.2000000
Yes 0.3333333 0.6666667
Windy
NO YES
No 0.4000000 0.6000000
Yes 0.6666667 0.3333333
We seek to predict the value of the attribute Play when the values of the other
attributes form a tuple that is absent from the table. This happens when we have the
datum x given below
and
PðxjCyes ÞPðCyes Þ
PðCyes jxÞ ¼
PðxÞ
0:0082 0:6428571 0:00527
¼ ¼ ;
PðxÞ PðxÞ
PðxjCno ÞPðCno Þ
PðCno jxÞ ¼
PðxÞ
0:0768 0:3571429 0:02742
¼ ¼ :
PðxÞ PðxÞ
Since PðCno jxÞ [ P Cyes jx , the classiﬁer will predict “no” for x.
Note that there is no example in the data set where outlook = “Overcast” and
Play = “Yes”. Therefore, P (outlook = “Overcast”Play = “Yes”) = 0 and any
product of probabilities that includes this factor will be 0. This problem can be ﬁxed
by using a technique known as Laplace correction. Namely, if the fractions
p1 pm
; . . .;
q1 qm
Pm
are m probabilities such that pi
i¼1 qi ¼ 1, we replace these fractions by
p1 þ k pm þ k
; . . .; ;
q1 þ mk qm þ mk
pi pi þ k 1
6 6 :
qi qi þ mk m
The parameter k is, in general, a small positive number and is determining how
influential the priori values are compared to knowledge extracted from the training
set.
To apply a Laplace correction with k ¼ 1, we need to write
> nbc < naiveBayes(Play ˜ .,data=tennis,laplace=1)
Intelligent Data Analysis Techniques … 9
Note that the conditional probabilities are modiﬁed and there is no null val
ues:
Apriori probabilities:
No Yes
0.3571429 0.6428571
Conditional probabilities:
Outlook
Overcast Rainy Sunny
No 0.1250000 0.3750000 0.5000000
Yes 0.4166667 0.3333333 0.2500000
Temp
Cool Hot Mild
No 0.2500000 0.3750000 0.3750000
Yes 0.3333333 0.2500000 0.4166667
Humidity
High Normal
No 0.7142857 0.2857143
Yes 0.3636364 0.6363636
Windy
NO YES
No 0.6000000 0.8000000
Yes 0.7777778 0.4444444
Example 2.4 In this example, we seek to construct a Bayes classiﬁer for a data set
that has numerical attributes using the iris data set and the package e1071.
> nbc < naiveBayes(iris[,1:4],iris[,5])
The structure of the classiﬁer returned can be inspected using the statement
> str(nbc)
10 D. Simovici
which returns
List of 4
$ apriori: ’table’ int [1:3(1d)] 50 50 50
.. attr(*, "dimnames")=List of 1
.. ..$ iris[, 5]: chr [1:3] "setosa" "versicolor" "virginica"
$ tables :List of 4
..$ Sepal.Length: num [1:3, 1:2] 5.006 5.936 6.588 0.352 0.516 ...
.. .. attr(*, "dimnames")=List of 2
.. .. ..$ iris[, 5] : chr [1:3] "setosa" "versicolor" "virginica"
.. .. ..$ Sepal.Length: NULL
..$ Sepal.Width : num [1:3, 1:2] 3.428 2.77 2.974 0.379 0.314 ...
.. .. attr(*, "dimnames")=List of 2
.. .. ..$ iris[, 5] : chr [1:3] "setosa" "versicolor" "virginica"
.. .. ..$ Sepal.Width: NULL
..$ Petal.Length: num [1:3, 1:2] 1.462 4.26 5.552 0.174 0.47 ...
.. .. attr(*, "dimnames")=List of 2
.. .. ..$ iris[, 5] : chr [1:3] "setosa" "versicolor" "virginica"
.. .. ..$ Petal.Length: NULL
..$ Petal.Width : num [1:3, 1:2] 0.246 1.326 2.026 0.105 0.198 ...
.. .. attr(*, "dimnames")=List of 2
.. .. ..$ iris[, 5] : chr [1:3] "setosa" "versicolor" "virginica"
.. .. ..$ Petal.Width: NULL
$ levels : chr [1:3] "setosa" "versicolor" "virginica"
$ call : language naiveBayes.default(x = iris[, 1:4], y = iris[, 5])
 attr(*, "class")= chr "naiveBayes"
Decision trees are algorithms that build classiﬁcation models based on a chain of
partitions of the training set. Depending on the nature of data (categorical or
numerical), we need to choose a particular type of decision tree.
Decision trees are built through recursive data partitioning, where in each iter
ation, the training data are split according to the values of a selected attribute. Each
node n corresponds to a subset D(n) of the training data set D and to a partition
π(n) of D(n). If n0 is the root of the decision tree, then D(n0) = D. If n is a node that
has the descendants n1, … , nk, then
In other words, the blocks of the partition π(n) are the data sets that correspond
to the descendant nodes n1, … , nk. Partitioning of a set D(n) is done, in general, on
the basis of the values of the attributes of the objects assigned to the node n.
Suppose that the training data is labeled by c1 ; . . .; cm . This, in turn, determines a
partition r ¼ fC1 ; . . .; Cm g of the training set, where the block Cj contains the data
records labeled cj for i 6 j 6 m. If E is a subset of D, the purity of E equals the
entropy of the trace partition rE (see Sect. 8B). The set E is pure if rE consists of
Intelligent Data Analysis Techniques … 11
exactly one block, that is, HðrE Þ ¼ 0; in other words, E is pure if its elements
belong to exactly one classes.
The recursive splitting of the nodes stops at nodes that correspond to “pure” or
“almost pure” data subsets, that is, when the data of the node consist of instances of
the same class, or when a class is strongly predominant at that node. Nodes where
splitting stops are the leaves of the decision trees.
There are three issues in constructing a decision tree (Breiman et al. 1998):
(i) choosing a splitting criterion that generates a partition of DðnÞ;
(ii) deciding when a node should not be split further, that is, when a node is
terminal;
(iii) the assignment of each terminal node to a class.
Splitting the data set DðnÞ aims to produce nodes with increasing purity. Assume
that n is split k ways to generate the descendants n1 ; . . .; nk that contain the data sets
Dðn1 Þ; . . .; Dðnk Þ. The splitting partition rn at n is deﬁned as
X
k jDðn Þja
j
Ha jDðnj Þ ¼ Ha ðjDðnÞ jrn Þ
j¼1
jDðnÞj
This quantity is known as the information gain caused by rn , and it is the basis
of one of the best known method for constructing decision trees, namely the C5.0
algorithm of Quinlan (1993). Variants of this algorithm are also popular [e.g., the
J48 of the WEKA software package (Witten et al. 2011)].
The construction of a C5.0 tree in the C50 package can be achieved by writing
C5.0(trainData,classVector, trials = t, costs = c)
where the ﬁrst parameter speciﬁes the data set on which the classiﬁer is constructed
and the second parameter is a factor vector which contains the class for each row of
the training data; the remaining parameters are optional and will be discussed in the
sequel.
Example 2.5 To generate a decision tree for the iris data set, we split this data
into a training data set, trainIris, and a test data set, testIris by writing
12 D. Simovici
About 90 % of the entries in this index have value 1 and about 10 % contain the
value 2, which correspond to the training set and the test set, respectively.
The classiﬁer dt is built using the syntax
dt < C5.0(trainIris[,1:4],trainIris[,5])
The classes predicted for the test set are obtained with
> pred < predict(dt,testIris[,1:4],type="class")
> pred
setosa setosa setosa setosa setosa
versicolor versicolor versicolor versicolor versicolor
versicolor versicolor virginica virginica virginica
virginica virginica virginica virginica
Levels: setosa versicolor virginica
A summary of the classiﬁer summary (dt) returns the speciﬁcs of the decision
tree
Decision tree:
Decision Tree

Size Errors
4 4( 3.1%) <<
Note that the classiﬁer generated in Example 2.5 produced four erroneous pre
dictions. A matrix of costs can be associated with these mistakes such that the costs
depend on the nature of the errors. For instance, since we have three classes
designated as (a), (b), and (c), we could consider the cost matrix
0 1
0 2 0
costs ¼ @ 4 0 5A
0 1 0
These entries of this matrix assign a cost to mistakes made during the classiﬁ
cation. Rows correspond to predicted values and columns to actual values; the
diagonal elements are 0. Thus, the costliest error of the classiﬁer is to predict (b) for
an object in the class (c).
(a) (b)
0.3
negative positive
0.6
negative positive
prob. density
cases N cases P
prob. density
cases N cases P
0.2
0.4
TN TP
0.1
0.2
threshold
threshold
FN FP
0.0
0.0
0 2 4 6 8 10 0 2 4 6 8 10
test value test value
Fig. 1 Relative positions of distributions of test results. a Wellseparated results. b Positive and
negative results overlap
14 D. Simovici
Note that the set of test values of individuals who have the disease overlaps with
the set of test values of those who do not have the disease. These sets are repre
sented in Fig. 1b by the areas P and N located under each of the two curves.
The diagnosis is determined by the value of a test threshold: Patients whose test
values exceed the threshold are deemed to be positive (that is, have the disease);
patients whose test values are lower than the threshold are deemed to be negative.
Some patients who have the disease but whose test results are lower than the
threshold will be classiﬁed by this simple test among the negative cases (there are
the falsenegative cases); others, who do not have the disease but whose test values
are larger than the threshold, will be classiﬁed among the positive cases (they are
the falsepositive cases). The number of elements of these sets is denoted by FNðtÞ
and FNðtÞ, respectively.
The set of patients who have the disease and are correctly identiﬁed by the test
forms the set of truepositive cases; the number of elements of this set is denoted by
TPðtÞ. Also, the set of patients who do not have the disease and are correctly
identiﬁed forms the set of truenegative cases; the number of elements of this set is
TNðtÞ. Clearly, we have
N ¼ TNðtÞ þ FPðtÞ;
P ¼ TPðtÞ þ FNðtÞ:
Note that the total number of cases N and P does not depend on t. The deﬁnitions
are summarized in Table 4 known as the confusion matrix or confusion table.
Among these cases, the number of incorrectly classiﬁed cases is FPðtÞ þ FNðtÞ;
this motivates the introduction of the error rate errorðtÞ as
FPðtÞ þ FNðtÞ
errorðtÞ ¼ :
NþP
TPðtÞ þ TNðtÞ
accðtÞ ¼ 1 error ¼ :
PþN
TNðtÞ
specificityðtÞ ¼ :
N
specificityðtÞ ¼ PðTNðtÞjNÞ:
Similarly, the sensitivity at t (also known as the truepositive rate) or the recall is
given by
TPðtÞ
sensitivityðtÞ ¼ ;
P
TPðtÞ
precisionðtÞ ¼ :
TPðtÞ þ FPðtÞ
Note that
TNðtÞ
specificityðtÞ ¼ ;
TNðtÞ þ FPðtÞ
TPðtÞ
sensitivityðtÞ ¼ ;
TPðtÞ þ FNðtÞ
TPðtÞ
precisionðtÞ ¼ :
TPðtÞ þ FPðtÞ
It is easy to verify that for any four positive numbers a; b; c; d, we have the
double inequality
na c o a þ c na c o
min ; 6 6 max ; :
b d bþd b d
This implies
which is equivalent to
In other words, the accuracy of t always lies between the sensitivity and the
speciﬁcity at t.
Note that 1 specificityðtÞ ¼ 1 TNðtÞ FPðtÞ
N ¼ N . This justiﬁes referring to 1
specificityðtÞ as the falsepositive rate.
The F1 score considers both the precision and the sensitivity rates and is deﬁned
as their harmonic mean
precisionðtÞ sensitivityðtÞ
F1 ðtÞ ¼ 2 :
precisionðtÞ þ sensitivityðtÞ
precisionðtÞ sensitivityðtÞ
Fb ðtÞ ¼ ð1 þ b2 Þ :
b2 precisionðtÞ þ sensitivityðtÞ
Note that F2 weighs sensitivity higher than precision, while F0:5 weighs preci
sion higher than sensitivity.
ðx1 ; y1 Þ; . . .; ðxm ; ym Þ;
are the set of positive examples and the set of negative examples, respectively.
T is linearly separable if there exists a hyperplane Hv;a : v0 x ¼ a (called the
separating hyperplane) such that all positive examples lie in one halfspace
determined by Hv;a and all negative examples lie in the other halfspace as shown in
Fig. 2. In other words, v and a can be chosen such that for all positive examples we
shall have v0 xi a [ 0 and for all negative examples we shall have v0 xi a\0.
Both conditions can be stated as
yi ðv0 xi aÞ [ 0 ð1Þ
for 16 i 6m.
The distance between a point xi and the hyperplane Hv;a is
jv0 xi aj yi ðv0 xi aÞ
di ¼ ¼ ;
kvk kvk
for 16 i 6m. This would imply that for the positive examples, we shall have
v0 xi a lkvk > 0;
v0 xi a þ lkvk 6 0:
w0 xi b 1 > 0 ð2Þ
w0 xi b þ 1 6 0; ð3Þ
for the negative examples, where w ¼ l1 kvvk and b ¼ lkawk. In a uniﬁed form, these
restrictions can be now written
yi ðw0 xi bÞ > 1
for 16 i 6m.
The distance between the hyperplanes w0 xi a ¼ 1 and w0 x a ¼ 1 is kw2 k, and
we seek to maximize this distance in order to obtain a good separation between the
classes. Thus, we need to minimize kwk subjected to the restrictions yi ðw0 xi bÞ>1
for 16 i 6m. An equivalent formulation brings this problem to a quadratic optimi
zation problem, namely seeking w that is a solution of the problem:
1
minimize kwk2 ; where w 2;
2
subject to 1 yi ðw0 xi bÞ 6 0 for 1 6 i 6m
1 Xm
Lðw; a; uÞ ¼ kwk2 þ ui ð1 yi ðw0 xi ÞÞ
2 i¼1
!
1 2
Xm Xm Xn
¼ kw k þ ui ui y i wk xki b ;
2 i¼1 i¼1 k¼1
where ui > 0 are the Lagrange multipliers. The dual objective function is obtained
by as gðuÞ ¼ inf w;a Lðw; uÞ. This requires the stationarity conditions
Intelligent Data Analysis Techniques … 19
@L @L
¼0 for 1 6 i 6n and ¼ 0;
@wi @a
which amount to
@L Xm
¼ wj yi ui xji ¼0 for 16 j 6n;
@wj i¼1
@L X n
¼ ui yi ¼0;
@a i¼1
X
n
w¼ y i ui xi ð4Þ
i¼1
Pn
and i¼1 ui yi ¼ 0, the dual objective function is
X
m
1X n X n
gðuÞ ¼ ui y i y i ui ui x0 i xj ;
i¼1
2 i¼1 i¼1
w0 xi b 1 > ni ð5Þ
w0 xi b þ 1 6 ni ; ð6Þ
for the negative examples, respectively, where w ¼ l1 kvvk and b ¼ lkawk. In turn, in a
uniﬁed form these restrictions can be written as
1 yi ðw0 xi bÞ 6 ni
1 Xn
minimize kwk2 þC ni ; where w 2;
2 i¼1
subject to 1 yi ðw0 xi aÞ 6 ni for 1 6 i 6 m;
X
m
1X n X n
maximize gðuÞ ¼ ui y i y j ui uj x0 i xj ;
i¼1
2 i¼1 i¼1
Xm
subject to 0 6 ui 6 C and ui yi ¼ 0:
i¼1
Example 4.1 The kernlab library is described in (Karatzoglou et al. 2004) which
provides users with essential access to support vector machine techniques. After
installing the package, its loading is achieved using
> library(kernlab)
We split the data set iris into a training set trainIris and a test set
testIris in the same manner used in Example 2.5.
The classiﬁer is created by writing
> svm < ksvm(Species ˜ .,data=trainIris,kernel="vanilladot",
C = 1,prob.model=TRUE)
and is used to generate distribution probabilities for each of the 12 entries of the
test set by writing
> pred_p < predict(svm,testIris,type = "probabilities")
Intelligent Data Analysis Techniques … 21
Note the use of the parameter kernel = “vanilladot”. We will explain later
the use of kernels.
These distributions can be examined:
> pred_p
setosa versicolor virginica
[1,] 0.948669677 0.0365527398 0.014777583
[2,] 0.971508823 0.0190805740 0.009410603
[3,] 0.987012019 0.0080849105 0.004903071
[4,] 0.950002416 0.0357236471 0.014273937
[5,] 0.659161885 0.2879288429 0.052909272
[6,] 0.017947111 0.9594198514 0.022633038
[7,] 0.012561988 0.9829687166 0.004469296
[8,] 0.017910234 0.9784817276 0.003608038
[9,] 0.008436607 0.9467301478 0.044833245
[10,] 0.012126227 0.9815816669 0.006292106
[11,] 0.028265376 0.9660266137 0.005708011
[12,] 0.052250902 0.9359484109 0.011800687
[13,] 0.001837466 0.0003496850 0.997812849
[14,] 0.006546816 0.0065769958 0.986876188
[15,] 0.005543471 0.0006948435 0.993761686
[16,] 0.001242060 0.0002903663 0.998467574
[17,] 0.012187320 0.0324955786 0.955317101
[18,] 0.019265185 0.3263600533 0.654374762
[19,] 0.005646642 0.0255939953 0.968759363
Note that in each case, one of the numbers strongly dominates the others, a
consequence of the linear separability of this data set. Alternatively, a prediction
that returns directly the class of various objects can be generated by
pred < predict(svm,testIris,type="response")
and generates
> pred
[1] setosa setosa setosa setosa setosa
versicolor versicolor versicolor
[9] versicolor versicolor versicolor versicolor virginica
virginica virginica virginica
[17] virginica virginica virginica
In many situations, data are not linearly separable; that is, there is no separating
hyperplane between classes. Consider, for example, the set of points is shown in
Fig. 3, which are separated into positive and negative examples by a nonlinear
surface rather than a hyperplane (in our twodimensional case, by a curve rather
than a line). The solution is to transform data into another space, where the sepa
rating surface is transformed into a hyperplane such that the positive and negative
examples will inhabit the two halfspaces determined by the hyperplane. The data
transformation is deﬁned by a function / : Rn ! H, where H is a new linear space
that is referred as the feature space. The intention is to use a linear classiﬁer in the
new space to achieve separation between the representation of the positive and the
negative examples in this new space.
We assume that the feature space H is equipped with an inner product
ð; Þ : H ! R>0 . In view of Equality (4), if the data are approximately linearly
separable in the new space, the classiﬁcation decision is based on computing
X
n
yi ui /ðxi Þ0 /ðxÞ a
i¼1
separating separating
curve line
2 2
x2 6 2 y2 6 2
2
2 2 2 w ◦
2 ◦
2 ◦ 22 ◦
◦
◦ ◦ ◦
◦ ◦ ◦
◦ ◦
◦
 
x1 y1
negative examples ◦, positive examples 2
Example 4.2 The twodimensional data set shown in Fig. 4 is clearly not linearly
separable because no line can be drawn such that all positive points will be on one
side of the line and all negative points on the other.
Again, we use the kernlab and the function ksvm of this package. We apply a
Gaussian kernel, which can be called using the rbfdot value:
> svmrbf < ksvm(class ˜ x + y, data=points,
+ kernel="rbfdot", C = 1)
6 7 8
If the data frame testdata contains the vectors ; ; and , then
6 8 11
the predictions of the classiﬁer svmrbf obtained with
> pred_points < predict(svmrbf,testdata,type="response")
returns
> pred_points
[,1]
[1,] 0.03084342
[2,] 1.03816317
[3,] 1.21256792
Note that the ﬁrst two test data that are close to negative training examples get
negative predictions; the remaining test data that are close to positive examples get
a positive prediction.
4 6 8 10 12
x
24 D. Simovici
5 Regression
Regression seeks functions that model data with minimal errors. It aims to describe
the relationships between dependent variables and independent variables and to
estimate values of dependent variables starting from values of independent
variables.
There are several types of regression: linear regression, logistic regression,
nonlinear regression, Cox regression, etc. We present here an introduction to linear
regression.
Linear regression considers models that assume that a variable Y is estimated to
be a linear function of the independent variables X1 ; . . .; Xn :
Y ¼ a0 þ a1 X1 þ þ an Xn :
and can be displayed using the usual plot function, as shown in Fig. 5. To
produce the regression line, we call the linear modeling function lm:
lm.r < lm(formula = weight ˜ height)
90
line
80
weight
70
60
The use of support vector machines for regression was proposed in (Drucker
et al. 1996). The model produced by support vector classiﬁcation depends only on a
subset of the training data, because the cost function for building the model does not
care about training points that lie beyond the margin. Another SVM version known
as least squares support vector machine has been proposed in (Suykens and Van
dewalle 1999).
26 D. Simovici
6 Active Learning
Learning, as has been discussed up to this point, involves passive learners, that is,
learning algorithms where the information flows from data to learner.
A machine learning algorithm can achieve greater accuracy with fewer training
labels if it is allowed to choose the data from which it learns, that is, to apply active
learning. An active learner may pose queries, usually in the form of unlabeled data
instances to be labeled by a human operator. The flow of information between data
and the learner is bidirectional as shown in Fig. 6.
Since unlabeled data are abundant and, in many cases, easily obtainable, there
are good reasons to use this learning paradigm.
The training processes that allow us to construct data mining models often
require a large volume of labeled data. For example, to produce a topicbased text
classiﬁer through text mining, a large number of documents must be labeled with
the pertinent topics. This is an expensive process that requires numerous human
readers capable of understanding these topics and attaches appropriate labels to the
documents. Similarly, speech recognition requires labeling of a large number of
speech fragments by specialized linguists, which is time consuming and prone to
errors.
Active learning requires a querying strategy (see Settles 2012). One such
strategy is query by uncertainty (also known as uncertainty sampling), in which a
single classiﬁer is learned from labeled data and is subsequently utilized for
examining the unlabeled data. Those instances in the unlabeled data set that the
classiﬁer is least certain about are subject to classiﬁcation by a human annotator.
Query by uncertainty has been realized using a range of learners, such as logistic
regression (Lewis and Gale 1994), support vector machines (Schohn and Cohn
2000), and Markov models (Scheffer et al. 2001). The amount of data that require
annotation in order to reach a given performance, compared to passively learning
from examples provided in a random order, is signiﬁcantly reduced using query by
uncertainty.
 Passive Learning 
Data Set S Model
algorithm
Passive learning

Active Learning 
Data Set S Model
algorithm
Active Learning
There are several modalities to implement query by uncertainty, and they require
to determine the data item xlc for which the learner is the least conﬁdent about its
labeling.
The most common approach for selecting xlc is the use of entropy as a measure
of uncertainty. If Y is a random variable that ranges over all possible labels, then we
shall seek xlc as xlc ¼ argmaxx HðYjxÞ.
Another approach requires the learner C to evaluate the degree of conﬁdence in
its predictions. Let x be a data item and let ^y be the label with the highest posterior
probability according to C, that is, ^y ¼ argmaxy PC ðyjxÞ. Then, 1 Pð^yjxÞ is the
lack of conﬁdence of C in the label ^y and xlc ¼ argmaxx ð1 Pð^yjxÞÞ is a data item
for which C is the least conﬁdent. The intervention of the human annotator will be
required for xlc .
Yet another strategy makes use of the output margin of a data item x deﬁned as
the difference Pð^y1 jxÞ Pð^y2 jxÞÞ between the probability of the most likely label ^y1
and the second most likely label ^y2 of an item x. For items with large margins, there
is little uncertainty on the choice of the most likely label; therefore, items with small
margin beneﬁt most from an external annotation, and so, an external annotation will
be required for xm deﬁned by
Active learning may run into difﬁculties because, as shown in (Schütze et al.
2006; Velipasaoglu et al. 2007), a mix of learnable and unlearnable classes cooccur
in a data set. A class can be regarded as learnable if there exists a learning pro
cedure that generates a classiﬁer with a performance (e.g., the F1 measure) that
exceeds a certain threshold with a certain level of conﬁdence.
For small classes, it is difﬁcult or impossible to create reliable classiﬁers. For
example, if a class contains 1 % of 1000 records, we have just ten examples for that
class and this is often not sufﬁcient for creating a classiﬁer.
In Dasgupta (2011), the following simple but paradigmatic example is used to
describe the effect of active learning. Suppose that we have a data set
S ¼ fðxi ; yi Þj1 > i > ng, where xi 2 R and yi 2 f1; 1g, and we use a collection H
of simple thresholding classiﬁers of the form ht : R ! f1; 1g, where
1 if x \ t;
1 if x > t;
where t is the threshold that deﬁnes the classiﬁer ht . The empirical error of the
classiﬁer ht is
The data are separable if a value t0 exists such that errðht0 Þ ¼ 0. Note that if
n ¼ 2, the data are separable.
28 D. Simovici
Artiﬁcial neural networks (NN) aim to emulate cognitive processes that take place
in the human brain. Research in this direction started in the 1940s with the work of
McCulloch and Pitts (1943), Pitts and McCulloch (1947) who developed a com
putational model of the brain.
The human brain is a highly organized collection of a large number of inter
connected and specialized cells called neurons. Neurons are engaged in certain
computing activities that are carried out using chemical and electrical signals;
connections between neurons are referred to as synapses and the brain, as a large
collection of simple computers has a high degree of parallelism.
The current model of NN consists in a series of layers L1 ; . . .; L‘ of computing
units. Units on the ﬁrst layer L1 are referred to as input units; those on the last layer
L‘ are the output units, and the units in each layer beyond the ﬁrst layer are neurons.
Connections exist only between neurons that belong to consecutive layers.
A simple example of a NN is a perceptron that consists of n input units and one
neuron. Perceptrons can be trained to perform classiﬁcation on sets of objects of the
form ðx1 ; y1 Þ; . . .; ðxm ; ym Þ, where xi 2 Rn and yi 2 f1; 1g, and they achieve this
by constructing a separating hyperplane between the set of positive examples and
the set of negative examples whenever these sets are linearly separable. In this
respect, perceptrons are similar to support vector machines. However, the model
building is done in an iterative, speciﬁc way proposed by Rosenblatt (1958).
Several variants of this algorithm exist (Freund and Shapire 1999; Novikoff 1962).
A perceptron intended to analyze vectors x 2 Rn is deﬁned by n þ 1 numbers:
the weights w1 ; . . .; wn of the input units and a bias b as shown in Fig. 7.
In the simplest case (discussed next), the neuron itself is characterized by a
transfer function that computes the answer y ¼ signðnetðxÞÞ, where netðxÞ ¼
w0 x þ b.
The hyperplane deﬁned by this perceptron is w0 x þ b ¼ 0.
Intelligent Data Analysis Techniques … 29
The sequence
is linearly separable, as shown in Fig. 8a. On the other hand, the sequence
(a) (b)
x2 6 x2 6
@
2 @ ◦ ◦ 2
@
@
@
2 2@  2 ◦ 
x1 x1
Fig. 8 A linearly separable sequence and a sequence that is not linearly separable; positive
examples are designated by square, while circle symbols correspond to negative examples
30 D. Simovici
Suppose there exists an optimal weight vector wopt and an optimal bias bopt such
that
wopt
¼ 1 and yi ðw0 opt xi þ bopt Þ > c;
for 1 6 i 6 m. Then, we claim that the number of mistakes made by the algorithm is
at most
2R 2
c
for 1 6 i 6 m.
The algorithm begins with an augmented vector w ^ 0 ¼ 0 and updates it at each
mistake.
Let w^ t1 be the augmented weight vector prior to the tth mistake. The tth update
is performed when
^ 0t1 ^xi ¼ yi ðw
yi w ^ 0t1 xi þ bt1 Þ 6 0;
wt1
^ t1 ¼
w bt1 :
R
The update is
!
wt wt1 þ gyi xi
^t ¼
w bt ¼ bt1 þgyi R2
R R
wt1 þ gyi xi wt1 gyi xi
¼ ¼ þ
R þ gyi R
bt1 bt1
R gyi R
^ t1 þ gyi ^xi ;
¼w
we have
^ 0opt w
w ^ 0opt w
^t ¼ w ^ 0t1 þ gyi w
^ 0opt ^xi > w
^ 0opt w
^ t1 þ gc:
^ 0opt w
By repeated application of the inequality w ^ t > gc, we obtain
^ 0opt w
w ^ t > tgc:
^t ¼ w
Since w ^ t1 þ gyi ^xi , we have
kw ^ 0t w
^ t k2 ¼ w ^ 0t1 þ jgyi ^xi0 Þðw
^ t ¼ ðw ^ t1 þ gyi ^xi Þ
^ t1 k2 þ2gyi w
¼ kw ^ 0t1 ^xi þ g2 kx^i k2
^ 0t1 x^i 6 0 when an update occursÞ
ðbecause yi w
^ t1 k2 þg2 k^xi k2
6 kw
^ t1 k2 þg2 ðk^xi k2 þR2 Þ
6 kw
^ t1 k2 þ2g2 R2 ;
6 kw
we have
pﬃﬃﬃﬃ
w^ opt
2tgR >
w^ opt
kw ^ 0opt w
^ tk > w ^ t > tgc;
which imply
2 2
R
w
2 2R
t62 ^ opt
6
c c
In the case of the perceptron considered above, the transfer function is the
signum function
1 if x > 0;
signðxÞ ¼
1 if x [ 0
for x 2 R. We mention a few other choices that exist for the transfer function:
• the sigmoid or the logistic function hðxÞ ¼ 1þe1 x ,
• the hyperbolic tangent hðxÞ ¼ tanhðxÞ,
x2
• the Gaussian function hðxÞ ¼ ae 2 .
for x 2 R. The advantage of these last three choices is their differentiability that
enables us to apply optimization techniques to more complex NNs. Note, in par
ticular, that if h is a sigmoid transfer function, then
1
h0 ðxÞ ¼ ex ¼ hðxÞ ð1 hðxÞÞ; ð7Þ
ð1 ex Þ2
which turns out to be a very useful property. To emphasize the choices that we have
for the transfer function, it is useful to think that a neuron has the structure shown in
Fig. 9.
A multilayer NN is a much more capable classiﬁer compared to the perceptron.
It has, however, a degree of complexity because of topology of the neuron network
which entails multiple connection weights, the multiple outputs, and the more
complex neurons.
The speciﬁcation of the architecture of a NN encompasses the following ele
ments (see Fig. 10):
(i) the choice of ‘ is the number of levels; the ﬁrst level L1 contains the input
units, the last level L‘ contains the output units, and the intermediate levels
L2 ; . . .; L‘1 contain the hidden units;
Intelligent Data Analysis Techniques … 33
x1i x w1
n
h ( i=1 wi xi )
x2i x w2
j
z6 Σ  h 
..
. wn : n
net = i=1 wi xi
xni x
(ii) the connection from unit Ni on level Lk to unit Nj on level k þ 1 has the weight
wji ; the set of units on level Lkþ2 that are connected to unit Nj is the down
stream set of Nj denoted by ðNj Þ;
(iii) the type of neurons used in the network as deﬁned by their transfer functions.
Let X be the set of examples that are used in training the network. For x 2 X, we
have a vector of target outputs tðxÞ and a vector of actual outputs oðxÞ, both in xp ,
where p is the number of output units. The outputs that correspond to a unit Nj are
denoted by ox;j . For a weight vector w of the network, the total error is
1X
E ðw Þ ¼ ktðxÞ oðxÞk2 :
2 x2X
The information is propagated from the input to the output layer. This justiﬁes
referring to the architecture of this network as a feedforward network.
34 D. Simovici
@EðwÞ
Dwji ¼ g
@wji
where the learning rate g is a small positive number. Initially, the weights of the
edges are randomly set as numbers having small absolute values (e.g., between
−0.05 and 0.05) (cf. Mitchell 1997). These rates are successively modiﬁed as we
show next.
To evaluate the partial derivatives of the form @EðwÞ
@wji , we need to take into account
that EðwÞ depends on wji through netj and therefore,
(i) If Nj is an output neuron, then EðwÞ depends on netj through the output oj of
the unit Nj , where oj ¼ hðnetj Þ. Thus,
@E ðwÞ
¼ tj oj h netj 1 h netj :
@netj
(ii) When Nj is a hidden unit EðwÞ depends on netj via the functions netk for all
neurons Nk situated downstream from Nj . In turn, each netk depends on oj ,
which depends on netj . This allows us to write:
Intelligent Data Analysis Techniques … 35
Observe that
@oj
¼ h0 ðnetj Þ ¼ hðnetj Þð1 hðnetj ÞÞ
@netj
yields
@EðwÞ X @E ðwÞ
¼ oj ð1 oj Þ wk j :
@netj @netk
Nk 2dsðNj Þ
If di ¼ @EðwÞ
@ neti
for every neuron Ni , then
(
@EðwÞ netj 1 h netj
tj oj h P if Nj is an output neuron
¼ oj 1 oj d
@wji w
Nk 2dsðNj Þ k k j if Nj is a hidden neuron:
Observe that the weight updates proceed from the output layer toward the inner
layers, which justiﬁes the name of the algorithm.
Next, we present an example for NN construction using the package neu
ralnet developed in (Günther and Fritsch 2010). The package computes NN with
36 D. Simovici
one hidden layer with a prescribed number of neurons. The computation of the NN
model is achieved by calling
nnmodel < neuralnet(target ˜ predictors, data = inputdata,
+ hidden = h)
where target * predictors is the formula that speciﬁes the model, and
hidden gives the number of neurons in the hidden layer.
Example 7.2 We use the data set Concrete_Compressive_Strength
(CCS) that is available from the data mining repository at UCI. Ingredients included
in concrete include cement, blast furnace slag, fly ash, water, superplasticizer,
coarse aggregate, and ﬁne aggregate. The data set records 1030 observations and
has nine numerical attributes. Data are presented in a raw form (it is not scaled), and
various attributes have distinct ranges (see Table 5).
The ﬁrst seven attributes are expressed in kgm3 . Data (originally in the xls
format) are read in R using the csv format as
> CCS < read.csv("CCS.csv")
> head(CCS)
cem blast ash water plast coarse fine age strength
1 540.0 0.0 0 162 2.5 1040.0 676.0 28 79.99
2 540.0 0.0 0 162 2.5 1055.0 676.0 28 61.89
3 332.5 142.5 0 228 0.0 932.0 594.0 270 40.27
4 332.5 142.5 0 228 0.0 932.0 594.0 365 41.05
5 198.6 132.4 0 192 0.0 978.4 825.5 360 44.30
6 266.0 114.0 0 228 0.0 932.0 670.0 90 47.03
Since the scale of the attributes is quite distinct, the data are normalized using the
function normalize deﬁned in (Lantz 2013) as
normalize < function(x) {
+ return((x  min(x)) / (max(x)  min(x)))
+ }
This results in normalized data; its ﬁrst few records (truncated to two decimals)
are:
> head(CCSN)
cem blast ash water plast coarse fine age strength
1 1.00 0.00 0 0.32 0.07 0.69 0.20 0.07 0.96
2 1.00 0.00 0 0.32 0.07 0.73 0.20 0.07 0.74
3 0.52 0.39 0 0.84 0.00 0.38 0.00 0.73 0.47
4 0.52 0.39 0 0.84 0.00 0.38 0.00 1.00 0.48
5 0.22 0.36 0 0.56 0.00 0.51 0.58 0.98 0.52
6 0.37 0.31 0 0.84 0.00 0.38 0.19 0.24 0.55
The resulting neural net can be seen using plot(nnet4) and is shown in
Fig. 11.
3.7
.1
002
07
7
−2−1.47.1
32
.631906679
03
9
8
blast 3.99981
−0
−4
.3
.83
2−.11.82
96
148
36
15 383
54 1
1.18
9
99
1 .
ash
987
−1.9
954
3
0−. 1.46
9
57 23
−0.4
93 72 4
03
−0
938
3
1.
.19
5595
25
99
−2.79
6.8
11 1
98 strength
.34
09
51
.7
−1
4
37
836
−
plast −1.03 4
747 154
1.5
48
.5 5 −1
35
−11.858
. 92
47
867
coarse 1.22
0.3
535685
0.00881
86
7
491
70
46
09
82..5
33
.50
2.
−0
fine −0.92724
869075865
8
23
4−.145.18.0
1
095
0.8
age
Once a neural net is created, the compute function of the neuralnet can be
used to calculate and summarize the output of each neuron; it can be used to predict
outputs formed by new combinations of values of attributes.
Example 7.3 Consider some new combinations of values for the eight predictive
attributes of CCSN deﬁned by
newconc < matrix(c(1.00, 0.2, 0.1, 0.1, 0.1, 0.8, 0.8, 0.9,
0.9, 0.5, 0.1, 0.4, 0.1, 0.5, 0.5, 0.2),
byrow = TRUE, ncol = 8)
Using
new.output < compute(nnet4,newconc)
new.output$net.result
8 Bibliographic Guide
Data mining and machine learning have generated a vast collection of references.
Among more advanced texts, we recommend (AbuMostafa et al. 2012; Bishop
2007; Murphy 2012; ShalevShwartz and BenDavid 2014; Zaki and Meira 2014;
Mohri et al. 2012).
A large number of books exist that deal with the R system and its applications to
machine learning and data mining. We mention (Lander 2014; Maindonald and
Braun 2004; Matloff 2011; Wickham 2009) as general references on R; books
specialized in machine learning applications are (Lantz 2013; Zhao 2013; Shao and
Cen 2014).
A very lucid and helpful survey of active learning is (Settles 2012).
The current literature dedicated to support vector machines includes book
written at various levels of mathematical sophistication ranging from accessible
titles (Cristianini and ShaweTaylor 2000; Kung 2014; Statnikov et al. 2011;
Suykens et al. 2005) to more advanced (ShaweTaylor and Cristianini 2005; 2008).
A comprehensive discussion related to the implementation of SVM in the
kernlab package of R is presented in (Karatzoglou et al. 2004; Karatzoglu et al.
2006).
Intelligent Data Analysis Techniques … 39
Hv;a ¼ fx 2 Rn j v0 ðx x0 Þ ¼ 0g;
where x0 2 Hv;a .
Any hyperplane Hv;a partitions Rn into three sets:
[
Hv;a ¼ fx 2 Rn jv0 x [ ag;
0
Hv;a ¼ Hv;a ;
\
Hv;a ¼ fx 2 Rn jv0 x\ag:
[ \
The sets Hv;a and Hv;a are the positive and negative open halfspaces determined
by Hv;a , respectively. The sets
>
Hv;a ¼ fx 2 Rn jv0 x > ag;
6
Hv;a ¼ fx 2 Rn jv0 x 6 ag:
are the positive and negative closed halfspaces determined by Hv;a , respectively.
If x1 ; x2 2 Hv;a , then v ? x1 x2 . This justiﬁes referring to v as the normal to
the hyperplane Hv;a . Observe that a hyperplane is fully determined by a vector
x0 2 Hv;a and by v.
Let x0 2 Rn and let Hx;a be a hyperplane. We seek x 2 Hx;a such that kx x0 k2
is minimal. Finding x amounts to minimizing the function f ðxÞ ¼ kx x0 k22 ¼
Pn 2
i¼1 ðxi x0i Þ subjected to the constraint v1 x1 þ þ wn xn a ¼ 0. Using the
Lagrangean LðxÞ ¼ f ðxÞ þ kðv0 x aÞ and the multiplier k, we impose the
conditions
@L
¼0 for 16i6n
@xi
which amount to
40 D. Simovici
@f
þ kwi ¼ 0
@xi
1
v0 x ¼ v0 x0 kv0 v ¼ a:
2
Thus,
v0 x0 ja v0 x0 a
k¼2 0
¼2 :
vv kvk22
v0 x0 a
x ¼ x0 v:
kvk22
The smallest distance between x0 and a point in the hyperplane Hv;a is given by
jv0 x aj
jv0 x aj
0
0
k x0 xk ¼
v
¼
kvk22
kvk
If we deﬁne the distance dðHv;a ; x0 Þ between x0 and Hv;a as this smallest dis
tance, we have:
j v0 x0 aj
d Hv;a ; x0 ¼ ð8Þ
kvk2
!
X
n X
n
f ti xi 6 t i f ð xi Þ
i¼1 i¼1
for every x1 ; . . .; xn 2 I.
Proof The argument is by induction on n, where n > 2. The basis step, n ¼ 2,
follows immediately from the deﬁnition of convex functions.
Suppose that the statement holds for n, and let t1 ; . . .; tn ; tnþ1 be n þ 1 numbers
P
such that nþ1
i¼1 ti ¼ 1. We have
Combining this inequality with the previous inequality gives the desired con
clusion. h
Example B.2 It is easy to verify that the function f ðxÞ ¼ xa is convex if a > 1
because f 00 ðxÞ ¼ 1x [ 0 for x 2 R [ 0 . Therefore, if t1 ; . . .; tn 2 ½0; 1 and
Pn
i¼1 ti ¼ 1, by applying Jensen’s inequality to f , we obtain the inequality:
!a
X
n X
n
t i xi ti xai :
i¼1 i¼1
so
42 D. Simovici
!a
X
n X
n
xai > n1a xi :
i¼1 i¼1
Pn Pn
When i¼1 xi ¼ 1, the previous inequality implies a
i¼1 xi > n1a .
Let partðSÞ be the set of partitions of the set S. A partial order “≤” can be deﬁned
on partðSÞ as p 6 p0 if each block B0 of p is a union of blocks of the partition p.
Example B.3 For S ¼ fxi j1 6 i 6 6g consider the partitions
The partition iS whose blocks are singletons fxg, where x 2 S, is the least
partition deﬁned on S. The partition hS that consists of a single block equal to S is
the largest partition on S.
Let p; r be two partitions of a set S, where p ¼ fB1 ; . . .; Bn g and
r ¼ fC1 ; . . .; Cm g. The partition p ^ r of S consists of all nonempty intersections of
the form Bi \ Cj , where 1 6 i 6 n and 16 j 6 m. Clearly, we have p ^ r 6 p and
p ^ r 6 r. Moreover, if s is a partition of S such that s 6 p and s 6 r, then s 6p ^ r.
If T S is a nonempty subset of S, then any partition p ¼ fB1 ; . . .; Bn g of S
determines a partition pT on T deﬁned by
For example, if p ¼ ffx1 ; x2 g; fx6 g; fx3 ; x5 g; fx4 gg, the trace on p on the set
fx1 ; x2 ; x5 ; x6 g is the partition pT ¼ ffx1 ; x2 g; fx6 g; fx5 gg.
A subset T of S is ppure, if T is included in a block of p, or, equivalently, if
pT ¼ xT .
Let p ¼ fB1 ; . . .; Bn g be a partition of a ﬁnite set S and let xi ¼ jjBSijj for 1 > i 6 n.
P
Since i¼1 nxi ¼ 1, we have the inequality
n
X
jBi j a
1 61 n1a :
i¼1
jSj
Intelligent Data Analysis Techniques … 43
Deﬁnition B.4 The aentropy Ha ðpÞ of the partition p ¼ fB1 ; . . .; Bn g of the set S
is given by
Xm !
1 jBi j a
Ha ðpÞ ¼ 1 :
1 21a i¼1
jSj
Example B.5 Starting with the convex function gðxÞ ¼ x ln x (whose second
derivative g00 ðxÞ ¼ 1x is positive), the Jensen equality implies:
! !
X
n X
n X
n
t i xi ln t i xi 6 ti xi ln xi
i¼1 i¼1 i¼1
x1 þ þ xn X n
ðx1 þ þ xn Þ ln 6 xi ln xi :
n i¼1
X
n
jBi j jBi j
ln n> ln :
i¼1
jSj jSj
P
The quantity ni¼1 jBjSji j ln jBjSji j is the Shannon entropy of p. Its maximum value
ln n is obtained when the blocks of p have equal size.
Note that lima!1 Ha ðpÞ ¼ HðpÞ. In other words, Shannon’s entropy is a limit
case of the Ha entropy.
Let p; r be two partitions of a set S, where p ¼ fB1 ; . . .; Bn g and
r ¼ fC1 ; . . .; Cm g. The conditional entropy Ha ðpjrÞ is deﬁned by
X a
m
jCj j
Ha ðpjrÞ ¼ Ha ðpCj Þ :
j¼1
jSj
44 D. Simovici
P jBi \ Cj ja
Since Ha ðpCj Þ ¼ 1211a 1 m
i¼1 jCj j , it follows that
Ha ðp ^ rÞ ¼ Ha ðpjrÞ þ Ha ðrÞ:
Various types of entropies are used to evaluate the impurity of a set relative to a
partition. Namely, for a partition j of S, Ha ðjÞ ranges from 0 (when the partition j
1a
consists of one block and, therefore, is pure) to 1n
121a when the partition consists of
nsingletons, and, therefore, it has the highest degree of impurity.
C Optimization with Constraints
An optimization problem consists in ﬁnding a local minimum or a local maximum
of a function f : Rn ! R, when such a minimum exists. The function f is referred
to as the objective function. Note that ﬁnding a local minimum of a function f is
equivalent to ﬁnding a local maximum for the function f .
In constrained optimization, additional conditions are imposed on the argument
of the objective function. A typical formulation of a constrained optimization
problem is
If the feasible region R is nonempty and bounded, then, under certain condi
tions, a solution exists.
If R ¼ ;, we say that the constraints are inconsistent.
Note that equality constraints can be replaced in a constrained optimization
problem by inequality constraints. Indeed, a constraint of the form cðxÞ ¼ 0 can be
replaced by a pair of constrains cðxÞ > 0 and cðxÞ > 0.
Let x 2 R be a feasible solution and let cðxÞ > 0 be an inequality constraint used
to deﬁne R. If x 2 R and cðxÞ ¼ 0, we say that c is an active constraint.
Consider the following optimization problem for an object function f : Rn ! R,
the compact set S Rn , and the constraint functions c : Rn ! Rm and
d : Rn ! Rp :
Both the object function f and the constraint functions c; d are assumed to be
continuously differentiable. We shall refer to this optimization problem as the
primal problem.
Deﬁnition C.1 The Lagrangean associated with this optimization problem is the
function L : Rn
Rm
Rp ! R given by
The dual optimization problem starts with the Lagrange dual function g :
Rm
Rp ! R deﬁned by
and consists of
maximize a0 ; x; where x 2 Rn ;
subject to x > 0n and
Ax b ¼ 0p :
The constraint functions are cðxÞ ¼ x and dðxÞ ¼ Ax b, and the Lagrangean
L is
Lðx; u; vÞ ¼ a0 x u0 x þ v0 ðAx bÞ
¼ v0 b þ ða0 u0 þ v0 AÞx:
Example C.4 Let us consider a variant of the primal problem discussed in Example
C.3. The objective function is again f ðxÞ ¼ a0 x. However, now we have only the
inequality constraints cðxÞ 6 0m , where cðxÞ ¼ Ax b, A 2 Rm
n , and b 2 Rm .
Thus, the primal problem can be stated as
maximize a0 x; where x 2 Rn ;
subject to Ax > b:
The Lagrangean L is
maximize b0 u subject to a0 þ u0 A ¼ 0m
and u > 0
1
minimize x0 Qx r0 x; where x 2 Rn ;
2
subject to Ax > b;
where Q 2 Rn
n is a positive deﬁnite matrix, and r 2 Rn , A 2 Rp
n , and b 2 Rp are
known as a quadratic optimization problem.
The Lagrangean L is
1 1
Lðx; uÞ ¼ x0 Qx r0 x þ u0 ðAx bÞ ¼ x0 Qx þ ðu0 A r0 Þx u0 b
2 2
and the dual function is gðuÞ ¼ inf x2Rn Lðx; uÞ subject to u > 0m . Since x is
unconstrained in the deﬁnition of g, the minimum is attained when we have the
equalities
@ 12 x0 Qx þ ðu0 A r 0 Þx u0 b
¼0
@xi
because u>0; cðx0 Þ60m , and dðx0 Þ ¼ 0p which yields the desired inequality. h
48 D. Simovici
Corollary C.7 For the function involved in the primal and dual problems, we have
Proof This inequality follows immediately from the proof of Theorem C.6. h
Corollary C.8 If f ðx Þ 6 gðu; vÞ, where u > 0m and cðx Þ 6 0m , then x is a
solution of the primal problem and u is a solution of the dual problem.
Furthermore, if supfgðu; vÞju > 0m g ¼ 1, then there is no solution of the pri
mal problem.
Proof These statements are an immediate consequence of Corollary C.7. h
Example C.9 Consider the primal problem
It is clear that the minimum if f ðxÞ is obtained for x1 ¼ 1 and x2 ¼ 0 and this
minimum is 1. The Lagrangean is
u21
gðuÞ ¼ inffx21 þ x22 þ u1 ðx1 1Þjx 2 R2 g ¼ :
4
Then, supfgðu1 Þju1 > 0g ¼ 0, and a gap exists between the minimal value of the
primal function and the maximal value of the dual function.
The possible gap that exists between infff ðxÞjx 2 S; cðxÞ 6 0m g and
supfgðu; vÞj > 0n g is known as the duality gap.
A stronger result holds if certain conditions involving the restrictions are
satisﬁed:
Theorem C.10 (Strong Duality Theorem) Let C be a nonempty convex subset of
Rn , f : Rn ! R and c : Rn ! Rm be convex functions, and let d : Rn ! Rp be
given by dðxÞ ¼ Ax b, where A 2 Rp
n and b 2 Rp .
Consider the primal problem
Suppose that there exists z 2 C such that cðzÞ\0m and dðzÞ ¼ 0p ; additionally,
0p 2 IðdðCÞÞ. We have:
References
AbuMostafa YS, MagdonIsmail M, Lin HT (2012) Learning from data. AML Book, AMLbook.
com
Anderson E (1936) The species problem in iris. Ann Mo Bot Gard 23:457–509
Bishop CM (2007) Pattern recognition and machine learning. Springer, New York
Blumer A, Ehrenfeucht A, Haussler D, Warmuth MK (1989) Learnability and the vapnik
chervonenkis dimension. J ACM 36(4):929–965
Breiman L, Friedman JH, Olshen RO, Stone CS (1998) Classiﬁcation and regression trees.
Chapman and Hall, Boca Raton (reprint edition)
Cortes C, Vapnik V (1995) Supportvector networks. Mach Learn 20(3):273–297
50 D. Simovici
Cristianini N, ShaweTaylor J (2000) Support vector machines and other kernelbased learning
methods. Cambridge University Press, Cambridge
Dasgupta S (2011) Two faces of active learning. Theoret Comput Sci 412:1767–1781
Drucker H, Burges CJC, Kaufman L, Smola AJ, Vapnik V (1996) Support vector regression
machines. In: Advances in neural information processing systems 9, NIPS, Denver, CO, USA,
2–5 Dec 1996, pp 155–161
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics
7:179–188
Freund Y, Shapire RE (1999) Large margin classiﬁcation using the perceptron algorithm. Mach
Learn 37:277–296
Günther F, Fritsch S (2010) Neuralnet: training of neural networks. R J 2(1):30–38
Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab—an s4 package for kernel methods
in R. J Stat Softw 11:1–20
Karatzoglu A, Meyer DM, Hornik K (2006) Support vector machines in R. J Stat Softw 15:1–28
Kung SY (2014) Kernel methods and machine learning. Cambridge University Press, Cambridge
Lander J (2014) R for everyone. AddisonWesley, Upper Saddle River
Lantz B (2013) Machine learning with R. PACKT Publishing, Birmingham
Lewis DD, Gale WA (1994) A sequential algorithm for training text classiﬁers. In: Proceedings of
the 17th annual international ACM SIGIR conference on research and development in
information retrieval, SIGIR’94, pp 3–12. SpringerVerlag New York, Inc, New York
Maindonald J, Braun J (2004) Data analysis and graphics using R—an examplebased approach.
Cambridge University Press, Cambridge
Matloff N (2011) The art of R programming—a tour of statistical software design. No starch press,
San Francisco
McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull
Math Biophy 5:115–133
Mitchell TM (1997) Machine learning. McGrawHill, Boston
Mohri M, Rostamizadeh A, Talwalkar A (2012) Foundations of machine learning. MIT Press,
Cambridge
Murphy KP (2012) Machine learning: a probabilistic perspective. MIT Press, Cambridge
Novikoff ABJ (1962) On convergence proofs on perceptrons. In: Proceedings of the symposium
on mathematical theory of automata 12:615–622
Pitts W, McCulloch WS (1947) How we know universals—the perception of auditory and visual
forms. Bull Math Biophys 9:127–147
Quinlan JR (1993) C 4.5 programs for machine learning. Morgan Kaufmann Publ., San Mateo
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and
organization in the brain. Psychol Rev 65:386–407
Scheffer T, Decomain C, Wrobel S (2001) Active hidden Markov models for information
extraction. In: Advances in intelligent data analysis, 4th international conference, IDA 2001.
Cascais, Portugal, Sept 13–15, 2001. Proceedings, pp 309–318
Schohn G, Cohn D (2000) Less is more: active learning with support vector machines. In:
Proceedings of the seventeenth international conference on machine learning (ICML 2000),
Stanford University, Stanford, CA, June 29–July 2, 2000, pp 839–846
Schütze H, Velipasaoglu E, Pedersen JO (2006) Performance thresholding in practical text
classiﬁcation. In: Proceedings of the 2006 ACM CIKM international conference on
information and knowledge management, Arlington, 6–11 Nov 2006, pp 662–671
Settles B (2012) Active learning. Morgan and Claypool
ShalevShwartz S, BenDavid S (2014) Understanding machine learning. Cambridge University
Press, Cambridge
Shao Y, Cen Y (2014) Data mining applications with R. Academic Press, San Diego
ShaweTaylor J, Cristianini N (2005) Kernel methods for pattern analysis. Cambridge University
Press, Cambridge
Simovici DA, Djeraba C (2014) Mathematical tools for data mining, 2nd edn. Springer, London
Intelligent Data Analysis Techniques … 51
Statnikov A, Aliferis CF, Hardin DP, Guyon I (2011) A gentle introduction to support vector
machines in biomedicine. World Scientiﬁc, Singapore
Steinwart I, Christman A (2008) Support vector machines. Springer, Berlin
Suykens JAK, van Gestel T, De Brabanter J, De Moor B, Vandewalle J (2005) Least squares
support vector machines. World Scientiﬁc, New Jersey
Suykens JAK, Vandewalle J (1999) Least squares support vector machine classiﬁers. Neural
Process Lett 9(3):293–300
Velipasaoglu E, Schütze H, Pedersen JO (2007) Improving active learning recall via disjunctive
boolean constraints. In SIGIR 2007: proceedings of the 30th annual international ACM SIGIR
conference on research and development in information retrieval, Amsterdam, The Nether
lands, July 23–27, 2007, pp 893–894
Wickham H (2009) ggplot2—Elegant graphics for data analysis. Springer, Dordrecht
Witten IH, Frank E, Hall MA (2011) Data mining—practical machine learning tools and
techniques, 3rd edn. Elsevier (Morgan Kaufmann), Amsterdam
Zaki MJ, Meira WM (2014) Data mining and analysis. Cambrige University Press, Cambrige
Zhao Y (2013) R and data mining—example and case studies. Academic Press, San Diego
On Metaheuristics in Optimization
and Data Analysis. Application
to Geosciences
Keywords Metaheuristics Numerical and combinatorial optimization Genetic
algorithms Differential evolution Particle swarm optimization Hyperparam
eters optimization Problems in geosciences
1 A Painless Introduction
How many problemsolving methods does one need to master? Indeed, many
new methods for solving problems were invented (some may say discovered) lately.
As opposed to exact deterministic algorithms, many of these new methods are weak
methods; a weak method is not rigidly related to one speciﬁc problem, but rather it
can be applied for solving various problems. At times, one or another such prob
lemsolving technique appears to be most fashionable. To an outsider, genetic
algorithms (GAs), artiﬁcial neural networks, particle swarm optimization, and
support vector machines to name just a few seemed to successively take by storm
the proscenium over the last decades. Is each new method better than the previous
ones and, consequently, is the choice of the method to solve ones speciﬁc problem a
matter of keeping pace with fashion? Is there one particular method that solves best,
among all existing methods and all problems? A positive answer to either question
would mean that we actually have a free lunch when trying to solve a given
problem: we could spare the time needed to identify the best method for ﬁnding
solutions to the problem. However, a theorem proven in 1995 by Wolpert and
McReady (1997), called the No Free Lunch Theorem for optimization, shows that
the answer to both questions above is negative. Informally (and leaving aside
details and nuances of the theorem), the NFLTO states that, averaging overall
problems, all solving methods have the same performance, no matter what indicator
of performance is used. Obviously, the common average is obtained from various
On Metaheuristics in Optimization and Data Analysis. Application to Geosciences 55
sets of values of the performance indicators for each method and various levels of
each method’s performance when applied to each speciﬁc problem. This means that
in general, two different methods perform at their respective best on different
problems, and consequently, each of them has a poorer performance on remaining
problems. It follows that there is no problemsolving method which is the “best”
method to solve all problems (indeed, if a method M would have equally good
performances on all problems, then this would be Ms average performance; then,
any method with scattered values of the performance indicator would outperform M
on some problems). Therefore, for each problemsolving method, there is a subset
of all problems for which it is the best solving method in some cases, and the subset
may consist of only one problem or even zero problems. Conversely, given a
problem to be solved, one has to ﬁnd a particular method that works best for that
problem which proves that the metaproblem mentioned above is nontrivial.
Actually, it may be a very difﬁcult problem; similar to the way some problem
solving methods are widely used even if they are not guaranteed to provide the
exact solution, an approximate but acceptably good solution to the metaproblem
may be useful.
Optimization problems There is an informal conjecture stating that anything we
are doing, we optimize something; or, as Clerc put it in (2006), iterative optimi
zation is as old as life itself. While each of these two statements may be the subject
of subtle philosophical debates, it is true that many problems can be stated as
optimization problems. Finding the average of n real numbers is an optimization
problem (ﬁnd the number a which minimizes the sum of its distances absolute
values of the differences to each of the given numbers); the same goes for decision
making problems, for machine learning ones, and many others.
organize the search for the optimum following a tree structure—or any other
structure. Therefore, it requires little computer memory. Hill climbing starts with an
initial candidate solution and iteratively aims at improving the current candidate
solution by replacing it with any (or the best) neighbor solution which is better than
the current one; when there are no more possible improvements, the search stops.
The neighborhood can be considered either in the set over which the function is
deﬁned (a neighbor can be obtained through a slight modiﬁcation of a number
which is a component of the candidate solution) or in the set of computer repre
sentations of candidate solutions (a neighbor there is reached by flipping one bit).
While the procedure sketched above is very effective for any monomodal
function (informally, a function whose graph has only one hilltop), it may get stuck
in local optima if the function is multimodal. In the latter case, the graph of the
function will also have a secondhighest hill, a third highest one, etc.; one run of the
hillclimbing procedure having the initial solution at the shoulder of the second
highest hill will ﬁnd the secondhighest hilltop (a local optimum), but then, it will
get stuck there, since no improvement is possible anymore in the neighborhood.
This is why for multimodal functions iterated hill climbing is used instead of one
iteration hill climbing: The method is applied several times in a row, with different
initial candidate solutions, thus increasing the chance that one run of the method
will start at the foot of the hill which contains the global optimum.
Simulated Annealing The problem described above—optimization methods
getting stuck in local optima—was actually impairing potential advances in opti
mization methods. A breakthrough has been the Metropolis algorithm (Metropolis
et al. 1953). The new idea was to occasionally allow for candidate solutions which
are worse than the current one to replace the current solution. This is compatible
with the hillclimbing metaphor: Indeed, when one wanders through a hilly land
scape aiming at reaching the top of the highest hill, he/she may have to occasionally
climb down a hill in order to reach a higher one.
of annealing, the temperature starts at a (relatively) high value and decreases at each
iteration of the currentsolutionchanging process. Simulated annealing has been
successfully applied to solve many discrete and continuous optimization problems,
including optimal design.
The rest of this chapter and Chapter “Genetic Programming Techniques with
Applications in the Oil and Gas Industry” present several populationbased meta
heuristics: GAs and genetic programing, DE, and particle swarm optimization. We
briefly introduce each of them in the following paragraphs. Four particular topics of
interest, in particular for the metaheuristics under discussion, are then briefly
touched upon.
Many more metaheuristics have been proposed and new ones continue to
appear. Monographs and surveys on metaheuristics such as Glover (1986); Talbi
(2009); Voß (2001) give comprehensive insights into the topic. The International
Journal of Metaheuristics publishes both theoretical and application papers on
methods including: neighborhood search algorithms, evolutionary algorithms, ant
systems, particle swarms, variable neighborhood search, artiﬁcial neural networks,
and artiﬁcial immune systems. Those interested in approaches to solving the meta
problem above may wish to read about hyperheuristics—a term coined by Burke; a
survey is provided in Burke et al. (2013).
1.2 What Will the Rest of This Chapter and the Next
One Elaborate On?
GAs simulate a few basic factors of natural evolution: mutation, crossover, and
selection. The implementation of each of these simulated factors involves gener
ating random numbers: like all evolutionary methods, GAs are nondeterministic.
Adaptation, which is instrumental in natural evolution, is simulated by calculating
values of a function (the environment) and, on this basis, making candidate solu
tions compete for survival for the next generation. The evolution of the population
of solutions can be seen as a learning process where candidate solutions learn
collectively.
More sophisticated variants of GAs simulate further factors of natural evolution,
such as the integrated evolution of two species [coevolution (Hillis 1990) the host–
parasite model].
One particular feature of GAs is that the whole computation process takes place
in two dual spaces: the space of candidate solutions to the given problem (where the
evaluation and the subsequent selection for survival take place the phenotype) and
the space of the representations of such solutions (where genetic operators such as
mutation and crossover are applied the genotype). This characteristic is also bor
rowed from natural evolution, where the genetic code and the actual being evolved
from that code are instantiations of the twospace paradigm: In natural evolution,
the genetic code is altered through mutations and through crossover between par
ents; subsequently, the being evolved from the genetic code is evaluated with
respect to its adaptation to the environment.
The genetic code in GAs is actually the way candidate solutions are represented
in the computer. The standard GAs (Michalewicz 1992) works with chromosomes
(representations of candidate solutions) which are strings of bits. When applied to
solve realworld problems, GAs evolved toward sophisticated representations of
candidate solutions, including varyinglength chromosomes and multidimensional
chromosomes. One particular representation of candidate solutions has been
groundbreaking: trees from graph theory.
Genetic Programing emerged as a distinct area of GA. In his seminal book (Koza
1992), Koza uses a particular deﬁnition for the solution to a problem: A solution is a
computer program which solves the problem. Adding to this the idea that such
computer programs can be developed automatically, in particular through genetic
programing, a flourishing ﬁeld of research and applications emerged. As Poli et. al.
put it, genetic programing automatically solves problems without requiring the user
to know or specify the form or structure of the solution in advance (Poli et al. 2008).
The seminal paper for the latter metaheuristic is (Kennedy and Eberhart 1995);
a textbook dedicated to PSO is (Clerc 2006). Bird flocking or ﬁsh schooling can be
considered as being the inspiring metaphors from nature. The core idea is that at
each iteration, each particle (candidate solution) moves through the search space
according to a (linear) combination of the particles current move, of the best per
sonal previous position, and of the best previous position of the neighbors (what
neighbors means, is a parameter of the procedure). This powerful combination of
the backtracking flavor (keeping track somehow of the previous personal best) and
collective learning (partly aiming at the regional/global previous best) makes PSO
well suited for optimization problems with a moving optimum.
Parameter Control A key element for the successful design of any metaheuristic
is a proper adjustment of its parameters. Sufﬁces it to think of the number of
candidate GAs one has to select from when designing a GA for a given problem:
On Metaheuristics in Optimization and Data Analysis. Application to Geosciences 61
Mutation rates and crossover rates can, at least theoretically, take on any value
between 0 and 1; there are tens of choices for the population size; the selection
procedure can be any of at least ten popular ones (new ones can be invented), etc.
This makes a search space for properly designing a GAs for a given problem in the
range of at least hundreds of thousands candidate GAs; of these, only a few will
probably have a good performance and ﬁnding these among all possible GAs for
that problem is a nontrivial task.
There are three main ways of tackling unfeasible solutions. A ﬁrst approach is to
penalize unfeasible solutions and otherwise let them continue to be part of the
search process. In this way, an unfeasible solution becomes even less competitive
than it actually is with respect to the searchfortheoptimum process (see ﬁtness
function in the GAs section of this chapter). A second approach is to repair the new
solution in case it is unfeasible (repairing means changing the solution in such a
way that it becomes feasible); the fact that repairing may have the same complexity
as the original given problem makes this approach least recommendable. The best
approach seems to be that of including the constraints (or at least some of them) into
the representation of solutions. This idea is convincingly illustrated for numerical
problems in Michalewicz (1992) where bit string representations are used: Any bit
string is decoded into a feasible solution. This approach has the decisive advantage
that there is no need to check whether or not candidate solutions obtained from
existing ones are feasible. When including the problem constraints into the codi
ﬁcation of candidate solutions, one actually uses hybridisation with the problem,
which is mentioned in the next paragraph.
62 H. Luchian et al.
Hybridisation According to one of the deﬁnitions in Blum and Roli (2003), a basic
idea of metaheuristics in general is to combine two or more heuristics in one
problemtailored procedure and use it as a speciﬁc metaheuristic. Hybridisation has
even more subtle aspects. Hybridisation happens when inserting an element from one
metaheuristic into another metaheuristic (e.g., using crossover, a deﬁning operator
for GA, in an evolution strategy which, in its standard form, uses only mutations).
Another form of hybridisation could be called hybridisation with the problem:
Problemspeciﬁc properties can be used for deﬁning particular operators in a meta
heuristic. An example can be found in Michalewicz (1992): For the transportation
problem, a feasible solution remains feasible after applying on it a certain transfor
mation; this transformation is then used to deﬁne the mutation operator. An example
of hybridisation is illustrated in this book, in the chapter on genetic programing.
The ﬁrst optimization problem, known as Six Hump Camel Back, is commonly
used as a benchmark function to assess the performance of optimization algorithms
to which its multimodal complex landscape imposes serious difﬁculties. It is for
mulated as a minimization problem over two continuous variables. The problem is
deﬁned as follows:
Minimize f ðx1 ; x2 Þ ¼ ð4 2:1x21 þ x41 =3Þx21 þ x1 x2 þ ð4 þ 4x22 Þx22
where 3 x1 3; ð1Þ
2 x2 2:
The landscape of the function is illustrated in Fig. 1 with the aid of perspective
and contour plots in R.
Visible on the plots above, the function has six local minima and two global
minima. The two global minima lie at locations ðx1 ; x2 Þ ¼ ð0:0898; 0:7126Þ and
ðx1 ; x2 Þ ¼ ð0:0898; 0:7126Þ; the value returned at these locations corresponds to
f ðx1 ; x2 Þ ¼ 1:0316.
The R code deﬁning the Six Hump Camel Back function is shown below.
(a)
58
37.48
8
38.58017
100.1
30.87894 33.0793
.3 80
31.9 5.082
29.77877
83 35
39
23.17772 24.27789
33
.6
18.77701 25
17.67684
791
.3
9
13.27614 14.37631 78
9.975611
95.7893
6.675084 07
82.58719
4.474733 7.77526
50.6821
44.08105
2.274382 1.174207
47.38158
50.6821
94.68912
91.38859
2
438
z
2.27
0
48.48175
93.58895
48.48175
92.48877
47.38158
91.38859
46.2814
0.0740314
6
−1
85.88772
3.374558
11.07579 5.574909
2
27
8.875435
805
x2
.5
15.47649 12.17596
x1 41
78
4
19.87719 16.57666
72
40.7
37
854
.8 20.977
42
80 26.47824 22.07754
.68
7 34.17947 28.67859 5
71.5
35.2796
56
42.98087 36.37982
−2
51.78228 45.18123
1
−3 −2 −1 0 1 2 3
(b) 31
2.3
1.0
09
0.2217872
0.5915
325
7910
0.68
98
26
1.28
23 73 2.486697
1.83
−0.70
142
0.26800
93
0.4991
2665
401
2.2
490
8
.00
662
0.3
2.5329
958
2.255584
093
36
−0
8
−0.286
0.63
61
779 −0.61
0220
0.5
2
0.2
0.8226816 5
795
244
0.9613 −0.33
68
78
495
2.348029
2884
8248
2.2
1.053
1.562
1.14624 00 6 555
667
1.42357 98 0 −0.1
6 479 84
2.301806
1.97
1.70091 94
0.40
2
0.91
1.74713
5
1.37
.499
0.03689656
0.9151269
2.0244
512
7
z
735
0.0831
123
2.30
313
1.88
0.0
1.9782
69
2.16
3
6916
180
580
2.301806
2.11
2.301806
1921
1.9
48
6
3
1.3
32
4
02
572
2900
1.747135
21
77
5
−0.1
0.3
1.5160
1.46
35
0.
1.007
942 9799
86
0.45
3
167 60 3
1.19246
89
−0.4
−0.5
45
2.53292
2.209361
1.00
3
04
253 51 7572
2.209361
0439
299
2
1.83958
−0.6
564 1
0.63779
04
432
−0.24
62
0.5453457
2
0.1293419
90
87
−0.2866
098
2.
52
17
0.7
1.3
81 2.25 −0.70
0.4
22
02 5584 2665
x2
302
0.2680
311
0.
55
x1 2.62
5365
−0.3
7910 0.591
363
31
73
−1.0
−2 −1 0 1 2
Fig. 1 Perspective and contour plots for Six Hump Camel Back: a for the entire domain of
deﬁnition: x1 2 ½3; 3, x2 2 ½2; 2, b restricted to x1 2 ½1:9; 1:9, x2 2 ½1:1; 1:1. The two
global optima are illustrated as blue triangles at locations ðx1 ; x2 Þ ¼ ð0:0898; 0:7126Þ and
ðx1 ; x2 Þ ¼ ð0:0898; 0:7126Þ
X
n
Subject to: ci xi b; ð3Þ
i¼1
xi 2 0; 1; i ¼ 1; n: ð4Þ
x1 þ x2 1; ð5Þ
x5 þ x3 1: ð6Þ
x5 þ x3 þ x4 2: ð7Þ
The variables xi represent the decision to select project i (xi ¼ 0 means the project
is not selected, whereas xi ¼ 1 means the project gets selected for implementation)—
On Metaheuristics in Optimization and Data Analysis. Application to Geosciences 65
constraint expressed by Eq. 4. The total budget of the ﬁrm must not be exceeded by
the total costs of the projects selected (Eq. 3). Other constraints may be imposed on
the problem (especially in a realworld context), such as Eq. 5 expresses the condition
that either project 1 or project 2 gets implemented; Eq. 6 expresses the condition that
either project 3 or project 5 gets implemented; Eq. 7 expresses the condition that at
most 2 out of the 3 projects (3, 4, and 5) may get implemented.
2 Evolutionary Algorithms
2.1 Terminology
Evolutionary algorithms use a vocabulary borrowed from genetics. They simulate the
evolution across a sequence of generations (iterations within an iterative process) of a
population (set) of candidate solutions. A candidate solution is internally represented
as a string of genes and is called chromosome or individual. The position of a gene in a
chromosome is called locus, and all the possible values for the gene form the set of
alleles of the respective gene. The internal representation (encoding) of a candidate
solution in an evolutionary algorithm form the genotype; this information is pro
cessed by the evolutionary algorithm. Each chromosome corresponds to a candidate
solution in the search space of the problem which represents its phenotype. A
decoding function is necessary to translate the genotype into phenotype. If the search
space is ﬁnite, it is desirable that this function should satisfy the bijection property in
order to avoid redundancy in chromosomes encoding (which would slow down the
convergence) and to ensure the coverage of the entire search space.
The population maintained by an evolutionary algorithm evolves with the aid of
genetic operators that simulate the fundamental elements in genetics: Mutation
consists in a random perturbation of a gene, while crossover aims at exchanging
genetic information among several chromosomes. The chromosome subjected to a
genetic operator is called parent and the resulted chromosome is called offspring.
A process called selection involving some degree of randomness selects the
individuals to breed and create offsprings, mainly based on individual merit. The
individual merit is measured using a ﬁtness function which quantiﬁes how ﬁtted the
candidate solution encoded by the chromosome is for the problem being solved. The
ﬁtness function is formulated based on the mathematical function to be optimized.
The solution returned by an evolutionary algorithm is usually the most ﬁtted
chromosome in the last generation.
66 H. Luchian et al.
GAs (Holland 1998) are the most well known and the most intensively used class of
evolutionary algorithms.
A GA performs a multidimensional search by means of a population of can
didate solutions which exchange information and evolve during an iterative process.
The process is illustrated by the pseudocode in Fig. 2.
In order to solve a problem with a GA, one must deﬁne the following elements:
• an encoding for candidate solutions (the genotype);
• an initialization procedure to generate the initial population of candidate
solutions;
• a ﬁtness function which deﬁnes the environment and measures the quality of the
candidate solutions;
• a selection scheme;
• genetic operators (mutation and crossover);
• numerical parameters.
The encoding is considered to be the main factor determining the success or
failure of a GA.
The standard encoding in GAs consists in binary strings of ﬁxed length. The
main advantage of this encoding is offered by the existence of a theoretical model
(the Schema theorem) explaining the search process until convergence. Another
advantage shown by Holland is the high implicit parallelism in the GA. A widely
used extension to the binary encoding is gray coding.
Unfortunately, for many problems, this encoding is not a natural one and it is
difﬁcult to be adapted. However, GAs themselves evolved and the encoding
extended to strings of integer and real numbers, permutations, trees, and multi
dimensional structures. Decoding the chromosome onto a candidate solution to the
problem sometimes necessitates problemspeciﬁc heuristics.
Important factors that need to be analyzed with regard to the encoding are the
size of the search space induced by a representation and the coverage of the phe
notype space: Whether the phenotype space is entirely covered and/or reachable,
whether the mapping from genotype to phenotype is injective, or “degenerate,” and
68 H. Luchian et al.
fi
pi ¼ ;
P
N
fj
j¼1
where N is the number of individuals in the population (see, for a simple example,
Fig. 3 which assumes a population of 5 individuals). On each application of the
selection scheme, a random number is generated r 2 ½0; 1Þ and the individual i with
the highest cumulative frequency smaller than this random r is selected to survive to
the next generation:
On Metaheuristics in Optimization and Data Analysis. Application to Geosciences 69
Fig. 3 Fitness values in a population of 5 individuals. The bottom row contains the ﬁtness values
of the individuals. Their associated probabilities are the labels of the circular sectors
min X
k
i ¼ k ¼ 1. . .nfkj rg:
j¼1
Ordinalbased selection takes into account only the relative order of individuals
according to their ﬁtness values. The most used procedures of this kind are the
linear ranking selection (Baker 1985) and the tournament selection (Goldberg
1989).
New individuals are created in population with the aid of two genetic operators:
crossover and mutation. The classical crossover operator aims at exchanging
genetic material between two chromosomes in two steps: A locus is chosen ran
domly to play the role of a cut point and splits each of the two chromosomes in two
segments; then, two new chromosomes are generated by merging the ﬁrst segment
from the ﬁrst chromosome with the second segment from the second chromosome
and vice versa. This operator is called in literature onepoint crossover and is
illustrated in Fig. 4. Generalizations exist to three or more cut points. Uniform
crossover builds sequentially the offspring by copying at each locus the allele
randomly chosen from one of the two parents.
Various constraints imposed by realworld problems led to various encodings for
candidate solutions; these problemspeciﬁc encodings subsequently necessitate the
redeﬁnition of crossover. Thus, algebraic operators are implied for the case of
numerical optimization with real encoding; an impressive number of papers focused
on permutationbased encodings proposing various operators and performing
comparative studies. It is now a common procedure to wrap a problemspeciﬁc
heuristic within the crossover operator in Ionita et al. (2006), the authors propose
new operators for constraint satisfaction; (Luchian et al. 1994) presents new
operators in the context of clustering]. Crossover in GAs stands at the moment for
any procedure which combines the information encoded within two or several
chromosomes to create new and hopefully better individuals.
70 H. Luchian et al.
The lines of code below call the ga function to execute a GA which maximizes
our newly deﬁned function with a population of 20 chromosomes using real
encoding and arithmetic operators for 50 iterations:
> library("GA")
> GA.sols < ga(type = "realvalued", fitness = SixHumpMax,
+ min = c(3, 2), max = c(3, 2), maxiter=50, popSize=20)
Iter = 1  Mean = 20.10513  Best = 0.3900806
Iter = 2  Mean = 8.679598  Best = 0.3900806
Iter = 3  Mean = 1.909435  Best = 0.3900806
Iter = 4  Mean = 0.7739577  Best = 0.521566
Iter = 5  Mean = 0.4207289  Best = 0.521566
...
Iter = 50  Mean = 0.9275536  Best = 1.020383
During its execution, the ga function prints at each iteration the mean of the
ﬁtness in population and the best ﬁtness value. To show the ﬁnal results, we call the
summary function:
72 H. Luchian et al.
> summary(GA.sols)
++
 Genetic Algorithm 
++
GA settings:
Type = realvalued
Population size = 20
Number of generations = 50
Elitism = 1
Crossover probability = 0.8
Mutation probability = 0.1
Search domain
x1 x2
Min 3 2
Max 3 2
GA results:
Iterations = 50
Fitness function value = 1.020383
Solution =
x1 x2
[1,] 0.1262185 0.6870156
0 10 20 30 40 50
x
On Metaheuristics in Optimization and Data Analysis. Application to Geosciences 73
2
2
2
58 9.68
58 9.68
58 9.68
37.48 37.48 37.48
791 28
791 28
791 28
38.58017 38.58017 38.58017
100.1
100.1
100.1
30.87894 33.0793 30.87894 33.0793 30.87894 33.0793
.3
.3
.3
29.77877 29.77877 29.77877
.08
.08
.08
83 035
83 035
83 035
3
3
23.17772 24.27789 23.17772 24.27789 23.17772 24.27789
2
33
33
33
18.77701 25 18.77701 25 18.77701 25
55
55
55
17.67684 17.67684 17.67684
9
13.27614 .3 13.27614 .3 13.27614 .3
9.975611 14.37631 78 9.975611 14.37631 78 9.975611 14.37631 78
31.9
31.9
31.9
95.7893
95.7893
95.7893
6.675084 07 6.675084 07 6.675084 07
82.58719
82.58719
82.58719
4.474733 7.77526 4.474733 7.77526 4.474733 7.77526
50.6821
50.6821
50.6821
44.08105
44.08105
44.08105
2.274382 1.174207 2.274382 1.174207 2.274382 1.174207
1
1
1
47.38158
47.38158
47.38158
50.6821
50.6821
50.6821
94.68912
94.68912
94.68912
91.38859
91.38859
91.38859
4382
4382
4382
2.27
2.27
2.27
0
0
0
48.48175
48.48175
48.48175
93.58895
93.58895
93.58895
48.48175
48.48175
48.48175
92.48877
92.48877
92.48877
47.38158
47.38158
47.38158
91.38859
91.38859
91.38859
46.2814
46.2814
46.2814
−1
0.0740314 0.0740314 0.0740314
−1
−1
6 6 6
85.88772
85.88772
85.88772
3.374558 3.374558 3.374558
11.07579 5.574909 11.07579 5.574909 11.07579 5.574909
8052
8052
8052
27
27
27
8.875435 8.875435 8.875435
.5
.5
.5
15.47649 12.17596 15.47649 12.17596 15.47649 12.17596
78
78
78
41 19.87719 41 19.87719 41 19.87719
8544
8544
8544
16.57666 16.57666 16.57666
72
72
72
40.7
40.7
40.7
.8 37 .8 37 .8 37
42
42
42
80 26.47824 22.07754 20.977 80 26.47824 22.07754 20.977 80 26.47824 22.07754 20.977
.68
.68
.68
7 34.17947 28.67859 5 7 34.17947 28.67859 5 7 34.17947 28.67859 5
71.5
71.5
71.5
35.2796 35.2796 35.2796
−2
56
56
56
−2
−2
51.78228 42.98087 51.78228 42.98087 51.78228 42.98087
45.18123 45.18123 45.18123
1
1
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
2
2
58 9.68
58 9.68
37.48 37.48
791 28
791 28
38.58017 38.58017
100.1
100.1
30.87894 33.0793 30.87894 33.0793
.3
.3
29.77877 29.77877
.08
.08
83 035
83 035
3
3
23.17772 24.27789 23.17772 24.27789
2
2
33
33
18.77701 25 18.77701 25
55
55
17.67684 17.67684
9
13.27614 .3 13.27614 .3
9.975611 14.37631 78 9.975611 14.37631 78
31.9
31.9
95.7893
95.7893
6.675084 07 6.675084 07
82.58719
82.58719
4.474733 7.77526 4.474733 7.77526
50.6821
50.6821
44.08105
44.08105
2.274382 1.174207 2.274382 1.174207
1
1
47.38158
47.38158
50.6821
50.6821
94.68912
94.68912
91.38859
91.38859
4382
4382
2.27
2.27
0
0
48.48175
48.48175
93.58895
93.58895
48.48175
48.48175
92.48877
92.48877
47.38158
47.38158
91.38859
91.38859
−1
46.2814
46.2814
0.0740314 0.0740314
−1
6 6
85.88772
85.88772
3.374558 3.374558
11.07579 5.574909 11.07579 5.574909
8052
8052
27
27
8.875435 8.875435
.5
.5
15.47649 12.17596 15.47649 12.17596
78
78
41 19.87719 41 19.87719
8544
8544
16.57666 16.57666
72
72
40.7
40.7
.8 37 .8 37
42
42
80 26.47824 22.07754 20.977 80 26.47824 22.07754 20.977
.68
.68
7 34.17947 28.67859 5 7 34.17947 28.67859 5
71.5
71.5
−2
35.2796 35.2796
56
56
36.37982 36.37982
−2
51.78228 42.98087 51.78228 42.98087
45.18123 45.18123
1
1
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Fig. 7 The evolution of the population in GA during one run of the algorithm: the distribution of
the candidate solutions at iterations 1, 2, 5, 10, and 15
> t.test(fitness)
One Sample ttest
data: fitness
t = 1503.688, df = 29, pvalue < 2.2e16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
1.031762 1.028960
sample estimates:
mean of x
1.030361
Although the reported results are satisfactory, the GAs are usually enhanced in
practice by hybridizing them with local search algorithms.
With a standard binary encoding, GAs are the most appropriate candidates when
attempting to solve the portfolio selection problem by means of metaheuristics.
In order to illustrate such an approach, we consider the problem deﬁned in
Sect. 1.1 with the following instantiation: the number of projects n ¼ 6, the budget
of the ﬁrm b ¼ 1000, and the costs and the utilities of the projects as in Table 1. An
optimal solution to this problem involves the selection of projects 1, 4, 5, and 6; it
has total cost 850 and utility 1700.
One way to deal, within a GA, with the constraints imposed by the problem, is to
encourage the search in the feasible region of the search space by penalizing the
unfeasible candidate solutions. Under this approach, any solution that violates a
constraint gets a lower ﬁtness. Identifying the most appropriate scheme to penalize
solutions is, by itself, an optimization problem. The code below implements one
possible ﬁtness function for our problem:
> summary(GA)
++
 Genetic Algorithm 
++
GA settings:
Type = binary
Population size = 20
Number of generations = 50
Elitism = 1
Crossover probability = 0.8
Mutation probability = 0.1
GA results:
Iterations = 50
Fitness function value = 1700
Solution =
x1 x2 x3 x4 x5 x6
[1,] 1 0 0 1 1 1
ðt1Þ
X
L
ðt1Þ ðt1Þ
yi ¼ kxðt1Þ þ ð1 kÞxIi þ Fl ðxJil xKil Þ ð8Þ
l¼1
where k is a numerical value in range [0,1] controlling the influence of the best
ðt1Þ ðt1Þ
element in the current population, which is x . xIi is a chromosome from the
On Metaheuristics in Optimization and Data Analysis. Application to Geosciences 77
ðt1Þ ðt1Þ
yi ¼ xðt1Þ þ FðxJi xK i Þ ð9Þ
It must be noted that the mutation mechanism described above does not alter the
current/selected chromosome xi . It is the role of crossover to build an offspring of
the current chromosome, by combining its genetic material with the one encoded by
the mutant chromosome. From this perspective, DE is not entirely compliant with
the general speciﬁcations of the two genetic operators.
Two versions of crossover are proposed in DE. A ﬁrst one, called binomial
crossover, is similar to the uniform crossover in GAs: It is a binary operator that
mixes the components of the two chromosomes based on a given probability CR:
yi;d if rd \CR or d ¼ d0
zi;d ¼ d ¼ 1. . .D ð11Þ
xi;d otherwise
of the division of a to b; k is the ﬁrst trial that satisﬁes that a random uniformly
generated number in [0,1] is higher than CR, thus following a truncated geometric
distribution. For example, considering d0 ¼ 6 and D ¼ 10, H could be the series 6,
7, 8 or 6, 7, 8, 9, 10, 1, 2, depending on the parameter CR; these two examples
clearly illustrate the similarity of the exponential crossover in DE with the 2point
crossover in GAs.
In both versions of the crossover operator, CR is a parameter deciding the
influence of the mutant on the structure of the offspring. A theoretical analysis of
the two crossover variants and their influence on the sensitivity of DE to different
values of CR are presented in Zaharie (2007).
An elitist replacement strategy guarantees survival of the ﬁttest chromosome
among the parent and the offspring.
To simulate a run of the DE algorithm on our minimization problem, we use the
R package called DEoptim (Mullen et al. 2011).1 The following code calls the
DEoptim function which executes the DE/rand/1/bin algorithm (the variant
implementing mutation based on a random candidate and one difference, and binary
crossover) to minimize the SixHump function with a population consisting of 20
candidate solutions over 50 iterations; with the trace parameter set on TRUE, the
best candidate solution (its value for the objective function and its components) in
each iteration is shown during the run:
> library("DEoptim")
> DE.sols < DEoptim(SixHumpV, lower = c(3, 2), upper = c(3, 2),
+ control = list(strategy = 1, NP=20, itermax=50, storepopfrom = 1,
+ trace = TRUE))
Iteration: 1 bestvalit: 0.343676 bestmemit: 0.424858 0.515384
Iteration: 2 bestvalit: 0.343676 bestmemit: 0.424858 0.515384
Iteration: 3 bestvalit: 0.343676 bestmemit: 0.424858 0.515384
Iteration: 4 bestvalit: 0.722848 bestmemit: 0.090842 0.885970
Iteration: 5 bestvalit: 0.811161 bestmemit: 0.138414 0.742059
...
1
The package can be freely downloaded from http://cran.rproject.org/web/packages/DEoptim/
index.html.
On Metaheuristics in Optimization and Data Analysis. Application to Geosciences 79
value
−0.4 0
−1
−2
−0.5 −3
0 10 20 30 40 50
function value
−0.6
stored population
−0.7
−0.8
par2
−0.9 1.5
1.0
value
0.5
0.0
−0.5
−1.0 −1.0
−1.5
−2.0
0 10 20 30 40 50 0 10 20 30 40 50
iteration stored population
Fig. 9 The evolution of the population in DE during one run of the algorithm: a the evolution of
the best ﬁtness value in population and b the distribution of the candidate solutions (the genotype)
> DE.sols$optim
$bestmem
par1 par2
0.08984226 0.71265649
$bestval
[1] 1.031628
...
> plot(DE.sols, plot.type = "bestvalit", col="red", pch=1)
> plot(DE.sols, plot.type = "storepop")
Figure 9 clearly illustrates the convergence toward the optimal solution in DE. In
our run, the optimum is found after 31 iterations, as indicated by Fig. 9a. The
diversity in population decreases signiﬁcantly during the run according to Fig. 9b
which presents in two distinct plots the distribution of the values in each iteration
for each parameter of the objective function. This plot indicates an interesting
behavior: convergence toward two distinct regions in the search space.
In order to get more insight into the dynamics of the population within DE,
Fig. 10 illustrates the candidate solutions in the population at distinct moments
during the run distributed over the contour plot illustrating the landscape of the
objective function. The series (a) of plots show the distribution of the candidate
solutions at iterations 1, 5, 10, and 15. The series (b) offers a zoomedin perspective
of the landscape (restricted to x1 2 ½1:9; 1:9 and x2 2 ½1:1; 1:1) showing the
distribution of the candidate solutions at iterations 15, 20, 30, and 50. In the ﬁrst
iteration of the algorithm, the population is spread at random in the search space. At
iteration number 10 (Fig. 10a)3rd plot), groups of individuals were formed around
local and global optima. Toward the end of our run, all the candidate solutions
migrate in the regions corresponding to the two global optima.
80 H. Luchian et al.
(a)
2
2
58 9.6
58 9.6
58 9.6
58 9.6
37.48 37.48 37.48 37.48
7912 8
7912 8
7912 8
7912 8
38.58017 38.58017 38.58017 38.58017
100.19
100.19
100.19
100.19
30.87894 33.0793 30.87894 33.0793 30.87894 33.0793 30.87894 33.0793
.3 80
.3 80
.3 80
.3 80
.082
.082
.082
.082
29.77877 29.77877 29.77877 29.77877
83 35
83 35
83 35
83 35
3
3
23.17772 24.27789 23.17772 24.27789 23.17772 24.27789 23.17772 24.27789
33
33
33
33
18.77701 25 18.77701 25 18.77701 25 18.77701 25
17.67684 17.67684 17.67684 17.67684
55
55
55
55
13.27614 .3 13.27614 .3 13.27614 .3 13.27614 .3
9.975611 14.37631 78 9.975611 14.37631 78 9.975611 14.37631 78 9.975611 14.37631 78
31.9
31.9
31.9
31.9
95.7893
95.7893
95.7893
95.7893
6.675084 07 6.675084 07 6.675084 07 6.675084 07
82.58719
82.58719
82.58719
82.58719
4.474733 7.77526 4.474733 7.77526 4.474733 7.77526 4.474733 7.77526
50.6821
50.6821
50.6821
50.6821
44.08105
44.08105
44.08105
44.08105
2.274382 1.174207 2.274382 1.174207 2.274382 1.174207 2.274382 1.174207
1
1
47.38158
47.38158
47.38158
47.38158
50.6821
50.6821
50.6821
50.6821
94.68912
94.68912
94.68912
94.68912
91.38859
91.38859
91.38859
91.38859
382
382
382
382
2.274
2.274
2.274
2.274
0
0
48.48175
48.48175
48.48175
48.48175
93.58895
93.58895
93.58895
93.58895
48.48175
48.48175
48.48175
48.48175
92.48877
92.48877
92.48877
92.48877
47.38158
47.38158
47.38158
47.38158
91.38859
91.38859
91.38859
91.38859
46.2814
46.2814
46.2814
46.2814
0.07403146 0.07403146 0.07403146 0.07403146
−1
−1
−1
−1
85.88772
85.88772
85.88772
85.88772
3.374558 3.374558 3.374558 3.374558
11.07579 5.574909 11.07579 5.574909 11.07579 5.574909 11.07579 5.574909
052
052
052
052
27
27
27
27
8.875435 8.875435 8.875435 8.875435
.5
.5
.5
.5
15.47649 12.17596 15.47649 12.17596 15.47649 12.17596 15.47649 12.17596
40.78
40.78
40.78
40.78
41 41 41 41
78
78
78
78
544
544
544
544
19.87719 16.57666 19.87719 16.57666 19.87719 16.57666 19.87719 16.57666
72
72
72
72
.8 20.97737 .8 20.97737 .8 20.97737 .8 20.97737
42
42
42
42
80 26.47824 22.07754 80 26.47824 22.07754 80 26.47824 22.07754 80 26.47824 22.07754
.685
.685
.685
.685
71.58
71.58
71.58
71.58
7 34.17947 28.67859 7 34.17947 28.67859 7 34.17947 28.67859 7 34.17947 28.67859
42.98087 36.37982 35.27965 42.98087 36.37982 35.27965 42.98087 36.37982 35.27965 42.98087 36.37982 35.27965
−2
−2
−2
−2
61
61
61
61
51.78228 45.18123 51.78228 45.18123 51.78228 45.18123 51.78228 45.18123
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
(b)
2.39 2.39 2.39 2.39
1.0
1.0
1.0
1.0
0.82268
16 1.331131 0.221 42 3.827154 0.82268
16 1.331131 0.221 42 3.827154 0.82268
16 1.331131 0.221 42 3.827154 0.82268
16 1.331131 0.221 42 3.827154
7872 52 7872 52 7872 52 7872 52
3 −0.37 2.995146 3 −0.37 2.995146 3 −0.37 2.995146 3 −0.37 2.995146
9
9
0.2217872
0.2217872
0.2217872
0.2217872
0.591568 0.591568 0.591568 0.591568
325
325
325
325
60
60
60
60
0.684
0.684
0.684
0.684
9107 9107 9107 9107
0.2680098
0.2680098
0.2680098
0.2680098
1.284
1.284
1.284
1.284
0.499123
0.499123
0.499123
0.499123
3 3 3 3
32
32
32
32
2.486697 2.486697 2.486697 2.486697
1.839
1.839
1.839
1.839
−0.702 −0.702 −0.702 −0.702
42
42
42
42
.009
.009
.009
.009
2.20
2.20
2.20
2.20
6658 6658 6658 6658
0.31
0.31
0.31
0.31
0136
0136
0136
0136
2
2
2.53292
2.53292
2.53292
2.53292
908
908
908
908
2.255584 2.255584 2.255584 2.255584
−0.28666
−0.28666
−0.28666
−0.28666
58
58
58
58
−0
−0
−0
−0
93
93
93
93
0.637 0.637 0.637 0.637
61
61
61
61
−0.610 −0.610 −0.610 −0.610
0. 5
0. 5
0. 5
0. 5
791 2205 791 2205 791 2205 791 2205
0.26
0.26
0.26
0.26
0.8226816
95
44
0.8226816
95
44
0.8226816
95
44
0.8226816
95
44
0.96134 −0.332 0.96134 −0.332 0.96134 −0.332 0.96134 −0.332
1.0537
1.5622
1.0537
1.5622
1.0537
1.5622
1.0537
1.5622
6778
6778
6778
6778
80
80
80
80
2.348029
2.348029
2.348029
2.348029
95 95 95 95
48
48
48
48
8846 2.25 8846 2.25 8846 2.25 8846 2.25
1.14624 1.14624 1.14624 1.14624
09
09
09
09
−0.14 5584 −0.14 5584 −0.14 5584 −0.14 5584
1.9782
1.9782
1.9782
1.9782
1.423576 1.423576 1.423576 1.423576
8 0.4
8 0.4
8 0.4
8 0.4
0.406
0.406
0.406
0.406
7994 7994 7994 7994
2.301806
2.301806
2.301806
2.301806
1.700912 1.700912 1.700912 1.700912
0.915
1.747135
0.915
1.747135 1.747135
0.915
1.747135
0.915
1.377
1.377
1.377
1.377
0.03689656
0.9151269
0.03689656
0.9151269
0.03689656
0.9151269
0.03689656
0.9151269
2.02447 2.02447 2.02447 2.02447
9912
9912
9912
9912
1269
1269
1269
1269
0. 0
0. 0
0. 0
0. 0
138 138 138 138
0.08311
0.08311
0.08311
0.08311
2.301
2.301
2.301
2.301
1.885
1.885
1.885
1.885
353
353
353
353
1.978248
1.978248
1.978248
1.978248
2.163 2.163 2.163 2.163
3
3
916 916 916 916
2.301806
2.301806
2.301806
2.301806
2.116 2.116 2.116 2.116
806
806
806
806
803
803
803
803
2.301806
2.301806
2.301806
2.301806
1.93 1.93 1.93 1.93
921
921
921
921
1.3
1.3
1.3
1.3
004
004
004
004
20 20 20 20
72
72
72
72
21
21
21
21
1.747135 25 1.747135 25 1.747135 25 1.747135 25
77
77
77
77
−0.1 −0.1 −0.1 −0.1
1.0075
1.0075
1.0075
1.0075
0.4529
0.4529
0.4529
0.4529
0.36
0.36
0.36
0.36
1.5160
1.5160
1.5160
1.5160
1.469 1.469 1.469 1.469
35
35
35
35
0.8
0.8
0.8
0.8
9421 799 9421 799 9421 799 9421 799
3
3
68
68
68
68
−0.5
−0.5
−0.5
−0.5
67 67 67 67
04
04
04
04
−0.4 1.192463 1.0 −0.4 1.192463 1.0 −0.4 1.192463 1.0 −0.4 1.192463 1.0
2.53292
2.53292
2.53292
2.53292
2.209361
2.209361
2.209361
2.209361
90
90
90
90
55
55
55
55
4393
4393
4393
4393
2532 07572 2532 07572 2532 07572 2532 07572
2.209361
2.209361
2.209361
2.209361
42
42
42
42
1
1
99 99 99 99
1.83958
1.83958
1.83958
1.83958
−0.65 −0.65 −0.65 −0.65
6443 0.637791 6443 0.637791 6443 0.637791 6443 0.637791
−0.240
−0.240
−0.240
−0.240
04
04
04
04
−0.286662
−0.286662
−0.286662
−0.286662
0.5453457
0.5453457
0.5453457
0.5453457
72
72
72
72
2 2 2 2
0.1293419
0.1293419
0.1293419
0.1293419
90
90
90
90
78
78
78
78
98
98
98
98
2.8 2.8 2.8 2.8
52
52
52
52
0.73 31
0.73 31
0.73 31
0.73 31
1.33
1.33
1.33
1.33
2.2555 2.2555 2.2555 2.2555
21
21
21
21
10 −0.702 10 −0.702 10 −0.702 10 −0.702
0.4
0.4
0.4
0.4
0.26800
0.26800
0.26800
0.26800
84 6658 84 6658 84 6658 84 6658
0.2
0.2
0.2
0.2
25 25 25 25
0236
0236
0236
0236
−0.37 −0.37 −0.37 −0.37
11
11
11
11
5 2.6253 5 2.6253 5 2.6253 5 2.6253
91 0.5915 91 0.5915 91 0.5915 91 0.5915
−1.0
−1.0
−1.0
−1.0
65 65 65 65
3.7347 2.34 0.175 073 683 3.7347 2.34 0.175 073 683 3.7347 2.34 0.175 073 683 3.7347 2.34 0.175 073 683
3
3
08 3.3187 5645 08 3.3187 5645 08 3.3187 5645 08 3.3187 5645
05 80 0136 1.284908 05 80 0136 1.284908 05 80 0136 1.284908 05 80 0136 1.284908
29 0.684 1.14624 29 0.684 1.14624 29 0.684 1.14624 29 0.684 1.14624
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Fig. 10 The evolution of the population in DE during one run of the algorithm: a the distribution
of the candidate solutions at iterations 1, 5, 10, and 15 and b a zoomedin landscape showing the
distribution of the candidate solutions at iterations 15, 20, 30, and 50
The mean of the objective values after 30 runs is −1.031615, with a standard
deviation of 3.74e−05.
Variations were brought to the classical EAs not only at the encoding and operators
level. In order to face the challenges imposed by realworld problems, modiﬁca
tions are also recorded in the general scheme of the algorithms.
EAs are generally preferred to trajectorybased metaheuristics (i.e., hill climb
ing, simulated annealing, Tabu Search) in multimodal environments, mostly due
to their increased exploration capabilities. However, a standard EA still can be
trapped in a local optimum due to premature attraction of the entire population into
its basin of attraction. Therefore, the main concern of EAs for multimodal opti
mization is to maintain diversity for a longer time in order to detect multiple (local)
optima. To discover the global optima, the EA must be able to intensify the search
in several promising regions and eventually encourage simultaneous convergence
toward several local optima. This strategy is called niching: The algorithm forces
the population to preserve subpopulations, each subpopulation corresponding to a
niche in the search space, and different niches represent different (local) optimal
regions.
Several strategies exist in the literature to introduce niching capabilities into
evolutionary algorithms. Deb and Goldberg (1989) propose ﬁtness sharing: The
On Metaheuristics in Optimization and Data Analysis. Application to Geosciences 81
ﬁtness of each individual is modiﬁed by taking into account the number and ﬁtness
of its closely ranged individuals. This strategy determine the number of individuals
in the attraction basin of an optimum to be dependent on the height of that peak.
Another widely used strategy is to arrange the candidate solutions into groups of
individuals that can only interact between themselves. The island model evolves
independently several populations of candidate solutions; after a number of gen
erations, individuals in neighboring populations migrates between the islands
(Whitley et al. 1998).
There are techniques, which divide the population, based on the distances
between individuals (the socalled radiibased multimodal search GAs). Genetic
chromodynamics (Dumitrescu 2000) introduces a set of restrictions with regard to
the way selection is applied or the way recombination takes place. A merging
operator is introduced which merges very similar individuals after perturbation
takes place. In Stoean et al. (2010), best successive local individuals are conserved,
while subpopulations are topological separated.
De Jong introduced a new scheme of inserting the descendants into the popu
lation, called the crowding method (Kenneth 1975). To preserve diversity, the
offspring replace only similar individuals in the population.
A ﬁeld of intensive research within the evolutionary computation (EC) com
munity is multiobjective optimization. Most realworld problems necessitate the
optimization of several, often conflicting objectives. Populationbased optimization
methods offer an elegant and very efﬁcient approach to this kind of problems: With
small modiﬁcations of the basic algorithmic scheme, they are able to offer an
approximation of the Pareto optimal solution set. While moving from one Pareto
solution to another, there is always a certain amount of sacriﬁce in one objective(s)
to achieve a certain amount of gain in the other(s). Pareto optimal solution sets are
often preferred to single solutions in practice, because the tradeoff between
objectives can be analyzed and optimal decisions can be made on the speciﬁc
problem instance.
Zitzler et al. (2000) formulate three goals to be achieved by multiobjective
search algorithms:
• the Pareto solution set should be as close as possible to the true Pareto front,
• the Pareto solution set should be uniformly distributed and diverse over of the
Pareto front in order to provide the decision maker a true picture of tradeoffs,
• the set of solutions should capture the whole spectrum of the Pareto front. This
requires investigating solutions at the extreme ends of the objective function
space.
GAs have been the most popular heuristic approach to multiobjective design
and optimization problems mostly because of their ability to simultaneously search
different regions of a solution space and ﬁnd a diverse set of solutions. The
82 H. Luchian et al.
crossover operator may exploit structures of good solutions with respect to different
objectives to create new nondominated solutions in unexplored parts of the Pareto
front. In addition, most multiobjective GAs do not require the user to prioritize,
scale, or weigh objectives. There are many variations of multiobjective GAs in the
literature and several comparative studies. As in multimodal environments, the
main concern in multiobjective GAs optimization is to maintain diversity
throughout the search in order to cover the whole Pareto front. Konak et al. (2006)
provide a survey on the most known multiobjective GAs, describing common
techniques used in multiobjective GA to attain the three abovementioned goals.
3 Swarm Intelligence
The PSO model was introduced in 1995 by Kennedy and Eberhart (1995), being
discovered through simulation of a simpliﬁed social model such as ﬁsh schooling or
bird flocking. It was originally conceived as a method for optimization of contin
uous nonlinear functions. Latter studies showed that PSO can be successfully
adapted to solve combinatorial problems.
The evolutionary cultural model proposed by (Boyd and Richerson 1985) stands
as the basic principle of PSO. According to this model, individuals of a society have
two learning sources: individual learning and cultural transmission. Individual
learning is efﬁcient only in homogenous environments: The patterns acquired
On Metaheuristics in Optimization and Data Analysis. Application to Geosciences 83
through local interactions with the environment are generally applicable. For het
erogenous environments, social learning—the essential feature of cultural trans
mission—is necessary.
In line with the evolutionary cultural model, the PSO algorithm uses a set of
simple agents which collaborate in order to ﬁnd solutions of a given optimization
problem.
In the PSO paradigm, the environment corresponds to the search space of the
optimization problem to be solved. A swarm of particles is placed in this envi
ronment. The location of each particle corresponds therefore to a candidate solution
to the problem. A ﬁtness function is formulated in accordance with the optimization
criterion to measure the quality of each location. The particles move in their
environment collecting information on the quality of the solutions they visit and
share this information to the neighboring particles in the swarm. Each particle is
endowed with memory to store the information gathered by individual interactions
with the environment, simulating thus individual learning. The information
acquired from neighboring particles corresponds to the social learning component.
Eventually, the swarm is likely to move toward “more” optimum locations of the
search space, similar to a flock of birds that collectively forage for food.
Unlike GAs, in PSO, there exist no evolution operators and no competition for
survival; all particles survive and share information for the welfare of the swarm.
The driving force is the emergent SI and attained by the sharing of local information
between particles in order to produce global knowledge. It is important to note that
problem solving is a populationwide phenomenon, because a particle by itself is
probably incapable of solving even simple problems (Poli et al. 2007).
Usually, the swarm is composed of particles that share the same structural and
behavioral features. Each particle is characterized by its current position in the
search space, its velocity, and one or more of its best positions in the past (usually,
only one position). Each particle uses the objective (ﬁtness) function so that it can
ﬁnd out how good its current status is. The particles use a communication channel
in order to exchange information with (some) of its peers. The topology of the
swarm’s social network is deﬁned by the structure of the communication channel,
where cliques of interconnected particles form neighborhoods.
In the classical PSO algorithm, the position of a particle in the search space is
updated in each iteration depending on the position and velocity of the particle in
the previous iteration. The formulas used to update the particles and the procedures
are inspired from and conceived for continuous spaces. Therefore, each particle is
represented by a vector x of length n indicating the position in the ndimensional
search space and has a velocity vector v used to update the current position. The
velocity vector is computed following the rules:
• every particle tends to keep its current direction (an inertia term);
• every particle is attracted to the best position p it has achieved so far (imple
ments the individual learning component);
• every particle is attracted to the best particle g in the neighborhood (implements
the social learning component).
84 H. Luchian et al.
The velocity vector is computed as a weighted sum of the three terms above.
Two random multipliers r1 ; r2 are used to gain stochastic exploration capability,
while w; c1 ; c2 are weights usually empirically determined. The formulae used to
update each of the individuals in the population at iteration t + 1 are as follows:
vti ¼ w vt1
i þ c1 r1 ðpit1 xit1 Þ þ c2 r2 ðg it1 xt1
i Þ ð13aÞ
As a side effect of these changes, the velocity of the particle could enter a
divergence process, throwing the particle further, and further away form p. To
prevent this behavior, Kennedy and Eberhart clamped the amplitude of the velocity
to a maximum value, denoted by vmax :
velocity which determines the maximum change each particle can take during one
iteration. This parameter is usually proportional with the search domain.
One run of the PSO algorithm can be illustrated using package pso built for R
which is consistent with standard PSO, as described in Bratton and Kennedy
(2007):
> library(pso)
> PSO.sols < psoptim(rep(NA,2),SixHumpV,lower=c(3,2),upper=c(3,2),
control=list( maxit=50, s=20, trace=1, REPORT=1))
S=20, K=3, p=0.1426, w0=0.7213, w1=0.7213, c.p=1.193, c.g=1.193
v.max=NA, d=7.211, vectorize=FALSE, hybrid=off
It 1: fitness=0.3635
It 2: fitness=0.8261
It 3: fitness=0.8261
It 4: fitness=0.8623
It 5: fitness=0.9337
...
> show(PSO.sols)
$par
[1] 0.09041749 0.71296641
$value
[1] 1.031627
The algorithm reaches quickly the global optima, as shown in Fig. 12.
Figure 13 illustrates the distribution of the individuals in population during one
run, at iterations 1, 2, 5, 10, 20, and 50.
Although PSO was conceived for continuous optimization, an effort was done to
adapt the algorithm in order to be used for solving a wide range of combinatorial
and binary optimization problems. A short discussion of the binary version of PSO
is presented in this section, following the presentation from (Bautu 2010).
Kennedy and Eberhart (1997) introduced a ﬁrst variant of binary PSO, com
bining the evolutionary cultural model with the reasoned action model. According
to the latter, the action performed by an individual is the stochastic result of the
intention to do that action. The strength of the intention results from the interaction
of the personal attitude and the social attitude on the matter (Hale et al. 2002).
On Metaheuristics in Optimization and Data Analysis. Application to Geosciences 87
−0.4
−0.5
−0.6
−0.7
f
−0.8
−0.9
−1.0
0 10 20 30 40 50
iteration
Fig. 12 The evolution of the best value of the objective function for one run of PSO
(a)
2
2
58
58
58
37.48 37.48 37.48
8
100.1
100.1
30.87894 33.0793 30.87894 33.0793 30.87894 33.0793
.3 80
.3 80
.3 80
31.9 5.082
31.9 5.082
31.9 5.082
83 35
83 35
39
39
39
23.17772 24.27789 23.17772 24.27789 23.17772 24.27789
2
2
.6
33
.6
33
.6
33
18.77701 25 18.77701 25 18.77701 25
791
791
791
5
9
9
13.27614 14.37631 78 13.27614 14.37631 78 13.27614 14.37631 78
9.975611 9.975611 9.975611
95.7893
95.7893
95.7893
82.58719
82.58719
4.474733 7.77526 4.474733 7.77526 4.474733 7.77526
50.6821
50.6821
50.6821
44.08105
44.08105
44.08105
2.274382 1.174207 2.274382 1.174207 2.274382 1.174207
1
1
47.38158
47.38158
47.38158
50.6821
50.6821
50.6821
94.68912
94.68912
94.68912
91.38859
91.38859
91.38859
4382
4382
4382
2.27
2.27
2.27
0
0
48.48175
48.48175
48.48175
93.58895
93.58895
93.58895
48.48175
48.48175
48.48175
92.48877
92.48877
92.48877
47.38158
47.38158
47.38158
91.38859
91.38859
91.38859
46.2814
46.2814
46.2814
−1
−1
85.88772
85.88772
85.88772
8052
8052
27
27
27
.5
.5
78
78
8544
8544
8544
72
72
40.7
40.7
40.7
.8 37 .8 37 .8 37
42
42
42
.68
.68
5
71.5
5
71.5
56
56
−2
−2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
(b)
2
2
58
58
58
791 28
100.1
100.1
.3 80
.3 80
31.9 5.082
31.9 5.082
83 35
83 35
39
39
39
2
.6
33
.6
33
.6
33
791
5
9
95.7893
95.7893
82.58719
82.58719
50.6821
50.6821
44.08105
44.08105
44.08105
1
47.38158
47.38158
47.38158
50.6821
50.6821
50.6821
94.68912
94.68912
94.68912
91.38859
91.38859
91.38859
4382
4382
4382
2.27
2.27
2.27
0
0
48.48175
48.48175
48.48175
93.58895
93.58895
93.58895
48.48175
48.48175
48.48175
92.48877
92.48877
92.48877
47.38158
47.38158
47.38158
91.38859
91.38859
91.38859
46.2814
46.2814
46.2814
−1
−1
85.88772
85.88772
85.88772
8052
8052
27
27
27
.5
.5
78
78
8544
8544
8544
72
72
40.7
40.7
40.7
.8 37 .8 37 .8 37
42
42
42
.68
.68
5
71.5
5
71.5
56
56
−2
−2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Fig. 13 The evolution of the population in PSO during one run of the algorithm: the distribution of
the candidate solutions at iterations 1, 2, 5, 10, 20, and 15
88 H. Luchian et al.
The PSO algorithm for realvalued optimization updates the positions of parti
cles based on a function that depends (indirectly) of various personal and social
factors. In the binary domain, the intention of particles to move between the two
allowed positions: 0 and 1 is modeled in a similar manner. The probability that the
particle will move to position 1 is computed by:
The individual learning factor and the social learning factor act as personal and
social attitudes that help to select one of the two binary options.
In particular, with respect to classical PSO, in binary PSO:
• the domain of particle positions in the context of binary optimization problems
is P ¼ f0; 1gn ;
• the cost function that describes the optimization problem is hence deﬁned
c : f0; 1gn ! R;
• the position of a particle consists in the responses of the particle to the n binary
queries of the problem. The position in the search space is updated during each
iteration depending on its velocity.
Let pt 2 P and vt 2 R denote the position and the velocity of a particle at
iteration t. The update equation for the particle’s position in binary PSO is as
follows:
1; if /3 \ð1 þ expðvÞÞ1
p¼ ; ð16Þ
0; otherwise
where /3 is a random uniformly distributed variable in ½0; 1Þ. It results that higher
velocity induces higher probabilities for the particle to choose 1. The equation for
the particle ensures that the particle stays within the search space domain; hence, no
relocation procedure is required.
The velocity of the particle is updated using the same equation as in classical
PSO. The semantics of each term in (13a) for binary PSO are special cases of their
original meaning. For example, if the best position of the particle (pti ) is 1 and the
current position (pt ) is 0, then pti pt ¼ 1. In this case, the second term in (13a) will
increase the value of vt ; hence, the probability that the particle with choose 1 will
also increase. Similarly, the velocity will decrease if pti ¼ 0 and pt ¼ 1. If the two
positions are the same, the individual learning term will not change the velocity in
order to try to maintain the current choice. The same is true for the velocity updates
produced by the social learning term. The position of the particle may change due to
the stochastic nature of (16), even if the velocity does not change between
iterations.
The complete PSO algorithm for binary optimization problems is presented in
vector form in (Fig. 14).
On Metaheuristics in Optimization and Data Analysis. Application to Geosciences 89
Other PSO variants can also be successfully used on binary spaces. In Wang
et al. (2008), the authors propose the outcome of the binary queries to be estab
lished randomly based on the position of the particle within a continuous space.
Khanesar et al. (2009) present a variation of the binary PSO in which the particle
toggles its binary position with probability depending its velocity.
Metaheuristics stand as basis for the design of efﬁcient algorithms for various data
analysis tasks. Such approaches are extensions of conventional techniques, obtained
as hybridizations with metaheuristics, or evolved as new selfcontained data
analysis methods.
There is a large variety of approaches for data clustering based on GAs (Breaban
et al. 2012; Hruschka et al. 2009; Luchian et al. 1994), DE (Zaharie 2005), PSO
(Breaban and Luchian 2011; Rana et al. 2011), and ACO (Shelokar et al. 2004).
Learning Classiﬁer Systems (Lanzi et al. 2000) are one of the major families of
techniques that apply EC to machine learning; these systems evolve a set of con
dition–action rules able to solve classiﬁcation problems. Decision trees (Turney
1995) and support vector machines (Stoean et al. 2009, 2011) are also evolved with
GAs. The representative application example of EAs in regression analysis is the
use of genetic programing for symbolic regression, topic covered in detail in
Chapter “Genetic Programming Techniques with Applications in the Oil and Gas
Industry” of this book. Many algorithms based on metaheuristics tackle feature
selection and feature extraction.
90 H. Luchian et al.
> show(rock)
area peri shape perm
1 4990 2791.900 0.0903296 6.3
2 7002 3892.600 0.1486220 6.3
3 7558 3930.660 0.1833120 6.3
4 7352 3869.320 0.1170630 6.3
...
48 9718 1485.580 0.2004470 580.0
> library(e1071)
> svr < svm(perm ~ area+peri+shape, data=rock,
+ type="epsregression", kernel = "radial")
> predicted <predict(svr,newdata=rock,type="response")
> MSE(predicted, rock$perm)
[1] 35316.21
> cor(predicted, rock$perm)
[1] 0.9040716
> plot(predicted, rock$perm)
The default settings of the three hyperparameters used in the run above can be
inspected next: Cost is the regularization parameter, gamma is a parameter of the
kernel function, and epsilon is the size of the insensitive tube.
> summary(svr)
Parameters:
SVMType: epsregression
SVMKernel: radial
cost: 1
gamma: 0.3333333
epsilon: 0.1
Any of the metaheuristics presented in this chapter can be used to tackle this
minimization problem. We illustrate here the use of DE:
92 H. Luchian et al.
> DEparams < DEoptim(trainingError, lower = c(0, 0, 0), upper = c(4, 4, 1),
+ control = list(strategy = 1,NP=20, itermax=20, trace = TRUE))
Iteration: 1 bestvalit: 5937.692186 bestmemit: 1.929174 2.872409 0.012022
Iteration: 2 bestvalit: 5630.575260 bestmemit: 3.110530 3.717773 0.166768
Iteration: 3 bestvalit: 3623.210268 bestmemit: 2.818071 3.682759 0.077892
...
Iteration: 20 bestvalit: 1473.135923 bestmemit: 3.983884 3.812688 0.046011
The solution obtained by DE is stored next in the vector params and is used to
train a new SVM.
Figure 15 illustrates the predicted values compared to real values for the case of
SVR with default settings (a) and for the case of SVR with optimized hyper
parameters (b).
Nevertheless, the optimized model gives much better results with regard to the
error of predictions, but is prone to overﬁtting: A single dataset was used both for
training and testing; in this situation, the model is highly adapted to the dataset and
may suffer from poor generalization power. We can avoid overﬁtting by using dis
tinct sets for training and testing. The new function to be optimized should be
formulated as shown below. Very similar with the previous version regarding its
deﬁnition, this function is signiﬁcantly different in behavior: It invokes a “training”
dataset in the learning phase but computes the prediction error on a “testing” dataset:
(a) (b)
800 1000 1200
rock$perm
600
600
400
400
200
200
0
0 200 400 600 800 1000 0 200 400 600 800 1000 1200
predicted predicted
Fig. 15 Predicted over expected values in regression analysis with SVR using: a default hyper
parameters settings and b optimized settings
On Metaheuristics in Optimization and Data Analysis. Application to Geosciences 93
The validation of the regression model obtained with the optimized hyper
parameters requires in this case a third dataset called validation set. This phase
closes the analysis which, as recommended in the case of any supervised learning
task, is composed of three phases: training, testing, and validation. If the accuracy/
error obtained in the validation phase is satisfactory, the model can be used in
production.
An oil production planning problem that appears in the context of oil wells with
insufﬁcient oil pressure and which consists in identifying the amount of gas that
should be injected in a well in order to maximize the amount of oil extracted from
that well is solved by an evolutionary algorithm in Singh et al. (2013). The problem
is more difﬁcult since it is constrained by the total amount of gas available daily.
The authors propose a multiobjective approach to the problem and also formulate a
single objective version, focused on the maximization of proﬁt, instead of the oil
quantity. The problem of gas allocation among oil wells is also tackled in Ghaedi
et al. (2013), by means of a hybrid GA, and in Abdel Rasoul et al. (2014). The
problem of gas allocation among oil wells is also tackled in Ghaedi et al. (2013), by
means of a hybrid GA, and in Abdel Rasoul et al. (2014).
The optimal well type and location are determined with PSO in (Onwunalu and
Durlofsky 2010), in a study involving vertical, deviated, and duallateral wells.
Comparisons with a GA over multiple runs of both algorithms show that PSO
outperforms, on average, the GA, yet the advantages of using PSO over GA are
varied among the cases surveyed. Driven by the goal of maximizing the total
hydrocarbon recovery, an well placement problem is tackled in Nwankwor et al.
(2013) with a hybrid PSODE algorithm is proposed for the problem. The hybrid is
compared to basic variants of PSO and DE on three problem cases concerning the
placement of vertical wells in 2D and 3D reservoir models. Optimal well placement
under uncertainty is tackled in a twostage approach in Lyons and Nasrabadi
(2013). First, an ensemble Kalman ﬁlter is used to perform history matching on the
reservoir data. Then, well placement is solved by a GA combined with pseudohi
story matching.
Carbon dioxide (CO2) sequestration is of great interest for oil engineers. In
recent years, the idea of storing CO2 in deep geological formations, such as
depleted oil and gas reservoirs (with impermeable rocks), gained a lot of focus from
the community as a solution for greenhouse gas mitigation by avoiding CO2 from
emission into the atmosphere. The CO2 sequestration also helps by enhancing
methods for oil or gas recovery (Zangeneh et al. 2013). Evolutionary algorithms are
used in order to identify carbon dioxide seepage areas in Cortis et al. (2008). In
Zangeneh et al. (2013), the parameters of a CO2 storage model are optimized using
a GA. A multiobjective GA (NSGA) is implemented for optimizing gas storage
alongside oil recovery in Safarzadeh and Motahhari (2014). Based on the results
from the GA, the authors are able to propose some production scenarios.
In (Fichter et al. 2000), a portfolio optimization problem for the oil and gas
industry is tackled by means of a GA. GAs are chosen for this task both due to their
scalability to extremely large portfolios and because they allow the analysis of
portfolios from the point of view of value and risk measures.
GA and PSO are used to ﬁnd the optimal parameters of a linear and an expo
nential model for the demand of oil in Iran in Assareh et al. (2010). The models use
as input variables the population, the gross domestic product, import, and export
data; they are used to forecast demand of oil up to 2030.
PSO emerged as a powerful algorithm for geophysical inverse problems when
compared to GAs and simulated annealing in Martnez et al. (2010), Shaw, and
On Metaheuristics in Optimization and Data Analysis. Application to Geosciences 95
References
Abdel Rasoul RR, Daoud A, El Tayeb ESA (2014) Production allocation in multilayers gas
producing wells using temperature measurements with the application of a genetic algorithm.
Pet Sci Technol 32(3):363–370
Ahmadi MA, Ebadi M (2014) Robust intelligent tool for estimation dew point pressure in
retrograded condensate gas reservoirs: application of particle swarm optimization. J Pet Sci
Eng 123:7–19
Ahmadi MA, Zendehboudi S, Lohi A, Elkamel A, Chatzis I (2013) Reservoir permeability
prediction by neural networks combined with hybrid genetic algorithm and particle swarm
optimization. Geophys Prospect 61(3):582–598
Alkazemi B, Mohan CK (2002) Multiphase generalization of the particle swarm optimization
algorithm. In: Proceedings of the IEEE congress on evolutionary computation. IEEE Press
Angeline PJ (1998) Using selection to improve particle swarm optimization. In: Proceedings of the
IEEE international conference on evolutionary computation. IEEE Press, pp 84–89. ISBN 0
780348699
Assareh E, Behrang MA, Assari MR, Ghanbarzadeh A (2010) Application of PSO (particle swarm
optimization) and GA (genetic algorithm) techniques on demand estimation of oil in iran.
Energy 35(12):5223–5229
Back T (1996) Evolutionary algorithms in theory and practice. Oxford University Press, New York
Baker JD (1985) Adaptive selection methods for genetic algorithms. In: Proceedings of an
International Conference on Genetic Algorithms and their applications. Hillsdale, New Jersey,
pp 101–111
Baker JD (1987) Reducing bias and inefﬁciency in the selection algorithm. In: Proceedings of the
second international conference on genetic algorithms. pp 14–21
Bautu A (2010) Generalizations of Particle Swarm Optimization: applications of particle swarm
algorithms to statistical physics and bioinformatics problems. PhD Thesis, Department of
Computer Science, Al. I. Cuza University, Lambert Academic Publishing. ISBN 9783848417315
Blum C, Roli A (2003) Metaheuristics in combinatorial optimization: overview and conceptual
comparison. ACM Comput Surv 35(3):268–308. ISSN 03600300. doi:http://doi.acm.org/10.
1145/937503.937505
Boyd R, Richerson PJ (1985) Culture and the evolutionary process. The University of Chicago
Press, Chicago
Bratton D, Kennedy J (2007) Deﬁning a standard for particle swarm optimization. In: Swarm
intelligence symposium, 2007. SIS 2007, IEEE, pp 120–127
96 H. Luchian et al.
the 2002 Congress—vol 02, CEC’02. IEEE Computer Society, Washington, pp 1474–1479.
ISBN 0780372824. http://portal.acm.org/citation.cfm?id=1251972.1252447
Lanzi PL, Stolzmann W, Wilson SW (2000) Learning classiﬁer systems: from foundations to
applications (No. 1813). Springer, Berlin
Lïvbjerg M, Rasmussen TK, Krink T (2001) Hybrid particle swarm optimiser with breeding and
subpopulations. In: Proceedings of the genetic and evolutionary computation conference
(GECCO2001). Morgan Kaufmann, pp 469–476
Luchian S, Luchian H, Petriuc M (1994) Evolutionary automated classiﬁcation. In: Proceedings of
1st congress on evolutionary computation, pp 585–588
Lyons J, Nasrabadi H (2013) Well placement optimization under timedependent uncertainty using
an ensemble kalman ﬁlter and a genetic algorithm. J Petrol Sci Eng 109:70–79
Martnez JLF, Gonzalo EG, Álvarez JPF, Kuzma HA, Pérez COM (2010) PSO: A powerful
algorithm to solve geophysical inverse problems: Application to a 1DDC resistivity case.
J Appl Geophys 710(1):13–25
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equation of state
calculations by fast computing machines. J Chem Phys 21(6):1087–1092
Michalewicz Z (1992) Genetic algorithms + data structures = evolution programs (3rd edn).
Springer, Berlin. ISBN 3540606769
Mitchell M (1996) An introduction to genetic algorithms. MIT Press, Cambridge.
ISBN 0262133164
Mitchell M, Forrest S, Holland JH (1992) The royal road for genetic algorithms: ﬁtness landscapes
and ga performance. In: Proceedings of the ﬁrst European conference on artiﬁcial life,
pp 245–254. The MIT Press, Cambridge
Mohaghegh SD (2005) A new methodology for the identiﬁcation of best practices in the oil and
gas industry, using intelligent systems. J Pet Sci Eng 49(3):239–260
Mohaghegh SD et al (2005) Recent developments in application of artiﬁcial intelligence in
petroleum engineering. J Pet Technol 57(4):86–91
Mullen KM, Ardia D, Gil DL, Windover D, Cline J (2011) DEoptim: an R package for global
optimization by differential evolution. J Stat Softw 40(6):1–26
Nateri K Madavan (2002) Multiobjective optimization using a pareto differential evolution
approach. In: Proceedings of the world on congress on computational intelligence, vol 2. IEEE,
pp 1145–1150
Nguyen NT, Kowalczyk R (2012) Transactions on computational collective intelligence III.
Springer, Berlin
Nwankwor E, Nagar AK, Reid DC (2013) Hybrid differential evolution and particle swarm
optimization for optimal well placement. Comput Geosci 17(2):249–268
Onwunalu JE, Durlofsky LJ (2010) Application of a particle swarm optimization algorithm for
determining optimum well location and type. Comput Geosci 14(1):183–198
Park HY, DattaGupta A, King MJ (2014) Handling conflicting multiple objectives using pareto
based evolutionary algorithm during history matching of reservoir performance. J Pet Sci Eng
Piotrowski AP, Osuch M, Napiorkowski MJ, Rowinski PM, Napiorkowski JJ (2014) Comparing
large number of metaheuristics for artiﬁcial neural networks training to predict water
temperature in a natural river. Comput Geosci 64:136–151
Poli R, Kennedy J, Blackwell T (2007) Particle swarm optimization. Swarm Intell 1(1):33–57
Poli R, Langdon WB, McPhee NF (2008) A ﬁeld guide to genetic programming. http://www.gp
ﬁeldguide.org.uk. (With contributions by JR Koza)
Poormirzaee R, Moghadam RH, Zarean A (2014) Inversion seismic refraction data using particle
swarm optimization: a case study of Tabriz, Iran. Arab J Geosci 1–9
Radcliffe NJ, Surry PD, Jz E (1995) Fitness variance of formae and performance prediction. In:
Foundations of genetic algorithms, pp 51–72
Raidl GR, Gottlieb J (2005) Empirical analysis of locality, heritability and heuristic bias in
evolutionary algorithms: a case study for the multidimensional knapsack problem. Evol
Comput 13(4):441–475
On Metaheuristics in Optimization and Data Analysis. Application to Geosciences 99
Rana S, Jasola S, Kumar R (2011) A review on particle swarm optimization algorithms and their
applications to data clustering. Artif Intell Rev 35(3):211–222
Rechenberg I (1973) Evolutionsstrategie: optimierung technischer systeme nach prinzipien der
biologischen evolution. In: FrommannHolzboog
Rechenberg I (1973) Evolutionstrategie: optimierung Technisher Systeme nach Prinzipien der
Biologischen Evolution. FrommannHolzboog Verlag, Stuttgart
Riget J, Vesterstrøm JS (2002) A diversityguided particle swarm optimizerthe ARPSO.
Department of Computer Science, University of Aarhus, Aarhus, Denmark, Technical Report,
vol 2. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.2929
Safarzadeh MA, Motahhari SM (2014) Cooptimization of carbon dioxide storage and enhanced
oil recovery in oil reservoirs using a multiobjective genetic algorithm (NSGAII). Pet Sci 11
(3):460–468
Schwefel HPP (1993) Evolution and optimum seeking. Wiley, Hoboken
Scrucca L (2013) GA: a package for genetic algorithms in R. J Stat Softw 53(4):1–37. http://www.
jstatsoft.org/v53/i04/
ShakhsiNiaei M, Iranmanesh SH, Torabi SA (2013) A review of mathematical optimization
applications in oilandgas upstream & midstream management. Int J Energy Stat 1
(02):143–154
Shaw R, Srivastava S (2007) Particle swarm optimization: a new tool to invert geophysical data.
Geophysics 72(2):F75–F83
Shelokar PS, Jayaraman VK, Kulkarni BD (2004) An ant colony approach for clustering.
Analytica Chimica Acta 509(2):187–195
Shi Y, Eberhart RC (1998) Parameter selection in particle swarm optimization. In: EP’98:
proceedings of the 7th international conference on evolutionary programming VII. Springer,
London, pp 591–600. ISBN 3540648917
Simon HA (1969) The sciences of the artiﬁcial, vol 136. MIT Press, Cambridge
Singh HK, Ray T, Sarker R (2013) Optimum oil production planning using infeasibility driven
evolutionary algorithm. Evolut Comput 21(1):65–82
Stoean R, Preuss M, Stoean C, ElDarzi E, Dumitrescu D (2009) Support vector machine learning
with an evolutionary engine. J Oper Res Soc 60(8):1116–1122
Stoean C, Preuss M, Stoean R, Dumitrescu D (2010) Multimodal optimization by means of a
topological species conservation algorithm. IEEE Trans Evolut Comput 14(6):842–864
Stoean R, Stoean C, Lupsor M, Stefanescu H, Badea R (2011) Evolutionarydriven support vector
machines for determining the degree of liver ﬁbrosis in chronic hepatitis C. Artif Intell Med
51:53–65. ISSN 09333657
Storn R, Price K (1997) Differential evolution: a simple and efﬁcient heuristic for global
optimization over continuous spaces. J Glob Optim 11(4):341–359. ISSN 09255001. doi:10.
1023/A:1008202821328
Sun J, Feng B, Xu W (2004) Particle swarm optimization with particles having quantum behavior.
In Proceedings of the IEEE congress on evolutionary computation. IEEE Press, pp 325–331
Talbi EG (2009) Metaheuristics: from design to implementation, vol 74. Wiley, Hoboken
Thander B, Sircar A, Karmakar GP (2014) Hydrocarbon resource estimation: a stochastic
approach. J Pet Explor Prod Technol 1–8
Tronicke J, Paasche H, Böniger U (2012) Crosshole traveltime tomography using particle swarm
optimization: a nearsurface ﬁeld example. Geophysics 77(1):R19–R32
Turney P (1995) Costsensitive classiﬁcation: empirical evaluation of a hybrid genetic decision
tree induction algorithm. J Artif Intell Res 2:369–409
Voß S (2001) Metaheuristics: the state of the art. In: Local search for planning and scheduling.
Springer, Berlin, pp 1–23
Wang L, Wang X, Fu J, Zhen L (2008) A novel probability binary particle swarm optimization
algorithm and its application. J Softw 3(9):28–35
Whitley Darrell, Rana Soraya, Heckendorn Robert B (1998) The island model genetic algorithm:
on separability, population size and convergence. J Comput Inf Technol 7:33–47
100 H. Luchian et al.
Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Comput
1(1):67–82
Zaharie D (2005) Density based clustering with crowding differential evolution. In: International
symposium on symbolic and numeric algorithms for scientiﬁc computing, pp 343–350
Zaharie D (2007) A comparative analysis of crossover variants in differential evolution. In:
Proceedings of IMCSIT 2007, pp 171–181
Zangeneh H, Jamshidi S, Soltanieh M (2013) Coupled optimization of enhanced gas recovery and
carbon dioxide sequestration in natural gas reservoirs: case study in a real gas ﬁeld in the south
of Iran. Int J Greenhouse Gas Control 17:515–522
Zitzler E, Deb K, Thiele L (2000) Comparison of multiobjective evolutionary algorithms:
empirical results. Evol Comput 8:173–195
Genetic Programming Techniques
with Applications in the Oil and Gas
Industry
Keywords Genetic programming Regression Gene expression Programming
RGP Petroleum engineering problems
This chapter presents the theoretical background behind the evolutionary algorithm
variant known as genetic programming (GP). Details on the features that make GP a
remarkable algorithm for data analysis are provided. Gene Expression Program
ming (GEP) is a GP variant proposed by Ferreira (2001), which has since gained a
lot of interest from researchers for applications in various ﬁelds of science.
We chose to present it in this chapter since it is a good example of a hybrid
evolutionary algorithm that combines advantages from both GAs and GP, and it is
among the most used flavors of GP in applications. Insight into the inner workings
H. Luchian
Faculty of Computer Science, Alexandru Ioan Cuza University, Iasi, Romania
A. Băutu
Faculty of Navigation and Naval Management, Romanian Naval Academy,
Constanta, Romania
E. Băutu (&)
Faculty of Mathematics and Computer Science, Ovidius University, Constanta, Romania
email: ebautu@gmail.com
1 Genetic Programming
Nicheal Cramer’s work from 1985 stands at the root of the genetic programming
paradigm; he proposed a type of genetic algorithm with individuals represented by
computer programs (Cramer 1985). Cramer used the proposed algorithm to auto
matically evolve simple mathematical expressions. His work was followed by
Schmidhuber’s idea of using Prolog and Lisp as support for evolutionary algo
rithms, which led to a metalearning algorithm based on GP (Dickmanns et al.
1987; Schmidhuber 1987). The inventor of modern GP is considered to be John
Koza, a former professor at Stanford University, who layed the foundation of what
is currently known as GP in his ﬁrst book on the topic (Koza 1992). He envisioned
a genetic algorithm that evolved Lisp Sexpressions, that automatically solves
problems. Recent accounts on the topic of GP are provided in (Poli and Koza 2014;
Poli 2008); insights into the theoretical foundations of GP are provided in Langdon
and Poli (2002). We will briefly describe in the following the main traits of GP that
differentiate it from GAs, following the description provided in (Bautu 2010; Bautu
and Bautu 2009).
any adaptive (or learning based) system. GP individuals are computer programs,
encoded as syntax trees (e.g,. Fig. 1). The nodes in the tree are labeled with
symbols. The leaves of the tree are labeled with terminal symbols (the variables and
the constants in the program—in our example, x, 2), while the internal nodes are
labeled with functional symbols (e.g., algebraic operators, trigonometric functions,
or other common mathematical functions, etc.). During evolution, the sizes and
shapes of the trees are changing in order to adapt to the environment provided by
the problem. The search space for the GP algorithm is graphically depicted in
Fig. 2.
It is important for the symbol set of the algorithm, comprised of all the functions
and terminals, to be carefully selected prior to running the GP algorithm, in order to
provide the prerequisites to model the proposed problem (Koza 1992). We refer, in
the following, to two features that must be met by the symbol set: closure and
completeness.
The closure property refers to each function of the set of functions being well
deﬁned and closely relative to any combination of parameters it may receive during
evolution. This is usually achieved by the special treatment of a relatively small
Fig. 2 Graphical representation of the search space for GA (left) and GP (right)
104 H. Luchian et al.
number of situations. For example, for divide operations, which are not allowed to
receive zero as the second parameter, it is clear that the closure property is not
satisﬁed; likewise, the logarithm function should not receive negative parameters.
Examples of closed symbols sets (i.e., it is guaranteed that all syntactically valid
expressions formed with these symbols are also semantically valid):
• C ¼ fAND; OR; NOT; x; y; TRUE; FALSEg, where x and y are Boolean
variables, and TRUE and FALSE are Boolean constants;
• C ¼ fþ; ; ; x; y; 0; 1g, where x and y are integers variables.
Examples of functions sets that are not closed are:
• C ¼ fþ; ; ; =; x; y; 0; 1g, where x and y are real variables—the set is not
closed because it is possible to generate expressions which are semantically
invalid due to division by 0:
• C ¼ fþ; ; log; xg, where x is a real variable—the set is not closed; in case the
log function receives negative or null parameters, the resulting expression is not
semantically valid
A possible solution for achieving closure of the symbol set is by means of the
deﬁnition of protected functions. Protected functions return a special value of the
terminal set whenever an exceptional situation is detected. For example, in case of
the division operator, a protected function can return 0 if the second parameter is 0:
x=y; if y 6¼ 0
=prot ðx; yÞ ¼ ð2Þ
0; otherwise
In this way, the protected divide operation has a welldeﬁned result for any
values of its parameters. The advantage of this approach is its simplicity, from the
implementation point of view.
In order to meet the completeness property, one must make sure that the symbol
set for the algorithm is sufﬁcient in order to express a solution to the problem; in
general, expert knowledge is needed to implement this part. This property is
guaranteed only for some problem cases where there exist theoretical arguments or
empirical evidence favoring a particular choice of symbols.
The selection of the input variables necessary for a given problem can be
straightforward, or it may be solved by a feature extraction algorithm (Veerama
chaneni et al. 2010). Similarly, the function set that is sufﬁcient to express a
problem solution is very dependent on the problem to be solved.
For example, the functions set {AND, OR, NOT} is sufﬁcient to express any
Boolean function. By removing the AND function, the remaining set still meets the
sufﬁcient condition because the AND Boolean function can be simulated with:
Genetic Programming Techniques … 105
In case of removing the NOT function, the remaining set no longer meets the
sufﬁcient condition, because its effect can not be simulated with the functions left in
the set. Thus, functions such as XOR can not be expressed. As with the terminals,
the responsibility to establish the set of functions appropriate for the problem
remains to the user.
GP builds approximations of the real solution, in case the symbols included in
the symbol set are not sufﬁcient to express a solution to the problem. For this
reason, the general set of symbols used in GP to express a solution to a given
problem does not coincide with the minimal set of symbols required to express the
solution; it usually contains additional symbols. The effect that these additional
symbols may have on the quality of solutions identiﬁed by the algorithm is difﬁcult
to assess a priori. For example, the presence of additional variables in the set of
terminals may lead to a decrease in the algorithm performance in ﬁnding solutions
(Fig. 3); in this case, the GP algorithm also performs a feature selection task,
identifying automatically the variables that are signiﬁcant for the model.
For example, suppose GP is used to infer a formula for the exponential function
ex . This function cannot be expressed exactly by a ﬁnite algebraic expression. If GP
uses the set of symbols,
C ¼ fþ; ; ; =; x; y; 0; 1; 2g;
it will, most likely, provide ﬁnite approximations for this function, such as 1,
1; 1 þ x; 1 þ x þ x12 ; 1 þ x þ x12 þ x13 .
Fig. 3 Completeness of the symbols set and its effect on the solution
106 H. Luchian et al.
Fig. 5 Trees with depth 4, generated by the full method (left) and the grow method (right). Gray
nodes are terminal nodes
where N is the total number of cases for assessing individuals, S(i, J) is the value
obtained by assessing the individual i of the population for variables in the case j of
input data, and CðjÞ is the correct (expected) value for the case j. For the sake of
comparing individuals across different generations and algorithm runs, John Koza
introduced several types of ﬁtness which offer different abstraction degrees of the
individual performances, all of them based on the distance between the input data
and the estimations made by the GP individual (Koza 1992).
108 H. Luchian et al.
of the offspring. This process is illustrated in the algorithm (as shown in Fig. 8) and
exempliﬁed in Fig. 9.
When the cut point for subtree mutation is close to the root of the syntax tree, the
operator has a highly destructive effect; similarly, a mutation point near to the
leaves of the tree has small chances to alter completely the expression encoded by
the individual. A practical solution to this problem is to assign variable mutation
probabilities to nodes on different levels of the tree, e.g., mutation probability that
increases from the root to the frontier of the tree.
Permutation This operator randomly selects an internal node of the syntax tree.
Assume this node is labeled with a function of arity k. The permutation operator
generates a random permutation of the k children and swaps the children nodes
according to this permutation. In case the label of the target node is a commutative
function, the effect of this operator on the phenotype encoded by the tree is actually
null.
Editing The editing operator provides a way to reduce the complexity of indi
viduals chromosomes dynamically, at runtime. For example, the editing operator
might evaluate functions that are contextfree and have only constants as parameters
and then replace these functions with the result of the evaluation. Complex editing
rules might require large computing resources. The use of this operator is justiﬁed
by the necessity of limiting code bloat (Luke 2000a, b), or if individuals need to be
made more readable (for example, one might process the solution of the algorithm
in order to obtain a more userfriendly solution).
Encapsulation Reusability of code may be implemented in GP by means of the
encapsulation operator. This operator works by assigning names to subtrees of
chosen individuals, in order for them to be referred later in GP chromosomes as
symbols. Encapsulation operates on a single individual by extracting parts of its
chromosome and mapping them to a new symbol name. The encapsulation operator
works by randomly selecting an internal node of the tree encoded in the individual,
saves the subtree with root at that point by a new symbol name, and replaces it with
the new symbol name. The new symbol points to the original subtree and it is
included in the terminal set because it is a complete subtree and does not require
any parameters to be evaluated. The main beneﬁt of this operator is that it protects
the subtree used to deﬁne the new symbol from the destructing effects of genetic
operators. This operator stands at the base of the automatically deﬁned functions
idea in GP (Poli 2008).
pﬃ
where denotes the square root function. This encoding is obtained by the breadth
traversal of the expression tree in Fig. 10. The expression is different from the
preﬁxed notation, as well as from the postﬁx notation obtained by depth traversing,
which are used by some vectorbased or stackbased variants of GP (Keith and
Martin 1994).
Decoding the genotype into the equivalent phenotype follows the same rules.
pﬃ
For example, the genotype = x y x is equivalent to the following expression
pﬃ
tree: The start symbol ð Þ is of arity 1; hence, it is linked with the following
symbol (/); /has arity 2, and it is linked with the following two symbols—and x. The
process continues until each symbol is linked with a number of symbols equal to its
arity. The symbols with arity 0 are leaf nodes in the phenotype’s expression tree.
qﬃﬃﬃﬃﬃﬃﬃﬃﬃ
The translation process builds the expression tree corresponding to ðyxÞ x .
GEP genes are divided into two structural units: head and tail. The head may
contain functions and terminals, and the tail is constrained to contain only termi
nals. The tail size depends on the head size and on the set of symbols used in the
gene,
t ¼ hðn 1Þ þ 1;
where t is the required minimum size of the tail, h is the size of the head, and n is
the maximum arity of the symbols that may appear inside the gene. In this orga
nization, GEP genes are padded at the end with symbols that may not be used in the
decodiﬁcation (they are inactive). This structural organization of GEP genes ensures
syntactic validity of all obtained programs. Also, GEP genetic operators always
produce syntactically correct expressions.
GEP individuals are multigenic chromosomes, where each gene encodes a valid
expression tree which interacts with the other genes to creat