You are on page 1of 298

Constantin Cranganu · Henri Luchian

Mihaela Elena Breaban Editors

Artificial Intelligent
Approaches
in Petroleum
Geosciences
Artificial Intelligent Approaches in Petroleum
Geosciences
Constantin Cranganu Henri Luchian

Mihaela Elena Breaban


Editors

Artificial Intelligent
Approaches in Petroleum
Geosciences

123
Editors
Constantin Cranganu Mihaela Elena Breaban
Brooklyn College University of Iaşi
Brooklyn, NY Iaşi
USA Romania

Henri Luchian
University of Iaşi
Iaşi
Romania

ISBN 978-3-319-16530-1 ISBN 978-3-319-16531-8 (eBook)


DOI 10.1007/978-3-319-16531-8

Library of Congress Control Number: 2015933823

Springer Cham Heidelberg New York Dordrecht London


© Springer International Publishing Switzerland 2015
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt
from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained
herein or for any errors or omissions that may have been made.

Printed on acid-free paper

Springer International Publishing AG Switzerland is part of Springer Science+Business Media


(www.springer.com)
Preface

Integration, handling data of immense size and uncertainty, and dealing with risk
management are among crucial issues in petroleum geosciences. The problems one
has to solve in this domain are becoming too complex to rely on a single discipline
for effective solutions, and the costs associated with poor predictions (e.g., dry
holes) increase. Therefore, there is a need to establish new approaches aimed at
proper integration of disciplines (such as petroleum engineering, geology, geo-
physics, and geochemistry), data fusion, risk reduction, and uncertainty
management.
This book presents several artificial intelligent approaches1 for tackling and
solving challenging practical problems from the petroleum geosciences and
petroleum industry. Written by experienced academics, this book offers state-of-
the-art working examples and provides the reader with exposure to the latest
developments in the field of artificial intelligent methods applied to oil and gas
research, exploration, and production. It also analyzes the strengths and weaknesses
of each method presented using benchmarking, while also emphasizing essential
parameters such as robustness, accuracy, speed of convergence, computer time,
overlearning, or the role of normalization.
The reader of this book will benefit from exposure to the latest developments in
the field of modern heuristics applied to oil and gas research, exploration, and
production. These approaches can be used for uncertainty analysis, risk assessment,
data fusion and mining, data analysis and interpretation, and knowledge discovery,
from diverse data such as 3-D seismic, geological data, well logging, and pro-
duction data. Thus, the book is intended for petroleum scientists, data miners, data
scientists and professionals, and postgraduate students involved in the petroleum
industry.
Petroleum Geosciences are—like many other fields—a paradigmatic realm of
difficult optimization and decision-making real-world problems. As the number,

1
Artificial Intelligence methods, some of which are grouped together in various ways, under
names such as Computational Intelligence, Soft Computing, Meta-heuristics, or Modern heuristics.

v
vi Preface

difficulty, and scale of such specific problems increase steadily, the need for
diverse, adjustable problem-solving tools can hardly be satisfied by the necessarily
limited number of approaches typically included in a curriculum/syllabus from
academic fields other than Computer Science (such as Petroleum Geology).
Therefore, the first three chapters of this volume aim at providing working infor-
mation about modern problem-solving tools, in particular in machine learning and
in data mining, and also at inciting the reader to look further into this thriving topic.
Traditionally, solving a given problem in mathematics and in sciences at large
implies the construction of an abstract model, the process of proving theoretical
results valid in that model, and eventually, based on those theoretical results, the
design of a method for solving the problem. This problem-solving paradigm has
been and will continue to be immensely successful. Nevertheless, an abstract model
is an approximation of the real-world problem; there have been failures triggered by
a tiny mismatch between the original problem and the proposed model for it.
Furthermore, a problem-solving method developed in this manner is likely to be
useful only for the problem at hand. While, ultimately, any problem-solving
technique may be—in various degrees—subject to these two observations, some
relatively new approaches illustrate alternative lines of attack; it is the editors’ hope
that the first three chapters of the book illustrate this idea in a way that will prove to
be useful to the readers.
In the first chapter, Simovici presents some of the main paradigms of intelligent
data analysis provided by machine learning and data mining. After discussing
several types of learning (supervised, unsupervised, semi-supervised, active, and
reinforcement learning), he examines several classes of learning algorithms (naïve
Bayes classifiers, decision trees, support vector machines, and neural networks) and
the modalities to evaluate their performance. Examples of specific applications of
algorithms are given using System R.
The second and third chapters, by Luchian, Breaban, and Bautu, are dedicated to
meta-heuristics. After a rather simple introduction to the topic, the second chapter
presents, based on working examples, evolutionary computing in general and, in
particular, genetic algorithms and differential evolution; particle swarm optimiza-
tion is also extensively discussed. Topics of particular importance, such as multi-
modal and multi-objective problems, hybridization, and also applications in
petroleum geosciences are discussed based on concrete examples. The third chapter
gives a compact presentation of genetic programming, gene expression program-
ming, and also discusses an R package for genetic programming and applications of
GP for solving specific problems from the oil and gas industry.
Ashena and Thonhauser discuss the Artificial Neural Networks (ANNs), which
has the potential to increase the ability of problem solving in geosciences and in the
petroleum industry, particularly in case of limited availability or lack of input data.
ANN applications have become widespread because they proved to be able to
produce reasonable outputs for inputs they have not learned how to deal with. The
following subjects are presented: artificial neural networks basics (neurons, acti-
vation function, ANN structure), feed-forward ANN, back-propagation and learn-
ing, perceptrons and back-propagation, multilayer ANNs and back-propagation
Preface vii

algorithm, data processing by ANN (training, overfitting, testing, validation), ANN,


and statistical parameters. An applied example of ANN, followed by applications of
ANN in geosciences and petroleum industry complete the chapter.
Al-Anazi and Gates present the use of support vector regression to accurately
estimate two important geomechanical rock properties, Poisson’s ratio and Young’s
modulus. Accurate prediction of rock elastic properties is essential for wellbore
stability analysis, hydraulic fracturing design, sand production prediction and
management, and other geomechanical applications. The two most common
required material properties are Poisson’s ratio and Young’s modulus. These elastic
properties are often reliably determined from laboratory tests by using cores
extracted from wells under simulated reservoir conditions. Unfortunately, most
wells have limited core data. On the other hand, wells typically have log data. By
using suitable regression models, the log data can be used to extend knowledge of
core-based elastic properties to the entire field. Artificial neural networks (ANN)
have proven to be successful in many reservoir characterization problems. Although
nonlinear problems can be well resolved by ANN-based models, extensive
numerical experiments (training) must be done to optimize the network structure. In
addition, generated regression models from ANNs may not perfectly generalize to
unseen input data. Recently, support vector machines (SVMs) have proven suc-
cessful in several real-world applications for its potential to generalize and converge
to a global optimal solution. SVM models are based on the structural risk mini-
mization principle that minimizes the generalization error by striking a balance
between empirical training errors and learning machine capacity. This has proven
superior in several applications to the empirical risk minimization principle adopted
by ANNs that aims to reduce the training error only. Here, support vector regression
(SVR) to predict Poisson’s ratio and Young’s modulus is described. The method
uses a fuzzy-based ranking algorithm to select the most significant input variables
and filter out dependency. The learning and predictive capabilities of the SVR
method is compared to that of a back-propagation neural network (BPNN). The
results demonstrate that SVR has similar or superior learning and prediction
capabilities to that of the BPNN. Parameter sensitivity analysis was performed to
investigate the effect of the SVM regularization parameter, the regression tube
radius, and the type of kernel function used. The result shows that the capability
of the SVM approximation depends strongly on these parameters.
The next three chapters introduce the active learning method (ALM) and present
various applications of it in petroleum geosciences.
First, Cranganu, and Bahrpeyma use ALM to predict a missing log (DT or sonic
log) when only two other logs (GR and REID) are present. In their approach,
applying ALM involves three steps: (1) supervised training of the model, using
available GR, REID, and DT logs; (2) confirmation and validation of the model by
blind-testing the results in a well containing both the predictors (GR, REID) and the
target (DT) values; and (3) applying the predicted model to wells containing the
predictor data and obtaining the synthetic (simulated) DT values. Their results
indicate that the performance of the algorithm is satisfactory, while the performance
time is significantly low. The quality of the simulation procedure was assessed by
viii Preface

three parameters, namely mean square error (MSE), mean relative error (MRE), and
Pearson product momentum correlation coefficient (R). The authors employed both
the measured and simulated sonic log DT to predict the presence and estimate the
depth intervals where overpressured fluid zone may develop in the Anadarko Basin,
Oklahoma. Based on interpretation of the sonic log trends, they inferred that
overpressure regions are developing between *1,250 and 2,500 m depth and the
overpressured intervals have thicknesses varying between *700 and 1,000 m.
These results match very well previous published results reported in the Anadarko
Basin, using the same wells, but different artificial intelligent approaches.
Second, Bahrpeyma et al. employed ALM to estimate another missing log in
hydrocarbon reservoirs, namely the density log. The regression and normalized
mean squared error (MSE) for estimating density log using ALM were equal to 0.9
and 0.042, respectively. The results, including errors and regression coefficients,
proved that ALM was successful in processing the density estimation. In their
chapter, the authors illustrated ALM by an example of a petroleum field in the NW
Persian Gulf.
Third, Bahrpeyma et al. tackled the common issue when reservoir engineers
should analyze the reservoirs with small sets of measurements (this problem is
known as the small sample size problem). Because of small sample size problem,
modeling techniques commonly fail to accurately extract the true relationships
between the inputs and the outputs used for reservoir properties prediction or
modeling. In this chapter, small sample size problem is addressed for modeling
carbonate reservoirs by using the active learning method (ALM). Noise injection
technique, which is a popular solution to small sample size problem, is employed to
recover the impact of separating the validation and test sets from the entire sample
set in the process of ALM. The proposed method is used to model hydraulic flow
units (HFUs). HFUs are defined as correlatable and mappable zones within a res-
ervoir controlling the fluid flow. This research presents quantitative formulation
between flow units and well log data in one of the heterogeneous carbonate res-
ervoirs in Persian Gulf. The results for R and nMSE are 85 % and 0.0042,
respectively, which reflect the ability of the proposed method to improve gener-
alization ability of the ALM when facing with sample size problem.
Dobróka and Szabó carried out a well log analysis by global optimization-based
interval inversion method. Global optimization procedures, such as genetic algo-
rithms and simulated annealing methods, offer robust and highly accurate solution
to several problems in petroleum geosciences. The authors argue that these methods
can be used effectively in the solution of well-logging inverse problems. Traditional
inversion methods are used to process the borehole geophysical data collected at a
given depth point. As having barely more types of probes than unknowns in a given
depth, a set of marginally overdetermined inverse problems has to be solved along a
borehole. This single inversion scheme represents a relatively noise-sensitive
interpretation procedure. To reduce the noise, the degree of overdetermination
of the inverse problem must be increased. This condition can be achieved by using
a so-called interval inversion method, which inverts all data from a greater depth
interval jointly to estimate petrophysical parameters of hydrocarbon reservoirs to
Preface ix

the same interval. The chapter gives a detailed description of the interval inversion
problem, which is then solved by a series expansion-based discretization technique.
The high degree of overdetermination significantly increases the accuracy of
parameter estimation. The quality improvement in the accuracy of estimated model
parameters often leads to a more reliable calculation of hydrocarbon reserves. The
knowledge of formation boundaries is also required for reserve calculation. Well
logs contain information about layer thicknesses, which cannot be extracted by the
traditional local inversion approach. The interval inversion method is applicable to
derive the layer boundary coordinates and certain zone parameters involved in the
interpretation problem automatically. In this chapter, the authors analyzed how to
apply a fully automated procedure for the determination of rock interfaces and
petrophysical parameters of hydrocarbon formations. Cluster analysis of well-
logging data is performed as a preliminary data-processing step before inversion.
The analysis of cluster number log allows the separation of formations and gives an
initial estimate for layer thicknesses. In the global inversion phase, the model
including petrophysical parameters and layer boundary coordinates is progressively
refined to achieve an optimal solution. The very fast simulated reannealing method
ensures the best fit between the measured data and theoretical data calculated on the
model. The inversion methodology is demonstrated by a hydrocarbon field exam-
ple, with an application for shaly sand reservoirs.
Finally, Mohebbi and Kaydani undertake a detailed review of meta-heuristics
dealing with permeability estimation in petroleum reservoirs. They argue that
proper permeability distribution in reservoir models is very important for the
determination of oil and gas reservoir quality. In fact, it is not possible to have
accurate solutions in many petroleum engineering problems without having accu-
rate values for this key parameter of hydrocarbon reservoir. Permeability estimation
by individual techniques within the various porous media can vary with the state of
in situ environment, fluid distribution, and the scale of the medium under investi-
gation. Recently, attempts have been made to utilize meta-heuristics for the iden-
tification of the relationship that may exist between the well log data and core
permeability. This chapter overviews the different meta-heuristics in permeability
prediction, indicating the advantages of each method. In the end, some suggestions
and comments about how to choose the best method are presented.

December 2014 Constantin Cranganu


Henri Luchian
Mihaela Elena Breaban
Contents

Intelligent Data Analysis Techniques—Machine Learning


and Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Dan Simovici

On Meta-heuristics in Optimization and Data Analysis.


Application to Geosciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Henri Luchian, Mihaela Elena Breaban and Andrei Bautu

Genetic Programming Techniques with Applications


in the Oil and Gas Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Henri Luchian, Andrei Băutu and Elena Băutu

Application of Artificial Neural Networks in Geoscience


and Petroleum Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Rahman Ashena and Gerhard Thonhauser

On Support Vector Regression to Predict Poisson’s Ratio


and Young’s Modulus of Reservoir Rock . . . . . . . . . . . . . . . . . . . . . . 167
A.F. Al-Anazi and I.D. Gates

Use of Active Learning Method to Determine the Presence


and Estimate the Magnitude of Abnormally Pressured Fluid
Zones: A Case Study from the Anadarko Basin, Oklahoma. . . . . . . . . 191
Constantin Cranganu and Fouad Bahrpeyma

Active Learning Method for Estimating Missing Logs


in Hydrocarbon Reservoirs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Fouad Bahrpeyma, Constantin Cranganu
and Behrouz Zamani Dadaneh

xi
xii Contents

Improving the Accuracy of Active Learning Method via Noise


Injection for Estimating Hydraulic Flow Units: An Example
from a Heterogeneous Carbonate Reservoir . . . . . . . . . . . . . . . . . . . . 225
Fouad Bahrpeyma, Constantin Cranganu and Bahman Golchin

Well Log Analysis by Global Optimization-based Interval


Inversion Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Mihály Dobróka and Norbert Péter Szabó

Permeability Estimation in Petroleum Reservoir by Meta-heuristics:


An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Ali Mohebbi and Hossein Kaydani

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
Intelligent Data Analysis
Techniques—Machine Learning
and Data Mining

Dan Simovici

Abstract This introductory chapter presents some of the main paradigms of


intelligent data analysis provided by machine learning and data mining. After
discussing several types of learning (supervised, unsupervised, semi-supervised,
active and reinforcement learning) we examine several classes of learning algo-
rithms (naive Bayes classifiers, decision trees, support vector machines, and neural
networks) and the modalities to evaluate their performance. Examples of specific
applications of algorithms are given using System R.

 
Keywords Supervised learning Unsupervised learning Clustering General- 
   
ization Overfitting Active learning Classifiers A priori probabilities 
  
A posteriori probabilities Decision trees Entropy Impurity Naive Bayes 
 
classifiers Perceptrons Neural Networks

1 Introduction

Machine learning and its applied counterpart, data mining, deal with problems that
present difficulties in formulating algorithms that can be readily translated into
programs, due to their complexity. Examples of such problems are finding diag-
nosis for patients starting with a series of their symptoms, determining credit
worthiness of customers based on their demographics and credit history. In each of
these problems, the challenge is to compute a label for each analyzed piece of data
that depends on the characteristics of data.
The general approach known as supervised learning is to begin with a number of
labeled examples (where answers are known) in order to generate an algorithm that
computes the function that gives the answers starting from these examples.

D. Simovici (&)
Department of Computer Science, University of Massachusetts Boston, Boston, MA, USA
e-mail: dsim@cs.umb.edu

© Springer International Publishing Switzerland 2015 1


C. Cranganu et al. (eds.), Artificial Intelligent Approaches in Petroleum Geosciences,
DOI 10.1007/978-3-319-16531-8_1
2 D. Simovici

In other approaches in machine learning, the challenge is to identify structure


that is hidden in data, e.g., identifying groups of data such that strong similarity
exists between objects that belong to the same group and also that objects that
belong to different groups are sufficiently distinct. This activity is known as clus-
tering and belongs to the category of unsupervised learning. The term “unsuper-
vised” refers to the fact that this type of learning does not require operator
intervention. Other machine learning activities of this type include outlier identi-
fication and density estimation.
An intermediate type of activity, referred as semi-supervised learning, requires a
limited involvement of the operator. For example, in the case of clustering, this may
allow the operator to specify pairs of objects that must belong to the same group
and pairs of objects that may not belong to the same group.
The quality of the learning process is assessed through its capability for gen-
eralization, that is, the capacity of the produced algorithm for computing correct
labels for yet unseen examples. It is important to note that the correct behavior of an
algorithm relative to the training data is no guarantee, in general, for its general-
ization prowess. Indeed, it is sometime the case that the pursuit of a perfect fit of the
learning algorithm to the training data leads to overfitting. This term describes the
situation when the algorithm acts correctly on the training data but is unable to
predict unseen data. In an extreme case, a rote learner will memorize the labels of
its training data and nothing else. Such a learner will be perfectly accurate on its
training data but lack completely any generalization capability.
A machine learning algorithm can achieve greater accuracy with fewer training
labels if it is allowed to choose the data from which it learns, that is, to apply active
learning. An active learner may pose queries soliciting a human operator to label a
data instance. Since unlabeled data are abundant and, in many cases, easily
obtained, there are good reasons to use this learning paradigm.
Reinforcement learning is a machine learning paradigm inspired by psychology
which emphasizes learning by an agent from its direct interaction with the data in
order to attain certain goals of learning, e.g., accuracy of label prediction. The
framework of this type of learning makes use of states and actions of an agent, and
the rewards and deals with uncertainty and non-determinism.
Machine learning techniques can be applied to a wide variety of problems and tend
to avoid the difficulties of standard problem-solving techniques where a complete
understanding of data is required at the beginning of the problem-solving process.
We have selected system R to provide examples of applications of algorithms
presented in this chapter. This is one of the most popular, freely available software
system for statistics and machine learning, that is continuously expanded by a large
community of developers that have created packages that address certain problems.
The basic software is available from http://www.r-project.org/. Packages can be
obtained from many mirrors of the software that can be easily accessed after the
basic system is installed.
Data sets used in R are either part of the basic software or can be downloaded
from the University of California Irvine machine learning repository whose URL is
http://archive.ics.uci.edu/ml/. The basic R system is capable of reading files in the
Intelligent Data Analysis Techniques … 3

csv format, which is one of the most common modalities for uploading data. For
example, to create a data frame d by reading the file d.csv, one could use

d <- read.csv("d.csv")

To learn the basics of R, the reader is invited to consult one of the basic
references (Lander 2014; Maindonald and Braun 2004) or seek help on the Web.

2 Simple Classifiers

We present now several types of classifiers using two of the most popular data sets,
namely Fisher’s iris data and the tennis data.
Example 2.1 The iris data were collected by Anderson (1936), an American bot-
anist who was interested in the study of variations in three species of iris flowers in
Gaspè peninsula in northeastern Canada and was made popular in statistics by
Fisher (1936).
Fisher’s iris data consist of measurements on 150 of iris specimens and include
measurements of sepal length, sepal width, petal length, and petal width, as well as
the species of the plants. The attributes that are distinct from the class are numerical,
so each plant is represented by a point in R4 . The species identified are iris setosa,
iris versicolor, and iris virginica, and there are 50 specimens from each of these
species, as shown in Table 1.
We will use various types of classifiers as they are implemented in system R,
one of the most used pieces of software for data analysis, which is freely available
on the Internet.
The iris data set is a part of the basic R package and can be loaded using
> data(iris)

The structure of this data set can be obtained using


> str(iris)

which returns a summary description:


’data.frame’: 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..:1 1 ...

Example 2.2 The tennis data set shown in Table 2 is a fictitious small data set
that specifies conditions from playing an outdoor game. It contains five attributes:
outlook, temperature, humidity, windy, and play.
4 D. Simovici

Table 1 Fisher’s iris data set


Sepal length Sepal width Petal length Petal width Species
SL SW PL PW
5.1 3.5 1.4 0.2 Setosa
4.9 3.0 1.4 0.2 Setosa
.. .. .. .. ..
. . . . .
5.3 3.7 1.5 0.2 Setosa
5.0 3.3 1.4 0.2 Setosa
7.0 3.2 4.7 1.4 Versicolor
6.4 3.2 4.5 1.5 Versicolor
.. .. .. .. ..
. . . . .
5.1 2.5 3.0 1.1 Versicolor
5.7 2.8 4.1 1.3 Versicolor
6.3 3.3 6.0 2.5 Virginica
5.8 2.7 5.1 1.9 virginica
.. .. .. .. ..
. . . . .
6.2 3.4 5.4 2.3 Virginica
5.9 3.0 5.1 1.8 Virginica

Table 2 Tennis data set


outlook temperature humidity windy play
Sunny Hot High No No
Sunny Hot High Yes No
Overcast Hot High No Yes
Rainy Mild High No Yes
Rainy Cool Normal No Yes
Rainy Cool Normal Yes No
Overcast Cool Normal Yes Yes
Sunny Mild High No No
Sunny Cool Normal No Yes
Rainy Mild Normal No Yes
Sunny Mild Normal Yes Yes
Overcast Mild High Yes Yes
Overcast Hot Normal No Yes
Rainy Mild High Yes No

The data can be placed in a comma-separated EXCEL file tennis.csv and


then loaded in R using a statement of the form
tennis <- read.csv("tennis.csv")
Intelligent Data Analysis Techniques … 5

2.1 Bayes Classification and Naive Bayesian Classifiers

Suppose that a data set D consists of n non-empty and mutually disjoint classes

C1 ; . . .; Cm :

Let PðDjCi Þ be the probability that a datum x in D belongs to Ci for 16 i 6 m.


Bayes classifiers determine the class of x 2 D as one of the classes C1 ; . . .; Cn :
by computing the conditional probabilities PðCi jxÞ and assigning x to the class Ck
where

k ¼ argmaxi PðCi jxÞ:

The probabilities PðCi jxÞ are known as a posteriori probabilities, since they are
evaluated after the datum x is observed and the class Ck is occasionally referred to
as the maximum a posteriori class.
By the Bayes’ law, we have

PðxjCi ÞPðCi Þ
PðCi jxÞ ¼
P ð xÞ

for 16 i 6 n. Note that PðxÞ does not influence the selection of Ck .


Generally, the probabilities of the classes PðCi Þ are referred to as the prior or a
priori probability of classes, and they may be estimated using one of the following
methods:
(i) they may be assumed to be equal, PðCi Þ ¼    ¼ PðCn Þ ¼ 1n, or
(ii) they can be estimated as the frequencies of the classes Ci in the training
population, or
(iii) estimations can be obtained from general domain knowledge.
Another challenge in Bayesian classification is to evaluate probabilities of the
form PðxjCi Þ. Naïve Bayes classifiers add a supplementary independence hypoth-
esis. Namely, if x ¼ ðx1 ; . . .; xm Þ, we assume that the components x1 ; . . .; xm are
independent of each other, which allows us to write

Y
m  
PðxjCi Þ ¼ P xj jCi
j¼1

for 1 6 i 6 n. The probabilities Pðxj jCi Þ are usually estimated from the training
examples, and the estimation method depends on the nature of each of the attributes
A1 ; . . .; Am that define these components. The classifier will assign x to the most
 value of PðCi jxÞ and
likely class, that is to the Ci that corresponds to the maximum
Qm 
therefore to the class Ci for which PðxjCi Þ ¼ j¼1 P xj jCi is maximal.
6 D. Simovici

f
If Aj is a categorical attribute, Pðxj jCi Þ can be estimated as cjii , where ci is the
number of training examples in class Ci and fji is the number of training examples in
the class Ci having the value of the Aj component equal to xj .
If Aj is continuous, Pðxj jCi Þ can be approximated with the normal distribution. If
li and ri are the mean and the standard deviation of the examples of the class Ci ,
then we may adopt as an estimate of Pðxj jCi Þ the value

ðxj li Þ2
1 
pffiffiffiffiffiffi e 2r2
i :
2pri

Example 2.3 In the tennis data set, there are two classes determined by the attribute
play: Cyes and CNo , which contain 9 and 5 records, respectively. If the proba-
bilities of these classes are estimated by their frequencies, we will have PðCYes Þ ¼
14 and PðCNo Þ ¼ 14. Since all attributes in this example are categorical, the prob-
9 5
f
abilities Pðxj jCi Þ are estimated as ci ij , where ci is the number of training examples in
class Ci and fji is the number of training examples in the class Ci having the value of
the Aj component equal to xj . In this case, the frequencies are computed in Table 3.
A naive Bayes classifier for this categorical data set is created in R with the
package e1071. After installing this package, e1071 is loaded using the directive
-> library(e1071)

The naive Bayes classifier nbc is created by writing:


nbc <- naiveBayes(Play ˜ .,data = tennis)

In the definition of nbc, the expression


Play ˜ .

Table 3 Frequencies in the


tennis data set Attributes Values PðxjCyes Þ PðxjCno Þ
outlook Sunny 2=9 3=5
Overcast 4=9 0=5
Rainy 3=9 2=5
temperature Hot 2=9 2=5
Mild 4=9 2=5
Cool 3=9 1=5
humidity High 3=9 4=5
Normal 6=9 1=5
windy No 6=9 2=5
Yes 3=9 3=5
Intelligent Data Analysis Techniques … 7

is a model formula that has the general form

class variable  list of explanatory variables:

In this case, Play is clearly the class variable; the period “.” replaces all other
variables. If several variables participate in the list of explanatory variables, they are
linked by +.
Displaying the components of nbc gives us the prior probabilities and the
conditional probabilities Pð xjC Þ:
A-priori probabilities:
No Yes
0.3571429 0.6428571

Conditional probabilities:
Outlook
Overcast Rainy Sunny
No 0.0000000 0.4000000 0.6000000
Yes 0.4444444 0.3333333 0.2222222

Temp
Cool Hot Mild
No 0.2000000 0.4000000 0.4000000
Yes 0.3333333 0.2222222 0.4444444

Humidity
High Normal
No 0.8000000 0.2000000
Yes 0.3333333 0.6666667

Windy
NO YES
No 0.4000000 0.6000000
Yes 0.6666667 0.3333333

We seek to predict the value of the attribute Play when the values of the other
attributes form a tuple that is absent from the table. This happens when we have the
datum x given below

outlook temperature humidity windy


Rainy Hot High YES

We need to compute the conditional probabilities


     
P xjCyes ¼ P outlook ¼ RainyjCyes  P temperature ¼ HotjCyes
 Pðhumidity ¼ High jCyes ÞPðwindy ¼ YES jCyes Þ
¼ 3=9  2=9  3=9  3=9 ¼ 54=6561 ¼ 0:0082;
8 D. Simovici

and

PðxjCno Þ ¼ Pðoutlook ¼ RainyjCno Þ  Pðtemperature ¼ HotjCno Þ


 Pðhumidity ¼ High jCno ÞPðwindy ¼ YES jCno Þ
¼ 2=5  2=5  4=5  3=5 ¼ 48=625 ¼ 0:0768:

A posteriori probabilities are given by

PðxjCyes ÞPðCyes Þ
PðCyes jxÞ ¼
PðxÞ
0:0082  0:6428571 0:00527
¼ ¼ ;
PðxÞ PðxÞ
PðxjCno ÞPðCno Þ
PðCno jxÞ ¼
PðxÞ
0:0768  0:3571429 0:02742
¼ ¼ :
PðxÞ PðxÞ
 
Since PðCno jxÞ [ P Cyes jx , the classifier will predict “no” for x.
Note that there is no example in the data set where outlook = “Overcast” and
Play = “Yes”. Therefore, P (outlook = “Overcast”|Play = “Yes”) = 0 and any
product of probabilities that includes this factor will be 0. This problem can be fixed
by using a technique known as Laplace correction. Namely, if the fractions
p1 pm
; . . .;
q1 qm
Pm
are m probabilities such that pi
i¼1 qi ¼ 1, we replace these fractions by

p1 þ k pm þ k
; . . .; ;
q1 þ mk qm þ mk

respectively. None of the newly defined numbers is 0 and we have

pi pi þ k 1
6 6 :
qi qi þ mk m

The parameter k is, in general, a small positive number and is determining how
influential the priori values are compared to knowledge extracted from the training
set.
To apply a Laplace correction with k ¼ 1, we need to write
> nbc <- naiveBayes(Play ˜ .,data=tennis,laplace=1)
Intelligent Data Analysis Techniques … 9

Note that the conditional probabilities are modified and there is no null val-
ues:
A-priori probabilities:

No Yes
0.3571429 0.6428571

Conditional probabilities:
Outlook
Overcast Rainy Sunny
No 0.1250000 0.3750000 0.5000000
Yes 0.4166667 0.3333333 0.2500000

Temp
Cool Hot Mild
No 0.2500000 0.3750000 0.3750000
Yes 0.3333333 0.2500000 0.4166667

Humidity
High Normal
No 0.7142857 0.2857143
Yes 0.3636364 0.6363636

Windy
NO YES
No 0.6000000 0.8000000
Yes 0.7777778 0.4444444

Example 2.4 In this example, we seek to construct a Bayes classifier for a data set
that has numerical attributes using the iris data set and the package e1071.
> nbc <- naiveBayes(iris[,1:4],iris[,5])

> table(predict(nbc,iris[,1:4]), iris[,5],


+ dnn=list("predicted","actual"))

This will return


actual
predicted setosa versicolor virginica
setosa 50 0 0
versicolor 0 47 3
virginica 0 3 47

The structure of the classifier returned can be inspected using the statement
> str(nbc)
10 D. Simovici

which returns

List of 4
$ apriori: ’table’ int [1:3(1d)] 50 50 50
..- attr(*, "dimnames")=List of 1
.. ..$ iris[, 5]: chr [1:3] "setosa" "versicolor" "virginica"
$ tables :List of 4
..$ Sepal.Length: num [1:3, 1:2] 5.006 5.936 6.588 0.352 0.516 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ iris[, 5] : chr [1:3] "setosa" "versicolor" "virginica"
.. .. ..$ Sepal.Length: NULL
..$ Sepal.Width : num [1:3, 1:2] 3.428 2.77 2.974 0.379 0.314 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ iris[, 5] : chr [1:3] "setosa" "versicolor" "virginica"
.. .. ..$ Sepal.Width: NULL
..$ Petal.Length: num [1:3, 1:2] 1.462 4.26 5.552 0.174 0.47 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ iris[, 5] : chr [1:3] "setosa" "versicolor" "virginica"
.. .. ..$ Petal.Length: NULL
..$ Petal.Width : num [1:3, 1:2] 0.246 1.326 2.026 0.105 0.198 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ iris[, 5] : chr [1:3] "setosa" "versicolor" "virginica"
.. .. ..$ Petal.Width: NULL
$ levels : chr [1:3] "setosa" "versicolor" "virginica"
$ call : language naiveBayes.default(x = iris[, 1:4], y = iris[, 5])
- attr(*, "class")= chr "naiveBayes"

2.2 Decision Trees

Decision trees are algorithms that build classification models based on a chain of
partitions of the training set. Depending on the nature of data (categorical or
numerical), we need to choose a particular type of decision tree.
Decision trees are built through recursive data partitioning, where in each iter-
ation, the training data are split according to the values of a selected attribute. Each
node n corresponds to a subset D(n) of the training data set D and to a partition
π(n) of D(n). If n0 is the root of the decision tree, then D(n0) = D. If n is a node that
has the descendants n1, … , nk, then

pðnÞ ¼ fDðn1 Þ; . . .; Dðnk Þg:

In other words, the blocks of the partition π(n) are the data sets that correspond
to the descendant nodes n1, … , nk. Partitioning of a set D(n) is done, in general, on
the basis of the values of the attributes of the objects assigned to the node n.
Suppose that the training data is labeled by c1 ; . . .; cm . This, in turn, determines a
partition r ¼ fC1 ; . . .; Cm g of the training set, where the block Cj contains the data
records labeled cj for i 6 j 6 m. If E is a subset of D, the purity of E equals the
entropy of the trace partition rE (see Sect. 8B). The set E is pure if rE consists of
Intelligent Data Analysis Techniques … 11

exactly one block, that is, HðrE Þ ¼ 0; in other words, E is pure if its elements
belong to exactly one classes.
The recursive splitting of the nodes stops at nodes that correspond to “pure” or
“almost pure” data subsets, that is, when the data of the node consist of instances of
the same class, or when a class is strongly predominant at that node. Nodes where
splitting stops are the leaves of the decision trees.
There are three issues in constructing a decision tree (Breiman et al. 1998):
(i) choosing a splitting criterion that generates a partition of DðnÞ;
(ii) deciding when a node should not be split further, that is, when a node is
terminal;
(iii) the assignment of each terminal node to a class.
Splitting the data set DðnÞ aims to produce nodes with increasing purity. Assume
that n is split k ways to generate the descendants n1 ; . . .; nk that contain the data sets
Dðn1 Þ; . . .; Dðnk Þ. The splitting partition rn at n is defined as

r ¼ fDðn1 Þ; . . .; Dðnk Þg:

If j is the partition of the original data set in classes, the a-impurity at n is


Ha ðjDðnÞ Þ. The aggregate a-impurity of the descendants of n is

X
k  jDðn Þja
j
Ha jDðnj Þ ¼ Ha ðjDðnÞ jrn Þ
j¼1
jDðnÞj

and, therefore, the decrease in impurity afforded by the splitting rn is

Ha ðjDðnÞ Þ  Ha ðjDðnÞ jrn Þ:

This quantity is known as the information gain caused by rn , and it is the basis
of one of the best known method for constructing decision trees, namely the C5.0
algorithm of Quinlan (1993). Variants of this algorithm are also popular [e.g., the
J48 of the WEKA software package (Witten et al. 2011)].
The construction of a C5.0 tree in the C50 package can be achieved by writing
C5.0(trainData,classVector, trials = t, costs = c)

where the first parameter specifies the data set on which the classifier is constructed
and the second parameter is a factor vector which contains the class for each row of
the training data; the remaining parameters are optional and will be discussed in the
sequel.
Example 2.5 To generate a decision tree for the iris data set, we split this data
into a training data set, trainIris, and a test data set, testIris by writing
12 D. Simovici

> index <- sample(2,nrow(iris),replace=TRUE,prob=c(0.9,0.1))


> trainIris <- iris[index==1,]
> testIris <- iris[index==2,]

About 90 % of the entries in this index have value 1 and about 10 % contain the
value 2, which correspond to the training set and the test set, respectively.
The classifier dt is built using the syntax
dt <- C5.0(trainIris[,1:4],trainIris[,5])

The classes predicted for the test set are obtained with
> pred <- predict(dt,testIris[,1:4],type="class")
> pred
setosa setosa setosa setosa setosa
versicolor versicolor versicolor versicolor versicolor
versicolor versicolor virginica virginica virginica
virginica virginica virginica virginica
Levels: setosa versicolor virginica

A summary of the classifier summary (dt) returns the specifics of the decision
tree
Decision tree:

Petal.Length <= 1.9: setosa (45)


Petal.Length > 1.9:
:...Petal.Width > 1.7: virginica (39/1)
Petal.Width <= 1.7:
:...Petal.Length <= 4.9: versicolor (41/1)
Petal.Length > 4.9: virginica (6/2)
Evaluation on training data (131 cases):

Decision Tree
----------------
Size Errors
4 4( 3.1%) <<

(a) (b) (c) <-classified as


---- ---- ----
45 (a): class setosa

40 3 (b): class versicolor


1 42 (c): class virginica

The parameter trials refers to a very important technique in machine learning


called boosting. Boosting refers to a method of producing a very accurate classifier
by combining moderately inaccurate classifiers. Using trials, we can specify the
number of boosting iterations.
Intelligent Data Analysis Techniques … 13

Note that the classifier generated in Example 2.5 produced four erroneous pre-
dictions. A matrix of costs can be associated with these mistakes such that the costs
depend on the nature of the errors. For instance, since we have three classes
designated as (a), (b), and (c), we could consider the cost matrix
0 1
0 2 0
costs ¼ @ 4 0 5A
0 1 0

These entries of this matrix assign a cost to mistakes made during the classifi-
cation. Rows correspond to predicted values and columns to actual values; the
diagonal elements are 0. Thus, the costliest error of the classifier is to predict (b) for
an object in the class (c).

3 Evaluation of Performance of Classifiers

Consider a simple classification algorithm involving the diagnosis of a condition


based on the value of a test result. A disease is predicted when the value of a test
t result is greater than 5; patients who satisfy this condition constitute the positive
set which contains PðtÞ elements; the other patients form the set of negative cases
which consist of NðtÞ, as we show in Fig. 1. Suppose initially that the distribution of
cases is the one shown in Fig. 1a. In this case, the test results are decisive: Patients
with test values of at least 5 have a positive diagnosis, while patients with values
lower than 5 have a negative diagnosis. Such well-delimited situations are infre-
quent. More likely, the curves that give the probability densities intersect, as we
show in Fig. 1b.
0.4
0.8

(a) (b)
0.3

negative positive
0.6

negative positive
prob. density

cases N cases P
prob. density

cases N cases P
0.2
0.4

TN TP
0.1
0.2

threshold
threshold
FN FP
0.0
0.0

0 2 4 6 8 10 0 2 4 6 8 10
test value test value

Fig. 1 Relative positions of distributions of test results. a Well-separated results. b Positive and
negative results overlap
14 D. Simovici

Note that the set of test values of individuals who have the disease overlaps with
the set of test values of those who do not have the disease. These sets are repre-
sented in Fig. 1b by the areas P and N located under each of the two curves.
The diagnosis is determined by the value of a test threshold: Patients whose test
values exceed the threshold are deemed to be positive (that is, have the disease);
patients whose test values are lower than the threshold are deemed to be negative.
Some patients who have the disease but whose test results are lower than the
threshold will be classified by this simple test among the negative cases (there are
the false-negative cases); others, who do not have the disease but whose test values
are larger than the threshold, will be classified among the positive cases (they are
the false-positive cases). The number of elements of these sets is denoted by FNðtÞ
and FNðtÞ, respectively.
The set of patients who have the disease and are correctly identified by the test
forms the set of true-positive cases; the number of elements of this set is denoted by
TPðtÞ. Also, the set of patients who do not have the disease and are correctly
identified forms the set of true-negative cases; the number of elements of this set is
TNðtÞ. Clearly, we have

N ¼ TNðtÞ þ FPðtÞ;
P ¼ TPðtÞ þ FNðtÞ:

Note that the total number of cases N and P does not depend on t. The definitions
are summarized in Table 4 known as the confusion matrix or confusion table.
Among these cases, the number of incorrectly classified cases is FPðtÞ þ FNðtÞ;
this motivates the introduction of the error rate errorðtÞ as

FPðtÞ þ FNðtÞ
errorðtÞ ¼ :
NþP

Note that errorðtÞ 2 ½0; 1 for every value of t. The accuracy at t is

TPðtÞ þ TNðtÞ
accðtÞ ¼ 1  error ¼ :
PþN

The specificity at t (also known as the true-negative rate) is defined as:

Table 4 Confusion table


True class
Positive Negative
Classifier result for Positive TPðtÞ FPðtÞ
threshold t Negative FNðtÞ TNðtÞ
Totals P N
Intelligent Data Analysis Techniques … 15

TNðtÞ
specificityðtÞ ¼ :
N

Specificity can be regarded as a conditional probability, namely,

specificityðtÞ ¼ PðTNðtÞjNÞ:

Similarly, the sensitivity at t (also known as the true-positive rate) or the recall is
given by

TPðtÞ
sensitivityðtÞ ¼ ;
P

and can be expressed as the conditional probability sensitivityðtÞ ¼ PðTPðtÞjPÞ.


High values of specificity occur when there are few false positives; low sensi-
tivity indicates the presence of many false negatives.
The precision at t is

TPðtÞ
precisionðtÞ ¼ :
TPðtÞ þ FPðtÞ

Note that

0 6 specificityðtÞ; sensitivityðtÞ; precisionðtÞ 6 1

for every value of t. Also, we have

TNðtÞ
specificityðtÞ ¼ ;
TNðtÞ þ FPðtÞ
TPðtÞ
sensitivityðtÞ ¼ ;
TPðtÞ þ FNðtÞ
TPðtÞ
precisionðtÞ ¼ :
TPðtÞ þ FPðtÞ

It is easy to verify that for any four positive numbers a; b; c; d, we have the
double inequality
na c o a þ c na c o
min ; 6 6 max ; :
b d bþd b d

This implies

TPðtÞ TNðtÞ TPðtÞ þ TNðtÞ TPðtÞ TNðtÞ


min ; 6 6 max ; ;
P N PþN P N
16 D. Simovici

which is equivalent to

minfspecificityðtÞ; sensitivityðtÞg 6 accðtÞ 6 maxfspecificityðtÞ; sensitivityðtÞg:

In other words, the accuracy of t always lies between the sensitivity and the
specificity at t.
Note that 1  specificityðtÞ ¼ 1  TNðtÞ FPðtÞ
N ¼ N . This justifies referring to 1 
specificityðtÞ as the false-positive rate.
The F1 score considers both the precision and the sensitivity rates and is defined
as their harmonic mean

precisionðtÞ  sensitivityðtÞ
F1 ðtÞ ¼ 2 :
precisionðtÞ þ sensitivityðtÞ

A more general measure is Fb given by

precisionðtÞ  sensitivityðtÞ
Fb ðtÞ ¼ ð1 þ b2 Þ :
b2 precisionðtÞ þ sensitivityðtÞ

Note that F2 weighs sensitivity higher than precision, while F0:5 weighs preci-
sion higher than sensitivity.

4 Support Vector Machines

Support vector machines (SVMs) represent a powerful technique in classification,


regression, and outlier detection. SVMs were developed by Cortes and Vapnik
(1995) for binary classification.
The simplest application of these algorithms is solving the binary classification
problem which seeks to separate two classes of vectors in Rn by determining an
optimum separating hyperplane for the classes involved. The two classes of vectors
involved are known as the positive examples and the negative examples, and the
separating hyperplane must be determined such that the separation between the
closest representatives of the two classes is maximized.
Building a separating hyperplane amounts to building a classifier model and the
process begins, as it is customary in classification, with a training set T that consists
of m pairs of the form

ðx1 ; y1 Þ; . . .; ðxm ; ym Þ;

where x1 ; . . .; xm 2 Rn and yi 2 f1; 1g for 1 6 i 6 m. The sets


Intelligent Data Analysis Techniques … 17

Tþ ¼ fxi jðxi ; 1Þ 2 Tg;


T ¼ fxi jðxi ; 1Þ 2 Tg

are the set of positive examples and the set of negative examples, respectively.
T is linearly separable if there exists a hyperplane Hv;a : v0 x ¼ a (called the
separating hyperplane) such that all positive examples lie in one half-space
determined by Hv;a and all negative examples lie in the other half-space as shown in
Fig. 2. In other words, v and a can be chosen such that for all positive examples we
shall have v0 xi  a [ 0 and for all negative examples we shall have v0 xi  a\0.
Both conditions can be stated as

yi ðv0 xi  aÞ [ 0 ð1Þ

for 16 i 6m.
The distance between a point xi and the hyperplane Hv;a is

jv0 xi  aj yi ðv0 xi  aÞ
di ¼ ¼ ;
kvk kvk

and we refer to this distance as the geometric margin of xi .


We need to ensure that the geometric margins have a guaranteed minimum l,
that is,

yi ðv0 xi  aÞ > lkvk

Fig. 2 Linearly separable x2 6 2


data set
 Z 2
margin ZZ 
Z 2 2
Z Z 2
Z Z
Z Z 2
◦Z
Z Z 22
◦ Z Z
Z Z
Z Z
◦ ◦ Z 2
◦ ◦◦ Z ZZ

Z
Z
◦ ◦ Z
◦ ZZ -
negativeexamples: ◦,• (supportvectors)
x1
positiveexamples: 2,  (supportvectors)
18 D. Simovici

for 16 i 6m. This would imply that for the positive examples, we shall have

v0 xi  a  lkvk > 0;

and for the negative examples,

v0 xi  a þ lkvk 6 0:

These conditions can be written equivalently as

w0 xi  b  1 > 0 ð2Þ

for the positive examples, and

w0 xi  b þ 1 6 0; ð3Þ

for the negative examples, where w ¼ l1 kvvk and b ¼ lkawk. In a unified form, these
restrictions can be now written

yi ðw0 xi  bÞ > 1

for 16 i 6m.
The distance between the hyperplanes w0 xi  a ¼ 1 and w0 x  a ¼ 1 is kw2 k, and
we seek to maximize this distance in order to obtain a good separation between the
classes. Thus, we need to minimize kwk subjected to the restrictions yi ðw0 xi  bÞ>1
for 16 i 6m. An equivalent formulation brings this problem to a quadratic optimi-
zation problem, namely seeking w that is a solution of the problem:

1
minimize kwk2 ; where w 2;
2
subject to 1  yi ðw0 xi  bÞ 6 0 for 1 6 i 6m

The separating hyperplane is Hv;a .


To obtain the dual of this problem (see Sect. 8C), we start from the Lagrangean

1 Xm
Lðw; a; uÞ ¼ kwk2 þ ui ð1  yi ðw0 xi ÞÞ
2 i¼1
!
1 2
Xm Xm Xn
¼ kw k þ ui  ui y i wk xki  b ;
2 i¼1 i¼1 k¼1

where ui > 0 are the Lagrange multipliers. The dual objective function is obtained
by as gðuÞ ¼ inf w;a Lðw; uÞ. This requires the stationarity conditions
Intelligent Data Analysis Techniques … 19

@L @L
¼0 for 1 6 i 6n and ¼ 0;
@wi @a

which amount to

@L Xm
¼ wj  yi ui xji ¼0 for 16 j 6n;
@wj i¼1
@L X n
¼ ui yi ¼0;
@a i¼1

InPa vectorial form, the first n stationary conditions can be written as


w  ni¼1 yi ui xi ¼ 0n . Since

X
n
w¼ y i ui xi ð4Þ
i¼1

Pn
and i¼1 ui yi ¼ 0, the dual objective function is

X
m
1X n X n
gðuÞ ¼ ui  y i y i ui ui x0 i xj ;
i¼1
2 i¼1 i¼1

which is a quadratic function subject to u > 0m .


There are several important aspects of the dual formulation of the SVM:
(i) Equality (4) shows that the weight vector w is located in the hyperplane
determined by the vectors x1 ; . . .; xn . Moreover, one can show that ui 6¼ 0 if
and only if xi is a support vector and, therefore, w is determined by the support
vectors.
(ii) The advantage of the dual formulation is that the number m of variables ui may
be a lot smaller that the original number of variables n.
(iii) The dual optimization problem needs no access to the original data
fxi j1 6 i 6 ng. Instead, only the inner products x0i xj are necessary in the
construction of the dual objective function.
A data point x 2 Rn is classified by the SVM determined here based on the sign
of the expression w0 x  a; in other words, the class y of an yet unseen point is given
by y ¼ signðw0 x  aÞ.
If the data are “almost” linearly separable, a separation hyperplane exists such
that the majority (but not all) of the positive examples inhabit the positive half-
space of the hyperplane and the majority (but not all) of the negative examples
inhabit the negative half-space. In this case, we shall seek a “separating hyperplane”
that separates the two classes with the smallest error. This is achieved by assigning
to each object xi in the data set a slack variable ni , where ni >0, by relaxing
Inequalities 2 and 3 as
20 D. Simovici

w0 xi  b  1 >  ni ð5Þ

for the positive examples and

w0 xi  b þ 1 6 ni ; ð6Þ

for the negative examples, respectively, where w ¼ l1 kvvk and b ¼ lkawk. In turn, in a
unified form these restrictions can be written as

1  yi ðw0 xi  bÞ 6 ni

for 1 6 i 6 m. The soft-margin SVM primal problem is

1 Xn
minimize kwk2 þC ni ; where w 2;
2 i¼1
subject to 1  yi ðw0 xi  aÞ 6 ni for 1 6 i 6 m;

where C and ni are user-defined parameters referred usually as hyper-parameters.


The dual of the soft-margin problem is similar to the previous dual and it
amounts to

X
m
1X n X n
maximize gðuÞ ¼ ui  y i y j ui uj x0 i xj ;
i¼1
2 i¼1 i¼1
Xm
subject to 0 6 ui 6 C and ui yi ¼ 0:
i¼1

Example 4.1 The kernlab library is described in (Karatzoglou et al. 2004) which
provides users with essential access to support vector machine techniques. After
installing the package, its loading is achieved using
> library(kernlab)

We split the data set iris into a training set trainIris and a test set
testIris in the same manner used in Example 2.5.
The classifier is created by writing
> svm <- ksvm(Species ˜ .,data=trainIris,kernel="vanilladot",
C = 1,prob.model=TRUE)

and is used to generate distribution probabilities for each of the 12 entries of the
test set by writing
> pred_p <- predict(svm,testIris,type = "probabilities")
Intelligent Data Analysis Techniques … 21

Note the use of the parameter kernel = “vanilladot”. We will explain later
the use of kernels.
These distributions can be examined:
> pred_p
setosa versicolor virginica
[1,] 0.948669677 0.0365527398 0.014777583
[2,] 0.971508823 0.0190805740 0.009410603
[3,] 0.987012019 0.0080849105 0.004903071
[4,] 0.950002416 0.0357236471 0.014273937
[5,] 0.659161885 0.2879288429 0.052909272
[6,] 0.017947111 0.9594198514 0.022633038
[7,] 0.012561988 0.9829687166 0.004469296
[8,] 0.017910234 0.9784817276 0.003608038
[9,] 0.008436607 0.9467301478 0.044833245
[10,] 0.012126227 0.9815816669 0.006292106
[11,] 0.028265376 0.9660266137 0.005708011
[12,] 0.052250902 0.9359484109 0.011800687
[13,] 0.001837466 0.0003496850 0.997812849
[14,] 0.006546816 0.0065769958 0.986876188
[15,] 0.005543471 0.0006948435 0.993761686
[16,] 0.001242060 0.0002903663 0.998467574
[17,] 0.012187320 0.0324955786 0.955317101
[18,] 0.019265185 0.3263600533 0.654374762
[19,] 0.005646642 0.0255939953 0.968759363

Note that in each case, one of the numbers strongly dominates the others, a
consequence of the linear separability of this data set. Alternatively, a prediction
that returns directly the class of various objects can be generated by
pred <- predict(svm,testIris,type="response")

and generates
> pred
[1] setosa setosa setosa setosa setosa
versicolor versicolor versicolor
[9] versicolor versicolor versicolor versicolor virginica
virginica virginica virginica
[17] virginica virginica virginica

Levels: setosa versicolor virginica.

A contingency table can be obtained with


table(pred,testIris$Species)

pred setosa versicolor virginica


setosa 5 0 0
versicolor 0 7 0
virginica 0 0 7
22 D. Simovici

In many situations, data are not linearly separable; that is, there is no separating
hyperplane between classes. Consider, for example, the set of points is shown in
Fig. 3, which are separated into positive and negative examples by a nonlinear
surface rather than a hyperplane (in our two-dimensional case, by a curve rather
than a line). The solution is to transform data into another space, where the sepa-
rating surface is transformed into a hyperplane such that the positive and negative
examples will inhabit the two half-spaces determined by the hyperplane. The data
transformation is defined by a function / : Rn ! H, where H is a new linear space
that is referred as the feature space. The intention is to use a linear classifier in the
new space to achieve separation between the representation of the positive and the
negative examples in this new space.
We assume that the feature space H is equipped with an inner product
ð; Þ : H ! R>0 . In view of Equality (4), if the data are approximately linearly
separable in the new space, the classification decision is based on computing
X
n
yi ui /ðxi Þ0 /ðxÞ  a
i¼1

Let K : H 2 ! R be the function defined by Kðu; vÞ ¼ ðUðuÞ; UðvÞÞ; this func-


tion is referred to as a kernelPfunction, and the decision in the new space is based on
the sign of the expression ni¼1 ðyi ui K xi ; xÞ  a. Thus, we need to specify only
the kernel function rather than the explicit transformation /. In Example 4.1, / is
the identical transformation and the corresponding kernel, Kðu; vÞ ¼ u0 v, is known
as the vannila kernel. Among the most frequently used kernels, we mention the
2
Gaussian kernel defined by Kðu; vÞ ¼ eckuvk , the exponential kernel given by
Kðu; vÞ ¼ eckuvk , and the polynomial kernel Kðu; vÞ ¼ ðk þ u0 vÞp .

separating separating
curve line
2 2
x2 6 2 y2 6 2
2
2 2 2 w ◦
2 ◦
2 ◦ 22 ◦

◦ ◦ ◦
◦ ◦ ◦
◦ ◦

- -
x1 y1
negative examples ◦, positive examples 2

Fig. 3 Inseparable data set


Intelligent Data Analysis Techniques … 23

Example 4.2 The two-dimensional data set shown in Fig. 4 is clearly not linearly
separable because no line can be drawn such that all positive points will be on one
side of the line and all negative points on the other.
Again, we use the kernlab and the function ksvm of this package. We apply a
Gaussian kernel, which can be called using the rbfdot value:
> svmrbf <- ksvm(class ˜ x + y, data=points,
+ kernel="rbfdot", C = 1)
     
6 7 8
If the data frame testdata contains the vectors ; ; and , then
6 8 11
the predictions of the classifier svmrbf obtained with
> pred_points <- predict(svmrbf,testdata,type="response")

returns
> pred_points
[,1]
[1,] -0.03084342
[2,] -1.03816317
[3,] 1.21256792

Note that the first two test data that are close to negative training examples get
negative predictions; the remaining test data that are close to positive examples get
a positive prediction.

Fig. 4 Data set that is not


linearly separable; positive
examples are shown as white
10

square and negative examples


as asterisk
8
y
6
4

4 6 8 10 12
x
24 D. Simovici

5 Regression

Regression seeks functions that model data with minimal errors. It aims to describe
the relationships between dependent variables and independent variables and to
estimate values of dependent variables starting from values of independent
variables.
There are several types of regression: linear regression, logistic regression,
nonlinear regression, Cox regression, etc. We present here an introduction to linear
regression.
Linear regression considers models that assume that a variable Y is estimated to
be a linear function of the independent variables X1 ; . . .; Xn :

Y ¼ a0 þ a1 X1 þ    þ an Xn :

Y must be continuous, while X1 ; . . .; Xn may be continuous or discrete (categorial).


Example 5.1 Consider a data set that records the height and weight of several
individuals. We seek a linear dependency of the height on the weight (the regres-
sor), specified by the model formula weight * height. Data can be placed in R
using
height <- c(1.6,1.62,1.65,1.72,1.74,1.74,1.76,1.77,1.79,1.8,1.8,1.81,
1.83,1.84,1.86,1.87,1.9,1.91,1.91,1.92);
weight <- c(55,53,54,57,64,69,73,65,80,72,77,81,73,80,84,86,84,88,
91,89);

and can be displayed using the usual plot function, as shown in Fig. 5. To
produce the regression line, we call the linear modeling function lm:
lm.r <- lm(formula = weight ˜ height)

The coefficients of the regression line are


Coefficients:
(Intercept) height
-148.0 123.8

and abline(lm.r) places the regression line on the plot.

In a multivariable regression, we seek a similar linear dependency that involves


several regressors.
Example 5.2 Suppose that we collect a data set that shows the dependency of
systolic blood pressure numbers on the body mass index (BMI), sex, and age by
writing
Intelligent Data Analysis Techniques … 25

Fig. 5 Data and regression

90
line

80
weight
70
60

1.60 1.65 1.70 1.75 1.80 1.85 1.90


height

BMI <- c(21.48,19.83,19.26,21.13,22.79,23.56,20.74,24.96,22.22,23.76,


24.72,21.79,23.62,24.28,24.59,23.26,24.12,24.94,24.14);
sex <- c(0,0,1,0,1,1,1,1,1,1,0,1,0,1,0,1,1,1,1);
age <- c(21,31,20,32,41,25,40,38,50,45,41,65,37,60,51,65,40,55,40);
sys <- c(125,120,110,130,141,155,110,120,130,140,130,120,135,167,
130,150,140,145,120);

The linear model is obtained with


lm.r = lm(sys ˜ BMI + sex + age)

This results in the linear function defined by the following coefficients:


(Intercept) BMI sex age
31.5458 4.0081 2.4310 0.1791

In other words, the linear model is


sys = 31.5458 + 4.0081 ∗ BMI + 2.4310 ∗ sex + 0.1791 ∗ age.

The use of support vector machines for regression was proposed in (Drucker
et al. 1996). The model produced by support vector classification depends only on a
subset of the training data, because the cost function for building the model does not
care about training points that lie beyond the margin. Another SVM version known
as least squares support vector machine has been proposed in (Suykens and Van-
dewalle 1999).
26 D. Simovici

6 Active Learning

Learning, as has been discussed up to this point, involves passive learners, that is,
learning algorithms where the information flows from data to learner.
A machine learning algorithm can achieve greater accuracy with fewer training
labels if it is allowed to choose the data from which it learns, that is, to apply active
learning. An active learner may pose queries, usually in the form of unlabeled data
instances to be labeled by a human operator. The flow of information between data
and the learner is bidirectional as shown in Fig. 6.
Since unlabeled data are abundant and, in many cases, easily obtainable, there
are good reasons to use this learning paradigm.
The training processes that allow us to construct data mining models often
require a large volume of labeled data. For example, to produce a topic-based text
classifier through text mining, a large number of documents must be labeled with
the pertinent topics. This is an expensive process that requires numerous human
readers capable of understanding these topics and attaches appropriate labels to the
documents. Similarly, speech recognition requires labeling of a large number of
speech fragments by specialized linguists, which is time consuming and prone to
errors.
Active learning requires a querying strategy (see Settles 2012). One such
strategy is query by uncertainty (also known as uncertainty sampling), in which a
single classifier is learned from labeled data and is subsequently utilized for
examining the unlabeled data. Those instances in the unlabeled data set that the
classifier is least certain about are subject to classification by a human annotator.
Query by uncertainty has been realized using a range of learners, such as logistic
regression (Lewis and Gale 1994), support vector machines (Schohn and Cohn
2000), and Markov models (Scheffer et al. 2001). The amount of data that require
annotation in order to reach a given performance, compared to passively learning
from examples provided in a random order, is significantly reduced using query by
uncertainty.

- Passive Learning -
Data Set S Model
algorithm

Passive learning

-
Active Learning -
Data Set S Model
 algorithm

Active Learning

Fig. 6 Information flow in passive versus active learning


Intelligent Data Analysis Techniques … 27

There are several modalities to implement query by uncertainty, and they require
to determine the data item xlc for which the learner is the least confident about its
labeling.
The most common approach for selecting xlc is the use of entropy as a measure
of uncertainty. If Y is a random variable that ranges over all possible labels, then we
shall seek xlc as xlc ¼ argmaxx HðYjxÞ.
Another approach requires the learner C to evaluate the degree of confidence in
its predictions. Let x be a data item and let ^y be the label with the highest posterior
probability according to C, that is, ^y ¼ argmaxy PC ðyjxÞ. Then, 1  Pð^yjxÞ is the
lack of confidence of C in the label ^y and xlc ¼ argmaxx ð1  Pð^yjxÞÞ is a data item
for which C is the least confident. The intervention of the human annotator will be
required for xlc .
Yet another strategy makes use of the output margin of a data item x defined as
the difference Pð^y1 jxÞ  Pð^y2 jxÞÞ between the probability of the most likely label ^y1
and the second most likely label ^y2 of an item x. For items with large margins, there
is little uncertainty on the choice of the most likely label; therefore, items with small
margin benefit most from an external annotation, and so, an external annotation will
be required for xm defined by

xm ¼ argminx ðPð^y1 jxÞ  Pð^y2 jxÞÞ:

Active learning may run into difficulties because, as shown in (Schütze et al.
2006; Velipasaoglu et al. 2007), a mix of learnable and unlearnable classes co-occur
in a data set. A class can be regarded as learnable if there exists a learning pro-
cedure that generates a classifier with a performance (e.g., the F1 measure) that
exceeds a certain threshold with a certain level of confidence.
For small classes, it is difficult or impossible to create reliable classifiers. For
example, if a class contains 1 % of 1000 records, we have just ten examples for that
class and this is often not sufficient for creating a classifier.
In Dasgupta (2011), the following simple but paradigmatic example is used to
describe the effect of active learning. Suppose that we have a data set
S ¼ fðxi ; yi Þj1 > i > ng, where xi 2 R and yi 2 f1; 1g, and we use a collection H
of simple thresholding classifiers of the form ht : R ! f1; 1g, where

1 if x \ t;
1 if x > t;

where t is the threshold that defines the classifier ht . The empirical error of the
classifier ht is

jfxi jht ðxi Þ 6¼ yi gj


errðht Þ ¼ :
n

The data are separable if a value t0 exists such that errðht0 Þ ¼ 0. Note that if
n ¼ 2, the data are separable.
28 D. Simovici

To determine effectively a threshold classifier that achieves an approximative


separation of the data (in case that data are not separable), with an error less than ,
we need approximately 1 randomly drawn examples [see, for example Blumer et al.
(1989)].
If n ¼ 1 unlabeled examples are drawn at random, finding a classifier
involves asking for log2 n labels by a binary process that begins by asking for the
label of the median point, then for the label of the 25 percentile point (or to the label
of the 75 percentile point), as so on. This opens the possibility that active learning
reduces exponentially the number of labels needed to construct a classifier (Das-
gupta 2011).

7 Perceptrons and Neural Networks

Artificial neural networks (NN) aim to emulate cognitive processes that take place
in the human brain. Research in this direction started in the 1940s with the work of
McCulloch and Pitts (1943), Pitts and McCulloch (1947) who developed a com-
putational model of the brain.
The human brain is a highly organized collection of a large number of inter-
connected and specialized cells called neurons. Neurons are engaged in certain
computing activities that are carried out using chemical and electrical signals;
connections between neurons are referred to as synapses and the brain, as a large
collection of simple computers has a high degree of parallelism.
The current model of NN consists in a series of layers L1 ; . . .; L‘ of computing
units. Units on the first layer L1 are referred to as input units; those on the last layer
L‘ are the output units, and the units in each layer beyond the first layer are neurons.
Connections exist only between neurons that belong to consecutive layers.
A simple example of a NN is a perceptron that consists of n input units and one
neuron. Perceptrons can be trained to perform classification on sets of objects of the
form ðx1 ; y1 Þ; . . .; ðxm ; ym Þ, where xi 2 Rn and yi 2 f1; 1g, and they achieve this
by constructing a separating hyperplane between the set of positive examples and
the set of negative examples whenever these sets are linearly separable. In this
respect, perceptrons are similar to support vector machines. However, the model
building is done in an iterative, specific way proposed by Rosenblatt (1958).
Several variants of this algorithm exist (Freund and Shapire 1999; Novikoff 1962).
A perceptron intended to analyze vectors x 2 Rn is defined by n þ 1 numbers:
the weights w1 ; . . .; wn of the input units and a bias b as shown in Fig. 7.
In the simplest case (discussed next), the neuron itself is characterized by a
transfer function that computes the answer y ¼ signðnetðxÞÞ, where netðxÞ ¼
w0 x þ b.
The hyperplane defined by this perceptron is w0 x þ b ¼ 0.
Intelligent Data Analysis Techniques … 29

Fig. 7 Perceptron acting on x1i x w1


n-dimensional inputs
x2i x w2
j
z
..
. wn : Perceptron -
y
xni x -

Example 7.1 Let


       
0 0 1 1
x1 ¼ ; x2 ¼ ; x3 ¼ ; x4 ¼ :
0 1 1 0

The sequence

S1 ¼ ðx1 ; 1Þ; ðx2 ; 1Þ; ðx3 ; 1Þ; ðx4 ; 1ÞÞ

is linearly separable, as shown in Fig. 8a. On the other hand, the sequence

S1 ¼ ðx1 ; 1Þ; ðx2 ; 1Þ; ðx3 ; 1Þ; ðx4 ; 1ÞÞ

shown in Fig. 8b is not linearly separable.

Let R be the minimum radius of a closed ball centered in 0, that is,


R ¼ maxfk xi kj1 6 i 6 mg.
If ðxi ; yi Þ is a member of the sequence S and H is the target hyperplane w0 x þ b ¼ 0,
where kwk ¼ 1, define the functional margin of ðxi ; yi Þ as ci ¼ yi ðw0 xi þ bÞ. As
before, if yi and w0 xi þ b have the same sign, then ðxi ; yi Þ is classified correctly;
otherwise, it is incorrectly classified and we say that a mistake occurred.

(a) (b)
x2 6 x2 6
@
2 @ ◦ ◦ 2
@
@
@
2 2@ - 2 ◦ -
x1 x1

Fig. 8 A linearly separable sequence and a sequence that is not linearly separable; positive
examples are designated by square, while circle symbols correspond to negative examples
30 D. Simovici

A perceptron is constructed starting from the sequence S and from a parameter


g 2 ð0; 1Þ known as a learning rate.

Input: labelled training sequence S and learning rate η


Output: weight vector w and parameter b defining classifier
initialize w = 0, b0 = 0, k = 0
define R = max{ xi  | 1  i  m}
repeat until (no mistakes are made in the for loop)
for i = 1 to m do
if (yi (wk xi + bk )  0)
wk+1 = wk + ηyi xi ;
bk+1 = bk + ηyi R2 ;
k = k + 1;
end if
end for
end repeat
return k, (wk , bk ) where k is the number of mistakes;

Suppose there exists an optimal weight vector wopt and an optimal bias bopt such
that

wopt ¼ 1 and yi ðw0 opt xi þ bopt Þ > c;

for 1 6 i 6 m. Then, we claim that the number of mistakes made by the algorithm is
at most
 
2R 2
c

Indeed, let t be the update counter,


   
w xi

w b and ^xi ¼
R R

for 1 6 i 6 m.
The algorithm begins with an augmented vector w ^ 0 ¼ 0 and updates it at each
mistake.
Let w^ t1 be the augmented weight vector prior to the tth mistake. The tth update
is performed when

^ 0t1 ^xi ¼ yi ðw
yi w ^ 0t1 xi þ bt1 Þ 6 0;

where ðxi ; yi Þ is the example incorrectly classified by


Intelligent Data Analysis Techniques … 31

 
wt1
^ t1 ¼
w bt1 :
R

The update is
  !
wt wt1 þ gyi xi
^t ¼
w bt ¼ bt1 þgyi R2
R R
     
wt1 þ gyi xi wt1 gyi xi
¼ ¼ þ
R þ gyi R
bt1 bt1
R gyi R
^ t1 þ gyi ^xi ;
¼w

where we used the fact that bt ¼ bt1 þ gyi R2 .


Since
    
b xi
^ 0opt ^xi ¼ yi w
yi w ^ 0opt ¼ yi w^ 0opt xi þ b > c;
R R

we have

^ 0opt w
w ^ 0opt w
^t ¼ w ^ 0t1 þ gyi w
^ 0opt ^xi > w
^ 0opt w
^ t1 þ gc:

^ 0opt w
By repeated application of the inequality w ^ t > gc, we obtain

^ 0opt w
w ^ t > tgc:

^t ¼ w
Since w ^ t1 þ gyi ^xi , we have

kw ^ 0t w
^ t k2 ¼ w ^ 0t1 þ jgyi ^xi0 Þðw
^ t ¼ ðw ^ t1 þ gyi ^xi Þ
^ t1 k2 þ2gyi w
¼ kw ^ 0t1 ^xi þ g2 kx^i k2
^ 0t1 x^i 6 0 when an update occursÞ
ðbecause yi w
^ t1 k2 þg2 k^xi k2
6 kw
^ t1 k2 þg2 ðk^xi k2 þR2 Þ
6 kw
^ t1 k2 þ2g2 R2 ;
6 kw

^ t k2 6 2tg2 R2 . By combining the inequalities


which implies kw

^ 0opt wt > tgc and kw


w ^ i k2 6 2tg2 R2
32 D. Simovici

we have
pffiffiffiffi
w^ opt 2tgR > w^ opt kw ^ 0opt w
^ tk > w ^ t > tgc;

which imply
 2  2
R w
2 2R
t62 ^ opt 6
c c

because bopt 6 R for a non-trivial separation of data and hence


2 2
w^ opt 6 w
^ opt þ1 ¼ 2:

In the case of the perceptron considered above, the transfer function is the
signum function

1 if x > 0;
signðxÞ ¼
1 if x [ 0

for x 2 R. We mention a few other choices that exist for the transfer function:
• the sigmoid or the logistic function hðxÞ ¼ 1þe1 x ,
• the hyperbolic tangent hðxÞ ¼ tanhðxÞ,
x2
• the Gaussian function hðxÞ ¼ ae 2 .
for x 2 R. The advantage of these last three choices is their differentiability that
enables us to apply optimization techniques to more complex NNs. Note, in par-
ticular, that if h is a sigmoid transfer function, then

1
h0 ðxÞ ¼ ex ¼ hðxÞ ð1  hðxÞÞ; ð7Þ
ð1  ex Þ2

which turns out to be a very useful property. To emphasize the choices that we have
for the transfer function, it is useful to think that a neuron has the structure shown in
Fig. 9.
A multilayer NN is a much more capable classifier compared to the perceptron.
It has, however, a degree of complexity because of topology of the neuron network
which entails multiple connection weights, the multiple outputs, and the more
complex neurons.
The specification of the architecture of a NN encompasses the following ele-
ments (see Fig. 10):
(i) the choice of ‘ is the number of levels; the first level L1 contains the input
units, the last level L‘ contains the output units, and the intermediate levels
L2 ; . . .; L‘1 contain the hidden units;
Intelligent Data Analysis Techniques … 33

x1i x w1
n
h ( i=1 wi xi )
x2i x w2
j
z6 Σ - h -
..
. wn : n
net = i=1 wi xi
xni x

Fig. 9 Summation and activation components of a neuron

Fig. 10 Structure of a neural Input Layer Output Layer


network
L1 Lk Lk+1 L

v

z
··· ···
 .. ..
v
 . .
z
s j
Nj -
··· 3 ··· 

 .. w ..
v
 . .
z
ds(Nj )
 ···
v

z

(ii) the connection from unit Ni on level Lk to unit Nj on level k þ 1 has the weight
wji ; the set of units on level Lkþ2 that are connected to unit Nj is the down-
stream set of Nj denoted by ðNj Þ;
(iii) the type of neurons used in the network as defined by their transfer functions.
Let X be the set of examples that are used in training the network. For x 2 X, we
have a vector of target outputs tðxÞ and a vector of actual outputs oðxÞ, both in xp ,
where p is the number of output units. The outputs that correspond to a unit Nj are
denoted by ox;j . For a weight vector w of the network, the total error is

1X
E ðw Þ ¼ ktðxÞ  oðxÞk2 :
2 x2X

The information is propagated from the input to the output layer. This justifies
referring to the architecture of this network as a feed-forward network.
34 D. Simovici

We discuss here the backpropagation training algorithm for a feed-forward NN.


The training process consists in readjusting the weights of the connection taking
into account the error rate.
Note that EðwÞ depends on a large number of parameters of the form wji , which
poses a significant challenge as an optimization problem. Finding a local minimum
of EðwÞ can be achieved by applying a gradient descent algorithm. It is known that
the fastest decrease of a function is in the opposite direction of its gradient, and the
components of the gradient of EðwÞ are given by @EðwÞ@wkj . The learning algorithm will
modify the weights of the network wji in the opposite direction of the gradient of the
error EðwÞ. Consequently, the change in wji will be given by

@EðwÞ
Dwji ¼ g
@wji

where the learning rate g is a small positive number. Initially, the weights of the
edges are randomly set as numbers having small absolute values (e.g., between
−0.05 and 0.05) (cf. Mitchell 1997). These rates are successively modified as we
show next.
To evaluate the partial derivatives of the form @EðwÞ
@wji , we need to take into account
that EðwÞ depends on wji through netj and therefore,

@EðwÞ @EðwÞ @netj @EðwÞ


¼ ¼ xji :
@wji @netj @wji @netj

The position of the neuron Nj in the network must be considered in computing


@EðwÞ
@ netj
,
and we have two cases.

(i) If Nj is an output neuron, then EðwÞ depends on netj through the output oj of
the unit Nj , where oj ¼ hðnetj Þ. Thus,

@EðwÞ @EðwÞ @oj @EðwÞ 0


¼ ¼ h ðnetj Þ:
@netj @oj @netj @oj

Since Nj is an output neuron, we have @EðwÞ


@oj ¼ ðtj  oj Þ for 1 6 j 6 p. If we
assume that Nj has a sigmoidal transfer function, then [by Equality (7)] we
have

@E ðwÞ      
¼  tj  oj h netj 1  h netj :
@netj

(ii) When Nj is a hidden unit EðwÞ depends on netj via the functions netk for all
neurons Nk situated downstream from Nj . In turn, each netk depends on oj ,
which depends on netj . This allows us to write:
Intelligent Data Analysis Techniques … 35

@E ðwÞ X @EðwÞ @netk


¼
@netj @netk @netj
Nk 2dsðNj Þ
X @EðwÞ @netk @oj
¼ :
@netj @oj @netj
Nk 2dsðNj Þ

Observe that

@oj
¼ h0 ðnetj Þ ¼ hðnetj Þð1  hðnetj ÞÞ
@netj

because h is a sigmoid function, and @ net


@oj ¼ wk j because Nk 2 dsðNj Þ. This
k

yields

@EðwÞ X @E ðwÞ
¼ oj ð1  oj Þ wk j :
@netj @netk
Nk 2dsðNj Þ

If di ¼  @EðwÞ
@ neti
for every neuron Ni , then
(      
@EðwÞ netj 1  h netj
 tj oj h P if Nj is an output neuron
¼ oj 1  oj d
@wji w
Nk 2dsðNj Þ k k j if Nj is a hidden neuron:

The changes in the weights can now be written as


(
gðtj  oj Þhðnet
P j Þð1  hðnetj ÞÞ if Nj is an output neuron
Dwji ¼ goj ð1  oj Þ Nk 2dsðNj Þ dk wkj if Nj is a hidden neuron:

The backpropagation algorithm consists of the following steps:

for each training example (x, t), where x ∈ X do


input x in the network and obtain ox,j for each unit Nj ;
for each output unit Nj compute Δwji = η(tj − oj )h(netj )(1 − h(netj ))
and update wji ; 
for each hidden unit Nj compute Δwji = ηoj (1 − oj ) Nk ∈ds(Nj ) δk wkj
and update wji ;
end for

Observe that the weight updates proceed from the output layer toward the inner
layers, which justifies the name of the algorithm.
Next, we present an example for NN construction using the package neu-
ralnet developed in (Günther and Fritsch 2010). The package computes NN with
36 D. Simovici

one hidden layer with a prescribed number of neurons. The computation of the NN
model is achieved by calling
nnmodel <- neuralnet(target ˜ predictors, data = inputdata,
+ hidden = h)

where target * predictors is the formula that specifies the model, and
hidden gives the number of neurons in the hidden layer.
Example 7.2 We use the data set Concrete_Compressive_Strength
(CCS) that is available from the data mining repository at UCI. Ingredients included
in concrete include cement, blast furnace slag, fly ash, water, superplasticizer,
coarse aggregate, and fine aggregate. The data set records 1030 observations and
has nine numerical attributes. Data are presented in a raw form (it is not scaled), and
various attributes have distinct ranges (see Table 5).
The first seven attributes are expressed in kgm3 . Data (originally in the xls
format) are read in R using the csv format as
> CCS <- read.csv("CCS.csv")
> head(CCS)
cem blast ash water plast coarse fine age strength
1 540.0 0.0 0 162 2.5 1040.0 676.0 28 79.99
2 540.0 0.0 0 162 2.5 1055.0 676.0 28 61.89
3 332.5 142.5 0 228 0.0 932.0 594.0 270 40.27
4 332.5 142.5 0 228 0.0 932.0 594.0 365 41.05
5 198.6 132.4 0 192 0.0 978.4 825.5 360 44.30
6 266.0 114.0 0 228 0.0 932.0 670.0 90 47.03

Since the scale of the attributes is quite distinct, the data are normalized using the
function normalize defined in (Lantz 2013) as
normalize <- function(x) {
+ return((x - min(x)) / (max(x) - min(x)))
+ }

Table 5 Attributes of CCS


data set Attribute in original set Attribute in our data set
Cement Cem
Blast furnace slag Bla
Fly ash Ash
Water Water
Superplasticizer Plast
Coarse aggregate Coarse
Fine aggregate Fine
Age Age in days
Concrete compressive strength Strength
Intelligent Data Analysis Techniques … 37

and the data set is normalized using


CCSN <- as.data.frame(lapply(CCS,normalize))

This results in normalized data; its first few records (truncated to two decimals)
are:
> head(CCSN)
cem blast ash water plast coarse fine age strength
1 1.00 0.00 0 0.32 0.07 0.69 0.20 0.07 0.96
2 1.00 0.00 0 0.32 0.07 0.73 0.20 0.07 0.74
3 0.52 0.39 0 0.84 0.00 0.38 0.00 0.73 0.47
4 0.52 0.39 0 0.84 0.00 0.38 0.00 1.00 0.48
5 0.22 0.36 0 0.56 0.00 0.51 0.58 0.98 0.52
6 0.37 0.31 0 0.84 0.00 0.38 0.19 0.24 0.55

A neural net model with four hidden neurons is computed by


nnet4 <- neuralnet(strength ˜ cem + blast + ash + water +
+ plast + coarse + fine + age,
+ data = CCSN, hidden = 4)

The resulting neural net can be seen using plot(nnet4) and is shown in
Fig. 11.

Fig. 11 Neural nets with four 1 1


hidden neurons
cem
−6

3.7
.1

002
07

7
−2−1.47.1

32
.631906679
03
9
8

blast 3.99981
−0

−4
.3
.83
2−.11.82

96
148
36
15 383

54 1
1.18

9
99

1 .
ash
987

−1.9
954
3
0−. 1.46
9
57 23

−0.4
93 72 4
03

−0
938
3
1.

.19
5595

water −3.08 121


2.
9

25
99

−2.79
6.8

11 1
98 strength
.34

09
51

.7
−1

4
37

836


plast −1.03 4
747 154
1.5
48
.5 5 −1
35
−11.858
. 92
47

867
coarse 1.22
0.3
535685

0.00881

86
7

491
70

46
09
82..5

33
.50

2.
−0

fine −0.92724
869075865
8
23
4−.145.18.0

1
095
0.8
age

Error: 2.460318 Steps: 35779


38 D. Simovici

Once a neural net is created, the compute function of the neuralnet can be
used to calculate and summarize the output of each neuron; it can be used to predict
outputs formed by new combinations of values of attributes.
Example 7.3 Consider some new combinations of values for the eight predictive
attributes of CCSN defined by
newconc <- matrix(c(1.00, 0.2, 0.1, 0.1, 0.1, 0.8, 0.8, 0.9,
0.9, 0.5, 0.1, 0.4, 0.1, 0.5, 0.5, 0.2),
byrow = TRUE, ncol = 8)

Using
new.output <- compute(nnet4,newconc)
new.output$net.result

yields the following predictions for strength


[,1]
[1,] 0.8891098520
[2,] 0.9530817536

8 Bibliographic Guide

Data mining and machine learning have generated a vast collection of references.
Among more advanced texts, we recommend (Abu-Mostafa et al. 2012; Bishop
2007; Murphy 2012; Shalev-Shwartz and Ben-David 2014; Zaki and Meira 2014;
Mohri et al. 2012).
A large number of books exist that deal with the R system and its applications to
machine learning and data mining. We mention (Lander 2014; Maindonald and
Braun 2004; Matloff 2011; Wickham 2009) as general references on R; books
specialized in machine learning applications are (Lantz 2013; Zhao 2013; Shao and
Cen 2014).
A very lucid and helpful survey of active learning is (Settles 2012).
The current literature dedicated to support vector machines includes book
written at various levels of mathematical sophistication ranging from accessible
titles (Cristianini and Shawe-Taylor 2000; Kung 2014; Statnikov et al. 2011;
Suykens et al. 2005) to more advanced (Shawe-Taylor and Cristianini 2005; 2008).
A comprehensive discussion related to the implementation of SVM in the
kernlab package of R is presented in (Karatzoglou et al. 2004; Karatzoglu et al.
2006).
Intelligent Data Analysis Techniques … 39

A Subspaces and Hyperplanes


We assume that the reader is familiar with the notion of linear space, as presented,
for example in (Simovici and Djeraba 2014). If L is a real linear space, a subspace
of L is a subset M of L such that x; y 2 M implies x þ y 2 M and ax 2 M for every
a 2 R.
Note that L is a subspace of L and that the smallest subspace of L is {0}. Any
intersection of subspaces of L is a subspace of L. Therefore, if X is a subset of L,
then the intersection of all subspaces that contain X is a subspace; we refer to this
subspace as the subspace generated by X, and we denote it by spanðXÞ.
Let v 2 Rn  f0g and let a 2 R. The hyperplane determined by v and a is the
set Hv;a ¼ fx 2 Rn jv0 x ¼ ag.
If x0 2 Hv;a , then v0 x0 ¼ a, so Hv,a is also described by the equality

Hv;a ¼ fx 2 Rn j v0 ðx  x0 Þ ¼ 0g;

where x0 2 Hv;a .
Any hyperplane Hv;a partitions Rn into three sets:
[
Hv;a ¼ fx 2 Rn jv0 x [ ag;
0
Hv;a ¼ Hv;a ;
\
Hv;a ¼ fx 2 Rn jv0 x\ag:

[ \
The sets Hv;a and Hv;a are the positive and negative open half-spaces determined
by Hv;a , respectively. The sets
>
Hv;a ¼ fx 2 Rn jv0 x > ag;
6
Hv;a ¼ fx 2 Rn jv0 x 6 ag:

are the positive and negative closed half-spaces determined by Hv;a , respectively.
If x1 ; x2 2 Hv;a , then v ? x1  x2 . This justifies referring to v as the normal to
the hyperplane Hv;a . Observe that a hyperplane is fully determined by a vector
x0 2 Hv;a and by v.
Let x0 2 Rn and let Hx;a be a hyperplane. We seek x 2 Hx;a such that kx  x0 k2
is minimal. Finding x amounts to minimizing the function f ðxÞ ¼ kx  x0 k22 ¼
Pn 2
i¼1 ðxi  x0i Þ subjected to the constraint v1 x1 þ    þ wn xn  a ¼ 0. Using the
Lagrangean LðxÞ ¼ f ðxÞ þ kðv0 x  aÞ and the multiplier k, we impose the
conditions

@L
¼0 for 16i6n
@xi

which amount to
40 D. Simovici

@f
þ kwi ¼ 0
@xi

for 16 i 6n. These equalities yield 2ðxi  x0i Þ þ kvi ¼ 0, so we have


xi ¼ x0i  12 kvi . Consequently, we have x ¼ x0  12 kv. Since x 2 Hv;a , this implies

1
v0 x ¼ v0 x0  kv0 v ¼ a:
2

Thus,

v0 x0  ja v0 x0  a
k¼2 0
¼2 :
vv kvk22

We conclude that the closest point in Hv;a to x0 is

v0 x0  a
x ¼ x0  v:
kvk22

The smallest distance between x0 and a point in the hyperplane Hv;a is given by

jv0 x  aj jv0 x  aj
0 0
k x0  xk ¼ v ¼
kvk22 kvk

If we define the distance dðHv;a ; x0 Þ between x0 and Hv;a as this smallest dis-
tance, we have:

  j v0 x0  aj
d Hv;a ; x0 ¼ ð8Þ
kvk2

B Convexity, Partitions, and Entropy


Let x; y 2 Rn . The closed segment determined by x and y is the set

½x; y ¼ fax þ ð1  aÞy j 0 6 a 61g:

A set C, C  Rn is convex if x; y 2 C implies ½x; y  C.


Let S be a non-empty convex subset of Rn . A function f : S ! R is convex if
f ðtx þ ð1  tÞyÞ6tf ðxÞ þ ð1  tÞf ðyÞ for every x; y 2 S and t 2 ½0; 1.
Theorem B.1 (Jensen’s Theorem) Let f be a function P that is convex on an interval
I. If t1 ; . . .; tn 2 ½0; 1 are n numbers such that ni¼1 ti ¼ 1, then
Intelligent Data Analysis Techniques … 41

!
X
n X
n
f ti xi 6 t i f ð xi Þ
i¼1 i¼1

for every x1 ; . . .; xn 2 I.
Proof The argument is by induction on n, where n > 2. The basis step, n ¼ 2,
follows immediately from the definition of convex functions.
Suppose that the statement holds for n, and let t1 ; . . .; tn ; tnþ1 be n þ 1 numbers
P
such that nþ1
i¼1 ti ¼ 1. We have

f ðt1 x1 þ    þ tn1 xn1 þ tn xn þ tnþ1 xnþ1 Þ


 
tn xn þ tnþ1 xnþ1
¼ f t1 x1 þ    þ tn1 xn1 þ ðtn þ tnþ1 Þ :
tn þ tnþ1

By the inductive hypothesis, we can write

f ðt1 x1 þ    þ tn1 xn1 þ tn xn þ tnþ1 xnþ1 Þ


 
tn xn þ tnþ1 xnþ1
6 t1 f ðx1 Þ þ    þ tn1 f ðxn1 Þ þ ðtn þ tnþ1 Þf :
tn þ tnþ1

Next, by the convexity of f , we have


 
tn xn þ tnþ1 xnþ1 tn tnþ1
f 6 f ðxn Þ þ f ðxnþ1 Þ:
tn þ tnþ1 tn þ tnþ1 tn þ tnþ1

Combining this inequality with the previous inequality gives the desired con-
clusion. h
Example B.2 It is easy to verify that the function f ðxÞ ¼ xa is convex if a > 1
because f 00 ðxÞ ¼ 1x [ 0 for x 2 R [ 0 . Therefore, if t1 ; . . .; tn 2 ½0; 1 and
Pn
i¼1 ti ¼ 1, by applying Jensen’s inequality to f , we obtain the inequality:
!a
X
n X
n
t i xi  ti xai :
i¼1 i¼1

In particular, if t1 ¼    ¼ tn ¼ 1n, it follows that


!a
X
n X
n
xi 6 na1 xai ;
i¼1 i¼1

so
42 D. Simovici

!a
X
n X
n
xai > n1a xi :
i¼1 i¼1

Pn Pn
When i¼1 xi ¼ 1, the previous inequality implies a
i¼1 xi > n1a .

A partition of a finite and non-empty set S is a collection of non-empty subsets


B1 ; . . .; Bn such that
(i) if 1 6 i; j 6 n and i 6¼ j, Bi \ Bj ¼ ;;
Sn
(ii) i¼1 Bi ¼ S. The sets B1 ; . . .; Bn are known as the blocks of p.

Let partðSÞ be the set of partitions of the set S. A partial order “≤” can be defined
on partðSÞ as p 6 p0 if each block B0 of p is a union of blocks of the partition p.
Example B.3 For S ¼ fxi j1 6 i 6 6g consider the partitions

p ¼ ffx1 ; x2 g; fx6 g; fx3 ; x5 g; fx4 gg;


p0 ¼ ffx1 ; x2 ; x6 g; fx3 ; x4 ; x5 gg:

We have p 6 p0 because each of the blocks of p0 is a union of blocks of p.

The partition iS whose blocks are singletons fxg, where x 2 S, is the least
partition defined on S. The partition hS that consists of a single block equal to S is
the largest partition on S.
Let p; r be two partitions of a set S, where p ¼ fB1 ; . . .; Bn g and
r ¼ fC1 ; . . .; Cm g. The partition p ^ r of S consists of all non-empty intersections of
the form Bi \ Cj , where 1 6 i 6 n and 16 j 6 m. Clearly, we have p ^ r 6 p and
p ^ r 6 r. Moreover, if s is a partition of S such that s 6 p and s 6 r, then s 6p ^ r.
If T S is a non-empty subset of S, then any partition p ¼ fB1 ; . . .; Bn g of S
determines a partition pT on T defined by

pT ¼ fT \ Bi jBi 2 p and T \ Bi 6¼ ;g:

For example, if p ¼ ffx1 ; x2 g; fx6 g; fx3 ; x5 g; fx4 gg, the trace on p on the set
fx1 ; x2 ; x5 ; x6 g is the partition pT ¼ ffx1 ; x2 g; fx6 g; fx5 gg.
A subset T of S is p-pure, if T is included in a block of p, or, equivalently, if
pT ¼ xT .
Let p ¼ fB1 ; . . .; Bn g be a partition of a finite set S and let xi ¼ jjBSijj for 1 > i 6 n.
P
Since i¼1 nxi ¼ 1, we have the inequality
n 
X 
jBi j a
1 61  n1a :
i¼1
jSj
Intelligent Data Analysis Techniques … 43

Note that if jB1 j ¼    ¼ jBn j ¼ jSj


n , the left member of the above equality equals
1n .
1a

Definition B.4 The a-entropy Ha ðpÞ of the partition p ¼ fB1 ; . . .; Bn g of the set S
is given by

Xm   !
1 jBi j a
Ha ðpÞ ¼ 1 :
1  21a i¼1
jSj

By the previous considerations, the maximum value of the expression Ha ðpÞ is


1a
obtained when the blocks of the partition p have equal size and are equal to 1n1a .
 
12
Pn jBi j2
When a ¼ 2, we obtain the Gini index of p, giniðpÞ ¼ 2 1  i¼1 jSj .

Example B.5 Starting with the convex function gðxÞ ¼ x ln x (whose second
derivative g00 ðxÞ ¼ 1x is positive), the Jensen equality implies:
! !
X
n X
n X
n
t i xi ln t i xi 6 ti xi ln xi
i¼1 i¼1 i¼1

for every x1 ; . . .; xn 2 I. As before, for t1 ¼    ¼ tn ¼ 1n, we have

x1 þ    þ xn X n
ðx1 þ    þ xn Þ ln 6 xi ln xi :
n i¼1

Applying this inequalities to xi ¼ jBjSji j, where p is a partition of S given by


p ¼ fB1 ; . . .; Bn g, we have

X
n
jBi j jBi j
ln n>  ln :
i¼1
jSj jSj
P
The quantity  ni¼1 jBjSji j ln jBjSji j is the Shannon entropy of p. Its maximum value
ln n is obtained when the blocks of p have equal size.

Note that lima!1 Ha ðpÞ ¼ HðpÞ. In other words, Shannon’s entropy is a limit
case of the Ha -entropy.
Let p; r be two partitions of a set S, where p ¼ fB1 ; . . .; Bn g and
r ¼ fC1 ; . . .; Cm g. The conditional entropy Ha ðpjrÞ is defined by

X  a
m
jCj j
Ha ðpjrÞ ¼ Ha ðpCj Þ :
j¼1
jSj
44 D. Simovici

 P jBi \ Cj ja 
Since Ha ðpCj Þ ¼ 1211a 1  m
i¼1 jCj j , it follows that

Ha ðp ^ rÞ ¼ Ha ðpjrÞ þ Ha ðrÞ:

Various types of entropies are used to evaluate the impurity of a set relative to a
partition. Namely, for a partition j of S, Ha ðjÞ ranges from 0 (when the partition j
1a
consists of one block and, therefore, is pure) to 1n
121a when the partition consists of
n-singletons, and, therefore, it has the highest degree of impurity.
C Optimization with Constraints
An optimization problem consists in finding a local minimum or a local maximum
of a function f : Rn ! R, when such a minimum exists. The function f is referred
to as the objective function. Note that finding a local minimum of a function f is
equivalent to finding a local maximum for the function f .
In constrained optimization, additional conditions are imposed on the argument
of the objective function. A typical formulation of a constrained optimization
problem is

minimize f ðxÞ; where x 2 Rn


subject to ci ðxÞ ¼ 0; where 1 6 i 6 p;
and cj ðxÞ> 0; where 1 6 j 6 q:

Here, ci are functions that specify equality constraints placed on x, while cj


define inequality constraints. The feasible region of the constrained optimization
problem is the set
 
R ¼ x 2 Rn jci ðxÞ ¼ 0 for 1 6 i 6 p and cj ðxÞ > 0 for 16j6q :

If the feasible region R is non-empty and bounded, then, under certain condi-
tions, a solution exists.
If R ¼ ;, we say that the constraints are inconsistent.
Note that equality constraints can be replaced in a constrained optimization
problem by inequality constraints. Indeed, a constraint of the form cðxÞ ¼ 0 can be
replaced by a pair of constrains cðxÞ > 0 and cðxÞ > 0.
Let x 2 R be a feasible solution and let cðxÞ > 0 be an inequality constraint used
to define R. If x 2 R and cðxÞ ¼ 0, we say that c is an active constraint.
Consider the following optimization problem for an object function f : Rn ! R,
the compact set S  Rn , and the constraint functions c : Rn ! Rm and
d : Rn ! Rp :

minimize f ðxÞ; where x 2 S;


subject u 2 Rm to cðxÞ 6 0m
and dðxÞ ¼ 0p
Intelligent Data Analysis Techniques … 45

Both the object function f and the constraint functions c; d are assumed to be
continuously differentiable. We shall refer to this optimization problem as the
primal problem.
Definition C.1 The Lagrangean associated with this optimization problem is the
function L : Rn
Rm
Rp ! R given by

Lðx; u; vÞ ¼ f ðxÞ þ u0 cðxÞ þ v0 dðxÞ

for x 2 Rn , u 2 Rm , and v 2 Rp . The component ui of u is the Lagrangean mul-


tiplier corresponding to the constraint ci ðxÞ60; the component vj of v is the
Lagrangean multiplier corresponding to the constraint hj ðxÞ ¼ 0.

The dual optimization problem starts with the Lagrange dual function g :
Rm
Rp ! R defined by

gðu; vÞ ¼ inf Lðx; u; vÞ ð9Þ


x2S

and consists of

maximize gðu; vÞ; where u 2 Rm and v 2 Rp ;


subject to u > 0m :

Theorem C.2 The function g : Rm


Rp ! R defined by Equality (9) is concave
over Rm
Rp .
Proof For u1 ; u2 2 Rm and v1 ; v2 2 Rp , we have:

gðtu1 þ ð1  tÞu2 ; tv1 þ ð1  tÞv2 Þ


¼ infff ðxÞ þ ðtu01 þ ð1  tÞu02 ÞcðxÞ þ ðtv01 þ ð1  tÞv02 ÞdðxÞ j x 2 Sg
¼ infftðf ðxÞ þ u01 c þ v01 dÞ þ ð1  tÞðf ðxÞ þ u02 cðxÞ þ v02 dðxÞÞ j x 2 Sg
> t infff ðxÞ þ u0 1 c þ v0 1 djx 2 Sg
þ ð1  tÞinfff ðxÞ þ u02 cðxÞ þ v02 dðxÞ j x 2 Sg
¼ tgðu1 ; v1 Þ þ ð1  tÞgðu2 ; v2 Þ;

which shows that g is concave.


Theorem C.2 is significant because a local optimum of g is a global optimum
regardless of convexity properties of f ; c or d. Although the dual function g is not
given explicitly, the restrictions of the dual have a simpler form and this may be an
advantage in specific cases. h
Example C.3 Let f : Rn ! R be the linear function f ðxÞ ¼ a0 x, A 2 Rp
n , and
b 2 Rp . Consider the primal problem:
46 D. Simovici

maximize a0 ; x; where x 2 Rn ;
subject to x > 0n and
Ax  b ¼ 0p :

The constraint functions are cðxÞ ¼ x and dðxÞ ¼ Ax  b, and the Lagrangean
L is

Lðx; u; vÞ ¼ a0 x  u0 x þ v0 ðAx  bÞ
¼ v0 b þ ða0  u0 þ v0 AÞx:

This yields the dual function

gðu; vÞ ¼ v0 b þ infn ða0  u0 þ v0 AÞx:


x2R

Unless a0  u0 þ v0 A ¼ 0n0 , we have gðu; vÞ ¼ 1. Therefore, we have



v0 b if a  u þ A0 v ¼ 0n ;
gðu; vÞ ¼
1 otherwise.

Thus, the dual problem is


maximize gðu; vÞ subject to u > 0m .
An equivalent of the dual problem is
maximize v0 b subject to a  u þ A0 v ¼ 0n and u > 0m .
In turn, this problem is equivalent to:
maximize v0 b subject to a þ A0 v > 0n .

Example C.4 Let us consider a variant of the primal problem discussed in Example
C.3. The objective function is again f ðxÞ ¼ a0 x. However, now we have only the
inequality constraints cðxÞ 6 0m , where cðxÞ ¼ Ax  b, A 2 Rm
n , and b 2 Rm .
Thus, the primal problem can be stated as

maximize a0 x; where x 2 Rn ;
subject to Ax > b:

The Lagrangean L is

Lðx; uÞ ¼ a0 x þ u0 ðAx  bÞ ¼ u0 b þ ða0 þ u0 AÞ;


Intelligent Data Analysis Techniques … 47

which yields the dual function:



u0 b if a0 þ u0 A ¼ 0m ;
gðuÞ ¼
1 otherwise :

and the dual problem is

maximize  b0 u subject to a0 þ u0 A ¼ 0m
and u > 0

Example C.5 The following optimization problem

1
minimize x0 Qx  r0 x; where x 2 Rn ;
2
subject to Ax > b;

where Q 2 Rn
n is a positive definite matrix, and r 2 Rn , A 2 Rp
n , and b 2 Rp are
known as a quadratic optimization problem.
The Lagrangean L is

1 1
Lðx; uÞ ¼ x0 Qx  r0 x þ u0 ðAx  bÞ ¼ x0 Qx þ ðu0 A  r0 Þx  u0 b
2 2

and the dual function is gðuÞ ¼ inf x2Rn Lðx; uÞ subject to u > 0m . Since x is
unconstrained in the definition of g, the minimum is attained when we have the
equalities

@ 12 x0 Qx þ ðu0 A  r 0 Þx  u0 b
¼0
@xi

for 1 6 i 6 n, which amount to x ¼ Q1 ðr  AuÞ. Thus, the dual optimization


function is: gðuÞ ¼  12 u0 Pu  u0 d  12 r0 Qr subject to u > 0p , where P ¼ AQ1 A0 ,
d ¼ b  AQ1 r. This shows that the dual problem of this quadratic optimization
problem is itself a quadratic optimization problem.
Theorem C.6 (The Weak Duality Theorem) Let x0 be a solution of the primal
problem and let ðu; vÞ be a solution of the dual problem. We have gðu; vÞ 6 f ðx0 Þ.
Proof We have

gðu; vÞ ¼ infff ðxÞ þ u0 cðxÞ þ v0 dðxÞjx 2 Sg


6 f ðx0 Þ þ u0 cðx0 Þ þ v0 dðx0 Þ 6 f ðx0 Þ;

because u>0; cðx0 Þ60m , and dðx0 Þ ¼ 0p which yields the desired inequality. h
48 D. Simovici

Corollary C.7 For the function involved in the primal and dual problems, we have

supfgðu; vÞju > 0n g 6 infff ðxÞjx 2 S; cðxÞ 6 0m g:

Proof This inequality follows immediately from the proof of Theorem C.6. h
Corollary C.8 If f ðx Þ 6 gðu; vÞ, where u > 0m and cðx Þ 6 0m , then x is a
solution of the primal problem and u is a solution of the dual problem.
Furthermore, if supfgðu; vÞju > 0m g ¼ 1, then there is no solution of the pri-
mal problem.
Proof These statements are an immediate consequence of Corollary C.7. h
Example C.9 Consider the primal problem

minimize x21 þ x22 ; where x1 x2 2 R


subject to x1  1 6 0:

It is clear that the minimum if f ðxÞ is obtained for x1 ¼ 1 and x2 ¼ 0 and this
minimum is 1. The Lagrangean is

LðuÞ ¼ x21 þ x22 þ u1 ðx1  1Þ

and the dual function is

u21
gðuÞ ¼ inffx21 þ x22 þ u1 ðx1  1Þjx 2 R2 g ¼  :
4

Then, supfgðu1 Þju1 > 0g ¼ 0, and a gap exists between the minimal value of the
primal function and the maximal value of the dual function.

The possible gap that exists between infff ðxÞjx 2 S; cðxÞ 6 0m g and
supfgðu; vÞj > 0n g is known as the duality gap.
A stronger result holds if certain conditions involving the restrictions are
satisfied:
Theorem C.10 (Strong Duality Theorem) Let C be a non-empty convex subset of
Rn , f : Rn ! R and c : Rn ! Rm be convex functions, and let d : Rn ! Rp be
given by dðxÞ ¼ Ax  b, where A 2 Rp
n and b 2 Rp .
Consider the primal problem

minimize f ðxÞ; where x 2 S;


subject to cðxÞ 6 0m
and dðxÞ ¼ 0p;

and its dual


Intelligent Data Analysis Techniques … 49

maxmize gðu; vÞ; where u 2 Rm and v 2 Rp ;


subject to u > 0m

Suppose that there exists z 2 C such that cðzÞ\0m and dðzÞ ¼ 0p ; additionally,
0p 2 IðdðCÞÞ. We have:

supfgðu; vÞju > 0m g ¼ infff ðxÞjx 2 C; cðxÞ 6 0m ; dðxÞ ¼ 0p g: ð10Þ

Moreover, if infff ðxÞjx 2 C; cðxÞ 6 0m ; dðxÞ ¼ 0p g is finite, then there exists


u1 ; v1 with u1 > 0m such that gðu1 ; v1 Þ ¼ supfgðu1 ; v1 Þju1 > 0m g; if f ðxÞ ¼
infff ðxÞjx 2 C; cðxÞ 6 0m ; dðxÞ ¼ 0p g (which means that the infimum is achieved
at x), then u0 1 cð
xÞ ¼ 0.

If L is the Lagrangean of the primary optimization problem

minimize f ðxÞ; where x 2 S;


subject to cðxÞ 6 0m
and dðxÞ ¼ 0p;

then a saddle point is a triplet x ; u ; v with x 2 S and u 6 0 such that

Lðx ; u; vÞ 6 Lðx ; u ; v Þ 6 Lðx; u ; v Þ

for x 2 S and u > 0.


The duality gap disappears, and then, a saddle point occurs for the primal
problem, as stated by the next theorem.
Theorem C.11 The triplet ðx ; u ; v Þ is a saddle point of the Lagrangean of the
primal problem if and only if its components x and u ; v are solutions of the
primal and dual problems, respectively, and there is no duality gap, that is,
f ðx Þ ¼ gðu ; v Þ.

References

Abu-Mostafa YS, Magdon-Ismail M, Lin HT (2012) Learning from data. AML Book, AMLbook.
com
Anderson E (1936) The species problem in iris. Ann Mo Bot Gard 23:457–509
Bishop CM (2007) Pattern recognition and machine learning. Springer, New York
Blumer A, Ehrenfeucht A, Haussler D, Warmuth MK (1989) Learnability and the vapnik-
chervonenkis dimension. J ACM 36(4):929–965
Breiman L, Friedman JH, Olshen RO, Stone CS (1998) Classification and regression trees.
Chapman and Hall, Boca Raton (reprint edition)
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
50 D. Simovici

Cristianini N, Shawe-Taylor J (2000) Support vector machines and other kernel-based learning
methods. Cambridge University Press, Cambridge
Dasgupta S (2011) Two faces of active learning. Theoret Comput Sci 412:1767–1781
Drucker H, Burges CJC, Kaufman L, Smola AJ, Vapnik V (1996) Support vector regression
machines. In: Advances in neural information processing systems 9, NIPS, Denver, CO, USA,
2–5 Dec 1996, pp 155–161
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics
7:179–188
Freund Y, Shapire RE (1999) Large margin classification using the perceptron algorithm. Mach
Learn 37:277–296
Günther F, Fritsch S (2010) Neuralnet: training of neural networks. R J 2(1):30–38
Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab—an s4 package for kernel methods
in R. J Stat Softw 11:1–20
Karatzoglu A, Meyer DM, Hornik K (2006) Support vector machines in R. J Stat Softw 15:1–28
Kung SY (2014) Kernel methods and machine learning. Cambridge University Press, Cambridge
Lander J (2014) R for everyone. Addison-Wesley, Upper Saddle River
Lantz B (2013) Machine learning with R. PACKT Publishing, Birmingham
Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: Proceedings of
the 17th annual international ACM SIGIR conference on research and development in
information retrieval, SIGIR’94, pp 3–12. Springer-Verlag New York, Inc, New York
Maindonald J, Braun J (2004) Data analysis and graphics using R—an example-based approach.
Cambridge University Press, Cambridge
Matloff N (2011) The art of R programming—a tour of statistical software design. No starch press,
San Francisco
McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull
Math Biophy 5:115–133
Mitchell TM (1997) Machine learning. McGraw-Hill, Boston
Mohri M, Rostamizadeh A, Talwalkar A (2012) Foundations of machine learning. MIT Press,
Cambridge
Murphy KP (2012) Machine learning: a probabilistic perspective. MIT Press, Cambridge
Novikoff ABJ (1962) On convergence proofs on perceptrons. In: Proceedings of the symposium
on mathematical theory of automata 12:615–622
Pitts W, McCulloch WS (1947) How we know universals—the perception of auditory and visual
forms. Bull Math Biophys 9:127–147
Quinlan JR (1993) C 4.5 programs for machine learning. Morgan Kaufmann Publ., San Mateo
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and
organization in the brain. Psychol Rev 65:386–407
Scheffer T, Decomain C, Wrobel S (2001) Active hidden Markov models for information
extraction. In: Advances in intelligent data analysis, 4th international conference, IDA 2001.
Cascais, Portugal, Sept 13–15, 2001. Proceedings, pp 309–318
Schohn G, Cohn D (2000) Less is more: active learning with support vector machines. In:
Proceedings of the seventeenth international conference on machine learning (ICML 2000),
Stanford University, Stanford, CA, June 29–July 2, 2000, pp 839–846
Schütze H, Velipasaoglu E, Pedersen JO (2006) Performance thresholding in practical text
classification. In: Proceedings of the 2006 ACM CIKM international conference on
information and knowledge management, Arlington, 6–11 Nov 2006, pp 662–671
Settles B (2012) Active learning. Morgan and Claypool
Shalev-Shwartz S, Ben-David S (2014) Understanding machine learning. Cambridge University
Press, Cambridge
Shao Y, Cen Y (2014) Data mining applications with R. Academic Press, San Diego
Shawe-Taylor J, Cristianini N (2005) Kernel methods for pattern analysis. Cambridge University
Press, Cambridge
Simovici DA, Djeraba C (2014) Mathematical tools for data mining, 2nd edn. Springer, London
Intelligent Data Analysis Techniques … 51

Statnikov A, Aliferis CF, Hardin DP, Guyon I (2011) A gentle introduction to support vector
machines in biomedicine. World Scientific, Singapore
Steinwart I, Christman A (2008) Support vector machines. Springer, Berlin
Suykens JAK, van Gestel T, De Brabanter J, De Moor B, Vandewalle J (2005) Least squares
support vector machines. World Scientific, New Jersey
Suykens JAK, Vandewalle J (1999) Least squares support vector machine classifiers. Neural
Process Lett 9(3):293–300
Velipasaoglu E, Schütze H, Pedersen JO (2007) Improving active learning recall via disjunctive
boolean constraints. In SIGIR 2007: proceedings of the 30th annual international ACM SIGIR
conference on research and development in information retrieval, Amsterdam, The Nether-
lands, July 23–27, 2007, pp 893–894
Wickham H (2009) ggplot2—Elegant graphics for data analysis. Springer, Dordrecht
Witten IH, Frank E, Hall MA (2011) Data mining—practical machine learning tools and
techniques, 3rd edn. Elsevier (Morgan Kaufmann), Amsterdam
Zaki MJ, Meira WM (2014) Data mining and analysis. Cambrige University Press, Cambrige
Zhao Y (2013) R and data mining—example and case studies. Academic Press, San Diego
On Meta-heuristics in Optimization
and Data Analysis. Application
to Geosciences

Henri Luchian, Mihaela Elena Breaban and Andrei Bautu

Abstract This chapter presents popular meta-heuristics inspired from nature


focusing on evolutionary computation (EC). The first section, as an elevator pitch,
briefly walks through problem solving, touching upon notions such as optimization
problems, meta-heuristics, constraint handling, hybridization, and the No Free
Lunch Theorem for optimization, and also giving very short introductions into
several most popular meta-heuristics. The next two sections are dedicated to evo-
lutionary algorithms and swarm intelligence (SI), two of the main areas of EC.
Three particular optimization methods illustrating these two areas are presented in
more detail: genetic algorithms (GAs), differential evolution (DE), and particle
swarm optimization (PSO). For a better understanding of these algorithms, refer-
ences to R packages implementing the algorithms and code samples to solve
numerical and combinatorial problems are given. The fourth section is dedicated to
the use of EC techniques in data analysis. Optimization of the hyper-parameters of
conventional machine learning techniques is illustrated by a case study. The last
section reviews applications of meta-heuristics in geosciences.


Keywords Meta-heuristics Numerical and combinatorial optimization Genetic 
 
algorithms Differential evolution Particle swarm optimization Hyper-param- 

eters optimization Problems in geosciences

H. Luchian  M.E. Breaban (&)


Faculty of Computer Science, Alexandru Ioan Cuza University of Iasi, Iasi, Romania
e-mail: pmihaela@infoiasi.ro
A. Bautu
Faculty of Navigation and Naval Management, Romanian Naval Academy,
Constanta, Romania

© Springer International Publishing Switzerland 2015 53


C. Cranganu et al. (eds.), Artificial Intelligent Approaches in Petroleum Geosciences,
DOI 10.1007/978-3-319-16531-8_2
54 H. Luchian et al.

1 A Painless Introduction

A particular characteristic of problem solving becomes evident if computers are


used for searching solutions to problems. Namely, when asked to solve a given
problem, one is simultaneously, if implicitly, asked to solve the meta-problem of
finding the best method to solve the problem. Best may refer to saving resources
most often, time in the process of finding a solution; it may also point to the
required accuracy/precision of the solution or to the set of instances of the problem
which must be solved, or to a threshold for positive/negative errors, etc. In many
cases, simply finding a method which can successfully look for a solution to the
given problem is not sufficient; the method should comply with requirements such
as those enumerated above, and moreover, it should do this in the best possible
way. Therefore, irrespective of what best means, in order to deal with the com-
panion meta-problem, one needs to be acquainted with a comprehensive set of
methods for solving problems: the larger the set of methods one chooses from, the
better the proposed method should be.
This may be the reason why, along with the ever increasing use of computers for
solving problems, a wealth of new approaches to problem solving has been
proposed.

1.1 Briefly, on Problems and Methods to Solve Them

How many problem-solving methods does one need to master? Indeed, many
new methods for solving problems were invented (some may say discovered) lately.
As opposed to exact deterministic algorithms, many of these new methods are weak
methods; a weak method is not rigidly related to one specific problem, but rather it
can be applied for solving various problems. At times, one or another such prob-
lem-solving technique appears to be most fashionable. To an outsider, genetic
algorithms (GAs), artificial neural networks, particle swarm optimization, and
support vector machines to name just a few seemed to successively take by storm
the proscenium over the last decades. Is each new method better than the previous
ones and, consequently, is the choice of the method to solve ones specific problem a
matter of keeping pace with fashion? Is there one particular method that solves best,
among all existing methods and all problems? A positive answer to either question
would mean that we actually have a free lunch when trying to solve a given
problem: we could spare the time needed to identify the best method for finding
solutions to the problem. However, a theorem proven in 1995 by Wolpert and
McReady (1997), called the No Free Lunch Theorem for optimization, shows that
the answer to both questions above is negative. Informally (and leaving aside
details and nuances of the theorem), the NFLTO states that, averaging overall
problems, all solving methods have the same performance, no matter what indicator
of performance is used. Obviously, the common average is obtained from various
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 55

sets of values of the performance indicators for each method and various levels of
each method’s performance when applied to each specific problem. This means that
in general, two different methods perform at their respective best on different
problems, and consequently, each of them has a poorer performance on remaining
problems. It follows that there is no problem-solving method which is the “best”
method to solve all problems (indeed, if a method M would have equally good
performances on all problems, then this would be Ms average performance; then,
any method with scattered values of the performance indicator would outperform M
on some problems). Therefore, for each problem-solving method, there is a subset
of all problems for which it is the best solving method in some cases, and the subset
may consist of only one problem or even zero problems. Conversely, given a
problem to be solved, one has to find a particular method that works best for that
problem which proves that the meta-problem mentioned above is non-trivial.
Actually, it may be a very difficult problem; similar to the way some problem-
solving methods are widely used even if they are not guaranteed to provide the
exact solution, an approximate but acceptably good solution to the meta-problem
may be useful.
Optimization problems There is an informal conjecture stating that anything we
are doing, we optimize something; or, as Clerc put it in (2006), iterative optimi-
zation is as old as life itself. While each of these two statements may be the subject
of subtle philosophical debates, it is true that many problems can be stated as
optimization problems. Finding the average of n real numbers is an optimization
problem (find the number a which minimizes the sum of its distances absolute
values of the differences to each of the given numbers); the same goes for decision-
making problems, for machine learning ones, and many others.

An optimization problem asks to find—if it exists—an extreme value (either


minimum or maximum) of a given function. Finding the required solution is, in
fact, a search process performed in the space of all candidate solutions; this is why
the terms optimization method and search method are sometimes loosely used as
synonyms, although the term optimization refers to the values of the function, while
search (through the set of candidate solutions) usually points to values of the
variables of the respective function. Several simple taxonomies of optimization
problems are useful when studying meta-heuristics: optimization of functions of
continuous variable/discrete variable; optimization with/without constraints; opti-
mization with a fixed/moving optimum; single objective/multiple objective opti-
mization. Here are some examples:
• constraint optimization raises the critical problem of handling constraints;
• continuous/discrete variables point to specific meta-heuristics that originally
specialize in one the two types of optimization (e.g., GAs for discrete variables;
differential evolution (DE) for continuous variables);
• self-adapting meta-heuristics are recommended for solving problems with a
moving optimum;
56 H. Luchian et al.

• particular variants of existing meta-heuristics have been defined for multi-


objective optimization (e.g., in DE).
Meta-heuristics, described below, are seen as optimization methods (i.e.,
methods for solving optimization problems). While meta-heuristics can also be used
for solving, for example, complex-system-design problems or machine learning
problems, such problems can also be stated as optimization problems.
Meta-heuristics Any problem-solving method belongs to one of three categories:
exact deterministic methods, approximate deterministic methods, and non-deter-
ministic methods. This chapter is concerned with the second and third categories,
which flourished over the last few decades.

A heuristic is a problem-solving method which is able to find approximate


solutions to the given problem either in a (significantly) shorter time than an exact
algorithm or simply when no exact algorithm can find a solution. Approximate
solutions may be acceptable in various situations; Simon (1969) argues that humans
tend to satisfice (use an acceptable approximate solution obtained reasonably
quickly) when it comes to complex situations/domains.
Meta-heuristic is a relatively recent term, introduced by Glover in 1986. Various
definitions and taxonomies of meta-heuristics were subsequently proposed; the
works mentioned below discuss these in detail. It is generally accepted that meta-
heuristics are problem-independent high-level strategies which guide the process of
finding (approximate) solutions to given problems. However, problem-independent
methods (also called weak methods) may well be fine-tuned by incorporating in the
search procedure some problem-specific knowledge; an early paper on this is
(Grefenstette 1987).
Among several existing taxonomies of meta-heuristics, the most interesting one
for our discussion is the classification concerned with the number of current
solutions. A trajectory or single-point meta-heuristic works with only one current
solution; the current solution is iteratively subject to conditional change. Local
search meta-heuristics, such as Tabu Search, Iterated Local Search, and Variable
Neighborhood Search (Blum and Roli 2003), fall into this category. A population-
based meta-heuristic iteratively change a set of candidate solutions collectively
called population; genetic algorithm (GA) or particle swarm Optimization, among
others, belong in this category.
This section briefly discusses two trajectory-based methods: iterated hill
climbing and simulated annealing.
Hill climbing Hill climbing is a weak optimization heuristic: In order to be applied
for solving a given problem, the only properties that are required are that the
function to be optimized takes on values which can always be compared against
each other (a totally ordered set of values such as the real numbers or the natural
numbers) and that it allows for a step-by-step improvement of candidate solutions
(i.e., the problem is not akin to finding the needle in the haystack). Hill climbing
does not use any other properties of the function to be optimized and does not
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 57

organize the search for the optimum following a tree structure—or any other
structure. Therefore, it requires little computer memory. Hill climbing starts with an
initial candidate solution and iteratively aims at improving the current candidate
solution by replacing it with any (or the best) neighbor solution which is better than
the current one; when there are no more possible improvements, the search stops.
The neighborhood can be considered either in the set over which the function is
defined (a neighbor can be obtained through a slight modification of a number
which is a component of the candidate solution) or in the set of computer repre-
sentations of candidate solutions (a neighbor there is reached by flipping one bit).

While the procedure sketched above is very effective for any mono-modal
function (informally, a function whose graph has only one hilltop), it may get stuck
in local optima if the function is multi-modal. In the latter case, the graph of the
function will also have a second-highest hill, a third highest one, etc.; one run of the
hill-climbing procedure having the initial solution at the shoulder of the second-
highest hill will find the second-highest hilltop (a local optimum), but then, it will
get stuck there, since no improvement is possible anymore in the neighborhood.
This is why for multi-modal functions iterated hill climbing is used instead of one-
iteration hill climbing: The method is applied several times in a row, with different
initial candidate solutions, thus increasing the chance that one run of the method
will start at the foot of the hill which contains the global optimum.
Simulated Annealing The problem described above—optimization methods
getting stuck in local optima—was actually impairing potential advances in opti-
mization methods. A breakthrough has been the Metropolis algorithm (Metropolis
et al. 1953). The new idea was to occasionally allow for candidate solutions which
are worse than the current one to replace the current solution. This is compatible
with the hill-climbing metaphor: Indeed, when one wanders through a hilly land-
scape aiming at reaching the top of the highest hill, he/she may have to occasionally
climb down a hill in order to reach a higher one.

The idea of expanding the exploration capabilities of the optimization method at


the expense of the quality of the current solution proved to be very productive.
Nevertheless, a better idea is to also keep under some kind of control the ratio
between the number of steps when the current solution is actually improved and the
number of steps when the current solution is worsened. This is where simulated
annealing comes into scene. Beings of nature have not been the only inspiration for
problem-solving researchers; non-living-world processes are also a rich source for
metaphors and simulations in problem solving. One celebrating example is
annealing: Cooled gradually, a metal can gain most desirable physical properties
(e.g., ductility and flexibility), while sudden cooling of a metal hardens it.
Kirckpatrick et al. (1983) proposed a simulation of annealing which uses a
parameter (the temperature) for controlling the improvement/worsening ratio men-
tioned above: The lower the temperature, the fewer steps which worsen the current
solution are allowed. Analogously to what happens in the physical–chemical process
58 H. Luchian et al.

of annealing, the temperature starts at a (relatively) high value and decreases at each
iteration of the current-solution-changing process. Simulated annealing has been
successfully applied to solve many discrete and continuous optimization problems,
including optimal design.
The rest of this chapter and Chapter “Genetic Programming Techniques with
Applications in the Oil and Gas Industry” present several population-based meta-
heuristics: GAs and genetic programing, DE, and particle swarm optimization. We
briefly introduce each of them in the following paragraphs. Four particular topics of
interest, in particular for the meta-heuristics under discussion, are then briefly
touched upon.
Many more meta-heuristics have been proposed and new ones continue to
appear. Monographs and surveys on meta-heuristics such as Glover (1986); Talbi
(2009); Voß (2001) give comprehensive insights into the topic. The International
Journal of Meta-heuristics publishes both theoretical and application papers on
methods including: neighborhood search algorithms, evolutionary algorithms, ant
systems, particle swarms, variable neighborhood search, artificial neural networks,
and artificial immune systems. Those interested in approaches to solving the meta-
problem above may wish to read about hyper-heuristics—a term coined by Burke; a
survey is provided in Burke et al. (2013).

1.2 What Will the Rest of This Chapter and the Next
One Elaborate On?

We introduce briefly the main topics of the two chapters.


Genetic Algorithms Ingo Rechenberg, a professor with the Technical University
of Berlin and a parent of evolution strategies, made a statement which supports the
use of evolutionary techniques for problem solving: “Natural evolution is, or
comprises, a very efficient optimization process, which, by simulation, can conduct
to solving difficult optimization processes” Rechenberg (1973). The statement is
empirically supported by many successful applications of evolutionary techniques
for solving various optimization problems. The field of evolutionary computing
now includes various techniques; the pioneering ones have been the GAs (Holland
1975), the evolution programs Fogel et al. (1966), and the evolution strategies
Rechenberg (1973; Schwefel 1993). Excellent textbooks on GAs are widely used:
Michalewicz (1992; Mitchell 1996), or a more general one, on evolutionary com-
puting (Jong 2006).

As the title of the groundbreaking book by Holland suggests, adaptation has


been the core idea that led to GAs; self-adapting techniques became ever since
more and more popular. Trying to reach the optimum starting from initial guesses as
candidate solutions, such techniques self-adapt their search using properties of the
search space of (the instance of) the problem.
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 59

GAs simulate a few basic factors of natural evolution: mutation, crossover, and
selection. The implementation of each of these simulated factors involves gener-
ating random numbers: like all evolutionary methods, GAs are non-deterministic.
Adaptation, which is instrumental in natural evolution, is simulated by calculating
values of a function (the environment) and, on this basis, making candidate solu-
tions compete for survival for the next generation. The evolution of the population
of solutions can be seen as a learning process where candidate solutions learn
collectively.
More sophisticated variants of GAs simulate further factors of natural evolution,
such as the integrated evolution of two species [coevolution (Hillis 1990) the host–
parasite model].
One particular feature of GAs is that the whole computation process takes place
in two dual spaces: the space of candidate solutions to the given problem (where the
evaluation and the subsequent selection for survival take place the phenotype) and
the space of the representations of such solutions (where genetic operators such as
mutation and crossover are applied the genotype). This characteristic is also bor-
rowed from natural evolution, where the genetic code and the actual being evolved
from that code are instantiations of the two-space paradigm: In natural evolution,
the genetic code is altered through mutations and through crossover between par-
ents; subsequently, the being evolved from the genetic code is evaluated with
respect to its adaptation to the environment.
The genetic code in GAs is actually the way candidate solutions are represented
in the computer. The standard GAs (Michalewicz 1992) works with chromosomes
(representations of candidate solutions) which are strings of bits. When applied to
solve real-world problems, GAs evolved toward sophisticated representations of
candidate solutions, including varying-length chromosomes and multi-dimensional
chromosomes. One particular representation of candidate solutions has been
groundbreaking: trees from graph theory.
Genetic Programing emerged as a distinct area of GA. In his seminal book (Koza
1992), Koza uses a particular definition for the solution to a problem: A solution is a
computer program which solves the problem. Adding to this the idea that such
computer programs can be developed automatically, in particular through genetic
programing, a flourishing field of research and applications emerged. As Poli et. al.
put it, genetic programing automatically solves problems without requiring the user
to know or specify the form or structure of the solution in advance (Poli et al. 2008).

A tree can be seen as representing a calculation, in particular, a computer pro-


gram. In genetic programing, computer programs evolve in an automated manner
through self-adaptation of a population of trees each tree representing a candidate
program. Evaluation of candidate solutions is carried out using a set of instances of
the problem to be solved for which the actual solutions are known beforehand.
Specific operators have been introduced to cope with peculiarities of the evolution
of trees as abstract representations.
60 H. Luchian et al.

Spectacular results have been obtained using genetic programing, including


patentable inventions.
Differential Evolution Since 1996, when it was publicly proposed by Price and
Storn (1997), DE became a popular optimization technique. It is a population-based
method designed for minimizing multi-dimensional real-valued functions through
vector processing; the function needs not be continuous (let alone differentiable)
and, even if it is differentiable, no information on the gradient is used.

DE follows the general steps of an evolutionary scheme: initialisation, applying


specific operators (see the one described below), evaluation, and selection; this
sequence being iterated from the second step until a halting condition is met. The
basic operation in DE is to add the weighted difference of two vectors in the
population to a third one. Thus, the candidate solutions learn from each other; the
computation is a self-adapting process.
From its early days, DE proved to be a powerful optimization technique: It won
the general-purpose algorithms competition in the First International Contest on
Evolutionary Optimization, 1996 (at the IEEE International Conference on Evo-
lutionary Computation). As was the case with other evolutionary techniques, DE
evolved to incorporate new elements such as elitism or coevolution. Pareto-based
approaches have been proposed for tackling multiple objective optimization
problems using DE (Madavan 2002).
Particle Swarm Optimization Collective intelligence (Nguyen and Kowalczyk
2012) is a rich source of inspiration for designing meta-heuristics through simu-
lation. Particularly, successful among such meta-heuristics are Ant Colony Opti-
mization (Dorigo and Stützle 2004) and Particle Swarm Optimization.

The seminal paper for the latter meta-heuristic is (Kennedy and Eberhart 1995);
a textbook dedicated to PSO is (Clerc 2006). Bird flocking or fish schooling can be
considered as being the inspiring metaphors from nature. The core idea is that at
each iteration, each particle (candidate solution) moves through the search space
according to a (linear) combination of the particles current move, of the best per-
sonal previous position, and of the best previous position of the neighbors (what
neighbors means, is a parameter of the procedure). This powerful combination of
the backtracking flavor (keeping track somehow of the previous personal best) and
collective learning (partly aiming at the regional/global previous best) makes PSO
well suited for optimization problems with a moving optimum.

1.3 Short Comments on Four Transversal Issues

Parameter Control A key element for the successful design of any meta-heuristic
is a proper adjustment of its parameters. Suffices it to think of the number of
candidate GAs one has to select from when designing a GA for a given problem:
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 61

Mutation rates and crossover rates can, at least theoretically, take on any value
between 0 and 1; there are tens of choices for the population size; the selection
procedure can be any of at least ten popular ones (new ones can be invented), etc.
This makes a search space for properly designing a GAs for a given problem in the
range of at least hundreds of thousands candidate GAs; of these, only a few will
probably have a good performance and finding these among all possible GAs for
that problem is a non-trivial task.

In the design phase of a meta-heuristic, parameters can be set by hand or auto-


matically. For example, for GAs, a supervisor GAs have been proposed (Gre-
fenstette 1986) which can be used for off-line improvement of the parameters of a
given GAs such as the population size, the mutation, and crossover rates.
If one chooses to have dynamic parameter values during the run of the algo-
rithm, this can be done automatically, for example, upon automatedly checking
whether or not any change of the best-so-far solution happened during a given
number of iterations.
Constraint Handling When the problem to be solved belongs to the constraint
optimization class, a major concern along the iterative solution-improving process
is that of preserving the feasability of candidate solutions, i.e., keeping only
solutions which satisfy all the constraints. The way a feasible solution is obtained in
the first place is beyond the scope of this paragraph—this may happen, for example,
by applying a heuristic which ends up with a feasible but, very likely, non-optimal
solution. Subsequently, the iterative solution-improvement process successively
changes the current solution; every such change may turn a current solution which
is feasible into one which is not. When unfeasible solutions (candidate solutions
which do not satisfy the problem constraints) are obtained, the optimization method
should address this.

There are three main ways of tackling unfeasible solutions. A first approach is to
penalize unfeasible solutions and otherwise let them continue to be part of the
search process. In this way, an unfeasible solution becomes even less competitive
than it actually is with respect to the search-for-the-optimum process (see fitness
function in the GAs section of this chapter). A second approach is to repair the new
solution in case it is unfeasible (repairing means changing the solution in such a
way that it becomes feasible); the fact that repairing may have the same complexity
as the original given problem makes this approach least recommendable. The best
approach seems to be that of including the constraints (or at least some of them) into
the representation of solutions. This idea is convincingly illustrated for numerical
problems in Michalewicz (1992) where bit string representations are used: Any bit
string is decoded into a feasible solution. This approach has the decisive advantage
that there is no need to check whether or not candidate solutions obtained from
existing ones are feasible. When including the problem constraints into the codi-
fication of candidate solutions, one actually uses hybridisation with the problem,
which is mentioned in the next paragraph.
62 H. Luchian et al.

Hybridisation According to one of the definitions in Blum and Roli (2003), a basic
idea of meta-heuristics in general is to combine two or more heuristics in one
problem-tailored procedure and use it as a specific meta-heuristic. Hybridisation has
even more subtle aspects. Hybridisation happens when inserting an element from one
meta-heuristic into another meta-heuristic (e.g., using crossover, a defining operator
for GA, in an evolution strategy which, in its standard form, uses only mutations).
Another form of hybridisation could be called hybridisation with the problem:
Problem-specific properties can be used for defining particular operators in a meta-
heuristic. An example can be found in Michalewicz (1992): For the transportation
problem, a feasible solution remains feasible after applying on it a certain transfor-
mation; this transformation is then used to define the mutation operator. An example
of hybridisation is illustrated in this book, in the chapter on genetic programing.

Hybridisation is recommended, in general, for improving the problem-solving


method. This could be called intended hybridisation, and it has proven its beneficial
effects in countless successful applications.
There also exists an unintended hybridisation which one should be aware of. For
example, when trying to optimize a Royal Road function (Mitchell et al. 1992), the
search in a large plateau (while a substring of 8 bits does not yet contain only 1s) is
akin to a blind search, even though we run a GA for solving the problem. Indeed,
the probability field constructed for the selection has, for the whole plateau, equal
probabilities, and consequently, the selection is not biased toward solutions closer
to the optimum—it is rather a random selection. This way, the GA designed to
solve the Royal Road problem is (unwillingly) hybridised with random search
which takes over temporarily while walking the plateau.
Experiments Non-deterministic methods are used in a way which differs from that
of deterministic ones. The latter will always provide the same output for a given
input, while the former may give different results when run repeatedly with the same
input. This behavior leads to the need of assessing the quality of a non-deterministic
algorithm by repeatedly running it with the same input. Various statistics can be used
—usually, the average of the respective best solutions and their standard deviation,
over a number of runs. Therefore, the proper use of non-deterministic methods
requires at least basic knowledge of probabilities and statistics, in particular Exper-
iment design. Testing statistical hypothesis gives substance to the study of the per-
formance of (non-deterministic) meta-heuristics.

1.4 Going into Practice: Two Running Examples

In order to illustrate the optimization process conducted within the methods


described in this chapter, two optimization problems are formulated here. Sample
code in R (including the output) invoking the algorithms under consideration is
listed in the next sections in an attempt to familiarize the reader with some avail-
able, easy-to-use software.
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 63

The first optimization problem, known as Six Hump Camel Back, is commonly
used as a benchmark function to assess the performance of optimization algorithms
to which its multi-modal complex landscape imposes serious difficulties. It is for-
mulated as a minimization problem over two continuous variables. The problem is
defined as follows:
Minimize f ðx1 ; x2 Þ ¼ ð4  2:1x21 þ x41 =3Þx21 þ x1 x2 þ ð4 þ 4x22 Þx22
where 3  x1  3; ð1Þ
2  x2  2:

The landscape of the function is illustrated in Fig. 1 with the aid of perspective
and contour plots in R.
Visible on the plots above, the function has six local minima and two global
minima. The two global minima lie at locations ðx1 ; x2 Þ ¼ ð0:0898; 0:7126Þ and
ðx1 ; x2 Þ ¼ ð0:0898; 0:7126Þ; the value returned at these locations corresponds to
f ðx1 ; x2 Þ ¼ 1:0316.
The R code defining the Six Hump Camel Back function is shown below.

> SixHump <- function (x1, x2)


{
(4-2.1*x1^2+x1^4/3)*x1^2+x1*x2+(-4+4*x2^2)*x2^2
}

An equivalent function can be implemented in R using as argument a vector.


This formulation is more appropriate for our goals because, this general form which
does not impose restrictions on the size of the input, can be further called by other R
routines implementing the meta-heuristics presented in this chapter.

> SixHumpV <- function (x)


{
(4-2.1*x[1]^2+x[1]^4/3)*x[1]^2+x[1]*x[2]+(-4+4*x[2]^2)*x[2]^2
}

We also illustrate the use of meta-heuristics on a constrained optimization


problem with discrete variables, frequently arising in the oil and gas industry:
portfolio selection. While this problem may be found under various formulations,
we tackle here the variant presented in Shakhsi-Niaei et al. (2013). Given a firm
with a budget b, n projects, with the net value of the ith project denoted by fi and the
cost of the ith project denoted by ci , one must find the combination of projects that
maximizes the total utility for the firm, as computed in Eq. 2:
X
n
Maximize z ¼ fi xi ; ð2Þ
i¼1
64 H. Luchian et al.

(a)

58
37.48

8
38.58017

100.1
30.87894 33.0793

.3 80
31.9 5.082
29.77877

83 35
39
23.17772 24.27789

33
.6
18.77701 25
17.67684

791
.3

9
13.27614 14.37631 78
9.975611

95.7893
6.675084 07

82.58719
4.474733 7.77526

50.6821

44.08105
2.274382 1.174207

47.38158

50.6821
94.68912
91.38859

2
438
z

2.27
0

48.48175
93.58895
48.48175
92.48877

47.38158
91.38859
46.2814
0.0740314
6

−1

85.88772
3.374558
11.07579 5.574909

2
27
8.875435

805
x2

.5
15.47649 12.17596
x1 41

78

4
19.87719 16.57666

72

40.7
37

854
.8 20.977

42
80 26.47824 22.07754

.68
7 34.17947 28.67859 5

71.5
35.2796

56
42.98087 36.37982

−2
51.78228 45.18123

1
−3 −2 −1 0 1 2 3

(b) 31
2.3
1.0

0.8226 1.3311 0.22 94 3.827154


816 1787 25
2 2
683 −0.3 2.995146

09
0.2217872

0.5915

325
7910

0.68
98

26

1.28
23 73 2.486697

1.83
−0.70

142
0.26800

93
0.4991
2665

401

2.2
490
8

.00

662
0.3

2.5329
958
2.255584

093
36
−0

8
−0.286
0.63

61
779 −0.61
0220
0.5

2
0.2
0.8226816 5

795
244
0.9613 −0.33
68

78
495

2.348029
2884

8248
2.2

1.053
1.562
1.14624 00 6 555

667
1.42357 98 0 −0.1
6 479 84

2.301806
1.97
1.70091 94

0.40
2
0.91

1.74713
5
1.37

.499

0.03689656

0.9151269
2.0244
512

7
z

735

0.0831
123

2.30
313

1.88
0.0

1.9782

69

2.16
3

6916

180
580
2.301806

2.11
2.301806

1921
1.9
48

6
3
1.3
32
4

02
572
2900

1.747135
21

77
5
−0.1
0.3
1.5160

1.46

35
0.
1.007

942 9799

86
0.45

3
167 60 3
1.19246
89
−0.4
−0.5

45
2.53292
2.209361

1.00
3

04
253 51 7572
2.209361

0439

299
2
1.83958

−0.6
564 1
0.63779

04
432
−0.24

62

0.5453457

2
0.1293419

90

87
−0.2866

098
2.

52

17
0.7
1.3

81 2.25 −0.70

0.4

22
02 5584 2665
x2
302

0.2680
311

0.
55
x1 2.62
5365
−0.3
7910 0.591
363
31

73
−1.0

3.73 2.3 0.17 5683


4708 3.31 48 5564
8705 36
0 5 401 1.28490
29 0.68 8 1.14624

−2 −1 0 1 2

Fig. 1 Perspective and contour plots for Six Hump Camel Back: a for the entire domain of
definition: x1 2 ½3; 3, x2 2 ½2; 2, b restricted to x1 2 ½1:9; 1:9, x2 2 ½1:1; 1:1. The two
global optima are illustrated as blue triangles at locations ðx1 ; x2 Þ ¼ ð0:0898; 0:7126Þ and
ðx1 ; x2 Þ ¼ ð0:0898; 0:7126Þ
X
n
Subject to: ci xi  b; ð3Þ
i¼1

xi 2 0; 1; i ¼ 1; n: ð4Þ

x1 þ x2  1; ð5Þ

x5 þ x3  1: ð6Þ

x5 þ x3 þ x4  2: ð7Þ

The variables xi represent the decision to select project i (xi ¼ 0 means the project
is not selected, whereas xi ¼ 1 means the project gets selected for implementation)—
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 65

constraint expressed by Eq. 4. The total budget of the firm must not be exceeded by
the total costs of the projects selected (Eq. 3). Other constraints may be imposed on
the problem (especially in a real-world context), such as Eq. 5 expresses the condition
that either project 1 or project 2 gets implemented; Eq. 6 expresses the condition that
either project 3 or project 5 gets implemented; Eq. 7 expresses the condition that at
most 2 out of the 3 projects (3, 4, and 5) may get implemented.

2 Evolutionary Algorithms

Evolutionary algorithms (EAs) are simplified computational models of the evolu-


tionary processes that occur in nature. They are search methods implementing
principles of natural selection and genetics. Parts of this section follow closely the
text in (Breaban 2011).

2.1 Terminology

Evolutionary algorithms use a vocabulary borrowed from genetics. They simulate the
evolution across a sequence of generations (iterations within an iterative process) of a
population (set) of candidate solutions. A candidate solution is internally represented
as a string of genes and is called chromosome or individual. The position of a gene in a
chromosome is called locus, and all the possible values for the gene form the set of
alleles of the respective gene. The internal representation (encoding) of a candidate
solution in an evolutionary algorithm form the genotype; this information is pro-
cessed by the evolutionary algorithm. Each chromosome corresponds to a candidate
solution in the search space of the problem which represents its phenotype. A
decoding function is necessary to translate the genotype into phenotype. If the search
space is finite, it is desirable that this function should satisfy the bijection property in
order to avoid redundancy in chromosomes encoding (which would slow down the
convergence) and to ensure the coverage of the entire search space.
The population maintained by an evolutionary algorithm evolves with the aid of
genetic operators that simulate the fundamental elements in genetics: Mutation
consists in a random perturbation of a gene, while crossover aims at exchanging
genetic information among several chromosomes. The chromosome subjected to a
genetic operator is called parent and the resulted chromosome is called offspring.
A process called selection involving some degree of randomness selects the
individuals to breed and create offsprings, mainly based on individual merit. The
individual merit is measured using a fitness function which quantifies how fitted the
candidate solution encoded by the chromosome is for the problem being solved. The
fitness function is formulated based on the mathematical function to be optimized.
The solution returned by an evolutionary algorithm is usually the most fitted
chromosome in the last generation.
66 H. Luchian et al.

2.2 Directions in Evolutionary Algorithms

First efforts to develop computational models of evolutionary systems date back to


1950s (Bremermann 1958; Fraser 1957). Several distinct interpretations, which are
widely used nowadays, were independently developed later. The main differences
between these classes of evolutionary algorithms consist in solution encoding,
operators implementation, and selection schemes.
Evolutionary programing crystallized in 1963 in the USA at San Diego Uni-
versity, when Fogel (1966) generated simple programs as simple finite-state
machines; this technique was developed further by his son David Fogel. A random
mutation operator was applied on state transition diagrams, and the best chromo-
some was selected for survival.
Evolutionary strategies (ES) were introduced in 1960s when Hans-Paul
Schwefel and Ingo Rechenberg, working on a problem from mechanics involving
shape optimization, designed a new optimization technique because existing
mathematical methods were unable to provide a solution. The first ES algorithm
was initially proposed by Schwefel in 1965 and developed further by Rechenberg
(1973). Their method was designed to solve optimization problems with continuous
variables; it used one candidate solution and applied random mutations followed by
the selection of the fittest. ES were later strongly promoted by Back (1996) who
incorporated the idea of population of solutions.
GAs were developed by John Henry Holland in 1973 after years of study of the
idea of simulating the natural evolution. These algorithms model the genetic
inheritance and the Darwinian competition for survival. GAs are described in more
detail in Sect. 2.3.
Genetic programing is a specialized form of a GA. The specialization consists in
manipulating a very specific type of encoding and, consequently, in using modified
versions of the genetic operators. GP was introduced by Koza in 1992 in an attempt
to perform automatic programing. GP manipulates directly phenotypes, which are
computer programs (hierarchical structures) expressed as trees. It is currently
intensively used to solve symbolic regression problems. Genetic programing and
one important variation—gene expression programing—are described in Chapter
“Genetic Programming Techniques with Applications in the Oil and Gas Industry”
of this book.
DE (Storn and Price 1997) is a more recent class of evolutionary algorithms
whose operators are specifically designed for numerical optimization. DE is
described in detail in Sect. 2.4.
An in-depth analysis under a unified view of these distinct directions in evo-
lutionary algorithms is presented in De Jong (2006).
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 67

Fig. 2 A generic genetic


algorithm t := 0
Initialize P0
Evaluate P0
while halting condition not met do
t := t + 1
select Pt from Pt−1
apply crossover and mutation in Pt
evaluate Pt
end while

2.3 Genetic Algorithms

GAs (Holland 1998) are the most well known and the most intensively used class of
evolutionary algorithms.
A GA performs a multi-dimensional search by means of a population of can-
didate solutions which exchange information and evolve during an iterative process.
The process is illustrated by the pseudo-code in Fig. 2.
In order to solve a problem with a GA, one must define the following elements:
• an encoding for candidate solutions (the genotype);
• an initialization procedure to generate the initial population of candidate
solutions;
• a fitness function which defines the environment and measures the quality of the
candidate solutions;
• a selection scheme;
• genetic operators (mutation and crossover);
• numerical parameters.
The encoding is considered to be the main factor determining the success or
failure of a GA.
The standard encoding in GAs consists in binary strings of fixed length. The
main advantage of this encoding is offered by the existence of a theoretical model
(the Schema theorem) explaining the search process until convergence. Another
advantage shown by Holland is the high implicit parallelism in the GA. A widely
used extension to the binary encoding is gray coding.
Unfortunately, for many problems, this encoding is not a natural one and it is
difficult to be adapted. However, GAs themselves evolved and the encoding
extended to strings of integer and real numbers, permutations, trees, and multi-
dimensional structures. Decoding the chromosome onto a candidate solution to the
problem sometimes necessitates problem-specific heuristics.
Important factors that need to be analyzed with regard to the encoding are the
size of the search space induced by a representation and the coverage of the phe-
notype space: Whether the phenotype space is entirely covered and/or reachable,
whether the mapping from genotype to phenotype is injective, or “degenerate,” and
68 H. Luchian et al.

whether particular (groups of) phenotypes are over-represented (Radcliffe et al.


1995). Also, the “heritability” and “locality” of the representation under crossover
and mutation need to be studied (Raidl and Gottlieb 2005).
The initialization of the population is usually performed randomly. There exist
approaches which make use of greedy strategies to construct some initial good
solutions or other specific methods depending on the problem.
The fitness function is constructed based on the mathematical function to be
optimized. For more complex problems, the fitness function may involve very
complex computations and increase the intrinsic polynomial complexity of the GA.
Several probabilistic procedures based on the fitness distribution in population
can be used to select the individuals to survive in the next generations and produce
offsprings; this phase of the algorithm is known as selection for variation. All
these procedures encourage to some degree the survival of the fittest individuals,
allowing at the same time that the worst adapted individual survive and contribute
with local information (short-length substrings) to the structure of the optimal
solution. The most essential feature which differentiates them is the selection
pressure: the degree to which the better individuals are favored; the higher the
selection pressure, the more the better individuals are favored. The selection
pressure has a great impact on the diversity in population and consequently on the
convergence of GAs. If the selection pressure is too high, the algorithm will suffer
from insufficient exploration of the search space and premature convergence occurs,
resulting in sub-optimal solutions. On the contrary, if the selection pressure is too
low, the algorithm will unnecessarily take longer time to reach the optimal solution.
Various selection schemes were proposed and studied from this perspective. They
can be grouped into two classes: proportionate-based selection and ordinal-based
selection. Proportionate-based selection takes into account the absolute values of
the fitness. The most known procedures in this class are as follows: roulette wheel
(Holland 1975) and stochastic universal sampling (Baker 1987).
Because of its wide use and popularity among all GAs flavors, we describe, in
the following, roulette wheel selection. For this procedure, each individual is
assigned a probability of being selected proportional with its fitness value. The sum
of these probability values over the set of all the individuals in a generation is 1. Let
fi be the fitness of the ith individual of the current population, then pi is the
probability of the individual for being selected:

fi
pi ¼ ;
P
N
fj
j¼1

where N is the number of individuals in the population (see, for a simple example,
Fig. 3 which assumes a population of 5 individuals). On each application of the
selection scheme, a random number is generated r 2 ½0; 1Þ and the individual i with
the highest cumulative frequency smaller than this random r is selected to survive to
the next generation:
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 69

Fig. 3 Fitness values in a population of 5 individuals. The bottom row contains the fitness values
of the individuals. Their associated probabilities are the labels of the circular sectors

min X
k
i ¼ k ¼ 1. . .nfkj  rg:
j¼1

Ordinal-based selection takes into account only the relative order of individuals
according to their fitness values. The most used procedures of this kind are the
linear ranking selection (Baker 1985) and the tournament selection (Goldberg
1989).
New individuals are created in population with the aid of two genetic operators:
crossover and mutation. The classical crossover operator aims at exchanging
genetic material between two chromosomes in two steps: A locus is chosen ran-
domly to play the role of a cut point and splits each of the two chromosomes in two
segments; then, two new chromosomes are generated by merging the first segment
from the first chromosome with the second segment from the second chromosome
and vice versa. This operator is called in literature one-point crossover and is
illustrated in Fig. 4. Generalizations exist to three or more cut points. Uniform
crossover builds sequentially the offspring by copying at each locus the allele
randomly chosen from one of the two parents.
Various constraints imposed by real-world problems led to various encodings for
candidate solutions; these problem-specific encodings subsequently necessitate the
redefinition of crossover. Thus, algebraic operators are implied for the case of
numerical optimization with real encoding; an impressive number of papers focused
on permutation-based encodings proposing various operators and performing
comparative studies. It is now a common procedure to wrap a problem-specific
heuristic within the crossover operator in Ionita et al. (2006), the authors propose
new operators for constraint satisfaction; (Luchian et al. 1994) presents new
operators in the context of clustering]. Crossover in GAs stands at the moment for
any procedure which combines the information encoded within two or several
chromosomes to create new and hopefully better individuals.
70 H. Luchian et al.

Fig. 4 Crossover operators in bit string GA

Mutation is a unary operator designed to introduce variability in population. In


the case of binary GAs, the mutation operator modifies each gene (from 0 to 1 or
from 1 to 0) with a given probability. As in the case of crossover, mutation takes
various forms depending on the problem and the encoding used (see Fig. 5 for
examples of how mutation works for different chromosome representations).
When designing a GA, decisions have to be made with regard to several
parameters: population size, crossover and mutation rate, and a halting criterion.
Except some general considerations (i.e., high mutation rate in first iterations,
decreasing during the run, combined with a complementary evolution for cross-
over), finding the optimum parameter values comes more to empiricism than to
abstract studies.
In the following, we illustrate the search process conducted by a GA using the
package called “GA” (Scrucca 2013) in R to minimize the Six Hump Camel Back
function, previously defined in Sect. 1.4.

Fig. 5 The behavior of the mutation operator for different encodings


On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 71

Because this is a problem with a continuous bi-dimensional search space, a real


encoding and arithmetical operators are a natural choice. Moreover, empirical
studies have reported that these settings obtain better performance compared to
natural encoding and standard operators in the case of numerical optimization
problems. The initialization scheme consists in randomly generating points (can-
didate solutions, chromosomes) in the bi-dimensional search space defined by the
problem. We have to define further the fitness function that should be used to
measure the quality of the chromosomes. Naturally, this is based on the objective
function of our problem, but requires some minimal modifications: the GAs
necessitate that the fitness function is designed for maximization: The higher the
fitness value is, the better the candidate solution for our problem is. Because the
problem we tackle is defined for minimization, low values of our objective function
(previously defined in R as SixHumpV) correspond to better solutions, while high
values to worse ones. Therefore, we need to build a new function playing as fitness
in the GA, simply by multiplying our objective function with (−1):

SixHumpMax <- function(x)


+ {
+ -SixHumpV(x)
+ }

The lines of code below call the ga function to execute a GA which maximizes
our newly defined function with a population of 20 chromosomes using real
encoding and arithmetic operators for 50 iterations:

> library("GA")
> GA.sols <- ga(type = "real-valued", fitness = SixHumpMax,
+ min = c(-3, -2), max = c(3, 2), maxiter=50, popSize=20)
Iter = 1 | Mean = -20.10513 | Best = 0.3900806
Iter = 2 | Mean = -8.679598 | Best = 0.3900806
Iter = 3 | Mean = -1.909435 | Best = 0.3900806
Iter = 4 | Mean = -0.7739577 | Best = 0.521566
Iter = 5 | Mean = -0.4207289 | Best = 0.521566
...
Iter = 50 | Mean = 0.9275536 | Best = 1.020383

During its execution, the ga function prints at each iteration the mean of the
fitness in population and the best fitness value. To show the final results, we call the
summary function:
72 H. Luchian et al.

> summary(GA.sols)
+-----------------------------------+
| Genetic Algorithm |
+-----------------------------------+

GA settings:
Type = real-valued
Population size = 20
Number of generations = 50
Elitism = 1
Crossover probability = 0.8
Mutation probability = 0.1
Search domain
x1 x2
Min -3 -2
Max 3 2

GA results:
Iterations = 50
Fitness function value = 1.020383
Solution =
x1 x2
[1,] -0.1262185 0.6870156

The best solution obtained over 50 iterations corresponds to Six Hump


(−0.1262185, 0.6870156) = −1.020383. The evolution of the best value of the
objective function in population during the run is illustrated in Fig. 6.

Fig. 6 The evolution of the


−0.4

best objective value in one


run of the GA
−0.5
−0.6
SixHump
−0.7
−0.8
−0.9
−1.0

0 10 20 30 40 50
x
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 73

2
2

2
58 9.68

58 9.68

58 9.68
37.48 37.48 37.48

791 28

791 28

791 28
38.58017 38.58017 38.58017

100.1

100.1

100.1
30.87894 33.0793 30.87894 33.0793 30.87894 33.0793

.3

.3

.3
29.77877 29.77877 29.77877

.08

.08

.08
83 035

83 035

83 035
3

3
23.17772 24.27789 23.17772 24.27789 23.17772 24.27789

2
33

33

33
18.77701 25 18.77701 25 18.77701 25

55

55

55
17.67684 17.67684 17.67684

9
13.27614 .3 13.27614 .3 13.27614 .3
9.975611 14.37631 78 9.975611 14.37631 78 9.975611 14.37631 78
31.9

31.9

31.9
95.7893

95.7893

95.7893
6.675084 07 6.675084 07 6.675084 07

82.58719

82.58719

82.58719
4.474733 7.77526 4.474733 7.77526 4.474733 7.77526
50.6821

50.6821

50.6821
44.08105

44.08105

44.08105
2.274382 1.174207 2.274382 1.174207 2.274382 1.174207

1
1

1
47.38158

47.38158

47.38158
50.6821

50.6821

50.6821
94.68912

94.68912

94.68912
91.38859

91.38859

91.38859
4382

4382

4382
2.27

2.27

2.27
0
0

0
48.48175

48.48175

48.48175
93.58895

93.58895

93.58895
48.48175

48.48175

48.48175
92.48877

92.48877

92.48877
47.38158

47.38158

47.38158
91.38859

91.38859

91.38859
46.2814

46.2814

46.2814
−1
0.0740314 0.0740314 0.0740314
−1

−1
6 6 6
85.88772

85.88772

85.88772
3.374558 3.374558 3.374558
11.07579 5.574909 11.07579 5.574909 11.07579 5.574909

8052

8052

8052
27

27

27
8.875435 8.875435 8.875435
.5

.5

.5
15.47649 12.17596 15.47649 12.17596 15.47649 12.17596
78

78

78
41 19.87719 41 19.87719 41 19.87719

8544

8544

8544
16.57666 16.57666 16.57666
72

72

72
40.7

40.7

40.7
.8 37 .8 37 .8 37
42

42

42
80 26.47824 22.07754 20.977 80 26.47824 22.07754 20.977 80 26.47824 22.07754 20.977
.68

.68

.68
7 34.17947 28.67859 5 7 34.17947 28.67859 5 7 34.17947 28.67859 5

71.5

71.5

71.5
35.2796 35.2796 35.2796

−2
56

56

56
−2

36.37982 36.37982 36.37982

−2
51.78228 42.98087 51.78228 42.98087 51.78228 42.98087
45.18123 45.18123 45.18123
1

1
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
2

2
58 9.68

58 9.68
37.48 37.48
791 28

791 28
38.58017 38.58017

100.1

100.1
30.87894 33.0793 30.87894 33.0793

.3

.3
29.77877 29.77877
.08

.08
83 035

83 035
3

3
23.17772 24.27789 23.17772 24.27789
2

2
33

33
18.77701 25 18.77701 25
55

55
17.67684 17.67684

9
13.27614 .3 13.27614 .3
9.975611 14.37631 78 9.975611 14.37631 78
31.9

31.9
95.7893

95.7893
6.675084 07 6.675084 07

82.58719

82.58719
4.474733 7.77526 4.474733 7.77526
50.6821

50.6821
44.08105

44.08105
2.274382 1.174207 2.274382 1.174207
1

1
47.38158

47.38158
50.6821

50.6821
94.68912

94.68912
91.38859

91.38859
4382

4382
2.27

2.27
0

0
48.48175

48.48175
93.58895

93.58895
48.48175

48.48175
92.48877

92.48877
47.38158

47.38158
91.38859

91.38859
−1

46.2814

46.2814
0.0740314 0.0740314

−1
6 6
85.88772

85.88772
3.374558 3.374558
11.07579 5.574909 11.07579 5.574909
8052

8052
27

27
8.875435 8.875435
.5

.5
15.47649 12.17596 15.47649 12.17596
78

78
41 19.87719 41 19.87719
8544

8544
16.57666 16.57666
72

72
40.7

40.7
.8 37 .8 37
42

42
80 26.47824 22.07754 20.977 80 26.47824 22.07754 20.977
.68

.68
7 34.17947 28.67859 5 7 34.17947 28.67859 5
71.5

71.5
−2

35.2796 35.2796
56

56
36.37982 36.37982
−2
51.78228 42.98087 51.78228 42.98087
45.18123 45.18123
1

1
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Fig. 7 The evolution of the population in GA during one run of the algorithm: the distribution of
the candidate solutions at iterations 1, 2, 5, 10, and 15

Figure 7 illustrates the distribution of the individuals in population during one


run of the GA, at iterations 1, 2, 5, 10, and 50. The GA shows a very quick
convergence toward the regions containing the global minima. The evolution of the
fitness for the run illustrated here shows that the GA is able to locate in only a few
number of iterations the promising area in the search space due to its good
exploration abilities. However, by comparing the final solution to the minimum of
the objective function (−1.020383 vs. −1.0316), we may conclude that in this run,
the GA is deficient at exploitation: Even if very close to the global optima, starting
at iteration number 17, the algorithm stopped improving the best solution achieved
so far.
By illustrating only one run of the GA, a general conclusion on its convergence
cannot be drawn on this basis due to the stochastic nature of the algorithm. To study
its performance, 30 runs are performed with the same settings and for each run, the
objective value corresponding to the solution returned is collected. In this manner,
we obtain a sample of 30 values with mean −1.030361—which is closer to the
optimum than the particular run reported previously, and standard deviation 0.0037
—which indicates that the algorithm is stable, returning each time solutions very
close to the optimum. The confidence interval for the mean supports these con-
clusions: The mean of the objective values returned by the GA is less than
−1.028960 (we are interested in the minimum) with probability 0.95. In the code
below, “fitness” is a vector with 30 values corresponding to the objective values
returned in 30 runs:
74 H. Luchian et al.

> t.test(fitness)
One Sample t-test

data: fitness
t = -1503.688, df = 29, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-1.031762 -1.028960
sample estimates:
mean of x
-1.030361

Although the reported results are satisfactory, the GAs are usually enhanced in
practice by hybridizing them with local search algorithms.
With a standard binary encoding, GAs are the most appropriate candidates when
attempting to solve the portfolio selection problem by means of meta-heuristics.
In order to illustrate such an approach, we consider the problem defined in
Sect. 1.1 with the following instantiation: the number of projects n ¼ 6, the budget
of the firm b ¼ 1000, and the costs and the utilities of the projects as in Table 1. An
optimal solution to this problem involves the selection of projects 1, 4, 5, and 6; it
has total cost 850 and utility 1700.
One way to deal, within a GA, with the constraints imposed by the problem, is to
encourage the search in the feasible region of the search space by penalizing the
unfeasible candidate solutions. Under this approach, any solution that violates a
constraint gets a lower fitness. Identifying the most appropriate scheme to penalize
solutions is, by itself, an optimization problem. The code below implements one
possible fitness function for our problem:

> portfolio <- function(x){


+ cost <- c(250,350,100,200,300,100)
+ utility <- c(500,400,150,300,600,300)
+ totalUtility <- sum (utility*x)
+ totalCost <- sum (cost*x)
+ penalty <- 0
+ if (totalCost > 1000)
+ penalty <- totalCost #penalty for exceeding the budget
+ p=sum(cost)
+ if (x[1]+x[2] > 1) penalty <- penalty+p #violating constraint 5)
+ if (x[3]+x[5] > 1) penalty <- penalty+p #violating constraint 6)
+ if (x[3]+x[4]+x[5] > 2) penalty <- penalty+p #violating constraint 7)
+ totalUtility - penalty
+ }
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 75

Table 1 Cost and utility of Project 1 2 3 4 5 6


projects
Cost 250 350 100 200 300 100
Utility 500 400 150 300 600 300

A GA with binary encoding is called to solve this problem instance:

> GA <- ga(type = "binary", fitness = portfolio, nBits = 6,


+ maxiter = 50, popSize = 10)
Iter = 1 | Mean = 270 | Best = 1300
Iter = 2 | Mean = 850 | Best = 1400
Iter = 3 | Mean = 1225 | Best = 1700
Iter = 4 | Mean = 1160 | Best = 1700
I...
Iter = 49 | Mean = 832.5 | Best = 1700

> summary(GA)
+-----------------------------------+
| Genetic Algorithm |
+-----------------------------------+

GA settings:
Type = binary
Population size = 20
Number of generations = 50
Elitism = 1
Crossover probability = 0.8
Mutation probability = 0.1

GA results:
Iterations = 50
Fitness function value = 1700
Solution =
x1 x2 x3 x4 x5 x6
[1,] 1 0 0 1 1 1

2.4 Differential Evolution

Adhering by design to the area of evolutionary algorithms, but targeting in par-


ticular the field of numerical optimization, a method called DE was developed by
Ken Price and Rainer Storn during 1994–1996 (Storn and Price 1997). The results
76 H. Luchian et al.

Fig. 8 The differential


1. t := 0
evolution algorithm (0) (0) (0)
2. Initialize population P0 = {x1 , x2 , ..., xm } of size m
3. Evaluate P0
4. while halting condition not met do
5. t := t + 1
6. for i=1 to m do
7. yi = generateMutant(Pt−1 )
(t−1)
8. zi = crossover(xi , yi )
9. Evaluate zi
(t−1)
10. if zi is better than xi then
(t)
11. xi = zi
12. else
(t) (t−1)
13. xi = xi
14. end if
15. end for
16. end while

in their seminal paper show that DE outperforms GAs in numerical optimization


and this hypothesis was subsequently confirmed in competitions dedicated to real-
valued function minimization.
DE makes use of the same terminology as GAs: A population of candidate
solutions evolves by means of selection, mutation, and crossover. The differences
occur at several levels: the encoding of the candidate solutions, the definition of the
genetic operators, and the selection scheme.
Designed for numerical optimization, the internal encoding of the candidate
solution (the genotype) is identical to the phenotype: A string of real values that
correspond to the decision variables defined by the problem.
The selection for variation is replaced in DE by a simple pass through the entire
population: Each chromosome is participating in the variation phase to create a new
offspring by means of genetic operators. However, DE implements selection at
replacement: The offspring is introduced in the new population only if it is better
than its parent with regard to the fitness function. The pseudo-code of the DE
algorithm is illustrated in Fig. 8.
There are several versions of the mutation operator (line 7 of the algorithm).
However, they all share a mechanism that is a distinctive feature of DE within the
EA framework: The perturbation term is obtained as the difference between some
randomly selected chromosomes. This perturbation mechanism, particular to DE,
suggestively gives the name of this method. The general formula creating one
mutant yi at time t is given below:

ðt1Þ
X
L
ðt1Þ ðt1Þ
yi ¼ kxðt1Þ þ ð1  kÞxIi þ Fl ðxJil  xKil Þ ð8Þ
l¼1

where k is a numerical value in range [0,1] controlling the influence of the best
ðt1Þ ðt1Þ
element in the current population, which is x . xIi is a chromosome from the
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 77

current population, chosen at random (Ii 2 f1; 2; . . .mg). L [ ¼ 1 is an integer


ðt1Þ ðt1Þ
value specifying the number of pairs of chromosomes of the form ðxJil ; xKil Þ
randomly chosen from the current population (Jil ; Kil 2 f1; 2; . . .mg; Jil 6¼ Kil ) and
which are used in the perturbation mechanism. Fl [ 0; l ¼ 1. . .m are scaling factors
decisive for the influence of each difference.
Different settings of the numerical parameters k and L lead to distinct DE
algorithms. In order to specify, in a concise manner, the DE variant, a simple
notation, was introduced based on three variables: DE/a=L=c where a depends on
the value of k, L is the number of vector differences used, and c is the type of
crossover. The most popular versions of the DE algorithm are DE/best/1= and
DE/rand/1=. Both versions correspond to the case when only one difference is
used to compute the mutant. The first case corresponds to k ¼ 1, respectively, to

ðt1Þ ðt1Þ
yi ¼ xðt1Þ þ FðxJi  xK i Þ ð9Þ

while the second case corresponds to k ¼ 0, respectively, to

ðt1Þ ðt1Þ ðt1Þ


yi ¼ xI i þ FðxJi  xK i Þ ð10Þ

It must be noted that the mutation mechanism described above does not alter the
current/selected chromosome xi . It is the role of crossover to build an offspring of
the current chromosome, by combining its genetic material with the one encoded by
the mutant chromosome. From this perspective, DE is not entirely compliant with
the general specifications of the two genetic operators.
Two versions of crossover are proposed in DE. A first one, called binomial
crossover, is similar to the uniform crossover in GAs: It is a binary operator that
mixes the components of the two chromosomes based on a given probability CR:

yi;d if rd \CR or d ¼ d0
zi;d ¼ d ¼ 1. . .D ð11Þ
xi;d otherwise

where rd is a random number uniformly distributed in [0,1] and d0 2 ½1; D is a


random position in the chromosome guaranteeing that the offspring contains at least
one element from the mutant. D denotes the dimensionality of the problem, i.e., the
length of the string representing a chromosome.
The second variant of the crossover operator is called exponential crossover and
can be expressed by the following formulation:

yi;d for d 2 H
zi;d ¼ ð12Þ
xi;d otherwise

where H is a series of size at most D of consecutive circular numbers in range 1,2,


… D, starting with a value d0 and continuing with ðd0 þ 1Þ  D, ðd0 þ 2Þ  D; . . .;
ðd0 þ kÞ  D where a  b expresses the modulus operator returning the remainder
78 H. Luchian et al.

of the division of a to b; k is the first trial that satisfies that a random uniformly
generated number in [0,1] is higher than CR, thus following a truncated geometric
distribution. For example, considering d0 ¼ 6 and D ¼ 10, H could be the series 6,
7, 8 or 6, 7, 8, 9, 10, 1, 2, depending on the parameter CR; these two examples
clearly illustrate the similarity of the exponential crossover in DE with the 2-point
crossover in GAs.
In both versions of the crossover operator, CR is a parameter deciding the
influence of the mutant on the structure of the offspring. A theoretical analysis of
the two crossover variants and their influence on the sensitivity of DE to different
values of CR are presented in Zaharie (2007).
An elitist replacement strategy guarantees survival of the fittest chromosome
among the parent and the offspring.
To simulate a run of the DE algorithm on our minimization problem, we use the
R package called DEoptim (Mullen et al. 2011).1 The following code calls the
DEoptim function which executes the DE/rand/1/bin algorithm (the variant
implementing mutation based on a random candidate and one difference, and binary
crossover) to minimize the SixHump function with a population consisting of 20
candidate solutions over 50 iterations; with the trace parameter set on TRUE, the
best candidate solution (its value for the objective function and its components) in
each iteration is shown during the run:

> library("DEoptim")
> DE.sols <- DEoptim(SixHumpV, lower = c(-3, -2), upper = c(3, 2),
+ control = list(strategy = 1, NP=20, itermax=50, storepopfrom = 1,
+ trace = TRUE))
Iteration: 1 bestvalit: -0.343676 bestmemit: 0.424858 -0.515384
Iteration: 2 bestvalit: -0.343676 bestmemit: 0.424858 -0.515384
Iteration: 3 bestvalit: -0.343676 bestmemit: 0.424858 -0.515384
Iteration: 4 bestvalit: -0.722848 bestmemit: -0.090842 0.885970
Iteration: 5 bestvalit: -0.811161 bestmemit: 0.138414 0.742059
...

The performance of DE is highly dependent on the values of the numerical


parameters. The authors of DE recommend setting CR to 0.9 and selecting F from
the interval [0.5, 1.0]. The run illustrated here uses the default values in DEoptim:
CR ¼ 0:9 and F ¼ 0:8.
The following lines of code list the best solution in the last iteration and output
two plots: One representing the evolution of the best value of the objective function
(the minimum) in the population and one representing the distribution of the can-
didate solutions during the run. The resulting plots are illustrated in Fig. 9.

1
The package can be freely downloaded from http://cran.r-project.org/web/packages/DEoptim/
index.html.
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 79

(a) convergence plot par1


(b)
2
1

value
−0.4 0
−1
−2
−0.5 −3
0 10 20 30 40 50
function value

−0.6
stored population
−0.7

−0.8
par2
−0.9 1.5
1.0

value
0.5
0.0
−0.5
−1.0 −1.0
−1.5
−2.0
0 10 20 30 40 50 0 10 20 30 40 50
iteration stored population

Fig. 9 The evolution of the population in DE during one run of the algorithm: a the evolution of
the best fitness value in population and b the distribution of the candidate solutions (the genotype)

> DE.sols$optim
$bestmem
par1 par2
0.08984226 -0.71265649

$bestval
[1] -1.031628
...
> plot(DE.sols, plot.type = "bestvalit", col="red", pch=1)
> plot(DE.sols, plot.type = "storepop")

Figure 9 clearly illustrates the convergence toward the optimal solution in DE. In
our run, the optimum is found after 31 iterations, as indicated by Fig. 9a. The
diversity in population decreases significantly during the run according to Fig. 9b
which presents in two distinct plots the distribution of the values in each iteration
for each parameter of the objective function. This plot indicates an interesting
behavior: convergence toward two distinct regions in the search space.
In order to get more insight into the dynamics of the population within DE,
Fig. 10 illustrates the candidate solutions in the population at distinct moments
during the run distributed over the contour plot illustrating the landscape of the
objective function. The series (a) of plots show the distribution of the candidate
solutions at iterations 1, 5, 10, and 15. The series (b) offers a zoomed-in perspective
of the landscape (restricted to x1 2 ½1:9; 1:9 and x2 2 ½1:1; 1:1) showing the
distribution of the candidate solutions at iterations 15, 20, 30, and 50. In the first
iteration of the algorithm, the population is spread at random in the search space. At
iteration number 10 (Fig. 10a)-3rd plot), groups of individuals were formed around
local and global optima. Toward the end of our run, all the candidate solutions
migrate in the regions corresponding to the two global optima.
80 H. Luchian et al.

(a)
2

2
58 9.6

58 9.6

58 9.6

58 9.6
37.48 37.48 37.48 37.48
7912 8

7912 8

7912 8

7912 8
38.58017 38.58017 38.58017 38.58017

100.19

100.19

100.19

100.19
30.87894 33.0793 30.87894 33.0793 30.87894 33.0793 30.87894 33.0793

.3 80

.3 80

.3 80

.3 80
.082

.082

.082

.082
29.77877 29.77877 29.77877 29.77877

83 35

83 35

83 35

83 35
3

3
23.17772 24.27789 23.17772 24.27789 23.17772 24.27789 23.17772 24.27789

33

33

33

33
18.77701 25 18.77701 25 18.77701 25 18.77701 25
17.67684 17.67684 17.67684 17.67684
55

55

55

55
13.27614 .3 13.27614 .3 13.27614 .3 13.27614 .3
9.975611 14.37631 78 9.975611 14.37631 78 9.975611 14.37631 78 9.975611 14.37631 78
31.9

31.9

31.9

31.9
95.7893

95.7893

95.7893

95.7893
6.675084 07 6.675084 07 6.675084 07 6.675084 07

82.58719

82.58719

82.58719

82.58719
4.474733 7.77526 4.474733 7.77526 4.474733 7.77526 4.474733 7.77526
50.6821

50.6821

50.6821

50.6821
44.08105

44.08105

44.08105

44.08105
2.274382 1.174207 2.274382 1.174207 2.274382 1.174207 2.274382 1.174207
1

1
47.38158

47.38158

47.38158

47.38158
50.6821

50.6821

50.6821

50.6821
94.68912

94.68912

94.68912

94.68912
91.38859

91.38859

91.38859

91.38859
382

382

382

382
2.274

2.274

2.274

2.274
0

0
48.48175

48.48175

48.48175

48.48175
93.58895

93.58895

93.58895

93.58895
48.48175

48.48175

48.48175

48.48175
92.48877

92.48877

92.48877

92.48877
47.38158

47.38158

47.38158

47.38158
91.38859

91.38859

91.38859

91.38859
46.2814

46.2814

46.2814

46.2814
0.07403146 0.07403146 0.07403146 0.07403146
−1

−1

−1

−1
85.88772

85.88772

85.88772

85.88772
3.374558 3.374558 3.374558 3.374558
11.07579 5.574909 11.07579 5.574909 11.07579 5.574909 11.07579 5.574909

052

052

052

052
27

27

27

27
8.875435 8.875435 8.875435 8.875435
.5

.5

.5

.5
15.47649 12.17596 15.47649 12.17596 15.47649 12.17596 15.47649 12.17596
40.78

40.78

40.78

40.78
41 41 41 41
78

78

78

78
544

544

544

544
19.87719 16.57666 19.87719 16.57666 19.87719 16.57666 19.87719 16.57666
72

72

72

72
.8 20.97737 .8 20.97737 .8 20.97737 .8 20.97737
42

42

42

42
80 26.47824 22.07754 80 26.47824 22.07754 80 26.47824 22.07754 80 26.47824 22.07754
.685

.685

.685

.685
71.58

71.58

71.58

71.58
7 34.17947 28.67859 7 34.17947 28.67859 7 34.17947 28.67859 7 34.17947 28.67859
42.98087 36.37982 35.27965 42.98087 36.37982 35.27965 42.98087 36.37982 35.27965 42.98087 36.37982 35.27965
−2

−2

−2

−2
61

61

61

61
51.78228 45.18123 51.78228 45.18123 51.78228 45.18123 51.78228 45.18123

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

(b)
2.39 2.39 2.39 2.39
1.0

1.0

1.0

1.0
0.82268
16 1.331131 0.221 42 3.827154 0.82268
16 1.331131 0.221 42 3.827154 0.82268
16 1.331131 0.221 42 3.827154 0.82268
16 1.331131 0.221 42 3.827154
7872 52 7872 52 7872 52 7872 52
3 −0.37 2.995146 3 −0.37 2.995146 3 −0.37 2.995146 3 −0.37 2.995146
9

9
0.2217872

0.2217872

0.2217872

0.2217872
0.591568 0.591568 0.591568 0.591568
325

325

325

325
60

60

60

60
0.684

0.684

0.684

0.684
9107 9107 9107 9107
0.2680098

0.2680098

0.2680098

0.2680098
1.284

1.284

1.284

1.284
0.499123

0.499123

0.499123

0.499123
3 3 3 3
32

32

32

32
2.486697 2.486697 2.486697 2.486697
1.839

1.839

1.839

1.839
−0.702 −0.702 −0.702 −0.702
42

42

42

42
.009

.009

.009

.009
2.20

2.20

2.20

2.20
6658 6658 6658 6658
0.31

0.31

0.31

0.31
0136

0136

0136

0136
2

2
2.53292

2.53292

2.53292

2.53292
908

908

908

908
2.255584 2.255584 2.255584 2.255584
−0.28666

−0.28666

−0.28666

−0.28666
58

58

58

58
−0

−0

−0

−0
93

93

93

93
0.637 0.637 0.637 0.637
61

61

61

61
−0.610 −0.610 −0.610 −0.610
0. 5

0. 5

0. 5

0. 5
791 2205 791 2205 791 2205 791 2205
0.26

0.26

0.26

0.26
0.8226816
95
44

0.8226816

95
44
0.8226816

95
44
0.8226816

95
44
0.96134 −0.332 0.96134 −0.332 0.96134 −0.332 0.96134 −0.332
1.0537
1.5622

1.0537
1.5622

1.0537
1.5622

1.0537
1.5622
6778

6778

6778

6778
80

80

80

80
2.348029

2.348029

2.348029

2.348029
95 95 95 95
48

48

48

48
8846 2.25 8846 2.25 8846 2.25 8846 2.25
1.14624 1.14624 1.14624 1.14624
09

09

09

09
−0.14 5584 −0.14 5584 −0.14 5584 −0.14 5584
1.9782

1.9782

1.9782

1.9782
1.423576 1.423576 1.423576 1.423576
8 0.4

8 0.4

8 0.4

8 0.4
0.406

0.406

0.406

0.406
7994 7994 7994 7994
2.301806

2.301806

2.301806

2.301806
1.700912 1.700912 1.700912 1.700912
0.915

1.747135
0.915

1.747135 1.747135

0.915
1.747135

0.915
1.377

1.377

1.377

1.377
0.03689656

0.9151269

0.03689656

0.9151269

0.03689656

0.9151269

0.03689656

0.9151269
2.02447 2.02447 2.02447 2.02447
9912

9912

9912

9912
1269

1269

1269

1269
0. 0

0. 0

0. 0

0. 0
138 138 138 138
0.08311

0.08311

0.08311

0.08311
2.301

2.301

2.301

2.301
1.885

1.885

1.885

1.885
353

353

353

353
1.978248

1.978248

1.978248

1.978248
2.163 2.163 2.163 2.163
3

3
916 916 916 916
2.301806

2.301806

2.301806

2.301806
2.116 2.116 2.116 2.116
806

806

806

806
803

803

803

803
2.301806

2.301806

2.301806

2.301806
1.93 1.93 1.93 1.93
921

921

921

921
1.3

1.3

1.3

1.3
004

004

004

004
20 20 20 20
72

72

72

72
21

21

21

21
1.747135 25 1.747135 25 1.747135 25 1.747135 25
77

77

77

77
−0.1 −0.1 −0.1 −0.1
1.0075

1.0075

1.0075

1.0075
0.4529

0.4529

0.4529

0.4529
0.36

0.36

0.36

0.36
1.5160

1.5160

1.5160

1.5160
1.469 1.469 1.469 1.469
35

35

35

35
0.8

0.8

0.8

0.8
9421 799 9421 799 9421 799 9421 799
3

3
68

68

68

68
−0.5

−0.5

−0.5

−0.5
67 67 67 67
04

04

04

04
−0.4 1.192463 1.0 −0.4 1.192463 1.0 −0.4 1.192463 1.0 −0.4 1.192463 1.0
2.53292

2.53292

2.53292

2.53292
2.209361

2.209361

2.209361

2.209361
90

90

90

90
55

55

55

55
4393

4393

4393

4393
2532 07572 2532 07572 2532 07572 2532 07572
2.209361

2.209361

2.209361

2.209361
42

42

42

42
1

1
99 99 99 99
1.83958

1.83958

1.83958

1.83958
−0.65 −0.65 −0.65 −0.65
6443 0.637791 6443 0.637791 6443 0.637791 6443 0.637791
−0.240

−0.240

−0.240

−0.240
04

04

04

04
−0.286662

−0.286662

−0.286662

−0.286662
0.5453457

0.5453457

0.5453457

0.5453457
72

72

72

72
2 2 2 2
0.1293419

0.1293419

0.1293419

0.1293419
90

90

90

90
78

78

78

78
98

98

98

98
2.8 2.8 2.8 2.8
52

52

52

52
0.73 31

0.73 31

0.73 31

0.73 31
1.33

1.33

1.33

1.33
2.2555 2.2555 2.2555 2.2555
21

21

21

21
10 −0.702 10 −0.702 10 −0.702 10 −0.702
0.4

0.4

0.4

0.4
0.26800

0.26800

0.26800

0.26800
84 6658 84 6658 84 6658 84 6658
0.2

0.2

0.2

0.2
25 25 25 25
0236

0236

0236

0236
−0.37 −0.37 −0.37 −0.37
11

11

11

11
5 2.6253 5 2.6253 5 2.6253 5 2.6253
91 0.5915 91 0.5915 91 0.5915 91 0.5915
−1.0

−1.0

−1.0

−1.0
65 65 65 65
3.7347 2.34 0.175 073 683 3.7347 2.34 0.175 073 683 3.7347 2.34 0.175 073 683 3.7347 2.34 0.175 073 683
3

3
08 3.3187 5645 08 3.3187 5645 08 3.3187 5645 08 3.3187 5645
05 80 0136 1.284908 05 80 0136 1.284908 05 80 0136 1.284908 05 80 0136 1.284908
29 0.684 1.14624 29 0.684 1.14624 29 0.684 1.14624 29 0.684 1.14624

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

Fig. 10 The evolution of the population in DE during one run of the algorithm: a the distribution
of the candidate solutions at iterations 1, 5, 10, and 15 and b a zoomed-in landscape showing the
distribution of the candidate solutions at iterations 15, 20, 30, and 50

The mean of the objective values after 30 runs is −1.031615, with a standard
deviation of 3.74e−05.

2.5 Extensions of EAs for Multi-modal and Multi-objective


Problems

Variations were brought to the classical EAs not only at the encoding and operators
level. In order to face the challenges imposed by real-world problems, modifica-
tions are also recorded in the general scheme of the algorithms.
EAs are generally preferred to trajectory-based meta-heuristics (i.e., hill climb-
ing, simulated annealing, Tabu Search) in multi-modal environments, mostly due
to their increased exploration capabilities. However, a standard EA still can be
trapped in a local optimum due to premature attraction of the entire population into
its basin of attraction. Therefore, the main concern of EAs for multi-modal opti-
mization is to maintain diversity for a longer time in order to detect multiple (local)
optima. To discover the global optima, the EA must be able to intensify the search
in several promising regions and eventually encourage simultaneous convergence
toward several local optima. This strategy is called niching: The algorithm forces
the population to preserve subpopulations, each subpopulation corresponding to a
niche in the search space, and different niches represent different (local) optimal
regions.
Several strategies exist in the literature to introduce niching capabilities into
evolutionary algorithms. Deb and Goldberg (1989) propose fitness sharing: The
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 81

fitness of each individual is modified by taking into account the number and fitness
of its closely ranged individuals. This strategy determine the number of individuals
in the attraction basin of an optimum to be dependent on the height of that peak.
Another widely used strategy is to arrange the candidate solutions into groups of
individuals that can only interact between themselves. The island model evolves
independently several populations of candidate solutions; after a number of gen-
erations, individuals in neighboring populations migrates between the islands
(Whitley et al. 1998).
There are techniques, which divide the population, based on the distances
between individuals (the so-called radii-based multi-modal search GAs). Genetic
chromodynamics (Dumitrescu 2000) introduces a set of restrictions with regard to
the way selection is applied or the way recombination takes place. A merging
operator is introduced which merges very similar individuals after perturbation
takes place. In Stoean et al. (2010), best successive local individuals are conserved,
while sub-populations are topological separated.
De Jong introduced a new scheme of inserting the descendants into the popu-
lation, called the crowding method (Kenneth 1975). To preserve diversity, the
offspring replace only similar individuals in the population.
A field of intensive research within the evolutionary computation (EC) com-
munity is multi-objective optimization. Most real-world problems necessitate the
optimization of several, often conflicting objectives. Population-based optimization
methods offer an elegant and very efficient approach to this kind of problems: With
small modifications of the basic algorithmic scheme, they are able to offer an
approximation of the Pareto optimal solution set. While moving from one Pareto
solution to another, there is always a certain amount of sacrifice in one objective(s)
to achieve a certain amount of gain in the other(s). Pareto optimal solution sets are
often preferred to single solutions in practice, because the trade-off between
objectives can be analyzed and optimal decisions can be made on the specific
problem instance.
Zitzler et al. (2000) formulate three goals to be achieved by multi-objective
search algorithms:
• the Pareto solution set should be as close as possible to the true Pareto front,
• the Pareto solution set should be uniformly distributed and diverse over of the
Pareto front in order to provide the decision maker a true picture of trade-offs,
• the set of solutions should capture the whole spectrum of the Pareto front. This
requires investigating solutions at the extreme ends of the objective function
space.
GAs have been the most popular heuristic approach to multi-objective design
and optimization problems mostly because of their ability to simultaneously search
different regions of a solution space and find a diverse set of solutions. The
82 H. Luchian et al.

crossover operator may exploit structures of good solutions with respect to different
objectives to create new non-dominated solutions in unexplored parts of the Pareto
front. In addition, most multi-objective GAs do not require the user to prioritize,
scale, or weigh objectives. There are many variations of multi-objective GAs in the
literature and several comparative studies. As in multi-modal environments, the
main concern in multi-objective GAs optimization is to maintain diversity
throughout the search in order to cover the whole Pareto front. Konak et al. (2006)
provide a survey on the most known multi-objective GAs, describing common
techniques used in multi-objective GA to attain the three above-mentioned goals.

3 Swarm Intelligence

Swarm intelligence (SI) is a computational paradigm inspired from the collective


behavior in auto-organized decentralized systems. It stipulates that problem solving
can emerge at the level of a collection of agents which are not aware of the problem
itself, but collective interactions lead to the solution. SI systems are typically made
up of a population of simple autonomous agents interacting locally with one
another and with their environment. Although there is no centralized control, the
local interactions between agents lead to the emergence of global behavior.
Examples of systems like this can be found in nature, including ant colonies, bird
flocking, animal herding, bacteria molding, and fish schooling.
The most successful SI techniques are ant colony optimization (ACO) and
particle swarm optimization (PSO). In ACO (Dorigo and Stützle 2004), artificial
ants build solutions walking in the graph of the problem and (simulating real ants)
leaving artificial pheromone so that other ants will be able to build better solutions.
ACO was successfully applied to an impressive number of optimization problems.
PSO is an optimization method initially designed for continuous optimization;
however, it was further adapted to solve various combinatorial problems. PSO is
presented in more detail in the next section.

3.1 Particle Swarm Optimization

The PSO model was introduced in 1995 by Kennedy and Eberhart (1995), being
discovered through simulation of a simplified social model such as fish schooling or
bird flocking. It was originally conceived as a method for optimization of contin-
uous nonlinear functions. Latter studies showed that PSO can be successfully
adapted to solve combinatorial problems.
The evolutionary cultural model proposed by (Boyd and Richerson 1985) stands
as the basic principle of PSO. According to this model, individuals of a society have
two learning sources: individual learning and cultural transmission. Individual
learning is efficient only in homogenous environments: The patterns acquired
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 83

through local interactions with the environment are generally applicable. For het-
erogenous environments, social learning—the essential feature of cultural trans-
mission—is necessary.
In line with the evolutionary cultural model, the PSO algorithm uses a set of
simple agents which collaborate in order to find solutions of a given optimization
problem.
In the PSO paradigm, the environment corresponds to the search space of the
optimization problem to be solved. A swarm of particles is placed in this envi-
ronment. The location of each particle corresponds therefore to a candidate solution
to the problem. A fitness function is formulated in accordance with the optimization
criterion to measure the quality of each location. The particles move in their
environment collecting information on the quality of the solutions they visit and
share this information to the neighboring particles in the swarm. Each particle is
endowed with memory to store the information gathered by individual interactions
with the environment, simulating thus individual learning. The information
acquired from neighboring particles corresponds to the social learning component.
Eventually, the swarm is likely to move toward “more” optimum locations of the
search space, similar to a flock of birds that collectively forage for food.
Unlike GAs, in PSO, there exist no evolution operators and no competition for
survival; all particles survive and share information for the welfare of the swarm.
The driving force is the emergent SI and attained by the sharing of local information
between particles in order to produce global knowledge. It is important to note that
problem solving is a population-wide phenomenon, because a particle by itself is
probably incapable of solving even simple problems (Poli et al. 2007).
Usually, the swarm is composed of particles that share the same structural and
behavioral features. Each particle is characterized by its current position in the
search space, its velocity, and one or more of its best positions in the past (usually,
only one position). Each particle uses the objective (fitness) function so that it can
find out how good its current status is. The particles use a communication channel
in order to exchange information with (some) of its peers. The topology of the
swarm’s social network is defined by the structure of the communication channel,
where cliques of interconnected particles form neighborhoods.
In the classical PSO algorithm, the position of a particle in the search space is
updated in each iteration depending on the position and velocity of the particle in
the previous iteration. The formulas used to update the particles and the procedures
are inspired from and conceived for continuous spaces. Therefore, each particle is
represented by a vector x of length n indicating the position in the n-dimensional
search space and has a velocity vector v used to update the current position. The
velocity vector is computed following the rules:
• every particle tends to keep its current direction (an inertia term);
• every particle is attracted to the best position p it has achieved so far (imple-
ments the individual learning component);
• every particle is attracted to the best particle g in the neighborhood (implements
the social learning component).
84 H. Luchian et al.

The velocity vector is computed as a weighted sum of the three terms above.
Two random multipliers r1 ; r2 are used to gain stochastic exploration capability,
while w; c1 ; c2 are weights usually empirically determined. The formulae used to
update each of the individuals in the population at iteration t + 1 are as follows:

vti ¼ w  vt1
i þ c1  r1  ðpit1  xit1 Þ þ c2  r2  ðg  it1  xt1
i Þ ð13aÞ

xti ¼ xit1 þ vti ð13bÞ

As a side effect of these changes, the velocity of the particle could enter a
divergence process, throwing the particle further, and further away form p. To
prevent this behavior, Kennedy and Eberhart clamped the amplitude of the velocity
to a maximum value, denoted by vmax :

vti ¼ minðvmax ; maxðvmax ; vti ÞÞ: ð14Þ

Equation 13b generates a new position in the search space (corresponding to a


candidate solution). It can be associated to some extent to the mutation operator in
evolutionary programing. However, in PSO, this mutation is guided by the past
experience of both the particle and other members of the swarm. In other words,
“PSO performs mutation with a conscience” (Jong 2006). Considering the best
visited solutions stored in the personal memory of each individual as additional
members of the population, PSO implements a weak form of selection (Angeline
1998).
The shape of the search space is unknown; hence, there exists no known opti-
mum combination of the two learning sources (i.e., individual learning and cultural
transmission). The classical PSO algorithm compensates this lack of information
with random values for learning factors c1  r1 and c2  r2 , which change in each
iteration in order to weigh differently the learning sources. The velocity change
produced by each term depends on the distance between the compared positions
(i.e., the particle will move faster if values are larger) and the random learning
factors. This allows PSO to simulate, during a single run, various search strategies.
The solution that the algorithm outputs at the end of the run is obtained from the
information stored in the memory of each particle after the last iteration is
completed.
The search for the optimal solution in PSO is described by the iterative procedure
in Fig. 11. The fitness function is denoted by f and is formulated for maximization.
Particle pi is chosen in the basic version of the algorithm to be the best position
in the problem space visited by particle i. However, the best position is not always
dependent only on the fitness function. Constraints can be applied in order to adapt
PSO to various problems, without slowing down the convergence of the algorithm.
In constrained nonlinear optimization, the particles store only feasible solutions and
ignore the infeasible ones (Hu and Eberhart 2002). In multi-objective optimization,
only the Pareto-dominant solutions are stored (Coello and Lechunga 2002; Hu and
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 85

Fig. 11 Basic PSO 1. t := 0


2. Initialize xti , i = 1..n
3. Initialize vit , i = 1..n
4. Store personal best pti = xti , i = 1..n
5. Find neighborhood best git = argmaxy∈N xti (f (y)), i = 1..n
6. while halting condition not met do
7. t := t + 1
8. Update vit , i = 1..n using equation 13a
9. Update xti , i = 1..n using equation 13b
10. Update personal best pti = argmax(f (pt−1 t
i ), f (xi ))
11. Find neighborhood best git = argmaxy∈N xti (f (y))
12. end while

Eberhart 2002). In dynamic environments, particle p is reset to the current position


if a change in the environment is detected (Hu and Eberhart 2001).
The selection of particle gi is performed in two steps: neighborhood selection
followed by particle selection. The size of the neighborhood has a great impact on
the convergence of the algorithm. It is generally accepted that a large neighborhood
speeds-up the convergence, while small neighborhoods prevent the algorithm from
premature convergence. Various neighborhood topologies were investigated with
regard to their impact on the performance of the algorithm (Kennedy 2002; Ken-
nedy and Mendes 2003); however, as expected, there is no free lunch: Different
topologies are appropriate to different problems.
A major problem investigated in the PSO literature is the premature conver-
gence of the algorithm in multi-modal optimization. This problem has been
addressed in several papers and solutions include addition of a queen particle (Clerc
1999), alternation of the neighborhood topology (Kennedy 1999), introduction of
subpopulations (Lïvbjerg et al. 2001), giving the particles a physical extension
(Krink et al. 2002), alternation between phases of attraction and repulsion (Riget
and Vesterstrøm 2002), giving different temporary search goals to groups of par-
ticles (Al-kazemi and Mohan 2002), giving particles quantum behavior (Sun et al.
2004), and the use of specific swarm-inspired operators (Breaban and Luchian
2005).
Another crucial problem is parameter control. The values and choices for some
of these parameters may have significant impact on the efficiency and reliability of
the PSO. There are several papers that address this problem; in most of them, the
values for parameters are established through repeated experiments but there also
exist attempts to adjust them dynamically, using EC algorithms.
The role played by the inertia weight was compared to that of the temperature
parameter in simulated annealing (Shi and Eberhart 1998). A large inertia weight
facilitates a global search, while a small inertia weight facilitates a local search. The
parameters c1 and c2 are called generically learning factors; because of their distinct
roles, c1 was named the cognitive parameter (it gives the magnitude of the infor-
mation gathered by each individual) and c2 the social parameter (it weights the
cooperation between particles). Another parameter used in PSO is the maximum
86 H. Luchian et al.

velocity which determines the maximum change each particle can take during one
iteration. This parameter is usually proportional with the search domain.
One run of the PSO algorithm can be illustrated using package pso built for R
which is consistent with standard PSO, as described in Bratton and Kennedy
(2007):

> library(pso)
> PSO.sols <- psoptim(rep(NA,2),SixHumpV,lower=c(-3,-2),upper=c(3,2),
control=list( maxit=50, s=20, trace=1, REPORT=1))
S=20, K=3, p=0.1426, w0=0.7213, w1=0.7213, c.p=1.193, c.g=1.193
v.max=NA, d=7.211, vectorize=FALSE, hybrid=off
It 1: fitness=-0.3635
It 2: fitness=-0.8261
It 3: fitness=-0.8261
It 4: fitness=-0.8623
It 5: fitness=-0.9337
...

The final solution obtained in 50 iterations with a population of 20 individuals


reaches the global optima:

> show(PSO.sols)
$par
[1] 0.09041749 -0.71296641

$value
[1] -1.031627

The algorithm reaches quickly the global optima, as shown in Fig. 12.
Figure 13 illustrates the distribution of the individuals in population during one
run, at iterations 1, 2, 5, 10, 20, and 50.

3.2 PSO on Binary Domains

Although PSO was conceived for continuous optimization, an effort was done to
adapt the algorithm in order to be used for solving a wide range of combinatorial
and binary optimization problems. A short discussion of the binary version of PSO
is presented in this section, following the presentation from (Bautu 2010).
Kennedy and Eberhart (1997) introduced a first variant of binary PSO, com-
bining the evolutionary cultural model with the reasoned action model. According
to the latter, the action performed by an individual is the stochastic result of the
intention to do that action. The strength of the intention results from the interaction
of the personal attitude and the social attitude on the matter (Hale et al. 2002).
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 87

−0.4
−0.5
−0.6
−0.7
f
−0.8
−0.9
−1.0

0 10 20 30 40 50
iteration

Fig. 12 The evolution of the best value of the objective function for one run of PSO

(a)
2

2
58

58

58
37.48 37.48 37.48
8

38.58017 38.58017 38.58017


100.1

100.1

100.1
30.87894 33.0793 30.87894 33.0793 30.87894 33.0793
.3 80

.3 80

.3 80
31.9 5.082

31.9 5.082

31.9 5.082

29.77877 29.77877 29.77877


83 35

83 35

83 35
39

39

39
23.17772 24.27789 23.17772 24.27789 23.17772 24.27789
2

2
.6

33

.6

33

.6

33
18.77701 25 18.77701 25 18.77701 25
791

791

791

17.67684 .3 17.67684 .3 17.67684 .3


5

5
9

9
13.27614 14.37631 78 13.27614 14.37631 78 13.27614 14.37631 78
9.975611 9.975611 9.975611
95.7893

95.7893

95.7893

6.675084 07 6.675084 07 6.675084 07


82.58719

82.58719

82.58719
4.474733 7.77526 4.474733 7.77526 4.474733 7.77526
50.6821

50.6821

50.6821
44.08105

44.08105

44.08105
2.274382 1.174207 2.274382 1.174207 2.274382 1.174207
1

1
47.38158

47.38158

47.38158
50.6821

50.6821

50.6821
94.68912

94.68912

94.68912
91.38859

91.38859

91.38859
4382

4382

4382
2.27

2.27

2.27
0

0
48.48175

48.48175

48.48175
93.58895

93.58895

93.58895
48.48175

48.48175

48.48175
92.48877

92.48877

92.48877
47.38158

47.38158

47.38158
91.38859

91.38859

91.38859
46.2814

46.2814

46.2814

0.0740314 0.0740314 0.0740314


6 6 6
1

−1

−1
85.88772

85.88772

85.88772

3.374558 3.374558 3.374558


11.07579 5.574909 11.07579 5.574909 11.07579 5.574909
8052

8052

8052
27

27

27

8.875435 8.875435 8.875435


.5

.5

.5

41 15.47649 12.17596 41 15.47649 12.17596 41 15.47649 12.17596


78

78

78
8544

8544

8544

19.87719 16.57666 19.87719 16.57666 19.87719 16.57666


72

72

72
40.7

40.7

40.7

.8 37 .8 37 .8 37
42

42

42

80 26.47824 22.07754 20.977 80 26.47824 22.07754 20.977 80 26.47824 22.07754 20.977


.68

.68

.68

7 34.17947 28.67859 5 7 34.17947 28.67859 7 34.17947 28.67859


71.5

5
71.5

5
71.5

35.2796 35.2796 35.2796


56

56

56

42.98087 36.37982 42.98087 36.37982 42.98087 36.37982


2

−2

−2

51.78228 51.78228 51.78228


1

45.18123 45.18123 45.18123

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

(b)
2

2
58

58

58

37.48 37.48 37.48


8

791 28

38.58017 38.58017 38.58017


100.1

100.1

100.1

30.87894 33.0793 30.87894 33.0793 30.87894 33.0793


.3 80

.3 80

.3 80
31.9 5.082

31.9 5.082

29.77877 29.77877 29.77877


.08
83 35

83 35

83 35
39

39

39

23.17772 24.27789 23.17772 24.27789 23.17772 24.27789


2

2
.6

33

.6

33

.6

33

18.77701 25 18.77701 25 18.77701 25


55
791

791

17.67684 .3 17.67684 .3 17.67684 .3


5

5
9

13.27614 14.37631 78 13.27614 14.37631 78 13.27614 14.37631 78


9.975611 9.975611 9.975611
31.9
95.7893

95.7893

95.7893

6.675084 07 6.675084 07 6.675084 07


82.58719

82.58719

82.58719

4.474733 7.77526 4.474733 7.77526 4.474733 7.77526


50.6821

50.6821

50.6821
44.08105

44.08105

44.08105

2.274382 1.174207 2.274382 1.174207 2.274382 1.174207


1

1
47.38158

47.38158

47.38158
50.6821

50.6821

50.6821
94.68912

94.68912

94.68912
91.38859

91.38859

91.38859
4382

4382

4382
2.27

2.27

2.27
0

0
48.48175

48.48175

48.48175
93.58895

93.58895

93.58895
48.48175

48.48175

48.48175
92.48877

92.48877

92.48877
47.38158

47.38158

47.38158
91.38859

91.38859

91.38859
46.2814

46.2814

46.2814

0.0740314 0.0740314 0.0740314


6 6 6
1

−1

−1
85.88772

85.88772

85.88772

3.374558 3.374558 3.374558


11.07579 5.574909 11.07579 5.574909 11.07579 5.574909
8052

8052

8052
27

27

27

8.875435 8.875435 8.875435


.5

.5

.5

41 15.47649 12.17596 41 15.47649 12.17596 41 15.47649 12.17596


78

78

78
8544

8544

8544

19.87719 16.57666 19.87719 16.57666 19.87719 16.57666


72

72

72
40.7

40.7

40.7

.8 37 .8 37 .8 37
42

42

42

80 26.47824 22.07754 20.977 80 26.47824 22.07754 20.977 80 26.47824 22.07754 20.977


.68

.68

.68

7 34.17947 28.67859 5 7 34.17947 28.67859 7 34.17947 28.67859


71.5

5
71.5

5
71.5

35.2796 35.2796 35.2796


56

56

56

42.98087 36.37982 42.98087 36.37982 42.98087 36.37982


2

−2

−2

51.78228 51.78228 51.78228


1

45.18123 45.18123 45.18123

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Fig. 13 The evolution of the population in PSO during one run of the algorithm: the distribution of
the candidate solutions at iterations 1, 2, 5, 10, 20, and 15
88 H. Luchian et al.

The PSO algorithm for real-valued optimization updates the positions of parti-
cles based on a function that depends (indirectly) of various personal and social
factors. In the binary domain, the intention of particles to move between the two
allowed positions: 0 and 1 is modeled in a similar manner. The probability that the
particle will move to position 1 is computed by:

Pðpt ¼ 1Þ ¼ f 0 ðpt1 ; vt1 ; pt1


i ; gg Þ:
t1
ð15Þ

The individual learning factor and the social learning factor act as personal and
social attitudes that help to select one of the two binary options.
In particular, with respect to classical PSO, in binary PSO:
• the domain of particle positions in the context of binary optimization problems
is P ¼ f0; 1gn ;
• the cost function that describes the optimization problem is hence defined
c : f0; 1gn ! R;
• the position of a particle consists in the responses of the particle to the n binary
queries of the problem. The position in the search space is updated during each
iteration depending on its velocity.
Let pt 2 P and vt 2 R denote the position and the velocity of a particle at
iteration t. The update equation for the particle’s position in binary PSO is as
follows:

1; if /3 \ð1 þ expðvÞÞ1
p¼ ; ð16Þ
0; otherwise

where /3 is a random uniformly distributed variable in ½0; 1Þ. It results that higher
velocity induces higher probabilities for the particle to choose 1. The equation for
the particle ensures that the particle stays within the search space domain; hence, no
relocation procedure is required.
The velocity of the particle is updated using the same equation as in classical
PSO. The semantics of each term in (13a) for binary PSO are special cases of their
original meaning. For example, if the best position of the particle (pti ) is 1 and the
current position (pt ) is 0, then pti  pt ¼ 1. In this case, the second term in (13a) will
increase the value of vt ; hence, the probability that the particle with choose 1 will
also increase. Similarly, the velocity will decrease if pti ¼ 0 and pt ¼ 1. If the two
positions are the same, the individual learning term will not change the velocity in
order to try to maintain the current choice. The same is true for the velocity updates
produced by the social learning term. The position of the particle may change due to
the stochastic nature of (16), even if the velocity does not change between
iterations.
The complete PSO algorithm for binary optimization problems is presented in
vector form in (Fig. 14).
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 89

Require: c - the objective function


Ensure: S - the position that encodes the best solution
1. t = 0
2. Initialize particle positions (pt )
3. Initialize particle velocities (v t )
4. Store particle best solutions (g ti = pt )
5. while searching allowed do
6. t=t+1
7. Update positions using equation (16)
8. Find neighborhood best solutions with neighborhood operator N
(g tg = argminx∈{bti |N } c(x))
9. Update velocity using equation (13a)
10. Limit velocity using equation (14)
11. end while
12. Retrieve the solution.
13. return S

Fig. 14 The particle swarm optimization algorithm for binary optimization

Other PSO variants can also be successfully used on binary spaces. In Wang
et al. (2008), the authors propose the outcome of the binary queries to be estab-
lished randomly based on the position of the particle within a continuous space.
Khanesar et al. (2009) present a variation of the binary PSO in which the particle
toggles its binary position with probability depending its velocity.

4 Integrating Meta-heuristics with Conventional Methods


in Data Analysis: A Practical Example

Meta-heuristics stand as basis for the design of efficient algorithms for various data
analysis tasks. Such approaches are extensions of conventional techniques, obtained
as hybridizations with meta-heuristics, or evolved as new self-contained data
analysis methods.
There is a large variety of approaches for data clustering based on GAs (Breaban
et al. 2012; Hruschka et al. 2009; Luchian et al. 1994), DE (Zaharie 2005), PSO
(Breaban and Luchian 2011; Rana et al. 2011), and ACO (Shelokar et al. 2004).
Learning Classifier Systems (Lanzi et al. 2000) are one of the major families of
techniques that apply EC to machine learning; these systems evolve a set of con-
dition–action rules able to solve classification problems. Decision trees (Turney
1995) and support vector machines (Stoean et al. 2009, 2011) are also evolved with
GAs. The representative application example of EAs in regression analysis is the
use of genetic programing for symbolic regression, topic covered in detail in
Chapter “Genetic Programming Techniques with Applications in the Oil and Gas
Industry” of this book. Many algorithms based on meta-heuristics tackle feature
selection and feature extraction.
90 H. Luchian et al.

We restrict the discussion in this section to one particular application: Optimi-


zation of the parameters of machine learning algorithms used in data analysis. The
performance of several machine learning algorithms depends heavily on some
parameters involved in their design; such parameters are often called meta-
parameters or hyper-parameters. The problem of choosing the best settings for these
parameters is also known as model selection.
Examples may vary from simple algorithms such as k-nearest neighbors where k
is such a hyper-parameter, to more complex algorithms. In the case of artificial
neural networks, the structure of the network (the number of hidden layers, the
number of neurons in each layer, the activation function) has a high impact on the
accuracy of the results in classification or regression analysis; the degree of com-
plexity of the network is a critical factor in the trade-off between overfitting the
model to the training data and underfitting, and the right balance can be achieved
only with extensive experiments. In the definition of support vector machines
(SVMs), two numerical parameters play important roles: a constant C called reg-
ularization parameter and a constant e corresponding to the width of the -insen-
sitive zone, influence the number of support vectors used in the model, controlling
the trade-off between two goals, fitting the training set well, and avoiding overfit-
ting; parameters characterizing various kernel functions are also involved.
We illustrate here a simple model selection scheme by means of EAs for
regression analysis. A small dataset called “rock,” included in R, is used with this
purpose. It consists of 48 rock samples from a petroleum reservoir characterized by
the area of pores, total perimeter of pores, shape, and permeability.

> show(rock)
area peri shape perm
1 4990 2791.900 0.0903296 6.3
2 7002 3892.600 0.1486220 6.3
3 7558 3930.660 0.1833120 6.3
4 7352 3869.320 0.1170630 6.3
...
48 9718 1485.580 0.2004470 580.0

We illustrate regression analysis by training a support vector machine to learn a


model able to predict permeability. The quality of the regression model is usually
measured by the mean squared error, as defined below.

> MSE <- function(x,y)


+ {
+ mean((x-y)^2)
+ }

Support vector regression is implemented in R under package “e1071.” The


results obtained using radial kernel are shown below:
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 91

> library(e1071)
> svr <- svm(perm ~ area+peri+shape, data=rock,
+ type="eps-regression", kernel = "radial")
> predicted <-predict(svr,newdata=rock,type="response")
> MSE(predicted, rock$perm)
[1] 35316.21
> cor(predicted, rock$perm)
[1] 0.9040716
> plot(predicted, rock$perm)

The default settings of the three hyper-parameters used in the run above can be
inspected next: Cost is the regularization parameter, gamma is a parameter of the
kernel function, and epsilon is the size of the insensitive tube.

> summary(svr)
Parameters:
SVM-Type: eps-regression
SVM-Kernel: radial
cost: 1
gamma: 0.3333333
epsilon: 0.1

These numerical parameters can be optimized in order to minimize the predic-


tion error measured by MSE. We formulate this task as a numerical optimization
problem defined over three numerical parameters (cost, gamma, and epsilon),
aiming to minimize the MSE of the predictions obtained with support vector
regression under the given settings:

> trainingError <- function(params)


+ {
+ svr <- svm(perm ~ area+peri+shape, data=rock, type="eps-regression",
+ kernel = "radial", gamma=params[1], cost = params[2], epsilon = params[3])
+ predicted <-predict(svr,newdata=rock,type="response")
+ MSE(predicted, rock$perm)
+ }

Any of the meta-heuristics presented in this chapter can be used to tackle this
minimization problem. We illustrate here the use of DE:
92 H. Luchian et al.

> DEparams <- DEoptim(trainingError, lower = c(0, 0, 0), upper = c(4, 4, 1),
+ control = list(strategy = 1,NP=20, itermax=20, trace = TRUE))
Iteration: 1 bestvalit: 5937.692186 bestmemit: 1.929174 2.872409 0.012022
Iteration: 2 bestvalit: 5630.575260 bestmemit: 3.110530 3.717773 0.166768
Iteration: 3 bestvalit: 3623.210268 bestmemit: 2.818071 3.682759 0.077892
...
Iteration: 20 bestvalit: 1473.135923 bestmemit: 3.983884 3.812688 0.046011

The solution obtained by DE is stored next in the vector params and is used to
train a new SVM.

> params <- DEparams$optim$bestmem


> svr <- svm(perm ~ area+peri+shape, data=rock, scale = TRUE, type="eps-regression",
+ kernel = "radial", gamma=params[1], cost = params[2], epsilon = params[3])
> predicted <-predict(svr,newdata=rock,type="response")
> MSE(predicted, rock$perm)
[1] 1473.136
> cor(predicted, rock$perm)
[1] 0.9968882

Figure 15 illustrates the predicted values compared to real values for the case of
SVR with default settings (a) and for the case of SVR with optimized hyper-
parameters (b).
Nevertheless, the optimized model gives much better results with regard to the
error of predictions, but is prone to overfitting: A single dataset was used both for
training and testing; in this situation, the model is highly adapted to the dataset and
may suffer from poor generalization power. We can avoid overfitting by using dis-
tinct sets for training and testing. The new function to be optimized should be
formulated as shown below. Very similar with the previous version regarding its
definition, this function is significantly different in behavior: It invokes a “training”
dataset in the learning phase but computes the prediction error on a “testing” dataset:

(a) (b)
800 1000 1200

800 1000 1200


rock$perm

rock$perm
600

600
400

400
200

200
0

0 200 400 600 800 1000 0 200 400 600 800 1000 1200

predicted predicted

Fig. 15 Predicted over expected values in regression analysis with SVR using: a default hyper-
parameters settings and b optimized settings
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 93

> testingError <- function(params)


+ {
+ svr <- svm(perm ~ area+peri+shape, data=training, type="eps-regression",
+ kernel = "radial", gamma=params[1], cost = params[2], epsilon = params[3])
+ predicted <-predict(svr,newdata=testing,type="response")
+ MSE(predicted, rock$perm)
+ }

The validation of the regression model obtained with the optimized hyper-
parameters requires in this case a third dataset called validation set. This phase
closes the analysis which, as recommended in the case of any supervised learning
task, is composed of three phases: training, testing, and validation. If the accuracy/
error obtained in the validation phase is satisfactory, the model can be used in
production.

5 Applications of Meta-Heuristics in Geosciences

Evolutionary algorithms have been used in solving geophysics optimization


problems in two main directions: either by performing the optimization, or by
optimizing parameters of other methods (e.g., neural networks) used in specific
problems.
Evolutionary methods are compared to PSO in a study on optimization of res-
ervoir models to match past petroleum production data in Yasin Hajizadeh et al.
(2011). ACO, DE, PSO, and the neighborhood algorithm are integrated in a
Bayesian framework in order to measure the uncertainty of the predictions obtained
by each algorithm, in a case study involving two petroleum reservoirs. Ahmadi
et al. (2013) perform the task of predicting reservoir permeability using a soft
sensor implemented on the basis of a feed-forward artificial neural network, which
was then optimized using a hybrid GA and PSO method. History matching is also
the research topic in Park et al. (2014). A multi-objective evolutionary algorithm
identifies optimal solutions and outperforms a traditional weighted-sum approach.
GAs are acknowledged as important tools for successful neural network data-
driven models with applications in the oil and gas industry (Mohaghegh 2005;
Shahab et al. 2005). Intelligent software tools used in the industry integrate hard
(statistical) and soft (intelligent) computing techniques, such as fuzzy cluster
analysis, genetic optimization, or neural computing (Shahab et al. 2005).
Direct use of a GA helps to evaluate hydrocarbon resource in a field dataset from
North Cambay basin, India (Thander et al. 2014). Several parameters are required
for resource estimation (e.g., areal extent, net pay thickness, oil saturation, etc.), yet
a limited set is recorded in the exploration phase. Also, recordings are done with
uncertainty, due to reservoir heterogeneity. GA copes well with the uncertainty in
data and delivers estimations of the oil reserve using a real dataset.
94 H. Luchian et al.

An oil production planning problem that appears in the context of oil wells with
insufficient oil pressure and which consists in identifying the amount of gas that
should be injected in a well in order to maximize the amount of oil extracted from
that well is solved by an evolutionary algorithm in Singh et al. (2013). The problem
is more difficult since it is constrained by the total amount of gas available daily.
The authors propose a multi-objective approach to the problem and also formulate a
single objective version, focused on the maximization of profit, instead of the oil
quantity. The problem of gas allocation among oil wells is also tackled in Ghaedi
et al. (2013), by means of a hybrid GA, and in Abdel Rasoul et al. (2014). The
problem of gas allocation among oil wells is also tackled in Ghaedi et al. (2013), by
means of a hybrid GA, and in Abdel Rasoul et al. (2014).
The optimal well type and location are determined with PSO in (Onwunalu and
Durlofsky 2010), in a study involving vertical, deviated, and dual-lateral wells.
Comparisons with a GA over multiple runs of both algorithms show that PSO
outperforms, on average, the GA, yet the advantages of using PSO over GA are
varied among the cases surveyed. Driven by the goal of maximizing the total
hydrocarbon recovery, an well placement problem is tackled in Nwankwor et al.
(2013) with a hybrid PSO-DE algorithm is proposed for the problem. The hybrid is
compared to basic variants of PSO and DE on three problem cases concerning the
placement of vertical wells in 2D and 3D reservoir models. Optimal well placement
under uncertainty is tackled in a two-stage approach in Lyons and Nasrabadi
(2013). First, an ensemble Kalman filter is used to perform history matching on the
reservoir data. Then, well placement is solved by a GA combined with pseudohi-
story matching.
Carbon dioxide (CO2) sequestration is of great interest for oil engineers. In
recent years, the idea of storing CO2 in deep geological formations, such as
depleted oil and gas reservoirs (with impermeable rocks), gained a lot of focus from
the community as a solution for greenhouse gas mitigation by avoiding CO2 from
emission into the atmosphere. The CO2 sequestration also helps by enhancing
methods for oil or gas recovery (Zangeneh et al. 2013). Evolutionary algorithms are
used in order to identify carbon dioxide seepage areas in Cortis et al. (2008). In
Zangeneh et al. (2013), the parameters of a CO2 storage model are optimized using
a GA. A multi-objective GA (NSGA) is implemented for optimizing gas storage
alongside oil recovery in Safarzadeh and Motahhari (2014). Based on the results
from the GA, the authors are able to propose some production scenarios.
In (Fichter et al. 2000), a portfolio optimization problem for the oil and gas
industry is tackled by means of a GA. GAs are chosen for this task both due to their
scalability to extremely large portfolios and because they allow the analysis of
portfolios from the point of view of value and risk measures.
GA and PSO are used to find the optimal parameters of a linear and an expo-
nential model for the demand of oil in Iran in Assareh et al. (2010). The models use
as input variables the population, the gross domestic product, import, and export
data; they are used to forecast demand of oil up to 2030.
PSO emerged as a powerful algorithm for geophysical inverse problems when
compared to GAs and simulated annealing in Martnez et al. (2010), Shaw, and
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 95

Srivastava (2007). Other applications include inversion of seismic refraction data


Poormirzaee et al. (2014), crosshole traveltime tomography Tronicke et al. (2012),
or reservoir characterization Fernández Martìnez et al. (2012).
A large number of meta-heuristics are compared with respect to training an
artificial neural network for the task of forecasting the water temperature of a natural
river in Piotrowski et al. (2014). The study involves a comparison of several ver-
sions of PSO, DE, direct search to the levenberg–Marquardt (LM) algorithm for
ANN training. The study concludes that only the DE algorithm obtains results
competitive to the LM algorithm. A similar optimization idea is described in Ah-
madi and Ebadi (2014), where a hybrid combination between an artificial neural
network and PSO, extended with dew point pressure data, leads to a better
understanding of reservoir fluid behavior.

References

Abdel Rasoul RR, Daoud A, El Tayeb ESA (2014) Production allocation in multi-layers gas
producing wells using temperature measurements with the application of a genetic algorithm.
Pet Sci Technol 32(3):363–370
Ahmadi MA, Ebadi M (2014) Robust intelligent tool for estimation dew point pressure in
retrograded condensate gas reservoirs: application of particle swarm optimization. J Pet Sci
Eng 123:7–19
Ahmadi MA, Zendehboudi S, Lohi A, Elkamel A, Chatzis I (2013) Reservoir permeability
prediction by neural networks combined with hybrid genetic algorithm and particle swarm
optimization. Geophys Prospect 61(3):582–598
Al-kazemi B, Mohan CK (2002) Multi-phase generalization of the particle swarm optimization
algorithm. In: Proceedings of the IEEE congress on evolutionary computation. IEEE Press
Angeline PJ (1998) Using selection to improve particle swarm optimization. In: Proceedings of the
IEEE international conference on evolutionary computation. IEEE Press, pp 84–89. ISBN 0-
7803-4869-9
Assareh E, Behrang MA, Assari MR, Ghanbarzadeh A (2010) Application of PSO (particle swarm
optimization) and GA (genetic algorithm) techniques on demand estimation of oil in iran.
Energy 35(12):5223–5229
Back T (1996) Evolutionary algorithms in theory and practice. Oxford University Press, New York
Baker JD (1985) Adaptive selection methods for genetic algorithms. In: Proceedings of an
International Conference on Genetic Algorithms and their applications. Hillsdale, New Jersey,
pp 101–111
Baker JD (1987) Reducing bias and inefficiency in the selection algorithm. In: Proceedings of the
second international conference on genetic algorithms. pp 14–21
Bautu A (2010) Generalizations of Particle Swarm Optimization: applications of particle swarm
algorithms to statistical physics and bioinformatics problems. PhD Thesis, Department of
Computer Science, Al. I. Cuza University, Lambert Academic Publishing. ISBN 978-3848417315
Blum C, Roli A (2003) Metaheuristics in combinatorial optimization: overview and conceptual
comparison. ACM Comput Surv 35(3):268–308. ISSN 0360-0300. doi:http://doi.acm.org/10.
1145/937503.937505
Boyd R, Richerson PJ (1985) Culture and the evolutionary process. The University of Chicago
Press, Chicago
Bratton D, Kennedy J (2007) Defining a standard for particle swarm optimization. In: Swarm
intelligence symposium, 2007. SIS 2007, IEEE, pp 120–127
96 H. Luchian et al.

Breaban M (2011) Clustering: evolutionary approaches. PhD Thesis, Department of Computer


Science, Al. I. Cuza University
Breaban M, Luchian H (2005) PSO under an adaptive scheme. In: Proceedings of the IEEE
congress on evolutionary computation. IEEE Press, pp 1212–1217
Breaban ME, Luchian H (2011) PSO aided k-means clustering: introducing connectivity in
k-means. In: Proceedings of the 13th annual conference on Genetic and evolutionary
computation. ACM, pp 1227–1234
Breaban ME, Luchian H, Simovici D (2012) A genetic clustering algorithm by monomial
projection pursuit. In Symbolic and numeric algorithms for scientific computing (SYNASC),
14th international symposium on 2012. IEEE, pp 214–219
Bremermann HJ (1958) The evolution of intelligence: the nervous system as a model of its
environment. Technical Report No. 1, Department of Mathematics, University of Washington,
Seattle
Burke EK, Gendreau M, Hyde M, Kendall G, Ochoa G, Özcan E, Qu R (2013) Hyper-heuristics: a
survey of the state of the art. J Oper Res Soc 64(12):1695–1724
Clerc M (1999) The swarm and the queen: towards a deterministic and adaptive particle swarm
optimization. In: Proceedings of the IEEE congress on evolutionary computation, vol 3,
pp 1951–1957. doi:10.1109/CEC.1999.785513
Clerc M (2006) Particle swarm optimization. Hermes Sci, London. ISBN 1905209045
Coello CAC, Lechunga MS (2002) Mopso: a proposal for multiple objective particle swarm
optimization. In Proceedings of the IEEE congress on evolutionary computation. IEEE Press,
pp 1051–1056
Cortis A, Oldenburg CM, Benson SM (2008) The role of optimality in characterizing CO2 seepage
from geologic carbon sequestration sites. Int J Greenh Gas Control 2(4):640–652
De Jong KA (2006) Evolutionary computation. A unified approach. MIT Press, Cambridge
Deb K, Goldberg DE (1989) An investigation of niche and species formation in genetic function
optimization. In: Proceedings of the 3rd international conference on genetic algorithms, San
Francisco. Morgan Kaufmann Publishers Inc., pp 42–50, ISBN 1-55860-066-3. http://portal.
acm.org/citation.cfm?id=645512.657099
Dorigo M, Stützle T (2004) Ant colony optimization. Bradford Company, Scituate. ISBN
0262042193
Dumitrescu D (2000) Genetic chromodynamics. Studia Universitatis Babes-Bolyai Cluj-Napoca,
Ser. Informatica 45:39–50
Fernández Martnez JL, Mukerji T, Garca Gonzalo E, Suman A (2012) Reservoir characterization
and inversion uncertainty via a family of particle swarm optimizers. Geophysics 77(1):
M1–M16
Fichter DP et al (2000) Application of genetic algorithms in portfolio optimization for the oil and
gas industry. In: SPE annual technical conference and exhibition. Society of Petroleum
Engineers
Fogel LJ, Owens AJ, Walsh MJ (1966) Artifficial intelligence through simulated evolution. Wiley,
New York
Fraser AS (1957) Simulations of genetic systems by automatic digital computers. Aust J Biol Sci
10:492–499
Ghaedi M, Ghotbi C, Aminshahidy B (2013) Optimization of gas allocation to a group of wells in
gas lift in one of the iranian oil fields using an efficient hybrid genetic algorithm (HGA). Pet Sci
Technol 31(9):949–959
Glover F (1986) Future paths for integer programming and links to artificial intelligence. Comput
Oper Res 13(5):533–549. ISSN 0305-0548. doi:10.1016/0305-0548(86)90048-1
Grefenstette JJ (1986) Optimization of control parameters for genetic algorithms. IEEE Trans Syst
Man Cybern 16(1): 122–128
Grefenstette JJ (1987) Incorporating problem specific knowledge into genetic algorithms. Genet
Algorithms Simul Annealing 4:42–60
Hajizadeh Y, Demyanov V, Mohamed L, Christie M (2011) Comparison of evolutionary and
swarm intelligence methods for history matching and uncertainty quantification in petroleum
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 97

reservoir models. In: Intelligent computational optimization in engineering. Springer, Berlin,


pp 209–240
Hale JL, Householder BJ, Greene KL (2002) The theory of reasoned action. Sage Publications,
Thousand Oaks, pp 259–286
Hillis WD (1990) Co-evolving parasites improve simulated evolution as an optimization
procedure. Phys D Nonlinear Phenom 42(1):228–234
Holland JH (1975) Adaptation in natural and artificial systems. University of Michigan Press, Ann
Arbor
Holland JH (1998) Adaptation in natural and artificial systems: an introductory analysis with
applications to biology, control and artificial intelligence. MIT Press, Cambridge.
ISBN 0-262-58111
Hruschka ER, Campello RJGB., Freitas AA, De Carvalho APLF (2009) A survey of evolutionary
algorithms for clustering. IEEE Trans Syst Man Cybern Part C Appl Rev 39(2):133–155
Hu X, Eberhart RC (2001) Tracking dynamic systems with PSO: where’s the cheese? In
Proceedings of the workshop on particle swarm optimization, pp 80–83
Hu X, Eberhart RC (2002) Multiobjective optimization using dynamic neighborhood particle
swarm optimization. In: Proceedings of the IEEE congress on evolutionary computation. IEEE
Press, pp 1677–1681
Hu X, Eberhart RC (2002) Solving constrained nonlinear optimization problems with particle
swarm optimization. In: Proceedings of the sixth world multiconference on systemics,
cybernetics and informatics
Ionita M, Croitoru C, Breaban M (2006) Incorporating inference into evolutionary algorithms for
max-csp. In: 3rd international workshop on hybrid metaheuristics, LNCS 4030. Springer,
Berlin, pp 139–149
Jong KD (2006) Evolutionary computation: a unified approach. MIT Press. ISBN 0-262-04194
Kennedy J (1999) Small worlds and mega-minds: effects of neighborhood topology on particle
swarm performance. In: Proceedings of the IEEE congress of evolutionary computation, vol 3.
IEEE Press, pp 931–1938. doi:10.1109/CEC.1999.785513
Kennedy J (2002) Population structure and particle swarm performance. In: Proceedings of the
congress on evolutionary computation (CEC 2002). IEEE Press, pp 1671–1676
Kennedy J, Eberhart RC (1995) Particle swarm optimization. In: Proceedings of the 1995 IEEE
international conference on neural networks, vol 4. IEEE Press, pp 1942–1948
Kennedy J, Eberhart RC (1995) Particle swarm optimization. In: Proceedings of IEEE
international conference on neural networks, pp 1942–1948
Kennedy J, Eberhart RC (1997) A discrete binary version of the particle swarm algorithm. In:
Proceedings of the world multiconference on systemics, cybernetics and informatics, vol 5,
Piscataway. IEEE Press, pp 4104–4109
Kennedy J, Mendes R (2003) Neighborhood topologies in fully-informed and best-of neighbor-
hood particle swarms. In: Proceedings of the 2003 IEEE SMC workshop on soft computing in
industrial applications (SMCia03). IEEE Computer Society, pp 45–50
Kenneth ADJ (1975) An analysis of the behavior of a class of genetic adaptive systems. PhD
thesis, University of Michigan, Dissertation Abstracts International, vol 36, no 10, Ann Arbor,
AAI7609381
Khanesar MA, Tavakoli H, Teshnehlab M, Shoorehdeli MA (2009) Novel binary particle swarm
optimization. In: Tech Education and Publishing, pp 1–10. ISBN 978-953-7619-48-0
Kirkpatrick S, Gelatt CD, Vecchi MP et al (1983) Optimization by simmulated annealing. Science
220(4598):671–680
Konak A, Coit DW, Smith AE (2006) Multi-objective optimization using genetic algorithms: a
tutorial. Reliab Eng Syst Safety 910(9):992–1007. http://www.sciencedirect.com/science/
article/B6V4T-4J0NY2F-2/2/97db869c46fc43f457f3d509adaa15b5
Koza J (1992) Genetic programming: on the programming of computers by means of natural
selection. MIT Press, Cambridge
Krink T, Vesterstrom JS, Riget J (2002) Particle swarm optimisation with spatial particle
extension. In: Proceedings of the evolutionary computation on 2002. CEC’02. Proceedings of
98 H. Luchian et al.

the 2002 Congress—vol 02, CEC’02. IEEE Computer Society, Washington, pp 1474–1479.
ISBN 0-7803-7282-4. http://portal.acm.org/citation.cfm?id=1251972.1252447
Lanzi PL, Stolzmann W, Wilson SW (2000) Learning classifier systems: from foundations to
applications (No. 1813). Springer, Berlin
Lïvbjerg M, Rasmussen TK, Krink T (2001) Hybrid particle swarm optimiser with breeding and
subpopulations. In: Proceedings of the genetic and evolutionary computation conference
(GECCO-2001). Morgan Kaufmann, pp 469–476
Luchian S, Luchian H, Petriuc M (1994) Evolutionary automated classification. In: Proceedings of
1st congress on evolutionary computation, pp 585–588
Lyons J, Nasrabadi H (2013) Well placement optimization under time-dependent uncertainty using
an ensemble kalman filter and a genetic algorithm. J Petrol Sci Eng 109:70–79
Martnez JLF, Gonzalo EG, Álvarez JPF, Kuzma HA, Pérez COM (2010) PSO: A powerful
algorithm to solve geophysical inverse problems: Application to a 1D-DC resistivity case.
J Appl Geophys 710(1):13–25
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equation of state
calculations by fast computing machines. J Chem Phys 21(6):1087–1092
Michalewicz Z (1992) Genetic algorithms + data structures = evolution programs (3rd edn).
Springer, Berlin. ISBN 3-540-60676-9
Mitchell M (1996) An introduction to genetic algorithms. MIT Press, Cambridge.
ISBN 0-262-13316-4
Mitchell M, Forrest S, Holland JH (1992) The royal road for genetic algorithms: fitness landscapes
and ga performance. In: Proceedings of the first European conference on artificial life,
pp 245–254. The MIT Press, Cambridge
Mohaghegh SD (2005) A new methodology for the identification of best practices in the oil and
gas industry, using intelligent systems. J Pet Sci Eng 49(3):239–260
Mohaghegh SD et al (2005) Recent developments in application of artificial intelligence in
petroleum engineering. J Pet Technol 57(4):86–91
Mullen KM, Ardia D, Gil DL, Windover D, Cline J (2011) DEoptim: an R package for global
optimization by differential evolution. J Stat Softw 40(6):1–26
Nateri K Madavan (2002) Multiobjective optimization using a pareto differential evolution
approach. In: Proceedings of the world on congress on computational intelligence, vol 2. IEEE,
pp 1145–1150
Nguyen NT, Kowalczyk R (2012) Transactions on computational collective intelligence III.
Springer, Berlin
Nwankwor E, Nagar AK, Reid DC (2013) Hybrid differential evolution and particle swarm
optimization for optimal well placement. Comput Geosci 17(2):249–268
Onwunalu JE, Durlofsky LJ (2010) Application of a particle swarm optimization algorithm for
determining optimum well location and type. Comput Geosci 14(1):183–198
Park H-Y, Datta-Gupta A, King MJ (2014) Handling conflicting multiple objectives using pareto-
based evolutionary algorithm during history matching of reservoir performance. J Pet Sci Eng
Piotrowski AP, Osuch M, Napiorkowski MJ, Rowinski PM, Napiorkowski JJ (2014) Comparing
large number of metaheuristics for artificial neural networks training to predict water
temperature in a natural river. Comput Geosci 64:136–151
Poli R, Kennedy J, Blackwell T (2007) Particle swarm optimization. Swarm Intell 1(1):33–57
Poli R, Langdon WB, McPhee NF (2008) A field guide to genetic programming. http://www.gp-
field-guide.org.uk. (With contributions by JR Koza)
Poormirzaee R, Moghadam RH, Zarean A (2014) Inversion seismic refraction data using particle
swarm optimization: a case study of Tabriz, Iran. Arab J Geosci 1–9
Radcliffe NJ, Surry PD, Jz E (1995) Fitness variance of formae and performance prediction. In:
Foundations of genetic algorithms, pp 51–72
Raidl GR, Gottlieb J (2005) Empirical analysis of locality, heritability and heuristic bias in
evolutionary algorithms: a case study for the multidimensional knapsack problem. Evol
Comput 13(4):441–475
On Meta-heuristics in Optimization and Data Analysis. Application to Geosciences 99

Rana S, Jasola S, Kumar R (2011) A review on particle swarm optimization algorithms and their
applications to data clustering. Artif Intell Rev 35(3):211–222
Rechenberg I (1973) Evolutionsstrategie: optimierung technischer systeme nach prinzipien der
biologischen evolution. In: Frommann-Holzboog
Rechenberg I (1973) Evolutionstrategie: optimierung Technisher Systeme nach Prinzipien der
Biologischen Evolution. Frommann-Holzboog Verlag, Stuttgart
Riget J, Vesterstrøm JS (2002) A diversity-guided particle swarm optimizer-the ARPSO.
Department of Computer Science, University of Aarhus, Aarhus, Denmark, Technical Report,
vol 2. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.2929
Safarzadeh MA, Motahhari SM (2014) Co-optimization of carbon dioxide storage and enhanced
oil recovery in oil reservoirs using a multi-objective genetic algorithm (NSGA-II). Pet Sci 11
(3):460–468
Schwefel H-PP (1993) Evolution and optimum seeking. Wiley, Hoboken
Scrucca L (2013) GA: a package for genetic algorithms in R. J Stat Softw 53(4):1–37. http://www.
jstatsoft.org/v53/i04/
Shakhsi-Niaei M, Iranmanesh SH, Torabi SA (2013) A review of mathematical optimization
applications in oil-and-gas upstream & midstream management. Int J Energy Stat 1
(02):143–154
Shaw R, Srivastava S (2007) Particle swarm optimization: a new tool to invert geophysical data.
Geophysics 72(2):F75–F83
Shelokar PS, Jayaraman VK, Kulkarni BD (2004) An ant colony approach for clustering.
Analytica Chimica Acta 509(2):187–195
Shi Y, Eberhart RC (1998) Parameter selection in particle swarm optimization. In: EP’98:
proceedings of the 7th international conference on evolutionary programming VII. Springer,
London, pp 591–600. ISBN 3540648917
Simon HA (1969) The sciences of the artificial, vol 136. MIT Press, Cambridge
Singh HK, Ray T, Sarker R (2013) Optimum oil production planning using infeasibility driven
evolutionary algorithm. Evolut Comput 21(1):65–82
Stoean R, Preuss M, Stoean C, El-Darzi E, Dumitrescu D (2009) Support vector machine learning
with an evolutionary engine. J Oper Res Soc 60(8):1116–1122
Stoean C, Preuss M, Stoean R, Dumitrescu D (2010) Multimodal optimization by means of a
topological species conservation algorithm. IEEE Trans Evolut Comput 14(6):842–864
Stoean R, Stoean C, Lupsor M, Stefanescu H, Badea R (2011) Evolutionary-driven support vector
machines for determining the degree of liver fibrosis in chronic hepatitis C. Artif Intell Med
51:53–65. ISSN 0933-3657
Storn R, Price K (1997) Differential evolution: a simple and efficient heuristic for global
optimization over continuous spaces. J Glob Optim 11(4):341–359. ISSN 09255001. doi:10.
1023/A:1008202821328
Sun J, Feng B, Xu W (2004) Particle swarm optimization with particles having quantum behavior.
In Proceedings of the IEEE congress on evolutionary computation. IEEE Press, pp 325–331
Talbi E-G (2009) Metaheuristics: from design to implementation, vol 74. Wiley, Hoboken
Thander B, Sircar A, Karmakar GP (2014) Hydrocarbon resource estimation: a stochastic
approach. J Pet Explor Prod Technol 1–8
Tronicke J, Paasche H, Böniger U (2012) Crosshole traveltime tomography using particle swarm
optimization: a near-surface field example. Geophysics 77(1):R19–R32
Turney P (1995) Cost-sensitive classification: empirical evaluation of a hybrid genetic decision
tree induction algorithm. J Artif Intell Res 2:369–409
Voß S (2001) Meta-heuristics: the state of the art. In: Local search for planning and scheduling.
Springer, Berlin, pp 1–23
Wang L, Wang X, Fu J, Zhen L (2008) A novel probability binary particle swarm optimization
algorithm and its application. J Softw 3(9):28–35
Whitley Darrell, Rana Soraya, Heckendorn Robert B (1998) The island model genetic algorithm:
on separability, population size and convergence. J Comput Inf Technol 7:33–47
100 H. Luchian et al.

Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Comput
1(1):67–82
Zaharie D (2005) Density based clustering with crowding differential evolution. In: International
symposium on symbolic and numeric algorithms for scientific computing, pp 343–350
Zaharie D (2007) A comparative analysis of crossover variants in differential evolution. In:
Proceedings of IMCSIT 2007, pp 171–181
Zangeneh H, Jamshidi S, Soltanieh M (2013) Coupled optimization of enhanced gas recovery and
carbon dioxide sequestration in natural gas reservoirs: case study in a real gas field in the south
of Iran. Int J Greenhouse Gas Control 17:515–522
Zitzler E, Deb K, Thiele L (2000) Comparison of multiobjective evolutionary algorithms:
empirical results. Evol Comput 8:173–195
Genetic Programming Techniques
with Applications in the Oil and Gas
Industry

Henri Luchian, Andrei Băutu and Elena Băutu

Abstract The chapter, entitled “Genetic Programming Techniques with Applica-


tions in the Oil and Gas Industry”, consists of four parts. The first part presents
theoretical features of the genetic programming algorithm, describing its main
components, such as individual representation, initialization of the population,
evaluation of the individuals, genetic operators, and selection scheme. The second
part is concerned with a hybrid evolutionary algorithm—Gene Expression Pro-
gramming, which combines features from genetic algorithms and genetic pro-
gramming. In the third part, references towards software frameworks that
implement GP are provided. This part then focuses on the use of the R package for
genetic programming—RGP and provides a guide for the package, using two
model problems to exemplify its usage. The last part reviews applications of genetic
programming for petroleum engineering problems.

 
Keywords Genetic programming Regression Gene expression Programming  

RGP Petroleum engineering problems

This chapter presents the theoretical background behind the evolutionary algorithm
variant known as genetic programming (GP). Details on the features that make GP a
remarkable algorithm for data analysis are provided. Gene Expression Program-
ming (GEP) is a GP variant proposed by Ferreira (2001), which has since gained a
lot of interest from researchers for applications in various fields of science.
We chose to present it in this chapter since it is a good example of a hybrid
evolutionary algorithm that combines advantages from both GAs and GP, and it is
among the most used flavors of GP in applications. Insight into the inner workings

H. Luchian
Faculty of Computer Science, Alexandru Ioan Cuza University, Iasi, Romania
A. Băutu
Faculty of Navigation and Naval Management, Romanian Naval Academy,
Constanta, Romania
E. Băutu (&)
Faculty of Mathematics and Computer Science, Ovidius University, Constanta, Romania
e-mail: ebautu@gmail.com

© Springer International Publishing Switzerland 2015 101


C. Cranganu et al. (eds.), Artificial Intelligent Approaches in Petroleum Geosciences,
DOI 10.1007/978-3-319-16531-8_3
102 H. Luchian et al.

of GP is gained by means of two practical examples: a synthetic regression problem


and a real problem from the field of petroleum engineering, both modeled by GP.
We provide references to existing software packages that offer implementations
of the GP paradigm and focus on the R package (RGP) for genetic programming. A
step by step guide to using RGP for the aforementioned problems is provided.
Further, we review applications of GP to petroleum engineering related problems,
such as well log analysis, reservoir characterization, or pressure analysis.
Following the directions set by John Holland for the presentation of adaptive
systems (Holland 1992), we explain the basics of GP and GEP by describing the
following features (Bautu 2010):
• the encoding method used by the individuals (representation) and the decoding
procedure;
• the procedure to generate individuals, used especially during the initialization
of the population, but also in the context of some genetic operators;
• the evaluation procedure (fitness function);
• procedures for genetic operators (e.g., mutation, crossover);
• the selection scheme.

1 Genetic Programming

Nicheal Cramer’s work from 1985 stands at the root of the genetic programming
paradigm; he proposed a type of genetic algorithm with individuals represented by
computer programs (Cramer 1985). Cramer used the proposed algorithm to auto-
matically evolve simple mathematical expressions. His work was followed by
Schmidhuber’s idea of using Prolog and Lisp as support for evolutionary algo-
rithms, which led to a meta-learning algorithm based on GP (Dickmanns et al.
1987; Schmidhuber 1987). The inventor of modern GP is considered to be John
Koza, a former professor at Stanford University, who layed the foundation of what
is currently known as GP in his first book on the topic (Koza 1992). He envisioned
a genetic algorithm that evolved Lisp S-expressions, that automatically solves
problems. Recent accounts on the topic of GP are provided in (Poli and Koza 2014;
Poli 2008); insights into the theoretical foundations of GP are provided in Langdon
and Poli (2002). We will briefly describe in the following the main traits of GP that
differentiate it from GAs, following the description provided in (Bautu 2010; Bautu
and Bautu 2009).

1.1 Representation of Individuals

Traditional GP appeared from the need to automatically solve problems, based on a


high-level statement of the problem, without any prior knowledge of the form or
structure of the solution. The structures (individuals) that evolve are at the base of
Genetic Programming Techniques … 103

Fig. 1 GP syntax tree representing the individual ðxÞ þ x  2

any adaptive (or learning based) system. GP individuals are computer programs,
encoded as syntax trees (e.g,. Fig. 1). The nodes in the tree are labeled with
symbols. The leaves of the tree are labeled with terminal symbols (the variables and
the constants in the program—in our example, x, 2), while the internal nodes are
labeled with functional symbols (e.g., algebraic operators, trigonometric functions,
or other common mathematical functions, etc.). During evolution, the sizes and
shapes of the trees are changing in order to adapt to the environment provided by
the problem. The search space for the GP algorithm is graphically depicted in
Fig. 2.
It is important for the symbol set of the algorithm, comprised of all the functions
and terminals, to be carefully selected prior to running the GP algorithm, in order to
provide the prerequisites to model the proposed problem (Koza 1992). We refer, in
the following, to two features that must be met by the symbol set: closure and
completeness.
The closure property refers to each function of the set of functions being well
defined and closely relative to any combination of parameters it may receive during
evolution. This is usually achieved by the special treatment of a relatively small

Fig. 2 Graphical representation of the search space for GA (left) and GP (right)
104 H. Luchian et al.

number of situations. For example, for divide operations, which are not allowed to
receive zero as the second parameter, it is clear that the closure property is not
satisfied; likewise, the logarithm function should not receive negative parameters.
Examples of closed symbols sets (i.e., it is guaranteed that all syntactically valid
expressions formed with these symbols are also semantically valid):
• C ¼ fAND; OR; NOT; x; y; TRUE; FALSEg, where x and y are Boolean
variables, and TRUE and FALSE are Boolean constants;
• C ¼ fþ; ; ; x; y; 0; 1g, where x and y are integers variables.
Examples of functions sets that are not closed are:
• C ¼ fþ; ; ; =; x; y; 0; 1g, where x and y are real variables—the set is not
closed because it is possible to generate expressions which are semantically
invalid due to division by 0:

f ðx; yÞ ¼ ðx  xÞ=ðy  yÞ or f ðx; yÞ ¼ ðx  yÞ=0;

• C ¼ fþ; ; log; xg, where x is a real variable—the set is not closed; in case the
log function receives negative or null parameters, the resulting expression is not
semantically valid

f ðxÞ ¼ x þ logðx  xÞ or f ðxÞ ¼ logðxÞ= logðxÞ ð1Þ

A possible solution for achieving closure of the symbol set is by means of the
definition of protected functions. Protected functions return a special value of the
terminal set whenever an exceptional situation is detected. For example, in case of
the division operator, a protected function can return 0 if the second parameter is 0:

x=y; if y 6¼ 0
=prot ðx; yÞ ¼ ð2Þ
0; otherwise

In this way, the protected divide operation has a well-defined result for any
values of its parameters. The advantage of this approach is its simplicity, from the
implementation point of view.
In order to meet the completeness property, one must make sure that the symbol
set for the algorithm is sufficient in order to express a solution to the problem; in
general, expert knowledge is needed to implement this part. This property is
guaranteed only for some problem cases where there exist theoretical arguments or
empirical evidence favoring a particular choice of symbols.
The selection of the input variables necessary for a given problem can be
straightforward, or it may be solved by a feature extraction algorithm (Veerama-
chaneni et al. 2010). Similarly, the function set that is sufficient to express a
problem solution is very dependent on the problem to be solved.
For example, the functions set {AND, OR, NOT} is sufficient to express any
Boolean function. By removing the AND function, the remaining set still meets the
sufficient condition because the AND Boolean function can be simulated with:
Genetic Programming Techniques … 105

ANDðx; yÞ ¼ NOTðORðNOTðxÞ; NOTðyÞÞÞ:

In case of removing the NOT function, the remaining set no longer meets the
sufficient condition, because its effect can not be simulated with the functions left in
the set. Thus, functions such as XOR can not be expressed. As with the terminals,
the responsibility to establish the set of functions appropriate for the problem
remains to the user.
GP builds approximations of the real solution, in case the symbols included in
the symbol set are not sufficient to express a solution to the problem. For this
reason, the general set of symbols used in GP to express a solution to a given
problem does not coincide with the minimal set of symbols required to express the
solution; it usually contains additional symbols. The effect that these additional
symbols may have on the quality of solutions identified by the algorithm is difficult
to assess a priori. For example, the presence of additional variables in the set of
terminals may lead to a decrease in the algorithm performance in finding solutions
(Fig. 3); in this case, the GP algorithm also performs a feature selection task,
identifying automatically the variables that are significant for the model.
For example, suppose GP is used to infer a formula for the exponential function
ex . This function cannot be expressed exactly by a finite algebraic expression. If GP
uses the set of symbols,

C ¼ fþ; ; ; =; x; y; 0; 1; 2g;

it will, most likely, provide finite approximations for this function, such as 1,
1; 1 þ x; 1 þ x þ x12 ; 1 þ x þ x12 þ x13 .

1.2 Generating Individuals

The generation of GP individuals is used for the initialization of the population in


the first generation, as well as for implementing certain genetic operators, like

Fig. 3 Completeness of the symbols set and its effect on the solution
106 H. Luchian et al.

subtree mutation. GP individuals are generated in a random manner, usually


recursively, node by node. In the beginning, a symbol chosen randomly from the
symbol set of the algorithm is assigned as the root of the tree. If the symbol is a
terminal (e.g., variable, constant, or a function without parameters), then the gen-
erating process stops. The individual obtained is a (degenerate) tree consisting of a
single node, labeled with a terminal. If the symbol chosen is a function f with arity a
(f), then the recursive process builds up a(f) descendants as parameters of this
symbol. If a descendant is labeled with a terminal, then the generation process is
considered completed for that node. If a descendant is labeled with a function, then
the generation process continues recursively until all leaf nodes of the tree are
labeled with terminals. The tree depth is the longest direct path from the root to any
leaf node. Using pseudocode, this process is described by the algorithm (as shown
in Fig. 4).
In practice, the algorithm (as shown in Fig. 4) should be enhanced with a
provided mechanism for limiting the size of the tree produced; such a mechanism
can be implemented in different ways. Koza proposed three generating methods that
provide control over the sizes and complexity of the trees (Bautu 2010; Koza 1992):
• the full method creates full trees (i.e., the length of the direct path from the root
to any leaf node is equal to the depth of the tree);
• the grow method creates trees with different shapes and sizes;
• the ramped half-and-half method combines the previous methods to produce a
larger variety of full and irregular trees (Fig. 5).
In the initial population, it is essential to have a wide variety of individuals, such
that they ensure a good coverage of the search space, and a good diversity.
Diversity is key for the evolutionary process. The ramped half-and-half method is
very suitable to create a wide variety of trees in the initial population. For example,
with depths between 2 and 5, 12.5 % of the trees are full trees of depth 2, 12.5 % are
irregular trees of depth 2, 12.5 % are full trees of depth 3, and so on until the
maximum depth. Other methods, more sophisticated, are discussed in the literature
and are usually already implemented in dedicated GP packages (Luke 2000a, b).

Require: S – the symbols set


Ensure: T – result tree
1. c = RandomSymbol(S)
2. root(T ) = c
3. for i = 1 . . . z(c) do
4. Child(T, i) = RandomT ree(S)
5. end for
6. return T

Fig. 4 Random generation of an individual in GP


Genetic Programming Techniques … 107

Fig. 5 Trees with depth 4, generated by the full method (left) and the grow method (right). Gray
nodes are terminal nodes

1.3 Evaluation of Individuals

The central idea to all EC techniques is adaptation to the environment. In nature,


the number of offspring the individual has is usually used as a measure of the
individual’s adaptation to its environment. In EC, a reverse approach is employed:
The specific adaptation of each individual controls the number of offspring. An
explicit measure for the adaptation of individuals is the fitness value, evaluated
using a procedure specific to the problem addressed.
In the case of GP, each individual fitness is evaluated against a given set of input
data—particular cases of the problem search space. The selection of the input data
should be representative for the problem, because it is the foundation based on
which the algorithm generalizes the results obtained to the whole problem space.
All the individuals in a generation should train using the same input data, such that
they can be compared against each other.
A formula that is very frequently used to evaluate individuals is the mean
squared error of the individual, with respect to the input data:
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
uP
uN
u ðSði; jÞ  CðjÞÞ2
tj¼1
fitnessðiÞ ¼ ; ð3Þ
n

where N is the total number of cases for assessing individuals, S(i, J) is the value
obtained by assessing the individual i of the population for variables in the case j of
input data, and CðjÞ is the correct (expected) value for the case j. For the sake of
comparing individuals across different generations and algorithm runs, John Koza
introduced several types of fitness which offer different abstraction degrees of the
individual performances, all of them based on the distance between the input data
and the estimations made by the GP individual (Koza 1992).
108 H. Luchian et al.

1.4 Genetic Operators

The main GP operators are selection and crossover. Mutation is considered a


secondary type operator. Also, specific GP operators exist, such as permutation,
editing, encapsulation, and decimation—they are also considered secondary type
operators. Other operators may be defined in order to target particular aspects of the
problems addressed with GP.
Crossover is deemed the most important GP operator. The basic idea is common
to that of the crossover operator from GAs: Parent individuals are selected from the
population, and offspring are produced such that they inherit parts from each parent.
There exist several traditional versions of the crossover operator, and others may be
defined, depending on the specificity of the problem.
The standard crossover operator uses two cut points, one in each parent. The
routine for this operator chooses, with uniform probability, one point in each of the
two parent chromosomes. Then, it swaps the subtree rooted in the corresponding
cut point with the subtree from the other parent (see Fig. 6). This process is
illustrated in the algorithm (as shown in Fig. 7) and represented in Fig. 6.
The offspring produced are always valid structures, due to the closure property
of the symbol set. It can be noted that the operator produces diversity in the
population; in the GA, when 2 identical individuals were subject to crossover, the
offspring were identical to the parents. It is not the case of the standard crossover
operator in GP. Hence, premature convergence is not an issue for GP when standard
crossover is used.
Mutation The mutation operator is a secondary type operator, mainly respon-
sible for producing diversity in the GP population. The standard implementation of
the operator proceeds by randomly selecting a node from the individual. The
subtree rooted in that node is then replaced by a randomly generated subtree. The
generation of the new subtree makes use of one of the algorithms we discussed
earlier. Similar to crossover, a maximum depth limit can be used to restrict the size

Fig. 6 Two-points crossover


operator exchange subtrees
rooted in the cut points
(marked with a dashed line)
Genetic Programming Techniques … 109

Require: C1, C2 – parent chromosomes


Ensure: O1, O2 – offspring chromosomes
1. O1 = C1  Clone parent chromosomes
2. O2 = C2
3. P 1 = RandomN odeSelect(C1)  Select the cut points
4. P 2 = RandomN odeSelect(C2)
5. nod(O1, P 1) = Subtree(C2, P 2)  Swap subtrees
6. nod(O2, P 2) = Subtree(C1, P 1)
7. return O1, O2

Fig. 7 The standard (two points) crossover operator in GP

of the offspring. This process is illustrated in the algorithm (as shown in Fig. 8) and
exemplified in Fig. 9.
When the cut point for subtree mutation is close to the root of the syntax tree, the
operator has a highly destructive effect; similarly, a mutation point near to the
leaves of the tree has small chances to alter completely the expression encoded by
the individual. A practical solution to this problem is to assign variable mutation
probabilities to nodes on different levels of the tree, e.g., mutation probability that
increases from the root to the frontier of the tree.
Permutation This operator randomly selects an internal node of the syntax tree.
Assume this node is labeled with a function of arity k. The permutation operator

Require: C – the chromosome undergoing mutation


Require: S – the symbols set
Ensure: O – cromozomul obinut pentru mutaie
1. O = C  clone the parent
2. P = RandomN odeSelect(O)  select the mutation point
3. T = RandomT ree(S)
4. nod(O, P ) = T  replace the subtree
5. return O

Fig. 8 Subtree mutation operator

Fig. 9 The subtree mutation


operator replaces a subtree
with a newly generated tree
110 H. Luchian et al.

generates a random permutation of the k children and swaps the children nodes
according to this permutation. In case the label of the target node is a commutative
function, the effect of this operator on the phenotype encoded by the tree is actually
null.
Editing The editing operator provides a way to reduce the complexity of indi-
viduals chromosomes dynamically, at runtime. For example, the editing operator
might evaluate functions that are context-free and have only constants as parameters
and then replace these functions with the result of the evaluation. Complex editing
rules might require large computing resources. The use of this operator is justified
by the necessity of limiting code bloat (Luke 2000a, b), or if individuals need to be
made more readable (for example, one might process the solution of the algorithm
in order to obtain a more user-friendly solution).
Encapsulation Reusability of code may be implemented in GP by means of the
encapsulation operator. This operator works by assigning names to subtrees of
chosen individuals, in order for them to be referred later in GP chromosomes as
symbols. Encapsulation operates on a single individual by extracting parts of its
chromosome and mapping them to a new symbol name. The encapsulation operator
works by randomly selecting an internal node of the tree encoded in the individual,
saves the subtree with root at that point by a new symbol name, and replaces it with
the new symbol name. The new symbol points to the original subtree and it is
included in the terminal set because it is a complete subtree and does not require
any parameters to be evaluated. The main benefit of this operator is that it protects
the subtree used to define the new symbol from the destructing effects of genetic
operators. This operator stands at the base of the automatically defined functions
idea in GP (Poli 2008).

1.5 Selection Scheme

In GP, selection is viewed as an operator that acts on a population of individuals


and results in a single individual. The selection operator works in two stages: first,
an individual from the population is chosen according to a selection scheme, and
then, this individual is copied into the population in the next generation of the
algorithm. The selection schemes available for genetic algorithms are used in the
case of GP, too. Among them, roulette wheel selection stands out as being very
highly used—whether used directly with the fitness values of the individuals, or
with their ranks (assigned based on fitness values, too).
In the context of GP, tournament selection is also of wide use. For this scheme, a
number of individuals (e.g., 2 or 4) are chosen randomly from the population. The
one with the best fitness is selected to survive in the next generation. The parents
remain in the population, so they may participate in future tournaments.
Usual selection schemes are coupled with elitist survival of a number or of a
percent of the individuals in the population. This scheme ensures the survival of the
best individuals from one generation to the next.
Genetic Programming Techniques … 111

2 Gene Expression Programming

Different representations used in the GP algorithm led to different flavors of GP,


oftentimes with their own names. Driven by the idea that every terminal has a type
and every function has a specification of types for its parameters, Montana intro-
duced strongly typed genetic programming (David 1995). This variant is useful for
implementing type constraints, such as those encountered in physics equations. The
existence of multiple objectives for practical problems led to the proposal of Pareto
GP (Vladislavleva et al. 2009).
Gene Expression Programming is a GP-based algorithm, proposed in (Ferreira
2001), very popular in applications in many domains (Zhou et al. 2003). This
variant combines the advantages of the classical GA representation (linear strings of
fixed size, which leads to easy implementation of genetic operators), with those
exhibited by the individuals in GP (hierarchical structures with different sizes and
shapes, which leads to the possibility of encoding highly complex programs). We
will describe GEP in the following, using the same structure as for the description
of the GP algorithm.

2.1 Representation of GEP Individuals

The phenotype of a GEP individual is a complex mathematical expression that may


be viewed as a hierarchical structure with variable sizes and shapes. In GEP jargon,
it is called an expression tree. The genotype of a GEP individual is a fixed size
string of symbols.
The expression tree from Fig. 10 represents the mathematical expression
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðx  yÞ  ðy þ xÞ;

which is the phenotype of the genotype:


pffi
  þ x y y x; ð4Þ

Fig. 10 An expression parse


tree
112 H. Luchian et al.

pffi
where denotes the square root function. This encoding is obtained by the breadth
traversal of the expression tree in Fig. 10. The expression is different from the
prefixed notation, as well as from the postfix notation obtained by depth traversing,
which are used by some vector-based or stack-based variants of GP (Keith and
Martin 1994).
Decoding the genotype into the equivalent phenotype follows the same rules.
pffi
For example, the genotype =  x y x is equivalent to the following expression
pffi
tree: The start symbol ð Þ is of arity 1; hence, it is linked with the following
symbol (/); /has arity 2, and it is linked with the following two symbols—and x. The
process continues until each symbol is linked with a number of symbols equal to its
arity. The symbols with arity 0 are leaf nodes in the phenotype’s expression tree.
qffiffiffiffiffiffiffiffiffi
The translation process builds the expression tree corresponding to ðyxÞ x .
GEP genes are divided into two structural units: head and tail. The head may
contain functions and terminals, and the tail is constrained to contain only termi-
nals. The tail size depends on the head size and on the set of symbols used in the
gene,

t ¼ hðn  1Þ þ 1;

where t is the required minimum size of the tail, h is the size of the head, and n is
the maximum arity of the symbols that may appear inside the gene. In this orga-
nization, GEP genes are padded at the end with symbols that may not be used in the
decodification (they are inactive). This structural organization of GEP genes ensures
syntactic validity of all obtained programs. Also, GEP genetic operators always
produce syntactically correct expressions.
GEP individuals are multi-genic chromosomes, where each gene encodes a valid
expression tree which interacts with the other genes to creat