04 Combin Optimization

Statistical Computing
Prof. Dr. Matei Demetrescu
Summer 2019
Statistics and Econometrics (CAU Kiel) Summer 2019 1 / 41

Today’s outline
Combinatorial optimization
1 Heuristic searches
2 Simulated annealing and genetic algorithms
3 Tabu algorithms
4 Up next

Heuristic searches
Outline
3 Tabu algorithms
4 Up next

Heuristic searches
The basic setup
We have the same goal:

maximize f (θ) w.r.t.θ = (θ1 , . . . , θp ), where θ ∈ Ω, but
Ω consists of a finite number N of possible solutions.
Sounds simple, but N may be large.

Heuristic searches
Example: Model selection
Given a dependent variable Y and a set of predictors Xj ,

we search for the best model of the form Y = β0 + sj=1 βij Xij +
P
... where {i1 , . . . , is } is a subset of {1, . . . , p}; always with intercept.

Of course, θi is coded suitably.
What “best” means may be defined in various ways, say by the
Akaike information criterion (AIC). (Recall, R2 is not suitable.)
Regressor selection requires ideally optimization over N = 2p candidate

models!

Heuristic searches
More generally
Practitioners frequently have to deal with discrete optimization problems

where Ω has 2p or even p! elements.
E.g. a likelihood function may depend on

so-called configuration parameters θ
which describe the form of a statistical model and
for which there are many discrete choices
and other parameters which are easily optimized for each
configuration.
In such cases, we may view f (θ) as the (log) profile likelihood of a given
configuration, θ.1
The solutions space Ω is obtained by combining p various features;

optimizing w.r.t. θ is demanding...
1
I.e., the highest likelihood attainable using that configuration.
Heuristic searches
Just to get an idea
Parsing many possible solutions is not always feasible:
Time to solve problem of order. . .

p O(p2 ) O(2p ) O(p!)
20 1 minute 1 minute 1 minute
21 1.10 minutes 2 minutes 21 minutes
25 1.57 minutes 32 minutes 12.1 years
30 2.25 minutes 17.1 hours 207 million years
50 6.25 minutes 2042.9 years 2.4×1040 years
Bottom line: Many combinatorial optimization problems are inherently too

difficult to solve exactly by traditional means.

Heuristic searches
A trade-off
So let’s avoid algorithms

that will find the global maximum, but
will take forever to do so.
An algorithm that finds a good local maximum in reasonable time is
clearly to be preferred.
Heuristic algorithms explicitly trade off global optimality for speed.

Heuristic searches
Local neighbourhood search
Let stay iterative (what else?):

we improve a current candidate solution, but
do so in a neighborhood of the current candidate, say N (θ (t) ).
The update θ (t) 7→ θ (t+1) ∈ N (θ (t) ) is a local move (step).
Don’t forget to add a stopping rule.
Note: Since the neighbourhood changes with each move, we do search

larger subsets of Ω.

Heuristic searches
Choice of neighborhood
The choice of neighbourhood influences performance:

So keep it simple: e.g., define N (θ (t) ) by allowing k changes to the
current candidate solution.2
Large k is time-consuming, but small k may miss improvements.
No universal answer on how to pick it.
In the model selection problem,

θ (t) is a binary vector coding the inclusion/exclusion of each X;
a 1-neighborhood is the set of models that either add or eliminate one
regressor from θ (t) (i.e. flips one element of θ).
2
This defines a so-called k-neighborhood, and any move is called a k-change.
Heuristic searches
The moves
You may come up with many ways of updating the current candidate; you
typically want an upward move:
Simple ascent: θ (t+1) is the first better θ in N (θ (t) ).

Steepest ascent: θ (t+1) is the best θ in N (θ (t) ).
Then, update the neighbourhood – and repeat.
Different pros and cons; again, no universally best choice...

Heuristic searches
Random starts
Be smart (for the given money)

Select many starting points at random,
Run simple/steepest ascent from each point
Choose best solution as final answer.
Works quite well in practice, actually.
Worth emphasizing: The random starts strategy can be combined with any
optimization method!

Heuristic searches
Example: Model selection, p = 27 predictors
400
yaxislabel
350
300
1 31 16 46 61
xaxislabel
Results of random-starts local search, 1-change steepest ascent, 15
iterations max per start. (AIC ranges between −300 and −420.)
Heuristic searches
Example: Regression with 27 predictors cont’d
Predictors selected by various searches
Method 1 2 3 6 7 8 9 10 12 13 14 15 16 18 19 20 21 22 24 25 26 AIC
LS (2,4) ••• • • • • • • • • • −416.95
LS (1) •• • • • • • • • • −414.60
LS (3) •• • • • • • • • • • −414.16
LS (5) • •• • • • • • • • −413.52
S-Plus • •• • • • • • • • • • −416.94
Efroymson • •• • • • • • • −400.16
Random starts local search [LS] model selection results using 1-change
steepest ascent, as well as two standard procedures (greedy stepwise
ascent). (Bullets indicate inclusion of predictor j.)

Heuristic searches
Local search with variable neighborhoods
For standard ascent algorithms,

entrapment in a globally uncompetitive local maximum is inevitable
because no downhill moves are allowed, and
(many) random starts are mandatory.
We may want to
break out of a bad neighbourhood, or
move towards other (more promising) neighbourhoods.
Of course, while ensuring that you don’t slide back immediately...

Simulated annealing and genetic algorithms
Outline
3 Tabu algorithms
4 Up next

Annealing
Wikipedia says:
Annealing, in metallurgy and materials science, is a heat
treatment that alters the physical and sometimes chemical
properties of a material [...].
In annealing, atoms migrate in the crystal lattice [grid] and the
number of dislocations decreases, leading to a change in ductility
and hardness. As the material cools it recrystallizes.

An analogy
A crystal has the lowest level of potential energy among all

configurations.
Movement of atoms is random, depends on temperature.
Annealing “pushes” atoms to reach their optimal position in the
lattice. This involves:
heating to a predetermined temperature
holding the material for some time at that temperature
slowly cooling and repeating the procedure.
In real-life physics, atoms may break out of suboptimal positions
when heating – but we want such behavior for θ (t) as well!
We treat θ (t) as an atom that should reach the optimal θ ∗ .

Simulated annealing
To maintain the analogy, annealing problems are posed as minimizations.

Of course, we may always minimize −f if we needed maximization.
The simulated annealing process

Initializes with a candidate θ (0) and a “temperature” τ0 .
Consists of several stages of constant temperature, τ0 , τ1 , τ2 . . .,
where
each stage j consists of mj iterations, where updates may (but need
not) take place.
The “temperature” parameterizes how often the algorithm could take
non-improving steps,
... thus allowing for break-out attempts.

The algorithm
1 Pick (randomly) a candidate solution θ̃ from the current
neighborhood of θ (t) , according to a proposal distribution g (t) .
(Often the discrete uniform over N (θ (t) ).)
2 If proposal leads to an improvement, update θ (t+1) = θ̃
3 If proposal does not lead to an improvement,
randomly choose whether to update using θ̃ or not;
the current temperature controls the probability of updating!
So let θ (t+1) = θ̃ with probability equal to
( !)
f (θ (t) ) − f (θ̃)
min 1, exp .
τj
Otherwise, let θ (t+1) = θ (t) .

4 Repeat steps 1 – 3 a total of mj times.
5 Increment j. Update temperature and stage length. Go to step 1.
Some details
The neighbourhoods:
should be small and easily computed.
any two candidates must be connectable via a sequence of
neighborhoods.
Let τj = α(τj−1 ) and mj = β(mj−1 ) (cooling schedule):

α(·) should slowly decrease temperatures to zero.
β(·) should increase stage lengths as temperatures drop.
Popular choices:
τj−1
Set mj = 1 ∀j and reduce temperature slowly, α(τj−1 ) = 1+aτj−1 for
some a 1. Or
Set α(τj−1 ) = aτj−1 for some a < 1 (usually a ≥ 0.9). Increase stage
lengths as you go, say β(mj−1 ) = bmj−1 for b > 1.
More practical details
Pick τ0 > 0 so that exp ((f (θ i ) − f (θ j ))/τ0 ) is close to one ∀ θ i and

θ j in Ω.
This is problem-dependent and may be an effort, but
allows any point to be visited (with a reasonable chance) early on.
Larger temperature reductions typically require longer stages.
Long stages at high temperatures are usually useless; decrease
temperature rapidly at first but slowly thereafter.
Stop when no improvements for a couple of stages, run out of
patience, exceed budget, or current solution is acceptable.

Example: Regression model selection
−300
yaxislabel 1.0
otherlabel
−350
0.5
−400
0.0
0
1000 2000
xaxislabel
Discrete uniform proposal, 1-neighbourhoods, 15-stage cooling schedule
with stage lengths of 60, 120, and 220, α(τj−1 ) = 0.9τj−1 , τ0 is 1 (bottom
solid) or 10 (top solid). (Dashed: baseline temperature at each step.)
Genetic algorithms
Rather than physics we now mimic natural evolutionary processes to

improve candidate solutions.
The analogy is taken seriously:

1 One works with a so-called population of candidate solutions;
the best individual (according to a so-called fitness function) is the
current approximation to the solution.
2 Each iteration leads to a change of generations (population renewal).
3 Characteristics of individuals are inherited and modified from
individuals of the previous generation, usually in a way that involves
randomness.
4 The key aspect is that individuals with high fitness are more likely to
pass on their characteristics to the next generation.

The general structure
Genetic algorithms consist of

1 A genetic representation of the solution, typically an array of 0/1
values (some encoding may be necessary); the exact configuration of
the genetic code of an individual is called chromosome. (Its elements
are the genes, and the possible values of the genes are called alleles.)
2 The fitness function to evaluate individuals (this is usually the target
function f of interest)
3 Initial population (should be large enough; either chosen at random or
clustered in subsets of the solutions space)
4 The population renewal algorithm (the most important item here); to
keep the analogy, it consists of applying a set of genetic operators
(i.e. rules that change the genetic representation of individuals) to
selected individuals of the current population.
5 A stopping rule (well, duh...)

Model selection again
In this example,
1 The genetic representation of θ is simply inclusion/exclusion of the
respective regressor (obvious codes 0/1).
2 The AIC may play the role of the fitness function.
3 The initial population consists of several candidate models, say picked
at random.
4 To renew the population, we look at the best performing models from
the current populations, and change them suitably to yield a new
generation (“breeding”).
5 We stop if no improvement has been observed for the last say 15
generations or a total number of generations has been reached (etc.).

Parent selection and genetic operators
We need current individuals to breed the next generation, so

Select a pair of parent solution at random, with probability to be
picked proportional to individual fitness or to relative rank in current
population (sometimes more parents are better);
Create new individual (offspring) based on characteristics of parents;
repeat until you have enough new individuals.
Two basic flavours of genetic operators generating new individuals:

Cross-over: select random position and split parent chromosomes
correspondingly; then recombine pieces.
Mutation: randomly change one or several elements of the
chromosome(s)
There are many variations (rule of thumb: small mutation probabilities).

Model selection again
400
300
yaxislabel
200
100
0 50 100
xaxislabel
Population size P = 20, 100 generations breeding with fitness-rank based
selection probability, simple crossover, and 1% mutation.
Tabu algorithms
Outline
3 Tabu algorithms
4 Up next

Tabu algorithms
Potentially escaping entrapment
Let’s now consider downhill moves explicitly. We allow them

if we can’t find uphill ones, but with
care to avoid reversal in the next (or 2nd next) step,
since this may eliminate potential long-term benefits of downhill move.
SANN or GA deal with this to some extend by built-in randomness, but

local searches don’t. (And even SANN/GA may use some help.)
So, after a downhill step, let’s (temporarily) forbid (or make tabu/taboo)
moves that have the potential of taking us back. We need to
characterize moves to this end, and
keep an eye on the recent history (say H (t) ) of moves.

Tabu algorithms
Move attributes
We use so-called attributes to characterize moves:

in each step, we forbid attributes that would lead to reversing moves.
... but only do so for a couple of steps.
Several attributes may be relevant for a given problem

call the ath attribute Aa ;
a move from θ (t) to θ (t+1) can be characterized by different
attributes, depending on t.
The tabu list is a list of attributes, not moves, so a single tabu attribute
may forbid entire classes of moves.

Tabu algorithms
Examples of attributes
Generic Attribute Model Selection Example
(2-change neighbourhood)
(t)
A change in the value of θi . The at- A1 : Whether the ith predictor is added
tribute may be the value from which (or deleted) from the model.
the move began, or the value to which
it arrived.
(t) (t)
A swap in the values of θi and θj A2 : Whether the absent variable is ex-
when
(t)
θi 6=
(t)
θj . changed for the included variable.
A change in the value of f resulting A3 : The reduction in AIC achieved by

from the step, f (θ (t+1) ) − f (θ (t) ). the move.
The value of some other strategically A4 : The number of predictors in the
chosen function, g, i.e., g(θ (t+1) ). new model.
A change in the value of g resulting A5 : The change in a different variable
from the step, g(θ (t+1) ) − g(θ (t) ). selection criterion such as adjusted R2 .

Tabu algorithms
Relevance at a given iteration
Recall, the presence of attributes on the tabu list is only temporary.

The recency R of an attribute Aa is the number of steps that have
passed since a move last exhibited Aa .
Let R Aa , H (t) = 0 if the move yielding θ (t) exhibits attribute a,

R Aa , H (t) = 1 if last appearance of a was at time t − 1, etc.

History matters:
Current neighbourhoods may depend on search history; denote this
N (θ (t) , H (t) ).
The choice of θ (t+1) in N (θ (t) , H (t) ) may depend on the (recent)
search history too.
This effectively leads to using an augmented target function, fH (t) .

Tabu algorithms
Workflow
Say we make a move exhibiting Aa

Then we put its complement on the tabu list (want to prevent
undoing this move),
... and leave it there for τ iterations. (This is a tuning parameter
controlling the way the algorithm works.)
The neighbourhood where we look for updates becomes
n o
(t) (t) (t)
N (θ , H ) = θ : θ ∈ N (θ ) & no attribute of θ is tabu at time t .
When R Aa , H (t) first equals τ , we remove Aa from the tabu list.

√ √
Typical tabu validity in practice: 7 ≤ τ ≤ 20, or .5 p ≤ τ ≤ 2 p. If it’s
large enough it will prevent cycles.

Tabu algorithms
Refinements: Aspiration
May want to override the tabu when it is wise.
Most common:
permit a tabu move ...
if it provides a higher value of the objective function than has been
found in any iteration so far.
Aspiration by influence:
A very recent high increase in target may imply that algorithm parses
a new (promising) region in Ω
So cancel the tabu of not reversing the last downward move: further
exploration of new region is meaningful.

Tabu algorithms
Refinements: Diversification
Bad candidates with often appearances may correspond to

entrapment.
So construct fH (t) to penalize such (bad) moves/attributes.
This requires of course keeping track of residence frequency F Aa , H (t) .

One way of imposing this:
f (θ (t+1) ) if f (θ (t+1) ) ≥ f (θ (t) )

(t+1)
fH (t) (θ )=
f (θ (t+1) ) − cF Aa , H (t) if f (θ (t+1) ) < f (θ (t) )


Tabu algorithms
Refinements: Intensification
Construct fH (t) to extend search near good solutions with high residency.
Reward retention of attributes that correlate with high-quality
high-freqency solutions;
penalize removal of such attributes.
Easier said than done.

Tabu algorithms
To sum up
Identify a list of problem-specific attributes, initialize tabu list. Then
1 Construct fH (t) (θ) based on f (θ), and perhaps
diversification/intensification penalties or incentives.
2 Identify neighbors of θ (t) , namely the members of N (θ (t) ).
3 Rank the neighbors in decreasing order of improvement, as evaluated
by fH (t) .
4 Select the highest ranking neighbor.
5 Is this neighbor currently on the tabu list? If not, go to step 8.
6 Does this neighbor pass an aspiration criterion? If so, go to step 8.
7 If all neighbors of θ (t) have been considered and none have been
adopted as θ (t+1) , then stop. Otherwise, select the next best neighbor
and go to step 5.
8 Adopt this solution as θ (t+1) and update the tabu list.
9 Has a stopping criterion been met? If so, stop. Otherwise, increment t
and go to step 1.
Tabu algorithms
Results of tabu search for model selection example
400
yaxislabel
350
300
0 20 40 60
xaxislabel
Best solution was hit twice: iterations 44 and 66.

Up next
Outline
3 Tabu algorithms
4 Up next

Up next
Coming up
The expectation-maximization algorithm

04 Combin Optimization

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

04 Combin Optimization

Загружено:

Авторское право:

Доступные форматы

Statistical Computing

Prof. Dr. Matei Demetrescu

Statistics and Econometrics (CAU Kiel) Summer 2019 1 / 41

2 Simulated annealing and genetic algorithms

Statistics and Econometrics (CAU Kiel) Summer 2019 2 / 41

2 Simulated annealing and genetic algorithms

Statistics and Econometrics (CAU Kiel) Summer 2019 3 / 41

The basic setup

We have the same goal:

Sounds simple, but N may be large.

Statistics and Econometrics (CAU Kiel) Summer 2019 4 / 41

Example: Model selection

Given a dependent variable Y and a set of predictors Xj ,

... where {i1 , . . . , is } is a subset of {1, . . . , p}; always with intercept.

Regressor selection requires ideally optimization over N = 2p candidate

Statistics and Econometrics (CAU Kiel) Summer 2019 5 / 41

Practitioners frequently have to deal with discrete optimization problems

E.g. a likelihood function may depend on

The solutions space Ω is obtained by combining p various features;

Just to get an idea

Parsing many possible solutions is not always feasible:

Time to solve problem of order. . .

Bottom line: Many combinatorial optimization problems are inherently too

Statistics and Econometrics (CAU Kiel) Summer 2019 7 / 41

So let’s avoid algorithms

Heuristic algorithms explicitly trade off global optimality for speed.

Statistics and Econometrics (CAU Kiel) Summer 2019 8 / 41

Local neighbourhood search

Let stay iterative (what else?):

Note: Since the neighbourhood changes with each move, we do search

Statistics and Econometrics (CAU Kiel) Summer 2019 9 / 41

The choice of neighbourhood influences performance:

In the model selection problem,

Simple ascent: θ (t+1) is the first better θ in N (θ (t) ).

Then, update the neighbourhood – and repeat.

Different pros and cons; again, no universally best choice...

Statistics and Econometrics (CAU Kiel) Summer 2019 11 / 41

Be smart (for the given money)

Statistics and Econometrics (CAU Kiel) Summer 2019 12 / 41

Example: Model selection, p = 27 predictors

Example: Regression with 27 predictors cont’d

Predictors selected by various searches

Statistics and Econometrics (CAU Kiel) Summer 2019 14 / 41

Local search with variable neighborhoods

For standard ascent algorithms,

Statistics and Econometrics (CAU Kiel) Summer 2019 15 / 41

2 Simulated annealing and genetic algorithms

Statistics and Econometrics (CAU Kiel) Summer 2019 16 / 41

Statistics and Econometrics (CAU Kiel) Summer 2019 17 / 41

A crystal has the lowest level of potential energy among all

We treat θ (t) as an atom that should reach the optimal θ ∗ .

Statistics and Econometrics (CAU Kiel) Summer 2019 18 / 41

To maintain the analogy, annealing problems are posed as minimizations.

The simulated annealing process

Statistics and Econometrics (CAU Kiel) Summer 2019 19 / 41

Otherwise, let θ (t+1) = θ (t) .

Let τj = α(τj−1 ) and mj = β(mj−1 ) (cooling schedule):

More practical details

Pick τ0 > 0 so that exp ((f (θ i ) − f (θ j ))/τ0 ) is close to one ∀ θ i and

Statistics and Econometrics (CAU Kiel) Summer 2019 22 / 41

Example: Regression model selection

Rather than physics we now mimic natural evolutionary processes to

The analogy is taken seriously:

Statistics and Econometrics (CAU Kiel) Summer 2019 24 / 41