AI Jacobs University Green Slides Dec 5

Articial Intelligence
Course No. 320331, Fall 2013

Dr. Kaustubh Pathak
Assistant Professor, Computer Science
k.pathak@jacobs-university.de
Jacobs University Bremen
December 5, 2013
K. Pathak (Jacobs University Bremen)
December 5, 2013
1 / 475
December 5, 2013
2 / 475
Course Introduction
Python Brief Introduction
Agents and their Task-environments
Goal-based Problem-solving Agents using Searching
Non-classical Search Algorithms
Games Agents Play
Logical Agents: Propositional Logic
Probability Calculus
Beginning to Learn using Na Bayesian Classiers
ve
Bayesian Networks
Course Introduction
Contents
Course Introduction
Course Logistics
What is Articial Intelligence (AI)?
Foundations of AI
History of AI
State of the Art
Course Introduction
December 5, 2013
3 / 475
Course Logistics
Grading
Break-down:
Easy quizzes
15% Auditors: taking 75% quizzes necessary.
Homeworks (5) 25%
Mid-term exam 30% 23rd Oct. (Wed.) after Reading days.
Final exam
30%
If you have an ocial excuse for a quizz/exam, make-up will be
provided. For home-works, make-ups will be decided on a
case-by-case basis: ocial excuse for at least three days immediately
before the deadline necessary.
Home-works: Python or C++.
Teaching Assistant: Vahid Azizi v.azizi@jacobs-university.de
December 5, 2013
4 / 475
Course Introduction
Course Logistics
Homework Submission via Grader
Check after a week:

https://cantaloupe.eecs.jacobs-university.de/login.php
Course Introduction
December 5, 2013
5 / 475
Course Logistics
Teaching Philosophy
No question will be ridiculed.
Some questions would be taken oine or might be postponed.
Homeworks are where you really learn!
Not all material will be in the slides. Some material will be derived on
the board - you should take lecture-notes yourselves.
Material done on the board is especially likely to appear in
quizzes/exams.
December 5, 2013
6 / 475
Course Introduction
Course Logistics
Expert of the Day

At the beginning of each lecture, a student will summarize the last
lecture in 5 minutes (more than 7 will be penalized).
She/He can also highlight things which need more clarication.
A student will volunteer at the end of each lecture for being the
expert in the next lecture.
Your participation counts as 1 quiz. Everyone should do it at least
once.
Course Introduction
December 5, 2013
7 / 475
December 5, 2013
8 / 475
Course Logistics
Coming Up...
Our Next Expert Is?
Course Introduction
Course Logistics
Textbooks
Main textbook:
Stuart Russell and Peter Norvig, Articial Intelligence: A Modern
Approach, 3rd Edition, 2010, Pearson International Edition.
Other references:
Uwe Schning, Logic for Computer Scientists, English 2001,
o
German 2005, Birkhuser.
a
Daphne Koller and Nir Friedman, Probabilistic Graphical Models:
Principles and Techniques, 2009, MIT Press.
Course Introduction
December 5, 2013
9 / 475
Course Logistics
Syllabus
Introduction to AI; Intelligent agents:
Chapters 1,2.
A Brief Introduction to Python (skipped this year)
Solving problems by Searching

BF, DF, A search: Proofs
Chapter 3.
Sampling Discrete Probability Distributions, Simulated Annealing,
Genetic Algorithms: Real-world example
Chapter 4.
Adversarial search (Games): Minimax, pruning
Chapter 5.
Logical Agents
Also Schnings Book

o
Propositional Logic: Inference with Resolution
Uncertain Knowledge & Reasoning

Introduction to Probabilistic Reasoning:
Bayesian Networks: Various Inference Approaches
Chapter 7.
Also Kollers Book

Chapter 13.
Chapter 14.
Introduction to Machine-Learning
Supervised Learning: Information Entropy, Decision Trees, ANNs:
Chapter 18.
Model Estimation: Priors, Maximum Likelihood, Kalman Filter, EKF,
RANSAC.
Learning Probabilistic Models:
Chapter 20.
Unsupervised Learning: Clustering (K-Means, Mean-Shift Algorithm).
December 5, 2013
10 / 475
Course Introduction
Dening AI
Human-centered vs. Rationalist Approaches
Thinking Humanly
[The
automation of] activities that
we associate with human thinking, activities such as decisionmaking, problem-solving, learning... (Bellman, 1978)
Acting Humanly The art
of creating machines that perform functions that require intelligence when performed by people.
(Kurzweil, 1990)
Thinking Rationally
The
study of computations that make
it possible to perceive, reason, and
act. (Winston, 1992)
Acting Rationally Computational Intelligence is the study of

the design of intelligent agents.
(Poole et al., 1998)
Course Introduction
December 5, 2013
11 / 475
Acting Humanly
The Turing Test (1950)

The test is passed if a human interrogator,
after posing some written questions,
cannot determine whether the responses
come from a human or from a computer.
Total Turing Test

There is a video signal for the interrogator
to test the subjects perceptual abilities, as
well as a hatch to pass physical objects
through.
Figure 1: Alan Turing

(1912-1954)
December 5, 2013
12 / 475
Course Introduction
Reverse Turing Test: CAPTCHA

Completely Automated Public Turing test to tell Computers and Humans Apart
Figure 2: Source: http://www.captcha.net/
Course Introduction
December 5, 2013
13 / 475
Capabilities required for passing the Turing test

The 6 main disciplines composing AI.
The Turing Test

Natural language processing
Knowledge representation
Automated reasoning
Machine learning
The Total Turing Test

Computer vision
Robotics
December 5, 2013
14 / 475
Course Introduction
Thinking Humanly
Trying to discover how human minds
work. Three ways:
Introspection
Psychological experiments on
humans
Brain imaging: Functional
Magnetic Resonance Imaging
(fMRI), Positron Emission
Tomography (PET), EEG, etc.
Cognitive Science constructs testable
theories of mind:
Figure 3: fMRI image (source:

http://www.umsl.edu/~tsytsarev)
Computer models from AI
Youtube video (1:00-4:20)
Experimental techniques from

psychology
Reading mind by fMRI
Course Introduction
December 5, 2013
15 / 475
Thinking Rationally
Aristotles Syllogisms (384-322 B.C.): right thinking, deductive

logic.
The logicist tradition in AI. Good old AI. Logical programming.
Problems:
Cannot handle uncertainty
Does not scale-up due to high computational requirements.
December 5, 2013
16 / 475
Course Introduction
Acting Rationally
Denition 1.1 (Agent)

An agent is something that acts, i.e,
perceives the environment,
acts autonomously,
persist over a prolonged time-period,
adapts to change,
creates and pursues goals (by planning), etc.
Course Introduction
December 5, 2013
17 / 475
Acting Rationally
The Rational Agent Approach
Denition 1.2 (Rational Agent)

A rational agent is one that acts so as to achieve the best outcome, or
when there is uncertainty, the best expected outcome. This approach is
more general, because:
Rationality is more general than logical inference, e.g. reex actions.
Rationality is more amenable to scientic development than the ones
based on human behavior or thought.
Rationality is well dened mathematically in a way, it is just
optimization under constraints. When, due to computational
demands in a complicated environment, the agent cannot maintain
perfect rationality, it resorts to limited rationality.
December 5, 2013
18 / 475
Course Introduction
Acting Rationally
The Rational Agent Approach
This course therefore concentrates on general principles of rational

agents and on components for constructing them.
Course Introduction
December 5, 2013
19 / 475
Foundations of AI
Disciplines Contributing to AI. I

Philosophy
Rationalism: Using power of reasoning to understand the world.
How does the mind arise from the physical brain?
Dualism: Part of mind is separate from matter/nature. Proponent
Ren Descartes, among others.
e
Materialism: Brains operation constitutes the mind. Claims that free
will is just the way perception of available choices appears to the
choosing entity.
Mathematics
Logic, computational tractability, probability theory.
Economics
Utility theory, decision theory (probability theory + utility theory), game
theory.
December 5, 2013
20 / 475
Course Introduction
Foundations of AI
Neuroscience
The exact way the brain enables thought is still a scientic mystery.
However, the mapping between areas of the brain and parts of body they
control or receive sensory input from can be found, though it can change
over a course of a few weeks.
Motor Cortex
Parietal Lobe
Frontal Lobe
Dorsal Stream
Occipital Lobe
Visual Cortex
Ventral Stream
Cerebellum
Temporal Lobe
Spinal Cord
Figure 4: The human cortex with the various lobes shown in dierent colors. The
information from the visual cortex gets channeled into the dorsal (where/how)
and the ventral (what) streams.
Course Introduction
December 5, 2013
21 / 475
Foundations of AI
The Human Brain

The human brain has 1011 neurons, with 1014 synapses, cycle time of
103 , and 1014 memory updates/sec. Refer to Fig. 1.3 in the book.
Figure 5: TED Video: Dr. Jill Bolte Taylor
December 5, 2013
22 / 475
Course Introduction
Foundations of AI
Psychology
Behaviorism (stimulus/response), Cognitive psychology.
Computer Engineering
Hardware and Software. Computer vision.
Linguistics
Natural language processing.
Course Introduction
December 5, 2013
23 / 475
Foundations of AI
Control Theory and Cybernetics
Figure 6: A typical control system with feedback. Source:

https://www.ece.cmu.edu/~koopman/des_s99/control_theory/
The basic idea of control theory is to use sensory feedback to alter system
inputs so as to minimize the error between desired and observed output.
Basic example: controlling the movement of an industrial robotic arm to a
desired orientation.
December 5, 2013
24 / 475
Course Introduction
Foundations of AI
Control Theory and Cybernetics
Norbert Wiener (18941964): book Cybernetics (1948).

Modern control theory and AI have a considerable overlap: both have
the goal of designing systems which maximize an objective function
over time.
Dierence is in: 1) the mathematical techniques used; 2) the
application areas.
Control theory focuses more on calculus of continuous variables,
Matrix algebra, whereas AI also uses tools of logical inference and
planning.
Course Introduction
December 5, 2013
25 / 475
History of AI
History of AI I
Gestation Period (1943-1955)
McCulloch and Pitts (1943) proposed a model for the neuron. Hebbian
learning (1949) for updating inter-neuron connection strengths developed.
Alan Turing published Computing Machinery and Intelligence (1950),
proposing the Turing test, machine learning, genetic algorithms, and
reinforcement learning.
Birth of AI (1956)
The Dartmouth workshop organized by John McCarthy of Stanford.
Early Enthusiasm (1952-1969)

LISP developed. Several small successes including theorem proving etc.
Perceptrons (Rosenblatt, 1962) developed.
Reality hits (1966-1973)

December 5, 2013
26 / 475
Course Introduction
History of AI
History of AI II
After the Sputnik launch (1957), automatic Russian to English translation
attempted. Failed miserably.
1. The spirit is willing, but the esh is weak. Translated to:
2. The wodka is good but the meat is rotten.
Computational complexity scaling-up could not be handled. Single layer
perceptrons were found to have very limited representational power. Most
of government funding stopped.
Knowledge-based Systems (1969-1979)

Use of expert domain specic knowledge and cook-book rules collected
from experts for inference.
Examples: DENDRAL (1969) for inferring molecular structure from mass
spectrometer results; MYCIN (Blood infection diagnosis) with 450 rules.
AI in Industry (1980-present)
Course Introduction
December 5, 2013
27 / 475
History of AI
History of AI III
Companies like DEC, DuPont etc. developed expert systems. Industry
boomed but all extravagant promises not fullled leading to AI winter.
Return of Neural Networks (1986-present)

Back-propagation learning algorithm developed. The connectionist
approach competes with logicist and symbolic approaches. NN research
bifurcates.
AI embraces Control Theory and Statistics (1987-present)

Rigorous mathematical methods began to be reused instead of ad hoc
methods. Example: Hidden Markov Models (HMM), Bayesian Networks,
etc. Real-life data-sets sharing started.
Intelligent agents (1995-present)

Growth of the Internet. AI in web-based applications (-bots).
December 5, 2013
28 / 475
Course Introduction
History of AI
History of AI IV
Huge data-sets (2001-present)
Learning based on very large data-sets. Example: Filling in holes in a
photograph; Hayes and Efros (2007). Performance went from poor for
10,000 samples to excellent for 2 million samples.
Figure 7: Source: Hayes and Efros (SIGGRAPH 2007).
Course Introduction
December 5, 2013
29 / 475
December 5, 2013
30 / 475
History of AI
Reading Assignment (not graded)

Read Sec. 1.3 of the textbook.
Course Introduction
State of the Art
Successful Applications
Intelligent Software Wizards and Assistants
(a) Microsoft Oce Assistant Clippit
(b) Siri
Figure 8: Wizards and Assistants.
Course Introduction
December 5, 2013
31 / 475
State of the Art
Logistics Planning
Dynamic Analysis and Replanning Tool (DART). Used during Gulf war
(1990s) for scheduling of transportation. DARPA stated that this single
application paid back DARPAs 30 years investment in AI.
DART won DARPAs outstanding Performance by a Contractor award, for
modication and transportation feasibility analysis for Time-Phased Force
and Deployment Data that was used during Desert Storm.
http://www.bbn.com
December 5, 2013
32 / 475
Course Introduction
State of the Art
Flow Machines
2013 Best AI Video Award: http://www.aaaivideos.org
Figure 9: Video (4:53)
Course Introduction
December 5, 2013
33 / 475
December 5, 2013
34 / 475
State of the Art
Intelligent Textbook
2012 Best AI Video Award: http://www.aaaivideos.org
Course Introduction
State of the Art
DARPA Urban Challenge 2007
Course Introduction
December 5, 2013
35 / 475
State of the Art
3D Planar-Patches based Simultaneous Localization and

Mapping (SLAM): Scene Registration
Figure 12: Collecting Data

Registered Planar-Patches
Registered Point-Clouds
The Minimally Uncertain Maximum Consensus (MUMC) Algorithm

Related to the RANSAC (Random Consensus) Algorithm that we will
study.
K. Pathak, A. Birk, N. Vaskevicius, and J. Poppinga, Fast registration based on noisy planes
with unknown correspondences for 3D mapping, IEEE Transactions on Robotics, vol. 26, no.
3, pp. 424-441, 2010.
December 5, 2013
36 / 475
Course Introduction
State of the Art
New Sensing Technologies: Example Kinect
(a) The Microsoft Kinect 3D camera (b) A point-cloud obtained from it (from
(from Wikipedia)
Willow Garage).
Course Introduction
December 5, 2013
37 / 475
December 5, 2013
38 / 475
State of the Art
RGBD Segmentation
Unsupervised Clustering By Mean-Shift Algorithm
Course Introduction
State of the Art
Object Recognition & Pose Estimation
Figure 13: IEEE Int. Conf. on Robotics & Automation (ICRA) 2011: Perception
Challenge. Our group won II place between Berkeley (I) and Stanford (III).
Video (2:39)
December 5, 2013
39 / 475
December 5, 2013
40 / 475
Contents

Data-types
Control Statements
Functions
Packages, Modules, Classes
Data-types
Built-in Data-types
Type
Numbers
Strings
Boolean
Lists
Dictionaries
Tuples
Sets/FrozenSets
Files
Single Instances
...
Example
12, 3.4, 7788990L, 6.1+4j, Decimal
"abcd", abc, "abcs"
True, False
[True, 1.2, "vcf"]
{"A" : 25, "V" : 70}
("ABC", 1, Z)
{90,a}, frozenset({a, 2})
f= open(spam.txt, r)}
None, NotImplemented
Immutable
X/
December 5, 2013
41 / 475
December 5, 2013
42 / 475
Data-types
Sequences I
str, list, tuple
Creation and Indexing

a= "1234567"
a[0]
b= [z, x, a, k]
b[-1] == b[len(b)-1], b[-1] is b[len(b)-1]
x= """This is a
multiline string"""
print x
y= me
too
print y
len(y)
Data-types
Sequences II
str, list, tuple
Immutability
a[1]= q # Fails
b[1]= s
c= a;
c is a, c==a
a= "xyz"; c is a
Help
dir(b)
help(b.sort)
b.sort()
b # In-place sorting
December 5, 2013
43 / 475
December 5, 2013
44 / 475
Data-types
Sequences III
str, list, tuple
Slicing
a[1:2]
a[0:-1]
a[:-1], a[3:]
a[:]
a[0:len(a):2]
a[-1::-1]
Repetition & Concatenation

c=a*2
b*3
a= a + 5mn
d= b + [abs, 1, False]
Data-types
Sequences IV
str, list, tuple
Nesting
A=[[1,2,3],[4,5,6],[7,8,9]]
A[0]
A[0][2]
A[0:-1][-1]
A[3] # Error
List Comprehension
q= [x.isdigit() for x in a]
print q
p=[(r[1]**2) for r in A if r[1]< 8] # Power
print p
December 5, 2013
45 / 475
December 5, 2013
46 / 475
Data-types
Sequences V
str, list, tuple
Dictionaries
D= {0:Rhine, 1:"Indus", 3:"Hudson"}
D[0]
D[6] # Error
D[6]="Volga"
dir(D)
Data-types
Numbers I
Math Operations
1
2
3
4
5
6
7
8
9
10
11
12
13
a= 10; b= 3; c= 10.5; d=1.2345

a/b
a//b, c//b # Floor division: b*(a//b) + (a%b) == a
d**c # Power
type(10**40) # Unlimited integers
import math
import random
dir(math)
math.pi # repr(x)
print math.pi # str(x)
s= "e is %08.3f and list is %s" % (math.e, [a,1,1.5])
random.random() # [0,1)
random.choice([apple,orange,banana,kiwi])
December 5, 2013
47 / 475
December 5, 2013
48 / 475
Data-types
Numbers II
Booleans
1
2
3
4
5
6
7
8
9
10
11
12
13
s1= True
s2= 3 < 5 <=100
not(s1 and s2) or (not s2)
x= 2 if s2 else 3
b1= 0xE210 # Hex
print b1
b2= 023 # Oct
print b2
b1 & b2 # Bitwise and
b1 | b2 # Bitwise or
b1 ^ b2 # Bitwise xor
b1 << 3 # Shift b1 left by 3 bits
~b1 # Bitwise complement
Data-types
Dynamic Typing I
Variables are names and have no types. They can refer to objects of
any type. Type is associated with objects.
1
2
3
4
a= "abcf"
b= "abcf"
a==b, a is b
a= 2.5
Objects are garbage-collected automatically.
Shared references
December 5, 2013
49 / 475
December 5, 2013
50 / 475
Data-types
Dynamic Typing II
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
a= [4,1,5,10]
b=a
b is a
a.sort()
b is a
a.append(w)
a
b is a
a= a + [w]
a
b is a
b
x= 42
y= 42
x is y, x==y
x= [1,2,3]; y=[1,2,3]
Data-types
Dynamic Typing III

17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
x is y, x==y
x=123; y= 123
x is y, x==y # Wassup?
# Assignments create references
L= [1,2,3]
M= [x, L, c]
M
L[1]= 0
M
# To copy
L= [1,2,3]
M= [x, L[:], c]
M
L[1]= 0
M
December 5, 2013
51 / 475
Control Statements
Control Statements I
Mind the indentation! One extra carriage return to nish in
interactive mode.
1
2
3
4
5
6
7
8
9
10
11
import sys
tmp= sys.stdout
sys.stdout = open(log.txt, a)
x= random.random();
if x < 0.25:
[y, z]= [-1, 4]
elif 0.25 <= x < 0.75:
print case 2
y= z= 0
else:
z, y= 0, 2; y += 3
12
13
sys.stdout = tmp
December 5, 2013
52 / 475
Control Statements
Loops I
While
1
2
3
4
5
6
7
8
9
10
11
12
13
i= 0;
while i< 5:
s= raw_input("Enter an int: ")
try:
j= int(s)
except:
print invalid input
break;
else:
print "Its square is %d" % j**2
i += 1
else:
print "exited normally without break"
December 5, 2013
53 / 475
December 5, 2013
54 / 475
Control Statements
Loops II
For
1
2
3
4
5
6
7
8
X= range(2,10,2) # [2, 4, 6, 8]
N= 7
for x in X:
if x> N:
print x, "is >", N
break;
else:
print no number > , N, found
9
10
11
for line in open(test.txt, r):

print line.upper()
Functions
Functions I
Arguments are passed by assignment
1
2
3
4
def change_q(p, q):

for i in p:
if i not in q: q.append(i)
p= abc
5
6
7
8
9
10
x= [a,b,c]; # Mutable
y= bdg # Immutable
print x, y
change_q(q=x,p=y)
print x, y
Output
[a, b, c] bdg
[a, b, c, d, g] bdg
December 5, 2013
55 / 475
Functions
Functions II
Scoping rule: LEGB= Local-function, Enclosing-function(s), Global
(module), Built-ins.
1
2
3
4
5
6
7
8
v= 99
def local():
def locallocal():
v= u
print "inside locallocal ", v
u= 7; v= 2
locallocal()
print "outside locallocal ", v
9
10
11
12
13
def glob1():
global v
v += 1
December 5, 2013
56 / 475
Functions
Functions III
14
15
16
17
18
local()
print v
glob1()
print v
Output
inside locallocal 7
outside locallocal 2
99
100
December 5, 2013
57 / 475
Packages, Modules I
Python Standard Library http://docs.python.org/library/
Folder structure
root/
pack1/
__init__.py
mod1.py
pack2/
__init__.py
mod2.py
root should be in one of the following: 1) program home folder, 2)
PYTHONPATH 3) standard lib folder, or, 4) in a .pth le on path. The
full search-path is in sys.path.
Importing
December 5, 2013
58 / 475
Packages, Modules II
Python Standard Library http://docs.python.org/library/
import pack1.mod1
import pack1.mod3 as m3
from pack1.pack2.mod2 import A,B,C
December 5, 2013
59 / 475
December 5, 2013
60 / 475
Classes I
Example
class Animal(object): # new style classes
count= 0
def __init__(self, _name):
Animal.count += 1
self.name= _name
def __str__(self):
return I am + self.name
def make_noise(self):
print (self.speak()+" ")*3
class Dog(Animal):
def __init__(self, _name):
Animal.__init__(self, _name)
self.count= 1
Classes II
def speak(self):
return "woof"
Full examples in python examples.tgz
December 5, 2013
61 / 475
Useful External Libraries
The SciPy library is a vast Python library for scientic computations.

http://www.scipy.org/
In Ubuntu install python-scitools in the package-manager.
Library for doing linear-algebra, statistics, FFT, integration,
optimization, plotting, etc.
Boost is a very mature and professional C++ library. It has Python

bindings. Refer to:
http://www.boost.org/doc/libs/1_47_0/libs/python/doc/
For creation of Python graph data-structures (leveraging boost) look
at: http://projects.skewed.de/graph-tool/
December 5, 2013
62 / 475
Contents

Agent Types
December 5, 2013
63 / 475
A general agent
Agent
Sensors
Percepts
Environment
?
Actuators
Actions
Denition 3.1 (A Rational Agent)

For each possible percept sequence, a rational agent should select an
action that is expected to maximize its performance measure, given the
evidence provided by the percept sequence, and whatever built-in (prior)
knowledge the agent has.
December 5, 2013
64 / 475
Properties of the Task Environment

Fully observable: relevant environment
state fully exposed by the sensors.
Single Agent
Deterministic: If the next state is completely determined by the current state
and the action of the agent
Episodic: Agents experience is divided
into atomic episodes, each independent
of the last, e.g. assembly-line robot.
Static:
Environment unchanging.
Semi-dynamic: agents performance
measure changes with time, env. static.
Discrete: state of the environment is
discrete, e.g. chess-playing, trac control.
Known: rules of the game/laws of
physics of the env. are known to the
agent.
Partially observable: e.g. a limited

eld-of-view sensor. Unobservable
Multi-agent
Stochastic: Uncertainties quantied by
probabilities.
Sequential: current action aects future
actions, e.g. chess-playing agent.
Dynamic: The environment changes
while the agent is deliberating.
Continuous:
The state smoothly
changes in time, e.g. a mobile robot.
Unknown: The agent must learn the
rules of the game.
Hardest case: Partially observable, multiagent, stochastic, sequential,

dynamic, continuous, and unknown.
December 5, 2013
65 / 475
Example of a Partially Observable Environment
Figure 14: A mobile robot operating GUI.
December 5, 2013
66 / 475
Agent Types
Agent Types
Four basic types in order of increasing generality:

Simple reex agents
Reex agents with state
Goal-based agents
Utility-based agents
All these can be turned into learning agents
December 5, 2013
67 / 475
Agent Types
Simple reex agents
Algorithm 1: Simple-Reflex-Agent
Agent
Sensors
Conditionaction rules
What action I
should do now
Actuators
Environment
What the world

is like now
input
: percept
output : action
persistent: rules, a set of
condition-action rules
state Interpret-Input(percept) ;
rule Rule-Match(state, rules) ;
action rule.action;
return action
December 5, 2013
68 / 475
Agent Types
Model-based reex agents I
Sensors
State
How the world evolves
What my actions do
What action I
should do now
Conditionaction rules
Agent
Environment
What the world

is like now
Actuators
December 5, 2013
69 / 475
Agent Types
Model-based reex agents II

Algorithm 2: Model-Based-Reflex-Agent
input
: percept
output : action
persistent: state, agents current conception of worlds state
model, how next state depends on the current state and action
rules, a set of condition-action rules
action, the most recent action, initially none
state Update-State(state, action, percept, model) ;
rule Rule-Match(state, rules) ;
action rule.action;
return action
December 5, 2013
70 / 475
Agent Types
Goal-based agents
Sensors
State
What the world
is like now
What my actions do
What it will be like

if I do action A
Goals
What action I
should do now
Agent
Environment
Actuators
Figure 15: Includes search and planning.
December 5, 2013
71 / 475
Agent Types
Utility-based agents
Sensors
State
What the world
is like now
What my actions do
What it will be like

if I do action A
Utility
How happy I will be

in such a state
What action I
should do now
Agent
Environment
Actuators
Figure 16: An agents utility function is its internalization of the performance

measure.
December 5, 2013
72 / 475
Contents

The Graph-Search Algorithm
Uninformed (Blind) Search
Informed (Heuristic) Search
December 5, 2013
73 / 475
December 5, 2013
74 / 475
Problem Solving Agents

Algorithm 3: Simple-Problem-Solving-Agent
input
: percept
output : action
persistent: seq, an action sequence, initially empty
state, agents current conception of worlds state
goal, a goal, initially null
problem, a problem formulation
state Update-State(state, percept) ;
if seq is empty then
goal Formulate-Goal(state) ;
problem Formulate-Problem(state, goal) ;
seq Search(problem) ;
if seq = failure then return a null action
action First(seq) ;
seq Rest(seq) ;
return action
Searching for Solutions

State: The system-state of x X
parameterizes all properties of interest. The
initial-state of the state is x0 and the set of
goal-states is Xg . The set X of valid states
is called the state-space.
Initial State x0
Valid Actions
u1
u2
Actions or Inputs: At each state x, there

are a set of valid actions u(x) U(x) that
can be taken by the search agent to alter the
state.
State Transition Function: How a new
state x is created by applying an action u to
the current state x.
x = f(x, u)
(4.1)
Goal State xg
Figure 17: Nodes are

states, and edges are
state-transitions caused
by actions.
The transition may have a cost k(x, u) > 0.

December 5, 2013
75 / 475
Examples I
7
5
5
1
parent, action
State
Node
8
8
2
2
depth = 6
g=6
state
Goal State
Start State
(a) An instance
(b) A node of the search-graph.

point to parent-nodes.
Arrows
Figure 18: The 8-puzzle problem
December 5, 2013
76 / 475
Examples II
Figure 19: An instance of the 8-Queens problem
December 5, 2013
77 / 475
Examples III
Oradea
71
75
Neamt
Zerind
87
151
Iasi
Arad
140
92
Sibiu
99
Fagaras
118
Timisoara
111
Vaslui
80
Rimnicu Vilcea
Lugoj
Pitesti
97
142
211
70
98
Mehadia
75
Dobreta
146
85
101
Hirsova
Urziceni
86
138
Bucharest
120
90
Craiova
Giurgiu
Eforie
Figure 20: The map of Romania. An instance of the route planning problem
given a map.
December 5, 2013
78 / 475
Examples IV
Figure 21: A 2D occupancy grid map created using Laser-Range-Finder (LRF).
December 5, 2013
79 / 475
Examples V
Figure 22: Result of A path-planning algorithm on a multi-resolution quad-tree.
December 5, 2013
80 / 475
Graph-Search
Compare with Textbook Fig. 3.7
Algorithm 4: Graph-Search
input : x0 , Xg
D = , The explored-set/dead-set/passive-set;
F.Insert( x0 , g (x0 ) = 0, (x0 ) = h(x0 ) ) Frontier/active-set;
while F not empty do
x, g (x), (x) F.Choose() Remove best x from F;
if x Xg then return SUCCESS;
D D {x};
for u U(x) do
x f(x, u),
g (x ) g (x) + k(x, u) ;
if (x D) and (x F) then
/
/
F.Insert( x , g (x ), (x ) = g (x ) + h(x , Xg ) );
else if (x F) then
F.Resolve-Duplicate(x , g (x ), (x ));
December 5, 2013
81 / 475
Measuring Problem-Solving Performance
Completeness: Is the algorithm guaranteed to nd a solution if there

is one?
Optimality: Does the strategy nd optimal solutions?
Time & Space Complexity: How long does the algorithm take and
how much memory is needed?
Branching factor b: The maximum number of successors (children) of
any node.
Depth d: The shallowest goal-node level.
Max-length m: Maximum length of any path in state-space.
December 5, 2013
82 / 475
Breadth-First Search (BFS)

The frontier F is implemented as a FIFO queue. The oldest element
is chosen by Choose().
For a nite graph, it is complete, and optimum if all edges have same
cost. It nds the shallowest goal node.
The Graph-Search can return as soon as a goal-state is generated
in line 1.
Number of nodes generated b + b 2 + . . . + b d = O(b d ). This is the
space and time complexity.
The explored set will have O(b d1 ) nodes and the frontier will have
O(b d ) nodes.
Mememory becomes more critical than computation time, e.g. for
b = 10, d = 12, 1 KB/node, search-time is 13 days, and
memory-requirements 1 petabyte (= 1015 Bytes).
December 5, 2013
83 / 475
BFS Example
December 5, 2013
84 / 475
Dijkstra Algorithm or Uniform-Cost Search

The frontier F is implemented as a priority-queue. Choose() selects
the element with the highest priority, i.e. the minimum path-length
g (x).
F.Resolve-Duplicate(x ) function on line 2 updates the path-cost
g (x ) of x in the frontier F, if the new value is lower than the stored
value. If the cost is decreased, the old parent is replaced by the new
one. The priority queue is reordered to reect the change.
It is complete and optimum.
When a node x is chosen from the priority-queue, the minimum length
path from x0 to it has been found. Its length is denoted as g (x).
In other words, the optimum path-lengths of all the explored nodes in
the set D have already been found.
December 5, 2013
85 / 475
Correctness of Dijkstra Algorithm

x
x0
D
xg
F
Figure 23: The graph separation by the frontier. The node x in the frontier is
chosen for further expansion. The set D is a tree.
December 5, 2013
86 / 475
Correctness of Dijkstra Algorithm: Observations

Unexplored nodes can only be reached through the frontier nodes.
An existing frontier node xf s cost can only be reduced thorough a
node xc which currently has been chosen from the priority-queue as it
has the smallest cost in the frontier: This will be done by the
RESOLVE-DUPLICATE function. Afterwards, parent(xf )= xc .
Note that D remains a tree.
The frontier expands only through the unexplored children of the
chosen frontier node, all of the children will have costs worse than
their parent.
The costs of the successively chosen frontier nodes are
non-decreasing.
December 5, 2013
87 / 475
Correctness of Dijkstra Algorithm: Proof by Induction
Theorem 4.1 (When a frontier node xc is chosen for exansion, its

optimum path has been found)
Proof.
The proof will be done as part of proof of optimality of the A algorithm,
as the Dijkstra Algorithm is a special case of the A algorithm. Refer to
Lemma 4.6.
December 5, 2013
88 / 475
Depth-rst Search (DFS)

The frontier F is implemented as a LIFO stack. The newest element
is chosen by Choose().
For a nite graph, it is complete, and but not optimum.
Explored nodes with no descendants in the frontier can be removed
from the memory! This gives a space-complexity adavantage: O(bm)
nodes. This happens automatically if the algorithm is written
recursively.
Assuming that nodes at the same depth as the goal-node have no
successors, b = 10, d = 16, 1 KB/node, DFS will require 7 trillion
(1012 ) times less space than BFS! That is why, it is popular in AI
research.
December 5, 2013
89 / 475
Algorithm 5: Depth-limited-Search
input : current-state x, depth d
if x Xg then
return SUCCESS ;
else if d = 0 then
return CUTOFF
else
for u U(x) do
x f(x, u);
result Depth-limited-Search(x , d 1);
if result =SUCCESS then
return SUCCESS
else if result =CUTOFF then
cuto-occurred true
if cuto-occurred then return CUTOFF ;
else return NOT-FOUND ;
December 5, 2013
90 / 475
DFS Example
Goal node M
D
H
E
I
F
K
G
M
E
I
F
K
G
M
December 5, 2013
D
O
91 / 475
Iterative Deepening Search (IDS)
As O(b d )
and DFS.
O(b d1 ), one can combine the benets of BFS
All the work from previous iteration is redone, but this is

acceptable, as the frontier-size is dominant.
Algorithm 6: Iterative-Deepening-Search
for d= 0 to do
result Depth-limited-Search(x0 , d);
if result = CUTOFF then
return result
December 5, 2013
92 / 475

A
Limit = 0
Limit = 1
C
E
G
M
F
K
F
K
G
M
E
I
F
K
E
I
F
K
D
O
E
I
G
M
A
C
A
B
G
G
M
E
I
C
E
D
O
F
K
B
G
E
I
F
K
F
K
A
C
E
I
E
I
B
D
F
K
C
E
A
B
D
C
E
Limit = 3
B
D
B
D
Limit = 2
F
K
G
M
December 5, 2013
93 / 475
December 5, 2013
94 / 475
Bellmans Principle of Optimality

Theorem 4.2
All subpaths of an optimal path are also optimal.
A Search
Est. path-cost from x0 to xg = g (x) + est. path-cost from x to xg (4.2)
(x)
h(x)
(x)
g (x) + h(x)
(4.3)
(xg ) g (xg ), as h(xg ) = 0.
(4.4)
h(x) is a heuristically estimated cost, e.g., for the map route-nding

problem, h(x) = xg x .
(x) is the estimated cost of the cheapest solution through node x.
A is a Graph-Search where, the frontier F is a priority-queue with

higher priority given to lower values of the evaluation function (x).
If no heuristics are taken, i.e. h(x) 0, A reduces to Dijkstras
algorithm.
December 5, 2013
95 / 475
A Search
Resolve-Duplicate
Similar to Dijkstras Algorithm, F.Resolve-Duplicate(x ) on line 2 of

Algo. 4 updates the cost (x ) in the frontier F, if the new value is lower
than the stored value. If this occurs, the old parent of x is replaced by the
new one. The priority queue is reordered to reect the change.
December 5, 2013
96 / 475
Example heuristic function
Arad
Bucharest
Craiova
Drobeta
Eforie
Fagaras
Giurgiu
Hirsova
Iasi
Lugoj
Mehadia
Neamt
Oradea
Pitesti
Rimnicu Vilcea
Sibiu
Timisoara
Urziceni
Vaslui
Zerind
366
0
160
242
161
176
77
151
226
244
241
234
380
100
193
253
329
80
199
374
Figure 24: Values of hSLD straight-line distances to Bucharest.
December 5, 2013
97 / 475
Conditions for Optimality of A I

A will nd the optimal path, if the heuristic cost h(x) is admissible and
consistent.
Denition 4.3 (Admissibility)

h(x) is admissible, if it never over-estimates the cost to reach the goal
h(x) is always optimistic.
Denition 4.4 (Consistency)

h(x) is consistent, if for every child (generated by action ui ) xi of a node x
the triangle-inequality holds:
h(xg ) = 0,
(4.5)
h(x) k(x, ui , xi ) + h(xi ),
(4.6)
This is a stronger condition than admissibility, i.e.

consistency admissibility.
December 5, 2013
98 / 475
Proof: consistency admissibility I

We show the result by induction on n( [x : xg ]): the number of edges in
the optimal-path (with the least sum of edge-costs) from a node x to the
goal xg . We assume that we have a consistent heuristic h(x). So, our
induction hypothesis is that the consistent heuristic h(x) is also admissible.
Case n( [x : xg ]) = 1: We use the property of consistency that
h(xg ) = 0. Nodes x with n( [x : xg ]) = 1 are such that their
optimum path is just one edge long, this edge being the one which
connects them to the goal: thus, h (x) = k(x, xg ). Now, since
consistency is assumed to hold,
h(x) k(x, xg ) + 0 = h (x)
(4.7)
This proves the admissibility. Thus our hypothesis holds for n = 1.
December 5, 2013
99 / 475
Proof: consistency admissibility II
Case n( [x : xg ]) = m: We assume now that our hypothesis holds

for all nodes with optimal-paths which have at most m 1 edges, i.e
for nodes Sm1 {y | n( [y : xg ]) < m}. Let x be a node with
n( [x : xg ]) = m, i.e. its optimal path to goal is m edges long. Let
the successor of x on this optimal path be x . Since the sub-paths of
an optimal path are optimal also, x Sm1 , hence, the hypothesis
holds for it, i.e. h(x ) h (x ). Now, since consistency holds for x,
h(x) h(x ) + k(x, x ) h (x ) + k(x, x ) = h (x)
(4.8)
The last step holds because x is the successor of x on the latters

optimal path to goal. Hence admissibilty has been demonstrated for a
general node x with n( [x, xg ]) = m.
By induction, the result holds for nodes with all values of n( [x : xg ]), i.e.
the entire graph.
December 5, 2013
100 / 475
A proof I
Lemma 4.5 ( (x) is non-decreasing along any optimal path, if h(x) is
consistent)
x0
xm
xn
xp
Figure 25: A dashed line between two nodes denotes that the nodes are
connected by a path, but are not necessarily directly connected.
To prove: Let xp be a node for which the optimum-path with cost g (xp )
has been found (see gure): Nodes xm and xn lie on this optimum path
such that xm precedes xn , then
(xm )
(xn )
(4.9)
December 5, 2013
101 / 475
A proof II
Proof.
First note that since xm and xn lie on the optimum path to xp , their
paths are also optimum and have lengths g (xn ) and g (xm )
respectively.
Let us rst assume that xm is the parent of xn , then
(xn ) = h(xn ) + g (xn )
(4.10)
= h(xn ) + g (xm ) + k(xm , xn )
(4.11)
(4.6)
h(xm ) + g (xm )
(xm )
(4.12)
Now if xm is not the parent of xn but a predecessor, the inequality

can be chained for every child-parent node on the path between them,
and we reach the same conclusion.
December 5, 2013
102 / 475
Recall Graph-Search
Algorithm 7: Graph-Search
input : x0 , Xg
D = , The explored-set/dead-set/passive-set;
F.Insert( x0 , g (x0 ) = 0 ) The frontier/active-set;
while F not empty do
x, g (x) F.Choose() Remove x from the frontier;
if x Xg then return SUCCESS;
D D {x};
for u U(x) do
x f(x, u),
g (x ) g (x) + k(x, u) ;
if (x D) and (x F) then
/
/
F.Insert( x , g (x ) + h(x , Xg ) );
else if (x F) then
F.Resolve-Duplicate(x );
December 5, 2013
103 / 475
A proof I
Lemma 4.6 (At selection for expansion, a nodes optimum path has
been found)
To prove: In every iteration k of A , the node x selected for expansion by
the frontier Fk (x has the minimum value of (x) in Fk ) is such that, at
selection:
g (x) = g (x).
(4.13)
December 5, 2013
104 / 475
Proof of Lemma 4.6
Proof is by induction on the iteration number N. At N = 1, the

frontier F1 selects its only node x0 with cost g (x0 ) = g (x0 ) = 0.
Assume that the induction hypothesis holds for N = 1 . . . k, Now we

need to show that it holds for iteration k + 1.
Assume that Fk+1 selects xn at this iteration. All frontier nodes have
their parents in D. At the time of selection, parent(xn )= xs .
Suppose that the path through xs is not the optimal path for xn , but
the optimal path is , as shown in the gure in blue. This path exists
in the graph at iteration k + 1 whether or not it will ever be

discovered in future iterations is irrelevant.
December 5, 2013
105 / 475
Proof of Lemma 4.6 contd.

Fk+1
xs
x0
xm
xn
(x0 : xm)
xp
(xp : xn)
Figure 26: The path shown in blue is the assumed optimal path. Since xm D
at iteration k + 1, by the induction hypothesis it must have been selected by

Fi , i < k + 1, and hence its path (x0 : xm ) is optimum.
December 5, 2013
106 / 475
Note that the assumed optimal-path has to pass through a node
which is in Fk+1 because the frontier separates the dead-nodes D

and unknown nodes and all expansion occurs through the frontier
nodes.
Let xp be the rst node on to belong to Fk+1 . Let its parent in
be xm D.
Thus, the entire assumed optimal path consists of the following
sub-paths ( stands for path-concatenation):

(x0 : xn ) = (x0 : xm ) k(xm : xp ) (xp : xn )
December 5, 2013
(4.14)
107 / 475

The cost of xp in Fk+1 is
gk+1 (xp ) = g (xm ) + k(xm : xp )
k+1 (xp )
(4.15)
= h(xp ) + g (xp ).
(xp )
= g (xp )
(4.16)
As xp lies on the optimum-path to xn , from Lemma 4.5,
(xp )
(xn )
(4.17a)
k+1 (xn ),
the cost of xn at k+1.
(4.17b)
However, xn was selected by Fk+1 for expansion. Therefore,

k+1 (xn )
(4.16)
k+1 (xp )
(xp ).
(4.18)
Combining results,
(4.17a)
(xp )
(xn ) =
(4.18)
(xn )
k+1 (xn ).
k+1 (xn )
(xp ),
(4.19)
(4.20)
December 5, 2013
108 / 475
As
(xn ) =
k+1 (xn ),
g (xn ) + h(xn ) = gk+1 (xn ) + h(xn ),

g (xn ) = gk+1 (xn ),
(4.21)
Therefore, the path-cost of g (xn ) at k + 1 is indeed optimum. This is in

contradiction to what we assumed, with our alternate optimal-path
hypothesis. Therefore, we conclude that the optimal path g (xn ) is found

at N = k + 1 when xn is selected by Fk+1 .
By extension, when the goal point is selected by the frontier, its optimum
path with cost g (xg ) has been found.
December 5, 2013
109 / 475
Lemma 4.7 (Monotonicity of Expansion)

Let the selected node by Fj be xj , and that selected by Fj+1 be xj+1 .
Then it must be true that
(xj )
(xj+1 ).
(4.22)
Why?
Proof.
Sketch: You have to consider two cases:
At iteration j + 1, xj+1 is a child of xj .
At iteration j + 1, xj+1 is not a child of xj .
December 5, 2013
110 / 475
Consequences of the Monotonicity of Expansion
Remark 4.8
This shows that at iteration N = j, if Fj selects xj then:
All nodes x with (x) < (xj ) have already been expanded (i.e. they
have died), and some nodes with (x) = (xj ) have also been
expanded.
In particular, when the rst goal is found, then, All nodes x with
(x) < g (x ) have already been expanded, and some nodes with
g
(x) = g (x ) have also been expanded.
g
December 5, 2013
111 / 475
Figure 27: Region searched before nding a solution: Dijkstra path search
December 5, 2013
112 / 475
Figure 28: Region searched before nding a solution: A path search. The
number of nodes expanded is the minimum possible.
December 5, 2013
113 / 475
Properties of A
The paths are optimal w.r.t. the cost function, but do you notice any
undesirable properties of the planned paths?
Why are there less colors in Fig. 18 than in Fig. 17?
In Fig. 18 why are the red-shades lighter in the beginning?
December 5, 2013
114 / 475
An alternative to A for path-planning
Figure 29: Funnel-planning using wave-front expansion. The path stays away
from the obstacles.
December 5, 2013
115 / 475
Funnel path-planning
Figure 30: Funnel-planning using wave-front expansion in 3D. Source: Brock and
Kavraki, Decomposition based motion-planning: A framework for real-time
motion-planning in high dimensional conguration spaces, ICRA 2001.
December 5, 2013
116 / 475
Algorithm 8: Funnel-Planning
input: x0 , xg
B ndFreeSphere(x0 )
B.parent
Q.insert(B, B.center xg B.r )
while Q not empty do
B Q.getMin()
D.insert(B)
if xg B then
return [D, B]
for s 1 . . . Ns do
x sampleOnSurface(B)
if x D then
/
C ndFreeSphere(x)
C .parent B
Q.insert(C , C .center xg C .r )
December 5, 2013
117 / 475
Funnel path-planning
Figure 31: Motion planning using the funnel potentials. Source: LaValle,
Planning Algorithms, http://planning.cs.uiuc.edu/.
December 5, 2013
118 / 475
Contents

Hill-Climbing
Sampling from a PMF
Simulated Annealing
Genetic Algorithms
December 5, 2013
119 / 475
Local Search
objective function
global maximum
shoulder
local maximum
flat local maximum
state space
current
state
Figure 32: A 1-D state-space landscape. The aim is to nd the global maximum.
Local search algorithms are used when

The search-path itself is not important, but only the nal optimal
state, e.g. 8-Queen problem, job-shop scheduling, IC design, TSP, etc.
Memory eciency is needed. Typically only one node is retained.
The aim is to nd the best state according to an objective function to
be optimized. We may be seeking the global maximum or the
minimum. How can we reformulate the former to the latter?
December 5, 2013
120 / 475
Hill-Climbing
Algorithm 9: Hill-Climbing
input : x0 , objective (value) function v (x)
to maximize
output: x , the state where a local
maximum is achieved
x x0 ;
while True do
y the highest-valued child of x ;
if v (y) v (x) then return x ;
xy
To avoid getting stuck in plateaux:
Allow side-ways movements.
Problems?
Random-restart: perform search from
many randomly chosen x0 till an
optimal solution is found.
Figure 33: A ridge. The local

maxima are not directly
connected to each other.
December 5, 2013
121 / 475
Hill-Climbing
Figure 34: (a) Starting state. h(x) is the number of pairs of queens attacking
each other. Each node has 8 7 children. (b) A local minimum with h(x) = 1.
December 5, 2013
122 / 475
Sampling from a PMF
Sampling from a Probability Mass Function (pmf)

Denition 5.1 (PMF)
Given a discrete random-variable A with an exhaustive and ordered (can
be user-dened, if no natural order exists) list of its possible values
[a1 , a2 , . . . , an ], its pmf P(A) is a table, with probabilities
P(A = ai ), i = 1 . . . n. Obviously, n P(A = ai ) = 1.
i=1
Problem 5.2
Sampling a PMF A uniform random number generator in the unit-interval
has the probability distribution function pu[0,1] (x) as shown below in the
gure. Python random.random() returns a sample x [0.0, 1.0). How
can you use it to sample a given discrete distribution (PMF)?
1
pu[0,1](x)
0
a
1
x
b
P (x [a, b] ; 0 a b < 1) = b a
December 5, 2013
123 / 475
Sampling from a PMF
Denition 5.3 (Cumulative PMF FA for a PMF P(A))

n
FA (aj )
u(x)
FA (aj )
P(A aj )
0
1
=
i=1
P(A = ai )u(aj ai ),
where,
(5.1)
x <0
x 0
(5.2)
P(A = ai )
(5.3)
ij
u(x) is called the discrete unit-step function or the Heaviside step

function. What is FA (an )?
December 5, 2013
124 / 475
Sampling from a PMF
Form a vector of half-closed intervals
[0, FA (a1 ))
[FA (a1 ), FA (a2 ))
.
.
s
, Dene a0 s.t. FA (a0 )
.
[FA (an2 ), FA (an1 ))

[FA (an1 ), 1)
0.
(5.4)
Let ru be a sample from a uniform random-number generator in the

unit-interval [0, 1). Then,
P(ru s[i]) = P(FA (ai1 ) ru < FA (ai ))
= P(FA (ai1 ) ru FA (ai ))
= FA (ai ) FA (ai1 )
(5.3)
= P(ai )
December 5, 2013
125 / 475
Simulated Annealing
Algorithm 10: Simulated-Annealing

input : x0 , objective (cost) function c(x)
to minimize
output: x , a locally optimum state
x x0 ;
for k k0 to do
T Schedule(k) ;
if 0 < T < then return x ;
y a randomly selected child of x ;
E c(y) c(x) ;
if E < 0 then
xy
else
x y with probability P(E , T ) ;
P(E , T ) = 1+e 1 /T e E /T ;
E
(Boltzmann distribution)
(5.5)
An example of a
schedule is
Tk = T0
ln(k0 )
ln(k)
(5.6)
Applications:
VLSI layouts,
Factory-scheduling.
December 5, 2013
126 / 475
Simulated Annealing
Another Schedule
We rst generate some random rearrangements, and use them

to determine the range of values of E that will be encountered
from move to move. Choosing a starting value of T which is
considerably larger than the largest E normally encountered,
we proceed downward in multiplicative steps each amounting to
a 10 % decrease in T . We hold each new value of T constant
for, say, 100N recongurations, or for 10N successful
recongurations, whichever comes rst. When eorts to reduce
E further become suciently discouraging, we stop.
Numerical Recipes in C: The Art of Scientic Computing.
December 5, 2013
127 / 475
Simulated Annealing
Example: Traveling Salesman Problem

1800
1600
1400
1200
1000
800
600
4000
50000
100000
150000
nr. iteration
200000
250000
50000
100000
150000
nr. iteration
200000
250000
1000
800
T
600
400
200
00
December 5, 2013
128 / 475
Simulated Annealing
Example: Traveling Salesman Problem

100
80
60
40
20
00
20
40
60
80
100
Figure 35: A suboptimal tour found by the algorithm in one of the runs.
December 5, 2013
129 / 475
Genetic Algorithms
Algorithm 11: Genetic Algorithm

input : P = {x}, a population of individuals,
a Fitness() function to maximize
output: x , an individual
repeat
Pn ;
for i 1 to Size(P) do
x Random-Selection(P, Fitness()) ;
y Random-Selection(P, Fitness()) ;
c Reproduce(x, y ) ;
if small probability then Mutate(c) ;
Add c to Pn
P Pn
until x P, Fitness(x) > Threshold, or enough time
elapsed;
return best individual in P
December 5, 2013
130 / 475
Genetic Algorithms
Algorithm 12: Reproduce(x, y)

N Length(x) ;
R random-number from 1 to N (cross-over point);
c Substring(x, 1, R) + Substring(y, R + 1, N) ;
return c
24748552
24 31%
32752411
32748552
32748152
32752411
23 29%
24748552
24752411
24752411
24415124
20 26%
32752411
32752124
32252124
32543213
11 14%
24415124
24415411
24415417
(a)
Initial Population
(b)
Fitness Function
(c)
Selection
(d)
Crossover
(e)
Mutation
Figure 36: The 8-Queens problem. The ith number in the string is the position of
the queen in the ith column. The tness function is the number of non-attacking
pairs (maximum tness 28).
December 5, 2013
131 / 475
Genetic Algorithms
Properties of Genetic Algorithms (GA)
Crucial issue: encoding.

Schema e.g. 236 , an instance of this schema is 23689745. If
average tness of instances of schema are above the mean, then the
number of instances of the schema in the population will grow over
time. It is important that the schema makes some sense within the
semantics/physics of the problem.
GAs have been used in job-shop scheduling, circuit-layout, etc.
The identication of the exact conditions under which GAs perform
well requires further research.
December 5, 2013
132 / 475
Genetic Algorithms
A Detailed Example: Flexible Job Scheduling

from G. Zhang et al An eective genetic algorithm for the exible job-shop scheduling
problem, Expert Systems with Applications, vol. 38, 2011.
Figure 37: Gantt-Chart of a Schedule: Minimizing Makespan.

December 5, 2013
133 / 475
December 5, 2013
134 / 475
Genetic Algorithms
GA Example: Flexible Job Scheduling
Job
J1
J2
Operation
O11
O12
O21
O22
O23
M1
2
3
4
-
M2
6
8
6
7
M3
5
6
5
11
M4
3
4
5
M5
4
5
8
Genetic Algorithms
GA Example: Constraints
Oi(j+1) can begin only after Oij has ended.

Only a certain subset ij of machines can perform Oij .
Jio is the number of total operations for job Ji .
L=
N
i=1 Jio
total number of operations of all jobs.
Pijk is the processing-time of Oij on machine k.
December 5, 2013
135 / 475
Genetic Algorithms
GA Example: Chromosome Representation
(a)
(b) Machine Selection Part
(c) Operation Sequence Part
Figure 38: Chromosome Representation

December 5, 2013
136 / 475
Genetic Algorithms
GA Example: Decoding Chromosome
(a) Finding enough space to insert Oi(j+1)

December 5, 2013
137 / 475
Genetic Algorithms
GA Example: Initial Population: Global Selection (GS)
December 5, 2013
138 / 475
Genetic Algorithms
GA Example: Initial Population: Local Selection (LS)
December 5, 2013
139 / 475
Genetic Algorithms
GA Example: MS Crossover Operator
Figure 39: Machine Sequence (MS) Part Crossover
December 5, 2013
140 / 475
Genetic Algorithms
GA Example: OS Crossover Operator

Precedence Preserving Order-Based Crossover (POX)
Figure 40: Operation Sequence (OS) Part: Precedence Preserving Order-Based

Crossover (POX)
December 5, 2013
141 / 475
Genetic Algorithms
GA Example: Mutation
Figure 41: Machine Sequence (MS) Part Mutation
December 5, 2013
142 / 475
Genetic Algorithms
GA Example: Run
Figure 42: A typical run of GA
December 5, 2013
143 / 475
December 5, 2013
144 / 475
Games Agents Play
Contents
Games Agents Play

Minimax
Alpha-Beta Pruning
Games Agents Play
Minimax
Zero-Sum Games
A Partial Game-Tree for Tic-Tac-Toe
MAX (X)
MIN (O)
X
X
X O
X O X
X O
X
MIN (O)
X
O
...
...
TERMINAL
Utility
...
X O
X
MAX (X)
...
...
...
X O X
O X
O
X O X
O O X
X X O
X O X
X
X O O
...
+1
Figure 43: Each half-move is called a ply.

Games Agents Play
December 5, 2013
145 / 475
Minimax
Search-Tree vs Game-Tree
The search-tree is usually a sub-set of the game-tree.

Example: For Chess, the game-tree is estimated to be over 1040
nodes big, with an average branching factor of 35.
December 5, 2013
146 / 475
Games Agents Play
Minimax
Zero-Sum Games
Nomenclature
S0 The initial state.

Player(s) The player which has the move in state s.
Actions(s) Set of legal moves in state s.
Result(s, a) The transition-model: the state resulting from applying the
action a to the state s.
Terminal-Test(s) Returns True if the game is over at s.
Utility(s) The payo for the Max player at a terminal state s.
Zero-sum game A game where the sum of utilities for both players at each
terminal state is a constant. Example: Chess:
(1, 0), (0, 1), (1/2, 1/2).
Games Agents Play
December 5, 2013
147 / 475
Minimax
An Example 2-Ply Game

3
MAX
A1
A2
A3
MIN
A 11
A 12
12
2
A 21
A 13
2
A 31
A 22 A 23
14
A 32
A 33
Figure 44: Each node (state) labeled with its minimax value.
Minimax(s) =
Utility(s)
if Terminal-Test(s)
maxaActions(s) Minimax(Result(s, a)) if Player(s) = Max
minaActions(s) Minimax(Result(s, a)) if Player(s) = Min

(6.1)
December 5, 2013
148 / 475
Games Agents Play
Minimax
The Minimax Algorithm

Algorithm 13: Minimax-Decision
input
: State s
return arg maxaActions(s) Min-Value(Result(s,a))
Algorithm 14: Max-Value

input
: State s
if Terminal-Test(s) then return Utility(s) ;
;
for a Actions(s) do
max(, Min-Value(Result(s, a)))
return
Algorithm 15: Min-Value

input
: State s
;
for a Actions(s) do
min(, Max-Value(Result(s, a)))
return
Games Agents Play
December 5, 2013
149 / 475
Minimax
Search-Tree Complexity
Since the search is a DFS, for an average branching-factor b and

maximum depth m:
Space complexity: O(bm).
Time complexity: O(b m ).
It turns out we can reduce time-complexity in the best case to

O(b m/2 ) using Alpha-Beta Pruning.
December 5, 2013
150 / 475
Games Agents Play
Alpha-Beta Pruning
Alpha-Beta Pruning
MAX
MIN
12
MAX
14
Games Agents Play
MIN
12
December 5, 2013
151 / 475
December 5, 2013
152 / 475
Alpha-Beta Pruning
Algorithm 16: Alpha-Beta-Search(s)

Max-Value(s, = , = ) ;
return the Action in Actions(s) with value
Algorithm 17: Max-Value(s, , )

;
for a Actions(s) do
max(, Min-Value(Result(s, a), , )) ;
if then return ;
max(, );
return
Algorithm 18: Min-Value(s, , )

;
for a Actions(s) do
min(, Max-Value(Result(s, a), , )) ;
if then return ;
min(, );
return
Games Agents Play
Alpha-Beta Pruning
Reference
Donald E. Knuth and Ronald W. Moore, An Analysis of Alpha-Beta

Pruning, Articial Intelligence, 1975.
Games Agents Play
December 5, 2013
153 / 475
Alpha-Beta Pruning
Transposition Table
Some state-nodes may reappear in the tree: To avoid repeating their

expansion, their computed utilities can be cached in a hash-table
called the Transposition Table. This analogous to the dead-set in
the Graph-Search Algorithm.
It may not be practical to cache all visited nodes. Various heuristics
are used to decide which nodes to discard from the transposition
table.
December 5, 2013
154 / 475
Games Agents Play
Alpha-Beta Pruning
Cuto Depth and Evaluation Functions
To achieve real-time performance, we expand the search-tree up to a

maximum depth only and replace the utility computation by a
heuristic evaluation function.
Algorithm 19: Min-Value(s, , , d)
if Cutoff-Test(s, d) then return Eval(s) ;
;
for a Actions(s) do
min(, Max-Value(Result(s, a), , , d + 1)) ;
if then return ;
min(, );
return
Games Agents Play
December 5, 2013
155 / 475
Alpha-Beta Pruning
Example Evaluation Functions

Each state s is considered to have several features, e.g. in Chess,
number of rooks, pawns, bishops, number of plys till now, etc.
Each feature can be given a weight and a weighted sum of features
can be used. Example: pawn (1), bishop (3), rook (5), queen (9).
Weighting can be nonlinear, e.g. a pair of bishops is worth more than
twice the worth of a single bishop; a bishop is more valuable in
endgame.
Read Sec 5.7 of the textbook. The (2007-2010) computer world
champion was RYBKA running on a desktop with its evaluation
function tuned by International Master Vasik Rajlich. Allegations of
plagiarization: Crafty and Fruit.
December 5, 2013
156 / 475
Contents

Propositional Logic
Entailment and Inference
Inference by Model-Checking
Inference by Theorem Proving
Inference by Resolution
Inference with Denite Clauses
2SAT
Agents based on Propositional Logic
Time out from Logic
December 5, 2013
157 / 475
Knowledge-Base
A Knowledge-Base is a set of sentences expressed in a knowledge
representation language. New sentences can be added to the KB and it
can be queried about whether a given sentence can be inferred from what
is known.
A KB-Agent is an example of a Reex-Agent explained previously.
Algorithm 20: Knowledge-Base (KB) Agent
input : KB, a knowledge-base,
t, time, initially 0.
Tell(KB, Make-Percept-Sentence(percept, t)) ;
action Ask(KB, Make-Action-Query(t)) ;
Tell(KB, Make-Action-Sentence(action, t)) ;
t t +1 ;
return action
December 5, 2013
158 / 475
Propositional Logic
Propositional Logic
A simple knowledge representation language
Denition 7.1 (Syntax of Propositional Logic)

An atomic formula (also called an atomic sentence or a
proposition-symbol) has the form P, Q, A1 , True , False , IsRaining etc.
A formula/sentence can be dened inductively as
All atomic formulas are formulas.
For every formula F , F is a formula, called a negation.
For all formulas F and G , the following are also formulas:
(F G ), called a disjunction.
(F G ), called a conjunction.
If a formula F is part of another formula G , then it is called a

subformula of G .
We use the short-hand notations:
F G (Premise implies Conclusion) for (F ) G
F G , (Biconditional) for (F G ) (G F ).
December 5, 2013
159 / 475
Propositional Logic
Propositional Logic
A simple knowledge representation language
Syntax of Propositional Logic rewritten in BNF Grammar

Sentence Atomic-Sentence | Complex-Sentence
Atomic-Sentence True | False | P | Q | R | . . .
Complex-Sentence (Sentence ) | [Sentence ]

| Sentence
| Sentence Sentence
| Sentence Sentence
| Sentence Sentence
| Sentence Sentence
Operator-Precedence : , , , ,
(7.1)
Axioms are sentences which are given and cannot be derived from other
sentences.
December 5, 2013
160 / 475
Propositional Logic
Semantics of Propositional Logic

The elements of the set T {0, 1}, also written {False , True } or
{F , T } are called Truth-Values.
Let D be a set of atomic formulas/sentences. Then an assignment A

is a mapping A : D T.
We can extend the mapping A to A : E T, where, E D is the
set of formulas which can be built using only the atomic formulas in
D, as follows:
For any atomic formula Bi D, A (Bi )
1, if A (P) = 0
A (P)
0, otherwise.
A(Bi ).
A ((P Q))
1, if A (P) = 1 and A (Q) = 1

0, otherwise.
A ((P Q))
1, if A (P) = 1 or A (Q) = 1
0, otherwise.
December 5, 2013
161 / 475
Propositional Logic
Semantics of Propositional Logic

Truth-Table
The semantic interpretation can be shown by a truth-table.

A(P )
1
1
0
0
A(Q )
1
0
1
0
A (P Q )
1
0
1
1
A (P Q )
1
0
0
1
From now on, the distinction between A and A is dropped.
December 5, 2013
162 / 475
Propositional Logic
Suitable Assignment, Model, Satisability, Validity

If an assignment A is dened for all atomic formulas in a formula F ,
then A is called suitable for F .
If A is suitable for F and A(F ) = 1, then A is called a model for F
and write A F . Otherwise, we write A F .
The set of all models of a formula/sentence F is denoted by M(F )
A formula F is called satisable, if it has at least one model,
otherwise it is called unsatisable or contradictory.
A set of formulas F is called satisable, if there exists an assignment
A which is a model for all Fi F.
A formula F is called valid (or a tautology), if every suitable

assignment for F is also a model of F . In this case, we write
Otherwise, we write F .
December 5, 2013
F.
163 / 475
Propositional Logic
Theorem 7.2
A formula F is valid if and only if (i) F is unsatisable.
Proof.
F is valid i every suitable assignment of F is a model of F .
i every suitable assignment of F (and hence, of F ) is not a model of
F .
i F has no model, and hence, is unsatisable.
December 5, 2013
164 / 475
Propositional Logic
Wumpus World
4
Bree z e
Stench
PIT
PIT
Bree z e
Bree z e
Stench
Gold
Bree z e
Stench
Bree z e
PIT
Bree z e
START
Figure 45: Actions=[Move-Forward, Turn-Left, Turn-Right, Grab, Shoot, Climb],

Percept=[Stench, Breeze, Glitter, Bump, Scream]
December 5, 2013
165 / 475
Propositional Logic
Wumpus World

December 5, 2013
166 / 475
Propositional Logic
Wumpus World

December 5, 2013
167 / 475
Propositional Logic
Wumpus World KB
Px,y
Wx,y
Bx,y
Sx,y
is
is
is
is
true
true
true
true
if
if
if
if
there is a pit in [x, y ]

there is a Wumpus in [x, y ]
the agent perceives a breeze in [x, y ]
the agent perceives a stench in [x, y ]
R1 : P1,1
(7.2)
R3 : B2,1 (P1,1 P2,2 P3,1 )
(7.4)
R2 : B1,1 (P1,2 P2,1 ),
(7.3)
We also have percepts.

R4 : B1,1 ,
R5 : B2,1
(7.5)
Query to the KB: Q = P1,2 or Q = P2,2 .
December 5, 2013
168 / 475
Entailment
Denition 7.3 (Entailment)

The formula/sentence F entails the formula/sentence G , i.e.
F
G,
M(F ) M(G ) .
(7.6)
We also say that G is a consequence of F .
December 5, 2013
169 / 475
Theorem 7.4 (Deduction Theorem)

For any formulas F and G , F G i the formula (F G ) is valid, i.e.
true in all assignments suitable for F and G .
Proof.
Assume F
G . Let A be an assignment suitable for F and G , then,
If A is not a model of F , i.e. A F , then A is a model of (F G )

(ref. truth-table of implication).
If A F , then as F G , A G . Hence, A is a model of (F G ).
Thus, A is always a model for (F G ). Hence, (F G ) is valid.
Assume (F G ) is valid. Hence, there does not exist an assignment

A such that
A F.
A G.
Hence, all models of F are also models of G , and so F
G.
December 5, 2013
170 / 475
Denition 7.5 (Equivalence F G )
Two formulas F and G are semantically equivalent if for every assignment

A suitable for both F and G , A(F ) = A(G ).
Remark 7.6 (An equivalent denition of equivalence

F G , i F
G and G
F.
December 5, 2013
171 / 475
Equivalence
Example 7.7
In the following, and can be swapped to get new equivalences.
F F
(7.7)
F F F
Idempotency
Commutativity
F G G F
(7.8)
(7.9)
(F G ) H F (G H)
Associativity
(7.10)
Absorption
(7.11)
F (G H) (F G ) (F H)
Distributivity
(7.12)
deMorgans Law
(7.13)
Contraposition
(7.14)
F (F G ) F
(F G ) (F ) (G )
P Q Q P
All of them can be shown by truth-tables.

December 5, 2013
172 / 475
Theorem 7.8 (Proof by Contradiction or The SAT Problem)

For any formulas/sentences F and G , F
unsatisable.
G i the sentence (F G ) is
Proof.
First, note the equivalence (F G ) (F G ).
F
G i the formula (F G ) is valid.
From Thm. 7.2, (F G ) is valid i (F G ) is unsatisable.

Combining the above two results, F
unsatisable.
G i the sentence (F G ) is
December 5, 2013
173 / 475
Logical Inference
Inference is the use of entailment to draw conclusions. For example,

KB Q, shows that the formula/sentence Q can be concluded from
(or is a consequence of) what the agent knows.
KB i Q denotes that the inference algorithm i derives Q from KB .
An algorithm which derives only entailed sentences is called sound.
An algorithm is complete if it can derive any sentence that is entailed.
Simplest inference is by model-checking: Use the Deduction

Theorem 7.4, i.e. show that KB Q is a tautology. This proves that
KB Q.
December 5, 2013
174 / 475
Model-Checking Based Inference

Algorithm 21: TT-Entails(KB, Q)
input: KB, a knowledge-base; Q, a query formula/sentence
symbols list of proposition-symbols (atomic formulas) in KB and Q ;
return TT-Check(KB, Q, symbols, model={})
Algorithm 22: TT-Check(KB, Q, symbols, model)
input: KB, a knowledge-base; Q, a sentence; symbols; model
if Empty(symbols) then
if Pl-true(KB, model) then
return Pl-true(Q, model)
else
return True // As KB is False , KB Q is True .
else
P First(symbols); tail Rest(symbols) ;
return TT-Check(KB, Q, tail, model {P = True }) And
TT-Check(KB, Q, tail, model {P = False })
December 5, 2013
175 / 475
Normal Forms
Denition 7.9 (Literal)
If P is an atomic formula then
P is a positive literal.
P is a negative literal.
Denition 7.10 (Conjunctive Normal Form (CNF))

If Li, j are literals then a formula F is in CNF, if it is of the form
n
mi
F =
Li, j
(7.15)
i=1 j=1
Clausei
{L1,1 , . . . L1,m1 }, . . . , {Ln,1 , . . . Ln,mn } .

December 5, 2013
(7.16)
176 / 475
Conjunctive Normal Form (CNF): Conversion Procedure

Given a formula F
1. Substitute in F every occurrence of a subformula of the form
G by G ,
(G H) by (G H),
(G H) by (G H),
until no such subformulas occur.

2. Substitute in F every occurrence of a subformula of the formula
K (G H) by (K G ) (K H),
(G H) K by (G K ) (H K ),
until no such subformulas occur.
Example 7.11 (CNF)

A B (A B) (B A)
December 5, 2013
177 / 475
Davis-Putnam-Logemann-Loveland (DPLL) Algorithm

TT-Entails can be made more ecient. Entailment is now to be
inferred by solving the SAT problem (Thm. 7.8): KB Q i the sentence
(KB Q) is unsatisable. DPLL has 3 improvements over
TT-Entails:
Early Termination: A clause is true, if any literal is true. A sentence
(in CNF) is false if any clause is false, which occurs when all its
literals are false. Sometimes a partial model suces. Example:
(A B) (C A) is true if A = True .
Pure Symbol: A symbol which occurs with the same sign in all
clauses. Example: (A B), (C B), (C A). If a sentence has a
model, then it has a model with pure (atomic) symbols assigned to
make their literal true: this never makes the clause false.
Unit Clause is a single literal clause or a clause where all literals but
one have already been assigned False by the model. To make a
unit-clause true, an appropriate truth-value for the sole literal can be
found. Example: (C B) with B already assigned true. Assigning
a unit-clause may create another: unit propagation.
December 5, 2013
178 / 475
Algorithm 23: DPLL-Satisfiable(s)

input: s, a sentence in propositional logic
clauses the set of clauses in the CNF of s ;
symbols a list of propositional-symbols (atomic formulas) in s ;
return DPLL(clauses, symbols, { })
Algorithm 24: DPLL(clauses, symbols, model)
if every clause in clauses is true in model then return True ;
if some clause in clauses is false in model then return False ;
P, v Find-Pure-Symbol(symbols, clauses, model) ;
if P = then return DPLL(clauses, symbols P, model { P=v }) ;
P, v Find-Unit-Clause(clauses, model) ;
if P = then return DPLL(clauses, symbols P, model { P=v }) ;
P First(symbols); rest Rest(symbols) ;
return DPLL(clauses, rest, model { P= True }) Or DPLL(clauses,
rest, model { P= False })
December 5, 2013
179 / 475
Literals
Denition 7.12 (Complement Literal)
L=
A
A
if L = A,
if L = A.
(7.17)
December 5, 2013
180 / 475
Early termination restated
An assignment A (possibly partial) satises a clause if it assigns 1 to

at least one of its literals.
An assignment A (possibly partial) satises a CNF formula F , if it
satises each of its clauses.
December 5, 2013
181 / 475
Residual formula F |
Let F be a CNF formula containing a literal . Then F | denotes the

residual formula obtained by applying the partial assignment A( ) = 1 on
F . This is done by:
Removing all clauses containing as they are satised by the
assignment.
Deleting from all clauses in F containing . Why?
If after deleting , a clause becomes empty, what does it signify?
December 5, 2013
182 / 475
Another version of DPLL

Ouyang, How good are branching rules in DPLL? Discrete Applied Mathematics, 1998.
Algorithm 25: DPLL(F )

input: F a formula in CNF
while F includes a clause of length at most 1 do
if F includes an empty clause then return Unsatisable;
if F includes a unit-clause { } then
F F |
while F includes a pure (monotone) literal { } do
F F |
if F is empty then return Satisable;

Choose a literal u in F ;
if DPLL(F | u)= Satisable then return Satisable;
if DPLL(F | u )= Satisable then return Satisable;
return Unsatisable
December 5, 2013
183 / 475
WalkSAT
A local search algorithm for satisability
Algorithm 26: WalkSAT

input : C, a set of propositional clauses; p, probability of random-walk; N
maximum ips allowed
output: A satisfying model or failure
model a random assignment of T/F to symbols in C ;
for i = 1 to N do
if model satises C then return model ;
clause a randomly selected clause in C that is false in model ;
if sample true with probability p then
ip the value in model of a randomly selected symbol from clause ;
else
ip whichever symbol in clause maximizes the number of satised
clauses
December 5, 2013
184 / 475
Propositional Theorem Proving

Use inference rules and equivalence-relationships to produce a chain of
conclusions which leads to the desired goal sentence.
Inference Rules
P Q, P
Q
P Q
P
Modus Ponens
((P Q) P)
(7.18)
And-Elimination
(P Q)
(7.19)
Monotonicity of Logical Systems

The set of entailed sentences can only increase as information is added to
the KB. If KB P then KB R P for a new sentence R. This assumes,
of course, that KB R is still satisable.
December 5, 2013
185 / 475
Search for Proving

Initial State: The initial KB.
Action: All inference rules applied to all the sentences that match the
top half of the inference rule.
Result: The application of an action results in adding the sentence in
the bottom half of the inference rule to the KB.
Goal: The sentence were trying to prove.
The search can be performed by e.g. IDS. This search is sound, but is it
complete?
December 5, 2013
186 / 475
Resolution
Resolution is an inference procedure to prove unsatisability of a set of

clauses (equivalently a formula in CNF). It is:
Sound, i.e. correct.
Complete, when combined with a complete search-algorithm.
December 5, 2013
187 / 475
General Resolution
Denition 7.13 (Resolvent)

Let C1 , C2 , R be clauses. Then R is called a resolvent of C1 , C2 , written
R = C1 C2 , if there is a literal L C1 and L C2 and
R = (C1 {L}) (C2 {L}).
(7.20)
Example 7.14
C1 = {A, B, C , D} and C2 = {C , B, D, E , F }
R = {A, C , D, E , F }
December 5, 2013
188 / 475
General Resolution
Lemma 7.15 (Resolution Lemma)
Let F be a formula in CNF in set-format. Let R be the resolvent of two
clauses C1 and C2 in F . Then F and F {R} are equivalent (Def. 7.5).
Proof.
Let A be an assignment suitable for F (and hence also for F {R}).
If A
F {R}, then clearly A
F.
Suppose A F , i.e. A Ci , for all clauses Ci F . Let
R = (C1 {L}) (C2 {L}) for L C1 and L C2 .
Case A L: As A C2 and as A L, we have A (C2 {L}).

Hence, A R.
Case A L: As A C1 , we have A (C1 {L}). Hence, A R.
So, in both cases A {R} and hence A F {R}.
We have shown that F {R}

F {R} F .
F and F
F {R}. Thus,
December 5, 2013
189 / 475
Remark 7.16 (Caution)

When resolving (A B C ) and (A B), the resolvent is either
B B C or A A C , both of which are equivalent to True . The
resolvent is not C . You can validate
(A B C ) (A B) is not equivalent to (A B C ) (A B) C
by a truth-table.
Denition 7.17 (Resolution Closure RC (S))

The set of all clauses derivable by repeated application of resolution to a
set of clauses S.
Denition 7.18 (Resolution Applied to Contradicting Clauses)

An empty-clause results from applying resolution to contradicting
clauses, e.g. to A and A.
December 5, 2013
190 / 475
Explicit Model Construction for S when
RC (S)
/
Algorithm 27: ModelConstruction

input : S, a set of propositional clauses;
RC (S), the set of clauses which is the resolution-closure of S and
does not contain the empty-clause .
output: An assignment A s.t. A RC (S).
[P1 , . . . , Pn ] The list of all symbols (atomic formulas) in S ;
for i = 1 to n do
Find a clause C RC (S) s.t. C = (Pi False . . . False ) after
substituting P1 . . . Pi1 assigned in previous iterations ;
if such a clause C found then
A(Pi ) False ;
else
A(Pi ) True ;
return The constructed model A.
December 5, 2013
191 / 475
A Worked-out Example
Let the sentence S consist of the following clauses
C1 = (X Y Z ), C2 = (Z R S), C3 = (S T ).
Its closure RC (S) contains, in addition to S, (using to denote

resolution) C4 = C1 C2 = (X Y R S),
C5 = C2 C3 = (Z R T ), and
C6 = C1 C5 = (X Y R T ).
We now trace the Algo. 27 for [P1 , . . . , Pn ] = [X , Y , R, T , S, Z ]. In
the for-loop over i, the following selections will be made:
i=1:
i=2:
i=3:
i=4:
i=5:
i=6:
X = True .
Y = True .
R = True .
T = False , due to C6 under previous assignments.
S = True . Examine C3 and C4 carefully.
Z = False , due to C1 under previous assignments.
It can be veried, that under this assignment, all clauses of RC (S)

are True .
December 5, 2013
192 / 475
Theorem 7.19
Algo. ModelConstruction produces a valid model for S, if
RC (S)
/
Proof.
We prove this by contradiction. Assume at some iteration i = k of the
for-loop in Algo. 27, the assignment to Pk causes a clause C of RC (S) to
become False for the rst time.
For this to occur, C = (False . . . False Pk ) or
C = (False . . . False Pk ). If only one of these two is present in
RC (S), then the assignment rule chooses the appropriate value for Pk
to make A(C ) = True .
The problem occurs if both are in RC (S). But in this case, their
resolvent (False . . . False ) also has to be in RC (S), which means
that the resolvent is already False by the assignment P1 , . . . , Pk1 .
This contradicts our assumption that the rst falsied clause appears
at stage k.
Thus, the construction never falsies a clause in RC (S). It produces
a valid model for RC (S) and in particular, for S.
December 5, 2013
193 / 475
Theorem 7.20 (Ground Resolution Theorem)

If a set of clauses is unsatisable, then the resolution-closure of these
clauses contains the empty clause .
December 5, 2013
194 / 475
Proof of the Ground Resolution Theorem
Proof.
Proof by contraposition: we prove that if the closure RC (S) does not
contain the empty-clause , then S is satisable.
If RC (S), we already proved that one model A S can be
/
recursively constructed by using Algo. 27 ModelConstruction.
Therefore, this proves that if the closure RC (S) does not contain the
empty-clause , then S is satisable. It also proves its contraposition,
namely, the Ground Resolution Theorem.
December 5, 2013
195 / 475
A Resolution Algorithm for Inferring Entailment

Algorithm 28: PL-Resolution(KB, Q)
input : KB, a knowledge-base; Q, a query-sentence;
output: Whether KB Q
clauses the set of clauses in the CNF of (KB Q) ;
new {} ;
while True do
for every resolvable clause-pair Ci , Cj clauses do
resolvents Resolve(Ci , Cj ) ;
if resolvents then return True ;
new new resolvents
if new clauses then return False // resolution-closure reached ;
clauses clauses new
December 5, 2013
196 / 475
Wumpus World
B1,1
P2,1
P2,1
P2,1
B1,1
P1,2
P1,2
P1,2
P2,1
P2,1
B1,1
B1,1
B1,1
P1,2
B1,1
B1,1 P1,2
B1,1
P2,1
P1,2
P2,1
P1,2
P1,2
Figure 48: CNF of B1,1 (P1,2 P2,1 ) along with the observation B1,1 . It
entails the query Q : P1,2
December 5, 2013
197 / 475
Special kinds of KB
Remark 7.21
Resolution is the most powerful algorithm for showing entailment for
a general KB .
The SAT problem is in general NP complete.
However, for some special, less general cases, we can make the
algorithm more ecient.
Two special algorithms are: HornSAT and 2SAT, which are applicable
to a KB consisting of a specic kind of clauses only.
December 5, 2013
198 / 475
HornSAT
Denition 7.22 (Denite Clause)
A clause with only one positive literal. Every denite clause can be written
as an implication. Example: (A B C ) (B C A).
Denition 7.23 (Horn Clause)

A clause with at most one positive literal. Includes denite clauses.
Horn clauses are closed under resolution. Why?
Horn clauses with no positive literals are called goal clauses.
Inference with Horn clauses can be done with forward or backward
chaining. Deciding entailment with Horn clauses can be done in time
that is linear in size of the KB!
December 5, 2013
199 / 475
Forward-Chaining with Denite Clauses

Algorithm 29: PL-FC-Entails(KB, Q)
N[C ] for a clause C is initialized to the number of symbols in C s premise ;
Assignment A(S) is initially False for all symbols S in the KB ;
g a queue of symbols, initially known to be True in KB ;
while g = do
X Pop(g) ;
if X = Q then return True ;
if A(X )= False then
A(X )= True ;
for C KB, where X is in the premise of C do
decrement N[C ] ;
if N[C ] = 0 then add the conclusion of C to g
return False
The algorithm begins with known-facts (positive literals) and determines if a single
propositional symbol the query is entailed by a KB of denite clauses.
December 5, 2013
200 / 475
PL-FC-Entails is complete
Every entailed atomic sentence is derived. Consider the nal state of
A when the algorithm reaches a xed-point, i.e. no further inferences
are possible. In other words, g = .
Claim: A can be viewed as a model of the KB: every denite clause
in the KB is True in this A.
To prove, assume the opposite, i.e. some clause a1 . . . an b is
False in the model. Then the premise must be True and the
conclusion b False in the model, i.e. A(b) = False .
As the premise is True in A, b must have been added to g and hence
at some point (when b was popped from g) assigned True by the
algorithm. This is a contradiction. Therefore, A KB .
Any atomic sentence q that is entailed by the KB must be True in all

its models, and in particular in A. Hence every entailed atomic
sentence must be entailed by the algorithm.
December 5, 2013
201 / 475
Example 7.24 (For PL-FC-Entails)

PQ
LM P
B LM
AP L
AB L
A
B
(7.21)
December 5, 2013
202 / 475
2SAT
2SAT
Applicable to KBs consisting of 2-CNFs
Figure 49: 2SAT KB as an implication graph. Credits: Wikipedia
For each clause A B introduce implications A B and B A as

edges in the digraph G (V , E ).
The formula F is now a conjunction of such 2 literal clauses.
December 5, 2013
203 / 475
2SAT
Reminder: DFS as dened in Cormen et al (CLRS)
Figure 50: The main DFS loop which

makes sure that the full graph is visited.
Figure 51: The recursive DFS-Visit

function
December 5, 2013
204 / 475
2SAT
Kosarajus algorithm: CLRS version

Uses the DFS version from Figs. 50 and 51.
Figure 52: SCC: the pseudocode from CLRS.
G T (V , E T ) is the graph obtained by reversing the directions of all

edges in the digraph G (V , E ).
The main loop of DFS in line 3 refers to the algorithm shown in
Fig. 50 (lines 57).
December 5, 2013
205 / 475
2SAT
Let f (v ) denote the component number of SCC containing the vertex

v . f (v ) is chosen such that it topologically sorts the condensation of
graph G (i.e. the DAG made out of supervertices, where each
supervertex is an SCC see the right subgure below):
u, v V : u
b
a
C1
o
m
(7.22)
C2
v f (u) f (v )
f
j
C4
C3
Figure 53: A directed graph G and its condensation. The subscripts i for each
component Ci in the right-gure have been chosen to be the same as f (v ),
v Ci . Thus C1 , C2 , C3 , C4 is a topological order.
December 5, 2013
206 / 475
2SAT
2SAT
Theorem 7.25
If v V such that f (v ) = f (v ) then F is unsatisable.
Proof.
Since v and v lie in the same strongly connected component,
v v ,
v v .
No truth-value can be assigned to v such that both implications are

satised. Hence, F is unsatisable.
December 5, 2013
207 / 475
2SAT
2SAT: Finding a model

Lemma 7.26
If u V : f (u) = f (u), then we can nd a model A of F by setting
A(u) =
Proof.
1 if f (u) > f (u)

0 if f (u) < f (u)
(7.23)
Proof by contradiction: assume that F becomes False using this

assignment this means a clause A B evaluates to False for this
assignment. Thus, both A and B are False .
Therefore, f (B) < f (B) and f (A) < f (A). Why?
The clause A B contributes the edges A B and B A.
Therefore, f (A) f (B) and f (B) f (A).

Combining all inequalities
f (A) < f (A) f (B) < f (B) f (A), a contradiction!

December 5, 2013
208 / 475
2SAT
Runtime of 2SAT
For n 2-clauses, the implication graph has 2n edges and at most 4n

vertices.
The algorithm for nding SCC runs in (n).
The overall runtime for 2SAT is (n).
December 5, 2013
209 / 475
December 5, 2013
210 / 475
2SAT
Applications of 2SAT
http://en.wikipedia.org/wiki/2-satisfiability
Conict-free placement of geometrical objects.

Data-clustering
Scheduling

Wumpus Problem Revisited
We have a large collection of rules, written for each square:

B1,1 (P1,2 P2,1 )
S1,1 (W1,2 W2,1 ) . . .

P1,1 , W1,1 . . .
Initial conditions
At least one Wumpus: W1,1 W1,2 . . . W4,4 .
At most one Wumpus: (W1,1 W1,2 ), . . . , (W4,3 W4,4 )
December 5, 2013
211 / 475
Fluents
Aspects of the world/ agents state which change with time should have a
time-index associated to the name.
All percepts: Stench 3 , Stench 4 , Breeze 5 .
There can be location-uents, e.g. Lt , the agent is in square (x, y )

x,y
at time step t.
Other properties: FacingEast 0 , HaveArrow 0 , WumpusAlive 0 .
Percepts can be connected to the properties of the squares where
they were experienced.
Lt (Breeze t Bx,y )
x,y
Lt (Stench t Sx,y )
x,y
Actions: Forward 0 , TurnRight 1 , Shoot 7 , Grab 8 , Climb 10
December 5, 2013
212 / 475
Eect Axioms
Transition model: to be written for all 16 squares, for all 4
orientations, and for all actions.
L0 FacingEast 0 Forward 0 (L1 L1 )
1,1
2,1
1,1
If the agent takes this action, then Ask(KB , L1 ) returns True .
2,1
Frame problem: each eect-axiom has to state what remains
unchanged as a result of the action.
Forward t (HaveArrow t HaveArrow t+1 )
Forward t (WumpusAlive t WumpusAlive t+1 )
There is a proliferation of frame-axioms (representational frame

problem).
December 5, 2013
213 / 475
Successor State Axiom
Instead of focusing on actions, specify how uents change.

F t+1 ActionCausesF t (F t ActionCausesNotF t )
(7.24)
HaveArrow t+1 (HaveArrow t Shoot t )
Lt+1 (Lt (Forward t Bump t+1 ))

1,1
1,1
(Lt (South t Forward t ))
1,2
(Lt (West t Forward t ))

2,1
December 5, 2013
214 / 475
Sequence of Percepts and Actions I

Stench 0 Breeze 0 Glitter 0 Bump 0 Scream 0 ; Forward 0
Stench 1 Breeze 1 Glitter 1 Bump 1 Scream 1 ; TurnRight 1


Stench 6 Breeze 6 Glitter 6 Bump 6 Scream 6 .
OK t Px,y (Wx,y WumpusAlive t )

x,y
December 5, 2013
(7.25)
(7.26)
215 / 475
Sequence of Percepts and Actions II

4
Bree z e
Stench
PIT
PIT
Bree z e
Bree z e
Stench
Gold
Bree z e
Stench
Bree z e
PIT
Bree z e
START
Figure 54: Ask(KB, L6 ), Ask(KB, W1,3 ), Ask(KB, P3,1 )

1,2
December 5, 2013
216 / 475
December 5, 2013
217 / 475
December 5, 2013
218 / 475
Time out from Logic
Time out: illogical logical sentences of Yogi Berra
It aint over till its over.

Nobody goes there anymore; its too
crowded.
It was impossible to get a conversation going;
everybody was talking too much.
In theory there is no dierence between theory
and practise, in practise there is.
I didnt really say everything I said.
Figure 55: Photo

credits: wordpress
December 5, 2013
219 / 475
Time out from Logic
Formal languages
Ontological and Epistemological Commitments
Language
Epistemological Commitment
(What exists in the world)
Propositional Logic
First-order Logic
Probability Theory
Fuzzy Logic
Ontological Commitment
(What you believe about what exists)
Facts
Facts, Objects, Relations
Facts
Facts with
degree of truth [0, 1]
True/False/Unknown
True/False/Unknown
Degree of belief [0, 1]
Known interval value
December 5, 2013
220 / 475
Contents
Limitations of Logic
Conditional Probabilities
Inference using a Joint Probability Distribution
Conditional Independence
December 5, 2013
221 / 475
Limitations of Logical Agents

Agents need to handle uncertainty due to partial observability (limited
sensing capabilities), nondeterminism, sensing noise, etc.
Example medical diagnosis:
Toothache Cavity GumProblem Abscess . . .,
Cavity Toothache .
Use of logic in a domain like medical diagnosis fails because of
Laziness: Too hard to list all premises and conclusions to come up
with an exceptionless rule, and hard to use these rules.
Theoretical Ignorance: A complete theory of the domain is
unavailable.
Practical Ignorance: Not all test results may be available. Not all
symptoms may manifest.
December 5, 2013
222 / 475
Belief State & Degree of Belief
Denition 8.1 (Belief State)

It is the agents current belief about the its own state or about the
relevant states of the environment, given:
its prior knowledge,
the history of its past actions,
the history of its observed percepts.
It is particularly important for partially-observable and/or

non-deterministic scenarios.
December 5, 2013
223 / 475
Belief State & Degree of Belief
For a logical agent based on propositional-logic, the belief state is in

terms of sentences which are true or false.
When information is uncertain, The agents knowledge only provides
a degree of belief in the relevant sentences. The degrees of belief are
represented in probabilities.
December 5, 2013
224 / 475
Quantifying Degree of Belief

Let be the universal-set/space of all possible suitable assignments
for all the relevant propositional sentences/formulas of the world
under consideration.
For each assignment or possible world or elementary event A ,
the agent associates a degree of belief or probability of it occurring:
its value P(A) [0, 1].
All the elementary events A are mutually-exclusive and
exhaustive.
Mutually-exclusive: Events Ai and Aj , i = j cannot both occur
simultaneously.
Exhaustive: = i {Ai }.
Hence, to normalize its degree of belief, the agent chooses

P(A) = 1.
(8.1)
A
December 5, 2013
225 / 475
Consider a propositional formula/sentence F .

P(F is True ) P(F )
P(A).
(8.2)
AM(F )
December 5, 2013
226 / 475
From this, the following can be derived

P(F ) =
0
1
if F is inconsistent
if F is valid
(8.3)
= 1 P(F ).
(8.4)
P(F G ) = P(F ) + P(G ) P(F G ). Inclusion-Exclusion Principle

(8.5)
December 5, 2013
227 / 475
Random Variables (RV)

Variables in Probability Theory are called Random Variables (RV),
written with a capital rst-letter (e.g. Weather ). Each RV has a
domain, either discrete or continuous. The domain D is the set of
possible values/instantiations (written in small-letters) the RV can
take, e.g. D(Weather ) = {sunny , rain , cloudy , snow }. The values
are given their natural or user-dened order.
In AI, discrete RVs are more common. Let F be the propositional
sentence Weather = sunny , then using (8.2), the following are
equivalent ways of writing
P(F ) P((Weather = sunny ) = True )
P(sunny )
P(Weather = sunny )
(8.6)
The whole probability mass distribution (pmf) of a discrete RV

(DRV) can be summarized in a vector-form, e.g.
P(Weather )
P(sunny ) P(rain ) P(cloudy ) P(snow )

December 5, 2013
(8.7)
228 / 475
Joint Probability Distributions

Given a DRV A, with domain D(A) = [a1 , a2 , . . . , an ], its cardinality is
|A|
|D(A)| = n
(8.8)
Given DRVs A, B, . . ., the following notations are equivalent

P((A = a) (B = b) . . .) P(A = a, B = b, . . .)
P(a, b, . . .)
(8.9)
All such probabilities can be collected together in a table called the

full joint probability (mass) distribution (JPD) of A, B, . . ..
P(A, B, . . .) is of size |A| |B|
(8.10)
Note that partial joint probability distributions are also possible, e.g.
P(A = ai , B, . . .) P(ai , B, . . .) is of size |B|
December 5, 2013
(8.11)
229 / 475
Vector Probability Distributions

The joint probability distribution (8.9) can also be written as the
distribution of a vector of RVs

A
ai

X = B ,
x = bj ,
(8.12)
.
.
.
.
.
.
|X|
|A| |B|
P(X) P(A, B, . . .),
(8.13)
P(x) P(ai , bj , . . .).
(8.14)
Finally, joint probability distributions of vector RVs can be dened

P(X, Y, . . .),
P(x, Y, . . .),
P(x, y, . . .).
December 5, 2013
(8.15)
230 / 475
Let F be a propositional formula. Suppose the agent initially has the

belief P(F = True ), as stated in (8.2):
P(F )
P(A)
AM(F )
Let G be another propositional formula s.t. M(F ) M(G ) = .
Suppose the agents observes that G = True . Thus, the agent now
needs to update its belief about F .
December 5, 2013
231 / 475
Since G is now known to be True , we set
P(A)
0, if A M(G ).
/
(8.16)
Eectively, = M(G ), and P(G | G ) = 1.

To get P(G | G ) = 1, the agent needs to renormalize its beliefs,
(8.2)
Earlier, P(G ) =
P(A)
(8.17)
AM(G )
P(A | G ) =
P(A)
, A M(G )
P(G )
(8.18)
Thus, we retain the unit-summation property of the agents belief-set,

P(G | G ) =
AM(G )
P(A | G ) =
AM(G )
P(A)
P(G )
=
= 1.
P(G )
P(G )
December 5, 2013
232 / 475
Now, we can modify (8.2)
P(F | G ) =
A M(F ) M(G )
(8.18)
1
P(G )
P(A | G ),
(8.19)
P(A),
A M(F ) M(G )
P(F G )
P(G )
(8.20)
The product rule:

P(F G ) = P(G ) P(F | G )
(8.21)
If P(F | G ) = P(F ), F and G are independent. In this case,

P(F G ) = P(G ) P(F )
Independence
December 5, 2013
(8.22)
233 / 475
Product Rule in terms of RVs

Since F , G are any propositional formulas/sentences, we can
generalize (8.21) to
P(F1 . . . Fn G1 . . . Gm ) =
P(F1 . . . Fn | G1 . . . Gm ) P(G1 . . . Gm ) (8.23)
In terms of RVs, we can summarize the product-rule in a table using

previously introduced notation. Given RVs X1 , . . . , Xn , Y1 , . . . , Ym ,
each with possibly dierent cardinality,
P(X1 , . . . , Xn , Y1 , . . . , Ym ) =
P(X1 , . . . , Xn | Y1 , . . . , Ym ) P(Y1 , . . . , Ym ). (8.24)
This table-formula applies component-wise in the table.
A tabulation of a conditional probabilities P(A = ai | B = bj ),
i = 1 . . . |A|, j = 1 . . . |B| is called a Conditional Probability Table
(CPT).
December 5, 2013
234 / 475
Law of Total Probability

Let the propositional sentences Gi , i = 1 . . . n be such that
n
M(Gi ) M(Gj ) = , if i = j, and,
M(Gi ) = .
(8.25)
i=1
Therefore, for any propositional formula F , its model-set can be

partitioned as follows
n
M(F ) =
i=1
M(Gi ) M(F ).
(8.26)
On substitution of above in (8.2)

n
P(F ) =
P(A)
i=1 AM(Gi ) M(F )

n
(8.21)
=
P(F | Gi ) P(Gi ).
i=1
i=1
P(F Gi )
(8.28)
(8.27)
December 5, 2013
235 / 475
Law of Total Probability

In terms of RVs, this is called marginalization.
|B|
P(A = aj ) =
P(A = aj , B = bi ),
(8.29)
P(A = aj | B = bi ) P(B = bi ).
(8.30)
i=1
|B|
=
i=1
Interpret the following carefully using our notation

|B|
P(A) =
|B|
P(A, bi ) =
i=1
i=1
|Y|
P(A | bi )P(bi )
(8.31)
P(X | yi ) P(yi ).
(8.32)
|Y|
P(X) =
P(X, yi ) =
i=1
i=1
The LHS of the above equations is called the marginal probability

computed from the joint or conditional probability distributions resp.
December 5, 2013
236 / 475
Expectation of a DRV
Let X be a DRV with domain [x1 , x2 , . . . , xn ], where xi are ordered
and have numerical values. Then, expectation of X is
n
E [X ] =
xi P(xi )
(8.33)
i=1
This can also be generalized to a vector DRV X with domain

[x1 , x2 , . . . , xn ],
n
E [X] =
xi P(xi )
(8.34)
i=1
December 5, 2013
237 / 475
Theorem 8.2 (Linearity of Expectation)

For any two DRVs X and Y with ordered numerical domains, and a, b R,
E [a X + b Y ] = a E [X ] + b E [Y ].
(8.35)
This holds even if X and Y are not independent, i.e. if

P(x, y ) = P(x)P(y ) in general. Independence of RVs will be covered later.
Proof.
|X | |Y |
E [a X + b Y ] =
(a xi + b yj ) P(xi , yj )
i=1 j=1
Derivation completed in the class using marginalization.
December 5, 2013
238 / 475
Example: JPD of 5 Object-Classiers I
Figure 56: Table from: Combining Pattern Classiers by Ludmila I. Kuncheva,

2004. We are interested in classifying objects of interest in two class-types:
T = 1 or T = 2. We have 5 dierent classiers Ci , i = 1 . . . 5 to do this task.
We show each of them 300 samples of objects of class-type T = 1. The 5
classiers have various degrees of agreement about the deduced class, which is
summarized above, e.g. the string 11212 means that C1 classied the object as
T = 1, C2 classied the object as T = 1, C3 classied the object as T = 2, C4
classied the object as T = 1, and C5 classied the object as T = 2. The
probabilities can be computed by dividing the frequencies (counts) by 300.
December 5, 2013
239 / 475
Example: JPD of 5 Object-Classiers II

Then, we can compute the following probabilities using the results derived
so far
P(C1 = 1, C2 = 1, C3 = 2, C4 = 1, C5 = 2 | T = 1) = 14/300.
On using marginalization (8.29),
P(C1 = 1, C2 = 1, C3 = 1 | T = 1) =
c4 =1,2 c5 =1,2
P(C1 = 1, C2 = 1, C3 = 1, C4 = c4 , C5 = c5 | T = 1) =
(5 + 4 + 10 + 8)
27
=
.
300
300
December 5, 2013
240 / 475
Example: JPD of 5 Object-Classiers III

Using the product-rule for RVs (8.24), one can nd
P(C4 = 2, C5 = 2 | C1 = 1, C2 = 1, C3 = 1, T = 1) =
P(C1 = 1, C2 = 1, C3 = 1, C4 = 2, C5 = 2 | T = 1)
P(C1 = 1, C2 = 1, C3 = 1 | T = 1)
8/300
8
=
= .
27/300
27
We can compute the table for the RV C1 with the ordered domain
[1, 2] (class-type) by marginalization. This can be simply done by
summing the rst two columns (where C1 = 1) and the last two
columns (where C1 = 2).
P(C1 | T = 1) =
146/300
154/300
December 5, 2013
241 / 475
Normalization Constraints on CPTs

In terms of RVs, the product-rule (8.21) is more often used as
P(A = ai , B = bj )
P(B = bj )
P(A, bj )
P(A | bj ) =
P(bj )
P(A = ai | B = bj ) =
(8.36)
(8.37)
The following result is very useful in practice.

|A|
i=1
1
P(A = ai | B = bj ) =
P(B = bj )
(8.29)
|A|
P(A = ai , B = bj )
i=1
P(B = bj )
= 1.
P(B = bj )
(8.38)
This implies that P(A | b) is a valid normalized pmf.

December 5, 2013
242 / 475
Normalization Constraints on CPTs
Therefore, (8.37) can also be written as

P(A | bj ) = P(A, bj ).
(8.39)
The normalization constant can be used to normalize the whole vector

P(A | bj ) without explicitly computing P(B = bj ).
December 5, 2013
243 / 475
Answering Queries on Beliefs Based on Evidence

If we divide the set of RVs in a given joint distribution into:
The query RV X ;
The unobserved (hidden) RVs stacked into the vector RV Y;
The observed (evidence) RVs stacked into the vector RV E. The
evidence is always instantiated E = e on what was observed.
The given joint distribution P(X , E, Y) is analogous to a probabilistic

Knowledge-Base. We can utilize (8.39) to write
P(X | E = e) = P(X , E = e)
(8.40)
|Y|
P(X , E = e, Y = yi )
General Form
(8.41)
i=1
December 5, 2013
244 / 475
Bayes Rule
Due to symmetry, swapping propositional formulas F and G in (8.21), we

can also write
P(F G ) = P(G | F ) P(F )
(8.42)
Combining the RHS of Eqs. (8.21) and (8.42),

P(G | F ) =
P(F | G ) P(G )
P(F )
Bayes Rule
December 5, 2013
(8.43)
245 / 475
Bayes Rule for RVs

In terms of RVs,
P(B = bj | A = ai ) =
P(A = ai | B = bj ) P(B = bj )
P(A = ai )
(8.39)
P(B | ai ) = P(B, ai ) = P(ai | B) P(B).
(8.44)
(8.45)
If we divide the set of RVs in a given joint distribution into: the RVs
of interest A, B, and observed (evidence) vector RV E, then, we can
generalize (8.45) to
P(ai | B, e) P(B | e)
P(ai | e)
= P(ai | B, e) P(B | e)
P(B | A = ai , E = e) =
(8.46)
General Form
(8.47)
P(ai | B, e) P(B | e).

December 5, 2013
246 / 475
Signicance of Bayes Rule

Nomenclature
Denition 8.3 (Likelihood)

P(data | Hypothesis)
Denition 8.4 (Posterior)

P(Hypothesis | data)
Denition 8.5 (Prior)

P(Hypothesis)
Denition 8.6 (Evidence)

P(data)
P(H | D = d) P(D = d | H) P(H)

(8.48)
December 5, 2013
247 / 475
Example: Explaining Away (Probabilistic OR)
Done in the class.
December 5, 2013
248 / 475
Example: Bayesian Estimation for DRVs

Surprise Candy
A candy manufacturer supplies candies in 5 dierent kinds of bags, all of

which look identical:
h1 : 100% cherry;
h2 : 75% cherry, 25% lime;
h5 : 100% lime.
The candies and their wrappers also look identical. You are given an
unknown bag. You sample (by licking ;)) candies from it by replacement,
and with each sample you want to:
Keep track of your belief P(hi | e1 , . . . , en ) in each of the 5 dierent
hypotheses for the kind of bag it is.
Predict whether the next sample would be cherry or lime avored.
December 5, 2013
249 / 475

Surprise Candy Hypothesis Update
The manufacturer has given the prior pmf of the dierent bag-types:
P(H) = [P(h1 ), P(h2 ), P(h3 ), P(h4 ), P(h5 )].
Suppose e1 = cherry , e2 = lime . Bayes theorem
1.0 P(h1 )
0.75 P(h2 )
P(H | e1 ) = 0.5 P(h3 ) , P(H | e1 , e2 ) =
0.25 P(h4 )
0.0 P(h5 )
gives
0.0 1.0 P(h1 )

0.25 0.75 P(h2 )
(0.5)2 P(h3 )
0.75 0.25 P(h4 )

1.0 0.0 P(h5 )
As we gather more and more samples, the probability

P(hi | e1 , . . . , en ) of the correct bag-type hi will eventually dominate.
December 5, 2013
250 / 475

Surprise Candy Prediction
If we now want to predict the outcome of the next sample. Therefore, we
need to estimate the distribution of the RV En+1 = [cherry , lime ].

5
P(En+1 | e1:n ) =
i=1
5
=
i=1
5
=
i=1
P(En+1 , H = hi | e1:n )
P(En+1 | hi , e1:n ) P(hi | e1:n )
P(En+1 | hi ) P(hi | e1:n ).
December 5, 2013
(8.49)
251 / 475
Independence
The events represented by propositional logic sentences F and G are

independent, if either P(G ) = 0, or,
P(F | G ) = P(F )
(8.50)
Using the general Bayes rule, one sees that this is a symmetric
relationship and hence also holds with F and G swapped.
For independent events, (8.21) reduces to
P(F G ) = P(F ) P(G )
(8.51)
December 5, 2013
252 / 475
Independence for DRVs
In terms of RVs, A = ai and B = bj are independent if either

P(B = bj ) = 0, or,
P(A = ai | B = bj ) = P(A = ai ).
(8.52)
If the above holds for all ai , bj , we can summarize the independence of A

and B as A B, which implies
P(A | B) = P(A),
or equiv. P(A, B) = P(A) P(B)
(8.53)
December 5, 2013
253 / 475
Independence is a changeable property. Two events which were earlier
independent, can become dependent in light of some evidence. Two
events F , G are called conditionally independent given H if
P(F | G H) = P(F | H), or, P(G H) = 0.
(8.54)
Using the general Bayes rule, one sees that this is a symmetric
relationship and hence also holds with F and G swapped. Using the
product rule, another way to write conditional independence is
P(F G | H) = P(F | H) P(G | H), or, P(H) = 0.
December 5, 2013
(8.55)
254 / 475
In terms of RVs, A = ai and B = bj are conditionally independent given
C = ck if either P(B = bj C = ck ) = 0, or,
P(A = ai | B = bj , C = ck ) = P(A = ai | C = ck ).
(8.56)
If the above holds for all ai , bj , ck , we can say that A and B are
conditionally independent given C , and all of the following are equivalent
P(A | B, C ) = P(A | C ),
(8.57a)
P(B | A, C ) = P(B | C ),
(8.57b)
P(A, B | C ) = P(A | C ) P(B | C )
(8.57c)
This conditional independence is also written as (A B) | C or A C B.
December 5, 2013
255 / 475
Example: The Wumpus maze solved probabilistically

1,4
2,4
3,4
4,4
1,4
2,4
3,4
4,4
1,3
2,3
3,3
4,3
1,3
2,3
3,3
4,3
OTHER
QUERY
2,2
1,2
3,2
4,2
1,2
2,1
3,1
4,1
2,2
1,1
3,2
4,2
2,1
FRINGE
3,1
4,1
B
OK
1,1
B
OK
KNOWN
OK
Figure 57: Priors P(Pij = 1) = P(pij ) = 0.2; only P(P11 = 0) = P(p11 ) = 1. Pij
are independent binary RVs.
December 5, 2013
256 / 475
Example: A Converging Connection

We are given a joint-distribution between three RVs A, B, C which has the
following special form
P(A, B, C ) = P(A) P(B) P(C | A, B)
(8.58)
Then,
|C |
P(A, B) =
|C |
(8.58)
P(A, B, ci )
i=1
i=1
P(A) P(B) P(ci | A, B)
|C |
= P(A) P(B)
i=1
P(ci | A, B)
= P(A) P(B)
Hence, A and B are independent.
December 5, 2013
257 / 475
Example: A Serial Connection

P(A, B, C ) = P(A) P(B | A) P(C | B)
(8.59)
Then,
P(C | A, B) =
P(A, B, C )
P(A, B)
(8.59)
P(B | A) P(C | B) P(A)

= P(C | B)
P(B | A) P(A)
Hence, C and A are conditionally independent, given B.
December 5, 2013
258 / 475
Example: A Diverging Connection

P(A, B, C ) = P(B | A) P(C | A) P(A)
(8.60)
Then,
P(B | A, C ) =
P(A, B, C )
P(A, C )
(8.60)
P(B | A) P(C | A) P(A)

= P(B | A)
P(C | A) P(A)
Hence, B and C are conditionally independent, given A.
December 5, 2013
259 / 475
Example: A Diverging Connection

However, if A is not given, but C is given, then, P(B | C ) = P(B).
P(B, C )
P(B | C ) =
P(C )
(8.60)
|A|
i=1 P(B
|A|
i=1 P(ai , B, C )
P(C )
| ai ) P(C | ai ) P(ai )
P(C )
|A|
i=1 P(B | ai ) P(C | ai ) P(ai )
|A|
i=1 P(C | ai ) P(ai )
(8.61)
The last equation shows how evidence may be transmitted from an

instantiation of C to B implicitly through A. What does it mean if
A Gender , B Height , C LengthOfHair ?
December 5, 2013
260 / 475
Chain Rule
This is an application of the product-rule (stated in terms of RVs)
P(X1 , . . . , Xn ) = P(Xn | X1 , . . . , Xn1 ) P(X1 , . . . , Xn1 )
P(X1 , . . . , Xn1 ) = P(Xn1 | X1 , . . . , Xn2 ) P(X1 , . . . , Xn2 )
P(X1 , . . . , Xn2 ) = P(Xn2 | X1 , . . . , Xn3 ) P(X1 , . . . , Xn3 )

.
.
.
P(X2 , X1 ) = P(X2 | X1 ) P(X1 ).
(8.62a)
(8.62b)
(8.62c)
(8.62d)
Combining the above equations, we get

n
P(X1 , . . . , Xn ) =
i=1
P(Xi | Xi1 , Xi2 , . . .).
(8.63)
This formula will be useful later in formulating the joint distribution of a

Bayesian network.
December 5, 2013
261 / 475
December 5, 2013
262 / 475

ve
Contents

ve
Reasoning using NBC
Learning Terminology
Supervised Learning
Some Discrete Probability Distributions
Training an NBC
Some Continuous Probability Distributions
NBC for Continuous RVs

ve
Reasoning using NBC
Na Bayesian Classier (NBC)

ve
a.k.a. Idiot Bayes Model
Class
Attribute 1
Attribute n
Attribute 2
Figure 58: The Na Bayesian Classier assumes that the attributes are
ve
conditionally independent of each other given the class. However, NBC often
achieves surprisingly good performance even when this strong assumption is not
strictly valid.

ve
December 5, 2013
263 / 475
Reasoning using NBC
The NBC Size Advantage

If we use the full JPD P(C , A1 , A2 , . . . , An ) we need to specify
|C | |A1 | |An | 1 independent probability values. If all RVs are
binary, this number is 2n+1 1.
If we use an NBC, we need to give:
P(C ) with |C | 1 independent values.
n CPTs P(Ai | C ), the ith CPT having (|Ai | 1)|C | independent
values.
n
This gives a total of |C | 1 + i=1 |C |(|Ai | 1) values. If all RVs are
binary, this number is 1 + 2n.
Thus, the model complexity has come down from exponential to

linear in number of attributes. Less number of parameters make
NBCs immune to overtting at the expense of less accuracy compared
to more advanced methods.
December 5, 2013
264 / 475

ve
Reasoning using NBC
Query distributions for an NBC
The NBC is rst trained using a set of example/instance tuples:

ei = (C = ci , A1 = a1i , . . . , An = ani ),
i = 1, . . . , m.
(9.1)
During testing, we would like to predict the class of a previously

unseen sample with attributes (A1 = a1 , A2 = a2 , . . . , An = an ) with
unknown classication. Thus, we want to query:
P(C | A1 = a1 , . . . , An = an ).

ve
(9.2)
December 5, 2013
265 / 475
Reasoning using NBC
Answering queries
P(a1 , a2 , . . . , an | C ) P(C )
,
(9.3)
P(a1 , a2 , . . . , an )
using conditional independence of attributes given class,
P(a1 | C ) P(a2 | C ) P(an | C ) P(C )
=
(9.4)
k
P(a1 , a2 , . . . , an , ci )
i=1
P(a1 | C ) P(a2 | C ) P(an | C ) P(C )
(9.5)
=
k
i=1 P(a1 | ci ) P(an | ci ) P(ci )
P(C | a1 , a2 , . . . , an ) =
December 5, 2013
266 / 475

ve
Reasoning using NBC
The Log-Sum Trick
P(cj | a1 , a2 , . . . , an ) =
P(a1 | cj ) P(an | cj ) P(cj )
k
i=1 P(a1
| ci ) P(an | ci ) P(ci )
As n increases, each term in the summation in the denominator becomes

smaller and smaller. This may lead to numerical underow.

ve
December 5, 2013
267 / 475
Reasoning using NBC
The Log-Sum Trick
e bj
Rewrite P(cj | a1 , a2 , . . . , an )
(9.6)
k
bi
i=1 e
k
ln P(cj | a1 , a2 , . . . , an ) = bj ln
e bi
(9.7)
i=1
k
= bj b ln
where, b
e bi b
(9.8)
i=1
max bi .
(9.9)
Example: ln(e 120 + e 121 ) = 120 + ln(1 + e 1 ).
December 5, 2013
268 / 475

ve
Learning 101
Denition of Tom Mitchell, CMU
An agent is said to learn from experience E with respect to some

class of tasks T and performance measure P, if its performance at
tasks in T , as measured by P, improves with experience E .
The major issues are
1. What prior knowledge is available to the agent, and how is the
knowledge represented?
2. What feedback is available to learn from?

ve
December 5, 2013
269 / 475
Feedback to Learn from
Supervised Learning The agent observes some input(X)-output(Y ) pairs

and learns a functional mapping between them.
Classication: The output Y is a discrete set of labels.
Regression: The output Y is a continuous variable.
Reinforcement Learning The agent learns from a series of rewards and
punishments.
Unsupervised Learning No feedback. The agent learns patterns in input,
e.g. clustering.
December 5, 2013
270 / 475

ve
Supervised Learning
Supervised Learning
Given a training set of i = 1 . . . m example input-output pairs

Input Xi , generally a vector.
Output Yi , generally a scalar.
There is an unknown functional relationship Yi = f (Xi ).
We would like to nd a function h which approximates the true

function f . The approximate function h is called a hypothesis.
To measure the accuracy of the hypothesis we use a test set of
example pairs dierent from the training set. The hypothesis h
generalizes well if it correctly predicts the output Y for these novel
test-set examples.

ve
December 5, 2013
271 / 475
Supervised Learning
Example: The Mushroom Classication Dataset

http://archive.ics.uci.edu/ml/datasets/Mushroom
Nr. of attributes: 22
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
A11
A12
A13
A14
A15
A16
A17
A18
A19
A20
A21
A22
cap-shape
cap-surface
cap-color
bruises?
odor
gill-attachment
gill-spacing
gill-size
gill-color
stalk-shape
stalk-root
stalk-surface-above-ring
stalk-surface-below-ring
stalk-color-above-ring
stalk-color-below-ring
veil-type
veil-color
ring-number
ring-type
spore-print-color
population
habitat
bell=b,conical=c,convex=x,at=f, knobbed=k,sunken=s
brous=f,grooves=g,scaly=y,smooth=s
brown=n,bu=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
bruises=t,no=f
almond=a,anise=l,creosote=c,shy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
attached=a,descending=d,free=f,notched=n
close=c,crowded=w,distant=d
broad=b,narrow=n
black=k,brown=n,bu=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,ye
enlarging=e,tapering=t
bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
brous=f,scaly=y,silky=k,smooth=s
brous=f,scaly=y,silky=k,smooth=s
brown=n,bu=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
brown=n,bu=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
partial=p,universal=u
brown=n,orange=o,white=w,yellow=y
none=n,one=o,two=t
cobwebby=c,evanescent=e,aring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
black=k,brown=n,bu=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d
December 5, 2013
272 / 475

ve
Supervised Learning
Example: The Mushroom Classication Dataset

http://archive.ics.uci.edu/ml/datasets/Mushroom
Nr. of examples: 8124

p
e
e
p
e
e
e
e
e
p
.
.
.
x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
x,y,b,t,n,f,c,b,e,e,?,s,s,e,w,p,w,t,e,w,c,w
x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,n,g
b,s,w,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,n,m
b,y,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,s,m
x,y,w,t,p,f,c,n,p,e,e,s,s,w,w,p,w,o,p,k,v,g
.
.
.

ve
December 5, 2013
273 / 475
Supervised Learning
Cardiac Single Proton Emission Computed Tomography

(SPECT) Diagnosis
http://archive.ics.uci.edu/ml/datasets/SPECT+Heart
Citation: Kurgan et al, Knowledge discovery approach to automated

cardiac SPECT diagnosis, Articial Intelligence in Medicine, vol. 23
(2001).
Attributes: 22+1, all binary.
Instances: 267. Class 0: 55 (Normals), Class 1: 212 (Abnormals).
December 5, 2013
274 / 475

ve
Supervised Learning
Independent and Identically Distributed (i.i.d.)
We assume that example data-point j is an RV Ej whose observed value

ej = (xj , yj ) is sampled from a probability distribution which remains
unchanged (stationary) over time. Furthermore, each sample is
independent of the others. Thus,
P(Ej | Ej1 , Ej2 , . . .) = P(Ej )
P(Ej ) = P(Ej1 ) = P(Ej2 ) = . . .
Independence
(9.10a)
Identical Distribution
(9.10b)

ve
December 5, 2013
275 / 475
Supervised Learning
Error-Rate
It is the proportion of mistakes a given hypothesis makes: i.e. the
proportion of times h(x) = y .
December 5, 2013
276 / 475

ve
Supervised Learning
Holdout Cross-Validation
Split the available examples randomly into a training-set from which the
learning algorithm produces a hypothesis, and test-set on which the
accuracy of h is evaluated. Disadvantage: we cannot use all examples for
nding h.
e1
e2
.
.
.
.
em
Figure 59: Holdout cross-validation.


ve
December 5, 2013
277 / 475
Multinomial Distribution
Let our sample-space be divided into k classes, i.e. |C | = k.
We draw m samples. Each sample ej , j = 1 . . . m falls into only one
class. Let the class of ej be denoted ej .c.
Let the known prior probability P(C = ci ) = i . Thus,
k
i =
i=1
P(C = ci ) = 1.
(9.11)
i=1
Let the integer-valued DRVs Ni , i = 1, . . . , k denote the number of

samples (out of m) which fall in category i.

N1
m
.
Ni =
I(ej .c = i),
N . ,
(9.12)
.
j=1
Nk
k
Ni = m.
(9.13)
i=1
December 5, 2013
278 / 475

ve
Multinomial Distribution
The the joint pmf of the Ni s is given by

n1
k1
m
m n1
m i=1 ni n1 n2
.
n
P(N = . ) =
1 2 k k
.
n1
n2
nk
nk
m!
n
n n
=
1 1 2 2 k k
n1 ! n2 ! nk !
k
where,
ini ,
m (n)
(9.14)
i=1
ni = m,
i=1
i = 1.
(9.15)
i=1
We write, N Multinomial ().
(9.16)
The expected count of samples for class C = i is

m
E [Ni ] =
E [I(ej .c = i)] =
j=1
P(C = i) =
j=1
(9.17)
j=1

ve
i = m i
December 5, 2013
279 / 475
Denition 9.1 (Gamma function (x))
(x)
t x1 e t dt
(9.18)
If the argument is a positive integer n, then (n) = (n 1)!
Figure 60: Src: facebook.com. We are only interested in the positive half of the
real axis.
December 5, 2013
280 / 475

ve
Figure 61: Src: zazzle.com. The famous Gamma function value for a non-integer.

ve
December 5, 2013
281 / 475
The Dirichlet PDF

T
It is a pdf of the pmf P(C ) = = 1 , . . . , k .

should lie in the probability simplex
k
Sk = { | i [0, 1] and
i=1 i = 1 = 1}.
It is parameterized by hyperparameters Rk , i > 0.
k
p(; )
ii 1 ,
d ()
d ()
(1 + . . . + k )
i=1
k
i=1 (i )
(9.19)
We write Dir ().

Recall: if n is a positive integer, (n) = (n 1)!, but it is also dened
for general real numbers.
As p(; ) is a pdf,
p(; ) d = 1.
(9.20)
Sk
December 5, 2013
282 / 475

ve
The Dirichlet Distribution: Properties I

Using this, and the property (z + 1) = z (z), we show that
E [i ] =
Sk
i p(; ) d
= d ()
Sk
1 1 1 ii k k 1 d
= missing steps
(1 + k ) (i + 1)
=
(1 + k + 1) (i )
i
=
.
1 + + k

ve
(9.21)
December 5, 2013
283 / 475
(a) = (1, 1, 1)
(b) = (0.1, 0.1, 0.1)
(c) = (10, 10, 10)
(d) = (2, 5, 15)
Figure 62: Visualizing the Dirichlet distribution for k = 3 dened on S3 in the

rst octant. From Frigyik et al, Univ. of Washington Tech. Report
UWEETR-2010-0006.
December 5, 2013
284 / 475

ve
The Dirichlet Distribution: Properties II

For a binary RVs (k = 2),
The multinomial distribution reduces to the Binomial distribution.
The Dirichlet distribution reduces to the Beta distribution.
Figure 63: The beta distribution. In terms of Dirichlet hyperparameters: 1 = ,

2 = . Src: your.org.
December 5, 2013
285 / 475
ve
Conjugate Priors
The Bayesian estimation update rule also holds for pdfs:
p( n+1 | e1:n+1 ) = p(en+1 | n ) p( n | e1:n ) .
Posterior
Likelihood
Prior
In general, the prior, the likelihood, and the posterior distributions

may belong to dierent families of probability distributions.
However, for some families, the prior and the posterior belong to the
same family F1 , given that the likelihood is from a certain family F2 .
In this case the family F1 is called the conjugate-prior to F2 .
The most relevant examples for this course are:
F1 = F2 = Gaussian.
F2 = Multinomial, F1 = Dirichlet distribution.
December 5, 2013
286 / 475

ve
Multinomial and Dirichlet Conjugate Priors

The prior p() is Dirichlet Dir ().
The likelihood P(N = n | ) of observing the counts n is Multinomial
N Multinomial ().
The posterior is
p( | n) = P(n | ) p(; )
k
ini
= 1
i=1
(9.22)
j 1
(9.23)
j=1
ini +i 1
= 1
Dir (n + )
(9.24)
, i are called pseudo-counts.
(9.25)
i=1
Using (9.21) for the posterior, we have

E [j ] =
nj + j
k
i=1 (ni
+ i )

ve
December 5, 2013
287 / 475
Eect of Prior vs. Eect of Likelihood
E [j ] =
nj + j
k
i=1 (ni
+ i )
, i are called pseudo-counts.
December 5, 2013
288 / 475

ve
Training an NBC
Unknown Attribute Values in Some Examples

For example, the attribute stalk-root has a missing value (?) in some
examples of the Mushroom dataset.
Choices:
Take the most common value of that attribute in the whole example
set.
For a binary decision problem with decision (class) variable having
values Class = Y /N, and an attribute A with missing values in some
examples, suppose the example with the missing value is of
Class = Y . Using the product rule:
P(A = ai | Class = Y ) =
P(A = ai , Class = Y )
pi
= .
P(Class = Y )
p
(9.26)
We can then choose the attribute-value ai with the largest conditional

probability. Similarly the Class = N case can be handled.

ve
December 5, 2013
289 / 475
Training an NBC
Training the NBC
Step 1
Fill in the missing attribute-values using the heuristics of the last slide.
Step 2
Compute the prior P(C = ci ) for i = 1, . . . , k
1
P(C = ci ) =
m
I(ej .c = ci )
(9.27)
j=1
December 5, 2013
290 / 475

ve
Training an NBC
Training the NBC
Step 3
The CPT P(Ar | C ) consists of |C | pmfs P(Ar | ci ) S|Ar | .
Assume a Dirichlet prior for all pmfs P(Ar | ci ) Dir (r ).
In the absence of any other prior information, choose a uniform prior,

i.e. r [ ] = 1 for = 1, . . . , |Ar |.

ve
December 5, 2013
291 / 475
Training an NBC
Training the NBC

Step 4
Assume that the examples in the database with ej .Ar = ar , ej .c = ci
are distributed according to P(Ar | ci ). The number of such observed
examples are then sampled from a Multinomial distribution with
= P(Ar | ci ). This is the likelihood.
So the expected Dirichlet posterior estimate of the pmf P(Ar | ci ),
r = 1, . . . , n and = 1, . . . , |Ar | from (9.25) is
m
nr ,
j=1
P(Ar = ar , | C = ci ) =
I(ej .Ar = ar , ej .c = ci )
nr , + r [ ]
|Ar |
p=1 (nr ,p
(9.28)
(9.29)
+ r [p])
December 5, 2013
292 / 475

ve
Continuous Probability Density Functions
If X is a RV in a continuous domain D(X ),

x2
P(x1 X x2 ) =
p(x)dx.
(9.30)
x1
The function p(X = x) is called the probability density function (pdf) of

X . Obviously, D(X ) p(x)dx = 1. Analogous to the discrete counterpart,
we can also dene multivariate pdfs p(X = x).

ve
December 5, 2013
293 / 475
Most rules like the product-rule, marginalization, Bayes rule have similar
counterparts in the continuous domain.
p(X = x, Y = y ) xy = (p(x | y ) x) (p(y ) y )
p(x, y ) = p(x | y ) p(y )
p(X = x, Y = y )dx = p(Y = y ).
(9.31)
(9.32)
D(X )
December 5, 2013
294 / 475

ve
If X Rn is a normally distributed vector continuous RV (CRV), its

normal/Gaussian pdf is dened as
N (X = x ; , C)
x
1
(2)n/2 |C|1/2
1
exp (x )T C1 (x )
x
x
2
(9.33)
is the mean and C is the covariance matrix of the distribution.

x

ve
December 5, 2013
295 / 475
Example: Banknote authentication

http://archive.ics.uci.edu/ml/datasets/banknote+authentication
Number of attributes: 5
1.
2.
3.
4.
5.
Variance of Wavelet Transformed image (continuous)

Skewness of Wavelet Transformed image (continuous)
Curtosis of Wavelet Transformed image (continuous)
Entropy of image (continuous)
Class (integer) 0/1
Number of instances: 1372
December 5, 2013
296 / 475

ve
Example: Banknote authentication

http://archive.ics.uci.edu/ml/datasets/banknote+authentication
3.6216,8.6661,-2.8073,-0.44699
4.5459,8.1674,-2.4586,-1.4621
3.866,-2.6383,1.9242,0.10645
3.4566,9.5228,-4.0112,-3.5944
0.32924,-4.4552,4.5718,-0.9888
-1.3887,-4.8773,6.4774,0.34179
-3.7503,-13.4586,17.5932,-2.7771
-3.5637,-8.3827,12.393,-1.2823
-2.5419,-0.65804,2.6842,1.1952
0
0
0
0
0
1
1
1
1

ve
December 5, 2013
297 / 475
Using NBC for CRVs: Option 1
Discretize all CRVs.

This is like creating histograms with a user given bin size.
Later we will see a discretization technique based on information
entropy.
December 5, 2013
298 / 475

ve
Option 2: Modifying NBC Likelihood for the Continuous

Case
The conditional independence assumption gives:
n
p(A1 = a1 , A2 = a2 , . . . , An = an | C = ci ) =
j=1
p(aj | ci ).
(9.34)
The conditional pdf p(aj | ci ) is now computed as

2
p(Aj = aj | C = ci ) = N (aj ; ji , ji ),
(9.35)
2
where, ji , ji are the mean and variance of the values of Aj among
instances of class C = ci .

ve
December 5, 2013
299 / 475
NBC Posterior for Continuous Attributes and Discrete

Class
P(cj | a1 , a2 , . . . , an ) =
p(a1 | cj ) p(an | cj ) P(cj )
k
i=1 p(a1
| ci ) p(an | ci ) P(ci )
December 5, 2013
(9.36)
300 / 475
Bayesian Networks
Contents
Bayesian Networks
Some Conditional Independence Results
Pruning a BN
Exact Inference in a BN
Approximate Inference in BN
Ecient Representation of CPTs
Applications of BN
December 5, 2013
301 / 475
Bayesian Networks
Problems of Using a Full Joint Distribution

Given n RVs, each with cardinality d, a joint distribution table has an
exponentially growing size of O(d n ).
It is usually dicult to assign these probabilities.
In real life, we deal with eects and their direct causes. A domain
expert will usually be able to provide us with probability-tables of type
P(Eect | Cause 1 , . . . , Cause k )
In general, an eect-RV X has certain direct cause-RVs
Ci , i = 1 . . . k, k
n, and X is conditionally independent of the all
other cause RVs given all Ci . The number of probabilities to be
specied now is reduced to O(nd k+1 )
O(d n ).
Example: Run
http://www.aispace.org/bayes/version5.1.9/bayes.jnlp and
load sample problems.
December 5, 2013
302 / 475
Bayesian Networks
An Example Bayesian Network (BN)

P(B)
.001
Earthquake
Alarm
Burglary
B
t
t
f
f
E
t
f
t
f
P(A)
.95
.94
.29
.001
A P(M)
A P(J)
JohnCalls
t
f
.90
.05
P(E)
.002
MaryCalls
t
f
.70
.01
Figure 64: A typical Bayesian Network (BN) showing the topology and CPTs.
December 5, 2013
303 / 475
Bayesian Networks
Terminology
A Bayesian Network (BN) is a Directed Acyclic Graph (DAG) of

nodes Xi which are RVs. A CPT exists between each node and all its
parents.
The descendents of a node X are all the nodes Y to which a directed
path exists from X . In this case, X is an ancestor of Y .
Non-descendents of a node X form a set of all nodes which are not
descendents of X .
A set of nodes S = {X1 , . . . Xk } BN is called ancestral if it contains
all its ancestors, i.e. Y S, s.t. descendent (Y ) = Xi S. In other
/
words, S has no incoming edges from outside S.
December 5, 2013
304 / 475
Bayesian Networks
Partial and Topological Ordering of the Nodes of a BN

Partial Ordering
A BN is a DAG, and all DAGs have an implicit partial order.
Xi is an ancestor of Xj (i < j)
(10.1)
Topological Ordering/Sorting
A topological ordering of a DAG is a non-unique total ordering which
is compatible with the above partial ordering.
In any topological ordering [X1 , X2 , . . . , Xn ], for all vertices Xi ,
Parents (Xi ) Predecessors (Xi ).
(10.2)
There exist several linear time algorithms for topological sorting.

December 5, 2013
305 / 475
Bayesian Networks
A Simple Algorithm for Finding a Topological Ordering

Algorithm 30: TopologicalOrdering
input : G (V , E ), a DAG
output: L, a list with vertices in ascending order of topological order.
while (S = {vertex Y G .V | indegree of Y is 0}) = do
Choose any vertex X S // Here we have freedom ;
Append X to L ;
Remove X and all its out-edges from G .
return L
In the highlighted line, dierent choices will result in dierent
topological orders.
We dene a particular choice dubbed ITYCA (Ignore Till You Cant
Anymore), which takes a given node Y , and if Y S, it does not
choose Y till it is the only node left in S.
ITYCA produces a topological order
L = [nondescendents (Y ), Y , descendents (Y )].
December 5, 2013
306 / 475
Bayesian Networks
Example for ITYCA Topological Order

A
C
D
Figure 65: If we decide to ITYCA node E , a possible topological order which

could be returned is [A, B, C , D, F , G , E , H].
December 5, 2013
307 / 475
Bayesian Networks
ITYCA Topological Order
How can you get the ITYCA topological order with the DFS based
topological ordering algorithm from the CLRS book?
December 5, 2013
308 / 475
Bayesian Networks
Recall: Chain Rule

This is an application of the product-rule (stated in terms of RVs)
P(X1 , . . . , Xn ) = P(Xn | X1 , . . . , Xn1 ) P(X1 , . . . , Xn1 )
P(X1 , . . . , Xn1 ) = P(Xn1 | X1 , . . . , Xn2 ) P(X1 , . . . , Xn2 )

.
.
.
P(X2 , X1 ) = P(X2 | X1 ) P(X1 ).
(10.3a)
(10.3b)
(10.3c)
Combining the Eqs. (10.3), we get

n
P(Xi | Xi1 , Xi2 , . . .).
P(X1 , . . . , Xn ) =
i=1
(10.4)
This formula will be useful later in formulating the joint distribution of a

Bayesian network.
Bayesian Networks
December 5, 2013
309 / 475
Dening Property of a BN
Theorem 10.1 (The Joint Probability Distribution of a BN)

Let [X1 , . . . , Xn ] be any given topological sorting of the nodes of the BN
B. Every node Xi in a BN of n nodes is conditionally independent of its
predecessors in the topological sorting, given its parents, if and only if the
joint probability distribution represented by B is given by
n
PB (X1 , . . . , Xn ) =
i=1
P(Xi | Parents (Xi ))
(10.5)
Note that some nodes can be without any parents, but they are also
included in the above product.
December 5, 2013
310 / 475
Bayesian Networks
Proof: First Part

First assume that nodes are conditionally independent of their
predecessors in the topological sorting, given their parents
This assumption means that
P(Xi | Xi1 , Xi2 , . . . , X1 ) = P(Xi | Parents (Xi )).
(10.6)
From the chain-rule (10.4), we have

n
P(X1 , . . . , Xn ) =
i=1
P(Xi | Xi1 , Xi2 , . . .),

n
(10.6)
i=1
P(Xi | Parents (Xi )).
This proves (10.5) one way.

Bayesian Networks
December 5, 2013
311 / 475
Proof: Second Part I

Next consider a BN with the joint distribution given by (10.5)
We need to now prove that the nodes are conditionally independent of
their predecessors in the topological sorting, given their parents, i.e. that
(10.6) holds.
P(Xi | Xi1 , Xi2 , . . . , X1 ) =
P(Xi , Xi1 , Xi2 , . . . , X1 )

, where,
P(Xi1 , Xi2 , . . . , X1 )
P(Xi , Xi1 , Xi2 , . . . , X1 ) =

Xn
P(X1 , . . . , Xn )
(10.7)
(10.8)
Xi+1
n
=
Xn
Xi+1 j=1
P(Xj | parents (Xj )) (10.9)
December 5, 2013
312 / 475
Bayesian Networks
Proof: Second Part II
=
k=1
P(Xk | parents (Xk ))
Xn
Xi+1 j=i+1
f1
P(Xj | parents (Xj )) (10.10)

f2
The summation in f2 can be distributed as follows using the property that

parents (Xj ) {Xk | k < j},
f2 =
Xi+1
P(Xi+1 | parents (Xi+1 ))
Xn
P(Xn | parents (Xn )) (10.11)
The last summation is 1, substituting it makes the summation on Xn1

one, and this cascading nally makes f2 = 1.
Bayesian Networks
December 5, 2013
313 / 475
Proof: Second Part III

We thus have,
i
P(Xi , Xi1 , . . . , X1 ) =
k=1
i1
P(Xi1 , Xi2 , . . . , X1 ) =
k=1
P(Xk | parents (Xk )), and similarly, (10.12)

P(Xk | parents (Xk ))
(10.13)
Substituting both in (10.7),

P(Xi | Xi1 , Xi2 , . . . , X1 ) = P(Xi | parents (Xi )),
(10.14)
which proves the required conditional independence.
December 5, 2013
314 / 475
Bayesian Networks
Local Markov Property of Bayesian Networks
Theorem 10.2 (Local Markov Property)

Each node Y BN is conditionally independent of nondescendants (Y ),
given its parents.
Proof.
A node is conditionally independent, given its parents, of its predecessors
in any topological ordering of the BN, in particular, of those in a
topological-ordering made by ITYCAing the node: these predecessors are
precisely the nodes nondescendants.
Bayesian Networks
December 5, 2013
315 / 475
Example
A
C
D
H
G
Figure 66: Write down the expression for the JPD PB (X1 , . . . , Xn ) of the above
BN.
December 5, 2013
316 / 475
Bayesian Networks
Pruning a BN
Theorem 10.3 (Plucking a Leaf)

If a BN B consists of nodes {X1 , . . . , Xn , L} where the node L is a leaf,
then let B denote a pruned BN consisting of nodes {X1 , . . . , Xn }. Then,
PB (X1 , . . . , Xn ) = PB (X1 , . . . , Xn ),
(10.15)
where, the RHS is the full JPD of B as dened in (10.5), and LHS is the
marginal JPD found from the full JPD of B after marginalizing out L.
Proof.
Bayesian Networks
December 5, 2013
317 / 475
Pruning a BN
|L|
PB (x1 , . . . , xn ) =
PB (x1 , . . . , xn , L = i )
i=1
|L|
=
i=1 j=1
P(xj | parents (Xj )) P( i | parents (L))

|L|
=
j=1
n
=
j=1
P(xj | parents (Xj ))
i=1
P( i | parents (L))
P(xj | parents (Xj ))
= PB (x1 , . . . , xn )
December 5, 2013
318 / 475
Bayesian Networks
Pruning a BN
Theorem 10.4 (JPD of an Ancestral Sub-DAG of a BN)

Let A = {A1 , A2 , . . . , Am } B be an ancestral-set in a BN B. Consider
now the BN represented by the nodes in A. Then,
PB (A1 , . . . , Am ) = PA (A1 , . . . , Am ),
(10.16)
where, the RHS is the full JPD of BN A as dened in (10.5), and LHS is
the marginal JPD found from the full JPD of B after marginalizing out
nodes in B which do not exist in A.
Proof.
Bayesian Networks
December 5, 2013
319 / 475
Pruning a BN
From B we can obtain A by plucking one leaf at a time, as shown in the

Algorithm 31 below. From Theorem 10.3, each time we pluck a leaf, the
JPD of nodes belonging to A within B does not change. Hence, (10.16)
follows.
Algorithm 31: PruningToAncestral
input: B, a BN;
A, an ancestral set in B.
while (S = {node L B | L A and outdegree of L is 0}) = do
/
Choose any node X S ;
Remove X and all its in-edges from B.
December 5, 2013
320 / 475
Bayesian Networks
Denition 10.5 (Query on Posterior Distribution)
Given a query RV X , a vector of observed evidence RVs E = e, and a
vector of irrelevant (unobserved/hidden) RVs Y, wed like to nd the
posterior probability P(X | e). We have a BN B consisting of all the RVs
{X , E, Y}.
General Inference Procedure

Use PB (10.5) as the JPD of all RVs. Then,
P(X | e) = P(X , e)
PB (X , e, y).
(10.17)
Bayesian Networks
December 5, 2013
321 / 475
A Useful Optimization
Ignoring Nodes Irrelevant to the Query

From Theorem 10.4, all nodes which are not ancestors of X or e are
irrelevant to the query and we can answer the query using a pruned-out
smaller BN consisting of the smallest ancestral set containing X and e.
December 5, 2013
322 / 475
Bayesian Networks
Recall
P(B)
.001
Earthquake
Alarm
Burglary
B
t
t
f
f
E
t
f
t
f
P(A)
.95
.94
.29
.001
A P(M)
A P(J)
JohnCalls
t
f
P(E)
.002
MaryCalls
.90
.05
t
f
.70
.01
|E | |A|
P(B | j, m) = P(B, j, m) =
P(B, j, m, ei , ak )
i=1 k=1
|E | |A|
=
i=1 k=1
P(B) P(ei ) P(ak | B, ei ) P(j | ak ) P(m | ak )
Bayesian Networks
December 5, 2013
(10.18)
323 / 475
Factors
|E | |A|
P(B | j, m) =
i=1 k=1
P(B) P(ei ) P(ak | B, ei ) P(j | ak ) P(m | ak )
|E |
P(ei )
= P(B)
f1 (B)
|A|
i=1
f2 (E )
k=1
P(ak | B, ei ) P(j | ak ) P(m | ak )

f3 (A,B,E )
f4 (A)
f5 (A)
Each factor fi is a matrix indexed by the values of its argument RVs, e.g.
f4 (A) =
P(j | a)
,
P(j | a)
f3 (A, B, E ) is 2 2 2.
f5 (A) =
P(m | a)
,
P(m | a)
December 5, 2013
324 / 475
Bayesian Networks
Point-wise Multiplication of Factors
P(B | j, m) = f1 (B)
The symbol
f2 (E )
f3 (A, B, E )
f4 (A)
f5 (A)
denotes a point-wise product, dened as
f(X1 , . . . , Xi , Y1 , . . . Yj , Z1 , . . . , Zk ) =
f1 (X1 , . . . , Xi , Y1 , . . . Yj )
f2 (Y1 , . . . Yj , Z1 , . . . , Zk ). (10.19)
Bayesian Networks
December 5, 2013
325 / 475
An Example Illustrating Factor Multiplication

f3 (A, B, C ) = f1 (A, B)
A
1
1
0
0
B
1
0
1
0
f1 (A, B)
p11
p12
p13
p14
B
1
1
0
0
C
1
0
1
0
f2 (B, C )
p21
p22
p23
p24
A
1
1
1
1
0
0
0
0
B
1
1
0
0
1
1
0
0
C
1
0
1
0
1
0
1
0
f2 (B, C )
f1 f2
p11 p21
p11 p22
p12 p23
p12 p24
p13 p21
p13 p22
p14 p23
p14 p24
December 5, 2013
326 / 475
Bayesian Networks
An Example Illustrating Factor Marginalization

a1
a1
a1
a1
a2
a2
a2
a2
a3
a3
a3
a3
b1
b1
b2
b2
b1
b1
b2
b2
b1
b1
b2
b2
c1
c2
c1
c2
c1
c2
c1
c2
c1
c2
c1
c2
0.25
0.35
0.08
a1
a1
a2
a2
a3
a3
0.16
0.05
0.07
0
0
0.15
c1
c2
c1
c2
c1
c2
0.33
0.51
0.05
0.07
0.24
0.39
0.21
0.09
0.18
Figure 67: f2 (A, C ) = bi f1 (A, B = bi , C ). From Koller and Friedman,

Probabilistic Graphical Models, 2009.
Bayesian Networks
December 5, 2013
327 / 475
Inference by Variable Elimination I
Remark 10.6 (Notation)

Let the vector RV X = [X1 , . . . , Xn ] be the vector of all nodes (RVs) in a
BN. We will use X to refer to both the vector and the set of RVs
{X1 , . . . , Xn }. So if Y = [X1 , . . . , Xm ], m < n, we can write Y X.
December 5, 2013
328 / 475
Bayesian Networks
Algorithm 32: EliminateVariableFromFactors

input : A factorization F : f1 . . . fm ;
Z X, an RV to eliminate by marginalization
output: The factorization F after elimination
S(Z ) = {fi F | fi involves Z } ;

Remove factors S(Z ) from F ;
Marginalize out Z from the product of factors in S(Z ) to create a new
factor g
h=
g=
h.
(10.20)
zZ
fS(Z )
point-wise product
Append factor g to F;
return F
Bayesian Networks
December 5, 2013
329 / 475
Computing P(Q | E = e)
Zhang and Poole (1994)
Algorithm 33: VariableElimination

input : A factorization F : f1 . . . fm of a JPD P(X);
Q X, a vector of query RVs;
E = e, the vector of instantiated observed (evidence) RVs;
Y X, any ordering of unobserved RVs. Note: X = Q E Y.
output: The posterior distribution P(Q | E = e)
Instantiate E = e in all factors in F. This truncates all factor-tables to

those elements which correspond to E = e ;
for Y Y do
F EliminateVariableFromFactors(F, Y ) ;
h(Q) point-wise product of all factors in F ;
return Normalize(h(Q))
December 5, 2013
330 / 475
Bayesian Networks
Recall
P(B)
.001
Earthquake
Alarm
Burglary
B
t
t
f
f
E
t
f
t
f
P(A)
.95
.94
.29
.001
A P(M)
A P(J)
JohnCalls
t
f
P(E)
.002
MaryCalls
.90
.05
t
f
Bayesian Networks
.70
.01
December 5, 2013
331 / 475
Alarm Example: Computing P(B | j, m) I

|E |
P(B | j, m) = P(B)
f1 (B)
|A|
P(ei )
i=1
f2 (E )
k=1
P(ak | B, ei ) P(j | ak ) P(m | ak )

f3 (A,B,E )
f4 (A)
f5 (A)
We have Q [B], e [J = True , M = True ]. Let us choose the

ordering Y [E , A] for the unobserved RVs.
Initially, we have the set of factors:
f1 (B)
P(B)
f2 (E )
P(E )
F = f3 (A, B, E )
= P(A | B, E )
f4 (J, A)
P(J | A)
f5 (M, A)
P(M | A)
December 5, 2013
(10.21)
332 / 475
Bayesian Networks
Alarm Example: Computing P(B | j, m) II
Instantiate the observed variables J = True , M = True . Taking the

listing order True , False for all binary RVs, we have,
[0.001, 0.999]
f1 (B)
[0.002, 0.998]
f2 (E )
(10.22)
= f3 (A, B, E )
F = f3 (A, B, E )
[0.9, 0.05]
f4 (A)
[0.7, 0.01]
f5 (A)
In the rst iteration of the for-loop of Algo. 33, we have Y E . In
the call EliminateVariableFromFactors(F, E), in (10.20), we
have factor h1 = f2 (E ) f3 (A, B, E ). Verify that h1 has the table
B
T
T
F
F
A
T
F
T
F
E= True
0.95 x 0.002
0.05 x 0.002
0.29 x 0.002
0.71 x 0.002
E= False
0.94 x 0.998
0.06 x 0.998
0.001 x 0.998
0.009 x 0.998
Bayesian Networks
December 5, 2013
333 / 475
Alarm Example: Computing P(B | j, m) III

Summing out E in h1 gives us the factor g1
B
T
T
F
F
A
T
F
T
F
g1 (A, B)
0.95 x 0.002 + 0.94 x 0.998
0.05 x 0.002 + 0.06 x 0.998
0.29 x 0.002 + 0.001 x 0.998
0.71 x 0.002 + 0.999 x 0.998
=
=
=
=
0.94002
0.05998
0.001578
0.998422
So, now we have F = [f1 (B), g1 (A, B), f4 (A), f5 (A)].

In the second iteration of the for-loop of Algo. 33, we have Y A.
In the call EliminateVariableFromFactors(F, A), in (10.20),
we have factor h2 = g1 (A, B) f4 (A) f5 (A). Verify that h2 has the
table
B
T
T
F
F
A
T
F
T
F
h2 (A, B)
0.94002 x 0.9 x 0.7 = 0.5922126
0.05998 x 0.05 x 0.01 = 2.999 105
0.001578 x 0.9 x 0.7 = 0.00099414
0.998422 x 0.05 x 0.01 = 0.0004992
December 5, 2013
334 / 475
Bayesian Networks
Alarm Example: Computing P(B | j, m) IV

By summing out A from h2 (A, B), we get the table
g2 (B) = [0.59224259, 0.001493351]. So, now F = [f1 (B), g2 (B)].
Finally, back in VariableElimination, we have
h(B) = f1 (B) g2 (B)
B
T
F
h(B)
0.001 x 0.59224259 = 0.00059224259
0.999 x 0.001493351 = 0.0014918576
Normalizing h(B), we get,

P(B | J = True , M = True ) = [0.2842, 0.7158]
Taking a dierent order for unobserved RVs, e.g., Y [A, E ] would

also give the same result but a dierent computational eciency.
Which order is better for this example?
Bayesian Networks
December 5, 2013
335 / 475
Alarm Example: Computing P(J = True | B = True )
For answering P(J = True | B = True ), note that the sub-DAG formed by
the nodes {B, A, J, E } is ancestral. Therefore, we can prune out the leaf
node M before answering the query.
December 5, 2013
336 / 475
Bayesian Networks
Monte Carlo Algorithms
Inference in BN can also be done using randomized sampling algorithms

called Monte Carlo algorithms whose accuracy depends on the number of
samples generated. Hence, the solutions are called any time because
within a given computation time an estimate can be produced.
Bayesian Networks
December 5, 2013
337 / 475
Algorithm 34: Prior-Sampling

input : B, a BN
output: A sample s from the JPD of B given by (10.5)
Size of the BN, denoted B, is the number of nodes in B ;
Z TopologicalOrdering(B) ;
Initialize sample-vector s R B to 0 ;
for i 1 . . . B do
zi A random-sample from pmf P(Zi | parents (Zi ));
s[i] zi ;
return s
December 5, 2013
338 / 475
Bayesian Networks
Probability of a sample generated by Prior-Sampling

The sampling proceeds in a topological order Z, i.e. parents are sampled
before children. When a child RV is sampled, all its parent RVs have
already been instantiated. Let a sample s be generated in order Z, then
P(s1 ) = P(Z1 = s1 ) as Z1 is guaranteed to be a root-node, it has no
parents;
P(s2 s1 ) = P(s1 ) P(s2 | parents (Z2 ) {s1 });
.
.
.
In general, from product-rule,
P(si si1 . . . s1 ) = P(si | si1 si2 . . . s1 ) P(si1 si2 . . . s1 )
= P(Zi = si | parents (Zi ) {s1 , . . . , si1 })
P(si1 si2 . . . s1 )
Bayesian Networks
December 5, 2013
339 / 475
Thus, the probability of the whole sample vector s containing samples

from all RVs in the BN is
P(s) = P(s1 s2 . . . s B )
B
=
i=1
P(Zi = si | parents (Zi ) {s1 , . . . , si1 })
(10.5)
= PB (Z = s).
Therefore, s is sample from the JPD of the BN.
December 5, 2013
340 / 475
Bayesian Networks
Algorithm 35: Rejection-Sampling

input : B, a BN consisting of RVs {X , E, Y};
X , query RV; E = e evidence vector RV;
N, the number of samples to generate
output: An estimate P(X | e)

Initialize count-map C R|X | to 0 ;
for j 1 to N do
Initialize sample-vector s R B to 0 ;
s Prior-Sampling(B) ;
if s is consistent with e then
x the value of X in s ;
C[x] C[x] + 1 ;
return Normalize(C)
Bayesian Networks
December 5, 2013
341 / 475
Sampling only Non-evidence RVs
Rejection-Sampling may reject too many samples as |E|

increases! Therefore, unusable for complex problems.
Alternative approach:
Create a sample consistent with the evidence by sampling only
non-evidence RVs and freeze the evidence RVs to the observations e.
Compute the samples weight as the likelihood of the evidence in the
sample.
December 5, 2013
342 / 475
Bayesian Networks
Sampling only Non-evidence RVs

Algorithm 36: Weighted-Sample
input : B, a BN; E = e evidence vector RV;
output: A sample s, and its weight w
w 1;
Z TopologicalOrdering(B) ;
Initialize the slots of sample-vector s R B corresponding to E by e ;
for i 1 . . . B do
if Zi E then
zi value of Zi in e ;
w w P(Zi = zi | parents (Zi ))
else
zi A random-sample from P(Zi | parents (Zi ));
s[i] zi ;
return s, w
Bayesian Networks
December 5, 2013
343 / 475
Probability of a sample generated by Weighted-Sample
The sampling proceeds in a topological order Z of the RVs X Y E.

When a child RV is sampled, all its parent RVs have already been
instantiated, either by sampling or because they are part of the evidence.
December 5, 2013
344 / 475
Bayesian Networks
Probability of a sample generated by Weighted-Sample

Let U = {X } Y be the set of non-evidence RVs. Weighted-Sample
only samples RVs in U.
Then, the probability of a sample s = u e is
U
PWS (s) =
i=1
P(Ui = ui | parents (Ui )).
(10.23)
The computed weight of this sample is

E
w (s) =
i=1
P(Ei = ei | parents (Ei )),
(10.24)
where, parents (Ui ), parents (Ei ) can contain both variables u

instantiated by sampling and other non-sampled evidence variables.
Bayesian Networks
December 5, 2013
345 / 475
Algorithm 37: Likelihood-Weighting

input : B, a BN;
X , a query RV;
E = e evidence vector RV;
N, the number of samples to generate
output: An estimate P(X | e)
W R|X | , a map from each value of X to its weighted counts, initialized

to 0 ;
for j 1 to N do
s, w Weighted-Sample(B, e) ;
x value of X in sample s ;
W[x] W[x] + w ;
return Normalize(W)
December 5, 2013
346 / 475
Bayesian Networks
Consistency of Likelihood-Weighting
Let Nx (y) be the number of samples of type s = x y e generated
by Weighted-Sample.
Then before normalization of W,
Nx (y) w (s = {x} y e).
W[x] =
y
The expected value E [Nx (y)] = N PWS (s = {x} y e).

Substituting above, and absorbing N in the normalization constant ,
E [W[x]] =
y
PWS (s = {x} y e) w (s = {x} y e),
(10.23),(10.24)
PB (s = {x} y e)
= PB (x, e)
Therefore, W = PB (X , e) = PB (X | e).
Bayesian Networks
December 5, 2013
347 / 475
Noisy OR
An eect (child node) can have several causes (parent nodes), e.g.
the binary eect RV Fever can have binary cause-RVs
Cold , Flu , Malaria , etc. It is dicult for a domain expert to specify
all the numbers for the whole CPT P(Fever | Cold , Flu , Malaria ).
The number of to be specied probabilities in a CPT increases
exponentially with the number of parents.
Therefore, we use additional assumptions to keep this number
bounded.
December 5, 2013
348 / 475
Bayesian Networks
Noisy OR
As a logical statement:
Fever = Cold Flu Malaria
Fever = Cold Flu Malaria

Now, in noisy OR, you allow Fever = True with a small probability
even if a cause is True , e.g. Cold = True .
A patient may have a cold but no fever: therefore, cold in inhibited in
its capacity to cause fever.
Bayesian Networks
December 5, 2013
349 / 475
Noisy OR
Noisy OR makes two assumptions:

All possible causes are listed. A leak cause-node may be included: the
latter is a catch-all for all other miscellaneous causes.
Inhibition of each parent is causally independent of the inhibition of
any other parents.
P(fever | Cold , Flu , Malaria ) =
P(fever | Cold ) P(fever | Flu ) P(fever | Malaria ) (10.25)
December 5, 2013
350 / 475
Bayesian Networks
Noisy OR
This allows the CPT to be dened implicitly by inhibition probabilities
P(fever | cold )
P(fever | u )
P(fever | malaria )
P(fever | cold )
qc ,
P(fever | u )
qf ,
P(fever | malaria )
qm ,
1,
(10.26)
1,
(10.27)
1.
(10.28)
The full CPT P(Fever | Cold , Flu , Malaria ) can now be given in terms of
these values by plugging them into (10.25).
Cold
T
T
T
T
F
F
F
F
Flu
T
T
F
F
T
T
F
F
Malaria
T
F
T
F
T
F
T
F
P(Fever )
qc qf qm
qc qf
qc qm
qc
qf qm
qf
qm
1.0
Bayesian Networks
P(Fever )
1 qc qf qm
1 qc qf
1 qc qm
1 qc
1 qf qm
1 qf
1 qm
0.0
December 5, 2013
351 / 475
Noisy MAX
The Noisy MAX is a generalization of the noisy OR for non-binary RVs.
Let Y be an RV with values 0, 1, . . . |Y | 1, |Y | > 2. The RV Y is
semantically graded, meaning that the value 0 implies that the
eect Y is absent, and increasing values denote that Y is present
with increasing intensity/degree.
The direct causes of Y are represented by
Parents (Y ) = X = {X1 , . . . , Xn }. Each Xi is also a graded RV and
can take values from 0 to |Xi | 1.
Let us denote zi,k to denote an instantiation of X in which Xi = k
and Xj = 0, j = i. Note that zi,0 0.
December 5, 2013
352 / 475
Bayesian Networks
Two Assumptions of Noisy MAX

1. When all causes are absent, the eect is absent
P(Y = 0 | X = 0) = 1.
(10.29)
2. We note that, for any x,

d
P(Y d | x) =
k=0
P(Y = k | x).
(10.30)
The second assumption is that this can be written as a product of the

eects of Xi s acting independently:
n
P(Y d | X1 = k1 , . . . , Xn = kn ) =
i=1
P(Y d | zi,ki ).
Bayesian Networks
December 5, 2013
(10.31)
353 / 475
Parameterization of Noisy MAX I

The full CPT of size |Y | |X1 | |Xn | need not be specied. Using
the assumptions above, only the following |Y | n (|Xi | 1) values
i=1
need to be specied.
ci, j, k
P(Y = i | zj,k ), where,
(10.32)
i = 0 . . . (|Y | 1),
k = 1 . . . (|Xj | 1), due to (10.29).

d
Cd, j, k
P(Y d | zi,k ) =
ci, j, k .
(10.33)
i=0
Substituting (10.32) in (10.31),

n
P(Y d | X1 = k1 , . . . , Xn = kn ) =
Cd, j, kj
Q(d, x).
j=1
(10.34)
December 5, 2013
354 / 475
Bayesian Networks
Parameterization of Noisy MAX II

Using the above, the implied CPT can be computed as
P(Y = 0 | x) = Q(0, x),
P(Y = i | x) = Q(i, x) Q(i 1, x), i > 0.

Y \X1
0
1
2
0
1
0
0
1
0.5
0.5
0
2
0
1
0
Y \X2
0
1
2
(a) ci, 1, k
0
1
0
0
1
0.5
0.3
0.2
(10.35)
2
0
0
1
(b) ci, 2, k
Figure 68: Example of Noisy-MAX parameterization.
Bayesian Networks
December 5, 2013
355 / 475
Applications of BN
The decomposition of large probabilistic domains into weakly

connected subsets through conditional independence is one of
the most important developments in the recent history of AI.
Textbook, page 499.
December 5, 2013
356 / 475
Bayesian Networks
Applications of BN
Expert Systems for Medicine
Figure 69: The CPCS system of Pradhan et al (1994) for internal medicine. It
has 448 nodes, 906 edges, and uses Noisy-MAX distributions to reduce the
specied CPT values from 133,931,430 to 8,254.
Bayesian Networks
December 5, 2013
357 / 475
Applications of BN
Expert Systems for Medicine
Path-nder: diagnosis accuracy at the level of a junior doctor for

lymph diseases. (pathnder.xdsl)
A recent example: The Hepar-II network for liver diseases. (Hepar
II.xdsl)
December 5, 2013
358 / 475
Bayesian Networks
Applications of BN
Expert Systems for Fault Diagnosis
Fault-diagnosis for power-systems, cars, printers, etc.

NASAs VISTA project for showing fault-diagnosis of the Shuttle.
Autonomous Underwater Vehicles (AUVs).
A realistic example: Printer-troubleshooting le.
Bayesian Networks
December 5, 2013
359 / 475
Applications of BN
In Genetics
Gene-expression Analysis
Functional Annotation, Protein-protein interaction, Haplotype
Inference
Pedigree Analysis
Survey: http:
//genomics10.bu.edu/bioinformatics/kasif/bayes-net.html
December 5, 2013
360 / 475
Bayesian Networks
Applications of BN
In Software
Document categorization, Semantic Web Ontologies (e.g.

BayesOWL).
Data-compression
Paper-clip: the erstwhile Microsoft Oce Assistant. Read the
humorous article: http:
//people.cs.ubc.ca/~murphyk/Bayes/econ.22mar01.html
For spam-email ltering.
Microsoft does active research in BN: http://research.
microsoft.com/en-us/um/redmond/groups/adapt/msbnx/
Bayesian Networks
December 5, 2013
361 / 475
Applications of BN
Free and Commercial Software
A good list with comparison at:

http://people.cs.ubc.ca/~murphyk/Software/bnsoft.html
Dierent le-formats for exchange of Bayesian Network data:
XML-based (.xmlbif, .xdsl)- a successor of .bif, .net, .dsc, etc. http:
//www.cs.cmu.edu/~fgcozman/Research/InterchangeFormat/
Data-sets for benchmarking:
http://genie.sis.pitt.edu/networks.html,
http://www.cs.huji.ac.il/site/labs/compbio/Repository/
Read case-studies in dierent elds at:
http://www.hugin.com/case-stories.
December 5, 2013
362 / 475
Some Learning Methodologies
Contents

The Perceptron
Entropy
Decision Tree Learning
Cross-Validation
Unsupervised Learning
k-Means Clustering
The Dunn Index
The EM Algorithm
Mean-Shift Clustering
December 5, 2013
363 / 475
The Perceptron
The Perceptron
Supervised Learning
The perceptron is a simple binary classier for two linearly separable

classes M and M+ .
Classication rule: Given the learned weight-vector w Rd and
threshold , and a vector x Rd to be classied,
wx
x M
wx>
(11.1)
x M+ .
(11.2)
x
.
1
(11.3)
Redene
w
w
,
December 5, 2013
364 / 475
The Perceptron
Algorithm 38: PerceptronLearning

input : M+ , M
output: The learned weight vector w
repeat
for all x M+ do
if w x 0 then
x
w=w+ x
for all x M do
if w x > 0 then
x
w=w x
until all x M+ M are classied correctly ;
return w
December 5, 2013
365 / 475
The Perceptron
Learning w: Linearly Separable Points

Iteration 1
2.5
1.5
0.5
0.5
1.5
2.5
2.5
1.5

2.5
0.5
0.5
Iteration 2
1.5
2.5
December 5, 2013
366 / 475
The Perceptron
Points not linearly separable

Iteration 1
2.5
1.5
0.5
0.5
1.5
2.5
2.5
1.5
0.5
0.5
1.5
Iteration 2
2.5
December 5, 2013
367 / 475
2.5
Entropy
Entropy
1.5
0.5
Given a PMF of an RV X which species probabilities Pi = P(X = xi ) for

0.5
events X = xi , wed like to have a metric for:
1
How much choice is involved in the selection of an event from this set;
Or, in other words, how uncertain we are of the outcome.

1.5
Lets call the desired metric H(X ) or in terms of probabilities,
2
H(P1 , . . . , Pn ), and rst see which properties it should ideally possess.
2.5
2.5
1.5
0.5
0.5
1.5
2.5
Iteration 3
2.5
1.5

1
December 5, 2013
368 / 475
Entropy
Entropy
Desired property 1
H(P1 , . . . , Pn ) is a continuous function of Pi s.
December 5, 2013
369 / 475
Entropy
Entropy
Desired property 2
If all events X = xi are equally likely, Pi = 1/n, i = 1 . . . n. Then
H(1/n, . . . , 1/n) should be a monotonically increasing function of n.
As the number of equally likely events increases, our choice or
uncertainty increases.
December 5, 2013
370 / 475
Entropy
Entropy: Desired Property 3

If a choice be broken down into two successive choices, the original H
should be the weighted sum of the individual values of H. Example:
1 1 1
1 1
1 2 1
H( , , ) = H( , ) + H( , )
2 3 6
2 2
2 3 3
1/2
1/2
1/3
2/3
1/6
1/2
1/3
Figure 70: Note that the net probabilities at the leaves are the same.
December 5, 2013
371 / 475
Entropy
Entropy I
Theorem 11.1
The function satisfying the said three properties is
n
H(X ) H(P1 , . . . , Pn ) =
Pi log Pi
(11.4)
i=1
December 5, 2013
372 / 475
Entropy
Denition 11.2 (A(n))

The entropy when all n choices are equally likely.
A(n)
H(P1 =
1
1
1
, P2 = , . . . , Pn = )
n
n
n
(11.5)
December 5, 2013
373 / 475
Entropy
Choice Tree
Figure 71: A choice tree with depth d = 3 and branching-factor b = 2.
Consider rst levels 2 and 3, then also include level 1. Using property 3,
1
A(23 ) = A(22 ) + 22 2 A(2)
2
1
1
= A(2) + 21 1 A(2) + 22 2 A(2)
2
2
= 3 A(2).
December 5, 2013
374 / 475
Entropy
Proof (Shannon, 1948), Part I

Consider a choice-tree with branching-factor b and depth d. Let it
represent a total of b d equally likely choices (leaves). Consider the
sub-tree of depth d 1 excluding the leaves of the original tree. By
property 3, we have
b d1
A(b d ) = A(b d1 ) +
i=1
1
b d1
A(b)
= A(b d1 ) + A(b), simly dividing A(b d1 ),

= A(b d2 ) + 2 A(b), continuing d times,
= d A(b).
(11.6)
The only function satisfying (11.6) is A(n) k log n, k > 0 from

property 2.
December 5, 2013
375 / 475
Entropy
Proof (Shannon, 1948), Intermission
1
1
1
Knowing that H(P1 = n , P2 = n , . . . , Pn = n ) = k log n is nice, but
what were really after is the case H(P1 , P2 , . . . , Pn ), where Pi s are
arbitrary: This is proven in Part II.
Given any arbitrary values of Pi s, e.g. for n = 3,

P1 = 0.303, P2 = 0.1417, P3 = 0.5553, we can always write them as
fractions of form Pi = mi /M. For example, for the given example,
M = 10, 000, m1 = 3030, m2 = 1417, m3 = 5553.
If any of Pi s is an irrational number (unlikely in a real situation), we
can always approximate the irrational number by a rational number to
any desired accuracy.
December 5, 2013
376 / 475
Entropy
Proof (Shannon, 1948), Part II

Now consider M equally likely choices. We can break them down in a
two-level choice-tree: the rst level has n nodes xi , i = 1 . . . n with
probabilities Pi = mi /M, where, n mi = M. Therefore, n Pi = 1.
i=1
i=1
Each node xi then has mi equally likely children.
Pn =
P1 =
mn
M
m1
M
P2 =
m1 equally likely
children
mn equally likely
children
m2
M
m2 equally likely
children
Figure 72: A two-level choice-tree.

December 5, 2013
377 / 475
Entropy
Proof (Shannon, 1948), Part II

So, property 3 gives
n
A(M) = H(P1 , . . . , Pn ) +
Pi A(mi )
i=1
H(P1 , . . . , Pn ) = k
Pi
i=1
log M k
=1
n
H(P1 , . . . , Pn ) = k
= k
Pi log
i=1
n
Pi log mi
i=1
mi
M
Pi log Pi .
i=1
If we select k = 1 and log2 , we measure entropy in bits; for loge , in nats,

and for log10 , in bans.
December 5, 2013
378 / 475
Entropy
Entropy
Thus, the entropy in bits of an RV is dened as

H(X )
E [ log2 P(X )]
(11.7a)
|X |
P(xi ) log2 P(xi ),
for DRVs
(11.7b)
p(x) log2 p(x),
for CRVs
(11.7c)
i=1
D(X )
December 5, 2013
379 / 475
Entropy
Computing 0 log 0
By LHpitals rule,
o
lim+ x log x = lim+
x0
x0
log x
1/x
= lim+
x0
1/x
1/x 2
= lim+ (x)
= 0.
x0
December 5, 2013
380 / 475
Entropy
Conditional Entropy
A choice-tree based derivation of conditional entropy was done in the
class. The derivation can also be done purely mathematically as follows
|X |
H(Y | X )
i=1
|X |
P(xi ) H(Y | X = xi ) =
|X |
P(xi )
i=1
P(xi )
(11.8)
i=1
j=1
P(yj | xi ) log2 P(yj | xi )
P(xi , yj ) log2
i=1 j=1
(11.9)
P(xi , yj )
P(xi )
|X | |Y |
j=1
H(yj | xi )
|Y |
|X | |Y |
|Y |
|X | |Y |
P(xi , yj ) log2 P(xi , yj ) +

i=1 j=1
P(xi , yj ) log2 P(xi )

i=1 j=1
= H(X , Y ) H(X )
(11.10)
December 5, 2013
381 / 475
Entropy
An Important Result Regarding Conditional Entropy I

0.6
0.5
H= -x log_2(x)
0.4
0.3
0.2
0.1
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Figure 73: Plot of H(x) = x log2 x. It is a concave function.

December 5, 2013
382 / 475
Entropy
An Important Result Regarding Conditional Entropy II
For a concave function f (x), a version of the Jensens inequality applies,

f(
i
i xi )
i f (xi ),
if
i = 1, and i 0.
(11.11)
We now start from (11.9) and write it as,
|Y | |X |
H(Y | X ) =
P(xi ) H(P(yj | xi ))
j=1
(11.12)
i=1
December 5, 2013
383 / 475
Entropy
An Important Result Regarding Conditional Entropy III

Since
P(xi ) = 1, we can use (11.11) with i identied as P(xi ).
|Y |
H(Y | X )
|X |
H
j=1
|Y |
i=1
P(xi , yj )
i=1
|Y |
(11.13)
|X |
H
j=1
P(xi ) P(yj | xi )
H(P(yj ))
j=1
(11.7b)
H(Y ).
H(Y | X ) H(Y )
(11.14)
Therefore, knowing some information X reduces the uncertainty of Y .
December 5, 2013
384 / 475
Entropy
Mutual Information
Mutual information of two RVs X and Y is dened as
|X | |Y |
I (X , Y )
P(xi , yj ) log2
i=1 j=1
P(xi , yj )
P(xi ) P(yj )
(11.15a)
= H(Y ) H(Y | X )
(11.15b)
= H(X , Y ) H(X | Y ) H(Y | X )
(11.15d)
= H(X ) H(X | Y )
(11.15c)
= H(X ) + H(Y ) H(X , Y )
(11.15e)
The highlighted expression shows that the mutual-information is also the

information-gain, i.e. the reduction in uncertainty of Y on knowing X .
Applying (11.14) to the highlighted expression, we see that I (X , Y ) is
always non-negative.
December 5, 2013
385 / 475
Entropy
Venn Diagram
H(X)
H(Y )
H(X | Y ) I(X, Y ) H(Y | X)
H(X, Y )
December 5, 2013
386 / 475
Entropy
Principle of Maximum Entropy

E.T. Jaynes, 1957
The principle of maximum entropy is a postulate which states that,

subject to known constraints (called testable information), the
probability distribution which is the least biased, i.e. which assumes
the least prior information, is the one with largest entropy.
Example: Consider a boolean DRV X with P(X = True ) = p. Then
its entropy
H(p, 1 p)
Hb (p)
= p log p (1 p) log (1 p).
(11.16)
If no other prior information is available, then by dierentiating the

above w.r.t. p, we nd that the maximum entropy is achieved if we
select p = 1/2. The maximal value of the entropy is 1 bit.
Example: if only the mean and the variance 2 of a CRV are known
beforehand, then subject to these constraints, the Gaussian
distribution has the maximum entropy.
December 5, 2013
387 / 475
Decision Trees
Patrons?
None
Some
Full
Yes
No
WaitEstimate?
>60
30-60
Alternate?
No
No
No
Yes
Bar?
No
Yes
0-10
Hungry?
Yes
Reservation?
No
10-30
No
Fri/Sat?
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Alternate?
No
Yes
Yes
Raining?
No
Yes
No
Yes
Yes
Figure 74: A decision-tree for deciding whether to wait for a table.

December 5, 2013
388 / 475
Learning Decision Trees from Examples

Table 1: To wait or not to wait
i/p
Ex.
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
Alt
Y
Y
N
Y
Y
N
N
N
N
Y
N
Y
Bar
N
N
Y
N
N
Y
Y
N
Y
Y
N
Y
Fri
N
N
N
Y
Y
N
N
N
Y
Y
N
Y
Hun
Y
Y
N
Y
N
Y
N
Y
N
Y
N
Y
Input Attributes
Patr
Prc Rain
Some $$$
N
Full
$
N
Some
$
N
Full
$
Y
Full
$$$
N
Some
$$
Y
None
$
Y
Some
$$
Y
Full
$
Y
Full
$$$
N
None
$
N
Full
$
N
Rsrv
Y
N
N
N
Y
Y
N
Y
N
Y
N
N
Type
Frn
Thai
Burg
Thai
Frn
Itl
Burg
Thai
Burg
Itl
Thai
Burg
Est
0-10
30-60
0-10
10-30
>60
0-10
0-10
0-10
>60
10-30
0-10
30-60
December 5, 2013
o/p
Wait?
y1 = Y
y2 = N
y3 = Y
y4 = Y
y5 = N
y6 = Y
y7 = N
y8 = Y
y9 = N
y10 = N
y11 = N
y12 = Y
389 / 475
Aim
To build a decision-tree from examples, which allows us, on an average, to
reach a decision as fast as possible, i.e. with the least number of checks.
Strategy
We check the attributes Xi (i.e. split the tree) in decreasing order of their
mutual-information I (Xi , Y ) = H(Y ) H(Y | Xi ) to the decision (class
Y ).
December 5, 2013
390 / 475
Information Gain on Selecting an Attribute for Splitting

Reminder
Hb (p)
H(p, 1 p)
= p log2 (p) (1 p) log2 (1 p).
(11.17)
Information-gain for attribute Type

I (Type , Wait ) = H(Wait ) H(Wait | Type )
H(Wait | Type ) =
P(Type = t)
t=f ,t,b,i
=
t=f ,t,b,i
w =Y ,N
h(P(w | t))
P(Type = t) Hb P(Wait = Y | t) .
December 5, 2013
391 / 475
We estimate the pmf of the boolean RV Wait? using the examples.

We get P(Wait? = Y ) = 6/12 = 0.5, so, P(Wait? = N) = 0.5.
Using (11.16), H(Wait? ) = 1 bit.
Type: Frn (+1,-1), Thai (+2,-2), Burg (+2,-2), Itl (+1,-1).
Patrons: None (+0, -2), Some (+4, -0), Full (+2,-4).
December 5, 2013
392 / 475

Let us compute the conditional entropy H(Wait? | Type ) from
(11.8).
We will estimate the pmf of Type by the examples given:
P(Type = Frn ) = Pf = 2/12, P(Type = Thai ) = Pt = 4/12,
P(Type = Itl ) = Pi = 2/12, P(Type = Burg ) = Pb = 4/12.
We also need the conditional probabilities:
P(Wait? = Y | Type = Frn ) = Pwf = 1/2,
P(Wait? = Y | Type = Thai ) = Pwt = 2/4,
P(Wait? = Y | Type = Itl ) = Pwi = 1/2,
P(Wait? = Y | Type = Burg ) = Pwb = 2/4.
Using (11.8) and (11.16), we have,
H(Wait? | Type ) = Pf Hb (Pwf ) + Pt Hb (Pwt ) + Pi Hb (Pwi ) +
Pb Hb (Pwb ) = (2/12)1 + (4/12)1 + (2/12)1 + (4/12)1 = 1 bit. So our
information-gain is H(Wait? ) H(Wait? | Type ) = 0!
December 5, 2013
393 / 475

Let us compute the conditional entropy H(Wait? | Patrons ) from
(11.8).
We will estimate the pmf of Patrons by the examples given:
P(Patrons = None ) = Pn = 2/12, P(Patrons = Some ) = Ps = 4/12,
P(Patrons = Full ) = Pf = 6/12.
We also need the conditional probabilities:
P(Wait? = Y | Patrons = None ) = Pwn = 0/2,
P(Wait? = Y | Patrons = Some ) = Pws = 4/4,
P(Wait? = Y | Patrons = Full ) = Pwf = 2/6.
Using (11.8) and (11.16), we have,
H(Wait? | Patrons ) = Pn Hb (Pwn ) + Ps Hb (Pws ) + Pf Hb (Pwf ) =
(2/12)0 + (4/12)0 + (6/12)Hb (1/3) = 0.5 0.9183 = 0.4591.
The information gain is H(Wait? ) H(Wait? | Patrons ) = 0.5408 bits.
Our goal is to reach a decision as soon as possible, so we should split

on the attribute which leads to the maximum information gain. In
this example (after computing the gain also for all other attributes) it
is Patrons .
December 5, 2013
394 / 475
Notation
Let A be the vector RV of all attribute-DRVs. Each DRV Ai has a
domain of values {ai,j | j = 1 . . . |Ai |}.
Let the example-set be denoted as
X = {(A = ak , Y = yk ) | k = 1 . . . X},
Xi X.
(11.18)
If x X, then we use the notation x.Y and x.Ai to denote its

classication and the value of the ith attribute resp.
The function Plurality-Value(Xi ) returns the majority
classication Y among all examples in Xi . It resolves ties randomly.
Then the call Decision-Tree-Learning(X, A, ) returns the
decision-tree. The algorithm is given in the next slide.
December 5, 2013
395 / 475
Algorithm 39: Decision-Tree-Learning

input : Remaining examples Xr , set of remaining attributes Ar ,
parent-node examples Xp
output: a tree
if Xr = then return Plurality-Value(Xp ) ;
else if x Xr , x.Y = c then return c (leaf) ;
else if A = then return Plurality-Value(Xr ) ;
else
A arg maxAAr Information-Gain(A, Y , Xr ) ;
a new decision-tree with root-test A ;
foreach value ai of attribute A do
Xr [A = ai ] {x | x Xr and x.A = ai } ;
Subtree s Decision-Tree-Learning(Xr [A = ai ],
Ar {A }, Xr ) ;
Add a branch to tree with label A = ai and subtree s ;
return ;
December 5, 2013
396 / 475
Recall: The Original Decision Tree

Patrons?
None
Some
Full
Yes
No
WaitEstimate?
>60
30-60
10-30
0-10
Alternate?
No
No
Yes
Reservation?
No
No
No
Fri/Sat?
Yes
Bar?
Hungry?
No
Yes
Yes
Yes
Alternate?
Yes
No
Yes
Yes
Yes
No
Yes
Raining?
Yes
No
Yes
No
No
Yes
Yes
December 5, 2013
397 / 475
The Faster Decision Tree Based on Information Gain

Patrons?
None
No
Some
Full
Hungry?
Yes
No
No
French
Yes
Yes
Type?
Italian
Thai
Burger
Fri/Sat?
No
No
Yes
No
Yes
Yes
Figure 75: The decision-tree deduced from the 12 examples of the training-set.
Some attributes are never checked to arrive at a decision.
December 5, 2013
398 / 475
Other Information-Theoretic Criteria for Tree-Building

Information-gain is a good criterion for splitting if all attributes have
equal number of values. If not, it is biased towards attributes with
higher number of values. Example: ID number.
In such cases, the gain-ratio may be a better criterion. For an
attribute A and a decision-variable C , it is dened as
GR(A) =
I (A, C )
H(A)
(11.19)
It penalizes higher number of values of an attribute by the term H(A)

which is dened by (11.7b) and computed as
|A|
H(A) =
i=1
pi + ni
pi + ni
log2
p+n
p+n
(11.20)
Usually, (Quinlan, 1986) the gain-ratio is only computed for attributes

with above average value of the information-gain I (A, C ) and only
these attributes are considered candidates for splitting.
December 5, 2013
399 / 475
Decision Tree with a Non-Binary Decision
If we have k classes, e.g. for mushrooms: edible, poisonous, and

hallucinogenic, we can easily generalize the decision-tree building by
Using (11.9) to compute the conditional entropy H(Class | Attribute )
while computing the information-gain.
This means that you cannot use the shorcut entropy expression
Hb (P) since it applies only to binary RVs. You have to use the
general entropy expression with all |Class | terms in the summation.
December 5, 2013
400 / 475
Cross-Validation
Model-Complexity Selection
f(x)
f(x)
f(x)
f(x)
(a)
x
(c)
(b)
x
(d)
Figure 76: Dierent hypothesis functions h (linear, seventh-degree polynomial,

sixth-degree polynomial, sinusoidal) for some given data. Which are preferable?
Ockhams Razor
Entities must not be multiplied beyond necessity. In other words, we
should tend towards simpler theories until more complicated theories
become necessary to explain new observations. The eld of
Model-Selection studies these ideas, many of them based on the
information entropy: MDL (minimum description length), AIC (Akaike
information criterion), etc.
December 5, 2013
401 / 475
Cross-Validation
Error-Rate
It is the proportion of mistakes a given hypothesis makes: i.e. the
proportion of times h(x) = y .
December 5, 2013
402 / 475
Cross-Validation
Recall: Holdout Cross-Validation

Split the available examples randomly into a training-set from which the
learning algorithm produces a hypothesis, and test-set on which the
accuracy of h is evaluated. Disadvantage: we cannot use all examples for
nding h.
e1
e2
.
.
.
.
em
Figure 77: Holdout cross-validation.

December 5, 2013
403 / 475
Cross-Validation
k-fold Cross-Validation
Divide available examples into k equal subsets. Perform k rounds of
learning: in each round, (1 1/k)th of the data is used as a training-set
and the remaining 1/kth data is used as test-set (now called
validation-set). Then the average test score of k rounds is taken as the
performance measure. k = 5, 10, m. k = m is called leave-one-out
cross-validation (LOOCV).
e1
e2
.
.
Part of Training Set

Validation Set
.
.
em
Figure 78: 5-fold cross-validation. Iteration 3/5.

December 5, 2013
404 / 475
Cross-Validation
k-Fold-Cross-Validation
Algorithm 40: k-Fold-Cross-Validation
input : Learner, a learning-algorithm; s, a model-complexity parameter ;
k; examples
output: Average training-set error-rate eT ,
Average validation set error-rate eV
eV 0, eT 0 ;
for i = 1 to k do
St , Sv Partition(examples, i, k) ;
h Learner(s, St ) ;
eT eT + Error-Rate(h, St ) ;
eV eV + Error-Rate(h, Sv ) ;
return (eT /k, eV /k)
December 5, 2013
405 / 475
Cross-Validation
Model Selection Complexity vs. Goodness of Fit

Algorithm 41: Cross-Validation-Wrapper
input
: Learner, k, examples
output
: Model of optimal complexity
local vars.: eT , an array, indexed by model-complexity s, storing
training-set error-rates.
eV an array, indexed by model-complexity s, storing validation-set
error-rates.
for s = 1 to do
(eT , eV ) k-Fold-Cross-Validation(Learner , s, k, examples) ;
if eT has converged then
s value of s with minimum eV [s] ;
return Learner(s , examples)
December 5, 2013
406 / 475
Cross-Validation
Example: Decision-Tree Learning

s = number of nodes.
The wrapper Learner is Decision-Tree-Learning (Algo. 39). It
builds the tree breadth-rst, rather than depth-rst (still using the
information-gain criterion), and stops when the maximum specied s
is reached.
60
Validation Set Error
Training Set Error
50
Error rate
40
30
20
10
0
1
5
6
Tree size
10
December 5, 2013
407 / 475
Cross-Validation
Using Learning Curves for Comparison of Learners
Figure 79: From Scaling up the Naive Bayesian Classier: Using Decision Trees
for Feature Selection, by Ratanamahatana et al.
December 5, 2013
408 / 475
Clustering
Clustering partitions available data xi (feature-vectors) into clusters, such
that samples belonging to a cluster are similar according to some criterion.
It nds usage in
Data analysis and compression
Pattern recognition
Image segmentation
Bioinformatics
December 5, 2013
409 / 475
Clustering example: data-compression

We have an image stored with 24 bits/pixel (16 million colors).
Denote these N pixels as xj , j = 1 . . . N, each encoded in 24-bits.
We want to compress it to 8 bits/pixel (256 colors).
Color quantization: We want to devise a palette (colormap) of 256
colors which approximates the colors present in the original image as
close as possible.
This means that we want to nd mi i = 1 . . . k (k = 256), each
encoded again in 24-bits, such that each original pixels color can be
approximated by the mi nearest to it in a color-space.
mi are called codebook vectors or code-words.
Ideally, we should choose the code-words mi so as to minimize the
reconstruction error
N
E (m1 , . . . , mk ) =
j=1
min xj mp
December 5, 2013
(11.21)
410 / 475
k-Means Clustering
The k-Means Algorithm

Algorithm 42: k-Means
input : Samples xj , j = 1 . . . N, Number of clusters k
output: Cluster-centers/ Codebook-vectors mi , i = 1 . . . k
Initialize mi , e.g. to k random xj ;
repeat
for j 1 . . . N do
1 if xj mi = minp xj mp
bj,i =
0 otherwise.
for i 1 . . . k do
mi
N
j=1 bj,i
xj /
N
j=1 bj,i
until all mi have converged;
December 5, 2013
411 / 475
k-Means Clustering
Animation of k-means
Figure 80: http://cs.joensuu.fi/sipu/clustering/animator/
December 5, 2013
412 / 475
The Dunn Index
The Dunn Index

A metric for evaluating clustering algorithms
Step 1: A metric i for intra-cluster distance (cluster-size)

Let x and y be feature-vectors assigned to the same cluster Ci . Then, the
following are possible metrics for the size of Ci .
i = max x y
(11.22)
x,yCi
i =
i =
1
|Ci |(|Ci | 1)
1
|Ci |
x,yCi
x=y
xy
(11.23)
x i .
xCi
(11.24)
December 5, 2013
413 / 475
The Dunn Index
The Dunn Index

Step 2: A metric (Ci , Cj ) for inter-cluster distance

Then, the following are possible metrics for inter-cluster distance.
(Ci , Cj ) =
(Ci , Cj ) =
max
xy
(11.25)
min
xy
(11.26)
xCi ,yCj
xCi ,yCj
(Ci , Cj ) = i j .
(11.27)
December 5, 2013
414 / 475
The Dunn Index
The Dunn Index

Step 3: The Dunn index

Let there be m clusters C1 , C2 , . . . , Cm ,
1
DIm
min
min (Ci , Cj )
1im 1jm
max k
j=i
(11.28)
1km
December 5, 2013
415 / 475
The Dunn Index
The Dunn Index

DIm
min
max k
1km
min (Ci , Cj )
1im 1jm
j=i
Higher values are better.

If m is not known, the m for which DIm is the highest can be chosen.
December 5, 2013
416 / 475
The EM Algorithm
Expectation-Maximization (EM) Algorithm

Parametric Clustering
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2
0.4
0.6
0.8
(a)
0.2
0.4
0.6
0.8
(b)
Figure 81: A Gaussian mixture computed from 500 samples with weights (left to
right) 0.2, 0.3, 0.5.
December 5, 2013
417 / 475
The EM Algorithm
A Gaussian Mixture Model (GMM) for Clustering

Recall: Multivariate Gaussian
N (X = x ; , )
x
(2)n/2 ||1/2
1
x
x
exp (x )T 1 (x )
2
(11.29)
GMM pdf
K
p(x) =
i=1
P(C = i) N (x; i , i ),
where,
(11.30)
P(C = i) = 1.
(11.31)
i=1
December 5, 2013
418 / 475
The EM Algorithm
Expectation/ Expected Value

If X is a DRV, a function f (X = x) has the expected value
|X |
E [f ]
P(xi ) f (X = xi )
(11.32)
i=1
For vector DRVs, the denition can be extended simply,

|X|
E [f]
P(xi ) f(X = xi )
(11.33)
i=1
If X is a CRV, a function f (X = x) has the expected value

E [f ]
p(x) f (x) dx
(11.34)
D(X )
This can also be generalized similarly to vector CRVs.

December 5, 2013
419 / 475
The EM Algorithm
Covariance
For a vector RV (discrete or continuous) X Rn ,

Cov(X)
E (X E [X])(X E [X])T
(11.35)
Cov(X) Rnn , and is symmetric, as well as guaranteed to be at

least positive semi-denite. In practise, it is usually positive-denite.
December 5, 2013
420 / 475
The EM Algorithm
Mahalanobis Distance
If X Rn is a normally distributed vector continuous RV (CRV), its
normal/Gaussian pdf is dened as
N (X = x ; , C)
x
1
(2)n/2 |C|1/2
1
x
x
exp (x )T C1 (x )
2
(11.36)
For a sample of a CRV X = x, the Mahalanobis distance can be used to

compare the weighted distance of samples from the mean of a Gaussian
distribution. It is dened as
x
(x )T C1 (x ).
(11.37)
December 5, 2013
421 / 475
The EM Algorithm
Mahalanobis Distance
(x )T C1 (x ).
A smaller value represents more condence. Example: If x =
C=
2
x 0
2 ,
0 y
2
C
x
, and
y
(x x )2 (y y )2
=
+
.
2
2
x
y
December 5, 2013
(11.38)
422 / 475
The EM Algorithm
The Mixture Distribution

A datum x is sampled from a mixture of K (known) components in
two steps:
1. Choose a component : Sample pmf P(C = i), i = 1 . . . K .
2. Generate a sample from the component: Sample pdf p(x | C = i)
The likelihood of the sample x is then

K
p(x)
P(C = i, x)
i=1
i=1
P(C = i) p(x | C = i)
(11.39)
If p(x | C = i) are multivariate Gaussians, we have a Mixture of

Gaussians. Each Gaussian is parameterized by mean i , and
covariance i . The parameter P(C = i) is called the weight of the
ith component. These taken together for all i form our
parameter-vector .
December 5, 2013
423 / 475
The EM Algorithm
Problem 11.3 (Gaussian Mixture)

Given data xj , j = 1 . . . n and the number of components K , estimate the
parameters: P(C = i), i , and i , i = 1 . . . K .
Chicken or Egg Problem

If we knew which component i generated xj , j we could estimate
Gaussian parameters by maximizing their likelihood.
If we knew the Gaussian parameters for all components, we could
assign each xj to the component with the minimum Mahalanobis
distance and hence estimate P(C = i).
Problem is, we know neither!
Hence we formulate an algorithm which iteratively increases the
expectation of the likelihood of the data.
Initialization: Choose some reasonable random values for all
parameters i : P(C = i), i , and i , i = 1 . . . K .
December 5, 2013
424 / 475
The EM Algorithm
The EM Algorithm: The E-Step

Let us dene a hidden boolean DRV Zij . If xj was generated by
component i, Zij = True ; if not, Zij = False .
Let us estimate P(Zij )
P(Zij = True ) = P(C = i | xj )
= p(xj | Ci ) P(Ci )
(11.40)
The last expression can be evaluated based on the parameters from

the previous iteration of the EM algorithm.
The expected count of data-samples in category i out of n samples
can be computed as
n
I(Zij = True )
(11.41)
P(Zij = True ).
ni =
(11.42)
j=1
n
ni
E [ni ] =
j=1
December 5, 2013
425 / 475
The EM Algorithm
The EM Algorithm: The M-Step

Update the estimates of the Gaussian parameters of each component
i = 1 . . . K as follows (order important)
1
i
ni
i
P(C = i)
1
ni
P(Zij = True ) xj
(11.43)
P(Zij = True ) (xj i )(xj i )T
(11.44)
j=1
n
j=1
ni
.
n
(11.45)
December 5, 2013
426 / 475
The EM Algorithm
The EM Algorithm: An Important Property
In each E+M-step, the log-likelihood of the whole data increases or

stays the same. The log-likelihood is dened using (11.39) as
n
L(x1:n ) =
j=1
log {p(xj )}
log
j=1
(11.46)
i=1
P(C = i) p(xj | C = i) .
(11.47)
Proof is outside of the scope of this course, but uses the Jensens
inequality.
December 5, 2013
427 / 475
The EM Algorithm
Log-likelihood L
The EM Algorithm: Example Result

700
600
500
400
300
200
100
0
-100
-200
0
10
15
20
Iteration number
Figure 82: Log-likelihood as a function of EM iteration. The horizontal dashed

line is the log-likelihood of the true mixture.
December 5, 2013
428 / 475
The EM Algorithm
The EM Algorithm: Potential Problems
K unknown.
A Gaussian component may shrink to cover just one point: in this
case, its covariance determinant becomes 0 and the likelihood blows
up. Restart with a new better initial guess.
December 5, 2013
429 / 475
Mean-Shift
Non-Parametric Clustering
Uptil now the number of clusters was always given (e.g. k=256 in the
image-compression example).
What if we want the algorithm to gure out the number of clusters, if
we give it some hints about the structure: signicance of distances in
each of the dimensions of the vector x? These are called bandwidths.
The mean-shift algorithm makes clusters by nding the basins of
attraction of local peaks of the pdf p(x).
The pdf p(x) is estimates by kernel density estimation (KDE).
Detailed derivation done in the class.
December 5, 2013
430 / 475
0.5
Kernel Density Estimation
0.0
0.1
0.2
dnorm(x)
0.3
0.4
reference
2.0
0.3
0.1
Figure 83: KDE using 200 samples from a standard Gaussian. The KDE uses a
Gaussian kernel of dierent bandwidths.
(a) hx = 30, hy = 30
December 5, 2013
431 / 475
(b) hx = 60, hy = 60
(c) hx = 60, hy = 30
Figure 84: Eect of bandwidth on the clustering of randomly generated points on

a 640 480 grid. The trajectories of all points during the mean-shift iterations
are plotted.
December 5, 2013
432 / 475
(a) hx = 30, hy = 30
(b) hx = 60, hy = 60
Figure 85: Clustering an RGBD image from a Kinect in the multi-dimensional

space of color (Luv) (hL = 10, huv = 20), pixel-coordinates (hp = 15 pixels), local
normals (hn = 45 ), range (hr = 15 mm).
December 5, 2013
433 / 475
Color Spaces
(a) RGB
(b) Hue Saturation Value
Figure 86: Credits: Wikipedia
December 5, 2013
434 / 475
Perceptually Uniform Color-Spaces

CIELuv, CIELab
Figure 87: Credits: Wikipedia
December 5, 2013
435 / 475
December 5, 2013
436 / 475
Home-work Assignments
Contents
Home-assignment 1
Home-assignment 2
Home-assignment 3
Home-assignment 4
Home-assignment 5
Home-assignment 1
HA 1: A Path-Planning and Simulated Annealing I

Groups of 2. Due date 27.9., 23:59
(a) A 2D occupancy grid laser

map of the Intel lab.
(b) The thresholded map (t = 0.55).
Figure 88: The input map.
December 5, 2013
437 / 475
Home-assignment 1
HA 1: A Path-Planning and Simulated Annealing II

1. Code the general Graph-Search Algorithm 4 and specialize it to

Dijkstra and A . Test your code on the map shown in Fig. 88 for the
start and goal points given in astar.py. Your program should work
for any given start and goal points. The data-les for this homework
are available on Campus-Net.
40%
1.1 The thresholded map is available as a gzipped pickled numpy.array
object in the uploaded map1c.p.gz. Code to extract this object is in
the uploaded le astar.py. which contains some skeleton code. If you
prefer working in C++, request the TA to provide you the map in a
text format.
1.2 Code your own Node and Graph classes. In this home-work, use of
external graph-libraries is not required.
1.3 Write all your code in a module and in the unit-testing section of the
module. The resultant path should be dumped in a le map path.png.
Look at the functions of matplotlib.image
December 5, 2013
438 / 475
Home-assignment 1
HA 1: A Path-Planning and Simulated Annealing III

1.4 Each pixel is a node. If only 4-successors (up-down-top-bottom) are

considered, each edge has cost 1. If the diagonal successors are
considered, their edge cost is 2. Your program should be congurable

to work both ways. Pixel value 1 means that the cell is free, value 0
means that it is occupied.
1.5 Both Dijkstra and A nd optimal paths. Compare them in terms of
nal costs of the found paths. If both paths are optimal, why are they
not the same? Write your answer to this in the header of the module
le.
1.6 For implementing priority-queue, look at
http://docs.python.org/library/heapq.html. Your node class
should override the cmp (...), hash () functions for it to
work with the heapq. See also Fig. 89.
Solution: See astar.py in the solutions on Campus-net.
December 5, 2013
439 / 475
Home-assignment 1
HA 1: A Path-Planning and Simulated Annealing IV

Figure 89: An example UML design for homework 1 feel free to alter it.
Base classes Node, Frontier, and GraphSearch are abstract classes which do
not care about implementation details. Thus, the same structure can be
used for all graph-search variants.
December 5, 2013
440 / 475
Home-assignment 1
HA 1: A Path-Planning and Simulated Annealing V

2. Answer the Why? in Lemma 4.7 using the hint provided.

25%
Solution: Were given that the node chosen by Fj is xj , and the node
chosen by Fj+1 is xj+1 . We consider the following cases:
Case I: At iteration j + 1, xj+1 is a child of xj : Therefore, [x0 : xj ] is

a sub-path of [x0 : xj+1 ]. This follows from Lemma 4.6, which states
the optimum paths from x to xj and xj+1 respectively were found when
they were selected. When we combine this with Lemma 4.5, we get
(xj ) (xj+1 ) because xj lies on the optimum path to xj+1 .

Case II: At iteration j + 1, xj+1 is not a child of xj . In this case, at step
j, xj+1 must have been a part of Fj , because,
Fj+1 = Fj {un-dead children of xj } {xj }. Hence, the cost of xj+1
was not updated at iteration j + 1 because its parent was not changed
by Resolve-Duplicate to xj , therefore, j (xj+1 ) = (xj+1 ), where,
j (xj+1 ) denotes the cost of xj+1 in Fj . But since xj+1 was not chosen
by Fj , it must be true that j (xj+1 ) (xj ), or combined with the
previous result, (xj+1 ) (xj ).
December 5, 2013
441 / 475
Home-assignment 1
HA 1: A Path-Planning and Simulated Annealing VI

3. Implement a sampler class for a PMF of a discrete random variable A

as explained in Jump to Location . It should be initialized with a list of
probabilities P(A = ai ), i = 1 . . . n, summing to unity. A list of
descriptive labels (e.g. ["child selected", "child not
selected"] etc) for each ai may be given during initialization also.
A function choose() should return the index of the selected ai .
Write a simple unit-test, which initializes a PMF
P(A)
P(a1 ) = 0.3
P(a2 ) = 0.15
P(a3 ) = 0.05
P(a4 ) = 0.4
P(a5 ) = 0.1 ,
and draws 10,000 samples. Plot a histogram showing the observed

relative frequencies of ai . Do they correspond to the provided
probabilities?
5%
Solution: See pmf.py in the solutions on Campus-net.
December 5, 2013
442 / 475
Home-assignment 1
HA 1: A Path-Planning and Simulated Annealing VII

4. The uploaded le pt coords list.p.gz contains a list of 25 tuples.

The le tsp.py shows some sample code to read in the le and test
your code. Each tuple contains the (x, y ) coordinates of a point. A
robot arm has the task to punch all points with a tool mounted on its
end-eector. In which order should the end-eector move from point
to point without visiting any point twice, so that the total traversal
path-cost (sum of Euclidean inter-point distances) is minimized and
no point is left unvisited? The starting and ending point is the rst
point. Solve the problem using Simulated Annealing coded in Python
or C++.
30%
4.1 A potential solution is a permutation of the list of indices from 1 to 24,
the index 0 being the rst and the last point in the path.
4.2 For generating a random child of the current permutation list, you
could use random.shuffle on a randomly selected sub-list of the
parent.
December 5, 2013
443 / 475
Home-assignment 1
HA 1: A Path-Planning and Simulated Annealing VIII

4.3 For deciding whether or not to select a worse child, you need to
compute the Boltzmann probability at the current T and sample the
boolean random-variable using the code of the sampler class from the
last part.
4.4 Select a schedule for the temperature based on the advice given in the
quote from Numerical Recipes in C. You have to experiment with it
to get the best results.
4.5 Make a plot of the iteration number vs. the current total traversal
cost. Also show iteration number vs. T (schedule).
4.6 Plot the nal path and print its cost.
Solution: Four dierent heuristics were used in child creation: 1)

random sub-path reversal with p = 0.4; 2) insertion of random
sub-path at a random location with p = 0.4; 3) sub-path shue
p = 0.1; 4) random swap of two points p = 0.1. Refer to the code in
tsp.py available online. The minimum tour cost is about 459. The
plots obtained were as shown in the following gures.
December 5, 2013
444 / 475
Home-assignment 1
HA 1: A Path-Planning and Simulated Annealing IX

100
80
60
40
20
00
20
40
60
80
100
(a) Path
December 5, 2013
445 / 475
Home-assignment 1
HA 1: A Path-Planning and Simulated Annealing X

1800
1600
1400
1200
1000
800
600
4000
500000
1000000
1500000
nr. iteration
2000000
2500000
500000
1000000
1500000
nr. iteration
2000000
2500000
1000
800
T
600
400
200
00
(b) Energy and Temperature.

= 0.5
December 5, 2013
446 / 475
Home-assignment 1
HA 1: A Path-Planning and Simulated Annealing XI

5. Only for AI Lab participants: Implement the funnel-planning Algo. 8

and test it for several start and goal points for the map of problem 1.
Your code should show the result (map + funnel-sequence) in a
pop-up window and print the cost of the path on the console.
December 5, 2013
447 / 475
Home-assignment 2
HA 2 I
1. There is another way to formulate alpha-beta pruning: this variation

is called the negmax formulation, as opposed to the minmax approach
which we covered in the class. It is given at the end of this homework
in Algo. 43 (from Knuth and Moore, 1975). It consists of just one
function F which recursively calls itself. Show a manual trace of the
run of this algorithm on the 2-Ply game of Fig. 44. The trace starts
with the call F(A, , ), where A is the root node. You could do
this cleanly on a sheet of paper and then scan it for submission.
(25%)
December 5, 2013
448 / 475
Home-assignment 2
HA 2 II
F (A, , )
= , 3
t = 3, 2, 2
3
F (A1, , )
= , 3
t = 3, 12, 8
12
F (A1, , 3)
F (A1, , 3)
= , 14, 5, 2
t = 14, 5, 2
= , 2
t = 2
14
December 5, 2013
449 / 475
Home-assignment 2
HA 2 III
2. Find the resolution-closure RC (S) of the set S consisting of the

following clauses
AB C
C1 :
C2 :
C3 :
A B C D E
E F G D
G E
C4 :
Is this set of clauses in RC (S) satisable? If yes, nd a model which

satises all the clauses in RC (S) using Algo. 27
ModelConstruction.
(10%+10%)
December 5, 2013
450 / 475
Home-assignment 2
HA 2 IV
Solution: We proceed as in the worked-out example given after

Algorithm 27. We rst write the clauses in CNF with symbols
appearing alphabetically
AB C
C1 :
A B C D E
C2 :
D E F G
C3 :
E G
C4 :
As per Remark 7.16, a resolution like C1 C2 results in the clause

B B C D E True and hence does not provide any new
information: such pairs will not be mentioned henceforth. Pairs like
C1 and C3 cannot be resolved, as they do not have any
complementary literals: such pairs will also not be mentioned in the
December 5, 2013
451 / 475
Home-assignment 2
HA 2 V
following. In the rst iteration, we get the following new clauses on

resolution amongst C1 , . . . , C4
C5 = C2
C3 :
C6 = C2
C4 :
C7 = C3
C4 :
A B C D F G
A B C D G
D E F
In the next iteration, we get the following new clauses from resolution
amongst C1 , . . . , C7 , ignoring the pairs already considered
C8 = C2
C7 :
C9 = C3
C6 :
A B C D F
A B C D E F
We ignore duplicates like C4 C5 which gives C9 again. In the next

iteration, we try to obtain new clauses from resolution amongst
December 5, 2013
452 / 475
Home-assignment 2
HA 2 VI
C1 , . . . , C9 , ignoring the pairs already considered: however, in this

iteration, we do not get any new clauses: for many pairs, the
resolvent clauses already exist; for the rest of the pairs, they are either
not resolvable, or their resolvent is equivalent to true. Thus
RC (S) = {C1 , . . . , C9 }.
Since the resolution-closure does not contain the empty clause, there
exists a model which satises all the clauses in it. To nd such a
model, we use Algorithm 27. Taking the order of symbols as
[P1 , . . . , Pn ] = [A, B, C , D, E , F , G ], we provide the execution trace:
i=1 A = True , C1 is now already True in this partial model
and need not be considered anymore.
i=2 B = True
i=3 C = True , C2 , C5 , C6 , C8 , C9 are now already True in
this partial model and need not be considered anymore.
December 5, 2013
453 / 475
Home-assignment 2
HA 2 VII
i=4 D = True , C3 , C7 are now already True in this partial

model and need not be considered anymore.
i=5 E = True
i=7 F = True , no clause remains which depends on F .
i=8 G = False because under the previous assignments,
C4 = G False .
The above model is seen to satisfy all the clauses in RC (S), and in
particular S. A dierent initial order will give a dierent model.
December 5, 2013
454 / 475
Home-assignment 2
HA 2 VIII
3. Convert the following set of sentences to CNF and give a trace of the
execution of DPLL (2nd version, Algo. 25) on the conjunction of
these clauses.
A (C E )
S1 :
(12.1a)
E D
(12.1b)
E C
(12.1d)
C B
S2 :
(12.1f)
B F C
S3 :
S4 :
(12.1c)
C F
S5 :
S6 :
(12.1e)
(20%)
December 5, 2013
455 / 475
Home-assignment 2
HA 2 IX
Solution: The CNFs are
S1 :C11 C12 C13
S2 :C2
(A C E ) (C A) (E A)
S3 :C3
S4 :C4
B F C
S5 :C5
S6 :C6
C F
E D
E C
C B
Solution: We use the decimal notation for recursive calls of DPLL.

The set of initial clauses is denoted by F.
DPLL-1(F = {C11 , . . . , C6 })
In the second while-loop, D is found as a pure literal. Therefore,
F F | D does not contain C2 .
December 5, 2013
456 / 475
Home-assignment 2
HA 2 X
Although a literal u can now be chosen arbitrarily, we decide to choose
the one with the maximum Jeroslow Wang (JW) metric w (F, ) as
discussed in the class. To review:
2k dk (F, ),
w (F, ) =
(12.3)
k1
where, dk (F, ) is the number of clauses of length k in F which contain

the literal . It can be veried that the maximum value is
w (F, = C ) = 7/8. So we choose u = C .
DPLL-1.1(F | C ). F F | C now contains the reduced clauses:
C11 : {A, E }, C13 : {A, E }, C4 : {E }
The rst while loop nds a unit clause C4 : {E }. Therefore, it sets

E = 1. F F | E gives only one remaining reduced clause,
C11 : {A} which is also unitary. The second iteration of the while loop
then sets A = 1. F F | A now is empty.
Since F is empty, a Satisable is returned. The current partial model
is D = 1, C = 0, E = 0, A = 0, which satises all clauses of the original
set.
December 5, 2013
457 / 475
Home-assignment 2
HA 2 XI
4. We are given the following seven 2-clauses of a 2SAT problem:

A B, A C , A D, D C ,
D B, E C , B C .
4.1 Solve the 2SAT problem by using an implication graph: Either prove
the clauses unsatisable, or if they are satisiable, nd a model.
4.2 For nding the strongly-connected-components, use the algorithm from
Cormen et al (CLRS) book, given in Fig. 52. It uses the version of DFS
shown in Figs. 50 and 51.
(25%)
Solution:
December 5, 2013
458 / 475
Home-assignment 2
HA 2 XII
1/8
9/18
11/12
A
3/4
B
10/15
C
2/5
19/20
6/7
16/17
13/14
Figure 90: Prob. 4.1. The implication graph G (V , E ) and the rst DFS on G .
December 5, 2013
459 / 475
Home-assignment 2
HA 2 XIII
Decreasing oder of u.f from the rst DFS of G.
E, B, D, C, E, A, A, C, D, B
GT
4/9
13/20
A
3/10
14/19
5/8
16/17
6/7
15/18
D
1/2
11/12
Figure 91: Prob. 4.1, 4.2. The second DFS on G T taking the vertices in
decreasing order of their nishing time in the rst DFS.
December 5, 2013
460 / 475
Home-assignment 2
HA 2 XIV
T1
T2
T3
T4
:E
: B, C, A, D
: E
: A, D, B, C
T1
T2
T4
T3
Figure 92: The strongly connected components all u Ti have f (u) = i .

Since, there does not exist a symbol X s.t. it and its negation belong to the
same SCC, a model exists. Following Lemma 7.26, a model can be
constructed as A = 1, B = 0, C = 0, D = 1, E = 0.
5. Explain the denition of independence P(F | G ) = P(F ) for any PL

formulas F and G in terms of M(F ) and M(G ) by revisiting the
renormalization scheme used to derive Eq. (8.20). You can explain
using Venn diagrams or by doing algebra or both.
(5%)
Solution:
December 5, 2013
461 / 475
Home-assignment 2
HA 2 XV
= M(G)
M(F )
M(G)
M(F ) M(G)
The independence condition implies
P(A) =
AM(F )
AM(F G ) P(A)
AM(G ) P(A)
(12.5)
In other words, the fraction of probability contained in M(F ) w.r.t.

that contained in is the same as the fraction of probability
contained in M(F ) M(G ) w.r.t. that contained in .
December 5, 2013
462 / 475
Home-assignment 2
HA 2 XVI
6. Given RVs X and Y , nd the values of the following two summations:

(2% + 3%)
|X | |Y |
|X | |Y |
P(xi , yj ),
i=1 j=1
Solution:
|X |
i=1
|X |
i=1
i=1 j=1
|Y |
j=1 P(xi , yj )
P(xi | yj )
= 1 since we are summing a JPD.
|Y |
j=1 P(xi
| yj ) = |Y | since there are as many summation

constraints in the CPT refer to Eq. (8.38).
December 5, 2013
463 / 475
December 5, 2013
464 / 475
Home-assignment 2
The Negmax Alpha-Beta Pruning Variation
Algorithm 43: F(s, , )

;
for a Actions(s) do
t F(Result(s, a), , ) ;
if t > then t ;
if then break;
return
Home-assignment 3
HA 3: NBC/BN I
1. Download
http://www.aispace.org/bayes/version5.1.9/bayes.jar and
run it using java -jar bayes.jar. Load File >Load Sample
Problem >Car Starting Problem.
1.1 Given its parents, of which nodes is the node Spark Quality (SQ)
conditionally independent? You can abbreviate the node names by the
initials.
1.2 To answer the query P(Battery Voltage | Spark Adequate ), which
nodes are irrelevant and can be pruned out to get a smaller BN
without aecting the query result? Verify this by making the query
P(Battery Voltage | Spark Adequate = T ) rst (in the solve tab),
then pruning the BN (in the create tab) and making the query again. If
you get an error, you may have pruned too much.
2. Fill in the missing steps in Eq. 9.21.

December 5, 2013
465 / 475
Home-assignment 3
HA 3: NBC/BN II
3. We will compute how accurate NBC is for the dataset

http://archive.ics.uci.edu/ml/datasets/SPECT+Heart. It
does not have any missing attributes.
3.1 Write an NBC class (C++ or Python) which has functionality for:
Loading a training dataset (SPECT.train for this DB);
Computing all the CPT pmfs P(Ai | C = c) assuming a uniform
Dirichlet prior for all pmfs;
Answering posterior queries of the kind
P(C = c | A1 = a1 , , An = an ) using the log-sum trick.
If you code in Python, use of the numpy library would save you time.
3.2 Use these posterior queries on the test data SPECT.test and compute
the error-rate in percentage. The NBC predicted classication is, of
course, the c with the maximum posterior probability.
December 5, 2013
466 / 475
Home-assignment 4
HA 4: BN/Entropy I
1. For the alarm example, nd P(E , B | j, m) using the

Variable-Elimination algorithm.
2. Show that the implicit CPT of Noisy-Max given by (10.35) is a valid
CPT, i.e. it satises the summation constraint.
3. Consider a DRV X = [x1 , . . . , xn ], where xi R, with an unknown
PMF [p1 , . . . , pn ]. The only prior information you are given is that
E [X ] = . Given this information, show that the least biased PMF
that you can select is
pi =
e xi
, i = 1, . . . , n, where,
n
e xj
j=1
December 5, 2013
(12.6)
467 / 475
Home-assignment 4
HA 4: BN/Entropy II
the constant can be found by numerically solving the equation

=
n
xi
i=1 xi e
n
xi
i=1 e
(12.7)
Hint: This is a constrained optimization problem where you have two

constraints: i pi = 1 and E [X ] = . Constrained optimization is best
solved using Lagrange multipliers (revise the relevant ESM course). The
constant will turn out to be one of the two Lagrange multipliers. For
ease of algebra, use the natural logarithm in the denition of information
entropy, although the base of the logarithm will not change the result.
December 5, 2013
468 / 475
Home-assignment 5
HA 5: Decision Trees I
We will compute how accurate decision trees are for the dataset
http://archive.ics.uci.edu/ml/datasets/SPECT+Heart. which we
evaluated with NBC in HA 3.
1. Write a decision-tree class (C++ or Python) which has functionality
for:
Loading a training dataset (SPECT.train for this DB);
Running Algo. 39 to learn a decision-tree.
2. Use the decision tree on the test data SPECT.test and compute the
error-rate in percentage.
December 5, 2013
469 / 475
December 5, 2013
470 / 475
Quizzes
Contents
Quizzes
Quiz 1
Quiz 2
Quizzes
Quiz 1
Quiz 1 I
Sep. 23
1. Two consistent heuristics h1 (x) and h2 (x) are given s.t.

h1 (x) h2 (x), x. Assume for simplicity that there is a single goal
state xg . Which heuristic is better to use in A and why?
50%
Solution: From Lemma 4.7 and Remark 4.8, at the iteration when
the goal is found, the maximum possible number of nodes expanded
uptil then form a set
Sh1 = {x |
Sh2 = {x |
(x)
(x)
(xg )}
{x | g (x) g (xg ) h1 (x)}
(13.1)
{x | g (x) g (xg ) h2 (x)}
(13.2)
(xg )}
Note that g (xg ) is xed. As g (x) is the optimal distance of x from

the origin, the set S = {x | g (x) } is such that |S| is a
Quizzes
December 5, 2013
471 / 475
Quiz 1
Quiz 1 II
Sep. 23
non-decreasing function of . As 1 (corresponding to h1 ) is less than

2 (corresponding to h2 ), because it is given that h1 (x) h2 (x).
Therefore,
|Sh1 | |Sh2 |
h1 is more ecient.
2. What could be an admissible heuristic function for solving the

8-puzzle by A ?
Solution: Read Sec. 3.6.2 of the textbook.
7
5
8
5
1
Start State
50%
(13.3)
Goal State
December 5, 2013
472 / 475
Quizzes
Quiz 2
Quiz 2 I
Oct. 7
Using properly labeled Venn diagrams show:

1. The case KB
2. The case KB
3. The case, where, KB
Q and KB
4. F F {R}, where R is the resolvent of two clauses in F
5. Monotonicity of PL knowledge-bases
Quizzes
December 5, 2013
473 / 475
Quiz 2
Figure 93: Quiz 2 solution.
M(KB)
M(Q)
(a) KB
M(KB)
M(Q)
M(KB)
(b) KB
(c) KB
M(Q)
Q, KB
M(Q)
M(R)
M(KB)
M(F ) = M(F {R})

= M(F ) M({R})
(d) F F {R}
M(KB )
M(F )
(e) Let F be a new clause appended to the CNF KB. KB =

KB F = KB {F } KB Q
December 5, 2013
474 / 475
Quizzes
Quiz 3
Quiz 3, 13.11.
A
C
D
H
G
1. A query to the BN is P(B | G = g ). Give a reduced BN which can be

used to answer this query instead of the original.
2. Write the JPD of the reduced BN.
3. Show how P(B | G = g ) can be computed in two dierent ways from
the JPD by distributing the summations over the product dierently.
December 5, 2013
475 / 475

AI Jacobs University Green Slides Dec 5

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

AI Jacobs University Green Slides Dec 5

Загружено:

Авторское право:

Доступные форматы

Articial Intelligence

Course No. 320331, Fall 2013

K. Pathak (Jacobs University Bremen)

K. Pathak (Jacobs University Bremen)

K. Pathak (Jacobs University Bremen)

Homework Submission via Grader

Check after a week:

K. Pathak (Jacobs University Bremen)

K. Pathak (Jacobs University Bremen)

Expert of the Day

K. Pathak (Jacobs University Bremen)

Our Next Expert Is?

K. Pathak (Jacobs University Bremen)

K. Pathak (Jacobs University Bremen)

A Brief Introduction to Python (skipped this year)

Solving problems by Searching

Also Schnings Book

Propositional Logic: Inference with Resolution

Uncertain Knowledge & Reasoning

Also Kollers Book

What is Articial Intelligence (AI)?

K. Pathak (Jacobs University Bremen)

Acting Rationally Computational Intelligence is the study of

What is Articial Intelligence (AI)?

The Turing Test (1950)

Total Turing Test

K. Pathak (Jacobs University Bremen)

Figure 1: Alan Turing

What is Articial Intelligence (AI)?

Reverse Turing Test: CAPTCHA

Figure 2: Source: http://www.captcha.net/

K. Pathak (Jacobs University Bremen)

What is Articial Intelligence (AI)?

Capabilities required for passing the Turing test

The Turing Test

The Total Turing Test

K. Pathak (Jacobs University Bremen)

What is Articial Intelligence (AI)?

Figure 3: fMRI image (source:

Computer models from AI

Youtube video (1:00-4:20)

Experimental techniques from

Reading mind by fMRI

K. Pathak (Jacobs University Bremen)

What is Articial Intelligence (AI)?

Aristotles Syllogisms (384-322 B.C.): right thinking, deductive

K. Pathak (Jacobs University Bremen)

What is Articial Intelligence (AI)?

Denition 1.1 (Agent)

K. Pathak (Jacobs University Bremen)

What is Articial Intelligence (AI)?

Denition 1.2 (Rational Agent)

K. Pathak (Jacobs University Bremen)

What is Articial Intelligence (AI)?

This course therefore concentrates on general principles of rational

K. Pathak (Jacobs University Bremen)

Disciplines Contributing to AI. I

The Human Brain

Figure 5: TED Video: Dr. Jill Bolte Taylor

K. Pathak (Jacobs University Bremen)

K. Pathak (Jacobs University Bremen)

Control Theory and Cybernetics

Figure 6: A typical control system with feedback. Source:

Control Theory and Cybernetics