Вы находитесь на странице: 1из 238

Articial Intelligence

Course No. 320331, Fall 2013


Dr. Kaustubh Pathak
Assistant Professor, Computer Science
k.pathak@jacobs-university.de
Jacobs University Bremen

December 5, 2013

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

1 / 475

December 5, 2013

2 / 475

Course Introduction
Python Brief Introduction
Agents and their Task-environments
Goal-based Problem-solving Agents using Searching
Non-classical Search Algorithms
Games Agents Play
Logical Agents: Propositional Logic
Probability Calculus
Beginning to Learn using Na Bayesian Classiers
ve
K. Pathak (Jacobs University Bremen)

Bayesian Networks

Articial Intelligence

Course Introduction

Contents

Course Introduction
Course Logistics
What is Articial Intelligence (AI)?
Foundations of AI
History of AI
State of the Art

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

December 5, 2013

3 / 475

Course Logistics

Grading
Break-down:
Easy quizzes
15% Auditors: taking 75% quizzes necessary.
Homeworks (5) 25%
Mid-term exam 30% 23rd Oct. (Wed.) after Reading days.
Final exam
30%
If you have an ocial excuse for a quizz/exam, make-up will be
provided. For home-works, make-ups will be decided on a
case-by-case basis: ocial excuse for at least three days immediately
before the deadline necessary.
Home-works: Python or C++.
Teaching Assistant: Vahid Azizi v.azizi@jacobs-university.de

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

4 / 475

Course Introduction

Course Logistics

Homework Submission via Grader

Check after a week:


https://cantaloupe.eecs.jacobs-university.de/login.php

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

December 5, 2013

5 / 475

Course Logistics

Teaching Philosophy
No question will be ridiculed.
Some questions would be taken oine or might be postponed.
Homeworks are where you really learn!
Not all material will be in the slides. Some material will be derived on
the board - you should take lecture-notes yourselves.
Material done on the board is especially likely to appear in
quizzes/exams.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

6 / 475

Course Introduction

Course Logistics

Expert of the Day


At the beginning of each lecture, a student will summarize the last
lecture in 5 minutes (more than 7 will be penalized).
She/He can also highlight things which need more clarication.
A student will volunteer at the end of each lecture for being the
expert in the next lecture.
Your participation counts as 1 quiz. Everyone should do it at least
once.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

December 5, 2013

7 / 475

December 5, 2013

8 / 475

Course Logistics

Coming Up...

Our Next Expert Is?

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

Course Logistics

Textbooks

Main textbook:
Stuart Russell and Peter Norvig, Articial Intelligence: A Modern
Approach, 3rd Edition, 2010, Pearson International Edition.
Other references:
Uwe Schning, Logic for Computer Scientists, English 2001,
o
German 2005, Birkhuser.
a
Daphne Koller and Nir Friedman, Probabilistic Graphical Models:
Principles and Techniques, 2009, MIT Press.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

December 5, 2013

9 / 475

Course Logistics

Syllabus
Introduction to AI; Intelligent agents:

Chapters 1,2.

A Brief Introduction to Python (skipped this year)

Solving problems by Searching


BF, DF, A search: Proofs
Chapter 3.
Sampling Discrete Probability Distributions, Simulated Annealing,
Genetic Algorithms: Real-world example
Chapter 4.
Adversarial search (Games): Minimax, pruning
Chapter 5.

Logical Agents

Also Schnings Book


o

Propositional Logic: Inference with Resolution

Uncertain Knowledge & Reasoning


Introduction to Probabilistic Reasoning:
Bayesian Networks: Various Inference Approaches

Chapter 7.

Also Kollers Book


Chapter 13.
Chapter 14.

Introduction to Machine-Learning
Supervised Learning: Information Entropy, Decision Trees, ANNs:
Chapter 18.
Model Estimation: Priors, Maximum Likelihood, Kalman Filter, EKF,
RANSAC.
Learning Probabilistic Models:
Chapter 20.
Unsupervised Learning: Clustering (K-Means, Mean-Shift Algorithm).
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

10 / 475

Course Introduction

What is Articial Intelligence (AI)?

Dening AI
Human-centered vs. Rationalist Approaches

Thinking Humanly

[The
automation of] activities that
we associate with human thinking, activities such as decisionmaking, problem-solving, learning... (Bellman, 1978)
Acting Humanly The art
of creating machines that perform functions that require intelligence when performed by people.
(Kurzweil, 1990)

K. Pathak (Jacobs University Bremen)

Thinking Rationally

The
study of computations that make
it possible to perceive, reason, and
act. (Winston, 1992)

Acting Rationally Computational Intelligence is the study of


the design of intelligent agents.
(Poole et al., 1998)

Articial Intelligence

Course Introduction

December 5, 2013

11 / 475

What is Articial Intelligence (AI)?

Acting Humanly

The Turing Test (1950)


The test is passed if a human interrogator,
after posing some written questions,
cannot determine whether the responses
come from a human or from a computer.

Total Turing Test


There is a video signal for the interrogator
to test the subjects perceptual abilities, as
well as a hatch to pass physical objects
through.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Figure 1: Alan Turing


(1912-1954)

December 5, 2013

12 / 475

Course Introduction

What is Articial Intelligence (AI)?

Reverse Turing Test: CAPTCHA


Completely Automated Public Turing test to tell Computers and Humans Apart

Figure 2: Source: http://www.captcha.net/

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

December 5, 2013

13 / 475

What is Articial Intelligence (AI)?

Capabilities required for passing the Turing test


The 6 main disciplines composing AI.

The Turing Test


Natural language processing
Knowledge representation
Automated reasoning
Machine learning

The Total Turing Test


Computer vision
Robotics

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

14 / 475

Course Introduction

What is Articial Intelligence (AI)?

Thinking Humanly
Trying to discover how human minds
work. Three ways:
Introspection
Psychological experiments on
humans
Brain imaging: Functional
Magnetic Resonance Imaging
(fMRI), Positron Emission
Tomography (PET), EEG, etc.
Cognitive Science constructs testable
theories of mind:

Figure 3: fMRI image (source:


http://www.umsl.edu/~tsytsarev)

Computer models from AI

Youtube video (1:00-4:20)

Experimental techniques from


psychology

Reading mind by fMRI

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

December 5, 2013

15 / 475

What is Articial Intelligence (AI)?

Thinking Rationally

Aristotles Syllogisms (384-322 B.C.): right thinking, deductive


logic.
The logicist tradition in AI. Good old AI. Logical programming.
Problems:
Cannot handle uncertainty
Does not scale-up due to high computational requirements.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

16 / 475

Course Introduction

What is Articial Intelligence (AI)?

Acting Rationally

Denition 1.1 (Agent)


An agent is something that acts, i.e,
perceives the environment,
acts autonomously,
persist over a prolonged time-period,
adapts to change,
creates and pursues goals (by planning), etc.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

December 5, 2013

17 / 475

What is Articial Intelligence (AI)?

Acting Rationally
The Rational Agent Approach

Denition 1.2 (Rational Agent)


A rational agent is one that acts so as to achieve the best outcome, or
when there is uncertainty, the best expected outcome. This approach is
more general, because:
Rationality is more general than logical inference, e.g. reex actions.
Rationality is more amenable to scientic development than the ones
based on human behavior or thought.
Rationality is well dened mathematically in a way, it is just
optimization under constraints. When, due to computational
demands in a complicated environment, the agent cannot maintain
perfect rationality, it resorts to limited rationality.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

18 / 475

Course Introduction

What is Articial Intelligence (AI)?

Acting Rationally
The Rational Agent Approach

This course therefore concentrates on general principles of rational


agents and on components for constructing them.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

December 5, 2013

19 / 475

Foundations of AI

Disciplines Contributing to AI. I


Philosophy
Rationalism: Using power of reasoning to understand the world.
How does the mind arise from the physical brain?
Dualism: Part of mind is separate from matter/nature. Proponent
Ren Descartes, among others.
e
Materialism: Brains operation constitutes the mind. Claims that free
will is just the way perception of available choices appears to the
choosing entity.

Mathematics
Logic, computational tractability, probability theory.

Economics
Utility theory, decision theory (probability theory + utility theory), game
theory.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

20 / 475

Course Introduction

Foundations of AI

Neuroscience
The exact way the brain enables thought is still a scientic mystery.
However, the mapping between areas of the brain and parts of body they
control or receive sensory input from can be found, though it can change
over a course of a few weeks.
Motor Cortex

Parietal Lobe

Frontal Lobe
Dorsal Stream

Occipital Lobe

Visual Cortex

Ventral Stream

Cerebellum
Temporal Lobe
Spinal Cord

Figure 4: The human cortex with the various lobes shown in dierent colors. The
information from the visual cortex gets channeled into the dorsal (where/how)
and the ventral (what) streams.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

December 5, 2013

21 / 475

Foundations of AI

The Human Brain


The human brain has 1011 neurons, with 1014 synapses, cycle time of
103 , and 1014 memory updates/sec. Refer to Fig. 1.3 in the book.

Figure 5: TED Video: Dr. Jill Bolte Taylor

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

22 / 475

Course Introduction

Foundations of AI

Psychology
Behaviorism (stimulus/response), Cognitive psychology.

Computer Engineering
Hardware and Software. Computer vision.

Linguistics
Natural language processing.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

December 5, 2013

23 / 475

Foundations of AI

Control Theory and Cybernetics

Figure 6: A typical control system with feedback. Source:


https://www.ece.cmu.edu/~koopman/des_s99/control_theory/

The basic idea of control theory is to use sensory feedback to alter system
inputs so as to minimize the error between desired and observed output.
Basic example: controlling the movement of an industrial robotic arm to a
desired orientation.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

24 / 475

Course Introduction

Foundations of AI

Control Theory and Cybernetics

Norbert Wiener (18941964): book Cybernetics (1948).


Modern control theory and AI have a considerable overlap: both have
the goal of designing systems which maximize an objective function
over time.
Dierence is in: 1) the mathematical techniques used; 2) the
application areas.
Control theory focuses more on calculus of continuous variables,
Matrix algebra, whereas AI also uses tools of logical inference and
planning.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

December 5, 2013

25 / 475

History of AI

History of AI I
Gestation Period (1943-1955)
McCulloch and Pitts (1943) proposed a model for the neuron. Hebbian
learning (1949) for updating inter-neuron connection strengths developed.
Alan Turing published Computing Machinery and Intelligence (1950),
proposing the Turing test, machine learning, genetic algorithms, and
reinforcement learning.

Birth of AI (1956)
The Dartmouth workshop organized by John McCarthy of Stanford.

Early Enthusiasm (1952-1969)


LISP developed. Several small successes including theorem proving etc.
Perceptrons (Rosenblatt, 1962) developed.

Reality hits (1966-1973)


K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

26 / 475

Course Introduction

History of AI

History of AI II
After the Sputnik launch (1957), automatic Russian to English translation
attempted. Failed miserably.
1. The spirit is willing, but the esh is weak. Translated to:
2. The wodka is good but the meat is rotten.
Computational complexity scaling-up could not be handled. Single layer
perceptrons were found to have very limited representational power. Most
of government funding stopped.

Knowledge-based Systems (1969-1979)


Use of expert domain specic knowledge and cook-book rules collected
from experts for inference.
Examples: DENDRAL (1969) for inferring molecular structure from mass
spectrometer results; MYCIN (Blood infection diagnosis) with 450 rules.

AI in Industry (1980-present)
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

December 5, 2013

27 / 475

History of AI

History of AI III
Companies like DEC, DuPont etc. developed expert systems. Industry
boomed but all extravagant promises not fullled leading to AI winter.

Return of Neural Networks (1986-present)


Back-propagation learning algorithm developed. The connectionist
approach competes with logicist and symbolic approaches. NN research
bifurcates.

AI embraces Control Theory and Statistics (1987-present)


Rigorous mathematical methods began to be reused instead of ad hoc
methods. Example: Hidden Markov Models (HMM), Bayesian Networks,
etc. Real-life data-sets sharing started.

Intelligent agents (1995-present)


Growth of the Internet. AI in web-based applications (-bots).
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

28 / 475

Course Introduction

History of AI

History of AI IV
Huge data-sets (2001-present)
Learning based on very large data-sets. Example: Filling in holes in a
photograph; Hayes and Efros (2007). Performance went from poor for
10,000 samples to excellent for 2 million samples.

Figure 7: Source: Hayes and Efros (SIGGRAPH 2007).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

December 5, 2013

29 / 475

December 5, 2013

30 / 475

History of AI

Reading Assignment (not graded)


Read Sec. 1.3 of the textbook.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

State of the Art

Successful Applications
Intelligent Software Wizards and Assistants

(a) Microsoft Oce Assistant Clippit

(b) Siri

Figure 8: Wizards and Assistants.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

December 5, 2013

31 / 475

State of the Art

Logistics Planning
Dynamic Analysis and Replanning Tool (DART). Used during Gulf war
(1990s) for scheduling of transportation. DARPA stated that this single
application paid back DARPAs 30 years investment in AI.
DART won DARPAs outstanding Performance by a Contractor award, for
modication and transportation feasibility analysis for Time-Phased Force
and Deployment Data that was used during Desert Storm.
http://www.bbn.com

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

32 / 475

Course Introduction

State of the Art

Flow Machines
2013 Best AI Video Award: http://www.aaaivideos.org

Figure 9: Video (4:53)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

December 5, 2013

33 / 475

December 5, 2013

34 / 475

State of the Art

Intelligent Textbook
2012 Best AI Video Award: http://www.aaaivideos.org

Figure 10: Video (4:53)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

State of the Art

DARPA Urban Challenge 2007

Figure 11: Video (6:05)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

December 5, 2013

35 / 475

State of the Art

3D Planar-Patches based Simultaneous Localization and


Mapping (SLAM): Scene Registration

Figure 12: Collecting Data


Registered Planar-Patches

Registered Point-Clouds

The Minimally Uncertain Maximum Consensus (MUMC) Algorithm


Related to the RANSAC (Random Consensus) Algorithm that we will
study.
K. Pathak, A. Birk, N. Vaskevicius, and J. Poppinga, Fast registration based on noisy planes
with unknown correspondences for 3D mapping, IEEE Transactions on Robotics, vol. 26, no.
3, pp. 424-441, 2010.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

36 / 475

Course Introduction

State of the Art

New Sensing Technologies: Example Kinect

(a) The Microsoft Kinect 3D camera (b) A point-cloud obtained from it (from
(from Wikipedia)
Willow Garage).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

December 5, 2013

37 / 475

December 5, 2013

38 / 475

State of the Art

RGBD Segmentation
Unsupervised Clustering By Mean-Shift Algorithm

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Course Introduction

State of the Art

Object Recognition & Pose Estimation

Figure 13: IEEE Int. Conf. on Robotics & Automation (ICRA) 2011: Perception
Challenge. Our group won II place between Berkeley (I) and Stanford (III).
Video (2:39)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

39 / 475

December 5, 2013

40 / 475

Python Brief Introduction

Contents

Python Brief Introduction


Data-types
Control Statements
Functions
Packages, Modules, Classes

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

Data-types

Built-in Data-types
Type
Numbers
Strings
Boolean
Lists
Dictionaries
Tuples
Sets/FrozenSets
Files
Single Instances
...

Example
12, 3.4, 7788990L, 6.1+4j, Decimal
"abcd", abc, "abcs"
True, False
[True, 1.2, "vcf"]
{"A" : 25, "V" : 70}
("ABC", 1, Z)
{90,a}, frozenset({a, 2})
f= open(spam.txt, r)}
None, NotImplemented

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

Immutable

X/

December 5, 2013

41 / 475

December 5, 2013

42 / 475

Data-types

Sequences I
str, list, tuple

Creation and Indexing


a= "1234567"
a[0]
b= [z, x, a, k]
b[-1] == b[len(b)-1], b[-1] is b[len(b)-1]
x= """This is a
multiline string"""
print x
y= me
too
print y
len(y)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

Data-types

Sequences II
str, list, tuple

Immutability
a[1]= q # Fails
b[1]= s
c= a;
c is a, c==a
a= "xyz"; c is a

Help
dir(b)
help(b.sort)
b.sort()
b # In-place sorting

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

December 5, 2013

43 / 475

December 5, 2013

44 / 475

Data-types

Sequences III
str, list, tuple

Slicing
a[1:2]
a[0:-1]
a[:-1], a[3:]
a[:]
a[0:len(a):2]
a[-1::-1]

Repetition & Concatenation


c=a*2
b*3
a= a + 5mn
d= b + [abs, 1, False]
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

Data-types

Sequences IV
str, list, tuple

Nesting
A=[[1,2,3],[4,5,6],[7,8,9]]
A[0]
A[0][2]
A[0:-1][-1]
A[3] # Error

List Comprehension
q= [x.isdigit() for x in a]
print q
p=[(r[1]**2) for r in A if r[1]< 8] # Power
print p

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

December 5, 2013

45 / 475

December 5, 2013

46 / 475

Data-types

Sequences V
str, list, tuple

Dictionaries
D= {0:Rhine, 1:"Indus", 3:"Hudson"}
D[0]
D[6] # Error
D[6]="Volga"
dir(D)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

Data-types

Numbers I
Math Operations
1
2
3
4
5
6
7
8
9
10
11
12
13

a= 10; b= 3; c= 10.5; d=1.2345


a/b
a//b, c//b # Floor division: b*(a//b) + (a%b) == a
d**c # Power
type(10**40) # Unlimited integers
import math
import random
dir(math)
math.pi # repr(x)
print math.pi # str(x)
s= "e is %08.3f and list is %s" % (math.e, [a,1,1.5])
random.random() # [0,1)
random.choice([apple,orange,banana,kiwi])

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

December 5, 2013

47 / 475

December 5, 2013

48 / 475

Data-types

Numbers II
Booleans
1
2
3
4
5
6
7
8
9
10
11
12
13

s1= True
s2= 3 < 5 <=100
not(s1 and s2) or (not s2)
x= 2 if s2 else 3
b1= 0xE210 # Hex
print b1
b2= 023 # Oct
print b2
b1 & b2 # Bitwise and
b1 | b2 # Bitwise or
b1 ^ b2 # Bitwise xor
b1 << 3 # Shift b1 left by 3 bits
~b1 # Bitwise complement

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

Data-types

Dynamic Typing I

Variables are names and have no types. They can refer to objects of
any type. Type is associated with objects.
1
2
3
4

a= "abcf"
b= "abcf"
a==b, a is b
a= 2.5
Objects are garbage-collected automatically.
Shared references

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

December 5, 2013

49 / 475

December 5, 2013

50 / 475

Data-types

Dynamic Typing II
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

a= [4,1,5,10]
b=a
b is a
a.sort()
b is a
a.append(w)
a
b is a
a= a + [w]
a
b is a
b
x= 42
y= 42
x is y, x==y
x= [1,2,3]; y=[1,2,3]

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

Data-types

Dynamic Typing III


17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

x is y, x==y
x=123; y= 123
x is y, x==y # Wassup?
# Assignments create references
L= [1,2,3]
M= [x, L, c]
M
L[1]= 0
M
# To copy
L= [1,2,3]
M= [x, L[:], c]
M
L[1]= 0
M

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

December 5, 2013

51 / 475

Control Statements

Control Statements I
Mind the indentation! One extra carriage return to nish in
interactive mode.
1
2
3
4
5
6
7
8
9
10
11

import sys
tmp= sys.stdout
sys.stdout = open(log.txt, a)
x= random.random();
if x < 0.25:
[y, z]= [-1, 4]
elif 0.25 <= x < 0.75:
print case 2
y= z= 0
else:
z, y= 0, 2; y += 3

12
13

sys.stdout = tmp

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

52 / 475

Python Brief Introduction

Control Statements

Loops I
While
1
2
3
4
5
6
7
8
9
10
11
12
13

i= 0;
while i< 5:
s= raw_input("Enter an int: ")
try:
j= int(s)
except:
print invalid input
break;
else:
print "Its square is %d" % j**2
i += 1
else:
print "exited normally without break"

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

December 5, 2013

53 / 475

December 5, 2013

54 / 475

Control Statements

Loops II
For
1
2
3
4
5
6
7
8

X= range(2,10,2) # [2, 4, 6, 8]
N= 7
for x in X:
if x> N:
print x, "is >", N
break;
else:
print no number > , N, found

9
10
11

for line in open(test.txt, r):


print line.upper()

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

Functions

Functions I
Arguments are passed by assignment
1
2
3
4

def change_q(p, q):


for i in p:
if i not in q: q.append(i)
p= abc

5
6
7
8
9
10

x= [a,b,c]; # Mutable
y= bdg # Immutable
print x, y
change_q(q=x,p=y)
print x, y
Output
[a, b, c] bdg
[a, b, c, d, g] bdg

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

December 5, 2013

55 / 475

Functions

Functions II
Scoping rule: LEGB= Local-function, Enclosing-function(s), Global
(module), Built-ins.
1
2
3
4
5
6
7
8

v= 99
def local():
def locallocal():
v= u
print "inside locallocal ", v
u= 7; v= 2
locallocal()
print "outside locallocal ", v

9
10
11
12
13

def glob1():
global v
v += 1

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

56 / 475

Python Brief Introduction

Functions

Functions III

14
15
16
17
18

local()
print v
glob1()
print v
Output
inside locallocal 7
outside locallocal 2
99
100

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

December 5, 2013

57 / 475

Packages, Modules, Classes

Packages, Modules I
Python Standard Library http://docs.python.org/library/

Folder structure
root/
pack1/
__init__.py
mod1.py
pack2/
__init__.py
mod2.py
root should be in one of the following: 1) program home folder, 2)
PYTHONPATH 3) standard lib folder, or, 4) in a .pth le on path. The
full search-path is in sys.path.
Importing
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

58 / 475

Python Brief Introduction

Packages, Modules, Classes

Packages, Modules II
Python Standard Library http://docs.python.org/library/

import pack1.mod1
import pack1.mod3 as m3
from pack1.pack2.mod2 import A,B,C

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

December 5, 2013

59 / 475

December 5, 2013

60 / 475

Packages, Modules, Classes

Classes I
Example
class Animal(object): # new style classes
count= 0
def __init__(self, _name):
Animal.count += 1
self.name= _name
def __str__(self):
return I am + self.name
def make_noise(self):
print (self.speak()+" ")*3
class Dog(Animal):
def __init__(self, _name):
Animal.__init__(self, _name)
self.count= 1
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

Packages, Modules, Classes

Classes II

def speak(self):
return "woof"
Full examples in python examples.tgz

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Python Brief Introduction

December 5, 2013

61 / 475

Packages, Modules, Classes

Useful External Libraries

The SciPy library is a vast Python library for scientic computations.


http://www.scipy.org/
In Ubuntu install python-scitools in the package-manager.
Library for doing linear-algebra, statistics, FFT, integration,
optimization, plotting, etc.

Boost is a very mature and professional C++ library. It has Python


bindings. Refer to:
http://www.boost.org/doc/libs/1_47_0/libs/python/doc/
For creation of Python graph data-structures (leveraging boost) look
at: http://projects.skewed.de/graph-tool/

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

62 / 475

Agents and their Task-environments

Contents

Agents and their Task-environments


Agent Types

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

63 / 475

Agents and their Task-environments

A general agent
Agent

Sensors

Percepts

Environment

?
Actuators

Actions

Denition 3.1 (A Rational Agent)


For each possible percept sequence, a rational agent should select an
action that is expected to maximize its performance measure, given the
evidence provided by the percept sequence, and whatever built-in (prior)
knowledge the agent has.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

64 / 475

Agents and their Task-environments

Properties of the Task Environment


Fully observable: relevant environment
state fully exposed by the sensors.
Single Agent
Deterministic: If the next state is completely determined by the current state
and the action of the agent
Episodic: Agents experience is divided
into atomic episodes, each independent
of the last, e.g. assembly-line robot.
Static:
Environment unchanging.
Semi-dynamic: agents performance
measure changes with time, env. static.
Discrete: state of the environment is
discrete, e.g. chess-playing, trac control.
Known: rules of the game/laws of
physics of the env. are known to the
agent.

Partially observable: e.g. a limited


eld-of-view sensor. Unobservable
Multi-agent
Stochastic: Uncertainties quantied by
probabilities.
Sequential: current action aects future
actions, e.g. chess-playing agent.
Dynamic: The environment changes
while the agent is deliberating.
Continuous:
The state smoothly
changes in time, e.g. a mobile robot.
Unknown: The agent must learn the
rules of the game.

Hardest case: Partially observable, multiagent, stochastic, sequential,


dynamic, continuous, and unknown.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

65 / 475

Agents and their Task-environments

Example of a Partially Observable Environment

Figure 14: A mobile robot operating GUI.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

66 / 475

Agents and their Task-environments

Agent Types

Agent Types

Four basic types in order of increasing generality:


Simple reex agents
Reex agents with state
Goal-based agents
Utility-based agents
All these can be turned into learning agents

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Agents and their Task-environments

December 5, 2013

67 / 475

Agent Types

Simple reex agents

Algorithm 1: Simple-Reflex-Agent
Agent

Sensors

Conditionaction rules

What action I
should do now

Actuators

K. Pathak (Jacobs University Bremen)

Environment

What the world


is like now

input
: percept
output : action
persistent: rules, a set of
condition-action rules
state Interpret-Input(percept) ;
rule Rule-Match(state, rules) ;
action rule.action;
return action

Articial Intelligence

December 5, 2013

68 / 475

Agents and their Task-environments

Agent Types

Model-based reex agents I

Sensors
State
How the world evolves
What my actions do

What action I
should do now

Conditionaction rules

Agent

K. Pathak (Jacobs University Bremen)

Environment

What the world


is like now

Actuators

Articial Intelligence

Agents and their Task-environments

December 5, 2013

69 / 475

Agent Types

Model-based reex agents II


Algorithm 2: Model-Based-Reflex-Agent
input
: percept
output : action
persistent: state, agents current conception of worlds state
model, how next state depends on the current state and action
rules, a set of condition-action rules
action, the most recent action, initially none
state Update-State(state, action, percept, model) ;
rule Rule-Match(state, rules) ;
action rule.action;
return action

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

70 / 475

Agents and their Task-environments

Agent Types

Goal-based agents

Sensors
State
What the world
is like now

What my actions do

What it will be like


if I do action A

Goals

What action I
should do now

Agent

Environment

How the world evolves

Actuators

Figure 15: Includes search and planning.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Agents and their Task-environments

December 5, 2013

71 / 475

Agent Types

Utility-based agents

Sensors
State
What the world
is like now

What my actions do

What it will be like


if I do action A

Utility

How happy I will be


in such a state
What action I
should do now

Agent

Environment

How the world evolves

Actuators

Figure 16: An agents utility function is its internalization of the performance


measure.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

72 / 475

Goal-based Problem-solving Agents using Searching

Contents

Goal-based Problem-solving Agents using Searching


The Graph-Search Algorithm
Uninformed (Blind) Search
Informed (Heuristic) Search

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

73 / 475

December 5, 2013

74 / 475

Goal-based Problem-solving Agents using Searching

Problem Solving Agents


Algorithm 3: Simple-Problem-Solving-Agent
input
: percept
output : action
persistent: seq, an action sequence, initially empty
state, agents current conception of worlds state
goal, a goal, initially null
problem, a problem formulation
state Update-State(state, percept) ;
if seq is empty then
goal Formulate-Goal(state) ;
problem Formulate-Problem(state, goal) ;
seq Search(problem) ;
if seq = failure then return a null action
action First(seq) ;
seq Rest(seq) ;
return action

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

Searching for Solutions


State: The system-state of x X
parameterizes all properties of interest. The
initial-state of the state is x0 and the set of
goal-states is Xg . The set X of valid states
is called the state-space.

Initial State x0
Valid Actions
u1

u2

Actions or Inputs: At each state x, there


are a set of valid actions u(x) U(x) that
can be taken by the search agent to alter the
state.
State Transition Function: How a new
state x is created by applying an action u to
the current state x.
x = f(x, u)

(4.1)

Goal State xg

Figure 17: Nodes are


states, and edges are
state-transitions caused
by actions.

The transition may have a cost k(x, u) > 0.


K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

75 / 475

Goal-based Problem-solving Agents using Searching

Examples I

7
5

5
1

parent, action

State

Node

8
8

2
2

depth = 6
g=6

state

Goal State

Start State

(a) An instance

(b) A node of the search-graph.


point to parent-nodes.

Arrows

Figure 18: The 8-puzzle problem

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

76 / 475

Goal-based Problem-solving Agents using Searching

Examples II

Figure 19: An instance of the 8-Queens problem

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

77 / 475

Goal-based Problem-solving Agents using Searching

Examples III
Oradea

71

75

Neamt

Zerind

87

151

Iasi
Arad

140

92
Sibiu

99

Fagaras

118
Timisoara

111

Vaslui

80
Rimnicu Vilcea

Lugoj

Pitesti

97

142

211

70

98
Mehadia

75
Dobreta

146

85

101

Hirsova

Urziceni

86

138

Bucharest

120
90
Craiova

Giurgiu

Eforie

Figure 20: The map of Romania. An instance of the route planning problem
given a map.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

78 / 475

Goal-based Problem-solving Agents using Searching

Examples IV

Figure 21: A 2D occupancy grid map created using Laser-Range-Finder (LRF).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

79 / 475

Goal-based Problem-solving Agents using Searching

Examples V

Figure 22: Result of A path-planning algorithm on a multi-resolution quad-tree.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

80 / 475

Goal-based Problem-solving Agents using Searching

The Graph-Search Algorithm

Graph-Search
Compare with Textbook Fig. 3.7

Algorithm 4: Graph-Search

input : x0 , Xg
D = , The explored-set/dead-set/passive-set;
F.Insert( x0 , g (x0 ) = 0, (x0 ) = h(x0 ) ) Frontier/active-set;
while F not empty do
x, g (x), (x) F.Choose() Remove best x from F;
if x Xg then return SUCCESS;
D D {x};
for u U(x) do
x f(x, u),
g (x ) g (x) + k(x, u) ;
if (x D) and (x F) then
/
/
F.Insert( x , g (x ), (x ) = g (x ) + h(x , Xg ) );
else if (x F) then
F.Resolve-Duplicate(x , g (x ), (x ));

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

December 5, 2013

81 / 475

The Graph-Search Algorithm

Measuring Problem-Solving Performance

Completeness: Is the algorithm guaranteed to nd a solution if there


is one?
Optimality: Does the strategy nd optimal solutions?
Time & Space Complexity: How long does the algorithm take and
how much memory is needed?
Branching factor b: The maximum number of successors (children) of
any node.
Depth d: The shallowest goal-node level.
Max-length m: Maximum length of any path in state-space.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

82 / 475

Goal-based Problem-solving Agents using Searching

Uninformed (Blind) Search

Breadth-First Search (BFS)


The frontier F is implemented as a FIFO queue. The oldest element
is chosen by Choose().
For a nite graph, it is complete, and optimum if all edges have same
cost. It nds the shallowest goal node.
The Graph-Search can return as soon as a goal-state is generated
in line 1.
Number of nodes generated b + b 2 + . . . + b d = O(b d ). This is the
space and time complexity.
The explored set will have O(b d1 ) nodes and the frontier will have
O(b d ) nodes.
Mememory becomes more critical than computation time, e.g. for
b = 10, d = 12, 1 KB/node, search-time is 13 days, and
memory-requirements 1 petabyte (= 1015 Bytes).
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

December 5, 2013

83 / 475

Uninformed (Blind) Search

BFS Example

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

84 / 475

Goal-based Problem-solving Agents using Searching

Uninformed (Blind) Search

Dijkstra Algorithm or Uniform-Cost Search


The frontier F is implemented as a priority-queue. Choose() selects
the element with the highest priority, i.e. the minimum path-length
g (x).
F.Resolve-Duplicate(x ) function on line 2 updates the path-cost
g (x ) of x in the frontier F, if the new value is lower than the stored
value. If the cost is decreased, the old parent is replaced by the new
one. The priority queue is reordered to reect the change.
It is complete and optimum.
When a node x is chosen from the priority-queue, the minimum length
path from x0 to it has been found. Its length is denoted as g (x).
In other words, the optimum path-lengths of all the explored nodes in
the set D have already been found.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

December 5, 2013

85 / 475

Uninformed (Blind) Search

Correctness of Dijkstra Algorithm


x
x0

D
xg
F

Figure 23: The graph separation by the frontier. The node x in the frontier is
chosen for further expansion. The set D is a tree.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

86 / 475

Goal-based Problem-solving Agents using Searching

Uninformed (Blind) Search

Correctness of Dijkstra Algorithm: Observations


Unexplored nodes can only be reached through the frontier nodes.
An existing frontier node xf s cost can only be reduced thorough a
node xc which currently has been chosen from the priority-queue as it
has the smallest cost in the frontier: This will be done by the
RESOLVE-DUPLICATE function. Afterwards, parent(xf )= xc .
Note that D remains a tree.
The frontier expands only through the unexplored children of the
chosen frontier node, all of the children will have costs worse than
their parent.
The costs of the successively chosen frontier nodes are
non-decreasing.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

December 5, 2013

87 / 475

Uninformed (Blind) Search

Correctness of Dijkstra Algorithm: Proof by Induction

Theorem 4.1 (When a frontier node xc is chosen for exansion, its


optimum path has been found)
Proof.
The proof will be done as part of proof of optimality of the A algorithm,
as the Dijkstra Algorithm is a special case of the A algorithm. Refer to
Lemma 4.6.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

88 / 475

Goal-based Problem-solving Agents using Searching

Uninformed (Blind) Search

Depth-rst Search (DFS)


The frontier F is implemented as a LIFO stack. The newest element
is chosen by Choose().
For a nite graph, it is complete, and but not optimum.
Explored nodes with no descendants in the frontier can be removed
from the memory! This gives a space-complexity adavantage: O(bm)
nodes. This happens automatically if the algorithm is written
recursively.
Assuming that nodes at the same depth as the goal-node have no
successors, b = 10, d = 16, 1 KB/node, DFS will require 7 trillion
(1012 ) times less space than BFS! That is why, it is popular in AI
research.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

December 5, 2013

89 / 475

Uninformed (Blind) Search

Algorithm 5: Depth-limited-Search
input : current-state x, depth d
if x Xg then
return SUCCESS ;
else if d = 0 then
return CUTOFF
else
for u U(x) do
x f(x, u);
result Depth-limited-Search(x , d 1);
if result =SUCCESS then
return SUCCESS
else if result =CUTOFF then
cuto-occurred true
if cuto-occurred then return CUTOFF ;
else return NOT-FOUND ;

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

90 / 475

Goal-based Problem-solving Agents using Searching

Uninformed (Blind) Search

DFS Example
Goal node M

D
H

E
I

F
K

K. Pathak (Jacobs University Bremen)

G
M

E
I

Articial Intelligence

Goal-based Problem-solving Agents using Searching

F
K

G
M

December 5, 2013

D
O

91 / 475

Uninformed (Blind) Search

Iterative Deepening Search (IDS)

As O(b d )
and DFS.

O(b d1 ), one can combine the benets of BFS

All the work from previous iteration is redone, but this is


acceptable, as the frontier-size is dominant.
Algorithm 6: Iterative-Deepening-Search
for d= 0 to do
result Depth-limited-Search(x0 , d);
if result = CUTOFF then
return result

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

92 / 475

Goal-based Problem-solving Agents using Searching


A

Limit = 0

Limit = 1

C
E

G
M

K. Pathak (Jacobs University Bremen)

F
K

F
K

G
M

E
I

F
K

E
I

F
K

D
O

E
I

Articial Intelligence

Goal-based Problem-solving Agents using Searching

G
M

A
C

A
B
G

G
M

E
I

C
E

D
O

F
K

B
G

E
I

F
K

F
K

A
C

E
I

E
I

B
D

F
K

C
E

A
B
D

C
E

Limit = 3

B
D

B
D

Limit = 2

Uninformed (Blind) Search

F
K

G
M

December 5, 2013

93 / 475

December 5, 2013

94 / 475

Informed (Heuristic) Search

Bellmans Principle of Optimality


Theorem 4.2
All subpaths of an optimal path are also optimal.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

Informed (Heuristic) Search

A Search
Est. path-cost from x0 to xg = g (x) + est. path-cost from x to xg (4.2)
(x)

h(x)

(x)

g (x) + h(x)

(4.3)

(xg ) g (xg ), as h(xg ) = 0.

(4.4)

h(x) is a heuristically estimated cost, e.g., for the map route-nding


problem, h(x) = xg x .
(x) is the estimated cost of the cheapest solution through node x.

A is a Graph-Search where, the frontier F is a priority-queue with


higher priority given to lower values of the evaluation function (x).
If no heuristics are taken, i.e. h(x) 0, A reduces to Dijkstras
algorithm.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

December 5, 2013

95 / 475

Informed (Heuristic) Search

A Search
Resolve-Duplicate

Similar to Dijkstras Algorithm, F.Resolve-Duplicate(x ) on line 2 of


Algo. 4 updates the cost (x ) in the frontier F, if the new value is lower
than the stored value. If this occurs, the old parent of x is replaced by the
new one. The priority queue is reordered to reect the change.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

96 / 475

Goal-based Problem-solving Agents using Searching

Informed (Heuristic) Search

Example heuristic function

Arad
Bucharest
Craiova
Drobeta
Eforie
Fagaras
Giurgiu
Hirsova
Iasi
Lugoj

Mehadia
Neamt
Oradea
Pitesti
Rimnicu Vilcea
Sibiu
Timisoara
Urziceni
Vaslui
Zerind

366
0
160
242
161
176
77
151
226
244

241
234
380
100
193
253
329
80
199
374

Figure 24: Values of hSLD straight-line distances to Bucharest.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

December 5, 2013

97 / 475

Informed (Heuristic) Search

Conditions for Optimality of A I


A will nd the optimal path, if the heuristic cost h(x) is admissible and
consistent.

Denition 4.3 (Admissibility)


h(x) is admissible, if it never over-estimates the cost to reach the goal
h(x) is always optimistic.

Denition 4.4 (Consistency)


h(x) is consistent, if for every child (generated by action ui ) xi of a node x
the triangle-inequality holds:
h(xg ) = 0,

(4.5)

h(x) k(x, ui , xi ) + h(xi ),

(4.6)

This is a stronger condition than admissibility, i.e.


consistency admissibility.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

98 / 475

Goal-based Problem-solving Agents using Searching

Informed (Heuristic) Search

Proof: consistency admissibility I


We show the result by induction on n( [x : xg ]): the number of edges in
the optimal-path (with the least sum of edge-costs) from a node x to the
goal xg . We assume that we have a consistent heuristic h(x). So, our
induction hypothesis is that the consistent heuristic h(x) is also admissible.
Case n( [x : xg ]) = 1: We use the property of consistency that
h(xg ) = 0. Nodes x with n( [x : xg ]) = 1 are such that their
optimum path is just one edge long, this edge being the one which
connects them to the goal: thus, h (x) = k(x, xg ). Now, since
consistency is assumed to hold,
h(x) k(x, xg ) + 0 = h (x)

(4.7)

This proves the admissibility. Thus our hypothesis holds for n = 1.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

December 5, 2013

99 / 475

Informed (Heuristic) Search

Proof: consistency admissibility II

Case n( [x : xg ]) = m: We assume now that our hypothesis holds


for all nodes with optimal-paths which have at most m 1 edges, i.e
for nodes Sm1 {y | n( [y : xg ]) < m}. Let x be a node with
n( [x : xg ]) = m, i.e. its optimal path to goal is m edges long. Let
the successor of x on this optimal path be x . Since the sub-paths of
an optimal path are optimal also, x Sm1 , hence, the hypothesis
holds for it, i.e. h(x ) h (x ). Now, since consistency holds for x,
h(x) h(x ) + k(x, x ) h (x ) + k(x, x ) = h (x)

(4.8)

The last step holds because x is the successor of x on the latters


optimal path to goal. Hence admissibilty has been demonstrated for a
general node x with n( [x, xg ]) = m.
By induction, the result holds for nodes with all values of n( [x : xg ]), i.e.
the entire graph.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

100 / 475

Goal-based Problem-solving Agents using Searching

Informed (Heuristic) Search

A proof I
Lemma 4.5 ( (x) is non-decreasing along any optimal path, if h(x) is
consistent)
x0

xm
xn
xp

Figure 25: A dashed line between two nodes denotes that the nodes are
connected by a path, but are not necessarily directly connected.

To prove: Let xp be a node for which the optimum-path with cost g (xp )
has been found (see gure): Nodes xm and xn lie on this optimum path
such that xm precedes xn , then

K. Pathak (Jacobs University Bremen)

(xm )

(xn )

(4.9)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

December 5, 2013

101 / 475

Informed (Heuristic) Search

A proof II
Proof.
First note that since xm and xn lie on the optimum path to xp , their
paths are also optimum and have lengths g (xn ) and g (xm )
respectively.
Let us rst assume that xm is the parent of xn , then

(xn ) = h(xn ) + g (xn )

(4.10)

= h(xn ) + g (xm ) + k(xm , xn )

(4.11)

(4.6)

h(xm ) + g (xm )

(xm )

(4.12)

Now if xm is not the parent of xn but a predecessor, the inequality


can be chained for every child-parent node on the path between them,
and we reach the same conclusion.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

102 / 475

Goal-based Problem-solving Agents using Searching

Informed (Heuristic) Search

Recall Graph-Search
Algorithm 7: Graph-Search
input : x0 , Xg
D = , The explored-set/dead-set/passive-set;
F.Insert( x0 , g (x0 ) = 0 ) The frontier/active-set;
while F not empty do
x, g (x) F.Choose() Remove x from the frontier;
if x Xg then return SUCCESS;
D D {x};
for u U(x) do
x f(x, u),
g (x ) g (x) + k(x, u) ;
if (x D) and (x F) then
/
/
F.Insert( x , g (x ) + h(x , Xg ) );
else if (x F) then
F.Resolve-Duplicate(x );
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

December 5, 2013

103 / 475

Informed (Heuristic) Search

A proof I

Lemma 4.6 (At selection for expansion, a nodes optimum path has
been found)
To prove: In every iteration k of A , the node x selected for expansion by
the frontier Fk (x has the minimum value of (x) in Fk ) is such that, at
selection:
g (x) = g (x).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

(4.13)

December 5, 2013

104 / 475

Goal-based Problem-solving Agents using Searching

Informed (Heuristic) Search

Proof of Lemma 4.6

Proof is by induction on the iteration number N. At N = 1, the


frontier F1 selects its only node x0 with cost g (x0 ) = g (x0 ) = 0.

Assume that the induction hypothesis holds for N = 1 . . . k, Now we


need to show that it holds for iteration k + 1.
Assume that Fk+1 selects xn at this iteration. All frontier nodes have
their parents in D. At the time of selection, parent(xn )= xs .
Suppose that the path through xs is not the optimal path for xn , but
the optimal path is , as shown in the gure in blue. This path exists

in the graph at iteration k + 1 whether or not it will ever be


discovered in future iterations is irrelevant.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

December 5, 2013

105 / 475

Informed (Heuristic) Search

Proof of Lemma 4.6 contd.


Fk+1

xs

x0
xm

xn

(x0 : xm)
xp

(xp : xn)

Figure 26: The path shown in blue is the assumed optimal path. Since xm D

at iteration k + 1, by the induction hypothesis it must have been selected by


Fi , i < k + 1, and hence its path (x0 : xm ) is optimum.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

106 / 475

Goal-based Problem-solving Agents using Searching

Informed (Heuristic) Search

Proof of Lemma 4.6 contd.

Note that the assumed optimal-path has to pass through a node

which is in Fk+1 because the frontier separates the dead-nodes D


and unknown nodes and all expansion occurs through the frontier
nodes.
Let xp be the rst node on to belong to Fk+1 . Let its parent in

be xm D.
Thus, the entire assumed optimal path consists of the following

sub-paths ( stands for path-concatenation):


(x0 : xn ) = (x0 : xm ) k(xm : xp ) (xp : xn )

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

December 5, 2013

(4.14)

107 / 475

Informed (Heuristic) Search

Proof of Lemma 4.6 contd.


The cost of xp in Fk+1 is

gk+1 (xp ) = g (xm ) + k(xm : xp )

k+1 (xp )

(4.15)

= h(xp ) + g (xp ).

(xp )

= g (xp )

(4.16)

As xp lies on the optimum-path to xn , from Lemma 4.5,

(xp )

(xn )

(4.17a)

k+1 (xn ),

the cost of xn at k+1.

(4.17b)

However, xn was selected by Fk+1 for expansion. Therefore,


k+1 (xn )

(4.16)

k+1 (xp )

(xp ).

(4.18)

Combining results,
(4.17a)

(xp )

(xn ) =

K. Pathak (Jacobs University Bremen)

(4.18)

(xn )

k+1 (xn ).

k+1 (xn )

Articial Intelligence

(xp ),

(4.19)
(4.20)

December 5, 2013

108 / 475

Goal-based Problem-solving Agents using Searching

Informed (Heuristic) Search

Proof of Lemma 4.6 contd.

As

(xn ) =

k+1 (xn ),

g (xn ) + h(xn ) = gk+1 (xn ) + h(xn ),


g (xn ) = gk+1 (xn ),

(4.21)

Therefore, the path-cost of g (xn ) at k + 1 is indeed optimum. This is in


contradiction to what we assumed, with our alternate optimal-path

hypothesis. Therefore, we conclude that the optimal path g (xn ) is found


at N = k + 1 when xn is selected by Fk+1 .
By extension, when the goal point is selected by the frontier, its optimum
path with cost g (xg ) has been found.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

December 5, 2013

109 / 475

Informed (Heuristic) Search

Lemma 4.7 (Monotonicity of Expansion)


Let the selected node by Fj be xj , and that selected by Fj+1 be xj+1 .
Then it must be true that

(xj )

(xj+1 ).

(4.22)

Why?

Proof.
Sketch: You have to consider two cases:
At iteration j + 1, xj+1 is a child of xj .
At iteration j + 1, xj+1 is not a child of xj .

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

110 / 475

Goal-based Problem-solving Agents using Searching

Informed (Heuristic) Search

Consequences of the Monotonicity of Expansion

Remark 4.8
This shows that at iteration N = j, if Fj selects xj then:

All nodes x with (x) < (xj ) have already been expanded (i.e. they
have died), and some nodes with (x) = (xj ) have also been
expanded.
In particular, when the rst goal is found, then, All nodes x with
(x) < g (x ) have already been expanded, and some nodes with
g
(x) = g (x ) have also been expanded.
g

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

December 5, 2013

111 / 475

Informed (Heuristic) Search

Figure 27: Region searched before nding a solution: Dijkstra path search
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

112 / 475

Goal-based Problem-solving Agents using Searching

Informed (Heuristic) Search

Figure 28: Region searched before nding a solution: A path search. The
number of nodes expanded is the minimum possible.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

December 5, 2013

113 / 475

Informed (Heuristic) Search

Properties of A

The paths are optimal w.r.t. the cost function, but do you notice any
undesirable properties of the planned paths?
Why are there less colors in Fig. 18 than in Fig. 17?
In Fig. 18 why are the red-shades lighter in the beginning?

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

114 / 475

Goal-based Problem-solving Agents using Searching

Informed (Heuristic) Search

An alternative to A for path-planning

Figure 29: Funnel-planning using wave-front expansion. The path stays away
from the obstacles.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

December 5, 2013

115 / 475

Informed (Heuristic) Search

Funnel path-planning

Figure 30: Funnel-planning using wave-front expansion in 3D. Source: Brock and
Kavraki, Decomposition based motion-planning: A framework for real-time
motion-planning in high dimensional conguration spaces, ICRA 2001.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

116 / 475

Goal-based Problem-solving Agents using Searching

Informed (Heuristic) Search

Algorithm 8: Funnel-Planning
input: x0 , xg
B ndFreeSphere(x0 )
B.parent
Q.insert(B, B.center xg B.r )
while Q not empty do
B Q.getMin()
D.insert(B)
if xg B then
return [D, B]
for s 1 . . . Ns do
x sampleOnSurface(B)
if x D then
/
C ndFreeSphere(x)
C .parent B
Q.insert(C , C .center xg C .r )
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Goal-based Problem-solving Agents using Searching

December 5, 2013

117 / 475

Informed (Heuristic) Search

Funnel path-planning

Figure 31: Motion planning using the funnel potentials. Source: LaValle,
Planning Algorithms, http://planning.cs.uiuc.edu/.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

118 / 475

Non-classical Search Algorithms

Contents

Non-classical Search Algorithms


Hill-Climbing
Sampling from a PMF
Simulated Annealing
Genetic Algorithms

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

119 / 475

Non-classical Search Algorithms

Local Search
objective function

global maximum

shoulder
local maximum
flat local maximum

state space
current
state

Figure 32: A 1-D state-space landscape. The aim is to nd the global maximum.

Local search algorithms are used when


The search-path itself is not important, but only the nal optimal
state, e.g. 8-Queen problem, job-shop scheduling, IC design, TSP, etc.
Memory eciency is needed. Typically only one node is retained.
The aim is to nd the best state according to an objective function to
be optimized. We may be seeking the global maximum or the
minimum. How can we reformulate the former to the latter?
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

120 / 475

Non-classical Search Algorithms

Hill-Climbing

Algorithm 9: Hill-Climbing
input : x0 , objective (value) function v (x)
to maximize
output: x , the state where a local
maximum is achieved
x x0 ;
while True do
y the highest-valued child of x ;
if v (y) v (x) then return x ;
xy
To avoid getting stuck in plateaux:
Allow side-ways movements.
Problems?
Random-restart: perform search from
many randomly chosen x0 till an
optimal solution is found.
K. Pathak (Jacobs University Bremen)

Figure 33: A ridge. The local


maxima are not directly
connected to each other.

Articial Intelligence

Non-classical Search Algorithms

December 5, 2013

121 / 475

Hill-Climbing

Figure 34: (a) Starting state. h(x) is the number of pairs of queens attacking
each other. Each node has 8 7 children. (b) A local minimum with h(x) = 1.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

122 / 475

Non-classical Search Algorithms

Sampling from a PMF

Sampling from a Probability Mass Function (pmf)


Denition 5.1 (PMF)
Given a discrete random-variable A with an exhaustive and ordered (can
be user-dened, if no natural order exists) list of its possible values
[a1 , a2 , . . . , an ], its pmf P(A) is a table, with probabilities
P(A = ai ), i = 1 . . . n. Obviously, n P(A = ai ) = 1.
i=1

Problem 5.2
Sampling a PMF A uniform random number generator in the unit-interval
has the probability distribution function pu[0,1] (x) as shown below in the
gure. Python random.random() returns a sample x [0.0, 1.0). How
can you use it to sample a given discrete distribution (PMF)?
1
pu[0,1](x)
0
a
1
x
b
P (x [a, b] ; 0 a b < 1) = b a
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Non-classical Search Algorithms

December 5, 2013

123 / 475

Sampling from a PMF

Denition 5.3 (Cumulative PMF FA for a PMF P(A))


n

FA (aj )
u(x)
FA (aj )

P(A aj )
0
1

=
i=1

P(A = ai )u(aj ai ),

where,

(5.1)

x <0
x 0

(5.2)

P(A = ai )

(5.3)

ij

u(x) is called the discrete unit-step function or the Heaviside step


function. What is FA (an )?

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

124 / 475

Non-classical Search Algorithms

Sampling from a PMF

Form a vector of half-closed intervals

[0, FA (a1 ))
[FA (a1 ), FA (a2 ))

.
.
s
, Dene a0 s.t. FA (a0 )
.

[FA (an2 ), FA (an1 ))


[FA (an1 ), 1)

0.

(5.4)

Let ru be a sample from a uniform random-number generator in the


unit-interval [0, 1). Then,
P(ru s[i]) = P(FA (ai1 ) ru < FA (ai ))
= P(FA (ai1 ) ru FA (ai ))

= FA (ai ) FA (ai1 )

(5.3)

= P(ai )

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Non-classical Search Algorithms

December 5, 2013

125 / 475

Simulated Annealing

Algorithm 10: Simulated-Annealing


input : x0 , objective (cost) function c(x)
to minimize
output: x , a locally optimum state
x x0 ;
for k k0 to do
T Schedule(k) ;
if 0 < T < then return x ;
y a randomly selected child of x ;
E c(y) c(x) ;
if E < 0 then
xy
else
x y with probability P(E , T ) ;
P(E , T ) = 1+e 1 /T e E /T ;
E
(Boltzmann distribution)

K. Pathak (Jacobs University Bremen)

(5.5)

Articial Intelligence

An example of a
schedule is
Tk = T0

ln(k0 )
ln(k)

(5.6)

Applications:
VLSI layouts,
Factory-scheduling.

December 5, 2013

126 / 475

Non-classical Search Algorithms

Simulated Annealing

Another Schedule

We rst generate some random rearrangements, and use them


to determine the range of values of E that will be encountered
from move to move. Choosing a starting value of T which is
considerably larger than the largest E normally encountered,
we proceed downward in multiplicative steps each amounting to
a 10 % decrease in T . We hold each new value of T constant
for, say, 100N recongurations, or for 10N successful
recongurations, whichever comes rst. When eorts to reduce
E further become suciently discouraging, we stop.
Numerical Recipes in C: The Art of Scientic Computing.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Non-classical Search Algorithms

December 5, 2013

127 / 475

Simulated Annealing

Example: Traveling Salesman Problem


1800
1600
1400
1200
1000
800
600
4000

50000

100000
150000
nr. iteration

200000

250000

50000

100000
150000
nr. iteration

200000

250000

1000
800
T

600
400
200
00

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

128 / 475

Non-classical Search Algorithms

Simulated Annealing

Example: Traveling Salesman Problem


100
80
60
40
20
00

20

40

60

80

100

Figure 35: A suboptimal tour found by the algorithm in one of the runs.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Non-classical Search Algorithms

December 5, 2013

129 / 475

Genetic Algorithms

Algorithm 11: Genetic Algorithm


input : P = {x}, a population of individuals,
a Fitness() function to maximize
output: x , an individual
repeat
Pn ;
for i 1 to Size(P) do
x Random-Selection(P, Fitness()) ;
y Random-Selection(P, Fitness()) ;
c Reproduce(x, y ) ;
if small probability then Mutate(c) ;
Add c to Pn
P Pn
until x P, Fitness(x) > Threshold, or enough time
elapsed;
return best individual in P
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

130 / 475

Non-classical Search Algorithms

Genetic Algorithms

Algorithm 12: Reproduce(x, y)


N Length(x) ;
R random-number from 1 to N (cross-over point);
c Substring(x, 1, R) + Substring(y, R + 1, N) ;
return c
24748552

24 31%

32752411

32748552

32748152

32752411

23 29%

24748552

24752411

24752411

24415124

20 26%

32752411

32752124

32252124

32543213

11 14%

24415124

24415411

24415417

(a)
Initial Population

(b)
Fitness Function

(c)
Selection

(d)
Crossover

(e)
Mutation

Figure 36: The 8-Queens problem. The ith number in the string is the position of
the queen in the ith column. The tness function is the number of non-attacking
pairs (maximum tness 28).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Non-classical Search Algorithms

December 5, 2013

131 / 475

Genetic Algorithms

Properties of Genetic Algorithms (GA)

Crucial issue: encoding.


Schema e.g. 236 , an instance of this schema is 23689745. If
average tness of instances of schema are above the mean, then the
number of instances of the schema in the population will grow over
time. It is important that the schema makes some sense within the
semantics/physics of the problem.
GAs have been used in job-shop scheduling, circuit-layout, etc.
The identication of the exact conditions under which GAs perform
well requires further research.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

132 / 475

Non-classical Search Algorithms

Genetic Algorithms

A Detailed Example: Flexible Job Scheduling


from G. Zhang et al An eective genetic algorithm for the exible job-shop scheduling
problem, Expert Systems with Applications, vol. 38, 2011.

Figure 37: Gantt-Chart of a Schedule: Minimizing Makespan.


K. Pathak (Jacobs University Bremen)

Articial Intelligence

Non-classical Search Algorithms

December 5, 2013

133 / 475

December 5, 2013

134 / 475

Genetic Algorithms

GA Example: Flexible Job Scheduling

Job
J1
J2

Operation
O11
O12
O21
O22
O23

M1
2
3
4
-

K. Pathak (Jacobs University Bremen)

M2
6
8
6
7

M3
5
6
5
11

M4
3
4
5

M5
4
5
8

Articial Intelligence

Non-classical Search Algorithms

Genetic Algorithms

GA Example: Constraints

Oi(j+1) can begin only after Oij has ended.


Only a certain subset ij of machines can perform Oij .
Jio is the number of total operations for job Ji .
L=

N
i=1 Jio

total number of operations of all jobs.

Pijk is the processing-time of Oij on machine k.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Non-classical Search Algorithms

December 5, 2013

135 / 475

Genetic Algorithms

GA Example: Chromosome Representation

(a)

(b) Machine Selection Part

(c) Operation Sequence Part

Figure 38: Chromosome Representation


K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

136 / 475

Non-classical Search Algorithms

Genetic Algorithms

GA Example: Decoding Chromosome

(a) Finding enough space to insert Oi(j+1)


K. Pathak (Jacobs University Bremen)

Articial Intelligence

Non-classical Search Algorithms

December 5, 2013

137 / 475

Genetic Algorithms

GA Example: Initial Population: Global Selection (GS)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

138 / 475

Non-classical Search Algorithms

Genetic Algorithms

GA Example: Initial Population: Local Selection (LS)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Non-classical Search Algorithms

December 5, 2013

139 / 475

Genetic Algorithms

GA Example: MS Crossover Operator

Figure 39: Machine Sequence (MS) Part Crossover

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

140 / 475

Non-classical Search Algorithms

Genetic Algorithms

GA Example: OS Crossover Operator


Precedence Preserving Order-Based Crossover (POX)

Figure 40: Operation Sequence (OS) Part: Precedence Preserving Order-Based


Crossover (POX)
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Non-classical Search Algorithms

December 5, 2013

141 / 475

Genetic Algorithms

GA Example: Mutation

Figure 41: Machine Sequence (MS) Part Mutation

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

142 / 475

Non-classical Search Algorithms

Genetic Algorithms

GA Example: Run

Figure 42: A typical run of GA

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

143 / 475

December 5, 2013

144 / 475

Games Agents Play

Contents

Games Agents Play


Minimax
Alpha-Beta Pruning

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Games Agents Play

Minimax

Zero-Sum Games
A Partial Game-Tree for Tic-Tac-Toe
MAX (X)

MIN (O)

X
X

X O

X O X

X O
X

MIN (O)

X
O

...

...

TERMINAL
Utility

...

X O
X

MAX (X)

...

...

...

X O X
O X
O

X O X
O O X
X X O

X O X
X
X O O

...

+1

Figure 43: Each half-move is called a ply.


K. Pathak (Jacobs University Bremen)

Articial Intelligence

Games Agents Play

December 5, 2013

145 / 475

Minimax

Search-Tree vs Game-Tree

The search-tree is usually a sub-set of the game-tree.


Example: For Chess, the game-tree is estimated to be over 1040
nodes big, with an average branching factor of 35.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

146 / 475

Games Agents Play

Minimax

Zero-Sum Games
Nomenclature

S0 The initial state.


Player(s) The player which has the move in state s.
Actions(s) Set of legal moves in state s.
Result(s, a) The transition-model: the state resulting from applying the
action a to the state s.
Terminal-Test(s) Returns True if the game is over at s.
Utility(s) The payo for the Max player at a terminal state s.
Zero-sum game A game where the sum of utilities for both players at each
terminal state is a constant. Example: Chess:
(1, 0), (0, 1), (1/2, 1/2).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Games Agents Play

December 5, 2013

147 / 475

Minimax

An Example 2-Ply Game


3

MAX

A1

A2

A3

MIN
A 11

A 12

12

2
A 21

A 13

2
A 31

A 22 A 23

14

A 32

A 33

Figure 44: Each node (state) labeled with its minimax value.

Minimax(s) =

Utility(s)
if Terminal-Test(s)

maxaActions(s) Minimax(Result(s, a)) if Player(s) = Max

minaActions(s) Minimax(Result(s, a)) if Player(s) = Min


(6.1)
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

148 / 475

Games Agents Play

Minimax

The Minimax Algorithm


Algorithm 13: Minimax-Decision
input
: State s
return arg maxaActions(s) Min-Value(Result(s,a))

Algorithm 14: Max-Value


input
: State s
if Terminal-Test(s) then return Utility(s) ;
;
for a Actions(s) do
max(, Min-Value(Result(s, a)))
return

Algorithm 15: Min-Value


input
: State s
if Terminal-Test(s) then return Utility(s) ;
;
for a Actions(s) do
min(, Max-Value(Result(s, a)))
return
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Games Agents Play

December 5, 2013

149 / 475

Minimax

Search-Tree Complexity

Since the search is a DFS, for an average branching-factor b and


maximum depth m:
Space complexity: O(bm).
Time complexity: O(b m ).

It turns out we can reduce time-complexity in the best case to


O(b m/2 ) using Alpha-Beta Pruning.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

150 / 475

Games Agents Play

Alpha-Beta Pruning

Alpha-Beta Pruning

MAX

MIN

12

MAX

K. Pathak (Jacobs University Bremen)

14

Articial Intelligence

Games Agents Play

MIN

12

December 5, 2013

151 / 475

December 5, 2013

152 / 475

Alpha-Beta Pruning

Algorithm 16: Alpha-Beta-Search(s)


Max-Value(s, = , = ) ;
return the Action in Actions(s) with value

Algorithm 17: Max-Value(s, , )


if Terminal-Test(s) then return Utility(s) ;
;
for a Actions(s) do
max(, Min-Value(Result(s, a), , )) ;
if then return ;
max(, );
return

Algorithm 18: Min-Value(s, , )


if Terminal-Test(s) then return Utility(s) ;
;
for a Actions(s) do
min(, Max-Value(Result(s, a), , )) ;
if then return ;
min(, );
return
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Games Agents Play

Alpha-Beta Pruning

Reference

Donald E. Knuth and Ronald W. Moore, An Analysis of Alpha-Beta


Pruning, Articial Intelligence, 1975.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Games Agents Play

December 5, 2013

153 / 475

Alpha-Beta Pruning

Transposition Table

Some state-nodes may reappear in the tree: To avoid repeating their


expansion, their computed utilities can be cached in a hash-table
called the Transposition Table. This analogous to the dead-set in
the Graph-Search Algorithm.
It may not be practical to cache all visited nodes. Various heuristics
are used to decide which nodes to discard from the transposition
table.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

154 / 475

Games Agents Play

Alpha-Beta Pruning

Cuto Depth and Evaluation Functions

To achieve real-time performance, we expand the search-tree up to a


maximum depth only and replace the utility computation by a
heuristic evaluation function.
Algorithm 19: Min-Value(s, , , d)
if Cutoff-Test(s, d) then return Eval(s) ;
;
for a Actions(s) do
min(, Max-Value(Result(s, a), , , d + 1)) ;
if then return ;
min(, );
return

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Games Agents Play

December 5, 2013

155 / 475

Alpha-Beta Pruning

Example Evaluation Functions


Each state s is considered to have several features, e.g. in Chess,
number of rooks, pawns, bishops, number of plys till now, etc.
Each feature can be given a weight and a weighted sum of features
can be used. Example: pawn (1), bishop (3), rook (5), queen (9).
Weighting can be nonlinear, e.g. a pair of bishops is worth more than
twice the worth of a single bishop; a bishop is more valuable in
endgame.
Read Sec 5.7 of the textbook. The (2007-2010) computer world
champion was RYBKA running on a desktop with its evaluation
function tuned by International Master Vasik Rajlich. Allegations of
plagiarization: Crafty and Fruit.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

156 / 475

Logical Agents: Propositional Logic

Contents

Logical Agents: Propositional Logic


Propositional Logic
Entailment and Inference
Inference by Model-Checking
Inference by Theorem Proving
Inference by Resolution
Inference with Denite Clauses
2SAT
Agents based on Propositional Logic
Time out from Logic

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

157 / 475

Logical Agents: Propositional Logic

Knowledge-Base
A Knowledge-Base is a set of sentences expressed in a knowledge
representation language. New sentences can be added to the KB and it
can be queried about whether a given sentence can be inferred from what
is known.
A KB-Agent is an example of a Reex-Agent explained previously.
Algorithm 20: Knowledge-Base (KB) Agent
input : KB, a knowledge-base,
t, time, initially 0.
Tell(KB, Make-Percept-Sentence(percept, t)) ;
action Ask(KB, Make-Action-Query(t)) ;
Tell(KB, Make-Action-Sentence(action, t)) ;
t t +1 ;
return action

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

158 / 475

Logical Agents: Propositional Logic

Propositional Logic

Propositional Logic
A simple knowledge representation language

Denition 7.1 (Syntax of Propositional Logic)


An atomic formula (also called an atomic sentence or a
proposition-symbol) has the form P, Q, A1 , True , False , IsRaining etc.
A formula/sentence can be dened inductively as
All atomic formulas are formulas.
For every formula F , F is a formula, called a negation.
For all formulas F and G , the following are also formulas:
(F G ), called a disjunction.
(F G ), called a conjunction.

If a formula F is part of another formula G , then it is called a


subformula of G .
We use the short-hand notations:
F G (Premise implies Conclusion) for (F ) G
F G , (Biconditional) for (F G ) (G F ).
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

159 / 475

Propositional Logic

Propositional Logic
A simple knowledge representation language

Syntax of Propositional Logic rewritten in BNF Grammar


Sentence Atomic-Sentence | Complex-Sentence

Atomic-Sentence True | False | P | Q | R | . . .

Complex-Sentence (Sentence ) | [Sentence ]


| Sentence

| Sentence Sentence
| Sentence Sentence

| Sentence Sentence
| Sentence Sentence

Operator-Precedence : , , , ,

(7.1)

Axioms are sentences which are given and cannot be derived from other
sentences.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

160 / 475

Logical Agents: Propositional Logic

Propositional Logic

Semantics of Propositional Logic


The elements of the set T {0, 1}, also written {False , True } or
{F , T } are called Truth-Values.

Let D be a set of atomic formulas/sentences. Then an assignment A


is a mapping A : D T.
We can extend the mapping A to A : E T, where, E D is the
set of formulas which can be built using only the atomic formulas in
D, as follows:
For any atomic formula Bi D, A (Bi )
1, if A (P) = 0
A (P)
0, otherwise.

A(Bi ).

A ((P Q))

1, if A (P) = 1 and A (Q) = 1


0, otherwise.

A ((P Q))

1, if A (P) = 1 or A (Q) = 1
0, otherwise.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

161 / 475

Propositional Logic

Semantics of Propositional Logic


Truth-Table

The semantic interpretation can be shown by a truth-table.


A(P )
1
1
0
0

A(Q )
1
0
1
0

A (P Q )
1
0
1
1

A (P Q )
1
0
0
1

From now on, the distinction between A and A is dropped.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

162 / 475

Logical Agents: Propositional Logic

Propositional Logic

Suitable Assignment, Model, Satisability, Validity


If an assignment A is dened for all atomic formulas in a formula F ,
then A is called suitable for F .
If A is suitable for F and A(F ) = 1, then A is called a model for F
and write A F . Otherwise, we write A F .
The set of all models of a formula/sentence F is denoted by M(F )
A formula F is called satisable, if it has at least one model,
otherwise it is called unsatisable or contradictory.
A set of formulas F is called satisable, if there exists an assignment
A which is a model for all Fi F.

A formula F is called valid (or a tautology), if every suitable


assignment for F is also a model of F . In this case, we write
Otherwise, we write F .

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

F.

163 / 475

Propositional Logic

Theorem 7.2
A formula F is valid if and only if (i) F is unsatisable.

Proof.
F is valid i every suitable assignment of F is a model of F .
i every suitable assignment of F (and hence, of F ) is not a model of
F .
i F has no model, and hence, is unsatisable.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

164 / 475

Logical Agents: Propositional Logic

Propositional Logic

Wumpus World
4

Bree z e

Stench

PIT

PIT

Bree z e

Bree z e

Stench
Gold

Bree z e

Stench

Bree z e

PIT

Bree z e

START

Figure 45: Actions=[Move-Forward, Turn-Left, Turn-Right, Grab, Shoot, Climb],


Percept=[Stench, Breeze, Glitter, Bump, Scream]

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

165 / 475

Propositional Logic

Wumpus World

Figure 46: Actions=[Move-Forward, Turn-Left, Turn-Right, Grab, Shoot, Climb],


Percept=[Stench, Breeze, Glitter, Bump, Scream]

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

166 / 475

Logical Agents: Propositional Logic

Propositional Logic

Wumpus World

Figure 47: Actions=[Move-Forward, Turn-Left, Turn-Right, Grab, Shoot, Climb],


Percept=[Stench, Breeze, Glitter, Bump, Scream]

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

167 / 475

Propositional Logic

Wumpus World KB
Px,y
Wx,y
Bx,y
Sx,y

is
is
is
is

true
true
true
true

if
if
if
if

there is a pit in [x, y ]


there is a Wumpus in [x, y ]
the agent perceives a breeze in [x, y ]
the agent perceives a stench in [x, y ]
R1 : P1,1

(7.2)

R3 : B2,1 (P1,1 P2,2 P3,1 )

(7.4)

R2 : B1,1 (P1,2 P2,1 ),

(7.3)

We also have percepts.


R4 : B1,1 ,

R5 : B2,1

(7.5)

Query to the KB: Q = P1,2 or Q = P2,2 .

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

168 / 475

Logical Agents: Propositional Logic

Entailment and Inference

Entailment

Denition 7.3 (Entailment)


The formula/sentence F entails the formula/sentence G , i.e.
F

G,

M(F ) M(G ) .

(7.6)

We also say that G is a consequence of F .

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

169 / 475

Entailment and Inference

Theorem 7.4 (Deduction Theorem)


For any formulas F and G , F G i the formula (F G ) is valid, i.e.
true in all assignments suitable for F and G .

Proof.
Assume F

G . Let A be an assignment suitable for F and G , then,

If A is not a model of F , i.e. A F , then A is a model of (F G )


(ref. truth-table of implication).
If A F , then as F G , A G . Hence, A is a model of (F G ).
Thus, A is always a model for (F G ). Hence, (F G ) is valid.

Assume (F G ) is valid. Hence, there does not exist an assignment


A such that
A F.
A G.
Hence, all models of F are also models of G , and so F

K. Pathak (Jacobs University Bremen)

Articial Intelligence

G.

December 5, 2013

170 / 475

Logical Agents: Propositional Logic

Entailment and Inference

Denition 7.5 (Equivalence F G )

Two formulas F and G are semantically equivalent if for every assignment


A suitable for both F and G , A(F ) = A(G ).

Remark 7.6 (An equivalent denition of equivalence


F G , i F

G and G

F.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

171 / 475

Entailment and Inference

Equivalence
Example 7.7
In the following, and can be swapped to get new equivalences.
F F

(7.7)

F F F

Idempotency
Commutativity

F G G F

(7.8)
(7.9)

(F G ) H F (G H)

Associativity

(7.10)

Absorption

(7.11)

F (G H) (F G ) (F H)

Distributivity

(7.12)

deMorgans Law

(7.13)

Contraposition

(7.14)

F (F G ) F

(F G ) (F ) (G )
P Q Q P

All of them can be shown by truth-tables.


K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

172 / 475

Logical Agents: Propositional Logic

Entailment and Inference

Theorem 7.8 (Proof by Contradiction or The SAT Problem)


For any formulas/sentences F and G , F
unsatisable.

G i the sentence (F G ) is

Proof.
First, note the equivalence (F G ) (F G ).
F

G i the formula (F G ) is valid.

From Thm. 7.2, (F G ) is valid i (F G ) is unsatisable.


Combining the above two results, F
unsatisable.

K. Pathak (Jacobs University Bremen)

G i the sentence (F G ) is

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

173 / 475

Entailment and Inference

Logical Inference

Inference is the use of entailment to draw conclusions. For example,


KB Q, shows that the formula/sentence Q can be concluded from
(or is a consequence of) what the agent knows.
KB i Q denotes that the inference algorithm i derives Q from KB .
An algorithm which derives only entailed sentences is called sound.
An algorithm is complete if it can derive any sentence that is entailed.

Simplest inference is by model-checking: Use the Deduction


Theorem 7.4, i.e. show that KB Q is a tautology. This proves that
KB Q.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

174 / 475

Logical Agents: Propositional Logic

Inference by Model-Checking

Model-Checking Based Inference


Algorithm 21: TT-Entails(KB, Q)
input: KB, a knowledge-base; Q, a query formula/sentence
symbols list of proposition-symbols (atomic formulas) in KB and Q ;
return TT-Check(KB, Q, symbols, model={})
Algorithm 22: TT-Check(KB, Q, symbols, model)
input: KB, a knowledge-base; Q, a sentence; symbols; model
if Empty(symbols) then
if Pl-true(KB, model) then
return Pl-true(Q, model)
else
return True // As KB is False , KB Q is True .

else
P First(symbols); tail Rest(symbols) ;
return TT-Check(KB, Q, tail, model {P = True }) And
TT-Check(KB, Q, tail, model {P = False })

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

175 / 475

Inference by Model-Checking

Normal Forms
Denition 7.9 (Literal)
If P is an atomic formula then
P is a positive literal.
P is a negative literal.

Denition 7.10 (Conjunctive Normal Form (CNF))


If Li, j are literals then a formula F is in CNF, if it is of the form
n

mi

F =

Li, j

(7.15)

i=1 j=1
Clausei

{L1,1 , . . . L1,m1 }, . . . , {Ln,1 , . . . Ln,mn } .


K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

(7.16)

176 / 475

Logical Agents: Propositional Logic

Inference by Model-Checking

Conjunctive Normal Form (CNF): Conversion Procedure


Given a formula F
1. Substitute in F every occurrence of a subformula of the form
G by G ,

(G H) by (G H),

(G H) by (G H),

until no such subformulas occur.


2. Substitute in F every occurrence of a subformula of the formula
K (G H) by (K G ) (K H),
(G H) K by (G K ) (H K ),
until no such subformulas occur.

Example 7.11 (CNF)


A B (A B) (B A)
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

177 / 475

Inference by Model-Checking

Davis-Putnam-Logemann-Loveland (DPLL) Algorithm


TT-Entails can be made more ecient. Entailment is now to be
inferred by solving the SAT problem (Thm. 7.8): KB Q i the sentence
(KB Q) is unsatisable. DPLL has 3 improvements over
TT-Entails:
Early Termination: A clause is true, if any literal is true. A sentence
(in CNF) is false if any clause is false, which occurs when all its
literals are false. Sometimes a partial model suces. Example:
(A B) (C A) is true if A = True .
Pure Symbol: A symbol which occurs with the same sign in all
clauses. Example: (A B), (C B), (C A). If a sentence has a
model, then it has a model with pure (atomic) symbols assigned to
make their literal true: this never makes the clause false.
Unit Clause is a single literal clause or a clause where all literals but
one have already been assigned False by the model. To make a
unit-clause true, an appropriate truth-value for the sole literal can be
found. Example: (C B) with B already assigned true. Assigning
a unit-clause may create another: unit propagation.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

178 / 475

Logical Agents: Propositional Logic

Inference by Model-Checking

Algorithm 23: DPLL-Satisfiable(s)


input: s, a sentence in propositional logic
clauses the set of clauses in the CNF of s ;
symbols a list of propositional-symbols (atomic formulas) in s ;
return DPLL(clauses, symbols, { })
Algorithm 24: DPLL(clauses, symbols, model)
if every clause in clauses is true in model then return True ;
if some clause in clauses is false in model then return False ;
P, v Find-Pure-Symbol(symbols, clauses, model) ;
if P = then return DPLL(clauses, symbols P, model { P=v }) ;
P, v Find-Unit-Clause(clauses, model) ;
if P = then return DPLL(clauses, symbols P, model { P=v }) ;
P First(symbols); rest Rest(symbols) ;
return DPLL(clauses, rest, model { P= True }) Or DPLL(clauses,
rest, model { P= False })
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

179 / 475

Inference by Model-Checking

Literals

Denition 7.12 (Complement Literal)

L=

K. Pathak (Jacobs University Bremen)

A
A

if L = A,
if L = A.

Articial Intelligence

(7.17)

December 5, 2013

180 / 475

Logical Agents: Propositional Logic

Inference by Model-Checking

Early termination restated

An assignment A (possibly partial) satises a clause if it assigns 1 to


at least one of its literals.
An assignment A (possibly partial) satises a CNF formula F , if it
satises each of its clauses.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

181 / 475

Inference by Model-Checking

Residual formula F |

Let F be a CNF formula containing a literal . Then F | denotes the


residual formula obtained by applying the partial assignment A( ) = 1 on
F . This is done by:
Removing all clauses containing as they are satised by the
assignment.
Deleting from all clauses in F containing . Why?
If after deleting , a clause becomes empty, what does it signify?

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

182 / 475

Logical Agents: Propositional Logic

Inference by Model-Checking

Another version of DPLL


Ouyang, How good are branching rules in DPLL? Discrete Applied Mathematics, 1998.

Algorithm 25: DPLL(F )


input: F a formula in CNF
while F includes a clause of length at most 1 do
if F includes an empty clause then return Unsatisable;
if F includes a unit-clause { } then
F F |
while F includes a pure (monotone) literal { } do
F F |

if F is empty then return Satisable;


Choose a literal u in F ;
if DPLL(F | u)= Satisable then return Satisable;
if DPLL(F | u )= Satisable then return Satisable;

return Unsatisable
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

183 / 475

Inference by Model-Checking

WalkSAT
A local search algorithm for satisability

Algorithm 26: WalkSAT


input : C, a set of propositional clauses; p, probability of random-walk; N
maximum ips allowed
output: A satisfying model or failure
model a random assignment of T/F to symbols in C ;
for i = 1 to N do
if model satises C then return model ;
clause a randomly selected clause in C that is false in model ;
if sample true with probability p then
ip the value in model of a randomly selected symbol from clause ;
else
ip whichever symbol in clause maximizes the number of satised
clauses

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

184 / 475

Logical Agents: Propositional Logic

Inference by Theorem Proving

Propositional Theorem Proving


Use inference rules and equivalence-relationships to produce a chain of
conclusions which leads to the desired goal sentence.

Inference Rules
P Q, P
Q
P Q
P

Modus Ponens

((P Q) P)

(7.18)

And-Elimination

(P Q)

(7.19)

Monotonicity of Logical Systems


The set of entailed sentences can only increase as information is added to
the KB. If KB P then KB R P for a new sentence R. This assumes,
of course, that KB R is still satisable.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

185 / 475

Inference by Theorem Proving

Search for Proving


Initial State: The initial KB.
Action: All inference rules applied to all the sentences that match the
top half of the inference rule.
Result: The application of an action results in adding the sentence in
the bottom half of the inference rule to the KB.
Goal: The sentence were trying to prove.
The search can be performed by e.g. IDS. This search is sound, but is it
complete?

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

186 / 475

Logical Agents: Propositional Logic

Inference by Resolution

Resolution

Resolution is an inference procedure to prove unsatisability of a set of


clauses (equivalently a formula in CNF). It is:
Sound, i.e. correct.
Complete, when combined with a complete search-algorithm.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

187 / 475

Inference by Resolution

General Resolution

Denition 7.13 (Resolvent)


Let C1 , C2 , R be clauses. Then R is called a resolvent of C1 , C2 , written

R = C1 C2 , if there is a literal L C1 and L C2 and

R = (C1 {L}) (C2 {L}).

(7.20)

Example 7.14
C1 = {A, B, C , D} and C2 = {C , B, D, E , F }
R = {A, C , D, E , F }

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

188 / 475

Logical Agents: Propositional Logic

Inference by Resolution

General Resolution
Lemma 7.15 (Resolution Lemma)
Let F be a formula in CNF in set-format. Let R be the resolvent of two
clauses C1 and C2 in F . Then F and F {R} are equivalent (Def. 7.5).

Proof.
Let A be an assignment suitable for F (and hence also for F {R}).
If A

F {R}, then clearly A

F.

Suppose A F , i.e. A Ci , for all clauses Ci F . Let

R = (C1 {L}) (C2 {L}) for L C1 and L C2 .

Case A L: As A C2 and as A L, we have A (C2 {L}).


Hence, A R.
Case A L: As A C1 , we have A (C1 {L}). Hence, A R.
So, in both cases A {R} and hence A F {R}.

We have shown that F {R}


F {R} F .
K. Pathak (Jacobs University Bremen)

F and F

F {R}. Thus,

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

189 / 475

Inference by Resolution

Remark 7.16 (Caution)


When resolving (A B C ) and (A B), the resolvent is either
B B C or A A C , both of which are equivalent to True . The
resolvent is not C . You can validate
(A B C ) (A B) is not equivalent to (A B C ) (A B) C
by a truth-table.

Denition 7.17 (Resolution Closure RC (S))


The set of all clauses derivable by repeated application of resolution to a
set of clauses S.

Denition 7.18 (Resolution Applied to Contradicting Clauses)


An empty-clause results from applying resolution to contradicting
clauses, e.g. to A and A.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

190 / 475

Logical Agents: Propositional Logic

Inference by Resolution

Explicit Model Construction for S when

RC (S)
/

Algorithm 27: ModelConstruction


input : S, a set of propositional clauses;
RC (S), the set of clauses which is the resolution-closure of S and
does not contain the empty-clause .
output: An assignment A s.t. A RC (S).
[P1 , . . . , Pn ] The list of all symbols (atomic formulas) in S ;
for i = 1 to n do
Find a clause C RC (S) s.t. C = (Pi False . . . False ) after
substituting P1 . . . Pi1 assigned in previous iterations ;
if such a clause C found then
A(Pi ) False ;
else
A(Pi ) True ;
return The constructed model A.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

191 / 475

Inference by Resolution

A Worked-out Example
Let the sentence S consist of the following clauses
C1 = (X Y Z ), C2 = (Z R S), C3 = (S T ).

Its closure RC (S) contains, in addition to S, (using to denote


resolution) C4 = C1 C2 = (X Y R S),
C5 = C2 C3 = (Z R T ), and
C6 = C1 C5 = (X Y R T ).
We now trace the Algo. 27 for [P1 , . . . , Pn ] = [X , Y , R, T , S, Z ]. In
the for-loop over i, the following selections will be made:
i=1:
i=2:
i=3:
i=4:
i=5:
i=6:

X = True .
Y = True .
R = True .
T = False , due to C6 under previous assignments.
S = True . Examine C3 and C4 carefully.
Z = False , due to C1 under previous assignments.

It can be veried, that under this assignment, all clauses of RC (S)


are True .
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

192 / 475

Logical Agents: Propositional Logic

Inference by Resolution

Theorem 7.19
Algo. ModelConstruction produces a valid model for S, if

RC (S)
/

Proof.
We prove this by contradiction. Assume at some iteration i = k of the
for-loop in Algo. 27, the assignment to Pk causes a clause C of RC (S) to
become False for the rst time.
For this to occur, C = (False . . . False Pk ) or
C = (False . . . False Pk ). If only one of these two is present in
RC (S), then the assignment rule chooses the appropriate value for Pk
to make A(C ) = True .
The problem occurs if both are in RC (S). But in this case, their
resolvent (False . . . False ) also has to be in RC (S), which means
that the resolvent is already False by the assignment P1 , . . . , Pk1 .
This contradicts our assumption that the rst falsied clause appears
at stage k.
Thus, the construction never falsies a clause in RC (S). It produces
a valid model for RC (S) and in particular, for S.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

193 / 475

Inference by Resolution

Theorem 7.20 (Ground Resolution Theorem)


If a set of clauses is unsatisable, then the resolution-closure of these
clauses contains the empty clause .

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

194 / 475

Logical Agents: Propositional Logic

Inference by Resolution

Proof of the Ground Resolution Theorem

Proof.
Proof by contraposition: we prove that if the closure RC (S) does not
contain the empty-clause , then S is satisable.
If RC (S), we already proved that one model A S can be
/
recursively constructed by using Algo. 27 ModelConstruction.
Therefore, this proves that if the closure RC (S) does not contain the
empty-clause , then S is satisable. It also proves its contraposition,
namely, the Ground Resolution Theorem.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

195 / 475

Inference by Resolution

A Resolution Algorithm for Inferring Entailment


Algorithm 28: PL-Resolution(KB, Q)
input : KB, a knowledge-base; Q, a query-sentence;
output: Whether KB Q
clauses the set of clauses in the CNF of (KB Q) ;
new {} ;
while True do
for every resolvable clause-pair Ci , Cj clauses do
resolvents Resolve(Ci , Cj ) ;
if resolvents then return True ;
new new resolvents
if new clauses then return False // resolution-closure reached ;
clauses clauses new

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

196 / 475

Logical Agents: Propositional Logic

Inference by Resolution

Wumpus World

B1,1

P2,1

P2,1

P2,1

B1,1

P1,2

P1,2

P1,2

P2,1

P2,1

B1,1

B1,1

B1,1

P1,2

B1,1

B1,1 P1,2

B1,1

P2,1

P1,2

P2,1

P1,2

P1,2

Figure 48: CNF of B1,1 (P1,2 P2,1 ) along with the observation B1,1 . It
entails the query Q : P1,2

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

197 / 475

Inference with Denite Clauses

Special kinds of KB

Remark 7.21
Resolution is the most powerful algorithm for showing entailment for
a general KB .
The SAT problem is in general NP complete.
However, for some special, less general cases, we can make the
algorithm more ecient.
Two special algorithms are: HornSAT and 2SAT, which are applicable
to a KB consisting of a specic kind of clauses only.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

198 / 475

Logical Agents: Propositional Logic

Inference with Denite Clauses

HornSAT
Denition 7.22 (Denite Clause)
A clause with only one positive literal. Every denite clause can be written
as an implication. Example: (A B C ) (B C A).

Denition 7.23 (Horn Clause)


A clause with at most one positive literal. Includes denite clauses.
Horn clauses are closed under resolution. Why?
Horn clauses with no positive literals are called goal clauses.
Inference with Horn clauses can be done with forward or backward
chaining. Deciding entailment with Horn clauses can be done in time
that is linear in size of the KB!

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

199 / 475

Inference with Denite Clauses

Forward-Chaining with Denite Clauses


Algorithm 29: PL-FC-Entails(KB, Q)
N[C ] for a clause C is initialized to the number of symbols in C s premise ;
Assignment A(S) is initially False for all symbols S in the KB ;
g a queue of symbols, initially known to be True in KB ;
while g = do
X Pop(g) ;
if X = Q then return True ;
if A(X )= False then
A(X )= True ;
for C KB, where X is in the premise of C do
decrement N[C ] ;
if N[C ] = 0 then add the conclusion of C to g
return False
The algorithm begins with known-facts (positive literals) and determines if a single
propositional symbol the query is entailed by a KB of denite clauses.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

200 / 475

Logical Agents: Propositional Logic

Inference with Denite Clauses

PL-FC-Entails is complete
Every entailed atomic sentence is derived. Consider the nal state of
A when the algorithm reaches a xed-point, i.e. no further inferences
are possible. In other words, g = .
Claim: A can be viewed as a model of the KB: every denite clause
in the KB is True in this A.
To prove, assume the opposite, i.e. some clause a1 . . . an b is
False in the model. Then the premise must be True and the
conclusion b False in the model, i.e. A(b) = False .
As the premise is True in A, b must have been added to g and hence
at some point (when b was popped from g) assigned True by the
algorithm. This is a contradiction. Therefore, A KB .

Any atomic sentence q that is entailed by the KB must be True in all


its models, and in particular in A. Hence every entailed atomic
sentence must be entailed by the algorithm.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

201 / 475

Inference with Denite Clauses

Example 7.24 (For PL-FC-Entails)


PQ

LM P

B LM

AP L

AB L
A

B
(7.21)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

202 / 475

Logical Agents: Propositional Logic

2SAT

2SAT
Applicable to KBs consisting of 2-CNFs

Figure 49: 2SAT KB as an implication graph. Credits: Wikipedia

For each clause A B introduce implications A B and B A as


edges in the digraph G (V , E ).
The formula F is now a conjunction of such 2 literal clauses.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

203 / 475

2SAT

Reminder: DFS as dened in Cormen et al (CLRS)

Figure 50: The main DFS loop which


makes sure that the full graph is visited.

K. Pathak (Jacobs University Bremen)

Figure 51: The recursive DFS-Visit


function

Articial Intelligence

December 5, 2013

204 / 475

Logical Agents: Propositional Logic

2SAT

Kosarajus algorithm: CLRS version


Uses the DFS version from Figs. 50 and 51.

Figure 52: SCC: the pseudocode from CLRS.

G T (V , E T ) is the graph obtained by reversing the directions of all


edges in the digraph G (V , E ).
The main loop of DFS in line 3 refers to the algorithm shown in
Fig. 50 (lines 57).
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

205 / 475

2SAT

Let f (v ) denote the component number of SCC containing the vertex


v . f (v ) is chosen such that it topologically sorts the condensation of
graph G (i.e. the DAG made out of supervertices, where each
supervertex is an SCC see the right subgure below):
u, v V : u
b
a

C1

o
m

(7.22)
C2

v f (u) f (v )

f
j

C4
C3

Figure 53: A directed graph G and its condensation. The subscripts i for each
component Ci in the right-gure have been chosen to be the same as f (v ),
v Ci . Thus C1 , C2 , C3 , C4 is a topological order.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

206 / 475

Logical Agents: Propositional Logic

2SAT

2SAT

Theorem 7.25
If v V such that f (v ) = f (v ) then F is unsatisable.

Proof.
Since v and v lie in the same strongly connected component,
v v ,

v v .

No truth-value can be assigned to v such that both implications are


satised. Hence, F is unsatisable.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

207 / 475

2SAT

2SAT: Finding a model


Lemma 7.26
If u V : f (u) = f (u), then we can nd a model A of F by setting
A(u) =

Proof.

1 if f (u) > f (u)


0 if f (u) < f (u)

(7.23)

Proof by contradiction: assume that F becomes False using this


assignment this means a clause A B evaluates to False for this
assignment. Thus, both A and B are False .

Therefore, f (B) < f (B) and f (A) < f (A). Why?

The clause A B contributes the edges A B and B A.

Therefore, f (A) f (B) and f (B) f (A).


Combining all inequalities

f (A) < f (A) f (B) < f (B) f (A), a contradiction!


K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

208 / 475

Logical Agents: Propositional Logic

2SAT

Runtime of 2SAT

For n 2-clauses, the implication graph has 2n edges and at most 4n


vertices.
The algorithm for nding SCC runs in (n).
The overall runtime for 2SAT is (n).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

209 / 475

December 5, 2013

210 / 475

2SAT

Applications of 2SAT
http://en.wikipedia.org/wiki/2-satisfiability

Conict-free placement of geometrical objects.


Data-clustering
Scheduling

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

Agents based on Propositional Logic

Agents based on Propositional Logic


Wumpus Problem Revisited

We have a large collection of rules, written for each square:


B1,1 (P1,2 P2,1 )

S1,1 (W1,2 W2,1 ) . . .


P1,1 , W1,1 . . .

Initial conditions

At least one Wumpus: W1,1 W1,2 . . . W4,4 .

At most one Wumpus: (W1,1 W1,2 ), . . . , (W4,3 W4,4 )

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

211 / 475

Agents based on Propositional Logic

Fluents
Aspects of the world/ agents state which change with time should have a
time-index associated to the name.
All percepts: Stench 3 , Stench 4 , Breeze 5 .

There can be location-uents, e.g. Lt , the agent is in square (x, y )


x,y
at time step t.
Other properties: FacingEast 0 , HaveArrow 0 , WumpusAlive 0 .
Percepts can be connected to the properties of the squares where
they were experienced.
Lt (Breeze t Bx,y )
x,y
Lt (Stench t Sx,y )
x,y

Actions: Forward 0 , TurnRight 1 , Shoot 7 , Grab 8 , Climb 10

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

212 / 475

Logical Agents: Propositional Logic

Agents based on Propositional Logic

Eect Axioms
Transition model: to be written for all 16 squares, for all 4
orientations, and for all actions.
L0 FacingEast 0 Forward 0 (L1 L1 )
1,1
2,1
1,1
If the agent takes this action, then Ask(KB , L1 ) returns True .
2,1
Frame problem: each eect-axiom has to state what remains
unchanged as a result of the action.
Forward t (HaveArrow t HaveArrow t+1 )

Forward t (WumpusAlive t WumpusAlive t+1 )

There is a proliferation of frame-axioms (representational frame


problem).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

213 / 475

Agents based on Propositional Logic

Successor State Axiom

Instead of focusing on actions, specify how uents change.


F t+1 ActionCausesF t (F t ActionCausesNotF t )

(7.24)

HaveArrow t+1 (HaveArrow t Shoot t )

Lt+1 (Lt (Forward t Bump t+1 ))


1,1
1,1
(Lt (South t Forward t ))
1,2

(Lt (West t Forward t ))


2,1

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

214 / 475

Logical Agents: Propositional Logic

Agents based on Propositional Logic

Sequence of Percepts and Actions I


Stench 0 Breeze 0 Glitter 0 Bump 0 Scream 0 ; Forward 0

Stench 1 Breeze 1 Glitter 1 Bump 1 Scream 1 ; TurnRight 1

Stench 2 Breeze 2 Glitter 2 Bump 2 Scream 2 ; TurnRight 2


Stench 3 Breeze 3 Glitter 3 Bump 3 Scream 3 ; Forward 3

Stench 4 Breeze 4 Glitter 4 Bump 4 Scream 4 ; TurnRight 4


Stench 5 Breeze 5 Glitter 5 Bump 5 Scream 5 ; Forward 5
Stench 6 Breeze 6 Glitter 6 Bump 6 Scream 6 .

OK t Px,y (Wx,y WumpusAlive t )


x,y
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

(7.25)

(7.26)

215 / 475

Agents based on Propositional Logic

Sequence of Percepts and Actions II


4

Bree z e

Stench

PIT

PIT

Bree z e

Bree z e

Stench
Gold

Bree z e

Stench

Bree z e

PIT

Bree z e

START

Figure 54: Ask(KB, L6 ), Ask(KB, W1,3 ), Ask(KB, P3,1 )


1,2

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

216 / 475

Logical Agents: Propositional Logic

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

K. Pathak (Jacobs University Bremen)

Agents based on Propositional Logic

December 5, 2013

217 / 475

Agents based on Propositional Logic

Articial Intelligence

December 5, 2013

218 / 475

Logical Agents: Propositional Logic

Time out from Logic

Time out: illogical logical sentences of Yogi Berra

It aint over till its over.


Nobody goes there anymore; its too
crowded.
It was impossible to get a conversation going;
everybody was talking too much.
In theory there is no dierence between theory
and practise, in practise there is.
I didnt really say everything I said.

Figure 55: Photo


credits: wordpress

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Logical Agents: Propositional Logic

December 5, 2013

219 / 475

Time out from Logic

Formal languages
Ontological and Epistemological Commitments

Language

Epistemological Commitment

(What exists in the world)

Propositional Logic
First-order Logic
Probability Theory
Fuzzy Logic

Ontological Commitment

(What you believe about what exists)

Facts
Facts, Objects, Relations
Facts
Facts with
degree of truth [0, 1]

True/False/Unknown
True/False/Unknown
Degree of belief [0, 1]
Known interval value

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

220 / 475

Probability Calculus

Contents

Probability Calculus
Limitations of Logic
Probability Calculus
Conditional Probabilities
Inference using a Joint Probability Distribution
Conditional Independence

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Probability Calculus

December 5, 2013

221 / 475

Limitations of Logic

Limitations of Logical Agents


Agents need to handle uncertainty due to partial observability (limited
sensing capabilities), nondeterminism, sensing noise, etc.
Example medical diagnosis:
Toothache Cavity GumProblem Abscess . . .,
Cavity Toothache .
Use of logic in a domain like medical diagnosis fails because of
Laziness: Too hard to list all premises and conclusions to come up
with an exceptionless rule, and hard to use these rules.
Theoretical Ignorance: A complete theory of the domain is
unavailable.
Practical Ignorance: Not all test results may be available. Not all
symptoms may manifest.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

222 / 475

Probability Calculus

Limitations of Logic

Belief State & Degree of Belief

Denition 8.1 (Belief State)


It is the agents current belief about the its own state or about the
relevant states of the environment, given:
its prior knowledge,
the history of its past actions,
the history of its observed percepts.

It is particularly important for partially-observable and/or


non-deterministic scenarios.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Probability Calculus

December 5, 2013

223 / 475

Limitations of Logic

Belief State & Degree of Belief

For a logical agent based on propositional-logic, the belief state is in


terms of sentences which are true or false.
When information is uncertain, The agents knowledge only provides
a degree of belief in the relevant sentences. The degrees of belief are
represented in probabilities.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

224 / 475

Probability Calculus

Probability Calculus

Quantifying Degree of Belief


Let be the universal-set/space of all possible suitable assignments
for all the relevant propositional sentences/formulas of the world
under consideration.
For each assignment or possible world or elementary event A ,
the agent associates a degree of belief or probability of it occurring:
its value P(A) [0, 1].
All the elementary events A are mutually-exclusive and
exhaustive.
Mutually-exclusive: Events Ai and Aj , i = j cannot both occur
simultaneously.
Exhaustive: = i {Ai }.

Hence, to normalize its degree of belief, the agent chooses


P(A) = 1.

(8.1)

A
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Probability Calculus

December 5, 2013

225 / 475

Probability Calculus

Quantifying Degree of Belief

Consider a propositional formula/sentence F .


P(F is True ) P(F )

K. Pathak (Jacobs University Bremen)

P(A).

(8.2)

AM(F )

Articial Intelligence

December 5, 2013

226 / 475

Probability Calculus

Probability Calculus

Quantifying Degree of Belief

From this, the following can be derived


P(F ) =

0
1

if F is inconsistent
if F is valid

(8.3)

= 1 P(F ).

(8.4)

P(F G ) = P(F ) + P(G ) P(F G ). Inclusion-Exclusion Principle


(8.5)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Probability Calculus

December 5, 2013

227 / 475

Probability Calculus

Random Variables (RV)


Variables in Probability Theory are called Random Variables (RV),
written with a capital rst-letter (e.g. Weather ). Each RV has a
domain, either discrete or continuous. The domain D is the set of
possible values/instantiations (written in small-letters) the RV can
take, e.g. D(Weather ) = {sunny , rain , cloudy , snow }. The values
are given their natural or user-dened order.
In AI, discrete RVs are more common. Let F be the propositional
sentence Weather = sunny , then using (8.2), the following are
equivalent ways of writing
P(F ) P((Weather = sunny ) = True )
P(sunny )

P(Weather = sunny )

(8.6)

The whole probability mass distribution (pmf) of a discrete RV


(DRV) can be summarized in a vector-form, e.g.
P(Weather )
K. Pathak (Jacobs University Bremen)

P(sunny ) P(rain ) P(cloudy ) P(snow )


Articial Intelligence

December 5, 2013

(8.7)
228 / 475

Probability Calculus

Probability Calculus

Joint Probability Distributions


Given a DRV A, with domain D(A) = [a1 , a2 , . . . , an ], its cardinality is
|A|

|D(A)| = n

(8.8)

Given DRVs A, B, . . ., the following notations are equivalent


P((A = a) (B = b) . . .) P(A = a, B = b, . . .)
P(a, b, . . .)

(8.9)

All such probabilities can be collected together in a table called the


full joint probability (mass) distribution (JPD) of A, B, . . ..
P(A, B, . . .) is of size |A| |B|

(8.10)

Note that partial joint probability distributions are also possible, e.g.
P(A = ai , B, . . .) P(ai , B, . . .) is of size |B|
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Probability Calculus

December 5, 2013

(8.11)

229 / 475

Probability Calculus

Vector Probability Distributions


The joint probability distribution (8.9) can also be written as the
distribution of a vector of RVs


A
ai


X = B ,
x = bj ,
(8.12)
.
.
.
.
.
.
|X|

|A| |B|

P(X) P(A, B, . . .),

(8.13)
P(x) P(ai , bj , . . .).

(8.14)

Finally, joint probability distributions of vector RVs can be dened


P(X, Y, . . .),

K. Pathak (Jacobs University Bremen)

P(x, Y, . . .),

Articial Intelligence

P(x, y, . . .).

December 5, 2013

(8.15)

230 / 475

Probability Calculus

Conditional Probabilities

Conditional Probabilities

Let F be a propositional formula. Suppose the agent initially has the


belief P(F = True ), as stated in (8.2):
P(F )

P(A)
AM(F )

Let G be another propositional formula s.t. M(F ) M(G ) = .

Suppose the agents observes that G = True . Thus, the agent now
needs to update its belief about F .

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Probability Calculus

December 5, 2013

231 / 475

Conditional Probabilities

Conditional Probabilities
Since G is now known to be True , we set
P(A)

0, if A M(G ).
/

(8.16)

Eectively, = M(G ), and P(G | G ) = 1.


To get P(G | G ) = 1, the agent needs to renormalize its beliefs,
(8.2)

Earlier, P(G ) =

P(A)

(8.17)

AM(G )

P(A | G ) =

P(A)
, A M(G )
P(G )

(8.18)

Thus, we retain the unit-summation property of the agents belief-set,


P(G | G ) =

AM(G )

K. Pathak (Jacobs University Bremen)

P(A | G ) =

AM(G )

Articial Intelligence

P(A)
P(G )
=
= 1.
P(G )
P(G )

December 5, 2013

232 / 475

Probability Calculus

Conditional Probabilities

Conditional Probabilities
Now, we can modify (8.2)
P(F | G ) =

A M(F ) M(G )

(8.18)

1
P(G )

P(A | G ),

(8.19)

P(A),
A M(F ) M(G )

P(F G )
P(G )

(8.20)

The product rule:


P(F G ) = P(G ) P(F | G )

(8.21)

If P(F | G ) = P(F ), F and G are independent. In this case,


P(F G ) = P(G ) P(F )
K. Pathak (Jacobs University Bremen)

Independence

Articial Intelligence

Probability Calculus

December 5, 2013

(8.22)

233 / 475

Conditional Probabilities

Product Rule in terms of RVs


Since F , G are any propositional formulas/sentences, we can
generalize (8.21) to
P(F1 . . . Fn G1 . . . Gm ) =

P(F1 . . . Fn | G1 . . . Gm ) P(G1 . . . Gm ) (8.23)

In terms of RVs, we can summarize the product-rule in a table using


previously introduced notation. Given RVs X1 , . . . , Xn , Y1 , . . . , Ym ,
each with possibly dierent cardinality,
P(X1 , . . . , Xn , Y1 , . . . , Ym ) =
P(X1 , . . . , Xn | Y1 , . . . , Ym ) P(Y1 , . . . , Ym ). (8.24)
This table-formula applies component-wise in the table.
A tabulation of a conditional probabilities P(A = ai | B = bj ),
i = 1 . . . |A|, j = 1 . . . |B| is called a Conditional Probability Table
(CPT).
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

234 / 475

Probability Calculus

Conditional Probabilities

Law of Total Probability


Let the propositional sentences Gi , i = 1 . . . n be such that
n

M(Gi ) M(Gj ) = , if i = j, and,

M(Gi ) = .

(8.25)

i=1

Therefore, for any propositional formula F , its model-set can be


partitioned as follows
n

M(F ) =
i=1

M(Gi ) M(F ).

(8.26)

On substitution of above in (8.2)


n

P(F ) =

P(A)

i=1 AM(Gi ) M(F )


n
(8.21)
=
P(F | Gi ) P(Gi ).
i=1
K. Pathak (Jacobs University Bremen)

i=1

P(F Gi )

(8.28)

Articial Intelligence

Probability Calculus

(8.27)

December 5, 2013

235 / 475

Conditional Probabilities

Law of Total Probability


In terms of RVs, this is called marginalization.
|B|

P(A = aj ) =

P(A = aj , B = bi ),

(8.29)

P(A = aj | B = bi ) P(B = bi ).

(8.30)

i=1
|B|

=
i=1

Interpret the following carefully using our notation


|B|

P(A) =

|B|

P(A, bi ) =
i=1

i=1

|Y|

P(A | bi )P(bi )

(8.31)

P(X | yi ) P(yi ).

(8.32)

|Y|

P(X) =

P(X, yi ) =
i=1

i=1

The LHS of the above equations is called the marginal probability


computed from the joint or conditional probability distributions resp.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

236 / 475

Probability Calculus

Conditional Probabilities

Expectation of a DRV
Let X be a DRV with domain [x1 , x2 , . . . , xn ], where xi are ordered
and have numerical values. Then, expectation of X is
n

E [X ] =

xi P(xi )

(8.33)

i=1

This can also be generalized to a vector DRV X with domain


[x1 , x2 , . . . , xn ],
n

E [X] =

xi P(xi )

(8.34)

i=1

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Probability Calculus

December 5, 2013

237 / 475

Conditional Probabilities

Theorem 8.2 (Linearity of Expectation)


For any two DRVs X and Y with ordered numerical domains, and a, b R,
E [a X + b Y ] = a E [X ] + b E [Y ].

(8.35)

This holds even if X and Y are not independent, i.e. if


P(x, y ) = P(x)P(y ) in general. Independence of RVs will be covered later.

Proof.
|X | |Y |

E [a X + b Y ] =

(a xi + b yj ) P(xi , yj )
i=1 j=1

Derivation completed in the class using marginalization.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

238 / 475

Probability Calculus

Conditional Probabilities

Example: JPD of 5 Object-Classiers I

Figure 56: Table from: Combining Pattern Classiers by Ludmila I. Kuncheva,


2004. We are interested in classifying objects of interest in two class-types:
T = 1 or T = 2. We have 5 dierent classiers Ci , i = 1 . . . 5 to do this task.
We show each of them 300 samples of objects of class-type T = 1. The 5
classiers have various degrees of agreement about the deduced class, which is
summarized above, e.g. the string 11212 means that C1 classied the object as
T = 1, C2 classied the object as T = 1, C3 classied the object as T = 2, C4
classied the object as T = 1, and C5 classied the object as T = 2. The
probabilities can be computed by dividing the frequencies (counts) by 300.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Probability Calculus

December 5, 2013

239 / 475

Conditional Probabilities

Example: JPD of 5 Object-Classiers II


Then, we can compute the following probabilities using the results derived
so far
P(C1 = 1, C2 = 1, C3 = 2, C4 = 1, C5 = 2 | T = 1) = 14/300.

On using marginalization (8.29),

P(C1 = 1, C2 = 1, C3 = 1 | T = 1) =
c4 =1,2 c5 =1,2

P(C1 = 1, C2 = 1, C3 = 1, C4 = c4 , C5 = c5 | T = 1) =
(5 + 4 + 10 + 8)
27
=
.
300
300

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

240 / 475

Probability Calculus

Conditional Probabilities

Example: JPD of 5 Object-Classiers III


Using the product-rule for RVs (8.24), one can nd
P(C4 = 2, C5 = 2 | C1 = 1, C2 = 1, C3 = 1, T = 1) =
P(C1 = 1, C2 = 1, C3 = 1, C4 = 2, C5 = 2 | T = 1)
P(C1 = 1, C2 = 1, C3 = 1 | T = 1)
8/300
8
=
= .
27/300
27
We can compute the table for the RV C1 with the ordered domain
[1, 2] (class-type) by marginalization. This can be simply done by
summing the rst two columns (where C1 = 1) and the last two
columns (where C1 = 2).
P(C1 | T = 1) =
K. Pathak (Jacobs University Bremen)

146/300
154/300

Articial Intelligence

Probability Calculus

December 5, 2013

241 / 475

Conditional Probabilities

Normalization Constraints on CPTs


In terms of RVs, the product-rule (8.21) is more often used as
P(A = ai , B = bj )
P(B = bj )
P(A, bj )
P(A | bj ) =
P(bj )

P(A = ai | B = bj ) =

(8.36)
(8.37)

The following result is very useful in practice.


|A|

i=1

1
P(A = ai | B = bj ) =
P(B = bj )
(8.29)

|A|

P(A = ai , B = bj )
i=1

P(B = bj )
= 1.
P(B = bj )

(8.38)

This implies that P(A | b) is a valid normalized pmf.


K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

242 / 475

Probability Calculus

Conditional Probabilities

Normalization Constraints on CPTs

Therefore, (8.37) can also be written as


P(A | bj ) = P(A, bj ).

(8.39)

The normalization constant can be used to normalize the whole vector


P(A | bj ) without explicitly computing P(B = bj ).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Probability Calculus

December 5, 2013

243 / 475

Inference using a Joint Probability Distribution

Answering Queries on Beliefs Based on Evidence


If we divide the set of RVs in a given joint distribution into:
The query RV X ;
The unobserved (hidden) RVs stacked into the vector RV Y;
The observed (evidence) RVs stacked into the vector RV E. The
evidence is always instantiated E = e on what was observed.

The given joint distribution P(X , E, Y) is analogous to a probabilistic


Knowledge-Base. We can utilize (8.39) to write
P(X | E = e) = P(X , E = e)

(8.40)

|Y|

P(X , E = e, Y = yi )

General Form

(8.41)

i=1

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

244 / 475

Probability Calculus

Inference using a Joint Probability Distribution

Bayes Rule

Due to symmetry, swapping propositional formulas F and G in (8.21), we


can also write
P(F G ) = P(G | F ) P(F )

(8.42)

Combining the RHS of Eqs. (8.21) and (8.42),


P(G | F ) =

P(F | G ) P(G )
P(F )

K. Pathak (Jacobs University Bremen)

Bayes Rule

Articial Intelligence

Probability Calculus

December 5, 2013

(8.43)

245 / 475

Inference using a Joint Probability Distribution

Bayes Rule for RVs


In terms of RVs,
P(B = bj | A = ai ) =

P(A = ai | B = bj ) P(B = bj )
P(A = ai )

(8.39)

P(B | ai ) = P(B, ai ) = P(ai | B) P(B).

(8.44)
(8.45)

If we divide the set of RVs in a given joint distribution into: the RVs
of interest A, B, and observed (evidence) vector RV E, then, we can
generalize (8.45) to
P(ai | B, e) P(B | e)
P(ai | e)
= P(ai | B, e) P(B | e)

P(B | A = ai , E = e) =

(8.46)
General Form
(8.47)

P(ai | B, e) P(B | e).


K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

246 / 475

Probability Calculus

Inference using a Joint Probability Distribution

Signicance of Bayes Rule


Nomenclature

Denition 8.3 (Likelihood)


P(data | Hypothesis)

Denition 8.4 (Posterior)


P(Hypothesis | data)

Denition 8.5 (Prior)


P(Hypothesis)

Denition 8.6 (Evidence)


P(data)

P(H | D = d) P(D = d | H) P(H)


K. Pathak (Jacobs University Bremen)

Articial Intelligence

Probability Calculus

(8.48)
December 5, 2013

247 / 475

Inference using a Joint Probability Distribution

Example: Explaining Away (Probabilistic OR)

Done in the class.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

248 / 475

Probability Calculus

Inference using a Joint Probability Distribution

Example: Bayesian Estimation for DRVs


Surprise Candy

A candy manufacturer supplies candies in 5 dierent kinds of bags, all of


which look identical:
h1 : 100% cherry;
h2 : 75% cherry, 25% lime;
h3 : 50% cherry, 50% lime;
h4 : 25% cherry, 75% lime;
h5 : 100% lime.
The candies and their wrappers also look identical. You are given an
unknown bag. You sample (by licking ;)) candies from it by replacement,
and with each sample you want to:
Keep track of your belief P(hi | e1 , . . . , en ) in each of the 5 dierent
hypotheses for the kind of bag it is.
Predict whether the next sample would be cherry or lime avored.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Probability Calculus

December 5, 2013

249 / 475

Inference using a Joint Probability Distribution

Example: Bayesian Estimation for DRVs


Surprise Candy Hypothesis Update

The manufacturer has given the prior pmf of the dierent bag-types:
P(H) = [P(h1 ), P(h2 ), P(h3 ), P(h4 ), P(h5 )].
Suppose e1 = cherry , e2 = lime . Bayes theorem

1.0 P(h1 )
0.75 P(h2 )

P(H | e1 ) = 0.5 P(h3 ) , P(H | e1 , e2 ) =

0.25 P(h4 )
0.0 P(h5 )

gives

0.0 1.0 P(h1 )


0.25 0.75 P(h2 )

(0.5)2 P(h3 )

0.75 0.25 P(h4 )


1.0 0.0 P(h5 )

As we gather more and more samples, the probability


P(hi | e1 , . . . , en ) of the correct bag-type hi will eventually dominate.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

250 / 475

Probability Calculus

Inference using a Joint Probability Distribution

Example: Bayesian Estimation for DRVs


Surprise Candy Prediction

If we now want to predict the outcome of the next sample. Therefore, we

need to estimate the distribution of the RV En+1 = [cherry , lime ].


5

P(En+1 | e1:n ) =

i=1
5

=
i=1
5

=
i=1

K. Pathak (Jacobs University Bremen)

P(En+1 , H = hi | e1:n )

P(En+1 | hi , e1:n ) P(hi | e1:n )

P(En+1 | hi ) P(hi | e1:n ).

Articial Intelligence

Probability Calculus

December 5, 2013

(8.49)

251 / 475

Conditional Independence

Independence

The events represented by propositional logic sentences F and G are


independent, if either P(G ) = 0, or,
P(F | G ) = P(F )

(8.50)

Using the general Bayes rule, one sees that this is a symmetric
relationship and hence also holds with F and G swapped.
For independent events, (8.21) reduces to
P(F G ) = P(F ) P(G )

K. Pathak (Jacobs University Bremen)

Articial Intelligence

(8.51)

December 5, 2013

252 / 475

Probability Calculus

Conditional Independence

Independence for DRVs

In terms of RVs, A = ai and B = bj are independent if either


P(B = bj ) = 0, or,
P(A = ai | B = bj ) = P(A = ai ).

(8.52)

If the above holds for all ai , bj , we can summarize the independence of A


and B as A B, which implies
P(A | B) = P(A),

K. Pathak (Jacobs University Bremen)

or equiv. P(A, B) = P(A) P(B)

Articial Intelligence

Probability Calculus

(8.53)

December 5, 2013

253 / 475

Conditional Independence

Conditional Independence
Independence is a changeable property. Two events which were earlier
independent, can become dependent in light of some evidence. Two
events F , G are called conditionally independent given H if
P(F | G H) = P(F | H), or, P(G H) = 0.

(8.54)

Using the general Bayes rule, one sees that this is a symmetric
relationship and hence also holds with F and G swapped. Using the
product rule, another way to write conditional independence is
P(F G | H) = P(F | H) P(G | H), or, P(H) = 0.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

(8.55)

254 / 475

Probability Calculus

Conditional Independence

Conditional Independence
In terms of RVs, A = ai and B = bj are conditionally independent given
C = ck if either P(B = bj C = ck ) = 0, or,
P(A = ai | B = bj , C = ck ) = P(A = ai | C = ck ).

(8.56)

If the above holds for all ai , bj , ck , we can say that A and B are
conditionally independent given C , and all of the following are equivalent
P(A | B, C ) = P(A | C ),

(8.57a)

P(B | A, C ) = P(B | C ),

(8.57b)

P(A, B | C ) = P(A | C ) P(B | C )

(8.57c)

This conditional independence is also written as (A B) | C or A C B.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Probability Calculus

December 5, 2013

255 / 475

Conditional Independence

Example: The Wumpus maze solved probabilistically


1,4

2,4

3,4

4,4

1,4

2,4

3,4

4,4

1,3

2,3

3,3

4,3

1,3

2,3

3,3

4,3

OTHER

QUERY

2,2

1,2

3,2

4,2

1,2

2,1

3,1

4,1

2,2

1,1

3,2

4,2

2,1

FRINGE
3,1

4,1

B
OK
1,1

B
OK

KNOWN

OK

Figure 57: Priors P(Pij = 1) = P(pij ) = 0.2; only P(P11 = 0) = P(p11 ) = 1. Pij
are independent binary RVs.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

256 / 475

Probability Calculus

Conditional Independence

Example: A Converging Connection


We are given a joint-distribution between three RVs A, B, C which has the
following special form
P(A, B, C ) = P(A) P(B) P(C | A, B)

(8.58)

Then,
|C |

P(A, B) =

|C |
(8.58)

P(A, B, ci )

i=1

i=1

P(A) P(B) P(ci | A, B)

|C |

= P(A) P(B)
i=1

P(ci | A, B)

= P(A) P(B)
Hence, A and B are independent.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Probability Calculus

December 5, 2013

257 / 475

Conditional Independence

Example: A Serial Connection

We are given a joint-distribution between three RVs A, B, C which has the


following special form
P(A, B, C ) = P(A) P(B | A) P(C | B)

(8.59)

Then,
P(C | A, B) =

P(A, B, C )
P(A, B)

(8.59)

P(B | A) P(C | B) P(A)


= P(C | B)
P(B | A) P(A)

Hence, C and A are conditionally independent, given B.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

258 / 475

Probability Calculus

Conditional Independence

Example: A Diverging Connection

We are given a joint-distribution between three RVs A, B, C which has the


following special form
P(A, B, C ) = P(B | A) P(C | A) P(A)

(8.60)

Then,
P(B | A, C ) =

P(A, B, C )
P(A, C )

(8.60)

P(B | A) P(C | A) P(A)


= P(B | A)
P(C | A) P(A)

Hence, B and C are conditionally independent, given A.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Probability Calculus

December 5, 2013

259 / 475

Conditional Independence

Example: A Diverging Connection


However, if A is not given, but C is given, then, P(B | C ) = P(B).
P(B, C )
P(B | C ) =
P(C )
(8.60)

|A|
i=1 P(B

|A|
i=1 P(ai , B, C )

P(C )

| ai ) P(C | ai ) P(ai )
P(C )

|A|
i=1 P(B | ai ) P(C | ai ) P(ai )
|A|
i=1 P(C | ai ) P(ai )

(8.61)

The last equation shows how evidence may be transmitted from an


instantiation of C to B implicitly through A. What does it mean if
A Gender , B Height , C LengthOfHair ?

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

260 / 475

Probability Calculus

Conditional Independence

Chain Rule
This is an application of the product-rule (stated in terms of RVs)
P(X1 , . . . , Xn ) = P(Xn | X1 , . . . , Xn1 ) P(X1 , . . . , Xn1 )

P(X1 , . . . , Xn1 ) = P(Xn1 | X1 , . . . , Xn2 ) P(X1 , . . . , Xn2 )

P(X1 , . . . , Xn2 ) = P(Xn2 | X1 , . . . , Xn3 ) P(X1 , . . . , Xn3 )


.
.
.
P(X2 , X1 ) = P(X2 | X1 ) P(X1 ).

(8.62a)
(8.62b)
(8.62c)

(8.62d)

Combining the above equations, we get


n

P(X1 , . . . , Xn ) =
i=1

P(Xi | Xi1 , Xi2 , . . .).

(8.63)

This formula will be useful later in formulating the joint distribution of a


Bayesian network.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

261 / 475

December 5, 2013

262 / 475

Beginning to Learn using Na Bayesian Classiers


ve

Contents

Beginning to Learn using Na Bayesian Classiers


ve
Reasoning using NBC
Learning Terminology
Supervised Learning
Some Discrete Probability Distributions
Training an NBC
Some Continuous Probability Distributions
NBC for Continuous RVs

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

Reasoning using NBC

Na Bayesian Classier (NBC)


ve
a.k.a. Idiot Bayes Model

Class

Attribute 1

Attribute n

Attribute 2

Figure 58: The Na Bayesian Classier assumes that the attributes are
ve
conditionally independent of each other given the class. However, NBC often
achieves surprisingly good performance even when this strong assumption is not
strictly valid.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

December 5, 2013

263 / 475

Reasoning using NBC

The NBC Size Advantage


If we use the full JPD P(C , A1 , A2 , . . . , An ) we need to specify
|C | |A1 | |An | 1 independent probability values. If all RVs are
binary, this number is 2n+1 1.
If we use an NBC, we need to give:
P(C ) with |C | 1 independent values.
n CPTs P(Ai | C ), the ith CPT having (|Ai | 1)|C | independent
values.
n
This gives a total of |C | 1 + i=1 |C |(|Ai | 1) values. If all RVs are
binary, this number is 1 + 2n.

Thus, the model complexity has come down from exponential to


linear in number of attributes. Less number of parameters make
NBCs immune to overtting at the expense of less accuracy compared
to more advanced methods.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

264 / 475

Beginning to Learn using Na Bayesian Classiers


ve

Reasoning using NBC

Query distributions for an NBC

The NBC is rst trained using a set of example/instance tuples:


ei = (C = ci , A1 = a1i , . . . , An = ani ),

i = 1, . . . , m.

(9.1)

During testing, we would like to predict the class of a previously


unseen sample with attributes (A1 = a1 , A2 = a2 , . . . , An = an ) with
unknown classication. Thus, we want to query:
P(C | A1 = a1 , . . . , An = an ).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

(9.2)

December 5, 2013

265 / 475

Reasoning using NBC

Answering queries

P(a1 , a2 , . . . , an | C ) P(C )
,
(9.3)
P(a1 , a2 , . . . , an )
using conditional independence of attributes given class,
P(a1 | C ) P(a2 | C ) P(an | C ) P(C )
=
(9.4)
k
P(a1 , a2 , . . . , an , ci )
i=1
P(a1 | C ) P(a2 | C ) P(an | C ) P(C )
(9.5)
=
k
i=1 P(a1 | ci ) P(an | ci ) P(ci )

P(C | a1 , a2 , . . . , an ) =

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

266 / 475

Beginning to Learn using Na Bayesian Classiers


ve

Reasoning using NBC

The Log-Sum Trick

P(cj | a1 , a2 , . . . , an ) =

P(a1 | cj ) P(an | cj ) P(cj )

k
i=1 P(a1

| ci ) P(an | ci ) P(ci )

As n increases, each term in the summation in the denominator becomes


smaller and smaller. This may lead to numerical underow.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

December 5, 2013

267 / 475

Reasoning using NBC

The Log-Sum Trick

e bj

Rewrite P(cj | a1 , a2 , . . . , an )

(9.6)

k
bi
i=1 e
k

ln P(cj | a1 , a2 , . . . , an ) = bj ln

e bi

(9.7)

i=1
k

= bj b ln

where, b

e bi b

(9.8)

i=1

max bi .

(9.9)

Example: ln(e 120 + e 121 ) = 120 + ln(1 + e 1 ).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

268 / 475

Beginning to Learn using Na Bayesian Classiers


ve

Learning Terminology

Learning 101
Denition of Tom Mitchell, CMU

An agent is said to learn from experience E with respect to some


class of tasks T and performance measure P, if its performance at
tasks in T , as measured by P, improves with experience E .
The major issues are
1. What prior knowledge is available to the agent, and how is the
knowledge represented?
2. What feedback is available to learn from?

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

December 5, 2013

269 / 475

Learning Terminology

Feedback to Learn from

Supervised Learning The agent observes some input(X)-output(Y ) pairs


and learns a functional mapping between them.
Classication: The output Y is a discrete set of labels.
Regression: The output Y is a continuous variable.
Reinforcement Learning The agent learns from a series of rewards and
punishments.
Unsupervised Learning No feedback. The agent learns patterns in input,
e.g. clustering.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

270 / 475

Beginning to Learn using Na Bayesian Classiers


ve

Supervised Learning

Supervised Learning

Given a training set of i = 1 . . . m example input-output pairs


Input Xi , generally a vector.
Output Yi , generally a scalar.
There is an unknown functional relationship Yi = f (Xi ).

We would like to nd a function h which approximates the true


function f . The approximate function h is called a hypothesis.
To measure the accuracy of the hypothesis we use a test set of
example pairs dierent from the training set. The hypothesis h
generalizes well if it correctly predicts the output Y for these novel
test-set examples.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

December 5, 2013

271 / 475

Supervised Learning

Example: The Mushroom Classication Dataset


http://archive.ics.uci.edu/ml/datasets/Mushroom

Nr. of attributes: 22
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
A11
A12
A13
A14
A15
A16
A17
A18
A19
A20
A21
A22

cap-shape
cap-surface
cap-color
bruises?
odor
gill-attachment
gill-spacing
gill-size
gill-color
stalk-shape
stalk-root
stalk-surface-above-ring
stalk-surface-below-ring
stalk-color-above-ring
stalk-color-below-ring
veil-type
veil-color
ring-number
ring-type
spore-print-color
population
habitat

K. Pathak (Jacobs University Bremen)

bell=b,conical=c,convex=x,at=f, knobbed=k,sunken=s
brous=f,grooves=g,scaly=y,smooth=s
brown=n,bu=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
bruises=t,no=f
almond=a,anise=l,creosote=c,shy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
attached=a,descending=d,free=f,notched=n
close=c,crowded=w,distant=d
broad=b,narrow=n
black=k,brown=n,bu=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,ye
enlarging=e,tapering=t
bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
brous=f,scaly=y,silky=k,smooth=s
brous=f,scaly=y,silky=k,smooth=s
brown=n,bu=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
brown=n,bu=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
partial=p,universal=u
brown=n,orange=o,white=w,yellow=y
none=n,one=o,two=t
cobwebby=c,evanescent=e,aring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
black=k,brown=n,bu=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d
Articial Intelligence

December 5, 2013

272 / 475

Beginning to Learn using Na Bayesian Classiers


ve

Supervised Learning

Example: The Mushroom Classication Dataset


http://archive.ics.uci.edu/ml/datasets/Mushroom

Nr. of examples: 8124


p
e
e
p
e
e
e
e
e
p
.
.
.

x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
x,y,b,t,n,f,c,b,e,e,?,s,s,e,w,p,w,t,e,w,c,w
x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,n,g
b,s,w,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,n,m
b,y,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,s,m
x,y,w,t,p,f,c,n,p,e,e,s,s,w,w,p,w,o,p,k,v,g
.
.
.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

December 5, 2013

273 / 475

Supervised Learning

Cardiac Single Proton Emission Computed Tomography


(SPECT) Diagnosis
http://archive.ics.uci.edu/ml/datasets/SPECT+Heart

Citation: Kurgan et al, Knowledge discovery approach to automated


cardiac SPECT diagnosis, Articial Intelligence in Medicine, vol. 23
(2001).
Attributes: 22+1, all binary.
Instances: 267. Class 0: 55 (Normals), Class 1: 212 (Abnormals).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

274 / 475

Beginning to Learn using Na Bayesian Classiers


ve

Supervised Learning

Independent and Identically Distributed (i.i.d.)

We assume that example data-point j is an RV Ej whose observed value


ej = (xj , yj ) is sampled from a probability distribution which remains
unchanged (stationary) over time. Furthermore, each sample is
independent of the others. Thus,
P(Ej | Ej1 , Ej2 , . . .) = P(Ej )

P(Ej ) = P(Ej1 ) = P(Ej2 ) = . . .

K. Pathak (Jacobs University Bremen)

Independence

(9.10a)

Identical Distribution

(9.10b)

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

December 5, 2013

275 / 475

Supervised Learning

Error-Rate
It is the proportion of mistakes a given hypothesis makes: i.e. the
proportion of times h(x) = y .

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

276 / 475

Beginning to Learn using Na Bayesian Classiers


ve

Supervised Learning

Holdout Cross-Validation
Split the available examples randomly into a training-set from which the
learning algorithm produces a hypothesis, and test-set on which the
accuracy of h is evaluated. Disadvantage: we cannot use all examples for
nding h.
e1
e2
.
.

.
.
em

Figure 59: Holdout cross-validation.


K. Pathak (Jacobs University Bremen)

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

December 5, 2013

277 / 475

Some Discrete Probability Distributions

Multinomial Distribution
Let our sample-space be divided into k classes, i.e. |C | = k.
We draw m samples. Each sample ej , j = 1 . . . m falls into only one
class. Let the class of ej be denoted ej .c.
Let the known prior probability P(C = ci ) = i . Thus,
k

i =
i=1

P(C = ci ) = 1.

(9.11)

i=1

Let the integer-valued DRVs Ni , i = 1, . . . , k denote the number of


samples (out of m) which fall in category i.

N1
m
.
Ni =
I(ej .c = i),
N . ,
(9.12)
.
j=1
Nk
k

Ni = m.

(9.13)

i=1
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

278 / 475

Beginning to Learn using Na Bayesian Classiers


ve

Some Discrete Probability Distributions

Multinomial Distribution
The the joint pmf of the Ni s is given by

n1
k1
m
m n1
m i=1 ni n1 n2
.
n
P(N = . ) =

1 2 k k
.
n1
n2
nk
nk
m!
n
n n
=
1 1 2 2 k k
n1 ! n2 ! nk !
k

where,

ini ,

m (n)

(9.14)

i=1

ni = m,
i=1

i = 1.

(9.15)

i=1

We write, N Multinomial ().

(9.16)

The expected count of samples for class C = i is


m

E [Ni ] =

E [I(ej .c = i)] =
j=1

P(C = i) =
j=1

K. Pathak (Jacobs University Bremen)

(9.17)

j=1

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

i = m i
December 5, 2013

279 / 475

Some Discrete Probability Distributions

Denition 9.1 (Gamma function (x))

(x)

t x1 e t dt

(9.18)

If the argument is a positive integer n, then (n) = (n 1)!

Figure 60: Src: facebook.com. We are only interested in the positive half of the
real axis.
K. Pathak (Jacobs University Bremen)
Articial Intelligence
December 5, 2013
280 / 475

Beginning to Learn using Na Bayesian Classiers


ve

Some Discrete Probability Distributions

Figure 61: Src: zazzle.com. The famous Gamma function value for a non-integer.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

December 5, 2013

281 / 475

Some Discrete Probability Distributions

The Dirichlet PDF


T

It is a pdf of the pmf P(C ) = = 1 , . . . , k .


should lie in the probability simplex
k
Sk = { | i [0, 1] and
i=1 i = 1 = 1}.
It is parameterized by hyperparameters Rk , i > 0.
k

p(; )

ii 1 ,

d ()

d ()

(1 + . . . + k )

i=1

k
i=1 (i )

(9.19)

We write Dir ().


Recall: if n is a positive integer, (n) = (n 1)!, but it is also dened
for general real numbers.
As p(; ) is a pdf,
p(; ) d = 1.

(9.20)

Sk
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

282 / 475

Beginning to Learn using Na Bayesian Classiers


ve

Some Discrete Probability Distributions

The Dirichlet Distribution: Properties I


Using this, and the property (z + 1) = z (z), we show that
E [i ] =

Sk

i p(; ) d

= d ()

Sk

1 1 1 ii k k 1 d

= missing steps
(1 + k ) (i + 1)
=
(1 + k + 1) (i )
i
=
.
1 + + k

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

(9.21)

December 5, 2013

283 / 475

Some Discrete Probability Distributions

(a) = (1, 1, 1)

(b) = (0.1, 0.1, 0.1)

(c) = (10, 10, 10)

(d) = (2, 5, 15)

Figure 62: Visualizing the Dirichlet distribution for k = 3 dened on S3 in the


rst octant. From Frigyik et al, Univ. of Washington Tech. Report
UWEETR-2010-0006.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

284 / 475

Beginning to Learn using Na Bayesian Classiers


ve

Some Discrete Probability Distributions

The Dirichlet Distribution: Properties II


For a binary RVs (k = 2),
The multinomial distribution reduces to the Binomial distribution.
The Dirichlet distribution reduces to the Beta distribution.

Figure 63: The beta distribution. In terms of Dirichlet hyperparameters: 1 = ,


2 = . Src: your.org.
K. Pathak (Jacobs University Bremen)
Articial Intelligence
December 5, 2013
285 / 475
Beginning to Learn using Na Bayesian Classiers
ve

Some Discrete Probability Distributions

Conjugate Priors
The Bayesian estimation update rule also holds for pdfs:
p( n+1 | e1:n+1 ) = p(en+1 | n ) p( n | e1:n ) .
Posterior

Likelihood

Prior

In general, the prior, the likelihood, and the posterior distributions


may belong to dierent families of probability distributions.
However, for some families, the prior and the posterior belong to the
same family F1 , given that the likelihood is from a certain family F2 .
In this case the family F1 is called the conjugate-prior to F2 .
The most relevant examples for this course are:
F1 = F2 = Gaussian.
F2 = Multinomial, F1 = Dirichlet distribution.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

286 / 475

Beginning to Learn using Na Bayesian Classiers


ve

Some Discrete Probability Distributions

Multinomial and Dirichlet Conjugate Priors


The prior p() is Dirichlet Dir ().
The likelihood P(N = n | ) of observing the counts n is Multinomial
N Multinomial ().
The posterior is
p( | n) = P(n | ) p(; )
k

ini

= 1
i=1

(9.22)

j 1

(9.23)

j=1

ini +i 1

= 1

Dir (n + )

(9.24)

, i are called pseudo-counts.

(9.25)

i=1

Using (9.21) for the posterior, we have


E [j ] =

nj + j
k
i=1 (ni

+ i )

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

December 5, 2013

287 / 475

Some Discrete Probability Distributions

Eect of Prior vs. Eect of Likelihood

E [j ] =

nj + j
k
i=1 (ni

K. Pathak (Jacobs University Bremen)

+ i )

, i are called pseudo-counts.

Articial Intelligence

December 5, 2013

288 / 475

Beginning to Learn using Na Bayesian Classiers


ve

Training an NBC

Unknown Attribute Values in Some Examples


For example, the attribute stalk-root has a missing value (?) in some
examples of the Mushroom dataset.
Choices:
Take the most common value of that attribute in the whole example
set.
For a binary decision problem with decision (class) variable having
values Class = Y /N, and an attribute A with missing values in some
examples, suppose the example with the missing value is of
Class = Y . Using the product rule:
P(A = ai | Class = Y ) =

P(A = ai , Class = Y )
pi
= .
P(Class = Y )
p

(9.26)

We can then choose the attribute-value ai with the largest conditional


probability. Similarly the Class = N case can be handled.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

December 5, 2013

289 / 475

Training an NBC

Training the NBC

Step 1
Fill in the missing attribute-values using the heuristics of the last slide.

Step 2
Compute the prior P(C = ci ) for i = 1, . . . , k
1
P(C = ci ) =
m

K. Pathak (Jacobs University Bremen)

I(ej .c = ci )

(9.27)

j=1

Articial Intelligence

December 5, 2013

290 / 475

Beginning to Learn using Na Bayesian Classiers


ve

Training an NBC

Training the NBC

Step 3
The CPT P(Ar | C ) consists of |C | pmfs P(Ar | ci ) S|Ar | .
Assume a Dirichlet prior for all pmfs P(Ar | ci ) Dir (r ).

In the absence of any other prior information, choose a uniform prior,


i.e. r [ ] = 1 for = 1, . . . , |Ar |.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

December 5, 2013

291 / 475

Training an NBC

Training the NBC


Step 4
Assume that the examples in the database with ej .Ar = ar , ej .c = ci
are distributed according to P(Ar | ci ). The number of such observed
examples are then sampled from a Multinomial distribution with
= P(Ar | ci ). This is the likelihood.
So the expected Dirichlet posterior estimate of the pmf P(Ar | ci ),
r = 1, . . . , n and = 1, . . . , |Ar | from (9.25) is
m

nr ,
j=1

P(Ar = ar , | C = ci ) =

K. Pathak (Jacobs University Bremen)

I(ej .Ar = ar , ej .c = ci )
nr , + r [ ]

|Ar |
p=1 (nr ,p

(9.28)
(9.29)

+ r [p])

Articial Intelligence

December 5, 2013

292 / 475

Beginning to Learn using Na Bayesian Classiers


ve

Some Continuous Probability Distributions

Continuous Probability Density Functions

If X is a RV in a continuous domain D(X ),


x2

P(x1 X x2 ) =

p(x)dx.

(9.30)

x1

The function p(X = x) is called the probability density function (pdf) of


X . Obviously, D(X ) p(x)dx = 1. Analogous to the discrete counterpart,
we can also dene multivariate pdfs p(X = x).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

December 5, 2013

293 / 475

Some Continuous Probability Distributions

Most rules like the product-rule, marginalization, Bayes rule have similar
counterparts in the continuous domain.
p(X = x, Y = y ) xy = (p(x | y ) x) (p(y ) y )
p(x, y ) = p(x | y ) p(y )

p(X = x, Y = y )dx = p(Y = y ).

(9.31)
(9.32)

D(X )

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

294 / 475

Beginning to Learn using Na Bayesian Classiers


ve

Some Continuous Probability Distributions

If X Rn is a normally distributed vector continuous RV (CRV), its


normal/Gaussian pdf is dened as
N (X = x ; , C)
x

1
(2)n/2 |C|1/2

1
exp (x )T C1 (x )
x
x
2

(9.33)

is the mean and C is the covariance matrix of the distribution.


x

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

December 5, 2013

295 / 475

NBC for Continuous RVs

Example: Banknote authentication


http://archive.ics.uci.edu/ml/datasets/banknote+authentication

Number of attributes: 5
1.
2.
3.
4.
5.

Variance of Wavelet Transformed image (continuous)


Skewness of Wavelet Transformed image (continuous)
Curtosis of Wavelet Transformed image (continuous)
Entropy of image (continuous)
Class (integer) 0/1

Number of instances: 1372

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

296 / 475

Beginning to Learn using Na Bayesian Classiers


ve

NBC for Continuous RVs

Example: Banknote authentication


http://archive.ics.uci.edu/ml/datasets/banknote+authentication

3.6216,8.6661,-2.8073,-0.44699
4.5459,8.1674,-2.4586,-1.4621
3.866,-2.6383,1.9242,0.10645
3.4566,9.5228,-4.0112,-3.5944
0.32924,-4.4552,4.5718,-0.9888
-1.3887,-4.8773,6.4774,0.34179
-3.7503,-13.4586,17.5932,-2.7771
-3.5637,-8.3827,12.393,-1.2823
-2.5419,-0.65804,2.6842,1.1952

K. Pathak (Jacobs University Bremen)

0
0
0
0
0
1
1
1
1

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

December 5, 2013

297 / 475

NBC for Continuous RVs

Using NBC for CRVs: Option 1

Discretize all CRVs.


This is like creating histograms with a user given bin size.
Later we will see a discretization technique based on information
entropy.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

298 / 475

Beginning to Learn using Na Bayesian Classiers


ve

NBC for Continuous RVs

Option 2: Modifying NBC Likelihood for the Continuous


Case
The conditional independence assumption gives:
n

p(A1 = a1 , A2 = a2 , . . . , An = an | C = ci ) =

j=1

p(aj | ci ).

(9.34)

The conditional pdf p(aj | ci ) is now computed as


2
p(Aj = aj | C = ci ) = N (aj ; ji , ji ),

(9.35)

2
where, ji , ji are the mean and variance of the values of Aj among
instances of class C = ci .

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Beginning to Learn using Na Bayesian Classiers


ve

December 5, 2013

299 / 475

NBC for Continuous RVs

NBC Posterior for Continuous Attributes and Discrete


Class

P(cj | a1 , a2 , . . . , an ) =

K. Pathak (Jacobs University Bremen)

p(a1 | cj ) p(an | cj ) P(cj )

k
i=1 p(a1

| ci ) p(an | ci ) P(ci )

Articial Intelligence

December 5, 2013

(9.36)

300 / 475

Bayesian Networks

Contents

Bayesian Networks
Some Conditional Independence Results
Pruning a BN
Exact Inference in a BN
Approximate Inference in BN
Ecient Representation of CPTs
Applications of BN

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

301 / 475

Bayesian Networks

Problems of Using a Full Joint Distribution


Given n RVs, each with cardinality d, a joint distribution table has an
exponentially growing size of O(d n ).
It is usually dicult to assign these probabilities.
In real life, we deal with eects and their direct causes. A domain
expert will usually be able to provide us with probability-tables of type
P(Eect | Cause 1 , . . . , Cause k )
In general, an eect-RV X has certain direct cause-RVs
Ci , i = 1 . . . k, k
n, and X is conditionally independent of the all
other cause RVs given all Ci . The number of probabilities to be
specied now is reduced to O(nd k+1 )
O(d n ).
Example: Run
http://www.aispace.org/bayes/version5.1.9/bayes.jnlp and
load sample problems.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

302 / 475

Bayesian Networks

An Example Bayesian Network (BN)


P(B)
.001

Earthquake

Alarm

Burglary

B
t
t
f
f

E
t
f
t
f

P(A)
.95
.94
.29
.001

A P(M)

A P(J)

JohnCalls

t
f

.90
.05

P(E)
.002

MaryCalls

t
f

.70
.01

Figure 64: A typical Bayesian Network (BN) showing the topology and CPTs.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

303 / 475

Bayesian Networks

Terminology

A Bayesian Network (BN) is a Directed Acyclic Graph (DAG) of


nodes Xi which are RVs. A CPT exists between each node and all its
parents.
The descendents of a node X are all the nodes Y to which a directed
path exists from X . In this case, X is an ancestor of Y .
Non-descendents of a node X form a set of all nodes which are not
descendents of X .
A set of nodes S = {X1 , . . . Xk } BN is called ancestral if it contains
all its ancestors, i.e. Y S, s.t. descendent (Y ) = Xi S. In other
/
words, S has no incoming edges from outside S.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

304 / 475

Bayesian Networks

Partial and Topological Ordering of the Nodes of a BN


Partial Ordering
A BN is a DAG, and all DAGs have an implicit partial order.
Xi is an ancestor of Xj (i < j)

(10.1)

Topological Ordering/Sorting
A topological ordering of a DAG is a non-unique total ordering which
is compatible with the above partial ordering.
In any topological ordering [X1 , X2 , . . . , Xn ], for all vertices Xi ,
Parents (Xi ) Predecessors (Xi ).

(10.2)

There exist several linear time algorithms for topological sorting.


K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

305 / 475

Bayesian Networks

A Simple Algorithm for Finding a Topological Ordering


Algorithm 30: TopologicalOrdering
input : G (V , E ), a DAG
output: L, a list with vertices in ascending order of topological order.
while (S = {vertex Y G .V | indegree of Y is 0}) = do
Choose any vertex X S // Here we have freedom ;
Append X to L ;
Remove X and all its out-edges from G .
return L
In the highlighted line, dierent choices will result in dierent
topological orders.
We dene a particular choice dubbed ITYCA (Ignore Till You Cant
Anymore), which takes a given node Y , and if Y S, it does not
choose Y till it is the only node left in S.
ITYCA produces a topological order
L = [nondescendents (Y ), Y , descendents (Y )].
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

306 / 475

Bayesian Networks

Example for ITYCA Topological Order


A

C
D

Figure 65: If we decide to ITYCA node E , a possible topological order which


could be returned is [A, B, C , D, F , G , E , H].

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

307 / 475

Bayesian Networks

ITYCA Topological Order

How can you get the ITYCA topological order with the DFS based
topological ordering algorithm from the CLRS book?

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

308 / 475

Bayesian Networks

Recall: Chain Rule


This is an application of the product-rule (stated in terms of RVs)
P(X1 , . . . , Xn ) = P(Xn | X1 , . . . , Xn1 ) P(X1 , . . . , Xn1 )

P(X1 , . . . , Xn1 ) = P(Xn1 | X1 , . . . , Xn2 ) P(X1 , . . . , Xn2 )


.
.
.
P(X2 , X1 ) = P(X2 | X1 ) P(X1 ).

(10.3a)
(10.3b)

(10.3c)

Combining the Eqs. (10.3), we get


n

P(Xi | Xi1 , Xi2 , . . .).

P(X1 , . . . , Xn ) =
i=1

(10.4)

This formula will be useful later in formulating the joint distribution of a


Bayesian network.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

309 / 475

Some Conditional Independence Results

Dening Property of a BN

Theorem 10.1 (The Joint Probability Distribution of a BN)


Let [X1 , . . . , Xn ] be any given topological sorting of the nodes of the BN
B. Every node Xi in a BN of n nodes is conditionally independent of its
predecessors in the topological sorting, given its parents, if and only if the
joint probability distribution represented by B is given by
n

PB (X1 , . . . , Xn ) =
i=1

P(Xi | Parents (Xi ))

(10.5)

Note that some nodes can be without any parents, but they are also
included in the above product.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

310 / 475

Bayesian Networks

Some Conditional Independence Results

Proof: First Part


First assume that nodes are conditionally independent of their
predecessors in the topological sorting, given their parents
This assumption means that
P(Xi | Xi1 , Xi2 , . . . , X1 ) = P(Xi | Parents (Xi )).

(10.6)

From the chain-rule (10.4), we have


n

P(X1 , . . . , Xn ) =
i=1

P(Xi | Xi1 , Xi2 , . . .),


n

(10.6)

i=1

P(Xi | Parents (Xi )).

This proves (10.5) one way.


K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

311 / 475

Some Conditional Independence Results

Proof: Second Part I


Next consider a BN with the joint distribution given by (10.5)
We need to now prove that the nodes are conditionally independent of
their predecessors in the topological sorting, given their parents, i.e. that
(10.6) holds.
P(Xi | Xi1 , Xi2 , . . . , X1 ) =

P(Xi , Xi1 , Xi2 , . . . , X1 )


, where,
P(Xi1 , Xi2 , . . . , X1 )

P(Xi , Xi1 , Xi2 , . . . , X1 ) =


Xn

P(X1 , . . . , Xn )

(10.7)
(10.8)

Xi+1
n

=
Xn

K. Pathak (Jacobs University Bremen)

Xi+1 j=1

Articial Intelligence

P(Xj | parents (Xj )) (10.9)

December 5, 2013

312 / 475

Bayesian Networks

Some Conditional Independence Results

Proof: Second Part II

=
k=1

P(Xk | parents (Xk ))

Xn

Xi+1 j=i+1

f1

P(Xj | parents (Xj )) (10.10)


f2

The summation in f2 can be distributed as follows using the property that


parents (Xj ) {Xk | k < j},
f2 =
Xi+1

P(Xi+1 | parents (Xi+1 ))

Xn

P(Xn | parents (Xn )) (10.11)

The last summation is 1, substituting it makes the summation on Xn1


one, and this cascading nally makes f2 = 1.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

313 / 475

Some Conditional Independence Results

Proof: Second Part III


We thus have,
i

P(Xi , Xi1 , . . . , X1 ) =
k=1
i1

P(Xi1 , Xi2 , . . . , X1 ) =
k=1

P(Xk | parents (Xk )), and similarly, (10.12)


P(Xk | parents (Xk ))

(10.13)

Substituting both in (10.7),


P(Xi | Xi1 , Xi2 , . . . , X1 ) = P(Xi | parents (Xi )),

(10.14)

which proves the required conditional independence.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

314 / 475

Bayesian Networks

Some Conditional Independence Results

Local Markov Property of Bayesian Networks

Theorem 10.2 (Local Markov Property)


Each node Y BN is conditionally independent of nondescendants (Y ),
given its parents.

Proof.
A node is conditionally independent, given its parents, of its predecessors
in any topological ordering of the BN, in particular, of those in a
topological-ordering made by ITYCAing the node: these predecessors are
precisely the nodes nondescendants.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

315 / 475

Some Conditional Independence Results

Example
A

C
D

H
G

Figure 66: Write down the expression for the JPD PB (X1 , . . . , Xn ) of the above
BN.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

316 / 475

Bayesian Networks

Pruning a BN

Theorem 10.3 (Plucking a Leaf)


If a BN B consists of nodes {X1 , . . . , Xn , L} where the node L is a leaf,
then let B denote a pruned BN consisting of nodes {X1 , . . . , Xn }. Then,
PB (X1 , . . . , Xn ) = PB (X1 , . . . , Xn ),

(10.15)

where, the RHS is the full JPD of B as dened in (10.5), and LHS is the
marginal JPD found from the full JPD of B after marginalizing out L.

Proof.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

317 / 475

Pruning a BN

|L|

PB (x1 , . . . , xn ) =

PB (x1 , . . . , xn , L = i )
i=1
|L|

=
i=1 j=1

P(xj | parents (Xj )) P( i | parents (L))


|L|

=
j=1
n

=
j=1

P(xj | parents (Xj ))

i=1

P( i | parents (L))

P(xj | parents (Xj ))

= PB (x1 , . . . , xn )

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

318 / 475

Bayesian Networks

Pruning a BN

Theorem 10.4 (JPD of an Ancestral Sub-DAG of a BN)


Let A = {A1 , A2 , . . . , Am } B be an ancestral-set in a BN B. Consider
now the BN represented by the nodes in A. Then,
PB (A1 , . . . , Am ) = PA (A1 , . . . , Am ),

(10.16)

where, the RHS is the full JPD of BN A as dened in (10.5), and LHS is
the marginal JPD found from the full JPD of B after marginalizing out
nodes in B which do not exist in A.

Proof.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

319 / 475

Pruning a BN

From B we can obtain A by plucking one leaf at a time, as shown in the


Algorithm 31 below. From Theorem 10.3, each time we pluck a leaf, the
JPD of nodes belonging to A within B does not change. Hence, (10.16)
follows.
Algorithm 31: PruningToAncestral
input: B, a BN;
A, an ancestral set in B.
while (S = {node L B | L A and outdegree of L is 0}) = do
/
Choose any node X S ;
Remove X and all its in-edges from B.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

320 / 475

Bayesian Networks

Exact Inference in a BN

Exact Inference in a BN
Denition 10.5 (Query on Posterior Distribution)
Given a query RV X , a vector of observed evidence RVs E = e, and a
vector of irrelevant (unobserved/hidden) RVs Y, wed like to nd the
posterior probability P(X | e). We have a BN B consisting of all the RVs
{X , E, Y}.

General Inference Procedure


Use PB (10.5) as the JPD of all RVs. Then,
P(X | e) = P(X , e)

K. Pathak (Jacobs University Bremen)

PB (X , e, y).

(10.17)

Articial Intelligence

Bayesian Networks

December 5, 2013

321 / 475

Exact Inference in a BN

A Useful Optimization

Ignoring Nodes Irrelevant to the Query


From Theorem 10.4, all nodes which are not ancestors of X or e are
irrelevant to the query and we can answer the query using a pruned-out
smaller BN consisting of the smallest ancestral set containing X and e.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

322 / 475

Bayesian Networks

Exact Inference in a BN

Recall
P(B)
.001

Earthquake

Alarm

Burglary

B
t
t
f
f

E
t
f
t
f

P(A)
.95
.94
.29
.001

A P(M)

A P(J)

JohnCalls

t
f

P(E)
.002

MaryCalls

.90
.05

t
f

.70
.01

|E | |A|

P(B | j, m) = P(B, j, m) =

P(B, j, m, ei , ak )
i=1 k=1

|E | |A|

=
i=1 k=1

P(B) P(ei ) P(ak | B, ei ) P(j | ak ) P(m | ak )

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

(10.18)
323 / 475

Exact Inference in a BN

Factors

|E | |A|

P(B | j, m) =

i=1 k=1

P(B) P(ei ) P(ak | B, ei ) P(j | ak ) P(m | ak )

|E |

P(ei )

= P(B)
f1 (B)

|A|

i=1

f2 (E )

k=1

P(ak | B, ei ) P(j | ak ) P(m | ak )


f3 (A,B,E )

f4 (A)

f5 (A)

Each factor fi is a matrix indexed by the values of its argument RVs, e.g.
f4 (A) =

P(j | a)
,
P(j | a)

f3 (A, B, E ) is 2 2 2.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

f5 (A) =

P(m | a)
,
P(m | a)

December 5, 2013

324 / 475

Bayesian Networks

Exact Inference in a BN

Point-wise Multiplication of Factors

P(B | j, m) = f1 (B)
The symbol

f2 (E )

f3 (A, B, E )

f4 (A)

f5 (A)

denotes a point-wise product, dened as

f(X1 , . . . , Xi , Y1 , . . . Yj , Z1 , . . . , Zk ) =
f1 (X1 , . . . , Xi , Y1 , . . . Yj )

K. Pathak (Jacobs University Bremen)

f2 (Y1 , . . . Yj , Z1 , . . . , Zk ). (10.19)

Articial Intelligence

Bayesian Networks

December 5, 2013

325 / 475

Exact Inference in a BN

An Example Illustrating Factor Multiplication


f3 (A, B, C ) = f1 (A, B)
A
1
1
0
0

B
1
0
1
0

f1 (A, B)
p11
p12
p13
p14

B
1
1
0
0

C
1
0
1
0

f2 (B, C )
p21
p22
p23
p24

K. Pathak (Jacobs University Bremen)

A
1
1
1
1
0
0
0
0

B
1
1
0
0
1
1
0
0

C
1
0
1
0
1
0
1
0

Articial Intelligence

f2 (B, C )

f1 f2
p11 p21
p11 p22
p12 p23
p12 p24
p13 p21
p13 p22
p14 p23
p14 p24

December 5, 2013

326 / 475

Bayesian Networks

Exact Inference in a BN

An Example Illustrating Factor Marginalization


a1
a1
a1
a1
a2
a2
a2
a2
a3
a3
a3
a3

b1
b1
b2
b2
b1
b1
b2
b2
b1
b1
b2
b2

c1
c2
c1
c2
c1
c2
c1
c2
c1
c2
c1
c2

0.25
0.35
0.08

a1
a1
a2
a2
a3
a3

0.16
0.05
0.07
0
0
0.15

c1
c2
c1
c2
c1
c2

0.33
0.51
0.05
0.07
0.24
0.39

0.21
0.09
0.18

Figure 67: f2 (A, C ) = bi f1 (A, B = bi , C ). From Koller and Friedman,


Probabilistic Graphical Models, 2009.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

327 / 475

Exact Inference in a BN

Inference by Variable Elimination I

Remark 10.6 (Notation)


Let the vector RV X = [X1 , . . . , Xn ] be the vector of all nodes (RVs) in a
BN. We will use X to refer to both the vector and the set of RVs
{X1 , . . . , Xn }. So if Y = [X1 , . . . , Xm ], m < n, we can write Y X.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

328 / 475

Bayesian Networks

Exact Inference in a BN

Algorithm 32: EliminateVariableFromFactors


input : A factorization F : f1 . . . fm ;
Z X, an RV to eliminate by marginalization
output: The factorization F after elimination

S(Z ) = {fi F | fi involves Z } ;


Remove factors S(Z ) from F ;
Marginalize out Z from the product of factors in S(Z ) to create a new
factor g
h=

g=

h.

(10.20)

zZ

fS(Z )
point-wise product

Append factor g to F;
return F
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

329 / 475

Exact Inference in a BN

Computing P(Q | E = e)
Zhang and Poole (1994)

Algorithm 33: VariableElimination


input : A factorization F : f1 . . . fm of a JPD P(X);
Q X, a vector of query RVs;
E = e, the vector of instantiated observed (evidence) RVs;
Y X, any ordering of unobserved RVs. Note: X = Q E Y.
output: The posterior distribution P(Q | E = e)

Instantiate E = e in all factors in F. This truncates all factor-tables to


those elements which correspond to E = e ;
for Y Y do
F EliminateVariableFromFactors(F, Y ) ;
h(Q) point-wise product of all factors in F ;
return Normalize(h(Q))
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

330 / 475

Bayesian Networks

Exact Inference in a BN

Recall

P(B)
.001

Earthquake

Alarm

Burglary

B
t
t
f
f

E
t
f
t
f

P(A)
.95
.94
.29
.001

A P(M)

A P(J)

JohnCalls

K. Pathak (Jacobs University Bremen)

t
f

P(E)
.002

MaryCalls

.90
.05

t
f

Articial Intelligence

Bayesian Networks

.70
.01

December 5, 2013

331 / 475

Exact Inference in a BN

Alarm Example: Computing P(B | j, m) I


|E |

P(B | j, m) = P(B)
f1 (B)

|A|

P(ei )
i=1

f2 (E )

k=1

P(ak | B, ei ) P(j | ak ) P(m | ak )


f3 (A,B,E )

f4 (A)

f5 (A)

We have Q [B], e [J = True , M = True ]. Let us choose the


ordering Y [E , A] for the unobserved RVs.
Initially, we have the set of factors:

f1 (B)
P(B)
f2 (E )
P(E )

F = f3 (A, B, E )
= P(A | B, E )

f4 (J, A)
P(J | A)
f5 (M, A)
P(M | A)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

(10.21)

332 / 475

Bayesian Networks

Exact Inference in a BN

Alarm Example: Computing P(B | j, m) II

Instantiate the observed variables J = True , M = True . Taking the


listing order True , False for all binary RVs, we have,

[0.001, 0.999]
f1 (B)
[0.002, 0.998]
f2 (E )

(10.22)
= f3 (A, B, E )
F = f3 (A, B, E )

[0.9, 0.05]
f4 (A)
[0.7, 0.01]
f5 (A)
In the rst iteration of the for-loop of Algo. 33, we have Y E . In
the call EliminateVariableFromFactors(F, E), in (10.20), we
have factor h1 = f2 (E ) f3 (A, B, E ). Verify that h1 has the table
B
T
T
F
F

K. Pathak (Jacobs University Bremen)

A
T
F
T
F

E= True
0.95 x 0.002
0.05 x 0.002
0.29 x 0.002
0.71 x 0.002

E= False
0.94 x 0.998
0.06 x 0.998
0.001 x 0.998
0.009 x 0.998

Articial Intelligence

Bayesian Networks

December 5, 2013

333 / 475

Exact Inference in a BN

Alarm Example: Computing P(B | j, m) III


Summing out E in h1 gives us the factor g1
B
T
T
F
F

A
T
F
T
F

g1 (A, B)
0.95 x 0.002 + 0.94 x 0.998
0.05 x 0.002 + 0.06 x 0.998
0.29 x 0.002 + 0.001 x 0.998
0.71 x 0.002 + 0.999 x 0.998

=
=
=
=

0.94002
0.05998
0.001578
0.998422

So, now we have F = [f1 (B), g1 (A, B), f4 (A), f5 (A)].


In the second iteration of the for-loop of Algo. 33, we have Y A.
In the call EliminateVariableFromFactors(F, A), in (10.20),
we have factor h2 = g1 (A, B) f4 (A) f5 (A). Verify that h2 has the
table
B
T
T
F
F

K. Pathak (Jacobs University Bremen)

A
T
F
T
F

h2 (A, B)
0.94002 x 0.9 x 0.7 = 0.5922126
0.05998 x 0.05 x 0.01 = 2.999 105
0.001578 x 0.9 x 0.7 = 0.00099414
0.998422 x 0.05 x 0.01 = 0.0004992

Articial Intelligence

December 5, 2013

334 / 475

Bayesian Networks

Exact Inference in a BN

Alarm Example: Computing P(B | j, m) IV


By summing out A from h2 (A, B), we get the table
g2 (B) = [0.59224259, 0.001493351]. So, now F = [f1 (B), g2 (B)].
Finally, back in VariableElimination, we have
h(B) = f1 (B) g2 (B)
B
T
F

h(B)
0.001 x 0.59224259 = 0.00059224259
0.999 x 0.001493351 = 0.0014918576

Normalizing h(B), we get,


P(B | J = True , M = True ) = [0.2842, 0.7158]

Taking a dierent order for unobserved RVs, e.g., Y [A, E ] would


also give the same result but a dierent computational eciency.
Which order is better for this example?

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

335 / 475

Exact Inference in a BN

Alarm Example: Computing P(J = True | B = True )

For answering P(J = True | B = True ), note that the sub-DAG formed by
the nodes {B, A, J, E } is ancestral. Therefore, we can prune out the leaf
node M before answering the query.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

336 / 475

Bayesian Networks

Approximate Inference in BN

Approximate Inference in BN
Monte Carlo Algorithms

Inference in BN can also be done using randomized sampling algorithms


called Monte Carlo algorithms whose accuracy depends on the number of
samples generated. Hence, the solutions are called any time because
within a given computation time an estimate can be produced.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

337 / 475

Approximate Inference in BN

Algorithm 34: Prior-Sampling


input : B, a BN
output: A sample s from the JPD of B given by (10.5)

Size of the BN, denoted B, is the number of nodes in B ;

Z TopologicalOrdering(B) ;

Initialize sample-vector s R B to 0 ;
for i 1 . . . B do
zi A random-sample from pmf P(Zi | parents (Zi ));
s[i] zi ;
return s

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

338 / 475

Bayesian Networks

Approximate Inference in BN

Probability of a sample generated by Prior-Sampling


The sampling proceeds in a topological order Z, i.e. parents are sampled
before children. When a child RV is sampled, all its parent RVs have
already been instantiated. Let a sample s be generated in order Z, then
P(s1 ) = P(Z1 = s1 ) as Z1 is guaranteed to be a root-node, it has no
parents;
P(s2 s1 ) = P(s1 ) P(s2 | parents (Z2 ) {s1 });
.
.
.
In general, from product-rule,
P(si si1 . . . s1 ) = P(si | si1 si2 . . . s1 ) P(si1 si2 . . . s1 )
= P(Zi = si | parents (Zi ) {s1 , . . . , si1 })
P(si1 si2 . . . s1 )

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

339 / 475

Approximate Inference in BN

Thus, the probability of the whole sample vector s containing samples


from all RVs in the BN is
P(s) = P(s1 s2 . . . s B )
B

=
i=1

P(Zi = si | parents (Zi ) {s1 , . . . , si1 })

(10.5)

= PB (Z = s).

Therefore, s is sample from the JPD of the BN.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

340 / 475

Bayesian Networks

Approximate Inference in BN

Algorithm 35: Rejection-Sampling


input : B, a BN consisting of RVs {X , E, Y};
X , query RV; E = e evidence vector RV;
N, the number of samples to generate

output: An estimate P(X | e)


Initialize count-map C R|X | to 0 ;

for j 1 to N do
Initialize sample-vector s R B to 0 ;
s Prior-Sampling(B) ;
if s is consistent with e then
x the value of X in s ;
C[x] C[x] + 1 ;
return Normalize(C)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

341 / 475

Approximate Inference in BN

Sampling only Non-evidence RVs

Rejection-Sampling may reject too many samples as |E|


increases! Therefore, unusable for complex problems.
Alternative approach:
Create a sample consistent with the evidence by sampling only
non-evidence RVs and freeze the evidence RVs to the observations e.
Compute the samples weight as the likelihood of the evidence in the
sample.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

342 / 475

Bayesian Networks

Approximate Inference in BN

Sampling only Non-evidence RVs


Algorithm 36: Weighted-Sample
input : B, a BN; E = e evidence vector RV;
output: A sample s, and its weight w
w 1;
Z TopologicalOrdering(B) ;
Initialize the slots of sample-vector s R B corresponding to E by e ;
for i 1 . . . B do
if Zi E then
zi value of Zi in e ;
w w P(Zi = zi | parents (Zi ))
else
zi A random-sample from P(Zi | parents (Zi ));
s[i] zi ;
return s, w
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

343 / 475

Approximate Inference in BN

Probability of a sample generated by Weighted-Sample

The sampling proceeds in a topological order Z of the RVs X Y E.


When a child RV is sampled, all its parent RVs have already been
instantiated, either by sampling or because they are part of the evidence.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

344 / 475

Bayesian Networks

Approximate Inference in BN

Probability of a sample generated by Weighted-Sample


Let U = {X } Y be the set of non-evidence RVs. Weighted-Sample
only samples RVs in U.
Then, the probability of a sample s = u e is
U

PWS (s) =
i=1

P(Ui = ui | parents (Ui )).

(10.23)

The computed weight of this sample is


E

w (s) =
i=1

P(Ei = ei | parents (Ei )),

(10.24)

where, parents (Ui ), parents (Ei ) can contain both variables u


instantiated by sampling and other non-sampled evidence variables.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

345 / 475

Approximate Inference in BN

Algorithm 37: Likelihood-Weighting


input : B, a BN;
X , a query RV;
E = e evidence vector RV;
N, the number of samples to generate

output: An estimate P(X | e)

W R|X | , a map from each value of X to its weighted counts, initialized


to 0 ;
for j 1 to N do
s, w Weighted-Sample(B, e) ;
x value of X in sample s ;
W[x] W[x] + w ;

return Normalize(W)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

346 / 475

Bayesian Networks

Approximate Inference in BN

Consistency of Likelihood-Weighting
Let Nx (y) be the number of samples of type s = x y e generated
by Weighted-Sample.
Then before normalization of W,
Nx (y) w (s = {x} y e).

W[x] =
y

The expected value E [Nx (y)] = N PWS (s = {x} y e).


Substituting above, and absorbing N in the normalization constant ,
E [W[x]] =
y

PWS (s = {x} y e) w (s = {x} y e),

(10.23),(10.24)

PB (s = {x} y e)

= PB (x, e)
Therefore, W = PB (X , e) = PB (X | e).
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

347 / 475

Ecient Representation of CPTs

Noisy OR
Ecient Representation of CPTs

An eect (child node) can have several causes (parent nodes), e.g.
the binary eect RV Fever can have binary cause-RVs
Cold , Flu , Malaria , etc. It is dicult for a domain expert to specify
all the numbers for the whole CPT P(Fever | Cold , Flu , Malaria ).
The number of to be specied probabilities in a CPT increases
exponentially with the number of parents.
Therefore, we use additional assumptions to keep this number
bounded.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

348 / 475

Bayesian Networks

Ecient Representation of CPTs

Noisy OR
Ecient Representation of CPTs

As a logical statement:
Fever = Cold Flu Malaria

Fever = Cold Flu Malaria


Now, in noisy OR, you allow Fever = True with a small probability
even if a cause is True , e.g. Cold = True .
A patient may have a cold but no fever: therefore, cold in inhibited in
its capacity to cause fever.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

349 / 475

Ecient Representation of CPTs

Noisy OR

Noisy OR makes two assumptions:


All possible causes are listed. A leak cause-node may be included: the
latter is a catch-all for all other miscellaneous causes.
Inhibition of each parent is causally independent of the inhibition of
any other parents.
P(fever | Cold , Flu , Malaria ) =

P(fever | Cold ) P(fever | Flu ) P(fever | Malaria ) (10.25)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

350 / 475

Bayesian Networks

Ecient Representation of CPTs

Noisy OR
This allows the CPT to be dened implicitly by inhibition probabilities
P(fever | cold )
P(fever | u )

P(fever | malaria )

P(fever | cold )

qc ,

P(fever | u )

qf ,

P(fever | malaria )

qm ,

1,

(10.26)

1,

(10.27)

1.

(10.28)

The full CPT P(Fever | Cold , Flu , Malaria ) can now be given in terms of
these values by plugging them into (10.25).
Cold
T
T
T
T
F
F
F
F
K. Pathak (Jacobs University Bremen)

Flu
T
T
F
F
T
T
F
F

Malaria
T
F
T
F
T
F
T
F

P(Fever )
qc qf qm
qc qf
qc qm
qc
qf qm
qf
qm
1.0

Articial Intelligence

Bayesian Networks

P(Fever )
1 qc qf qm
1 qc qf
1 qc qm
1 qc
1 qf qm
1 qf
1 qm
0.0
December 5, 2013

351 / 475

Ecient Representation of CPTs

Noisy MAX
The Noisy MAX is a generalization of the noisy OR for non-binary RVs.
Let Y be an RV with values 0, 1, . . . |Y | 1, |Y | > 2. The RV Y is
semantically graded, meaning that the value 0 implies that the
eect Y is absent, and increasing values denote that Y is present
with increasing intensity/degree.
The direct causes of Y are represented by
Parents (Y ) = X = {X1 , . . . , Xn }. Each Xi is also a graded RV and
can take values from 0 to |Xi | 1.
Let us denote zi,k to denote an instantiation of X in which Xi = k
and Xj = 0, j = i. Note that zi,0 0.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

352 / 475

Bayesian Networks

Ecient Representation of CPTs

Two Assumptions of Noisy MAX


1. When all causes are absent, the eect is absent
P(Y = 0 | X = 0) = 1.

(10.29)

2. We note that, for any x,


d

P(Y d | x) =

k=0

P(Y = k | x).

(10.30)

The second assumption is that this can be written as a product of the


eects of Xi s acting independently:
n

P(Y d | X1 = k1 , . . . , Xn = kn ) =

K. Pathak (Jacobs University Bremen)

i=1

P(Y d | zi,ki ).

Articial Intelligence

Bayesian Networks

December 5, 2013

(10.31)

353 / 475

Ecient Representation of CPTs

Parameterization of Noisy MAX I


The full CPT of size |Y | |X1 | |Xn | need not be specied. Using
the assumptions above, only the following |Y | n (|Xi | 1) values
i=1
need to be specied.
ci, j, k

P(Y = i | zj,k ), where,

(10.32)

i = 0 . . . (|Y | 1),

k = 1 . . . (|Xj | 1), due to (10.29).


d

Cd, j, k

P(Y d | zi,k ) =

ci, j, k .

(10.33)

i=0

Substituting (10.32) in (10.31),


n

P(Y d | X1 = k1 , . . . , Xn = kn ) =

Cd, j, kj

Q(d, x).

j=1

(10.34)
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

354 / 475

Bayesian Networks

Ecient Representation of CPTs

Parameterization of Noisy MAX II


Using the above, the implied CPT can be computed as
P(Y = 0 | x) = Q(0, x),

P(Y = i | x) = Q(i, x) Q(i 1, x), i > 0.


Y \X1
0
1
2

0
1
0
0

1
0.5
0.5
0

2
0
1
0

Y \X2
0
1
2

(a) ci, 1, k

0
1
0
0

1
0.5
0.3
0.2

(10.35)

2
0
0
1

(b) ci, 2, k

Figure 68: Example of Noisy-MAX parameterization.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

355 / 475

Applications of BN

The decomposition of large probabilistic domains into weakly


connected subsets through conditional independence is one of
the most important developments in the recent history of AI.
Textbook, page 499.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

356 / 475

Bayesian Networks

Applications of BN

Expert Systems for Medicine

Figure 69: The CPCS system of Pradhan et al (1994) for internal medicine. It
has 448 nodes, 906 edges, and uses Noisy-MAX distributions to reduce the
specied CPT values from 133,931,430 to 8,254.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

357 / 475

Applications of BN

Expert Systems for Medicine

Path-nder: diagnosis accuracy at the level of a junior doctor for


lymph diseases. (pathnder.xdsl)
A recent example: The Hepar-II network for liver diseases. (Hepar
II.xdsl)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

358 / 475

Bayesian Networks

Applications of BN

Expert Systems for Fault Diagnosis

Fault-diagnosis for power-systems, cars, printers, etc.


NASAs VISTA project for showing fault-diagnosis of the Shuttle.
Autonomous Underwater Vehicles (AUVs).
A realistic example: Printer-troubleshooting le.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

359 / 475

Applications of BN

In Genetics

Gene-expression Analysis
Functional Annotation, Protein-protein interaction, Haplotype
Inference
Pedigree Analysis
Survey: http:
//genomics10.bu.edu/bioinformatics/kasif/bayes-net.html

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

360 / 475

Bayesian Networks

Applications of BN

In Software

Document categorization, Semantic Web Ontologies (e.g.


BayesOWL).
Data-compression
Paper-clip: the erstwhile Microsoft Oce Assistant. Read the
humorous article: http:
//people.cs.ubc.ca/~murphyk/Bayes/econ.22mar01.html
For spam-email ltering.
Microsoft does active research in BN: http://research.
microsoft.com/en-us/um/redmond/groups/adapt/msbnx/

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Bayesian Networks

December 5, 2013

361 / 475

Applications of BN

Free and Commercial Software

A good list with comparison at:


http://people.cs.ubc.ca/~murphyk/Software/bnsoft.html
Dierent le-formats for exchange of Bayesian Network data:
XML-based (.xmlbif, .xdsl)- a successor of .bif, .net, .dsc, etc. http:
//www.cs.cmu.edu/~fgcozman/Research/InterchangeFormat/
Data-sets for benchmarking:
http://genie.sis.pitt.edu/networks.html,
http://www.cs.huji.ac.il/site/labs/compbio/Repository/
Read case-studies in dierent elds at:
http://www.hugin.com/case-stories.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

362 / 475

Some Learning Methodologies

Contents

Some Learning Methodologies


The Perceptron
Entropy
Decision Tree Learning
Cross-Validation
Unsupervised Learning
k-Means Clustering
The Dunn Index
The EM Algorithm
Mean-Shift Clustering

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

363 / 475

The Perceptron

The Perceptron
Supervised Learning

The perceptron is a simple binary classier for two linearly separable


classes M and M+ .
Classication rule: Given the learned weight-vector w Rd and
threshold , and a vector x Rd to be classied,
wx

x M

wx>

(11.1)

x M+ .

(11.2)

x
.
1

(11.3)

Redene
w

K. Pathak (Jacobs University Bremen)

w
,

Articial Intelligence

December 5, 2013

364 / 475

Some Learning Methodologies

The Perceptron

Algorithm 38: PerceptronLearning


input : M+ , M
output: The learned weight vector w
repeat
for all x M+ do
if w x 0 then
x
w=w+ x
for all x M do
if w x > 0 then
x
w=w x
until all x M+ M are classied correctly ;
return w

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

365 / 475

The Perceptron

Learning w: Linearly Separable Points


Iteration 1
2.5

1.5

0.5

0.5

1.5

2.5
2.5

1.5

K. Pathak (Jacobs University Bremen)


2.5

0.5

0.5

Articial Intelligence
Iteration 2

1.5

2.5

December 5, 2013

366 / 475

Some Learning Methodologies

The Perceptron

Points not linearly separable


Iteration 1
2.5

1.5

0.5

0.5

1.5

2.5
2.5

1.5

K. Pathak (Jacobs University Bremen)

0.5

0.5

1.5

Articial Intelligence
Iteration 2

2.5

December 5, 2013

367 / 475

2.5

Some Learning Methodologies

Entropy

Entropy

1.5

0.5

Given a PMF of an RV X which species probabilities Pi = P(X = xi ) for


0.5
events X = xi , wed like to have a metric for:
1
How much choice is involved in the selection of an event from this set;

Or, in other words, how uncertain we are of the outcome.


1.5
Lets call the desired metric H(X ) or in terms of probabilities,
2
H(P1 , . . . , Pn ), and rst see which properties it should ideally possess.
2.5
2.5

1.5

0.5

0.5

1.5

2.5

Iteration 3
2.5

1.5

K. Pathak (Jacobs University Bremen)


1

Articial Intelligence

December 5, 2013

368 / 475

Some Learning Methodologies

Entropy

Entropy

Desired property 1
H(P1 , . . . , Pn ) is a continuous function of Pi s.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

369 / 475

Entropy

Entropy

Desired property 2
If all events X = xi are equally likely, Pi = 1/n, i = 1 . . . n. Then
H(1/n, . . . , 1/n) should be a monotonically increasing function of n.
As the number of equally likely events increases, our choice or
uncertainty increases.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

370 / 475

Some Learning Methodologies

Entropy

Entropy: Desired Property 3


If a choice be broken down into two successive choices, the original H
should be the weighted sum of the individual values of H. Example:
1 1 1
1 1
1 2 1
H( , , ) = H( , ) + H( , )
2 3 6
2 2
2 3 3

1/2

1/2

1/3
2/3
1/6

1/2
1/3

Figure 70: Note that the net probabilities at the leaves are the same.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

371 / 475

Entropy

Entropy I

Theorem 11.1
The function satisfying the said three properties is
n

H(X ) H(P1 , . . . , Pn ) =

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Pi log Pi

(11.4)

i=1

December 5, 2013

372 / 475

Some Learning Methodologies

Entropy

Denition 11.2 (A(n))


The entropy when all n choices are equally likely.
A(n)

K. Pathak (Jacobs University Bremen)

H(P1 =

1
1
1
, P2 = , . . . , Pn = )
n
n
n

Articial Intelligence

Some Learning Methodologies

(11.5)

December 5, 2013

373 / 475

Entropy

Choice Tree

Figure 71: A choice tree with depth d = 3 and branching-factor b = 2.

Consider rst levels 2 and 3, then also include level 1. Using property 3,
1
A(23 ) = A(22 ) + 22 2 A(2)
2
1
1
= A(2) + 21 1 A(2) + 22 2 A(2)
2
2
= 3 A(2).
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

374 / 475

Some Learning Methodologies

Entropy

Proof (Shannon, 1948), Part I


Consider a choice-tree with branching-factor b and depth d. Let it
represent a total of b d equally likely choices (leaves). Consider the
sub-tree of depth d 1 excluding the leaves of the original tree. By
property 3, we have
b d1

A(b d ) = A(b d1 ) +
i=1

1
b d1

A(b)

= A(b d1 ) + A(b), simly dividing A(b d1 ),


= A(b d2 ) + 2 A(b), continuing d times,
= d A(b).

(11.6)

The only function satisfying (11.6) is A(n) k log n, k > 0 from


property 2.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

375 / 475

Entropy

Proof (Shannon, 1948), Intermission

1
1
1
Knowing that H(P1 = n , P2 = n , . . . , Pn = n ) = k log n is nice, but
what were really after is the case H(P1 , P2 , . . . , Pn ), where Pi s are
arbitrary: This is proven in Part II.

Given any arbitrary values of Pi s, e.g. for n = 3,


P1 = 0.303, P2 = 0.1417, P3 = 0.5553, we can always write them as
fractions of form Pi = mi /M. For example, for the given example,
M = 10, 000, m1 = 3030, m2 = 1417, m3 = 5553.
If any of Pi s is an irrational number (unlikely in a real situation), we
can always approximate the irrational number by a rational number to
any desired accuracy.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

376 / 475

Some Learning Methodologies

Entropy

Proof (Shannon, 1948), Part II


Now consider M equally likely choices. We can break them down in a
two-level choice-tree: the rst level has n nodes xi , i = 1 . . . n with
probabilities Pi = mi /M, where, n mi = M. Therefore, n Pi = 1.
i=1
i=1
Each node xi then has mi equally likely children.
Pn =
P1 =

mn
M

m1
M

P2 =

m1 equally likely
children

mn equally likely
children

m2
M

m2 equally likely
children

Figure 72: A two-level choice-tree.


K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

377 / 475

Entropy

Proof (Shannon, 1948), Part II


So, property 3 gives
n

A(M) = H(P1 , . . . , Pn ) +

Pi A(mi )
i=1

H(P1 , . . . , Pn ) = k

Pi
i=1

log M k

=1
n

H(P1 , . . . , Pn ) = k
= k

Pi log
i=1
n

Pi log mi
i=1

mi
M

Pi log Pi .
i=1

If we select k = 1 and log2 , we measure entropy in bits; for loge , in nats,


and for log10 , in bans.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

378 / 475

Some Learning Methodologies

Entropy

Entropy

Thus, the entropy in bits of an RV is dened as


H(X )

E [ log2 P(X )]

(11.7a)

|X |

P(xi ) log2 P(xi ),

for DRVs

(11.7b)

p(x) log2 p(x),

for CRVs

(11.7c)

i=1

D(X )

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

379 / 475

Entropy

Computing 0 log 0
By LHpitals rule,
o
lim+ x log x = lim+

x0

x0

K. Pathak (Jacobs University Bremen)

log x
1/x

= lim+
x0

1/x
1/x 2

Articial Intelligence

= lim+ (x)

= 0.

x0

December 5, 2013

380 / 475

Some Learning Methodologies

Entropy

Conditional Entropy
A choice-tree based derivation of conditional entropy was done in the
class. The derivation can also be done purely mathematically as follows
|X |

H(Y | X )

i=1

|X |

P(xi ) H(Y | X = xi ) =

|X |

P(xi )
i=1

P(xi )

(11.8)

i=1

j=1

P(yj | xi ) log2 P(yj | xi )

P(xi , yj ) log2
i=1 j=1

(11.9)

P(xi , yj )
P(xi )

|X | |Y |

j=1

H(yj | xi )

|Y |

|X | |Y |

|Y |

|X | |Y |

P(xi , yj ) log2 P(xi , yj ) +


i=1 j=1

P(xi , yj ) log2 P(xi )


i=1 j=1

= H(X , Y ) H(X )
K. Pathak (Jacobs University Bremen)

(11.10)
Articial Intelligence

Some Learning Methodologies

December 5, 2013

381 / 475

Entropy

An Important Result Regarding Conditional Entropy I


0.6
0.5

H= -x log_2(x)

0.4
0.3
0.2
0.1
0.0
0.0

0.2

0.4

0.6

0.8

1.0

Figure 73: Plot of H(x) = x log2 x. It is a concave function.


K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

382 / 475

Some Learning Methodologies

Entropy

An Important Result Regarding Conditional Entropy II

For a concave function f (x), a version of the Jensens inequality applies,


f(
i

i xi )

i f (xi ),

if

i = 1, and i 0.

(11.11)

We now start from (11.9) and write it as,

|Y | |X |

H(Y | X ) =
P(xi ) H(P(yj | xi ))

j=1

K. Pathak (Jacobs University Bremen)

(11.12)

i=1

Articial Intelligence

Some Learning Methodologies

December 5, 2013

383 / 475

Entropy

An Important Result Regarding Conditional Entropy III


Since

P(xi ) = 1, we can use (11.11) with i identied as P(xi ).

|Y |

H(Y | X )

|X |

H
j=1
|Y |

i=1

P(xi , yj )
i=1

|Y |

(11.13)

|X |

H
j=1

P(xi ) P(yj | xi )

H(P(yj ))
j=1

(11.7b)

H(Y ).

H(Y | X ) H(Y )

(11.14)

Therefore, knowing some information X reduces the uncertainty of Y .

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

384 / 475

Some Learning Methodologies

Entropy

Mutual Information
Mutual information of two RVs X and Y is dened as
|X | |Y |

I (X , Y )

P(xi , yj ) log2
i=1 j=1

P(xi , yj )
P(xi ) P(yj )

(11.15a)

= H(Y ) H(Y | X )

(11.15b)

= H(X , Y ) H(X | Y ) H(Y | X )

(11.15d)

= H(X ) H(X | Y )

(11.15c)

= H(X ) + H(Y ) H(X , Y )

(11.15e)

The highlighted expression shows that the mutual-information is also the


information-gain, i.e. the reduction in uncertainty of Y on knowing X .
Applying (11.14) to the highlighted expression, we see that I (X , Y ) is
always non-negative.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

385 / 475

Entropy

Venn Diagram

H(X)
H(Y )
H(X | Y ) I(X, Y ) H(Y | X)

H(X, Y )
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

386 / 475

Some Learning Methodologies

Entropy

Principle of Maximum Entropy


E.T. Jaynes, 1957

The principle of maximum entropy is a postulate which states that,


subject to known constraints (called testable information), the
probability distribution which is the least biased, i.e. which assumes
the least prior information, is the one with largest entropy.
Example: Consider a boolean DRV X with P(X = True ) = p. Then
its entropy
H(p, 1 p)

Hb (p)

= p log p (1 p) log (1 p).

(11.16)

If no other prior information is available, then by dierentiating the


above w.r.t. p, we nd that the maximum entropy is achieved if we
select p = 1/2. The maximal value of the entropy is 1 bit.
Example: if only the mean and the variance 2 of a CRV are known
beforehand, then subject to these constraints, the Gaussian
distribution has the maximum entropy.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

387 / 475

Decision Tree Learning

Decision Trees
Patrons?
None

Some

Full

Yes

No

WaitEstimate?

>60

30-60

Alternate?

No
No

No

Yes

Bar?

No

Yes

0-10

Hungry?

Yes

Reservation?

No

10-30

No

Fri/Sat?
No

No

Yes

Yes

Yes

Yes

Yes

Yes

Alternate?
No

Yes

Yes

Raining?
No

Yes

No

Yes

Yes

Figure 74: A decision-tree for deciding whether to wait for a table.


K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

388 / 475

Some Learning Methodologies

Decision Tree Learning

Learning Decision Trees from Examples


Table 1: To wait or not to wait
i/p
Ex.
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12

Alt
Y
Y
N
Y
Y
N
N
N
N
Y
N
Y

Bar
N
N
Y
N
N
Y
Y
N
Y
Y
N
Y

Fri
N
N
N
Y
Y
N
N
N
Y
Y
N
Y

Hun
Y
Y
N
Y
N
Y
N
Y
N
Y
N
Y

K. Pathak (Jacobs University Bremen)

Input Attributes
Patr
Prc Rain
Some $$$
N
Full
$
N
Some
$
N
Full
$
Y
Full
$$$
N
Some
$$
Y
None
$
Y
Some
$$
Y
Full
$
Y
Full
$$$
N
None
$
N
Full
$
N

Rsrv
Y
N
N
N
Y
Y
N
Y
N
Y
N
N

Articial Intelligence

Some Learning Methodologies

Type
Frn
Thai
Burg
Thai
Frn
Itl
Burg
Thai
Burg
Itl
Thai
Burg

Est
0-10
30-60
0-10
10-30
>60
0-10
0-10
0-10
>60
10-30
0-10
30-60

December 5, 2013

o/p
Wait?
y1 = Y
y2 = N
y3 = Y
y4 = Y
y5 = N
y6 = Y
y7 = N
y8 = Y
y9 = N
y10 = N
y11 = N
y12 = Y

389 / 475

Decision Tree Learning

Aim
To build a decision-tree from examples, which allows us, on an average, to
reach a decision as fast as possible, i.e. with the least number of checks.

Strategy
We check the attributes Xi (i.e. split the tree) in decreasing order of their
mutual-information I (Xi , Y ) = H(Y ) H(Y | Xi ) to the decision (class
Y ).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

390 / 475

Some Learning Methodologies

Decision Tree Learning

Information Gain on Selecting an Attribute for Splitting


Reminder
Hb (p)

H(p, 1 p)

= p log2 (p) (1 p) log2 (1 p).

(11.17)

Information-gain for attribute Type


I (Type , Wait ) = H(Wait ) H(Wait | Type )
H(Wait | Type ) =

P(Type = t)
t=f ,t,b,i

=
t=f ,t,b,i

K. Pathak (Jacobs University Bremen)

w =Y ,N

h(P(w | t))

P(Type = t) Hb P(Wait = Y | t) .

Articial Intelligence

Some Learning Methodologies

December 5, 2013

391 / 475

Decision Tree Learning

Information Gain on Selecting an Attribute for Splitting

We estimate the pmf of the boolean RV Wait? using the examples.


We get P(Wait? = Y ) = 6/12 = 0.5, so, P(Wait? = N) = 0.5.
Using (11.16), H(Wait? ) = 1 bit.
Type: Frn (+1,-1), Thai (+2,-2), Burg (+2,-2), Itl (+1,-1).
Patrons: None (+0, -2), Some (+4, -0), Full (+2,-4).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

392 / 475

Some Learning Methodologies

Decision Tree Learning

Information Gain on Selecting an Attribute for Splitting


Let us compute the conditional entropy H(Wait? | Type ) from
(11.8).
We will estimate the pmf of Type by the examples given:
P(Type = Frn ) = Pf = 2/12, P(Type = Thai ) = Pt = 4/12,
P(Type = Itl ) = Pi = 2/12, P(Type = Burg ) = Pb = 4/12.
We also need the conditional probabilities:
P(Wait? = Y | Type = Frn ) = Pwf = 1/2,
P(Wait? = Y | Type = Thai ) = Pwt = 2/4,
P(Wait? = Y | Type = Itl ) = Pwi = 1/2,
P(Wait? = Y | Type = Burg ) = Pwb = 2/4.
Using (11.8) and (11.16), we have,
H(Wait? | Type ) = Pf Hb (Pwf ) + Pt Hb (Pwt ) + Pi Hb (Pwi ) +
Pb Hb (Pwb ) = (2/12)1 + (4/12)1 + (2/12)1 + (4/12)1 = 1 bit. So our
information-gain is H(Wait? ) H(Wait? | Type ) = 0!

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

393 / 475

Decision Tree Learning

Information Gain on Selecting an Attribute for Splitting


Let us compute the conditional entropy H(Wait? | Patrons ) from
(11.8).
We will estimate the pmf of Patrons by the examples given:
P(Patrons = None ) = Pn = 2/12, P(Patrons = Some ) = Ps = 4/12,
P(Patrons = Full ) = Pf = 6/12.
We also need the conditional probabilities:
P(Wait? = Y | Patrons = None ) = Pwn = 0/2,
P(Wait? = Y | Patrons = Some ) = Pws = 4/4,
P(Wait? = Y | Patrons = Full ) = Pwf = 2/6.
Using (11.8) and (11.16), we have,
H(Wait? | Patrons ) = Pn Hb (Pwn ) + Ps Hb (Pws ) + Pf Hb (Pwf ) =
(2/12)0 + (4/12)0 + (6/12)Hb (1/3) = 0.5 0.9183 = 0.4591.
The information gain is H(Wait? ) H(Wait? | Patrons ) = 0.5408 bits.

Our goal is to reach a decision as soon as possible, so we should split


on the attribute which leads to the maximum information gain. In
this example (after computing the gain also for all other attributes) it
is Patrons .
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

394 / 475

Some Learning Methodologies

Decision Tree Learning

Notation
Let A be the vector RV of all attribute-DRVs. Each DRV Ai has a
domain of values {ai,j | j = 1 . . . |Ai |}.
Let the example-set be denoted as

X = {(A = ak , Y = yk ) | k = 1 . . . X},

Xi X.

(11.18)

If x X, then we use the notation x.Y and x.Ai to denote its


classication and the value of the ith attribute resp.
The function Plurality-Value(Xi ) returns the majority
classication Y among all examples in Xi . It resolves ties randomly.
Then the call Decision-Tree-Learning(X, A, ) returns the
decision-tree. The algorithm is given in the next slide.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

395 / 475

Decision Tree Learning

Algorithm 39: Decision-Tree-Learning


input : Remaining examples Xr , set of remaining attributes Ar ,
parent-node examples Xp
output: a tree
if Xr = then return Plurality-Value(Xp ) ;
else if x Xr , x.Y = c then return c (leaf) ;
else if A = then return Plurality-Value(Xr ) ;
else
A arg maxAAr Information-Gain(A, Y , Xr ) ;
a new decision-tree with root-test A ;
foreach value ai of attribute A do
Xr [A = ai ] {x | x Xr and x.A = ai } ;
Subtree s Decision-Tree-Learning(Xr [A = ai ],
Ar {A }, Xr ) ;

Add a branch to tree with label A = ai and subtree s ;

return ;
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

396 / 475

Some Learning Methodologies

Decision Tree Learning

Recall: The Original Decision Tree


Patrons?
None

Some

Full

Yes

No

WaitEstimate?

>60

30-60

10-30

0-10

Alternate?

No
No

Yes

Reservation?
No

No

No

Fri/Sat?

Yes

Bar?

Hungry?

No

Yes

Yes

Yes

Alternate?

Yes

No

Yes

Yes

Yes

No

Yes

Raining?

Yes

No

Yes

No

No

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

Yes

Yes

December 5, 2013

397 / 475

Decision Tree Learning

The Faster Decision Tree Based on Information Gain


Patrons?
None
No

Some

Full
Hungry?

Yes

No
No
French
Yes

Yes
Type?
Italian

Thai

Burger
Fri/Sat?

No
No

Yes

No

Yes

Yes

Figure 75: The decision-tree deduced from the 12 examples of the training-set.
Some attributes are never checked to arrive at a decision.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

398 / 475

Some Learning Methodologies

Decision Tree Learning

Other Information-Theoretic Criteria for Tree-Building


Information-gain is a good criterion for splitting if all attributes have
equal number of values. If not, it is biased towards attributes with
higher number of values. Example: ID number.
In such cases, the gain-ratio may be a better criterion. For an
attribute A and a decision-variable C , it is dened as
GR(A) =

I (A, C )
H(A)

(11.19)

It penalizes higher number of values of an attribute by the term H(A)


which is dened by (11.7b) and computed as
|A|

H(A) =

i=1

pi + ni
pi + ni
log2
p+n
p+n

(11.20)

Usually, (Quinlan, 1986) the gain-ratio is only computed for attributes


with above average value of the information-gain I (A, C ) and only
these attributes are considered candidates for splitting.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

399 / 475

Decision Tree Learning

Decision Tree with a Non-Binary Decision

If we have k classes, e.g. for mushrooms: edible, poisonous, and


hallucinogenic, we can easily generalize the decision-tree building by
Using (11.9) to compute the conditional entropy H(Class | Attribute )
while computing the information-gain.
This means that you cannot use the shorcut entropy expression
Hb (P) since it applies only to binary RVs. You have to use the
general entropy expression with all |Class | terms in the summation.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

400 / 475

Some Learning Methodologies

Cross-Validation

Model-Complexity Selection
f(x)

f(x)

f(x)

f(x)

(a)

x
(c)

(b)

x
(d)

Figure 76: Dierent hypothesis functions h (linear, seventh-degree polynomial,


sixth-degree polynomial, sinusoidal) for some given data. Which are preferable?

Ockhams Razor
Entities must not be multiplied beyond necessity. In other words, we
should tend towards simpler theories until more complicated theories
become necessary to explain new observations. The eld of
Model-Selection studies these ideas, many of them based on the
information entropy: MDL (minimum description length), AIC (Akaike
information criterion), etc.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

401 / 475

Cross-Validation

Error-Rate
It is the proportion of mistakes a given hypothesis makes: i.e. the
proportion of times h(x) = y .

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

402 / 475

Some Learning Methodologies

Cross-Validation

Recall: Holdout Cross-Validation


Split the available examples randomly into a training-set from which the
learning algorithm produces a hypothesis, and test-set on which the
accuracy of h is evaluated. Disadvantage: we cannot use all examples for
nding h.
e1
e2
.
.

.
.
em

Figure 77: Holdout cross-validation.


K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

403 / 475

Cross-Validation

k-fold Cross-Validation
Divide available examples into k equal subsets. Perform k rounds of
learning: in each round, (1 1/k)th of the data is used as a training-set
and the remaining 1/kth data is used as test-set (now called
validation-set). Then the average test score of k rounds is taken as the
performance measure. k = 5, 10, m. k = m is called leave-one-out
cross-validation (LOOCV).
e1
e2
.
.

Part of Training Set


Part of Training Set
Validation Set
Part of Training Set

.
.
em

Part of Training Set

Figure 78: 5-fold cross-validation. Iteration 3/5.


K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

404 / 475

Some Learning Methodologies

Cross-Validation

k-Fold-Cross-Validation
Algorithm 40: k-Fold-Cross-Validation
input : Learner, a learning-algorithm; s, a model-complexity parameter ;
k; examples
output: Average training-set error-rate eT ,
Average validation set error-rate eV
eV 0, eT 0 ;
for i = 1 to k do
St , Sv Partition(examples, i, k) ;
h Learner(s, St ) ;
eT eT + Error-Rate(h, St ) ;
eV eV + Error-Rate(h, Sv ) ;
return (eT /k, eV /k)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

405 / 475

Cross-Validation

Model Selection Complexity vs. Goodness of Fit


Algorithm 41: Cross-Validation-Wrapper
input
: Learner, k, examples
output
: Model of optimal complexity
local vars.: eT , an array, indexed by model-complexity s, storing
training-set error-rates.
eV an array, indexed by model-complexity s, storing validation-set
error-rates.
for s = 1 to do
(eT , eV ) k-Fold-Cross-Validation(Learner , s, k, examples) ;
if eT has converged then
s value of s with minimum eV [s] ;
return Learner(s , examples)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

406 / 475

Some Learning Methodologies

Cross-Validation

Example: Decision-Tree Learning


s = number of nodes.
The wrapper Learner is Decision-Tree-Learning (Algo. 39). It
builds the tree breadth-rst, rather than depth-rst (still using the
information-gain criterion), and stops when the maximum specied s
is reached.
60
Validation Set Error
Training Set Error
50

Error rate

40

30

20

10

0
1

K. Pathak (Jacobs University Bremen)

5
6
Tree size

Articial Intelligence

Some Learning Methodologies

10

December 5, 2013

407 / 475

Cross-Validation

Using Learning Curves for Comparison of Learners

Figure 79: From Scaling up the Naive Bayesian Classier: Using Decision Trees
for Feature Selection, by Ratanamahatana et al.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

408 / 475

Some Learning Methodologies

Unsupervised Learning

Clustering
Clustering partitions available data xi (feature-vectors) into clusters, such
that samples belonging to a cluster are similar according to some criterion.
It nds usage in
Data analysis and compression
Pattern recognition
Image segmentation
Bioinformatics

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

409 / 475

Unsupervised Learning

Clustering example: data-compression


We have an image stored with 24 bits/pixel (16 million colors).
Denote these N pixels as xj , j = 1 . . . N, each encoded in 24-bits.
We want to compress it to 8 bits/pixel (256 colors).
Color quantization: We want to devise a palette (colormap) of 256
colors which approximates the colors present in the original image as
close as possible.
This means that we want to nd mi i = 1 . . . k (k = 256), each
encoded again in 24-bits, such that each original pixels color can be
approximated by the mi nearest to it in a color-space.
mi are called codebook vectors or code-words.
Ideally, we should choose the code-words mi so as to minimize the
reconstruction error
N

E (m1 , . . . , mk ) =
j=1
K. Pathak (Jacobs University Bremen)

min xj mp

Articial Intelligence

December 5, 2013

(11.21)

410 / 475

Some Learning Methodologies

k-Means Clustering

The k-Means Algorithm


Algorithm 42: k-Means
input : Samples xj , j = 1 . . . N, Number of clusters k
output: Cluster-centers/ Codebook-vectors mi , i = 1 . . . k
Initialize mi , e.g. to k random xj ;
repeat
for j 1 . . . N do
1 if xj mi = minp xj mp
bj,i =
0 otherwise.
for i 1 . . . k do
mi

N
j=1 bj,i

xj /

N
j=1 bj,i

until all mi have converged;

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

411 / 475

k-Means Clustering

Animation of k-means

Figure 80: http://cs.joensuu.fi/sipu/clustering/animator/

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

412 / 475

Some Learning Methodologies

The Dunn Index

The Dunn Index


A metric for evaluating clustering algorithms

Step 1: A metric i for intra-cluster distance (cluster-size)


Let x and y be feature-vectors assigned to the same cluster Ci . Then, the
following are possible metrics for the size of Ci .
i = max x y

(11.22)

x,yCi

i =

i =

K. Pathak (Jacobs University Bremen)

1
|Ci |(|Ci | 1)
1
|Ci |

x,yCi
x=y

xy

(11.23)

x i .

xCi

(11.24)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

413 / 475

The Dunn Index

The Dunn Index


A metric for evaluating clustering algorithms

Step 2: A metric (Ci , Cj ) for inter-cluster distance


Then, the following are possible metrics for inter-cluster distance.
(Ci , Cj ) =
(Ci , Cj ) =

max

xy

(11.25)

min

xy

(11.26)

xCi ,yCj
xCi ,yCj

(Ci , Cj ) = i j .

K. Pathak (Jacobs University Bremen)

Articial Intelligence

(11.27)

December 5, 2013

414 / 475

Some Learning Methodologies

The Dunn Index

The Dunn Index


A metric for evaluating clustering algorithms

Step 3: The Dunn index


Let there be m clusters C1 , C2 , . . . , Cm ,

1
DIm
min
min (Ci , Cj )

1im 1jm
max k
j=i

(11.28)

1km

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

415 / 475

The Dunn Index

The Dunn Index


A metric for evaluating clustering algorithms

DIm

min

max k

1km

min (Ci , Cj )

1im 1jm
j=i

Higher values are better.


If m is not known, the m for which DIm is the highest can be chosen.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

416 / 475

Some Learning Methodologies

The EM Algorithm

Expectation-Maximization (EM) Algorithm


Parametric Clustering

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0
0

0.2

0.4

0.6

0.8

(a)

0.2

0.4

0.6

0.8

(b)

Figure 81: A Gaussian mixture computed from 500 samples with weights (left to
right) 0.2, 0.3, 0.5.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

417 / 475

The EM Algorithm

A Gaussian Mixture Model (GMM) for Clustering


Recall: Multivariate Gaussian

N (X = x ; , )
x

(2)n/2 ||1/2

1
x
x
exp (x )T 1 (x )
2

(11.29)

GMM pdf
K

p(x) =
i=1

P(C = i) N (x; i , i ),

where,

(11.30)

P(C = i) = 1.

(11.31)

i=1
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

418 / 475

Some Learning Methodologies

The EM Algorithm

Expectation/ Expected Value


If X is a DRV, a function f (X = x) has the expected value
|X |

E [f ]

P(xi ) f (X = xi )

(11.32)

i=1

For vector DRVs, the denition can be extended simply,


|X|

E [f]

P(xi ) f(X = xi )

(11.33)

i=1

If X is a CRV, a function f (X = x) has the expected value


E [f ]

p(x) f (x) dx

(11.34)

D(X )

This can also be generalized similarly to vector CRVs.


K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

419 / 475

The EM Algorithm

Covariance

For a vector RV (discrete or continuous) X Rn ,


Cov(X)

E (X E [X])(X E [X])T

(11.35)

Cov(X) Rnn , and is symmetric, as well as guaranteed to be at


least positive semi-denite. In practise, it is usually positive-denite.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

420 / 475

Some Learning Methodologies

The EM Algorithm

Mahalanobis Distance
If X Rn is a normally distributed vector continuous RV (CRV), its
normal/Gaussian pdf is dened as
N (X = x ; , C)
x

1
(2)n/2 |C|1/2

1
x
x
exp (x )T C1 (x )
2

(11.36)

For a sample of a CRV X = x, the Mahalanobis distance can be used to


compare the weighted distance of samples from the mean of a Gaussian
distribution. It is dened as
x

(x )T C1 (x ).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

(11.37)

December 5, 2013

421 / 475

The EM Algorithm

Mahalanobis Distance

(x )T C1 (x ).

A smaller value represents more condence. Example: If x =

C=

2
x 0
2 ,
0 y

K. Pathak (Jacobs University Bremen)

2
C

x
, and
y

(x x )2 (y y )2
=
+
.
2
2
x
y

Articial Intelligence

December 5, 2013

(11.38)

422 / 475

Some Learning Methodologies

The EM Algorithm

The Mixture Distribution


A datum x is sampled from a mixture of K (known) components in
two steps:
1. Choose a component : Sample pmf P(C = i), i = 1 . . . K .
2. Generate a sample from the component: Sample pdf p(x | C = i)

The likelihood of the sample x is then


K

p(x)

P(C = i, x)
i=1

i=1

P(C = i) p(x | C = i)

(11.39)

If p(x | C = i) are multivariate Gaussians, we have a Mixture of


Gaussians. Each Gaussian is parameterized by mean i , and
covariance i . The parameter P(C = i) is called the weight of the
ith component. These taken together for all i form our
parameter-vector .
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

423 / 475

The EM Algorithm

Problem 11.3 (Gaussian Mixture)


Given data xj , j = 1 . . . n and the number of components K , estimate the
parameters: P(C = i), i , and i , i = 1 . . . K .

Chicken or Egg Problem


If we knew which component i generated xj , j we could estimate
Gaussian parameters by maximizing their likelihood.
If we knew the Gaussian parameters for all components, we could
assign each xj to the component with the minimum Mahalanobis
distance and hence estimate P(C = i).
Problem is, we know neither!
Hence we formulate an algorithm which iteratively increases the
expectation of the likelihood of the data.
Initialization: Choose some reasonable random values for all
parameters i : P(C = i), i , and i , i = 1 . . . K .
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

424 / 475

Some Learning Methodologies

The EM Algorithm

The EM Algorithm: The E-Step


Let us dene a hidden boolean DRV Zij . If xj was generated by
component i, Zij = True ; if not, Zij = False .
Let us estimate P(Zij )
P(Zij = True ) = P(C = i | xj )

= p(xj | Ci ) P(Ci )

(11.40)

The last expression can be evaluated based on the parameters from


the previous iteration of the EM algorithm.
The expected count of data-samples in category i out of n samples
can be computed as
n

I(Zij = True )

(11.41)

P(Zij = True ).

ni =

(11.42)

j=1
n

ni

E [ni ] =
j=1

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

425 / 475

The EM Algorithm

The EM Algorithm: The M-Step


Update the estimates of the Gaussian parameters of each component
i = 1 . . . K as follows (order important)
1
i
ni

i
P(C = i)

K. Pathak (Jacobs University Bremen)

1
ni

P(Zij = True ) xj

(11.43)

P(Zij = True ) (xj i )(xj i )T

(11.44)

j=1
n
j=1

ni

.
n

(11.45)

Articial Intelligence

December 5, 2013

426 / 475

Some Learning Methodologies

The EM Algorithm

The EM Algorithm: An Important Property

In each E+M-step, the log-likelihood of the whole data increases or


stays the same. The log-likelihood is dened using (11.39) as
n

L(x1:n ) =

j=1

log {p(xj )}

log
j=1

(11.46)

i=1

P(C = i) p(xj | C = i) .

(11.47)

Proof is outside of the scope of this course, but uses the Jensens
inequality.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

427 / 475

The EM Algorithm

Log-likelihood L

The EM Algorithm: Example Result


700
600
500
400
300
200
100
0
-100
-200
0

10

15

20

Iteration number

Figure 82: Log-likelihood as a function of EM iteration. The horizontal dashed


line is the log-likelihood of the true mixture.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

428 / 475

Some Learning Methodologies

The EM Algorithm

The EM Algorithm: Potential Problems

K unknown.
A Gaussian component may shrink to cover just one point: in this
case, its covariance determinant becomes 0 and the likelihood blows
up. Restart with a new better initial guess.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

429 / 475

Mean-Shift Clustering

Mean-Shift
Non-Parametric Clustering

Uptil now the number of clusters was always given (e.g. k=256 in the
image-compression example).
What if we want the algorithm to gure out the number of clusters, if
we give it some hints about the structure: signicance of distances in
each of the dimensions of the vector x? These are called bandwidths.
The mean-shift algorithm makes clusters by nding the basins of
attraction of local peaks of the pdf p(x).
The pdf p(x) is estimates by kernel density estimation (KDE).
Detailed derivation done in the class.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

430 / 475

Some Learning Methodologies

Mean-Shift Clustering

0.5

Kernel Density Estimation

0.0

0.1

0.2

dnorm(x)

0.3

0.4

reference
2.0
0.3
0.1

Figure 83: KDE using 200 samples from a standard Gaussian. The KDE uses a
Gaussian kernel of dierent bandwidths.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

(a) hx = 30, hy = 30

December 5, 2013

431 / 475

Mean-Shift Clustering

(b) hx = 60, hy = 60

(c) hx = 60, hy = 30

Figure 84: Eect of bandwidth on the clustering of randomly generated points on


a 640 480 grid. The trajectories of all points during the mean-shift iterations
are plotted.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

432 / 475

Some Learning Methodologies

Mean-Shift Clustering

(a) hx = 30, hy = 30

(b) hx = 60, hy = 60

Figure 85: Clustering an RGBD image from a Kinect in the multi-dimensional


space of color (Luv) (hL = 10, huv = 20), pixel-coordinates (hp = 15 pixels), local
normals (hn = 45 ), range (hr = 15 mm).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Some Learning Methodologies

December 5, 2013

433 / 475

Mean-Shift Clustering

Color Spaces

(a) RGB

(b) Hue Saturation Value

Figure 86: Credits: Wikipedia

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

434 / 475

Some Learning Methodologies

Mean-Shift Clustering

Perceptually Uniform Color-Spaces


CIELuv, CIELab

Figure 87: Credits: Wikipedia

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

435 / 475

December 5, 2013

436 / 475

Home-work Assignments

Contents

Home-assignment 1
Home-work Assignments
Home-assignment 2
Home-assignment 3
Home-assignment 4
Home-assignment 5

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Home-work Assignments

Home-assignment 1

HA 1: A Path-Planning and Simulated Annealing I


Groups of 2. Due date 27.9., 23:59

(a) A 2D occupancy grid laser


map of the Intel lab.

(b) The thresholded map (t = 0.55).

Figure 88: The input map.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Home-work Assignments

December 5, 2013

437 / 475

Home-assignment 1

HA 1: A Path-Planning and Simulated Annealing II


Groups of 2. Due date 27.9., 23:59

1. Code the general Graph-Search Algorithm 4 and specialize it to


Dijkstra and A . Test your code on the map shown in Fig. 88 for the
start and goal points given in astar.py. Your program should work
for any given start and goal points. The data-les for this homework
are available on Campus-Net.
40%
1.1 The thresholded map is available as a gzipped pickled numpy.array
object in the uploaded map1c.p.gz. Code to extract this object is in
the uploaded le astar.py. which contains some skeleton code. If you
prefer working in C++, request the TA to provide you the map in a
text format.
1.2 Code your own Node and Graph classes. In this home-work, use of
external graph-libraries is not required.
1.3 Write all your code in a module and in the unit-testing section of the
module. The resultant path should be dumped in a le map path.png.
Look at the functions of matplotlib.image
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

438 / 475

Home-work Assignments

Home-assignment 1

HA 1: A Path-Planning and Simulated Annealing III


Groups of 2. Due date 27.9., 23:59

1.4 Each pixel is a node. If only 4-successors (up-down-top-bottom) are


considered, each edge has cost 1. If the diagonal successors are

considered, their edge cost is 2. Your program should be congurable


to work both ways. Pixel value 1 means that the cell is free, value 0
means that it is occupied.
1.5 Both Dijkstra and A nd optimal paths. Compare them in terms of
nal costs of the found paths. If both paths are optimal, why are they
not the same? Write your answer to this in the header of the module
le.
1.6 For implementing priority-queue, look at
http://docs.python.org/library/heapq.html. Your node class
should override the cmp (...), hash () functions for it to
work with the heapq. See also Fig. 89.

Solution: See astar.py in the solutions on Campus-net.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Home-work Assignments

December 5, 2013

439 / 475

Home-assignment 1

HA 1: A Path-Planning and Simulated Annealing IV


Groups of 2. Due date 27.9., 23:59

Figure 89: An example UML design for homework 1 feel free to alter it.
Base classes Node, Frontier, and GraphSearch are abstract classes which do
not care about implementation details. Thus, the same structure can be
used for all graph-search variants.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

440 / 475

Home-work Assignments

Home-assignment 1

HA 1: A Path-Planning and Simulated Annealing V


Groups of 2. Due date 27.9., 23:59

2. Answer the Why? in Lemma 4.7 using the hint provided.


25%
Solution: Were given that the node chosen by Fj is xj , and the node
chosen by Fj+1 is xj+1 . We consider the following cases:

Case I: At iteration j + 1, xj+1 is a child of xj : Therefore, [x0 : xj ] is


a sub-path of [x0 : xj+1 ]. This follows from Lemma 4.6, which states
the optimum paths from x to xj and xj+1 respectively were found when
they were selected. When we combine this with Lemma 4.5, we get

(xj ) (xj+1 ) because xj lies on the optimum path to xj+1 .


Case II: At iteration j + 1, xj+1 is not a child of xj . In this case, at step
j, xj+1 must have been a part of Fj , because,
Fj+1 = Fj {un-dead children of xj } {xj }. Hence, the cost of xj+1
was not updated at iteration j + 1 because its parent was not changed
by Resolve-Duplicate to xj , therefore, j (xj+1 ) = (xj+1 ), where,
j (xj+1 ) denotes the cost of xj+1 in Fj . But since xj+1 was not chosen
by Fj , it must be true that j (xj+1 ) (xj ), or combined with the
previous result, (xj+1 ) (xj ).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Home-work Assignments

December 5, 2013

441 / 475

Home-assignment 1

HA 1: A Path-Planning and Simulated Annealing VI


Groups of 2. Due date 27.9., 23:59

3. Implement a sampler class for a PMF of a discrete random variable A


as explained in Jump to Location . It should be initialized with a list of
probabilities P(A = ai ), i = 1 . . . n, summing to unity. A list of
descriptive labels (e.g. ["child selected", "child not
selected"] etc) for each ai may be given during initialization also.
A function choose() should return the index of the selected ai .
Write a simple unit-test, which initializes a PMF
P(A)
P(a1 ) = 0.3

P(a2 ) = 0.15

P(a3 ) = 0.05

P(a4 ) = 0.4

P(a5 ) = 0.1 ,

and draws 10,000 samples. Plot a histogram showing the observed


relative frequencies of ai . Do they correspond to the provided
probabilities?
5%
Solution: See pmf.py in the solutions on Campus-net.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

442 / 475

Home-work Assignments

Home-assignment 1

HA 1: A Path-Planning and Simulated Annealing VII


Groups of 2. Due date 27.9., 23:59

4. The uploaded le pt coords list.p.gz contains a list of 25 tuples.


The le tsp.py shows some sample code to read in the le and test
your code. Each tuple contains the (x, y ) coordinates of a point. A
robot arm has the task to punch all points with a tool mounted on its
end-eector. In which order should the end-eector move from point
to point without visiting any point twice, so that the total traversal
path-cost (sum of Euclidean inter-point distances) is minimized and
no point is left unvisited? The starting and ending point is the rst
point. Solve the problem using Simulated Annealing coded in Python
or C++.
30%
4.1 A potential solution is a permutation of the list of indices from 1 to 24,
the index 0 being the rst and the last point in the path.
4.2 For generating a random child of the current permutation list, you
could use random.shuffle on a randomly selected sub-list of the
parent.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Home-work Assignments

December 5, 2013

443 / 475

Home-assignment 1

HA 1: A Path-Planning and Simulated Annealing VIII


Groups of 2. Due date 27.9., 23:59

4.3 For deciding whether or not to select a worse child, you need to
compute the Boltzmann probability at the current T and sample the
boolean random-variable using the code of the sampler class from the
last part.
4.4 Select a schedule for the temperature based on the advice given in the
quote from Numerical Recipes in C. You have to experiment with it
to get the best results.
4.5 Make a plot of the iteration number vs. the current total traversal
cost. Also show iteration number vs. T (schedule).
4.6 Plot the nal path and print its cost.

Solution: Four dierent heuristics were used in child creation: 1)


random sub-path reversal with p = 0.4; 2) insertion of random
sub-path at a random location with p = 0.4; 3) sub-path shue
p = 0.1; 4) random swap of two points p = 0.1. Refer to the code in
tsp.py available online. The minimum tour cost is about 459. The
plots obtained were as shown in the following gures.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

444 / 475

Home-work Assignments

Home-assignment 1

HA 1: A Path-Planning and Simulated Annealing IX


Groups of 2. Due date 27.9., 23:59
100
80
60
40
20
00

20

40

60

80

100

(a) Path
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Home-work Assignments

December 5, 2013

445 / 475

Home-assignment 1

HA 1: A Path-Planning and Simulated Annealing X

Groups of 2. Due date 27.9., 23:59


1800
1600
1400
1200
1000
800
600
4000

500000

1000000
1500000
nr. iteration

2000000

2500000

500000

1000000
1500000
nr. iteration

2000000

2500000

1000
800
T

600
400
200
00

(b) Energy and Temperature.


K. Pathak (Jacobs University Bremen)

Articial Intelligence

= 0.5
December 5, 2013

446 / 475

Home-work Assignments

Home-assignment 1

HA 1: A Path-Planning and Simulated Annealing XI


Groups of 2. Due date 27.9., 23:59

5. Only for AI Lab participants: Implement the funnel-planning Algo. 8


and test it for several start and goal points for the map of problem 1.
Your code should show the result (map + funnel-sequence) in a
pop-up window and print the cost of the path on the console.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Home-work Assignments

December 5, 2013

447 / 475

Home-assignment 2

HA 2 I
Groups of 2. Due date 21.10., 23:59

1. There is another way to formulate alpha-beta pruning: this variation


is called the negmax formulation, as opposed to the minmax approach
which we covered in the class. It is given at the end of this homework
in Algo. 43 (from Knuth and Moore, 1975). It consists of just one
function F which recursively calls itself. Show a manual trace of the
run of this algorithm on the 2-Ply game of Fig. 44. The trace starts
with the call F(A, , ), where A is the root node. You could do
this cleanly on a sheet of paper and then scan it for submission.
(25%)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

448 / 475

Home-work Assignments

Home-assignment 2

HA 2 II
Groups of 2. Due date 21.10., 23:59

F (A, , )
= , 3
t = 3, 2, 2
3

F (A1, , )
= , 3
t = 3, 12, 8

12

F (A1, , 3)

F (A1, , 3)

= , 14, 5, 2
t = 14, 5, 2

= , 2
t = 2

K. Pathak (Jacobs University Bremen)

14

Articial Intelligence

Home-work Assignments

December 5, 2013

449 / 475

Home-assignment 2

HA 2 III
Groups of 2. Due date 21.10., 23:59

2. Find the resolution-closure RC (S) of the set S consisting of the


following clauses
AB C

C1 :
C2 :
C3 :

A B C D E

E F G D
G E

C4 :

Is this set of clauses in RC (S) satisable? If yes, nd a model which


satises all the clauses in RC (S) using Algo. 27
ModelConstruction.
(10%+10%)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

450 / 475

Home-work Assignments

Home-assignment 2

HA 2 IV
Groups of 2. Due date 21.10., 23:59

Solution: We proceed as in the worked-out example given after


Algorithm 27. We rst write the clauses in CNF with symbols
appearing alphabetically
AB C

C1 :

A B C D E

C2 :

D E F G

C3 :

E G

C4 :

As per Remark 7.16, a resolution like C1 C2 results in the clause


B B C D E True and hence does not provide any new
information: such pairs will not be mentioned henceforth. Pairs like
C1 and C3 cannot be resolved, as they do not have any
complementary literals: such pairs will also not be mentioned in the
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Home-work Assignments

December 5, 2013

451 / 475

Home-assignment 2

HA 2 V
Groups of 2. Due date 21.10., 23:59

following. In the rst iteration, we get the following new clauses on


resolution amongst C1 , . . . , C4
C5 = C2

C3 :

C6 = C2

C4 :

C7 = C3

C4 :

A B C D F G
A B C D G
D E F

In the next iteration, we get the following new clauses from resolution
amongst C1 , . . . , C7 , ignoring the pairs already considered
C8 = C2

C7 :

C9 = C3

C6 :

A B C D F

A B C D E F

We ignore duplicates like C4 C5 which gives C9 again. In the next


iteration, we try to obtain new clauses from resolution amongst
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

452 / 475

Home-work Assignments

Home-assignment 2

HA 2 VI
Groups of 2. Due date 21.10., 23:59

C1 , . . . , C9 , ignoring the pairs already considered: however, in this


iteration, we do not get any new clauses: for many pairs, the
resolvent clauses already exist; for the rest of the pairs, they are either
not resolvable, or their resolvent is equivalent to true. Thus
RC (S) = {C1 , . . . , C9 }.
Since the resolution-closure does not contain the empty clause, there
exists a model which satises all the clauses in it. To nd such a
model, we use Algorithm 27. Taking the order of symbols as
[P1 , . . . , Pn ] = [A, B, C , D, E , F , G ], we provide the execution trace:
i=1 A = True , C1 is now already True in this partial model
and need not be considered anymore.
i=2 B = True
i=3 C = True , C2 , C5 , C6 , C8 , C9 are now already True in
this partial model and need not be considered anymore.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Home-work Assignments

December 5, 2013

453 / 475

Home-assignment 2

HA 2 VII
Groups of 2. Due date 21.10., 23:59

i=4 D = True , C3 , C7 are now already True in this partial


model and need not be considered anymore.
i=5 E = True
i=7 F = True , no clause remains which depends on F .
i=8 G = False because under the previous assignments,
C4 = G False .
The above model is seen to satisfy all the clauses in RC (S), and in
particular S. A dierent initial order will give a dierent model.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

454 / 475

Home-work Assignments

Home-assignment 2

HA 2 VIII
Groups of 2. Due date 21.10., 23:59

3. Convert the following set of sentences to CNF and give a trace of the
execution of DPLL (2nd version, Algo. 25) on the conjunction of
these clauses.
A (C E )

S1 :

(12.1a)

E D

(12.1b)

E C

(12.1d)

C B

S2 :

(12.1f)

B F C

S3 :
S4 :

(12.1c)

C F

S5 :
S6 :

(12.1e)

(20%)

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Home-work Assignments

December 5, 2013

455 / 475

Home-assignment 2

HA 2 IX
Groups of 2. Due date 21.10., 23:59

Solution: The CNFs are

S1 :C11 C12 C13

S2 :C2

(A C E ) (C A) (E A)

S3 :C3

S4 :C4

B F C

S5 :C5

S6 :C6

C F

E D

E C

C B

Solution: We use the decimal notation for recursive calls of DPLL.


The set of initial clauses is denoted by F.

DPLL-1(F = {C11 , . . . , C6 })

In the second while-loop, D is found as a pure literal. Therefore,

F F | D does not contain C2 .

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

456 / 475

Home-work Assignments

Home-assignment 2

HA 2 X
Groups of 2. Due date 21.10., 23:59
Although a literal u can now be chosen arbitrarily, we decide to choose
the one with the maximum Jeroslow Wang (JW) metric w (F, ) as
discussed in the class. To review:
2k dk (F, ),

w (F, ) =

(12.3)

k1

where, dk (F, ) is the number of clauses of length k in F which contain


the literal . It can be veried that the maximum value is
w (F, = C ) = 7/8. So we choose u = C .

DPLL-1.1(F | C ). F F | C now contains the reduced clauses:

C11 : {A, E }, C13 : {A, E }, C4 : {E }

The rst while loop nds a unit clause C4 : {E }. Therefore, it sets


E = 1. F F | E gives only one remaining reduced clause,

C11 : {A} which is also unitary. The second iteration of the while loop
then sets A = 1. F F | A now is empty.
Since F is empty, a Satisable is returned. The current partial model
is D = 1, C = 0, E = 0, A = 0, which satises all clauses of the original
set.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Home-work Assignments

December 5, 2013

457 / 475

Home-assignment 2

HA 2 XI
Groups of 2. Due date 21.10., 23:59

4. We are given the following seven 2-clauses of a 2SAT problem:


A B, A C , A D, D C ,

D B, E C , B C .

4.1 Solve the 2SAT problem by using an implication graph: Either prove
the clauses unsatisable, or if they are satisiable, nd a model.
4.2 For nding the strongly-connected-components, use the algorithm from
Cormen et al (CLRS) book, given in Fig. 52. It uses the version of DFS
shown in Figs. 50 and 51.

(25%)
Solution:

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

458 / 475

Home-work Assignments

Home-assignment 2

HA 2 XII
Groups of 2. Due date 21.10., 23:59
1/8

9/18

11/12

A
3/4

B
10/15

C
2/5

19/20

6/7

16/17

13/14

Figure 90: Prob. 4.1. The implication graph G (V , E ) and the rst DFS on G .

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Home-work Assignments

December 5, 2013

459 / 475

Home-assignment 2

HA 2 XIII
Groups of 2. Due date 21.10., 23:59
Decreasing oder of u.f from the rst DFS of G.
E, B, D, C, E, A, A, C, D, B
GT
4/9
13/20
A

3/10

14/19

5/8

16/17

6/7

15/18

D
1/2

11/12

Figure 91: Prob. 4.1, 4.2. The second DFS on G T taking the vertices in
decreasing order of their nishing time in the rst DFS.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

460 / 475

Home-work Assignments

Home-assignment 2

HA 2 XIV
Groups of 2. Due date 21.10., 23:59
T1
T2
T3
T4

:E
: B, C, A, D
: E
: A, D, B, C

T1

T2

T4

T3

Figure 92: The strongly connected components all u Ti have f (u) = i .


Since, there does not exist a symbol X s.t. it and its negation belong to the
same SCC, a model exists. Following Lemma 7.26, a model can be
constructed as A = 1, B = 0, C = 0, D = 1, E = 0.

5. Explain the denition of independence P(F | G ) = P(F ) for any PL


formulas F and G in terms of M(F ) and M(G ) by revisiting the
renormalization scheme used to derive Eq. (8.20). You can explain
using Venn diagrams or by doing algebra or both.
(5%)
Solution:
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Home-work Assignments

December 5, 2013

461 / 475

Home-assignment 2

HA 2 XV
Groups of 2. Due date 21.10., 23:59

= M(G)

M(F )
M(G)

M(F ) M(G)
The independence condition implies
P(A) =
AM(F )

AM(F G ) P(A)
AM(G ) P(A)

(12.5)

In other words, the fraction of probability contained in M(F ) w.r.t.


that contained in is the same as the fraction of probability
contained in M(F ) M(G ) w.r.t. that contained in .
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

462 / 475

Home-work Assignments

Home-assignment 2

HA 2 XVI
Groups of 2. Due date 21.10., 23:59

6. Given RVs X and Y , nd the values of the following two summations:


(2% + 3%)
|X | |Y |

|X | |Y |

P(xi , yj ),
i=1 j=1

Solution:
|X |
i=1

|X |
i=1

i=1 j=1
|Y |
j=1 P(xi , yj )

P(xi | yj )

= 1 since we are summing a JPD.

|Y |
j=1 P(xi

| yj ) = |Y | since there are as many summation


constraints in the CPT refer to Eq. (8.38).

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Home-work Assignments

December 5, 2013

463 / 475

December 5, 2013

464 / 475

Home-assignment 2

The Negmax Alpha-Beta Pruning Variation

Algorithm 43: F(s, , )


if Terminal-Test(s) then return Utility(s) ;
;
for a Actions(s) do
t F(Result(s, a), , ) ;
if t > then t ;
if then break;
return

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Home-work Assignments

Home-assignment 3

HA 3: NBC/BN I
Groups of 2. Due date 17.11., 23:59

1. Download
http://www.aispace.org/bayes/version5.1.9/bayes.jar and
run it using java -jar bayes.jar. Load File >Load Sample
Problem >Car Starting Problem.
1.1 Given its parents, of which nodes is the node Spark Quality (SQ)
conditionally independent? You can abbreviate the node names by the
initials.
1.2 To answer the query P(Battery Voltage | Spark Adequate ), which
nodes are irrelevant and can be pruned out to get a smaller BN
without aecting the query result? Verify this by making the query
P(Battery Voltage | Spark Adequate = T ) rst (in the solve tab),
then pruning the BN (in the create tab) and making the query again. If
you get an error, you may have pruned too much.

2. Fill in the missing steps in Eq. 9.21.


K. Pathak (Jacobs University Bremen)

Articial Intelligence

Home-work Assignments

December 5, 2013

465 / 475

Home-assignment 3

HA 3: NBC/BN II
Groups of 2. Due date 17.11., 23:59

3. We will compute how accurate NBC is for the dataset


http://archive.ics.uci.edu/ml/datasets/SPECT+Heart. It
does not have any missing attributes.
3.1 Write an NBC class (C++ or Python) which has functionality for:
Loading a training dataset (SPECT.train for this DB);
Computing all the CPT pmfs P(Ai | C = c) assuming a uniform
Dirichlet prior for all pmfs;
Answering posterior queries of the kind
P(C = c | A1 = a1 , , An = an ) using the log-sum trick.

If you code in Python, use of the numpy library would save you time.
3.2 Use these posterior queries on the test data SPECT.test and compute
the error-rate in percentage. The NBC predicted classication is, of
course, the c with the maximum posterior probability.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

466 / 475

Home-work Assignments

Home-assignment 4

HA 4: BN/Entropy I
Groups of 2. Due date 29.11., 23:59

1. For the alarm example, nd P(E , B | j, m) using the


Variable-Elimination algorithm.
2. Show that the implicit CPT of Noisy-Max given by (10.35) is a valid
CPT, i.e. it satises the summation constraint.
3. Consider a DRV X = [x1 , . . . , xn ], where xi R, with an unknown
PMF [p1 , . . . , pn ]. The only prior information you are given is that
E [X ] = . Given this information, show that the least biased PMF
that you can select is
pi =

e xi
, i = 1, . . . , n, where,
n
e xj
j=1

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Home-work Assignments

December 5, 2013

(12.6)

467 / 475

Home-assignment 4

HA 4: BN/Entropy II
Groups of 2. Due date 29.11., 23:59

the constant can be found by numerically solving the equation


=

n
xi
i=1 xi e
n
xi
i=1 e

(12.7)

Hint: This is a constrained optimization problem where you have two


constraints: i pi = 1 and E [X ] = . Constrained optimization is best
solved using Lagrange multipliers (revise the relevant ESM course). The
constant will turn out to be one of the two Lagrange multipliers. For
ease of algebra, use the natural logarithm in the denition of information
entropy, although the base of the logarithm will not change the result.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

468 / 475

Home-work Assignments

Home-assignment 5

HA 5: Decision Trees I
Groups of 2. Due date 14.12., 23:59

We will compute how accurate decision trees are for the dataset
http://archive.ics.uci.edu/ml/datasets/SPECT+Heart. which we
evaluated with NBC in HA 3.
1. Write a decision-tree class (C++ or Python) which has functionality
for:
Loading a training dataset (SPECT.train for this DB);
Running Algo. 39 to learn a decision-tree.

2. Use the decision tree on the test data SPECT.test and compute the
error-rate in percentage.

K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

469 / 475

December 5, 2013

470 / 475

Quizzes

Contents

Quizzes
Quiz 1
Quiz 2

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Quizzes

Quiz 1

Quiz 1 I
Sep. 23

1. Two consistent heuristics h1 (x) and h2 (x) are given s.t.


h1 (x) h2 (x), x. Assume for simplicity that there is a single goal
state xg . Which heuristic is better to use in A and why?
50%
Solution: From Lemma 4.7 and Remark 4.8, at the iteration when
the goal is found, the maximum possible number of nodes expanded
uptil then form a set
Sh1 = {x |

Sh2 = {x |

(x)

(x)

(xg )}

{x | g (x) g (xg ) h1 (x)}

(13.1)

{x | g (x) g (xg ) h2 (x)}

(13.2)

(xg )}

Note that g (xg ) is xed. As g (x) is the optimal distance of x from


the origin, the set S = {x | g (x) } is such that |S| is a
K. Pathak (Jacobs University Bremen)

Articial Intelligence

Quizzes

December 5, 2013

471 / 475

Quiz 1

Quiz 1 II
Sep. 23

non-decreasing function of . As 1 (corresponding to h1 ) is less than


2 (corresponding to h2 ), because it is given that h1 (x) h2 (x).
Therefore,
|Sh1 | |Sh2 |

h1 is more ecient.

2. What could be an admissible heuristic function for solving the


8-puzzle by A ?
Solution: Read Sec. 3.6.2 of the textbook.
7
5
8

5
1

Start State

K. Pathak (Jacobs University Bremen)

50%

(13.3)

Articial Intelligence

Goal State

December 5, 2013

472 / 475

Quizzes

Quiz 2

Quiz 2 I
Oct. 7

Using properly labeled Venn diagrams show:


1. The case KB

2. The case KB

3. The case, where, KB

Q and KB

4. F F {R}, where R is the resolvent of two clauses in F

5. Monotonicity of PL knowledge-bases

K. Pathak (Jacobs University Bremen)

Articial Intelligence

Quizzes

December 5, 2013

473 / 475

Quiz 2

Figure 93: Quiz 2 solution.

M(KB)
M(Q)

(a) KB

M(KB)

M(Q)

M(KB)

(b) KB

(c) KB

M(Q)

Q, KB

M(Q)

M(R)

M(KB)

M(F ) = M(F {R})


= M(F ) M({R})

(d) F F {R}

K. Pathak (Jacobs University Bremen)

M(KB )

M(F )

(e) Let F be a new clause appended to the CNF KB. KB =


KB F = KB {F } KB Q
Articial Intelligence

December 5, 2013

474 / 475

Quizzes

Quiz 3

Quiz 3, 13.11.
A

C
D

H
G

1. A query to the BN is P(B | G = g ). Give a reduced BN which can be


used to answer this query instead of the original.
2. Write the JPD of the reduced BN.
3. Show how P(B | G = g ) can be computed in two dierent ways from
the JPD by distributing the summations over the product dierently.
K. Pathak (Jacobs University Bremen)

Articial Intelligence

December 5, 2013

475 / 475

Вам также может понравиться