ANU July2001 Tutorial 4

Korb & Nicholson 1 Korb & Nicholson 2
Bayesian AI
Overview
Tutorial
1. Introduction to Bayesian AI (20 min)
Kevin B. Korb and Ann E. Nicholson 2. Bayesian networks (40 min)

Lunch
School of Computer Science
3. Bayesian networks cont’d (10 min)
and Software Engineering
Monash University 4. Applications (50 min)
Clayton, VIC 3168 Break (10 min)
AUSTRALIA
5. Learning Bayesian networks (50 min)
fkorb,annng@csse.monash.edu.au 6. Current research issues (10 min)

HTTP :// WWW. CSSE . MONASH . EDU. AU /˜KORB
Bayesian AI Tutorial Bayesian AI Tutorial
Reasoning under uncertainty

Introduction to Bayesian AI
Uncertainty: The quality or state of being not clearly
Reasoning under uncertainty known.
Probabilities This encompasses most of what we understand about
Alternative formalisms the world — and most of what we would like our AI
– Fuzzy logic systems to understand.
– MYCIN’s certainty factors Distinguishes deductive knowledge (e.g.,

– Default Logic mathematics) from inductive belief (e.g.,
science).
Bayesian philosophy
Sources of uncertainty
– Dutch book arguments
– Bayes’ Theorem Ignorance
(which side of this coin is up?)
– Conditionalization
– Confirmation theory Physical randomness
(which side of this coin will land up?)
Bayesian decision theory
Vagueness
Towards a Bayesian AI (which tribe am I closest to genetically? Picts?
Angles? Saxons? Celts?)

Fuzzy Logic
Probabilities Designed to cope with vagueness:
Is Fido a Labrador or a Shepard?
Classic approach to reasoning under uncertainty. Fuzzy set theory:
(Blaise Pascal and Fermat). m(F ido 2 Labrador) = m(F ido 2 Shepard) = 0:5
Kolmogorov’s Axioms: Extended to fuzzy logic, which takes intermediate
1. P (U ) = 1 truth values: T (Labrador(F ido)) = 0:5.
2.8X U P (X ) 0 Combination rules:
3. 8X; Y U T (p ^ q) = min(T (p); T (q))

if X \ Y = ; T (p _ q) = max(T (p); T (q))
then P (X _ Y ) = P (X ) + P (Y )
T (:p) = 1 T (p)
Conditional Probability P (X jY ) = P (PX(Y^Y) )
Not suitable for coping with randomness or ignorance.
Independence X q Y i P (X jY ) = P (X ) Obviously not:
Uncertainty(inclement weather) =
max(Uncertainty(rain),Uncertainty(hail),. . . )
MYCIN’s Certainty Factors Default Logic

Uncertainty formalism developed for the early expert
Intended to reflect “stereotypical” reasoning under
system MYCIN (Buchanan and Shortliffe, 1984):
uncertainty (Reiter 1980). Example:
Elicit for (h; e):
Bird(Tweety) : Bird(x) ! Flies(x)
measure of belief: MB (h; e) 2 [0; 1℄ Flies(Tweety)
measure of disbelief: MD(h; e) 2 [0; 1℄ Problems:
CF (h; e) = MB (h; e) MD(h; e) 2 [ 1; 1℄ Best semantics for default rules are probabilistic
(Pearl 1988, Korb 1995).
Special functions provided for combining evidence. Mishandles combinations of low probability
Problems: events. E.g.,
No semantics ever given for ‘belief’/‘disbelief’ ApplyforJob(me) : ApplyforJob(x) ! Reje t(x)

Reje t(me)
Heckerman (1986) proved that restrictions
I.e., the dole always looks better than applying for
required for a probabilistic semantics imply
a job!
absurd independence assumptions.

A Dutch Book
Probability Theory Payoff table on a bet for h
(Odds = p=1 p; S = betting unit)
So, why not use probability theory to represent
uncertainty? h Payoff
That’s what it was invented for. . . dealing with T $(1-p) S
physical randomness and degrees of ignorance. F -$p S
Furthermore, if you make bets which violate Given a fair bet, the expected value from such a payoff
probability theory, you are subject to Dutch books: is always $0.
A Dutch book is a sequence of “fair” bets Now, let’s violate the probability axioms.
which collectively guarantee a loss.
Fair bets are bets based upon the standard Example

Say, P (A) = 0:1 (violating A2)
odds-probability relation:
O(h) = P (h) Payoff table against A (inverse of: for A),

1 P (h) with S = 1:
P (h) = O(h) :A Payoff

1 + O(h)
T $pS = -$0.10
F -$(1-p)S = -$1.10
Bayesian Decision Theory

Bayes’ Theorem;
— Frank Ramsey (1931)
Conditionalization
Decision making under uncertainty: what action to
— Due to Reverend Thomas Bayes (1764) take (plan to adopt) when future state of the world is
not known.
P (ejh)P (h)
P (hje) = Bayesian answer: Find utility of each possible
P (e)
outcome (action-state pair) and take the action that
Conditionalization: P 0 (h) = P (hje) maximizes expected utility.
Or, read Bayes’ theorem as: Example
Posterior = Likelihood Prior Action Rain (p = .4) Shine (1 - p = .6)
Prob of eviden e Take umbrella 30 10

Leave umbrella -100 50
Assumptions:
Expected utilities:
1. Joint priors over fhi g and e exist. E(Take umbrella) = (30)(.4) + (10)(.6) = 18
2. Total evidence: e, and only e, is learned. E(Leave umbrella) = (-100)(.4) + (50)(.6) = -10

Bayesian AI
A Bayesian conception of an AI is: Bayesian Networks: Overview
An autonomous agent which
Has a utility structure (preferences) Syntax
Can learn about its world and the relation between Semantics
its actions and future states (probabilities) Evaluation methods
Maximizes its expected utility Influence diagrams (Decision Networks)
The techniques used in learning about the world are Dynamic Bayesian Networks
(primarily) statistical. . . Hence
Bayesian data mining
Bayesian Networks
Example: Earthquake (Pearl)
Data Structure which represents the dependence
between variables.
Pearl has a new burglar alarm installed.
Gives concise specification of the joint probability
It is reliable about detecting burglary, but
distribution.
responds to minor earthquakes.
A Bayesian Network is a graph in which the
Two neighbours (John, Mary) promise to call you
following holds:
at work when they hear the alarm.
1. A set of random variables makes up the nodes
– John always calls when hears alarm, but
in the network.
confuses alarm with phone ringing (and calls
2. A set of directed links or arrows connects pairs then also)
of nodes.
– Mary likes loud music and sometimes misses
3. Each node has a conditional probability table alarm!
that quantifies the effects the parents have on
the node. Given evidence about who has and hasn’t called,
estimate the probability of a burglary.
4. Directed, acyclic graph (DAG), i.e. no directed
cycles.

Earthquake Example: Notes
Earthquake Example: Assumptions: John and Mary don’t perceive

burglary directly; they do not feel minor
Network Structure earthquakes.
Note: no info about loud music or telephone

ringing and confusing John. Summarised in
uncertainty in links from Alarm to JohnCalls
P(E) and MaryCalls.
Burglary Earthquake

0.02
Once specified topology, need to specify
P(B)
0.01 B E P(A|B,E) conditional probability table (CPT) for each
Alarm
T T 0.95 node.
T F 0.94
F T 0.29 – Each row contains the cond prob of each node
JohnCalls MaryCalls F F 0.001
value for a conditioning case.
A P(J|A) A P(M|A) – Each row must sum to 1.
T 0.90 T 0.70
– A table for a Boolean var with n Boolean
parents contain 2n+1 probs.
F 0.05 F 0.01
– A node with no parents has one row (the prior

probabilities)
Representing the joint

probability distribution
Semantics of Bayesian P (X1 = x1 ; X2 = x2 ; :::; Xn = xn )

Networks = P (x1 ; x2 ; :::; xn )
A (more compact) representation of the joint = P (x1 ) P (x2 jx1 )::: P (xn jx1 ^ :::xn 1 )
probability distribution.
– helpful in understanding how to construct = i P (xi jx1 ^ :::xi 1 )
network
Encoding a collection of conditional independence = i P (xi j(Xi ))

statements.
– helpful in understanding how to design
Example: P (J ^ M ^ A ^ :B ^ :E )
inference procedures
= P (J jA)P (M jA)P (Aj:B ^ :E )P (:B )P (:E )
= 0:9 0:7 0:001 0:999 0:998 = 0:0067:

Compactness and Node

Ordering
Network Construction
Compactness of BN is an example of a locally
1. Choose the set of relevant variables Xi that structured (or sparse) system.
describe the domain. The correct order to add nodes is to add the “root
2. Choose an ordering for the variables. causes” first, then the variable they influence, so
on until “leaves” reached.
3. While there are variables left:
(a) Pick a variable Xi and add a node to the
Examples of wrong ordering (which still represent
same joint distribution):
network for it.
(b) Set (Xi ) to some minimal set of nodes already 1. MaryCalls, JohnCalls, Alarm, Burglary,
in the net such that the conditional Earthquake.
independence property is satisfied. MaryCalls
P (Xi jXi 1 ; :::; X1 ) = P (Xi j(Xi))

JohnCalls
(c) Define the CPT for Xi .

Alarm
Burglary Earthquake
Compactness and Node

Conditional Independence:
Ordering (cont.)
Causal Chains
2. MaryCalls, JohnCalls, Earthquake, Burglary,
Alarm.
Causal chains give rise to conditional independence:
MaryCalls
JohnCalls A B C
Earthquake
P (C jA ^ B ) = P (C jB )
Burglary Alarm Example
A = Jack’s flu
As many probabilities as the full joint distribution!
B = severe cough
C = Jill’s flu
See below for why.

Conditional Dependence:
Conditional Independence:
Common Effects
Common Causes
Common effects (or their descendants) give rise to

Causal causes (or ancestors) also give rise to
conditional dependence:
conditional independence:
A C
P (AjC ^ B ) 6= P (A)P (C )
A C
P (C jA ^ B ) = P (C jB ) Example
Example A = flu
A = Jack’s food poisoning B = severe cough
B = shared soup C = tuberculosis
C = Jill’s food poisoning Given a severe cough, flu “explains away”

tuberculosis.
D-separation
Graph-theoretic criterion of conditional Causal Ordering

independence.
We can determine whether a set of nodes X is

independent of another set Y, given a set of
q j
evidence nodes E, i.e., X Y E ”.
Why does variable order affect network density?
Earthquake example
Because
Burglary Earthquake
Using the causal order allows direct
representation of conditional independencies
Alarm Violating causal order requires new arcs to

re-establish conditional independencies
JohnCalls MaryCalls

Inference in Bayesian
Causal Ordering (cont’d)
Networks
Flu TB
Basic task for any probabilistic inference system:
Compute the posterior probability distribution for
a set of query variables, given values for some
evidence variables.
Cough
Also called Belief Updating.

Flu and TB are marginally independent.
Types of Inference:
Given the ordering: Cough, Flu, TB: Q E Q E E
Cough
E Q
Flu TB
(Explaining Away)
Marginal independence of Flu and TB must be E Q Intercausal E
re-established by adding F lu !
T B or F lu T B
Diagnostic Causal Mixed
Kinds of Inference Inference Algorithms:

Diagnostic inferences: from effect to causes.
Overview

P(Burglary|JohnCalls)
Exact inference
Causal Inferences: from causes to effects.
– Trees and polytrees: message-passing
P(JohnCalls|Burglary) algorithm
P(MaryCalls|Burglary) – Multiply-connected networks:
Intercausal Inferences: between causes of a Clustering
common effect. Approximate Inference
P(Burglary|Alarm)
– Large, complex networks:
P(Burglary|Alarm ^ Earthquake) Stochastic Simulation
Mixed Inference: combining two or more of above. Other approximation methods
P(Alarm|JohnCalls ^ :EarthQuake) In the general case, both sorts of inference are
P(Burglary|JohnCalls ^ :EarthQuake) computationally complex (“NP-hard”).

Message Passing Example

Inference in multiply
P(B) P(E)
0.01
Burglary Earthquake
0.02 connected networks

B E P(A)
PhoneRings Alarm
T T 0.95 Networks where two nodes are connected by more
T F 0.94
P(Ph) F T 0.29 than one path
0.05 JohnCalls MaryCalls F F 0.001
– Two or more possible causes which share a
P A P(J) A P(M) common ancestor
T T 0.95 T 0.70
T F 0.5 F 0.01
– One variable can influence another through
F T 0.90
F F 0.01
more than one causal mechanism
Example: Cancer network

Metastatic Cancer
π(Β) = (.001,.999) π(Ε) = (.002,.998) A
λ (Β) = (1,1) λ (Ε) = (1,1)
Brain tumour
bel(B) = (.001, .999) bel(E) = (.002, .998)
B C
B λ A (B) λ A (E)
E
bel(Ph) = (.05, .95) Increased total

π A (B) π A (E) serum calcium
π(Ph) = (.05,.95)
D E
λ(Ph) = (1,1) Ph λ J (Ph) A
λ J (A) π M(A) Coma Severe Headaches
π J (Ph) λ M(A)
π J (A) Message passing doesn’t work - evidence gets
J M
“counted twice”
λ (J) = (1,1) λ (M) = (1,0)
Clustering methods
Transform network into a probabilistically

equivalent polytree by merging (clustering) Clustering methods (cont.)
offending nodes
Cancer example: new node Z combining B and C

Jensen Join-tree (Jensen, 1996) version the
current most efficient algorithm in this class (e.g.
A used in Hugin, Netica).
Network evaluation done in two stages

– Compile into join-tree
Z=B,C May be slow
May require too much memory if original
network is highly connected
E – Do belief updating in join-tree (usually fast)
D
Caveat: clustered nodes have increased complexity;
P (z ja) = P (b; ja) = P (bja)P ( ja) updates may be computationally complex
P (ejz ) = P (ejb; ) = P (ej )

P (djz ) = P (djb; )

Approximate inference with

stochastic simulation Making Decisions
Use the network to generate a large number of
Bayesian networks can be extended to support
cases that are consistent with the network
decision making.
distribution.
Preferences between different outcomes of
Evaluation may not converge to exact values (in
various plans.
reasonable time).
– Utility theory
Usually converges to close to exact solution
quickly if the evidence is not too unlikely.
Decision theory = Utility theory + Probability
theory.
Performs better when evidence is nearer to root
nodes, however in real domains, evidence tends to
be near leaves (Nicholson&Jitnah, 1998)
Type of Nodes
Decision Networks Chance nodes: (ovals) represent random variables
(same as Bayesian networks). Has an associated
A Decision network represents information about CPT. Parents can be decision nodes and other
the agent’s current state chance nodes.
its possible actions Decision nodes: (rectangles) represent points where

the decision maker has a choice of actions.
the state that will result from the agent’s action
Utility nodes: (diamonds) represent the agent’s
the utility of that state utility function (also called value nodes in the
Also called, Influence Diagrams (Howard&Matheson, literature). Parents are variables describing the
1981). outcome state that directly affect utility. Has an
associated table representing multi-attribute
utility function.

Example: Umbrella
Weather
Evaluating Decision
Forecast U
Networks: Algorithm
Take Umbrella 1. Set the evidence variables for the current state.
2. For each possible value of the decision node

P (W eather = Rainj) = 0:3
P (F ore ast = RainyjW eather = Rain) = 0:60 (a) Set the decision node to that value.
P (F ore ast = CloudyjW eather = Rain) = 0:25 (b) Calculate the posterior probabilities for the
P (F ore ast = SunnyjW eather = Rain) = 0:15 parent nodes of the utility node (as for BNs).
(c) Calculate the resulting (expected) utility for
P (F ore ast = RainyjW eather = NoRain) = 0:1 the action.
P (F ore ast = CloudyjW eather = NoRain) = 0:2 3. Return the action with the highest expected utility.
P (F ore ast = SunnyjW eather = NoRain) = 0:7
Simple for single decision, less so when executing
several actions in sequence (i.e. a plan).
U (NoRain; T akeUmbrella) = 20
U (NoRain; LeaveAtHome) = 100
U (Rain; T akeIt) = 70
U (Rain; LeaveAtHome) = 0
Dynamic Belief Networks
Dynamic Decision Network

State evolution model
State t-2 State t-1 State t State t+1 State t+2 Similarly, Decision Networks can be extended to
include temporal aspects.

Obs t-2 Obs t-1 Obs t Obs t+1 Obs t+2
Sensor model
Sequence of decisions taken = Plan.
The values of state variables at time t depend only

on the values at t 1.
Dt Dt+1 Dt+1 Dt+1
Can calculate distributions for St+1 and further:

probabilistic projection.
State t State t+1 State t+2 State t+3
Can be done using standard BN updating

Ut+3
algorithms Obs t Obs t+1 Obs t+2 Obs t+3
This type of DBN gets very large, very quickly.
Usually only keep two time slices of the network.

Bayesian Networks: Summary
Bayes’ rule allows unknown probabilities to be

computed from known ones.
Uses of Bayesian Networks Conditional independence (due to causal
relationships) allows efficient updating
1. Calculating the belief in query variables given
values for evidence variables (above).
Bayesian networks are a natural way to represent
conditional independence info.
2. Predicting values in dependent variables given – links between nodes: qualitative aspects;
values for independent variables.
– conditional probability tables: quantitative
3. Decision making based on probabilities in the aspects.
network and on the agent’s utilities (Influence
Diagrams [Howard and Matheson 1981]).
Inference means computer the probability
distribution for a set of query variables, given a set
4. Deciding which additional evidence should be of evidence variables.
observed in order to gain useful information.
Inference in Bayesian networks is very flexible:
5. Sensitivity analysis to test impact of changes in can enter evidence about any node and update
probabilities or utilities on decisions. beliefs in any other nodes.
The speed of inference in practice depends on the

structure of the network: how many loops;
numbers of parents; location of evidence and query
nodes.
Korb & Nicholson 47-1 Korb & Nicholson 48
Bayesian Networks: Summary

(cont’d)
Bayesian networks can be extended with decision Applications: Overview

nodes and utility nodes to support decision making:
Decision Networks or Influence Diagrams. (Simple) Example Networks
Bayesian and Decision networks can be extended to Applications
allow explicit reasoning about changes over time.
– Medical Decision Making: Survey of
applications
– Planning and Plan Recognition
– Natural Language Generation (NAG)
– Bayesian poker
Deployed Bayesian Networks (See Handout for

details)
BN Software
Web Resources
Bayesian AI Tutorial
Example: Cancer
Metastatic cancer is a possible cause of a brain tumor
and is also an explanation for increased total serum Example: Asia
calcium. In turn, either of these could explain a
patient falling into a coma. Severe headache is also A patient presents to a doctor with shortness of breath.
possibly associated with a brain tumor. (Example from The doctor considers that possibles causes are
(Pearl, 1988).) tuberculosis, lung cancer and bronchitis. Other
Metastatic Cancer additional information that is relevant is whether the
A patient has recently visited Asia (where tuberculosis is
Brain tumour more prevalent), whether or not the patient is a
B C smoker (which increases the chances of cancer and
Increased total
serum calcium
bronchitis). A positive xray would indicate either TB or
D E lung cancer. (Example from (Lauritzen, 1988).)
Coma Severe Headaches visit to Asia smoking
P (a) = 0:2 tuberculosis lung cancer bronchitis
P (bja) = 0:80 P (bj:a) = 0:20 either tub or

lung cancer
P ( ja) = 0:20 P ( j:a) = 0:05

P (djb; ) = 0:80 P (dj:b; ) = 0:80
positive X-ray dyspnoea
P (djb; : ) = 0:80 P (dj:b; : ) = 0:05

P (ej ) = 0:80 P (ej: ) = 0:60
Example: A Lecturer’s Life

Dr. Ann Nicholson spends 60% of her work time in her office.
BN Applications
The rest of her work time is spent elsewhere. When Ann is in
her office, half the time her light is off (when she is trying to Most BN applications to date are hand-crafted
hide from students and get some real work done). When she using domain information provided by experts.
is not in her office, she leaves her light on only 5% of the time.
(van der Gaag et al., 1999, give a case study on
80% of the time she is in her office, Ann is logged onto the
probability elicitation.)
computer. Because she sometimes logs onto the computer
from home, 10% of the time she is not in her office, she is still Tasks include:
logged onto the computer. Suppose a student checks Dr.
Nicholson’s login status and sees that she is logged on. What
– prediction: (1) given evidence; (2) effect of
effect does this have on the student’s belief that Dr. intervention.
Nicholson’s light is on? (Example from (Nicholson, 1999)) – diagnosis
– planning
P(in-office=T)=0.6 – decision making
in-office
P(lights-on=T | in-office=T)=0.5 – explanation
P(lights-on=T | in-office=F)=0.05
– choice of observations (experimental design)
P(logged-on=T | in-office=T) = 0.8
lights-on logged-on P(logged-on=T | in-office=F) = 0.1

Korb & Nicholson 53 Korb & Nicholson 54-1
Probabilistic reasoning in
medicine
Probabilistic reasoning in Multiply-connected network (QMR structure)
medicine B = background information (e.g. age, sex of

patient)
See Dean et al. (1993).
Simplest tree-structured network for diagnostic

reasoning
– H = disease hypothesis; F = findings
(symptoms, test results)
F1 F2 F3
Medical Applications
Pathfinder case study; see Russell&Norvig (1995, Medical Applications

pp.457-458).
QMR (Quick Medical Reference): 600 diseases, ALARM (Beinlich et al., 1989): 37 nodes, 42 arcs.
4,000 findings, 40,000 arcs. (Dean&Wellman, (See Netica examples.)
MinVolSet (3)
1991) .976
Ventmach (4) Disconnect (2)
MUNIN (Andreassen et al., 1989): neuromuscular PulmEmbolus(2) Intubation (3)

1.158
.141
.617
VentTube (4) 1.146 KinkedTube(4)
disorders, about 1000 nodes; exact computation <

.288 1.180 .227
.369 .140
.428
PAP (3) Shunt (2) Press (4) .098 VentLung (4)
5 seconds. .067 .100 1.201
1.189

FiO2 (2) VentAlv (4) MinVol (4)
Glucose prediction and insulin dose adjustment .411 .213
.805 .743
.891
PVSat (3) ArtCO2 (3) .362

(DBN application) (Andreassen et al., 1991). ExpCO2 (4)
Anaphylaxis (2)
.054 .239

InsuffAnesth (2) SaO2 (3)
TPR (3)
CPSC project (Pradham et al., 1994) .092
.246 .066
Catechol (2)
– 448 nodes, 906 links, 8254 conditional LVFailure(2) Hypovolemia (2)

.470
.360
probability values
.547 .538
.137 .479
ErrCauter (2) HR (3) ErrLowOutput(2) History (2) StrokeVolume (3) LVEDVolume(3)

.888
– LW algorithm - answers in 35 mins (1994) .324
.888 .948 .251
.344
.724 .746
.874
HRSat (3) .324 HREKG (3) HRBP (3) CO (3) CVP (3) PCPW (3)
Application of LW to medical diagnosis .199

BP (3)
.485
(Shwe&Cooper, 1990).
Forecasting sleep apnea (Dagum et al., 1993).

Natural Language Generation

Plan Recognition Applications
NAG (McConachy et al., 1999) – A Nice Argument
Keyhole plan recognition in an Adventure game Generator – uses two Bayesian networks to generate
(Albrecht et al., 1998).
A
0 A
1
A
2 A
3
A
0 A
1
A
2 A
3
and assess natural language arguments:
Q Q’ Q Q’
Normative Model: Represents our best
understanding of the domain; proper (constrained)
L L L L L L L L
Bayesian updating, given premises.
0 1 2 3 0 1 2 3
(a) mainModel (b) indepModel User Model: Represents our best understanding of
A
0 A
1
A
2 A
3 Q Q’
the human; Bayesian updating modified to reflect
human biases (e.g., overconfidence; Korb,
Q Q’ L
0
L
1
L
2
L
3 McConachy, Zukerman, 1997).
(c) actionModel (d) locationModel
BNs are embedded in a semantic hierarchy
Traffic plan recognition (Pynadeth&Wellman, supports attentional modeling
1995).
constrained updating
Bayesian Poker
(Korb et al., 1999)

%cE
% E cc

Higher level
%% EE ccc
concepts like
‘motivation’ or
Poker is ideal for testing automated reasoning
% E c
‘ability’
%% EE cc
Lower level
concepts like
under uncertainty
‘Grade Point Average’
% E cc
%% + EE H cc

@ Semantic
@
% % E
E B HH B
H cc Network
2nd layer – Physical randomisation
@
@%
EA
Q E HH
c
c
%@@
E
E A
A EE QQBB
Q B
cc
Semantic
%% -@@R

EE E Q cc – Incomplete hand information
% E C c
Network
%% EE C
1st layer

%
HH
H

EE C C
Bayesian – Incomplete opponent info (strategies, bluffing,
%
HH

% E C etc)
%%
E
EE
Network

%
% E

6
Bayesian networks are a good representation for

Proposition, e.g., [publications authored by person X cited >5 times]
complex game playing.
Our Bayesian Poker Player (BPP) plays 5-Card

stud poker at the level of a good amateur human
player. To play:
telnet indy13.cs.monash.edu.au
1
login: poker
password: maverick

Bayesian Poker BN
Bayesian network provides an estimate of winning Bayesian Poker BN (cont.)

at any point in the hand.
Betting curves based on pot-odds used to Different networks (matrices) for each round.
determine action (bet/call, pass or fold).
OPP Current, BPP Current: (partial) hand types
BPP Win
with cards dealt so far.
OPP Final, BPP Final: hand types after all 5

cards dealt.
OPP Final BPP Final
M
Observation nodes:
M
C|F C|F
– OPP Upcards: All opponent’s cards except first
OPP Current BPP Current are visible to BPP.
M M
– OPP Action: BPP knows opponent’s action.
A|C U|C
OPP Action OPP Upcards
Current Status, Possible

Bayesian Poker BN (cont.)
Extensions
Hand Types
Initial 9 hand types too coarse.

BPP outperforms automated opponents, is fairly
even with ave amateur humans, and loses to
We use a finer granularity for most common hands experienced humans.
(busted and a pair):
Learning the OPP Action CPTs does not (yet)
– low, medium, Q-high, K-high, A-high appear to improve performance.
– results in 17 hand-types
BN Improvements
Conditional Probability Matrices – Refine action nodes
MAjC : probability of opponent’s action given – Further refinement of hand types
current hand type learned from observed – Improve network structure
showdown data.
– Adding bluffing to the opponent model
MU jC and MCjF estimated by dealing out 107 – Improved learning of opponent model
poker hands.
More complex poker: multi-opponent games, table
Belief Updating: Since network is a polytree, simple stake games.
fast propagation updating algorithm used.
DBN model to represent changes over time

Deployed BNs
Deployed BNs (cont’d)
From Web Site database: See handout for details.
TRACS: Predicting reliability of military vehicles. Knowledge Industries applications: (a) in

Andes: intelligent tutoring system for physics. medicine, sleep disorders, pathology, trauma care,
hand and wrist evaluations, dermatology, and
Distributed Virtual Agents advising online users home-based health evaluations (b) in capital
on web sites. equipment, locomotives, gas-turbine engines for
Information extraction from natural language text aircraft and land-based power production, the
space shuttle, and office equipment.
DXPLAIN: decision support for medical diagnosis.
Software debugging.
Illiad: teaching tool for medical students.
Vista: decision support system used at NASA
Microsoft Health Produce: “find by symptom”
Mission Control Center.
feature.
MS: (a) Answer Wizard (Office 95), Information
Weapons scheduling. retrieval; (b) Print Troubleshooter; (c) Aladdin,
Monitoring power generation. troubleshooting customer support.
Processor fault diagnosis.
BN Software: Issues
Functionality
– Especially application vs API BN Software
Price
– Many free for demo versions or educational use Analytica: www.lumina.com
– Commercial licence costs. Hugin: www.hugin.com
Availability (platforms) Netica: www.norsys.com
Quality Above 3 available during tutorial lab session.
– GUI JavaBayes:
– Documentation and Help http://www.cs.cmu.edu/˜ javabayes/Home/
Leading edge Many other packages (see next slide)
Robustness
– software
– company

BN Web Resources
Bayesian Belief Network site (Russell Greiner):
www.cs.ualberta.ca/˜ greiner/bn.html Applications: Summary
Bayesian Network Repository (Nir Friedman)
www-nt.cs.berkeley.edu/home/nir/ Various BN structures are available to compactly
public html/Repository/index.htm and accurately represent certain types of domain
Summary of BN software and links to software sites features.
(Kevin Murphy)
http.cs.berkeley.edu/˜
Bayesian networks have been used for a wide
murphyk/Bayes/bnsoft.html range of AI applications.
Includes Murphy’s Bayes net toolbox Robust and easy to use Bayesian network software
Russell Almond’s BN page is now readily available.
bayes.stat.washington.edu/almond/belief.html
Association for Uncertainty in AI

www.auai.org
Learning Bayesian Networks

Linear and Discrete Models
Learning Network Parameters Linear and Discrete Models
– Linear Coefficients
– Learning Probability Tables Linear Models: Used in biology & social sciences
since Sewall Wright (1921)
Learning Causal Structure
Conditional Independence Learning Linear models represent causal relationships as sets of
linear functions of “independent” variables.
– Statistical Equivalence
– TETRAD II
X1 X2
Bayesian Learning of Bayesian Networks
– Cooper & Herskovits: K2
– Learning Variable Order
X3
– Statistical Equivalence Learners
Full Causal Learners Equivalently:

Minimum Encoding Methods X3 = a13 X1 + a23 X2 + 1
– Lam & Bacchus’s MDL learner
– MML metrics Discrete models: “Bayesian nets” replace vectors of
– MML search algorithms linear coefficients with CPTs.
– MML Sampling
Empirical Results

Learning Conditional
Probability Tables
Spiegelhalter & Lauritzen (1990):
assume parameter independence
Learning Linear Parameters each CPT cell i = a parameter in a Dirichlet

distribution
Maximum likelihood methods have been available
since Wright’s path model analysis (1921). D[1 ; : : : ; i ; : : : ; K ℄
for K parents
Equivalent methods: prob of outcome i is i =K

k=1 k
Simon-Blalock method (Simon, 1954; Blalock, observing outcome i update D to

1964)
D[1 ; : : : ; i + 1; : : : ; K ℄
Ordinary least squares multiple regression (OLS)
Others are looking at learning without parameter
independence. E.g.,
Decision trees to learn structure within CPTs

(Boutillier et al. 1996).
Dual log-linear and full CPT models (Neil,

Wallace, Korb 1999).
Learning Causal Structure

This is the real problem; parameterizing models is
relatively straightforward estimation problem.
Statistical Equivalence
Verma and Pearl’s rules identify the set of causal
There are two basic methods:
models which are statistically equivalent —
Learning from conditional independencies (CI
learning)
Two causal models H1 and H2 are
Learning using a scoring metric statistically equivalent iff they contain the
(Metric learning) same variables and joint samples over them
provide no statistical grounds for preferring
one over the other.
CI learning (Verma and Pearl, 1991)
Examples
Suppose you have an Oracle who can answer yes or no All fully connected models are equivalent.
to any question of the type:
A ! B ! C and A B C.
X q Y jS? A ! B ! D C and A B ! D C.
Then you can learn the correct causal model, up to
statistical equivalence.

TETRAD II
— Spirtes, Glymour and Scheines (1993)
Chickering (1995):
Replace the Oracle with statistical tests:
Any two causal models over the same variables
which have the same skeleton (undirected arcs) for linear models a significance test on partial
and the same directed v-structures are correlation
statistically equivalent.
X q Y jS i XY S = 0
If H1 and H2 are statistically equivalent, then
they have the same maximum likelihoods relative for discrete models a 2 test on the difference
to any joint samples: between CPT counts expected with independence
(Ei ) and observed (Oi )
max P (ejH1 ; 1 ) = max P (ejH2 ; 2 )
2
where i is a parameterization of Hi X q Y jS i Oi ln Oi 0
i
Ei
Bayesian LBN: Cooper &

Herskovits
— Cooper & Herskovits (1991, 1992)
TETRAD II Compute P (hi je) by brute force, under the
assumptions:
Asymptotically finds causal structure to within the
1. All variables are discrete.
statistical equivalence class of the true model.
2. Samples are i.i.d.
Requires larger sample sizes than MML (Dai,
Korb, Wallace & Wu, 1997): 3. No missing values.
Statistical tests are not robust given weak 4. All values of child variables are uniformly
causal interactions and/or small samples. distributed.
Cheap, and easy to use. 5. Priors over hypotheses are uniform.
With these assumptions, Cooper & Herskovits reduce

the computation of PCH (h; e) to a polynomial time
counting problem.

Learning Variable Order

Cooper & Herskovits
Reliance upon a given variable order is a major
drawback to K2
But the hypothesis space is exponential; they go for
dramatic simplification: And many other algorithms (Buntine 1991,
Bouckert 1994, Suzuki 1996, Madigan &
6. Assume we know the temporal ordering of the
Raftery 1994)
variables.
What’s wrong with that?
In that case, for any pair of variables the only problem
is We want autonomous AI (data mining). If experts
can order the variables they can likely supply
deciding whether they are connected by an arc
models.
! arc direction is trival
! cycles are impossible. Determining variable ordering is half the problem.
2 n)=2 If we know A comes before B , the only remaining
New hypothesis space has size only 2(n (still
issue is whether there is a link between the two.
exponential).
Algorithm “K2” does a greedy search through this
The number of orderings consistent with dags is
exponential (Brightwell & Winkler 1990; number
reduced space.
complete). So iterating over all possible orderings
will not scale up.
Statistical Equivalence Learners
Learners Wallace & Korb (1999): This is not right!
These are causal models; they are distinguishable

Heckerman & Geiger (1995) advocate learning only up on experimental data.
to statistical equivalence classes (a la TETRAD II).
– Failure to collect some data is no reason to
Since observational data cannot distinguish change prior probabilities.
btw equivalent models, there’s no point trying E.g., If your thermometer topped out at 35Æ ,
to go futher. you wouldn’t treat 35Æ and 34Æ as equally

likely.
) Madigan, Andersson, Perlman & Volinsky Not all equivalence classes are created equal:
(1996) follow this advice, use uniform prior fA ! ! !
B C, A B C, A B C g
over equivalence classes. f !
A g B C
) Geiger and Heckerman (1994) define
Within classes some dags should have greater
Bayesian metrics for linear and discrete
priors than others. . . E.g.,
equivalence classes of models (BGe and BDe)
LightsOn ! InOffice ! LoggedOn v.
LightsOn InOffice ! LoggedOn

Full Causal Learners

So. . . a full causal learner is an algorithm that: MDL
1. Learns causal connectedness.
Minimum Description Length (MDL) inference —
2. Learns v-structures.
Invented by Rissanen (1978)
Hence, learns equivalence classes.
based upon Minimum Message Length
3. Learns full variable order. (MML) invented by Wallace (Wallace and
Hence, learns full causal structure (order + Boulton, 1968).
connectedness).
Plays trade-off btw
TETRAD II: 1, 2.
– model simplicity
Madigan et al.: 1, 2. – model fit to the data
Cooper & Herskovits’ K2: 1. by minimizing the length of a joint description of
Lam and Bacchus MDL: 1, 2 (partial), 3 (partial).

model and data given the model.
Wallace, Neil, Korb MML: 1, 2, 3.
Lam & Bacchus (1993)

MDL encoding of causal models:
Network: Lam & Bacchus

n
i=1 [ki log(n) + d(si 1) 2

j (i) sj ℄ Search algorithm:
– ki log(n) for specifying ki parents for ith node Initial constraints taken from domain expert:
ki
partial variable order, direct connections
– d(si 1) j=1 sj for specifying the CPT:
d is the fixed bit-length per probability Greedy search: every possible arc addition is
si is the number of states for node i tested, best MDL measure used to add one
(Note: no arcs are deleted)
Data given network:
n n Local arcs checked for improved MDL via arc
N i=1 M (Xi ; (i)) N i=1 H (Xi ) reversal
– M (Xi ; (i)) is mutual information btw Xi and Iterate until MDL fails to improve
its parent set ) Results similar to K2, but without full variable
– H (Xi ) is entropy of variable Xi ordering
(NB: This code is not efficient. E.g., treats every node

as equally likely to be a parent; assumes knowledge of
all ki .)

CaMML MML Metric for Linear Models

Minimum Message Length (Wallace & Boulton 1968) Network:
log n! + n(n2 1) log E

uses Shannon’s measure of information:
I (m) = log P (m)

– log n! for variable order
n(n 1)
Applied in reverse, we can compute P (h; e) from I (h; e). – 2
for connectivity
Given an efficient joint encoding method for the

– log E restore efficiency by subtracting cost of
selecting a linear extension
hypothesis & evidence space (i.e., satisfying Shannon’s
law), MML: Parameters given dag h:
f g
Searches hi for that hypothesis h that f (j jh)
minimizes I (h) + I (e h).j log p
F (j )
Xj
Equivalent to that h that maximizes P (h)P (e h) — i.e.,j where j are the parameters for Xj and F (j ) is
j
P (h e). j
the Fisher information. f (j h) is assumed to be
The other significant difference from MDL: N (0; j ).
MML takes parameter estimation seriously. (Cf. with MDL’s fixed length for parms)
MML Metric for discrete

models
MML Metric for Linear Models
We can use PCH (hi ; e) (from Cooper & Herskovits) to
define an MML metric for discrete models.
Sample for Xj given h and j :
Difference between MML and Bayesian metrics:
p21 e
K
log P (ejh; j ) = 2 2
jk =2j
k=1 MML partitions the parameter space and
j
selects optimal parameters.
where K is the number of sample values and jk is
the difference between the observed value of Xj Equivalent to a penalty of 12 log e
6
per parameter
and its linear prediction. (Wallace & Freeman 1987); hence:
I (e; hi ) = sj log e log PCH (hi ; e) (1)

2 6
Applied in MML Sampling algorithm.

MML search algorithms

MML metrics need to be combined with search. This
has been done three ways: MML Sampling
1. Wallace, Korb, Dai (1996): greedy search (linear).
Search space of totally ordered models (TOMs).
Brute force computation of linear extensions
Sampled via a Metropolis algorithm (Metropolis et al.
(small models only).
1953).
2. Neil and Korb (1999): genetic algorithms (linear). From current model M , find the next model M 0 by:
Asymptotic estimator of linear extensions Randomly select a variable; attempt to swap order
GA chromosomes = causal models with its predecessor.
Genetic operators manipulate them Or, randomly select a pair; attempt to add/delete
Selection pressure is based on MML an arc.
3. Wallace and Korb (1999): MML sampling (linear, Attempts succeed whenever P (M 0 )=P (M ) > U (per
discrete). MML metric), where U is uniformly random from
Stochastic sampling through space of totally [0 : 1℄.
ordered causal models
No counting of linear extensions required
Empirical Results
MML Sampling A weakness in this area — and AI generally.
Paper publications based upon very small models,

Metropolis: this procedure samples TOMs with a loose comparisons.
frequency proportional to their posterior probability.
ALARM net often used — everything sets it to
To find posterior of dag h: keep count of visits to all within 1 or 2 arcs.
TOMs consistent with h
Neil and Korb (1999) compared CaMML and BGe
Estimated by counting visits to all TOMs with (Heckerman & Geiger’s Bayesian metric over
identical max likelihoods to h equivalence classes), using identical GA search over
Output: Probabilities of linear models:
Top dags On KL distance and topological distance from the

true model, CaMML and BGe performed nearly
Top statistical equivalence classes the same.
Top MML equivalence classes On test prediction accuracy on strict effect nodes
(those with no children), CaMML clearly
outperformed BGe.

Current Research Issues
size and complexity

LBN Web Resources
difficulties with elicitation
Info on TETRAD II; downloadable TETRAD III (approx combinations of discrete and continuous (i.e.
equiv to II) mixing node types)
hss.cmu.edu/html/departments/
philosophy/TETRAD/tetrad.html Learning issues
Microsoft: MSBN and Webmine — downloadable trial – Missing data

versions – Latent variables
www.research.microsoft.com/research/dtg/
– Experimental data
CaMML (web site under construction)
– Learning CPT structure
www.csse.monash.edu.au/˜ korb
– Multi-structure models
continuous & discrete
CPTs w/ & w/o parm independence
References
Introduction to Bayesian AI
T. Bayes (1764) “An Essay Towards Solving a Problem in the
Doctrine of Chances.” Phil Trans of the Royal Soc of
London. Reprinted in Biometrika, 45 (1958), 296-315.
B. Buchanan and E. Shortliffe (eds.) (1984) Rule-Based
Expert Systems: The MYCIN Experiments of the Stanford
Heuristic Programming Project. Addison-Wesley.
(Other) Limitations B. de Finetti (1964) “Foresight: Its Logical Laws, Its
Subjective Sources,” in Kyburg and Smokler (eds.)
inappropriate problems (deterministic systems, Studies in Subjective Probability. NY: Wiley.
legal rules) D. Heckerman (1986) “Probabilistic Interpretations for

MYCIN’s Certainty Factors,” in L.N. Kanal and J.F.
Lemmer (eds.) Uncertainty in Artificial Intelligence.
North-Holland.
C. Howson and P. Urbach (1993) Scientific Reasoning: The
Bayesian Approach. Open Court.
A MODERN REVIEW OF B AYESIAN THEORY.
K.B. Korb (1995) “Inductive learning and defeasible
inference,” Jrn for Experimental and Theoretical AI, 7,
291-324.
R. Neapolitan (1990) Probabilistic Reasoning in Expert
Systems. Wiley.

C HAPTERS 1, 2 AND 4 COVER SOME OF THE RELEVANT Spiegelhalter (1999) Probabilistic networks and expert
HISTORY. systems. New York: Springer.
TECHNICAL SURVEY OF B AYESIAN NET TECHNOLOGY,
J. Pearl (1988) Probabilistic Reasoning in Intelligent
INCLUDING LEARNING B AYESIAN NETS.
Systems, Morgan Kaufmann.
F. P. Ramsey (1931) “Truth and Probability” in The D. D’Ambrosio (1999) “Inference in Bayesian Networks”.
Foundations of Mathematics and Other Essays. NY: Artificial Intelligence Magazine, Vol 20, No. 2.
Humanities Press. A. P. Dawid (1998) Conditional independence. In
T HE ORIGIN OF MODERN B AYESIANISM . I NCLUDES Encyclopedia of Statistical Sciences, Update Volume 2.
LOTTERY- BASED ELICITATION AND D UTCH - BOOK New York: Wiley Interscience.
ARGUMENTS FOR THE USE OF PROBABILITIES.
P. Haddaway (1999) “An Overview of Some Recent
R. Reiter (1980) “A logic for default reasoning,” Artificial Developments in Bayesian Problem-Solving Techniques”.
Intelligence, 13, 81-132. Artificial Intelligence Magazine, Vol 20, No. 2.
J. von Neumann and O.Morgenstern (1947) Theory of Games R.A. Howard & J.E. Matheson (1981) Influence Diagrams.
and Economic Behavior, 2nd ed. Princeton Univ. In Howard and Matheson (eds.) Readings in the
S TANDARD REFERENCE ON ELICITING UTILITIES VIA Principles and Applications of Decision Analysis. Menlo
LOTTERIES. Park, Calif: Strategic Decisions Group.
F. V. Jensen (1996) An Introduction to Bayesian Networks,
Bayesian Networks Springer.
R. Neapolitan (1990) Probabilistic Reasoning in Expert
E. Charniak (1991) “Bayesian Networks Without Tears”,
Systems. Wiley.
Artificial Intelligence Magazine, pp. 50-63, Vol 12.
S IMILAR COVERAGE TO THAT OF P EARL ; MORE
A N ELEMENTARY INTRODUCTION.
EMPHASIS ON PRACTICAL ALGORITHMS FOR NETWORK
G.F. Cooper (1990) The computational complexity of UPDATING.
probabilistic inference using belief networks. Artificial
J. Pearl (1988) Probabilistic Reasoning in Intelligent
Intelligence, 42, 393-405.
Systems, Morgan Kaufmann.
R. G. Cowell, A. Philip Dawid, S. L. Lauritzen and D. J. T HIS IS THE CLASSIC TEXT INTRODUCING B AYESIAN
NETWORKS TO THE AI COMMUNITY. I. Beinlich, H. Suermondt, R. Chavez and G. Cooper (1992)

“The ALARM monitoring system: A case study with two
J. Pearl (2000) Causality. Cambridge University.
probabilistic inference techniques for belief networks”,
Poole, D., Mackworth, A., and Goebel, R. (1998) Proc. of the 2nd European Conf. on Artificial Intelligence
Computational Intelligence: a logical approach. Oxford in medicine, pp. 689-693.
University Press.
T.L Dean and M.P. Wellman (1991) Planning and control,
Russell & Norvig (1995) Artificial Intelligence: A Modern Morgan Kaufman.
Approach, Prentice Hall. T.L. Dean, J. Allen and J. Aloimonos (1994) Artificial
J. Whittaker (1990) Graphical models in applied Intelligence: Theory and Practice, Benjamin/Cummings.
multivariate statistics. Wiley. P. Dagum, A. Galper and E. Horvitz (1992) “Dynamic
Network Models for Forecasting”, Proceedings of the 8th
Conference on Uncertainty in Artificial Intelligence, pp.
Applications 41-48.
D.W. Albrecht, I. Zukerman and Nicholson, A.E. (1998) J. Forbes, T. Huang, K. Kanazawa and S. Russell (1995) “The
Bayesian Models for Keyhole Plan Recognition in an BATmobile: Towards a Bayesian Automated Taxi”,
Adventure Game. User Modeling and User-Adapted Proceedings of the 14th Int. Joint Conf. on Artificial
Interaction, 8(1-2), 5-47, Kluwer Academic Publishers. Intelligence (IJCAI’95), pp. 1878-1885.
S. Andreassen, F.V. Jensen, S.K. Andersen, B. Falck, U. M. Henrion, J.S. Breese and E.J. Horvitz (1991) Decision
Kjærulff, M. Woldbye, A.R. Sørensen, A. Rosenfalck and analysis and expert systems. AI Magazine, 12, 64-91.
F. Jensen (1989) “MUNIN — An Expert EMG Assistant”, K.B. Korb, I. Zukerman and R. McConachy (1997) A
Computer-Aided Electromyography and Expert Systems, cognitive model of argumentation. In Proceedings of the
Chapter 21, J.E. Desmedt (Ed.), Elsevier. Cognitive Science Society, Stanford University.
S.A. Andreassen, J.J Benn, R. Hovorks, K.G. Olesen and S.L Lauritzen and D.J. Spiegelhalter (1988) “Local
R.E. Carson (1991) “A Probabilistic Approach to Glucose Computations with Probabilities on Graphical
Prediction and Insulin Dose Adjustment: Description of Structures and their Application to Expert Systems”,
Metabolic Model and Pilot Evaluation Study”. Journal of the Royal Statistical Society, 50(2), pp.

157-224. R. Bouckeart (1994) Probabilistic network construction using

M. Pradham, G. Provan, B. Middleton and M. Henrion the minimum description length principle. Technical
(1994) “Knowledge engineering for large belief Report RUU-CS-94-27, Dept of Computer Science,
networks”, Proceedings of the 10th Conference on Utrecht University.
Uncertainty in Artificial Intelligence. C. Boutillier, N. Friedman, M. Goldszmidt, D. Koller (1996)
D. Pynadeth and M. P. Wellman (1995) “Accounting for “Context-specific independence in Bayesian networks,” in
Context in Plan Recogniition, with Application to Traffic Horvitz & Jensen (eds.) UAI 1996, 115-123.
Monitoring”, Proceedings of the 11th Conference on G. Brightwell and P. Winkler (1990) Counting linear
Uncertainty in Artificial Intelligence, pp.472-481. extensions is #P-complete. Technical Report DIMACS
M. Shwe and G. Cooper (1990) “An Empirical Analysis of 90-49, Dept of Computer Science, Rutgers Univ.
Likelihood-Weighting Simulation on a Large, Multiply W. Buntine (1991) “Theory refinement on Bayesian
Connected Belief Network”, Proceedings of the Sixth networks,” in D’Ambrosio, Smets and Bonissone (eds.)
Workshop on Uncertainty in Artificial Intelligence, pp. UAI 1991, 52-69.
498-508, 1990.
W. Buntine (1996) “A Guide to the Literature on Learning
L.C. van der Gaag, S. Renooij, C.L.M. Witteman, B.M.P. Probabilistic Networks from Data,” IEEE Transactions
Aleman, B.G. “Tall (1999) How to Elicit Many on Knowledge and Data Engineering,8, 195-210.
Probabilities”, Laskey & Prade (eds) UAI99, 647-654.
D.M. Chickering (1995) “A Tranformational
Zukerman, I., McConachy, R., Korb, K. and Pickett, D. (1999) Characterization of Equivalent Bayesian Network
“Exploratory Interaction with a Bayesian Argumentation Structures,” in P. Besnard and S. Hanks (eds.)
System,” in IJCAI-99 Proceedings – the Sixteenth Proceedings of the Eleventh Conference on Uncertainty in
International Joint Conference on Artificial Intelligence,
Artificial Intelligence (pp. 87-98). San Francisco: Morgan
pp. 1294-1299, Stockholm, Sweden, Morgan Kaufmann.
Kaufmann.
STATISTICAL EQUIVALENCE .
Learning Bayesian Networks G.F. Cooper and E. Herskovits (1991) “A Bayesian Method
for Constructing Bayesian Belief Networks from
H. Blalock (1964) Causal Inference in Nonexperimental
Databases,” in D’Ambrosio, Smets and Bonissone (eds.)
Research. University of North Carolina.
UAI 1991, 86-94. B AYESIAN LEARNING OF STATISTICAL EQUIVALENCE

CLASSES.
G.F. Cooper and E. Herskovits (1992) “A Bayesian Method
for the Induction of Probabilistic Networks from Data,” K. Korb (1999) “Probabilistic Causal Structure” in H.
Machine Learning, 9, 309-347. Sankey (ed.) Causation and Laws of Nature:
A N EARLY BAYESIAN CAUSAL DISCOVERY METHOD. Australasian Studies in History and Philosophy of
Science 14. Kluwer Academic.
H. Dai, K.B. Korb, C.S. Wallace and X. Wu (1997) “A study of
I NTRODUCTION TO THE RELEVANT PHILOSOPHY OF
casual discovery with weak links and small samples.”
CAUSATION FOR LEARNING B AYESIAN NETWORKS.
Proceedings of the Fifteenth International Joint
Conference on Artificial Intelligence (IJCAI), P. Krause (1998) Learning Probabilistic Networks.
pp. 1304-1309. Morgan Kaufmann. http : ==www:auai:org=bayes USKrause:ps:gz
B ASIC INTRODUCTION TO BN S, PARAMETERIZATION
N. Friedman (1997) “The Bayesian Structural EM
AND LEARNING CAUSAL STRUCTURE .
Algorithm,” in D. Geiger and P.P. Shenoy (eds.)
Proceedings of the Thirteenth Conference on Uncertainty W. Lam and F. Bacchus (1993) “Learning Bayesian belief
in Artificial Intelligence (pp. 129-138). San Francisco: networks: An approach based on the MDL principle,” Jrn
Morgan Kaufmann. Comp Intelligence, 10, 269-293.
D. Geiger and D. Heckerman (1994) “Learning Gaussian D. Madigan, S.A. Andersson, M.D. Perlman & C.T. Volinsky
networks,” in Lopes de Mantras and Poole (eds.) UAI (1996) “Bayesian model averaging and model selection
1994, 235-243. for Markov equivalence classes of acyclic digraphs,”
Comm in Statistics: Theory and Methods, 25, 2493-2519.
D. Heckerman and D. Geiger (1995) “Learning Bayesian
networks: A unification for discrete and Gaussian D. Madigan and A. E. Raftery (1994) “Model selection and
domains,” in Besnard and Hanks (eds.) UAI 1995, accounting for model uncertainty in graphical modesl
274-284. using Occam’s window,” Jrn AMer Stat Assoc, 89,
1535-1546.
D. Heckerman, D. Geiger, and D.M. Chickering (1995)
“Learning Bayesian Networks: The Combination of N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H.
Knowledge and Statistical Data,” Machine Learning, 20, Teller and E. Teller (1953) “Equations of state
197-243. calculations by fast computing machines,” Jrn Chemical

Physics, 21, 1087-1091. Prediction and Search: Lecture Notes in Statistics 81.
Springer Verlag.
J.R. Neil and K.B. Korb (1999) “The Evolution of Causal
A THOROUGH PRESENTATION OF THE ORTHODOX
Models: A Comparison of Bayesian Metrics and
STATISTICAL APPROACH TO LEARNING CAUSAL
Structure Priors,” in N. Zhong and L. Zhous (eds.)
STRUCTURE .
Methodologies for Knowledge Discovery and Data
Mining: Third Pacific-Asia Conference (pp. 432-437). J. Suzuki (1996) “Learning Bayesian Belief Networks Based
Springer Verlag. on the Minimum Description Length Principle,” in L.
G ENETIC ALGORITHMS FOR CAUSAL DISCOVERY; Saitta (ed.) Proceedings of the Thirteenth International
STRUCTURE PRIORS. Conference on Machine Learning (pp. 462-470). San
Francisco: Morgan Kaufmann.
J.R. Neil, C.S. Wallace and K.B. Korb (1999) “Learning
Bayesian networks with restricted causal interactions,” T.S. Verma and J. Pearl (1991) “Equivalence and Synthesis
in Laskey and Prade (eds.) UAI 99, 486-493. of Causal Models,” in P. Bonissone, M. Henrion, L. Kanal
and J.F. Lemmer (eds) Uncertainty in Artificial
J. Rissanen (1978) “Modeling by shortest data description,”
Intelligence 6 (pp. 255-268). Elsevier.
Automatica, 14, 465-471.
T HE GRAPHICAL CRITERION FOR STATISTICAL
H. Simon (1954) “Spurious Correlation: A Causal EQUIVALENCE .
Interpretation,” Jrn Amer Stat Assoc, 49, 467-479.
C.S. Wallace and D. Boulton (1968) “An information measure
D. Spiegelhalter & S. Lauritzen (1990) “Sequential Updating for classification,” Computer Jrn, 11, 185-194.
of Conditional Probabilities on Directed Graphical
C.S. Wallace and P.R. Freeman (1987) “Estimation and
Structures,” Networks, 20, 579-605.
inference by compact coding,” Jrn Royal Stat Soc (Series
P. Spirtes, C. Glymour and R. Scheines (1990) “Causality B), 49, 240-252.
from Probability,” in J.E. Tiles, G.T. McKee and G.C.
C. S. Wallace and K. B. Korb (1999) “Learning Linear Causal
Dean Evolving Knowledge in Natural Science and
Models by MML Sampling,” in A. Gammerman (ed.)
Artificial Intelligence. London: Pitman. A N
Causal Models and Intelligent Data Management.
ELEMENTARY INTRODUCTION TO STRUCTURE
Springer Verlag.
LEARNING VIA CONDITIONAL INDEPENDENCE .
S AMPLING APPROACH TO LEARNING CAUSAL MODELS ;
P. Spirtes, C. Glymour and R. Scheines (1993) Causation, DISCUSSION OF STRUCTURE PRIORS.
Korb & Nicholson 111
C. S. Wallace, K. B. Korb, and H. Dai (1996) “Causal

Discovery via MML,” in L. Saitta (ed.) Proceedings of the
Thirteenth International Conference on Machine
Learning (pp. 516-524). San Francisco: Morgan
Kaufmann.
I NTRODUCES AN MML METRIC FOR CAUSAL MODELS.
S. Wright (1921) “Correlation and Causation,” Jrn

Agricultural Research, 20, 557-585.
S. Wright (1934) “The Method of Path Coefficients,” Annals

of Mathematical Statistics, 5, 161-215.

ANU July2001 Tutorial 4

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

ANU July2001 Tutorial 4

Загружено:

Авторское право:

Доступные форматы

Korb & Nicholson 1 Korb & Nicholson 2

Kevin B. Korb and Ann E. Nicholson 2. Bayesian networks (40 min)

fkorb,annng@csse.monash.edu.au 6. Current research issues (10 min)

Bayesian AI Tutorial Bayesian AI Tutorial

Korb & Nicholson 3 Korb & Nicholson 4

Reasoning under uncertainty

– MYCIN’s certainty factors Distinguishes deductive knowledge (e.g.,

Bayesian AI Tutorial Bayesian AI Tutorial

2.8X  U P (X )  0 Combination rules:

3. 8X; Y  U  T (p ^ q) = min(T (p); T (q))

Bayesian AI Tutorial Bayesian AI Tutorial

Korb & Nicholson 7 Korb & Nicholson 8

MYCIN’s Certainty Factors Default Logic

 No semantics ever given for ‘belief’/‘disbelief’ ApplyforJob(me) : ApplyforJob(x) ! Reje t(x)

Bayesian AI Tutorial Bayesian AI Tutorial

Fair bets are bets based upon the standard Example

O(h) = P (h) Payoff table against A (inverse of: for A),

P (h) = O(h) :A Payoff

Bayesian AI Tutorial Bayesian AI Tutorial

Korb & Nicholson 11 Korb & Nicholson 12

Bayesian Decision Theory

Or, read Bayes’ theorem as: Example

Posterior = Likelihood Prior Action Rain (p = .4) Shine (1 - p = .6)

Prob of eviden e Take umbrella 30 10

Bayesian AI Tutorial Bayesian AI Tutorial

 Has a utility structure (preferences)  Syntax

Bayesian data mining

Bayesian AI Tutorial Bayesian AI Tutorial

Korb & Nicholson 15 Korb & Nicholson 16

Bayesian AI Tutorial Bayesian AI Tutorial

Earthquake Example: Notes

Earthquake Example:  Assumptions: John and Mary don’t perceive

 Note: no info about loud music or telephone

– A node with no parents has one row (the prior

Bayesian AI Tutorial Bayesian AI Tutorial

Korb & Nicholson 19 Korb & Nicholson 20

Representing the joint

Semantics of Bayesian P (X1 = x1 ; X2 = x2 ; :::; Xn = xn )

 Encoding a collection of conditional independence = i P (xi j(Xi ))

= P (J jA)P (M jA)P (Aj:B ^ :E )P (:B )P (:E )

= 0:9 0:7 0:001 0:999 0:998 = 0:0067:

Bayesian AI Tutorial Bayesian AI Tutorial

Compactness and Node

P (Xi jXi 1 ; :::; X1 ) = P (Xi j(Xi))

(c) Define the CPT for Xi .

Bayesian AI Tutorial Bayesian AI Tutorial

Korb & Nicholson 23 Korb & Nicholson 24

Compactness and Node

Bayesian AI Tutorial Bayesian AI Tutorial

Common effects (or their descendants) give rise to

 A = Jack’s food poisoning  B = severe cough

 B = shared soup  C = tuberculosis

 C = Jill’s food poisoning Given a severe cough, flu “explains away”

Bayesian AI Tutorial Bayesian AI Tutorial

Korb & Nicholson 27 Korb & Nicholson 28

 Graph-theoretic criterion of conditional Causal Ordering

 We can determine whether a set of nodes X is

Alarm  Violating causal order requires new arcs to

Bayesian AI Tutorial Bayesian AI Tutorial

 Also called Belief Updating.

Given the ordering: Cough, Flu, TB: Q E Q E E

Bayesian AI Tutorial Bayesian AI Tutorial

Korb & Nicholson 31 Korb & Nicholson 32

2.8X U P (X ) 0 Combination rules:

3. 8X; Y U T (p ^ q) = min(T (p); T (q))

No semantics ever given for ‘belief’/‘disbelief’ ApplyforJob(me) : ApplyforJob(x) ! Reje t(x)

Has a utility structure (preferences) Syntax

Earthquake Example: Assumptions: John and Mary don’t perceive

Note: no info about loud music or telephone

Encoding a collection of conditional independence = i P (xi j(Xi ))

P (Xi jXi 1 ; :::; X1 ) = P (Xi j(Xi))

A = Jack’s food poisoning B = severe cough

B = shared soup C = tuberculosis

C = Jill’s food poisoning Given a severe cough, flu “explains away”

Graph-theoretic criterion of conditional Causal Ordering

We can determine whether a set of nodes X is

Alarm Violating causal order requires new arcs to

Also called Belief Updating.

Example: Cancer network

Transform network into a probabilistically

Cancer example: new node Z combining B and C

Network evaluation done in two stages

its possible actions Decision nodes: (rectangles) represent points where

The values of state variables at time t depend only

Can calculate distributions for St+1 and further:

Can be done using standard BN updating

This type of DBN gets very large, very quickly.

Usually only keep two time slices of the network.

Bayes’ rule allows unknown probabilities to be

The speed of inference in practice depends on the

Deployed Bayesian Networks (See Handout for

medicine B = background information (e.g. age, sex of

Simplest tree-structured network for diagnostic

Pathfinder case study; see Russell&Norvig (1995, Medical Applications

MUNIN (Andreassen et al., 1989): neuromuscular PulmEmbolus(2) Intubation (3)

Application of LW to medical diagnosis .199

Forecasting sleep apnea (Dagum et al., 1993).

(Korb et al., 1999)

Our Bayesian Poker Player (BPP) plays 5-Card

Bayesian network provides an estimate of winning Bayesian Poker BN (cont.)

OPP Final, BPP Final: hand types after all 5