Академический Документы
Профессиональный Документы
Культура Документы
Bayesian AI
Overview
Tutorial
1. Introduction to Bayesian AI (20 min)
Fuzzy Logic
Probabilities Designed to cope with vagueness:
Is Fido a Labrador or a Shepard?
Classic approach to reasoning under uncertainty. Fuzzy set theory:
(Blaise Pascal and Fermat). m(F ido 2 Labrador) = m(F ido 2 Shepard) = 0:5
Kolmogorov’s Axioms: Extended to fuzzy logic, which takes intermediate
1. P (U ) = 1 truth values: T (Labrador(F ido)) = 0:5.
A Dutch Book
Probability Theory Payoff table on a bet for h
(Odds = p=1 p; S = betting unit)
So, why not use probability theory to represent
uncertainty? h Payoff
That’s what it was invented for. . . dealing with T $(1-p) S
physical randomness and degrees of ignorance. F -$p S
Furthermore, if you make bets which violate Given a fair bet, the expected value from such a payoff
probability theory, you are subject to Dutch books: is always $0.
A Dutch book is a sequence of “fair” bets Now, let’s violate the probability axioms.
which collectively guarantee a loss.
Bayesian AI
A Bayesian conception of an AI is: Bayesian Networks: Overview
An autonomous agent which
Can learn about its world and the relation between Semantics
its actions and future states (probabilities) Evaluation methods
Maximizes its expected utility Influence diagrams (Decision Networks)
The techniques used in learning about the world are Dynamic Bayesian Networks
(primarily) statistical. . . Hence
Bayesian Networks
Example: Earthquake (Pearl)
Data Structure which represents the dependence
between variables.
Pearl has a new burglar alarm installed.
Gives concise specification of the joint probability
It is reliable about detecting burglary, but
distribution.
responds to minor earthquakes.
A Bayesian Network is a graph in which the
Two neighbours (John, Mary) promise to call you
following holds:
at work when they hear the alarm.
1. A set of random variables makes up the nodes
– John always calls when hears alarm, but
in the network.
confuses alarm with phone ringing (and calls
2. A set of directed links or arrows connects pairs then also)
of nodes.
– Mary likes loud music and sometimes misses
3. Each node has a conditional probability table alarm!
that quantifies the effects the parents have on
the node. Given evidence about who has and hasn’t called,
estimate the probability of a burglary.
4. Directed, acyclic graph (DAG), i.e. no directed
cycles.
0.02
Once specified topology, need to specify
P(B)
0.01 B E P(A|B,E) conditional probability table (CPT) for each
Alarm
T T 0.95 node.
T F 0.94
F T 0.29 – Each row contains the cond prob of each node
JohnCalls MaryCalls F F 0.001
value for a conditioning case.
A P(J|A) A P(M|A) – Each row must sum to 1.
T 0.90 T 0.70
– A table for a Boolean var with n Boolean
parents contain 2n+1 probs.
F 0.05 F 0.01
A (more compact) representation of the joint = P (x1 ) P (x2 jx1 )::: P (xn jx1 ^ :::xn 1 )
probability distribution.
– helpful in understanding how to construct = i P (xi jx1 ^ :::xi 1 )
network
Burglary Earthquake
MaryCalls
JohnCalls A B C
Earthquake
P (C jA ^ B ) = P (C jB )
Burglary Alarm Example
A = Jack’s flu
As many probabilities as the full joint distribution!
B = severe cough
C = Jill’s flu
See below for why.
Conditional Dependence:
Conditional Independence:
Common Effects
Common Causes
P (AjC ^ B ) 6= P (A)P (C )
A C
P (C jA ^ B ) = P (C jB ) Example
Example A = flu
D-separation
Earthquake example
Because
Burglary Earthquake
Using the causal order allows direct
representation of conditional independencies
JohnCalls MaryCalls
Inference in Bayesian
Causal Ordering (cont’d)
Networks
Flu TB
Basic task for any probabilistic inference system:
Compute the posterior probability distribution for
a set of query variables, given values for some
evidence variables.
Cough
Cough
E Q
Flu TB
(Explaining Away)
Marginal independence of Flu and TB must be E Q Intercausal E
re-established by adding F lu !
T B or F lu T B
Diagnostic Causal Mixed
P(Burglary|JohnCalls)
Exact inference
Causal Inferences: from causes to effects.
– Trees and polytrees: message-passing
P(JohnCalls|Burglary) algorithm
P(MaryCalls|Burglary) – Multiply-connected networks:
Intercausal Inferences: between causes of a Clustering
common effect. Approximate Inference
P(Burglary|Alarm)
– Large, complex networks:
P(Burglary|Alarm ^ Earthquake) Stochastic Simulation
Mixed Inference: combining two or more of above. Other approximation methods
P(Alarm|JohnCalls ^ :EarthQuake) In the general case, both sorts of inference are
P(Burglary|JohnCalls ^ :EarthQuake) computationally complex (“NP-hard”).
B E P(A)
PhoneRings Alarm
T T 0.95 Networks where two nodes are connected by more
T F 0.94
P(Ph) F T 0.29 than one path
0.05 JohnCalls MaryCalls F F 0.001
– Two or more possible causes which share a
P A P(J) A P(M) common ancestor
T T 0.95 T 0.70
T F 0.5 F 0.01
– One variable can influence another through
F T 0.90
F F 0.01
more than one causal mechanism
Clustering methods
Type of Nodes
Decision Networks Chance nodes: (ovals) represent random variables
(same as Bayesian networks). Has an associated
A Decision network represents information about CPT. Parents can be decision nodes and other
the agent’s current state chance nodes.
Example: Umbrella
Weather
Evaluating Decision
Forecast U
Networks: Algorithm
Take Umbrella 1. Set the evidence variables for the current state.
State t-2 State t-1 State t State t+1 State t+2 Similarly, Decision Networks can be extended to
include temporal aspects.
Obs t-2 Obs t-1 Obs t Obs t+1 Obs t+2
Sensor model
Sequence of decisions taken = Plan.
BN Software
Web Resources
Bayesian AI Tutorial
Bayesian AI Tutorial
Korb & Nicholson 49 Korb & Nicholson 50
Example: Cancer
Metastatic cancer is a possible cause of a brain tumor
and is also an explanation for increased total serum Example: Asia
calcium. In turn, either of these could explain a
patient falling into a coma. Severe headache is also A patient presents to a doctor with shortness of breath.
possibly associated with a brain tumor. (Example from The doctor considers that possibles causes are
(Pearl, 1988).) tuberculosis, lung cancer and bronchitis. Other
Metastatic Cancer additional information that is relevant is whether the
A patient has recently visited Asia (where tuberculosis is
Brain tumour more prevalent), whether or not the patient is a
B C smoker (which increases the chances of cancer and
Increased total
serum calcium
bronchitis). A positive xray would indicate either TB or
D E lung cancer. (Example from (Lauritzen, 1988).)
Coma Severe Headaches visit to Asia smoking
Probabilistic reasoning in
medicine
F1 F2 F3
Bayesian AI Tutorial
Bayesian AI Tutorial
Medical Applications
QMR (Quick Medical Reference): 600 diseases, ALARM (Beinlich et al., 1989): 37 nodes, 42 arcs.
4,000 findings, 40,000 arcs. (Dean&Wellman, (See Netica examples.)
MinVolSet (3)
1991) .976
FiO2 (2) VentAlv (4) MinVol (4)
Glucose prediction and insulin dose adjustment .411 .213
.805 .743
.891
InsuffAnesth (2) SaO2 (3)
TPR (3)
CPSC project (Pradham et al., 1994) .092
.246 .066
Catechol (2)
(Shwe&Cooper, 1990).
Q Q’ Q Q’
Normative Model: Represents our best
understanding of the domain; proper (constrained)
L L L L L L L L
Bayesian updating, given premises.
0 1 2 3 0 1 2 3
(a) mainModel (b) indepModel User Model: Represents our best understanding of
A
0 A
1
A
2 A
3 Q Q’
the human; Bayesian updating modified to reflect
human biases (e.g., overconfidence; Korb,
Q Q’ L
0
L
1
L
2
L
3 McConachy, Zukerman, 1997).
(c) actionModel (d) locationModel
BNs are embedded in a semantic hierarchy
Traffic plan recognition (Pynadeth&Wellman, supports attentional modeling
1995).
constrained updating
Bayesian Poker
%% EE ccc
concepts like
‘motivation’ or
Poker is ideal for testing automated reasoning
% E c
‘ability’
%% EE cc
Lower level
concepts like
under uncertainty
‘Grade Point Average’
% E cc
%% + EE H cc
@ Semantic
@
% % E
E B HH B
H cc Network
2nd layer – Physical randomisation
@
@%
EA
Q E HH
c
c
%@@
E
E A
A EE QQBB
Q B
cc
Semantic
%% -@@R
EE E Q cc – Incomplete hand information
% E C c
Network
%% EE C
1st layer
%
HH
H
EE C C
Bayesian – Incomplete opponent info (strategies, bluffing,
%
HH
% E C etc)
%%
E
EE
Network
%
% E
6
Bayesian Poker BN
Betting curves based on pot-odds used to Different networks (matrices) for each round.
determine action (bet/call, pass or fold).
OPP Current, BPP Current: (partial) hand types
BPP Win
with cards dealt so far.
M
Observation nodes:
M
C|F C|F
– OPP Upcards: All opponent’s cards except first
OPP Current BPP Current are visible to BPP.
M M
– OPP Action: BPP knows opponent’s action.
A|C U|C
Deployed BNs
Deployed BNs (cont’d)
From Web Site database: See handout for details.
BN Software: Issues
Functionality
– Especially application vs API BN Software
Price
– Many free for demo versions or educational use Analytica: www.lumina.com
– Commercial licence costs. Hugin: www.hugin.com
Availability (platforms) Netica: www.norsys.com
Quality Above 3 available during tutorial lab session.
– GUI JavaBayes:
– Documentation and Help http://www.cs.cmu.edu/˜ javabayes/Home/
Leading edge Many other packages (see next slide)
Robustness
– software
– company
BN Web Resources
Bayesian Belief Network site (Russell Greiner):
www.cs.ualberta.ca/˜ greiner/bn.html Applications: Summary
Bayesian Network Repository (Nir Friedman)
www-nt.cs.berkeley.edu/home/nir/ Various BN structures are available to compactly
public html/Repository/index.htm and accurately represent certain types of domain
Summary of BN software and links to software sites features.
(Kevin Murphy)
http.cs.berkeley.edu/˜
Bayesian networks have been used for a wide
murphyk/Bayes/bnsoft.html range of AI applications.
Includes Murphy’s Bayes net toolbox Robust and easy to use Bayesian network software
Russell Almond’s BN page is now readily available.
bayes.stat.washington.edu/almond/belief.html
Empirical Results
Learning Conditional
Probability Tables
Spiegelhalter & Lauritzen (1990):
assume parameter independence
TETRAD II
Statistical Equivalence
— Spirtes, Glymour and Scheines (1993)
Chickering (1995):
Replace the Oracle with statistical tests:
Any two causal models over the same variables
which have the same skeleton (undirected arcs) for linear models a significance test on partial
and the same directed v-structures are correlation
statistically equivalent.
X q Y jS i XY S = 0
If H1 and H2 are statistically equivalent, then
they have the same maximum likelihoods relative for discrete models a 2 test on the difference
to any joint samples: between CPT counts expected with independence
(Ei ) and observed (Oi )
max P (ejH1 ; 1 ) = max P (ejH2 ; 2 )
2
where i is a parameterization of Hi X q Y jS i Oi ln Oi 0
i
Ei
Statistical Equivalence
Statistical Equivalence Learners
Learners Wallace & Korb (1999): This is not right!
– M (Xi ; (i)) is mutual information btw Xi and Iterate until MDL fails to improve
its parent set ) Results similar to K2, but without full variable
– H (Xi ) is entropy of variable Xi ordering
Equivalent to that h that maximizes P (h)P (e h) — i.e.,j where j are the parameters for Xj and F (j ) is
j
P (h e). j
the Fisher information. f (j h) is assumed to be
The other significant difference from MDL: N (0; j ).
MML takes parameter estimation seriously. (Cf. with MDL’s fixed length for parms)
Empirical Results
Info on TETRAD II; downloadable TETRAD III (approx combinations of discrete and continuous (i.e.
equiv to II) mixing node types)
hss.cmu.edu/html/departments/
philosophy/TETRAD/tetrad.html Learning issues
References
Introduction to Bayesian AI
T. Bayes (1764) “An Essay Towards Solving a Problem in the
Doctrine of Chances.” Phil Trans of the Royal Soc of
London. Reprinted in Biometrika, 45 (1958), 296-315.
B. Buchanan and E. Shortliffe (eds.) (1984) Rule-Based
Expert Systems: The MYCIN Experiments of the Stanford
Heuristic Programming Project. Addison-Wesley.
(Other) Limitations B. de Finetti (1964) “Foresight: Its Logical Laws, Its
Subjective Sources,” in Kyburg and Smokler (eds.)
inappropriate problems (deterministic systems, Studies in Subjective Probability. NY: Wiley.
C HAPTERS 1, 2 AND 4 COVER SOME OF THE RELEVANT Spiegelhalter (1999) Probabilistic networks and expert
HISTORY. systems. New York: Springer.
TECHNICAL SURVEY OF B AYESIAN NET TECHNOLOGY,
J. Pearl (1988) Probabilistic Reasoning in Intelligent
INCLUDING LEARNING B AYESIAN NETS.
Systems, Morgan Kaufmann.
F. P. Ramsey (1931) “Truth and Probability” in The D. D’Ambrosio (1999) “Inference in Bayesian Networks”.
Foundations of Mathematics and Other Essays. NY: Artificial Intelligence Magazine, Vol 20, No. 2.
Humanities Press. A. P. Dawid (1998) Conditional independence. In
T HE ORIGIN OF MODERN B AYESIANISM . I NCLUDES Encyclopedia of Statistical Sciences, Update Volume 2.
LOTTERY- BASED ELICITATION AND D UTCH - BOOK New York: Wiley Interscience.
ARGUMENTS FOR THE USE OF PROBABILITIES.
P. Haddaway (1999) “An Overview of Some Recent
R. Reiter (1980) “A logic for default reasoning,” Artificial Developments in Bayesian Problem-Solving Techniques”.
Intelligence, 13, 81-132. Artificial Intelligence Magazine, Vol 20, No. 2.
J. von Neumann and O.Morgenstern (1947) Theory of Games R.A. Howard & J.E. Matheson (1981) Influence Diagrams.
and Economic Behavior, 2nd ed. Princeton Univ. In Howard and Matheson (eds.) Readings in the
S TANDARD REFERENCE ON ELICITING UTILITIES VIA Principles and Applications of Decision Analysis. Menlo
LOTTERIES. Park, Calif: Strategic Decisions Group.
F. V. Jensen (1996) An Introduction to Bayesian Networks,
Bayesian Networks Springer.
R. Neapolitan (1990) Probabilistic Reasoning in Expert
E. Charniak (1991) “Bayesian Networks Without Tears”,
Systems. Wiley.
Artificial Intelligence Magazine, pp. 50-63, Vol 12.
S IMILAR COVERAGE TO THAT OF P EARL ; MORE
A N ELEMENTARY INTRODUCTION.
EMPHASIS ON PRACTICAL ALGORITHMS FOR NETWORK
G.F. Cooper (1990) The computational complexity of UPDATING.
probabilistic inference using belief networks. Artificial
J. Pearl (1988) Probabilistic Reasoning in Intelligent
Intelligence, 42, 393-405.
Systems, Morgan Kaufmann.
R. G. Cowell, A. Philip Dawid, S. L. Lauritzen and D. J. T HIS IS THE CLASSIC TEXT INTRODUCING B AYESIAN
S. Andreassen, F.V. Jensen, S.K. Andersen, B. Falck, U. M. Henrion, J.S. Breese and E.J. Horvitz (1991) Decision
Kjærulff, M. Woldbye, A.R. Sørensen, A. Rosenfalck and analysis and expert systems. AI Magazine, 12, 64-91.
F. Jensen (1989) “MUNIN — An Expert EMG Assistant”, K.B. Korb, I. Zukerman and R. McConachy (1997) A
Computer-Aided Electromyography and Expert Systems, cognitive model of argumentation. In Proceedings of the
Chapter 21, J.E. Desmedt (Ed.), Elsevier. Cognitive Science Society, Stanford University.
S.A. Andreassen, J.J Benn, R. Hovorks, K.G. Olesen and S.L Lauritzen and D.J. Spiegelhalter (1988) “Local
R.E. Carson (1991) “A Probabilistic Approach to Glucose Computations with Probabilities on Graphical
Prediction and Insulin Dose Adjustment: Description of Structures and their Application to Expert Systems”,
Metabolic Model and Pilot Evaluation Study”. Journal of the Royal Statistical Society, 50(2), pp.
Learning Bayesian Networks G.F. Cooper and E. Herskovits (1991) “A Bayesian Method
for Constructing Bayesian Belief Networks from
H. Blalock (1964) Causal Inference in Nonexperimental
Databases,” in D’Ambrosio, Smets and Bonissone (eds.)
Research. University of North Carolina.
D. Geiger and D. Heckerman (1994) “Learning Gaussian D. Madigan, S.A. Andersson, M.D. Perlman & C.T. Volinsky
networks,” in Lopes de Mantras and Poole (eds.) UAI (1996) “Bayesian model averaging and model selection
1994, 235-243. for Markov equivalence classes of acyclic digraphs,”
Comm in Statistics: Theory and Methods, 25, 2493-2519.
D. Heckerman and D. Geiger (1995) “Learning Bayesian
networks: A unification for discrete and Gaussian D. Madigan and A. E. Raftery (1994) “Model selection and
domains,” in Besnard and Hanks (eds.) UAI 1995, accounting for model uncertainty in graphical modesl
274-284. using Occam’s window,” Jrn AMer Stat Assoc, 89,
1535-1546.
D. Heckerman, D. Geiger, and D.M. Chickering (1995)
“Learning Bayesian Networks: The Combination of N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H.
Knowledge and Statistical Data,” Machine Learning, 20, Teller and E. Teller (1953) “Equations of state
197-243. calculations by fast computing machines,” Jrn Chemical
Physics, 21, 1087-1091. Prediction and Search: Lecture Notes in Statistics 81.
Springer Verlag.
J.R. Neil and K.B. Korb (1999) “The Evolution of Causal
A THOROUGH PRESENTATION OF THE ORTHODOX
Models: A Comparison of Bayesian Metrics and
STATISTICAL APPROACH TO LEARNING CAUSAL
Structure Priors,” in N. Zhong and L. Zhous (eds.)
STRUCTURE .
Methodologies for Knowledge Discovery and Data
Mining: Third Pacific-Asia Conference (pp. 432-437). J. Suzuki (1996) “Learning Bayesian Belief Networks Based
Springer Verlag. on the Minimum Description Length Principle,” in L.
G ENETIC ALGORITHMS FOR CAUSAL DISCOVERY; Saitta (ed.) Proceedings of the Thirteenth International
STRUCTURE PRIORS. Conference on Machine Learning (pp. 462-470). San
Francisco: Morgan Kaufmann.
J.R. Neil, C.S. Wallace and K.B. Korb (1999) “Learning
Bayesian networks with restricted causal interactions,” T.S. Verma and J. Pearl (1991) “Equivalence and Synthesis
in Laskey and Prade (eds.) UAI 99, 486-493. of Causal Models,” in P. Bonissone, M. Henrion, L. Kanal
and J.F. Lemmer (eds) Uncertainty in Artificial
J. Rissanen (1978) “Modeling by shortest data description,”
Intelligence 6 (pp. 255-268). Elsevier.
Automatica, 14, 465-471.
T HE GRAPHICAL CRITERION FOR STATISTICAL
H. Simon (1954) “Spurious Correlation: A Causal EQUIVALENCE .
Interpretation,” Jrn Amer Stat Assoc, 49, 467-479.
C.S. Wallace and D. Boulton (1968) “An information measure
D. Spiegelhalter & S. Lauritzen (1990) “Sequential Updating for classification,” Computer Jrn, 11, 185-194.
of Conditional Probabilities on Directed Graphical
C.S. Wallace and P.R. Freeman (1987) “Estimation and
Structures,” Networks, 20, 579-605.
inference by compact coding,” Jrn Royal Stat Soc (Series
P. Spirtes, C. Glymour and R. Scheines (1990) “Causality B), 49, 240-252.
from Probability,” in J.E. Tiles, G.T. McKee and G.C.
C. S. Wallace and K. B. Korb (1999) “Learning Linear Causal
Dean Evolving Knowledge in Natural Science and
Models by MML Sampling,” in A. Gammerman (ed.)
Artificial Intelligence. London: Pitman. A N
Causal Models and Intelligent Data Management.
ELEMENTARY INTRODUCTION TO STRUCTURE
Springer Verlag.
LEARNING VIA CONDITIONAL INDEPENDENCE .
S AMPLING APPROACH TO LEARNING CAUSAL MODELS ;
P. Spirtes, C. Glymour and R. Scheines (1993) Causation, DISCUSSION OF STRUCTURE PRIORS.
Bayesian AI Tutorial