ST PGM M09 A Summ19 - Fekri PDF

ECE 8803
Parameter Learning in Graphical

Models
Module 9: Part A
Maximum Likelihood Estimation
1. Known Structure &
2. Fully Observed Variables
Faramarz Fekri
Center for Signal and Information
Processing
Overview
•  Learning problems in graphical models
–  Parameter Estimation Problem
•  Maximum likelihood estimation (MLE)
–  Biased Coin Toss
–  Gaussian (single and multivariate)
•  Sufficient Statistics
– MLE for Bayesian Networks
– Limitations of MLE
•  Bayesian estimation (next Lecture)
Chapters 16 and 17 from textbook K&F
ECE-8803 F. Fekri Summer 2019
Learning Graphical Models
•  Up to now, we assumed that the Graphical networks were given.
•  Where do the networks come from?
–  Knowledge engineering with aid of experts
–  Learning: automatedModels
Learning Graphical construction of networks (via instances)
•  Our goal: given set of independent samples (assignments of
The goal: given set of independent samples (assignments of
random variables), findfind
random variables), the thebest (i.e.,
best (the thelikely)
most most likely) graphical
graphical
model (both
modelstructure and
(both structure the
and the parameters).
parameters
𝐴 𝐹 Learning Bayesian
𝐴
Networks
𝐹
Structure
Four types of problems will be covered
learning
𝑆 Learn 𝑆
X1 X2
Data
𝑁 𝐻 𝑁
Inducer𝐻
Prior information
Y
(A,F,S,N,H) = (T,F,F,T,F) P(Y|X ,X )

S FA TF TF FT FFX parameter1 2
(A,F,S,N,H) = (T,F,T,T,F) 1 X2 y 0 y 1
t 0.9 0.7 0.8 0.2x 0 x2 0 learning

1 0
… 1
x1 0 x2 1 0.2 0.8
(A,F,S,N,H) = (F,T,T,T,T) f 0.1 0.3 0.2 0.8x
1
1 x2 0 0.1 0.9
x1 1 x2 1 0.02 0.98
22
ECE-8803 F. Fekri 23 Summer 2019
Learning
•  Measures of success
–  How close is the learned network to the original distribution?
•  Use distance measures between distributions
–  Often hard since we do not have the true underlying distribution
•  Instead, evaluate performance by how well the network

predicts new unseen examples (“test data”)
–  Classification accuracy
–  How close is the structure of the network to the true one?

•  Use distance metric between structures
•  Hard because we do not know the true structure
•  Instead, ask whether independencies learned hold in test data

Prior Knowledge
•  Prespecified structure
–  Need to learn only CPDs
•  Prespecified variables
–  Need to learn both network structure and CPDs
•  Hidden variables
–  Need to learn hidden variables, structure, and CPDs
•  Complete/incomplete data
–  Missing data
–  Unobserved variables
Learning Problems in GM (I)
•  I.Four types Structure,
Known of problems Complete
will be covered
Data
1.  I. Known
Fully observed Structure,
(complete) Complete
data, and known Data
structure:
Goal: Parameter estimation
•  Data does not Parameter
Goal: contain missing values
estimation
Data does not contain missing values
•  Goal: Parameter (CPD)
Data does estimation
not contain missing values
X1 X2 X1 X2
Initial X1 X2 X1 X2
network Initial Inducer
network Inducer
Learning
Y Y
Y Y
X1 X2 Y
X1 X2 Y
x1 0 x2 1 y0 P(Y|X1,X2)
x1 1 x2 0 y0
x1 0 x2 1 • y Data
0
X1 X2 y0 y1
P(Y|X1,X2)
x1 0 x2 1 y1
x1 1 x2 0 •  Prior information
y 0
x1 0 x2 0 1 0
X1 X2 y0 y1
Input x1 0 x2 1 x1 0 x2 0 1 0
• y Choice of x10 x21
1
x1 0 x2 0 Input
y0 0.2 0.8
Data x1 0 x2 0 y 0 x1 0 x2 1 0.2 0.8
x1 1 x2 1 Data
y1 parametric family
x1 1 x2 0 0.1 0.9 1
x1 1 x2 1 y 1 x1 x2 0 0.1 0.9
y for each CPD, e.g.,
x1 0 x2 1 y1 x1 1 x2 1 0.02 0.98 1
x1 0 x2 1 1 x1 x2 1 0.02 0.98
y P(Xi|Pa(Xi))
x1 1 x2 0 y0
x1 1 x2 0 24
CPD
0
24

Learning Problems in GM (II)
2.  Fully observed (complete) data, and unknown structure:
•  Data I.
II. Unknown Known
Structure,
does Structure,
not contain Complete
missing Complete
values Data Data
•  Structure
Goal: Goal: Structure learning
learning & Parameter
& parameter (CPD) estimation
estimation
Data does not contain missing
Data does values missing values
not contain
X1 X2 X1 X2 X1 X2 X1 X2
Initial
Initial Inducer
network
network Inducer
Learning
Y Y
Y Y
X1 X2 Y
x1 0 x2 1 y0 X1 X2 Y P(Y|X1,X2)
x1 1 x2 0 y0 x1 0 x2 1 • y Data X X y
0
1 2
0 y1 P(Y|X1,X2)
Input x1 0 x2 1 y1 x1 1 x2 0 • y Prior information

0 x1 x
0 12
0 0 X1 X2 y0 y1
x1 0 x2 0 1 0
• y Choice xxof xx 0.2 0.8
0 1
x1 0 x2 0 y 0 x1 0 x2 1 1
Data Input 1 2
x1 1 x2 1 y1 x1 0 x2 0 y 0 1 0.1 0 0.9 x1 0 x2 1 0.2 0.8

Data x
1
parametric family
x
1
2
0.02 1 0.98
x1 0 x2 1 y1
x1 1 x2 1 y 1 1 2 x1 1 x2 0 0.1 0.9
x1 x2 y0 y for each CPD, e.g.,
1 0
x1 0 x2 1 1
25 x1 1 x2 1 0.02 0.98
x1 1 x2 0 y P(Xi|Pa(Xi))
CPD
0
24

Learning Problems in GM (III)
3.  Missing (incomplete) data, and known structure:
III.• Known
Data I. Structure,
Known
contains Incomplete
Structure,
missing values Data Data
Complete
• 
Goal: Goal: Parameter
Parameter (CPD) estimation
estimation
Data contains Data
missing values
does (e.g. Naïve
not contain Bayes)
missing values
X1 X2 X1 X2 X1 X2 X1 X2
Initial Initial
Inducer Inducer
network network Learning
Y Y Y Y
X1 X2 Y X1 X2 Y
? x2 1 y0 • y Data P(Y|X1,X2) P(Y|X1,X2)
x1 0 x2 1 0
x1 1 ? y0 X1 X2 y0 y1
x1 1 x2 0 • y Prior information
0 X1 X2 y0 y1
? x2 1 ? x 0x 0 1 0 x1 0 x2 0 1 0
Input Input x1 0 x2 1
• y Choice of x
1 1 2
x1 0 x2 0 y0
0x 1 0.2 0.8 x 0 x2 1 0.2 0.8
Data Data x1 0 x2 0 y 0
parametric
1
family
2
1
? x2 1 y1 x 1x 0 0.1 0.9 x 1 x2 0 0.1 0.9
x1 1 x2 1 y 1 1 2
1

x1 0 x2 1 ? x 1x 1 0.02 0.98x 1 x2 1 0.02 0.98
x1 0 x2 1 1 1 2
1
x1 ? y0 y P(Xi|Pa(Xi))
1
x1 1 x2 0
CPD
0
26 24

Learning Problems in GM (IV)
4.  Missing (incomplete) data, and unknown structure:
IV. • Unknown
Data I. Structure,
Known
contains Incomplete
Structure,
missing values Data Data
Complete
•  Structure
Goal: Goal: Structure learning
learning & Parameter
& parameter (CPD) estimation
estimation
Data contains Data
missing
doesvalues
not contain missing values
X1 X2 X1 X2 X1 X2 X1 X2
Initial Initial Inducer Inducer
network network Learning
Y Y Y Y
X1 X2 Y
X1 X2 Y
P(Y|X1,X2)
? x2 1 y0
x1 0 x2 1 • y Data
0 P(Y|X1,X2)
x1 1 ? y0 X1 X2 y0 y1
x1 1 x2 0 • y Prior information
0
x 0x 0 1 0
X1 X2 y0 y1
? x2 1 ?
Input Input x1 0 x2 1
• y Choice of x
1 1 2 x1 0 x2 0 1 0
x1 0 x2 0 y0
0x 1 0.2 0.8
Data Data x1 0 x2 0 y 0 1 2 x1 0 x2 1 0.2 0.8
? x2 1 y1
x1 1 x2 1 y
parametric xfamily
1 1
1x 2
0 0.1 0.9
x1 1 x2 0 0.1 0.9
x1 0 x2 1 ? x 1x 1 0.02 0.98
x1 0 x2 1 1 1 2 x1 1 x2 1 0.02 0.98
y P(Xi|Pa(Xi))
x1 1 ? y0
x1 1 x2 0
CPD
0
27
24

Learning Principle in GM
•  Estimation principle:
•  Maximal Likelihood Estimation (MLE)
•  Bayesian Estimation (BE)
•  Common Feature
•  Utilize distribution factorization
•  Utilize inference algorithms
•  Utilize
Learning regularization/prior
for GMs
Known Structure Unknown Structure
Fully observable data Relatively Easy Hard
Missing data Hard (EM) Very hard
Estimation principle:
Parameter Estimation Problem
•  Biased Coin Toss Example
Biased Coin Toss Example
Coin can land Head
–  Coin can land in two positions: in two positions:
or TailHead or Tail X
Estimation task
•  Estimate the probability 𝜃=P(X=h); landing1-θ in heads using a biased
Given toss examples x[1],...x[m] estimate
P(X=h)= θ and P(X=t)=
coin: Denote by P(H) and P(T) to mean P(X=h) and P(X=t),
respectively.
–  Given a sequence of m independently and identically distributed
(iid) flips 𝐷 = {x[1],...,Assumption:
x[m]} i.i.d samples
Tosses are controlled by an (unknown) parameter θ
–  Example: 𝐷 = {x[1] ,..., x[m]} = {1,0,1,...,0}, 𝑥 ∈ {0,1}
Tosses are sampled from the same distribution
𝑖
Tosses are independent of each other
–  Denote P(H) and P(T) to mean 𝜃=P(X=h) and P(X=t)=1-𝜃. 29
•  Assumption: i.i.d samples

•  Tosses are controlled by an (unknown) parameter θ
•  Tosses are sampled fromBiased Coindistribution
the same Toss Example
Goal: find θ∈[0,1] that predicts the data well
•  Tosses are independent of each other
“Predicts the data well” = likelihood of the data given θ
| θ ) = ∏i =1 P( x[i ] | x[1],..., x[i − 1],θ ) = ∏i =1 P( x[i ] | θ )
m m
ECE-8803 L( D : θ ) = F.
P( DFekri Summer 2019
Eg., 𝐷 = 𝑥1 ,𝑃𝑥𝐷
2, 𝜃
…𝑃(𝜃)
, 𝑥𝑁 = {1,0,1, … , 0}, 𝑥𝑖 ∈ {0,1}
𝑃 𝐷 𝜃 𝑃(𝜃)
𝑃(𝜃|𝐷) =
Parameter Estimation: Biased Coin Toss
=
Bayesian treat the unknown parameters as a random variable,
𝑃(𝐷) ∫ 𝑃 𝐷 𝜃 𝑃 𝜃 𝑑𝜃 𝜃
d Coin Toss
Model: 𝑃Biased Example
whose distribution
𝑥|𝜃 = 𝜃Coin
can be1−𝑥
1 − 𝜃Toss Example
𝑥
inferred using Bayes rule:
𝑃 𝐷 𝜃 𝑃(𝜃) 𝑃 𝐷 𝜃 𝑃(𝜃)
The𝑃(𝜃|𝐷)
crucial =
equation
1 −𝑃(𝐷)
𝜃, 𝑓𝑜𝑟 =
can
𝑥=be written in words
∫ 𝑃0 𝐷 𝜃 𝑃 𝜃 𝑑𝜃 𝜃
𝑃(𝑥|𝜃 ) = θ∈[0,1]𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑×𝑝𝑟𝑖𝑜𝑟 𝑋
Goal: find that
𝑓𝑜𝑟 𝑥predicts
= 1 the data well
[0,1] that predicts
𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 = the data well
𝜃,
𝑚𝑎𝑟𝑔𝑖𝑛𝑎𝑙 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑
𝑁
The crucial
§  “Predict
“Predicts equation
thethe
data can
well”
data ==
well” be written
likelihood
likelihoodofofin words
the
the data given θθ
data given
Likelihood of a single observation 𝑥 𝑖 ? 𝑋
Maximum Likelihood
𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑×𝑝𝑟𝑖𝑜𝑟 Estimator
data For
well” = =likelihood 𝜃 ∏of the is 𝑃data 𝐷 𝜃 given = 𝑖=1θ∏
m
𝑁 m 𝑁
𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟
L
𝑃 iid
( D : θ
𝑥𝑖 |𝜃data,
) = P (
𝜃 the
𝑥 =
D | θ ) =
𝑖 1 likelihood
− 1−𝑥 P ( x[ i ]
i =1𝑖 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑
| x[1],..., x[i − 1], θ ) = 𝑃(𝑥 P( x[i ] | θ )
i =1 𝑖 |𝜃)
𝑚𝑎𝑟𝑔𝑖𝑛𝑎𝑙
Parameter θ that maximizes L(D:θ)
∏ ∏
m 1−𝑥 m𝑖
D | θ ) = i =1 P( xprobability [i − 1],θ )H,T,T,H,H
= i =1 P( x[i] | θ )
𝑁 𝑥𝑖 1−𝑥 𝑖 𝑥𝑖 maximizes
𝑖 #ℎ𝑒𝑎𝑑 1 − 𝜃 #𝑡𝑎𝑖𝑙
𝜃
Example:
𝑖=1 1 − 𝜃
[iH,T,T,H,H 𝑖 = 𝜃
] | x[1],...,ofxsequence
In our example, θ=0.6 1 − 𝜃 the =
sequence𝜃
24
𝑁
LFor
( H iid
, T , Tdata,
, H , Hthe: θ ) likelihood
= P( H | θ ) P (Tis| 𝑃
θ ) P𝐷(T 𝜃| θ ) P=( H | θ𝑖=1
) P (𝑃(𝑥
H | θ𝑖)|𝜃)
= θ 3 (1 − θ ) 2
§  Example:
The prior
𝑁 𝑥𝑃
𝜃 1−𝜃 𝜃
probability encodes
1−𝑥 of=𝜃sequence
our 𝑥 prior
1−𝜃 H,T,T,H,H
𝑖 1−𝑥𝑖
knowledge = 𝜃 #ℎ𝑒𝑎𝑑 on the1 − 𝜃domain
#𝑡𝑎𝑖𝑙
bability of sequence H,T,T,H,H
𝑖 𝑖 𝑖 𝑖
𝑖=1
L( H,TDifferent
,T,H,H :θ) = P(𝑃
prior H |𝜃 θ)will
P(Tend
| θ)upP(with
T | θ) P(H | θ)𝑃(𝜃|𝐷)!
P(H | θ)estimate
different = θ3(1−θ)2
L(D:θ)
θ ) = PThe | θ ) P𝑃
( H prior (T 𝜃| θ encodes
) P(T | θ )our | θ )knowledge
P( Hprior P( H | θ ) =on
θ the θ)
(1 − domain 3 2 25
Different prior 𝑃 𝜃 will end up with different estimate 𝑃(𝜃|𝐷)!

L(D:θ)
25
0 0.2 0.4 0.6 0.8 1 θ
30
D:θ)
0 0.2 0.4 0.6 0.8 1 θ

31

Maximum Likelihood Estimator (MLE)
These estimators have different properties, such as being
“unbiased”, “minimum variance”, etc.
•  MLE is a very popular estimator, which is simple and has good
statistical properties
A very popular estimator is the maximum likelihood estimator
•  Find parameter
(MLE), which isθsimple
that maximizes L(D:θ),
and has good (assume N
statistical observed data)
properties
MLE for
𝜃 = Biased
𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃 Coin
𝐷 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑁
𝑖=1 𝑃(𝑥𝑖 |𝜃)
MLE for
•  MLE
Biased
for Biased
Objective Coin
Coin: log likelihood
function,
= log 𝜃 𝑛ℎ 1 − 𝜃 𝑛𝑡 26
𝑙 𝜃; 𝐷 function,
–  Objective
Objective = log 𝑃 𝐷log
function: 𝜃 likelihood
log likelihood, = 𝑛ℎ log 𝜃 +
𝑁 − 𝑛ℎ log 1 − 𝜃
𝑙 𝜃; 𝐷 = log 𝑃 𝐷 𝜃 = log 𝜃 𝑛ℎ 1 − 𝜃 𝑛𝑡 = 𝑛ℎ log 𝜃 +
𝑁 − 𝑛ℎ log 1 − 𝜃
We need to maximize this w.r.t. 𝜃
–  We need to maximize this w.r.t. 𝜃
–  We
Take derivatives
need 𝜃 w.r.t. 𝜃
w.r.t. this
to maximize
Take derivatives w.r.t. 𝜃
𝜕𝑙 𝑛ℎ 𝑁−𝑛ℎ 𝑛ℎ 1
Take= derivatives
− =0⇒
w.r.t. 𝜃 𝜃𝑀𝐿𝐸 = or 𝜃𝑀𝐿𝐸 = 𝑖 𝑥𝑖
𝜕𝜃 𝜃 1−𝜃 𝑁 𝑁
𝜕𝑙 𝑛ℎ 𝑁−𝑛ℎ 𝑛ℎ 1
= − =0⇒ 𝜃𝑀𝐿𝐸 = or 𝜃𝑀𝐿𝐸 = 𝑖 𝑥𝑖
𝜕𝜃 𝜃 1−𝜃 𝑁 𝑁
Sufficient
Sufficient Statistics
Statistics
•  For computing the parameter
For computing θ of the
the parameter θ ofcoin
the toss example,
coin toss we only
example,
needed MHweand Mneeded
only T since MH and MT since
L(θ : D) = P( D : θ ) = θ M H (1 − θ ) M T
M and M are sufficient statistics
•  MH and MT areH sufficient
T
statistics.
•  A Afunction
functions(D)s(D)is isa a
sufficient
sufficientstatistic
statistic from
from instances to a vector
vector
inin
Rkℜif, for
kAif, forany
function two
s(D)
any twodatasets
is adatasets DD
sufficient and D’ D’
statistic
and and
fromany θ∈Θ,
instances
and any weawe
to
θ∈Θ, have
vector
have
in ℜk if, for any two datasets D and D’ and any θ∈Θ, we have
∑ s∑( xs[(ix])[i])= =∑∑s(s(xx[[ii])])
x[ i ]∈D x[ i ]∈D '
⇒ L( D : θ ) = L( D': θ )
⇒ L( D : θ ) = L( D': θ )
A function s(D) is a sufficient statistic from instances to a vector
x[ i ]∈D x[ i ]∈D ' in ℜk if, for any two datasets D and D’ and any θ∈Θ, we have
∑ s( x[i]) = ∑ s( x[i]) ⇒ L( D : θ ) = L( D': θ )

Datasets
∑
x[ i ]∈D x[ i ]∈D '
•  We
Weoften
often refer
refer
We often to to
refer thethetuple
tuple ∑ s((xx[[ii])]) as
as the
asthe
the sufficient
sufficient
sufficient statistics
statistics
statistics
We often refer to the tuple ∑ s( x[i]) as the sufficient statistics
Statistics
x[ i ]∈D
of the data set D.
of
oftheofData set D.
∈D
xx[[ii]]∈D
the the data
data setset
D. D. In coin toss experiment, MH and MT are sufficient statistics
4
In coin toss experiment, MH and MT are sufficient statistics
In coin toss experiment, MH and MT are sufficient statistics
Datasets
Statistics
5

ufficient
SufficientStatistics
Sufficient
Statistics
Sufficient for
for Multinomial
Statistics
Sufficient for Multinomial
Statistics
Multinomial
Statistics for for Multino
Multinomial
Sufficient Statistics for Multinom
Y: multinomial, k values
•  Y: multinomial, k values(e.g.
(e.g. result
result ofofaadice
dice throw)
throw)
ufficient
Y: multinomial,
Statistics
k values
•  Y: multinomial,
Sufficient
(e.g.
k values
Statistics for
(e.g. result
Y: multinomial,
result of
for Multino
ofkaadice
dicethrow)
values throw)
Multino
(e.g. result of a di
Y: multinomial, k values (e.g. result of a dice
•  A sufficient statistics for a dataset D over Y is the tuple of counts
•  A sufficient statistics for a dataset D over Y is the tuple of counts
A sufficient statistics for a dataset D over Y is the
<M1,...Mk>statistics
such thatA Msufficient
is dataset
the number of times that Y=yi is in D.
A sufficient fori a D over
statisticsY for
isthat
the
a dataset D over Y
Y: multinomial,
tuple
<M
of
,...M
tuple of counts
1
Y: multinomial,
counts <M
>
1
such
k <M ,...M
1 ,...Mkk values (e.g.
that
tuplek
M
kY=y
>
>
i is the
such
values
such
of
number
that
counts
that
M
of
M
(e.g.
<Mi
i is
times
is result
the
1,...Mresult
the
,...M >
Y=y
such of
of
i is
thataa
in
M
D.
A sufficient statistics for a dataset D over Y i di
dic
i isth
number of times that thetuple of i in D
counts <M k> such that M is
number of times
•  Likelihood function: the Y=yofin times
thatnumber
•  Likelihood function:
i D 1 k
that the Y=yi i in D i
number of times that the Y=y in D
L( D : θ ) = ∏ θ
k
Likelihood
ALikelihood
sufficientfunction: statistics for where
a dataset yD over Y
Mi
θ i = P(Y = yi ik) M
L( D : θ ) = ∏ function:
k
function: θ where θ(i D= :Pθ(Y) == ∏
A sufficient statistics for a datasetL( D : θ ) = ∏D θover Y
i =1 i M i
Likelihood L k) θ where
i
i =1 i
Likelihood function: where
i =1 Mii
tuple
MLEtuple ofof
Principle:
counts
counts
Choose
<M
<M
Θ θthat 1 ,...M
,...M
maximize k >
> such
such
L(D:Θ)
that
that
i =1 i
M M i is
is tht
•  MLE Principle:
MLE•  Principle: Choose
Choose such
Θ θthat 1 that it
maximize k
maximizes L(D:θ).
L(D:Θ)ΘL(D:θ). i
number of times
MLE Principle:
number of times
MLE
Choose that
such thatthe
Principle:
that the
MLE Principle: Y=y
it Choose
maximizes
Y=yΘ that
Choose i in
i that Dmaximize L(D
in Dmaximize L(D:
Mi
Multinomial MLE: θ i= mM i i
M
Multinomial MLE: θMultinomial
•  It can be shown that
=Multinomial
∑ mM MLE:
MLE: θii = Mmk i i
•  It can be shown that ∑i=1 i MLE:
Multinomial
i =1 Mi
:θθ )) == ∏
Multinomial MLE:
θMi where
θ = ∑m M M i
Likelihood function:
LL(( D
D : ∑
∏i=1 i where
i =M
1θ
k i =1
i =1
i
i6 i
ECE-8803 F. Fekri 6 Summer 2019
Sufficient
Sufficient Statistic Statistic
for Single for Gaussian
Variable Gaussian
Sufficient Sufficient
Statistic forStatistic
•  Single variable Gaussian
Gaussian Gaussian
distribution: for Gaussian
distribution: X ~ N ( µ , σ )
2
1 −
1
–  Probability density function

Probability(pdf):
density function (pdf): p ( X ) = e 2
Sufficient Statistic for

µ σ
1 Gaussian
2
Gaussian distribution: X ~ N (
Gaussian distribution:, ) X ~ N ( µ ,1
− ⎜
σ⎛ x
2
− µ)⎞
⎟
2π σ 2
p( X ) = e2 2⎝ σ ⎠ 1
Probability densityProbability
function (pdf):
density p ( X ) =
Gaussianpdistribution: 1 function
X⎛ ~ N22π(pdf):
σ1, σ ) µ µ ⎞ 2π1σ⎛ x −
(µ 2
–  Can be written as:

Rewrite as ( X ) = exp⎜
⎜ − x
σ
+ x 2 − 2 ⎟⎟1
pσ( X ) =σ
− ⎜
e 2⎝ σ
Probability
1 ⎛ 2 1 density2 π σ
function
⎝2 (pdf):2 2
µ µ ⎞ for Gaussian 2π2 σ ⎠
Rewrite as p( X ) = exp⎜Sufficient
−x ⎜ + x Statistic
12 − 2 ⎟ ⎛ 2 1 µ µ ⎞
Rewrite
2π σ as
⎝
2
p (2Xσ ) = σ σexp⎟⎠X⎜⎜~−Nx(µ , σ ) 2 + x 2 − 2 ⎟⎟
π σ ⎛ ⎝ 2 <M,∑ 12σ p( X )µ=x[m],
σ1µ 2 ⎞σ∑ ⎠ x[m]2
2
Gaussian distribution:
21Gaussian:
sufficient statistics for
2
1 ⎛ x−µ ⎞
− ⎜ ⎟
Rewrite as ( X ) = densityexp ⎜ − x(pdf): + x m − e ⎟ m

2⎝ σ ⎠
pProbability function
⎜ 2 2π σ 2 ⎟
Sufficient 2π σ
Statistic
⎝ 2σ
for
2
σ
Gaussian σ ⎠
sufficient statistics for
–  It can be shown that sufficient Gaussian:
statistics<M,∑ forpfor x[m],
mGaussian: ∑ ⎛m x[m] 2 >
µ µ ⎞ 2
sufficient statistics ( X ) = Gaussian: + x<M,∑

) σ ⎟⎠ mx[m], ∑mx[m
1 1
Rewrite as exp⎜⎜ − x 2
− ⎟
π
Gaussian distribution: ⎝ σ X ~ N
2σ ( µ , σσ 2 22 2
MLE Principle: <Choose Θ that maximize 1 L(D:Θ)

2 1 ⎛ x−µ ⎞
2
sufficient statistics
M, ∑ for
x[m], Gaussian:
∑ x[m] <M,∑ 2 >pm( Xx[m], ∑e mx[m]2>
− ⎜ ⎟
2⎝ σ ⎠
Probability density function (pdf): )=
sufficient m m 2π σ >
MLE Principle: Choose Θ that maximize L(D:Θ) statistics for Gaussian: <M,∑ x[m], ∑ x[m] 2
m m
MLE Principle: Chooseas p( X ) =Θ that exp⎜⎜ − x maximize L(D:Θ

1 ⎛ 1 µ µ ⎞ 2
Rewrite +x − ⎟ 2
1
MLE Principle: Choose 2πΘσ that ⎝ maximize L(D:Θ) σ ⎟⎠
MLE Principle: µ = Θ∑
Choose thatx[mmaximize L(D:Θ)
σ σ 2 2 2
•  MLE Principle: Choose θ such that it maximizes L(D:θ).

2
Multinomial1 MLE: ]
Multinomial MLE: µ = M ∑Multinomial
x[msufficient
] M µ =mfor1 Gaussian:
statistics <M,∑ x[m], ∑ x[m] > 2
MLE: ∑ x[m] m m
1 1 M x[ m]
•  It can be shown that Gaussian MLE: MLE: µ = ∑
m
Multinomial ∑
m
Multinomial MLE: µ = Choose
MLE Principle: 1Θx[that
m] maximize L(D:Θ)
M
σ1 ∑
= 2 µ∑ M = ( x[m] − µ ) 2
1 σ mM m
1 ( x[ m] − µ )
2
σ= ∑ ( x[m] − µMLE:
m
Multinomial ) M =mM ∑ x[m] 7
M m 11
= ∑
∑
m
σσ== 1 ( x [(mx ]
[ −
m µ
] −) 2
µ ) 2
MσM m Mm∑
( x[m] − µ ) 7 2
m
7

r a multivariate-Gaussian
LE for a multivariate-Gaussian
Multivariate-Gaussian
e • shown that
Likewise, theshow
we can MLE forforµaand
MLE Σ is
multivariate-Gaussian: xn ,1
It can be shown that the MLE for µ and Σ is xn , 2 xn ,1
1 xn
xn , 2
xn 1 xn
N MLEn xn , K
xn
N n xn , K
1 T 1 x1T
xn 1 ML xn ML ST 1
N MLEn
xn ML x n NML S
X
x2T x1T
n
N N X
x2T
he scatter matrix is x TN
where the scatter matrix is x TN
T T T
n
xnS ML xxn ML xn x
T x
n n nn
N
x x T ML ML
N T
n n ML ML n n nn ML ML
fficient statistics
The sufficientare nxn and
statistics
T.
are nxnxnxn nand nxnxn .
T
•  The
at X X= sufficient statistics are Σ x together with XTX=Σ x x T.
nxn Xmay
T
Notenxthat T TX= not be
T full rank (eg. if N <D), in which case
n n n Σ n is not
n case
nxnxn may not be full rank (eg. if N <D), in which ML ΣML is not
le•  Remark:
invertibleX X may not be a full rank matrix. As such S would not be
T
invertible.
MLE
MLEfor
forBayesian Networks Networks (I)
Vanilla Bayesian
Parameters
θx0, θx1 X
θy0|x0, θy1|x0, θy0|x1, θy1|x1 x0 x1
0.7 0.3
Training data:
tuples <x[m],y[m]> m=1,…,M X
M
L( D : θ ) = ∏ P( x[m], y[m] : θ ) Y
m =1
M
Y
= ∏ P( x[m] : θ X ) P( y[m] | x[m] : θY | X )
X y0 y1
m =1
x0 0.95 0.05
⎛ M
⎞⎛ M ⎞
= ⎜⎜ ∏ P( x[m] : θ X ) ⎟⎟⎜⎜ ∏ P( y[m] | x[m] : θY | X ) ⎟⎟ x1 0.2 0.8
⎝ m =1 ⎠⎝ m =1 ⎠
Likelihood decomposes into two separate terms, one for
each variable (“decomposability of the likelihood function”)
8
-Lee
MLE for
MLE for Bayesian
Vanilla Bayesian
NetworksNetworks (II)
Terms further decompose by CPDs:
M
∏ P( y[m] | x[m] : θ )
m =1
= ∏ 0
P( y[m] | x[m] : θY | X ) ∏ P( y[m] | x[m] : θ )
1
Y|X
m:x[ m ]= x m:x[ m ]= x
= ∏ 0
P( y[m] | x[m] : θY | x 0 ) ∏ P( y[m] | x[m] : θ
1
Y | x1
)
m:x[ m ]= x m: x[ m ]= x
By sufficient statistics
∏
M [ x1 , y 0 ] M [ x1 , y1 ]
P( y[m] | x[m] : θY | x1 ) = θ y 0 | x1 ⋅θ y1| x1
m:x[ m ]= x1
where M[x1,y1] is the number of data instances in which X

takes the value x1 and Y takes the value y1
MLE
M [ x1 , y 0 ] M [ x1 , y 0 ]
θ y |x
0 1 = =
M [x , y ] + M [x , y ]
1 0 1 1
M [ x1 ]
-Lee
9
LE for General Bayesian Networks
MLE for General Bayesian Networks
anIfNetworks
we assume
MLE for that the parameters
General Bayesian forNetworks
each CPT are globally
•  If we assume that the parameters for each CPT are globally
independent, and all nodes are fully observed, then the log-
independent,
ameters forIfeach and are
CPT all nodes
globallyare fully observed, then the log-
likelihood we assume
function that the
decomposes parameters
into forofeach
aa sum localCPT are globally
terms, oneper
MLE likelihood
s arefor function
fullyGeneral decomposes
Bayesian
observed, then thealllog- into
Networks sum of local terms, one
node:independent,
pernode: and nodes are fully observed, then the log-
poses into alikelihood
sum of local functionterms, one
decomposes into a sum of local terms, one
LE for 𝑙 𝜃;General
𝐷
If we assume = log 𝑃 Bayesian
𝐷 𝜃 Networks
that the parameters for each CPT are globally
per node:
= log 𝑖 𝑗 𝑃and
independent, 𝑥𝑗𝑖 all 𝑖
𝑝𝑎𝑋nodes , 𝜃𝑗 ) =
are fully
𝑖 𝑗 log 𝑃 𝑥
observed,𝑗
𝑖
𝑝𝑎 𝑖then the log-
𝑋 , 𝜃𝑗 )
𝑙 𝜃; 𝐷 = log 𝑃 𝐷 𝜃 =
𝑙 𝜃; 𝐷 = log 𝑗𝑃 𝐷 𝜃 𝑗
likelihood𝑖 function 𝑖 log𝑖𝑃decomposes into a sum of local terms, one
𝜃𝑗 ) = 𝑖𝑖 𝑗 log =
log 𝑃 𝑎 𝑃𝑎 log
|𝜃 𝑥𝑗 𝑖𝑝𝑎
+
𝑖 𝑋𝑗𝑗, 𝑃
𝑓𝜃𝑗 )𝑥𝑓𝑗 𝑝𝑎𝑖 𝑋 , 𝜃𝑗 ) = 𝑖 𝑠𝑗 log𝑖𝑃 𝑥𝑗 𝑝𝑎
𝑖
|𝜃 𝑖 + 𝑖
𝑙𝑜𝑔𝑃 𝑠 𝑖 𝑖 𝑖
𝑎 , 𝑓 , 𝜃 + 𝑙𝑜𝑔𝑃(ℎ𝑖 𝑖 𝑖𝑖
|𝑠 , 𝜃ℎ,)𝜃 )
𝑋𝑗 𝑗
per node: 𝑗
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
ForOne
each variable 𝑋 𝑖 : 𝐹𝑙𝑢
𝑙 𝜃;
example 𝐷 = log 𝑃 𝐷 𝜃
One term for each CPT; break up MLE problem into independent subproblems
term for each CPT; break up MLE problem into independent subproblems
𝐴𝑙𝑙𝑒𝑟𝑔𝑦 #(𝑋𝐹𝑙𝑢
𝑖 =𝑥 ,𝑃𝑎𝑋𝑖 =𝑢) 𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑃𝑀𝐿𝐸 𝑋𝑖 =𝑖 𝑥𝑖𝑗 𝑃
For
= log each 𝑖
𝑃𝑎𝑥𝑋𝑗𝑖 =
variable 𝑖 𝑋
𝑝𝑎𝑢𝑋𝑗 ,𝑖=
:
𝜃𝑗 ) =#(𝑃𝑎𝑖 =𝑢) 𝑖 𝑖 𝐹𝑙𝑢
𝑗 log 𝑃 𝑥𝑗 𝑝𝑎𝑋𝑗 , 𝜃𝑗 )
𝑋
𝑖 𝑆𝑖𝑛𝑢𝑠
=•  Here
#(𝑋 𝑖 =𝑥we
Earlier we
,𝑃𝑎 just
𝑋𝑖 =𝑢)
already need
learn to
howestimate each
to estimate CPT
#(𝑋 separately:
𝑖 =𝑥CPT
a single ,𝑃𝑎𝑋𝑖 =𝑢)
𝑃
#(𝑃𝑎 =𝑢) 𝑀𝐿𝐸 𝑖
𝑋 = 𝑥𝑖 𝑃𝑎𝑋𝑖 = 𝑢 =
𝑋𝑖 𝑋𝑖#(𝑃𝑎 =𝑢)
–  For each variable 𝑋𝑖: 𝑆𝑖𝑛𝑢𝑠 𝐴𝑙𝑙𝑒𝑟𝑔𝑦 𝑆𝑖𝑛𝑢𝑠
For each variable 𝑋𝑖 :
Why?
Here we just need to estimate each CPT separately.
𝐹𝑙𝑢
𝑁𝑜𝑠𝑒 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
#(𝑋𝑖 =𝑥 ,𝑃𝑎𝑋𝑖 =𝑢) 𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑃𝑀𝐿𝐸Why?
𝑋𝑖 = 𝑥𝑖 𝑃𝑎𝑋𝑖 = 𝑢 = 𝑁𝑜𝑠𝑒
𝐹𝑙𝑢
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝑁𝑜𝑠𝑒 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
#(𝑃𝑎𝑋𝑖 =𝑢)
𝑆𝑖𝑛𝑢𝑠 34
ECE-8803 F.34Fekri Summer 34
2019
Multinomial CPD
Parameters
Parameters
MLE
MLE forfor, θ Table
θTable CPDCPD BayesNets
in Bayesian Networks
LθY ( D, θ:θθY, |θ,Xθ) , ,θ θ= ∏
θ ,θ
, θ θ y[ m ]| X[ m ]
x0 x1 x0 x1 X X
x0 x1 x0 x1
y0|x0 y1|yx0|
0x0 y0|
y1|
x1x0 y1|
y0|
x1x1 y1|x1
•  Multinomial CPD
Multinomial CPD 0.7 0.3 0.7 0.3
Training
Training data: data: m
tuplesL<x[m],y[m]>) =m=1,…,M
( D : θ <x[m],y[m]>
tuples θ m=1,…,M ∏ ⎡ M [ x , yX]1⎤ Xk
= m ∏ ⎢ ∏ θ y|x
Y Y |X y[ m ]| X[ m ]
⎡
⎥
M [ x, y ] ⎤
[m](:⎢θX)) ⎣ yθ∈Val (Y ) ⎥ Y⎦
M M
L( D : θ ) L=( D∏ [m∏
: θP) ( x= (=
], yP[m
∏ ∏ y|x
x][:mθx],)∈yVal Y
⎣ y∈Val (Y ) ⎦
m =1 m =1
M
x∈Val ( X )
M
Y
= ∏ P ( x=
Y
[m∏
] : θPX()xP θ X] |) P
[m( y] [: m : θ]Y| | Xx[)m] : θY | X )
x[(my[]m
X y0 X y1 y0 y1
m =1 m =1
x
⎞ 0.95x 0.050.95 0.05
For•  each value x∈X we get an independent mult
0
⎛ ⎛ M
⎞⎛ M M
⎞⎛ M ⎞ 0
= ⎜⎜ ∏ x∈X ⎜⎜m∏
ForFor each
each value
value x∈X,
⎝ m =1
P (=x[⎜⎜m∏] :we
we
⎝ m =1
θPX()x⎟⎟[get
get ] : θPXan
an
⎠⎝ m =1 ⎠⎝ m =1
∏
()y⎟⎟[⎜⎜mindependent
: θ]Y| |xX[)m⎟⎟ ] : θYx|multinomial
independent
]|P
x[(m
y[]m
⎠
X )⎟
1
⎟ 0.2 x1 0.8 0.2 0.8
⎠
problem where
multinomial problem thewhere
MLE isthe MLE is
problem whereLikelihood the
Likelihood MLE
decomposes is twointoseparate
into
decomposes terms, terms,
two separate one forone for
each variable (“decomposability
each variable M [ xof, ythe
(“decomposability i
] likelihood function”)
of the likelihood function”)
θ y |x =
i
i
8 8
M[ x[]x ,
y]
M
θ y |x
i =
M [ x]
ECE-8803 F. Fekri 11
Summer 2019
Limitations of MLE
•  Coin A is tossed 10 times, and comes out ‘head’ 3 of
the 10 tosses,
–  Then MLE will give probability of head = 0.3
•  Coin B is tossed 1,000,000 times, and comes out ‘head’

300,000 of the 1,000,000 tosses
–  MLE will give probability of head = 0.3
•  Shall we place the same bet on the next Coin A toss as

we would on the next Coin B toss?
•  We need to incorporate prior knowledge
–  Prior knowledge should only be used as a guide
Frequentist vs Bayesian Parameter Estimation
•  Frequentists think of a parameter as a fixed, unknown

constant, not a random variable (in Bayesian)
•  Hence different “objective” estimators, instead of
Bayes’ rule:
–  These estimators have different properties, such as
being “unbiased”, “minimum variance”, etc.
–  MLE is one popular example in frequentists method.
•  Bayesian treat the unknown parameters as a random
variable, and incorporates prior knowledge into estimate.
•  Bayesian estimation has been criticized for being
“subjective”

ST PGM M09 A Summ19 - Fekri PDF

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

ST PGM M09 A Summ19 - Fekri PDF

Загружено:

Авторское право:

Доступные форматы

ECE 8803

Parameter Learning in Graphical

(A,F,S,N,H) = (T,F,F,T,F) P(Y|X ,X )

t 0.9 0.7 0.8 0.2x 0 x2 0 learning

• Instead, evaluate performance by how well the network

– How close is the structure of the network to the true one?

ECE-8803 F. Fekri Summer 2019

ECE-8803 F. Fekri Summer 2019

Input x1 0 x2 1 y1 x1 1 x2 0 • y Prior information

x1 1 x2 1 y1 x1 0 x2 0 y 0 1 0.1 0 0.9 x1 0 x2 1 0.2 0.8

ECE-8803 F. Fekri Summer 2019

y for each CPD, e.g.,

ECE-8803 F. Fekri Summer 2019

ECE-8803 F. Fekri Summer 2019

Known Structure Unknown Structure

Fully observable data Relatively Easy Hard

Missing data Hard (EM) Very hard

• Assumption: i.i.d samples

Different prior 𝑃 𝜃 will end up with different estimate 𝑃(𝜃|𝐷)!

0 0.2 0.4 0.6 0.8 1 θ

ECE-8803 F. Fekri Summer 2019

∑ s( x[i]) = ∑ s( x[i]) ⇒ L( D : θ ) = L( D': θ )

ECE-8803 F. Fekri Summer 2019

– Probability density function

Sufficient Statistic for

– Can be written as:

Rewrite as ( X ) = densityexp ⎜ − x(pdf): + x m − e ⎟ m

sufficient statistics ( X ) = Gaussian: + x<M,∑

MLE Principle: <Choose Θ that maximize 1 L(D:Θ)

MLE Principle: Chooseas p( X ) =Θ that exp⎜⎜ − x maximize L(D:Θ

• MLE Principle: Choose θ such that it maximizes L(D:θ).

Multinomial ) M =mM ∑ x[m] 7

ECE-8803 F. Fekri Summer 2019

where M[x1,y1] is the number of data instances in which X

• Coin B is tossed 1,000,000 times, and comes out ‘head’

• Shall we place the same bet on the next Coin A toss as

• Frequentists think of a parameter as a fixed, unknown

Вам также может понравиться

•  Instead, evaluate performance by how well the network

–  How close is the structure of the network to the true one?

Input x1 0 x2 1 y1 x1 1 x2 0 • y Prior information

•  Assumption: i.i.d samples

–  Probability density function

–  Can be written as:

•  MLE Principle: Choose θ such that it maximizes L(D:θ).

•  Coin B is tossed 1,000,000 times, and comes out ‘head’

•  Shall we place the same bet on the next Coin A toss as

•  Frequentists think of a parameter as a fixed, unknown