Академический Документы
Профессиональный Документы
Культура Документы
Decision
Theory
Team
teaching
Lecture 2:
Bayesian
Decision
Theory
1.
Diagram
and
formula=on
2.
Bayes
rule
for
inference
3.
Bayesian
decision
4.
Discriminant
func=ons
and
space
par==on
5.
Advanced
issues
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
X--- all the observables using existing sensors and instruments x --- is a set of features selected from components of X, or linear/non-linear functions of X. w --- is our inner belief/perception about the subject class. --- is the action that we take for x. We denote the three spaces by
x d , w C ,
Examples
Ex
1:
Fish
classica=on
X=I
is
the
image
of
sh,
x
=(brightness,
length,
n#,
.)
w
is
our
belief
what
the
sh
type
is
c={sea
bass,
salmon,
trout,
}
is
a
decision
for
the
sh
type,
in
this
case
c=
={sea
bass,
salmon,
trout,
}
Ex 2: Medical diagnosis
X= all the available medical tests, imaging scans that a doctor can order for a patient x =(blood pressure, glucose level, cough, x-ray.) w is an illness type c={Flu, cold, TB, pneumonia, lung cancer} is a decision for treatment,
={Tylenol, Hospitalize, }
Lecture
note
for
Stat
231:
Pa;ern
Recogni=on
and
Machine
Learning
Tasks
subjects
Observables X
Features x
Inner belief w
Decision
control sensors
sta=s=cal inference
risk/cost minimiza=on
In Bayesian decision theory, we are concerned with the last three steps in the big ellipse assuming that the observables are given and features are selected.
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
Bayes
Decision
It
is
the
decision
making
when
all
underlying
probability
distribu=ons
are
known.
It
is
op=mal
given
the
distribu=ons
are
known.
For
two
classes
1 and 2 , Prior probabilities for an unknown new observation: P(1) : the new observation belongs to class 1 P(2) : the new observation belongs to class 2 P(1 ) + P(2 ) = 1 It reflects our prior knowledge. It is our decision rule when no feature on the new object is available: Classify as class 1 if P(1 ) > P(2 )
sta=s=cal Inference
risk/cost minimiza=on
Decision
(x)
The belief on the class w is computed by the Bayes rule
p( x | w) p( w) p ( w | x) =
p ( x)
The
risk
is
computed
by
k
R( i | x) = ( i | w j )p(w j | x)
j=1
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
Bayes
Decision
We
observe
features
on
each
object.
P(x| 1) & P(x| 2) : class-specific density
The
Bayes
rule:
Decision
Rule
A decision rule is a mapping function from feature space to the set of actions
(x) : d
we will show that randomized decisions wont be optimal. A
decision
is
made
to
minimize
the
average
cost
/
risk,
R = R( ( x) | x) p( x) dx
It
is
minimized
when
our
decision
is
made
to
minimize
the
cost
/
risk
for
each
instance
x.
k
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
Bayesian
error
In a special case, like fish classification, the action is classification, we assume a 0/1 error.
( i | w j ) = 0 ( i | w j ) = 1
The
risk
for
classifying
x
to
class
i
is,
if i = w j if i w j
R( i | x) =
wj
p(w
i
| x) = 1 p( i | x)
The optimal decision is to choose the class that has maximum posterior probability
The total risk for a decision rule, in this case, is called the Bayesian error
= ik=1 i
3
i j = , i j
5
2 1 4
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
An example of sh classica=on
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
Decision/classica=on Boundaries
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
Close-form
solu=ons
In
two
case
classica=on,
k=2,
with
p(x|w)
being
Normal
densi=es,
the
decision
Boundaries
can
be
computed
in
close-form.
Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning
Excercise
Consider
the
example
of
sea
bass
salmon
classier,
let
these
two
possible
ac=ons:
A1:
Decide
the
input
is
sea
bass;
A2:
Decide
the
input
is
salmon.
Prior
for
sea
bass
and
salmon
are
2/3
and
1/3,
respec=vely.
The
cost
of
classifying
a
sh
as
a
salmon
when
it
truly
is
sea
bass
is
2$,
and
The
cost
of
classifying
a
sh
as
a
sea
bass
when
it
is
truly
a
salmon
is
1$.
Find
the
decision
for
input
X
=
13,
whereas
the
likelihood
P(X|1)
=
0.28,
and
P(X|2)
=
0.17
Nominal features
Training data. Three input dimensions rash (R), temperature (T ), dizzy (D); output class (C). All variables binary
Training
Phase
:
We
can
summarise
the
training
data
as
follows,
and
es=mate
the
likelihoods
as
rela=ve
frequencies
Classify the test data by compu=ng the posterior probabili=es. If P(C = 1 | X) > 0.5 classify as C = 1 else classify as C = 0 (since it is a two class problem)
Testing phase :
Training phase
Finally we compute the class condi=onal probability of observing a height given either a smurf or a troll
Finally, we compute the prior probability of observing a troll or a smurf regardless of height as
Tes.ng phase:
In general, we have the most important parts done. However, lets take a look at Bayes rule in this context:
We no=ce that in order to gure out the nal probability, we need the marginal probability p(x) which is the base probability of height. Basically, p(x) is there to make sure that if we summed up all the P(j|x) they would equal one. This is simply
Finally.... We have everything we need to determine if a creature is most likely a troll or a smurf based upon its height. Thus, we can now ask, if we observe a creature that is 2 tall how likely is it to be a smurf?
Next we ask what is the likelihood of that we have a troll given the observation of 2?
Exercise
if we test a creature that is 2.90 tall, how likely is it to be? Smurf or Troll?
In two dimensions, the so called bivariate normal, we are concerned with two means: a variance-covariance matrix:
Quality Control Result Passed Passed Passed Passed Not passed Not passed Not passed
As a consultant to the factory, you get a task to set up the criteria for automa=c quality control. Then, the manager of the factory also wants to test your criteria upon new type of chip rings that even the human experts are argued to each other. The new chip rings have curvature 2.81 and diameter 5.46. Can you solve this problem by employing Bayes Classier?
X = features (or independent variables) of all data. Each row (denoted by ) represents one object; each column stands for one feature. Y = group of the object (or dependent variable) of all data. Each row represents one object and it has only one column.
Training phase
"2.95 6.63 % $ ' $2.35 7.79 ' $3.57 5.65 ' x= $ 3.16 5.47 ' y= $ ' $2.58 4.46 ' $ ' 2.16 6.22 ' $ $3.27 3.52 ' # &
"1 % $ ' $1 ' $1 ' $ ' $1 ' $2 ' $ ' 2' $ $2 ' # &
Xk = data of row k, for example x3 = 3.57 5.65 g=number of gropus in y, in our example, g=2 Xi = features data for group i . Each row represents one object; each column stands for one feature. We separate x nto several i groups based on the number of category in y.
i = mean of features in group i, which is average of xi 1 = [ 3.05 ] , 2 = [2.67 4.73] 6.38 = global mean vector, that is mean of the whole data set. In this example, = [ 2.88 5.676]
x
=
mean
corrected
data,
that
is
the
features
data
for
group
i,
xi
,
minus
the
global
mean
vector
#0.305 1.218&
#0.060 0.951 & 0 % ( x0 x
1
=
%
2.109
2
=
%0.732 0.547 (
0.357
(
%0.679 % $0.269 0.025 ( ( 0.209 '
0 i
% %0.386 $
( 2.155( '
Finally we compute the class condi=onal probability of observing curvature = 2.81 and diameter = 5.46 given either a passed or not passed
p(2.81,5.46|Passed)= 2(3.14)
1/2(2)
1/2
" 2.81 % " 3.05 % " 0.166 0.192 %1 " 2.81 % " 3.05 % 1/2($ '$ ')'$ ' ($ '$ ') # 5.46 & # 6.38 & # 0.192 1.349 & # 5.46 & # 6.38 &
" 0.259 0.286 % p(2.81,5.46| Not_ passed)=2(3.14)1/2(2) $ ' e # 0.286 2.142 &
" 2.81 % " 2.67 % " 0.259 0.286 %1 " 2.81 % " 2.67 % 1/2($ '$ ')'$ ' ($ '$ ') # 5.46 & # 4.73 & # 0.286 2.142 & # 5.46 & # 4.73 &
P = prior probability vector (each row represent prior probability of group ). If we do not know the prior probability, we just assume it is equal to total sample of each group divided by the total samples, that is
Tes.ng phase
In
general,
we
have
the
most
important
parts
done.
However,
lets
take
a
look
at
Bayes
rule
in
this
context:
Basically,
p(x)
is
there
to
make
sure
that
if
we
summed
up
all
the
P(j|x)
they
would
equal
one.
This
is
simply
Finally....
We
have
everything
we
need
to
determine
if
a
creature
is
most
likely
a
troll
or
a
smurf
based
upon
its
height.
Thus,
we
can
now
ask,
if
we
observe
curvature
=
2.81
and
diameter
=
5.46
how
likely
is
it
to
be
a
passed?
p(2.81,5.46 | Passed)p(Passed) p(Passed | 2.81,5.46) = p(2.81,5.46)
Next we ask what is the likelihood of that we have not passed by given the observation of curvature = 2.81 and diameter = 5.46?