Вы находитесь на странице: 1из 46

Bayesian

Decision Theory
Team teaching

Lecture 2:
Bayesian Decision Theory 1. Diagram and formula=on 2. Bayes rule for inference 3. Bayesian decision 4. Discriminant func=ons and space par==on 5. Advanced issues

Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning

Diagram of pa;ern classica=on


Procedure of pattern recognition and decision making
subjects Observables X Features x Inner belief w Ac=on

X--- all the observables using existing sensors and instruments x --- is a set of features selected from components of X, or linear/non-linear functions of X. w --- is our inner belief/perception about the subject class. --- is the action that we take for x. We denote the three spaces by
x d , w C ,

x = ( x1 , x2 ,..., xd ) is a vector w is the index of class, C = {w1 , w2 ,..., wk }


Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning

Examples
Ex 1: Fish classica=on X=I is the image of sh,
x =(brightness, length, n#, .) w is our belief what the sh type is c={sea bass, salmon, trout, } is a decision for the sh type, in this case c=
={sea bass, salmon, trout, }

Ex 2: Medical diagnosis
X= all the available medical tests, imaging scans that a doctor can order for a patient x =(blood pressure, glucose level, cough, x-ray.) w is an illness type c={Flu, cold, TB, pneumonia, lung cancer} is a decision for treatment,
={Tylenol, Hospitalize, }


Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning

Tasks

subjects

Observables X

Features x

Inner belief w

Decision

control sensors

selec=ng Informa=ve features

sta=s=cal inference

risk/cost minimiza=on

In Bayesian decision theory, we are concerned with the last three steps in the big ellipse assuming that the observables are given and features are selected.

Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning

Bayes Decision
It is the decision making when all underlying probability distribu=ons are known. It is op=mal given the distribu=ons are known. For two classes 1 and 2 , Prior probabilities for an unknown new observation: P(1) : the new observation belongs to class 1 P(2) : the new observation belongs to class 2 P(1 ) + P(2 ) = 1 It reflects our prior knowledge. It is our decision rule when no feature on the new object is available: Classify as class 1 if P(1 ) > P(2 )

Bayesian Decision Theory


Features x

sta=s=cal Inference

Inner belief p(w|x)

risk/cost minimiza=on

Decision (x)

Two probability tables: a). Prior p(w) b). Likelihood p(x|w)

A risk/cost function (is a two-way table) ( | w)

The belief on the class w is computed by the Bayes rule p( x | w) p( w) p ( w | x) = p ( x) The risk is computed by k R( i | x) = ( i | w j )p(w j | x)
j=1

Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning

Bayes Decision
We observe features on each object. P(x| 1) & P(x| 2) : class-specific density The Bayes rule:

Decision Rule
A decision rule is a mapping function from feature space to the set of actions

(x) : d
we will show that randomized decisions wont be optimal. A decision is made to minimize the average cost / risk, R = R( ( x) | x) p( x) dx It is minimized when our decision is made to minimize the cost / risk for each instance x.
k

( x) = arg min R( | x) = arg min ( | w j ) p(w j | x)


j =1

Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning

Bayesian error
In a special case, like fish classification, the action is classification, we assume a 0/1 error.

( i | w j ) = 0 ( i | w j ) = 1
The risk for classifying x to class i is,

if i = w j if i w j

R( i | x) =

wj

p(w
i

| x) = 1 p( i | x)

The optimal decision is to choose the class that has maximum posterior probability

( x) = arg min (1 p( | x)) = arg max p( | x)


The total risk for a decision rule, in this case, is called the Bayesian error

R = p(error) = p(error | x) p( x)dx = (1 p( ( x) | x)) p( x)dx


Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning

Par==on of feature space


(x) : d
The decision is a par==on /coloring of the feature space into k subspaces

= ik=1 i
3

i j = , i j
5

2 1 4

Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning

An example of sh classica=on

Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning

Decision/classica=on Boundaries

Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning

Close-form solu=ons
In two case classica=on, k=2, with p(x|w) being Normal densi=es, the decision Boundaries can be computed in close-form.

Lecture note for Stat 231: Pa;ern Recogni=on and Machine Learning

Excercise
Consider the example of sea bass salmon classier, let these two possible ac=ons: A1: Decide the input is sea bass; A2: Decide the input is salmon. Prior for sea bass and salmon are 2/3 and 1/3, respec=vely. The cost of classifying a sh as a salmon when it truly is sea bass is 2$, and The cost of classifying a sh as a sea bass when it is truly a salmon is 1$. Find the decision for input X = 13, whereas the likelihood P(X|1) = 0.28, and P(X|2) = 0.17

Nominal features

Training data. Three input dimensions rash (R), temperature (T ), dizzy (D); output class (C). All variables binary


Training Phase : We can summarise the training data as follows, and es=mate the likelihoods as rela=ve frequencies

Classify the test data by compu=ng the posterior probabili=es. If P(C = 1 | X) > 0.5 classify as C = 1 else classify as C = 0 (since it is a two class problem)

Testing phase :

Univariate Normal Distribu=on (Con=nuous)

Training phase

compute the standard devia=on of both trolls and smurfs

Finally we compute the class condi=onal probability of observing a height given either a smurf or a troll

Finally, we compute the prior probability of observing a troll or a smurf regardless of height as

Tes.ng phase:

In general, we have the most important parts done. However, lets take a look at Bayes rule in this context:

We no=ce that in order to gure out the nal probability, we need the marginal probability p(x) which is the base probability of height. Basically, p(x) is there to make sure that if we summed up all the P(j|x) they would equal one. This is simply

which would be:

Finally.... We have everything we need to determine if a creature is most likely a troll or a smurf based upon its height. Thus, we can now ask, if we observe a creature that is 2 tall how likely is it to be a smurf?

Next we ask what is the likelihood of that we have a troll given the observation of 2?

The end result is that if

then we can decide that we are most likely observing a smurf

Exercise
if we test a creature that is 2.90 tall, how likely is it to be? Smurf or Troll?

Height 2.70 2.52 2.57 2.22 3.16 3.58 3.16

Creature Smurf Smurf Smurf Smurf Troll Troll Troll

The Mul.variate Normal Distribu.on (Con.nuous)

In two dimensions, the so called bivariate normal, we are concerned with two means: a variance-covariance matrix:

Curvature 2.95 2.53 3.57 3.57 3.16 2.58 2.16

Diameter 6.63 7.79 5.65 5.45 4.46 6.22 3.52

Quality Control Result Passed Passed Passed Passed Not passed Not passed Not passed

As a consultant to the factory, you get a task to set up the criteria for automa=c quality control. Then, the manager of the factory also wants to test your criteria upon new type of chip rings that even the human experts are argued to each other. The new chip rings have curvature 2.81 and diameter 5.46. Can you solve this problem by employing Bayes Classier?

X = features (or independent variables) of all data. Each row (denoted by ) represents one object; each column stands for one feature. Y = group of the object (or dependent variable) of all data. Each row represents one object and it has only one column.

Training phase

"2.95 6.63 % $ ' $2.35 7.79 ' $3.57 5.65 ' x= $ 3.16 5.47 ' y= $ ' $2.58 4.46 ' $ ' 2.16 6.22 ' $ $3.27 3.52 ' # &

"1 % $ ' $1 ' $1 ' $ ' $1 ' $2 ' $ ' 2' $ $2 ' # &

Xk = data of row k, for example x3 = 3.57 5.65 g=number of gropus in y, in our example, g=2 Xi = features data for group i . Each row represents one object; each column stands for one feature. We separate x nto several i groups based on the number of category in y.

"2.95 6.63 % $ ' 2.53 7.79 ' $ x = x1= 2

$3.57 5.65 ' $ ' #3.16 5.47 &

"2.58 4.46% $ ' 2.16 6.22 ' $ $3.27 3.52' # &

i = mean of features in group i, which is average of xi 1 = [ 3.05 ] , 2 = [2.67 4.73] 6.38 = global mean vector, that is mean of the whole data set. In this example, = [ 2.88 5.676]

x = mean corrected data, that is the features data for group i, xi , minus the global mean vector #0.305 1.218& #0.060 0.951 & 0 % ( x0 x 1 = % 2.109 2 = %0.732 0.547 ( 0.357 (
%0.679 % $0.269 0.025 ( ( 0.209 '

0 i

% %0.386 $

( 2.155( '

Covariance matrix of group i = 0 T 0 (xi ) xi i = ni #0.166 0.192& 1 = 2 = % ( $0.192 1.349 '

#0.259 0.286& % ( $0.286 2.142 '

Finally we compute the class condi=onal probability of observing curvature = 2.81 and diameter = 5.46 given either a passed or not passed

p(2.81,5.46|Passed)= 2(3.14)

1/2(2)

" 0.166 0.192 % $ ' e # 0.192 1.349 &


1/2

1/2

" 2.81 % " 3.05 % " 0.166 0.192 %1 " 2.81 % " 3.05 % 1/2($ '$ ')'$ ' ($ '$ ') # 5.46 & # 6.38 & # 0.192 1.349 & # 5.46 & # 6.38 &

" 0.259 0.286 % p(2.81,5.46| Not_ passed)=2(3.14)1/2(2) $ ' e # 0.286 2.142 &

" 2.81 % " 2.67 % " 0.259 0.286 %1 " 2.81 % " 2.67 % 1/2($ '$ ')'$ ' ($ '$ ') # 5.46 & # 4.73 & # 0.286 2.142 & # 5.46 & # 4.73 &

P = prior probability vector (each row represent prior probability of group ). If we do not know the prior probability, we just assume it is equal to total sample of each group divided by the total samples, that is

p(Passed) = 4/7 p(Not_passed) = 3/7

Tes.ng phase

In general, we have the most important parts done. However, lets take a look at Bayes rule in this context:
Basically, p(x) is there to make sure that if we summed up all the P(j|x) they would equal one. This is simply

Finally.... We have everything we need to determine if a creature is most likely a troll or a smurf based upon its height. Thus, we can now ask, if we observe curvature = 2.81 and diameter = 5.46 how likely is it to be a passed? p(2.81,5.46 | Passed)p(Passed) p(Passed | 2.81,5.46) = p(2.81,5.46)
Next we ask what is the likelihood of that we have not passed by given the observation of curvature = 2.81 and diameter = 5.46?

p(2.81,5.46 | Not _ passed)p(Not _ Passed) p(Not _ passed | 2.81,5.46) = p(2.81,5.46)


If the end result is

p(Not _ passed | 2.81,5.46) > p(Passed | 2.81,5.46)


then we can decide that is most likely classified as Not Passed