Академический Документы
Профессиональный Документы
Культура Документы
A basic classier
Training data D={x(i),y(i)}, Classier f(x ; D)
Discrete feature vector x
f(x ; D) is a con@ngency table
Features
# bad
# good
X=0
42
15
X=1
338
287
X=2
A basic classier
Training data D={x(i),y(i)}, Classier f(x ; D)
Discrete feature vector x
f(x ; D) is a con@ngency table
Features
# bad
# good
X=0
42
15
X=1
338
287
X=2
A basic classier
Training data D={x(i),y(i)}, Classier f(x ; D)
Discrete feature vector x
f(x ; D) is a con@ngency table
Features
# bad
# good
X=0
.7368
.2632
X=1
.5408
.4592
X=2
.3750
.6250
Bayes rule
H
F
Bayes rule
H
F
P(H & F) = ?
P(F|H) = ?
Bayes rule
H
F
Bayes rule
H
F
Joint distribution
Bayes Rule:
(Use the rule of total probability
to calculate the denominator!)
Bayes classifiers
Learn class conditional models
Estimate a probability model for each class
Training data
Split by class
Dc = { x(j) : y(j) = c }
# bad
# good
p(y=0|x)
p(y=1|x)
X=0
42
15
42 / 383 15 / 307
.7368
.2632
X=1
338
287
.5408
.4592
X=2
3 / 383 5 / 307
.3750
.6250
p(y)
383/690
307/690
Bayes classifiers
Learn class conditional models
Estimate a probability model for each class
Training data
Split by class
Dc = { x(j) : y(j) = c }
12
10
8
6
4
2
0
-3
-2
-1
Gaussian models
Estimate parameters of the Gaussians from the data
Feature x1 !
-1
14
Bayes classifiers
# bad
# good
p(y=0|x)
p(y=1|x)
X=0
42
15
42 / 383 15 / 307
.7368
.2632
X=1
338
287
.5408
.4592
X=2
3 / 383 5 / 307
.3750
.6250
p(y)
383/690
307/690
Joint distributions
Make a truth table of all
combinations of values
A B C
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
Joint distributions
Make a truth table of all
combinations of values
For each combination of values,
determine how probable it is
Total probability must sum to one
How many values did we specify?
A B C
p(A,B,C | y=1)
0 0 0
0.50
0 0 1
0.05
0 1 0
0.01
0 1 1
0.10
1 0 0
0.04
1 0 1
0.15
1 1 0
0.05
1 1 1
0.10
A B C
p(A,B,C | y=1)
0 0 0
4/10
0 0 1
1/10
0 1 0
0/10
0 1 1
0/10
1 0 0
1/10
1 0 1
2/10
1 1 0
1/10
1 1 1
1/10
Overfitting!
A B C
p(A,B,C | y=1)
0 0 0
4/10
0 0 1
1/10
0 1 0
0/10
0 1 1
0/10
1 0 0
1/10
1 0 1
2/10
1 1 0
1/10
1 1 1
1/10
Independence:
p(a,b) = p(a) p(b)
p(x1, x2, xN | y=1) = p(x1 | y=1) p(x2 | y=1) p(xN | y=1)
Only need to estimate each individually
A B C
p(A,B,C | y=1)
p(A |y=1)
p(B |y=1)
p(C |y=1)
.4
.7
.1
.6
.3
.9
0 0 0
.4 * .7 * .1
0 0 1
.4 * .7 * .9
0 1 0
.4 * .3 * .1
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
Decide class 0
22
23
x1
x2
p(x | y=0)
x1
x2
p(x | y=1)
1/4
1/4
0/4
1/4
1/4
2/4
2/4
0/4
24
Nave Bayes:
p(y|x)= p(x|y) p(y) / p(x) ; p(x|y) = p(xi|y)
Covariates are independent given cause
211
x2
0
222
x1
A Bayes classier
Given training data, compute p( y=c| x) and choose largest
Whats the (training) error rate of this method?
Features
# bad
# good
X=0
42
15
X=1
338
287
X=2
30
A Bayes classier
Given training data, compute p( y=c| x) and choose largest
Whats the (training) error rate of this method?
Features
# bad
# good
X=0
42
15
X=1
338
287
X=2
31
(at any x)
32
A Bayes classier
Bayes classification decision rule compares probabilities:
<
>
<
>
p(x , y=0 )
Shape: p(x | y=0 )
Area: p(y=0)
Feature x1 !
Decision boundary
p(x , y=1 ) Shape: p(x | y=1 )
Area: p(y=1)
A Bayes classier
Not all errors are created equally
Risk associated with each outcome?
Decision boundary
p(x , y=1 )
{
{
p(x , y=0 )
A Bayes classier
Increase alpha: prefer class 0
Spam detection
Decision boundary
p(x , y=1 )
p(x , y=0 )
A Bayes classier
Decision boundary
p(x , y=1 )
{
{
p(x , y=0 )
Measuring errors
Confusion matrix
Can extend to more classes
Predict 0
Predict 1
Y=0
380
Y=1
338
-- sensitivity
-- specificity
<
>
<
>
Classical tes@ng:
Choose gamma so that FPR is xed (p-value)
Given that y=0 is true, whats the probability we decide y=1?
(c) Alexander Ihler
38
ROC Curves
Characterize performance as we vary the decision threshold?
Bayes classifier,
multiplier alpha
Guess all 1
<
>
Guess at random, proportion alpha
39
Discrimina@ve learning:
Output predic@on (x)
Probabilis@c learning
Probabilis@c learning:
Output probability p(y|x)
40
Discrimina@ve learning:
Output predic@on (x)
Probabilis@c learning:
Output probability p(y|x)
41
ROC Curves
Characterize performance as we vary our condence threshold?
Classifier B
True positive rate
= sensitivity
Guess all 1
Classifier A
Guess all 0
(c) Alexander Ihler
42
Gaussian models
Bayes optimal decision
Choose most likely class
Decision boundary
Places where probabilities equal
Gaussian models
Bayes optimal decision boundary
p(y=0 | x) = p(y=1 | x)
Transition point between p(y=0|x) >/< p(y=1|x)
Gaussian example
Spherical covariance: = 2 I
Decision rule
-1
-2
-2
-1
Ex:
Variance
[3 0 ]
[ 0 .25 ]
-1
-2
-10
-8
-6
-4
-2
10
= 1 / (1 + exp( -a ) ) (**)
a = log [ p(x|y=0) p(y=0) / p(x|y=1) p(y=1) ]
(**) called the logistic function, or logistic sigmoid.
Gaussian models
Return to Gaussian models with equal covariances
(**)
Now we also know that the probability of each class is given by:
p(y=0 | x) = Logistic( ** ) = Logistic( aT x + b )
Well see this form again soon