Академический Документы
Профессиональный Документы
Культура Документы
Multi-class Classification
Fall 2019
Outline
3. Multi-class Classification
Multi-class Naive Bayes
Multi-class Logistic Regression
1
Review of Logistic regression
Intuition: Logistic Regression
w0 + w1 x1 + w2 x2 = 0
Prob(y= 1|x)
HAM SPAM
0 wTx, Linear
comb. of features
3
Intuition: Logistic Regression
Prob(y= 1|x)
HAM SPAM
0 wTx, Linear
comb. of features
4
Formal Setup: Binary Logistic Classification
4
Formal Setup: Binary Logistic Classification
4
Formal Setup: Binary Logistic Classification
4
Formal Setup: Binary Logistic Classification
4
Formal Setup: Binary Logistic Classification
4
Formal Setup: Binary Logistic Classification
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−6 −4 −2 0 2 4 6
4
How to optimize w?
5
How to optimize w?
5
Gradient Descent Update for Logistic Regression
6
Gradient Descent Update for Logistic Regression
d
σ(a) = σ(a)[1 − σ(a)]
da
6
Gradient Descent Update for Logistic Regression
d
σ(a) = σ(a)[1 − σ(a)]
da
Gradient of cross-entropy loss
∂E(w ) X
σ(w > x n ) − yn x n
=
∂w n
6
Gradient Descent Update for Logistic Regression
d
σ(a) = σ(a)[1 − σ(a)]
da
Gradient of cross-entropy loss
∂E(w ) X
σ(w > x n ) − yn x n
=
∂w n
Remark
6
Numerical optimization
7
Batch gradient descent vs SGD
Batch GD SGD
fewer iterations, more iterations,
every iteration uses all samples every iteration uses one sample
8
Logistic regression vs linear regression
1X
RSS(w ) = (yn − w > x n )2
2 n
9
Outline
3. Multi-class Classification
Multi-class Naive Bayes
Multi-class Logistic Regression
10
Non-linear Decision Boundary
How to handle more complex decision boundaries?
x2
x1
11
How to handle more complex decision boundaries?
x2
x1
11
How to handle more complex decision boundaries?
x2
x1
11
Adding polynomial features
12
Adding polynomial features
12
Adding polynomial features
12
Adding polynomial features
12
Adding polynomial features
12
Adding polynomial features
12
Adding polynomial features
-1 + x12 + x22 = 0
x2
x1
12
Adding polynomial features
13
Adding polynomial features
13
Adding polynomial features
13
Adding polynomial features
x2
x1
13
Adding polynomial features
x2
x1
x2
x2
x1 x1
14
Solution to Overfitting: Regularization
15
Solution to Overfitting: Regularization
15
Solution to Overfitting: Regularization
15
Solution to Overfitting: Regularization
15
Solution to Overfitting: Regularization
x1
15
Outline
3. Multi-class Classification
Multi-class Naive Bayes
Multi-class Logistic Regression
16
Multi-class Classification
What if there are more than 2 classes?
x2
x1
17
What if there are more than 2 classes?
x2
x1
17
What if there are more than 2 classes?
x2
x1
17
What if there are more than 2 classes?
x2
x1
17
Setup
K =number of classes
18
Naive Bayes is already multi-class
Formal Definition
Given a random vector X ∈ RK and a dependent variable Y ∈ [C], the
Naive Bayes model defines the joint distribution
19
Learning problem
Training data
D = {(xn , yn )}N K N
n=1 → D = {({xnk }k=1 , yn )}n=1
Goal
Learn πc , c = 1, 2, · · · , C, and θck , ∀c ∈ [C], k ∈ [K] under the
constraints: X
πc = 1
c
and X X
θck = P(wordk |Y = c) = 1
k k
as well as πc , θck ≥ 0.
20
Our hammer: maximum likelihood estimation
Optimize it!
X X
(πc∗ , θck
∗
) = arg max log πyn + xnk log θyn k
n n,k
21
Our hammer: maximum likelihood estimation
Optimization Problem
X X
(πc∗ , θck
∗
) = arg max log πyn + xnk log θyn k
n n,k
22
Our hammer: maximum likelihood estimation
Optimization Problem
X X
(πc∗ , θck
∗
) = arg max log πyn + xnk log θyn k
n n,k
Solution
∗ #of times word k shows up in data points labeled as c
θck =
#total trials for data points labeled as c
#of data points labeled as c
πc∗ =
N
22
Outline
3. Multi-class Classification
Multi-class Naive Bayes
Multi-class Logistic Regression
23
Logistic regression for predicting multiple classes?
Prob(y= 1|x)
HAM SPAM
0 wTx, Linear
comb. of features
24
Logistic regression for predicting multiple classes?
Prob(y= 1|x)
HAM SPAM
0 wTx, Linear
comb. of features
24
Logistic regression for predicting multiple classes?
Prob(y= 1|x)
HAM SPAM
0 wTx, Linear
comb. of features
24
Logistic regression for predicting multiple classes?
Prob(y= 1|x)
HAM SPAM
0 wTx, Linear
comb. of features
24
Logistic regression for predicting multiple classes?
Prob(y= 1|x)
HAM SPAM
0 wTx, Linear
comb. of features
24
Logistic regression for predicting multiple classes?
Prob(y= 1|x)
HAM SPAM
0 wTx, Linear
comb. of features
x2
x1
25
The One-versus-Rest or One-Versus-All Approach
x2
x1
26
The One-versus-Rest or One-Versus-All Approach
w 1T x = 0
x2
x1
27
The One-versus-Rest or One-Versus-All Approach
x2
x1
28
The One-versus-Rest or One-Versus-All Approach
Square or Circle
+
+
x2
None!
x1
29
The One-versus-Rest or One-Versus-All Approach
Pr(Circle) = 0.75
+ Pr(Square) = 0.6
Pr(Triangle) = 0.2
+
x2
x1
30
The One-Versus-One Approach
• For each pair of classes Ck and Ck 0 , change the problem into binary
classification
1. Relabel training data with label Ck , into positive (or ‘1’)
2. Relabel training data with label Ck 0 into negative (or ‘0’)
3. Disregard all other data
x2
x1
31
The One-Versus-One Approach
32
The One-Versus-One Approach
32
The One-Versus-One Approach
x2
x1
32
The One-Versus-One Approach
x2
x1
32
The One-Versus-One Approach
x1
32
Contrast these approaches
33
Contrast these approaches
33
Contrast these approaches
33
Contrast these approaches
33
Contrast these approaches
Intuition:
from the decision rule of our naive Bayes classifier
34
Multinomial logistic regression
Intuition:
from the decision rule of our naive Bayes classifier
34
First try
35
First try
35
First try
35
Multinomial logistic regression
36
How does the softmax function behave?
Suppose we have
w>
1 x = 100, w>
2 x = 50, w>
3 x = −20.
37
How does the softmax function behave?
Suppose we have
w>
1 x = 100, w>
2 x = 50, w>
3 x = −20.
37
Sanity check
38
Parameter estimation
39
Parameter estimation
39
Parameter estimation
X X K
Y XX
⇒ log P(yn |x n ) = log P(Ck |x n )ynk = ynk log P(Ck |x n )
n n k=1 n k
39
Cross-entropy error function
40
Cross-entropy error function
Properties
40
Summary
41