18-661 Introduction To Machine Learning: Multi-Class Classification

18-661 Introduction to Machine Learning
Multi-class Classification
Fall 2019
Outline
1. Review of Logistic regression
2. Non-linear Decision Boundary
3. Multi-class Classification
Multi-class Naive Bayes
Multi-class Logistic Regression
1
Review of Logistic regression
Intuition: Logistic Regression
• x1 = # of times ’meet’ appears in an email

• x2 = # of times ’lottery’ appears in an email
• Define feature vector x = [1, x1 , x2 ]
• Learn the decision boundary w0 + w1 x1 + w2 x2 = 0 such that
• If w> x ≥ 0 declare y = 1 (spam)
• If w> x < 0 declare y = 0 (ham)
x2: occurrences of ‘lottery’
w0 + w1 x1 + w2 x2 = 0
x1: occurrences of ‘meet’
Key Idea: map features into points in a high-dimensional space,

and use hyperplanes to separate them
2
• Suppose we want to output the probability of an email being

spam/ham instead of just 0 or 1
• This gives information about the confidence in the decision
• Use a function σ(w> x) that maps w> x to a value between 0 and 1
Prob(y= 1|x)
HAM SPAM
0 wTx, Linear
comb. of features
Probability that predicted label is 1 (spam)
3
• Suppose we want to output the probability of an email being

spam/ham instead of just 0 or 1
• This gives information about the confidence in the decision
• Use a function σ(w> x) that maps w> x to a value between 0 and 1
Prob(y= 1|x)
HAM SPAM
0 wTx, Linear
comb. of features
Probability that predicted label is 1 (spam)
Key Problem: Finding optimal weights w that accurately predict

this probability for a new email 3
Formal Setup: Binary Logistic Classification
• Input: x = [1, x1 , x2 , . . . xD ] ∈ RD+1
4
• Input: x = [1, x1 , x2 , . . . xD ] ∈ RD+1

• Output: y ∈ {0, 1}
4
• Input: x = [1, x1 , x2 , . . . xD ] ∈ RD+1

• Output: y ∈ {0, 1}
• Training data: D = {(x n , yn ), n = 1, 2, . . . , N}
4
• Input: x = [1, x1 , x2 , . . . xD ] ∈ RD+1

• Output: y ∈ {0, 1}
4
• Input: x = [1, x1 , x2 , . . . xD ] ∈ RD+1

• Output: y ∈ {0, 1}
• Model:
p(y = 1|x; w ) = σ(w > x)
4
• Input: x = [1, x1 , x2 , . . . xD ] ∈ RD+1

• Output: y ∈ {0, 1}
• Model:
p(y = 1|x; w ) = σ(w > x)
4
• Input: x = [1, x1 , x2 , . . . xD ] ∈ RD+1

• Output: y ∈ {0, 1}
• Model:
p(y = 1|x; w ) = σ(w > x)
and σ[·] stands for the sigmoid function
1
σ(a) =
1 + e −a
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−6 −4 −2 0 2 4 6
4
How to optimize w?
Probability of a single training sample (x n , yn )

(
σ(w > x n ) if yn = 1
p(yn |x n ; w ) = >
1 − σ(w x n ) otherwise
5
How to optimize w?
Probability of a single training sample (x n , yn )

(
σ(w > x n ) if yn = 1
p(yn |x n ; w ) = >
1 − σ(w x n ) otherwise
Compact expression, exploring that yn is either 1 or 0
p(yn |x n ; w ) = σ(w > x n )yn [1 − σ(w > x n )]1−yn
Minimize the negative log-likelihood of the whole training data D,

i.e. cross-entropy error function
X
E(w ) = − {yn log σ(w > x n ) + (1 − yn ) log[1 − σ(w > x n )]}
n
5
Gradient Descent Update for Logistic Regression
Cross-entropy Error Function

X
n
6

X
n
Simple fact: derivatives of σ(a)
d
σ(a) = σ(a)[1 − σ(a)]
da
6

X
n
d
σ(a) = σ(a)[1 − σ(a)]
da
Gradient of cross-entropy loss
∂E(w ) X
σ(w > x n ) − yn x n

=
∂w n
6

X
n
d
σ(a) = σ(a)[1 − σ(a)]
da
Gradient of cross-entropy loss
∂E(w ) X
σ(w > x n ) − yn x n

=
∂w n
Remark
• en = σ(w > x n ) − yn is called error for the nth training sample.

6
Numerical optimization
Gradient descent for logistic regression
• Choose a proper step size η > 0

• Iteratively update the parameters following the negative gradient to
minimize the error function
Xn o
w (t+1) ← w (t) − η σ(x >
n w
(t)
) − yn x n
n
• Can also perform stochastic gradient descent (with a possibly

different learning rate η)
n o
w (t+1) ← w (t) − η σ(x >
it w
(t)
) − yit x it
where it is drawn uniformly at randomly from the training data

{1, 2, · · · }
7
Batch gradient descent vs SGD
Learning Rate = η = 0.01 Learning Rate = η = 0.01

1.0 1.0
Probability of being Spam
Probability of being Spam

0.8 0.8
0.6 Email 1 0.6 Email 1

Email 2 Email 2
0.4 0.4
Email 3 Email 3
Email 4 0.2 Email 4
0.2
0 20 40 0 50 100 150 200
Number of Iterations Number of Iterations
Batch GD SGD
fewer iterations, more iterations,
every iteration uses all samples every iteration uses one sample
8
Logistic regression vs linear regression
logistic regression linear regression

Training data (x n , yn ), yn ∈ {0, 1} (x n , yn ), yn ∈ R
loss function cross-entropy RSS
prob. interpretation yn |x n , w ∼ Ber(σ(w > x n )) yn |x n , w ∼ N (w > x n , σ 2 )
> >
P P
gradient n σ(x n w ) − yn x n n x n w − yn x n
Cross-entropy loss function (logistic regression):

X
n
RSS loss function (linear regression):
1X
RSS(w ) = (yn − w > x n )2
2 n
9
Outline
10
Non-linear Decision Boundary
How to handle more complex decision boundaries?
x2
x1
11
x2
x1
• This data is not linear separable
11
x2
x1
• This data is not linear separable

• Use non-linear basis functions to add more features
11
Adding polynomial features
• New feature vector is x = [1, x1 , x2 , x12 , x22 ]
12

• Pr(y = 1|x) = σ(w0 + w1 x1 + w2 x2 + w3 x12 + w4 x22 )
12

• Pr(y = 1|x) = σ(w0 + w1 x1 + w2 x2 + w3 x12 + w4 x22 )
• If w = [−1, 0, 0, 1, 1], the boundary is −1 + x12 + x22 = 0
12

• Pr(y = 1|x) = σ(w0 + w1 x1 + w2 x2 + w3 x12 + w4 x22 )
• If −1 + x12 + x22 ≥ 0 declare spam
12

• Pr(y = 1|x) = σ(w0 + w1 x1 + w2 x2 + w3 x12 + w4 x22 )
• If −1 + x12 + x22 < 0 declare ham
12

• Pr(y = 1|x) = σ(w0 + w1 x1 + w2 x2 + w3 x12 + w4 x22 )
12

• Pr(y = 1|x) = σ(w0 + w1 x1 + w2 x2 + w3 x12 + w4 x22 )
-1 + x12 + x22 = 0
x2
x1
12
• What if we add many more features and define

x = [1, x1 , x2 , x12 , x22 , x13 , x23 , . . . ]?
13

x = [1, x1 , x2 , x12 , x22 , x13 , x23 , . . . ]?
• We get a complex decision boundary
13

x = [1, x1 , x2 , x12 , x22 , x13 , x23 , . . . ]?
13

x = [1, x1 , x2 , x12 , x22 , x13 , x23 , . . . ]?
x2
x1
13

x = [1, x1 , x2 , x12 , x22 , x13 , x23 , . . . ]?
x2
x1
Can result in overfitting and bad generalization to new data points

13
Concept-check: Bias-Variance Trade-off
x2
x2
x1 x1
high bias high variance
14
Solution to Overfitting: Regularization
• Add regularization term to be cross entropy loss function

X 1
E(w ) = − {yn log σ(w > x n )+(1−yn ) log[1−σ(w > x n )]}+ λkw k22
n |2 {z }
regularization
15

X 1
n |2 {z }
regularization
• Perform gradient descent on this regularized function
15

X 1
n |2 {z }
regularization

• Often, we do NOT regularize the bias term w0 (you will see this in
the homework)
15

X 1
n |2 {z }
regularization

the homework)
15

X 1
n |2 {z }
regularization

the homework)
x2
x1
15
Outline
16
Multi-class Classification
What if there are more than 2 classes?
• Dog vs. cat. vs crocodile
x2
x1
17

• Movie genres (action, horror, comedy, . . . )
x2
x1
17

• Part of speech tagging (verb, noun, adjective, . . . )
x2
x1
17

• Part of speech tagging (verb, noun, adjective, . . . )
• ...
x2
x1
17
Setup
Predict multiple classes/outcomes C1 , C2 , . . . , CK :

• Weather prediction: sunny, cloudy, raining, etc
• Optical character recognition: 10 digits + 26 characters (lower and
upper cases) + special characters, etc.
K =number of classes
Methods we’ve studied for binary classification:

• Naive Bayes
• Logistic regression
Do they generalize to multi-class classification?
18
Naive Bayes is already multi-class
Formal Definition
Given a random vector X ∈ RK and a dependent variable Y ∈ [C], the
Naive Bayes model defines the joint distribution
P(X = x, Y = c) = P(Y = c)P(X = x|Y = c) (1)

K
Y
= P(Y = c) P(wordk |Y = c)xk (2)
k=1
K
Y xk
= πc θck (3)
k=1
where xk is the number of occurrences of the kth word, πc is the prior

probability of class c (which allows multiple classes!), and θck is the
weight of the kth word for the cth class.
19
Learning problem
Training data
D = {(xn , yn )}N K N
n=1 → D = {({xnk }k=1 , yn )}n=1
Goal
Learn πc , c = 1, 2, · · · , C, and θck , ∀c ∈ [C], k ∈ [K] under the
constraints: X
πc = 1
c
and X X
θck = P(wordk |Y = c) = 1
k k
as well as πc , θck ≥ 0.
20
Our hammer: maximum likelihood estimation
Log-Likelihood of the training data

N
Y
L = log P(D) = log πyn P(xn |yn )
n=1
N
!
Y Y
= log πyn θyxnnkk
n=1 k
!
X X
= log πyn + xnk log θyn k
n k
X X
= log πyn + xnk log θyn k
n n,k
Optimize it!
X X
(πc∗ , θck
∗
) = arg max log πyn + xnk log θyn k
n n,k
21
Optimization Problem
X X
(πc∗ , θck
∗
n n,k
22
Optimization Problem
X X
(πc∗ , θck
∗
n n,k
Solution
∗ #of times word k shows up in data points labeled as c
θck =
#total trials for data points labeled as c
#of data points labeled as c
πc∗ =
N
22
Outline
23
Logistic regression for predicting multiple classes?
• The linear decision boundary that we optimized was specific to

binary classification.
Prob(y= 1|x)
HAM SPAM
0 wTx, Linear
comb. of features
y = 1 for spam, y = 0 for ham
24

• If σ(w> x) ≥ 0.5 declare y = 1 (spam)
Prob(y= 1|x)
HAM SPAM
0 wTx, Linear
comb. of features
24

• If σ(w> x) < 0.5 declare y = 0 (ham)
Prob(y= 1|x)
HAM SPAM
0 wTx, Linear
comb. of features
24

• How to extend it to multi-class classification?
Prob(y= 1|x)
HAM SPAM
0 wTx, Linear
comb. of features
24

Prob(y= 1|x)
HAM SPAM
0 wTx, Linear
comb. of features
24

Prob(y= 1|x)
HAM SPAM
0 wTx, Linear
comb. of features
Idea: Express as multiple binary classification problems

24
The One-versus-Rest or One-Versus-All Approach
• For each class Ck , change the problem into binary classification

1. Relabel training data with label Ck , into positive (or ‘1’)
2. Relabel all the rest data into negative (or ‘0’)
• Repeat this multiple times: Train K binary classifiers, using logistic
regression to differentiate the two classes each time
x2
x1
25

x2
x1
26

w 1T x = 0
x2
x1
27
How to combine these linear decision boundaries?
• There is ambiguity in some of the regions (the 4 triangular areas)
x2
x1
28
• There is ambiguity in some of the regions (the 4 triangular areas)

• How do we resolve this?
Square or Circle
+
+
x2
None!
x1
29

• Use the confidence estimates Pr(y = C1 |x) = σ(w1> x),
>
. . . Pr(y = CK |x) = σ(wK x)
∗
• Declare class Ck that maximizes
k ∗ = arg max Pr(y = Ck |x) = σ(wk> x)
k=1,...,K
Pr(Circle) = 0.75
+ Pr(Square) = 0.6
Pr(Triangle) = 0.2
+
x2
x1
30
The One-Versus-One Approach
• For each pair of classes Ck and Ck 0 , change the problem into binary
classification
2. Relabel training data with label Ck 0 into negative (or ‘0’)
3. Disregard all other data
x2
x1
31
• How many binary classifiers for K classes?
32
• How many binary classifiers for K classes?
32
• How many binary classifiers for K classes? K (K − 1)/2

• How to combine their outputs?
x2
x1
32

• Given x, count the K (K − 1)/2 votes from outputs of all binary
classifiers and declare the winner as the predicted class.
x2
x1
32

• Given x, count the K (K − 1)/2 votes from outputs of all binary
classifiers and declare the winner as the predicted class.
• Use confidence scores to resolve ties
x2
x1
32
Contrast these approaches
Number of Binary Classifiers to be trained

• One-Versus-All: K classifiers.
• One-Versus-One: K (K − 1)/2 classifiers – bad if K is large
33

Effect of Relabeling and Splitting Training Data
• One-Versus-All: imbalance in the number of positive and negative
samples can cause bias in each trained classifier
• One-Versus-One: each classifier trained on a small subset of data
(only those labeled with those two classes would be involved), which
can result in high variance
33

Any other ideas?
33

Any other ideas?
• Hierarchical classification – we will see this in decision trees
33

Any other ideas?
• Hierarchical classification – we will see this in decision trees
• Multinomial Logistic Regression – directly output probabilities of y
being in each of the K classes, instead of reducing to a binary
classification problem.
33
Multinomial logistic regression
Intuition:
from the decision rule of our naive Bayes classifier
y ∗ = arg maxk p(y = Ck |x) = arg maxk log p(x|y = Ck )p(y = Ck )

X
= arg maxk log πk + xi log θki = arg maxk w >
k x
i
34
Intuition:
from the decision rule of our naive Bayes classifier
y ∗ = arg maxk p(y = Ck |x) = arg maxk log p(x|y = Ck )p(y = Ck )

X
= arg maxk log πk + xi log θki = arg maxk w >
k x
i
Essentially, we are comparing
w> > >

1 x, w 2 x, · · · , w K x
with one for each category.
34
First try
So, can we define the following conditional model?
p(y = Ck |x) = σ[w >

k x].
35
First try
p(y = Ck |x) = σ[w >

k x].
This would not work because:

X X
p(y = Ck |x) = σ[w >
k x] 6= 1.
k k
each summand can be any number (independently) between 0 and 1.

But we are close!
35
First try
p(y = Ck |x) = σ[w >

k x].
This would not work because:

X X
p(y = Ck |x) = σ[w >
k x] 6= 1.
k k
each summand can be any number (independently) between 0 and 1.

But we are close!
Learn the K linear models jointly to ensure this property holds!
35
• Model: For each class Ck , we have a parameter vector w k and

model the posterior probability as:
>
ewk x
p(Ck |x) = P w > x ← This is called softmax function
k0 e
k0
• Decision boundary: Assign x with the label that is the maximum of

posterior:
arg maxk P(Ck |x) → arg maxk w > k x.
36
How does the softmax function behave?
Suppose we have
w>
1 x = 100, w>
2 x = 50, w>
3 x = −20.
37
How does the softmax function behave?
Suppose we have
w>
1 x = 100, w>
2 x = 50, w>
3 x = −20.
We would pick the winning class label 1.
Softmax translates these scores into well-formed conditional

probabilities
e 100
p(y = 1|x) = <1
e 100 + e 50 + e −20
• preserves relative ordering of scores

• maps scores to values between 0 and 1 that also sum to 1
37
Sanity check
Multinomial model reduce to binary logistic regression when K = 2

>
ew1 x 1
p(C1 |x) = >x >x = −(w >
w w 1+e 1 −w 2 ) x
e 1 +e 2
1
=
1 + e −w > x
Multinomial thus generalizes the (binary) logistic regression to deal
with multiple classes.
38
Parameter estimation
Discriminative approach: maximize conditional likelihood

X
log P(D) = log P(yn |x n )
n
39

X
n
We will change yn to y n = [yn1 yn2 · · · ynK ]> , a K -dimensional vector

using 1-of-K encoding.
(
1 if yn = k
ynk =
0 otherwise
Ex: if yn = 2, then, y n = [0 1 0 0 · · · 0]> .
39

X
n
We will change yn to y n = [yn1 yn2 · · · ynK ]> , a K -dimensional vector

using 1-of-K encoding.
(
1 if yn = k
ynk =
0 otherwise
Ex: if yn = 2, then, y n = [0 1 0 0 · · · 0]> .
X X K
Y XX
⇒ log P(yn |x n ) = log P(Ck |x n )ynk = ynk log P(Ck |x n )
n n k=1 n k
39
Cross-entropy error function
Definition: negative log likelihood

XX
E(w 1 , w 2 , . . . , w K ) = − ynk log P(Ck |x n )
n k
>
!
XX ewk xn
=− ynk log P w >0 x n
k0 e
k
n k
40
Cross-entropy error function
Definition: negative log likelihood

XX
E(w 1 , w 2 , . . . , w K ) = − ynk log P(Ck |x n )
n k
>
!
XX ewk xn
=− ynk log P w >0 x n
k0 e
k
n k
Properties
• Convex, therefore unique global optimum

• Optimization requires numerical procedures, analogous to those used
for binary logistic regression
40
Summary
You should know
• What is logistic regression and solving for w using gradient descent

on the cross entropy loss function
• Difference between Naive Bayes and Logistic Regression
• How to solve for the model parameters using gradient descent
• How to handle multiclass classification: one-versus-all,
one-versus-one, multinomial regression
41

18-661 Introduction To Machine Learning: Multi-Class Classification

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

18-661 Introduction To Machine Learning: Multi-Class Classification

Загружено:

Авторское право:

Доступные форматы

18-661 Introduction to Machine Learning

1. Review of Logistic regression

2. Non-linear Decision Boundary

• x1 = # of times ’meet’ appears in an email

x1: occurrences of ‘meet’

Key Idea: map features into points in a high-dimensional space,

• Suppose we want to output the probability of an email being

Probability that predicted label is 1 (spam)

• Suppose we want to output the probability of an email being

Probability that predicted label is 1 (spam)

Key Problem: Finding optimal weights w that accurately predict

• Input: x = [1, x1 , x2 , . . . xD ] ∈ RD+1

• Input: x = [1, x1 , x2 , . . . xD ] ∈ RD+1

• Input: x = [1, x1 , x2 , . . . xD ] ∈ RD+1

• Input: x = [1, x1 , x2 , . . . xD ] ∈ RD+1

• Input: x = [1, x1 , x2 , . . . xD ] ∈ RD+1

• Input: x = [1, x1 , x2 , . . . xD ] ∈ RD+1

• Input: x = [1, x1 , x2 , . . . xD ] ∈ RD+1

Probability of a single training sample (x n , yn )

Probability of a single training sample (x n , yn )

Compact expression, exploring that yn is either 1 or 0

p(yn |x n ; w ) = σ(w > x n )yn [1 − σ(w > x n )]1−yn

Minimize the negative log-likelihood of the whole training data D,

Cross-entropy Error Function

Cross-entropy Error Function

Simple fact: derivatives of σ(a)

Cross-entropy Error Function

Simple fact: derivatives of σ(a)

Cross-entropy Error Function

Simple fact: derivatives of σ(a)

• en = σ(w > x n ) − yn is called error for the nth training sample.

Gradient descent for logistic regression

• Choose a proper step size η > 0

• Can also perform stochastic gradient descent (with a possibly

where it is drawn uniformly at randomly from the training data

Learning Rate = η = 0.01 Learning Rate = η = 0.01

Probability of being Spam

0.6 Email 1 0.6 Email 1

logistic regression linear regression

Cross-entropy loss function (logistic regression):

RSS loss function (linear regression):

1. Review of Logistic regression

2. Non-linear Decision Boundary

• This data is not linear separable

• This data is not linear separable

• New feature vector is x = [1, x1 , x2 , x12 , x22 ]

• New feature vector is x = [1, x1 , x2 , x12 , x22 ]

• New feature vector is x = [1, x1 , x2 , x12 , x22 ]

• New feature vector is x = [1, x1 , x2 , x12 , x22 ]

• New feature vector is x = [1, x1 , x2 , x12 , x22 ]

• New feature vector is x = [1, x1 , x2 , x12 , x22 ]

• New feature vector is x = [1, x1 , x2 , x12 , x22 ]

• What if we add many more features and define

• What if we add many more features and define

• What if we add many more features and define

• What if we add many more features and define

• What if we add many more features and define

Can result in overfitting and bad generalization to new data points

high bias high variance

• Add regularization term to be cross entropy loss function

• Add regularization term to be cross entropy loss function

• Perform gradient descent on this regularized function

• Add regularization term to be cross entropy loss function