Академический Документы
Профессиональный Документы
Культура Документы
Logistic regression
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Predicting sentiment by topic:
An intelligent restaurant
review system
Sample review:
Watching the chefs create
incredible edible art made
the experience very unique.
Sentence Sentiment
Classifier
Sentence
Classifier
from
review MODEL
Output: y
Input: x Predicted
class
ŷ = -1
Note: we’ll start talking about 2 classes, and address multiclass later
7 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
A (linear) classifier
• Will use training data to learn a weight
or coefficient for each word
Word Coefficient
good 1.0
great 1.5
awesome 2.7
bad -1.0
terrible -2.1
awful -3.3
restaurant, the, we, where, … 0.0
… …
8 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Scoring a sentence
Word Coefficient Input xi:
Sushi was great,
good 1.0
the food was awesome,
great 1.2 but the service was terrible.
awesome 1.7
bad -1.0
terrible -2.1
awful -3.3
restaurant, the, 0.0
we, where, …
… …
Called a linear classifier, because output is weighted sum of input.
10 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Word Coefficient
… …
Training Learn
set classifier
Data
(x,y) Validation
Accuracy
(Sentence1, ) set
(Sentence2, )
…
13 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Decision boundaries
0
0 1 2 3 4 …
#awesome
15 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Decision boundary example
Word Coefficient
#awesome 1.0
#awful -1.5
Score(x) = 1.0 #awesome – 1.5 #awful
#awful
Score(x) < 0
…
0
Score(x) > 0
0 1 2 3 4 …
#awesome
16 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Decision boundary separates
positive & negative predictions
• For linear classifiers:
- When 2 coefficients are non-zero
è line
- When 3 coefficients are non-zero
è plane
- When many coefficients are non-zero
è hyperplane
• For more general classifiers
è more complicated shapes
y ŵ
ML algorithm
Quality
metric
21 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Coefficients of classifier
x[2]
x[3]
#awful
Score(x) = w0
+ w1 #awesome
#awesome x[1] + w2 #awful
+ w3 #great
23 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
General notation
Output: y {-1,+1}
Inputs: x = (x[1],x[2],…, x[d])
d-dim vector
Notational conventions:
x[j] = jth input (scalar)
hj(x) = jth feature (scalar)
xi = input of ith data point (vector)
xi[j] = jth input of ith data point (scalar)
Score(x) < 0
…
0
Score(x) > 0
0 1 2 3 4 …
#awesome
27 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Decision boundary: effect of changing coefficients
Input Coefficient Value
w0 1.0
0.0
#awesome w1 1.0 Score(x) = 1.0 #awesome
+ – 1.5 #awful
#awful w2 -1.5
#awful
Score(x) < 0
…
0
Score(x) > 0
0 1 2 3 4 …
#awesome
28 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Decision boundary: effect of changing coefficients
Input Coefficient Value
w0 1.0
#awesome w1 1.0 Score(x) = 1.0 + 1.0 #awesome – 3.0
1.5 #awful
#awful w2 -3.0
-1.5
#awful
Score(x) < 0
…
0
Score(x) > 0
0 1 2 3 4 …
#awesome
29 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
More generic features… D-dimensional hyperplane
Model: ŷi = sign(Score(xi))
extraction model
Data
y ŵ
ML algorithm
Quality
metric
31 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Are you sure about the prediction?
Class probability
ŷ = +1 with ŷ = +1 with
high probability probability 0.5
33 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Basics of probabilities – quick review
x= y=
review text sentiment
All the sushi was delicious! Easily best sushi in Seattle. +1
The sushi & everything else were awesome!
I expect 70% of rows
+1
My wife tried their ramen, it was pretty forgettable. -1
to have y = +1
(Exact number will vary
The sushi was good, the service was OK +1
for each specific dataset)
… …
Probabilities always
between 0 & 1
Probabilities
sum up to 1
Useful for
⌃ predicting ŷ
Find best model P
by finding best ŵ
y ŵ
ML algorithm
Quality
metric
48 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Thus far, we focused on decision boundaries
Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi)
= wTh(xi)
#awful
Score(x) < 0
… Relate
4 Score(xi) to
⌃
3 P(y=+1|x,ŵ)?
2
0
Score(x) > 0
0 1 2 3 4 …
#awesome
49 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Interpreting Score(xi)
Score(xi) = wTh(xi)
-∞
ŷi = -1 0.0
ŷi = +1 +∞
⌃ ⌃ ⌃
P(y=+1|xi) = 0 P(y=+1|xi) = 0.5 P(y=+1|xi) = 1
50 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Why not just use regression to build classifier?
-∞ < Score(xi) < +∞
Score(xi) = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)
-∞ 0.0
+∞
How do
we link
-∞,+∞
to 0,1??? 0.0 0.5 1.0
P(y=+1|xi)
But probabilities between 0 and 1
51 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Link function: squeeze real line into [0,1]
Score(xi) = wTh(xi)
-∞ 0.0
+∞
Link
function
0.0
0.5 1.0
⌃
P(y=+1|xi) = g(wTh(xi))
Generalized linear model
52 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
⌃
P(y=+1|x,ŵ) = g(ŵTh(x))
x Feature h(x) ML
Training
extraction model
Data
y ŵ
ML algorithm
Quality
metric
53 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Logistic regression classifier:
linear score with
logistic link function
1
sigmoid(Score) =
1 + e Score
Score -∞ -2 0.0 +2 +∞
sigmoid(Score)
0.5
0.0 1.0
P(y=+1|xi,w) = sigmoid(Score(xi))
Score(xi) P(y=+1|xi,w)
w> h(x)
1
1+e
w> h(x)
58 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Logistic regression è
Linear decision boundary
#awful
w> h(x)
3
1
1+e
2
0
0 1 2 3 4 … w> h(x)
#awesome
59 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Effect of coefficients on
logistic regression model
w0 -2 w0 0 w0 0
w#awesome +1 w#awesome +1 w#awesome +3
w#awful -1 w#awful -1 w#awful -3
w> h(x)
w> h(x)
w> h(x)
1
1
1+e
1+e
1+e
#awesome - #awful #awesome - #awful #awesome - #awful
x Feature h(x) ML
Training
extraction model
Data
y ŵ
ML algorithm
Quality
metric
62 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Overview of learning
logistic regression model
T
awesome
bad
ŵ2
ŵ3
1.7
-1.0
1 + e-ŵ h(x)
awful ŵ4 -3.3
… … …
Training Learn
set classifier
Data
(x,y) Validation
Quality
(Sentence1, ) set
(Sentence2, )
…
64 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Find “best” classifier =
Maximize quality metric over all possible w0,w1,w2
Likelihood ℓ(w)
ℓ(w0=0, w1=1, w2=-1.5) = 10-6
#awful
1
Best model:
0
Highest likelihood ℓ(w)
0 1 2 3 4 …
#awesome
ŵ = (w0=1, w1=0.5, w2=-1.5)
66 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Encoding categorical inputs
1-hot
x= encoding x h1(x) h2(x) … h195(x) h196(x)
Brazil
Country of birth
(Argentina, Brazil, USA,...) Zimbabwe
Input: x Output: y
Image pixels Object in image
74 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Multiclass classification formulation
• C possible classes:
- y can be 1, 2,…, C
• N datapoints:
Data
x[1] x[2] y
point
x1,y1 2 1
Learn:
⌃
x2,y2 0 2 P(y= |x)
x3,y3 ⌃
3 3
P(y= |x)
x4,y4 4 1
⌃
P(y= |x)
75 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
1 versus all:
⌃
Estimate P(y= |x) using 2-class model
⌃
Train classifier: P (y=+1|x)
⌃ ⌃
Predict: P(y= |xi)= P (y=+1|xi)
⌃ ⌃ ⌃
P(y= |xi) = P(y= |xi) = P(y= |xi) =
1 + e-ŵ h(x)
x Feature h(x) ML
Training
extraction model
Data
y ŵ
ML algorithm
Quality
metric
83 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
What you can do now…
• Describe decision boundaries and linear
classifiers
• Use class probability to express degree of
confidence in prediction
• Define a logistic regression model
• Interpret logistic regression outputs as class
probabilities
• Describe impact of coefficient values on
logistic regression output
• Use 1-hot encoding to represent categorical
inputs
• Perform multiclass classification using the
1-versus-all approach
85 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization