Вы находитесь на странице: 1из 67

Linear classifiers:

Logistic regression
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Predicting sentiment by topic:
An intelligent restaurant
review system

2 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


It’s a big day & I want to book a table at
a nice Japanese restaurant
Seattle has many
★★★★
sushi restaurants

What are people


saying about
the food?
the ambiance?...
3 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Positive reviews not positive about everything
Experience

Sample review:
Watching the chefs create
incredible edible art made
the experience very unique.

My wife tried their ramen


and it was pretty forgettable.

All the sushi was delicious!


Easily best sushi in Seattle.
4 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Classifying sentiment of review
Easily best sushi in Seattle.

Sentence Sentiment
Classifier

5 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Linear classifier: Intuition

6 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


ŷ = +1
Classifier

Sentence
Classifier
from
review MODEL
Output: y
Input: x Predicted
class
ŷ = -1
Note: we’ll start talking about 2 classes, and address multiclass later
7 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
A (linear) classifier
•  Will use training data to learn a weight
or coefficient for each word
Word Coefficient
good 1.0
great 1.5
awesome 2.7
bad -1.0
terrible -2.1
awful -3.3
restaurant, the, we, where, … 0.0
… …
8 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Scoring a sentence
Word Coefficient Input xi:
Sushi was great,
good 1.0
the food was awesome,
great 1.2 but the service was terrible.
awesome 1.7
bad -1.0
terrible -2.1
awful -3.3
restaurant, the, 0.0
we, where, …
… …
Called a linear classifier, because output is weighted sum of input.
10 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Word Coefficient
… …

Simple linear classifier


Score(x) = weighted count of
Sentence words in sentence
from
review If Score (x) > 0:
ŷ=
Input: x Else:
ŷ=
12 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Training a classifier = Learning the coefficients
Word Coefficient
good 1.0
awesome 1.7
bad -1.0
awful -3.3
… …

Training Learn
set classifier

Data

(x,y) Validation
Accuracy
(Sentence1, ) set
(Sentence2, )

13 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Decision boundaries

14 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Suppose only two words had non-zero coefficient
Word Coefficient
#awesome 1.0
#awful -1.5
Score(x) = 1.0 #awesome – 1.5 #awful
#awful

Sushi was awesome,


… the food was awesome,
4 but the service was awful.
3

0
0 1 2 3 4 …
#awesome
15 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Decision boundary example
Word Coefficient
#awesome 1.0
#awful -1.5
Score(x) = 1.0 #awesome – 1.5 #awful
#awful

Score(x) < 0

0
Score(x) > 0
0 1 2 3 4 …
#awesome
16 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Decision boundary separates
positive & negative predictions
•  For linear classifiers:
- When 2 coefficients are non-zero
è line
- When 3 coefficients are non-zero
è plane
- When many coefficients are non-zero
è hyperplane
•  For more general classifiers
è more complicated shapes

18 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Linear classifier: Model

20 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


x Feature h(x) ML ŷ
Training
extraction model
Data

y ŵ

ML algorithm

Quality
metric
21 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Coefficients of classifier
x[2]

x[3]
#awful

Score(x) = w0
+ w1 #awesome
#awesome x[1] + w2 #awful
+ w3 #great
23 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
General notation
Output: y {-1,+1}
Inputs: x = (x[1],x[2],…, x[d])
d-dim vector
Notational conventions:
x[j] = jth input (scalar)
hj(x) = jth feature (scalar)
xi = input of ith data point (vector)
xi[j] = jth input of ith data point (scalar)

25 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Simple hyperplane
Model: ŷi = sign(Score(xi))

Score(xi) = w0 + w1 xi[1] + … + wd xi[d] = wTxi


feature 1 = 1
feature 2 = x[1] … e.g., #awesome
feature 3 = x[2] … e.g., #awful

feature d+1 = x[d] … e.g., #ramen
26 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Decision boundary: effect of changing coefficients
Input Coefficient Value
w0 0.0
#awesome w1 1.0 Score(x) = 1.0 #awesome – 1.5 #awful
#awful w2 -1.5
#awful

Score(x) < 0

0
Score(x) > 0
0 1 2 3 4 …
#awesome
27 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Decision boundary: effect of changing coefficients
Input Coefficient Value
w0 1.0
0.0
#awesome w1 1.0 Score(x) = 1.0 #awesome
+ – 1.5 #awful
#awful w2 -1.5
#awful

Score(x) < 0

0
Score(x) > 0
0 1 2 3 4 …
#awesome
28 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Decision boundary: effect of changing coefficients
Input Coefficient Value
w0 1.0
#awesome w1 1.0 Score(x) = 1.0 + 1.0 #awesome – 3.0
1.5 #awful
#awful w2 -3.0
-1.5
#awful

Score(x) < 0

0
Score(x) > 0
0 1 2 3 4 …
#awesome
29 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
More generic features… D-dimensional hyperplane
Model: ŷi = sign(Score(xi))

Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi)


D
X
= wj hj(xi) = wTh(xi)
j=0

feature 1 = h0(x) … e.g., 1


feature 2 = h1(x) … e.g., x[1] = #awesome
feature 3 = h2(x) … e.g., x[2] = #awful
or, log(x[7]) x[2] = log(#bad) x #awful
or, tf-idf(“awful”)

feature D+1 = hD(x) … some other function of x[1],…, x[d]
30 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
ŷ = sign(ŵTh(x))
Training
x Feature h(x) ML
(either -1 or +1)

extraction model
Data

y ŵ

ML algorithm

Quality
metric
31 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Are you sure about the prediction?
Class probability

32 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


How confident is your prediction?
•  Thus far, we’ve outputted a prediction or

•  But, how sure are you about the prediction?

“The sushi & everything “The sushi was good,


else were awesome!” the service was OK”

Definite Not sure

ŷ = +1 with ŷ = +1 with
high probability probability 0.5
33 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Basics of probabilities – quick review

34 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Basic probability
Probability a review is positive is 0.7

x= y=
review text sentiment
All the sushi was delicious! Easily best sushi in Seattle. +1
The sushi & everything else were awesome!
I expect 70% of rows
+1
My wife tried their ramen, it was pretty forgettable. -1
to have y = +1
(Exact number will vary
The sushi was good, the service was OK +1
for each specific dataset)
… …

36 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Interpreting probabilities
as degrees of belief
P(y=+1)
0.0 0.5 1.0

Absolutely sure Absolutely sure


reviews are negative reviews are positive

Not sure if reviews are


positive or negative
37 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Key properties of probabilities

Two class Multiple classes


Property
(e.g., y is +1 or -1) (e.g., y is dog, cat or bird)

Probabilities always
between 0 & 1

Probabilities
sum up to 1

38 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Conditional probability
Probability a review with
3 “awesome” and 1 “awful” is positive is 0.9
x = review text y = sentiment
All the sushi was delicious! Easily best sushi in Seattle. +1
Sushi was awesome & everything else was awesome! +1
The service was awful, but overall awesome place!
My wife tried their ramen, it was pretty forgettable. -1
The sushi was good, the service was OK +1
… …
awesome … awesome … awful … awesome +1 I expect 90% of rows with
… … reviews containing
awesome … awesome … awful … awesome -1 3 “awesome” & 1 “awful”
… … to have y = +1
… … (Exact number will vary
awesome … awesome … awful … awesome +1 for each specific dataset)
39 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Interpreting conditional probabilities
Output label Input sentence

P(y=+1|xi=“All the sushi was delicious!”)


0.0 0.5 1.0

Absolutely sure review Absolutely sure review


“All the sushi was delicious!” “All the sushi was delicious!”
is negative is positive
Not sure if review
“All the sushi was delicious!”
is positive or negative
40 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Key properties of
conditional probabilities

Two class Multiple classes


Property (e.g., y is +1 or -1, (e.g., y is dog, cat or bird,
xi is review text) xi is image)
Conditional
probabilities always
between 0 & 1
Conditional
probabilities
sum up to 1 over y,
but not over x
41 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Using probabilities in classification

43 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


How confident is your prediction?
“The sushi & everything “The sushi was good,
else were awesome!” the service was OK”

Definite Not sure

P(y=+1| x=“The sushi & everything


else were awesome!” ) P(y=+1| x=“The sushi was good,
the service was OK” )
= 0.99 = 0.55
Many classifiers provide a degree of certainty:
Output label Input sentence
P(y|x)
Extremely useful in practice
44 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Goal: Learn conditional probabilities from data
Training data: N observations (xi,yi)
x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1
0 2 -1
3 3 -1
4 1 +1
… … …

Optimize quality metric


on training data

Useful for
⌃ predicting ŷ
Find best model P
by finding best ŵ

45 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Predict most likely class

P(y|x) = estimate of class
Sentence probabilities
from ⌃
review If P(y=+1|x) > 0.5:
ŷ=
Input: x Else:
ŷ=

•  Estimating P(y|x) improves interpretability:
-  Predict ŷ = +1 and tell me how sure you are

46 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Predicting class probabilities with
generalized linear models

47 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization



x Feature h(x) ML P(y|x)
Training
extraction model
Data

y ŵ

ML algorithm

Quality
metric
48 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Thus far, we focused on decision boundaries
Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi)
= wTh(xi)
#awful

Score(x) < 0
… Relate
4 Score(xi) to

3 P(y=+1|x,ŵ)?
2

0
Score(x) > 0
0 1 2 3 4 …
#awesome
49 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Interpreting Score(xi)
Score(xi) = wTh(xi)
-∞
ŷi = -1 0.0
ŷi = +1 +∞

Very sure Not sure if Very sure


ŷi = -1 ŷi = -1 or +1 ŷi = +1

⌃ ⌃ ⌃
P(y=+1|xi) = 0 P(y=+1|xi) = 0.5 P(y=+1|xi) = 1
50 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Why not just use regression to build classifier?
-∞ < Score(xi) < +∞
Score(xi) = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)

-∞ 0.0
+∞

How do
we link
-∞,+∞
to 0,1??? 0.0 0.5 1.0

P(y=+1|xi)
But probabilities between 0 and 1
51 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Link function: squeeze real line into [0,1]
Score(xi) = wTh(xi)
-∞ 0.0

+∞

Link
function
0.0
0.5 1.0


P(y=+1|xi) = g(wTh(xi))
Generalized linear model
52 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

P(y=+1|x,ŵ) = g(ŵTh(x))
x Feature h(x) ML
Training
extraction model
Data

y ŵ

ML algorithm

Quality
metric
53 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Logistic regression classifier:
linear score with
logistic link function

54 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Logistic function (sigmoid, logit)

1
sigmoid(Score) =
1 + e Score

Score -∞ -2 0.0 +2 +∞

sigmoid(Score)

56 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Logistic regression model
Score(xi) = wTh(xi)
-∞
0.0
+∞

0.5
0.0 1.0

P(y=+1|xi,w) = sigmoid(Score(xi))

57 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Understanding the logistic
regression model
1
P (y = +1 | x, w) = w> h(x)
1+e

Score(xi) P(y=+1|xi,w)
w> h(x)
1
1+e

w> h(x)
58 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Logistic regression è
Linear decision boundary
#awful

w> h(x)
3

1
1+e
2

0
0 1 2 3 4 … w> h(x)
#awesome
59 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Effect of coefficients on
logistic regression model

w0 -2 w0 0 w0 0
w#awesome +1 w#awesome +1 w#awesome +3
w#awful -1 w#awful -1 w#awful -3
w> h(x)

w> h(x)

w> h(x)
1

1
1+e

1+e

1+e
#awesome - #awful #awesome - #awful #awesome - #awful

60 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization



P(y=+1|x,ŵ) = sigmoid(ŵTh(x)) = 1 .
1+e-ŵ h(x)
T

x Feature h(x) ML
Training
extraction model
Data

y ŵ

ML algorithm

Quality
metric
62 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Overview of learning
logistic regression model

©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Training a classifier = Learning the coefficients
Word Coefficient Value
ŵ0 -2.0

good ŵ1 1.0 P(y=+1|x,ŵ) = 1 .

T
awesome
bad
ŵ2
ŵ3
1.7
-1.0
1 + e-ŵ h(x)

awful ŵ4 -3.3
… … …

Training Learn
set classifier

Data

(x,y) Validation
Quality
(Sentence1, ) set
(Sentence2, )

64 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Find “best” classifier =
Maximize quality metric over all possible w0,w1,w2
Likelihood ℓ(w)
ℓ(w0=0, w1=1, w2=-1.5) = 10-6
#awful

ℓ(w0=1, w1=1, w2=-1.5) = 10-5


Find best model
… coefficients w with
4 gradient ascent!
3
ℓ(w0=1, w1=0.5, w2=-1.5) = 10-4
2

1
Best model:
0
Highest likelihood ℓ(w)
0 1 2 3 4 …
#awesome
ŵ = (w0=1, w1=0.5, w2=-1.5)
66 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Encoding categorical inputs

68 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Categorical inputs
•  Numeric inputs:
-  #awesome, age, salary,…
-  Intuitive when multiplied
by coefficient Numeric value, but should be
interpreted as category
•  e.g., 1.5 #awesome (98195 not about 9x larger than 10005)
•  Categorical inputs:

Gender Country of birth Zipcode


(Male, Female,...) (Argentina, Brazil, USA,...) (10005, 98195,...)

How do we multiply category by coefficient???


Must convert categorical inputs into numeric features
69 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Encoding categories as numeric features

1-hot
x= encoding x h1(x) h2(x) … h195(x) h196(x)
Brazil
Country of birth
(Argentina, Brazil, USA,...) Zimbabwe

196 categories 196 features

x= Bag of x h1(x) h2(x) … h9999(x) h10000(x)


words
Restaurant review
(Text data)

10,000 words in vocabulary 10,000 featuresMachine Learning Specialization


71 ©2015-2016 Emily Fox & Carlos Guestrin
Multiclass classification
using 1 versus all

73 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Multiclass classification

Input: x Output: y
Image pixels Object in image
74 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Multiclass classification formulation

•  C possible classes:
- y can be 1, 2,…, C
•  N datapoints:
Data
x[1] x[2] y
point

x1,y1 2 1
Learn:

x2,y2 0 2 P(y= |x)
x3,y3 ⌃
3 3
P(y= |x)
x4,y4 4 1

P(y= |x)
75 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
1 versus all:

Estimate P(y= |x) using 2-class model

+1 class: points with yi=


-1 class: points with yi= OR


Train classifier: P (y=+1|x)

⌃ ⌃
Predict: P(y= |xi)= P (y=+1|xi)

77 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


1 versus all: simple multiclass classification
using C 2-class models

⌃ ⌃ ⌃
P(y= |xi) = P(y= |xi) = P(y= |xi) =

78 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Multiclass training

Pc(y=+1|x) = estimate of
1 vs all model for each class

Predict most likely class


max_prob = 0; ŷ = 0
Input: xi For c = 1,…,C:

If Pc(y=+1|x i)
> max_prob:
ŷ=c

max_prob = Pc(y=+1|xi)
80 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Summary of logistic
regression classifier

81 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization



P(y=+1|x,ŵ) = 1 .

1 + e-ŵ h(x)

x Feature h(x) ML
Training
extraction model
Data

y ŵ

ML algorithm

Quality
metric
83 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
What you can do now…
•  Describe decision boundaries and linear
classifiers
•  Use class probability to express degree of
confidence in prediction
•  Define a logistic regression model
•  Interpret logistic regression outputs as class
probabilities
•  Describe impact of coefficient values on
logistic regression output
•  Use 1-hot encoding to represent categorical
inputs
•  Perform multiclass classification using the
1-versus-all approach
85 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

Вам также может понравиться