Lecture10 PDF

CS7015 (Deep Learning) : Lecture 10
Learning Vectorial Representations Of Words
Mitesh M. Khapra
Department of Computer Science and Engineering

Indian Institute of Technology Madras
1/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Acknowledgments
‘word2vec Parameter Learning Explained’ by Xin Rong
‘word2vec Explained: deriving Mikolov et al.’s negative-sampling word-
embedding method’ by Yoav Goldberg and Omer Levy
Sebastian Ruder’s blogs on word embeddingsa
Ali Ghodsi’s video lecture on Word2Vec b
a
Blog1, Blog2, Blog3
b
Lectures on Word2Vec
2/70
Module 10.1: One-hot representations of words
3/70
Let us start with a very simple mo-
tivation for why we are interested in
vectorial representations of words
4/70
Suppose we are given an input stream
of words (sentence, document, etc.)
and we are interested in learning
some function of it (say, ŷ =
sentiments(words))
This is by far AAMIR KHAN’s best one. Finest

casting and terrific acting by all.
4/70
Model sentiments(words))
Say, we employ a machine learning al-
gorithm (some mathematical model)
for learning such a function (ŷ =
f (x))

4/70
Model sentiments(words))
Say, we employ a machine learning al-
gorithm (some mathematical model)
for learning such a function (ŷ =
[5.7, 1.2, 2.3, -10.2, 4.5, ..., 11.9, 20.1, -0.5, 40.7]
f (x))
We first need a way of converting the
input stream (or each word in the
stream) to a vector x (a mathemat-
ical quantity)
4/70
Given a corpus,
5/70
Given a corpus,
Corpus:
Human machine interface for computer
applications
User opinion of computer system response
time
User interface management system
System engineering for improved response
time
5/70
Given a corpus, consider the set V
Corpus: of all unique words across all input
Human machine interface for computer streams (i.e., all sentences or docu-
applications ments)
time
time
5/70
applications ments)
time
V is called the vocabulary of the
corpus (i.e., all sentences or docu-
ments)
time
V = [human,machine, interface, for, computer,

applications, user, opinion, of, system, response,
time, interface, management, engineering,
improved]
5/70
applications ments)
time
ments)
time We need a representation for every
word in V
V = [human,machine, interface, for, computer,
improved]
5/70
applications ments)
time
ments)
word in V
V = [human,machine, interface, for, computer, One very simple way of doing this is
to use one-hot vectors of size |V |
improved]
machine: 0 1 0 ... 0 0 0
5/70
applications ments)
time
ments)
word in V
V = [human,machine, interface, for, computer, One very simple way of doing this is
to use one-hot vectors of size |V |
improved] The representation of the i-th word
will have a 1 in the i-th position and
machine: 0 1 0 ... 0 0 0 a 0 in the remaining |V | − 1 positions
5/70
Problems:
cat: 0 0 0 0 0 1 0
V tends to be very large (for example,
dog: 0 1 0 0 0 0 0 50K for PTB, 13M for Google 1T cor-
pus)
truck: 0 0 0 1 0 0 0
6/70
Problems:
cat: 0 0 0 0 0 1 0
pus)
truck: 0 0 0 1 0 0 0
These representations do not capture
any notion of similarity
6/70
Problems:
cat: 0 0 0 0 0 1 0
pus)
truck: 0 0 0 1 0 0 0
Ideally, we would want the represent-
ations of cat and dog (both domestic
animals) to be closer to each other
than the representations of cat and
truck
6/70
Problems:
cat: 0 0 0 0 0 1 0
pus)
truck: 0 0 0 1 0 0 0
√
euclid dist(cat, dog) = 2 Ideally, we would want the represent-
√
euclid dist(dog, truck) = 2 ations of cat and dog (both domestic
truck
However, with 1-hot representations,
the Euclidean distance between any
√
two words in the vocabulary in 2
6/70
Problems:
cat: 0 0 0 0 0 1 0
pus)
truck: 0 0 0 1 0 0 0
√
euclid dist(cat, dog) = 2 Ideally, we would want the represent-
√
euclid dist(dog, truck) = 2 ations of cat and dog (both domestic
cosine sim(cat, dog) = 0
cosine sim(dog, truck) = 0 truck
However, with 1-hot representations,
the Euclidean distance between any
√
two words in the vocabulary in 2
And the cosine similarity between
any two words in the vocabulary is
0 6/70
Module 10.2: Distributed Representations of words
7/70
You shall know a word by the com-
pany it keeps - Firth, J. R. 1957:11
A bank is a financial institution that accepts

deposits from the public and creates credit.
8/70
Distributional similarity based rep-
resentations
A bank is a financial institution that accepts

8/70
resentations
This leads us to the idea of co-
A bank is a financial institution that accepts occurrence matrix
8/70
resentations
This leads us to the idea of co-
A bank is a financial institution that accepts occurrence matrix
The idea is to use the accompanying words

(financial, deposits, credit) to represent bank
8/70
Corpus:
A co-occurrence matrix is a terms ×
Human machine interface for computer ap- terms matrix which captures the
plications
number of times a term appears in
User opinion of computer system response the context of another term
time
time
9/70
Corpus:
plications
time
The context is defined as a window of
k words around the terms
time
9/70
Corpus:
plications
time
time Let us build a co-occurrence matrix
for this toy corpus with k = 2
9/70
Corpus:
plications
time
human machine system for ... user
This is also known as a word ×
human
machine
0
1
1
0
0
0
1
1
...
...
0
0
context matrix
system 0 0 0 1 ... 2
for 1 1 1 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 2 0 ... 0
Co-occurence Matrix
9/70
Corpus:
plications
time
human
machine
0
1
1
0
0
0
1
1
...
...
0
0
context matrix
system 0 0 0 1 ... 2
for 1 1 1 0 ... 0 You could choose the set of words
. . . . . . .
. . . . . . . and contexts to be same or different
. . . . . . .
user 0 0 2 0 ... 0
Co-occurence Matrix
9/70
Corpus:
plications
time
human
machine
0
1
1
0
0
0
1
1
...
...
0
0
context matrix
system 0 0 0 1 ... 2
for 1 1 1 0 ... 0 You could choose the set of words
. . . . . . .
. . . . . . . and contexts to be same or different
. . . . . . .
user 0 0 2 0 ... 0 Each row (column) of the co-
Co-occurence Matrix occurrence matrix gives a vectorial
representation of the corresponding
word (context) 9/70
Some (fixable) problems
Stop words (a, the, for, etc.) are very
frequent → these counts will be very
human machine system for ... user high
human 0 1 0 1 ... 0
machine 1 0 0 1 ... 0
system 0 0 0 1 ... 2
for 1 1 1 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 2 0 ... 0
10/70
human
human
0
machine
1
system
0
...
...
user
0
high
machine 1 0 0 ... 0
system 0 0 0 ... 2 Solution 1: Ignore very frequent
. . . . . .
. . . . . . words
. . . . . .
user 0 0 2 ... 0
10/70
human machine system for ... user high
human 0 1 0 x ... 0
machine 1 0 0 x ... 0
system 0 0 0 x ... 2
Solution 1: Ignore very frequent
for
.
x
.
x
.
x
.
x
.
...
.
x
.
words
. . . . . . .
. . . . . . . Solution 2: Use a threshold t (say, t
user 0 0 2 x ... 0
= 100)
Xij = min(count(wi , cj ), t),
where w is word and c is context.
10/70
Solution 3: Instead of count(w, c) use
P M I(w, c)
11/70
P M I(w, c)
p(c|w)
P M I(w, c) = log
p(c)
count(w, c) ∗ N
= log
count(c) ∗ count(w)
N is the total number of words
11/70
P M I(w, c)
human machine system for ... user p(c|w)
human 0 2.944 0 2.25 ... 0 P M I(w, c) = log
machine 2.944 0 0 2.25 ... 0 p(c)
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0 count(w, c) ∗ N
= log
. . . . . . . count(c) ∗ count(w)
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0 N is the total number of words
11/70
P M I(w, c)
human 0 2.944 0 2.25 ... 0 P M I(w, c) = log
machine 2.944 0 0 2.25 ... 0 p(c)
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0 count(w, c) ∗ N
= log
. . . . . . . count(c) ∗ count(w)
. . . . . . .
. . . . . . .
If count(w, c) = 0, P M I(w, c) = −∞
11/70
P M I(w, c)
human 0 2.944 0 2.25 ... 0 P M I(w, c) = log
machine 2.944 0 0 2.25 ... 0 p(c)
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0 count(w, c) ∗ N
= log
. . . . . . . count(c) ∗ count(w)
. . . . . . .
. . . . . . .
If count(w, c) = 0, P M I(w, c) = −∞
Instead use,
P M I0 (w, c) = P M I(w, c) if count(w, c) > 0

=0 otherwise
11/70
P M I(w, c)
human 0 2.944 0 2.25 ... 0 P M I(w, c) = log
machine 2.944 0 0 2.25 ... 0 p(c)
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0 count(w, c) ∗ N
= log
. . . . . . . count(c) ∗ count(w)
. . . . . . .
. . . . . . .
If count(w, c) = 0, P M I(w, c) = −∞
Instead use,
P M I0 (w, c) = P M I(w, c) if count(w, c) > 0

=0 otherwise
or
P P M I(w, c) = P M I(w, c) if P M I(w, c) > 0

=0 otherwise
11/70
Some (severe) problems
Very high dimensional (|V |)

human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0
12/70
Very sparse
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0
12/70
Very sparse
human 0 2.944 0 2.25 ... 0 Grows with the size of the vocabulary
machine 2.944 0 0 2.25 ... 0
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0
12/70
Very sparse
human 0 2.944 0 2.25 ... 0 Grows with the size of the vocabulary
machine 2.944 0 0 2.25 ... 0
system
for
0
2.25
0
2.25
0
1.15
1.15
0
...
...
1.84
0
Solution: Use dimensionality reduc-
.
.
.
.
.
.
.
.
.
.
.
.
.
.
tion (SVD)
. . . . . . .
user 0 0 1.84 0 ... 0
12/70
Module 10.3: SVD for learning word representations
13/70
Singular Value Decomposition
gives a rank k approximation of
the original matrix
 
T
  X = XP P M I m×n = Um×k Σk×k Vk×n
  =
 X 
XP P M I (simplifying notation to
m×n

↑ ··· ↑
 X) is the co-occurrence matrix
← v1T →
   
σ1
   ..   ..  with PPMI values
. .
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n
14/70
the original matrix
 
T
  =
 X 
m×n

↑ ··· ↑
← v1T →
   
σ1
. .
 
u1 ··· uk 
   
SVD gives the best rank-k ap-
↓ ··· ↓ m×k σk k×k
← vkT → k×n
proximation of the original data
(X)
14/70
the original matrix
 
T
  =
 X 
m×n

↑ ··· ↑
← v1T →
   
σ1
. .
 
u1 ··· uk 
   
SVD gives the best rank-k ap-
↓ ··· ↓ m×k σk k×k
← vkT → k×n
proximation of the original data
(X)
Discovers latent semantics in the
corpus (We will soon examine
this with the help of an example)
14/70
Notice that the product can be
written as a sum of k rank-1
matrices
 
 
  =
 X 
m×n
 
↑ ··· ↑ 
σ1
 
← v1T →

   ..   .. 
. .
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n
= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT
15/70
matrices
Each σi ui viT ∈ Rm×n because it
 



 = is a product of a m × 1 vector
 X 
with a 1 × n vector
m×n
 
↑ ··· ↑ 
σ1
 
← v1T →

   ..   .. 
. .
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n
15/70
matrices
 



 X 
m×n
  If we truncate the sum at σ1 u1 v1T
↑ ··· ↑ 
σ1
 
← v1T →

then we get the best rank-1 ap-
   ..   .. 
. . proximation of X
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n
15/70
matrices
 



 X 
m×n
↑ ··· ↑ 
σ1
 
← v1T →

   ..   .. 
. . proximation of X (By SVD the-
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n orem! But what does this mean?
= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT We will see on the next slide)
15/70
matrices
 



 X 
m×n
↑ ··· ↑ 
σ1
 
← v1T →

   ..   .. 
. . proximation of X (By SVD the-
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n orem! But what does this mean?
= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT We will see on the next slide)
If we truncate the sum at
σ1 u1 v1T + σ2 u2 v2T then we get the
best rank-2 approximation of X
and so on
15/70
What do we mean by approxim-
ation here?
 
 
  =
 X 
m×n
 
↑ ··· ↑ 
σ1
 
← v1T →

   ..   .. 
. .
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n
16/70
ation here?
Notice that X has m × n entries
 
 
  =
 X 
m×n
 
↑ ··· ↑ 
σ1
 
← v1T →

   ..   .. 
. .
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n
16/70
ation here?
 
When we use the rank-1 approx-
imation we are using only m +
 
  =
 X 
n + 1 entries to reconstruct [u ∈
m×n
  Rm , v ∈ Rn , σ ∈ R1 ]
↑ ··· ↑ 
σ1
 
← v1T →

   ..   .. 
. .
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n
16/70
ation here?
 
 
  =
 X 
m×n
  Rm , v ∈ Rn , σ ∈ R1 ]
↑ ··· ↑ 
σ1
 
← v1T →

   ..   ..  But SVD theorem tells us that
. .
 
u1 ··· uk  u1 ,v1 and σ1 store the most im-
   
↓ ··· ↓ m×k σk ← vkT →
k×k k×n portant information in X (akin to
= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT the principal components in X)
16/70
ation here?
 
 
  =
 X 
m×n
  Rm , v ∈ Rn , σ ∈ R1 ]
↑ ··· ↑ 
σ1
 
← v1T →

   ..   ..  But SVD theorem tells us that
. .
 
u1 ··· uk  u1 ,v1 and σ1 store the most im-
   
↓ ··· ↓ m×k σk ← vkT →
k×k k×n portant information in X (akin to
= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT the principal components in X)
Each subsequent term (σ2 u2 v2T ,
σ3 u3 v3T , . . . ) stores less and less
important information
16/70
verylight green As an analogy consider the case when
we are using 8 bits to represent colors
z }| { z }| {
0 0 0 1 1 0 1 1
light green
z }| { z }| {
0 0 1 0 1 0 1 1
dark green
z }| { z }| {
0 1 0 0 1 0 1 1
verydark green
z }| { z }| {
1 0 0 0 1 0 1 1
17/70
z }| { z }| {
0 0 0 1 1 0 1 1 The representation of very light, light,
dark and very dark green would look
different
light green
z }| { z }| {
0 0 1 0 1 0 1 1
dark green
z }| { z }| {
0 1 0 0 1 0 1 1
verydark green
z }| { z }| {
1 0 0 0 1 0 1 1
17/70
z }| { z }| {
different
light green
z }| { z }| { But now what if we were asked to com-
press this into 4 bits? (akin to com-
0 0 1 0 1 0 1 1 pressing m × n values into m + n + 1
values on the previous slide)
dark green
z }| { z }| {
0 1 0 0 1 0 1 1
verydark green
z }| { z }| {
1 0 0 0 1 0 1 1
17/70
z }| { z }| {
different
light green
We will retain the most important 4
dark green
z }| { z }| { bits and the previously (slightly) lat-
ent similarity between the colors now
0 1 0 0 1 0 1 1 becomes very obvious
verydark green
z }| { z }| {
1 0 0 0 1 0 1 1
17/70
z }| { z }| {
different
light green
We will retain the most important 4
dark green
z }| { z }| { bits and the previously (slightly) lat-
ent similarity between the colors now
0 1 0 0 1 0 1 1 becomes very obvious
Something similar is guaranteed by
SVD (retain the most important in-
verydark green
z }| { z }| { formation and discover the latent sim-
ilarities between words)
1 0 0 0 1 0 1 1
17/70
human machine system for ... user human machine system for ... user
human 0 2.944 0 2.25 ... 0 human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.944 0 0 2.25 ... 0 machine 2.01 2.01 0.23 2.14 ... 0.43
system 0 0 0 1.15 ... 1.84 system 0.23 0.23 1.17 0.96 ... 1.29
for 2.25 2.25 1.15 0 ... 0 for 2.14 2.14 0.96 1.87 ... -0.13
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
user 0 0 1.84 0 ... 0 user 0.43 0.43 1.29 -0.13 ... 1.71
Co-occurrence Matrix (X) Low rank X → Low rank X̂
Notice that after low rank reconstruction with SVD, the latent co-occurrence
between {system, machine} and {human, user} has become visible
18/70
Recall that earlier each row of the original
X= matrix X served as the representation of a
human machine system for ... user word
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0
19/70
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0 Then XX T is a matrix whose ij-th entry is
system 0 0 0 1.15 ... 1.84 the dot product between the representation
for 2.25 2.25 1.15 0 ... 0
. . . . . . . of word i (X[i :]) and word j (X[j :])
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0
XX T =

human 32.5 23.9 7.78 20.25 ... 7.01
machine 23.9 32.5 7.78 20.25 ... 7.01
system 7.78 7.78 0 17.65 ... 21.84
for 20.25 20.25 17.65 36.3 ... 11.8
. . . . . . .
. . . . . . .
. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3
cosine sim(human, user) = 0.21 19/70

human 0 2.944 0 2.25 ... 0
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0 X[i :]
X[j :]
XX T =

human 32.5 23.9 7.78 20.25 ... 7.01
machine 23.9 32.5 7.78 20.25 ... 7.01
system 7.78 7.78 0 17.65 ... 21.84
for 20.25 20.25 17.65 36.3 ... 11.8
. . . . . . .
. . . . . . .
. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3

human 0 2.944 0 2.25 ... 0
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .  
user 0 0 1.84 0 ... 0 X[i :] 1 2 3
 2 1 0 
X[j :] 1 3 5
XX T = | {z }
X

human 32.5 23.9 7.78 20.25 ... 7.01
machine 23.9 32.5 7.78 20.25 ... 7.01
system 7.78 7.78 0 17.65 ... 21.84
for 20.25 20.25 17.65 36.3 ... 11.8
. . . . . . .
. . . . . . .
. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3

human 0 2.944 0 2.25 ... 0
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .   
user 0 0 1.84 0 ... 0 X[i :] 1 2 3 1 2 1
 2 1 0  2 1 3 
X[j :] 1 3 5 3 0 5
XX T = | {z }| {z }
X XT
human 32.5 23.9 7.78 20.25 ... 7.01
machine 23.9 32.5 7.78 20.25 ... 7.01
system 7.78 7.78 0 17.65 ... 21.84
for 20.25 20.25 17.65 36.3 ... 11.8
. . . . . . .
. . . . . . .
. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3

human 0 2.944 0 2.25 ... 0
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .   
user 0 0 1.84 0 ... 0 X[i :] 1 2 3 1 2 1
 2 1 0  2 1 3 
X[j :] 1 3 5 3 0 5
XX T = | {z }| {z }
X XT
 
human machine system for ... user . . 22
human 32.5 23.9 7.78 20.25 ... 7.01
machine 23.9 32.5 7.78 20.25 ... 7.01 = . . .
system 7.78 7.78 0 17.65 ... 21.84 . . .
for 20.25 20.25 17.65 36.3 ... 11.8 | {z }
. . . . . . .
. . . . . . . XX T
. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3

human 0 2.944 0 2.25 ... 0
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .   
user 0 0 1.84 0 ... 0 X[i :] 1 2 3 1 2 1
 2 1 0  2 1 3 
X[j :] 1 3 5 3 0 5
XX T = | {z }| {z }
X XT
 
human machine system for ... user . . 22
human 32.5 23.9 7.78 20.25 ... 7.01
machine 23.9 32.5 7.78 20.25 ... 7.01 = . . .
system 7.78 7.78 0 17.65 ... 21.84 . . .
for 20.25 20.25 17.65 36.3 ... 11.8 | {z }
. . . . . . .
. . . . . . . XX T
. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3
The ij-th entry of XX T thus (roughly)
captures the cosine similarity between
wordi , wordj
Once we do an SVD what is a
good choice for the representation of
wordi ?
20/70
X̂ = good choice for the representation of
human machine system for ... user wordi ?
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
Obviously, taking the i-th row of the
for
.
2.14
.
2.14
.
0.96
.
1.87
.
...
.
-0.13
.
reconstructed matrix does not make
.
.
.
.
.
.
.
.
.
.
.
.
.
.
sense because it is still high dimen-
user 0.43 0.43 1.29 -0.13 ... 1.71 sional
20/70
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for
.
2.14
.
2.14
.
0.96
.
1.87
.
...
.
-0.13
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
user 0.43 0.43 1.29 -0.13 ... 1.71 sional
But we saw that the reconstructed
X̂ X̂ T = matrix X̂ = U ΣV T discovers latent
human machine system for ... user semantics and its word representa-
human 25.4 25.4 7.6 21.9 ... 6.84
machine 25.4 25.4 7.6 21.9 ... 6.84
tions are more meaningful
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32
. . . . . . .
. . . . . . .
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11

human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for
.
2.14
.
2.14
.
0.96
.
1.87
.
...
.
-0.13
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
user 0.43 0.43 1.29 -0.13 ... 1.71 sional
But we saw that the reconstructed
X̂ X̂ T = matrix X̂ = U ΣV T discovers latent
human machine system for ... user semantics and its word representa-
human 25.4 25.4 7.6 21.9 ... 6.84
machine 25.4 25.4 7.6 21.9 ... 6.84
tions are more meaningful
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32 Wishlist: We would want represent-
. . . . . . .
. . . . . . . ations of words (i, j) to be of smal-
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11 ler dimensions but still have the same
similarity (dot product) as the corres-
ponding rows of X̂
Notice that the dot product between the
X̂ = rows of the the matrix Wword = U Σ is the
same as the dot product between the rows
human 2.01 2.01 0.23 2.14 ... 0.43 of X̂
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13 X̂ X̂ T = (U ΣV T )(U ΣV T )T
. . . . . . .
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
X̂ X̂ T =

human 25.4 25.4 7.6 21.9 ... 6.84
machine 25.4 25.4 7.6 21.9 ... 6.84
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32
. . . . . . .
. . . . . . .
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11
similarity = 0.33 21/70

human 2.01 2.01 0.23 2.14 ... 0.43 of X̂
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
. . . . . . .
. . . . . . . = (U ΣV T )(V ΣU T )
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
X̂ X̂ T =

human 25.4 25.4 7.6 21.9 ... 6.84
machine 25.4 25.4 7.6 21.9 ... 6.84
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32
. . . . . . .
. . . . . . .
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11

human 2.01 2.01 0.23 2.14 ... 0.43 of X̂
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
. . . . . . .
. . . . . . . = (U ΣV T )(V ΣU T )
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71 = U ΣΣT U T (∵ V T V = I)
X̂ X̂ T =

human 25.4 25.4 7.6 21.9 ... 6.84
machine 25.4 25.4 7.6 21.9 ... 6.84
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32
. . . . . . .
. . . . . . .
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11

human 2.01 2.01 0.23 2.14 ... 0.43 of X̂
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
. . . . . . .
. . . . . . . = (U ΣV T )(V ΣU T )
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71 = U ΣΣT U T (∵ V T V = I)
= U Σ(U Σ)T = Wword Wword
T
T
X̂ X̂ =

human 25.4 25.4 7.6 21.9 ... 6.84
machine 25.4 25.4 7.6 21.9 ... 6.84
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32
. . . . . . .
. . . . . . .
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11

human 2.01 2.01 0.23 2.14 ... 0.43 of X̂
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
. . . . . . .
. . . . . . . = (U ΣV T )(V ΣU T )
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71 = U ΣΣT U T (∵ V T V = I)
= U Σ(U Σ)T = Wword Wword
T
T
X̂ X̂ =
Conventionally,
human 25.4 25.4 7.6 21.9 ... 6.84 Wword = U Σ ∈ Rm×k
machine 25.4 25.4 7.6 21.9 ... 6.84
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32 is taken as the representation of the m words
. . . . . . .
. . . . . . .
in the vocabulary and
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11 Wcontext = V
is taken as the representation of the context

similarity = 0.33 words 21/70
Module 10.4: Continuous bag of words model
22/70
The methods that we have seen so far are called count based models because
they use the co-occurrence counts of words
23/70
The methods that we have seen so far are called count based models because
they use the co-occurrence counts of words
We will now see methods which directly learn word representations (these are
called (direct) prediction based models)
23/70
The story ahead ...
Continuous bag of words model
24/70
The story ahead ...
Skip gram model with negative sampling (the famous word2vec)
24/70
The story ahead ...
GloVe word embeddings
24/70
The story ahead ...
Evaluating word embeddings
24/70
The story ahead ...
Good old SVD does just fine!!
24/70
Consider this Task: Predict n-th
word given previous n-1 words
25/70
word given previous n-1 words
Example: he sat on a chair
25/70
Sometime in the 21st century, Joseph Cooper, word given previous n-1 words
a widowed former engineer and former NASA
pilot, runs a farm with his father-in-law Donald,
son Tom, and daughter Murphy, It is post-truth Training data: All n-word windows
society ( Cooper is reprimanded for telling in your corpus
Murphy that the Apollo missions did indeed
happen) and a series of crop blights threatens hu-
manity’s survival. Murphy believes her bedroom
is haunted by a poltergeist. When a pattern
is created out of dust on the floor, Cooper
realizes that gravity is behind its formation,
not a ”ghost”. He interprets the pattern as
a set of geographic coordinates formed into
binary code. Cooper and Murphy follow the
coordinates to a secret NASA facility, where they
are met by Cooper’s former professor, Dr. Brand.
Some sample 4 word windows from a corpus

25/70
happen) and a series of crop blights threatens hu- Training data for this task is easily
manity’s survival. Murphy believes her bedroom available (take all n word windows
is haunted by a poltergeist. When a pattern from the whole of wikipedia)
is created out of dust on the floor, Cooper
realizes that gravity is behind its formation,
a set of geographic coordinates formed into
binary code. Cooper and Murphy follow the

25/70
happen) and a series of crop blights threatens hu- Training data for this task is easily
manity’s survival. Murphy believes her bedroom available (take all n word windows
is haunted by a poltergeist. When a pattern from the whole of wikipedia)
is created out of dust on the floor, Cooper For ease of illustration, we will first
realizes that gravity is behind its formation, focus on the case when n = 2 (i.e.,
a set of geographic coordinates formed into predict second word based on first
binary code. Cooper and Murphy follow the word)

25/70
We will now try to answer these two questions:
26/70
How do you model this task?
26/70
How do you model this task?
What is the connection between this task and learning word representations?
26/70
We will model this problem using a
feedforward neural network
27/70
feedforward neural network
Input: One-hot representation of the
context word
0 1 0 ... 0 0 0
sat
27/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . feedforward neural network
. . . . . . . . . . . . context word
Output: There are |V | words
(classes) possible and we want to pre-
dict a probability distribution over
these |V | classes (multi-class classific-
ation problem)
0 1 0 ... 0 0 0
sat
27/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . . . . context word
Wword ∈ Rk×|V | Output: There are |V | words
. . . . . . . . . . h ∈ Rk
Wcontext ∈ Rk×|V |
ation problem)
0 1 0 ... 0 0 0 x ∈ R|V |
sat
27/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . . . . context word
Wword ∈ Rk×|V | Output: There are |V | words
. . . . . . . . . . h ∈ Rk
ation problem)
0 1 0 ... 0 0 0 x ∈ R|V | Parameters: Wcontext ∈ Rk×|V | and
sat Wword ∈ Rk×|V |
(we are assuming that the set of
words and context words is the
same: each of size |V |)
27/70
P (chair|sat)
P (man|sat)
What is the product Wcontext x given that x
P (on|sat)
P (he|sat)
. . . . . . . . . is a one hot vector
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V |
sat
28/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
It is simply the i-th column of Wcontext
. . . . . . . . . . . . 
−1 0.5 2
   
0 0.5
 3 −1 −2   1  = −1
Wword ∈ Rk×|V |
−2 1.7 3 0 1.7
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V |
sat
28/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . . . . 
−1 0.5 2
   
0 0.5
 3 −1 −2   1  = −1
Wword ∈ Rk×|V |
−2 1.7 3 0 1.7
. . . . . . . . . . h ∈ Rk
So when the ith word is present the ith ele-
ment in the one hot vector is ON and the ith
column of Wcontext gets selected
0 1 0 ... 0 0 0 x ∈ R|V |
sat
28/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . . . . 
−1 0.5 2
   
0 0.5
 3 −1 −2   1  = −1
Wword ∈ Rk×|V |
−2 1.7 3 0 1.7
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V | In other words, there is a one-to-one cor-
respondence between the words and the
sat columns of Wcontext
28/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . . . . 
−1 0.5 2
   
0 0.5
 3 −1 −2   1  = −1
Wword ∈ Rk×|V |
−2 1.7 3 0 1.7
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V | In other words, there is a one-to-one cor-
respondence between the words and the
sat columns of Wcontext
More specifically, we can treat the i-th
column of Wcontext as the representation of
context i
28/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function?
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V |
sat
29/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
priate output function? (softmax)
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V |
sat
29/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V |
sat
e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je
29/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . . . . Therefore, P (on|sat) is proportional to the

dot product between j th column of Wcontext
Wword ∈ Rk×|V | and ith column of Wword
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V |
sat
e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je
29/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)

P (word = i|sat) thus depends on the ith
. . . . . . . . . . h ∈ Rk
column of Wword
0 1 0 ... 0 0 0 x ∈ R|V |
sat
e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je
29/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)

. . . . . . . . . . h ∈ Rk
column of Wword
We thus treat the i-th column of Wword as
the representation of word i
0 1 0 ... 0 0 0 x ∈ R|V |
sat
e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je
29/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)

. . . . . . . . . . h ∈ Rk
column of Wword
0 1 0 ... 0 0 0 x ∈ R|V | Hope you see an analogy with SVD! (there
we had a different way of learning Wcontext
sat and Wword but we saw that the ith column
of Wword corresponded to the representa-
tion of the ith word)
e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je
29/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)

. . . . . . . . . . h ∈ Rk
column of Wword
0 1 0 ... 0 0 0 x ∈ R|V | Hope you see an analogy with SVD! (there
we had a different way of learning Wcontext
sat and Wword but we saw that the ith column
of Wword corresponded to the representa-
tion of the ith word)
e(Wword h)[i]
P (on|sat) = P (W Having understood the interpretation of
word h)[j]
je Wcontext and Wword , our aim now is to
learn these parameters
29/70
We denote the context word (sat) by the in-
ŷ = P (on|sat)
dex c and the correct output word (on) by
P (chair|sat)
P (man|sat)
P (he|sat)
the index w
. . . . . . . . .
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V |
sat
30/70
ŷ = P (on|sat)
P (chair|sat)
P (man|sat)
P (he|sat)
the index w
. . . . . . . . . For this multiclass classification problem
what is an appropriate output function (ŷ =
. . . . . . . . . . . . f (x)) ?
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V |
sat
30/70
ŷ = P (on|sat)
P (chair|sat)
P (man|sat)
P (he|sat)
the index w
. . . . . . . . . . . . f (x)) ? softmax
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V |
sat
30/70
ŷ = P (on|sat)
P (chair|sat)
P (man|sat)
P (he|sat)
the index w
. . . . . . . . . . . . f (x)) ? softmax
What is an appropriate loss function?
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V |
sat
30/70
ŷ = P (on|sat)
P (chair|sat)
P (man|sat)
P (he|sat)
the index w
. . . . . . . . . . . . f (x)) ? softmax
What is an appropriate loss function? cross
Wword ∈ Rk×|V | entropy
. . . . . . . . . . h ∈ Rk L (θ) = − log ŷw = − log P (w|c)
0 1 0 ... 0 0 0 x ∈ R|V |
sat
30/70
ŷ = P (on|sat)
P (chair|sat)
P (man|sat)
P (he|sat)
the index w
. . . . . . . . . . . . f (x)) ? softmax

h = Wcontext · xc = uc
0 1 0 ... 0 0 0 x ∈ R|V |
sat
30/70
ŷ = P (on|sat)
P (chair|sat)
P (man|sat)
P (he|sat)
the index w
. . . . . . . . . . . . f (x)) ? softmax

exp(uc · vw )
ŷw = P
0 1 0 ... 0 0 0 |V | w0 ∈Vexp(uc · vw0 )
x∈R
sat
30/70
ŷ = P (on|sat)
P (chair|sat)
P (man|sat)
P (he|sat)
the index w
. . . . . . . . . . . . f (x)) ? softmax

exp(uc · vw )
ŷw = P
0 1 0 ... 0 0 0 |V | w0 ∈Vexp(uc · vw0 )
x∈R
uc is the column of Wcontext corresponding

sat
to context word c and vw is the column of
Wword corresponding to the word w
30/70
P (chair|sat)
P (man|sat)
How do we train this simple feed for-
P (on|sat)
P (he|sat)
. . . . . . . . . ward neural network?
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V |
sat
31/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . ward neural network? backpropaga-
tion
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V |
sat
31/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . ward neural network? backpropaga-
tion
. . . . . . . . . . . . Let us consider one input-output pair
Wword ∈ Rk×|V |
(c, w) and see the update rule for vw
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V |
sat
31/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V |
sat
32/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
w0 ∈V exp(uc · vw0 )
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V |
sat
32/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
Wword ∈ Rk×|V |
X
= −(uc · vw − log exp(uc · vw0 ))
. . . . . . . . . . h ∈ Rk w0 ∈V
0 1 0 ... 0 0 0 x ∈ R|V |
sat
32/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
Wword ∈ Rk×|V |
X
. . . . . . . . . . h ∈ Rk w0 ∈V
exp(uc · vw )
∇vw = −(uc − P · uc )
0 1 0 ... 0 0 0 x ∈ R|V |
sat
∂
∇vw = L (θ)
∂vw
32/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
Wword ∈ Rk×|V |
X
. . . . . . . . . . h ∈ Rk w0 ∈V
exp(uc · vw )
∇vw = −(uc − P · uc )
0 1 0 ... 0 0 0 x ∈ R|V | = −uc (1 − ŷw )
sat
32/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
Wword ∈ Rk×|V |
X
. . . . . . . . . . h ∈ Rk w0 ∈V
exp(uc · vw )
∇vw = −(uc − P · uc )
0 1 0 ... 0 0 0 x ∈ R|V | = −uc (1 − ŷw )
sat And the update rule would be
32/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
Wword ∈ Rk×|V |
X
. . . . . . . . . . h ∈ Rk w0 ∈V
exp(uc · vw )
∇vw = −(uc − P · uc )
0 1 0 ... 0 0 0 x ∈ R|V | = −uc (1 − ŷw )
vw = vw − η∇vw
32/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
Wword ∈ Rk×|V |
X
. . . . . . . . . . h ∈ Rk w0 ∈V
exp(uc · vw )
∇vw = −(uc − P · uc )
0 1 0 ... 0 0 0 x ∈ R|V | = −uc (1 − ŷw )
vw = vw − η∇vw
= vw + ηuc (1 − ŷw )
32/70
P (chair|sat)
P (man|sat)
This update rule has a nice interpret-
P (on|sat)
P (he|sat)
. . . . . . . . . ation
. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V |
sat
33/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . ation
. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )
Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-

ing the right word and vw will not be
. . . . . . . . . . h ∈ Rk
updated
0 1 0 ... 0 0 0 x ∈ R|V |
sat
33/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . ation
. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

. . . . . . . . . . h ∈ Rk
updated
Wcontext ∈ Rk×|V | If ŷw → 0 then vw gets updated by
adding a fraction of uc to it
0 1 0 ... 0 0 0 x ∈ R|V |
sat
33/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . ation
. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

. . . . . . . . . . h ∈ Rk
updated
0 1 0 ... 0 0 0 x ∈ R|V |
This increases the cosine similarity
sat between vw and uc
33/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . ation
. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

. . . . . . . . . . h ∈ Rk
updated
0 1 0 ... 0 0 0 x ∈ R|V |
sat between vw and uc (How?
33/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . ation
. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

. . . . . . . . . . h ∈ Rk
updated
0 1 0 ... 0 0 0 x ∈ R|V |
sat between vw and uc (How? Refer to
slide 38 of Lecture 2)
33/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . ation
. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

. . . . . . . . . . h ∈ Rk
updated
0 1 0 ... 0 0 0 x ∈ R|V |
sat between vw and uc (How? Refer to
slide 38 of Lecture 2)
The training objective ensures that
the cosine similarity between word
(vw ) and context word (uc ) is max-
imized 33/70
P (chair|sat)
P (man|sat)
What happens to the representations
P (on|sat)
P (he|sat)
. . . . . . . . . of two words w and w0 which tend to
appear in similar context (c)
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
0 1 0 ... 0 0 0 x ∈ R|V |
sat
34/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
. . . . . . . . . . . . The training ensures that both vw
Wword ∈ Rk×|V |
and vw0 have a high cosine similarity
with uc and hence transitively (intuit-
. . . . . . . . . . h ∈ Rk ively) ensures that vw and vw0 have a
high cosine similarity with each other
0 1 0 ... 0 0 0 x ∈ R|V |
sat
34/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
Wword ∈ Rk×|V |
This is only an intuition (reasonable)
0 1 0 ... 0 0 0 x∈R |V |
sat
34/70
P (chair|sat)
P (man|sat)
P (on|sat)
P (he|sat)
Wword ∈ Rk×|V |
This is only an intuition (reasonable)
0 1 0 ... 0 0 0 x∈R |V |
Haven’t come across a formal proof
sat for this!
34/70
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
P (chair|sat,
In practice, instead of window size of 1 it is
. . . . . . . . . common to use a window size of d
. . . . . . . . . . . .
Wword ∈ Rk×|V |
. . . . . . . . . . h ∈ Rk
[Wcontext , Wcontext ] ∈ Rk×2|V |
x ∈ R2|V |
he sat
0 1 0 ... 0 0 0
sat
0 0 0 ... 0 1 0
he 35/70
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
P (chair|sat,
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R
. . . . . . . . . . h ∈ Rk
x ∈ R2|V |
he sat
0 1 0 ... 0 0 0
sat
0 0 0 ... 0 1 0
he 35/70
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
P (chair|sat,
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R
[Wcontext , Wcontext ] just means that we are
. . . . . . . . . . h ∈ Rk stacking 2 copies of Wcontext matrix
x ∈ R2|V |
he sat
0 1 0 ... 0 0 0
sat
0 0 0 ... 0 1 0
he 35/70
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
P (chair|sat,
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R
. . . . . . . . . . h ∈ Rk stacking 2 copies of Wcontext matrix  
0
   1 } sat
−1 0.5 2  
 0 
 
 3 −1.0 −2
0
−2 1.7 3  
0
x ∈ R2|V |
1 } he
he sat
0 1 0 ... 0 0 0
sat
0 0 0 ... 0 1 0
he 35/70
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
P (chair|sat,
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R
0
   1 } sat
−1 0.5 2 −1 0.5 2  0

 3 −1.0 −2 3 −1.0 −2   
0
−2 1.7 3 −2 1.7 3  

0
x ∈ R2|V |
1 } he
he sat
0 1 0 ... 0 0 0
sat
0 0 0 ... 0 1 0
he 35/70
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
P (chair|sat,
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R
0
   1 } sat
−1 0.5 2 −1 0.5 2  0

 3 −1.0 −2 3 −1.0 −2   
0
−2 1.7 3 −2 1.7 3  

0
x ∈ R2|V |
1 } he
he sat  
2.5
0 1 0 ... 0 0 0 = −3.0
sat 4.7
0 0 0 ... 0 1 0
he 35/70
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
P (chair|sat,
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R
0
   1 } sat
−1 0.5 2 −1 0.5 2  0

 3 −1.0 −2 3 −1.0 −2   
0
−2 1.7 3 −2 1.7 3  

0
x ∈ R2|V |
1 } he
he sat  
2.5
0 1 0 ... 0 0 0 = −3.0
sat 4.7
The resultant product would simply be the
0 0 0 ... 0 1 0 sum of the columns corresponding to ‘sat’
he and ‘he’ 35/70
Of course in practice we will not do this expensive matrix multiplication
36/70
Of course in practice we will not do this expensive matrix multiplication
If ‘he’ is the ith word in the vocabulary and sat is the j th word then we will
simply access columns Wcontext [:, i] and Wcontext [:, j] and add them
36/70
Now what happens during backpropagation
37/70
Recall that
d−1
X
h= uci
i=1
37/70
Recall that
d−1
X
h= uci
i=1
and
e(Wword h)[k]
P (on|sat, he) = P (W
word h)[j]
je
37/70
Recall that
d−1
X
h= uci
i=1
and
e(Wword h)[k]
word h)[j]
je
where ‘k’ is the index of the word ‘on’
37/70
Recall that
d−1
X
h= uci
i=1
and
e(Wword h)[k]
word h)[j]
je

The loss function depends on {Wword , uc1 , uc2 , . . . , ucd−1 } and all these
parameters will get updated during backpropogation
37/70
Recall that
d−1
X
h= uci
i=1
and
e(Wword h)[k]
word h)[j]
je

The loss function depends on {Wword , uc1 , uc2 , . . . , ucd−1 } and all these
parameters will get updated during backpropogation
Try deriving the update rule for vw now and see how it differs from the one we
derived before
37/70
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
Some problems:
P (chair|sat,
. . . . . . . . . Notice that the softmax function at
the output is computationally very
expensive
. . . . . . . . . . . .
Wword ∈ Rk×|V |
exp(uc · vw )
ŷw = P
. . . . . . . . . . h ∈ Rk
x ∈ R2|V |
he sat
38/70
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
Some problems:
P (chair|sat,
expensive
. . . . . . . . . . . .
Wword ∈ Rk×|V |
exp(uc · vw )
ŷw = P
. . . . . . . . . . h ∈ Rk
The denominator requires a summa-

tion over all words in the vocabulary
x ∈ R2|V |
he sat
38/70
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
Some problems:
P (chair|sat,
expensive
. . . . . . . . . . . .
Wword ∈ Rk×|V |
exp(uc · vw )
ŷw = P
. . . . . . . . . . h ∈ Rk
The denominator requires a summa-

tion over all words in the vocabulary
We will revisit this issue soon
x ∈ R2|V |
he sat
38/70
Module 10.5: Skip-gram model
39/70
The model that we just saw is called the continuous bag of words model (it
predicts an output word give a bag of context words)
40/70
The model that we just saw is called the continuous bag of words model (it
predicts an output word give a bag of context words)
We will now see the skip gram model (which predicts context words given an
input word)
40/70
he sat a chair Notice that the role of context and
word has changed now
. . . . . . . . . . h ∈ R|k|
Wword ∈ Rk×|V |
x ∈ R|V |
0 0 1 ... 0 0 0
on
41/70
In the simple case when there is only
one context word, we will arrive at
. . . . . . . . . . h ∈ R|k|
the same update rule for uc as we did
for vw earlier
Wword ∈ Rk×|V |
x ∈ R|V |
0 0 1 ... 0 0 0
on
41/70
. . . . . . . . . . h ∈ R|k|
for vw earlier
Wword ∈ Rk×|V | Notice that even when we have mul-
tiple context words the loss function
0 0 1 ... 0 0 0 x ∈ R|V | would just be a summation of many
on cross entropy errors
d−1
X
L (θ) = − log ŷwi
i=1
41/70
. . . . . . . . . . h ∈ R|k|
for vw earlier
Wword ∈ Rk×|V | Notice that even when we have mul-
tiple context words the loss function
0 0 1 ... 0 0 0 x ∈ R|V | would just be a summation of many
on cross entropy errors
d−1
X
L (θ) = − log ŷwi
i=1
Typically, we predict context words

on both sides of the given word
41/70
he sat a chair Some problems
Same as bag of words
. . . . . . . . . . h ∈ R|k|
Wword ∈ Rk×|V |
x ∈ R|V |
0 0 1 ... 0 0 0
on
42/70
The softmax function at the output
is computationally expensive
. . . . . . . . . . h ∈ R|k|
Wword ∈ Rk×|V |
x ∈ R|V |
0 0 1 ... 0 0 0
on
42/70
. . . . . . . . . . h ∈ R|k| Solution 1: Use negative sampling
Wword ∈ Rk×|V |
x ∈ R|V |
0 0 1 ... 0 0 0
on
42/70
Solution 2: Use contrastive estima-
Wword ∈ Rk×|V |
tion
x ∈ R|V |
0 0 1 ... 0 0 0
on
42/70
Wword ∈ Rk×|V |
tion
x ∈ R|V |
Solution 3: Use hierarchical softmax
0 0 1 ... 0 0 0
on
42/70
Wword ∈ Rk×|V |
tion
x ∈ R|V |
0 0 1 ... 0 0 0
on
42/70
D = [(sat, on), (sat, Let D be the set of all correct (w, c) pairs in the
a), (sat, chair), (on, corpus
a), (on,chair), (a,chair),
(on,sat), (a, sat),
(chair,sat), (a, on),
(chair, on), (chair, a) ]
43/70
0
a), (on,chair), (a,chair), Let D be the set of all incorrect (w, r) pairs in
(on,sat), (a, sat), the corpus
(chair,sat), (a, on),
0
D = [(sat, oxygen),
(sat, magic), (chair,
sad), (chair, walking)]
43/70
0
(chair,sat), (a, on), 0
D can be constructed by randomly sampling a
context word r which has never appeared with w
0
D = [(sat, oxygen), and creating a pair (w, r)
(sat, magic), (chair,
sad), (chair, walking)]
43/70
0
(chair,sat), (a, on), 0
D can be constructed by randomly sampling a
context word r which has never appeared with w
0
D = [(sat, oxygen), and creating a pair (w, r)
(sat, magic), (chair, As before let vw be the representation of the word
sad), (chair, walking)] w and uc be the representation of the context word
c
43/70
For a given (w, c) ∈ D we are interested in max-
P (z = 1|w, c) imizing
p(z = 1|w, c)
σ
uc vw
44/70
p(z = 1|w, c)
σ
Let us model this probability by
p(z = 1|w, c) = σ(uTc vw )

· 1
=
1 + e−uTc vw
uc vw
44/70
p(z = 1|w, c)
σ
Let us model this probability by
p(z = 1|w, c) = σ(uTc vw )

· 1
=
1 + e−uTc vw
Considering all (w, c) ∈ D, we are interested in
uc vw
Y
maximize p(z = 1|w, c)
θ
(w,c)∈D
where θ is the word representation (vw ) and con-

text representation (uc ) for all words in our corpus
44/70
0
For (w, r) ∈ D we are interested in maximizing
P (z = 0|w, r)
p(z = 0|w, r)
ur vw
45/70
0
P (z = 0|w, r)
p(z = 0|w, r)
σ
Again we model this as
− p(z = 0|w, r) = 1 − σ(uTr vw )
ur vw
45/70
0
P (z = 0|w, r)
p(z = 0|w, r)
σ
− p(z = 0|w, r) = 1 − σ(uTr vw )

1
=1−
1 + e−vrT vw
·
ur vw
45/70
0
P (z = 0|w, r)
p(z = 0|w, r)
σ
− p(z = 0|w, r) = 1 − σ(uTr vw )

1
=1−
1 + e−vrT vw
· 1
=
1 + euTr vw
ur vw
45/70
0
P (z = 0|w, r)
p(z = 0|w, r)
σ
− p(z = 0|w, r) = 1 − σ(uTr vw )

1
=1−
1 + e−vrT vw
· 1
= = σ(−uTr vw )
1 + euTr vw
ur vw
45/70
0
P (z = 0|w, r)
p(z = 0|w, r)
σ
− p(z = 0|w, r) = 1 − σ(uTr vw )

1
=1−
1 + e−vrT vw
· 1
= = σ(−uTr vw )
1 + euTr vw
0
Considering all (w, r) ∈ D , we are interested in
ur vw Y
maximize p(z = 0|w, r)
θ
(w,r)∈D0
45/70
Combining the two we get:
P (z = 0|w, r)
Y Y
maximize p(z = 1|w, c) p(z = 0|w, r)
θ
(w,c)∈D 0
(w,r)∈D
ur vw
46/70
P (z = 0|w, r)
Y Y
θ
(w,c)∈D 0
(w,r)∈D
Y Y
=maximize p(z = 1|w, c) (1 − p(z = 1|w, r))
σ θ
(w,c)∈D 0
(w,r)∈D
ur vw
46/70
P (z = 0|w, r)
Y Y
θ
(w,c)∈D 0
(w,r)∈D
Y Y
σ θ
(w,c)∈D 0
(w,r)∈D
X
=maximize log p(z = 1|w, c)
θ
(w,c)∈D
− X
+ log(1 − p(z = 1|w, r))
0
(w,r)∈D
ur vw
46/70
P (z = 0|w, r)
Y Y
θ
(w,c)∈D 0
(w,r)∈D
Y Y
σ θ
(w,c)∈D 0
(w,r)∈D
X
θ
(w,c)∈D
− X
+ log(1 − p(z = 1|w, r))
0
(w,r)∈D
X 1 X 1
=maximize log + log
· θ
(w,c)∈D
1+
T
e−vc vw 0
T
1 + evr vw
(w,r)∈D
ur vw
46/70
P (z = 0|w, r)
Y Y
θ
(w,c)∈D 0
(w,r)∈D
Y Y
σ θ
(w,c)∈D 0
(w,r)∈D
X
θ
(w,c)∈D
− X
+ log(1 − p(z = 1|w, r))
0
(w,r)∈D
X 1 X 1
=maximize log + log
· θ
(w,c)∈D
1+
T
e−vc vw 0
T
1 + evr vw
(w,r)∈D
X X
=maximize log σ(vcT vw ) + log σ(−vrT vw )
θ
(w,c)∈D 0
(w,r)∈D
ur vw 1
where σ(x) = 1+e−x
46/70
In the original paper, Mikolov et. al. sample k
P (z = 0|w, r) negative (w, r) pairs for every positive (w, c) pairs
ur vw
47/70
0
The size of D is thus k times the size of D
ur vw
47/70
0
The random context word is drawn from a modi-
σ fied unigram distribution
ur vw
47/70
0
The random context word is drawn from a modi-
σ fied unigram distribution
3
r ∼ p(r) 4
− 3
count(r) 4
r∼
N
· N = total number of words in the corpus
ur vw
47/70
Module 10.6: Contrastive estimation
48/70
Wword ∈ Rk×|V |
tion
x ∈ R|V |
0 0 1 ... 0 0 0
on
49/70
Positive: He sat on a chair Negative: He sat abracadabra a chair
50/70
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|
. . . . . . . . . . . . . . . . . . . .
Wh ∈ R2d×h Wh ∈ R2d×h
vc vw vc vw
sat on sat abracadabra
50/70
s sr
. . . . . . . . . . . . . . . . . . . .
vc vw vc vw
We would like sr to be greater than s
50/70
s sr
. . . . . . . . . . . . . . . . . . . .
vc vw vc vw

Okay, so let us try to maximize s − sr
50/70
s sr
. . . . . . . . . . . . . . . . . . . .
vc vw vc vw

But we would like the difference to be at
least m
50/70
s sr
. . . . . . . . . . . . . . . . . . . .
vc vw vc vw
We would like sr to be greater than s So we can maximize s − (sr + m)

least m
50/70
s sr
. . . . . . . . . . . . . . . . . . . .
vc vw vc vw

Okay, so let us try to maximize s − sr What if s > sr + m
least m
50/70
s sr
. . . . . . . . . . . . . . . . . . . .
vc vw vc vw

Okay, so let us try to maximize s − sr What if s > sr + m (don’t do any thing)
least m
50/70
s sr
. . . . . . . . . . . . . . . . . . . .
vc vw vc vw

Okay, so let us try to maximize s − sr What if s > sr + m (don’t do any thing)
maximize max(0, s − (sr + m))
least m
50/70
Module 10.7: Hierarchical softmax
51/70
Wword ∈ Rk×|V |
tion
x ∈ R|V |
0 0 1 ... 0 0 0
on
52/70
vT u
max e c w
. . . 1 . . . . . . . . P
w0 ∈|V |
v T u0
e c w
on
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
53/70
Construct a binary tree such that there are
|V | leaf nodes each corresponding to one
word in the vocabulary
. . . .
. . . 1 . . . . . . . .
on
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
53/70
There exists a unique path from the root
node to a leaf node.
. . . .
. . . 1 . . . . . . . .
on
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
53/70
Let l(w1 ), l(w2 ), ..., l(wp ) be the nodes on
the path from root to w
. . . .
. . . 1 . . . . . . . .
on
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
53/70
π(on)1 = 1 Construct a binary tree such that there are
π(on)2 = 0
π(on)3 = 0
. . . .
Let π(w) be a binary vector such that:
. . . 1 . . . . . . . .
on π(w)k = 1 path branches left at node l(wk )
=0 otherwise
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
53/70
π(on)1 = 1 u1 Construct a binary tree such that there are
π(on)2 = 0 u2
π(on)3 = 0
uV
. . . .
. . . 1 . . . . . . . .
=0 otherwise
. . . . . . . . . . h = vc Finally each internal node is associated with

a vector ui
0 1 0 ... 0 0 0
sat
53/70
π(on)1 = 1 u1 Construct a binary tree such that there are
π(on)2 = 0 u2
π(on)3 = 0
uV
. . . .
. . . 1 . . . . . . . .
=0 otherwise
. . . . . . . . . . h = vc Finally each internal node is associated with

a vector ui
So the parameters of the module are
0 1 0 ... 0 0 0 Wcontext and u1 , u2 , . . . , uv (in effect, we
have the same number of parameters as be-
sat fore)
53/70
π(on)1 = 1 u1 For a given pair (w, c) we are interested in
the probability p(w|vc )
π(on)2 = 0 u2
π(on)3 = 0
uV
. . . .
. . . 1 . . . . . . . .
on
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
54/70
π(on)2 = 0 u2 We model this probability as
Y
p(w|vc ) = (π(wk )|vc )
π(on)3 = 0 k
uV
. . . .
. . . 1 . . . . . . . .
on
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
54/70
Y
p(w|vc ) = (π(wk )|vc )
π(on)3 = 0 k
uV
. . . . For example
. . . 1 . . . . . . . . P (on|vsat ) = P (π(on)1 = 1|vsat )

on ∗P (π(on)2 = 0|vsat )
∗P (π(on)3 = 0|vsat )
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
54/70
Y
p(w|vc ) = (π(wk )|vc )
π(on)3 = 0 k
uV
. . . . For example
. . . 1 . . . . . . . . P (on|vsat ) = P (π(on)1 = 1|vsat )

on ∗P (π(on)2 = 0|vsat )
∗P (π(on)3 = 0|vsat )
. . . . . . . . . . h = vc In effect, we are saying that the probability
of predicting a word is the same as predicting
the correct unique path from the root node
to that word
0 1 0 ... 0 0 0
sat
54/70
π(on)1 = 1 u1 We model
1
P (π(on)i = 1) = T
π(on)2 = 0 u2 1 + e−vc ui
P (π(on)i = 0) = 1 − P (π(on)i = 1)
π(on)3 = 0 1
uV P (π(on)i = 0) = T
1 + evc ui
. . . .
. . . 1 . . . . . . . .
on
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
55/70
1
P (π(on)i = 1) = T
π(on)2 = 0 u2 1 + e−vc ui
P (π(on)i = 0) = 1 − P (π(on)i = 1)
π(on)3 = 0 1
uV P (π(on)i = 0) = T
1 + evc ui
. . . . The above model ensures that the repres-
. . . 1 . . . . . . . . entation of a context word vc will have a
on high(low) similarity with the representation
of the node ui if ui appears on the path and
the path branches to the left(right) at ui
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
55/70
1
P (π(on)i = 1) = T
π(on)2 = 0 u2 1 + e−vc ui
P (π(on)i = 0) = 1 − P (π(on)i = 1)
π(on)3 = 0 1
uV P (π(on)i = 0) = T
1 + evc ui
. . . . The above model ensures that the repres-
. . . 1 . . . . . . . . entation of a context word vc will have a
on high(low) similarity with the representation
of the node ui if ui appears on the path and
the path branches to the left(right) at ui
. . . . . . . . . . h = vc Again, transitively the representations of
contexts which appear with the same words
will have high similarity
0 1 0 ... 0 0 0
sat
55/70
π(on)1 = 1 u1
|π(w)|
Y
π(on)2 = 0 u2
P (w|vc ) = P (π(wk )|vc )
k=1
π(on)3 = 0
uV
. . . .
. . . 1 . . . . . . . .
on
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
56/70
π(on)1 = 1 u1
|π(w)|
Y
π(on)2 = 0 u2
P (w|vc ) = P (π(wk )|vc )
k=1
π(on)3 = 0
uV
. . . . Note that p(w|vc ) can now be com-

puted using |π(w)| computations in-
. . . 1 . . . . . . . . stead of |V | required by softmax
on
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
56/70
π(on)1 = 1 u1
|π(w)|
Y
π(on)2 = 0 u2
P (w|vc ) = P (π(wk )|vc )
k=1
π(on)3 = 0
uV

on
How do we construct the binary tree?
. . . . . . . . . . h = vc
0 1 0 ... 0 0 0
sat
56/70
π(on)1 = 1 u1
|π(w)|
Y
π(on)2 = 0 u2
P (w|vc ) = P (π(wk )|vc )
k=1
π(on)3 = 0
uV

on
How do we construct the binary tree?
. . . . . . . . . . h = vc Turns out that even a random ar-
rangement of the words on leaf nodes
does well in practice
0 1 0 ... 0 0 0
sat
56/70
Module 10.8: GloVe representations
57/70
Count based methods (SVD) rely on global co-occurrence counts from the
corpus for computing word representations
58/70
Predict based methods learn word representations using co-occurrence inform-
ation
58/70
Predict based methods learn word representations using co-occurrence inform-
ation
Why not combine the two (count and learn) ?
58/70
Corpus:
Xij encodes important global information
Human machine interface for computer applications
about the co-occurrence between i and j
User opinion of computer system response time
(global: because it is computed from the en-
System engineering for improved response time
tire corpus)
X=

human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13
. . . . . . .
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji
59/70
Corpus:
tire corpus)
Why not learn word vectors which are faith-
X= ful to this information?
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13
. . . . . . .
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji
59/70
Corpus:
tire corpus)
human machine system for ... user For example, enforce
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43 viT vj = log P (j|i)
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13 = log Xij − log(Xi )
. . . . . . .
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji
59/70
Corpus:
tire corpus)
human 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13 = log Xij − log(Xi )
. . . . . . .
. . . . . . .
. . . . . . . Similarly,
user 0.43 0.43 1.29 -0.13 ... 1.71
vjT vi = log Xij − log Xj (Xij = Xji )

Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji
59/70
Corpus:
tire corpus)
human 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13 = log Xij − log(Xi )
. . . . . . .
. . . . . . .
. . . . . . . Similarly,
user 0.43 0.43 1.29 -0.13 ... 1.71
vjT vi = log Xij − log Xj (Xij = Xji )

Xij Xij Essentially we are saying that we want word
P (j|i) = P =
Xij Xi vectors vi and vj such that viT vj is faithful
Xij = Xji to the globally computed P (j|i)
59/70
Corpus:
Adding the two equations we get
User opinion of computer system response time 2viT vj = 2 log Xij − log Xi − log Xj
1 1
System engineering for improved response time viT vj = log Xij − log Xi − log Xj
2 2
X=

human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13
. . . . . . .
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji
60/70
Corpus:
1 1
2 2
X= Note that log Xi and log Xj depend only on
the words i & j and we can think of them as
human 2.01 2.01 0.23 2.14 ... 0.43 word specific biases which will be learned
machine 2.01 2.01 0.23 2.14 ... 0.43
system
for
0.23
2.14
0.23
2.14
1.17
0.96
0.96
1.87
...
...
1.29
-0.13 viT vj = log Xij − bi − bj
. . . . . . .
. . . . . . . viT vj + bi + bj = log Xij
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji
60/70
Corpus:
1 1
2 2
X= Note that log Xi and log Xj depend only on
the words i & j and we can think of them as
human 2.01 2.01 0.23 2.14 ... 0.43 word specific biases which will be learned
machine 2.01 2.01 0.23 2.14 ... 0.43
system
for
0.23
2.14
0.23
2.14
1.17
0.96
0.96
1.87
...
...
1.29
-0.13 viT vj = log Xij − bi − bj
. . . . . . .
. . . . . . . viT vj + bi + bj = log Xij
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
We can then formulate this as the following
optimization problem
Xij Xij X T
P (j|i) = P = min (vi vj + bi + bj − log Xij )2
Xij Xi vi ,vj ,bi ,bj | {z } | {z }
i,j
predicted value actual value
Xij = Xji using model computed from
parameters the given corpus
60/70
Corpus:
X T
User opinion of computer system response time min (vi vj + bi + bj − log Xij )2
User interface management system vi ,vj ,bi ,bj
i,j
X=

human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13
. . . . . . .
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji
61/70
Corpus:
X T
i,j
X= Drawback: weighs all co-occurrences

equally
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13
. . . . . . .
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji
61/70
Corpus:
X T
i,j

equally
human 2.01 2.01 0.23 2.14 ... 0.43
Solution: add a weighting function
machine 2.01 2.01 0.23 2.14 ... 0.43 X
system
for
0.23
2.14
0.23
2.14
1.17
0.96
0.96
1.87
...
...
1.29
-0.13
min f (Xij )(viT vj + bi + bj − log Xij )2
vi ,vj ,bi ,bj
. . . . . . . i,j
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji
61/70
Corpus:
X T
i,j

equally
human 2.01 2.01 0.23 2.14 ... 0.43
Solution: add a weighting function
machine 2.01 2.01 0.23 2.14 ... 0.43 X
system
for
0.23
2.14
0.23
2.14
1.17
0.96
0.96
1.87
...
...
1.29
-0.13
min f (Xij )(viT vj + bi + bj − log Xij )2
vi ,vj ,bi ,bj
. . . . . . . i,j
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71 Wishlist: f (Xij ) should be such that
neither rare nor frequent words are over-
weighted.
Xij Xij
P (j|i) = P = x
)α , if x < xmax

Xij Xi ( xmax
f (x) =
1, otherwise
Xij = Xji
where α can be tuned for a given dataset
61/70
Module 10.9: Evaluating word representations
62/70
How do we evaluate the learned word representations ?
63/70
Semantic Relatedness
64/70
Ask humans to judge the relatedness
between a pair of words
Shuman (cat, dog) = 0.8
64/70
Compute the cosine similarity
between the corresponding word
vectors learned by the model

T v
vcat dog
Smodel (cat, dog) = = 0.7
k vcat kk vdog k
64/70
Given a large number of such
word pairs, compute the correlation
T v
vcat
Smodel (cat, dog) =
dog
= 0.7 between Smodel & Shuman , and com-
k vcat kk vdog k pare different models
64/70
Given a large number of such
word pairs, compute the correlation
T v
vcat
Smodel (cat, dog) =
dog
= 0.7 between Smodel & Shuman , and com-
k vcat kk vdog k pare different models
Model 1 is better than Model 2 if
correlation(Smodel1 , Shuman )
> correlation(Smodel2 , Shuman )
64/70
Synonym Detection
65/70
Synonym Detection
Given: a term and four candidate
synonyms
Term : levied
Candidates : {unposed,
believed, requested, correlated}
65/70
Synonym Detection
synonyms
Pick the candidate which has the
Term : levied largest cosine similarity with the term
Candidates : {unposed,
believed, requested, correlated}
Synonym : = argmax cosine(vterm , vc )

c∈C
65/70
Synonym Detection
synonyms
Pick the candidate which has the
Term : levied largest cosine similarity with the term
Candidates : {unposed, Compute the accuracy of different
believed, requested, correlated} models and compare
Synonym : = argmax cosine(vterm , vc )

c∈C
65/70
Analogy
66/70
Analogy
Semantic Analogy: Find nearest
neighbour of vbrother − vsister +
vgrandson
brother : sister :: grandson : ?
66/70
Analogy
Semantic Analogy: Find nearest
neighbour of vbrother − vsister +
vgrandson
Syntactic Analogy: Find nearest
neighbour of vwork − vworks + vspeak
brother : sister :: grandson : ?

work : works :: speak : ?
66/70
So which algorithm gives the best result ?
67/70
Boroni et.al [2014] showed that predict models consistently outperform count
models in all tasks.
67/70
Boroni et.al [2014] showed that predict models consistently outperform count
models in all tasks.
Levy et.al [2015] do a much more through analysis (IMO) and show that good
old SVD does better than prediction based models on similarity tasks but not
on analogy tasks.
67/70
Module 10.10: Relation between SVD & word2Vec
68/70
The story ahead ...
Good old SVD does just fine!!
69/70
he sat a chair Recall that SVD does a matrix factorization
of the co-occurrence matrix
. . . . . . . . . . h ∈ R|k|
.
Wword ∈ Rk×|V |
x ∈ R|V |
0 0 1 ... 0 0 0
on
70/70
Levy et.al [2015] show that word2vec also
Wcontext ∈ R k×|V | implicitly does a matrix factorization
. . . . . . . . . . h ∈ R|k|
.
Wword ∈ Rk×|V |
x ∈ R|V |
0 0 1 ... 0 0 0
on
70/70
What does this mean ?
. . . . . . . . . . h ∈ R|k|
.
Wword ∈ Rk×|V |
x ∈ R|V |
0 0 1 ... 0 0 0
on
70/70
. . . . . . . . . . h ∈ R|k|
Recall that word2vec gives us Wcontext &
Wword .
Wword ∈ Rk×|V |
x ∈ R|V |
0 0 1 ... 0 0 0
on
70/70
. . . . . . . . . . h ∈ R|k|
Wword .
Wword ∈ Rk×|V |
Turns out that we can also show that
M = Wcontext ∗ Wword
x ∈ R|V |
0 0 1 ... 0 0 0
where
on
Mij = P M I(wi , ci ) − log(k)
k = number of negative samples
70/70
. . . . . . . . . . h ∈ R|k|
Wword .
Wword ∈ Rk×|V |
Turns out that we can also show that
M = Wcontext ∗ Wword
x ∈ R|V |
0 0 1 ... 0 0 0
where
on
Mij = P M I(wi , ci ) − log(k)
k = number of negative samples
So essentially, word2vec factorizes a mat-

rix M which is related to the PMI based
co-occurrence matrix (very similar to what
SVD does)
70/70

Lecture10 PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Lecture10 PDF

Загружено:

Авторское право:

Доступные форматы

CS7015 (Deep Learning) : Lecture 10

Learning Vectorial Representations Of Words

Department of Computer Science and Engineering

This is by far AAMIR KHAN’s best one. Finest

This is by far AAMIR KHAN’s best one. Finest

V = [human,machine, interface, for, computer,

A bank is a financial institution that accepts

A bank is a financial institution that accepts

The idea is to use the accompanying words

Xij = min(count(wi , cj ), t),

where w is word and c is context.

P M I0 (w, c) = P M I(w, c) if count(w, c) > 0

P M I0 (w, c) = P M I(w, c) if count(w, c) > 0

P P M I(w, c) = P M I(w, c) if P M I(w, c) > 0

human machine system for ... user

= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT

= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT

= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT

= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT

= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT

= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT

Co-occurrence Matrix (X) Low rank X → Low rank X̂

human machine system for ... user

cosine sim(human, user) = 0.21 19/70

human machine system for ... user

cosine sim(human, user) = 0.21 19/70

human machine system for ... user

cosine sim(human, user) = 0.21 19/70

cosine sim(human, user) = 0.21 19/70

cosine sim(human, user) = 0.21 19/70

cosine sim(human, user) = 0.33 20/70

human machine system for ... user

similarity = 0.33 21/70

human machine system for ... user

similarity = 0.33 21/70

human machine system for ... user

similarity = 0.33 21/70

human machine system for ... user

similarity = 0.33 21/70

is taken as the representation of the context

Some sample 4 word windows from a corpus

Some sample 4 word windows from a corpus

Some sample 4 word windows from a corpus

. . . . . . . . . . . . Therefore, P (on|sat) is proportional to the

. . . . . . . . . . . . Therefore, P (on|sat) is proportional to the

. . . . . . . . . . . . Therefore, P (on|sat) is proportional to the

. . . . . . . . . . . . Therefore, P (on|sat) is proportional to the

. . . . . . . . . . . . Therefore, P (on|sat) is proportional to the

. . . . . . . . . . h ∈ Rk L (θ) = − log ŷw = − log P (w|c)

. . . . . . . . . . h ∈ Rk L (θ) = − log ŷw = − log P (w|c)

. . . . . . . . . . h ∈ Rk L (θ) = − log ŷw = − log P (w|c)

. . . . . . . . . . h ∈ Rk L (θ) = − log ŷw = − log P (w|c)

uc is the column of Wcontext corresponding

Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-

Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-

Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-

Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-

Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-

Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-

[Wcontext , Wcontext ] ∈ Rk×2|V |

[Wcontext , Wcontext ] ∈ Rk×2|V |

[Wcontext , Wcontext ] ∈ Rk×2|V |

where ‘k’ is the index of the word ‘on’