Вы находитесь на странице: 1из 268

CS7015 (Deep Learning) : Lecture 10

Learning Vectorial Representations Of Words

Mitesh M. Khapra

Department of Computer Science and Engineering


Indian Institute of Technology Madras

1/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Acknowledgments
‘word2vec Parameter Learning Explained’ by Xin Rong
‘word2vec Explained: deriving Mikolov et al.’s negative-sampling word-
embedding method’ by Yoav Goldberg and Omer Levy
Sebastian Ruder’s blogs on word embeddingsa
Ali Ghodsi’s video lecture on Word2Vec b

a
Blog1, Blog2, Blog3
b
Lectures on Word2Vec

2/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.1: One-hot representations of words

3/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Let us start with a very simple mo-
tivation for why we are interested in
vectorial representations of words

4/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Let us start with a very simple mo-
tivation for why we are interested in
vectorial representations of words
Suppose we are given an input stream
of words (sentence, document, etc.)
and we are interested in learning
some function of it (say, ŷ =
sentiments(words))

This is by far AAMIR KHAN’s best one. Finest


casting and terrific acting by all.

4/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Let us start with a very simple mo-
tivation for why we are interested in
vectorial representations of words
Suppose we are given an input stream
of words (sentence, document, etc.)
and we are interested in learning
some function of it (say, ŷ =
Model sentiments(words))
Say, we employ a machine learning al-
gorithm (some mathematical model)
for learning such a function (ŷ =
f (x))

This is by far AAMIR KHAN’s best one. Finest


casting and terrific acting by all.

4/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Let us start with a very simple mo-
tivation for why we are interested in
vectorial representations of words
Suppose we are given an input stream
of words (sentence, document, etc.)
and we are interested in learning
some function of it (say, ŷ =
Model sentiments(words))
Say, we employ a machine learning al-
gorithm (some mathematical model)
for learning such a function (ŷ =
[5.7, 1.2, 2.3, -10.2, 4.5, ..., 11.9, 20.1, -0.5, 40.7]
f (x))
We first need a way of converting the
This is by far AAMIR KHAN’s best one. Finest
input stream (or each word in the
casting and terrific acting by all.
stream) to a vector x (a mathemat-
ical quantity)
4/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus,

5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus,
Corpus:
Human machine interface for computer
applications
User opinion of computer system response
time
User interface management system
System engineering for improved response
time

5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus, consider the set V
Corpus: of all unique words across all input
Human machine interface for computer streams (i.e., all sentences or docu-
applications ments)
User opinion of computer system response
time
User interface management system
System engineering for improved response
time

5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus, consider the set V
Corpus: of all unique words across all input
Human machine interface for computer streams (i.e., all sentences or docu-
applications ments)
User opinion of computer system response
time
V is called the vocabulary of the
corpus (i.e., all sentences or docu-
User interface management system
ments)
System engineering for improved response
time

V = [human,machine, interface, for, computer,


applications, user, opinion, of, system, response,
time, interface, management, engineering,
improved]

5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus, consider the set V
Corpus: of all unique words across all input
Human machine interface for computer streams (i.e., all sentences or docu-
applications ments)
User opinion of computer system response
time
V is called the vocabulary of the
corpus (i.e., all sentences or docu-
User interface management system
ments)
System engineering for improved response
time We need a representation for every
word in V
V = [human,machine, interface, for, computer,
applications, user, opinion, of, system, response,
time, interface, management, engineering,
improved]

5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus, consider the set V
Corpus: of all unique words across all input
Human machine interface for computer streams (i.e., all sentences or docu-
applications ments)
User opinion of computer system response
time
V is called the vocabulary of the
corpus (i.e., all sentences or docu-
User interface management system
ments)
System engineering for improved response
time We need a representation for every
word in V
V = [human,machine, interface, for, computer, One very simple way of doing this is
applications, user, opinion, of, system, response,
to use one-hot vectors of size |V |
time, interface, management, engineering,
improved]

machine: 0 1 0 ... 0 0 0

5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Given a corpus, consider the set V
Corpus: of all unique words across all input
Human machine interface for computer streams (i.e., all sentences or docu-
applications ments)
User opinion of computer system response
time
V is called the vocabulary of the
corpus (i.e., all sentences or docu-
User interface management system
ments)
System engineering for improved response
time We need a representation for every
word in V
V = [human,machine, interface, for, computer, One very simple way of doing this is
applications, user, opinion, of, system, response,
to use one-hot vectors of size |V |
time, interface, management, engineering,
improved] The representation of the i-th word
will have a 1 in the i-th position and
machine: 0 1 0 ... 0 0 0 a 0 in the remaining |V | − 1 positions

5/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Problems:
cat: 0 0 0 0 0 1 0
V tends to be very large (for example,
dog: 0 1 0 0 0 0 0 50K for PTB, 13M for Google 1T cor-
pus)
truck: 0 0 0 1 0 0 0

6/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Problems:
cat: 0 0 0 0 0 1 0
V tends to be very large (for example,
dog: 0 1 0 0 0 0 0 50K for PTB, 13M for Google 1T cor-
pus)
truck: 0 0 0 1 0 0 0
These representations do not capture
any notion of similarity

6/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Problems:
cat: 0 0 0 0 0 1 0
V tends to be very large (for example,
dog: 0 1 0 0 0 0 0 50K for PTB, 13M for Google 1T cor-
pus)
truck: 0 0 0 1 0 0 0
These representations do not capture
any notion of similarity
Ideally, we would want the represent-
ations of cat and dog (both domestic
animals) to be closer to each other
than the representations of cat and
truck

6/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Problems:
cat: 0 0 0 0 0 1 0
V tends to be very large (for example,
dog: 0 1 0 0 0 0 0 50K for PTB, 13M for Google 1T cor-
pus)
truck: 0 0 0 1 0 0 0
These representations do not capture
any notion of similarity

euclid dist(cat, dog) = 2 Ideally, we would want the represent-

euclid dist(dog, truck) = 2 ations of cat and dog (both domestic
animals) to be closer to each other
than the representations of cat and
truck
However, with 1-hot representations,
the Euclidean distance between any

two words in the vocabulary in 2

6/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Problems:
cat: 0 0 0 0 0 1 0
V tends to be very large (for example,
dog: 0 1 0 0 0 0 0 50K for PTB, 13M for Google 1T cor-
pus)
truck: 0 0 0 1 0 0 0
These representations do not capture
any notion of similarity

euclid dist(cat, dog) = 2 Ideally, we would want the represent-

euclid dist(dog, truck) = 2 ations of cat and dog (both domestic
animals) to be closer to each other
cosine sim(cat, dog) = 0
than the representations of cat and
cosine sim(dog, truck) = 0 truck
However, with 1-hot representations,
the Euclidean distance between any

two words in the vocabulary in 2
And the cosine similarity between
any two words in the vocabulary is
0 6/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.2: Distributed Representations of words

7/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
You shall know a word by the com-
pany it keeps - Firth, J. R. 1957:11

A bank is a financial institution that accepts


deposits from the public and creates credit.

8/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
You shall know a word by the com-
pany it keeps - Firth, J. R. 1957:11
Distributional similarity based rep-
resentations

A bank is a financial institution that accepts


deposits from the public and creates credit.

8/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
You shall know a word by the com-
pany it keeps - Firth, J. R. 1957:11
Distributional similarity based rep-
resentations
This leads us to the idea of co-
A bank is a financial institution that accepts occurrence matrix
deposits from the public and creates credit.

8/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
You shall know a word by the com-
pany it keeps - Firth, J. R. 1957:11
Distributional similarity based rep-
resentations
This leads us to the idea of co-
A bank is a financial institution that accepts occurrence matrix
deposits from the public and creates credit.

The idea is to use the accompanying words


(financial, deposits, credit) to represent bank

8/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
A co-occurrence matrix is a terms ×
Human machine interface for computer ap- terms matrix which captures the
plications
number of times a term appears in
User opinion of computer system response the context of another term
time
User interface management system
System engineering for improved response
time

9/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
A co-occurrence matrix is a terms ×
Human machine interface for computer ap- terms matrix which captures the
plications
number of times a term appears in
User opinion of computer system response the context of another term
time
User interface management system
The context is defined as a window of
k words around the terms
System engineering for improved response
time

9/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
A co-occurrence matrix is a terms ×
Human machine interface for computer ap- terms matrix which captures the
plications
number of times a term appears in
User opinion of computer system response the context of another term
time
User interface management system
The context is defined as a window of
k words around the terms
System engineering for improved response
time Let us build a co-occurrence matrix
for this toy corpus with k = 2

9/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
A co-occurrence matrix is a terms ×
Human machine interface for computer ap- terms matrix which captures the
plications
number of times a term appears in
User opinion of computer system response the context of another term
time
User interface management system
The context is defined as a window of
k words around the terms
System engineering for improved response
time Let us build a co-occurrence matrix
for this toy corpus with k = 2
human machine system for ... user
This is also known as a word ×
human
machine
0
1
1
0
0
0
1
1
...
...
0
0
context matrix
system 0 0 0 1 ... 2
for 1 1 1 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 2 0 ... 0

Co-occurence Matrix

9/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
A co-occurrence matrix is a terms ×
Human machine interface for computer ap- terms matrix which captures the
plications
number of times a term appears in
User opinion of computer system response the context of another term
time
User interface management system
The context is defined as a window of
k words around the terms
System engineering for improved response
time Let us build a co-occurrence matrix
for this toy corpus with k = 2
human machine system for ... user
This is also known as a word ×
human
machine
0
1
1
0
0
0
1
1
...
...
0
0
context matrix
system 0 0 0 1 ... 2
for 1 1 1 0 ... 0 You could choose the set of words
. . . . . . .
. . . . . . . and contexts to be same or different
. . . . . . .
user 0 0 2 0 ... 0

Co-occurence Matrix

9/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
A co-occurrence matrix is a terms ×
Human machine interface for computer ap- terms matrix which captures the
plications
number of times a term appears in
User opinion of computer system response the context of another term
time
User interface management system
The context is defined as a window of
k words around the terms
System engineering for improved response
time Let us build a co-occurrence matrix
for this toy corpus with k = 2
human machine system for ... user
This is also known as a word ×
human
machine
0
1
1
0
0
0
1
1
...
...
0
0
context matrix
system 0 0 0 1 ... 2
for 1 1 1 0 ... 0 You could choose the set of words
. . . . . . .
. . . . . . . and contexts to be same or different
. . . . . . .
user 0 0 2 0 ... 0 Each row (column) of the co-
Co-occurence Matrix occurrence matrix gives a vectorial
representation of the corresponding
word (context) 9/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Stop words (a, the, for, etc.) are very
frequent → these counts will be very
human machine system for ... user high
human 0 1 0 1 ... 0
machine 1 0 0 1 ... 0
system 0 0 0 1 ... 2
for 1 1 1 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 2 0 ... 0

10/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Stop words (a, the, for, etc.) are very
frequent → these counts will be very
human
human
0
machine
1
system
0
...
...
user
0
high
machine 1 0 0 ... 0
system 0 0 0 ... 2 Solution 1: Ignore very frequent
. . . . . .
. . . . . . words
. . . . . .
user 0 0 2 ... 0

10/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Stop words (a, the, for, etc.) are very
frequent → these counts will be very
human machine system for ... user high
human 0 1 0 x ... 0
machine 1 0 0 x ... 0
system 0 0 0 x ... 2
Solution 1: Ignore very frequent
for
.
x
.
x
.
x
.
x
.
...
.
x
.
words
. . . . . . .
. . . . . . . Solution 2: Use a threshold t (say, t
user 0 0 2 x ... 0
= 100)

Xij = min(count(wi , cj ), t),

where w is word and c is context.

10/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Solution 3: Instead of count(w, c) use
P M I(w, c)

11/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Solution 3: Instead of count(w, c) use
P M I(w, c)

p(c|w)
P M I(w, c) = log
p(c)
count(w, c) ∗ N
= log
count(c) ∗ count(w)
N is the total number of words

11/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Solution 3: Instead of count(w, c) use
P M I(w, c)
human machine system for ... user p(c|w)
human 0 2.944 0 2.25 ... 0 P M I(w, c) = log
machine 2.944 0 0 2.25 ... 0 p(c)
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0 count(w, c) ∗ N
= log
. . . . . . . count(c) ∗ count(w)
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0 N is the total number of words

11/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Solution 3: Instead of count(w, c) use
P M I(w, c)
human machine system for ... user p(c|w)
human 0 2.944 0 2.25 ... 0 P M I(w, c) = log
machine 2.944 0 0 2.25 ... 0 p(c)
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0 count(w, c) ∗ N
= log
. . . . . . . count(c) ∗ count(w)
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0 N is the total number of words
If count(w, c) = 0, P M I(w, c) = −∞

11/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Solution 3: Instead of count(w, c) use
P M I(w, c)
human machine system for ... user p(c|w)
human 0 2.944 0 2.25 ... 0 P M I(w, c) = log
machine 2.944 0 0 2.25 ... 0 p(c)
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0 count(w, c) ∗ N
= log
. . . . . . . count(c) ∗ count(w)
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0 N is the total number of words
If count(w, c) = 0, P M I(w, c) = −∞

Instead use,

P M I0 (w, c) = P M I(w, c) if count(w, c) > 0


=0 otherwise

11/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (fixable) problems
Solution 3: Instead of count(w, c) use
P M I(w, c)
human machine system for ... user p(c|w)
human 0 2.944 0 2.25 ... 0 P M I(w, c) = log
machine 2.944 0 0 2.25 ... 0 p(c)
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0 count(w, c) ∗ N
= log
. . . . . . . count(c) ∗ count(w)
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0 N is the total number of words
If count(w, c) = 0, P M I(w, c) = −∞

Instead use,

P M I0 (w, c) = P M I(w, c) if count(w, c) > 0


=0 otherwise

or

P P M I(w, c) = P M I(w, c) if P M I(w, c) > 0


=0 otherwise
11/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (severe) problems
Very high dimensional (|V |)

human machine system for ... user


human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0

12/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (severe) problems
Very high dimensional (|V |)
Very sparse
human machine system for ... user
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0

12/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (severe) problems
Very high dimensional (|V |)
Very sparse
human machine system for ... user
human 0 2.944 0 2.25 ... 0 Grows with the size of the vocabulary
machine 2.944 0 0 2.25 ... 0
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0

12/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Some (severe) problems
Very high dimensional (|V |)
Very sparse
human machine system for ... user
human 0 2.944 0 2.25 ... 0 Grows with the size of the vocabulary
machine 2.944 0 0 2.25 ... 0
system
for
0
2.25
0
2.25
0
1.15
1.15
0
...
...
1.84
0
Solution: Use dimensionality reduc-
.
.
.
.
.
.
.
.
.
.
.
.
.
.
tion (SVD)
. . . . . . .
user 0 0 1.84 0 ... 0

12/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.3: SVD for learning word representations

13/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Singular Value Decomposition
gives a rank k approximation of
the original matrix
 
T
  X = XP P M I m×n = Um×k Σk×k Vk×n
  =
 X 
XP P M I (simplifying notation to
m×n

↑ ··· ↑
 X) is the co-occurrence matrix
← v1T →
   
σ1
   ..   ..  with PPMI values
. .
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n

14/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Singular Value Decomposition
gives a rank k approximation of
the original matrix
 
T
  X = XP P M I m×n = Um×k Σk×k Vk×n
  =
 X 
XP P M I (simplifying notation to
m×n

↑ ··· ↑
 X) is the co-occurrence matrix
← v1T →
   
σ1
   ..   ..  with PPMI values
. .
 
u1 ··· uk 
   
SVD gives the best rank-k ap-
↓ ··· ↓ m×k σk k×k
← vkT → k×n
proximation of the original data
(X)

14/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Singular Value Decomposition
gives a rank k approximation of
the original matrix
 
T
  X = XP P M I m×n = Um×k Σk×k Vk×n
  =
 X 
XP P M I (simplifying notation to
m×n

↑ ··· ↑
 X) is the co-occurrence matrix
← v1T →
   
σ1
   ..   ..  with PPMI values
. .
 
u1 ··· uk 
   
SVD gives the best rank-k ap-
↓ ··· ↓ m×k σk k×k
← vkT → k×n
proximation of the original data
(X)
Discovers latent semantics in the
corpus (We will soon examine
this with the help of an example)

14/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the product can be
written as a sum of k rank-1
matrices
 
 
  =
 X 
m×n
 
↑ ··· ↑ 
σ1
 
← v1T →

   ..   .. 
. .
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n

= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT

15/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the product can be
written as a sum of k rank-1
matrices
Each σi ui viT ∈ Rm×n because it
 



 = is a product of a m × 1 vector
 X 
with a 1 × n vector
m×n
 
↑ ··· ↑ 
σ1
 
← v1T →

   ..   .. 
. .
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n

= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT

15/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the product can be
written as a sum of k rank-1
matrices
Each σi ui viT ∈ Rm×n because it
 



 = is a product of a m × 1 vector
 X 
with a 1 × n vector
m×n
  If we truncate the sum at σ1 u1 v1T
↑ ··· ↑ 
σ1
 
← v1T →

then we get the best rank-1 ap-
   ..   .. 
. . proximation of X
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n

= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT

15/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the product can be
written as a sum of k rank-1
matrices
Each σi ui viT ∈ Rm×n because it
 



 = is a product of a m × 1 vector
 X 
with a 1 × n vector
m×n
  If we truncate the sum at σ1 u1 v1T
↑ ··· ↑ 
σ1
 
← v1T →

then we get the best rank-1 ap-
   ..   .. 
. . proximation of X (By SVD the-
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n orem! But what does this mean?
= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT We will see on the next slide)

15/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the product can be
written as a sum of k rank-1
matrices
Each σi ui viT ∈ Rm×n because it
 



 = is a product of a m × 1 vector
 X 
with a 1 × n vector
m×n
  If we truncate the sum at σ1 u1 v1T
↑ ··· ↑ 
σ1
 
← v1T →

then we get the best rank-1 ap-
   ..   .. 
. . proximation of X (By SVD the-
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n orem! But what does this mean?
= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT We will see on the next slide)
If we truncate the sum at
σ1 u1 v1T + σ2 u2 v2T then we get the
best rank-2 approximation of X
and so on

15/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
What do we mean by approxim-
ation here?

 
 
  =
 X 
m×n
 
↑ ··· ↑ 
σ1
 
← v1T →

   ..   .. 
. .
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n

= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT

16/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
What do we mean by approxim-
ation here?
Notice that X has m × n entries
 
 
  =
 X 
m×n
 
↑ ··· ↑ 
σ1
 
← v1T →

   ..   .. 
. .
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n

= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT

16/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
What do we mean by approxim-
ation here?
Notice that X has m × n entries
 
When we use the rank-1 approx-
imation we are using only m +
 
  =
 X 
n + 1 entries to reconstruct [u ∈
m×n
  Rm , v ∈ Rn , σ ∈ R1 ]
↑ ··· ↑ 
σ1
 
← v1T →

   ..   .. 
. .
 
u1 ··· uk 
   
↓ ··· ↓ m×k σk k×k
← vkT → k×n

= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT

16/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
What do we mean by approxim-
ation here?
Notice that X has m × n entries
 
When we use the rank-1 approx-
imation we are using only m +
 
  =
 X 
n + 1 entries to reconstruct [u ∈
m×n
  Rm , v ∈ Rn , σ ∈ R1 ]
↑ ··· ↑ 
σ1
 
← v1T →

   ..   ..  But SVD theorem tells us that
. .
 
u1 ··· uk  u1 ,v1 and σ1 store the most im-
   
↓ ··· ↓ m×k σk ← vkT →
k×k k×n portant information in X (akin to
= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT the principal components in X)

16/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
What do we mean by approxim-
ation here?
Notice that X has m × n entries
 
When we use the rank-1 approx-
imation we are using only m +
 
  =
 X 
n + 1 entries to reconstruct [u ∈
m×n
  Rm , v ∈ Rn , σ ∈ R1 ]
↑ ··· ↑ 
σ1
 
← v1T →

   ..   ..  But SVD theorem tells us that
. .
 
u1 ··· uk  u1 ,v1 and σ1 store the most im-
   
↓ ··· ↓ m×k σk ← vkT →
k×k k×n portant information in X (akin to
= σ1 u1 v1T + σ2 u2 v2T + ··· + σk uk vkT the principal components in X)
Each subsequent term (σ2 u2 v2T ,
σ3 u3 v3T , . . . ) stores less and less
important information

16/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
verylight green As an analogy consider the case when
we are using 8 bits to represent colors
z }| { z }| {
0 0 0 1 1 0 1 1

light green
z }| { z }| {
0 0 1 0 1 0 1 1

dark green
z }| { z }| {
0 1 0 0 1 0 1 1

verydark green
z }| { z }| {
1 0 0 0 1 0 1 1
17/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
verylight green As an analogy consider the case when
we are using 8 bits to represent colors
z }| { z }| {
0 0 0 1 1 0 1 1 The representation of very light, light,
dark and very dark green would look
different
light green
z }| { z }| {
0 0 1 0 1 0 1 1

dark green
z }| { z }| {
0 1 0 0 1 0 1 1

verydark green
z }| { z }| {
1 0 0 0 1 0 1 1
17/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
verylight green As an analogy consider the case when
we are using 8 bits to represent colors
z }| { z }| {
0 0 0 1 1 0 1 1 The representation of very light, light,
dark and very dark green would look
different
light green
z }| { z }| { But now what if we were asked to com-
press this into 4 bits? (akin to com-
0 0 1 0 1 0 1 1 pressing m × n values into m + n + 1
values on the previous slide)

dark green
z }| { z }| {
0 1 0 0 1 0 1 1

verydark green
z }| { z }| {
1 0 0 0 1 0 1 1
17/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
verylight green As an analogy consider the case when
we are using 8 bits to represent colors
z }| { z }| {
0 0 0 1 1 0 1 1 The representation of very light, light,
dark and very dark green would look
different
light green
z }| { z }| { But now what if we were asked to com-
press this into 4 bits? (akin to com-
0 0 1 0 1 0 1 1 pressing m × n values into m + n + 1
values on the previous slide)
We will retain the most important 4
dark green
z }| { z }| { bits and the previously (slightly) lat-
ent similarity between the colors now
0 1 0 0 1 0 1 1 becomes very obvious

verydark green
z }| { z }| {
1 0 0 0 1 0 1 1

17/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
verylight green As an analogy consider the case when
we are using 8 bits to represent colors
z }| { z }| {
0 0 0 1 1 0 1 1 The representation of very light, light,
dark and very dark green would look
different
light green
z }| { z }| { But now what if we were asked to com-
press this into 4 bits? (akin to com-
0 0 1 0 1 0 1 1 pressing m × n values into m + n + 1
values on the previous slide)
We will retain the most important 4
dark green
z }| { z }| { bits and the previously (slightly) lat-
ent similarity between the colors now
0 1 0 0 1 0 1 1 becomes very obvious
Something similar is guaranteed by
SVD (retain the most important in-
verydark green
z }| { z }| { formation and discover the latent sim-
ilarities between words)
1 0 0 0 1 0 1 1

17/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
human machine system for ... user human machine system for ... user
human 0 2.944 0 2.25 ... 0 human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.944 0 0 2.25 ... 0 machine 2.01 2.01 0.23 2.14 ... 0.43
system 0 0 0 1.15 ... 1.84 system 0.23 0.23 1.17 0.96 ... 1.29
for 2.25 2.25 1.15 0 ... 0 for 2.14 2.14 0.96 1.87 ... -0.13
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
user 0 0 1.84 0 ... 0 user 0.43 0.43 1.29 -0.13 ... 1.71

Co-occurrence Matrix (X) Low rank X → Low rank X̂

Notice that after low rank reconstruction with SVD, the latent co-occurrence
between {system, machine} and {human, user} has become visible

18/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Recall that earlier each row of the original
X= matrix X served as the representation of a
human machine system for ... user word
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0

19/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Recall that earlier each row of the original
X= matrix X served as the representation of a
human machine system for ... user word
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0 Then XX T is a matrix whose ij-th entry is
system 0 0 0 1.15 ... 1.84 the dot product between the representation
for 2.25 2.25 1.15 0 ... 0
. . . . . . . of word i (X[i :]) and word j (X[j :])
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0

XX T =

human machine system for ... user


human 32.5 23.9 7.78 20.25 ... 7.01
machine 23.9 32.5 7.78 20.25 ... 7.01
system 7.78 7.78 0 17.65 ... 21.84
for 20.25 20.25 17.65 36.3 ... 11.8
. . . . . . .
. . . . . . .
. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3

cosine sim(human, user) = 0.21 19/70


Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Recall that earlier each row of the original
X= matrix X served as the representation of a
human machine system for ... user word
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0 Then XX T is a matrix whose ij-th entry is
system 0 0 0 1.15 ... 1.84 the dot product between the representation
for 2.25 2.25 1.15 0 ... 0
. . . . . . . of word i (X[i :]) and word j (X[j :])
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0 X[i :]

X[j :]
XX T =

human machine system for ... user


human 32.5 23.9 7.78 20.25 ... 7.01
machine 23.9 32.5 7.78 20.25 ... 7.01
system 7.78 7.78 0 17.65 ... 21.84
for 20.25 20.25 17.65 36.3 ... 11.8
. . . . . . .
. . . . . . .
. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3

cosine sim(human, user) = 0.21 19/70


Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Recall that earlier each row of the original
X= matrix X served as the representation of a
human machine system for ... user word
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0 Then XX T is a matrix whose ij-th entry is
system 0 0 0 1.15 ... 1.84 the dot product between the representation
for 2.25 2.25 1.15 0 ... 0
. . . . . . . of word i (X[i :]) and word j (X[j :])
. . . . . . .
. . . . . . .  
user 0 0 1.84 0 ... 0 X[i :] 1 2 3
 2 1 0 
X[j :] 1 3 5
XX T = | {z }
X

human machine system for ... user


human 32.5 23.9 7.78 20.25 ... 7.01
machine 23.9 32.5 7.78 20.25 ... 7.01
system 7.78 7.78 0 17.65 ... 21.84
for 20.25 20.25 17.65 36.3 ... 11.8
. . . . . . .
. . . . . . .
. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3

cosine sim(human, user) = 0.21 19/70


Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Recall that earlier each row of the original
X= matrix X served as the representation of a
human machine system for ... user word
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0 Then XX T is a matrix whose ij-th entry is
system 0 0 0 1.15 ... 1.84 the dot product between the representation
for 2.25 2.25 1.15 0 ... 0
. . . . . . . of word i (X[i :]) and word j (X[j :])
. . . . . . .
. . . . . . .   
user 0 0 1.84 0 ... 0 X[i :] 1 2 3 1 2 1
 2 1 0  2 1 3 
X[j :] 1 3 5 3 0 5
XX T = | {z }| {z }
X XT
human machine system for ... user
human 32.5 23.9 7.78 20.25 ... 7.01
machine 23.9 32.5 7.78 20.25 ... 7.01
system 7.78 7.78 0 17.65 ... 21.84
for 20.25 20.25 17.65 36.3 ... 11.8
. . . . . . .
. . . . . . .
. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3

cosine sim(human, user) = 0.21 19/70


Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Recall that earlier each row of the original
X= matrix X served as the representation of a
human machine system for ... user word
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0 Then XX T is a matrix whose ij-th entry is
system 0 0 0 1.15 ... 1.84 the dot product between the representation
for 2.25 2.25 1.15 0 ... 0
. . . . . . . of word i (X[i :]) and word j (X[j :])
. . . . . . .
. . . . . . .   
user 0 0 1.84 0 ... 0 X[i :] 1 2 3 1 2 1
 2 1 0  2 1 3 
X[j :] 1 3 5 3 0 5
XX T = | {z }| {z }
X XT
 
human machine system for ... user . . 22
human 32.5 23.9 7.78 20.25 ... 7.01
machine 23.9 32.5 7.78 20.25 ... 7.01 = . . .
system 7.78 7.78 0 17.65 ... 21.84 . . .
for 20.25 20.25 17.65 36.3 ... 11.8 | {z }
. . . . . . .
. . . . . . . XX T
. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3

cosine sim(human, user) = 0.21 19/70


Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Recall that earlier each row of the original
X= matrix X served as the representation of a
human machine system for ... user word
human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0 Then XX T is a matrix whose ij-th entry is
system 0 0 0 1.15 ... 1.84 the dot product between the representation
for 2.25 2.25 1.15 0 ... 0
. . . . . . . of word i (X[i :]) and word j (X[j :])
. . . . . . .
. . . . . . .   
user 0 0 1.84 0 ... 0 X[i :] 1 2 3 1 2 1
 2 1 0  2 1 3 
X[j :] 1 3 5 3 0 5
XX T = | {z }| {z }
X XT
 
human machine system for ... user . . 22
human 32.5 23.9 7.78 20.25 ... 7.01
machine 23.9 32.5 7.78 20.25 ... 7.01 = . . .
system 7.78 7.78 0 17.65 ... 21.84 . . .
for 20.25 20.25 17.65 36.3 ... 11.8 | {z }
. . . . . . .
. . . . . . . XX T
. . . . . . .
user 7.01 7.01 21.84 11.8 ... 28.3
The ij-th entry of XX T thus (roughly)
captures the cosine similarity between
wordi , wordj
cosine sim(human, user) = 0.21 19/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Once we do an SVD what is a
good choice for the representation of
wordi ?

20/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Once we do an SVD what is a
X̂ = good choice for the representation of
human machine system for ... user wordi ?
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
Obviously, taking the i-th row of the
for
.
2.14
.
2.14
.
0.96
.
1.87
.
...
.
-0.13
.
reconstructed matrix does not make
.
.
.
.
.
.
.
.
.
.
.
.
.
.
sense because it is still high dimen-
user 0.43 0.43 1.29 -0.13 ... 1.71 sional

20/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Once we do an SVD what is a
X̂ = good choice for the representation of
human machine system for ... user wordi ?
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
Obviously, taking the i-th row of the
for
.
2.14
.
2.14
.
0.96
.
1.87
.
...
.
-0.13
.
reconstructed matrix does not make
.
.
.
.
.
.
.
.
.
.
.
.
.
.
sense because it is still high dimen-
user 0.43 0.43 1.29 -0.13 ... 1.71 sional
But we saw that the reconstructed
X̂ X̂ T = matrix X̂ = U ΣV T discovers latent
human machine system for ... user semantics and its word representa-
human 25.4 25.4 7.6 21.9 ... 6.84
machine 25.4 25.4 7.6 21.9 ... 6.84
tions are more meaningful
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32
. . . . . . .
. . . . . . .
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11

cosine sim(human, user) = 0.33 20/70


Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Once we do an SVD what is a
X̂ = good choice for the representation of
human machine system for ... user wordi ?
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
Obviously, taking the i-th row of the
for
.
2.14
.
2.14
.
0.96
.
1.87
.
...
.
-0.13
.
reconstructed matrix does not make
.
.
.
.
.
.
.
.
.
.
.
.
.
.
sense because it is still high dimen-
user 0.43 0.43 1.29 -0.13 ... 1.71 sional
But we saw that the reconstructed
X̂ X̂ T = matrix X̂ = U ΣV T discovers latent
human machine system for ... user semantics and its word representa-
human 25.4 25.4 7.6 21.9 ... 6.84
machine 25.4 25.4 7.6 21.9 ... 6.84
tions are more meaningful
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32 Wishlist: We would want represent-
. . . . . . .
. . . . . . . ations of words (i, j) to be of smal-
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11 ler dimensions but still have the same
similarity (dot product) as the corres-
ponding rows of X̂
cosine sim(human, user) = 0.33 20/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the dot product between the
X̂ = rows of the the matrix Wword = U Σ is the
human machine system for ... user
same as the dot product between the rows
human 2.01 2.01 0.23 2.14 ... 0.43 of X̂
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13 X̂ X̂ T = (U ΣV T )(U ΣV T )T
. . . . . . .
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71

X̂ X̂ T =

human machine system for ... user


human 25.4 25.4 7.6 21.9 ... 6.84
machine 25.4 25.4 7.6 21.9 ... 6.84
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32
. . . . . . .
. . . . . . .
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11

similarity = 0.33 21/70


Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the dot product between the
X̂ = rows of the the matrix Wword = U Σ is the
human machine system for ... user
same as the dot product between the rows
human 2.01 2.01 0.23 2.14 ... 0.43 of X̂
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13 X̂ X̂ T = (U ΣV T )(U ΣV T )T
. . . . . . .
. . . . . . . = (U ΣV T )(V ΣU T )
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71

X̂ X̂ T =

human machine system for ... user


human 25.4 25.4 7.6 21.9 ... 6.84
machine 25.4 25.4 7.6 21.9 ... 6.84
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32
. . . . . . .
. . . . . . .
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11

similarity = 0.33 21/70


Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the dot product between the
X̂ = rows of the the matrix Wword = U Σ is the
human machine system for ... user
same as the dot product between the rows
human 2.01 2.01 0.23 2.14 ... 0.43 of X̂
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13 X̂ X̂ T = (U ΣV T )(U ΣV T )T
. . . . . . .
. . . . . . . = (U ΣV T )(V ΣU T )
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71 = U ΣΣT U T (∵ V T V = I)

X̂ X̂ T =

human machine system for ... user


human 25.4 25.4 7.6 21.9 ... 6.84
machine 25.4 25.4 7.6 21.9 ... 6.84
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32
. . . . . . .
. . . . . . .
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11

similarity = 0.33 21/70


Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the dot product between the
X̂ = rows of the the matrix Wword = U Σ is the
human machine system for ... user
same as the dot product between the rows
human 2.01 2.01 0.23 2.14 ... 0.43 of X̂
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13 X̂ X̂ T = (U ΣV T )(U ΣV T )T
. . . . . . .
. . . . . . . = (U ΣV T )(V ΣU T )
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71 = U ΣΣT U T (∵ V T V = I)
= U Σ(U Σ)T = Wword Wword
T
T
X̂ X̂ =

human machine system for ... user


human 25.4 25.4 7.6 21.9 ... 6.84
machine 25.4 25.4 7.6 21.9 ... 6.84
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32
. . . . . . .
. . . . . . .
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11

similarity = 0.33 21/70


Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Notice that the dot product between the
X̂ = rows of the the matrix Wword = U Σ is the
human machine system for ... user
same as the dot product between the rows
human 2.01 2.01 0.23 2.14 ... 0.43 of X̂
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13 X̂ X̂ T = (U ΣV T )(U ΣV T )T
. . . . . . .
. . . . . . . = (U ΣV T )(V ΣU T )
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71 = U ΣΣT U T (∵ V T V = I)
= U Σ(U Σ)T = Wword Wword
T
T
X̂ X̂ =
Conventionally,
human machine system for ... user
human 25.4 25.4 7.6 21.9 ... 6.84 Wword = U Σ ∈ Rm×k
machine 25.4 25.4 7.6 21.9 ... 6.84
system 7.6 7.6 24.8 18.03 ... 20.6
for 21.9 21.9 0.96 24.6 ... 15.32 is taken as the representation of the m words
. . . . . . .
. . . . . . .
in the vocabulary and
. . . . . . .
user 6.84 6.84 20.6 15.32 ... 17.11 Wcontext = V

is taken as the representation of the context


similarity = 0.33 words 21/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.4: Continuous bag of words model

22/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The methods that we have seen so far are called count based models because
they use the co-occurrence counts of words

23/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The methods that we have seen so far are called count based models because
they use the co-occurrence counts of words
We will now see methods which directly learn word representations (these are
called (direct) prediction based models)

23/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The story ahead ...
Continuous bag of words model

24/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)

24/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)
GloVe word embeddings

24/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)
GloVe word embeddings
Evaluating word embeddings

24/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)
GloVe word embeddings
Evaluating word embeddings
Good old SVD does just fine!!

24/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Consider this Task: Predict n-th
word given previous n-1 words

25/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Consider this Task: Predict n-th
word given previous n-1 words
Example: he sat on a chair

25/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Consider this Task: Predict n-th
Sometime in the 21st century, Joseph Cooper, word given previous n-1 words
a widowed former engineer and former NASA
pilot, runs a farm with his father-in-law Donald,
Example: he sat on a chair
son Tom, and daughter Murphy, It is post-truth Training data: All n-word windows
society ( Cooper is reprimanded for telling in your corpus
Murphy that the Apollo missions did indeed
happen) and a series of crop blights threatens hu-
manity’s survival. Murphy believes her bedroom
is haunted by a poltergeist. When a pattern
is created out of dust on the floor, Cooper
realizes that gravity is behind its formation,
not a ”ghost”. He interprets the pattern as
a set of geographic coordinates formed into
binary code. Cooper and Murphy follow the
coordinates to a secret NASA facility, where they
are met by Cooper’s former professor, Dr. Brand.

Some sample 4 word windows from a corpus


25/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Consider this Task: Predict n-th
Sometime in the 21st century, Joseph Cooper, word given previous n-1 words
a widowed former engineer and former NASA
pilot, runs a farm with his father-in-law Donald,
Example: he sat on a chair
son Tom, and daughter Murphy, It is post-truth Training data: All n-word windows
society ( Cooper is reprimanded for telling in your corpus
Murphy that the Apollo missions did indeed
happen) and a series of crop blights threatens hu- Training data for this task is easily
manity’s survival. Murphy believes her bedroom available (take all n word windows
is haunted by a poltergeist. When a pattern from the whole of wikipedia)
is created out of dust on the floor, Cooper
realizes that gravity is behind its formation,
not a ”ghost”. He interprets the pattern as
a set of geographic coordinates formed into
binary code. Cooper and Murphy follow the
coordinates to a secret NASA facility, where they
are met by Cooper’s former professor, Dr. Brand.

Some sample 4 word windows from a corpus


25/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Consider this Task: Predict n-th
Sometime in the 21st century, Joseph Cooper, word given previous n-1 words
a widowed former engineer and former NASA
pilot, runs a farm with his father-in-law Donald,
Example: he sat on a chair
son Tom, and daughter Murphy, It is post-truth Training data: All n-word windows
society ( Cooper is reprimanded for telling in your corpus
Murphy that the Apollo missions did indeed
happen) and a series of crop blights threatens hu- Training data for this task is easily
manity’s survival. Murphy believes her bedroom available (take all n word windows
is haunted by a poltergeist. When a pattern from the whole of wikipedia)
is created out of dust on the floor, Cooper For ease of illustration, we will first
realizes that gravity is behind its formation, focus on the case when n = 2 (i.e.,
not a ”ghost”. He interprets the pattern as
a set of geographic coordinates formed into predict second word based on first
binary code. Cooper and Murphy follow the word)
coordinates to a secret NASA facility, where they
are met by Cooper’s former professor, Dr. Brand.

Some sample 4 word windows from a corpus


25/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We will now try to answer these two questions:

26/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We will now try to answer these two questions:
How do you model this task?

26/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We will now try to answer these two questions:
How do you model this task?
What is the connection between this task and learning word representations?

26/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We will model this problem using a
feedforward neural network

27/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We will model this problem using a
feedforward neural network
Input: One-hot representation of the
context word

0 1 0 ... 0 0 0

sat

27/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
We will model this problem using a

P (on|sat)
P (he|sat)
. . . . . . . . . feedforward neural network
Input: One-hot representation of the
. . . . . . . . . . . . context word
Output: There are |V | words
(classes) possible and we want to pre-
dict a probability distribution over
these |V | classes (multi-class classific-
ation problem)
0 1 0 ... 0 0 0

sat

27/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
We will model this problem using a

P (on|sat)
P (he|sat)
. . . . . . . . . feedforward neural network
Input: One-hot representation of the
. . . . . . . . . . . . context word
Wword ∈ Rk×|V | Output: There are |V | words
(classes) possible and we want to pre-
. . . . . . . . . . h ∈ Rk
dict a probability distribution over
Wcontext ∈ Rk×|V |
these |V | classes (multi-class classific-
ation problem)
0 1 0 ... 0 0 0 x ∈ R|V |

sat

27/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
We will model this problem using a

P (on|sat)
P (he|sat)
. . . . . . . . . feedforward neural network
Input: One-hot representation of the
. . . . . . . . . . . . context word
Wword ∈ Rk×|V | Output: There are |V | words
(classes) possible and we want to pre-
. . . . . . . . . . h ∈ Rk
dict a probability distribution over
Wcontext ∈ Rk×|V |
these |V | classes (multi-class classific-
ation problem)
0 1 0 ... 0 0 0 x ∈ R|V | Parameters: Wcontext ∈ Rk×|V | and
sat Wword ∈ Rk×|V |
(we are assuming that the set of
words and context words is the
same: each of size |V |)

27/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
What is the product Wcontext x given that x

P (on|sat)
P (he|sat)
. . . . . . . . . is a one hot vector

. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

28/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
What is the product Wcontext x given that x

P (on|sat)
P (he|sat)
. . . . . . . . . is a one hot vector
It is simply the i-th column of Wcontext
. . . . . . . . . . . . 
−1 0.5 2
   
0 0.5
 3 −1 −2   1  = −1
Wword ∈ Rk×|V |
−2 1.7 3 0 1.7
. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

28/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
What is the product Wcontext x given that x

P (on|sat)
P (he|sat)
. . . . . . . . . is a one hot vector
It is simply the i-th column of Wcontext
. . . . . . . . . . . . 
−1 0.5 2
   
0 0.5
 3 −1 −2   1  = −1
Wword ∈ Rk×|V |
−2 1.7 3 0 1.7
. . . . . . . . . . h ∈ Rk
So when the ith word is present the ith ele-
Wcontext ∈ Rk×|V |
ment in the one hot vector is ON and the ith
column of Wcontext gets selected
0 1 0 ... 0 0 0 x ∈ R|V |

sat

28/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
What is the product Wcontext x given that x

P (on|sat)
P (he|sat)
. . . . . . . . . is a one hot vector
It is simply the i-th column of Wcontext
. . . . . . . . . . . . 
−1 0.5 2
   
0 0.5
 3 −1 −2   1  = −1
Wword ∈ Rk×|V |
−2 1.7 3 0 1.7
. . . . . . . . . . h ∈ Rk
So when the ith word is present the ith ele-
Wcontext ∈ Rk×|V |
ment in the one hot vector is ON and the ith
column of Wcontext gets selected
0 1 0 ... 0 0 0 x ∈ R|V | In other words, there is a one-to-one cor-
respondence between the words and the
sat columns of Wcontext

28/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
What is the product Wcontext x given that x

P (on|sat)
P (he|sat)
. . . . . . . . . is a one hot vector
It is simply the i-th column of Wcontext
. . . . . . . . . . . . 
−1 0.5 2
   
0 0.5
 3 −1 −2   1  = −1
Wword ∈ Rk×|V |
−2 1.7 3 0 1.7
. . . . . . . . . . h ∈ Rk
So when the ith word is present the ith ele-
Wcontext ∈ Rk×|V |
ment in the one hot vector is ON and the ith
column of Wcontext gets selected
0 1 0 ... 0 0 0 x ∈ R|V | In other words, there is a one-to-one cor-
respondence between the words and the
sat columns of Wcontext
More specifically, we can treat the i-th
column of Wcontext as the representation of
context i

28/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function?

. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)

. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)

. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je

29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)

. . . . . . . . . . . . Therefore, P (on|sat) is proportional to the


dot product between j th column of Wcontext
Wword ∈ Rk×|V | and ith column of Wword

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je

29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)

. . . . . . . . . . . . Therefore, P (on|sat) is proportional to the


dot product between j th column of Wcontext
Wword ∈ Rk×|V | and ith column of Wword
P (word = i|sat) thus depends on the ith
. . . . . . . . . . h ∈ Rk
column of Wword

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je

29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)

. . . . . . . . . . . . Therefore, P (on|sat) is proportional to the


dot product between j th column of Wcontext
Wword ∈ Rk×|V | and ith column of Wword
P (word = i|sat) thus depends on the ith
. . . . . . . . . . h ∈ Rk
column of Wword
We thus treat the i-th column of Wword as
Wcontext ∈ Rk×|V |
the representation of word i
0 1 0 ... 0 0 0 x ∈ R|V |

sat

e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je

29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)

. . . . . . . . . . . . Therefore, P (on|sat) is proportional to the


dot product between j th column of Wcontext
Wword ∈ Rk×|V | and ith column of Wword
P (word = i|sat) thus depends on the ith
. . . . . . . . . . h ∈ Rk
column of Wword
We thus treat the i-th column of Wword as
Wcontext ∈ Rk×|V |
the representation of word i
0 1 0 ... 0 0 0 x ∈ R|V | Hope you see an analogy with SVD! (there
we had a different way of learning Wcontext
sat and Wword but we saw that the ith column
of Wword corresponded to the representa-
tion of the ith word)
e(Wword h)[i]
P (on|sat) = P (W
word h)[j]
je

29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
How do we obtain P (on|sat)? For this multi-
. . . . . . . . . class classification problem what is an appro-
priate output function? (softmax)

. . . . . . . . . . . . Therefore, P (on|sat) is proportional to the


dot product between j th column of Wcontext
Wword ∈ Rk×|V | and ith column of Wword
P (word = i|sat) thus depends on the ith
. . . . . . . . . . h ∈ Rk
column of Wword
We thus treat the i-th column of Wword as
Wcontext ∈ Rk×|V |
the representation of word i
0 1 0 ... 0 0 0 x ∈ R|V | Hope you see an analogy with SVD! (there
we had a different way of learning Wcontext
sat and Wword but we saw that the ith column
of Wword corresponded to the representa-
tion of the ith word)
e(Wword h)[i]
P (on|sat) = P (W Having understood the interpretation of
word h)[j]
je Wcontext and Wword , our aim now is to
learn these parameters
29/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-

ŷ = P (on|sat)
dex c and the correct output word (on) by

P (chair|sat)

P (man|sat)
P (he|sat)
the index w
. . . . . . . . .

. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-

ŷ = P (on|sat)
dex c and the correct output word (on) by

P (chair|sat)

P (man|sat)
P (he|sat)
the index w
. . . . . . . . . For this multiclass classification problem
what is an appropriate output function (ŷ =
. . . . . . . . . . . . f (x)) ?

Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-

ŷ = P (on|sat)
dex c and the correct output word (on) by

P (chair|sat)

P (man|sat)
P (he|sat)
the index w
. . . . . . . . . For this multiclass classification problem
what is an appropriate output function (ŷ =
. . . . . . . . . . . . f (x)) ? softmax

Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-

ŷ = P (on|sat)
dex c and the correct output word (on) by

P (chair|sat)

P (man|sat)
P (he|sat)
the index w
. . . . . . . . . For this multiclass classification problem
what is an appropriate output function (ŷ =
. . . . . . . . . . . . f (x)) ? softmax
What is an appropriate loss function?
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-

ŷ = P (on|sat)
dex c and the correct output word (on) by

P (chair|sat)

P (man|sat)
P (he|sat)
the index w
. . . . . . . . . For this multiclass classification problem
what is an appropriate output function (ŷ =
. . . . . . . . . . . . f (x)) ? softmax
What is an appropriate loss function? cross
Wword ∈ Rk×|V | entropy

. . . . . . . . . . h ∈ Rk L (θ) = − log ŷw = − log P (w|c)

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-

ŷ = P (on|sat)
dex c and the correct output word (on) by

P (chair|sat)

P (man|sat)
P (he|sat)
the index w
. . . . . . . . . For this multiclass classification problem
what is an appropriate output function (ŷ =
. . . . . . . . . . . . f (x)) ? softmax
What is an appropriate loss function? cross
Wword ∈ Rk×|V | entropy

. . . . . . . . . . h ∈ Rk L (θ) = − log ŷw = − log P (w|c)


h = Wcontext · xc = uc
Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-

ŷ = P (on|sat)
dex c and the correct output word (on) by

P (chair|sat)

P (man|sat)
P (he|sat)
the index w
. . . . . . . . . For this multiclass classification problem
what is an appropriate output function (ŷ =
. . . . . . . . . . . . f (x)) ? softmax
What is an appropriate loss function? cross
Wword ∈ Rk×|V | entropy

. . . . . . . . . . h ∈ Rk L (θ) = − log ŷw = − log P (w|c)


h = Wcontext · xc = uc
Wcontext ∈ Rk×|V |
exp(uc · vw )
ŷw = P
0 1 0 ... 0 0 0 |V | w0 ∈Vexp(uc · vw0 )
x∈R

sat

30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
We denote the context word (sat) by the in-

ŷ = P (on|sat)
dex c and the correct output word (on) by

P (chair|sat)

P (man|sat)
P (he|sat)
the index w
. . . . . . . . . For this multiclass classification problem
what is an appropriate output function (ŷ =
. . . . . . . . . . . . f (x)) ? softmax
What is an appropriate loss function? cross
Wword ∈ Rk×|V | entropy

. . . . . . . . . . h ∈ Rk L (θ) = − log ŷw = − log P (w|c)


h = Wcontext · xc = uc
Wcontext ∈ Rk×|V |
exp(uc · vw )
ŷw = P
0 1 0 ... 0 0 0 |V | w0 ∈Vexp(uc · vw0 )
x∈R

uc is the column of Wcontext corresponding


sat
to context word c and vw is the column of
Wword corresponding to the word w

30/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
How do we train this simple feed for-

P (on|sat)
P (he|sat)
. . . . . . . . . ward neural network?

. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

31/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
How do we train this simple feed for-

P (on|sat)
P (he|sat)
. . . . . . . . . ward neural network? backpropaga-
tion
. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

31/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
How do we train this simple feed for-

P (on|sat)
P (he|sat)
. . . . . . . . . ward neural network? backpropaga-
tion
. . . . . . . . . . . . Let us consider one input-output pair
Wword ∈ Rk×|V |
(c, w) and see the update rule for vw

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

31/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw

. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
w0 ∈V exp(uc · vw0 )
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
w0 ∈V exp(uc · vw0 )
Wword ∈ Rk×|V |
X
= −(uc · vw − log exp(uc · vw0 ))
. . . . . . . . . . h ∈ Rk w0 ∈V

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
w0 ∈V exp(uc · vw0 )
Wword ∈ Rk×|V |
X
= −(uc · vw − log exp(uc · vw0 ))
. . . . . . . . . . h ∈ Rk w0 ∈V
exp(uc · vw )
Wcontext ∈ Rk×|V |
∇vw = −(uc − P · uc )
w0 ∈V exp(uc · vw0 )
0 1 0 ... 0 0 0 x ∈ R|V |

sat

∇vw = L (θ)
∂vw

32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
w0 ∈V exp(uc · vw0 )
Wword ∈ Rk×|V |
X
= −(uc · vw − log exp(uc · vw0 ))
. . . . . . . . . . h ∈ Rk w0 ∈V
exp(uc · vw )
Wcontext ∈ Rk×|V |
∇vw = −(uc − P · uc )
w0 ∈V exp(uc · vw0 )
0 1 0 ... 0 0 0 x ∈ R|V | = −uc (1 − ŷw )
sat

32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
w0 ∈V exp(uc · vw0 )
Wword ∈ Rk×|V |
X
= −(uc · vw − log exp(uc · vw0 ))
. . . . . . . . . . h ∈ Rk w0 ∈V
exp(uc · vw )
Wcontext ∈ Rk×|V |
∇vw = −(uc − P · uc )
w0 ∈V exp(uc · vw0 )
0 1 0 ... 0 0 0 x ∈ R|V | = −uc (1 − ŷw )
sat And the update rule would be

32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
w0 ∈V exp(uc · vw0 )
Wword ∈ Rk×|V |
X
= −(uc · vw − log exp(uc · vw0 ))
. . . . . . . . . . h ∈ Rk w0 ∈V
exp(uc · vw )
Wcontext ∈ Rk×|V |
∇vw = −(uc − P · uc )
w0 ∈V exp(uc · vw0 )
0 1 0 ... 0 0 0 x ∈ R|V | = −uc (1 − ŷw )
sat And the update rule would be

vw = vw − η∇vw

32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)

P (on|sat)
P (he|sat)
. . . . . . . . . L (θ) = − log ŷw
exp(uc · vw )
. . . . . . . . . . . . = − log P
w0 ∈V exp(uc · vw0 )
Wword ∈ Rk×|V |
X
= −(uc · vw − log exp(uc · vw0 ))
. . . . . . . . . . h ∈ Rk w0 ∈V
exp(uc · vw )
Wcontext ∈ Rk×|V |
∇vw = −(uc − P · uc )
w0 ∈V exp(uc · vw0 )
0 1 0 ... 0 0 0 x ∈ R|V | = −uc (1 − ŷw )
sat And the update rule would be

vw = vw − η∇vw
= vw + ηuc (1 − ŷw )

32/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
This update rule has a nice interpret-

P (on|sat)
P (he|sat)
. . . . . . . . . ation

. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
This update rule has a nice interpret-

P (on|sat)
P (he|sat)
. . . . . . . . . ation

. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-


ing the right word and vw will not be
. . . . . . . . . . h ∈ Rk
updated
Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
This update rule has a nice interpret-

P (on|sat)
P (he|sat)
. . . . . . . . . ation

. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-


ing the right word and vw will not be
. . . . . . . . . . h ∈ Rk
updated
Wcontext ∈ Rk×|V | If ŷw → 0 then vw gets updated by
adding a fraction of uc to it
0 1 0 ... 0 0 0 x ∈ R|V |

sat

33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
This update rule has a nice interpret-

P (on|sat)
P (he|sat)
. . . . . . . . . ation

. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-


ing the right word and vw will not be
. . . . . . . . . . h ∈ Rk
updated
Wcontext ∈ Rk×|V | If ŷw → 0 then vw gets updated by
adding a fraction of uc to it
0 1 0 ... 0 0 0 x ∈ R|V |
This increases the cosine similarity
sat between vw and uc

33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
This update rule has a nice interpret-

P (on|sat)
P (he|sat)
. . . . . . . . . ation

. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-


ing the right word and vw will not be
. . . . . . . . . . h ∈ Rk
updated
Wcontext ∈ Rk×|V | If ŷw → 0 then vw gets updated by
adding a fraction of uc to it
0 1 0 ... 0 0 0 x ∈ R|V |
This increases the cosine similarity
sat between vw and uc (How?

33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
This update rule has a nice interpret-

P (on|sat)
P (he|sat)
. . . . . . . . . ation

. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-


ing the right word and vw will not be
. . . . . . . . . . h ∈ Rk
updated
Wcontext ∈ Rk×|V | If ŷw → 0 then vw gets updated by
adding a fraction of uc to it
0 1 0 ... 0 0 0 x ∈ R|V |
This increases the cosine similarity
sat between vw and uc (How? Refer to
slide 38 of Lecture 2)

33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
This update rule has a nice interpret-

P (on|sat)
P (he|sat)
. . . . . . . . . ation

. . . . . . . . . . . . vw = vw + ηuc (1 − ŷw )

Wword ∈ Rk×|V | If ŷw → 1 then we are already predict-


ing the right word and vw will not be
. . . . . . . . . . h ∈ Rk
updated
Wcontext ∈ Rk×|V | If ŷw → 0 then vw gets updated by
adding a fraction of uc to it
0 1 0 ... 0 0 0 x ∈ R|V |
This increases the cosine similarity
sat between vw and uc (How? Refer to
slide 38 of Lecture 2)
The training objective ensures that
the cosine similarity between word
(vw ) and context word (uc ) is max-
imized 33/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
What happens to the representations

P (on|sat)
P (he|sat)
. . . . . . . . . of two words w and w0 which tend to
appear in similar context (c)
. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

34/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
What happens to the representations

P (on|sat)
P (he|sat)
. . . . . . . . . of two words w and w0 which tend to
appear in similar context (c)
. . . . . . . . . . . . The training ensures that both vw
Wword ∈ Rk×|V |
and vw0 have a high cosine similarity
with uc and hence transitively (intuit-
. . . . . . . . . . h ∈ Rk ively) ensures that vw and vw0 have a
high cosine similarity with each other
Wcontext ∈ Rk×|V |

0 1 0 ... 0 0 0 x ∈ R|V |

sat

34/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
What happens to the representations

P (on|sat)
P (he|sat)
. . . . . . . . . of two words w and w0 which tend to
appear in similar context (c)
. . . . . . . . . . . . The training ensures that both vw
Wword ∈ Rk×|V |
and vw0 have a high cosine similarity
with uc and hence transitively (intuit-
. . . . . . . . . . h ∈ Rk ively) ensures that vw and vw0 have a
high cosine similarity with each other
Wcontext ∈ Rk×|V |
This is only an intuition (reasonable)
0 1 0 ... 0 0 0 x∈R |V |

sat

34/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (chair|sat)

P (man|sat)
What happens to the representations

P (on|sat)
P (he|sat)
. . . . . . . . . of two words w and w0 which tend to
appear in similar context (c)
. . . . . . . . . . . . The training ensures that both vw
Wword ∈ Rk×|V |
and vw0 have a high cosine similarity
with uc and hence transitively (intuit-
. . . . . . . . . . h ∈ Rk ively) ensures that vw and vw0 have a
high cosine similarity with each other
Wcontext ∈ Rk×|V |
This is only an intuition (reasonable)
0 1 0 ... 0 0 0 x∈R |V |
Haven’t come across a formal proof
sat for this!

34/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h

P (on|sat, he)
P (he|sat, he)

P (chair|sat,
In practice, instead of window size of 1 it is
. . . . . . . . . common to use a window size of d

. . . . . . . . . . . .
Wword ∈ Rk×|V |

. . . . . . . . . . h ∈ Rk

[Wcontext , Wcontext ] ∈ Rk×2|V |

x ∈ R2|V |

he sat
0 1 0 ... 0 0 0
sat

0 0 0 ... 0 1 0
he 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h

P (on|sat, he)
P (he|sat, he)

P (chair|sat,
In practice, instead of window size of 1 it is
. . . . . . . . . common to use a window size of d
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R

. . . . . . . . . . h ∈ Rk

[Wcontext , Wcontext ] ∈ Rk×2|V |

x ∈ R2|V |

he sat
0 1 0 ... 0 0 0
sat

0 0 0 ... 0 1 0
he 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h

P (on|sat, he)
P (he|sat, he)

P (chair|sat,
In practice, instead of window size of 1 it is
. . . . . . . . . common to use a window size of d
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R
[Wcontext , Wcontext ] just means that we are
. . . . . . . . . . h ∈ Rk stacking 2 copies of Wcontext matrix

[Wcontext , Wcontext ] ∈ Rk×2|V |

x ∈ R2|V |

he sat
0 1 0 ... 0 0 0
sat

0 0 0 ... 0 1 0
he 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h

P (on|sat, he)
P (he|sat, he)

P (chair|sat,
In practice, instead of window size of 1 it is
. . . . . . . . . common to use a window size of d
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R
[Wcontext , Wcontext ] just means that we are
. . . . . . . . . . h ∈ Rk stacking 2 copies of Wcontext matrix  
0
   1 } sat
−1 0.5 2  
[Wcontext , Wcontext ] ∈ Rk×2|V |
 0 
 
 3 −1.0 −2
0
−2 1.7 3  
0
x ∈ R2|V |
1 } he
he sat
0 1 0 ... 0 0 0
sat

0 0 0 ... 0 1 0
he 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h

P (on|sat, he)
P (he|sat, he)

P (chair|sat,
In practice, instead of window size of 1 it is
. . . . . . . . . common to use a window size of d
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R
[Wcontext , Wcontext ] just means that we are
. . . . . . . . . . h ∈ Rk stacking 2 copies of Wcontext matrix  
0
   1 } sat
−1 0.5 2 −1 0.5 2  0

[Wcontext , Wcontext ] ∈ Rk×2|V |
 3 −1.0 −2 3 −1.0 −2   
0
−2 1.7 3 −2 1.7 3  

0
x ∈ R2|V |
1 } he
he sat
0 1 0 ... 0 0 0
sat

0 0 0 ... 0 1 0
he 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h

P (on|sat, he)
P (he|sat, he)

P (chair|sat,
In practice, instead of window size of 1 it is
. . . . . . . . . common to use a window size of d
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R
[Wcontext , Wcontext ] just means that we are
. . . . . . . . . . h ∈ Rk stacking 2 copies of Wcontext matrix  
0
   1 } sat
−1 0.5 2 −1 0.5 2  0

[Wcontext , Wcontext ] ∈ Rk×2|V |
 3 −1.0 −2 3 −1.0 −2   
0
−2 1.7 3 −2 1.7 3  

0
x ∈ R2|V |
1 } he
he sat  
2.5
0 1 0 ... 0 0 0 = −3.0
sat 4.7

0 0 0 ... 0 1 0
he 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h

P (on|sat, he)
P (he|sat, he)

P (chair|sat,
In practice, instead of window size of 1 it is
. . . . . . . . . common to use a window size of d
So now,
d−1
. . . . . . . . . . . . h=
X
uc i
i=1
k×|V |
Wword ∈ R
[Wcontext , Wcontext ] just means that we are
. . . . . . . . . . h ∈ Rk stacking 2 copies of Wcontext matrix  
0
   1 } sat
−1 0.5 2 −1 0.5 2  0

[Wcontext , Wcontext ] ∈ Rk×2|V |
 3 −1.0 −2 3 −1.0 −2   
0
−2 1.7 3 −2 1.7 3  

0
x ∈ R2|V |
1 } he
he sat  
2.5
0 1 0 ... 0 0 0 = −3.0
sat 4.7
The resultant product would simply be the
0 0 0 ... 0 1 0 sum of the columns corresponding to ‘sat’
he and ‘he’ 35/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Of course in practice we will not do this expensive matrix multiplication

36/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Of course in practice we will not do this expensive matrix multiplication
If ‘he’ is the ith word in the vocabulary and sat is the j th word then we will
simply access columns Wcontext [:, i] and Wcontext [:, j] and add them

36/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Now what happens during backpropagation

37/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Now what happens during backpropagation
Recall that
d−1
X
h= uci
i=1

37/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Now what happens during backpropagation
Recall that
d−1
X
h= uci
i=1

and

e(Wword h)[k]
P (on|sat, he) = P (W
word h)[j]
je

37/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Now what happens during backpropagation
Recall that
d−1
X
h= uci
i=1

and

e(Wword h)[k]
P (on|sat, he) = P (W
word h)[j]
je

where ‘k’ is the index of the word ‘on’

37/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Now what happens during backpropagation
Recall that
d−1
X
h= uci
i=1

and

e(Wword h)[k]
P (on|sat, he) = P (W
word h)[j]
je

where ‘k’ is the index of the word ‘on’


The loss function depends on {Wword , uc1 , uc2 , . . . , ucd−1 } and all these
parameters will get updated during backpropogation

37/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Now what happens during backpropagation
Recall that
d−1
X
h= uci
i=1

and

e(Wword h)[k]
P (on|sat, he) = P (W
word h)[j]
je

where ‘k’ is the index of the word ‘on’


The loss function depends on {Wword , uc1 , uc2 , . . . , ucd−1 } and all these
parameters will get updated during backpropogation
Try deriving the update rule for vw now and see how it differs from the one we
derived before
37/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h

P (on|sat, he)
P (he|sat, he)
Some problems:

P (chair|sat,
. . . . . . . . . Notice that the softmax function at
the output is computationally very
expensive
. . . . . . . . . . . .
Wword ∈ Rk×|V |
exp(uc · vw )
ŷw = P
w0 ∈V exp(uc · vw0 )
. . . . . . . . . . h ∈ Rk

[Wcontext , Wcontext ] ∈ Rk×2|V |

x ∈ R2|V |

he sat

38/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h

P (on|sat, he)
P (he|sat, he)
Some problems:

P (chair|sat,
. . . . . . . . . Notice that the softmax function at
the output is computationally very
expensive
. . . . . . . . . . . .
Wword ∈ Rk×|V |
exp(uc · vw )
ŷw = P
w0 ∈V exp(uc · vw0 )
. . . . . . . . . . h ∈ Rk

The denominator requires a summa-


tion over all words in the vocabulary
[Wcontext , Wcontext ] ∈ Rk×2|V |

x ∈ R2|V |

he sat

38/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
P (man|sat, h

P (on|sat, he)
P (he|sat, he)
Some problems:

P (chair|sat,
. . . . . . . . . Notice that the softmax function at
the output is computationally very
expensive
. . . . . . . . . . . .
Wword ∈ Rk×|V |
exp(uc · vw )
ŷw = P
w0 ∈V exp(uc · vw0 )
. . . . . . . . . . h ∈ Rk

The denominator requires a summa-


tion over all words in the vocabulary
[Wcontext , Wcontext ] ∈ Rk×2|V |
We will revisit this issue soon
x ∈ R2|V |

he sat

38/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.5: Skip-gram model

39/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The model that we just saw is called the continuous bag of words model (it
predicts an output word give a bag of context words)

40/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The model that we just saw is called the continuous bag of words model (it
predicts an output word give a bag of context words)
We will now see the skip gram model (which predicts context words given an
input word)

40/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Notice that the role of context and
word has changed now

Wcontext ∈ Rk×|V |

. . . . . . . . . . h ∈ R|k|

Wword ∈ Rk×|V |

x ∈ R|V |
0 0 1 ... 0 0 0
on

41/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Notice that the role of context and
word has changed now
In the simple case when there is only
Wcontext ∈ Rk×|V |
one context word, we will arrive at
. . . . . . . . . . h ∈ R|k|
the same update rule for uc as we did
for vw earlier
Wword ∈ Rk×|V |

x ∈ R|V |
0 0 1 ... 0 0 0
on

41/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Notice that the role of context and
word has changed now
In the simple case when there is only
Wcontext ∈ Rk×|V |
one context word, we will arrive at
. . . . . . . . . . h ∈ R|k|
the same update rule for uc as we did
for vw earlier
Wword ∈ Rk×|V | Notice that even when we have mul-
tiple context words the loss function
0 0 1 ... 0 0 0 x ∈ R|V | would just be a summation of many
on cross entropy errors
d−1
X
L (θ) = − log ŷwi
i=1

41/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Notice that the role of context and
word has changed now
In the simple case when there is only
Wcontext ∈ Rk×|V |
one context word, we will arrive at
. . . . . . . . . . h ∈ R|k|
the same update rule for uc as we did
for vw earlier
Wword ∈ Rk×|V | Notice that even when we have mul-
tiple context words the loss function
0 0 1 ... 0 0 0 x ∈ R|V | would just be a summation of many
on cross entropy errors
d−1
X
L (θ) = − log ŷwi
i=1

Typically, we predict context words


on both sides of the given word
41/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words

Wcontext ∈ Rk×|V |

. . . . . . . . . . h ∈ R|k|

Wword ∈ Rk×|V |

x ∈ R|V |
0 0 1 ... 0 0 0
on

42/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words
The softmax function at the output
Wcontext ∈ Rk×|V |
is computationally expensive
. . . . . . . . . . h ∈ R|k|

Wword ∈ Rk×|V |

x ∈ R|V |
0 0 1 ... 0 0 0
on

42/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words
The softmax function at the output
Wcontext ∈ Rk×|V |
is computationally expensive
. . . . . . . . . . h ∈ R|k| Solution 1: Use negative sampling

Wword ∈ Rk×|V |

x ∈ R|V |
0 0 1 ... 0 0 0
on

42/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words
The softmax function at the output
Wcontext ∈ Rk×|V |
is computationally expensive
. . . . . . . . . . h ∈ R|k| Solution 1: Use negative sampling
Solution 2: Use contrastive estima-
Wword ∈ Rk×|V |
tion
x ∈ R|V |
0 0 1 ... 0 0 0
on

42/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words
The softmax function at the output
Wcontext ∈ Rk×|V |
is computationally expensive
. . . . . . . . . . h ∈ R|k| Solution 1: Use negative sampling
Solution 2: Use contrastive estima-
Wword ∈ Rk×|V |
tion
x ∈ R|V |
Solution 3: Use hierarchical softmax
0 0 1 ... 0 0 0
on

42/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words
The softmax function at the output
Wcontext ∈ Rk×|V |
is computationally expensive
. . . . . . . . . . h ∈ R|k| Solution 1: Use negative sampling
Solution 2: Use contrastive estima-
Wword ∈ Rk×|V |
tion
x ∈ R|V |
Solution 3: Use hierarchical softmax
0 0 1 ... 0 0 0
on

42/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
D = [(sat, on), (sat, Let D be the set of all correct (w, c) pairs in the
a), (sat, chair), (on, corpus
a), (on,chair), (a,chair),
(on,sat), (a, sat),
(chair,sat), (a, on),
(chair, on), (chair, a) ]

43/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
D = [(sat, on), (sat, Let D be the set of all correct (w, c) pairs in the
a), (sat, chair), (on, corpus
0
a), (on,chair), (a,chair), Let D be the set of all incorrect (w, r) pairs in
(on,sat), (a, sat), the corpus
(chair,sat), (a, on),
(chair, on), (chair, a) ]
0
D = [(sat, oxygen),
(sat, magic), (chair,
sad), (chair, walking)]

43/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
D = [(sat, on), (sat, Let D be the set of all correct (w, c) pairs in the
a), (sat, chair), (on, corpus
0
a), (on,chair), (a,chair), Let D be the set of all incorrect (w, r) pairs in
(on,sat), (a, sat), the corpus
(chair,sat), (a, on), 0
D can be constructed by randomly sampling a
(chair, on), (chair, a) ]
context word r which has never appeared with w
0
D = [(sat, oxygen), and creating a pair (w, r)
(sat, magic), (chair,
sad), (chair, walking)]

43/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
D = [(sat, on), (sat, Let D be the set of all correct (w, c) pairs in the
a), (sat, chair), (on, corpus
0
a), (on,chair), (a,chair), Let D be the set of all incorrect (w, r) pairs in
(on,sat), (a, sat), the corpus
(chair,sat), (a, on), 0
D can be constructed by randomly sampling a
(chair, on), (chair, a) ]
context word r which has never appeared with w
0
D = [(sat, oxygen), and creating a pair (w, r)
(sat, magic), (chair, As before let vw be the representation of the word
sad), (chair, walking)] w and uc be the representation of the context word
c

43/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
For a given (w, c) ∈ D we are interested in max-
P (z = 1|w, c) imizing

p(z = 1|w, c)
σ

uc vw

44/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
For a given (w, c) ∈ D we are interested in max-
P (z = 1|w, c) imizing

p(z = 1|w, c)
σ
Let us model this probability by

p(z = 1|w, c) = σ(uTc vw )


· 1
=
1 + e−uTc vw

uc vw

44/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
For a given (w, c) ∈ D we are interested in max-
P (z = 1|w, c) imizing

p(z = 1|w, c)
σ
Let us model this probability by

p(z = 1|w, c) = σ(uTc vw )


· 1
=
1 + e−uTc vw
Considering all (w, c) ∈ D, we are interested in
uc vw
Y
maximize p(z = 1|w, c)
θ
(w,c)∈D

where θ is the word representation (vw ) and con-


text representation (uc ) for all words in our corpus
44/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
0
For (w, r) ∈ D we are interested in maximizing
P (z = 0|w, r)
p(z = 0|w, r)

ur vw

45/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
0
For (w, r) ∈ D we are interested in maximizing
P (z = 0|w, r)
p(z = 0|w, r)

σ
Again we model this as

− p(z = 0|w, r) = 1 − σ(uTr vw )

ur vw

45/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
0
For (w, r) ∈ D we are interested in maximizing
P (z = 0|w, r)
p(z = 0|w, r)

σ
Again we model this as

− p(z = 0|w, r) = 1 − σ(uTr vw )


1
=1−
1 + e−vrT vw
·

ur vw

45/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
0
For (w, r) ∈ D we are interested in maximizing
P (z = 0|w, r)
p(z = 0|w, r)

σ
Again we model this as

− p(z = 0|w, r) = 1 − σ(uTr vw )


1
=1−
1 + e−vrT vw
· 1
=
1 + euTr vw

ur vw

45/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
0
For (w, r) ∈ D we are interested in maximizing
P (z = 0|w, r)
p(z = 0|w, r)

σ
Again we model this as

− p(z = 0|w, r) = 1 − σ(uTr vw )


1
=1−
1 + e−vrT vw
· 1
= = σ(−uTr vw )
1 + euTr vw

ur vw

45/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
0
For (w, r) ∈ D we are interested in maximizing
P (z = 0|w, r)
p(z = 0|w, r)

σ
Again we model this as

− p(z = 0|w, r) = 1 − σ(uTr vw )


1
=1−
1 + e−vrT vw
· 1
= = σ(−uTr vw )
1 + euTr vw
0
Considering all (w, r) ∈ D , we are interested in
ur vw Y
maximize p(z = 0|w, r)
θ
(w,r)∈D0
45/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Combining the two we get:

P (z = 0|w, r)
Y Y
maximize p(z = 1|w, c) p(z = 0|w, r)
θ
(w,c)∈D 0
(w,r)∈D

ur vw

46/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Combining the two we get:

P (z = 0|w, r)
Y Y
maximize p(z = 1|w, c) p(z = 0|w, r)
θ
(w,c)∈D 0
(w,r)∈D
Y Y
=maximize p(z = 1|w, c) (1 − p(z = 1|w, r))
σ θ
(w,c)∈D 0
(w,r)∈D

ur vw

46/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Combining the two we get:

P (z = 0|w, r)
Y Y
maximize p(z = 1|w, c) p(z = 0|w, r)
θ
(w,c)∈D 0
(w,r)∈D
Y Y
=maximize p(z = 1|w, c) (1 − p(z = 1|w, r))
σ θ
(w,c)∈D 0
(w,r)∈D
X
=maximize log p(z = 1|w, c)
θ
(w,c)∈D
− X
+ log(1 − p(z = 1|w, r))
0
(w,r)∈D

ur vw

46/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Combining the two we get:

P (z = 0|w, r)
Y Y
maximize p(z = 1|w, c) p(z = 0|w, r)
θ
(w,c)∈D 0
(w,r)∈D
Y Y
=maximize p(z = 1|w, c) (1 − p(z = 1|w, r))
σ θ
(w,c)∈D 0
(w,r)∈D
X
=maximize log p(z = 1|w, c)
θ
(w,c)∈D
− X
+ log(1 − p(z = 1|w, r))
0
(w,r)∈D
X 1 X 1
=maximize log + log
· θ
(w,c)∈D
1+
T
e−vc vw 0
T
1 + evr vw
(w,r)∈D

ur vw

46/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Combining the two we get:

P (z = 0|w, r)
Y Y
maximize p(z = 1|w, c) p(z = 0|w, r)
θ
(w,c)∈D 0
(w,r)∈D
Y Y
=maximize p(z = 1|w, c) (1 − p(z = 1|w, r))
σ θ
(w,c)∈D 0
(w,r)∈D
X
=maximize log p(z = 1|w, c)
θ
(w,c)∈D
− X
+ log(1 − p(z = 1|w, r))
0
(w,r)∈D
X 1 X 1
=maximize log + log
· θ
(w,c)∈D
1+
T
e−vc vw 0
T
1 + evr vw
(w,r)∈D
X X
=maximize log σ(vcT vw ) + log σ(−vrT vw )
θ
(w,c)∈D 0
(w,r)∈D

ur vw 1
where σ(x) = 1+e−x

46/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
In the original paper, Mikolov et. al. sample k
P (z = 0|w, r) negative (w, r) pairs for every positive (w, c) pairs

ur vw

47/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
In the original paper, Mikolov et. al. sample k
P (z = 0|w, r) negative (w, r) pairs for every positive (w, c) pairs
0
The size of D is thus k times the size of D

ur vw

47/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
In the original paper, Mikolov et. al. sample k
P (z = 0|w, r) negative (w, r) pairs for every positive (w, c) pairs
0
The size of D is thus k times the size of D
The random context word is drawn from a modi-
σ fied unigram distribution

ur vw

47/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
In the original paper, Mikolov et. al. sample k
P (z = 0|w, r) negative (w, r) pairs for every positive (w, c) pairs
0
The size of D is thus k times the size of D
The random context word is drawn from a modi-
σ fied unigram distribution
3
r ∼ p(r) 4
− 3
count(r) 4
r∼
N
· N = total number of words in the corpus

ur vw

47/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.6: Contrastive estimation

48/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words
The softmax function at the output
Wcontext ∈ Rk×|V |
is computationally expensive
. . . . . . . . . . h ∈ R|k| Solution 1: Use negative sampling
Solution 2: Use contrastive estima-
Wword ∈ Rk×|V |
tion
x ∈ R|V |
Solution 3: Use hierarchical softmax
0 0 1 ... 0 0 0
on

49/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair

50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|

. . . . . . . . . . . . . . . . . . . .

Wh ∈ R2d×h Wh ∈ R2d×h

vc vw vc vw

sat on sat abracadabra

50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|

. . . . . . . . . . . . . . . . . . . .

Wh ∈ R2d×h Wh ∈ R2d×h

vc vw vc vw

sat on sat abracadabra

We would like sr to be greater than s

50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|

. . . . . . . . . . . . . . . . . . . .

Wh ∈ R2d×h Wh ∈ R2d×h

vc vw vc vw

sat on sat abracadabra

We would like sr to be greater than s


Okay, so let us try to maximize s − sr

50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|

. . . . . . . . . . . . . . . . . . . .

Wh ∈ R2d×h Wh ∈ R2d×h

vc vw vc vw

sat on sat abracadabra

We would like sr to be greater than s


Okay, so let us try to maximize s − sr
But we would like the difference to be at
least m

50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|

. . . . . . . . . . . . . . . . . . . .

Wh ∈ R2d×h Wh ∈ R2d×h

vc vw vc vw

sat on sat abracadabra

We would like sr to be greater than s So we can maximize s − (sr + m)


Okay, so let us try to maximize s − sr
But we would like the difference to be at
least m

50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|

. . . . . . . . . . . . . . . . . . . .

Wh ∈ R2d×h Wh ∈ R2d×h

vc vw vc vw

sat on sat abracadabra

We would like sr to be greater than s So we can maximize s − (sr + m)


Okay, so let us try to maximize s − sr What if s > sr + m
But we would like the difference to be at
least m

50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|

. . . . . . . . . . . . . . . . . . . .

Wh ∈ R2d×h Wh ∈ R2d×h

vc vw vc vw

sat on sat abracadabra

We would like sr to be greater than s So we can maximize s − (sr + m)


Okay, so let us try to maximize s − sr What if s > sr + m (don’t do any thing)
But we would like the difference to be at
least m

50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Positive: He sat on a chair Negative: He sat abracadabra a chair
s sr
Wout ∈ Rh×|1| Wout ∈ Rh×|1|

. . . . . . . . . . . . . . . . . . . .

Wh ∈ R2d×h Wh ∈ R2d×h

vc vw vc vw

sat on sat abracadabra

We would like sr to be greater than s So we can maximize s − (sr + m)


Okay, so let us try to maximize s − sr What if s > sr + m (don’t do any thing)
But we would like the difference to be at
maximize max(0, s − (sr + m))
least m

50/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.7: Hierarchical softmax

51/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Some problems
Same as bag of words
The softmax function at the output
Wcontext ∈ Rk×|V |
is computationally expensive
. . . . . . . . . . h ∈ R|k| Solution 1: Use negative sampling
Solution 2: Use contrastive estima-
Wword ∈ Rk×|V |
tion
x ∈ R|V |
Solution 3: Use hierarchical softmax
0 0 1 ... 0 0 0
on

52/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
vT u
max e c w
. . . 1 . . . . . . . . P

w0 ∈|V |
v T u0
e c w

on

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Construct a binary tree such that there are
|V | leaf nodes each corresponding to one
word in the vocabulary

. . . .
. . . 1 . . . . . . . .
on

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Construct a binary tree such that there are
|V | leaf nodes each corresponding to one
word in the vocabulary
There exists a unique path from the root
node to a leaf node.

. . . .
. . . 1 . . . . . . . .
on

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Construct a binary tree such that there are
|V | leaf nodes each corresponding to one
word in the vocabulary
There exists a unique path from the root
node to a leaf node.
Let l(w1 ), l(w2 ), ..., l(wp ) be the nodes on
the path from root to w
. . . .
. . . 1 . . . . . . . .
on

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 Construct a binary tree such that there are
|V | leaf nodes each corresponding to one
word in the vocabulary
π(on)2 = 0
There exists a unique path from the root
node to a leaf node.
π(on)3 = 0
Let l(w1 ), l(w2 ), ..., l(wp ) be the nodes on
the path from root to w
. . . .
Let π(w) be a binary vector such that:
. . . 1 . . . . . . . .
on π(w)k = 1 path branches left at node l(wk )
=0 otherwise

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 Construct a binary tree such that there are
|V | leaf nodes each corresponding to one
word in the vocabulary
π(on)2 = 0 u2
There exists a unique path from the root
node to a leaf node.
π(on)3 = 0
uV
Let l(w1 ), l(w2 ), ..., l(wp ) be the nodes on
the path from root to w
. . . .
Let π(w) be a binary vector such that:
. . . 1 . . . . . . . .
on π(w)k = 1 path branches left at node l(wk )
=0 otherwise

. . . . . . . . . . h = vc Finally each internal node is associated with


a vector ui

0 1 0 ... 0 0 0

sat

53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 Construct a binary tree such that there are
|V | leaf nodes each corresponding to one
word in the vocabulary
π(on)2 = 0 u2
There exists a unique path from the root
node to a leaf node.
π(on)3 = 0
uV
Let l(w1 ), l(w2 ), ..., l(wp ) be the nodes on
the path from root to w
. . . .
Let π(w) be a binary vector such that:
. . . 1 . . . . . . . .
on π(w)k = 1 path branches left at node l(wk )
=0 otherwise

. . . . . . . . . . h = vc Finally each internal node is associated with


a vector ui
So the parameters of the module are
0 1 0 ... 0 0 0 Wcontext and u1 , u2 , . . . , uv (in effect, we
have the same number of parameters as be-
sat fore)

53/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 For a given pair (w, c) we are interested in
the probability p(w|vc )
π(on)2 = 0 u2

π(on)3 = 0
uV

. . . .
. . . 1 . . . . . . . .
on

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

54/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 For a given pair (w, c) we are interested in
the probability p(w|vc )
π(on)2 = 0 u2 We model this probability as
Y
p(w|vc ) = (π(wk )|vc )
π(on)3 = 0 k
uV

. . . .
. . . 1 . . . . . . . .
on

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

54/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 For a given pair (w, c) we are interested in
the probability p(w|vc )
π(on)2 = 0 u2 We model this probability as
Y
p(w|vc ) = (π(wk )|vc )
π(on)3 = 0 k
uV

. . . . For example

. . . 1 . . . . . . . . P (on|vsat ) = P (π(on)1 = 1|vsat )


on ∗P (π(on)2 = 0|vsat )
∗P (π(on)3 = 0|vsat )
. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

54/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 For a given pair (w, c) we are interested in
the probability p(w|vc )
π(on)2 = 0 u2 We model this probability as
Y
p(w|vc ) = (π(wk )|vc )
π(on)3 = 0 k
uV

. . . . For example

. . . 1 . . . . . . . . P (on|vsat ) = P (π(on)1 = 1|vsat )


on ∗P (π(on)2 = 0|vsat )
∗P (π(on)3 = 0|vsat )
. . . . . . . . . . h = vc In effect, we are saying that the probability
of predicting a word is the same as predicting
the correct unique path from the root node
to that word
0 1 0 ... 0 0 0

sat

54/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 We model
1
P (π(on)i = 1) = T
π(on)2 = 0 u2 1 + e−vc ui
P (π(on)i = 0) = 1 − P (π(on)i = 1)
π(on)3 = 0 1
uV P (π(on)i = 0) = T
1 + evc ui
. . . .
. . . 1 . . . . . . . .
on

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

55/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 We model
1
P (π(on)i = 1) = T
π(on)2 = 0 u2 1 + e−vc ui
P (π(on)i = 0) = 1 − P (π(on)i = 1)
π(on)3 = 0 1
uV P (π(on)i = 0) = T
1 + evc ui
. . . . The above model ensures that the repres-
. . . 1 . . . . . . . . entation of a context word vc will have a
on high(low) similarity with the representation
of the node ui if ui appears on the path and
the path branches to the left(right) at ui
. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

55/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1 We model
1
P (π(on)i = 1) = T
π(on)2 = 0 u2 1 + e−vc ui
P (π(on)i = 0) = 1 − P (π(on)i = 1)
π(on)3 = 0 1
uV P (π(on)i = 0) = T
1 + evc ui
. . . . The above model ensures that the repres-
. . . 1 . . . . . . . . entation of a context word vc will have a
on high(low) similarity with the representation
of the node ui if ui appears on the path and
the path branches to the left(right) at ui
. . . . . . . . . . h = vc Again, transitively the representations of
contexts which appear with the same words
will have high similarity
0 1 0 ... 0 0 0

sat

55/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1

|π(w)|
Y
π(on)2 = 0 u2
P (w|vc ) = P (π(wk )|vc )
k=1
π(on)3 = 0
uV

. . . .
. . . 1 . . . . . . . .
on

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

56/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1

|π(w)|
Y
π(on)2 = 0 u2
P (w|vc ) = P (π(wk )|vc )
k=1
π(on)3 = 0
uV

. . . . Note that p(w|vc ) can now be com-


puted using |π(w)| computations in-
. . . 1 . . . . . . . . stead of |V | required by softmax
on

. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

56/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1

|π(w)|
Y
π(on)2 = 0 u2
P (w|vc ) = P (π(wk )|vc )
k=1
π(on)3 = 0
uV

. . . . Note that p(w|vc ) can now be com-


puted using |π(w)| computations in-
. . . 1 . . . . . . . . stead of |V | required by softmax
on
How do we construct the binary tree?
. . . . . . . . . . h = vc

0 1 0 ... 0 0 0

sat

56/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
π(on)1 = 1 u1

|π(w)|
Y
π(on)2 = 0 u2
P (w|vc ) = P (π(wk )|vc )
k=1
π(on)3 = 0
uV

. . . . Note that p(w|vc ) can now be com-


puted using |π(w)| computations in-
. . . 1 . . . . . . . . stead of |V | required by softmax
on
How do we construct the binary tree?
. . . . . . . . . . h = vc Turns out that even a random ar-
rangement of the words on leaf nodes
does well in practice
0 1 0 ... 0 0 0

sat

56/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.8: GloVe representations

57/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Count based methods (SVD) rely on global co-occurrence counts from the
corpus for computing word representations

58/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Count based methods (SVD) rely on global co-occurrence counts from the
corpus for computing word representations
Predict based methods learn word representations using co-occurrence inform-
ation

58/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Count based methods (SVD) rely on global co-occurrence counts from the
corpus for computing word representations
Predict based methods learn word representations using co-occurrence inform-
ation
Why not combine the two (count and learn) ?

58/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Xij encodes important global information
Human machine interface for computer applications
about the co-occurrence between i and j
User opinion of computer system response time
User interface management system
(global: because it is computed from the en-
System engineering for improved response time
tire corpus)

X=

human machine system for ... user


human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13
. . . . . . .
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71

Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji

59/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Xij encodes important global information
Human machine interface for computer applications
about the co-occurrence between i and j
User opinion of computer system response time
User interface management system
(global: because it is computed from the en-
System engineering for improved response time
tire corpus)
Why not learn word vectors which are faith-
X= ful to this information?
human machine system for ... user
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13
. . . . . . .
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71

Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji

59/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Xij encodes important global information
Human machine interface for computer applications
about the co-occurrence between i and j
User opinion of computer system response time
User interface management system
(global: because it is computed from the en-
System engineering for improved response time
tire corpus)
Why not learn word vectors which are faith-
X= ful to this information?
human machine system for ... user For example, enforce
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43 viT vj = log P (j|i)
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13 = log Xij − log(Xi )
. . . . . . .
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71

Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji

59/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Xij encodes important global information
Human machine interface for computer applications
about the co-occurrence between i and j
User opinion of computer system response time
User interface management system
(global: because it is computed from the en-
System engineering for improved response time
tire corpus)
Why not learn word vectors which are faith-
X= ful to this information?
human machine system for ... user For example, enforce
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43 viT vj = log P (j|i)
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13 = log Xij − log(Xi )
. . . . . . .
. . . . . . .
. . . . . . . Similarly,
user 0.43 0.43 1.29 -0.13 ... 1.71

vjT vi = log Xij − log Xj (Xij = Xji )


Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji

59/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Xij encodes important global information
Human machine interface for computer applications
about the co-occurrence between i and j
User opinion of computer system response time
User interface management system
(global: because it is computed from the en-
System engineering for improved response time
tire corpus)
Why not learn word vectors which are faith-
X= ful to this information?
human machine system for ... user For example, enforce
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43 viT vj = log P (j|i)
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13 = log Xij − log(Xi )
. . . . . . .
. . . . . . .
. . . . . . . Similarly,
user 0.43 0.43 1.29 -0.13 ... 1.71

vjT vi = log Xij − log Xj (Xij = Xji )


Xij Xij Essentially we are saying that we want word
P (j|i) = P =
Xij Xi vectors vi and vj such that viT vj is faithful
Xij = Xji to the globally computed P (j|i)

59/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Adding the two equations we get
Human machine interface for computer applications
User opinion of computer system response time 2viT vj = 2 log Xij − log Xi − log Xj
User interface management system
1 1
System engineering for improved response time viT vj = log Xij − log Xi − log Xj
2 2
X=

human machine system for ... user


human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13
. . . . . . .
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71

Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji

60/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Adding the two equations we get
Human machine interface for computer applications
User opinion of computer system response time 2viT vj = 2 log Xij − log Xi − log Xj
User interface management system
1 1
System engineering for improved response time viT vj = log Xij − log Xi − log Xj
2 2
X= Note that log Xi and log Xj depend only on
human machine system for ... user
the words i & j and we can think of them as
human 2.01 2.01 0.23 2.14 ... 0.43 word specific biases which will be learned
machine 2.01 2.01 0.23 2.14 ... 0.43
system
for
0.23
2.14
0.23
2.14
1.17
0.96
0.96
1.87
...
...
1.29
-0.13 viT vj = log Xij − bi − bj
. . . . . . .
. . . . . . . viT vj + bi + bj = log Xij
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71

Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji

60/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Adding the two equations we get
Human machine interface for computer applications
User opinion of computer system response time 2viT vj = 2 log Xij − log Xi − log Xj
User interface management system
1 1
System engineering for improved response time viT vj = log Xij − log Xi − log Xj
2 2
X= Note that log Xi and log Xj depend only on
human machine system for ... user
the words i & j and we can think of them as
human 2.01 2.01 0.23 2.14 ... 0.43 word specific biases which will be learned
machine 2.01 2.01 0.23 2.14 ... 0.43
system
for
0.23
2.14
0.23
2.14
1.17
0.96
0.96
1.87
...
...
1.29
-0.13 viT vj = log Xij − bi − bj
. . . . . . .
. . . . . . . viT vj + bi + bj = log Xij
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71
We can then formulate this as the following
optimization problem
Xij Xij X T
P (j|i) = P = min (vi vj + bi + bj − log Xij )2
Xij Xi vi ,vj ,bi ,bj | {z } | {z }
i,j
predicted value actual value
Xij = Xji using model computed from
parameters the given corpus

60/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Human machine interface for computer applications
X T
User opinion of computer system response time min (vi vj + bi + bj − log Xij )2
User interface management system vi ,vj ,bi ,bj
i,j
System engineering for improved response time

X=

human machine system for ... user


human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13
. . . . . . .
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71

Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji

61/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Human machine interface for computer applications
X T
User opinion of computer system response time min (vi vj + bi + bj − log Xij )2
User interface management system vi ,vj ,bi ,bj
i,j
System engineering for improved response time

X= Drawback: weighs all co-occurrences


equally
human machine system for ... user
human 2.01 2.01 0.23 2.14 ... 0.43
machine 2.01 2.01 0.23 2.14 ... 0.43
system 0.23 0.23 1.17 0.96 ... 1.29
for 2.14 2.14 0.96 1.87 ... -0.13
. . . . . . .
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71

Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji

61/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Human machine interface for computer applications
X T
User opinion of computer system response time min (vi vj + bi + bj − log Xij )2
User interface management system vi ,vj ,bi ,bj
i,j
System engineering for improved response time

X= Drawback: weighs all co-occurrences


equally
human machine system for ... user
human 2.01 2.01 0.23 2.14 ... 0.43
Solution: add a weighting function
machine 2.01 2.01 0.23 2.14 ... 0.43 X
system
for
0.23
2.14
0.23
2.14
1.17
0.96
0.96
1.87
...
...
1.29
-0.13
min f (Xij )(viT vj + bi + bj − log Xij )2
vi ,vj ,bi ,bj
. . . . . . . i,j
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71

Xij Xij
P (j|i) = P =
Xij Xi
Xij = Xji

61/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Corpus:
Human machine interface for computer applications
X T
User opinion of computer system response time min (vi vj + bi + bj − log Xij )2
User interface management system vi ,vj ,bi ,bj
i,j
System engineering for improved response time

X= Drawback: weighs all co-occurrences


equally
human machine system for ... user
human 2.01 2.01 0.23 2.14 ... 0.43
Solution: add a weighting function
machine 2.01 2.01 0.23 2.14 ... 0.43 X
system
for
0.23
2.14
0.23
2.14
1.17
0.96
0.96
1.87
...
...
1.29
-0.13
min f (Xij )(viT vj + bi + bj − log Xij )2
vi ,vj ,bi ,bj
. . . . . . . i,j
. . . . . . .
. . . . . . .
user 0.43 0.43 1.29 -0.13 ... 1.71 Wishlist: f (Xij ) should be such that
neither rare nor frequent words are over-
weighted.
Xij Xij
P (j|i) = P = x
)α , if x < xmax
 
Xij Xi ( xmax
f (x) =
1, otherwise
Xij = Xji
where α can be tuned for a given dataset

61/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.9: Evaluating word representations

62/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
How do we evaluate the learned word representations ?

63/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Semantic Relatedness

64/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Semantic Relatedness
Ask humans to judge the relatedness
between a pair of words

Shuman (cat, dog) = 0.8

64/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Semantic Relatedness
Ask humans to judge the relatedness
between a pair of words
Compute the cosine similarity
between the corresponding word
vectors learned by the model

Shuman (cat, dog) = 0.8


T v
vcat dog
Smodel (cat, dog) = = 0.7
k vcat kk vdog k

64/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Semantic Relatedness
Ask humans to judge the relatedness
between a pair of words
Compute the cosine similarity
between the corresponding word
vectors learned by the model
Given a large number of such
Shuman (cat, dog) = 0.8
word pairs, compute the correlation
T v
vcat
Smodel (cat, dog) =
dog
= 0.7 between Smodel & Shuman , and com-
k vcat kk vdog k pare different models

64/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Semantic Relatedness
Ask humans to judge the relatedness
between a pair of words
Compute the cosine similarity
between the corresponding word
vectors learned by the model
Given a large number of such
Shuman (cat, dog) = 0.8
word pairs, compute the correlation
T v
vcat
Smodel (cat, dog) =
dog
= 0.7 between Smodel & Shuman , and com-
k vcat kk vdog k pare different models
Model 1 is better than Model 2 if

correlation(Smodel1 , Shuman )
> correlation(Smodel2 , Shuman )

64/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Synonym Detection

65/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Synonym Detection
Given: a term and four candidate
synonyms

Term : levied
Candidates : {unposed,
believed, requested, correlated}

65/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Synonym Detection
Given: a term and four candidate
synonyms
Pick the candidate which has the
Term : levied largest cosine similarity with the term
Candidates : {unposed,
believed, requested, correlated}

Synonym : = argmax cosine(vterm , vc )


c∈C

65/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Synonym Detection
Given: a term and four candidate
synonyms
Pick the candidate which has the
Term : levied largest cosine similarity with the term
Candidates : {unposed, Compute the accuracy of different
believed, requested, correlated} models and compare

Synonym : = argmax cosine(vterm , vc )


c∈C

65/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Analogy

66/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Analogy
Semantic Analogy: Find nearest
neighbour of vbrother − vsister +
vgrandson

brother : sister :: grandson : ?

66/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Analogy
Semantic Analogy: Find nearest
neighbour of vbrother − vsister +
vgrandson
Syntactic Analogy: Find nearest
neighbour of vwork − vworks + vspeak

brother : sister :: grandson : ?


work : works :: speak : ?

66/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
So which algorithm gives the best result ?

67/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
So which algorithm gives the best result ?
Boroni et.al [2014] showed that predict models consistently outperform count
models in all tasks.

67/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
So which algorithm gives the best result ?
Boroni et.al [2014] showed that predict models consistently outperform count
models in all tasks.
Levy et.al [2015] do a much more through analysis (IMO) and show that good
old SVD does better than prediction based models on similarity tasks but not
on analogy tasks.

67/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
Module 10.10: Relation between SVD & word2Vec

68/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
The story ahead ...
Continuous bag of words model
Skip gram model with negative sampling (the famous word2vec)
GloVe word embeddings
Evaluating word embeddings
Good old SVD does just fine!!

69/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Recall that SVD does a matrix factorization
of the co-occurrence matrix

Wcontext ∈ Rk×|V |

. . . . . . . . . . h ∈ R|k|

.
Wword ∈ Rk×|V |

x ∈ R|V |
0 0 1 ... 0 0 0
on

70/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Recall that SVD does a matrix factorization
of the co-occurrence matrix
Levy et.al [2015] show that word2vec also
Wcontext ∈ R k×|V | implicitly does a matrix factorization

. . . . . . . . . . h ∈ R|k|

.
Wword ∈ Rk×|V |

x ∈ R|V |
0 0 1 ... 0 0 0
on

70/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Recall that SVD does a matrix factorization
of the co-occurrence matrix
Levy et.al [2015] show that word2vec also
Wcontext ∈ R k×|V | implicitly does a matrix factorization
What does this mean ?
. . . . . . . . . . h ∈ R|k|

.
Wword ∈ Rk×|V |

x ∈ R|V |
0 0 1 ... 0 0 0
on

70/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Recall that SVD does a matrix factorization
of the co-occurrence matrix
Levy et.al [2015] show that word2vec also
Wcontext ∈ R k×|V | implicitly does a matrix factorization
What does this mean ?
. . . . . . . . . . h ∈ R|k|
Recall that word2vec gives us Wcontext &
Wword .
Wword ∈ Rk×|V |

x ∈ R|V |
0 0 1 ... 0 0 0
on

70/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Recall that SVD does a matrix factorization
of the co-occurrence matrix
Levy et.al [2015] show that word2vec also
Wcontext ∈ R k×|V | implicitly does a matrix factorization
What does this mean ?
. . . . . . . . . . h ∈ R|k|
Recall that word2vec gives us Wcontext &
Wword .
Wword ∈ Rk×|V |
Turns out that we can also show that
M = Wcontext ∗ Wword
x ∈ R|V |
0 0 1 ... 0 0 0
where
on
Mij = P M I(wi , ci ) − log(k)
k = number of negative samples

70/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10
he sat a chair Recall that SVD does a matrix factorization
of the co-occurrence matrix
Levy et.al [2015] show that word2vec also
Wcontext ∈ R k×|V | implicitly does a matrix factorization
What does this mean ?
. . . . . . . . . . h ∈ R|k|
Recall that word2vec gives us Wcontext &
Wword .
Wword ∈ Rk×|V |
Turns out that we can also show that
M = Wcontext ∗ Wword
x ∈ R|V |
0 0 1 ... 0 0 0
where
on
Mij = P M I(wi , ci ) − log(k)
k = number of negative samples

So essentially, word2vec factorizes a mat-


rix M which is related to the PMI based
co-occurrence matrix (very similar to what
SVD does)
70/70
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Вам также может понравиться