Вы находитесь на странице: 1из 48

Fron%ers of Computa%onal Journalism

Columbia Journalism School

Week 3: Document Topic Modelling September 24, 2012

Week 3: Document Topic Modeling

Vector Space Model

Cosine Distance

F-idf

Topic Models

Week 3: Document Topic Modeling

Vector Space Model

Cosine Distance

F-idf

Topic Models

Vector representa%on for documents

As before, we want to find numerical “features” that describe the document.

An old idea, going back to proto-search engine research by Luhn at IBM in the 1950s.

- H. P. Luhn , A Sta's'cal approach to mechanized encoding and searching of literary

- H. P. Luhn , A Sta's'cal approach to mechanized encoding and searching of literary informa'on, 1957

Turns out features = words works fine

Encode each document as the list of words it contains.

Dimensions = vocabulary of document set.

Value on each dimension = # of %mes word appears in document

Example

D1 = “I like databases” D2 = “I hate hate databases”

D1 = “I like databases” D2 = “I hate hate databases” Each row = document vector

Each row = document vector All rows = term-document matrix Individual entry = F ( t,d ) = “term frequency”

Aka “Bag of words” model

Throws out word order.

e.g. “soldiers shot civilians” and “civilians shot soldiers” encoded iden%cally.

Week 3: Document Topic Modeling

Vector Space Model

Cosine Distance

F-idf

Topic Models

Distance func%on

Useful for:

clustering documents finding docs similar to example matching a search query

Basic idea: look for overlapping terms

Cosine similarity

Given document vectors a,b define

similarity ( a, b ) a b

If each word occurs exactly once in each document, equivalent to coun%ng overlapping words.

Note: not a distance func%on, as similarity increases when documents are… similar. (What part of the defini%on of a distance func%on is violated here?)

Problem: long documents always win

Let a = “This car runs fast.” Let b = “My car is old. I want a new car, a shiny car” Let query = “fast car”

 

this

car

runs

fast

my

is

old

I

want

a

new

shiny

a

1

1

1

1

0

0

0

0

0

0

0

0

b

0

3

0

0

1

1

1

1

1

1

1

1

q

0

1

0

1

0

0

0

0

0

0

0

0

Problem: long documents always win

similarity( a,q ) = 1*1 [car] + 1*1 [fast] = 2 similarity( b,q ) = 3*1 [car] + 0*1 [fast] = 3

Longer document more “similar”, by virtue of repea%ng words.

Normalize document vectors

simila rity ( a, b ) a b

a b

= cos ( Θ ) returns result in [0,1]

Normalized query example

 

this

car

runs

fast

my

is

old

I

want

a

new

shiny

a

1

1

1

1

0

0

0

0

0

0

0

0

b

0

3

0

0

1

1

1

1

1

1

1

1

q

0

1

0

1

0

0

0

0

0

0

0

0

similarity ( a, q ) =

2 2 = 1 2 0.707

0 0 0 similarity ( a , q ) = 2 2 = 1 2 ≈
0 0 0 similarity ( a , q ) = 2 2 = 1 2 ≈
4
4

similarity ( b, q ) =

3 2 0.514

0 0 0 similarity ( a , q ) = 2 2 = 1 2 ≈
17
17

Cosine similarity

cos θ = similarity ( a, b ) a b

a b
a
b
Cosine similarity cos θ = similarity ( a , b ) ≡ a • b a

Cosine distance (finally)

d ist ( a, b ) 1 a b

a b

Week 3: Document Topic Modeling

Vector Space Model

Cosine Distance

F-idf

Topic Models

Problem: common words

We want to weight words that “discriminate” among documents.

Stopwords : if all documents contain “the,” are all documents similar?

Common words: if most documents contain “car” then car doesn’t tell us much about (contextual) similarity.

Context makers

General News

Car Reviews

Context makers General News Car Reviews contains “car” does not contain “car” = =

contains “car”Context makers General News Car Reviews does not contain “car” = =

does not contain “car”Context makers General News Car Reviews contains “car” = =

=

=

Document Frequency

Idea: de-weight common words Common = appears in many documents

df ( t, D ) =

d D : t d

in many documents df ( t , D ) = d ∈ D : t ∈

D

“document frequency” = frac%on of docs containing term

Inverse Document Frequency

Invert (so more common = smaller weight) and take log

idf ( t, D ) = log

( D
(
D

d D : t d

)

F-idf

Mul%ply term frequency by inverse document frequency

tfidf ( t, d, D ) = tf ( t, d ) idf ( d, D )

= n ( t, d ) log

( D n ( t, D )
(
D
n ( t, D )

)

n( t,d ) = number of %mes term t in doc d n( t,D ) = number docs in D containing t

Salton’s descrip%on of F-idf

Salton’s descrip%on of F-idf - from Salton , Wong, Yang, A Vector Space Model for Automa'c

- from Salton , Wong, Yang, A Vector Space Model for Automa'c Indexing, 1975

F

F F-idf nj-sentator-menendez corpus, Overview sample files color = human tags generated from F-idf clusters

F-idf

F F-idf nj-sentator-menendez corpus, Overview sample files color = human tags generated from F-idf clusters

nj-sentator-menendez corpus, Overview sample files color = human tags generated from F-idf clusters

Cluster Hypothesis

“documents in the same cluster behave similarly with respect to relevance to informa%on needs”

- Manning, Raghavan , Schütze, Introduc'on to Informa'on Retrieval

Not really a precise statement – but the crucial link between human seman%cs and mathema%cal proper%es.

Ar%culated as early as 1971, has been shown to hold at web scale, widely assumed.

Bag of words + F-idf hard to beat

Prac%cal win: good precision-recall metrics in tests with human-tagged document sets.

S%ll the dominant text indexing scheme used today. ( Lucene , FAST, Google…) Many variants.

Some, but not much, theory to explain why this works. (E.g. why that par%cular idf formula? why doesn’t indexing bigrams improve performance?)

Collec%vely:

the vector space document model

Week 3: Document Topic Modeling

Vector Space Model

Cosine Distance

F-idf

Topic Models

Problem Statement

Can the computer tell us the “topics” in a document set? Can the computer organize the documents by “topic”?

Topic Modeling approaches

Find clusters directly in F-idf space

Fast and simple This is what Overview does

Reduce dimensionality of space, so each dimension is a topic

Poten%ally beker results, less sensi%ve to “noise” LSI, PLSI, LDA

Polysemy and Synonymy

Polysemy: same word, different meanings

“bank” the ins%tu%on vs. “bank” an airplane

Synonymy: different word, same meaning

“buy” vs. “purchase”

Bag of words model cannot differen%ate polysemes , and incorrectly splits synonyms.

Contextual informa%on

“Word sense disambigua%on” research has shown that surrounding words can be used to determine sense:

“The bank has my money” “The plane banked sharply les”

Contextual informa%on

Synonym detec%on: similar usages imply similar meanings:

“I bought a couch” “We need to buy more stamps”

“I purchased a couch” “We need to purchase more stamps”

Latent Seman%c Analysis

Basic idea: factor term-document matrix, reconstruct low-rank approxima%on from leading factors .

(what?)

Given term document matrix X, factor by singular value decomposi%on.

(what?)

Latent Seman%c Analysis

Latent Seman%c Analysis U = “words to concepts” D = diagonal matrix, “concept strength in document

U

= “words to concepts”

D

= diagonal matrix, “concept strength in document set”

V

= “concepts to documents”

Polysemy and synonymy

Polysemy and synonymy Synonymy = many rows in U (many words) map to same column in

Synonymy = many rows in U (many words) map to same column in D (same concept)

Polysemy = same row in U (word) maps to mul%ple columns in D (different concepts)

Throw out trailing concepts

Throw out trailing concepts Technically: low rank approxima%on to X Intui%on: these concepts don’t contribute much

Technically: low rank approxima%on to X

trailing concepts Technically: low rank approxima%on to X Intui%on: these concepts don’t contribute much to term

Intui%on: these concepts don’t contribute much to term weights. LSA tends to do best on retrieval metrics with ~100 factors

Probabilis%c Latent Seman%c Indexing

Fixed set of topics Z For each topic zZ, there’s a distribu%on of words p(w|z)

Assume each document generated by:

1. Choose mixture of topics p( z|d ) 2. Choose each word from topic mixture, p( w|z)p( z|d )

Probabilis%c Latent Seman%c Indexing

Probabilis%c Latent Seman%c Indexing

Probabilis%c Latent Seman%c Indexing

Probabilis%c Latent Seman%c Indexing

PLSI in prac%ce

Start with term-document matrix X = p(w|d ) Find:

topics p( w|z) document coordinates p( z|d )

by maximizing probability p( w|d ) that they produced observed X, aka the “likelihood”

p ( w | d ) =

∑ ∑

d D

z

Z

p ( w | z ) p ( z | d )

Extracted topics p(w|z)

topic Z words in topic p( w|z)
topic Z
words in topic
p( w|z)

Latent Dirichlet Alloca%on

Same underlying idea, but now generate a document by:

1. For each doc d, choose mixture of topics p(z|d ) 2. For each word w in d, choose a topic z from p( z|d ) 3. Then choose word from p( w|z)

Difference: each word in doc can come from a different topic.

Dimensionality reduc%on

Output of LSA, PLSI, LDA is a vector of much lower dimension for each document.

Dimensions are “concepts” or “topics” instead of words.

Can measure cosine distance, cluster, etc. in this new space.

Which method is best?

LSA, PLSI, LDA all improve performance in informa%on retrieval applica%ons (area under precision/recall curve)

But are they useful for journalism? Not clear.

The “smoothing” they apply may destroy interes%ng outliers.

Homework: let’s find out.