Вы находитесь на странице: 1из 48

Fron%ers of Computa%onal Journalism

Columbia Journalism School

Week 3: Document Topic Modelling September 24, 2012

Week 3: Document Topic Modeling

Vector Space Model

Cosine Distance

F-idf

Topic Models

Week 3: Document Topic Modeling

Vector Space Model

Cosine Distance

F-idf

Topic Models

Vector representa%on for documents

As before, we want to ﬁnd numerical “features” that describe the document.

An old idea, going back to proto-search engine research by Luhn at IBM in the 1950s.

- H. P. Luhn , A Sta's'cal approach to mechanized encoding and searching of literary informa'on, 1957

Turns out features = words works ﬁne

Encode each document as the list of words it contains.

Dimensions = vocabulary of document set.

Value on each dimension = # of %mes word appears in document

Example

D1 = “I like databases” D2 = “I hate hate databases”

Each row = document vector All rows = term-document matrix Individual entry = F ( t,d ) = “term frequency”

Aka “Bag of words” model

Throws out word order.

e.g. “soldiers shot civilians” and “civilians shot soldiers” encoded iden%cally.

Week 3: Document Topic Modeling

Vector Space Model

Cosine Distance

F-idf

Topic Models

Distance func%on

Useful for:

clustering documents ﬁnding docs similar to example matching a search query

Basic idea: look for overlapping terms

Cosine similarity

Given document vectors a,b deﬁne

similarity ( a, b ) a b

If each word occurs exactly once in each document, equivalent to coun%ng overlapping words.

Note: not a distance func%on, as similarity increases when documents are… similar. (What part of the deﬁni%on of a distance func%on is violated here?)

Problem: long documents always win

Let a = “This car runs fast.” Let b = “My car is old. I want a new car, a shiny car” Let query = “fast car”

 this car runs fast my is old I want a new shiny a 1 1 1 1 0 0 0 0 0 0 0 0 b 0 3 0 0 1 1 1 1 1 1 1 1 q 0 1 0 1 0 0 0 0 0 0 0 0

Problem: long documents always win

similarity( a,q ) = 1*1 [car] + 1*1 [fast] = 2 similarity( b,q ) = 3*1 [car] + 0*1 [fast] = 3

Longer document more “similar”, by virtue of repea%ng words.

Normalize document vectors

simila rity ( a, b ) a b

a b

= cos ( Θ ) returns result in [0,1]

Normalized query example

 this car runs fast my is old I want a new shiny a 1 1 1 1 0 0 0 0 0 0 0 0 b 0 3 0 0 1 1 1 1 1 1 1 1 q 0 1 0 1 0 0 0 0 0 0 0 0

similarity ( a, q ) =

2 2 = 1 2 0.707

4

similarity ( b, q ) =

3 2 0.514

17

Cosine similarity

cos θ = similarity ( a, b ) a b

a
b

Cosine distance (ﬁnally)

d ist ( a, b ) 1 a b

a b

Week 3: Document Topic Modeling

Vector Space Model

Cosine Distance

F-idf

Topic Models

Problem: common words

We want to weight words that “discriminate” among documents.

Stopwords : if all documents contain “the,” are all documents similar?

Common words: if most documents contain “car” then car doesn’t tell us much about (contextual) similarity.

Context makers

General News

Car Reviews

contains “car”

does not contain “car”

=

=

Document Frequency

Idea: de-weight common words Common = appears in many documents

df ( t, D ) =

d D : t d

D

“document frequency” = frac%on of docs containing term

Inverse Document Frequency

Invert (so more common = smaller weight) and take log

idf ( t, D ) = log

(
D

d D : t d

)

F-idf

Mul%ply term frequency by inverse document frequency

tfidf ( t, d, D ) = tf ( t, d ) idf ( d, D )

= n ( t, d ) log

(
D
n ( t, D )

)

n( t,d ) = number of %mes term t in doc d n( t,D ) = number docs in D containing t

Salton’s descrip%on of F-idf

- from Salton , Wong, Yang, A Vector Space Model for Automa'c Indexing, 1975

F

F-idf

nj-sentator-menendez corpus, Overview sample ﬁles color = human tags generated from F-idf clusters

Cluster Hypothesis

“documents in the same cluster behave similarly with respect to relevance to informa%on needs”

- Manning, Raghavan , Schütze, Introduc'on to Informa'on Retrieval

Not really a precise statement – but the crucial link between human seman%cs and mathema%cal proper%es.

Ar%culated as early as 1971, has been shown to hold at web scale, widely assumed.

Bag of words + F-idf hard to beat

Prac%cal win: good precision-recall metrics in tests with human-tagged document sets.

S%ll the dominant text indexing scheme used today. ( Lucene , FAST, Google…) Many variants.

Some, but not much, theory to explain why this works. (E.g. why that par%cular idf formula? why doesn’t indexing bigrams improve performance?)

Collec%vely:

the vector space document model

Week 3: Document Topic Modeling

Vector Space Model

Cosine Distance

F-idf

Topic Models

Problem Statement

Can the computer tell us the “topics” in a document set? Can the computer organize the documents by “topic”?

Topic Modeling approaches

Find clusters directly in F-idf space

Fast and simple This is what Overview does

Reduce dimensionality of space, so each dimension is a topic

Poten%ally beker results, less sensi%ve to “noise” LSI, PLSI, LDA

Polysemy and Synonymy

Polysemy: same word, diﬀerent meanings

“bank” the ins%tu%on vs. “bank” an airplane

Synonymy: diﬀerent word, same meaning

Bag of words model cannot diﬀeren%ate polysemes , and incorrectly splits synonyms.

Contextual informa%on

“Word sense disambigua%on” research has shown that surrounding words can be used to determine sense:

“The bank has my money” “The plane banked sharply les”

Contextual informa%on

Synonym detec%on: similar usages imply similar meanings:

“I bought a couch” “We need to buy more stamps”

“I purchased a couch” “We need to purchase more stamps”

Latent Seman%c Analysis

Basic idea: factor term-document matrix, reconstruct low-rank approxima%on from leading factors .

(what?)

Given term document matrix X, factor by singular value decomposi%on.

(what?)

Latent Seman%c Analysis

 U = “words to concepts” D = diagonal matrix, “concept strength in document set” V = “concepts to documents”

Polysemy and synonymy

Synonymy = many rows in U (many words) map to same column in D (same concept)

Polysemy = same row in U (word) maps to mul%ple columns in D (diﬀerent concepts)

Throw out trailing concepts

Technically: low rank approxima%on to X

Intui%on: these concepts don’t contribute much to term weights. LSA tends to do best on retrieval metrics with ~100 factors

Probabilis%c Latent Seman%c Indexing

Fixed set of topics Z For each topic zZ, there’s a distribu%on of words p(w|z)

Assume each document generated by:

1. Choose mixture of topics p( z|d ) 2. Choose each word from topic mixture, p( w|z)p( z|d )

Probabilis%c Latent Seman%c Indexing

Probabilis%c Latent Seman%c Indexing

PLSI in prac%ce

topics p( w|z) document coordinates p( z|d )

by maximizing probability p( w|d ) that they produced observed X, aka the “likelihood”

p ( w | d ) =

∑ ∑

d D

z

Z

p ( w | z ) p ( z | d )

Extracted topics p(w|z)

topic Z
words in topic
p( w|z)

Latent Dirichlet Alloca%on

Same underlying idea, but now generate a document by:

1. For each doc d, choose mixture of topics p(z|d ) 2. For each word w in d, choose a topic z from p( z|d ) 3. Then choose word from p( w|z)

Diﬀerence: each word in doc can come from a diﬀerent topic.

Dimensionality reduc%on

Output of LSA, PLSI, LDA is a vector of much lower dimension for each document.

Dimensions are “concepts” or “topics” instead of words.

Can measure cosine distance, cluster, etc. in this new space.

Which method is best?

LSA, PLSI, LDA all improve performance in informa%on retrieval applica%ons (area under precision/recall curve)

But are they useful for journalism? Not clear.

The “smoothing” they apply may destroy interes%ng outliers.

Homework: let’s ﬁnd out.