Вы находитесь на странице: 1из 8

Headline Similarities

TF/IDF Demo
Problem
From input files with schema: Item_ID Country_ID Category_ID Title

Define and identify similar items given item ID


Define and find clusters of similar items

Options given:

TF/IDF analysis with distance metrics given by choice of


Euclidean, Cosine, Manhattan (L1), [Jaccard]
Specific Problem Analysis
1. Titles are limited in length and most users exploit full length, so documents
are similar in size no need to compensate for doc size
2. Need not worry about size of word set 1 \ word set 2, so no need for Jaccard
3. Word repetition is rare, so use sets (bag) of words with TF = 1/n, where n is
word number in title (a constant for each word in title)
4. Use cosine similarity,
Cosine = Sum(words in both) TF1*TF2*IDF^2 /
(Sum(words in 1) TF1^2*IDF^2) (Sum(words in 2) TF2^2*IDF^2)
= Sum(words in both) IDF^2 /
(Sum(words in 1) IDF^2) (Sum(words in 2) IDF^2)
1/(Sum IDF^2) can be precomputed for each Item_ID
Preparation
Precompute

1. Dict keyed by ID of sets with title words filtered for alphanumeric


2. Dict keyed by word of sets of IDs containing word
3. Dict keyed by ID of 1/(Sum IDF^2)

Generated sample input

Set of 10k documents of up to 80 characters each


Skewed toward lexicographically earlier words (starting with a, b, c) to
distinguish common words and less common words in set
Find Top k similar Items for ID
Get IDs containing words in title
Heapsort with heapq library O(n log k), where n is only IDs with word in
common
Sample Output
Input itemID: item words
00002000: adulterated aspen belly beriberi blockish bondwoman canfield malefactions
Similar itemID: (cosine similarity) item words
|- 00006034: (0.376) adulterated applique archaizing balalaika belabours beriberi bibles
canfield
|- 00006538: (0.376) adulterated applique archaizing balalaika belabours beriberi bibles
canfield
|- 00009158: (0.384) answered apocalyptically baronies belly beriberi blimey bondwoman
|- 00007371: (0.384) answered apocalyptically baronies belly beriberi blimey bondwoman
|- 00007426: (0.384) answered apocalyptically baronies belly beriberi blimey bondwoman
|- 00009646: (0.384) answered apocalyptically baronies belly beriberi blimey bondwoman
|- 00003585: (1.000) adulterated aspen belly beriberi blockish bondwoman canfield
malefactions
|- 00003440: (1.000) adulterated aspen belly beriberi blockish bondwoman canfield
malefactions
|- 00003908: (1.000) adulterated aspen belly beriberi blockish bondwoman canfield
malefactions
|- 00007631: (0.384) answered apocalyptically baronies belly beriberi blimey bondwoman
Clustering
1. Idea 1: Traverse as graph following k most similar to form cluster. Very slow.
2. Idea 2: Traverse set of words.
Ignore common words with IDF < cutoff (optimization in absence of parallel
programming)
Traverse ID1, ID2 from hashmap[word]
If ID1 != ID2, increment cosine similarity by
IDF(word)^2/IDFsum(ID1)/IDFsum(ID2)
3. Idea 2 runs in ~3 seconds for a 10k document example file
Sample Output for Clustering
204 clusters generated in 3.01s
2378 connections found
Cluster 1 words:
anchorite, apparition, augend, backsliders, blotting, bumblers, corruptible, denazified
Cluster 2 words:
aerosols, arabesk, atheneum, bibles, bilious, bortz, cardoon, childlike, curriery
Cluster 3 words:
addressability, alexandrine, antiques, balalaika, bedevilling, frolicky, futural
Cluster 4 words:
adulterated, annoyances, arabesk, areola, ascend, battlefields, chatted, crinkly
Cluster 5 words:
applique, attaches, backsliders, blahs, calliopes, circulates, constraints
Cluster 6 words:
amasser, amiss, anchorite, aouads, approbations, autographs, begin, calliopes, heats
Cluster 7 words:
ahchoo, alleviator, amiss, aspen, brownstone, budgie, centiliter, coagulators, doubtable
Cluster 8 words:
amasser, bantling, bioresearch, bulgurs, bunter, canaanite, disseminated, misprinting

Вам также может понравиться