TF/IDF Demo Problem From input files with schema: Item_ID Country_ID Category_ID Title
Define and identify similar items given item ID
Define and find clusters of similar items
Options given:
TF/IDF analysis with distance metrics given by choice of
Euclidean, Cosine, Manhattan (L1), [Jaccard] Specific Problem Analysis 1. Titles are limited in length and most users exploit full length, so documents are similar in size no need to compensate for doc size 2. Need not worry about size of word set 1 \ word set 2, so no need for Jaccard 3. Word repetition is rare, so use sets (bag) of words with TF = 1/n, where n is word number in title (a constant for each word in title) 4. Use cosine similarity, Cosine = Sum(words in both) TF1*TF2*IDF^2 / (Sum(words in 1) TF1^2*IDF^2) (Sum(words in 2) TF2^2*IDF^2) = Sum(words in both) IDF^2 / (Sum(words in 1) IDF^2) (Sum(words in 2) IDF^2) 1/(Sum IDF^2) can be precomputed for each Item_ID Preparation Precompute
1. Dict keyed by ID of sets with title words filtered for alphanumeric
2. Dict keyed by word of sets of IDs containing word 3. Dict keyed by ID of 1/(Sum IDF^2)
Generated sample input
Set of 10k documents of up to 80 characters each
Skewed toward lexicographically earlier words (starting with a, b, c) to distinguish common words and less common words in set Find Top k similar Items for ID Get IDs containing words in title Heapsort with heapq library O(n log k), where n is only IDs with word in common Sample Output Input itemID: item words 00002000: adulterated aspen belly beriberi blockish bondwoman canfield malefactions Similar itemID: (cosine similarity) item words |- 00006034: (0.376) adulterated applique archaizing balalaika belabours beriberi bibles canfield |- 00006538: (0.376) adulterated applique archaizing balalaika belabours beriberi bibles canfield |- 00009158: (0.384) answered apocalyptically baronies belly beriberi blimey bondwoman |- 00007371: (0.384) answered apocalyptically baronies belly beriberi blimey bondwoman |- 00007426: (0.384) answered apocalyptically baronies belly beriberi blimey bondwoman |- 00009646: (0.384) answered apocalyptically baronies belly beriberi blimey bondwoman |- 00003585: (1.000) adulterated aspen belly beriberi blockish bondwoman canfield malefactions |- 00003440: (1.000) adulterated aspen belly beriberi blockish bondwoman canfield malefactions |- 00003908: (1.000) adulterated aspen belly beriberi blockish bondwoman canfield malefactions |- 00007631: (0.384) answered apocalyptically baronies belly beriberi blimey bondwoman Clustering 1. Idea 1: Traverse as graph following k most similar to form cluster. Very slow. 2. Idea 2: Traverse set of words. Ignore common words with IDF < cutoff (optimization in absence of parallel programming) Traverse ID1, ID2 from hashmap[word] If ID1 != ID2, increment cosine similarity by IDF(word)^2/IDFsum(ID1)/IDFsum(ID2) 3. Idea 2 runs in ~3 seconds for a 10k document example file Sample Output for Clustering 204 clusters generated in 3.01s 2378 connections found Cluster 1 words: anchorite, apparition, augend, backsliders, blotting, bumblers, corruptible, denazified Cluster 2 words: aerosols, arabesk, atheneum, bibles, bilious, bortz, cardoon, childlike, curriery Cluster 3 words: addressability, alexandrine, antiques, balalaika, bedevilling, frolicky, futural Cluster 4 words: adulterated, annoyances, arabesk, areola, ascend, battlefields, chatted, crinkly Cluster 5 words: applique, attaches, backsliders, blahs, calliopes, circulates, constraints Cluster 6 words: amasser, amiss, anchorite, aouads, approbations, autographs, begin, calliopes, heats Cluster 7 words: ahchoo, alleviator, amiss, aspen, brownstone, budgie, centiliter, coagulators, doubtable Cluster 8 words: amasser, bantling, bioresearch, bulgurs, bunter, canaanite, disseminated, misprinting