Andreas TF IDF Demo

Загружено:

scribdv7r@gishpuppycom

0% нашли этот документ полезным (0 голосов)

40 просмотров8 страниц

TF IDF talk

Авторское право

Доступные форматы

PDF, TXT или читайте онлайн в Scribd

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Пожаловаться на этот документ

TF IDF talk

Авторское право:

Доступные форматы

Скачайте в формате PDF, TXT или читайте онлайн в Scribd

Отметить как неприемлемый контент

0% нашли этот документ полезным (0 голосов)

40 просмотров8 страниц

Andreas TF IDF Demo

Загружено:

scribdv7r@gishpuppycom

TF IDF talk

Авторское право:

Доступные форматы

Скачайте в формате PDF, TXT или читайте онлайн в Scribd

Отметить как неприемлемый контент

Перейти к странице

Вы находитесь на странице: 1из 8

Поиск в документе

Headline Similarities

TF/IDF Demo
Problem
From input files with schema: Item_ID Country_ID Category_ID Title

Define and identify similar items given item ID

Define and find clusters of similar items

Options given:

TF/IDF analysis with distance metrics given by choice of

Euclidean, Cosine, Manhattan (L1), [Jaccard]
Specific Problem Analysis
1. Titles are limited in length and most users exploit full length, so documents
are similar in size no need to compensate for doc size
2. Need not worry about size of word set 1 \ word set 2, so no need for Jaccard
3. Word repetition is rare, so use sets (bag) of words with TF = 1/n, where n is
word number in title (a constant for each word in title)
4. Use cosine similarity,
Cosine = Sum(words in both) TF1*TF2*IDF^2 /
(Sum(words in 1) TF1^2*IDF^2) (Sum(words in 2) TF2^2*IDF^2)
= Sum(words in both) IDF^2 /
(Sum(words in 1) IDF^2) (Sum(words in 2) IDF^2)
1/(Sum IDF^2) can be precomputed for each Item_ID
Preparation
Precompute

1. Dict keyed by ID of sets with title words filtered for alphanumeric

2. Dict keyed by word of sets of IDs containing word
3. Dict keyed by ID of 1/(Sum IDF^2)

Generated sample input

Set of 10k documents of up to 80 characters each

Skewed toward lexicographically earlier words (starting with a, b, c) to
distinguish common words and less common words in set
Find Top k similar Items for ID
Get IDs containing words in title
Heapsort with heapq library O(n log k), where n is only IDs with word in
common
Sample Output
Input itemID: item words
00002000: adulterated aspen belly beriberi blockish bondwoman canfield malefactions
Similar itemID: (cosine similarity) item words
|- 00006034: (0.376) adulterated applique archaizing balalaika belabours beriberi bibles
canfield
|- 00006538: (0.376) adulterated applique archaizing balalaika belabours beriberi bibles
canfield
|- 00009158: (0.384) answered apocalyptically baronies belly beriberi blimey bondwoman
|- 00007371: (0.384) answered apocalyptically baronies belly beriberi blimey bondwoman
|- 00007426: (0.384) answered apocalyptically baronies belly beriberi blimey bondwoman
|- 00009646: (0.384) answered apocalyptically baronies belly beriberi blimey bondwoman
|- 00003585: (1.000) adulterated aspen belly beriberi blockish bondwoman canfield
malefactions
|- 00003440: (1.000) adulterated aspen belly beriberi blockish bondwoman canfield
malefactions
|- 00003908: (1.000) adulterated aspen belly beriberi blockish bondwoman canfield
malefactions
|- 00007631: (0.384) answered apocalyptically baronies belly beriberi blimey bondwoman
Clustering
1. Idea 1: Traverse as graph following k most similar to form cluster. Very slow.
2. Idea 2: Traverse set of words.
Ignore common words with IDF < cutoff (optimization in absence of parallel
programming)
Traverse ID1, ID2 from hashmap[word]
If ID1 != ID2, increment cosine similarity by
IDF(word)^2/IDFsum(ID1)/IDFsum(ID2)
3. Idea 2 runs in ~3 seconds for a 10k document example file
Sample Output for Clustering
204 clusters generated in 3.01s
2378 connections found
Cluster 1 words:
anchorite, apparition, augend, backsliders, blotting, bumblers, corruptible, denazified
Cluster 2 words:
aerosols, arabesk, atheneum, bibles, bilious, bortz, cardoon, childlike, curriery
Cluster 3 words:
addressability, alexandrine, antiques, balalaika, bedevilling, frolicky, futural
Cluster 4 words:
adulterated, annoyances, arabesk, areola, ascend, battlefields, chatted, crinkly
Cluster 5 words:
applique, attaches, backsliders, blahs, calliopes, circulates, constraints
Cluster 6 words:
amasser, amiss, anchorite, aouads, approbations, autographs, begin, calliopes, heats
Cluster 7 words:
ahchoo, alleviator, amiss, aspen, brownstone, budgie, centiliter, coagulators, doubtable
Cluster 8 words:
amasser, bantling, bioresearch, bulgurs, bunter, canaanite, disseminated, misprinting

Вам также может понравиться

Shoe Dog: A Memoir by the Creator of Nike
От Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
Рейтинг: 4.5 из 5 звезд
4.5/5 (537)
Mridul Report
Документ43 страницы
Mridul Report
Pankaj Gupta
Оценок пока нет
Grit: The Power of Passion and Perseverance
От Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
Рейтинг: 4 из 5 звезд
4/5 (587)
FEM For Beams (Finite Element Method) Part 3
Документ4 страницы
FEM For Beams (Finite Element Method) Part 3
Вячеслав Чедрик
Оценок пока нет
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
От Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
Рейтинг: 4 из 5 звезд
4/5 (894)
DLL-GenMath-Q1 - Nov 7 - 11
Документ3 страницы
DLL-GenMath-Q1 - Nov 7 - 11
Majoy Acebes
Оценок пока нет
The Yellow House: A Memoir (2019 National Book Award Winner)
От Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
Рейтинг: 4 из 5 звезд
4/5 (98)
Maximize Profits from Vehicle Production
Документ28 страниц
Maximize Profits from Vehicle Production
suudsfiin
Оценок пока нет
The Little Book of Hygge: Danish Secrets to Happy Living
От Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
Рейтинг: 3.5 из 5 звезд
3.5/5 (399)
Facts Are Relative
Документ2 страницы
Facts Are Relative
C Camerini
Оценок пока нет
On Fire: The (Burning) Case for a Green New Deal
От Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
Рейтинг: 4 из 5 звезд
4/5 (73)
Matlab Assignment
Документ9 страниц
Matlab Assignment
Royapuram Peter
Оценок пока нет
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
От Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
Рейтинг: 4 из 5 звезд
4/5 (5794)
10 5923 J Ajcam 20160603 04
Документ4 страницы
10 5923 J Ajcam 20160603 04
RicardoPerezOrtega
Оценок пока нет
Never Split the Difference: Negotiating As If Your Life Depended On It
От Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
Рейтинг: 4.5 из 5 звезд
4.5/5 (838)
Tensorflow Placeholders and Optimizers
Документ20 страниц
Tensorflow Placeholders and Optimizers
Devyansh Gupta
Оценок пока нет
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
От Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
Рейтинг: 4.5 из 5 звезд
4.5/5 (474)
AI Introduction Overview
Документ28 страниц
AI Introduction Overview
shilpi
Оценок пока нет
Yes Please
От Everand
Yes Please
Amy Poehler
Рейтинг: 4 из 5 звезд
4/5 (1891)
Or Objectives
Документ19 страниц
Or Objectives
Karthik Saraa
Оценок пока нет
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
От Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
Рейтинг: 3.5 из 5 звезд
3.5/5 (231)
LG 5 and 6 Activity Worksheet 2nd Quarter
Документ1 страница
LG 5 and 6 Activity Worksheet 2nd Quarter
Sherre Nicole Cuenta
Оценок пока нет
Principles: Life and Work
От Everand
Principles: Life and Work
Ray Dalio
Рейтинг: 4 из 5 звезд
4/5 (599)
Enee 660 HW #7
Документ2 страницы
Enee 660 HW #7
PeacefulLion
0% (1)
The Emperor of All Maladies: A Biography of Cancer
От Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
Рейтинг: 4.5 из 5 звезд
4.5/5 (271)
Discrete Maths
Документ6 страниц
Discrete Maths
sanjayghosh
Оценок пока нет
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
От Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
Рейтинг: 4 из 5 звезд
4/5 (1090)
DCT For Speech Compression
Документ21 страница
DCT For Speech Compression
Sangeeth reddy podila
Оценок пока нет
The World Is Flat 3.0: A Brief History of the Twenty-first Century
От Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
Рейтинг: 3.5 из 5 звезд
3.5/5 (2219)
Hyperparameter Optimization For Machine Learning Models Based On Bayesian Optimization
Документ15 страниц
Hyperparameter Optimization For Machine Learning Models Based On Bayesian Optimization
Aminul Haque
Оценок пока нет
Team of Rivals: The Political Genius of Abraham Lincoln
От Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
Рейтинг: 4.5 из 5 звезд
4.5/5 (234)
Master production schedule and disaggregation techniques
Документ49 страниц
Master production schedule and disaggregation techniques
Nofriani Fajrah
Оценок пока нет
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
От Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
Рейтинг: 4.5 из 5 звезд
4.5/5 (344)
Block Diagram: CPE501 Chemical Process Control
Документ3 страницы
Block Diagram: CPE501 Chemical Process Control
Iman Firdaus
Оценок пока нет
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
От Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
Рейтинг: 4.5 из 5 звезд
4.5/5 (265)
Using The MATLAB Data Acquisition
Документ16 страниц
Using The MATLAB Data Acquisition
Sathiswaran Selvam
Оценок пока нет
Fear: Trump in the White House
От Everand
Fear: Trump in the White House
Bob Woodward
Рейтинг: 3.5 из 5 звезд
3.5/5 (738)
Mathematical Methods Exam Solutions
Документ9 страниц
Mathematical Methods Exam Solutions
Breaker Selven
Оценок пока нет
Angela's Ashes: A Memoir
От Everand
Angela's Ashes: A Memoir
Frank McCourt
Рейтинг: 4.5 из 5 звезд
4.5/5 (440)
Timetable Scheduling Via Genetic Algorithm Andrew Reid East
Документ48 страниц
Timetable Scheduling Via Genetic Algorithm Andrew Reid East
19524 Alekhya
Оценок пока нет
Rise of ISIS: A Threat We Can't Ignore
От Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
Рейтинг: 3.5 из 5 звезд
3.5/5 (137)
Design Control Systems State Space
Документ42 страницы
Design Control Systems State Space
Belayneh Tadesse
100% (1)
Steve Jobs
От Everand
Steve Jobs
Walter Isaacson
Рейтинг: 4.5 из 5 звезд
4.5/5 (806)
Chapter6 (MBI) El
Документ51 страница
Chapter6 (MBI) El
jokowi123
Оценок пока нет
John Adams
От Everand
John Adams
David McCullough
Рейтинг: 4.5 из 5 звезд
4.5/5 (2409)
Design and Analysis of Algorithms Laboratory 10CSL47
Документ28 страниц
Design and Analysis of Algorithms Laboratory 10CSL47
Pradyot SN
Оценок пока нет
The Unwinding: An Inner History of the New America
От Everand
The Unwinding: An Inner History of the New America
George Packer
Рейтинг: 4 из 5 звезд
4/5 (45)
Chapter 18
Документ9 страниц
Chapter 18
KANIKA GORAYA
Оценок пока нет
Bad Feminist: Essays
От Everand
Bad Feminist: Essays
Roxane Gay
Рейтинг: 4 из 5 звезд
4/5 (1015)
FIR Filter Design
Документ81 страница
FIR Filter Design
Ayush Joshi
Оценок пока нет
The Glass Castle: A Memoir
От Everand
The Glass Castle: A Memoir
Jeannette Walls
Рейтинг: 4.5 из 5 звезд
4.5/5 (1712)
Trees
Документ15 страниц
Trees
Sharmila Shammi
Оценок пока нет
Wolf Hall: A Novel
От Everand
Wolf Hall: A Novel
Hilary Mantel
Рейтинг: 4 из 5 звезд
4/5 (3811)
Fuzzy Logic - Manafeddin Namazov
Документ5 страниц
Fuzzy Logic - Manafeddin Namazov
Arif Rizkianto
Оценок пока нет
The Outsider: A Novel
От Everand
The Outsider: A Novel
Stephen King
Рейтинг: 4 из 5 звезд
4/5 (1839)
Communications of The ACM: in Computer Sciences Analogous To The Creation of
Документ3 страницы
Communications of The ACM: in Computer Sciences Analogous To The Creation of
Ariel Gonzales
Оценок пока нет
The Perks of Being a Wallflower
От Everand
The Perks of Being a Wallflower
Stephen Chbosky
Рейтинг: 4.5 из 5 звезд
4.5/5 (2099)
FEM For Nonlinear Hyperbolic PDE
Документ38 страниц
FEM For Nonlinear Hyperbolic PDE
shabadan
Оценок пока нет
The Woman in Cabin 10
От Everand
The Woman in Cabin 10
Ruth Ware
Рейтинг: 3.5 из 5 звезд
3.5/5 (2322)
2015 Bookmatter QuantumMany-BodyPhysicsOfUltra
Документ83 страницы
2015 Bookmatter QuantumMany-BodyPhysicsOfUltra
Debabrata Dey
Оценок пока нет
The Light Between Oceans: A Novel
От Everand
The Light Between Oceans: A Novel
M.L. Stedman
Рейтинг: 4.5 из 5 звезд
4.5/5 (789)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
От Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
Рейтинг: 4.5 из 5 звезд
4.5/5 (119)
Little Women
От Everand
Little Women
Louisa May Alcott
Рейтинг: 4 из 5 звезд
4/5 (104)
Brooklyn: A Novel
От Everand
Brooklyn: A Novel
Colm Toibin
Рейтинг: 3.5 из 5 звезд
3.5/5 (1937)
A Man Called Ove: A Novel
От Everand
A Man Called Ove: A Novel
Fredrik Backman
Рейтинг: 4.5 из 5 звезд
4.5/5 (4609)
The Art of Racing in the Rain: A Novel
От Everand
The Art of Racing in the Rain: A Novel
Garth Stein
Рейтинг: 4 из 5 звезд
4/5 (4200)
Manhattan Beach: A Novel
От Everand
Manhattan Beach: A Novel
Jennifer Egan
Рейтинг: 3.5 из 5 звезд
3.5/5 (792)
A Tree Grows in Brooklyn
От Everand
A Tree Grows in Brooklyn
Betty Smith
Рейтинг: 4.5 из 5 звезд
4.5/5 (1929)
Sing, Unburied, Sing: A Novel
От Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
Рейтинг: 4 из 5 звезд
4/5 (1103)
The Constant Gardener: A Novel
От Everand
The Constant Gardener: A Novel
John le Carre
Рейтинг: 3.5 из 5 звезд
3.5/5 (104)
Her Body and Other Parties: Stories
От Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
Рейтинг: 4 из 5 звезд
4/5 (821)