Академический Документы
Профессиональный Документы
Культура Документы
j
JureLeskovec andAnand Rajaraman
StanfordUniversity
Friday5:30atGates
Friday
5:30 at Gates B125:307:30pm
B12 5:307:30pm
Youwilllearnandgethandsonexperienceon:
LogintoAmazonEC2andrequestacluster
Login to Amazon EC2 and request a cluster
RunHadoop MapReduce jobs
UseAsternCluster
U At
Cl t software
ft
Amazonhaveus$12kofcomputingtime
Each students has about $200 worth of
Eachstudentshasabout$200worthof
computingtime
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
Ideallyteamsof2students(1(3)isalsook)
Ideally teams of 2 students (1 (3) is also ok)
Project:
Discovers
Discoversinterestingrelationshipswithina
interesting relationships within a
significantamountofdata
Havesomeoriginalideathatextends/buildson
Have some original idea that extends/builds on
whatwelearnedinclass
Extend/Improve/Speed
Extend/Improve/Speedup
upsomeexistingalgorithm
some existing algorithm
Defineanewproblem andsolveit
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
Answerthefollowingquestions:
Answer the following questions:
Whatistheproblem youaresolving?
Whatdata
Wh t d t willyouuse(wherewillyougetit)?
ill
( h
ill
t it)?
Howwillyoudoit?
Whichalgorithms/techniques
Whi h l ith /t h i
youplantouse?
l t
?
Beasspecificasyoucan!
Who
Whowillyouevaluate,measuresuccess?
will you evaluate measure success?
Whatdoyouexpecttosubmit attheendofthe
quarter?
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
DueonmidnightFeb12010
Due on midnight Feb 1 2010
EmailthePDF tocs345awin0910
staff@lists.stanford.edu
TAswillassigngroupnumbers
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
Wikipedia
IMbuddygraph
Yahoo Altavista webgraph
YahooAltavista
web graph
StanfordWebBase
Twitter Data
TwitterData
Blogsandnewsdata
Netflix
Restaurantreviews
Yahoo Music Ratings
YahooMusicRatings
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
CompleteedithistoryofWikipediauntil
Complete edit history of Wikipedia until
January2008
Foreverysingleeditthecomplete
For every single edit the complete
snapshotofthearticleissaved
Eachpage
Each page hasatalk
has a talk page:
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
Talkpage:
Talk page:
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
Everyregistered
Every registered
usehasapage:
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
Everyuser
Every usersspagehasatalkpage:
page has a talk page:
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
10
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
11
<page>
<title>Anarchism</title>
<id>12</id>
<revision>
<id>18201</id>
<timestamp>2002-02-25T15:00:22Z</timestamp>
<contributor>
<ip>Conversion script</ip>
</contributor>
<minor />
<comment>Automated conversion</comment>
<text xml:space="preserve">''Anarchism'' is
theory that advocates the abolition of all
government.
...
</text>
</revision>
<revision>
/
<id>19746</id>
<timestamp>2002-02-25T15:43:11Z</timestamp>
<contributor>
<ip>140.232.153.45</ip>
</contributor>
<comment>*</comment>
<text xml:space=
xml:space="preserve">''Anarchism'
preserve > Anarchism
is
theory that advocates the abolition of all
...
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
the political
forms of
the political
forms of government.
12
CompleteeditandtalkhistoryofWikipedia:
Complete edit and talk history of Wikipedia:
Howdoarticlesevolve?
Usestringeditdistancelikeapproachtomeasuredifferences
betweenversionsofthearticle
Modeltheevolutionofthecontent
Whichusersmakewhattypesofedits?
Which users make what types of edits?
Bigvs.smallchanges,reorganization?
Suggesttoawhichusershouldeditthepage?
Howdouserstalkandtheneditsamepages?
H d
lk d h
di
?
Dousersfirsttalkandthenedit?
y
Isittheotherwayaround?
Suggestuserswhichpagestoedit
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
13
Altavista webgraphfrom2002:
web graph from 2002:
1/21/2010
Nodesarewebpages
Di t d d
Directededgesarehyperlinks
h
li k
1.4billionpublicwebpages
S
Severalbillionedges
l billi
d
ForeachnodewealsoknowthepageURL
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
14
SPAM:
Usethewebgraphstructure
to more efficiently extract
tomoreefficientlyextract
spamwebpages
Linkfarms
Link farms
Spidertraps
Personalizedandtopic
Personalized
and topic
sensitivePageRank
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
15
Websitestructureidentification:
Website structure identification:
Fromthewebgraph extractwebsites
What are common navigational structures of
Whatarecommonnavigationalstructuresof
websites?
Clusterwebsitegraphs
Identifycommonsubgraphs andpatterns
Whatarerolespages/linksplayinthegraph:
Contentpages
C t t
Navigationalpages
p g
Indexpages
Buildasummary/mapofthewebsite
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
16
AcollectionoffocusedsnapshotsoftheWeb
A collection of focused snapshots of the Web
Datastartsin2004andcontinuestilltoday
Generalcrawls
General crawls
startfrom~1000seedwebpages
Crawlupto
Crawl up to ~150
150,000pagerpersite
000 pager per site
Specializedcrawls:
Universities
USGovernment
HurricaneKatrina(2005) dailycrawls
Monthlynewspapercrawls
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
17
SmallerthanAltavista
Smaller than Altavista butyou
but you
alsohavethepagecontent
Candotopicanalysis
TopicsensitivePageRank
Studytheevolutionofwebsitesand
Study
the evolution of websites and
webpages
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
18
50milliontweetspermonthstarting
50 million tweets per month starting
June2009(6months)
Format:
T
2009-06-07 02:07:42
U
http://twitter.com/redsoxtweets
W
#redsox Extra Bases: Sox win, 8-1: The Rangers
spoiled Jon Lester's perfecto and his shutout..
http://tinyurl.com/pyhgwy
T i
Twoimportantthings:
t t thi
URLs
Hashtags
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
19
Trendingtopics:raising,falling
Trending topics: raising falling
Inferringlinksofthewhofollowswhomnetwork
WhatisthelifecyclesofURLsandhashtags?
Clusteringtweetsbytopicorcategory
Sentimentanalysis arepeople
positive/negativeaboutsomething(aproduct?)
iti /
ti
b t
thi (
d t?)
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
20
Morethan1millionnewsmedia
More than 1 million newsmedia andblog
and blog
articlesperdaysinceAugust2008
Extractphrases(quotes)andlinks
Extract phrases (quotes) and links
http://memetracker.org
Format:
P
T
Q
Q
Q
Q
Q
L
http://cnnpoliticalticker.wordpress.com/2008/08/31/mccain-defendspalins-experience-level
2008-09-01 00:00:13
dangerously unprepared to be president
even more dangerously unprepared
understands the challenges that we face
worked and succeeded
still to this day refuses to acknowledge that the surge has
succeeded
http://www.cnn.com
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
21
Findallvariants(mutations)ofthesame
Find all variants (mutations) of the same
phrase clusterphrasesbasedonedit
distance and time:
distanceandtime:
lipstick on a pig
you can put lipstick on a pig
you can put lipstick on a pig but it
it'ss still a pig
i think they put some lipstick on a pig but it's still a pig
putting lipstick on a pig
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
22
p p
y
p
Predictthepopularityofaphraseovertime
Howdoesinformationmutate/changeovertime?
Whichmediasitesarethemostinfluential?Builda
Which
media sites are the most influential? Build a
predictivemodelofsiteinfluence
Whichnodesareearlymentioners,latecomers,
summarizers?
i
?
Sentimentanalysis arepeoplepositive/negativeabout
something (news, a product)
something(news,aproduct)
Createamodelofpoliticalbias(liberalvs.conservative)
Whatisgenuinenews,whataregenuinephrasesandwhatis
What
is genuine news what are genuine phrases and what is
spam?
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
23
WealsohavetheWikipedai
We also have the Wikipedai webserver logs,
logs
i.e.,pagevisitstatistics
HowdoesWikipagevisitstatisticscorrelate
with external events natural disasters?
withexternalevents,naturaldisasters?
UseTwitterorMemeTracker datatodetectthose
Compareoccurrenceofphrasesandvisitsto
Compare occurrence of phrases and visits to
Wikipediapages
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
24
AlargeIMbuddygraphfromMarch2005
A large IM buddy graph from March 2005
230millionnodes
7 340 million undirected edges
7,340millionundirectededges
Limitations:
Onlyhavethebuddygraphwithrandomnodeids
No communication or edge strength
Nocommunicationoredgestrength
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
25
Findcommunities,clustersinsuchabiggraph
Find communities clusters in such a big graph
Countfrequentsubgraphs
q
g p
Designalgorithmstocharacterizethe
structureofthenetworkasawhole
f
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
26
Movieratings:
Movie ratings:
Netflixprizedataset:
http://www.netflixprize.com/
YahooMusicratings:
YahooMusicuserratingsofsongswithartist,
g
g
,
albumandgenreinformation
717millionratings
136,000songs
1.8users
R t
Restaurantreviews
t
i
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
27
Collaborativefiltering:
Collaborative filtering:
Predictwhatratingswillusergivetoparticular
songs/movies,i.e.,whichsonswillhe/shelike?
g/
, ,
/
Supplementthedatawithadditionaldata
sources:
Movies IMDB
Playlistsfromtheweb
Lyric(textofthesong)
Includetaste,temporalcomponent,
diversity into the model
diversityintothemodel
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
28
StanfordSearchQueries
Stanford Search Queries
NewYorkTimesarticlessince1987
Article
Articlearemanuallyannotatedbysubject
are manually annotated by subject
categoriesandkeywords
Entityorrelationextraction
Extractkeywords,predictarticlecategory
Dontfeellimitedbythese
f ll
db h
Youcancollectthedatasetyourself
And define the project/question yourself
Anddefinetheproject/questionyourself
1/21/2010
JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining
29