Вы находитесь на странице: 1из 29

CS345a:DataMining

j
JureLeskovec andAnand Rajaraman
StanfordUniversity

Friday5:30atGates
Friday
5:30 at Gates B125:307:30pm
B12 5:307:30pm
Youwilllearnandgethandsonexperienceon:
LogintoAmazonEC2andrequestacluster
Login to Amazon EC2 and request a cluster
RunHadoop MapReduce jobs
UseAsternCluster
U At
Cl t software
ft

Amazonhaveus$12kofcomputingtime
Each students has about $200 worth of
Eachstudentshasabout$200worthof
computingtime

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

Ideallyteamsof2students(1(3)isalsook)
Ideally teams of 2 students (1 (3) is also ok)
Project:
Discovers
Discoversinterestingrelationshipswithina
interesting relationships within a
significantamountofdata
Havesomeoriginalideathatextends/buildson
Have some original idea that extends/builds on
whatwelearnedinclass
Extend/Improve/Speed
Extend/Improve/Speedup
upsomeexistingalgorithm
some existing algorithm
Defineanewproblem andsolveit

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

Answerthefollowingquestions:
Answer the following questions:
Whatistheproblem youaresolving?
Whatdata
Wh t d t willyouuse(wherewillyougetit)?
ill
( h
ill
t it)?
Howwillyoudoit?
Whichalgorithms/techniques
Whi h l ith /t h i
youplantouse?
l t
?
Beasspecificasyoucan!

Who
Whowillyouevaluate,measuresuccess?
will you evaluate measure success?
Whatdoyouexpecttosubmit attheendofthe
quarter?
1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

DueonmidnightFeb12010
Due on midnight Feb 1 2010

EmailthePDF tocs345awin0910
staff@lists.stanford.edu

TAswillassigngroupnumbers

Name your file: <group#> proposal pdf


Nameyourfile:<group#>_proposal.pdf

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

Wikipedia
IMbuddygraph
Yahoo Altavista webgraph
YahooAltavista
web graph
StanfordWebBase
Twitter Data
TwitterData
Blogsandnewsdata
Netflix
Restaurantreviews
Yahoo Music Ratings
YahooMusicRatings

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

CompleteedithistoryofWikipediauntil
Complete edit history of Wikipedia until
January2008
Foreverysingleeditthecomplete
For every single edit the complete
snapshotofthearticleissaved
Eachpage
Each page hasatalk
has a talk page:

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

Talkpage:
Talk page:

Editors discuss things like:


Editorsdiscussthingslike:

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

Everyregistered
Every registered
usehasapage:

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

Everyuser
Every usersspagehasatalkpage:
page has a talk page:

Users discuss things:


Usersdiscussthings:

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

10

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

11

<page>
<title>Anarchism</title>
<id>12</id>
<revision>
<id>18201</id>
<timestamp>2002-02-25T15:00:22Z</timestamp>
<contributor>
<ip>Conversion script</ip>
</contributor>
<minor />
<comment>Automated conversion</comment>
<text xml:space="preserve">''Anarchism'' is
theory that advocates the abolition of all
government.
...
</text>
</revision>
<revision>
/
<id>19746</id>
<timestamp>2002-02-25T15:43:11Z</timestamp>
<contributor>
<ip>140.232.153.45</ip>
</contributor>
<comment>*</comment>
<text xml:space=
xml:space="preserve">''Anarchism'
preserve > Anarchism
is
theory that advocates the abolition of all
...

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

the political
forms of

the political
forms of government.

12

CompleteeditandtalkhistoryofWikipedia:
Complete edit and talk history of Wikipedia:
Howdoarticlesevolve?
Usestringeditdistancelikeapproachtomeasuredifferences
betweenversionsofthearticle
Modeltheevolutionofthecontent

Whichusersmakewhattypesofedits?
Which users make what types of edits?
Bigvs.smallchanges,reorganization?
Suggesttoawhichusershouldeditthepage?

Howdouserstalkandtheneditsamepages?
H d
lk d h
di
?
Dousersfirsttalkandthenedit?
y
Isittheotherwayaround?

Suggestuserswhichpagestoedit
1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

13

Altavista webgraphfrom2002:
web graph from 2002:

1/21/2010

Nodesarewebpages
Di t d d
Directededgesarehyperlinks
h
li k
1.4billionpublicwebpages
S
Severalbillionedges
l billi
d
ForeachnodewealsoknowthepageURL

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

14

SPAM:
Usethewebgraphstructure
to more efficiently extract
tomoreefficientlyextract
spamwebpages
Linkfarms
Link farms
Spidertraps

Personalizedandtopic
Personalized
and topic
sensitivePageRank

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

15

Websitestructureidentification:
Website structure identification:
Fromthewebgraph extractwebsites
What are common navigational structures of
Whatarecommonnavigationalstructuresof
websites?
Clusterwebsitegraphs
Identifycommonsubgraphs andpatterns

Whatarerolespages/linksplayinthegraph:
Contentpages
C t t
Navigationalpages
p g
Indexpages

Buildasummary/mapofthewebsite
1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

16

AcollectionoffocusedsnapshotsoftheWeb
A collection of focused snapshots of the Web
Datastartsin2004andcontinuestilltoday
Generalcrawls
General crawls
startfrom~1000seedwebpages
Crawlupto
Crawl up to ~150
150,000pagerpersite
000 pager per site

Specializedcrawls:
Universities
USGovernment
HurricaneKatrina(2005) dailycrawls
Monthlynewspapercrawls
1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

17

SmallerthanAltavista
Smaller than Altavista butyou
but you
alsohavethepagecontent

Candotopicanalysis

TopicsensitivePageRank

Studytheevolutionofwebsitesand
Study
the evolution of websites and
webpages

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

18

50milliontweetspermonthstarting
50 million tweets per month starting
June2009(6months)
Format:

T
2009-06-07 02:07:42
U
http://twitter.com/redsoxtweets
W
#redsox Extra Bases: Sox win, 8-1: The Rangers
spoiled Jon Lester's perfecto and his shutout..
http://tinyurl.com/pyhgwy

T i
Twoimportantthings:
t t thi
URLs
Hashtags

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

19

Trendingtopics:raising,falling
Trending topics: raising falling

Inferringlinksofthewhofollowswhomnetwork

WhatisthelifecyclesofURLsandhashtags?

Finding early/influential users?


Findingearly/influentialusers?

Clusteringtweetsbytopicorcategory

Sentimentanalysis arepeople
positive/negativeaboutsomething(aproduct?)
iti /
ti
b t
thi (
d t?)

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

20

Morethan1millionnewsmedia
More than 1 million newsmedia andblog
and blog
articlesperdaysinceAugust2008
Extractphrases(quotes)andlinks
Extract phrases (quotes) and links
http://memetracker.org
Format:

P
T
Q
Q
Q
Q
Q
L

http://cnnpoliticalticker.wordpress.com/2008/08/31/mccain-defendspalins-experience-level
2008-09-01 00:00:13
dangerously unprepared to be president
even more dangerously unprepared
understands the challenges that we face
worked and succeeded
still to this day refuses to acknowledge that the surge has
succeeded
http://www.cnn.com

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

21

Findallvariants(mutations)ofthesame
Find all variants (mutations) of the same
phrase clusterphrasesbasedonedit
distance and time:
distanceandtime:

lipstick on a pig
you can put lipstick on a pig
you can put lipstick on a pig but it
it'ss still a pig
i think they put some lipstick on a pig but it's still a pig
putting lipstick on a pig

Temporal variations of the phrase volume


Temporalvariationsofthephrasevolume

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

22

p p
y
p
Predictthepopularityofaphraseovertime

Howdoesinformationmutate/changeovertime?

Whichmediasitesarethemostinfluential?Builda
Which
media sites are the most influential? Build a
predictivemodelofsiteinfluence

Whichnodesareearlymentioners,latecomers,
summarizers?
i
?

Sentimentanalysis arepeoplepositive/negativeabout
something (news, a product)
something(news,aproduct)

Createamodelofpoliticalbias(liberalvs.conservative)

Whatisgenuinenews,whataregenuinephrasesandwhatis
What
is genuine news what are genuine phrases and what is
spam?

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

23

WealsohavetheWikipedai
We also have the Wikipedai webserver logs,
logs
i.e.,pagevisitstatistics

HowdoesWikipagevisitstatisticscorrelate
with external events natural disasters?
withexternalevents,naturaldisasters?
UseTwitterorMemeTracker datatodetectthose
Compareoccurrenceofphrasesandvisitsto
Compare occurrence of phrases and visits to
Wikipediapages

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

24

AlargeIMbuddygraphfromMarch2005
A large IM buddy graph from March 2005
230millionnodes
7 340 million undirected edges
7,340millionundirectededges

Limitations:
Onlyhavethebuddygraphwithrandomnodeids
No communication or edge strength
Nocommunicationoredgestrength

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

25

Findcommunities,clustersinsuchabiggraph
Find communities clusters in such a big graph

Countfrequentsubgraphs
q
g p

Designalgorithmstocharacterizethe
structureofthenetworkasawhole
f

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

26

Movieratings:
Movie ratings:
Netflixprizedataset:
http://www.netflixprize.com/

YahooMusicratings:
YahooMusicuserratingsofsongswithartist,
g
g
,
albumandgenreinformation
717millionratings
136,000songs
1.8users

R t
Restaurantreviews
t
i

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

27

Collaborativefiltering:
Collaborative filtering:
Predictwhatratingswillusergivetoparticular
songs/movies,i.e.,whichsonswillhe/shelike?
g/
, ,
/

Supplementthedatawithadditionaldata
sources:
Movies IMDB
Playlistsfromtheweb
Lyric(textofthesong)

Includetaste,temporalcomponent,
diversity into the model
diversityintothemodel

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

28

StanfordSearchQueries
Stanford Search Queries
NewYorkTimesarticlessince1987
Article
Articlearemanuallyannotatedbysubject
are manually annotated by subject
categoriesandkeywords
Entityorrelationextraction
Extractkeywords,predictarticlecategory

Dontfeellimitedbythese
f ll
db h
Youcancollectthedatasetyourself
And define the project/question yourself
Anddefinetheproject/questionyourself

1/21/2010

JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining

29

Вам также может понравиться