Вы находитесь на странице: 1из 12

Onday4,wesawhowtoprocesstextdatausingtheEnronemaildataset.Inreality,weonlyprocessedasmallfractionof theentiredataset:about15megabytesofKennethLaysemails.TheentiredatasetcontainingmanyEnronemployees mailboxesis1.3gigabytes,about87timesthanwhatweworkedwith.AndwhatifweworkedonGMail,Yahoo!Mail,or Hotmail?Wedhaveseveralpetabytesworthofemails,atleast71milliontimesthesizeofthedatawedealtwith. Allthatdatawouldtakeawhiletoprocess,anditcertainlycouldnttonorbecrunchedbyasinglelaptop.Wedhaveto storethedataonmanymachines,andwedhavetoprocessit(tokenizeit,calculatetfidf)usingmultiplemachines.There aremanywaystodothis,butoneofthemorepopularrecentmethodsofparallelizingdata computationisbasedona programmingframeworkcalledMapReduce,anideathatGooglepresentedtotheworldin2004.Luckily,youdonothaveto workatGoogletobenetfromMapReduce:anopensourceimplementationcalledHadoopisavailableforyouruse! Youmightworrythatwedonthavehundredsofmachinessittingaroundforustousethem.Actually,wedo!AmazonWeb ServicesoersaservicecalledElasticMapReduce(EMR)thatgivesusaccesstoasmanymachinesaswewouldlikefor about10centsperhourofmachineweuse.Use100machinesfor2hours?PayAmazonaroud$2.00.Ifyouveeverheard thebuzzwordcloud computing,thiselasticserviceispartofthehype.

Letsstartwithasimplewordcountexample,thenrewriteitinMapReduce,thenrunMapReduceon20machinesusing AmazonsEMR,andnallywriteabigpersonMapReduceworkowtocalculateTFIDF!

Setup
Weregoingtobeusingtwoles, dataiap/day5/term_tools.py and dataiap/day5/package.tar.gz .Eitherwriteyourcodeinthe dataiap/day5 directory,orcopytheselestothedirectorywhereyourworklives.

CountingWords
Weregoingtostartwithasimpleexamplethatshouldbefamiliartoyoufromday4slecture.First,unzipthe JSONencodedKennethLayemaille:
unzipdataiap/datasets/emails/kenneth_json.zip

Thiswillresultinanewlecalled layk.json ,whichisJSONencoded.WhatisJSON?Youcanthinkofitlikeatext representationofpythondictionariesandlists.Ifyouopenupthele,youwillseeoneachlinesomethingthatlookslike this:


{"sender":"rosalee.fleming@enron.com","recipients":["lizard_ar@yahoo.com"],"cc":[],"text":"Liz,Idon'tknowhowthe addressshowsupwhensent,buttheytellusit's\nkenneth.lay@enron.com.\n\nTalktoyousoon,Ihope.\n\nRosie","mid": "32285792.1075840285818.JavaMail.evans@thyme","fpath":"enron_mail_20110402/maildir/layk/_sent/108.","bcc":[],"to": ["lizard_ar@yahoo.com"],"replyto":null,"ctype":"text/plain;charset=usascii","fname":"108.","date":"20000810 03:27:0007:00","folder":"_sent","subject":"KLL'semailaddress"}

ItsadictionaryrepresentinganemailfoundinKennethLaysmailbox.Itcontainsthesamecontentthatwedealtwithon day4,butencodedintoJSON,andratherthanoneleperemail,wehaveasinglelewithoneemailperline. Whydidwedothis?BigdatacrunchingsystemslikeHadoopdontdealwellwithlotsofsmallles:theywanttobeableto sendalargechunkofdatatoamachineandhavetocrunchonitforawhile.Soweveprocessedthedatatobeinthis format:onebigle,abunchofemailslinebyline.Ifyourecurioushowwedidthis,checkout dataiap/day5


/emails_to_json.py .

Asidefromthat,processingtheemailsisprettysimilartowhatwedidonday4.Letslookatascriptthatcountsthewords inthetextofeachemail(Remember:itwouldhelpifyouwroteandranyourcodein dataiap/day5/... today,sinceseveral moduleslike term_tools.py areavailableinthatdirectory).

MapReduceskillsjustyet:youwilllikelyneedthemoneday.

Analyzingtheoutput
Hopefullyyourrstmapreduceisdonebynow.Therearetwobitsofoutputweshouldcheckout.First,whenthe MapReducejobnishes,youwillseesomethinglikethefollowingmessageinyourterminalwindow:
Countersfromstep 1: FileSystemCounters: FILE_BYTES_READ:499365431 FILE_BYTES_WRITTEN:61336628 S3_BYTES_READ:1405888038 S3_BYTES_WRITTEN:8354556 JobCounters: Launchedmaptasks:189 Launchedreducetasks:85 Racklocalmaptasks:189 MapReduceFramework: Combineinputrecords:0 Combineoutputrecords:0 Mapinputbytes:1405888038 Mapinputrecords:516893 Mapoutputbytes:585440070 Mapoutputrecords:49931418 Reduceinputgroups:232743 Reduceinputrecords:49931418 Reduceoutputrecords:232743 Reduceshufflebytes:27939562 SpilledRecords:134445547

Thatsasummaryof,onyour20machines,howmanyMappersandReducersran.Youcanrunmorethanoneofeachona physicalmachine,whichexplainswhymorethan20ofeachraninourtasks.Noticehowmanyreducersranyourtask.Each reducerisgoingtoreceiveasetofwordsandtheirnumberofoccurrences,andemitwordcounts.Reducersdonttalkto oneanother,sotheyendupwritingtheirownles. Withthisinmind,gototheS3console,andlookatthe output directoryoftheS3buckettowhichyououtputyourwords. Noticethatthereareseverallesinthe output directorynamed part00000 , part00001 .Thereshouldbeasmanylesas therewerereducers,sinceeachwrotetheleout.Downloadsomeoftheselesandopenthemup.Youwillseethevarious wordcountsforwordsacrosstheentireEnronemailcorpus.Lifeisgood! (Optional)Exercise:Makeadirectorycalled copied .Copytheoutputfromyourscriptto copied using dataiap/resources /s3_util.py withacommandlike python../resources/s3_utilgets3://dataiapYOURUSERNAMEtestbucket/outputcopied .Once youvegotallthelesdownloaded,loadthemupandsortthelinesbytheircount.Dothepopulartermsacrosstheentire datasetmakesense?

TFIDF
ThissectionisgoingtofurtherexerciseourMapReducefu. Onday4,welearnedthatcountingwordsisnotenoughtosummarizetext:commonwordslike the and and aretoo popular.Inordertodiscountthosewords,wemultipliedbythetermfrequencyof wordX by log(total#documents/#documents withwordX) .LetsdothatwithMapReduce! WeregoingtoemitapersenderTFIDF.Todothis,weneedthreeMapReducetasks: Therstwillcalculatethenumberofdocuments,forthenumeratorinIDF. Thesecondwillcalculatethenumberofdocumentseachtermappearsin,forthedenominatorofIDF,andemitsthe IDF( log(total#documents/#documentswithwordX) ). ThethirdcalculatesapersenderIDFforeachtermaftertakingboththesecondMapReducestermIDFandtheemail

OK,nowthatyouveseenthemotivationbehindtheMapReducetechnique,letsactuallytryitout.

MapReduce
SaywehaveaJSONencodedlewithemails(3,000,000emailson3,000,000lines),andwehave3computerstocompute thenumberoftimeseachwordappears. Inthemapphase(gurebelow),wearegoingtosendeachcomputer1/3ofthelines.Eachcomputerwillprocesstheir 1,000,000linesbyreadingeachofthelinesandtokenizingtheirwords.Forexample,therstmachinemayextractenron, call,...,whilethesecondmachineextractsconference,call,....

Fromthewords,wewillcreate(key,value)pairs(or(word,1)pairsinthisexample).Theshuephasewillassignedeachkey tooneofthe3computer,andallthevaluesassociatedwiththesamekeyaresenttothekeyscomputer.Thisisnecessary becausethewholedictionarydoesnttinthememoryofasinglecomputer!Thinkofthisascreatingthe(key,value)pairs ofahugedictionarythatspansallofthe3,000,000emails.Becausethewholedictionarydoesnttintoasinglemachine,the keysaredistributedacrossour3machines.Inthisexample,enronisassignedtocomputer1,whilecallandconference areassignedtocomputer2,andbankruptisassignedtocomputer3.

Finally,onceeachmachinehasreceivedthevaluesofthekeysitsresponsiblefor,thereducephasewillprocesseachkeys value.Itdoesthisbygoingthrougheachkeythatisassignedtothemachineandexecutinga reducer functiononthevalues associatedwiththekey.Forexample,enronwasassociatedwithalistofthree1s,andthereducerstepsimplyaddsthem up.

MapReduceismoregeneralpurposethanjustservingtocountwords.Somepeoplehaveusedittodoexoticthingslike processmillionsofsongs,butwewantyoutoworkthroughanentireendtoendexample. Withoutfurtherado,heresthewordcountexample,butwrittenasaMapReduceapplication:


importsys frommrjob.protocolimportJSONValueProtocol frommrjob.jobimportMRJob fromterm_toolsimportget_terms classMRWordCount(MRJob): INPUT_PROTOCOL=JSONValueProtocol OUTPUT_PROTOCOL=JSONValueProtocol defmapper(self,key,email): forterminget_terms(email['text']): yieldterm,1 def reducer(self,term,occurrences): yieldNone,{'term':term,'count':sum(occurrences)} if__name__=='__main__': MRWordCount.run()

Letsbreakthisthingdown.YoullnoticethetermMRJobinabunchofplaces.MRJobisapythonpackagethatmakes writingMapReduceprogramseasy.ThedevelopersatYelp(theywrotethe mrjob module)wroteaconvenienceclasscalled MRJob thatyouwillextend.Whenitsrun,itautomaticallyhooksintotheMapReduceframework,readsandparsesthe inputles,anddoesabunchofotherthingsforyou. Whatwedoiscreateaclass MRWordCount thatextends MRJob ,andimplementthe mapper and reducer functions.Ifthe programisrunfromthecommandline(the if__name__=='__main__': part),itwillexecutetheMRWordCountMapRedce program. Lookinginside MRWordCount ,wesee INPUT_PROTOCOL beingsetto JSONValueProtocol .Bydefault,mapfunctionsexpectalineof textasinput,butweveencodedouremailsasJSON,soweletMRJobknowthat.Similarly,weexplainthatourreducetasks willemitdictionariesbysetting OUTPUT_PROTOCOL appropriately. The mapper functionhandlesthefunctionalitydescribedintherstimageofthelastsection.Ittakeseachemail,tokenizesit intoterms,and yield seachterm.Youcan yield akeyandavalue( term and 1 )inamapper(noticeyieldarrowsinthe secondgureabove).Weyieldthetermwiththevalue 1 ,meaningoneinstanceoftheword term wasfound. yield isa pythonkeywordthatturnsfunctionsintoiterators(stackoverowexplanation).Inthecontextofwriting mapper and reducer functions,youcanthinkofitas return . The reducer functionimplementsthethirdimageofthelastsection.Wearegivenaword(thekeyemittedfrommappers), andalist occurrences ofallofthevaluesemittedforeachinstanceof term .Sincewearecountingoccurrencesofwords,we yield adictionarycontainingthetermandasumoftheoccurrencesweveseen.

Notethatwe sum insteadof len the occurrences .Thisallowsustochangethemapperimplementationtoemitthenumber oftimeseachwordoccursinadocument,ratherthan 1 foreachword. Boththe mapper and reducer oerustheparallelismwewanted.Thereisnoloopthroughourentiresetofemails,so MapReduceisfreetodistributetheemailstomultiplemachines,eachofwhichwillrun mapper onanemailbyemailbasis. Wedonthaveasingledictionarywiththecountofeveryword,butinsteadhavea reduce functionthathastosumupthe occurrencesofasingleword,meaningwecanagaindistributetheworktoseveralreducingmachines.

RunIt!
Enoughtalk!Letsrunthisthing.
pythonmr_wordcount.pyo'wordcount_test'nooutput'../datasets/emails/layk.json'

The o agtellsMRJobtooutputallreduceroutputtothe wordcount_test directory.The nooutput agsaysnottoprint theoutputofthereducerstothescreen.Thelastargument( '../datasets/emails/layk.json' )specieswhichle(orles)to readintothemappersasinput. Takealookatthenewlycreated wordcount_test directory.Thereshouldbeatleastonele( part00000 ),andperhapsmore. Thereisoneleperreducerthatcountedwords.Reducersdonttalktooneanotherastheydotheirwork,andsoweend upwithmultipleoutputles.Whilethecountofaspecicwordwillonlyappearinonele,wehavenoideawhichreducer lewillcontainagivenword. Theoutputles(openoneupinatexteditor)listeachwordasadictionaryonasingleline( OUTPUT_PROTOCOL=
JSONValueProtocol in mr_wordcount.py iswhatcausedthis).

Youwillnoticewehavenotyetruntasksonlargedatasets(werestillusing layk.json )andwearestillrunningthemlocally onourcomputers.WewillsoonlearntomovethisworktoAmazonscloudinfrastructure,butrunningMRJobtaskslocally totestthemonasmallleisforeverimportant.MapReducetaskswilltakealongtimetorunandholdupseveraltensto severalhundredsofmachines.Theyalsocostmoneytorun,whethertheycontainabugornot.Testthemlocallylikewejust didtomakesureyoudonthavebugsbeforegoingtothefulldataset.

ShowoWhatyouLearned
ExerciseCreateasecondversionoftheMapReducewordcounterthatcountsthenumberofeachwordemittedbyeach sender.Youwillneedthisforlater,sinceweregoingtobecalculatingTFIDFimplementingtermspersender.Youcan accomplishthiswithasneakychangetothe term emittedbythe mapper .Youcaneitherturnthattermintoadictionary,or intoamorecomplicatedstring,buteitherwayyouwillhavetoencodebothsenderandterminformationinthat term .Ifyou getstuck,takeapeakat dataiap/day5/mr_wc_by_sender.py . (Optional)ExerciseThe grep commandonUNIXlikesystemsallowsyoutosearchtextlesforsometermorterms.Typing grephotdogsfile1 willreturnallinstancesoftheword hotdogs inthele file1 .Implementa grep foremails.Whenauser usesyourmapreduceprogramtondawordintheemailcollection,theywillbegivenalistofthesubjectsandsendersof allemailsthatcontaintheword.Youmightndyoudonotneedaparticularlysmartreducerinthiscase:thatsne.If yourepressedfortime,youcanskipthisexercise. WenowknowhowtowritesomeprettygnarlyMapReduceprograms,buttheyallrunonourlaptops.Sortofboring.Its timetomovetotheworldofdistributedcomputing,Amazonstyle!

AmazonWebServices
AmazonWebServices(AWS)isAmazonsgifttopeoplewhodontowndatacenters.Itallowsyoutoelasticallyrequest computationandstorageresourcesatvariedscalesusingdierentservices.Asatestimenttotheexibilityoftheservices, companieslikeNetFlixaremovingtheirentireoperationintoAWS. InordertoworkwithAWS,youwillneedtosetupanaccount.Ifyoureintheclass,wewillhavegivenyouausername,

password,accesskey,andaccesssecret.Tousetheseaccounts,youwillhavetologinthroughaspecialclassonlywebpage. Thesameinstructionsworkforpeoplewhoaretryingthisathome,onlyyouneedtologinatthemainAWSwebsite. TheusernameandpasswordlogyouintotheAWSconsolesothatyoucanclickarounditsinterface.Inordertoletyour computeridentifyitselfwithAWS,youhavetotellyourcomputeryouraccesskeyandsecret.OnUNIXlikeplatforms (GNU/Linux,BSD,MacOS),typethefollowing:


exportAWS_ACCESS_KEY_ID='your_key_id' exportAWS_SECRET_ACCESS_KEY='your_access_id'

Onwindowsmachines,typethefollowingatthecommandline:
setAWS_ACCESS_KEY_ID=your_key_id setAWS_SECRET_ACCESS_KEY=your_access_id

Replace your_key_id and your_access_id withtheonesyouwereassigned. Thatsit!TherearemorethanadaysworthofAWSservicestodiscuss,soletsstickwithtwoofthem:SimpleStorage Service(S3)andElasticMapReduce(EMR).

AWSS3
S3allowsyoutostoregigabytes,terabytes,and,ifyoudlike,petabytesofdatainAmazonsdatacenters.Thisisuseful, becauselaptopsoftendontcrunchandstoremorethanafewhundredgigabytesworthofdata,andstoringitinthe datacenterallowsyoutosecurelyhaveaccesstothedataincaseofhardwarefailures.ItsalsonicebecauseAmazontries harderthanyoutohavethedatabealwaysaccessible. Inexchangeforniceguaranteesaboutscaleandaccessibilityofdata,Amazonchargesyourentontheorderof14centsper gigabytestoredpermonth. ServicesthatworkonAWS,likeEMR,readdatafromandstoredatatoS3.WhenwerunourMapReduceprogramson EMR,weregoingtoreadtheemaildatafromS3,andwritewordcountdatatoS3. S3dataisstoredinbuckets.Withinabucketyoucreate,youcanstoreasmanylesorfoldersasyoudlike.Thenameof yourbuckethastobeuniqueacrossallofthepeoplethatstoretheirstuinS3.Wanttomakeyourownbucket?Letsdo this! LogintotheAWSconsole(thewebsite),andclickontheS3tab.Thiswillshowyoualeexplorerlikeinterface,with bucketslistedontheleftandlesperbucketlistedontheright. ClickCreateBucketnearthetopleft. Enterabucketname.ThishastobeuniqueacrossallusersofS3.PicksomethinglikedataiapYOURUSERNAME testbucket.Donotuseunderscoresinthenameofthebucket. ClickCreate Thisgivesyouabucket,butthebuckethasnothinginit!Poorbucket.LetsuploadKennethLaysemails. Selectthebucketfromthelistontheleft. ClickUpload. ClickAddFiles. Selectthelayk.jsonleonyourcomputer. ClickStartUpload. Rightclickontheuploadedle,andclickMakePublic. Verifytheleispublicbygoingtoht tp://dataiapYOURUSERNAMEtestbucket.s3.amazonaws. com/layk.json. Awesome!WejustuploadedourrstletoS3.Amazonisnowhostingthele.Wecanaccessitovertheweb,which meanswecanshareitwithotherresearchersorprocessitinElasticMapReduce.Tosavetime,weveuploadedtheentire enrondatasettohttps://dataiapenronjson.s3.amazonaws.com/ .HeadovertheretoseeallofthedierentEnron
The complete dataset is no longer available here, since it was only intended for in-class use. For alternate instructions, please see the Lectures and Labs page.

employeesleslisted(therstthreeshouldbe allenp.json , arnoldj.json ,and arorah.json ). Twonotesfromhere.First,uploadingtheletoS3wasjustanexercisewellusethe dataiapenronjson bucketforour futureexercises.Thatsbecausethetotalleuploadisaround1.3gigs,andwedidntwanttoputeveryonethroughthepain ofuploadingitthemselves.Second,mostprogrammersdontusethewebinterfacetouploadtheirles.Theyinsteadoptto uploadthelesfromthecommandline.Ifyouhavesomefreetime,feelfreetocheckout dataiap/resources/s3_util.py fora scriptthatcopiesdirectoriestoanddownloadsbucketsfromS3. Letscrunchthroughtheseles!

AWSEMR
Wereabouttoprocesstheentireenrondataset.Letsdoaquicksanitycheckthatourmapreducewordcountscriptstill works.Wereabouttogetintotheterritoryofspendingmoneyontheorderof10centspermachinehour,sowewantto makesurewedontrunintopreventableproblemsthatwastemoney.
pythonmr_wordcount.pyo'wordcount_test2'nooutput'../datasets/emails/layk.json'

Didthatnishrunningandoutputthewordcountsto wordcount_test2 ?Ifso,letsruniton20machines(costingus$2, roundedtothenearesthour).Beforerunningthescript,welltalkabouttheparameters:


pythonmr_wordcount.pynumec2instances=20pythonarchivepackage.tar.gzremro's3://dataiapYOURUSERNAMEtestbucket/out

Theparametersare: numec2instances:wewanttorunon20machinesinthecloud.Snap! pythonarchive:whenthescriptrunsonremotemachines,itwillneedterm_tools.pyinordertotokenizetheemail text.Wehavepackagedthisleintopackage.tar.gz. remr:dontrunthescriptlocallyrunitonAWSEMR. o's3://dataiapYOURUSERNAMEtestbucket/output':writescriptoutputtothebucketyoumadewhenplaying aroundwithS3.Putalllesinadirectorycalledoutputinthatbucket.Makesureyouchangedataiap YOURUSERNAMEtestbuckettowhateverbucketnameyoupickedonS3. nooutput:dontprintthereduceroutputtothescreen. 's3://dataiapenronjson/*.json':performthemapreducewithinputfromthedataiapenronjsonbucket thattheinstructorscreated,anduseasinputanylethatendsin.json.Youcouldhavenamedaspecicle,like layk.jsonhere,butthepointisthatwecanrunonmuchlargerdatasets. Checkbackonthescript.Isitstillrunning?Itshouldbe.Youmayaswellkeepreading,sinceyoullbehereawhile.Intotal, ourruntookthreeminutesforAmazontorequisitionthemachines,fourminutestoinstallthenecessarysoftwareonthem, andbetween15adn25minutestoruntheactualMapReducetasksonHadoop.Thatmightstrikesomeofyouasweird,and welltalkaboutitnow. UnderstandingMapReduceisaboutunderstandingscale.Wereusedtothinkingofourprogramsasbeingabout performance,butthatsnottheroleofMapReduce.Runningascriptonasingleleonasinglemachinewillbefasterthan runningascriptonmultiplelessplitamongstmultiplemachinesthatshuedataaroundtooneanotherandemitthedata toaservice(likeEMRandS3)overtheinternetisnotgoingtobefast.WewriteMapReduceprogramsbecausetheyletus easilyaskfor10timesmoremachineswhenthedatawereprocessinggrowsbyafactorof10,notsothatwecanachieve subsecondprocessingtimesonlargedatasets.Itsamentalmodelswitchthatwilltakeawhiletoappreciate,soletitbrew inyourmindforabit. WhatitdoesmeanisthatMapReduceasaprogrammingmodelisnotamagicbullet.TheEnrondatasetisnotactuallyso largethatitshouldntbeprocessedonyourlaptop.Weusedthedatasetbecauseitwaslargeenoughtogiveyouan appreciationfororderofmagnituelesizedierences,butnotlargeenoughthatamodernlaptopcantprocessthedata.In practice,dontlookintoMapReduceuntilyouhaveseveraltensorhundredsofgigabytesofdatatoanalyze.Intheworld thatexistsinsidemostcompanies,thissizedatasetiseasytostumbleupon.Sodontbedisheartenedifyoudontneedthe

MapReduceskillsjustyet:youwilllikelyneedthemoneday.

Analyzingtheoutput
Hopefullyyourrstmapreduceisdonebynow.Therearetwobitsofoutputweshouldcheckout.First,whenthe MapReducejobnishes,youwillseesomethinglikethefollowingmessageinyourterminalwindow:
Countersfromstep 1: FileSystemCounters: FILE_BYTES_READ:499365431 FILE_BYTES_WRITTEN:61336628 S3_BYTES_READ:1405888038 S3_BYTES_WRITTEN:8354556 JobCounters: Launchedmaptasks:189 Launchedreducetasks:85 Racklocalmaptasks:189 MapReduceFramework: Combineinputrecords:0 Combineoutputrecords:0 Mapinputbytes:1405888038 Mapinputrecords:516893 Mapoutputbytes:585440070 Mapoutputrecords:49931418 Reduceinputgroups:232743 Reduceinputrecords:49931418 Reduceoutputrecords:232743 Reduceshufflebytes:27939562 SpilledRecords:134445547

Thatsasummaryof,onyour20machines,howmanyMappersandReducersran.Youcanrunmorethanoneofeachona physicalmachine,whichexplainswhymorethan20ofeachraninourtasks.Noticehowmanyreducersranyourtask.Each reducerisgoingtoreceiveasetofwordsandtheirnumberofoccurrences,andemitwordcounts.Reducersdonttalkto oneanother,sotheyendupwritingtheirownles. Withthisinmind,gototheS3console,andlookatthe output directoryoftheS3buckettowhichyououtputyourwords. Noticethatthereareseverallesinthe output directorynamed part00000 , part00001 .Thereshouldbeasmanylesas therewerereducers,sinceeachwrotetheleout.Downloadsomeoftheselesandopenthemup.Youwillseethevarious wordcountsforwordsacrosstheentireEnronemailcorpus.Lifeisgood! (Optional)Exercise:Makeadirectorycalled copied .Copytheoutputfromyourscriptto copied using dataiap/resources /s3_util.py withacommandlike python../resources/s3_utilgets3://dataiapYOURUSERNAMEtestbucket/outputcopied .Once youvegotallthelesdownloaded,loadthemupandsortthelinesbytheircount.Dothepopulartermsacrosstheentire datasetmakesense?

TFIDF
ThissectionisgoingtofurtherexerciseourMapReducefu. Onday4,welearnedthatcountingwordsisnotenoughtosummarizetext:commonwordslike the and and aretoo popular.Inordertodiscountthosewords,wemultipliedbythetermfrequencyof wordX by log(total#documents/#documents withwordX) .LetsdothatwithMapReduce! WeregoingtoemitapersenderTFIDF.Todothis,weneedthreeMapReducetasks: Therstwillcalculatethenumberofdocuments,forthenumeratorinIDF. Thesecondwillcalculatethenumberofdocumentseachtermappearsin,forthedenominatorofIDF,andemitsthe IDF( log(total#documents/#documentswithwordX) ). ThethirdcalculatesapersenderIDFforeachtermaftertakingboththesecondMapReducestermIDFandtheemail

corpusasinput. HINTDonotruntheseMapReducetasksonAmazon.Yousawhowslowitwastorun,somakesuretheentireTFIDF workowworksonyourlocalmachinewith layk.json beforemovingtoAmazon.

MapReduce1:TotalNumberofDocuments
EugeneandIarethelaziestofinstructors.Wedontlikedoingworkwherewedonthaveto.Ifyoudlikeamentalexercise astohowtowritethisMapReduce,youcandosoyourself,butitssimplerthanthewordcountexample.Ourdatasetis smallenoughthatwecanjustusethe wc UNIXcommandtocountthenumberoflinesinourcorpus:
wcllayk.json

KennethLayhas5929emailsinhisdataset.WeranwclontheentireEnronemaildataset,andgot516893.Thistookafew seconds.Sometimes,itsnotworthoverengineeringasimpletask!:)

MapReduce2:PerTermIDF
Werecommendyoustickto516893asyourtotalnumberofdocuments,sinceeventuallyweregoingtobecrunchingthe entiredataset! Whatwewanttodohereisemit log(516893.0/#documentswithwordX) foreach wordX inourdataset.Noticethedecimalon 516893.0:thatssowedooatingpointdivisionratherthanintegerdivision.Theoutputshouldbealewhereeachline contains {'term':'wordX','idf':35.92} foractualvaluesof wordX and 35.92 . Weveputouranswerin dataiap/day5/mr_per_term_idf.py ,buttryyourhandatwritingityourselfbeforeyoulookatours.It canbeimplementedwithathreelinechangetotheoriginalwordcountMapReducewewrote(onelinejustincludes math.log !).

MapReduce3:PerSenderTFIDFs
ThethirdMapReducemultipliespersendertermfrequenciesbypertermIDFs.Thismeansitneedstotakeasinputthe IDFscalculatedinthelaststepandcalculatethepersenderTFs.Thatrequiressomethingwehaventseenyet:initialization logic.Letsshowyouthecode,thentellyouhowitsdone.
importos frommrjob.protocolimportJSONValueProtocol frommrjob.jobimportMRJob fromterm_toolsimportget_terms DIRECTORY="/path/to/idf_parts/" classMRTFIDFBySender(MRJob): INPUT_PROTOCOL=JSONValueProtocol OUTPUT_PROTOCOL=JSONValueProtocol defmapper(self,key,email): forterminget_terms(email['text']): yield{'term':term,'sender':email['sender']},1 def reducer_init(self): self.idfs={} forfnameinos.listdir(DIRECTORY):#lookthroughfilenamesinthedirectory file=open(os.path.join(DIRECTORY,fname))#openafile for lineinfile:#readeachlineinjsonfile term_idf=JSONValueProtocol.read(line)[1]#parsethelineasaJSONobject self.idfs[term_idf['term']]=term_idf['idf'] def reducer(self,term_sender,howmany): tfidf=sum(howmany)*self.idfs[term_sender['term']] yieldNone,{'term_sender':term_sender,'tfidf':tfidf}

Ifyoudidtherst exercise,the mapper and reducer functionsshouldlookalotlikethepersenderwordcount mapper and reducer functionsyouwroteforthat.Theonlydierenceisthat reducer takesthetermfrequenciesandmultipliesthemby self.idfs[term] ,tonormalizebyeachwordsIDF.Theotherdierenceistheadditionof reducer_init ,whichwewill describenext.
self.idfs isadictionarycontainingtermIDFmappingsfromthesecondMapReduce.SayyourantheIDFcalculating

MapReducelikeso:
pythonmr_per_term_idf.pyo'idf_parts'nooutput'../datasets/emails/layk.json'

10

TheindividualtermsandIDFswouldbeemittedtothedirectory idf_parts/ .Wewouldwanttoloadallofthesetermidf mappingsinto self.idfs .Set DIRECTORY tothelesystempaththatpointstothe idf_parts/ directory. Sometimes,wewanttoloadsomedatabeforerunningthemapperorthereducer.Inourexample,wewanttoloadtheIDF valuesintomemorybeforeexecutingthereducer,sothatthevaluesareavailablewhenwecomputethetfidf.Thefunction
reducer_init isdesignedtoperformthissetup.Itiscalledbeforetherst reducer iscalledtocalculateTFIDF.Itopensallof

theoutputlesin DIRECTORY ,andreadstheminto self.idfs .Thisway,when reducer iscalledonaterm,theidfforthatterm hasalreadybeencalculated. Toverifyyouvedonethiscorrectly,compareyouroutputtoours.ThereweresomepottymouthsthatemailedKennethLay: {tdf:13.155591168821202,term_sender:{term:ahole,sender:justinsitzman@hotmail.com}}

WhyisitOKtoLoadIDFsIntoMemory?
Youmightbealarmedatthemoment.Hereweare,workingwithBIGDATA,andnowwereexpectingtheTFIDF calculationtoloadtheentiretyoftheIDFdataintomemoryonEVERYSINGLEreducer.Thatscrazytown. Itsactuallynot.Whilethecorpuswereanalyzingislarge,thenumberofwordsintheEnglishlanguage(roughlythe amountoftermswecalculateIDFfor)isnot.Infact,theoutputofthepertermIDFcalculationwasaround8megabytes, whichisfarsmallerthanthe1.3gigabytesweprocessed.Keepthisinmind:evenifcalculatingsomethingoveralarge amountofdataishardandtakesawhile,theresultmightendupsmall.

Optional:RunTheTFIDFWorkow
WerecommendrunningtheTFIDFworkowonAmazononceclassisover.TherstMapReducescript(pertermIDF) shouldrunjustneonAmazon.Thesecondwillnot.The reducer_init logicexpectsaletoliveonyourlocaldirectory.You willhavetomodifyittoreadtheoutputoftheIDFcalculationsfromS3using boto .Takealookatthecodetoimplement get in dataiap/resources/s3_util.py foraprogrammaticviewofaccessinglesinS3.

Wheretogofromhere
WehopethatMapReduceservesyouwellwithlargedatasets.Ifthiskindofworkexcitesyou,herearesomethingstoread upon. Asyoucansee,writingmorecomplexworkowsforthingslikeTFIDFcangetannoying.Inpractice,folksuse higherlevellanguagesthanmapandreducetobuildMapReduceworkows.SomeexamplesarePig,Hive,and Cascading. IfyoucareaboutmakingyourMapReducetasksrunfaster,therearelotsoftricksyoucanplay.Oneoftheeasiest thingstodoistoaddacombinersbetweenyourmapperandreducer.Acombinerhassimilarlogictoareducer,but runsonamapperbeforetheshuestage.Thisallowsyouto,forexample,presumthewordsemittedbythemap stageinawordcountsothatyoudonthavetoshueasmanywordsaround. MapReduceisonemodelforparallelprogrammingcalleddataparallelism.Feelfreetoreadaboutothers. WhenMapReducerunsonmultiplecomputers,itsanexampleofdistributedcomputing,whichhasalotof interestingapplicationsandproblemstobesolved. S3isadistributedstoragesystemandisoneofmany.ItisbuiltuponAmazonsDynamotechnology.Itsoneofmany distributedlesystemsanddistributeddatastores.

MIT OpenCourseWare http://ocw.mit.edu

Resource: How to Process, Analyze and Visualize Data


Adam Marcus and Eugene Wu

The following may not correspond to a particular course on MIT OpenCourseWare, but has been provided by the author as an individual learning resource.

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Вам также может понравиться