Академический Документы
Профессиональный Документы
Культура Документы
Letsstartwithasimplewordcountexample,thenrewriteitinMapReduce,thenrunMapReduceon20machinesusing AmazonsEMR,andnallywriteabigpersonMapReduceworkowtocalculateTFIDF!
Setup
Weregoingtobeusingtwoles, dataiap/day5/term_tools.py and dataiap/day5/package.tar.gz .Eitherwriteyourcodeinthe dataiap/day5 directory,orcopytheselestothedirectorywhereyourworklives.
CountingWords
Weregoingtostartwithasimpleexamplethatshouldbefamiliartoyoufromday4slecture.First,unzipthe JSONencodedKennethLayemaille:
unzipdataiap/datasets/emails/kenneth_json.zip
MapReduceskillsjustyet:youwilllikelyneedthemoneday.
Analyzingtheoutput
Hopefullyyourrstmapreduceisdonebynow.Therearetwobitsofoutputweshouldcheckout.First,whenthe MapReducejobnishes,youwillseesomethinglikethefollowingmessageinyourterminalwindow:
Countersfromstep 1: FileSystemCounters: FILE_BYTES_READ:499365431 FILE_BYTES_WRITTEN:61336628 S3_BYTES_READ:1405888038 S3_BYTES_WRITTEN:8354556 JobCounters: Launchedmaptasks:189 Launchedreducetasks:85 Racklocalmaptasks:189 MapReduceFramework: Combineinputrecords:0 Combineoutputrecords:0 Mapinputbytes:1405888038 Mapinputrecords:516893 Mapoutputbytes:585440070 Mapoutputrecords:49931418 Reduceinputgroups:232743 Reduceinputrecords:49931418 Reduceoutputrecords:232743 Reduceshufflebytes:27939562 SpilledRecords:134445547
Thatsasummaryof,onyour20machines,howmanyMappersandReducersran.Youcanrunmorethanoneofeachona physicalmachine,whichexplainswhymorethan20ofeachraninourtasks.Noticehowmanyreducersranyourtask.Each reducerisgoingtoreceiveasetofwordsandtheirnumberofoccurrences,andemitwordcounts.Reducersdonttalkto oneanother,sotheyendupwritingtheirownles. Withthisinmind,gototheS3console,andlookatthe output directoryoftheS3buckettowhichyououtputyourwords. Noticethatthereareseverallesinthe output directorynamed part00000 , part00001 .Thereshouldbeasmanylesas therewerereducers,sinceeachwrotetheleout.Downloadsomeoftheselesandopenthemup.Youwillseethevarious wordcountsforwordsacrosstheentireEnronemailcorpus.Lifeisgood! (Optional)Exercise:Makeadirectorycalled copied .Copytheoutputfromyourscriptto copied using dataiap/resources /s3_util.py withacommandlike python../resources/s3_utilgets3://dataiapYOURUSERNAMEtestbucket/outputcopied .Once youvegotallthelesdownloaded,loadthemupandsortthelinesbytheircount.Dothepopulartermsacrosstheentire datasetmakesense?
TFIDF
ThissectionisgoingtofurtherexerciseourMapReducefu. Onday4,welearnedthatcountingwordsisnotenoughtosummarizetext:commonwordslike the and and aretoo popular.Inordertodiscountthosewords,wemultipliedbythetermfrequencyof wordX by log(total#documents/#documents withwordX) .LetsdothatwithMapReduce! WeregoingtoemitapersenderTFIDF.Todothis,weneedthreeMapReducetasks: Therstwillcalculatethenumberofdocuments,forthenumeratorinIDF. Thesecondwillcalculatethenumberofdocumentseachtermappearsin,forthedenominatorofIDF,andemitsthe IDF( log(total#documents/#documentswithwordX) ). ThethirdcalculatesapersenderIDFforeachtermaftertakingboththesecondMapReducestermIDFandtheemail
OK,nowthatyouveseenthemotivationbehindtheMapReducetechnique,letsactuallytryitout.
MapReduce
SaywehaveaJSONencodedlewithemails(3,000,000emailson3,000,000lines),andwehave3computerstocompute thenumberoftimeseachwordappears. Inthemapphase(gurebelow),wearegoingtosendeachcomputer1/3ofthelines.Eachcomputerwillprocesstheir 1,000,000linesbyreadingeachofthelinesandtokenizingtheirwords.Forexample,therstmachinemayextractenron, call,...,whilethesecondmachineextractsconference,call,....
Letsbreakthisthingdown.YoullnoticethetermMRJobinabunchofplaces.MRJobisapythonpackagethatmakes writingMapReduceprogramseasy.ThedevelopersatYelp(theywrotethe mrjob module)wroteaconvenienceclasscalled MRJob thatyouwillextend.Whenitsrun,itautomaticallyhooksintotheMapReduceframework,readsandparsesthe inputles,anddoesabunchofotherthingsforyou. Whatwedoiscreateaclass MRWordCount thatextends MRJob ,andimplementthe mapper and reducer functions.Ifthe programisrunfromthecommandline(the if__name__=='__main__': part),itwillexecutetheMRWordCountMapRedce program. Lookinginside MRWordCount ,wesee INPUT_PROTOCOL beingsetto JSONValueProtocol .Bydefault,mapfunctionsexpectalineof textasinput,butweveencodedouremailsasJSON,soweletMRJobknowthat.Similarly,weexplainthatourreducetasks willemitdictionariesbysetting OUTPUT_PROTOCOL appropriately. The mapper functionhandlesthefunctionalitydescribedintherstimageofthelastsection.Ittakeseachemail,tokenizesit intoterms,and yield seachterm.Youcan yield akeyandavalue( term and 1 )inamapper(noticeyieldarrowsinthe secondgureabove).Weyieldthetermwiththevalue 1 ,meaningoneinstanceoftheword term wasfound. yield isa pythonkeywordthatturnsfunctionsintoiterators(stackoverowexplanation).Inthecontextofwriting mapper and reducer functions,youcanthinkofitas return . The reducer functionimplementsthethirdimageofthelastsection.Wearegivenaword(thekeyemittedfrommappers), andalist occurrences ofallofthevaluesemittedforeachinstanceof term .Sincewearecountingoccurrencesofwords,we yield adictionarycontainingthetermandasumoftheoccurrencesweveseen.
Notethatwe sum insteadof len the occurrences .Thisallowsustochangethemapperimplementationtoemitthenumber oftimeseachwordoccursinadocument,ratherthan 1 foreachword. Boththe mapper and reducer oerustheparallelismwewanted.Thereisnoloopthroughourentiresetofemails,so MapReduceisfreetodistributetheemailstomultiplemachines,eachofwhichwillrun mapper onanemailbyemailbasis. Wedonthaveasingledictionarywiththecountofeveryword,butinsteadhavea reduce functionthathastosumupthe occurrencesofasingleword,meaningwecanagaindistributetheworktoseveralreducingmachines.
RunIt!
Enoughtalk!Letsrunthisthing.
pythonmr_wordcount.pyo'wordcount_test'nooutput'../datasets/emails/layk.json'
The o agtellsMRJobtooutputallreduceroutputtothe wordcount_test directory.The nooutput agsaysnottoprint theoutputofthereducerstothescreen.Thelastargument( '../datasets/emails/layk.json' )specieswhichle(orles)to readintothemappersasinput. Takealookatthenewlycreated wordcount_test directory.Thereshouldbeatleastonele( part00000 ),andperhapsmore. Thereisoneleperreducerthatcountedwords.Reducersdonttalktooneanotherastheydotheirwork,andsoweend upwithmultipleoutputles.Whilethecountofaspecicwordwillonlyappearinonele,wehavenoideawhichreducer lewillcontainagivenword. Theoutputles(openoneupinatexteditor)listeachwordasadictionaryonasingleline( OUTPUT_PROTOCOL=
JSONValueProtocol in mr_wordcount.py iswhatcausedthis).
ShowoWhatyouLearned
ExerciseCreateasecondversionoftheMapReducewordcounterthatcountsthenumberofeachwordemittedbyeach sender.Youwillneedthisforlater,sinceweregoingtobecalculatingTFIDFimplementingtermspersender.Youcan accomplishthiswithasneakychangetothe term emittedbythe mapper .Youcaneitherturnthattermintoadictionary,or intoamorecomplicatedstring,buteitherwayyouwillhavetoencodebothsenderandterminformationinthat term .Ifyou getstuck,takeapeakat dataiap/day5/mr_wc_by_sender.py . (Optional)ExerciseThe grep commandonUNIXlikesystemsallowsyoutosearchtextlesforsometermorterms.Typing grephotdogsfile1 willreturnallinstancesoftheword hotdogs inthele file1 .Implementa grep foremails.Whenauser usesyourmapreduceprogramtondawordintheemailcollection,theywillbegivenalistofthesubjectsandsendersof allemailsthatcontaintheword.Youmightndyoudonotneedaparticularlysmartreducerinthiscase:thatsne.If yourepressedfortime,youcanskipthisexercise. WenowknowhowtowritesomeprettygnarlyMapReduceprograms,buttheyallrunonourlaptops.Sortofboring.Its timetomovetotheworldofdistributedcomputing,Amazonstyle!
AmazonWebServices
AmazonWebServices(AWS)isAmazonsgifttopeoplewhodontowndatacenters.Itallowsyoutoelasticallyrequest computationandstorageresourcesatvariedscalesusingdierentservices.Asatestimenttotheexibilityoftheservices, companieslikeNetFlixaremovingtheirentireoperationintoAWS. InordertoworkwithAWS,youwillneedtosetupanaccount.Ifyoureintheclass,wewillhavegivenyouausername,
Onwindowsmachines,typethefollowingatthecommandline:
setAWS_ACCESS_KEY_ID=your_key_id setAWS_SECRET_ACCESS_KEY=your_access_id
AWSS3
S3allowsyoutostoregigabytes,terabytes,and,ifyoudlike,petabytesofdatainAmazonsdatacenters.Thisisuseful, becauselaptopsoftendontcrunchandstoremorethanafewhundredgigabytesworthofdata,andstoringitinthe datacenterallowsyoutosecurelyhaveaccesstothedataincaseofhardwarefailures.ItsalsonicebecauseAmazontries harderthanyoutohavethedatabealwaysaccessible. Inexchangeforniceguaranteesaboutscaleandaccessibilityofdata,Amazonchargesyourentontheorderof14centsper gigabytestoredpermonth. ServicesthatworkonAWS,likeEMR,readdatafromandstoredatatoS3.WhenwerunourMapReduceprogramson EMR,weregoingtoreadtheemaildatafromS3,andwritewordcountdatatoS3. S3dataisstoredinbuckets.Withinabucketyoucreate,youcanstoreasmanylesorfoldersasyoudlike.Thenameof yourbuckethastobeuniqueacrossallofthepeoplethatstoretheirstuinS3.Wanttomakeyourownbucket?Letsdo this! LogintotheAWSconsole(thewebsite),andclickontheS3tab.Thiswillshowyoualeexplorerlikeinterface,with bucketslistedontheleftandlesperbucketlistedontheright. ClickCreateBucketnearthetopleft. Enterabucketname.ThishastobeuniqueacrossallusersofS3.PicksomethinglikedataiapYOURUSERNAME testbucket.Donotuseunderscoresinthenameofthebucket. ClickCreate Thisgivesyouabucket,butthebuckethasnothinginit!Poorbucket.LetsuploadKennethLaysemails. Selectthebucketfromthelistontheleft. ClickUpload. ClickAddFiles. Selectthelayk.jsonleonyourcomputer. ClickStartUpload. Rightclickontheuploadedle,andclickMakePublic. Verifytheleispublicbygoingtoht tp://dataiapYOURUSERNAMEtestbucket.s3.amazonaws. com/layk.json. Awesome!WejustuploadedourrstletoS3.Amazonisnowhostingthele.Wecanaccessitovertheweb,which meanswecanshareitwithotherresearchersorprocessitinElasticMapReduce.Tosavetime,weveuploadedtheentire enrondatasettohttps://dataiapenronjson.s3.amazonaws.com/ .HeadovertheretoseeallofthedierentEnron
The complete dataset is no longer available here, since it was only intended for in-class use. For alternate instructions, please see the Lectures and Labs page.
employeesleslisted(therstthreeshouldbe allenp.json , arnoldj.json ,and arorah.json ). Twonotesfromhere.First,uploadingtheletoS3wasjustanexercisewellusethe dataiapenronjson bucketforour futureexercises.Thatsbecausethetotalleuploadisaround1.3gigs,andwedidntwanttoputeveryonethroughthepain ofuploadingitthemselves.Second,mostprogrammersdontusethewebinterfacetouploadtheirles.Theyinsteadoptto uploadthelesfromthecommandline.Ifyouhavesomefreetime,feelfreetocheckout dataiap/resources/s3_util.py fora scriptthatcopiesdirectoriestoanddownloadsbucketsfromS3. Letscrunchthroughtheseles!
AWSEMR
Wereabouttoprocesstheentireenrondataset.Letsdoaquicksanitycheckthatourmapreducewordcountscriptstill works.Wereabouttogetintotheterritoryofspendingmoneyontheorderof10centspermachinehour,sowewantto makesurewedontrunintopreventableproblemsthatwastemoney.
pythonmr_wordcount.pyo'wordcount_test2'nooutput'../datasets/emails/layk.json'
Theparametersare: numec2instances:wewanttorunon20machinesinthecloud.Snap! pythonarchive:whenthescriptrunsonremotemachines,itwillneedterm_tools.pyinordertotokenizetheemail text.Wehavepackagedthisleintopackage.tar.gz. remr:dontrunthescriptlocallyrunitonAWSEMR. o's3://dataiapYOURUSERNAMEtestbucket/output':writescriptoutputtothebucketyoumadewhenplaying aroundwithS3.Putalllesinadirectorycalledoutputinthatbucket.Makesureyouchangedataiap YOURUSERNAMEtestbuckettowhateverbucketnameyoupickedonS3. nooutput:dontprintthereduceroutputtothescreen. 's3://dataiapenronjson/*.json':performthemapreducewithinputfromthedataiapenronjsonbucket thattheinstructorscreated,anduseasinputanylethatendsin.json.Youcouldhavenamedaspecicle,like layk.jsonhere,butthepointisthatwecanrunonmuchlargerdatasets. Checkbackonthescript.Isitstillrunning?Itshouldbe.Youmayaswellkeepreading,sinceyoullbehereawhile.Intotal, ourruntookthreeminutesforAmazontorequisitionthemachines,fourminutestoinstallthenecessarysoftwareonthem, andbetween15adn25minutestoruntheactualMapReducetasksonHadoop.Thatmightstrikesomeofyouasweird,and welltalkaboutitnow. UnderstandingMapReduceisaboutunderstandingscale.Wereusedtothinkingofourprogramsasbeingabout performance,butthatsnottheroleofMapReduce.Runningascriptonasingleleonasinglemachinewillbefasterthan runningascriptonmultiplelessplitamongstmultiplemachinesthatshuedataaroundtooneanotherandemitthedata toaservice(likeEMRandS3)overtheinternetisnotgoingtobefast.WewriteMapReduceprogramsbecausetheyletus easilyaskfor10timesmoremachineswhenthedatawereprocessinggrowsbyafactorof10,notsothatwecanachieve subsecondprocessingtimesonlargedatasets.Itsamentalmodelswitchthatwilltakeawhiletoappreciate,soletitbrew inyourmindforabit. WhatitdoesmeanisthatMapReduceasaprogrammingmodelisnotamagicbullet.TheEnrondatasetisnotactuallyso largethatitshouldntbeprocessedonyourlaptop.Weusedthedatasetbecauseitwaslargeenoughtogiveyouan appreciationfororderofmagnituelesizedierences,butnotlargeenoughthatamodernlaptopcantprocessthedata.In practice,dontlookintoMapReduceuntilyouhaveseveraltensorhundredsofgigabytesofdatatoanalyze.Intheworld thatexistsinsidemostcompanies,thissizedatasetiseasytostumbleupon.Sodontbedisheartenedifyoudontneedthe
MapReduceskillsjustyet:youwilllikelyneedthemoneday.
Analyzingtheoutput
Hopefullyyourrstmapreduceisdonebynow.Therearetwobitsofoutputweshouldcheckout.First,whenthe MapReducejobnishes,youwillseesomethinglikethefollowingmessageinyourterminalwindow:
Countersfromstep 1: FileSystemCounters: FILE_BYTES_READ:499365431 FILE_BYTES_WRITTEN:61336628 S3_BYTES_READ:1405888038 S3_BYTES_WRITTEN:8354556 JobCounters: Launchedmaptasks:189 Launchedreducetasks:85 Racklocalmaptasks:189 MapReduceFramework: Combineinputrecords:0 Combineoutputrecords:0 Mapinputbytes:1405888038 Mapinputrecords:516893 Mapoutputbytes:585440070 Mapoutputrecords:49931418 Reduceinputgroups:232743 Reduceinputrecords:49931418 Reduceoutputrecords:232743 Reduceshufflebytes:27939562 SpilledRecords:134445547
Thatsasummaryof,onyour20machines,howmanyMappersandReducersran.Youcanrunmorethanoneofeachona physicalmachine,whichexplainswhymorethan20ofeachraninourtasks.Noticehowmanyreducersranyourtask.Each reducerisgoingtoreceiveasetofwordsandtheirnumberofoccurrences,andemitwordcounts.Reducersdonttalkto oneanother,sotheyendupwritingtheirownles. Withthisinmind,gototheS3console,andlookatthe output directoryoftheS3buckettowhichyououtputyourwords. Noticethatthereareseverallesinthe output directorynamed part00000 , part00001 .Thereshouldbeasmanylesas therewerereducers,sinceeachwrotetheleout.Downloadsomeoftheselesandopenthemup.Youwillseethevarious wordcountsforwordsacrosstheentireEnronemailcorpus.Lifeisgood! (Optional)Exercise:Makeadirectorycalled copied .Copytheoutputfromyourscriptto copied using dataiap/resources /s3_util.py withacommandlike python../resources/s3_utilgets3://dataiapYOURUSERNAMEtestbucket/outputcopied .Once youvegotallthelesdownloaded,loadthemupandsortthelinesbytheircount.Dothepopulartermsacrosstheentire datasetmakesense?
TFIDF
ThissectionisgoingtofurtherexerciseourMapReducefu. Onday4,welearnedthatcountingwordsisnotenoughtosummarizetext:commonwordslike the and and aretoo popular.Inordertodiscountthosewords,wemultipliedbythetermfrequencyof wordX by log(total#documents/#documents withwordX) .LetsdothatwithMapReduce! WeregoingtoemitapersenderTFIDF.Todothis,weneedthreeMapReducetasks: Therstwillcalculatethenumberofdocuments,forthenumeratorinIDF. Thesecondwillcalculatethenumberofdocumentseachtermappearsin,forthedenominatorofIDF,andemitsthe IDF( log(total#documents/#documentswithwordX) ). ThethirdcalculatesapersenderIDFforeachtermaftertakingboththesecondMapReducestermIDFandtheemail
MapReduce1:TotalNumberofDocuments
EugeneandIarethelaziestofinstructors.Wedontlikedoingworkwherewedonthaveto.Ifyoudlikeamentalexercise astohowtowritethisMapReduce,youcandosoyourself,butitssimplerthanthewordcountexample.Ourdatasetis smallenoughthatwecanjustusethe wc UNIXcommandtocountthenumberoflinesinourcorpus:
wcllayk.json
KennethLayhas5929emailsinhisdataset.WeranwclontheentireEnronemaildataset,andgot516893.Thistookafew seconds.Sometimes,itsnotworthoverengineeringasimpletask!:)
MapReduce2:PerTermIDF
Werecommendyoustickto516893asyourtotalnumberofdocuments,sinceeventuallyweregoingtobecrunchingthe entiredataset! Whatwewanttodohereisemit log(516893.0/#documentswithwordX) foreach wordX inourdataset.Noticethedecimalon 516893.0:thatssowedooatingpointdivisionratherthanintegerdivision.Theoutputshouldbealewhereeachline contains {'term':'wordX','idf':35.92} foractualvaluesof wordX and 35.92 . Weveputouranswerin dataiap/day5/mr_per_term_idf.py ,buttryyourhandatwritingityourselfbeforeyoulookatours.It canbeimplementedwithathreelinechangetotheoriginalwordcountMapReducewewrote(onelinejustincludes math.log !).
MapReduce3:PerSenderTFIDFs
ThethirdMapReducemultipliespersendertermfrequenciesbypertermIDFs.Thismeansitneedstotakeasinputthe IDFscalculatedinthelaststepandcalculatethepersenderTFs.Thatrequiressomethingwehaventseenyet:initialization logic.Letsshowyouthecode,thentellyouhowitsdone.
importos frommrjob.protocolimportJSONValueProtocol frommrjob.jobimportMRJob fromterm_toolsimportget_terms DIRECTORY="/path/to/idf_parts/" classMRTFIDFBySender(MRJob): INPUT_PROTOCOL=JSONValueProtocol OUTPUT_PROTOCOL=JSONValueProtocol defmapper(self,key,email): forterminget_terms(email['text']): yield{'term':term,'sender':email['sender']},1 def reducer_init(self): self.idfs={} forfnameinos.listdir(DIRECTORY):#lookthroughfilenamesinthedirectory file=open(os.path.join(DIRECTORY,fname))#openafile for lineinfile:#readeachlineinjsonfile term_idf=JSONValueProtocol.read(line)[1]#parsethelineasaJSONobject self.idfs[term_idf['term']]=term_idf['idf'] def reducer(self,term_sender,howmany): tfidf=sum(howmany)*self.idfs[term_sender['term']] yieldNone,{'term_sender':term_sender,'tfidf':tfidf}
Ifyoudidtherst exercise,the mapper and reducer functionsshouldlookalotlikethepersenderwordcount mapper and reducer functionsyouwroteforthat.Theonlydierenceisthat reducer takesthetermfrequenciesandmultipliesthemby self.idfs[term] ,tonormalizebyeachwordsIDF.Theotherdierenceistheadditionof reducer_init ,whichwewill describenext.
self.idfs isadictionarycontainingtermIDFmappingsfromthesecondMapReduce.SayyourantheIDFcalculating
MapReducelikeso:
pythonmr_per_term_idf.pyo'idf_parts'nooutput'../datasets/emails/layk.json'
10
TheindividualtermsandIDFswouldbeemittedtothedirectory idf_parts/ .Wewouldwanttoloadallofthesetermidf mappingsinto self.idfs .Set DIRECTORY tothelesystempaththatpointstothe idf_parts/ directory. Sometimes,wewanttoloadsomedatabeforerunningthemapperorthereducer.Inourexample,wewanttoloadtheIDF valuesintomemorybeforeexecutingthereducer,sothatthevaluesareavailablewhenwecomputethetfidf.Thefunction
reducer_init isdesignedtoperformthissetup.Itiscalledbeforetherst reducer iscalledtocalculateTFIDF.Itopensallof
theoutputlesin DIRECTORY ,andreadstheminto self.idfs .Thisway,when reducer iscalledonaterm,theidfforthatterm hasalreadybeencalculated. Toverifyyouvedonethiscorrectly,compareyouroutputtoours.ThereweresomepottymouthsthatemailedKennethLay: {tdf:13.155591168821202,term_sender:{term:ahole,sender:justinsitzman@hotmail.com}}
WhyisitOKtoLoadIDFsIntoMemory?
Youmightbealarmedatthemoment.Hereweare,workingwithBIGDATA,andnowwereexpectingtheTFIDF calculationtoloadtheentiretyoftheIDFdataintomemoryonEVERYSINGLEreducer.Thatscrazytown. Itsactuallynot.Whilethecorpuswereanalyzingislarge,thenumberofwordsintheEnglishlanguage(roughlythe amountoftermswecalculateIDFfor)isnot.Infact,theoutputofthepertermIDFcalculationwasaround8megabytes, whichisfarsmallerthanthe1.3gigabytesweprocessed.Keepthisinmind:evenifcalculatingsomethingoveralarge amountofdataishardandtakesawhile,theresultmightendupsmall.
Optional:RunTheTFIDFWorkow
WerecommendrunningtheTFIDFworkowonAmazononceclassisover.TherstMapReducescript(pertermIDF) shouldrunjustneonAmazon.Thesecondwillnot.The reducer_init logicexpectsaletoliveonyourlocaldirectory.You willhavetomodifyittoreadtheoutputoftheIDFcalculationsfromS3using boto .Takealookatthecodetoimplement get in dataiap/resources/s3_util.py foraprogrammaticviewofaccessinglesinS3.
Wheretogofromhere
WehopethatMapReduceservesyouwellwithlargedatasets.Ifthiskindofworkexcitesyou,herearesomethingstoread upon. Asyoucansee,writingmorecomplexworkowsforthingslikeTFIDFcangetannoying.Inpractice,folksuse higherlevellanguagesthanmapandreducetobuildMapReduceworkows.SomeexamplesarePig,Hive,and Cascading. IfyoucareaboutmakingyourMapReducetasksrunfaster,therearelotsoftricksyoucanplay.Oneoftheeasiest thingstodoistoaddacombinersbetweenyourmapperandreducer.Acombinerhassimilarlogictoareducer,but runsonamapperbeforetheshuestage.Thisallowsyouto,forexample,presumthewordsemittedbythemap stageinawordcountsothatyoudonthavetoshueasmanywordsaround. MapReduceisonemodelforparallelprogrammingcalleddataparallelism.Feelfreetoreadaboutothers. WhenMapReducerunsonmultiplecomputers,itsanexampleofdistributedcomputing,whichhasalotof interestingapplicationsandproblemstobesolved. S3isadistributedstoragesystemandisoneofmany.ItisbuiltuponAmazonsDynamotechnology.Itsoneofmany distributedlesystemsanddistributeddatastores.
The following may not correspond to a particular course on MIT OpenCourseWare, but has been provided by the author as an individual learning resource.
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.