Академический Документы
Профессиональный Документы
Культура Документы
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Thisisthehtmlversionofthefilehttp://www.cloudera.com/content/dam/cloudera/partners/academicpartners
gated/labs/_Homework_Labs_WithProfessorNotes.pdf.
Googleautomaticallygenerateshtmlversionsofdocumentsaswecrawltheweb.
Page1
ApacheHadoop
Acourseforundergraduates
HomeworkLabs
with
ProfessorsNotes
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_H
1/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
Page2
TableofContents
GeneralNotesonHomeworkLabs...................................................................................................3
Lecture1Lab:UsingHDFS..................................................................................................................5
Lecture2Lab:RunningaMapReduceJob...................................................................................11
Lecture3Lab:WritingaMapReduceJavaProgram................................................................16
Lecture3Lab:MorePracticewithMapReduceJavaPrograms...........................................24
Lecture3Lab:WritingaMapReduceStreamingProgram....................................................26
Lecture3Lab:WritingUnitTestswiththeMRUnitFramework.........................................29
Lecture4Lab:UsingToolRunnerandPassingParameters..................................................30
Lecture4Lab:UsingaCombiner...................................................................................................33
Lecture5Lab:TestingwithLocalJobRunner.............................................................................34
Lecture5Lab:Logging.......................................................................................................................37
Lecture5Lab:UsingCountersandaMap
OnlyJob.................................................................40
Lecture6Lab:WritingaPartitioner.............................................................................................42
Lecture6Lab:ImplementingaCustomWritableComparable.............................................45
Lecture6Lab:UsingSequenceFilesandFileCompression..................................................47
Lecture7Lab:CreatinganInvertedIndex.................................................................................51
Lecture7Lab:CalculatingWordCo
Occurrence......................................................................55
Lecture8Lab:ImportingDatawithSqoop.................................................................................57
Lecture8Lab:RunninganOozieWorkflow...............................................................................60
Lecture8BonusLab:ExploringaSecondarySortExample.................................................62
NotesforUpcomingLabs..................................................................................................................66
Lecture9Lab:DataIngestWithHadoopTools.........................................................................67
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_H
2/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Lecture9Lab:UsingPigforETLProcessing..............................................................................71
Lecture9Lab:AnalyzingAdCampaignDatawithPig............................................................79
Lecture10Lab:AnalyzingDisparateDataSetswithPig.......................................................87
Lecture10Lab:ExtendingPigwithStreamingandUDFs......................................................93
Lecture11Lab:RunningHiveQueriesfromtheShell,Scripts,andHue..........................98
Lecture11Lab:DataManagementwithHive..........................................................................103
Lecture12Lab:GainingInsightwithSentimentAnalysis...................................................110
Lecture12Lab:DataTransformationwithHive....................................................................114
Lecture13Lab:InteractiveAnalysiswithImpala..................................................................123
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
Page3
GeneralNotesonHomeworkLabs
Studentscompletehomeworkforthiscourseusingthestudentversionofthetraining
VirtualMachine(VM).ClouderasuppliesasecondVM,theprofessorsVM,inadditiontothe
studentVM.TheprofessorsVMcomescompletewithsolutionstothehomeworklabs.
TheprofessorsVMcontainsadditionalprojectsubdirectorieswithhintsandsolutions.
Subdirectoriesnamedsrc/hintsandsrc/solutionprovidehints(partialsolutions)andfull
solutions,respectively.ThestudentVMwillhavesrc/stubsdirectoriesonly,nohintsor
solutionsdirectories.Fullsolutionscanbedistributedtostudentsafterhomeworkhasbeen
submitted.Insomecases,alabmayrequirethatthepreviouslab(s)ransuccessfully,
ensuringthattheVMisintherequiredstate.Providingstudentswiththesolutiontothe
previouslab,andhavingthemrunthesolution,willbringtheVMtotherequiredstate.This
shouldbecompletedpriortorunningcodeforthenewlab.
ExceptforthepresenceofsolutionsintheprofessorVM,thestudentandprofessorversions
ofthetrainingVMarethesame.BothVMsruntheCentOS6.3Linuxdistributionandcome
configuredwithCDH(ClouderasDistribution,includingApacheHadoop)installedin
pseudo
distributedmode.InadditiontocoreHadoop,theHadoopecosystemtools
necessarytocompletethehomeworklabsarealsoinstalled(e.g.Pig,Hive,Flume,etc.).Perl,
Python,PHP,andRubyareinstalledaswell.
Hadooppseudo
distributedmodeisamethodofrunningHadoopwherebyallHadoop
daemonsrunonthesamemachine.Itis,essentially,aclusterconsistingofasinglemachine.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_H
3/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
ItworksjustlikealargerHadoopcluster,thekeydifference(apartfromspeed,ofcourse!)is
thattheblockreplicationfactorissettoone,sincethereisonlyasingleDataNodeavailable.
Note:Homeworklabsaregroupedintoindividualfilesbylecturenumberforeasypostingof
assignments.Thesamelabsappearinthisdocument,butwithreferencestohintsand
solutionswhereapplicable.Thestudentshomeworklabswillreferenceastubs
subdirectory,nothintsorsolutions.Studentswilltypicallycompletetheircodinginthe
stubssubdirectories.
GettingStarted
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
Page4
1.TheVMissettoautomaticallyloginastheusertraining.Shouldyoulogoutatany
time,youcanlogbackinastheusertrainingwiththepasswordtraining.
WorkingwiththeVirtualMachine
1.Shouldyouneedit,therootpasswordistraining.Youmaybepromptedforthisif,
forexample,youwanttochangethekeyboardlayout.Ingeneral,youshouldnotneed
thispasswordsincethetraininguserhasunlimitedsudoprivileges.
2.Insomecommand
linestepsinthelabs,youwillseelineslikethis:
$hadoopfsputshakespeare\
/user/training/shakespeare
Thedollarsign($)atthebeginningofeachlineindicatestheLinuxshellprompt.The
actualpromptwillincludeadditionalinformation(e.g.,[training@localhost
workspace]$)butthisisomittedfromtheseinstructionsforbrevity.
Thebackslash(\)attheendofthefirstlinesignifiesthatthecommandisnotcompleted,
andcontinuesonthenextline.Youcanenterthecodeexactlyasshown(ontwolines),
oryoucanenteritonasingleline.Ifyoudothelatter,youshouldnottypeinthe
backslash.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_H
4/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
3.AlthoughmanystudentsarecomfortableusingUNIXtexteditorslikevioremacs,some
mightpreferagraphicaltexteditor.Toinvokethegraphicaleditorfromthecommand
line,typegeditfollowedbythepathofthefileyouwishtoedit.Appending&tothe
commandallowsyoutotypeadditionalcommandswhiletheeditorisstillopen.Hereis
anexampleofhowtoeditafilenamedmyfile.txt:
$geditmyfile.txt&
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
Page5
Lecture1Lab:UsingHDFS
FilesUsedinThisExercise:
Datafiles(local)
~/training_materials/developer/data/shakespeare.tar.gz
~/training_materials/developer/data/access_log.gz
InthislabyouwillbegintogetacquaintedwiththeHadooptools.Youwillmanipulate
filesinHDFS,theHadoopDistributedFileSystem.
SetUpYourEnvironment
1.Beforestartingthelabs,runthecoursesetupscriptinaterminalwindow:
$~/scripts/developer/training_setup_dev.sh
Hadoop
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_H
5/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Hadoopisalreadyinstalled,configured,andrunningonyourvirtualmachine.
Mostofyourinteractionwiththesystemwillbethroughacommand
linewrappercalled
hadoop.Ifyourunthisprogramwithnoarguments,itprintsahelpmessage.Totrythis,
runthefollowingcommandinaterminalwindow:
$hadoop
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
Page6
Thehadoopcommandissubdividedintoseveralsubsystems.Forexample,thereisa
subsystemforworkingwithfilesinHDFSandanotherforlaunchingandmanaging
MapReduceprocessingjobs.
Step1:ExploringHDFS
ThesubsystemassociatedwithHDFSintheHadoopwrapperprogramiscalledFsShell.
Thissubsystemcanbeinvokedwiththecommandhadoopfs.
1.Openaterminalwindow(ifoneisnotalreadyopen)bydouble
clickingtheTerminal
icononthedesktop.
2.Intheterminalwindow,enter:
$hadoopfs
YouseeahelpmessagedescribingallthecommandsassociatedwiththeFsShell
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_H
6/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
subsystem.
3.Enter:
$hadoopfsls/
ThisshowsyouthecontentsoftherootdirectoryinHDFS.Therewillbemultiple
entries,oneofwhichis/user.Individualusershaveahomedirectoryunderthis
directory,namedaftertheirusernameyourusernameinthiscourseistraining,
thereforeyourhomedirectoryis/user/training.
4.Tryviewingthecontentsofthe/userdirectorybyrunning:
$hadoopfsls/user
Youwillseeyourhomedirectoryinthedirectorylisting.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
Page7
5.Listthecontentsofyourhomedirectorybyrunning:
$hadoopfsls/user/training
Therearenofilesyet,sothecommandsilentlyexits.Thisisdifferentfromrunning
hadoopfsls/foo,whichreferstoadirectorythatdoesntexist.Inthiscase,an
errormessagewouldbedisplayed.
NotethatthedirectorystructureinHDFShasnothingtodowiththedirectorystructure
ofthelocalfilesystemtheyarecompletelyseparatenamespaces.
Step2:UploadingFiles
Besidesbrowsingtheexistingfilesystem,anotherimportantthingyoucandowith
FsShellistouploadnewdataintoHDFS.
1.Changedirectoriestothelocalfilesystemdirectorycontainingthesampledatawewill
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_H
7/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
beusinginthehomeworklabs.
$cd~/training_materials/developer/data
IfyouperformaregularLinuxlscommandinthisdirectory,youwillseeafewfiles,
includingtwonamedshakespeare.tar.gzand
shakespearestream.tar.gz.Bothofthesecontainthecompleteworksof
Shakespeareintextformat,butwithdifferentformatsandorganizations.Fornowwe
willworkwithshakespeare.tar.gz.
2.Unzipshakespeare.tar.gzbyrunning:
$tarzxvfshakespeare.tar.gz
Thiscreatesadirectorynamedshakespeare/containingseveralfilesonyourlocal
filesystem.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
Page8
3.InsertthisdirectoryintoHDFS:
$hadoopfsputshakespeare/user/training/shakespeare
Thiscopiesthelocalshakespearedirectoryanditscontentsintoaremote,HDFS
directorynamed/user/training/shakespeare.
4.ListthecontentsofyourHDFShomedirectorynow:
$hadoopfsls/user/training
Youshouldseeanentryfortheshakespearedirectory.
5.Nowtrythesamefslscommandbutwithoutapathargument:
$hadoopfsls
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_H
8/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Youshouldseethesameresults.Ifyoudontpassadirectorynametothels
command,itassumesyoumeanyourhomedirectory,i.e./user/training.
Relativepaths
Ifyoupassanyrelative(nonabsolute)pathstoFsShellcommands(oruserelative
pathsinMapReduceprograms),theyareconsideredrelativetoyourhomedirectory.
6.Wewillalsoneedasamplewebserverlogfile,whichwewillputintoHDFSforusein
futurelabs.ThisfileiscurrentlycompressedusingGZip.Ratherthanextractthefileto
thelocaldiskandthenuploadit,wewillextractanduploadinonestep.First,createa
directoryinHDFSinwhichtostoreit:
$hadoopfsmkdirweblog
7.Now,extractanduploadthefileinonestep.Thecoptiontogunzipuncompressesto
standardoutput,andthedash()inthehadoopfsputcommandtakeswhatever
isbeingsenttoitsstandardinputandplacesthatdatainHDFS.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
Page9
$gunzipcaccess_log.gz\
|hadoopfsputweblog/access_log
8.RunthehadoopfslscommandtoverifythatthelogfileisinyourHDFShome
directory.
9.Theaccesslogfileisquitelargearound500MB.Createasmallerversionofthisfile,
consistingonlyofitsfirst5000lines,andstorethesmallerversioninHDFS.Youcanuse
thesmallerversionfortestinginsubsequentlabs.
$hadoopfsmkdirtestlog
$gunzipcaccess_log.gz|headn5000\
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_H
9/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
|hadoopfsputtestlog/test_access_log
Step3:ViewingandManipulatingFiles
NowletsviewsomeofthedatayoujustcopiedintoHDFS.
1.Enter:
$hadoopfslsshakespeare
Thisliststhecontentsofthe/user/training/shakespeareHDFSdirectory,
whichconsistsofthefilescomedies,glossary,histories,poems,and
tragedies.
2.Theglossaryfileincludedinthecompressedfileyoubeganwithisnotstrictlyawork
ofShakespeare,soletsremoveit:
$hadoopfsrmshakespeare/glossary
Notethatyoucouldleavethisfileinplaceifyousowished.Ifyoudid,thenitwouldbe
includedinsubsequentcomputationsacrosstheworksofShakespeare,andwouldskew
yourresultsslightly.Aswithmanyreal
worldbigdataproblems,youmaketrade
offs
betweenthelabortopurifyyourinputdataandtheprecisionofyourresults.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
Page10
3.Enter:
$hadoopfscatshakespeare/histories|tailn50
Thisprintsthelast50linesofHenryIV,Part1toyourterminal.Thiscommandishandy
forviewingtheoutputofMapReduceprograms.Veryoften,anindividualoutputfileofa
MapReduceprogramisverylarge,makingitinconvenienttoviewtheentirefileinthe
terminal.Forthisreason,itsoftenagoodideatopipetheoutputofthefscat
commandintohead,tail,more,orless.
4.Todownloadafiletoworkwithonthelocalfilesystemusethefsgetcommand.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
10/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Thiscommandtakestwoarguments:anHDFSpathandalocalpath.ItcopiestheHDFS
contentsintothelocalfilesystem:
$hadoopfsgetshakespeare/poems~/shakepoems.txt
$less~/shakepoems.txt
OtherCommands
Thereareseveralotheroperationsavailablewiththehadoopfscommandtoperform
mostcommonfilesystemmanipulations:mv,cp,mkdir,etc.
1.Enter:
$hadoopfs
ThisdisplaysabriefusagereportofthecommandsavailablewithinFsShell.Try
playingaroundwithafewofthesecommandsifyoulike.
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
10
Page11
Lecture2Lab:RunningaMapReduce
Job
FilesandDirectoriesUsedinthisExercise
Sourcedirectory:~/workspace/wordcount/src/solution
Files:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
11/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
WordCount.java:AsimpleMapReducedriverclass.
WordMapper.java:Amapperclassforthejob.
SumReducer.java:Areducerclassforthejob.
wc.jar:Thecompiled,assembledWordCountprogram
InthislabyouwillcompileJavafiles,createaJAR,andrunMapReducejobs.
InadditiontomanipulatingfilesinHDFS,thewrapperprogramhadoopisusedtolaunch
MapReducejobs.ThecodeforajobiscontainedinacompiledJARfile.HadooploadstheJAR
intoHDFSanddistributesittotheworkernodes,wheretheindividualtasksofthe
MapReducejobareexecuted.
OnesimpleexampleofaMapReducejobistocountthenumberofoccurrencesofeachword
inafileorsetoffiles.InthislabyouwillcompileandsubmitaMapReducejobtocountthe
numberofoccurrencesofeverywordintheworksofShakespeare.
CompilingandSubmittingaMapReduceJob
1.Inaterminalwindow,changetothelabsourcedirectory,andlistthecontents:
$cd~/workspace/wordcount/src
$ls
Listthefilesinthesolutionpackagedirectory:
$lssolution
ThepackagecontainsthefollowingJavafiles:
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
11
Page12
WordCount.java:AsimpleMapReducedriverclass.
WordMapper.java:Amapperclassforthejob.
SumReducer.java:Areducerclassforthejob.
Examinethesefilesifyouwish,butdonotchangethem.Remaininthisdirectorywhile
youexecutethefollowingcommands.
2.Beforecompiling,examinetheclasspathHadoopisconfiguredtouse:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
12/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
$hadoopclasspath
ThisshowsliststhelocationswheretheHadoopcoreAPIclassesareinstalled.
3.CompilethethreeJavaclasses:
$javacclasspath`hadoopclasspath`solution/*.java
Note:inthecommandabove,thequotesaroundhadoopclasspathare
backquotes.Thisrunsthehadoopclasspathcommandandusesitsoutputas
partofthejavaccommand.
Thecompiled(.class)filesareplacedinthesolutiondirectory.
4.CollectyourcompiledJavafilesintoaJARfile:
$jarcvfwc.jarsolution/*.class
5.SubmitaMapReducejobtoHadoopusingyourJARfiletocounttheoccurrencesofeach
wordinShakespeare:
$hadoopjarwc.jarsolution.WordCount\
shakespearewordcounts
ThishadoopjarcommandnamestheJARfiletouse(wc.jar),theclasswhosemain
methodshouldbeinvoked(solution.WordCount),andtheHDFSinputandoutput
directoriestousefortheMapReducejob.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
12
Page13
YourjobreadsallthefilesinyourHDFSshakespearedirectory,andplacesitsoutput
inanewHDFSdirectorycalledwordcounts.
6.Tryrunningthissamecommandagainwithoutanychange:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
13/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
$hadoopjarwc.jarsolution.WordCount\
shakespearewordcounts
Yourjobhaltsrightawaywithanexception,becauseHadoopautomaticallyfailsifyour
jobtriestowriteitsoutputintoanexistingdirectory.Thisisbydesignsincetheresult
ofaMapReducejobmaybeexpensivetoreproduce,Hadooppreventsyoufrom
accidentallyoverwritingpreviouslyexistingfiles.
7.ReviewtheresultofyourMapReducejob:
$hadoopfslswordcounts
Thisliststheoutputfilesforyourjob.(YourjobranwithonlyoneReducer,sothere
shouldbeonefile,namedpartr00000,alongwitha_SUCCESSfileanda_logs
directory.)
8.Viewthecontentsoftheoutputforyourjob:
$hadoopfscatwordcounts/partr00000|less
Youcanpagethroughafewscreenstoseewordsandtheirfrequenciesintheworksof
Shakespeare.(Thespacebarwillscrolltheoutputbyonescreentheletter'q'willquit
thelessutility.)Notethatyoucouldhavespecifiedwordcounts/*justaswellinthis
command.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
13
Page14
WildcardsinHDFSfilepaths
Takecarewhenusingwildcards(e.g.*)whenspecifyingHFDSfilenamesbecauseof
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
14/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
howLinuxworks,theshellwillattempttoexpandthewildcardbeforeinvokinghadoop,
andthenpassincorrectreferencestolocalfilesinsteadofHDFSfiles.Youcanprevent
thisbyenclosingthewildcardedHDFSfilenamesinsinglequotes,e.g.hadoopfs
cat'wordcounts/*'
9.TryrunningtheWordCountjobagainstasinglefile:
$hadoopjarwc.jarsolution.WordCount\
shakespeare/poemspwords
Whenthejobcompletes,inspectthecontentsofthepwordsHDFSdirectory.
10.Cleanuptheoutputfilesproducedbyyourjobruns:
$hadoopfsrmrwordcountspwords
StoppingMapReduceJobs
Itisimportanttobeabletostopjobsthatarealreadyrunning.Thisisusefulif,forexample,
youaccidentallyintroducedaninfiniteloopintoyourMapper.Animportantpointto
rememberisthatpressing^Ctokillthecurrentprocess(whichisdisplayingthe
MapReducejob'sprogress)doesnotactuallystopthejobitself.
AMapReducejob,oncesubmittedtoHadoop,runsindependentlyoftheinitiatingprocess,
solosingtheconnectiontotheinitiatingprocessdoesnotkillthejob.Instead,youneedto
telltheHadoopJobTrackertostopthejob.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
14
Page15
1.Startanotherwordcountjoblikeyoudidintheprevioussection:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
15/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
$hadoopjarwc.jarsolution.WordCountshakespeare\
count2
2.Whilethisjobisrunning,openanotherterminalwindowandenter:
$mapredjoblist
Thisliststhejobidsofallrunningjobs.Ajobidlookssomethinglike:
job_200902131742_0002
3.Copythejobid,andthenkilltherunningjobbyentering:
$mapredjobkilljobid
TheJobTrackerkillsthejob,andtheprogramrunningintheoriginalterminal
completes.
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
15
Page16
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
16/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Lecture3Lab:WritingaMapReduce
JavaProgram
ProjectsandDirectoriesUsedinthisExercise
Eclipseproject:averagewordlength
Javafiles:
AverageReducer.java(Reducer)
LetterMapper.java(Mapper)
AvgWordLength.java(driver)
Testdata(HDFS):
shakespeare
Exercisedirectory:~/workspace/averagewordlength
Inthislab,youwillwriteaMapReducejobthatreadsanytextinputandcomputesthe
averagelengthofallwordsthatstartwitheachcharacter.
Foranytextinput,thejobshouldreporttheaveragelengthofwordsthatbeginwitha,b,
andsoforth.Forexample,forinput:
Nonowisdefinitelynotthetime
Theoutputwouldbe:
N
2.0
3.0
10.0
2.0
3.5
(Fortheinitialsolution,yourprogramshouldbecase
sensitiveasshowninthisexample.)
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
16
Page17
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
17/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
TheAlgorithm
Thealgorithmforthisprogramisasimpleone
passMapReduceprogram:
TheMapper
TheMapperreceivesalineoftextforeachinputvalue.(Ignoretheinputkey.)Foreach
wordintheline,emitthefirstletterofthewordasakey,andthelengthofthewordasa
value.Forexample,forinputvalue:
Nonowisdefinitelynotthetime
YourMappershouldemit:
N
n3
i
10
TheReducer
ThankstotheshuffleandsortphasebuiltintoMapReduce,theReducerreceivesthekeysin
sortedorder,andallthevaluesforonekeyaregroupedtogether.So,fortheMapperoutput
above,theReducerreceivesthis:
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
17
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
18/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Page18
(2)
(10)
(2)
(3,3)
(3,4)
TheReduceroutputshouldbe:
N
2.0
10.0
2.0
3.0
3.5
Step1:StartEclipse
ThereisoneEclipseprojectforeachofthelabsthatuseJava.UsingEclipsewillspeedup
yourdevelopmenttime.
1.BesureyouhaverunthecoursesetupscriptasinstructedearlierintheGeneralNotes
section.ThatscriptsetsupthelabworkspaceandcopiesintheEclipseprojectsyouwill
usefortheremainderofthecourse.
2.StartEclipseusingtheicononyourVMdesktop.Theprojectsforthiscoursewillappear
intheProjectExplorerontheleft.
Step2:WritethePrograminJava
TherearestubfilesforeachoftheJavaclassesforthislab:LetterMapper.java(the
Mapper),AverageReducer.java(theReducer),andAvgWordLength.java(the
driver).
IfyouareusingEclipse,openthestubfiles(locatedinthesrc/stubspackage)inthe
averagewordlengthproject.Ifyouprefertoworkintheshell,thefilesarein
~/workspace/averagewordlength/src/stubs.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
18
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
19/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Page19
Youmaywishtoreferbacktothewordcountexample(inthewordcountprojectin
Eclipseorin~/workspace/wordcount)asastartingpointforyourJavacode.Herearea
fewdetailstohelpyoubeginyourJavaprogramming:
3.Definethedriver
Thisclassshouldconfigureandsubmityourbasicjob.Amongthebasicstepshere,
configurethejobwiththeMapperclassandtheReducerclassyouwillwrite,andthe
datatypesoftheintermediateandfinalkeys.
4.DefinetheMapper
NotethesesimplestringoperationsinJava:
str.substring(0,1)//String:firstletterofstr
str.length()
//int:lengthofstr
5.DefinetheReducer
Inasingleinvocationthereduce()methodreceivesastringcontainingoneletter(the
key)alongwithaniterablecollectionofintegers(thevalues),andshouldemitasingle
key
valuepair:theletterandtheaverageoftheintegers.
6.Compileyourclassesandassemblethejarfile
Tocompileandjar,youmayeitherusethecommandlinejavaccommandasyoudid
earlierintheRunningaMapReduceJoblab,orfollowthestepsbelow(UsingEclipse
toCompileYourSolution)touseEclipse.
Step3:UseEclipsetoCompileYourSolution
FollowthesestepstouseEclipsetocompletethislab.
Note:Thesesamestepswillbeusedforallsubsequentlabs.Theinstructionswillnot
berepeatedeachtime,sotakenoteofthesteps.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
20/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
19
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
Page20
1.VerifythatyourJavacodedoesnothaveanycompilererrorsorwarnings.
TheEclipsesoftwareinyourVMispre
configuredtocompilecodeautomatically
withoutperforminganyexplicitsteps.Compileerrorsandwarningsappearasredand
yellowiconstotheleftofthecode.
AredXindicatesacompilererror
2.InthePackageExplorer,opentheEclipseprojectforthecurrentlab(i.e.
averagewordlength).Right
clickthedefaultpackageunderthesrcentryandselect
Export.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
21/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
20
Page21
3.SelectJava>JARfilefromtheExportdialogbox,thenclickNext.
4.SpecifyalocationfortheJARfile.YoucanplaceyourJARfileswhereveryoulike,e.g.:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
22/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
21
Page22
Note:FormoreinformationaboutusingEclipse,seetheEclipseReferencein
Homework_EclipseRef.docx.
Step3:Testyourprogram
1.Inaterminalwindow,changetothedirectorywhereyouplacedyourJARfile.Runthe
hadoopjarcommandasyoudidpreviouslyintheRunningaMapReduceJoblab.
$hadoopjaravgwordlength.jarstubs.AvgWordLength\
shakespearewordlengths
2.Listtheresults:
$hadoopfslswordlengths
Asinglereduceroutputfileshouldbelisted.
3.Reviewtheresults:
$hadoopfscatwordlengths/*
Thefileshouldlistallthenumbersandlettersinthedataset,andtheaveragelengthof
thewordsstartingwiththem,e.g.:
1
2
3
4
5
6
7
8
9
1.02
1.0588235294117647
1.0
1.5
1.5
1.5
1.0
1.5
1.0
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
23/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
A
B
C
3.891394576646375
5.139302507836991
6.629694233531706
ThisexampleusestheentireShakespearedatasetforyourinputyoucanalsotryitwith
justoneofthefilesinthedataset,orwithyourowntestdata.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
22
Page23
Thisistheendofthelab.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
24/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
23
Page24
Lecture3Lab:MorePracticewith
MapReduceJavaPrograms
FilesandDirectoriesUsedinthisExercise
Eclipseproject:log_file_analysis
Javafiles:
SumReducer.javatheReducer
LogFileMapper.javatheMapper
ProcessLogs.javathedriverclass
Testdata(HDFS):
weblog(fullversion)
testlog(testsampleset)
Exercisedirectory:~/workspace/log_file_analysis
Inthislab,youwillanalyzealogfilefromawebservertocountthenumberofhits
madefromeachuniqueIPaddress.
YourtaskistocountthenumberofhitsmadefromeachIPaddressinthesample
(anonymized)webserverlogfilethatyouuploadedtothe/user/training/weblog
directoryinHDFSwhenyoucompletedtheUsingHDFSlab.
Inthelog_file_analysisdirectory,youwillfindstubsfortheMapperandDriver.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
25/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
1.Usingthestubfilesinthelog_file_analysisprojectdirectory,writeMapperand
DrivercodetocountthenumberofhitsmadefromeachIPaddressintheaccesslogfile.
YourfinalresultshouldbeafileinHDFScontainingeachIPaddress,andthecountof
loghitsfromthataddress.Note:TheReducerforthislabperformstheexactsame
functionastheoneintheWordCountprogramyouranearlier.Youcanreusethat
codeoryoucanwriteyourownifyouprefer.
2.Buildyourapplicationjarfilefollowingthestepsinthepreviouslab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
24
Page25
3.Testyourcodeusingthesamplelogdatainthe/user/training/weblogdirectory.
Note:Youmaywishtotestyourcodeagainstthesmallerversionoftheaccesslogyou
createdinapriorlab(locatedinthe/user/training/testlogHDFSdirectory)
beforeyourunyourcodeagainstthefulllogwhichcanbequitetimeconsuming.
Thisistheendofthelab.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
26/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
25
Page26
Lecture3Lab:WritingaMapReduce
StreamingProgram
FilesandDirectoriesUsedinthisExercise
Projectdirectory:~/workspace/averagewordlength
Testdata(HDFS):
shakespeare
Inthislabyouwillrepeatthesametaskasinthepreviouslab:writingaprogramto
calculateaveragewordlengthsforletters.However,youwillwritethisasastreaming
programusingascriptinglanguageofyourchoiceratherthanusingJava.
YourvirtualmachinehasPerl,Python,PHP,andRubyinstalled,soyoucanchooseanyof
theseorevenshellscriptingtodevelopaStreamingsolution.
ForyourHadoopStreamingprogramyouwillnotuseEclipse.Launchatexteditortowrite
yourMapperscriptandyourReducerscript.Herearesomenotesaboutsolvingtheproblem
inHadoopStreaming:
1.TheMapperScript
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
27/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
TheMapperwillreceivelinesoftextonstdin.Findthewordsinthelinestoproduce
theintermediateoutput,andemitintermediate(key,value)pairsbywritingstringsof
theform:
key<tab>value<newline>
Thesestringsshouldbewrittentostdout.
2.TheReducerScript
Forthereducer,multiplevalueswiththesamekeyaresenttoyourscriptonstdinas
successivelinesofinput.Eachlinecontainsakey,atab,avalue,andanewline.Alllines
withthesamekeyaresentoneafteranother,possiblyfollowedbylineswithadifferent
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
26
Page27
key,untilthereducinginputiscomplete.Forexample,thereducescriptmayreceivethe
following:
t
Forthisinput,emitthefollowingtostdout:
t
3.5
5.0
Observethatthereducerreceivesakeywitheachinputline,andmustnoticewhen
thekeychangesonasubsequentline(orwhentheinputisfinished)toknowwhenthe
valuesforagivenkeyhavebeenexhausted.ThisisdifferentthantheJavaversionyou
workedoninthepreviouslab.
3.Runthestreamingprogram:
$hadoopjar/usr/lib/hadoop0.20mapreduce/\
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
28/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
contrib/streaming/hadoopstreaming*.jar\
inputinputDiroutputoutputDir\
filepathToMapScriptfilepathToReduceScript\
mappermapBasenamereducerreduceBasename
(Remember,youmayneedtodeleteanypreviousoutputbeforerunningyourprogram
byissuing:hadoopfsrmrdataToDelete.)
4.ReviewtheoutputintheHDFSdirectoryyouspecified(outputDir).
ProfessorsNote~
ThePerlexampleisin:~/workspace/wordcount/perl_solution
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
27
Page28
ProfessorsNote~
SolutioninPython
YoucanfindaworkingsolutiontothislabwritteninPythoninthedirectory
~/workspace/averagewordlength/python_sample_solution.
Torunthesolution,changedirectoryto~/workspace/averagewordlengthandrun
thiscommand:
$hadoopjar/usr/lib/hadoop0.20mapreduce\
/contrib/streaming/hadoopstreaming*.jar\
inputshakespeareoutputavgwordstreaming\
filepython_sample_solution/mapper.py\
filepython_sample_solution/reducer.py\
mappermapper.pyreducerreducer.py
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
29/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
28
Page29
Lecture3Lab:WritingUnitTestswith
theMRUnitFramework
ProjectsUsedinthisExercise
Eclipseproject:mrunit
Javafiles:
SumReducer.java(ReducerfromWordCount)
WordMapper.java(MapperfromWordCount)
TestWordCount.java(TestDriver)
InthisExercise,youwillwriteUnitTestsfortheWordCountcode.
1.LaunchEclipse(ifnecessary)andexpandthemrunitfolder.
2.ExaminetheTestWordCount.javafileinthemrunitprojectstubspackage.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
30/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Noticethatthreetestshavebeencreated,oneeachfortheMapper,Reducer,andthe
entireMapReduceflow.Currently,allthreetestssimplyfail.
3.Runthetestsbyright
clickingonTestWordCount.javainthePackageExplorer
panelandchoosingRunAs>JUnitTest.
4.Observethefailure.ResultsintheJUnittab(nexttothePackageExplorertab)should
indicatethatthreetestsranwiththreefailures.
5.Nowimplementthethreetests.
6.Runthetestsagain.ResultsintheJUnittabshouldindicatethatthreetestsranwithno
failures.
7.Whenyouaredone,closetheJUnittab.
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
29
Page30
Lecture4Lab:UsingToolRunnerand
PassingParameters
FilesandDirectoriesUsedinthisExercise
Eclipseproject:toolrunner
Javafiles:
AverageReducer.java(ReducerfromAverageWordLength)
LetterMapper.java(MapperfromAverageWordLength)
AvgWordLength.java(driverfromAverageWordLength)
Exercisedirectory:~/workspace/toolrunner
InthisExercise,youwillimplementadriverusingToolRunner.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
31/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
FollowthestepsbelowtostartwiththeAverageWordLengthprogramyouwroteinan
earlierlab,andmodifythedrivertouseToolRunner.ThenmodifytheMappertoreferencea
BooleanparametercalledcaseSensitiveiftrue,themappershouldtreatupperand
lowercaselettersasdifferentiffalseorunset,alllettersshouldbeconvertedtolowercase.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
30
Page31
ModifytheAverageWordLengthDrivertouseToolrunner
1.CopytheReducer,MapperanddrivercodeyoucompletedintheWritingJava
MapReduceProgramslabearlier,intheaveragewordlengthproject.
CopyingSourceFiles
YoucanuseEclipsetocopyaJavasourcefilefromoneprojectorpackagetoanotherby
rightclickingonthefileandselectingCopy,thenrightclickingthenewpackageand
selectingPaste.Ifthepackageshavedifferentnames(e.g.ifyoucopyfrom
averagewordlength.solutiontotoolrunner.stubs),Eclipsewillautomatically
changethepackagedirectiveatthetopofthefile.Ifyoucopythefileusingafile
browserortheshell,youwillhavetodothatmanually.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
32/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
2.ModifytheAvgWordLengthdrivertouseToolRunner.Refertotheslidesfordetails.
a.Implementtherunmethod
b.Modifymaintocallrun
3.Jaryoursolutionandtestitbeforecontinuingitshouldcontinuetofunctionexactlyas
itdidbefore.RefertotheWritingaJavaMapReduceProgramlabforhowtoassemble
andtestifyouneedareminder.
ModifytheMappertouseaconfigurationparameter
4.ModifytheLetterMapperclassto
a.Overridethesetupmethodtogetthevalueofaconfigurationparameter
calledcaseSensitive,anduseittosetamembervariableindicating
whethertodocasesensitiveorcaseinsensitiveprocessing.
b.Inthemapmethod,choosewhethertodocasesensitiveprocessing(leavethe
lettersas
is),orinsensitiveprocessing(convertallletterstolower
case)
basedonthatvariable.
Passaparameterprogrammatically
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
31
Page32
5.ModifythedriversrunmethodtosetaBooleanconfigurationparametercalled
caseSensitive.(Hint:UsetheConfiguration.setBooleanmethod.)
6.Testyourcodetwice,oncepassingfalseandoncepassingtrue.Whensettotrue,
yourfinaloutputshouldhavebothupperandlowercaseletterswhenfalse,itshould
haveonlylowercaseletters.
Hint:RemembertorebuildyourJarfiletotestchangestoyourcode.
Passaparameterasaruntimeparameter
7.Commentoutthecodethatsetstheparameterprogrammatically.(Eclipsehint:Select
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
33/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
thecodetocommentandthenselectSource>ToggleComment).Testagain,thistime
passingtheparametervalueusingDontheHadoopcommandline,e.g.:
$hadoopjartoolrunner.jarstubs.AvgWordLength\
DcaseSensitive=trueshakespearetoolrunnerout
8.Testpassingbothtrueandfalsetoconfirmtheparameterworkscorrectly.
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
32
Page33
Lecture4Lab:UsingaCombiner
FilesandDirectoriesUsedinthisExercise
Eclipseproject:combiner
Javafiles:
WordCountDriver.java(DriverfromWordCount)
WordMapper.java(MapperfromWordCount)
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
34/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
SumReducer.java(ReducerfromWordCount)
Exercisedirectory:~/workspace/combiner
Inthislab,youwilladdaCombinertotheWordCountprogramtoreducetheamount
ofintermediatedatasentfromtheMappertotheReducer.
Becausesummingisassociativeandcommutative,thesameclasscanbeusedforboththe
ReducerandtheCombiner.
ImplementaCombiner
1.CopyWordMapper.javaandSumReducer.javafromthewordcountprojectto
thecombinerproject.
2.ModifytheWordCountDriver.javacodetoaddaCombinerfortheWordCount
program.
3.Assembleandtestyoursolution.(TheoutputshouldremainidenticaltotheWordCount
applicationwithoutacombiner.)
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
33
Page34
Lecture5Lab:Testingwith
LocalJobRunner
FilesandDirectoriesUsedinthisExercise
Eclipseproject:toolrunner
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
35/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Testdata(local):
~/training_materials/developer/data/shakespeare
Exercisedirectory:~/workspace/toolrunner
Inthislab,youwillpracticerunningajoblocallyfordebuggingandtestingpurposes.
IntheUsingToolRunnerandPassingParameterslab,youmodifiedtheAverageWord
LengthprogramtouseToolRunner.Thismakesitsimpletosetjobconfigurationproperties
onthecommandline.
RuntheAverageWordLengthprogramusing
LocalJobRunneronthecommandline
1.RuntheAverageWordLengthprogramagain.Specifyjt=localtorunthejoblocally
insteadofsubmittingtothecluster,andfs=file:///tousethelocalfilesystem
insteadofHDFS.YourinputandoutputfilesshouldrefertolocalfilesratherthanHDFS
files.
Note:UsetheprogramyoucompletedintheToolRunnerlab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
34
Page35
$hadoopjartoolrunner.jarstubs.AvgWordLength\
fs=file:///jt=local\
~/training_materials/developer/data/shakespeare\
localout
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
36/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
2.Reviewthejoboutputinthelocaloutputfolderyouspecified.
Optional:RuntheAverageWordLengthprogramusing
LocalJobRunnerinEclipse
1.InEclipse,locatethetoolrunnerprojectinthePackageExplorer.Openthestubs
package.
2.Rightclickonthedriverclass(AvgWordLength)andselectRunAs>Run
Configurations
3.EnsurethatJavaApplicationisselectedintheruntypeslistedintheleftpane.
4.IntheRunConfigurationdialog,clicktheNewlaunchconfigurationbutton:
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
35
Page36
5.OntheMaintab,confirmthattheProjectandMainclassaresetcorrectlyforyour
project,e.g.Project:toolrunnerandMainclass:stubs.AvgWordLength
6.SelecttheArgumentstabandentertheinputandoutputfolders.(Thesearelocal,not
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
37/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
HDFSfolders,andarerelativetotherunconfigurationsworkingfolder,whichbydefault
istheprojectfolderintheEclipseworkspace:e.g.~/workspace/toolrunner.)
7.ClicktheRunbutton.Theprogramwillrunlocallywiththeoutputdisplayedinthe
Eclipseconsolewindow.
8.Reviewthejoboutputinthelocaloutputfolderyouspecified.
Note:Youcanre
runanypreviousconfigurationsusingtheRunorDebughistorybuttonson
theEclipsetoolbar.
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
36
Page37
Lecture5Lab:Logging
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
38/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
FilesandDirectoriesUsedinthisExercise
Eclipseproject:logging
Javafiles:
AverageReducer.java(ReducerfromToolRunner)
LetterMapper.java(MapperfromToolRunner)
AvgWordLength.java(driverfromToolRunner)
Testdata(HDFS):
shakespeare
Exercisedirectory:~/workspace/logging
Inthislab,youwillpracticeusinglog4jwithMapReduce.
ModifytheAverageWordLengthprogramyoubuiltintheUsingToolRunnerandPassing
ParameterslabsothattheMapperlogsadebugmessageindicatingwhetheritiscomparing
withorwithoutcasesensitivity.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
37
Page38
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
39/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
EnableMapperLoggingfortheJob
1.Beforeaddingadditionalloggingmessages,tryre
runningthetoolrunnerlabsolution
withMapperdebugloggingenabledbyadding
Dmapred.map.child.log.level=DEBUG
tothecommandline.E.g.
$hadoopjartoolrunner.jarstubs.AvgWordLength\
Dmapred.map.child.log.level=DEBUGshakespeareoutdir
2.TakenoteoftheJobIDintheterminalwindoworbyusingthemaprepjobcommand.
3.Whenthejobiscomplete,viewthelogs.InabrowseronyourVM,visittheJobTracker
UI:http://localhost:50030/jobtracker.jsp.Findthejobyoujustraninthe
CompletedJobslistandclickitsJobID.E.g.:
4.Inthetasksummary,clickmaptoviewthemaptasks.
5.Inthelistoftasks,clickonthemaptasktoviewthedetailsofthattask.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
38
Page39
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
40/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
6.UnderTaskLogs,clickAll.ThelogsshouldincludebothINFOandDEBUGmessages.
E.g.:
AddDebugLoggingOutputtotheMapper
7.Copythecodefromthetoolrunnerprojecttotheloggingprojectstubspackage.
(UseyoursolutionfromtheToolRunnerlab.)
8.Uselog4jtooutputadebuglogmessageindicatingwhethertheMapperisdoingcase
sensitiveorinsensitivemapping.
BuildandTestYourCode
9.Followingtheearliersteps,testyourcodewithMapperdebugloggingenabled.Viewthe
maptasklogsintheJobTrackerUItoconfirmthatyourmessageisincludedinthelog.
(Hint:SearchforLetterMapperinthepagetofindyourmessage.)
10.Optional:TryrunningmaploggingsettoINFO(thedefault)orWARNinsteadofDEBUG
andcomparethelogoutput.
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
39
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
41/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Page40
Lecture5Lab:UsingCountersanda
MapOnlyJob
FilesandDirectoriesUsedinthisExercise
Eclipseproject:counters
Javafiles:
ImageCounter.java(driver)
ImageCounterMapper.java(Mapper)
Testdata(HDFS):
weblog(fullwebserveraccesslog)
testlog(partialdatasetfortesting)
Exercisedirectory:~/workspace/counters
InthislabyouwillcreateaMap
onlyMapReducejob.
Yourapplicationwillprocessawebserversaccesslogtocountthenumberoftimesgifs,
jpegs,andotherresourceshavebeenretrieved.Yourjobwillreportthreefigures:number
ofgifrequests,numberofjpegrequests,andnumberofotherrequests.
Hints
1.YoushoulduseaMap
onlyMapReducejob,bysettingthenumberofReducersto0in
thedrivercode.
2.Forinputdata,usetheWebaccesslogfilethatyouuploadedtotheHDFS
/user/training/weblogdirectoryintheUsingHDFSlab.
Note:Testyourcodeagainstthesmallerversionoftheaccessloginthe
/user/training/testlogdirectorybeforeyourunyourcodeagainstthefulllogin
the/user/training/weblogdirectory.
3.UseacountergroupsuchasImageCounter,withnamesgif,jpegandother.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
40
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
42/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Page41
4.Inyourdrivercode,retrievethevaluesofthecountersafterthejobhascompletedand
reportthemusingSystem.out.println.
5.TheoutputfolderonHDFSwillcontainMapperoutputfileswhichareempty,because
theMappersdidnotwriteanydata.
Thisistheendofthelab.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
43/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
41
Page42
Lecture6Lab:WritingaPartitioner
FilesandDirectoriesUsedinthisExercise
Eclipseproject:partitioner
Javafiles:
MonthPartitioner.java(Partitioner)
ProcessLogs.java(driver)
CountReducer.java(Reducer)
LogMonthMapper.java(Mapper)
Testdata(HDFS):
weblog(fullwebserveraccesslog)
testlog(partialdatasetfortesting)
Exercisedirectory:~/workspace/partitioner
InthisExercise,youwillwriteaMapReducejobwithmultipleReducers,andcreatea
PartitionertodeterminewhichReducereachpieceofMapperoutputissentto.
TheProblem
IntheMorePracticewithWritingMapReduceJavaProgramslabyoudidpreviously,you
builtthecodeinlog_file_analysisproject.Thatprogramcountedthenumberofhits
foreachdifferentIPaddressinaweblogfile.Thefinaloutputwasafilecontainingalistof
IPaddresses,andthenumberofhitsfromthataddress.
Thistime,youwillperformasimilartask,butthefinaloutputshouldconsistof12files,one
eachforeachmonthoftheyear:January,February,andsoon.Eachfilewillcontainalistof
IPaddresses,andthenumberofhitsfromthataddressinthatmonth.
Wewillaccomplishthisbyhaving12Reducers,eachofwhichisresponsibleforprocessing
thedataforaparticularmonth.Reducer0processesJanuaryhits,Reducer1processes
Februaryhits,andsoon.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
44/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
42
Page43
Note:WeareactuallybreakingthestandardMapReduceparadigmhere,whichsaysthatall
thevaluesfromaparticularkeywillgotothesameReducer.Inthisexample,whichisavery
commonpatternwhenanalyzinglogfiles,valuesfromthesamekey(theIPaddress)willgo
tomultipleReducers,basedonthemonthportionoftheline.
WritetheMapper
1.StartingwiththeLogMonthMapper.javastubfile,writeaMapperthatmapsalog
fileoutputlinetoanIP/monthpair.Themapmethodwillbesimilartothatinthe
LogFileMapperclassinthelog_file_analysisproject,soyoumaywishtostart
bycopyingthatcode.
2.TheMappershouldemitaTextkey(theIPaddress)andTextvalue(themonth).E.g.:
Input:96.7.4.14[24/Apr/2011:04:20:110400]"GET
/cat.jpgHTTP/1.1"20012433
Outputkey:96.7.4.14
Outputvalue:Apr
Hint:IntheMapper,youmayusearegularexpressiontoparsetologfiledataifyouare
familiarwithregexprocessing(seefileHomework_RegexRef.docxforreference).
Rememberthatthelogfilemaycontainunexpecteddatathatis,linesthatdonot
conformtotheexpectedformat.Besurethatyourcodecopeswithsuchlines.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
45/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
43
Page44
WritethePartitioner
3.ModifytheMonthPartitioner.javastubfiletocreateaPartitionerthatsendsthe
(key,value)pairtothecorrectReducerbasedonthemonth.Rememberthatthe
Partitionerreceivesboththekeyandvalue,soyoucaninspectthevaluetodetermine
whichReducertochoose.
ModifytheDriver
4.Modifyyourdrivercodetospecifythatyouwant12Reducers.
5.ConfigureyourjobtouseyourcustomPartitioner.
TestyourSolution
6.Buildandtestyourcode.Youroutputdirectoryshouldcontain12filesnamedpartr
000xx.EachfileshouldcontainIPaddressandnumberofhitsformonthxx.
Hints:
WriteunittestsforyourPartitioner!
Youmaywishtotestyourcodeagainstthesmallerversionoftheaccessloginthe
/user/training/testlogdirectorybeforeyourunyourcodeagainstthefull
loginthe/user/training/weblogdirectory.However,notethatthetestdata
maynotincludeallmonths,sosomeresultfileswillbeempty.
Thisistheendofthelab.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
46/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
44
Page45
Lecture6Lab:ImplementingaCustom
WritableComparable
FilesandDirectoriesUsedinthisExercise
Eclipseproject:writables
Javafiles:
StringPairWritableimplementsaWritableComparabletype
StringPairMapperMapperfortestjob
StringPairTestDriverDriverfortestjob
Datafile:
~/training_materials/developer/data/nameyeartestdata(smallsetofdata
forthetestjob)
Exercisedirectory:~/workspace/writables
Inthislab,youwillcreateacustomWritableComparabletypethatholdstwostrings.
Testthenewtypebycreatingasimpleprogramthatreadsalistofnames(firstandlast)and
countsthenumberofoccurrencesofeachname.
Themappershouldacceptslinesintheform:
lastnamefirstnameotherdata
Thegoalistocountthenumberoftimesalastname/firstnamepairoccurwithinthedataset.
Forexample,forinput:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
47/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
SmithJoe19630812Poughkeepsie,NY
SmithJoe18320120Sacramento,CA
MurphyAlice20040602Berlin,MA
Wewanttooutput:
(Smith,Joe)
(Murphy,Alice)1
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
45
Page46
Note:YouwilluseyourcustomWritableComparabletypeinafuturelab,somakesureit
isworkingwiththetestjobnow.
StringPairWritable
YouneedtoimplementaWritableComparableobjectthatholdsthetwostrings.Thestub
providesanemptyconstructorforserialization,astandardconstructorthatwillbegiven
twostrings,atoStringmethod,andthegeneratedhashCodeandequalsmethods.You
willneedtoimplementthereadFields,write,andcompareTomethodsrequiredby
WritableComparables.
NotethatEclipseautomaticallygeneratedthehashCodeandequalsmethodsinthestub
file.YoucangeneratethesetwomethodsinEclipsebyright
clickinginthesourcecodeand
choosingSource>GeneratehashCode()andequals().
NameCountTestJob
ThetestjobrequiresaReducerthatsumsthenumberofoccurrencesofeachkey.Thisisthe
samefunctionthattheSumReducerusedpreviouslyinwordcount,exceptthatSumReducer
expectsTextkeys,whereasthereducerforthisjobwillgetStringPairWritablekeys.You
mayeitherre
writeSumReducertoaccommodateothertypesofkeys,oryoucanusethe
LongSumReducerHadooplibraryclass,whichdoesexactlythesamething.
Youcanusethesimpletestdatain
~/training_materials/developer/data/nameyeartestdatatomakesure
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
48/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
yournewtypeworksasexpected.
YoumaytestyourcodeusinglocaljobrunnerorbysubmittingaHadoopjobtothe(pseudo
)clusterasusual.Ifyousubmitthejobtothecluster,notethatyouwillneedtocopyyour
testdatatoHDFSfirst.
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
46
Page47
Lecture6Lab:UsingSequenceFilesand
FileCompression
FilesandDirectoriesUsedinthisExercise
Eclipseproject:createsequencefile
Javafiles:
CreateSequenceFile.java(adriverthatconvertsatextfiletoasequencefile)
ReadCompressedSequenceFile.java(adriverthatconvertsacompressedsequence
filetotext)
Testdata(HDFS):
weblog(fullwebserveraccesslog)
Exercisedirectory:~/workspace/createsequencefile
Inthislabyouwillpracticereadingandwritinguncompressedandcompressed
SequenceFiles.
First,youwilldevelopaMapReduceapplicationtoconverttextdatatoaSequenceFile.Then
youwillmodifytheapplicationtocompresstheSequenceFileusingSnappyfilecompression.
WhencreatingtheSequenceFile,usethefullaccesslogfileforinputdata.(Youuploadedthe
accesslogfiletotheHDFS/user/training/weblogdirectorywhenyouperformedthe
UsingHDFSlab.)
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
49/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
AfteryouhavecreatedthecompressedSequenceFile,youwillwriteasecondMapReduce
applicationtoreadthecompressedSequenceFileandwriteatextfilethatcontainsthe
originallogfiletext.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
47
Page48
WriteaMapReduceprogramtocreatesequencefilesfrom
textfiles
1.DeterminethenumberofHDFSblocksoccupiedbytheaccesslogfile:
a.Inabrowserwindow,starttheNameNodeWebUI.TheURLis
http://localhost:50070
b.ClickBrowsethefilesystem.
c.Navigatetothe/user/training/weblog/access_logfile.
d.Scrolldowntothebottomofthepage.Thetotalnumberofblocksoccupiedby
theaccesslogfileappearsinthebrowserwindow.
2.Completethestubfileinthecreatesequencefileprojecttoreadtheaccesslogfile
andcreateaSequenceFile.RecordsemittedtotheSequenceFilecanhaveanykeyyou
like,butthevaluesshouldmatchthetextintheaccesslogfile.(Hint:YoucanuseMap
onlyjobusingthedefaultMapper,whichsimplyemitsthedatapassedtoit.)
Note:IfyouspecifyanoutputkeytypeotherthanLongWritable,youmustcall
job.setOutputKeyClassnotjob.setMapOutputKeyClass.Ifyouspecifyan
outputvaluetypeotherthanText,youmustcalljob.setOutputValueClassnot
job.setMapOutputValueClass.
3.Buildandtestyoursolutionsofar.Usetheaccesslogasinputdata,andspecifythe
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
50/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
uncompressedsfdirectoryforoutput.
4.ExaminetheinitialportionoftheoutputSequenceFileusingthefollowingcommand:
$hadoopfscatuncompressedsf/partm00000|less
SomeofthedataintheSequenceFileisunreadable,butpartsoftheSequenceFile
shouldberecognizable:
ThestringSEQ,whichappearsatthebeginningofaSequenceFile
TheJavaclassesforthekeysandvalues
Textfromtheaccesslogfile
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
48
Page49
5.Verifythatthenumberoffilescreatedbythejobisequivalenttothenumberofblocks
requiredtostoretheuncompressedSequenceFile.
CompresstheOutput
6.ModifyyourMapReducejobtocompresstheoutputSequenceFile.Addstatementsto
yourdrivertoconfiguretheoutputasfollows:
Compresstheoutputfile.
Useblockcompression.
UsetheSnappycompressioncodec.
7.CompilethecodeandrunyourmodifiedMapReducejob.FortheMapReduceoutput,
specifythecompressedsfdirectory.
8.ExaminethefirstportionoftheoutputSequenceFile.Noticethedifferencesbetweenthe
uncompressedandcompressedSequenceFiles:
ThecompressedSequenceFilespecifiesthe
org.apache.hadoop.io.compress.SnappyCodeccompressioncodecin
itsheader.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
51/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Youcannotreadthelogfiletextinthecompressedfile.
9.ComparethefilesizesoftheuncompressedandcompressedSequenceFilesinthe
uncompressedsfandcompressedsfdirectories.ThecompressedSequenceFiles
shouldbesmaller.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
49
Page50
WriteanotherMapReduceprogramtouncompressthefiles
10.Startingwiththeprovidedstubfile,writeasecondMapReduceprogramtoreadthe
compressedlogfileandwriteatextfile.Thistextfileshouldhavethesametextdataas
thelogfile,pluskeys.Thekeyscancontainanyvaluesyoulike.
11.CompilethecodeandrunyourMapReducejob.
FortheMapReduceinput,specifythecompressedsfdirectoryinwhichyoucreated
thecompressedSequenceFileintheprevioussection.
FortheMapReduceoutput,specifythecompressedsftotextdirectory.
12.Examinethefirstportionoftheoutputinthecompressedsftotextdirectory.
Youshouldbeabletoreadthetextuallogfileentries.
Optional:Usecommandlineoptionstocontrolcompression
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
52/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
13.IfyouusedToolRunnerforyourdriver,youcancontrolcompressionusingcommand
linearguments.Trycommentingoutthecodeinyourdriverwhereyoucall.Thentest
settingthemapred.output.compressedoptiononthecommandline,e.g.:
$hadoopjarsequence.jar\
stubs.CreateUncompressedSequenceFile\
Dmapred.output.compressed=true\
weblogoutdir
14.Reviewtheoutputtoconfirmthefilesarecompressed.
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
50
Page51
Lecture7Lab:CreatinganInverted
Index
FilesandDirectoriesUsedinthisExercise
Eclipseproject:inverted_index
Javafiles:
IndexMapper.java(Mapper)
IndexReducer.java(Reducer)
InvertedIndex.java(Driver)
Datafiles:
~/training_materials/developer/data/invertedIndexInput.tgz
Exercisedirectory:~/workspace/inverted_index
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
53/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Inthislab,youwillwriteaMapReducejobthatproducesaninvertedindex.
Forthislabyouwilluseanalternateinput,providedinthefile
invertedIndexInput.tgz.Whendecompressed,thisarchivecontainsadirectoryof
fileseachisaShakespeareplayformattedasfollows:
0HAMLET
1
2
3DRAMATISPERSONAE
4
5
6CLAUDIUS
kingofDenmark.(KINGCLAUDIUS:)
7
8HAMLETsontothelate,andnephewtothepresentking.
9
10POLONIUS
lordchamberlain.(LORDPOLONIUS:)
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
51
Page52
...
Eachlinecontains:
Linenumber
separator:atabcharacter
value:thelineoftext
ThisformatcanbereaddirectlyusingtheKeyValueTextInputFormatclassprovidedin
theHadoopAPI.ThisinputformatpresentseachlineasonerecordtoyourMapper,withthe
partbeforethetabcharacterasthekey,andthepartafterthetabasthevalue.
Givenabodyoftextinthisform,yourindexershouldproduceanindexofallthewordsin
thetext.Foreachword,theindexshouldhavealistofallthelocationswheretheword
appears.Forexample,forthewordhoneysuckleyouroutputshouldlooklikethis:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
54/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
honeysuckle
2kinghenryiv@1038,midsummernightsdream@2175,...
Theindexshouldcontainsuchanentryforeverywordinthetext.
PreparetheInputData
1.ExtracttheinvertedIndexInputdirectoryanduploadtoHDFS:
$cd~/training_materials/developer/data
$tarzxvfinvertedIndexInput.tgz
$hadoopfsputinvertedIndexInputinvertedIndexInput
DefinetheMapReduceSolution
Rememberthatforthisprogramyouuseaspecialinputformattosuittheformofyourdata,
soyourdriverclasswillincludealinelike:
job.setInputFormatClass(KeyValueTextInputFormat.class)
Dontforgettoimportthisclassforyouruse.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
52
Page53
RetrievingtheFileName
Notethatthelabrequiresyoutoretrievethefilename
sincethatisthenameoftheplay.
TheContextobjectcanbeusedtoretrievethenameofthefilelikethis:
FileSplitfileSplit=(FileSplit)context.getInputSplit()
Pathpath=fileSplit.getPath()
StringfileName=path.getName()
BuildandTestYourSolution
TestagainsttheinvertedIndexInputdatayouloadedabove.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
55/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Hints
Youmayliketocompletethislabwithoutreadinganyfurther,oryoumayfindthefollowing
hintsaboutthealgorithmhelpful.
TheMapper
YourMappershouldtakeasinputakeyandalineofwords,andemitasintermediatevalues
eachwordaskey,andthekeyasvalue.
Forexample,thelineofinputfromthefilehamlet:
282Haveheavenandearthtogether
producesintermediateoutput:
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
53
Page54
Have
hamlet@282
heaven
hamlet@282
and
hamlet@282
earth
hamlet@282
togetherhamlet@282
TheReducer
YourReducersimplyaggregatesthevaluespresentedtoitforthesamekey,intoonevalue.
Useaseparatorlike,betweenthevalueslisted.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
56/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
54
Page55
Lecture7Lab:CalculatingWordCo
Occurrence
FilesandDirectoriesUsedinthisExercise
Eclipseproject:word_cooccurrence
Javafiles:
WordCoMapper.java(Mapper)
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
57/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
SumReducer.java(ReducerfromWordCount)
WordCo.java(Driver)
Testdirectory(HDFS):
shakespeare
Exercisedirectory:~/workspace/word_cooccurence
Inthislab,youwillwriteanapplicationthatcountsthenumberoftimeswords
appearnexttoeachother.
Testyourapplicationusingthefilesintheshakespearefolderyoupreviouslycopiedinto
HDFSintheUsingHDFSlab.
NotethatthisimplementationisaspecializationofWordCo
Occurrenceaswedescribeitin
thenotesinthiscaseweareonlyinterestedinpairsofwordswhichappeardirectly
nexttoeachother.
1.Changedirectoriestotheword_cooccurrencedirectorywithinthelabsdirectory.
2.CompletetheDriverandMapperstubfilesyoucanusethestandardSumReducerfrom
theWordCountprojectasyourReducer.YourMappersintermediateoutputshouldbe
intheformofaTextobjectasthekey,andanIntWritableasthevaluethekeywillbe
word1,word2,andthevaluewillbe1.
ExtraCredit
Ifyouhaveextratime,pleasecompletetheseadditionalchallenges:
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
55
Page56
Challenge1:UsetheStringPairWritablekeytypefromtheImplementinga
CustomWritableComparablelab.Copyyourcompletedsolution(fromthewritables
project)intothecurrentproject.
Challenge2:WriteasecondMapReducejobtosorttheoutputfromthefirstjobsothat
thelistofpairsofwordsappearsinascendingfrequency.
Challenge3:Sortbydescendingfrequencyinstead(sortthatthemostfrequently
occurringwordpairsarefirstintheoutput.)Hint:Youwillneedtoextend
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
58/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
org.apache.hadoop.io.LongWritable.Comparator.
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
56
Page57
Lecture8Lab:ImportingDatawith
Sqoop
InthislabyouwillimportdatafromarelationaldatabaseusingSqoop.Thedatayou
loadherewillbeusedsubsequentlabs.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
59/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
ConsidertheMySQLdatabasemovielens,derivedfromtheMovieLensprojectfrom
UniversityofMinnesota.(Seenoteattheendofthislab.)Thedatabaseconsistsofseveral
relatedtables,butwewillimportonlytwoofthese:movie,whichcontainsabout3,900
moviesandmovierating,whichhasabout1,000,000ratingsofthosemovies.
ReviewtheDatabaseTables
First,reviewthedatabasetablestobeloadedintoHadoop.
1.LogontoMySQL:
$mysqluser=trainingpassword=trainingmovielens
2.Reviewthestructureandcontentsofthemovietable:
mysql>DESCRIBEmovie
...
mysql>SELECT*FROMmovieLIMIT5
3.Notethecolumnnamesforthetable:
____________________________________________________________________________________________
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
57
Page58
4.Reviewthestructureandcontentsofthemovieratingtable:
mysql>DESCRIBEmovierating
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
60/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
mysql>SELECT*FROMmovieratingLIMIT5
5.Notethesecolumnnames:
____________________________________________________________________________________________
6.Exitmysql:
mysql>quit
ImportwithSqoop
YouinvokeSqooponthecommandlinetoperformseveralcommands.Withityoucan
connecttoyourdatabaseservertolistthedatabases(schemas)towhichyouhaveaccess,
andlistthetablesavailableforloading.Fordatabaseaccess,youprovideaconnectstringto
identifytheserver,and
ifrequired
yourusernameandpassword.
1.ShowthecommandsavailableinSqoop:
$sqoophelp
2.Listthedatabases(schemas)inyourdatabaseserver:
$sqooplistdatabases\
connectjdbc:mysql://localhost\
usernametrainingpasswordtraining
(Note:Insteadofenteringpasswordtrainingonyourcommandline,youmay
prefertoenterP,andletSqooppromptyouforthepassword,whichisthennotvisible
whenyoutypeit.)
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
58
Page59
3.Listthetablesinthemovielensdatabase:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
61/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
$sqooplisttables\
connectjdbc:mysql://localhost/movielens\
usernametrainingpasswordtraining
4.ImportthemovietableintoHadoop:
$sqoopimport\
connectjdbc:mysql://localhost/movielens\
usernametrainingpasswordtraining\
fieldsterminatedby'\t'tablemovie
5.Verifythatthecommandhasworked.
$hadoopfslsmovie
$hadoopfstailmovie/partm00000
6.ImportthemovieratingtableintoHadoop.
Repeatthelasttwosteps,butforthemovieratingtable.
Thisistheendofthelab.
Note:
ThislabusestheMovieLensdataset,orsubsetsthereof.Thisdataisfreelyavailablefor
academicpurposes,andisusedanddistributedbyClouderawiththeexpresspermission
oftheUMNGroupLensResearchGroup.Ifyouwouldliketousethisdataforyourown
researchpurposes,youarefreetodoso,aslongasyoucitetheGroupLensResearch
Groupinanyresultingpublications.Ifyouwouldliketousethisdataforcommercial
purposes,youmustobtainexplicitpermission.Youmayfindthefulldataset,aswellas
detailedlicenseterms,athttp://www.grouplens.org/node/73
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
59
Page60
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
62/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Lecture8Lab:RunninganOozie
Workflow
FilesandDirectoriesUsedinthisExercise
Exercisedirectory:~/workspace/oozie_labs
Ooziejobfolders:
lab1javamapreduce
lab2sortwordcount
Inthislab,youwillinspectandrunOozieworkflows.
1.StarttheOozieserver
$sudo/etc/init.d/ooziestart
2.Changedirectoriestothelabdirectory:
$cd~/workspace/oozielabs
3.Inspectthecontentsofthejob.propertiesandworkflow.xmlfilesinthelab1
javamapreduce/jobfolder.YouwillseethatthisisthestandardWordCountjob.
Inthejob.propertiesfile,takenoteofthejobsbasedirectory(lab1java
mapreduce),andtheinputandoutputdirectoriesrelativetothat.(TheseareHDFS
directories.)
4.WehaveprovidedasimpleshellscripttosubmittheOozieworkflow.Inspectthe
run.shscriptandthenrun:
$./run.shlab1javamapreduce
NoticethatOoziereturnsajobidentificationnumber.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
60
Page61
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
63/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
5.Inspecttheprogressofthejob:
$ooziejobooziehttp://localhost:11000/oozie\
infojob_id
6.Whenthejobhascompleted,reviewthejoboutputdirectoryinHDFStoconfirmthat
theoutputhasbeenproducedasexpected.
7.Repeattheaboveprocedureforlab2sortwordcount.Noticewhenyouinspect
workflow.xmlthatthisworkflowincludestwoMapReducejobswhichrunoneafter
theother,inwhichtheoutputofthefirstistheinputforthesecond.Whenyouinspect
theoutputinHDFSyouwillseethatthesecondjobsortstheoutputofthefirstjobinto
descendingnumericalorder.
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
61
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
64/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Page62
Lecture8BonusLab:Exploringa
SecondarySortExample
FilesandDirectoriesUsedinthisExercise
Eclipseproject:secondarysort
Datafiles:
~/training_materials/developer/data/nameyeartestdata
Exercisedirectory:~/workspace/secondarysort
Inthislab,youwillrunaMapReducejobindifferentwaystoseetheeffectsofvarious
componentsinasecondarysortprogram.
Theprogramacceptslinesintheform
lastnamefirstnamebirthdate
Thegoalistoidentifytheyoungestpersonwitheachlastname.Forexample,forinput:
MurphyJoanne19630812
MurphyDouglas18320120
MurphyAlice20040602
Wewanttowriteout:
MurphyAlice20040602
Allthecodeisprovidedtodothis.Followingthestepsbelowyouaregoingtoprogressively
addeachcomponenttothejobtoaccomplishthefinalgoal.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
62
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
65/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Page63
BuildtheProgram
1.InEclipse,reviewbutdonotmodifythecodeinthesecondarysortprojectexample
package.
2.Inparticular,notetheNameYearDriverclass,inwhichthecodetosetthepartitioner,
sortcomparator,andgroupcomparatorforthejobiscommentedout.Thisallowsusto
setthosevaluesonthecommandlineinstead.
3.Exportthejarfilefortheprogramassecsort.jar.
4.Asmalltestdatafilecallednameyeartestdatahasbeenprovidedforyou,locatedin
thesecondarysortprojectfolder.CopythedatafiletoHDFS,ifyoudidnotalreadydoso
intheWritableslab.
RunasaMaponlyJob
5.TheMapperforthisjobconstructsacompositekeyusingtheStringPairWritable
type.SeetheoutputofjustthemapperbyrunningthisprogramasaMap
onlyjob:
$hadoopjarsecsort.jarexample.NameYearDriver\
Dmapred.reduce.tasks=0nameyeartestdatasecsortout
6.Reviewtheoutput.Notethekeyisastringpairoflastnameandbirthyear.
RunusingthedefaultPartitionerandComparators
7.Re
runthejob,settingthenumberofreducetasksto2insteadof0.
8.Notethattheoutputnowconsistsoftwofilesoneeachforthetworeducetasks.Within
eachfile,theoutputissortedbylastname(ascending)andyear(ascending).Butitisnt
sortedbetweenfiles,andrecordswiththesamelastnamemaybeindifferentfiles
(meaningtheywenttodifferentreducers).
Runusingthecustompartitioner
9.Reviewthecodeofthecustompartitionerclass:NameYearPartitioner.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
66/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
63
Page64
10.Re
runthejob,addingasecondparametertosetthepartitionerclasstouse:
Dmapreduce.partitioner.class=example.NameYearPartitioner
11.Reviewtheoutputagain,thistimenotingthatallrecordswiththesamelastnamehave
beenpartitionedtothesamereducer.
However,theyarestillbeingsortedintothedefaultsortorder(name,yearascending).
Wewantitsortedbynameascending/yeardescending.
Runusingthecustomsortcomparator
12.TheNameYearComparatorclasscomparesName/Yearpairs,firstcomparingthe
namesand,ifequal,comparestheyear(indescendingorderi.e.lateryearsare
consideredlessthanearlieryears,andthusearlierinthesortorder.)Re
runthejob
usingNameYearComparatorasthesortcomparatorbyaddingathirdparameter:
Dmapred.output.key.comparator.class=
example.NameYearComparator
13.Reviewtheoutputandnotethateachreducersoutputisnowcorrectlypartitionedand
sorted.
RunwiththeNameYearReducer
14.Sofarwevebeenrunningwiththedefaultreducer,whichistheIdentityReducer,which
simplywriteseachkey/valuepairitreceives.Theactualgoalofthisjobistoemitthe
recordfortheyoungestpersonwitheachlastname.Wecandothiseasilyifallrecords
foragivenlastnamearepassedtoasinglereducecall,sortedindescendingorder,
whichcanthensimplyemitthefirstvaluepassedineachcall.
15.ReviewtheNameYearReducercodeandnotethatitemits
16.Re
runthejob,usingthereducerbyaddingafourthparameter:
Dmapreduce.reduce.class=example.NameYearReducer
Alas,thejobstillisntcorrect,becausethedatabeingpassedtothereducemethodis
beinggroupedaccordingtothefullkey(nameandyear),somultiplerecordswiththe
samelastname(butdifferentyears)arebeingoutput.Wewantittobegroupedby
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
67/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
nameonly.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
64
Page65
Runwiththecustomgroupcomparator
17.TheNameComparatorclasscomparestwostringpairsbycomparingonlythename
fieldanddisregardingtheyearfield.Pairswiththesamenamewillbegroupedintothe
samereducecall,regardlessoftheyear.Addthegroupcomparatortothejobbyadding
afinalparameter:
Dmapred.output.value.groupfn.class=
example.NameComparator
18.Notethefinaloutputnowcorrectlyincludesonlyasinglerecordforeachdifferentlast
name,andthatthatrecordistheyoungestpersonwiththatlastname.
Thisistheendofthelab.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
68/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
65
Page66
NotesforUpcomingLabs
VMServicesCustomization
Fortheremainderofthelabs,thereareservicesthatmustberunninginyourVM,and
othersthatareoptional.Itisstronglyrecommendedthatyourunthefollowingcommand
wheneveryoustarttheVM:
$~/scripts/analyst/toggle_services.sh
Thiswillconservememoryandincreaseperformanceofthevirtualmachine.Afterrunning
thiscommand,youmaysafelyignoreanymessagesaboutservicesthathavealreadybeen
startedorshutdown.
DataModelReference
Foryourconvenience,youwillfindareferencedocumentdepictingthestructureforthe
tablesyouwilluseinthefollowinglabs.Seefile:Homework_DataModelRef.docx
RegularExpression(Regex)Reference
Foryourconvenience,youwillfindareferencedocumentdescribingregularexpressions
syntax.Seefile:Homework_RegexRef.docx
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
69/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
66
Page67
Lecture9Lab:DataIngestWithHadoop
Tools
InthislabyouwillpracticeusingtheHadoopcommandlineutilitytointeractwith
HadoopsDistributedFilesystem(HDFS)anduseSqooptoimporttablesfroma
relationaldatabasetoHDFS.
PrepareyourVirtualMachine
LaunchtheVMifyouhaventalreadydoneso,andthenrunthefollowingcommandtoboost
performancebydisablingservicesthatarenotneededforthisclass:
$~/scripts/analyst/toggle_services.sh
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
70/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
67
Page68
Step1:SetupHDFS
1.Openaterminalwindow(ifoneisnotalreadyopen)bydouble
clickingtheTerminal
icononthedesktop.Next,changetothedirectoryforthislabbyrunningthefollowing
command:
$cd$ADIR/exercises/data_ingest
2.Toseethecontentsofyourhomedirectory,runthefollowingcommand:
$hadoopfsls/user/training
3.Ifyoudonotspecifyapath,hadoopfsassumesyouarereferringtoyourhome
directory.Therefore,thefollowingcommandisequivalenttotheoneabove:
$hadoopfsls
4.Mostofyourworkwillbeinthe/dualcoredirectory,socreatethatnow:
$hadoopfsmkdir/dualcore
Step2:ImportingDatabaseTablesintoHDFSwithSqoop
Dualcorestoresinformationaboutitsemployees,customers,products,andordersina
MySQLdatabase.Inthenextfewsteps,youwillexaminethisdatabasebeforeusingSqoopto
importitstablesintoHDFS.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
71/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
68
Page69
1.LogintoMySQLandselectthedualcoredatabase:
$mysqluser=trainingpassword=trainingdualcore
2.Next,listtheavailabletablesinthedualcoredatabase(mysql>representsthe
MySQLclientpromptandisnotpartofthecommand):
mysql>SHOWTABLES
3.Reviewthestructureoftheemployeestableandexamineafewofitsrecords:
mysql>DESCRIBEemployees
mysql>SELECTemp_id,fname,lname,state,salaryFROM
employeesLIMIT10
4.ExitMySQLbytypingquit,andthenhittheenterkey:
mysql>quit
5.Next,runthefollowingcommand,whichimportstheemployeestableintothe
/dualcoredirectorycreatedearlierusingtabcharacterstoseparateeachfield:
$sqoopimport\
connectjdbc:mysql://localhost/dualcore\
usernametrainingpasswordtraining\
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
72/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
fieldsterminatedby'\t'\
warehousedir/dualcore\
tableemployees
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
69
Page70
6.RevisethepreviouscommandandimportthecustomerstableintoHDFS.
7.RevisethepreviouscommandandimporttheproductstableintoHDFS.
8.RevisethepreviouscommandandimporttheorderstableintoHDFS.
9.Next,youwillimporttheorder_detailstableintoHDFS.Thecommandisslightly
differentbecausethistableonlyholdsreferencestorecordsintheordersand
productstable,andlacksaprimarykeyofitsown.Consequently,youwillneedto
specifythesplitbyoptionandinstructSqooptodividetheimportworkamong
maptasksbasedonvaluesintheorder_idfield.Analternativeistousethem1
optiontoforceSqooptoimportallthedatawithasingletask,butthiswould
significantlyreduceperformance.
$sqoopimport\
connectjdbc:mysql://localhost/dualcore\
usernametrainingpasswordtraining\
fieldsterminatedby'\t'\
warehousedir/dualcore\
tableorder_details\
splitby=order_id
Thisistheendofthelab.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
73/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
70
Page71
Lecture9Lab:UsingPigforETL
Processing
InthislabyouwillpracticeusingPigtoexplore,correct,andreorderdatainfiles
fromtwodifferentadnetworks.Youwillfirstexperimentwithsmallsamplesofthis
datausingPiginlocalmode,andonceyouareconfidentthatyourETLscriptsworkas
youexpect,youwillusethemtoprocessthecompletedatasetsinHDFSbyusingPig
inMapReducemode.
IMPORTANT:Sincethislabbuildsonthepreviousone,itisimportantthatyousuccessfully
completethepreviouslabbeforestartingthislab.
BackgroundInformation
Dualcorehasrecentlystartedusingonlineadvertisementstoattractnewcustomerstoitse
commercesite.Eachofthetwoadnetworkstheyuseprovidesdataabouttheadstheyve
placed.Thisincludesthesitewheretheadwasplaced,thedatewhenitwasplaced,what
keywordstriggereditsdisplay,whethertheuserclickedthead,andtheper
clickcost.
Unfortunately,thedatafromeachnetworkisinadifferentformat.Eachfilealsocontains
someinvalidrecords.Beforewecananalyzethedata,wemustfirstcorrecttheseproblems
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
74/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
byusingPigto:
Filterinvalidrecords
Reorderfields
Correctinconsistencies
WritethecorrecteddatatoHDFS
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
71
Page72
Step#1:WorkingintheGruntShell
Inthisstep,youwillpracticerunningPigcommandsintheGruntshell.
1.Changetothedirectoryforthislab:
$cd$ADIR/exercises/pig_etl
2.Copyasmallnumberofrecordsfromtheinputfiletoanotherfileonthelocalfile
system.WhenyoustartPig,youwillruninlocalmode.Fortesting,youcanworkfaster
withsmalllocalfilesthanlargefilesinHDFS.
Itisnotessentialtochoosearandomsampleherejustahandfulofrecordsinthe
correctformatwillsuffice.Usethecommandbelowtocapturethefirst25recordsso
youhaveenoughtotestyourscript:
$headn25$ADIR/data/ad_data1.txt>sample1.txt
3.StarttheGruntshellinlocalmodesothatyoucanworkwiththelocalsample1.txt
file.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
75/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
$pigxlocal
ApromptindicatesthatyouarenowintheGruntshell:
grunt>
4.Loadthedatainthesample1.txtfileintoPiganddumpit:
grunt>data=LOAD'sample1.txt'
grunt>DUMPdata
Youshouldseethe25recordsthatcomprisethesampledatafile.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
72
Page73
5.Loadthefirsttwocolumnsdatafromthesamplefileascharacterdata,andthendump
thatdata:
grunt>first_2_columns=LOAD'sample1.txt'AS
(keyword:chararray,campaign_id:chararray)
grunt>DUMPfirst_2_columns
6.UsetheDESCRIBEcommandinPigtoreviewtheschemaoffirst_2_cols:
grunt>DESCRIBEfirst_2_columns
TheschemaappearsintheGruntshell.
UsetheDESCRIBEcommandwhileperformingtheselabsanytimeyouwouldliketo
reviewschemadefinitions.
7.SeewhathappensifyouruntheDESCRIBEcommandondata.Recallthatwhenyou
loadeddata,youdidnotdefineaschema.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
76/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
grunt>DESCRIBEdata
8.EndyourGruntshellsession:
grunt>QUIT
73
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
Page74
Step#2:ProcessingInputDatafromtheFirstAdNetwork
Inthisstep,youwillprocesstheinputdatafromthefirstadnetwork.First,youwillcreatea
Pigscriptinafile,andthenyouwillrunthescript.Manypeoplefindworkingthisway
easierthanworkingdirectlyintheGruntshell.
1.Editthefirst_etl.pigfiletocompletetheLOADstatementandreadthedatafrom
thesampleyoujustcreated.Thefollowingtableshowstheformatofthedatainthefile.
Forsimplicity,youshouldleavethedateandtimefieldsseparate,soeachwillbeof
typechararray,ratherthanconvertingthemtoasinglefieldoftypedatetime.
IndexField
DataTypeDescription
Example
keyword
chararrayKeywordthattriggeredad
tablet
campaign_idchararrayUniquelyidentifiesthead
A3
date
chararrayDateofaddisplay
05/29/2013
time
chararrayTimeofaddisplay
15:49:21
display_sitechararrayDomainwhereadshown
www.example.com
was_clickedint
Whetheradwasclicked
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
77/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
cpc
int
Costperclick,incents
country
chararrayNameofcountryinwhichadranUSA
placement
chararrayWhereonpagewasaddisplayedTOP
106
2.OnceyouhaveeditedtheLOADstatement,tryitoutbyrunningyourscriptinlocal
mode:
$pigxlocalfirst_etl.pig
Makesuretheoutputlookscorrect(i.e.,thatyouhavethefieldsintheexpectedorder
andthevaluesappearsimilarinformattothatshowninthetableabove)beforeyou
continuewiththenextstep.
3.Makeeachofthefollowingchanges,runningyourscriptinlocalmodeaftereachoneto
verifythatyourchangeiscorrect:
a.Updateyourscripttofilteroutallrecordswherethecountryfielddoesnot
containUSA.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
74
Page75
b.Weneedtostorethefieldsinadifferentorderthanwereceivedthem.Usea
FOREACHGENERATEstatementtocreateanewrelationcontainingthe
fieldsinthesameorderasshowninthefollowingtable(thecountryfieldis
notincludedsinceallrecordsnowhavethesamevalue):
IndexField
Description
campaign_id
Uniquelyidentifiesthead
date
Dateofaddisplay
time
Timeofaddisplay
keyword
Keywordthattriggeredad
display_site
Domainwhereadshown
placement
Whereonpagewasaddisplayed
was_clicked
Whetheradwasclicked
cpc
Costperclick,incents
c.Updateyourscripttoconvertthekeywordfieldtouppercaseandtoremove
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
78/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
anyleadingortrailingwhitespace(Hint:Youcannestcallstothetwobuilt
in
functionsinsidetheFOREACHGENERATEstatementfromthelast
statement).
4.AddthecompletedatafiletoHDFS:
$hadoopfsput$ADIR/data/ad_data1.txt/dualcore
5.Editfirst_etl.pigandchangethepathintheLOADstatementtomatchthepathof
thefileyoujustaddedtoHDFS(/dualcore/ad_data1.txt).
6.Next,replaceDUMPwithaSTOREstatementthatwillwritetheoutputofyour
processingastab
delimitedrecordstothe/dualcore/ad_data1directory.
7.RunthisscriptinPigsMapReducemodetoanalyzetheentirefileinHDFS:
$pigfirst_etl.pig
Ifyourscriptfails,checkyourcodecarefully,fixtheerror,andthentryrunningitagain.
DontforgetthatyoumustremoveoutputinHDFSfromapreviousrunbeforeyou
executethescriptagain.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
75
Page76
8.Checkthefirst20outputrecordsthatyourscriptwrotetoHDFSandensuretheylook
correct(youcanignorethemessagecat:Unabletowritetooutputstreamthissimply
happensbecauseyouarewritingmoredatawiththefscatcommandthanyouare
readingwiththeheadcommand):
$hadoopfscat/dualcore/ad_data1/part*|head20
a.Arethefieldsinthecorrectorder?
b.Areallthekeywordsnowinuppercase?
Step#3:ProcessingInputDatafromtheSecondAdNetwork
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
79/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Nowthatyouhavesuccessfullyprocessedthedatafromthefirstadnetwork,continueby
processingdatafromthesecondone.
1.Createasmallsampleofthedatafromthesecondadnetworkthatyoucantestlocally
whileyoudevelopyourscript:
$headn25$ADIR/data/ad_data2.txt>sample2.txt
2.Editthesecond_etl.pigfiletocompletetheLOADstatementandreadthedatafrom
thesampleyoujustcreated(Hint:Thefieldsarecomma
delimited).Thefollowingtable
showstheorderoffieldsinthisfile:
IndexField
DataTypeDescription
Example
campaign_idchararrayUniquelyidentifiesthead
A3
date
chararrayDateofaddisplay
05/29/2013
time
chararrayTimeofaddisplay
15:49:21
display_sitechararrayDomainwhereadshown
placement
was_clickedint
cpc
int
keyword
chararrayKeywordthattriggeredad
www.example.com
chararrayWhereonpagewasaddisplayedTOP
Whetheradwasclicked
Costperclick,incents
106
tablet
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
76
Page77
3.OnceyouhaveeditedtheLOADstatement,usetheDESCRIBEkeywordandthenrun
yourscriptinlocalmodetocheckthattheschemamatchesthetableabove:
$pigxlocalsecond_etl.pig
4.ReplaceDESCRIBEwithaDUMPstatementandthenmakeeachofthefollowing
changestosecond_etl.pig,runningthisscriptinlocalmodeaftereachchangeto
verifywhatyouvedonebeforeyoucontinuewiththenextstep:
d.Thisadnetworksometimeslogsagivenrecordtwice.Addastatementtothe
second_etl.pigfilesothatyouremoveanyduplicaterecords.Ifyouhave
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
80/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
donethiscorrectly,youshouldonlyseeonerecordwherethe
display_sitefieldhasavalueofsiliconwire.example.com.
e.Asbefore,youneedtostorethefieldsinadifferentorderthanyoureceived
them.UseaFOREACHGENERATEstatementtocreateanewrelation
containingthefieldsinthesameorderyouusedtowritetheoutputfromfirst
adnetwork(shownagaininthetablebelow)andalsousetheUPPERand
TRIMfunctionstocorrectthekeywordfieldasyoudidearlier:
IndexField
Description
campaign_id
Uniquelyidentifiesthead
date
Dateofaddisplay
time
Timeofaddisplay
keyword
Keywordthattriggeredad
display_site
Domainwhereadshown
placement
Whereonpagewasaddisplayed
was_clicked
Whetheradwasclicked
cpc
Costperclick,incents
f.ThedatefieldinthisdatasetisintheformatMMDDYYYY,whilethedata
youpreviouslywroteisintheformatMM/DD/YYYY.EdittheFOREACH
GENERATEstatementtocalltheREPLACE(date,'','/')function
tocorrectthis.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
77
Page78
5.Onceyouaresurethescriptworkslocally,addthefulldatasettoHDFS:
$hadoopfsput$ADIR/data/ad_data2.txt/dualcore
6.EditthescripttohaveitLOADthefileyoujustaddedtoHDFS,andthenreplacethe
DUMPstatementwithaSTOREstatementtowriteyouroutputastab
delimitedrecords
tothe/dualcore/ad_data2directory.
7.RunyourscriptagainstthedatayouaddedtoHDFS:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
81/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
$pigsecond_etl.pig
8.Checkthefirst15outputrecordswritteninHDFSbyyourscript:
$hadoopfscat/dualcore/ad_data2/part*|head15
a.Doyouseeanyduplicaterecords?
b.Arethefieldsinthecorrectorder?
c.Areallthekeywordsinuppercase?
d.Isthedatefieldinthecorrect(MM/DD/YYYY)format?
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
78
Page79
Lecture9Lab:AnalyzingAdCampaign
DatawithPig
Duringthepreviouslab,youperformedETLprocessingondatasetsfromtwoonline
adnetworks.Inthislab,youwillwritePigscriptsthatanalyzethisdatatooptimize
advertising,helpingDualcoretosavemoneyandattractnewcustomers.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
82/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
IMPORTANT:Sincethislabbuildsonthepreviousone,itisimportantthatyousuccessfully
completethepreviouslabbeforestartingthislab.
Step#1:FindLowCostSites
BothadnetworkschargeafeeonlywhenauserclicksonDualcoresad.Thisisidealfor
Dualcoresincetheirgoalistobringnewcustomerstotheirsite.However,somesitesand
keywordsaremoreeffectivethanothersatattractingpeopleinterestedinthenewtablet
beingadvertisedbyDualcore.Withthisinmind,youwillbeginbyidentifyingwhichsites
havethelowesttotalcost.
1.Changetothedirectoryforthislab:
$cd$ADIR/exercises/analyze_ads
2.Obtainalocalsubsetoftheinputdatabyrunningthefollowingcommand:
$hadoopfscat/dualcore/ad_data1/part*\
|headn100>test_ad_data.txt
Youcanignorethemessagecat:Unabletowritetooutputstream,whichappears
becauseyouarewritingmoredatawiththefscatcommandthanyouarereading
withtheheadcommand.
Note:Asmentionedinthepreviouslab,itisfastertotestPigscriptsbyusingalocal
subsetoftheinputdata.Althoughexplicitstepsarenotprovidedforcreatinglocaldata
subsetsinupcominglabs,doingsowillhelpyouperformthelabsmorequickly.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
79
Page80
3.Openthelow_cost_sites.pigfileinyoureditor,andthenmakethefollowing
changes:
a.ModifytheLOADstatementtoreadthesampledatainthe
test_ad_data.txtfile.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
83/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
b.Addalinethatcreatesanewrelationtoincludeonlyrecordswhere
was_clickedhasavalueof1.
c.Groupthisfilteredrelationbythedisplay_sitefield.
d.Createanewrelationthatincludestwofields:thedisplay_siteand
thetotalcostofallclicksonthatsite.
e.Sortthatnewrelationbycost(inascendingorder)
f.Displayjustthefirstthreerecordstothescreen
4.Onceyouhavemadethesechanges,tryrunningyourscriptagainstthesampledata:
$pigxlocallow_cost_sites.pig
5.IntheLOADstatement,replacethetest_ad_data.txtfilewithafileglob(pattern)
thatwillloadboththe/dualcore/ad_data1and/dualcore/ad_data2
directories(anddoesnotloadanyotherdata,suchasthetextfilesfromtheprevious
lab).
6.Onceyouhavemadethesechanges,tryrunningyourscriptagainstthedatainHDFS:
$piglow_cost_sites.pig
Question:Whichthreesiteshavethelowestoverallcost?
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
80
Page81
Step#2:FindHighCostKeywords
ThetermsuserstypewhendoingsearchesmaypromptthesitetodisplayaDualcore
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
84/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
advertisement.Sinceonlineadvertiserscompeteforthesamesetofkeywords,someof
themcostmorethanothers.YouwillnowwritesomePigLatintodeterminewhich
keywordshavebeenthemostexpensiveforDualcoreoverall.
1.Sincethiswillbeaslightvariationonthecodeyouhavejustwritten,copythatfileas
high_cost_keywords.pig:
$cplow_cost_sites.pighigh_cost_keywords.pig
2.Editthehigh_cost_keywords.pigfileandmakethefollowingthreechanges:
a.Groupbythekeywordfieldinsteadofdisplay_site
b.Sortindescendingorderofcost
c.Displaythetopfiveresultstothescreeninsteadofthetopthreeasbefore
3.Onceyouhavemadethesechanges,tryrunningyourscriptagainstthedatainHDFS:
$pighigh_cost_keywords.pig
Question:Whichfivekeywordshavethehighestoverallcost?
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
81
Page82
BonusLab#1:CountAdClicks
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
85/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Oneimportantstatisticwehaventyetcalculatedisthetotalnumberofclickstheadshave
received.Doingsowillhelpthemarketingdirectorplanthenextadcampaignbudget.
1.Changetothebonus_01subdirectoryofthecurrentlab:
$cdbonus_01
2.Editthetotal_click_count.pigfileandimplementthefollowing:
a.Grouptherecords(filteredbywas_clicked==1)sothatyoucancall
theaggregatefunctioninthenextstep.
b.InvoketheCOUNTfunctiontocalculatethetotalofclickedads(Hint:
Becauseweshouldnthaveanynullrecords,youcanusetheCOUNT
functioninsteadofCOUNT_STAR,andthechoiceoffieldyousupplytothe
functionisarbitrary).
c.Displaytheresulttothescreen
3.Onceyouhavemadethesechanges,tryrunningyourscriptagainstthedatainHDFS:
$pigtotal_click_count.pig
Question:Howmanyclicksdidwereceive?
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
82
Page83
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
86/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
BonusLab#2:EstimatetheMaximumCostoftheNextAd
Campaign
Whenyoureportedthetotalnumberofclicks,theMarketingDirectorsaidthatthegoalisto
getaboutthreetimesthatamountduringthenextcampaign.Unfortunately,becausethe
costisbasedonthesiteandkeyword,itisntclearhowmuchtobudgetforthatcampaign.
Youcanhelpbyestimatingtheworstcase(mostexpensive)costbasedon50,000clicks.You
willdothisbyfindingthemostexpensiveadandthenmultiplyingitbythenumberofclicks
desiredinthenextcampaign.
1.Becausethiscodewillbesimilartothecodeyouwroteinthepreviousstep,startby
copyingthatfileasproject_next_campaign_cost.pig:
$cptotal_click_count.pigproject_next_campaign_cost.pig
2.Edittheproject_next_campaign_cost.pigfileandmakethefollowing
modifications:
a.Sinceyouaretryingtodeterminethehighestpossiblecost,youshould
notlimityourcalculationtothecostforadsactuallyclicked.Removethe
FILTERstatementsothatyouconsiderthepossibilitythatanyadmight
beclicked.
b.Changetheaggregatefunctiontotheonethatreturnsthemaximumvalue
inthecpcfield(Hint:Dontforgettochangethenameoftherelationthis
fieldbelongsto,inordertoaccountfortheremovaloftheFILTER
statementinthepreviousstep).
c.ModifyyourFOREACH...GENERATEstatementtomultiplythevalue
returnedbytheaggregatefunctionbythetotalnumberofclickswe
expecttohaveinthenextcampaign
d.Displaytheresultingvaluetothescreen.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
83
Page84
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
87/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
3.Onceyouhavemadethesechanges,tryrunningyourscriptagainstthedatainHDFS:
$pigproject_next_campaign_cost.pig
Question:Whatisthemaximumyouexpectthiscampaignmightcost?
ProfessorsNote~
Youcancompareyoursolutiontotheoneinthebonus_02/sample_solution/
subdirectory.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
84
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
88/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Page85
BonusLab#3:CalculatingClickThroughRate(CTR)
Thecalculationsyoudidatthestartofthislabprovidedaroughideaaboutthesuccessof
theadcampaign,butdidntaccountforthefactthatsomesitesdisplayDualcoresadsmore
thanothers.Thismakesitdifficulttodeterminehoweffectivetheiradswerebysimply
countingthenumberofclicksononesiteandcomparingittothenumberofclickson
anothersite.OnemetricthatwouldallowDualcoretobettermakesuchcomparisonsisthe
Click
ThroughRate(http://tiny.cloudera.com/ade03a),commonlyabbreviatedas
CTR.Thisvalueissimplythepercentageofadsshownthatusersactuallyclicked,andcanbe
calculatedbydividingthenumberofclicksbythetotalnumberofadsshown.
1.Changetothebonus_03subdirectoryofthecurrentlab:
$cd../bonus_03
2.Editthelowest_ctr_by_site.pigfileandimplementthefollowing:
a.WithinthenestedFOREACH,filtertherecordstoincludeonlyrecords
wheretheadwasclicked.
b.CreateanewrelationonthelinethatfollowstheFILTERstatement
whichcountsthenumberofrecordswithinthecurrentgroup
c.Addanotherlinebelowthattocalculatetheclick
throughrateinanew
fieldnamedctr
d.AfterthenestedFOREACH,sorttherecordsinascendingorderof
clickthroughrateanddisplaythefirstthreetothescreen.
3.Onceyouhavemadethesechanges,tryrunningyourscriptagainstthedatainHDFS:
$piglowest_ctr_by_site.pig
Question:Whichthreesiteshavethelowestclickthroughrate?
Ifyoustillhavetimeremaining,modifyyourscripttodisplaythethreekeywordswiththe
highestclick
throughrate.
Copyright20102014Cloudera,Inc.Allrightsreserved.
85
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
89/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Nottobereproducedwithoutpriorwrittenconsent.
Page86
Thisistheendofthelab.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
90/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
86
Page87
Lecture10Lab:AnalyzingDisparateData
SetswithPig
Inthislab,youwillpracticecombining,joining,andanalyzingtheproductsalesdata
previouslyexportedfromDualcoresMySQLdatabasesoyoucanobservetheeffects
thattherecentadvertisingcampaignhashadonsales.
IMPORTANT:Sincethislabbuildsonthepreviousone,itisimportantthatyousuccessfully
completethepreviouslabbeforestartingthislab.
Step#1:ShowPerMonthSalesBeforeandAfterCampaign
Beforeweproceedwithmoresophisticatedanalysis,youshouldfirstcalculatethenumber
ofordersDualcorereceivedeachmonthforthethreemonthsbeforetheiradcampaign
began(FebruaryApril,2013),aswellasforthemonthduringwhichtheircampaignran
(May,2013).
1.Changetothedirectoryforthislab:
$cd$ADIR/exercises/disparate_datasets
2.Openthecount_orders_by_period.pigfileinyoureditor.Wehaveprovidedthe
LOADstatementaswellasaFILTERstatementthatusesaregularexpressiontomatch
therecordsinthedatarangeyoullanalyze.Makethefollowingadditionalchanges:
a.FollowingtheFILTERstatement,createanewrelationwithjustone
field:theordersyearandmonth(Hint:UsetheSUBSTRINGbuilt
in
functiontoextractthefirstpartoftheorder_dtmfield,whichcontains
themonthandyear).
b.Countthenumberofordersineachofthemonthsyouextractedinthe
previousstep.
c.Displaythecountbymonthtothescreen
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
91/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
87
Page88
3.Onceyouhavemadethesechanges,tryrunningyourscriptagainstthedatainHDFS:
$pigcount_orders_by_period.pig
Question:DoesthedatasuggestthattheadvertisingcampaignwestartedinMayledto
asubstantialincreaseinorders?
Step#2:CountAdvertisedProductSalesbyMonth
Ouranalysisfromthepreviousstepsuggeststhatsalesincreaseddramaticallythesame
monthDualcorebeganadvertising.Next,youllcomparethesalesofthespecificproduct
Dualcoreadvertised(productID#1274348)duringthesameperiodtoseewhetherthe
increaseinsaleswasactuallyrelatedtotheircampaign.
Youwillbejoiningtwodatasetsduringthisportionofthelab.Sincethisisthefirstjoinyou
havedonewithPig,nowisagoodtimetomentionatipthatcanhaveaprofoundeffecton
theperformanceofyourscript.Filteringoutunwanteddatafromeachrelationbeforeyou
jointhem,aswevedoneinourexample,meansthatyourscriptwillneedtoprocessless
dataandwillfinishmorequickly.WewilldiscussseveralmorePigperformancetipslaterin
class,butthisoneisworthlearningnow.
4.Editthecount_tablet_orders_by_period.pigfileandimplementthe
following:
a.Jointhetworelationsontheorder_idfieldtheyhaveincommon
b.Createanewrelationfromthejoineddatathatcontainsasinglefield:the
ordersyearandmonth,similartowhatyoudidpreviouslyinthe
count_orders_by_period.pigfile.
c.Grouptherecordsbymonthandthencounttherecordsineachgroup
d.Displaytheresultstoyourscreen
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
92/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
88
Page89
5.Onceyouhavemadethesechanges,tryrunningyourscriptagainstthedatainHDFS:
$pigcount_tablet_orders_by_period.pig
Question:Doesthedatashowanincreaseinsalesoftheadvertisedproduct
correspondingtothemonthinwhichDualcorescampaignwasactive?
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
93/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
89
Page90
BonusLab#1:CalculateAverageOrderSize
ItappearsthatDualcoresadvertisingcampaignwassuccessfulingeneratingneworders.
Sincetheysellthistabletataslightlosstoattractnewcustomers,letsseeifcustomerswho
buythistabletalsobuyotherthings.Youwillwritecodetocalculatetheaveragenumberof
itemsforallordersthatcontaintheadvertisedtabletduringthecampaignperiod.
1.Changetothebonus_01subdirectoryofthecurrentlab:
$cdbonus_01
2.Edittheaverage_order_size.pigfiletocalculatetheaverageasdescribedabove.
Whiletherearemultiplewaystoachievethis,itisrecommendedthatyouimplement
thefollowing:
a.Filtertheordersbydate(usingaregularexpression)toincludeonlythose
placedduringthecampaignperiod(May1,2013throughMay31,2013)
b.Excludeanyorderswhichdonotcontaintheadvertisedproduct(productID
#1274348)
c.Createanewrelationcontainingtheorder_idandproduct_idfieldsfor
theseorders.
d.Countthetotalnumberofproductsperorder
e.Calculatetheaveragenumberofproductsforallorders
3.Onceyouhavemadethesechanges,tryrunningyourscriptagainstthedatainHDFS:
$pigaverage_order_size.pig
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
94/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Question:Doesthedatashowthattheaverageordercontainedatleasttwoitemsin
additiontothetabletDualcoreadvertised?
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
90
Page91
BonusLab#2:SegmentCustomersforLoyaltyProgram
Dualcoreisconsideringstartingaloyaltyrewardsprogram.Thiswillprovideexclusive
benefitstotheirbestcustomers,whichwillhelptoretainthem.Anotheradvantageisthatit
willalsoallowDualcoretocaptureevenmoredataabouttheshoppinghabitsoftheir
customersforexample,Dualcorecaneasilytracktheircustomersin
storepurchaseswhen
thesecustomersprovidetheirrewardsprogramnumberatcheckout.
Tobeconsideredfortheprogram,acustomermusthavemadeatleastfivepurchasesfrom
Dualcoreduring2012.Thesecustomerswillbesegmentedintogroupsbasedonthetotal
retailpriceofallpurchaseseachmadeduringthatyear:
Platinum:Purchasestotaledatleast$10,000
Gold:Purchasestotaledatleast$5,000butlessthan$10,000
Silver:Purchasestotaledatleast$2,500butlessthan$5,000
Sinceweareconsideringthetotalsalespriceofordersinadditiontothenumberofordersa
customerhasplaced,noteverycustomerwithatleastfiveordersduring2012willqualify.
Infact,onlyaboutonepercentofthecustomerswillbeeligibleformembershipinoneof
thesethreegroups.
Duringthislab,youwillwritethecodeneededtofilterthelistofordersbasedondate,
groupthembycustomerID,countthenumberoforderspercustomer,andthenfilterthisto
excludeanycustomerwhodidnothaveatleastfiveorders.Youwillthenjointhis
informationwiththeorderdetailsandproductsdatasetsinordertocalculatethetotalsales
ofthoseordersforeachcustomer,splitthemintothegroupsbasedonthecriteriadescribed
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
95/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
above,andthenwritethedataforeachgroup(customerIDandtotalsales)intoaseparate
directoryinHDFS.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
91
Page92
1.Changetothebonus_02subdirectoryofthecurrentlab:
$cd../bonus_02
2.Edittheloyalty_program.pigfileandimplementthestepsdescribedabove.The
codetoloadthethreedatasetsyouwillneedisalreadyprovidedforyou.
3.Afteryouhavewrittenthecode,runitagainstthedatainHDFS:
$pigloyalty_program.pig
4.Ifyourscriptcompletedsuccessfully,usethehadoopfsgetmergecommandto
createalocaltextfileforeachgroupsoyoucancheckyourwork(notethatthenameof
thedirectoryshownheremaynotbethesameastheoneyouchose):
$hadoopfsgetmerge/dualcore/loyalty/platinumplatinum.txt
$hadoopfsgetmerge/dualcore/loyalty/goldgold.txt
$hadoopfsgetmerge/dualcore/loyalty/silversilver.txt
5.UsetheUNIXheadand/ortailcommandstocheckafewrecordsandensurethatthe
totalsalespricesfallintothecorrectranges:
$headplatinum.txt
$tailgold.txt
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
96/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
$headsilver.txt
6.Finally,countthenumberofcustomersineachgroup:
$wclplatinum.txt
$wclgold.txt
$wclsilver.txt
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
92
Page93
Lecture10Lab:ExtendingPigwith
StreamingandUDFs
InthislabyouwillusetheSTREAMkeywordinPigtoanalyzemetadatafrom
Dualcorescustomerservicecallrecordingstoidentifythecauseofasuddenincrease
incomplaints.Youwillthenusethisdatainconjunctionwithauser
definedfunction
toproposeasolutionforresolvingtheproblem.
IMPORTANT:Sincethislabbuildsonthepreviousone,itisimportantthatyousuccessfully
completethepreviouslabbeforestartingthislab.
BackgroundInformation
Dualcoreoutsourcesitscallcenteroperationsandcostshaverecentlyrisenduetoan
increaseinthevolumeofcallshandledbytheseagents.Unfortunately,Dualcoredoesnot
haveaccesstothecallcentersdatabase,buttheyareprovidedwithrecordingsofthesecalls
storedinMP3format.ByusingPigsSTREAMkeywordtoinvokeaprovidedPythonscript,
youcanextractthecategoryandtimestampfromthefiles,andthenanalyzethatdatato
learnwhatiscausingtherecentincreaseincalls.
Step#1:ExtractCallMetadata
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
97/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Note:SincethePythonlibraryweareusingforextractingthetagsdoesn'tsupportHDFS,we
runthisscriptinlocalmodeonasmallsampleofthecallrecordings.Becauseyouwilluse
Pigslocalmode,therewillbenoneedtoshipthescripttothenodesinthecluster.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
93
Page94
1.Changetothedirectoryforthislab:
$cd$ADIR/exercises/extending_pig
2.APythonscript(readtags.py)isprovidedforextractingthemetadatafromtheMP3
files.Thisscripttakesthepathofafileonthecommandlineandreturnsarecord
containingfivetab
delimitedfields:thefilepath,callcategory,agentID,customerID,
andthetimestampofwhentheagentansweredthecall.
Yourfirststepistocreateatextfilecontainingthepathsofthefilestoanalyze,withone
lineforeachfile.Youcaneasilycreatethedataintherequiredformatbycapturingthe
outputoftheUNIXfindcommand:
$find$ADIR/data/cscalls/name'*.mp3'>call_list.txt
3.Edittheextract_metadata.pigfileandmakethefollowingchanges:
a.ReplacethehardcodedparameterintheSUBSTRINGfunctionusedtofilter
bymonthwithaparameternamedMONTHwhosevalueyoucanassignonthe
commandline.Thiswillmakeiteasytochecktheleadingcallcategoriesfor
differentmonthswithouthavingtoeditthescript.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
98/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
b.Addthecodenecessarytocountcallsbycategory
c.Displaythetopthreecategories(basedonnumberofcalls)tothescreen.
4.Onceyouhavemadethesechanges,runyourscripttocheckthetopthreecategoriesin
themonthbeforeDualcorestartedtheonlineadvertisingcampaign:
$pigxlocalparamMONTH=201304extract_metadata.pig
5.Nowrunthescriptagain,thistimespecifyingtheparameterforMay:
$pigxlocalparamMONTH=201305extract_metadata.pig
TheoutputshouldconfirmthatnotonlyiscallvolumesubstantiallyhigherinMay,the
SHIPPING_DELAYcategoryhasmorethantwicetheamountofcallsastheothertwo.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
94
Page95
Step#2:ChooseBestLocationforDistributionCenter
Theanalysisyoujustcompleteduncoveredaproblem.DualcoresVicePresidentof
Operationslaunchedaninvestigationbasedonyourfindingsandhasnowconfirmedthe
cause:theironlineadvertisingcampaignisindeedattractingmanynewcustomers,but
manyofthemlivefarfromDualcoresonlydistributioncenterinPaloAlto,California.All
shipmentsaretransportedbytruck,soanordercantakeuptofivedaystodeliver
dependingonthecustomerslocation.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
99/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
95
Page96
Tosolvethisproblem,Dualcorewillopenanewdistributioncentertoimprove
shippingtimes.
TheZIPcodesforthethreeproposedsitesare02118,63139,and78237.Youwill
lookupthelatitudeandlongitudeoftheseZIPcodes,aswellastheZIPcodesof
customerswhohaverecentlyordered,usingasupplieddataset.Onceyouhavethe
coordinates,youwillinvoketheusetheHaversineDistInMilesUDF
distributedwithDataFutodeterminehowfareachcustomerisfromthethreedata
centers.Youwillthencalculatetheaveragedistanceforallcustomerstoeachof
thesedatacentersinordertoproposetheonethatwillbenefitthemostcustomers.
1.Addthetab
delimitedfilemappingZIPcodestolatitude/longitudepointsto
HDFS:
$hadoopfsmkdir/dualcore/distribution
$hadoopfsput$ADIR/data/latlon.tsv\
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
100/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
/dualcore/distribution
2.Ascript(create_cust_location_data.pig)hasbeenprovidedtofind
theZIPcodesforcustomerswhoplacedordersduringtheperiodofthead
campaign.Italsoexcludestheoneswhoarealreadyclosetothecurrentfacility,
aswellascustomersintheremotestatesofAlaskaandHawaii(whereorders
areshippedbyairplane).ThePigLatincodejoinsthesecustomersZIPcodes
withthelatitude/longitudedatasetuploadedinthepreviousstep,thenwrites
thosethreecolumns(ZIPcode,latitude,andlongitude)astheresult.Examine
thescripttoseehowitworks,andthenrunittocreatethecustomerlocation
datainHDFS:
$pigcreate_cust_location_data.pig
3.YouwillusetheHaversineDistInMilesfunctiontocalculatethedistance
fromeachcustomertoeachofthethreeproposedwarehouselocations.This
functionrequiresustosupplythelatitudeandlongitudeofboththecustomer
andthewarehouse.Whilethescriptyoujustexecutedcreatedthelatitudeand
longitudeforeachcustomer,youmustcreateadatasetcontainingtheZIPcode,
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
96
Page97
latitude,andlongitudeforthesewarehouses.Dothisbyrunningthefollowing
UNIXcommand:
$egrep'^02118|^63139|^78237'\
$ADIR/data/latlon.tsv>warehouses.tsv
4.Next,addthisfiletoHDFS:
$hadoopfsputwarehouses.tsv/dualcore/distribution
5.Editthecalc_average_distances.pigfile.TheUDFisalreadyregistered
andanaliasforthisfunctionnamedDISTisdefinedatthetopofthescript,just
beforethetwodatasetsyouwilluseareloaded.Youneedtocompletetherest
ofthisscript:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
101/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
a.Createarecordforeverycombinationofcustomerandproposed
distributioncenterlocation
b.Usethefunctiontocalculatethedistancefromthecustomerto
thewarehouse
c.Calculatetheaveragedistanceforallcustomerstoeach
warehouse
d.Displaytheresulttothescreen
6.AfteryouhavefinishedimplementingthePigLatincodedescribedabove,run
thescript:
$pigcalc_average_distances.pig
Question:WhichofthesethreeproposedZIPcodeshasthelowestaverage
mileagetoDualcorescustomers?
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
97
Page98
Lecture11Lab:RunningHiveQueries
fromtheShell,Scripts,andHue
InthislabyouwillwriteHiveQLqueriestoanalyzedatainHivetablesthat
havebeenpopulatedwithdatayouplacedinHDFSduringearlierlabs.
IMPORTANT:Sincethislabbuildsonthepreviousone,itisimportantthatyou
successfullycompletethepreviouslabbeforestartingthislab.
Step#1:RunningaQueryfromtheHiveShell
Dualcoreranacontestinwhichcustomerspostedvideosofinterestingwaystouse
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
102/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
theirnewtablets.A$5,000prizewillbeawardedtothecustomerwhosevideo
receivedthehighestrating.
However,theregistrationdatawaslostduetoanRDBMScrash,andtheonly
informationtheyhaveisfromthevideos.Thewinningcustomerintroducedherself
onlyasBridgetfromKansasCityinhervideo.
YouwillneedtorunaHivequerythatidentifiesthewinnersrecordinthecustomer
databasesothatDualcorecansendherthe$5,000prize.
1.Changetothedirectoryforthislab:
$cd$ADIR/exercises/analyzing_sales
2.StartHive:
$hive
HivePrompt
Tomakeiteasiertocopyqueriesandpastethemintoyourterminalwindow,we
donotshowthehive>promptinsubsequentsteps.Stepsprefixedwith
$shouldbeexecutedontheUNIXcommandlinetherestshouldberuninHive
unlessotherwisenoted.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
98
Page99
3.Makethequeryresultseasiertoreadbysettingthepropertythatwillmake
Hiveshowcolumnheaders:
sethive.cli.print.header=true
4.AllyouknowaboutthewinneristhathernameisBridgetandshelivesin
KansasCity.UseHive'sLIKEoperatortodoawildcardsearchfornamessuchas
"Bridget","Bridgette"or"Bridgitte".Remembertofilteronthecustomer'scity.
Question:Whichcustomerdidyourqueryidentifyasthewinnerofthe$5,000
prize?
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
103/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Step#2:RunningaQueryDirectlyfromtheCommand
Line
Youwillnowrunatop
Nquerytoidentifythethreemostexpensiveproductsthat
Dualcorecurrentlyoffers.
5.ExittheHiveshellandreturntothecommandline:
quit
6.AlthoughHiveQLstatementsareterminatedbysemicolonsintheHiveshell,itis
notnecessarytodothiswhenrunningasinglequeryfromthecommandline
usingtheeoption.RunthefollowingcommandtoexecutethequotedHiveQL
statement:
$hivee'SELECTprice,brand,nameFROMPRODUCTS
ORDERBYpriceDESCLIMIT3'
Question:Whichthreeproductsarethemostexpensive?
Step#3:RunningaHiveQLScript
Therulesforthecontestdescribedearlierrequirethatthewinnerboughtthe
advertisedtabletfromDualcorebetweenMay1,2013andMay31,2013.Before
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
99
Page100
Dualcorecanauthorizetheaccountingdepartmenttopaythe$5,000prize,you
mustensurethatBridgetiseligible.Sincethisqueryinvolvesjoiningdatafrom
severaltables,itsaperfectcaseforrunningitasaHivescript.
1.StudytheHiveQLcodeforthequerytolearnhowitworks:
$catverify_tablet_order.hql
2.ExecutetheHiveQLscriptusingthehivecommandsfoption:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
104/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
$hivefverify_tablet_order.hql
Question:DidBridgetordertheadvertisedtabletinMay?
Step#4:RunningaQueryThroughHueandBeeswax
AnotherwaytorunHivequeriesisthroughyourWebbrowserusingHuesBeeswax
application.Thisisespeciallyconvenientifyouusemorethanonecomputerorif
youuseadevice(suchasatablet)thatisntcapableofrunningHiveitselfbecause
itdoesnotrequireanysoftwareotherthanabrowser.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
100
Page101
1.StarttheFirefoxWebbrowserbyclickingtheorangeandblueiconnearthetop
oftheVMwindow,justtotherightoftheSystemmenu.OnceFirefoxstarts,type
http://localhost:8888/intotheaddressbar,andthenhittheenterkey.
2.Afterafewseconds,youshouldseeHuesloginscreen.Entertrainingin
boththeusernameandpasswordfields,andthenclicktheSignInbutton.If
promptedtorememberthepassword,declinebyhittingtheESCkeysoyoucan
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
105/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
practicethisstepagainlaterifyouchoose.
AlthoughseveralHueapplicationsareavailablethroughtheiconsatthetopof
thepage,theBeeswaxqueryeditorisshownbydefault.
3.Selectdefaultfromthedatabaselistontheleftsideofthepage.
4.Writeaqueryinthetextareathatwillcountthenumberofrecordsinthe
customerstable,andthenclicktheExecutebutton.
Question:HowmanycustomersdoesDualcoreserve?
5.ClicktheQueryEditorlinkintheupperleftcorner,andthenwriteandruna
querytofindthetenstateswiththemostcustomers.
Question:Whichstatehasthemostcustomers?
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
101
Page102
BonusLab#1:CalculatingRevenueandProfit
SeveralmorequestionsaredescribedbelowandyouwillneedtowritetheHiveQL
codetoanswerthem.Youcanusewhichevermethodyoulikebest,includingHive
shell,HiveScript,orHue,torunyourqueries.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
106/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
WhichtopthreeproductshasDualcoresoldmoreofthananyother?
Hint:RememberthatifyouuseaGROUPBYclauseinHive,youmustgroupby
allfieldslistedintheSELECTclausethatarenotpartofanaggregatefunction.
WhatwasDualcorestotalrevenueinMay,2013?
WhatwasDualcoresgrossprofit(salespriceminuscost)inMay,2013?
Theresultsoftheabovequeriesareshownincents.Rewritethegrossprofit
querytoformatthevalueindollarsandcents(e.g.,$2000000.00).Todothis,
youcandividetheprofitby100andformattheresultusingthePRINTF
functionandtheformatstring"$%.2f".
ProfessorsNote~
Thereareseveralwaysyoucouldwriteeachquery,andyoucanfindonesolution
foreachprobleminthebonus_01/sample_solution/directory.
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
102
Page103
Lecture11Lab:DataManagement
withHive
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
107/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Inthislabyouwillpracticeusingseveralcommontechniquesforcreatingand
populatingHivetables.Youwillalsocreateandqueryatablecontainingeach
ofthecomplexfieldtypeswestudied:array,map,andstruct.
IMPORTANT:Sincethislabbuildsonthepreviousone,itisimportantthatyou
successfullycompletethepreviouslabbeforestartingthislab.
Additionally,manyofthecommandsyouwillrunuseenvironmentalvariablesand
relativefilepaths.ItisimportantthatyouusetheHiveshell,ratherthanHueor
anotherinterface,asyouworkthroughthestepsthatfollow.
Step#1:UseSqoopsHiveImportOptiontoCreatea
Table
YouusedSqoopinanearlierlabtoimportdatafromMySQLintoHDFS.Sqoopcan
alsocreateaHivetablewiththesamefieldsasthesourcetableinadditionto
importingtherecords,whichsavesyoufromhavingtowriteaCREATETABLE
statement.
1.Changetothedirectoryforthislab:
$cd$ADIR/exercises/data_mgmt
2.ExecutethefollowingcommandtoimportthesupplierstablefromMySQLas
anewHive
managedtable:
$sqoopimport\
connectjdbc:mysql://localhost/dualcore\
usernametrainingpasswordtraining\
fieldsterminatedby'\t'\
tablesuppliers\
hiveimport
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
103
Page104
3.StartHive:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
108/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
$hive
4.Itisalwaysagoodideatovalidatedataafteraddingit.ExecutetheHivequery
shownbelowtocountthenumberofsuppliersinTexas:
SELECTCOUNT(*)FROMsuppliersWHEREstate='TX'
Thequeryshouldshowthatninerecordsmatch.
Step#2:CreateanExternalTableinHive
YouimporteddatafromtheemployeestableinMySQLinanearlierlab,butit
wouldbeconvenienttobeabletoquerythisfromHive.Sincethedataalreadyexists
inHDFS,thisisagoodopportunitytouseanexternaltable.
1.WriteandexecuteaHiveQLstatementtocreateanexternaltableforthetab
delimitedrecordsinHDFSat/dualcore/employees.Thedataformatis
shownbelow:
FieldNameFieldType
emp_id
STRING
fname
STRING
lname
STRING
address
STRING
city
STRING
state
STRING
zipcode
STRING
job_title
STRING
email
STRING
active
STRING
salary
INT
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
104
Page105
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
109/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
2.RunthefollowingHivequerytoverifythatyouhavecreatedthetablecorrectly.
SELECTjob_title,COUNT(*)ASnum
FROMemployees
GROUPBYjob_title
ORDERBYnumDESC
LIMIT3
ItshouldshowthatSalesAssociate,Cashier,andAssistantManagerarethe
threemostcommonjobtitlesatDualcore.
Step#3:CreateandLoadaHiveManagedTable
Next,youwillcreateandthenloadaHive
managedtablewithproductratingsdata.
1.Createatablenamedratingsforstoringtab
delimitedrecordsusingthis
structure:
FieldNameFieldType
posted
TIMESTAMP
cust_id
INT
prod_id
INT
rating
TINYINT
message
STRING
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
105
Page106
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
110/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
2.Showthetabledescriptionandverifythatitsfieldshavethecorrectorder,
names,andtypes:
DESCRIBEratings
3.Next,openaseparateterminalwindow(File
>OpenTerminal)soyoucanrun
thefollowingshellcommand.Thiswillpopulatethetabledirectlybyusingthe
hadoopfscommandtocopyproductratingsdatafrom2012tothatdirectory
inHDFS:
$hadoopfsput$ADIR/data/ratings_2012.txt\
/user/hive/warehouse/ratings
LeavethewindowopenafterwardssothatyoucaneasilyswitchbetweenHive
andthecommandprompt.
4.Next,verifythatHivecanreadthedatawejustadded.Runthefollowingquery
inHivetocountthenumberofrecordsinthistable(theresultshouldbe464):
SELECTCOUNT(*)FROMratings
5.AnotherwaytoloaddataintoaHivetableisthroughtheLOADDATAcommand.
Thenextfewcommandswillleadyouthroughtheprocessofcopyingalocalfile
toHDFSandloadingitintoHive.First,copythe2013ratingsdatatoHDFS:
$hadoopfsput$ADIR/data/ratings_2013.txt/dualcore
6.Verifythatthefileisthere:
$hadoopfsls/dualcore/ratings_2013.txt
7.UsetheLOADDATAstatementinHivetoloadthatfileintotheratingstable:
LOADDATAINPATH'/dualcore/ratings_2013.txt'INTO
TABLEratings
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
106
111/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Page107
8.TheLOADDATAINPATHcommandmovesthefiletothetablesdirectory.
Verifythatthefileisnolongerpresentintheoriginaldirectory:
$hadoopfsls/dualcore/ratings_2013.txt
9.Verifythatthefileisshownalongsidethe2012ratingsdatainthetables
directory:
$hadoopfsls/user/hive/warehouse/ratings
10.Finally,counttherecordsintheratingstabletoensurethatall21,997are
available:
SELECTCOUNT(*)FROMratings
Step#4:Create,Load,andQueryaTablewithComplex
Fields
Dualcorerecentlystartedaloyaltyprogramtorewardtheirbestcustomers.
Dualcorehasasampleofthedatathatcontainsinformationaboutcustomerswho
havesignedupfortheprogram,includingtheirphonenumbers(asamap),alistof
pastorderIDs(asanarray),andastructthatsummarizestheminimum,maximum,
average,andtotalvalueofpastorders.Youwillcreatethetable,populateitwiththe
provideddata,andthenrunafewqueriestopracticereferencingthesetypesof
fields.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
107
112/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Page108
1.RunthefollowingstatementinHivetocreatethetable:
CREATETABLEloyalty_program
(cust_idINT,
fnameSTRING,
lnameSTRING,
emailSTRING,
levelSTRING,
phoneMAP<STRING,STRING>,
order_idsARRAY<INT>,
order_valueSTRUCT<min:INT,
max:INT,
avg:INT,
total:INT>)
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'|'
COLLECTIONITEMSTERMINATEDBY','
MAPKEYSTERMINATEDBY':'
2.Examinethedatainloyalty_data.txttoseehowitcorrespondstothe
fieldsinthetableandthenloaditintoHive:
LOADDATALOCALINPATH'loyalty_data.txt'INTOTABLE
loyalty_program
3.RunaquerytoselecttheHOMEphonenumber(Hint:Mapkeysarecase
sensitive)forcustomerID1200866.Youshouldsee408
555
4914astheresult.
4.Selectthethirdelementfromtheorder_idsarrayforcustomerID1200866
(Hint:Elementsareindexedfromzero).Thequeryshouldreturn5278505.
5.Selectthetotalattributefromtheorder_valuestructforcustomerID
1200866.Thequeryshouldreturn401874.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
113/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
108
Page109
BonusLab#1:AlterandDropaTable
1.UseALTERTABLEtorenamethelevelcolumntostatus.
2.UsetheDESCRIBEcommandontheloyalty_programtabletoverifythe
change.
3.UseALTERTABLEtorenametheentiretabletoreward_program.
4.AlthoughtheALTERTABLEcommandoftenrequiresthatwemakea
correspondingchangetothedatainHDFS,renamingatableorcolumndoesnot.
Youcanverifythisbyrunningaqueryonthetableusingthenewnames(the
resultshouldbeSILVER):
SELECTstatusFROMreward_programWHEREcust_id=
1200866
5.Assometimeshappensinthecorporateworld,prioritieshaveshiftedandthe
programisnowcanceled.Dropthereward_programtable.
Thisistheendofthelab.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
114/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
109
Page110
Lecture12Lab:GainingInsightwith
SentimentAnalysis
Inthisoptionallab,youwilluseHive'stextprocessingfeaturestoanalyze
customerscommentsandproductratings.Youwilluncoverproblemsand
proposepotentialsolutions.
IMPORTANT:Sincethislabbuildsonthepreviousone,itisimportantthatyou
successfullycompletethepreviouslabbeforestartingthislab.
BackgroundInformation
Customerratingsandfeedbackaregreatsourcesofinformationforbothcustomers
andretailerslikeDualcore.However,customercommentsaretypicallyfree
form
textandmustbehandleddifferently.Fortunately,Hiveprovidesextensivesupport
fortextprocessing.
Step#1:AnalyzeNumericProductRatings
Beforedelvingintotextprocessing,youwillbeginbyanalyzingthenumericratings
customershaveassignedtovariousproducts.
1.Changetothedirectoryforthislab:
$cd$ADIR/exercises/sentiment
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
115/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
110
Page111
2.StartHiveandusetheDESCRIBEcommandtoremindyourselfofthetables
structure.
3.Wewanttofindtheproductthatcustomerslikemost,butmustguardagainst
beingmisledbyproductsthathavefewratingsassigned.Runthefollowing
querytofindtheproductwiththehighestaverageamongallthosewithatleast
50ratings:
SELECTprod_id,FORMAT_NUMBER(avg_rating,2)AS
avg_rating
FROM(SELECTprod_id,AVG(rating)ASavg_rating,
COUNT(*)ASnum
FROMratings
GROUPBYprod_id)rated
WHEREnum>=50
ORDERBYavg_ratingDESC
LIMIT1
4.Rewrite,andthenexecute,thequeryabovetofindtheproductwiththelowest
averageamongproductswithatleast50ratings.Youshouldseethattheresult
isproductID1274673withanaverageratingof1.10.
Step#2:AnalyzeRatingComments
Weobservedearlierthatcustomersareverydissatisfiedwithoneoftheproducts
thatDualcoresells.Althoughnumericratingscanhelpidentifywhichproductthatis,
theydonttellDualcorewhycustomersdontliketheproduct.Wecouldsimplyread
throughallthecommentsassociatedwiththatproducttolearnthisinformation,but
thatapproachdoesntscale.Next,youwilluseHivestextprocessingsupportto
analyzethecomments.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
116/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
111
Page112
1.Thefollowingquerynormalizesallcommentsonthatproducttolowercase,
breaksthemintoindividualwordsusingtheSENTENCESfunction,andpasses
thosetotheNGRAMSfunctiontofindthefivemostcommonbigrams(two
word
combinations).RunthequeryinHive:
SELECTEXPLODE(NGRAMS(SENTENCES(LOWER(message)),2,5))
ASbigrams
FROMratings
WHEREprod_id=1274673
2.Mostofthesewordsaretoocommontoprovidemuchinsight,thoughtheword
expensivedoesstandoutinthelist.Modifythepreviousquerytofindthefive
mostcommontrigrams(three
wordcombinations),andthenrunthatqueryin
Hive.
3.Amongthepatternsyouseeintheresultisthephrasetentimesmore.This
mightberelatedtothecomplaintsthattheproductistooexpensive.Nowthat
youveidentifiedaspecificphrase,lookatafewcommentsthatcontainitby
runningthisquery:
SELECTmessage
FROMratings
WHEREprod_id=1274673
ANDmessageLIKE'%tentimesmore%'
LIMIT3
Youshouldseethreecommentsthatsay,Whydoestheredonecosttentimes
morethantheothers?
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
117/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
112
Page113
4.Wecaninferthatcustomersarecomplainingaboutthepriceofthisitem,but
thecommentalonedoesntprovideenoughdetail.Oneofthewords(red)in
thatcommentwasalsofoundinthelistoftrigramsfromtheearlierquery.
Writeandexecuteaquerythatwillfindalldistinctcommentscontainingthe
wordredthatareassociatedwithproductID1274673.
5.Thepreviousstepshouldhavedisplayedtwocomments:
Whatissospecialaboutred?
Whydoestheredonecosttentimesmorethantheothers?
Thesecondcommentimpliesthatthisproductisoverpricedrelativetosimilar
products.WriteandrunaquerythatwilldisplaytherecordforproductID
1274673intheproductstable.
6.Yourqueryshouldhaveshownthattheproductwasa16GBUSBFlashDrive
(Red)fromtheOrionbrand.Next,runthisquerytoidentifysimilarproducts:
SELECT*
FROMproducts
WHEREnameLIKE'%16GBUSBFlashDrive%'
ANDbrand='Orion'
Thequeryresultsshowthattherearethreealmostidenticalproducts,butthe
productwiththenegativereviews(theredone)costsabouttentimesasmuch
astheothers,justassomeofthecommentssaid.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
118/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Basedonthecostandpricecolumns,itappearsthatdoingtextprocessingon
theproductratingshashelpedDualcoreuncoverapricingerror.
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
113
Page114
Lecture12Lab:DataTransformation
withHive
InthislabyouwillcreateandpopulateatablewithlogdatafromDualcores
Webserver.Queriesonthatdatawillrevealthatmanycustomersabandon
theirshoppingcartsbeforecompletingthecheckoutprocess.Youwillcreate
severaladditionaltables,usingdatafromaTRANSFORMscriptandasupplied
UDF,whichyouwilluselatertoanalyzehowDualcorecouldturnthisproblem
intoanopportunity.
IMPORTANT:Sincethislabbuildsonthepreviousone,itisimportantthatyou
successfullycompletethepreviouslabbeforestartingthislab.
Step#1:CreateandPopulatetheWebLogsTable
Typicallogfileformatsarenotdelimited,soyouwillneedtousetheRegexSerDe
andspecifyapatternHivecanusetoparselinesintoindividualfieldsyoucanthen
query.
1.Changetothedirectoryforthislab:
$cd$ADIR/exercises/transform
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
119/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
2.Examinethecreate_web_logs.hqlscripttogetanideaofhowitusesa
RegexSerDetoparselinesinthelogfile(anexampleloglineisshowninthe
commentatthetopofthefile).Whenyouhaveexaminedthescript,runitto
createthetableinHive:
$hivefcreate_web_logs.hql
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
114
Page115
3.PopulatethetablebyaddingthelogfiletothetablesdirectoryinHDFS:
$hadoopfsput$ADIR/data/access.log
/dualcore/web_logs
4.StarttheHiveshellinanotherterminalwindow
5.Verifythatthedataisloadedcorrectlybyrunningthisquerytoshowthetop
threeitemsuserssearchedforonDualcoresWebsite:
SELECTterm,COUNT(term)ASnumFROM
(SELECTLOWER(REGEXP_EXTRACT(request,
'/search\\?phrase=(\\S+)',1))ASterm
FROMweb_logs
WHERErequestREGEXP'/search\\?phrase=')terms
GROUPBYterm
ORDERBYnumDESC
LIMIT3
Youshouldseethatitreturnstablet(303),ram(153)andwifi(148).
Note:TheREGEXPoperator,whichisavailableinsomeSQLdialects,issimilar
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
120/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
toLIKE,butusesregularexpressionsformorepowerfulpatternmatching.The
REGEXPoperatorissynonymouswiththeRLIKEoperator.
Step#2:AnalyzeCustomerCheckouts
YouvejustqueriedthelogstoseewhatuserssearchforonDualcoresWebsite,but
nowyoullrunsomequeriestolearnwhethertheybuy.AsonmanyWebsites,
customersaddproductstotheirshoppingcartsandthenfollowacheckout
processtocompletetheirpurchase.Sinceeachpartofthisfour
stepprocesscanbe
identifiedbyitsURLinthelogs,wecanusearegularexpressiontoeasilyidentify
them:
StepRequestURL
Description
115
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
Page116
/cart/checkout/step1viewcart
Viewlistofitemsaddedtocart
/cart/checkout/step2shippingcostNotifycustomerofshippingcost
/cart/checkout/step3payment
Gatherpaymentinformation
/cart/checkout/step4receipt
Showreceiptforcompletedorder
1.RunthefollowingqueryinHivetoshowthenumberofrequestsforeachstepof
thecheckoutprocess:
SELECTCOUNT(*),request
FROMweb_logs
WHERErequestREGEXP'/cart/checkout/step\\d.+'
GROUPBYrequest
Theresultsofthisqueryhighlightamajorproblem.Aboutoneoutofevery
threecustomersabandonstheircartafterthesecondstep.Thismightmean
millionsofdollarsinlostrevenue,soletsseeifwecandeterminethecause.
2.Thelogfilescookiefieldstoresavaluethatuniquelyidentifieseachuser
session.Sincenotallsessionsinvolvecheckoutsatall,createanewtable
containingthesessionIDandnumberofcheckoutstepscompletedforjust
thosesessionsthatdo:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
121/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
CREATETABLEcheckout_sessionsAS
SELECTcookie,ip_address,COUNT(request)AS
steps_completed
FROMweb_logs
WHERErequestREGEXP'/cart/checkout/step\\d.+'
GROUPBYcookie,ip_address
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
116
Page117
3.Runthisquerytoshowthenumberofpeoplewhoabandonedtheircartafter
eachstep:
SELECTsteps_completed,COUNT(cookie)ASnum
FROMcheckout_sessions
GROUPBYsteps_completed
Youshouldseethatmostcustomerswhoabandonedtheirorderdidsoafterthe
secondstep,whichiswhentheyfirstlearnhowmuchitwillcosttoshiptheir
order.
Step#3:UseTRANSFORMforIPGeolocation
Basedonwhatyou'vejustseen,itseemslikelythatcustomersabandontheircarts
duetohighshippingcosts.Theshippingcostisbasedonthecustomer'slocationand
theweightoftheitemsthey'veordered.Althoughthisinformationisnotinthe
database(sincetheorderwasn'tcompleted),wecangatherenoughdatafromthe
logstoestimatethem.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
122/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Wedon'thavethecustomer'saddress,butwecanuseaprocessknownas"IP
geolocation"tomapthecomputer'sIPaddressinthelogfiletoanapproximate
physicallocation.Sincethisisn'tabuilt
incapabilityofHive,you'lluseaprovided
PythonscripttoTRANSFORMtheip_addressfieldfromthe
checkout_sessionstabletoaZIPcode,aspartofHiveQLstatementthatcreates
anewtablecalledcart_zipcodes.
RegardingTRANSFORMandUDFExamplesinthis
Exercise
Duringthislab,youwilluseaPythonscriptforIPgeolocationandaUDFto
calculateshippingcosts.Bothareimplementedmerelyasasimulation
compatiblewiththefictitiousdataweuseinclassandintendedtoworkeven
whenInternetaccessisunavailable.Thefocusoftheselabsisonhowtouse
externalscriptsandUDFs,ratherthanhowthecodefortheexamplesworks
internally.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
117
Page118
1.Examinethecreate_cart_zipcodes.hqlscriptandobservethefollowing:
a.Itcreatesanewtablecalledcart_zipcodesbasedonselect
statement.
b.Thatselectstatementtransformstheip_address,cookie,and
steps_completedfieldsfromthecheckout_sessionstable
usingaPythonscript.
c.ThenewtablecontainstheZIPcodeinsteadofanIPaddress,plusthe
othertwofieldsfromtheoriginaltable.
2.Examinetheipgeolocator.pyscriptandobservethefollowing:
a.RecordsarereadfromHiveonstandardinput.
b.Thescriptsplitsthemintoindividualfieldsusingatabdelimiter.
c.Theip_addrfieldisconvertedtozipcode,butthecookieand
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
123/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
steps_completedfieldsarepassedthroughunmodified.
d.Thethreefieldsineachoutputrecordaredelimitedwithtabsare
printedtostandardoutput.
3.Runthescripttocreatethecart_zipcodestable:
$hivefcreate_cart_zipcodes.hql
Step#4:ExtractListofProductsAddedtoEachCart
Asdescribedearlier,estimatingtheshippingcostalsorequiresalistofitemsinthe
customerscart.YoucanidentifyproductsaddedtothecartsincetherequestURL
lookslikethis(onlytheproductIDchangesfromonerecordtothenext):
/cart/additem?productid=1234567
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
118
Page119
1.WriteaHiveQLstatementtocreateatablecalledcart_itemswithtwofields:
cookieandprod_idbasedondataselectedtheweb_logstable.Keepthe
followinginmindwhenwritingyourstatement:
a.Theprod_idfieldshouldcontainonlytheseven
digitproductID
(Hint:UsetheREGEXP_EXTRACTfunction)
b.AddaWHEREclausewithREGEXPusingthesameregularexpression
asabovesothatyouonlyincluderecordswherecustomersareadding
itemstothecart.
ProfessorsNote~
Ifyouneedahintonhowtowritethestatement,lookatthefile:
sample_solution/create_cart_items.hql
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
124/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
2.ExecutetheHiveQLstatementfromyoujustwrote.
3.Verifythecontentsofthenewtablebyrunningthisquery:
SELECTCOUNT(DISTINCTcookie)FROMcart_itemsWHERE
prod_id=1273905
ProfessorsNote~
Ifthisdoesntreturn47,thencompareyourstatementtothefile:
sample_solution/create_cart_items.hql.Makethenecessary
corrections,andthenre
runyourstatement(afterdroppingthecart_items
table).
Step#5:CreateTablestoJoinWebLogswithProduct
Data
YounowhavetablesrepresentingtheZIPcodesandproductsassociatedwith
checkoutsessions,butyou'llneedtojointhesewiththeproductstabletogetthe
weightoftheseitemsbeforeyoucanestimateshippingcosts.Inordertodosome
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
119
Page120
moreanalysislater,wellalsoincludetotalsellingpriceandtotalwholesalecostin
additiontothetotalshippingweightforallitemsinthecart.
1.RunthefollowingHiveQLtocreateatablecalledcart_orderswiththe
information:
CREATETABLEcart_ordersAS
SELECTz.cookie,steps_completed,zipcode,
SUM(shipping_wt)astotal_weight,
SUM(price)AStotal_price,
SUM(cost)AStotal_cost
FROMcart_zipcodesz
JOINcart_itemsi
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
125/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
ON(z.cookie=i.cookie)
JOINproductsp
ON(i.prod_id=p.prod_id)
GROUPBYz.cookie,zipcode,steps_completed
Step#6:CreateaTableUsingaUDFtoEstimate
ShippingCost
Wefinallyhavealltheinformationweneedtoestimatetheshippingcostforeach
abandonedorder.YouwilluseaHiveUDFtocalculatetheshippingcostgivenaZIP
codeandthetotalweightofallitemsintheorder.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
120
Page121
1.BeforeyoucanuseaUDF,youmustaddittoHivesclasspath.Runthefollowing
commandinHivetodothat:
ADDJARgeolocation_udf.jar
2.Next,youmustregisterthefunctionwithHiveandprovidethenameoftheUDF
classaswellasthealiasyouwanttouseforthefunction.RuntheHive
commandbelowtoassociateourUDFwiththealiasCALC_SHIPPING_COST:
CREATETEMPORARYFUNCTIONCALC_SHIPPING_COSTAS
'com.cloudera.hive.udf.UDFCalcShippingCost'
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
126/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
121
Page122
3.Nowcreateanewtablecalledcart_shippingthatwillcontainthesessionID,
numberofstepscompleted,totalretailprice,totalwholesalecost,andthe
estimatedshippingcostforeachorderbasedondatafromthecart_orders
table:
CREATETABLEcart_shippingAS
SELECTcookie,steps_completed,total_price,
total_cost,
CALC_SHIPPING_COST(zipcode,total_weight)AS
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
127/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
shipping_cost
FROMcart_orders
4.Finally,verifyyourtablebyrunningthefollowingquerytocheckarecord:
SELECT*FROMcart_shippingWHERE
cookie='100002920697'
Thisshouldshowthatsessionashavingtwocompletedsteps,atotalretailprice
of$263.77,atotalwholesalecostof$236.98,andashippingcostof$9.09.
Note:Thetotal_price,total_cost,andshipping_costcolumnsinthe
cart_shippingtablecontainthenumberofcentsasintegers.Besureto
divideresultscontainingmonetaryamountsby100togetdollarsandcents.
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
122
Page123
Lecture13Lab:InteractiveAnalysis
withImpala
Inthislabyouwillexamineabandonedcartdatausingthetablescreatedin
thepreviouslab.YouwilluseImpalatoquicklydeterminehowmuchlost
revenuetheseabandonedcartsrepresentanduseseveralwhatifscenarios
todeterminewhetherDualcoreshouldofferfreeshippingtoencourage
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
128/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
customerstocompletetheirpurchases.
IMPORTANT:Sincethislabbuildsonthepreviousone,itisimportantthatyou
successfullycompletethepreviouslabbeforestartingthislab.
Step#1:StarttheImpalaShellandRefreshtheCache
1.IssuethefollowingcommandstostartImpala,thenchangetothedirectoryfor
thislab:
$sudoserviceimpalaserverstart
$sudoserviceimpalastatestorestart
$cd$ADIR/exercises/interactive
2.First,starttheImpalashell:
$impalashell
3.SinceyoucreatedtablesandmodifieddatainHive,Impalascacheofthe
metastoreisoutdated.Youmustrefreshitbeforecontinuingbyenteringthe
followingcommandintheImpalashell:
REFRESH
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
123
Page124
Step#2:CalculateLostRevenue
1.First,youllcalculatehowmuchrevenuetheabandonedcartsrepresent.
Remember,therearefourstepsinthecheckoutprocess,soonlyrecordsinthe
cart_shippingtablewithasteps_completedvalueoffourrepresenta
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
129/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
completedpurchase:
SELECTSUM(total_price)ASlost_revenue
FROMcart_shipping
WHEREsteps_completed<4
LostRevenueFromAbandonedShippingCarts
cart_shipping
cookie
steps_completedtotal_price
total_cost
shipping_cost
100054318085
6899
6292
425
100060397203
19218
17520
552
100062224714
7609
7155
556
100064732105
53137
50685
839
100107017704
44928
44200
720
...
...
...
...
...
Sumoftotal_pricewheresteps_completed<4
YoushouldseethatabandonedcartsmeanthatDualcoreispotentiallylosing
outonmorethan$2millioninrevenue!Clearlyitsworththeefforttodo
furtheranalysis.
Note:Thetotal_price,total_cost,andshipping_costcolumnsinthe
cart_shippingtablecontainthenumberofcentsasintegers.Besureto
divideresultscontainingmonetaryamountsby100togetdollarsandcents.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
124
Page125
2.Thenumberreturnedbythepreviousqueryisrevenue,butwhatcountsis
profit.Wecalculategrossprofitbysubtractingthecostfromtheprice.Write
andexecuteaquerysimilartotheoneabove,butwhichreportsthetotallost
profitfromabandonedcarts.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
130/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
ProfessorsNote~
Ifyouneedahintonhowtowritethisquery,youcancheckthefile:
sample_solution/abandoned_checkout_profit.sql
Afterrunningyourquery,youshouldseethatDualcoreispotentiallylosing
$111,058.90inprofitduetocustomersnotcompletingthecheckoutprocess.
3.HowdoesthiscomparetotheamountofprofitDualcorereceivesfrom
customerswhodocompletethecheckoutprocess?Modifyyourpreviousquery
toconsideronlythoserecordswheresteps_completed=4,andthen
executeitintheImpalashell.
ProfessorsNote~
Checksample_solution/completed_checkout_profit.sqlforahint.
TheresultshouldshowthatDualcoreearnsatotalof$177,932.93oncompleted
orders,soabandonedcartsrepresentasubstantialproportionofadditional
profits.
4.Theprevioustwoqueriesshowthetotalprofitforabandonedandcompleted
orders,butthesearentdirectlycomparablebecausethereweredifferent
numbersofeach.Itmightbethecasethatoneismuchmoreprofitablethanthe
otheronaper
orderbasis.Writeandexecuteaquerythatwillcalculatethe
averageprofitbasedonthenumberofstepscompletedduringthecheckout
process.
ProfessorsNote~
Ifyouneedhelpwritingthisquery,checkthefile:
sample_solution/checkout_profit_by_step.sql
Youshouldobservethatcartsabandonedaftersteptworepresentaneven
higheraverageprofitperorderthancompletedorders.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
125
Page126
Step#3:CalculateCost/ProfitforaFreeShippingOffer
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
131/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Youhaveobservedthatmostcartsandthemostprofitablecartsareabandoned
atthepointwheretheshippingcostisdisplayedtothecustomer.Youwillnowrun
somequeriestodeterminewhetherofferingfreeshipping,onatleastsomeorders,
wouldactuallybringinmorerevenueassumingthisofferpromptedmore
customerstofinishthecheckoutprocess.
1.Runthefollowingquerytocomparetheaverageshippingcostfororders
abandonedafterthesecondstepversuscompletedorders:
SELECTsteps_completed,AVG(shipping_cost)ASship_cost
FROMcart_shipping
WHEREsteps_completed=2ORsteps_completed=4
GROUPBYsteps_completed
AverageShippingCostforCartsAbandonedAfterSteps2and4
cart_shipping
cookie
steps_completedtotal_price
total_cost
shipping_cost
100054318085
6899
6292
425
100060397203
19218
17520
552
100062224714
7609
7155
556
100064732105
53137
50685
839
100107017704
44928
44200
720
...
...
...
...
...
Averageofshipping_costwheresteps_completed=2or4
Youwillseethattheshippingcostofabandonedorderswasalmost10%
higherthanforcompletedpurchases.Offeringfreeshipping,atleastfor
someorders,mightactuallybringinmoremoneythanpassingonthe
costandriskingabandonedorders.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
126
Page127
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
132/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
2.Runthefollowingquerytodeterminetheaverageprofitperorderoverthe
entiremonthforthedatayouareanalyzinginthelogfile.Thiswillhelpyouto
determinewhetherDualcorecouldabsorbthecostofofferingfreeshipping:
SELECTAVG(pricecost)ASprofit
FROMproductsp
JOINorder_detailsd
ON(d.prod_id=p.prod_id)
JOINorderso
ON(d.order_id=o.order_id)
WHEREYEAR(order_date)=2013
ANDMONTH(order_date)=05
AverageProfitperOrder,May2013
products
order_details
orders
prod_idpricecost
order_idproduct_id
order_idorder_date
1273641
1839
1275
6547914
1273641
6547914
2013050100:02:08
1273642
1949
721
6547914
1273644
6547915
2013050100:02:55
1273643
2149
845
6547914
1273645
6547916
2013050100:06:15
1273644
2029
763
6547915
1273645
6547917
2013061200:10:41
1273645
1909
1234
6547916
1273641
6547918
2013061200:11:30
...
...
...
...
...
...
...
Averagetheprofit...
...onordersmadein
May,2013
YoushouldseethattheaverageprofitforallordersduringMaywas
$7.80.Anearlierqueryyouranshowedthattheaverageshippingcost
was$8.83forcompletedordersand$9.66forabandonedorders,so
clearlyDualcorewouldlosemoneybyofferingfreeshippingonall
orders.However,itmightstillbeworthwhiletoofferfreeshippingon
ordersoveracertainamount.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
127
Page128
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
133/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
3.Runthefollowingquery,whichisaslightlyrevisedversionofthepreviousone,
todeterminewhetherofferingfreeshippingonlyonordersof$10ormore
wouldbeagoodidea:
SELECTAVG(pricecost)ASprofit
FROMproductsp
JOINorder_detailsd
ON(d.prod_id=p.prod_id)
JOINorderso
ON(d.order_id=o.order_id)
WHEREYEAR(order_date)=2013
ANDMONTH(order_date)=05
ANDPRICE>=1000
Youshouldseethattheaverageprofitonordersof$10ormorewas
$9.09,soabsorbingthecostofshippingwouldleaveverylittleprofit.
4.Repeatthepreviousquery,modifyingitslightlyeachtimetofindtheaverage
profitonordersofatleast$50,$100,and$500.
Youshouldseethatthereisahugespikeintheamountofprofitfor
ordersof$500ormore(Dualcoremakes$111.05onaverageforthese
orders).
5.Howmuchdoesshippingcostonaveragefororderstotaling$500ormore?
Writeandrunaquerytofindout.
ProfessorsNote~
Thefilesample_solution/avg_shipping_cost_50000.sqlcontains
thesolution.
Youshouldseethattheaverageshippingcostis$12.28,whichhappens
tobeabout11%oftheprofitbroughtinonthoseorders.
6.SinceDualcorewontknowinadvancewhowillabandontheircart,theywould
havetoabsorbthe$12.28averagecostonallordersofatleast$500.Wouldthe
extramoneytheymightbringinfromabandonedcartsoffsettheaddedcostof
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
128
134/135
7/26/2015
ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Page129
freeshippingforcustomerswhowouldhavecompletedtheirpurchases
anyway?Runthefollowingquerytoseethetotalprofitoncompleted
purchases:
SELECTSUM(total_pricetotal_cost)AStotal_profit
FROMcart_shipping
WHEREtotal_price>=50000
ANDsteps_completed=4
Afterrunningthisquery,youshouldseethatthetotalprofitfor
completedordersis$107,582.97.
7.Now,runthefollowingquerytofindthepotentialprofit,aftersubtracting
shippingcosts,ifallcustomerscompletedthecheckoutprocess:
SELECTgross_profittotal_shipping_costAS
potential_profit
FROM(SELECT
SUM(total_pricetotal_cost)AS
gross_profit,
SUM(shipping_cost)AStotal_shipping_cost
FROMcart_shipping
WHEREtotal_price>=50000)large_orders
Sincetheresultof$120,355.26isgreaterthanthecurrentprofitof$107,582.97
Dualcorecurrentlyearnsfromcompletedorders,itappearsthattheycouldearn
nearly$13,000morebyofferingfreeshippingforallordersofatleast$500.
Congratulations!YourhardworkanalyzingavarietyofdatawithHadoops
toolshashelpedmakeDualcoremoreprofitablethanever.
Thisistheendofthelab.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_
129
135/135