Вы находитесь на странице: 1из 135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Thisisthehtmlversionofthefilehttp://www.cloudera.com/content/dam/cloudera/partners/academicpartners
gated/labs/_Homework_Labs_WithProfessorNotes.pdf.
Googleautomaticallygenerateshtmlversionsofdocumentsaswecrawltheweb.

Page1

ApacheHadoop
Acourseforundergraduates

HomeworkLabs
with
ProfessorsNotes

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_H

1/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

Page2

TableofContents

GeneralNotesonHomeworkLabs...................................................................................................3
Lecture1Lab:UsingHDFS..................................................................................................................5
Lecture2Lab:RunningaMapReduceJob...................................................................................11
Lecture3Lab:WritingaMapReduceJavaProgram................................................................16
Lecture3Lab:MorePracticewithMapReduceJavaPrograms...........................................24
Lecture3Lab:WritingaMapReduceStreamingProgram....................................................26
Lecture3Lab:WritingUnitTestswiththeMRUnitFramework.........................................29
Lecture4Lab:UsingToolRunnerandPassingParameters..................................................30
Lecture4Lab:UsingaCombiner...................................................................................................33
Lecture5Lab:TestingwithLocalJobRunner.............................................................................34
Lecture5Lab:Logging.......................................................................................................................37
Lecture5Lab:UsingCountersandaMap
OnlyJob.................................................................40
Lecture6Lab:WritingaPartitioner.............................................................................................42
Lecture6Lab:ImplementingaCustomWritableComparable.............................................45
Lecture6Lab:UsingSequenceFilesandFileCompression..................................................47
Lecture7Lab:CreatinganInvertedIndex.................................................................................51
Lecture7Lab:CalculatingWordCo
Occurrence......................................................................55
Lecture8Lab:ImportingDatawithSqoop.................................................................................57
Lecture8Lab:RunninganOozieWorkflow...............................................................................60
Lecture8BonusLab:ExploringaSecondarySortExample.................................................62
NotesforUpcomingLabs..................................................................................................................66
Lecture9Lab:DataIngestWithHadoopTools.........................................................................67
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_H

2/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Lecture9Lab:UsingPigforETLProcessing..............................................................................71
Lecture9Lab:AnalyzingAdCampaignDatawithPig............................................................79
Lecture10Lab:AnalyzingDisparateDataSetswithPig.......................................................87
Lecture10Lab:ExtendingPigwithStreamingandUDFs......................................................93
Lecture11Lab:RunningHiveQueriesfromtheShell,Scripts,andHue..........................98
Lecture11Lab:DataManagementwithHive..........................................................................103
Lecture12Lab:GainingInsightwithSentimentAnalysis...................................................110
Lecture12Lab:DataTransformationwithHive....................................................................114
Lecture13Lab:InteractiveAnalysiswithImpala..................................................................123

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

Page3

GeneralNotesonHomeworkLabs
Studentscompletehomeworkforthiscourseusingthestudentversionofthetraining
VirtualMachine(VM).ClouderasuppliesasecondVM,theprofessorsVM,inadditiontothe
studentVM.TheprofessorsVMcomescompletewithsolutionstothehomeworklabs.
TheprofessorsVMcontainsadditionalprojectsubdirectorieswithhintsandsolutions.
Subdirectoriesnamedsrc/hintsandsrc/solutionprovidehints(partialsolutions)andfull
solutions,respectively.ThestudentVMwillhavesrc/stubsdirectoriesonly,nohintsor
solutionsdirectories.Fullsolutionscanbedistributedtostudentsafterhomeworkhasbeen
submitted.Insomecases,alabmayrequirethatthepreviouslab(s)ransuccessfully,
ensuringthattheVMisintherequiredstate.Providingstudentswiththesolutiontothe
previouslab,andhavingthemrunthesolution,willbringtheVMtotherequiredstate.This
shouldbecompletedpriortorunningcodeforthenewlab.
ExceptforthepresenceofsolutionsintheprofessorVM,thestudentandprofessorversions
ofthetrainingVMarethesame.BothVMsruntheCentOS6.3Linuxdistributionandcome
configuredwithCDH(ClouderasDistribution,includingApacheHadoop)installedin
pseudo
distributedmode.InadditiontocoreHadoop,theHadoopecosystemtools
necessarytocompletethehomeworklabsarealsoinstalled(e.g.Pig,Hive,Flume,etc.).Perl,
Python,PHP,andRubyareinstalledaswell.
Hadooppseudo
distributedmodeisamethodofrunningHadoopwherebyallHadoop
daemonsrunonthesamemachine.Itis,essentially,aclusterconsistingofasinglemachine.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_H

3/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

ItworksjustlikealargerHadoopcluster,thekeydifference(apartfromspeed,ofcourse!)is
thattheblockreplicationfactorissettoone,sincethereisonlyasingleDataNodeavailable.
Note:Homeworklabsaregroupedintoindividualfilesbylecturenumberforeasypostingof
assignments.Thesamelabsappearinthisdocument,butwithreferencestohintsand
solutionswhereapplicable.Thestudentshomeworklabswillreferenceastubs
subdirectory,nothintsorsolutions.Studentswilltypicallycompletetheircodinginthe
stubssubdirectories.

GettingStarted

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

Page4

1.TheVMissettoautomaticallyloginastheusertraining.Shouldyoulogoutatany
time,youcanlogbackinastheusertrainingwiththepasswordtraining.

WorkingwiththeVirtualMachine
1.Shouldyouneedit,therootpasswordistraining.Youmaybepromptedforthisif,
forexample,youwanttochangethekeyboardlayout.Ingeneral,youshouldnotneed
thispasswordsincethetraininguserhasunlimitedsudoprivileges.
2.Insomecommand
linestepsinthelabs,youwillseelineslikethis:
$hadoopfsputshakespeare\
/user/training/shakespeare
Thedollarsign($)atthebeginningofeachlineindicatestheLinuxshellprompt.The
actualpromptwillincludeadditionalinformation(e.g.,[training@localhost
workspace]$)butthisisomittedfromtheseinstructionsforbrevity.
Thebackslash(\)attheendofthefirstlinesignifiesthatthecommandisnotcompleted,
andcontinuesonthenextline.Youcanenterthecodeexactlyasshown(ontwolines),
oryoucanenteritonasingleline.Ifyoudothelatter,youshouldnottypeinthe
backslash.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_H

4/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

3.AlthoughmanystudentsarecomfortableusingUNIXtexteditorslikevioremacs,some
mightpreferagraphicaltexteditor.Toinvokethegraphicaleditorfromthecommand
line,typegeditfollowedbythepathofthefileyouwishtoedit.Appending&tothe
commandallowsyoutotypeadditionalcommandswhiletheeditorisstillopen.Hereis
anexampleofhowtoeditafilenamedmyfile.txt:
$geditmyfile.txt&

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

Page5

Lecture1Lab:UsingHDFS
FilesUsedinThisExercise:
Datafiles(local)
~/training_materials/developer/data/shakespeare.tar.gz
~/training_materials/developer/data/access_log.gz

InthislabyouwillbegintogetacquaintedwiththeHadooptools.Youwillmanipulate
filesinHDFS,theHadoopDistributedFileSystem.

SetUpYourEnvironment
1.Beforestartingthelabs,runthecoursesetupscriptinaterminalwindow:
$~/scripts/developer/training_setup_dev.sh

Hadoop
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_H

5/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Hadoopisalreadyinstalled,configured,andrunningonyourvirtualmachine.
Mostofyourinteractionwiththesystemwillbethroughacommand
linewrappercalled
hadoop.Ifyourunthisprogramwithnoarguments,itprintsahelpmessage.Totrythis,
runthefollowingcommandinaterminalwindow:
$hadoop

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

Page6

Thehadoopcommandissubdividedintoseveralsubsystems.Forexample,thereisa
subsystemforworkingwithfilesinHDFSandanotherforlaunchingandmanaging
MapReduceprocessingjobs.

Step1:ExploringHDFS
ThesubsystemassociatedwithHDFSintheHadoopwrapperprogramiscalledFsShell.
Thissubsystemcanbeinvokedwiththecommandhadoopfs.
1.Openaterminalwindow(ifoneisnotalreadyopen)bydouble
clickingtheTerminal
icononthedesktop.
2.Intheterminalwindow,enter:
$hadoopfs
YouseeahelpmessagedescribingallthecommandsassociatedwiththeFsShell
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_H

6/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

subsystem.
3.Enter:
$hadoopfsls/
ThisshowsyouthecontentsoftherootdirectoryinHDFS.Therewillbemultiple
entries,oneofwhichis/user.Individualusershaveahomedirectoryunderthis
directory,namedaftertheirusernameyourusernameinthiscourseistraining,
thereforeyourhomedirectoryis/user/training.
4.Tryviewingthecontentsofthe/userdirectorybyrunning:
$hadoopfsls/user
Youwillseeyourhomedirectoryinthedirectorylisting.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

Page7

5.Listthecontentsofyourhomedirectorybyrunning:
$hadoopfsls/user/training
Therearenofilesyet,sothecommandsilentlyexits.Thisisdifferentfromrunning
hadoopfsls/foo,whichreferstoadirectorythatdoesntexist.Inthiscase,an
errormessagewouldbedisplayed.
NotethatthedirectorystructureinHDFShasnothingtodowiththedirectorystructure
ofthelocalfilesystemtheyarecompletelyseparatenamespaces.

Step2:UploadingFiles
Besidesbrowsingtheexistingfilesystem,anotherimportantthingyoucandowith
FsShellistouploadnewdataintoHDFS.
1.Changedirectoriestothelocalfilesystemdirectorycontainingthesampledatawewill
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_H

7/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

beusinginthehomeworklabs.
$cd~/training_materials/developer/data
IfyouperformaregularLinuxlscommandinthisdirectory,youwillseeafewfiles,
includingtwonamedshakespeare.tar.gzand
shakespearestream.tar.gz.Bothofthesecontainthecompleteworksof
Shakespeareintextformat,butwithdifferentformatsandorganizations.Fornowwe
willworkwithshakespeare.tar.gz.
2.Unzipshakespeare.tar.gzbyrunning:
$tarzxvfshakespeare.tar.gz
Thiscreatesadirectorynamedshakespeare/containingseveralfilesonyourlocal
filesystem.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

Page8

3.InsertthisdirectoryintoHDFS:
$hadoopfsputshakespeare/user/training/shakespeare
Thiscopiesthelocalshakespearedirectoryanditscontentsintoaremote,HDFS
directorynamed/user/training/shakespeare.
4.ListthecontentsofyourHDFShomedirectorynow:
$hadoopfsls/user/training
Youshouldseeanentryfortheshakespearedirectory.
5.Nowtrythesamefslscommandbutwithoutapathargument:
$hadoopfsls
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_H

8/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Youshouldseethesameresults.Ifyoudontpassadirectorynametothels
command,itassumesyoumeanyourhomedirectory,i.e./user/training.

Relativepaths
Ifyoupassanyrelative(nonabsolute)pathstoFsShellcommands(oruserelative
pathsinMapReduceprograms),theyareconsideredrelativetoyourhomedirectory.

6.Wewillalsoneedasamplewebserverlogfile,whichwewillputintoHDFSforusein
futurelabs.ThisfileiscurrentlycompressedusingGZip.Ratherthanextractthefileto
thelocaldiskandthenuploadit,wewillextractanduploadinonestep.First,createa
directoryinHDFSinwhichtostoreit:
$hadoopfsmkdirweblog
7.Now,extractanduploadthefileinonestep.Thecoptiontogunzipuncompressesto
standardoutput,andthedash()inthehadoopfsputcommandtakeswhatever
isbeingsenttoitsstandardinputandplacesthatdatainHDFS.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

Page9

$gunzipcaccess_log.gz\
|hadoopfsputweblog/access_log
8.RunthehadoopfslscommandtoverifythatthelogfileisinyourHDFShome
directory.
9.Theaccesslogfileisquitelargearound500MB.Createasmallerversionofthisfile,
consistingonlyofitsfirst5000lines,andstorethesmallerversioninHDFS.Youcanuse
thesmallerversionfortestinginsubsequentlabs.
$hadoopfsmkdirtestlog
$gunzipcaccess_log.gz|headn5000\
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_H

9/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

|hadoopfsputtestlog/test_access_log

Step3:ViewingandManipulatingFiles
NowletsviewsomeofthedatayoujustcopiedintoHDFS.
1.Enter:
$hadoopfslsshakespeare
Thisliststhecontentsofthe/user/training/shakespeareHDFSdirectory,
whichconsistsofthefilescomedies,glossary,histories,poems,and
tragedies.
2.Theglossaryfileincludedinthecompressedfileyoubeganwithisnotstrictlyawork
ofShakespeare,soletsremoveit:
$hadoopfsrmshakespeare/glossary
Notethatyoucouldleavethisfileinplaceifyousowished.Ifyoudid,thenitwouldbe
includedinsubsequentcomputationsacrosstheworksofShakespeare,andwouldskew
yourresultsslightly.Aswithmanyreal
worldbigdataproblems,youmaketrade
offs
betweenthelabortopurifyyourinputdataandtheprecisionofyourresults.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

Page10

3.Enter:
$hadoopfscatshakespeare/histories|tailn50
Thisprintsthelast50linesofHenryIV,Part1toyourterminal.Thiscommandishandy
forviewingtheoutputofMapReduceprograms.Veryoften,anindividualoutputfileofa
MapReduceprogramisverylarge,makingitinconvenienttoviewtheentirefileinthe
terminal.Forthisreason,itsoftenagoodideatopipetheoutputofthefscat
commandintohead,tail,more,orless.
4.Todownloadafiletoworkwithonthelocalfilesystemusethefsgetcommand.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

10/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Thiscommandtakestwoarguments:anHDFSpathandalocalpath.ItcopiestheHDFS
contentsintothelocalfilesystem:
$hadoopfsgetshakespeare/poems~/shakepoems.txt
$less~/shakepoems.txt

OtherCommands
Thereareseveralotheroperationsavailablewiththehadoopfscommandtoperform
mostcommonfilesystemmanipulations:mv,cp,mkdir,etc.
1.Enter:
$hadoopfs
ThisdisplaysabriefusagereportofthecommandsavailablewithinFsShell.Try
playingaroundwithafewofthesecommandsifyoulike.

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

10

Page11

Lecture2Lab:RunningaMapReduce
Job
FilesandDirectoriesUsedinthisExercise
Sourcedirectory:~/workspace/wordcount/src/solution

Files:

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

11/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
WordCount.java:AsimpleMapReducedriverclass.
WordMapper.java:Amapperclassforthejob.
SumReducer.java:Areducerclassforthejob.
wc.jar:Thecompiled,assembledWordCountprogram

InthislabyouwillcompileJavafiles,createaJAR,andrunMapReducejobs.
InadditiontomanipulatingfilesinHDFS,thewrapperprogramhadoopisusedtolaunch
MapReducejobs.ThecodeforajobiscontainedinacompiledJARfile.HadooploadstheJAR
intoHDFSanddistributesittotheworkernodes,wheretheindividualtasksofthe
MapReducejobareexecuted.
OnesimpleexampleofaMapReducejobistocountthenumberofoccurrencesofeachword
inafileorsetoffiles.InthislabyouwillcompileandsubmitaMapReducejobtocountthe
numberofoccurrencesofeverywordintheworksofShakespeare.

CompilingandSubmittingaMapReduceJob
1.Inaterminalwindow,changetothelabsourcedirectory,andlistthecontents:
$cd~/workspace/wordcount/src
$ls
Listthefilesinthesolutionpackagedirectory:
$lssolution
ThepackagecontainsthefollowingJavafiles:
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

11

Page12

WordCount.java:AsimpleMapReducedriverclass.
WordMapper.java:Amapperclassforthejob.
SumReducer.java:Areducerclassforthejob.
Examinethesefilesifyouwish,butdonotchangethem.Remaininthisdirectorywhile
youexecutethefollowingcommands.
2.Beforecompiling,examinetheclasspathHadoopisconfiguredtouse:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

12/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

$hadoopclasspath

ThisshowsliststhelocationswheretheHadoopcoreAPIclassesareinstalled.
3.CompilethethreeJavaclasses:
$javacclasspath`hadoopclasspath`solution/*.java
Note:inthecommandabove,thequotesaroundhadoopclasspathare
backquotes.Thisrunsthehadoopclasspathcommandandusesitsoutputas
partofthejavaccommand.
Thecompiled(.class)filesareplacedinthesolutiondirectory.
4.CollectyourcompiledJavafilesintoaJARfile:
$jarcvfwc.jarsolution/*.class
5.SubmitaMapReducejobtoHadoopusingyourJARfiletocounttheoccurrencesofeach
wordinShakespeare:
$hadoopjarwc.jarsolution.WordCount\
shakespearewordcounts
ThishadoopjarcommandnamestheJARfiletouse(wc.jar),theclasswhosemain
methodshouldbeinvoked(solution.WordCount),andtheHDFSinputandoutput
directoriestousefortheMapReducejob.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

12

Page13

YourjobreadsallthefilesinyourHDFSshakespearedirectory,andplacesitsoutput
inanewHDFSdirectorycalledwordcounts.
6.Tryrunningthissamecommandagainwithoutanychange:

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

13/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

$hadoopjarwc.jarsolution.WordCount\
shakespearewordcounts
Yourjobhaltsrightawaywithanexception,becauseHadoopautomaticallyfailsifyour
jobtriestowriteitsoutputintoanexistingdirectory.Thisisbydesignsincetheresult
ofaMapReducejobmaybeexpensivetoreproduce,Hadooppreventsyoufrom
accidentallyoverwritingpreviouslyexistingfiles.
7.ReviewtheresultofyourMapReducejob:
$hadoopfslswordcounts
Thisliststheoutputfilesforyourjob.(YourjobranwithonlyoneReducer,sothere
shouldbeonefile,namedpartr00000,alongwitha_SUCCESSfileanda_logs
directory.)
8.Viewthecontentsoftheoutputforyourjob:
$hadoopfscatwordcounts/partr00000|less
Youcanpagethroughafewscreenstoseewordsandtheirfrequenciesintheworksof
Shakespeare.(Thespacebarwillscrolltheoutputbyonescreentheletter'q'willquit
thelessutility.)Notethatyoucouldhavespecifiedwordcounts/*justaswellinthis
command.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

13

Page14

WildcardsinHDFSfilepaths
Takecarewhenusingwildcards(e.g.*)whenspecifyingHFDSfilenamesbecauseof

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

14/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
howLinuxworks,theshellwillattempttoexpandthewildcardbeforeinvokinghadoop,
andthenpassincorrectreferencestolocalfilesinsteadofHDFSfiles.Youcanprevent
thisbyenclosingthewildcardedHDFSfilenamesinsinglequotes,e.g.hadoopfs
cat'wordcounts/*'

9.TryrunningtheWordCountjobagainstasinglefile:
$hadoopjarwc.jarsolution.WordCount\
shakespeare/poemspwords
Whenthejobcompletes,inspectthecontentsofthepwordsHDFSdirectory.
10.Cleanuptheoutputfilesproducedbyyourjobruns:
$hadoopfsrmrwordcountspwords

StoppingMapReduceJobs
Itisimportanttobeabletostopjobsthatarealreadyrunning.Thisisusefulif,forexample,
youaccidentallyintroducedaninfiniteloopintoyourMapper.Animportantpointto
rememberisthatpressing^Ctokillthecurrentprocess(whichisdisplayingthe
MapReducejob'sprogress)doesnotactuallystopthejobitself.
AMapReducejob,oncesubmittedtoHadoop,runsindependentlyoftheinitiatingprocess,
solosingtheconnectiontotheinitiatingprocessdoesnotkillthejob.Instead,youneedto
telltheHadoopJobTrackertostopthejob.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

14

Page15

1.Startanotherwordcountjoblikeyoudidintheprevioussection:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

15/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

$hadoopjarwc.jarsolution.WordCountshakespeare\
count2
2.Whilethisjobisrunning,openanotherterminalwindowandenter:
$mapredjoblist
Thisliststhejobidsofallrunningjobs.Ajobidlookssomethinglike:
job_200902131742_0002
3.Copythejobid,andthenkilltherunningjobbyentering:
$mapredjobkilljobid
TheJobTrackerkillsthejob,andtheprogramrunningintheoriginalterminal
completes.

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

15

Page16

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

16/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Lecture3Lab:WritingaMapReduce
JavaProgram
ProjectsandDirectoriesUsedinthisExercise
Eclipseproject:averagewordlength

Javafiles:
AverageReducer.java(Reducer)
LetterMapper.java(Mapper)
AvgWordLength.java(driver)
Testdata(HDFS):
shakespeare
Exercisedirectory:~/workspace/averagewordlength

Inthislab,youwillwriteaMapReducejobthatreadsanytextinputandcomputesthe
averagelengthofallwordsthatstartwitheachcharacter.
Foranytextinput,thejobshouldreporttheaveragelengthofwordsthatbeginwitha,b,
andsoforth.Forexample,forinput:
Nonowisdefinitelynotthetime
Theoutputwouldbe:
N

2.0

3.0

10.0

2.0

3.5

(Fortheinitialsolution,yourprogramshouldbecase
sensitiveasshowninthisexample.)

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

16

Page17
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

17/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

TheAlgorithm
Thealgorithmforthisprogramisasimpleone
passMapReduceprogram:
TheMapper
TheMapperreceivesalineoftextforeachinputvalue.(Ignoretheinputkey.)Foreach
wordintheline,emitthefirstletterofthewordasakey,andthelengthofthewordasa
value.Forexample,forinputvalue:
Nonowisdefinitelynotthetime
YourMappershouldemit:
N

n3
i

10

TheReducer
ThankstotheshuffleandsortphasebuiltintoMapReduce,theReducerreceivesthekeysin
sortedorder,andallthevaluesforonekeyaregroupedtogether.So,fortheMapperoutput
above,theReducerreceivesthis:

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

17

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

18/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Page18

(2)

(10)

(2)

(3,3)

(3,4)

TheReduceroutputshouldbe:
N

2.0

10.0

2.0

3.0

3.5

Step1:StartEclipse
ThereisoneEclipseprojectforeachofthelabsthatuseJava.UsingEclipsewillspeedup
yourdevelopmenttime.
1.BesureyouhaverunthecoursesetupscriptasinstructedearlierintheGeneralNotes
section.ThatscriptsetsupthelabworkspaceandcopiesintheEclipseprojectsyouwill
usefortheremainderofthecourse.
2.StartEclipseusingtheicononyourVMdesktop.Theprojectsforthiscoursewillappear
intheProjectExplorerontheleft.

Step2:WritethePrograminJava
TherearestubfilesforeachoftheJavaclassesforthislab:LetterMapper.java(the
Mapper),AverageReducer.java(theReducer),andAvgWordLength.java(the
driver).
IfyouareusingEclipse,openthestubfiles(locatedinthesrc/stubspackage)inthe
averagewordlengthproject.Ifyouprefertoworkintheshell,thefilesarein
~/workspace/averagewordlength/src/stubs.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

18

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

19/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Page19

Youmaywishtoreferbacktothewordcountexample(inthewordcountprojectin
Eclipseorin~/workspace/wordcount)asastartingpointforyourJavacode.Herearea
fewdetailstohelpyoubeginyourJavaprogramming:
3.Definethedriver
Thisclassshouldconfigureandsubmityourbasicjob.Amongthebasicstepshere,
configurethejobwiththeMapperclassandtheReducerclassyouwillwrite,andthe
datatypesoftheintermediateandfinalkeys.
4.DefinetheMapper
NotethesesimplestringoperationsinJava:
str.substring(0,1)//String:firstletterofstr
str.length()

//int:lengthofstr

5.DefinetheReducer
Inasingleinvocationthereduce()methodreceivesastringcontainingoneletter(the
key)alongwithaniterablecollectionofintegers(thevalues),andshouldemitasingle
key
valuepair:theletterandtheaverageoftheintegers.
6.Compileyourclassesandassemblethejarfile
Tocompileandjar,youmayeitherusethecommandlinejavaccommandasyoudid
earlierintheRunningaMapReduceJoblab,orfollowthestepsbelow(UsingEclipse
toCompileYourSolution)touseEclipse.

Step3:UseEclipsetoCompileYourSolution
FollowthesestepstouseEclipsetocompletethislab.
Note:Thesesamestepswillbeusedforallsubsequentlabs.Theinstructionswillnot
berepeatedeachtime,sotakenoteofthesteps.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

20/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
19

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

Page20

1.VerifythatyourJavacodedoesnothaveanycompilererrorsorwarnings.
TheEclipsesoftwareinyourVMispre
configuredtocompilecodeautomatically
withoutperforminganyexplicitsteps.Compileerrorsandwarningsappearasredand
yellowiconstotheleftofthecode.

AredXindicatesacompilererror

2.InthePackageExplorer,opentheEclipseprojectforthecurrentlab(i.e.
averagewordlength).Right
clickthedefaultpackageunderthesrcentryandselect
Export.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

21/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

20

Page21

3.SelectJava>JARfilefromtheExportdialogbox,thenclickNext.

4.SpecifyalocationfortheJARfile.YoucanplaceyourJARfileswhereveryoulike,e.g.:

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

22/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

21

Page22

Note:FormoreinformationaboutusingEclipse,seetheEclipseReferencein
Homework_EclipseRef.docx.

Step3:Testyourprogram
1.Inaterminalwindow,changetothedirectorywhereyouplacedyourJARfile.Runthe
hadoopjarcommandasyoudidpreviouslyintheRunningaMapReduceJoblab.
$hadoopjaravgwordlength.jarstubs.AvgWordLength\
shakespearewordlengths
2.Listtheresults:
$hadoopfslswordlengths
Asinglereduceroutputfileshouldbelisted.
3.Reviewtheresults:
$hadoopfscatwordlengths/*
Thefileshouldlistallthenumbersandlettersinthedataset,andtheaveragelengthof
thewordsstartingwiththem,e.g.:
1
2
3
4
5
6
7
8
9

1.02
1.0588235294117647
1.0
1.5
1.5
1.5
1.0
1.5
1.0

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

23/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

A
B
C

3.891394576646375
5.139302507836991
6.629694233531706

ThisexampleusestheentireShakespearedatasetforyourinputyoucanalsotryitwith
justoneofthefilesinthedataset,orwithyourowntestdata.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

22

Page23

Thisistheendofthelab.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

24/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

23

Page24

Lecture3Lab:MorePracticewith
MapReduceJavaPrograms
FilesandDirectoriesUsedinthisExercise
Eclipseproject:log_file_analysis

Javafiles:
SumReducer.javatheReducer
LogFileMapper.javatheMapper
ProcessLogs.javathedriverclass
Testdata(HDFS):
weblog(fullversion)
testlog(testsampleset)
Exercisedirectory:~/workspace/log_file_analysis

Inthislab,youwillanalyzealogfilefromawebservertocountthenumberofhits
madefromeachuniqueIPaddress.
YourtaskistocountthenumberofhitsmadefromeachIPaddressinthesample
(anonymized)webserverlogfilethatyouuploadedtothe/user/training/weblog
directoryinHDFSwhenyoucompletedtheUsingHDFSlab.
Inthelog_file_analysisdirectory,youwillfindstubsfortheMapperandDriver.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

25/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

1.Usingthestubfilesinthelog_file_analysisprojectdirectory,writeMapperand
DrivercodetocountthenumberofhitsmadefromeachIPaddressintheaccesslogfile.
YourfinalresultshouldbeafileinHDFScontainingeachIPaddress,andthecountof
loghitsfromthataddress.Note:TheReducerforthislabperformstheexactsame
functionastheoneintheWordCountprogramyouranearlier.Youcanreusethat
codeoryoucanwriteyourownifyouprefer.
2.Buildyourapplicationjarfilefollowingthestepsinthepreviouslab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

24

Page25

3.Testyourcodeusingthesamplelogdatainthe/user/training/weblogdirectory.
Note:Youmaywishtotestyourcodeagainstthesmallerversionoftheaccesslogyou
createdinapriorlab(locatedinthe/user/training/testlogHDFSdirectory)
beforeyourunyourcodeagainstthefulllogwhichcanbequitetimeconsuming.

Thisistheendofthelab.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

26/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

25

Page26

Lecture3Lab:WritingaMapReduce
StreamingProgram
FilesandDirectoriesUsedinthisExercise
Projectdirectory:~/workspace/averagewordlength

Testdata(HDFS):
shakespeare

Inthislabyouwillrepeatthesametaskasinthepreviouslab:writingaprogramto
calculateaveragewordlengthsforletters.However,youwillwritethisasastreaming
programusingascriptinglanguageofyourchoiceratherthanusingJava.
YourvirtualmachinehasPerl,Python,PHP,andRubyinstalled,soyoucanchooseanyof
theseorevenshellscriptingtodevelopaStreamingsolution.
ForyourHadoopStreamingprogramyouwillnotuseEclipse.Launchatexteditortowrite
yourMapperscriptandyourReducerscript.Herearesomenotesaboutsolvingtheproblem
inHadoopStreaming:
1.TheMapperScript

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

27/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

TheMapperwillreceivelinesoftextonstdin.Findthewordsinthelinestoproduce
theintermediateoutput,andemitintermediate(key,value)pairsbywritingstringsof
theform:
key<tab>value<newline>
Thesestringsshouldbewrittentostdout.
2.TheReducerScript
Forthereducer,multiplevalueswiththesamekeyaresenttoyourscriptonstdinas
successivelinesofinput.Eachlinecontainsakey,atab,avalue,andanewline.Alllines
withthesamekeyaresentoneafteranother,possiblyfollowedbylineswithadifferent

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

26

Page27

key,untilthereducinginputiscomplete.Forexample,thereducescriptmayreceivethe
following:
t

Forthisinput,emitthefollowingtostdout:
t

3.5

5.0

Observethatthereducerreceivesakeywitheachinputline,andmustnoticewhen
thekeychangesonasubsequentline(orwhentheinputisfinished)toknowwhenthe
valuesforagivenkeyhavebeenexhausted.ThisisdifferentthantheJavaversionyou
workedoninthepreviouslab.
3.Runthestreamingprogram:
$hadoopjar/usr/lib/hadoop0.20mapreduce/\
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

28/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

contrib/streaming/hadoopstreaming*.jar\
inputinputDiroutputoutputDir\
filepathToMapScriptfilepathToReduceScript\
mappermapBasenamereducerreduceBasename
(Remember,youmayneedtodeleteanypreviousoutputbeforerunningyourprogram
byissuing:hadoopfsrmrdataToDelete.)
4.ReviewtheoutputintheHDFSdirectoryyouspecified(outputDir).

ProfessorsNote~
ThePerlexampleisin:~/workspace/wordcount/perl_solution

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

27

Page28

ProfessorsNote~

SolutioninPython
YoucanfindaworkingsolutiontothislabwritteninPythoninthedirectory
~/workspace/averagewordlength/python_sample_solution.
Torunthesolution,changedirectoryto~/workspace/averagewordlengthandrun
thiscommand:
$hadoopjar/usr/lib/hadoop0.20mapreduce\
/contrib/streaming/hadoopstreaming*.jar\
inputshakespeareoutputavgwordstreaming\
filepython_sample_solution/mapper.py\
filepython_sample_solution/reducer.py\
mappermapper.pyreducerreducer.py

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

29/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

28

Page29

Lecture3Lab:WritingUnitTestswith
theMRUnitFramework
ProjectsUsedinthisExercise
Eclipseproject:mrunit

Javafiles:
SumReducer.java(ReducerfromWordCount)
WordMapper.java(MapperfromWordCount)
TestWordCount.java(TestDriver)

InthisExercise,youwillwriteUnitTestsfortheWordCountcode.
1.LaunchEclipse(ifnecessary)andexpandthemrunitfolder.
2.ExaminetheTestWordCount.javafileinthemrunitprojectstubspackage.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

30/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Noticethatthreetestshavebeencreated,oneeachfortheMapper,Reducer,andthe
entireMapReduceflow.Currently,allthreetestssimplyfail.
3.Runthetestsbyright
clickingonTestWordCount.javainthePackageExplorer
panelandchoosingRunAs>JUnitTest.
4.Observethefailure.ResultsintheJUnittab(nexttothePackageExplorertab)should
indicatethatthreetestsranwiththreefailures.
5.Nowimplementthethreetests.
6.Runthetestsagain.ResultsintheJUnittabshouldindicatethatthreetestsranwithno
failures.
7.Whenyouaredone,closetheJUnittab.

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

29

Page30

Lecture4Lab:UsingToolRunnerand
PassingParameters
FilesandDirectoriesUsedinthisExercise
Eclipseproject:toolrunner

Javafiles:
AverageReducer.java(ReducerfromAverageWordLength)
LetterMapper.java(MapperfromAverageWordLength)
AvgWordLength.java(driverfromAverageWordLength)
Exercisedirectory:~/workspace/toolrunner

InthisExercise,youwillimplementadriverusingToolRunner.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

31/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

FollowthestepsbelowtostartwiththeAverageWordLengthprogramyouwroteinan
earlierlab,andmodifythedrivertouseToolRunner.ThenmodifytheMappertoreferencea
BooleanparametercalledcaseSensitiveiftrue,themappershouldtreatupperand
lowercaselettersasdifferentiffalseorunset,alllettersshouldbeconvertedtolowercase.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

30

Page31

ModifytheAverageWordLengthDrivertouseToolrunner
1.CopytheReducer,MapperanddrivercodeyoucompletedintheWritingJava
MapReduceProgramslabearlier,intheaveragewordlengthproject.

CopyingSourceFiles
YoucanuseEclipsetocopyaJavasourcefilefromoneprojectorpackagetoanotherby
rightclickingonthefileandselectingCopy,thenrightclickingthenewpackageand
selectingPaste.Ifthepackageshavedifferentnames(e.g.ifyoucopyfrom
averagewordlength.solutiontotoolrunner.stubs),Eclipsewillautomatically
changethepackagedirectiveatthetopofthefile.Ifyoucopythefileusingafile
browserortheshell,youwillhavetodothatmanually.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

32/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

2.ModifytheAvgWordLengthdrivertouseToolRunner.Refertotheslidesfordetails.
a.Implementtherunmethod
b.Modifymaintocallrun
3.Jaryoursolutionandtestitbeforecontinuingitshouldcontinuetofunctionexactlyas
itdidbefore.RefertotheWritingaJavaMapReduceProgramlabforhowtoassemble
andtestifyouneedareminder.

ModifytheMappertouseaconfigurationparameter
4.ModifytheLetterMapperclassto
a.Overridethesetupmethodtogetthevalueofaconfigurationparameter
calledcaseSensitive,anduseittosetamembervariableindicating
whethertodocasesensitiveorcaseinsensitiveprocessing.
b.Inthemapmethod,choosewhethertodocasesensitiveprocessing(leavethe
lettersas
is),orinsensitiveprocessing(convertallletterstolower
case)
basedonthatvariable.

Passaparameterprogrammatically

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

31

Page32

5.ModifythedriversrunmethodtosetaBooleanconfigurationparametercalled
caseSensitive.(Hint:UsetheConfiguration.setBooleanmethod.)
6.Testyourcodetwice,oncepassingfalseandoncepassingtrue.Whensettotrue,
yourfinaloutputshouldhavebothupperandlowercaseletterswhenfalse,itshould
haveonlylowercaseletters.
Hint:RemembertorebuildyourJarfiletotestchangestoyourcode.

Passaparameterasaruntimeparameter
7.Commentoutthecodethatsetstheparameterprogrammatically.(Eclipsehint:Select
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

33/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

thecodetocommentandthenselectSource>ToggleComment).Testagain,thistime
passingtheparametervalueusingDontheHadoopcommandline,e.g.:
$hadoopjartoolrunner.jarstubs.AvgWordLength\
DcaseSensitive=trueshakespearetoolrunnerout
8.Testpassingbothtrueandfalsetoconfirmtheparameterworkscorrectly.

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

32

Page33

Lecture4Lab:UsingaCombiner
FilesandDirectoriesUsedinthisExercise
Eclipseproject:combiner

Javafiles:
WordCountDriver.java(DriverfromWordCount)
WordMapper.java(MapperfromWordCount)

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

34/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
SumReducer.java(ReducerfromWordCount)
Exercisedirectory:~/workspace/combiner

Inthislab,youwilladdaCombinertotheWordCountprogramtoreducetheamount
ofintermediatedatasentfromtheMappertotheReducer.
Becausesummingisassociativeandcommutative,thesameclasscanbeusedforboththe
ReducerandtheCombiner.

ImplementaCombiner
1.CopyWordMapper.javaandSumReducer.javafromthewordcountprojectto
thecombinerproject.
2.ModifytheWordCountDriver.javacodetoaddaCombinerfortheWordCount
program.
3.Assembleandtestyoursolution.(TheoutputshouldremainidenticaltotheWordCount
applicationwithoutacombiner.)

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

33

Page34

Lecture5Lab:Testingwith
LocalJobRunner
FilesandDirectoriesUsedinthisExercise
Eclipseproject:toolrunner

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

35/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Testdata(local):
~/training_materials/developer/data/shakespeare
Exercisedirectory:~/workspace/toolrunner

Inthislab,youwillpracticerunningajoblocallyfordebuggingandtestingpurposes.
IntheUsingToolRunnerandPassingParameterslab,youmodifiedtheAverageWord
LengthprogramtouseToolRunner.Thismakesitsimpletosetjobconfigurationproperties
onthecommandline.

RuntheAverageWordLengthprogramusing
LocalJobRunneronthecommandline
1.RuntheAverageWordLengthprogramagain.Specifyjt=localtorunthejoblocally
insteadofsubmittingtothecluster,andfs=file:///tousethelocalfilesystem
insteadofHDFS.YourinputandoutputfilesshouldrefertolocalfilesratherthanHDFS
files.
Note:UsetheprogramyoucompletedintheToolRunnerlab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

34

Page35

$hadoopjartoolrunner.jarstubs.AvgWordLength\
fs=file:///jt=local\
~/training_materials/developer/data/shakespeare\
localout

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

36/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

2.Reviewthejoboutputinthelocaloutputfolderyouspecified.

Optional:RuntheAverageWordLengthprogramusing
LocalJobRunnerinEclipse
1.InEclipse,locatethetoolrunnerprojectinthePackageExplorer.Openthestubs
package.
2.Rightclickonthedriverclass(AvgWordLength)andselectRunAs>Run
Configurations
3.EnsurethatJavaApplicationisselectedintheruntypeslistedintheleftpane.
4.IntheRunConfigurationdialog,clicktheNewlaunchconfigurationbutton:

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

35

Page36

5.OntheMaintab,confirmthattheProjectandMainclassaresetcorrectlyforyour
project,e.g.Project:toolrunnerandMainclass:stubs.AvgWordLength
6.SelecttheArgumentstabandentertheinputandoutputfolders.(Thesearelocal,not
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

37/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

HDFSfolders,andarerelativetotherunconfigurationsworkingfolder,whichbydefault
istheprojectfolderintheEclipseworkspace:e.g.~/workspace/toolrunner.)

7.ClicktheRunbutton.Theprogramwillrunlocallywiththeoutputdisplayedinthe
Eclipseconsolewindow.

8.Reviewthejoboutputinthelocaloutputfolderyouspecified.

Note:Youcanre
runanypreviousconfigurationsusingtheRunorDebughistorybuttonson
theEclipsetoolbar.

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

36

Page37

Lecture5Lab:Logging
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

38/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

FilesandDirectoriesUsedinthisExercise
Eclipseproject:logging

Javafiles:
AverageReducer.java(ReducerfromToolRunner)
LetterMapper.java(MapperfromToolRunner)
AvgWordLength.java(driverfromToolRunner)

Testdata(HDFS):
shakespeare

Exercisedirectory:~/workspace/logging

Inthislab,youwillpracticeusinglog4jwithMapReduce.
ModifytheAverageWordLengthprogramyoubuiltintheUsingToolRunnerandPassing
ParameterslabsothattheMapperlogsadebugmessageindicatingwhetheritiscomparing
withorwithoutcasesensitivity.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

37

Page38

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

39/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

EnableMapperLoggingfortheJob
1.Beforeaddingadditionalloggingmessages,tryre
runningthetoolrunnerlabsolution
withMapperdebugloggingenabledbyadding

Dmapred.map.child.log.level=DEBUG
tothecommandline.E.g.
$hadoopjartoolrunner.jarstubs.AvgWordLength\
Dmapred.map.child.log.level=DEBUGshakespeareoutdir
2.TakenoteoftheJobIDintheterminalwindoworbyusingthemaprepjobcommand.
3.Whenthejobiscomplete,viewthelogs.InabrowseronyourVM,visittheJobTracker
UI:http://localhost:50030/jobtracker.jsp.Findthejobyoujustraninthe
CompletedJobslistandclickitsJobID.E.g.:

4.Inthetasksummary,clickmaptoviewthemaptasks.

5.Inthelistoftasks,clickonthemaptasktoviewthedetailsofthattask.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

38

Page39
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

40/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

6.UnderTaskLogs,clickAll.ThelogsshouldincludebothINFOandDEBUGmessages.
E.g.:

AddDebugLoggingOutputtotheMapper
7.Copythecodefromthetoolrunnerprojecttotheloggingprojectstubspackage.
(UseyoursolutionfromtheToolRunnerlab.)
8.Uselog4jtooutputadebuglogmessageindicatingwhethertheMapperisdoingcase
sensitiveorinsensitivemapping.

BuildandTestYourCode
9.Followingtheearliersteps,testyourcodewithMapperdebugloggingenabled.Viewthe
maptasklogsintheJobTrackerUItoconfirmthatyourmessageisincludedinthelog.
(Hint:SearchforLetterMapperinthepagetofindyourmessage.)
10.Optional:TryrunningmaploggingsettoINFO(thedefault)orWARNinsteadofDEBUG
andcomparethelogoutput.

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

39

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

41/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Page40

Lecture5Lab:UsingCountersanda
MapOnlyJob
FilesandDirectoriesUsedinthisExercise

Eclipseproject:counters

Javafiles:
ImageCounter.java(driver)
ImageCounterMapper.java(Mapper)
Testdata(HDFS):
weblog(fullwebserveraccesslog)
testlog(partialdatasetfortesting)
Exercisedirectory:~/workspace/counters

InthislabyouwillcreateaMap
onlyMapReducejob.
Yourapplicationwillprocessawebserversaccesslogtocountthenumberoftimesgifs,
jpegs,andotherresourceshavebeenretrieved.Yourjobwillreportthreefigures:number
ofgifrequests,numberofjpegrequests,andnumberofotherrequests.

Hints
1.YoushoulduseaMap
onlyMapReducejob,bysettingthenumberofReducersto0in
thedrivercode.
2.Forinputdata,usetheWebaccesslogfilethatyouuploadedtotheHDFS
/user/training/weblogdirectoryintheUsingHDFSlab.
Note:Testyourcodeagainstthesmallerversionoftheaccessloginthe
/user/training/testlogdirectorybeforeyourunyourcodeagainstthefulllogin
the/user/training/weblogdirectory.
3.UseacountergroupsuchasImageCounter,withnamesgif,jpegandother.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

40

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

42/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Page41

4.Inyourdrivercode,retrievethevaluesofthecountersafterthejobhascompletedand
reportthemusingSystem.out.println.
5.TheoutputfolderonHDFSwillcontainMapperoutputfileswhichareempty,because
theMappersdidnotwriteanydata.

Thisistheendofthelab.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

43/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

41

Page42

Lecture6Lab:WritingaPartitioner
FilesandDirectoriesUsedinthisExercise
Eclipseproject:partitioner

Javafiles:
MonthPartitioner.java(Partitioner)
ProcessLogs.java(driver)
CountReducer.java(Reducer)
LogMonthMapper.java(Mapper)

Testdata(HDFS):
weblog(fullwebserveraccesslog)
testlog(partialdatasetfortesting)

Exercisedirectory:~/workspace/partitioner

InthisExercise,youwillwriteaMapReducejobwithmultipleReducers,andcreatea
PartitionertodeterminewhichReducereachpieceofMapperoutputissentto.

TheProblem
IntheMorePracticewithWritingMapReduceJavaProgramslabyoudidpreviously,you
builtthecodeinlog_file_analysisproject.Thatprogramcountedthenumberofhits
foreachdifferentIPaddressinaweblogfile.Thefinaloutputwasafilecontainingalistof
IPaddresses,andthenumberofhitsfromthataddress.
Thistime,youwillperformasimilartask,butthefinaloutputshouldconsistof12files,one
eachforeachmonthoftheyear:January,February,andsoon.Eachfilewillcontainalistof
IPaddresses,andthenumberofhitsfromthataddressinthatmonth.
Wewillaccomplishthisbyhaving12Reducers,eachofwhichisresponsibleforprocessing
thedataforaparticularmonth.Reducer0processesJanuaryhits,Reducer1processes
Februaryhits,andsoon.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

44/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

42

Page43

Note:WeareactuallybreakingthestandardMapReduceparadigmhere,whichsaysthatall
thevaluesfromaparticularkeywillgotothesameReducer.Inthisexample,whichisavery
commonpatternwhenanalyzinglogfiles,valuesfromthesamekey(theIPaddress)willgo
tomultipleReducers,basedonthemonthportionoftheline.

WritetheMapper
1.StartingwiththeLogMonthMapper.javastubfile,writeaMapperthatmapsalog
fileoutputlinetoanIP/monthpair.Themapmethodwillbesimilartothatinthe
LogFileMapperclassinthelog_file_analysisproject,soyoumaywishtostart
bycopyingthatcode.
2.TheMappershouldemitaTextkey(theIPaddress)andTextvalue(themonth).E.g.:
Input:96.7.4.14[24/Apr/2011:04:20:110400]"GET
/cat.jpgHTTP/1.1"20012433
Outputkey:96.7.4.14
Outputvalue:Apr

Hint:IntheMapper,youmayusearegularexpressiontoparsetologfiledataifyouare
familiarwithregexprocessing(seefileHomework_RegexRef.docxforreference).
Rememberthatthelogfilemaycontainunexpecteddatathatis,linesthatdonot
conformtotheexpectedformat.Besurethatyourcodecopeswithsuchlines.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

45/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

43

Page44

WritethePartitioner
3.ModifytheMonthPartitioner.javastubfiletocreateaPartitionerthatsendsthe
(key,value)pairtothecorrectReducerbasedonthemonth.Rememberthatthe
Partitionerreceivesboththekeyandvalue,soyoucaninspectthevaluetodetermine
whichReducertochoose.

ModifytheDriver
4.Modifyyourdrivercodetospecifythatyouwant12Reducers.
5.ConfigureyourjobtouseyourcustomPartitioner.

TestyourSolution
6.Buildandtestyourcode.Youroutputdirectoryshouldcontain12filesnamedpartr
000xx.EachfileshouldcontainIPaddressandnumberofhitsformonthxx.
Hints:
WriteunittestsforyourPartitioner!
Youmaywishtotestyourcodeagainstthesmallerversionoftheaccessloginthe
/user/training/testlogdirectorybeforeyourunyourcodeagainstthefull
loginthe/user/training/weblogdirectory.However,notethatthetestdata
maynotincludeallmonths,sosomeresultfileswillbeempty.

Thisistheendofthelab.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

46/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

44

Page45

Lecture6Lab:ImplementingaCustom
WritableComparable
FilesandDirectoriesUsedinthisExercise
Eclipseproject:writables

Javafiles:
StringPairWritableimplementsaWritableComparabletype
StringPairMapperMapperfortestjob
StringPairTestDriverDriverfortestjob
Datafile:
~/training_materials/developer/data/nameyeartestdata(smallsetofdata
forthetestjob)

Exercisedirectory:~/workspace/writables

Inthislab,youwillcreateacustomWritableComparabletypethatholdstwostrings.
Testthenewtypebycreatingasimpleprogramthatreadsalistofnames(firstandlast)and
countsthenumberofoccurrencesofeachname.
Themappershouldacceptslinesintheform:
lastnamefirstnameotherdata
Thegoalistocountthenumberoftimesalastname/firstnamepairoccurwithinthedataset.
Forexample,forinput:

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

47/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

SmithJoe19630812Poughkeepsie,NY
SmithJoe18320120Sacramento,CA
MurphyAlice20040602Berlin,MA
Wewanttooutput:
(Smith,Joe)

(Murphy,Alice)1
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

45

Page46

Note:YouwilluseyourcustomWritableComparabletypeinafuturelab,somakesureit
isworkingwiththetestjobnow.

StringPairWritable
YouneedtoimplementaWritableComparableobjectthatholdsthetwostrings.Thestub
providesanemptyconstructorforserialization,astandardconstructorthatwillbegiven
twostrings,atoStringmethod,andthegeneratedhashCodeandequalsmethods.You
willneedtoimplementthereadFields,write,andcompareTomethodsrequiredby
WritableComparables.
NotethatEclipseautomaticallygeneratedthehashCodeandequalsmethodsinthestub
file.YoucangeneratethesetwomethodsinEclipsebyright
clickinginthesourcecodeand
choosingSource>GeneratehashCode()andequals().

NameCountTestJob
ThetestjobrequiresaReducerthatsumsthenumberofoccurrencesofeachkey.Thisisthe
samefunctionthattheSumReducerusedpreviouslyinwordcount,exceptthatSumReducer
expectsTextkeys,whereasthereducerforthisjobwillgetStringPairWritablekeys.You
mayeitherre
writeSumReducertoaccommodateothertypesofkeys,oryoucanusethe
LongSumReducerHadooplibraryclass,whichdoesexactlythesamething.
Youcanusethesimpletestdatain
~/training_materials/developer/data/nameyeartestdatatomakesure
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

48/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

yournewtypeworksasexpected.
YoumaytestyourcodeusinglocaljobrunnerorbysubmittingaHadoopjobtothe(pseudo

)clusterasusual.Ifyousubmitthejobtothecluster,notethatyouwillneedtocopyyour
testdatatoHDFSfirst.

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

46

Page47

Lecture6Lab:UsingSequenceFilesand
FileCompression
FilesandDirectoriesUsedinthisExercise
Eclipseproject:createsequencefile

Javafiles:
CreateSequenceFile.java(adriverthatconvertsatextfiletoasequencefile)
ReadCompressedSequenceFile.java(adriverthatconvertsacompressedsequence
filetotext)

Testdata(HDFS):
weblog(fullwebserveraccesslog)
Exercisedirectory:~/workspace/createsequencefile

Inthislabyouwillpracticereadingandwritinguncompressedandcompressed
SequenceFiles.
First,youwilldevelopaMapReduceapplicationtoconverttextdatatoaSequenceFile.Then
youwillmodifytheapplicationtocompresstheSequenceFileusingSnappyfilecompression.
WhencreatingtheSequenceFile,usethefullaccesslogfileforinputdata.(Youuploadedthe
accesslogfiletotheHDFS/user/training/weblogdirectorywhenyouperformedthe
UsingHDFSlab.)
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

49/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

AfteryouhavecreatedthecompressedSequenceFile,youwillwriteasecondMapReduce
applicationtoreadthecompressedSequenceFileandwriteatextfilethatcontainsthe
originallogfiletext.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

47

Page48

WriteaMapReduceprogramtocreatesequencefilesfrom
textfiles
1.DeterminethenumberofHDFSblocksoccupiedbytheaccesslogfile:
a.Inabrowserwindow,starttheNameNodeWebUI.TheURLis
http://localhost:50070
b.ClickBrowsethefilesystem.
c.Navigatetothe/user/training/weblog/access_logfile.
d.Scrolldowntothebottomofthepage.Thetotalnumberofblocksoccupiedby
theaccesslogfileappearsinthebrowserwindow.
2.Completethestubfileinthecreatesequencefileprojecttoreadtheaccesslogfile
andcreateaSequenceFile.RecordsemittedtotheSequenceFilecanhaveanykeyyou
like,butthevaluesshouldmatchthetextintheaccesslogfile.(Hint:YoucanuseMap

onlyjobusingthedefaultMapper,whichsimplyemitsthedatapassedtoit.)
Note:IfyouspecifyanoutputkeytypeotherthanLongWritable,youmustcall
job.setOutputKeyClassnotjob.setMapOutputKeyClass.Ifyouspecifyan
outputvaluetypeotherthanText,youmustcalljob.setOutputValueClassnot
job.setMapOutputValueClass.
3.Buildandtestyoursolutionsofar.Usetheaccesslogasinputdata,andspecifythe
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

50/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

uncompressedsfdirectoryforoutput.
4.ExaminetheinitialportionoftheoutputSequenceFileusingthefollowingcommand:
$hadoopfscatuncompressedsf/partm00000|less
SomeofthedataintheSequenceFileisunreadable,butpartsoftheSequenceFile
shouldberecognizable:
ThestringSEQ,whichappearsatthebeginningofaSequenceFile
TheJavaclassesforthekeysandvalues
Textfromtheaccesslogfile

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

48

Page49

5.Verifythatthenumberoffilescreatedbythejobisequivalenttothenumberofblocks
requiredtostoretheuncompressedSequenceFile.

CompresstheOutput
6.ModifyyourMapReducejobtocompresstheoutputSequenceFile.Addstatementsto
yourdrivertoconfiguretheoutputasfollows:
Compresstheoutputfile.
Useblockcompression.
UsetheSnappycompressioncodec.
7.CompilethecodeandrunyourmodifiedMapReducejob.FortheMapReduceoutput,
specifythecompressedsfdirectory.
8.ExaminethefirstportionoftheoutputSequenceFile.Noticethedifferencesbetweenthe
uncompressedandcompressedSequenceFiles:
ThecompressedSequenceFilespecifiesthe
org.apache.hadoop.io.compress.SnappyCodeccompressioncodecin
itsheader.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

51/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Youcannotreadthelogfiletextinthecompressedfile.
9.ComparethefilesizesoftheuncompressedandcompressedSequenceFilesinthe
uncompressedsfandcompressedsfdirectories.ThecompressedSequenceFiles
shouldbesmaller.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

49

Page50

WriteanotherMapReduceprogramtouncompressthefiles
10.Startingwiththeprovidedstubfile,writeasecondMapReduceprogramtoreadthe
compressedlogfileandwriteatextfile.Thistextfileshouldhavethesametextdataas
thelogfile,pluskeys.Thekeyscancontainanyvaluesyoulike.
11.CompilethecodeandrunyourMapReducejob.
FortheMapReduceinput,specifythecompressedsfdirectoryinwhichyoucreated
thecompressedSequenceFileintheprevioussection.
FortheMapReduceoutput,specifythecompressedsftotextdirectory.
12.Examinethefirstportionoftheoutputinthecompressedsftotextdirectory.
Youshouldbeabletoreadthetextuallogfileentries.

Optional:Usecommandlineoptionstocontrolcompression

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

52/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

13.IfyouusedToolRunnerforyourdriver,youcancontrolcompressionusingcommand
linearguments.Trycommentingoutthecodeinyourdriverwhereyoucall.Thentest
settingthemapred.output.compressedoptiononthecommandline,e.g.:
$hadoopjarsequence.jar\
stubs.CreateUncompressedSequenceFile\
Dmapred.output.compressed=true\
weblogoutdir
14.Reviewtheoutputtoconfirmthefilesarecompressed.

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

50

Page51

Lecture7Lab:CreatinganInverted
Index
FilesandDirectoriesUsedinthisExercise
Eclipseproject:inverted_index

Javafiles:
IndexMapper.java(Mapper)
IndexReducer.java(Reducer)
InvertedIndex.java(Driver)
Datafiles:
~/training_materials/developer/data/invertedIndexInput.tgz
Exercisedirectory:~/workspace/inverted_index

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

53/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Inthislab,youwillwriteaMapReducejobthatproducesaninvertedindex.
Forthislabyouwilluseanalternateinput,providedinthefile
invertedIndexInput.tgz.Whendecompressed,thisarchivecontainsadirectoryof
fileseachisaShakespeareplayformattedasfollows:
0HAMLET
1
2
3DRAMATISPERSONAE
4
5
6CLAUDIUS

kingofDenmark.(KINGCLAUDIUS:)

7
8HAMLETsontothelate,andnephewtothepresentking.
9
10POLONIUS

lordchamberlain.(LORDPOLONIUS:)

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

51

Page52

...
Eachlinecontains:

Linenumber

separator:atabcharacter

value:thelineoftext

ThisformatcanbereaddirectlyusingtheKeyValueTextInputFormatclassprovidedin
theHadoopAPI.ThisinputformatpresentseachlineasonerecordtoyourMapper,withthe
partbeforethetabcharacterasthekey,andthepartafterthetabasthevalue.
Givenabodyoftextinthisform,yourindexershouldproduceanindexofallthewordsin
thetext.Foreachword,theindexshouldhavealistofallthelocationswheretheword
appears.Forexample,forthewordhoneysuckleyouroutputshouldlooklikethis:

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

54/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

honeysuckle

2kinghenryiv@1038,midsummernightsdream@2175,...

Theindexshouldcontainsuchanentryforeverywordinthetext.

PreparetheInputData
1.ExtracttheinvertedIndexInputdirectoryanduploadtoHDFS:
$cd~/training_materials/developer/data
$tarzxvfinvertedIndexInput.tgz
$hadoopfsputinvertedIndexInputinvertedIndexInput

DefinetheMapReduceSolution
Rememberthatforthisprogramyouuseaspecialinputformattosuittheformofyourdata,
soyourdriverclasswillincludealinelike:
job.setInputFormatClass(KeyValueTextInputFormat.class)
Dontforgettoimportthisclassforyouruse.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

52

Page53

RetrievingtheFileName
Notethatthelabrequiresyoutoretrievethefilename
sincethatisthenameoftheplay.
TheContextobjectcanbeusedtoretrievethenameofthefilelikethis:
FileSplitfileSplit=(FileSplit)context.getInputSplit()
Pathpath=fileSplit.getPath()
StringfileName=path.getName()

BuildandTestYourSolution
TestagainsttheinvertedIndexInputdatayouloadedabove.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

55/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Hints
Youmayliketocompletethislabwithoutreadinganyfurther,oryoumayfindthefollowing
hintsaboutthealgorithmhelpful.

TheMapper
YourMappershouldtakeasinputakeyandalineofwords,andemitasintermediatevalues
eachwordaskey,andthekeyasvalue.
Forexample,thelineofinputfromthefilehamlet:
282Haveheavenandearthtogether
producesintermediateoutput:

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

53

Page54

Have

hamlet@282

heaven

hamlet@282

and

hamlet@282

earth

hamlet@282

togetherhamlet@282

TheReducer
YourReducersimplyaggregatesthevaluespresentedtoitforthesamekey,intoonevalue.
Useaseparatorlike,betweenthevalueslisted.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

56/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

54

Page55

Lecture7Lab:CalculatingWordCo
Occurrence
FilesandDirectoriesUsedinthisExercise
Eclipseproject:word_cooccurrence

Javafiles:
WordCoMapper.java(Mapper)

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

57/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
SumReducer.java(ReducerfromWordCount)
WordCo.java(Driver)
Testdirectory(HDFS):
shakespeare
Exercisedirectory:~/workspace/word_cooccurence

Inthislab,youwillwriteanapplicationthatcountsthenumberoftimeswords
appearnexttoeachother.
Testyourapplicationusingthefilesintheshakespearefolderyoupreviouslycopiedinto
HDFSintheUsingHDFSlab.
NotethatthisimplementationisaspecializationofWordCo
Occurrenceaswedescribeitin
thenotesinthiscaseweareonlyinterestedinpairsofwordswhichappeardirectly
nexttoeachother.
1.Changedirectoriestotheword_cooccurrencedirectorywithinthelabsdirectory.
2.CompletetheDriverandMapperstubfilesyoucanusethestandardSumReducerfrom
theWordCountprojectasyourReducer.YourMappersintermediateoutputshouldbe
intheformofaTextobjectasthekey,andanIntWritableasthevaluethekeywillbe
word1,word2,andthevaluewillbe1.

ExtraCredit
Ifyouhaveextratime,pleasecompletetheseadditionalchallenges:
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

55

Page56

Challenge1:UsetheStringPairWritablekeytypefromtheImplementinga
CustomWritableComparablelab.Copyyourcompletedsolution(fromthewritables
project)intothecurrentproject.
Challenge2:WriteasecondMapReducejobtosorttheoutputfromthefirstjobsothat
thelistofpairsofwordsappearsinascendingfrequency.
Challenge3:Sortbydescendingfrequencyinstead(sortthatthemostfrequently
occurringwordpairsarefirstintheoutput.)Hint:Youwillneedtoextend
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

58/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

org.apache.hadoop.io.LongWritable.Comparator.

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

56

Page57

Lecture8Lab:ImportingDatawith
Sqoop
InthislabyouwillimportdatafromarelationaldatabaseusingSqoop.Thedatayou
loadherewillbeusedsubsequentlabs.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

59/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

ConsidertheMySQLdatabasemovielens,derivedfromtheMovieLensprojectfrom
UniversityofMinnesota.(Seenoteattheendofthislab.)Thedatabaseconsistsofseveral
relatedtables,butwewillimportonlytwoofthese:movie,whichcontainsabout3,900
moviesandmovierating,whichhasabout1,000,000ratingsofthosemovies.

ReviewtheDatabaseTables
First,reviewthedatabasetablestobeloadedintoHadoop.
1.LogontoMySQL:
$mysqluser=trainingpassword=trainingmovielens
2.Reviewthestructureandcontentsofthemovietable:
mysql>DESCRIBEmovie
...
mysql>SELECT*FROMmovieLIMIT5
3.Notethecolumnnamesforthetable:
____________________________________________________________________________________________

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

57

Page58

4.Reviewthestructureandcontentsofthemovieratingtable:
mysql>DESCRIBEmovierating

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

60/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

mysql>SELECT*FROMmovieratingLIMIT5
5.Notethesecolumnnames:
____________________________________________________________________________________________
6.Exitmysql:
mysql>quit

ImportwithSqoop
YouinvokeSqooponthecommandlinetoperformseveralcommands.Withityoucan
connecttoyourdatabaseservertolistthedatabases(schemas)towhichyouhaveaccess,
andlistthetablesavailableforloading.Fordatabaseaccess,youprovideaconnectstringto
identifytheserver,and
ifrequired
yourusernameandpassword.
1.ShowthecommandsavailableinSqoop:
$sqoophelp
2.Listthedatabases(schemas)inyourdatabaseserver:
$sqooplistdatabases\
connectjdbc:mysql://localhost\
usernametrainingpasswordtraining
(Note:Insteadofenteringpasswordtrainingonyourcommandline,youmay
prefertoenterP,andletSqooppromptyouforthepassword,whichisthennotvisible
whenyoutypeit.)

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

58

Page59

3.Listthetablesinthemovielensdatabase:

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

61/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

$sqooplisttables\
connectjdbc:mysql://localhost/movielens\
usernametrainingpasswordtraining
4.ImportthemovietableintoHadoop:
$sqoopimport\
connectjdbc:mysql://localhost/movielens\
usernametrainingpasswordtraining\
fieldsterminatedby'\t'tablemovie
5.Verifythatthecommandhasworked.
$hadoopfslsmovie
$hadoopfstailmovie/partm00000
6.ImportthemovieratingtableintoHadoop.
Repeatthelasttwosteps,butforthemovieratingtable.

Thisistheendofthelab.
Note:
ThislabusestheMovieLensdataset,orsubsetsthereof.Thisdataisfreelyavailablefor
academicpurposes,andisusedanddistributedbyClouderawiththeexpresspermission
oftheUMNGroupLensResearchGroup.Ifyouwouldliketousethisdataforyourown
researchpurposes,youarefreetodoso,aslongasyoucitetheGroupLensResearch
Groupinanyresultingpublications.Ifyouwouldliketousethisdataforcommercial
purposes,youmustobtainexplicitpermission.Youmayfindthefulldataset,aswellas
detailedlicenseterms,athttp://www.grouplens.org/node/73

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

59

Page60

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

62/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Lecture8Lab:RunninganOozie
Workflow
FilesandDirectoriesUsedinthisExercise
Exercisedirectory:~/workspace/oozie_labs

Ooziejobfolders:
lab1javamapreduce
lab2sortwordcount

Inthislab,youwillinspectandrunOozieworkflows.
1.StarttheOozieserver
$sudo/etc/init.d/ooziestart
2.Changedirectoriestothelabdirectory:
$cd~/workspace/oozielabs
3.Inspectthecontentsofthejob.propertiesandworkflow.xmlfilesinthelab1
javamapreduce/jobfolder.YouwillseethatthisisthestandardWordCountjob.
Inthejob.propertiesfile,takenoteofthejobsbasedirectory(lab1java
mapreduce),andtheinputandoutputdirectoriesrelativetothat.(TheseareHDFS
directories.)
4.WehaveprovidedasimpleshellscripttosubmittheOozieworkflow.Inspectthe
run.shscriptandthenrun:

$./run.shlab1javamapreduce
NoticethatOoziereturnsajobidentificationnumber.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

60

Page61

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

63/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

5.Inspecttheprogressofthejob:
$ooziejobooziehttp://localhost:11000/oozie\
infojob_id
6.Whenthejobhascompleted,reviewthejoboutputdirectoryinHDFStoconfirmthat
theoutputhasbeenproducedasexpected.
7.Repeattheaboveprocedureforlab2sortwordcount.Noticewhenyouinspect
workflow.xmlthatthisworkflowincludestwoMapReducejobswhichrunoneafter
theother,inwhichtheoutputofthefirstistheinputforthesecond.Whenyouinspect
theoutputinHDFSyouwillseethatthesecondjobsortstheoutputofthefirstjobinto
descendingnumericalorder.

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

61

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

64/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Page62

Lecture8BonusLab:Exploringa
SecondarySortExample
FilesandDirectoriesUsedinthisExercise
Eclipseproject:secondarysort

Datafiles:
~/training_materials/developer/data/nameyeartestdata
Exercisedirectory:~/workspace/secondarysort

Inthislab,youwillrunaMapReducejobindifferentwaystoseetheeffectsofvarious
componentsinasecondarysortprogram.
Theprogramacceptslinesintheform
lastnamefirstnamebirthdate
Thegoalistoidentifytheyoungestpersonwitheachlastname.Forexample,forinput:
MurphyJoanne19630812
MurphyDouglas18320120
MurphyAlice20040602
Wewanttowriteout:
MurphyAlice20040602
Allthecodeisprovidedtodothis.Followingthestepsbelowyouaregoingtoprogressively
addeachcomponenttothejobtoaccomplishthefinalgoal.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

62

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

65/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Page63

BuildtheProgram
1.InEclipse,reviewbutdonotmodifythecodeinthesecondarysortprojectexample
package.
2.Inparticular,notetheNameYearDriverclass,inwhichthecodetosetthepartitioner,
sortcomparator,andgroupcomparatorforthejobiscommentedout.Thisallowsusto
setthosevaluesonthecommandlineinstead.
3.Exportthejarfilefortheprogramassecsort.jar.
4.Asmalltestdatafilecallednameyeartestdatahasbeenprovidedforyou,locatedin
thesecondarysortprojectfolder.CopythedatafiletoHDFS,ifyoudidnotalreadydoso
intheWritableslab.

RunasaMaponlyJob
5.TheMapperforthisjobconstructsacompositekeyusingtheStringPairWritable
type.SeetheoutputofjustthemapperbyrunningthisprogramasaMap
onlyjob:
$hadoopjarsecsort.jarexample.NameYearDriver\
Dmapred.reduce.tasks=0nameyeartestdatasecsortout
6.Reviewtheoutput.Notethekeyisastringpairoflastnameandbirthyear.

RunusingthedefaultPartitionerandComparators
7.Re
runthejob,settingthenumberofreducetasksto2insteadof0.
8.Notethattheoutputnowconsistsoftwofilesoneeachforthetworeducetasks.Within
eachfile,theoutputissortedbylastname(ascending)andyear(ascending).Butitisnt
sortedbetweenfiles,andrecordswiththesamelastnamemaybeindifferentfiles
(meaningtheywenttodifferentreducers).

Runusingthecustompartitioner
9.Reviewthecodeofthecustompartitionerclass:NameYearPartitioner.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

66/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

63

Page64

10.Re
runthejob,addingasecondparametertosetthepartitionerclasstouse:

Dmapreduce.partitioner.class=example.NameYearPartitioner
11.Reviewtheoutputagain,thistimenotingthatallrecordswiththesamelastnamehave
beenpartitionedtothesamereducer.
However,theyarestillbeingsortedintothedefaultsortorder(name,yearascending).
Wewantitsortedbynameascending/yeardescending.

Runusingthecustomsortcomparator
12.TheNameYearComparatorclasscomparesName/Yearpairs,firstcomparingthe
namesand,ifequal,comparestheyear(indescendingorderi.e.lateryearsare
consideredlessthanearlieryears,andthusearlierinthesortorder.)Re
runthejob
usingNameYearComparatorasthesortcomparatorbyaddingathirdparameter:

Dmapred.output.key.comparator.class=
example.NameYearComparator
13.Reviewtheoutputandnotethateachreducersoutputisnowcorrectlypartitionedand
sorted.

RunwiththeNameYearReducer
14.Sofarwevebeenrunningwiththedefaultreducer,whichistheIdentityReducer,which
simplywriteseachkey/valuepairitreceives.Theactualgoalofthisjobistoemitthe
recordfortheyoungestpersonwitheachlastname.Wecandothiseasilyifallrecords
foragivenlastnamearepassedtoasinglereducecall,sortedindescendingorder,
whichcanthensimplyemitthefirstvaluepassedineachcall.
15.ReviewtheNameYearReducercodeandnotethatitemits
16.Re
runthejob,usingthereducerbyaddingafourthparameter:
Dmapreduce.reduce.class=example.NameYearReducer
Alas,thejobstillisntcorrect,becausethedatabeingpassedtothereducemethodis
beinggroupedaccordingtothefullkey(nameandyear),somultiplerecordswiththe
samelastname(butdifferentyears)arebeingoutput.Wewantittobegroupedby
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

67/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

nameonly.
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

64

Page65

Runwiththecustomgroupcomparator
17.TheNameComparatorclasscomparestwostringpairsbycomparingonlythename
fieldanddisregardingtheyearfield.Pairswiththesamenamewillbegroupedintothe
samereducecall,regardlessoftheyear.Addthegroupcomparatortothejobbyadding
afinalparameter:
Dmapred.output.value.groupfn.class=
example.NameComparator
18.Notethefinaloutputnowcorrectlyincludesonlyasinglerecordforeachdifferentlast
name,andthatthatrecordistheyoungestpersonwiththatlastname.

Thisistheendofthelab.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

68/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

65

Page66

NotesforUpcomingLabs
VMServicesCustomization
Fortheremainderofthelabs,thereareservicesthatmustberunninginyourVM,and
othersthatareoptional.Itisstronglyrecommendedthatyourunthefollowingcommand
wheneveryoustarttheVM:

$~/scripts/analyst/toggle_services.sh
Thiswillconservememoryandincreaseperformanceofthevirtualmachine.Afterrunning
thiscommand,youmaysafelyignoreanymessagesaboutservicesthathavealreadybeen
startedorshutdown.

DataModelReference
Foryourconvenience,youwillfindareferencedocumentdepictingthestructureforthe
tablesyouwilluseinthefollowinglabs.Seefile:Homework_DataModelRef.docx

RegularExpression(Regex)Reference
Foryourconvenience,youwillfindareferencedocumentdescribingregularexpressions
syntax.Seefile:Homework_RegexRef.docx

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

69/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

66

Page67

Lecture9Lab:DataIngestWithHadoop
Tools
InthislabyouwillpracticeusingtheHadoopcommandlineutilitytointeractwith
HadoopsDistributedFilesystem(HDFS)anduseSqooptoimporttablesfroma
relationaldatabasetoHDFS.

PrepareyourVirtualMachine
LaunchtheVMifyouhaventalreadydoneso,andthenrunthefollowingcommandtoboost
performancebydisablingservicesthatarenotneededforthisclass:
$~/scripts/analyst/toggle_services.sh

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

70/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

67

Page68

Step1:SetupHDFS
1.Openaterminalwindow(ifoneisnotalreadyopen)bydouble
clickingtheTerminal
icononthedesktop.Next,changetothedirectoryforthislabbyrunningthefollowing
command:
$cd$ADIR/exercises/data_ingest
2.Toseethecontentsofyourhomedirectory,runthefollowingcommand:
$hadoopfsls/user/training
3.Ifyoudonotspecifyapath,hadoopfsassumesyouarereferringtoyourhome
directory.Therefore,thefollowingcommandisequivalenttotheoneabove:
$hadoopfsls
4.Mostofyourworkwillbeinthe/dualcoredirectory,socreatethatnow:
$hadoopfsmkdir/dualcore

Step2:ImportingDatabaseTablesintoHDFSwithSqoop
Dualcorestoresinformationaboutitsemployees,customers,products,andordersina
MySQLdatabase.Inthenextfewsteps,youwillexaminethisdatabasebeforeusingSqoopto
importitstablesintoHDFS.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

71/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

68

Page69

1.LogintoMySQLandselectthedualcoredatabase:
$mysqluser=trainingpassword=trainingdualcore
2.Next,listtheavailabletablesinthedualcoredatabase(mysql>representsthe
MySQLclientpromptandisnotpartofthecommand):
mysql>SHOWTABLES
3.Reviewthestructureoftheemployeestableandexamineafewofitsrecords:
mysql>DESCRIBEemployees
mysql>SELECTemp_id,fname,lname,state,salaryFROM
employeesLIMIT10
4.ExitMySQLbytypingquit,andthenhittheenterkey:
mysql>quit
5.Next,runthefollowingcommand,whichimportstheemployeestableintothe
/dualcoredirectorycreatedearlierusingtabcharacterstoseparateeachfield:
$sqoopimport\
connectjdbc:mysql://localhost/dualcore\
usernametrainingpasswordtraining\
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

72/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

fieldsterminatedby'\t'\
warehousedir/dualcore\
tableemployees

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

69

Page70

6.RevisethepreviouscommandandimportthecustomerstableintoHDFS.
7.RevisethepreviouscommandandimporttheproductstableintoHDFS.
8.RevisethepreviouscommandandimporttheorderstableintoHDFS.
9.Next,youwillimporttheorder_detailstableintoHDFS.Thecommandisslightly
differentbecausethistableonlyholdsreferencestorecordsintheordersand
productstable,andlacksaprimarykeyofitsown.Consequently,youwillneedto
specifythesplitbyoptionandinstructSqooptodividetheimportworkamong
maptasksbasedonvaluesintheorder_idfield.Analternativeistousethem1
optiontoforceSqooptoimportallthedatawithasingletask,butthiswould
significantlyreduceperformance.
$sqoopimport\
connectjdbc:mysql://localhost/dualcore\
usernametrainingpasswordtraining\
fieldsterminatedby'\t'\
warehousedir/dualcore\
tableorder_details\
splitby=order_id

Thisistheendofthelab.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

73/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

70

Page71

Lecture9Lab:UsingPigforETL
Processing
InthislabyouwillpracticeusingPigtoexplore,correct,andreorderdatainfiles
fromtwodifferentadnetworks.Youwillfirstexperimentwithsmallsamplesofthis
datausingPiginlocalmode,andonceyouareconfidentthatyourETLscriptsworkas
youexpect,youwillusethemtoprocessthecompletedatasetsinHDFSbyusingPig
inMapReducemode.
IMPORTANT:Sincethislabbuildsonthepreviousone,itisimportantthatyousuccessfully
completethepreviouslabbeforestartingthislab.

BackgroundInformation
Dualcorehasrecentlystartedusingonlineadvertisementstoattractnewcustomerstoitse

commercesite.Eachofthetwoadnetworkstheyuseprovidesdataabouttheadstheyve
placed.Thisincludesthesitewheretheadwasplaced,thedatewhenitwasplaced,what
keywordstriggereditsdisplay,whethertheuserclickedthead,andtheper
clickcost.
Unfortunately,thedatafromeachnetworkisinadifferentformat.Eachfilealsocontains
someinvalidrecords.Beforewecananalyzethedata,wemustfirstcorrecttheseproblems
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

74/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

byusingPigto:
Filterinvalidrecords
Reorderfields
Correctinconsistencies
WritethecorrecteddatatoHDFS

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

71

Page72

Step#1:WorkingintheGruntShell
Inthisstep,youwillpracticerunningPigcommandsintheGruntshell.
1.Changetothedirectoryforthislab:
$cd$ADIR/exercises/pig_etl
2.Copyasmallnumberofrecordsfromtheinputfiletoanotherfileonthelocalfile
system.WhenyoustartPig,youwillruninlocalmode.Fortesting,youcanworkfaster
withsmalllocalfilesthanlargefilesinHDFS.
Itisnotessentialtochoosearandomsampleherejustahandfulofrecordsinthe
correctformatwillsuffice.Usethecommandbelowtocapturethefirst25recordsso
youhaveenoughtotestyourscript:
$headn25$ADIR/data/ad_data1.txt>sample1.txt
3.StarttheGruntshellinlocalmodesothatyoucanworkwiththelocalsample1.txt
file.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

75/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

$pigxlocal
ApromptindicatesthatyouarenowintheGruntshell:
grunt>
4.Loadthedatainthesample1.txtfileintoPiganddumpit:
grunt>data=LOAD'sample1.txt'
grunt>DUMPdata
Youshouldseethe25recordsthatcomprisethesampledatafile.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

72

Page73

5.Loadthefirsttwocolumnsdatafromthesamplefileascharacterdata,andthendump
thatdata:
grunt>first_2_columns=LOAD'sample1.txt'AS
(keyword:chararray,campaign_id:chararray)
grunt>DUMPfirst_2_columns
6.UsetheDESCRIBEcommandinPigtoreviewtheschemaoffirst_2_cols:
grunt>DESCRIBEfirst_2_columns
TheschemaappearsintheGruntshell.
UsetheDESCRIBEcommandwhileperformingtheselabsanytimeyouwouldliketo
reviewschemadefinitions.
7.SeewhathappensifyouruntheDESCRIBEcommandondata.Recallthatwhenyou
loadeddata,youdidnotdefineaschema.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

76/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

grunt>DESCRIBEdata
8.EndyourGruntshellsession:
grunt>QUIT

73

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

Page74

Step#2:ProcessingInputDatafromtheFirstAdNetwork
Inthisstep,youwillprocesstheinputdatafromthefirstadnetwork.First,youwillcreatea
Pigscriptinafile,andthenyouwillrunthescript.Manypeoplefindworkingthisway
easierthanworkingdirectlyintheGruntshell.
1.Editthefirst_etl.pigfiletocompletetheLOADstatementandreadthedatafrom
thesampleyoujustcreated.Thefollowingtableshowstheformatofthedatainthefile.
Forsimplicity,youshouldleavethedateandtimefieldsseparate,soeachwillbeof
typechararray,ratherthanconvertingthemtoasinglefieldoftypedatetime.

IndexField

DataTypeDescription

Example

keyword

chararrayKeywordthattriggeredad

tablet

campaign_idchararrayUniquelyidentifiesthead

A3

date

chararrayDateofaddisplay

05/29/2013

time

chararrayTimeofaddisplay

15:49:21

display_sitechararrayDomainwhereadshown

www.example.com

was_clickedint

Whetheradwasclicked

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

77/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

cpc

int

Costperclick,incents

country

chararrayNameofcountryinwhichadranUSA

placement

chararrayWhereonpagewasaddisplayedTOP

106

2.OnceyouhaveeditedtheLOADstatement,tryitoutbyrunningyourscriptinlocal
mode:
$pigxlocalfirst_etl.pig
Makesuretheoutputlookscorrect(i.e.,thatyouhavethefieldsintheexpectedorder
andthevaluesappearsimilarinformattothatshowninthetableabove)beforeyou
continuewiththenextstep.
3.Makeeachofthefollowingchanges,runningyourscriptinlocalmodeaftereachoneto
verifythatyourchangeiscorrect:
a.Updateyourscripttofilteroutallrecordswherethecountryfielddoesnot
containUSA.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

74

Page75

b.Weneedtostorethefieldsinadifferentorderthanwereceivedthem.Usea
FOREACHGENERATEstatementtocreateanewrelationcontainingthe
fieldsinthesameorderasshowninthefollowingtable(thecountryfieldis
notincludedsinceallrecordsnowhavethesamevalue):

IndexField

Description

campaign_id

Uniquelyidentifiesthead

date

Dateofaddisplay

time

Timeofaddisplay

keyword

Keywordthattriggeredad

display_site

Domainwhereadshown

placement

Whereonpagewasaddisplayed

was_clicked

Whetheradwasclicked

cpc

Costperclick,incents

c.Updateyourscripttoconvertthekeywordfieldtouppercaseandtoremove
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

78/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

anyleadingortrailingwhitespace(Hint:Youcannestcallstothetwobuilt
in
functionsinsidetheFOREACHGENERATEstatementfromthelast
statement).
4.AddthecompletedatafiletoHDFS:
$hadoopfsput$ADIR/data/ad_data1.txt/dualcore
5.Editfirst_etl.pigandchangethepathintheLOADstatementtomatchthepathof
thefileyoujustaddedtoHDFS(/dualcore/ad_data1.txt).
6.Next,replaceDUMPwithaSTOREstatementthatwillwritetheoutputofyour
processingastab
delimitedrecordstothe/dualcore/ad_data1directory.
7.RunthisscriptinPigsMapReducemodetoanalyzetheentirefileinHDFS:
$pigfirst_etl.pig
Ifyourscriptfails,checkyourcodecarefully,fixtheerror,andthentryrunningitagain.
DontforgetthatyoumustremoveoutputinHDFSfromapreviousrunbeforeyou
executethescriptagain.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

75

Page76

8.Checkthefirst20outputrecordsthatyourscriptwrotetoHDFSandensuretheylook
correct(youcanignorethemessagecat:Unabletowritetooutputstreamthissimply
happensbecauseyouarewritingmoredatawiththefscatcommandthanyouare
readingwiththeheadcommand):
$hadoopfscat/dualcore/ad_data1/part*|head20
a.Arethefieldsinthecorrectorder?
b.Areallthekeywordsnowinuppercase?

Step#3:ProcessingInputDatafromtheSecondAdNetwork
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

79/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Nowthatyouhavesuccessfullyprocessedthedatafromthefirstadnetwork,continueby
processingdatafromthesecondone.
1.Createasmallsampleofthedatafromthesecondadnetworkthatyoucantestlocally
whileyoudevelopyourscript:
$headn25$ADIR/data/ad_data2.txt>sample2.txt

2.Editthesecond_etl.pigfiletocompletetheLOADstatementandreadthedatafrom
thesampleyoujustcreated(Hint:Thefieldsarecomma
delimited).Thefollowingtable
showstheorderoffieldsinthisfile:

IndexField

DataTypeDescription

Example

campaign_idchararrayUniquelyidentifiesthead

A3

date

chararrayDateofaddisplay

05/29/2013

time

chararrayTimeofaddisplay

15:49:21

display_sitechararrayDomainwhereadshown

placement

was_clickedint

cpc

int

keyword

chararrayKeywordthattriggeredad

www.example.com

chararrayWhereonpagewasaddisplayedTOP
Whetheradwasclicked

Costperclick,incents

106
tablet

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

76

Page77

3.OnceyouhaveeditedtheLOADstatement,usetheDESCRIBEkeywordandthenrun
yourscriptinlocalmodetocheckthattheschemamatchesthetableabove:
$pigxlocalsecond_etl.pig
4.ReplaceDESCRIBEwithaDUMPstatementandthenmakeeachofthefollowing
changestosecond_etl.pig,runningthisscriptinlocalmodeaftereachchangeto
verifywhatyouvedonebeforeyoucontinuewiththenextstep:
d.Thisadnetworksometimeslogsagivenrecordtwice.Addastatementtothe
second_etl.pigfilesothatyouremoveanyduplicaterecords.Ifyouhave
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

80/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

donethiscorrectly,youshouldonlyseeonerecordwherethe
display_sitefieldhasavalueofsiliconwire.example.com.
e.Asbefore,youneedtostorethefieldsinadifferentorderthanyoureceived
them.UseaFOREACHGENERATEstatementtocreateanewrelation
containingthefieldsinthesameorderyouusedtowritetheoutputfromfirst
adnetwork(shownagaininthetablebelow)andalsousetheUPPERand
TRIMfunctionstocorrectthekeywordfieldasyoudidearlier:

IndexField

Description

campaign_id

Uniquelyidentifiesthead

date

Dateofaddisplay

time

Timeofaddisplay

keyword

Keywordthattriggeredad

display_site

Domainwhereadshown

placement

Whereonpagewasaddisplayed

was_clicked

Whetheradwasclicked

cpc

Costperclick,incents

f.ThedatefieldinthisdatasetisintheformatMMDDYYYY,whilethedata
youpreviouslywroteisintheformatMM/DD/YYYY.EdittheFOREACH
GENERATEstatementtocalltheREPLACE(date,'','/')function
tocorrectthis.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

77

Page78

5.Onceyouaresurethescriptworkslocally,addthefulldatasettoHDFS:
$hadoopfsput$ADIR/data/ad_data2.txt/dualcore
6.EditthescripttohaveitLOADthefileyoujustaddedtoHDFS,andthenreplacethe
DUMPstatementwithaSTOREstatementtowriteyouroutputastab
delimitedrecords
tothe/dualcore/ad_data2directory.
7.RunyourscriptagainstthedatayouaddedtoHDFS:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

81/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

$pigsecond_etl.pig
8.Checkthefirst15outputrecordswritteninHDFSbyyourscript:
$hadoopfscat/dualcore/ad_data2/part*|head15
a.Doyouseeanyduplicaterecords?
b.Arethefieldsinthecorrectorder?
c.Areallthekeywordsinuppercase?
d.Isthedatefieldinthecorrect(MM/DD/YYYY)format?

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

78

Page79

Lecture9Lab:AnalyzingAdCampaign
DatawithPig
Duringthepreviouslab,youperformedETLprocessingondatasetsfromtwoonline
adnetworks.Inthislab,youwillwritePigscriptsthatanalyzethisdatatooptimize
advertising,helpingDualcoretosavemoneyandattractnewcustomers.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

82/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

IMPORTANT:Sincethislabbuildsonthepreviousone,itisimportantthatyousuccessfully
completethepreviouslabbeforestartingthislab.

Step#1:FindLowCostSites
BothadnetworkschargeafeeonlywhenauserclicksonDualcoresad.Thisisidealfor
Dualcoresincetheirgoalistobringnewcustomerstotheirsite.However,somesitesand
keywordsaremoreeffectivethanothersatattractingpeopleinterestedinthenewtablet
beingadvertisedbyDualcore.Withthisinmind,youwillbeginbyidentifyingwhichsites
havethelowesttotalcost.
1.Changetothedirectoryforthislab:
$cd$ADIR/exercises/analyze_ads
2.Obtainalocalsubsetoftheinputdatabyrunningthefollowingcommand:
$hadoopfscat/dualcore/ad_data1/part*\
|headn100>test_ad_data.txt
Youcanignorethemessagecat:Unabletowritetooutputstream,whichappears
becauseyouarewritingmoredatawiththefscatcommandthanyouarereading
withtheheadcommand.
Note:Asmentionedinthepreviouslab,itisfastertotestPigscriptsbyusingalocal
subsetoftheinputdata.Althoughexplicitstepsarenotprovidedforcreatinglocaldata
subsetsinupcominglabs,doingsowillhelpyouperformthelabsmorequickly.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

79

Page80

3.Openthelow_cost_sites.pigfileinyoureditor,andthenmakethefollowing
changes:
a.ModifytheLOADstatementtoreadthesampledatainthe
test_ad_data.txtfile.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

83/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

b.Addalinethatcreatesanewrelationtoincludeonlyrecordswhere
was_clickedhasavalueof1.
c.Groupthisfilteredrelationbythedisplay_sitefield.
d.Createanewrelationthatincludestwofields:thedisplay_siteand
thetotalcostofallclicksonthatsite.
e.Sortthatnewrelationbycost(inascendingorder)
f.Displayjustthefirstthreerecordstothescreen
4.Onceyouhavemadethesechanges,tryrunningyourscriptagainstthesampledata:
$pigxlocallow_cost_sites.pig
5.IntheLOADstatement,replacethetest_ad_data.txtfilewithafileglob(pattern)
thatwillloadboththe/dualcore/ad_data1and/dualcore/ad_data2
directories(anddoesnotloadanyotherdata,suchasthetextfilesfromtheprevious
lab).
6.Onceyouhavemadethesechanges,tryrunningyourscriptagainstthedatainHDFS:
$piglow_cost_sites.pig
Question:Whichthreesiteshavethelowestoverallcost?

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

80

Page81

Step#2:FindHighCostKeywords
ThetermsuserstypewhendoingsearchesmaypromptthesitetodisplayaDualcore
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

84/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

advertisement.Sinceonlineadvertiserscompeteforthesamesetofkeywords,someof
themcostmorethanothers.YouwillnowwritesomePigLatintodeterminewhich
keywordshavebeenthemostexpensiveforDualcoreoverall.
1.Sincethiswillbeaslightvariationonthecodeyouhavejustwritten,copythatfileas
high_cost_keywords.pig:
$cplow_cost_sites.pighigh_cost_keywords.pig
2.Editthehigh_cost_keywords.pigfileandmakethefollowingthreechanges:
a.Groupbythekeywordfieldinsteadofdisplay_site
b.Sortindescendingorderofcost
c.Displaythetopfiveresultstothescreeninsteadofthetopthreeasbefore
3.Onceyouhavemadethesechanges,tryrunningyourscriptagainstthedatainHDFS:
$pighigh_cost_keywords.pig
Question:Whichfivekeywordshavethehighestoverallcost?

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

81

Page82

BonusLab#1:CountAdClicks
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

85/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Oneimportantstatisticwehaventyetcalculatedisthetotalnumberofclickstheadshave
received.Doingsowillhelpthemarketingdirectorplanthenextadcampaignbudget.
1.Changetothebonus_01subdirectoryofthecurrentlab:
$cdbonus_01
2.Editthetotal_click_count.pigfileandimplementthefollowing:
a.Grouptherecords(filteredbywas_clicked==1)sothatyoucancall
theaggregatefunctioninthenextstep.
b.InvoketheCOUNTfunctiontocalculatethetotalofclickedads(Hint:
Becauseweshouldnthaveanynullrecords,youcanusetheCOUNT
functioninsteadofCOUNT_STAR,andthechoiceoffieldyousupplytothe
functionisarbitrary).
c.Displaytheresulttothescreen
3.Onceyouhavemadethesechanges,tryrunningyourscriptagainstthedatainHDFS:
$pigtotal_click_count.pig
Question:Howmanyclicksdidwereceive?

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

82

Page83

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

86/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

BonusLab#2:EstimatetheMaximumCostoftheNextAd
Campaign
Whenyoureportedthetotalnumberofclicks,theMarketingDirectorsaidthatthegoalisto
getaboutthreetimesthatamountduringthenextcampaign.Unfortunately,becausethe
costisbasedonthesiteandkeyword,itisntclearhowmuchtobudgetforthatcampaign.
Youcanhelpbyestimatingtheworstcase(mostexpensive)costbasedon50,000clicks.You
willdothisbyfindingthemostexpensiveadandthenmultiplyingitbythenumberofclicks
desiredinthenextcampaign.
1.Becausethiscodewillbesimilartothecodeyouwroteinthepreviousstep,startby
copyingthatfileasproject_next_campaign_cost.pig:
$cptotal_click_count.pigproject_next_campaign_cost.pig
2.Edittheproject_next_campaign_cost.pigfileandmakethefollowing
modifications:
a.Sinceyouaretryingtodeterminethehighestpossiblecost,youshould
notlimityourcalculationtothecostforadsactuallyclicked.Removethe
FILTERstatementsothatyouconsiderthepossibilitythatanyadmight
beclicked.
b.Changetheaggregatefunctiontotheonethatreturnsthemaximumvalue
inthecpcfield(Hint:Dontforgettochangethenameoftherelationthis
fieldbelongsto,inordertoaccountfortheremovaloftheFILTER
statementinthepreviousstep).
c.ModifyyourFOREACH...GENERATEstatementtomultiplythevalue
returnedbytheaggregatefunctionbythetotalnumberofclickswe
expecttohaveinthenextcampaign
d.Displaytheresultingvaluetothescreen.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

83

Page84
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

87/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

3.Onceyouhavemadethesechanges,tryrunningyourscriptagainstthedatainHDFS:
$pigproject_next_campaign_cost.pig
Question:Whatisthemaximumyouexpectthiscampaignmightcost?

ProfessorsNote~
Youcancompareyoursolutiontotheoneinthebonus_02/sample_solution/
subdirectory.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

84

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

88/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Page85

BonusLab#3:CalculatingClickThroughRate(CTR)
Thecalculationsyoudidatthestartofthislabprovidedaroughideaaboutthesuccessof
theadcampaign,butdidntaccountforthefactthatsomesitesdisplayDualcoresadsmore
thanothers.Thismakesitdifficulttodeterminehoweffectivetheiradswerebysimply
countingthenumberofclicksononesiteandcomparingittothenumberofclickson
anothersite.OnemetricthatwouldallowDualcoretobettermakesuchcomparisonsisthe
Click
ThroughRate(http://tiny.cloudera.com/ade03a),commonlyabbreviatedas
CTR.Thisvalueissimplythepercentageofadsshownthatusersactuallyclicked,andcanbe
calculatedbydividingthenumberofclicksbythetotalnumberofadsshown.
1.Changetothebonus_03subdirectoryofthecurrentlab:
$cd../bonus_03
2.Editthelowest_ctr_by_site.pigfileandimplementthefollowing:
a.WithinthenestedFOREACH,filtertherecordstoincludeonlyrecords
wheretheadwasclicked.
b.CreateanewrelationonthelinethatfollowstheFILTERstatement
whichcountsthenumberofrecordswithinthecurrentgroup
c.Addanotherlinebelowthattocalculatetheclick
throughrateinanew
fieldnamedctr
d.AfterthenestedFOREACH,sorttherecordsinascendingorderof
clickthroughrateanddisplaythefirstthreetothescreen.
3.Onceyouhavemadethesechanges,tryrunningyourscriptagainstthedatainHDFS:
$piglowest_ctr_by_site.pig
Question:Whichthreesiteshavethelowestclickthroughrate?
Ifyoustillhavetimeremaining,modifyyourscripttodisplaythethreekeywordswiththe
highestclick
throughrate.

Copyright20102014Cloudera,Inc.Allrightsreserved.

85

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

89/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Nottobereproducedwithoutpriorwrittenconsent.

Page86

Thisistheendofthelab.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

90/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

86

Page87

Lecture10Lab:AnalyzingDisparateData
SetswithPig
Inthislab,youwillpracticecombining,joining,andanalyzingtheproductsalesdata
previouslyexportedfromDualcoresMySQLdatabasesoyoucanobservetheeffects
thattherecentadvertisingcampaignhashadonsales.
IMPORTANT:Sincethislabbuildsonthepreviousone,itisimportantthatyousuccessfully
completethepreviouslabbeforestartingthislab.

Step#1:ShowPerMonthSalesBeforeandAfterCampaign
Beforeweproceedwithmoresophisticatedanalysis,youshouldfirstcalculatethenumber
ofordersDualcorereceivedeachmonthforthethreemonthsbeforetheiradcampaign
began(FebruaryApril,2013),aswellasforthemonthduringwhichtheircampaignran
(May,2013).
1.Changetothedirectoryforthislab:
$cd$ADIR/exercises/disparate_datasets
2.Openthecount_orders_by_period.pigfileinyoureditor.Wehaveprovidedthe
LOADstatementaswellasaFILTERstatementthatusesaregularexpressiontomatch
therecordsinthedatarangeyoullanalyze.Makethefollowingadditionalchanges:
a.FollowingtheFILTERstatement,createanewrelationwithjustone
field:theordersyearandmonth(Hint:UsetheSUBSTRINGbuilt
in
functiontoextractthefirstpartoftheorder_dtmfield,whichcontains
themonthandyear).
b.Countthenumberofordersineachofthemonthsyouextractedinthe
previousstep.
c.Displaythecountbymonthtothescreen
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

91/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

87

Page88

3.Onceyouhavemadethesechanges,tryrunningyourscriptagainstthedatainHDFS:
$pigcount_orders_by_period.pig
Question:DoesthedatasuggestthattheadvertisingcampaignwestartedinMayledto
asubstantialincreaseinorders?

Step#2:CountAdvertisedProductSalesbyMonth
Ouranalysisfromthepreviousstepsuggeststhatsalesincreaseddramaticallythesame
monthDualcorebeganadvertising.Next,youllcomparethesalesofthespecificproduct
Dualcoreadvertised(productID#1274348)duringthesameperiodtoseewhetherthe
increaseinsaleswasactuallyrelatedtotheircampaign.
Youwillbejoiningtwodatasetsduringthisportionofthelab.Sincethisisthefirstjoinyou
havedonewithPig,nowisagoodtimetomentionatipthatcanhaveaprofoundeffecton
theperformanceofyourscript.Filteringoutunwanteddatafromeachrelationbeforeyou
jointhem,aswevedoneinourexample,meansthatyourscriptwillneedtoprocessless
dataandwillfinishmorequickly.WewilldiscussseveralmorePigperformancetipslaterin
class,butthisoneisworthlearningnow.
4.Editthecount_tablet_orders_by_period.pigfileandimplementthe
following:
a.Jointhetworelationsontheorder_idfieldtheyhaveincommon
b.Createanewrelationfromthejoineddatathatcontainsasinglefield:the
ordersyearandmonth,similartowhatyoudidpreviouslyinthe
count_orders_by_period.pigfile.
c.Grouptherecordsbymonthandthencounttherecordsineachgroup
d.Displaytheresultstoyourscreen

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

92/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

88

Page89

5.Onceyouhavemadethesechanges,tryrunningyourscriptagainstthedatainHDFS:
$pigcount_tablet_orders_by_period.pig
Question:Doesthedatashowanincreaseinsalesoftheadvertisedproduct
correspondingtothemonthinwhichDualcorescampaignwasactive?

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

93/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

89

Page90

BonusLab#1:CalculateAverageOrderSize
ItappearsthatDualcoresadvertisingcampaignwassuccessfulingeneratingneworders.
Sincetheysellthistabletataslightlosstoattractnewcustomers,letsseeifcustomerswho
buythistabletalsobuyotherthings.Youwillwritecodetocalculatetheaveragenumberof
itemsforallordersthatcontaintheadvertisedtabletduringthecampaignperiod.
1.Changetothebonus_01subdirectoryofthecurrentlab:
$cdbonus_01
2.Edittheaverage_order_size.pigfiletocalculatetheaverageasdescribedabove.
Whiletherearemultiplewaystoachievethis,itisrecommendedthatyouimplement
thefollowing:
a.Filtertheordersbydate(usingaregularexpression)toincludeonlythose
placedduringthecampaignperiod(May1,2013throughMay31,2013)
b.Excludeanyorderswhichdonotcontaintheadvertisedproduct(productID
#1274348)
c.Createanewrelationcontainingtheorder_idandproduct_idfieldsfor
theseorders.
d.Countthetotalnumberofproductsperorder
e.Calculatetheaveragenumberofproductsforallorders
3.Onceyouhavemadethesechanges,tryrunningyourscriptagainstthedatainHDFS:
$pigaverage_order_size.pig
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

94/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Question:Doesthedatashowthattheaverageordercontainedatleasttwoitemsin
additiontothetabletDualcoreadvertised?

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

90

Page91

BonusLab#2:SegmentCustomersforLoyaltyProgram
Dualcoreisconsideringstartingaloyaltyrewardsprogram.Thiswillprovideexclusive
benefitstotheirbestcustomers,whichwillhelptoretainthem.Anotheradvantageisthatit
willalsoallowDualcoretocaptureevenmoredataabouttheshoppinghabitsoftheir
customersforexample,Dualcorecaneasilytracktheircustomersin
storepurchaseswhen
thesecustomersprovidetheirrewardsprogramnumberatcheckout.
Tobeconsideredfortheprogram,acustomermusthavemadeatleastfivepurchasesfrom
Dualcoreduring2012.Thesecustomerswillbesegmentedintogroupsbasedonthetotal
retailpriceofallpurchaseseachmadeduringthatyear:
Platinum:Purchasestotaledatleast$10,000
Gold:Purchasestotaledatleast$5,000butlessthan$10,000
Silver:Purchasestotaledatleast$2,500butlessthan$5,000
Sinceweareconsideringthetotalsalespriceofordersinadditiontothenumberofordersa
customerhasplaced,noteverycustomerwithatleastfiveordersduring2012willqualify.
Infact,onlyaboutonepercentofthecustomerswillbeeligibleformembershipinoneof
thesethreegroups.
Duringthislab,youwillwritethecodeneededtofilterthelistofordersbasedondate,
groupthembycustomerID,countthenumberoforderspercustomer,andthenfilterthisto
excludeanycustomerwhodidnothaveatleastfiveorders.Youwillthenjointhis
informationwiththeorderdetailsandproductsdatasetsinordertocalculatethetotalsales
ofthoseordersforeachcustomer,splitthemintothegroupsbasedonthecriteriadescribed
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

95/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

above,andthenwritethedataforeachgroup(customerIDandtotalsales)intoaseparate
directoryinHDFS.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

91

Page92

1.Changetothebonus_02subdirectoryofthecurrentlab:
$cd../bonus_02
2.Edittheloyalty_program.pigfileandimplementthestepsdescribedabove.The
codetoloadthethreedatasetsyouwillneedisalreadyprovidedforyou.
3.Afteryouhavewrittenthecode,runitagainstthedatainHDFS:
$pigloyalty_program.pig
4.Ifyourscriptcompletedsuccessfully,usethehadoopfsgetmergecommandto
createalocaltextfileforeachgroupsoyoucancheckyourwork(notethatthenameof
thedirectoryshownheremaynotbethesameastheoneyouchose):
$hadoopfsgetmerge/dualcore/loyalty/platinumplatinum.txt
$hadoopfsgetmerge/dualcore/loyalty/goldgold.txt
$hadoopfsgetmerge/dualcore/loyalty/silversilver.txt
5.UsetheUNIXheadand/ortailcommandstocheckafewrecordsandensurethatthe
totalsalespricesfallintothecorrectranges:
$headplatinum.txt
$tailgold.txt
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

96/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

$headsilver.txt
6.Finally,countthenumberofcustomersineachgroup:
$wclplatinum.txt
$wclgold.txt
$wclsilver.txt

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

92

Page93

Lecture10Lab:ExtendingPigwith
StreamingandUDFs
InthislabyouwillusetheSTREAMkeywordinPigtoanalyzemetadatafrom
Dualcorescustomerservicecallrecordingstoidentifythecauseofasuddenincrease
incomplaints.Youwillthenusethisdatainconjunctionwithauser
definedfunction
toproposeasolutionforresolvingtheproblem.
IMPORTANT:Sincethislabbuildsonthepreviousone,itisimportantthatyousuccessfully
completethepreviouslabbeforestartingthislab.

BackgroundInformation
Dualcoreoutsourcesitscallcenteroperationsandcostshaverecentlyrisenduetoan
increaseinthevolumeofcallshandledbytheseagents.Unfortunately,Dualcoredoesnot
haveaccesstothecallcentersdatabase,buttheyareprovidedwithrecordingsofthesecalls
storedinMP3format.ByusingPigsSTREAMkeywordtoinvokeaprovidedPythonscript,
youcanextractthecategoryandtimestampfromthefiles,andthenanalyzethatdatato
learnwhatiscausingtherecentincreaseincalls.

Step#1:ExtractCallMetadata
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

97/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Note:SincethePythonlibraryweareusingforextractingthetagsdoesn'tsupportHDFS,we
runthisscriptinlocalmodeonasmallsampleofthecallrecordings.Becauseyouwilluse
Pigslocalmode,therewillbenoneedtoshipthescripttothenodesinthecluster.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

93

Page94

1.Changetothedirectoryforthislab:
$cd$ADIR/exercises/extending_pig
2.APythonscript(readtags.py)isprovidedforextractingthemetadatafromtheMP3
files.Thisscripttakesthepathofafileonthecommandlineandreturnsarecord
containingfivetab
delimitedfields:thefilepath,callcategory,agentID,customerID,
andthetimestampofwhentheagentansweredthecall.
Yourfirststepistocreateatextfilecontainingthepathsofthefilestoanalyze,withone
lineforeachfile.Youcaneasilycreatethedataintherequiredformatbycapturingthe
outputoftheUNIXfindcommand:
$find$ADIR/data/cscalls/name'*.mp3'>call_list.txt
3.Edittheextract_metadata.pigfileandmakethefollowingchanges:
a.ReplacethehardcodedparameterintheSUBSTRINGfunctionusedtofilter
bymonthwithaparameternamedMONTHwhosevalueyoucanassignonthe
commandline.Thiswillmakeiteasytochecktheleadingcallcategoriesfor
differentmonthswithouthavingtoeditthescript.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

98/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

b.Addthecodenecessarytocountcallsbycategory
c.Displaythetopthreecategories(basedonnumberofcalls)tothescreen.
4.Onceyouhavemadethesechanges,runyourscripttocheckthetopthreecategoriesin
themonthbeforeDualcorestartedtheonlineadvertisingcampaign:
$pigxlocalparamMONTH=201304extract_metadata.pig
5.Nowrunthescriptagain,thistimespecifyingtheparameterforMay:
$pigxlocalparamMONTH=201305extract_metadata.pig
TheoutputshouldconfirmthatnotonlyiscallvolumesubstantiallyhigherinMay,the
SHIPPING_DELAYcategoryhasmorethantwicetheamountofcallsastheothertwo.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

94

Page95

Step#2:ChooseBestLocationforDistributionCenter
Theanalysisyoujustcompleteduncoveredaproblem.DualcoresVicePresidentof
Operationslaunchedaninvestigationbasedonyourfindingsandhasnowconfirmedthe
cause:theironlineadvertisingcampaignisindeedattractingmanynewcustomers,but
manyofthemlivefarfromDualcoresonlydistributioncenterinPaloAlto,California.All
shipmentsaretransportedbytruck,soanordercantakeuptofivedaystodeliver
dependingonthecustomerslocation.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

99/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

95

Page96

Tosolvethisproblem,Dualcorewillopenanewdistributioncentertoimprove
shippingtimes.
TheZIPcodesforthethreeproposedsitesare02118,63139,and78237.Youwill
lookupthelatitudeandlongitudeoftheseZIPcodes,aswellastheZIPcodesof
customerswhohaverecentlyordered,usingasupplieddataset.Onceyouhavethe
coordinates,youwillinvoketheusetheHaversineDistInMilesUDF
distributedwithDataFutodeterminehowfareachcustomerisfromthethreedata
centers.Youwillthencalculatetheaveragedistanceforallcustomerstoeachof
thesedatacentersinordertoproposetheonethatwillbenefitthemostcustomers.
1.Addthetab
delimitedfilemappingZIPcodestolatitude/longitudepointsto
HDFS:
$hadoopfsmkdir/dualcore/distribution
$hadoopfsput$ADIR/data/latlon.tsv\
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

100/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

/dualcore/distribution
2.Ascript(create_cust_location_data.pig)hasbeenprovidedtofind
theZIPcodesforcustomerswhoplacedordersduringtheperiodofthead
campaign.Italsoexcludestheoneswhoarealreadyclosetothecurrentfacility,
aswellascustomersintheremotestatesofAlaskaandHawaii(whereorders
areshippedbyairplane).ThePigLatincodejoinsthesecustomersZIPcodes
withthelatitude/longitudedatasetuploadedinthepreviousstep,thenwrites
thosethreecolumns(ZIPcode,latitude,andlongitude)astheresult.Examine
thescripttoseehowitworks,andthenrunittocreatethecustomerlocation
datainHDFS:
$pigcreate_cust_location_data.pig
3.YouwillusetheHaversineDistInMilesfunctiontocalculatethedistance
fromeachcustomertoeachofthethreeproposedwarehouselocations.This
functionrequiresustosupplythelatitudeandlongitudeofboththecustomer
andthewarehouse.Whilethescriptyoujustexecutedcreatedthelatitudeand
longitudeforeachcustomer,youmustcreateadatasetcontainingtheZIPcode,

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

96

Page97

latitude,andlongitudeforthesewarehouses.Dothisbyrunningthefollowing
UNIXcommand:
$egrep'^02118|^63139|^78237'\
$ADIR/data/latlon.tsv>warehouses.tsv
4.Next,addthisfiletoHDFS:
$hadoopfsputwarehouses.tsv/dualcore/distribution

5.Editthecalc_average_distances.pigfile.TheUDFisalreadyregistered
andanaliasforthisfunctionnamedDISTisdefinedatthetopofthescript,just
beforethetwodatasetsyouwilluseareloaded.Youneedtocompletetherest
ofthisscript:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

101/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

a.Createarecordforeverycombinationofcustomerandproposed
distributioncenterlocation
b.Usethefunctiontocalculatethedistancefromthecustomerto
thewarehouse
c.Calculatetheaveragedistanceforallcustomerstoeach
warehouse
d.Displaytheresulttothescreen
6.AfteryouhavefinishedimplementingthePigLatincodedescribedabove,run
thescript:
$pigcalc_average_distances.pig
Question:WhichofthesethreeproposedZIPcodeshasthelowestaverage
mileagetoDualcorescustomers?

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

97

Page98

Lecture11Lab:RunningHiveQueries
fromtheShell,Scripts,andHue
InthislabyouwillwriteHiveQLqueriestoanalyzedatainHivetablesthat
havebeenpopulatedwithdatayouplacedinHDFSduringearlierlabs.
IMPORTANT:Sincethislabbuildsonthepreviousone,itisimportantthatyou
successfullycompletethepreviouslabbeforestartingthislab.

Step#1:RunningaQueryfromtheHiveShell
Dualcoreranacontestinwhichcustomerspostedvideosofinterestingwaystouse
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

102/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

theirnewtablets.A$5,000prizewillbeawardedtothecustomerwhosevideo
receivedthehighestrating.
However,theregistrationdatawaslostduetoanRDBMScrash,andtheonly
informationtheyhaveisfromthevideos.Thewinningcustomerintroducedherself
onlyasBridgetfromKansasCityinhervideo.
YouwillneedtorunaHivequerythatidentifiesthewinnersrecordinthecustomer
databasesothatDualcorecansendherthe$5,000prize.
1.Changetothedirectoryforthislab:
$cd$ADIR/exercises/analyzing_sales
2.StartHive:
$hive

HivePrompt
Tomakeiteasiertocopyqueriesandpastethemintoyourterminalwindow,we
donotshowthehive>promptinsubsequentsteps.Stepsprefixedwith
$shouldbeexecutedontheUNIXcommandlinetherestshouldberuninHive
unlessotherwisenoted.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

98

Page99

3.Makethequeryresultseasiertoreadbysettingthepropertythatwillmake
Hiveshowcolumnheaders:
sethive.cli.print.header=true
4.AllyouknowaboutthewinneristhathernameisBridgetandshelivesin
KansasCity.UseHive'sLIKEoperatortodoawildcardsearchfornamessuchas
"Bridget","Bridgette"or"Bridgitte".Remembertofilteronthecustomer'scity.
Question:Whichcustomerdidyourqueryidentifyasthewinnerofthe$5,000
prize?
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

103/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Step#2:RunningaQueryDirectlyfromtheCommand
Line
Youwillnowrunatop
Nquerytoidentifythethreemostexpensiveproductsthat
Dualcorecurrentlyoffers.
5.ExittheHiveshellandreturntothecommandline:
quit
6.AlthoughHiveQLstatementsareterminatedbysemicolonsintheHiveshell,itis
notnecessarytodothiswhenrunningasinglequeryfromthecommandline
usingtheeoption.RunthefollowingcommandtoexecutethequotedHiveQL
statement:
$hivee'SELECTprice,brand,nameFROMPRODUCTS
ORDERBYpriceDESCLIMIT3'
Question:Whichthreeproductsarethemostexpensive?

Step#3:RunningaHiveQLScript
Therulesforthecontestdescribedearlierrequirethatthewinnerboughtthe
advertisedtabletfromDualcorebetweenMay1,2013andMay31,2013.Before

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

99

Page100

Dualcorecanauthorizetheaccountingdepartmenttopaythe$5,000prize,you
mustensurethatBridgetiseligible.Sincethisqueryinvolvesjoiningdatafrom
severaltables,itsaperfectcaseforrunningitasaHivescript.
1.StudytheHiveQLcodeforthequerytolearnhowitworks:
$catverify_tablet_order.hql
2.ExecutetheHiveQLscriptusingthehivecommandsfoption:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

104/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

$hivefverify_tablet_order.hql
Question:DidBridgetordertheadvertisedtabletinMay?

Step#4:RunningaQueryThroughHueandBeeswax
AnotherwaytorunHivequeriesisthroughyourWebbrowserusingHuesBeeswax
application.Thisisespeciallyconvenientifyouusemorethanonecomputerorif
youuseadevice(suchasatablet)thatisntcapableofrunningHiveitselfbecause
itdoesnotrequireanysoftwareotherthanabrowser.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

100

Page101

1.StarttheFirefoxWebbrowserbyclickingtheorangeandblueiconnearthetop
oftheVMwindow,justtotherightoftheSystemmenu.OnceFirefoxstarts,type
http://localhost:8888/intotheaddressbar,andthenhittheenterkey.
2.Afterafewseconds,youshouldseeHuesloginscreen.Entertrainingin
boththeusernameandpasswordfields,andthenclicktheSignInbutton.If
promptedtorememberthepassword,declinebyhittingtheESCkeysoyoucan
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

105/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

practicethisstepagainlaterifyouchoose.
AlthoughseveralHueapplicationsareavailablethroughtheiconsatthetopof
thepage,theBeeswaxqueryeditorisshownbydefault.
3.Selectdefaultfromthedatabaselistontheleftsideofthepage.
4.Writeaqueryinthetextareathatwillcountthenumberofrecordsinthe
customerstable,andthenclicktheExecutebutton.
Question:HowmanycustomersdoesDualcoreserve?
5.ClicktheQueryEditorlinkintheupperleftcorner,andthenwriteandruna
querytofindthetenstateswiththemostcustomers.
Question:Whichstatehasthemostcustomers?

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

101

Page102

BonusLab#1:CalculatingRevenueandProfit
SeveralmorequestionsaredescribedbelowandyouwillneedtowritetheHiveQL
codetoanswerthem.Youcanusewhichevermethodyoulikebest,includingHive
shell,HiveScript,orHue,torunyourqueries.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

106/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

WhichtopthreeproductshasDualcoresoldmoreofthananyother?
Hint:RememberthatifyouuseaGROUPBYclauseinHive,youmustgroupby
allfieldslistedintheSELECTclausethatarenotpartofanaggregatefunction.
WhatwasDualcorestotalrevenueinMay,2013?
WhatwasDualcoresgrossprofit(salespriceminuscost)inMay,2013?
Theresultsoftheabovequeriesareshownincents.Rewritethegrossprofit
querytoformatthevalueindollarsandcents(e.g.,$2000000.00).Todothis,
youcandividetheprofitby100andformattheresultusingthePRINTF
functionandtheformatstring"$%.2f".

ProfessorsNote~
Thereareseveralwaysyoucouldwriteeachquery,andyoucanfindonesolution
foreachprobleminthebonus_01/sample_solution/directory.

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

102

Page103

Lecture11Lab:DataManagement
withHive
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

107/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Inthislabyouwillpracticeusingseveralcommontechniquesforcreatingand
populatingHivetables.Youwillalsocreateandqueryatablecontainingeach
ofthecomplexfieldtypeswestudied:array,map,andstruct.
IMPORTANT:Sincethislabbuildsonthepreviousone,itisimportantthatyou
successfullycompletethepreviouslabbeforestartingthislab.
Additionally,manyofthecommandsyouwillrunuseenvironmentalvariablesand
relativefilepaths.ItisimportantthatyouusetheHiveshell,ratherthanHueor
anotherinterface,asyouworkthroughthestepsthatfollow.

Step#1:UseSqoopsHiveImportOptiontoCreatea
Table
YouusedSqoopinanearlierlabtoimportdatafromMySQLintoHDFS.Sqoopcan
alsocreateaHivetablewiththesamefieldsasthesourcetableinadditionto
importingtherecords,whichsavesyoufromhavingtowriteaCREATETABLE
statement.
1.Changetothedirectoryforthislab:
$cd$ADIR/exercises/data_mgmt
2.ExecutethefollowingcommandtoimportthesupplierstablefromMySQLas
anewHive
managedtable:
$sqoopimport\
connectjdbc:mysql://localhost/dualcore\
usernametrainingpasswordtraining\
fieldsterminatedby'\t'\
tablesuppliers\
hiveimport
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

103

Page104

3.StartHive:
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

108/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

$hive
4.Itisalwaysagoodideatovalidatedataafteraddingit.ExecutetheHivequery
shownbelowtocountthenumberofsuppliersinTexas:
SELECTCOUNT(*)FROMsuppliersWHEREstate='TX'
Thequeryshouldshowthatninerecordsmatch.

Step#2:CreateanExternalTableinHive
YouimporteddatafromtheemployeestableinMySQLinanearlierlab,butit
wouldbeconvenienttobeabletoquerythisfromHive.Sincethedataalreadyexists
inHDFS,thisisagoodopportunitytouseanexternaltable.

1.WriteandexecuteaHiveQLstatementtocreateanexternaltableforthetab

delimitedrecordsinHDFSat/dualcore/employees.Thedataformatis
shownbelow:

FieldNameFieldType
emp_id
STRING
fname
STRING
lname
STRING
address
STRING
city
STRING
state
STRING
zipcode
STRING
job_title
STRING
email
STRING
active
STRING
salary
INT

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

104

Page105

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

109/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

2.RunthefollowingHivequerytoverifythatyouhavecreatedthetablecorrectly.
SELECTjob_title,COUNT(*)ASnum
FROMemployees
GROUPBYjob_title
ORDERBYnumDESC
LIMIT3
ItshouldshowthatSalesAssociate,Cashier,andAssistantManagerarethe
threemostcommonjobtitlesatDualcore.

Step#3:CreateandLoadaHiveManagedTable
Next,youwillcreateandthenloadaHive
managedtablewithproductratingsdata.
1.Createatablenamedratingsforstoringtab
delimitedrecordsusingthis
structure:

FieldNameFieldType
posted
TIMESTAMP
cust_id
INT
prod_id
INT
rating
TINYINT
message
STRING

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

105

Page106
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

110/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

2.Showthetabledescriptionandverifythatitsfieldshavethecorrectorder,
names,andtypes:
DESCRIBEratings
3.Next,openaseparateterminalwindow(File
>OpenTerminal)soyoucanrun
thefollowingshellcommand.Thiswillpopulatethetabledirectlybyusingthe
hadoopfscommandtocopyproductratingsdatafrom2012tothatdirectory
inHDFS:
$hadoopfsput$ADIR/data/ratings_2012.txt\
/user/hive/warehouse/ratings
LeavethewindowopenafterwardssothatyoucaneasilyswitchbetweenHive
andthecommandprompt.
4.Next,verifythatHivecanreadthedatawejustadded.Runthefollowingquery
inHivetocountthenumberofrecordsinthistable(theresultshouldbe464):
SELECTCOUNT(*)FROMratings
5.AnotherwaytoloaddataintoaHivetableisthroughtheLOADDATAcommand.
Thenextfewcommandswillleadyouthroughtheprocessofcopyingalocalfile
toHDFSandloadingitintoHive.First,copythe2013ratingsdatatoHDFS:
$hadoopfsput$ADIR/data/ratings_2013.txt/dualcore
6.Verifythatthefileisthere:
$hadoopfsls/dualcore/ratings_2013.txt
7.UsetheLOADDATAstatementinHivetoloadthatfileintotheratingstable:
LOADDATAINPATH'/dualcore/ratings_2013.txt'INTO
TABLEratings

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

106

111/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Page107

8.TheLOADDATAINPATHcommandmovesthefiletothetablesdirectory.
Verifythatthefileisnolongerpresentintheoriginaldirectory:
$hadoopfsls/dualcore/ratings_2013.txt
9.Verifythatthefileisshownalongsidethe2012ratingsdatainthetables
directory:
$hadoopfsls/user/hive/warehouse/ratings
10.Finally,counttherecordsintheratingstabletoensurethatall21,997are
available:
SELECTCOUNT(*)FROMratings

Step#4:Create,Load,andQueryaTablewithComplex
Fields
Dualcorerecentlystartedaloyaltyprogramtorewardtheirbestcustomers.
Dualcorehasasampleofthedatathatcontainsinformationaboutcustomerswho
havesignedupfortheprogram,includingtheirphonenumbers(asamap),alistof
pastorderIDs(asanarray),andastructthatsummarizestheminimum,maximum,
average,andtotalvalueofpastorders.Youwillcreatethetable,populateitwiththe
provideddata,andthenrunafewqueriestopracticereferencingthesetypesof
fields.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

107

112/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Page108

1.RunthefollowingstatementinHivetocreatethetable:
CREATETABLEloyalty_program
(cust_idINT,
fnameSTRING,
lnameSTRING,
emailSTRING,
levelSTRING,
phoneMAP<STRING,STRING>,
order_idsARRAY<INT>,
order_valueSTRUCT<min:INT,
max:INT,
avg:INT,
total:INT>)
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'|'
COLLECTIONITEMSTERMINATEDBY','
MAPKEYSTERMINATEDBY':'
2.Examinethedatainloyalty_data.txttoseehowitcorrespondstothe
fieldsinthetableandthenloaditintoHive:
LOADDATALOCALINPATH'loyalty_data.txt'INTOTABLE
loyalty_program
3.RunaquerytoselecttheHOMEphonenumber(Hint:Mapkeysarecase

sensitive)forcustomerID1200866.Youshouldsee408
555
4914astheresult.
4.Selectthethirdelementfromtheorder_idsarrayforcustomerID1200866
(Hint:Elementsareindexedfromzero).Thequeryshouldreturn5278505.
5.Selectthetotalattributefromtheorder_valuestructforcustomerID
1200866.Thequeryshouldreturn401874.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

113/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

108

Page109

BonusLab#1:AlterandDropaTable
1.UseALTERTABLEtorenamethelevelcolumntostatus.
2.UsetheDESCRIBEcommandontheloyalty_programtabletoverifythe
change.
3.UseALTERTABLEtorenametheentiretabletoreward_program.
4.AlthoughtheALTERTABLEcommandoftenrequiresthatwemakea
correspondingchangetothedatainHDFS,renamingatableorcolumndoesnot.
Youcanverifythisbyrunningaqueryonthetableusingthenewnames(the
resultshouldbeSILVER):
SELECTstatusFROMreward_programWHEREcust_id=
1200866
5.Assometimeshappensinthecorporateworld,prioritieshaveshiftedandthe
programisnowcanceled.Dropthereward_programtable.

Thisistheendofthelab.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

114/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

109

Page110

Lecture12Lab:GainingInsightwith
SentimentAnalysis
Inthisoptionallab,youwilluseHive'stextprocessingfeaturestoanalyze
customerscommentsandproductratings.Youwilluncoverproblemsand
proposepotentialsolutions.

IMPORTANT:Sincethislabbuildsonthepreviousone,itisimportantthatyou
successfullycompletethepreviouslabbeforestartingthislab.

BackgroundInformation
Customerratingsandfeedbackaregreatsourcesofinformationforbothcustomers
andretailerslikeDualcore.However,customercommentsaretypicallyfree
form
textandmustbehandleddifferently.Fortunately,Hiveprovidesextensivesupport
fortextprocessing.

Step#1:AnalyzeNumericProductRatings
Beforedelvingintotextprocessing,youwillbeginbyanalyzingthenumericratings
customershaveassignedtovariousproducts.
1.Changetothedirectoryforthislab:
$cd$ADIR/exercises/sentiment

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

115/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

110

Page111

2.StartHiveandusetheDESCRIBEcommandtoremindyourselfofthetables
structure.
3.Wewanttofindtheproductthatcustomerslikemost,butmustguardagainst
beingmisledbyproductsthathavefewratingsassigned.Runthefollowing
querytofindtheproductwiththehighestaverageamongallthosewithatleast
50ratings:
SELECTprod_id,FORMAT_NUMBER(avg_rating,2)AS
avg_rating
FROM(SELECTprod_id,AVG(rating)ASavg_rating,
COUNT(*)ASnum
FROMratings
GROUPBYprod_id)rated
WHEREnum>=50
ORDERBYavg_ratingDESC
LIMIT1
4.Rewrite,andthenexecute,thequeryabovetofindtheproductwiththelowest
averageamongproductswithatleast50ratings.Youshouldseethattheresult
isproductID1274673withanaverageratingof1.10.

Step#2:AnalyzeRatingComments
Weobservedearlierthatcustomersareverydissatisfiedwithoneoftheproducts
thatDualcoresells.Althoughnumericratingscanhelpidentifywhichproductthatis,
theydonttellDualcorewhycustomersdontliketheproduct.Wecouldsimplyread
throughallthecommentsassociatedwiththatproducttolearnthisinformation,but
thatapproachdoesntscale.Next,youwilluseHivestextprocessingsupportto
analyzethecomments.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

116/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

111

Page112

1.Thefollowingquerynormalizesallcommentsonthatproducttolowercase,
breaksthemintoindividualwordsusingtheSENTENCESfunction,andpasses
thosetotheNGRAMSfunctiontofindthefivemostcommonbigrams(two
word
combinations).RunthequeryinHive:
SELECTEXPLODE(NGRAMS(SENTENCES(LOWER(message)),2,5))
ASbigrams
FROMratings
WHEREprod_id=1274673
2.Mostofthesewordsaretoocommontoprovidemuchinsight,thoughtheword
expensivedoesstandoutinthelist.Modifythepreviousquerytofindthefive
mostcommontrigrams(three
wordcombinations),andthenrunthatqueryin
Hive.
3.Amongthepatternsyouseeintheresultisthephrasetentimesmore.This
mightberelatedtothecomplaintsthattheproductistooexpensive.Nowthat
youveidentifiedaspecificphrase,lookatafewcommentsthatcontainitby
runningthisquery:
SELECTmessage
FROMratings
WHEREprod_id=1274673
ANDmessageLIKE'%tentimesmore%'
LIMIT3
Youshouldseethreecommentsthatsay,Whydoestheredonecosttentimes
morethantheothers?
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

117/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

112

Page113

4.Wecaninferthatcustomersarecomplainingaboutthepriceofthisitem,but
thecommentalonedoesntprovideenoughdetail.Oneofthewords(red)in
thatcommentwasalsofoundinthelistoftrigramsfromtheearlierquery.
Writeandexecuteaquerythatwillfindalldistinctcommentscontainingthe
wordredthatareassociatedwithproductID1274673.
5.Thepreviousstepshouldhavedisplayedtwocomments:
Whatissospecialaboutred?
Whydoestheredonecosttentimesmorethantheothers?
Thesecondcommentimpliesthatthisproductisoverpricedrelativetosimilar
products.WriteandrunaquerythatwilldisplaytherecordforproductID
1274673intheproductstable.
6.Yourqueryshouldhaveshownthattheproductwasa16GBUSBFlashDrive
(Red)fromtheOrionbrand.Next,runthisquerytoidentifysimilarproducts:
SELECT*
FROMproducts
WHEREnameLIKE'%16GBUSBFlashDrive%'
ANDbrand='Orion'
Thequeryresultsshowthattherearethreealmostidenticalproducts,butthe
productwiththenegativereviews(theredone)costsabouttentimesasmuch
astheothers,justassomeofthecommentssaid.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

118/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Basedonthecostandpricecolumns,itappearsthatdoingtextprocessingon
theproductratingshashelpedDualcoreuncoverapricingerror.

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

113

Page114

Lecture12Lab:DataTransformation
withHive
InthislabyouwillcreateandpopulateatablewithlogdatafromDualcores
Webserver.Queriesonthatdatawillrevealthatmanycustomersabandon
theirshoppingcartsbeforecompletingthecheckoutprocess.Youwillcreate
severaladditionaltables,usingdatafromaTRANSFORMscriptandasupplied
UDF,whichyouwilluselatertoanalyzehowDualcorecouldturnthisproblem
intoanopportunity.

IMPORTANT:Sincethislabbuildsonthepreviousone,itisimportantthatyou
successfullycompletethepreviouslabbeforestartingthislab.

Step#1:CreateandPopulatetheWebLogsTable
Typicallogfileformatsarenotdelimited,soyouwillneedtousetheRegexSerDe
andspecifyapatternHivecanusetoparselinesintoindividualfieldsyoucanthen
query.
1.Changetothedirectoryforthislab:
$cd$ADIR/exercises/transform

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

119/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

2.Examinethecreate_web_logs.hqlscripttogetanideaofhowitusesa
RegexSerDetoparselinesinthelogfile(anexampleloglineisshowninthe
commentatthetopofthefile).Whenyouhaveexaminedthescript,runitto
createthetableinHive:
$hivefcreate_web_logs.hql

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

114

Page115

3.PopulatethetablebyaddingthelogfiletothetablesdirectoryinHDFS:
$hadoopfsput$ADIR/data/access.log
/dualcore/web_logs
4.StarttheHiveshellinanotherterminalwindow
5.Verifythatthedataisloadedcorrectlybyrunningthisquerytoshowthetop
threeitemsuserssearchedforonDualcoresWebsite:
SELECTterm,COUNT(term)ASnumFROM
(SELECTLOWER(REGEXP_EXTRACT(request,
'/search\\?phrase=(\\S+)',1))ASterm
FROMweb_logs
WHERErequestREGEXP'/search\\?phrase=')terms
GROUPBYterm
ORDERBYnumDESC
LIMIT3
Youshouldseethatitreturnstablet(303),ram(153)andwifi(148).
Note:TheREGEXPoperator,whichisavailableinsomeSQLdialects,issimilar
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

120/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

toLIKE,butusesregularexpressionsformorepowerfulpatternmatching.The
REGEXPoperatorissynonymouswiththeRLIKEoperator.

Step#2:AnalyzeCustomerCheckouts
YouvejustqueriedthelogstoseewhatuserssearchforonDualcoresWebsite,but
nowyoullrunsomequeriestolearnwhethertheybuy.AsonmanyWebsites,
customersaddproductstotheirshoppingcartsandthenfollowacheckout
processtocompletetheirpurchase.Sinceeachpartofthisfour
stepprocesscanbe
identifiedbyitsURLinthelogs,wecanusearegularexpressiontoeasilyidentify
them:

StepRequestURL

Description
115

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

Page116

/cart/checkout/step1viewcart

Viewlistofitemsaddedtocart

/cart/checkout/step2shippingcostNotifycustomerofshippingcost

/cart/checkout/step3payment

Gatherpaymentinformation

/cart/checkout/step4receipt

Showreceiptforcompletedorder

1.RunthefollowingqueryinHivetoshowthenumberofrequestsforeachstepof
thecheckoutprocess:
SELECTCOUNT(*),request
FROMweb_logs
WHERErequestREGEXP'/cart/checkout/step\\d.+'
GROUPBYrequest
Theresultsofthisqueryhighlightamajorproblem.Aboutoneoutofevery
threecustomersabandonstheircartafterthesecondstep.Thismightmean
millionsofdollarsinlostrevenue,soletsseeifwecandeterminethecause.
2.Thelogfilescookiefieldstoresavaluethatuniquelyidentifieseachuser
session.Sincenotallsessionsinvolvecheckoutsatall,createanewtable
containingthesessionIDandnumberofcheckoutstepscompletedforjust
thosesessionsthatdo:

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

121/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

CREATETABLEcheckout_sessionsAS
SELECTcookie,ip_address,COUNT(request)AS
steps_completed
FROMweb_logs
WHERErequestREGEXP'/cart/checkout/step\\d.+'
GROUPBYcookie,ip_address

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

116

Page117

3.Runthisquerytoshowthenumberofpeoplewhoabandonedtheircartafter
eachstep:
SELECTsteps_completed,COUNT(cookie)ASnum
FROMcheckout_sessions
GROUPBYsteps_completed
Youshouldseethatmostcustomerswhoabandonedtheirorderdidsoafterthe
secondstep,whichiswhentheyfirstlearnhowmuchitwillcosttoshiptheir
order.

Step#3:UseTRANSFORMforIPGeolocation
Basedonwhatyou'vejustseen,itseemslikelythatcustomersabandontheircarts
duetohighshippingcosts.Theshippingcostisbasedonthecustomer'slocationand
theweightoftheitemsthey'veordered.Althoughthisinformationisnotinthe
database(sincetheorderwasn'tcompleted),wecangatherenoughdatafromthe
logstoestimatethem.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

122/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Wedon'thavethecustomer'saddress,butwecanuseaprocessknownas"IP
geolocation"tomapthecomputer'sIPaddressinthelogfiletoanapproximate
physicallocation.Sincethisisn'tabuilt
incapabilityofHive,you'lluseaprovided
PythonscripttoTRANSFORMtheip_addressfieldfromthe
checkout_sessionstabletoaZIPcode,aspartofHiveQLstatementthatcreates
anewtablecalledcart_zipcodes.

RegardingTRANSFORMandUDFExamplesinthis
Exercise
Duringthislab,youwilluseaPythonscriptforIPgeolocationandaUDFto
calculateshippingcosts.Bothareimplementedmerelyasasimulation
compatiblewiththefictitiousdataweuseinclassandintendedtoworkeven
whenInternetaccessisunavailable.Thefocusoftheselabsisonhowtouse
externalscriptsandUDFs,ratherthanhowthecodefortheexamplesworks
internally.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

117

Page118

1.Examinethecreate_cart_zipcodes.hqlscriptandobservethefollowing:
a.Itcreatesanewtablecalledcart_zipcodesbasedonselect
statement.
b.Thatselectstatementtransformstheip_address,cookie,and
steps_completedfieldsfromthecheckout_sessionstable
usingaPythonscript.
c.ThenewtablecontainstheZIPcodeinsteadofanIPaddress,plusthe
othertwofieldsfromtheoriginaltable.
2.Examinetheipgeolocator.pyscriptandobservethefollowing:
a.RecordsarereadfromHiveonstandardinput.
b.Thescriptsplitsthemintoindividualfieldsusingatabdelimiter.
c.Theip_addrfieldisconvertedtozipcode,butthecookieand
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

123/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

steps_completedfieldsarepassedthroughunmodified.
d.Thethreefieldsineachoutputrecordaredelimitedwithtabsare
printedtostandardoutput.
3.Runthescripttocreatethecart_zipcodestable:
$hivefcreate_cart_zipcodes.hql

Step#4:ExtractListofProductsAddedtoEachCart
Asdescribedearlier,estimatingtheshippingcostalsorequiresalistofitemsinthe
customerscart.YoucanidentifyproductsaddedtothecartsincetherequestURL
lookslikethis(onlytheproductIDchangesfromonerecordtothenext):

/cart/additem?productid=1234567

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

118

Page119

1.WriteaHiveQLstatementtocreateatablecalledcart_itemswithtwofields:
cookieandprod_idbasedondataselectedtheweb_logstable.Keepthe
followinginmindwhenwritingyourstatement:
a.Theprod_idfieldshouldcontainonlytheseven
digitproductID
(Hint:UsetheREGEXP_EXTRACTfunction)
b.AddaWHEREclausewithREGEXPusingthesameregularexpression
asabovesothatyouonlyincluderecordswherecustomersareadding
itemstothecart.

ProfessorsNote~
Ifyouneedahintonhowtowritethestatement,lookatthefile:
sample_solution/create_cart_items.hql

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

124/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

2.ExecutetheHiveQLstatementfromyoujustwrote.
3.Verifythecontentsofthenewtablebyrunningthisquery:
SELECTCOUNT(DISTINCTcookie)FROMcart_itemsWHERE
prod_id=1273905

ProfessorsNote~
Ifthisdoesntreturn47,thencompareyourstatementtothefile:
sample_solution/create_cart_items.hql.Makethenecessary
corrections,andthenre
runyourstatement(afterdroppingthecart_items
table).

Step#5:CreateTablestoJoinWebLogswithProduct
Data
YounowhavetablesrepresentingtheZIPcodesandproductsassociatedwith
checkoutsessions,butyou'llneedtojointhesewiththeproductstabletogetthe
weightoftheseitemsbeforeyoucanestimateshippingcosts.Inordertodosome

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

119

Page120

moreanalysislater,wellalsoincludetotalsellingpriceandtotalwholesalecostin
additiontothetotalshippingweightforallitemsinthecart.
1.RunthefollowingHiveQLtocreateatablecalledcart_orderswiththe
information:
CREATETABLEcart_ordersAS
SELECTz.cookie,steps_completed,zipcode,
SUM(shipping_wt)astotal_weight,
SUM(price)AStotal_price,
SUM(cost)AStotal_cost
FROMcart_zipcodesz
JOINcart_itemsi
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

125/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

ON(z.cookie=i.cookie)
JOINproductsp
ON(i.prod_id=p.prod_id)
GROUPBYz.cookie,zipcode,steps_completed

Step#6:CreateaTableUsingaUDFtoEstimate
ShippingCost
Wefinallyhavealltheinformationweneedtoestimatetheshippingcostforeach
abandonedorder.YouwilluseaHiveUDFtocalculatetheshippingcostgivenaZIP
codeandthetotalweightofallitemsintheorder.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

120

Page121

1.BeforeyoucanuseaUDF,youmustaddittoHivesclasspath.Runthefollowing
commandinHivetodothat:
ADDJARgeolocation_udf.jar
2.Next,youmustregisterthefunctionwithHiveandprovidethenameoftheUDF
classaswellasthealiasyouwanttouseforthefunction.RuntheHive
commandbelowtoassociateourUDFwiththealiasCALC_SHIPPING_COST:
CREATETEMPORARYFUNCTIONCALC_SHIPPING_COSTAS
'com.cloudera.hive.udf.UDFCalcShippingCost'
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

126/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

121

Page122

3.Nowcreateanewtablecalledcart_shippingthatwillcontainthesessionID,
numberofstepscompleted,totalretailprice,totalwholesalecost,andthe
estimatedshippingcostforeachorderbasedondatafromthecart_orders
table:
CREATETABLEcart_shippingAS
SELECTcookie,steps_completed,total_price,
total_cost,
CALC_SHIPPING_COST(zipcode,total_weight)AS
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

127/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

shipping_cost
FROMcart_orders
4.Finally,verifyyourtablebyrunningthefollowingquerytocheckarecord:
SELECT*FROMcart_shippingWHERE
cookie='100002920697'
Thisshouldshowthatsessionashavingtwocompletedsteps,atotalretailprice
of$263.77,atotalwholesalecostof$236.98,andashippingcostof$9.09.
Note:Thetotal_price,total_cost,andshipping_costcolumnsinthe
cart_shippingtablecontainthenumberofcentsasintegers.Besureto
divideresultscontainingmonetaryamountsby100togetdollarsandcents.

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

122

Page123

Lecture13Lab:InteractiveAnalysis
withImpala
Inthislabyouwillexamineabandonedcartdatausingthetablescreatedin
thepreviouslab.YouwilluseImpalatoquicklydeterminehowmuchlost
revenuetheseabandonedcartsrepresentanduseseveralwhatifscenarios
todeterminewhetherDualcoreshouldofferfreeshippingtoencourage
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

128/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

customerstocompletetheirpurchases.
IMPORTANT:Sincethislabbuildsonthepreviousone,itisimportantthatyou
successfullycompletethepreviouslabbeforestartingthislab.

Step#1:StarttheImpalaShellandRefreshtheCache
1.IssuethefollowingcommandstostartImpala,thenchangetothedirectoryfor
thislab:
$sudoserviceimpalaserverstart
$sudoserviceimpalastatestorestart
$cd$ADIR/exercises/interactive
2.First,starttheImpalashell:
$impalashell
3.SinceyoucreatedtablesandmodifieddatainHive,Impalascacheofthe
metastoreisoutdated.Youmustrefreshitbeforecontinuingbyenteringthe
followingcommandintheImpalashell:
REFRESH

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

123

Page124

Step#2:CalculateLostRevenue

1.First,youllcalculatehowmuchrevenuetheabandonedcartsrepresent.
Remember,therearefourstepsinthecheckoutprocess,soonlyrecordsinthe
cart_shippingtablewithasteps_completedvalueoffourrepresenta
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

129/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

completedpurchase:
SELECTSUM(total_price)ASlost_revenue
FROMcart_shipping
WHEREsteps_completed<4

LostRevenueFromAbandonedShippingCarts

cart_shipping
cookie

steps_completedtotal_price

total_cost

shipping_cost

100054318085

6899

6292

425

100060397203

19218

17520

552

100062224714

7609

7155

556

100064732105

53137

50685

839

100107017704

44928

44200

720

...

...

...

...

...

Sumoftotal_pricewheresteps_completed<4

YoushouldseethatabandonedcartsmeanthatDualcoreispotentiallylosing
outonmorethan$2millioninrevenue!Clearlyitsworththeefforttodo
furtheranalysis.
Note:Thetotal_price,total_cost,andshipping_costcolumnsinthe
cart_shippingtablecontainthenumberofcentsasintegers.Besureto
divideresultscontainingmonetaryamountsby100togetdollarsandcents.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

124

Page125

2.Thenumberreturnedbythepreviousqueryisrevenue,butwhatcountsis
profit.Wecalculategrossprofitbysubtractingthecostfromtheprice.Write
andexecuteaquerysimilartotheoneabove,butwhichreportsthetotallost
profitfromabandonedcarts.
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

130/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

ProfessorsNote~
Ifyouneedahintonhowtowritethisquery,youcancheckthefile:
sample_solution/abandoned_checkout_profit.sql

Afterrunningyourquery,youshouldseethatDualcoreispotentiallylosing
$111,058.90inprofitduetocustomersnotcompletingthecheckoutprocess.
3.HowdoesthiscomparetotheamountofprofitDualcorereceivesfrom
customerswhodocompletethecheckoutprocess?Modifyyourpreviousquery
toconsideronlythoserecordswheresteps_completed=4,andthen
executeitintheImpalashell.
ProfessorsNote~
Checksample_solution/completed_checkout_profit.sqlforahint.

TheresultshouldshowthatDualcoreearnsatotalof$177,932.93oncompleted
orders,soabandonedcartsrepresentasubstantialproportionofadditional
profits.
4.Theprevioustwoqueriesshowthetotalprofitforabandonedandcompleted
orders,butthesearentdirectlycomparablebecausethereweredifferent
numbersofeach.Itmightbethecasethatoneismuchmoreprofitablethanthe
otheronaper
orderbasis.Writeandexecuteaquerythatwillcalculatethe
averageprofitbasedonthenumberofstepscompletedduringthecheckout
process.

ProfessorsNote~
Ifyouneedhelpwritingthisquery,checkthefile:
sample_solution/checkout_profit_by_step.sql
Youshouldobservethatcartsabandonedaftersteptworepresentaneven
higheraverageprofitperorderthancompletedorders.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

125

Page126

Step#3:CalculateCost/ProfitforaFreeShippingOffer
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

131/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Youhaveobservedthatmostcartsandthemostprofitablecartsareabandoned
atthepointwheretheshippingcostisdisplayedtothecustomer.Youwillnowrun
somequeriestodeterminewhetherofferingfreeshipping,onatleastsomeorders,
wouldactuallybringinmorerevenueassumingthisofferpromptedmore
customerstofinishthecheckoutprocess.
1.Runthefollowingquerytocomparetheaverageshippingcostfororders
abandonedafterthesecondstepversuscompletedorders:
SELECTsteps_completed,AVG(shipping_cost)ASship_cost
FROMcart_shipping
WHEREsteps_completed=2ORsteps_completed=4
GROUPBYsteps_completed

AverageShippingCostforCartsAbandonedAfterSteps2and4

cart_shipping
cookie

steps_completedtotal_price

total_cost

shipping_cost

100054318085

6899

6292

425

100060397203

19218

17520

552

100062224714

7609

7155

556

100064732105

53137

50685

839

100107017704

44928

44200

720

...

...

...

...

...

Averageofshipping_costwheresteps_completed=2or4

Youwillseethattheshippingcostofabandonedorderswasalmost10%
higherthanforcompletedpurchases.Offeringfreeshipping,atleastfor
someorders,mightactuallybringinmoremoneythanpassingonthe
costandriskingabandonedorders.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

126

Page127

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

132/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

2.Runthefollowingquerytodeterminetheaverageprofitperorderoverthe
entiremonthforthedatayouareanalyzinginthelogfile.Thiswillhelpyouto
determinewhetherDualcorecouldabsorbthecostofofferingfreeshipping:
SELECTAVG(pricecost)ASprofit
FROMproductsp
JOINorder_detailsd
ON(d.prod_id=p.prod_id)
JOINorderso
ON(d.order_id=o.order_id)
WHEREYEAR(order_date)=2013
ANDMONTH(order_date)=05

AverageProfitperOrder,May2013

products

order_details

orders

prod_idpricecost

order_idproduct_id

order_idorder_date

1273641

1839

1275

6547914

1273641

6547914

2013050100:02:08

1273642

1949

721

6547914

1273644

6547915

2013050100:02:55

1273643

2149

845

6547914

1273645

6547916

2013050100:06:15

1273644

2029

763

6547915

1273645

6547917

2013061200:10:41

1273645

1909

1234

6547916

1273641

6547918

2013061200:11:30

...

...

...

...

...

...

...

Averagetheprofit...

...onordersmadein
May,2013

YoushouldseethattheaverageprofitforallordersduringMaywas
$7.80.Anearlierqueryyouranshowedthattheaverageshippingcost
was$8.83forcompletedordersand$9.66forabandonedorders,so
clearlyDualcorewouldlosemoneybyofferingfreeshippingonall
orders.However,itmightstillbeworthwhiletoofferfreeshippingon
ordersoveracertainamount.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

127

Page128
http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

133/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

3.Runthefollowingquery,whichisaslightlyrevisedversionofthepreviousone,
todeterminewhetherofferingfreeshippingonlyonordersof$10ormore
wouldbeagoodidea:
SELECTAVG(pricecost)ASprofit
FROMproductsp
JOINorder_detailsd
ON(d.prod_id=p.prod_id)
JOINorderso
ON(d.order_id=o.order_id)
WHEREYEAR(order_date)=2013
ANDMONTH(order_date)=05
ANDPRICE>=1000

Youshouldseethattheaverageprofitonordersof$10ormorewas
$9.09,soabsorbingthecostofshippingwouldleaveverylittleprofit.
4.Repeatthepreviousquery,modifyingitslightlyeachtimetofindtheaverage
profitonordersofatleast$50,$100,and$500.
Youshouldseethatthereisahugespikeintheamountofprofitfor
ordersof$500ormore(Dualcoremakes$111.05onaverageforthese
orders).
5.Howmuchdoesshippingcostonaveragefororderstotaling$500ormore?
Writeandrunaquerytofindout.
ProfessorsNote~
Thefilesample_solution/avg_shipping_cost_50000.sqlcontains
thesolution.
Youshouldseethattheaverageshippingcostis$12.28,whichhappens
tobeabout11%oftheprofitbroughtinonthoseorders.
6.SinceDualcorewontknowinadvancewhowillabandontheircart,theywould
havetoabsorbthe$12.28averagecostonallordersofatleast$500.Wouldthe
extramoneytheymightbringinfromabandonedcartsoffsettheaddedcostof
Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

128

134/135

7/26/2015

ApacheHadoopAcourseforundergraduatesHomeworkLabswithProfessorsNotes

Page129

freeshippingforcustomerswhowouldhavecompletedtheirpurchases
anyway?Runthefollowingquerytoseethetotalprofitoncompleted
purchases:
SELECTSUM(total_pricetotal_cost)AStotal_profit
FROMcart_shipping
WHEREtotal_price>=50000
ANDsteps_completed=4

Afterrunningthisquery,youshouldseethatthetotalprofitfor
completedordersis$107,582.97.
7.Now,runthefollowingquerytofindthepotentialprofit,aftersubtracting
shippingcosts,ifallcustomerscompletedthecheckoutprocess:
SELECTgross_profittotal_shipping_costAS
potential_profit
FROM(SELECT
SUM(total_pricetotal_cost)AS
gross_profit,
SUM(shipping_cost)AStotal_shipping_cost
FROMcart_shipping
WHEREtotal_price>=50000)large_orders
Sincetheresultof$120,355.26isgreaterthanthecurrentprofitof$107,582.97
Dualcorecurrentlyearnsfromcompletedorders,itappearsthattheycouldearn
nearly$13,000morebyofferingfreeshippingforallordersofatleast$500.
Congratulations!YourhardworkanalyzingavarietyofdatawithHadoops
toolshashelpedmakeDualcoremoreprofitablethanever.

Thisistheendofthelab.

Copyright20102014Cloudera,Inc.Allrightsreserved.
Nottobereproducedwithoutpriorwrittenconsent.

http://webcache.googleusercontent.com/search?q=cache:yfigtLK8unYJ:www.cloudera.com/content/dam/cloudera/partners/academicpartnersgated/labs/_

129

135/135

Вам также может понравиться