Top 100 Hadoop Interview Questions and Answers 2015

HadoopMapReducevs.
ApacheSparkWho
WinstheBattle?
HerearetopHadoopDeveloperInterviewQuestionsandAnswersbasedon
differentcomponentsoftheHadoopEcosystem
1)HadoopBasicInterviewQuestions
2)HadoopHDFSInterviewQuestions
3)MapReduceInterviewQuestions
4)HadoopHBaseInterviewQuestions
5)HadoopSqoopInterviewQuestions
6)HadoopFlumeInterviewQuestions
7)HadoopZookeeperInterviewQuestions
Top50HadoopInterviewQuestions
8)PigInterviewQuestions
5JobRolesAvailableforHadoopers
9)HiveInterviewQuestions
10)HadoopYARNInterviewQuestions
BigDataHadoopInterviewQuestionsand
Answers
TheseareHadoopBasicInterviewQuestionsandAnswersforfreshersand
experienced.
1.WhatisBigData?
Top6HadoopVendorsprovidingBigData
SolutionsinOpenDataPlatform
Bigdataisdefinedasthevoluminousamountofstructured,unstructuredorsemi
structureddatathathashugepotentialforminingbutissolargethatitcannotbe
processedusingtraditionaldatabasesystems.Bigdataischaracterizedbyitshigh
velocity,volumeandvarietythatrequirescosteffectiveandinnovativemethodsfor
informationprocessingtodrawmeaningfulbusinessinsights.Morethanthevolume
ofthedataitisthenatureofthedatathatdefineswhetheritisconsideredasBig
Dataornot.
HereisaninterestingandexplanatoryvisualonWhatisBigData?
BigDataAnalyticsTheNewPlayerinICC
WorldCupCricket2015
1/11
Hadoop Training- What is Big Data by DeZyre....
5ReasonswhyJavaprofessionalsshould
learnHadoop
Youmightalsolike
ADataScienceTeamcanhaveallofa
DataScientist'sSkills
2.WhatdothefourVsofBigDatadenote?
BigDataAnalytics?
IBMhasanice,simpleexplanationforthefourcriticalfeaturesofbigdata:
a)VolumeScaleofdata
d)VeracityUncertaintyofdata
HereisanexplanatoryvideoonthefourVsofBigData
TopBigDataCertificationstochoose
fromin2016
b)VelocityAnalysisofstreamingdata
c)VarietyDifferentformsofdata
HowLinkedInusesHadooptoleverage
ApacheSparkmakesDataProcessing
&PreparationFaster
RecapofDataScienceNewsfor
February
2/11
Hadoop Training- Four V's of Big Data by DeZyre...
RecapofApacheSparkNewsfor
February
RecapofHadoopNewsforFebruary
ApacheSparkEcosystemandSpark
Components
DataScientistSalaryReportof100Top
TechCompanies
7BigDataConferencesYouShould
Attendin2016
3.Howbigdataanalysishelpsbusinessesincreasetheirrevenue?Give
example.
Tutorials
Bigdataanalysisishelpingbusinessesdifferentiatethemselvesforexample
dplyrManipulationVerbs
Introductiontodplyrpackage
ImportingDatafromFlatFilesinR
PrincipalComponentAnalysisTutorial
PandasTutorialPart3
PandasTutorialPart2
PandasTutorialPart1
Walmarttheworldslargestretailerin2014intermsofrevenueisusingbigdata
analyticstoincreaseitssalesthroughbetterpredictiveanalytics,providing
customizedrecommendationsandlaunchingnewproductsbasedoncustomer
preferencesandneeds.Walmartobservedasignificant10%to15%increasein
onlinesalesfor$1billioninincrementalrevenue.Therearemanymorecompanies
likeFacebook,Twitter,LinkedIn,Pandora,JPMorganChase,BankofAmerica,etc.
usingbigdataanalyticstoboosttheirrevenue.
Hereisaninterestingvideothatexplainshowvariousindustriesareleveragingbig
dataanalysistoincreasetheirrevenue
5/11
Hadoop Training- Top 10 industries using Big D...
TutorialHadoopMultinodeCluster
SetuponUbuntu
DataVisualizationsToolsinR
RStatisticalandLanguagetutorial
IntroductiontoDataSciencewithR
ApachePigTutorial:UserDefined
FunctionExample
ApachePigTutorialExample:WebLog
ServerAnalytics
4.NamesomecompaniesthatuseHadoop.
ImpalaCaseStudy:WebTraffic
ImpalaCaseStudy:FlightDataAnalysis
HadoopImpalaTutorial
ApacheHiveTutorial:Tables
FlumeHadoopTutorial:TwitterData
Extraction
Yahoo(Oneofthebiggestuser&morethan80%codecontributortoHadoop)
Facebook
Netflix
Amazon
Aggregation
Hulu
HadoopSqoopTutorial:ExampleData
Export
Adobe
eBay
FlumeHadoopTutorial:WebsiteLog
HadoopSqoopTutorial:Exampleof
DataAggregation
Spotify
Rubikloud
Twitter
ApacheZookepeerTutorial:Exampleof
WatchNotification
ApacheZookepeerTutorial:Centralized
ConfigurationManagement
ToviewadetailedlistofsomeofthetopcompaniesusingHadoopCLICKHERE
5.DifferentiatebetweenStructuredandUnstructureddata.
HadoopZookeeperTutorial
HadoopSqoopTutorial
HadoopPIGTutorial
HadoopOozieTutorial
HadoopNoSQLDatabaseTutorial
HadoopHiveTutorial
HadoopHDFSTutorial
HadoophBaseTutorial
HadoopFlumeTutorial
Hadoop2.0YARNTutorial
HadoopMapReduceTutorial
BigDataHadoopTutorial
Datawhichcanbestoredintraditionaldatabasesystemsintheformofrowsand
columns,forexampletheonlinepurchasetransactionscanbereferredtoas
StructuredData.Datawhichcanbestoredonlypartiallyintraditionaldatabase
systems,forexample,datainXMLrecordscanbereferredtoassemistructured
data.Unorganizedandrawdatathatcannotbecategorizedassemistructuredor
structureddataisreferredtoasunstructureddata.Facebookupdates,Tweetson
Twitter,Reviews,weblogs,etc.areallexamplesofunstructureddata.
6.OnwhatconcepttheHadoopframeworkworks?
HadoopFrameworkworksonthefollowingtwocorecomponents
1)HDFSHadoopDistributedFileSystemisthejavabasedfilesystemfor
scalableandreliablestorageoflargedatasets.DatainHDFSisstoredintheform
ofblocksanditoperatesontheMasterSlaveArchitecture.
2)HadoopMapReduceThisisajavabasedprogrammingparadigmofHadoop
frameworkthatprovidesscalabilityacrossvariousHadoopclusters.MapReduce
distributestheworkloadintovarioustasksthatcanruninparallel.Hadoopjobs
perform2separatetasksjob.Themapjobbreaksdownthedatasetsintokey
valuepairsortuples.Thereducejobthentakestheoutputofthemapjoband
OnlineCourses
combinesthedatatuplestointosmallersetoftuples.Thereducejobisalways
performedafterthemapjobisexecuted.
HereisavisualthatclearlyexplaintheHDFSandHadoopMapReduceConcepts
10/11
Hadoop Training- Definition of Hadoop Ecosyst...
HadoopTraining
DataScienceinPython
DataScienceinR
DataScienceTraining
HadoopTraininginCalifornia
HadoopTraininginNewYork
HadoopTraininginTexas
HadoopTraininginVirginia
HadoopTraininginWashington
HadoopTraininginNewJersey
7)WhatarethemaincomponentsofaHadoopApplication?
Hadoopapplicationshavewiderangeoftechnologiesthatprovidegreatadvantage
insolvingcomplexbusinessproblems.
CorecomponentsofaHadoopapplicationare
1)HadoopCommon
2)HDFS
3)HadoopMapReduce
4)YARN
DataAccessComponentsarePigandHive
DataStorageComponentisHBase
DataIntegrationComponentsareApacheFlume,Sqoop,Chukwa
DataManagementandMonitoringComponentsareAmbari,Oozieand
Zookeeper.
DataSerializationComponentsareThriftandAvro
DataIntelligenceComponentsareApacheMahoutandDrill.
8.WhatisHadoopstreaming?
Hadoopdistributionhasagenericapplicationprogramminginterfaceforwriting
MapandReducejobsinanydesiredprogramminglanguagelikePython,Perl,
Ruby,etc.ThisisreferredtoasHadoopStreaming.Userscancreateandrunjobs
withanykindofshellscriptsorexecutableastheMapperorReducers.
9.WhatisthebesthardwareconfigurationtorunHadoop?
ThebestconfigurationforexecutingHadoopjobsisdualcoremachinesordual
processorswith4GBor8GBRAMthatuseECCmemory.Hadoophighlybenefits
fromusingECCmemorythoughitisnotlowend.ECCmemoryisrecommended
forrunningHadoopbecausemostoftheHadoopusershaveexperiencedvarious
checksumerrorsbyusingnonECCmemory.However,thehardwareconfiguration
alsodependsontheworkflowrequirementsandcanchangeaccordingly.
10.WhatarethemostcommonlydefinedinputformatsinHadoop?
ThemostcommonInputFormatsdefinedinHadoopare:
TextInputFormatThisisthedefaultinputformatdefinedinHadoop.
KeyValueInputFormatThisinputformatisusedforplaintextfileswhereinthe
filesarebrokendownintolines.
SequenceFileInputFormatThisinputformatisusedforreadingfilesin
sequence.
WehavefurthercategorizedBigDataInterviewQuestionsforFreshersand
Experienced
HadoopInterviewQuestionsandAnswersforFreshersQ.Nos1,2,4,5,6,7,8,9
HadoopInterviewQuestionsandAnswersforExperiencedQ.Nos3,8,9,10
ForadetailedPDFreportonHadoopSalaries CLICKHERE
HadoopHDFSInterviewQuestionsandAnswers
1.WhatisablockandblockscannerinHDFS?
BlockTheminimumamountofdatathatcanbereadorwrittenisgenerally
referredtoasablockinHDFS.ThedefaultsizeofablockinHDFSis64MB.
BlockScannerBlockScannertracksthelistofblockspresentonaDataNodeand
verifiesthemtofindanykindofchecksumerrors.BlockScannersuseathrottling
mechanismtoreservediskbandwidthonthedatanode.
2.ExplainthedifferencebetweenNameNode,BackupNodeandCheckpoint
NameNode.
NameNode:NameNodeisattheheartoftheHDFSfilesystemwhichmanagesthe
metadatai.e.thedataofthefilesisnotstoredontheNameNodebutratherithas
thedirectorytreeofallthefilespresentintheHDFSfilesystemonahadoop
cluster.NameNodeusestwofilesforthenamespace
fsimagefileItkeepstrackofthelatestcheckpointofthenamespace.
editsfileItisalogofchangesthathavebeenmadetothenamespacesince
checkpoint.
CheckpointNode
CheckpointNodekeepstrackofthelatestcheckpointinadirectorythathassame
structureasthatofNameNodesdirectory.Checkpointnodecreatescheckpoints
forthenamespaceatregularintervalsbydownloadingtheeditsandfsimagefile
fromtheNameNodeandmergingitlocally.Thenewimageisthenagainupdated
backtotheactiveNameNode.
BackupNode:
BackupNodealsoprovidescheckpointingfunctionalitylikethatofthecheckpoint
nodebutitalsomaintainsitsuptodateinmemorycopyofthefilesystem
namespacethatisinsyncwiththeactiveNameNode.
3.Whatiscommodityhardware?
CommodityHardwarereferstoinexpensivesystemsthatdonothavehigh
availabilityorhighquality.CommodityHardwareconsistsofRAMbecausethere
arespecificservicesthatneedtobeexecutedonRAM.Hadoopcanberunonany
commodityhardwareanddoesnotrequireanysupercomputersorhighend
hardwareconfigurationtoexecutejobs.
4.WhatistheportnumberforNameNode,TaskTrackerandJobTracker?
NameNode50070
JobTracker50030
TaskTracker50060
5.Explainabouttheprocessofinterclusterdatacopying.
HDFSprovidesadistributeddatacopyingfacilitythroughtheDistCPfromsourceto
destination.Ifthisdatacopyingiswithinthehadoopclusterthenitisreferredtoas
interclusterdatacopying.DistCPrequiresbothsourceanddestinationtohavea
compatibleorsameversionofhadoop.
6.HowcanyouoverwritethereplicationfactorsinHDFS?
ThereplicationfactorinHDFScanbemodifiedoroverwrittenin2ways
1)UsingtheHadoopFSShell,replicationfactorcanbechangedperfilebasisusing
thebelowcommand
$hadoopfssetrepw2/my/test_file(test_fileisthefilenamewhosereplication
factorwillbesetto2)
2)UsingtheHadoopFSShell,replicationfactorofallfilesunderagivendirectory
canbemodifiedusingthebelowcommand
3)$hadoopfssetrepw5/my/test_dir(test_diristhenameofthedirectoryandall
thefilesinthisdirectorywillhaveareplicationfactorsetto5)
7.ExplainthedifferencebetweenNASandHDFS.
NASrunsonasinglemachineandthusthereisnoprobabilityofdata
redundancywhereasHDFSrunsonaclusterofdifferentmachinesthusthereis
dataredundancybecauseofthereplicationprotocol.
NASstoresdataonadedicatedhardwarewhereasinHDFSallthedatablocks
aredistributedacrosslocaldrivesofthemachines.
InNASdataisstoredindependentofthecomputationandhenceHadoop
MapReducecannotbeusedforprocessingwhereasHDFSworkswithHadoop
MapReduceasthecomputationsinHDFSaremovedtodata.
8.ExplainwhathappensifduringthePUToperation,HDFSblockisassigned
areplicationfactor1insteadofthedefaultvalue3.
ReplicationfactorisapropertyofHDFSthatcanbesetaccordinglyfortheentire
clustertoadjustthenumberoftimestheblocksaretobereplicatedtoensurehigh
dataavailability.ForeveryblockthatisstoredinHDFS,theclusterwillhaven1
duplicatedblocks.So,ifthereplicationfactorduringthePUToperationissetto1
insteadofthedefaultvalue3,thenitwillhaveasinglecopyofdata.Underthese
circumstanceswhenthereplicationfactorissetto1,iftheDataNodecrashes
underanycircumstances,thenonlysinglecopyofthedatawouldbelost.
9.WhatistheprocesstochangethefilesatarbitrarylocationsinHDFS?
HDFSdoesnotsupportmodificationsatarbitraryoffsetsinthefileormultiple
writersbutfilesarewrittenbyasinglewriterinappendonlyformati.e.writestoa
Callus18446966465(USTollFree) Home Blog Tutorials
ContactUs
fileinHDFSarealwaysmadeattheendofthefile.
BuildProjects,LearnSkills,GetHired
10.ExplainabouttheindexingprocessinHDFS.
IndexingprocessinHDFSdependsontheblocksize.HDFSstoresthelastpartof
thedatathatfurtherpointstotheaddresswherethenextpartofdatachunkis
stored.
11.Whatisarackawarenessandonwhatbasisisdatastoredinarack?
Allthedatanodesputtogetherformastorageareai.e.thephysicallocationofthe
datanodesisreferredtoasRackinHDFS.Therackinformationi.e.therackidof
eachdatanodeisacquiredbytheNameNode.Theprocessofselectingcloserdata
nodesdependingontherackinformationisknownasRackAwareness.
Thecontentspresentinthefilearedividedintodatablockassoonastheclientis
readytoloadthefileintothehadoopcluster.AfterconsultingwiththeNameNode,
REQUESTINFO
DeZyreforBusiness
SignIn
clientallocates3datanodesforeachdatablock.Foreachdatablock,thereexists
2copiesinonerackandthethirdcopyispresentinanotherrack.Thisisgenerally
referredtoastheReplicaPlacementPolicy.
WehavefurthercategorizedHadoopHDFSInterviewQuestionsforFreshersand
Experienced
HadoopInterviewQuestionsandAnswersforFreshersQ.Nos2,3,7,9,10,11
HadoopInterviewQuestionsandAnswersforExperiencedQ.Nos1,2,
4,5,6,7,8
ClickheretoknowmoreaboutourIBMCertifiedHadoopDevelopercourse
HadoopMapReduceInterviewQuestionsandAnswers
1.ExplaintheusageofContextObject.
ContextObjectisusedtohelpthemapperinteractwithotherHadoopsystems.
ContextObjectcanbeusedforupdatingcounters,toreporttheprogressandto
provideanyapplicationlevelstatusupdates.ContextObjecthastheconfiguration
detailsforthejobandalsointerfaces,thathelpsittogeneratingtheoutput.
2.WhatarethecoremethodsofaReducer?
The3coremethodsofareducerare
1)setup()Thismethodofthereducerisusedforconfiguringvariousparameters
liketheinputdatasize,distributedcache,heapsize,etc.
FunctionDefinitionpublicvoidsetup(context)
2)reduce()itisheartofthereducerwhichiscalledonceperkeywiththe
associatedreducetask.
FunctionDefinitionpublicvoidreduce(Key,Value,context)
3)cleanup()Thismethodiscalledonlyonceattheendofreducetaskforclearing
allthetemporaryfiles.
FunctionDefinitionpublicvoidcleanup(context)
3.Explainaboutthepartitioning,shuffleandsortphase
ShufflePhaseOncethefirstmaptasksarecompleted,thenodescontinueto
performseveralothermaptasksandalsoexchangetheintermediateoutputswith
thereducersasrequired.Thisprocessofmovingtheintermediateoutputsofmap
taskstothereducerisreferredtoasShuffling.
SortPhaseHadoopMapReduceautomaticallysortsthesetofintermediatekeys
onasinglenodebeforetheyaregivenasinputtothereducer.
PartitioningPhaseTheprocessthatdetermineswhichintermediatekeysand
valuewillbereceivedbyeachreducerinstanceisreferredtoaspartitioning.The
destinationpartitionissameforanykeyirrespectiveofthemapperinstancethat
generatedit.
4.HowtowriteacustompartitionerforaHadoopMapReducejob?
StepstowriteaCustomPartitionerforaHadoopMapReduceJob
AnewclassmustbecreatedthatextendsthepredefinedPartitionerClass.
getPartitionmethodofthePartitionerclassmustbeoverridden.
Thecustompartitionertothejobcanbeaddedasaconfigfileinthewrapper
whichrunsHadoopMapReduceorthecustompartitionercanbeaddedtothe
jobbyusingthesetmethodofthepartitionerclass.
5.WhatistherelationshipbetweenJobandTaskinHadoop?
AsinglejobcanbebrokendownintooneormanytasksinHadoop.
6.IsitimportantforHadoopMapReducejobstobewritteninJava?
ItisnotnecessarytowriteHadoopMapReducejobsinjavabutuserscanwrite
MapReducejobsinanydesiredprogramminglanguagelikeRuby,Perl,Python,R,
Awk,etc.throughtheHadoopStreamingAPI.
7.Whatistheprocessofchangingthesplitsizeifthereislimitedstorage
spaceonCommodityHardware?
Ifthereislimitedstoragespaceoncommodityhardware,thesplitsizecanbe
changedbyimplementingtheCustomSplitter.ThecalltoCustomSplittercanbe
madefromthemainmethod.
8.WhataretheprimaryphasesofaReducer?
The3primaryphasesofareducerare
1)Shuffle
2)Sort
3)Reduce
9.WhatisaTaskInstance?
TheactualhadoopMapReducejobsthatrunoneachslavenodearereferredtoas
Taskinstances.EverytaskinstancehasitsownJVMprocess.Foreverynewtask
instance,aJVMprocessisspawnedbydefaultforatask.
10.Canreducerscommunicatewitheachother?
Reducersalwaysruninisolationandtheycannevercommunicatewitheachother
aspertheHadoopMapReduceprogrammingparadigm.
WehavefurthercategorizedHadoopMapReduceInterviewQuestionsforFreshers
andExperienced
HadoopInterviewQuestionsandAnswersforFreshersQ.Nos2,5,6
HadoopInterviewQuestionsandAnswersforExperiencedQ.Nos
1,3,4,7,8,9,10
HadoopHBaseInterviewQuestionsandAnswers
1.WhenshouldyouuseHBaseandwhatarethekeycomponentsofHBase?
HBaseshouldbeusedwhenthebigdataapplicationhas
1)Avariableschema
2)Whendataisstoredintheformofcollections
3)Iftheapplicationdemandskeybasedaccesstodatawhileretrieving.
KeycomponentsofHBaseare
RegionThiscomponentcontainsmemorydatastoreandHfile.
RegionServerThismonitorstheRegion.
HBaseMasterItisresponsibleformonitoringtheregionserver.
ZookeeperIttakescareofthecoordinationbetweentheHBaseMastercomponent
andtheclient.
CatalogTablesThetwoimportantcatalogtablesareROOTandMETA.ROOT
tabletrackswheretheMETAtableisandMETAtablestoresalltheregionsinthe
system.
2.WhatarethedifferentoperationalcommandsinHBaseatrecordleveland
tablelevel?
RecordLevelOperationalCommandsinHBaseareput,get,increment,scanand
delete.
TableLevelOperationalCommandsinHBasearedescribe,list,drop,disableand
scan.
3.WhatisRowKey?
EveryrowinanHBasetablehasauniqueidentifierknownasRowKey.Itisused
forgroupingcellslogicallyanditensuresthatallcellsthathavethesameRowKeys
arecolocatedonthesameserver.RowKeyisinternallyregardedasabytearray.
4.ExplainthedifferencebetweenRDBMSdatamodelandHBasedatamodel.
RDBMSisaschemabaseddatabasewhereasHBaseisschemalessdatamodel.
RDBMSdoesnothavesupportforinbuiltpartitioningwhereasinHBasethereis
automatedpartitioning.
RDBMSstoresnormalizeddatawhereasHBasestoresdenormalizeddata.
5.ExplainaboutthedifferentcatalogtablesinHBase?
ThetwoimportantcatalogtablesinHBase,areROOTandMETA.ROOTtable
trackswheretheMETAtableisandMETAtablestoresalltheregionsinthe
system.
6.Whatiscolumnfamilies?Whathappensifyoualtertheblocksizeof
ColumnFamilyonanalreadypopulateddatabase?
ThelogicaldeviationofdataisrepresentedthroughakeyknownascolumnFamily.
Columnfamiliesconsistofthebasicunitofphysicalstorageonwhichcompression
featurescanbeapplied.Inanalreadypopulateddatabase,whentheblocksizeof
columnfamilyisaltered,theolddatawillremainwithintheoldblocksizewhereas
thenewdatathatcomesinwilltakethenewblocksize.Whencompactiontakes
place,theolddatawilltakethenewblocksizesothattheexistingdataisread
correctly.
7.ExplainthedifferencebetweenHBaseandHive.
HBaseandHivebotharecompletelydifferenthadoopbasedtechnologiesHiveisa
datawarehouseinfrastructureontopofHadoopwhereasHBaseisaNoSQLkey
valuestorethatrunsontopofHadoop.HivehelpsSQLsavvypeopletorun
MapReducejobswhereasHBasesupports4primaryoperationsput,get,scanand
delete.HBaseisidealforrealtimequeryingofbigdatawhereHiveisanideal
choiceforanalyticalqueryingofdatacollectedoverperiodoftime.
8.ExplaintheprocessofrowdeletioninHBase.
OnissuingadeletecommandinHBasethroughtheHBaseclient,dataisnot
actuallydeletedfromthecellsbutratherthecellsaremadeinvisiblebysettinga
tombstonemarker.Thedeletedcellsareremovedatregularintervalsduring
compaction.
9.WhatarethedifferenttypesoftombstonemarkersinHBasefordeletion?
Thereare3differenttypesoftombstonemarkersinHBasefordeletion
1)FamilyDeleteMarkerThismarkersmarksallcolumnsforacolumnfamily.
2)VersionDeleteMarkerThismarkermarksasingleversionofacolumn.
3)ColumnDeleteMarkerThismarkersmarksalltheversionsofacolumn.
10.ExplainaboutHLogandWALinHBase.
AlleditsintheHStorearestoredintheHLog.EveryregionserverhasoneHLog.
HLogcontainsentriesforeditsofallregionsperformedbyaparticularRegion
Server.WALabbreviatestoWriteAheadLog(WAL)inwhichalltheHLogeditsare
writtenimmediately.WALeditsremaininthememorytilltheflushperiodincaseof
deferredlogflush.
WehavefurthercategorizedHadoopHBaseInterviewQuestionsforFreshersand
Experienced
HadoopInterviewQuestionsandAnswersforFreshersQ.Nos1,2,4,5,7
HadoopInterviewQuestionsandAnswersforExperiencedQ.Nos2,3,6,8,9,10
HadoopSqoopInterviewQuestionsandAnswers
1.ExplainaboutsomeimportantSqoopcommandsotherthanimportand
export.
CreateJob(create)
Herewearecreatingajobwiththenamemyjob,whichcanimportthetabledata
fromRDBMStabletoHDFS.Thefollowingcommandisusedtocreateajobthatis
importingdatafromtheemployeetableinthedbdatabasetotheHDFSfile.
$Sqoopjobcreatemyjob\
import\
connectjdbc:mysql://localhost/db\
usernameroot\
tableemployeem1
VerifyJob(list)
listargumentisusedtoverifythesavedjobs.Thefollowingcommandisusedto
verifythelistofsavedSqoopjobs.
$Sqoopjoblist
InspectJob(show)
showargumentisusedtoinspectorverifyparticularjobsandtheirdetails.The
followingcommandandsampleoutputisusedtoverifyajobcalledmyjob.
$Sqoopjobshowmyjob
ExecuteJob(exec)
execoptionisusedtoexecuteasavedjob.Thefollowingcommandisusedto
executeasavedjobcalledmyjob.
$Sqoopjobexecmyjob
2.HowSqoopcanbeusedinaJavaprogram?
TheSqoopjarinclasspathshouldbeincludedinthejavacode.Afterthisthe
methodSqoop.runTool()methodmustbeinvoked.Thenecessaryparameters
shouldbecreatedtoSqoopprogrammaticallyjustlikeforcommandline.
3.WhatistheprocesstoperformanincrementaldataloadinSqoop?
TheprocesstoperformincrementaldataloadinSqoopistosynchronizethe
modifiedorupdateddata(oftenreferredasdeltadata)fromRDBMStoHadoop.
ThedeltadatacanbefacilitatedthroughtheincrementalloadcommandinSqoop.
IncrementalloadcanbeperformedbyusingSqoopimportcommandorbyloading
thedataintohivewithoutoverwritingit.Thedifferentattributesthatneedtobe
specifiedduringincrementalloadinSqoopare
1)Mode(incremental)ThemodedefineshowSqoopwilldeterminewhatthenew
rowsare.ThemodecanhavevalueasAppendorLastModified.
2)Col(Checkcolumn)Thisattributespecifiesthecolumnthatshouldbeexamined
tofindouttherowstobeimported.
3)Value(lastvalue)Thisdenotesthemaximumvalueofthecheckcolumnfrom
thepreviousimportoperation.
4.IsitpossibletodoanincrementalimportusingSqoop?
Yes,Sqoopsupportstwotypesofincrementalimports
1)Append
2)LastModified
ToinsertonlyrowsAppendshouldbeusedinimportcommandandforinserting
therowsandalsoupdatingLastModifiedshouldbeusedintheimportcommand.
5.WhatisthestandardlocationorpathforHadoopSqoopscripts?
/usr/bin/HadoopSqoop
6.Howcanyoucheckallthetablespresentinasingledatabaseusing
Sqoop?
Thecommandtocheckthelistofalltablespresentinasingledatabaseusing
Sqoopisasfollows
Sqooplisttablesconnectjdbc:mysql://localhost/user
7.HowarelargeobjectshandledinSqoop?
Sqoopprovidesthecapabilitytostorelargesizeddataintoasinglefieldbasedon
thetypeofdata.Sqoopsupportstheabilitytostore
1)CLOBsCharacterLargeObjects
2)BLOBsBinaryLargeObjects
LargeobjectsinSqooparehandledbyimportingthelargeobjectsintoafile
referredasLobFilei.e.LargeObjectFile.TheLobFilehastheabilitytostore
recordsofhugesize,thuseachrecordintheLobFileisalargeobject.
8.CanfreeformSQLqueriesbeusedwithSqoopimportcommand?Ifyes,
thenhowcantheybeused?
SqoopallowsustousefreeformSQLquerieswiththeimportcommand.The
importcommandshouldbeusedwiththeeandqueryoptionstoexecutefree
formSQLqueries.Whenusingtheeandqueryoptionswiththeimport
commandthetargetdirvaluemustbespecified.
9.DifferentiatebetweenSqoopanddistCP.
DistCPutilitycanbeusedtotransferdatabetweenclusterswhereasSqoopcanbe
usedtotransferdataonlybetweenHadoopandRDBMS.
10.WhatarethelimitationsofimportingRDBMStablesintoHcatalog
directly?
ThereisanoptiontoimportRDBMStablesintoHcatalogdirectlybymakinguseof
hcatalogdatabaseoptionwiththehcatalogtablebutthelimitationtoitisthat
thereareseveralargumentslikeasavrofile,direct,assequencefile,targetdir,
exportdirarenotsupported.
WehavefurthercategorizedHadoopSqoopInterviewQuestionsforFreshersand
Experienced
HadoopInterviewQuestionsandAnswersforFreshersQ.Nos4,5,6,9
HadoopInterviewQuestionsandAnswersforExperiencedQ.Nos
1,2,3,6,7,8,10
HadoopFlumeInterviewQuestionsandAnswers
1)ExplainaboutthecorecomponentsofFlume.
ThecorecomponentsofFlumeare
EventThesinglelogentryorunitofdatathatistransported.
SourceThisisthecomponentthroughwhichdataentersFlumeworkflows.
SinkItisresponsiblefortransportingdatatothedesireddestination.
ChannelitistheductbetweentheSinkandSource.
AgentAnyJVMthatrunsFlume.
ClientThecomponentthattransmitseventtothesourcethatoperateswiththe
agent.
2)DoesFlumeprovide100%reliabilitytothedataflow?
Yes,ApacheFlumeprovidesendtoendreliabilitybecauseofitstransactional
approachindataflow.
3)HowcanFlumebeusedwithHBase?
ApacheFlumecanbeusedwithHBaseusingoneofthetwoHBasesinks
HBaseSink(org.apache.flume.sink.hbase.HBaseSink)supportssecureHBase
clustersandalsothenovelHBaseIPCthatwasintroducedintheversion
HBase0.96.
AsyncHBaseSink(org.apache.flume.sink.hbase.AsyncHBaseSink)hasbetter
performancethanHBasesinkasitcaneasilymakenonblockingcallsto
HBase.
WorkingoftheHBaseSink
InHBaseSink,aFlumeEventisconvertedintoHBaseIncrementsorPuts.
SerializerimplementstheHBaseEventSerializerwhichistheninstantiatedwhen
thesinkstarts.Foreveryevent,sinkcallstheinitializemethodintheserializer
whichthentranslatestheFlumeEventintoHBaseincrementsandputstobesent
toHBasecluster.
WorkingoftheAsyncHBaseSink
AsyncHBaseSinkimplementstheAsyncHBaseEventSerializer.Theinitialize
methodiscalledonlyoncebythesinkwhenitstarts.SinkinvokesthesetEvent
methodandthenmakescallstothegetIncrementsandgetActionsmethodsjust
similartoHBasesink.Whenthesinkstops,thecleanUpmethodiscalledbythe
serializer.
4)ExplainaboutthedifferentchanneltypesinFlume.Whichchanneltypeis
faster?
The3differentbuiltinchanneltypesavailableinFlumeare
MEMORYChannelEventsarereadfromthesourceintomemoryandpassedto
thesink.
JDBCChannelJDBCChannelstorestheeventsinanembeddedDerby
database.
FILEChannelFileChannelwritesthecontentstoafileonthefilesystemafter
readingtheeventfromasource.Thefileisdeletedonlyafterthecontentsare
successfullydeliveredtothesink.
MEMORYChannelisthefastestchannelamongthethreehoweverhastheriskof
dataloss.Thechannelthatyouchoosecompletelydependsonthenatureofthe
bigdataapplicationandthevalueofeachevent.
5)WhichisthereliablechannelinFlumetoensurethatthereisnodataloss?
FILEChannelisthemostreliablechannelamongthe3channelsJDBC,FILEand
MEMORY.
6)ExplainaboutthereplicationandmultiplexingselectorsinFlume.
ChannelSelectorsareusedtohandlemultiplechannels.BasedontheFlume
headervalue,aneventcanbewrittenjusttoasinglechannelortomultiple
channels.Ifachannelselectorisnotspecifiedtothesourcethenbydefaultitisthe
Replicatingselector.Usingthereplicatingselector,thesameeventiswrittentoall
thechannelsinthesourceschannelslist.Multiplexingchannelselectorisused
whentheapplicationhastosenddifferenteventstodifferentchannels.
7)HowmultihopagentcanbesetupinFlume?
AvroRPCBridgemechanismisusedtosetupMultihopagentinApacheFlume.
8)DoesApacheFlumeprovidesupportforthirdpartyplugins?
MostofthedataanalystsuseApacheFlumehaspluginbasedarchitectureasit
canloaddatafromexternalsourcesandtransferittoexternaldestinations.
9)Isitpossibletoleveragerealtimeanalysisonthebigdatacollectedby
Flumedirectly?Ifyes,thenexplainhow.
DatafromFlumecanbeextracted,transformedandloadedinrealtimeinto
ApacheSolrserversusingMorphlineSolrSink
10)DifferentiatebetweenFileSinkandFileRollSink
ThemajordifferencebetweenHDFSFileSinkandFileRollSinkisthatHDFSFile
SinkwritestheeventsintotheHadoopDistributedFileSystem(HDFS)whereas
FileRollSinkstorestheeventsintothelocalfilesystem.
HadoopFlumeInterviewQuestionsandAnswersforFreshersQ.Nos
1,2,4,5,6,10
HadoopFlumeInterviewQuestionsandAnswersforExperiencedQ.Nos
3,7,8,9
HadoopZookeeperInterviewQuestionsandAnswers
1)CanApacheKafkabeusedwithoutZookeeper?
ItisnotpossibletouseApacheKafkawithoutZookeeperbecauseiftheZookeeper
isdownKafkacannotserveclientrequest.
2)NameafewcompaniesthatuseZookeeper.
Yahoo,Solr,Helprace,Neo4j,Rackspace
3)WhatistheroleofZookeeperinHBasearchitecture?
InHBasearchitecture,ZooKeeperisthemonitoringserverthatprovidesdifferent
servicesliketrackingserverfailureandnetworkpartitions,maintainingthe
configurationinformation,establishingcommunicationbetweentheclientsand
regionservers,usabilityofephemeralnodestoidentifytheavailableserversinthe
cluster.
4)ExplainaboutZooKeeperinKafka
ApacheKafkausesZooKeepertobeahighlydistributedandscalablesystem.
ZookeeperisusedbyKafkatostorevariousconfigurationsandusethemacross
thehadoopclusterinadistributedmanner.Toachievedistributedness,
configurationsaredistributedandreplicatedthroughouttheleaderandfollower
nodesintheZooKeeperensemble.WecannotdirectlyconnecttoKafkabybye
passingZooKeeperbecauseiftheZooKeeperisdownitwillnotbeabletoserve
theclientrequest.
5)ExplainhowZookeeperworks
ZooKeeperisreferredtoastheKingofCoordinationanddistributedapplications
useZooKeepertostoreandfacilitateimportantconfigurationinformationupdates.
ZooKeeperworksbycoordinatingtheprocessesofdistributedapplications.
ZooKeeperisarobustreplicatedsynchronizationservicewitheventual
consistency.Asetofnodesisknownasanensembleandpersisteddatais
distributedbetweenmultiplenodes.
3ormoreindependentserverscollectivelyformaZooKeeperclusterandelecta
master.Oneclientconnectstoanyofthespecificserverandmigratesifaparticular
nodefails.TheensembleofZooKeepernodesisalivetillthemajorityofnodsare
working.ThemasternodeinZooKeeperisdynamicallyselectedbytheconsensus
withintheensemblesoifthemasternodefailsthentheroleofmasternodewill
migratetoanothernodewhichisselecteddynamically.Writesarelinearandreads
areconcurrentinZooKeeper.
6)ListsomeexamplesofZookeeperusecases.
FoundbyElasticusesZookeepercomprehensivelyforresourceallocation,
leaderelection,highprioritynotificationsanddiscovery.Theentireserviceof
FoundbuiltupofvarioussystemsthatreadandwritetoZookeeper.
ApacheKafkathatdependsonZooKeeperisusedbyLinkedIn
StormthatreliesonZooKeeperisusedbypopularcompanieslikeGrouponand
Twitter.
7)HowtouseApacheZookeepercommandlineinterface?
ZooKeeperhasacommandlineclientsupportforinteractiveuse.Thecommand
lineinterfaceofZooKeeperissimilartothefileandshellsystemofUNIX.Datain
ZooKeeperisstoredinahierarchyofZnodeswhereeachznodecancontaindata
justsimilartoafile.Eachznodecanalsohavechildrenjustlikedirectoriesinthe
UNIXfilesystem.
Zookeeperclientcommandisusedtolaunchthecommandlineclient.Iftheinitial
promptishiddenbythelogmessagesafterenteringthecommand,userscanjust
hitENTERtoviewtheprompt.
8)WhatarethedifferenttypesofZnodes?
Thereare2typesofZnodesnamelyEphemeralandSequentialZnodes.
TheZnodesthatgetdestroyedassoonastheclientthatcreateditdisconnects
arereferredtoasEphemeralZnodes.
SequentialZnodeistheoneinwhichsequentialnumberischosenbythe
ZooKeeperensembleandisprefixedwhentheclientassignsnametothe
znode.
9)Whatarewatches?
Clientdisconnectionmightbetroublesomeproblemespeciallywhenweneedto
keepatrackonthestateofZnodesatregularintervals.ZooKeeperhasanevent
systemreferredtoaswatchwhichcanbesetonZnodetotriggeranevent
wheneveritisremoved,alteredoranynewchildrenarecreatedbelowit.
10)WhatproblemscanbeaddressedbyusingZookeeper?
Inthedevelopmentofdistributedsystems,creatingownprotocolsforcoordinating
thehadoopclusterresultsinfailureandfrustrationforthedevelopers.The
architectureofadistributedsystemcanbepronetodeadlocks,inconsistencyand
raceconditions.Thisleadstovariousdifficultiesinmakingthehadoopclusterfast,
reliableandscalable.Toaddressallsuchproblems,ApacheZooKeepercanbe
usedasacoordinationservicetowritecorrectdistributedapplicationswithout
havingtoreinventthewheelfromthebeginning.
HadoopZooKeeperInterviewQuestionsandAnswersforFreshersQ.Nos
1,2,8,9
HadoopZooKeeperInterviewQuestionsandAnswersforExperienced
Q.Nos3,4,5,6,7,10
HadoopPigInterviewQuestionsandAnswers
1)WhatdoyoumeanbyabaginPig?
CollectionoftuplesisreferredasabaginApachePig
2)DoesPigsupportmultilinecommands?
Yes
3)WhataredifferentmodesofexecutioninApachePig?
ApachePigrunsin2modesoneisthePig(LocalMode)CommandModeand
theotheristheHadoopMapReduce(Java)CommandMode.LocalMode
requiresaccesstoonlyasinglemachinewhereallfilesareinstalledandexecuted
onalocalhostwhereasMapReducerequiresaccessingtheHadoopcluster.
4)ExplaintheneedforMapReducewhileprogramminginApachePig.
ApachePigprogramsarewritteninaquerylanguageknownasPigLatinthatis
similartotheSQLquerylanguage.Toexecutethequery,thereisneedforan
executionengine.ThePigengineconvertsthequeriesintoMapReducejobsand
thusMapReduceactsastheexecutionengineandisneededtoruntheprograms.
5)ExplainaboutcogroupinPig.
COGROUPoperatorinPigisusedtoworkwithmultipletuples.COGROUP
operatorisappliedonstatementsthatcontainorinvolvetwoormorerelations.The
COGROUPoperatorcanbeappliedonupto127relationsatatime.Whenusing
theCOGROUPoperatorontwotablesatoncePigfirstgroupsboththetablesand
afterthatjoinsthetwotablesonthegroupedcolumns.
6)ExplainabouttheBloomMapFile.
BloomMapFileisaclassthatextendstheMapFileclass.ItisusednHBasetable
formattoprovidequickmembershiptestforthekeysusingdynamicbloomfilters.
7)DifferentiatebetweenHadoopMapReduceandPig
PigprovideshigherlevelofabstractionwhereasMapReduceprovideslowlevel
ofabstraction.
MapReducerequiresthedeveloperstowritemorelinesofcodewhen
comparedtoApachePig.
PigcodingapproachiscomparativelyslowerthanthefullytunedMapReduce
codingapproach.
ReadMoreinDetailhttp://www.dezyre.com/article/mapreducevspigvshive/163
8)WhatistheusageofforeachoperationinPigscripts?
FOREACHoperationinApachePigisusedtoapplytransformationtoeach
elementinthedatabagsothatrespectiveactionisperformedtogeneratenew
dataitems.
SyntaxFOREACHdata_bagnameGENERATEexp1,exp2
9)ExplainaboutthedifferentcomplexdatatypesinPig.
ApachePigsupports3complexdatatypes
MapsThesearekey,valuestoresjoinedtogetherusing#.
TuplesJustsimilartotherowinatablewheredifferentitemsareseparatedby
acomma.Tuplescanhavemultipleattributes.
BagsUnorderedcollectionoftuples.Bagallowsmultipleduplicatetuples.
10)WhatdoesFlattendoinPig?
Sometimesthereisdatainatupleorbagandifwewanttoremovethelevelof
nestingfromthatdatathenFlattenmodifierinPigcanbeused.Flattenunnests
bagsandtuples.Fortuples,theFlattenoperatorwillsubstitutethefieldsofatuple
inplaceofatuplewhereasunnestingbagsisalittlecomplexbecauseitrequires
creatingnewtuples.
WehavefurthercategorizedHadoopPigInterviewQuestionsforFreshersand
Experienced
HadoopInterviewQuestionsandAnswersforFreshersQ.Nos1,2,4,7,9
HadoopInterviewQuestionsandAnswersforExperiencedQ.Nos3,5,6,8,10
HadoopHiveInterviewQuestionsandAnswers
1)WhatisaHiveMetastore?
HiveMetastoreisacentralrepositorythatstoresmetadatainexternaldatabase.
2)AremultilinecommentssupportedinHive?
No
3)WhatisObjectInspectorfunctionality?
ObjectInspectorisusedtoanalyzethestructureofindividualcolumnsandthe
internalstructureoftherowobjects.ObjectInspectorinHiveprovidesaccessto
complexobjectswhichcanbestoredinmultipleformats.
4)ExplainaboutthedifferenttypesofjoininHive.
HiveQLhas4differenttypesofjoins
JOINSimilartoOuterJoininSQL
FULLOUTERJOINCombinestherecordsofboththeleftandrightoutertables
thatfulfilthejoincondition.
LEFTOUTERJOINAlltherowsfromthelefttablearereturnedevenifthereare
nomatchesintherighttable.
RIGHTOUTERJOINAlltherowsfromtherighttablearereturnedevenifthereare
nomatchesinthelefttable.
5)HowcanyouconfigureremotemetastoremodeinHive?
ToconfiguremetastoreinHive,hivesite.xmlfilehastobeconfiguredwiththe
belowproperty
<property>
<name>hive.metastore.uris</name>
<value>thrift://node1(orIPAddress):9083</value>
<description>IPaddressandportofthemetastorehost</description>
</property
6)ExplainabouttheSMBJoininHive.
InSMBjoininHive,eachmapperreadsabucketfromthefirsttableandthe
correspondingbucketfromthesecondtableandthenamergesortjoinis
performed.SortMergeBucket(SMB)joininhiveismainlyusedasthereisnolimit
onfileorpartitionortablejoin.SMBjoincanbestbeusedwhenthetablesare
large.InSMBjointhecolumnsarebucketedandsortedusingthejoincolumns.All
tablesshouldhavethesamenumberofbucketsinSMBjoin.
7)IsitpossibletochangethedefaultlocationofManagedTablesinHive,if
sohow?
Yes,wecanchangethedefaultlocationofManagedtablesusingtheLOCATION
keywordwhilecreatingthemanagedtable.Theuserhastospecifythestorage
pathofthemanagedtableasthevaluetotheLOCATIONkeyword.
8)HowdatatransferhappensfromHDFStoHive?
IfdataisalreadypresentinHDFSthentheuserneednotLOADDATAthatmoves
thefilestothe/user/hive/warehouse/.Sotheuserjusthastodefinethetableusing
thekeywordexternalthatcreatesthetabledefinitioninthehivemetastore.
Createexternaltabletable_name(
idint,
myfieldsstring
)
location'/my/location/in/hdfs'
9)Howcanyouconnectanapplication,ifyourunHiveasaserver?
WhenrunningHiveasaserver,theapplicationcanbeconnectedinoneofthe3
ways
ODBCDriverThissupportstheODBCprotocol
JDBCDriverThissupportstheJDBCprotocol
ThriftClientThisclientcanbeusedtomakecallstoallhivecommandsusing
differentprogramminglanguagelikePHP,Python,Java,C++andRuby.
10)WhatdoestheoverwritekeyworddenoteinHiveloadstatement?
OverwritekeywordinHiveloadstatementdeletesthecontentsofthetargettable
andreplacesthemwiththefilesreferredbythefilepathi.e.thefilesthatare
referredbythefilepathwillbeaddedtothetablewhenusingtheoverwrite
keyword.
11)WhatisSerDeinHive?HowcanyouwriteyourowncustomSerDe?
SerDeisaSerializerDeSerializer.HiveusesSerDetoreadandwritedatafrom
tables.Generally,usersprefertowriteaDeserializerinsteadofaSerDeasthey
wanttoreadtheirowndataformatratherthanwritingtoit.IftheSerDesupports
DDLi.e.basicallySerDewithparameterizedcolumnsanddifferentcolumntypes,
theuserscanimplementaProtocolbasedDynamicSerDeratherthanwritingthe
SerDefromscratch.
12)IncaseofembeddedHive,canthesamemetastorebeusedbymultiple
users?
Wecannotusemetastoreinsharingmode.Itissuggestedtousestandalonereal
databaselikePostGreSQLandMySQL.
HadoopHiveInterviewQuestionsandAnswersforFreshersQ.Nos
1,2,3,4,6,8
HadoopHiveInterviewQuestionsandAnswersforExperiencedQ.Nos
5,7,9,10,11,12
HadoopYARNInterviewQuestionsandAnswers
1)WhatarethestableversionsofHadoop?
Release2.7.1(stable)
Release2.4.1
Release1.2.1(stable)
2)WhatisApacheHadoopYARN?
YARNisapowerfulandefficientfeaturerolledoutasapartofHadoop2.0.YARNis
alargescaledistributedsystemforrunningbigdataapplications.
3)IsYARNareplacementofHadoopMapReduce?
YARNisnotareplacementofHadoopbutitisamorepowerfulandefficient
technologythatsupportsMapReduceandisalsoreferredtoasHadoop2.0or
MapReduce2.
WehavefurthercategorizedHadoopYARNInterviewQuestionsforFreshersand
Experienced
HadoopInterviewQuestionsandAnswersforFreshersQ.Nos2,3
HadoopInterviewQuestionsandAnswersforExperiencedQ.Nos1
HadoopInterviewQuestionsAnswers
Needed
HadoopYARNInterviewQuestions
1)WhataretheadditionalbenefitsYARNbringsintoHadoop?
2)HowcannativelibrariesbeincludedinYARNjobs?
3)ExplainthedifferencesbetweenHadoop1.xandHadoop2.x
Or
4)ExplainthedifferencebetweenMapReduce1andMapReduce2/YARN
5)WhatarethemodulesthatconstitutetheApacheHadoop2.0framework?
6)WhatarethecorechangesinHadoop2.0?
7)HowisthedistancebetweentwonodesdefinedinHadoop?
8)DifferentiatebetweenNFS,HadoopNameNodeandJournalNode.
WehopethattheseHadoopInterviewQuestionsandAnswershaveprecharged
youforyournextHadoopInterview.GettheBallRollingandanswerthe
unansweredquestionsinthecommentsbelow.Pleasedo!It'sallpartofourshared
missiontoeaseHadoopInterviewsforallprospectiveHadoopers.Weinviteyouto
getinvolved.
ClickheretoknowmoreaboutourIBMCertifiedHadoopDevelopercourse
PREVIOUS
NEXT

Top 100 Hadoop Interview Questions and Answers 2015

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Top 100 Hadoop Interview Questions and Answers 2015

Загружено:

Авторское право:

Доступные форматы

HadoopMapReducevs.

Hadoop Training- What is Big Data by DeZyre....

Hadoop Training- Four V's of Big Data by DeZyre...

Hadoop Training- Top 10 industries using Big D...

Hadoop Training- Definition of Hadoop Ecosyst...

Вам также может понравиться