Вы находитесь на странице: 1из 46

HadoopMapReducevs.

ApacheSparkWho
WinstheBattle?

HerearetopHadoopDeveloperInterviewQuestionsandAnswersbasedon
differentcomponentsoftheHadoopEcosystem

1)HadoopBasicInterviewQuestions
2)HadoopHDFSInterviewQuestions
3)MapReduceInterviewQuestions
4)HadoopHBaseInterviewQuestions
5)HadoopSqoopInterviewQuestions
6)HadoopFlumeInterviewQuestions
7)HadoopZookeeperInterviewQuestions

Top50HadoopInterviewQuestions

8)PigInterviewQuestions

5JobRolesAvailableforHadoopers

9)HiveInterviewQuestions
10)HadoopYARNInterviewQuestions

BigDataHadoopInterviewQuestionsand
Answers
TheseareHadoopBasicInterviewQuestionsandAnswersforfreshersand
experienced.
1.WhatisBigData?

Top6HadoopVendorsprovidingBigData

SolutionsinOpenDataPlatform

Bigdataisdefinedasthevoluminousamountofstructured,unstructuredorsemi
structureddatathathashugepotentialforminingbutissolargethatitcannotbe
processedusingtraditionaldatabasesystems.Bigdataischaracterizedbyitshigh
velocity,volumeandvarietythatrequirescosteffectiveandinnovativemethodsfor
informationprocessingtodrawmeaningfulbusinessinsights.Morethanthevolume
ofthedataitisthenatureofthedatathatdefineswhetheritisconsideredasBig
Dataornot.
HereisaninterestingandexplanatoryvisualonWhatisBigData?
BigDataAnalyticsTheNewPlayerinICC
WorldCupCricket2015

1/11

Hadoop Training- What is Big Data by DeZyre....

5ReasonswhyJavaprofessionalsshould

learnHadoop

Youmightalsolike

ADataScienceTeamcanhaveallofa
DataScientist'sSkills

2.WhatdothefourVsofBigDatadenote?

BigDataAnalytics?

IBMhasanice,simpleexplanationforthefourcriticalfeaturesofbigdata:
a)VolumeScaleofdata

d)VeracityUncertaintyofdata
HereisanexplanatoryvideoonthefourVsofBigData

TopBigDataCertificationstochoose
fromin2016

b)VelocityAnalysisofstreamingdata
c)VarietyDifferentformsofdata

HowLinkedInusesHadooptoleverage

ApacheSparkmakesDataProcessing
&PreparationFaster

RecapofDataScienceNewsfor
February

2/11

Hadoop Training- Four V's of Big Data by DeZyre...

RecapofApacheSparkNewsfor
February

RecapofHadoopNewsforFebruary

ApacheSparkEcosystemandSpark
Components

DataScientistSalaryReportof100Top
TechCompanies

7BigDataConferencesYouShould
Attendin2016

3.Howbigdataanalysishelpsbusinessesincreasetheirrevenue?Give
example.

Tutorials

Bigdataanalysisishelpingbusinessesdifferentiatethemselvesforexample

dplyrManipulationVerbs

Introductiontodplyrpackage

ImportingDatafromFlatFilesinR

PrincipalComponentAnalysisTutorial

PandasTutorialPart3

PandasTutorialPart2

PandasTutorialPart1

Walmarttheworldslargestretailerin2014intermsofrevenueisusingbigdata
analyticstoincreaseitssalesthroughbetterpredictiveanalytics,providing
customizedrecommendationsandlaunchingnewproductsbasedoncustomer
preferencesandneeds.Walmartobservedasignificant10%to15%increasein
onlinesalesfor$1billioninincrementalrevenue.Therearemanymorecompanies
likeFacebook,Twitter,LinkedIn,Pandora,JPMorganChase,BankofAmerica,etc.
usingbigdataanalyticstoboosttheirrevenue.

Hereisaninterestingvideothatexplainshowvariousindustriesareleveragingbig

dataanalysistoincreasetheirrevenue

5/11

Hadoop Training- Top 10 industries using Big D...

TutorialHadoopMultinodeCluster
SetuponUbuntu

DataVisualizationsToolsinR

RStatisticalandLanguagetutorial

IntroductiontoDataSciencewithR

ApachePigTutorial:UserDefined
FunctionExample

ApachePigTutorialExample:WebLog
ServerAnalytics

4.NamesomecompaniesthatuseHadoop.

ImpalaCaseStudy:WebTraffic

ImpalaCaseStudy:FlightDataAnalysis

HadoopImpalaTutorial

ApacheHiveTutorial:Tables

FlumeHadoopTutorial:TwitterData
Extraction

Yahoo(Oneofthebiggestuser&morethan80%codecontributortoHadoop)

Facebook
Netflix
Amazon

Aggregation

Hulu

HadoopSqoopTutorial:ExampleData
Export

Adobe
eBay

FlumeHadoopTutorial:WebsiteLog

HadoopSqoopTutorial:Exampleof
DataAggregation

Spotify

Rubikloud
Twitter

ApacheZookepeerTutorial:Exampleof
WatchNotification

ApacheZookepeerTutorial:Centralized
ConfigurationManagement

ToviewadetailedlistofsomeofthetopcompaniesusingHadoopCLICKHERE

5.DifferentiatebetweenStructuredandUnstructureddata.

HadoopZookeeperTutorial

HadoopSqoopTutorial

HadoopPIGTutorial

HadoopOozieTutorial

HadoopNoSQLDatabaseTutorial

HadoopHiveTutorial

HadoopHDFSTutorial

HadoophBaseTutorial

HadoopFlumeTutorial

Hadoop2.0YARNTutorial

HadoopMapReduceTutorial

BigDataHadoopTutorial

Datawhichcanbestoredintraditionaldatabasesystemsintheformofrowsand
columns,forexampletheonlinepurchasetransactionscanbereferredtoas
StructuredData.Datawhichcanbestoredonlypartiallyintraditionaldatabase
systems,forexample,datainXMLrecordscanbereferredtoassemistructured
data.Unorganizedandrawdatathatcannotbecategorizedassemistructuredor
structureddataisreferredtoasunstructureddata.Facebookupdates,Tweetson
Twitter,Reviews,weblogs,etc.areallexamplesofunstructureddata.
6.OnwhatconcepttheHadoopframeworkworks?

HadoopFrameworkworksonthefollowingtwocorecomponents
1)HDFSHadoopDistributedFileSystemisthejavabasedfilesystemfor
scalableandreliablestorageoflargedatasets.DatainHDFSisstoredintheform
ofblocksanditoperatesontheMasterSlaveArchitecture.
2)HadoopMapReduceThisisajavabasedprogrammingparadigmofHadoop
frameworkthatprovidesscalabilityacrossvariousHadoopclusters.MapReduce

distributestheworkloadintovarioustasksthatcanruninparallel.Hadoopjobs
perform2separatetasksjob.Themapjobbreaksdownthedatasetsintokey
valuepairsortuples.Thereducejobthentakestheoutputofthemapjoband

OnlineCourses

combinesthedatatuplestointosmallersetoftuples.Thereducejobisalways
performedafterthemapjobisexecuted.
HereisavisualthatclearlyexplaintheHDFSandHadoopMapReduceConcepts

10/11

Hadoop Training- Definition of Hadoop Ecosyst...

HadoopTraining

DataScienceinPython

DataScienceinR

DataScienceTraining

HadoopTraininginCalifornia

HadoopTraininginNewYork

HadoopTraininginTexas

HadoopTraininginVirginia

HadoopTraininginWashington

HadoopTraininginNewJersey

7)WhatarethemaincomponentsofaHadoopApplication?
Hadoopapplicationshavewiderangeoftechnologiesthatprovidegreatadvantage
insolvingcomplexbusinessproblems.

CorecomponentsofaHadoopapplicationare
1)HadoopCommon
2)HDFS

3)HadoopMapReduce
4)YARN
DataAccessComponentsarePigandHive

DataStorageComponentisHBase
DataIntegrationComponentsareApacheFlume,Sqoop,Chukwa
DataManagementandMonitoringComponentsareAmbari,Oozieand
Zookeeper.

DataSerializationComponentsareThriftandAvro

DataIntelligenceComponentsareApacheMahoutandDrill.
8.WhatisHadoopstreaming?

Hadoopdistributionhasagenericapplicationprogramminginterfaceforwriting
MapandReducejobsinanydesiredprogramminglanguagelikePython,Perl,
Ruby,etc.ThisisreferredtoasHadoopStreaming.Userscancreateandrunjobs
withanykindofshellscriptsorexecutableastheMapperorReducers.
9.WhatisthebesthardwareconfigurationtorunHadoop?
ThebestconfigurationforexecutingHadoopjobsisdualcoremachinesordual
processorswith4GBor8GBRAMthatuseECCmemory.Hadoophighlybenefits
fromusingECCmemorythoughitisnotlowend.ECCmemoryisrecommended
forrunningHadoopbecausemostoftheHadoopusershaveexperiencedvarious
checksumerrorsbyusingnonECCmemory.However,thehardwareconfiguration
alsodependsontheworkflowrequirementsandcanchangeaccordingly.

10.WhatarethemostcommonlydefinedinputformatsinHadoop?
ThemostcommonInputFormatsdefinedinHadoopare:
TextInputFormatThisisthedefaultinputformatdefinedinHadoop.
KeyValueInputFormatThisinputformatisusedforplaintextfileswhereinthe
filesarebrokendownintolines.
SequenceFileInputFormatThisinputformatisusedforreadingfilesin

sequence.

WehavefurthercategorizedBigDataInterviewQuestionsforFreshersand
Experienced

HadoopInterviewQuestionsandAnswersforFreshersQ.Nos1,2,4,5,6,7,8,9
HadoopInterviewQuestionsandAnswersforExperiencedQ.Nos3,8,9,10
ForadetailedPDFreportonHadoopSalaries CLICKHERE

HadoopHDFSInterviewQuestionsandAnswers
1.WhatisablockandblockscannerinHDFS?
BlockTheminimumamountofdatathatcanbereadorwrittenisgenerally
referredtoasablockinHDFS.ThedefaultsizeofablockinHDFSis64MB.

BlockScannerBlockScannertracksthelistofblockspresentonaDataNodeand
verifiesthemtofindanykindofchecksumerrors.BlockScannersuseathrottling
mechanismtoreservediskbandwidthonthedatanode.
2.ExplainthedifferencebetweenNameNode,BackupNodeandCheckpoint
NameNode.
NameNode:NameNodeisattheheartoftheHDFSfilesystemwhichmanagesthe
metadatai.e.thedataofthefilesisnotstoredontheNameNodebutratherithas

thedirectorytreeofallthefilespresentintheHDFSfilesystemonahadoop
cluster.NameNodeusestwofilesforthenamespace
fsimagefileItkeepstrackofthelatestcheckpointofthenamespace.

editsfileItisalogofchangesthathavebeenmadetothenamespacesince
checkpoint.
CheckpointNode
CheckpointNodekeepstrackofthelatestcheckpointinadirectorythathassame
structureasthatofNameNodesdirectory.Checkpointnodecreatescheckpoints
forthenamespaceatregularintervalsbydownloadingtheeditsandfsimagefile
fromtheNameNodeandmergingitlocally.Thenewimageisthenagainupdated
backtotheactiveNameNode.

BackupNode:
BackupNodealsoprovidescheckpointingfunctionalitylikethatofthecheckpoint
nodebutitalsomaintainsitsuptodateinmemorycopyofthefilesystem
namespacethatisinsyncwiththeactiveNameNode.
3.Whatiscommodityhardware?

CommodityHardwarereferstoinexpensivesystemsthatdonothavehigh
availabilityorhighquality.CommodityHardwareconsistsofRAMbecausethere

arespecificservicesthatneedtobeexecutedonRAM.Hadoopcanberunonany
commodityhardwareanddoesnotrequireanysupercomputersorhighend
hardwareconfigurationtoexecutejobs.
4.WhatistheportnumberforNameNode,TaskTrackerandJobTracker?

NameNode50070
JobTracker50030
TaskTracker50060

5.Explainabouttheprocessofinterclusterdatacopying.
HDFSprovidesadistributeddatacopyingfacilitythroughtheDistCPfromsourceto
destination.Ifthisdatacopyingiswithinthehadoopclusterthenitisreferredtoas
interclusterdatacopying.DistCPrequiresbothsourceanddestinationtohavea
compatibleorsameversionofhadoop.
6.HowcanyouoverwritethereplicationfactorsinHDFS?

ThereplicationfactorinHDFScanbemodifiedoroverwrittenin2ways
1)UsingtheHadoopFSShell,replicationfactorcanbechangedperfilebasisusing
thebelowcommand

$hadoopfssetrepw2/my/test_file(test_fileisthefilenamewhosereplication
factorwillbesetto2)
2)UsingtheHadoopFSShell,replicationfactorofallfilesunderagivendirectory
canbemodifiedusingthebelowcommand

3)$hadoopfssetrepw5/my/test_dir(test_diristhenameofthedirectoryandall
thefilesinthisdirectorywillhaveareplicationfactorsetto5)
7.ExplainthedifferencebetweenNASandHDFS.
NASrunsonasinglemachineandthusthereisnoprobabilityofdata
redundancywhereasHDFSrunsonaclusterofdifferentmachinesthusthereis
dataredundancybecauseofthereplicationprotocol.
NASstoresdataonadedicatedhardwarewhereasinHDFSallthedatablocks
aredistributedacrosslocaldrivesofthemachines.
InNASdataisstoredindependentofthecomputationandhenceHadoop
MapReducecannotbeusedforprocessingwhereasHDFSworkswithHadoop
MapReduceasthecomputationsinHDFSaremovedtodata.

8.ExplainwhathappensifduringthePUToperation,HDFSblockisassigned
areplicationfactor1insteadofthedefaultvalue3.
ReplicationfactorisapropertyofHDFSthatcanbesetaccordinglyfortheentire
clustertoadjustthenumberoftimestheblocksaretobereplicatedtoensurehigh
dataavailability.ForeveryblockthatisstoredinHDFS,theclusterwillhaven1

duplicatedblocks.So,ifthereplicationfactorduringthePUToperationissetto1
insteadofthedefaultvalue3,thenitwillhaveasinglecopyofdata.Underthese
circumstanceswhenthereplicationfactorissetto1,iftheDataNodecrashes
underanycircumstances,thenonlysinglecopyofthedatawouldbelost.
9.WhatistheprocesstochangethefilesatarbitrarylocationsinHDFS?

HDFSdoesnotsupportmodificationsatarbitraryoffsetsinthefileormultiple
writersbutfilesarewrittenbyasinglewriterinappendonlyformati.e.writestoa
Callus18446966465(USTollFree) Home Blog Tutorials

ContactUs

fileinHDFSarealwaysmadeattheendofthefile.

BuildProjects,LearnSkills,GetHired

10.ExplainabouttheindexingprocessinHDFS.

IndexingprocessinHDFSdependsontheblocksize.HDFSstoresthelastpartof
thedatathatfurtherpointstotheaddresswherethenextpartofdatachunkis
stored.

11.Whatisarackawarenessandonwhatbasisisdatastoredinarack?
Allthedatanodesputtogetherformastorageareai.e.thephysicallocationofthe
datanodesisreferredtoasRackinHDFS.Therackinformationi.e.therackidof
eachdatanodeisacquiredbytheNameNode.Theprocessofselectingcloserdata
nodesdependingontherackinformationisknownasRackAwareness.
Thecontentspresentinthefilearedividedintodatablockassoonastheclientis
readytoloadthefileintothehadoopcluster.AfterconsultingwiththeNameNode,

REQUESTINFO

DeZyreforBusiness

SignIn

clientallocates3datanodesforeachdatablock.Foreachdatablock,thereexists
2copiesinonerackandthethirdcopyispresentinanotherrack.Thisisgenerally
referredtoastheReplicaPlacementPolicy.
WehavefurthercategorizedHadoopHDFSInterviewQuestionsforFreshersand
Experienced

HadoopInterviewQuestionsandAnswersforFreshersQ.Nos2,3,7,9,10,11
HadoopInterviewQuestionsandAnswersforExperiencedQ.Nos1,2,
4,5,6,7,8
ClickheretoknowmoreaboutourIBMCertifiedHadoopDevelopercourse

HadoopMapReduceInterviewQuestionsandAnswers
1.ExplaintheusageofContextObject.
ContextObjectisusedtohelpthemapperinteractwithotherHadoopsystems.
ContextObjectcanbeusedforupdatingcounters,toreporttheprogressandto
provideanyapplicationlevelstatusupdates.ContextObjecthastheconfiguration
detailsforthejobandalsointerfaces,thathelpsittogeneratingtheoutput.

2.WhatarethecoremethodsofaReducer?
The3coremethodsofareducerare

1)setup()Thismethodofthereducerisusedforconfiguringvariousparameters
liketheinputdatasize,distributedcache,heapsize,etc.
FunctionDefinitionpublicvoidsetup(context)

2)reduce()itisheartofthereducerwhichiscalledonceperkeywiththe
associatedreducetask.
FunctionDefinitionpublicvoidreduce(Key,Value,context)
3)cleanup()Thismethodiscalledonlyonceattheendofreducetaskforclearing
allthetemporaryfiles.

FunctionDefinitionpublicvoidcleanup(context)
3.Explainaboutthepartitioning,shuffleandsortphase
ShufflePhaseOncethefirstmaptasksarecompleted,thenodescontinueto
performseveralothermaptasksandalsoexchangetheintermediateoutputswith
thereducersasrequired.Thisprocessofmovingtheintermediateoutputsofmap
taskstothereducerisreferredtoasShuffling.

SortPhaseHadoopMapReduceautomaticallysortsthesetofintermediatekeys
onasinglenodebeforetheyaregivenasinputtothereducer.
PartitioningPhaseTheprocessthatdetermineswhichintermediatekeysand

valuewillbereceivedbyeachreducerinstanceisreferredtoaspartitioning.The
destinationpartitionissameforanykeyirrespectiveofthemapperinstancethat
generatedit.
4.HowtowriteacustompartitionerforaHadoopMapReducejob?

StepstowriteaCustomPartitionerforaHadoopMapReduceJob
AnewclassmustbecreatedthatextendsthepredefinedPartitionerClass.
getPartitionmethodofthePartitionerclassmustbeoverridden.
Thecustompartitionertothejobcanbeaddedasaconfigfileinthewrapper
whichrunsHadoopMapReduceorthecustompartitionercanbeaddedtothe
jobbyusingthesetmethodofthepartitionerclass.
5.WhatistherelationshipbetweenJobandTaskinHadoop?

AsinglejobcanbebrokendownintooneormanytasksinHadoop.
6.IsitimportantforHadoopMapReducejobstobewritteninJava?
ItisnotnecessarytowriteHadoopMapReducejobsinjavabutuserscanwrite
MapReducejobsinanydesiredprogramminglanguagelikeRuby,Perl,Python,R,
Awk,etc.throughtheHadoopStreamingAPI.

7.Whatistheprocessofchangingthesplitsizeifthereislimitedstorage
spaceonCommodityHardware?

Ifthereislimitedstoragespaceoncommodityhardware,thesplitsizecanbe
changedbyimplementingtheCustomSplitter.ThecalltoCustomSplittercanbe
madefromthemainmethod.
8.WhataretheprimaryphasesofaReducer?

The3primaryphasesofareducerare
1)Shuffle
2)Sort

3)Reduce
9.WhatisaTaskInstance?
TheactualhadoopMapReducejobsthatrunoneachslavenodearereferredtoas
Taskinstances.EverytaskinstancehasitsownJVMprocess.Foreverynewtask
instance,aJVMprocessisspawnedbydefaultforatask.

10.Canreducerscommunicatewitheachother?
Reducersalwaysruninisolationandtheycannevercommunicatewitheachother
aspertheHadoopMapReduceprogrammingparadigm.

WehavefurthercategorizedHadoopMapReduceInterviewQuestionsforFreshers
andExperienced

HadoopInterviewQuestionsandAnswersforFreshersQ.Nos2,5,6
HadoopInterviewQuestionsandAnswersforExperiencedQ.Nos
1,3,4,7,8,9,10

HadoopHBaseInterviewQuestionsandAnswers
1.WhenshouldyouuseHBaseandwhatarethekeycomponentsofHBase?
HBaseshouldbeusedwhenthebigdataapplicationhas
1)Avariableschema

2)Whendataisstoredintheformofcollections
3)Iftheapplicationdemandskeybasedaccesstodatawhileretrieving.
KeycomponentsofHBaseare

RegionThiscomponentcontainsmemorydatastoreandHfile.
RegionServerThismonitorstheRegion.

HBaseMasterItisresponsibleformonitoringtheregionserver.
ZookeeperIttakescareofthecoordinationbetweentheHBaseMastercomponent
andtheclient.

CatalogTablesThetwoimportantcatalogtablesareROOTandMETA.ROOT
tabletrackswheretheMETAtableisandMETAtablestoresalltheregionsinthe
system.
2.WhatarethedifferentoperationalcommandsinHBaseatrecordleveland
tablelevel?
RecordLevelOperationalCommandsinHBaseareput,get,increment,scanand
delete.

TableLevelOperationalCommandsinHBasearedescribe,list,drop,disableand
scan.
3.WhatisRowKey?
EveryrowinanHBasetablehasauniqueidentifierknownasRowKey.Itisused
forgroupingcellslogicallyanditensuresthatallcellsthathavethesameRowKeys
arecolocatedonthesameserver.RowKeyisinternallyregardedasabytearray.

4.ExplainthedifferencebetweenRDBMSdatamodelandHBasedatamodel.

RDBMSisaschemabaseddatabasewhereasHBaseisschemalessdatamodel.
RDBMSdoesnothavesupportforinbuiltpartitioningwhereasinHBasethereis
automatedpartitioning.

RDBMSstoresnormalizeddatawhereasHBasestoresdenormalizeddata.
5.ExplainaboutthedifferentcatalogtablesinHBase?
ThetwoimportantcatalogtablesinHBase,areROOTandMETA.ROOTtable
trackswheretheMETAtableisandMETAtablestoresalltheregionsinthe
system.

6.Whatiscolumnfamilies?Whathappensifyoualtertheblocksizeof
ColumnFamilyonanalreadypopulateddatabase?
ThelogicaldeviationofdataisrepresentedthroughakeyknownascolumnFamily.
Columnfamiliesconsistofthebasicunitofphysicalstorageonwhichcompression
featurescanbeapplied.Inanalreadypopulateddatabase,whentheblocksizeof
columnfamilyisaltered,theolddatawillremainwithintheoldblocksizewhereas
thenewdatathatcomesinwilltakethenewblocksize.Whencompactiontakes
place,theolddatawilltakethenewblocksizesothattheexistingdataisread
correctly.
7.ExplainthedifferencebetweenHBaseandHive.

HBaseandHivebotharecompletelydifferenthadoopbasedtechnologiesHiveisa
datawarehouseinfrastructureontopofHadoopwhereasHBaseisaNoSQLkey
valuestorethatrunsontopofHadoop.HivehelpsSQLsavvypeopletorun
MapReducejobswhereasHBasesupports4primaryoperationsput,get,scanand
delete.HBaseisidealforrealtimequeryingofbigdatawhereHiveisanideal
choiceforanalyticalqueryingofdatacollectedoverperiodoftime.
8.ExplaintheprocessofrowdeletioninHBase.

OnissuingadeletecommandinHBasethroughtheHBaseclient,dataisnot
actuallydeletedfromthecellsbutratherthecellsaremadeinvisiblebysettinga
tombstonemarker.Thedeletedcellsareremovedatregularintervalsduring
compaction.
9.WhatarethedifferenttypesoftombstonemarkersinHBasefordeletion?
Thereare3differenttypesoftombstonemarkersinHBasefordeletion

1)FamilyDeleteMarkerThismarkersmarksallcolumnsforacolumnfamily.
2)VersionDeleteMarkerThismarkermarksasingleversionofacolumn.
3)ColumnDeleteMarkerThismarkersmarksalltheversionsofacolumn.

10.ExplainaboutHLogandWALinHBase.

AlleditsintheHStorearestoredintheHLog.EveryregionserverhasoneHLog.
HLogcontainsentriesforeditsofallregionsperformedbyaparticularRegion
Server.WALabbreviatestoWriteAheadLog(WAL)inwhichalltheHLogeditsare
writtenimmediately.WALeditsremaininthememorytilltheflushperiodincaseof
deferredlogflush.
WehavefurthercategorizedHadoopHBaseInterviewQuestionsforFreshersand
Experienced

HadoopInterviewQuestionsandAnswersforFreshersQ.Nos1,2,4,5,7
HadoopInterviewQuestionsandAnswersforExperiencedQ.Nos2,3,6,8,9,10

HadoopSqoopInterviewQuestionsandAnswers
1.ExplainaboutsomeimportantSqoopcommandsotherthanimportand
export.
CreateJob(create)

Herewearecreatingajobwiththenamemyjob,whichcanimportthetabledata
fromRDBMStabletoHDFS.Thefollowingcommandisusedtocreateajobthatis
importingdatafromtheemployeetableinthedbdatabasetotheHDFSfile.
$Sqoopjobcreatemyjob\

import\
connectjdbc:mysql://localhost/db\

usernameroot\
tableemployeem1
VerifyJob(list)

listargumentisusedtoverifythesavedjobs.Thefollowingcommandisusedto
verifythelistofsavedSqoopjobs.
$Sqoopjoblist
InspectJob(show)

showargumentisusedtoinspectorverifyparticularjobsandtheirdetails.The
followingcommandandsampleoutputisusedtoverifyajobcalledmyjob.
$Sqoopjobshowmyjob
ExecuteJob(exec)

execoptionisusedtoexecuteasavedjob.Thefollowingcommandisusedto

executeasavedjobcalledmyjob.
$Sqoopjobexecmyjob

2.HowSqoopcanbeusedinaJavaprogram?
TheSqoopjarinclasspathshouldbeincludedinthejavacode.Afterthisthe
methodSqoop.runTool()methodmustbeinvoked.Thenecessaryparameters
shouldbecreatedtoSqoopprogrammaticallyjustlikeforcommandline.
3.WhatistheprocesstoperformanincrementaldataloadinSqoop?

TheprocesstoperformincrementaldataloadinSqoopistosynchronizethe
modifiedorupdateddata(oftenreferredasdeltadata)fromRDBMStoHadoop.
ThedeltadatacanbefacilitatedthroughtheincrementalloadcommandinSqoop.
IncrementalloadcanbeperformedbyusingSqoopimportcommandorbyloading
thedataintohivewithoutoverwritingit.Thedifferentattributesthatneedtobe
specifiedduringincrementalloadinSqoopare
1)Mode(incremental)ThemodedefineshowSqoopwilldeterminewhatthenew
rowsare.ThemodecanhavevalueasAppendorLastModified.

2)Col(Checkcolumn)Thisattributespecifiesthecolumnthatshouldbeexamined
tofindouttherowstobeimported.

3)Value(lastvalue)Thisdenotesthemaximumvalueofthecheckcolumnfrom
thepreviousimportoperation.
4.IsitpossibletodoanincrementalimportusingSqoop?

Yes,Sqoopsupportstwotypesofincrementalimports
1)Append
2)LastModified

ToinsertonlyrowsAppendshouldbeusedinimportcommandandforinserting
therowsandalsoupdatingLastModifiedshouldbeusedintheimportcommand.
5.WhatisthestandardlocationorpathforHadoopSqoopscripts?
/usr/bin/HadoopSqoop

6.Howcanyoucheckallthetablespresentinasingledatabaseusing
Sqoop?
Thecommandtocheckthelistofalltablespresentinasingledatabaseusing
Sqoopisasfollows
Sqooplisttablesconnectjdbc:mysql://localhost/user

7.HowarelargeobjectshandledinSqoop?
Sqoopprovidesthecapabilitytostorelargesizeddataintoasinglefieldbasedon
thetypeofdata.Sqoopsupportstheabilitytostore

1)CLOBsCharacterLargeObjects
2)BLOBsBinaryLargeObjects
LargeobjectsinSqooparehandledbyimportingthelargeobjectsintoafile
referredasLobFilei.e.LargeObjectFile.TheLobFilehastheabilitytostore
recordsofhugesize,thuseachrecordintheLobFileisalargeobject.

8.CanfreeformSQLqueriesbeusedwithSqoopimportcommand?Ifyes,
thenhowcantheybeused?
SqoopallowsustousefreeformSQLquerieswiththeimportcommand.The
importcommandshouldbeusedwiththeeandqueryoptionstoexecutefree
formSQLqueries.Whenusingtheeandqueryoptionswiththeimport
commandthetargetdirvaluemustbespecified.
9.DifferentiatebetweenSqoopanddistCP.

DistCPutilitycanbeusedtotransferdatabetweenclusterswhereasSqoopcanbe
usedtotransferdataonlybetweenHadoopandRDBMS.

10.WhatarethelimitationsofimportingRDBMStablesintoHcatalog
directly?
ThereisanoptiontoimportRDBMStablesintoHcatalogdirectlybymakinguseof
hcatalogdatabaseoptionwiththehcatalogtablebutthelimitationtoitisthat
thereareseveralargumentslikeasavrofile,direct,assequencefile,targetdir,
exportdirarenotsupported.

WehavefurthercategorizedHadoopSqoopInterviewQuestionsforFreshersand
Experienced

HadoopInterviewQuestionsandAnswersforFreshersQ.Nos4,5,6,9
HadoopInterviewQuestionsandAnswersforExperiencedQ.Nos
1,2,3,6,7,8,10

HadoopFlumeInterviewQuestionsandAnswers
1)ExplainaboutthecorecomponentsofFlume.

ThecorecomponentsofFlumeare
EventThesinglelogentryorunitofdatathatistransported.
SourceThisisthecomponentthroughwhichdataentersFlumeworkflows.

SinkItisresponsiblefortransportingdatatothedesireddestination.
ChannelitistheductbetweentheSinkandSource.

AgentAnyJVMthatrunsFlume.
ClientThecomponentthattransmitseventtothesourcethatoperateswiththe
agent.
2)DoesFlumeprovide100%reliabilitytothedataflow?

Yes,ApacheFlumeprovidesendtoendreliabilitybecauseofitstransactional
approachindataflow.
3)HowcanFlumebeusedwithHBase?
ApacheFlumecanbeusedwithHBaseusingoneofthetwoHBasesinks

HBaseSink(org.apache.flume.sink.hbase.HBaseSink)supportssecureHBase
clustersandalsothenovelHBaseIPCthatwasintroducedintheversion
HBase0.96.
AsyncHBaseSink(org.apache.flume.sink.hbase.AsyncHBaseSink)hasbetter
performancethanHBasesinkasitcaneasilymakenonblockingcallsto
HBase.
WorkingoftheHBaseSink

InHBaseSink,aFlumeEventisconvertedintoHBaseIncrementsorPuts.
SerializerimplementstheHBaseEventSerializerwhichistheninstantiatedwhen
thesinkstarts.Foreveryevent,sinkcallstheinitializemethodintheserializer
whichthentranslatestheFlumeEventintoHBaseincrementsandputstobesent
toHBasecluster.
WorkingoftheAsyncHBaseSink

AsyncHBaseSinkimplementstheAsyncHBaseEventSerializer.Theinitialize
methodiscalledonlyoncebythesinkwhenitstarts.SinkinvokesthesetEvent
methodandthenmakescallstothegetIncrementsandgetActionsmethodsjust
similartoHBasesink.Whenthesinkstops,thecleanUpmethodiscalledbythe
serializer.
4)ExplainaboutthedifferentchanneltypesinFlume.Whichchanneltypeis
faster?
The3differentbuiltinchanneltypesavailableinFlumeare

MEMORYChannelEventsarereadfromthesourceintomemoryandpassedto
thesink.
JDBCChannelJDBCChannelstorestheeventsinanembeddedDerby
database.
FILEChannelFileChannelwritesthecontentstoafileonthefilesystemafter

readingtheeventfromasource.Thefileisdeletedonlyafterthecontentsare
successfullydeliveredtothesink.
MEMORYChannelisthefastestchannelamongthethreehoweverhastheriskof
dataloss.Thechannelthatyouchoosecompletelydependsonthenatureofthe
bigdataapplicationandthevalueofeachevent.

5)WhichisthereliablechannelinFlumetoensurethatthereisnodataloss?
FILEChannelisthemostreliablechannelamongthe3channelsJDBC,FILEand
MEMORY.
6)ExplainaboutthereplicationandmultiplexingselectorsinFlume.

ChannelSelectorsareusedtohandlemultiplechannels.BasedontheFlume
headervalue,aneventcanbewrittenjusttoasinglechannelortomultiple
channels.Ifachannelselectorisnotspecifiedtothesourcethenbydefaultitisthe
Replicatingselector.Usingthereplicatingselector,thesameeventiswrittentoall
thechannelsinthesourceschannelslist.Multiplexingchannelselectorisused
whentheapplicationhastosenddifferenteventstodifferentchannels.
7)HowmultihopagentcanbesetupinFlume?
AvroRPCBridgemechanismisusedtosetupMultihopagentinApacheFlume.

8)DoesApacheFlumeprovidesupportforthirdpartyplugins?

MostofthedataanalystsuseApacheFlumehaspluginbasedarchitectureasit
canloaddatafromexternalsourcesandtransferittoexternaldestinations.
9)Isitpossibletoleveragerealtimeanalysisonthebigdatacollectedby
Flumedirectly?Ifyes,thenexplainhow.

DatafromFlumecanbeextracted,transformedandloadedinrealtimeinto
ApacheSolrserversusingMorphlineSolrSink
10)DifferentiatebetweenFileSinkandFileRollSink
ThemajordifferencebetweenHDFSFileSinkandFileRollSinkisthatHDFSFile
SinkwritestheeventsintotheHadoopDistributedFileSystem(HDFS)whereas
FileRollSinkstorestheeventsintothelocalfilesystem.

HadoopFlumeInterviewQuestionsandAnswersforFreshersQ.Nos
1,2,4,5,6,10
HadoopFlumeInterviewQuestionsandAnswersforExperiencedQ.Nos
3,7,8,9

HadoopZookeeperInterviewQuestionsandAnswers
1)CanApacheKafkabeusedwithoutZookeeper?
ItisnotpossibletouseApacheKafkawithoutZookeeperbecauseiftheZookeeper

isdownKafkacannotserveclientrequest.

2)NameafewcompaniesthatuseZookeeper.

Yahoo,Solr,Helprace,Neo4j,Rackspace
3)WhatistheroleofZookeeperinHBasearchitecture?
InHBasearchitecture,ZooKeeperisthemonitoringserverthatprovidesdifferent
servicesliketrackingserverfailureandnetworkpartitions,maintainingthe
configurationinformation,establishingcommunicationbetweentheclientsand
regionservers,usabilityofephemeralnodestoidentifytheavailableserversinthe
cluster.

4)ExplainaboutZooKeeperinKafka
ApacheKafkausesZooKeepertobeahighlydistributedandscalablesystem.
ZookeeperisusedbyKafkatostorevariousconfigurationsandusethemacross
thehadoopclusterinadistributedmanner.Toachievedistributedness,
configurationsaredistributedandreplicatedthroughouttheleaderandfollower
nodesintheZooKeeperensemble.WecannotdirectlyconnecttoKafkabybye
passingZooKeeperbecauseiftheZooKeeperisdownitwillnotbeabletoserve
theclientrequest.
5)ExplainhowZookeeperworks

ZooKeeperisreferredtoastheKingofCoordinationanddistributedapplications
useZooKeepertostoreandfacilitateimportantconfigurationinformationupdates.
ZooKeeperworksbycoordinatingtheprocessesofdistributedapplications.
ZooKeeperisarobustreplicatedsynchronizationservicewitheventual
consistency.Asetofnodesisknownasanensembleandpersisteddatais
distributedbetweenmultiplenodes.
3ormoreindependentserverscollectivelyformaZooKeeperclusterandelecta
master.Oneclientconnectstoanyofthespecificserverandmigratesifaparticular
nodefails.TheensembleofZooKeepernodesisalivetillthemajorityofnodsare
working.ThemasternodeinZooKeeperisdynamicallyselectedbytheconsensus
withintheensemblesoifthemasternodefailsthentheroleofmasternodewill
migratetoanothernodewhichisselecteddynamically.Writesarelinearandreads
areconcurrentinZooKeeper.

6)ListsomeexamplesofZookeeperusecases.
FoundbyElasticusesZookeepercomprehensivelyforresourceallocation,
leaderelection,highprioritynotificationsanddiscovery.Theentireserviceof
FoundbuiltupofvarioussystemsthatreadandwritetoZookeeper.
ApacheKafkathatdependsonZooKeeperisusedbyLinkedIn
StormthatreliesonZooKeeperisusedbypopularcompanieslikeGrouponand
Twitter.
7)HowtouseApacheZookeepercommandlineinterface?

ZooKeeperhasacommandlineclientsupportforinteractiveuse.Thecommand
lineinterfaceofZooKeeperissimilartothefileandshellsystemofUNIX.Datain
ZooKeeperisstoredinahierarchyofZnodeswhereeachznodecancontaindata
justsimilartoafile.Eachznodecanalsohavechildrenjustlikedirectoriesinthe
UNIXfilesystem.
Zookeeperclientcommandisusedtolaunchthecommandlineclient.Iftheinitial
promptishiddenbythelogmessagesafterenteringthecommand,userscanjust
hitENTERtoviewtheprompt.

8)WhatarethedifferenttypesofZnodes?
Thereare2typesofZnodesnamelyEphemeralandSequentialZnodes.
TheZnodesthatgetdestroyedassoonastheclientthatcreateditdisconnects
arereferredtoasEphemeralZnodes.
SequentialZnodeistheoneinwhichsequentialnumberischosenbythe
ZooKeeperensembleandisprefixedwhentheclientassignsnametothe
znode.

9)Whatarewatches?
Clientdisconnectionmightbetroublesomeproblemespeciallywhenweneedto
keepatrackonthestateofZnodesatregularintervals.ZooKeeperhasanevent
systemreferredtoaswatchwhichcanbesetonZnodetotriggeranevent
wheneveritisremoved,alteredoranynewchildrenarecreatedbelowit.

10)WhatproblemscanbeaddressedbyusingZookeeper?
Inthedevelopmentofdistributedsystems,creatingownprotocolsforcoordinating
thehadoopclusterresultsinfailureandfrustrationforthedevelopers.The
architectureofadistributedsystemcanbepronetodeadlocks,inconsistencyand
raceconditions.Thisleadstovariousdifficultiesinmakingthehadoopclusterfast,
reliableandscalable.Toaddressallsuchproblems,ApacheZooKeepercanbe
usedasacoordinationservicetowritecorrectdistributedapplicationswithout
havingtoreinventthewheelfromthebeginning.

HadoopZooKeeperInterviewQuestionsandAnswersforFreshersQ.Nos
1,2,8,9
HadoopZooKeeperInterviewQuestionsandAnswersforExperienced
Q.Nos3,4,5,6,7,10

HadoopPigInterviewQuestionsandAnswers
1)WhatdoyoumeanbyabaginPig?
CollectionoftuplesisreferredasabaginApachePig

2)DoesPigsupportmultilinecommands?
Yes

3)WhataredifferentmodesofexecutioninApachePig?
ApachePigrunsin2modesoneisthePig(LocalMode)CommandModeand
theotheristheHadoopMapReduce(Java)CommandMode.LocalMode
requiresaccesstoonlyasinglemachinewhereallfilesareinstalledandexecuted
onalocalhostwhereasMapReducerequiresaccessingtheHadoopcluster.

4)ExplaintheneedforMapReducewhileprogramminginApachePig.
ApachePigprogramsarewritteninaquerylanguageknownasPigLatinthatis
similartotheSQLquerylanguage.Toexecutethequery,thereisneedforan
executionengine.ThePigengineconvertsthequeriesintoMapReducejobsand
thusMapReduceactsastheexecutionengineandisneededtoruntheprograms.
5)ExplainaboutcogroupinPig.

COGROUPoperatorinPigisusedtoworkwithmultipletuples.COGROUP
operatorisappliedonstatementsthatcontainorinvolvetwoormorerelations.The
COGROUPoperatorcanbeappliedonupto127relationsatatime.Whenusing
theCOGROUPoperatorontwotablesatoncePigfirstgroupsboththetablesand
afterthatjoinsthetwotablesonthegroupedcolumns.
6)ExplainabouttheBloomMapFile.
BloomMapFileisaclassthatextendstheMapFileclass.ItisusednHBasetable
formattoprovidequickmembershiptestforthekeysusingdynamicbloomfilters.

7)DifferentiatebetweenHadoopMapReduceandPig
PigprovideshigherlevelofabstractionwhereasMapReduceprovideslowlevel
ofabstraction.
MapReducerequiresthedeveloperstowritemorelinesofcodewhen
comparedtoApachePig.
PigcodingapproachiscomparativelyslowerthanthefullytunedMapReduce
codingapproach.

ReadMoreinDetailhttp://www.dezyre.com/article/mapreducevspigvshive/163
8)WhatistheusageofforeachoperationinPigscripts?
FOREACHoperationinApachePigisusedtoapplytransformationtoeach
elementinthedatabagsothatrespectiveactionisperformedtogeneratenew
dataitems.

SyntaxFOREACHdata_bagnameGENERATEexp1,exp2
9)ExplainaboutthedifferentcomplexdatatypesinPig.
ApachePigsupports3complexdatatypes

MapsThesearekey,valuestoresjoinedtogetherusing#.
TuplesJustsimilartotherowinatablewheredifferentitemsareseparatedby
acomma.Tuplescanhavemultipleattributes.

BagsUnorderedcollectionoftuples.Bagallowsmultipleduplicatetuples.
10)WhatdoesFlattendoinPig?

Sometimesthereisdatainatupleorbagandifwewanttoremovethelevelof
nestingfromthatdatathenFlattenmodifierinPigcanbeused.Flattenunnests
bagsandtuples.Fortuples,theFlattenoperatorwillsubstitutethefieldsofatuple
inplaceofatuplewhereasunnestingbagsisalittlecomplexbecauseitrequires
creatingnewtuples.
WehavefurthercategorizedHadoopPigInterviewQuestionsforFreshersand
Experienced
HadoopInterviewQuestionsandAnswersforFreshersQ.Nos1,2,4,7,9
HadoopInterviewQuestionsandAnswersforExperiencedQ.Nos3,5,6,8,10

HadoopHiveInterviewQuestionsandAnswers
1)WhatisaHiveMetastore?

HiveMetastoreisacentralrepositorythatstoresmetadatainexternaldatabase.
2)AremultilinecommentssupportedinHive?
No

3)WhatisObjectInspectorfunctionality?
ObjectInspectorisusedtoanalyzethestructureofindividualcolumnsandthe
internalstructureoftherowobjects.ObjectInspectorinHiveprovidesaccessto
complexobjectswhichcanbestoredinmultipleformats.

4)ExplainaboutthedifferenttypesofjoininHive.
HiveQLhas4differenttypesofjoins
JOINSimilartoOuterJoininSQL

FULLOUTERJOINCombinestherecordsofboththeleftandrightoutertables
thatfulfilthejoincondition.
LEFTOUTERJOINAlltherowsfromthelefttablearereturnedevenifthereare
nomatchesintherighttable.
RIGHTOUTERJOINAlltherowsfromtherighttablearereturnedevenifthereare
nomatchesinthelefttable.

5)HowcanyouconfigureremotemetastoremodeinHive?
ToconfiguremetastoreinHive,hivesite.xmlfilehastobeconfiguredwiththe
belowproperty

<property>
<name>hive.metastore.uris</name>

<value>thrift://node1(orIPAddress):9083</value>
<description>IPaddressandportofthemetastorehost</description>
</property

6)ExplainabouttheSMBJoininHive.
InSMBjoininHive,eachmapperreadsabucketfromthefirsttableandthe
correspondingbucketfromthesecondtableandthenamergesortjoinis
performed.SortMergeBucket(SMB)joininhiveismainlyusedasthereisnolimit
onfileorpartitionortablejoin.SMBjoincanbestbeusedwhenthetablesare
large.InSMBjointhecolumnsarebucketedandsortedusingthejoincolumns.All
tablesshouldhavethesamenumberofbucketsinSMBjoin.
7)IsitpossibletochangethedefaultlocationofManagedTablesinHive,if
sohow?

Yes,wecanchangethedefaultlocationofManagedtablesusingtheLOCATION
keywordwhilecreatingthemanagedtable.Theuserhastospecifythestorage
pathofthemanagedtableasthevaluetotheLOCATIONkeyword.

8)HowdatatransferhappensfromHDFStoHive?
IfdataisalreadypresentinHDFSthentheuserneednotLOADDATAthatmoves
thefilestothe/user/hive/warehouse/.Sotheuserjusthastodefinethetableusing
thekeywordexternalthatcreatesthetabledefinitioninthehivemetastore.

Createexternaltabletable_name(
idint,
myfieldsstring

)
location'/my/location/in/hdfs'
9)Howcanyouconnectanapplication,ifyourunHiveasaserver?

WhenrunningHiveasaserver,theapplicationcanbeconnectedinoneofthe3
ways
ODBCDriverThissupportstheODBCprotocol
JDBCDriverThissupportstheJDBCprotocol

ThriftClientThisclientcanbeusedtomakecallstoallhivecommandsusing
differentprogramminglanguagelikePHP,Python,Java,C++andRuby.
10)WhatdoestheoverwritekeyworddenoteinHiveloadstatement?

OverwritekeywordinHiveloadstatementdeletesthecontentsofthetargettable
andreplacesthemwiththefilesreferredbythefilepathi.e.thefilesthatare
referredbythefilepathwillbeaddedtothetablewhenusingtheoverwrite
keyword.
11)WhatisSerDeinHive?HowcanyouwriteyourowncustomSerDe?
SerDeisaSerializerDeSerializer.HiveusesSerDetoreadandwritedatafrom
tables.Generally,usersprefertowriteaDeserializerinsteadofaSerDeasthey
wanttoreadtheirowndataformatratherthanwritingtoit.IftheSerDesupports
DDLi.e.basicallySerDewithparameterizedcolumnsanddifferentcolumntypes,
theuserscanimplementaProtocolbasedDynamicSerDeratherthanwritingthe
SerDefromscratch.

12)IncaseofembeddedHive,canthesamemetastorebeusedbymultiple
users?
Wecannotusemetastoreinsharingmode.Itissuggestedtousestandalonereal
databaselikePostGreSQLandMySQL.
HadoopHiveInterviewQuestionsandAnswersforFreshersQ.Nos

1,2,3,4,6,8

HadoopHiveInterviewQuestionsandAnswersforExperiencedQ.Nos
5,7,9,10,11,12

HadoopYARNInterviewQuestionsandAnswers
1)WhatarethestableversionsofHadoop?

Release2.7.1(stable)
Release2.4.1
Release1.2.1(stable)

2)WhatisApacheHadoopYARN?
YARNisapowerfulandefficientfeaturerolledoutasapartofHadoop2.0.YARNis
alargescaledistributedsystemforrunningbigdataapplications.
3)IsYARNareplacementofHadoopMapReduce?

YARNisnotareplacementofHadoopbutitisamorepowerfulandefficient
technologythatsupportsMapReduceandisalsoreferredtoasHadoop2.0or
MapReduce2.

WehavefurthercategorizedHadoopYARNInterviewQuestionsforFreshersand
Experienced
HadoopInterviewQuestionsandAnswersforFreshersQ.Nos2,3
HadoopInterviewQuestionsandAnswersforExperiencedQ.Nos1

HadoopInterviewQuestionsAnswers
Needed
HadoopYARNInterviewQuestions
1)WhataretheadditionalbenefitsYARNbringsintoHadoop?
2)HowcannativelibrariesbeincludedinYARNjobs?

3)ExplainthedifferencesbetweenHadoop1.xandHadoop2.x
Or
4)ExplainthedifferencebetweenMapReduce1andMapReduce2/YARN

5)WhatarethemodulesthatconstitutetheApacheHadoop2.0framework?
6)WhatarethecorechangesinHadoop2.0?

7)HowisthedistancebetweentwonodesdefinedinHadoop?
8)DifferentiatebetweenNFS,HadoopNameNodeandJournalNode.

WehopethattheseHadoopInterviewQuestionsandAnswershaveprecharged
youforyournextHadoopInterview.GettheBallRollingandanswerthe
unansweredquestionsinthecommentsbelow.Pleasedo!It'sallpartofourshared
missiontoeaseHadoopInterviewsforallprospectiveHadoopers.Weinviteyouto
getinvolved.

ClickheretoknowmoreaboutourIBMCertifiedHadoopDevelopercourse

PREVIOUS

NEXT

Вам также может понравиться