Академический Документы
Профессиональный Документы
Культура Документы
Apache Hadoop stores huge files as they are (raw) without specifying any
schema.
High scalability – We can add any number of nodes, hence enhancing
performance dramatically.
Reliable – It stores data reliably on the cluster despite machine failure.
High availability – In Hadoop data is highly available despite hardware
failure. If a machine or hardware crashes, then we can access data from
another path.
Economic – Hadoop runs on a cluster of commodity hardware which is not
very expensive
Follow this link to know about Features of Hadoop
Q.3 What are the core components of Hadoop?
Hadoop is an open-source software framework for distributed storage and
processing of large datasets. Apache Hadoop core components are HDFS,
MapReduce, and YARN.
HDFS- Hadoop Distributed File System (HDFS) is the primary storage
system of Hadoop. HDFS store very large files running on a cluster of
commodity hardware. It works on the principle of storage of less number
of large files rather than the huge number of small files. HDFS stores data
reliably even in the case of hardware failure. It provides high throughput
access to an application by accessing in parallel.
MapReduce- MapReduce is the data processing layer of Hadoop. It writes
an application that processes large structured and unstructured data
stored in HDFS. MapReduce processes a huge amount of data in parallel.
It does this by dividing the job (submitted job) into a set of independent
tasks (sub-job). In Hadoop, MapReduce works by breaking the processing
into phases: Map and Reduce. The Map is the first phase of processing,
where we specify all the complex logic code. Reduce is the second phase of
processing. Here we specify light-weight processing like
aggregation/summation.
YARN- YARN is the processing framework in Hadoop. It provides
Resource management and allows multiple data processing engines. For
example real-time streaming, data science, and batch processing.
Read Hadoop Ecosystem Components in detail.
Q.4 What are the Features of Hadoop?
The various Features of Hadoop are:
You can also change the replication factor on per-file basis using the
command:hadoop fs –setrep –w 3 / file_location
You can also change replication factor for all the files in a directory by
using:hadoop fs –setrep –w 3 –R / directoey_location
Learn: Read Write Operations in HDFS
13) Explain Hadoop Archives?
Apache Hadoop HDFS stores and processes large (terabytes) data sets.
However, storing a large number of small files in HDFS is inefficient, since
each file is stored in a block, and block metadata is held in memory by the
namenode.
Reading through small files normally causes lots of seeks and lots of hopping
from datanode to datanode to retrieve each small file, all of which is inefficient
data access pattern.
Hadoop Archive (HAR) basically deals with small files issue. HAR pack a
number of small files into a large file, so, one can access the original files in
parallel transparently (without expanding the files) and efficiently.
Hadoop Archives are special format archives. It maps to a file system
directory. Hadoop Archive always has a *.har extension. In particular, Hadoop
MapReduce uses Hadoop Archives as an Input.
14) What do you mean by the High Availability of a NameNode in Hadoop
HDFS?
In Hadoop 1.0, NameNode is a single point of Failure (SPOF), if namenode
fails, all clients including MapReduce jobs would be unable to read, write file
or list files. In such event, whole Hadoop system would be out of service until
new namenode is brought online.
Hadoop 2.0 overcomes this single point of failure by providing support for
multiple NameNode. High availability feature provides an extra NameNode
(active standby NameNode) to Hadoop architecture which is configured for
automatic failover. If active NameNode fails, then Standby Namenode takes
all the responsibility of active node and cluster continues to work.
The initial implementation of HDFS namenode high availability provided for
single active namenode and single standby namenode. However, some
deployment requires high degree fault-tolerance, this is enabled by new
version 3.0, which allows the user to run multiple standby namenode. For
instance configuring three namenode and five journal nodes, the cluster is
able to tolerate the failure of two nodes rather than one.
Read HDFS NameNode High Availability in detail
15) What is Fault Tolerance in HDFS?
Fault-tolerance in HDFS is working strength of a system in unfavorable
conditions ( like the crashing of the node, hardware failure and so on). HDFS
control faults by the process of replica creation. When client stores a file in
HDFS, then the file is divided into blocks and blocks of data are distributed
across different machines present in HDFS cluster. And, It creates a replica of
each block on other machines present in the cluster. HDFS, by default, creates
3 copies of a block on other machines present in the cluster. If any machine in
the cluster goes down or fails due to unfavorable conditions, then also, the
user can easily access that data from other machines in the cluster in which
replica of the block is present.
Read HDFS Fault Tolerance feature in detail
16) What is Rack Awareness?
Rack Awareness improves the network traffic while reading/writing file. In
which NameNode chooses the DataNode which is closer to the same rack or
nearby rack. NameNode achieves rack information by maintaining the rack
IDs of each DataNode. This concept that chooses Datanodes based on the rack
information. In HDFS, NameNode makes sure that all the replicas are not
stored on the same rack or single rack. It follows Rack Awareness Algorithm to
reduce latency as well as fault tolerance.
Default replication factor is 3, according to Rack Awareness Algorithm.
Therefore, the first replica of the block will store on a local rack. The next
replica will store on another datanode within the same rack. And the third
replica stored on the different rack.
In Hadoop, we need Rack Awareness because it improves:
Data high availability and reliability.
The performance of the cluster.
Network bandwidth.
Read about Rack Awareness in Detail
17) Explain the Single point of Failure in Hadoop?
In Hadoop 1.0, NameNode is a single point of Failure (SPOF). If namenode
fails, all clients would unable to read/write files. In such event, whole Hadoop
system would be out of service until new namenode is up.
Hadoop 2.0 overcomes this SPOF by providing support for multiple
NameNode. High availability feature provides an extra NameNode to Hadoop
architecture. This feature provides automatic failover. If active NameNode
fails, then Standby-Namenode takes all the responsibility of active node. And
cluster continues to work.
The initial implementation of Namenode high availability provided for single
active/standby namenode. However, some deployment requires high degree
fault-tolerance. So new version 3.0 enable this feature by allowing the user to
run multiple standby namenode. For instance configuring three namenode
and five journal nodes. So, the cluster is able to tolerate the failure of two
nodes rather than one.
18) Explain Erasure Coding in Hadoop?
In Hadoop, by default HDFS replicates each block three times for several
purposes. Replication in HDFS is very simple and robust form of redundancy
to shield against the failure of datanode. But replication is very expensive.
Thus, 3 x replication scheme has 200% overhead in storage space and other
resources.
Thus, Hadoop 2.x introduced Erasure Coding a new feature to use in the place
of Replication. It also provides the same level of fault tolerance with less space
store and 50% storage overhead.
Erasure Coding uses Redundant Array of Inexpensive Disk (RAID). RAID
implements EC through striping. In which it divide logical sequential data
(such as a file) into the smaller unit (such as bit, byte or block). Then, stores
data on different disk.
Encoding- In this process, RAID calculates and sort Parity cells for each strip
of data cells. And recover error through the parity. Erasure coding extends a
message with redundant data for fault tolerance. EC codec operates on
uniformly sized data cells. In Erasure Coding, codec takes a number of data
cells as input and produces parity cells as the output. Data cells and parity
cells together are called an erasure coding group.
There are two algorithms available for Erasure Coding:
XOR Algorithm
Reed-Solomon Algorithm
Read about Erasure Coding in detail
19) What is Disk Balancer in Hadoop?
Hadoop Interview Questions for Experienced – Disk Balancer
HDFS provides a command line tool called Diskbalancer. It distributes data
evenly on all disks of a datanode. This tool operates against a given datanode
and moves blocks from one disk to another.
Disk balancer works by creating a plan (set of statements) and executing that
plan on the datanode. Thus, the plan describes how much data should move
between two disks. A plan composes multiple steps. Move step has source
disk, destination disk and the number of bytes to move. And the plan will
execute against an operational datanode.
By default, disk balancer is not enabled; Hence, to enable disk
balancer dfs.disk.balancer.enabled must be set true in hdfs-site.xml.
When we write new block in hdfs, then, datanode uses volume choosing the
policy to choose the disk for the block. Each directory is the volume in hdfs
terminology. Thus, two such policies are:
Round-robin: It distributes the new blocks evenly across the available
disks.
Available space: It writes data to the disk that has maximum free space (by
percentage).
Read Hadoop Disk Balancer in Detail
If you don’t understand any Hadoop Interview Questions and Answers, ask us
in the comment section and our support team will get back to you.
20) How would you check whether your NameNode is working or not?
There are several ways to check the status of the NameNode. Mostly, one uses
the jps command to check the status of all daemons running in the HDFS.
21) Is Namenode machine same as DataNode machine as in terms of hardware?
Unlike the DataNodes, a NameNode is a highly available server. That manages
the File System Namespace and maintains the metadata information.
Metadata information is a number of blocks, their location, replicas and other
details. It also executes file system execution such as naming, closing, opening
files/directories.
Therefore, NameNode requires higher RAM for storing the metadata for
millions of files. Whereas, DataNode is responsible for storing actual data in
HDFS. It performsread and write operation as per request of the clients.
Therefore, Datanode needs to have a higher disk capacity for storing huge data
sets.
Learn: Namenode High Availability in Hadoop
22) What are file permissions in HDFS and how HDFS check permissions for
files or directory?
For files and directories, Hadoop distributed file system (HDFS) implements a
permissions model. For each file or directory, thus, we can manage
permissions for a set of 3 distinct user classes:
The owner, group, and others.
The 3 different permissions for each user class: Read (r), write (w),
and execute(x).
For files, the r permission is to read the file, and the w permission is
to write to the file.
For directories, the r permission is to list the contents of the
directory. The wpermission is to create or delete the directory.
X permission is to access a child of the directory.
HDFS check permissions for files or directory:
We can also check the owner’s permissions if the username matches the
owner of the directory.
If the group matches the directory’s group, then Hadoop tests the user’s
group permissions.
Hadoop tests the “other” permission when the owner and the group names
don’t match.
If none of the permissions checks succeed, the client’s request is denied.
23) If DataNode increases, then do we need to upgrade NameNode?
Namenode stores meta-data i.e. number of blocks, their location, replicas.
This meta-data is available in memory in the master for faster retrieval of
data. NameNode maintains and manages the slave nodes, and assigns tasks to
them. It regulates client’s access to files.
It also executes file system execution such as naming, closing, opening
files/directories.
During Hadoop installation, framework determines NameNode based on the
size of the cluster. Mostly we don’t need to upgrade the NameNode because it
does not store the actual data. But it stores the metadata, so such requirement
rarely arise.
These Big Data Hadoop interview questions are the selected ones which are
asked frequently and by going through these HDFS interview questions you
will be able to answer many other related answers in your interview.
We have categorized the above Big Data Hadoop interview questions and
answers for HDFS Interview for freshers and experienced.
Hadoop Interview Questions and Answers for Freshers – Q. No.- 5 – 8.
Hadoop Interview Questions and Answers for Experienced – Q. No.- 8-23.
Here are few more frequently asked Hadoop HDFS interview Questions and
Answers for Freshers and Experienced.
NameNode – It is the master node. It is responsible for storing the metadata of all
the files and directories. It also has information about blocks, their location,
replicas and other detail.
Datanode – It is the slave node that contains the actual data. DataNode also
performs read and write operation as per request for the clients.
Secondary NameNode – Secondary NameNode download
the FsImage and EditLogs from the NameNode. Then it merges EditLogs with the
FsImage periodically. It keeps edits log size within a limit. It stores the modified
FsImage into persistent storage. which we can use FsImage in the case of
NameNode failure.
5) What is NameNode and DataNode in HDFS?
NameNode – It works as Master in Hadoop cluster. Below listed are the main function
performed by NameNode:
Stores metadata of actual data. E.g. Filename, Path, No. of blocks, Block IDs,
Block Location, No. of Replicas, and also Slave related configuration.
It also manages Filesystem namespace.
Regulates client access request for actual file data file.
It also assigns work to Slaves (DataNode).
Executes file system namespace operation like opening/closing files, renaming
files/directories.
As Name node keep metadata in memory for fast retrieval. So it requires the
huge amount of memory for its operation. It should also host on reliable
hardware.
DataNode – It works as a Slave in Hadoop cluster. Below listed are the main function
performed by DataNode:
Actually, stores Business data.
It is the actual worker node, so it handles Read/Write/Data processing.
Upon instruction from Master, it performs creation/replication/deletion of data
blocks.
As DataNode store all the Business data, so it requires the huge amount of
storage for its operation. It should also host on Commodity hardware.
These were some general Hadoop interview questions and answers. Now let us take
some Hadoop interview questions and answers specially for freshers.
One can also change the replication factor on a per-file basis using the command:
hadoop fs –setrep –w 3 / file_location
One can also change replication factor for all the files in a directory by using:
hadoop fs –setrep –w 3 –R / directoey_location
27) Why Hadoop performs replication, although it results in data redundancy?
In HDFS, Replication provides the fault tolerance. Replication is one of the unique
features of HDFS. Data Replication solves the issue of data loss in unfavorable
conditions. Unfavorable conditions are the hardware failure, crashing of the node
and so on.
HDFS by default creates 3 replicas of each block across the cluster in Hadoop. And
we can change it as per the need. So if any node goes down, we can recover data on
that node from the other node. In HDFS, Replication will lead to the consumption of
a lot of space. But the user can always add more nodes to the cluster if required. It is
very rare to have free space issues in practical cluster. As the very first reason to
deploy HDFS was to store huge data sets. Also, one can change the replication factor
to save HDFS space. Or one can also use different codec provided by the Hadoop to
compress the data.
28) What is Rack Awareness in Apache Hadoop?
In Hadoop, Rack Awareness improves the network traffic while
reading/writing file. In Rack Awareness NameNode chooses the DataNode
which is closer to the same rack or nearby rack. NameNode achieves
Rack information by maintaining the rack ids of each DataNode. Thus,
this concept chooses Datanodes based on the Rack information.
HDFS NameNode makes sure that all the replicas are not stored on the single rack or
same rack. It follows Rack Awareness Algorithm to reduce latency as well as fault
tolerance.
Default replication factor is 3. Therefore according to Rack Awareness Algorithm:
The first replica of the block will store on a local rack.
The next replica will store on another datanode within the same rack.
And the third replica stored on the different rack.
In Hadoop, we need Rack Awareness for below reason: It improves-
When Hadoop framework allocates all the new blocks and read from new
datanode. This will overload the new datanode.
HDFS provides a tool called Balancer that analyzes block placement and balances
across the datanodes.
These are very common type of Hadoop interview questions and answers faced
during the interview of an experienced professional.
32) What is Disk Balancer in Apache Hadoop?
Disk Balancer is a command line tool, which distributes data evenly on all disks of a
datanode. This tool operates against a given datanode and moves blocks from one
disk to another.
Disk balancer works by creating and executing a plan (set of statements) on the
datanode. Thus, the plan describes how much data should move between two disks.
A plan composes multiple steps. Move step has source disk, destination disk and the
number of bytes to move. And the plan will execute against an operational datanode.
By default, disk balancer is not enabled. Hence, to enable
diskbalnecr dfs.disk.balancer.enabled must be set true in hdfs-site.xml.
When we write new block in hdfs, then, datanode uses volume choosing the policy to
choose the disk for the block. Each directory is the volume in hdfs terminology.
Thus, two such policies are: Round-robin and Available space
Round-robin distributes the new blocks evenly across the available disks.
Available space writes data to the disk that has maximum free space (by
percentage).
Read about HDFS Disk Balancer in detail.
33) What is active and passive NameNode in Hadoop?
In Hadoop 1.0, NameNode is a single point of Failure (SPOF). If namenode fails, then
all clients would be unable to read, write file or list files. In such event, whole
Hadoop system would be out of service until new namenode is up.
Hadoop 2.0 overcomes this SPOF. Hadoop 2.0 provides support for multiple
NameNode. High availability feature provides an extra NameNode to Hadoop
architecture for automatic failover.
Active NameNode – It is the NameNode which works and runs in the cluster. It is
also responsible for all client operations in the cluster.
Passive NameNode – It is a standby namenode, which has similar data as active
NameNode. It simply acts as a slave, maintains enough state to provide a fast
failover, if necessary.
If Active NameNode fails, then Passive NameNode takes all the responsibility of active
node. The cluster works continuously.
34) How is indexing done in Hadoop HDFS?
Apache Hadoop has a unique way of indexing. Once Hadoop framework store the
data as per the Data Bock size. HDFS will keep on storing the last part of the data
which will say where the next part of the data will be. In fact, this is the base of
HDFS.
35) What is a Block Scanner in HDFS?
Block scanner verify whether the data blocks stored on each DataNodes are correct
or not. When Block scanner detects corrupted data block, then following steps occur:
First of all, DataNode report NameNode about the corrupted block.
After that, NameNode will start the process of creating a new replica. It creates
new replica using the correct replica of the corrupted block present in other
DataNodes.
When the replication count of the correct replicas matches the replication factor
3, then delete corrupted block
36) How to perform the inter-cluster data copying work in HDFS?
HDFS use distributed copy command to perform the inter-cluster data copying. That
is as below:
hadoop distcp hdfs</span><span class="co1">://<source NameNode> hdfs://<target NameNode</span></span>>
DistCp (distributed copy) is a tool also used for large inter/intra-cluster copying. It uses MapReduce to affect its distribution, error handling and
recovery and reporting. This distributed copy tool enlarges a list of files and directories into the input to map tasks.
37) What are the main properties of hdfs-site.xml file?
hdfs-site.xml – It specifies configuration setting for HDFS daemons in Hadoop. It also provides default block replication and permission
checking on HDFS.
The three main hdfs-site.xml properties are:
1. dfs.name.dir gives you the location where NameNode stores the metadata
(FsImage and edit logs). It also specifies where DFS should locate, on the disk or
onto the remote directory.
2. dfs.data.dir gives the location of DataNodes where it stores the data.
3. fs.checkpoint.dir is the directory on the file system. Hence, on this directory
secondary NameNode stores the temporary images of edit logs.
38) How can one check whether NameNode is working or not?
One can check the status of the HDFS NameNode in several ways. Most usually, one
uses the jps command to check the status of all daemons running in the HDFS.
39) How would you restart NameNode?
NameNode is also known as Master node. It stores meta-data i.e. number of blocks,
replicas, and other details. NameNode maintains and manages the slave nodes, and
assigns tasks to them.
By following two methods, you can restart NameNode:
First stop the NameNode individually using ./sbin/hadoop-daemons.sh stop
namenode command. Then, start the NameNode using ./sbin/hadoop-
daemons.shstart namenode command.
Use ./sbin/stop-all.sh and then use ./sbin/start-all.sh command which will stop all
the demons first. Then start all the daemons.
The above Hadoop interview questions and answers were for experienced but
freshers can also refer these Hadoop interview questions and answers for in depth
knowledge. Now let’s move forward with some advanced Hadoop interview
questions and answers.
We can also check the owner’s permissions, if the user name matches the owner
of directory.
If the group matches the directory’s group, then Hadoop tests the user’s group
permissions.
Hadoop tests the “other” permission, when owner and the group names doesn’t
match.
If none of the permissions checks succeed, then client’s request is denied.
45) How one can format Hadoop HDFS?
One can format HDFS by using bin/hadoop namenode –format command.
$bin/hadoop namenode –format$ command formats the HDFS via NameNode.
Formatting implies initializing the directory specified by the dfs.name.dir variable.
When you run this command on existing files system, then, you will lose all your
data stored on your NameNode.
Hadoop NameNode directory contains the FsImage and edit files. This hold the basic
information’s about Hadoop file system. So, this basic information includes like
which user created files etc.
Hence, when we format the NameNode, then it will delete above information from
directory. This information is present in the hdfs-
site.xml as dfs.namenode.name.dir. So, formatting a NameNode will not format the
DataNode.
NOTE: Never format, up and running Hadoop Filesystem. You will lose data stored
in the HDFS.
46) What is the process to change the files at arbitrary locations in HDFS?
HDFS doesn’t support modifications at arbitrary offsets in the file or multiple
writers. But a single writer writes files in append-only format. Writes to a file in
HDFS are always made at the end of the file.
47) Differentiate HDFS & HBase.
Data write process
HDFS- Append method
HBase- Bulk incremental, random write
Data read process
HDFS- Table scan
HBase- Table scan/random read/small range scan
Hive SQL querying
HDFS- Excellent
HBase- Average
Read about HBase in detail.
These are some advanced Hadoop interview questions and answers for HDFS that
will help you in answer many more interview questions in the best manner
48) What is meant by streaming access?
HDFS works on the principle of “write once, read many”. Its focus is on fast and
accurate data retrieval. Steaming access means reading the complete data instead of
retrieving a single record from the database.
49) How to transfer data from Hive to HDFS?
One can transfer data from Hive by writing the query:
hive> insert overwrite directory ‘/’ select * from emp;
Hence, the output you receive will be stored in part files in the specified HDFS path.
50) How to add/delete a Node to the existing cluster?
To add a Node to the existing cluster follow:
Add the host name/Ip address in dfs.hosts/slaves file. Then, refresh the cluster with
$hadoop dfsamin -refreshNodes
To delete a Node to the existing cluster follow:
Add the hostname/Ip address to dfs.hosts.exclude/remove the entry from slaves file.
Then, refresh the cluster with $hadoop dfsamin -refreshNodes
$hadoop dfsamin -refreshNodes
51) How to format the HDFS? How frequently it will be done?
These type of Hadoop Interview Questions and Answers are also taken very short
and to the point. Giving very lengthy answer here is unnecessary and may lead to
negative points.
$hadoop namnode -format.
Note: Format the HDFS only once that to during initial cluster setup.
52) What is the importance of dfs.namenode.name.dir in HDFS?
dfs.namenode.name.dir contains the fsimage file for namenode.
We should configure it to write to atleast two filesystems on physical
hosts/namenode/secondary namenode. Because if we lose FsImage file we will lose
entire HDFS file system. Also there is no other recovery mechanism if there is no
FsImage file available.
Number 40-52 were the advanced Hadoop interview question and answer to get
indepth knowledge in handling difficult Hadoop interview questions and answers.
This was all about the Hadoop Interview Questions and Answers
These questions are frequently asked Hadoop interview questions and answers. You
can read here some more Hadoop HDFS interview questions and answers.
After going through these top Hadoop Interview questions and answers you will be
able to confidently face a interview and will be able to answer Hadoop Interview
questions and answers asked in your interview in the best manner. These Hadoop
Interview Questions are suggested by the experts at DataFlair
Key –
Q.1 – Q.5 Basic Hadoop Interview Questions
Q.6 – Q.10 HDFS Hadoop interview questions and answers for freshers
Q. 11- Q. 20 Frequently asked Questions in Hadoop Interview
Q.21 – Q39 were the HDFS Hadoop interview questions and answer for experienced
Q.40 – Q.52 were the advanced HDFS Hadoop interview questions and answers
These Hadoop interview questions and answers are categorized so that you can pay
more attention to questions specified for you, however, it is recommended that you
go through all the Hadoop interview questions and answers for complete
understanding.
If you have any more doubt or query on Hadoop Interview Questions and Answers,
Drop a comment and our support team will be happy to help you. Now let’s jump to
our second part of Hadoop Interview Questions i.e. MapReduce Interview Questions
and Answers.
These are very common type of MapReduce Hadoop interview questions and
answers faced during the interview of a Fresher.
68) How to write a custom partitioner for a Hadoop MapReduce job?
This is one of the most common MapReduce Hadoop interview question and answer
It stores the results uniformly across different reducers, based on the user condition.
By setting a Partitioner to partition by the key, we can guarantee that records for the
same key will go the same reducer. It also ensures that only one reducer receives all
the records for that particular key.
By the following steps, we can write Custom partitioner for a Hadoop MapReduce
job:
Create a new class that extends Partitioner Class.
Then, Override method getPartition, in the wrapper that runs in the MapReduce.
By using method set Partitioner class, add the custom partitioner to the job. Or
add the custom partitioner to the job as config file.
69) What is shuffling and sorting in Hadoop MapReduce?
Shuffling and Sorting takes place after the completion of map task. Shuffle and sort
phase in Hadoop occurs simultaneously.
Shuffling- Shuffling is the process by which the system sorts the key-value output
of the map tasks and transfer it to the reducer. Shuffle phase is important for
reducers, otherwise, they would not have any input. As shuffling can start even
before the map phase has finished. So this saves some time and completes the
task in lesser time.
Sorting- Mapper generate the intermediate key-value pair. Before starting of
reducer, mapreduce framework sort these key-value pair by the keys. It also
helps reducer to easily distinguish when a new reduce task should start. Thus
saves time for the reducer.
Shuffling and sorting are not performed at all if you specify zero reducer
(setNumReduceTasks(0))
Read about Shuffling and Sorting in detail.
70) Why aggregation cannot be done in Mapper?
Mapper task processes each input record (From RecordReader) and generates a key-
value pair. The Mapper store intermediate-output on the local disk.
We cannot perform aggregation in mapper because:
Sorting takes place only on the Reducer function. Thus there is no provision for
sorting in the mapper function. Without sorting aggregation is not possible.
To perform aggregation, we need the output of all the Mapper function. Thus,
which may not be possible to collect in the map phase. Because mappers may be
running on different machines where the data blocks are present.
If we will try to perform aggregation of data at mapper, it requires
communication between all mapper functions. Which may be running on
different machines. Thus, this will consume high network bandwidth and can
cause network bottlenecking.
71) Explain map-only job?
MapReduce is the data processing layer of Hadoop. It is the framework for writing
applications that process the vast amount of data stored in the HDFS. It processes
the huge amount of data in parallel by dividing the job into a set of independent
tasks (sub-job). In Hadoop, MapReduce have 2 phases of processing: Map and
Reduce.
In Map phase we specify all the complex logic/business rules/costly code. Map takes
a set of data and converts it into another set of data. It also break individual
elements into tuples (key-value pairs). In Reduce phase we specify light-weight
processing like aggregation/summation. Reduce takes the output from the map as
input. After that it combines tuples (key-value) based on the key. And then, modifies
the value of the key accordingly.
Learn Hadoop from Industry Experts
Consider a case where we just need to perform the operation and no aggregation
required. Thus, in such case, we will prefer “Map-Only job” in Hadoop. In Map-Only
job, the map does all task with its InputSplit and the reducer do no job. Map output
is the final output.
This we can achieve by setting job.setNumreduceTasks(0) in the configuration in a
driver. This will make a number of reducer 0 and thus only mapper will be doing the
complete task.
Read about map-only job in Hadoop Mapreduce in detail.
72) What is SequenceFileInputFormat in Hadoop MapReduce?
SequenceFileInputFormat is an InputFormat which reads sequence files. Sequence
files are binary files that stores sequences of binary key-value pairs. These files are
block-compress. Thus, Sequence files provide direct serialization and deserialization of
several arbitrary data types.
Here Key and value- Both are user-defined.
SequenceFileAsTextInputFormat is variant of SequenceFileInputFormat. It converts the
sequence file’s key value to text objects. Hence, by calling ‘tostring()’ it performs
conversion on the keys and values. Thus, this InputFormat makes sequence files
suitable input for streaming.
SequenceFileAsBinaryInputFormat is variant of SequenceFileInputFormat. Hence,
by using this we can extract the sequence file’s keys and values as an opaque binary
object.
The above 58 – 72 MapReduce Hadoop interview questions and answers were for
freshers, However experienced can also go through these MapReduce Hadoop
interview questions and answers for revising the basics.
Number of bytes read and write within map/reduce job is correct or not
The number of tasks launches and successfully run in map/reduce job is correct
or not.
The amount of CPU and memory consumed is appropriate for our job and cluster
nodes.
There are two types of counters:
Built-In Counters – In Hadoop there are some built-In counters for every job.
These report various metrics, like, there are counters for the number of bytes and
records. Thus, this allows us to confirm that it consume the expected amount of
input. Also make sure that it produce the expected amount of output.
User-Defined Counters – Hadoop MapReduce permits user code to define a set of
counters. These are then increased as desired in the mapper or reducer. For
example, in Java, use ‘enum’ to define counters.
Read about Counters in detail.
79) How to submit extra files(jars,static files) for MapReduce job during runtime in
Hadoop?
MapReduce framework provides Distributed Cache to caches files needed by the
applications. It can cache read-only text files, archives, jar files etc.
An application which needs to use distributed cache to distribute a file should make
sure that the files are available on URLs.
URLs can be either hdfs:// or http://.
Now, if the file is present on the hdfs:// or http://urls. Then, user mentions it to be
cache file to distribute. This framework will copy the cache file on all the nodes
before starting of tasks on those nodes. The files are only copied once per job.
Applications should not modify those files.
80) What is TextInputFormat in Hadoop?
TextInputFormat is the default InputFormat. It treats each line of the input file as a
separate record. For unformatted data or line-based records like log files,
TextInputFormat is useful. By default, RecordReader also uses TextInputFormat for
converting data into key-value pairs. So,
Key- It is the byte offset of the beginning of the line.
Value- It is the contents of the line, excluding line terminators.
File content is- on the top of the building
so,
Key- 0
Value- on the top of the building
TextInputFormat also provides below 2 types of RecordReader-
LineRecordReader
SequenceFileRecordReader
Top Interview Quetions for Hadoop
MapReduce
81) How many Mappers run for a MapReduce job?
Number of mappers depends on 2 factors:
Amount of data we want to process along with block size. It is driven by the
number of inputsplit. If we have block size of 128 MB and we expect 10TB of
input data, we will have 82,000 maps. Ultimately InputFormat determines the
number of maps.
Configuration of the slave i.e. number of core and RAM available on slave. The
right number of map/node can between 10-100. Hadoop framework should give
1 to 1.5 cores of processor for each mapper. For a 15 core processor, 10 mappers
can run.
In MapReduce job, by changing block size we can control number of Mappers . By
Changing block size the number of inputsplit increases or decreases.
By using the JobConf’s conf.setNumMapTasks(int num) we can increase the number of
map task.
Mapper= {(total data size)/ (input split size)}
If data size= 1 Tb and input split size= 100 MB
Mapper= (1000*1000)/100= 10,000
82) How many Reducers run for a MapReduce job?
Answer these type of MapReduce Hadoop interview questions answers very shortly
and to the point.
With the help of Job.setNumreduceTasks(int) the user set the number of reduces for the
job. To set the right number of reducesrs use the below formula:
0.95 Or 1.75 multiplied by (<no. of nodes> * <no. of maximum container per node>).
As the map finishes, all the reducers can launch immediately and start transferring
map output with 0.95. With 1.75, faster nodes finsihes first round of reduces and
launch second wave of reduces .
With the increase of number of reducers:
Load balancing increases.
Cost of failures decreases.
Framework overhead increases.
These are very common type of MapReduce Hadoop interview questions and
answers faced during the interview of an experienced professional.
83) How to sort intermediate output based on values in MapReduce?
Hadoop MapReduce automatically sorts key-value pair generated by the mapper.
Sorting takes place on the basis of keys. Thus, to sort intermediate output based on
values we need to use secondary sorting.
There are two possible approaches:
First, using reducer, reducer reads and buffers all the value for a given key. Then,
do an in-reducer sort on all the values. Reducer will receive all the values for a
given key (huge list of values), this cause reducer to run out of memory. Thus,
this approach can work well if number of values is small.
Second, using MapReduce paradigm. It sort the reducer input values, by creating
a composite key” (using value to key conversion approach) . i.e.by adding a part
of, or entire value to, the natural key to achieve sorting technique. This approach
is scalable and will not generate out of, memory errors.
We need custom partitioner to make that all the data with same key (composite key
with the value) . So, data goes to the same reducer and custom comparator. In
Custom comparator the data grouped by the natural key once it arrives at the
reducer.
84) What is purpose of RecordWriter in Hadoop?
Reducer takes mapper output (intermediate key-value pair) as an input. Then, it
runs a reducer function on them to generate output (zero or more key-value pair).
So, the output of the reducer is the final output.
RecordWriter writes these output key-value pair from the Reducer phase to output
files. OutputFormat determines, how RecordWriter writes these key-value pairs in
Output files. Hadoop provides OutputFormat instances which help to write files on
the in HDFS or local disk.
85) What are the most common OutputFormat in Hadoop?
Reducer takes mapper output as input and produces output (zero or more key-value
pair). RecordWriter writes these output key-value pair from the Reducer phase to
output files. So, OutputFormat determines, how RecordWriter writes these key-
value pairs in Output files.
FileOutputFormat.setOutputpath() method used to set the output directory. So, every
Reducer writes a separate in a common output directory.
Most common OutputFormat are:
TextOutputFormat – It is the default OutputFormat in MapReduce.
TextOutputFormat writes key-value pairs on individual lines of text files. Keys
and values of this format can be of any type. Because TextOutputFormat turns
them to string by calling toString() on them.
SequenceFileOutputFormat – This OutputFormat writes sequences files for its
output. It is also used between MapReduce jobs.
SequenceFileAsBinaryOutputFormat – It is another form of
SequenceFileInputFormat. which writes keys and values to sequence file in
binary format.
DBOutputFormat – We use this for writing to relational databases and HBase.
Then, it sends the reduce output to a SQL table. It accepts key-value pairs, where
the key has a type extending DBwritable.
Read about outputFormat in detail.
86) What is LazyOutputFormat in Hadoop?
FileOutputFormat subclasses will create output files (part-r-nnnn), even if they are
empty. Some applications prefer not to create empty files, which is
where LazyOutputFormat helps.
LazyOutputFormat is a wrapper OutputFormat. It make sure that the output file
should create only when it emit its first record for a given partition.
To use LazyOutputFormat, call its SetOutputFormatClass() method with the JobConf.
To enable LazyOutputFormat, streaming and pipes supports a – lazyOutput option.
87) How to handle record boundaries in Text files or Sequence files in MapReduce
InputSplits?
InputSplit’s RecordReader in MapReduce will “start” and “end” at a record
boundary.
In SequenceFile, every 2k bytes has a 20 bytes sync mark between the records. And,
the sync marks between the records allow the RecordReader to seek to the start of
the InputSplit. It contains a file, length and offset. It also find the first sync mark
after the start of the split. And, the RecordReader continues processing records until
it reaches the first sync mark after the end of the split.
Similarly, Text files use newlines instead of sync marks to handle record boundaries.
88) What are the main configuration parameters in a MapReduce program?
The main configuration parameters are:
Input format of data.
Job’s input locations in the distributed file system.
Output format of data.
Job’s output location in the distributed file system.
JAR file containing the mapper, reducer and driver classes
Class containing the map function.
Class containing the reduce function..
89) Is it mandatory to set input and output type/format in MapReduce?
No, it is mandatory.
Hadoop cluster, by default, takes the input and the output format as ‘text’.
TextInputFormat – MapReduce default InputFormat is TextInputFormat. It treats
each line of each input file as a separate record and also performs no parsing. For
unformatted data or line-based records like log files, TextInputFormat is also useful.
By default, RecordReader also uses TextInputFormat for converting data into key-
value pairs.
TextOutputFormat- MapReduce default OutputFormat is TextOutputFormat. It also
writes (key, value) pairs on individual lines of text files. Its keys and values can be of
any type.
90) What is Identity Mapper?
Identity Mapper is the default Mapper provided by Hadoop. When MapReduce
program has not defined any mapper class then Identity mapper runs. It simply
passes the input key-value pair for the reducer phase. Identity Mapper does not
perform computation and calculations on the input data. So, it only writes the input
data into output.
The class name is org.apache.hadoop.mapred.lib.IdentityMapper
91) What is Identity reducer?
Identity Reducer is the default Reducer provided by Hadoop. When MapReduce
program has not defined any mapper class then Identity mapper runs. It does not
mean that the reduce step will not take place. It will take place and related sorting
and shuffling will also take place. But there will be no aggregation. So you can use
identity reducer if you want to sort your data that is coming from the map but don’t
care for any grouping.
The above MapReduce Hadoop interview questions and answers i.e Q. 73 – Q. 91
were for experienced but freshers can also refer these MapReduce Hadoop interview
questions and answers for in depth knowledge. Now let’s move forward with some
advanced MapReduce Hadoop interview questions and answers.