Вы находитесь на странице: 1из 85

Q. 1 What is Apache Hadoop?

Hadoop emerged as a solution to the “Big Data” problems. It is a part of the


Apache project sponsored by the Apache Software Foundation (ASF). It is an
open source software framework for distributed storage and distributed
processing of large data sets. Open source means it is freely available and even
we can change its source code as per our requirements. Apache Hadoop makes
it possible to run applications on the system with thousands of commodity
hardware nodes. It’s distributed file system has the provision of rapid data
transfer rates among nodes. It also allows the system to continue operating in
case of node failure. Apache Hadoop provides:
 Storage layer – HDFS
 Batch processing engine – MapReduce
 Resource Management Layer – YARN
Q.2 Why do we need Hadoop?
The picture of Hadoop came into existence to deal with Big Data challenges.
The challenges with Big Data are-
 Storage – Since data is very large, so storing such a huge amount of data is
very difficult.
 Security – Since the data is huge in size, keeping it secure is another
challenge.
 Analytics – In Big Data, most of the time we are unaware of the kind of
data we are dealing with. So analyzing that data is even more difficult.
 Data Quality – In the case of Big Data, data is very messy, inconsistent and
incomplete.
 Discovery – Using a powerful algorithm to find patterns and insights are
very difficult.
Hadoop is an open-source software framework that supports the storage and
processing of large data sets. Apache Hadoop is the best solution for storing
and processing Big data because:

 Apache Hadoop stores huge files as they are (raw) without specifying any
schema.
 High scalability – We can add any number of nodes, hence enhancing
performance dramatically.
 Reliable – It stores data reliably on the cluster despite machine failure.
 High availability – In Hadoop data is highly available despite hardware
failure. If a machine or hardware crashes, then we can access data from
another path.
 Economic – Hadoop runs on a cluster of commodity hardware which is not
very expensive
Follow this link to know about Features of Hadoop
Q.3 What are the core components of Hadoop?
Hadoop is an open-source software framework for distributed storage and
processing of large datasets. Apache Hadoop core components are HDFS,
MapReduce, and YARN.
 HDFS- Hadoop Distributed File System (HDFS) is the primary storage
system of Hadoop. HDFS store very large files running on a cluster of
commodity hardware. It works on the principle of storage of less number
of large files rather than the huge number of small files. HDFS stores data
reliably even in the case of hardware failure. It provides high throughput
access to an application by accessing in parallel.
 MapReduce- MapReduce is the data processing layer of Hadoop. It writes
an application that processes large structured and unstructured data
stored in HDFS. MapReduce processes a huge amount of data in parallel.
It does this by dividing the job (submitted job) into a set of independent
tasks (sub-job). In Hadoop, MapReduce works by breaking the processing
into phases: Map and Reduce. The Map is the first phase of processing,
where we specify all the complex logic code. Reduce is the second phase of
processing. Here we specify light-weight processing like
aggregation/summation.
 YARN- YARN is the processing framework in Hadoop. It provides
Resource management and allows multiple data processing engines. For
example real-time streaming, data science, and batch processing.
Read Hadoop Ecosystem Components in detail.
Q.4 What are the Features of Hadoop?
The various Features of Hadoop are:

 Open Source – Apache Hadoop is an open source software framework.


Open source means it is freely available and even we can change its source
code as per our requirements.
 Distributed processing – As HDFS stores data in a distributed manner
across the cluster. MapReduce process the data in parallel on the cluster of
nodes.
 Fault Tolerance – Apache Hadoop is highly Fault-Tolerant. By default,
each block creates 3 replicas across the cluster and we can change it as per
needment. So if any node goes down, we can recover data on that node
from the other node. Framework recovers failures of nodes or tasks
automatically.
 Reliability – It stores data reliably on the cluster despite machine failure.
 High Availability – Data is highly available and accessible despite
hardware failure. In Hadoop, when a machine or hardware crashes, then
we can access data from another path.
 Scalability – Hadoop is highly scalable, as one can add the new hardware to
the nodes.
 Economic- Hadoop runs on a cluster of commodity hardware which is not
very expensive. We do not need any specialized machine for it.
 Easy to use – No need of client to deal with distributed computing, the
framework take care of all the things. So it is easy to use.
Read Hadoop Features in detail
Q.5 Compare Hadoop and RDBMS?
Apache Hadoop is the future of the database because it stores and processes a
large amount of data. Which will not be possible with the traditional database.
There is some difference between Hadoop and RDBMS which are as follows:
 Architecture – Traditional RDBMS have ACID properties. Whereas
Hadoop is a distributed computing framework having two main
components: Distributed file system (HDFS) and MapReduce.
 Data acceptance – RDBMS accepts only structured data. While Hadoop can
accept both structured as well as unstructured data. It is a great feature of
Hadoop, as we can store everything in our database and there will be no
data loss.
 Scalability – RDBMS is a traditional database which provides vertical
scalability. So if the data increases for storing then we have to increase
particular system configuration. While Hadoop provides horizontal
scalability. So we just have to add one or more node to the cluster if there
is any requirement for an increase in data.
 OLTP (Real-time data processing) and OLAP – Traditional RDMS support
OLTP (Real-time data processing). OLTP is not supported in Apache
Hadoop. Apache Hadoop supports large scale Batch Processing workloads
(OLAP).
 Cost – Licensed software, therefore we have to pay for the software.
Whereas Hadoop is open source framework, so we don’t need to pay for
software.
Read: 50+ Hadoop MapReduce Interview Questions and Answers
If you have any doubts or queries regarding Hadoop Interview Questions at
any point you can ask that Hadoop Interview question to us in comment
section and our support team will get back to you.

3. Hadoop Interview Questions for


Freshers
The following Hadoop Interview Questions are for freshers and students,
however, Experienced can also refer them for revising the basics
Q.6 What are the modes in which Hadoop run?
Apache Hadoop runs in three modes:
 Local (Standalone) Mode – Hadoop by default run in a single-node, non-
distributed mode, as a single Java process. Local mode uses the local file
system for input and output operation. It is also used for debugging
purpose, and it does not support the use of HDFS. Further, in this mode,
there is no custom configuration required for configuration files.
 Pseudo-Distributed Mode – Just like the Standalone mode, Hadoop also
runs on a single-node in a Pseudo-distributed mode. The difference is that
each daemon runs in a separate Java process in this Mode. In Pseudo-
distributed mode, we need configuration for all the four files mentioned
above. In this case, all daemons are running on one node and thus, both
Master and Slave node are the same.
 Fully-Distributed Mode – In this mode, all daemons execute in separate
nodes forming a multi-node cluster. Thus, it allows separate nodes for
Master and Slave.
Read: Know what is Rack Awareness in Hadoop
Q.7 What are the features of Standalone (local) mode?
By default, Hadoop run in a single-node, non-distributed mode, as a single
Java process. Local mode uses the local file system for input and output
operation. One can also use it for debugging purpose. It does not support the
use of HDFS. Standalone mode is suitable only for running programs during
development for testing. Further, in this mode, there is no custom
configuration required for configuration files. Configuration files are:
 core-site.xml
 hdfs-site.xml files.
 mapred-site.xml
 yarn-default.xml
Q.8 What are the features of Pseudo mode?
Just like the Standalone mode, Hadoop can also run on a single-node in this
mode. The difference is that each Hadoop daemon runs in a separate Java
process in this Mode. In Pseudo-distributed mode, we need configuration for
all the four files mentioned above. In this case, all daemons are running on
one node and thus, both Master and Slave node are the same.
The pseudo mode is suitable for both for development and in the testing
environment. In the Pseudo mode, all the daemons run on the same machine.
Read: How Hadoop MapReduce Works
Q.9 What are the features of Fully-Distributed mode?
In this mode, all daemons execute in separate nodes forming a multi-node
cluster. Thus, we allow separate nodes for Master and Slave.
We use this mode in the production environment, where ‘n’ number of
machines forming a cluster. Hadoop daemons run on a cluster of machines.
There is one host onto which NameNode is running and the other hosts on
which DataNodes are running. Therefore, NodeManager installs on every
DataNode. And it is also responsible for the execution of the task on every
single DataNode.
The ResourceManager manages all these NodeManager. ResourceManager
receives the processing requests. After that, it passes the parts of the request
to corresponding NodeManager accordingly.
Q.10 What are configuration files in Hadoop?
Core-site.xml – It contain configuration setting for Hadoop core such as I/O
settings that are common to HDFS & MapReduce. It use Hostname and port
.The most commonly used port is 9000.
1. <configuration>
2. <property>
3. <name>fs.defaultFS</name>
4. <value>hdfs://localhost:9000</value>
5. </property>
6. </configuration>

hdfs-site.xml – This file contains the configuration setting for HDFS


daemons. hdfs-site.xml also specify default block replication and permission
checking on HDFS.
1. <configuration>
2. <property>
3. <name>dfs.replication</name>
4. <value>1</value>
5. </property>
6. </configuration>

mapred-site.xml – In this file, we specify a framework name for


MapReduce. we can specify by setting the mapreduce.framework.name.
1. <configuration>
2. <property>
3. <name>mapreduce.framework.name</name>
4. <value>yarn</value>
5. </property>
6. </configuration>

yarn-site.xml – This file provide configuration setting


for NodeManager and ResourceManager.
1. <configuration>
2. <property>
3. <name>yarn.nodemanager.aux-services</name>
4. <value>mapreduce_shuffle</value>
5. </property>
6. <property>
7. <name>yarn.nodemanager.env-whitelist</name> <value>
8. JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREP
END_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property>
9. </configuration>

Q.11 What are the limitations of Hadoop?


Various limitations of Hadoop are:
 Issue with small files – Hadoop is not suited for small files. Small files are
the major problems in HDFS. A small file is significantly smaller than the
HDFS block size (default 128MB). If you are storing these large number of
small files, HDFS can’t handle these lots of files. As HDFS works with a
small number of large files for storing data sets rather than larger number
of small files. If one use the huge number of small files, then this will
overload the namenode. Since namenode stores the namespace of HDFS.
HAR files, Sequence files, and Hbase overcome small files issues.
 Processing Speed – With parallel and distributed algorithm, MapReduce
process large data sets. MapReduce performs the task: Map and Reduce.
MapReduce requires a lot of time to perform these tasks thereby
increasing latency. As data is distributed and processed over the cluster in
MapReduce. So, it will increase the time and reduces processing speed.
 Support only Batch Processing – Hadoop supports only batch processing. It
does not process streamed data and hence, overall performance is slower.
MapReduce framework does not leverage the memory of the cluster to the
maximum.
 Iterative Processing – Hadoop is not efficient for iterative processing. As
hadoop does not support cyclic data flow. That is the chain of stages in
which the input to the next stage is the output from the previous stage.
 Vulnerable by nature – Hadoop is entirely written in Java, a language most
widely used. Hence java been most heavily exploited by cyber-criminal.
Therefore it implicates in numerous security breaches.
 Security- Hadoop can be challenging in managing the complex application.
Hadoop is missing encryption at storage and network levels, which is a
major point of concern. Hadoop supports Kerberos authentication, which
is hard to manage.
Read more on Hadoop Limitations

4. Hadoop Interview Questions for


Experienced
The core Hadoop Interview Questions are for experienced, but freshers and
Students can also read and refer them for advanced understanding
Q.12 Compare Hadoop 2 and Hadoop 3?
 In Hadoop 2, the minimum supported version of Java is Java 7, while in
Hadoop 3 is Java 8.
 Hadoop 2, handle fault tolerance by replication (which is wastage of
space). While Hadoop 3 handle it by Erasure coding.
 For data balancing Hadoop 2 uses HDFS balancer. While Hadoop 3 uses
Intra-data node balancer.
 In Hadoop 2 some default ports are Linux ephemeral port range. So at the
time of startup, they will fail to bind. But in Hadoop 3 these ports have
been moved out of the ephemeral range.
 In hadoop 2, HDFS has 200% overhead in storage space. While Hadoop 3
has 50% overhead in storage space.
 Hadoop 2 has features to overcome SPOF (single point of failure). So
whenever NameNode fails, it recovers automatically. Hadoop 3 recovers
SPOF automatically no need of manual intervention to overcome it.
Read about more Comparison between Hadoop 2 and Hadoop 3
13) Explain Data Locality in Hadoop?
Hadoop Interview Questions and Answers – Data Locality
Hadoop major drawback was cross-switch network traffic due to the huge
volume of data. To overcome this drawback, Data locality came into the
picture. It refers to the ability to move the computation close to where the
actual data resides on the node, instead of moving large data to computation.
Data locality increases the overall throughput of the system.
In Hadoop, HDFS stores datasets. Datasets are divided into blocks and stored
across the datanodes in Hadoop cluster. When a user runs the MapReduce job
then NameNode sends this MapReduce code to the datanodes on which data is
available related to MapReduce job.
Data locality has three categories:
 Data local – In this category data is on the same node as the mapper
working on the data. In such case, the proximity of the data is closer to the
computation. This is the most preferred scenario.
 Intra – Rack- In this scenarios mapper run on the different node but on
the same rack. As it is not always possible to execute the mapper on the
same datanode due to constraints.
 Inter-Rack – In this scenarios mapper run on the different rack. As it is not
possible to execute mapper on a different node in the same rack due to
resource constraints.
14) What is Safemode in Hadoop?
Safemode in Apache Hadoop is a maintenance state of NameNode. During
which NameNode doesn’t allow any modifications to the file system. During
Safemode, HDFS cluster is in read-only and doesn’t replicate or delete blocks.
At the startup of NameNode:
 It loads the file system namespace from the last saved FsImage into its
main memory and the edits log file.
 Merges edits log file on FsImage and results in new file system namespace.
 Then it receives block reports containing information about block location
from all datanodes.
In SafeMode NameNode perform a collection of block reports from datanodes.
NameNode enters safemode automatically during its start up. NameNode
leaves Safemode after the DataNodes have reported that most blocks are
available. Use the command:
hadoop dfsadmin –safemode get: To know the status of Safemode
bin/hadoop dfsadmin –safemode enter: To enter Safemode
hadoop dfsadmin -safemode leave: To come out of Safemode
NameNode front page shows whether safemode is on or off.
15) What is the problem with small files in Hadoop?
Hadoop is not suited for small data. Hadoop HDFS lacks the ability to support
the random reading of small files. Small file in HDFS is smaller than the
HDFS block size (default 128 MB). If we are storing these huge numbers of
small files, HDFS can’t handle these lots of files. HDFS works with the small
number of large files for storing large datasets. It is not suitable for a large
number of small files. A large number of many small files overload NameNode
since it stores the namespace of HDFS.
Solution:
HAR (Hadoop Archive) Files deal with small file issue. HAR has introduced a
layer on top of HDFS, which provide interface for file accessing. Using Hadoop
archive command we can create HAR files. These file runs a MapReduce job to
pack the archived files into a smaller number of HDFS files. Reading through
files in as HAR is not more efficient than reading through files in HDFS. Since
each HAR file access requires two index files read as well the data file to read,
this makes it slower.
Sequence Files also deal with small file problem. In this, we use the filename
as key and the file contents as the value. If we have 10,000 files of 100 KB, we
can write a program to put them into a single sequence file. And then we can
process them in a streaming fashion.
Read: Hadoop Mapper – 4 Steps Learning to MapReduce Mapper
16) What is a “Distributed Cache” in Apache Hadoop?
In Hadoop, data chunks process independently in parallel among DataNodes,
using a program written by the user. If we want to access some files from all
the DataNodes, then we will put that file to distributed cache.
Big Data Interview Questions For Freshers – Distributed Cache
MapReduce framework provides Distributed Cache to caches files needed by
the applications. It can cache read-only text files, archives, jar files etc.
Once we have cached a file for our job. Then, Hadoop will make it available on
each datanodes where map/reduce tasks are running. Then, we can access
files from all the datanodes in our map and reduce job.
An application which needs to use distributed cache should make sure that the
files are available on URLs. URLs can be either hdfs:// or http://. Now, if the file
is present on the mentioned URLs. The user mentions it to be cache file to
distributed cache. This framework will copy the cache file on all the nodes
before starting of tasks on those nodes. By default size of distributed cache is
10 GB. We can adjust the size of distributed cache using local.cache.size.
Read Hadoop distributed cache in detail
17) How is security achieved in Hadoop?
Apache Hadoop achieves security by using Kerberos.
At a high level, there are three steps that a client must take to access a service
when using Kerberos. Thus, each of which involves a message exchange with a
server.
 Authentication – The client authenticates itself to the authentication
server. Then, receives a timestamped Ticket-Granting Ticket (TGT).
 Authorization – The client uses the TGT to request a service ticket from the
Ticket Granting Server.
 Service Request – The client uses the service ticket to authenticate itself to
the server.
18) Why does one remove or add nodes in a Hadoop cluster frequently?
The most important features of the Hadoop is its utilization of Commodity
hardware. However, this leads to frequent Datanode crashes in a Hadoop
cluster.
Another striking feature of Hadoop is the ease of scale by the rapid growth in
data volume.
Hence, due to above reasons, administrator Add/Remove DataNodes in a
Hadoop Cluster.
Read: Hadoop HDFS Architecture Explanation and Assumptions
19) What is throughput in Hadoop?
The amount of work done in a unit time is Throughput. Because of bellow
reasons HDFS provides good throughput:
 The HDFS is Write Once and Read Many Model. It simplifies the data
coherency issues as the data written once, one can not modify it. Thus,
provides high throughput data access.
 Hadoop works on Data Locality principle. This principle state that moves
computation to data instead of data to computation. This reduces network
congestion and therefore, enhances the overall system throughput.
20) How to restart NameNode or all the daemons in Hadoop?
By following methods we can restart the NameNode:
 You can stop the NameNode individually using /sbin/hadoop-
daemon.sh stop namenode command. Then start the NameNode
using /sbin/hadoop-daemon.sh start namenode.
 Use /sbin/stop-all.sh and the use /sbin/start-all.sh, command which will stop
all the demons first. Then start all the daemons.
The sbin directory inside the Hadoop directory store these script files.
21) What does jps command do in Hadoop?
The jbs command helps us to check if the Hadoop daemons are running or
not. Thus, it shows all the Hadoop daemons that are running on the machine.
Daemons are Namenode, Datanode, ResourceManager, NodeManager etc.
Read:Top 10 Hadoop Hdfs Commands with Examples and Usage
22) What are the main hdfs-site.xml properties?
hdfs-site.xml – This file contains the configuration setting for HDFS daemons.
hdfs-site.xml also specify default block replication and permission checking on
HDFS.
The three main hdfs-site.xml properties are:
1. dfs.name.dir gives you the location where NameNode stores the metadata
(FsImage and edit logs). And also specify where DFS should locate – on
the disk or in the remote directory.
2. dfs.data.dir gives the location of DataNodes where it stores the data.
3. fs.checkpoint.dir is the directory on the file system. On which secondary
NameNode stores the temporary images of edit logs. Then this EditLogs
and FsImage will merge for backup.
23) What is fsck?
fsck is the File System Check. Hadoop HDFS use the fsck (filesystem check)
command to check for various inconsistencies. It also reports the problems
with the files in HDFS. For example, missing blocks for a file or under-
replicated blocks. It is different from the traditional fsck utility for the native
file system. Therefore it does not correct the errors it detects.
Normally NameNode automatically corrects most of the recoverable failures.
Filesystem check also ignores open files. But it provides an option to select all
files during reporting. The HDFS fsck command is not a Hadoop shell
command. It can also run as bin/hdfs fsck. Filesystem check can run on the
whole file system or on a subset of files.
Usage:
hdfs fsck <path>
[-list-corruptfileblocks |
[-move | -delete | -openforwrite]
[-files [-blocks [-locations | -racks]]]
[-includeSnapshots]
Path- Start checking from this path
-delete- Delete corrupted files.
-files- Print out the checked files.
-files –blocks- Print out the block report.
-files –blocks –locations- Print out locations for every block.
-files –blocks –rack- Print out network topology for data-node locations
-includeSnapshots- Include snapshot data if the given path indicates or
include snapshottable directory.
-list -corruptfileblocks- Print the list of missing files and blocks they belong to.
24) How to debug Hadoop code?
First, check the list of MapReduce jobs currently running. Then, check
whether orphaned jobs is running or not; if yes, you need to determine the
location of RM logs.
 First of all, Run: “ps –ef| grep –I ResourceManager” and then, look for log
directory in the displayed result. Find out the job-id from the displayed
list. Then check whether error message associated with that job or not.
 Now, on the basis of RM logs, identify the worker node which involves in
the execution of the task.
 Now, login to that node and run- “ps –ef| grep –I NodeManager”
 Examine the NodeManager log.
 The majority of errors come from user level logs for each amp-reduce job.
Read: Hadoop HDFS Commands with Examples and Usage
25) Explain Hadoop streaming?
Hadoop distribution provides generic application programming interface (API).
This allows writing Map and Reduce jobs in any desired programming
language. The utility allows creating/running jobs with any executable
as Mapper/Reducer.
For example:
hadoop jar hadoop-streaming-3.0.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /usr/bin/wc
In the example, both the Mapper and reducer are executables. That read the
input from stdin (line by line) and emit the output to stdout. The utility allows
creating/submitting Map/Reduce job, to an appropriate cluster. It also
monitors the progress of the job until it completes. Hadoop Streaming uses
both streaming command options as well as generic command options. Be
sure to place the generic options before the streaming. Otherwise, the
command will fail.
The general line syntax shown below:
Hadoop command [genericOptions] [streamingOptions]
26) What does hadoop-metrics.properties file do?
Statistical information exposed by the Hadoop daemons is Metrics. Hadoop
framework uses it for monitoring, performance tuning and debug.
By default, there are many metrics available. Thus, they are very useful for
troubleshooting.
Hadoop framework use hadoop-metrics.properties for ‘Performance Reporting’
purpose. It also controls the reporting for Hadoop. The API provides an
abstraction so we can implement on top of a variety of metrics client libraries.
The choice of client library is a configuration option. And different modules
within the same application can use different metrics implementation
libraries.
This file is present inside /etc/hadoop.
Also Read Hadoop Tutorial with Hadoop Daemons
27) How Hadoop’s CLASSPATH plays a vital role in starting or stopping in
Hadoop daemons?
CLASSPATH includes all directories containing jar files required to start/stop
Hadoop daemons.
For example- HADOOP_HOME/share/hadoop/common/lib contains all the utility
jar files. We cannot start/ stop Hadoop daemons if we don’t set CLASSPATH.
We can set CLASSPATH inside /etc/hadoop/hadoop-env.sh file. The next time
you run hadoop, the CLASSPATH will automatically add. That is, you don’t
need to add CLASSPATH in the parameters each time you run it.
28) What are the different commands used to startup and shutdown Hadoop
daemons?
• To start all the hadoop daemons use: ./sbin/start-all.sh.
Then, to stop all the Hadoop daemons use:./sbin/stop-all.sh
• You can also start all the dfs daemons together using ./sbin/start-dfs.sh.
Yarn daemons together using ./sbin/start-yarn.sh. MR Job history server
using /sbin/mr-jobhistory-daemon.sh start history server. Then, to stop these
daemons we can use
./sbin/stop-dfs.sh
./sbin/stop-yarn.sh
/sbin/mr-jobhistory-daemon.sh stop historyserver.
• Finally, the last way is to start all the daemons individually. Then, stop them
individually:
./sbin/hadoop-daemon.sh start namenode
./sbin/hadoop-daemon.sh start datanode
./sbin/yarn-daemon.sh start resourcemanager
./sbin/yarn-daemon.sh start nodemanager
./sbin/mr-jobhistory-daemon.sh start historyserver
Read – Hadoop Commands with Examples and Usage Part
29) What is configured in /etc/hosts and what is its role in setting Hadoop
cluster?
./etc/hosts file contains the hostname and their IP address of that host. It also,
maps the IP address to the hostname. In hadoop cluster, we store all the
hostnames (master and slaves) with their IP address in ./etc/hosts. So, we can
use hostnames easily instead of IP addresses.
30) How is the splitting of file invoked in Hadoop framework?
Input file store data for Hadoop MapReduce task’s, and these files typically
reside in HDFS. InputFormat defines how these input files split and read. It is
also responsible for creating InputSplit, which is the logical representation of
data. InputFormatalso divides split into records. Then, mapper will process
each record (which is a key-value pair). Hadoop framework invokes Splitting
of the file by running getInputSplit()method. This method belongs to
InputFormat class (like FileInputFormat) defined by the user.
31) Is it possible to provide multiple input to Hadoop? If yes then how?
Yes, it is possible by using MultipleInputs class.
For example:
If we had weather data from the UK Met Office. And we want to combine with
the NCDC data for our maximum temperature analysis. Then, we can set up
the input as follows:
1. MultipleInputs.addInputPath(job,ncdcInputPath,TextInputFormat.class,MaxTemperatureMapper.class);
2. MultipleInputs.addInputPath(job,metofficeInputPath,TextInputFormat.class, MetofficeMaxTemperatureMapper.class);

The above code replaces the usual calls


to FileInputFormat.addInputPath() and job.setmapperClass(). Both the Met
Office and NCDC data are text based. So, we use TextInputFormat for each.
And, we will use two different mappers, as the two data sources have different
line format. The MaxTemperatureMapperr reads NCDC input data and
extracts the year and temperature fields.
The MetofficeMaxTemperatureMappers reads Met Office input data. Then,
extracts the year and temperature fields.
Read More on InputSplit in Hadoop MapReduce
32) Is it possible to have hadoop job output in multiple directories?
If yes, how?
Yes, it is possible by using following approaches:
a. Using MultipleOutputs class-
This class simplifies writing output data to multiple outputs.
MultipleOutputs.addNamedOutput(job,”OutputFileName”,OutputFormatClass,keyClass,valueClass);
The API provides two overloaded write methods to achieve this
MultipleOutput.write(‘OutputFileName”, new Text (key), new Text(value));
Then, we need to use overloaded write method, with an extra parameter for
the base output path. This will allow to write the output file to separate output
directories.
MultipleOutput.write(‘OutputFileName”, new Text (key), new Text(value), baseOutputPath);
Then, we need to change your baseOutputpath in each of our implementation.
b. Rename/Move the file in driver class-
This is the easiest hack to write output to multiple directories. So, we can use
MultipleOutputs and write all the output files to a single directory. But the file
names need to be different for each category.
We have categorized the above Hadoop Scenario based Interview Questions
for Hadoop developers, Hadoop adminstrators, Hadoop architect etc for
freshers as well as for experienced candidates.
 Hadoop Interview Questions and Answers for Freshers – Q. No.- 5 -11.
 Hadoop Interview Questions and Answers for Experienced – Q. No.- 11-
32.
Click here for more frequently asked Hadoop real time interview Questions
and Answers for Freshers and Experienced.

5. HDFS Hadoop Interview Questions


and Answers
In this section, we have covered top HDFS Hadoop Interview Questions and
Answers. In the below Hadoop Interview questions we have covered what is
HDFS, HDFS components, indexing, read-write operations, blocks, replication
etc. So, Lets get started with the Hadoop Interview Questions on HDFS.
1) What is HDFS- Hadoop Distributed File System?
Hadoop distributed file system (HDFS) is the primary storage system of Hadoop.
HDFS stores very large files running on a cluster of commodity hardware. It
works on the principle of storage of less number of large files rather than the
huge number of small files. HDFS stores data reliably even in the case of
hardware failure. It provides high throughput access to the application by
accessing in parallel.
Components of HDFS:
 NameNode – It is also known as Master node. Namenode stores meta-data
i.e. number of blocks, their replicas and other details.
 DataNode – It is also known as Slave. In Hadoop HDFS, DataNode is
responsible for storing actual data. DataNode performs read and write
operation as per request for the clients in HDFS.
Read Hadoop HDFS in detail
2) Explain NameNode and DataNode in HDFS?
I. NameNode – It is also known as Master node. Namenode stores meta-data i.e.
number of blocks, their location, replicas and other details. This meta-data is
available in memory in the master for faster retrieval of data. NameNode
maintains and manages the slave nodes, and assigns tasks to them. It should
be deployed on reliable hardware as it is the centerpiece of HDFS.
Task of NameNode
 Manage file system namespace.
 Regulates client’s access to files.
 In HDFS, NameNode also executes file system execution such as naming,
closing, opening files and directories.
II. DataNode – It is also known as Slave. In Hadoop HDFS, DataNode is
responsible for storing actual data in HDFS. DataNode performs read and
write operation as per request for the clients. One can deploy the DataNode on
commodity hardware.
Task of DataNode
 In HDFS, DataNode performs various operations like block replica
creation, deletion, and replication according to the instruction of
NameNode.
 DataNode manages data storage of the system.
Read: HDFS Blocks & Data Block Size
3) Why is block size set to 128 MB in Hadoop HDFS?
Block is a continuous location on the hard drive which stores the data. In
general, FileSystem store data as a collection of blocks. HDFS stores each file
as blocks, and distributes it across the Hadoop cluster. In HDFS, the default
size of data block is 128 MB, which we can configure as per our requirement.
Block size is set to 128 MB:
 To reduce the disk seeks (IO). Larger the block size, lesser the file blocks
and less number of disk seek and transfer of the block can be done within
respectable limits and that to parallelly.
 HDFS have huge data sets, i.e. terabytes and petabytes of data. If we take 4
KB block size for HDFS, just like Linux file system, which have 4 KB block
size, then we would be having too many blocks and therefore too much of
metadata. Managing this huge number of blocks and metadata will create
huge overhead and traffic which is something which we don’t want. So, the
block size is set to 128 MB.
On the other hand, block size can’t be so large that the system is waiting a very
long time for the last unit of data processing to finish its work.
4) How data or file is written into HDFS?
When a client wants to write a file to HDFS, it communicates to namenode for
metadata. The Namenode responds with details of a number of blocks,
replication factor. Then, on basis of information from NameNode, client split
files into multiple blocks. After that client starts sending them to first
DataNode. The client sends block A to Datanode 1 with other two Datanodes
details.
When Datanode 1 receives block A sent from the client, Datanode 1 copy same
block to Datanode 2 of the same rack. As both the Datanodes are in the same
rack so block transfer via rack switch. Now Datanode 2 copies the same block
to Datanode 3. As both the Datanodes are in different racks so block transfer
via an out-of-rack switch.
After the Datanode receives the blocks from the client. Then Datanode sends
write confirmation to Namenode. Now Datanode sends write confirmation to
the client. The Same process will repeat for each block of the file. Data transfer
happen in parallel for faster write of blocks.
Read HDFS file write operation workflow in detail

6. HDFS Hadoop Interview Questions


for Freshers:
The HDFS Hadoop Intrview Questions from number 5 – 8 are basic questions
for freshers. However Experienced People can also read them for
understanding the basics.
5) Can multiple clients write into an HDFS file concurrently?
Multiple clients cannot write into an HDFS file at same time. Apache Hadoop
HDFS follows single writer multiple reader models. The client which opens a
file for writing, the NameNode grant a lease. Now suppose, some other client
wants to write into that file. It asks NameNode for the write operation.
NameNode first checks whether it has granted the lease for writing into that
file to someone else or not. When someone already acquires the lease, then, it
will reject the write request of the other client.
6) How data or file is read in HDFS?
To read from HDFS, the first client communicates to namenode for metadata.
A client comes out of namenode with the name of files and its location. The
Namenode responds with details of the number of blocks, replication factor.
Now client communicates with Datanode where the blocks are present. Clients
start reading data parallel from the Datanode. It read on the basis of
information received from the namenodes.
Once client or application receives all the blocks of the file, it will combine
these blocks to form a file. For read performance improvement, the location of
each block ordered by their distance from the client. HDFS selects the replica
which is closest to the client. This reduces the read latency and bandwidth
consumption. It first read the block in the same node. Then another node in
the same rack, and then finally another Datanode in another rack.
Read HDFS file read operation workflow in detail
7) Why HDFS stores data using commodity hardware despite the higher chance
of failures?
HDFS stores data using commodity hardware because HDFS is highly fault-
tolerant. HDFS provides fault tolerance by replicating the data blocks. And
then distribute it among different DataNodes across the cluster. By default,
replication factor is 3 which is configurable. Replication of data solves the
problem of data loss in unfavorable conditions. And unfavorable conditions
are crashing of the node, hardware failure and so on. So, when any machine in
the cluster goes down, then the client can easily access their data from another
machine. And this machine contains the same copy of data blocks.
Learn: Automatic Failover in Hadoop
8) How is indexing done in HDFS?
Hadoop has a unique way of indexing. Once Hadoop framework store the data
as per the block size. HDFS will keep on storing the last part of the data which
will say where the next part of the data will be. In fact, this is the base of
HDFS.

7. HDFS Hadoop Interview Questions


for Experienced:
The Hadoop Interview Questions from 9- 23 are for experienced people but
freshers and students can also read them for advanced understanding
9) What is a Heartbeat in HDFS?
Heartbeat is the signals that NameNode receives from the DataNodes to show
that it is functioning (alive). NameNode and DataNode do communicate using
Heartbeat. If after the certain time of heartbeat NameNode do not receive any
response from DataNode, then that Node is dead. The NameNode then
schedules the creation of new replicas of those blocks on other DataNodes.
Heartbeats from a Datanode also carry information about total storage
capacity. Also, carry the fraction of storage in use, and the number of data
transfers currently in progress.
The default heartbeat interval is 3 seconds. One can change it by
using dfs.heartbeat.interval in hdfs-site.xml.
10) How to copy a file into HDFS with a different block size to that of existing
block size configuration?
One can copy a file into HDFS with a different block size by using:
–Ddfs.blocksize=block_size, where block_size is in bytes.
So, let us explain it with an example:
Suppose, you want to copy a file called test.txt of size, say of 128 MB, into the
hdfs. And for this file, you want the block size to be 32MB (33554432 Bytes) in
place of the default (128 MB). So, you would issue the following command:
Hadoop fs –Ddfs.blocksize=33554432 –copyFromlocal/home/dataflair/test.txt/sample_hdfs
Now, you can check the HDFS block size associated with this file by:
hadoop fs –stat %o/sample_hdfs/test.txt
Else, you can also use the NameNode web UI for seeing the HDFS directory.
11) Why HDFS performs replication, although it results in data redundancy?
In HDFS, Replication provides the fault tolerance. Data replication is one of
the most important and unique features of HDFS. Replication of data solves
the problem of data loss in unfavorable conditions. Unfavorable conditions are
crashing of the node, hardware failure and so on. HDFS by default creates 3
replicas of each block across the cluster in Hadoop. And we can change it as
per the need. So, if any node goes down, we can recover data on that node
from the other node.
In HDFS, Replication will lead to the consumption of a lot of space. But the
user can always add more nodes to the cluster if required. It is very rare to
have free space issues in the practical cluster. As the very first reason to deploy
HDFS was to store huge data sets. Also, one can change the replication factor
to save HDFS space. Or one can also use different codec provided by the
Hadoop to compress the data.
Read: Namenode High Availability
12) What is the default replication factor and how will you change it?
The default replication factor is 3. One can change this in following three
ways:
 By adding this property to hdfs-site.xml:
 <property>
 <name>dfs.replication</name>
 <value>5</value>
 <description>Block Replication</description>
 </property>

 You can also change the replication factor on per-file basis using the
command:hadoop fs –setrep –w 3 / file_location
 You can also change replication factor for all the files in a directory by
using:hadoop fs –setrep –w 3 –R / directoey_location
Learn: Read Write Operations in HDFS
13) Explain Hadoop Archives?
Apache Hadoop HDFS stores and processes large (terabytes) data sets.
However, storing a large number of small files in HDFS is inefficient, since
each file is stored in a block, and block metadata is held in memory by the
namenode.
Reading through small files normally causes lots of seeks and lots of hopping
from datanode to datanode to retrieve each small file, all of which is inefficient
data access pattern.
Hadoop Archive (HAR) basically deals with small files issue. HAR pack a
number of small files into a large file, so, one can access the original files in
parallel transparently (without expanding the files) and efficiently.
Hadoop Archives are special format archives. It maps to a file system
directory. Hadoop Archive always has a *.har extension. In particular, Hadoop
MapReduce uses Hadoop Archives as an Input.
14) What do you mean by the High Availability of a NameNode in Hadoop
HDFS?
In Hadoop 1.0, NameNode is a single point of Failure (SPOF), if namenode
fails, all clients including MapReduce jobs would be unable to read, write file
or list files. In such event, whole Hadoop system would be out of service until
new namenode is brought online.
Hadoop 2.0 overcomes this single point of failure by providing support for
multiple NameNode. High availability feature provides an extra NameNode
(active standby NameNode) to Hadoop architecture which is configured for
automatic failover. If active NameNode fails, then Standby Namenode takes
all the responsibility of active node and cluster continues to work.
The initial implementation of HDFS namenode high availability provided for
single active namenode and single standby namenode. However, some
deployment requires high degree fault-tolerance, this is enabled by new
version 3.0, which allows the user to run multiple standby namenode. For
instance configuring three namenode and five journal nodes, the cluster is
able to tolerate the failure of two nodes rather than one.
Read HDFS NameNode High Availability in detail
15) What is Fault Tolerance in HDFS?
Fault-tolerance in HDFS is working strength of a system in unfavorable
conditions ( like the crashing of the node, hardware failure and so on). HDFS
control faults by the process of replica creation. When client stores a file in
HDFS, then the file is divided into blocks and blocks of data are distributed
across different machines present in HDFS cluster. And, It creates a replica of
each block on other machines present in the cluster. HDFS, by default, creates
3 copies of a block on other machines present in the cluster. If any machine in
the cluster goes down or fails due to unfavorable conditions, then also, the
user can easily access that data from other machines in the cluster in which
replica of the block is present.
Read HDFS Fault Tolerance feature in detail
16) What is Rack Awareness?
Rack Awareness improves the network traffic while reading/writing file. In
which NameNode chooses the DataNode which is closer to the same rack or
nearby rack. NameNode achieves rack information by maintaining the rack
IDs of each DataNode. This concept that chooses Datanodes based on the rack
information. In HDFS, NameNode makes sure that all the replicas are not
stored on the same rack or single rack. It follows Rack Awareness Algorithm to
reduce latency as well as fault tolerance.
Default replication factor is 3, according to Rack Awareness Algorithm.
Therefore, the first replica of the block will store on a local rack. The next
replica will store on another datanode within the same rack. And the third
replica stored on the different rack.
In Hadoop, we need Rack Awareness because it improves:
 Data high availability and reliability.
 The performance of the cluster.
 Network bandwidth.
Read about Rack Awareness in Detail
17) Explain the Single point of Failure in Hadoop?
In Hadoop 1.0, NameNode is a single point of Failure (SPOF). If namenode
fails, all clients would unable to read/write files. In such event, whole Hadoop
system would be out of service until new namenode is up.
Hadoop 2.0 overcomes this SPOF by providing support for multiple
NameNode. High availability feature provides an extra NameNode to Hadoop
architecture. This feature provides automatic failover. If active NameNode
fails, then Standby-Namenode takes all the responsibility of active node. And
cluster continues to work.
The initial implementation of Namenode high availability provided for single
active/standby namenode. However, some deployment requires high degree
fault-tolerance. So new version 3.0 enable this feature by allowing the user to
run multiple standby namenode. For instance configuring three namenode
and five journal nodes. So, the cluster is able to tolerate the failure of two
nodes rather than one.
18) Explain Erasure Coding in Hadoop?
In Hadoop, by default HDFS replicates each block three times for several
purposes. Replication in HDFS is very simple and robust form of redundancy
to shield against the failure of datanode. But replication is very expensive.
Thus, 3 x replication scheme has 200% overhead in storage space and other
resources.
Thus, Hadoop 2.x introduced Erasure Coding a new feature to use in the place
of Replication. It also provides the same level of fault tolerance with less space
store and 50% storage overhead.
Erasure Coding uses Redundant Array of Inexpensive Disk (RAID). RAID
implements EC through striping. In which it divide logical sequential data
(such as a file) into the smaller unit (such as bit, byte or block). Then, stores
data on different disk.
Encoding- In this process, RAID calculates and sort Parity cells for each strip
of data cells. And recover error through the parity. Erasure coding extends a
message with redundant data for fault tolerance. EC codec operates on
uniformly sized data cells. In Erasure Coding, codec takes a number of data
cells as input and produces parity cells as the output. Data cells and parity
cells together are called an erasure coding group.
There are two algorithms available for Erasure Coding:
 XOR Algorithm
 Reed-Solomon Algorithm
Read about Erasure Coding in detail
19) What is Disk Balancer in Hadoop?
Hadoop Interview Questions for Experienced – Disk Balancer
HDFS provides a command line tool called Diskbalancer. It distributes data
evenly on all disks of a datanode. This tool operates against a given datanode
and moves blocks from one disk to another.
Disk balancer works by creating a plan (set of statements) and executing that
plan on the datanode. Thus, the plan describes how much data should move
between two disks. A plan composes multiple steps. Move step has source
disk, destination disk and the number of bytes to move. And the plan will
execute against an operational datanode.
By default, disk balancer is not enabled; Hence, to enable disk
balancer dfs.disk.balancer.enabled must be set true in hdfs-site.xml.
When we write new block in hdfs, then, datanode uses volume choosing the
policy to choose the disk for the block. Each directory is the volume in hdfs
terminology. Thus, two such policies are:
 Round-robin: It distributes the new blocks evenly across the available
disks.
 Available space: It writes data to the disk that has maximum free space (by
percentage).
Read Hadoop Disk Balancer in Detail
If you don’t understand any Hadoop Interview Questions and Answers, ask us
in the comment section and our support team will get back to you.
20) How would you check whether your NameNode is working or not?
There are several ways to check the status of the NameNode. Mostly, one uses
the jps command to check the status of all daemons running in the HDFS.
21) Is Namenode machine same as DataNode machine as in terms of hardware?
Unlike the DataNodes, a NameNode is a highly available server. That manages
the File System Namespace and maintains the metadata information.
Metadata information is a number of blocks, their location, replicas and other
details. It also executes file system execution such as naming, closing, opening
files/directories.
Therefore, NameNode requires higher RAM for storing the metadata for
millions of files. Whereas, DataNode is responsible for storing actual data in
HDFS. It performsread and write operation as per request of the clients.
Therefore, Datanode needs to have a higher disk capacity for storing huge data
sets.
Learn: Namenode High Availability in Hadoop
22) What are file permissions in HDFS and how HDFS check permissions for
files or directory?
For files and directories, Hadoop distributed file system (HDFS) implements a
permissions model. For each file or directory, thus, we can manage
permissions for a set of 3 distinct user classes:
The owner, group, and others.
The 3 different permissions for each user class: Read (r), write (w),
and execute(x).
 For files, the r permission is to read the file, and the w permission is
to write to the file.
 For directories, the r permission is to list the contents of the
directory. The wpermission is to create or delete the directory.
 X permission is to access a child of the directory.
HDFS check permissions for files or directory:

 We can also check the owner’s permissions if the username matches the
owner of the directory.
 If the group matches the directory’s group, then Hadoop tests the user’s
group permissions.
 Hadoop tests the “other” permission when the owner and the group names
don’t match.
 If none of the permissions checks succeed, the client’s request is denied.
23) If DataNode increases, then do we need to upgrade NameNode?
Namenode stores meta-data i.e. number of blocks, their location, replicas.
This meta-data is available in memory in the master for faster retrieval of
data. NameNode maintains and manages the slave nodes, and assigns tasks to
them. It regulates client’s access to files.
It also executes file system execution such as naming, closing, opening
files/directories.
During Hadoop installation, framework determines NameNode based on the
size of the cluster. Mostly we don’t need to upgrade the NameNode because it
does not store the actual data. But it stores the metadata, so such requirement
rarely arise.
These Big Data Hadoop interview questions are the selected ones which are
asked frequently and by going through these HDFS interview questions you
will be able to answer many other related answers in your interview.
We have categorized the above Big Data Hadoop interview questions and
answers for HDFS Interview for freshers and experienced.
 Hadoop Interview Questions and Answers for Freshers – Q. No.- 5 – 8.
 Hadoop Interview Questions and Answers for Experienced – Q. No.- 8-23.
Here are few more frequently asked Hadoop HDFS interview Questions and
Answers for Freshers and Experienced.

8. MapReduce Hadoop Interview


Questions and Answers
In this section, we have discussed top MapReduce hadoop Interview
Questions for freshers and experienced. In the below Hadoop Interview
Questions on MapReduce we have taken the questions on topics likes What is
Hadoop MapReduce, Key – value pair, Mappers, combiners, reducers etc. Lets
get started with the MapReduce Interview questions on MapReduce.
The Hadoop Interview questions from 1 – 6 are for freshers but experienced
professionals can also refer these Hadoop Interview Questions for basic
understanding
1) What is Hadoop MapReduce?
MapReduce is the data processing layer of Hadoop. It is the framework for
writing applications that process the vast amount of data stored in the HDFS.
It processes a huge amount of data in parallel by dividing the job into a set of
independent tasks (sub-job). In Hadoop, MapReduce works by breaking the
processing into phases: Map and Reduce.
 Map- It is the first phase of processing. In which we specify all the complex
logic/business rules/costly code. The map takes a set of data and converts
it into another set of data. It also breaks individual elements into tuples
(key-value pairs).
 Reduce- It is the second phase of processing. In which we specify light-
weight processing like aggregation/summation. Reduce takes the output
from the map as input. After that, it combines tuples (key-value) based on
the key. And then, modifies the value of the key accordingly.
Read Hadoop MapReduce in detail
2) Why Hadoop MapReduce?
When we store huge amount of data in HDFS, the first question arises is, how
to process this data?
Transferring all this data to a central node for processing is not going to work.
And we will have to wait forever for the data to transfer over the network.
Google faced this same problem with its Distributed Goggle File System (GFS). It
solved this problem using a MapReduce data processing model.
Challenges before MapReduce:
 Costly – All the data (terabytes) in one server or as database cluster which
is very expensive. And also hard to manage.
 Time-consuming – By using single machine we cannot analyze the data
(terabytes) as it will take a lot of time.
MapReduce overcome these challenges:
 Cost-efficient – It distribute the data over multiple low config machines.
 Time-efficient – If we want to analyze the data. We can write the analysis
code in Map function. And the integration code in Reduce function and
execute it. Thus, this MapReduce code will go to every machine which has
a part of our data and executes on that specific part. Hence instead of
moving terabytes of data, we just move kilobytes of code. So this type of
movement is time-efficient.
Read more on Hadoop MapReduce
3) What is the key- value pair in MapReduce?
Hadoop MapReduce implements a data model, which represents data as key-
value pairs. Both input and output to MapReduce Framework should be in
Key-value pairs only.
In Hadoop, if the schema is static we can directly work on the column instead
of key-value. But, the schema is not static we will work on keys and values.
Keys and values are not the intrinsic properties of the data. But the user
analyzing the data chooses a key-value pair. A Key-value pair in Hadoop
MapReduce generate in following way:
 InputSplit- It is the logical representation of data. InputSplit represents
the data which individual Mapper will process.
 RecordReader- It communicates with the InputSplit (created by
InputFormat). And converts the split into records. Records are in form of
Key-value pairs that are suitable for reading by the mapper. By Default
RecordReader uses TextInputFormat for converting data into a key-value
pair.
Key- It is the byte offset of the beginning of the line within the file, so it will be
unique if combined with the file.
Value- It is the contents of the line, excluding line terminators. For Example
file content is- on the top of the crumpetty Tree
Key- 0
Value- on the top of the crumpetty Tree
Read about Hadoop Key-Value Pair in detail
4) Why MapReduce uses the key-value pair to process the data?
MapReduce works on unstructured and semi-structured data apart from
structured data. One can read the Structured data like the ones stored
in RDBMS by columns. But handling unstructured data is feasible using key-
value pairs. And the very core idea of MapReduce work on the basis of these
pairs. Framework map data into a collection of key-value pairs
by mapper and reducer on all the pairs with the same key. So as stated by
Google themselves in their research publication. In most of the computations-
Map operation applies on each logical “record” in our input. This computes a
set of intermediate key-value pairs. Then apply reduce operation on all the
values that share the same key. This combines the derived data properly.
In conclusion, we can say that key-value pairs are the best solution to work on
data problems on MapReduce.
Read about MapReduce key-value pair in detail
5) How many Mappers run for a MapReduce job in Hadoop?
Mapper task processes each input record (from RecordReader) and generates
a key-value pair. The number of mappers depends on 2 factors:
 The amount of data we want to process along with block size. It depends
on the number of InputSplit. If we have the block size of 128 MB and we
expect 10TB of input data, thus we will have 82,000 maps.
Ultimately InputFormat determines the number of maps.
 The configuration of the slave i.e. number of core and RAM available on
the slave. The right number of map/node can between 10-100. Hadoop
framework should give 1 to 1.5 cores of the processor to each mapper.
Thus, for a 15 core processor, 10 mappers can run.
In MapReduce job, by changing the block size one can control the number
of Mappers. Hence, by Changing block size the number of InputSplit
increases or decreases.
By using the JobConf’s conf.setNumMapTasks(int num) one can increase the
number of map tasks manually.
Mapper= {(total data size)/ (input split size)}
If data size= 1 Tb and input split size= 100 MB
Hence, Mapper= (1000*1000)/100= 10,000
6) How many Reducers run for a MapReduce job in Hadoop?
Reducer takes a set of an intermediate key-value pair produced by the mapper as
the input. Then runs a reduce function on each of them to generate the output.
Thus, the output of the reducer is the final output, which it stores in HDFS.
Usually, in the reducer, we do aggregation or summation sort of computation.
With the help of Job.setNumreduceTasks(int) the user set the number of reducers
for the job. Hence the right number of reducers are set by the formula:
0.95 Or 1.75 multiplied by (<no. of nodes> * <no. of the maximum container per
node>).
With 0.95, all the reducers can launch immediately and start transferring map
outputs as the map finish.
With 1.75, faster node finishes the first round of reduces and then launch the
second wave of reduces.
By increasing the number of reducers:
 Framework overhead increases
 Increases load balancing
 Lowers the cost of failures
Read: What is Reducer in MapReduce?

9. MapReduce Hadoop Interview


Questions for Experienced:
The following Hadoop Interview questions are more technical targeted for
experienced professionals, however freshers can also refer them for better
understanding.
7) What is the difference between Reducer and Combiner in Hadoop
MapReduce?
The Combiner is Mini-Reducer that perform local reduce task. The Combiner
runs on the Map output and produces the output to reducer input. A combiner
is usually used for network optimization. Reducer takes a set of an
intermediate key-value pair produced by the mapper as the input. Then runs a
reduce function on each of them to generate the output. An output of the
reducer is the final output.
 Unlike a reducer, the combiner has a limitation. i.e. the input or output
key and value types must match the output types of the mapper.
 Combiners can operate only on a subset of keys and values. i.e. combiners
can execute on functions that are commutative.
 Combiner functions take input from a single mapper. While reducers can
take data from multiple mappers as a result of partitioning.
8) What happens if the number of reducers is 0 in Hadoop?
If we set the number of reducer to 0, then no reducer will execute and no
aggregation will take place. In such case, we will prefer “Map-only job” in
Hadoop. In a map-Only job, the map does all task with its InputSplit and the
reducer does no job. Map output is the final output.
Between map and reduce phases there is key, sort, and shuffle phase. Sort and
shuffle phase are responsible for sorting the keys in ascending order. Then
grouping values based on same keys. This phase is very expensive. If reduce
phase is not required we should avoid it. Avoiding reduce phase would
eliminate sort and shuffle phase as well. This also saves network congestion.
As in shuffling an output of mapper travels to the reducer, when data size is
huge, large data travel to the reducer.
9) What do you mean by shuffling and sorting in MapReduce?
Shuffling and Sorting takes place after the completion of map task. Shuffle and
sort phase in hadoop occurs simultaneously.
Shuffling- It is the process of transferring data from the mapper to reducer.
i.e., the process by which the system sorts the key-value output of the map
tasks and transfer it to the reducer.
So, shuffle phase is necessary for reducer, otherwise, they would not have any
input. As shuffling can start even before the map phase has finished. So this
saves some time and completes the task in lesser time.
Sorting- Mapper generate the intermediate key-value pair. Before starting of
reducer, MapReduce framework sort these key-value pairs by the keys.
Sorting helps reducer to easily distinguish when a new reduce task should
start. Thus saves time for the reducer.
Shuffling and sorting are not performed at all if you specify zero reducers
(setNumReduceTasks(0)).
Any doubt in the Hadoop Interview Questions and Answers yet? Just Drop a
comment and we will get back to you.
Read Shuffling and Sorting in detail
10) What is the fundamental difference between a MapReduce InputSplit and
HDFS block?
By definition
 Block – Block is the continuous location on the hard drive where data
HDFS store data. In general, FileSystem store data as a collection of
blocks. In a similar way, HDFS stores each file as blocks, and distributes it
across the Hadoop cluster.
 InputSplit- InputSplit represents the data which individual Mapper will
process. Further split divides into records. Each record (which is a key-
value pair) will be processed by the map.
Data representation
 Block- It is the physical representation of data.
 InputSplit- It is the logical representation of data. Thus, during data
processing in MapReduce program or other processing techniques use
InputSplit. In MapReduce, important thing is that InputSplit does not
contain the input data. Hence, it is just a reference to the data.
Size
 Block- The default size of the HDFS block is 128 MB which is configured as
per our requirement. All blocks of the file are of the same size except the
last block. The last Block can be of same size or smaller. In Hadoop, the
files split into 128 MB blocks and then stored into Hadoop Filesystem.
 InputSplit- Split size is approximately equal to block size, by default.
Example
Consider an example, where we need to store the file in HDFS. HDFS stores
files as blocks. Block is the smallest unit of data that can store or retrieved
from the disk. The default size of the block is 128MB. HDFS break files into
blocks and stores these blocks on different nodes in the cluster. We have a file
of 130 MB, so HDFS will break this file into 2 blocks.
Now, if we want to perform MapReduce operation on the blocks, it will not
process, as the 2nd block is incomplete. InputSplit solves this problem.
InputSplit will form a logical grouping of blocks as a single block. As the
InputSplit include a location for the next block. It also includes the byte offset
of the data needed to complete the block.
From this, we can conclude that InputSplit is only a logical chunk of data. i.e.
it has just the information about blocks address or location. Thus, during
MapReduce execution, Hadoop scans through the blocks and create
InputSplits. Split act as a broker between block and mapper.
11) What is a Speculative Execution in Hadoop MapReduce?
Hadoop Interview Questions and Answers – Speculative Execution
MapReduce breaks jobs into tasks and these tasks run parallel rather than
sequential. Thus reduces overall execution time. This model of execution is
sensitive to slow tasks as they slow down the overall execution of a job. There
are various reasons for the slowdown of tasks like hardware degradation. But
it may be difficult to detect causes since the tasks still complete successfully.
Although it takes more time than the expected time.
Apache Hadoop doesn’t try to diagnose and fix slow running task. Instead, it
tries to detect them and run backup tasks for them. This is called Speculative
execution in Hadoop. These backup tasks are called Speculative tasks in
hadoop. First of all Hadoop framework launch all the tasks for the job in
Hadoop MapReduce. Then it launches speculative tasks for those tasks that
have been running for some time (one minute). And the task that has not
made any much progress, on average, as compared with other tasks from the
job. If the original task completes before the speculative task. Then it will kill
the speculative task. On the other hand, it will kill the original task if the
speculative task finishes before it.
Read Hadoop Speculative Execution in detail
12) How to submit extra files(jars, static files) for MapReduce job during
runtime?
MapReduce framework provides Distributed Cache to caches files needed by
the applications. It can cache read-only text files, archives, jar files etc.
First of all, an application which needs to use distributed cache to distribute a
file should make sure that the files are available on URLs. Hence, URLs can be
either hdfs:// or http://. Now, if the file is present on the hdfs:// or http://urls.
Then, user mentions it to be cache file to distribute. This framework will copy
the cache file on all the nodes before starting of tasks on those nodes. The files
are only copied once per job. Applications should not modify those files.
By default size of the distributed cache is 10 GB. We can adjust the size of
distributed cache using local.cache.size.

1) What is Hadoop HDFS – Hadoop Distributed File System?


Hadoop distributed file system-HDFS is the primary storage system of Hadoop. HDFS
stores very large files running on a cluster of commodity hardware. It works on the
principle of storage of less number of large files rather than the huge number of small files.
HDFS stores data reliably even in the case of hardware failure. It also provides high
throughput access to the application by accessing in parallel.
HDFS Architecture
Components of HDFS:
 NameNode – It works as Master in Hadoop cluster. Namenode stores meta-data
i.e. number of blocks, replicas and other details. Meta-data is present in memory
in the master to provide faster retrieval of data. NameNode maintains and also
manages the slave nodes, and assigns tasks to them. It should deploy on reliable
hardware as it is the centerpiece of HDFS.
 DataNode – It works as Slave in Hadoop cluster. In Hadoop HDFS, DataNode is
responsible for storing actual data in HDFS. It also performs read and writes
operation as per request for the clients. DataNodes can deploy on commodity
hardware.
Read about HDFS in detail.
2) What are the key features of HDFS?
The various Features of HDFS are:
 Fault Tolerance – In Apache Hadoop HDFS, Fault-tolerance is working strength of
a system in unfavorable conditions. Hadoop HDFS is highly fault-tolerant, in
HDFS data is divided into blocks and multiple copies of blocks are created on
different machines in the cluster. If any machine in the cluster goes down due to
unfavorable conditions, then a client can easily access their data from other
machines which contain the same copy of data blocks.
 High Availability – HDFS is highly available file system; data gets replicated
among the nodes in the HDFS cluster by creating a replica of the blocks on the
other slaves present in the HDFS cluster. Hence, when a client wants to access
his data, they can access their data from the slaves which contains its blocks and
which is available on the nearest node in the cluster. At the time of failure of a
node, a client can easily access their data from other nodes.
 Data Reliability – HDFS is a distributed file system which provides reliable data
storage. HDFS can store data in the range of 100s petabytes. It stores data
reliably by creating a replica of each and every block present on the nodes and
hence, provides fault tolerance facility.
 Replication – Data replication is one of the most important and unique features
of HDFS. In HDFS, replication data is done to solve the problem of data loss in
unfavorable conditions like crashing of the node, hardware failure and so on.
 Scalability – HDFS stores data on multiple nodes in the cluster, when
requirement increases we can scale the cluster. There are two scalability
mechanisms available: vertical and horizontal.
 Distributed Storage – In HDFS all the features are achieved via distributed
storage and replication. In HDFS data is stored in distributed manner across the
nodes in the HDFS cluster.
Read about HDFS Features in detail.
3) What is the difference between NAS and HDFS?
 Hadoop distributed file system (HDFS) is the primary storage system of Hadoop.
HDFS designs to store very large files running on a cluster of commodity
hardware. While Network-attached storage (NAS) is a file-level computer data
storage server. NAS provides data access to a heterogeneous group of clients.
 HDFS distribute data blocks across all the machines in a cluster. Whereas NAS,
data stores on a dedicated hardware.
 Hadoop HDFS is designed to work with MapReduce Framework. In MapReduce
Framework computation move to the data instead of Data to computation. NAS
is not suitable for MapReduce, as it stores data separately from the
computations.
 Hadoop HDFS runs on the cluster commodity hardware which is cost effective.
While a NAS is a high-end storage device which includes high cost.
4) List the various HDFS daemons in HDFS cluster?
HDFS Daemons
The daemon runs in HDFS cluster are as follows:

 NameNode – It is the master node. It is responsible for storing the metadata of all
the files and directories. It also has information about blocks, their location,
replicas and other detail.
 Datanode – It is the slave node that contains the actual data. DataNode also
performs read and write operation as per request for the clients.
 Secondary NameNode – Secondary NameNode download
the FsImage and EditLogs from the NameNode. Then it merges EditLogs with the
FsImage periodically. It keeps edits log size within a limit. It stores the modified
FsImage into persistent storage. which we can use FsImage in the case of
NameNode failure.
5) What is NameNode and DataNode in HDFS?
NameNode – It works as Master in Hadoop cluster. Below listed are the main function
performed by NameNode:
 Stores metadata of actual data. E.g. Filename, Path, No. of blocks, Block IDs,
Block Location, No. of Replicas, and also Slave related configuration.
 It also manages Filesystem namespace.
 Regulates client access request for actual file data file.
 It also assigns work to Slaves (DataNode).
 Executes file system namespace operation like opening/closing files, renaming
files/directories.
 As Name node keep metadata in memory for fast retrieval. So it requires the
huge amount of memory for its operation. It should also host on reliable
hardware.
DataNode – It works as a Slave in Hadoop cluster. Below listed are the main function
performed by DataNode:
 Actually, stores Business data.
 It is the actual worker node, so it handles Read/Write/Data processing.
 Upon instruction from Master, it performs creation/replication/deletion of data
blocks.
 As DataNode store all the Business data, so it requires the huge amount of
storage for its operation. It should also host on Commodity hardware.
These were some general Hadoop interview questions and answers. Now let us take
some Hadoop interview questions and answers specially for freshers.

HDFS Hadoop Interview Question and


Answer for Freshers
6) What do you mean by metadata in HDFS?
In Apache Hadoop HDFS, metadata shows the structure of HDFS directories and files.
It provides the various information about directories and files like permissions,
replication factor. NameNode stores metadata Files which are as follows:
 FsImage – FsImage is an “Image file”. It contains the entire filesystem namespace
and stored as a file in the namenode’s local file system. It also contains a
serialized form of all the directories and file inodes in the filesystem. Each inodeis
an internal representation of file or directory’s metadata.
 EditLogs – EditLogs contains all the recent modifications made to the file system
of most recent FsImage. When namenode receives a create/update/delete
request from the client. Then this request is first recorded to edits file.
If you face any doubt while reading the Hadoop interview questions and answers
drop a comment and we will get back to you.
7) What is Block in HDFS?
This one is very important Hadoop interview questions and answers asked in most of
the interviews.

Files are broken up in ‘blocks’


Block is a continuous location on the hard drive where data is stored. In general,
FileSystem store data as a collection of blocks. In a similar way, HDFS stores each
file as blocks, and distributes it across the Hadoop cluster. HDFS client does not have
any control on the block like block location. NameNode decides all such things.
The default size of the HDFS block is 128 MB, which we can configure as per the
requirement. All blocks of the file are of the same size except the last block, which
can be the same size or smaller.
If the data size is less than the block size, then block size will be equal to the data
size. For example, if the file size is 129 MB, then 2 blocks will be created for it. One
block will be of default size 128 MB and other will be 1 MB only and not 128 MB as it
will waste the space (here block size is equal to data size). Hadoop is intelligent
enough not to waste rest of 127 MB. So it is allocating 1Mb block only for 1MB data.
The major advantages of storing data in such block size are that it saves disk seek
time.
Read about HDFS Data Blocks in Detail.
8) Why is Data Block size set to 128 MB in Hadoop?
Because of the following reasons Block size is 128 MB:
 To reduce the disk seeks (IO). Larger the block size, lesser the file blocks. Thus,
less number of disk seeks. And block can transfer within respectable limits and
that to parallelly.
 HDFS have huge datasets, i.e. terabytes and petabytes of data. If we take 4 KB
block size for HDFS, just like Linux file system, which has 4 KB block size. Then
we would be having too many blocks and therefore too much of metadata.
Managing this huge number of blocks and metadata will create huge overhead.
Which is something which we don’t want? So, the block size is set to 128 MB.
On the other hand, block size can’t be so large. Because the system will wait for a
very long time for the last unit of data processing to finish its work.
9) What is the difference between a MapReduce InputSplit and HDFS block?
Tip for these type of Hadoop interview questions and answers: Start with the
definition of Block and InputSplit and answer in a comparison language and then
cover its data representation, size, and example and that too in a comparison
language.
By definition-
 Block- Block in Hadoop is the continuous location on the hard drive where HDFS
store data. In general, FileSystem store data as a collection of blocks. In a similar
way, HDFS stores each file as blocks, and distributes it across the Hadoop cluster.
 InputSplit- InputSplit represents the data which individual Mapper will process.
Further split divides into records. Each record (which is a key-value pair) will be
processed by the map.
Data representation-
 Block-It is the physical representation of data.
 InputSplit- It is the logical representation of data. Thus, during data processing
in MapReduce program or other processing techniques use InputSplit. In
MapReduce, important thing is that InputSplit does not contain the input data.
Hence, it is just a reference to the data.
Size-
 Block- The default size of the HDFS block is 128 MB which is configured as per
our requirement. All blocks of the file are of the same size except the last block.
The last Block can be of same size or smaller. In Hadoop, the files split into 128
MB blocks and then stored into Hadoop Filesystem.
 InputSplit- Split size is approximately equal to block size, by default.
Example-
Consider an example, where we need to store the file in HDFS. HDFS stores files as
blocks. Block is the smallest unit of data that can store or retrieved from the disk.
The default size of the block is 128MB. HDFS break files into blocks and stores these
blocks on different nodes in the cluster. We have a file of 130 MB, so HDFS will
break this file into 2 blocks.
Now, if one wants to perform MapReduce operation on the blocks, it will not
process, as the 2nd block is incomplete. InputSplit solves this problem. InputSplit
will form a logical grouping of blocks as a single block. As the InputSplit include a
location for the next block. It also includes the byte offset of the data needed to
complete the block.
From this, we can conclude that InputSplit is only a logical chunk of data. i.e. it has
just the information about blocks address or location. Thus, during MapReduce
execution, Hadoop scans through the blocks and create InputSplits.
Read InputSplit vs HDFS Blocks in Hadoop in detail.
10) How can one copy a file into HDFS with a different block size to that of existing
block size configuration?
By using bellow commands one can copy a file into HDFS with a different block size:
–Ddfs.blocksize=block_size, where block_size is in bytes.
So, consider an example to explain it in detail:
Suppose, you want to copy a file called test.txt of size, say of 128 MB, into the hdfs.
And for this file, you want the block size to be 32MB (33554432 Bytes) in place of the
default (128 MB). So, can issue the following command:
Hadoop fs –Ddfs.blocksize=33554432-copyFromlocal/home/dataflair/test.txt/sample_hdfs.
Now, you can check the HDFS block size associated with this file by:
hadoop fs –stat %o/sample_hdfs/test.txt
You can also check it by using the NameNode web UI for seeing the HDFS directory.
These are very common type of Hadoop interview questions and answers faced
during the interview of a fresher.

Frequently Asked Question in Hadoop


Interview
11) Which one is the master node in HDFS? Can it be commodity hardware?
Name node is the master node in HDFS. The NameNode stores metadata and works
as High Availability machine in HDFS. It requires high memory (RAM) space, so
NameNode needs to be a high-end machine with good memory space. It cannot be a
commodity as the entire HDFS works on it.
Masternode in HDFS
12) In HDFS, how Name node determines which data node to write on?
Answer these type of Hadoop interview questions answers very shortly and to the
point.
Namenode contains Metadata i.e. number of blocks, replicas, their location, and
other details. This meta-data is available in memory in the master for faster retrieval
of data. NameNode maintains and manages the Datanodes, and assigns tasks to
them.
13) What is a Heartbeat in Hadoop?
Heartbeat is the signals that NameNode receives from the DataNodes to show that it
is functioning (alive).
Heartbeat in Hadoop
NameNode and DataNode do communicate using Heartbeat. If after a certain time of
heartbeat DataNode does not send any response to NameNode, then that Node
is dead. So, NameNode in HDFS will create new replicas of those blocks on other
DataNodes.
Heartbeats carry information about total storage capacity. It also, carry the fraction
of storage in use, and the number of data transfers currently in progress.
The default heartbeat interval is 3 seconds. One can change it by
usingdfs.heartbeat.interval in hdfs-site.xml.
14) Can multiple clients write into a Hadoop HDFS file concurrently?
Multiple clients cannot write into a Hadoop HDFS file at the same time. Apache
Hadoop follows single writer multiple reader models. When HDFS client opens a file
for writing, then NameNode grant a lease. Now suppose, some other client wants to
write into that file. It asks NameNode for a write operation in Hadoop. NameNode
first checks whether it has granted the lease for writing into that file to someone else
or not. If someone else acquires the lease, then it will reject the write request of the
other client.
Read HDFS Data Write Operation in detail.
15) How data or file is read in Hadoop HDFS?
To read from HDFS, the first client communicates to namenode for metadata. The
Namenode responds with details of No. of blocks, Block IDs, Block Location, No. of
Replicas. Then, the client communicates with Datanode where the blocks are
present. Clients start reading data parallel from the Datanode. It read on the basis of
information received from the namenodes.
Once an application or HDFS client receives all the blocks of the file, it will combine
these blocks to form a file. To improve read performance, the location of each block
ordered by their distance from the client. HDFS selects the replica which is closest to
the client. This reduces the read latency and bandwidth consumption. It first read
the block in the same node. Then another node in the same rack, and then finally
another Datanode in another rack.
Read HDFS Data Read Operation in detail.
16) Does HDFS allow a client to read a file which is already opened for writing?
Yes, the client can read the file which is already opened for writing. But, the problem
in reading a file which is currently open for writing, lies in the consistency of data.
HDFS does not provide the surety that the data which it has written into the file will
be visible to a new reader. For this, one can call the hflush operation. It will push all
the data in the buffer into write pipeline. Then the hflush operation will wait for
acknowledgments from the datanodes. Hence, by doing this, the data that client has
written into the file before the hflush operation visible to the reader for sure.
Dara file read
If you encounter any doubt or query in the Hadoop interview questions, feel free to
ask us in the comment section below and our support team will get back to you.
17) Why is Reading done in parallel and writing is not in HDFS?
Client read data parallelly because by doing so the client can access the
data fast. Reading in parallel makes the system fault tolerant. But the
client does not perform the write operation in Parallel. Because writing in
parallel might result in data inconsistency.
Suppose, you have a file and two nodes are trying to write data into a file in parallel.
Then the first node does not know what the second node has written and vice-versa.
So, we can not identify which data to store and access.
Client in Hadoop writes data in pipeline anatomy. There are various benefits of a
pipeline write:
 More efficient bandwidth consumption for the client – The client only has to
transfer one replica to the first datanode in the pipeline write. So, each node only
gets and send one replica over the network (except the last datanode only
receives data). This results in balanced bandwidth consumption. As compared to
the client writing three replicas into three different datanodes.
 Smaller sent/ack window to maintain – The client maintains a much smaller
sliding window. Sliding window record which blocks in the replica is sending to
the DataNodes. It also records which blocks are waiting for acks to confirm the
write has been done. In a pipeline write, the client appears to write data to only
one datanode.
18) What is the problem with small files in Apache Hadoop?
Hadoop is not suitable for small data. Hadoop HDFS lacks the ability to
support the random reading of small files. Small file in HDFS is smaller
than the HDFS block size (default 128 MB). If we are storing these huge
numbers of small files, HDFS can’t handle these lots of files. HDFS works
with the small number of large files for storing large datasets. It is not
suitable for a large number of small files. A large number of many small files
overload NameNode since it stores the namespace of HDFS.
Solution –
 HAR (Hadoop Archive) Files – HAR files deal with small file issue. HAR has
introduced a layer on top of HDFS, which provides interface for file accessing.
Using Hadoop archive command we can create HAR files. These file runs
a MapReduce job to pack the archived files into a smaller number of HDFS files.
Reading through files in as HAR is not more efficient than reading through files
in HDFS.
 Sequence Files – Sequence Files also deal with small file problem. In this, we use
the filename as key and the file contents as the value. Suppose we have 10,000
files, each of 100 KB, we can write a program to put them into a single sequence
file. Then one can process them in a streaming fashion.
19) What is throughput in HDFS?
The amount of work done in a unit time is known as Throughput. Below are the
reasons due to HDFS provides good throughput:
 Hadoop works on Data Locality principle. This principle state that moves
computation to data instead of data to computation. This reduces network
congestion and therefore, enhances the overall system throughput.
 The HDFS is Write Once and Read Many Model. It simplifies the data coherency
issues as the data written once, one can not modify it. Thus, provides high
throughput data access.
20) Comparison between Secondary NameNode and Checkpoint Node in Hadoop?
Secondary NameNode download the FsImage and EditLogs from the NameNode. Then
it merges EditLogs with the FsImage periodically. Secondary NameNode stores the
modified FsImage into persistent storage. So, we can use FsImage in the case of
NameNode failure. But it does not upload the merged FsImage with EditLogs to
active namenode. While Checkpoint node is a node which periodically creates
checkpoints of the namespace.
Checkpoint Node in Hadoop first downloads FsImage and edits from the active
NameNode. Then it merges them (FsImage and edits) locally, and at last, it uploads
the new image back to the active namenode.
The above 7-20 Hadoop interview questions and answers were for freshers,
However, experienced can also go through these Hadoop interview questions and
answers for revising the basics.

HDFS Hadoop Interview Questions and


Answers for Experienced
21) What is a Backup node in Hadoop?
Backup node provides the same checkpointing functionality as the Checkpoint node
(Checkpoint node is a node which periodically creates checkpoints of the namespace.
Checkpoint Node downloads FsImage and edits from the active NameNode, merges
them locally, and uploads the new image back to the active NameNode). In Hadoop,
Backup node keeps an in-memory, up-to-date copy of the file system namespace,
which is always synchronized with the active NameNode state.
Backup node in hadoop
The Backup node does not need to download FsImage and edits files from the active
NameNode in order to create a checkpoint, as would be required with a Checkpoint
node or Secondary Namenode since it already has an up-to-date state of the
namespace state in memory. The Backup node checkpoint process is more efficient
as it only needs to save the namespace into the local FsImage file and reset edits.
One Backup node is supported by the NameNode at a time. No checkpoint nodes
may be registered if a Backup node is in use.
22) How does HDFS ensure Data Integrity of data blocks stored in HDFS?
Data Integrity ensures the correctness of the data. But, it is possible that the data will
get corrupted during I/O operation on the disk. Corruption can occur due to various
reasons network faults, or buggy software. Hadoop HDFS client software
implements checksum checking on the contents of HDFS files.

HDFS Federation Architecture


In Hadoop, when a client creates an HDFS file, it computes a checksum of each block
of the file. Then stores these checksums in a separate hidden file in the same HDFS
namespace. When a client retrieves file contents it first checks. Then it verifies that
the data it received from each Datanode matches the checksum. Checksum stored in
the associated checksum file. And if not, then the client can opt to retrieve that block
from another DataNode that has a replica of that block.
23) What do you mean by the NameNode High Availability in hadoop?
In Hadoop 1.x, NameNode is a single point of Failure (SPOF). If namenode fails, all
clients would be unable to read, write file or list files. In such event, the whole
Hadoop system would be out of service until new namenode is up.
Hadoop 2.0 overcomes SPOF. Hadoop 2.x provide support for multiple
NameNode. High availability feature gives an extra NameNode (active standby
NameNode) to Hadoop architecture. This extra NameNode configured for automatic
failover. If active NameNode fails, then Standby Namenode takes all its
responsibility. And cluster work continuously.
The initial implementation of namenode high availability provided for single
active/standby namenode. However, some deployment requires high degree fault-
tolerance. Hadoop 3.x enable this feature by allowing the user to run multiple
standby namenode. The cluster tolerates the failure of 2 nodes rather than 1 by
configuring 3 namenode & 5 journal nodes.
Read about HDFS NameNode High Availability.
24) What is Fault Tolerance in Hadoop HDFS?
Fault-tolerance in HDFS is working strength of a system in unfavorable conditions.
Unfavorable conditions are the crashing of the node, hardware failure and so on.
HDFS control faults by the process of replica creation. When client stores a file in
HDFS, Hadoop framework divide the file into blocks. Then client distributes data
blocks across different machines present in HDFS cluster. And, then create the
replica of each block is on other machines present in the cluster.
HDFS, by default, creates 3 copies of a block on other machines present in the
cluster. If any machine in the cluster goes down or fails due to unfavorable
conditions. Then also, the user can easily access that data from other machines in
which replica of the block is present.
Read about HDFS Fault Tolerance in detail.
25) Describe HDFS Federation.
In Hadoop 1.0, HDFS architecture for entire cluster allows only single namespace.
Limitations-
 Namespace layer and storage layer are tightly coupled. This makes alternate
implementation of namenode difficult. It also restricts other services to use block
storage directly.
 A namespace is not scalable like datanode. Scaling in HDFS cluster is
horizontally by adding datanodes. But we can’t add more namespace to an
existing cluster.
 There is no separation of the namespace. So, there is no isolation among tenant
organization that is using the cluster.
In Hadoop 2.0, HDFS Federation overcomes this limitation. It supports too many
NameNode/ Namespaces to scale the namespace horizontally. In HDFS federation
isolate different categories of application and users to different namespaces. This
improves Read/ write operation throughput adding more namenodes.
Read about HDFS Federation in detail.
26) What is the default replication factor in Hadoop and how will you change it?
The default replication factor is 3. One can change replication factor in following
three ways:
 By adding this property to hdfs-site.xml:
 <property>
 <name>dfs.replication</name>
 <value>5</value>
 <description>Block Replication</description>
 </property>

 One can also change the replication factor on a per-file basis using the command:
hadoop fs –setrep –w 3 / file_location
 One can also change replication factor for all the files in a directory by using:
hadoop fs –setrep –w 3 –R / directoey_location
27) Why Hadoop performs replication, although it results in data redundancy?
In HDFS, Replication provides the fault tolerance. Replication is one of the unique
features of HDFS. Data Replication solves the issue of data loss in unfavorable
conditions. Unfavorable conditions are the hardware failure, crashing of the node
and so on.
HDFS by default creates 3 replicas of each block across the cluster in Hadoop. And
we can change it as per the need. So if any node goes down, we can recover data on
that node from the other node. In HDFS, Replication will lead to the consumption of
a lot of space. But the user can always add more nodes to the cluster if required. It is
very rare to have free space issues in practical cluster. As the very first reason to
deploy HDFS was to store huge data sets. Also, one can change the replication factor
to save HDFS space. Or one can also use different codec provided by the Hadoop to
compress the data.
28) What is Rack Awareness in Apache Hadoop?
In Hadoop, Rack Awareness improves the network traffic while
reading/writing file. In Rack Awareness NameNode chooses the DataNode
which is closer to the same rack or nearby rack. NameNode achieves
Rack information by maintaining the rack ids of each DataNode. Thus,
this concept chooses Datanodes based on the Rack information.
HDFS NameNode makes sure that all the replicas are not stored on the single rack or
same rack. It follows Rack Awareness Algorithm to reduce latency as well as fault
tolerance.
Default replication factor is 3. Therefore according to Rack Awareness Algorithm:
 The first replica of the block will store on a local rack.
 The next replica will store on another datanode within the same rack.
 And the third replica stored on the different rack.
In Hadoop, we need Rack Awareness for below reason: It improves-

 Data high availability and reliability.


 The performance of the cluster.
 Network bandwidth.
Read about HDFS Rack Awareness in detail.
29) Explain the Single point of Failure in Hadoop?
In Hadoop 1.0, NameNode is a single point of Failure (SPOF). If namenode
fails, all clients would unable to read/write files. In such event, whole
Hadoop system would be out of service until new namenode is up.
Hadoop 2.0 overcomes this SPOF by providing support for multiple NameNode. High
availability feature provides an extra NameNode to hadoop architecture. This feature
provides automatic failover. If active NameNode fails, then Standby-Namenode
takes all the responsibility of active node. And cluster continues to work.
The initial implementation of Namenode high availability provided for single
active/standby namenode. However, some deployment requires high degree fault-
tolerance. So new version 3.0 enable this feature by allowing the user to run multiple
standby namenode. The cluster tolerates the failure of 2 nodes rather than 1 by
configuring 3 namenode & 5 journalnodes.
30) Explain Erasure Coding in Apache Hadoop?
For several purposes, HDFS, by default, replicates each block three times.
Replication also provides the very simple form of redundancy to protect against the
failure of datanode. But replication is very expensive. 3 x replication scheme results
in 200% overhead in storage space and other resources.
Hadoop 2.x introduced a new feature called “Erasure Coding” to use in the place of
Replication. It also provides the same level of fault tolerance with less space store
and 50% storage overhead.
Erasure Coding uses Redundant Array of Inexpensive Disk (RAID). RAID
implements Erasure Coding through striping. In which it divides logical sequential
data (such as a file) into the smaller unit (such as bit, byte or block). After that, it
stores data on different disk.
Encoding – In this, RAID calculates and sort Parity cells for each strip of data cells.
Then, recover error through the parity. Erasure coding extends a message with
redundant data for fault tolerance. Its codec operates on uniformly sized data cells.
In Erasure Coding, codec takes a number of data cells as input and produces parity
cells as the output.
There are two algorithms available for Erasure Coding:
 XOR Algorithm
 Reed-Solomon Algorithm
Read about Erasure Coding in detail
31) What is Balancer in Hadoop?
Data may not always distribute uniformly across the datanodes in HDFS due to
following reasons:
 A lot of deletes and writes
 Disk replacement
Data Blocks allocation strategy tries hard to spread new blocks uniformly among all
the datanodes. In a large cluster, each node has different capacity. While quite often
you need to delete some old nodes, also add new nodes for more capacity.
The addition of new datanode becomes a bottleneck due to below reason:

 When Hadoop framework allocates all the new blocks and read from new
datanode. This will overload the new datanode.
HDFS provides a tool called Balancer that analyzes block placement and balances
across the datanodes.

These are very common type of Hadoop interview questions and answers faced
during the interview of an experienced professional.
32) What is Disk Balancer in Apache Hadoop?
Disk Balancer is a command line tool, which distributes data evenly on all disks of a
datanode. This tool operates against a given datanode and moves blocks from one
disk to another.
Disk balancer works by creating and executing a plan (set of statements) on the
datanode. Thus, the plan describes how much data should move between two disks.
A plan composes multiple steps. Move step has source disk, destination disk and the
number of bytes to move. And the plan will execute against an operational datanode.
By default, disk balancer is not enabled. Hence, to enable
diskbalnecr dfs.disk.balancer.enabled must be set true in hdfs-site.xml.
When we write new block in hdfs, then, datanode uses volume choosing the policy to
choose the disk for the block. Each directory is the volume in hdfs terminology.
Thus, two such policies are: Round-robin and Available space
 Round-robin distributes the new blocks evenly across the available disks.
 Available space writes data to the disk that has maximum free space (by
percentage).
Read about HDFS Disk Balancer in detail.
33) What is active and passive NameNode in Hadoop?
In Hadoop 1.0, NameNode is a single point of Failure (SPOF). If namenode fails, then
all clients would be unable to read, write file or list files. In such event, whole
Hadoop system would be out of service until new namenode is up.
Hadoop 2.0 overcomes this SPOF. Hadoop 2.0 provides support for multiple
NameNode. High availability feature provides an extra NameNode to Hadoop
architecture for automatic failover.
 Active NameNode – It is the NameNode which works and runs in the cluster. It is
also responsible for all client operations in the cluster.
 Passive NameNode – It is a standby namenode, which has similar data as active
NameNode. It simply acts as a slave, maintains enough state to provide a fast
failover, if necessary.
If Active NameNode fails, then Passive NameNode takes all the responsibility of active
node. The cluster works continuously.
34) How is indexing done in Hadoop HDFS?
Apache Hadoop has a unique way of indexing. Once Hadoop framework store the
data as per the Data Bock size. HDFS will keep on storing the last part of the data
which will say where the next part of the data will be. In fact, this is the base of
HDFS.
35) What is a Block Scanner in HDFS?
Block scanner verify whether the data blocks stored on each DataNodes are correct
or not. When Block scanner detects corrupted data block, then following steps occur:
 First of all, DataNode report NameNode about the corrupted block.
 After that, NameNode will start the process of creating a new replica. It creates
new replica using the correct replica of the corrupted block present in other
DataNodes.
 When the replication count of the correct replicas matches the replication factor
3, then delete corrupted block
36) How to perform the inter-cluster data copying work in HDFS?
HDFS use distributed copy command to perform the inter-cluster data copying. That
is as below:
hadoop distcp hdfs</span><span class="co1">://<source NameNode> hdfs://<target NameNode</span></span>>
DistCp (distributed copy) is a tool also used for large inter/intra-cluster copying. It uses MapReduce to affect its distribution, error handling and
recovery and reporting. This distributed copy tool enlarges a list of files and directories into the input to map tasks.
37) What are the main properties of hdfs-site.xml file?
hdfs-site.xml – It specifies configuration setting for HDFS daemons in Hadoop. It also provides default block replication and permission
checking on HDFS.
The three main hdfs-site.xml properties are:
1. dfs.name.dir gives you the location where NameNode stores the metadata
(FsImage and edit logs). It also specifies where DFS should locate, on the disk or
onto the remote directory.
2. dfs.data.dir gives the location of DataNodes where it stores the data.
3. fs.checkpoint.dir is the directory on the file system. Hence, on this directory
secondary NameNode stores the temporary images of edit logs.
38) How can one check whether NameNode is working or not?
One can check the status of the HDFS NameNode in several ways. Most usually, one
uses the jps command to check the status of all daemons running in the HDFS.
39) How would you restart NameNode?
NameNode is also known as Master node. It stores meta-data i.e. number of blocks,
replicas, and other details. NameNode maintains and manages the slave nodes, and
assigns tasks to them.
By following two methods, you can restart NameNode:
 First stop the NameNode individually using ./sbin/hadoop-daemons.sh stop
namenode command. Then, start the NameNode using ./sbin/hadoop-
daemons.shstart namenode command.
 Use ./sbin/stop-all.sh and then use ./sbin/start-all.sh command which will stop all
the demons first. Then start all the daemons.
The above Hadoop interview questions and answers were for experienced but
freshers can also refer these Hadoop interview questions and answers for in depth
knowledge. Now let’s move forward with some advanced Hadoop interview
questions and answers.

Advanced Questions for Hadoop


Interview
40) How NameNode tackle Datanode failures in Hadoop?
HDFS has a master-slave architecture in which master is namenode and slave are
datanode. HDFS cluster has single namenode that manages file system namespace
(metadata) and multiple datanodes that are responsible for storing actual data in
HDFs and performing the read-write operation as per request for the clients.
NameNode receives Heartbeat and block report from Datanode. Heartbeat receipt
implies that the datanode is alive and functioning properly and block report contains
a list of all blocks on a datanode. When NameNode observes that DataNode has not
sent heartbeat message after a certain amount of time, the datanode is marked as
dead. The namenode replicates the blocks of the dead node to another datanode
using the replica created earlier. Hence, NameNode can easily handle Datanode
failure.
41) Is Namenode machine same as DataNode machine as in terms of hardware in
Hadoop?
NameNode is highly available server, unlike DataNode. NameNode manages the File
System Namespace. It also maintains the metadata information. Metadata
information is the number of blocks, their location, replicas and other details. It also
executes file system execution such as naming, closing, opening files/directories.
Because of the above reasons, NameNode requires higher RAM for storing the
metadata for millions of files. Whereas, DataNode is responsible for storing actual
data in HDFS. It performs read and write operation as per request of the clients.
Therefore, Datanode needs to have a higher disk capacity for storing huge data sets.
42) If DataNode increases, then do we need to upgrade NameNode in Hadoop?
Namenode stores meta-data i.e. number of blocks, their location,
replicas. In Hadoop, meta-data is present in memory in the master for
faster retrieval of data. NameNode manages and maintains the slave
nodes, and assigns tasks to them. It regulates client’s access to files.
It also executes file system execution such as naming, closing, opening
files/directories. During Hadoop installation, framework determines NameNode
based on the size of the cluster. Mostly we don’t need to upgrade the NameNode
because it does not store the actual data. But it stores the metadata, so such
requirement rarely arise.
43) Explain what happens if, during the PUT operation, HDFS block is assigned a
replication factor 1 instead of the default value 3?
Replication factor can be set for the entire cluster to adjust the number of replicated
block. It ensures high data availability.
The cluster will have n-1 duplicated blocks, for every block that are present in HDFS.
So, if the replication factor during PUT operation is set to 1 in place of the default
value 3. Then it will have a single copy of data. If one set replication factor 1. And if
DataNode crashes under any circumstances, then an only single copy of the data
would lose.
44) What are file permissions in HDFS? how does HDFS check permissions for
files/directory?
Hadoop distributed file system (HDFS) implements a permissions model for
files/directories.
For each files/directory, one can manage permissions for a set of 3 distinct user
classes: owner, group, others
There are also 3 different permissions for each user class: Read (r), write
(w), execute(x)
 For files, the w permission is to write to the file and the r permission is to readthe
file.
 For directories, w permission is to create or delet the directory. The rpermission is
to list the contents of the directory.
 The X permission is to access a child of the directory.
HDFS check permissions for files or directory:

 We can also check the owner’s permissions, if the user name matches the owner
of directory.
 If the group matches the directory’s group, then Hadoop tests the user’s group
permissions.
 Hadoop tests the “other” permission, when owner and the group names doesn’t
match.
 If none of the permissions checks succeed, then client’s request is denied.
45) How one can format Hadoop HDFS?
One can format HDFS by using bin/hadoop namenode –format command.
$bin/hadoop namenode –format$ command formats the HDFS via NameNode.
Formatting implies initializing the directory specified by the dfs.name.dir variable.
When you run this command on existing files system, then, you will lose all your
data stored on your NameNode.
Hadoop NameNode directory contains the FsImage and edit files. This hold the basic
information’s about Hadoop file system. So, this basic information includes like
which user created files etc.
Hence, when we format the NameNode, then it will delete above information from
directory. This information is present in the hdfs-
site.xml as dfs.namenode.name.dir. So, formatting a NameNode will not format the
DataNode.
NOTE: Never format, up and running Hadoop Filesystem. You will lose data stored
in the HDFS.
46) What is the process to change the files at arbitrary locations in HDFS?
HDFS doesn’t support modifications at arbitrary offsets in the file or multiple
writers. But a single writer writes files in append-only format. Writes to a file in
HDFS are always made at the end of the file.
47) Differentiate HDFS & HBase.
Data write process
 HDFS- Append method
 HBase- Bulk incremental, random write
Data read process
 HDFS- Table scan
 HBase- Table scan/random read/small range scan
Hive SQL querying
 HDFS- Excellent
 HBase- Average
Read about HBase in detail.
These are some advanced Hadoop interview questions and answers for HDFS that
will help you in answer many more interview questions in the best manner
48) What is meant by streaming access?
HDFS works on the principle of “write once, read many”. Its focus is on fast and
accurate data retrieval. Steaming access means reading the complete data instead of
retrieving a single record from the database.
49) How to transfer data from Hive to HDFS?
One can transfer data from Hive by writing the query:
hive> insert overwrite directory ‘/’ select * from emp;
Hence, the output you receive will be stored in part files in the specified HDFS path.
50) How to add/delete a Node to the existing cluster?
To add a Node to the existing cluster follow:
Add the host name/Ip address in dfs.hosts/slaves file. Then, refresh the cluster with
$hadoop dfsamin -refreshNodes
To delete a Node to the existing cluster follow:
Add the hostname/Ip address to dfs.hosts.exclude/remove the entry from slaves file.
Then, refresh the cluster with $hadoop dfsamin -refreshNodes
$hadoop dfsamin -refreshNodes
51) How to format the HDFS? How frequently it will be done?
These type of Hadoop Interview Questions and Answers are also taken very short
and to the point. Giving very lengthy answer here is unnecessary and may lead to
negative points.
$hadoop namnode -format.
Note: Format the HDFS only once that to during initial cluster setup.
52) What is the importance of dfs.namenode.name.dir in HDFS?
dfs.namenode.name.dir contains the fsimage file for namenode.
We should configure it to write to atleast two filesystems on physical
hosts/namenode/secondary namenode. Because if we lose FsImage file we will lose
entire HDFS file system. Also there is no other recovery mechanism if there is no
FsImage file available.
Number 40-52 were the advanced Hadoop interview question and answer to get
indepth knowledge in handling difficult Hadoop interview questions and answers.
This was all about the Hadoop Interview Questions and Answers
These questions are frequently asked Hadoop interview questions and answers. You
can read here some more Hadoop HDFS interview questions and answers.
After going through these top Hadoop Interview questions and answers you will be
able to confidently face a interview and will be able to answer Hadoop Interview
questions and answers asked in your interview in the best manner. These Hadoop
Interview Questions are suggested by the experts at DataFlair
Key –
Q.1 – Q.5 Basic Hadoop Interview Questions
Q.6 – Q.10 HDFS Hadoop interview questions and answers for freshers
Q. 11- Q. 20 Frequently asked Questions in Hadoop Interview
Q.21 – Q39 were the HDFS Hadoop interview questions and answer for experienced
Q.40 – Q.52 were the advanced HDFS Hadoop interview questions and answers
These Hadoop interview questions and answers are categorized so that you can pay
more attention to questions specified for you, however, it is recommended that you
go through all the Hadoop interview questions and answers for complete
understanding.
If you have any more doubt or query on Hadoop Interview Questions and Answers,
Drop a comment and our support team will be happy to help you. Now let’s jump to
our second part of Hadoop Interview Questions i.e. MapReduce Interview Questions
and Answers.

Hadoop Interview Questions and Answers


for MapReduce
It is difficult to pass the Hadoop interview as it is fast and growing technology. To get
you through this tough path the MapReduce Hadoop interview questions and
answers will serve as the backbone. This section contains the commonly
asked MapReduce Hadoop interview questions and answers.
In this section on MapReduce Hadoop interview questions and answers, we have
covered 50+ Hadoop interview questions and answers for MapReduce in detail. We
have covered MapReduce Hadoop interview questions and answers for freshers,
MapReduce Hadoop interview questions and answers for experienced as well as
some advanced Mapreduce Hadoop interview questions and answers.
These MapReduce Hadoop Interview Questions are framed by keeping in mind the
need of an era, and the trending pattern of the interview that is being followed by the
companies. The interview questions of Hadoop MapReduce are dedicatedly framed
by the company experts to help you to reach your goal.
Top 50 MapReduce Hadoop Interview Questions and Answers for Hadoop Jobs.

Basic MapReduce Hadoop Interview


Questions and Answers
53) What is MapReduce in Hadoop?
Hadoop MapReduce is the data processing layer. It is the framework for writing
applications that process the vast amount of data stored in the HDFS.
What is MapReduce
It processes a huge amount of data in parallel by dividing the job into a set of
independent tasks (sub-job). In Hadoop, MapReduce works by breaking the
processing into phases: Map and Reduce.
 Map – It is the first phase of processing. In which we specify all the complex
logic/business rules/costly code. The map takes a set of data and converts it into
another set of data. It also breaks individual elements into tuples (key-value
pairs).
 Reduce – It is the second phase of processing. In which we specify light-weight
processing like aggregation/summation. The output from the map is the input to
Reducer. Then, Reducer combines tuples (key-value) based on the key. And then,
modifies the value of the key accordingly.
Read about Hadoop MapReduce in Detail.
54) What is the need of MapReduce in Hadoop?
In Hadoop, when we have stored the data in HDFS, how to process this data is the
first question that arises? Transferring all this data to a central node for processing is
not going to work. And we will have to wait forever for the data to transfer over the
network. Google faced this same problem with its Distributed Goggle File System
(GFS). It solved this problem using a MapReduce data processing model.
Challenges before MapReduce
 Time-consuming – By using single machine we cannot analyze the data
(terabytes) as it will take a lot of time.
 Costly – All the data (terabytes) in one server or as database cluster which is very
expensive. And also hard to manage.
MapReduce overcome these challenges
 Time-efficient – If we want to analyze the data. We can write the analysis code in
Map function. And the integration code in Reduce function and execute it. Thus,
this MapReduce code will go to every machine which has a part of our data and
executes on that specific part. Hence instead of moving terabytes of data, we just
move kilobytes of code. So this type of movement is time-efficient.
 Cost-efficient – It distributes the data over multiple low config machines.
Hadoop MapReduce Job Execution Flow Diagram
55) What is Mapper in Hadoop?
Mapper task processes each input record (from RecordReader) and generates a key-
value pair. This key-value pairs generated by mapper is completely different from
the input pair. The Mapper store intermediate-output on the local disk. Thus, it does
not store its output on HDFS. It is temporary data and writing on HDFS will create
unnecessary multiple copies. Mapper only understands key-value pairs of data. So
before passing data to the mapper, it, first converts the data into key-value pairs.
MapReduce Mapper
Mapper only understands key-value pairs of data. So before passing data to the
mapper, it, first converts the data into key-value
pairs. InputSplit and RecordReader convert data into key-value pairs. Input split is
the logical representation of data. RecordReader communicates with the InputSplit
and converts the data into Kay-value pairs. Hence
 Key is a reference to the input value.
 Value is the data set on which to operate.
Number of maps depends on the total size of the input. i.e. the total number of
blocks of the input files. Mapper= {(total data size)/ (input split size)} If data size= 1
Tb and input split size= 100 MB Hence, Mapper= (1000*1000)/100= 10,000
Mapper= {(total data size)/ (input split size)} If data size= 1 Tb and input split size=
100 MB Hence, Mapper= (1000*1000)/100= 10,000
If data size= 1 Tb and input split size= 100 MB HenceMapper= (1000*1000)/100=
10,000
Mapper= (1000*1000)/100= 10,000
Read about Mapper in detail.
56) What is Reducer in Hadoop?
Reducer takes the output of the Mapper (intermediate key-value pair) as the input.
After that, it runs a reduce function on each of them to generate the output. Thus the
output of the reducer is the final output, which it stored in HDFS. Usually, in
Reducer, we do aggregation or summation sort of computation. Reducer has three
primary phases-
 Shuffle- The framework, fetches the relevant partition of the output of all the
Mappers for each reducer via HTTP.
 Sort- The framework groups Reducers inputs by the key in this Phase. Shuffle
and sort phases occur simultaneously.
 Reduce- After shuffling and sorting, reduce task aggregates the key-value pairs.
In this phase, call the reduce (Object, Iterator, OutputCollector, Reporter)
method for each <key, (list of values)> pair in the grouped inputs.
Reducer in Hadoop MapReduce
With the help of Job.setNumreduceTasks(int) the user set the number of reducers for
the job.
Hence, right number of reducers is 0.95 or 1.75 multiplied by (<no. of nodes>*<no.
of maximum container per node>)
Read about Reducer in detail.
57) How to set mappers and reducers for MapReduce jobs?
One can configure JobConf to set number of mappers and reducers.
 For Mapper – job.setNumMaptasks()
 For Reducer – job.setNumreduceTasks()
These were some general MapReduce Hadoop interview questions and answers.
Now let us take some Mapreduce Hadoop interview questions and answers specially
for freshers.

MapReduce Hadoop Interview Question


and Answer for Freshers
58) What is the key- value pair in Hadoop MapReduce?
Hadoop MapReduce implements a data model, which represents data as key-value
pairs. Both input and output to MapReduce Framework should be in Key-value pairs
only. In Hadoop, if a schema is static we can directly work on the column instead of
key-value. But, the schema is not static we will work on keys and values. Keys and
values are not the intrinsic properties of the data. But the user analyzing the data
chooses a key-value pair.
Hadoop MapReduce Key Value Pair
A Key-value pair in Hadoop MapReduce generate in following way:

 InputSplit – It is the logical representation of data. InputSplit represents the data


which individual Mapper will process.
 RecordReader – It converts the split into records which are in form of Key-value
pairs. That is suitable for reading by the mapper.
By Default RecordReader uses TextInputFormat for converting data into a key-
value pair.
 Key – It is the byte offset of the beginning of the line within the file.
 Value – It is the contents of the line, excluding line terminators. For
For Example, file content is- on the top of the crumpetty Tree
Key- 0
Value- on the top of the crumpetty Tree
Read about MapReduce Key-value pair in detail.
59) What is the need of key-value pair to process the data in MapReduce?
Hadoop MapReduce works on unstructured and semi-structured data apart from
structured data. One can read the Structured data like the ones stored in RDBMS by
columns.
But handling unstructured data is feasible using key-value pairs. The very core idea of
MapReduce work on the basis of these pairs. Framework map data into a collection
of key-value pairs by mapper and reducer on all the pairs with the same key.
In most of the computations- Map operation applies on each logical “record” in our
input. This computes a set of intermediate key-value pairs. Then apply reduce
operation on all the values that share the same key. This combines the derived data
properly.
Thus, we can say that key-value pairs are the best solution to work on data problems
on MapReduce.
60) What are the most common InputFormats in Hadoop?
In Hadoop, Input files store the data for a MapReduce job. Input files which stores
data typically reside in HDFS. Thus, in MapReduce, InputFormat defines how these
input files split and read. InputFormat creates InputSplit.
Types of InputFormat in MapReduce
Most common InputFormat are:

 FileInputFormat – For all file-based InputFormat it is the base class . It also


specifies input directory where data files are present. FileInputFormat also read
all files. And, then divides these files into one or more InputSplits.
 TextInputFormat – It is the default InputFormat of MapReduce. It uses each line
of each input file as a separate record. Thus, performs no parsing.
Key- byte offset.
Value- It is the contents of the line, excluding line terminators.
 KeyValueTextInputFormat – It also treats each line of input as a separate record.
But the main difference is that TextInputFormat treats entire line as the value.
While the KeyValueTextInputFormat breaks the line itself into key and value by
the tab character (‘/t’).
Key- Everything up to tab character.
Value- Remaining part of the line after tab character.
 SequenceFileInputFormat – It reads sequence files.
Key & Value- Both are user-defined.
Read about Mapreduce InputFormat in detail.
61) Explain InputSplit in Hadoop?
InputFormat creates InputSplit. InputSplit is the logical representation of data.
Further Hadoop framework divides InputSplit into records. Then mapper will
process each record. The size of split is approximately equal to HDFS block size (128
MB). In MapReduce program, Inputsplit is user defined. So, the user can control
split size based on the size of data.
InputSplit in mapreduce has a length in bytes. It also has set of storage locations
(hostname strings). It use storage location to place map tasks as close to split’s data
as possible. According to the inputslit size each Map tasks process. So that the
largest one gets processed first, this minimize the job runtime. In MapReduce,
important thing is that InputSplit is just a reference to the data, not contain input
data.
By calling ‘getSplit()’ client who is running job calculate the split for the job . And
then send to the application master and it will use their storage location to schedule
map tasks. And that will process them on the cluster. In MapReduce, split is send to
the createRecordReader() method. It will create RecordReader for the split in
mapreduce job. Then RecordReader generate record (key-value pair). Then it passes
to the map function.
Read about MapReduce InputSplit in detail.
62) Explain the difference between a MapReduce InputSplit and HDFS block.
Tip for these type of Mapreduce Hadoop interview questions and and answers: Start
with the definition of Block and InputSplit and answer in a comparison language and
then cover its data representation, size and example and that too in a comparison
language.
By definition-
 Block – It is the smallest unit of data that the file system store. In general,
FileSystem store data as a collection of blocks. In a similar way, HDFS stores
each file as blocks, and distributes it across the Hadoop cluster.
 InputSplit – InputSplit represents the data which individual Mapper will process.
Further split divides into records. Each record (which is a key-value pair) will be
processed by the map.
Size-
 Block – The default size of the HDFS block is 128 MB which is configured as per
our requirement. All blocks of the file are of the same size except the last block.
The last Block can be of same size or smaller. In Hadoop, the files split into 128
MB blocks and then stored into Hadoop Filesystem.
 InputSplit – Split size is approximately equal to block size, by default.
Data representation-
 Block – It is the physical representation of data.
 InputSplit – It is the logical representation of data. Thus, during data processing
in MapReduce program or other processing techniques use InputSplit. In
MapReduce, important thing is that InputSplit does not contain the input data.
Hence, it is just a reference to the data.
Read more differences between MapReduce InputSplit and HDFS block.
63) What is the purpose of RecordReader in hadoop?
RecordReader in Hadoop uses the data within the boundaries, defined by InputSplit.
It creates key-value pairs for the mapper. The “start” is the byte position in the file.
Thus at ‘start the RecordReader should start generating key-value pairs. And the
“end” is where it should stop reading records
RecordReader in MapReduce job load data from its source. And then, converts the
data into key-value pairs suitable for reading by the mapper. RecordReader
communicates with the InputSplit until it does not read the complete file. The
MapReduce framework defines RecordReader instance by the InputFormat. By,
default RecordReader also uses TextInputFormat for converting data into key-value
pairs.
TextInputFormat provides 2 types of RecordReader
: LineRecordReader and SequenceFileRecordReader.
LineRecordReader in Hadoop is the default RecordReader that TextInputFormat
provides. Hence, each line of the input file is the new value and the key is byte offset.
SequenceFileRecordReader in Hadoop reads data specified by the header of a
sequence file.
Read about MapReduce RecordReder in detail.
64) What is Combiner in Hadoop?
In MapReduce job, Mapper Generate large chunks of intermediate data. Then pass it
to reduce for further processing. All this leads to enormous network congestion.
Hadoop MapReduce framework provides a function known as Combiner. It plays a
key role in reducing network congestion.
Combiner Dataflow
The Combiner in Hadoop is also known as Mini-reducer that performs local
aggregation on the mapper’s output. This reduces the data transfer
between mapperand reducer and increases the efficiency.
There is no guarantee of execution of Ccombiner in Hadoop i.e. Hadoop may or may
not execute a combiner. Also if required it may execute it more than 1 times. Hence,
your MapReduce jobs should not depend on the Combiners execution.
Read about MapReduce Combiner in detail.
65) Explain about the partitioning, shuffle and sort phase in MapReduce?
Hadoop Partitioned
Partitioning Phase – Partitioning specifies that all the values for each key are grouped
together. Then make sure that all the values of a single key go on the same Reducer.
Thus allows even distribution of the map output over the Reducer.
Shuffle Phase – It is the process by which the system sorts the key-value output of the
map tasks. After that it transfer to the reducer.
Sort Phase – Mapper generate the intermediate key-value pair. Before starting of
Reducer, map reduce framework sort these key-value pairs by the keys. It also helps
reducer to easily distinguish when a new reduce task should start. Thus saves time
for the reducer.
Read about Shuffling and Sorting in detail
66) What does a “MapReduce Partitioner” do?
Partitioner comes int the picture, if we are working on more than one reducer.
Partitioner controls the partitioning of the keys of the intermediate map-outputs. By
hash function, the key (or a subset of the key) is used to derive the partition.
Partitioning specifies that all the values for each key grouped together. And it make
sure that all the values of single key goes on the same reducer. Thus allowing even
distribution of the map output over the reducers. It redirects the Mappers output to
the reducers by determining which reducer is responsible for the particular key.
The total number of partitioner is equal to the number of Reducer. Partitioner in
Hadoop will divide the data according to the number of reducers. Thus, single
reducer process the data from the single partitioner.
Read about MapReduce Partitioner in detail.
67) If no custom partitioner is defined in Hadoop then how is data partitioned before it
is sent to the reducer?
So, Hadoop MapReduce by default uses ‘HashPartitioner’.
It uses the hashCode() method to determine, to which partition a given (key, value)
pair will be sent. HashPartitioner also has a method called getPartition.
HashPartitioner also takes key.hashCode() & integer>MAX_VALUE. It takes these code
to finds the modulus using the number of reduce tasks. Suppose there are 10 reduce
tasks, then getPartition will return values 0 through 9 for all keys.
1. Public class HashPartitioner<k, v>extends Partitioner<k, v>
2. {
3. Public int getpartitioner(k key, v value, int numreduceTasks)
4. {
5. Return (key.hashCode() & Integer.Max_VALUE) % numreduceTasks;
6. }
7. }

These are very common type of MapReduce Hadoop interview questions and
answers faced during the interview of a Fresher.
68) How to write a custom partitioner for a Hadoop MapReduce job?
This is one of the most common MapReduce Hadoop interview question and answer
It stores the results uniformly across different reducers, based on the user condition.
By setting a Partitioner to partition by the key, we can guarantee that records for the
same key will go the same reducer. It also ensures that only one reducer receives all
the records for that particular key.
By the following steps, we can write Custom partitioner for a Hadoop MapReduce
job:
 Create a new class that extends Partitioner Class.
 Then, Override method getPartition, in the wrapper that runs in the MapReduce.
 By using method set Partitioner class, add the custom partitioner to the job. Or
add the custom partitioner to the job as config file.
69) What is shuffling and sorting in Hadoop MapReduce?
Shuffling and Sorting takes place after the completion of map task. Shuffle and sort
phase in Hadoop occurs simultaneously.
 Shuffling- Shuffling is the process by which the system sorts the key-value output
of the map tasks and transfer it to the reducer. Shuffle phase is important for
reducers, otherwise, they would not have any input. As shuffling can start even
before the map phase has finished. So this saves some time and completes the
task in lesser time.
 Sorting- Mapper generate the intermediate key-value pair. Before starting of
reducer, mapreduce framework sort these key-value pair by the keys. It also
helps reducer to easily distinguish when a new reduce task should start. Thus
saves time for the reducer.
Shuffling and sorting are not performed at all if you specify zero reducer
(setNumReduceTasks(0))
Read about Shuffling and Sorting in detail.
70) Why aggregation cannot be done in Mapper?
Mapper task processes each input record (From RecordReader) and generates a key-
value pair. The Mapper store intermediate-output on the local disk.
We cannot perform aggregation in mapper because:
 Sorting takes place only on the Reducer function. Thus there is no provision for
sorting in the mapper function. Without sorting aggregation is not possible.
 To perform aggregation, we need the output of all the Mapper function. Thus,
which may not be possible to collect in the map phase. Because mappers may be
running on different machines where the data blocks are present.
 If we will try to perform aggregation of data at mapper, it requires
communication between all mapper functions. Which may be running on
different machines. Thus, this will consume high network bandwidth and can
cause network bottlenecking.
71) Explain map-only job?
MapReduce is the data processing layer of Hadoop. It is the framework for writing
applications that process the vast amount of data stored in the HDFS. It processes
the huge amount of data in parallel by dividing the job into a set of independent
tasks (sub-job). In Hadoop, MapReduce have 2 phases of processing: Map and
Reduce.
In Map phase we specify all the complex logic/business rules/costly code. Map takes
a set of data and converts it into another set of data. It also break individual
elements into tuples (key-value pairs). In Reduce phase we specify light-weight
processing like aggregation/summation. Reduce takes the output from the map as
input. After that it combines tuples (key-value) based on the key. And then, modifies
the value of the key accordingly.
Learn Hadoop from Industry Experts
Consider a case where we just need to perform the operation and no aggregation
required. Thus, in such case, we will prefer “Map-Only job” in Hadoop. In Map-Only
job, the map does all task with its InputSplit and the reducer do no job. Map output
is the final output.
This we can achieve by setting job.setNumreduceTasks(0) in the configuration in a
driver. This will make a number of reducer 0 and thus only mapper will be doing the
complete task.
Read about map-only job in Hadoop Mapreduce in detail.
72) What is SequenceFileInputFormat in Hadoop MapReduce?
SequenceFileInputFormat is an InputFormat which reads sequence files. Sequence
files are binary files that stores sequences of binary key-value pairs. These files are
block-compress. Thus, Sequence files provide direct serialization and deserialization of
several arbitrary data types.
Here Key and value- Both are user-defined.
SequenceFileAsTextInputFormat is variant of SequenceFileInputFormat. It converts the
sequence file’s key value to text objects. Hence, by calling ‘tostring()’ it performs
conversion on the keys and values. Thus, this InputFormat makes sequence files
suitable input for streaming.
SequenceFileAsBinaryInputFormat is variant of SequenceFileInputFormat. Hence,
by using this we can extract the sequence file’s keys and values as an opaque binary
object.
The above 58 – 72 MapReduce Hadoop interview questions and answers were for
freshers, However experienced can also go through these MapReduce Hadoop
interview questions and answers for revising the basics.

MapReduce Hadoop Interview questions


and Answers for Experienced
73) What is KeyValueTextInputFormat in Hadoop?
KeyValueTextInputFormat- It treats each line of input as a separate record. It breaks
the line itself into key and value. Thus, it uses the tab character (‘/t’) to break the line
into a key-value pair.
Key- Everything up to tab character.
Value- Remaining part of the line after tab character.
Consider the following input file, where → represents a (horizontal) tab character:
But→ his face you could not see
Account→ of his beaver hat Hence,
Output:
Key- But
Value- his face you could not see
Key- Account
Value- of his beaver hat
74) Differentiate Reducer and Combiner in Hadoop MapReduce?
Combiner- The combiner is Mini-Reducer that perform local reduce task. It run on
the Map output and produces the output to reducer input. Combiner is usually used
for network optimization.
Reducer- Reducer takes a set of an intermediate key-value pair produced by the
mapper as the input. Then runs a reduce function on each of them to generate the
output. An output of the reducer is the final output.
 Unlike a reducer, the combiner has a limitation . i.e. the input or output key and
value types must match the output types of the mapper.
 Combiners can operate only on a subset of keys and values . i.e. combiners can
execute on functions that are commutative.
 Combiner functions take input from a single mapper. While reducers can take
data from multiple mappers as a result of partitioning.
75) Explain the process of spilling in MapReduce?
Map task processes each input record (from RecordReader) and generates a key-
value pair. The Mapper does not store its output on HDFS. Thus, this is temporary
data and writing on HDFS will create unnecessary multiple copies. The Mapper
writes its output into the circular memory buffer (RAM). Size of the buffer is 100 MB
by default. We can also change it by using mapreduce.task.io.sort.mb property.
Now, spilling is a process of copying the data from the memory buffer to disc. It takes
place when the content of the buffer reaches a certain threshold size. So, background
thread by default starts spilling the contents after 80% of the buffer size has filled.
Therefore, for a 100 MB size buffer, the spilling will start after the content of the
buffer reach a size of 80MB.
76) What happen if number of reducer is set to 0 in Hadoop?
If we set the number of reducer to 0:
 Then no reducer will execute and no aggregation will take place.
 In such case we will prefer “Map-only job” in Hadoop. In map-Only job, the map
does all task with its InputSplit and the reducer do no job. Map output is the final
output.
In between map and reduce phases there is key, sort, and shuffle phase. Sort and
shuffle phase are responsible for sorting the keys in ascending order. Then grouping
values based on same keys. This phase is very expensive. If reduce phase is not
required we should avoid it. Avoiding reduce phase would eliminate sort and shuffle
phase as well. This also saves network congestion. As in shuffling an output of
mapper travels to reducer,when data size is huge, large data travel to reducer.
77) What is Speculative Execution in Hadoop?
MapReduce breaks jobs into tasks and run these tasks parallely rather than
sequentially. Thus reduces execution time. This model of execution is sensitive to
slow tasks as they slow down the overall execution of a job. There are various
reasons for the slowdown of tasks like hardware degradation. But it may be difficult
to detect causes since the tasks still complete successfully. Although it takes more
time than the expected time.
Hadoop framework doesn’t try to fix and diagnose slow running task. It tries to
detect them and run backup tasks for them. This process is called Speculative
execution in Hadoop. These backup tasks are called Speculative tasks in Hadoop.
First of all Hadoop framework launch all the tasks for the job in Hadoop
MapReduce. Then it launch speculative tasks for those tasks that have been running
for some time (one minute). And the task that have not made any much progress, on
average, as compared with other tasks from the job.
If the original task completes before the speculative task. Then it will kill speculative
task . On the other hand, it will kill the original task if the speculative task finishes
before it.
Read about Speculative Execution in detail.
78) What counter in Hadoop MapReduce?
Counters in MapReduce are useful Channel for gathering statistics about the
MapReduce job. Statistics like for quality control or for application-level. They are
also useful for problem diagnosis.
Context Object
Counters validate that:

 Number of bytes read and write within map/reduce job is correct or not
 The number of tasks launches and successfully run in map/reduce job is correct
or not.
 The amount of CPU and memory consumed is appropriate for our job and cluster
nodes.
There are two types of counters:

 Built-In Counters – In Hadoop there are some built-In counters for every job.
These report various metrics, like, there are counters for the number of bytes and
records. Thus, this allows us to confirm that it consume the expected amount of
input. Also make sure that it produce the expected amount of output.
 User-Defined Counters – Hadoop MapReduce permits user code to define a set of
counters. These are then increased as desired in the mapper or reducer. For
example, in Java, use ‘enum’ to define counters.
Read about Counters in detail.
79) How to submit extra files(jars,static files) for MapReduce job during runtime in
Hadoop?
MapReduce framework provides Distributed Cache to caches files needed by the
applications. It can cache read-only text files, archives, jar files etc.
An application which needs to use distributed cache to distribute a file should make
sure that the files are available on URLs.
URLs can be either hdfs:// or http://.
Now, if the file is present on the hdfs:// or http://urls. Then, user mentions it to be
cache file to distribute. This framework will copy the cache file on all the nodes
before starting of tasks on those nodes. The files are only copied once per job.
Applications should not modify those files.
80) What is TextInputFormat in Hadoop?
TextInputFormat is the default InputFormat. It treats each line of the input file as a
separate record. For unformatted data or line-based records like log files,
TextInputFormat is useful. By default, RecordReader also uses TextInputFormat for
converting data into key-value pairs. So,
 Key- It is the byte offset of the beginning of the line.
 Value- It is the contents of the line, excluding line terminators.
File content is- on the top of the building
so,
Key- 0
Value- on the top of the building
TextInputFormat also provides below 2 types of RecordReader-
 LineRecordReader
 SequenceFileRecordReader
Top Interview Quetions for Hadoop
MapReduce
81) How many Mappers run for a MapReduce job?
Number of mappers depends on 2 factors:
 Amount of data we want to process along with block size. It is driven by the
number of inputsplit. If we have block size of 128 MB and we expect 10TB of
input data, we will have 82,000 maps. Ultimately InputFormat determines the
number of maps.
 Configuration of the slave i.e. number of core and RAM available on slave. The
right number of map/node can between 10-100. Hadoop framework should give
1 to 1.5 cores of processor for each mapper. For a 15 core processor, 10 mappers
can run.
In MapReduce job, by changing block size we can control number of Mappers . By
Changing block size the number of inputsplit increases or decreases.
By using the JobConf’s conf.setNumMapTasks(int num) we can increase the number of
map task.
Mapper= {(total data size)/ (input split size)}
If data size= 1 Tb and input split size= 100 MB
Mapper= (1000*1000)/100= 10,000
82) How many Reducers run for a MapReduce job?
Answer these type of MapReduce Hadoop interview questions answers very shortly
and to the point.
With the help of Job.setNumreduceTasks(int) the user set the number of reduces for the
job. To set the right number of reducesrs use the below formula:
0.95 Or 1.75 multiplied by (<no. of nodes> * <no. of maximum container per node>).
As the map finishes, all the reducers can launch immediately and start transferring
map output with 0.95. With 1.75, faster nodes finsihes first round of reduces and
launch second wave of reduces .
With the increase of number of reducers:
 Load balancing increases.
 Cost of failures decreases.
 Framework overhead increases.
These are very common type of MapReduce Hadoop interview questions and
answers faced during the interview of an experienced professional.
83) How to sort intermediate output based on values in MapReduce?
Hadoop MapReduce automatically sorts key-value pair generated by the mapper.
Sorting takes place on the basis of keys. Thus, to sort intermediate output based on
values we need to use secondary sorting.
There are two possible approaches:
 First, using reducer, reducer reads and buffers all the value for a given key. Then,
do an in-reducer sort on all the values. Reducer will receive all the values for a
given key (huge list of values), this cause reducer to run out of memory. Thus,
this approach can work well if number of values is small.
 Second, using MapReduce paradigm. It sort the reducer input values, by creating
a composite key” (using value to key conversion approach) . i.e.by adding a part
of, or entire value to, the natural key to achieve sorting technique. This approach
is scalable and will not generate out of, memory errors.
We need custom partitioner to make that all the data with same key (composite key
with the value) . So, data goes to the same reducer and custom comparator. In
Custom comparator the data grouped by the natural key once it arrives at the
reducer.
84) What is purpose of RecordWriter in Hadoop?
Reducer takes mapper output (intermediate key-value pair) as an input. Then, it
runs a reducer function on them to generate output (zero or more key-value pair).
So, the output of the reducer is the final output.
RecordWriter writes these output key-value pair from the Reducer phase to output
files. OutputFormat determines, how RecordWriter writes these key-value pairs in
Output files. Hadoop provides OutputFormat instances which help to write files on
the in HDFS or local disk.
85) What are the most common OutputFormat in Hadoop?
Reducer takes mapper output as input and produces output (zero or more key-value
pair). RecordWriter writes these output key-value pair from the Reducer phase to
output files. So, OutputFormat determines, how RecordWriter writes these key-
value pairs in Output files.
FileOutputFormat.setOutputpath() method used to set the output directory. So, every
Reducer writes a separate in a common output directory.
Most common OutputFormat are:
 TextOutputFormat – It is the default OutputFormat in MapReduce.
TextOutputFormat writes key-value pairs on individual lines of text files. Keys
and values of this format can be of any type. Because TextOutputFormat turns
them to string by calling toString() on them.
 SequenceFileOutputFormat – This OutputFormat writes sequences files for its
output. It is also used between MapReduce jobs.
 SequenceFileAsBinaryOutputFormat – It is another form of
SequenceFileInputFormat. which writes keys and values to sequence file in
binary format.
 DBOutputFormat – We use this for writing to relational databases and HBase.
Then, it sends the reduce output to a SQL table. It accepts key-value pairs, where
the key has a type extending DBwritable.
Read about outputFormat in detail.
86) What is LazyOutputFormat in Hadoop?
FileOutputFormat subclasses will create output files (part-r-nnnn), even if they are
empty. Some applications prefer not to create empty files, which is
where LazyOutputFormat helps.
LazyOutputFormat is a wrapper OutputFormat. It make sure that the output file
should create only when it emit its first record for a given partition.
To use LazyOutputFormat, call its SetOutputFormatClass() method with the JobConf.
To enable LazyOutputFormat, streaming and pipes supports a – lazyOutput option.
87) How to handle record boundaries in Text files or Sequence files in MapReduce
InputSplits?
InputSplit’s RecordReader in MapReduce will “start” and “end” at a record
boundary.
In SequenceFile, every 2k bytes has a 20 bytes sync mark between the records. And,
the sync marks between the records allow the RecordReader to seek to the start of
the InputSplit. It contains a file, length and offset. It also find the first sync mark
after the start of the split. And, the RecordReader continues processing records until
it reaches the first sync mark after the end of the split.
Similarly, Text files use newlines instead of sync marks to handle record boundaries.
88) What are the main configuration parameters in a MapReduce program?
The main configuration parameters are:
 Input format of data.
 Job’s input locations in the distributed file system.
 Output format of data.
 Job’s output location in the distributed file system.
 JAR file containing the mapper, reducer and driver classes
 Class containing the map function.
 Class containing the reduce function..
89) Is it mandatory to set input and output type/format in MapReduce?
No, it is mandatory.
Hadoop cluster, by default, takes the input and the output format as ‘text’.
TextInputFormat – MapReduce default InputFormat is TextInputFormat. It treats
each line of each input file as a separate record and also performs no parsing. For
unformatted data or line-based records like log files, TextInputFormat is also useful.
By default, RecordReader also uses TextInputFormat for converting data into key-
value pairs.
TextOutputFormat- MapReduce default OutputFormat is TextOutputFormat. It also
writes (key, value) pairs on individual lines of text files. Its keys and values can be of
any type.
90) What is Identity Mapper?
Identity Mapper is the default Mapper provided by Hadoop. When MapReduce
program has not defined any mapper class then Identity mapper runs. It simply
passes the input key-value pair for the reducer phase. Identity Mapper does not
perform computation and calculations on the input data. So, it only writes the input
data into output.
The class name is org.apache.hadoop.mapred.lib.IdentityMapper
91) What is Identity reducer?
Identity Reducer is the default Reducer provided by Hadoop. When MapReduce
program has not defined any mapper class then Identity mapper runs. It does not
mean that the reduce step will not take place. It will take place and related sorting
and shuffling will also take place. But there will be no aggregation. So you can use
identity reducer if you want to sort your data that is coming from the map but don’t
care for any grouping.
The above MapReduce Hadoop interview questions and answers i.e Q. 73 – Q. 91
were for experienced but freshers can also refer these MapReduce Hadoop interview
questions and answers for in depth knowledge. Now let’s move forward with some
advanced MapReduce Hadoop interview questions and answers.

Advanced Interview Questions and


Answers for Hadoop MapReduce
92) What is Chain Mapper?
We can use multiple Mapper classes within a single Map task by using Chain
Mapper class. The Mapper classes invoked in a chained (or piped) fashion. The
output of the first becomes the input of the second, and so on until the last mapper.
The Hadoop framework write output of the last mapper to the task’s output.
The key benefit of this feature is that the Mappers in the chain do not need to be
aware that they execute in a chain. And, this enables having reusable specialized
Mappers. We can combine these mappers to perform composite operations within a
single task in Hadoop.
Hadoop framework take Special care when create chains. The key/values output by a
Mapper are valid for the following mapper in the chain.
The class name is org.apache.hadoop.mapred.lib.ChainMapper
This is one of the very important Mapreduce Hadoop interview questions and
answers
93) What are the core methods of a Reducer?
Reducer process the output the mapper. After processing the data, it also produces a
new set of output, which it stores in HDFS. And, the core methods of a Reducer are:
 setup()- Various parameters like the input data size, distributed cache, heap size,
etc this method configure. Function Definition- public void setup (context)
 reduce() – Reducer call this method once per key with the associated reduce
task. Function Definition- public void reduce (key, value, context)
 cleanup() – Reducer call this method only once at the end of reduce task for
clearing all the temporary files. Function Definition- public void cleanup (context)
94) What are the parameters of mappers and reducers?
The parameters for Mappers are:
 LongWritable(input)
 text (input)
 text (intermediate output)
 IntWritable (intermediate output)
The parameters for Reducers are:
 text (intermediate output)
 IntWritable (intermediate output)
 text (final output)
 IntWritable (final output)
95) What is the difference between TextinputFormat and KeyValueTextInputFormat
class?
TextInputFormat – It is the default InputFormat. It treats each line of the input file
as a separate record. For unformatted data or line-based records like log files,
TextInputFormat is also useful. So,
 Key- It is byte offset of the beginning of the line within the file.
 Value- It is the contents of the line, excluding line terminators.
KeyValueTextInputFormat – It is like TextInputFormat. The reason is it also treats
each line of input as a separate record. But the main difference is that
TextInputFormat treats entire line as the value. While the
KeyValueTextInputFormat breaks the line itself into key and value by the tab
character (‘/t’). so,
 Key- Everything up to tab character.
 Value- Remaining part of the line after tab character.
For example, consider a file contents as below:
AL#Alabama
AR#Arkansas
FL#Florida
So, TextInputFormat
Key value
0 AL#Alabama 14
AR#Arkansas 23
FL#Florida
So, KeyValueTextInputFormat
Key value
AL Alabama
AR Arkansas
FL Florida
These are some of the advanced MapReduce Hadoop interview Questions and
answers
96) How is the splitting of file invoked in Hadoop ?
InputFormat is responsible for creating InputSplit, which is the logical
representation of data. Further Hadoop framework divides split into records. Then,
Mapper process each record (which is a key-value pair).
By running getInputSplit() method Hadoop framework invoke Splitting of file
. getInputSplit() method belongs to Input Format class (like FileInputFormat) defined
by the user.
97) How many InputSplits will be made by hadoop framework?
InputFormat is responsible for creating InputSplit, which is the logical
representation of data. Further Hadoop framework divides split into records. Then,
Mapper process each record (which is a key-value pair).
MapReduce system use storage locations to place map tasks as close to split’s data as
possible. By default, split size is approximately equal to HDFS block size (128 MB).
For, example the file size is 514 MB,
128MB: 1st block, 128Mb: 2nd block, 128Mb: 3rd block,
128Mb: 4th block, 2Mb: 5th block
So, 5 InputSplit is created based on 5 blocks.
If in case you have any confusion about any MapReduce Hadoop Interview
Questions, do let us know by leaving a comment. we will be glad to solve your
queries.
98) Explain the usage of Context Object.
With the help of Context Object, Mapper can easily interact with other Hadoop
systems. It also helps in updating counters. So counters can report the progress and
provide any application-level status updates.
It contains configuration details for the job.
99) When is it not recommended to use MapReduce paradigm for large scale data
processing?
For iterative processing use cases it is not suggested to use MapReduce. As it is not
cost effective, instead Apache Pig can be used for the same.
100) What is the difference between RDBMS with Hadoop MapReduce?
Size of Data
 RDBMS- Traditional RDBMS can handle upto gigabytes of data.
 MapReduce- Hadoop MapReduce can hadnle upto petabytes of data or more.
Updates
 RDBMS- Read and Write multiple times.
 MapReduce- Read many times but write once model.
Schema
 RDBMS- Static Schema that needs to be pre-defined.
 MapReduce- Has a dynamic schema
Processing Model
 RDBMS- Supports both batch and interactive processing.
 MapReduce- Supports only batch processing.
Scalability
 RDBMS- Non-Linear
 MapReduce- Linear
101) Define Writable data types in Hadoop MapReduce.
Hadoop reads and writes data in a serialized form in the writable interface. The
Writable interface has several classes like Text, IntWritable, LongWriatble,
FloatWritable, BooleanWritable. Users are also free to define their personal Writable
classes as well.
102) Explain what does the conf.setMapper Class do in MapReduce?
Conf.setMapperclass sets the mapper class. Which includes reading data and also
generating a key-value pair out of the mapper.

Вам также может понравиться