Вы находитесь на странице: 1из 50

Quiz demo: Knowledge For Cloudera

Practise Quiz demo for ccd-410 (02-2014)

You are on Practise mode

Results

1 of 60 questions answered correctly

Your time: 00:00:54

You have reached 1 of 60 points, (1.67%)

Average score   21.67%

Your score   1.67%

Your performance has been rated as Keep trying!

Your result has been entered into leaderboard


Name: Name E-Mail: E-Mail
Send

View questions Show leaderboard

1. Question

For each input key-value pair, mappers can emit:

As many intermediate key-value pairs as designed, but they cannot be of the same type as the
input key-value pair.

As many intermediate key-value pairs as designed. There are no restrictions on the types of
those key-value pairs (i.e., they can be heterogeneous).

One intermediate key-value pair, of a different type.

As many intermediate key-value pairs as designed, as long as all the keys have the same types
and all the values have the same type.
One intermediate key-value pair, but of the same type.

Correct

Mapper maps input key/value pairs to a set of intermediate key/value pairs.


Maps are the individual tasks that transform input records into intermediate records. The
transformed intermediate records do not need to be of the same type as the input records. A given
input pair may map to zero or many output pairs.
Reference: Hadoop Map-Reduce Tutorial

Explanation:
Correct answer(s):
As many intermediate key-value pairs as designed, as long as all the keys have the same types
and all the values have the same type.

2. Question

In the reducer, the MapReduce API provides you with an iterator over Writable values. What does
calling the next () method return?

It returns a reference to the same Writable object each time, but populated with different data.

It returns a reference to a Writable object. The API leaves unspecified whether this is a reused
object or a new object.

It returns a reference to a different Writable object time.

It returns a reference to the same Writable object if the next value is the same as the previous
value, or a new Writable object otherwise.

It returns a reference to a Writable object from an object pool.

Incorrect

Calling Iterator.next() will always return the SAME EXACT instance of IntWritable,
with the contents of that instance replaced with the next value.
Reference: manupulating iterator in mapreduce

Explanation:
Correct answer(s):
It returns a reference to the same Writable object each time, but populated with different data.

3. Question

MapReduce v2 (MRv2/YARN) is designed to address which two issues?


Single point of failure in the NameNode.

Resource pressure on the JobTracker.

Standardize on a single MapReduce API.

HDFS latency.

Ability to run frameworks other than MapReduce, such as MPI.

Reduce complexity of the MapReduce APIs.

Incorrect

YARN (Yet Another Resource Negotiator), as an aspect of Hadoop, has two major
kinds of benefits:
* (D) The ability to use programming frameworks other than MapReduce.
/ MPI (Message Passing Interface) was mentioned as a paradigmatic example of a MapReduce
alternative
* Scalability, no matter what programming framework you use.

Note:
* The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker,
resource management and job scheduling/monitoring, into separate daemons. The idea is to have
a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is
either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.
* (B) The central goal of YARN is to clearly separate two things that are unfortunately smushed
together in current Hadoop, specifically in (mainly) JobTracker:
/ Monitoring the status of the cluster with respect to which nodes have which resources available.
Under YARN, this will be global.
/ Managing the parallelization execution of any specific job. Under YARN, this will be done
separately for each job.
The current Hadoop MapReduce system is fairly scalable — Yahoo runs 5000 Hadoop jobs, truly
concurrently, on a single cluster, for a total 1.5 – 2 millions jobs/cluster/month. Still, YARN will
remove scalability bottlenecks
Reference: Apache Hadoop YARN – Concepts & Applications

Explanation:
Correct answer(s):
Resource pressure on the JobTracker.
Ability to run frameworks other than MapReduce, such as MPI.

4. Question

Your cluster’s HDFS block size in 64MB. You have directory containing 100 plain text files, each of
which is 100MB in size. The InputFormat for your job is TextInputFormat. Determine how many
Mappers will run?

200

100

640

64

Incorrect

Each file would be split into two as the block size (64 MB) is less than the file size
(100 MB), so 200 mappers would be running.
Note:
If you’re not compressing the files then hadoop will process your large files (say 10G), with a
number of mappers related to the block size of the file.
Say your block size is 64M, then you will have ~160 mappers processing this 10G file (160*64 ~=
10G). Depending on how CPU intensive your mapper logic is, this might be an
acceptable blocks size, but if you find that your mappers are executing in sub minute times, then
you might want to increase the work done by each mapper (by increasing the block size to 128,
256, 512m – the actual size depends on how you intend to process the data).
Reference: http://stackoverflow.com/questions/11014493/hadoop-mapreduce-appropriate-inputfiles-size
(first answer, second paragraph)

Explanation:
Correct answer(s):
200

5. Question

You’ve written a MapReduce job that will process 500 million input records and generated 500
million key-value pairs. The data is not uniformly distributed. Your MapReduce job will create a
significant amount of intermediate data that it needs to transfer between mappers and reduces
which is a potential bottleneck. A custom implementation of which interface is most likely to reduce
the amount of intermediate data transferred across the network?

Writable

Partitioner

WritableComparable

Combiner

InputFormat
OutputFormat

Incorrect

Combiners are used to increase the efficiency of a MapReduce program. They are
used to aggregate intermediate map output locally on individual mapper outputs. Combiners can
help you reduce the amount of data that needs to be transferred across to the reducers. You can
use your reducer code as a combiner if the operation performed is commutative and associative.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are
combiners? When should I use a combiner in my MapReduce Job?

Explanation:
Correct answer(s):
Combiner

6. Question

Determine which best describes when the reduce method is first called in a MapReduce job?

Reducers start copying intermediate key-value pairs from each Mapper as soon as it has
completed. The programmer can configure in the job what percentage of the intermediate data
should arrive before the reduce method begins.

Reducers start copying intermediate key-value pairs from each Mapper as soon as it has
completed. The reduce method is called only after all intermediate data has been copied and
sorted.

Reduce methods and map methods all start at the beginning of a job, in order to provide
optimal performance for map-only or reduce-only jobs.

Reducers start copying intermediate key-value pairs from each Mapper as soon as it has
completed. The reduce method is called as soon as the intermediate key-value pairs start to
arrive.

Incorrect

* In a MapReduce job reducers do not start executing the reduce method until the all
Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers
as soon as they are available. The programmer defined reduce method is called only after all the
mappers have finished.

* Reducers start copying intermediate key-value pairs from the mappers as soon as they are
available. The progress calculation also takes in account the processing of data transfer which is
done by reduce process, therefore the reduce progress starts showing up as soon as any
intermediate key-value pair for a mapper is available to be transferred to reducer. Though the
reducer progress is updated still the programmer defined reduce method is called only after all the
mappers have finished.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers , When is the
reducers are started in a MapReduce job?

Explanation:
Correct answer(s):
Reducers start copying intermediate key-value pairs from each Mapper as soon as it has
completed. The reduce method is called as soon as the intermediate key-value pairs start to
arrive.

7. Question

Indentify which best defines a SequenceFile?

A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key
must be the same type. Each value must be the same type.

A SequenceFile contains a binary encoding of an arbitrary number of WritableComparable


objects, in sorted order.

A SequenceFile contains a binary encoding of an arbitrary number of homogeneous Writable


objects

A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous Writable


objects

Incorrect

SequenceFile is a flat file consisting of binary key/value pairs.


There are 3 different SequenceFile formats:
Uncompressed key/value records.
Record compressed key/value records – only ‘values’ are compressed here.
Block compressed key/value records – both keys and values are collected in ‘blocks’ separately
and compressed. The size of the ‘block’ is configurable.
Reference: http://wiki.apache.org/hadoop/SequenceFile

Explanation:
Correct answer(s):
A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key
must be the same type. Each value must be the same type.

8. Question
In a MapReduce job, you want each of your input files processed by a single map task. How do
you configure a MapReduce job so that a single map task processes each input file regardless of
how many blocks the input file occupies?

Write a custom MapRunner that iterates over all key-value pairs in the entire file.

Increase the parameter that controls minimum split size in the job configuration.

Set the number of mappers equal to the number of input files you want to process.

Write a custom FileInputFormat and override the method isSplitable to always return false.

Incorrect

FileInputFormat is the base class for all file-based InputFormats. This provides a
generic implementation of getSplits(JobContext). Subclasses of FileInputFormat can also override
the isSplitable(JobContext, Path) method to ensure input-files are not split-up and are processed
as a whole by Mappers.
Reference: org.apache.hadoop.mapreduce.lib.input, Class FileInputFormat<K,V>

Explanation:
Correct answer(s):
Write a custom FileInputFormat and override the method isSplitable to always return false.

9. Question

Workflows expressed in Oozie can contain:

Sequences of MapReduce and Pig jobs. These are limited to linear sequences of actions with
exception handlers but no forks.

Sequences of MapReduce and Pig. These sequences can be combined with other actions
including forks, decision points, and path joins.

Sequences of MapReduce job only; on Pig on Hive tasks or jobs. These MapReduce
sequences can be combined with forks and path joins.

Iterntive repetition of MapReduce jobs until a desired answer or state is reached.

Incorrect

Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs)
arranged in a control dependency DAG (Direct Acyclic Graph), specifying a sequence of actions
execution. This graph is specified in hPDL (a XML Process Definition Language).

hPDL is a fairly compact language, using a limited amount of flow control and action nodes.
Control nodes define the flow of execution and include beginning and end of a workflow (start, end
and fail nodes) and mechanisms to control the workflow execution path ( decision, fork and join
nodes).
Workflow definitions
Currently running workflow instances, including instance states and variables
Reference: Introduction to Oozie
Note: Oozie is a Java Web-Application that runs in a Java servlet-container – Tomcat and uses a
database to store:

Explanation:
Correct answer(s):
Sequences of MapReduce and Pig. These sequences can be combined with other actions
including forks, decision points, and path joins.

10. Question

You wrote a map function that throws a runtime exception when it encounters a control character
in input data. The input supplied to your mapper contains twelve such characters totals, spread
across five file splits. The first four file splits each have two control characters and the last split has
four control characters.

Indentify the number of failed task attempts you can expect when you run the job with
mapred.max.map.attempts set to 4:

You will have forty-eight failed task attempts

You will have twelve failed task attempts

You will have seventeen failed task attempts

You will have five failed task attempts

You will have twenty failed task attempts

Incorrect

There will be four failed task attempts for each of the five file splits.
Note:

Explanation:
Correct answer(s):
You will have twenty failed task attempts
11. Question

Which project gives you a distributed, Scalable, data store that allows you random, realtime
read/write access to hundreds of terabytes of data?

Hue

Hive

Oozie

HBase

Flume

Sqoop

Pig

Incorrect

Use Apache HBase when you need random, realtime read/write access to your Big
Data.
Note: This project’s goal is the hosting of very large tables — billions of rows X millions of columns
— atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned,
column-oriented store modeled after Google’s Bigtable: A Distributed Storage System for
Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided
by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop
and HDFS.
Features
Linear and modular scalability.
Strictly consistent reads and writes.
Automatic and configurable sharding of tables
Automatic failover support between RegionServers.
Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
Easy to use Java API for client access.
Block cache and Bloom Filters for real-time queries.
Query predicate push down via server side Filters
Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data
encoding options
Extensible jruby-based (JIRB) shell
Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
Reference: http://hbase.apache.org/ (when would I use HBase? First sentence)

Explanation:
Correct answer(s):
HBase
12. Question

When can a reduce class also serve as a combiner without affecting the output of a MapReduce
program?

Always. Code can be reused in Java since it is a polymorphic object-oriented programming


language.

When the types of the reduce operation’s input key and input value match the types of the
reducer’s output key and output value and when the reduce operation is both communicative and
associative.

Never. Combiners and reducers must be implemented separately because they serve different
purposes.

When the signature of the reduce method matches the signature of the combine method.

Always. The point of a combiner is to serve as a mini-reducer directly after the map phase to
increase performance.

Incorrect

You can use your reducer code as a combiner if the operation performed is
commutative and associative.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are
combiners? When should I use a combiner in my MapReduce Job?

Explanation:
Correct answer(s):
When the types of the reduce operation’s input key and input value match the types of the
reducer’s output key and output value and when the reduce operation is both communicative and
associative.

13. Question

Your client application submits a MapReduce job to your Hadoop cluster. Identify the Hadoop
daemon on which the Hadoop framework will look for an available slot schedule a MapReduce
operation.

NameNode

TaskTracker

Secondary NameNode

DataNode
JobTracker

Incorrect

JobTracker is the daemon service for submitting and tracking MapReduce jobs in
Hadoop. There is only One Job Tracker process run on any hadoop cluster. Job Tracker runs on
its own JVM process. In a typical production cluster its run on a separate machine. Each slave
node is configured with job tracker node location. The JobTracker is single point of failure for the
Hadoop MapReduce service. If it goes down, all running jobs are halted. JobTracker in Hadoop
performs following actions(from Hadoop Wiki:)
Client applications submit jobs to the Job tracker.
The JobTracker talks to the NameNode to determine the location of the data
The JobTracker locates TaskTracker nodes with available slots at or near the data
The JobTracker submits the work to the chosen TaskTracker nodes.
The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they
are deemed to have failed and the work is scheduled on a different TaskTracker.
A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do
then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid,
and it may may even blacklist the TaskTracker as unreliable.
When the work is completed, the JobTracker updates its status.
Client applications can poll the JobTracker for information.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What is a
JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster?

Explanation:
Correct answer(s):
JobTracker

14. Question

You want to count the number of occurrences for each unique word in the supplied input data.
You’ve decided to implement this by having your mapper tokenize each word and emit a literal
value 1, and then have your reducer increment a counter for each literal 1 it receives. After
successful implementing this, it occurs to you that you could optimize this by specifying a
combiner. Will you be able to reuse your existing Reduces as your combiner in this case and why
or why not?

No, because the sum operation in the reducer is incompatible with the operation of a Combiner.

Yes, because the sum operation is both associative and commutative and the input and output
types to the reduce method match.

No, because the Reducer and Combiner are separate interfaces.

Yes, because Java is a polymorphic object-oriented language and thus reducer code can be
reused as a combiner.

No, because the Combiner is incompatible with a mapper which doesn’t use the same data
type for both the key and value.

Incorrect

Combiners are used to increase the efficiency of a MapReduce program. They are
used to aggregate intermediate map output locally on individual mapper outputs. Combiners can
help you reduce the amount of data that needs to be transferred across to the reducers. You can
use your reducer code as a combiner if the operation performed is commutative and associative.
The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if
required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend
on the combiners execution.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are
combiners? When should I use a combiner in my MapReduce Job?

Explanation:
Correct answer(s):
Yes, because the sum operation is both associative and commutative and the input and output
types to the reduce method match.

15. Question

You want to understand more about how users browse your public website, such as which pages
they visit prior to placing an order. You have a farm of 200 web servers hosting your website. How
will you gather this data for your analysis?

Write a MapReduce job, with the web servers for mappers, and the Hadoop cluster nodes for
reduces.

Ingest the server web logs into HDFS using Flume.

Import all users’ clicks from your OLTP databases into Hadoop, using Sqoop.

Channel these clickstreams inot Hadoop using Hadoop Streaming.

Sample the weblogs from the web servers, copying them into Hadoop using curl.

Incorrect

Hadoop MapReduce for Parsing Weblogs


Here are the steps for parsing a log file using Hadoop MapReduce:
Load log files into the HDFS location using this Hadoop command:
hadoop fs -put <local file path of weblogs> <hadoop HDFS location>
The Opencsv2.3.jar framework is used for parsing log records.
Below is the Mapper program for parsing the log file from the HDFS location.
public static class ParseMapper
extends Mapper<Object, Text, NullWritable,Text >{
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {

CSVParser parse = new CSVParser(‘ ‘,’\\"’);


String sp=parse.parseLine(value.toString());
int spSize=sp.length;
StringBuffer rec= new StringBuffer();
for(int i=0;i<spSize;i++){
rec.append(sp[i]);
if(i!=(spSize-1))
rec.append(",");
}
word.set(rec.toString());
context.write(NullWritable.get(), word);
}
}
The command below is the Hadoop-based log parse execution. TheMapReduce program is
attached in this article. You can add extra parsing methods in the class. Be sure to create a new
JAR with any change and move it to the Hadoop distributed job tracker system.
hadoop jar <path of logparse jar> <hadoop HDFS logfile path> <output path of parsed log file>
The output file is stored in the HDFS location, and the output file name starts with "part-".

Explanation:
Correct answer(s):
Write a MapReduce job, with the web servers for mappers, and the Hadoop cluster nodes for
reduces.

16. Question

What data does a Reducer reduce method process?

All data for a given value, regardless of which mapper(s) produced it.

All the data in a single input file.

All data produced by a single mapper.

All data for a given key, regardless of which mapper(s) produced it.

Incorrect
Reducing lets you aggregate values together. A reducer function receives an iterator
of input values from an input list. It then combines these values together, returning a single output
value.
All values with the same key are presented to a single reduce task.
Reference: Yahoo! Hadoop Tutorial, Module 4: MapReduce

Explanation:
Correct answer(s):
All data for a given key, regardless of which mapper(s) produced it.

17. Question

On a cluster running MapReduce v1 (MRv1), a TaskTracker heartbeats into the JobTracker on


your cluster, and alerts the JobTracker it has an open map task slot.
What determines how the JobTracker assigns each map task to a TaskTracker?

The amount of RAM installed on the TaskTracker node.

The amount of free disk space on the TaskTracker node.

The number and speed of CPU cores on the TaskTracker node.

The average system load on the TaskTracker node over the past fifteen (15) minutes.

The location of the InsputSplit to be processed in relation to the location of the node.

Incorrect

The TaskTrackers send out heartbeat messages to the JobTracker, usually every
few minutes, to reassure the JobTracker that it is still alive. These message also inform the
JobTracker of the number of available slots, so the JobTracker can stay up to date with where in
the cluster work can be delegated. When the JobTracker tries to find somewhere to schedule a
task within the MapReduce operations, it first looks for an empty slot on the same server that
hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the
same rack.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, How
JobTracker schedules a task?

Explanation:
Correct answer(s):
The location of the InsputSplit to be processed in relation to the location of the node.

18. Question

How are keys and values presented and passed to the reducers during a standard sort and shuffle
phase of MapReduce?

Keys are presented to a reducer in random order; values for a given key are sorted in
ascending order.

Keys are presented to reducer in sorted order; values for a given key are not sorted.

Keys are presented to reducer in sorted order; values for a given key are sorted in ascending
order.

Keys are presented to a reducer in random order; values for a given key are not sorted.

Incorrect

Reducer has 3 primary phases:


1. Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the

same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are
merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application should
extend the key with the secondary key and define a grouping comparator. The keys will be sorted
using the entire key, but will be grouped using the grouping comparator to decide which keys and
values are sent in the same call to reduce.
3. Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of
values)> in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via
TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.
Reference: org.apache.hadoop.mapreduce, Class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Explanation:
Correct answer(s):
Keys are presented to reducer in sorted order; values for a given key are not sorted.

19. Question

Which describes how a client reads a file from HDFS?


The client queries the NameNode for the block location(s). The NameNode returns the block
location(s) to the client. The client reads the data directory off the DataNode(s).

The client contacts the NameNode for the block location(s). The NameNode then queries the
DataNodes for block locations. The DataNodes respond to the NameNode, and the NameNode
redirects the client to the DataNode that holds the requested data block(s). The client then reads
the data directly off the DataNode.

The client queries all DataNodes in parallel. The DataNode that contains the requested data
responds directly to the client. The client reads the data directly off the DataNode.

The client contacts the NameNode for the block location(s). The NameNode contacts the
DataNode that holds the requested data block. Data is transferred from the DataNode to the
NameNode, and then from the NameNode to the client.

Incorrect

The Client communication to HDFS happens using Hadoop HDFS API. Client
applications talk to the NameNode whenever they wish to locate a file, or when they want to
add/copy/move/delete a file on HDFS. The NameNode responds the successful requests by
returning a list of relevant DataNode servers where the data lives. Client applications can talk
directly to a DataNode, once the NameNode has provided the location of the data.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, How the Client
communicates with HDFS?

Explanation:
Correct answer(s):
The client contacts the NameNode for the block location(s). The NameNode then queries the
DataNodes for block locations. The DataNodes respond to the NameNode, and the NameNode
redirects the client to the DataNode that holds the requested data block(s). The client then reads
the data directly off the DataNode.

20. Question

You need to run the same job many times with minor variations. Rather than hardcoding all job
configuration options in your drive code, you’ve decided to have your Driver subclass
org.apache.hadoop.conf.Configured and implement the org.apache.hadoop.util.Tool interface.
Indentify which invocation correctly passes.mapred.job.name with a value of Example to Hadoop?

hadoop MyDriver mapred.job.name=Example input output

hadoop “mapred.job.name=Example” MyDriver input output

hadoop setproperty mapred.job.name=Example MyDriver input output

hadoop MyDrive –D mapred.job.name=Example input output


hadoop setproperty (“mapred.job.name=Example”) MyDriver input output

Incorrect

Configure the property using the -D key=value notation:


-D mapred.job.name=’My Job’
You can list a whole bunch of options by calling the streaming jar with just the -info argument
Reference: Python hadoop streaming : Setting a job name

Explanation:
Correct answer(s):
hadoop MyDrive –D mapred.job.name=Example input output

21. Question

Which best describes how TextInputFormat processes input files and line breaks?

Input file splits may cross line breaks. A line that crosses file splits is ignored.

Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader
of the split that contains the beginning of the broken line.

Input file splits may cross line breaks. A line that crosses file splits is read by the
RecordReaders of both splits containing the broken line.

The input file is split exactly at the line breaks, so each RecordReader will read a series of
complete lines.

Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader
of the split that contains the end of the broken line.

Incorrect

As the Map operation is parallelized the input file set is first split to several pieces
called FileSplits. If an individual file is so large that it will affect seek time it will be split to several
Splits. The splitting does not know anything about the input file’s internal logical structure, for
example line-oriented text files are split on arbitrary byte boundaries. Then a new map task is
created per FileSplit.
When an individual map task starts it will open a new output writer per configured reduce task. It
will then proceed to read its FileSplit using the RecordReader it gets from the specified
InputFormat. InputFormat parses the input and generates key-value pairs. InputFormat must also
handle records that may be split on the FileSplit boundary. For example TextInputFormat will read
the last line of the FileSplit past the split boundary and, when reading other than the first FileSplit,
TextInputFormat ignores the content up to the first newline.
Reference: How Map and Reduce operations are actually carried out
Explanation:
Correct answer(s):
Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader
of the split that contains the end of the broken line.

22. Question

Given a directory of files with the following structure: line number, tab character, string:
Example:
1abialkjfjkaoasdfjksdlkjhqweroij
2kadfjhuwqounahagtnbvaswslmnbfgy
3kjfteiomndscxeqalkzhtopedkfsikj
You want to send each line as one record to your Mapper. Which InputFormat should you use to
complete the line: conf.setInputFormat (____.class) ; ?

SequenceFileAsTextInputFormat

KeyValueFileInputFormat

BDBInputFormat

SequenceFileInputFormat

Incorrect

Note:
The output format for your first MR job should be SequenceFileOutputFormat – this will store the
Key/Values output from the reducer in a binary format, that can then be read back in, in your
second MR job using SequenceFileInputFormat.
Reference: How to parse CustomWritable from text in Hadoop
http://stackoverflow.com/questions/9721754/how-to-parse-customwritable-from-text-in-hadoop
(see answer 1 and then see the comment #1 for it)

Explanation:
Correct answer(s):
SequenceFileInputFormat

23. Question

What types of algorithms are difficult to express in MapReduce v1 (MRv1)?

Algorithms that require global, sharing states.

Relational operations on large amounts of structured and semi-structured data.


Algorithms that require applying the same mathematical function to large numbers of individual
binary records.

Large-scale graph algorithms that require one-step link traversal.

Text analysis algorithms on large collections of unstructured text (e.g, Web crawls).

Incorrect

See 3) below.
Limitations of Mapreduce – where not to use Mapreduce
While very powerful and applicable to a wide variety of problems, MapReduce is not the answer to
every problem. Here are some problems I found where MapReudce is not suited and some papers
that address the limitations of MapReuce.
1. Computation depends on previously computed values
If the computation of a value depends on previously computed values, then MapReduce cannot be
used. One good example is the Fibonacci series where each value is summation of the previous
two values. i.e., f(k+2) = f(k+1) + f(k). Also, if the data set is small enough to be computed on a

single machine, then it is better to do it as a single reduce(map(data)) operation rather than going
through the entire map reduce process.
2. Full-text indexing or ad hoc searching
The index generated in the Map step is one dimensional, and the Reduce step must not generate
a large amount of data or there will be a serious performance degradation. For example,
CouchDB’s MapReduce may not be a good fit for full-text indexing or ad hoc searching. This is a
problem better suited for a tool such as Lucene.
3. Algorithms depend on shared global state
Solutions to many interesting problems in text processing do not require global synchronization.
As a result, they can be expressed naturally in MapReduce, since map and reduce tasks run
independently and in isolation. However, there are many examples of algorithms that depend
crucially on the existence of shared global state during processing, making them difficult to
implement in MapReduce (since the single opportunity for global synchronization in MapReduce is
the barrier between the map and reduce phases of processing)
Reference: Limitations of Mapreduce – where not to use Mapreduce

Explanation:
Correct answer(s):
Algorithms that require global, sharing states.

24. Question

You want to perform analysis on a large collection of images. You want to store this data in HDFS
and process it with MapReduce but you also want to give your data analysts and data scientists
the ability to process the data directly from HDFS with an interpreted high-level programming
language like Python. Which format should you use to store this data in HDFS?
SequenceFiles

XML

JSON

Avro

HTML

CSV

Incorrect

Using Hadoop Sequence Files


So what should we do in order to deal with huge amount of images? Use hadoop sequence files!
Those are map files that inherently can be read by map reduce applications – there is an input

format especially for sequence files – and are splitable by map reduce, so we can have one huge
file that will be the input of many map tasks. By using those sequence files we are letting hadoop
use its advantages. It can split the work into chunks so the processing is parallel, but the chunks
are big enough that the process stays efficient.
Since the sequence file are map file the desired format will be that the key will be text and hold the
HDFS filename and the value will be BytesWritable and will contain the image content of the file.
Reference: Hadoop binary files processing introduced by image duplicates finder

Explanation:
Correct answer(s):
SequenceFiles

25. Question

MapReduce v2 (MRv2/YARN) splits which major functions of the JobTracker into separate
daemons? Select two.

Heath states checks (heartbeats)

Job scheduling/monitoring

Launching tasks

Job coordination between the ResourceManager and NodeManager

MapReduce metric reporting

Managing file system metadata

Resource management
Managing tasks

Incorrect

The fundamental idea of MRv2 is to split up the two major functionalities of the
JobTracker, resource management and job scheduling/monitoring, into separate daemons. The
idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An
application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.
Note:
The central goal of YARN is to clearly separate two things that are unfortunately smushed together
in current Hadoop, specifically in (mainly) JobTracker:
/ Monitoring the status of the cluster with respect to which nodes have which resources available.
Under YARN, this will be global.
/ Managing the parallelization execution of any specific job. Under YARN, this will be done
separately for each job.
Reference: Apache Hadoop YARN – Concepts & Applications

Explanation:
Correct answer(s):
Resource management
Job coordination between the ResourceManager and NodeManager

26. Question

You have the following key-value pairs as output from your Map task:
(the, 1)
(fox, 1)
(faster, 1)
(than, 1)
(the, 1)
(dog, 1)
How many keys will be passed to the Reducer’s reduce method?

One

Five

Four

Six

Three

Two

Incorrect
Only one key value pair will be passed from the two (the, 1) key value pairs.

Explanation:
Correct answer(s):
Five

27. Question

A client application creates an HDFS file named foo.txt with a replication factor of 3. Identify which
best describes the file access rules in HDFS if the file has a single block that is stored on data
nodes A, B and C?

Each data node stores a copy of the file in the local file system with the same name as the
HDFS file.

The file will be marked as corrupted if data node B fails during the creation of the file.

Each data node locks the local file to prohibit concurrent readers and writers of the file.

The file can be accessed if at least one of the data nodes storing the file is available.

Incorrect

HDFS keeps three copies of a block on three different datanodes to protect against
true data corruption. HDFS also tries to distribute these three replicas on more than one rack to
protect against data availability issues. The fact that HDFS actively monitors any failed
datanode(s) and upon failure detection immediately schedules re-replication of blocks (if needed)
implies that three copies of data on three different nodes is sufficient to avoid corrupted files.
Note:
HDFS is designed to reliably store very large files across machines in a large cluster. It stores
each file as a sequence of blocks; all blocks in a file except the last block are the same size. The
blocks of a file are replicated for fault tolerance. The block size and replication factor are
configurable per file. An application can specify the number of replicas of a file. The replication
factor can be specified at file creation time and can be changed later. Files in HDFS are write-once
and have strictly one writer at any time. The NameNode makes all decisions regarding replication
of blocks. HDFS uses rack-aware replica placement policy. In default configuration there are total
3 copies of a datablock on HDFS, 2 copies are stored on datanodes on same rack and 3rd copy
on a different rack.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers , How the
HDFS Blocks are replicated?

Explanation:
Correct answer(s):
The file can be accessed if at least one of the data nodes storing the file is available.
28. Question

You write MapReduce job to process 100 files in HDFS. Your MapReduce algorithm uses
TextInputFormat: the mapper applies a regular expression over input values and emits key-values
pairs with the key consisting of the matching text, and the value containing the filename and byte
offset. Determine the difference between setting the number of reduces to one and settings the
number of reducers to zero.

There is no difference in output between the two settings.

With zero reducers, instances of matching patterns are stored in multiple files on HDFS. With
one reducer, all instances of matching patterns are gathered together in one file on HDFS.

With zero reducers, all instances of matching patterns are gathered together in one file on
HDFS. With one reducer, instances of matching patterns are stored in multiple files on HDFS.

With zero reducers, no reducer runs and the job throws an exception. With one reducer,
instances of matching patterns are stored in a single file on HDFS.

Incorrect

* It is legal to set the number of reduce-tasks to zero if no reduction is desired.


In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by
setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the
FileSystem.
* Often, you may want to process input data using a map function only. To do this, simply set
mapreduce.job.reduces to zero. The MapReduce framework will not create any reducer tasks.
Rather, the outputs of the mapper tasks will be the final output of the job.
Note:
Reduce
In this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is
called for each <key, (list of values)> pair in the grouped inputs.
The output of the reduce task is typically written to the FileSystem via
OutputCollector.collect(WritableComparable, Writable).
Applications can use the Reporter to report progress, set application-level status messages and
update Counters, or just indicate that they are alive.
The output of the Reducer is not sorted.

Explanation:
Correct answer(s):
With zero reducers, instances of matching patterns are stored in multiple files on HDFS. With
one reducer, all instances of matching patterns are gathered together in one file on HDFS.

29. Question
The Hadoop framework provides a mechanism for coping with machine issues such as faulty
configuration or impending hardware failure. MapReduce detects that one or a number of
machines are performing poorly and starts more copies of a map or reduce task. All the tasks run
simultaneously and the task finish first are used. This is called:

Combine

Default Partitioner

IdentityMapper

IdentityReducer

Speculative Execution

Incorrect

Speculative execution: One problem with the Hadoop system is that by dividing the
tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program.
For example if one node has a slow disk controller, then it may be reading its input at only 10% the
speed of all the other nodes. So when 99 map tasks are already complete, the system is still
waiting for the final map task to check in, which takes much longer than all the other nodes.
By forcing tasks to run in isolation from one another, individual tasks do not know where their
inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore,
the same input can be processed multiple times in parallel, to exploit differences in machine
capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule
redundant copies of the remaining tasks across several nodes which do not have other work to
perform. This process is known as speculative execution. When tasks complete, they announce
this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If
other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks
and discard their outputs. The Reducers then receive their inputs from whichever Mapper
completed successfully, first.
Reference: Apache Hadoop, Module 4: MapReduce
Note:
* Hadoop uses "speculative execution." The same task may be started on multiple boxes. The first
one to finish wins, and the other copies are killed.
Failed tasks are tasks that error out.
* There are a few reasons Hadoop can kill tasks by his own decisions:

a) Task does not report progress during timeout (default is 10 minutes)


b) FairScheduler or CapacityScheduler needs the slot for some other pool (FairScheduler) or
queue (CapacityScheduler).
c) Speculative execution causes results of task not to be needed since it has completed on other
place.
Reference: Difference failed tasks vs killed tasks

Explanation:
Correct answer(s):
Speculative Execution

30. Question

You need to move a file titled “weblogs” into HDFS. When you try to copy the file, you can’t. You
know you have ample space on your DataNodes. Which action should you take to relieve this
situation and store more files in HDFS?

Increase the amount of memory for the NameNode.

Decrease the block size on your remaining files.

Decrease the block size on all current files in HDFS.

Increase the block size on all current files in HDFS.

Increase the block size on your remaining files.

Increase the number of disks (or size) for the NameNode.

Incorrect

Note:
* -put localSrc destCopies the file or directory from the local file system identified by localSrc to
dest within the DFS.
* What is HDFS Block size? How is it different from traditional file system block size?

In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is
typically 64Mb or 128Mb in size. Each block is replicated multiple times. Default is to replicate
each block three times. Replicas are stored on different nodes. HDFS utilizes the local file system
to store each HDFS block as a separate file. HDFS Block size can not be compared with the
traditional file system block size.

Explanation:
Correct answer(s):
Decrease the block size on your remaining files.

31. Question

To process input key-value pairs, your mapper needs to lead a 512 MB data file in memory. What
is the best way to accomplish this?

Place the data file in the DistributedCache and read the data into memory in the map method of
the mapper.
Place the data file in the DataCache and read the data into memory in the configure method of
the mapper.

Place the data file in the DistributedCache and read the data into memory in the configure
method of the mapper.

Serialize the data file, insert in it the JobConf object, and read the data into memory in the
configure method of the mapper.

Incorrect

Hadoop has a distributed cache mechanism to make available file locally that may
be needed by Map/Reduce jobs
Use Case
Lets understand our Use Case a bit more in details so that we can follow-up the code snippets.
We have a Key-Value file that we need to use in our Map jobs. For simplicity, lets say we need to
replace all keywords that we encounter during parsing, with some other value.
So what we need is
A key-values files (Lets use a Properties files)
The Mapper code that uses the code
Write the Mapper code that uses it
view sourceprint?
01.
public class DistributedCacheMapper extends Mapper<LongWritable, Text, Text, Text> {
02.

03.
Properties cache;
04.
05.
@Override
06.
protected void setup(Context context) throws IOException, InterruptedException {
07.
super.setup(context);
08.
Path localCacheFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration());
09.
10.
if(localCacheFiles != null) {
11.
// expecting only single file here
12.
for (int i = 0; i < localCacheFiles.length; i++) {
13.
Path localCacheFile = localCacheFiles[i];
14.
cache = new Properties();
15.
cache.load(new FileReader(localCacheFile.toString()));
16.
}
17.
} else {
18.
// do your error handling here
19.
}
20.
21.
}
22.

23.
@Override
24.
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
25.
// use the cache here
26.
// if value contains some attribute, cache.get(<value>)
27.
// do some action or replace with something else
28.
}
29.
30.
}
Note:
* Distribute application-specific large, read-only files efficiently.
DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives,
jars etc.) needed by applications.
Applications specify the files, via urls (hdfs:// or http://) to be cached via the JobConf. The
DistributedCache assumes that the files specified via hdfs:// urls are already present on the
FileSystem at the path specified by the url.
Reference: Using Hadoop Distributed Cache

Explanation:
Correct answer(s):
Place the data file in the DistributedCache and read the data into memory in the map method of
the mapper.
32. Question

Can you use MapReduce to perform a relational join on two large tables sharing a key? Assume
that the two tables are formatted as comma-separated files in HDFS.

No, but it can be done with either Pig or Hive.

Yes, but only if one of the tables fits into memory

Yes.

Yes, so long as both tables fit into memory.

No, MapReduce cannot perform relational operations.

Incorrect

Note:
* Join Algorithms in MapReduce
A) Reduce-side join
B) Map-side join
C) In-memory join
/ Striped Striped variant variant
/ Memcached variant
* Which join to use?
/ In-memory join > map-side join > reduce-side join
/ Limitations of each?
In-memory join: memory
Map-side join: sort order and partitioning
Reduce-side join: general purpose

Explanation:
Correct answer(s):
Yes.

33. Question

You have just executed a MapReduce job. Where is intermediate data written to after being
emitted from the Mapper’s map method?

Into in-memory buffers on the TaskTracker node running the Mapper that spill over and are
written into HDFS.

Intermediate data in streamed across the network from Mapper to the Reduce and is never
written to disk.

Into in-memory buffers that spill over to the local file system of the TaskTracker node running
the Mapper.

Into in-memory buffers on the TaskTracker node running the Reducer that spill over and are
written into HDFS.

Into in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker
node running the Reducer

Incorrect

The mapper output (intermediate data) is stored on the Local file system (NOT
HDFS) of each individual mapper nodes. This is typically a temporary directory location which can

be setup in config by the hadoop administrator. The intermediate data is cleaned up after the
Hadoop Job completes.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, Where is the
Mapper Output (intermediate kay-value data) stored ?

Explanation:
Correct answer(s):
Into in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker
node running the Reducer

34. Question

All keys used for intermediate output from mappers must:

Override isSplitable.

Implement a splittable compression algorithm.

Be a subclass of FileInputFormat.

Implement WritableComparable.

Implement a comparator for speedy sorting.

Incorrect

The MapReduce framework operates exclusively on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces a set of <key,
value> pairs as the output of the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence need to implement
the Writable interface. Additionally, the key classes have to implement the WritableComparable
interface to facilitate sorting by the framework.
Reference: MapReduce Tutorial

Explanation:
Correct answer(s):
Implement WritableComparable.

35. Question

Table metadata in Hive is:

Stored along with the data in HDFS.

Stored as metadata on the NameNode.

Stored in the Metastore.

Stored in ZooKeeper.

Incorrect

By default, hive use an embedded Derby database to store metadata information.


The metastore is the "glue" between Hive and HDFS. It tells Hive where your data files live in
HDFS, what type of data they contain, what tables they belong to, etc.
The Metastore is an application that runs on an RDBMS and uses an open source ORM layer
called DataNucleus, to convert object representations into a relational schema and vice versa.
They chose this approach as opposed to storing this information in hdfs as they need the
Metastore to be very low latency. The DataNucleus layer allows them to plugin many different
RDBMS technologies.
Note:
* By default, Hive stores metadata in an embedded Apache Derby database, and other
client/server databases like MySQL can optionally be used.
* features of Hive include:
Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during
query execution.
Reference: Store Hive Metadata into RDBMS

Explanation:
Correct answer(s):
Stored in the Metastore.

36. Question

You have user profile records in your OLPT database, that you want to join with web logs you
have already ingested into the Hadoop file system. How will you obtain these user records?

Hive LOAD DATA command

Ingest with Flume agents

Pig LOAD command

HDFS command

Ingest with Hadoop Streaming

Sqoop import

Incorrect

Apache Hadoop and Pig provide excellent tools for extracting and analyzing data
from very large Web logs.
We use Pig scripts for sifting through the data and to extract useful information from the Web logs.
We load the log file into Pig using the LOAD command.
raw_logs = LOAD ‘apacheLog.log’ USING TextLoader AS (line:chararray);
Note 1:
Data Flow and Components
* Content will be created by multiple Web servers and logged in local hard discs. This content will
then be pushed to HDFS using FLUME framework. FLUME has agents running on Web servers;
these are machines that collect data intermediately using collectors and finally push that data to
HDFS.
* Pig Scripts are scheduled to run using a job scheduler (could be cron or any sophisticated batch
job solution). These scripts actually analyze the logs on various dimensions and extract the
results. Results from Pig are by default inserted into HDFS, but we can use storage

implementation for other repositories also such as HBase, MongoDB, etc. We have also tried the
solution with HBase (please see the implementation section). Pig Scripts can either push this data
to HDFS and then MR jobs will be required to read and push this data into HBase, or Pig scripts
can push this data into HBase directly. In this article, we use scripts to push data onto HDFS, as
we are showcasing the Pig framework applicability for log analysis at large scale.
* The database HBase will have the data processed by Pig scripts ready for reporting and further
slicing and dicing.
* The data-access Web service is a REST-based service that eases the access and integrations
with data clients. The client can be in any language to access REST-based API. These clients
could be BI- or UI-based clients.
Note 2:
The Log Analysis Software Stack
* Hadoop is an open source framework that allows users to process very large data in parallel. It’s
based on the framework that supports Google search engine. The Hadoop core is mainly divided
into two modules:
1. HDFS is the Hadoop Distributed File System. It allows you to store large amounts of data using
multiple commodity servers connected in a cluster.
2. Map-Reduce (MR) is a framework for parallel processing of large data sets. The default
implementation is bonded with HDFS.
* The database can be a NoSQL database such as HBase. The advantage of a NoSQL database
is that it provides scalability for the reporting module as well, as we can keep historical processed
data for reporting purposes. HBase is an open source columnar DB or NoSQL DB, which uses
HDFS. It can also use MR jobs to process data. It gives real-time, random read/write access to
very large data sets — HBase can save very large tables having million of rows. It’s a distributed
database and can also keep multiple versions of a single row.
* The Pig framework is an open source platform for analyzing large data sets and is implemented
as a layered language over the Hadoop Map-Reduce framework. It is built to ease the work of
developers who write code in the Map-Reduce format, since code in Map-Reduce format needs to
be written in Java. In contrast, Pig enables users to write code in a scripting language.
* Flume is a distributed, reliable and available service for collecting, aggregating and moving a
large amount of log data (src flume-wiki). It was built to push large logs into Hadoop-HDFS for
further processing. It’s a data flow solution, where there is an originator and destination for each
node and is divided into Agent and Collector tiers for collecting logs and pushing them to
destination storage.
Reference: Hadoop and Pig for Large-Scale Web Log Analysis

Explanation:
Correct answer(s):
Pig LOAD command

37. Question

What is a SequenceFile?

A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous writable


objects.

A SequenceFile contains a binary encoding of an arbitrary number of WritableComparable


objects, in sorted order.

A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key
must be the same type. Each value must be same type.

A SequenceFile contains a binary encoding of an arbitrary number of homogeneous writable


objects.

Incorrect

SequenceFile is a flat file consisting of binary key/value pairs.

There are 3 different SequenceFile formats:


Uncompressed key/value records.
Record compressed key/value records – only ‘values’ are compressed here.
Block compressed key/value records – both keys and values are collected in ‘blocks’ separately
and compressed. The size of the ‘block’ is configurable.
Reference: http://wiki.apache.org/hadoop/SequenceFile

Explanation:
Correct answer(s):
A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key
must be the same type. Each value must be same type.

38. Question

Identify the MapReduce v2 (MRv2 / YARN) daemon responsible for launching application
containers and monitoring application resource usage?

NodeManager

ResourceManager

JobTracker

TaskTracker

ApplicationMaster

ApplicationMasterService

Incorrect

The fundamental idea of MRv2 (YARN) is to split up the two major functionalities of
the JobTracker, resource management and job scheduling/monitoring, into separate daemons.
The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM).
An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.
Note: Let’s walk through an application execution sequence :
Reference: Apache Hadoop YARN – Concepts & Applications

Explanation:
Correct answer(s):
ApplicationMaster

39. Question

You need to perform statistical analysis in your MapReduce job and would like to call methods in
the Apache Commons Math library, which is distributed as a 1.3 megabyte Java archive (JAR) file.
Which is the best way to make this library available to your MapReducer job at runtime?

Have your system administrator copy the JAR to all nodes in the cluster and set its location in
the HADOOP_CLASSPATH environment variable before you submit your job.

Package your code and the Apache Commands Math library into a zip file named JobJar.zip

Have your system administrator place the JAR file on a Web server accessible to all cluster
nodes and then set the HTTP_JAR_URL environment variable to its location.

When submitting the job on the command line, specify the –libjars option followed by the JAR
file path.

Incorrect

The usage of the jar command is like this,


Usage: hadoop jar <jar> [mainClass] args…
If you want the commons-math3.jar to be available for all the tasks you can do any one of
these
1. Copy the jar file in $HADOOP_HOME/lib dir
or
2. Use the generic option -libjars.

Explanation:
Correct answer(s):
When submitting the job on the command line, specify the –libjars option followed by the JAR
file path.

40. Question

Which best describes what the map method accepts and emits?

It accepts a single key-value pairs as input and can emit only one key-value pair as output.

It accepts a single key-value pair as input and emits a single key and list of corresponding
values as output.

It accepts a single key-value pairs as input and can emit any number of key-value pair as
output, including zero.

It accepts a list key-value pairs as input and can emit only one key-value pair as output.

Incorrect

public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>


extends Object
Maps input key/value pairs to a set of intermediate key/value pairs.
Maps are the individual tasks which transform input records into a intermediate records. The
transformed intermediate records need not be of the same type as the input records. A given input
pair may map to zero or many output pairs.
Reference: org.apache.hadoop.mapreduce
Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Explanation:
Correct answer(s):
It accepts a single key-value pairs as input and can emit any number of key-value pair as
output, including zero.

41. Question

For each intermediate key, each reducer task can emit:

As many final key-value pairs as desired. There are no restrictions on the types of those keyvalue pairs
(i.e., they can be heterogeneous).

One final key-value pair per value associated with the key; no restrictions on the type.

One final key-value pair per key; no restrictions on the type.

As many final key-value pairs as desired, as long as all the keys have the same type and all the
values have the same type.

As many final key-value pairs as desired, but they must have the same type as the intermediate
key-value pairs.

Incorrect

Reducer reduces a set of intermediate values which share a key to a smaller set of
values.
Reducing lets you aggregate values together. A reducer function receives an iterator of input
values from an input list. It then combines these values together, returning a single output value.
Reference: Hadoop Map-Reduce Tutorial; Yahoo! Hadoop Tutorial, Module 4: MapReduce

Explanation:
Correct answer(s):
One final key-value pair per key; no restrictions on the type.

42. Question

In a MapReduce job, the reducer receives all values associated with same key. Which statement
best describes the ordering of these values?
The values are arbitrarily ordered, and the ordering may vary from run to run of the same
MapReduce job.

The values are in sorted order.

The values are arbitrary ordered, but multiple runs of the same MapReduce job will always
have the same ordering.

Since the values come from mapper outputs, the reducers will receive contiguous sections of
sorted values.

Incorrect

Note:
* Input to the Reducer is the sorted output of the mappers.
* The framework calls the application’s Reduce function once for each unique key in the sorted
order.
* Example:
For the given sample input the first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
The second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>

Explanation:
Correct answer(s):
The values are arbitrarily ordered, and the ordering may vary from run to run of the same
MapReduce job.

43. Question

You are developing a combiner that takes as input Text keys, IntWritable values, and emits Text
keys, IntWritable values. Which interface should your class implement?

Combiner <Text, IntWritable, Text, IntWritable>

Mapper <Text, IntWritable, Text, IntWritable>

Combiner <Text, Text, IntWritable, IntWritable>

Reducer <Text, Text, IntWritable, IntWritable>


Reducer <Text, IntWritable, Text, IntWritable>

Incorrect

Correct answer(s):
Reducer <Text, IntWritable, Text, IntWritable>

44. Question

Analyze each scenario below and indentify which best describes the behavior of the default
partitioner?

The default partitioner computes the hash of the value and takes the mod of that value with the
number of reducers. The result determines the reducer assigned to process the key-value pair.

The default partitioner assigns key-values pairs to reduces based on an internal random
number generator.

The default partitioner computes the hash of the key. Hash values between specific ranges are
associated with different buckets, and each bucket is assigned to a specific reducer.

The default partitioner implements a round-robin strategy, shuffling the key-value pairs to each
reducer in turn. This ensures an event partition of the key space.

The default partitioner computes the hash of the key and divides that valule modulo the number
of reducers. The result determines the reducer assigned to process the key-value pair.

Incorrect

The default partitioner computes a hash value for the key and assigns the partition
based on this result.
The default Partitioner implementation is called HashPartitioner. It uses the hashCode() method of
the key objects modulo the number of partitions total to determine which partition to send a given
(key, value) pair to.
In Hadoop, the default partitioner is HashPartitioner, which hashes a record’s key to determine
which partition (and thus which reducer) the record belongs in.The number of partition is then
equal to the number of reduce tasks for the job.
Reference: Getting Started With (Customized) Partitioning

Explanation:
Correct answer(s):
The default partitioner computes the hash of the key and divides that valule modulo the number
of reducers. The result determines the reducer assigned to process the key-value pair.
45. Question

In a MapReduce job with 500 map tasks, how many map task attempts will there be?

At least 500.

It depends on the number of reduces in the job.

Between 500 and 1000.

Exactly 500.

At most 500.

Incorrect

From Cloudera Training Course:


Task attempt is a particular instance of an attempt to execute a task
– There will be at least as many task attempts as there are tasks
– If a task attempt fails, another will be started by the JobTracker
– Speculative execution can also result in more task attempts than completed tasks

Explanation:
Correct answer(s):
At least 500.

46. Question

You are developing a MapReduce job for sales reporting. The mapper will process input keys
representing the year (IntWritable) and input values representing product indentifies (Text).
Indentify what determines the data types used by the Mapper for a given job.

The key and value types specified in the JobConf.setMapInputKeyClass and


JobConf.setMapInputValuesClass methods

The data types specified in HADOOP_MAP_DATATYPES environment variable

The mapper-specification.xml file submitted with the job determine the mapper’s input key and
value types.

The InputFormat used by the job determines the mapper’s input key and value types.

Incorrect

The input types fed to the mapper are controlled by the InputFormat used. The
default input format, "TextInputFormat," will load data in as (LongWritable, Text) pairs. The long
value is the byte offset of the line in the file. The Text object holds the string contents of the line of
the file.
Note: The data types emitted by the reducer are identified by setOutputKeyClass()
andsetOutputValueClass(). The data types emitted by the reducer are identified by
setOutputKeyClass() and setOutputValueClass().
By default, it is assumed that these are the output types of the mapper as well. If this is not the
case, the methods setMapOutputKeyClass() and setMapOutputValueClass() methods of the
JobConf class will override these.
Reference: Yahoo! Hadoop Tutorial, THE DRIVER METHOD

Explanation:
Correct answer(s):
The InputFormat used by the job determines the mapper’s input key and value types.

47. Question

When is the earliest point at which the reduce method of a given Reducer can be called?

As soon as a mapper has emitted at least one record.

As soon as at least one mapper has finished processing its input split.

It depends on the InputFormat used for the job.

Not until all mappers have finished processing all records.

Incorrect

In a MapReduce job reducers do not start executing the reduce method until the all
Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers
as soon as they are available. The programmer defined reduce method is called only after all the
mappers have finished.
Note: The reduce phase has 3 steps: shuffle, sort, reduce. Shuffle is where the data is collected by
the reducer from each mapper. This can happen while mappers are generating data since it is only
a data transfer. On the other hand, sort and reduce can only start once all the mappers are done.
Why is starting the reducers early a good thing? Because it spreads out the data transfer from the
mappers to the reducers over time, which is a good thing if your network is the bottleneck.
Why is starting the reducers early a bad thing? Because they "hog up" reduce slots while only
copying data. Another job that starts later that will actually use the reduce slots now can’t use
them.
You can customize when the reducers startup by changing the default value of
mapred.reduce.slowstart.completed.maps in mapred-site.xml. A value of 1.00 will wait for all the
mappers to finish before starting the reducers. A value of 0.0 will start the reducers right away. A
value of 0.5 will start the reducers when half of the mappers are complete. You can also change
mapred.reduce.slowstart.completed.maps on a job-by-job basis.
Typically, keep mapred.reduce.slowstart.completed.maps above 0.9 if the system ever has
multiple jobs running at once. This way the job doesn’t hog up reducers when they aren’t doing
anything but copying data. If you only ever have one job running at a time, doing 0.1 would
probably be appropriate.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, When is the
reducers are started in a MapReduce job?

Explanation:
Correct answer(s):
Not until all mappers have finished processing all records.

48. Question

You want to populate an associative array in order to perform a map-side join. You’ve decided to
put this information in a text file, place that file into the DistributedCache and read it in your
Mapper before any records are processed.
Indentify which method in the Mapper you should use to implement code for reading the file and
populating the associative array?

map

combine

init

configure

Incorrect

See 3) below.
Here is an illustrative example on how to use the DistributedCache:

// Setting up the cache for the application


1. Copy the requisite files to the FileSystem:
$ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
$ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
$ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
$ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
$ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
$ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
2. Setup the application’s JobConf:
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
job);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
3. Use the cached files in the Mapper
or Reducer:
public static class MapClass extends MapReduceBase
implements Mapper<K, V, K, V> {
private Path localArchives;
private Path localFiles;
public void configure(JobConf job) {
// Get the cached archives/files
localArchives = DistributedCache.getLocalCacheArchives(job);
localFiles = DistributedCache.getLocalCacheFiles(job);
}
public void map(K key, V value,
OutputCollector<K, V> output, Reporter reporter)
throws IOException {
// Use data from the cached archives/files here

// …
// …
output.collect(k, v);
}
}
Reference: org.apache.hadoop.filecache , Class DistributedCache

Explanation:
Correct answer(s):
configure

49. Question

Which process describes the lifecycle of a Mapper?

The TaskTracker spawns a new Mapper to process all records in a single input split.

The JobTracker spawns a new Mapper to process all records in a single file.

The JobTracker calls the TaskTracker’s configure () method, then its map () method and finally
its close () method.

The TaskTracker spawns a new Mapper to process each key-value pair.

Incorrect
For each map instance that runs, the TaskTracker creates a new instance of your
mapper.
Note:
* The Mapper is responsible for processing Key/Value pairs obtained from the InputFormat. The
mapper may perform a number of Extraction and Transformation functions on the Key/Value pair
before ultimately outputting none, one or many Key/Value pairs of the same, or different Key/Value
type.
* With the new Hadoop API, mappers extend the org.apache.hadoop.mapreduce.Mapper class.
This class defines an ‘Identity’ map function by default – every input Key/Value pair obtained from
the InputFormat is written out.
Examining the run() method, we can see the lifecycle of the mapper:
/**
* Expert users can override this method for more complete control over the
* execution of the Mapper.

* @param context
* @throws IOException
*/
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
}
setup(Context) – Perform any setup for the mapper. The default implementation is a no-op method.
map(Key, Value, Context) – Perform a map operation in the given Key / Value pair. The default
implementation calls Context.write(Key, Value)
cleanup(Context) – Perform any cleanup for the mapper. The default implementation is a no-op
method.
Reference: Hadoop/MapReduce/Mapper

Explanation:
Correct answer(s):
The TaskTracker spawns a new Mapper to process each key-value pair.

50. Question

You have written a Mapper which invokes the following five calls to the OutputColletor.collect
method:
output.collect (new Text (“Apple”), new Text (“Red”) ) ;
output.collect (new Text (“Banana”), new Text (“Yellow”) ) ;
output.collect (new Text (“Apple”), new Text (“Yellow”) ) ;
output.collect (new Text (“Cherry”), new Text (“Red”) ) ;
output.collect (new Text (“Apple”), new Text (“Green”) ) ;
How many times will the Reducer’s reduce method be invoked?

Incorrect

reduce() gets called once for each [key, (list of values)] pair. To explain, let’s say
you called:
out.collect(new Text("Car"),new Text("Subaru");
out.collect(new Text("Car"),new Text("Honda");
out.collect(new Text("Car"),new Text("Ford");
out.collect(new Text("Truck"),new Text("Dodge");
out.collect(new Text("Truck"),new Text("Chevy");
Then reduce() would be called twice with the pairs

reduce(Car, <Subaru, Honda, Ford>)


reduce(Truck, <Dodge, Chevy>)
Reference: Mapper output.collect()?

Explanation:
Correct answer(s):
3

51. Question

A combiner reduces:

The number of input files a mapper must process.

The number of values across different keys in the iterator supplied to a single reduce method
call.

The amount of intermediate data that must be transferred between the mapper and reducer.

The number of output files a reducer must produce.

Incorrect
Combiners are used to increase the efficiency of a MapReduce program. They are
used to aggregate intermediate map output locally on individual mapper outputs. Combiners can
help you reduce the amount of data that needs to be transferred across to the reducers. You can
use your reducer code as a combiner if the operation performed is commutative and associative.
The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if

required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend
on the combiners execution.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are
combiners? When should I use a combiner in my MapReduce Job?

Explanation:
Correct answer(s):
The amount of intermediate data that must be transferred between the mapper and reducer.

52. Question

You use the hadoop fs –put command to write a 300 MB file using and HDFS block size of 64 MB.
Just after this command has finished writing 200 MB of this file, what would another user see
when trying to access this life?

They would see the current of the file through the last completed block.

They would see Hadoop throw an ConcurrentFileAccessException when they try to access this
file.

They would see the current state of the file, up to the last bit written by the command.

They would see no content until the whole file written and closed.

Incorrect

Note:
* put
Usage: hadoop fs -put <localsrc> … <dst>
Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads
input from stdin and writes to destination filesystem.

Explanation:
Correct answer(s):
They would see no content until the whole file written and closed.

53. Question
In a large MapReduce job with m mappers and n reducers, how many distinct copy operations will
there be in the sort/shuffle phase?

mXn (i.e., m multiplied by n)

m+n (i.e., m plus n)

E. mn (i.e., m to the power of n)

Incorrect

A MapReduce job with m mappers and r reducers involves up to m * r distinct copy


operations, since each mapper may have intermediate output going to every reducer.

Explanation:
Correct answer(s):
mXn (i.e., m multiplied by n)

54. Question

What is the disadvantage of using multiple reducers with the default HashPartitioner and
distributing your workload across you cluster?

You will longer be able to take advantage of a Combiner.

You will not be able to compress the intermediate data.

By using multiple reducers with the default HashPartitioner, output files may not be in globally
sorted order.

There are no concerns with this approach. It is always advisable to use multiple reduces.

Incorrect

Multiple reducers and total ordering


If your sort job runs with multiple reducers (either because mapreduce.job.reduces in mapredsite.xml has
been set to a number larger than 1, or because you’ve used the -r option to specify
the number of reducers on the command-line), then by default Hadoop will use the HashPartitioner
to distribute records across the reducers. Use of the HashPartitioner means that you can’t
concatenate your output files to create a single sorted output file. To do this you’ll need total
ordering,
Reference: Sorting text files with MapReduce
Explanation:
Correct answer(s):
By using multiple reducers with the default HashPartitioner, output files may not be in globally
sorted order.

55. Question

Identify the tool best suited to import a portion of a relational database every day as files into
HDFS, and generate Java classes to interact with that imported data?

Pig

Oozie

Sqoop

Flume

Hive

fuse-dfs

Hue

Incorrect

Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:


Imports individual tables or entire databases to files in HDFS
Generates Java classes to allow you to interact with your imported data
Provides the ability to import from SQL databases straight into your Hive data warehouse
Note:
Data Movement Between Hadoop and Relational Databases
Data can be moved between Hadoop and a relational database as a bulk data transfer, or
relational tables can be accessed from within a MapReduce map function.
Note:
* Cloudera’s Distribution for Hadoop provides a bulk data transfer tool (i.e., Sqoop) that imports
individual tables or entire databases into HDFS files. The tool also generates Java classes that
support interaction with the imported data. Sqoop supports all relational databases over JDBC,
and Quest Software provides a connector (i.e., OraOop) that has been optimized for access to

data residing in Oracle databases.


Reference: http://log.medcl.net/item/2011/08/hadoop-and-mapreduce-big-data-analytics-gartner/
(Data Movement between hadoop and relational databases, second paragraph)

Explanation:
Correct answer(s):
Pig
Hive

56. Question

Indentify the utility that allows you to create and run MapReduce jobs with any executable or script
as the mapper and/or the reducer?

Flume

Sqoop

Oozie

mapred

Hadoop Streaming

Incorrect

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility
allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or
the reducer.
Reference: http://hadoop.apache.org/common/docs/r0.20.1/streaming.html (Hadoop Streaming,
second sentence)

Explanation:
Correct answer(s):
Hadoop Streaming

57. Question

You have a directory named jobdata in HDFS that contains four files: _first.txt, second.txt, .third.txt
and #data.txt. How many files will be processed by the FileInputFormat.setInputPaths () command
when it’s given a path object representing this directory?

Four, all files will be processed

Two, file names with a leading period or underscore are ignored

None, the directory cannot be named jobdata

Three, the pound sign is an invalid character for HDFS file names

One, no special characters can prefix the name of an input file

Incorrect
Files starting with ‘_’ are considered ‘hidden’ like unix files starting with ‘.’.
# characters are allowed in HDFS file names.

Explanation:
Correct answer(s):
Two, file names with a leading period or underscore are ignored

58. Question

Assuming default settings, which best describes the order of data provided to a reducer’s reduce
method:

Both the keys and values passed to a reducer always appear in sorted order.

Neither keys nor values are in any predictable order.

The keys given to a reducer aren’t in a predictable order, but the values associated with those
keys always are.

The keys given to a reducer are in sorted order but the values associated with each key are in
no predictable order

Incorrect

Reducer has 3 primary phases:

1. Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the
same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are
merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application should
extend the key with the secondary key and define a grouping comparator. The keys will be sorted
using the entire key, but will be grouped using the grouping comparator to decide which keys and
values are sent in the same call to reduce.
3. Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of
values)> in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via
TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.
Reference: org.apache.hadoop.mapreduce, Class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Explanation:
Correct answer(s):
The keys given to a reducer are in sorted order but the values associated with each key are in
no predictable order

59. Question

You want to run Hadoop jobs on your development workstation for testing before you submit them
to your production cluster. Which mode of operation in Hadoop allows you to most closely simulate
a production cluster while using a single machine?

Run all the nodes in your production cluster as virtual machines on your development
workstation.

Run the hadoop command with the –jt local and the –fs file:///options.

Run simldooop, the Apache open-source software for simulating Hadoop clusters.

Run the DataNode, TaskTracker, NameNode and JobTracker daemons on a single machine.

Incorrect

Hosting on local VMs


As well as large-scale cloud infrastructures, there is another deployment pattern: local VMs on
desktop systems or other development machines. This is a good tactic if your physical machines
run windows and you need to bring up a Linux system running Hadoop, and/or you want to
simulate the complexity of a small Hadoop cluster.
Have enough RAM for the VM to not swap.
Don’t try and run more than one VM per physical host, it will only make things slower.
use file: URLs to access persistent input and output data.
consider making the default filesystem a file: URL so that all storage is really on the physical host.
It’s often faster and preserves data better.

Explanation:
Correct answer(s):
Run all the nodes in your production cluster as virtual machines on your development
workstation.

60. Question

You need to create a job that does frequency analysis on input data. You will do this by writing a
Mapper that uses TextInputFormat and splits each value (a line of text from an input file) into
individual characters. For each one of these characters, you will emit the character as a key and
an InputWritable as the value. As this will produce proportionally more intermediate data than input
data, which two resources should you expect to be bottlenecks?

Processor and network I/O

Processor and RAM

Processor and disk I/O

Disk I/O and network I/O

Incorrect

Correct answer(s):
Disk I/O and network I/O

We do not provide actual exam questions from any vendor like Microsoft, Cisco, Oracle, EMC etc. top

Вам также может понравиться