You are on page 1of 23

MapReduce: Simplified Data Analysis of

Big Data

Seema Maitreya,
C.K. Jhab

Development of computers and the internet

increase the growth of data exponentially
Data coming from employees
Data coming from social media
Data coming from machines
The issues are how to extract these data in
an efficient manner
Efficient parallel/concurrent algorithms and
implementation techniques are the key to meeting
the scalability and performance requirements
entailed in such large scale data mining analyses

Big data is a term for data sets that

are so large or complex that
traditional data processing
applications are inadequate to deal
with them. Challenges include
analysis, capture, data curation,
search, sharing, storage, transfer,
visualization, querying, updating
and information privacy.
The significance of
Big Data can be characterized as

Big data is a valuable term despite the

It is gaining more popularity and
interest from both business users and
IT industry
From an analytics perspective it still
represents analytic workloads and data
management solutions that could not
previously be supported because of
cost considerations and/or technology
The significance of
Big Data continued

The solutions provided enable smarter

and faster decision making, and allow
organizations to achieve faster time to
value from their investments in analytical
processing technology and products. It is
gaining more popularity and interest from
both business users and IT industry
Analytics on multi-structured data enable
smarter decisions. Up till now, these types
of data have been difficult to process
using traditional analytical processing
The significance of
Big Data continued

Rapid decisions are enabled

because big data solutions support
the rapid analysis of high volumes
of detailed data.
Faster time to value is possible
because organizations can now
process and analyze data that is
outside of the enterprise data
Summarizes the main features, challenges
and technology responses connected to
handing different types of large data sets
Attribut Features Challenges and Skill responses
Volume Amount of generated data has Internet has created tremendous increment in the
increased tremendously the past global data production. A response to this situation
years. However, this is the less has been through the generalization of the cloud
challenging aspect in practice. based solutions. The noSQL database approach is a
response to store and query huge volumes of data
heavily distributed

Velocity Production of data is growing Millions of connected devices (smartphones) are

with high speed and such getting added daily which results in the increase of
produced data must be collected not only the volume but also velocity. To get a
in shorter time frames. competitive edge, global companies considered the
Realtime data processing platforms as a

Variety There came the explosion of The current way to collect and analyse non-
data formats that range from structured or semistructured data is just opposite
structured information to free from the manner the traditional relational data
text with the multiplication of model and query languages does. This reality has
data sources resulted in the evolution of new kinds of data stores
that gives the ability to support flexible data models.

Value Until recently, there was more Big Data technologies are deeping their roots in
focus on recording the large creating, capturing and exploiting large volumes of
volumes of data but not data. In principle, the challenge comes while
bothered how to conquer them. transforming underdone data into information that
contains value and can be used in decision making
or other business requirements.
is a programming model and an
associated implementation for processing
and generating large data sets with a
parallel, distributed algorithm on a cluster
MapReduce gained its popularity when
used successfully by Google. In real, it is a
and fault-tolerant data processing tool
which provides the ability to process huge
voluminous data in parallel with
many low-end computing nodes
MapReduce Continued

MapReduce is designed to be used

by programmers, rather than
business users. It is a programming
model, not a programming
language. It has gained popularity
for its easiness, efficiency and
ability to control Big Data in a
timely manner.
MapReduce Continued

Steps in MapReduce to process the database

MapReduce Continued

MapReduce with combiners,

MapReduce Continued

Mappers Required to generate an arbitrary number

of intermediate pairs.
Reducers Applied to all intermediate values
associated with the same intermediate key
Partitioners Its main job is to divide the intermediate
key space, and then to assign the
intermediate key-value pairs to reducers
Combiners Combiners are an (optional) optimization.
Before performing the phase of shuffle and
sort, it allows the local aggregation of data.
Essentially, combiners are used to save
bandwidth, e.g.: word count program
Hadoop Distributed
File System (HDFS)

It is best suited to a small number of very

large files
Use of data replication make possible to
achieve data availability in HDFS. But it
results into the rise in storage required to
cope the data
HDFS supports multiple readers and one
writer (MROW). The index mechanism is
not available in HDFS, hence, it is best
suited to read-only applications that need
to scan and read the complete contents of
a file.
HDFS architecture

The best way to understand and get

familiar with the working of Hadoop
is to walk through the process of
writing a Hadoop MapReduce
application. We will be working with
a simple MapReduce application
that can reverse many strings.
Main Objective
optimize HDFS and provide
significant impact on the overall
performance of a MapReduce
framework which will result in the
boosting of overall efficiency of
MapReduce applications in Hadoop

Big data and the technologies associated with it

can bring significant benefits to the business.
To be able to extract the benefits of Big Data, it
is crucial to know how to ensure intelligent use,
management and re-use of Data Sources,
including public government data, in and across
country to build useful applications and services
It is crucial to evaluate the best approach to use
for filtering and/or analyzing the data
This outstanding framework of Hadoop speeds-
up the processing of large amounts of data
through distributed processes and thus,
provides the responses very fast.
Paper Analysis
There are no clear statement of the problem stated on the
paper. The closest mentioned is about how to make full
use of large scale data which supports decision making
Author's purpose/approach/methods
Authors approach is creating a sample
program to test the efficiency of Hadoop
Is the title appropriate/clear
The title of the journal provides a vague
information about its main contents.
Is the abstract in the correct form
The abstract seems to be in the correct form,
but It gives little information on what the paper
wants to achieve.
Is the purpose of the article
clear in the intro?
The purpose of the article is not clearly
stated on the intro part. It takes me to
read up until to the experiment and
contribution part to fully understand
what the purpose is all about.
Does the author/s objective in this
discussion of topic
The objective is on topic considering
that the topic is improvement of hadoop
Is the objective important to the field
of IT
Yes, Increase in performance and
efficiency of gathering and analyzing
data is important to every field, not just
on the field of Information Technology.
Are the experimental methods
described adequately?
The experiment is by testing a
program to reverse some text.
Though, there are no matching
results displayed to know for sure
if the test Is successful and
congruent to the main objective
Suggested Improvements
Make sure that main objective
and the problem is stated at the
start of the article so as to let the
readers understand at first what
are we going to look at, and give
a clear indication of what the
article is all about.
Presented by:
Christian F Ramos