Вы находитесь на странице: 1из 4

1

ASSN.NO:1 CLOUD COMPUTING ASSIGNMENT


13.10.2019

MAP REDUCE:

MapReduce is a core component of the Apache Hadoop software framework.

Hadoop enables resilient, distributed processing of massive unstructured data


sets across commodity computer clusters, in which each node of the cluster
includes its own storage. MapReduce serves two essential functions: it filters
and parcels out work to various nodes within the cluster or map, a function
sometimes referred to as the mapper, and it organizes and reduces the results
from each node into a cohesive answer to a query, referred to as the reducer.

How MapReduce works:

The original version of MapReduce involved several component daemons,


including:

 JobTracker -- the master node that manages all the jobs and resources in a
cluster;
 TaskTrackers -- agents deployed to each machine in the cluster to run the
map and reduce tasks; and
 JobHistory Server -- a component that tracks completed jobs and is typically
deployed as a separate function or with JobTracker.

With the introduction of MapReduce and Hadoop version 2, previous


JobTracker and TaskTracker daemons have been replaced with components of
Yet Another Resource Negotiator (YARN), called ResourceManager and
NodeManager.
2

 ResourceManager runs on a master node and handles the submission and


scheduling of jobs on the cluster. It also monitors jobs and allocates
resources.
 NodeManager runs on slave nodes and interoperates with Resource Manager
to run tasks and track resource usage. NodeManager can employ other
daemons to assist with task execution on the slave node.

To distribute input data and collate results, MapReduce operates in


parallel across massive cluster sizes. Because cluster size doesn't affect a
processing job's final results, jobs can be split across almost any number of
servers. Therefore, MapReduce and the overall Hadoop framework simplify
software development.

MapReduce is available in several languages, including C, C++, Java, Ruby,


Perl and Python. Programmers can use MapReduce libraries to create tasks
without dealing with communication or coordination between nodes.

MapReduce is also fault-tolerant, with each node periodically reporting its


status to a master node. If a node doesn't respond as expected, the master node
reassigns that piece of the job to other available nodes in the cluster. This
creates resiliency and makes it practical for MapReduce to run on inexpensive
commodity servers.

MapReduce examples and uses

The power of MapReduce is in its ability to tackle huge data sets by distributing
processing across many nodes, and then combining or reducing the results of
those nodes.

As a basic example, users could list and count the number of times every word
appears in a novel as a single server application, but that is time-consuming. By
contrast, users can split the task among 26 people, so each takes a page, writes a
word on a separate sheet of paper and takes a new page when they're finished.
This is the map aspect of MapReduce. And if a person leaves, another person
takes his or her place. This exemplifies MapReduce's fault-tolerant element.
3

When all the pages are processed, users sort their single-word pages into 26
boxes, which represent the first letter of each word. Each user takes a box and
sorts each word in the stack alphabetically. The number of pages with the same
word is an example of the reduce aspect of MapReduce.

There is a broad range of real-world uses for MapReduce involving complex


and seemingly unrelated data sets. For example, a social networking site could
use MapReduce to determine users' potential friends, colleagues and other
contacts based on site activity, names, locations, employers and many other data
elements. A booking website could use MapReduce to examine the search
criteria and historical behaviors of users, and can create customized offerings
for each. An industrial facility could collect equipment data from different
sensors across the installation and use MapReduce to tailor maintenance
schedules or predict equipment failures to improve overall uptime and cost-
savings.

MapReduce services and alternatives

One challenge with MapReduce is the infrastructure it requires to run. Many


businesses that could benefit from big data tasks can't sustain the capital and
overhead needed for such an infrastructure. As a result, some organizations rely
on public cloud services for Hadoop and MapReduce, which offer enormous
scalability with minimal capital costs or maintenance overhead.

For example, Amazon Web Services (AWS) provides Hadoop as a service


through its Amazon Elastic MapReduce (EMR) offering. Microsoft Azure
offers its HDInsight service, which enables users to provision Hadoop, Apache
Spark and other clusters for data processing tasks. Google Cloud Platform
provides its Cloud Dataproc service to run Spark and Hadoop clusters.

For organizations that prefer to build and maintain private, on-premises big data
infrastructures, Hadoop and MapReduce represent only one option.
Organizations can opt to deploy other platforms, such as Apache Spark, High-
Performance Computing Cluster and Hydra. The big data framework an
enterprise chooses will depend on the types of processing tasks required,
4

supported programming languages, and performance and infrastructure


demands.

Вам также может понравиться