Академический Документы
Профессиональный Документы
Культура Документы
Introduction
Apache Hadoop is one of the earliest and most influential open-source tools for storing and
processing the massive amount of readily-available digital data that has accumulated with the
rise of the World Wide Web. It evolved from a project called Nutch, which attempted to find a
better open source way to crawl the web. Nutch's creators were heavily influenced by the
thinking in two key papers from Google and originally incorporated them into Nutch, but
eventually the storage and processing work split into the Hadoop project, while continuing to
develop Nutch as its own web crawling project.
In this article, we'll briefly consider data systems and some specific, distinguishing needs of big
data systems. Then well look at how Hadoop has evolved to address those needs.
Data Systems
Data exists all over the place: on scraps of paper, in books, photos, multimedia files, server logs,
and on web sites. When that data is purposefully collected, it enters a data system.
Imagine a school project where students measure the water level of a nearby creek each day.
They record their measurements in the field on a clipboard, return to their classroom, and enter
that data in a spreadsheet. When they've collected a sufficient amount, they begin to analyze it.
They might compare the same months from different years, sort from the highest to lowest water
level. They might build graphs to look for trends.
This school project illustrates a data system:
It is stored (saved to disk on the classroom computer; field notebooks might be copied or
retained to verify the integrity of the data)
Big data systems had to accept that data would be distributed. Storing the dataset in
pieces across a cluster of machines was unavoidable.
2.
Once clusters became the storage foundation, then software had to account for
hardware failure, because hardware failure is inevitable when you're talking about running
hundreds or thousands of machines in a cluster.
3.
Since machines will fail, they needed a new way of communicating with each other. In
everyday data computing, we're used to some specific machine, usually identified by an IP
address or hostname sending specific data to another specific machine. This explicit
communication had to be replaced with implicit communication, where some machine tells
some other machine that it must process some specific data. Otherwise, programmers would face
a verification problem at least as large as the data processing problem itself.
4.
Finally, computing would need to go to the data and process it on the distributed
machines rather than moving the vast quantities of data across the network.
Released in 2007, version 1.0 of the Java-based programming framework, Hadoop, was the first
open source project to embrace these changes in thinking. Its first iteration consists of two layers:
1.
HDFS: The Hadoop Distributed File System, responsible for storing data across multiple
machines.
2.
MapReduce: The software framework for processing the data in place and in parallel on
each machine, as well as scheduling tasks, monitoring them and re-running failed ones.
HDFS 1.0
The Hadoop Distributed File System, HDFS, is the distributed storage layer that Hadoop uses to
spread out data and ensure that it is properly stored for high availability.
When a new block is created, HDFS places the first replica on the node where the writer is
running. A second replica is written on a randomly chosen node in any rack except the rack
where the the first replica was written. Then the third replica is placed on a randomly chosen
machine in this second rack. If more than the default of three replicas is specified in the
configuration, the remaining replicas are placed randomly, with the restriction that no more than
one replica is placed on any one node and no more than two replicas are placed on the same rack.
MapReduce 1.0
The second layer of Hadoop, MapReduce, is responsible for batch processing the data stored on
HDFS. Hadoop's implementation of Google's MapReduce programming model makes it possible
for developer to use the resources provided by HDFS without needing experience with parallel
and distributed systems.
SHUFFLING
REDUCING
{she, 1}
{she, 1, 1}
{she, 2}
{sells, 1}
{sells, 1, 1}
{sells, 2}
{seashells, 1}
{seashells, 1, 1}
{seashells, 2}
{by, 1}
{by, 1}
{by, 1}
{six, 1}
{six, 1}
{six, 1}
{seashores, 1}
{seashores, 1, 1}
{seashores, 2}
{she, 1}
{sure, 1}
{sure, 1}
{sure, 1}
{well, 1}
{well, 1}
{sells}
{seashells, 1}
{well, 1}
If this mapping were done in sequence over a large dataset, it would take much too long, but
done in parallel, then reduced, it becomes scalable for large datasets.
Higher-level components can plug into the MapReduce layer to supply additional functionality.
For example, Apache Pig provides developers with a language for writing data analysis programs
by abstracting the Java MapReduce idioms to a higher level, similar to what SQL does for
relational databases. Apache Hive supports data analysis and reporting with an SQL-like
interface to HDFS. It abstracts the MapReduce Java API queries to provide high-level query
functionality for developers. Many additional components are available for Hadoop 1.x, but the
ecosystem was constrained by some key limitations in MapReduce.
Limitations of MapReduce 1
1 HDFS Federation
HDFS federation introduces a clear separation between namespace and storage, making multiple
namespaces in a cluster possible. This provides some key improvements:
In addition to the improved scalability, performance, and isolation provided by the introduction
of NameNode federation, Hadoop 2.0 also introduced high availability for the NameNodes.
3 YARN
Hadoop 2.0 decoupled MapReduce from HDFS. The management of workloads, multi-tenancy,
security controls, and high availability features were spun off into YARN (Yet Another Resource
Negotiator). YARN is, in essence, a large-scale distributed operating system for big data
applications that makes Hadoop well-suited for both MapReduce as well as other applications
that can't wait for batch processing to complete. YARN removed the need for working through
the often I/O intensive, high latency MapReduce framework, enabling new processing models to
be used with HDFS. The decoupled-MapReduce remained as a user-facing framework
exclusively devoted to performing the task it was intended to do: batch processing.
Below are some of the processing models available to users of Hadoop 2.x:
Batch Processing
Batch processing systems are non-interactive and have access to all the data before processing
starts. In addition, the questions being explored during processing must be known before
processing starts. Batch processing is typically high latency, with the speed of big data batch jobs
generally measured in minutes or more.
Where batch processing shines: Indexing data, crawling the web, and data crunching
Some programs that can do this for Hadoop: MapReduce, Tez, Spark, Hive, and Flink
Interactive Processing
Interactive processing systems are needed when you dont know all of your questions ahead of
time. Instead, a user interprets the answer to a query, then formulates a new question. To support
this kind of exploration, the response has to be returned much more quickly than a typical
MapReduce job.
Where interactive processing shines: Interactive processing supports data exploration.
Some programs that do this for Hadoop: Impala, Drill, HAWQ, Presto, Vortex, and Vertica SQL,
Tez
Stream Processing
Stream processing systems take large amounts of discrete data points and execute a continuous
query to produce near-real time results as new data arrives in the system.
Where stream processing shines: Any time you have already digital data being continually
generated. Examples include monitoring public sentiment toward an issue, event, or product in
social media in order track emerging trends or monitoring server logs.
Some programs that do this for Hadoop: Spark Stream, Storm
Graph Processing
Graph algorithms typically require communication between vertices or hops in order to move an
edge from one vertex to another which required a lot of unnecessary overhead when passed
through the 1.x MapReduce.
Where graphing shines: Graphs are ideal for showing the non-linear relationships between
things: Facebook's friend relationships, Twitter's follower, distributed graphing databases like at
the core of social media sites.
Some programs that do this for Hadoop: Apache Giraph, Apache Spark's GraphX, Hama, Titan
These are just a few of the alternative processing models and tools. For a comprehensive guide to
the open source Hadoop ecosystem, including processing models other than MapReduce, see
the The Hadoop Ecosystem Table
Hadoop 3.x
At the time of this writing, the Hadoop 3.0.0-alpha1 is available for testing. The 3.x branch aims
to provide improvements such as HDFS erasure encoding to conserve disk space, improvements
to YARNs timeline service to improve its scalability, reliability, and usability, support for more
than two NameNodes, and Intra-datanode balancer and more. To learn more, visit the overview
of major changes.
Conclusion
In this article, we've looked at how Hadoop evolved to meet the needs of increasingly large
datasets. If you're interested in experimenting with Hadoop, you might like to take a look
at Installing Hadoop in Stand-Alone Mode on Ubuntu 16.04. For more about big data concepts in
general, see An Introduction to Big Data Concepts and Terminology.