Вы находитесь на странице: 1из 3


SearchData Data warehouse 5 Hadoop 5

Search the TechTarget Network

Essential Guide Browse Sections 5

This content is part of the Essential Guide:

Using big data platforms for data management, access and analytics


Hadoop Distributed File System (HDFS) Posted by: Margaret Rouse


u The Hadoop Distributed File System (HDFS) is the primary data storage system used
Contributor(s): Jack Vaughan and Emma Preslar
by Hadoop applications. It employs a NameNode and DataNode architecture to implement a
c distributed file system that provides high-performance access to data across highly scalable

o Hadoop clusters.

i HDFS is a key part of the many Hadoop

ecosystem technologies, as it provides a reliable
n means for managing pools of big data and
supporting related big data analytics applications.

How HDFS works

HDFS supports the rapid transfer of data between
compute nodes. At its outset, it was closely
coupled with MapReduce, a programmatic
framework for data processing.

When HDFS takes in data, it breaks the

information down into separate blocks and
distributes them to different nodes in a cluster,
thus enabling highly efficient parallel processing.

Moreover, the Hadoop Distributed File System is

specially designed to be highly fault-tolerant. The
file system replicates, or copies, each piece of
data multiple times and distributes the copies to
individual nodes, placing at least one copy on a
different server rack than the others. As a result, the data on nodes that crash can be found
elsewhere within a cluster. This ensures that processing can continue while data is

HDFS uses master/slave architecture. In its initial incarnation, each Hadoop cluster consisted
of a single NameNode that managed file system operations and supporting DataNodes that
managed data storage on individual compute nodes. The HDFS elements combine to
support applications with large data sets.

This master node "data chunking" architecture takes as its design guides elements from
Google File System (GFS), a proprietary file system outlined in in Google technical papers,
as well as IBM's General Parallel File System (GPFS), a format that boosts I/O by striping
blocks of data over multiple disks, writing blocks in parallel. While HDFS is not Portable
Operating System Interface model-compliant, it echoes POSIX design style in some aspects.


k HDFS architecture centers on commanding NameNodes that hold metadata and DataNodes that store information in
blocks. Working at the heart of Hadoop, HDFS can replicate data at great scale.

Why use HDFS?

The Hadoop Distributed File System arose at Yahoo as a part of that company's ad serving
and search engine requirements. Like other web-oriented companies, Yahoo found itself
juggling a variety of applications that were accessed by a growing numbers of users, who
were creating more and more data. Facebook, eBay, LinkedIn and Twitter are among the web
companies that used HDFS to underpin big data analytics to address these same

But the file system found use beyond that. HDFS was used by The New York Times as part
of large-scale image conversions, Media6Degrees for log processing and machine learning,
LiveBet for log storage and odds analysis, Joost for session analysis and Fox Audience
Network for log analysis and data mining. HDFS is also at the core of many open source data
warehouse alternatives, sometimes called data lakes.

Because HDFS is typically deployed as part of very large-scale implementations, support for
low-cost commodity hardware is a particularly useful feature. Such systems, running web
search and related applications, for example, can range into the hundreds of petabytes and
thousands of nodes. They must be especially resilient, as server failures are common at such

HDFS and Hadoop history

In 2006, Hadoop's originators ceded their work on HDFS and MapReduce to the Apache
Software Foundation project. The software was widely adopted in big data analytics projects
in a range of industries. In 2012, HDFS and Hadoop became available in Version 1.0.



The basic HDFS standard has been continuously updated since its inception.

With Version 2.0 of Hadoop in 2013, a general-purpose YARN resource manager was added,
and MapReduce and HDFS were effectively decoupled. Thereafter, diverse data processing
frameworks and file systems were supported by Hadoop. While MapReduce was often
replaced by Apache Spark, HDFS continued to be a prevalent file format for Hadoop.

After four alpha releases and one beta, Apache Hadoop 3.0.0 became generally available in
December 2017, with HDFS enhancements supporting additional NameNodes, erasure
coding facilities and greater data compression. At the same time, advances in HDFS tooling,
such as LinkedIn's open source Dr. Elephant and Dynamometer performance testing tools,
have expanded to enable development of ever larger HDFS implementations.

This was last updated in February 2018

m Continue Reading About Hadoop Distributed File

System (HDFS)

∙∙ Careful planning needed to build a Hadoop cluster architecture

Examining NameNode and HDFS in Hadoop architecture

∙∙ Hadoop co-creator talks where Hadoop has been and where it's going

HDFS architecture guide

∙ Hadoop database for big data

Related Terms

Apache Hive
Apache Hive is an open source data warehouse system for querying and analyzing large data sets that are
principally stored in ... See complete definitionq

big data analytics

Big data analytics is the often complex process of examining large and varied data sets -- or big data -- to
uncover information ... See complete definitionq

Hadoop is an open source distributed processing framework that manages data processing and storage for
big data applications ... See complete definitionq

m Join the conversation z 2 comments

Send me notifications when other members comment.

Add My Comment

Oldest 5

[-] Margaret Rouse - 23 Jul 2013 10:18 AM

How do you plan to implement open source distributed file systems in your organization?

[-] RajaBaba - 29 Aug 2017 10:48 PM
Huge databases are used to identify in instance of reference is the goal of Hadoop developer.



Active Directory GUI Tool

File Extensions and File Formats

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z #

Powered by:

Latest TechTarget

A2organizations 2 2
Data-rich Cloud AutoML, Google's
BigQuery ML BigQuery ML
turn focus to upgrades needs big
ethical data expand Google changes to
mining stake in ML compete

As data analytics have increasingly Google introduced a raft of updates to On the eve of Google Cloud Next '19,
SAP become a core component of its cloud-based machine learning and experts assess the impact of Google's
organizations' strategies, concerns AI products, including expanded BigQuery ML since its release and their
SQL SERVER have arisen around how data is... capabilities for its ... hopes for the ...

About Us Contributors Features All Rights Reserved, Copyright

2005 - 2019, TechTarget

Meet The Editors Reprints Guides

Contact Us Archive Opinions

Privacy Policy Site Map Photo Stories


Advertisers Answers Quizzes

Business Partners Definitions Tips

Media Kit E-Products Tutorials

Corporate Site Events Videos