Вы находитесь на странице: 1из 10

T DW I R E S E A R C H

T DW I CHECK L IS T RE P OR T

HADOOP
Revealing Its True Value for
Business Intelligence

By Philip Russom

Sponsored by

tdwi.org
DECEMBER 2011

T DW I CHECK L IS T RE P OR T
TABLE OF CONTENTS

HADOOP
Revealing Its True Value for 2 F OREWORD
Business Intelligence 2 N
 UMBER ONE
Hadoop is an ecosystem, not a single product.
3 N
 UMBER TWO
By Philip Russom HDFS is a file system, not a DBMS.
4 N
 UMBER THREE
MapReduce provides control for analytics, not
analytics per se.
5 N
 UMBER FOUR
Hive resembles SQL, but is not standard SQL.
6 N
 UMBER FIVE
Hadoop is about data diversity, not just data volume.
6 N
 UMBER SIX
Hadoop is a complement of BI and DW
rarely a replacement.
7 N
 UMBER SEVEN
Hadoop enables many types of analytics, not just
Web analytics.
8 A
 BOUT OUR SPONSORS
9 A
 BOUT THE TDWI CHECKLIST REPORT SERIES
9 A
 BOUT THE AUTHOR
9 A
 BOUT TDWI RESEARCH

1201 Monster Road SW, Suite 250


Renton, WA 98057

T 425.277.9126
F 425.687.2842
E info@tdwi.org

tdwi.org
2011 by TDWI (The Data Warehousing InstituteTM), a division of 1105 Media, Inc. All rights
reserved. Reproductions in whole or in part are prohibited except by written permission. E-mail
requests or feedback to info@tdwi.org. Product and company names mentioned herein may be
trademarks and/or registered trademarks of their respective companies.
TDWI CHECKLIST REPORT: H ADOOP: RE VE A LING ITS TRUE VA LUE FOR BUSINESS INTELLIGENCE

FOREWORD NUMBER ONE


HADOOP IS AN ECOSYSTEM, NOT A SINGLE PRODUCT.

Despite all the hubbub and hype around Hadoop, few business Apache Hadoop is an open source software project administered
intelligence (BI) and data warehousing (DW) professionals know by the Apache Software Foundation.1 Hadoop is the brand name
much about what Hadoop is, how it does what it does, or in which Apache and its open source community have given to a family
situations they should deploy it. Because of the newness and of related open source products and technologies. The Hadoop
complexity of Hadoop, there are several points of confusion that family includes numerous products, which users implement in
are holding back BI/DW professionals and other people: certain combinations for specific applications. The Apache Hadoop
product family includes the Hadoop Distributed File System (HDFS),
Hadoop is multiple products. As well see in the next section,
MapReduce, Hive, Hbase, Pig, Zookeeper, Flume, Sqoop, Oozie, Hue,
Hadoop is a family of open source products and technologies
and so on. Most of these are approved projects that are ready for
overseen by the Apache Software Foundation.
download via apache.org today. Others (including Flume, Sqoop,
Hadoop is an ecosystem. In addition to products from Apache, and Oozie) are still in the incubation phase.
the extended Hadoop ecosystem includes a growing list of vendor
HDFS and MapReduce together constitute core Hadoop, which is the
products that integrate with or expand Hadoop technologies.
foundation for all Hadoop-based applications. For applications in
Apache Hadoop is open source. Its open source software library BI, DW, and big data analytics, core Hadoop is usually augmented
is available through Apache. For users who want a more enterprise- with Hive and Hbase, and sometimes Pig.
ready package, a few vendors now offer Hadoop distributions that
When people in the know say Hadoop, they usually mean core
also include administrative tools and technical support.
Hadoop. When you hear or read about Hadoop, be careful to grasp
Hadoop manages big data. The Hadoop file system excels the correct definition of Hadoop for that context.
with big data that is file based, including files that contain
As Hadoops popularity has increased in recent years, an ecosystem
nonstructured data.
of products, technologies, and services has sprung up around the
Hadoop enables advanced analytics. Hadoop is excellent for Hadoop product family. The extended Hadoop ecosystem includes
storing and searching multi-structured big data, but advanced a growing number of third-party vendor products. For example, a
analytics is possible only with certain combinations of Hadoop few vendors offer their own distribution of Hadoop. These usually
products, third-party products, or extensions of Hadoop technologies. focus on HDFS, perhaps with other Apache Hadoop products.
Such distributions include vendor-specific improvements and
Hadoop differs from traditional BI and DW. In particular, the
extensions of Hadoop products, without introducing proprietary
Hadoop family has its own query and database technologies. These
forks or underpinnings. Other vendors offer tools for deploying
are similar to standard SQL and relational databases, such that
and administering Hadoop environments. Yet more vendors offer
BI/DW professionals can learn them quickly.
general-purpose platforms for BI, analytics, and data integration,
Hadoop and related technologies have been with us for over five but with interfaces, development tools, and other functions that
years now, but BI/DW professionals have only recently started support Hadoop products.
exploring them, motivated by the rise of big data analytics. The
The point is that Hadoop is not a single entity. It is a rich, complex,
business advantages of big data analytics are the leading reasons
and evolving ecosystem of multiple open source products from
why BI/DW professionals need to know more about Hadoop now.
Apache. In addition, the ecosystem expands almost daily as
Despite the short-term confusion, TDWI anticipates that Hadoop
more open source and vendor products support or extend Hadoop
techniques will soon become a common complement to older
products and technical approaches.
BI/DW approaches.
To help BI/DW professionals and other people prepare for the
eventual widespread use of Hadoop and its extended ecosystem,
this Checklist Report drills into common points of confusion. It
clarifies these points and reveals the true value of Hadoop for BI,
DW, big data, and analytics.

1. Apache, Apache Hadoop, and Hadoop are trademarks of the Apache Software Foundation.

2 TDWI RESE A RCH tdwi.org


TDWI CHECKLIST REPORT: H ADOOP: RE VE A LING ITS TRUE VA LUE FOR BUSINESS INTELLIGENCE

NUMBER TWO
HDFS IS A FILE SYSTEM, NOT A DBMS.

One of the most common misconceptions about Hadoop is that note that the Hive metastore gives Hadoop some DBMS-like
HDFS is a database management system (DBMS). In fact, TDWI metadata capabilities.
researchers have seen public presentations and read publications
HDFS has advantages over a DBMS in some circumstances. For
by otherwise knowledgeable people who describe Hadoop as a
example, most DBMSs are designed for structured data (or simply
database. As its name explicitly states, however, the Hadoop
relational data) and have limitations when managing unstructured
Distributed File System is a file system, not a DBMS. HDFS can
data. As a file system, HDFS can handle any file or document
query and index the data it manages, which makes it similar to a
type containing data that ranges from structured data (relational
DBMS, but that doesnt make it a true DBMS.
or hierarchical) to unstructured data (such as human language
Heres a description of how HDFS works, as described in the text). When HDFS and MapReduce are combined, Hadoop easily
HDFS Architecture Guide: parses and indexes the full range of data types. Furthermore,
as a distributed system, HDFS scales well and has a certain
The Hadoop Distributed File System (HDFS) is a distributed file
amount of fault tolerance based on data replication, even when
system designed to run on [clusters of] commodity hardware.
deployed atop commodity hardware. For these reasons, HDFS and
It has many similarities with existing distributed file systems.
MapReduce (whether from Apache or elsewhere) can complement
However, the differences from other distributed file systems
existing BI/DW systems that focus on structured, relational data.
are significant. HDFS is highly fault-tolerant [because it
automatically replicates file blocks across multiple machine
nodes] and is designed to be deployed on low-cost hardware.
HDFS provides high throughput access to application data and
is suitable for applications that have large data sets.2
As a file system, HDFS manages files that contain data. Because
it is file based, HDFS itself does not offer random access to data
and has limited metadata capabilities when compared to a DBMS.
Likewise, HDFS is strongly batch oriented, so it has limited real-time
data access functions.3
To overcome these challenges, you can layer HBase over HDFS to
gain some DBMS capabilities. HBase is one of the many products
from the Apache Hadoop product family. HBase is modeled after
Googles Bigtable; hence HBase (like Bigtable) excels with random,
real-time access to very large tables containing billions of rows
and millions of columns. HBase is new and will no doubt improve,
but today its limited to straightforward tables and records with
little support for more complex data structures. In addition,

2. From the HDFS Architecture Guide on hadoop.apache.org.

3. For more information, see the article What Hadoop is Not on wiki.apache.org.

3 TDWI RESE A RCH tdwi.org


TDWI CHECKLIST REPORT: H ADOOP: RE VE A LING ITS TRUE VA LUE FOR BUSINESS INTELLIGENCE

NUMBER THREE
MAPREDUCE PROVIDES CONTROL FOR ANALYTICS,
NOT ANALYTICS PER SE.

Hadoop MapReduce and variations of it are sometimes called MapReduce (whether from Apache or from a software vendor) gets
analytic tools, but thats not quite right. MapReduce is more of its name from the procedure it follows to provide massively parallel
a general-purpose execution engine that works with a variety of processing: A MapReduce job usually splits the input data-set
storage technologies, including HDFS, other file systems, and some into independent chunks which are processed by the map tasks in
DBMSs. Developers at Google created MapReduce before HDFS a completely parallel manner. The framework sorts the outputs of
existed, which corroborates that not all variants of MapReduce the maps, which are then input to the reduce tasks [which, in turn,
require HDFS. As an execution engine, MapReduce and its underlying assemble one or more result sets].5 See Figure 1.
data platform handle the complexities of network communication,
MAPREDUCE HAS A LOT TO OFFER FOR ADVANCED ANALYTICS.
parallel programming, and fault tolerance. In addition, MapReduce
controls hand-coded programs and automatically provides MapReduce brings advanced analytic processing to the data.
multithreading processes so they can execute in parallel for massive This is the reverse of older practices where we bring large quantities
scalability. The controlled parallelization of MapReduce can apply to of transformed data to an analytic tool, especially those based on
multiple types of distributed applications, not just analytic ones. data mining or statistical analysis. As big data gets bigger, its just
not practical (from both a time and cost perspective) to move and
To put it more succinctly: Hadoop MapReduce is a software
process that much data.
[programming] framework for easily writing [massively parallel]
applications which process vast amounts of data (multi-terabyte MapReduce was built for parallel processing. This is key to
data sets) in-parallel on large clusters (thousands of nodes) of scaling up big data analytics.
commodity hardware in a reliable, fault-tolerant manner.4
MapReduce significantly broadens the scope of advanced
Note that Hadoop MapReduce assumes a fair amount of hand coding. analytics. MapReduce is typically coupled with a data platform
A developer can choose a programming language from a long list, that manages diverse types of data, files, documents, content,
including Hive, Java, C++, C#, Python, R, and so on. Theoretically, and schema.
any code that can be compiled or is supported directly by MapReduce
MapReduce is schema neutral. Big data analytics usually involves
should work. In a MapReduce application, the hand-coded routines
the discovery of new business facts, and forcing data into an a priori
provide the actual computations. Hence, in an analytic application
data model inhibits open-ended discovery. Furthermore, MapReduce
based on MapReduce, the routines are the analytics.
works with diverse data that has little or no metadata or structure.
(Continues)
4. From the MapReduce Tutorial on hadoop.apache.org.
5. Ibid.

User App Master Node Worker


Node

Worker Worker Worker Worker


Node Node Node Node

Worker Worker Worker


Node Node Node

Worker Worker Worker


Node Node Node
Big Data Key Value Pairs; Result
Sources Temp Files Set(s)
COLLECT & SORT
INPUT MAP REDUCE OUTPUT
MAP RESULTS

Figure 1. Hadoop MapReduce in Action.

4 TDWI RESE A RCH tdwi.org


TDWI CHECKLIST REPORT: H ADOOP: RE VE A LING ITS TRUE VA LUE FOR BUSINESS INTELLIGENCE

NUMBER FOUR
HIVE RESEMBLES SQL, BUT IS NOT STANDARD SQL.

(Continued)
MAPREDUCE DIFFERS FROM WHAT BI PROFESSIONALS Within the Apache Hadoop product family, Apache Hive implements
ARE USED TO. a SQL-like language called QL. Users familiar with SQL seem to
Open source Hadoop MapReduce requires a lot of hand coding. pick up Hive QL easily. Hive is optimized for querying large data
Hand coding routines is okay for many applications developers, sets residing in distributed storage. It provides direct access to
especially those who already know the languages supported by files stored in Apache HDFS or Apache HBase, and possibly in
MapReduce, but survey data from TDWI shows that most BI/DW other data storage systems.
shops are aggressively moving to vendor-tool-based solutions HIVE HAS A NUMBER OF COMPELLING ABILITIES.
instead of hand-coded ones.
Hadoop MapReduce executes Hive queries such that they run in
Hadoop MapReduce has its own query language. Ironically, parallel. In the MapReduce environment, developers can mix hand-
MapReduce can control routines coded in many languagesbut not coded Hive routines with other routines, plus custom mappers and
SQL. Luckily, Hadoop Hive gets results that are similar to those of reducers. Likewise, Hive queries benefit from the scalability and
SQL, and Hive has a syntax that is similar to that of SQL. Hence, fault tolerance of HDFS. Hive queries are schema neutral and work
BI/DW professionals who know SQL can learn Hive easily. with many data types, just as Apache Hadoop and MapReduce do.
Furthermore, as more vendors release ODBC and JDBC drivers, these
HIVES CHALLENGES ARE BEING OVERCOME.
enable BI professionals to develop in SQL while the driver handles
translations to Hive and back. For BI/DW purposes, it would be desirable for a wide variety of
tools and applications to plug into Hive as a way of expressing a
Hadoop MapReduce was designed for HDFS. Thats fine if you
SQL-defined query in a Hadoop environment. The current barrier
anticipate implementing HDFS and your big data is mostly file
to this common style of connectivity is that Hive QL is not SQL
based; however, MapReduce coupled with a scalable RDBMS would
compliant. This situation will soon improve dramatically, because
appeal to more mainstream BI/DW shops. Open source Hadoop
almost all of the leading BI and data integration vendors have built
MapReduce aside, commercial variants of MapReduce that work with
or are building interfaces, connectors, and front ends to Hadoop
a relational DBMS and that support standard SQL are now available
technologies. These new capabilities can do the required SQL-to-Hive
from a few software vendors.
translations, and some enable the developer to work in standard
SQL, thereby shielding the developer from Hive altogether.
Despite the improvements, caveats apply. ODBC/JDBC-based SQL
access to HDFS (via Hive or some other Hadoop layer) is valuable
within its limited scope, but it wont transform HDFS into a true
DBMS. Furthermore, given the batch orientation that HDFS imposes
on Hive and MapReduce, iterative ad hoc queries process rather
slowly compared to those running on modern relation systems, which
in turn slows the work of business analysts and similar users.

5 TDWI RESE A RCH tdwi.org


TDWI CHECKLIST REPORT: H ADOOP: RE VE A LING ITS TRUE VA LUE FOR BUSINESS INTELLIGENCE

NUMBER FIVE NUMBER SIX


HADOOP IS ABOUT DATA DIVERSITY, NOT JUST DATA HADOOP IS A COMPLEMENT OF BI AND DW
VOLUME. RARELY A REPLACEMENT.

We think of Hadoop and other big data platforms as being largely The broad range of unstructured data just discussed houses a
about data volume because were impressed by the massive data sets richness of information that could be used for BI, DW, and analytic
they handle. But these environmentsand big data itselfare just purposesif it were readily accessible. The catch is that the DBMSs
as much about the diversity of data as they are about data volume. that most data warehouses are based on were not designed for
The two concepts are related in that big data often gets big precisely unstructured data, and few can truly leverage it. Plus, many users
because of its diversity. consciously choose to design and optimize their data warehouses
Data diversity is one of the most formidable challenges in BI/DW for the most common BI deliverablesreports, dashboards,
today. Thats because most BI technologies and user best practices performance management, and OLAPbut not for advanced forms
were designed for operating on relational data and other forms of of analytics. Instead of retrofitting a data warehouse to support
structured data. Many user organizations still struggle to wring BI unstructured data and advanced analytics, many organizations are
value from the wide range of unstructured data types, including text, looking into HDFS and other Hadoop products as complementary
clickstreams, log files, social media, documents, location data, sensor technologies for these purposes.
data, and so on. Note that theres nothing new about complementing an existing
Hadoop technologies are renowned for making sense of diverse big BI/DW technology stack with an additional platform. For several
data. For example, developers can push files containing a wide range years now, distributed data warehouse architectures have included a
of unstructured data into HDFS without needing to define data types number of systems on the side (SOSs) for data staging, managing
or structures at load time. Instead, data is structured at query or detailed source data, real-time data feeds, federated data marts,
analysis time. This is a good match for analytic methods that are advanced analytics, and so on. In fact, the long-standing tradition
open-ended for discovery purposes, since imposing structure can alter of the SOS has ramped up in recent years as user organizations
or hide detailed data that discovery depends on. For BI/DW tools and have acquired newer types of SOSs built specifically for analytics,
platforms that demand structured data, Hadoop Hive and MapReduce including data warehouse appliances, columnar databases, and
can output records and tables as needed. This way, HDFS can be an no-SQL databases.
effective source of unstructured data, yet with structured output for In line with the SOS tradition and its recent focus on analytics,
BI/DW purposes. Hadoop products (whether from Apache or elsewhere) show great
Diverse data invariably includes unstructured data in the form of promise as platforms for advanced analytics, thus complementing
human language text. Getting full value for BI, DW, and analytics from the average report-oriented data warehouse with new analytic
such text requires natural language processing (NLP). Today, Hadoop capabilities, especially for analytics with unstructured data. Outside
developers produce programs that can perform NLP, and MapReduce of BI and DW, Hadoop products also show promise for online
executes them. Eventually, tools for text mining, text analytics, and archiving, content management, and staging multi-structured data
similar NLP functions should connect to MapReduce to make their for a variety of applications. This puts pressure on vendors to offer
computational abilities available in Hadoop environments. good integration with Hadoop and to provide tools that reduce the
manual coding some Hadoop technologies require today.
Lets consider an alternate viewpoint, where Hadoop might be a
replacement instead of a complement. TDWI recently spoke with
a large multinational retailer that is in the process of deploying
a Hadoop environment. The retailer plans to first augment BI/DW
with analytics, and then later test whether Hadoop could become
the platform for all BI data. If the plan pans out, this large retailer
might eventually replace its EDW with Hadoop. So, who knows? For
a few organizations, Hadoop or its equivalent might become a new
type of data warehouse platform.

6 TDWI RESE A RCH tdwi.org


TDWI CHECKLIST REPORT: H ADOOP: RE VE A LING ITS TRUE VA LUE FOR BUSINESS INTELLIGENCE

NUMBER SEVEN
HADOOP ENABLES MANY TYPES OF ANALYTICS,
NOT JUST WEB ANALYTICS.

Representatives from Amazon, Comscore, eBay, Google, and LinkedIn


have recently spoken at TDWI events, explaining how their firms
depend on Hadoop technologies for analytics and other operations
with Web data. These are the kinds of companies we usually hear
about when the IT press discusses uses of Hadoop. Based on these
users success stories, Hadoops value for big data analytics is
clearat least with Web data at large, Internet-based companies.
This begs an important question: can Hadoop technologies enable
analytics outside of Web data and be applicable to mainstream
user organizations that are not Internet companies? The answer
is yes. Here are several scenarios for big data where Hadoop can
contribute to mainstream analytics:
Exploratory analytics. In the trend toward advanced analytics,
users are looking for platforms that enable analytics as an open-
ended discovery or exploratory mission. Discovering new facts
and relationships typically results from tapping big data that was
previously inaccessible to BI. Discovery also comes from mixing
data of various types from various sources. HDFS and MapReduce
enable the exploration of this eclectic mix of big data.
Big data that isnt Web data. Besides the Web, big data can
come from traditional applications, especially if you want to keep
decades of data live for analysis. Much of the data explosion
comes from sensory devices such as robotics in manufacturing,
RFID in retail, or grid monitoring in utilities.
Unstructured data. Text is not just from the Web. It also comes
in great volume from the claims process in insurance, medical
records in healthcare, and call center applications in any industry.
Semi-structured data. Many organizations are modernizing
business-to-business data exchange, which makes XML-based
data ever more important. Most traditional BI/DW platforms
struggle with the semi-structured hierarchies of XML, whereas
Hadoop handles these quite well.
Larger statistical samples. Hadoop isnt just for new analytic
applications. It can revamp old ones, too. For example, analytics
for risk and fraud that are based on statistical analysis or data
mining benefit from the much larger data samples that HDFS and
MapReduce can wring from diverse big data.
More granular analytics. Most 360-degree customer views
include hundreds of customer attributes. Hadoop can provide
insight and data to bump that up to thousands of attributes,
which in turn provides greater detail and precision for customer-
base segmentation and other customer analytics.

7 TDWI RESE A RCH tdwi.org


TDWI CHECKLIST REPORT: H ADOOP: RE VE A LING ITS TRUE VA LUE FOR BUSINESS INTELLIGENCE

ABOUT OUR SPONSORS

Cloudera, the leader in Apache Hadoopbased software and Tableau Software helps people see and understand data. Ranked by
services, enables data-driven enterprises to easily derive business Gartner in 2011 as the worlds fastest-growing business intelligence
value from all their structured and unstructured data. Clouderas company, Tableau helps anyone quickly and easily analyze, visualize,
Distribution, including Apache Hadoop (CDH)which is available to and share information. More than 6,500 customers across most
download for free at www.cloudera.com/downloadsis the most industries get rapid results with Tableau in the office and on the go.
comprehensive, tested, stable, and widely deployed distribution Tens of thousands of people use Tableau to share data in their blogs
of Hadoop in commercial and noncommercial environments. For and Web sites. See how Tableau can help you by downloading the
the fastest path to reliably using this completely open source free trial at www.tableausoftware.com/trial.
technology in production for big data analytics and answering
previously unaddressable big questions, organizations can
subscribe to Cloudera Enterprise, comprising Cloudera Support
and a portfolio of software including Cloudera Management Suite.
Cloudera also offers consulting services, training, and certification
on Apache technologies. As the top contributor to the Apache open
source community and with tens of thousands of nodes under
The Teradata Aster MapReduce Platform is the market-leading big
management across customers in financial services, government,
data analytics solution. This analytic platform embeds MapReduce
telecommunications, media, Web, advertising, retail, energy,
analytic processing for deeper insights on new data sources and
bioinformatics, pharma/healthcare, university research, oil and gas,
multi-structured data types to deliver analytic capabilities with
and gaming, Clouderas depth of experience and commitment to
breakthrough performance and scalability. Teradata Asters solution
sharing expertise are unrivaled. Please visit www.cloudera.com.
utilizes the patented SQL-MapReduce to parallelize the processing
of data and applications and deliver rich analytic insights at scale.
For more information, visit www.asterdata.com or for more about
Teradata, visit teradata.com.

EMC Greenplum is driving the future of data warehousing and


analytics with breakthrough products including the Greenplum
Data Computing Appliance, Greenplum Database, Greenplum HD
enterprise-ready Apache Hadoop, and Greenplum Chorusthe
industrys first Enterprise Data Cloud platform. EMC Corporation
(NYSE: EMC) is the worlds leading developer and provider of
information infrastructure technology and solutions.
Visit www.greenplum.com.

8 TDWI RESE A RCH tdwi.org


TDWI CHECKLIST REPORT: H ADOOP: RE VE A LING ITS TRUE VA LUE FOR BUSINESS INTELLIGENCE

ABOUT THE TDWI CHECKLIST REPORT SERIES

TDWI Checklist Reports provide an overview of success factors for


a specific project in business intelligence, data warehousing, or
a related data management discipline. Companies may use this
overview to get organized before beginning a project or to identify
goals and areas of improvement for current projects.

ABOUT THE AUTHOR

Philip Russom is the research director for data management at


The Data Warehousing Institute (TDWI), where he oversees many
of TDWIs research-oriented publications, services, and events.
Over the years, Russom has produced over 500 industry reports,
magazine articles, opinion columns, speeches, and Webinars on
topics in data management, data warehousing, and business
intelligence. Russom was an industry analyst at Forrester Research
and Giga Information Group. Before that, Russom worked in
technical and marketing positions for various database vendors. You
can reach him at prussom@tdwi.org.

ABOUT TDWI RESEARCH

TDWI Research provides research and advice for business


intelligence and data warehousing professionals worldwide. TDWI
Research focuses exclusively on BI/DW issues and teams up with
industry thought leaders and practitioners to deliver both broad
and deep understanding of the business and technical challenges
surrounding the deployment and use of business intelligence, data
warehousing, and data management solutions. TDWI Research
offers in-depth research reports, commentary, inquiry services, and
topical conferences, as well as consulting services for strategic
planning to user and vendor organizations.

9 TDWI RESE A RCH tdwi.org

Вам также может понравиться