Вы находитесь на странице: 1из 30

Welcome to Hadoop Technology and Architecture – Overview (Part 1)

Copyright © 1996, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013 EMC Corporation.
All Rights Reserved. EMC believes the information in this publication is accurate as of its publication date. The information
is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO REPRESENTATIONS OR
WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS
IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.
EMC2, EMC, Data Domain, RSA, EMC Centera, EMC ControlCenter, EMC LifeLine, EMC OnCourse, EMC Proven, EMC Snap,
EMC SourceOne, EMC Storage Administrator, Acartus, Access Logix, AdvantEdge, AlphaStor, ApplicationXtender,
ArchiveXtender, Atmos, Authentica, Authentic Problems, Automated Resource Manager, AutoStart, AutoSwap,
AVALONidm, Avamar, Captiva, Catalog Solution, C-Clip, Celerra, Celerra Replicator, Centera, CenterStage, CentraStar,
ClaimPack, ClaimsEditor, CLARiiON, ClientPak, Codebook Correlation Technology, Common Information Model,
Configuration Intelligence, Configuresoft, Connectrix, CopyCross, CopyPoint, Dantz, DatabaseXtender, Direct Matrix
Architecture, DiskXtender, DiskXtender 2000, Document Sciences, Documentum, elnput, E-Lab, EmailXaminer,
EmailXtender, Enginuity, eRoom, Event Explorer, FarPoint, FirstPass, FLARE, FormWare, Geosynchrony, Global File
Virtualization, Graphic Visualization, Greenplum, HighRoad, HomeBase, InfoMover, Infoscape, Infra, InputAccel, InputAccel
Express, Invista, Ionix, ISIS, Max Retriever, MediaStor, MirrorView, Navisphere, NetWorker, nLayers, OnAlert, OpenScale,
PixTools, Powerlink, PowerPath, PowerSnap, QuickScan, Rainfinity, RepliCare, RepliStor, ResourcePak, Retrospect, RSA, the
RSA logo, SafeLine, SAN Advisor, SAN Copy, SAN Manager, Smarts, SnapImage, SnapSure, SnapView, SRDF, StorageScope,
SupportMate, SymmAPI, SymmEnabler, Symmetrix, Symmetrix DMX, Symmetrix VMAX, TimeFinder, UltraFlex, UltraPoint,
UltraScale, Unisphere, VMAX, Vblock, Viewlets, Virtual Matrix, Virtual Matrix Architecture, Virtual Provisioning, VisualSAN,
VisualSRM, Voyence, VPLEX, VSAM-Assist, WebXtender, xPression, xPresso, YottaYotta, the EMC logo, and where
information lives, are registered trademarks or trademarks of EMC Corporation in the United States and other countries.

All other trademarks used herein are the property of their respective owners.

© Copyright 2013 EMC Corporation. All rights reserved. Published in the USA.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 1


This course covers Hadoop from both a technical and presales positioning perspective
with an explanation of Hadoop technology, terminology and architecture. Hadoop
architecture models, key components and deployment scenarios will also be covered.
Lastly, learners will become versed in various distributions of Hadoop and what various
solutions are available to solve modern data challenges.

This course is intended for those who would like a technical understanding of Hadoop
and its architecture from a presales perspective.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 2


Hadoop is a commodity-based compute and storage solution. Being a commodity based
solution, it allows data and processing of information over the network on various computers
attached to one another as a cluster.
A consideration in positioning Hadoop is to investigate some of the challenges that it can
solve. Some of these challenges can have a big impact on the efficiency of data processing
and information handling for an environment.
Hadoop can solve the following challenges:
• Poor utilization of storage and CPU
• Inefficient data staging and loading processes (ETLT (Extract Transform Load
Transform))
• Backup and disaster recovery missing - Hadoop can automatically address this
• Technology and knowledge gaps that are preventing Apache Hadoop from becoming
an enterprise standard
Despite the challenges, many companies are constructing a Hadoop strategy. In some
environments, Hadoop is still a research project with unknown results. In other
environments, research and deployment can go on for years.
EMC solves customer challenges by providing comprehensive support through EMC
Consulting, Big Data Advisory Services, Greenplum Database + Greenplum HD with Isilon
HDFS tech support, Greenplum Labs Data Science Teams, and Greenplum UAP (Unified
Analytics Platform). Also, with the recent acquisition of Pivotal Labs, we can now build data-
driven applications for speed to market.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 3


Shown here are Hadoop use cases by business vertical markets. It’s important to understand
why Hadoop exists and some of the market silos that it best operates in. For example in a
finance environment, Hadoop can be utilized to provide fast processing to detect customer
profiling, social media analytics, or fraud detection. Hadoop’s ability to take various forms of
data and allow simultaneous search and assimilation of information is critical for such a
market. In a Web environment, Hadoop can be leveraged to provide analysis on customer
churn, POS transactions and product recommendations. In the realm of Telecom, Hadoop
can be used as an analysis engine for user behaviors, call record details and network
utilization. Lastly, in the Healthcare field, Hadoop can be useful to provide electronic record
analysis from various sources, medical image processing and drug safety analysis.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 4


Hadoop is a scalable, fault-tolerant data storage and batch processing solution in a
distributed-computing environment. Hadoop can scale linearly on inexpensive Intel-based
commodity hardware and makes use of shared compute and storage in a distributed system.
Hadoop can be leveraged for processing large datasets across clusters of computers using a
simple programming model. Hadoop lets users distribute information over many nodes
allowing for parallel processing, storage and retrieval of information.
Hadoop began it’s existence as an open-source Apache project from Yahoo!. It was created
by Doug Cutting, formerly of Yahoo!, now with Cloudera, and was modeled after Google's
MapReduce and GFS (Google File System). Although much of the initial work was done by
Yahoo!, Hadoop is now a top-level Apache project supported largely by the open-source
development community.
Hadoop is written in Java and runs on the following platforms: Linux, Mac OS X, Windows
and Solaris.
The core components of Hadoop include:
• HDFS (Hadoop Distributed File System)
• MapReduce (MapReduce – compute/data processing)
Hadoop is designed to scale-up from a single server to thousands of machines, each offering
local storage and computation.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 5


Shown are some important milestones in Hadoops’ creation and evolution.
• In 2004, Google published the GFS paper, which Hadoop continues working towards,
but for public availability.
• As mentioned earlier, in 2005 Doug Cutting creates Nutch, which was an open source
web search product utilizing Google’s MapReduce.
• In 2008, Hadoop formally became an Apache project.
• In 2009, Hadoop was used at Yahoo! to sort 1 TB in approximately 60 seconds. At that
time, the cluster configuration at Yahoo! was:
 910 Nodes - 4, dual core Xeons @ 2.0GHz, 8GB RAM, 4-SATA disks
 1 Gb Ethernet
 40 nodes per rack
 8 Gb Ethernet uplinks from each rack
 RH Server 5.1
 Sun JDK 1.6.0_05-b13

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 6


As with any system, Hadoop has its advantages and disadvantages, although Hadoop’s
advantages largely outweigh its disadvantages. It is an ever-improving, constantly-evolving
system.
Hadoop's advantages are:
• Storage for huge datasets, spreading the data out over many nodes
• Allows for variable changing schemas and datasets; changing schemas can be
disruptive
• Supports batch queries and organized analytics; batch queries not imperative to
Hadoop, more important on legacy systems such as Mainframes
• A large ecosystem of support software/projects, which allow for customization
• Allows for native source input for data analyzed. Hadoop does not necessarily require
that source data be extracted, then transformed, before being used
Some of Hadoop's disadvantages are:
• Ease of use and access; Hadoop is not considered a common user-end tool
• It does not excel at low-latency responses due to its distributed compute/storage
environment and response times can be largely variable. Hadoop doesn’t do large-
scale simulation modeling well, it’s good for searching for strings of information and
similar.
In the past, Hadoop could not handle file changes and updates. An update was made for
version 2.0 of Hadoop (Apache) that allowed it to update and append existing files. However,
there are many installations of Hadoop that pre-date this feature.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 7


Why would a company use Hadoop for it's enterprise solution in dealing with large datasets
and hard-to-solve storage and compute issues?
Hadoop:
• Delivers performance and scalability at a low cost. Traditional enterprise deployments
are quickly resource constrained and difficult to manage and bring online.
• Stores data in its source format, for example, it stores Excel spreadsheets
as .xls files
• Allows for transparent application scalability
• Handles large amounts of data; data growth rates are astronomical, and mostly
unstructured.
• Is resilient in the face of infrastructure failures. Traditional enterprise environments
are always expensive. Hadoop doesn’t have to be.

Hadoop makes analytics on large-scale datasets more pragmatic. It also opens up new ways
of understanding and running lines of business (LOBs). Data growth figures change rapidly
and continuously. For example, the rise of unstructured data continues to gain momentum.
Five-year enterprise data growth is estimated at 650%, with over 80% of that unstructured
data, e.g., Facebook collects 100 TB per day (Gartner, 2012).

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 8


Hadoop is a different environment than traditional data processing systems. In many senses
it is a paradigm shift that goes against the traditional N+1 enterprise system design we have
embraced for many years. Hadoop is predicated upon the premises of:
• Not if something fails, but when…
• Not if data grows, but when…
• No real recovery, just move on
Other items that make Hadoop different is that it defers some challenges and/or decisions:
• Hadoop is non-transactional.
• Its file system is essentially read-only.
• Today, there is no real support for updates, the operator must keep the Hadoop
software up-to-date.
HDFS reduces the cost of storing and processing data to a point that keeping all data,
indefinitely, is now a very real possibility. At the same time, MapReduce makes developing
and executing massively-parallel data processing tasks trivial compared to historical
alternatives (e.g., High Performance Computing (HPC)).
In addition, Hadoop makes analytics on large-scale datasets more pragmatic. It also opens up
new ways of understanding and thus running lines of business.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 9


Positioning Hadoop can be an interesting challenge. Anyone can download Hadoop for free
directly from Apache. The value added for positioning Hadoop is in the following:
• Assisting with the design of a Hadoop environment. (EMC will give a lot of assistance
to the paying customer to design the right environment for their particular Hadoop
implementation.)
• Proposing solutions for the data-processing challenges, such as offering professional
services to help with data integration, custom data ingestion (migration) and
developing scalability options. (Further declaring that EMC will provide expert
assistance through EMC Professional Services to customize the customers iteration of
Hadoop to their specific data, how to download that data into a Hadoop file system,
and if needed help to scale out the file system and hardware.)
• Creating custom front-end clients for either data input or massaging of data within an
environment
• Extending an existing solution with custom programming, implementation, resource
outsourcing or equipment build-outs. (Extend the basic Hadoop system with custom
add-ons that will further refine and detail available information.)
• Further extending the Hadoop ecosystem with custom components. (Extend the basic
Hadoop system with custom add-ons that will further refine and detail available
information.)
Hadoop environments need a lot of maintenance as they grow and expand. Additionally,
there are opportunities to position backup, networking, storage, compute, and professional
services towards any Hadoop solution.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 10


To understand Hadoop and it's overall solution, it's important to know the core components
(HDFS and MapReduce), and also the critical ecosystem components.
The core components were mentioned earlier. They are:
• HDFS – Hadoop Distributed File System
• MapReduce – Distributed Data Processing Model
Hadoop’s starting point for Ecosystem projects include:
• Pig – Data Flow Language and execution environment
• Hive – Distributed Data Warehouse, provides SQL-based query language
• HBase – Distributed column-based database
• Common – Interfaces for distributed filesystems and general I/O (serialization, Java
RPC, etc.)

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 11


As shown, Hadoop has nodes with discrete purposes and functions:
• Namenode is responsible for managing the namespace repository for the file system
and managing jobs.
• Datanode is responsible for storing blocks of data and running tasks.
• MapReduce manages computation to where data resides within the environment.
• Hadoop is constructed within a distributed computing environment.
• There is shared file system space between nodes via HDFS.
• The processing framework that is provided by MapReduce.
• Languages to manipulate the data, such as Pig and Hive.
• The key value store is provided by HBase.
Hadoop is written in Java. Developers also typically write their MapReduce code in Java.
Higher-level abstractions on top of MapReduce have also been developed.
The system is self-healing in the sense that it automatically routes around failure. If a node
fails, then its workload and data are transparently shifted somewhere else. The system is
intelligent in the sense that the MapReduce scheduler optimizes for the processing to
happen on the same node storing the associated data (or co-located on the same leaf
Ethernet switch). It also speculatively executes redundant tasks if certain nodes are detected
to be slow.
One of the key benefits of Hadoop is the ability to upload any unstructured files without
having to “schematize” them first. You can dump any type of data into Hadoop and the input
record readers will abstract it out as if it was structured (i.e., schema on read versus on
write).

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 12


Here is a different view of a typical Hadoop architecture/environment.
The Hadoop Architecture, at its simplest level, consists of three major objects: a primary
namenode, Hadoop workers and clients. HDFS is the storage system for Hadoop applications.
HDFS exposes a file system namespace and controls client access to files stored in the
namespace. A file in this namespace is split into one or more data blocks, which are then
stored on datanodes. Data blocks are replicated and distributed on compute nodes
throughout the cluster to handle the processing of semi-structured and unstructured data.
The Hadoop file system architecture relies on at least one master that manages the file
system namespace and controls read and write access to files within the namespace, while
servicing requests from clients. By interfacing with the clients, it uses a jobtracker to manage
jobs that are submitted from the client.
Jobtracker schedules the job initiated by the client and interfaces with tasktrackers at the
datanode level. In the process of MapReduce: namenodes do the reducing; datanodes do
not communicate with each other. The jobtracker assigns the reduce tracker and then the
datanodes transfer data to one another.
The HDFS workers are the datanodes. Each datanode has its own tasktracker for tracking its
own tasks associated with a request, serving read and write requests from the client. The
datanode performs the replication tasks based on instructions received from the namenode,
periodically validating the checksum of data blocks. Data is stored at the datanode level,
whereas the namenode manages the file system metadata. Datanodes serve read and write
requests from and to clients. A secondary namenode performs checkpoints of the
namespace.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 13


Hadoop itself refers to the overall system that runs jobs, distributes tasks (pieces of these jobs) and
stores data in a parallel and distributed fashion. The Hadoop solution includes MapReduce and HDFS.
MapReduce is a programming paradigm that expresses a large, distributed computation as a
sequence of distributed operations on datasets of key/value pairs. The Hadoop MapReduce
framework harnesses a cluster of machines and executes user-defined MapReduce jobs across the
nodes in the cluster. A MapReduce computation has two phases, a map phase and a reduce phase.
The input to the computation is a dataset of key/value pairs.
In the map phase, the framework splits the input dataset into a large number of fragments and
assigns each fragment to a map task. The framework also distributes the many map tasks across the
cluster of nodes on which it operates. Each map task consumes key/value pairs from its assigned
fragment and produces a set of intermediate key/value pairs.
In Hadoop, the combination of all of the Java Archive (JAR) files and classes needed to run a
MapReduce program is called a job. All of these components are themselves collected into a JAR,
which is usually referred to as a job file. To execute a job, a client will submit the job to a jobtracker.
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file
as a sequence of blocks. All the blocks in a file, except the last block, are the same size. Blocks
belonging to a file are replicated for fault tolerance. The block size and replication factor are
configurable per file. Files in HDFS are ‘write once’ and have strictly one writer at any time.
Like MapReduce, HDFS follows a master/worker architecture. An HDFS installation consists of a single
namenode, a master server that manages the file system namespace and regulates access to files by
clients. In addition, there are a number of datanodes, one per node in the cluster, which manage
storage attached to the nodes that they run on. The namenode makes file system namespace
operations like opening, closing, renaming, etc., of files and directories available via an RPC interface.
It also determines the mapping of blocks to datanodes. The datanodes are responsible for serving
read and write requests from file system clients. They also perform block creation, deletion and
replication upon instruction from the namenode.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 14


Using Hadoop, we can see how it works in the data center with other applications, such as,
how users interact with Hadoop and the problems it solves.
Hadoop allows for loading Petabytes of data from concurrent different inputs. It allows for
real-time batch processing, recovering from failures and on-the-go transformation of data.
Hadoop as a solution protects cluster performance by balancing the workload across the
cluster and protecting data by moving it to better performing nodes/storage. Hadoop can
also be used by multiple users, multiple lines of business, and for multiple purposes.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 15


Shown here are the data input and management components that will work with Hadoop,
Pig, Hive, and HBase.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 16


Hadoop is a ‘top-level’ Apache project created and managed under the auspices of the
Apache Software Foundation. Several other projects exist that rely on some or all of Hadoop,
typically either both HDFS and MapReduce, or just HDFS. Ecosystem projects are often also
top-level Apache projects.
Hive is a project that was initially created at Facebook. The motivation for it was that many
data analysts are very familiar with Structured Query Language (SQL), the de facto standard
for querying data in Relational Database Management Systems (RDBMSs). Data analysts tend
to be far less familiar with programming languages, such as Java. Hive provides a way to
query data in HDFS using Java to write MapReduce code. Around 99% of Facebook’s Hadoop
jobs are now created by the Hive interpreter. Hive allows users to query data using HiveQL, a
language very similar to standard SQL. Hive turns HiveQL queries into standard MapReduce
jobs. Note that Hive is not an RDBMS!
Pig is another high-level abstraction on top of MapReduce, originally developed at Yahoo! It
provides a scripting language known as Pig Latin. Pig abstracts MapReduce details from the
developer and is made up of a set of operations that are applied to the input data to produce
output. The language is relatively easy to learn for people experienced in Perl, Python, Ruby
and other similar languages. It is also fairly easy to use to write complex tasks, such as joins
of multiple datasets, and within the system, Pig Latin scripts are converted to MapReduce
jobs.
HBase stores its data in HDFS for reliability and availability. It provides random, real-time
read/write access to large amounts of data. It also allows you to manage tables consisting of
billions of row, with potentially millions of columns.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 17


There are several variations, or distributions of Hadoop available today. Remember, Hadoop
started as an Apache project, and as free software that was supported by a community of
developers.
Hadoop is still widely available as free with all required components and ecosystem
software, which is also free. Popular distributions of Hadoop include:
• Apache (free)
• Greenplum HD (Hadoop, GPHD)
• PivotalHD (Pivotal Hadoop)
• Cloudera (Hadoop creator working here)

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 18


This is a good take-away resource to help you review and reference what each of the core
Hadoop components are, and what they are best suited for.
MapReduce is the software framework for Hadoop. Typically written in Java, MapReduce
programs interact directly with files in HDFS. A MapReduce job is broken into two steps:
1. The input dataset is split into independent chunks, which are processed by the map
tasks in a parallel manner.
2. The map output acts as input to the reduce tasks. In the map step, data is split into
key value pairs, while during the reduce step the ‘like keys’ are retrieved, sorted and
then reduced.
HBase is a powerful non-relational, distributed, column-oriented database that sits on top of
HDFS and supports high-performance ingest and data manipulation on billions of rows of
data in real time. This database is typically used when one needs random, real-time, read
and write access to Big Data.
Pig is the platform that supports Pig Latin, a procedural relational data-flow language that
reduces the need to write Java code to run in MapReduce. Pig provides the infrastructure or
compiler for evaluating Pig Latin programs used to analyze large datasets.
Hive is a row-based data warehouse system that facilitates querying and managing large
datasets in distributed storage. It enables easy Extract, Transform, Load (ETL), imposes
structure on a variety of data formats, provides access to files in the Hadoop environment,
and provides query execution in MapReduce. Hive facilitates data summarization, ad-hoc
queries and large dataset analysis. It uses HiveQL, or HQL, an SQL-like language to query data
and create custom functions.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 19


Let’s spend a few minutes discussing the individual Hadoop distributions from various
vendors/third-party vendors.
Apache Hadoop is freely available. It is the 'source' for all versions of Hadoop. That is, many
packaged versions of Hadoop (that you would buy) are based on Apache Hadoop, and may
replace Apache Hadoop with little to no change to the actual source code.
This is fine, especially since Apache Hadoop comes with a stack certification for the other
tangential ecosystem projects that work with Hadoop. When a version of Hadoop is released,
it is certified to work with other related ecosystem software packages.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 20


Greenplum Hadoop (HD) is based on pure Apache open source Hadoop. It includes HBase,
Zookeeper, Pig, Hive and Mahout.
Greenplum HD is an enterprise-ready version of Hadoop that includes the following in it’s
core package:
• Wizard-based installation and configuration
• UI-based system management
• Validation up to a 1000 node cluster
• Certification that it works with commodity hardware (Greenplum Data Computing
Appliance) and open operating system, Red Hat Enterprise Linux (RHEL) 6.x

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 21


Beginning in the Spring of 2013, Greenplum became Pivotal and Pivotal released PivotalHD, or Pivotal
Hadoop. PivotalHD includes these core Hadoop components:
• Installation and Configuration Manager (ICM) – cluster installation, upgrade and expansion
tools.
• GP Command Center – visual interface for cluster health, system metrics and job monitoring.
• Hadoop Virtualization Extension (HVE) – enhances Hadoop to support virtual node awareness
and enables greater cluster elasticity.
• GP Data Loader – parallel loading infrastructure that supports “line speed” data loading into
HDFS.
• Isilon Integration – extensively tested at scale with guidelines for compute-heavy, storage-
heavy and balanced configuration.
Additionally, PivotalHD includes these extended features:
• Advanced Database Services (HAWQ): high-performance, “True SQL” query interface running
within the Hadoop cluster.
• Xtensions Framework: support for ADS interfaces on external data providers (HBase, Avro,
etc.).
 Advanced Analytics Functions (MADLib) – ability to access parallelized machine-learning
and data-mining functions at scale.
 Unified Storage Services (USS) and Unified Catalog Services (UCS) – support for tiered
storage (hot, warm, cold) and integration of multiple data provider catalogs into a single
interface.
Pivotal HD is focused on delivering the enterprise-class features that are required by our target
customers and prospects. These features drive data-working productivity, enable massively-parallel
data loading, support enterprise-grade storage options and can be deployed in virtualized
environments.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 22


This overview of the PivotalHD architecture shows that the core parts are based on the re-
purposing of critical Apache Hadoop, and adding powerful Pivotal pieces.
Perhaps the most significant aspect is Hadoop with Query (HAWQ) where Pivotal is adding
the power of Greenplum database to the Hadoop offering, allowing for structured querying,
query optimizer, cataloging services and dynamic pipelining of data inputs.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 23


This architecture diagram shows the Pivotal Data Platform for Hadoop. It is based on
streaming services for data ingestion, SQL services for analytic workloads, in-memory
services for operational intelligence and in-memory object services for allowing run-time
applications to interact with the underlying data in the infrastructure.
All this is based on the HDFS premise of Hadoop. One thing that the Pivotal data platform
does point out is that both RDBMS and data-visualization services still have a valuable role in
enterprises, allowing for different and varied services than that of a Hadoop-only-
environment.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 24


Another player for the enterprise realm of Hadoop is Cloudera. One important consideration
here is that the creator of Hadoop, Doug Cutting, is now at Cloudera and brings the vision of
the original version of Hadoop to this product set.
Cloudera distributes a platform of open source Apache projects called Cloudera's
Distribution, which includes Apache Hadoop or CDH. In addition, Cloudera offers its
enterprise customers a family of products and services that complement the open-source
Apache Hadoop platform.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 25


Of course Hadoop has competitors. Hadoop’s competitors are not really the same as
Hadoop, but they do offer other options to consumers for working with distributed,
unstructured data.
• The first competitor is 'Brisk‘, which is based on Cassandra. Cassandra is a NoSQL
database system that implements a distributed key-value storage system for its data
structures. It is much like a MapReduce solution.
• MongoDB, is also a distributed system. It allows for MapReduce-like features through
a distributed C++ database.
• Boost.MapReduce is a C++ library with MapReduce features.
• Apache Spark (cluster computing with iterative MapReduce features)
None of these projects are as widely implemented, as stable as, or as user friendly as
Hadoop. The competitive offerings require a lot of customization, programming and
developing of your own enterprise type of technology, infrastructure solution. For many
operations this is a daunting task that requires lots of planning, design and custom
programming.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 26


Hadoop is very well suited for the social media market. Most of the data in social media is
unstructured, experiences rapid daily growth, has new data sources constantly appearing and
requires long-retention of information/data.
Not only is servicing the users of social media important, but understanding how the service is used
by the actual provider proves a huge data processing and business analytics case. Not only are social
media's engines kept busy servicing users and their vast amount of unstructured data, but the same
engines are leveraged for understanding how that data is used, trended, and utilized. All this
information is measured, analyzed, and quantified to enhance the users experience and provide
business intelligence to the provider.
Social Analytics
• Understanding how a company’s brand is impacted (reach and virality) beyond primary user
impressions. For example, if you are a “Fan” of a product and service, what’s the impact to
the brand as a function of your network of friends? Quantifies social media marketing
spending
• Understand the size and audience composition of brand’s social media following
• Determine the reach and frequency of exposure to social marketing messages and social
advertising
• Link social media exposure to brand engagement and spending propensity
Why Hadoop?
• Social media/web data is unstructured
• Amount of data is immense:
 140 M raw social pages per day
 5 TB of data per day

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 27


This course covered basic principles of Hadoop, including: how it is implemented, some of
the challenges that it addresses, the core components of a Hadoop solution and includes use
cases examples and competitors of a Hadoop solution.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 28


Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 29
Please take a few seconds to provide feedback on this course. To get started, click the
Provide Feedback button.

Copyright © 2013 EMC Corporation. Do not copy - All Rights Reserved. 30

Вам также может понравиться