Making Sense of Big Data - PWC Tech Forecast, Issue 3 2010

Technologyforecast
Making sense
of Big Data
A quarterly journal
2010, Issue 3
In this issue
04 22 36
Tapping into the Building a bridge Revising the CIO’s
power of Big Data to the rest of your data data playbook
Contents
Features
04 Tapping into the power of Big Data
Treating it differently from your core enterprise data is essential.
22 Building a bridge to the rest of your data

How companies are using open-source cluster-computing
techniques to analyze their data.
36 Revising the CIO’s data playbook

Start by adopting a fresh mind-set, grooming the right talent,
and piloting new tools to ride the next wave of innovation.
Interviews
14 The data scalability challenge
John Parkinson of TransUnion describes the data handling issues
more companies will face in three to five years.
18 Creating a cost-effective Big Data strategy

Disney’s Bud Albers, Scott Thompson, and Matt Estes outline an
agile approach that leverages open-source and cloud technologies.
34 Hadoop’s foray into the enterprise

Cloudera’s Amr Awadallah discusses how and why diverse
companies are trying this novel approach.
46 New approaches to customer data analysis

Razorfish’s Mark Taylor and Ray Velez discuss how new techniques
enable them to better analyze petabytes of Web data.
Departments
02 Message from the editor
50 Acknowledgments
54 Subtext
Message from
the editor
Bill James has loved baseball statistics ever since he was a kid in Mayetta,
Kansas, cutting baseball cards out of the backs of cereal boxes in the early
1960s. James, who compiled The Bill James Baseball Abstract for years, is
a renowned “sabermetrician” (a term he coined himself). He now is a senior
advisor on baseball operations for the Boston Red Sox, and he previously
worked in a similar capacity for other Major League Baseball teams.
James has done more to change the world of baseball statistics than
anyone in recent memory. As broadcaster Bob Costas says, James
“doesn’t just understand information. He has shown people a different
way of interpreting that information.” Before Bill James, Major League
Baseball teams all relied on long-held assumptions about how games are
won. They assumed batting average, for example, had more importance
than it actually does.
James challenged these assumptions. He asked critical questions that
didn’t have good answers at the time, and he did the research and analysis
necessary to find better answers. For instance, how many days’ rest does
a reliever need? James’s answer is that some relievers can pitch well for
two or more consecutive days, while others do better with a day or two of
rest in between. It depends on the individual. Why can’t a closer work more
than just the ninth inning? A closer is frequently the best reliever on the
team. James observes that managers often don’t use the best relievers
to their maximum potential.
The lesson learned from the Bill James example is that the best statistics
come from striving to ask the best questions and trying to get answers to
those questions. But what are the best questions? James takes an iterative
approach, analyzing the data he has, or can gather, asking some questions
based on that analysis, and then looking for the answers. He doesn’t stop
with just one set of statistics. The first set suggests some questions, to
which a second set suggests some answers, which then give rise to yet
another set of questions. It’s a continual process of investigation, one that’s
focused on surfacing the best questions rather than assuming those
questions have already been asked.
Enterprises can take advantage of a similarly iterative, investigative
approach to data. Enterprises are being overwhelmed with data; many
enterprises each generate petabytes of information they aren’t making best
use of. And not all of the data is the same. Some of it has value, and some,
not so much.
The problem with this data has been twofold: (1) it’s difficult to analyze,
and (2) processing it using conventional systems takes too long and is
too expensive.
02 PricewaterhouseCoopers Technology Forecast

Addressing these problems effectively doesn’t require “Revising the CIO’s data playbook,” on page 36
radically new technology. Better architectural design emphasizes that CIOs have time to pick and choose
choices and software that allows a different approach the most suitable approach. The most promising
to the problems are enough. Search engine companies opportunity is in the area of “gray data,” or data that
such as Google and Yahoo provide a pragmatic way comes from a variety of sources. This data is often raw
forward in this respect. They’ve demonstrated that and unvalidated, arrives in huge quantities, and doesn’t
efficient, cost-effective, system-level design can lead yet have established value. Gray data analysis requires
to an architecture that allows any company to handle a different skill set—people who are more exploratory
different data differently. by nature.
Enterprises shouldn’t treat voluminous, mostly As always, in this issue we’ve included interviews with
unstructured information (for example, Web server knowledgeable executives who have insights on the
log files) the same way they treat the data in core overall topic of interest:
transactional systems. Instead, they can use commodity
• John Parkinson of TransUnion describes the data
computer clusters, open-source software, and Tier 3
challenges that more and more companies will face
storage, and they can process in an exploratory way
during the next three to five years.
the less-structured kinds of data they’re generating.
With this approach, they can do what Bill James does • Bud Albers, Scott Thompson, and Matt Estes of Disney
and find better questions to ask. outline an agile, open-source cloud data vision.
In this issue of the Technology Forecast, we review the • Amr Awadallah of Cloudera explores the reasons
techniques behind low-cost distributed computing that behind Apache Hadoop’s adoption at search engine,
have led companies to explore more of their data in new social media, and financial services companies.
ways. In the article, “Tapping into the power of Big
• Mark Taylor and Ray Velez of Razorfish contrast
Data,” on page 04, we begin with a consideration of
newer, more scalable techniques of studying
exploratory analytics—methods that are separate from
customer data with the old methods.
traditional business intelligence (BI). These techniques
make it feasible to look for more haystacks, rather than Please visit pwc.com/techforecast to find these articles
just the needle in one haystack. and other issues of the Technology Forecast online.
If you would like to receive future issues of the
The article, “Building a bridge to the rest of your data,”
Technology Forecast as a PDF attachment, you
on page 22 highlights the growing interest in and
can sign up at pwc.com/techforecast/subscribe.
adoption of Hadoop clusters. Hadoop provides high-
volume, low-cost computing with the help of open- We welcome your feedback and your ideas for future
source software and hundreds or thousands of research and analysis topics to cover.
commodity servers. It also offers a simplified
approach to processing more complex data in parallel.
The methods, cost advantages, and scalability of
Hadoop-style cluster computing clear a path for
enterprises to analyze lots of data they didn’t have
the means to analyze before.
Tom DeGarmo
The buzz around Big Data and “cloud storage” (a term
Principal
some vendors use to describe less-expensive cluster-
Technology Leader
computing techniques) is considerable, but the article,
thomas.p.degarmo@us.pwc.com
Message from the editor 03

Tapping into the
power of Big Data
Treating it differently from your core enterprise data is essential.

By Galen Gruman

Like most corporations, the Walt Disney Co. is platform make it feasible not only to look for the
swimming in a rising sea of Big Data: information needle in the haystack, but also to look for new
collected from business operations, customers, haystacks. This kind of analysis demands an attitude
transactions, and the like; unstructured information of exploration—and the ability to generate value from
created by social media and other Web repositories, data that hasn’t been scrubbed or fully modeled into
including the Disney home page itself and sites for relational tables.
its theme parks, movies, books, and music; plus the
Using Disney and other examples, this first article
sites of its many big business units, including ESPN
introduces the idea of exploratory BI for Big Data.
and ABC.
The second article examines Hadoop clusters and
“In any given year, we probably generate more data technologies that support them (page 22), and the
than the Walt Disney Co. did in its first 80 years of third article looks at steps CIOs can take now to exploit
existence,” observes Bud Albers, executive vice the future benefits (page 36). We begin with a closer
president and CTO of the Disney Technology Shared look at Disney’s still-nascent but illustrative effort.
Services Group. “The challenge becomes what do
you do with it all?”
Albers and his team are in the early stages of “In any given year, we probably
answering their own question with an economical
cluster-computing architecture based on a set of
generate more data than the
cost-effective and scalable technologies anchored Walt Disney Co. did in its first
by Apache Hadoop, an open-source, Java-based
distributed file system based on Google File System
80 years of existence.”
and developed by Apache Software Foundation. —Bud Albers of Disney
These still-emerging technologies allow Disney analysts
to explore multiple terabytes of information without the
lengthy time requirements or high cost of traditional
business intelligence (BI) systems.
This issue of the Technology Forecast examines how
Apache Hadoop and these related technologies can
derive business value from Big Data by supporting a
new kind of exploratory analytics unlike traditional BI.
These software technologies and their hardware cluster
Tapping into the power of Big Data 05

Bringing Big Data under control technologies. (See the Technology Forecast, Summer
2009, for more on the topic of cloud computing.) When
Big Data is not a precise term; rather, it is a Albers launched the effort to change the division’s cost
characterization of the never-ending accumulation of curve so IT expenses would rise more slowly than the
all kinds of data, most of it unstructured. It describes business usage of IT—the opposite had been true—he
data sets that are growing exponentially and that are turned to an approach that many companies use to
too large, too raw, or too unstructured for analysis make data centers more efficient: virtualization.
using relational database techniques. Whether
terabytes or petabytes, the precise amount is less the Virtualization offers several benefits, including higher
issue than where the data ends up and how it is used. utilization of existing servers and the ability to move
Like everyone else, Disney’s Big Data is huge, more workloads to prevent resource bottlenecks. An
unstructured than structured, and growing much organization can also move workloads to external cloud
faster than transactional data. providers, using them as a backup resource when
needed, an approach called cloud bursting. By using
The Disney Technology Shared Services Group, which such approaches, the Disney Technology Shared
is responsible for Disney’s core Web and analysis Services Group lowered its IT expense growth rate from
technologies, recently began its Big Data efforts but 27 percent to –3 percent, while increasing its annual
already sees high potential. The group is testing the processing growth from 17 percent to 45 percent.
technology and working with analysts in Disney
business units. Disney’s data comes from varied While achieving this efficiency, the team realized that
sources, but much of it is collected for departmental the ability to move resources and tap external ones
business purposes and not yet widely shared. Disney’s could apply to more than just data center efficiency. At
Big Data approach will allow it to look at diverse data first, they explored using external clouds to analyze big
sets for unplanned purposes and to uncover patterns sets of data, such as Web traffic to Disney’s many sites,
across customer activities. For example, insights from and to handle big processing jobs more cost-effectively
Disney Store activities could be useful in call centers and more quickly than with internal systems.
for theme park booking or to better understand the During that exploration, the team discovered Hadoop,
audience segments of one of its cable networks. MapReduce, and other open-source technologies that
The Technology Shared Services Group is even using distribute data-analysis workloads across many
Big Data approaches to explore its own IT questions computers, breaking the analysis into many parallel
to understand what data is being stored, how it is workloads that produce results faster. Faster results
used, and thus what type of storage hardware and mean that more questions can be asked, and the low
management the group needs. cost of the technologies means the team can afford
Albers assumes that Big Data analysis is destined to to ask those questions.
become essential. “The speed of business these days Disney assembled a Hadoop cluster and set up
and the amount of data that we are now swimming a central logging service to mine data that the
in mean that we need to have new ways and new organization hadn’t been able to before. It will begin
techniques of getting at the data, finding out what’s in to provide internal group access to the cluster in
there, and figuring out how we deal with it,” he says. October 2010. Figure 1 shows how the Hadoop
The team stumbled upon an inexpensive way to cluster will benefit internal groups, business partners,
improve the business while pursuing more IT cost- and customers.
effectiveness through the use of private-cloud
“The speed of business these days and the amount of data that we
are now swimming in mean that we need to have new ways and new
techniques of getting at the data, finding out what’s in there, and figuring
out how we deal with it.” —Bud Albers of Disney

4
Improved
experience
Internal
Site business Affiliated
visitors partners businesses
Interface to cluster
(MapReduce/Hive/Pig)
1
Usage
data D-Cloud data cluster
Hadoop
2 Central 3
logging
service
Core IT and Metadata
business unit repository
systems
Figure 1: Disney’s Hadoop cluster and central logging service

Disney’s new D-Cloud data cluster can scale to handle (1) less-structured usage data through the establishment of (2) a central
logging service, (3) a cost-effective Hadoop data analysis engine, and a commodity computer cluster. The result is (4) a more
responsive and personalized user experience.
Source: Disney, 2010

Simply put, the low cost of a Hadoop cluster means
freedom to experiment. Disney uses a couple of dozen Opportunities for Big Data insights
servers that were scheduled to be retired, and the
organization operates its cluster with a handful of Here are other examples of the kinds of insights
existing staff. Matt Estes, principal data architect that may be gleaned from analysis of Big Data
for the Disney Technology Shared Services Group, information flows:
estimates the cost of the project at $300,000 • Customer churn, based on analysis of call center,
to $500,000. help desk, and Web site traffic patterns
“Before, I would have needed to figure on spending $3 • Changes in corporate reputation and the potential
million to $5 million for such an initiative,” Albers says. for regulatory action, based on the monitoring of
“Now I can do this without charging to the bottom line.” social networks as well as Web news sites
Unlike the reusable canned queries in typical BI systems, • Real-time demand forecasting, based on
Big Data analysis does require more effort to write the disparate inputs such as weather forecasts,
queries and the data-parsing code for what are often travel reservations, automotive traffic, and
unique inquiries of data sources. But Albers notes that retail point-of-sale data
“the risk is lower due to all the other costs being lower.”
Failure is inexpensive, so analysts are more willing to • Supply chain optimization, based on analysis of
explore questions they would otherwise avoid. weather patterns, potential disaster scenarios,
and political turmoil
Even in this early stage, Albers is confident that the
ability to ask more questions will lead to more insights
that translate to both the bottom line and the top line.
Disney and others explore their data without a lot of
For example, Disney already is seeking to boost
preconceptions. They know the results won’t be as
customer engagement and spending by making
specific as a profit-margin calculation or a drug-efficacy
recommendations to customers based on pattern
determination. But they still expect demonstrable value,
analysis of their online behavior.
and they expect to get it without a lot of extra expense.
Typical BI uses data from transactional and other
How Big Data analysis is different relational database management systems (RDBMSs)
What should other enterprises anticipate from Hadoop- that an enterprise collects—such as sales and
style analytics? It is a type of exploratory BI they haven’t purchasing records, product development costs, and
done much before. This is business intelligence that new employee hire records—diligently scrubs the data
provides indications, not absolute conclusions. It requires for accuracy and consistency, and then puts it into a
a different mind-set, one that begins with exploration, the form the BI system is programmed to run queries
results of which create hypotheses that are tested before against. Such systems are vital for accurate analyses
moving on to validation and consolidation. of transactional information, especially information
subject to compliance requirements, but they don’t
These methods could be used to answer questions work well for messy questions, they’ve been too
such as, “What indicators might there be that predate expensive for questions you’re not sure there’s any
a surge in Web traffic?” or “What fabrics and colors value in asking, and they haven’t been able to scale
are gaining popularity among influencers, and what to analyze large data sets efficiently. (See Figure 2.)
sources might be able to provide the materials to us?”
or “What’s the value of an influencer on Web traffic
through his or her social network?” See the sidebar
“Opportunities for Big Data insights” for more examples
of the kinds of questions that can be asked of Big Data.

Other companies have also tapped into the excitement
brewing over Big Data technologies. Several Web-
oriented companies that have always dealt with huge
Large Big Data (via Less scalability amounts of data—such as Yahoo, Twitter, and
data sets Hadoop/MapReduce) Google—were early adopters. Now, more traditional
companies—such as TransUnion, a credit rating
service—are exploring Big Data concepts,
having seen the cost and scalability benefits
the Web companies have realized.
Specifically, enterprises are also motivated by the
Small Little analytical value Traditional BI inability to scale their existing approach for working
data sets on traditional analytics tasks, such as querying
across terabytes of relational data. They are learning
that the tools associated with Hadoop are uniquely
positioned to explore data that has been sitting on
Non-relational data Relational data the side, unanalyzed. Figure 3 illustrates how the data
architecture landscape appears in 2010. Enterprises
with high processing power requirements and
centralized architectures are facing scaling issues.
Figure 2: Where Big Data fits in
Source: PricewaterhouseCoopers, 2010
In contrast, Big Data techniques allow you to sift through High

Enterprises facing Google, Amazon,
data to look for patterns at a much lower cost and in processing
scaling and Facebook, Twitter,
much less time than traditional BI systems. Should the capacity/cost etc. (all use non-
power
problems relational data stores
data end up being so valuable that it requires the for reasons of scale)
ongoing, compliance-oriented analysis of regular BI
systems, only then do you make that investment.
Big Data approaches let you ask more questions of
more information, opening a wide range of potential
Low Most enterprises Cloud users with
insights you couldn’t afford to consider in the past.
processing low compute
“Part of the analytics role is to challenge assumptions,” power requirements
Estes says. BI systems aren’t designed to do that;
instead, they’re designed to dig deeper into known
questions and look for variations that may indicate
deviations from expected outcomes. Centralized Distributed
compute compute
Furthermore, Big Data analysis is usually iterative: you architecture architecture
ask one question or examine one data set, then think of
more questions or decide to look at more data. That’s
Figure 3: The data architecture landscape in 2010
different from the “single source of truth” approach to
standard BI and data warehousing. The Disney team Source: PricewaterhouseCoopers, 2010
started with making sure they could expose and Wolfram Research and IBM have begun to extend
access the data, then moved to iterative refinement in their analytics applications to run on such large-scale
working with the data. “We aggressively got in to find data pools, and startups are presenting approaches
the direction and the base. Then we began to iterate they promise will allow data exploration in ways that
rather than try to do a Big Bang,” Albers says. technologies couldn’t have enabled in the past,
including support for tools that let knowledge workers
examine traditional databases using Big Data–style
exploratory tools.

The ways different enterprises approach Big Data Disney’s system is purely intended for exploratory
efforts or at most for reporting that eventually may feed
It should come as no surprise that organizations
up to product strategy or Web site design decisions.
dealing with lots of data are already investigating Big
If it breaks or needs a little retooling, there’s no crisis.
Data technologies, or that they have mixed opinions
about these tools. But Albers disagrees about the readiness of the tools,
noting that the Disney Technology Shared Services
“At TransUnion, we spend a lot of our time trawling
Group also handles quite a bit of data. He figures
through tens or hundreds of billions of rows of data,
Hadoop and MapReduce aren’t any worse than a lot of
looking for things that match a pattern approximately,”
proprietary software. “I fully expect we will run on things
says John Parkinson, TransUnion’s acting CTO. “We
that break,” he says, adding facetiously, “Not that any
want to do accurate but approximate matching and
commercial product I’ve ever had has ever broken.”
categorization in very large low-structure data sets.”
Data architect Estes also sees responsiveness in
Parkinson has explored Big Data technologies such
open-source development that’s laudable. “In our
as MapReduce that appear to have a more efficient
testing, we uncovered stuff, and you get somebody
filtering model than some of the pattern-matching
on the other end. This is their baby, right? I mean,
algorithms TransUnion has tried in the past.
they want it fixed.”
“MapReduce also, at least in its theoretical formulation,
is very amenable to highly parallelized execution,” Albers emphasizes the total cost-effectiveness of
which lets the users tap into farms of commodity Hadoop and MapReduce. “My software cost is zero.
hardware for fast, inexpensive processing, he notes. You still have the implementation, but that’s a constant
at some level, no matter what. Now you probably need
However, Parkinson thinks Hadoop and MapReduce
to have a little higher skill level at this stage of the
are too immature. “MapReduce really hasn’t evolved
game, so you’re probably paying a little more, but
yet to the point where your average enterprise
certainly, you’re not going out and approving a Teradata
technologist can easily make productive use of it. As
cluster. You’re talking about Tier 3 storage. You’re
for Hadoop, they have done a good job, but it’s like a
talking about a very low level of cost for the storage.”
lot of open-source software—80 percent done. There
were limits in the code that broke the stack well before Albers’ points are also valid. PricewaterhouseCoopers
what we thought was a good theoretical limit.” predicts these open-source tools will be solid sooner
rather than later, and are already worthy of use in
Parkinson echoes many IT executives who are
non-mission-critical environments and applications.
skeptical of open-source software in general. “If I have
Hence, in the CIO article on page 36, we argue in favor
a bunch of engineers, I don’t want them spending their
of taking cautious but exploratory steps.
day being the technology support environment for what
should be a product in our architecture,” he says.
Asking new business questions
That’s a legitimate point of view, especially considering
Saving money is certainly a big reward, but
the data volumes TransUnion manages—8 petabytes
PricewaterhouseCoopers contends the biggest
from 83,000 sources in 4,000 formats and growing—
payoff from Hadoop-style analysis of Big Data is the
and its focus on mission-critical capabilities for this
potential to improve organizations’ top line. “There’s
data. Credit scoring must run successfully and deliver
a lot of potential value in the unstructured data in
top-notch credit scores several times a day. It’s an
organizations, and people are starting to look at it
operational system that many depend on for critical
more seriously,” says Tom Urquhart, chief architect
business decisions that happen millions of times a
at PricewaterhouseCoopers. Think of it as a “Google
day. (For more on TransUnion, see the interview with
in a box, which allows you to do intelligent search
Parkinson on page 14.)
regardless of whether the underlying content is
structured or unstructured,” he says.

The Google-style techniques in Hadoop, MapReduce, Pattern analysis mashup services
and related technologies work in a fundamentally
There’s another use of Big Data that combines
different way from traditional BI systems, which use
efficiency and exploratory benefits: on-the-fly pattern
strictly formatted data cubes pulling information from
analysis from disparate sources to return real-time
data warehouses. Big Data tools let you work with data
results. Amazon.com pioneered Big Data–based
that hasn’t been formally modeled by data architects,
product recommendations by analyzing customer data,
so you can analyze and compare data of different types
including purchase histories, product ratings, and
and of different levels of rigor. Because these tools
comments. Albers is looking for similar value that
typically don’t discard or change the source data
would come from making live recommendations to
before the analysis begins, the original context
customers when they go to a Disney site, store, or
remains available for drill-down by analysts.
reservations phone line—based on their previous
These tools provide technology assistance to a very online and offline behavior with Disney.
human form of analysis: looking at the world as it is
O’Reilly Media, a publisher best known for technical
and finding patterns of similarity and difference, then
books and Web sites, is working with the White House
going deeper into the areas of interest judged valuable.
to develop mashup applications that look at data from
In contrast, BI systems know what questions should be
various sources to identify patterns that might help
asked and what answers to expect; their goal is to look
lobbyists and policymakers. For example, by mashing
for deviations from the norm or changes in standard
together US Census data and labor statistics, they can
patterns deemed important to track (such as changes
see which counties have the most international and
in baseline quality or in sales rates in specific
domestic immigration, then correlate those attributes
geographies). Such an approach, absent an exploratory
with government spending changes, says Roger
phase, results in a lot of information loss during data
Magoulas, O’Reilly’s research director.
consolidation. (See Figure 4.)
Exploration
Pre-consolidated data (never collected)
Information
loss Less information loss
All collected data

All collected data
Information
loss
Co
Summary
Co
Summary
departmental
ns
departmental
ns
data
o
data
oli
lid
Information
da
ati
loss Summary
tio
Summary
on
enterprise
n
enterprise
data data
Insight Greater insight
Figure 4: Information loss in the data consolidation process


Mashups like this can also result in customer-facing information resources whose accuracy and
services. FlightCaster for iPhone and BlackBerry uses completeness may be more established.
Big Data approaches to analyze flight-delay records
People use their knowledge and experience to
and current conditions to issue flight-delay predictions
appropriately weigh and correlate what they find across
to travelers.
gray data to come up with improved strategies to aid
the business. Figure 5 compares gray data and more
Exploiting the power of human analysis normalized black data.
Big Data approaches can lower processing and storage
costs, but we believe their main value is to perform the Gray data Black data
Raw Classified
analysis that BI systems weren’t designed for, acting as Data and context commingled Provenanced
an enabler and an amplifier of human analysis. Noisy Cleaned
Hypothetical Actual
Ad hoc exploration at a bargain
e.g., Wikipedia e.g., Financial system data
Big Data lets you inexpensively explore questions
and peruse data for patterns that may indicate Unchecked Reviewed
Indicative Confirming
opportunities or issues. In this arena, failure is cheap, Less trustworthy More trustworthy
so analysts are more willing to explore questions they Managed by business unit Managed by IT
would otherwise avoid. And that should lead to insights
that help the business operate better. Figure 5: Gray versus black data
Medical data is an example of the potential for ad hoc Source: PricewaterhouseCoopers, 2010
analysis. “A number of such discoveries are made on
the weekends when the people looking at the data Web analytics and financial risk analysis are two
are doing it from the point of view of just playing examples of how Big Data approaches augment human
around,” says Doug Lenat, founder and CEO of analysts. These techniques comb huge data sets of
Cycorp and a former professor at Stanford and information collected for specific purposes (such as
Carnegie Mellon universities. monitoring individual financial records), looking for
patterns that might identify good prospects for loans
Right now the technical knowledge required to use and flag problem borrowers. Increasingly, they comb
these tools is nontrivial. Imagine the value of extending external data not collected by a credit reporting
the exploratory capability more broadly. Cycorp is agency—for example, trends in a neighborhood’s
one of many startups trying to make Big Data analytic housing values or in local merchants’ sales patterns—
capabilities usable by more knowledge workers so to provide insights into where sales opportunities could
they can perform such exploration. be found or where higher concentrations of problem
customers are located.
Analyzing data that wasn’t designed for BI
The same approaches can help identify shifts in
Big Data also lets you work with “gray data,” or data consumer tastes, such as for apparel and furniture.
from multiple sources that isn’t formatted or vetted for And, by analyzing gray data related to costs of
your specific needs, and that varies significantly in its resources and changes in transportation schedules,
level of detail and accuracy—and thus cannot be these approaches can help anticipate stresses on
examined by BI systems. suppliers and help identify where additional suppliers
One analogy is Wikipedia. Everyone knows its might be found.
information is not rigorously managed or necessarily All of these activities require human intelligence,
accurate; nonetheless, Wikipedia is a good first place experience, and insight to make sense of the data,
to look for indicators of what may be true and useful. figure out the questions to ask, decide what
From there, you do further research using a mix of information should be correlated, and generally
conduct the analysis.

Why the time is ripe for Big Data Conclusion
The human analysis previously described is old hat PricewaterhouseCoopers believes that Big Data
for many business analysts, whether they work in approaches will become a key value creator for
manufacturing, fashion, finance, or real estate. What’s businesses, letting them tap into the wild, woolly
changing is scale. As noted, many types of information world of information heretofore out of reach. These
are now available that never existed or were not new data management and storage technologies can
accessible. What could once only be suggested also provide economies of scale in more traditional
through surveys, focus groups, and the like can now data analysis. Don’t limit yourself to the efficiencies
be examined directly, because more of the granular of Big Data and miss out on the potential for gaining
thinking and behaviors are captured. Businesses have insights through its advantages in handling the gray
the potential to discover more through larger samples data prevalent today.
and more granular details, without relying on people
Big Data analysis does not replace other systems.
to recall behaviors and motivations accurately.
Rather, it supplements the BI systems, data
This potential can be realized only if you pull together warehouses, and database systems essential to
and analyze all that data. Right now, there’s simply financial reporting, sales management, production
too much information for individual analysts to management, and compliance systems. The difference
manage, increasing the chances of missing potential is that these information systems deal with the knowns
opportunities or risks. Businesses that augment their that must meet high standards for rigor, accuracy, and
human experts with Big Data technologies could have compliance—while the emerging Big Data analytics
significant competitive advantages by heading off tools help you deal with the unknowns that could affect
problems sooner, identifying opportunities earlier, business strategy or its execution.
and performing mass customization at a larger scale.
As the amount and interconnectedness of data vastly
Fortunately, the emerging Big Data tools should let increases, the value of the Big Data approach will only
businesspeople apply individual judgments to vaster grow. If the amount and variety of today’s information is
pools of information, enabling low-cost, ad hoc daunting, think what the world will be like in 5 or 10
analysis never before feasible. Plus, as patterns years. People will become mobile sensors—collecting,
are discovered, the detection of some can be creating, and transmitting all sorts of information, from
automated, letting the human analysts concentrate locations to body status to environmental information.
on the art of analysis and interpretation that algorithms
We already see this happening as smartphones
can’t accomplish.
equipped with cameras, microphones, geolocation,
Even better, emerging Big Data technologies promise and compasses proliferate. Wearable medical sensors,
to extend the reach of analysis beyond the cadre of small temperature tags for use on packages, and other
researchers and business analysts. Several startups radio-equipped sensors are a reality. They’ll be the
offer new tools that use familiar data-analysis tools— Twitter and Facebook feeds of tomorrow, adding vast
similar to those for SQL databases and Excel quantities of new information that could provide
spreadsheets—to explore Big Data sources, thus context on behavior and environment never before
broadening the ability to explore to a wider set of possible—and a lot of “noise” certain to mask
knowledge workers. what’s important.
Finally, Big Data approaches can be used to power Insight-oriented analytics in this sea of information—
analytics-based services that improve the business where interactions cause untold ripples and eddies in
itself, such as in-context recommendations to the flow and delivery of business value—will become
customers, more accurate predictions of service a critical competitive requirement. Big Data technology
delivery, and more accurate failure predictions is the likeliest path to gaining such insights.
(such as for the manufacturing, energy, medical,
and chemical industries).

The data scalability
challenge
John Parkinson of TransUnion describes the

data handling issues more companies will face
in three to five years.
Interview conducted by Vinod Baya and Alan Morrison
John Parkinson is the acting CTO of TransUnion, the chairman and owner of Parkwood
Advisors, and a former CTO at Capgemini. In this interview, Parkinson outlines
TransUnion’s considerable requirements for less-structured data analysis, shedding
light on the many data-related technology challenges TransUnion faces today—challenges
he says that more companies will face in the near future.
PwC: In your role at TransUnion, you’ve of rows of data looking for things that match a pattern
evaluated many large-scale data processing approximately. MapReduce is a more efficient filter for
technologies. What do you think of Hadoop some of the pattern-matching algorithms that we have
tried to use. At least in its theoretical formulation, it’s
and MapReduce? very amenable to highly parallelized execution, which
JP: MapReduce is a very computationally attractive many of the other filtering algorithms we’ve used aren’t.
answer for a certain class of problem. If you have that The open-source stack is attractive for experimenting,
class of problem, then MapReduce is something you but the problem we find is that Hadoop isn’t what
should look at. The challenge today, however, is that the Google runs in production—it’s an attempt by a bunch
number of people who really get the formalism behind of pretty smart guys to reproduce what Google runs in
MapReduce is a lot smaller than the group of people production. They’ve done a good job, but it’s like a lot
trying to understand what to do with it. It really hasn’t of open-source software—80 percent done. The
evolved yet to the point where your average enterprise 20 percent that isn’t done—those are the hard parts.
technologist can easily make productive use of it.
From an experimentation point of view, we have had a
lot of success in proving that the computing formalism
PwC: What class of problem would that be? behind MapReduce works, but the software that we
JP: MapReduce works best in situations where you can acquire today is very fragile. It’s difficult to manage.
want to do high-volume, accurate but approximate It has some bugs in it, and it doesn’t behave very well
matching and categorization in very large, low- in an enterprise environment. It also has some
structured data sets. At TransUnion, we spend a lot of interesting limitations when you try to push the
our time trawling through tens or hundreds of billions scale and the performance.

We found a number of representational problems of the envelope. This is a problem for hardware as
when we used the HDFS/Hadoop/HBase stack to well as software. A lot of the vendors stop testing
do something that, according to the documentation their applications at about 80 percent or 85 percent
available, should have worked. However, in practice, of their theoretical capability. We routinely run them at
limits in the code broke the stack well before what we 110 percent of their theoretical capability, and they
thought was a good theoretical limit. break. I don’t mind making tactical justifications for
technologies that I expect to replace quickly. I do that
Now, the good news of course is that you get source
all the time. But having done that, I want the damn
code. But that’s also the bad news. You need to get the
thing to work. Too often, we’ve discovered that it
source code, and that’s not something that we want to
doesn’t work.
do as part of routine production. I have a bunch of
smart engineers, but I don’t want them spending their
day being the technology support environment for what PwC: Are you forced to use technologies that
should be a product in our architecture. Yes, there’s a have matured because of a wariness of things
pony there, but it’s going to be awhile before it stabilizes
on the absolute edge?
to the point that I want to bet revenue on it.
JP: My dilemma is that things that are known to work
usually don’t scale to what we need—for speed or full
PwC: Data warehousing appliance prices have capacity. I must spend some time, energy, and dollars
dropped pretty dramatically over the past couple betting on things that aren’t mature yet, but that can be
of years. When it comes to data that’s not sufficiently generalized architecturally. If the one I pick
necessarily on the critical path, how does an doesn’t work, or goes away, I can fit something else into
enterprise make sure that it is not spending its place relatively easily. That’s why we like appliances.
As long as they are well behaved at the network layer
more than it has to? and have a relatively generalized or standards-based
JP: We are probably not a good representational business semantic interface, it doesn’t matter if I have
example of that because our business is analyzing the to unplug one in 18 months or two years because
data. There is almost no price we won’t pay to get a something better came along. I can’t do that for
better answer faster, because we can price that into everything, but I can usually afford to do it in the areas
the products we produce. The challenge we face is where I have no established commercial alternative.
that the tools don’t always work properly at the edge
“I have a bunch of smart engineers, but I don’t want them spending their
day being the technology support environment for what should be a
product in our architecture.”
The data scalability challenge 15

PwC: What are you using in place of something PwC: Of the three kinds of data, which is the
like Hadoop? most challenging?
JP: Essentially, we use brute force. We use Ab Initio, JP: We have two kinds of challenges. The first is driven
which is a very smart brute-force parallelization scheme. purely by the scale at which we operate. We add
I depend on certain capabilities in Ab Initio to parallelize roughly half a terabyte of data per month to the credit
the ETL [extract, transform, and load] in such a way that file. Everything we do has challenges related to scale,
I can throw more cores at the problem. updates, speed, or database performance. The vendors
both love us and hate us. But we are where the industry
is going—where everybody is going to be in two to five
PwC: Much of the data you see is transactional. Is years. We are a good leading indicator, but we break
it all structured data, or are you also mining text? their stuff all the time. A second challenge is the
JP: We get essentially three kinds of data. We get unstructured part of the data, which is increasing.
accounts receivable data from credit loan issuers. That’s
the record of what people actually spend. We get public PwC: It’s more of a challenge to deal with the
record data, such as bankruptcy records, court records,
unstructured stuff because it comes in various
and liens, which are semi-structured text. And we get
other data, which is whatever shows up, and it’s formats and from various sources, correct?
generally hooked together around a well-understood JP: Yes. We have 83,000 data sources. Not everyone
set of identifiers. But the cost of this data is essentially provides us with data every day. It comes in about
free—we don’t pay for it. It’s also very noisy. So we 4,000 formats, despite our data interchange standards.
have to spend computational time figuring out whether And, to be able to process it fast enough, we must
the data we have is right, because we must find a place convert all data into a single interchange format that is
to put it in the working data sets that we build. the representation of what we use internally. Complex
At TransUnion, we suck in 100 million updates a day computer science problems are associated with all
for the credit files. We update a big data warehouse of that.
that contains all the credit and related data. And then
every day we generate somewhere between 1 and 20 PwC: Are these the kinds of data problems that
operational data stores, which is what we actually run
businesses in other industries will face in three
the business on. Our products are joined between what
we call indicative data, the information that identifies to five years?
you as an individual; structured data, which is derived JP: Yes, I believe so.
from transactional records; and unstructured data that
is attached to the indicative. We build those products
on the fly because the data may change every day, PwC: What are some of the other problems you
sometimes several times a day. think will become more widespread?
One challenge is how to accurately find the right place JP: Here are some simple practical examples. We have
to put the record. For example, we get a Joe Smith at 8.5 petabytes of data in the total managed environment.
13 Main Street and a Joe Smith at 31 Main Street. Once you go seriously above 100 terabytes, you must
Are those two different Joe Smiths, or is that a typing replace the storage fabric every four or five years.
error? We have to figure that out 100 million times a Moving 100 terabytes of data becomes a huge material
day using a bunch of custom pattern-matching and issue and takes a long time. You do get some help from
probabilistic algorithms. improved interconnect speed, but the arrays go as fast

as they go for reads and writes and you can’t go faster PwC: Cloudera [a vendor offering a Hadoop
than that. And businesses down the food chain are not distribution] would say bring the computation to
accustomed to thinking about refresh cycles that take the data.
months to complete. Now, a refresh cycle of PCs might
take months to complete, but any one piece of it takes JP: That works only for certain kinds of data. We already
only a couple of hours. When I move data from one do all of that large-scale computation on a file system
array to another, I’m not done until I’m done. basis, not on a database basis. And we spend compute
Additionally, I have some bugs and new cycles to compress the data so there are fewer bits to
vulnerabilities to deal with. move, then decompress the data for computation, and
recompress it so we have fewer bits to store.
Today, we don’t have a backup problem at TransUnion
because we do incremental forever backup. However, What we have discovered—because I run the fourth
we do have a restore problem. To restore a material largest commercial GPFS [general parallel file system,
amount of data, which we very occasionally need to do, a distributed computing file system developed by IBM]
takes days in some instances because the physics of cluster in the world—is that once you go beyond a
the technology we use won’t go faster than that. The certain size, the parallelization management tools break.
average IT department doesn’t worry about these That’s why I keep telling people that Hadoop is not what
problems. But take the amount of data an average IT Google runs in production. Maybe the Google guys
department has under management, multiply it by a have solved this, but if they have, they aren’t telling
single decimal order of magnitude, and it starts to me how. n
become a material issue.
We would like to see computationally more-efficient
compression algorithms, because my two big cost
pools are Store It and Move It. For now, I don’t have “We would like to see
a computational problem, but if I can’t shift the trend computationally more-efficient
line on Store It and Move It, I will have a computational
problem within a few years. To perform the compression algorithms,
computations in useful time, I must parallelize how I because my two big cost
compute. Above a certain point, the parallelization
breaks because I can’t move the data further. pools are Store It and Move It.”
The data scalability challenge 17

Creating a cost-effective
Big Data strategy
Disney’s Bud Albers, Scott Thompson,

and Matt Estes outline an agile
approach that leverages open-source
and cloud technologies.
Interview conducted by Galen Gruman
and Alan Morrison
Bud Albers joined what is now the Disney Technology Shared

Services Group two years ago as executive vice president and CTO. His management
team includes Scott Thompson, vice president of architecture, and Matt Estes, principal
data architect. The Technology Shared Services Group, located in Seattle, has a heritage
dating back to the late 1990s, when Disney acquired Starwave and Infoseek.
The group supports all the Disney businesses ($38 billion in annual revenue), managing
the company’s portfolio of Web properties. These include properties for the studio, store,
and park; ESPN; ABC; and a number of local television stations in major cities.
In this interview, Albers, Thompson, and Estes discuss how they’re expanding Disney’s
Web data analysis footprint without incurring additional cost by implementing a Hadoop
cluster. Albers and team freed up budget for this cluster by virtualizing servers and
eliminating other redundancies.
PwC: Disney is such a diverse company, and yet One of the things I’ve been telling my folks from a data
there clearly is lots of potential for synergies and perspective is that you don’t send terabytes one way to
cross-fertilization. How do you approach these be mated with a spreadsheet on the other side, right?
We’re thinking through those kinds of pieces and trying
opportunities from a data perspective? to figure out how we move down a path. The net is that
BA: We try and understand the best way to work working with all these businesses gives us a diverse set
with and to provide services to the consumer in the of requirements, as you might imagine. We’re trying to
long term. We have some businesses that are very stay ahead of where all the businesses are.
data intensive, and then we have some that are less In that respect, the questions I’m asking are, how do
so because of their consumer audience. One of the we get more agile, and how do we do it in a way that
challenges always is how to serve both kinds of handles all the data we have? We must consider all of
businesses and do so in ways that make sense. the new form factors being developed, all of which will
The sell-to relationships extend from the studio out generate lots of data. A big question is, how do we
to the distribution groups and the theater chains. handle this data in a way that makes cost sense for the
If you’re selling to millions, you’re trying to understand business and provides us an increased level of agility?
the different audiences and how they connect.

We hope to do in other areas what we’ve done with of engineering on the Web site who reports to me.
content distribution networks [CDNs]. We’ve had a Our CIO worries about it from the firewall back;
tremendous amount of success with the CDN I worry about it from the firewall to the living room
marketplace by standardizing, by staying in the middle and the mobile device. That’s the way we split up
of the road and not going to Akamai proprietary the world, if that makes sense.
extensions, and by creating a dynamic marketplace.
If we get a new episode of Lost, we can start
streaming it, and I can be streaming 80 percent on PwC: How do you link the data requirements of
Akamai and 20 percent on Level 3. Then we can the central core with those that are unique to the
decide we’re going to turn it back, and I’m going various parts of the business?
to give 80 percent to Limelight and 20 percent to
BA: It’s more art than science. The business units
Level 3. We can do that dynamically.
must generate revenue, and we must provide the core
services. How do you strike that balance? Ownership
PwC: What are the other main strengths of the is a lot more negotiated on some things today. We
Technology Shared Services Group at Disney? typically pull down most of the analytics and add things
in, and it’s a constant struggle to answer the question,
BA: When I came here a couple of years ago, we had “Do we have everything?” We’re headed toward this
some very good core central services. If you look at the notion of one data element at a time, aggregate, and
true definition of a cloud, we had the very early makings queue up the aggregate. It can get a little bit crazy
of one—shared central services around registration, because you wind up needing to pull the data in and
for example. On Disney, on ABC, or on ESPN, if you run it through that whole food chain, and it may or
have an ID, it works on all the Disney properties. If you may not have lasting value.
have an ESPN ID, you can sign in to KGO in San
Francisco, and it will work. It’s all a shared registration It may have only a temporal level of importance, and so
system. The advertising system we built is shared. The we’re trying to figure out how to better handle that. An
marketing systems we built are shared—all the analytics awful lot of what we do in the data collection is pull it in,
collection, all those things are centralized. Those things lay it out so it can be reported on, and/or push it back
that are common are shared among all the sites. into the businesses, because the Web is evolving
rapidly from a standalone thing to an integral part
Those things that are brand specific are built by the of how you do business.
brands, and the user interface is controlled by the
brands, so each of the various divisions has a head
“It’s more art than science. The business units must generate revenue,
and we must provide the core services. How do you strike that balance?
Ownership is a lot more negotiated on some things today.”
—Bud Albers
Creating a cost-effective Big Data strategy 19

PwC: Hadoop seems to suggest a feasible way PwC: How does the central logging service fit
to analyze data that has only temporal into your overall strategy?
importance. How did you get to the point where ST: As we looked at it, we said, it’s not just about
you could try something like a Hadoop cluster? virtualization. To be able to burst and do these other
BA: Guys like me never get called when it’s all pretty things, you need to build a bunch of core services.
and shiny. The Disney unit I joined obviously has many The initiative we’re working on now is to build some of
strengths, but when I was brought on, there was a cost those core services around managing configuration.
growth situation. The volume of the aggregate activity This project takes the foundation we laid with
growth was 17 percent. Our server growth at the time virtualization and a REST and JSON data exchange
was 30 percent. So we were filling up data centers, but standard, and adds those core services that enable us
we were filling them with CPUs that weren’t being used. to respond to the marketplace as it develops. Piping
My question was, how can you go to the CFO and ask that data back to a central repository helps you to
for a lot of money to fill a data center with capital assets analyze it, understand what’s going on, and make
that you’re going to use only 5 percent of? better decisions on the basis of what you learned.
CPU utilization isn’t the only measure, but it’s the most
prominent one. To study and understand what was PwC: How do you evolve so that the data
happening, we put monitors and measures on our strategy is really served well, so that it’s more
servers and reported peak CPU utilization on five- of a data-driven approach in some ways?
minute intervals across our server farm. We found that
on roughly 80 percent of our servers, we never got ME: On one side, you have a very transactional OLTP
above 10 percent utilization in a monthly period. [online transactional processing] kind of world, RDBMSs
[relational database management systems], and major
Our first step to address that problem was virtualization. vendors that we’re using there. On the other side of it,
At this point, about 49 percent of our data center is you have traditional analytical warehousing. And where
virtual. Our virtualization effort had a sizable impact on we’ve slotted this [Hadoop-style data] is in the middle
cost. Dollars fell out because we quit building data with the other operational data. Some of it is derived
centers and doing all kinds of redundant shuffling. We from transactional data, and some has been crafted
didn’t have to lay off people. We changed some of our out of analytical data. There’s a freedom that’s derived
processes, and we were able to shift our growth curve from blending these two kinds of data.
from plus 27 to minus 3 on the shared service.
Our centralized logging service is an example.
We call this our D-Cloud effort. Another step in this As we look at continuing to drive down costs to
effort was moving to a REST [REpresentational State drive up efficiency, we can begin to log a large
Transfer] and JSON [JavaScript Object Notation] data amount of this data at a price point that we have
exchange standard, because we knew we had to hit all not been able to achieve by scaling up RDBMSs
these different devices and create some common APIs or using warehousing appliances.
[application programming interfaces] in the framework.
One of the very first things we put in place was a central Then the key will be putting an expert system in place.
logging service for all the events. These event logs can That will give us the ability to really understand what’s
be streamed into one very large data set. We can then going on in the actual operational environment.
use the Hadoop and MapReduce paradigm to go after We’re starting to move again toward lower utilization
that data. trajectories. We need to scale the infrastructure back
and get that utilization level up to the threshold.

PwC: This kind of information doesn’t go in a We’re using the basic premise of the cloud, and we’re
cube. Not that data cubes are going away, using those techniques of standardizing the interface
but cubes are fairly well known now. The value to virtualize and drive cost out. I’m taking that cost
savings and returning some of it to the business, but
you can create is exactly what you said, then reinvesting some in new capabilities while the
understanding the thinking behind it and the cost curve is stabilizing.
exploratory steps.
ME: Refining some of this reinvestment in new
ST: We think storing the unstructured data in its raw capabilities doesn’t have to be put in the category of
format is what’s coming. In a Hadoop environment, traditional “$5 million projects” companies used to
instead of bringing the data back to your warehouse, think about. You can make significant improvements
you figure out what question you want to answer. Then with reinvestments of $200,000 or even $50,000.
you MapReduce the input, and you may send that off
to a data cube and a place that someone can dig BA: It’s then a matter of how you’re redeploying an
around in, but you keep the data in its raw format investment in resources that you’ve already made as
and pull out only what you need. a corporation. It’s a matter of now prioritizing your
work and not changing the bottom-line trajectory in a
BA: The wonderful thing about where we’re headed negative fashion with a bet that may not pay off. I can
right now is that data analysis used to be this giant, try it, and I don’t have to get great big governance-
massive bet that you had to place up front, right? based permission to do it, because it’s not a bet of half
No longer. Now, I pull Hadoop off of the Internet, the staff and all of this stuff. It’s, OK, let’s get something
first making sure that we’re compliant from a legal on the ground, let’s work with the business unit, let’s
perspective with licensing and so forth. After that’s pilot it, let’s go somewhere where we know we have a
taken care of, you begin to prototype. You begin to need, let’s validate it against this need, and let’s make
work with it against common hardware. You begin sure that it’s working. It’s not something that must go
to work with it against stuff you otherwise might through an RFP [request for proposal] and standard
throw out. Rather than, I’m going to go spend how procurement. I can move very fast. n
much for Teradata?
“We think storing the unstructured data in its raw format is what’s
coming. In a Hadoop environment, instead of bringing the data back to
your warehouse, you figure out what question you want to answer.”
—Scott Thompson
Creating a cost-effective Big Data strategy 21

Building a bridge to
the rest of your data
How companies are using open-source cluster-computing techniques

to analyze their data.
By Alan Morrison

As recently as two years ago, the International Yahoo, for example, abandoned its own data
Supercomputing Conference (ISC) agenda included architecture and began to adopt one along the lines
nothing about distributed computing for Big Data— pioneered by Google. It moved to Apache Hadoop,
as if projects such as Google Cluster Architecture, a an open-source, Java-based distributed file system
low-cost, distributed computing design that enables based on Google File System and developed by
efficient processing of large volumes of less-structured the Apache Software Foundation; it also adopted
data, didn’t exist. In a May 2008 blog, Brough Turner MapReduce, Google’s parallel programming framework.
noted the omission, pointing out that Google had Yahoo used these and other open-source tools it helped
harnessed as much as 100 petaflops1 of computing develop to crawl and index the Web. After implementing
power, compared to a mere 1 petaflop in the new IBM the architecture, it found other uses for the technology
Roadrunner, a supercomputer profiled in EE Times and has now scaled its Hadoop cluster to 4,000 nodes.
that month. “Have the supercomputer folks been
By early 2010, Hadoop, MapReduce, and related
bypassed and don’t even know it?” Turner wondered.2
open-source techniques had become the driving forces
Turner, co-founder and CTO of Ashtonbrooke.com, a behind what O’Reilly Media, The Economist, and others
startup in stealth mode, had been reading Google’s in the press call Big Data and what vendors call cloud
research papers and remarking on them in his blog for storage. Big Data refers to data sets that are growing
years. Although the broader business community had exponentially and that are too large, too raw, or too
taken little notice, some companies were following in unstructured for analysis by traditional means. Many
Google’s wake. Many of them were Web companies who are familiar with these new methods are convinced
that had data processing scalability challenges similar that Hadoop clusters will enable cost-effective analysis
to Google’s. of Big Data, and these methods are now spreading
beyond companies that mine the public Web as part
of their business.
By early 2010, Hadoop, MapReduce, and related open-source

techniques had become the driving forces behind what O’Reilly Media,
The Economist, and others in the press call Big Data and what
vendors call cloud storage.
Building a bridge to the rest of your data 23

“Hadoop will process the data set and output a new data set,
as opposed to changing the data set in place.” —Amr Awadallah
of Cloudera
What are these methods and how do they work? This • Software tolerance for hardware failures—When a
article looks at the architecture and tools surrounding failure occurs, the system responds by transferring
Hadoop clusters with an eye toward what about them the processing to another node, a critical capability
will be useful to mainstream enterprises during the for large distributed systems. As Roger Magoulas,
next three to five years. We focus on their utility for research director for O’Reilly Media, says, “If you are
less-structured data. going to have 40 or 100 machines, you don’t expect
your machines to break. If you are running something
with 1,000 nodes, stuff is going to break all the time.”
Hadoop clusters
• High compute power per query—The ability to
Although cluster computing has been around for scale up to thousands of nodes implies the ability
decades, commodity clusters are more recent, starting to throw more compute power at each query. That
with UNIX- and Linux-based Beowulf clusters in the ability, in turn, makes it possible to bring more data
mid-1990s. These banks of inexpensive servers to bear on each problem.
networked together were pitted against expensive
supercomputers from companies such as Cray and • Modularity and extensibility—Hadoop clusters
others—the kind of computers that government scale horizontally with the help of a uniform, highly
agencies, such as the National Aeronautics and Space modular architecture.
Administration (NASA), bought. It was no accident Hadoop isn’t intended for all kinds of workloads,
that NASA pioneered the development of Beowulf.3 especially not those with many writes. It works best for
Hadoop extends the value of commodity clusters, read-intensive workloads. These clusters complement,
making it possible to assemble a high-end computing rather than replace, high-performance computing (HPC)
cluster at a low-end price. A central assumption and other relational data systems. They don’t work well
underlying this architecture is that some nodes are with transactional data or records that require frequent
bound to fail when computing jobs are distributed updating. “Hadoop will process the data set and output
across hundreds or thousands of nodes. Therefore, a new data set, as opposed to changing the data set in
one key to success is to design the architecture to place,” says Amr Awadallah, vice president of
anticipate and recover from individual node failures.4 engineering and CTO of Cloudera, which develops a
Other goals of the Google Cluster Architecture and version of Hadoop.
its expression in open-source Hadoop include: A data architecture and a software design that are
• Price/performance over peak performance—The frugal with network and disk resources are responsible
emphasis is on optimizing aggregate throughput; for for the price/performance ratio of Hadoop clusters.
example, sorting functions to rank the occurrence of In Awadallah’s words, “You move your processing to
keywords in Web pages. Overall sorting throughput where your data lives.” Each node has its own
is high. In each of the past three years, Yahoo’s processing and storage, and the data is divided and
Hadoop clusters have won Gray’s terabyte sort processed locally in blocks sized for the purpose.
benchmarking test.5 This concept of localization makes it possible to use
inexpensive serial advanced technology attachment
(SATA) hard disks—the kind used in most PCs and
servers—and Gigabit Ethernet for most network
interconnections. (See Figure 1.)

Client
Switch
1000Mbps
Switch Switch
100Mbps 100Mbps
Typical node setup
2 quad-core Intel Nehalem
Task tracker/ JobTracker 24GB of RAM
DataNode
12 1TB SATA disks (non-RAID)
Task tracker/ NameNode 1 Gigabit Ethernet card
DataNode
Cost per node: $5,000
Task tracker/ Task tracker/
Effective file space per node: 20TB
DataNode DataNode
Task tracker/ Task tracker/

DataNode DataNode Claimed benefits
Linear scaling at $250 per user TB
Task tracker/ Task tracker/ (versus $5,000–$100,000 for alternatives)
DataNode DataNode
Compute placed near the data and
Task tracker/ Task tracker/ fewer writes limit networking
DataNode DataNode and storage costs
Rack Rack Modularity and extensibility
Figure 1: Hadoop cluster layout and characteristics

Source: IBM, 2008, and Cloudera, 2010

“Amazon supports Hadoop directly through its Elastic MapReduce
application programming interfaces.” —Chris Wensel of Concurrent
The result is less-expensive large-scale distributed Mostly overlooked in all that attention was the use of
computing and parallel processing, which make the Hadoop Distributed File System (HDFS) and the
possible an analysis that is different from what most MapReduce framework. Using these open-source tools,
enterprises have previously attempted. As author Tom after studying how-to blog posts from others, Times
White points out, “The ability to run an ad hoc query senior software architect Derek Gottfrid developed and
against your whole data set and get the results in a ran code in parallel across multiple Amazon machines.7
reasonable time is transformative.”6
“Amazon supports Hadoop directly through its Elastic
The cost of this capability is low enough that companies MapReduce application programming interfaces [APIs],”
can fund a Hadoop cluster from existing IT budgets. says Chris Wensel, founder of Concurrent, which
When it decided to try Hadoop, Disney’s Technology developed Cascading. (See the discussion of
Shared Services Group took advantage of the increased Cascading later in this article.) “I regularly work with
server utilization it had already achieved from customers to boot up 200-node clusters and process
virtualization. As of March 2010, with nearly 50 percent 3 terabytes of data in five or six hours, and then shut
of its servers virtualized, Disney had 30 percent server the whole thing down. That’s extraordinarily powerful.”
image growth annually but 30 percent less growth in
physical servers. It was then able to set up a multi-
terabyte cluster with Hadoop and other free open- The Hadoop Distributed File System
source tools, using servers it had planned to retire. The The Hadoop Distributed File System (HDFS) and the
group estimates it spent less than $500,000 on the MapReduce parallel programming framework are at
entire project. (See the article, “Tapping into the power the core of Apache Hadoop. Comparing HDFS and
of Big Data,” on page 04.) MapReduce to Linux, Awadallah says that together
These clusters are also transformative because cloud they’re a “data operating system.” This description may
providers can offer them on demand. Instead of using be overstated, but there are similarities to any operating
their own infrastructures, companies can subscribe to system. Operating systems schedule tasks, allocate
a service such as Amazon’s or Cloudera’s distribution resources, and manage files and data flows to fulfill the
on the Amazon Elastic Compute Cloud (EC2) platform. tasks. HDFS does a distributed computing version of
this. “It takes care of linking all the nodes together to
The EC2 platform was crucial in a well-known use of look like one big file and job scheduling system for the
cloud computing on a Big Data project that also applications running on top of it,” Awadallah says.
depended on Hadoop and other open-source tools.
In 2007, The New York Times needed to quickly HDFS, like all Hadoop tools, is Java based. An HDFS
assemble the PDFs of 11 million articles from contains two kinds of nodes:
4 terabytes of scanned images. Amazon’s EC2 service • A single NameNode that logs and maintains the
completed the job in 24 hours after setup, a feat necessary metadata in memory for distributed jobs
that received widespread attention in blogs and the
trade press. • Multiple DataNodes that create, manage, and
process the 64MB blocks that contain pieces of
Hadoop jobs, according to the instructions from
the NameNode

HDFS uses multi-gigabyte file sizes to reduce the HDFS does not perform tasks such as changing
management complexity of lots of files in large data specific numbers in a list or other changes on parts
volumes. It typically writes each copy of the data once, of a database. This limitation leads some to assume
adding to files sequentially. This approach simplifies that HDFS is not suitable for structured data. “HDFS
the task of synchronizing data and reduces disk and was never designed for structured data and therefore
bandwidth usage. it’s not optimal to perform queries on structured data,”
says Daniel Abadi, assistant professor of computer
Equally important are fault tolerance within the same
science at Yale University. Abadi and others at Yale
disk and bandwidth usage limits. To accomplish fault
have done performance testing on the subject, and they
tolerance, HDFS creates three copies of each data
have created a relational database alternative to HDFS
block, typically storing two copies in the same rack.
called HadoopDB to address the performance issues
The system goes to another rack only if it needs the
they identified.8
third copy. Figure 2 shows a simplified depiction of
HDFS and its data block copying method. Some developers are structuring data in ways that are
suitable for HDFS; they’re just doing it differently from
the way relational data would be structured. Nathan
Client Marz, a lead engineer at BackType, a company that
offers a search engine for social media buzz, uses
schemas to ensure consistency and avoid data
NameNode
corruption. “A lot of people think that Hadoop is
(metadata) meant for unstructured data, like log files,” Marz says.
Files Blocks “While Hadoop is great for log files, it’s also fantastic
File A 1, 2, 4 for strongly typed, structured data.” For this purpose,
File A 3, 5
Marz uses Thrift, which was developed by Facebook
for data translation and serialization purposes.9 (See
DataNode DataNode DataNode DataNode the discussion of Thrift later in this article.) Figure 3
illustrates a typical Hadoop data processing flow that
includes Thrift and MapReduce.
1 2 4 5 2 3 4 3 1 5 2 5
Figure 2: The Hadoop Distributed File System, or HDFS

Source: Apache Software Foundation, IBM, and PricewaterhouseCoopers, 2008
Input Input Output

data applications Core Hadoop data processing applications
Less-structured Cascading 1 Mashups

information Thrift RDBMS apps
such as: Zookeeper 1 M 1 BI systems
log files Pig
messages
images 2 M 2
Jobs R Results
2
M Map
3 M 3 R Reduce
3 64MB blocks
Figure 3: Hadoop ecosystem overview
Source: PricewaterhouseCoopers, derived from Apache Software Foundation and Dion Hinchcliffe, 2010

MapReduce The terms “map” and “reduce” refer to steps the
tool takes to distribute, or map, the input for parallel
MapReduce is the base programming framework for processing, and then reduce, or aggregate, the
Hadoop. It often acts as a bridge between HDFS and processed data into output files. (See Figure 4.)
tools that are more accessible to most programmers. MapReduce works with key-value pairs. Frequently
According to those at Google who developed the tool, with Web data, the keys consist of URLs and the
“it hides the details of parallelization” and the other values consist of Web page content, such as
nuts and bolts of HDFS.10 Hypertext Markup Language (HTML).
MapReduce is a layer of abstraction, a way of managing MapReduce’s main value is as a platform with a set
a sea of details by creating a layer that captures and of APIs. Before MapReduce, fewer programmers could
summarizes their essence. That doesn’t mean it is easy take advantage of distributed computing. Now that
to use. Many developers choose to work with another user-accessible tools have been designed, simpler
tool, yet another layer of abstraction on top of it. “I programming is possible on massively parallel systems
avoid using MapReduce directly at all cost,” Marz says. and less adaptation of the programs is required.
“I actually do almost all my MapReduce work with a The following sections examine some of these tools.
library called Cascading.”
Data Data
store 1 store n
Input key-value pairs Input key-value pairs
Map Map
key 1 values key 2 values key 3 values key 1 values key 2 values key 3 values
Barrier ... Aggregates intermediate values by output key ... Barrier
key 1 intermediate values key 2 intermediate values key 3 intermediate values
Reduce Reduce Reduce
final key 1 values final key 2 values final key 3 values
Figure 4: MapReduce phases

Source: Google, 2004, and Cloudera, 2009

“You can code in whatever JVM-based language you want, and then
shove that into the cluster.” —Chris Wensel of Concurrent
Cascading Rather than approach map and reduce phases large-file

by large-file, developers assemble flows of operations
Wensel, who created Cascading, calls it an alternative using functions, filters, aggregators, and buffers.
API to MapReduce, a single library of operations that Those flows make up the pipe assemblies, which, in
developers can tap. It’s another layer of abstraction Marz’s terms, “compile to MapReduce.” In this way,
that helps bring what programmers ordinarily do in Cascading smoothes the bumpy MapReduce terrain so
non-distributed environments to distributed computing. more developers—including those who work mainly in
With it, he says, “you can code in whatever JVM-based Client
scripting languages—can build flows. (See Figure 6.)
[Java Virtual Machine] language you want, and then Assembly Flow
shove that into the cluster.”
A A A A
Wensel wanted to obviate the need for “thinking in MR MR
MapReduce.” When using Cascading, developers don’t A A A A
think in key-value pair terms—they think in terms of MR MR

fields and lists of values called “tuples.” A Cascading
tuple is simpler than a database record but acts like Cluster
one. Each tuple flows through “pipe” assemblies, which Job Job
are comparable to Java classes. The data flow begins
at the source, an input file, and ends with a sink, an
output directory. (See Figure 5.)
Map Reduce Map Reduce
[f1, f2, ...] [f1, f2, ...] [f1, f2, ...] [f1, f2, ...] [f1, f2, ...] A Pipe assembly
P P P P P P
Hadoop MR (translation to MapReduce)
MR
[f1, f2, ...] [f1, f2, ...]
MapReduce jobs
So Si
Figure 6: Cascading assembly and flow

[f1, f2, ...] Tuples with field names
So Source Source: Concurrent, 2010
Si Sink
P Pipe
Figure 5: A Cascading assembly

Source: Concurrent, 2010

Some useful tools for MapReduce-style With LISP, Watson says, he can load the data once and
analytics programming test multiple times. In C++, he would need to use a
relational database and reload each time for a program
Open-source tools that work via MapReduce on test. Using LISP makes it possible to create and test
Hadoop clusters are proliferating. Users and developers small bits of code in an iterative fashion, a major reason
don’t seem concerned that Google received a patent for for the productivity gains.�
MapReduce in January 2010. In fact, Google, IBM, and
others have encouraged the development and use of This iterative, LISP-like program-programmer
open-source versions of these tools at various research interaction with Clojure leads to what Hickey calls
universities.11 A few of the more prominent tools “dynamic development.” Any code entered in the
relevant to analytics, and used by developers we’ve console interface, he points out, is automatically
interviewed, are listed in the sections that follow. compiled on the fly.�
Clojure Thrift
Clojure creator Rich Hickey wanted to combine aspects Thrift, initially created at Facebook in 2007 and then
of C or C#, LISP (for list processing, a language released to open source, helps developers create
associated with artificial intelligence that’s rich in services that communicate across languages, including
mathematical functions), and Java. The letters C, L, and C++, C#, Java, Perl, Python, PHP, Erlang, and Ruby.
J led him to name the language, which is pronounced With Thrift, according to Facebook, users can “define
“closure.” Clojure combines a LISP library with Java all the necessary data structures and interfaces for a
libraries. Clojure’s mathematical and natural language complex service in a single short file.”�
processing (NLP) capabilities and the fact that it is JVM A more important aspect of Thrift, according to
based make it useful for statistical analysis on Hadoop BackType’s Marz, is its ability to create strongly typed
clusters. FlightCaster, a commercial-airline-delay- data and flexible schemas. Countering the emphasis of
prediction service, uses Clojure on top of Cascading, the so-called NoSQL community on schema-less data,
on top of MapReduce and Hadoop, for “getting the Marz asserts there are effective ways to lightly structure
right view into unstructured data from heterogeneous the data in Hadoop-style analysis.
sources,” says Bradford Cross, FlightCaster co-founder.�
Marz uses Thrift’s serialization features, which turn
LISP has attributes that lend themselves to NLP, making objects�into a sequence of bits that can be stored as
Clojure especially useful in NLP applications. Mark files, to create schemas between types (for instance,
Watson, an artificial intelligence consultant and author, differentiating between text strings and long, 64-bit
says most LISP programming he’s done is for NLP. He integers) and schemas between relationships (for
considers LISP to be four times as productive for instance, linking Twitter accounts that share a
programming as C++ and twice as productive as Java. common interest). Structuring the data in this way
His NLP code “uses a huge amount of memory-resident helps BackType avoid inconsistencies in the data
data,” such as lists of proper nouns, text categories, or the need to manually filter for some attributes.
common last names, and nationalities.
BackType can use required and optional fields to
structure the Twitter messages it crawls and analyzes.
The required fields can help enforce data type. The
“Getting the right view into optional fields, meanwhile, allow changes to the
unstructured data from schema as well as the use of old objects that were
created using the old schema.
heterogeneous sources can
be quite tricky.” —Bradford
Cross of FlightCaster

Marz’s use of Thrift to model social graphs like the one Open-source, non-relational data stores
in Figure 7 demonstrates the flexibility of the schema
Non-relational data stores have become much more
for Hadoop-style computing. Thrift essentially enables
numerous since the Apache Hadoop project began in
modularity in the social graph described in the schema.
2007. Many are open source. Developers of these data
For example, to select a single age for each person,
stores have optimized each for a different kind of data.
BackType can take into account all the raw age data.
When contrasted with relational databases, these data
It can do this by a computation on the entire data set
stores lack many design features that can be essential
or a selective computation on only the people in the
for enterprise transactional data. However, they are
data set who have new data.
often well tailored to specific, intended purposes,
and they offer the added benefit of simplicity. Primary
Gender male
non-relational data store types include the following:
Bob • Multidimensional map store—Each record
Age 39
maps a row name, a column name, and a time
Alice stamp to a value. Map stores have their heritage
in Google’s Bigtable.
Charlie
Gender female • Key-value store—Each record consists of a key,
Gender male or unique identifier, mapped to one or more values.
Age 25
Age 22 • Graph store—Each record consists of elements that
together form a graph. Graphs depict relationships.
Apache
For example, social graphs describe relationships
Thrift
Language: C++ between people. Other graphs describe relationships
between objects, between links, or both.
• Document store—Each record consists of a
Figure 7: An example of a social graph modeled using
Thrift schema document. Extensible Markup Language (XML)
databases, for example, store XML documents.
Source: Nathan Marz, 2010
Because of their simplicity, map and key-value stores
can have scalability advantages over most types of
BackType doesn’t just work with raw data. It runs a relational databases. (HadoopDB, a hybrid approach
series of jobs that constantly normalize and analyze developed at Yale University, is designed to overcome
new data coming in, and then other jobs that write the the scalability problems associated with relational
analyzed data to a scalable random-access database databases.) Table 1 provides a few examples of the
such as HBase or Cassandra.12 open-source, non-relational data stores that
are available.
Map Key-value Document Graph
HBase Tokyo Cabinet/Tyrant MongoDB Resource Description Framework (RDF)
Hypertable Project Voldemort CouchDB Neo4j
Cassandra Redis Xindice InfoGrid
Table 1: Example open-source, non-relational data stores

Source: PricewaterhouseCoopers, Daniel Abadi of Yale University, and organization Web sites, 2010

“We established that Hadoop does horizontally scale. This is what’s really
exciting, because I’m an RDBMS guy, right? I’ve done that for years, and
you don’t get that kind of constant scalability no matter what you do.”
—Scott Thompson of Disney
Other related technologies and vendors • Cost-effective scalability—Horizontal scaling from

a low-cost base implies a feasible long-term cost
A comprehsensive review of the various tools created
structure for more kinds of data. Scott Thompson,
for the Hadoop ecosystem is beyond the scope of this
vice president of infrastructure at the Disney
article, but a few of the tools merit brief description here
Technology Shared Services Group, says, “We
because they’ve been mentioned elsewhere in this issue:
established that Hadoop does horizontally scale.
• Pig—A scripting language called Pig Latin, which is This is what’s really exciting, because I’m an
a primary feature of Apache Pig, allows more concise RDBMS guy, right? I’ve done that for years, and
querying of data sets “directly from the console” than you don’t get that kind of constant scalability no
is possible using MapReduce, according to author matter what you do.”
Tom White.
• Fault tolerance—Associated with scalability is
• Hive—Hive is designed as “mainly an ETL [extract, the assumption that some nodes will fail. Hadoop
transform, and load] system” for use at Facebook, and MapReduce are fault tolerant, another reason
according to Chris Wensel. commodity hardware can be used.
• Zookeeper—Zookeeper provides an interface • Suitability for less-structured data—Perhaps
for creating distributed applications, according most importantly, the methods that Google
to Apache. pioneered, and that Yahoo and others expanded,
focus on what Cloudera’s Awadallah calls
Big Data covers many vendor niches, and some
“complex” data. Although developers such as Marz
vendors’ products take advantage of the Hadoop stack
understand the value of structuring data, most
or add to its capabilities. (See the sidebar “Selected Big
Hadoop/MapReduce developers don’t have an
Data tool vendors.”)
RDBMS mentality. They have an NLP mentality, and
they’re focused on techniques optimized for large
Conclusion amounts of less-structured information, such as the
vast amount of information on the Web.
Interest in and adoption of Hadoop clusters are
growing rapidly. Reasons for Hadoop’s The methods, cost advantages, and scalability of
popularity include: Hadoop-style cluster computing clear a path for
enterprises to analyze the Big Data they didn’t have
• Open, dynamic development—The Hadoop/ the means to analyze before. This set of methods is
MapReduce environment offers cost-effective separate from, yet complements, data warehousing.
distributed computing to a community of open- Understanding what Hadoop clusters do and how
source programmers who’ve grown up on Linux they do it is fundamental to deciding when and where
and Java, and scripting languages such as enterprises should consider making use of them.
Perl and Python. Some are taking advantage of
functional programming language dialects such
as Clojure. The openness and interaction can
lead to faster development cycles.

1 FLOPS stands for “floating point operations per second.” Floating point
Selected Big Data tool vendors processors use more bits to store each value, allowing more precision
and ease of programming than fixed point processors. One petaflop is
Amazon upwards of one quadrillion floating point operations per second.
Amazon provides a Hadoop framework on its 2 Brough Turner, “Google Surpasses Supercomputer Community,
Unnoticed?” May 20, 2008, http://blogs.broughturner.com/
Elastic Compute Cloud (EC2) and S3 storage communications/2008/05/google-surpasses-supercomputer-
service it calls Elastic MapReduce. community-unnoticed.html (accessed April 8, 2010).
3 See, for example, Tim Kientzle, “Beowulf: Linux clustering,”
Appistry Dr. Dobb’s Journal, November 1, 1998, Factiva Document
Appistry’s CloudIQ Storage platform offers a dobb000020010916dub100045 (accessed April 9, 2010).
substitute for HDFS, one designed to eliminate 4 Luis Barroso, Jeffrey Dean, and Urs Hoelzle, “Web Search for a
the single point of failure of the NameNode. Planet: The Google Cluster Architecture,” Google Research
Publications, http://research.google.com/archive/googlecluster.html
Cloudera (accessed April 10, 2010).
Cloudera takes a Red Hat approach to Hadoop, 5 See http://sortbenchmark.org/ and http://developer.yahoo.net/blog/
(accessed April 9, 2010).
offering its own distribution on EC2/S3 with
6 Tom White, Hadoop: The Definitive Guide (Sebastopol, CA: O’Reilly
management tools, training, support, and Media, 2009), 4.
professional services. 7 See Derek Gottfrid, “Self-service, Prorated Super Computing Fun!”
The New York Times Open Blog, November 1, 2007, http://open.
Cloudscale blogs.nytimes.com/2007/11/01/self-service-prorated-super-
Cloudscale’s first product, Cloudcel, marries an computing-fun/(accessed June 4, 2010) and Bill Snyder, “Cloud
Excel-based interface to a back end that’s a Computing: Not Just Pie in the Sky,” CIO, March 5, 2008, Factiva
Document CIO0000020080402e4350000 (accessed March 28, 2010).
massively parallel stream processing engine. The
8 See “HadoopDB” at http://db.cs.yale.edu/hadoopdb/hadoopdb.html
product is designed to process stored, historical, (accessed April 11, 2010).
or real-time data. 9 Nathan Marz, “Thrift + Graphs = Strong, flexible schemas on
Hadoop,” http://nathanmarz.com/blog/schemas-on-hadoop/
Concurrent (accessed April 11, 2010).
Concurrent developed Cascading, for which 10 Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data
it offers licensing, training, and support. Processing on Large Clusters,” Google Research Publications,
December 2004, http://labs.google.com/papers/mapreduce.html
Drawn to Scale (accessed April 22, 2010).
Drawn to Scale offers an analytical and 11 See Dean, et al., US Patent No. 7,650,331, January 19, 2010, at http://
transactional database product on Hadoop www.uspto.gov. For an example of the participation by Google and
IBM in Hadoop’s development, see “Google and IBM Announce
and HBase, with occasional consulting. University Initiative to Address Internet-Scale Computing Challenges,”
Google press release, October 8, 2007, http://www.google.com/intl/en/
IBM press/pressrel/20071008_ibm_univ.html (accessed March 28, 2010).
IBM introduced a distribution of Hadoop called 12 See the Apache site at http://apache.org/ for descriptions of many
BigInsights in May 2010. The company’s jStart tools that take advantage of MapReduce and/or HDFS that are not
team offers briefings and workshops on Hadoop profiled in this article.
pilots. IBM BigSheets acts as an aggregation,
analysis, and visualization point for large amounts
of Web data.
Microsoft
Microsoft Pivot uses the company’s Deep Zoom
technology to provide visual data browsing
capabilities for XML files. Azure Table services is
in some ways comparable to Bigtable or HBase.
(See the interview with Mark Taylor and Ray Velez
of Razorfish on page 46.)
ParaScale
ParaScale offers software for enterprises to
set up their own public or private cloud storage
environments with parallel processing and
large-scale data handling capability.

Hadoop’s foray
into the enterprise
Cloudera’s Amr Awadallah discusses how and why

diverse companies are trying this novel approach.
Interview conducted by Alan Morrison, Bo Parker, and Vinod Baya
Amr Awadallah is vice president of engineering and CTO at Cloudera, a company that
offers products and services around Hadoop, an open-source technology that allows
efficient mining of large, complex data sets. In this interview, Awadallah provides an
overview of Hadoop’s capabilities and how Cloudera customers are using them.
PwC: Were you at Yahoo before coming and worked on it in conjunction with the open-source
to Cloudera? Apache Hadoop community. Yahoo played a very big
role in the evolution of Hadoop to where it is today.
AA: Yes. I was with Yahoo from mid-2000 until mid- Soon after the Yahoo Search team started using
2008, starting with the Yahoo Shopping team after Hadoop, other parts of the company began to see
selling my company VivaSmart to Yahoo. Beginning in the power and flexibility that this system offers.
2003, my career shifted toward business intelligence Today, Yahoo uses Hadoop for data warehousing,
and analytics at consumer-facing properties such as mail spam detection, news feed processing, and
Yahoo News, Mail, Finance, Messenger, and Search. content/ad targeting.
I had the daunting task of building a very large data
warehouse infrastructure that covered all these diverse PwC: What are some of the advantages of
products and figuring out how to bring them together.
Hadoop when you compare it with RDBMSs
That is when I first experienced Hadoop. Its model of [relational database management systems]?
“mine first, govern later” fits in with the well-governed
infrastructure of a data mart, so it complements these AA: With Oracle, Teradata, and other RDBMSs, you
systems very well. Governance standards are important must create the table and schema first. You say, this is
for maintaining a common language across the what I’m going to be loading in, these are the types of
organization. However, they do inhibit agility, so it’s best columns I’m going to load in, and then you load your
to complement a well-governed data mart with a more data. That process can inhibit how fast you can evolve
agile complex data processing system like Hadoop. your data model and schemas, and it can limit what you
log and track.
PwC: How did Yahoo start using Hadoop? With Hadoop, it’s the other way around. You load all of
your data, such as XML [Extensible Markup Language],
AA: In 2005, Yahoo was faced with a business tab delimited flat files, Apache log files, JSON
challenge. The cost of creating the Web search index [JavaScript Object Notation], etc. Then in Hive or Pig
was approaching the revenues being made from the [both of which are Hadoop data query tools], you point
keyword advertising on the search pages. Yahoo Search your metadata toward the file and parse the data on
adopted Hadoop as an economically scalable solution,

“We are not talking about a replacement technology for data warehouses—
let’s be clear on this. No customers are using Hadoop in that fashion.”
the fly when reading it out. This approach lets you interesting place to ask questions of complex plus
extract the columns that map to the data structure relational data? Probably not, although organizations
you’re interested in. still need to use, collect, and present relational data
for questions that are routine and require, in some
Creating the structure on the read path like this can
cases, a real-time response.
have its disadvantages; however, it gives you the agility
and the flexibility to evolve your schema much quicker
without normalizing your data first. In general, relational PwC: How have companies benefited
systems are not well suited for quickly evolving complex from querying across both structured and
data types.
complex data?
Another benefit is retroactive schemas. For example,
AA: When you query against complex data types, such
an engineer launching a new product feature can add
as Web log files and customer support forums, as well
the logging for it, and that new data will start flowing
as against the structured data you have already been
directly into Hadoop. Weeks or months later, a data
collecting, such as customer records, sales history, and
analyst can update their read schema on how to parse
transactions, you get a much more accurate answer to
this new data. Then they will immediately be able to
the question you’re asking. For example, a large credit
query the history of this metric since it started flowing
card company we’ve worked with can identify which
in [as opposed to waiting for the RDBMS schema to
transactions are most likely fraudulent and can prioritize
be updated and the ETL (extract, transform, and load)
which accounts need to be addressed.
processes to reload the full history of that metric].
PwC: Are the companies you work with aware

PwC: What about the cost advantages?
that this is a totally different paradigm?
AA: The cost basis is 10 to 100 times cheaper than
other solutions. But it’s not just about cost. Relational AA: Yes and no. The main use case we see is in
databases are really good at what they were designed companies that have a mix of complex data and
for, which is running interactive SQL queries against structured data that they want to query across. Some
well-structured data. We are not talking about a large financial institutions that we talk to have 10, 20,
replacement technology for data warehouses—let’s or even hundreds of Oracle systems—it’s amazing.
be clear on this. They have all of these file servers storing XML files or
log files, and they want to consolidate all these tables
No customers are using Hadoop in that fashion. They and files onto one platform that can handle both data
recognize that the nature of data is changing. Where is types so they can run comprehensive queries. This is
the data growing? It’s growing around complex data where Hadoop really shines; it allows companies to
types. Is a relational container the best and most run jobs across both data types. n
Hadoop’s foray into the enterprise 35

Revising the CIO’s
data playbook
Start by adopting a fresh mind-set, grooming the right talent,

and piloting new tools to ride the next wave of innovation.
By Jimmy Guterman

Like pioneers exploring a new territory, a few This will be good for their units and their careers,
enterprises are making discoveries by exploring Big but it would be better for the organization as a whole
Data. The terrain is complex and far less structured if someone—the CIO is the natural person—drove a
than the data CIOs are accustomed to. And it is single, central, cross-enterprise Big Data initiative.
growing by exabytes each year. But it is also getting
With this in mind, PricewaterhouseCoopers
easier and less expensive to explore and analyze, in
encourages CIOs to take these steps:
part because software tools built to take advantage
of cloud computing infrastructures are now available. • Start to add the discipline and skill set for Big Data
Our advice to CIOs: You don’t need to rush, but do to your organizations; the people for this may or
begin to acquire the necessary mind-set, skill set, may not come from existing staff.
and tool kit.
• Set up sandboxes (which you can rent or buy) to
These are still the early days. The prime directive experiment with Big Data technologies.
for any CIO is to deliver value to the business through
• Understand the open-source nature of the tools
technology. One way to do that is to integrate new
and how to manage risk.
technologies in moderation, with a focus on the
long-term opportunities they may yield. Leading Enterprises have the opportunity to analyze more
CIOs pride themselves on waiting until a technology kinds of data more cheaply than ever before. It is
has proven value before they adopt it. Fair enough. also important to remember that Big Data tools did
not originate with vendors that were simply trying to
However, CIOs who ignore the Big Data trends
create new markets. The tools sprung from a real
described in the first two articles risk being
need among the enterprises that first confronted
marginalized in the C-suite. As they did with earlier
the scalability and cost challenges of Big Data—
technologies, including traditional business
challenges that are now felt more broadly. These
intelligence, business unit executives are ready to
pioneers also discovered the need for a wider variety
seize the Big Data opportunity and make it their own.
of talent than IT has typically recruited.
Enterprises have the opportunity to analyze more kinds of data

more cheaply than ever before. It is also important to remember
that Big Data tools did not originate with vendors that were simply
trying to create new markets.
Revising the CIO’s data playbook 37

Big Data lessons from Web companies For example, the 1-800-GOOG-411 service, which
individuals can call to get telephone numbers and
Today’s CIO literature is full of lessons you can learn addresses of local businesses, does not merely take
from companies such as Google. Some of the an ax to the high-margin directory assistance services
comparisons are superficial because most companies run by incumbent carriers (although it does that).
do not have a Web company’s data complexities and That is just a by-product. More important, the
will never attain the original singleness of purpose 800-number service has let Google compile what has
that drove Google, for example, to develop Big been described as the world’s largest database of
Data innovations. But there is no niche where the spoken language. Google is using that database to
development of Big Data tools, techniques, mind-set, improve the quality of voice recognition in Google
and usage is greater than in companies such as Voice, in its sundry mobile-phone applications, and in
Google, Yahoo, Facebook, Twitter, and LinkedIn. other services under development. Some of the ways
And there is plenty that CIOs can learn from these companies such as Google capture data and convert
companies. Every major service these companies it into services are listed in Table 1.
create is built on the idea of extracting more and
more value from more and more data.
Service Data that Web companies capture
Self-serve advertising Ad-clicking and -picking behavior
Analytics Aggregated Web site usage tracking
Social networking Sundry online
Browser Limited browser behaviors
E-mail Words used in e-mails
Search engine Searches and clicking information
RSS feeds Detailed reading habits
Extra browser functionality All browser behavior
View videos All site behavior
Free directory assistance Database of spoken words
Table 1: Web portal Big Data strategy


“I see inspiration from the Google model ... just having lots of
cheap stuff that you can use to crunch vast quantities of data.”
—Phil Buckle of the UK National Policing Improvement Agency
Many Web companies are finding opportunity in “gray The business case
data.” Gray data is the raw and unvalidated data that
arrives from various sources, in huge quantities, and not Besides Google, Yahoo, and other Web-based
in the most usable form. Yet gray data can deliver value enterprises that have complex data sets, there are
to the business even if the generators of that content stories of brick and mortar organizations that will be
(for example, people calling directory assistance) are making more use of Big Data. For example, Rollin Ford,
contributing that data for a reason far different from Wal-Mart’s CIO, told The Economist earlier this year,
improving voice-recognition algorithms. They just want “Every day I wake up and ask, ‘How can I flow data
the right phone number; the data they leave is a gift to better, manage data better, analyze data better?’”
the company providing the service. The answer to that question today implies a budget
reallocation, with less-expensive hardware and software
The new technologies and services described in the carrying more of the load. “I see inspiration from the
article, “Building a bridge to the rest of your data,” on Google model and the notion of moving into
page 22 are making it possible to search for enterprise commodity-based computing—just having lots of
value in gray data in agile ways at low cost. Much of cheap stuff that you can use to crunch vast quantities
this value is likely to be in the area of knowing your of data. I think that really contrasts quite heavily with
customers, a sure path for CIOs looking for ways to the historic model of paying lots of money for really
contribute to company growth and deepen their specialist stuff,” says Phil Buckle, CTO of the UK’s
relationships with the rest of the C-suite. National Policing Improvement Agency, which oversees
What Web enterprise use of Big Data shows law enforcement infrastructure nationwide. That’s a new
CIOs, most of all, is that there is a way to think and mind-set for the CIO, who ordinarily focuses on keeping
manage differently when you conclude that standard the plumbing and the data it carries safe, secure,
transactional data analysis systems are not and in-house, and functional.
should not be the only models. New models are Seizing the Big Data initiative would give CIOs in
emerging. CIOs who recognize these new models particular and IT in general more clout in the executive
without throwing away the legacy systems that still suite. But are CIOs up to the task? “It would be a
serve them well will see that having more than one positive if IT could harness unstructured data
tool set, one skill set, and one set of controls makes effectively,” former Gartner analyst Howard Dresner,
their organizations more sophisticated, more agile, president and founder of Dresner Advisory Services,
less expensive to maintain, and more valuable to observes. “However, they haven’t always done a great
the business. job with structured data, and unstructured is far more
complex and exists predominantly outside the firewall
and beyond their control.”
Tools are not the issue. Many evolving tools, as noted
in the previous article, come from the open-source
community; they can be downloaded and experimented
with for low cost and are certainly up to supporting
any pilot project. More important is the aforementioned
mind-set and a new kind of talent IT will need.

“The talent demand isn’t so much for Java developers or statisticians
per se as it is for people who know how to work with denormalized
data.” —Ray Velez of Razorfish
To whom does the future of IT belong? Chris Wensel, who created Cascading (an alternative
application programming interface [API] to MapReduce)
The ascendance of Big Data means that CIOs need a and straddles the worlds of startups and entrenched
more data-centric approach. But what kind of talent companies, says, “When I talk to CIOs, I tell them:
can help a CIO succeed in a more data-centric business ‘You know those people you have who know about
environment, and what specific skills do the CIO’s data. You probably don’t use those people as much
teams focused on the area need to develop as you should. But once you take advantage of that
and balance? expertise and reallocate that talent, you can take
Hal Varian, a University of California, Berkeley, professor advantage of these new techniques.’”
and Google’s chief economist, says, “The sexy job in The increased emphasis on data analysis does not
the next 10 years will be statisticians.” He and others, mean that traditional programmers will be replaced by
such as IT and management professor Erik Brynjolfsson quantitative analysts or data warehouse specialists.
at the Massachusetts Institute of Technology (MIT), “The talent demand isn’t so much for Java developers
contend this demand will happen because the amount or statisticians per se as it is for people who know how
of data to be analyzed is out of control. Those who to work with denormalized data,” says Ray Velez, CTO
can make sense of the flood will reap the greatest at Razorfish, an interactive marketing and technology
rewards. They have a point, but the need is not just consulting firm involved in many Big Data initiatives.
for statisticians—it’s for a wide range of analytically “It’s about understanding how to map data into a format
minded people. that most people are not familiar with. Most people
Today, larger companies still need staff with expertise understand SQL and the relational format, so the real
in package implementations and customizations, skill set evolution doesn’t have quite as much to do with
systems integration, and business process whether it’s Java or Python or other technologies.”
reengineering, as well as traditional data management Velez points to Bill James as a useful case. James, a
and business intelligence that’s focused on baseball writer and statistician, challenged conventional
transactional data. But there is a growing role for wisdom by taking an exploratory mind-set to baseball
people with flexible minds to analyze data and suggest statistics. He literally changed how baseball
solutions to problems or identify opportunities from management makes talent decisions, and even how
that data. they manage on the field. In fact, James became senior
In Silicon Valley and elsewhere, where businesses such advisor for baseball operations in the Boston Red Sox’s
as Google, Facebook, and Twitter are built on the front office.
rigorous and speedy analysis of data, programming
frameworks such as MapReduce (which works with
Hadoop) and NoSQL (a database approach for non-
relational data stores) are becoming more popular.

For example, James showed that batting average is just has to do with boldness and courage, a willingness
less an indicator of a player’s future success than to challenge those who are in the habit of using
how often he’s involved in scoring runs—getting on metrics they’ve been using for years.”
base, advancing runners, or driving them in. In this
The CIO will need people throughout the organization
example and many others, James used his knowledge
who have all sorts of relevant analysis and coding skills,
of the topic, explored the data, asked questions no
who understand the value of data, and who are not
one had asked, and then formulated, tested, and
afraid to explore. This does not mean the end of the
refined hypotheses.
technology- or application-centric organizational chart
Says Velez: “Our analytics team within Razorfish has of the typical IT organization. Rather, it means the
the James types of folks who can help drive different addition of a data-exploration dimension that is more
thinking and envision possibilities with the data. We than one or two people. These people will be using a
need to find a lot more of those people. They’re not blend of tools that differ depending on requirements,
very easy to find. There is an aspect of James that as Table 2 illustrates. More of the tools will be open
source than in the past.
Skills Tools (a sampler) Comments
Natural language processing Clojure, Redis, Scala, Crane, other To some extent, each of these serves as
and text mining Java functional language libraries, a layer of abstraction on top of Hadoop.
Python Natural Language ToolKit Those familiar keep adding layers on top of
layers. FlightCaster, for example, uses a
stack consisting of Amazon S3 -> Amazon
EC2 -> Cloudera -> HDFS -> Hadoop ->
Cascading -> Clojure.1
Data mining R, Mathlab R is more suited to finance and statistics,

whereas Mathlab is more engineering
oriented.2
Scripting and NoSQL Python and related frameworks, These lend themselves to or are based on
database programming skills HBase, Cassandra, CouchDB, functional languages such as LISP, or
Tokyo Cabinet comparable to LISP. CouchDB, for
example, is written in Erlang.3 (See the
discussion of Clojure and LISP on page 30.)
Table 2: New skills and tools for the IT department

Source: Cited online postings and PricewaterhouseCoopers, 2008–2010
Pete Skomoroch, “How FlightCaster Squeezes Predictions from Flight Data,” Data Wrangling blog, August 24, 2009,
1
http://www.datawrangling.com/how-flightcaster-squeezes-predictions-from-flight-data (accessed May 14, 2010).

Brendan O’Connor, “Comparison of data analysis packages,” AI and Social Science blog, February 23, 2009,
2
http://anyall.org/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/
(accessed May 25, 2010).
Scripting languages such as Python run more slowly than Java, but developers sometimes make the tradeoff
3
to increase their own productivity. Some companies have created their own frameworks and released these
to open source. See Klaas Bosteels, “Python + Hadoop = Flying Circus Elephant,” Last.HQ Last.fm blog,
May 29, 2008, http://blog.last.fm/2008/05/29/python-hadoop-flying-circus-elephant (accessed May 14, 2010).

“Every technology department has a skunkworks, no matter how
informal—a sandbox where they can test and prove technologies.
That’s how open source entered our organization. A small Hadoop
installation might be a gateway that leads you to more open source.
But it might turn out to be a neat little open-source project that
sits by itself and doesn’t bother anything else.” —CIO of a small
Massachusetts company
Where do CIOs find such talent? Start with your own Clearly, one challenge CIOs face has nothing to do
enterprise. For example, business analysts managing with data or skill sets. Open-source projects become
the marketing department’s lead-generation systems available earlier in their evolution than do proprietary
could be promoted onto an IT data staff charged with alternatives. In this respect, Big Data tools are less
exploring the data flow. Most large consumer-oriented stable and complete than are Apache or Linux
companies already have people in their business units open-source tool kits.
who can analyze data and suggest solutions to
Introducing an open-source technology such as
problems or identify opportunities. These people need
Hadoop into a mostly proprietary environment does
to be groomed and promoted, and more of them hired
not necessarily mean turning the organization upside
for IT, to enable the entire organization, not just the
down. A CIO at a small Massachusetts company says,
marketing department, to reap the riches.
“Every technology department has a skunkworks,
no matter how informal—a sandbox where they can
Set up a sandbox test and prove technologies. That’s how open source
entered our organization. A small Hadoop installation
Although the business case CIOs can make for Big Data might be a gateway that leads you to more open
is inarguable, even inarguable business cases carry source. But it might turn out to be a neat little open-
some risk. Many CIOs will look at the risks associated source project that sits by itself and doesn’t bother
with Big Data and find a familiar canard. Many Big Data anything else. Either can be OK, depending on the
technologies—Hadoop in particular—are open source, needs of your company.”
and open source is often criticized for carrying too
much risk. Bud Albers, executive vice president and CTO of
Disney Technology Shared Services Group, concurs.
The open-source versus proprietary technology “It depends on your organizational mind-set,” he says.
argument is nothing new. CIOs who have tried to “It depends on your organizational capability. There
implement open-source programs, from the Apache is a certain ‘don’t try this at home’ kind of warning
Web server to the Drupal content-management system, that goes with technologies like Hadoop. You have to
have faced the usual arguments against code being be willing at this stage of its maturity to maybe have
available to all comers. Some of those arguments, a little higher level of capability to go in.”
especially concerns revolving around security and
reliability, verge on the specious. Google built its internal
Web servers atop Apache. And it would be difficult to
find a Big Data site as reliable as Google’s.

PricewaterhouseCoopers agrees with those sentiments Different companies will want to experiment with
and strongly urges large enterprises to establish a Hadoop in different ways, or segregate it from the rest
sandbox dedicated to Big Data and Hadoop/ of the IT infrastructure with stronger or weaker walls.
MapReduce. This move should be standard operating The CIO must determine how to encourage this kind
procedure for large companies in 2010, as should a of experimentation.
small, dedicated staff of data explorers and modest
budget for the efforts. For more information on what
should be in your sandbox, refer to the article, “Building Understand and manage the risks
a bridge to the rest of your data,” on page 22. Some of the risks associated with Big Data are
And for some ideas on how the sandbox could fit in legitimate, and CIOs must address them. In the case of
your organization chart, see Figure 1. Hadoop clusters, security is a pressing question: it was
a feature added as the project developed, not cooked
in from the beginning. It’s still far from perfect. Many
open-source projects start as projects intended to
Marketing
VP of IT manager prove a concept or solve a particular problem. Some,
such as Linux or Mozilla, become massive successes,
but they rarely start with the sort of requirements a CIO
faces when introducing systems to corporate settings.
Web site
Director of application development manager Beyond open source, regardless of which tools are
used to manipulate data, there are always risks
associated with making decisions based on the analysis
Sales of Big Data. To give one dramatic example, the recent
Director of data analysis manager financial crisis was caused in part by banks and rating
agencies whose models for understanding value at
risk and the potential for securities based on subprime
mortgages to fail were flat-out wrong. Just as there is
Operations
manager risk in data that is not sufficiently clean, there is risk
Data analysis Data
exploration in data manipulation techniques that have not been
team sufficiently vetted. Many times, the only way to
team
understand big, complicated data is through the use
Finance
manager of big, complicated algorithms, which leaves a door
open to big, catastrophic mistakes in analysis.
Proactively preventing these mistakes from happening
Figure 1: Where a data exploration team might fit in an requires the risk mind-set, says Larry Best, IT risk
organization chart manager at PricewaterhouseCoopers. “You have to
Source: PricewaterhouseCoopers, 2010 think carefully about what can go wrong, do a
quantitative analysis of the likelihood of such a mistake,
and anticipate the impact if the mistake occurs.”

Table 3 includes a list of the risks associated with Big Data analysis and ways to mitigate them.
Risk Mitigation tactic
Over-reliance on insights gleaned from Testing

data analysis leads to loss
Inaccurate or obsolete data Maintain strong metadata management; unverified information must be flagged
Analysis leads to paralysis Keep the sandbox related to the business problem or opportunity
Security Keep the Hadoop clusters away from the firewall, be vigilant, ask chief security
officer for help
Buggy code and other glitches Make sure the team keeps track of modifications and other implementation
history, since documentation isn’t plentiful
Rejection by other parts of Perform change management to help improve the odds of acceptance,
the organization along with quick successes
Table 3: How to mitigate the risks of Big Data analysis

Best points out that choosing the right controls to of more powerful technologies,” Dresner says.
implement in a given risk scenario is essential. The only “This means massive change, and IT doesn’t
way to make sound choices is by adopting a risk always embrace change.” More forward-thinking
mind-set and approach, that allow a focus on the most IT organizations constantly review their software
critical controls, he says. Enterprises simply don’t have portfolio and adjust accordingly.
the resources to implement blanket controls. The
In this case, the need to manipulate larger and larger
Control Objectives for Information and related
amounts of data that companies are collecting is
Technology (COBIT) framework, a popular reference for
pressing. Even risk-averse CIOs are exploring the
IT risk assessment, is a “phone book of thousands of
possibilities of Big Data for their businesses. Bud
controls.” he says. Risk is not juggling a lot of balls.
Mathaisel, CIO of the outsourcing vendor Achievo,
“Risk is knowing which balls are made out
divides the risks of Big Data and their solutions into
of rubber and which are made out of glass.”
three areas:
By nature and by work experiences, most CIOs are
• Accessibility—The data repository used for data
risk averse. Blue-chip CIOs hold off installing new
analysis should be access managed.
versions of software until they have been proven
beyond a doubt, and these CIOs don’t standardize • Classification—Gray data should be identified
on new platforms until the risk for change appears to as such.
be less than the risk of stasis.
• Governance—Who’s doing what with this?
“The fundamental issue is whether IT is willing to
Yes, Big Data is new. But accessibility, classification,
depart from the status quo, such as an RDBMS
and governance are matters CIOs have had to deal
[relational database management system], in favor
with for many years in many guises.

Conclusion As companies with a history of cautious data policies
begin to test and embrace Hadoop, MapReduce, and
At many companies, Big Data is both an opportunity the like, forward-looking CIOs will turn to the issues that
(what useful needles can we find in a terabyte-sized will become more important as Big Data becomes the
haystack?) and a source of stress (Big Data is norm. The communities arising around Hadoop (and the
overwhelming our current tools and methods; they inevitable open-source and proprietary competitors that
don’t scale up to meet the challenge). The prefix follow) will grow and become influential, inspiring more
“tera” in “terabyte,” after all, comes from the Greek CIOs to become more data-centric. The profusion of
word for “monster.” CIOs aiming to use Big Data to new data sources will lead to dramatic growth in the
add value to their businesses are monster slayers. use and diversity of metadata. As the data grows, so
CIOs don’t just manage hardware and software now; will our vocabulary for understanding it.
they’re expected to manage the data stored in that
hardware and used by that software—and provide a Whether learning from Google’s approach to Big Data,
framework for delivering insights from the data. hiring a staff primed to maximize its value, or managing
the new risks, forward-looking CIOs will, as always,
From Amazon.com to the Boston Red Sox, diverse be looking to enable new business opportunities
companies compete based on what data they collect through technology.
and what they learn from it. CIOs must deliver easy,
reliable, secure access to that data and develop
consistent, trustworthy ways to explore and wrench
wisdom from that data. CIOs do not need to rush, but As companies with a history of
they do need to be prepared for the changes that Big
Data is likely to require. cautious data policies begin to
Perhaps the most productive way for CIOs to frame the test and embrace Hadoop,
issue is to acknowledge that Big Data isn’t merely a MapReduce, and the like, forward-
new model; it’s a new way to think about all data
models. Big Data isn’t merely more data; it is different looking CIOs will turn to the issues
data that requires different tools. As more and more that will become more important
internal and external sources cast off more and more
data, basic notions about the size and attributes of data as Big Data becomes the norm.
sets are likely to change. With those changes, CIOs will The communities arising around
be expected to capture more data and deliver it to the
executive team in a manner that reveals the business— Hadoop (and the inevitable open-
and how to grow it—in new ways. source and proprietary competitors
Web companies have set the bar high already. John that follow) will grow and become
Avery, a partner at Sungard Consulting Services, points
to the YouTube example: “YouTube’s ability to index a influential, inspiring more CIOs to
data store of such immense size and then accrete become more data-centric.
additional analysis on top of that, as an ongoing
process with no foresight into what those analyses
would look like when the data was originally stored,
is very, very impressive. That is something that has
challenged folks in financial technology for years.”

New approaches to
customer data analysis
Razorfish’s Mark Taylor and Ray Velez

discuss how new techniques enable
them to better analyze petabytes of
Web data.
Interview conducted by Alan Morrison and Bo Parker
Mark Taylor is global solutions director and Ray Velez is CTO of

Razorfish, an interactive marketing and technology consulting
firm that is now a part of Publicis Groupe. In this interview, Taylor and Velez discuss how they use Amazon’s Elastic
Compute Cloud (EC2) and Elastic MapReduce services, as well as Microsoft Azure Table services, for large-scale
customer segmentation and other data mining functions.
PwC: What business problem were you trying to That capability gives us a much smarter way to apply
solve with the Amazon services? rules to our clients’ merchandising approaches, so that
we can achieve far more contextual focus for the use of
MT: We needed to join together large volumes of the data. Rather than using the data for reporting only,
disparate data sets that both we and a particular client we can actually leverage it for targeting and think about
can access. Historically, those data sets have not been how we can add value to the insight.
able to be joined at the capacity level that we were able
to achieve using the cloud. RV: It was slightly different from a traditional database
approach. The traditional approach just isn’t going to
In our traditional data environment, we were limited work when dealing with the amounts of data that a tool
to the scope of real clickstream data that we could like the Atlas ad server [a Razorfish ad engine that is
actually access for processing and leveraging now owned by Microsoft and offered through Microsoft
bandwidth, because we procured a fixed size of data. Advertising] has to deal with.
We managed and worked with a third party to serve
that data center.
PwC: The scalability aspect of it seems clear.
This approach worked very well until we wanted to tie
together and use SQL servers with online analytical But is the nature of the data you’re collecting
processing cubes, all in a fixed infrastructure. With the such that it may not be served well by a
cloud, we were able to throw billions of rows of data relational approach?
together to really start categorizing that information RV: It’s not the nature of the data itself, but what we
so that we could segment non-personally identifiable end up needing to deal with when it comes to relational
data from browsing sessions and from specific ways data. Relational data has lots of flexibility because of
in which we think about segmenting the behavior the normalized format, and then you can slice and dice
of customers. and look at the data in lots of different ways. Until you

“Rather than using the data for reporting only, we can actually leverage it
for targeting and think about how we can add value to the insight.”
—Mark Taylor
put it into a data warehouse format or a denormalized We were able to build something that we never would
EMR [Elastic MapReduce] or Bigtable type of format, have thought of exposing to the world, because it never
you really don’t get the performance that you need would have performed well. It actually spurred a whole
when dealing with larger data sets. new business idea for us. We were able to take what
would typically be a BusinessObjects or a Cognos
So it’s really that classic tradeoff; the data doesn’t
application, which would not scale to Internet volumes.
necessarily lend itself perfectly to either approach.
When you’re looking at performance and the amount We did some sizing to determine how big the data
of data, even a data warehouse can’t deal with the footprint would be. Obviously, when you do that, you
amount of data that we would get from a lot of our tend to have a ton more space than you require,
data sources. because you’re duplicating lots and lots of data that,
with a relational database table, would be lookup data
or other things like that. But it turned out that when I
PwC: What motivated you to look at this new laid the indexes on top of the traditionally relational
technology to solve that old problem? data, the resulting data set actually had even greater
RV: Here’s a similar example where we used a slightly storage requirements than performing the duplication
different technology. We were working with a large and putting the data set into a large denormalized
financial services institution, and we were dealing with format. That was a bit of a surprise to us. The size of
massive amounts of spending-pattern and anonymous the indexes got so large.
data. We knew we had to scale to Internet volumes, When you think about it, maybe that’s just how an index
and we were talking about columnar databases. We works anyway—it puts things into this denormalized
wondered, can we use a relational structure with format. An index file is just some closed concept in
enough indexes to make it perform well? We your database or memory space. The point is, we would
experimented with a relational structure and it have never tried to expose that to consumers, but we
just didn’t work. were able to expose it to consumers because of this
So early on we jumped into what Microsoft Azure new format.
technology allowed us to do, and we put it into a MT: The first commercial benefits were the ability to
Bigtable format, or a Hadoop-style format, using Azure aggregate large and disparate data into one place and
Table services. The real custom element was designing the extra processing power. But the next phase of
the partitioning structure of this data to denormalize benefits really derives from the ability to identify true
what would usually be five or six tables into one huge relationships across that data.
table with lots of columns, to the point where we started
to bump up against the maximum number of columns
they had.
New approaches to customer data analysis 47

“The stat section on [the MLB] site was always the most difficult part of
the site, but the business insisted it needed it.” —Ray Velez
Tiny percentages of these data sets have the most This way, we stay relevant and respond more quickly to
significant impact on our customer interactions. We customer demand. We’re identifying new variations and
are already developing new data measurement and KPI shifts in the data on a real-time basis that would have
[key performance indicator] strategies as we’re starting taken weeks or months, or that we might have missed
to ask ourselves, “Do our clients really need all of the completely, using the old approach. The analyst’s role
data and measurement points to solve their in creating these new algorithms and designing new
business goals?” methods of campaign planning is clearly key to this
type of solution design. The outcome of all this is really
interesting and I’m starting to see a subtle, organic
PwC: Given these new techniques, is the skill response to different changes in the way our solution
set that’s most beneficial to have at Razorfish tracks and targets customer behavior.
changing?
RV: It’s about understanding how to map data into a PwC: Are you familiar with Bill James, a Major
format that most people are not familiar with. Most League Baseball statistician who has taken a
people understand SQL and relational format, so I think
the real skill set evolution doesn’t have quite as much
rather different approach to metrics? James
to do with whether the tool of choice is Java or Python developed some metrics that turned out to be
or other technologies; it’s more about do I understand more useful than those used for many years
normalized versus denormalized structures. in baseball. That kind of person seems to be
MT: From a more commercial viewpoint, there’s a shift the type that you’re enabling to hypothesize,
away from product type and skill set, which is based perhaps even do some machine learning to
around constraints and managing known parameters, generate hypotheses.
and very much more toward what else can we do.
RV: Absolutely. Our analytics team within Razorfish has
It changes the impact—not just in the technology
the Bill James type of folks who can help drive different
organization, but in the other disciplines as well.
thinking and envision possibilities with the data. We
I’ve already seen a profound effect on the old ways of need to find a lot more of those people. They’re not
doing things. Rather than thinking of doing the same very easy to find. And we have some of the leading
things better, it really comes down to having the people folks in the industry.
and skills to meet your intended business goals.
You know, a long, long time ago we designed the
Using the Elastic MapReduce service with Cascading, Major League Baseball site and the platform. The stat
our solutions can have a ripple effect on all of the section on that site was always the most difficult part
non-technical business processes and engagements of the site, but the business insisted it needed it. The
across teams. For example, conventional marketing amount of people who really wanted to churn through
segmentation used to involve teams of analysts who that data was small. We were using Oracle at the time.
waded through various data sets and stages of We used the concept of temporary tables, which
processing and analysis to make sense of how a would denormalize lots of different relational tables for
business might view groups of customers. Using the performance reasons, and that was a challenge. If I
Hadoop-style alternative and Cascading, we’re able to had the cluster technology we do now back in 1999
identify unconventional relationships across many data and 2000, we could have built to scale much more
points with less effort, and in the process create new than going to two measly servers that we could cluster.
segmentations and insights.

PwC: The Bill James analogy goes beyond PwC: Social media is helping customers become
batting averages, which have been the age-old more active and engaged. From a marketing
metric for assessing the contribution of a hitter analysis perspective, it’s a variation on a Super
to a team, to measuring other things that weren’t Bowl advertisement, just scaled down to that
measured before. social media environment. And if that’s going to
RV: Even crazy stuff. We used to do things like, show happen frequently, you need to know what is the
me all of Derek Jeter’s hits at night on a grass field. impact, who’s watching it, and how are the
people who were watching it affected by it. If
PwC: There you go. Exactly. you just think about the data ramifications of
that, it sort of blows your mind.
RV: That’s the example I always use, because that was
the hardest thing to get to scale, but if you go to the RV: If you think about the popularity of Hadoop and
stat section, you can do a lot of those things. But if too Bigtable, which is really looking under the covers of
many people went to the stat section on the site, the the way Google does its search, and when you think
site would melt down, because Oracle couldn’t handle about search at the end of the day, search really is
it. If I were to rebuild that today, I could use an EMR recommendations. It’s relevancy. What are the impacts
or a Bigtable and I’d be much happier. on the ability of people to create new ways to do search
and to compete in a more targeted fashion with the
search engine? If you look three to five years out, that’s
PwC: Considering the size of the Bigtable that really exciting. We used to say we could never re-create
you’re able to put together without using joins, it that infrastructure that Google has; Google is the
seems like you’re initially able to filter better and second largest server manufacturer in the world. But
now we have a way to create small targeted ways of
maybe do multistage filtering to get to something
doing what Google does. I think that’s pretty exciting. n
useful. You can take a cyclical approach to your
analysis, correct?
RV: Yes, you’re almost peeling away the layer of the “What are the impacts on the ability
onion. But putting data into a denormalized format does
restrict flexibility, because you have so much more of people to create new ways to
power with a where clause than you do with a standard do search and to compete in a
EMR or Bigtable access mechanism.
more targeted fashion with the
It’s like the difference between something built for
exactly one task versus something built to handle tasks search engine? If you look three to
I haven’t even thought of. If you peel away the layer of five years out, that’s really exciting.”
the onion, you might decide, wow, this data’s interesting
and we’re going in a very interesting direction, so what —Ray Velez
about this? You may not be able to slice it that way. You
might have to step back and come up with a different
partition structure to support it.
New approaches to customer data analysis 49

Acknowledgments
Advisory
Sponsor & Technology Leader
Tom DeGarmo
US Thought Leadership
Partner-in-Charge
Tom Craren
Center for Technology and Innovation

Managing Editor
Bo Parker
Editors
Vinod Baya, Alan Morrison
Contributors
Larry Best, Galen Gruman, Jimmy Guterman, Larry Marion, Bill Roberts
Editorial Advisers
Markus Anderle, Stephen Bay, Brian Butte, Tom Johnson,
Krishna Kumaraswamy, Bud Mathaisel, Sean McClowry, Rajesh Munavalli,
Luis Orama, Dave Patton, Jonathan Reichental, Terry Retter, Deepak Sahi,
Carter Shock, David Steier, Joe Tagliaferro, Dimpsy Teckchandani,
Cindi Thompson, Tom Urquhart, Christine Wendin, Dean Wotkiewich
Copyedit
Lea Anne Bantsari, Ellen Dunn
Transcription
Paula Burns

Graphic Design Industry perspectives
Art Director During the preparation of this publication, we benefited
Jacqueline Corliss greatly from interviews and conversations with the
following executives and industry analysts:
Designers
Jacqueline Corliss, Suzanne Lau Bud Albers, executive vice president and chief
technology officer, Technology Shared Services
Illustrators
Group, Disney
Donald Bernhardt, Suzanne Lau,
Tatiana Pechenik Matt Aslett, analyst, enterprise software, the451
Photographers John Avery, partner, Sungard Consulting Services
Tim Szumowski, David Tipling
Amr Awadallah, vice president, engineering,
(Getty Images), Marina Waltz
and chief technology officer, Cloudera
Phil Buckle, chief technology officer, National Policing
Online Improvement Agency
Director, Online Marketing Howard Dresner, president and founder, Dresner
Jack Teuber Advisory Services
Designer and Producer Brian Donnelly, founder and chief executive officer,
Scott Schmidt InSilico Discovery
Matt Estes, principal data architect, Technology Shared
Services Group, Disney
Reviewers
Jim Kobelius, senior analyst, Forrester Research
Dave Stuckey, Chris Wensel
Doug Lenat, founder and chief executive officer, Cycorp
Roger Magoulas, research director, O’Reilly Media
Marketing Nathan Marz, lead engineer, BackType
Bob Kramer
Bill McColl, founder and chief executive officer,
Cloudscale
Special thanks to John Parkinson, acting chief technology officer,
TransUnion
Ray George, Page One
Rachel Lovinger, Razorfish David Smoley, chief information officer, Flextronics
Mariam Sughayer, Disney Mark Taylor, global solutions director, Razorfish
Scott Thompson, vice president, architecture,
Technology Shared Services Group, Disney
Ray Velez, chief technology officer, Razorfish
Acknowledgments 51
pwc.com/us
To have a deeper conversation

about how this subject may affect
your business, please contact:
Tom DeGarmo
Principal, Technology Leader
PricewaterhouseCoopers
+1 267-330-2658
thomas.p.degarmo@us.pwc.com
This publication is printed on Coronado Stipple Cover made from 30% recycled fiber; and
Endeavor Velvet Book made from 50% recycled fiber, a Forest Stewardship Council (FSC)
certified stock using 25% post-consumer waste.

Recycled paper
Subtext
Big Data Data sets that range from many terabytes to petabytes in size, and
that usually consist of less-structured information such as Web log files.
Hadoop cluster A type of scalable computer cluster inspired by the Google

Cluster Architecture and intended for cost-effectively processing
less-structured information.
Apache Hadoop The core of an open-source ecosystem that makes Big Data analysis
more feasible through the efficient use of commodity computer clusters.
Cascading A bridge from Hadoop to common Java-based programming techniques

not previously usable in cluster-computing environments.
NoSQL A class of non-relational data stores and data analysis techniques that
are intended for various kinds of less-structured data. Many of these
techniques are part of the Hadoop ecosystem.
Gray data Data from multiple sources that isn’t formatted or vetted for
specific needs, but worth exploring with the help of Hadoop
cluster analysis techniques.
Comments or requests? Please visit www.pwc.com/techforecast OR send e-mail to: techforecasteditors@us.pwc.com
PricewaterhouseCoopers (www.pwc.com) provides industry-focused assurance, tax and advisory services to build public trust and enhance value for
its clients and their stakeholders. More than 155,000 people in 153 countries across our network share their thinking, experience and solutions to
develop fresh perspectives and practical advice.
© 2010 PricewaterhouseCoopers LLP. All rights reserved. “PricewaterhouseCoopers” refers to PricewaterhouseCoopers LLP, a Delaware limited
liability partnership, or, as the context requires, the PricewaterhouseCoopers global network or other member firms of the network, each of which is a
separate and independent legal entity. This document is for general information purposes only, and should not be used as a substitute for consultation
with professional advisors.

Making Sense of Big Data - PWC Tech Forecast, Issue 3 2010

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Making Sense of Big Data - PWC Tech Forecast, Issue 3 2010

Загружено:

Авторское право:

Доступные форматы

Technologyforecast

22 Building a bridge to the rest of your data

36 Revising the CIO’s data playbook

18 Creating a cost-effective Big Data strategy

34 Hadoop’s foray into the enterprise

46 New approaches to customer data analysis

02 PricewaterhouseCoopers Technology Forecast

Message from the editor 03

Treating it differently from your core enterprise data is essential.

04 PricewaterhouseCoopers Technology Forecast

Tapping into the power of Big Data 05

06 PricewaterhouseCoopers Technology Forecast

Figure 1: Disney’s Hadoop cluster and central logging service

Tapping into the power of Big Data 07

08 PricewaterhouseCoopers Technology Forecast

In contrast, Big Data techniques allow you to sift through High

Tapping into the power of Big Data 09

10 PricewaterhouseCoopers Technology Forecast

All collected data

Insight Greater insight

Figure 4: Information loss in the data consolidation process

Tapping into the power of Big Data 11

12 PricewaterhouseCoopers Technology Forecast

Tapping into the power of Big Data 13

John Parkinson of TransUnion describes the

14 PricewaterhouseCoopers Technology Forecast

The data scalability challenge 15

16 PricewaterhouseCoopers Technology Forecast

The data scalability challenge 17

Disney’s Bud Albers, Scott Thompson,

Bud Albers joined what is now the Disney Technology Shared

18 PricewaterhouseCoopers Technology Forecast

Creating a cost-effective Big Data strategy 19

20 PricewaterhouseCoopers Technology Forecast

Creating a cost-effective Big Data strategy 21

How companies are using open-source cluster-computing techniques

22 PricewaterhouseCoopers Technology Forecast

By early 2010, Hadoop, MapReduce, and related open-source

Building a bridge to the rest of your data 23

24 PricewaterhouseCoopers Technology Forecast

Task tracker/ Task tracker/

Rack Rack Modularity and extensibility

Figure 1: Hadoop cluster layout and characteristics

Building a bridge to the rest of your data 25

26 PricewaterhouseCoopers Technology Forecast

Figure 2: The Hadoop Distributed File System, or HDFS

Input Input Output

Less-structured Cascading 1 Mashups

Building a bridge to the rest of your data 27

Barrier ... Aggregates intermediate values by output key ... Barrier

key 1 intermediate values key 2 intermediate values key 3 intermediate values

Reduce Reduce Reduce

final key 1 values final key 2 values final key 3 values

Figure 4: MapReduce phases

28 PricewaterhouseCoopers Technology Forecast

Cascading Rather than approach map and reduce phases large-file

think in key-value pair terms—they think in terms of MR MR

Map Reduce Map Reduce

Figure 6: Cascading assembly and flow

Figure 5: A Cascading assembly

Building a bridge to the rest of your data 29

30 PricewaterhouseCoopers Technology Forecast