Академический Документы
Профессиональный Документы
Культура Документы
Making sense
of Big Data
A quarterly journal
2010, Issue 3
In this issue
04 22 36
Tapping into the Building a bridge Revising the CIO’s
power of Big Data to the rest of your data data playbook
Contents
Features
04 Tapping into the power of Big Data
Treating it differently from your core enterprise data is essential.
Departments
02 Message from the editor
50 Acknowledgments
54 Subtext
Message from
the editor
Bill James has loved baseball statistics ever since he was a kid in Mayetta,
Kansas, cutting baseball cards out of the backs of cereal boxes in the early
1960s. James, who compiled The Bill James Baseball Abstract for years, is
a renowned “sabermetrician” (a term he coined himself). He now is a senior
advisor on baseball operations for the Boston Red Sox, and he previously
worked in a similar capacity for other Major League Baseball teams.
James has done more to change the world of baseball statistics than
anyone in recent memory. As broadcaster Bob Costas says, James
“doesn’t just understand information. He has shown people a different
way of interpreting that information.” Before Bill James, Major League
Baseball teams all relied on long-held assumptions about how games are
won. They assumed batting average, for example, had more importance
than it actually does.
James challenged these assumptions. He asked critical questions that
didn’t have good answers at the time, and he did the research and analysis
necessary to find better answers. For instance, how many days’ rest does
a reliever need? James’s answer is that some relievers can pitch well for
two or more consecutive days, while others do better with a day or two of
rest in between. It depends on the individual. Why can’t a closer work more
than just the ninth inning? A closer is frequently the best reliever on the
team. James observes that managers often don’t use the best relievers
to their maximum potential.
The lesson learned from the Bill James example is that the best statistics
come from striving to ask the best questions and trying to get answers to
those questions. But what are the best questions? James takes an iterative
approach, analyzing the data he has, or can gather, asking some questions
based on that analysis, and then looking for the answers. He doesn’t stop
with just one set of statistics. The first set suggests some questions, to
which a second set suggests some answers, which then give rise to yet
another set of questions. It’s a continual process of investigation, one that’s
focused on surfacing the best questions rather than assuming those
questions have already been asked.
Enterprises can take advantage of a similarly iterative, investigative
approach to data. Enterprises are being overwhelmed with data; many
enterprises each generate petabytes of information they aren’t making best
use of. And not all of the data is the same. Some of it has value, and some,
not so much.
The problem with this data has been twofold: (1) it’s difficult to analyze,
and (2) processing it using conventional systems takes too long and is
too expensive.
“The speed of business these days and the amount of data that we
are now swimming in mean that we need to have new ways and new
techniques of getting at the data, finding out what’s in there, and figuring
out how we deal with it.” —Bud Albers of Disney
Internal
Site business Affiliated
visitors partners businesses
Interface to cluster
(MapReduce/Hive/Pig)
1
Usage
data D-Cloud data cluster
Hadoop
2 Central 3
logging
service
Core IT and Metadata
business unit repository
systems
started with making sure they could expose and Wolfram Research and IBM have begun to extend
access the data, then moved to iterative refinement in their analytics applications to run on such large-scale
working with the data. “We aggressively got in to find data pools, and startups are presenting approaches
the direction and the base. Then we began to iterate they promise will allow data exploration in ways that
rather than try to do a Big Bang,” Albers says. technologies couldn’t have enabled in the past,
including support for tools that let knowledge workers
examine traditional databases using Big Data–style
exploratory tools.
Exploration
Pre-consolidated data (never collected)
Information
loss Less information loss
Information
loss
Co
Summary
Co
Summary
departmental
ns
departmental
ns
data
o
data
oli
lid
Information
da
ati
loss Summary
tio
Summary
on
enterprise
n
enterprise
data data
John Parkinson is the acting CTO of TransUnion, the chairman and owner of Parkwood
Advisors, and a former CTO at Capgemini. In this interview, Parkinson outlines
TransUnion’s considerable requirements for less-structured data analysis, shedding
light on the many data-related technology challenges TransUnion faces today—challenges
he says that more companies will face in the near future.
PwC: In your role at TransUnion, you’ve of rows of data looking for things that match a pattern
evaluated many large-scale data processing approximately. MapReduce is a more efficient filter for
technologies. What do you think of Hadoop some of the pattern-matching algorithms that we have
tried to use. At least in its theoretical formulation, it’s
and MapReduce? very amenable to highly parallelized execution, which
JP: MapReduce is a very computationally attractive many of the other filtering algorithms we’ve used aren’t.
answer for a certain class of problem. If you have that The open-source stack is attractive for experimenting,
class of problem, then MapReduce is something you but the problem we find is that Hadoop isn’t what
should look at. The challenge today, however, is that the Google runs in production—it’s an attempt by a bunch
number of people who really get the formalism behind of pretty smart guys to reproduce what Google runs in
MapReduce is a lot smaller than the group of people production. They’ve done a good job, but it’s like a lot
trying to understand what to do with it. It really hasn’t of open-source software—80 percent done. The
evolved yet to the point where your average enterprise 20 percent that isn’t done—those are the hard parts.
technologist can easily make productive use of it.
From an experimentation point of view, we have had a
lot of success in proving that the computing formalism
PwC: What class of problem would that be? behind MapReduce works, but the software that we
JP: MapReduce works best in situations where you can acquire today is very fragile. It’s difficult to manage.
want to do high-volume, accurate but approximate It has some bugs in it, and it doesn’t behave very well
matching and categorization in very large, low- in an enterprise environment. It also has some
structured data sets. At TransUnion, we spend a lot of interesting limitations when you try to push the
our time trawling through tens or hundreds of billions scale and the performance.
“I have a bunch of smart engineers, but I don’t want them spending their
day being the technology support environment for what should be a
product in our architecture.”
“It’s more art than science. The business units must generate revenue,
and we must provide the core services. How do you strike that balance?
Ownership is a lot more negotiated on some things today.”
—Bud Albers
CPU utilization isn’t the only measure, but it’s the most
prominent one. To study and understand what was PwC: How do you evolve so that the data
happening, we put monitors and measures on our strategy is really served well, so that it’s more
servers and reported peak CPU utilization on five- of a data-driven approach in some ways?
minute intervals across our server farm. We found that
on roughly 80 percent of our servers, we never got ME: On one side, you have a very transactional OLTP
above 10 percent utilization in a monthly period. [online transactional processing] kind of world, RDBMSs
[relational database management systems], and major
Our first step to address that problem was virtualization. vendors that we’re using there. On the other side of it,
At this point, about 49 percent of our data center is you have traditional analytical warehousing. And where
virtual. Our virtualization effort had a sizable impact on we’ve slotted this [Hadoop-style data] is in the middle
cost. Dollars fell out because we quit building data with the other operational data. Some of it is derived
centers and doing all kinds of redundant shuffling. We from transactional data, and some has been crafted
didn’t have to lay off people. We changed some of our out of analytical data. There’s a freedom that’s derived
processes, and we were able to shift our growth curve from blending these two kinds of data.
from plus 27 to minus 3 on the shared service.
Our centralized logging service is an example.
We call this our D-Cloud effort. Another step in this As we look at continuing to drive down costs to
effort was moving to a REST [REpresentational State drive up efficiency, we can begin to log a large
Transfer] and JSON [JavaScript Object Notation] data amount of this data at a price point that we have
exchange standard, because we knew we had to hit all not been able to achieve by scaling up RDBMSs
these different devices and create some common APIs or using warehousing appliances.
[application programming interfaces] in the framework.
One of the very first things we put in place was a central Then the key will be putting an expert system in place.
logging service for all the events. These event logs can That will give us the ability to really understand what’s
be streamed into one very large data set. We can then going on in the actual operational environment.
use the Hadoop and MapReduce paradigm to go after We’re starting to move again toward lower utilization
that data. trajectories. We need to scale the infrastructure back
and get that utilization level up to the threshold.
“We think storing the unstructured data in its raw format is what’s
coming. In a Hadoop environment, instead of bringing the data back to
your warehouse, you figure out what question you want to answer.”
—Scott Thompson
Switch
1000Mbps
Switch Switch
100Mbps 100Mbps
Typical node setup
2 quad-core Intel Nehalem
Task tracker/ JobTracker 24GB of RAM
DataNode
12 1TB SATA disks (non-RAID)
Task tracker/ NameNode 1 Gigabit Ethernet card
DataNode
Cost per node: $5,000
Task tracker/ Task tracker/
Effective file space per node: 20TB
DataNode DataNode
The result is less-expensive large-scale distributed Mostly overlooked in all that attention was the use of
computing and parallel processing, which make the Hadoop Distributed File System (HDFS) and the
possible an analysis that is different from what most MapReduce framework. Using these open-source tools,
enterprises have previously attempted. As author Tom after studying how-to blog posts from others, Times
White points out, “The ability to run an ad hoc query senior software architect Derek Gottfrid developed and
against your whole data set and get the results in a ran code in parallel across multiple Amazon machines.7
reasonable time is transformative.”6
“Amazon supports Hadoop directly through its Elastic
The cost of this capability is low enough that companies MapReduce application programming interfaces [APIs],”
can fund a Hadoop cluster from existing IT budgets. says Chris Wensel, founder of Concurrent, which
When it decided to try Hadoop, Disney’s Technology developed Cascading. (See the discussion of
Shared Services Group took advantage of the increased Cascading later in this article.) “I regularly work with
server utilization it had already achieved from customers to boot up 200-node clusters and process
virtualization. As of March 2010, with nearly 50 percent 3 terabytes of data in five or six hours, and then shut
of its servers virtualized, Disney had 30 percent server the whole thing down. That’s extraordinarily powerful.”
image growth annually but 30 percent less growth in
physical servers. It was then able to set up a multi-
terabyte cluster with Hadoop and other free open- The Hadoop Distributed File System
source tools, using servers it had planned to retire. The The Hadoop Distributed File System (HDFS) and the
group estimates it spent less than $500,000 on the MapReduce parallel programming framework are at
entire project. (See the article, “Tapping into the power the core of Apache Hadoop. Comparing HDFS and
of Big Data,” on page 04.) MapReduce to Linux, Awadallah says that together
These clusters are also transformative because cloud they’re a “data operating system.” This description may
providers can offer them on demand. Instead of using be overstated, but there are similarities to any operating
their own infrastructures, companies can subscribe to system. Operating systems schedule tasks, allocate
a service such as Amazon’s or Cloudera’s distribution resources, and manage files and data flows to fulfill the
on the Amazon Elastic Compute Cloud (EC2) platform. tasks. HDFS does a distributed computing version of
this. “It takes care of linking all the nodes together to
The EC2 platform was crucial in a well-known use of look like one big file and job scheduling system for the
cloud computing on a Big Data project that also applications running on top of it,” Awadallah says.
depended on Hadoop and other open-source tools.
In 2007, The New York Times needed to quickly HDFS, like all Hadoop tools, is Java based. An HDFS
assemble the PDFs of 11 million articles from contains two kinds of nodes:
4 terabytes of scanned images. Amazon’s EC2 service • A single NameNode that logs and maintains the
completed the job in 24 hours after setup, a feat necessary metadata in memory for distributed jobs
that received widespread attention in blogs and the
trade press. • Multiple DataNodes that create, manage, and
process the 64MB blocks that contain pieces of
Hadoop jobs, according to the instructions from
the NameNode
1 2 4 5 2 3 4 3 1 5 2 5
Data Data
store 1 store n
Input key-value pairs Input key-value pairs
Map Map
key 1 values key 2 values key 3 values key 1 values key 2 values key 3 values
[f1, f2, ...] [f1, f2, ...] [f1, f2, ...] [f1, f2, ...] [f1, f2, ...] A Pipe assembly
P P P P P P
Hadoop MR (translation to MapReduce)
MR
[f1, f2, ...] [f1, f2, ...]
MapReduce jobs
So Si
Clojure Thrift
Clojure creator Rich Hickey wanted to combine aspects Thrift, initially created at Facebook in 2007 and then
of C or C#, LISP (for list processing, a language released to open source, helps developers create
associated with artificial intelligence that’s rich in services that communicate across languages, including
mathematical functions), and Java. The letters C, L, and C++, C#, Java, Perl, Python, PHP, Erlang, and Ruby.
J led him to name the language, which is pronounced With Thrift, according to Facebook, users can “define
“closure.” Clojure combines a LISP library with Java all the necessary data structures and interfaces for a
libraries. Clojure’s mathematical and natural language complex service in a single short file.”�
processing (NLP) capabilities and the fact that it is JVM A more important aspect of Thrift, according to
based make it useful for statistical analysis on Hadoop BackType’s Marz, is its ability to create strongly typed
clusters. FlightCaster, a commercial-airline-delay- data and flexible schemas. Countering the emphasis of
prediction service, uses Clojure on top of Cascading, the so-called NoSQL community on schema-less data,
on top of MapReduce and Hadoop, for “getting the Marz asserts there are effective ways to lightly structure
right view into unstructured data from heterogeneous the data in Hadoop-style analysis.
sources,” says Bradford Cross, FlightCaster co-founder.�
Marz uses Thrift’s serialization features, which turn
LISP has attributes that lend themselves to NLP, making objects�into a sequence of bits that can be stored as
Clojure especially useful in NLP applications. Mark files, to create schemas between types (for instance,
Watson, an artificial intelligence consultant and author, differentiating between text strings and long, 64-bit
says most LISP programming he’s done is for NLP. He integers) and schemas between relationships (for
considers LISP to be four times as productive for instance, linking Twitter accounts that share a
programming as C++ and twice as productive as Java. common interest). Structuring the data in this way
His NLP code “uses a huge amount of memory-resident helps BackType avoid inconsistencies in the data
data,” such as lists of proper nouns, text categories, or the need to manually filter for some attributes.
common last names, and nationalities.
BackType can use required and optional fields to
structure the Twitter messages it crawls and analyzes.
The required fields can help enforce data type. The
“Getting the right view into optional fields, meanwhile, allow changes to the
unstructured data from schema as well as the use of old objects that were
created using the old schema.
heterogeneous sources can
be quite tricky.” —Bradford
Cross of FlightCaster
Amr Awadallah is vice president of engineering and CTO at Cloudera, a company that
offers products and services around Hadoop, an open-source technology that allows
efficient mining of large, complex data sets. In this interview, Awadallah provides an
overview of Hadoop’s capabilities and how Cloudera customers are using them.
PwC: Were you at Yahoo before coming and worked on it in conjunction with the open-source
to Cloudera? Apache Hadoop community. Yahoo played a very big
role in the evolution of Hadoop to where it is today.
AA: Yes. I was with Yahoo from mid-2000 until mid- Soon after the Yahoo Search team started using
2008, starting with the Yahoo Shopping team after Hadoop, other parts of the company began to see
selling my company VivaSmart to Yahoo. Beginning in the power and flexibility that this system offers.
2003, my career shifted toward business intelligence Today, Yahoo uses Hadoop for data warehousing,
and analytics at consumer-facing properties such as mail spam detection, news feed processing, and
Yahoo News, Mail, Finance, Messenger, and Search. content/ad targeting.
I had the daunting task of building a very large data
warehouse infrastructure that covered all these diverse PwC: What are some of the advantages of
products and figuring out how to bring them together.
Hadoop when you compare it with RDBMSs
That is when I first experienced Hadoop. Its model of [relational database management systems]?
“mine first, govern later” fits in with the well-governed
infrastructure of a data mart, so it complements these AA: With Oracle, Teradata, and other RDBMSs, you
systems very well. Governance standards are important must create the table and schema first. You say, this is
for maintaining a common language across the what I’m going to be loading in, these are the types of
organization. However, they do inhibit agility, so it’s best columns I’m going to load in, and then you load your
to complement a well-governed data mart with a more data. That process can inhibit how fast you can evolve
agile complex data processing system like Hadoop. your data model and schemas, and it can limit what you
log and track.
PwC: How did Yahoo start using Hadoop? With Hadoop, it’s the other way around. You load all of
your data, such as XML [Extensible Markup Language],
AA: In 2005, Yahoo was faced with a business tab delimited flat files, Apache log files, JSON
challenge. The cost of creating the Web search index [JavaScript Object Notation], etc. Then in Hive or Pig
was approaching the revenues being made from the [both of which are Hadoop data query tools], you point
keyword advertising on the search pages. Yahoo Search your metadata toward the file and parse the data on
adopted Hadoop as an economically scalable solution,
the fly when reading it out. This approach lets you interesting place to ask questions of complex plus
extract the columns that map to the data structure relational data? Probably not, although organizations
you’re interested in. still need to use, collect, and present relational data
for questions that are routine and require, in some
Creating the structure on the read path like this can
cases, a real-time response.
have its disadvantages; however, it gives you the agility
and the flexibility to evolve your schema much quicker
without normalizing your data first. In general, relational PwC: How have companies benefited
systems are not well suited for quickly evolving complex from querying across both structured and
data types.
complex data?
Another benefit is retroactive schemas. For example,
AA: When you query against complex data types, such
an engineer launching a new product feature can add
as Web log files and customer support forums, as well
the logging for it, and that new data will start flowing
as against the structured data you have already been
directly into Hadoop. Weeks or months later, a data
collecting, such as customer records, sales history, and
analyst can update their read schema on how to parse
transactions, you get a much more accurate answer to
this new data. Then they will immediately be able to
the question you’re asking. For example, a large credit
query the history of this metric since it started flowing
card company we’ve worked with can identify which
in [as opposed to waiting for the RDBMS schema to
transactions are most likely fraudulent and can prioritize
be updated and the ETL (extract, transform, and load)
which accounts need to be addressed.
processes to reload the full history of that metric].
Many Web companies are finding opportunity in “gray The business case
data.” Gray data is the raw and unvalidated data that
arrives from various sources, in huge quantities, and not Besides Google, Yahoo, and other Web-based
in the most usable form. Yet gray data can deliver value enterprises that have complex data sets, there are
to the business even if the generators of that content stories of brick and mortar organizations that will be
(for example, people calling directory assistance) are making more use of Big Data. For example, Rollin Ford,
contributing that data for a reason far different from Wal-Mart’s CIO, told The Economist earlier this year,
improving voice-recognition algorithms. They just want “Every day I wake up and ask, ‘How can I flow data
the right phone number; the data they leave is a gift to better, manage data better, analyze data better?’”
the company providing the service. The answer to that question today implies a budget
reallocation, with less-expensive hardware and software
The new technologies and services described in the carrying more of the load. “I see inspiration from the
article, “Building a bridge to the rest of your data,” on Google model and the notion of moving into
page 22 are making it possible to search for enterprise commodity-based computing—just having lots of
value in gray data in agile ways at low cost. Much of cheap stuff that you can use to crunch vast quantities
this value is likely to be in the area of knowing your of data. I think that really contrasts quite heavily with
customers, a sure path for CIOs looking for ways to the historic model of paying lots of money for really
contribute to company growth and deepen their specialist stuff,” says Phil Buckle, CTO of the UK’s
relationships with the rest of the C-suite. National Policing Improvement Agency, which oversees
What Web enterprise use of Big Data shows law enforcement infrastructure nationwide. That’s a new
CIOs, most of all, is that there is a way to think and mind-set for the CIO, who ordinarily focuses on keeping
manage differently when you conclude that standard the plumbing and the data it carries safe, secure,
transactional data analysis systems are not and in-house, and functional.
should not be the only models. New models are Seizing the Big Data initiative would give CIOs in
emerging. CIOs who recognize these new models particular and IT in general more clout in the executive
without throwing away the legacy systems that still suite. But are CIOs up to the task? “It would be a
serve them well will see that having more than one positive if IT could harness unstructured data
tool set, one skill set, and one set of controls makes effectively,” former Gartner analyst Howard Dresner,
their organizations more sophisticated, more agile, president and founder of Dresner Advisory Services,
less expensive to maintain, and more valuable to observes. “However, they haven’t always done a great
the business. job with structured data, and unstructured is far more
complex and exists predominantly outside the firewall
and beyond their control.”
Tools are not the issue. Many evolving tools, as noted
in the previous article, come from the open-source
community; they can be downloaded and experimented
with for low cost and are certainly up to supporting
any pilot project. More important is the aforementioned
mind-set and a new kind of talent IT will need.
To whom does the future of IT belong? Chris Wensel, who created Cascading (an alternative
application programming interface [API] to MapReduce)
The ascendance of Big Data means that CIOs need a and straddles the worlds of startups and entrenched
more data-centric approach. But what kind of talent companies, says, “When I talk to CIOs, I tell them:
can help a CIO succeed in a more data-centric business ‘You know those people you have who know about
environment, and what specific skills do the CIO’s data. You probably don’t use those people as much
teams focused on the area need to develop as you should. But once you take advantage of that
and balance? expertise and reallocate that talent, you can take
Hal Varian, a University of California, Berkeley, professor advantage of these new techniques.’”
and Google’s chief economist, says, “The sexy job in The increased emphasis on data analysis does not
the next 10 years will be statisticians.” He and others, mean that traditional programmers will be replaced by
such as IT and management professor Erik Brynjolfsson quantitative analysts or data warehouse specialists.
at the Massachusetts Institute of Technology (MIT), “The talent demand isn’t so much for Java developers
contend this demand will happen because the amount or statisticians per se as it is for people who know how
of data to be analyzed is out of control. Those who to work with denormalized data,” says Ray Velez, CTO
can make sense of the flood will reap the greatest at Razorfish, an interactive marketing and technology
rewards. They have a point, but the need is not just consulting firm involved in many Big Data initiatives.
for statisticians—it’s for a wide range of analytically “It’s about understanding how to map data into a format
minded people. that most people are not familiar with. Most people
Today, larger companies still need staff with expertise understand SQL and the relational format, so the real
in package implementations and customizations, skill set evolution doesn’t have quite as much to do with
systems integration, and business process whether it’s Java or Python or other technologies.”
reengineering, as well as traditional data management Velez points to Bill James as a useful case. James, a
and business intelligence that’s focused on baseball writer and statistician, challenged conventional
transactional data. But there is a growing role for wisdom by taking an exploratory mind-set to baseball
people with flexible minds to analyze data and suggest statistics. He literally changed how baseball
solutions to problems or identify opportunities from management makes talent decisions, and even how
that data. they manage on the field. In fact, James became senior
In Silicon Valley and elsewhere, where businesses such advisor for baseball operations in the Boston Red Sox’s
as Google, Facebook, and Twitter are built on the front office.
rigorous and speedy analysis of data, programming
frameworks such as MapReduce (which works with
Hadoop) and NoSQL (a database approach for non-
relational data stores) are becoming more popular.
Natural language processing Clojure, Redis, Scala, Crane, other To some extent, each of these serves as
and text mining Java functional language libraries, a layer of abstraction on top of Hadoop.
Python Natural Language ToolKit Those familiar keep adding layers on top of
layers. FlightCaster, for example, uses a
stack consisting of Amazon S3 -> Amazon
EC2 -> Cloudera -> HDFS -> Hadoop ->
Cascading -> Clojure.1
Scripting and NoSQL Python and related frameworks, These lend themselves to or are based on
database programming skills HBase, Cassandra, CouchDB, functional languages such as LISP, or
Tokyo Cabinet comparable to LISP. CouchDB, for
example, is written in Erlang.3 (See the
discussion of Clojure and LISP on page 30.)
Pete Skomoroch, “How FlightCaster Squeezes Predictions from Flight Data,” Data Wrangling blog, August 24, 2009,
1
http://anyall.org/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/
(accessed May 25, 2010).
Scripting languages such as Python run more slowly than Java, but developers sometimes make the tradeoff
3
to increase their own productivity. Some companies have created their own frameworks and released these
to open source. See Klaas Bosteels, “Python + Hadoop = Flying Circus Elephant,” Last.HQ Last.fm blog,
May 29, 2008, http://blog.last.fm/2008/05/29/python-hadoop-flying-circus-elephant (accessed May 14, 2010).
Where do CIOs find such talent? Start with your own Clearly, one challenge CIOs face has nothing to do
enterprise. For example, business analysts managing with data or skill sets. Open-source projects become
the marketing department’s lead-generation systems available earlier in their evolution than do proprietary
could be promoted onto an IT data staff charged with alternatives. In this respect, Big Data tools are less
exploring the data flow. Most large consumer-oriented stable and complete than are Apache or Linux
companies already have people in their business units open-source tool kits.
who can analyze data and suggest solutions to
Introducing an open-source technology such as
problems or identify opportunities. These people need
Hadoop into a mostly proprietary environment does
to be groomed and promoted, and more of them hired
not necessarily mean turning the organization upside
for IT, to enable the entire organization, not just the
down. A CIO at a small Massachusetts company says,
marketing department, to reap the riches.
“Every technology department has a skunkworks,
no matter how informal—a sandbox where they can
Set up a sandbox test and prove technologies. That’s how open source
entered our organization. A small Hadoop installation
Although the business case CIOs can make for Big Data might be a gateway that leads you to more open
is inarguable, even inarguable business cases carry source. But it might turn out to be a neat little open-
some risk. Many CIOs will look at the risks associated source project that sits by itself and doesn’t bother
with Big Data and find a familiar canard. Many Big Data anything else. Either can be OK, depending on the
technologies—Hadoop in particular—are open source, needs of your company.”
and open source is often criticized for carrying too
much risk. Bud Albers, executive vice president and CTO of
Disney Technology Shared Services Group, concurs.
The open-source versus proprietary technology “It depends on your organizational mind-set,” he says.
argument is nothing new. CIOs who have tried to “It depends on your organizational capability. There
implement open-source programs, from the Apache is a certain ‘don’t try this at home’ kind of warning
Web server to the Drupal content-management system, that goes with technologies like Hadoop. You have to
have faced the usual arguments against code being be willing at this stage of its maturity to maybe have
available to all comers. Some of those arguments, a little higher level of capability to go in.”
especially concerns revolving around security and
reliability, verge on the specious. Google built its internal
Web servers atop Apache. And it would be difficult to
find a Big Data site as reliable as Google’s.
Inaccurate or obsolete data Maintain strong metadata management; unverified information must be flagged
Analysis leads to paralysis Keep the sandbox related to the business problem or opportunity
Security Keep the Hadoop clusters away from the firewall, be vigilant, ask chief security
officer for help
Buggy code and other glitches Make sure the team keeps track of modifications and other implementation
history, since documentation isn’t plentiful
Rejection by other parts of Perform change management to help improve the odds of acceptance,
the organization along with quick successes
Best points out that choosing the right controls to of more powerful technologies,” Dresner says.
implement in a given risk scenario is essential. The only “This means massive change, and IT doesn’t
way to make sound choices is by adopting a risk always embrace change.” More forward-thinking
mind-set and approach, that allow a focus on the most IT organizations constantly review their software
critical controls, he says. Enterprises simply don’t have portfolio and adjust accordingly.
the resources to implement blanket controls. The
In this case, the need to manipulate larger and larger
Control Objectives for Information and related
amounts of data that companies are collecting is
Technology (COBIT) framework, a popular reference for
pressing. Even risk-averse CIOs are exploring the
IT risk assessment, is a “phone book of thousands of
possibilities of Big Data for their businesses. Bud
controls.” he says. Risk is not juggling a lot of balls.
Mathaisel, CIO of the outsourcing vendor Achievo,
“Risk is knowing which balls are made out
divides the risks of Big Data and their solutions into
of rubber and which are made out of glass.”
three areas:
By nature and by work experiences, most CIOs are
• Accessibility—The data repository used for data
risk averse. Blue-chip CIOs hold off installing new
analysis should be access managed.
versions of software until they have been proven
beyond a doubt, and these CIOs don’t standardize • Classification—Gray data should be identified
on new platforms until the risk for change appears to as such.
be less than the risk of stasis.
• Governance—Who’s doing what with this?
“The fundamental issue is whether IT is willing to
Yes, Big Data is new. But accessibility, classification,
depart from the status quo, such as an RDBMS
and governance are matters CIOs have had to deal
[relational database management system], in favor
with for many years in many guises.
PwC: What business problem were you trying to That capability gives us a much smarter way to apply
solve with the Amazon services? rules to our clients’ merchandising approaches, so that
we can achieve far more contextual focus for the use of
MT: We needed to join together large volumes of the data. Rather than using the data for reporting only,
disparate data sets that both we and a particular client we can actually leverage it for targeting and think about
can access. Historically, those data sets have not been how we can add value to the insight.
able to be joined at the capacity level that we were able
to achieve using the cloud. RV: It was slightly different from a traditional database
approach. The traditional approach just isn’t going to
In our traditional data environment, we were limited work when dealing with the amounts of data that a tool
to the scope of real clickstream data that we could like the Atlas ad server [a Razorfish ad engine that is
actually access for processing and leveraging now owned by Microsoft and offered through Microsoft
bandwidth, because we procured a fixed size of data. Advertising] has to deal with.
We managed and worked with a third party to serve
that data center.
PwC: The scalability aspect of it seems clear.
This approach worked very well until we wanted to tie
together and use SQL servers with online analytical But is the nature of the data you’re collecting
processing cubes, all in a fixed infrastructure. With the such that it may not be served well by a
cloud, we were able to throw billions of rows of data relational approach?
together to really start categorizing that information RV: It’s not the nature of the data itself, but what we
so that we could segment non-personally identifiable end up needing to deal with when it comes to relational
data from browsing sessions and from specific ways data. Relational data has lots of flexibility because of
in which we think about segmenting the behavior the normalized format, and then you can slice and dice
of customers. and look at the data in lots of different ways. Until you
Tiny percentages of these data sets have the most This way, we stay relevant and respond more quickly to
significant impact on our customer interactions. We customer demand. We’re identifying new variations and
are already developing new data measurement and KPI shifts in the data on a real-time basis that would have
[key performance indicator] strategies as we’re starting taken weeks or months, or that we might have missed
to ask ourselves, “Do our clients really need all of the completely, using the old approach. The analyst’s role
data and measurement points to solve their in creating these new algorithms and designing new
business goals?” methods of campaign planning is clearly key to this
type of solution design. The outcome of all this is really
interesting and I’m starting to see a subtle, organic
PwC: Given these new techniques, is the skill response to different changes in the way our solution
set that’s most beneficial to have at Razorfish tracks and targets customer behavior.
changing?
RV: It’s about understanding how to map data into a PwC: Are you familiar with Bill James, a Major
format that most people are not familiar with. Most League Baseball statistician who has taken a
people understand SQL and relational format, so I think
the real skill set evolution doesn’t have quite as much
rather different approach to metrics? James
to do with whether the tool of choice is Java or Python developed some metrics that turned out to be
or other technologies; it’s more about do I understand more useful than those used for many years
normalized versus denormalized structures. in baseball. That kind of person seems to be
MT: From a more commercial viewpoint, there’s a shift the type that you’re enabling to hypothesize,
away from product type and skill set, which is based perhaps even do some machine learning to
around constraints and managing known parameters, generate hypotheses.
and very much more toward what else can we do.
RV: Absolutely. Our analytics team within Razorfish has
It changes the impact—not just in the technology
the Bill James type of folks who can help drive different
organization, but in the other disciplines as well.
thinking and envision possibilities with the data. We
I’ve already seen a profound effect on the old ways of need to find a lot more of those people. They’re not
doing things. Rather than thinking of doing the same very easy to find. And we have some of the leading
things better, it really comes down to having the people folks in the industry.
and skills to meet your intended business goals.
You know, a long, long time ago we designed the
Using the Elastic MapReduce service with Cascading, Major League Baseball site and the platform. The stat
our solutions can have a ripple effect on all of the section on that site was always the most difficult part
non-technical business processes and engagements of the site, but the business insisted it needed it. The
across teams. For example, conventional marketing amount of people who really wanted to churn through
segmentation used to involve teams of analysts who that data was small. We were using Oracle at the time.
waded through various data sets and stages of We used the concept of temporary tables, which
processing and analysis to make sense of how a would denormalize lots of different relational tables for
business might view groups of customers. Using the performance reasons, and that was a challenge. If I
Hadoop-style alternative and Cascading, we’re able to had the cluster technology we do now back in 1999
identify unconventional relationships across many data and 2000, we could have built to scale much more
points with less effort, and in the process create new than going to two measly servers that we could cluster.
segmentations and insights.
Advisory
Sponsor & Technology Leader
Tom DeGarmo
US Thought Leadership
Partner-in-Charge
Tom Craren
Copyedit
Lea Anne Bantsari, Ellen Dunn
Transcription
Paula Burns
Acknowledgments 51
pwc.com/us
Tom DeGarmo
Principal, Technology Leader
PricewaterhouseCoopers
+1 267-330-2658
thomas.p.degarmo@us.pwc.com
This publication is printed on Coronado Stipple Cover made from 30% recycled fiber; and
Endeavor Velvet Book made from 50% recycled fiber, a Forest Stewardship Council (FSC)
certified stock using 25% post-consumer waste.
Recycled paper
Subtext
Big Data Data sets that range from many terabytes to petabytes in size, and
that usually consist of less-structured information such as Web log files.
Apache Hadoop The core of an open-source ecosystem that makes Big Data analysis
more feasible through the efficient use of commodity computer clusters.
NoSQL A class of non-relational data stores and data analysis techniques that
are intended for various kinds of less-structured data. Many of these
techniques are part of the Hadoop ecosystem.
Gray data Data from multiple sources that isn’t formatted or vetted for
specific needs, but worth exploring with the help of Hadoop
cluster analysis techniques.
PricewaterhouseCoopers (www.pwc.com) provides industry-focused assurance, tax and advisory services to build public trust and enhance value for
its clients and their stakeholders. More than 155,000 people in 153 countries across our network share their thinking, experience and solutions to
develop fresh perspectives and practical advice.
© 2010 PricewaterhouseCoopers LLP. All rights reserved. “PricewaterhouseCoopers” refers to PricewaterhouseCoopers LLP, a Delaware limited
liability partnership, or, as the context requires, the PricewaterhouseCoopers global network or other member firms of the network, each of which is a
separate and independent legal entity. This document is for general information purposes only, and should not be used as a substitute for consultation
with professional advisors.