Вы находитесь на странице: 1из 12

Big Data Analytics

f rs 22 October 2012; 18:18:2

Th e Wiley & SAS Bu sin ess Series presen ts books th at h elp sen ior-level m an agers with
th eir critical m an agemen t decision s.
Titles in th e Wiley an d SAS Bu sin ess Series in clu de:
Activity-Based Management for Financial Institutions: Driving Bottom-Line Results by Bren t Bah n u b
Advanced Business Analytics: Creating Business Value from Your Data by Jean Pau l Isson an d Jesse
Branded! How Retailers Engage Consumers with Social Media and Mobility by Bern ie Bren n an an d
Lori Sch afer
Business Analytics for Customer Intelligence by Gert Lau rsen
Business Analytics for Managers: Taking Business Intelligence beyond Reporting by Gert Lau rsen an d
Jesper Th orlu n d
The Business Forecasting Deal: Exposing Bad Practices and Providing Practical Solutions by Mich ael
Gillilan d
Business Intelligence Success Factors: Tools for Aligning Your Business in the Global Economy by Olivia
Parr Ru d
CIO Best Practices: Enabling Strategic Value with Information Technology, Second Edition by Joe
Sten zel
Connecting Organizational Silos: Taking Knowledge Flow Management to the Next Level with Social
Media by Fran k Leistn er
Credit Risk Assessment: The New Lending System for Borrowers, Lenders, and Investors by Clark
Abrah am s an d Min gyu an Zh an g
Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring by Naeem Siddiqi
The Data Asset: How Smart Companies Govern Their Data for Business Success by Ton y Fish er
Demand-Driven Forecasting: A Structured Approach to Forecasting by Ch arles Ch ase
Executive s Guide to Solvency II by David Bu ckh am , Jason Wah l, an d Stu art Rose
The Executive s Guide to Enterprise Social Media Strategy: How Social Networks Are Radically Trans-
forming Your Business by David Th om as an d Mike Barlow
Fair Lending Compliance: Intelligence and Implications for Credit Risk Management by Clark
R. Abrah am s an d Min gyu an Zh an g
Foreign Currency Financial Reporting from Euros to Yen to Yuan: A Guide to Fundamental Concepts
and Practical Applications by Robert Rowan
Human Capital Analytics: How to Harness the Potential of Your Organization s Greatest Asset by Gen e
Pease, Boyce Byerly, an d Jac Fitz-en z
Information Revolution: Using the Information Evolution Model to Grow Your Business by Jim Davis,
Gloria J. Miller, an d Allan Ru ssell
Manufacturing Best Practices: Optimizing Productivity and Product Quality by Bobby Hu ll
Marketing Automation: Practical Steps to More Effective Direct Marketing by Jeff LeSu eu r
Mastering Organizational Knowledge Flow: How to Make Knowledge Sharing Work by Fran k Leistn er
The New Know: Innovation Powered by Analytics by Th orn ton May
Performance Management: Integrating Strategy Execution, Methodologies, Risk, and Analytics by Gary
Cokin s
Retail Analytics: The Secret Weapon by Em m ett Cox
Social Network Analysis in Telecommunications by Carlos An dre Reis Pin h eiro
Statistical Thinking: Improving Business Performance, Second Edition by Roger W. Hoerl an d Ron ald
D. Sn ee
Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics
by Bill Fran ks
The Value of Business Analytics: Identifying the Path to Profitability by Evan Stu bbs
Visual Six Sigma: Making Data Analysis Lean by Ian Cox, Marie A. Gau dard, Ph ilip J. Ram sey,
Mia L. Steph en s, an d Leo Wrigh t
For m ore in form ation on an y of th e above titles, please visit www.wiley.com .

f rs 22 October 2012; 18:18:2

Big Data
Turning Big Data into Big Money

Fr an k O h lh o r st

Joh n Wiley & Son s, In c.

f rs 22 October 2012; 18:18:2

Cover im age: @lian gpv/ iStockph oto
Cover design : Mich ael Ru tkowski

Copyrigh t 2013 by Joh n Wiley & Son s, In c. All righ ts reserved.

Pu blish ed by Joh n Wiley & Son s, In c., Hoboken , New Jersey.

Pu blish ed sim u ltan eou sly in Can ada.

No part of th is pu blication m ay be reprodu ced, stored in a retrieval system , or tran s-

m itted in an y form or by an y m ean s, electron ic, m ech an ical, ph otocopyin g, recording,
scan n in g, or oth erwise, except as perm itted u n der Section 107 or 108 of th e 1976 Un ited
States Copyrigh t Act, with ou t eith er th e prior written perm ission of th e Pu blish er, or
au th orization th rou gh paym en t of th e appropriate per-copy fee to th e Copyright
Clearan ce Cen ter, In c., 222 Rosewood Drive, Dan vers, MA 01923, (978) 750-8400, fax
(978) 646-8600, or on th e Web at www.copyrigh t.com . Requ ests to th e Pu blish er for
perm ission sh ou ld be addressed to th e Perm ission s Departm en t, Joh n Wiley & Son s, In c.,
111 River Street, Hoboken , NJ 07030, (201) 748-6011, fax (201) 748-6008, or on lin e at
h ttp:/ / www.wiley.com / go/perm ission s.

Lim it of Liability/ Disclaim er of Warran ty: Wh ile th e pu blish er an d au th or h ave u sed

th eir best efforts in preparing th is book, th ey m ake n o represen tation s or warran ties with
respect to th e accu racy or com pleten ess of th e con ten ts of th is book an d speci cally
disclaim an y im plied warranties of m erch an tability or tn ess for a particular pu rpose. No
warran ty m ay be created or exten ded by sales represen tatives or written sales m aterials.
Th e advice an d strategies con tain ed h erein m ay n ot be su itable for you r situ ation . You
sh ou ld con su lt with a profession al wh ere appropriate. Neith er th e pu blish er n or au th or
sh all be liable for an y loss of pro t or an y oth er com m ercial dam ages, in clu din g bu t n ot
lim ited to special, in ciden tal, con sequ en tial, or oth er dam ages.

For gen eral in form ation on ou r oth er produ cts an d services or for tech n ical su pport,
please con tact ou r Cu stom er Care Departm en t with in th e Un ited States at (800)
762-2974, ou tside th e Un ited States at (317) 572-3993 or fax (317) 572-4002.

Wiley pu blish es in a variety of prin t an d electron ic form ats an d by prin t-on -dem an d.
Som e m aterial in clu ded with stan dard prin t version s of th is book m ay n ot be in clu ded
in e-books or in prin t-on -dem an d. If th is book refers to m edia su ch as a CD or DVD
th at is n ot in clu ded in th e version you pu rch ased, you m ay down load th is m aterial
at h ttp:/ / booksupport.wiley.com . For m ore in form ation abou t Wiley produ cts,
visit www.wiley.com .

Oh lh orst, Fran k, 1964

Big data an alytics : tu rn in g big data in to big m on ey / Fran k Oh lh orst.
p. cm . (Wiley & SAS bu sin ess series)
In clu des in dex.
ISBN 978-1-118-14759-7 (cloth ) ISBN 978-1-118-22582-0 (ePDF)
ISBN 978-1-118-26380-8 (Mobi) ISBN 978-1-118-23904-9 (ePu b)
1. Bu sin ess in telligen ce. 2. Data m in in g. I. Title.
HD38.7.O36 2013
658.4'72 dc23
Prin ted in th e Un ited States of Am erica
10 9 8 7 6 5 4 3 2 1

f rs 22 October 2012; 18:18:2


Preface ix

Acknow ledgm ents xiii

Chapter 1 What Is Big Data? ......................................................1

The A rrival of A nalytics 2

Where Is the Value? 3
More to Big Data Than Meets the Eye 5
Dealing w ith the Nuances of Big Data 6
A n Open Source Brings Forth Tools 7
Caution: Obstacles A head 8

Chapter 2 Why Big Data Matters.............................................11

Big Data Reaches Deep 12

Obstacles Remain 13
Data Continue to Evolve 15
Data and Data A nalysis A re Getting More Complex 17
The Future Is Now 18

Chapter 3 Big Data and the Business Case.............................21

Realizing Value 22
The Case for Big Data 22
The Rise of Big Data Options 25
Beyond Hadoop 27
With Choice Come Decisions 28

ftoc 23 October 2012; 12:36:54


Chapter 4 Building the Big Data Team ....................................29

The Data Scientist 29
The Team Challenge 30
Different Teams, Different Goals 31
Don t Forget the Data 32
Challenges Remain 32
Teams versus Culture 34
Gauging Success 35

Chapter 5 Big Data Sources .....................................................37

Hunting for Data 38

Setting the Goal 39
Big Data Sources Grow ing 40
Diving Deeper into Big Data Sources 42
A Wealth of Public Information 43
Getting Started w ith Big Data A cquisition 44
Ongoing Grow th, No End in Sight 46

Chapter 6 The Nuts and Bolts of Big Data ..............................47

The Storage Dilemma 47

Building a Platform 52
Bringing Structure to Unstructured Data 57
Processing Pow er 59
Choosing among In- house, Outsourced, or Hybrid A pproaches 61

Chapter 7 Security, Com pliance, Auditing,

and Protection .........................................................63
Pragmatic Steps to Securing Big Data 64
Classifying Data 65
Protecting Big Data A nalytics 66
Big Data and Compliance 67
The Intellectual Property Challenge 72

ftoc 23 October 2012; 12:36:54


Chapter 8 The Evolution of Big Data .......................................77

Big Data: The Modern Era 80
Today, Tomorrow , and the Next Day 84
Changing A lgorithms 90

Chapter 9 Best Practices for Big Data Analytics ....................93

Start Small w ith Big Data 94

Thinking Big 95
A voiding Worst Practices 96
Baby Steps 98
The Value of A nomalies 101
Expediency versus A ccuracy 103
In- Memory Processing 104

Chapter 10 Bringing It All Together .......................................111

The Path to Big Data 112

The Realities of Thinking Big Data 113
Hands- on Big Data 115
The Big Data Pipeline in Depth 116
Big Data Visualization 121
Big Data Privacy 122

Appendix Supporting Data .....................................................125

The MapR Distribution for A pache Hadoop 126

High A vailability: No Single Points of Failure 142

About the Author 151

Index 153

ftoc 23 October 2012; 12:36:54


Wh at are data? Th is seem s like a sim ple en ou gh qu estion ; h owever,

depen din g on th e in terpretation , th e defin ition of data can be an yth in g
from som eth in g recorded to everyth in g u n der th e su n . Data can be
su m m ed u p as everyth in g th at is experien ced, wh eth er it is a m ach in e
recordin g in form ation from sen sors, an in dividu al takin g pictu res, or a
cosm ic even t recorded by a scien tist. In oth er words, everyth in g is
data. However, recordin g an d preservin g th at data h as always been
th e ch allen ge, an d tech n ology h as lim ited th e ability to captu re an d
preserve data.
Th e h u m an brain s m em ory storage capacity is su pposed to be
arou n d 2.5 petabytes (or 1 m illion gigabytes). Th in k of it th is way:
If you r brain worked like a digital video recorder in a television , 2.5
petabytes wou ld be en ou gh to h old 3 m illion h ou rs of TV sh ows. You
wou ld h ave to leave th e TV ru n n in g con tin u ou sly for m ore th an 300
years to u se u p all of th at storage space. Th e available tech n ology for
storin g data fails in com parison , creatin g a tech n ology segm en t called
Big Data th at is growin g expon en tially.
Today, bu sin esses are recordin g m ore an d m ore in form ation , an d
th at in form ation (or data) is growin g, con su m in g m ore an d m ore
storage space an d becom in g h arder to m an age, th u s creatin g Big Data.
Th e reason s vary for th e n eed to record su ch m assive am ou n ts of
in form ation . Som etim es th e reason is adh eren ce to com plian ce reg-
u lation s, at oth er tim es it is th e n eed to preserve tran saction s, an d in
m an y cases it is sim ply part of a backu p strategy.
Neverth eless, it costs tim e an d m on ey to save data, even if it s on ly
for posterity. Th erein lies th e biggest ch allen ge: How can bu sin esses
con tin u e to afford to save m assive am ou n ts of data? Fortu n ately, th ose
wh o h ave com e u p with th e tech n ologies to m itigate th ese storage


fpref 22 October 2012; 18:25:28


con cern s h ave also com e u p with a way to derive valu e from wh at
m an y see as a bu rden . It is a process called Big Data analytics.
Th e con cepts beh in d Big Data an alytics are actu ally n oth in g n ew.
Bu sin esses h ave been u sin g bu sin ess in telligen ce tools for m an y dec-
ades, an d scien tists h ave been stu dyin g data sets to u n cover th e secrets
of th e u n iverse for m an y years. However, th e scale of data collection is
ch an gin g, an d th e m ore data you h ave available, th e m ore in form ation
you can extrapolate from th em .
Th e ch allen ge today is to fin d th e valu e of th e data an d to explore
data sou rces in m ore in terestin g an d applicable ways to develop
in telligen ce th at can drive decision s, fin d relation sh ips, solve problem s,
an d in crease profits, produ ctivity, an d even th e qu ality of life.
Th e key is to th in k big, an d th at m ean s Big Data an alytics.
Th is book will explore th e con cepts beh in d Big Data, h ow to
an alyze th at data, an d th e payoff from in terpretin g th e an alyzed data.

Ch apter 1 deals with th e origin s of Big Data an alytics, explores th e

evolu tion of th e associated tech n ology, an d explain s th e basic
con cepts beh in d derivin g valu e.
Ch apter 2 delves in to th e differen t types of data sou rces an d
explain s wh y th ose sou rces are im portan t to bu sin esses th at
are seekin g to fin d valu e in data sets.
Ch apter 3 h elps th ose wh o are lookin g to leverage data an alytics to
bu ild a bu sin ess case to spu r in vestm en t in th e tech n ologies
an d to develop th e skill sets n eeded to su ccessfu lly extract
in telligen ce an d valu e ou t of data sets.
Ch apter 4 brin gs th e con cepts of th e an alytics team togeth er,
describes th e n ecessary skill sets, an d explain s h ow to in tegrate
Big Data in to a corporate cu ltu re.
Ch apter 5 assists in th e h u n t for data sou rces to feed Big Data an a-
lytics, covers th e variou s pu blic an d private sou rces for data, an d
iden tifies th e differen t types of data u sable for an alytics.
Ch apter 6 deals with storage, processin g power, an d platform s by
describin g th e elem en ts th at m ake u p a Big Data an alytics
system .

fpref 22 October 2012; 18:25:28


Ch apter 7 describes th e im portan ce of secu rity, com plian ce, an d

au ditin g th e tools an d tech n iqu es th at keep large data sou rces
secu re yet available for an alytics.
Ch apter 8 delves in to th e evolu tion of Big Data an d discu sses th e
sh ort-term an d lon g-term ch an ges th at will m aterialize as Big
Data evolves an d is adopted by m ore an d m ore organ ization s.
Ch apter 9 discu sses best practices for data an alysis, covers som e of
th e key con cepts th at m ake Big Data an alytics easier to deliver,
an d warn s of th e poten tial pitfalls an d h ow to avoid th em .
Ch apter 10 explores th e con cept of th e data pipelin e an d h ow
Big Data m oves th rou gh th e an alysis process an d is th en
tran sform ed in to u sable in form ation th at delivers valu e.

Som etim es th e best in form ation on a particu lar tech n ology com es
from th ose wh o are prom otin g th at tech n ology for profit an d grow th ,
h en ce th e birth of th e wh ite paper. Wh ite papers are m ean t to edu -
cate an d in form poten tial cu stom ers abou t a particu lar tech n ology
segm en t wh ile gen tly goadin g th ose poten tial cu stom ers toward th e
ven dor s produ ct.
That said, it is always best to take white papers with a grain of
salt. Nevertheless, white papers prove to be an excellent sou rce for
researching technology and have significan t education al valu e. With
that in m ind, I h ave included the following white papers in the appendix
of th is book, and each offers additional knowledge for those who are
lookin g to leverage Big Data solu tions: The MapR Distribu tion for
Apach e Hadoop and High Availability: No Sin gle Points of Failure,
both from MapR Techn ologies.

fpref 22 October 2012; 18:25:28

Acknow ledgm ents

Take it from m e, writin g a book takes tim e, patien ce, an d m otivation in

equ al m easu res. At tim es th e ch allen ges can be overwh elm in g, an d it
becom es very easy to lose focu s. However, an alytics, pattern s, an d
u n coverin g th e h idden m ean in g beh in d data h ave always attracted
m e. Wh en on e con siders th e possibilities offered by com preh en sive
an alytics an d th e in clu sion of wh at m ay seem to be u n related data sets,
th e effort in volved seem s alm ost in con sequ en tial.
Th e idea for th is book cam e from a brief con versation with Joh n
Wiley & Son s editor Tim oth y Bu rgard, wh o con tacted m e ou t of th e
blu e with a proposition to bu ild on som e articles I h ad written on Big
Data. Tim explain ed th at com preh en sive in form ation th at cou ld be
con su m ed by C-level execu tives an d th ose en terin g th e data an alytics
aren a was sorely lackin g, an d h e th ou gh t th at I was u p to th e ch allen ge
of creatin g th at in form ation . So it was with Tim s en cou ragem en t th at I
started down th e path to create a book on Big Data.
I wou ld be rem iss if I didn t m en tion th e excellen t advice an d
addition al m otivation th at I received from Joh n Wiley & Son s devel-
opm en t editor Stacey Rivera, wh o was faced with th e ch allen ge of
keepin g m e on track an d m ovin g m e alon g in th e process a ch ore
th at I wou ld n ot wish on an yon e!
Pu ttin g togeth er a book like th is is a lon g jou rn ey th at in trodu ced
m e to m an y experts, m en tors, an d acqu ain tan ces wh o h elped m e to
sh ape m y ideology on h ow large data sets can be brou gh t togeth er for
processin g to u n cover tren ds an d oth er valu able bits of in form ation .
I also h ave to ackn owledge th e m an y ven dors in th e Big Data
aren a wh o in adverten tly h elped m e alon g m y jou rn ey to expose th e
valu e con tain ed in data. Th ose ven dors, wh o n u m ber in th e dozen s,
h ave m ade con cen trated efforts to edu cate th e pu blic abou t th e valu e
beh in d Big Data, an d th e even ts th ey h ave spon sored as well as th e


ast 22 October 2012; 18:21:7


in form ation th ey h ave dissem in ated h ave h elped to fu rth er defin e th e

m arket an d give rise to con versation s th at en cou raged m e to pu rsu e
m y u ltim ate goal of writin g a book.
Writin g takes a great deal of en ergy an d can qu ickly con su m e all
of th e h ou rs in a day. With th at in m in d, I h ave to th an k th e n u m erou s
editors wh om I h ave worked with on freelan ce projects wh ile con cu r-
ren tly writin g th is book. With ou t th eir u n derstan din g an d flexibility,
I cou ld n ever h ave written th is book, or an y oth er. Special th an ks
go ou t to Mike Vizard, Ed Scan n ell, Mike Fratto, Mark Fon tecch io,
Jam es Allen Miller, an d Cam eron Stu rdevan t.
Wh en it com es to providin g th e u ltim ate in en cou ragem en t an d
su pport, n o on e can com pare with m y wife, Carol, wh o u n derstood
th e toll th at writin g a book wou ld take on fam ily tim e an d was still
willin g to provide m e with wh atever I n eeded to su ccessfu lly com plete
th is book. I also h ave to th an k m y ch ildren , Con n or, Tyler, Sarah , an d
Katelyn , for u n derstan din g th at Daddy h ad to work an d was n ot
always available. I am very th an kfu l to h ave su ch a won derfu l an d
su pportive fam ily.

ast 22 October 2012; 18:21:8