Big Data Analytics

Big Data
Turning Big Data into Big Money

Fr an k O h lh o r st

Joh n Wiley & Son s, In c.

Preface ix

Acknow ledgm ents xiii

Chapter 1 What Is Big Data? ......................................................1

The A rrival of A nalytics 2

Where Is the Value? 3
More to Big Data Than Meets the Eye 5
Dealing w ith the Nuances of Big Data 6
A n Open Source Brings Forth Tools 7
Caution: Obstacles A head 8

Chapter 2 Why Big Data Matters.............................................11

Big Data Reaches Deep 12

Obstacles Remain 13
Data Continue to Evolve 15
Data and Data A nalysis A re Getting More Complex 17
The Future Is Now 18

Chapter 3 Big Data and the Business Case.............................21

Realizing Value 22
The Case for Big Data 22
The Rise of Big Data Options 25
Beyond Hadoop 27
With Choice Come Decisions 28

Chapter 4 Building the Big Data Team ....................................29

The Data Scientist 29
The Team Challenge 30
Different Teams, Different Goals 31
Don t Forget the Data 32
Challenges Remain 32
Teams versus Culture 34
Gauging Success 35

Chapter 5 Big Data Sources .....................................................37

Hunting for Data 38

Setting the Goal 39
Big Data Sources Grow ing 40
Diving Deeper into Big Data Sources 42
A Wealth of Public Information 43
Getting Started w ith Big Data A cquisition 44
Ongoing Grow th, No End in Sight 46

Chapter 6 The Nuts and Bolts of Big Data ..............................47

The Storage Dilemma 47

Building a Platform 52
Bringing Structure to Unstructured Data 57
Processing Pow er 59
Choosing among In- house, Outsourced, or Hybrid A pproaches 61

Chapter 7 Security, Com pliance, Auditing,

and Protection .........................................................63
Pragmatic Steps to Securing Big Data 64
Classifying Data 65
Protecting Big Data A nalytics 66
Big Data and Compliance 67
The Intellectual Property Challenge 72

Chapter 8 The Evolution of Big Data .......................................77

Big Data: The Modern Era 80
Today, Tomorrow , and the Next Day 84
Changing A lgorithms 90

Chapter 9 Best Practices for Big Data Analytics ....................93

Start Small w ith Big Data 94

Thinking Big 95
A voiding Worst Practices 96
Baby Steps 98
The Value of A nomalies 101
Expediency versus A ccuracy 103
In- Memory Processing 104

Chapter 10 Bringing It All Together .......................................111

The Path to Big Data 112

The Realities of Thinking Big Data 113
Hands- on Big Data 115
The Big Data Pipeline in Depth 116
Big Data Visualization 121
Big Data Privacy 122

Appendix Supporting Data .....................................................125

The MapR Distribution for A pache Hadoop 126

High A vailability: No Single Points of Failure 142

About the Author 151

Index 153

Wh at are data? Th is seem s like a sim ple en ou gh qu estion ; h owever,

depen din g on th e in terpretation , th e defin ition of data can be an yth in g
from som eth in g recorded to everyth in g u n der th e su n . Data can be
su m m ed u p as everyth in g th at is experien ced, wh eth er it is a m ach in e
recordin g in form ation from sen sors, an in dividu al takin g pictu res, or a
cosm ic even t recorded by a scien tist. In oth er words, everyth in g is
data. However, recordin g an d preservin g th at data h as always been
th e ch allen ge, an d tech n ology h as lim ited th e ability to captu re an d
preserve data.
Th e h u m an brain s m em ory storage capacity is su pposed to be
arou n d 2.5 petabytes (or 1 m illion gigabytes). Th in k of it th is way:
If you r brain worked like a digital video recorder in a television , 2.5
petabytes wou ld be en ou gh to h old 3 m illion h ou rs of TV sh ows. You
wou ld h ave to leave th e TV ru n n in g con tin u ou sly for m ore th an 300
years to u se u p all of th at storage space. Th e available tech n ology for
storin g data fails in com parison , creatin g a tech n ology segm en t called
Big Data th at is growin g expon en tially.
Today, bu sin esses are recordin g m ore an d m ore in form ation , an d
th at in form ation (or data) is growin g, con su m in g m ore an d m ore
storage space an d becom in g h arder to m an age, th u s creatin g Big Data.
Th e reason s vary for th e n eed to record su ch m assive am ou n ts of
in form ation . Som etim es th e reason is adh eren ce to com plian ce reg-
u lation s, at oth er tim es it is th e n eed to preserve tran saction s, an d in
m an y cases it is sim ply part of a backu p strategy.
Neverth eless, it costs tim e an d m on ey to save data, even if it s on ly
for posterity. Th erein lies th e biggest ch allen ge: How can bu sin esses
con tin u e to afford to save m assive am ou n ts of data? Fortu n ately, th ose
wh o h ave com e u p with th e tech n ologies to m itigate th ese storage


con cern s h ave also com e u p with a way to derive valu e from wh at
m an y see as a bu rden . It is a process called Big Data analytics.
Th e con cepts beh in d Big Data an alytics are actu ally n oth in g n ew.
Bu sin esses h ave been u sin g bu sin ess in telligen ce tools for m an y dec-
ades, an d scien tists h ave been stu dyin g data sets to u n cover th e secrets
of th e u n iverse for m an y years. However, th e scale of data collection is
ch an gin g, an d th e m ore data you h ave available, th e m ore in form ation
you can extrapolate from th em .
Th e ch allen ge today is to fin d th e valu e of th e data an d to explore
data sou rces in m ore in terestin g an d applicable ways to develop
in telligen ce th at can drive decision s, fin d relation sh ips, solve problem s,
an d in crease profits, produ ctivity, an d even th e qu ality of life.
Th e key is to th in k big, an d th at m ean s Big Data an alytics.
Th is book will explore th e con cepts beh in d Big Data, h ow to
an alyze th at data, an d th e payoff from in terpretin g th e an alyzed data.

Ch apter 1 deals with th e origin s of Big Data an alytics, explores th e

evolu tion of th e associated tech n ology, an d explain s th e basic
con cepts beh in d derivin g valu e.
Ch apter 2 delves in to th e differen t types of data sou rces an d
explain s wh y th ose sou rces are im portan t to bu sin esses th at
are seekin g to fin d valu e in data sets.
Ch apter 3 h elps th ose wh o are lookin g to leverage data an alytics to
bu ild a bu sin ess case to spu r in vestm en t in th e tech n ologies
an d to develop th e skill sets n eeded to su ccessfu lly extract
in telligen ce an d valu e ou t of data sets.
Ch apter 4 brin gs th e con cepts of th e an alytics team togeth er,
describes th e n ecessary skill sets, an d explain s h ow to in tegrate
Big Data in to a corporate cu ltu re.
Ch apter 5 assists in th e h u n t for data sou rces to feed Big Data an a-
lytics, covers th e variou s pu blic an d private sou rces for data, an d
iden tifies th e differen t types of data u sable for an alytics.
Ch apter 6 deals with storage, processin g power, an d platform s by
describin g th e elem en ts th at m ake u p a Big Data an alytics
system .

Ch apter 7 describes th e im portan ce of secu rity, com plian ce, an d

au ditin g th e tools an d tech n iqu es th at keep large data sou rces
secu re yet available for an alytics.
Ch apter 8 delves in to th e evolu tion of Big Data an d discu sses th e
sh ort-term an d lon g-term ch an ges th at will m aterialize as Big
Data evolves an d is adopted by m ore an d m ore organ ization s.
Ch apter 9 discu sses best practices for data an alysis, covers som e of
th e key con cepts th at m ake Big Data an alytics easier to deliver,
an d warn s of th e poten tial pitfalls an d h ow to avoid th em .
Ch apter 10 explores th e con cept of th e data pipelin e an d h ow
Big Data m oves th rou gh th e an alysis process an d is th en
tran sform ed in to u sable in form ation th at delivers valu e.

Som etim es th e best in form ation on a particu lar tech n ology com es
from th ose wh o are prom otin g th at tech n ology for profit an d grow th ,
h en ce th e birth of th e wh ite paper. Wh ite papers are m ean t to edu -
cate an d in form poten tial cu stom ers abou t a particu lar tech n ology
segm en t wh ile gen tly goadin g th ose poten tial cu stom ers toward th e
ven dor s produ ct.
That said, it is always best to take white papers with a grain of
salt. Nevertheless, white papers prove to be an excellent sou rce for
researching technology and have significan t education al valu e. With
that in m ind, I h ave included the following white papers in the appendix
of th is book, and each offers additional knowledge for those who are
lookin g to leverage Big Data solu tions: The MapR Distribu tion for
Apach e Hadoop and High Availability: No Sin gle Points of Failure,
both from MapR Techn ologies.

Acknow ledgm ents

Take it from m e, writin g a book takes tim e, patien ce, an d m otivation in

equ al m easu res. At tim es th e ch allen ges can be overwh elm in g, an d it
becom es very easy to lose focu s. However, an alytics, pattern s, an d
u n coverin g th e h idden m ean in g beh in d data h ave always attracted
m e. Wh en on e con siders th e possibilities offered by com preh en sive
an alytics an d th e in clu sion of wh at m ay seem to be u n related data sets,
th e effort in volved seem s alm ost in con sequ en tial.
Th e idea for th is book cam e from a brief con versation with Joh n
Wiley & Son s editor Tim oth y Bu rgard, wh o con tacted m e ou t of th e
blu e with a proposition to bu ild on som e articles I h ad written on Big
Data. Tim explain ed th at com preh en sive in form ation th at cou ld be
con su m ed by C-level execu tives an d th ose en terin g th e data an alytics
aren a was sorely lackin g, an d h e th ou gh t th at I was u p to th e ch allen ge
of creatin g th at in form ation . So it was with Tim s en cou ragem en t th at I
started down th e path to create a book on Big Data.
I wou ld be rem iss if I didn t m en tion th e excellen t advice an d
addition al m otivation th at I received from Joh n Wiley & Son s devel-
opm en t editor Stacey Rivera, wh o was faced with th e ch allen ge of
keepin g m e on track an d m ovin g m e alon g in th e process a ch ore
th at I wou ld n ot wish on an yon e!
Pu ttin g togeth er a book like th is is a lon g jou rn ey th at in trodu ced
m e to m an y experts, m en tors, an d acqu ain tan ces wh o h elped m e to
sh ape m y ideology on h ow large data sets can be brou gh t togeth er for
processin g to u n cover tren ds an d oth er valu able bits of in form ation .
I also h ave to ackn owledge th e m an y ven dors in th e Big Data
aren a wh o in adverten tly h elped m e alon g m y jou rn ey to expose th e
valu e con tain ed in data. Th ose ven dors, wh o n u m ber in th e dozen s,
h ave m ade con cen trated efforts to edu cate th e pu blic abou t th e valu e
beh in d Big Data, an d th e even ts th ey h ave spon sored as well as th e


in form ation th ey h ave dissem in ated h ave h elped to fu rth er defin e th e

m arket an d give rise to con versation s th at en cou raged m e to pu rsu e
m y u ltim ate goal of writin g a book.
Writin g takes a great deal of en ergy an d can qu ickly con su m e all
of th e h ou rs in a day. With th at in m in d, I h ave to th an k th e n u m erou s
editors wh om I h ave worked with on freelan ce projects wh ile con cu r-
ren tly writin g th is book. With ou t th eir u n derstan din g an d flexibility,
I cou ld n ever h ave written th is book, or an y oth er. Special th an ks
go ou t to Mike Vizard, Ed Scan n ell, Mike Fratto, Mark Fon tecch io,
Jam es Allen Miller, an d Cam eron Stu rdevan t.
Wh en it com es to providin g th e u ltim ate in en cou ragem en t an d
su pport, n o on e can com pare with m y wife, Carol, wh o u n derstood
th e toll th at writin g a book wou ld take on fam ily tim e an d was still
willin g to provide m e with wh atever I n eeded to su ccessfu lly com plete
th is book. I also h ave to th an k m y ch ildren , Con n or, Tyler, Sarah , an d
Katelyn , for u n derstan din g th at Daddy h ad to work an d was n ot
always available. I am very th an kfu l to h ave su ch a won derfu l an d
su pportive fam ily.

ast 22 October 2012; 18:21:8