You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/266938495

Managing Big Data

Book · November 2014

CITATIONS READS

0 1,184

1 author:

Dr.Chandrakant Naikodi
MNC,Bangalore
69 PUBLICATIONS   42 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Publishing two books : Python Applications Programming and Fundamentals of Computer Programming View project

All content following this page was uploaded by Dr.Chandrakant Naikodi on 05 December 2014.

The user has requested enhancement of the downloaded file.


Managing Big Data

CHANDRAKANT NAIKODI
ABOUT AUTHOR
• Dr.Chandrakant N
Dr.Chandrakant Naikodi is presently working as a Senior Software Engineer in MNC, and
visiting professor at Cambridge Institute of Technology, Bangalore, India. He completed
Diploma in CSE, BE degree in Information Science and Engg. and ME in Information
Technology. His PhD degree is in Computer Science and Engg. were earned from the
Bangalore University. He has published many research papers in referred International
Journals and Conferences. He is the author of many technical books namely “C:Test
Your Aptitude” and “1000 Questions and Answers in C++” published by Tata Mc-Graw
Hill, other books like ”Programming in C and Data Structure” by Vikas Publication, and
”Wireless Sensor Network for Beginners” by Mudranik Technologies. His area of interest
includes Computer Networks, MANETs, WSN, and Programming Languages, Big Data
etc.

PREFACE
• Big data is a popular term used to describe the exponential growth and availability of
data, both structured and unstructured. Hadoop is an open-source software framework
for storing and processing big data in a distributed fashion on large clusters of commodity
hardware. This book concentrates on Hadoop architecture, HDFS, MapReduce, etc. The
authors will appreciate the suggestions or feedback from readers and users of this book,
kindly communicate via email addresses chandrakant.naikodi@{yahoo.in, gmail.com,
facebook.com}.
ACKNOWLEDGEMENTS
My deep gratitude and thanks to my wife Mrs. Vidyadhare Chandrakant and my daughter
Vaishnavi N for their immense patience, prayers and support. My sincere thanks to my father
Mr. Dharmanna N and mother Mrs. Shanthabhai Dharmanna for their blessings and support.
I thank my brothers Mr. Shankar N, Mr. Surykant N, father-in-law Mr. Venkatesh J, mother-in-
law Mrs. Kalavathi V, my brother-in-law Mr. Nataraj and Mr. Raghavendra.
I am greatly thankful to my well wishers and teachers, especially late Sri. B C Bhavikatti, Sri.
M B Naikodi, Sri. Somashekar Muniyappa, Sri. Sanjeevkumar Chetti and Sri.Santhoshkumar
Hanaji who supported and encouraged me greatly in all steps.
This book could not be completed without the support of my family and friends and the help
of several individuals who extended their valuable support in the preparation and compilation
of this book.
To daughter Vaishnavi
Contents

1 UNDERSTANDING BIG DATA 1


1.1 What is Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Why Big Data ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Data! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Data Storage and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Comparison with Other Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5.1 Rational Data Base Management System(RDBMS) . . . . . . . . . . . . . . 4
1.5.2 Grid Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5.3 Volunteer Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Convergence of Key Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Unstructured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 Industry Examples of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8.1 Web Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8.2 Big Data and Marketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8.3 Fraud and Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.8.4 Risk and Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.8.5 Big Data and Algorithmic Trading . . . . . . . . . . . . . . . . . . . . . . . 11
1.8.6 Big Data and Healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8.7 Big Data in Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.8.8 Advertising and Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.9 Big Data Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.9.1 Introduction to Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.9.2 Open Source Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.9.3 Cloud and Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.9.4 Mobile Business Intelligence(Mobile BI) . . . . . . . . . . . . . . . . . . . 19
1.9.5 Crowd Sourcing Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.9.6 Inter and Trans Firewall Analytics . . . . . . . . . . . . . . . . . . . . . . . 20

2 NOSQL DATA MANAGEMENT 23


2.1 Introduction to NoSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Aggregate Data Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Aggregates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.2 Key-value and Document Data Models . . . . . . . . . . . . . . . . . . . . 33
2.3 Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

i
ii

2.4 Graph Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34


2.5 Schemaless Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6 Materialized Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.7 Distribution Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.7.1 Sharding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.8 Version Stamps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.9 Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.10 Partitioning and Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.11 Composing Map-Reduce Calculations . . . . . . . . . . . . . . . . . . . . . . . . . 46

3 BASICS OF HADOOP 51
3.1 Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Analyzing Data with Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.1 Map and Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.2 Java MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Scaling Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5 Hadoop Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.6 Hadoop Pipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.7 Design of Hadoop Distributed File System (HDFS) . . . . . . . . . . . . . . . . . 62
3.8 HDFS Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.8.1 Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.8.2 Namenodes(NN) and Datanodes(DN) . . . . . . . . . . . . . . . . . . . . . 63
3.8.3 HDFS Federation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.8.4 HDFS High-Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.9 Java interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.9.1 Reading Data from a Hadoop URL . . . . . . . . . . . . . . . . . . . . . . . 66
3.9.2 Reading Data using the FileSystem API . . . . . . . . . . . . . . . . . . . . 66
3.9.3 Writing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.9.4 Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.9.5 Querying the Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.9.6 Deleting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.10 Hadoop I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.10.1 Data Integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.10.2 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.10.3 Serialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.10.4 File-based data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4 MAP-REDUCE APPLICATIONS 77
4.1 Map-Reduce Workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1.1 Decomposing a Problem into MapReduce Jobs: . . . . . . . . . . . . . . . 77
4.1.2 JobControl: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.1.3 Apache Oozie: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 Unit tests with MRUnit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.1 Mappers Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.2 Reducers Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
iii

4.2.3 Integration Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82


4.3 Test Data and Local Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.1 Testing the Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4 Anatomy of MapReduce job run . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4.1 Classic MapReduce (MapReduce 1) . . . . . . . . . . . . . . . . . . . . . . 87
4.4.2 YARN (MapReduce 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5 Failures in classic Map-reduce and YARN . . . . . . . . . . . . . . . . . . . . . . . 93
4.5.1 Failures in Classic MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5.2 Failures in YARN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.6 Job scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6.1 The Fair Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.6.2 The Capacity Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.7 Shuffle and Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.7.1 The Map Side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.7.2 The Reduce Side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.7.3 Configuration Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.8 Task Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.8.1 The Task Execution Environment . . . . . . . . . . . . . . . . . . . . . . . . 100
4.8.2 Speculative Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.8.3 Output Committers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.8.4 Task JVM Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.8.5 Skipping Bad Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.9 MapReduce Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.9.1 MapReduce Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.9.2 Input Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.9.3 Output Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5 HADOOP RELATED TOOLS 117


5.1 HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.1.1 Data Model and Implementations . . . . . . . . . . . . . . . . . . . . . . . 122
5.1.2 Hbase Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.1.3 Hbase Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.2 Praxis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.2.1 Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.2.2 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.2.3 UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.2.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.2.5 Schema Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.2.6 Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.2.7 Bulk Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.3 Cassandra, Cassandra data model . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.3.1 Cassandra Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.3.2 Cassandra Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.3.3 Hadoop Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.4 Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
iv

5.4.1 Grunt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135


5.4.2 Pig Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.4.3 Pig Latin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.4.4 Developing and Testing Pig Latin Scripts . . . . . . . . . . . . . . . . . . . 141
5.5 Hive Data Types and File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.5.1 HiveQL Data Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.5.2 HiveQL Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.5.3 HiveQL Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

6 LAB EXPERIMENTS 153


6.1 Exercise 1: HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.2 Exercise 2: MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.3 Exercise 3: MapReduce (Programs) . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.4 Exercise 4: Extract Facts using Hive . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.5 Exercise 5: Extract Sessions using Pig . . . . . . . . . . . . . . . . . . . . . . . . . . 166

View publication stats