Hadoop: Data Processing and Modelling
()
About this ebook
- Conquer the mountain of data using Hadoop 2.X tools
- The authors succeed in creating a context for Hadoop and its ecosystem
- Hands-on examples and recipes giving the bigger picture and helping you to master Hadoop 2.X data processing platforms
- Overcome the challenging data processing problems using this exhaustive course with Hadoop 2.X
This course is for Java developers, who know scripting, wanting a career shift to Hadoop - Big Data segment of the IT industry. So if you are a novice in Hadoop or an expert, this book will make you reach the most advanced level in Hadoop 2.X.
Garry Turkington
Garry Turkington has 14 years of industry experience, most of which has been focused on the design and implementation of large-scale distributed systems. In his current roles as VP Data Engineering at Improve Digital and the company's lead architect he is primarily responsible for the realization of systems that store, process, and extract value from the company's large data volumes. Before joining Improve Digital he spent time at Amazon UK where he led several software development teams, building systems that process the Amazon catalog data for every item worldwide. Prior to this he spent a decade in various government positions in both the UK and USA. He has BSc and PhD degrees in computer science from the Queens University of Belfast in Northern Ireland and a MEng in Systems Engineering from Stevens Institute of Technology in the USA.
Read more from Garry Turkington
Hadoop Beginner's Guide Rating: 4 out of 5 stars4/5Learning Hadoop 2 Rating: 4 out of 5 stars4/5
Related to Hadoop
Related ebooks
Cloudera Administration Handbook Rating: 0 out of 5 stars0 ratingsHadoop 2.x Administration Cookbook Rating: 0 out of 5 stars0 ratingsApache Hive Cookbook Rating: 0 out of 5 stars0 ratingsHadoop MapReduce v2 Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsMastering Hadoop Rating: 0 out of 5 stars0 ratingsHadoop Blueprints Rating: 0 out of 5 stars0 ratingsHadoop Real-World Solutions Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsSpark for Data Science Rating: 0 out of 5 stars0 ratingsPentaho 3.2 Data Integration Beginner's Guide Rating: 0 out of 5 stars0 ratingsHadoop in Action Rating: 0 out of 5 stars0 ratingsHadoop in Practice Rating: 0 out of 5 stars0 ratingsBig Data Analytics Rating: 0 out of 5 stars0 ratingsLearning Apache Spark 2 Rating: 0 out of 5 stars0 ratingsHDInsight Essentials - Second Edition Rating: 0 out of 5 stars0 ratingsGoogle Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform Rating: 5 out of 5 stars5/5Talend Open Studio Cookbook Rating: 2 out of 5 stars2/5Data Pipelines with Apache Airflow Rating: 0 out of 5 stars0 ratingsMastering Apache Cassandra - Second Edition Rating: 0 out of 5 stars0 ratingsThe Study of Building the Data Warehouse Rating: 0 out of 5 stars0 ratingsLearn Hadoop in 24 Hours Rating: 0 out of 5 stars0 ratingsOracle GoldenGate 12c Implementer's Guide Rating: 0 out of 5 stars0 ratingsPentaho Data Integration Beginner's Guide Rating: 4 out of 5 stars4/5Cassandra High Availability Rating: 5 out of 5 stars5/5Graph Databases in Action: Examples in Gremlin Rating: 0 out of 5 stars0 ratingsLearning PySpark Rating: 0 out of 5 stars0 ratingsExploring Hadoop Ecosystem (Volume 1): Batch Processing Rating: 0 out of 5 stars0 ratingsMachine Learning Systems: Designs that scale Rating: 0 out of 5 stars0 ratingsMastering Spark for Data Science Rating: 0 out of 5 stars0 ratingsJava for Data Science Rating: 0 out of 5 stars0 ratingsFast Data Processing with Spark 2 - Third Edition Rating: 0 out of 5 stars0 ratings
Computers For You
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5Elon Musk Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 0 out of 5 stars0 ratingsCompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsProcreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 0 out of 5 stars0 ratingsEverybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5People Skills for Analytical Thinkers Rating: 5 out of 5 stars5/5Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5Network+ Study Guide & Practice Exams Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 5 out of 5 stars5/5SQL Guide for Microsoft Access: SQL Basics, Fundamental & Queries Exercise Rating: 5 out of 5 stars5/5I Forced a Bot to Write This Book: A.I. Meets B.S. Rating: 4 out of 5 stars4/5AWS Certified Cloud Practitioner All-in-One Exam Guide (Exam CLF-C01) Rating: 5 out of 5 stars5/5Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles Rating: 4 out of 5 stars4/5Master Builder Roblox: The Essential Guide Rating: 4 out of 5 stars4/5CompTIA Security+ Practice Questions Rating: 2 out of 5 stars2/5The Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5Artificial Intelligence: The Complete Beginner’s Guide to the Future of A.I. Rating: 4 out of 5 stars4/5GarageBand Basics: The Complete Guide to GarageBand: Music Rating: 0 out of 5 stars0 ratingsThe Huffington Post Complete Guide to Blogging Rating: 3 out of 5 stars3/5
Reviews for Hadoop
0 ratings0 reviews
Book preview
Hadoop - Garry Turkington
Table of Contents
Hadoop: Data Processing and Modelling
Hadoop: Data Processing and Modelling
Credits
Preface
What this learning path covers
Hadoop beginners Guide
Hadoop Real World Solutions Cookbook, 2nd edition
Mastering Hadoop
What you need for this learning path
Who this learning path is for
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Module 1
1. What It's All About
Big data processing
The value of data
Historically for the few and not the many
Classic data processing systems
Scale-up
Early approaches to scale-out
Limiting factors
A different approach
All roads lead to scale-out
Share nothing
Expect failure
Smart software, dumb hardware
Move processing, not data
Build applications, not infrastructure
Hadoop
Thanks, Google
Thanks, Doug
Thanks, Yahoo
Parts of Hadoop
Common building blocks
HDFS
MapReduce
Better together
Common architecture
What it is and isn't good for
Cloud computing with Amazon Web Services
Too many clouds
A third way
Different types of costs
AWS – infrastructure on demand from Amazon
Elastic Compute Cloud (EC2)
Simple Storage Service (S3)
Elastic MapReduce (EMR)
What this book covers
A dual approach
Summary
2. Getting Hadoop Up and Running
Hadoop on a local Ubuntu host
Other operating systems
Time for action – checking the prerequisites
What just happened?
Setting up Hadoop
A note on versions
Time for action – downloading Hadoop
What just happened?
Time for action – setting up SSH
What just happened?
Configuring and running Hadoop
Time for action – using Hadoop to calculate Pi
What just happened?
Three modes
Time for action – configuring the pseudo-distributed mode
What just happened?
Configuring the base directory and formatting the filesystem
Time for action – changing the base HDFS directory
What just happened?
Time for action – formatting the NameNode
What just happened?
Starting and using Hadoop
Time for action – starting Hadoop
What just happened?
Time for action – using HDFS
What just happened?
Time for action – WordCount, the Hello World of MapReduce
What just happened?
Have a go hero – WordCount on a larger body of text
Monitoring Hadoop from the browser
The HDFS web UI
The MapReduce web UI
Using Elastic MapReduce
Setting up an account in Amazon Web Services
Creating an AWS account
Signing up for the necessary services
Time for action – WordCount on EMR using the management console
What just happened?
Have a go hero – other EMR sample applications
Other ways of using EMR
AWS credentials
The EMR command-line tools
The AWS ecosystem
Comparison of local versus EMR Hadoop
Summary
3. Understanding MapReduce
Key/value pairs
What it mean
Why key/value data?
Some real-world examples
MapReduce as a series of key/value transformations
Pop quiz – key/value pairs
The Hadoop Java API for MapReduce
The 0.20 MapReduce Java API
The Mapper class
The Reducer class
The Driver class
Writing MapReduce programs
Time for action – setting up the classpath
What just happened?
Time for action – implementing WordCount
What just happened?
Time for action – building a JAR file
What just happened?
Time for action – running WordCount on a local Hadoop cluster
What just happened?
Time for action – running WordCount on EMR
What just happened?
The pre-0.20 Java MapReduce API
Hadoop-provided mapper and reducer implementations
Time for action – WordCount the easy way
What just happened?
Walking through a run of WordCount
Startup
Splitting the input
Task assignment
Task startup
Ongoing JobTracker monitoring
Mapper input
Mapper execution
Mapper output and reduce input
Partitioning
The optional partition function
Reducer input
Reducer execution
Reducer output
Shutdown
That's all there is to it!
Apart from the combiner…maybe
Why have a combiner?
Time for action – WordCount with a combiner
What just happened?
When you can use the reducer as the combiner
Time for action – fixing WordCount to work with a combiner
What just happened?
Reuse is your friend
Pop quiz – MapReduce mechanics
Hadoop-specific data types
The Writable and WritableComparable interfaces
Introducing the wrapper classes
Primitive wrapper classes
Array wrapper classes
Map wrapper classes
Time for action – using the Writable wrapper classes
What just happened?
Other wrapper classes
Have a go hero – playing with Writables
Making your own
Input/output
Files, splits, and records
InputFormat and RecordReader
Hadoop-provided InputFormat
Hadoop-provided RecordReader
OutputFormat and RecordWriter
Hadoop-provided OutputFormat
Don't forget Sequence files
Summary
4. Developing MapReduce Programs
Using languages other than Java with Hadoop
How Hadoop Streaming works
Why to use Hadoop Streaming
Time for action – implementing WordCount using Streaming
What just happened?
Differences in jobs when using Streaming
Analyzing a large dataset
Getting the UFO sighting dataset
Getting a feel for the dataset
Time for action – summarizing the UFO data
What just happened?
Examining UFO shapes
Time for action – summarizing the shape data
What just happened?
Time for action – correlating of sighting duration to UFO shape
What just happened?
Using Streaming scripts outside Hadoop
Time for action – performing the shape/time analysis from the command line
What just happened?
Java shape and location analysis
Time for action – using ChainMapper for field validation/analysis
What just happened?
Have a go hero
Too many abbreviations
Using the Distributed Cache
Time for action – using the Distributed Cache to improve location output
What just happened?
Counters, status, and other output
Time for action – creating counters, task states, and writing log output
What just happened?
Too much information!
Summary
5. Advanced MapReduce Techniques
Simple, advanced, and in-between
Joins
When this is a bad idea
Map-side versus reduce-side joins
Matching account and sales information
Time for action – reduce-side join using MultipleInputs
What just happened?
DataJoinMapper and TaggedMapperOutput
Implementing map-side joins
Using the Distributed Cache
Have a go hero - Implementing map-side joins
Pruning data to fit in the cache
Using a data representation instead of raw data
Using multiple mappers
To join or not to join...
Graph algorithms
Graph 101
Graphs and MapReduce – a match made somewhere
Representing a graph
Time for action – representing the graph
What just happened?
Overview of the algorithm
The mapper
The reducer
Iterative application
Time for action – creating the source code
What just happened?
Time for action – the first run
What just happened?
Time for action – the second run
What just happened?
Time for action – the third run
What just happened?
Time for action – the fourth and last run
What just happened?
Running multiple jobs
Final thoughts on graphs
Using language-independent data structures
Candidate technologies
Introducing Avro
Time for action – getting and installing Avro
What just happened?
Avro and schemas
Time for action – defining the schema
What just happened?
Time for action – creating the source Avro data with Ruby
What just happened?
Time for action – consuming the Avro data with Java
What just happened?
Using Avro within MapReduce
Time for action – generating shape summaries in MapReduce
What just happened?
Time for action – examining the output data with Ruby
What just happened?
Time for action – examining the output data with Java
What just happened?
Have a go hero – graphs in Avro
Going forward with Avro
Summary
6. When Things Break
Failure
Embrace failure
Or at least don't fear it
Don't try this at home
Types of failure
Hadoop node failure
The dfsadmin command
Cluster setup, test files, and block sizes
Fault tolerance and Elastic MapReduce
Time for action – killing a DataNode process
What just happened?
NameNode and DataNode communication
Have a go hero – NameNode log delving
Time for action – the replication factor in action
What just happened?
Time for action – intentionally causing missing blocks
What just happened?
When data may be lost
Block corruption
Time for action – killing a TaskTracker process
What just happened?
Comparing the DataNode and TaskTracker failures
Permanent failure
Killing the cluster masters
Time for action – killing the JobTracker
What just happened?
Starting a replacement JobTracker
Have a go hero – moving the JobTracker to a new host
Time for action – killing the NameNode process
What just happened?
Starting a replacement NameNode
The role of the NameNode in more detail
File systems, files, blocks, and nodes
The single most important piece of data in the cluster – fsimage
DataNode startup
Safe mode
SecondaryNameNode
So what to do when the NameNode process has a critical failure?
BackupNode/CheckpointNode and NameNode HA
Hardware failure
Host failure
Host corruption
The risk of correlated failures
Task failure due to software
Failure of slow running tasks
Time for action – causing task failure
What just happened?
Have a go hero – HDFS programmatic access
Hadoop's handling of slow-running tasks
Speculative execution
Hadoop's handling of failing tasks
Have a go hero – causing tasks to fail
Task failure due to data
Handling dirty data through code
Using Hadoop's skip mode
Time for action – handling dirty data by using skip mode
What just happened?
To skip or not to skip...
Summary
7. Keeping Things Running
A note on EMR
Hadoop configuration properties
Default values
Time for action – browsing default properties
What just happened?
Additional property elements
Default storage location
Where to set properties
Setting up a cluster
How many hosts?
Calculating usable space on a node
Location of the master nodes
Sizing hardware
Processor / memory / storage ratio
EMR as a prototyping platform
Special node requirements
Storage types
Commodity versus enterprise class storage
Single disk versus RAID
Finding the balance
Network storage
Hadoop networking configuration
How blocks are placed
Rack awareness
The rack-awareness script
Time for action – examining the default rack configuration
What just happened?
Time for action – adding a rack awareness script
What just happened?
What is commodity hardware anyway?
Pop quiz – setting up a cluster
Cluster access control
The Hadoop security model
Time for action – demonstrating the default security
What just happened?
User identity
The super user
More granular access control
Working around the security model via physical access control
Managing the NameNode
Configuring multiple locations for the fsimage class
Time for action – adding an additional fsimage location
What just happened?
Where to write the fsimage copies
Swapping to another NameNode host
Having things ready before disaster strikes
Time for action – swapping to a new NameNode host
What just happened?
Don't celebrate quite yet!
What about MapReduce?
Have a go hero – swapping to a new NameNode host
Managing HDFS
Where to write data
Using balancer
When to rebalance
MapReduce management
Command line job management
Have a go hero – command line job management
Job priorities and scheduling
Time for action – changing job priorities and killing a job
What just happened?
Alternative schedulers
Capacity Scheduler
Fair Scheduler
Enabling alternative schedulers
When to use alternative schedulers
Scaling
Adding capacity to a local Hadoop cluster
Have a go hero – adding a node and running balancer
Adding capacity to an EMR job flow
Expanding a running job flow
Summary
8. A Relational View on Data with Hive
Overview of Hive
Why use Hive?
Thanks, Facebook!
Setting up Hive
Prerequisites
Getting Hive
Time for action – installing Hive
What just happened?
Using Hive
Time for action – creating a table for the UFO data
What just happened?
Time for action – inserting the UFO data
What just happened?
Validating the data
Time for action – validating the table
What just happened?
Time for action – redefining the table with the correct column separator
What just happened?
Hive tables – real or not?
Time for action – creating a table from an existing file
What just happened?
Time for action – performing a join
What just happened?
Have a go hero – improve the join to use regular expressions
Hive and SQL views
Time for action – using views
What just happened?
Handling dirty data in Hive
Have a go hero – do it!
Time for action – exporting query output
What just happened?
Partitioning the table
Time for action – making a partitioned UFO sighting table
What just happened?
Bucketing, clustering, and sorting... oh my!
User-Defined Function
Time for action – adding a new User Defined Function (UDF)
What just happened?
To preprocess or not to preprocess...
Hive versus Pig
What we didn't cover
Hive on Amazon Web Services
Time for action – running UFO analysis on EMR
What just happened?
Using interactive job flows for development
Have a go hero – using an interactive EMR cluster
Integration with other AWS products
Summary
9. Working with Relational Databases
Common data paths
Hadoop as an archive store
Hadoop as a preprocessing step
Hadoop as a data input tool
The serpent eats its own tail
Setting up MySQL
Time for action – installing and setting up MySQL
What just happened?
Did it have to be so hard?
Time for action – configuring MySQL to allow remote connections
What just happened?
Don't do this in production!
Time for action – setting up the employee database
What just happened?
Be careful with data file access rights
Getting data into Hadoop
Using MySQL tools and manual import
Have a go hero – exporting the employee table into HDFS
Accessing the database from the mapper
A better way – introducing Sqoop
Time for action – downloading and configuring Sqoop
What just happened?
Sqoop and Hadoop versions
Sqoop and HDFS
Time for action – exporting data from MySQL to HDFS
What just happened?
Mappers and primary key columns
Other options
Sqoop's architecture
Importing data into Hive using Sqoop
Time for action – exporting data from MySQL into Hive
What just happened?
Time for action – a more selective import
What just happened?
Datatype issues
Time for action – using a type mapping
What just happened?
Time for action – importing data from a raw query
What just happened?
Have a go hero
Sqoop and Hive partitions
Field and line terminators
Getting data out of Hadoop
Writing data from within the reducer
Writing SQL import files from the reducer
A better way – Sqoop again
Time for action – importing data from Hadoop into MySQL
What just happened?
Differences between Sqoop imports and exports
Inserts versus updates
Have a go hero
Sqoop and Hive exports
Time for action – importing Hive data into MySQL
What just happened?
Time for action – fixing the mapping and re-running the export
What just happened?
Other Sqoop features
Incremental merge
Avoiding partial exports
Sqoop as a code generator
AWS considerations
Considering RDS
Summary
10. Data Collection with Flume
A note about AWS
Data data everywhere...
Types of data
Getting network traffic into Hadoop
Time for action – getting web server data into Hadoop
What just happened?
Have a go hero
Getting files into Hadoop
Hidden issues
Keeping network data on the network
Hadoop dependencies
Reliability
Re-creating the wheel
A common framework approach
Introducing Apache Flume
A note on versioning
Time for action – installing and configuring Flume
What just happened?
Using Flume to capture network data
Time for action – capturing network traffic in a log file
What just happened?
Time for action – logging to the console
What just happened?
Writing network data to log files
Time for action – capturing the output of a command to a flat file
What just happened?
Logs versus files
Time for action – capturing a remote file in a local flat file
What just happened?
Sources, sinks, and channels
Sources
Sinks
Channels
Or roll your own
Understanding the Flume configuration files
Have a go hero
It's all about events
Time for action – writing network traffic onto HDFS
What just happened?
Time for action – adding timestamps
What just happened?
To Sqoop or to Flume...
Time for action – multi level Flume networks
What just happened?
Time for action – writing to multiple sinks
What just happened?
Selectors replicating and multiplexing
Handling sink failure
Have a go hero - Handling sink failure
Next, the world
Have a go hero - Next, the world
The bigger picture
Data lifecycle
Staging data
Scheduling
Summary
11. Where to Go Next
What we did and didn't cover in this book
Upcoming Hadoop changes
Alternative distributions
Why alternative distributions?
Bundling
Free and commercial extensions
Cloudera Distribution for Hadoop
Hortonworks Data Platform
MapR
IBM InfoSphere Big Insights
Choosing a distribution
Other Apache projects
HBase
Oozie
Whir
Mahout
MRUnit
Other programming abstractions
Pig
Cascading
AWS resources
HBase on EMR
SimpleDB
DynamoDB
Sources of information
Source code
Mailing lists and forums
LinkedIn groups
HUGs
Conferences
Summary
A. Pop Quiz Answers
Chapter 3, Understanding MapReduce
Pop quiz – key/value pairs
Pop quiz – walking through a run of WordCount
Chapter 7, Keeping Things Running
Pop quiz – setting up a cluster
2. Module 2
1. Getting Started with Hadoop 2.X
Introduction
Installing a single-node Hadoop Cluster
Getting ready
How to do it...
How it works...
Hadoop Distributed File System (HDFS)
Yet Another Resource Negotiator (YARN)
There's more
Installing a multi-node Hadoop cluster
Getting ready
How to do it...
How it works...
Adding new nodes to existing Hadoop clusters
Getting ready
How to do it...
How it works...
Executing the balancer command for uniform data distribution
Getting ready
How to do it...
How it works...
There's more...
Entering and exiting from the safe mode in a Hadoop cluster
How to do it...
How it works...
Decommissioning DataNodes
Getting ready
How to do it...
How it works...
Performing benchmarking on a Hadoop cluster
Getting ready
How to do it...
TestDFSIO
NNBench
MRBench
How it works...
2. Exploring HDFS
Introduction
Loading data from a local machine to HDFS
Getting ready
How to do it...
How it works...
Exporting HDFS data to a local machine
Getting ready
How to do it...
How it works...
Changing the replication factor of an existing file in HDFS
Getting ready
How to do it...
How it works...
Setting the HDFS block size for all the files in a cluster
Getting ready
How to do it...
How it works...
Setting the HDFS block size for a specific file in a cluster
Getting ready
How to do it...
How it works...
Enabling transparent encryption for HDFS
Getting ready
How to do it...
How it works...
Importing data from another Hadoop cluster
Getting ready
How to do it...
How it works...
Recycling deleted data from trash to HDFS
Getting ready
How to do it...
How it works...
Saving compressed data in HDFS
Getting ready
How to do it...
How it works...
3. Mastering Map Reduce Programs
Introduction
Writing the Map Reduce program in Java to analyze web log data
Getting ready
How to do it...
How it works...
Executing the Map Reduce program in a Hadoop cluster
Getting ready
How to do it
How it works...
Adding support for a new writable data type in Hadoop
Getting ready
How to do it...
How it works...
Implementing a user-defined counter in a Map Reduce program
Getting ready
How to do it...
How it works...
Map Reduce program to find the top X
Getting ready
How to do it...
How it works
Map Reduce program to find distinct values
Getting ready
How to do it
How it works...
Map Reduce program to partition data using a custom partitioner
Getting ready
How to do it...
How it works...
Writing Map Reduce results to multiple output files
Getting ready
How to do it...
How it works...
Performing Reduce side Joins using Map Reduce
Getting ready
How to do it
How it works...
Unit testing the Map Reduce code using MRUnit
Getting ready
How to do it...
How it works...
4. Data Analysis Using Hive, Pig, and Hbase
Introduction
Storing and processing Hive data in a sequential file format
Getting ready
How to do it...
How it works...
Storing and processing Hive data in the ORC file format
Getting ready
How to do it...
How it works...
Storing and processing Hive data in the ORC file format
Getting ready
How to do it...
How it works...
Storing and processing Hive data in the Parquet file format
Getting ready
How to do it...
How it works...
Performing FILTER By queries in Pig
Getting ready
How to do it...
How it works...
Performing Group By queries in Pig
Getting ready
How to do it...
How it works...
Performing Order By queries in Pig
Getting ready
How to do it..
How it works...
Performing JOINS in Pig
Getting ready
How to do it...
How it works
Replicated Joins
Skewed Joins
Merge Joins
Writing a user-defined function in Pig
Getting ready
How to do it...
How it works...
There's more...
Analyzing web log data using Pig
Getting ready
How to do it...
How it works...
Performing the Hbase operation in CLI
Getting ready
How to do it
How it works...
Performing Hbase operations in Java
Getting ready
How to do it
How it works...
Executing the MapReduce programming with an Hbase Table
Getting ready
How to do it
How it works
5. Advanced Data Analysis Using Hive
Introduction
Processing JSON data in Hive using JSON SerDe
Getting ready
How to do it...
How it works...
Processing XML data in Hive using XML SerDe
Getting ready
How to do it...
How it works
Processing Hive data in the Avro format
Getting ready
How to do it...
How it works...
Writing a user-defined function in Hive
Getting ready
How to do it
How it works...
Performing table joins in Hive
Getting ready
How to do it...
Left outer join
Right outer join
Full outer join
Left semi join
How it works...
Executing map side joins in Hive
Getting ready
How to do it...
How it works...
Performing context Ngram in Hive
Getting ready
How to do it...
How it works...
Call Data Record Analytics using Hive
Getting ready
How to do it...
How it works...
Twitter sentiment analysis using Hive
Getting ready
How to do it...
How it works
Implementing Change Data Capture using Hive
Getting ready
How to do it
How it works
Multiple table inserting using Hive
Getting ready
How to do it
How it works
6. Data Import/Export Using Sqoop and Flume
Introduction
Importing data from RDMBS to HDFS using Sqoop
Getting ready
How to do it...
How it works...
Exporting data from HDFS to RDBMS
Getting ready
How to do it...
How it works...
Using query operator in Sqoop import
Getting ready
How to do it...
How it works...
Importing data using Sqoop in compressed format
Getting ready
How to do it...
How it works...
Performing Atomic export using Sqoop
Getting ready
How to do it...
How it works...
Importing data into Hive tables using Sqoop
Getting ready
How to do it...
How it works...
Importing data into HDFS from Mainframes
Getting ready
How to do it...
How it works...
Incremental import using Sqoop
Getting ready
How to do it...
How it works...
Creating and executing Sqoop job
Getting ready
How to do it...
How it works...
Importing data from RDBMS to Hbase using Sqoop
Getting ready
How to do it...
How it works...
Importing Twitter data into HDFS using Flume
Getting ready
How to do it...
How it works
Importing data from Kafka into HDFS using Flume
Getting ready
How to do it...
How it works
Importing web logs data into HDFS using Flume
Getting ready
How to do it...
How it works...
7. Automation of Hadoop Tasks Using Oozie
Introduction
Implementing a Sqoop action job using Oozie
Getting ready
How to do it...
How it works
Implementing a Map Reduce action job using Oozie
Getting ready
How to do it...
How it works...
Implementing a Java action job using Oozie
Getting ready
How to do it
How it works
Implementing a Hive action job using Oozie
Getting ready
How to do it...
How it works...
Implementing a Pig action job using Oozie
Getting ready
How to do it...
How it works
Implementing an e-mail action job using Oozie
Getting ready
How to do it...
How it works...
Executing parallel jobs using Oozie (fork)
Getting ready
How to do it...
How it works...
Scheduling a job in Oozie
Getting ready
How to do it...
How it works...
8. Machine Learning and Predictive Analytics Using Mahout and R
Introduction
Setting up the Mahout development environment
Getting ready
How to do it...
How it works...
Creating an item-based recommendation engine using Mahout
Getting ready
How to do it...
How it works...
Creating a user-based recommendation engine using Mahout
Getting ready
How to do it...
How it works...
Using Predictive analytics on Bank Data using Mahout
Getting ready
How to do it...
How it works...
Clustering text data using K-Means
Getting ready
How to do it...
How it works...
Performing Population Data Analytics using R
Getting ready
How to do it...
How it works...
Performing Twitter Sentiment Analytics using R
Getting ready
How to do it...
How it works...
Performing Predictive Analytics using R
Getting ready
How to do it...
How it works...
9. Integration with Apache Spark
Introduction
Running Spark standalone
Getting ready
How to do it...
How it works...
Running Spark on YARN
Getting ready
How to do it...
How it works...
Olympics Athletes analytics using the Spark Shell
Getting ready
How to do it...
How it works...
Creating Twitter trending topics using Spark Streaming
Getting ready
How to do it...
How it works...
Twitter trending topics using Spark streaming
Getting ready
How to do it...
How it works...
Analyzing Parquet files using Spark
Getting ready
How to do it...
How it works...
Analyzing JSON data using Spark
Getting ready
How to do it...
How it works...
Processing graphs using Graph X
Getting ready
How to do it...
How it works...
Conducting predictive analytics using Spark MLib
Getting ready
How to do it...
How it works...
10. Hadoop Use Cases
Introduction
Call Data Record analytics
Getting ready
How to do it...
Problem Statement
Solution
How it works...
Web log analytics
Getting ready
How to do it...
Problem statement
Solution
How it works...
Sensitive data masking and encryption using Hadoop
Getting ready
How to do it...
Problem statement
Solution
How it works...
3. Module 3
1. Hadoop 2.X
The inception of Hadoop
The evolution of Hadoop
Hadoop's genealogy
Hadoop-0.20-append
Hadoop-0.20-security
Hadoop's timeline
Hadoop 2.X
Yet Another Resource Negotiator (YARN)
Architecture overview
Storage layer enhancements
High availability
HDFS Federation
HDFS snapshots
Other enhancements
Support enhancements
Hadoop distributions
Which Hadoop distribution?
Performance
Scalability
Reliability
Manageability
Available distributions
Cloudera Distribution of Hadoop (CDH)
Hortonworks Data Platform (HDP)
MapR
Pivotal HD
Summary
2. Advanced MapReduce
MapReduce input
The InputFormat class
The InputSplit class
The RecordReader class
Hadoop's small files
problem
Filtering inputs
The Map task
The dfs.blocksize attribute
Sort and spill of intermediate outputs
Node-local Reducers or Combiners
Fetching intermediate outputs – Map-side
The Reduce task
Fetching intermediate outputs – Reduce-side
Merge and spill of intermediate outputs
MapReduce output
Speculative execution of tasks
MapReduce job counters
Handling data joins
Reduce-side joins
Map-side joins
Summary
3. Advanced Pig
Pig versus SQL
Different modes of execution
Complex data types in Pig
Compiling Pig scripts
The logical plan
The physical plan
The MapReduce plan
Development and debugging aids
The DESCRIBE command
The EXPLAIN command
The ILLUSTRATE command
The advanced Pig operators
The advanced FOREACH operator
The FLATTEN operator
The nested FOREACH operator
The COGROUP operator
The UNION operator
The CROSS operator
Specialized joins in Pig
The Replicated join
Skewed joins
The Merge join
User-defined functions
The evaluation functions
The aggregate functions
The Algebraic interface
The Accumulator interface
The filter functions
The load functions
The store functions
Pig performance optimizations
The optimization rules
Measurement of Pig script performance
Combiners in Pig
Memory for the Bag data type
Number of reducers in Pig
The multiquery mode in Pig
Best practices
The explicit usage of types
Early and frequent projection
Early and frequent filtering
The usage of the LIMIT operator
The usage of the DISTINCT operator
The reduction of operations
The usage of Algebraic UDFs
The usage of Accumulator UDFs
Eliminating nulls in the data
The usage of specialized joins
Compressing intermediate results
Combining smaller files
Summary
4. Advanced Hive
The Hive architecture
The Hive metastore
The Hive compiler
The Hive execution engine
The supporting components of Hive
Data types
File formats
Compressed files
ORC files
The Parquet files
The data model
Dynamic partitions
Semantics for dynamic partitioning
Indexes on Hive tables
Hive query optimizers
Advanced DML
The GROUP BY operation
ORDER BY versus SORT BY clauses
The JOIN operator and its types
Map-side joins
Advanced aggregation support
Other advanced clauses
UDF, UDAF, and UDTF
Summary
5. Serialization and Hadoop I/O
Data serialization in Hadoop
Writable and WritableComparable
Hadoop versus Java serialization
Avro serialization
Avro and MapReduce
Avro and Pig
Avro and Hive
Comparison – Avro versus Protocol Buffers / Thrift
File formats
The Sequence file format
Reading and writing Sequence files
The MapFile format
Other data structures
Compression
Splits and compressions
Scope for compression
Summary
6. YARN – Bringing Other Paradigms to Hadoop
The YARN architecture
Resource Manager (RM)
Application Master (AM)
Node Manager (NM)
YARN clients
Developing YARN applications
Writing YARN clients
Writing the Application Master entity
Monitoring YARN
Job scheduling in YARN
CapacityScheduler
FairScheduler
YARN commands
User commands
Administration commands
Summary
7. Storm on YARN – Low Latency Processing in Hadoop
Batch processing versus streaming
Apache Storm
Architecture of an Apache Storm cluster
Computation and data modeling in Apache Storm
Use cases for Apache Storm
Developing with Apache Storm
Apache Storm 0.9.1
Storm on YARN
Installing Apache Storm-on-YARN
Prerequisites
Installation procedure
Summary
8. Hadoop on the Cloud
Cloud computing characteristics
Hadoop on the cloud
Amazon Elastic MapReduce (EMR)
Provisioning a Hadoop cluster on EMR
Summary
9. HDFS Replacements
HDFS – advantages and drawbacks
Amazon AWS S3
Hadoop support for S3
Implementing a filesystem in Hadoop
Implementing an S3 native filesystem in Hadoop
Summary
10. HDFS Federation
Limitations of the older HDFS architecture
Architecture of HDFS Federation
Benefits of HDFS Federation
Deploying federated NameNodes
HDFS high availability
Secondary NameNode, Checkpoint Node, and Backup Node
High availability – edits sharing
Useful HDFS tools
Three-layer versus four-layer network topology
HDFS block placement
Pluggable block placement policy
Summary
11. Hadoop Security
The security pillars
Authentication in Hadoop
Kerberos authentication
The Kerberos architecture and workflow
Kerberos authentication and Hadoop
Authentication via HTTP interfaces
Authorization in Hadoop
Authorization in HDFS
Identity of an HDFS user
Group listings for an HDFS user
HDFS APIs and shell commands
Specifying the HDFS superuser
Turning off HDFS authorization
Limiting HDFS usage
Name quotas in HDFS
Space quotas in HDFS
Service-level authorization in Hadoop
Data confidentiality in Hadoop
HTTPS and encrypted shuffle
SSL configuration changes
Configuring the keystore and truststore
Audit logging in Hadoop
Summary
12. Analytics Using Hadoop
Data analytics workflow
Machine learning
Apache Mahout
Document analysis using Hadoop and Mahout
Term frequency
Document frequency
Term frequency – inverse document frequency
Tf-Idf in Pig
Cosine similarity distance measures
Clustering using k-means
K-means clustering using Apache Mahout
RHadoop
Summary
13. Hadoop for Microsoft Windows
Deploying Hadoop on Microsoft Windows
Prerequisites
Building Hadoop
Configuring Hadoop
Deploying Hadoop
Summary
A. Bibliography
Index
Hadoop: Data Processing and Modelling
Hadoop: Data Processing and Modelling
Unlock the power of your data with Hadoop 2.X ecosystem and its data warehousing techniques across large data sets
A course in three modules
BIRMINGHAM - MUMBAI
Hadoop: Data Processing and Modelling
Copyright © 2016 Packt Publishing
All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Published on: August 2016
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN: 978-1-78712-516-2
www.packtpub.com
Credits
Authors
Garry Turkington
Tanmay Deshpande
Sandeep Karanth
Reviewers
David Gruzman
Muthusamy Manigandan
Vidyasagar N V
Shashwat Shriparv
Shiva Achari
Pavan Kumar Polineni
Uchit Vyas
Yohan Wadia
Content Development Editor
Rashmi Suvarna
Graphics
Kirk D'Phena
Production Coordinator
Shantanu N. Zagade
Preface
A number of organizations are focusing on big data processing, particularly with Hadoop. This course will help you understand how Hadoop, as an ecosystem, helps us store, process, and analyze data. Hadoop is synonymous with Big Data processing. Its simple programming model, code once and deploy at any scale
paradigm, and an ever-growing ecosystem make Hadoop an inclusive platform for programmers with different levels of expertise and breadth of knowledge. A team of machines are interconnected via a very fast network and provide better scaling and elasticity, but that is not enough.
These clusters have to be programmed. A greater number of machines, just like a team of human beings, require more coordination and synchronization. The higher the number of machines, the greater the possibility of failures in the cluster. How do we handle synchronization and fault tolerance in a simple way easing the burden on the programmer? The answer is systems such as Hadoop. Today, it is the number-one sought after job skill in the data sciences space. To handle and analyze Big Data, Hadoop has become the go-to tool. Hadoop 2.x is spreading its wings to cover a variety of application paradigms and solve a wider range of data problems. It is rapidly becoming a general-purpose cluster platform for all data processing needs, and will soon become a mandatory skill for every engineer across verticals. Explore the power of Hadoop ecosystem to be able to tackle real-world scenarios and build.
This course covers optimizations and advanced features of MapReduce, Pig, and Hive.
Along with Hadoop 2.x and illustrates how it can be used to extend the capabilities of Hadoop.
When you finish this course, you will be able to tackle the real-world scenarios and become a big data expert using the tools and the knowledge based on the various step-by-step tutorials and recipes.
What this learning path covers
Hadoop beginners Guide
This module is here to help you make sense of Hadoop and use it to solve your big data problems. It's a really exciting time to work with data processing technologies such as Hadoop. The ability to apply complex analytics to large data sets—once the monopoly of large corporations and government agencies—is now possible through free open source software (OSS). This module removes the mystery from Hadoop, presenting Hadoop and related technologies with a focus on building working systems and getting the job done, using cloud services to do so when it makes sense. From basic concepts and initial setup through developing applications and keeping the system running as the data grows, the module gives the understanding needed to effectively use Hadoop to solve real world problems. Starting with the basics of installing and configuring Hadoop, the module explains how to develop applications, maintain the system, and how to use additional products to integrate with other systems. In addition to examples on Hadoop clusters on Ubuntu uses of cloud services such as Amazon, EC2 and Elastic MapReduce are covered.
Hadoop Real World Solutions Cookbook, 2nd edition
Big Data is the need the day. Many organizations are producing huge amounts of data every day. With the advancement of Hadoop-like tools, it has become easier for everyone to solve Big Data problems with great efficiency and at a very low cost. When you are handling such a massive amount of data, even a small mistake can cost you dearly in terms of performance and storage. It's very important to learn the best practices of handling such tools before you start building an enterprise Big Data Warehouse, which will be greatly advantageous in making your project successful.
This module gives readers insights into learning and mastering big data via recipes. The module not only clarifies most big data tools in the market but also provides best practices for using them. The module provides recipes that are based on the latest versions of Apache Hadoop 2.X, YARN, Hive, Pig, Sqoop, Flume, Apache Spark, Mahout and many more such ecosystem tools. This real-world-solution cookbook is packed with handy recipes you can apply to your own everyday issues. Each chapter provides in-depth recipes that can be referenced easily. This module provides detailed practices on the latest technologies such as YARN and Apache Spark. Readers will be able to consider themselves as big data experts on completion of this module.
Mastering Hadoop
This era of Big Data has similar changes in businesses as well. Almost everything in a business is logged. Every action taken by a user on the page of an e-commerce page is recorded to improve quality of service and every item bought by the user are recorded to cross-sell or up-sell other items. Businesses want to understand the DNA of their customers and try to infer it by pinching out every possible data they can get about these customers. Businesses are not worried about the format of the data. They are ready to accept speech, images, natural language text, or structured data. These data points are used to drive business decisions and personalize experiences for the user. The more data, the higher the degree of personalization and better the experience for the user.
Hadoop is synonymous with Big Data processing. Its simple programming model, code once and deploy at any scale
paradigm, and an ever-growing ecosystem makes Hadoop an all-encompassing platform for programmers with different levels of expertise. This module explores the industry guidelines to optimize MapReduce jobs and higher-level abstractions such as Pig and Hive in Hadoop 2.0. Then, it dives deep into Hadoop 2.0 specific features such as YARN and HDFS Federation. This module is a step-by-step guide that focuses on advanced Hadoop concepts and aims to take your Hadoop knowledge and skill set to the next level. The data processing flow dictates the order of the concepts in each chapter, and each chapter is illustrated with code fragments or schematic diagrams.
This module is a guide focusing on advanced concepts and features in Hadoop.
Foundations of every concept are explained with code fragments or schematic illustrations. The data processing flow dictates the order of the concepts in each chapter
What you need for this learning path
In the simplest case, a single Linux-based machine will give you a platform to explore almost all the exercises in this course. We assume you have a recent distribution of Ubuntu, but as long as you have command-line Linux familiarity any modern distribution will suffice. Some of the examples in later chapters really need multiple machines to see things working, so you will require access to at least four such hosts. Virtual machines are completely acceptable; they're not ideal for production but are fine for learning and exploration. Since we also explore Amazon Web Services in this course, you can run all the examples on EC2 instances, and we will look at some other more Hadoop-specific uses of AWS throughout the modules. AWS services are usable by anyone, but you will need a credit card to sign up!
To get started with this hands-on recipe-driven module, you should have a laptop/desktop with any OS, such as Windows, Linux, or Mac. It's good to have an IDE, such as Eclipse or IntelliJ, and of course, you need a lot of enthusiasm to learn.
The following software suites are required to try out the examples in the module:
Java Development Kit (JDK 1.7 or later): This is free software from Oracle that provides a JRE ( Java Runtime Environment ) and additional tools for developers. It can be downloaded from http://www.oracle.com/technetwork/java/javase/downloads/index.html.
The IDE for editing Java code: IntelliJ IDEA is the IDE that has been used to develop the examples. Any other IDE of your choice can also be used. The community edition of the IntelliJ IDE can be downloaded from https://www.jetbrains.com/idea/download/.
Maven: Maven is a build tool that has been used to build the samples in the course. Maven can be used to automatically pull-build dependencies and specify configurations via XML files. The code samples in the chapters can be built into a JAR using two simple Maven commands:
mvn compilemvn assembly:single
These commands compile the code into a JAR file. These commands create a consolidated JAR with the program along with all its dependencies. It is important to change the mainClass references in the pom.xml to the driver class name when building the consolidated JAR file.
Hadoop-related consolidated JAR files can be run using the command:
hadoop jar
This command directly picks the driver program from the mainClass that was specified in the pom.xml. Maven can be downloaded and installed from http://maven.apache.org/download.cgi. The Maven XML template file used to build the samples in this course is as follows:
1.0 encoding=UTF-8
?>
xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://
maven.apache.org/xsd/maven-4.0.0.xsd">
mainClass>
mainClass>
m2e settings
only. It has no influence on the Maven build
itself. -->
artifactId>
Hadoop 2.2.0 : Apache Hadoop is required to try out the examples in general. Appendix , Hadoop for Microsoft Windows, has the details on Hadoop's single-node installation on a Microsoft Windows machine. The steps are similar and easier for other operating systems such as Linux or Mac, and they can be found at http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/SingleNodeSetup.html
Who this learning path is for
We assume you are reading this course because you want to know more about Hadoop at a hands-on level; the key audience is those with software development experience but no prior exposure to Hadoop or similar big data technologies.
For developers who want to know how to write MapReduce applications, we assume you are comfortable writing Java programs and are familiar with the Unix command-line interface. We will also show you a few programs in Ruby, but these are usually only to demonstrate language independence, and you don't need to be a Ruby expert.
For architects and system administrators, the course also provides significant value in explaining how Hadoop works, its place in the broader architecture, and how it can be managed operationally. Some of the more involved techniques in Chapter 4, Developing MapReduce Programs, and Chapter 5, Advanced MapReduce Techniques, are probably of less direct interest to this audience.
This course is for Java developers, who know scripting, wanting a career shift to Hadoop - Big Data segment of the IT industry. So if you are a novice in Hadoop or an expert this course will make you reach the most advanced level in Hadoop 2.X.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this course—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <feedback@packtpub.com>, and mention the course's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt course, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files for this course from your account at http://www.packtpub.com. If you purchased this course elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the SUPPORT tab at the top.
Click on Code Downloads & Errata.
Enter the name of the course in the Search box.
Select the course for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this course from.
Click on Code Download.
You can also download the code files by clicking on the Code Files button on the course's webpage at the Packt Publishing website. This page can be accessed by entering the course's name in the Search box. Please note that you need to be logged in to your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the course is also hosted on GitHub at https://github.com/PacktPublishing/Data-Science-with-Hadoop/tree/master. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our modules—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the course in the search field. The required information will appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <copyright@packtpub.com> with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
Questions
If you have a problem with any aspect of this course, you can contact us at <questions@packtpub.com>, and we will do our best to address the problem.
Part 1. Module 1
Hadoop Beginner’s Guide
Learn how to crunch big data to extract meaning from the data avalanche
Chapter 1. What It's All About
This book is about Hadoop, an open source framework for large-scale data processing. Before we get into the details of the technology and its use in later chapters, it is important to spend a little time exploring the trends that led to Hadoop's creation and its enormous success.
Hadoop was not created in a vacuum; instead, it exists due to the explosion in the amount of data being created and consumed and a shift that sees this data deluge arrive at small startups and not just huge multinationals. At the same time, other trends have changed how software and systems are deployed, using cloud resources alongside or even in preference to more traditional infrastructures.
This chapter will explore some of these trends and explain in detail the specific problems Hadoop seeks to solve and the drivers that shaped its design.
In the rest of this chapter we shall:
Learn about the big data revolution
Understand what Hadoop is and how it can extract value from data
Look into cloud computing and understand what Amazon Web Services provides
See how powerful the combination of big data processing and cloud computing can be
Get an overview of the topics covered in the rest of this book
So let's get on with it!
Big data processing
Look around at the technology we have today, and it's easy to come to the conclusion that it's all about data. As consumers, we have an increasing appetite for rich media, both in terms of the movies we watch and the pictures and videos we create and upload. We also, often without thinking, leave a trail of data across the Web as we perform the actions of our daily lives.
Not only is the amount of data being generated increasing, but the rate of increase is also accelerating. From emails to Facebook posts, from purchase histories to web links, there are large data sets growing everywhere. The challenge is in extracting from this data the most valuable aspects; sometimes this means particular data elements, and at other times, the focus is instead on identifying trends and relationships between pieces of data.
There's a subtle change occurring behind the scenes that is all about using data in more and more meaningful ways. Large companies have realized the value in data for some time and have been using it to improve the services they provide to their customers, that is, us. Consider how Google displays advertisements relevant to our web surfing, or how Amazon or Netflix recommend new products or titles that often match well to our tastes and interests.
The value of data
These corporations wouldn't invest in large-scale data processing if it didn't provide a meaningful return on the investment or a competitive advantage. There are several main aspects to big data that should be appreciated:
Some questions only give value when asked of sufficiently large data sets. Recommending a movie based on the preferences of another person is, in the absence of other factors, unlikely to be very accurate. Increase the number of people to a hundred and the chances increase slightly. Use the viewing history of ten million other people and the chances of detecting patterns that can be used to give relevant recommendations improve dramatically.
Big data tools often enable the processing of data on a larger scale and at a lower cost than previous solutions. As a consequence, it is often possible to perform data processing tasks that were previously prohibitively expensive.
The cost of large-scale data processing isn't just about financial expense; latency is also a critical factor. A system may be able to process as much data as is thrown at it, but if the average processing time is measured in weeks, it is likely not useful. Big data tools allow data volumes to be increased while keeping processing time under control, usually by matching the increased data volume with additional hardware.
Previous assumptions of what a database should look like or how its data should be structured may need to be revisited to meet the needs of the biggest data problems.
In combination with the preceding points, sufficiently large data sets and flexible tools allow previously unimagined questions to be answered.
Historically for the few and not the many
The examples discussed in the previous section have generally been seen in the form of innovations of large search engines and online companies. This is a continuation of a much older trend wherein processing large data sets was an expensive and complex undertaking, out of the reach of small- or medium-sized organizations.
Similarly, the broader approach of data mining has been around for a very long time but has never really been a practical tool outside the largest corporations and government agencies.
This situation may have been regrettable but most smaller organizations were not at a disadvantage as they rarely had access to the volume of data requiring such an investment.
The increase in data is not limited to the big players anymore, however; many small and medium companies—not to mention some individuals—find themselves gathering larger and larger amounts of data that they suspect may have some value they want to unlock.
Before understanding how this can be achieved, it is important to appreciate some of these broader historical trends that have laid the foundations for systems such as Hadoop today.
Classic data processing systems
The fundamental reason that big data mining systems were rare and expensive is that scaling a system to process large data sets is very difficult; as we will see, it has traditionally been limited to the processing power that can be built into a single computer.
There are however two broad approaches to scaling a system as the size of the data increases, generally referred to as scale-up and scale-out.
Scale-up
In most enterprises, data processing has typically been performed on impressively large computers with impressively larger price tags. As the size of the data grows, the approach is to move to a bigger server or storage array. Through an effective architecture—even today, as we'll describe later in this chapter—the cost of such hardware could easily be measured in hundreds of thousands or in millions of dollars.
The advantage of simple scale-up is that the architecture does not significantly change through the growth. Though larger components are used, the basic relationship (for example, database server and storage array) stays the same. For applications such as commercial database engines, the software handles the complexities of utilizing the available hardware, but in theory, increased scale is achieved by migrating the same software onto larger and larger servers. Note though that the difficulty of moving software onto more and more processors is never trivial; in addition, there are practical limits on just how big a single host can be, so at some point, scale-up cannot be extended any further.
The promise of a single architecture at any scale is also unrealistic. Designing a scale-up system to handle data sets of sizes such as 1 terabyte, 100 terabyte, and 1 petabyte may conceptually apply larger versions of the same components, but the complexity of their connectivity may vary from cheap commodity through custom hardware as the scale increases.
Early approaches to scale-out
Instead of growing a system onto larger and larger hardware, the scale-out approach spreads the processing onto more and more machines. If the data set doubles, simply use two servers instead of a single double-sized one. If it doubles again, move to four hosts.
The obvious benefit of this approach is that purchase costs remain much lower than for scale-up. Server hardware costs tend to increase sharply when one seeks to purchase larger machines, and though a single host may cost $5,000, one with ten times the processing power may cost a hundred times as much. The downside is that we need to develop strategies for splitting our data processing across a fleet of servers and the tools historically used for this purpose have proven to be complex.
As a consequence, deploying a scale-out solution has required significant engineering effort; the system developer often needs to handcraft the mechanisms for data partitioning and reassembly, not to mention the logic to schedule the work across the cluster and handle individual machine failures.
Limiting factors
These traditional approaches to scale-up and scale-out have not been widely adopted outside large enterprises, government, and academia. The purchase costs are often high, as is the effort to develop and manage the systems. These factors alone put them out of the reach of many smaller businesses. In addition, the approaches themselves have had several weaknesses that have become apparent over time:
As scale-out systems get large, or as scale-up systems deal with multiple CPUs, the difficulties caused by the complexity of the concurrency in the systems have become significant. Effectively utilizing multiple hosts or CPUs is a very difficult task, and implementing the necessary strategy to maintain efficiency throughout execution of the desired workloads can entail enormous effort.
Hardware advances—often couched in terms of Moore's law—have begun to highlight discrepancies in system capability. CPU power has grown much faster than network or disk speeds have; once CPU cycles were the most valuable resource in the system, but today, that no longer holds. Whereas a modern CPU may be able to execute millions of times as many operations as a CPU 20 years ago would, memory and hard disk speeds have only increased by factors of thousands or even hundreds. It is quite easy to build a modern system with so much CPU power that the storage system simply cannot feed it data fast enough to keep the CPUs busy.
A different approach
From the preceding scenarios there are a number of techniques that have been used successfully to ease the pain in scaling data processing systems to the large scales required by big data.
All roads lead to scale-out
As just hinted, taking a scale-up approach to scaling is not an open-ended tactic. There is a limit to the size of individual servers that can be purchased from mainstream hardware suppliers, and even more niche players can't offer an arbitrarily large server. At some point, the workload will increase beyond the capacity of the single, monolithic scale-up server, so then what? The unfortunate answer is that the best approach is to have two large servers instead of one. Then, later, three, four, and so on. Or, in other words, the natural tendency of scale-up architecture is—in extreme cases—to add a scale-out strategy to the mix. Though this gives some of the benefits of both approaches, it also compounds the costs and weaknesses; instead of very expensive hardware or the need to manually develop the cross-cluster logic, this hybrid architecture requires both.
As a