Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Hadoop: Data Processing and Modelling
Hadoop: Data Processing and Modelling
Hadoop: Data Processing and Modelling
Ebook2,407 pages15 hours

Hadoop: Data Processing and Modelling

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About This Book
  • Conquer the mountain of data using Hadoop 2.X tools
  • The authors succeed in creating a context for Hadoop and its ecosystem
  • Hands-on examples and recipes giving the bigger picture and helping you to master Hadoop 2.X data processing platforms
  • Overcome the challenging data processing problems using this exhaustive course with Hadoop 2.X
Who This Book Is For

This course is for Java developers, who know scripting, wanting a career shift to Hadoop - Big Data segment of the IT industry. So if you are a novice in Hadoop or an expert, this book will make you reach the most advanced level in Hadoop 2.X.

LanguageEnglish
Release dateAug 31, 2016
ISBN9781787120457
Hadoop: Data Processing and Modelling
Author

Garry Turkington

Garry Turkington has 14 years of industry experience, most of which has been focused on the design and implementation of large-scale distributed systems. In his current roles as VP Data Engineering at Improve Digital and the company's lead architect he is primarily responsible for the realization of systems that store, process, and extract value from the company's large data volumes. Before joining Improve Digital he spent time at Amazon UK where he led several software development teams, building systems that process the Amazon catalog data for every item worldwide. Prior to this he spent a decade in various government positions in both the UK and USA. He has BSc and PhD degrees in computer science from the Queens University of Belfast in Northern Ireland and a MEng in Systems Engineering from Stevens Institute of Technology in the USA.

Read more from Garry Turkington

Related to Hadoop

Related ebooks

Computers For You

View More

Related articles

Reviews for Hadoop

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Hadoop - Garry Turkington

    Table of Contents

    Hadoop: Data Processing and Modelling

    Hadoop: Data Processing and Modelling

    Credits

    Preface

    What this learning path covers

    Hadoop beginners Guide

    Hadoop Real World Solutions Cookbook, 2nd edition

    Mastering Hadoop

    What you need for this learning path

    Who this learning path is for

    Reader feedback

    Customer support

    Downloading the example code

    Errata

    Piracy

    Questions

    1. Module 1

    1. What It's All About

    Big data processing

    The value of data

    Historically for the few and not the many

    Classic data processing systems

    Scale-up

    Early approaches to scale-out

    Limiting factors

    A different approach

    All roads lead to scale-out

    Share nothing

    Expect failure

    Smart software, dumb hardware

    Move processing, not data

    Build applications, not infrastructure

    Hadoop

    Thanks, Google

    Thanks, Doug

    Thanks, Yahoo

    Parts of Hadoop

    Common building blocks

    HDFS

    MapReduce

    Better together

    Common architecture

    What it is and isn't good for

    Cloud computing with Amazon Web Services

    Too many clouds

    A third way

    Different types of costs

    AWS – infrastructure on demand from Amazon

    Elastic Compute Cloud (EC2)

    Simple Storage Service (S3)

    Elastic MapReduce (EMR)

    What this book covers

    A dual approach

    Summary

    2. Getting Hadoop Up and Running

    Hadoop on a local Ubuntu host

    Other operating systems

    Time for action – checking the prerequisites

    What just happened?

    Setting up Hadoop

    A note on versions

    Time for action – downloading Hadoop

    What just happened?

    Time for action – setting up SSH

    What just happened?

    Configuring and running Hadoop

    Time for action – using Hadoop to calculate Pi

    What just happened?

    Three modes

    Time for action – configuring the pseudo-distributed mode

    What just happened?

    Configuring the base directory and formatting the filesystem

    Time for action – changing the base HDFS directory

    What just happened?

    Time for action – formatting the NameNode

    What just happened?

    Starting and using Hadoop

    Time for action – starting Hadoop

    What just happened?

    Time for action – using HDFS

    What just happened?

    Time for action – WordCount, the Hello World of MapReduce

    What just happened?

    Have a go hero – WordCount on a larger body of text

    Monitoring Hadoop from the browser

    The HDFS web UI

    The MapReduce web UI

    Using Elastic MapReduce

    Setting up an account in Amazon Web Services

    Creating an AWS account

    Signing up for the necessary services

    Time for action – WordCount on EMR using the management console

    What just happened?

    Have a go hero – other EMR sample applications

    Other ways of using EMR

    AWS credentials

    The EMR command-line tools

    The AWS ecosystem

    Comparison of local versus EMR Hadoop

    Summary

    3. Understanding MapReduce

    Key/value pairs

    What it mean

    Why key/value data?

    Some real-world examples

    MapReduce as a series of key/value transformations

    Pop quiz – key/value pairs

    The Hadoop Java API for MapReduce

    The 0.20 MapReduce Java API

    The Mapper class

    The Reducer class

    The Driver class

    Writing MapReduce programs

    Time for action – setting up the classpath

    What just happened?

    Time for action – implementing WordCount

    What just happened?

    Time for action – building a JAR file

    What just happened?

    Time for action – running WordCount on a local Hadoop cluster

    What just happened?

    Time for action – running WordCount on EMR

    What just happened?

    The pre-0.20 Java MapReduce API

    Hadoop-provided mapper and reducer implementations

    Time for action – WordCount the easy way

    What just happened?

    Walking through a run of WordCount

    Startup

    Splitting the input

    Task assignment

    Task startup

    Ongoing JobTracker monitoring

    Mapper input

    Mapper execution

    Mapper output and reduce input

    Partitioning

    The optional partition function

    Reducer input

    Reducer execution

    Reducer output

    Shutdown

    That's all there is to it!

    Apart from the combiner…maybe

    Why have a combiner?

    Time for action – WordCount with a combiner

    What just happened?

    When you can use the reducer as the combiner

    Time for action – fixing WordCount to work with a combiner

    What just happened?

    Reuse is your friend

    Pop quiz – MapReduce mechanics

    Hadoop-specific data types

    The Writable and WritableComparable interfaces

    Introducing the wrapper classes

    Primitive wrapper classes

    Array wrapper classes

    Map wrapper classes

    Time for action – using the Writable wrapper classes

    What just happened?

    Other wrapper classes

    Have a go hero – playing with Writables

    Making your own

    Input/output

    Files, splits, and records

    InputFormat and RecordReader

    Hadoop-provided InputFormat

    Hadoop-provided RecordReader

    OutputFormat and RecordWriter

    Hadoop-provided OutputFormat

    Don't forget Sequence files

    Summary

    4. Developing MapReduce Programs

    Using languages other than Java with Hadoop

    How Hadoop Streaming works

    Why to use Hadoop Streaming

    Time for action – implementing WordCount using Streaming

    What just happened?

    Differences in jobs when using Streaming

    Analyzing a large dataset

    Getting the UFO sighting dataset

    Getting a feel for the dataset

    Time for action – summarizing the UFO data

    What just happened?

    Examining UFO shapes

    Time for action – summarizing the shape data

    What just happened?

    Time for action – correlating of sighting duration to UFO shape

    What just happened?

    Using Streaming scripts outside Hadoop

    Time for action – performing the shape/time analysis from the command line

    What just happened?

    Java shape and location analysis

    Time for action – using ChainMapper for field validation/analysis

    What just happened?

    Have a go hero

    Too many abbreviations

    Using the Distributed Cache

    Time for action – using the Distributed Cache to improve location output

    What just happened?

    Counters, status, and other output

    Time for action – creating counters, task states, and writing log output

    What just happened?

    Too much information!

    Summary

    5. Advanced MapReduce Techniques

    Simple, advanced, and in-between

    Joins

    When this is a bad idea

    Map-side versus reduce-side joins

    Matching account and sales information

    Time for action – reduce-side join using MultipleInputs

    What just happened?

    DataJoinMapper and TaggedMapperOutput

    Implementing map-side joins

    Using the Distributed Cache

    Have a go hero - Implementing map-side joins

    Pruning data to fit in the cache

    Using a data representation instead of raw data

    Using multiple mappers

    To join or not to join...

    Graph algorithms

    Graph 101

    Graphs and MapReduce – a match made somewhere

    Representing a graph

    Time for action – representing the graph

    What just happened?

    Overview of the algorithm

    The mapper

    The reducer

    Iterative application

    Time for action – creating the source code

    What just happened?

    Time for action – the first run

    What just happened?

    Time for action – the second run

    What just happened?

    Time for action – the third run

    What just happened?

    Time for action – the fourth and last run

    What just happened?

    Running multiple jobs

    Final thoughts on graphs

    Using language-independent data structures

    Candidate technologies

    Introducing Avro

    Time for action – getting and installing Avro

    What just happened?

    Avro and schemas

    Time for action – defining the schema

    What just happened?

    Time for action – creating the source Avro data with Ruby

    What just happened?

    Time for action – consuming the Avro data with Java

    What just happened?

    Using Avro within MapReduce

    Time for action – generating shape summaries in MapReduce

    What just happened?

    Time for action – examining the output data with Ruby

    What just happened?

    Time for action – examining the output data with Java

    What just happened?

    Have a go hero – graphs in Avro

    Going forward with Avro

    Summary

    6. When Things Break

    Failure

    Embrace failure

    Or at least don't fear it

    Don't try this at home

    Types of failure

    Hadoop node failure

    The dfsadmin command

    Cluster setup, test files, and block sizes

    Fault tolerance and Elastic MapReduce

    Time for action – killing a DataNode process

    What just happened?

    NameNode and DataNode communication

    Have a go hero – NameNode log delving

    Time for action – the replication factor in action

    What just happened?

    Time for action – intentionally causing missing blocks

    What just happened?

    When data may be lost

    Block corruption

    Time for action – killing a TaskTracker process

    What just happened?

    Comparing the DataNode and TaskTracker failures

    Permanent failure

    Killing the cluster masters

    Time for action – killing the JobTracker

    What just happened?

    Starting a replacement JobTracker

    Have a go hero – moving the JobTracker to a new host

    Time for action – killing the NameNode process

    What just happened?

    Starting a replacement NameNode

    The role of the NameNode in more detail

    File systems, files, blocks, and nodes

    The single most important piece of data in the cluster – fsimage

    DataNode startup

    Safe mode

    SecondaryNameNode

    So what to do when the NameNode process has a critical failure?

    BackupNode/CheckpointNode and NameNode HA

    Hardware failure

    Host failure

    Host corruption

    The risk of correlated failures

    Task failure due to software

    Failure of slow running tasks

    Time for action – causing task failure

    What just happened?

    Have a go hero – HDFS programmatic access

    Hadoop's handling of slow-running tasks

    Speculative execution

    Hadoop's handling of failing tasks

    Have a go hero – causing tasks to fail

    Task failure due to data

    Handling dirty data through code

    Using Hadoop's skip mode

    Time for action – handling dirty data by using skip mode

    What just happened?

    To skip or not to skip...

    Summary

    7. Keeping Things Running

    A note on EMR

    Hadoop configuration properties

    Default values

    Time for action – browsing default properties

    What just happened?

    Additional property elements

    Default storage location

    Where to set properties

    Setting up a cluster

    How many hosts?

    Calculating usable space on a node

    Location of the master nodes

    Sizing hardware

    Processor / memory / storage ratio

    EMR as a prototyping platform

    Special node requirements

    Storage types

    Commodity versus enterprise class storage

    Single disk versus RAID

    Finding the balance

    Network storage

    Hadoop networking configuration

    How blocks are placed

    Rack awareness

    The rack-awareness script

    Time for action – examining the default rack configuration

    What just happened?

    Time for action – adding a rack awareness script

    What just happened?

    What is commodity hardware anyway?

    Pop quiz – setting up a cluster

    Cluster access control

    The Hadoop security model

    Time for action – demonstrating the default security

    What just happened?

    User identity

    The super user

    More granular access control

    Working around the security model via physical access control

    Managing the NameNode

    Configuring multiple locations for the fsimage class

    Time for action – adding an additional fsimage location

    What just happened?

    Where to write the fsimage copies

    Swapping to another NameNode host

    Having things ready before disaster strikes

    Time for action – swapping to a new NameNode host

    What just happened?

    Don't celebrate quite yet!

    What about MapReduce?

    Have a go hero – swapping to a new NameNode host

    Managing HDFS

    Where to write data

    Using balancer

    When to rebalance

    MapReduce management

    Command line job management

    Have a go hero – command line job management

    Job priorities and scheduling

    Time for action – changing job priorities and killing a job

    What just happened?

    Alternative schedulers

    Capacity Scheduler

    Fair Scheduler

    Enabling alternative schedulers

    When to use alternative schedulers

    Scaling

    Adding capacity to a local Hadoop cluster

    Have a go hero – adding a node and running balancer

    Adding capacity to an EMR job flow

    Expanding a running job flow

    Summary

    8. A Relational View on Data with Hive

    Overview of Hive

    Why use Hive?

    Thanks, Facebook!

    Setting up Hive

    Prerequisites

    Getting Hive

    Time for action – installing Hive

    What just happened?

    Using Hive

    Time for action – creating a table for the UFO data

    What just happened?

    Time for action – inserting the UFO data

    What just happened?

    Validating the data

    Time for action – validating the table

    What just happened?

    Time for action – redefining the table with the correct column separator

    What just happened?

    Hive tables – real or not?

    Time for action – creating a table from an existing file

    What just happened?

    Time for action – performing a join

    What just happened?

    Have a go hero – improve the join to use regular expressions

    Hive and SQL views

    Time for action – using views

    What just happened?

    Handling dirty data in Hive

    Have a go hero – do it!

    Time for action – exporting query output

    What just happened?

    Partitioning the table

    Time for action – making a partitioned UFO sighting table

    What just happened?

    Bucketing, clustering, and sorting... oh my!

    User-Defined Function

    Time for action – adding a new User Defined Function (UDF)

    What just happened?

    To preprocess or not to preprocess...

    Hive versus Pig

    What we didn't cover

    Hive on Amazon Web Services

    Time for action – running UFO analysis on EMR

    What just happened?

    Using interactive job flows for development

    Have a go hero – using an interactive EMR cluster

    Integration with other AWS products

    Summary

    9. Working with Relational Databases

    Common data paths

    Hadoop as an archive store

    Hadoop as a preprocessing step

    Hadoop as a data input tool

    The serpent eats its own tail

    Setting up MySQL

    Time for action – installing and setting up MySQL

    What just happened?

    Did it have to be so hard?

    Time for action – configuring MySQL to allow remote connections

    What just happened?

    Don't do this in production!

    Time for action – setting up the employee database

    What just happened?

    Be careful with data file access rights

    Getting data into Hadoop

    Using MySQL tools and manual import

    Have a go hero – exporting the employee table into HDFS

    Accessing the database from the mapper

    A better way – introducing Sqoop

    Time for action – downloading and configuring Sqoop

    What just happened?

    Sqoop and Hadoop versions

    Sqoop and HDFS

    Time for action – exporting data from MySQL to HDFS

    What just happened?

    Mappers and primary key columns

    Other options

    Sqoop's architecture

    Importing data into Hive using Sqoop

    Time for action – exporting data from MySQL into Hive

    What just happened?

    Time for action – a more selective import

    What just happened?

    Datatype issues

    Time for action – using a type mapping

    What just happened?

    Time for action – importing data from a raw query

    What just happened?

    Have a go hero

    Sqoop and Hive partitions

    Field and line terminators

    Getting data out of Hadoop

    Writing data from within the reducer

    Writing SQL import files from the reducer

    A better way – Sqoop again

    Time for action – importing data from Hadoop into MySQL

    What just happened?

    Differences between Sqoop imports and exports

    Inserts versus updates

    Have a go hero

    Sqoop and Hive exports

    Time for action – importing Hive data into MySQL

    What just happened?

    Time for action – fixing the mapping and re-running the export

    What just happened?

    Other Sqoop features

    Incremental merge

    Avoiding partial exports

    Sqoop as a code generator

    AWS considerations

    Considering RDS

    Summary

    10. Data Collection with Flume

    A note about AWS

    Data data everywhere...

    Types of data

    Getting network traffic into Hadoop

    Time for action – getting web server data into Hadoop

    What just happened?

    Have a go hero

    Getting files into Hadoop

    Hidden issues

    Keeping network data on the network

    Hadoop dependencies

    Reliability

    Re-creating the wheel

    A common framework approach

    Introducing Apache Flume

    A note on versioning

    Time for action – installing and configuring Flume

    What just happened?

    Using Flume to capture network data

    Time for action – capturing network traffic in a log file

    What just happened?

    Time for action – logging to the console

    What just happened?

    Writing network data to log files

    Time for action – capturing the output of a command to a flat file

    What just happened?

    Logs versus files

    Time for action – capturing a remote file in a local flat file

    What just happened?

    Sources, sinks, and channels

    Sources

    Sinks

    Channels

    Or roll your own

    Understanding the Flume configuration files

    Have a go hero

    It's all about events

    Time for action – writing network traffic onto HDFS

    What just happened?

    Time for action – adding timestamps

    What just happened?

    To Sqoop or to Flume...

    Time for action – multi level Flume networks

    What just happened?

    Time for action – writing to multiple sinks

    What just happened?

    Selectors replicating and multiplexing

    Handling sink failure

    Have a go hero - Handling sink failure

    Next, the world

    Have a go hero - Next, the world

    The bigger picture

    Data lifecycle

    Staging data

    Scheduling

    Summary

    11. Where to Go Next

    What we did and didn't cover in this book

    Upcoming Hadoop changes

    Alternative distributions

    Why alternative distributions?

    Bundling

    Free and commercial extensions

    Cloudera Distribution for Hadoop

    Hortonworks Data Platform

    MapR

    IBM InfoSphere Big Insights

    Choosing a distribution

    Other Apache projects

    HBase

    Oozie

    Whir

    Mahout

    MRUnit

    Other programming abstractions

    Pig

    Cascading

    AWS resources

    HBase on EMR

    SimpleDB

    DynamoDB

    Sources of information

    Source code

    Mailing lists and forums

    LinkedIn groups

    HUGs

    Conferences

    Summary

    A. Pop Quiz Answers

    Chapter 3, Understanding MapReduce

    Pop quiz – key/value pairs

    Pop quiz – walking through a run of WordCount

    Chapter 7, Keeping Things Running

    Pop quiz – setting up a cluster

    2. Module 2

    1. Getting Started with Hadoop 2.X

    Introduction

    Installing a single-node Hadoop Cluster

    Getting ready

    How to do it...

    How it works...

    Hadoop Distributed File System (HDFS)

    Yet Another Resource Negotiator (YARN)

    There's more

    Installing a multi-node Hadoop cluster

    Getting ready

    How to do it...

    How it works...

    Adding new nodes to existing Hadoop clusters

    Getting ready

    How to do it...

    How it works...

    Executing the balancer command for uniform data distribution

    Getting ready

    How to do it...

    How it works...

    There's more...

    Entering and exiting from the safe mode in a Hadoop cluster

    How to do it...

    How it works...

    Decommissioning DataNodes

    Getting ready

    How to do it...

    How it works...

    Performing benchmarking on a Hadoop cluster

    Getting ready

    How to do it...

    TestDFSIO

    NNBench

    MRBench

    How it works...

    2. Exploring HDFS

    Introduction

    Loading data from a local machine to HDFS

    Getting ready

    How to do it...

    How it works...

    Exporting HDFS data to a local machine

    Getting ready

    How to do it...

    How it works...

    Changing the replication factor of an existing file in HDFS

    Getting ready

    How to do it...

    How it works...

    Setting the HDFS block size for all the files in a cluster

    Getting ready

    How to do it...

    How it works...

    Setting the HDFS block size for a specific file in a cluster

    Getting ready

    How to do it...

    How it works...

    Enabling transparent encryption for HDFS

    Getting ready

    How to do it...

    How it works...

    Importing data from another Hadoop cluster

    Getting ready

    How to do it...

    How it works...

    Recycling deleted data from trash to HDFS

    Getting ready

    How to do it...

    How it works...

    Saving compressed data in HDFS

    Getting ready

    How to do it...

    How it works...

    3. Mastering Map Reduce Programs

    Introduction

    Writing the Map Reduce program in Java to analyze web log data

    Getting ready

    How to do it...

    How it works...

    Executing the Map Reduce program in a Hadoop cluster

    Getting ready

    How to do it

    How it works...

    Adding support for a new writable data type in Hadoop

    Getting ready

    How to do it...

    How it works...

    Implementing a user-defined counter in a Map Reduce program

    Getting ready

    How to do it...

    How it works...

    Map Reduce program to find the top X

    Getting ready

    How to do it...

    How it works

    Map Reduce program to find distinct values

    Getting ready

    How to do it

    How it works...

    Map Reduce program to partition data using a custom partitioner

    Getting ready

    How to do it...

    How it works...

    Writing Map Reduce results to multiple output files

    Getting ready

    How to do it...

    How it works...

    Performing Reduce side Joins using Map Reduce

    Getting ready

    How to do it

    How it works...

    Unit testing the Map Reduce code using MRUnit

    Getting ready

    How to do it...

    How it works...

    4. Data Analysis Using Hive, Pig, and Hbase

    Introduction

    Storing and processing Hive data in a sequential file format

    Getting ready

    How to do it...

    How it works...

    Storing and processing Hive data in the ORC file format

    Getting ready

    How to do it...

    How it works...

    Storing and processing Hive data in the ORC file format

    Getting ready

    How to do it...

    How it works...

    Storing and processing Hive data in the Parquet file format

    Getting ready

    How to do it...

    How it works...

    Performing FILTER By queries in Pig

    Getting ready

    How to do it...

    How it works...

    Performing Group By queries in Pig

    Getting ready

    How to do it...

    How it works...

    Performing Order By queries in Pig

    Getting ready

    How to do it..

    How it works...

    Performing JOINS in Pig

    Getting ready

    How to do it...

    How it works

    Replicated Joins

    Skewed Joins

    Merge Joins

    Writing a user-defined function in Pig

    Getting ready

    How to do it...

    How it works...

    There's more...

    Analyzing web log data using Pig

    Getting ready

    How to do it...

    How it works...

    Performing the Hbase operation in CLI

    Getting ready

    How to do it

    How it works...

    Performing Hbase operations in Java

    Getting ready

    How to do it

    How it works...

    Executing the MapReduce programming with an Hbase Table

    Getting ready

    How to do it

    How it works

    5. Advanced Data Analysis Using Hive

    Introduction

    Processing JSON data in Hive using JSON SerDe

    Getting ready

    How to do it...

    How it works...

    Processing XML data in Hive using XML SerDe

    Getting ready

    How to do it...

    How it works

    Processing Hive data in the Avro format

    Getting ready

    How to do it...

    How it works...

    Writing a user-defined function in Hive

    Getting ready

    How to do it

    How it works...

    Performing table joins in Hive

    Getting ready

    How to do it...

    Left outer join

    Right outer join

    Full outer join

    Left semi join

    How it works...

    Executing map side joins in Hive

    Getting ready

    How to do it...

    How it works...

    Performing context Ngram in Hive

    Getting ready

    How to do it...

    How it works...

    Call Data Record Analytics using Hive

    Getting ready

    How to do it...

    How it works...

    Twitter sentiment analysis using Hive

    Getting ready

    How to do it...

    How it works

    Implementing Change Data Capture using Hive

    Getting ready

    How to do it

    How it works

    Multiple table inserting using Hive

    Getting ready

    How to do it

    How it works

    6. Data Import/Export Using Sqoop and Flume

    Introduction

    Importing data from RDMBS to HDFS using Sqoop

    Getting ready

    How to do it...

    How it works...

    Exporting data from HDFS to RDBMS

    Getting ready

    How to do it...

    How it works...

    Using query operator in Sqoop import

    Getting ready

    How to do it...

    How it works...

    Importing data using Sqoop in compressed format

    Getting ready

    How to do it...

    How it works...

    Performing Atomic export using Sqoop

    Getting ready

    How to do it...

    How it works...

    Importing data into Hive tables using Sqoop

    Getting ready

    How to do it...

    How it works...

    Importing data into HDFS from Mainframes

    Getting ready

    How to do it...

    How it works...

    Incremental import using Sqoop

    Getting ready

    How to do it...

    How it works...

    Creating and executing Sqoop job

    Getting ready

    How to do it...

    How it works...

    Importing data from RDBMS to Hbase using Sqoop

    Getting ready

    How to do it...

    How it works...

    Importing Twitter data into HDFS using Flume

    Getting ready

    How to do it...

    How it works

    Importing data from Kafka into HDFS using Flume

    Getting ready

    How to do it...

    How it works

    Importing web logs data into HDFS using Flume

    Getting ready

    How to do it...

    How it works...

    7. Automation of Hadoop Tasks Using Oozie

    Introduction

    Implementing a Sqoop action job using Oozie

    Getting ready

    How to do it...

    How it works

    Implementing a Map Reduce action job using Oozie

    Getting ready

    How to do it...

    How it works...

    Implementing a Java action job using Oozie

    Getting ready

    How to do it

    How it works

    Implementing a Hive action job using Oozie

    Getting ready

    How to do it...

    How it works...

    Implementing a Pig action job using Oozie

    Getting ready

    How to do it...

    How it works

    Implementing an e-mail action job using Oozie

    Getting ready

    How to do it...

    How it works...

    Executing parallel jobs using Oozie (fork)

    Getting ready

    How to do it...

    How it works...

    Scheduling a job in Oozie

    Getting ready

    How to do it...

    How it works...

    8. Machine Learning and Predictive Analytics Using Mahout and R

    Introduction

    Setting up the Mahout development environment

    Getting ready

    How to do it...

    How it works...

    Creating an item-based recommendation engine using Mahout

    Getting ready

    How to do it...

    How it works...

    Creating a user-based recommendation engine using Mahout

    Getting ready

    How to do it...

    How it works...

    Using Predictive analytics on Bank Data using Mahout

    Getting ready

    How to do it...

    How it works...

    Clustering text data using K-Means

    Getting ready

    How to do it...

    How it works...

    Performing Population Data Analytics using R

    Getting ready

    How to do it...

    How it works...

    Performing Twitter Sentiment Analytics using R

    Getting ready

    How to do it...

    How it works...

    Performing Predictive Analytics using R

    Getting ready

    How to do it...

    How it works...

    9. Integration with Apache Spark

    Introduction

    Running Spark standalone

    Getting ready

    How to do it...

    How it works...

    Running Spark on YARN

    Getting ready

    How to do it...

    How it works...

    Olympics Athletes analytics using the Spark Shell

    Getting ready

    How to do it...

    How it works...

    Creating Twitter trending topics using Spark Streaming

    Getting ready

    How to do it...

    How it works...

    Twitter trending topics using Spark streaming

    Getting ready

    How to do it...

    How it works...

    Analyzing Parquet files using Spark

    Getting ready

    How to do it...

    How it works...

    Analyzing JSON data using Spark

    Getting ready

    How to do it...

    How it works...

    Processing graphs using Graph X

    Getting ready

    How to do it...

    How it works...

    Conducting predictive analytics using Spark MLib

    Getting ready

    How to do it...

    How it works...

    10. Hadoop Use Cases

    Introduction

    Call Data Record analytics

    Getting ready

    How to do it...

    Problem Statement

    Solution

    How it works...

    Web log analytics

    Getting ready

    How to do it...

    Problem statement

    Solution

    How it works...

    Sensitive data masking and encryption using Hadoop

    Getting ready

    How to do it...

    Problem statement

    Solution

    How it works...

    3. Module 3

    1. Hadoop 2.X

    The inception of Hadoop

    The evolution of Hadoop

    Hadoop's genealogy

    Hadoop-0.20-append

    Hadoop-0.20-security

    Hadoop's timeline

    Hadoop 2.X

    Yet Another Resource Negotiator (YARN)

    Architecture overview

    Storage layer enhancements

    High availability

    HDFS Federation

    HDFS snapshots

    Other enhancements

    Support enhancements

    Hadoop distributions

    Which Hadoop distribution?

    Performance

    Scalability

    Reliability

    Manageability

    Available distributions

    Cloudera Distribution of Hadoop (CDH)

    Hortonworks Data Platform (HDP)

    MapR

    Pivotal HD

    Summary

    2. Advanced MapReduce

    MapReduce input

    The InputFormat class

    The InputSplit class

    The RecordReader class

    Hadoop's small files problem

    Filtering inputs

    The Map task

    The dfs.blocksize attribute

    Sort and spill of intermediate outputs

    Node-local Reducers or Combiners

    Fetching intermediate outputs – Map-side

    The Reduce task

    Fetching intermediate outputs – Reduce-side

    Merge and spill of intermediate outputs

    MapReduce output

    Speculative execution of tasks

    MapReduce job counters

    Handling data joins

    Reduce-side joins

    Map-side joins

    Summary

    3. Advanced Pig

    Pig versus SQL

    Different modes of execution

    Complex data types in Pig

    Compiling Pig scripts

    The logical plan

    The physical plan

    The MapReduce plan

    Development and debugging aids

    The DESCRIBE command

    The EXPLAIN command

    The ILLUSTRATE command

    The advanced Pig operators

    The advanced FOREACH operator

    The FLATTEN operator

    The nested FOREACH operator

    The COGROUP operator

    The UNION operator

    The CROSS operator

    Specialized joins in Pig

    The Replicated join

    Skewed joins

    The Merge join

    User-defined functions

    The evaluation functions

    The aggregate functions

    The Algebraic interface

    The Accumulator interface

    The filter functions

    The load functions

    The store functions

    Pig performance optimizations

    The optimization rules

    Measurement of Pig script performance

    Combiners in Pig

    Memory for the Bag data type

    Number of reducers in Pig

    The multiquery mode in Pig

    Best practices

    The explicit usage of types

    Early and frequent projection

    Early and frequent filtering

    The usage of the LIMIT operator

    The usage of the DISTINCT operator

    The reduction of operations

    The usage of Algebraic UDFs

    The usage of Accumulator UDFs

    Eliminating nulls in the data

    The usage of specialized joins

    Compressing intermediate results

    Combining smaller files

    Summary

    4. Advanced Hive

    The Hive architecture

    The Hive metastore

    The Hive compiler

    The Hive execution engine

    The supporting components of Hive

    Data types

    File formats

    Compressed files

    ORC files

    The Parquet files

    The data model

    Dynamic partitions

    Semantics for dynamic partitioning

    Indexes on Hive tables

    Hive query optimizers

    Advanced DML

    The GROUP BY operation

    ORDER BY versus SORT BY clauses

    The JOIN operator and its types

    Map-side joins

    Advanced aggregation support

    Other advanced clauses

    UDF, UDAF, and UDTF

    Summary

    5. Serialization and Hadoop I/O

    Data serialization in Hadoop

    Writable and WritableComparable

    Hadoop versus Java serialization

    Avro serialization

    Avro and MapReduce

    Avro and Pig

    Avro and Hive

    Comparison – Avro versus Protocol Buffers / Thrift

    File formats

    The Sequence file format

    Reading and writing Sequence files

    The MapFile format

    Other data structures

    Compression

    Splits and compressions

    Scope for compression

    Summary

    6. YARN – Bringing Other Paradigms to Hadoop

    The YARN architecture

    Resource Manager (RM)

    Application Master (AM)

    Node Manager (NM)

    YARN clients

    Developing YARN applications

    Writing YARN clients

    Writing the Application Master entity

    Monitoring YARN

    Job scheduling in YARN

    CapacityScheduler

    FairScheduler

    YARN commands

    User commands

    Administration commands

    Summary

    7. Storm on YARN – Low Latency Processing in Hadoop

    Batch processing versus streaming

    Apache Storm

    Architecture of an Apache Storm cluster

    Computation and data modeling in Apache Storm

    Use cases for Apache Storm

    Developing with Apache Storm

    Apache Storm 0.9.1

    Storm on YARN

    Installing Apache Storm-on-YARN

    Prerequisites

    Installation procedure

    Summary

    8. Hadoop on the Cloud

    Cloud computing characteristics

    Hadoop on the cloud

    Amazon Elastic MapReduce (EMR)

    Provisioning a Hadoop cluster on EMR

    Summary

    9. HDFS Replacements

    HDFS – advantages and drawbacks

    Amazon AWS S3

    Hadoop support for S3

    Implementing a filesystem in Hadoop

    Implementing an S3 native filesystem in Hadoop

    Summary

    10. HDFS Federation

    Limitations of the older HDFS architecture

    Architecture of HDFS Federation

    Benefits of HDFS Federation

    Deploying federated NameNodes

    HDFS high availability

    Secondary NameNode, Checkpoint Node, and Backup Node

    High availability – edits sharing

    Useful HDFS tools

    Three-layer versus four-layer network topology

    HDFS block placement

    Pluggable block placement policy

    Summary

    11. Hadoop Security

    The security pillars

    Authentication in Hadoop

    Kerberos authentication

    The Kerberos architecture and workflow

    Kerberos authentication and Hadoop

    Authentication via HTTP interfaces

    Authorization in Hadoop

    Authorization in HDFS

    Identity of an HDFS user

    Group listings for an HDFS user

    HDFS APIs and shell commands

    Specifying the HDFS superuser

    Turning off HDFS authorization

    Limiting HDFS usage

    Name quotas in HDFS

    Space quotas in HDFS

    Service-level authorization in Hadoop

    Data confidentiality in Hadoop

    HTTPS and encrypted shuffle

    SSL configuration changes

    Configuring the keystore and truststore

    Audit logging in Hadoop

    Summary

    12. Analytics Using Hadoop

    Data analytics workflow

    Machine learning

    Apache Mahout

    Document analysis using Hadoop and Mahout

    Term frequency

    Document frequency

    Term frequency – inverse document frequency

    Tf-Idf in Pig

    Cosine similarity distance measures

    Clustering using k-means

    K-means clustering using Apache Mahout

    RHadoop

    Summary

    13. Hadoop for Microsoft Windows

    Deploying Hadoop on Microsoft Windows

    Prerequisites

    Building Hadoop

    Configuring Hadoop

    Deploying Hadoop

    Summary

    A. Bibliography

    Index

    Hadoop: Data Processing and Modelling


    Hadoop: Data Processing and Modelling

    Unlock the power of your data with Hadoop 2.X ecosystem and its data warehousing techniques across large data sets

    A course in three modules

    BIRMINGHAM - MUMBAI

    Hadoop: Data Processing and Modelling

    Copyright © 2016 Packt Publishing

    All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Published on: August 2016

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN: 978-1-78712-516-2

    www.packtpub.com

    Credits

    Authors

    Garry Turkington

    Tanmay Deshpande

    Sandeep Karanth

    Reviewers

    David Gruzman

    Muthusamy Manigandan

    Vidyasagar N V

    Shashwat Shriparv

    Shiva Achari

    Pavan Kumar Polineni

    Uchit Vyas

    Yohan Wadia

    Content Development Editor

    Rashmi Suvarna

    Graphics

    Kirk D'Phena

    Production Coordinator

    Shantanu N. Zagade

    Preface

    A number of organizations are focusing on big data processing, particularly with Hadoop. This course will help you understand how Hadoop, as an ecosystem, helps us store, process, and analyze data. Hadoop is synonymous with Big Data processing. Its simple programming model, code once and deploy at any scale paradigm, and an ever-growing ecosystem make Hadoop an inclusive platform for programmers with different levels of expertise and breadth of knowledge. A team of machines are interconnected via a very fast network and provide better scaling and elasticity, but that is not enough.

    These clusters have to be programmed. A greater number of machines, just like a team of human beings, require more coordination and synchronization. The higher the number of machines, the greater the possibility of failures in the cluster. How do we handle synchronization and fault tolerance in a simple way easing the burden on the programmer? The answer is systems such as Hadoop. Today, it is the number-one sought after job skill in the data sciences space. To handle and analyze Big Data, Hadoop has become the go-to tool. Hadoop 2.x is spreading its wings to cover a variety of application paradigms and solve a wider range of data problems. It is rapidly becoming a general-purpose cluster platform for all data processing needs, and will soon become a mandatory skill for every engineer across verticals. Explore the power of Hadoop ecosystem to be able to tackle real-world scenarios and build.

    This course covers optimizations and advanced features of MapReduce, Pig, and Hive.

    Along with Hadoop 2.x and illustrates how it can be used to extend the capabilities of Hadoop.

    When you finish this course, you will be able to tackle the real-world scenarios and become a big data expert using the tools and the knowledge based on the various step-by-step tutorials and recipes.

    What this learning path covers

    Hadoop beginners Guide

    This module is here to help you make sense of Hadoop and use it to solve your big data problems. It's a really exciting time to work with data processing technologies such as Hadoop. The ability to apply complex analytics to large data sets—once the monopoly of large corporations and government agencies—is now possible through free open source software (OSS). This module removes the mystery from Hadoop, presenting Hadoop and related technologies with a focus on building working systems and getting the job done, using cloud services to do so when it makes sense. From basic concepts and initial setup through developing applications and keeping the system running as the data grows, the module gives the understanding needed to effectively use Hadoop to solve real world problems. Starting with the basics of installing and configuring Hadoop, the module explains how to develop applications, maintain the system, and how to use additional products to integrate with other systems. In addition to examples on Hadoop clusters on Ubuntu uses of cloud services such as Amazon, EC2 and Elastic MapReduce are covered.

    Hadoop Real World Solutions Cookbook, 2nd edition

    Big Data is the need the day. Many organizations are producing huge amounts of data every day. With the advancement of Hadoop-like tools, it has become easier for everyone to solve Big Data problems with great efficiency and at a very low cost. When you are handling such a massive amount of data, even a small mistake can cost you dearly in terms of performance and storage. It's very important to learn the best practices of handling such tools before you start building an enterprise Big Data Warehouse, which will be greatly advantageous in making your project successful.

    This module gives readers insights into learning and mastering big data via recipes. The module not only clarifies most big data tools in the market but also provides best practices for using them. The module provides recipes that are based on the latest versions of Apache Hadoop 2.X, YARN, Hive, Pig, Sqoop, Flume, Apache Spark, Mahout and many more such ecosystem tools. This real-world-solution cookbook is packed with handy recipes you can apply to your own everyday issues. Each chapter provides in-depth recipes that can be referenced easily. This module provides detailed practices on the latest technologies such as YARN and Apache Spark. Readers will be able to consider themselves as big data experts on completion of this module.

    Mastering Hadoop

    This era of Big Data has similar changes in businesses as well. Almost everything in a business is logged. Every action taken by a user on the page of an e-commerce page is recorded to improve quality of service and every item bought by the user are recorded to cross-sell or up-sell other items. Businesses want to understand the DNA of their customers and try to infer it by pinching out every possible data they can get about these customers. Businesses are not worried about the format of the data. They are ready to accept speech, images, natural language text, or structured data. These data points are used to drive business decisions and personalize experiences for the user. The more data, the higher the degree of personalization and better the experience for the user.

    Hadoop is synonymous with Big Data processing. Its simple programming model, code once and deploy at any scale paradigm, and an ever-growing ecosystem makes Hadoop an all-encompassing platform for programmers with different levels of expertise. This module explores the industry guidelines to optimize MapReduce jobs and higher-level abstractions such as Pig and Hive in Hadoop 2.0. Then, it dives deep into Hadoop 2.0 specific features such as YARN and HDFS Federation. This module is a step-by-step guide that focuses on advanced Hadoop concepts and aims to take your Hadoop knowledge and skill set to the next level. The data processing flow dictates the order of the concepts in each chapter, and each chapter is illustrated with code fragments or schematic diagrams.

    This module is a guide focusing on advanced concepts and features in Hadoop.

    Foundations of every concept are explained with code fragments or schematic illustrations. The data processing flow dictates the order of the concepts in each chapter

    What you need for this learning path

    In the simplest case, a single Linux-based machine will give you a platform to explore almost all the exercises in this course. We assume you have a recent distribution of Ubuntu, but as long as you have command-line Linux familiarity any modern distribution will suffice. Some of the examples in later chapters really need multiple machines to see things working, so you will require access to at least four such hosts. Virtual machines are completely acceptable; they're not ideal for production but are fine for learning and exploration. Since we also explore Amazon Web Services in this course, you can run all the examples on EC2 instances, and we will look at some other more Hadoop-specific uses of AWS throughout the modules. AWS services are usable by anyone, but you will need a credit card to sign up!

    To get started with this hands-on recipe-driven module, you should have a laptop/desktop with any OS, such as Windows, Linux, or Mac. It's good to have an IDE, such as Eclipse or IntelliJ, and of course, you need a lot of enthusiasm to learn.

    The following software suites are required to try out the examples in the module:

    Java Development Kit (JDK 1.7 or later): This is free software from Oracle that provides a JRE ( Java Runtime Environment ) and additional tools for developers. It can be downloaded from http://www.oracle.com/technetwork/java/javase/downloads/index.html.

    The IDE for editing Java code: IntelliJ IDEA is the IDE that has been used to develop the examples. Any other IDE of your choice can also be used. The community edition of the IntelliJ IDE can be downloaded from https://www.jetbrains.com/idea/download/.

    Maven: Maven is a build tool that has been used to build the samples in the course. Maven can be used to automatically pull-build dependencies and specify configurations via XML files. The code samples in the chapters can be built into a JAR using two simple Maven commands:

    mvn compilemvn assembly:single

    These commands compile the code into a JAR file. These commands create a consolidated JAR with the program along with all its dependencies. It is important to change the mainClass references in the pom.xml to the driver class name when building the consolidated JAR file.

    Hadoop-related consolidated JAR files can be run using the command:

    hadoop jar args

    This command directly picks the driver program from the mainClass that was specified in the pom.xml. Maven can be downloaded and installed from http://maven.apache.org/download.cgi. The Maven XML template file used to build the samples in this course is as follows:

    1.0 encoding=UTF-8?>

    http://maven.apache.org/POM/4.0.0

    xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance

    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://

    maven.apache.org/xsd/maven-4.0.0.xsd">

      4.0.0

      MasteringHadoop

      MasteringHadoop

      1.0-SNAPSHOT

     

       

         

            org.apache.maven.plugins

            maven-compiler-plugin

            3.0

           

              1.7

              1.7

           

         

         

            3.1

            org.apache.maven.plugins

            maven-jar-plugin

           

             

               

                  MasteringHadoop.MasteringHadoopTest

    mainClass>

               

             

           

         

            maven-assembly-plugin

           

             

               

                  MasteringHadoop.MasteringHadoopTest

    mainClass>

               

             

             

                jar-with-dependencies

             

           

         

       

       

         

           

    m2e settings

                    only. It has no influence on the Maven build

    itself. -->

           

              org.eclipse.m2e

              lifecycle-mapping

              1.0.0

             

               

                 

                   

                     

                        org.apache.maven.plugins

                        maven-dependency-plugin

    artifactId>

                        [2.1,)

                       

                          copy-dependencies

                       

                     

                     

                       

                     

                   

                 

               

             

         

       

     

     

       

     

    Hadoop 2.2.0 : Apache Hadoop is required to try out the examples in general. Appendix , Hadoop for Microsoft Windows, has the details on Hadoop's single-node installation on a Microsoft Windows machine. The steps are similar and easier for other operating systems such as Linux or Mac, and they can be found at http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/SingleNodeSetup.html

    Who this learning path is for

    We assume you are reading this course because you want to know more about Hadoop at a hands-on level; the key audience is those with software development experience but no prior exposure to Hadoop or similar big data technologies.

    For developers who want to know how to write MapReduce applications, we assume you are comfortable writing Java programs and are familiar with the Unix command-line interface. We will also show you a few programs in Ruby, but these are usually only to demonstrate language independence, and you don't need to be a Ruby expert.

    For architects and system administrators, the course also provides significant value in explaining how Hadoop works, its place in the broader architecture, and how it can be managed operationally. Some of the more involved techniques in Chapter 4, Developing MapReduce Programs, and Chapter 5, Advanced MapReduce Techniques, are probably of less direct interest to this audience.

    This course is for Java developers, who know scripting, wanting a career shift to Hadoop - Big Data segment of the IT industry. So if you are a novice in Hadoop or an expert this course will make you reach the most advanced level in Hadoop 2.X.

    Reader feedback

    Feedback from our readers is always welcome. Let us know what you think about this course—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

    To send us general feedback, simply e-mail <feedback@packtpub.com>, and mention the course's title in the subject of your message.

    If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

    Customer support

    Now that you are the proud owner of a Packt course, we have a number of things to help you to get the most from your purchase.

    Downloading the example code

    You can download the example code files for this course from your account at http://www.packtpub.com. If you purchased this course elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

    You can download the code files by following these steps:

    Log in or register to our website using your e-mail address and password.

    Hover the mouse pointer on the SUPPORT tab at the top.

    Click on Code Downloads & Errata.

    Enter the name of the course in the Search box.

    Select the course for which you're looking to download the code files.

    Choose from the drop-down menu where you purchased this course from.

    Click on Code Download.

    You can also download the code files by clicking on the Code Files button on the course's webpage at the Packt Publishing website. This page can be accessed by entering the course's name in the Search box. Please note that you need to be logged in to your Packt account.

    Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

    WinRAR / 7-Zip for Windows

    Zipeg / iZip / UnRarX for Mac

    7-Zip / PeaZip for Linux

    The code bundle for the course is also hosted on GitHub at https://github.com/PacktPublishing/Data-Science-with-Hadoop/tree/master. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

    Errata

    Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our modules—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

    To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the course in the search field. The required information will appear under the Errata section.

    Piracy

    Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

    Please contact us at <copyright@packtpub.com> with a link to the suspected pirated material.

    We appreciate your help in protecting our authors and our ability to bring you valuable content.

    Questions

    If you have a problem with any aspect of this course, you can contact us at <questions@packtpub.com>, and we will do our best to address the problem.

    Part 1. Module 1

    Hadoop Beginner’s Guide

    Learn how to crunch big data to extract meaning from the data avalanche

    Chapter 1. What It's All About

    This book is about Hadoop, an open source framework for large-scale data processing. Before we get into the details of the technology and its use in later chapters, it is important to spend a little time exploring the trends that led to Hadoop's creation and its enormous success.

    Hadoop was not created in a vacuum; instead, it exists due to the explosion in the amount of data being created and consumed and a shift that sees this data deluge arrive at small startups and not just huge multinationals. At the same time, other trends have changed how software and systems are deployed, using cloud resources alongside or even in preference to more traditional infrastructures.

    This chapter will explore some of these trends and explain in detail the specific problems Hadoop seeks to solve and the drivers that shaped its design.

    In the rest of this chapter we shall:

    Learn about the big data revolution

    Understand what Hadoop is and how it can extract value from data

    Look into cloud computing and understand what Amazon Web Services provides

    See how powerful the combination of big data processing and cloud computing can be

    Get an overview of the topics covered in the rest of this book

    So let's get on with it!

    Big data processing

    Look around at the technology we have today, and it's easy to come to the conclusion that it's all about data. As consumers, we have an increasing appetite for rich media, both in terms of the movies we watch and the pictures and videos we create and upload. We also, often without thinking, leave a trail of data across the Web as we perform the actions of our daily lives.

    Not only is the amount of data being generated increasing, but the rate of increase is also accelerating. From emails to Facebook posts, from purchase histories to web links, there are large data sets growing everywhere. The challenge is in extracting from this data the most valuable aspects; sometimes this means particular data elements, and at other times, the focus is instead on identifying trends and relationships between pieces of data.

    There's a subtle change occurring behind the scenes that is all about using data in more and more meaningful ways. Large companies have realized the value in data for some time and have been using it to improve the services they provide to their customers, that is, us. Consider how Google displays advertisements relevant to our web surfing, or how Amazon or Netflix recommend new products or titles that often match well to our tastes and interests.

    The value of data

    These corporations wouldn't invest in large-scale data processing if it didn't provide a meaningful return on the investment or a competitive advantage. There are several main aspects to big data that should be appreciated:

    Some questions only give value when asked of sufficiently large data sets. Recommending a movie based on the preferences of another person is, in the absence of other factors, unlikely to be very accurate. Increase the number of people to a hundred and the chances increase slightly. Use the viewing history of ten million other people and the chances of detecting patterns that can be used to give relevant recommendations improve dramatically.

    Big data tools often enable the processing of data on a larger scale and at a lower cost than previous solutions. As a consequence, it is often possible to perform data processing tasks that were previously prohibitively expensive.

    The cost of large-scale data processing isn't just about financial expense; latency is also a critical factor. A system may be able to process as much data as is thrown at it, but if the average processing time is measured in weeks, it is likely not useful. Big data tools allow data volumes to be increased while keeping processing time under control, usually by matching the increased data volume with additional hardware.

    Previous assumptions of what a database should look like or how its data should be structured may need to be revisited to meet the needs of the biggest data problems.

    In combination with the preceding points, sufficiently large data sets and flexible tools allow previously unimagined questions to be answered.

    Historically for the few and not the many

    The examples discussed in the previous section have generally been seen in the form of innovations of large search engines and online companies. This is a continuation of a much older trend wherein processing large data sets was an expensive and complex undertaking, out of the reach of small- or medium-sized organizations.

    Similarly, the broader approach of data mining has been around for a very long time but has never really been a practical tool outside the largest corporations and government agencies.

    This situation may have been regrettable but most smaller organizations were not at a disadvantage as they rarely had access to the volume of data requiring such an investment.

    The increase in data is not limited to the big players anymore, however; many small and medium companies—not to mention some individuals—find themselves gathering larger and larger amounts of data that they suspect may have some value they want to unlock.

    Before understanding how this can be achieved, it is important to appreciate some of these broader historical trends that have laid the foundations for systems such as Hadoop today.

    Classic data processing systems

    The fundamental reason that big data mining systems were rare and expensive is that scaling a system to process large data sets is very difficult; as we will see, it has traditionally been limited to the processing power that can be built into a single computer.

    There are however two broad approaches to scaling a system as the size of the data increases, generally referred to as scale-up and scale-out.

    Scale-up

    In most enterprises, data processing has typically been performed on impressively large computers with impressively larger price tags. As the size of the data grows, the approach is to move to a bigger server or storage array. Through an effective architecture—even today, as we'll describe later in this chapter—the cost of such hardware could easily be measured in hundreds of thousands or in millions of dollars.

    The advantage of simple scale-up is that the architecture does not significantly change through the growth. Though larger components are used, the basic relationship (for example, database server and storage array) stays the same. For applications such as commercial database engines, the software handles the complexities of utilizing the available hardware, but in theory, increased scale is achieved by migrating the same software onto larger and larger servers. Note though that the difficulty of moving software onto more and more processors is never trivial; in addition, there are practical limits on just how big a single host can be, so at some point, scale-up cannot be extended any further.

    The promise of a single architecture at any scale is also unrealistic. Designing a scale-up system to handle data sets of sizes such as 1 terabyte, 100 terabyte, and 1 petabyte may conceptually apply larger versions of the same components, but the complexity of their connectivity may vary from cheap commodity through custom hardware as the scale increases.

    Early approaches to scale-out

    Instead of growing a system onto larger and larger hardware, the scale-out approach spreads the processing onto more and more machines. If the data set doubles, simply use two servers instead of a single double-sized one. If it doubles again, move to four hosts.

    The obvious benefit of this approach is that purchase costs remain much lower than for scale-up. Server hardware costs tend to increase sharply when one seeks to purchase larger machines, and though a single host may cost $5,000, one with ten times the processing power may cost a hundred times as much. The downside is that we need to develop strategies for splitting our data processing across a fleet of servers and the tools historically used for this purpose have proven to be complex.

    As a consequence, deploying a scale-out solution has required significant engineering effort; the system developer often needs to handcraft the mechanisms for data partitioning and reassembly, not to mention the logic to schedule the work across the cluster and handle individual machine failures.

    Limiting factors

    These traditional approaches to scale-up and scale-out have not been widely adopted outside large enterprises, government, and academia. The purchase costs are often high, as is the effort to develop and manage the systems. These factors alone put them out of the reach of many smaller businesses. In addition, the approaches themselves have had several weaknesses that have become apparent over time:

    As scale-out systems get large, or as scale-up systems deal with multiple CPUs, the difficulties caused by the complexity of the concurrency in the systems have become significant. Effectively utilizing multiple hosts or CPUs is a very difficult task, and implementing the necessary strategy to maintain efficiency throughout execution of the desired workloads can entail enormous effort.

    Hardware advances—often couched in terms of Moore's law—have begun to highlight discrepancies in system capability. CPU power has grown much faster than network or disk speeds have; once CPU cycles were the most valuable resource in the system, but today, that no longer holds. Whereas a modern CPU may be able to execute millions of times as many operations as a CPU 20 years ago would, memory and hard disk speeds have only increased by factors of thousands or even hundreds. It is quite easy to build a modern system with so much CPU power that the storage system simply cannot feed it data fast enough to keep the CPUs busy.

    A different approach

    From the preceding scenarios there are a number of techniques that have been used successfully to ease the pain in scaling data processing systems to the large scales required by big data.

    All roads lead to scale-out

    As just hinted, taking a scale-up approach to scaling is not an open-ended tactic. There is a limit to the size of individual servers that can be purchased from mainstream hardware suppliers, and even more niche players can't offer an arbitrarily large server. At some point, the workload will increase beyond the capacity of the single, monolithic scale-up server, so then what? The unfortunate answer is that the best approach is to have two large servers instead of one. Then, later, three, four, and so on. Or, in other words, the natural tendency of scale-up architecture is—in extreme cases—to add a scale-out strategy to the mix. Though this gives some of the benefits of both approaches, it also compounds the costs and weaknesses; instead of very expensive hardware or the need to manually develop the cross-cluster logic, this hybrid architecture requires both.

    As a

    Enjoying the preview?
    Page 1 of 1