Jigsaw Beginners Guide To Big Data 2014

Beginner’s Guide to
BigDataAnalytics
Introduction
‘Big Data’, What do these two words really mean? Yes everyone is
talking about it but frankly, not many really understand what the hype
is all about. This book by Jigsaw Academy aims to give one an
understanding of Big Data and what makes data big, while also
elaborating in simple language the challenges of Big Data, the
emerging technologies and the Big Data landscape. Finally we talk
about careers in Big Data and what the future could hold in store for
the industry.
This book is also a useful companion to those of you enrolled in

Jigsaw's Course ‘Big Data Analytics Using Hadoop and R’. You can use
this book to compliment your learning and better understand Big
Data. Please note the blue boxes in every chapter which link the
content in the chapter to the modules covered in the course.
Enjoy the book.
Big Data Team Jigsaw (Led by Team Lead Kiran P.V.)
© Jigsaw Academy pg01

Outline
What is Big Data 03
2
What Makes Data Big 3

04
Challenges of Big Data Processing 5

07
Big Data Technologies 10

09
Big Data and Analytics 15

15
Unstructured Data and Text Analytics 20

17
Big Data in Action 20

22
Big Data Landscape 23

26
Big Data Career Paths 30

31
Big Data in the Future 30

33
Learn more about Big Data 30

34

CHAPTER 01
WhatisBigData?
“I don't know how the cost of hard disks has decreased so rapidly.
These days one can buy a terabyte hard drive for just $100” a friend
told me couple of years ago. It's hard not to agree with him and a
quick review of historical facts validated his opinion. In the 1990's,
the cost per 1 gigabyte of hard disk was around $10,000 and now it
can be purchased at only $0.1. The price has dropped 100,000 times
over a span of 20 years. Currently we are even seeing that a few giga
bytes of hard disk space are being offered free of cost by email
service providers and file hosting services. For personal accounts,
Gmail offers about 15 gigabytes of free hard disk space whereas file
hosting service Dropbox offers up to 3.5 gigabytes. However, these
values are on the higher side for business accounts.
One would wonder how enterprises are influenced by the lower

costs of storage space. For one, it definitely provides them with more opportunities of storing data around their
product and service offerings. Virtually every industry is seeing a tremendous explosion in terms of new data
sources and is dependent on advanced storage technologies. Increased adoption of internet and smart phones
enabled individuals across the globe to leave a huge digital footprint of online data which is wanted by many
enterprises. In the past, for example, banks used to store customer data mostly around demographic information
tracked from application forms and further transaction information tracked from passbooks. These days, the
customer data being stored is enormous and varies widely across mobile usage, online transactions, ATM
withdrawals, customer feedback, social media comments and credit bureau information. All these new sources of
data which did not exist in the past can be categorized under the new word “Big Data”. Big Data can be easily
referred to as data which is huge, but more importantly Big Data is data that comes from multiple sources rather
than just one.
Big Data is definitely one of the more fascinating evolutions of the 21st century in the world of IT. The truth is that
Big Data has opened up tremendous opportunities and has provided us with endless solutions to deal with social,
economic and business problems. For enterprises, it is a huge untapped
The Big Data and Hadoop Overview
source of profit, which if used appropriately will be the key to staying ahead of
Module provides pre-class videos
their competition. In order to deal with Big Data effectively, they need to and lots of reading material on the
importance of Big Data and how it is
depend on advanced database technologies and faster processing capabilities.
transforming the way enterprises are
Just having Big Data is not a sufficient criterion for success; enterprises also implementing data based strategies
need to implement analytics effectively, in order to be able to garner insights to become more competitive.
that help improve profitability. They should actively pursue the art and science
of capturing, analysing and creating value out of Big Data.

CHAPTER 02
WhatmakesdataBig?
We live in the era of Big Data and it is not leaving any industry untouched be it financial services, consumer
goods, e-commerce, transportation, manufacturing or social media. Across all industries, enterprises now have
access to an increasing number of both internal and external Big Data sources. Internal sources typically track
information around demographics, online or offline transactions, sensors, server logs, website traffic and emails.
This list is not exhaustive and varies from industry to industry. External sources on the other hand are mostly
related to social media information from online discussions, comments and feedback shared about a particular
product or service. Another major source of Big Data is machine data which consists of real-time information
from sensors and web logs that monitor customer’s online activity. In the coming years, as we continue to
develop new ways of data generation either online or offline by leveraging technological advancements, the one
correct prediction we can make is this; the growth of Big Data is not going to stop.
Although Big Data is more about data being captured from multiple sources and size at a higher level, there are
many technical definitions which provide more clarity. Orielly Strata group states that “Big Data is data that
exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or
doesn’t fit the structures of your database architectures”. In simple terms, Big Data needs multiple systems to
efficiently handle and process data rather than a single system. Say an online e-commerce enterprise in a single
region is generating about 1000 gigabytes of data on a daily basis which can be handled and processed using
traditional database systems. On expanding operations to a global level, their daily data generation has
increased 10000 times and is currently at 10 petabytes (1 petabyte = 1000000 gigabytes). To handle this kind of
data, traditional database systems do not have required capabilities and enterprises need to depend on Big Data
technologies such as Hadoop which uses a distributed computing framework. We will learn more about these
technical topics in subsequent chapters.
To further simplify our Big Data understanding, we can rely on three major
Commonly Big Data is characterized by
characteristics of Big Data i.e. volume, variety and velocity which are more 3 V’s and these provide context for a
new class of possibilities. You will learn
commonly referred as 3 V’s of Big Data. Occasionally, some resources do talk
more about how these characteristics
about a not so common characteristic of Big Data i.e. Veracity which is help achieve more information from
referred as the 4th V of Big Data. All these 4 characteristics provide more massive amounts of data in Big Data
and Hadoop Overview Module.
details around the nature, scope and structure of the Big Data.

Volume
Volume deals with the size aspect of Big Data. With technical advancements in global connectivity and social
media, the physical extent of data generated on a daily basis is growing exponentially. Every day, about 30 billion
pieces of information is shared globally. An IDC Digital Universe study, estimated that global data volume was
about 1.8 zetabyte as of 2011 and will grow about 5 times by 2015. A zetabyte is a quantity of information or
information storage capacity equal to one billion terabytes which is 1 followed by 21 zeroes of bytes. Across
many enterprises, these increasing volumes pose an immediate challenge to traditional database systems with
regards to the storing and processing of data.
Variety
Big Data comes from sources such as conversations on social media, media files shared on social networks,
online transactions, smart phone usage, climate sensor data, financial market data and many more. The
underlying formats of data coming out of these sources would vary in terms of excel sheets, text documents,
audio files and server log files which can broadly classified under
either structured or unstructured data types. Structured data
formats typically refer to a defined way of storing data i.e. clearly
marked out rows and columns whereas unstructured data formats
do not have any order and mostly refer to text, audio and video
data. Unstructured formats of data are more a recent phenomenon
and traditional database systems do not possess required
capabilities to process this kind of information.

Velocity
Increased volumes of data have put a lot of stress on the processing abilities of traditional database systems.
Enterprises should be able to quickly process incoming data from various sources and then share it with the
business to ensure the smooth functioning of day-to-day operations. This quick flow of data within an enterprise
refers to the velocity characteristic of Big Data. Another
important aspect is also about the ability to provide relevant
services to the end user on a real time basis. For example,
Amazon provides instant recommendation services
depending on the users search and location. Based on the
entered keyword, these services need to search through their
entire historical transactions and share relevant results which
hopefully would convert into a potential purchase. The effects
of velocity are very similar to volume, and enterprises need to
rely on advanced processing technologies to efficiently handle
Big Data.
Veracity
Though enterprises have access to lot of Big Data, some aspects of it would be missing. Over the years, we are
aware that data quality issues usually happen due to human entry error or due to some individuals withholding
information. In the Big Data era where most of the data capturing processes are automated, the same issues can
occur due to technical lapses which arise due to system or sensor malfunction. Whatever may be the reasons, one
should be careful to deal with inconsistency in Big Data before using it for any kind of analysis.

CHAPTER 03
ChallengesofBigDataProcessing
Just having a big source of data is not enough to become successful; enterprises need to implement relevant
processes and systems which will help them extract value out of it. An important aspect here is what one should
do with it. In the absence of a business context, data in itself is meaningless and would just occupy space in the
storage servers. Also many Big Data sources tend to have missing or low content information as described by
the veracity characteristic earlier. The actual power of Big Data will surface only by applying analytics on top of it,
when it is used to generate useful insights to guide future decision making. Irrespective of the size of the data,
whether big or small, analytics methodologies need to be implemented to reap benefits. This typically involves
the cleaning, analysing, interpreting and finally visualizing the hidden patterns that emerge from the data. Due
to sheer volume, variety and velocity of Big Data; the processing capacity of traditional database systems is
strained beyond their limits. So enterprises need to look out for advanced processing technologies and
capabilities to effectively manage Big Data and further implement analytics on it.
One of the major aspects of any Big Data processing framework would be to successfully handle huge amounts
of information without compromising on query time. Traditional database
systems lack the required infrastructure and internal designs to process Big NoSQL technologies are helping
enterprises to achieve more than what
Data at a scale of petabytes or exabytes. Since these systems tend to operate
was possible previously. In the Big
out of a single machine with huge hard disk and processing capabilities, there Data and Hadoop Overview Module,
are a set of limitations which comes with it. The first one is a scalability issue. the evolution and benefits of new
technologies such as Hadoop and
With a continuous increase in data volumes, the storage capacity of these MongoDB are discussed and also how
systems needs to be continuously increased, and this can be an expensive these help in overcoming the
limitations of traditional IT systems for
option. The second one is a slow querying time because the storage load is
solving Big Data problem.
already operating at maximum levels and enterprises cannot wait for days to

get their daily reports. These limitations call for alternate approaches based on scalable storage and a
distributed approach to scaling.
Big Data sources are diverse and inputs for these systems can be in the form of structured or unstructured
formats. Since the origin of these data formats is spread across the globe, most times these won’t have a pre-
defined order and requires pre-processing before using it for any analysis. A common use of Big Data processing
is to make use of unstructured data, specifically comments on social media and by tracking customer sentiments
towards various product and service offerings. Due to their inherent static design, many traditional database
systems can handle only structured data and as such do not provide any alternatives for unstructured data. For
example, SQL based database systems depend on schema designs which clearly define the nature of data being
loaded and used to process transactional data. Since unstructured data for the most part does not have a
proper structure, it would be impossible to handle it in SQL based systems. Luckily for us, there exist alternatives
in the form of NoSQL databases, which can handle both structured and unstructured data formats.
The majority of client applications run by enterprises are based on real-time, and instant support on services has
become a priority. This requires processing of Big Data by the minute in order to provide relevant service to the
customers. For example, based on user search keyword, Google instantly processes information stored in their
million databases and returns relevant links within a matter of seconds. Similarly, banks need to track global
online transactions at any time of the day and further update their databases so that it will reflect in a
customer’s online account immediately. These services require enterprises to have a system which can ensure
the fast movement of data without any potential failures. In order to handle this velocity of Big Data coupled
with volume and variety, enterprises need to depend on sophisticated databases which form part of the NoSQL
category. These databases relax the limits of the schema based design of SQL systems and stores data in key-
value pairs, all which are capable of handling all the 3 V’s of Big Data.

CHAPTER 04
BigDataTechnologies-
OverviewofComponents
With the growing challenges of Big Data and limitations of traditional data management solutions, enterprises
need to leverage new technologies and data warehousing architectures which have significant IT implications.
These technologies vary in terms of functionalities ranging from storing and processing massive volumes of data to
performing various analysis on the data at the lowest level of granularity. For example, by integrating unstructured
data such as text fields, social media chatter, and email documents, enterprises can leverage new sources of data
which can reveal new insights about the customers.
According to a market forecast report by ABI Research, the world-wide IT spending on Big Data technologies
exceeded $31 billion in 2013 and is projected to reach $114 billion by 20181. Most existing Big Data technologies
fall under the open source paradigm. These are free to use and can be experimented upon by anyone. In the
current Big Data technology landscape, there are many open-source tools which can potentially solve any problem
but one should have the right knowledge and niche expertise in order to efficiently work with these technologies.
One of the most popular and widely adopted open source Big Data technologies is Apache Hadoop. It is more
formally defined as an open-source software framework that supports distributed processing of vast amounts of
data across clusters of computers by using a simple programming model. Apache Hadoop is considered a cost-
effective solution which provides capabilities to scale up from single servers to thousands of machines, each
offering local computation and storage. In simple terms, it is more like cluster of machines interconnected by a
network system processing chunks of data at the same time rather than depending on one single machine which is
time consuming and in-efficient especially in the case of Big Data. A Hadoop cluster can be made up of a single or
thousands of machines which are commonly termed as nodes.

Let us try to understand this concept using a simple example. Say, an apartment complex housing 50 families has a
single washing machine catering to laundry needs. Assuming on an average if each family process 10 clothes per
day and time taken by washing machine to clean about 50 clothes is one hour, then total time taken by the washing
machine to meet the entire apartment needs per day would be 10 hours. Now the apartment manager is
considering increasing the capacity to 100 houses, and definitely this would put tremendous stress on the washing
machine’s daily load management capacity. In order to deal with this situation, the manager should probably
consider buying 4 more washing machines. With a total of 5 machines working together, the entire apartment
complexes laundry needs can be managed within 4 hours on any given day. The new solution also allows the
families to be more flexible with respect to the time of their washing machine usage. This example briefly captures
the essence of implementing distributed processing solutions using a cluster of machines rather than depending
on one single machine to meet the growing Big Data needs.
Invented and named by Doug Cutting after his son’s elephant toy, Hadoop Ecosystem comprises of multiple
projects which provide complete data management solutions needed by an enterprise. Some of the projects of
Hadoop Ecosystem include HDFS, MapReduce, Hive, HBase, Pig and others. Though evolution of Hadoop dates
back to early 2000’s, its main stream usage picked up momentum only a couple of years ago. Major advantage is its
ability to efficiently manage and process unstructured data. Since about 80% of Big Data consists of unstructured
data, it has become more of a strategic choice for many enterprises to implement Hadoop bases solutions.
Let's briefly review some of the key components of Hadoop Ecosystem.
HDFS(HadoopDistributedFileSystem)
Two primary components of Apache Hadoop are HDFS which provides Apache Hadoop is the most popular IT
solution for effectively dealing with Big
distributed data storage capabilities and MapReduce which is a parallel Data. With the help of Big Data and
programming framework. To better understand how Hadoop allows Hadoop Overview, Hadoop Data
Management and Processing Complex
scalability, one should understand how HDFS functions. HDFS breaks down
Data using Hadoop modules, you will
the processing data into smaller pieces called blocks, and stores them learn technical aspects of setting up of
across various nodes of a Hadoop cluster. This mechanism of HDFS Hadoop Cluster, its Architecture, HDFS
and MapReduce Framework and other
enables one to handle Big Data more efficiently and in a cost-effective way components using hands-on examples.
by employing low cost commodity hardware on the nodes of the
Two primary components of Apache Hadoop are HDFS which provides distributed data storage capabilities and
MapReduce which is a parallel programming framework. To better understand how Hadoop allows scalability, one
should understand how HDFS functions. HDFS breaks down the processing data into smaller pieces called blocks,
and stores them across various nodes of a Hadoop cluster. This mechanism of HDFS enables one to handle Big
Data more efficiently and in a cost-effective way by employing low cost commodity hardware on the nodes of the
Hadoop cluster. Unlike relational databases which depend of defining schemas to store structured data, HDFS puts
no restrictions on the type of data and can easily handle unstructured data too. Based on the NoSQL principle,
HDFS allows for schema less storage of data which makes it more popular when it comes to Big Data management.

MapReduce
MapReduce forms the heart of Hadoop and is a programming model which processes data stored on the nodes of
Hadoop cluster, in a parallel and distributed manner. Typically a MapReduce program consists of two components:
Map() and Reduce() procedures. Both of these phases work on key-value pairs. A key/value pair is a set of two linked
data items: a key, which is a unique identifier for some item of data, and the value, which is either the data that is
identified or a pointer to the location of that data. These key/value pairs can be like a customer unique identifiers
and location details or URLs paired with number of visits. What goes into key/value pairs is subjective and is
dependent on the type of problem being solved.
Map() procedure or job performs operations such as filtering and sorting which takes individual elements of data
and further break it down into key/value pairs. After execution of Map() job, Reduce() implements summary
functions where the output will be in an aggregated form. Always remember the order of any MapReduce program
involves the execution of Reduce() job followed by Map() job. Also, the output of Map() job will act as an input to the
Reduce() job.
MapReduceExample
Let’s look at a simple example. Assume you have three text documents, and each file contains specific number of
words. Say the first document contains a sentence “I like Hadoop” and all the documents are stored in HDFS. The
end objective is to find out the frequency of words present in all the text documents. For this we need to write Map
and Reduce jobs to process the text data and summarize the word distribution.
As the Map job executes, the documents are first sent to the mapper that will count each unique word for each
document: a list of (key/value) pairs is thus created with the word as key and its count as value. For example, the
results produced from one mapper task for the first text document would look like this:
(I,1) (Like,1) (Hadoop,1)
The list of (key/value) pairs generated by all mapper tasks are then processed by the reducer that basically
aggregates the (key/value) pairs from each mapper to finally produce a list with all the words and the summed
counts from the three mappers, producing a final result set as follows:

(I,1) (Like,1) (Hadoop,3) (Is,2) (Fun,1) (So,1) (Great,1)
This one is a simple and straight forward example. Even though a real time application would be quite complex and
often involves processing millions or billions of rows, the key principle behind a MapReduce execution would
remain the same.
JavaAPIs
In order to deal with Hadoop programming at the MapReduce level, one would need to work with Java APIs. Since
Hadoop framework is developed on Java platform, MapReduce programming using Java language is more native by
design. Hadoop developers or analysts should have fair knowledge of Java concepts to process queries on data
stored in various cluster nodes. Running MapReduce jobs involves installation of eclipse environment for Hadoop,
writing Map and Reduce job scripts, compiling them into a jar file and then further executing these jobs on the data
stored in HDFS.
For those who are averse to java programming and who do not have a developer background, alternatives exist for
Hadoop programming in terms of Pig, Hive and Hadoop Streaming components. Using Hadoop streaming
component, it is easier to create and run MapReduce jobs with any general programming languages such as Ruby,
Python, Perl, C++, R etc.
Pig
Pig comes to the rescue for non-technical professionals and makes it more approachable to work with Big Data on
Hadoop clusters. It is a highly interactive and script based environment for executing MapReduce jobs on the data
stored in HDFS. It consists of a data flow language, called Pig Latin, which supports writing MapReduce programs
with more ease and less amount of code in comparison to usage of Java
APIs. In many ways, the functionality of Pig is very much similar to how SQL
Analyzing Big Data is a key component
operates in relational database management systems. It also supports many of any enterprise's IT strategies related
user-defined functions, which can be embedded and executed along with a to Hadoop. In Processing Complex
Data using Hadoop Module, you will
Java program.
gain strong command in components
such as Hive, Pig and Impala which
enable faster querying and aggregation
Hive of data from Hadoop cluster.
Hive enables the connection between the worlds of Hadoop and SQL. It is
very beneficial for people with strong SQL skills. Hive is a data warehouse
infrastructure built on top of Hadoop that provides data summarization, querying, and analysis capabilities using
an SQL-like language called HiveQL. Similar to Pig, Hive functions like an abstraction on top of MapReduce, and
queries run will be converted to a series of MapReduce jobs at the time of execution. Since the internal architecture
is very similar to that of relational databases, Hive is used to handle structured data and enables easy integration
between Hadoop and other business intelligence tools.

Impala
Impala, similar to Hive provides an interactive SQL based query engine for data sitting on Hadoop servers. It is an
open-source program for handling and ensuring the availability of large quantities of data. This engine was
developed by Hadoop distribution vendor Cloudera, and currently can be accessed under open source Apache
license. As is the case with Hive, Impala supports widely known SQL-style query language, meaning that users can
put their SQL expertise directly to use on Big Data. Based on comparison results published by Cloudera, Impala
offers 6 to 69 times faster querying times than Hive thus making it a first choice among enterprises when it comes
to performing Big Data analyses on Hadoop.
HadoopStreaming
Hadoop Streaming component is a utility which allows us to write map and reduce programs in languages other
than java. It uses UNIX standard streams as the interface between Hadoop and MapReduce job, and thus any
language that supports reading standard input and writing standard output
can be used. It supports most of the programming languages such as ruby,
Hadoop Streaming is an essential
python, perl, and .net. So when you come across a MapReduce job written in utility and quite helpful for
any of these languages, then surely execution will be handled by the Hadoop programmers who prefer programming
with Python or R over Java. In the
Streaming component. Performing Analytics on Hadoop
Module, you will learn about running R
scripts for MapReduce jobs through
HBase Hadoop Streaming utility.
HBase is a column-oriented database within Hadoop Ecosystem and runs on

top of HDFS. Hadoop is a batch-oriented system which allows loading data into HDFS, processing and then
retrieving. This kind of operating mechanism would not be ideal for tasks involving regular reading and writing of
data. MapReduce programs can read input data and write outputs directly from HBase. Apart from using Java API,
Hive and Pig can be used to write MapReduce programs to be implemented on data sitting in HBase.
SqoopandFlume
These components enable connectivity between Hadoop and the rest of the data world. Sqoop is a tool which
allows transfer of data between Hadoop cluster and SQL based relational databases. Using Sqoop, one can easily
import data from external enterprise data warehouses or relational databases, and can efficiently store it in HDFS
or Hive databases.
Hadoop Data Management Module

As Sqoop is used to connect with traditional databases, Flume is used to provides detailed introduction with
hands-on exposure to database
collect and aggregate application data into HDFS. Typically, it is used to
components of Hadoop such as
collect large amounts of log data from distributed servers. Flume’s HBase, Hive, Sqoop and Flume. Also
architecture is streaming data flow-based and can be easily integrated with you will be able to develop more in-
depth knowledge on how to load and
Hadoop or Hive for analysis of data. Some of the common applications of query data using these components.
Flume component are to transport massive quantities of event data such
web logs, network traffic data, social media generated data like twitter feed

and email messages.
ZookeeperandOozie
While Hadoop offers a great model for scaling data processing applications through its distributed file system
(HDFS), Map/Reduce and numerous add-ons including Hive and Pig, all this power and distributed processing
requires coordination and smooth workflow which can be achieved by Zookeeper and Oozie components.
ZooKeeper is a component of Hadoop ecosystem which enables highly reliable distributed coordination. Within
Hadoop cluster, Zookeeper looks into synchronization and configuration of nodes and stores information around
how these nodes can access different services relating to MapReduce implementations.
Oozie is an open source workflow scheduler system to manage data processing jobs across Hadoop cluster. It
provides mechanisms to schedule the execution of MapReduce jobs based either on time-based criteria or on data
availability. It allows for repetitive execution of multi-stage workflows that can describe a complete end-to-end
process thus reducing the need for custom coding for each stage.

CHAPTER 05
BigDataandAnalytics
So far, we have learned about various technological and database architectural components that supports Big Data
management. The real imperative of Big Data lies in the enterprise’s ability to derive actionable insights and to
create business value. Building capabilities of analysing Big Data would provide unique opportunities for
enterprises and also put them ahead of their competition. Also these analyses can be performed on more detailed
and complete data, as compared to traditional analysis which would be limited only to samples. However,
performing analytics on Big Data is a challenging task considering the volumes and complex structures involved. To
deal with this, enterprises need to able to find the right mix of tools, expertise and analytics techniques.
Many early adopters of Big Data such as Google, Yahoo, Amazon and eBay
are considered to be pioneers in analysing Big Data. For example, eBay
The real value of Big Data lies in the
launches successful products and services by employing analytics on
insights it can generate. Processing
demographic and behavioural data from their millions of customers. Data Complex Data using Hadoop Module
used for analysis can come in various forms - user behaviour data, provides hands-on techniques and
knowledge to analyze Big Data with the
transactional data, and customer service data. On the other hand, Amazon help of Hadoop Components. In
offers services of recommendation engine on their home page. It leverages Performing Analytics on Hadoop Module,
you will learn about how analytics tools
Big Data analytics on data relating to customer’s buying history and
can be used to run some advanced
demographics to identify hidden patterns and provides accurate analyses on data residing on Hadoop.
recommendations for potential new purchases.
HoweBayleveragesBigData2?
Online auction giant eBay regularly monitors and analyzes huge amounts of information from their 100 million
customer interactions. They use this data to conduct experiments on its users in order to maximise selling prices
and customer satisfaction. On an average, they run about 200 experimental tests at the same time which range
from barely-noticeable alterations, to the dimensions of product images, right up to complete overhauls, to the
ways in which content for users' personal eBay home pages is displayed. Their huge customer base creates 12Tb of
data per day from every button they click to every product they buy, which is continually added to an existing pile
of Big Data. As the data is queried by automatic monitoring systems and employees looking to find more meaning
from it, data throughput reaches 100 petabytes (102,400TB) per day.
One of the business problems eBay handles is to achieve the highest buying price possible for all items users place
for sale, as profits come from a cut of each sale. Its data scientists perform advanced analytics by looking at all
variables in the way items are presented and sold. As one of the solutions to this problem, they began to study the
impact on selling price by the quality of the picture in a listing. They used Hadoop framework to process petabytes
of pictures due to its capability of handling unstructured data. Later these pictures were analyzed and re analyzed
and data scientists managed to extract some structured information such as how much they were sold for, how
many people viewed. Towards the end, they managed to establish a possible relation and concluded that better

image quality actually does provide a better price.
AnalyticsProjectFramework
Before doing a deep dive into Big Data, the first and important aspect of any analysis is to identify the business
problem. This is a fundamental step even with traditional data analytics projects. Once the business problem is
defined, then Big Data can be leveraged to search for hidden patterns and get valuable insights. Typically some of
the analytics problems being solved can be of the following nature.
Ÿ Predicting customer churn behaviour to design reach out

campaigns
Ÿ Understanding online and offline marketing impacts on sales
Ÿ Identifying whether a transaction is fraudulent or not
Ÿ Using customer purchasing patterns to recommend new products
Ÿ Forecasting of sales for better inventory management
Irrespective of any problem across verticals, the methodology involved
in implementing data analytics projects would remain the same. Major
difference between Big Data analytics projects and traditional data
projects would be the scale of data being handled and the combination of required tools. On the other hand, the
business problems, analytics techniques and project methodology would remain the same and is independent of
the data being handled. As part of any analytics project cycle, the processes typically involved are problem
definition, data gathering, selecting the right technique, performing analysis and visualizing the final results.
Let us get some more perspective on various stages involved in implementing an analytics project using a used
cars price prediction example.
Problem Identification
The first question that needs to be asked in any data analytics project would be what is the problem we are trying
to solve? In today’s Big Data world, enterprises are performing data analytics over various kinds of business
problems. It becomes essential to figure out which problem would create higher business impact and further
focussing on it to maximize ROI.
In the case of used cars price prediction example, determining value of a used car based on a variety of
characteristics such as make, model and engine that would benefit retailers. Such information would benefit them
to better manage the supply and demand flow in a highly price volatile market. Also with robust knowledge of price
variations by each model type, retailers can target buyers with relevant promotions and targeted discounts.
Gathering required Data

After identification of business problem, data needs to be gathered from various sources which will be useful for
the analysis. Based on the problem definition, the data attributes of these datasets can be defined.

For prediction of used car prices, we will require the sales data across years which capture the information on type
of car sold, number of years it was in use and final amount paid by the buyer. Additional data can be captured on
condition and performance of the car related to mileage and internal characteristics such as type, trim, doors and
cruise control. These days with rapid growth of social media and other data sources, more data can be captured
around brand perception of used cars and insurance claims related to the car which provides more insights on
price variations.
Choosing the Right Analytics Technique

Picking the right technique for any given problem is as critical as finding the right kind of data to begin with. In
analytics projects, often we depend on various tools and algorithms to work on various data problems. Say for
example, R is known for its statistical offerings while Python is popular for text data processing. For solution
extractions, statistical techniques rely on business context and have specific use cases like clustering algorithms are
used for solving customer segmentation problems, time series algorithms are used for forecasting problems and
recommendation algorithms are used to provide insights on more relevant products or services. Before applying
any technique, gathered data needs to undergo a set of data operations, such as data cleansing, data aggregation,
data augmentation, data sorting, and data formatting which are collectively referred to as pre-processing steps.
These steps translate the raw data into a fixed data format which is then shared as input to various algorithms.
Since problems involved in the used car example is forecasting of price values, regression techniques can be used.
At a broad level these deal with predictions of continuous variables like price, income, age, sales etc. Many
algorithms can implement regression techniques such as linear regression, random forests, neural networks and
support vector machines which vary in terms of complexity of implementation and scope of business
interpretation. At this point, this might sound more technical but getting a general idea is of more relevance here.
You will be able to appreciate the underlying concepts more while using these techniques in real-time projects.
Implementing Analytics Techniques

As discussed in above section, analytics problems can be solved with the help of a variety of statistical techniques.
When it comes to implementation of these techniques, there are lot of options available in terms of analytics tools
such as SAS, R and Python. SAS is more popular amongst the enterprises because of its ease of use and R or
Python are open source tools which have lot of takers amongst academia and programmers. On an average,
almost 80% of the time of any analytics project goes in problem identification, data gathering and pre-processing
steps while the remaining 20% is used for implementing chosen techniques and visualizing the final results. In the
case of Big Data, the same algorithms can be translated to MapReduce algorithms for running them on Hadoop
clusters which often requires more efforts and specialized expertise. In Hadoop Ecosystem, Mahout Component
use Java programming language to implement statistical techniques such as classification, recommendation
algorithm and others.
Depending on the data volumes being gathered for used car prediction problem, say to implement linear
regression technique, SAS or R tool can be used for smaller datasets and Hadoop integrated with R or SAS can be

used for Big Data. Another alternative for Big Data would be making use of Mahout Component which requires
Java expertise.
Visualizing End Results

Data visualization is used for displaying the output of analytics projects. Typically this would be the last step of any
analytics projects where visualization techniques are implemented either for validating the technique outcomes or
to present end results to non-technical management team. This can be done with various data visualization
software’s such as Tableau, Spotfire and also in-built capabilities of SAS and R. In comparison with SAS, R offers a
variety of packages namely ggplot2 and lattice for visualization of datasets.
After building linear regression model for user car price prediction, visualization techniques are implemented to
validate the statistical results and to further check whether these results are satisfying the technique assumptions.
Some of the standard validation techniques of linear regression model are heteroskedasticity, autocorrelation, and
multicollinearity. Above plots showcase visualization examples of performing these validation techniques on final
model results of used car price prediction example.
DifferentkindsofAnalytics
By looking at the used car price prediction example, one can understand that irrespective of any domain the
framework for implementing analytics projects remains the same. However, as different enterprises work on
solving various business problems, there are different kinds of analytics with domain specific applications. Some of

the common ones are marketing analytics, customer analytics, risk analytics, fraud analytics, human resource
analytics and web analytics which are classified based on different business functions. Marketing analytics in any
enterprise revolves around increasing efficiency and maximizing marketing performance through analysis such as
marketing mix optimization, marketing mix modeling, price analysis and promotional analysis to name a few. On
the other hand, customer analytics deals with understanding of customer behaviours and increasing loyalty using
analysis like customer segmentation, attrition analysis and life time value analysis.
Another common classification exists which is based on complexity level of analytics techniques being
implemented across any enterprise and is independent of the domain. These kinds are broadly classified under
basic analytics and advanced analytics categories.
BasicAnalytics
Basic analytics techniques are generally used to explore your data which include simple visualizations or simple
statistics. Some of the common examples are:
Ÿ Slicing and dicing refers to breaking down of data into smaller sets that are easier to explore. This is more
employed as a preliminary step to gain more understanding into data attributes and how different techniques
can be used and also how much computational power is required to implement a full scale analysis.
Ÿ Anomaly identification is the process of detecting outliers, such as an event where the actual observation differs
from the expected value. This might involve computing some summary statistics like mean, median, and range
values and also sometimes involves visualization techniques such as scatter plot, box plot etc. to identify
outliers through visual means.
AdvancedAnalytics
Advanced analytics involves applications of statistical algorithms for complex analysis on either structured or
unstructured data. Among its many use cases, these techniques can be deployed to find patterns in data,
prediction, forecasting, and complex event processing. With the growth of Big Data and enterprise's need to stay
ahead of competition, advanced analytics implementations have become main stream as an integral part of their
decision making process. Some of the examples of advanced analytics are:
Ÿ Text Analytics is the process of analyzing unstructured text, extracting relevant information, and transforming it
into structured information on which statistical techniques can be applied. Since much of Big Data comprises
unstructured data, text analytics has become one of the main stream applications amongst Big Data analytics.
Ÿ Predictive Modeling consists of statistical or data-mining solutions including algorithms and techniques to
determine future outcomes. A predictive model is made up of a number of predictors, which are variable
factors that are likely to influence future behavior or results. In marketing, for example, a customer's gender,
age, and purchase history might predict the likelihood of a future sale. Some of the other common applications
include churn prediction, fraud detection, customer segmentation, marketing spend optimization and many
more.

CHAPTER 06
UnstructuredDataand
TextAnalytics
Unstructured data usually takes up lots of storage capacity and more difficult to analyze when compared with
structured data which is relatively easy to handle and process. It is basically information which is text heavy. In
most cases that do not have a predefined data model and also does not fit well into traditional database
management systems. At an enterprise level, only 20% of the Big Data being handled is structured and the
remaining 80% is unstructured. Most of the unstructured data these days is machine generated from various
sources such as satellite images, video surveillance, scientific sensors, weather monitoring, social media, mobile
and other web content. Data coming from these sources would be in the form of text, images, videos, web logs,
and other customary machine formats like sensor output.
Few key facts related to unstructured data are:
Ÿ Most new data is unstructured and represents almost 95 percent of all data generated
Ÿ Unstructured data tends to grow exponentially, and is estimated to be doubling every year
Ÿ Unstructured data is vastly underutilized due to limitations of traditional IT technologies
With the evolution of Big Data technologies, enterprises can effectively process unstructured data and derive
business value out of it. Most firms currently implement NoSQL based technologies mainly Hadoop whose
capabilities extend beyond the traditional databases. Regardless of the native formats, Hadoop can store different
types of data from multiple databases with no prior need of schema. Within Hadoop Ecosystem, HDFS is used for
storage which handles non-predefined data models and MapReduce framework for quick processing of large
volumes of unstructured data. Later using the data sitting in Hadoop, enterprises can tap into traditionally
unexplored information and can start making more decisions based on hard data.

We have seen in an example in an earlier chapter, how eBay (a giant online
marketplace) tries to achieve highest buying price of items by understanding Majority of enterprise Hadoop
applications are implemented to deal
the impact of the quality of picture shared in the listing. To find a possible
with unstructured data. In the
solution, data science teams at eBay performed extensive image analysis and Processing Complex Data using
successfully found a relationship between listing views and items sold. This is a Hadoop and Performing Analytics on
Hadoop Modules, you will learn
classic real-time example of unstructured data processing with the help of
more about how to handle and
Hadoop. Generally in order to create value out of unstructured data, some of analyze text data with the help of
the most common analytics methodologies are text analytics, image and audio real time examples leveraging
Twitter and Email data.
analysis. Out of these, text analytics has been adopted as a mainstream
activity across many enterprises with increased usage of Hadoop and other Big
Data technologies.
Text analytics is commonly referred to as the process of analyzing unstructured text, extracting relevant
information, and transforming it into structured information that can then be leveraged in various ways. The
analysis and extraction processes used in text analytics takes advantage of techniques that originated in
computational linguistics, statistics, and other computer science disciplines. In a Big Data scenario, the applications
of text analytics is wide spread around social media analysis, brand perception and sentiment analysis, and even in
areas of churn and fraud prediction. Increasingly enterprises across all verticals are looking for ways to combine
both structured and unstructured data to get a complete view about their customer’s perceptions towards various
product and service offerings.
In the context of Big Data analytics, text analytics implementations can be done with the help of Hadoop
components such as Pig, Hive, and MapReduce programming using Java, Python and other languages. These
components are equipped with in-build custom functions which are suited for the processing of unstructured data
formats like text, images and videos. The key to successfully handling unstructured data is to bring structure on the
native format and then applying analytics or statistical techniques on top of it. Apart from Hadoop solutions, other
commercial text analytics tools are offered by vendors like Attensity, Clarabridge, IBM and SAS in Big Data space.

CHAPTER 07
BigDatainAction
Enterprises are spending millions of dollars on scaling up their existing IT infrastructures with Big Data
technologies to meet the end goal of achieving more business value out of data and staying ahead of their
competition. Unlike traditional data warehousing and BI opportunities, Big Data and analytics opportunities are
more business hypothesis driven and often revolve around exploratory activities. This scenario is consistent
across all verticals since Big Data is being generated from every function of a business ranging from
manufacturing to sales. The key to success in dealing with Big Data is in any enterprise’s ability to define relevant
business problems, combine structured and unstructured data sources, and identify hidden trends.
The business problem being handled varies by task as some are computationally intensive while others are more
data analysis intensive. Understanding the nature of the problem is very essential for picking the correct
approach. In order to exploit Big Data analytics, often enterprises develop a compelling business use case clearly
outlining what business outcomes are to be achieved. One of such user cases in IT domain is the implementation
of the Aadhar Big Data project by the government of India. Aadhar is a 12-
digit unique number issued for all residents in India with a goal of creating Applications and Use cases of Big Data are
many spanning across business domains.
the world’s largest biometric database. The objective of this project is to The case studies taught in the course helps
deliver more efficient public services and facilitate direct cash transfers to you appreciate both IT and business
aspects of Big Data. They cover domains
the people.
such as Finance, E-Commerce, Airline and
Social Media which will provide hands on
exposure in terms of processing Big Data
Likewise, many organizations are basing their business cases on the
and then analyze it to solve various
following benefits that can be derived from Big Data and analytics: business problems.
Ÿ Smarter decisions: Enable decision making beyond traditional practices

Ÿ Faster decisions: Reducing the dependency on bureaucracy within an organization
Ÿ Impactful decisions: Focus on value generating efforts and capabilities

In search of achieving these benefits, many business verticals such as Telecom, Banking, Insurance, HealthCare,
Retail, IT and Manufacturing are all riding the wave of Big Data analytics. Now we will review further how some of
these industries are leveraging Big Data to solve their business problems.
Retail & E-Commerce

Retail is one of the high potential areas for Big Data. A survey conducted by research firm IDC revealed that
retailers are increasingly looking at Big Data and analytics to derive business benefits. Companies can bring
together both online and offline data along with transactions information to better understand factors that drive
the shoppers behavioural traits. Beyond purchase data, retailers are looking at a whole array of new data sources –
web browsing data, social data and geo-tracking data which further helps in thorough segmentation of customers.
Combining this new information with traditional data, they started doing high-end analytics like market-basket
analytics, seasonal sales analytics, inventory optimization analytics, and pricing optimization analytics.
In the case of e-mail targeting, traditional approach has been to scan through the entire customer base, develop a
list of customers and then send out mass mailers to all. However, Big Data advantage is personalization, by
understanding consumers browsing history retailers can share specific messages related to items of the search and
then offer that shopper a targeted promotion. Also with the help of location data from mobile devices and if a
particular customer is present in a store, they can be offered specific coupons to motivate them into making a
purchase.
Telecom
Similar to other sectors, communications service providers all over the globe are witnessing significant data growth
due to increased adoption of smart phones, rise of social media and growth of mobile networks. Many of these
firms are tackling Big Data challenges so as to gain more market share and increase profit margins. Big Data can
help service providers achieve some of the key business objectives – provide better customer service with the help
of internal and external data, implement innovative product services using segmentation techniques, and develop
strategies to generate new sources of revenue. Over the last few years telecom operators have moved away from a
traditional model of data warehousing towards a centralized data repository model with integrated reporting
solutions. With exponential growth of Big Data in this sector, these operators are now looking towards new
technologies as a cost-effective solution to process the growing volumes of data.
One of the common applications most telecom operators implement is around integrating network performance
data with subscriber usage patterns. This is to understand what is happening in the complex intersection of
network and services (voice, data, and content). It generally helps them to detect network performance issues in
real time and provide quality customer services to maximize their customer satisfaction.
Financial Services
Historically Banking and financial management firms are rife with transactions data, with hundreds of millions of

records generated on a daily basis. With digitalisation, a variety of data sources – social media, information portals
and customized web applications are adding more information to industry’s existing ocean of data. Implementing
Big Data solutions enable these enterprises to collect and organize a host of additional data such as customer
preferences, behaviors, interaction history, events and location-specific details in a cost-effective manner. Using
this huge information, many financial services firms run sophisticated analytics to determine the best set of actions
to recommend to a customer, such as a targeted promotion, a cross-sell/up-sell offer, a discounted product or fee,
or a retention offer. In addition the Big Data technologies add value through real-time insight generation and help
in faster decision making.
One of the major developments is to integrate external data sources such as social media with the internal IT
infrastructure, which provides a broader view on customers, products and services at enterprise level. Customer
segmentation is a key tool for sales, promotion, and marketing campaigns across financial services firms. Using the
available data, generally customers are grouped into different segments based on their expectations, needs and
other characteristics. The advantages from such an implementation are multi-fold for enterprises in terms of
increasing loyalty with customers, selling more products and services and also cutting costs by better management
of resources.
HealthCare
Big Data has many implications for patients, providers, researchers, payers, and other health-care constituents.
Today’s patients are demanding more information about their health-care options so that they understand their
choices and can participate in decisions about their care. In a health-care scenario, traditionally the key data
sources have been patient demographics and medical history, diagnostic, clinical trials data and drug effectiveness
index. If these traditional data sources are combined with data provided by patients themselves through social
media channels and telematics, it can become a valuable source of information for researchers and others who are
constantly in search of options to reduce costs, boost outcomes, and improve treatment.
One of the major applications of Big Data has been in the area of DNA analysis. With the latest tools and
technologies, one can analyze an entire individual human DNA sequence and compare it against those of other
individuals and groups in smaller timeframes. The current relatively low cost to perform individual DNA analysis
(thousands of dollars) has made this tool accessible to a substantial number of people compared to the initial cost
of millions of dollars a few years ago after the first full human genome was analyzed.
Smart Cities
Growth of Big Data and digitalization has resulted in the availability of a wide range of information about cities,
their physical infrastructure, services, and interactions between people. Smarter Cities are leveraging this Big Data
to improve infrastructure, planning and management, and human services as a system of systems – with the goal
of making cities more desirable, liveable, sustainable, and green. Some of the applications include mass transit,
utilities, environment, emergency response, big event planning, public safety, and social programs.
IBM has been a pioneer in providing Big Data solutions in this area under their flagship program smarter planet. As

one of the projects, they have implemented a project using Big Data for better traffic flow in Dublin, Ireland. By
utilizing the GPS information from buses, IBM has been able to more accurately measure the arrival and departure
times and pass that data on to travellers via the transportation system’s notification boards. This information
would enable people to make better use of their time which would further increase the confidence in public
transport system.

CHAPTER 08
BigDataLandscape
HadoopDistributionOfferings
Although Hadoop and its projects are completely open source, a large number of companies have developed
their own Hadoop distributions which are more ready to use. These distributions are packaged and guaranteed
to have HDFS and MapReduce components, and all other supporting tools. There are several distributions
available, such as ones provided by EMC and Intel, as well as those provided by hardware vendors like IBM
which are typically all-in-one solutions that include hardware. But the three biggest and most prevalent
Hadoop distributions that exist today are Cloudera, MapR and Hortonworks. If you are looking for a quick plug-
and-play option, then each of these vendors offers VM images with Linux and Hadoop already installed.
Apache Hadoop, the original release of Hadoop comes from apache.org backed up by community support of
Apache Software Foundation. Many of the original Hadoop releases are done by this group with the latest one
being Hadoop 2.0. Other companies or organizations that release products containing modified or extended
versions of the Apache Hadoop are generally termed as Hadoop Distributions. One important point to note
here would be these Hadoop distributions will be continuously upgraded to keep up with the latest Apache
Hadoop releases launched by Apache Software Foundation.
There are many Hadoop distributions
available, and Cloudera CDH4 is one of the
Some of the companies that include Apache Hadoop and provide widely used distributions at enterprise level.
In the Big Data and Hadoop Overview
additional capabilities in terms of commercial support, and other utilities
Module, you will learn about installation and
related to Hadoop are, Cloudera’s Hadoop Distribution, CDH4 version working with CDH4 Hadoop Distribution.
includes HDFS, YARN, HBase, MapReduce, Hive, Pig, Zookeeper, Oozie, Cloudera CDH4 Installation includes Apache
Hadoop along with other components such
Mahout, Hue, and other open source tools (including the real-time query
as Pig, Hive and Impala for Big Data
engine - Impala). Cloudera Manager Free Edition includes all of CDH, processing.
plus a basic Manager supporting up to 50 cluster nodes. Cloudera

Enterprise combines CDH with a more sophisticated Manager

supporting an unlimited number of cluster nodes, proactive monitoring, and additional data analysis tools.
Hortonworks Hadoop Distribution, HDP version 2.0 includes HDFS, YARN, HBase, MapReduce, Hive, Pig,
HCatalog, Zookeeper, Oozie, Mahout, Hue, Ambari, Tez, and a real-time version of Hive (Stinger) and other
open source tools. It also provides high-availability support, a high-performance Hive ODBC driver, and Talend
Open Studio for Big Data.
MapR Hadoop Distribution, M7 version includes HDFS, HBase, MapReduce, Hive, Mahout, Oozie, Pig,
ZooKeeper, Hue, and other open source tools. It also includes direct NFS access, snapshots, and mirroring for
“high availability,” a proprietary HBase implementation that is fully compatible with Apache APIs, and a MapR
management console.
IBM Infosphere BigInsights is available in two editions. The Basic Edition includes HDFS, HBase, MapReduce,
Hive, Mahout, Oozie, Pig, ZooKeeper, Hue, and several other open source tools, as well as a basic version of
the IBM installer and data access tools. The Enterprise Edition adds sophisticated job management tools, a
data access layer that integrates with major data sources, and BigSheets (a spreadsheet-like interface for
manipulating data in the cluster).
Handling Big Data on Cloud is one of

Intel Distribution for Apache Hadoop is a product based on Apache the growing practices among enterprises
due to low cost and better processing
Hadoop, containing optimizations for Intel's latest CPUs and chipsets. It
capabilities. You will be provided AWS
includes the Intel Manager for Apache Hadoop for managing a cluster. instance with Hadoop installation to
work on the assignments and other case
study problems as part of virtual lab
Amazon Elastic MapReduce is a cloud service that enables users to easily offering.
process vast amounts of data at a cheaper cost. It utilizes a hosted

Hadoop framework running on the web-scale infrastructure of Amazon
Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). It includes HDFS (with
S3 support), HBase (proprietary backup recovery), MapReduce, Hive (added support for Dynamo), Pig, and
Zookeeper.
Windows Azure HDInsight is a Hadoop solution for the Azure cloud. It is integrated with the Microsoft
management console for easy deployment and integration with System Center. It can be integrated with Excel
through a Hive Excel plug-in. Further, it also offers connectivity services with Microsoft SQL Server Analysis
Services (SSAS), PowerPivot, and Power View through the Hive Open Database Connectivity (ODBC) driver.
For Big Data analysis, apart from knowing about analytics project cycle and kinds of analysis that can be done,
enterprises should also leverage the right kind of analytics tools to efficiently deal with Big Data. Broadly the
classification of Big Data analysis tools can be made around statistical technique offerings and business
intelligence integration capabilities. Although Hadoop components can be used to achieve each of these
independently, it is not a specialized analytics tool and is popularly used only for its distributed framework.
Let's explore some of the tools which offer extensive visualizations, drag-and-drop options, and easy-to-install

scripts.
AnalyticsImplementations
With the explosion of Big Data, there has been a quick growth of tools providing statistical capabilities at a
larger scale. Since statistics is critical for identifying and quantifying relationships between various attributes
in the data, it is one of the key components of many analytics tools catering to Big Data. Some of the more
interesting tools include:
R is an open source programming and statistical language that is rapidly R is the most popular open source
analytics tool. In the Performing
gaining popularity in Big Data space. It has been widely used among Analytics on Hadoop Module, you will
universities and startup companies alike from many years, but lot of recent learn about R syntax, handling data and
running statistical tests with R, and also
interest can be attributed to its open source nature and also flexibility of
about techniques to integrate R with
integration with open source Big Data technologies such as Hadoop. In Hadoop for implementing MapReduce
terms of statistical capabilities, R is very versatile and has more than 4000 programs from R console.
packages which can deal with any problem related to Big Data analysis.
Also with the introduction of RHadoop packages by Revolution Analytics,
now anyone can easily work with Hadoop cluster, interact with data in HDFS, and run MapReduce queries
written in R syntax.
SAS has been a pioneer in business analytics software and services over the last decade, and is also the
largest independent vendor in the business intelligence market. In order to deal with the data deluge, SAS has
recently upgraded their services towards Big Data handling capabilities. These will help users to perform data
manipulation and exploration analysis on Hadoop. Unlike working with Hadoop which requires specialized
expertise, enterprises can leverage their existing SAS skills to work easily with Big Data. It also offers text
analytics capabilities as part of its overall analytics platform and text data is viewed as simply another source
of data.
Apache Mahout, a statistical component of Hadoop Ecosystem, provides scalable machine learning algorithms
on top of the Hadoop platform. Mahout provides algorithms for clustering, classification, and collaborative
filtering implemented on top of Apache Hadoop using MapReduce. However, it requires java programming
expertise to successfully work and implement MapReduce queries using Mahout.
MADlib, one of the latest developments, is an open-source library that supports in-database analytics. It
provides data-parallel implementations of mathematical, statistical, and machine-learning methods that
support structured, semistructured, and unstructured data.
TextAnalyticsImplementations
Though text analytics can be grouped under Big Data analytics, it is always a good idea to deal with it
separately due to specific applications only for unstructured data. Here is an overview of some of the players

in the text analysis Big Data market.
Attensity is one of the original text analytics companies that began developing and selling products more than ten
years ago. It offers several engines for text analytics around Auto-Classification, Entity Extraction, and Exhaustive
Extraction. Attensity text analytics tools uses Hadoop framework to store data and are focused on social and
multichannel analytics by analyzing text for reporting from both internal and external sources.
Clarabridge is another pure-play text analytics vendor which extensively deals with unstructured data processing. It
offers its solution as a Software as a Service (SaaS).
Software giant IBM offers IBM Content Analytics solutions in the text analytics space. This tool is used to transform
content into analyzed information, and further made available for detailed analysis similar to the way structured
data would be analyzed in a BI toolset.
BusinessIntelligenceIntegration
Generally, enterprise level business intelligence needs cater to regular reporting, generating dashboards and
creating visualizations. Hive component of Hadoop provides traditional database features and business
intelligence integration capabilities to meet enterprise's reporting and analysis needs on structured data. Though it
uses SQL like query language for performing Big Data analysis; additional ready to use features such as drag-and-
drop and automated reporting are not supported. Many alternative tools exist which provide advanced business
intelligence reporting on Big Data sitting in Hadoop cluster. These BI tools
provide rich, user friendly environment to slice and dice data. We will review Tableau is one of the leading data
some of the widely used ones. visualization tools at enterprise level. In
the Performing Analytics on Hadoop
Module, you will learn about working
Tableau has been gaining popularity across enterprises as the go-to BI tool with Tableau and further able to build
web dashboards and complex
when it comes to analysing and visualizing Big Data. It offers direct
visualizations using the data residing in
connections for many high-performance databases, cubes, Hadoop, and Hadoop cluster.
cloud data sources such as Salesforce.com and Google Analytics. Tableau

has a fast, in-memory analytical engine which can work directly with Big
Data to create reports and dashboards. It also provides features around publishing web dashboards on a server
and enables easy sharing across the enterprise. With the availability of more than 30 data base plugin ranging
from Big Databases to traditional SQL databases, Tableau is attaining a status of should have BI tool across many
enterprises and industry verticals.
Datameer Analytics Solution (DAS), is a business integration platform for Hadoop and provides comprehensive
capabilities to analyze both structured and unstructured data. Major specialization is for enabling analysis
capabilities on large volumes of data stored in Hadoop cluster. It has a spreadsheet interface with over 200 analytic
functions and visualization including reports, charts and dashboards. DAS provides support for all major Hadoop
distributions including Apache, Cloudera, EMC Greenplum HD, IBM BigInsights, MapR, Yahoo! and Amazon.
Pivotal, an EMC spinoff offers big-data storage and analytics capabilities. Pivotal Big Data solutions offers wide set

of enterprise data products: MPP and column store databases, in-memory data processing, and Hadoop. It also
provides in-database integrations with SAS analytics and is one of the fast growing BI vendors in Big Data analytics
space.
Pentaho Big Data Solutions supports the entire Big Data analytics process ranging from discovering and preparing
data sources, to integration, visualization, analysis and prediction. For IT and developers, Pentaho provides a
complete, visual design environment to simplify and accelerate data preparation and modeling. For business users,
Pentaho provides visualization and exploration of data. And for data analysts and scientists, Pentaho provides full
data discovery, exploration and predictive analytics.
Another commercial solution offered by IBM includes the combination InfoSphere BigInsights and Cognos
software. This gives organizations a powerful solution to translate large amounts of data into valuable, actionable
insights. InfoSphere BigInsights software provides Big Data processing capabilities whereas Cognos software offers
enterprise level BI capabilities.

CHAPTER 09
BigDataCareerRoles
As the field of Big Data is booming, many enterprises are actively looking out for the
right talent with relevant IT expertise and deep analytical skills. According to
information technology research and advisory firm Gartner, Big Data will create
more than 4.4 million jobs by 2015, opening up plenty of opportunities for analysts,
computer scientists, mathematicians and other data-savvy job seekers. In spite of
this explosion in business demand, enterprises are currently short of experts who
can work with new tools and technologies, and make sense of unstructured data
flowing from mobile devices, sensors, social media and other sources.
Earlier Big Data skills were quite popular in defence and technology sectors. As the Big Data technologies became
cheaper and more easily accessible, more and more sectors joined the movement and the competition for Big Data
talent is becoming fiercer. Currently e-commerce companies and social media services are leading the demand.
Other sectors on the lookout for big-data skills include food manufacturers, retailers, consulting companies,
gaming, online travel, consumer finance, telecommunications and
insurance, according to a report published by a career site Dice.com. Big
Amongst Big Data Analtyics jobs, the
Data talent should consist of a combination of skills – good knowledge of Data Scientist role is a top requirement
statistical and analytical skills, an understanding of how Big Data can be for many firms. Jigsaw Academy has an
extensive industry network to facilitate
used to make better business decisions; and computer programming
placements for its students as and when
expertise for Big Data analysis. suitable positions are available. All the
participants get one on one support for
resume and interview preparation.
Let us review some of the common roles across Big Data talent,
ITFocused
Hadoop Developer
Amongst the existing Big Data technologies, Hadoop is the most preferred choice of many enterprises because of
its advantages around its flexibility and low cost. Major responsibilities of a Hadoop Developer include design,
development and implementation of data processing applications using Apache in a real-time project setup. These
roles require a thorough understanding of Hadoop Framework such as HDFS, MapReduce and other components
with a major emphasis on IT applications. Also, Hadoop Developers should have a strong working knowledge Java
programming language and preferably exposure to other languages such as C, C++, Python and PHP. Typically, this
role is most sought after by software professionals, ETL developers, and data engineers who have a solid
foundation of Hadoop Architecture.
Hadoop Administrator
Hadoop Administrator role is similar to traditional DBA (Database Administrator), however in the context of

configuring, deploying and maintaining of Hadoop Cluster. As the spending on Hadoop technologies by
enterprises is on the rise, the need for specialists who can work with such a framework for storage and large-scale
processing of data sets. As a Hadoop administrator, one should have a very good understanding of Apache
Hadoop and its Ecosystem, and also expertise around UNIX and Cloud frameworks which are more common for
setting up Hadoop clusters. This role is more preferred by traditional data warehousing specialists and database
administrators who are willing to scale up their expertise for Big Data.
AnalyticsFocused
Big Data Analyst
The Big Data talent needed by many enterprises should contain the ability to access, manipulate, and analyze
massive data sets in the Hadoop cluster using SQL and familiar scripting languages. Major focus of this role would
be on analysing large volumes of data using analytics tools like SAS and R by integrating with Hadoop cluster.
Exposure to business intelligence tools like Tableau is also required, as it is widely used across enterprises for
regular generation of interactive reports and dashboards on various business needs. Having knowledge of tools
like Impala, Hive, and Pig is also required especially to implement real-time analytics functions. Generally
professionals working as data analysts, BI specialists, and business analysts who work extensively with data and
are willing to take up Big Data challenges would be ideal for this Big Data analyst role.
Data Scientist
Harvard Business Review has termed data scientist as the sexiest job of the 21st century. Definitely it is one of the
most competent talents in high demand majorly for the skills around statistical knowledge, computing expertise
and ability to work with Big Data technologies. One important aspect which many have a wrong notion about is
that data scientist role is not always tied to Big Data projects, but more to do with the increased breadth and
depth of data being examined, as compared to traditional roles. The demand for data scientists is increasing
tremendously as more and more industry verticals are replying on data based decision making to become more
profitable. Apart from having skills around Hadoop framework, statistical tools such as R and Python, and
analytics techniques, a data scientist should also possess good domain expertise and communication skills to
present technical findings to the management team in terms of final recommendations. Traditional roles such as
statistician, predictive analyst, business analyst and business intelligence analyst, coupled with Big Data skills
would be more ideal for a data scientist career.

CHAPTER 10
BigDataintheFuture
As the dependence of our lives on mobile and online technologies is going
Since Apache Hadoop technology is
to increase over the years, so is the growth of Big Data. It is essential for rapidly evolving, it is definitely a big
individuals and enterprises of all sizes, types, and industries to embrace challenge to keep up with the latest
developments. Our Industry Experts
this data deluge. So far Big Data technologies and tools such as Hadoop will update the content in line with
have helped us to deal with data challenges around volume, variety and latest developments and provide
additional videos and references so
velocity.
that you won’t miss a thing.
With Big Data applications still at an early stage

but rapidly progressing year on year, surely we will witness many more technological
solutions packaged into distributions, appliances and on-demand cloud services
around implementing analytics solution. Existing cohort of solutions include Hadoop
connectors for analytical tools such as Tableau and R, but these can be further
improved by more support for better interactivity and abstract programming in new
tools. Abstract programming is about running a data processing task with less code
than with a regular programming language like Java and is aimed to help non-
programmers handle Big Data easily without getting into much of technical nuances.
In terms of enterprise level Big Data spending, currently most of the share goes to IT implementations in setting
up infrastructure and data processing needs. Though business intelligence components are being deployed, full
scale benefits of Big Data analytics are not yet achieved. One major trend that can be expected in future would be
the increased impact of data science teams on business decision making. As they tend to be integrated more
along the side of any enterprise’s business operations, we will see more repeatable processes at a daily level
starting with raw data and finishing with data products either in the form of reports, dashboards or other
applications.
Another future Big Data trend would be around visualization which typically satisfies exploration and explanation
needs in a data workflow. While many know that visualization output is an end result of data science project, it
also can be used as an exploration tool to identify anomalies and patterns in the data. Traditionally visualization
plays a vital role in delivering data manipulation abilities to those without any direct programming or statistical
skills. Now with increasing need amongst managers and business analysts to make sense of Big Data, it will
become essential for enterprises to build analytics capabilities around visualization tools to support non-data
savvy professionals.

CHAPTER 11
LearnmoreaboutBigData
So far this guide has provided a comprehensive introduction about various
topics around the field of Big Data and analytics. Like increasing volumes of Big
Data, the demand for skilled individuals in these areas and the salaries offered
are growing quickly. Fortunately, you can start building the required expertise by
exploring free resources available online and reading some best-selling books.
These resources differ from each other in terms of details emphasizing more on
either technical or business application aspects. Irrespective of whether you are
going to purchase a new book or refer an online resource, get ready for to derive
more knowledge about the fascinating world of Big Data which is believed to
transform the way businesses are run and further help them to achieve
competitive success.
BooksonBigData
Ÿ “Big Data: A Revolution That Will Transform How We Live, Work, and Think” by Viktor Mayer-Schonberger and
Kenneth Niel Cukier
Ÿ “Big Data at work: Dispelling the myths, uncovering the opportunities” by Thomas H. Davenport
Ÿ “Taming The Big Data Tidal Wave” by Bill Franks
BooksonHadoop
Ÿ “Hadoop: The Definitive Guide” 3rd Edition by Tom White
Ÿ “Hadoop in Practice” by Alex Holmes
Ÿ “Hadoop in Action” by Chuck Lam

BooksonAnalytics
Ÿ “Keeping Up with the Quants: Your Guide to Understanding and Using Analytics” by Thomas H. Davenport and
Jinho Kim
Ÿ “Competing on Analytics: The New Science of Winning” by Thomas H. Davenport and Jeanne G. Harris
Ÿ “Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die” by Eric Siegel
Ÿ “Super Crunchers: Why Thinking-By-Numbers is the New Way To Be Smart” by Ian Ayres
Ÿ “Data Science for Business: What you need to know about data mining and data-analytic thinking” by Foster
Provost and Tom Fawcett
PopularBlogsonBigDataandHadoop
Ÿ Smarter Computing Blog - Maintained by IBM which includes articles around Big Data and cloud computing
Ÿ Planet Big Data - An aggregator of worldwide blogs about Big Data, Hadoop, and related topics.
Ÿ Big Data | Forrester Blogs - An aggregation of blogs and articles from enterprise experts focusing on Big Data topics
Ÿ Hadoop Wizard - A website dedicated to help people learn how to use Hadoop for “Big Data” analytics
Ÿ Yahoo! Hadoop Tutorial - It includes free materials that cover in detail on how to use the Hadoop distributed data
processing environment
Ÿ Hadoop 360 - Exclusive Hadoop site maintained by data science central community
Ÿ Cloudera Developer Blog - Big Data best practices, how-to's, and internals from Cloudera Engineering and the
community
Ÿ The Hortonworks Blog - Collation of articles around Hadoop related to latest releases, trends and updates from the
expert team of Hortonworks
VideoResourcesforHadoop
Ÿ Big Data University - An IBM initiative which offers free online courses taught by the leading experts in the field
Ÿ MapR Academy - It provides few free training resources to help individuals and teams learn and use Hadoop
Ÿ Hadoop Screencast - A collection of good quality screencasts on installation and working with Apache Hadoop and
the various components of the Apache Hadoop Ecosystem
Ÿ Hadoop Essentials - A six-part recorded webinar series offered for free by Cloudera about introduction and
motivation for Hadoop
Ÿ Hortonworks Sandbox - It is provided as a self-contained virtual machine by Hortonworks with hands on video
tutorials and pre-installation of single node Hadoop cluster
References
1. https://www.abiresearch.com/press/big-data-spending-to-reach-114-billion-in-2018-loo
2. http://www.v3.co.uk/v3-uk/news/2302017/ebay-using-big-data-analytics-to-drive-up-price-listings
Did you enjoy this book? We would love to hear your feedback or suggestions if any. Do write to us at
info@jigsawacademy.com

Jigsaw Beginners Guide To Big Data 2014

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Jigsaw Beginners Guide To Big Data 2014

Загружено:

Авторское право:

Доступные форматы

Beginner’s Guide to

This book is also a useful companion to those of you enrolled in

Enjoy the book.

Big Data Team Jigsaw (Led by Team Lead Kiran P.V.)

© Jigsaw Academy pg01

What Makes Data Big 3

Challenges of Big Data Processing 5

Big Data Technologies 10

Big Data and Analytics 15

Unstructured Data and Text Analytics 20

Big Data in Action 20

Big Data Landscape 23

Big Data Career Paths 30

Big Data in the Future 30

Learn more about Big Data 30

© Jigsaw Academy pg02

One would wonder how enterprises are influenced by the lower

© Jigsaw Academy pg03

© Jigsaw Academy pg04

© Jigsaw Academy pg05

© Jigsaw Academy pg06

© Jigsaw Academy pg07

© Jigsaw Academy pg08

© Jigsaw Academy pg09

Let's briefly review some of the key components of Hadoop Ecosystem.

© Jigsaw Academy pg10

(I,1) (Like,1) (Hadoop,1)

© Jigsaw Academy pg11

Hive of data from Hadoop cluster.

© Jigsaw Academy pg12

HBase Hadoop Streaming utility.

HBase is a column-oriented database within Hadoop Ecosystem and runs on

Hadoop Data Management Module

© Jigsaw Academy pg13

© Jigsaw Academy pg14

© Jigsaw Academy pg15

Ÿ Predicting customer churn behaviour to design reach out

Gathering required Data

© Jigsaw Academy pg16

Choosing the Right Analytics Technique

Implementing Analytics Techniques

© Jigsaw Academy pg17

Visualizing End Results

© Jigsaw Academy pg18

© Jigsaw Academy pg19

Few key facts related to unstructured data are:

© Jigsaw Academy pg20

© Jigsaw Academy pg21

Ÿ Smarter decisions: Enable decision making beyond traditional practices

© Jigsaw Academy pg22

Retail & E-Commerce

© Jigsaw Academy pg23

© Jigsaw Academy pg24

© Jigsaw Academy pg25

plus a basic Manager supporting up to 50 cluster nodes. Cloudera

© Jigsaw Academy pg26

Handling Big Data on Cloud is one of

process vast amounts of data at a cheaper cost. It utilizes a hosted

© Jigsaw Academy pg27

© Jigsaw Academy pg28

cloud data sources such as Salesforce.com and Google Analytics. Tableau

© Jigsaw Academy pg29

© Jigsaw Academy pg30