You are on page 1of 24

BIG DATA :

AN INTRODUCTION
BAGUS JATI SANTOSO
MARCH, 7TH 2018
WHAT IS BIG DATA?

• Big data is high-volume, high-velocity and/or high-variety information assets that


demand cost-effective, innovative forms of information processing that enable
enhanced insight, decision making, and process automation.
• Big data itself is a blanket term for the non-traditional strategies and
technologies needed to gather, organize, process, and gather insights from large
datasets.
• Today, the problem of working with data has already exceeded the computing
power or storage of a single computer.
WHAT IS BIG DATA

• In general, big data is :


• Large datasets
• The category of computing strategies and technologies that are used to handle the large
datasets

• Large datasets means a dataset too large to reasonably process or store with traditional
tooling or on a single computer.
DATA NOWADAYS

• The volume at which new data is being generated is staggering


• We live in age when the amount of data to be generated in the world is measured in
exabytes (1 billion GB) and zettabytes (1 thousand billion GB)
• By 2025, the forecast is that the data around the Internet will exceed the brain capacity
of everyone living on the entire planet
DATA NOWADAYS

• The velocity of data generation, acquisition, processing, and output increases


exponentially as the number of sources and increasingly wider variety of formats grows
over time
• It is widely reported that some 90% of the world’s data has been created in the last two
years (http://www.economist.com/node/21537967)
• The big data revolution has driven massive change in the ability to process complex
events, capture online transactional data, develop products and services for mobile
computing, and process many large data events in near real time.
BIG DATA : WHAT IS THE DATA?

• Organizations nowadays are capturing additional data from its operational environment at an
increasingly fast speed. Some examples are :
• Web Data : Costumer level web behaviour data (page views, searches, reading, reviews, purchasing)
• Text data (email, news, Facebook feeds, documents, etc) is one of the biggest and most widely applicable types
of big data.
• Time and location data. GPS and connection makes time and location information a growing source of data. As
more individuals open up their time and location data more publicly, lots of interesting applications start to
emerge.
• Smart grid and sensor data. Sensor data are collected nowadays from cars, oil pipes, windmill turbines, and they
are collected in extremely high frequency.
• Social network data. Within social network sites like Facebook, LinkedIn, Instagram, it is possible to do link
analysis to uncover the network of a given user.
CHARACTERISTICS OF BIG DATA (3V)

• Volume
• Velocity
• Variety

• Other Characteristics :Veracity,Variability, and Value


VOLUME

• The sheer scale of the information processed helps define big data systems.
• These datasets can be orders of magnitude larger than traditional datasets, which
demands more thought at each stage of the processing and storage life cycle.
• There exists a challenge of pooling, allocating, and coordinating resources from groups of
computers. Cluster management and algorithms capable of breaking tasks into smaller
pieces become increasingly important.
VELOCITY

• Another way in which big data differs is the speed that information moves through the
system.
• Data is frequently flowing into the system from multiple sources and is often expected to
be processed in real time to gain insights and update the current understanding of the
system.
• The focus on near instant feedback has driven many big data practitioners away from a
batch-oriented approach and closer to a real-time streaming system.
VARIETY

• The variety of sources and data types being generated expands as fast as new technology can
be created
• Big data is unique because of the wide range of both the sources being processed and their
relative quality.
• Data can be ingested from internal systems like application and server logs, from social media
feeds and other external APIs, from physical device sensors, and from other providers.
• Big data seeks to handle potentially useful data regardless of where it's coming from by
consolidating all information into a single system.
SOME MAKE IT 4V
VERACITY, VARIABILITY, VALUE

• Veracity: The variety of sources and the complexity of the processing can lead to
challenges in evaluating the quality of the data (and consequently, the quality of the
resulting analysis)
• Variability:Variation in the data leads to wide variation in quality. Additional resources
may be needed to identify, process, or filter low quality data to make it more useful.
• Value: The ultimate challenge of big data is delivering value. Sometimes, the systems and
processes in place are complex enough that using the data and extracting actual value can
become difficult.
WHO’S GENERATING BIG DATA

Mobile devices
(tracking all objects all the time)

Social media and networks


(all of us are generating data) Scientific instruments
(collecting all sorts of data) Sensor technology and
networks
(measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion
THE MODEL HAS CHANGED…

The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming data
WHAT’S DRIVING BIG DATA

- Optimizations and predictive analytics


- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time

- Ad-hoc querying and reporting


- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
HARNESSING BIG DATA

• OLTP: Online Transaction Processing (DBMSs)


• OLAP: Online Analytical Processing (Data Warehousing)
• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
BIG DATA : THE INSIGHT

• Big Data enables shifting from the traditional insight, “descriptive”, into new insights,
“predictive” and “prescriptive”
• Descriptive analytics : “What happen in the past”
• What was the last revenue in the last year, which is our most profitable product ?

• Predictive analytics : “What might happen next”


• What will the number of complaints in the next quarter? Which costumer are likely to churn?

• Prescriptive analytics : “How I deal with this”


• Provide recommended articles that we think reader would like to read next, provide a value
package to a costumer that has high chance to churn
BUSINESS VALUE OF BIG DATA ANALYTICS

Automate the
Enabled enhanced Decision decision
insight making (Process
Automation)
APPLICATION OF BIG DATA

• Segmentation and prediction


• Bank reviews person financial history and assess their likelihood to pay the debt

• Churn prediction
• Attracting new customer is much more expensive than retaining new ones

• Recommended system and targeted marketing


• Amazon recommended item for users

• Sentiment analysis
• Find opinions across a large number of people to provide information on what the market is saying, thinking, and feeling about an organization.

• Operational analytics
• Airlines automatically reroute customers when a flight is delayed in order to limit travel disruption and raise cutstomer satisfaction.

• Big data for social good


• Provide tracking the spread of epidemic diseases
LIFE CYCLE OF BIG DATA

• Data ingestion is the process of taking raw data and adding


Ingesting it to the system. The complexity depends heavily on the
data into
the system format and quality of the data sources and how far the data
is from the desired state prior to processing.
• The ingestion processes typically hand the data off to the
Persisting components that manage storage, so that it can be reliably
Visualizing the data persisted to disk.
the results into
storage • The computation layer is perhaps the most diverse part of
the system as the requirements and best approach can vary
significantly depending on what type of insights desired.
Computing • Visualizing data is one of the most useful ways to spot
and trends and make sense of a large number of data points.
analyzing
CLUSTERED COMPUTING

• To better address the high storage and computational needs of big data, computer clusters
are a better fit.
• The benefits are :
• Resource Pooling: Combining the available storage space to hold data is a clear benefit, but CPU
and memory pooling is also extremely important.
• High Availability: Clusters can provide varying levels of fault tolerance and availability guarantees
to prevent hardware or software failures from affecting access to data and processing.
• Easy Scalability: Clusters make it easy to scale horizontally by adding additional machines to the
group. This means the system can react to changes without expanding the physical resources on a
machine.
BIG DATA TECHNOLOGY
The World
of Big Data
Tools DAG Model MapReduce Model Graph Model BSP/Collective Model

Hadoop
MPI
HaLoop Giraph
Twister Hama
For
GraphLab
Iterations/ Spark GraphX
Learning
Harp
Stratosphere
Dryad/ Reef
DryadLIN
Q Pig/PigLatin
Hive
For Query Tez
Drill
Shark
MRQL

For S4 Storm
Streaming Samza Spark Streaming