Вы находитесь на странице: 1из 45

Big Data

Dr. Manish Pokharel


June 08, 2013

8/3/2015

Contents

Introduction
Data Volume
Few facts of Big Data
Analysis
Challenges in Big Data
Handling Big Data
Research Areas in Big Data

8/3/2015

Preamble : The Evolution of Data


1. In the past, the most difficult problem for businesses was
how to store all the data.
2. The challenge now is no longer to store large amounts of
information, but to understand and analyze this data.
3. By harnessing this data through sophisticated analytics,
and by presenting the key metrics in an efficient, easily
discernable fashion, we are afforded unprecedented
understanding and insight into our data.

8/3/2015

The Evolution of Data


1. Unlocking the true value of this massive amount of
information will require new systems for centralizing,
aggregating, analyzing, and visualizing these enormous
data sets. In particular analyzing and understanding
petabytes of structured and unstructured data poses the
following unique challenges:
1. Scalability
2. Robustness
3. Diversity
4. Analytics
5. Visualization of the Data
8/3/2015

Introduction
We are awash in a flood of data today.
We have entered an era of Big Data.
Handling more than 30 PB ( 30 x 1125899906842624 bytes) in
a day has become a common phenomenon in most of the
international companies now days.
In USA only more than 848 PB of data was produced by the
government.
So, we can not run away or ignore the presence of huge data
We have to think in a different way to handle these huge data

8/3/2015

Data Volume

8/3/2015

Continue
As we know we need the data to convert it into information
so that we can make a good decision based upon the data.
In a broad range of application areas, data is being collected
at unprecedented scale.
Decisions that previously were based on guesswork, or on
painstakingly constructed models of reality, can now be made
based on the data itself.

8/3/2015

Continue
Big Data analysis now drives nearly every aspect of our
modern society, including mobile services, retail,
manufacturing, financial services, life sciences, and physical
sciences.
Big Data is an entity that is very big in size, very fast in speed
of interpreting, and various types of structures which is not
easily processed by the traditional database management
tools.
It refers to data sets whose size is beyond capabilities of the
current database technology

8/3/2015

Continue
Big Data is the massive data that comes from different sources
which is characterized by three Vs such as: Volume, Velocity
and Variety.
Volume
Velocity
Variety

8/3/2015

Continue
Variety
Up to 85 percent of an organizations data is unstructured not
numeric but it still must be folded into quantitative analysis and
decision making.
Example: Text, video, audio and other unstructured data require
different architecture and technologies for analysis.

Velocity
Initiatives such as the use of RFID tags and smart metering are driving
an ever greater need to deal with the torrent of data in near real time.
This, coupled with the need and drive to be more agile and deliver
insight quicker, is putting tremendous pressure on organizations to
build the necessary infrastructure and skill base to react quickly
enough.
8/3/2015

10

Continue
Variability
In addition to the speed at which data comes your way, the data flows
can be highly variable with daily, seasonal and event-triggered peak
loads that can be challenging to manage.

Complexity
Difficulties dealing with data increase with the expanding universe of
data sources and are compounded by the need to link, match and
transform data across business entities and systems.
Organizations need to understand relationships, such as complex
hierarchies and data linkages, among all data.

8/3/2015

11

Continue
Big Data can also be considered as a phenomenon that
describes large volumes of high velocity with high complexity
and variable data.
Big Data technologies as a new generation of technologies
and architectures, designed to economically extract value
from very large volumes of a wide variety of data by enabling
high-velocity capture, discovery, and/or analysis.
There are three main characteristics of Big Data: the data
itself, the analytics of the data, and the presentation of the
results of the analytics.
Big data is a relative term describing a situation where the
volume, velocity and variety of data exceed an organizations
storage or compute capacity for accurate and timely decision
making.
8/3/2015

12

Continue
Big data has special characters, it requires special
technologies to capture, to extract, to integrate, to analyze,
and to interpret it.
Extracting the meaning from the big data is not impossible but
the fact is that it is not easy.
Since the big data is never in rest and the size is increasing
very fast, an ultra-high speed messaging technology is
required in real time for streaming data capture and
monitoring continuously.
The heterogeneity nature of incoming data, its increasing
trends in volume, need of quick interpretation, and the
security are the prime challenges of big data.
8/3/2015

13

Continue
While the potential benefits of Big Data are real and
significant, and some initial successes have already been
achieved, there remain many technical challenges that must
be addressed to fully realize this potential.
The sheer size of the data, of course, is a major challenge, and
is the one that is most easily recognized.
Industry analysis companies like to point out that there are
challenges not just in Volume, but also in Variety and Velocity,
and that companies should not focus on just the first of these.

8/3/2015

14

Continue
By Variety, they usually mean heterogeneity of data types,
representation, and semantic interpretation.
By Velocity, they mean both the rate at which data arrive and
the time in which it must be acted upon.
While these three are important, this short list fails to include
additional important requirements such as privacy and
usability.

8/3/2015

15

Source: IDC's Digital Universe Study, sponsored by EMC, December 2012


8/3/2015

16

Few facts on Big Data!!


From 2005 to 2020, the digital universe will grow by a factor
of 300, from 130 Exabyte to 40,000 Exabyte, or 40 trillion
gigabytes (more than 5,200 gigabytes for every man, woman,
and child in 2020)
By 2020, the digital universe will about double every two
years
The investment in managing, containing, studying, and storing
the bits in the digital universe will only grow by 40% between
2012 and 2020
As a result, the investment per gigabyte during that same
period will drop from $2.00 to $0.20
8/3/2015

17

Continue
Between 2012 and 2020, emerging markets' share of the
expanding digital universe will grow from 36% to 62%.
A majority of the information in the digital universe, 68% in
2012, is created and consumed by consumers watching
digital TV, interacting with social media, sending camera
phone images and videos between devices and around the
Internet, and so on.
Yet enterprises have liability or responsibility for nearly 80% of
the information in the digital universe.

8/3/2015

18

Continue
It is estimated that by 2020, as much as 33% of the digital
universe will contain information that might be valuable if
analyzed.
By 2020, nearly 40% of the information in the digital universe
will be "touched" by cloud computing providers meaning
that a byte will be stored or processed in a cloud somewhere
in its journey from originator to disposal.

8/3/2015

19

Continue
The proportion of data in the digital universe that requires
protection is growing faster than the digital universe itself,
from less than a third in 2010 to more than 40% in 2020.
The amount of information individuals create themselves
writing documents, taking pictures, downloading music, etc.
is far less than the amount of information being created
about them in the digital universe.

8/3/2015

20

Big Data Analysis


The analysis of Big Data involves multiple distinct phases as
shown in the figure in the next slide ,each of which introduces
challenges.
Many people unfortunately focus just on the
analysis/modeling phase: while that phase is crucial, it is of
little use without the other phases of the data analysis
pipeline.
Even in the analysis phase, which has received much
attention, there are poorly understood complexities in the
context of multi-tenanted clusters where several users
programs run concurrently.
Many significant challenges extend beyond the analysis
phase.
8/3/2015

21

The Big Data Analysis Pipelines

8/3/2015

22

Continue
Data Acquisition and Recording
Big Data does not arise out of a vacuum: it is recorded
from some data generating source.
Much of this data is of no interest, and it can be filtered
and compressed by orders of magnitude.
One challenge is to define these filters in such a way that
they do not discard useful information.
The second challenge is to automatically generate the right
metadata to describe what data is recorded and how it is
recorded and measured.

8/3/2015

23

Continue
Information Extraction and Cleaning
The information collected will not be in a format ready for
analysis.
We require an information extraction process that pulls
out the required information from the underlying sources
and expresses it in a structured form suitable for analysis.

8/3/2015

24

Continue
Data Integration, Aggregation, and Representation
Given the heterogeneity of the flood of data, it is not
enough merely to record it and throw it into a repository.
Data analysis is considerably more challenging than simply
locating, identifying, understanding, and citing data.
For effective large-scale analysis all of this has to happen
in a completely automated manner.

8/3/2015

25

Continue
Query Processing, Data Modeling, and Analysis
Methods for querying and mining Big Data are
fundamentally different from traditional statistical analysis
on small samples.
Big Data is often noisy, dynamic, heterogeneous, interrelated and untrustworthy.
Interpretation
Having the ability to analyze Big Data is of limited value if
users cannot understand the analysis.
Ultimately, a decision-maker, provided with the result of
analysis, has to interpret these results.
8/3/2015

26

Challenges in Big Data Analysis

Heterogeneity and Incompleteness


Scale
Timeliness
Privacy
Human Collaboration

8/3/2015

27

Managing Big Data


The Classic architectures potential bottleneck is the database
server while faced with peak workloads.
A database server has restriction of scalability and cost, which
are two important goals of big data processing.
Big Data Architecture with following three key aspects:
Distributed file system,
Non-structural and semi-structured data storage
Cloud platform.

8/3/2015

28

Handling Big Data


Algorithms
Clustering
Association Learning
Parameter Estimation
Recommendation Engine
Classification
Similarity Matching
Neural Network
Genetic Algorithms etc

8/3/2015

29

Common Aspects
Analytics /Machine Learning
Learning insights from data
Big Data
Handling massive data volume
Can be combined or used separately

8/3/2015

30

Approach of Solving(Processing) Big Data !


Existing Data base approach is not appropriate! So, we can
use following approaches
Map Reduce
Cloud Computing

8/3/2015

31

Big Data in E-Government System


The government provides services to the citizen.
Now a days, most of the services are to be provided in real time or on-fly
such as: Disaster Management, Traffic Control, Crime Control etc.
For that, government needs to make a quick decision based upon the
various data from various sources in various formats.
Government should strive to understand the Art of the Possible enabled
by advances in techniques and technologies to manage and exploit Big
Data.
Hence, the government has to be smart enough to handle huge volume of
data, in high velocity, for variety of data.
Government has to explore the possibility of breaking the problems into
smaller sub-problems. [i.e. Divide and Conquer]
Assign these sub-problems for different workers and manage the entire
problems to be solved.[Map Reduce]

8/3/2015

32

Map Reduce
Map Reduce is a framework which is popularized by Google
that processes the set of individual problems parallel.
Map Reduce is a programming model that allows easy
development of scalable parallel applications to process big
data on large clusters of commodity machines .
It is a simple but provides good scalability and fault tolerance
for massive data processing .
The philosophy of Map Reduce is based upon Divide and
Conquer to solve the big problem by decomposing it into
small problems.

8/3/2015

33

Continue
Mapping and Reducing are two main functions of Map
Reduce.
The Mapping takes the problem as an input, breaks it into
many manageable small problems in (key, value) pairs and
assigns them to the different computers.
The function is executed in each computer in parallel that
produces a list of [Key1, list (Value1)] pairs whereas the
Reducing collects the processed small problems and
combines them in a defined format before processing.
The Reducing function is executed at the end that produces
[list (Value2)].
The features such as simplicity, flexibility, fault tolerance and
high scalability have made Map Reduce very successful in
managing the big data.
8/3/2015

34

Map Reduce in Connected Government

8/3/2015

35

Map Reduce in Connected Government


Services...

Ministry A

Service A

Service A

Ministry B

Service B

Service B

Ministry Z

Service Z

Service Z

Ministry A

Service A1

Ministry C

Service C

Ministry Z

Service Z1

Ministry A

Service A2

Ministry Z

Service Z2

List[Services 1]

Services...

MAP

Ministry B
Service A1'
Service C

List[Services 2]

Service Z1

Service 10

REDUCE

Ministry A

Service 20

Service A2
Service Z2

List[Services 3]

Service 30

Ministry Z
Services...

Government

8/3/2015

Cluster
Government

Shuffling and Rearranging

Connected
Government
36

Cloud Computing
A cloud computing is the type of parallel and distributed
system consisting of a collection of inter-connected and
virtualized computers that are dynamically provisioned and
presented as one or more unified computing resources based
upon the service level agreements [SLA] established through
negotiation between service provider and service user.

8/3/2015

37

Continue

8/3/2015

38

Few Research Topics in Big Data

Security in Big Data


Data Acquisition in Big Data
Data Visualization in Big Data
Managing data effectively in Big Data
Performance level in Big Data

8/3/2015

39

Conclusion
Big Data has become phenomenon in ICT world
We cannot run away from the presence of Big Data
There are still many research areas in Big Data

8/3/2015

40

Thank You Very Much!!!

8/3/2015

41

Few more slides if you need!

8/3/2015

42

Apache Hadoop
Apache Hadoop was developed to overcome the deficiencies
mentioned previously of prior storage and analytics
architectures (e.g. SANS, Sharding, Parallel Databases etc).
The Apache Hadoop software library framework allows for
distributed processing of large datasets across clusters of
computers on commodity hardware.
This solution is designed for flexibility and scalability, with an
architecture that scales to thousands of servers and petabytes
of data.
The library detects and handles failures at the application
layer, delivering a high-availability service on commodity
hardware.
8/3/2015

43

Hadoop
Hadoop is a Platform which enables you to store and analyze
large volumes of data.
Hadoop is batch oriented (high throughput and low latency)
and strongly consistent (data is always available).
Hadoop is best utilized for:
Large scale batch analytics
Unstructured or semi-structured data
Flat files
Hadoop is comprised of two major subsystems
HDFS (File System)
Map Reduce
8/3/2015

44

Thank you very much!!!

8/3/2015

45

Вам также может понравиться