Академический Документы
Профессиональный Документы
Культура Документы
8/3/2015
Contents
Introduction
Data Volume
Few facts of Big Data
Analysis
Challenges in Big Data
Handling Big Data
Research Areas in Big Data
8/3/2015
8/3/2015
Introduction
We are awash in a flood of data today.
We have entered an era of Big Data.
Handling more than 30 PB ( 30 x 1125899906842624 bytes) in
a day has become a common phenomenon in most of the
international companies now days.
In USA only more than 848 PB of data was produced by the
government.
So, we can not run away or ignore the presence of huge data
We have to think in a different way to handle these huge data
8/3/2015
Data Volume
8/3/2015
Continue
As we know we need the data to convert it into information
so that we can make a good decision based upon the data.
In a broad range of application areas, data is being collected
at unprecedented scale.
Decisions that previously were based on guesswork, or on
painstakingly constructed models of reality, can now be made
based on the data itself.
8/3/2015
Continue
Big Data analysis now drives nearly every aspect of our
modern society, including mobile services, retail,
manufacturing, financial services, life sciences, and physical
sciences.
Big Data is an entity that is very big in size, very fast in speed
of interpreting, and various types of structures which is not
easily processed by the traditional database management
tools.
It refers to data sets whose size is beyond capabilities of the
current database technology
8/3/2015
Continue
Big Data is the massive data that comes from different sources
which is characterized by three Vs such as: Volume, Velocity
and Variety.
Volume
Velocity
Variety
8/3/2015
Continue
Variety
Up to 85 percent of an organizations data is unstructured not
numeric but it still must be folded into quantitative analysis and
decision making.
Example: Text, video, audio and other unstructured data require
different architecture and technologies for analysis.
Velocity
Initiatives such as the use of RFID tags and smart metering are driving
an ever greater need to deal with the torrent of data in near real time.
This, coupled with the need and drive to be more agile and deliver
insight quicker, is putting tremendous pressure on organizations to
build the necessary infrastructure and skill base to react quickly
enough.
8/3/2015
10
Continue
Variability
In addition to the speed at which data comes your way, the data flows
can be highly variable with daily, seasonal and event-triggered peak
loads that can be challenging to manage.
Complexity
Difficulties dealing with data increase with the expanding universe of
data sources and are compounded by the need to link, match and
transform data across business entities and systems.
Organizations need to understand relationships, such as complex
hierarchies and data linkages, among all data.
8/3/2015
11
Continue
Big Data can also be considered as a phenomenon that
describes large volumes of high velocity with high complexity
and variable data.
Big Data technologies as a new generation of technologies
and architectures, designed to economically extract value
from very large volumes of a wide variety of data by enabling
high-velocity capture, discovery, and/or analysis.
There are three main characteristics of Big Data: the data
itself, the analytics of the data, and the presentation of the
results of the analytics.
Big data is a relative term describing a situation where the
volume, velocity and variety of data exceed an organizations
storage or compute capacity for accurate and timely decision
making.
8/3/2015
12
Continue
Big data has special characters, it requires special
technologies to capture, to extract, to integrate, to analyze,
and to interpret it.
Extracting the meaning from the big data is not impossible but
the fact is that it is not easy.
Since the big data is never in rest and the size is increasing
very fast, an ultra-high speed messaging technology is
required in real time for streaming data capture and
monitoring continuously.
The heterogeneity nature of incoming data, its increasing
trends in volume, need of quick interpretation, and the
security are the prime challenges of big data.
8/3/2015
13
Continue
While the potential benefits of Big Data are real and
significant, and some initial successes have already been
achieved, there remain many technical challenges that must
be addressed to fully realize this potential.
The sheer size of the data, of course, is a major challenge, and
is the one that is most easily recognized.
Industry analysis companies like to point out that there are
challenges not just in Volume, but also in Variety and Velocity,
and that companies should not focus on just the first of these.
8/3/2015
14
Continue
By Variety, they usually mean heterogeneity of data types,
representation, and semantic interpretation.
By Velocity, they mean both the rate at which data arrive and
the time in which it must be acted upon.
While these three are important, this short list fails to include
additional important requirements such as privacy and
usability.
8/3/2015
15
16
17
Continue
Between 2012 and 2020, emerging markets' share of the
expanding digital universe will grow from 36% to 62%.
A majority of the information in the digital universe, 68% in
2012, is created and consumed by consumers watching
digital TV, interacting with social media, sending camera
phone images and videos between devices and around the
Internet, and so on.
Yet enterprises have liability or responsibility for nearly 80% of
the information in the digital universe.
8/3/2015
18
Continue
It is estimated that by 2020, as much as 33% of the digital
universe will contain information that might be valuable if
analyzed.
By 2020, nearly 40% of the information in the digital universe
will be "touched" by cloud computing providers meaning
that a byte will be stored or processed in a cloud somewhere
in its journey from originator to disposal.
8/3/2015
19
Continue
The proportion of data in the digital universe that requires
protection is growing faster than the digital universe itself,
from less than a third in 2010 to more than 40% in 2020.
The amount of information individuals create themselves
writing documents, taking pictures, downloading music, etc.
is far less than the amount of information being created
about them in the digital universe.
8/3/2015
20
21
8/3/2015
22
Continue
Data Acquisition and Recording
Big Data does not arise out of a vacuum: it is recorded
from some data generating source.
Much of this data is of no interest, and it can be filtered
and compressed by orders of magnitude.
One challenge is to define these filters in such a way that
they do not discard useful information.
The second challenge is to automatically generate the right
metadata to describe what data is recorded and how it is
recorded and measured.
8/3/2015
23
Continue
Information Extraction and Cleaning
The information collected will not be in a format ready for
analysis.
We require an information extraction process that pulls
out the required information from the underlying sources
and expresses it in a structured form suitable for analysis.
8/3/2015
24
Continue
Data Integration, Aggregation, and Representation
Given the heterogeneity of the flood of data, it is not
enough merely to record it and throw it into a repository.
Data analysis is considerably more challenging than simply
locating, identifying, understanding, and citing data.
For effective large-scale analysis all of this has to happen
in a completely automated manner.
8/3/2015
25
Continue
Query Processing, Data Modeling, and Analysis
Methods for querying and mining Big Data are
fundamentally different from traditional statistical analysis
on small samples.
Big Data is often noisy, dynamic, heterogeneous, interrelated and untrustworthy.
Interpretation
Having the ability to analyze Big Data is of limited value if
users cannot understand the analysis.
Ultimately, a decision-maker, provided with the result of
analysis, has to interpret these results.
8/3/2015
26
8/3/2015
27
8/3/2015
28
8/3/2015
29
Common Aspects
Analytics /Machine Learning
Learning insights from data
Big Data
Handling massive data volume
Can be combined or used separately
8/3/2015
30
8/3/2015
31
8/3/2015
32
Map Reduce
Map Reduce is a framework which is popularized by Google
that processes the set of individual problems parallel.
Map Reduce is a programming model that allows easy
development of scalable parallel applications to process big
data on large clusters of commodity machines .
It is a simple but provides good scalability and fault tolerance
for massive data processing .
The philosophy of Map Reduce is based upon Divide and
Conquer to solve the big problem by decomposing it into
small problems.
8/3/2015
33
Continue
Mapping and Reducing are two main functions of Map
Reduce.
The Mapping takes the problem as an input, breaks it into
many manageable small problems in (key, value) pairs and
assigns them to the different computers.
The function is executed in each computer in parallel that
produces a list of [Key1, list (Value1)] pairs whereas the
Reducing collects the processed small problems and
combines them in a defined format before processing.
The Reducing function is executed at the end that produces
[list (Value2)].
The features such as simplicity, flexibility, fault tolerance and
high scalability have made Map Reduce very successful in
managing the big data.
8/3/2015
34
8/3/2015
35
Ministry A
Service A
Service A
Ministry B
Service B
Service B
Ministry Z
Service Z
Service Z
Ministry A
Service A1
Ministry C
Service C
Ministry Z
Service Z1
Ministry A
Service A2
Ministry Z
Service Z2
List[Services 1]
Services...
MAP
Ministry B
Service A1'
Service C
List[Services 2]
Service Z1
Service 10
REDUCE
Ministry A
Service 20
Service A2
Service Z2
List[Services 3]
Service 30
Ministry Z
Services...
Government
8/3/2015
Cluster
Government
Connected
Government
36
Cloud Computing
A cloud computing is the type of parallel and distributed
system consisting of a collection of inter-connected and
virtualized computers that are dynamically provisioned and
presented as one or more unified computing resources based
upon the service level agreements [SLA] established through
negotiation between service provider and service user.
8/3/2015
37
Continue
8/3/2015
38
8/3/2015
39
Conclusion
Big Data has become phenomenon in ICT world
We cannot run away from the presence of Big Data
There are still many research areas in Big Data
8/3/2015
40
8/3/2015
41
8/3/2015
42
Apache Hadoop
Apache Hadoop was developed to overcome the deficiencies
mentioned previously of prior storage and analytics
architectures (e.g. SANS, Sharding, Parallel Databases etc).
The Apache Hadoop software library framework allows for
distributed processing of large datasets across clusters of
computers on commodity hardware.
This solution is designed for flexibility and scalability, with an
architecture that scales to thousands of servers and petabytes
of data.
The library detects and handles failures at the application
layer, delivering a high-availability service on commodity
hardware.
8/3/2015
43
Hadoop
Hadoop is a Platform which enables you to store and analyze
large volumes of data.
Hadoop is batch oriented (high throughput and low latency)
and strongly consistent (data is always available).
Hadoop is best utilized for:
Large scale batch analytics
Unstructured or semi-structured data
Flat files
Hadoop is comprised of two major subsystems
HDFS (File System)
Map Reduce
8/3/2015
44
8/3/2015
45