Вы находитесь на странице: 1из 19

Big Data

A SOFT I NTRODUCTI ON OF BI G DATA


COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION
Contents
What is Big Data
Conventional Approaches
Problems with Conventional Approaches
Welcome to the world of Big Data
COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION
What is Big Data
Every day, world create 2.5 quintillion bytes of data so much that 90%
of the data in the world today has been created in the last two years
alone
Gartner defines Big Data as high volume, velocity and variety
information assets that demand cost-effective, innovative forms of
information processing for enhanced insight and decision making.
COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION
What is Big Data
According to IBM, 80% of data captured today is unstructured, from
sensors used to gather climate information, posts to social media sites,
digital pictures and videos, purchase transaction records, and cell phone
GPS signals, to name a few. All of this unstructured data is also Big Data.
COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION
Why Big Data
Huge Competition in the Market:
Retails Customer analytics
Travel travel pattern of the customer
Website Understand users navigation pattern, interest, conversion, etc
Sensors, satellite, geospatial Data
Military and intelligence
COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION
Essence of Big Data
COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION
Volume
Today we are living in the world of data. There are multiple factors
contributing in data growth
Huge volumes of data are generated from various sources:
Transaction based data (stored through years)
Text, Images, Videos from Social Media
Increased amounts of data generated by sensors
COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION
Volume
Turn 12 terabytes of Tweets created each day into improved product
sentiment analysis
Convert 350 billion annual meter readings to better predict power
consumption
Turn billions of customer complaints to analyze root cause of customer
churn
COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION
Velocity
According to Gartner, velocity "means both how fast data is being
produced and how fast the data must be processed to meet demand."
Scrutinize 5 million trade events created each day to identify potential
fraud
Analyze customers searching/buying pattern and show them
advertisement of attractive offers in real time
COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION
Velocity (example)
Take Googles example, about processing of the data:
As soon as a blog is posted it comes into the search result.
If we search about traveling, shopping(electronics, apparels, shoes,
watch, etc.), job, etc. the relevant advertisement it provides us, while
browsing.
Even ads in the mail are highly content driven
COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION
Variety
Data today comes in all types of formats from traditional databases to
hierarchical data stores created by end users and OLAP systems, to text
documents, email, meter-collected data, video, audio, stock ticker data
and financial transactions
COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION
Veracity
Big Data Veracity refers to the biases, noise and abnormality in data. Is
the data that is being stored, and mined meaningful to the problem
being analyzed.
Veracity in data analysis is the biggest challenge when compares to
things like volume and velocity. Keep your data clean and processes to
keep dirty data from accumulating in your systems.
COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION
Conventional Approaches
Storage
RDBMS (Oracle, DB2, MySQL, etc.)
OS Filesystem
Processing
SQL Queries
Custom framework
C/C++
Python/Perl
COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION
Why Big Data Technologies
Conventional Approaches/Technologies are not able to solve current
problems
They are good for certain use-cases
But they cannot handle the data in the range of peta-bytes
COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION
Problems with Conventional
Approaches
Limited Storage capacity
Limited Processing capacity
No scalability
Single point of failure
Sequential Processing
RBMSs can handle structured data
Requires preprocessing of data
Information is collected according to current business needs
COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION
Limited Storage capacity
Installed on single machine
Have specified storage limits
Requires to archive the data again and again
Problems of reloading data back to the repository, according to the
business needs
Only process the data that can be stored on a single machine
COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION
Limited Processing capacity
Installed on single machine
Have specified processing limits
Have certain no of processing elements (CPUs)
Not able to process the large amount of data efficiently
COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION
No scalability
One of biggest limitations of conventional RDBMs, is the no scalability
We cannot add more resources on the fly
COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION
Thank You
COPYRIGHT CHIRAG AHUJA RESTRICTED CIRCULATION

Вам также может понравиться