Вы находитесь на странице: 1из 19

What is Big Data?

• Big data is defined as the voluminous amount of structured,

unstructured or semi-structured data
• Cannot be processed using traditional database systems.
• Big data is characterized by its high velocity, volume and variety.
• It requires cost effective and innovative methods for information
processing to draw meaningful business insights.
Four V’s of Big Data
Four V’s of Big Data

• IBM has a nice, simple explanation for the four critical features of big
• a) Volume –Scale of data
• b) Velocity –Analysis of streaming data
• c) Variety – Different forms of data
• d) Veracity –Uncertainty of data
Volume of Big Data
• By 2010, data volume was 40 Zeta Bytes which is 300 times from
• Out of 7 Billion World population, cell phones are used by 6 Billion
• Most of the U.S company stores 100 TB data every day
Velocity of Big Data
• During each trading session, NSE captured 1 TB of trade information.
• In the year 2016, there were around 18.9 Billion network
• Modern cars have around 100 sensors that monitor fuel level, tire
pressure etc.
Variety of Big Data
• Every month facebook shares 30 Billion pieces of content.
• Per day, 400 Million tweets are sent by 200 Million Users.
• 150 Exa Byte of Health Care data were produced globally as of 2011.
Veracity of Big Data
• 1 in 3 business leaders don’t trust the information they use to take
What is Hadoop
Hadoop vs RDBMS
• Hadoop Processes semi-structured and unstructured data.
• RDBMS Processes structured data.
• Hadoop - Schema on Read
• RDBMS - Schema on Write
Best Fit for Applications :
• Hadoop: Data discovery and Massive Storage/Processing of Unstructured data.
• RDBMS: Best suited for OLTP and complex ACID
transactions.(Atomicity, Consistency, Isolation, and Durability − commonly
known as ACID properties − in order to ensure accuracy, completeness, and data
• Hadoop: Writes are Fast
• RDBMS : Reads are Fast
Schema on Read Vs Schema on Write
• First you define your schema, then you write your data, then you read
your data and it comes back in the schema you defined up-front.
Schema on read :
• just load the data as-is and apply your own lens to the data when you
read it back out.
How a big data problem looks like
Scale of data: 100s of TB to 10s of PB
Diverse Data: Structured, Web logs
Data needs to be processed in record time
Parallel processing
• 1024 GB = 1 TB
• 1024 TB = 1 PB
Why do we need Big data Analysis?
Big data analysis helps businesses increase their revenue
• Walmart the world’s largest retailer in 2014 in terms of revenue - is
using big data analytics to increase its sales through better predictive
analytics, providing customized recommendations and launching new
products based on customer preferences and needs.
• Walmart observed a significant 10% to 15% increase in online sales
for $1 billion in incremental revenue.
• There are many more companies like Facebook, Twitter, LinkedIn,
Pandora, JPMorgan Chase, Bank of America, etc. using big data
analytics to boost their revenue.
Structured, Semi structured and Unstructured data
• Data which can be stored in traditional database systems in the form
of rows and columns, for example the online purchase transactions
can be referred to as Structured Data.
• Data which can be stored only partially in traditional database
systems, for example, data in XML records can be referred to as semi
structured data.
• Unorganized and raw data that cannot be categorized as semi
structured or structured data is referred to as unstructured data.
Facebook updates, Tweets on Twitter, Reviews, web logs, etc. are all
examples of unstructured data.
Data Warehouse and Data Lake –
Two data management approaches
Data Warehouse
• DW is the central repository of integrated data from one or more
disparate sources, where data is extracted from transactional
• They store current and historical data
• used for creating trending reports for senior management reporting
such as annual and quarterly comparisons.
• It is highly transformed and structured.
Data Lake
1) DL retains all data regardless of source and structure in its raw form
and we only transform it when we’re ready to use it.
This approach is known as “Schema on Read” vs. the “Schema on
Write” approach used in the data warehouse.

2) Supports all data types both from non Traditional and traditional
data source.(Non-traditional data sources - web server logs, sensor
data, social network activity, text and images.)
3) Data Lakes support all users – Normal users and data scientists.
Data Scientists may use advanced analytic tools and capabilities like
statistical analysis and predictive modelling. The data scientists can go
to the lake and work with the very large and varied data sets they need.
Normal users make use of more structured views of the data provided
for their use.
4) Data Lakes adapt easily to changes.
- In the data lake, all data is stored in its raw form.
- Data is always accessible to someone who needs to use it.
- Users can explore data in novel ways and answer their questions at
their pace what has given rise to the concept of self-service business
5) Data Lakes provide faster insights
- It enables users to get to their results faster than the traditional data
warehouse approach.
Technology needed for Data Warehouse and Data Lake
• Big data technologies - Hadoop
Data warehouses - Relational database.
• Hadoop ecosystem works great for the data lake approach. How?
- It adapts and scales very easily for very large volumes of data.
- It can handle any data type or structure.
• Hadoop can also support data warehouse scenarios. How ?
- By applying structured views to the raw data.
- Hadoop has the flexibility that allows to all tiers of business users.
• Relational database technologies are ideal for data warehouse
applications. How?
- They excel at high-speed queries against very structure data.
Data lake concept