Вы находитесь на странице: 1из 80

SWE2011- Big Data Analytics

Teaching Material by
Dr.T.Ramkumar
Associate Professor
School of Information Technology & Engineering
Vellore Institute of Technology
Vellore – 632 014
ramkumar.thirunavukarasu@vit.ac.in
What is Big Data?
 “Big Data”

Big Data” refers to datasets whose size is beyond the ability of typical
database software tools to capture, store, manage and analyze

 “Big Data”
Big data is data that exceeds the processing capacity of conventional
database systems. The data is too big, moves too fast, or doesn’t fit the
strictures of your database architectures
 “Big Data”
Whose scale, diversity, and complexity require new architecture,
techniques, algorithms, and analytics to manage it and extract value
and hidden knowledge from it
 .“Big Data”
Big data is the realization of greater business intelligence by storing,
processing, and analyzing data that was previously ignored due to the
limitations of traditional data management technologies.
(Source: Harness the Power of Big Data: The IBM Big Data Platform)
Why Big Data? & What makes Big Data?
 Key enablers for the growth of “Big Data” are

Increase of storage capacities

Increase of processing power

Availability of data
Facts about Big Data
1 Bit = Binary Digit
8 Bits = 1 Byte

KB EB ZB 1024 Bytes = 1 Kilobyte


1024 Kilobytes = 1 Megabyte
1024 Megabytes = 1 Gigabyte
1024 Gigabytes = 1 Terabyte
MB PB YB
1024 Terabytes = 1 Petabyte
1024 Petabytes = 1 Exabyte
1024 Exabytes = 1 Zettabyte
GB TB BB GeB 1024 Zettabytes = 1 Yottabyte
1024 Yottabytes = 1 Brontobyte
1024 Brontobytes = 1 Geopbyte
Where does data come from?

 Data come from many quarters.

 Science – Medical imaging, Sensor data, Genome


sequencing, Weather data, Satellite feeds
 Industry - Financial, Pharmaceutical, Manufacturing,
Insurance, Online, retail
 Legacy – Sales data, customer behavior, product
databases, accounting data etc.,
 System data – Log files, status feeds, activity stream,
network messages. Spam filters.
Big Data … Every Where … … …
 100 TB of data uploaded daily to Facebook
 235 TB of data has been collected by the U.S
library of congress in April 2011

 Wal-Mart handles more than 1 million customer


transactions every hour, which is more than 2.5 PB
of data
 Google process 20PB per day
 2.7 ZB of data exits in the digital universe today
Big Data contributors

Social media and networks

(All of us are generating data)


Scientific instruments

(Collecting all sorts of data)

Mobile devices
(Tracking all objects all the time) Sensor technology and networks
(Measuring all kinds of data)
The Changing model of data
 The model of generating /Consuming data has
changed
 Old Model - Few companies are generating data;
all others are consuming data

 New model - All of us generating data; and all of us


are consuming data
Drivers of Big Data
 Medical Information such as genome sequencing
 Photos and video footages uploaded to the WWW
 Video Surveillance
 Mobile devices – providing geographical location of the user
 Meta data about text messages, call history and application
usages in smart phones
 Smart sensors from public and industry infrastructures
 Non-IT devices such as RFID and GPS.
Different Forms of Data

 Structured Data : Data contains a defined data


type, format, structure.
 Semi-structured Data : Data that has been
organized with minimal structure along with self-
describing nature.
 Quasi-Structured Data : Textual data with erratic
data formats that can be formatted with tools.
 Un-structured data : Data that does not have any
inherent structure
The Dimensions / Characteristics of Big Data

Volume • Large quantity of data

Velocity • Quickly moving data

Variety • Structured, Unstructured,


images, etc…
Volume - Scale
 Data volume is increasing exponentially

 Many organizations are generating data at a rate of


terabytes per day

Examples:
– Yahoo – over 170PB of data
– Facebook – over 30PB of data
– eBay – over 5PB of data

 Such data are having inherent value; hence it cannot be


discarded
Velocity - Speed
 Data is being generated fast and need to be processed
fast - Online Data Analytics

 Late decisions  missing opportunities

Examples:
1. E-Promotions: Based on your current location, your
purchase history, what you like  send promotions
right now for store next to you
2. Healthcare monitoring: sensors monitoring your
activities and body  any abnormal measurements
require immediate reaction
Complexity - Variety
 Various formats, types, and structures

 Text, numerical, images, audio, video, sequences,


time series, social media data, multi-dim arrays,
etc…
 Static data vs. streaming data

 A single application can be generating/collecting


many types of data
Complexity – Variety … [Contd.]

 To extract knowledge all these types of data need to


linked together
Big Data: 3 V’s
Some make it as 4 V’s
Data Analytics
 Data Analytics is the science of examining raw data with
the purpose of drawing conclusions that can be
transformed to useful information.
 Data Analytics is multidisciplinary and uses methods
from
 Statistics
 Computer Science
 Mathematics
 Operational Research
Data Analytics

• Analyze/Mine/Summarize large datasets.


• Extract knowledge from past data
• Predicts trends for future
• Utilizing the concept of Machine learning and data mining
tools
• Both Machine learning and Data Mining are the subset of
Artificial Intelligence
Big Data Analytics

 Large volume of different form of data, which


cannot be handled by standard database
management systems like DBMS,RDBMS or
ORDBMS

 High volume/High Variety/High Velocity


information assets require new forms of
processing/analytics to enable enhanced decision
making, insight discovery and optimization.
Traditional Analytics Vs Big Data Analytics
Attribute Traditional Analytics Big Data Analytics

Architecture Centralized Distributed

Nature of data  Limited data sets  Large scale data sets


 Cleaned data  More types of data – Raw data
 Structure or transactional  Unstructured or semi-structured

Focus Descriptive analytics Predictive analytics


Data Model Fixed Schema Schema less
Data relationship Known relationship Complex/un-known relationship

Data Source Operational Operational + External

Nature of Query Repetitive nature Experimental and adhoc nature


Data Volume GBs to Terabytes Petabytes to exabytes
Outcome of analysis Causation: What happened? Correlation : New insights and more
and Why ? accurate answer
Concepts that Fall under the Big Data

 Traditional Business Intelligence


 Data Mining
 Statistical Applications
 Predictive Analytics
Business Intelligence Vs. Advanced Analytics
The Big Data Challenge
HACE Theorem

• Heterogeneous, Autonomous sources with


distributed and decentralized control, and seeks
to explore Complex and Evolving relationships
among data.

• These characteristics make it an extreme


challenge for discovering useful knowledge from
the Big Data.
Big Data Challenges
Harnessing the Big Data

 OLTP: Online Transaction Processing (DBMSs)


 OLAP: Online Analytical Processing (Data Warehousing)
 RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
What is Data Science?
• An interdisciplinary field about processes and systems to
extract knowledge insights from data in various forms,
either structured or unstructured, which is a continuation
of some of the data analysis fields such as Statistics, data
mining, and KDD.

• Data science is the study of where information comes from,


what it represents and how it can be turned into a valuable
resource in the creation of business and IT Strategies.

• It helps the organization rein in costs, increase efficiencies,


recognize new market opportunities and increase the
organization's competitive advantage.
Who is Data Scientist?
• Deep Analytical Talented People
• “Part Analyst – Part Artist”
(Anjul Bhambhri, vice president of big data products at IBM)

• A data scientist does not simply collect and report


on data, but also looks at it from many angles,
determines what it means, then recommends ways
to apply the data.
Who is Data Scientist?
• A data scientist must possess a combination of analytic,
machine learning, data mining and statistical skills as
well as experience with algorithm and coding.

• Should have the ability to translate the significance of


data in a way that can be easily understood by others.
Profile of a Data Scientist
• Quantitative skills : such as mathematics or statistics
• Technical aptitude : namely, software Engineering, machine
learning and programming skills
• Skeptical mind-set and critical thinking : Can able to examine
the data critically rather in a one-sided way

• Curious and creative : should have passionate about data


and finding creative ways to solve problems
• Communicative and Collaborative : Articulate the
business value in a clear way and collaboratively work with
other stakeholders.
Evolution of Analytic Scalability
 Big data is more real-time in nature than
traditional/ DW applications
 Traditional DW architectures are not well-suited
for big data applications
 Shared nothing, massively parallel processing,
scale out architectures are well-suited for big data
applications
 Novel architectures, Analytic Processes, Analytic
Methods and Tools are needed
Analytic Scalability
 Horizontal scaling means that you scale by adding more
machines into your pool of resources
 Vertical scaling means that you scale by adding more
power (CPU, RAM) to an existing machine.
Data Design Approaches
 ACID approach for Relation based Database System
 Atomicity
 Consistency
 Isolation
 Durability
 CAP Theorem for Big Data based Systems
 Consistency
 Availability
 Partition- Tolerance
Massively Parallel Processing Systems

 MPP database spreads data out into independent pieces


manages by independent storage and central processing
unit.
 Hundreds of personal computers have been connected
networkingly and, each holding a small piece of large data
set.
 Independent smaller queries will run simultaneously
instead of just one big query.
 Much useful architecture for the data preparation stage.
Grid Computing

 Applying resources from many computers in a Network-at


the same time – to a single problem.
 The problem that requires large number of processing
cycles or an access to large amount of data.
 Organizations can unite disparate technology capabilities to
create a single, unified system.
 Enabling the virtual sharing of, management of and access
to devices across an enterprise, industry or workgroup.
Cloud Computing

 Enterprises incur no infrastructure or capital cost, only


operational cost (pay-per-use) basis.
 Capacity can be scaled up or down dynamically
 The underlying hardware can be anywhere geographically
 The architectural specifications are abstracted from the
user.
 Multiple users from multiple organizations can be accessing
the exact same infrastructure simultaneously.
MapReduce

 Mapreduce is a parallel programming framework.


 It is neither a database nor a director competitor to
database
 Programming framework that helps organizations to
handle unstructured and semi-structured sources (text,
web log data, sensor data and images)
 Hadoop is the best known implementation of MapReduce
Framework
A closer look on MapReduce

 MapReduce is used to process the source of big data,


summarizes it an meaningful way.

 It does not provide indexing, query optimaztion,security


and other features provided by databases.

 No historical perspective about other jobs that have been


run, and no knowledge about other data that exists.

 Mapreduce is still not Mature


Evolution of Analytic Processes

 Big Data Platforms provide interesting capabilities to do


exploratory analysis and experimentation.

 Such capabilities usually consists of Mashes-up datasets


sophisticated algorithms and codebases and rich data
visualization components.

 We call those capabilities as “Sandboxes”

 Provides complete freedom to perform anything over data.


Evolution of Analytic Processes

 Big Data Platforms provide interesting capabilities to do


exploratory analysis and experimentation.

 Such capabilities usually consists of Mashes-up datasets


sophisticated algorithms and codebases and rich data
visualization components.

 We call those capabilities as “Sandboxes”

 Provides complete freedom to perform anything over data.


A typical Sandbox contains….

 Data integration Tools like Hadoop Ecosystem tools.


 Statistical Analysis Tools such as SAS, R
 Tools for performing Soft computing such as MATLAB
 Adhoc querying and rich data visualization Tools such as
Tableau,Qlickview and others
Usage of SandBoxes

 Maximize insights into a dataset


 Uncover underlying structure
 Extract important variable
 Detect outlier and anomalies
 Test underlying assumptions
 Develop Parsimonious models
 Determine optimal factor settings
Internal Sandboxes

 A portion of enterprise data warehouse or datamart is


carved out to serve as the analytic sandbox.
 The sandbox is physically located in the production system,
however not part of the production system.
 Internal sandboxes are cost effective since no new
hardware is needed.
 Weakness is that the space and CPU resources of data
warehouse/datamart would be used by the internal
Sandboxes
External Sandboxes

 A physically separate analytic sandbox , created for testing


and development of analytic processes.
 A stand-alone environment dedicated for advanced
analytics.
 Weakness : Additional cost of the stand alone system that
serves as sandbox platform .
 To mitigate these cost, many organizations shifts the older
equipment s to the sandbox environment.
 Data movement from production system to sandbox
attracts significant cost.
Analytic Dataset

 Dataset that pulled together in order to create analysis or


model
 It is data in the format required for the specific analysis at
hand.
 An analytic dataset is generated by transforming,
aggregating, and combining data.
 The analytic dataset helps to bridge the gap between
efficient storage and ease of use.
Enterprise Analytic Dataset

 An enterprise analytic dataset is logically one row per


entity with dozens, hundreds, and thousands of attributes
 One can create EADs for any entity on which the focus is
analysis(Eg. Customer, employees,supplier., etc…)
 Offers a standardized view of data to support multiple
analysis efforts
 Streamlines the data preparation process
 Allow analytic professionals to spend much more time on
analysis
Evolution of Analytic Methods

 Supervised Learning
 Un-supervised Learning
 Ensemble Methods
 Text Analysis
 Collaborative filtering
Supervised Learning

• Supervised learning deals with learning a function from


available training data.
• A supervised learning algorithm analyzes the training data
and produces an inferred function, which can be used for
mapping new examples.
• Classifying an e-mail as Spam, Labeling Web Page based on
the Content
• Algorithms : Neural networks, Support Vector Machines
(SVMs), and Naive Bayes classifiers.

48
Un-Supervised Learning

• Unsupervised learning makes sense of unlabeled data without


having any predefined dataset for its training.
• Unsupervised learning is an extremely powerful tool for
analyzing available data and look for patterns and trends
• Grouping objects based on the similarity
• Algorithms : k-Means Algorithm, Hierarchical Algorithm

49
Ensemble Methods

• Instead of building a simple model with a single technique, multiple models


are built using multiple techniques
• All of the results are combined together to come up with a final answer
• Note that ensemble models go beyond picking the best individual
performer from a set of models
• They actually combine the results o multiple models in order to get a single
final answer
• This phenomenon is often called as wisdom of crowds.

50
Text Analysis

• One of the most rapidly growing methods utilized by organizations today is


the analysis of text and other un-structured data sources
• How many of a given customer’s e-mails were negative or positive in tone?
• How often did a given customer focus on a specific product line in his/her
communication?
• It is essential to create structured information out of the raw unstructured
data
• There are issues to address when parsing it, issues to address when
identifying the context, and issues to identify meaningful patterns
• Unstructured data itself isn’t analyzed rather, un-structured data is
processed in a way that applies some sort of structure to it

51
Collaborative Filtering

• Extensive Framework for performing Recommendations


• The process of filtering for information or patterns using techniques
involving collaboration between multiple data points
• The underlying assumption of the approach is that if person ‘A’ has the same
opinion as person ‘B’ on an issue, ‘A’ is more likely to have B's opinion on a
different issue, ‘x’, than the opinion on ‘x’ of a randomly-chosen person
• The idea behind collaborative filtering is that people often get the best
recommendations from someone with similar tastes like themselves

52
Collaborative Filtering Types
• User-based collaborative filtering predicts a test user’s
interest in a test item based on rating information from
similar user profiles
• Item-based collaborative filtering, is a form of
Collaborative Filtering based on the similarity between
items calculated using people's ratings of those items
• Many Different Similarity Measures are used
o Cosine, Person….
Evolution of Analytic Tools

• R Project for Statistical Computing


• Massive Online Analyzer
• Hadoop Eco- System Tools
• Modern Visualization Tools
Analyzing Data with Hadoop

 Nature of Analysis
–  Batch processing
–  Parallel execution
– Spread data over a cluster of servers and take the
computation to the data
 Analysis conducted at lower cost
 Analysis conducted in less time
 Greater flexibility
 Linear scalability
What analysis is possible with Hadoop?
Text Mining

Index Binding

Graph Creation and Analysis

Pattern Recognition

Collaborative filtering

Model Prediction

Sentiment Analysis

Risk Assessments
Modeling true risk
 Challenge:
 How much risk exposure does an organization really have
with each customer?
 Solution with the Hadoop:
 Source and aggregate disperse data sources to build data
picture – eg. Credit card records, call recordings, chat
sessions, e-mails, banking activity
 Structure and analysis – Sentiment analysis, graph creation,
pattern recognition
 Typical Industry:
 Financial services - Bank, insurance companies
Customer Churn Analysis
 Challenge:
 Why organizations are really losing customer?
 Solution with the Hadoop:
 Rapidly build behavioral model from disparate
data sources
 Structure and analyze with Hadoop – Traversing,
Graph creation, pattern recognition
 Typical Industry:
 Telecommunication, Financial Services
Recommending Engine / Ad targeting
 Challenge:
 Using user data to predict which products to
recommend
 Solution with the Hadoop:
 Batch processing framework – Allow execution in
parallel over large data sets
 Collaborative filtering – Collect taste information
from many users and utilizing information to predict
what similar users like
 Typical Industry:
 E-Commerce, Manufacturing, Retail, Advertising
Point of sale transaction analysis
 Challenge:
 Analyzing Point of Sale (PoS) data to target
promotions and manage operations
 Sources are complex and data volumes grow across
chains of stores and other sources
 Solution with the Hadoop:
 Batch processing framework – Allow execution in
parallel over large data sets
 Pattern recognition
 Optimizing over multiple data sources
 Utilizing information to predict demand
Interactive analysis of big data
 Challenge:
 Analyzing real-time data series from network of sensors
 Calculating average frequency over time is extremely tedious because of the
need to analyze terabytes
 Solution with the Hadoop:
 Task the computation to the data
 Expand from simple scans to more complex data mining
 Better understand how the network reacts to fluctuations
 Discrete anomalies may, in fact, be interconnected
 Identify the leading indicators of component failure
 Typical Industry:
 Telecommunication, Data centers
Threat Analysis / Trade Surveillance
 Challenge:
 Detecting threats in the form of fraudulent activity or
attacks
 Large data volumes involved
 Like looking of needle in a hay stack

 Solution with the Hadoop:


 Parallel processing over huge data sets
 Pattern recognition to identify anomalies – eg. Threats
 Typical Industry:
 Security, Financial services,
 General – Spam fighting, Click fraud
Search Quality
 Challenge:
 Providing meaningful search results
 Solution with the Hadoop
 Analyzing search attempts in conjunction with
structured data
 Pattern recognition
 Browsing pattern of users performing searches in
different categories
 Typical Industry
 Web, E-Commerce
Data Sand Box
 Challenge:
 Data Deluge - Don’t know what to do with the data or
what analysis to run
 Solution with the Hadoop:
 “Dump” all this data into an HDFS cluster
 Use Hadoop to start trying out different analysis on the
data
 See patterns to derive value from data
 Typical Industry:
 Common across all Industries
Typical Use-Case : Big Data Analytics for retail
organization

 Goal : To increase the sales


 Objective : To gather information from multitude
data sources
 Some are internal sources
 Others are external
 Some of the data have to be purchased
 Some may be available under the public domain
Typical Use-Case : Big Data Analytics for retail
organization

 Start with internal structured data first


 Sales log
 Inventory movement
 Registered transactions
 Customer information
 Pricing and supplier interactions
 Focus on un-structured data
 Call Centre support log
 Customer Feedback(E-mail and other communications),Surveys
 Data generated by sensors (parking lot usage)
Typical Use-Case : Big Data Analytics for retail
organization
 Finally external data must be taken into the account
 The data that make up the public portion of the analytic
process come from Government entities, research companies,
social networking sites and others.
 Mining Twitter data, Facebook, the U.S. Census, Weather
information, Traffic pattern Information and news archives

 The richness of the data is the basis for predictive analytics

 Eg.Comparing population trends along with social sentiments


Some useful external big data sources

 Some excellent tools are available that mines data from web
and transform into big data analytics platform
 Some of it is free, and some are available for fee
 https://www.programmableweb.com/api/extractiv
(which automatically converts unstructured text into structured
semantic data)
 https://www.mozenda.com/
(Quickly turn web page content into structured data)
 openrefine.org

(Powerful tool for working with messy data: cleaning it;


transforming it from one format into another; and extending it
with web services and external data)
Some useful external big data sources
 http://80legs.com/
(Specializes in gathering data from social networking sites
as well as retail and business directories)

 http://commoncrawl.org/
(Maintain an open repository of web crawl data that can be
accessed and analyzed by anyone)

 https://books.google.com/ngrams/
(optimized for quick inquiries into the usage of small sets
of phrases
Data Analytics Life Cycle
Phase – 1 Discovery
• Learning the business domain
• Availability of resources to support the project in terms of
people, technology, data and time
• Framing the business problem as an analytical challenge
• Identifying the key stakeholders
• Interviewing the analytics sponsor
• Developing initial hypotheses
• Identifying potential data sources
Phase – 2 Data preparation

• Preparing the Analytic sandbox


• Performing ETLT
• Learning about the data
• Data conditioning
• Survey and visualize
Phase – 3 Model Planning

• Assess the structure of the data set


• Data exploration and variable selection
• Model selection
Phase – 4 Model building

• Develop data sets for training, testing and production process


• Check the robustness of training and test data for analytical
techniques
• Refine the model to optimize the results
• Recording the results and logic of the model
Phase – 5 Communicate Results
• Assess the results and identify which data points may have been
surprizing and which are in-line with hypotheses
• Record all the findings and select three most significant ones that
can be shared with stakeholders
• Quantifying the business impact of the result
• Clearly articulate the results, methodology and business values of
the findings
• Make recommendations for future work or improvements.
Phase – 6 Operationalize
• Delivery of final reports, briefing, code and technical documents
• Execution of algorithms on in-memory tools such as ‘R’
• Running of pilot project to implement the project in a production
environment.
• Enables the team to learn about the performance and related
constraints of the model in a production environment on a small
scale
The Nuts and Bolts of Big Data
• Building a platform : Support for batch and real-time analytics
• Availability of mapping tools
• Big Data abstraction tools
• Parallel execution of business logic
• Moving away from SQL
• In-memory processing
• Support for public, private and hybrid clouds
Best Practices of Big Data Analytics (Do’s)

• Decide what data to include and what data to leave out


• Build effective business rules
• Translate business rules into relevant analytics in a collaborative
fashion
• Have a maintenance plan
• Keeping your users in mind – all of them
• The value of anomalies
• Expediency Vs. Accuracy
Best Practices of Big Data Analytics (Dont’s)

• Thinking “if we build it, they will come”


• Assuming that the software will have all of the answer
• Not understanding that you need to think differently
• Forgetting all of the lessons of the past
• Not having the requisite business and analytical expertise
• Treating the project like a science experiment
Big Data Security

• Control access by process, not by job function


• Securing the data at rest
• Protect the cryptographic keys and store them separately from
the data
• Create trusted applications and stacks to protect data
• Use a holistic approach

Вам также может понравиться