Вы находитесь на странице: 1из 28


BY- AMIT PANDEY– 15401A0503 (BATCH -04)

Data science– Standard definition.
 Data science is a multi-disciplinary field that
uses scientific methods, processes, algorithms and systems
to extract knowledge and insights from structured and
unstructured data.
 Data Science is the science which uses computer science,
statistics and machine learning, visualization and human-
computer interactions to collect, clean, integrate,
analyze, visualize, interact with data to create data
Where we obtain data from?
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 Financial transactions, bank/credit transactions
 Online trading and purchasing
 Social Network
How do we have ?
 Google processes 20 PB a day (2008)
 Facebook has 60 TB of daily logs
 eBay has 6.5 PB of user data + 50 TB/day (5/2009)
 1000 genomes project: 200 TB
 Cost of 1 TB of disk: $35
 Time to read 1 TB disk: 3 hrs (100 MB/s)
What is big data?
Big Data is any data that is
expensive to manage and hard to
extract value from
 Volume
 The size of the data
 Velocity
 The latency of data processing
relative to the growing demand for
 Variety and Complexity
 the diversity of sources, formats,
quality, structures.
5 V’s of Big Data
 Raw Data: “Volume”
 Change over time: “Velocity”
 Data types: “Variety”
 Data Quality: “Veracity”
 Information for Decision
Making: “Value”
Elements of BIG DATA
What To Do With These Data?
 Aggregation and Statistics
 Data warehousing and OLAP
 Indexing, Searching, and Querying
 Keyword based search
 Pattern matching (XML/RDF)
 Knowledge discovery
 Data Mining
 Statistical Modeling
What actually is Data Science?
 An area that manages, manipulates, extracts, and
interprets knowledge from tremendous amount of data
 Data science (DS) is a multidisciplinary field of study
with goal to address the challenges in big data
 Data science principles apply to all data – big and small
theories and techniques from many fields and
disciplines are used to investigate and analyze a
large amount of data to help decision makers in
many industries such as science, engineering,
economics, politics, finance, and education
 Computer Science
 Pattern recognition, visualization, data
warehousing, High performance computing,
Databases, AI
 Mathematics
 Mathematical Modeling
 Statistics
 Statistical and Stochastic modeling,
Different milestone in Data Science

R.A. Fisher W.E. Deming Peter Luhn

What are the expectations according to
Gartner’s 2014 Hype Cycle
What we invest in Data Science
Real Life Examples
 Companies learn your secrets, shopping patterns,
and preferences
 Data Science and election (2008, 2012)
 …that was just one of several ways that Mr.
Obama’s campaign operations, some unnoticed by
Mr. Romney’s aides in Boston, helped save the
president’s candidacy. In Chicago, the campaign
recruited a team of behavioral scientists to build
an extraordinarily sophisticated database
 …that allowed the Obama campaign not only to
alter the very nature of the electorate, making it
younger and less white, but also to create a
portrait of shifting voter allegiances. The power of
this operation stunned Mr. Romney’s aides on
election night, as they saw voters they never even
knew existed turn out in places like Osceola
County, Fla.
-- New York Times, Wed
Nov 7, 2012
Real life examples (contd..)
 Exciting new effective
applications of data analytics
 Example: Google Flu Trends:
Detecting outbreaks two weeks
ahead of CDC data
 New models are estimating
which cities are most at risk
for spread of the Ebola virus.
 Prediction model is built on
Various data sources , types and
Page Rank: The web as a behavioral dataset
Sponsored search
 Google revenue around $50 bn/year from
marketing, 97% of the companies revenue.

 Sponsored search uses an auction – a pure

competition for marketers trying to win
access to consumers.

 In other words, a competition for models of

consumers – their likelihood of responding to
the ad – and of determining the right bid for
the item.

 There are around 30 billion search requests a

month. Perhaps a trillion events of history
between search providers.

 Google Adwords and Adsense

Other data science application
 Transaction Databases  Recommender systems
(NetFlix), Fraud Detection (Security and Privacy)

 Wireless Sensor Data  Smart Home, Real-time

Monitoring, Internet of Things

 Text Data, Social Media Data  Product Review

and Consumer Satisfaction (Facebook, Twitter,
LinkedIn), E-discovery

 Software Log Data  Automatic Trouble

Shooting (Splunk)

 Genotype and Phenotype Data  Epic, 23andme,

Patient-Centered Care, Personalized Medicine
What can you do with the data?
Traffic Prediction and Earthquake Warning

Crowdsourcing + physical modeling + sensing + data assimilation

to produce:

From Alex Bayen, UCB, Director, Institute for Transportation Studies

Who are data scientists?
 Data scientists are a new breed of
analytical data expert who have the
technical skills to solve complex
problems – and the curiosity to explore
what problems need to be solved.
 They find stories, extract knowledge.
They are not reporters
 Data scientists are the key to realizing
the opportunities presented by big data.
They bring structure to it, find
compelling patterns in it, and advise
executives on the implications for
products, processes, and decisions
Duties of Data Scientists.
There's not a definitive job description when it comes
to a data scientist role. But here are a few things
you'll likely be doing:
 Collecting large amounts of unruly data and
transforming it into a more usable format.
 Staying on top of analytical techniques such as
machine learning, deep learning and text analytics.
 Solving business-related problems using data-
driven techniques.
 Communicating and collaborating with both IT and
 Working with a variety of programming languages,
including SAS, R and Python.
 Looking for order and patterns in data, as well as
spotting trends that can help a business’s bottom
 Having a solid grasp of statistics, including
statistical tests and distributions.
What are the tools of Data Scientists?
 Data visualization: the presentation of data
in a pictorial or graphical format so it can
be easily analyzed.
 Pattern recognition: technology that
recognizes patterns in data (often used
interchangeably with machine learning).
 Machine learning: a branch of artificial
intelligence based on mathematical
algorithms and automation.
 Data preparation: the process of
converting raw data into another format so
it can be more easily consumed.
 Deep learning: an area of machine learning
research that uses data to model complex
 Text analytics: the process of examining
unstructured data to glean key business
Companies that use Data Science.
 Accenture
 Fidelity Investments
 Bank of America
 Google
 Facebook
 Tata Consultancy Services
 Intel
 Many more……..
Contrast between Database and DataScience
Databases Data Science
Data Value “Precious” “Cheap”
Data Volume Modest Massive
Examples Bank records, Online clicks,
Personnel records, GPS logs,
Census, Tweets,
Medical records Building sensor readings
Priorities Consistency, Speed,
Error recovery, Availability,
Auditability Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL:
MongoDB, CouchDB,
Hbase, Cassandra, Riak,
Apache River, …
Contrast: Machine Learning v/s Data
Machine Learning Data Science

Develop new (individual) Explore many models, build

models and tune hybrids
Prove mathematical
properties of models Understand empirical
properties of models
Improve/validate on a few,
relatively clean, small Develop/use tools that can
datasets handle massive datasets

Publish a paper Take action!

Requirements for being a Data Scientist.
 Mathematics and Applied Mathematics
 Applied Statistics/Data Analysis
 Solid Programming Skills (R, Python, Julia, SQL)
 Data Mining
 Data Base Storage and Management
 Machine Learning and discovery
Data Science – A Visual Definition