Вы находитесь на странице: 1из 23

Data

Science 101
Arik Pelkey
Pentaho Senior Director – Product Marketing, Hitachi Vantara
Scott Cooley
Pentaho Data Scientist, Hitachi Vantara
Agenda
This session will provide an introduction to data science fundamentals.
• What is Data Science?
• Common Use Cases and Algorithms
• The Data Science Process
• Building a Data Science Team
• The Future
AI, Machine Learning, and Deep Learning

• AI: Getting machines


to do what humans
are good at

• Machine Learning:
Feeding an algorithm
data to learn and
predict something

• Deep Learning: A type


of machine learning

Image from https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/.


Data Science: Solving Problems with Data

Computer science, HACKING MATH AND Algorithms and


data engineering and SKILLS Machine STATISTICS numerical
wrangling, coding Learning KNOWLEDGE techniques to
derive insights
DATA
SCIENCE
Danger Traditional
Zone! Research
Understanding of the
underlying assumptions Domain knowledge,
SUBSTANTIVE business acumen, experience,
EXPERIENCE value to the business

Diagram from Drew Conway: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram.


What’s all the fuss?
This stuff was created many many years ago

• Bayes Theorem • Thomas Bayes mid 1700’s

• Regression • Legendre, Gauss and Galton


early 1800’s

• Neural Networks • McCulloch and Pitts early 1940s

Here is a sample footnote.


Think about All Our Data and Compute
SKA - 2020
(Square Kilometer Array Telescope)

It is still
GROWING!

Will generate as much data in


a day as the entire PLANET
does in a year!
https://www.computerworld.com.au/article/392735/ska_telescope_generate_more_data_than_entire_internet_2020/.
Types of Machine Learning


Regression – Looking for Classification – Similar to
✕✕
✕ a statistical relationship ✕ regression but looking for
✕ ✕
✕ across variables that △
separations in the data

✕ may give us an estimate △


given predefined classes.
of a particular outcome. (Supervised)


Clustering – Do not have Anomaly Detection –

✕ ◇ predefined classes but △ △△ ? Identification of outliers
✕ △△
◇ △△ △
◇ trying to find groups or △ △
△ △

based upon expected
△ △
△ △△
sets based upon data at ranges of data.
?
△ hand. (Unsupervised)

Here is a sample footnote.


Labelled vs Unlabelled
Lets say we want to Classify Houses by Size Supervised
Given Features or Feature Set Learning
Use the labels
to build a
FullBath HalfBath Bedrooms Home Age Size Label model. Model
1 0 2 56 M used to classify
1 1 3 59 L new house size
2 1 3 20 M
based ONLY on
2 1 3 19 S the known
feature set.
Unsupervised
SIZE is missing! We need to look for similarities in the data
and group them into clusters.
More on Machine Learning
Machine Learning is a methodology to create a model based on sample data and
use the model to make a prediction or strategy using a more algorithmic approach.
SUPERVISED LEARNING MODEL
Historical records that contain
square feet, number of
bathrooms, zip code….

Records that contain the price


the house sold for

Iterate the algorithm over the


combined data to train the model

Use the trained model to predict


outcome on new records
The Data Science Process: Getting from Raw Data to Outcomes
Formal Framework CRISP–DM The Data Science Workflow
Cross Industry Standard Process
for Data Mining

Joe Blizstein and Hanspeter Pfister created for Harvard Data Science course.
Specialist Traditional Data Science Team
Data Scientist (DS)
– Prepares data, engineers features, most valuable skill: training models.

Data Engineer (DE)


– Data acquisition focus. Build data pipelines. Not uncommon to have 5:1 ratio
DE:DS

Data Analyst (DA)


– Assist DS with data prep

Application architect (AA)


– Design complete solution; deploy and maintain models in production
Mythical Creatures
Trends
• Automation
• Tools for Citizen Data Scientists
• Pre-trained models in the cloud

Here is a sample footnote.


Hiring Guidance

Here is a sample footnote.


Defining Success
• Easy for the tangible
– Search order optimization
– Recommendation engine or CTR
• Hard for others
– Lead scoring
– Attrition

• Try to measure direct outcomes


• Rarely a silver bullet
• Think ROI

Here is a sample footnote.


Typical Data Science Project

DS DS DS DS DS

DE DA

AA AA AA

Understand ID and Prepare data Train Deploy Update


business procure and build model models models
objectives training data new features
Preventive Maintenance:
Caterpillar
Marine Asset Intelligence

Fleet Data via Data Scientist


Satellite Data Mining and
Predictive
Maintenance
Data Data
Integration Integration

Data Business User (COO)


Reporting on
Local Equipment
Marts Operations and
sensor and Efficiency
Server Data

Dashboards and
Reports on Machine
Performance
Cross Department (Onboard and
Operations Data Onshore)
Scheduling/ERP
The Future
• Scaling up / enabling more data scientists
• Model management
• Improved productivity
• Support for containerized applications.

Here is a sample footnote.


Pentaho ML Orchestration

• Makes data science


teams more productive
• Broad support for open
source libraries in
various languages
Summary
• What is Data Science
• Common Use Cases and Algorithms
• The Data Science Process
• Building a Data Science Team
• The Future
Next Steps
Want to learn more?
• Schedule a Meet the Expert
• Read Mark Hall’s Machine Learning with Pentaho Blog