Вы находитесь на странице: 1из 75

Scikit-learn for easy machine learning:

the vision, the tool, and the project


Ga¨el Varoquaux

scikit

machine learning in Python


1 Scikit-learn: the vision

G Varoquaux 2
1 Scikit-learn: the vision
An enabler

G Varoquaux 2
1 Scikit-learn: the vision
An enabler
Machine learning
for everybody and
for everything
Machine learning
without learning the
machinery

G Varoquaux 2
Machine learning in a nutshell

Machine learning is about making prediction from data

G Varoquaux 3
1 Machine learning: a historical perspective
Artificial Intelligence The 80s
Building decision rules

Eatable?

Tall?

Mobile?

G Varoquaux 4
1 Machine learning: a historicalperspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations

G Varoquaux 4
1 Machine learning: a historicalperspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
Statistical learning 2000s
Model the noise in the observations

G Varoquaux 4
1 Machine learning: a historicalperspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
Statistical learning 2000s
Model the noise in the observations
Big data today
Many observations,
simple rules

G Varoquaux 4
1 Machine learning: a historicalperspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
Statistical learning 2000s
Model the noise in the observations
Big data today
Many observations,
simple rules
“Big data isn’t actually interesting without machine
learning”
Steve Jurvetson, VC, SiliconValley
G Varoquaux 4
1 Machine learning in a nutshell: anexample

Face recognition

Andrew Bill Charles Dave

G Varoquaux 5
1 Machine learning in a nutshell: anexample

Face recognition

Andrew Bill Charles Dave

G Varoquaux
? 5
1 Machine learning in a nutshell
A simple method:
1 Store all the known (noisy) images and the names
that go with them.
2 From a new (noisy) images, find the image that is
most similar.

“Nearest neighbor” method

G Varoquaux 6
1 Machine learning in a nutshell
A simple method:
1 Store all the known (noisy) images and the names
that go with them.
2 From a new (noisy) images, find the image that is
most similar.

“Nearest neighbor” method


How many errors on already-known images?
... 0: no errors

Test data ƒ= Train data

G Varoquaux 6
1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
y

G Varoquaux 7
1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
y

y
x x

Which model to prefer?

G Varoquaux 7
1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
y

y
x x

Problem of “over-fitting”
Minimizing error is not always the best strategy
(learning noise)
Test data ƒ= train data
G Varoquaux 7
1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
y

y
x x

Prefer simple models


= concept of “regularization”
Balance the number of parameters to learn
with the amount of data
G Varoquaux 7
1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
y

y
Bias variance tradeoff

x x

Prefer simple models


= concept of “regularization”
Balance the number of parameters to learn
with the amount of data
G Varoquaux 7
1 Machine learning in a nutshell: regression
A single descriptor: Two descriptors:
one dimension 2 dimensions

y
y

X_2

x
X_1

More parameters

G Varoquaux 7
1 Machine learning in a nutshell: regression
A single descriptor: Two descriptors:
one dimension 2 dimensions

y
y

X_2

x
X_1

More parameters
⇒ need moredata
“curse of dimensionality”

G Varoquaux 7
1 Machine learning in a nutshell: classification

Example:
recognizing hand-written digits

G Varoquaux 8
1 Machine learning in a nutshell: classification
X1

Example:
recognizing hand-written digits

Represent with 2 numerical features

X2
G Varoquaux 8
1 Machine learning in a nutshell: classification

X1

X2
G Varoquaux 8
1 Machine learning in a nutshell: classification

X1

It’s about finding


separating lines

X2
G Varoquaux 8
1 Machine learning in a nutshell: models
1 staircase

Fit with a staircase of 10 constant values

G Varoquaux 9
1 Machine learning in a nutshell: models
1 staircase
2 staircases c o m b i n e d

Fit with a staircase of 10 constant values


Fit a new staircase on errors

G Varoquaux 9
1 Machine learning in a nutshell: models
1 staircase
2 staircases c o m b i n e d
3 staircases c o m b i n e d

Fit with a staircase of 10 constant values


Fit a new staircase on errors
Keep going

G Varoquaux 9
1 Machine learning in a nutshell: models
1 staircase
2 staircases c o m b i n e d
3 staircases c o m b i n e d
3 0 0 staircases c o m b i n e d

Fit with a staircase of 10 constant values


Fit a new staircase on errors
Keep going
Boosted regression trees
G Varoquaux 9
1 Machine learning in a nutshell: models
1 staircase
2 staircases c o m b i n e d
3 staircases c o m b i n e d
3 0 0 staircases c o m b i n e d

Complexitity trade offs


Computational + statistical

Fit with a staircase of 10 constant values


Fit a new staircase on errors
Keep going
Boosted regression trees
G Varoquaux 9
1 Machine learning in a nutshell: unsupervised

Stock market structure

G Varoquaux 10
1 Machine learning in a nutshell: unsupervised

Stock market structure


Unlabeled data
more common than labeled data

G Varoquaux 10
Machine learning

Mathematics and algorithms for fitting predictive models

Regression Classification
y

Unsupervised...
x

Notions of overfit, test error


regularization, model complexity
G Varoquaux 11
Machine learning is everywhere

Image recognition

Marketing (click-through rate)

Movie / music recommendation

Medical data

Logistic chains (eg supermarkets)

Language translation

Detecting industrial failures

G Varoquaux 12
Why another machine learning package?

G Varoquaux 13
Real statisticians use R
And real astronomers use IRAF

Real economists use Gauss

Real coders use C assembler

Real experiments are controlled in Labview

Real Bayesians use BUGS stan

Real text processing is done in Perl


Real Deep learner is best done with torch (Lua)

And medical doctors only trust SPSS


G Varoquaux 14
1 My stack

Python, what else?


General purpose
Interactive language
Easy to read / write

G Varoquaux 15
1 My stack

The scientific Python stack


numpy arrays
Mostly a f l o a t * *
No annotation / structure
Universal across applications
Easily shared with C / fortran

G Varoquaux 15
1 My stack

The scientific Python stack


numpy arrays
Connecting to
scipy
scikit-image
pandas
...

It’s about plugin things


together

G Varoquaux 15
1 My stack

The scientific Python stack


numpy arrays
Connecting to
scipy
scikit-image
pandas
...

Being Pythonic and


SciPythonic

G Varoquaux 15
1 scikit-learn vision

Machine learning for all


No specific application domain
No requirements in machine learning

High-quality Pythonic software library


Interfaces designed for users

Community-driven development
BSD licensed, very diverse contributors

http://scikit-learn.org
G Varoquaux 16
1 Between research and applications

Machine learning research


Conceptual complexity is not an issue
New and bleeding edge is better
Simple problems are old science

In the field
Tried and tested (aka boring) is good
Little sophistication from the user
API is more important than maths

Solving simple problemsmatters


Solving them really well matters a lot
G Varoquaux 17
2 Scikit-learn: the tool
A Python library for machine learning

§cTheodore W. Gray
G Varoquaux 18
2 A Python library

A library, not a program


More expressive and flexible
Easy to include in an ecosystem

As easy as py
f r om s k l e a r n i mpor t svm
c l a s s i f i e r = svm.SVC()
c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n)
Y t e s t = c l a s s i f i e r . p r e d i c t ( X t es t )

G Varoquaux 19
2 API: specifying a model

A central concept: the estimator


Instanciated without data
But specifying the parameters

from s k l ea r n . n ei g h b o r s import
K N ea r es t N ei g h b o r s

es t i m a t o r = K N ea res t N ei g h b o rs (
n n ei g h b o r s =2)

G Varoquaux 20
2 API: training a model

Training from data


es t i m a t o r . f i t ( X t r a i n , Y t r a i n )

with:
X a numpy array with shape
nsamples ×nfeatures
y a numpy 1D array, of ints or float, with shape
nsamples

G Varoquaux 21
2 API: using a model

Prediction: classification, regression


Y t e s t = es t i m a t o r . p r e d i c t ( X t es t )

Transforming: dimension reduction, filter


X new = es t i m a t o r . t r a n s fo r m ( X t es t )

Test score, density estimation


t e s t s c o r e = es t i m a t o r . s c o r e ( X t es t )

G Varoquaux 22
2 Vectorizing

From raw data to a sample matrix X

For text data: counting word occurences


- Input data: list of documents (string)
- Output data: numerical matrix

G Varoquaux 23
2 Vectorizing
From raw data to a sample matrix X

For text data: counting word occurences


- Input data: list of documents (string)
- Output data: numerical matrix
from s k l ea r n . f e a t u r e e x t r a c t i o n . t ex t
import H a s h i n g V ec t o r i z er
h a s h er = H a s h i n g V ec t o r i z er ()

X = h a s h er . f i t t r a n s f o r m (documents )

G Varoquaux 23
2 Scikit-learn: very rich feature set

Supervised learning
Decision trees (Random-Forest, Boosted Tree)
Linear models
SVM

Unsupervised Learning
Clustering
Dictionary learning
Outlier detection

Model selection
Built in cross-validation
Parameter optimization
G Varoquaux 24
2 Computational performance

scikit-learn mlpy pybrain pymvpa mdp shogun


SVM 5.2 9.47 17.5 11.52 40.48 5.63
LARS 1.17 105.3 - 37.35 - -
Elastic Net 0.52 73.7 - 1.44 - -
kNN 0.57 1.41 - 0.56 0.58 1.36
PCA 0.18 - - 8.93 0.47 0.33
k-Means 1.34 0.79 ∞ - 35.75 0.68

Algorithmic optimizations

Minimizing data copies

G Varoquaux 25
2 Computational performance
Random Forest fit time
scikit-learn mlpy pybrain pymvpa mdp shogun
14000
Scikit-Learn-RF 13427. 06

SVM 12000
Scikit-Learn-ETs randomForest
5.2
OpenCV-RF
OpenCV-ETs
9.47 17.5 11.52
R, Fortran 40.48 5.63
LARS OK3-RF
1.17
OK3-ETs 105.3 - 37.35 -
10941. 72 -
Elastic Net -
We ka -RF Orange
10000
0.52
R-RF
Orange-RF
73.7 - 1.44 -Python
kNN 0.57 1.41 - 0.56 0.58 1.36
8000
Fit tim e (s)

PCA 0.18 - - 8.93 0.47 0.33


k-Me ans6000 1.34 0.79
OpenCV ∞ - 35.75 0.68
C++
4464.65

Algorithmic
4000 optimizations 3342.83
OK3
C Weka
Mi nimizing
2000
data copies
Scikit-Learn 1518.14 1711.94 Java
1027.91
Python, Cython
203.01 211.53
0
Figure: Gilles Louppe

G Varoquaux 25
What if the data does not fit in memory?

“Big data”:
Petabytes...
Distributed storage
Computing cluster

G Varoquaux 26
What if the data does not fit in memory?

“Big data”:
Petabytes... Mere mortals:
Distributed storage Gigabytes...
Computing cluster Python programming
Off-the-self computers

See also: http://www.slideshare.net/GaelVaroquaux/processing-


biggish-data-on-commodity-hardware-simple-python-patterns
G Varoquaux 26
2 On-line algorithms
es t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )

G Varoquaux 27
2 On-line algorithms
es t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )

Linear models
sklearn.linear model.SGDRegressor
sklearn.linear model.SGDClassifier
Clustering
sklearn.cluster.MiniBatchKMeans
sklearn.cluster.Birch (new in 0.16)
PCA (new in 0.16)
sklearn.decompositions.IncrementalPCA

G Varoquaux 27
2 On-the-fly data reduction

Many features

⇒ Reduce the data as it is loaded

X s m a l l = es t i m a t o r . t r a n s fo r m ( X big , y)

G Varoquaux 28
2 On-the-fly data reduction
Random projections (will average features)
sklearn.random projection
random linear combinations of the features

Fast clustering offeatures


sklearn.cluster.FeatureAgglomeration
on images: super-pixel strategy

Hashing when observations have varying size


(e.g. words)
sklearn.feature e x t r a c t i o n . t e x t .
HashingVectorizer
stateless: can be used in parallel

G Varoquaux 28
3 Scikit-learn: the project

G Varoquaux 29
3 Having an impact

G Varoquaux 30
3 Having an impact

G Varoquaux 30
3 Having an impact

G Varoquaux 30
3 Having an impact

1% of Debian installs
1200 job offers on stackoverflow

G Varoquaux 30
3 Having an impact

1% of Debian installs
1200 job offers on stackoverflow

G Varoquaux 30
3 Community-based development inscikit-learn
Huge feature set:
benefits of a large team
Project growth:

More than 200 contributors


∼ 12 core contributors
1
from the start

Estimated cost of development: $ 6 millions


COCOMO model,
http://www.ohloh.net/p/scikit-learn

G Varoquaux 31
3 Many eyes makes code fast

L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer


G Varoquaux 32
3 6 steps to a community-driven project

1 Focus on quality
2 Build great docs and examples
3 Use github
4 Limit the technicality of your codebase
5 Releasing and packaging matter
6 Focus on your contributors,
give them credit, decision power

http://www.slideshare.net/GaelVaroquaux/
scikit-learn-dveloppement-communautaire
G Varoquaux 33
3 Quality assurance
Code review: pull requests
Can include newcomers

We read each others code

Everything is discussed:
- Should the algorithm go in?
- Are there good defaults?
- Are names meaningfull?
- Are the numerics stable?
- Could it be faster?

G Varoquaux 34
3 Quality assurance
Unit testing
Everything is tested

Great for numerics

Overall tests enforce on all estimators


- consistency with the API
- basic invariances
- good handling of various inputs

If it ain’t tested
it’s broken
G Varoquaux 35
Make it work, make it right, make it boring

G Varoquaux 36
3 The tragedy of the commons
Individuals, acting independently and rationally accord-
ing to each one’s self-interest, behave contrary to the
whole group’s long-term best interests by depleting
some common resource.
Wikipedia

Make it work, make it right, make it boring


Core projects (boring) taken for granted
⇒ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
G Varoquaux 37
3 The tragedy of the commons
Individuals, acting independently and rationally accord-
ing to each one’s self-interest, behave contrary to the
whole group’s long-term best interests by depleting
some common resource.
Wikipedia
+ It’s so hard to scale
User support
Growing codebase

Make it work, make it right, make it boring


Core projects (boring) taken for granted
⇒ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
G Varoquaux 37
Scikit-learn
The vision
Machine learning as a means not an end
Versatile library: the “right” level of abstraction
Close to research, but seeking different tradeoffs

@GaelVaroquaux
Scikit-learn
The vision
Machine learning as a means not an end
The tool
Simple API uniform across learners
Numpy matrices as data containers
Reasonnably fast

@GaelVaroquaux
Scikit-learn
The vision
Machine learning as a means not an end
The tool
Simple API uniform across learners
The project
Many people working together
Tests and discussions for quality

We’re hiring!

@GaelVaroquaux

Вам также может понравиться