Академический Документы
Профессиональный Документы
Культура Документы
Machine Learning
Topics
Topics
Machine Learning Intro
Learning is density estimation
The curse of dimensionality
Machine Learning
& Bayesian Statistics
Statistics
How does machine learning work?
Learning: learn a probability distribution
Classification: assign probabilities to data
Application
Application Scenario:
Automatic scales at supermarket
Detect type of fruit using a camera
camera
Banana 1.25kg
Total 13.15
Learning Probabilities
Toy Example:
We want to distinguish pictures
Learning Probabilities
Very simple algorithm:
Compute average color
Learn distribution
red
green
7
Learning Probabilities
red
green
8
Simple Learning
Simple Learning Algorithms:
Histograms
Fitting Gaussians
red
dim() = 2..3
green
Learning Probabilities
red
green
10
Learning Probabilities
red
orange
(p=95%)
banana-orange
decision
boundary
?
?
banana
(p=51%)
banana
(p=90%)
green
11
Machine Learning
Very simple idea:
Collect data
Estimate probability distribution
Use learned probabilities for classification (etc.)
We always decide for the most likely case
(largest probability)
Easy to see:
If the probability distributions are known exactly,
13
color:
3D (RGB)
14
full image
learning
red
?
dim() = 2..3
green
30 000 dimensions
15
Inputs
w1 w2 ...
Non-linear functions
Features as input
Combine basic functions
with weights
Optimize to yield
Outputs
(1,0) on bananas
(0,1) on oranges
boundary to data
18
Neural Networks
Inputs
l1 l2 ...
bottleneck
Outputs
19
training set
best separating
hyperplane
20
original space
Example Mapping:
feature space
x, y x 2 , xy, y 2
21
22
Learning Tasks
Examples of Machine Learning Problems
Pattern recognition
Single class (banana / non-banana)
Multi class (banana, orange, apple, pear)
Howto: Density estimation, highest density minimizes risk
Regression
Fit curve to sparse data
Howto: Curve with parameters, density estimation for
parameters
24
Supervision
Supervised learning
Training set is labeled
Semi-supervised
Part of the training set is labeled
Unsupervised
No labels, find structure on your own (Clustering)
Reinforcement learning
Learn from experience (losses/gains; robotics)
25
Principle
Parameters
1 , 2 , ,
training set
Model
hypothesis
26
p(x)
maximum
distribution
mean
Maximum density
Maximum likelihood
Maximum a posteriori
Mean of the distribution
x
p(x)
Inference:
maximum
mean
distribution
Bayesian Models
Scenario
Customer picks banana
(X = 0) or orange
(X = 1)
Modeling
Given image D (observed), what was X (latent)?
()
=
~ ()
28
Bayesian Models
Model for Estimating X
posterior
~
data term,
likelihood
()
prior
29
learn
fruit | img
compute
fruit img
learn
()
freq.
of fruits
Properties
Comprehensive model:
ignore
~
fruit img
learn
directly
ignore
()
freq.
of fruits
Properties
Easier:
Learn mapping from phenomenon to explanation
Not trying to explain / understand the whole phenomenon
Statistical Dependencies
Problem
Estimation Problem:
posterior
data term,
likelihood
()
prior
30 000 dimensions
33
Reducing dependencies
Problem:
(1 , 2 , , 10000 ) is to high-dimensional
k States, n variables: O(kn) density entries
General dependencies kill the model
Idea
Hand-craft decencies
Graphical Models
Factorize Models
Pairwise models:
1 , ,
1
1
=
5
9
=1
, ,
,
Model complexity:
, ,
1,2
10
11
12
2,3
O(nk2) parameters
35
Graphical Models
Markov Random fields
Factorize density in local
cliques
Graphical model
, ,
1,2
2,3
10
11
12
directly dependent
Formal model:
Conditional independence
36
Graphical Models
Conditional Independence
A node is conditionally
, ,
1,2
2,3
10
11
12
Theorem (HammersleyClifford):
Given conditional independence as graph, a (positive)
region selected
completion
Texture Synthesis
Idea
One or more images as examples
Learn image statistics
Use knowledge:
Specify boundary conditions
Fill in texture
Example
Data
Boundary
Conditions
40
Pixel
Image statistics
How pixels are colored depends
Neighborhood
41
Simplification
Problem:
Statistical dependencies
Simple modell can express dependencies on all kinds of
combinations
43
Pixel
However:
Regions overlap
Neighborhood
Texture Synthesis
Use for Texture Synthesis
1
() =
, ,
=1 =1
Ni, j
, = , ,
i, j
= , , , +,+
, ,
~ exp
2 2
45
Inference
Inference Problem
Computing p(x) is trivial for known x.
Finding the x that maximizes p(x) is very complicated.
In general: NP-hard
No efficient solution known (not even for the image case)
In practice
Different approximation strategies
46
Approximation only
Can run into bad local minima
47
Learning Theory
Overfitting
Problem: Overfitting
Two steps:
Learn model on training data
Use model on more data (test data)
Overfitting
49
Learning Probabilities
red
possible
banana-orange
decision
boundaries
green
50
Learning Probabilities
red
possible
banana-orange
decision
boundaries
green
51
Learning Probabilities
red
possible
banana-orange
decision
boundaries
green
52
Regression Example
Housing Prices in Springfield
600 K
500 K
400 K
300 K
200 K
100 K
1960 1970 1980 1990 2000 2010
disclaimer: numbers are made up
this is not an investment advice
53
Regression Example
Housing Prices in Springfield
600 K
500 K
400 K
300 K
200 K
100 K
1960 1970 1980 1990 2000 2010
disclaimer: numbers are made up
this is not an investment advice
54
Regression Example
Housing Prices in Springfield
600 K
500 K
400 K
300 K
200 K
100 K
1960 1970 1980 1990 2000 2010
disclaimer: numbers are made up
this is not an investment advice
55
Regression Example
Housing Prices in Springfield
600 K
500 K
400 K
300 K
200 K
100 K
1960 1970 1980 1990 2000 2010
disclaimer: numbers are made up
this is not an investment advice
56
Regression Example
Housing Prices in Springfield
600 K
500 K
400 K
300 K
oil crisis
(recession)
great up again
recession
starts
200 K
100 K
Housing bubble
1960 1970 1980 1990 2000 2010
disclaimer: numbers are made up
this is not an investment advice
57
Variance:
Bad generalization performance
58
Model Selection
How to choose the right model?
For example
Linear
Quadratic
Higher order
Cross Validation
Housing Prices in Springfield
600 K
500 K
400 K
300 K
200 K
100 K
1960 1970 1980 1990 2000 2010
disclaimer: numbers are made up
this is not an investment advice
60
Cross Validation
Housing Prices in Springfield
600 K
500 K
400 K
300 K
200 K
100 K
1960 1970 1980 1990 2000 2010
disclaimer: numbers are made up
this is not an investment advice
61
Cross Validation
Housing Prices in Springfield
600 K
500 K
400 K
300 K
200 K
100 K
1960 1970 1980 1990 2000 2010
disclaimer: numbers are made up
this is not an investment advice
62
Looking for
Hypothesis h that works everywhere on
1 MPixel photos: 256 1 000 000 data items
Cannot cover everything with examples
Assumption
No prior information
All true labeling functions are equally likely
Consequences
Without prior knowledge:
The expected off-training error of the following
assumptions)
No truly fully automatic machine learning
65
Example: Regression
Housing Prices in Springfield
600 K
500 K
400 K
300 K
200 K
100 K
1960 1970 1980 1990 2000 2010
66
Example: Regression
Housing Prices in Springfield
600 K
500 K
400 K
300 K
200 K
100 K
1960 1970 1980 1990 2000 2010
67
Example: Regression
Housing Prices in Springfield
600 K
500 K
400 K
300 K
200 K
100 K
1960 1970 1980 1990 2000 2010
68
vs.
smooth densities
In this case: Gaussians
69
Solution
Choose the one with higher likelihood
Significance test
For example: Does new drug help?
h0: Just random outcome
Show that P(h0) is small
70
600 K
Complex models
500 K
400 K
300 K
Example
200 K
100 K
Polynomial fitting
d continuous parameters
=
=0
71
Significance?
Simple criterion
Model must be able to predict training data
Order d 1 polynomial can always fit d points perfectly
72
Simple Model
Single Hypothesis
: 0,1 , maps features to decisions
Groud truth : 0,1 , correct labeling
Stream of data, drawn i.i.d.
, ~
=
drawn from fixed distribution .
Expected error:
=
Hypothesis
73
Simple Model
Empirical vs. True Error
Inifinte stream , ~, drawn i.i.d.
Finite training set { 1 , 1 , , , }~, drawn i.i.d.
Expected error:
=
Empirical error (training error):
1
=
> 2exp(2 2 )
74
Simple Model
Empirical vs. True Error
Finite training set { 1 , 1 , , , }~, drawn i.i.d.
Training error bound:
> 2exp(2 2 )
Result
Reliability of assessment of hypothesis quality grows
75
Machine Learning
We have multiple hypothesis
Multiple hypothesis = {1 , , }
Need a bound on generalization error estimate for
76
Machine Learning
Result
After n training examples, we now training error up to
1
2
log
2
2
(log in k)
training examples.
With probability 1 , error bounded by
:
1
2
log
2
77
= arg min
=1..
78
Trade off:
Bias
1
2
+2
log
2
Variance
79
Generalization
Can be generalized
For multi-class learning, regression, etc.
Conclusion
Two theoretical insights
No free lunch:
81
Conclusion
Two theoretical insights
There is no contradiction here
Still, some non-training points might be misclassified all the time
But they cannot show up frequently
Have to choose hypothesis set
Infinite capacity leads to unbounded error
Thus: We do need prior knowledge
82
Conclusions
Machine Learning
Is basically density estimation
Curse of dimensionality
High dimensionality makes things intractable
Model dependencies to fight the problem
83
Conclusions
Machine Learning
No free lunch
You can only learn when you already know something
Math wont tell you were knowledge initially came from
Significance
Beware of overfitting!
Need to adapt plasticity of model to available training data
84
85