Вы находитесь на странице: 1из 26

Introduction to

Machine Learning

Linear Regression

Prof. Andreas Krause


Learning and Adaptive Systems (las.ethz.ch)
Announcements
First recordings online:
https://www.video.ethz.ch/lectures/d-
infk/2019/spring/252-0220-00L.html

Waitlist updates

2
Basic Supervised Learning Pipeline

Training Data Test Data


“spam” ?

“ham”
Learning Classi- Predic-
method tion ?
fier
“spam” ?

f : :XX !
!YY f :X !Y
Prediction/
Model fitting Generalization
3
Regression
Instance of supervised learning
Goal: Predict real valued labels (possibly vectors)
Examples:
X Y
Flight route Delay (minutes)
Real estate objects Price
Customer & ad features Click-through probability

4
Running example: Diabetes
[Efron et al ‘04]
Features X:
Age
Sex
Body mass index
Average blood pressure
Six blood serum measurements (S1-S6)
Label (target) Y: quantitative measure of disease
progression

5
Regression
y
+
+ +
+ + +
+
+
+

Goal: learn real valued mapping f : Rd ! R


6
Important choices in regression
What types of functions f should we consider? Examples
f(x) f(x)
+
+ + ++
+ + + + + +
+ + + ++
+ + ++ ++
+ + +

x x
How should we measure goodness of fit?

7
Example: linear regression
y
+
+ +
+ + +
+
+
+

8
Homogeneous representation

9
Quantifying goodness of fit
D = {(x1 , y1 ), . . . , (xn , yn )} x i 2 Rd yi 2 R

y +
+ +
+ ++ + +
+
x

10
Least-squares linear regression optimization
[Legendre 1805, Gauss 1809]
Given data set D = {(x1 , y1 ), . . . , (xn , yn )}

How do we find the optimal weight vector?

n
X
w⇤ = arg min (yi w T xi )2
w
i=1

11
How to solve? Example: Scikit Learn

12
Demo
Disease progression

Body mass index


13
Method 1: Closed form solution
n
X
The problem w⇤ = arg min (yi w T xi )2
w
i=1

can be solved in closed form:

w⇤ = (XT X) 1
XT y

Hereby:

14
Method 2: Optimization
X
The objective function R̂(w) = (yi w T xi )2
i

is convex!

15
Gradient Descent
Start at an arbitrary w0 2 Rd
For t=1,2,... do wt+1 = wt ⌘t rR̂(wt )

Hereby, ⌘t is called learning rate

16
Convergence of gradient descent
Under mild assumptions, if step size sufficiently small,
gradient descent converges to a stationary point
(gradient = 0)
For convex objectives, it therefore finds the
optimal solution!

In the case of the squared loss, constant stepsize ½


converges linearly

17
Computing the gradient

18
Demo: Gradient descent

19
Choosing a stepsize
What happens if we choose a poor stepsize?

20
Adaptive step size
Can update the step size adaptively. Examples:
1) Via line search (optimizing step size every step)

2) „Bold driver“ heuristic


If function decreases, increase step size:

If function increases, decrease step size:

21
Demo: Gradient Descent for Linear Regression

22
Gradient descent vs closed form
Why would one ever consider performing gradient
descent, when it is possible to find closed form solution?

Computational complexity
May not need an optimal solution
Many problems don‘t admit closed form solution

23
Other loss functions
So far: Measure goodness of fit via squared error
Many other loss functions possible (and sensible!)

24
Fitting nonlinear functions
How about functions like this:

+ +
+ +
++ + ++
+ + +
+
++ +

25
Linear regression for polynomials
We can fit non-linear functions via linear regression,
using nonlinear features of our data (basis functions)
d
X
f (x) = wi i (x)
i=1

26

Вам также может понравиться