Regressao

Tarigan Statistical Doctoral Program in Computer Science
Consulting & Coaching of the Universities of Fribourg, Geneva,

statistical-coaching.ch Lausanne, Neuchâtel, Bern and the EPFL
Hands-on Data Analysis with R

University of Neuchatel, 10 May 2016
Linear and Piecewise Linear

Regressions
Bernadetta Tarigan, Dr. sc. ETHZ
Linear and Piecewise Linear Regressions 1

Motivation Linear Regression Piecewise Linear Regression
Critic data
• generated from different versions of a software project

• version: a version number of the software project
• rule name: a name of the rule used to generate critics about source
code
• entity: name of the entity which has a critic
• other fields contain additional information about the rules and
entities and can be used for filtering
• [version, rule name, entity] triplets have unique values as for each
version an entity can have only one critic of by a certain rule
• Main calculated value: Number of critics in a version

= number of lines with the same version number
• How does the number of critics trend change after version 50185?
Hypothesis is that after that version the number of critics should go down.
• But there is too much noise for analyzing all the data without filtering.
• Ideas on filtering:
• filter by package: focus on Collections-*, Kernel, System-*, Nautilus packages
• or perform analysis by package and merge the results
• filter by rule severity: focus only on error rules
• filter by rule group: focus only on “Pharo bugs” and “Bugs”
• calculate the trend values for each rule or each package separately and
compare trends/eliminate outliers
• consider only entities that were changed since last version
Very noisy indeed…

What is happening…?

There is hope…

This model?
Use all version
What’s wrong with this?

Or this one? Good enough?

Use only version > 185
What’s wrong with this?

How about this one? Much better?

Use all version
Looks better, no?

Simple Linear Regression = fitting a line on 1-dim input
• 𝑓(𝑥) = 𝛽0 + 𝛽1 𝑥
• 𝛽0 : intercept of the line (when 𝑥 = 0 then 𝑦 = 𝛽0 ), 𝛽1 : slope of the line
(one unit increase in 𝑥 gives 𝛽1 units in 𝑦)
• 𝜀𝑖 : random component (statistical error) for the 𝑖-th case, it accounts for
the fact that the statistical model does not give an exact fit to each and
every data points
• 𝜀𝑖 is unobservable, but we assume that E(𝜀𝑖 ) = 0 and Var 𝜀𝑖 = 𝜎𝜀2 for all
𝑖 = 1, … , 𝑛
• However, we do not assume any distribution for 𝜀𝑖
• Population parameters are 𝛽0 , 𝛽1 and 𝜎𝜀2 and we want to estimate them
Estimate best line

• Define
• fitted value 𝑦𝑖 : = 𝛽0 + 𝛽1 𝑥𝑖
• residual 𝑒𝑖 : = 𝑦𝑖 − 𝑦𝑖
• Points above the line have positive residuals, points below the line have negative
residuals
• A good line should have small residuals
• Residuals should be small in magnitude, because large negative residuals are as bad
as large positive ones
• So we cannot simply require 𝑒𝑖 = 0
• In fact, any line passing the means of the variables, the point (𝑥 , 𝑦), satisfies
𝑒𝑖 = 0
• Two immediate solutions
• require |𝑒𝑖 | to be as small as possible (least absolute distance)
• require (𝑒𝑖 )2 to be as small as possible (least squares distance)
• Consider the second option: mathematically easier (e.g. to take derivative),
although the first option is more resistant to outliers
Least squares solution

• Denote 𝛽𝑇 = (𝛽0 , 𝛽1 ) and 𝑥𝑖 𝑇 = (1, 𝑥𝑖 ) (column vector)
• Residual sum of (error) squares RSS 𝛽 : = (𝑒𝑖 )2 = {𝑦𝑖 − 𝛽𝑇 𝑥𝑖 }2
• Least squares solution is

𝛽𝑙𝑠 = arg min 𝑅𝑆𝑆 𝛽 = arg min {𝑦𝑖 − 𝛽𝑇 𝑥𝑖 }2
𝛽 𝛽
• Easy to solve: set the first partial derivatives equal to zero, check the second derivative…
𝛽0𝑙𝑠 = 𝑦 − 𝛽1𝑙𝑠 𝑥
(𝑥𝑖 −𝑥)(𝑦𝑖 −𝑦)

𝛽1𝑙𝑠 = (𝑥𝑖 −𝑥)2
• Properties of residuals
• 𝑒𝑖 = 0 since the least-squares line passes (𝑥 , 𝑦)
• 𝑥𝑖 𝑒𝑖 = 0 and 𝑦𝑖 𝑒𝑖 = 0: residuals are uncorrelated with the independent variable 𝑥𝑖 and
fitted value 𝑦𝑖
• 𝛽𝑙𝑠 are unique defined as long as 𝑥𝑖 ’s are not all identical, in that case the numerator
(𝑥𝑖 − 𝑥 )2 = 0
• Estimate for 𝜎𝜀2 is 𝑠𝑒 ≔ 𝑅𝑆𝑆/(𝑛 − 2)

How good is the fit?

• Use 𝑠𝑒 ≔ 𝑅𝑆𝑆/(𝑛 − 2) : the smaller the better
𝑅𝑒𝑔𝑆𝑆
• Use the coefficient of determination 𝑅2 ≔
𝑇𝑆𝑆
• Define TSS := (𝑦𝑖 −𝑦)2
o total sum of squares of the “null” model,
o i.e., we do not use the independent variable
• Recall RSS := (𝑦𝑖 −𝑦𝑖 )2
• Clearly RSS < TSS
• Define RegSS := (𝑦𝑖 − 𝑦)2
• 𝑅𝑒𝑔𝑆𝑆 = 𝑇𝑆𝑆 − 𝑅𝑆𝑆, it gives reduction in the squared error due to
the linear regression
𝑅𝑒𝑔𝑆𝑆
• Define 𝑅2 ≔ , clearly 0 ≤ 𝑅2 ≤ 1
𝑇𝑆𝑆
2
• 𝑅 is the proportion of the variation in that is explained by the linear
regression
• The larger 𝑅2 the better

Famous result: least squares estimates are BLUE

• BLUE = Best Linear Unbiased Estimates
• 𝛽𝑙𝑠 = arg min 𝑅𝑆𝑆 𝛽 = arg min {𝑦𝑖 − 𝛽𝑇 𝑥𝑖 }2

𝛽 𝛽
• Gauss-Markov Theorem:
least squares estimates have the smallest variance
among all linear unbiased estimates
• Recall:
• Let 𝛽 an estimate for an unknown parameter 𝛽
• The quality of 𝛽 is measured via its mean squared error
2 2
• 𝑀𝑆𝐸 𝛽 ≔ 𝐸 𝛽−𝐸 𝛽 = 𝛽−𝐸 𝛽 + 𝑉𝑎𝑟 𝛽
= 𝐵𝑖𝑎𝑠 2 + 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
• Therefor least squares estimates are famous: if the underlying function
𝑓(𝑥) were truly linear (that is, 𝑦 = 𝛽 𝑇 𝑥 + 𝜀), then least squares
estimates are your best approximation!

Great! But, what next?

• Remember that we do not assume any distribution for the statistical
error 𝜀, only that E(𝜀𝑖 ) = 0 and Var 𝜀𝑖 = 𝜎𝜀2 for all 𝑖 = 1, … , 𝑛
• Least squares estimates are great and truly mathematical solution,
but we cannot do much more
• We cannot do statistical inference on them, e.g.
• Confidence interval
• Hypotheses test
• Which are needed when the goal in estimating the underlying
mechanism is to explain or to describe
• But not to predict
• Statistical Inference: drawing conclusion about population from

sample with some calculated uncertainty
• When you have two sets of data/sample from the same mechanism
𝑦 = 𝛽 𝑇 𝑥 + 𝜀, you will get two sets of different estimates
Normal distribution of the random error 𝜺

• Linear statistical model: 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖
• Assume that random error 𝜀𝑖 are iid and 𝑵(𝟎, 𝝈𝟐𝜺 ) distributed,
for 𝑖 = 1, … , 𝑛
𝑌𝑖 |𝑥𝑖 ~ 𝑁(𝛽0 + 𝛽1 𝑥𝑖 , 𝜎𝜀2 )
The standard deviation remains constant,… E(y|x3)
b0 + b1x3 m3
E(y|x2)
b0 + b1x2 m2
…but the mean
E(y|x1)
value changes with x
b0 + b1x1 m1
x1 x2 x3
Maximum Likelihood Estimates

• Now that we know the distributions of 𝑌𝑖 |𝑥𝑖 that are independent
but not identical (𝑌𝑖 |𝑥𝑖 ~ 𝑁(𝛽0 + 𝛽1 𝑥𝑖 , 𝜎𝜀2 )), hence we can apply
maximum likelihood estimation (MLE) method
• The MLE estimates for are equal to least squares estimates

(𝑥𝑖 −𝑥)(𝑦𝑖 −𝑦)
𝛽1𝑀𝐿𝐸 = 𝛽1𝑙𝑠 =
(𝑥𝑖 −𝑥)2
𝛽0𝑀𝐿𝐸 = 𝛽0𝑙𝑠 = 𝑦 − 𝛽1𝑙𝑠 𝑥

Maximum Likelihood Estimates (Cont.)

• However, we get more
𝜎𝜀2 𝑛 2
• 𝛽0 ~ 𝑵 𝛽0 , 𝑖=1 𝑥𝑖
𝑠𝑥
𝜎𝜀2
• 𝛽1 ~ 𝑵(𝛽1 , )
𝑠𝑥
−𝜎𝜀2 𝑥
• Covariance 𝐶𝑜𝑣 𝛽0 , 𝛽1 =
𝑠𝑥
• Define 𝑆 2 ≔ 𝑅𝑆𝑆(𝛽0 , 𝛽1 )/(𝑛 − 2) unbiased estimate for 𝜎𝜀2
𝑛−2 𝑆 2
• ~ 𝝌𝟐 𝑛 − 2
𝜎𝜀2
• Moreover, (𝛽0 , 𝛽1 ) and 𝑆 2 are independent

Test statistics
From the results about sampling distributions, it immediately

follows that
which are the basis for inferences, significance test and CI

estimation, regarding the two parameters 𝛽0 and 𝛽1
Test of significance
1. Test both parameters simultaneously with 𝑭 test
𝐻0 ∶ 𝛽0 = 𝛽1 = 0
𝐻1 ∶ at least one of them is not zero
2. Test each parameter with 𝒕 test, for 𝑖 = 0,1

𝐻0 ∶ 𝛽𝑖 = 0
𝐻1 ∶ 𝛽𝑖 ≠ 0

Confidence Interval (CI) estimation

The (1 − 𝛼) CI for respectively 𝛽0 and 𝛽1 are
𝛽0 ± 𝑡1−𝛼 ; 𝑛−1 ∙ 𝑆𝐸(𝛽0 )

2
𝛽1 ± 𝑡1−𝛼 ; 𝑛−1 ∙ 𝑆𝐸(𝛽1 )

2
point estimate ± margin of error
• R returns the SE values

• When n is large, 𝑡 behaves like the Standard Normal 𝑍
• For 𝛼 = 0.05, 𝑡1−𝛼 ; 𝑛−1 ≈ 2
2
• Remember the 68-95-99.7 rule

Model validation
The assumptions of the random term (i.e., the errors)
The outliers
1. Zero mean of the errors

2. Constant variance (homoscedasticity) of the errors
3. Independence of the errors
4. Normality of the errors
5. Outlier diagnostic

Model Evaluation
The goodness-of-fit or quality of the model
How good is the fit?
Two measures:
• Residual standard error
• Coefficient of determination 𝑅2

Piecewise linear regression

• Other names: hockey stick,
broken stick or segmented
• It is a simple modification
of linear model, yet very
useful
• Different ranges of 𝑥, different linear relationships occur

• A single linear model may not provide an adequate explanation or description
• Breakpoints are the value of 𝑥 where the slope changes
• The value of breakpoints may or may not known before the analysis, when
unknown it has to be estimated

Even to model a nonlinear relationship!
Breakpoints are the value of 𝑥 where

the slope changes
The value of breakpoints may or may

not known before the analysis, when
unknown it has to be estimated

One breakpoint with known value

• Let 𝑐 be the value of breakpoint
0 ; 𝑥≤𝑐
• Denote (𝑥 − 𝑐)+ =
𝑥−𝑐; 𝑥 >𝑐
• Piecewise linear model
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 (𝑥 − 𝑐)+ + 𝜀
• Can be written as
𝛽0 + 𝛽1 𝑥 ; 𝑥≤𝑐
𝑦=
𝛽0 − 𝛽2 𝑐 + (𝛽1 +𝛽2 ) 𝑥 ; 𝑥 > 𝑐
• For 𝑥 ≤ 𝑐 the slope is 𝛽1
• Then it changes to 𝛽1 + 𝛽2 when 𝑥 > 𝑐

Hypothesis test
𝛽0 + 𝛽1 𝑥 ; 𝑥≤𝑐
𝑦=
𝛽0 − 𝛽2 𝑐 + (𝛽1 +𝛽2 ) 𝑥 ; 𝑥 > 𝑐
• For 𝑥 ≤ 𝑐 the slope is 𝛽1

• Then it changes to 𝛽1 + 𝛽2 when 𝑥 > 𝑐
As 𝑥 increases, to test if 𝑦 would decrease after the breakpoint 𝑐

is to test if 𝛽2 < 0

Regressao

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Regressao

Загружено:

Авторское право:

Доступные форматы

Tarigan Statistical Doctoral Program in Computer Science

Consulting & Coaching of the Universities of Fribourg, Geneva,

Hands-on Data Analysis with R

Linear and Piecewise Linear

Bernadetta Tarigan, Dr. sc. ETHZ

Linear and Piecewise Linear Regressions 1

• generated from different versions of a software project

• Main calculated value: Number of critics in a version

Very noisy indeed…

Linear and Piecewise Linear Regressions 4

Linear and Piecewise Linear Regressions 5

Linear and Piecewise Linear Regressions 6

What’s wrong with this?

Or this one? Good enough?

What’s wrong with this?

How about this one? Much better?

Looks better, no?

Simple Linear Regression = fitting a line on 1-dim input

Estimate best line

Least squares solution

• Residual sum of (error) squares RSS 𝛽 : = (𝑒𝑖 )2 = {𝑦𝑖 − 𝛽𝑇 𝑥𝑖 }2

• Least squares solution is

(𝑥𝑖 −𝑥)(𝑦𝑖 −𝑦)

• Estimate for 𝜎𝜀2 is 𝑠𝑒 ≔ 𝑅𝑆𝑆/(𝑛 − 2)

Linear and Piecewise Linear Regressions 12

How good is the fit?

Linear and Piecewise Linear Regressions 13

Famous result: least squares estimates are BLUE

• 𝛽𝑙𝑠 = arg min 𝑅𝑆𝑆 𝛽 = arg min {𝑦𝑖 − 𝛽𝑇 𝑥𝑖 }2

Linear and Piecewise Linear Regressions 14

Great! But, what next?

• Statistical Inference: drawing conclusion about population from

Normal distribution of the random error 𝜺

Maximum Likelihood Estimates

• The MLE estimates for are equal to least squares estimates

𝛽0𝑀𝐿𝐸 = 𝛽0𝑙𝑠 = 𝑦 − 𝛽1𝑙𝑠 𝑥

Linear and Piecewise Linear Regressions 17

Maximum Likelihood Estimates (Cont.)

Linear and Piecewise Linear Regressions 18

From the results about sampling distributions, it immediately

which are the basis for inferences, significance test and CI

2. Test each parameter with 𝒕 test, for 𝑖 = 0,1

Linear and Piecewise Linear Regressions 20

Confidence Interval (CI) estimation

𝛽0 ± 𝑡1−𝛼 ; 𝑛−1 ∙ 𝑆𝐸(𝛽0 )

𝛽1 ± 𝑡1−𝛼 ; 𝑛−1 ∙ 𝑆𝐸(𝛽1 )

point estimate ± margin of error

• R returns the SE values

Linear and Piecewise Linear Regressions 21

1. Zero mean of the errors

Linear and Piecewise Linear Regressions 22

Linear and Piecewise Linear Regressions 23

Piecewise linear regression

• Different ranges of 𝑥, different linear relationships occur

Linear and Piecewise Linear Regressions 24

Even to model a nonlinear relationship!

Breakpoints are the value of 𝑥 where

The value of breakpoints may or may

Linear and Piecewise Linear Regressions 25

One breakpoint with known value

Linear and Piecewise Linear Regressions 26

• For 𝑥 ≤ 𝑐 the slope is 𝛽1

As 𝑥 increases, to test if 𝑦 would decrease after the breakpoint 𝑐

Linear and Piecewise Linear Regressions 27