Вы находитесь на странице: 1из 27

Tarigan Statistical Doctoral Program in Computer Science

Consulting & Coaching of the Universities of Fribourg, Geneva,


statistical-coaching.ch Lausanne, Neuchâtel, Bern and the EPFL

Hands-on Data Analysis with R


University of Neuchatel, 10 May 2016

Linear and Piecewise Linear


Regressions

Bernadetta Tarigan, Dr. sc. ETHZ

Linear and Piecewise Linear Regressions 1


Motivation Linear Regression Piecewise Linear Regression

Critic data

• generated from different versions of a software project


• version: a version number of the software project
• rule name: a name of the rule used to generate critics about source
code
• entity: name of the entity which has a critic
• other fields contain additional information about the rules and
entities and can be used for filtering
• [version, rule name, entity] triplets have unique values as for each
version an entity can have only one critic of by a certain rule
Linear and Piecewise Linear Regressions 2
Motivation Linear Regression Piecewise Linear Regression

• Main calculated value: Number of critics in a version


= number of lines with the same version number
• How does the number of critics trend change after version 50185?
Hypothesis is that after that version the number of critics should go down.

• But there is too much noise for analyzing all the data without filtering.
• Ideas on filtering:
• filter by package: focus on Collections-*, Kernel, System-*, Nautilus packages
• or perform analysis by package and merge the results
• filter by rule severity: focus only on error rules
• filter by rule group: focus only on “Pharo bugs” and “Bugs”
• calculate the trend values for each rule or each package separately and
compare trends/eliminate outliers
• consider only entities that were changed since last version
Linear and Piecewise Linear Regressions 3
Motivation Linear Regression Piecewise Linear Regression

Very noisy indeed…

Linear and Piecewise Linear Regressions 4


Motivation Linear Regression Piecewise Linear Regression

What is happening…?

Linear and Piecewise Linear Regressions 5


Motivation Linear Regression Piecewise Linear Regression

There is hope…

Linear and Piecewise Linear Regressions 6


Motivation Linear Regression Piecewise Linear Regression

This model?
Use all version

What’s wrong with this?


Linear and Piecewise Linear Regressions 7
Motivation Linear Regression Piecewise Linear Regression

Or this one? Good enough?


Use only version > 185

What’s wrong with this?


Linear and Piecewise Linear Regressions 8
Motivation Linear Regression Piecewise Linear Regression

How about this one? Much better?


Use all version

Looks better, no?


Linear and Piecewise Linear Regressions 9
Motivation Linear Regression Piecewise Linear Regression

Simple Linear Regression = fitting a line on 1-dim input

• 𝑓(𝑥) = 𝛽0 + 𝛽1 𝑥
• 𝛽0 : intercept of the line (when 𝑥 = 0 then 𝑦 = 𝛽0 ), 𝛽1 : slope of the line
(one unit increase in 𝑥 gives 𝛽1 units in 𝑦)
• 𝜀𝑖 : random component (statistical error) for the 𝑖-th case, it accounts for
the fact that the statistical model does not give an exact fit to each and
every data points
• 𝜀𝑖 is unobservable, but we assume that E(𝜀𝑖 ) = 0 and Var 𝜀𝑖 = 𝜎𝜀2 for all
𝑖 = 1, … , 𝑛
• However, we do not assume any distribution for 𝜀𝑖
• Population parameters are 𝛽0 , 𝛽1 and 𝜎𝜀2 and we want to estimate them
Linear and Piecewise Linear Regressions 10
Motivation Linear Regression Piecewise Linear Regression

Estimate best line


• Define
• fitted value 𝑦𝑖 : = 𝛽0 + 𝛽1 𝑥𝑖
• residual 𝑒𝑖 : = 𝑦𝑖 − 𝑦𝑖
• Points above the line have positive residuals, points below the line have negative
residuals
• A good line should have small residuals
• Residuals should be small in magnitude, because large negative residuals are as bad
as large positive ones
• So we cannot simply require 𝑒𝑖 = 0
• In fact, any line passing the means of the variables, the point (𝑥 , 𝑦), satisfies
𝑒𝑖 = 0
• Two immediate solutions
• require |𝑒𝑖 | to be as small as possible (least absolute distance)
• require (𝑒𝑖 )2 to be as small as possible (least squares distance)
• Consider the second option: mathematically easier (e.g. to take derivative),
although the first option is more resistant to outliers
Linear and Piecewise Linear Regressions 11
Motivation Linear Regression Piecewise Linear Regression

Least squares solution


• Denote 𝛽𝑇 = (𝛽0 , 𝛽1 ) and 𝑥𝑖 𝑇 = (1, 𝑥𝑖 ) (column vector)

• Residual sum of (error) squares RSS 𝛽 : = (𝑒𝑖 )2 = {𝑦𝑖 − 𝛽𝑇 𝑥𝑖 }2

• Least squares solution is


𝛽𝑙𝑠 = arg min 𝑅𝑆𝑆 𝛽 = arg min {𝑦𝑖 − 𝛽𝑇 𝑥𝑖 }2
𝛽 𝛽

• Easy to solve: set the first partial derivatives equal to zero, check the second derivative…
𝛽0𝑙𝑠 = 𝑦 − 𝛽1𝑙𝑠 𝑥

(𝑥𝑖 −𝑥)(𝑦𝑖 −𝑦)


𝛽1𝑙𝑠 = (𝑥𝑖 −𝑥)2

• Properties of residuals
• 𝑒𝑖 = 0 since the least-squares line passes (𝑥 , 𝑦)
• 𝑥𝑖 𝑒𝑖 = 0 and 𝑦𝑖 𝑒𝑖 = 0: residuals are uncorrelated with the independent variable 𝑥𝑖 and
fitted value 𝑦𝑖
• 𝛽𝑙𝑠 are unique defined as long as 𝑥𝑖 ’s are not all identical, in that case the numerator
(𝑥𝑖 − 𝑥 )2 = 0

• Estimate for 𝜎𝜀2 is 𝑠𝑒 ≔ 𝑅𝑆𝑆/(𝑛 − 2)

Linear and Piecewise Linear Regressions 12


Motivation Linear Regression Piecewise Linear Regression

How good is the fit?


• Use 𝑠𝑒 ≔ 𝑅𝑆𝑆/(𝑛 − 2) : the smaller the better
𝑅𝑒𝑔𝑆𝑆
• Use the coefficient of determination 𝑅2 ≔
𝑇𝑆𝑆
• Define TSS := (𝑦𝑖 −𝑦)2
o total sum of squares of the “null” model,
o i.e., we do not use the independent variable
• Recall RSS := (𝑦𝑖 −𝑦𝑖 )2
• Clearly RSS < TSS
• Define RegSS := (𝑦𝑖 − 𝑦)2
• 𝑅𝑒𝑔𝑆𝑆 = 𝑇𝑆𝑆 − 𝑅𝑆𝑆, it gives reduction in the squared error due to
the linear regression
𝑅𝑒𝑔𝑆𝑆
• Define 𝑅2 ≔ , clearly 0 ≤ 𝑅2 ≤ 1
𝑇𝑆𝑆
2
• 𝑅 is the proportion of the variation in that is explained by the linear
regression
• The larger 𝑅2 the better

Linear and Piecewise Linear Regressions 13


Motivation Linear Regression Piecewise Linear Regression

Famous result: least squares estimates are BLUE


• BLUE = Best Linear Unbiased Estimates

• 𝛽𝑙𝑠 = arg min 𝑅𝑆𝑆 𝛽 = arg min {𝑦𝑖 − 𝛽𝑇 𝑥𝑖 }2


𝛽 𝛽

• Gauss-Markov Theorem:
least squares estimates have the smallest variance
among all linear unbiased estimates
• Recall:
• Let 𝛽 an estimate for an unknown parameter 𝛽
• The quality of 𝛽 is measured via its mean squared error
2 2
• 𝑀𝑆𝐸 𝛽 ≔ 𝐸 𝛽−𝐸 𝛽 = 𝛽−𝐸 𝛽 + 𝑉𝑎𝑟 𝛽

= 𝐵𝑖𝑎𝑠 2 + 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
• Therefor least squares estimates are famous: if the underlying function
𝑓(𝑥) were truly linear (that is, 𝑦 = 𝛽 𝑇 𝑥 + 𝜀), then least squares
estimates are your best approximation!

Linear and Piecewise Linear Regressions 14


Motivation Linear Regression Piecewise Linear Regression

Great! But, what next?


• Remember that we do not assume any distribution for the statistical
error 𝜀, only that E(𝜀𝑖 ) = 0 and Var 𝜀𝑖 = 𝜎𝜀2 for all 𝑖 = 1, … , 𝑛
• Least squares estimates are great and truly mathematical solution,
but we cannot do much more
• We cannot do statistical inference on them, e.g.
• Confidence interval
• Hypotheses test
• Which are needed when the goal in estimating the underlying
mechanism is to explain or to describe
• But not to predict

• Statistical Inference: drawing conclusion about population from


sample with some calculated uncertainty
• When you have two sets of data/sample from the same mechanism
𝑦 = 𝛽 𝑇 𝑥 + 𝜀, you will get two sets of different estimates
Linear and Piecewise Linear Regressions 15
Motivation Linear Regression Piecewise Linear Regression

Normal distribution of the random error 𝜺


• Linear statistical model: 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖
• Assume that random error 𝜀𝑖 are iid and 𝑵(𝟎, 𝝈𝟐𝜺 ) distributed,
for 𝑖 = 1, … , 𝑛
𝑌𝑖 |𝑥𝑖 ~ 𝑁(𝛽0 + 𝛽1 𝑥𝑖 , 𝜎𝜀2 )
The standard deviation remains constant,… E(y|x3)

b0 + b1x3 m3
E(y|x2)
b0 + b1x2 m2
…but the mean
E(y|x1)
value changes with x
b0 + b1x1 m1

x1 x2 x3
Linear and Piecewise Linear Regressions 16
Motivation Linear Regression Piecewise Linear Regression

Maximum Likelihood Estimates


• Now that we know the distributions of 𝑌𝑖 |𝑥𝑖 that are independent
but not identical (𝑌𝑖 |𝑥𝑖 ~ 𝑁(𝛽0 + 𝛽1 𝑥𝑖 , 𝜎𝜀2 )), hence we can apply
maximum likelihood estimation (MLE) method

• The MLE estimates for are equal to least squares estimates


(𝑥𝑖 −𝑥)(𝑦𝑖 −𝑦)
𝛽1𝑀𝐿𝐸 = 𝛽1𝑙𝑠 =
(𝑥𝑖 −𝑥)2

𝛽0𝑀𝐿𝐸 = 𝛽0𝑙𝑠 = 𝑦 − 𝛽1𝑙𝑠 𝑥

Linear and Piecewise Linear Regressions 17


Motivation Linear Regression Piecewise Linear Regression

Maximum Likelihood Estimates (Cont.)


• However, we get more
𝜎𝜀2 𝑛 2
• 𝛽0 ~ 𝑵 𝛽0 , 𝑖=1 𝑥𝑖
𝑠𝑥
𝜎𝜀2
• 𝛽1 ~ 𝑵(𝛽1 , )
𝑠𝑥
−𝜎𝜀2 𝑥
• Covariance 𝐶𝑜𝑣 𝛽0 , 𝛽1 =
𝑠𝑥
• Define 𝑆 2 ≔ 𝑅𝑆𝑆(𝛽0 , 𝛽1 )/(𝑛 − 2) unbiased estimate for 𝜎𝜀2
𝑛−2 𝑆 2
• ~ 𝝌𝟐 𝑛 − 2
𝜎𝜀2
• Moreover, (𝛽0 , 𝛽1 ) and 𝑆 2 are independent

Linear and Piecewise Linear Regressions 18


Motivation Linear Regression Piecewise Linear Regression

Test statistics

From the results about sampling distributions, it immediately


follows that

which are the basis for inferences, significance test and CI


estimation, regarding the two parameters 𝛽0 and 𝛽1
Linear and Piecewise Linear Regressions 19
Motivation Linear Regression Piecewise Linear Regression

Test of significance
1. Test both parameters simultaneously with 𝑭 test
𝐻0 ∶ 𝛽0 = 𝛽1 = 0
𝐻1 ∶ at least one of them is not zero

2. Test each parameter with 𝒕 test, for 𝑖 = 0,1


𝐻0 ∶ 𝛽𝑖 = 0

𝐻1 ∶ 𝛽𝑖 ≠ 0

Linear and Piecewise Linear Regressions 20


Motivation Linear Regression Piecewise Linear Regression

Confidence Interval (CI) estimation


The (1 − 𝛼) CI for respectively 𝛽0 and 𝛽1 are

𝛽0 ± 𝑡1−𝛼 ; 𝑛−1 ∙ 𝑆𝐸(𝛽0 )


2

𝛽1 ± 𝑡1−𝛼 ; 𝑛−1 ∙ 𝑆𝐸(𝛽1 )


2

point estimate ± margin of error

• R returns the SE values


• When n is large, 𝑡 behaves like the Standard Normal 𝑍
• For 𝛼 = 0.05, 𝑡1−𝛼 ; 𝑛−1 ≈ 2
2
• Remember the 68-95-99.7 rule

Linear and Piecewise Linear Regressions 21


Motivation Linear Regression Piecewise Linear Regression

Model validation
The assumptions of the random term (i.e., the errors)
The outliers

1. Zero mean of the errors


2. Constant variance (homoscedasticity) of the errors
3. Independence of the errors
4. Normality of the errors
5. Outlier diagnostic

Linear and Piecewise Linear Regressions 22


Motivation Linear Regression Piecewise Linear Regression

Model Evaluation
The goodness-of-fit or quality of the model
How good is the fit?
Two measures:
• Residual standard error
• Coefficient of determination 𝑅2

Linear and Piecewise Linear Regressions 23


Motivation Linear Regression Piecewise Linear Regression

Piecewise linear regression


• Other names: hockey stick,
broken stick or segmented
• It is a simple modification
of linear model, yet very
useful

• Different ranges of 𝑥, different linear relationships occur


• A single linear model may not provide an adequate explanation or description
• Breakpoints are the value of 𝑥 where the slope changes
• The value of breakpoints may or may not known before the analysis, when
unknown it has to be estimated

Linear and Piecewise Linear Regressions 24


Motivation Linear Regression Piecewise Linear Regression

Even to model a nonlinear relationship!

Breakpoints are the value of 𝑥 where


the slope changes

The value of breakpoints may or may


not known before the analysis, when
unknown it has to be estimated

Linear and Piecewise Linear Regressions 25


Motivation Linear Regression Piecewise Linear Regression

One breakpoint with known value


• Let 𝑐 be the value of breakpoint
0 ; 𝑥≤𝑐
• Denote (𝑥 − 𝑐)+ =
𝑥−𝑐; 𝑥 >𝑐
• Piecewise linear model
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 (𝑥 − 𝑐)+ + 𝜀
• Can be written as
𝛽0 + 𝛽1 𝑥 ; 𝑥≤𝑐
𝑦=
𝛽0 − 𝛽2 𝑐 + (𝛽1 +𝛽2 ) 𝑥 ; 𝑥 > 𝑐
• For 𝑥 ≤ 𝑐 the slope is 𝛽1
• Then it changes to 𝛽1 + 𝛽2 when 𝑥 > 𝑐

Linear and Piecewise Linear Regressions 26


Motivation Linear Regression Piecewise Linear Regression

Hypothesis test
𝛽0 + 𝛽1 𝑥 ; 𝑥≤𝑐
𝑦=
𝛽0 − 𝛽2 𝑐 + (𝛽1 +𝛽2 ) 𝑥 ; 𝑥 > 𝑐

• For 𝑥 ≤ 𝑐 the slope is 𝛽1


• Then it changes to 𝛽1 + 𝛽2 when 𝑥 > 𝑐

As 𝑥 increases, to test if 𝑦 would decrease after the breakpoint 𝑐


is to test if 𝛽2 < 0

Linear and Piecewise Linear Regressions 27

Вам также может понравиться