Вы находитесь на странице: 1из 43

Module-11

Linear Regression
Linear & Polynomial Regression
Learning Objectives

• At the end of this section delegates will be able to:


• Understand the role of regression analysis within the
Transactional DMAIC Improvement Process
• Understand that regression can be used to explore the
relationships between inputs (x’s) and outputs (y’s)
Linear & Polynomial Regression - Agenda

• Regression within DMAIC


• Review of Scatter Diagrams
• Introduction to Regression
• Linear Regression
• Polynomial Regression
• Summary
Six Sigma Transactional Improvement Process
Define Measure Analyse Improve Control
 Select Project  Define Measures (y’s)  Develop Detailed  Brainstorm Potential  Control Critical x’s
 Define Project Process Maps Improvement Strategies
10.2 Upper Control Limit
START  Select Improvement
Objective
PROCESS Strategy 10.0
 Form the Team  Check Data Integrity STEPS
9.8 Lower Control Limit

DECISION Criteria A B C D 9.6


1 5 10 15 20
Time + s - +
STOP
Cost + - + s  Monitor y’s
Service - + - +
 Identify Critical Etc s s - +
Process Steps (x’s)
 Determine Process by looking for:
 Map the Process Stability  Plan and Implement
– Process Bottlenecks y
 Identify Customer  Determine Process Pilot
Requirements Capability – Rework / Repetition  Verify Improvement  Validate Control
– Non-value Added Plan
Steps LSL USL
LSL USL
– Sources of Error /
Mistake
 Map the Ideal 15 20 25 30 35
15 20 25 30 35 Process  Identify further
 Identify Priorities  Implement opportunities
Set Targets for  Identify gaps

between current and Countermeasures  Close Project
 Update Project File
Measures
ideal

Phase Review Phase Review Phase Review Phase Review Phase Review
Scatter Diagram Revisited

Purpose
To show:
• How one variable changes in response to changes in
another.
• The nature of the relationship between two
variables.
• The strength of relationship between two variables.
Scatter Diagram

Suspected Suspected
Cause Effect
Amount Health
Overweight Index
50 .53
81 .32
117
100
.10
.13
Low
68 .59
Suspected Suspected 77 .40
112 .28
Cause Effect 49
70
.45
.50
Amount Health 89 .25
70 .34
Overweight Index 115 .18
52 .60
90 .42
70 .43
High 121
80
.15
.49
40 .65 High
75 .22
Low 35 .58
100 .35
Scatter Diagram

Health
Index .60

.50

.40

.30

.20

.10

45 65 85 105 125
Amount
Overweight
Scatter Diagram

Positive Correlation Negative Correlation


An increase in Y may depend on An increase in X may cause a
increases in X. If X is controlled, decrease in Y. Therefore, if X is
Y could be controlled. controlled, Y could be controlled.

Possible Positive Possible Negative


Correlation Correlation
If X is increased, Y may increase If X is increased, Y may decrease
somewhat, but Y seems to have other somewhat, but Y seems to have other
causes than X. causes than X.

No Correlation
There is no correlation.
Scatter Diagram: Risks & Limitations

• Does not prove anything


• Both axes should be of equal length
• Conclusions must not be made outside the
experimental range
• Experimental range should be wide enough to draw
useful conclusions
Regression
• At the heart of Six Sigma activities is identifying which
inputs or process steps cause unwanted variation in process
outputs

y = f(x)

• Regression analysis will allow us to determine which inputs


(x’s) influence our output or outputs (y’s)
• We can sometimes use regression analysis to build a
mathematical model which can be used to predict the value
of our outputs
Why do we use Regression?

• Regression analysis is used when we wish


to determine the relationship between
two or more continuous variables y
• In Six Sigma activities we often need to
understand the relationship between our
output (y) and our critical x’s
• In Problem Solving activities this may x
help us to discover root causes
• In Process Improvement activities, regression analysis
will allow us to find optimal settings for our critical x’s
Regression Exercise
Exercise - consider the following pairs of measures – could we draw a
line which might summarise the relationship/regression between them?
• Volume of Speech and Alcohol Volume Consumed
• Site Location and Quantity of Defects
• Shipping Defects and Customer Distance from Distribution Depot
• WIP and Yield
• Education and Salary
• Age and Beauty
• Sales and Advertising
• Pick Errors and Cycle time
• Sales Representative and Sales value
• Goals Scored per Season and Purchase Price of the Player
• Quantity Sold and Selling Price
• Speed of Query Resolution and Experience of the Operator
Linear Regression

• The simplest form of regression is single variable linear


regression
• y is the dependent variable
• x is the independent variable
• The equation for linear regression is:

y
y = β0 + β1x + error
β0 is the intercept
β1 is the slope x
Linear Regression - Example

A finance department is carrying out an investigation into


the number of errors that are generated on customer
invoices.

They suspect that the number of errors may be affected by


the volume of invoicing on any particular day.

The following slide shows the data for the past 50 working
days.
Linear Regression - Example

Volume Errors Volume Errors Volume Errors

155 2 178 6 175 3


165 5 155 3 173 3
170 3 186 5 201 7
198 5 201 7 297 28
276 26 241 8 165 5
209 3 174 8 193 2
186 4 163 4 162 0
288 23 207 10 271 17
176 3 188 6 179 5
208 4 154 1 197 8
163 5 163 3 162 2
173 3 178 5 265 14
174 1 210 9 221 9
223 11 263 13 165 6
196 7 162 3 154 5
241 10 165 3 199 5
283 26 224 6
Scatter Diagram
Open Worksheet: “Invoicing Errors”

Enter Errors
in Y and
Volume in X
and click OK
Scatter Diagram
Scatterplot of Errors vs Volume
30

25

20
Errors

15

10

150 175 200 225 250 275 300


Volume

A scatter diagram reveals that there may be a relationship between the number of
errors and the volume of invoices. A regression analysis will reveal the existence
and/or the strength of the relationship.
Linear Regression – Least Squares Method
Scatterplot of Errors vs Volume
30

25

20
Errors

15

10

150 175 200 225 250 275 300


Volume

We first need to establish the equation for the best fitting line which will minimise the
sum of squares of the predicted y values from the observed y values. In short, this is
known as the “least squares” method.
Regression - Minitab

Open Worksheet: “Invoicing Errors”


Regression - Minitab

1. Enter Errors
and Volume

2. Check Linear
Minitab – Regression Plot

This is the equation for


Fitted Line Plot
Errors = - 21.74 + 0.1465 Volume the best fit line.
30 S 2.98583
R-Sq 79.3%

25
R-Sq(adj) 78.9%
We can use it to
20
predict:
Errors

15
e.g. if we have 200
10
invoices we would
5 predict:
0

150 175 200 225 250 275 300 -21.74 + 0.1465 (200)
Volume

= 7.6 errors
Minitab – Regression Plot

The R-Squared and R-


Fitted Line Plot
Errors = - 21.74 + 0.1465 Volume Squared (adjusted) tell
30 S
R-Sq
2.98583
79.3%
us how much of the
25
R-Sq(adj) 78.9%
variation in Errors can
20 be explained by the
changes in Volume.
Errors

15

10 Here it is around 79%.


5

150 175 200 225 250 275 300


Volume
Minitab – Regression Plot

Fitted Line Plot The s value is the


Errors = - 21.74 + 0.1465 Volume
30 S 2.98583
standard error of
R-Sq
R-Sq(adj)
79.3%
78.9%
the y values about
25
the best fit line. It
20
is the standard
Errors

15 deviation of the
10 residuals
5
(the difference
between actual and
0

150 175 200 225 250 275 300


best-fit y values for
Volume each x)
Linear Regression – Minitab Output
Regression Analysis: Errors versus Volume

The regression equation is


Errors = - 21.74 + 0.1465 Volume

S = 2.98583 R-Sq = 79.3% R-Sq(adj) = 78.9%


A “p” value of
<0.05 indicates a
Analysis of Variance
statistically
Source DF SS MS F P
significant effect.
Regression 1 1642.07 1642.07 184.19 0.000
Error 48 427.93 8.92
Total 49 2070.00
Analysis of Variance (ANOVA) for Linear Regression
Source Degrees
of Variation of Freedom Sum of Squares Mean Square F-Ratio

Regression 1 1642.07 1642.07 184.19


Residual (Error) 48 427.93 8.92
Total 49 2070.00

The analysis of variance divides up the total variation in y


(errors) into its constituent parts.
We can learn a lot from this table:
1. What is the overall variation in y?
2. Is there a significant relationship between y and x?
3. How much of the variation in y is due to changes in x?
4. How much variation in y is still unexplained?
5. How accurate is my prediction of y for a given value of x?
What is the overall variation in y?
Source Degrees
of Variation of Freedom Sum of Squares Mean Square F-Ratio

Regression 1 1642.07 1642.07 184.19


Residual (Error) 48 427.93 8.92
Total 49 2070.00

The total variation in y is given by the Total Sum of Squares = 2070.00

The Total Sum of Squares = Σ( y − y) 2

The total mean square =


The total sum of squares
=
(
Σ y− y )
2

= σ n2−1
Total Degrees of Freedom n −1
2070 .00 42.245
σn2 −1 = = Check this out by calculating the standard
49
deviation of the 50 error results
σn −1 = 42.245 = 6.5
Is there a significant relationship between y and x?
Source Degrees
of Variation of Freedom Sum of Squares Mean Square F-Ratio

Regression 1 1642.07 1642.07 184.19


Residual (Error) 48 427.93 8.92
Total 49 2070.00

We can test the significance of the relationship between y and x by examining the
F-Ratio. The F-Ratio is name after Sir Ronald Fisher, who devised this test for
comparing variances.

Regression Mean Square 1642.07


F-Ratio = = = 184.19
Residual Mean Square 8.92

Examining the F tables for F0.05,1,48 gives a value of 4.03.


Our value of 184.19 is greater than 4.03 so we can assume that there is a
statistically significant relationship between y and x.
Is there a significant relationship between y and x?
Analysis of Variance

Source DF SS MS F P
Regression 1 1642.07 1642.07 184.19 0.000
Error 48 427.93 8.92
Total 49 2070.00

Minitab gives a “P” value as the outcome of a Hypothesis Test:


H0 = The regression is not significant (i.e. variation in the x is not significant in
explaining the variation in the y)
H1 = The regression is significant
Minitab’s P value is the probability that we would get this F value if the Null
Hypothesis were true
Since it is below 0.05 we can conclude with at least 95% Confidence that the
number of errors is influenced by the volume of invoices processed
How much variation in y is explained by changes in x?
Source Degrees
of Variation of Freedom Sum of Squares Mean Square F-Ratio

Regression 1 1642.07 1642.07 184.19


Residual (Error) 48 427.93 8.92
Total 49 2070.00

SSTOTAL = SSREGRESSION + SSRESIDUAL

SSTOTAL = Total Sum of Squares = Total variability in y values.

SSREGRESSION = Regression Sum of Squares = the amount of variability in the


y values explained by the
regression relationship.

SSRESIDUAL = Residual Sum of Squares = the amount of variability in the


(or Error Sum of Squares) y values not accounted for by the
regression relationship.
How much variation in y is explained by changes in x?
Source Degrees
of Variation of Freedom Sum of Squares Mean Square F-Ratio

Regression 1 1642.07 1642.07 184.19


Residual (Error) 48 427.93 8.92
Total 49 2070.00

The Coefficient of Determination R2


SSREGRESSION 1642.07
R2 = = = 0.79
SSTOTAL 2070.00

The coefficient of determination is normally expressed as a


percentage. It represents the percentage of the total variability
accounted for by the regression relationship. It can also be used to
test whether the regression accounts for a statistically significant
amount of the total variability.
How much variation in y is still unexplained?
Source Degrees
of Variation of Freedom Sum of Squares Mean Square F-Ratio

Regression 1 1642.07 1642.07 184.19


Residual (Error) 48 427.93 8.92
Total 49 2070.00

The Residual (Error) term provides us with information concerning the


amount of variation in y which is not accounted for by the regression.
The square root of the residual mean square is the standard error of y
about the regression equation.

MS Error = standard error of y about x


We can use the standard error to calculate confidence intervals for y
values for any given value of x.
Residuals

MS Error = standard error of y about x

Scatterplot of Errors vs Volume


Actual 30

value 25

20
Residual
Predicted
Errors

15
value
10

150 175 200 225 250 275 300


Volume

Residuals are the difference between the observed values of y and


the predicted values based on the regression model.
Examination of Residuals
Residuals are the differences between the observed values of y and the predicted
values based on the regression model. If there was no difference between these two
entities, then we would have a perfect model. In reality, this is unlikely to occur.

Observed y Predicted y
x y y = -21.74+0.1465x (y-y) ( y – y )2
155 2 0.9675 -1.0325 1.066
165 5 2.4325 -2.5675 6.592
170 3 2.485 -0.515 0.265
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
199 5 6.6175 1.6175 2.616
427.93 = SSRESIDUAL
Residuals vs Fits
Residuals Versus the Fitted Values
(response is Errors)

7.5

5.0

2.5

Residual
0.0

-2.5

-5.0

0 5 10 15 20
Fitted Value

• By examining the residual plot we can check for:


• Lack of fit (model inadequacy)
• Non-constant variability
• When we have sufficient data points, a normality test can also
be carried out. The distribution of residuals should be normal if
the model is a good fit to the data.
Normality of Residuals
Probability Plot of RESI1
Normal
99
Mean 3.055334E-15
StDev 2.955
95 N 50
AD 0.342
90
P-Value 0.479
80
70
Percent

60
50
40
30
20

10

1
-8 -6 -4 -2 0 2 4 6 8
RESI1

• In this case a Normality Test of the Residuals shows that they are
Normal (p value > 0.05)
How accurate is my prediction of y?

Source Degrees
of Variation of Freedom Sum of Squares Mean Square F-Ratio

Regression 1 1642.07 1642.07 184.19


Residual (Error) 48 427.93 8.92
Total 49 2070.00

MS Error = standard error of y about x

Statistical software programs will use the error mean square to


calculate confidence intervals when predicting y for a given value of
x. We can obtain confidence intervals for the predicted mean value
and also for the predicted individual values.
How accurate is my prediction of y?

Open Worksheet: “Invoicing Errors”


How accurate is my prediction of y?

1. Enter Errors
and Volume

2. Check Linear
3. Click on Options
How accurate is my prediction of y?

Tick both
Display Options
How accurate is my prediction of y?
Fitted Line Plot
Errors = - 21.74 + 0.1465 Volume
30 Regression
95% C I
95% PI

S 2.98583
20 R-Sq 79.3%
R-Sq(adj) 78.9%
Errors

10

-10
150 175 200 225 250 275 300
Volume

• 95% Confidence Intervals show the range of values we expect for the average value of
errors for any particular volume of invoices being processed
• 95% Prediction Intervals show the range of values within which we expect 95% of the
individual error values to be if we use the regression equation to predict this
• Precise values can be obtained within the Stat > Regression > Regression menu
Regression Exercises
Question 1:
A company developing healthcare software solutions is bidding for a new
contract and has historical data on similar previous contracts. It wants to
minimise the risk of failing to deliver the solution on time, so wants a good
estimate of the man-years of effort needed (the output measure, or y).
The variables previously recorded are the number of application sub-programs
written (x1), and the number of software configuration change proposals
implemented (x2).
Use regression to:
1. Investigate the relationship between x1 and the man-years required
2. Investigate the relationship between x2 and the man-years required
3. If the company estimates that 150 application sub-programs will be required,
and there are likely to be 100 software configuration change proposals
implemented, what would be your recommendation for the number of man-
years they should estimate?
Data is in Minitab Worksheet: Transactional Regression Exercises.mtw
Regression Exercises

Question 2:
The team investigating the Expense Claims process have
identified a potential input variable (x) that they believe
could affect the amount of time taken to pay the claims. The
potential variable is the amount of money claimed, and they
have gathered data on amounts claimed for the 100 payment
times they already had. Use Regression Analysis to
investigate the relationship, and be prepared to advise the
team on your conclusions.
Data is in Minitab Worksheet:
PAYMENT TIMES.mtw
Summary - Linear & Polynomial Regression

• Regression Analysis can be used to identify x’s that


are affecting the y’s
• A linear or polynomial regression model of y=f(x) can
be developed for individual x’s
• The model can be tested to see if it is significant and
how well it fits the data
• The model can be used to make predictions of y for
given values of x
• Regression is used much more extensively in
operational and DFSS activities