Вы находитесь на странице: 1из 51

ACTL2002/ACTL5101 Probability and Statistics: Week 12

ACTL2002/ACTL5101 Probability and Statistics


c Katja Ignatieva

School of Risk and Actuarial Studies
Australian School of Business
University of New South Wales
k.ignatieva@unsw.edu.au

Week 12
Week 2
Week 3
Week 4
Probability:
Week 6
Review
Estimation: Week 5
Week
7
Week
8
Week 9
Hypothesis testing:
Week
10
Week
11
Linear regression:
Week 2 VL
Week 3 VL
Week 4 VL
Video lectures: Week 1 VL
Week 1

Week 5 VL

ACTL2002/ACTL5101 Probability and Statistics: Week 12

First nine weeks


Introduction to probability;
Moments: (non)-central moments, mean, variance (standard
deviation), skewness & kurtosis;
Special univariate (parametric) distributions (discrete &
continue);
Joint distributions;
Convergence; with applications LLN & CLT;
Estimators (MME, MLE, and Bayesian);
Evaluation of estimators;
Interval estimation.
3301/3343

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Last two weeks


Simple linear regression:
-

Idea;
Estimating using LSE (& BLUE estimator & relation MLE);
Partition of variability of the variable;
Testing:
i) Slope;
ii) Intercept;
iii) Regression line;
iv) Correlation coefficient.

Multiple linear regression:


3302/3343

Matrix notation;
LSE estimates;
Tests;
R-squared and adjusted R-squared.

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Modelling assumptions in linear regression
Confounding effects

Modelling with Linear Regression

Modelling assumptions in linear regression


Confounding effects
Collinearity
Heteroscedasticity

Special explanatory variables


Interaction of explanatory variables
Categorial explanatory variables

Model selection
Reduction of number of explanatory variables
Model validation

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Modelling assumptions in linear regression
Confounding effects

Confounding effects
Linear regression measures the effect of explanatory variables
X1 , . . . , Xn on the dependent variable Y .
The assumptions are:
Effects of the covariates (explanatory variables) must be
additive;
Homoskedastic (constant) variance;
Errors must be independent of the explanatory variables with
mean zero (weak assumptions);
Errors must be Normally distributed, and hence, symmetric
(strong assumptions).

But what about confounding variables?


Correlation does not imply causality!
3303/3343

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Modelling assumptions in linear regression
Confounding effects

Confounding effects
C is a confounder of the relation between X and Y if: C
influences X and C influences Y , but X does not influence Y
(directly).

3304/3343

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Modelling assumptions in linear regression
Confounding effects

How to correctly use/dont use confounding variables?


If confounding variable is observable: add the confounding
variable.
If confounding variable is unobservable: be careful with
interpretation.
The predictor variable has an indirect influence on dependent
variable.
Example: Age Experience Probability of car accident.
Experience can not be measured, thus age can be a proxy for
experience.
The predictor variable has no direct influence on dependent
variable.
Example: Becoming older does not make you a better driver.

3305/3343

Hence, a predictor variable works as a predictor, but action


taken on the predictor itself will have no effect.

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Modelling assumptions in linear regression
Collinearity

Modelling with Linear Regression

Modelling assumptions in linear regression


Confounding effects
Collinearity
Heteroscedasticity

Special explanatory variables


Interaction of explanatory variables
Categorial explanatory variables

Model selection
Reduction of number of explanatory variables
Model validation

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Modelling assumptions in linear regression
Collinearity

Collinearity
Multicollinearity occurs when one explanatory variable is a
(nearly) linear combination of the other explanatory variables.
If explanatory variable is collinear, this variable is redundant,
it provides no/little additional information.
Example: perfect fit for: y = 87 + x1 + 18x2 , but also for
y = 7 + 9x1 + 2x2 :
i
1 2 3
4
yi 23 83 63 103
xi1 2 8 6 10
xi2 6 9 8 10

3306/3343

Note that x2 = 5 + x1 /2, thus e2 = 2 + c, e1 = 1 c/2


and e0 = 0 5c.

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Modelling assumptions in linear regression
Collinearity

Collinearity
Collinearity:
Does not influence fit, nor predictions.
Estimates of error variance, thus also model adequacy, are still
reliable.
Standard errors of individual regression coefficients are higher,
leading to small t-ratio.

Detecting collinearity:
i) Regress xj on the other explanatory variables
ii) Determine coefficient of determination R 2 .
iii) Calculate Variance Inflation Factor: VIFj = (1 Rj2 )1 . If
large (> 10), severe collinearity exists.

3307/3343

When severe collinearity exists, often the only option is to


remove one or more variables from the regression equation.

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Modelling assumptions in linear regression
Heteroscedasticity

Modelling with Linear Regression

Modelling assumptions in linear regression


Confounding effects
Collinearity
Heteroscedasticity

Special explanatory variables


Interaction of explanatory variables
Categorial explanatory variables

Model selection
Reduction of number of explanatory variables
Model validation

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Modelling assumptions in linear regression
Heteroscedasticity

Heteroscedasticity

We have assumed homoscedastic residuals, if the variance of


the residuals are different for different observations, then we
have heteroscedasticity.
Least squares estimator is unbiased, even in presence of
heteroscedasticity.
Least squares estimator might not be the optimal estimator.
Confidence intervals and hypothesis tests depends on
homoscedastic residuals.

3308/3343

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Modelling assumptions in linear regression
Heteroscedasticity

Graphical: plot the estimated residuals against the endogenous


variable:

100

100

200

200

0
100
0

100

200
yi

100
0

300

4
x1i

200

1
0.5

100
0
100
0

0
0.5

10
x2i

20

1
0

2
4
y*i =log(yi)

Solution: Use a transformation for Y , i.e., yi? = log(yi ).

3309/3343

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Modelling assumptions in linear regression
Heteroscedasticity

20

20

10

10

Graphical: plot the estimated residuals against the explanatory


variables (LM = SSM/2 = 34, 20.99 (2) = 9.21 and 20.95 (2) = 5.99):

0
10
0

0
200
yi

10
0

400

50

100
x1i

20
20

2i

15
10
0
10
0
3310/3343

10
5

10
x2i

20

0
0

50

100
x1i

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Modelling assumptions in linear regression
Heteroscedasticity

Detecting heteroscedasticity
F-test (using two groups of data), White (1980)-test
Bruesch and Pagan (1980) test:
Test H0 : homoscedastic residuals v.s.
H1 : Var (yi ) = 2 + z >
i , where z i is a know vector of
variables and is a p-dimensional vector of parameters.
Test procedure:
1. Fit the regression model and determine the residuals i .
2 2
2. Calculate the squared standardized residuals ?2
i = i /s .

3. Fit a regression model of ?2


i on z i (can be all Xi ).
2
Pn  ?2
4. Test statistic: LM = SSM
bi ?2
i=1
2 , where SSM =
3311/3343

5. Reject H0 if LM > 21 (p).

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Modelling assumptions in linear regression
Heteroscedasticity

GLS & WLS (OPTIONAL)


Sometimes, you know how much variation there should be in
the residual for each observation.
Example: Difference in exposure of risk.
Application: Mortality modeling (exposures-ages), proportion
claiming (changing portfolio sizes).
Heteroscedasticity: 1 6= In
h i
E > = 2

1 = P> P.

Find the OLS estimates of the regression model which is


pre-multiplied by P:
Py = PX + P
3312/3343

e + e
ye = X

Find P using (for example) the Cholesky decomposition.

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Modelling assumptions in linear regression
Heteroscedasticity

GLS & WLS (OPTIONAL)


Note that we can apply OLS on the pre-multiplied by P
model:
h i
i
h i
h
E ee> = E P> P> = PE > P> = P 2 P> = 2 In .
Hence, we have the Generalized Least Squares estimator:

1
e > ye
b= X
e >X
e
X


1
= X> P> PX
P> X> Py

1
= X> 1 X
X> 1 y

1
b = 2 X
e >X
e
Var ()

1
= 2 X> 1 X
3313/3343

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Modelling assumptions in linear regression
Heteroscedasticity

GLS & WLS (OPTIONAL)

Weighted least squares, each variable is weighted by 1/ i :

1 0
0
0 2
h i
0

E > = 2 = 2

.
.
.
.


.
.
0
0
n


1/1
0
0
1/ 1

1/
0

2
1/ 2

1
>

P=
=P P
..
..
...

.
.

1/ n
0
0
1/n
Only applicable when you know the relative variances
1 , . . . , n .
3314/3343

Can we also estimate the variance-covariance matrix ?

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Modelling assumptions in linear regression
Heteroscedasticity

EGLS/FGLS (OPTIONAL)
Feasible GLS or Estimated GLS does not impose the structure
of heteroskedasticity, but estimates it from the data.
Estimation procedure:
1. Estimate the regression using OLS.
2. Regress the squared residuals on explanatory variables.
3. Determine the expected squared residuals

bi = diag(b
1 , . . . ,
bn ).
4. Use WLS with weights
bi to find the EGLS/FGLS estimate.

3315/3343

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Interaction of explanatory variables

Modelling with Linear Regression

Modelling assumptions in linear regression


Confounding effects
Collinearity
Heteroscedasticity

Special explanatory variables


Interaction of explanatory variables
Categorial explanatory variables

Model selection
Reduction of number of explanatory variables
Model validation

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Interaction of explanatory variables

Interaction of explanatory variables


In linear regression covariates are additive.
Multiplicative relations are non-additive.
Example: Liquidity of stocks can be explained by: Price,
Volume, and Value (=PriceVolume).
Note that interaction terms might lead to high collinearity.
For symmetric distribution, if the explanatory variables are
centered, there is no correlation between interaction term and
main effects. Thus center variables to reduce collinearity
issues.
Example: yi = 0 + 1 X1 + 2 X2 + 3 X1 X2 + i ,
where 1 and 2 are main effects, 3 is the interaction effect.
The marginal effect of X1 |X2 = x2 is yi /x1 = 1 + 3 x2 .
3316/3343

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Interaction of explanatory variables

Interaction of random variables


A moderator variable is a predictor (e.g., X1 ) that interacts
with another predictor (e.g., X2 ) in explaining variance in a
predicted variable (e.g., Y ).
Moderating effects are equivalent to interaction effects.
Rearranging we get:
yi =

( + X )
| 0 {z1 1}

intercept depends on X1

+ (2 + 3 X1 ) X2 + i ,
|
{z
}
slope depends on X1

Always include the marginal effects in the regression.

3317/3343

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Interaction of explanatory variables

Example: interaction of random variables


Regression output (see excel file):

intercept
X1
X2
X1 X2

Main effects
Coef t stat
0.71 0.94
3.88
4.29
0.46
0.71

Full model
Coef t stat
0.61 1.51
0.83
1.24
0.18
0.53
1.11
6.57

Excluding X2
Coef t stat
0.46 1.55
1.02
1.83
1.12

6.83

Centered
Coef t stat
1.90
6.08
2.73
5.30
1.08
3.01
1.11
6.57

Correlations (where e
xi = e
xi E[Xi ]):

y
x1
x2
3318/3343

non-centered
x1
x2 x1 x2
0.92 0.82 0.97
0.86
0.8
0.8

y
e
x1
e
x2

centered
e
e
x1
x2 e
x1 e
x2
0.92 0.82 0.51
0.86 0.07
0.07

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Interaction of explanatory variables

Always include marginal effects (heteroscedastic variance):


Scatter plot residuals v.s. explanatory variables

2
0
2
1

0.5

0.5

1.5

2.5

3.5

x1i
variance of i decreasing function of x1i x2i

2
0
2
0
3319/3343

6
x1i x2i

10

12

14

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Categorial explanatory variables

Modelling with Linear Regression

Modelling assumptions in linear regression


Confounding effects
Collinearity
Heteroscedasticity

Special explanatory variables


Interaction of explanatory variables
Categorial explanatory variables

Model selection
Reduction of number of explanatory variables
Model validation

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Categorial explanatory variables

Binary explanatory variables


Categorical variables provide a numerical label for
measurements of observations that fall in distinct groups or
categories.
Binary variables is a variable which can only take two values,
namely zero and one.
Example: Gender (male=1, female=0), number of years
education (1 if more than 12 years, 0 otherwise).
Regression (see next slide):
LnFace i
\i
LnFace
3320/3343

=
0 +
1 LnIncomei +
2 Singlei +
i
= 0.42+
1.12 LnIncomei
0.51 Singlei +
i
(0.56) (0.05)
(0.16)
(0.57)

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Categorial explanatory variables

Binary explanatory variables


Interpretation coefficients:
- 0 : Intercept for non-singles;
- 1 : Marginal effect of LnIncome;
- 2 : Difference in intercept singles, non-singles i.e., 0 + 2 is
the intercept for singles.
15
14

Nonsingles
Singles

13

LnFace

12
11
10
9
8
7
6

6
3321/3343

9
10
LnIncome

11

12

13

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Categorial explanatory variables

15
14

Nonsingles
Singles

13

LnFace

12
11
10
9
8
7
6
6
3322/3343

9
10
LnIncome

11

12

13

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Categorial explanatory variables

1.5

Nonsingles
Singles

Residual

0.5

0.5

1.5
6
3323/3343

9
10
LnIncome

11

12

13

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Categorial explanatory variables

Binary explanatory variables


Regression (see next slide):
LnFace i
\i
LnFace

=
=

0 +
1 LnIi +
2
0.11+
1.07 LnIi
3.28
(0.61) (0.06)
(1.41)

Si +
Si +

3 Si LnIi +
0.27 Si LnIi +
(0.14)

i
i
(0.56)

Interpretation coefficients:
- 0 : Intercept for non-singles;
- 1 : Marginal effect of LnIncome;
- 2 : Difference in intercept singles, non-singles i.e., 0 + 2 is
the intercept for singles.
- 3 : Difference in marginal effect of LnIncome singles,
non-singles i.e., 1 + 3 is the marginal effect of LnIncome for
singles.
3324/3343

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Categorial explanatory variables

15
14

Nonsingles
Singles

13

LnFace

12
11
10
9
8
7
6
6
3325/3343

9
10
LnIncome

11

12

13

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Categorial explanatory variables

1.5

Nonsingles
Singles

Residual

0.5

0.5

1.5
6
3326/3343

9
10
LnIncome

11

12

13

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Categorial explanatory variables

Binary explanatory variables


Now consider the example where we try to explain the hourly
wage of individuals in a random sample.
As explanatory variable we have years of education.
Question: Why are you here?
Regression (see next slide):
HW i
di
HW

=
0 +
1 YEi + i
= 406+
33 YEi + i
(8.97) (0.65)
(12)

Solution: To earn more money later on?

3327/3343

Question: Explain the extend of correlation and causality in


this case.

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Categorial explanatory variables

150

Hour wage

100

50

0
12
3328/3343

12.5

13

13.5

14
14.5
years edu

15

15.5

16

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Categorial explanatory variables

20
15
10

Residual

5
0
5
10
15
20
12
3329/3343

12.5

13

13.5

14
14.5
years edu

15

15.5

16

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Categorial explanatory variables

Binary explanatory variables


Regression (see next slide):
HW i
di
HW

=
0 +
1
= 151+
13.7
(3.91) (0.30)

YEi +
YEi +

2 (YEi 14) Di +
i
35 (YEi 14) Di +
i
(0.47)
(2.37)

where Di = 1 if YEi > 14 and zero otherwise.


Interpretation coefficients:
- 0 : Intercept for hourly wage;
- 1 : Marginal effect of years education before 14 years;
- 2 : Difference in marginal effect of years education before and
after 14 years i.e., 1 + 2 is the marginal effect of years
education for years >14.
3330/3343

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Categorial explanatory variables

150

Hour wage

100

50

0
12
3331/3343

12.5

13

13.5

14
14.5
years edu

15

15.5

16

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Categorial explanatory variables

6
4

Residual

2
0
2
4
6
12
3332/3343

12.5

13

13.5

14
14.5
years edu

15

15.5

16

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Categorial explanatory variables

Categorial explanatory variables


Now consider the example where we try to explain the hourly
wage of individuals in a random sample.
As explanatory variable we have highest degree: high school,
college, university, PhD.

0, if highest degree is high school;

1, if highest degree is college;


Define: Ci =
2, if highest degree is university degree;

3, if highest degree is PhD.


Regression (see next slide):
YI i
ci
YI
3333/3343

= 0 +
1 Ci +
i
= 33.0+
8.00 Ci +
i
(1.36) (0.91)
(15.7)

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Categorial explanatory variables

Categorial
50

90

40

80

30

70

20

60

10

Residual

Yearly income

Empirical
100

50
40

0
10

30

20

20

30

10

40

0
3334/3343

1
2
Edu level

50

1
2
Edu level

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Categorial explanatory variables

Dummies
50

90

40

80

30

70

20

60

10

Residual

Yearly income

Empirical
100

50
40

0
10

30

20

20

30

10

40

0
3335/3343

1
2
Edu level

50

1
2
Edu level

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Special explanatory variables
Categorial explanatory variables

Comparing categorial and dummy explanatory variables


Change the categorical variable in dummy variables:
Regression (see previous slide):
YI i
ci
YI

= 0 +
1 D1,i +
2 D2,i +
3 D3,i +
i
= 33.4+
5.32 D1,i + 18.07 D2,i + 14.50 D3,i +
i
(1.53) (2.12)
(2.02)
(4.05)
(15.5)
Interpretation:
-

3336/3343

0 :
1 :
2 :
3 :

Average
Average
Average
Average

income of
additional
additional
additional

high school.
income of college relative to HS.
income of university relative to HS.
income of PhD relative to HS.

Conclusion: Only use categorical variables if the marginal


effects of all categories are equal!

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Model selection
Reduction of number of explanatory variables

Modelling with Linear Regression

Modelling assumptions in linear regression


Confounding effects
Collinearity
Heteroscedasticity

Special explanatory variables


Interaction of explanatory variables
Categorial explanatory variables

Model selection
Reduction of number of explanatory variables
Model validation

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Model selection
Reduction of number of explanatory variables

Reduction of number of explanatory variables


There are many possible combinations of explanatory
variables:
number of X
1
2
3
4
5

combinations
1
3
7
15
31

number of X
6
7
8
9
10

combinations
63
127
255
511
1023

You do not want to check them all.


How to decide what explanatory variables to include?
3337/3343

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Model selection
Reduction of number of explanatory variables

Stepwise regression algorithm


(i) Consider all possible regressions using one explanatory
variable.
For each of the regressions calculate the t-ratio.
Select the explanatory variable with the highest absolute
t-ratio (if larger than CV).
(ii) Add a variable to the model from the previous step. The
variable to enter is the one that makes the largest significant
contribution. The t-ratio must be above the CV.
(iii) Delete a variable to the model from the previous step. The
variable to be removed is the one that makes the smallest
contribution. The t-ratio must be below the CV.
(iv) Repeat steps (ii) and (iii) until all possible additions and
deletions are performed.

3338/3343

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Model selection
Reduction of number of explanatory variables

Stepwise regression algorithm


+ Useful algorithm that quickly search trough a number of
candidate models.
- Procedure snoops through a large number of candidate
models and may fit the data too well.
- No guarantee that the selected model is the best.
- The algorithm use one criterion, namely t-ratio and does not
consider other criteria such as s, R 2 , Ra2 , and so on (s will
decrease if the absolute value of the t-ratio is larger than one).
- The algorithm does not take into account the joint effect of
explanatory variables.

3339/3343

- Purely automatic procedures may not take into account an


investigators special knowledge.

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Model selection
Model validation

Modelling with Linear Regression

Modelling assumptions in linear regression


Confounding effects
Collinearity
Heteroscedasticity

Special explanatory variables


Interaction of explanatory variables
Categorial explanatory variables

Model selection
Reduction of number of explanatory variables
Model validation

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Model selection
Model validation

Comparing models
How to compare two regression models with the same number
of explanatory variables:
- F -statistic;
- The variability of the residual (s);
- R-squared.

How to compare two regression models with the unequal


number of explanatory variables:
- The variability of the residual (s);
- Adjusted R-squared;
- Likelihood ratio test:
2 (`p `p+q ) 2q

3340/3343

Reject that the model with p + q variables as good as the


model with p variables if 2 (`p `p+q ) > 21 (q).
- (Optional:) Information Criterions select the one with the
lowest AIC= 2k 2`.

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Model selection
Model validation

Out-of-Sample Validation Procedure (OPTIONAL)


(i) Begin with a sample of size n, divide into two subsamples
with size n1 and n2 .
(ii) Using the model development subsample, fit a candidate
model to the data set i = 1, . . . , n1
(iii) Using the model in step (ii) predict ybi in the validation
subsample.
(iv) Assess the proximity of the predictions to the held-out data.
One measure is the sum of squared prediction errors:
SSPE =

n
X
i=n1 +1

3341/3343

(yi ybi )2

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Model selection
Model validation

PRESS Validation Procedure (OPTIONAL)


Predicted residual sum squares:
(i) For the full sample, omit the i th point and use the remaining
n 1 observations to compute regression coefficients.
(ii) Use the regression coefficients in (i) to compute the predicted
b(i) .
response for the i th point, which is Y
(iii) Repeat step (i) and (ii) for i = 1, . . . , n, define:
PRESS =

n
X

yi yb(i)

2

i=

Note: one can rewrite it into computational less incentive


procedure!
i
yi yb(i) =
1 x i (X> X)1 x i
3342/3343

ACTL2002/ACTL5101 Probability and Statistics: Week 12


Model selection
Model validation

Data analysis and modeling


Box (1980): examine data, hypothesize a model, compare
data to a candidate model, formulate an improved model.

1. Examine the data graphically, use prior knowledge of


relationships (economic theory, industry practice).
2. Based on assumptions in the model, must be consistent with
the data.

3343/3343

3. Diagnostic checks (data and model criticism) data and model


must be consistent with one other before inferences can be
made.