Вы находитесь на странице: 1из 11

Statistical Modeling

Linear Regression and ANOVA


A Practical Computational Perspective

Hamid D. Ismail, PhD


Table of Contents
PREFACE ................................................................................................................................... VIII
1. INTRODUCTION TO MODELING .................................................................................... 10
1.1 INTRODUCTION..................................................................................................................... 10
1.2 GENERALIZED LINEAR MODEL .............................................................................................. 11
1.2.1 The regression .............................................................................................................. 14
1.2.2 The effect model .......................................................................................................... 15
1.3 THE STEPS OF STATISTICAL MODELING ................................................................................. 16
1.3.1 Problem definition ....................................................................................................... 16
1.3.2 Identifying the variables and the levels of measurement ............................................. 17
1.3.3 Study design ................................................................................................................. 18
1.3.4 Data presentation ......................................................................................................... 18
1.3.5 Exploring the variables ................................................................................................ 19
1.3.6 Data transformation and new variables........................................................................ 19
1.3.7 Choosing a GLM.......................................................................................................... 19
1.3.8 Model fitting and testing .............................................................................................. 20
1.3.9 Checking model assumptions and adequacy................................................................ 20
1.3.10 Interpretation of the results ........................................................................................ 21
2. BASIC STATISTICS .............................................................................................................. 23
2.1 THE TYPES OF RESEARCH STUDIES ........................................................................................ 23
2.1.1 Observational study ..................................................................................................... 23
2.1.2 Experimental studies .................................................................................................... 23
2.2 DESCRIPTIVE STATISTICS...................................................................................................... 24
2.2.1 Numerical descriptive measures .................................................................................. 24
2.2.2 Graphical descriptive tools .......................................................................................... 26
2.3 PROBABILITY DISTRIBUTION ................................................................................................ 27
2.3.1 Probability distribution of discrete random variables .................................................. 27
2.3.2 Probability distribution of continuous random variables ............................................. 33
2.4 SAMPLING DISTRIBUTIONS ................................................................................................... 36
2.4.1 The distribution of the sample mean ............................................................................ 37
2.5 OTHER STATISTICAL DISTRIBUTIONS FOR A RANDOM SAMPLE .............................................. 40
2.5.1 Student’s t distribution ................................................................................................. 40
2.5.2 Chi-square distribution................................................................................................. 41
2.5.3 F distribution ................................................................................................................ 41
2.6 STATISTICAL INFERENCES .................................................................................................... 42
2.6.1 Estimation .................................................................................................................... 42
2.6.2 Hypothesis testing ........................................................................................................ 43
3. CORRELATIONS .................................................................................................................. 47
3.1 SCATTERPLOT ...................................................................................................................... 47
3.2 THE COVARIANCE................................................................................................................. 51
3.3 THE PEARSON CORRELATION COEFFICIENT .......................................................................... 54
3.3.1 Inference about correlation coefficient using hypothesis testing ................................. 57
3.3.2 Inference about correlation coefficient using the confidence interval ......................... 61
v

3.4 SPEARMAN CORRELATION COEFFICIENT ............................................................................... 63


3.5 HOEFFDING DEPENDENCE COEFFICIENT .............................................................................. 66
3.6 KENDALL’S TAU-B CORRELATION COEFFICIENT ................................................................... 68
3.7 PARTIAL CORRELATION ........................................................................................................ 71
3.8 USING CORR AND IML PROCEDURES FOR CORRELATION.................................................... 74
3.8.1 Bivariate correlation example ...................................................................................... 75
3.8.2 Partial correlation example .......................................................................................... 83
4. SIMPLE LINEAR REGRESSION ....................................................................................... 86
4.1 INTRODUCTION..................................................................................................................... 86
4.1.1 The simple linear model assumptions .......................................................................... 88
4.2 THE LEAST SQUARES ESTIMATE FOR SIMPLE LINEAR MODEL PARAMETERS........................... 90
4.3 LEAST SQUARES ESTIMATE FOR VARIANCE........................................................................... 93
4.4 ANALYSIS OF VARIANCE AND INFERENCES ABOUT THE SLOPE .............................................. 95
4.4.1 Sum of squares for simple linear regression ................................................................ 96
4.4.2 The F statistic and coefficient of determination .......................................................... 97
4.5 INFERENCE ABOUT THE SLOPE USING STUDENT’S T STATISTIC ........................................... 100
4.6 INFERENCE ABOUT THE INTERCEPT..................................................................................... 104
4.7 CONFIDENCE INTERVAL OF MEAN RESPONSE AND PREDICTED VALUE................................. 107
4.7.1 The confidence interval for the expected value of response variable ........................ 107
4.7.2 Confidence interval for predicted value ..................................................................... 108
4.8 USING PROC REG FOR SIMPLE LINEAR REGRESSION ......................................................... 115
5. MULTIPLE LINEAR REGRESSION ............................................................................... 123
5.1 INTRODUCTION TO THE MULTIPLE LINEAR REGRESSION ..................................................... 123
5.2 DATA ORGANIZATION ........................................................................................................ 124
5.3 EXPLORING MULTIVARIATE DATA ...................................................................................... 127
5.4 FITTING A MULTIPLE LINEAR REGRESSION MODEL.............................................................. 130
5.5. ESTIMATING THE VARIANCE OF THE RESPONSE VARIABLE................................................. 136
5.6 THE VARIANCES OF THE REGRESSION COEFFICIENTS .......................................................... 137
5.7 INFERENCES ABOUT MODEL PARAMETERS .......................................................................... 139
5.8 ANALYSIS OF VARIANCE AND MODEL REDUCTION APPROACH ............................................ 141
5.8.1 Analysis of variance for multiple regression ............................................................. 141
5.8.2 Model reduction ......................................................................................................... 146
5.9 TESTING SIGNIFICANCE OF THE CONTRIBUTION OF THE PREDICTOR VARIABLES ................. 147
5.9.1 Testing the overall multiple linear model .................................................................. 147
5.9.2 Testing the coefficients of a subset of predictors equal zero ..................................... 149
5.9.3 Testing a subset of independent variables have equal effects.................................... 154
5.10 PREDICTION OF NEW OBSERVATIONS ................................................................................ 156
5.10.1 Extrapolation and joint region of the multiple linear model .................................... 156
5.10.2 Confidence interval for response mean.................................................................... 159
5.10.3 Prediction interval for an individual predicted response value ................................ 160
5.11 USING PROC REG FOR MULTIPLE LINEAR REGRESSION .................................................. 164
6. LINEAR REGRESSION ASSUMPTIONS, ADEQUACY, AND DIAGNOSTICS ....... 172
6.1 ASSUMPTIONS OF LINEAR REGRESSION............................................................................... 172
vi

6.2 CHECKING LINEAR REGRESSION ADEQUACY WITH RESIDUAL ANALYSIS............................. 177


6.2.1 Standardized residuals ............................................................................................... 179
6.2.2 Studentized residuals ................................................................................................. 180
6.2.3 The predicted residual error sum of squares .............................................................. 181
6.2.4 Studentized residual with a deleted current observation............................................ 181
6.2.5 Checking the normality assumption........................................................................... 185
6.2.6 Checking linear model assumptions with scatterplots ............................................... 191
6.2.7 Linear regression diagnostics, leverage, and influence ............................................. 195
7. POLYNOMIAL REGRESSION ......................................................................................... 211
7.1 INTRODUCTION................................................................................................................... 211
7.2 SINGLE PREDICTOR POLYNOMIAL MODEL ........................................................................... 211
7.2.1 Single predictor second-order model ......................................................................... 212
7.2.2 Single predictor third-order model............................................................................. 221
7.3 POLYNOMIAL MODEL WITH TWO PREDICTORS OR MORE ..................................................... 225
7.4 ORTHOGONAL POLYNOMIAL REGRESSION .......................................................................... 229
7.4.1 Estimation of orthogonal regression coefficients ...................................................... 230
7.4.2 Inferences about the orthogonal regression coefficients ............................................ 232
7.4.3 The sum of squares for the orthogonal polynomials .................................................. 232
8. MULTILINEARITY AND ITS EFFECT ON MULTIPLE LINEAR REGRESSION . 238
8.1 COLLINEARITY OR MULTILINEARITY .................................................................................. 238
8.1.1 The effects of collinearity on coefficient estimates ................................................... 238
8.1.2 Variance inflation factor and tolerance ...................................................................... 242
8.1.3 Condition index and eigenvalues ............................................................................... 248
8.1.4 Variance decomposition proportions ......................................................................... 249
9. REGRESSION WITH CORRELATED DATA ................................................................ 258
9.1 RIDGE REGRESSION ............................................................................................................ 258
9.1.1 Ridge regression estimation ....................................................................................... 262
9.1.2 Choosing the k value for ridge regression ................................................................. 266
9.2 DATA REDUCTION USING PRINCIPAL COMPONENT ANALYSIS .............................................. 275
9.2.1 Principal component analysis .................................................................................... 276
10. VARIABLE SELECTION ................................................................................................. 290
10.1 VARIABLE SELECTION CRITERIA ....................................................................................... 291
10.1.1 Residual error mean square ...................................................................................... 291
10.1.2 Coefficient of multiple determination...................................................................... 292
10.1.3 Adjusted coefficient of multiple determination ....................................................... 292
10.1.4 The significance level of F statistic ......................................................................... 292
10.1.5 Mallows criterion ..................................................................................................... 293
10.1.6 Akaike Information Criterion .................................................................................. 294
10.1.7 Corrected Akaike Information Criterion .................................................................. 294
10.1.8 Bayes Information Criterion (BIC) .......................................................................... 294
10.1.9 The predicted residual error sum of squares ............................................................ 295
10.2 VARIABLE SELECTION METHODS ...................................................................................... 295
vii

10.2.1 Generation of all possible subset models ................................................................. 295


10.2.2 Stepwise selection method ....................................................................................... 304
10.2.3 LASSO selection...................................................................................................... 318
11. REGRESSION WITH DUMMY VARIABLES .............................................................. 323
11.1 DUMMY VARIABLES ......................................................................................................... 324
11.2 ANCOVA MODELS .......................................................................................................... 328
11.2.1 Separate intercept and common slope model........................................................... 329
11.2.2 Separate intercept and separate slope model ........................................................... 339
11.2.3 Common intercept and separate slope model .......................................................... 347
12. ONE-WAY ANOVA FOR FIXED EFFECT ................................................................... 354
12.1 FIXED EFFECT ................................................................................................................... 354
12.2 ONE-WAY ANOVA WITH FIXED EFFECT (MODEL I) ........................................................ 355
12.2.1 One-way ANOVA sum of squares and the overall test statistic .............................. 358
12.2.2 Linear contrast and preplanned comparison ............................................................ 365
12.2.3 Unplanned post-ANOVA comparisons ................................................................... 372
12.2.4 ANOVA adequacy ................................................................................................... 390
13. TWO-WAY ANOVA FOR FIXED EFFECTS ................................................................ 405
13.1 RANDOMIZED COMPLETE BLOCK DESIGN ......................................................................... 405
13.1.1 Randomized complete block design without replicates ........................................... 405
13.1.2 Randomized complete block design with replicates ................................................ 417
13.2 LATIN SQUARE DESIGN ..................................................................................................... 425
13.2.1 ANOVA model for Latin square design .................................................................. 427
13.3 THE FACTORIAL ANOVA ................................................................................................ 434
13.3.1 The two factorial design with interaction model ..................................................... 436
14. ANOVA FOR MIXED EFFECTS ..................................................................................... 446
14.1 RANDOM EFFECT .............................................................................................................. 446
14.2 ONE-WAY ANOVA WITH RANDOM EFFECT ..................................................................... 446
14.2.1 The expected sum of squares of random model....................................................... 448
14.2.2 The test statistic for random effect........................................................................... 450
14.3 MIXED MODEL.................................................................................................................. 454
14.3.1 Parameter estimation................................................................................................ 455
14.3.2 The estimation method for degrees of freedom of the fixed effect.......................... 456
INDEX........................................................................................................................................ 462
viii

Preface
Statistical modeling is the branch of advanced statistics that helps explain or predict probabilistic
phenomena and behavior based on available data. Statistical modeling emerged as one of the
important topics in statistics because of its applications in science and business and its impact in
many fields such as the stock market, medicine, and weather prediction, just to name a few.
Moreover, statistical modeling is posed as an asset in this era of the availability of big data and
massive information. Since then it has become an essential interdisciplinary subject in universities
and a hot training topic in corporations. Rather than using the handy-dandy statistical packages for
modeling and entertaining the outcome, students disciplined in statistics and curious researchers
may need to delve into the computational and mathematical details. This book is an attempt to
satisfy the need of students and researchers who wish to understand the finer points of linear
modeling. This book handles linear modeling from a computational perspective with an emphasis
on the mathematical details and step-by-step calculations.
The theory may be discussed first in the beginning of any chapter; then the discussion is followed
using an example to demonstrate the relevant computation using SAS® PROC IML to mimic the
manual calculation. Then the chapter may be concluded by a discussion of the handy SAS
procedure that is usually used for that kind of analysis.
Chapter 1 includes a concise and brief introduction to modeling and the steps of statistical
modeling that many novices in modeling may appreciate.
Chapter 2 introduces a refreshing review of basic statistics such as experimental designs,
descriptive and inferential statistics that readers may encounter while reading this book.
Chapter 3 discusses the correlations between independent variables as a preliminary step in linear
modeling. The correlations discussed in this chapter include the Pearson correlation and other
common non-parametric correlations that are provided in statistical software packages such as
SAS.
Chapters 4 and 5 cover simple and multiple linear regression, respectively. The chapters discuss
data organization, least squares estimation for regression coefficients, hypothesis testing,
confidence intervals, prediction of new observations, and the use of the SAS procedures for
regression analyses.
Chapter 6 covers the model adequacy and the diagnostics that can be carried out to assess and
validate the linear model.
Chapter 7 extends linear modeling theory to polynomial regression for a single predictor, multiple
predictors, and orthogonal polynomial regression.
Chapter 8 discusses the problem of correlated independent variables and its effect on the least
squares estimation. This chapter introduces various metrics that can be computed to measure
collinearity among the data. These metrics include variance inflation factor, tolerance, condition
index, eigenvalues, and variance decomposition proportions.
ix

Chapter 9 covers two methods for handling the problem of highly correlated data in linear
modeling: ridge regression, which can be an alternative to the least squares method for parameter
estimation; and the principal component analysis, which is a method for transforming correlated
and highly dimensional data into less correlated and less dimensional data.
Chapter 10 covers variable selection or model selection, which is a crucial step in modeling when
the number of independent variables is large. The chapter discusses the most commonly used
criteria and methods for variable selection.
Chapter 11 discusses the analysis of covariance (ANCOVA) or the generalized linear models in
which the response variable is quantitative and the independent variables are a mix of quantitative
variables and qualitative variables. This chapter introduces dummy variables and regression with
dummy variables, which give rise to multiple regression equations and hypothesis testing.
Chapters 12 and 13 cover analysis of variance (ANOVA) with fixed effects or generalized linear
models in which the response variable is quantitative and the independent variables are qualitative
with a known number of values. These chapters discuss the most commonly used ANOVA
experimental designs such as completely randomized design, randomized complete block design,
Latin square, and factorial design. The chapters also discuss post-ANOVA preplanned (priori) and
unplanned (posteriori) comparisons and ANOVA adequacy.
Chapter 14 covers ANOVA for mixed effects, in which the values of the independent qualitative
variables represent a random sample from a broader population.
The book ends with the index and some statistical tables that the reader may use in this book,
although we provided SAS codes to generate the relevant quantiles and probabilities.
Hamid D. Ismail
462 Index

Index
condition number, 248
# confidence level, 43
R2 inflation, 144 confidence region, 156
Cook’s distance (D), 201
A correlation, 47
correlation coefficients, 51
Akaike Information Criterion (AIC), 294 covariance, 51
alpha inflation, 372 covariance matrix, 279
alternative hypothesis, 43 Cramér-von Mises, 391
analysis of variance (ANOVA), 141 critical value, 43
ANCOVA, 13, 323, 328 cumulative distribution function (cdf), 33
Anderson-Darling, 391
ANOVA, 13, 354, 446 D
ANOVA fixed model, 355
arithmetic mean, 25, 394 Descriptive statistics, 24, 42
autocorrelation, 88 design matrix, 325, 358
direct or causal relationship, 71
B dummy variables, 324
Dunnett’s method, 384
backward elimination selection, 305
balanced design, 417 E
Bartlett, 393
Bartlett’s test, 394 effect model, 15, 354
Bayes Information Criterion (BIC), 294 eigenvalues, 248, 281
binomial distribution, 12, 30, 31, 32 eigenvectors, 248, 276, 281
Bonferroni’s method, 382 empirical rule, 35
Boxplot, 26 Estimation, 42
Brown and Forsythe, 393 Euclidean distance, 248
Experimental studies, 23
C explained variation, 141
external standard error, 180
categories, 324 external variance, 176
central limit theorem, 37, 40 extrapolating, 156
central tendency, 24, 25 extrapolation, 92, 156, 211, 237
Chi-square distribution, 41
coefficient of determination, 98 F
coefficient of multiple determination, 143,
292 F distribution, 41
coefficient of the multiple correlation, 243 factorial design, 405
coefficient of variation, 26, 100, 122, 145, Factorial design, 24
170, 361, 362 factors, 324
cofounding, 123 familywise error rate (FWER), 373
collinearity, 238 finite correction factor, 40
completely randomized design, 355 FISHER, 79
Completely randomized design, 23 Fisher LSD, 375
463 Index

Fisher Z transformation, 57 K
fixed effect, 15 KENDALL, 79
fixed effects, 354 Kendall’s tau-b correlation coefficient, 68
forward selection, 305 Kolmogorov-Smirnov, 391
forward variable selection, 212
Fractional factorial design, 24 L
frequency histogram, 27, 33
full model, 146 LASSO, 318
Latin square, ix, 18, 24, 405, 425, 426, 428
G least squares, 14
least squares method, 131
generalized linear model (GLM), 12 leave-one-out, 201
leave-one-out (LOO), 181
H
Left-tailed test, 43
Hadi’s influence measure, 209 levels, 324
hat matrix, 133 Levene, 393
hidden extrapolation, 156 Levene’s test, 396
histogram, 391 leverage, 195
Histogram, 27 linear contrast, 366
HOEFFDING, 79 linear effect coefficient, 212
Hoeffding Dependence Coefficient, 66 LINEAR REGRESSION, 86
homogeneity of variance (HOV), 393 logistic model, 13
hypothesis testing, 330 log-linear model, 14
Hypothesis testing, 43
M
I
main effects, 434
idempotent matrix, 133 Mallows criterion (Cq), 293
ill-conditioned, 238 maximum likelihood, 456
ill-conditioning, 211 maximum likelihood method (ML), 90
influence, 195 median, 25, 26, 64, 79, 402
interaction, 225 mixed model, 454
interaction effect, 24, 71, 229, 339, 405, 434 model complexity, 290
internal standard error, 180 Model I, 355
internal variance, 176 Model II, 15
internally standardized residuals, 180 Model III, 15
interpolation, 91 model reduction, 141, 147, 151, 152, 292,
interquartile range (IQR), 25 330
interval estimates, 42 model selection, 295
intervening relationship, 71 multilinearity, 238, 258
iterative method, 273 multiple coefficient of correlation, 177
multiple coefficient of determination, 243
J multiple correlation, 143
jackknife, 181 multiple linear correlation, 127
jackknifing cross validation, 201 multiple linear model, 13
multiple linear regression, 123, 130
464 Index

N PROC UNIVARIATE, 190


prospective studies, 23
near singular, 238
p-value, 43
Newton-Raphson algorithm, 456
normal distributions, 34 Q
normal equations, 132
normal probability plot, 185, 391 quadratic effect coefficient, 212
nuisance factor, 405
null hypothesis, 43 R
random effect model, 15
O random effects, 446
O’Brien’s test, 399 random unexplained error, 96
O’Brien’s tests for homogeneity, 393 random variable, 27
Observational study, 23 randomized complete block design, 405, 454
observed values, 17 Randomized complete block design, 24
omnibus F statistic, 407 RANNOR(seeds), 173
ordinary least squares method (OLS), 90 reduced model (RM), 146
orthogonal, 238, 366, 425 reference line, 330
orthogonal polynomials, 229 regression coefficients, 14, 87
outliers, 179, 192 Regression sum of squares, 96
regression sum of squares (SSR), 141
P relative frequency, 27, 33
residual error sum of squares, 14
parsimony, 290
Residual error sum of squares, 96
partial correlation, 127
residual maximum likelihood, 456
Partial correlation, 71
response surface methodology (RSM), 225
partial regression coefficients, 130
retrospective studies, 23
partial slope, 130
ridge regression, 259
PEARSON, 79
ridge trace, 269, 270, 273, 274
Pearson correlation coefficient, 54
Right-tailed test, 44
percentiles, 25
point estimates, 42 S
point estimators, 136
Poisson distribution, 30, 32 sample survey studies, 23
polynomial regression, 211 Sampling distributions, 36
power of the test, 44 Scatterplot, 47
prediction interval (PI), 160 Scheffé’s method, 387
preplanned (priori), 365 selection criterion, 305
PRESS, 197, 295 Shapiro-Wilk, 391
Principal component analysis (PCA), 276 significance level, 43
principal components (PCs), 276 significance level to stay, 313
priori comparisons, 366 simple coefficient of correlation, 177
probability density function (pdf), 33 simple linear model, 12, 86
Probability distribution, 27, 33 singular values, 249
probability of mass function (pmf), 28 SPEARMAN, 79
PROC CORR, 74, 217 Spearman correlation coefficient, 63
PROC REG, 115 spurious relationship, 71
465 Index

standard deviation, 26 U
standard error (SE), 37 unbiased estimator, 42
standard normal distribution, 35, 42, 43, 60 unexplained variation, 142
Statistical inferences, 42 uniform distributions, 34
stepwise regression, 305 unplanned (posteriori), 365
stopping criterion, 305
strong control, 374 V
Student’s t distribution, 40
studentized range distribution, 377 variability, 10, 24, 25, 26, 170, 225, 276
studentized residuals, 197 variable selection, 291
Student-Newman-Keuls, 380 variance, 25
surface response, 225 variance component, 447
symmetric matrix, 133 variance components, 15
variance decomposition proportion, 250
T variance inflation, 242, 258
variance inflation factor (VIF), 242
the least squares (LS) estimates, 91 variation components, 455
Tolerance, 244
total sum of squares, 96 W
Two-tailed test, 44
Type I error, 44 weak control, 374
Type II, 44 Welsch and Kul measure (DFFITS), 205

Вам также может понравиться