Академический Документы
Профессиональный Документы
Культура Документы
Preface
Statistical modeling is the branch of advanced statistics that helps explain or predict probabilistic
phenomena and behavior based on available data. Statistical modeling emerged as one of the
important topics in statistics because of its applications in science and business and its impact in
many fields such as the stock market, medicine, and weather prediction, just to name a few.
Moreover, statistical modeling is posed as an asset in this era of the availability of big data and
massive information. Since then it has become an essential interdisciplinary subject in universities
and a hot training topic in corporations. Rather than using the handy-dandy statistical packages for
modeling and entertaining the outcome, students disciplined in statistics and curious researchers
may need to delve into the computational and mathematical details. This book is an attempt to
satisfy the need of students and researchers who wish to understand the finer points of linear
modeling. This book handles linear modeling from a computational perspective with an emphasis
on the mathematical details and step-by-step calculations.
The theory may be discussed first in the beginning of any chapter; then the discussion is followed
using an example to demonstrate the relevant computation using SAS® PROC IML to mimic the
manual calculation. Then the chapter may be concluded by a discussion of the handy SAS
procedure that is usually used for that kind of analysis.
Chapter 1 includes a concise and brief introduction to modeling and the steps of statistical
modeling that many novices in modeling may appreciate.
Chapter 2 introduces a refreshing review of basic statistics such as experimental designs,
descriptive and inferential statistics that readers may encounter while reading this book.
Chapter 3 discusses the correlations between independent variables as a preliminary step in linear
modeling. The correlations discussed in this chapter include the Pearson correlation and other
common non-parametric correlations that are provided in statistical software packages such as
SAS.
Chapters 4 and 5 cover simple and multiple linear regression, respectively. The chapters discuss
data organization, least squares estimation for regression coefficients, hypothesis testing,
confidence intervals, prediction of new observations, and the use of the SAS procedures for
regression analyses.
Chapter 6 covers the model adequacy and the diagnostics that can be carried out to assess and
validate the linear model.
Chapter 7 extends linear modeling theory to polynomial regression for a single predictor, multiple
predictors, and orthogonal polynomial regression.
Chapter 8 discusses the problem of correlated independent variables and its effect on the least
squares estimation. This chapter introduces various metrics that can be computed to measure
collinearity among the data. These metrics include variance inflation factor, tolerance, condition
index, eigenvalues, and variance decomposition proportions.
ix
Chapter 9 covers two methods for handling the problem of highly correlated data in linear
modeling: ridge regression, which can be an alternative to the least squares method for parameter
estimation; and the principal component analysis, which is a method for transforming correlated
and highly dimensional data into less correlated and less dimensional data.
Chapter 10 covers variable selection or model selection, which is a crucial step in modeling when
the number of independent variables is large. The chapter discusses the most commonly used
criteria and methods for variable selection.
Chapter 11 discusses the analysis of covariance (ANCOVA) or the generalized linear models in
which the response variable is quantitative and the independent variables are a mix of quantitative
variables and qualitative variables. This chapter introduces dummy variables and regression with
dummy variables, which give rise to multiple regression equations and hypothesis testing.
Chapters 12 and 13 cover analysis of variance (ANOVA) with fixed effects or generalized linear
models in which the response variable is quantitative and the independent variables are qualitative
with a known number of values. These chapters discuss the most commonly used ANOVA
experimental designs such as completely randomized design, randomized complete block design,
Latin square, and factorial design. The chapters also discuss post-ANOVA preplanned (priori) and
unplanned (posteriori) comparisons and ANOVA adequacy.
Chapter 14 covers ANOVA for mixed effects, in which the values of the independent qualitative
variables represent a random sample from a broader population.
The book ends with the index and some statistical tables that the reader may use in this book,
although we provided SAS codes to generate the relevant quantiles and probabilities.
Hamid D. Ismail
462 Index
Index
condition number, 248
# confidence level, 43
R2 inflation, 144 confidence region, 156
Cook’s distance (D), 201
A correlation, 47
correlation coefficients, 51
Akaike Information Criterion (AIC), 294 covariance, 51
alpha inflation, 372 covariance matrix, 279
alternative hypothesis, 43 Cramér-von Mises, 391
analysis of variance (ANOVA), 141 critical value, 43
ANCOVA, 13, 323, 328 cumulative distribution function (cdf), 33
Anderson-Darling, 391
ANOVA, 13, 354, 446 D
ANOVA fixed model, 355
arithmetic mean, 25, 394 Descriptive statistics, 24, 42
autocorrelation, 88 design matrix, 325, 358
direct or causal relationship, 71
B dummy variables, 324
Dunnett’s method, 384
backward elimination selection, 305
balanced design, 417 E
Bartlett, 393
Bartlett’s test, 394 effect model, 15, 354
Bayes Information Criterion (BIC), 294 eigenvalues, 248, 281
binomial distribution, 12, 30, 31, 32 eigenvectors, 248, 276, 281
Bonferroni’s method, 382 empirical rule, 35
Boxplot, 26 Estimation, 42
Brown and Forsythe, 393 Euclidean distance, 248
Experimental studies, 23
C explained variation, 141
external standard error, 180
categories, 324 external variance, 176
central limit theorem, 37, 40 extrapolating, 156
central tendency, 24, 25 extrapolation, 92, 156, 211, 237
Chi-square distribution, 41
coefficient of determination, 98 F
coefficient of multiple determination, 143,
292 F distribution, 41
coefficient of the multiple correlation, 243 factorial design, 405
coefficient of variation, 26, 100, 122, 145, Factorial design, 24
170, 361, 362 factors, 324
cofounding, 123 familywise error rate (FWER), 373
collinearity, 238 finite correction factor, 40
completely randomized design, 355 FISHER, 79
Completely randomized design, 23 Fisher LSD, 375
463 Index
Fisher Z transformation, 57 K
fixed effect, 15 KENDALL, 79
fixed effects, 354 Kendall’s tau-b correlation coefficient, 68
forward selection, 305 Kolmogorov-Smirnov, 391
forward variable selection, 212
Fractional factorial design, 24 L
frequency histogram, 27, 33
full model, 146 LASSO, 318
Latin square, ix, 18, 24, 405, 425, 426, 428
G least squares, 14
least squares method, 131
generalized linear model (GLM), 12 leave-one-out, 201
leave-one-out (LOO), 181
H
Left-tailed test, 43
Hadi’s influence measure, 209 levels, 324
hat matrix, 133 Levene, 393
hidden extrapolation, 156 Levene’s test, 396
histogram, 391 leverage, 195
Histogram, 27 linear contrast, 366
HOEFFDING, 79 linear effect coefficient, 212
Hoeffding Dependence Coefficient, 66 LINEAR REGRESSION, 86
homogeneity of variance (HOV), 393 logistic model, 13
hypothesis testing, 330 log-linear model, 14
Hypothesis testing, 43
M
I
main effects, 434
idempotent matrix, 133 Mallows criterion (Cq), 293
ill-conditioned, 238 maximum likelihood, 456
ill-conditioning, 211 maximum likelihood method (ML), 90
influence, 195 median, 25, 26, 64, 79, 402
interaction, 225 mixed model, 454
interaction effect, 24, 71, 229, 339, 405, 434 model complexity, 290
internal standard error, 180 Model I, 355
internal variance, 176 Model II, 15
internally standardized residuals, 180 Model III, 15
interpolation, 91 model reduction, 141, 147, 151, 152, 292,
interquartile range (IQR), 25 330
interval estimates, 42 model selection, 295
intervening relationship, 71 multilinearity, 238, 258
iterative method, 273 multiple coefficient of correlation, 177
multiple coefficient of determination, 243
J multiple correlation, 143
jackknife, 181 multiple linear correlation, 127
jackknifing cross validation, 201 multiple linear model, 13
multiple linear regression, 123, 130
464 Index
standard deviation, 26 U
standard error (SE), 37 unbiased estimator, 42
standard normal distribution, 35, 42, 43, 60 unexplained variation, 142
Statistical inferences, 42 uniform distributions, 34
stepwise regression, 305 unplanned (posteriori), 365
stopping criterion, 305
strong control, 374 V
Student’s t distribution, 40
studentized range distribution, 377 variability, 10, 24, 25, 26, 170, 225, 276
studentized residuals, 197 variable selection, 291
Student-Newman-Keuls, 380 variance, 25
surface response, 225 variance component, 447
symmetric matrix, 133 variance components, 15
variance decomposition proportion, 250
T variance inflation, 242, 258
variance inflation factor (VIF), 242
the least squares (LS) estimates, 91 variation components, 455
Tolerance, 244
total sum of squares, 96 W
Two-tailed test, 44
Type I error, 44 weak control, 374
Type II, 44 Welsch and Kul measure (DFFITS), 205