Вы находитесь на странице: 1из 6

Modelling

Definition of statistical modelling

Components of a model:

1. Objective
2. Model structure
3. Model assumptions
4. Parameter estimates and interpretation
5. Model Fit
6. Model selection

Example:

SIMPLE LINEAR REGRESSION

Objective: to model the expected value of a continuous variable Y, as a linear function of a


continuous predictor
Model structure
Model assumptions: Y is normally distributed, error term is normally distributed and
independent, X is fixed and constant variance
Parameter estimates and interpretation: Bo is the estimate of the intercept while B1 is the
estiamte of the slope. For every increase in unit of X,
Model fit: R2, residual analysis, F statistics
Model selection: From plethora of predictors, which variable must be included?

Generalized Linear Models (GLMs)


- The term generalized linear model (GLIM or GLM) refers to a larger class of models popularized
by McCullagh and Nelder (1982, 2nd edition 1989). In these models, the response variable yi is
assumed to follow an exponential family distribution with mean i , which is assumed to be
some (often nonlinear) function of xTi. Some would call these nonlinear because i is often a
nonlinear function of the covariates, but McCullagh and Nelder consider them to be linear,
because the covariates affect the distribution of yi only through the linear combination xTi.

Model Random Link Systematic


Linear Regression Normal Identity Continuous
ANOVA Normal Identity Categorical
ANCOVA Normal Identity Mixed
Logistic Regression Binomial Logit Mixed
Loglinear Poisson Log Categorical
Poisson Regression Poisson Log Mixed
Multinomial response Multinomial Generalized Logit Mixed
There are three components to any GLM:
1. Random Component refers to the probability distribution of the response variable (Y); e.g.
normal distribution for Y in the linear regression, or binomial distribution for Y in the binary
logistic regression. Also called a noise model or error model. How is random error added to the
prediction that comes out of the link function?
2. Systematic Component - specifies the explanatory variables (X1, X2, ... Xk) in the model, more
specifically their linear combination in creating the so called linear predictor; e.g., 0 + 1x1 + 2x2
as we have seen in a linear regression, or as we will see in a logistic regression in this lesson.
3. Link Function, or g() - specifies the link between random and systematic components. It says
how the expected value of the response relates to the linear predictor of explanatory variables;
e.g., = g(E(Yi)) = E(Yi) for linear regression, or = logit() for logistic regression.

Assumptions:
The data Y1, Y2, ..., Yn are independently distributed, i.e., cases are independent.
The dependent variable Yi does NOT need to be normally distributed, but it typically assumes a
distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal,...)
GLM does NOT assume a linear relationship between the dependent variable and the
independent variables, but it does assume linear relationship between the transformed
response in terms of the link function and the explanatory variables; e.g., for binary logistic
regression logit() = 0 + X.
Independent (explanatory) variables can be even the power terms or some other nonlinear
transformations of the original independent variables.
The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in
many cases given the model structure, and overdispersion (when the observed variance is larger
than what the model assumes) maybe present.
Errors need to be independent but NOT normally distributed.
It uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to
estimate the parameters, and thus relies on large-sample approximations.
Goodness-of-fit measures rely on sufficiently large samples, where a heuristic rule is that not
more than 20% of the expected cells counts are less than 5.

Example:
Simple Linear Regression models how mean expected value of a continuous response variable
depends on a set of explanatory variables, where index i stands for each data point:
Yi=0+xi+i Or E(Yi)=0+xi
o Random component: Y is a response variable and has a normal distribution, and
generally we assume errors, ei ~ N(0, 2).
o Systematic component: X is the explanatory variable (can be continuous or discrete) and
is linear in the parameters 0 + xi . Notice that with a multiple linear regression where
we have more than one explanatory variable, e.g., (X1, X2, ... Xk), we would have a linear
combination of these Xs in terms of regression parameters 's, but the explanatory
variables themselves could be transformed, e.g., X2, or log(X).
o Link function: Identity Link, = g(E(Yi)) = E(Yi) --- identity because we are modeling the
mean directly; this is the simplest link function.

Binary Logistic Regression models how binary response variable Y depends on a set of k
explanatory variables, X=(X1, X2, ... Xk).
logit()=log(/1)=0+xi++0+xk
which models the log odds of probability of "success" as a function of explanatory variables.
o Random component: The distribution of Y is assumed to be Binomial(n,), where is a
probability of "success".
o Systematic component: X's are explanatory variables (can be continuous, discrete, or
both) and are linear in the parameters, e.g., 0 + xi + ... + 0 + xk. Again,
transformation of the X's themselves are allowed like in linear regression; this holds for
any GLM.
o Link function: Logit link: =logit()=log(/1) More generally, the logit link models the
log odds of the mean, and the mean here is . Binary logistic regression models are also
known as logit models when the predictors are all categorical.

Log-linear Model models the expected cell counts as a function of levels of categorical variables,
e.g., for a two-way table the saturated model
log(ij)=+Ai+Bj+ABij
where ij=E(nij) as before are expected cell counts (mean in each cell of the two-way table), A
and B represent two categorical variables, and ij's are model parameters, and we are modeling
the natural log of the expected counts.
o Random component: The distribution of counts, which are the responses, is Poisson
o Systematic component: X's are discrete variables used in cross-classification, and are
linear in the parameters +X1i+X2j++Xkk+
o Link Function: Log link, = log() --- log because we are modeling the log of the cell
means.
The log-linear models are more general than logit models, and some logit models are equivalent
to certain log-linear models. Log-linear model is also equivalent to Poisson regression model
when all explanatory variables are discrete. For additional details see Agresti(2007), Sec. 3.3,
Agresti (2013), Section 4.3 (for counts), Section 9.2 (for rates), and Section 13.2 (for random
effects).

Advantages of GLMs over traditional (OLS) regression


We do not need to transform the response Y to have a normal distribution
The choice of link is separate from the choice of random component thus we have more
flexibility in modeling
If the link produces additive effects, then we do not need constant variance.
The models are fitted via Maximum Likelihood estimation; thus optimal properties of the
estimators.
All the inference tools and model checking that we will discuss for log-linear and logistic
regression models apply for other GLMs too; e.g., Wald and Likelihood ratio tests, Deviance,
Residuals, Confidence intervals, Overdispersion.
There is often one procedure in a software package to capture all the models listed above, e.g.
PROC GENMOD in SAS or glm() in R, etc... with options to vary the three components.

Limitations of GLMs
a. Linear function, e.g. can have only a linear predictor in the systematic component
b. Responses must be independent
Poisson Regression Model
Poisson regression is also a type of GLM model where the random component is specified
by the Poisson distribution of the response variable which is a count. When all explanatory
variables are discrete, log-linear model is equivalent to Poisson regression model.
Example of research questions:
o In an example using data about crabs we are interested in knowing: How does the
number of satellites, (male crabs residing near a female crab), for a female horseshoe
crab depend on the width of her back?, and What is the rate of satellites per unit width?
o In an example using data about credit cards we are interested in knowing: What is the
expected number of credit cards a person may have, given his/her income?, or What is
the sample rate of possession of credit cards?

Variables:
In Poisson regression Response/outcome variable Y is a count. But we can also have Y/t, the
rate (or incidence) as the response variable, where t is an interval representing time, space or
some other grouping.
Explanatory Variable(s):
o Explanatory variables, X = (X1, X2, Xk), can be continuous or a combination of
continuous and categorical variables. Convention is to call such a model Poisson
Regression.
o Explanatory variables, X = (X1, X2, Xk), can be ALL categorical. Then the counts to be
modeled are the counts in a contingency table, and the convention is to call such a
model log-linear model.
o If Y/t is the variable of interest then even with all categorical predictors, the regression
model will be known as Poisson regression, not a log-linear model.

GLM Model for Counts with its assumptions:


g()=0+1x1+2x2++kxk=xTi

- Random component: Response Y has a Poisson distribution that is yiPoisson(i) for i=1,...,N
where the expected count of yi is E(Y)=.
- Systematic component: Any set of X = (X1, X2, Xk) are explanatory variables. For now lets
focus on a single variable X.
- Link:
Identity link: =0+1x1
Sometimes the identity link function is used in Poisson regression. This model is the same as
that used in ordinary regression except that the random component is the Poisson distribution.
Issue: can yield < 0!
- Natural log link: log()=0+1x1
The Poisson regression model for counts is sometimes referred to as a Poisson loglinear
model. We will focus on this one and a rate model for incidences.
- For simplicity, with a single explanatory variable, we write: log()=+x . This is equivalent to:
=exp(+x)=exp()exp(

Interpretation of Parameters
exp() = effect on the mean of Y, that is , when X = 0
exp() = with every unit increase in X, the predictor variable has multiplicative effect of exp() on the
mean of Y, that is
If = 0, then exp() = 1, and the expected count, = E(y) = exp(), and Y and X are not related.
If > 0, then exp() > 1, and the expected count = E(y) is exp() times larger than when X = 0
If < 0, then exp() < 1, and the expected count = E(y) is exp() times smaller than when X = 0

GLM Model for Rates:


GLM:g()=0+1x1+2x2++kxk
Random component: Response Y has a Poisson distribution, and t is index of the time or
space; more specifically the expected value of rate Y/t, is E(Y/t)= that is E(Y)=t
Systematic component: Any set of X = (X1, X2, Xk) can be explanatory variables. For now
lets focus on a single variable X.
Link:
o Log of rate: log(Y/t)

Poisson loglinear regression model for the expected rate of the occurrence of event is:
log(/t)=+x
This can be rearranged to:
o log()log(t)=+x
o log()=+x+log(t)
The term log(t) is referred to as an offset. It is an adjustment term and a group of
observations may have the same offset, or each individual may have a different value of t.
log(t) is an observation and it will change the value of estimated counts:
=exp(+x+log(t))=(t)exp()exp(x)
This means that mean count is proportional to t.
- -Note that the interpretation of parameter estimates, and will stay the same as for the
model of counts; you just need to multiply the expected counts by t.

Parameter Estimation
Similar to the case of Logistic regression, the maximum likelihood estimators (MLEs) for (0, 1
etc.) are obtained by finding the values that maximizes log-likelihood. In general, there are no
closed-form solutions, so the ML estimates are obtained by using iterative algorithms such as
Newton-Raphson (NR), Iteratively re-weighted least squares (IRWLS), etc.

Inference
The usual tools from the basic statistical inference and GLM are valid.
o Confidence Intervals and Hypothesis tests for parameters
o Wald statistics and asymptotic standard error (ASE)
o Likelihood ratio tests
o Score tests
o Distribution of probability estimates
Model Fit
Overall goodness-of-fit statistics of the model are the same as for any GLM:
o Pearson chi-square statistic, X2
o Deviance, G2
o Likelihood ratio test, and statistic, G2
Residual analysis: Pearson, deviance, adjusted residuals, etc...
Overdispersion
o Recall that a Poisson random variable has the same mean and variance, e.g.,
E(Y)=Var(Y)=
o Overdispersion means that observed variance is larger than the assumed variance, i.e.,
Var(Y)= where is a scale parameter like we saw in logistic regression.
o Two typical solutions are:
Adjust for overdispersion (like in logistic regression) where we estimate
=X2/(Np) , and adjust the standard errors and test statistics.
Use negative binomial regression instead (see notes on ANGEL), where the
response Y is assumed to follow a Negative Binomial distribution, E(Y)= and
Var(Y)=+D2 . The index D is a called a dispersion parameter. Greater
heterogeneity in the Poisson means results in a larger value of D. As D
approaches 0, Var(Y) will approach , and the negative binomial and Poisson
regression will give the same inference.

Вам также может понравиться