Академический Документы
Профессиональный Документы
Культура Документы
Components of a model:
1. Objective
2. Model structure
3. Model assumptions
4. Parameter estimates and interpretation
5. Model Fit
6. Model selection
Example:
Assumptions:
The data Y1, Y2, ..., Yn are independently distributed, i.e., cases are independent.
The dependent variable Yi does NOT need to be normally distributed, but it typically assumes a
distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal,...)
GLM does NOT assume a linear relationship between the dependent variable and the
independent variables, but it does assume linear relationship between the transformed
response in terms of the link function and the explanatory variables; e.g., for binary logistic
regression logit() = 0 + X.
Independent (explanatory) variables can be even the power terms or some other nonlinear
transformations of the original independent variables.
The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in
many cases given the model structure, and overdispersion (when the observed variance is larger
than what the model assumes) maybe present.
Errors need to be independent but NOT normally distributed.
It uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to
estimate the parameters, and thus relies on large-sample approximations.
Goodness-of-fit measures rely on sufficiently large samples, where a heuristic rule is that not
more than 20% of the expected cells counts are less than 5.
Example:
Simple Linear Regression models how mean expected value of a continuous response variable
depends on a set of explanatory variables, where index i stands for each data point:
Yi=0+xi+i Or E(Yi)=0+xi
o Random component: Y is a response variable and has a normal distribution, and
generally we assume errors, ei ~ N(0, 2).
o Systematic component: X is the explanatory variable (can be continuous or discrete) and
is linear in the parameters 0 + xi . Notice that with a multiple linear regression where
we have more than one explanatory variable, e.g., (X1, X2, ... Xk), we would have a linear
combination of these Xs in terms of regression parameters 's, but the explanatory
variables themselves could be transformed, e.g., X2, or log(X).
o Link function: Identity Link, = g(E(Yi)) = E(Yi) --- identity because we are modeling the
mean directly; this is the simplest link function.
Binary Logistic Regression models how binary response variable Y depends on a set of k
explanatory variables, X=(X1, X2, ... Xk).
logit()=log(/1)=0+xi++0+xk
which models the log odds of probability of "success" as a function of explanatory variables.
o Random component: The distribution of Y is assumed to be Binomial(n,), where is a
probability of "success".
o Systematic component: X's are explanatory variables (can be continuous, discrete, or
both) and are linear in the parameters, e.g., 0 + xi + ... + 0 + xk. Again,
transformation of the X's themselves are allowed like in linear regression; this holds for
any GLM.
o Link function: Logit link: =logit()=log(/1) More generally, the logit link models the
log odds of the mean, and the mean here is . Binary logistic regression models are also
known as logit models when the predictors are all categorical.
Log-linear Model models the expected cell counts as a function of levels of categorical variables,
e.g., for a two-way table the saturated model
log(ij)=+Ai+Bj+ABij
where ij=E(nij) as before are expected cell counts (mean in each cell of the two-way table), A
and B represent two categorical variables, and ij's are model parameters, and we are modeling
the natural log of the expected counts.
o Random component: The distribution of counts, which are the responses, is Poisson
o Systematic component: X's are discrete variables used in cross-classification, and are
linear in the parameters +X1i+X2j++Xkk+
o Link Function: Log link, = log() --- log because we are modeling the log of the cell
means.
The log-linear models are more general than logit models, and some logit models are equivalent
to certain log-linear models. Log-linear model is also equivalent to Poisson regression model
when all explanatory variables are discrete. For additional details see Agresti(2007), Sec. 3.3,
Agresti (2013), Section 4.3 (for counts), Section 9.2 (for rates), and Section 13.2 (for random
effects).
Limitations of GLMs
a. Linear function, e.g. can have only a linear predictor in the systematic component
b. Responses must be independent
Poisson Regression Model
Poisson regression is also a type of GLM model where the random component is specified
by the Poisson distribution of the response variable which is a count. When all explanatory
variables are discrete, log-linear model is equivalent to Poisson regression model.
Example of research questions:
o In an example using data about crabs we are interested in knowing: How does the
number of satellites, (male crabs residing near a female crab), for a female horseshoe
crab depend on the width of her back?, and What is the rate of satellites per unit width?
o In an example using data about credit cards we are interested in knowing: What is the
expected number of credit cards a person may have, given his/her income?, or What is
the sample rate of possession of credit cards?
Variables:
In Poisson regression Response/outcome variable Y is a count. But we can also have Y/t, the
rate (or incidence) as the response variable, where t is an interval representing time, space or
some other grouping.
Explanatory Variable(s):
o Explanatory variables, X = (X1, X2, Xk), can be continuous or a combination of
continuous and categorical variables. Convention is to call such a model Poisson
Regression.
o Explanatory variables, X = (X1, X2, Xk), can be ALL categorical. Then the counts to be
modeled are the counts in a contingency table, and the convention is to call such a
model log-linear model.
o If Y/t is the variable of interest then even with all categorical predictors, the regression
model will be known as Poisson regression, not a log-linear model.
- Random component: Response Y has a Poisson distribution that is yiPoisson(i) for i=1,...,N
where the expected count of yi is E(Y)=.
- Systematic component: Any set of X = (X1, X2, Xk) are explanatory variables. For now lets
focus on a single variable X.
- Link:
Identity link: =0+1x1
Sometimes the identity link function is used in Poisson regression. This model is the same as
that used in ordinary regression except that the random component is the Poisson distribution.
Issue: can yield < 0!
- Natural log link: log()=0+1x1
The Poisson regression model for counts is sometimes referred to as a Poisson loglinear
model. We will focus on this one and a rate model for incidences.
- For simplicity, with a single explanatory variable, we write: log()=+x . This is equivalent to:
=exp(+x)=exp()exp(
Interpretation of Parameters
exp() = effect on the mean of Y, that is , when X = 0
exp() = with every unit increase in X, the predictor variable has multiplicative effect of exp() on the
mean of Y, that is
If = 0, then exp() = 1, and the expected count, = E(y) = exp(), and Y and X are not related.
If > 0, then exp() > 1, and the expected count = E(y) is exp() times larger than when X = 0
If < 0, then exp() < 1, and the expected count = E(y) is exp() times smaller than when X = 0
Poisson loglinear regression model for the expected rate of the occurrence of event is:
log(/t)=+x
This can be rearranged to:
o log()log(t)=+x
o log()=+x+log(t)
The term log(t) is referred to as an offset. It is an adjustment term and a group of
observations may have the same offset, or each individual may have a different value of t.
log(t) is an observation and it will change the value of estimated counts:
=exp(+x+log(t))=(t)exp()exp(x)
This means that mean count is proportional to t.
- -Note that the interpretation of parameter estimates, and will stay the same as for the
model of counts; you just need to multiply the expected counts by t.
Parameter Estimation
Similar to the case of Logistic regression, the maximum likelihood estimators (MLEs) for (0, 1
etc.) are obtained by finding the values that maximizes log-likelihood. In general, there are no
closed-form solutions, so the ML estimates are obtained by using iterative algorithms such as
Newton-Raphson (NR), Iteratively re-weighted least squares (IRWLS), etc.
Inference
The usual tools from the basic statistical inference and GLM are valid.
o Confidence Intervals and Hypothesis tests for parameters
o Wald statistics and asymptotic standard error (ASE)
o Likelihood ratio tests
o Score tests
o Distribution of probability estimates
Model Fit
Overall goodness-of-fit statistics of the model are the same as for any GLM:
o Pearson chi-square statistic, X2
o Deviance, G2
o Likelihood ratio test, and statistic, G2
Residual analysis: Pearson, deviance, adjusted residuals, etc...
Overdispersion
o Recall that a Poisson random variable has the same mean and variance, e.g.,
E(Y)=Var(Y)=
o Overdispersion means that observed variance is larger than the assumed variance, i.e.,
Var(Y)= where is a scale parameter like we saw in logistic regression.
o Two typical solutions are:
Adjust for overdispersion (like in logistic regression) where we estimate
=X2/(Np) , and adjust the standard errors and test statistics.
Use negative binomial regression instead (see notes on ANGEL), where the
response Y is assumed to follow a Negative Binomial distribution, E(Y)= and
Var(Y)=+D2 . The index D is a called a dispersion parameter. Greater
heterogeneity in the Poisson means results in a larger value of D. As D
approaches 0, Var(Y) will approach , and the negative binomial and Poisson
regression will give the same inference.