Академический Документы
Профессиональный Документы
Культура Документы
Poisson Regression
Poisson regression is used to model count variables.
Please note: The purpose of this page is to show how to use various data analysis commands. It does not
cover all aspects of the research process which researchers are expected to do. In particular, it does not
cover data cleaning and checking, verification of assumptions, model diagnostics or potential follow-up
analyses.
This page is done using SPSS 19.
GET
FILE='poisson_sim.sav'.
DESCRIPTIVES
VARIABLES=math num_awards
/STATISTICS=MEAN STDDEV VAR MIN MAX .
Each variable has 200 valid observations and their distributions seem quite reasonable.
The unconditional mean and variance of our outcome variable are not extremely different. Our model
assumes that these values, conditioned on the predictor variables, will be equal (or at least roughly so).
Let's continue with our description of the variables in this dataset. The table below shows the average
numbers of awards by program type and seems to suggest that program type is a good candidate for
predicting the number of awards, our outcome variable, because the mean value of the outcome appears
to vary by prog. Additionally, the means and variances within each level of prog--the conditional means
and variances--are similar.
GRAPH
/HISTOGRAM=num_awards.
OLS regression - Count outcome variables are sometimes logtransformed and analyzed using OLS regression. Many issues arise
with this approach, including loss of data due to undefined values
generated by taking the log of zero (which is undefined) and biased
estimates.
Poisson regression
Below we use the genlin command to estimate a Poisson regression model. We have one continuous
predictor and one categorical predictor. In the genlinline, we list our categorical predictor prog after "by"
and our continuous predictor math after "with". Both appear in the model line. We use
thecovb=robust option in the criteria line to obtain robust standard errors for the parameter estimates as
recommended by Cameron and Trivedi (2009) to control for mild violation of the distribution assumption
that the variance equals the mean. Finally, we ask SPSS to print out the model fit statistics, the summary of
the model effects, and the parameter estimates.
The output begins with the Goodness of Fit table. This lists various
statistics indicating model fit. To assess the fit of the model, the
goodness-of-fit chi-squared test is provided in the first line of this table.
We evaluate the deviance (189.45) as Chi-square distributed with the
model degrees of freedom (196). This is not a test of the model
coefficients (which we saw in the header information), but a test of the
model form: Does the poisson model form fit our data? We conclude
that the model fits reasonably well because the goodness-of-fit chisquared test is not statistically significant (with 196 degrees of freedom,
p = 0.204). If the test had been statistically significant, it would indicate
that the data do not fit the model well. In that situation, we may try to
determine if there are omitted predictor variables, if our linearity
assumption holds and/or if there is an issue of over-dispersion.
Next we see the Omnibus Test. This is a test that all of the estimated
coefficients are equal to zero--a test of the model as a whole. From the
p-value, we can see that the model is statistically significant.
Next is the Tests of Model Effects. This evaluates each of the model
variables with the appropriate degrees of freedom. The prog variable is
categorical with three levels. Thus, it will appear in the model as two
one degree-of-freedom indicator variables. To assess the significance
of prog as a variable, we need to test these two dummy variables
together in a two degree-of-freedom chi-square test. This indicates
that prog is a statistically significant predictor of num_awards. The
continuous predictor variable math requires one degree-of-freedom in
the model, and so the test presented here is equivalent to that in the
Parameter Estimates output.
Sometimes, we might want to present the regression results as incident rate ratios. These IRR values are
equal to our coefficients from the output above exponentiated and we can ask SPSS
to print solution(exponentiated).
The output above indicates that the incident rate for [prog=2] is 2.042
times the incident rate for the reference group,[prog=3]. Likewise, the
incident rate for [prog=1] is 0.691 times the incident rate for the reference
group holding the other variables at constant. The percent change in the
incident rate of num_awards is an increase of 7% for every unit increase
inmath.
Recall the form of our model equation:
log(num_awards)=Intercept+b1(prog=1)+b2(prog=2)+
b3math.
This implies:
num_awards=exp(Intercept+b1(prog=1)+b2(prog=2)+
b3math)=exp(Intercept)*exp(b1(prog=1))*
exp(b2(prog=2))*exp(b3math)
The coefficients have an additive effect in the log(y) scale and the IRR have a multiplicative effect in the y
scale.
For additional information on the various metrics in which the results can be presented, and the
interpretation of such, please see Regression Models for Categorical Dependent Variables Using Stata,
Second Edition by J. Scott Long and Jeremy Freese (2006).
To understand the model better, we can use the emmeans command to calculate the predicted counts at
each level of prog, holding all other variables (in this example, math) in the model at their means.
In the output above, we see that the predicted number of events for level 1 of prog is about .21,
holding math at its mean. The predicted number of events for level 2 of prog is higher at .62, and the
predicted number of events for level 3 of prog is about .31. Note that the predicted count of level 2
of prog is (.62/.31) = 2.0 times higher than the predicted count for level 3 of prog. This matches what we
saw in the IRR output table.
Below we will obtain the predicted counts for each value of prog at two set values of math: 35 and 75.
The table above shows that with prog=1 and math held at 35, the average predicted count (or average
number of awards) is about .06; when math = 75, the average predicted count for prog=1 is about 1.01. If
we look at these predicted counts at math = 35 and math = 75, we can see that the ratio is (1.01/0.06) =
16.8. This matches (within rounding error) the IRR of 1.0727 for a 40 unit change: 1.0727^40 = 16.1.
Things to consider
variable prog in the example above, our model would seem to have a
problem with over-dispersion. In other words, a mis-specified model
could present a symptom like an over-dispersion problem.
If the data generating process does not allow for any 0s (such as the
number of days spent in the hospital), then a zero-truncated model may
be more appropriate.
See also
References
Long, J. S. 1997. Regression Models for Categorical and Limited
Dependent Variables. Thousand Oaks, CA: Sage Publications.
References
Let's start with loading the data and looking at some descriptive statistics.
GET
FILE='poisson_sim.sav'.
DESCRIPTIVES
VARIABLES=math num_awards
/STATISTICS=MEAN STDDEV VAR MIN MAX .
Each variable has 200 valid observations and their distributions seem quite reasonable.
The unconditional mean and variance of our outcome variable are not extremely different. Our model
assumes that these values, conditioned on the predictor variables, will be equal (or at least roughly so).
Let's continue with our description of the variables in this dataset. The table below shows the average
numbers of awards by program type and seems to suggest that program type is a good candidate for
predicting the number of awards, our outcome variable, because the mean value of the outcome appears
to vary by prog. Additionally, the means and variances within each level of prog--the conditional means
and variances--are similar.
GRAPH
/HISTOGRAM=num_awards.
OLS regression - Count outcome variables are sometimes logtransformed and analyzed using OLS regression. Many issues arise
with this approach, including loss of data due to undefined values
generated by taking the log of zero (which is undefined) and biased
estimates.
Poisson regression
Below we use the genlin command to estimate a Poisson regression model. We have one continuous
predictor and one categorical predictor. In the genlinline, we list our categorical predictor prog after "by"
and our continuous predictor math after "with". Both appear in the model line. We use
thecovb=robust option in the criteria line to obtain robust standard errors for the parameter estimates as
recommended by Cameron and Trivedi (2009) to control for mild violation of the distribution assumption
that the variance equals the mean. Finally, we ask SPSS to print out the model fit statistics, the summary of
the model effects, and the parameter estimates.
The output begins with the Goodness of Fit table. This lists various
statistics indicating model fit. To assess the fit of the model, the
goodness-of-fit chi-squared test is provided in the first line of this table.
We evaluate the deviance (189.45) as Chi-square distributed with the
model degrees of freedom (196). This is not a test of the model
coefficients (which we saw in the header information), but a test of the
model form: Does the poisson model form fit our data? We conclude
that the model fits reasonably well because the goodness-of-fit chisquared test is not statistically significant (with 196 degrees of freedom,
p = 0.204). If the test had been statistically significant, it would indicate
that the data do not fit the model well. In that situation, we may try to
determine if there are omitted predictor variables, if our linearity
assumption holds and/or if there is an issue of over-dispersion.
Next we see the Omnibus Test. This is a test that all of the estimated
coefficients are equal to zero--a test of the model as a whole. From the
p-value, we can see that the model is statistically significant.
Next is the Tests of Model Effects. This evaluates each of the model
variables with the appropriate degrees of freedom. The prog variable is
categorical with three levels. Thus, it will appear in the model as two
one degree-of-freedom indicator variables. To assess the significance
of prog as a variable, we need to test these two dummy variables
together in a two degree-of-freedom chi-square test. This indicates
that prog is a statistically significant predictor of num_awards. The
continuous predictor variable math requires one degree-of-freedom in
the model, and so the test presented here is equivalent to that in the
Parameter Estimates output.
Sometimes, we might want to present the regression results as incident rate ratios. These IRR values are
equal to our coefficients from the output above exponentiated and we can ask SPSS
to print solution(exponentiated).
The output above indicates that the incident rate for [prog=2] is 2.042
times the incident rate for the reference group,[prog=3]. Likewise, the
incident rate for [prog=1] is 0.691 times the incident rate for the reference
group holding the other variables at constant. The percent change in the
incident rate of num_awards is an increase of 7% for every unit increase
inmath.
Recall the form of our model equation:
log(num_awards)=Intercept+b1(prog=1)+b2(prog=2)+
b3math.
This implies:
num_awards=exp(Intercept+b1(prog=1)+b2(prog=2)+
b3math)=exp(Intercept)*exp(b1(prog=1))*
exp(b2(prog=2))*exp(b3math)
The coefficients have an additive effect in the log(y) scale and the IRR have a multiplicative effect in the y
scale.
For additional information on the various metrics in which the results can be presented, and the
interpretation of such, please see Regression Models for Categorical Dependent Variables Using Stata,
Second Edition by J. Scott Long and Jeremy Freese (2006).
To understand the model better, we can use the emmeans command to calculate the predicted counts at
each level of prog, holding all other variables (in this example, math) in the model at their means.
In the output above, we see that the predicted number of events for level 1 of prog is about .21,
holding math at its mean. The predicted number of events for level 2 of prog is higher at .62, and the
predicted number of events for level 3 of prog is about .31. Note that the predicted count of level 2
of prog is (.62/.31) = 2.0 times higher than the predicted count for level 3 of prog. This matches what we
saw in the IRR output table.
Below we will obtain the predicted counts for each value of prog at two set values of math: 35 and 75.
The table above shows that with prog=1 and math held at 35, the average predicted count (or average
number of awards) is about .06; when math = 75, the average predicted count for prog=1 is about 1.01. If
we look at these predicted counts at math = 35 and math = 75, we can see that the ratio is (1.01/0.06) =
16.8. This matches (within rounding error) the IRR of 1.0727 for a 40 unit change: 1.0727^40 = 16.1.
Things to consider
variable prog in the example above, our model would seem to have a
problem with over-dispersion. In other words, a mis-specified model
could present a symptom like an over-dispersion problem.
If the data generating process does not allow for any 0s (such as the
number of days spent in the hospital), then a zero-truncated model may
be more appropriate.
See also
References
Long, J. S. 1997. Regression Models for Categorical and Limited
Dependent Variables. Thousand Oaks, CA: Sage Publications.
References
Deskripsi Data
Untuk tujuan ilustrasi, kami telah disimulasikan satu set data untuk Contoh 3 di atas: poisson_sim.sav . Dalam
contoh ini, num_awards adalah variabel hasil dan menunjukkan jumlah penghargaan yang diterima oleh siswa
di sebuah sekolah tinggi dalam setahun, matematika merupakan variabel prediktor terus menerus dan
merupakan nilai siswa pada matematika ujian akhir mereka, dan prog adalah variabel prediktor kategoris
dengan tiga tingkat yang menunjukkan jenis program di mana siswa yang terdaftar.
Mari kita mulai dengan memuat data dan melihat beberapa statistik deskriptif.
GET
FILE = 'poisson_sim.sav'.
Descriptives
VARIABEL = num_awards matematika
/ STATISTIK = MEAN stddev VAR MIN MAX.
Setiap variabel memiliki 200 observasi yang valid dan distribusi mereka tampaknya cukup masuk
akal. Mean bersyarat dan varians dari variabel hasil kami tidak sangat berbeda. Model kami mengasumsikan
bahwa nilai-nilai, AC pada variabel prediktor, akan sama (atau setidaknya sekitar begitu).
Mari kita lanjutkan dengan deskripsi kita tentang variabel dalam dataset ini. Tabel di bawah ini menunjukkan
angka rata-rata penghargaan oleh jenis program dan tampaknya menunjukkan bahwa jenis program adalah
calon yang baik untuk memprediksi jumlah penghargaan, variabel hasil kami, karena nilai rata-rata hasil
tampaknya bervariasi oleh prog. Selain itu, sarana dan variasi dalam setiap tingkat prog --the
berartikondisional dan varians - mirip.
Regresi Poisson
Di bawah ini kita menggunakan perintah genlin untuk memperkirakan model regresi Poisson. Kami memiliki satu
prediktor terus menerus dan satu prediktor kategoris. Pada baris genlin, kita daftar prediktor progkategoris kami
setelah "oleh" dan prediktor matematika terus menerus kami setelah "dengan". Kedua muncul dalam garis
model. Kami menggunakan covb = pilihan yang kuat di garis kriteria untuk mendapatkan kesalahan standar
yang kuat untuk estimasi parameter seperti yang direkomendasikan oleh Cameron dan Trivedi (2009) untuk
mengendalikan pelanggaran ringan asumsi distribusi yang varians sama mean.Akhirnya, kami meminta SPSS
untuk mencetak model fit statistik, ringkasan dari efek model, dan perkiraan parameter.
Output dimulai dengan Goodness of Fit meja. Ini berisi daftar berbagai
statistik yang menunjukkan model fit. Untuk menilai fit dari model, uji chisquared kebaikan-of-fit disediakan di baris pertama tabel ini. Kami
mengevaluasi penyimpangan (189,45) sebagai Chi-kuadrat dengan derajat
kebebasan model (196). Ini bukan tes koefisien Model (yang kita lihat dalam
informasi header), tetapi tes bentuk Model: Apakah bentuk Model poisson
sesuai data kami? Kami menyimpulkan bahwa model cocok cukup baik
karena tes chi-kuadrat kebaikan-of-fit tidak signifikan secara statistik (dengan
196 derajat kebebasan, p = 0,204). Jika tes telah signifikan secara statistik, itu
akan menunjukkan bahwa data dilakukan tidak cocok model juga. Dalam
situasi itu, kita dapat mencoba untuk menentukan apakah ada dihilangkan
variabel prediktor, jika asumsi linearitas kami memegang dan / atau jika ada
masalah over-dispersi.
log(num_awards)=Intercept+b1(prog=1)+b2(prog
=2)+b3matematika.
Ini berarti:
num_awards=exp(Intercept+b1(prog=1)+b2(prog=
2)+b3matematika)=exp(Intercept)*exp(b1(prog=
1))*exp(b2(prog=2))*exp(b3matematika)
Koefisien memiliki efek aditif dalam log (y) skala dan IRR memiliki efek perkalian dalam skala y.
Untuk informasi tambahan mengenai berbagai metrik yang hasilnya dapat disajikan, dan penafsiran seperti itu,
silakan lihat Regression Model untuk Variabel Dependent kategoris Menggunakan Stata, Edisi Keduaoleh J.
Scott panjang dan Jeremy Freese (2006).
Untuk memahami model yang lebih baik, kita dapat menggunakan perintah emmeans untuk menghitung jumlah
diprediksi pada setiap tingkat prog, memegang semua variabel lain (dalam contoh ini, matematika)dalam model
di kemampuan mereka.
Salah satu penyebab umum dari over-dispersi nol berlebih, yang pada
gilirannya dihasilkan oleh proses menghasilkan data tambahan. Dalam situasi
ini, Model nol-meningkat harus dipertimbangkan.
Variabel hasil dalam regresi Poisson tidak dapat memiliki angka negatif.
Lihat juga
Referensi
Cameron, AC dan Trivedi, PK 2009. Microeconometrics
Menggunakan College Station Stata, TX:. Stata Press.
Contoh 2. Jumlah orang di garis depan dari Anda di toko kelontong. Prediktor mungkin termasuk jumlah item
saat ini ditawarkan dengan harga diskon khusus dan apakah acara khusus (misalnya, liburan, acara olahraga
besar) adalah tiga atau lebih sedikit hari lagi.
Contoh 3. Jumlah penghargaan yang diterima oleh siswa di salah satu sekolah tinggi. Prediktor jumlah
penghargaan yang diterima termasuk jenis program di mana mahasiswa terdaftar (misalnya, kejuruan, umum
atau akademik) dan skor pada ujian akhir mereka dalam matematika.
Deskripsi Data
Untuk tujuan ilustrasi, kami telah disimulasikan satu set data untuk Contoh 3 di atas: poisson_sim.sav . Dalam
contoh ini, num_awards adalah variabel hasil dan menunjukkan jumlah penghargaan yang diterima oleh siswa
di sebuah sekolah tinggi dalam setahun, matematika merupakan variabel prediktor terus menerus dan
merupakan nilai siswa pada matematika ujian akhir mereka, dan prog adalah variabel prediktor kategoris
dengan tiga tingkat yang menunjukkan jenis program di mana siswa yang terdaftar.
Mari kita mulai dengan memuat data dan melihat beberapa statistik deskriptif.
GET
FILE = 'poisson_sim.sav'.
Descriptives
VARIABEL = num_awards matematika
/ STATISTIK = MEAN stddev VAR MIN MAX.
Setiap variabel memiliki 200 observasi yang valid dan distribusi mereka tampaknya cukup masuk
akal. Mean bersyarat dan varians dari variabel hasil kami tidak sangat berbeda. Model kami mengasumsikan
bahwa nilai-nilai, AC pada variabel prediktor, akan sama (atau setidaknya sekitar begitu).
Mari kita lanjutkan dengan deskripsi kita tentang variabel dalam dataset ini. Tabel di bawah ini menunjukkan
angka rata-rata penghargaan oleh jenis program dan tampaknya menunjukkan bahwa jenis program adalah
calon yang baik untuk memprediksi jumlah penghargaan, variabel hasil kami, karena nilai rata-rata hasil
tampaknya bervariasi oleh prog. Selain itu, sarana dan variasi dalam setiap tingkat prog --the
berartikondisional dan varians - mirip.
Regresi Poisson
Di bawah ini kita menggunakan perintah genlin untuk memperkirakan model regresi Poisson. Kami memiliki satu
prediktor terus menerus dan satu prediktor kategoris. Pada baris genlin, kita daftar prediktor progkategoris kami
setelah "oleh" dan prediktor matematika terus menerus kami setelah "dengan". Kedua muncul dalam garis
model. Kami menggunakan covb = pilihan yang kuat di garis kriteria untuk mendapatkan kesalahan standar
yang kuat untuk estimasi parameter seperti yang direkomendasikan oleh Cameron dan Trivedi (2009) untuk
mengendalikan pelanggaran ringan asumsi distribusi yang varians sama mean.Akhirnya, kami meminta SPSS
untuk mencetak model fit statistik, ringkasan dari efek model, dan perkiraan parameter.
Output dimulai dengan Goodness of Fit meja. Ini berisi daftar berbagai
statistik yang menunjukkan model fit. Untuk menilai fit dari model, uji chisquared kebaikan-of-fit disediakan di baris pertama tabel ini. Kami
mengevaluasi penyimpangan (189,45) sebagai Chi-kuadrat dengan derajat
kebebasan model (196). Ini bukan tes koefisien Model (yang kita lihat dalam
informasi header), tetapi tes bentuk Model: Apakah bentuk Model poisson
sesuai data kami? Kami menyimpulkan bahwa model cocok cukup baik
karena tes chi-kuadrat kebaikan-of-fit tidak signifikan secara statistik (dengan
196 derajat kebebasan, p = 0,204). Jika tes telah signifikan secara statistik, itu
akan menunjukkan bahwa data dilakukan tidak cocok model juga. Dalam
situasi itu, kita dapat mencoba untuk menentukan apakah ada dihilangkan
variabel prediktor, jika asumsi linearitas kami memegang dan / atau jika ada
masalah over-dispersi.
log(num_awards)=Intercept+b1(prog=1)+b2(prog
=2)+b3matematika.
Ini berarti:
num_awards=exp(Intercept+b1(prog=1)+b2(prog=
2)+b3matematika)=exp(Intercept)*exp(b1(prog=
1))*exp(b2(prog=2))*exp(b3matematika)
Koefisien memiliki efek aditif dalam log (y) skala dan IRR memiliki efek perkalian dalam skala y.
Untuk informasi tambahan mengenai berbagai metrik yang hasilnya dapat disajikan, dan penafsiran seperti itu,
silakan lihat Regression Model untuk Variabel Dependent kategoris Menggunakan Stata, Edisi Keduaoleh J.
Scott panjang dan Jeremy Freese (2006).
Untuk memahami model yang lebih baik, kita dapat menggunakan perintah emmeans untuk menghitung jumlah
diprediksi pada setiap tingkat prog, memegang semua variabel lain (dalam contoh ini, matematika)dalam model
di kemampuan mereka.
/ PRINT NONE
/ EMMEANS TABLES = prog SKALA = ORIGINAL.
Pada contoh di atas, kita melihat bahwa jumlah diprediksi acara untuk level 1 dari prog adalah sekitar 0,21,
memegang matematika di mean. Jumlah itu diprediksi acara untuk level 2 dari prog lebih tinggi di 0,62, dan
jumlah diprediksi acara untuk tingkat 3 dari prog adalah sekitar 0,31. Perhatikan bahwa jumlah prediksi tingkat 2
dari prog adalah (0,62 / 0,31) = 2,0 kali lebih tinggi dari jumlah yang diperkirakan untuk tingkat 3 dari prog. Ini
sesuai dengan apa yang kita lihat di tabel output IRR.
Di bawah ini kita akan mendapatkan jumlah yang diperkirakan untuk setiap nilai prog di dua nilai
set matematika: 35 dan 75.
Salah satu penyebab umum dari over-dispersi nol berlebih, yang pada
gilirannya dihasilkan oleh proses menghasilkan data tambahan. Dalam situasi
ini, Model nol-meningkat harus dipertimbangkan.
Variabel hasil dalam regresi Poisson tidak dapat memiliki angka negatif.
Lihat juga
Referensi
Cameron, AC dan Trivedi, PK 2009. Microeconometrics
Menggunakan College Station Stata, TX:. Stata Press.