library(broom)
library(dplyr)
library(ggplot2)
library(foreign)
library(magrittr)
library(pscl)
There are a broad range of diagnostic methods (mostly graphical) that can help you decide from among a dizzying array of alternative models for count data. The most important consideration is to start at the shallow end of the pool with the simplest approaches. Don’t jump in at the deep end and try something really complex before you’ve had the chance to look at simpler models, even if you know they are overly simplistic.
The dataset used in this example
Francis Huang has a nice web page showing two datasets and how to analyze them using the Poisson regression model as well as several more sophisticated models. The first data set is available for anyone to download. The link I originally used no longer works.
<- "http://faculty.missouri.edu/huangf/data/poisson/articles.csv"
fn <- read.csv(fn)
articles head(articles)
Here is a different link for the same file, used at German Rodriguez’s website. I’ve placed a copy on my website in case this link breaks. Also, this dataset is available with the pscl library of R under the name, bioChemists.
<- "http://www.pmean.com/00files/couart2.dta"
fn <- read.dta(fn)
articles head(articles)
art fem mar kid5 phd ment
1 0 Men Married 0 2.52 7
2 0 Women Single 0 2.05 6
3 0 Women Single 0 3.75 6
4 0 Men Married 1 1.18 3
5 0 Women Single 0 3.75 26
6 0 Women Married 2 3.59 2
The variables are + gender (fem), + marital status (mar), + number of children (kid5), + prestige of graduate program (phd), + the number of articles published by the individual’s mentor (ment) + the number of articles published by the scientist (art: the outcome).
Histogram of outcome variable
Look at the possible values for your outcome variable, which is usually a count. If you have rate data, do a histogram both for counts and rates.
ggplot(articles, aes(art)) +
geom_histogram(binwidth=1, color="black", fill="white")
Compare this to a bar chart of Poisson probabilities. The outcome variable from a Poisson regression model does not need to follow a single simple Poisson set of probabilities. It is more likely a mixture of different Poisson variables depending on the covariate pattern. Nevertheless, it can’t hurt to look at the Poisson probabilities, as long as you don’t get too concerned about discrepancies between the actual data and the Poisson probabilities.
data.frame(x=0:max(articles$art)) %>%
mutate(p=dpois(x, lambda=mean(articles$art))) %>%
ggplot(aes(factor(x), p)) +
geom_col(color="black", fill="white")
Here there seems to be a disparity at zero, which may (or may not) indicate the need for a zero-inflated model.
Descriptive statistics on indepdent variables
It helps to know the general descriptive statistics (means, standard deviations, percentages) for each of the independent variables.
%>%
articles group_by(fem) %>%
summarize(n=n())
# A tibble: 2 × 2
fem n
<fct> <int>
1 Men 494
2 Women 421
quantile(articles$ment)
0% 25% 50% 75% 100%
0 3 6 12 77
quantile(articles$phd)
0% 25% 50% 75% 100%
0.755 2.260 3.150 3.920 4.620
%>%
articles group_by(mar) %>%
summarize(n=n())
# A tibble: 2 × 2
mar n
<fct> <int>
1 Single 309
2 Married 606
```{rglm-diagnostics-10
articles %>% group_by(kid5) %>% summarize(n=n())
The wide ranges for ment and phd may make it difficult to interpret. A simple approach is to scale them to the interval 0 to 1.
::: {.cell}
```{.r .cell-code}
articles %>%
mutate(ment=(ment-min(ment))/(max(ment)-min(ment))) %>%
mutate(phd=(phd-min(phd))/(max(phd)-min(phd))) -> articles_scaled
:::
Bivariate plots
Although you are going to fit multiple independent variables, it does not hurt to look at the bivariate relationships first. A variable that looks important from a bivariate perspective, may be unimportant from the multivariate perspective and vice versa. So don’t read too much into these plots. It is intended to get you familiar with trends and patterns that might (or might not) show up in more complex models.
Use boxplots when the independent variable is categorical and scatterplots when the independent variable is continuous.
ggplot(articles, aes(factor(fem), art)) +
geom_boxplot()
ggplot(articles, aes(ment, art)) +
geom_point() +
geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(articles, aes(phd, art)) +
geom_point() +
geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(articles, aes(factor(mar), art)) +
geom_boxplot()
ggplot(articles, aes(factor(kid5), art)) +
geom_boxplot()
There aren’t any huge trends here. It looks like there might not be much going on here at all. If there are differences, they seem to show up in the values above the 75th percentile.
The one cause for early concern is that the effect of ment seems to level off after a certain point.
Fit simple models first
<- glm(art ~ fem, family=poisson, data = articles_scaled)
pois_fem summary(pois_fem)
Call:
glm(formula = art ~ fem, family = poisson, data = articles_scaled)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.63265 0.03279 19.293 < 2e-16 ***
femWomen -0.24718 0.05187 -4.765 1.89e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1817.4 on 914 degrees of freedom
Residual deviance: 1794.4 on 913 degrees of freedom
AIC: 3466.1
Number of Fisher Scoring iterations: 5
<- glm(art ~ ment, family=poisson, data = articles_scaled)
pois_ment summary(pois_ment)
Call:
glm(formula = art ~ ment, family = poisson, data = articles_scaled)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.25991 0.03436 7.564 3.91e-14 ***
ment 2.00584 0.14764 13.586 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1817.4 on 914 degrees of freedom
Residual deviance: 1669.5 on 913 degrees of freedom
AIC: 3341.3
Number of Fisher Scoring iterations: 5
<- glm(art ~ phd, family=poisson, data = articles_scaled)
pois_phd summary(pois_phd)
Call:
glm(formula = art ~ phd, family = poisson, data = articles_scaled)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.32259 0.06813 4.735 2.19e-06 ***
phd 0.32976 0.10053 3.280 0.00104 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1817.4 on 914 degrees of freedom
Residual deviance: 1806.6 on 913 degrees of freedom
AIC: 3478.3
Number of Fisher Scoring iterations: 5
<- glm(art ~ mar, family=poisson, data = articles_scaled)
pois_mar summary(pois_mar)
Call:
glm(formula = art ~ mar, family = poisson, data = articles_scaled)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.46514 0.04508 10.317 <2e-16 ***
marMarried 0.09117 0.05458 1.671 0.0948 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1817.4 on 914 degrees of freedom
Residual deviance: 1814.6 on 913 degrees of freedom
AIC: 3486.3
Number of Fisher Scoring iterations: 5
<- glm(art ~ kid5, family=poisson, data = articles_scaled)
pois_kid5 summary(pois_kid5)
Call:
glm(formula = art ~ kid5, family = poisson, data = articles_scaled)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.55960 0.02988 18.728 <2e-16 ***
kid5 -0.06978 0.03450 -2.023 0.0431 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1817.4 on 914 degrees of freedom
Residual deviance: 1813.2 on 913 degrees of freedom
AIC: 3485
Number of Fisher Scoring iterations: 5
For all of these models, the deviance is much greater than the degrees of freedom, indicating a poor fit. This could be failure to include all the variables in a multivariate sense or a problem with the underlying Poisson model.
Don’t put too much stock in the p-values just yet, but several of them are small. This may change when we look at the independent variables in combination.
Fit graually more complex models
<- glm(art ~ fem + ment + phd + mar + kid5, family=poisson, data = articles_scaled)
pois_all_ivs summary(pois_all_ivs)
Call:
glm(formula = art ~ fem + ment + phd + mar + kid5, family = poisson,
data = articles_scaled)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.31430 0.08758 3.589 0.000332 ***
femWomen -0.22459 0.05461 -4.112 3.92e-05 ***
ment 1.96679 0.15447 12.733 < 2e-16 ***
phd 0.04956 0.10202 0.486 0.627139
marMarried 0.15524 0.06137 2.529 0.011424 *
kid5 -0.18488 0.04013 -4.607 4.08e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1817.4 on 914 degrees of freedom
Residual deviance: 1634.4 on 909 degrees of freedom
AIC: 3314.1
Number of Fisher Scoring iterations: 5
tidy(pois_fem) %>%
bind_rows(tidy(pois_ment)) %>%
bind_rows(tidy(pois_phd)) %>%
bind_rows(tidy(pois_mar)) %>%
bind_rows(tidy(pois_kid5)) %>%
filter(term != "(Intercept)") %>%
mutate(unadjusted_rate_ratio=round(exp(estimate), 2)) %>%
select(term, unadjusted_rate_ratio) -> pois_crude
tidy(pois_all_ivs) %>%
filter(term != "(Intercept)") %>%
mutate(adjusted_rate_ratio=round(exp(estimate), 2)) %>%
select(term, adjusted_rate_ratio) -> pois_adjust
%>%
pois_crude full_join(pois_adjust)
Joining with `by = join_by(term)`
# A tibble: 5 × 3
term unadjusted_rate_ratio adjusted_rate_ratio
<chr> <dbl> <dbl>
1 femWomen 0.78 0.8
2 ment 7.43 7.15
3 phd 1.39 1.05
4 marMarried 1.1 1.17
5 kid5 0.93 0.83
Examine residuals
A good general plot is fitted (predicted) values versus residuals. Since the fitted value is a composite of all the independent variables, it is likely to show unusual behavior if there are issues involving any of the independent variables.
An ideal pattern would be a random scatter of points spread across all regions of the graph. A curvilinear pattern (e.g., high in the middle, low at either extreme) or a fanning out pattern (e.g., more variation on the right side of the graph than the left) is indication that there are unusual features in the data that are not fully captured by the current model.
<- augment(pois_all_ivs, articles_scaled, type.predict="response", type.residuals="deviance")
r_pois ggplot(r_pois, aes(.fitted, .resid)) +
geom_point() +
geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Clearly there is something unusual going on with the largest fitted value.
%>%
r_pois filter(.fitted > 7) %>%
data.frame
art fem mar kid5 phd ment .fitted .resid .hat .sigma
1 1 Men Married 1 0.2652005 1 9.627207 -3.567244 0.1993306 1.33509
.cooksd .std.resid
1 0.4006419 -3.986632
This appears to be the subject where the scaled value for ment equals 1. In other words, the highest possible value for ment.
You should also plot the residuals versus every independent variable, although the plots versus binary independent variables are unlikely to show anything interesting.
ggplot(r_pois, aes(factor(fem), .resid)) +
geom_boxplot()
ggplot(r_pois, aes(ment, .resid)) +
geom_point() +
geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(r_pois, aes(phd, .resid)) +
geom_point() +
geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(r_pois, aes(factor(mar), .resid)) +
geom_boxplot()
ggplot(r_pois, aes(factor(kid5), .resid)) +
geom_boxplot()
Perhaps, the value of a mentor is good only to a point and anything after offers no additional help. Let’s compute a new value for ment where any value above 40 is set equal to 40.
%>%
articles mutate(ment=pmin(40, ment)) %>%
mutate(ment=(ment-min(ment))/(max(ment)-min(ment))) %>%
mutate(phd=(phd-min(phd))/(max(phd)-min(phd))) -> articles_rescaled
<-
pois_rescaled glm(
~ fem + ment + phd + mar + kid5,
art family=poisson,
data = articles_rescaled)
<-
r_rescaled augment(
pois_rescaled,
articles_rescaled, type.predict="response",
type.residuals="deviance")
ggplot(r_rescaled, aes(.fitted, .resid)) +
geom_point() +
geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(r_rescaled, aes(ment, .resid)) +
geom_point() +
geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Negative binomial model
library(MASS)
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
<- glm.nb(art ~ fem + ment + phd + mar + kid5, data = articles_rescaled)
nb1 summary(pois_rescaled)
Call:
glm(formula = art ~ fem + ment + phd + mar + kid5, family = poisson,
data = articles_rescaled)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.28876 0.08797 3.283 0.00103 **
femWomen -0.22336 0.05453 -4.096 4.2e-05 ***
ment 1.29786 0.09771 13.283 < 2e-16 ***
phd -0.02054 0.10361 -0.198 0.84284
marMarried 0.15519 0.06136 2.529 0.01143 *
kid5 -0.17383 0.03987 -4.360 1.3e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1817.4 on 914 degrees of freedom
Residual deviance: 1612.3 on 909 degrees of freedom
AIC: 3292.1
Number of Fisher Scoring iterations: 5
summary(nb1)
Call:
glm.nb(formula = art ~ fem + ment + phd + mar + kid5, data = articles_rescaled,
init.theta = 2.345482748, link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.24464 0.11622 2.105 0.03530 *
femWomen -0.21377 0.07215 -2.963 0.00305 **
ment 1.34279 0.14614 9.188 < 2e-16 ***
phd 0.02888 0.13882 0.208 0.83517
marMarried 0.15054 0.08156 1.846 0.06494 .
kid5 -0.17066 0.05240 -3.257 0.00113 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(2.3455) family taken to be 1)
Null deviance: 1122.1 on 914 degrees of freedom
Residual deviance: 1008.1 on 909 degrees of freedom
AIC: 3128.3
Number of Fisher Scoring iterations: 1
Theta: 2.345
Std. Err.: 0.289
2 x log-likelihood: -3114.311
Zero inflated model
<-
zip1 zeroinfl(
~ fem + ment + phd + mar + kid5 |
art 1,
dist="poisson",
data = articles_rescaled)
summary(zip1)
Call:
zeroinfl(formula = art ~ fem + ment + phd + mar + kid5 | 1, data = articles_rescaled,
dist = "poisson")
Pearson residuals:
Min 1Q Median 3Q Max
-1.4952 -0.9862 -0.2940 0.5601 7.3463
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.52080 0.09770 5.331 9.79e-08 ***
femWomen -0.23166 0.05824 -3.978 6.96e-05 ***
ment 1.13674 0.10570 10.754 < 2e-16 ***
phd -0.05874 0.11145 -0.527 0.598140
marMarried 0.13225 0.06574 2.012 0.044239 *
kid5 -0.16129 0.04285 -3.765 0.000167 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7435 0.1639 -10.63 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 13
Log-likelihood: -1611 on 7 Df
<-
zip2 zeroinfl(
~ fem + ment + phd + mar + kid5 |
art + ment + phd + mar + kid5,
fem dist="poisson",
data = articles_rescaled)
summary(zip2)
Call:
zeroinfl(formula = art ~ fem + ment + phd + mar + kid5 | fem + ment +
phd + mar + kid5, data = articles_rescaled, dist = "poisson")
Pearson residuals:
Min 1Q Median 3Q Max
-1.9307 -0.8803 -0.2634 0.5320 7.0179
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.62199 0.10366 6.000 1.97e-09 ***
femWomen -0.21558 0.06390 -3.374 0.000741 ***
ment 0.96293 0.11653 8.263 < 2e-16 ***
phd -0.09387 0.12331 -0.761 0.446502
marMarried 0.10351 0.07184 1.441 0.149635
kid5 -0.13632 0.04756 -2.866 0.004156 **
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.61145 0.43313 -1.412 0.15804
femWomen 0.06827 0.28564 0.239 0.81109
ment -4.03326 1.45417 -2.774 0.00554 **
phd -0.21522 0.56689 -0.380 0.70421
marMarried -0.34733 0.32230 -1.078 0.28118
kid5 0.22108 0.19709 1.122 0.26198
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 21
Log-likelihood: -1599 on 12 Df
<-
zip3 zeroinfl(
~ fem + ment + phd + mar + kid5 |
art
ment, dist="poisson",
data = articles_rescaled)
summary(zip3)
Call:
zeroinfl(formula = art ~ fem + ment + phd + mar + kid5 | ment, data = articles_rescaled,
dist = "poisson")
Pearson residuals:
Min 1Q Median 3Q Max
-1.9318 -0.8877 -0.2517 0.5310 7.0536
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.59978 0.09752 6.150 7.74e-10 ***
femWomen -0.22165 0.05874 -3.774 0.000161 ***
ment 0.96217 0.11360 8.470 < 2e-16 ***
phd -0.07773 0.11213 -0.693 0.488213
marMarried 0.13383 0.06630 2.018 0.043546 *
kid5 -0.15649 0.04324 -3.619 0.000295 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8236 0.2141 -3.847 0.00012 ***
ment -4.1956 1.3978 -3.002 0.00269 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 14
Log-likelihood: -1600 on 8 Df
<-
zinb zeroinfl(
~ fem + ment + phd + mar + kid5 |
art
ment, dist="negbin",
data = articles_rescaled)
summary(zinb)
Call:
zeroinfl(formula = art ~ fem + ment + phd + mar + kid5 | ment, data = articles_rescaled,
dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.2732 -0.7706 -0.2525 0.4796 6.4457
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.37538 0.12149 3.090 0.00200 **
femWomen -0.21021 0.07162 -2.935 0.00333 **
ment 1.14750 0.15473 7.416 1.21e-13 ***
phd -0.01688 0.13731 -0.123 0.90214
marMarried 0.14006 0.08088 1.732 0.08331 .
kid5 -0.16263 0.05218 -3.117 0.00183 **
Log(theta) 1.01830 0.14352 7.095 1.29e-12 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8630 0.3706 -2.329 0.0199 *
ment -25.3534 10.5777 -2.397 0.0165 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 2.7685
Number of iterations in BFGS optimization: 36
Log-likelihood: -1550 on 9 Df
Comparing competing models
The AIC statistic is useful for comparing models, but it is not produced automatically for negative binomial regression or the zero-inflated models. It requires you to pull the individual components from the
save.image(file="glm_diagnostics.RData")
An earlier version of this page was published on new.pmean.com.