The R package COMPoissonReg
supports Conway-Maxwell
Poisson (CMP) and Zero-Inflated Conway-Maxwell Poisson (ZICMP) models
for analysis of count data in a flexible manner, to account for data
dispersion relative to a Poisson model. The package provides regression
functionality in addition to basic distribution functions. Interested
users can refer to Sellers and Shmueli
(2010) and Sellers and Raim (2016)
regarding the underlying theoretical developments for the CMP and ZICMP
regressions, respectively. A full specification of the public
COMPoissonReg
interface can be found in the manual. In
addition to package prerequisites Rcpp
(Eddelbuettel 2013) and numDeriv
(Gilbert and Varadhan 2019),
ggplot2
(Wickham 2016) is
also used in this vignette.
One of the challenges of working with CMP and ZICMP lies in computing
the normalizing constant and related quantities. The normalizing
constant does not have a simple closed form in general and can quickly
increase or decrease magnitude as parameters are varied.
COMPoissonReg
takes a hybrid approach of either truncating
the infinite series or making use of an approximation, depending on
parameter values.
The remainder of the vignette proceeds as follows. Section describes
functions to support the CMP distribution, including numerical handling
of the normalizing constant. Section describes functions for ZICMP.
Finally, Section demonstrates regression functions; Sections and give
specific examples based on CMP and ZICMP outcomes, respectively. The
COMPoissonReg
package is on CRAN at https://cran.r-project.org/package=COMPoissonReg and the
source code is on Github at https://github.com/lotze/COMPoissonReg.
Let Y ∼ CMP(λ, ν) be a Conway-Maxwell Poisson (CMP) random variable with density where λ > 0, ν > 0, and ℕ represents the nonnegative integers {0, 1, 2, …}. Three notable special cases of CMP(λ, ν) help to demonstrate its flexibility in count modeling.
The case ν = 1 corresponds to Poisson(λ).
When λ ∈ (0, 1) and ν = 0, the CMP(λ, ν) distribution is Geometric(1 − λ) with density f(y ∣ λ) = (1 − λ)λy for y ∈ ℕ, which is overdispersed relative to Poisson.
When ν → ∞, CMP(λ, ν) converges to a Bernoulli(λ/(1 + λ)) distribution which is underdispersed relative to Poisson.
The normalizing constant Z(λ, ν) presents some challenges in the practical use of CMP models and has been a topic of interest in the CMP literature. In general, there is no simple closed form expression for the series Z(λ, ν). Shmueli et al. (2005) give the approximation where the O(⋅) term vanishes as λ−1/ν becomes small. Approximations have been further studied and refined in subsequent literature; see for example Gillispie and Green (2015), Daly and Gaunt (2016), and Gaunt et al. (2019). The expression in emphasizes that the magnitude of Z(λ, ν) explodes as ν → 0 when λ > 1. For example, Z(2, 0.075) ≈ e780.515 is too large to store as a double-precision floating point number, and may evaluate to infinity if care is not taken. In contrast, Z(λ, ν) → 1/(1 − λ) when λ < 1 and ν → 0.
In practice, the COMPoissonReg
package does not place
constraints on λ and ν, except to ensure that they are
positive, so that their values are driven by the data or the user’s
selection. A hybrid strategy motivated by is taken by
COMPoissonReg
.
To compute Z(λ, ν), suppose we are given a small tolerance δ > 0. If the first term of dominates the second term, and we take as an approximation. Otherwise, the series is computed by truncating to a finite number of terms, which is described next. In either case, computations are kept on the log-scale as much as possible to accommodate numbers with potentially very large and very small magnitudes.
We approximate Z(λ, ν) by a finite summation $Z^{(M)}(\lambda, \nu) = \sum_{r=0}^M \lambda^r / (r!)^\nu$ if condition fails, so that the remainder is smaller than a given tolerance. The general approach is described in Appendix B of Shmueli et al. (2005). Robbins (1955) gives bounds for Stirling’s approximation as Noting that e1/(12n + 1) ≥ 1 for n ≥ 1 and $\sqrt{2\pi} e^{1 / (12n)} \leq e$ for n ≥ 2, we obtain simpler bounds which will be convenient in the following calculations.3 We may then bound the truncation error for Z(M)(λ, ν) using assuming that |λeν/(M + 1)ν| < 1 so that the geometric series in converges. To ensure this convergence we choose M at least large enough so that For a small given number ϵ > 0, we may consider bounding the relative error by The second inequality of can be expressed on the log-scale using where Therefore, we compute Z(M)(λ, ν) until at least M > λ1/νe − 1, increasing M and updating Z(M)(λ, ν) until is satisfied. This is summarized as Algorithm .
The individual terms λr/(r!)ν
in the summation may be too large to store at their original scale.
Therefore, summation is carried out at the log-scale, wherever possible,
using the identity this is especially helpful when
0 < y ≪ x, as
log x may be kept on the
log-scale by the first term of the right-hand side of , and the standard
library function log1p
may be used to accurately compute
log (1 + ϕ) for very small
ϕ > 0.
Many of the functions in the user interface of
COMPoissonReg
take an optional control
argument, which can be constructed as follows.
The tolerances δ and ϵ are specified as
hybrid.tol
and truncate.tol
respectively.
Taking hybrid.tol
to be a very small positive number
results in use of the truncated sum Z(M)(λ, ν),
while hybrid.tol = Inf
uses the approximation method ,
except in extreme cases where λ−1/ν evaluates
to zero or ∞ numerically. The argument
ymax
specifies upper limit M; this is a safety measure which
prevents very large computations unless the user opts to allow them.
When no control object is specified, a global default (via option
COMPoissonReg.control
) is used.
> control = getOption("COMPoissonReg.control")
> control$ymax
[1] 1e+06
> control$hybrid.tol
[1] 0.01
> control$truncate.tol
[1] 1e-06
The default may be replaced in your current session if desired.
The control object contains several other useful arguments to be discussed later in the vignette.
The ncmp
function computes the normalizing constant
Z(λ, ν) and
returns its value either on the original scale or the log-scale.
> ncmp(lambda = 1.5, nu = 1.2)
[1] 4.01341
> ncmp(lambda = 1.5, nu = 1.2, log = TRUE)
[1] 1.389641
> ncmp(lambda = 1.5, nu = 1.2, log = TRUE, control = get.control(hybrid.tol = 1e10))
[1] 1.373642
> ncmp(lambda = 1.5, nu = 1.2, log = TRUE, control = get.control(hybrid.tol = 1e-10))
[1] 1.389641
Before proceeding, let us define a function to display errors and warnings which are intentionally triggered in the remainder of the vignette.
The function tcmp
returns the truncation value M obtained from Algorithm .
> nu_seq = c(1, 0.5, 0.2, 0.1, 0.05, 0.03)
> tryCatch({ tcmp(lambda = 1.5, nu = nu_seq) }, warning = print_warning)
[1] simpleWarning in y_trunc(prep$lambda, prep$nu, tol = truncate.tol, ymax
[2] = ymax): Terms of normalizing constant CMP(1.5, 0.03) could not be
[3] bounded by a geometric series when y <= 1e+06. Consider adjusting the
[4] controls ymax, hybrid.tol, and truncate.tol
Note that tcmp
returns 1e6
and produces a
warning for the smallest nu
value 0.03 because Algorithm
has reached ymax = 1e6
before the series could be bounded
by a geometric series. Here, it is likely that support values with
non-negligible mass are being left out. Let us increase
ymax
to avoid this problem.
It is also possible to reach the second loop of Algorithm where the
geometric series can be used, but ymax
is not large enough
to satisfy . Here is an example where this occurs.
> tcmp(lambda = 1.2, nu = 0.03, control = get.control(ymax = 1200))
Warning in y_trunc(prep$lambda, prep$nu, tol = truncate.tol, ymax = ymax):
Absolute relative error 1.13902e-05 was larger than tolerance 1e-06 with
CMP(1.2, 0.03) truncated to 1200. Consider adjusting the controls ymax,
hybrid.tol, and truncate.tol
[1] 1200
Now that we have a somewhat robust computation for the normalizing constant, let us create a plot of the interesting behavior when λ > 1 and ν is decreasing.
library(ggplot2)
nu_seq = seq(0.03, 1.5, length.out = 20)
nc1 = ncmp(lambda = 0.5, nu = nu_seq, log = TRUE)
nc2 = ncmp(lambda = 1.05, nu = nu_seq, log = TRUE)
nc3 = ncmp(lambda = 1.20, nu = nu_seq, log = TRUE)
ggplot() +
geom_point(data = data.frame(x = nu_seq, y = nc1), aes(x = x, y = y), pch = 1) +
geom_point(data = data.frame(x = nu_seq, y = nc2), aes(x = x, y = y), pch = 2) +
geom_point(data = data.frame(x = nu_seq, y = nc3), aes(x = x, y = y), pch = 3) +
xlab("nu") +
ylab("log of normalizing constant") +
theme_bw()
Log of normalizing constant for λ = 0.5 (∘), λ = 1.05 (Δ), and λ = 1.20 (+).
We see that with λ = 1.2 a value of Z(λ, 0.03) ≈ e18.67 is obtained, which is an extremely large jump from the next value in the series Z(λ, 0.1074) ≈ e3.2.
The respective functions for CMP density, variate generation, CDF,
and quantile functions are dcmp
, rcmp
,
pcmp
, and qcmp
. Their usage is similar to
distribution functions provided by the R stats
package.
> dcmp(0, lambda = 10, nu = 0.9)
[1] 6.819476e-06
> dcmp(0:17, lambda = 10, nu = 0.9, log = TRUE)
[1] -11.895728 -9.593143 -7.914390 -6.600556 -5.545636 -4.691545
[7] -4.001543 -3.450278 -3.019190 -2.694107 -2.463848 -2.319369
[13] -2.253200 -2.259069 -2.331636 -2.466296 -2.659041 -2.906347
> dcmp(c(0, 1, 2), lambda = c(10, 11, 12), nu = c(0.9, 1.0, 1.1), log = TRUE)
[1] -11.895728 -8.602105 -6.071951
> rcmp(50, lambda = 10, nu = 0.9)
[1] 10 13 8 18 17 12 13 15 13 10 20 13 13 14 15 5 18 14 14 17 9 20 17 10 10
[26] 16 11 13 10 12 14 12 15 14 11 14 10 11 14 8 13 18 6 13 18 18 11 13 15 9
> pcmp(0:17, lambda = 10, nu = 0.9)
[1] 6.819476e-06 7.501423e-05 4.404609e-04 1.800073e-03 5.704532e-03
[6] 1.487703e-02 3.316443e-02 6.490125e-02 1.137420e-01 1.813448e-01
[11] 2.664516e-01 3.647872e-01 4.698497e-01 5.742973e-01 6.714341e-01
[16] 7.563328e-01 8.263482e-01 8.810232e-01
> qq = seq(0, 0.95, length.out = 10)
> qcmp(qq, lambda = 10, nu = 0.9)
[1] 0 8 10 11 12 13 14 15 17 19
The CMP distribution functions compute the normalizing constant via
Section . The density dcmp
uses the hybrid method, while
rcmp
, pcmp
, and qcmp
use the
truncation method regardless of condition . Variate generation, CDF, and
quantiles are then computed on CMP as if it were a discrete distribution
on the sample space {0, …, M}.
As with tcmp
and ncmp
, a warning is emitted in
cases where they may not produce reliable results.
> tryCatch({ rcmp(1, lambda = 2, nu = 0.01) }, warning = print_warning)
[1] simpleWarning in r_cmp(n, prep$lambda, prep$nu, hybrid.tol,
[2] truncate.tol, ymax): Terms of normalizing constant CMP(2, 0.01) could
[3] not be bounded by a geometric series when y <= 1e+06. Consider
[4] adjusting the controls ymax, hybrid.tol, and truncate.tol
The truncation method for qcmp
can result in unreliable
quantiles when the requested probability is very close to 1. For
example, the actual quantile for probability 1 is ∞,
which can be expressed with no computation, but the computed quantity
will be the truncation value M. More generally, implies that
Therefore, it is possible that quantiles larger than
1 − ϵ may be inaccurately
computed with the truncated support. COMPoissonReg
gives a
warning when these are requested.
> tryCatch({
+ qcmp(0.9999999, lambda = 1.5, nu = 0.5)
+ }, warning = print_warning)
[1] simpleWarning in qcmp(0.9999999, lambda = 1.5, nu = 0.5): At least one
[2] requested quantile was very close to 1. In particular, 1 of the given
[3] probabilities were greater than 1 - truncate_tol = exp(-1e-06), where
[4] truncate_tol = 1e-06. Associated results may be affected by truncation.
[5] Consider adjusting the controls ymax and truncate.tol or reducing logq.
As a sanity check, let us ensure that the empirical density values,
cumulative probabilities, and quantiles of draws from rcmp
align with respective results computed via dcmp
,
pcmp
, and qcmp
.
library(ggplot2)
n = 100000
lambda = 0.5
nu = 0.1
x = rcmp(n, lambda, nu)
xx = seq(-1, max(x)) ## Include -1 to ensure it gets probability zero
qq = seq(0, 0.99, length.out = 100)
fx = dcmp(xx, lambda, nu)
px = pcmp(xx, lambda, nu)
qx = qcmp(qq, lambda, nu)
qx_emp = quantile(x, probs = qq)
ggplot() +
geom_bar(data = data.frame(x = x), aes(x = x, y = ..prop..), fill = "NA",
col = "black") +
geom_point(data = data.frame(x = xx[-1], fx = fx[-1]), aes(x, fx)) +
ylab("Density") +
theme_bw()
Warning: The dot-dot notation (`..prop..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(prop)` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
Empirical density of draws (histogram) versus density computed via the dcmp function (points).
ggplot() +
stat_ecdf(data = data.frame(x = x), aes(x), geom = "step") +
geom_point(data = data.frame(x = xx, px = px), aes(x, px)) +
ylab("Probability") +
theme_bw()
Empirical CDF of draws (solid line) versus CDF computed via the pcmp function (points).
ggplot() +
geom_point(data = data.frame(x = qq, qx_emp = qx_emp), aes(qq, qx_emp), pch = 1) +
geom_point(data = data.frame(x = qq, qx = qx), aes(qq, qx), pch = 3) +
xlab("Probability") +
ylab("Quantile") +
theme_bw()
Empirical quantiles of draws (o
) versus quantiles computed
via the qcmp function (+
).
For Y ∼ CMP(λ, ν), we
can consider computing the expectation and variance of Y in two ways. First, if there is a
moderately-sized M where {0, …, M} contains all but a
negligible amount of the mass of Y, we can compute the moments using
truncated summations Otherwise, a different approach
is taken. Notice that the expected value is related to the first
derivative of the log-normalizing constant via For the
second derivative, Therefore, we may use first and
second derivatives of log Z(λ, ν), taken
with respect to λ, to compute
$\E(Y)$ and $\Var(Y)$. The ecmp
and
vcmp
functions implement this computation of the
expectation and variance, respectively. The control object is used to
determine whether to use the truncation or differentiation approach.
Condition is used to determine whether to use the truncation approach
(if false) or differentiation approach (if true).
> ecmp(lambda = 10, nu = 1.2)
[1] 6.727397
> ecmp(lambda = 1.5, nu = 0.5)
[1] 2.815447
> ecmp(lambda = 1.5, nu = 0.05)
[1] 3334.757
> ecmp(lambda = 1.5, nu = 0.05, control = get.control(hybrid.tol = 1e-10))
[1] 3334.762
> ecmp(lambda = 1.5, nu = 0.05, control = get.control(hybrid.tol = 1e10))
[1] 3334.757
> vcmp(lambda = 10, nu = 1.2)
[1] 5.679667
> vcmp(lambda = 1.5, nu = 0.5)
[1] 4.513041
> vcmp(lambda = 1.5, nu = 0.05)
[1] 66505.13
> vcmp(lambda = 1.5, nu = 0.05, control = get.control(hybrid.tol = 1e-10))
[1] 66505.03
> vcmp(lambda = 1.5, nu = 0.05, control = get.control(hybrid.tol = 1e10))
[1] 66505.13
Provided that an enormously large truncation value M is not required, we may compute
other moments by truncated sums using tcmp
.
Let S ∼ Bernoulli(p) and T ∼ CMP(λ, ν) be independent random variables. Then Y = (1 − S)T follows a Zero-Inflated Conway-Maxwell Poisson distribution ZICMP(λ, ν, p) with density Like CMP, several interesting special cases are obtained.
Taking ν = 1 corresponds to Zero-Inflated Poisson ZIP(λ, p) with density $f(y \mid \lambda, p) = (1 - p) e^{-\lambda} \lambda^{y} / y! + p \cdot \ind(y = 0)$.
When λ ∈ (0, 1) and ν → 0, ZICMP(λ, ν) converges to a Zero-Inflated Geometric distribution with density $f(y \mid \lambda, p) = (1 - p) (1 - \lambda) \lambda^y + p \cdot \ind(y = 0)$ for y ∈ ℕ.
When ν → ∞, ZICMP(λ, ν, p) converges to a “Zero-Inflated Bernoulli” distribution with density which remains a Bernoulli distribution with an adjusted success probability. Here the λ and p parameters are not individually identifiable. Therefore, users may want to avoid zero-inflation in CMP data analyses with extreme underdispersion.
Finally, p = 0 yields CMP(λ, ν).
There is a close relationship between the CMP and ZICMP distributions, as we have seen from construction of ZICMP. The relationship between the CMP densities and variate generation mechanisms was given earlier in this section. Denote F(x ∣ λ, ν) and F(x ∣ λ, ν, p) as the CDFs of CMP(λ, ν) and ZICMP(λ, ν, p), respectively. We have For a given probability ϕ ∈ [0, 1], the associated CMP and ZICMP quantile functions are related via
The respective functions for ZICMP density, variate generation, CDF,
and quantile functions are dzicmp
, rzicmp
,
pzicmp
, and qzicmp
. They make use of the CMP
implementation described in Section such as the criteria to either
truncate or approximate the normalizing constant.
> qq = seq(0, 0.95, length.out = 20)
> rzicmp(20, lambda = 1.5, nu = 0.2, p = 0.25)
[1] 10 17 14 8 0 11 0 19 12 16 8 0 10 0 0 12 8 10 12 0
> dzicmp(c(0, 1, 2), lambda = 1.5, nu = 0.2, p = 0.25)
[1] 0.26630689 0.02446034 0.03194095
> pzicmp(c(0, 1, 2), lambda = 1.5, nu = 0.2, p = 0.25)
[1] 0.2663069 0.2907672 0.3227082
> qzicmp(qq, lambda = 1.5, nu = 0.2, p = 0.25)
[1] 0 0 0 0 0 0 2 3 4 5 6 7 8 9 11 12 13 15 17 20
As with qcmp
, the qzicmp
function is
computed by the truncation method, and cannot accurately compute
quantiles very close to 1. More specifically, from the CMP distribution,
gives where ϕϵ = (1 − ϵ)(1 − p) + p
and the last equality uses . This motivates a warning from
qzicmp
when the argument is larger than ϕϵ.
> tryCatch({
+ qzicmp(0.9999999, lambda = 1.5, nu = 0.5, p = 0.5)
+ }, warning = print_warning)
[1] simpleWarning in qzicmp(0.9999999, lambda = 1.5, nu = 0.5, p = 0.5): At
[2] least one requested quantile was very close to 1. In particular, 1 of
[3] the given probabilities were greater than (1 - truncate.tol) * (1-p) +
[4] p, where truncate_tol = 1e-06. Associated results may be affected by
[5] truncation. Consider adjusting the controls ymax and truncate.tol or
[6] reducing logq.
Let us repeat the sanity check from Section to ensure that the
empirical density values, cumulative probabilities, and quantiles line
up with the ones computed via dzicmp
, pzicmp
,
and qzicmp
.
library(ggplot2)
n = 100000
lambda = 0.5
nu = 0.1
p = 0.5
x = rzicmp(n, lambda, nu, p)
xx = seq(-1, max(x)) ## Include -1 to ensure it gets probability zero
qq = seq(0, 0.99, length.out = 100)
fx = dzicmp(xx, lambda, nu, p)
px = pzicmp(xx, lambda, nu, p)
qx = qzicmp(qq, lambda, nu, p)
qx_emp = quantile(x, probs = qq)
ggplot() +
geom_bar(data = data.frame(x = x), aes(x = x, y = ..prop..), fill = "NA",
col = "black") +
geom_point(data = data.frame(x = xx[-1], fx = fx[-1]), aes(x, fx)) +
ylab("Density") +
theme_bw()
Empirical density of draws (histogram) versus density computed via the dzicmp function (points).
ggplot() +
stat_ecdf(data = data.frame(x = x), aes(x), geom = "step") +
geom_point(data = data.frame(x = xx, px = px), aes(x, px)) +
ylab("Probability") +
theme_bw()
Empirical CDF of draws (solid line) versus CDF computed via the pzicmp function (points).
ggplot() +
geom_point(data = data.frame(x = qq, qx_emp = qx_emp), aes(qq, qx_emp), pch = 1) +
geom_point(data = data.frame(x = qq, qx = qx), aes(qq, qx), pch = 3) +
xlab("Probability") +
ylab("Quantile") +
theme_bw()
Empirical quantiles of draws (o
) versus quantiles computed
via the qzicmp function (+
).
The expected value and variance of Y ∼ ZICMP(λ, ν, p)
are apparent from the construction Y = (1 − S)T given
earlier in this section. Namely, may be obtained using
formulas for iterated conditional expectations and variances. They are
evaluated in COMPoissonReg
using the ezicmp
and vzicmp
functions respectively. These functions make use
of the ecmp
and vcmp
functions described in
Section to compute $\E(T)$ and $\Var(T)$.
Suppose there are n
subjects with outcomes y1, …, yn ∈ ℕ
and covariates x⃗i ∈ ℝd1,
s⃗i ∈ ℝd2,
and w⃗i ∈ ℝd3
for i = 1, …, n. The
COMPoissonReg
package fits both CMP and ZICMP regression
models.
The CMP regression model assumes that where log λi = x⃗i⊤β⃗
and log νi = s⃗i⊤γ⃗.
Writing $\btheta = (\vec{\beta},
\vec{\gamma})$, the likelihood is The ZICMP
regression model assumes that where log λi = x⃗i⊤β⃗,
log νi = s⃗i⊤γ⃗,
and $\logit p_i = \vec{w}_i^\top
\vec{\zeta}$. Writing $\btheta =
(\vec{\beta}, \vec{\gamma}, \vec{\zeta})$, the likelihood is
We will write d = d1 + d2 + d3
for the total dimension of $\btheta$.
The n × d1
design matrix whose rows consist of x⃗i will be
denoted X⃗. Similarly, we will
write S⃗ and W⃗ as the n × d2 and n × d3 design
matrices constructed from s⃗i and w⃗i,
respectively. The glm.cmp
function provides a formula
interface to fit models of both types: and .
out = glm.cmp(formula.lambda, formula.nu = ~ 1, formula.p = NULL,
data = NULL, init = NULL, fixed = NULL, control = NULL, ...)
The interface contains three formulas: formula.lambda
specifies the regression x⃗i⊤β⃗
used for λi, while
formula.nu
and formula.p
correspond to s⃗i⊤γ⃗
for νi and
w⃗i⊤ζ⃗
for pi,
respectively. ZICMP regression is utilized when formula.p
is set to something other than its default NULL
value;
otherwise, CMP regression is assumed. The data
argument is
used to pass a data.frame
explicitly rather than having the
data be read from the local environment. The init
,
fixed
, and control
arguments and associated
helper functions are described below.
The init
argument represents an initial value for the
optimizer. The following functions can be used to construct it.
> get.init(beta = c(1, 2, 3), gamma = c(-1, 1), zeta = c(-2, -1))
$beta
[1] 1 2 3
$gamma
[1] -1 1
$zeta
[1] -2 -1
attr(,"class")
[1] "COMPoissonReg.init"
> get.init.zero(d1 = 3, d2 = 2, d3 = 2)
$beta
[1] 0 0 0
$gamma
[1] 0 0
$zeta
[1] 0 0
attr(,"class")
[1] "COMPoissonReg.init"
The fixed
argument is used to specify indices of the
three coefficients which will remain fixed at their initial value during
optimization.
> get.fixed(beta = c(1L, 2L), gamma = c(1L))
$beta
[1] 1 2
$gamma
[1] 1
$zeta
integer(0)
attr(,"class")
[1] "COMPoissonReg.fixed"
The specification above requests the first two elements of
beta
and the first element of gamma
to be
fixed. Notice that indices must be integers and that the default value
is an empty integer vector which is interpreted as “no elements are
fixed”. The fixed
argument can usually be disregarded but
may be useful in some circumstances; an example is given in Section
.
Specifying the elements of init
and fixed
may be somewhat awkward with the formula interface, as they require
knowledge of how formulas will be expanded into design matrices and
coefficients. It can be helpful to produce the design matrices using R’s
model.matrix
function.
> model.matrix(formula.lambda, data = data)
> model.matrix(formula.nu, data = data)
> model.matrix(formula.p, data = data)
The control
argument has been introduced in Section ;
regression modeling makes use of several additional arguments.
COMPoissonReg
uses optim
to compute the
maximum likelihood estimate (MLE) $\hat{\btheta}$ for $\btheta$ under the specified model. Several
controls are provided to influence how COMPoissonReg
invokes optim
; here are their default values.
> control = getOption("COMPoissonReg.control")
> control$optim.method
[1] "L-BFGS-B"
> control$optim.control
$maxit
[1] 150
The element optim.method
is a string which is passed as
the method
argument to optim
, while
optim.control
is a list passed as the control
argument of optim
. Note that, for the latter, if an entry
is given for fnscale
, it will be ignored and overwritten
internally by COMPoissonReg
.
The covariance of $\hat{\btheta}$ is
estimated by $\hat{\vec{V}}(\hat{\btheta}) =
-[\vec{H}(\hat{\btheta})]^{-1}$, where $\vec{H}(\btheta) = \partial^2 \log L(\btheta) /
[\partial \btheta \partial \btheta^\top]$ is the Hessian of the
log-likelihood computed by optim
. The standard error for
coefficient θj in $\btheta$ is then obtained as the square root
of the jth diagonal of $\hat{\vec{V}}(\hat{\btheta})$.
We will now illustrate use of the regression tools using two examples whose data are included in the package. Note that these demonstrations are not intended to be complete regression analyses, and results may be slightly different than previously published analyses due to differences in the computations.
The freight
dataset (Kutner,
Nachtsheim, and Neter 2003) was analyzed using CMP regression by
Sellers and Shmueli (2010) and found to
exhibit underdispersion. The data describe n = 10 instances where 1,000 ampules
were transported via air shipment. The outcome of interest is the
variable broken
which describes the number of broken
ampules in each shipment. The covariate transfers
describes
the number of times the carton was transferred from one aircraft to
another during the shipment.
Let us load and view the dataset.
> data(freight)
> print(freight)
broken transfers
1 16 1
2 9 0
3 17 2
4 12 0
5 22 3
6 13 1
7 8 0
8 15 1
9 19 2
10 11 0
Before fitting a CMP regression, let us fit a Poisson regression
model $Y_i \indep
\text{Poisson}(\lambda_i)$ with This can be
carried out with the standard glm
function.
> glm.out = glm(broken ~ transfers, data = freight, family = poisson)
> summary(glm.out)
Call:
glm(formula = broken ~ transfers, family = poisson, data = freight)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.35295 0.13174 17.86 < 2e-16 ***
transfers 0.26384 0.07924 3.33 0.000869 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 12.5687 on 9 degrees of freedom
Residual deviance: 1.8132 on 8 degrees of freedom
AIC: 50.395
Number of Fisher Scoring iterations: 4
Next, let us fit a similar CMP regression model with using only an intercept for νi.
> cmp.out = glm.cmp(broken ~ transfers, data = freight)
> print(cmp.out)
CMP coefficients
Estimate SE z-value p-value
X:(Intercept) 13.8286 6.2405 2.2159 0.0267
X:transfers 1.4843 0.6892 2.1537 0.03127
S:(Intercept) 1.7550 0.4493 3.9065 9.365e-05
--
Transformed intercept-only parameters
Estimate SE
nu 5.7835 2.5982
--
Chi-squared test for equidispersion
X^2 = 9.1048, df = 1, p-value = 2.5494e-03
--
Elapsed: 0.05 sec Sample size: 10 formula interface
LogLik: -18.6449 AIC: 43.2898 BIC: 44.1975
Optimization Method: L-BFGS-B Converged status: 0
Message: CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH
The coefficients used in the λi formula are
prefixed with an X:
label, while an S:
label
is used for coefficients of the νi formula.
Notice that estimates for X:
coefficients from the CMP fit
are dissimilar to those from the Poisson fit; this may occur when the
estimate of ν deviates from
the value of 1. Similarly to the glm
output, the output of
glm.cmp
displays several quantities for each coefficient
θj, j = 1, …, d: a point
estimate θ̂j, an
associated standard error $\widehat{\text{SE}}(\hat{\theta}_j)$, a
z-value $z_j = \hat{\theta}_j /
\widehat{\text{SE}}(\hat{\theta}_j)$, and a p-value 2Φ(−|zj|)
for the test H0 : θj = 0
versus H1 : θj ≠ 0.
Here, Φ is the CDF of the
standard normal distribution. Because an intercept-only formula was
specified for νi, ν̂ = exp (γ̂) does not vary
with i and its estimate and
associated standard error are added to the display. Here we see evidence
of underdispersion with ν̂ > 1. A test for equidispersion
is displayed to determine whether there is a significant amount of over
or underdispersion in the data. In particular, a likelihood ratio test
is used to decide whether $H_0: \vec{\gamma} =
\vec{0}$ versus $H_0: \vec{\gamma} \neq
\vec{0}$. The test statistic is displayed along with the degrees
of freedom and associated p-value. Here we have fairly strong evidence
to reject the null hypothesis of equidispersion.
The AIC for the CMP model cmp.out
shows some improvement
over that of glm.out
. Let us also consider a slope
coefficient for the ν
component using via the following call to
glm.cmp
.
> cmp2.out = glm.cmp(broken ~ transfers, formula.nu = ~ transfers, data = freight)
> print(cmp2.out)
CMP coefficients
Estimate SE z-value p-value
X:(Intercept) 15.5959 7.1087 2.1939 0.02824
X:transfers 4.6322 2.3978 1.9319 0.05338
S:(Intercept) 1.8935 0.4525 4.1846 2.857e-05
S:transfers 0.1206 0.0505 2.3883 0.01693
--
Chi-squared test for equidispersion
X^2 = 11.6991, df = 2, p-value = 2.8812e-03
--
Elapsed: 0.08 sec Sample size: 10 formula interface
LogLik: -17.3477 AIC: 42.6954 BIC: 43.9058
Optimization Method: L-BFGS-B Converged status: 0
Message: CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH
Model cmp2.out
provides a slight improvement over
cmp.out
in AIC and BIC, but we may prefer
cmp.out
in the interest of simplicity.
To gain some insight into the optimization, we may wish to increase the trace level, which can be done as follows.
> control = get.control(optim.control = list(maxit = 5, trace = 3, REPORT = 1))
> cmp3.out = glm.cmp(broken ~ transfers, data = freight, control = control)
N = 3, M = 5 machine precision = 2.22045e-16
This problem is unconstrained.
At X0, 0 variables are exactly at the bounds
At iterate 0 f= 273.34 |proj g|= 260.29
At iterate 1 f = 49.923 |proj g|= 113
At iterate 2 f = 46.155 |proj g|= 44.594
At iterate 3 f = 29.883 |proj g|= 22.868
At iterate 4 f = 27.657 |proj g|= 2.6174
At iterate 5 f = 27.599 |proj g|= 4.443
At iterate 6 f = 27.359 |proj g|= 7.3357
final value 27.359209
stopped after 6 iterations
N = 2, M = 5 machine precision = 2.22045e-16
This problem is unconstrained.
At X0, 0 variables are exactly at the bounds
At iterate 0 f= 148.87 |proj g|= 152.94
At iterate 1 f = 57.259 |proj g|= 146.63
At iterate 2 f = 53.033 |proj g|= 67.913
At iterate 3 f = 42.039 |proj g|= 42.645
At iterate 4 f = 36.942 |proj g|= 38.126
At iterate 5 f = 32.584 |proj g|= 43.852
At iterate 6 f = 24.06 |proj g|= 17.695
final value 24.060495
stopped after 6 iterations
Data from the local environment may be passed to the
glm.cmp
function without explicitly using the
data
argument.
In a count regression model, it may be desirable to include offset
terms such as An offset
term may be used
in the formula interface to accomplish this.
> freight$offx = 13
> freight$offs = 1
> glm.cmp(broken ~ transfers + offset(offx), data = freight)
> glm.cmp(broken ~ transfers + offset(offx), formula.nu = ~1 + offset(offs), data = freight)
For users who wish to bypass the formula interface and prepare the X⃗ and S⃗ design matrices manually, a “raw” interface to the regression functionality is also provided.
Several accessors are provided to extract results from the output object.
> logLik(cmp.out) ## Log-likelihood evaluated at MLE.
[1] -18.64489
> AIC(cmp.out) ## AIC evaluated at MLE.
[1] 43.28978
> BIC(cmp.out) ## BIC evaluated at MLE.
[1] 44.19754
> coef(cmp.out) ## Estimates of theta as a flat vector
X:(Intercept) X:transfers S:(Intercept)
13.828561 1.484303 1.755002
> coef(cmp.out, type = "list") ## Estimates of theta as a named list
$beta
X:(Intercept) X:transfers
13.828561 1.484303
$gamma
S:(Intercept)
1.755002
> vcov(cmp.out) ## Estimated covariance matrix of theta hat
X:(Intercept) X:transfers S:(Intercept)
X:(Intercept) 38.944034 4.0877545 2.8000943
X:transfers 4.087754 0.4749988 0.2978880
S:(Intercept) 2.800094 0.2978880 0.2018301
> sdev(cmp.out) ## Standard deviations from vcov(...) diagonals
X:(Intercept) X:transfers S:(Intercept)
6.2405155 0.6892016 0.4492551
> sdev(cmp.out, type = "list") ## Standard deviations as a named list
$beta
[1] 6.2405155 0.6892016
$gamma
[1] 0.4492551
The predict
function computes several useful quantities
evaluated at the estimate $\hat{\btheta}$. The argument default
type = "response"
produces a vector of estimates of the
response $\hat{y}_i = \E(Y_i)$ for
i = 1, …, n using the
method described in Section . The argument type = "link"
produces a data.frame
with columns for the linked
parameters λi and νi.
> predict(cmp.out)
[1] 13.70507 10.50769 17.83752 10.50769 23.17871 13.70507 10.50769 13.70507
[9] 17.83752 10.50769
> predict(cmp.out, type = "link")
lambda nu
1 4469845 5.78346
2 1013136 5.78346
3 19720454 5.78346
4 1013136 5.78346
5 87004438 5.78346
6 4469845 5.78346
7 1013136 5.78346
8 4469845 5.78346
9 19720454 5.78346
10 1013136 5.78346
Note that the estimated nu
values are equal for all
observations because the model assumed only an intercept term for the
dispersion component.
We can also use predict
on new covariate values. Note
that models fit with the formula interface expect the new data to be
provided as a data.frame
which is interpreted using the
formula used to fit the model. If the raw interface was used to fit the
model, use the get.modelmatrix
function to specify design
matrices to use for prediction.
# Prepare new data to fit by formula interface
new.df = data.frame(transfers = 0:10)
# Prepare new data to fit by raw interface
X = model.matrix(~ transfers, data = new.df)
S = model.matrix(~ 1, data = new.df)
new.data = get.modelmatrix(X = X, S = S)
# Pass new data to model from by formula interface
y.hat.new = predict(cmp.out, newdata = new.df)
# Pass new data to model from by raw interface
y.hat.new = predict(cmp.raw.out, newdata = new.data)
# Compute predictions for links
predict.out = predict(cmp.out, newdata = new.df, type = "link")
# Plot predictions
ggplot() +
geom_point(data = new.df, aes(transfers, y.hat.new)) +
xlab("Number of transfers") +
ylab("Predicted number broken") +
theme_bw()
> print(y.hat.new)
[1] 1.176542 2.690927 5.694026 11.715378 23.807032 48.096062
[7] 96.889869 194.912671 391.832565 787.430083 1582.156271
> print(predict.out)
lambda nu
1 1.013136e+06 5.78346
2 4.469845e+06 5.78346
3 1.972045e+07 5.78346
4 8.700444e+07 5.78346
5 3.838538e+08 5.78346
6 1.693520e+09 5.78346
7 7.471622e+09 5.78346
8 3.296396e+10 5.78346
9 1.454333e+11 5.78346
10 6.416355e+11 5.78346
11 2.830824e+12 5.78346
The leverage
function computes the diagonal entries of a
“hat” matrix which can be formulated in CMP regression. These can be
used to diagnose influential observations. For details, see Section 3.6
of Sellers and Shmueli (2010).
> leverage(cmp.out)
[1] 0.1541752 0.1943925 0.2733327 0.2259140 0.6002368 0.3795226 0.2379250
[8] 0.1117655 0.3653074 0.4574283
The residuals
function provides either raw (the default)
or quantile-based residuals (Dunn and Smyth
1996). In a CMP regression setting, raw residuals yi − ŷi
generally do not work well with traditional regression diagnostics, such
as Q-Q plots. Quantile-based residuals often produce interpretable
diagnostics; however, a random element is used in the computation of
quantile residuals for discrete distributions. This aids
interpretability but gives slightly different residual values each time
they are computed. See Dunn and Smyth
(1996) for details.
Pearson residuals may be preferred over raw residuals for diagnostics; these can be obtained by standardizing raw residuals using leverage values and variance estimates.
> link.hat = predict(cmp.out, type = "link")
> vv = vcmp(link.hat$lambda, link.hat$nu)
> hh = leverage(cmp.out)
> res.pearson = res.raw / sqrt(vv*(1-hh))
For each type of residual—raw, Pearson, and quantile—we now plot fitted values versus residuals and Q-Q plots.
plot.fit.res = function(y.hat, res) {
ggplot(data.frame(y = y.hat, res = res)) +
geom_point(aes(y, res)) +
xlab("Fitted Value") +
ylab("Residual Value") +
theme_bw() +
theme(plot.title = element_text(size = 10))
}
plot.qq.res = function(res) {
ggplot(data.frame(res = res), aes(sample = res)) +
stat_qq() +
stat_qq_line() +
theme_bw() +
theme(plot.title = element_text(size = 10))
}
y.hat = predict(cmp.out)
plot.fit.res(y.hat, res.raw) +
ggtitle("Fitted Values vs. Raw Residuals")
plot.qq.res(res.raw) +
ggtitle("Q-Q Plot of Raw Residuals")
plot.fit.res(y.hat, res.pearson) +
ggtitle("Fitted Values vs. Pearson Residuals")
plot.qq.res(res.pearson) +
ggtitle("Q-Q Plot of Pearson Residuals")
plot.fit.res(y.hat, res.qtl) +
ggtitle("Fitted Values vs. Quantile Residuals")
plot.qq.res(res.qtl) +
ggtitle("Q-Q Plot of Quantile Residuals")
In this example, with only 10 observations, it is difficult to see an advantage of using quantile residuals; the benefit will be more apparent in Section . One benefit of raw residuals is that they may be used to compute a mean-squared error.
To access the results of the equidispersion test shown in the output
of cmp.out
, we may use the equitest
accessor
function.
The deviance
function computes the deviance quantities
$D_i = -2 [\log L_i(\hat{\btheta}) - \log
L_i(\tilde{\btheta}_i)]$ for i = 1, …, n, where $L_i(\btheta)$ is the term of the likelihood
corresponding to the ith
observation, $\hat{\btheta}$ is the MLE
computed under the full likelihood $L(\btheta)
= \prod_{i=1}^n L_i(\btheta)$, and $\tilde{\btheta}_i$ is the maximizer of $L_i(\btheta)$.
> deviance(cmp.out)
[1] 0.3376988 -0.5150035 -1.7750671 -0.6863613 -2.2081582 -1.9383878
[7] 2.1710154 -1.1354954 -1.6584238 -2.1772063
The parametric.bootstrap
function carries out a
parametric bootstrap with R
repetitions. Using the fitted MLE $\hat{\btheta}$, bootstrap samples y⃗(r) = (y1(r), …, yn(r))
are drawn from the likelihood $L(\hat{\btheta})$ for r = 1, …, R. Estimate $\hat{\btheta}^{(r)}$ is fitted from
bootstrap sample y⃗(r). An R × d matrix is returned
whose rth row is $\hat{\btheta}^{(r)}$.
> cmp.boot = parametric.bootstrap(cmp.out, reps = 100)
> head(cmp.boot)
(Intercept) transfers (Intercept)
[1,] 12.435045 1.066322 1.620051
[2,] 22.895677 2.768967 2.297103
[3,] 21.989757 2.518793 2.223626
[4,] 26.788708 2.663686 2.416597
[5,] 15.550523 2.011527 1.873008
[6,] 8.892033 1.240461 1.348294
We used R = 100 in the display above to keep vignette computations small, but a larger number may be desired in practice. Bootstrap repetitions can be used, for example, to compute 95% confidence intervals for each of the coefficients.
Large covariates can present numerical difficulties in fitting CMP regression. We will briefly demonstrate the difficulties and some possible workarounds. First let us generate a new dataset based on a large covariate in the regression for λi.
set.seed(1234)
n = 200
x = rnorm(n, 500, 10)
X = cbind(intercept = 1, slope = x)
S = matrix(1, n, 1)
beta_true = c(-0.05, 0.05)
gamma_true = 2
lambda_true = exp(X %*% beta_true)
nu_true = exp(S %*% gamma_true)
y = rcmp(n, lambda_true, nu_true)
Notice that the generated counts y1, …, yn are relatively small compared to the covariate x1, …, xn.
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
471.4 492.3 498.3 499.4 505.5 530.4
> summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
22.0 27.0 29.0 29.1 31.0 37.0
An initial attempt to fit the true data-generating model fails.
> tryCatch({
+ glm.cmp(y ~ x, formula.nu = ~ 1)
+ }, error = print_warning)
[1] Error in optim(par.init, loglik, method = optim.method, control =
[2] optim.control, : L-BFGS-B needs finite values of 'fn'
Internally, the linked rate parameter λi = exp (β0 + β1xi)
may evaluate to Inf
or become very close to zero as the
optimizer moves β1
away from zero in a positive or negative direction, respectively. Some
possible ways to address this are as follows.
Standardize the covariate to have mean zero and variance one.
> glm.cmp(y ~ scale(x), formula.nu = ~ 1)
CMP coefficients
Estimate SE z-value p-value
X:(Intercept) 26.0617 2.5565 10.1944 2.1e-24
X:scale(x) 0.5253 0.0627 8.3848 5.082e-17
S:(Intercept) 2.0416 0.0980 20.8290 2.365e-96
--
Transformed intercept-only parameters
Estimate SE
nu 7.7031 0.755
--
Chi-squared test for equidispersion
X^2 = 232.4625, df = 1, p-value = 1.7311e-52
--
Elapsed: 0.88 sec Sample size: 200 formula interface
LogLik: -417.9344 AIC: 841.8689 BIC: 851.7638
Optimization Method: L-BFGS-B Converged status: 0
Message: CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH
Use a logarithmic transformation on the covariate.
> glm.cmp(y ~ log(x), formula.nu = ~ 1)
CMP coefficients
Estimate SE z-value p-value
X:(Intercept) -134.3015 17.4934 -7.6773 1.625e-14
X:log(x) 25.8123 3.1597 8.1693 3.103e-16
S:(Intercept) 2.0422 0.1017 20.0764 1.188e-89
--
Transformed intercept-only parameters
Estimate SE
nu 7.7074 0.784
--
Chi-squared test for equidispersion
X^2 = 232.3911, df = 1, p-value = 1.7943e-52
--
Elapsed: 0.88 sec Sample size: 200 formula interface
LogLik: -417.9869 AIC: 841.9738 BIC: 851.8688
Optimization Method: L-BFGS-B Converged status: 0
Message: CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH
Change optimization method or other optim
arguments.
> control = get.control(optim.method = "BFGS", optim.control = list(maxit = 200))
> suppressWarnings({
+ cmp.out = glm.cmp(y ~ x, formula.nu = ~ 1, control = control)
+ print(cmp.out)
+ })
CMP coefficients
Estimate SE z-value p-value
X:(Intercept) 12.1759 0.0467 260.6333 0
X:x -0.0210 NaN NaN NaN
S:(Intercept) -0.6747 NaN NaN NaN
--
Transformed intercept-only parameters
Estimate SE
nu 0.5093 NaN
--
Chi-squared test for equidispersion
X^2 = 254.5417, df = 1, p-value = 2.6567e-57
--
Elapsed: 0.16 sec Sample size: 200 formula interface
LogLik: -937.3960 AIC: 1880.7921 BIC: 1890.6870
Optimization Method: BFGS Converged status: 0
In this case, standardization and logarithmic transformation produce
a usable fit. Changing the optimization method to BFGS
allows the optimization to finish, but there are further numerical
problems in computing the Hessian for standard errors.
Now consider a generated dataset with large outcomes but a relatively small covariate. This situation can also present numerical difficulties.
set.seed(1234)
n = 200
x = runif(n, 1, 2)
X = cbind(intercept = 1, slope = x)
S = matrix(1, n, 1)
beta_true = c(1, 1)
gamma_true = -0.95
lambda_true = exp(X %*% beta_true)
nu_true = exp(S %*% gamma_true)
y = rcmp(n, lambda_true, nu_true)
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.009 1.238 1.486 1.483 1.735 1.999
> summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
156.0 332.0 606.5 810.5 1161.2 2391.0
An initial attempt to fit the data-generating model fails.
> tryCatch({
+ glm.cmp(y ~ x, formula.nu = ~ 1)
+ }, error = print_warning)
[1] Error in optim(par.init, loglik, method = optim.method, control =
[2] optim.control, : L-BFGS-B needs finite values of 'fn'
Informative starting values help the optimizer to initially make progress. True data-generating parameters will not be available in a real data analysis situation but help to illustrate the idea.
> init = get.init(beta = beta_true, gamma = gamma_true)
> glm.cmp(y ~ x, formula.nu = ~ 1, init = init)
CMP coefficients
Estimate SE z-value p-value
X:(Intercept) 0.9970 0.1050 9.4933 2.238e-21
X:x 1.0029 0.1047 9.5757 1.012e-21
S:(Intercept) -0.9498 0.1044 -9.0936 9.584e-20
--
Transformed intercept-only parameters
Estimate SE
nu 0.3868 0.0404
--
Chi-squared test for equidispersion
X^2 = 133.5665, df = 1, p-value = 6.7968e-31
--
Elapsed: 0.02 sec Sample size: 200 formula interface
LogLik: -1023.2186 AIC: 2052.4373 BIC: 2062.3322
Optimization Method: L-BFGS-B Converged status: 0
Message: CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH
Without choosing an initial value, changing the optimization method
to Nelder-Mead
and increasing the maximum number of
iterations also helps the optimizer find a solution.
> control = get.control(optim.method = "Nelder-Mead", optim.control = list(maxit = 1000))
> glm.cmp(y ~ x, formula.nu = ~ 1, control = control)
CMP coefficients
Estimate SE z-value p-value
X:(Intercept) 0.8460 0.0424 19.9591 1.25e-88
X:x 0.8518 0.0417 20.4230 1.043e-92
S:(Intercept) -1.1133 0.0486 -22.8895 5.915e-116
--
Transformed intercept-only parameters
Estimate SE
nu 0.3285 0.016
--
Chi-squared test for equidispersion
X^2 = 5048.8383, df = 1, p-value = 0.0000e+00
--
Elapsed: 0.04 sec Sample size: 200 formula interface
LogLik: -1024.1845 AIC: 2054.3691 BIC: 2064.2640
Optimization Method: Nelder-Mead Converged status: 0
Note that this solution is different from the previous one; the log-likelihood of the previous one is slightly better.
The couple
dataset (Loeys et al.
2012) was analyzed with ZICMP regression in Sellers and Raim (2016) and found to exhibit
overdispersion. The data concern separation trajectories of n = 387 couples. The variable
UPB
records the number of unwanted pursuit behavior
perpetrations and is considered the outcome of interest. Included
covariates are the binary variable EDUCATION
, which is 1 if
at least a bachelor’s degree was attained, and a continuous variable
ANXIETY
which measures anxious attachment. A zero-inflated
count model is considered for these data because 246 of the 387 records
have an outcome of zero.
Let us load and view the first few records in the dataset.
> data(couple)
> head(couple)
UPB EDUCATION ANXIETY
1 15 0 1.0053
2 0 0 -0.7034
3 0 1 -0.7034
4 3 1 0.6110
5 12 0 0.2167
6 0 1 2.0569
As a preliminary model, let us fit a standard Poisson model $Y_i \indep \text{Poisson}(\lambda_i)$ with
We may use the standard glm
function.
> glm.out = glm(UPB ~ EDUCATION + ANXIETY, data = couple, family = poisson)
> summary(glm.out)
Call:
glm(formula = UPB ~ EDUCATION + ANXIETY, family = poisson, data = couple)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.81695 0.04386 18.628 <2e-16 ***
EDUCATION -0.21579 0.07047 -3.062 0.0022 **
ANXIETY 0.42169 0.03333 12.651 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2478.3 on 386 degrees of freedom
Residual deviance: 2310.8 on 384 degrees of freedom
AIC: 2782.4
Number of Fisher Scoring iterations: 6
Now consider a ZICMP regression with We use the
glm.cmp
function as follows.
> zicmp0.out = glm.cmp(UPB ~ EDUCATION + ANXIETY,
+ formula.nu = ~ 1,
+ formula.p = ~ EDUCATION + ANXIETY,
+ data = couple)
> print(zicmp0.out)
ZICMP coefficients
Estimate SE z-value p-value
X:(Intercept) -0.1604 0.0189 -8.4843 2.171e-17
X:EDUCATION -0.0678 0.0325 -2.0849 0.03708
X:ANXIETY 0.0226 0.0142 1.5923 0.1113
S:(Intercept) -11.0963 59.4424 -0.1867 0.8519
W:(Intercept) 0.4176 0.1599 2.6119 0.009005
W:EDUCATION -0.3863 0.2677 -1.4433 0.1489
W:ANXIETY -0.5234 0.1335 -3.9200 8.856e-05
--
Transformed intercept-only parameters
Estimate SE
nu 0 9e-04
--
Chi-squared test for equidispersion
X^2 = 350.5662, df = 1, p-value = 3.1903e-78
--
Elapsed: 6.42 sec Sample size: 387 formula interface
LogLik: -627.1675 AIC: 1268.3350 BIC: 1296.0440
Optimization Method: L-BFGS-B Converged status: 0
Message: CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH
There are now three sets of coefficients reported in the output: the
X:
, S:
, and W:
prefixes label
estimates for the λi, νi, and pi formulas
respectively.
Our attempt to fit the previous model strongly tended to the Zero-Inflated Geometric special case of ZICMP, but SEs computed via the Hessian become large in this region. In this section, we fix γ0 at the extreme −∞ and fit the remaining coefficients. Let us use the raw interface to do this.
init = coef(zicmp0.out, type = "list")
y = couple$UPB
X = model.matrix(~ EDUCATION + ANXIETY, data = couple)
S = model.matrix(~ 1, data = couple)
W = model.matrix(~ EDUCATION + ANXIETY, data = couple)
control = get.control(optim.method = "BFGS")
zicmp.out = glm.zicmp.raw(y, X, S, W,
init = get.init(beta = c(-1,0,0), gamma = -Inf, zeta = c(-1,0,0)),
fixed = get.fixed(gamma = 1L), control = control)
> print(zicmp.out)
ZICMP coefficients
Estimate SE z-value p-value Fixed
X:(Intercept) -0.1605 0.0188 -8.5432 1.305e-17 F
X:EDUCATION -0.0678 0.0325 -2.0837 0.03718 F
X:ANXIETY 0.0226 0.0142 1.5896 0.1119 F
S:(Intercept) -Inf NA NA NA T
W:(Intercept) 0.4172 0.1599 2.6090 0.009082 F
W:EDUCATION -0.3863 0.2678 -1.4428 0.1491 F
W:ANXIETY -0.5245 0.1336 -3.9261 8.632e-05 F
--
Transformed intercept-only parameters
Estimate SE
nu 0 0
--
Some elements of gamma were fixed. Chi-squared test for equidispersion not defined.
--
Elapsed: 1.28 sec Sample size: 387 raw interface
LogLik: -627.1672 AIC: 1266.3345 BIC: 1290.0850
Optimization Method: BFGS Converged status: 0
Notice that an additional Fixed
column has been added to
the display, indicating that the coefficient gamma
is
fixed. Furthermore, its Estimate
column is set to the
initial value and the columns SE
, z-value
, and
p-value
are set to NA
. This model achieves a
similar log-likelihood value as our first attempt using
L-BFGS-B
but does not exhibit signs of numerical
issues.
Here are several of the accessors provided to extract model outputs.
> logLik(zicmp.out) ## Log-likelihood evaluated at MLE
[1] -627.1672
> AIC(zicmp.out) ## AIC evaluated at MLE
[1] 1266.334
> BIC(zicmp.out) ## BIC evaluated at MLE
[1] 1290.085
> coef(zicmp.out) ## Estimates of theta as a flat vector
X:(Intercept) X:EDUCATION X:ANXIETY S:(Intercept) W:(Intercept)
-0.16047147 -0.06776507 0.02257241 -Inf 0.41722279
W:EDUCATION W:ANXIETY
-0.38633736 -0.52450590
> coef(zicmp.out, type = "list") ## Estimates of theta as a named list
$beta
X:(Intercept) X:EDUCATION X:ANXIETY
-0.16047147 -0.06776507 0.02257241
$gamma
S:(Intercept)
-Inf
$zeta
W:(Intercept) W:EDUCATION W:ANXIETY
0.4172228 -0.3863374 -0.5245059
> vcov(zicmp.out) ## Estimated covariance matrix of theta hat
X:(Intercept) X:EDUCATION X:ANXIETY W:(Intercept)
X:(Intercept) 0.0003528180 -2.909813e-04 -1.246030e-04 0.0005767772
X:EDUCATION -0.0002909813 1.057615e-03 2.583975e-05 -0.0005084108
X:ANXIETY -0.0001246030 2.583975e-05 2.016361e-04 -0.0001489969
W:(Intercept) 0.0005767772 -5.084108e-04 -1.489969e-04 0.0255741783
W:EDUCATION -0.0005266101 2.146314e-03 7.074582e-05 -0.0255267345
W:ANXIETY -0.0001695705 1.830298e-04 3.848049e-04 -0.0006753149
W:EDUCATION W:ANXIETY
X:(Intercept) -5.266101e-04 -0.0001695705
X:EDUCATION 2.146314e-03 0.0001830298
X:ANXIETY 7.074582e-05 0.0003848049
W:(Intercept) -2.552673e-02 -0.0006753149
W:EDUCATION 7.169741e-02 0.0009802763
W:ANXIETY 9.802763e-04 0.0178471340
> sdev(zicmp.out) ## Standard deviations from vcov(...) diagonals
X:(Intercept) X:EDUCATION X:ANXIETY W:(Intercept) W:EDUCATION
0.01878345 0.03252099 0.01419986 0.15991929 0.26776372
W:ANXIETY
0.13359317
> sdev(zicmp.out, type = "list") ## Standard deviations as a named list
$beta
[1] 0.01878345 0.03252099 0.01419986
$gamma
[1] NA
$zeta
[1] 0.1599193 0.2677637 0.1335932
> equitest(zicmp0.out) ## Likelihood ratio test for H_0: gamma = 0
$teststat
[1] 350.5662
$pvalue
[1] 3.190326e-78
$df
[1] 1
> tryCatch({ ## An error is thrown for model with fixed gamma
+ equitest(zicmp.out)
+ }, error = print_warning)
[1] Error in equitest.zicmpfit(zicmp.out): Some elements of gamma were
[2] fixed, chi-squared test for equidispersion not defined
Because we fixed γ = −∞ to
obtain zicmp0.out
, the equitest
function
throws an error instead of proceeding with an equidispersion test.
The predict
function behaves similarly as in CMP
regression; however, the link
type here also includes a
column with the estimated pi.
> y.hat = predict(zicmp.out) ## Fitted values based on ecmp
> link.hat = predict(zicmp.out, type = "link")
> head(y.hat)
[1] 3.570765 1.622937 1.451592 2.391102 2.522830 3.713332
> head(link.hat)
lambda nu p
1 0.8712909 0 0.4725120
2 0.8383254 0 0.6870063
3 0.7833983 0 0.5986451
4 0.8069894 0 0.4281048
5 0.8559186 0 0.5753131
6 0.8337620 0 0.2596150
In this example, we can see the benefit of using quantile residuals
rather than raw residuals for diagnostic plots. The functions
plot.fit.res
and plot.qq.res
have been defined
in Section .
res.raw = residuals(zicmp.out, type = "raw")
res.qtl = residuals(zicmp.out, type = "quantile")
plot.fit.res(y.hat, res.raw) +
ggtitle("Fitted Values vs. Raw Residuals")
plot.qq.res(res.raw) +
ggtitle("Q-Q Plot of Raw Residuals")
plot.fit.res(y.hat, res.qtl) +
ggtitle("Fitted Values vs. Quantile Residuals")
plot.qq.res(res.qtl) +
ggtitle("Q-Q Plot of Quantile Residuals")
Here is an example of computing fitted values for new covariate data.
new.df = data.frame(EDUCATION = round(1:20 / 20), ANXIETY = seq(-3,3, length.out = 20))
X.new = model.matrix(~ EDUCATION + ANXIETY, data = new.df)
S.new = model.matrix(~ 1, data = new.df)
W.new = model.matrix(~ EDUCATION + ANXIETY, data = new.df)
new.data = get.modelmatrix(X.new, S.new, W.new)
# For model fit using raw interface, use get.modelmatrix to prepare new design
# matrices, offsets, etc
y.hat.new = predict(zicmp.out, newdata = new.data)
# For models fit with the formula interface, pass a data.frame with the same
# structure as used in the fit.
y.hat.new = predict(zicmp0.out, newdata = new.df)
> print(y.hat.new)
[1] 0.4699009 0.5622034 0.6711343 0.7991174 0.9487439 1.1227245 1.3238518
[8] 1.5549320 1.8187433 2.1180077 2.0359259 2.2797300 2.5382951 2.8108724
[15] 3.0968671 3.3959397 3.7081289 4.0339804 4.3745604 4.7316162
As with CMP regression, a parametric.bootstrap
function
is provided for convenience to obtain a bootstrap sample $\hat{\btheta}^{(r)}$ of $\btheta$ based on the estimate $\hat{\btheta}$. Because it is too time
consuming to run this example within the vignette, we show the code
without output below. As in Section , we consider using the bootstrap
samples to construct a 95% confidence interval for each of the
coefficients.
We acknowledge Thomas Lotze for significant contributions to the
initial development of the COMPoissonReg
package and for
service as initial maintainer on CRAN. We thank Darcy Steeg Morris, Eric
Slud, and Tommy Wright for reviewing the manuscript. We are grateful to
the users of COMPoissonReg
for their interest and for
bringing to light some of the issues which have been considered in this
work.
[email protected], Center for Statistical Research
& Methodology, U.S. Census Bureau, Washington, DC, 20233, U.S.A.
Disclaimer:
This document is released to
inform interested parties of ongoing research and to encourage
discussion of work in progress. Any views expressed are those of the
authors and not those of the U.S. Census Bureau. ↩︎
[email protected], Center for Statistical Research & Methodology, U.S. Census Bureau and Department of Mathematics and Statistics, Georgetown University, Washington, DC, 20057, U.S.A. ↩︎
These bounds are also stated at https://en.wikipedia.org/wiki/Stirling\%27s_approximation, last accessed 2022-10-09.↩︎
Comments about Results
The AIC of the ZICMP model is drastically smaller than the Poisson model, indicating a greatly improved fit. However, there are signs of possible numerical issues. The estimate for γ0 is a large negative number, but with an extremely large associated SE, which suggests that the effect may not be statistically significant. On the other hand, the estimate of ν is nearly zero with a small SE, which suggests that the dispersion parameter is indeed statistically significant. On the surface, this seems to be a contradiction.
The issue is that the Hessian of the log-likelihood becomes insensitive to small changes in γ0 when λi < 1 and γ0 is a large negative number. Let us first verify that the estimates for λi are indeed smaller than 1.
To show the insensitivity of the Hessian, let us consider a simpler setting with Y ∼ CMP(λ, ν), λ = exp {−0.25} ≈ 0.7788 fixed, and ν = exp {γ0}. We then have log-density with first derivative and second derivatives, respectively, For a given value of y, −eγ0log (y!) approaches zero as γ0 decreases. Therefore, let us focus on the function g(γ0) = −log Z(λ, eγ0) and its first and second derivatives. The following code illustrates their behavior.
Here is the result.
Notice that g(γ0) approaches a limit as γ0 → −∞, which coincides with the CMP distribution approaching a Geometric distribution. It may not be surprising that the first and second derivatives approach zero accordingly. This explains the large SE for γ0 in the model
zicmp0.out
. With estimates tending to this region of the parameter space, it may be preferable to fix fix γ0 at a value such as −∞, which will be done in the next section.