Introduction
Quantal responses are counts of all-or-none responses, out of a total \(n\). A dose-response relationship quantifies the response as a function of dose, or more generally as a function of an exposure.
Data are from a broad class of experiments where responses are insect deaths out of some total number exposed. Exposure may be time in coolstorage, or dose of a fumigant, or a concentration-time measure of exposure to a fumigant, or an intensity-time measure of exposure to radiation. See Follett and Neven (2006) for commentary on the regulatory and scientific background. We will use Hawaian fruitfly data that has been supplied by Dr Peter Follett to demonstrate the use of functions in the qra package, where the exposure is time in coolstorage,
The following code sets up the data.
<- qra::HawCon
HawCon ## Change name "CommonName" to "CN", for more compact output.
<- match("CommonName", names(HawCon))
CCnum names(HawCon)[CCnum] <- "CN"
## trtGp will identify species & lifestage combination
## trtGpRep will identify species, lifestage, and rep
## cTime is centered version of TrtTime
## scTime is centered and scaled version of TrtTime,
## needed to get some mixed model fits to converge
<- within(HawCon, {
HawCon <- factor(paste0(CN,LifestageTrt, sep=":"))
trtGp <- paste0(CN,LifestageTrt,":",RepNumber)
trtGpRep <- scale(TrtTime)
scTime <- factor(1:nrow(HawCon))
obs })
1 Data setup and choice of model
For the data that will be considered here, the exposure measure is time in coolstorage, and the response is mortality of insect pests.
1.1 Graphical Display
Figure 1.1 is designed to give a broad indication of the pattern of response, on a complementary log-log link scale. Responses appear acceptably linear, at least after the first one or two observations. There are clear systematic differences between replicates, indicative of a strong between replicate component of variation.
1.2 The choice of link function
Commonly used link functions are
logit
: This is difficult to distinguish, for any except datasets that have very large numbers of observations at exposures that lead to very low or very high mortality, from theprobit
. With the logit link, the model is predicting the \(\log(\mbox{odds})\) of response. Theprobit
link can be motivated by assuming the presence of a normally distributed random variable that generates a response whenever its value crosses a threshold.1complementary log-log
, abbreviated tocloglog
: This arises naturally as an extreme value distribution, when (for insect mortality), death arises from the failure of any one of a number of organs.
While later analyses will suggest a mild preference for a complementary log-log link function, as against a logit link, the difference is small. The difference does matter, for extrapolation to mortality values (commonly 99% or greater) that lie close to or beyond the limits of the data. The fitted models should, for this purpose, be treated as providing an indication of the broad pattern of response.
The possible use of a transformation for the exposure variable (or variables), commonly a logarithmic transformation, gives further flexibility. One or other choice within the range of possibilities noted has often been found to work well in practice, at least to the extent that the model survives scrutiny under standard diagnostic checks. Where the model will be used to make predictions beyond the limits of the main body of the data, this adds uncertainty.
1.3 Modeling the error distribution
With data such as here, it is to be expected that the response
will show strong extra-binomial variation. This happens because
the response varies from insect to insect, and/or because
insects do not respond independently. The glmmTMB
package
(Brooks et al. 2017) implements the betabinomial
error family, with
the option to model the scale parameter, and hence the
multiplier for the binomial variance, as a function of
explanatory variable(s). In the terminology used for R’s
glm()
function, and that will be used in the sequel, the
multiplier for the binomial variance is referred to as the
dispersion factor. For the data considered here, this
dispersion factor is high in the middle of the range of data,
reducing at the extremes.
An alternative that will be investigated in a later section is the use of binomial errors, with observation level random effects used to account for differences at the level of individual observations. For the present data, this automatically achieves much the same effect as betabinomial errors, without the need for specific attention to the modeling of a dispersion factor. The model fit appears, however to be less satisfactory than the use of betabinomial errors with a dispersion factor adjustment.
Other error models that are described in the literature,
and that operate at the level of individual observations,
are discussed in the vignette
vignette("countDists", package = "qra")
.
1.4 Choices required for mixed model fits
Fits will require a choice of link functions, modeling of the fixed effects, and modeling of the error distribution. Because there are just three replicates per lifestage, it is necessary to base estimates on a model that brings together components of variance information across lifestages. This inevitably leads to estimates that will sometimes smooth out effects that are specific to an individual lifestage.
The subsection that follows will explore the use of a betabinomial error model, as implemented in the glmmTMB package. The parameterization is that described in Morris (1997), with parameters \(\mu\) and \(\phi\). Here \(\mbox{E}[x] = n \mu\), and \[ \mbox{var}[x] = n \mu (1-\mu) \dfrac{\phi+n}{\phi+1} \] Setting \(\phi = \frac{1-\rho}{\rho}\), where \(\rho\) is the intra-class correlation, this equals \[ n \mu(1-\mu)(1+(n-1)\rho) \]
Quasibinomial errors
The lme4::glmer()
function offers the option of a quasibinomial
error, as for stats::glm()
. It does not offer an equivalent to
the glmmTMB
dispformula.
Specification of a quasibinomial error has the consequence that the model is fitted as for a binomial distribution, with the the binomial variance \(n \pi (1- \pi)\) then multiplied by a constant factor \(\Phi\) that is usually estimated using the Pearson chi-squared statistic. Compare this with the betabinomial, where the multiplier is \(\Phi = 1+(n-1)\rho\), i.e., it increases with \(n\). This is an important difference.
1.5 Complementary log-log versus logit link
Figure 1.2 shows lines and curves from alternative models that have been fitted to the data.
[1] TRUE
One has to experiment to get these models to fit. Some of the possibilities are:
- Stay with default control parameters, or try an alternative,
such as
ctl <- glmmTMBControl(optimizer=optim, optArgs=list(method="BFGS"))
- Replace
TrtTime
byscTime
, which is the centered and scaled version, in some or all of- the random effects term(s) in the model formula
- the fixed effects terms
- the dispersion formula (
dispformula
).
- the random effects term(s) in the model formula
- As will be shown later, the dispersion formula for the
HawCon
data needs to be able to fit a curve that has a roughly hill shape. Degree 2 normal splines (splines::ns(x,2)
, where in our casex=scTime
) appear to work reasonably well. - Use of
update()
to update a simpler model (e.g., to add the degree 2 term in models that added a curvature term to the straight line in the fixed effect) may avoid warning messages or failures that otherwise result.
Normal spline curves, or polynomial curves, may be used to model the fixed effects, as alternatives to lines. This can be compared with fitting a line, in order to check for curvature in the response.
If glmer()
(from the glmer
package) is used in place of
glmmTMB()
, the lme4
function allFit()
can be used to
compare results between a range of available optimizers.
Both the complementary log-log model and the logit model appear to fit well, as judged by examining plots of randomized quantile residuals.
Details of the model fitting process
Code that fits the several models is:
# Load packages that will be used
suppressMessages({library(lme4); library(splines)})
<- cbind(Dead,Live)~0+trtGp/TrtTime+(1|trtGpRep)
form <- cbind(Dead,Live)~0+trtGp/TrtTime+ns(scTime,2)+(1|trtGpRep)
form2s <- glmmTMB::glmmTMB(form, dispformula=~trtGp+ns(scTime,2),
HCbb.cll family=glmmTMB::betabinomial(link="cloglog"),
data=HawCon)
<- update(HCbb.cll, formula=form2s)
HCbb2s.cll <- glmmTMB::glmmTMB(form, dispformula=~trtGp+ns(TrtTime,2),
HCbb.logit family=glmmTMB::betabinomial(link="logit"),
data=HawCon)
<- update(HCbb.logit, formula=form2s) HCbb2s.logit
The right hand side of the model formula divides into two parts:
- The first part, i.e.,
0+trtGp/TrtTime
, expands to0 + trtGp + trtGp:TrtTime
. This specifies a different constant term and different slope, and thus a different line, for each different treatment group.- If the
0
is omitted, so that this initial part of the formula reduces totrtGp/TrtTime
, all that changes is the parameterization. There is then an overall constant term, with treatment group effects expressed as differences from the overall constant term.
- If the
- The round brackets that enclose the remainder of the
right hand side of the formula, i.e.,
(scTime|trtGpRep)
, identify it as supplying random effects terms. Here, a different random effects line is added for each replicate (trtGpRep
). This is achieved by fitting a random intercept and a random slope for each different replicate.- Note that
scTime
is by default interpreted as1 + scTime
.
- Note that
In addition, all models use a dispformula
to allow for
change in the betabinomial scale parameter \(\phi\). The
dispformula
used, allowing a different degree 2 function
of scTime
for each different treatment group, was chosen
after some experimentation.
1.6 Fitted lines, vs fitted normal spline curves
Figure 1.2 shows fitted lines, and fitted degree 2 normal spline curves, in Panel A for a complementary log-log (cloglog) link, and in Panel B for a logit link.
1.7 Diagnostic checks
Randomized quantile residuals, with the plots that are available
in the DHARMa
provide helpful diagnostic check. For any
residual, the corresponding quantile residual is the proportion
of residuals expected, under model assumptions, to be less than
or equal to it. If the distributional assumptions are satisfied,
the quantile residuals should have a distribution that differs
only by statistical error from a uniform distribution.
The function DHARMa::simulateResiduals()
provides a
convenient means to simulate enough sets of residuals (by
default, 250) to give a good sense, for each individual
observation, of the distribution. These then provide a
reference distribution for calculation of quantile
residuals. Residuals may be calculated allowing only for
the fixed effects (the default), or conditional on one or
more levels of random effects. If the model is correct,
residuals should be uniformly distributed irrespective of
the conditioning. See ?DHARMa::simulateResiduals
for
details.
- For the data as a whole, the distribution of residuals
can be checked by plotting the quantile residuals against
the corresponding quantiles.
- Departures from assumptions show a pattern of difference
from the line
y=x
that is different from that for normal distribution quantile-quantile plots.
- Departures from assumptions show a pattern of difference
from the line
- A second check plots quantile residuals against quantiles of predicted values. Quantile regression is then used to fit curves at 25%, 50%, and 75% quantiles of the quantile residuals. If the model is correctly specified, these should all be, to within statistical error, horizontal lines.
- Plots against other explanatory variables provide added checks.
Do such deviations from assumptions as are present matter?
A useful device is to simulate new ‘observations’ from the model,
and check whether there is a difference of substance in the
fitted values and parameter estimates.
Figure 1.3 shows the diagnostic plots for the linear model with a complementary log-log link.
The quantile-quantile (Q-Q) plot looks fine, The quantile residuals from the data appear, if anything, closer to uniformly distributed than any of the simulated sets of residuals. In Panels B and C, the quartiles of the data are appear satisfactorily close to the relevant quartiles.
Now compare (Figure 1.4) scaled residuals between treatment groups.
The numbers for MelonEgg
, MedL1
, and MelonL1
are too small
to give useful boxplot displays. Except for MedEgg
, where
points are concentrated in the tails, the scatter of points in
Panel A appears reasonably comparable between treatment groups.
QQ plots look good for all the models. In the sequel, they will be left out.
Uniform quantile-quantile plots — an example.
We will generate data from a highly overdispersed binomial type
distribution, then examine the uniform quantile-quantile plot
given by DHARMa::plotResiduals()
. An easy way to generate
overdispersed binomial type data is to start with binomial data,
then multiply both “successes” and “failures” by a number that
is substantially greater than one.
AIC-based model comparisons
Comparisons should be for models with the same outcome variable, and with the same data.2
In addition to the models described so far, we will include also models, to be discussed below, that assume binomial errors, with observational level random effects added. The model with the lowest AIC is the preferred model.
The following shows AIC values
bb.cll bb2s.cll bb.logit bb2s.logit biObs.cll biObs.logit
df 27.00 29 27.00 29.00 20.00 20.00
AIC 710.01 NA 721.68 720.02 719.33 721.29
There is a strong preference for a complementary log-log link, with the linear fit slightly better than the degree 2 normal spline fit, arguing for the use of the fitted lines for calculation of LT99s and confidence intervals.
1.8 Estimates of \(\rho\), and of the dispersion factor
Figures 1.7A and B (for a logit link)
plot the estimates of \(\rho\), based on modeling the
logarithm of the scale parameter as a degree 2 natural spline
function of scTime
that is added to a straight line response
that is different for each different treatment group. Use of
a logarithmic link (the default) has the consequence that
effects that are additive on the link scale become multipliers
when unlogged.
Figure 1.7C shows, for the complementary log-log model, the dispersion factors that result — high at midrange times and mortalities, reducing to close to a constant 1.0 at either extreme. Use of a normal spline basis helps ensures that the value for \(\rho\), and hence the dispersion factor, extrapolates to a close to constant value at both extremes.
A dispersion factor that is close to 1.0 at the upper extreme of the data provides a limited justification for assuming a binomial distribution, for purposes of calculating the sample size needed to demonstrate, at a 95% confidence level, a mandated survival rate of, e.g., no more than one in perhaps 100,000 insects.
2 Binomial errors, plus observation level random effects
The suggestion here is that for the replicates within each species/lifestage combination, the response in any one observation is close to binomial. Added to this is random variation between observations. By comparison with fitting a betabinomial or a quasibinomial error, the effect is to reduce the variance of the observed mortalities at low and high mortality points, relative to variance at midrange values.
The following fits the two models:
Notice that the random effects term (scTime|trtGpRep)
has been changed
to (1|trtGpRep)
. This avoids convergence failure messages.
The AIC statistic shows a slight preference for the complementary log-log model. The comparison with data that have been simulated from the respective fitted distributions indicates that the distinction is far from clear.
The attempt to repeat a similar comparison with models that assume a betabinomial error failed. Most simulationa generated error messages.
Figure 2.2 compares the diagnostic plots, between use of a complementary log-log link, and use of a logit link.