HLMdiag: a diagnostic tool for hierarchical (multilevel) linear models

The package HLMdiag was created in order to provide a unified framework for analysts to diagnose hierarchical linear models (HLMs). When HLMdiag was created in 2014, it made diagnostic procedures available for HLMs that had not yet been implemented in statistical software. Over the past 6 years, other packages have gradually implemented some of these procedures; however, diagnosing a model still often requires multiple packages, each of which has its own syntax and quirks. HLMdiag provides diagnostic tools targeting all aspects and levels of hierarchical linear models in a single package. In this update, HLMdiag focuses on usability in its functions. Drawing inspiration from the augment() function in the broom package, the new functions hlm_resid(), hlm_influence(), and hlm_augment() append diagnostic information to a model’s data frame, allowing the analyst to easily work with and see how residuals and influence diagnostics relate to the original variables.

This vignette introduces new functions concerning residual diagnostics in HLMdiag, specifically hlm_resid(), pull_resid(), and hlm_augment(). The main focus of this vignette, hlm_resid(), is intended to replace HLMresid() in functionality, and works more robustly with 3-level models as well as models fit with nlme. Additionally, hlm_resid() appends residuals and fitted values to a model’s data frame in the form of a tibble. Tibbles help the analyst diagnose models by simplifying the creation of diagnostic plots, especially within ggplot2, and working seamlessly with sorting functions in dplyr.

For information about influence diagnostics such as Cook’s distance and leverage in HLMs, the reader should see the vignette for hlm_influence().

hlm_resid() function

The function hlm_resid() takes a hierarchical linear model fit as a lmerMod or lme object and extracts residuals and predicted values from the model using both Least Squares (LS) and Empirical Bayes (EB) methods in estimating parameters. Whereas the old function, HLMresid(), would return a vector with a single type of residual, hlm_resid() appends specified residuals to the data frame of the model in the form of a tibble. The use of a tibble allows the analyst to easily plot residuals against explanatory variables and identify possible outliers or model deficiencies. The function draws heavy inspiration from the augment function of the broom package, however offers more options for the types of residuals that the analyst may want to use.

Types of residuals in multilevel models

The presence of multiple sources of variability in hierarchical linear models results in numerous quantities defining residuals. All functions in HLMdiag follow the classification by Hilden-Minton (1995) and define three types of residuals:

level-1 (conditional) residuals
higher-level (random effects) residuals
marginal (composite) residuals

The definitions of level-1 and higher-level residuals lead to different residuals depending on how the fixed and random model coefficients are estimated. hlm_resid() implements two estimation methods: Least Squares estimation and Empirical Bayes estimation. For a comprehensive discussion of these two methods and how they are implemented, we direct the reader to Loy and Hoffman (2014).

LS and EB residuals each have their own advantages and disadvantages. Level-1 LS residuals are calculated by fitting separate linear models to each group and using LS to estimate the fixed and random effects. Because they rely only on the first level of the hierarchy, they are unconfounded with higher-level residuals and a useful first step in diagnosing a model; however, for small group sample sizes, LS residuals are unreliable. Level-1 EB residuals are defined as the conditional modes of the random effects given the data and the estimated parameter values, which are calculated with (restricted) maximum likelihood. They are confounded with higher-level residuals, but more robust with small sample sizes. For higher-level residuals, coefficients can again be estimated using either LS or EB, but EB residuals are generally preferred due to small group sizes.

Example Data

To illustrate the use of hlm_resid(), we make use of data on exam scores of 4,059 students in 65 inner London schools. This data set is distributed as part of the R package mlmRev (Bates, Maechler, Bolker 2013), which makes well known multilevel modeling data sets available in R.

data("Exam", package = "mlmRev")
head(Exam)
#>   school   normexam schgend    schavg      vr     intake   standLRT sex type
#> 1      1  0.2613242   mixed 0.1661752 mid 50% bottom 25%  0.6190592   F  Mxd
#> 2      1  0.1340672   mixed 0.1661752 mid 50%    mid 50%  0.2058022   F  Mxd
#> 3      1 -1.7238820   mixed 0.1661752 mid 50%    top 25% -1.3645760   M  Mxd
#> 4      1  0.9675862   mixed 0.1661752 mid 50%    mid 50%  0.2058022   F  Mxd
#> 5      1  0.5443412   mixed 0.1661752 mid 50%    mid 50%  0.3711052   F  Mxd
#> 6      1  1.7348992   mixed 0.1661752 mid 50% bottom 25%  2.1894372   M  Mxd
#>   student
#> 1     143
#> 2     145
#> 3     142
#> 4     141
#> 5     138
#> 6     155

For each student, the data consist of their gender (sex) and two standardized exam scores— an intake score on the London Reading Test (LRT) at age 11 (standLRT) and a score on the General Certificate of Secondary Education (GCSE) examination at age 16 (normexam). Additionally, the students’ LRT scores were used to segment students into three categories (bottom 25%, middle 50%, and top 25%) based on their verbal reasoning subscore (vr) and overall score (intake). At the school level, the data contain the average intake score for the school (schavg) and type based on school gender (schgend, coded as mixed, boys, or girls).

We illustrate usage for hlm_resid() below using the exam data and fitting an initial two-level HLM using a student’s standardized London Reading Test intake score (standLRT) to explain their GCSE exam score, allowing for a random intercept for each school:

fm1 <- lme4::lmer(normexam ~ standLRT + (1 | school), Exam, REML = FALSE)
fm1
#> Linear mixed model fit by maximum likelihood  ['lmerMod']
#> Formula: normexam ~ standLRT + (1 | school)
#>    Data: Exam
#>       AIC       BIC    logLik -2*log(L)  df.resid 
#>  9365.243  9390.478 -4678.622  9357.243      4055 
#> Random effects:
#>  Groups   Name        Std.Dev.
#>  school   (Intercept) 0.3035  
#>  Residual             0.7522  
#> Number of obs: 4059, groups:  school, 65
#> Fixed Effects:
#> (Intercept)     standLRT  
#>    0.002391     0.563371

hlm_resid() usage

hlm_resid() takes 5 arguments:

object - the model fit as a lmerMod or lme object.
level - the level at which residuals should be extracted. By default, level = 1, returning both EB and LS level-1 residuals as well as marginal residuals. level can also be set to a grouping factor as defined in names(object@flist) in lme4 or in names(object$groups) in nlme. When set to a grouping factor, hlm_resid() returns both EB and LS higher-level residuals, otherwise known as the random effects for each group.
standardize - a logical indicating if the returned residuals should be standardized. By default, standardize = FALSE, returning unstandardized residuals. When standardize = TRUE, residuals are divided by the estimated standard error. For marginal residuals, when standardize = TRUE, the Cholesky residuals are returned.
include.ls - a logical indicating if LS residuals should be returned. By default, include.ls = TRUE. The LS method of estimating residuals is more computationally intensive than the EB method. Consequently, setting include.ls = FALSE substantially decreases the runtime on models with many observations.
data - only necessary if na.action = na.exclude for a lmerMod model object. data should specify the data frame used to fit the model containing observations with NA values.

General usage

To extract unstandardized level-1 residuals and marginal residuals, we use the following call.

library(HLMdiag)
hlm_resid(fm1)
#> # A tibble: 4,059 × 10
#>      .id normexam standLRT school  .resid .fitted .ls.resid .ls.fitted
#>    <dbl>    <dbl>    <dbl> <fct>    <dbl>   <dbl>     <dbl>      <dbl>
#>  1     1    0.261    0.619 1      -0.464   0.725     -0.561    0.822  
#>  2     2    0.134    0.206 1      -0.358   0.492     -0.395    0.529  
#>  3     3   -1.72    -1.36  1      -1.33   -0.393     -1.14    -0.585  
#>  4     4    0.968    0.206 1       0.475   0.492      0.438    0.529  
#>  5     5    0.544    0.371 1      -0.0409  0.585     -0.102    0.647  
#>  6     6    1.73     2.19  1       0.125   1.61      -0.201    1.94   
#>  7     7    1.04    -1.12  1       1.29   -0.253      1.45    -0.409  
#>  8     8   -0.129   -1.03  1       0.0773 -0.206      0.221   -0.350  
#>  9     9   -0.939   -0.538 1      -1.01    0.0730    -0.941    0.00167
#> 10    10   -1.22    -1.45  1      -0.780  -0.439     -0.576   -0.643  
#> # ℹ 4,049 more rows
#> # ℹ 2 more variables: .mar.resid <dbl>, .mar.fitted <dbl>

hlm_resid() appends 6 columns to the model frame:

.resid the unstandardized Empirical Bayes (EB) level-1 residuals for each observation
.fitted the Empirical Bayes (EB) fitted values
.ls.resid the unstandardized Least Squares (LS) level-1 residuals
.ls.fitted the Least Squares fitted values
.mar.resid the unstandardized marginal residuals
.mar.fitted the marginal fitted values

Note that the columns containing the Empirical Bayes residuals are simply named .resid and .fitted. This is because both lme4::resid() and nlme::resid() use the EB method in estimating residuals, thus the analyst is likely more familiar with this type of residual.

standardize and include.ls

We can alter the output of the level-1 residuals using the standardize and include.ls parameters introduced above. The following call returns both the level-1 and marginal residuals divided by the estimated standard error and excludes the least squares residuals.

hlm_resid(fm1, standardize = TRUE, include.ls = FALSE)
#> # A tibble: 4,059 × 8
#>      .id normexam standLRT school .std.resid .fitted .chol.mar.resid .mar.fitted
#>    <dbl>    <dbl>    <dbl> <fct>       <dbl>   <dbl>           <dbl>       <dbl>
#>  1     1    0.261    0.619 1         -0.616   0.725          -0.111        0.351
#>  2     2    0.134    0.206 1         -0.476   0.492           0.0353       0.118
#>  3     3   -1.72    -1.36  1         -1.77   -0.393          -1.19        -0.766
#>  4     4    0.968    0.206 1          0.632   0.492           1.21         0.118
#>  5     5    0.544    0.371 1         -0.0544  0.585           0.445        0.211
#>  6     6    1.73     2.19  1          0.167   1.61            0.618        1.24 
#>  7     7    1.04    -1.12  1          1.72   -0.253           2.06        -0.627
#>  8     8   -0.129   -1.03  1          0.103  -0.206           0.352       -0.580
#>  9     9   -0.939   -0.538 1         -1.35    0.0730         -1.07        -0.301
#> 10    10   -1.22    -1.45  1         -1.04   -0.439          -0.705       -0.813
#> # ℹ 4,049 more rows

hlm_resid() appends the EB and marginal residual columns to the mode frame. The names of the columns containing residuals now have the prefix .std to reflect the fact that they are standardized. Note that standardized marginal residuals are equivalent to the Cholesky residuals of a HLM and thus are named .chol.mar.resid.

Higher-level residuals

To access higher-level residuals, we can set the level parameter to a grouping factor as defined in names(object@flist) in lme4 or in names(object$groups) in nlme.

names(fm1@flist)
#> [1] "school"

To extract the school-level residuals, otherwise known as the predicted random effects for school, we use the following call.

hlm_resid(fm1, level = "school")
#> # A tibble: 65 × 3
#>    school .ranef.intercept .ls.intercept
#>    <chr>             <dbl>         <dbl>
#>  1 1                0.374         0.451 
#>  2 2                0.502         0.550 
#>  3 3                0.504         0.626 
#>  4 4                0.0181        0.0719
#>  5 5                0.240         0.329 
#>  6 6                0.541         0.671 
#>  7 7                0.379         0.467 
#>  8 8               -0.0262        0.0429
#>  9 9               -0.135        -0.169 
#> 10 10              -0.337        -0.257 
#> # ℹ 55 more rows

In the tibble returned by hlm_resid(), each row now represents one of the groups specified by the level parameter. In this example, there were 65 schools from which data was collected, and the resulting tibble has 65 rows. If we had included school-level variables in the model, such as the average intake score for a school (schavg), those variables would also be included in the output. The final two columns contain the random effects for intercept at the school.

.ranef.intercept the random effects for intercept, using Empirical Bayes estimation
.ls.intercept the random effects for intercept, using Least Squares estimation

Had we included other random effect terms in the model, the EB and LS predictions of their random effects would also be included in the output. Again, the EB random effects are simply named .ranef because it is assumed that the analyst is most familiar with this type of random effect The parameters standardize and include.ls work the same way with higher-level residuals and with both lmerMod and lme objects. Setting include.ls = FALSE is often recommended with higher-level residuals, as calculating LS residuals is computationally intensive and the resulting residuals are often unreliable due to small group sizes.

na.action and data

The final parameter, data, is an argument that becomes required when na.action = na.exclude for an lmerMod model object. Model frames in lme4 always use na.omit. Thus, a data parameter is needed to retain NA values in the model frame that hlm_resid() returns. To demonstrate this requirement, we create a duplicate data set of Exam with the standardized LRT scores set to NA for observations 1, 2, and 5.

# make copy of Exam data set
Exam2 <- Exam              
# set standLRT to NA for 3 observations
Exam2[c(1,2,5),7] <- NA    
# refit model with na.exclude
fm1.NA <- lme4::lmer(normexam ~ standLRT + (1 | school), Exam2,  
               na.action = na.exclude)

In the first example, when we try to call hlm_resid() on our model object, it throws an informative error asking for the data set used to fit the model. We provide this using the data parameter in the second example.

# incorrect:
hlm_resid(fm1.NA) 
#> Error in hlm_resid.lmerMod(fm1.NA): Please provide the data frame used to fit the model. This is necessary when the na.action is set to na.exclude

# correct:
hlm_resid(fm1.NA, data = Exam2) 
#> # A tibble: 4,059 × 10
#>      .id school normexam standLRT  .resid .fitted .ls.resid .ls.fitted
#>    <dbl> <fct>     <dbl>    <dbl>   <dbl>   <dbl>     <dbl>      <dbl>
#>  1     1 1         0.261   NA     NA      NA         NA        NA     
#>  2     2 1         0.134   NA     NA      NA         NA        NA     
#>  3     3 1        -1.72    -1.36  -1.34   -0.381     -1.15     -0.575 
#>  4     4 1         0.968    0.206  0.464   0.504      0.423     0.545 
#>  5     5 1         0.544   NA     NA      NA         NA        NA     
#>  6     6 1         1.73     2.19   0.113   1.62      -0.224     1.96  
#>  7     7 1         1.04    -1.12   1.28   -0.241      1.44     -0.398 
#>  8     8 1        -0.129   -1.03   0.0653 -0.194      0.210    -0.339 
#>  9     9 1        -0.939   -0.538 -1.02    0.0850    -0.954     0.0142
#> 10    10 1        -1.22    -1.45  -0.792  -0.427     -0.585    -0.634 
#> # ℹ 4,049 more rows
#> # ℹ 2 more variables: .mar.resid <dbl>, .mar.fitted <dbl>

Note that for the same model fit in nlme, we do not need to include the data parameter.

fm1.NA.lme <- nlme::lme(normexam ~ standLRT, random = ~1 | school, Exam2,  
               na.action = na.exclude)
hlm_resid(fm1.NA.lme)
#> # A tibble: 4,059 × 10
#>      .id normexam standLRT school  .resid .fitted .ls.resid .ls.fitted
#>    <dbl>    <dbl>    <dbl> <fct>    <dbl>   <dbl>     <dbl>      <dbl>
#>  1     1    0.261   NA     1      NA      NA         NA        NA     
#>  2     2    0.134   NA     1      NA      NA         NA        NA     
#>  3     3   -1.72    -1.36  1      -1.34   -0.381     -1.15     -0.575 
#>  4     4    0.968    0.206 1       0.464   0.504      0.423     0.545 
#>  5     5    0.544   NA     1      NA      NA         NA        NA     
#>  6     6    1.73     2.19  1       0.113   1.62      -0.224     1.96  
#>  7     7    1.04    -1.12  1       1.28   -0.241      1.44     -0.398 
#>  8     8   -0.129   -1.03  1       0.0653 -0.194      0.210    -0.339 
#>  9     9   -0.939   -0.538 1      -1.02    0.0850    -0.954     0.0142
#> 10    10   -1.22    -1.45  1      -0.792  -0.427     -0.585    -0.634 
#> # ℹ 4,049 more rows
#> # ℹ 2 more variables: .mar.resid <dbl>, .mar.fitted <dbl>

Three-level models

In three-level models, how we specify middle-level residuals changes depending on whether a model is a lmerMod or lme object. This is due to differences in how lme4 and nlme store their grouping factors To demonstrate this, we use a classroom data set from West, Welch, Galecki (2006). The data contains math testing scores from 1190 children in 312 different classroom within 107 unique schools. All classrooms are nested within schools, and all students are nested within a single classroom.

data("classroom", package = "WWGbook")
head(classroom)
#>   sex minority mathkind mathgain   ses yearstea mathknow housepov mathprep
#> 1   1        1      448       32  0.46        1       NA    0.082     2.00
#> 2   0        1      460      109 -0.27        1       NA    0.082     2.00
#> 3   1        1      511       56 -0.03        1       NA    0.082     2.00
#> 4   0        1      449       83 -0.38        2    -0.11    0.082     3.25
#> 5   0        1      425       53 -0.03        2    -0.11    0.082     3.25
#> 6   1        1      450       65  0.76        2    -0.11    0.082     3.25
#>   classid schoolid childid
#> 1     160        1       1
#> 2     160        1       2
#> 3     160        1       3
#> 4     217        1       4
#> 5     217        1       5
#> 6     217        1       6

The data set contains 12 columns; however, we are only interested in the response variable, mathgain, and the grouping variables, classid and schoolid. We fit the simple unconditional means model in both lme4 and nlme below.

# lme4
fm2.lmer <- lme4::lmer(mathgain ~ 1 + (1|schoolid/classid), data = classroom)
# nlme
fm2.lme <- nlme::lme(mathgain ~ 1, random = ~1|schoolid/classid, data = classroom)

For both models, classroom is nested within school. We can access the grouping factors for each model with the following:

# lme4
names(fm2.lmer@flist)
#> [1] "classid:schoolid" "schoolid"
# nlme
names(fm2.lme$groups)
#> [1] "schoolid" "classid"

Note how the orders of the grouping factors are opposite for lme4 and nlme. In lme4, convention is to start at the lowest level of hierarchy and move upwards (in this case classid:schoolid, followed by schoolid). Grouping factors in nlme are given starting at the highest grouping and move down (schoolid, followed by classid). The order of classid and schoolid in the returned tibble is based on these conventions.

To extract the classroom-level random effects for fm2.lmer, we set level = "classid:schoolid".

# lme4
hlm_resid(fm2.lmer, level = "classid:schoolid")
#> # A tibble: 312 × 5
#>    group classid schoolid .ranef.intercept .ls.intercept
#>    <chr>   <int>    <int>            <dbl>         <dbl>
#>  1 160:1     160        1           1.64            4.15
#>  2 217:1     217        1          -0.433          -4.15
#>  3 197:2     197        2          -1.69          -12.9 
#>  4 211:2     211        2           2.51            6.58
#>  5 307:2     307        2           2.44            6.33
#>  6 11:3       11        3           3.87           -4.38
#>  7 137:3     137        3           5.56           16.1 
#>  8 145:3     145        3           8.10            6.62
#>  9 228:3     228        3          -0.0237        -18.4 
#> 10 48:4       48        4           3.77           44   
#> # ℹ 302 more rows

For the fm2.lme, we set level = "classid".

# nlme
hlm_resid(fm2.lme, level = "classid")
#> # A tibble: 312 × 5
#>    group schoolid classid .ranef.intercept .ls.intercept
#>    <chr>    <int>   <int>            <dbl>         <dbl>
#>  1 1/160        1     160           1.64            4.15
#>  2 1/217        1     217          -0.433          -4.15
#>  3 2/197        2     197          -1.69          -12.9 
#>  4 2/211        2     211           2.51            6.58
#>  5 2/307        2     307           2.44            6.33
#>  6 3/11         3      11           3.87           -4.38
#>  7 3/137        3     137           5.56           16.1 
#>  8 3/145        3     145           8.10            6.62
#>  9 3/228        3     228          -0.0235        -18.4 
#> 10 4/48         4      48           3.77           44   
#> # ℹ 302 more rows

Using hlm_resid() in context

Now that we have discussed how to extract all types of residuals for hierarchical models using hlm_resid(), we illustrate how to use those residuals in context to evaluate a model. We will diagnose the earlier model, fm1, fit on the Exam data set above. We use hlm_resid() to extract the standardized level-1 and marginal residuals.

resid1_fm1 <- hlm_resid(fm1, level = 1, standardize = TRUE)
resid1_fm1
#> # A tibble: 4,059 × 10
#>      .id normexam standLRT school .std.resid .fitted .std.ls.resid .ls.fitted
#>    <dbl>    <dbl>    <dbl> <fct>       <dbl>   <dbl>         <dbl>      <dbl>
#>  1     1    0.261    0.619 1         -0.616   0.725         -0.682    0.822  
#>  2     2    0.134    0.206 1         -0.476   0.492         -0.480    0.529  
#>  3     3   -1.72    -1.36  1         -1.77   -0.393         -1.40    -0.585  
#>  4     4    0.968    0.206 1          0.632   0.492          0.532    0.529  
#>  5     5    0.544    0.371 1         -0.0544  0.585         -0.124    0.647  
#>  6     6    1.73     2.19  1          0.167   1.61          -0.251    1.94   
#>  7     7    1.04    -1.12  1          1.72   -0.253          1.78    -0.409  
#>  8     8   -0.129   -1.03  1          0.103  -0.206          0.271   -0.350  
#>  9     9   -0.939   -0.538 1         -1.35    0.0730        -1.15     0.00167
#> 10    10   -1.22    -1.45  1         -1.04   -0.439         -0.712   -0.643  
#> # ℹ 4,049 more rows
#> # ℹ 2 more variables: .chol.mar.resid <dbl>, .mar.fitted <dbl>

Utilizing the tibble output of hlm_resid(), we can easily plot the standardized LS residuals against explanatory variables such as the standardized LRT score. We use the LS residuals at level-1 because they are unconfounded of higher-level residuals.

library(ggplot2)
ggplot(data = resid1_fm1, aes(x = standLRT, y = .std.ls.resid)) + 
  geom_point(alpha = 0.2) +
  geom_smooth(method = "loess", se = FALSE) + 
  labs(y = "LS level-1 residuals", 
       title = "LS residuals against standardized LRT score")

The smoother shows that standardized LRT scores might not be linearly related to GCSE exam scores. Likelihood ratio tests (not shown) confirm that quadratic and cubic terms for standLRT contribute significantly in describing GCSE exam scores, so we incorporate these terms in the updated model, fm1b. We then reassess the level-1 LS residuals.

fm1b <- update(fm1, .~. + I(standLRT^2) + I(standLRT^3))

resid1_fm1b <- hlm_resid(fm1b, level = 1, standardize = TRUE)

ggplot(data = resid1_fm1b, aes(x = standLRT, y = .std.ls.resid)) + 
  geom_point(alpha = 0.2) +
  geom_smooth(method = "loess", se = FALSE) + 
  labs(y = "LS level-1 residuals", title = "LS residuals against standardized LRT score")

The addition of the quadratic and cubic terms seems to have improved the level-1 residuals. Additionally, there don’t appear to be any large outliers that might need to be removed. We double check the LS level-1 residuals for fm1, plotting them against the LS fitted values appended to the model frame by hlm_resid().

ggplot(data = resid1_fm1b, aes(x = .ls.fitted, y = .std.ls.resid)) + 
  geom_point(alpha = 0.2) +
  geom_smooth(method = "loess", se = FALSE) + 
  labs(x = "LS fitted values",
       y = "LS level-1 residuals", 
       title = "LS residuals against LS fitted values")

The LS level-1 residuals give no indication that the model violates any assumptions of an HLM. Because there are no glaring issues, we move on to the higher-level residuals, again plotting the output of hlm_resid().

resid2_fm1b <- hlm_resid(fm1b, level = "school", include.ls = FALSE)
resid2_fm1b
#> # A tibble: 65 × 2
#>    school .ranef.intercept
#>    <chr>             <dbl>
#>  1 1                0.374 
#>  2 2                0.498 
#>  3 3                0.502 
#>  4 4                0.0211
#>  5 5                0.247 
#>  6 6                0.538 
#>  7 7                0.390 
#>  8 8               -0.0220
#>  9 9               -0.156 
#> 10 10              -0.331 
#> # ℹ 55 more rows

ggplot(data = resid2_fm1b, aes(x = school, y = .ranef.intercept)) + 
  geom_point() +
  labs(y = "Random effects - intercept", 
       title = "Intercept random effects against school") + 
  coord_flip()

The school-level random effects again show no large outliers. We would now check influence diagnostics such as Cook’s distance and leverage to ensure our model is appropriate. For a discussion of influence diagnostics within HLMdiag, we direct the reader to the vignette for hlm_influence().

pull_resid()

The function pull_resid() is an efficient way to pull a single type of residual as a vector. Whereas hlm_resid() spends time calculating all types of residuals and fitted values, pull_resid() only calculates the type of residual specified by the user, saving time over using hlm_resid() and indexing for a specific column. This is especially useful when used in a loop, such as when using simulation-based diagnostics.

pull_resid() takes three parameters, object, type, and standardize. The parameters object and standardize work identically as they do in hlm_resid(), and the new parameter, type, can be set to:

"ls" to return LS level-1 residuals
"eb" to return EB level-1 residuals
"marginal" to return marginal residuals

The example below illustrates the output of pull_resid() being used to extract level-1 standardized LS residuals.

head(pull_resid(fm1, type = "ls", standardize = TRUE))
#> [1] -0.6824458 -0.4800891 -1.4044892  0.5323373 -0.1242091 -0.2512477

The function pull_resid() is only implemented for level-1 residuals. If the analyst wants to extract random effects for a model, they should use the ranef() functions from either lme4 or nlme.

hlm_augment()

The hlm_augment() function combines the hlm_resid() and hlm_influence() functions to return a tibble containing information about the residuals and the influence diagnostics appended to the data. hlm_augment() has three parameters, object, level, and include.ls, all of which have the same functionality as in hlm_resid(). The syntax is the same for the two functions. For example, we can extract both level-1 residual and influence diagnostics from the model fm1 with the following call.

fm1.aug <- hlm_augment(fm1)
tibble::glimpse(fm1.aug)
#> Rows: 4,059
#> Columns: 15
#> $ .id              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
#> $ normexam         <dbl> 0.2613242, 0.1340672, -1.7238820, 0.9675862, 0.544341…
#> $ standLRT         <dbl> 0.6190592, 0.2058022, -1.3645760, 0.2058022, 0.371105…
#> $ school           <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ .resid           <dbl> -0.46358741, -0.35802733, -1.33127074, 0.47549167, -0…
#> $ .fitted          <dbl> 0.72491161, 0.49209453, -0.39261126, 0.49209453, 0.58…
#> $ .ls.resid        <dbl> -0.56113518, -0.39525182, -1.13926654, 0.43826718, -0…
#> $ .ls.fitted       <dbl> 0.822459376, 0.529319021, -0.584615462, 0.529319021, …
#> $ .mar.resid       <dbl> -0.089826659, 0.015733418, -0.957509986, 0.849252418,…
#> $ .mar.fitted      <dbl> 0.35115086, 0.11833378, -0.76637201, 0.11833378, 0.21…
#> $ cooksd           <dbl> 1.503345e-05, 2.075760e-06, 1.042793e-03, 3.661260e-0…
#> $ mdffits          <dbl> 1.503227e-05, 2.075722e-06, 1.042108e-03, 3.661194e-0…
#> $ covtrace         <dbl> 7.814095e-05, 1.809065e-05, 6.568974e-04, 1.809065e-0…
#> $ covratio         <dbl> 1.000078, 1.000018, 1.000657, 1.000018, 1.000031, 1.0…
#> $ leverage.overall <dbl> 0.01271288, 0.01265360, 0.01328391, 0.01265360, 0.012…

This is useful for inspecting residuals values and influence diagnostics values at the same time. However, hlm_augment() lacks some of the functionality that hlm_influence() and hlm_resid() have. The delete and approx parameters available for hlm_influence() are not available in hlm_augment(), so the function will always use a one step approximation and delete all observations or groups instead of a selected few. The standardize parameter from hlm_resid() is also not included, so unstandardized residuals will always be returned. If additional functionality is required, hlm_influence() or hlm_resid() should be used instead. hlm_augment() is especially useful for inspecting influence diagnostics of observations with relatively high residuals, or vice versa. For more information about available functionality in hlm_influence(), see the hlm_influence() vignette.

References

Adam Loy, Heike Hofmann (2014). HLMdiag: A Suite of Diagnostics for Hierarchical Linear Models in R. Journal of Statistical Software, 56(5), 1-28. DOI

Brady West, Kathleen Welch, Andrzej Galecki (2006) Linear Mixed Models: A Practical Guide Using Statistical Software. First Edition. Chapman Hall / CRC Press. ISBN 1584884800

David Robinson, Alex Hayes (2020). broom: Convert Statistical Analysis Objects into Tidy Tibbles. R package version 0.5.6. https://CRAN.R-project.org/package=broom

Douglas Bates, Martin Maechler, Ben Bolker (2020). mlmRev: Examples from Multilevel Modelling Software Review. R package version 1.0-8. https://CRAN.R-project.org/package=mlmRev

Douglas Bates, Martin Maechler, Ben Bolker, Steve Walker (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1-48. DOI

J.A. Hilden-Minton (1995). Multilevel Diagnostics for Mixed and Hierarchical Linear Models. Ph.D. thesis, University of California Los Angeles.

Jose Pinheiro, Douglas Bates, Saikat DebRoy, Deepayan Sarkar, R Core Team (2020). nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-148, https://CRAN.R-project.org/package=nlme.

Residual Diagnostics

2025-05-12