\documentclass[nojss]{jss} %\VignetteIndexEntry{Fitting Generalized Linear Mixed-Effects Model Trees} %\VignetteDepends{glmertree} %\VignetteKeywords{mixed-effects model trees, recursive partitioning, decision trees} %\VignettePackage{glmertree} %% packages \usepackage{thumbpdf,lmodern,placeins} \title{Fitting Generalized Linear Mixed-Effects Model Trees} \Shorttitle{Generalized Linear Mixed-Effects Model Trees} \author{Marjolein Fokkema\\Universiteit Leiden \And Achim Zeileis\\Universit\"at Innsbruck} \Plainauthor{Marjolein Fokkema, Achim Zeileis} \Abstract{ This vignette introduces the \pkg{glmertree} package for fitting generalized linear mixed-effects model trees (GLMM trees or glmertrees). In hands-on examples based on artificial datasets, emphasis is given to three special cases of fitting GLMM trees: trees with constant fits in the terminal nodes (Section~\hyperref[sec:regression_trees]{\ref{sec:regression_trees}}), detection of treatment-subgroup interactions (Section~\hyperref[sec:treatment_subgroups]{\ref{sec:treatment_subgroups}}), and subgroup detection in longitudinal growth curve models (Section~\hyperref[sec:growth_curves]{\ref{sec:growth_curves}}). } \Keywords{recursive partitioning, mixed-effects model trees, decision trees} \Address{ Marjolein Fokkema\\ Department of Methods \& Statistics, Intitute of Psychology\\ Universiteit Leiden\\ Wassenaarseweg 52\\ 2333 AK Leiden, The Netherlands\\ E-mail: \email{m.fokkema@fsw.leidenuniv.nl}\\ URL: \url{http://www.marjoleinfokkema.nl/}\\ Achim Zeileis \\ Department of Statistics \\ Faculty of Economics and Statistics \\ Universit\"at Innsbruck \\ Universit\"atsstr.~15 \\ 6020 Innsbruck, Austria \\ E-mail: \email{Achim.Zeileis@R-project.org} \\ URL: \url{https://eeecon.uibk.ac.at/~zeileis/} \\ } \begin{document} \SweaveOpts{concordance=TRUE} \vspace*{-0.35cm} \SweaveOpts{engine=R, keep.source=TRUE} <>= library("glmertree") options(width = 70, prompt = "R> ", continue = "+ ") @ \section{Introduction} Generalized linear mixed-effects model trees (GLMM trees or glmertrees) have initially been proposed by \cite{FokkySmit18} for detecting treatment-subgroup interactions in clustered datasets. However, GLMM trees may be used to address a wider range of research questions. Using hands-on examples based on artificial datasets, this vignette describes how to fit GLMM trees. Section~\hyperref[sec:regression_trees]{\ref{sec:regression_trees}} shows how to fit (G)LMM trees with constant fits in the terminal nodes on clustered (multilevel) data. Section~\hyperref[sec:treatment_subgroups]{\ref{sec:treatment_subgroups}} shows how to assess main and interaction effects of a categorical variable (treatment) on a continuous response (treatment outcome) on clustered data. Section~\hyperref[sec:growth_curves]{\ref{sec:growth_curves}} shows how subgroups can be detected with respect to the parameters of a growth curve model. These are just examples; package \pkg{glmertree} can be used to detect predictors and moderators in a wide range of GLMMs-type models. The GLMM tree model is composed of global and local parts: The global model is composed of the random-effects terms and using all observations. The local model is composed of the fixed-effects terms, which are estimated locally: the observations in a dataset are partitoned with respect to additional covariates (a.k.a. partitioning variables) and a separate fixed-effects model is estimated in each cell of the resulting partition. Package \pkg{glmertree} employs package \pkg{partykit} \citep{HothyZeil15} to find the partition and package \pkg{lme4} \citep{BateyMeac15} to estimate the mixed-effects model. The current stable release version of package \pkg{glmertree} can be installed from the Comprehensive \proglang{R} Archive Network (CRAN) as follows: % <>= install.packages("glmertree") @ % Alternatively, the current development version can be installed from \proglang{R}-Forge: % <>= install.packages("glmertree", repos = "http://R-Forge.R-project.org") @ % After installation, the package can be loaded as follows: % <>= library("glmertree") @ % The main functions in package \pkg{glmertree} are \code{lmertree()} for continuous outcome variables, and \code{glmertree()} for binary or count outcome variables. Both functions require specification of a \code{formula} and \code{data} argument, as will also be illustrated in the examples below. The various methods that are available to plot and print the fitted models and to extract further information will also be shown in the examples below. For both main functions, the \code{formula} argument specifies the model formula, which is composed of a left- and right-hand side: the left-hand side specifies the response variable, followed by a tilde (\code{~}). The right-hand side comprises three parts: the predictors for the node-specific model (comprising fixed effects only, with coefficients that are allowed to differ over subgroups), the global model (comprising random and/or fixed effects, for which coefficients are estimated globally, using all observations) and the potential partitioning variables. The three parts of the right-hand side are separated by vertical bars: \newline \newline % \code{response ~ node-specific predictors | global predictors | partitioning variables} % \section{Fitting a mixed-model tree with constant fits} \label{sec:regression_trees} For this example, we will make use of the artificially generated \code{MHserviceDemo} dataset, containing data on $N = \Sexpr{nrow(MHserviceDemo)}$ young people receiving treatment at one of \Sexpr{length(unique(MHserviceDemo$cluster_id))} mental-health service providers. The response variable is (\code{outcome}), a continuous variable representing treatment outcome, as measured by a mental-health difficulties score at follow-up, corrected for the baseline assessment, with higher values reflecting poorer treatment outcome. Potential predictor variables are demographic variables and case characteristics: two continuous (\code{age} and \code{impact}) and four binary covariates (\code{gender}, \code{emotional}, \code{autism} and \code{conduct}). The cluster indicator (\code{cluster\_id}) is an indicator for the mental-health service provider. The data can be loaded as follows: % <>= data("MHserviceDemo", package = "glmertree") summary(MHserviceDemo) @ The response is a continuous variable, so we employ function \code{lmertree()}. In the model formula, we specify \code{outcome} as the response variable, followed by a tilde. Next, we specify the node-specific model, which comprises only an intercept in this case, because we want to identify subgroups which differ on their value of the response variable. Next, we specify the predictor for the global model. We only want to account for possible outcome differences between the service providers, so we specify a random intercept with respect to the \code{cluster_id} variable. Finally, we specify the demographic variables and case characteristic as the potential partitioning variables: <>= MH_tree <- lmertree(outcome ~ 1 | cluster_id | age + gender + emotional + autism + impact + conduct, data = MHserviceDemo) @ Note that we specified \code{cluster_id} as a predictor variable for the global model, in order to estimate arandom intercept with respect to \code{cluster_id}. We used short-hand notation for \code{(1|cluster_id)}; because function \code{lmertree} and \code{glmertree} assume a single random intercept term, by default, which is specified through providing the cluster indicator only. More complex random-effects structures can be specified with the customary formulation employed in package \pkg{lme4}. For example, if we would want to account for a global linear fixed effect of age, we can incorporate it in the specification of the global model as follows (results not presented or discussed further): <>= MH_tree2 <- lmertree(outcome ~ 1 | age + (1 | cluster_id) | gender + emotional + autism + impact + conduct, data = MHserviceDemo) @ Note that we used the round brackets around \code{(1|cluster_id)} to protect the vertical bars separating the global predictors from the node-specific predictors and potential partitioning variables. Alternatively, using the \code{glmertree()} function, a tree may be fitted to binary (\code{family = binomial}, default) or count response variables (\code{family = poisson}). Therefore, a binomial GLMM tree for a dichotomized response could be obained by (results not presented or discussed further): <>= MHserviceDemo$outcome_bin <- factor(MHserviceDemo$outcome > 0) MH_gtree <- glmertree(outcome_bin ~ 1 | cluster_id | age + gender + emotional + autism + impact + conduct, data = MHserviceDemo, family = "binomial") @ Using the \code{plot} method, we can plot the resulting tree and random effects: <>= plot(MH_tree) @ % \begin{figure}[t!] \centering \setkeys{Gin}{width=0.8\textwidth} <>= plot(MH_tree, which = "tree") @ \caption{\label{fig:tree_1} LMM tree for predicting treatment outcome.} \end{figure} % By using argument \code{which}, we could have specified which part of the model should be plotted; by default, \code{which = "all"} the tree as well as the random effects are plotted. The plotted tree is depicted in Figure~\ref{fig:tree_1}. In every inner node of the plotted tree, the splitting variable and corresponding $p$-value from the parameter stability test is reported. To control for multiple testing, the $p$-values are Bonferroni corrected, by default. This can be turned off by adding \code{bonferroni = FALSE} to the function call, which yields a less conservative criterion for the parameter stability tests, but may increase the likelihood of overfitting. The significance level $\alpha$ equals .05 by default, but a different value may be specified by including \code{alpha = .01} in the function call, for example. The Tree in Figure~\ref{fig:tree_1} shows the distribution of the observated values of the response in each of the terminal nodes. Four subgroups were found: terminal node 3 indicates that for female patients with lower age, the higher values for the response (i.e., poorer outcomes) were observed. Slightly better treatment outcomes are observed in terminal node 4 (females with higher age) and node 6 (males not presenting with emotional disorder). The best treatment outcomes are observed in node 7 (males presenting with emotional disorder). Random-effects predictions are plotted in Figure~\ref{fig:ranef_1}. On average, patients at service provider 3 appear to have higher response variable values (poorer outcomes), while patients at service provider 10 appear to have more favorable outcomes. % \begin{figure}[t!] \centering \setkeys{Gin}{width=0.5\textwidth} <>= plot(MH_tree, which = "ranef") @ \caption{\label{fig:ranef_1} Predicted random effects predictions for the different service providers.} \end{figure} % To obtain numerical results, \code{print}, \code{coef}, code{fixef}, \code{ranef} and \code{VarCorr} methods are available (results omitted): % <>= print(MH_tree) coef(MH_tree) fixef(MH_tree) ranef(MH_tree) VarCorr(MH_tree) @ % To obtain predicted values, the \code{predict} method can be used: % <>= predict(MH_tree, newdata = MHserviceDemo[1:10,]) @ % If the \code{newdata} argument is not specified, predictions for the training observations are returned, by default. Also by default, the predictions are based on both random- and fixed-effects. Random effects can be excluded from the predictions by adding \code{re.form = NA}. This is useful, for example, when \code{newdata} specifies new observations whichare not part of a cluster from the training data (i.e., are from a 'new' cluster): % <>= predict(MH_tree, newdata = MHserviceDemo[1:10, -7], re.form = NA) @ \subsection{Inspecting residuals} Residuals of the fitted mixed-effects tree can be obtained with the \code{residuals} method. Residuals can be used for assessing potential misspecification of the model or violation of assumptions (e.g., heteroscedasticity): % <>= resids <- residuals(MH_tree) preds <- predict(MH_tree) plot(MHserviceDemo$cluster_id, resids) scatter.smooth(preds, resids) @ % \begin{figure}[b!] \centering \setkeys{Gin}{width=0.49\textwidth} <>= resids <- residuals(MH_tree) preds <- predict(MH_tree) plot(MHserviceDemo$cluster_id, resids, xlab = "Cluster", ylab = "Residuals") @ <>= scatter.smooth(preds, resids, xlab = "Predicted values", ylab = "Residuals") @ \caption{\label{fig:residuals_1} Residuals of the fitted linear mixed-effects model tree in Figure~\ref{fig:tree_1}.} \end{figure} % The plotted residuals are depicted in Figure~\ref{fig:residuals_1}. The left panel indicates some differences in residual variances across the levels of \code{cluster_id}, but these differences were not statistically significant (when tested with functions \code{bartlett.test()} or \code{fligner.test()}). The right panel of in Figure~\ref{fig:residuals_1} indicates no pattern of association between fitted values and residuals. \FloatBarrier \section{Detecting treatment-subgroup interactions in clustered data} \label{sec:treatment_subgroups} In this example, we extend the model for the terminal nodes to accomodate a predictor variable: an indicator for treatment. Including predictor variables in the node-specific model may be particularly helpful when the interest is in detecting moderators. For example, in the detection of treatment-subgroup interactions where the effect of treatment may be moderated by one or more additional covariates. To illustrate, we will use the artificial motivating dataset from \cite{FokkySmit18}, which can be recreated using the code provided in Appendix~\ref{app}, or can be loaded as follows: % <>= data("DepressionDemo", package = "glmertree") summary(DepressionDemo) @ % The dataset includes seven variables: A continuous response variable (\code{depression}), a predictor variable for the linear model (\code{treatment}), three potential partitioning variables (\code{age}, \code{anxiety}, \code{duration}), an indicator for cluster (\code{cluster}) and a binarized response variable (\code{depression\_bin}). We fit a tree to the continuous response as follows: <>= lmmt <- lmertree(depression ~ treatment | cluster | age + duration + anxiety, data = DepressionDemo) @ The left-hand side of the model formula (preceding the tilde symbol) specified the response variable (\code{depression}). The right-hand side of the model formula comprises three parts, separated by vertical bars: The first part specifies the predictor variable(s) of the node specific (G) LMM (\code{treatment}, in this example). The second part specifies the predictors of the global model (comprising only a random intercept with respect to \code{cluster}, in this example). The third part specifies the potential partitioning variables. In this example, all partitioning variables are continuous, but (ordered) categorical partitioning variables can also be specified. We specified only a single variable for the global model, resulting in estimation of a random intercept with respect to \code{cluster}. By default, if only a single variable is specified for the global model, a random intercept with respect to the specified variable will be estimated. More complex random effects can be specified using the customary formulation employed in \pkg{lme4}. For example, we could specify a global model comprising a correlated random intercept and slope of \code{age}, both estimated with respect to \code{cluster}: <>= depression ~ treatment | (age + (1 + age | cluster)) | age + duration + anxiety @ Note that we included a fixed effect for \code{age} in the global model, as is customary when specifying a random effect. Note also that we used round brackets in order to protect the vertical bars in the formulation of the (global) random effects. We could also encounter nested multilevel structures, for example when we have patients from treatment centers nested within geographical areas. Using the indicators for treatment center and geographical areas (e.g., \code{center} and \code{area}), we could have specified random intercept terms for \code{center}, nested within \code{area}: <>= depression ~ treatment | (1|center/area) | age + duration + anxiety @ Using the \code{plot} method, we can plot the resulting tree and random effects: <>= plot(lmmt) @ % \begin{figure}[t!] \centering \setkeys{Gin}{width=0.8\textwidth} <>= plot(lmmt, which = "tree") @ \caption{\label{fig:tree} Linear mixed-effects model tree with treatment-subgroup interactions.} \end{figure} % The plotted tree is depicted in Figure~\ref{fig:tree}. In every inner node of the plotted tree, the splitting variable and corresponding $p$-value from the parameter stability test is reported. To control for multiple testing, the $p$-values are Bonferroni corrected, by default. The plotted tree in Figure~\ref{fig:tree} shows that there are three subgroups with differential treatment effectiveness: node 3 indicates that for patients with lower duration and lower anxiety, Treatment 1 leads to lower post-treatment depression. Node 4 indicates that for patients with lower duration and higher anxiety, both treatments yield more or less the same expected outcome. Node 5 indicates, that for patients with higher duration, Treatment 2 leads to lower post-treatment depression. \begin{figure}[t!] \centering \setkeys{Gin}{width=0.5\textwidth} <>= plot(lmmt, which = "ranef") @ \caption{\label{fig:ranef} Random effects for the linear mixed-effects model tree in Figure~\ref{fig:tree}.} \end{figure} The predicted random effects are plotted in Figure~\ref{fig:ranef}. On average, patients from cluster 10 have somewhat higher expected post-treatment depression scores, whereas patients from cluster 4 have somewhat lower expected post-treatment depression scores. Alternatively, we can request caterpillar plots of the estimated node-specific coefficients through specifying \code{which = "tree.coef"}: % <>= plot(lmmt, which = "tree.coef") @ % \begin{figure}[t!] \centering \setkeys{Gin}{width=0.8\textwidth} <>= plot(lmmt, which = "tree.coef") @ \caption{\label{fig:tree.coef} Estimated node-specific fixed effects with error bars.} \end{figure} % The plotted results depict the node-specific parameter estimated with error bars ($\pm 1.96$ times the standard error). Note that these standard errors do not account for the searching of the tree structure and are likely too small, but they do allow for gauging the precision of the estimated parameters. To obtain numerical results, \code{print}, \code{coef}, \code{fixef}, \code{ranef}, and \code{VarCorr} methods are available (results omitted): % <>= print(lmmt) coef(lmmt) fixef(lmmt) ranef(lmmt) VarCorr(lmmt) @ % To obtain predicted values, the \code{predict} method can be used: % <>= predict(lmmt, newdata = DepressionDemo[1:7,]) @ % When \code{newdata} is not specified, predictions for the training observations are returned, by default. Random effects can be excluded from the predictions by adding \code{re.form = NA}. This is useful, for example, when \code{newdata} is specified, but the new observations do not have a cluster indicator or are from new clusters: % <>= predict(lmmt, newdata = DepressionDemo[1:7, -3], re.form = NA) @ \subsection{Inspecting residuals} Residuals of the fitted (G)LMM tree can be obtained with the \code{residuals} method. This can be useful for assessing potential misspecification of the model (e.g., heteroscedasticity): % <>= resids <- residuals(lmmt) preds <- predict(lmmt) plot(factor(DepressionDemo$cluster), resids) scatter.smooth(preds, resids) @ % \begin{figure}[b!] \centering \setkeys{Gin}{width=0.49\textwidth} <>= resids <- residuals(lmmt) preds <- predict(lmmt) plot(factor(DepressionDemo$cluster), resids, xlab = "Cluster", ylab = "Residuals") @ <>= scatter.smooth(preds, resids, xlab = "Predicted values", ylab = "Residuals") @ \caption{\label{fig:residuals} Residuals of the fitted linear mixed-effects model tree in Figure~\ref{fig:tree}.} \end{figure} % The plotted residuals are depicted in Figure~\ref{fig:residuals}. The left panel indicates some variation in error variances across levels of the random effects, but it appears these differences are not statistivally significant: % <<>>= fligner.test(resids ~ DepressionDemo$cluster) bartlett.test(resids ~ DepressionDemo$cluster) @ % The right panel of Figure~\ref{fig:residuals} shows fitted values against residuals and also does not reveal a pattern indicating model misspecification. \section{Detecting subgroups with different growth trajectories} \label{sec:growth_curves} An artificially generated, longitudinal dataset is included in package \pkg{glmertree} and can be loaded as follows: % <>= data("GrowthCurveDemo", package = "glmertree") dim(GrowthCurveDemo) names(GrowthCurveDemo) @ % The dataset contains 1250 repeated measurements from \Sexpr{length(unique(GrowthCurveDemo$person))} individuals. The data is in long format and the response was measured at five timepoints for each individual. The dataset contains \Sexpr{ncol(GrowthCurveDemo)} variables: A continuous response variable (\code{y}), a predictor variable for the linear model (\code{time}, taking values 0 through 4), time-invariant potential partitioning variables (\code{x1} through \code{x8}), and an indicator for person (\code{person}). The data were generated so that \code{x1}, \code{x2} and \code{x3} are true partitioning variables. Furthermore, \code{x1} is a binary variable, while all other potential partitioning variables follow a normal distribution with $\mu = 0$ and $\sigma = 5$. Potential partitioning variables were generated so as to be uncorrelated. Random intercepts and slopes were generated so that the intercept and slope values for persons vary around their node-specific means, following a normal distribution with $\mu = 0$ and $\sigma = \sqrt{2}$ for the intercept and $\sigma = \sqrt{.4}$ for the slope. Errors were uncorrelated and followed a normal distribution with $\mu = 0$ and $\sigma = \sqrt{5}$. The default fitting procedure as employed by functions \code{lmertree()} and \code{glmertree()} assume potential predictor variables are measured on the observation level. In this example, potential partitioning variables are measured on the cluster level (i.e., time-invariant covariates) and the observation-level stability tests will likely have inflated type-I error. We can account for the level of the partitioning variables through specification of the \code{cluster} argument. As a result, parameter stability tests will be performed on the cluster instead of the observation level: % <>= gc_tree <- lmertree(y ~ time | person | x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, cluster = person, data = GrowthCurveDemo) @ % The first part of the formula (\code{y ~ time}) regresses the response on time. The second part (\code{| person |}) specifies that a random intercept should be estimated with respect to \code{person}. The third part (\code{x1 + ... + x8}) specifies the potential partitioning variables. Using the cluster-level stability tests, we obtained a tree with four subgroups (terminal nodes): % <>= width(gc_tree$tree) @ % Employing the default observation-level stability tests would have yielded a tree with more (possibly spurious) subgroups: % <>= gc_obs_tree <- lmertree(y ~ time | person | x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, data = GrowthCurveDemo) width(gc_obs_tree$tree) @ % We can plot the growth-curve tree using the \code{plot} method: % <>= plot(gc_tree, which = "tree", fitted = "marginal") @ % Note that we additionally specified \code{which = "tree"}, to obtain a plot of the tree only, and \code{fitted = "marginal"}. The latter specified that the fitted values (represented by the red lines in the terminal nodes) should be computed by fixing all remaining (fixed- an random-effects) predictor variables at their means (or majority class, for categorical predictors). By default, \code{fitted = "combined"} yields fitted values, computed based on the observed values of the remaining (random and fixed-effects) predictor variables. We employed the marginal approach here, as this yields straight lines plotted in the terminal nodes, which may better reflect the average trajectories. % To depict observed data as growth curves instead of single datapoints, specify `which = "growth"`: % <>= plot(gc_tree, which = "growth") @ % \begin{figure}[t!] \centering \setkeys{Gin}{width=0.8\textwidth} <>= plot(gc_tree, which = "growth") @ \caption{\label{fig:gc_tree} Linear mixed-effects model tree with growth curve models in the terminal nodes.} \end{figure} % The resulting plot is depicted in Figure~\ref{fig:gc_tree}. The red lines in the terminal nodes represent the average trajectory within the terminal nodes. The dots represent the observed data values. The plot reveals that the true partitioning variables (\code{x1}, \code{x2} and \code{x3}) were selected for splitting. The fitted models in the terminal nodes (red lines) reveal a decrease in the response variable over time for the left-most subgroup, and an increase for the right-most subgroup. The curves in the two middle subgroups are rather flat, indicating no change over time. We can also print the values of the estimated coefficients in the terminal nodes: % <>= plot(gc_tree, type = "simple", which = "tree") @ % \begin{figure}[t!] \centering \setkeys{Gin}{width=0.8\textwidth} <>= plot(gc_tree, type = "simple", which = "tree") @ \caption{\label{fig:gc_tree_simple} Linear mixed-effects model tree with estimated coefficients printed in the terminal nodes.} \end{figure} % The tree in Figure~\ref{fig:gc_tree_simple} reveals effects of time relatively close to zero for nodes 4 and 6. We can also plot the coefficients with error bars: % <>= plot(gc_tree, which = "tree.coef") @ % \begin{figure}[t!] \centering \setkeys{Gin}{width=0.8\textwidth} <>= plot(gc_tree, which = "tree.coef") @ \caption{\label{fig:gc_tree.coefs} Caterpillar plots of estimated coefficients for each of the terminal nodes.} \end{figure} % The standard errors used for creating the error bars in Figure~\ref{fig:gc_tree.coefs} do not account for the searching of the tree sturcture and may be too small. However, the overlapping error bars for the effect of time in nodes 6 and 4 indicate a non-significant difference between the effects of time in these nodes. % The observed data points in Figure~\ref{fig:gc_tree} indicate that the individual observations show substantial variation around the estimated fixed effects. To obtain an estimate of the random effects and residual variances, we can use the \code{VarCorr} method: % <>= varcor <- VarCorr(gc_tree) varcor @ % WE can use these variances to obtain an estimate of the intraclass correlation (ICC): % <>= res_var <- attr(varcor, "sc")^2 int_var <- as.numeric(varcor$person) ICC <- int_var / (res_var + int_var) ICC @ % The value of the ICC indicates that about \Sexpr{round(100*ICC, digits = 2)} percent of variance in the response is accounted for by inter-individual variation. \subsection{Adding a random slope of time} Earlier, we specified a model formula with only a random intercept and thus did not account for possible variation between persons in the effect of time, within terminal nodes. To account for such differences we can incorporate a random slope of time into the model formula: % <>= form_s <- formula(paste0("y ~ time | (1 + time | person) | ", paste0("x", 1:8, collapse = " + "))) form_s @ % Again, we fit the tree: % <>= gc_tree_s <- lmertree(form_s, cluster = person, data = GrowthCurveDemo) @ % In this case, we obtained the same tree structure with or without estimating random slopes (Figure~\ref{fig:gc_tree}). This need not necessarily be the case with other datasets. At the very least, the estimated random effects can provide us with additional information about variation due to between-person differences in initial levels and growth over time: % <<>>= VarCorr(gc_tree_s) @ % Compared to the fitted model with random intercepts only, we see that the residual variance decreased somewhat. \newpage \bibliography{glmertree} \begin{appendix} \section[R code for generating artificial motivating dataset]{\proglang{R} code for generating artificial motivating dataset} \label{app} Generate the predictor variables and error term: % <>= set.seed(123) treatment <- rbinom(n = 150, size = 1, prob = .5) duration <- round(rnorm(150, mean = 7, sd = 3)) anxiety <- round(rnorm(150, mean = 10, sd = 3)) age <- round(rnorm(150, mean = 45, sd = 10)) error <- rnorm(150, 0, 2) @ % Generate the random intercepts: <>= cluster <- error + rnorm(150, 0, 6) rand_int <- sort(rep(rnorm(10, 0, 1), each = 15)) rand_int[order(cluster)] <- rand_int error <- error - rand_int cluster[order(cluster)] <- rep(1:10, each = 15) @ % Generate treatment subgroups: % <>= node3t1 <- ifelse(duration <= 8 & anxiety <= 10 & treatment == 0, -2, 0) node3t2 <- ifelse(duration <= 8 & anxiety <= 10 & treatment == 1, 2, 0) node5t1 <- ifelse(duration > 8 & treatment == 0, 2.5, 0) node5t2 <- ifelse(duration > 8 & treatment == 1, -2.5, 0) @ % Generate the continuous and dichotomized outcome variable: % <>= depression <- round(9 + node3t1 + node3t2 + node5t1 + node5t2 + .4 * treatment + error + rand_int) depression_bin <- factor(as.numeric(depression > 9)) @ % Make treatment indicator a factor and collect everything in a data frame: % <>= treatment <- factor(treatment, labels = c("Treatment 1", "Treatment 2")) DepressionDemo <- data.frame(depression, treatment, cluster, age, anxiety, duration, depression_bin) @ \end{appendix} \end{document}