The MRTAnalysis
package now supports analysis of distal
causal excursion effect of a continuous distal outcomes
in micro-randomized trials (MRTs), using the function
dcee()
.
Distal outcomes are measured once at the end of the study (e.g., weight
loss, cognitive score), in contrast to proximal
outcomes which are repeatedly measured after each treatment
decision point.
This vignette introduces:
dcee()
function to estimate DCEE for
MRT with a continuous distal outcome.In a distal-outcome MRT:
Thus, each row in the long-format data corresponds to \((X_{it}, A_{it}, I_{it}, p_{it})\), with \(Y_i\) constant within each participant.
The distal causal excursion effects are defined using potential outcomes in Qian (2025). Roughly speaking, the DCEE at decision point \(t\) is the difference in the outcome \(Y_i\) due to assigning treatment \(A_{it}=1\) versus \(A_{it}=0\) at time \(t\), while keeping the past and future treatment assignments according to the randomization probabilities in the MRT (i.e., the MRT policy), and averaging over the covariate history and availability at \(t\).
This package provides data_distal_continuous
, a
synthetic dataset with:
userid
: participant id.dp
: decision point index.X
: continuous endogenous covariate.Z
: binary endogenous covariate.avail
: availability indicator.A
: treatment indicator.prob_A
: randomization probability.A_lag1
: lag-1 treatment.Y
: continuous distal outcome, identical across rows for
same userid
.library(MRTAnalysis)
current_options <- options(digits = 3) # save current options for restoring later
head(data_distal_continuous, 10)
#> userid dp X Z avail A prob_A A_lag1 Y
#> 1 1 1 -1.0605 0 0 0 0.100 0 103
#> 2 1 2 -1.6822 0 0 0 0.100 0 103
#> 3 1 3 -1.7167 0 1 0 0.100 0 103
#> 4 1 4 -1.9478 0 0 0 0.100 0 103
#> 5 1 5 -2.0758 1 0 0 0.411 0 103
#> 6 1 6 -0.5967 0 1 0 0.142 0 103
#> 7 1 7 0.7401 0 1 0 0.216 0 103
#> 8 1 8 1.8208 1 1 1 0.758 0 103
#> 9 1 9 -0.0854 1 1 0 0.639 1 103
#> 10 1 10 2.6413 1 1 1 0.824 0 103
dcee()
In the following function call of dcee()
, we specify the
distal outcome variable by outcome = "Y"
. We specify the
treatment variable by treatment = "A"
. We specify the
time-varying randomization probability by
rand_prob = "prob_A"
. We specify the fully marginal effect
as the quantity to be estimated by setting
moderator_formula = ~1
. We use X
and
Z
as two variables by setting
control_formula = ~logstep_pre30min
. We specify the
availability variable by availability = avail
. We use
linear regression for the control regression model (i.e., the Stage-1
nuisance models in the two-stage estimation procedure in Qian (2025))
by setting control_reg_method = "lm"
.
Note that the estimator for the distal causal excursion effect is consistent even if the control regression model is mis-specified, as long as the treatment randomization probabilities are correctly specified (which will be the case for MRTs). Different control regression methods can be used to improve efficiency.
fit_lm <- dcee(
data = data_distal_continuous,
id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A",
moderator_formula = ~1,
control_formula = ~ X + Z,
availability = "avail",
control_reg_method = "lm"
)
#> [dcee] Learner: lm | Moderators: <marginal> | Controls: X, Z
summary(fit_lm)
#>
#> Call:
#> dcee(data = data_distal_continuous, id = "userid", outcome = "Y",
#> treatment = "A", rand_prob = "prob_A", moderator_formula = ~1,
#> control_formula = ~X + Z, availability = "avail", control_reg_method = "lm")
#>
#> Inference: small-sample t; df = 49
#> Confidence level: 95%
#>
#> Distal causal excursion effect (beta):
#> Estimate 95% LCL 95% UCL Std. Error t value df Pr(>|t value|)
#> Intercept 0.404 -0.771 1.579 0.585 0.691 49 0.49
The summary()
function provides the estimated distal
causal excursion effect as well as the 95% confidence interval, standard
error, and p-value. The only row in the output
Distal causal excursion effect (beta)
is named
Intercept
, indicating that this is the fully marginal
effect (like an intercept in the causal effect model). In particular,
the estimated marginal distal excursion effect is 0.404, with 95%
confidence interval (-0.771, 1.579), and p-value 0.49. The confidence
interval and the p-value are based on t-quantiles.
The following code uses dcee()
to estimate the distal
causal excursion effect moderated by the time-varying covariate
Z
. This is achieved by setting
moderator_formula = ~ Z
.
fit_mod <- dcee(
data = data_distal_continuous,
id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A",
moderator_formula = ~Z,
control_formula = ~ Z + X,
availability = "avail",
control_reg_method = "lm"
)
#> [dcee] Learner: lm | Moderators: Z | Controls: Z, X
summary(fit_mod, lincomb = c(1, 1)) # beta0 + beta1
#>
#> Call:
#> dcee(data = data_distal_continuous, id = "userid", outcome = "Y",
#> treatment = "A", rand_prob = "prob_A", moderator_formula = ~Z,
#> control_formula = ~Z + X, availability = "avail", control_reg_method = "lm")
#>
#> Inference: small-sample t; df = 48
#> Confidence level: 95%
#>
#> Distal causal excursion effect (beta):
#> Estimate 95% LCL 95% UCL Std. Error t value df Pr(>|t value|)
#> Intercept -0.0968 -1.4528 1.2592 0.6744 -0.1436 48 0.89
#> Z 1.2100 -0.9299 3.3500 1.0643 1.1369 48 0.26
#>
#> Linear combinations (L * beta):
#> Estimate 95% LCL 95% UCL Std. Error t value df Pr(>|t value|)
#> L1 1.113 -0.719 2.946 0.911 1.222 48 0.23
In the above, we asked summary()
to calculate and print
the estimated coefficients for \(\beta_0 +
\beta_1\), the distal causal excursion effect when the binary
variable \(Z\) takes value 1, by using
the lincomb
optional argument. This is illustrated by the
following code. We set lincomb = c(1, 1)
, i.e., asks
summary()
to print out \([1, 1]
\times (\beta_0, \beta_1)^T = \beta_0 + \beta_1\). The table
under Linear combinations (L * beta)
is the fitted result
for this \(\beta_0 + \beta_1\)
coefficient combination.
One can use generalized additive models (GAM) for the control
regression models by setting control_reg_method = "gam"
.
This may improve efficiency if the relationship between the distal
outcome and the covariates is non-linear. One can use s()
to specify non-linear terms in the control_formula
. For
example, here we use a smooth term for the continuous covariate
X
, by setting
control_formula = ~ s(X) + Z
.
fit_gam <- dcee(
data = data_distal_continuous,
id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A",
moderator_formula = ~Z,
control_formula = ~ s(X) + Z,
availability = "avail",
control_reg_method = "gam"
)
#> [dcee] Learner: gam | Moderators: Z | Controls: s(X), Z
summary(fit_gam)
#>
#> Call:
#> dcee(data = data_distal_continuous, id = "userid", outcome = "Y",
#> treatment = "A", rand_prob = "prob_A", moderator_formula = ~Z,
#> control_formula = ~s(X) + Z, availability = "avail", control_reg_method = "gam")
#>
#> Inference: small-sample t; df = 48
#> Confidence level: 95%
#>
#> Distal causal excursion effect (beta):
#> Estimate 95% LCL 95% UCL Std. Error t value df Pr(>|t value|)
#> Intercept -0.0968 -1.4528 1.2592 0.6744 -0.1436 48 0.89
#> Z 1.2100 -0.9299 3.3500 1.0643 1.1369 48 0.26
One can also use tree-based methods for the control regression models
by setting control_reg_method = "rf"
(random forest via
randomForest
package) or
control_reg_method = "ranger"
(faster random forest via
ranger
package). This may improve efficiency if the
relationship between the distal outcome and the covariates is complex.
Note that tree-based methods do not allow specification of smooth terms
like s(X)
. The control_formula
has to be
specified using main terms only. Additional optional arguments can be
passed to the underlying random forest function via ...
argument of dcee()
, which is not shown in this example.
fit_rf <- dcee(
data = data_distal_continuous,
id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A",
moderator_formula = ~1,
control_formula = ~ X + Z,
availability = "avail",
control_reg_method = "rf" # can replace "rf" with "ranger" for faster implementation
)
#> [dcee] Learner: rf | Moderators: <marginal> | Controls: X, Z
summary(fit_rf)
#>
#> Call:
#> dcee(data = data_distal_continuous, id = "userid", outcome = "Y",
#> treatment = "A", rand_prob = "prob_A", moderator_formula = ~1,
#> control_formula = ~X + Z, availability = "avail", control_reg_method = "rf")
#>
#> Inference: small-sample t; df = 49
#> Confidence level: 95%
#>
#> Distal causal excursion effect (beta):
#> Estimate 95% LCL 95% UCL Std. Error t value df Pr(>|t value|)
#> Intercept 0.306 -0.787 1.399 0.544 0.563 49 0.58
The dcee()
function also supports cross-fitting, which
may lead to improved finite sample performance when using complex
machine learning methods for the control regression models. This is done
by setting cross_fit = TRUE
and specifying the number of
folds via cf_fold
. Here we use 5-fold cross-fitting with
generalized additive models for the control regression models as an
example. The particular cross-fitting algorithm follows Section 4 in the
Web Appendix of Zhong and others (2021).
fit_cf <- dcee(
data = data_distal_continuous,
id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A",
moderator_formula = ~1,
control_formula = ~ X + Z,
availability = "avail",
control_reg_method = "gam",
cross_fit = TRUE, cf_fold = 5
)
#> [dcee] Learner: gam | Moderators: <marginal> | Controls: X, Z
#> [dcee] Cross-fitting enabled with 5 folds
summary(fit_cf)
#>
#> Call:
#> dcee(data = data_distal_continuous, id = "userid", outcome = "Y",
#> treatment = "A", rand_prob = "prob_A", moderator_formula = ~1,
#> control_formula = ~X + Z, availability = "avail", control_reg_method = "gam",
#> cross_fit = TRUE, cf_fold = 5)
#>
#> Inference: small-sample t; df = 49
#> Confidence level: 95%
#>
#> Distal causal excursion effect (beta):
#> Estimate 95% LCL 95% UCL Std. Error t value df Pr(>|t value|)
#> Intercept 0.468 -0.703 1.638 0.583 0.803 49 0.43
We can set show_control_fit = TRUE
in the
summary()
function to inspect the control regression (i.e.,
Stage-1 nuisance) model fits. This is useful for diagnosing the fit of
the control regression models. For lm
/gam
these include regression summaries. For tree-based or SuperLearner fits,
original learner output is shown. To further inspect the control
regression model fits, one can manually inspect
$fit$regfit_a0
and $fit$regfit_a1
.
summary(fit_lm, show_control_fit = TRUE)
#>
#> Call:
#> dcee(data = data_distal_continuous, id = "userid", outcome = "Y",
#> treatment = "A", rand_prob = "prob_A", moderator_formula = ~1,
#> control_formula = ~X + Z, availability = "avail", control_reg_method = "lm")
#>
#> Inference: small-sample t; df = 49
#> Confidence level: 95%
#>
#> Distal causal excursion effect (beta):
#> Estimate 95% LCL 95% UCL Std. Error t value df Pr(>|t value|)
#> Intercept 0.404 -0.771 1.579 0.585 0.691 49 0.49
#>
#> Stage-1 nuisance fits:
#>
#> [A = 0] Outcome model
#> Method: lm | Formula: Y ~ X + Z | Subset: A == 0 | #person-decision-points = 996
#>
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -35.11 -9.33 -0.62 9.48 37.26
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 97.606 0.558 174.87 < 2e-16 ***
#> X 1.340 0.348 3.84 0.00013 ***
#> Z 2.817 0.900 3.13 0.00180 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 12.8 on 993 degrees of freedom
#> Multiple R-squared: 0.0256, Adjusted R-squared: 0.0236
#> F-statistic: 13 on 2 and 993 DF, p-value: 2.62e-06
#>
#>
#> [A = 1] Outcome model
#> Method: lm | Formula: Y ~ X + Z | Subset: A == 1 | #person-decision-points = 504
#>
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -41.64 -8.86 0.57 8.27 34.98
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 99.292 1.023 97.10 < 2e-16 ***
#> X 4.111 0.518 7.94 1.3e-14 ***
#> Z 4.850 1.208 4.01 6.9e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 12.8 on 501 degrees of freedom
#> Multiple R-squared: 0.142, Adjusted R-squared: 0.138
#> F-statistic: 41.4 on 2 and 501 DF, p-value: <2e-16
#>
#>
#> Note: If cross_fit = TRUE in dcee(), these are last-fold models for inspection only.
#>
#> For full details, inspect $fit$regfit_a0 and $fit$regfit_a1.