Exploratory Analysis for Micro-Randomized Trial (MRT): Continuous Distal Outcomes

Tianchen Qian (t.qian@uci.edu)

2025-08-28

Introduction

The MRTAnalysis package now supports analysis of distal causal excursion effect of a continuous distal outcomes in micro-randomized trials (MRTs), using the function dcee().
Distal outcomes are measured once at the end of the study (e.g., weight loss, cognitive score), in contrast to proximal outcomes which are repeatedly measured after each treatment decision point.

This vignette introduces:

Data Structure of MRT with Distal Outcomes

In a distal-outcome MRT:

Thus, each row in the long-format data corresponds to \((X_{it}, A_{it}, I_{it}, p_{it})\), with \(Y_i\) constant within each participant.

Distal Causal Excursion Effects

The distal causal excursion effects are defined using potential outcomes in Qian (2025). Roughly speaking, the DCEE at decision point \(t\) is the difference in the outcome \(Y_i\) due to assigning treatment \(A_{it}=1\) versus \(A_{it}=0\) at time \(t\), while keeping the past and future treatment assignments according to the randomization probabilities in the MRT (i.e., the MRT policy), and averaging over the covariate history and availability at \(t\).

Example Dataset

This package provides data_distal_continuous, a synthetic dataset with:

library(MRTAnalysis)
current_options <- options(digits = 3) # save current options for restoring later
head(data_distal_continuous, 10)
#>    userid dp       X Z avail A prob_A A_lag1   Y
#> 1       1  1 -1.0605 0     0 0  0.100      0 103
#> 2       1  2 -1.6822 0     0 0  0.100      0 103
#> 3       1  3 -1.7167 0     1 0  0.100      0 103
#> 4       1  4 -1.9478 0     0 0  0.100      0 103
#> 5       1  5 -2.0758 1     0 0  0.411      0 103
#> 6       1  6 -0.5967 0     1 0  0.142      0 103
#> 7       1  7  0.7401 0     1 0  0.216      0 103
#> 8       1  8  1.8208 1     1 1  0.758      0 103
#> 9       1  9 -0.0854 1     1 0  0.639      1 103
#> 10      1 10  2.6413 1     1 1  0.824      0 103

Using dcee()

Fully Marginal Effect (no moderators)

In the following function call of dcee(), we specify the distal outcome variable by outcome = "Y". We specify the treatment variable by treatment = "A". We specify the time-varying randomization probability by rand_prob = "prob_A". We specify the fully marginal effect as the quantity to be estimated by setting moderator_formula = ~1. We use X and Z as two variables by setting control_formula = ~logstep_pre30min. We specify the availability variable by availability = avail. We use linear regression for the control regression model (i.e., the Stage-1 nuisance models in the two-stage estimation procedure in Qian (2025)) by setting control_reg_method = "lm".

Note that the estimator for the distal causal excursion effect is consistent even if the control regression model is mis-specified, as long as the treatment randomization probabilities are correctly specified (which will be the case for MRTs). Different control regression methods can be used to improve efficiency.

fit_lm <- dcee(
    data = data_distal_continuous,
    id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A",
    moderator_formula = ~1,
    control_formula = ~ X + Z,
    availability = "avail",
    control_reg_method = "lm"
)
#> [dcee] Learner: lm | Moderators: <marginal> | Controls: X, Z
summary(fit_lm)
#> 
#> Call:
#> dcee(data = data_distal_continuous, id = "userid", outcome = "Y", 
#>     treatment = "A", rand_prob = "prob_A", moderator_formula = ~1, 
#>     control_formula = ~X + Z, availability = "avail", control_reg_method = "lm")
#> 
#> Inference: small-sample t; df = 49
#> Confidence level: 95%
#> 
#> Distal causal excursion effect (beta):
#>           Estimate 95% LCL 95% UCL Std. Error t value df Pr(>|t value|)
#> Intercept    0.404  -0.771   1.579      0.585   0.691 49           0.49

The summary() function provides the estimated distal causal excursion effect as well as the 95% confidence interval, standard error, and p-value. The only row in the output Distal causal excursion effect (beta) is named Intercept, indicating that this is the fully marginal effect (like an intercept in the causal effect model). In particular, the estimated marginal distal excursion effect is 0.404, with 95% confidence interval (-0.771, 1.579), and p-value 0.49. The confidence interval and the p-value are based on t-quantiles.

Moderated Effect

The following code uses dcee() to estimate the distal causal excursion effect moderated by the time-varying covariate Z. This is achieved by setting moderator_formula = ~ Z.

fit_mod <- dcee(
    data = data_distal_continuous,
    id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A",
    moderator_formula = ~Z,
    control_formula = ~ Z + X,
    availability = "avail",
    control_reg_method = "lm"
)
#> [dcee] Learner: lm | Moderators: Z | Controls: Z, X
summary(fit_mod, lincomb = c(1, 1)) # beta0 + beta1
#> 
#> Call:
#> dcee(data = data_distal_continuous, id = "userid", outcome = "Y", 
#>     treatment = "A", rand_prob = "prob_A", moderator_formula = ~Z, 
#>     control_formula = ~Z + X, availability = "avail", control_reg_method = "lm")
#> 
#> Inference: small-sample t; df = 48
#> Confidence level: 95%
#> 
#> Distal causal excursion effect (beta):
#>           Estimate 95% LCL 95% UCL Std. Error t value df Pr(>|t value|)
#> Intercept  -0.0968 -1.4528  1.2592     0.6744 -0.1436 48           0.89
#> Z           1.2100 -0.9299  3.3500     1.0643  1.1369 48           0.26
#> 
#> Linear combinations (L * beta):
#>    Estimate 95% LCL 95% UCL Std. Error t value df Pr(>|t value|)
#> L1    1.113  -0.719   2.946      0.911   1.222 48           0.23

In the above, we asked summary() to calculate and print the estimated coefficients for \(\beta_0 + \beta_1\), the distal causal excursion effect when the binary variable \(Z\) takes value 1, by using the lincomb optional argument. This is illustrated by the following code. We set lincomb = c(1, 1), i.e., asks summary() to print out \([1, 1] \times (\beta_0, \beta_1)^T = \beta_0 + \beta_1\). The table under Linear combinations (L * beta) is the fitted result for this \(\beta_0 + \beta_1\) coefficient combination.

GAM nuisance models

One can use generalized additive models (GAM) for the control regression models by setting control_reg_method = "gam". This may improve efficiency if the relationship between the distal outcome and the covariates is non-linear. One can use s() to specify non-linear terms in the control_formula. For example, here we use a smooth term for the continuous covariate X, by setting control_formula = ~ s(X) + Z.

fit_gam <- dcee(
    data = data_distal_continuous,
    id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A",
    moderator_formula = ~Z,
    control_formula = ~ s(X) + Z,
    availability = "avail",
    control_reg_method = "gam"
)
#> [dcee] Learner: gam | Moderators: Z | Controls: s(X), Z
summary(fit_gam)
#> 
#> Call:
#> dcee(data = data_distal_continuous, id = "userid", outcome = "Y", 
#>     treatment = "A", rand_prob = "prob_A", moderator_formula = ~Z, 
#>     control_formula = ~s(X) + Z, availability = "avail", control_reg_method = "gam")
#> 
#> Inference: small-sample t; df = 48
#> Confidence level: 95%
#> 
#> Distal causal excursion effect (beta):
#>           Estimate 95% LCL 95% UCL Std. Error t value df Pr(>|t value|)
#> Intercept  -0.0968 -1.4528  1.2592     0.6744 -0.1436 48           0.89
#> Z           1.2100 -0.9299  3.3500     1.0643  1.1369 48           0.26

Random Forest / Ranger nuisance

One can also use tree-based methods for the control regression models by setting control_reg_method = "rf" (random forest via randomForest package) or control_reg_method = "ranger" (faster random forest via ranger package). This may improve efficiency if the relationship between the distal outcome and the covariates is complex. Note that tree-based methods do not allow specification of smooth terms like s(X). The control_formula has to be specified using main terms only. Additional optional arguments can be passed to the underlying random forest function via ... argument of dcee(), which is not shown in this example.

fit_rf <- dcee(
    data = data_distal_continuous,
    id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A",
    moderator_formula = ~1,
    control_formula = ~ X + Z,
    availability = "avail",
    control_reg_method = "rf" # can replace "rf" with "ranger" for faster implementation
)
#> [dcee] Learner: rf | Moderators: <marginal> | Controls: X, Z
summary(fit_rf)
#> 
#> Call:
#> dcee(data = data_distal_continuous, id = "userid", outcome = "Y", 
#>     treatment = "A", rand_prob = "prob_A", moderator_formula = ~1, 
#>     control_formula = ~X + Z, availability = "avail", control_reg_method = "rf")
#> 
#> Inference: small-sample t; df = 49
#> Confidence level: 95%
#> 
#> Distal causal excursion effect (beta):
#>           Estimate 95% LCL 95% UCL Std. Error t value df Pr(>|t value|)
#> Intercept    0.306  -0.787   1.399      0.544   0.563 49           0.58

Cross-Fitting

The dcee() function also supports cross-fitting, which may lead to improved finite sample performance when using complex machine learning methods for the control regression models. This is done by setting cross_fit = TRUE and specifying the number of folds via cf_fold. Here we use 5-fold cross-fitting with generalized additive models for the control regression models as an example. The particular cross-fitting algorithm follows Section 4 in the Web Appendix of Zhong and others (2021).

fit_cf <- dcee(
    data = data_distal_continuous,
    id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A",
    moderator_formula = ~1,
    control_formula = ~ X + Z,
    availability = "avail",
    control_reg_method = "gam",
    cross_fit = TRUE, cf_fold = 5
)
#> [dcee] Learner: gam | Moderators: <marginal> | Controls: X, Z
#> [dcee] Cross-fitting enabled with 5 folds
summary(fit_cf)
#> 
#> Call:
#> dcee(data = data_distal_continuous, id = "userid", outcome = "Y", 
#>     treatment = "A", rand_prob = "prob_A", moderator_formula = ~1, 
#>     control_formula = ~X + Z, availability = "avail", control_reg_method = "gam", 
#>     cross_fit = TRUE, cf_fold = 5)
#> 
#> Inference: small-sample t; df = 49
#> Confidence level: 95%
#> 
#> Distal causal excursion effect (beta):
#>           Estimate 95% LCL 95% UCL Std. Error t value df Pr(>|t value|)
#> Intercept    0.468  -0.703   1.638      0.583   0.803 49           0.43

Inspecting Stage-1 Fits

We can set show_control_fit = TRUE in the summary() function to inspect the control regression (i.e., Stage-1 nuisance) model fits. This is useful for diagnosing the fit of the control regression models. For lm/gam these include regression summaries. For tree-based or SuperLearner fits, original learner output is shown. To further inspect the control regression model fits, one can manually inspect $fit$regfit_a0 and $fit$regfit_a1.

summary(fit_lm, show_control_fit = TRUE)
#> 
#> Call:
#> dcee(data = data_distal_continuous, id = "userid", outcome = "Y", 
#>     treatment = "A", rand_prob = "prob_A", moderator_formula = ~1, 
#>     control_formula = ~X + Z, availability = "avail", control_reg_method = "lm")
#> 
#> Inference: small-sample t; df = 49
#> Confidence level: 95%
#> 
#> Distal causal excursion effect (beta):
#>           Estimate 95% LCL 95% UCL Std. Error t value df Pr(>|t value|)
#> Intercept    0.404  -0.771   1.579      0.585   0.691 49           0.49
#> 
#> Stage-1 nuisance fits:
#> 
#>   [A = 0] Outcome model
#>   Method: lm | Formula: Y ~ X + Z | Subset: A == 0 | #person-decision-points = 996
#>   
#>   
#>   Residuals:
#>      Min     1Q Median     3Q    Max 
#>   -35.11  -9.33  -0.62   9.48  37.26 
#>   
#>   Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#>   (Intercept)   97.606      0.558  174.87  < 2e-16 ***
#>   X              1.340      0.348    3.84  0.00013 ***
#>   Z              2.817      0.900    3.13  0.00180 ** 
#>   ---
#>   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>   
#>   Residual standard error: 12.8 on 993 degrees of freedom
#>   Multiple R-squared:  0.0256,   Adjusted R-squared:  0.0236 
#>   F-statistic:   13 on 2 and 993 DF,  p-value: 2.62e-06
#>   
#> 
#>   [A = 1] Outcome model
#>   Method: lm | Formula: Y ~ X + Z | Subset: A == 1 | #person-decision-points = 504
#>   
#>   
#>   Residuals:
#>      Min     1Q Median     3Q    Max 
#>   -41.64  -8.86   0.57   8.27  34.98 
#>   
#>   Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#>   (Intercept)   99.292      1.023   97.10  < 2e-16 ***
#>   X              4.111      0.518    7.94  1.3e-14 ***
#>   Z              4.850      1.208    4.01  6.9e-05 ***
#>   ---
#>   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>   
#>   Residual standard error: 12.8 on 501 degrees of freedom
#>   Multiple R-squared:  0.142,    Adjusted R-squared:  0.138 
#>   F-statistic: 41.4 on 2 and 501 DF,  p-value: <2e-16
#>   
#> 
#>   Note: If cross_fit = TRUE in dcee(), these are last-fold models for inspection only.
#> 
#>   For full details, inspect $fit$regfit_a0 and $fit$regfit_a1.

References

Qian, T. (2025). Distal causal excursion effects: Modeling long-term effects of time-varying treatments in micro-randomized trials. arXiv preprint arXiv:2502.13500.
Zhong, Y., Kennedy, E. H., Bodnar, L. M. and Naimi, A. I. (2021). AIPW: An r package for augmented inverse probability–weighted estimation of average causal effects. American Journal of Epidemiology 190, 2690–2699.