Risk Set Matching with rsmatch

Time-varying matching

Any time we are handling observational data (as opposed to that from a randomized study) and trying to measure the effect of some treatment, bias can potentially confound our estimates—undermining the analysis. To reduce this bias, matching attempts to find groups that are similar across measured variables, despite receiving alternate sides of treatment. Matching requires special care when treatment(s) (and related matching variables) happen at multiple multiple times, as in some longitudinal data.

The R package rsmatch implements various forms of matching on time-varying observational studies to help in causal inference.

library(rsmatch)

This short vignette aims to provide enough information on time-varying matching and the rsmatch package for someone to start data analysis. The package implements two different approaches, and we recommend a thorough read-through of the corresponding paper to understand the methodology, potential pitfalls, and other crucial considerations.

OASIS data

We start by showing a data set that can illustrate the structure of a time-varying observational data set.

Consider the oasis data with 51 patients from the Open Access Series of Imaging Studies that are classified at each time point as having Alzheimer’s disease (AD) or not.

data(oasis)
head(oasis, n = 10)
#>    subject_id visit time_of_ad m_f educ ses age mr_delay e_tiv n_wbv   asf
#> 1   OAS2_0002     1         NA   M   12  -1  75        0  1678 0.736 1.046
#> 2   OAS2_0002     2         NA   M   12  -1  76      560  1738 0.713 1.010
#> 3   OAS2_0002     3         NA   M   12  -1  80     1895  1698 0.701 1.034
#> 4   OAS2_0007     1          3   M   16  -1  71        0  1357 0.748 1.293
#> 5   OAS2_0007     3          3   M   16  -1  73      518  1365 0.727 1.286
#> 6   OAS2_0007     4          3   M   16  -1  75     1281  1372 0.710 1.279
#> 7   OAS2_0009     1         NA   M   12   2  68        0  1457 0.806 1.205
#> 8   OAS2_0009     2         NA   M   12   2  69      576  1480 0.791 1.186
#> 9   OAS2_0010     1         NA   F   12   3  66        0  1447 0.769 1.213
#> 10  OAS2_0010     2         NA   F   12   3  68      854  1482 0.752 1.184

The data set has been modified to a “long” time-varying format with the following variables:

Importantly, visits have a similar time difference between them. This sets up a panel data structure. The package does not help with data sets where subject visits happen irregularly or continuously.

This data set illustrates a few of the standard causal inference elements for time-varying matching, including

Unfortunately, the data is missing a key aspect for causal analysis, the outcome on which to measure our treatment effect. Open-source longitudinal data sets are somewhat scarce, and this was the most illustrative example we could find while including in the package. However, this data fully demonstrates the matching process. Applying a set of matches to determine the treatment effect is, thankfully, a typically straightforward process.

Our goal is to match subjects similar in their gender, education level, socioeconomic status, age, and other MRI measurements, but one of which moved from a cognitively normal state to AD in the next period and the other who remained cognitively normal. On these similar groups, we can hopefully isolate the effect of an AD diagnosis on an outcome of interest.

Balanced Risk Set Matching

Based on the work of Li et al. (2001), the brsmatch() function performs “Balanced Risk Set Matching”. Given our data structure, it finds matches based on minimizing the Mahalanobis distance between covariates. If balancing is desired, the model will try to minimize the imbalance in terms of specified balancing covariates in the final pair output. Each treated individual is matched to one other individual. Full details are provided in the available reference.

For example, we can find 5 matched pairs based on all covariates with balance across gender and age:

pairs <- brsmatch(
  n_pairs = 5,
  data = oasis,
  id = "subject_id", 
  time = "visit", 
  trt_time = "time_of_ad",
  covariates = c("m_f", "educ", "ses", "age", 
                 "mr_delay", "e_tiv", "n_wbv", "asf"),
  balance = TRUE, 
  balance_covariates = c("m_f", "age")
)

na.omit(pairs)
#>    subject_id pair_id type
#> 5   OAS2_0014       1  trt
#> 15  OAS2_0043       2  all
#> 22  OAS2_0079       2  trt
#> 32  OAS2_0112       3  all
#> 36  OAS2_0124       1  all
#> 40  OAS2_0140       5  all
#> 41  OAS2_0150       3  trt
#> 43  OAS2_0160       4  trt
#> 49  OAS2_0182       4  all
#> 50  OAS2_0184       5  trt

The output includes a long-format data frame with subject_id, pair_id, and type to indicate whether the subject was considered in the treatment (trt) group, or the control (all) group for the indicated time point. The type variable is helpful because subjects who got AD could match in either the treatment or control group depending on which visit they were matched in. brsmatch() considers all of the possibilities in finding an optimal match. The output will include all subjects, with an NA value for pair_id and type for unmatched subjects.

Taking, for example, the first pair, OAS2_0014 to OAS2_0124, we know that the matching happened at visit 2, and in that visit, the two subjects have very similar covariates.

first_pair <- pairs[which(pairs$pair_id == 1), "subject_id"]
oasis[which(oasis$subject_id %in% first_pair), ]
#>    subject_id visit time_of_ad m_f educ ses age mr_delay e_tiv n_wbv   asf
#> 11  OAS2_0014     1          2   M   16   3  76        0  1602 0.697 1.096
#> 12  OAS2_0014     2          2   M   16   3  77      504  1590 0.696 1.104
#> 80  OAS2_0124     1         NA   M   16   3  70        0  1463 0.749 1.200
#> 81  OAS2_0124     2         NA   M   16   3  71      472  1479 0.750 1.187

Full details of the brsmatch() function’s arguments can be found with ?brsmatch. One option not used in this example allows for exact matching constraints. These can be very useful to improve the optimization speed for large data sets.

Propensity Score Matching with Time-Dependent Covariates

Based on the work of Lu (2005) “Propensity Score Matching with Time-Dependent Covariates”, the coxpsmatch() function matches subjects based on time-dependent propensity scores from a Cox proportional hazards model. If balancing is desired, the model will try to minimize the imbalance in terms of specified balancing covariates in the final pair output. Each treated individual is matched to one other individual.

The coxpsmatch() interface is similar to brsmatch() in both inputs and outputs as the following example shows. However, coxpsmatch() can pair all individuals if desired and does not allow (or need) the option of separate balance variables.

pairs <- coxpsmatch(
  n_pairs = 5,
  data = oasis,
  id = "subject_id", 
  time = "visit", 
  trt_time = "time_of_ad",
  covariates = c("m_f", "educ", "ses", "age", 
                 "mr_delay", "e_tiv", "n_wbv", "asf")
)

na.omit(pairs)
#>    subject_id pair_id type
#> 3   OAS2_0009       1  all
#> 5   OAS2_0014       2  trt
#> 16  OAS2_0046       1  trt
#> 29  OAS2_0104       5  trt
#> 32  OAS2_0112       5  all
#> 34  OAS2_0114       4  trt
#> 36  OAS2_0124       2  all
#> 39  OAS2_0139       4  all
#> 41  OAS2_0150       3  trt
#> 49  OAS2_0182       3  all

In this example, OAS2_0014 still matches to OAS2_0124, but other matches differ.

second_pair <- pairs[which(pairs$pair_id == 2), "subject_id"]
oasis[which(oasis$subject_id %in% second_pair), ]
#> # A tibble: 4 × 11
#>   subject_id visit time_of_ad m_f    educ ses     age mr_delay e_tiv n_wbv   asf
#>   <chr>      <int>      <dbl> <chr> <int> <fct> <int>    <int> <int> <dbl> <dbl>
#> 1 OAS2_0014      1          2 M        16 3        76        0  1602 0.697  1.10
#> 2 OAS2_0014      2          2 M        16 3        77      504  1590 0.696  1.10
#> 3 OAS2_0124      1         NA M        16 3        70        0  1463 0.749  1.2 
#> 4 OAS2_0124      2         NA M        16 3        71      472  1479 0.75   1.19

Going further

This quick vignette provides a primer to get you started with time-varying matching methods. Some necessary concepts not covered include:

References

Li, Yunfei Paul, Kathleen J Propert, and Paul R Rosenbaum. 2001. “Balanced Risk Set Matching.” Journal of the American Statistical Association 96 (455): 870–82.

Lu, Bo. 2005. “Propensity Score Matching with Time-Dependent Covariates.” Biometrics 61 (3): 721–28.