The twangRDC R package is a streamlined version of the
twang R package that was created specifically for use in the Federal Statistical Research Data Centers (FSRDCs), which are restricted data enclaves run by the Census Bureau to promote research use of data products from the Census and other federal statistical agencies. In particular, the twangRDC package contains functions to address linkage failures, which may systematically affect some records creating bias in population estimates. Further, the same functions can be used to ensure that a comparison group is equivalent to a treatment group on observable characteristics.
The package utilizes gradient boosted models to estimate weights for linkage failures and comparison group construction. Results using twangRDC will not necessarily be reproduced using
twang due to important differences in implementation. The twangRDC package allows for much larger datasets, like those used in population sciences, and many more covariates, but users should note that with smaller datasets, the
twang package is computationally more efficient. The twangRDC package uses xgboost for gradient boosting, which provides users with an optimized gradient boosting library that can utilize parallel computation within the FSRDCs.
The twangRDC package uses gradient boosted models to derive weights for nonequivalent groups. The algorithm alternates between adding iterations to the gradient boosted model (increasing its complexity) and evaluating balance between the two groups on the covariates. The algorithm automatically stops when additional iterations no longer improve balance. The package allows the user to generate weights for two different scenarios: linkage failures and the construction of a comparison group.
The Census Bureau’s Person Identification Validation System (PVS) facilitates data linkage by assigning unique person identifiers, known as Protected Identification Keys (PIKs) to federal, commercial, census, and survey data. The PVS uses probabilistic matching to assign a unique PIK for each person based on Personally Identifiable Information (PII) from secure reference files containing data from the Social Security Administration Numerical Identification file and other federal files. PIKs are linking keys that can be used to match person records across datasets1. For example, using a PIK, a respondent in the 2000 Decennial Census can be linked to their response in the 2010 Decennial Census. PIKs are assigned through a probabilistic matching process that links survey or census records containing Personally Identifiable Information (PII) like name, address, date of birth and place of birth, to a Social Security reference file.
Not all survey and census records can be linked to the reference file and as such not all records will be assigned a PIK. Linkage failures may occur if the PII used to match to the reference file is of insufficient quality (e.g., missing information), has insufficiently unique identifying information (e.g., a common name such as John Smith), or if an individual has no matching record in the Social Security reference file (e.g., undocumented residents). PIKs can be assigned for the large majority of 2000 Decennial Census records. A Decennial Census record must have a PIK in order to link it to administrative records that contain other information, such as mortality or annual residence.
The twangRDC package can generate weights to account for linkage failures such that observations with PIKs are representative of the population. It does this by using standard nonresponse weights, where nonresponse is defined by linkage failure. This is achieved by weighting observations with PIKs by the inverse probability of having a PIK. As a note, although designed for linkage failure, the twangRDC can be used for other nonresponse and missing data.
The steps of the algorithm are described here assuming the user specifies strata in the model, but the steps are the same without strata by considering the data coming from a single stratum. Guidance on specific decisions relevant for each step is provided throughout the examples.
iters.per.stepiterations to the gradient boosted model.
If the goal is the construction of a comparison group, twangRDC can apply gradient boosted propensity score weights for the average treatment effect on the treated. Treatment observations receive a weight of 1, while comparison observations receive a weight of the odds of treatment based on the propensity score model. The steps of the algorithm are described here assuming the user specifies strata in the model, but the steps are the same without strata by considering the data coming from a single stratum. Guidance on specific decisions relevant for each step is provided throughout the examples.
iters.per.stepiterations to the gradient boosted model.
The primary function of twangRDC is
ps.xgb, which performs the methodology previously described. In Table 1, we describe the options of
ps.xgb. Additional details can be found in the help file.
|formula||A symbolic description of the model to be fit with an observed data indicator or a treatment indicator on the left side of the formula and the variables to be balanced on the right side. If
|linkage||An indicator of whether the weighting should be for linkage failure or comparison group construction.
|strata||An optional factor variable identifying the strata. If specified, balance is optimized within strata.|
|params||xgboost parameters. Details below.|
|file||An optional filename to save intermediate results. The file is overwritten at each step of the algorithm.|
|iters.per.step||An integer specifying the number of iterations to add to the model at each step of algorithm.|
|max_steps||An integer specifying the maximum number of steps to take while optimizing the gradent boosted model. The maximum number of iterations considered during optimization is given by
|id.var||A variable that uniquely identifies observations.|
|min.width||An integer specifying the minimum number of iterations between the current number of model iterations and the optimal value.|
|save.model||A logical value indicating if the xgboost model is saved as part of the output object.|
|weights||An optional variable that identifies user defined weights to be incorporated into the optimization, e.g., sampling design weights. If specified, the output automatically accounts for these weights.|
A detailed description of the xgboost options can be found in the xgboost documentation. Here, we briefly describe the options that are most useful in this context.
etais a value between 0 and 1 that controls the learning rate. Smaller values of
etareduce overfit but require additional iterations to achieve optimal balance.
max_depthis the maximum allowable depth of the gradient boosted trees. A larger value allows the model to consider more complex interactions between the covariates. Increasing
max_depthmay require a reduction in
min_child_weightis the minimum number of observations needed to further partition a tree. If a leaf node is such that it has fewer than
min_child_weightobservations, the tree partitioning process will stop. Users can conceptualize this as similar to the number of observations in a level of a categorical variable for a standard regression model. Note, if weights are specified in
min_child_weightreferences the sum of the weights.
The options for twangRDC should be set to achieve optimal balance. Many of the arguments are tuning parameters that should be tweaked to explore if the resulting model achieves better or worse balance than other choices of the arguments. We recommend that users run the models several times modifying the choice of the arguments to optimize their choice. In most cases, we have found that small changes to the arguments will not result in substantial differences in the underlying model fit. However, it is important to explore this possibility for each new analysis. Some general recommendations are provided here.
formulashould include all covariates that are to be balanced. If working with small to moderate sample sizes, it is likely that achieving balance on a large set of covariates is not feasible, and the user should make the choice of covariates as parsimonious as possible. The underlying gradient boosted model will automatically consider interactions and nonlinearities, but balance is only calculated based on the form of the covariates in the
formula. To assess balance for specific interactions, the interactions must be included in the
iters.per.stepwill reduce the computation time necessary for the model to finish. However, there is a tradeoff between the choice of
iters.per.stepand the balance achieved by the model. Larger values will result in worse balance. We recommend values between 50 and 500 for datasets exceeding 100,000 records.
min_child_weightshould be set to a combination of values that allow the xgboost model to run for at least a few thousand iterations. If the optimum iteration is too small, users can decrease
min_child_weight. Conversely, if the algorithm does not stop within the users expectations, the value of
max_depthcan be increased or
min_child_weightdecreased. Our experience suggests that values of
etabetween 0.1 and 0.3 are a good place to start.
We will highlight two uses of the twangRDC R package. First, we will generate weights that address potential bias due to linkage failures by ensuring individuals with PIKs are representative of their geographic region’s population. Second, we will highlight using the twangRDC R package to generate propensity score weights such that a group of southern metropolitan areas can be used as a comparison group for residents of Orleans Parish, Louisiana, which corresponds exactly with the boundaries of the City of New Orleans.
A simulated data file was created for use in this vignette. It contains simulated records for residents of Orleans Parish and three other metropolitan areas in the South census region (Atlanta, GA, Memphis, TN, and Washington, D.C.). We built the file exclusively from public use data, but it mirrors the structure of restricted versions of the 2000 Decennial Census available through the FSRDC network2. Each simulated record includes basic individual demographic characteristics and basic household characteristics. Each record also includes two important indicators, one to simulate whether the individual record received a PIK and another to denote whether the individual lived in Orleans Parish.
The data was created by extracting all variables for households and individuals from the 5% Integrated Public Use Microdata Sample (IPUMS3) of the 2000 Decennial Census. We simulated assignment of households to census tracts and attached tract identifiers and characteristics.
We also simulated an indicator of PIK assignment (PIK=yes/no) to person records. Public use data do not include PIK assignment, so we estimated a predicted probability of receiving a PIK by using estimated regression parameters from Bond et al. 20134. We added a random error to the predicted probability so that the PIK assignment status is not deterministic and converted the predicted probability into a dichotomous PIK=yes/no variable.
Lastly, we pooled Orleans Parish records with those from other southern metropolitan areas, created an indicator for Orleans Parish residence and, for the purposes of this vignette, sampled the data to shrink the size of the dataset. The simulated file contains individual records for 8,893 residents of Orleans Parish, LA and 9,503 individual records for residents from select southern metropolitan areas.
Table 2 provides a description of the data elements included in the simulated data file.
|Data element||Description||Labels and Coding|
|metarea||A categorical variable for metropolitan area.||Atlanta, GA (52)
Memphis, TN/AR/MS (492)
New Orleans, LA (556)
Washington, DC/MD/VA (884)
|c00_age12||A categorical variable for age in years at the 2000 Decennial Census.||0 to 2 years old (1)
3 to 5 years old (2)
6 to 9 years old (3)
10 to 14 years old (4)
15 to 18 years old (5)
19 to 24 years old (6)
25 to 34 years old (7)
35 to 44 years old (8)
45 to 54 years old (9)
55 to 64 years old (10)
65 to 74 years old (11)
75 and older (12)
|c00_sex||A binary indicator of sex as reported on the 2000 Decennial Census.||Male (0)
|c00_race||A categorical variable for race as reported on the 2000 Decennial Census.||White (1)
African American (2)
American Indian or Alaskan Native (3)
Other Asian or Pacific Islander (5)
Some other race (6)
|c00_nphu||The number of persons in housing unit as reported on the 2000 Decennial Census.||1 to 16|
|tract_id_str||Census tract identifier.|
|sim_pik||Simulated binary indicator of PIK assignment.||No PIK assigned (0)
PIK assigned (1)
|nola_rec||Binary indicator for record from Orleans Parish.||Not Orleans Parish Record (0)
Orleans Parish Record (1)
|id||An ID variable that uniquely identifies individuals in the dataset.|
As previously mentioned, we will generate weights that ensure individuals with PIKs are representative of their geographic region. To keep the computational time of this vignette down, we focus only on a subset of Orleans parish. The twangRDC package is installed on the FSRDC servers and available to all FSRDC researchers by loading the package into R as shown here. First, we load the twangRDC package and our simulated dataset into R.
Next, it is important that the variables of the dataset are coded as intended. In this case, we convert several of variables to factors.
# factors need to be coded as such nola_south$metarea = as.factor(nola_south$metarea) nola_south$c00_age12 = as.factor(nola_south$c00_age12) nola_south$c00_race = as.factor(nola_south$c00_race) nola_south$c00_sex = as.factor(nola_south$c00_sex)
In a final data preparation step, we limit the dataset to Orleans Parish.
# only consider Orleans parish nola_only = subset(nola_south , metarea==556)
In this case, we wish to generate weights to ensure that individuals with PIKs are representative of their entire Census tract. That is, for each Census tract, we want to generate weights such that the observations with PIKs within the Census tract are representative of the tract’s population. This is achieved by specifying
linkage=TRUE, which tells the function that we wish to generate nonresponse weights based on linkage failure, and
strata="tract_id_str", which tells the function that we wish to generate weights that are representative within strata defined by Census tracts. The formula
sim_pik ~ c00_age12 + c00_race + c00_sex identifies the observations with PIK on the left hand side and the characteristics that we want to consider when estimating the weights the right hand side.
# set the model parameters params = list(eta = .1 , max_depth = 5 , min_child_weight=50) # fit the xgboost model res.pik = ps.xgb(sim_pik ~ c00_age12 + c00_race + c00_sex , strata="tract_id_str", data=nola_only, params=params, max.steps=50, iters.per.step=100, min.iter=1000, id.var="id", linkage = TRUE)
The parameters of the underlying xgboost model are specified in
params. These were described in more detail in a previous section. The
id.var="id" provides a unique identifier for observations such that the generated weights can easily be merged back in with the original data. The other options specified in the
ps.xgb function control how frequently the algorithm checks for convergence and how many iterations should be considered before stopping. First,
iters.per.step=100 tells the algorithm to only consider every 100-th iteration when evaluating convergence. Larger values improve computational time by reducing the number of balance evaluations, while smaller values may achieve slightly better balance. Next,
min.iter=1000 tells the algorithm that at least 1000 iterations must be used before stopping for convergence. Larger values ensure that more complex models are evaluated before determining the optimal set of weights. Finally,
max.steps=50 indicates that the algorithm should only evaluate balance up to 50 times before stopping. The maximum number of iterations of the xgboost model is given by
max.steps*iters.per.setp, which in this case is 5000. In general, this value should be large to ensure that the optimum set of weights is achieved. The default value is
max.steps=Inf, which will continue adding iterations to the model until the convergence criteria is met. Due to computational concerns, we recommend testing your code with values of
max.steps such that the total number of iterations is small (1000 to 10000). Once you have determined the model is working as intended, set
Now that the weights have been estimated, we will evaluate their quality. First, we need to ensure that a sufficient number of iterations have been used such that the balance criteria is minimized.
plot function provides a plot of the balance criteria (Figure 1), which in this case is the average of the strata-specific maximum absolute standardized difference of the covariates, versus the number iterations. We use this figure to verify that the algorithm has run for a sufficient number of iterations such that a clear minimum has been achieved. For example, Figure 1 shows that there is little change in the maximum absolute standardized difference after approximately 2,000 iterations. This indicates that adding additional iterations is unlikely to substantially improve the balance.
After evaluating convergence of the algorithm, we can assess the quality of the weights using balance tables. First, the
bal.table function will produce the population mean for each covariate, as well as the unadjusted (unweighted) and the adjusted (weighted) mean among those with PIK (Table 3). It also provides the standardized difference for each covariate before and after weighting. The standardized difference for a specific covariate is calculated as the mean difference between those with PIK versus those without divided by the population standard deviation. The goal is for the standardized differences after weighting to all be small, with many researchers recommending absolute values less than 0.1 as the target. Regardless of the specific target, standardized differences close to zero are ideal, and the quality of the balance should be assessed with the context of the specific analysis in mind. In this case, all of our standardized differences are very close to zero.
|Population Mean||Unadjusted Mean||Adjusted Mean||Unadjusted Standardized Difference||Adjusted Standardized Difference|
Next, balance can be assessed within Census tract since our goal for this model was to generate weights such that observations with PIKs are representative of their Census tract. We will identify tract-covariate combinations in which the adjusted standardized differences are the furthest from zero (Table 4). The
type='strata' option tells
bal.table to provide the maximum absolute standardized difference by the strata variable, in this case, the maximum absolute standardized difference of the covariates within each Census tract. We additionally specify
include.var=TRUE, which identifies the covariate that the maximum absolute standardized difference corresponds to within each Census tract,
decreasing = T, which orders the strata in decreasing order of their weighted standardized differences, and
n=3, which only prints the three strata with the largest absolute standardized differences. Note that this table produces a single row for each stratum, such that the table only shows the covariate with the worst balance for each stratum.
bal.table(res.pik , type='strata' , include.var=TRUE , n=3 , decreasing = T)
|Stratum||Unadjusted Maximum Standardized Difference||Adjusted Maximum Standardized Difference||Variable|
This table allows us to assess which strata had the worst balance after weighting, and among those strata, which covariates were problematic. In this case, both our within strata balance and our overall population balance look excellent (i.e., adjusted standardized differences close to zero). We conclude that the weighting has achieved our goal of making the sample representative of the population.
get.weights function extracts the weights at the optimal iteration. The resulting data contains the weights and the ID variable specified in
id.var. The weights can then be merged back in with the original data using the
id.var. Note that base R merge function is slow compared to modern alternatives. If your data is large, consider
# extract weights w = get.weights(res.pik) # merge weights into data dta=merge(dta , w , by='id' , all=TRUE)
These weights can be used in subsequent steps of the analysis. For example, if matching a specific marginal population total is necessary, the weights output by this function can be used as an input to a post-stratification step. Alternately, if the goal of the study is to compare two groups, these weights can be used as inputs into a propensity score model.
Our second example use of twangRDC will generate a comparison group for Orleans Parish consisting of residents of the three other southern metropolitan areas. The steps of the process remain the same as the PIK weighting example, but with minor adjustments to the interpretation and specification of the model. First,
linkage = FALSE specifies that we are no longer interested in weights to account for linkage failure, but instead, we wish to generate a comparison group using propensity score weights. Note that
ps.xgb only estimates the average treatment effect on the treated, with the treatment group identified by records with a value of 1 in the left hand side of the formula. Second, our formula now includes an interaction between age and sex. This informs the algorithm to evaluate balance based on unique combinations of age and sex (e.g., it will attempt to balance the proportion of each sex-by-age combination). Currently, twangRDC only allows for specification of interactions between factor variables. Finally, we no longer specify a stratification variable, as we are attempting to create a comparison group that is representative of Orleans parish as a whole. If we were interested in using the linkage failure weights from our previous example, they could be specified in the
weights option, and the output of this model would automatically account for those weights.
# set the model parameters params = list(eta = .3 , max_depth = 5 , subsample = 1 , max_delta_step=0 , gamma=0 , lambda=0 , alpha=0, min_child_weight=50 , objective = "binary:logistic") # fit the xgboost model res.ps = ps.xgb(nola_rec ~ c00_age12:c00_sex + c00_race , data=nola_south, params=params, max.steps=25, iters.per.step=100, min.iter=1000, min.width = 500, id.var="id", linkage = FALSE)
After fitting the model, the same steps covered in the linkage failure example should be used here as well. The first step is to evaluate the convergence of the model using the
plot function (Figure 2). We are verifying that the algorithm has run for a sufficient number of iterations such that a clear minimum has been achieved.
Next, we evaluate the performance of the algorithm by looking at balance tables (Table 5). Since we specified
bal.table function will now produce the treatment group mean for each covariate, as well as the mean before and after propensity score weighting among comparison cases. It also provides the standardized differences for each covariate. As in the linkage failure example, the goal is for the standardized differences after weighting to all be close to zero.
|Treatment Group Mean||Unadjusted Comparison Group Mean||Adjusted Comparison Group Mean||Unadjusted Standardized Difference||Adjusted Standardized Difference|
Since no strata were specified in the model, the
type='strata' option used in the linkage failure example is no longer useful. These weights can be extracted and merged in with the data using the same steps described in the linkage failure example.
For more information, tutorials, and tools for weighting and analysis of nonequivalent groups, see the TWANG website at https://www.rand.org/statistics/twang.html.
Wagner, Deborah and Mary Layne. The Person Identification Validation System (PVS): Applying the Center Administrative Records Research and Applications’ (CARRA) Record Linkage Software. US Census Bureau, Center Administrative Records Research and Applications Working Paper #2014-01. 2014.↩︎
Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020. https://doi.org/10.18128/D010.V10.0↩︎
Bond, Brittany, J. David Brown, Adela Luque and Amy O’Hara. The nature of the bias when studying only linkable person records: Evidence from the ACS. US Census Bureau, Center Administrative Records Research and Applications Working Paper #2014-08, 2014.↩︎