% setwd("d:/dev/twang/inst/doc")
% Sweave("twang.rnw"); system("texify twang.tex"); system("c:\\MiKTeX\\miktex\\bin\\yap.exe twang.dvi",wait=FALSE)



\documentclass{article}
\bibliographystyle{plain}
\usepackage[active]{srcltx}
\usepackage{url}
\usepackage{xcolor}
\usepackage{array}
\addtolength{\topmargin}{-0.5in}
\addtolength{\textheight}{0.75in}
\addtolength{\oddsidemargin}{-0.5in}
\addtolength{\textwidth}{1in}
\newcommand{\EV}{\mathrm{E}}
\newcommand{\Var}{\mathrm{Var}}
\newcommand{\aRule}{\begin{center} \rule{5in}{1mm} \end{center}}
\newcolumntype{L}{>{\arraybackslash}m{.8\linewidth}}


\title{Toolkit for Weighting and Analysis of Nonequivalent Groups: A guide to the twang package}
%\title{Propensity scores for binary treatments:\\A
%tutorial for the \texttt{ps} function in the \texttt{twang} package, Version 2.0}

\author{\parbox{\linewidth}{\centering Greg Ridgeway, Dan McCaffrey, Andrew Morral, Matthew Cefalu, Lane Burgette, Joseph Pane, and Beth Ann Griffin\footnote{The twang package and this tutorial were developed under NIDA grants R01 DA017507, R01 DA015697-03, and R01 DA034065}}
\\RAND}
%\VignetteIndexEntry{Toolkit for Weighting and Analysis of Nonequivalent Groups: A guide to the twang package}
%\VignetteDepends{gbm,survey,xtable}
%\VignetteKeywords{propensity score, nonresponse weighting}
%\VignettePackage{twang}

% \addtolength{\textheight}{0.75in}
\addtolength{\oddsidemargin}{-0.5in}
\addtolength{\textwidth}{0.75in}
\newcommand{\mathgbf}[1]{{\mbox{\boldmath$#1$\unboldmath}}}

\usepackage{Sweave}
\begin{document}
\SweaveOpts{concordance=TRUE}

\maketitle

\section{Introduction}

The Toolkit for Weighting and Analysis of Nonequivalent Groups, \texttt{twang}, contains a set of functions and procedures to support causal modeling of observational data through the estimation and evaluation of propensity scores and associated weights for binary, multinomial, and time-varying treatments. This package was developed in 2004. The \texttt{twang} package underwent extensive revisions in 2012 and 2020. Starting with version 2.0, the \texttt{twang} package includes revisions that improve computational efficiency. Code written for prior versions of \texttt{twang} should still run without modification. However, in certain situations, results from the updated version of \texttt{twang} will no longer replicate previous results. If a user wishes to reproduce results from a prior version of twang, a new option (\texttt{version="legacy"}) must be specified. 

This tutorial provides an introduction to \texttt{twang} and demonstrates its use through illustrative examples. Interested readers can review the related tutorials for more than two treatment groups and time-varying treatments at \url{https://www.rand.org/statistics/twang/tutorials.html}.

The foundation of the methods supported by \texttt{twang} is the propensity score. The propensity score is the probability that a particular case would be assigned or exposed to a treatment condition. Rosenbaum \& Rubin (1983) showed that knowing the propensity score is sufficient to separate the effect of a treatment on an outcome from observed confounding factors that influence both treatment assignment and outcomes, provided the necessary conditions hold. The propensity score has the balancing property that given the propensity score the distribution of features for the treatment cases is the same as that for the control cases. While the treatment selection probabilities are generally not known, good estimates of them can be effective at diminishing or eliminating confounds between pretreatment group differences and treatment outcomes in the estimation of treatment effects.

There are now numerous propensity scoring methods in the literature. They differ in how they estimate the propensity score (e.g. logistic regression, CART), the target estimand (e.g. treatment effect on the treated, population treatment effect), and how they utilize the resulting estimated propensity scores (e.g. stratification, matching, weighting, doubly robust estimators). We originally developed the \texttt{twang} package with a particular process in mind, namely, generalized boosted regression to estimate the propensity scores and weighting of the comparison cases to estimate the average treatment effect on the treated (ATT). However, we have updated the package to also meaningfully handle the case where interest lies in using the population weights (e.g., weighting of comparison and treatment cases to estimate the population average treatment effect, ATE). 

The main workhorse of \texttt{twang} is the \texttt{ps()} function, which implements gradient boosted models with either the \texttt{gbm} package or the \texttt{xgboost} package to estimate the propensity scores. However, the framework of the package is flexible enough to allow the user to use propensity score estimates from other methods and to assess the usefulness of those estimates for ensuring equivalence (or ``balance'') in the pretreatment covariate distributions of treatment and control groups using tools from the package. The same set of functions is also useful for other tasks such as non-response weighting, as discussed in Section~\ref{sec:nonresponse}.

The \texttt{twang} package aims to (i) compute from the data estimates of the propensity scores which yield accurate causal effect estimates, (ii) check the quality of the resulting propensity score weights by assessing whether or not they have the balancing properties that we expect in theory, and (iii) use them in computing treatment effect estimates. Users who are more comfortable with Stata or SAS than R are encouraged to visit \url{www.rand.org/statistics/twang/} for information on Stata commands and SAS macros that implement these methods. Additionally, we now have Shiny apps available that are menu-driven and user-friendly versions of the R tools.

\section{Principles of \texttt{twang}}

The \texttt{twang} package utilizes gradient boosted models to estimate propensity scores. These models are flexible, machine learning approaches that can automatically incorporate nonlinearities and interactions among the covariates. In general, the algorithms implemented in the \texttt{twang} package optimize the similarity between treatment and comparison observations by selecting the iteration (complexity) of the gradient boosted model that achieves the best balance. The user simply selects the set of covariates they wish to balance, and the algorithm automatically incorporates nonlinearity and interactions to achieve the optimal balance. This vignette focuses on binary treatments using the \texttt{ps} function of the \texttt{twang} package, but many of the functions and arguements are shared across \texttt{twang} functions developed for other treatment types (e.g., multinomial treatment using \texttt{mnps}). See \url{www.rand.org/statistics/twang/} for more information. 

The arguements to the \texttt{ps} can be broken up into three categories: (1) those that specify the model; (2)  those that control the gradient boosting; and (3) those that improve computational efficiency. We will describe these arguments in detail now, and then highlight their use in the next section. 

\subsection{Arguements that specify the propensity score model}

\begin{description}
  \item[\texttt{formula}] is a symbolic description of the propensity score model using standard R syntax with the treatment indicator on the left side of the formula and the potential confounding variables on the right side. In contrast to the \texttt{lm()} function, there is no need to specify interaction terms in the formula as the gradient boosted model automatically considers interactions. There is also no need --- and it can be counterproductive --- to create indicator, or ``dummy coded,'' variables to represent categorical covariates, provided the categorical variables are stored as a \texttt{factor} or as \texttt{ordered} (see \texttt{help(factor)} for more details).
    \item[\texttt{data}] indicates the dataset to be used, which should include a binary treatment indicator and the covariates to be balanced.
    \item[\texttt{estimand}] is the causal effect of interest, which is either ``ATE'' (average treatment effect) or ``ATT'' (average treatment effect on the treated). ATE estimates the change in the outcome if the treatment were applied to the entire population versus if the control were applied to the entire population. ATT estimates the analogous effect, averaging only over the treated population. The \texttt{estimand} argument was added to the 2012 revision of the package which integrated ATE weighting into the package and the \texttt{ps} function estimate of the propensity score. The default value is ``ATE''.
    \item[\texttt{sampw}] are optional sampling weights. If specified, the sampling weights are automatically incorporated into the derivation of the propensity score weights.
\end{description}

\subsection{Arguements that control the gradient boosting}

\begin{description}
  \item[\texttt{stop.method}] is a method or set of methods for measuring and summarizing balance across pretreatment variables. The \texttt{ps} function selects the optimal number of gradient boosting iterations to minimize the differences between the treatment and control groups as measured by the rules of the given \texttt{stop.method}. Current options are ks.mean, ks.max, es.mean, and es.max. ks refers to the Kolmogorov-Smirnov statistic and es refers to standardized effect size. These are summarized across the pretreatment variables by either the maximum (.max) or the mean (.mean). The default value is to utilize both ks.mean and es.mean.
  \item[\texttt{n.trees}] is the maximum number of gradient boosting iterations to be considered. The more iterations allows for more nonlienarity and interactions to be considered. The default value is 10,000.
  \item[\texttt{interaction.depth}] is a positive integer denoting the tree depth used in gradient boosting. This can be loosely be interpreted as the maximum number of variables that can be included in an interaction. Higher values allow for greater model complexity. The default value is 3.
  \item[\texttt{shrinkage}] is a numeric value between 0 and 1 denoting the learning rate. Smaller values restrict the complexity that is added at each iteration of the gradient boosting algorithm. A smaller learning rate requires more iterations (\texttt{n.trees}), but adds  some protection against model overfit. The default value is 0.01. 
  \item[\texttt{n.minobsinnode}] is the minimum number of observations in the terminal nodes of the trees used in the gradient boosting algorithm. Smaller values allow for more model complexity, while larger values provide some protection against model overfit. The default value is 10.
\end{description}

\subsection{Arguements that improve computational efficiency}

\begin{description}
\item[\texttt{n.keep}] is a positive integer indicating that the algorithm should only consider every \texttt{n.keep}-th iteration of the propensity score model and optimize balance over this set instead of all iterations. Larger values of \texttt{n.keep} improve computational efficiency by only assessing balance on a subset of the gradient boosting iterations, but is likely to achieve worse balance than considering every iteration. The default value is 1, which is to optimize over all iterations. 
\item[\texttt{n.grid}] is a positive integer that sets the grid size for an initial search of the region most likely to minimize the \texttt{stop.method}. \texttt{n.grid} corresponds to a grid of points on the kept iterations as defined by \texttt{n.keep}. For example, \texttt{n.keep=20} with \texttt{n.trees=5000} will keep the iterations \texttt{(20,40,...,5000)}. The option \texttt{n.grid=10} then splits this vector into 10 points. Thus, \texttt{n.grid*n.keep} must be less than or equal to \texttt{n.trees}. The default grid size is 25.
\item[\texttt{ks.exact}] improves the speed when calculating the p-value for the weighted two-sample Kolmogorov–Smirnov (KS) statistic. The default behavior when calculating the weighted two-sample KS p-value differs from implementations of \texttt{twang} prior to Version 2.0. Specifically, if \texttt{ks.exact=NULL} and the product of the effective sample sizes is less than 10,000, then an approximation based on the exact distribution of the unweighted KS statistic is used. This approximation via the exact distribution can also be requested directly by \texttt{ks.exact=TRUE}. Otherwise, an approximation based on the asymptotic distribution of the unweighted KS statistic is used. The default is \texttt{ks.exact=NULL}. We note that \texttt{ks.exact=TRUE} replicates the calculation of KS p-values from version of \texttt{twang} prior to Version 2.0 but adds substantial computation time for larger datasets. We recommend using the default value.
\item[\texttt{version}] allows the user to specify which gradient boosting package to use for the propensity score model. Current options are \texttt{"gbm"}, \texttt{"xgboost"}, and \texttt{"legacy"}. The default value is \texttt{version="gbm"} which uses gradient boosting from the \texttt{gbm} package and provides a faster implementation of key components of the \texttt{twang} package to improve speed when calculating and optimizing the balance statistics. \texttt{version="xgboost"} uses gradient boosting from the \texttt{xgboost} package. This provides users with access to a cutting-edge implementation of gradient boosting, while potentially providing speed improvements in larger datasets. The behavior of \texttt{xgboost} can be controlled using the the boosting parameters of \texttt{gbm} discussed earlier, but can also be controlled directly using the options of \texttt{xgboost}. See \url{https://xgboost.readthedocs.io/en/latest/parameter.html} for a description of the different options. \texttt{version="legacy"} uses the implementation of the \texttt{ps} function prior to Version 2.0. To replicate results from analyses utilizing twang prior to Version 2.0, specify the \texttt{version="legacy"} option.
\end{description}

\subsection{Typical workflow with \texttt{twang}}

Here, we outline the usual steps for a propensity score analysis using \texttt{twang}. While meant to be illustrative, these steps should not be considered comprehensive. Every analysis is unique and requires careful consideration. These steps will be expanded upon in the subsequent examples.

\begin{enumerate}
  \item Determine the estimand of interest, either ATE or ATT.
  \item Determine the observed confounding factors to be balanced.
  \item Fit the propensity score model using the \texttt{ps()} function.
  \begin{itemize}
    \item Evaluate the convergence of the algorithm.
    \item Assess the balance of the confounding factors before and after applying the propensity score weights.
    \item If needed, rerun the algorithm adjusting the gradient boosting parameters to achieve convergence or improve balance.
  \end{itemize}
  \item Extract the propensity score weights and fit a weighted model to estimate the treatment effect.
\end{enumerate}

\section{An ATT example}

<<echo=FALSE>>=
options(width=80)
@

\subsection{Installing \texttt{twang} for first time users}

The package is installed by typing \texttt{install.packages("twang", dependencies=TRUE)} into the R console. You will only need to do this step once. In the future, running \texttt{update.packages()} regularly will ensure that you have the latest versions of the packages, including bug fixes and new features.

\subsection{Load the data}

To start using \texttt{twang}, first load the package using \texttt{library(twang)}. This is required at the beginning of each R session. We will highlight features of \texttt{twang} using data from Lalonde's National Supported Work Demonstration analysis (Lalonde 1986, Dehejia \& Wahba 1999, \url{http://users.nber.org/~rdehejia/nswdata2.html}). This dataset is provided with the \texttt{twang} package and can be loaded using \texttt{data(lalonde)}.

<<>>=
library(twang)
data(lalonde)
@

R can read data from many other sources. The manual ``R Data Import/Export,'' available at \url{http://cran.r-project.org/doc/manuals/R-data.pdf}, describes that process in detail. 

\subsection{Fit a gradient boosted model with \texttt{ps}}

For the \texttt{lalonde} dataset, the variable \texttt{treat} is the treatment indicator, where 1 indicates being part of the National Supported Work Demonstration (``treatment'') and 0 indicates cases drawn from the Current Population Survey (``comparison''). In order to estimate a treatment effect for this demonstration program that is unbiased by pretreatment group differences on other observed covariates, we include the following covariates in a propensity score model of treatment: age, education, black, Hispanic, having no degree, married, earnings in 1974 (pretreatment), and earnings in 1975 (pretreatment). Note that we specify no outcome variables at this time. The \texttt{ps()} function is the primary method in \texttt{twang} for estimating propensity scores for binary treatments. 

<<>>=
ps.lalonde.gbm = ps(treat ~ age + educ + black + hispan + nodegree +
                         married + re74 + re75,
                 data = lalonde,
                 n.trees=5000,
                 interaction.depth=2,
                 shrinkage=0.01,
                 estimand = "ATT",
                 stop.method=c("es.mean","ks.max"),
                 n.minobsinnode = 10,
                 n.keep = 1,
                 n.grid = 25,
                 ks.exact = NULL,
                 verbose=FALSE)
@

In this model, we have specified \texttt{estimand = "ATT"} indicating that we will estimate the average treatment effect among those who partcipated in the National Supported Work Demonstration. We also specified two stopping criteria: \texttt{es.mean} and \texttt{ks.max}. This indicates that the algorithm should separately determine the iteration that minimizes the average absolute standardized effect size (\texttt{es.mean}) and the maximum KS statistic (\texttt{ks.max}). We will compare the results of these stopping rules to evaluate the sensitivity of the results to this choice.

The saved object \texttt{ps.lalonde.gbm} is of class \texttt{ps} and contains the propensity score model fit information. We will work with this object to assess model convergence, to assess the balance of pretreatment characteristics, and to extract the propensity score weights needed for subsequent analysis. By default, the \texttt{ps()} function uses the \texttt{gbm} package for gradient boosting. A user can request use of \texttt{xgboost} by specifying the option \texttt{version="xgboost"}. The use of \texttt{xgboost} will not be discussed further in this tutorial. See the documentation for additional details. 

\subsubsection{Convergence diagnostic checks}

\noindent Before continuing to outcome analyses, an analyst should perform diagnostic checks assessing the convergence of the gradient boosting algorithm. The specified value of \texttt{n.trees} should be large enough to allows the gradient boosted model to have explored sufficiently complicated models. We can do this quickly with the \texttt{plot()} function.\footnote{In versions 1.0.x of the \texttt{twang} package, the \texttt{ps} function itself included some plotting functions.  This is no longer the case (and the function no longer includes a \texttt{plots} argument); these functions have been moved to the generic \texttt{plot()} function.} As a default, the \texttt{plot()} function applied to a \texttt{ps} object gives the balance measures as a function of the number of iterations in the gradient boosting algorithm, with higher iterations corresponding to more complicated fitted models. In the example below, \Sexpr{ps.lalonde.gbm$desc$es.mean.ATT$n.trees} iterations minimized the average absolute standardized effect size and \Sexpr{ps.lalonde.gbm$desc$ks.max.ATT$n.trees} iterations minimized the largest of the eight Kolmogorov-Smirnov (KS) statistics computed for the covariates. 

If it appears that additional iterations would be likely to result in lower values of the balance statistic, \texttt{n.trees} should be increased. However, after a point, additional complexity typically makes the balance worse, as in the example below. This figure also gives information on how compatible two or more stopping rules are: if the minima for multiple stopping rules under consideration are near one another, the results should not be sensitive to which stopping rule one uses for the final analysis. See Section~\ref{sec:pseval} for a discussion of these and other balance measures.  

\begin{center}
<<label=iterPt, fig=TRUE, echo=TRUE, include = TRUE, height = 4, width = 6>>= 
    plot(ps.lalonde.gbm)
@
\end{center}

%This figure illustrates the process of selecting the number of trees
%or iterations in the GBM fit by assessing
%balance. 
%In each 
%panel, the number of GBM iterations is plotted on the horizontal
%axis and the overall summary measure of balance (mean standardized
%bias for \texttt{es.mean} or maximum KS statistic for \texttt{ks.max})
%is plotted on the vertical axis.
%Each iteration adds complexity to the propensity score model giving
%it greater modeling flexibility. The increased flexibility improves
%the balance of the two groups up to a certain point after which
%additional iterations offer no improvement or actually make the
%balance worse. 


% \begin{center}
% <<label=iterPt2, fig=TRUE, echo=TRUE, include = TRUE, height = 3.5, width = 6>>= 
%     plot(ps.lalonde.gbm, subset = 2)
% @
% \end{center}


\subsubsection{Assessing ``balance'' using balance tables}

Having estimated the propensity scores, \texttt{bal.table()} produces a table that shows how well the resulting weights succeed in manipulating the control group so that its weighted pretreatment characteristics match, or balance, those of the unweighted treatment group when \texttt{estimand = "ATT"}. If \texttt{estimand = "ATE"}, both the control and treatment groups are weighted so that the weighted pretreatment characteristics match, or balance, with one another.

By default, the \texttt{bal.table()} function uses the value of \texttt{estimand} set with the \texttt{ps()} function call. For example, in the analysis we set \texttt{estimand =  "ATT"} when calling \texttt{ps()} to estimate the propensity scores. The function \texttt{bal.table()} automatically uses the correct weights when checking balance and comparing the distributions of pre-treatment variables for the weighted control group with those from the unweighted treatment group.

<<echo=FALSE>>=
options(width=85)
@

<<>>=
lalonde.balance <- bal.table(ps.lalonde.gbm)
lalonde.balance
@

<<echo=FALSE>>=
options(width=80)
@

\texttt{bal.table()} returns information on the pretreatment covariates before and after weighting. The object is a list with named components, one for an unweighted analysis (named \texttt{unw}) and one for each \texttt{stop.method} specified, here \texttt{es.mean} and \texttt{ks.max}. McCaffrey et al (2004) essentially used \texttt{es.mean} for the analyses, but our more recent work has sometimes used \texttt{ks.max}. See McCaffrey et al. (2013) for a more detailed description of these choices.

If there are missing values (represented as \texttt{NA}) in the covariates, \texttt{twang} will attempt to construct weights that also balance rates of missingness in the treatment and control arms.  In this case, the \texttt{bal.table()} will have an extra row for each variable that has missing entries. User should note that missing data in \texttt{xgboost} is handled differently than in \texttt{gbm}. In \texttt{gbm}, missing values are placed in their own node, while in \texttt{xgboost} missing values are placed in the left or right node based on minimizing the objective function. 

The columns of the table consist of the following items:
\begin{description}
\item[\texttt{tx.mn, ct.mn}] The treatment means and the control means for each of the variables. The unweighted table \texttt{(unw)} shows the unweighted means. For each stopping rule the means are weighted using weights corresponding to the gbm model selected by \texttt{ps()} using the stopping rule.  When \texttt{estimand = "ATT"} the weights for the treatment group always equal 1 for all cases and there is no difference between unweighted and propensity score weighted tx.mn.

\item[\texttt{tx.sd, ct.sd}] The propensity score weighted treatment and control groups' standard deviations for each of the variables. The unweighted table (unw) shows the unweighted
standard deviations.

\item[\texttt{std.eff.sz}] The standardized effect size, defined as the treatment group mean minus the control group mean divided by the treatment group standard deviation if \texttt{estimand = "ATT"} or divided by the pooled sample (treatment and control) standard deviation if \texttt{estimand = "ATE"}. (In discussions of propensity scores this value is sometimes referred to as ``standardized bias``.) Occasionally, lack of treatment group or pooled sample variance on a covariate results in very large (or infinite) standardized effect sizes. For purposes of analyzing mean effect sizes across multiple covariates, we set all standardized effect sizes larger than 500 to \texttt{NA} (missing values).

\item[\texttt{stat, p}] Depending on whether the variable is continuous or categorical, \texttt{stat} is a t-statistic or a $\chi^2$ statistic.
\texttt{p} is the associated p-value

\item[\texttt{ks, ks.pval}] The Kolmogorov-Smirnov test statistic and its associated p-value. P-values for the KS statistics are either derived from Monte Carlo simulations or analytic approximations, depending on the specifications made in the \texttt{perm.test.iters} argument of the \texttt{ps} function. For categorical variables this is just the $\chi^2$ test p-value.
\end{description}

Components of these tables are useful for demonstrating that pretreatment differences between groups on observed variables have been eliminated using the weights. %The \texttt{xtable} package aids in formatting for \LaTeX { }and Word documents. Table~\ref{tab:balance} shows the results for \texttt{ks.max} reformatted for a \LaTeX { }document. For Word documents, paste the \LaTeX { }description of the table into a Word document, highlight it and use Word tools to convert the text to a table using ``\&`` as the separator.  

% 
% # <<eval=FALSE>>=
% # library(xtable)
% # pretty.tab <- lalonde.balance$ks.max.ATT[,c("tx.mn","ct.mn","ks")]
% # pretty.tab <- cbind(pretty.tab, lalonde.balance$unw[,"ct.mn"])
% # names(pretty.tab) <- c("E(X|t=1)","E(X|t=1)","KS","E(Y0|t=0)")
% # xtable(pretty.tab,
% #        caption = "Balance of the treatment and comparison groups",
% #        label = "tab:balance",
% #        digits = c(0, 2, 2, 2, 2),
% #        align=c("l","r","r","r","r"))
% # @
% 
% \clearpage
% <<echo=FALSE, results=tex>>=
% library(xtable)
% pretty.tab <- lalonde.balance$ks.max.ATT[,c("tx.mn","ct.mn","ks")]
% pretty.tab <- cbind(pretty.tab, lalonde.balance$unw[,"ct.mn"])
% names(pretty.tab) <- c("E(X|t=1)","E(X|t=1)","KS","E(X|t=0)")
% t1 <- xtable(pretty.tab,
%              caption = "Balance of the treatment and comparison groups",
%              label = "tab:balance",
%              digits = c(0, 2, 2, 2, 2),
%              align=c("l","r","r","r","r"))
% t1
% @

The \texttt{summary()} method for \texttt{ps} objects offers a compact summary of the sample sizes of the groups and the balance measures. If \texttt{perm.test.iters>0} was used to create the \texttt{ps} object, then Monte Carlo simulation is used to estimate p-values for the maximum KS statistic that would be expected across the covariates, had individuals with the same covariate values been assigned to groups randomly. Thus, a p-value of 0.04 for \texttt{max.ks.p} indicates that the largest KS statistic found across the covariates is larger than would be expected in 96\% of trials in which the same cases were randomly assigned to groups. Otherwise, \texttt{max.ks.p} will be NA.

<<>>=
summary(ps.lalonde.gbm)
@


In general, weighted means can have greater sampling variance than unweighted means from a sample of equal size. The effective sample size (ESS) of the weighted comparison group captures this increase in variance as

\begin{equation}
\mathrm{ESS} = \frac{\left(\sum_{i\in C} w_i\right)^2}{\sum_{i\in C}
w_i^2}.
\label{eq:ESS}
\end{equation}

The ESS is approximately the number of observations from a simple random sample that yields an estimate with sampling variation equal to the sampling variation obtained with the weighted comparison observations. Therefore, the ESS will give an estimate of the number of comparison participants that are comparable to the
treatment group when \texttt{estimand = "ATT"}. The ESS is an accurate measure of the relative size of the variance of means when the weights are fixed or they are uncorrelated with outcomes. Otherwise the ESS underestimates the effective sample size (Little \& Vartivarian, 2004). With propensity score weights, it is rare that weights are uncorrelated with outcomes.  Hence the ESS typically gives a lower bound on the effective sample size, but it still serves as a useful measure for choosing among alternative models and assessing the overall quality of a model, even if it provides a possibly conservative picture of the loss in
precision due to weighting.  

The \texttt{ess.treat} and \texttt{ess.ctrl} columns in the summary results shows the ESS for the estimated propensity scores. Note that although the original comparison group had 429 cases, the propensity score estimates effectively utilize only 24 or 36.4 of the comparison cases, depending on the rules and measures used to estimate the propensity scores. While this may seem like a large loss of sample size, this indicates that many of the original cases were unlike the treatment cases and, hence, were not useful for isolating the
treatment effect. Moreover, similar or even greater reductions in ESS would be expected from alternative approaches to using propensity scores, such as matching or stratification. Since the estimand of interest in this example is ATT, \texttt{ess.treat = n.treat} throughout (i.e., all treatment cases have a weight of 1). 

\subsubsection{Graphical assessments of balance}
\label{sec:plots} 

The \texttt{plot()} function can generate useful diagnostic plots from the propensity score objects. The full set of plots available in \texttt{twang} and the argument value of \texttt{plot} to produce each one are given in Table~\ref{tab:allArg}.  The convergence plot --- the default --- was discussed above.

\begin{table}[ht]
\begin{center}
\begin{tabular}{rll}
\hline
Descriptive & Numeric & Description \\
argument & argument & \\
\hline
\texttt{"optimize"}&  1  & Balance measure as a function of GBM iterations\\
\texttt{"boxplot"}& 2& Boxplot of treatment/control propensity scores\\
\texttt{"es"}        & 3 & Standardized effect size of pretreatment variables\\
\texttt{"t"}          & 4  & $t$-test $p$-values for weighted
pretreatment variables\\
\texttt{"ks"}       & 5 & Kolmogorov-Smirnov $p$-values for weighted
pretreatment variables\\
\texttt{"histogram"}       & 6 & Histogram of weights for treatment/control\\
\hline
\end{tabular}
\caption{Available options for \texttt{plots} argument to
  \texttt{plot()} function.}
\label{tab:allArg}
\end{center}
\end{table}

The \texttt{plot()} function takes a \texttt{plots} argument in order to produce other diagnostic plots.  For example, specifying \texttt{plots = 2} or \texttt{plots = "boxplot"} produces boxplots illustrating the spread of the estimated propensity scores in the treatment and comparison groups. Whereas propensity score stratification requires considerable overlap in these spreads, excellent covariate balance can often be achieved with weights, even when the propensity scores estimated for the treatment and control groups show little overlap.

\begin{center}
<<fig=TRUE,height=4,width=6>>=
plot(ps.lalonde.gbm, plots=2)
@
\end{center}

The effect size plot illustrates the effect of weights on the magnitude of differences between groups on each pretreatment covariate. These magnitudes are standardized using the standardized effect size described earlier.  In these plots, substantial reductions in effect sizes are observed for most variables (blue lines), with only one variable showing an increase in effect size (red lines), but only a seemingly trivial increase. Closed red circles indicate a statistically significant difference, many of which occur before weighting, none after. In some analyses variables can have very little variance in the treatment group sample or the entire sample and group differences can be very large relative to the standard deviations. In these situations, the user is warned that some effect sizes are too large to plot. 

\begin{center}
<<fig=TRUE,height=4,width=6>>=
plot(ps.lalonde.gbm, plots=3)
@
\end{center}

When many of the p-values testing individual covariates for balance are very small, the groups are clearly
imbalanced and inconsistent with what we would expect had the groups been formed by random assignment. After
weighting we would expect the p-values to be larger if balance had been achieved. We use a QQ plot comparing
the quantiles of the observed p-values to the quantiles of the uniform distribution (45 degree line) to conduct this check of balance. Ideally, the p-values from independent tests in which the null hypothesis is true will have a uniform distribution. Although the ideal is unlikely to hold even if we had random assignment (Bland, 2013), severe deviation of the p-values below the diagonal suggests lack of balance and p-values running at or above the diagonal suggests balance might have been achieved. The p-value plot (\texttt{plots=4} or \texttt{plots="t"}) allows users to visually to inspect the p-values of the t-tests for group differences in the covariate means.

\begin{center}
<<fig =TRUE, height=4, width = 6>>=
plot(ps.lalonde.gbm, plots = 4)
@ 
\end{center}

Before weighting (closed circles), the groups have statistically significant differences on many variables (i.e., p-values are near zero). After weighting (open circles) the p-values are generally above the 45-degree line, which represents the cumulative distribution of a uniform variable on [0,1]. This indicates that the p-values are even larger than would be expected in a randomized study. 

One can inspect similar plots for the KS statistic with the argument \texttt{plots = "ks"} or \texttt{plots = 5}.

\begin{center}
<<fig =TRUE, height=4, width = 6>>=
plot(ps.lalonde.gbm, plots = 5)
@ 
\end{center}


%\texttt{plot()}
%can create similar figures for KS statistic p-values by setting
%\texttt{plots="ks pvalues"}.

%\begin{center}
%<<fig=TRUE,height=4,width=6>>=
%plot(ps.lalonde, plots=5)
%@
%\end{center}

%\begin{center}
%<<fig=TRUE,height=4,width=6>>=
%plot(ps.lalonde, plots=5, subset = 2)
%@
%\end{center}


    
%\begin{figure}[h]
%  \centering
%  \includegraphics{twang-iterPt.pdf}
%  \caption{Caption.}
%\end{figure}
  
  

In all cases, the \texttt{subset} argument can be used if we wish to focus on results from one stopping rule.  
\begin{center}
<<fig =TRUE, height=4, width = 6>>=
plot(ps.lalonde.gbm, plots = 3, subset = 2)
@ 
\end{center}

\subsubsection{Understanding the relationship between the covariates and the treatment assignment}

The \texttt{gbm} package has various tools for exploring the relationship between the covariates and the treatment assignment indicator if these are of interest. These tools can be used with the \texttt{ps} object to extract useful information. In particular, the \texttt{summary()} function applied to the underling \texttt{gbm} object computes the relative influence of each variable for estimating the probability of treatment assignment. The relaive influence of each variable changes as more iteration are added to the gradient boosted model. In this example, we choose the number of iterations to be the optimal number for minimizing the largest of the KS statistics. This value can be found in the \texttt{ps.lalonde.gbm\$desc\$ks.max.ATT\$n.trees}. Figure~\ref{fig:relativeinfluence} shows the barchart of the relative influence and is produced when \texttt{plot=TRUE} in the call to \texttt{summary()}.

<<include=FALSE,width=6,height=8.5>>=
summary(ps.lalonde.gbm$gbm.obj,
        n.trees=ps.lalonde.gbm$desc$ks.max.ATT$n.trees,
        plot=FALSE)
@

\begin{figure}[h!]
\begin{center}
<<fig=TRUE,echo=FALSE,term=FALSE,width=6,height=8.5>>=
summary(ps.lalonde.gbm$gbm.obj,
        n.trees=ps.lalonde.gbm$desc$ks.max.ATT$n.trees)
@
\end{center}
\caption{Relative influence of the covariates on the estimated propensity score.}
\label{fig:relativeinfluence}
\end{figure}

Additionally, our team has developed the \texttt{SBdecomp} package for R which can be used to quantify the proportion of the estimated selection bias explained by each observed confounder when estimating causal effects using propensity score weights. It includes two approaches to quantify the proportion of the selection bias explained by each observed confounder: a single confounder removal approach and a single confounder inclusion approach. This tool can help analyze data where there is a substantive interest in identifying the variable or variables that explains the largest proportion of the estimated selection bias. Interested users are encourage to review the tutorial for that package at \url{https://www.rand.org/pubs/tools/TLA570-3.html}.

\subsection{Analysis of outcomes}

\subsubsection{Propensity scores from \texttt{ps()}}

A separate R package, the \texttt{survey} package, is useful for performing the outcomes analyses using weights. Its statistical methods account for the weights when computing standard error estimates. It is not a part of the standard R installation but installing \texttt{twang} should automatically install \texttt{survey} as well.

<<>>=
library(survey)
@

The \texttt{get.weights()} function extracts the propensity score weights from a \texttt{ps} object. Those weights may then be used as case weights in a \texttt{svydesign} object.  By default, it returns weights corresponding to the estimand (ATE or ATT) that was specified in the original call to \texttt{ps()}. If needed, the user can override the default via the optional \texttt{estimand} argument. 

<<>>=
lalonde$w <- get.weights(ps.lalonde.gbm, stop.method="es.mean")
design.ps <- svydesign(ids=~1, weights=~w, data=lalonde)
@

<<echo=FALSE>>=
@ 


%The \texttt{estimand} argument to the \texttt{get.weights} function
%specifies whether the weights are for estimating the treatment
%effect on the treated (ATT), computed as 1 for the treatment cases and
%$p/(1-p)$ for the comparison cases, or for estimating the treatment
%effect on the population (ATE), computed as $1/p$ for the treatment cases
%and $1/(1-p)$ for the comparison cases. 
%The currently implemented
%\texttt{stop.method} objects optimize for the treatment effect on
%the treated and it is possible that a different set of propensity
%scores would be optimal for a treatment on the population analysis.
The \texttt{stop.method} argument specifies which GBM model, and consequently which weights, to utilize. The \texttt{svydesign} function from the \texttt{survey} package creates an object that stores the dataset along with design information needed for analyses. See \texttt{help(svydesign)} for more details on setting up \texttt{svydesign} objects.

The aim of the National Supported Work Demonstration analysis is to determine whether the program was effective at increasing earnings in 1978. The propensity score adjusted test can be computed with \texttt{svyglm}.

<<>>=
glm1 <- svyglm(re78 ~ treat, design=design.ps)
summary(glm1)
@



The analysis estimates an increase in earnings of \$\Sexpr{round(glm1$coefficients["treat"],0)} for those that participated in the NSW compared with similarly situated people observed in the CPS. The effect, however, does not appear to be statistically significant.


Some authors have recommended utilizing both propensity score adjustment and additional covariate adjustment to minimize mean square error or to obtain ``doubly robust`` estimates of the treatment effect (Huppler-Hullsiek \& Louis 2002, Bang \& Robins 2005). These estimators are consistent if either the propensity scores are estimated correctly \textit{or} the regression model is specified correctly. For example, note that the balance table for \texttt{es.mean.ATT} made the two groups more similar on \texttt{nodegree}, but still some differences remained, \Sexpr{round(100*ps.lalonde.gbm$desc$es.mean.ATT$bal.tab$results["nodegree","tx.mn"],1)}\% of the treatment group had no degree while \Sexpr{round(100*ps.lalonde.gbm$desc$es.mean.ATT$bal.tab$results["nodegree","ct.mn"],1)}\% of the comparison group had no degree. While linear regression is sensitive to model misspecification when the treatment and comparison groups are dissimilar, the propensity score weighting has made them more similar, perhaps enough so that additional modeling with covariates can adjust for any remaining differences. In addition to potential bias reduction, the inclusion of additional covariates can reduce the standard error of the treatment effect if some of the covariates are strongly related to the outcome.

<<>>=
glm2 <- svyglm(re78 ~ treat + nodegree, design=design.ps)
summary(glm2)
@

Adjusting for the remaining group difference in the \texttt{nodegree} variable slightly increased the estimate of the program's effect to \$\Sexpr{round(glm2$coefficients["treat"],0)}, but the difference is still not statistically significant. We can further adjust for the other covariates, but that too in this case has little effect on the estimated program effect.



<<>>=
glm3 <- svyglm(re78 ~ treat + age + educ + black + hispan + nodegree +
                      married + re74 + re75,
               design=design.ps)
summary(glm3)
@

\subsubsection{Estimating the program effect using linear regression}

The classical regression approach to estimating the program effect would fit a linear model with a treatment indicator and linear terms for each of the covariates.

<<>>=
glm4 <- lm(re78 ~ treat + age + educ + black + hispan + nodegree +
                  married + re74 + re75,
           data=lalonde)
summary(glm4)
@

<<echo=FALSE>>=
glm5 <- lm(sqrt(re78) ~ treat + age + educ + black + hispan + nodegree +
                        married + sqrt(re74) + sqrt(re75),
           data=lalonde)
@


This model estimates a rather strong treatment effect, estimating a program effect of \$\Sexpr{round(coef(glm4)[2])} with a p-value=\Sexpr{round(coef(summary(glm4))[2,4],3)}. Several variations of this regression approach also estimate strong program effects. For example using square root transforms on the earnings variables yields a p-value=\Sexpr{round(coef(summary(glm5))[2,4],3)}. These estimates, however, are very sensitive to the model structure since the treatment and control subjects differ greatly as seen in the unweighted balance comparison from \texttt{bal.table(ps.lalonde.gbm)}.

% \subsubsection{Omitted variable sensitivity analysis}
% 
% Even though \texttt{glm3} is a ``doubly robust`` model, it is best practice to understand the sensitivity of the finding to unobserved variables. We utilize the Omitted Variable Tool, \texttt{OVtool}, to assess how sensitive the treatment effect direction is to unobserved variables. \texttt{OVtool} is a R package that assesses the sensitivity of research findings to omitted variables when estimating causal effects using ps weighting. The \texttt{OVtool} can be used to assess the sensitivity of both the estimated treatment effect and the p-value for a given analysis. \texttt{OVtool} should automatically be installed when installing \texttt{twang} and can be loaded with the library command.
% 
% <<>>=
% library(OVtool)
% @
% 
% The first step after loading the package is to replicate the outcome model using the \texttt{outcome\_model} function. This function automatically calls \texttt{svyglm} from the \texttt{survey} package. 
% 
% <<>>=
% results = outcome_model(ps_object = NULL,
%                         stop.method = NULL, 
%                         data = lalonde,
%                         weights="w",
%                         treatment = "treat",
%                         outcome = "re78", 
%                         model_covariates = c("age", "educ",
%                                              "black", "hispan",
%                                              "nodegree", "married",
%                                              "re74", "re75"),
%                         estimand = "ATT")
% summary(results$mod_results)
% @
% 
% We now are ready to run our sensitivity analysis using the following call to the \texttt{ov\_sim} function. For details on what the following parameters represent, please see \url{https://cloud.r-project.org/web/packages/OVtool/vignettes/OVtool.html}.
% 
% <<eval=FALSE>>=
% ovtool_att = ov_sim(model_results=results,
%                                   plot_covariates=c("age", "educ", "hispan",
%                                                     "nodegree", "married",
%                                                     "re74", "re75"),
%                                   es_grid = NULL,
%                                   rho_grid = seq(0, 0.40, by = 0.05),
%                                   n_reps = 200,
%                                   progress = TRUE)
% @
% 
% <<echo=FALSE>>=
% # save(ovtool_att, file = "data/ovtool_att.rda", version = 2)
% load("ovtool_att.rda")
% @
% 
% \begin{center}
% <<fig =TRUE, height=6, width = 8>>=
% plot.ov(ovtool_att, col='color', print_graphic = 3)
% @ 
% \end{center}
% 
% The y-axis shown in the figure represents the unobserved variable's absolute correlation with the outcome, [\texttt{re78}], while the x-axis has the association between the unobserved confounder and the treatment indicator on an effect size scale. Further, the black lines represent the treatment effect contours evaluated across plausible x and y values. The red dashed lines represent the p-value contours. Finally, the blue points represent a sample of the observed variables we controlled for in our doubly robust model to provide us with context in terms of how strong the omitted variable would need to be to shift the meaning of our results. Since there are no blue points to the right of the ``0`` contour, we can feel fairly confident that the direction of our estimated treatment effect is robust to unobserved confounders. We can utilize the \texttt{summary.ov()} function to help us interpret this graphic:
% 
% <<eval=TRUE>>=
% summary.ov(ovtool_att, model_results = results)
% @
% 
% The output above demonstrates that even if omitted variables existed, they likely wouldn't shift the direction of the effect. However, we do note that the magnitude of the effect would more than double if an unobserved variable existed that had an equivalent absolute correlation with [\texttt{re78}] and association with the treatment indicator. This is unlikely given [\texttt{re78}] is the pretreatment version of the outcome.

\clearpage

\subsection{Propensity scores from logistic regression}

Propensity score analysis is intended to avoid problems associated with the misspecification of covariate adjusted models of outcomes, but the quality of the balance and the treatment effect estimates can be sensitive to the method used to estimate the propensity scores. Consider estimating the propensity scores using logistic regression instead of \texttt{ps()}.

<<>>=
ps.logit <- glm(treat ~ age + educ + black + hispan + nodegree +
                        married + re74 + re75,
                data = lalonde,
                family = binomial)
lalonde$w.logit <- rep(1,nrow(lalonde))
lalonde$w.logit[lalonde$treat==0] <- exp(predict(ps.logit,subset(lalonde,treat==0)))
@

\texttt{predict()} for logistic regression model produces estimates on the log-odds scale by default. Exponentiating those predictions for the comparison subjects gives the ATT weights $p/(1-p)$. \texttt{dx.wts()} from the \texttt{twang} package diagnoses the balance for an arbitrary set of weights producing a balance table.  This function requires the user to specify the estimand argument in order to perform the appropriate calculations relative to the target group on which we are drawing inferences.  The function \texttt{dx.wts} has not been updated in Version 2.0 and still relies on the older version of the balance calculations.


<<>>=
bal.logit <- dx.wts(x = lalonde$w.logit,
                    data=lalonde,
                    vars=c("age","educ","black","hispan","nodegree",
                      "married","re74","re75"),
                    treat.var="treat",
                    perm.test.iters=0, estimand = "ATT")
bal.logit
@ 


Applying the \texttt{bal.table()} function to this object returns a variable-by-variable summary of balance, just like it did for the \texttt{ps} object.  

<<>>=
bal.table(bal.logit)
@ 

For weights estimated with logistic regression, the largest KS statistic was reduced from the unweighted sample's largest KS of 0.64 to 0.31, which is still quite a large KS statistic. %Table~\ref{tab:balancelogit} shows the details of the balance of the treatment and comparison groups. The means of the two groups appear to be quite similar while the KS statistic shows substantial differences in their distributions.

% % Code for Table 3, which is not evaluated here but later for formatting purposes
% <<eval=FALSE>>=
% pretty.tab <- bal.table(bal.logit)[[2]][,c("tx.mn","ct.mn","ks")]
% pretty.tab <- cbind(pretty.tab, bal.table(bal.logit)[[1]]$ct.mn)
% names(pretty.tab) <- c("E(Y1|t=1)","E(Y0|t=1)","KS","E(Y0|t=0)")
% xtable(pretty.tab,
%        caption = "Logistic regression estimates of the propensity scores",
%        label = "tab:balancelogit",
%        digits = c(0, 2, 2, 2, 2),
%        align=c("l","r","r","r","r"))
% @

Table \ref{tab:balancecompare} and \ref{tab:balancecompare2} compares the balancing quality of the weights directly with one another using the standardize effect size and the KS statistic, respectively. The standardized effect sizes for both sets of propensity score weights are improved compared to the unweighted analysis. The KS statistics for \texttt{age} and \texttt{re74} are larger for the logistic regression model than the GBM-based propensity score model. Combining the results of Table \ref{tab:balancecompare} and \ref{tab:balancecompare2} we can conclude that logistic regression is achieving mean balance between the groups (Table \ref{tab:balancecompare}), but it does not balance the full distribution (Table \ref{tab:balancecompare2}).

\begin{table}[ht]
\begin{center}
\begin{tabular}{rccc}
\hline
Covariate & Unweighted & Using \texttt{ps} & Using logistic regression \\
\hline
age & \Sexpr{round(ps.lalonde.gbm$desc$unw$bal.tab$results["age","std.eff.sz"],3)} & \Sexpr{round(ps.lalonde.gbm$desc$es.mean.ATT$bal.tab$results["age","std.eff.sz"] , 3)} & \Sexpr{round(bal.table(bal.logit)[[2]]["age","std.eff.sz"],3)} \\
educ & \Sexpr{round(ps.lalonde.gbm$desc$unw$bal.tab$results["educ","std.eff.sz"],3)} & \Sexpr{round(ps.lalonde.gbm$desc$es.mean.ATT$bal.tab$results["educ","std.eff.sz"] , 3)} & \Sexpr{round(bal.table(bal.logit)[[2]]["educ","std.eff.sz"],3)} \\
black & \Sexpr{round(ps.lalonde.gbm$desc$unw$bal.tab$results["black","std.eff.sz"],3)} & \Sexpr{round(ps.lalonde.gbm$desc$es.mean.ATT$bal.tab$results["black","std.eff.sz"] , 3)} & \Sexpr{round(bal.table(bal.logit)[[2]]["black","std.eff.sz"],3)} \\
hispan & \Sexpr{round(ps.lalonde.gbm$desc$unw$bal.tab$results["hispan","std.eff.sz"],3)} & \Sexpr{round(ps.lalonde.gbm$desc$es.mean.ATT$bal.tab$results["hispan","std.eff.sz"] , 3)} & \Sexpr{round(bal.table(bal.logit)[[2]]["hispan","std.eff.sz"],3)} \\
nodegree & \Sexpr{round(ps.lalonde.gbm$desc$unw$bal.tab$results["nodegree","std.eff.sz"],3)} & \Sexpr{round(ps.lalonde.gbm$desc$es.mean.ATT$bal.tab$results["nodegree","std.eff.sz"] , 3)} & \Sexpr{round(bal.table(bal.logit)[[2]]["nodegree","std.eff.sz"],3)} \\
married & \Sexpr{round(ps.lalonde.gbm$desc$unw$bal.tab$results["married","std.eff.sz"],3)} & \Sexpr{round(ps.lalonde.gbm$desc$es.mean.ATT$bal.tab$results["married","std.eff.sz"] , 3)} & \Sexpr{round(bal.table(bal.logit)[[2]]["married","std.eff.sz"],3)} \\
re74 & \Sexpr{round(ps.lalonde.gbm$desc$unw$bal.tab$results["re74","std.eff.sz"],3)} & \Sexpr{round(ps.lalonde.gbm$desc$es.mean.ATT$bal.tab$results["re74","std.eff.sz"] , 3)} & \Sexpr{round(bal.table(bal.logit)[[2]]["re74","std.eff.sz"],3)} \\
re75 & \Sexpr{round(ps.lalonde.gbm$desc$unw$bal.tab$results["re75","std.eff.sz"],3)} & \Sexpr{round(ps.lalonde.gbm$desc$es.mean.ATT$bal.tab$results["re75","std.eff.sz"] , 3)} & \Sexpr{round(bal.table(bal.logit)[[2]]["re75","std.eff.sz"],3)} \\
\hline
\end{tabular}
\caption{Standardized effect size of covariates using GBM and logistic regression.}
\label{tab:balancecompare}
\end{center}
\end{table}

\begin{table}[ht]
\begin{center}
\begin{tabular}{rccc}
\hline
Covariate & Unweighted & Using \texttt{ps} & Using logistic regression \\
\hline
age & \Sexpr{round(ps.lalonde.gbm$desc$unw$bal.tab$results["age","ks"],3)} & \Sexpr{round(ps.lalonde.gbm$desc$es.mean.ATT$bal.tab$results["age","ks"] , 3)} & \Sexpr{round(bal.table(bal.logit)[[2]]["age","ks"],3)} \\
educ & \Sexpr{round(ps.lalonde.gbm$desc$unw$bal.tab$results["educ","ks"],3)} & \Sexpr{round(ps.lalonde.gbm$desc$es.mean.ATT$bal.tab$results["educ","ks"] , 3)} & \Sexpr{round(bal.table(bal.logit)[[2]]["educ","ks"],3)} \\
black & \Sexpr{round(ps.lalonde.gbm$desc$unw$bal.tab$results["black","ks"],3)} & \Sexpr{round(ps.lalonde.gbm$desc$es.mean.ATT$bal.tab$results["black","ks"] , 3)} & \Sexpr{round(bal.table(bal.logit)[[2]]["black","ks"],3)} \\
hispan & \Sexpr{round(ps.lalonde.gbm$desc$unw$bal.tab$results["hispan","ks"],3)} & \Sexpr{round(ps.lalonde.gbm$desc$es.mean.ATT$bal.tab$results["hispan","ks"] , 3)} & \Sexpr{round(bal.table(bal.logit)[[2]]["hispan","ks"],3)} \\
nodegree & \Sexpr{round(ps.lalonde.gbm$desc$unw$bal.tab$results["nodegree","ks"],3)} & \Sexpr{round(ps.lalonde.gbm$desc$es.mean.ATT$bal.tab$results["nodegree","ks"] , 3)} & \Sexpr{round(bal.table(bal.logit)[[2]]["nodegree","ks"],3)} \\
married & \Sexpr{round(ps.lalonde.gbm$desc$unw$bal.tab$results["married","ks"],3)} & \Sexpr{round(ps.lalonde.gbm$desc$es.mean.ATT$bal.tab$results["married","ks"] , 3)} & \Sexpr{round(bal.table(bal.logit)[[2]]["married","ks"],3)} \\
re74 & \Sexpr{round(ps.lalonde.gbm$desc$unw$bal.tab$results["re74","ks"],3)} & \Sexpr{round(ps.lalonde.gbm$desc$es.mean.ATT$bal.tab$results["re74","ks"] , 3)} & \Sexpr{round(bal.table(bal.logit)[[2]]["re74","ks"],3)} \\
re75 & \Sexpr{round(ps.lalonde.gbm$desc$unw$bal.tab$results["re75","ks"],3)} & \Sexpr{round(ps.lalonde.gbm$desc$es.mean.ATT$bal.tab$results["re75","ks"] , 3)} & \Sexpr{round(bal.table(bal.logit)[[2]]["re75","ks"],3)} \\
\hline
\end{tabular}
\caption{KS statistic of covariates using GBM and logistic regression.}
\label{tab:balancecompare2}
\end{center}
\end{table}

<<>>=
design.logit <- svydesign(ids=~1, weights=~w.logit, data=lalonde)
glm6 <- svyglm(re78 ~ treat, design=design.logit)
summary(glm6)
@

% % Code for Table 3, which is evaluated here
% <<echo= FALSE, results=tex>>=
% pretty.tab <- bal.table(bal.logit)[[2]][,c("tx.mn","ct.mn","ks")]
% pretty.tab <- cbind(pretty.tab, bal.table(bal.logit)[[1]]$ct.mn)
% names(pretty.tab) <- c("E(Y1|t=1)","E(Y0|t=1)","KS","E(Y0|t=0)")
% xtable(pretty.tab,
%        caption = "Logistic regression estimates of the propensity scores",
%        label = "tab:balancelogit",
%        digits = c(0, 2, 2, 2, 2),
%        align=c("l","r","r","r","r"))
% @

% % Code for Table 4, which is evaluated here
% <<echo=FALSE, results=tex>>=
% bal.gbm <- dx.wts(ps.lalonde.gbm,
%                   data=lalonde, estimand = "ATE",
%                   vars=c("age","educ","black","hispan","nodegree","married","re74","re75"),
%                   treat.var="treat",
%                   perm.test.iters=0)
% pretty.tab <- rbind(bal.logit$summary.tab,
%                     bal.gbm$summary.tab[-1,])
% rownames(pretty.tab) <- pretty.tab$type
% rownames(pretty.tab)[2] <- "logit"
% pretty.tab <- pretty.tab[,c("n.treat","ess.ctrl","max.es","mean.es","max.ks","mean.ks")]
% xtable(pretty.tab,
%        caption = "Summary of the balancing properties of logistic regression and gbm",
%        label = "tab:balancecompare",
%        digits = c(0, 0, 2, 2, 2, 2, 2),
%        #digits = c(0,0, rep(2,9)),
%        align=c("l","r","r","r","r","r","r"))
%        #align = rep("l", 11))
% @

The analysis estimates an increase in earnings of \$\Sexpr{round(coef(glm6)[2])} for those that participated in the NSW compared with similarly situated people observed in the CPS. Table~\ref{tab:allTE} compares all of the treatment effect estimates.

<<echo=FALSE>>=
glm7 <- svyglm(re78 ~ treat + age + educ + black + hispan + nodegree +
                      married + re74 + re75,
               design=design.logit)
@

\begin{table}[ht]
\begin{center}
\begin{tabular}{rll}
\hline
Treatment effect & PS estimate & Linear adjustment \\
\hline
\$\Sexpr{round(coef(glm1)[2])} & GBM, minimize ES    & none \\
\$\Sexpr{round(coef(glm2)[2])} & GBM, minimize ES    & nodegree\\
\$\Sexpr{round(coef(glm3)[2])} & GBM, minimize ES    & all\\
\$\Sexpr{round(coef(glm4)[2])} & None                & all\\
\$\Sexpr{round(coef(glm6)[2])} & Logistic regression & none \\
\$\Sexpr{round(coef(glm7)[2])} & Logistic regression & all\\
\hline
\end{tabular}
\caption{Treatment effect estimates by various methods}
\label{tab:allTE}
\end{center}
\end{table}

\section{An ATE example}

In the analysis of Section 2, we focused on estimating ATT for the \texttt{lalonde} dataset.  In this situation, the ATE is not of great substantive interest because not all people who are offered entrance into the program could be expected to take advantage of the opportunity.  Further, there is some evidence that the treated subjects were drawn from a subset of the covariate space.  In particular, in an ATE analysis, we see that we are unable to achieve balance, especially for the ``black`` indicator.  

We now turn to an ATE analysis that is feasible and meaningful.  We focus on the \texttt{lindner} dataset, which was included in the \texttt{USPS} package (Obenchain 2011), and is now included in \texttt{twang} for
convenience.  A tutorial by Helmreich and Pruzek (2009; HP) for the \texttt{PSAgraphics}  package also uses propensity scores to analyze a portion of these data.  HP describe the data as follows on p.\ 3 with our minor recodings in square braces: 

\begin{quote}
  The \texttt{lindner} data contain data on 996 patients treated at the Lindner Center, Christ Hospital, Cincinnati in 1997.  Patients received a Percutaneous Coronary Intervention (PCI).  The data consists of 10 variables.  Two are outcomes: [\texttt{sixMonthSurvive}] ranges over two values... depending on whether patients surved to six months post treatment [denoted by \texttt{TRUE}] or did not survive to six months [\texttt{FALSE}]...  Secondly, \texttt{cardbill} contains the costs in 1998 dollars for the first
  six months (or less if the patient did not survive) after treatment... The treatment variable is \texttt{abcix}, where 0 indicates PCI treatment and 1 indicates standard PCI treatment and additional treatment in some form with abciximab. Covariates include \texttt{acutemi}, 1 indicating a recent acute myocardial infarction and 0 not; \texttt{ejecfrac} for the left ventricle ejection fraction, a percentage from 0 to 90; \texttt{ves1proc} giving the number of vessels (0 to 5) involved in the initial PCI; \texttt{stent} with 1 indicating coronary stent inserted, 0 not; \texttt{diabetic} where 1 indicates that the patient has been diagnosed with diabetes, 0 not; \texttt{height} in centimeters and \texttt{female} coding the sex of the patent, 1 for female, 0 for male.
\end{quote}

HP focus on \texttt{cardbill} --- the cost for the first months after treatment  --- as their outcome of
interest.  However, since not all patients survived to six months, it is not clear whether a lower value of \texttt{cardbill} is good or not.  For this reason, we choose six-month survival (\texttt{sixMonthSurvive}) as our outcome of interest.  

Ignoring pre-treatment variables, we see that \texttt{abcix} is associated with lower rates of 6-month mortality:

<<>>=
data(lindner)
table(lindner$sixMonthSurvive, lindner$abcix)
chisq.test(table(lindner$sixMonthSurvive, lindner$abcix))
@ 

The question is whether this association is causal. If health care policies were to be made on the basis of these data, we would wish to elicit expert opinion as to whether there are likely to be other confounding pretreatment variables.  For this tutorial, we simply follow HP in choosing the pre-treatment covariates.  
The \texttt{twang} model is fit as follows

<<>>=
set.seed(1)
ps.lindner <- ps(abcix ~ stent + height + female + diabetic + 
                 acutemi + ejecfrac + ves1proc,
                 data = lindner,
                 estimand = "ATE",
                 verbose = FALSE)
@ 

We set \texttt{estimand =  "ATE"} because we are interested in the effects of abciximab on everyone in the population.  We do not specify the stopping rules.  Consequently \texttt{ps()} uses the defaults:
\texttt{es.mean} and \texttt{ks.mean}. We then inspect pre- and post-weighting balance with the command

<<echo=FALSE>>=
options(width=85)
@

<<>>=
bal.table(ps.lindner)
@

<<echo=FALSE>>=
options(width = 80)
@ 


This balance table shows that \texttt{stent}, \texttt{acutemi}, and \texttt{ves1proc} were all significantly
imbalanced before weighting.  After weighting (using either \texttt{stop.method} considered)  we do not see problems in this regard.  Examining \texttt{plot(ps.lindner, plots = x)} for \texttt{x} running from 1 to 5 does not reveal problems, either.  In regard to the optimize plot, we note that the scales of the KS and ES statistics presented in the optimize plots are not necessarily comparable.  The fact that the KS values are lower than the ES values in the optimize plot does not suggest that the KS stopping rule is finding superior models.  Each panel of the optimize plot indicates the gbm model that minimizes each stopping rule.  The
panels should not be compared other than to compare the number of iterations selected by each rule.  

\begin{center}
<<fig =TRUE, height=4, width = 6>>=
plot(ps.lindner, plots = 1)
@ 
\end{center}

\begin{center}
<<fig =TRUE, height=4, width = 6>>=
plot(ps.lindner, plots = 2)
@ 
\end{center}

\pagebreak
\begin{center}
<<fig =TRUE, height=4, width = 6>>=
plot(ps.lindner, plots = 3)
@ 
\end{center}

\begin{center}
<<fig =TRUE, height=4, width = 6>>=
plot(ps.lindner, plots = 4)
@ 
\end{center}

\pagebreak
\begin{center}
<<fig =TRUE, height=4, width = 6>>=
plot(ps.lindner, plots = 5)
@ 
\end{center}

From a call to \texttt{summary()}, we see that the \texttt{es.mean.ATE} stopping rule results in a slightly higher ESS with comparable balance measures, so we proceed with those weights. Also, we note that \texttt{ess.treat} is no longer equal to \texttt{n.treat} since we are focusing on ATE rather than ATT.

<<>>=
summary(ps.lindner)
@ 

As before, we use the \texttt{survey} package to reweight our sample and perform the analysis.

<<>>=
lindner$w <- get.weights(ps.lindner, stop.method = "es.mean")
design.ps <- svydesign(ids=~1, weights = ~w, data = lindner)
svychisq(~sixMonthSurvive + abcix, design = design.ps)
@ 

The reweighting does not diminish the association between the treatment and the outcome.  Indeed, it is marginally more significant after the reweighting. Alternatively, we can run a generalised linear model. 

% The \texttt{OVtool} can assess the sensitivity of findings for binary outcomes, but it utilizes the residuals from a linear probability model to generate the empirical cumulative distribution function. For that purpose we will utilize a linear probability model as opposed to a logistic model.
% 
% <<>>=
% glm_ATE = svyglm(sixMonthSurvive ~ abcix, design = design.ps)
% summary(glm_ATE)
% @
% 
% As in the ATT example, we want to check to see how sensitive our analysis is to possible unobserved confounders. We first replicate the outcome model.
% 
% <<>>=
% results_ATE = outcome_model(data = lindner,
%                             weights="w",
%                             treatment = "abcix",
%                             outcome = "sixMonthSurvive",
%                             model_covariates = 1,
%                             estimand = "ATE")
% @
% 
% We then run the omitted variable simulations to understand the sensitivity of our results.
% 
% <<eval=FALSE>>=
% ovtool_ate = ov_sim(model_results=results_ATE,
%                     plot_covariates=c("stent", "height", "female",
%                                       "diabetic", "ejecfrac"),
%                     es_grid = NULL,
%                     rho_grid = NULL,
%                     n_reps = 250,
%                     progress = TRUE)
% @
% 
% <<echo=FALSE>>=
% # save(ovtool_ate, file = "vignettes/ovtool_ate.rda", version = 2)
% load("ovtool_ate.rda")
% @
% 
% \begin{center}
% <<fig =TRUE, height=6, width = 8>>=
% plot.ov(ovtool_ate, col='color', print_graphic = 3, p_contours = c(0.01, 0.05))
% @
% \end{center}
% 
% As discussed in the ATT example, the black lines represent the treatment effect contours evaluated across plausible x and y values. The x-axis represents the association a variable has with the treatment indicator, [\texttt{abcix}], while the y-axis represents the absolute correlation a variable has with the outcome, [\texttt{sixMonthSurvive}].  The p-value threshold of 0.05 visible in the top right corner of the graphic tells us that an unobserved variable would have to have an association with the treatment indicator and outcome of at least 0.3 and 0.15, respectively to shift these results. The \texttt{summary.ov()} function helps us summarize not only the sensitivity of statistical significance but also the direction of the effect.
% 
% <<eval=TRUE>>=
% summary.ov(ovtool_ate, model_results = results_ATE)
% @
% 
% The output above demonstrates that even if omitted variables existed, they likely wouldn't shift the interpretation of these results.

\section{Non-response weights}
\label{sec:nonresponse}

The \texttt{twang} package was designed to estimate propensity score
weights for the evaluation of treatment effects in observational or
quasi-experimental studies.  However, we find that the package
includes functions and diagnostic tools that are highly valuable for
other applications, such as for generating and diagnosing
nonresponse weights for survey nonresponse or study attrition.  We
now present an example that uses the tools in \texttt{twang}.  This
example uses the subset of the US Sustaining Effects Study data
distributed with the HLM software (Bryk, Raudenbush, Congdon, 1996)
and also available in the R package \texttt{mlmRev}.  The data
include mathematics test scores for 1721 students in kindergarten to
fourth grade.  They also include student race (black, Hispanic,
or other), gender, an indicator for whether or not the student had
been retained in grade, the percent low income students at the
school, the school size, the percent of mobile students, the
students' grade-levels, student and school IDs, and grades converted
to year by centering.  The study analysis plans to analyze growth in
math achievement from grade 1 to grade 4 using only students with
complete data.  However, the students with complete data differ from
other students. To reduce bias that could potentially result from
excluding incomplete cases, our analysis plan is to weight complete
cases with nonresponse weights.

The goal of nonresponse weighting is to develop weights for the
respondents that make them look like the entire sample --- both the
respondents and nonrespondents. Since the respondents already look
like themselves, the hard part is to figure out how well each
respondent represents the nonrespondents.
Nonresponse weights equal the reciprocal of the
probability of response and are applied only to respondents.


Note that the the probability of response is equivalent to the propensity score if we
consider subjects with an observed outcome to be the ``treated''
group, and those with an unobserved outcome to be the ``controls''.
We wish to reweight the sample to make it equivalent to the population
from which the sample was drawn, so ATE weights are more appropriate
in this case.  Further, recall that the weights for the treated
subjects are $1/p$ in an ATE analysis.  Therefore we can reweight the
sample of respondents
using the \texttt{get.weights()} function.



Before we can generate nonresponse weights, we need to prepare the
data using the following commands.
First we load the data.

<<>>=
data(egsingle)
@

Next we create a response indicator variable that we can merge onto the student by test score level data.
We want to include only students with scores from all of grades 1 to 4. The data include scores from
kindergarten (grade = 0) to grade 5 with some students having multiple scores from the same grade.  First
we keep the unique grades for each student:

<<>>=
tmp <- tapply(egsingle$grade, egsingle$childid, unique)
@

Because students do not all have the same number of score,
\texttt{tapply()} returns a list with one element per student.
Each element contains the unique set of grades observed for each
student. We now check for grades in 1 to 4:

<<>>=
tmp <- lapply(tmp, function(x){return(x %in% 1:4)})
@

The list tmp now contains a boolean vector for each student, where ``TRUE'' indicates the grade
took on a value in 1 to 4. The sum of this vector for each student determines how many of grades 1 to 4
we observed for him or her.

<<>>=
tmp <- lapply(tmp, sum)
@

A student is a respondent if he or she has scores from all four of grades 1 to 4 or if the value of \texttt{tmp} is 4.

<<>>=
tmp <- sapply(tmp, function(x){as.numeric(x == 4)})
@

We create a data frame of response indicators so we can merge them onto the test scores data:

<<>>=
tmp <- data.frame(tmp)
names(tmp) <- "resp"
tmp$childid <- row.names(tmp)
@

and merge this back to create a single data frame

<<>>=
egsingle <- merge(egsingle, tmp)
@


Because nonresponse is a student-level variable rather than a
student-by-year-level variable we create one record per student.

<<>>=
egsingle.one <-unique(egsingle[,-c(3:6)])
@

We also create a race variable

<<>>=
egsingle.one$race <- as.factor(race <- ifelse(egsingle.one$black==1, 1,
                                         ifelse(egsingle.one$hispanic==1, 2, 3)))
@

As discussed above, to use \texttt{ps()} to estimate nonresponse, we
need to let respondents be the treatment group by modeling an
indicator of response.


%<<>>=
%egsingle.one$nresp <- 1-egsingle.one$resp
%@

<<>>=
egsingle.ps <-  ps(resp ~ race + female + size + lowinc + mobility,
      data=egsingle.one,
      stop.method=c("es.mean","ks.max"),
      n.trees=5000,
      verbose=FALSE,
      estimand = "ATE")
@

As in standard propensity score applications, we should check that
\texttt{n.trees} was set large enough, so that balance would seem not
to be improved if more complex models were considered. Recall that for
\texttt{es.mean.ATE} the measure is the average effect size
difference between the two groups and for \texttt{ks.max.ATE} the
measure is the largest of the KS statistics.

\begin{center}
<<fig=TRUE,height=3.5,width=6>>=
plot(egsingle.ps)
@
\end{center}


%The optimal number of iterations for gbm to minimize the maximum KS
%statistic is 3989 and the
%optimal number of iterations for gbm to minimize the average effect
%size is 3513.  The weights
%do an excellent job matching the distribution of the respondent
%group covariates to those of the nonrespondents as shown in
%Table~\ref{tab:balance2}.

By default the balance table generated by \texttt{ps()} compares
the weighted treatment group (respondents) to the weighted comparison
group (nonresponders) -- both groups weighted to equal the overall
population.  However, the  goal is to weight the respondents to match
the population not to compare the weighted respondents and
nonrespondents. The default balance table may be useful for
evaluating the propensity scores, but it does not directly assess the
quality of the weights for  balancing the weighted respondents with
the overall population.

We can ``trick'' the \texttt{dx.wts()} function in the \texttt{twang}
package into making the desired comparison. We want to compare the
weighted respondents to the unweighted full sample. When evaluating
ATT weights we compare the weighted comparison group with the unweighted
treatment group. If we apply \texttt{dx.wts()} to a data set where
the ``treatment'' group is the entire \texttt{esingle.one} sample
and the ``control'' group is the \texttt{esingle.one} respondents and
the weights equal one for every student in the pseudo-treatment group
and equal the weights from \texttt{ps()} for every student in the
pseudo-control group, we can obtain the balance statistics we want.

We begin by setting up the data with the pseudo-treatment and control groups.
We add ATE weights from the ``ks.max'' stopping rule as our nonresponse
weights.

<<>>=
egsingle.one$wgt <- get.weights(egsingle.ps, stop.method="ks.max")
@

We now stack the full sample and the respondents. The variable ``nr2''
is the pseudo-treatment indicator.  We set it equal to one for the
full sample and 0 for the respondents. Similarly, ``wgt2'' is the
pseudo-ATT weight which is set equal to one for the full sample and
equal to the nonresponse weights for the respondents.

<<>>=
egtmp <- rbind(data.frame(egsingle.one, nr2=1, wgt2=1),
                data.frame(egsingle.one, nr2=0, wgt2=egsingle.one$wgt)[egsingle.one$resp==1,])
@

We now run \texttt{dx.wts()} to obtain the balance statistics.
Switching to ATT from ATE yields a warning that can be ignored in this case.

<<>>=
egdxwts <- dx.wts(x=egtmp$wgt2,
                   data=egtmp,
                   estimand="ATT",
                   vars=c("race", "female", "size",  "lowinc",  "mobility"),
                   treat.var="nr2")
@

<<results=tex>>=
# pretty.tab<-bal.table(egdxwts)[[2]][,c("tx.mn","ct.mn","std.eff.sz","ks")]
# names(pretty.tab) <- c("OverallS Sample","Weighted responders","Std ES","KS")
# xtable(pretty.tab,
#         caption = "Balance of the nonrespondents and respondents",
#         label = "tab:balance2",
#         digits = c(0, 2, 2, 2, 2),
#         align=c("l","r","r","r","r"))
bal.table(egdxwts)[[2]]
@


The balance is very good. We can use the weighted respondents
for analysis. We select only the records with an observed outcome.
This will be our analysis sample and the variable ``wgt'' will
contains the nonreponse weights.

<<>>=
egsinge.resp <- merge(subset(egsingle, subset=resp==1),
                        subset(egsingle.one, subset=resp==1,
                               select=c(childid, wgt)) )
@





\section{The details of \texttt{twang}}

\subsection{Propensity scores and weighting}

Propensity scores can be used to reweight
comparison cases so that the distribution of their features match
the distribution of features of the treatment cases, for ATT, or cases
from both treatment and control groups to match each other, for ATE 
(Rosenbaum 1987, Woold\-ridge 2002,
Hirano and Imbens 2001, McCaffrey \textit{et al.} 2004)
Let
$f(\mathbf{x}|t=1)$ be the distribution of features for the
treatment cases and $f(\mathbf{x}|t=0)$ be the distribution of
features for the comparison cases. If treatments were randomized
then we would expect these two distributions to be similar. When
they differ for ATT we will construct a weight, $w(\mathbf{x})$, so that
\begin{equation}
f(\mathbf{x}|t=1) = w(\mathbf{x})f(\mathbf{x}|t=0).
\label{eq:balance}
\end{equation}
For example, if $f(\mbox{age=65},\mbox{sex=F}|t=1) = 0.10$ and
$f(\mbox{age=65},\mbox{sex=F}|t=0) = 0.05$ (i.e. 10\% of the
treatment cases and 5\% of the comparison cases are 65 year old
females) then we need to give a weight of 2.0 to every 65 year old
female in the comparison group so that they have the same
representation as in the treatment group. More generally, we can
solve (\ref{eq:balance}) for $w(\mathbf{x})$ and apply Bayes Theorem
to the numerator and the denominator to give an expression for the
propensity score weight for comparison cases,
\begin{equation}
w(\mathbf{x})=K\frac{f(t=1|\mathbf{x})}{f(t=0|\mathbf{x})}
=K\frac{P(t=1|\mathbf{x})}{1-P(t=1|\mathbf{x})}, \label{eq:weight}
\end{equation}
where $K$ is a normalization constant that will cancel out in the
outcomes analysis. Equation (\ref{eq:weight}) indicates that if we
assign a weight to comparison case $i$ equal to the odds that a case
with features $\mathbf{x}_i$ would be exposed to the treatment, then
the distribution of their features would balance. Note that for
comparison cases with features that are atypical of treatment cases,
the propensity score $P(t=1|\mathbf{x})$ would be near 0 and would
produce a weight near 0. On the other hand, comparison cases with
features typical of the treatment cases would receive larger
weights.

For ATE, each group is weighted to match the population. The weight
must satisfy:
\begin{equation}
f(\mathbf{x} | t =1) = w(\mathbf{x})f(\mathbf{x}), \mbox{ and}   
\label{eq:balance2}
\end{equation}
\begin{equation}
 f(\mathbf{x} | t=0) = w(\mathbf{x}) f(\mathbf{x}), \mbox{ and}
  \label{eq:balance3}
\end{equation}
Again using Bayes Theorem we obtain  $w(\mathbf{x}) \propto
1/f(t=1 | \mathbf{x})$ for the treatment group and $w(\mathbf{x}) \propto 1/
f(t = 0 | \mathbf{x} )$  for the control group.  


\subsection{Estimating the propensity score}

In randomized studies $P(t=1|\mathbf{x})$ is known and fixed in the
study design. In observational studies the propensity score is
unknown and must be estimated, but poor estimation of the propensity
scores can cause just as much of a problem for estimating treatment
effects as poor regression modeling of the outcome. Linear logistic
regression is the common method for estimating propensity scores,
and can suffice for many problems. Linear logistic regression for
propensity scores estimates the log-odds of a case being in the
treatment given $\mathbf{x}$ as

\begin{equation}
\log\frac{P(t=1|\mathbf{x})}{1-P(t=1|\mathbf{x})} = \beta'\mathbf{x}
\label{eq:logodds}
\end{equation}
Usually, $\beta$ is selected to maximize the logistic log-likelihood
\begin{equation}
\ell{\beta}=\frac{1}{n}\sum_{i=1}^n
t_i\beta'\mathbf{x}_i-\log\left(1+\exp(\beta'\mathbf{x}_i)\right)
\label{eq:loglikelihood}
\end{equation}

Maximizing (\ref{eq:loglikelihood}) provides the maximum likelihood
estimates of $\beta$. However, in an attempt to remove as much
confounding as possible, observational studies often record data on
a large number of potential confounders, many of which can be
correlated with one another. Standard methods for fitting logistic
regression models to such data with the iteratively reweighted least
squares algorithm can be statistically and numerically unstable. To
improve the propensity score estimates we might also wish to include
non-linear effects and interactions in $\mathbf{x}$. The inclusion
of such terms only increases the instability of the models.

One increasingly popular method for fitting models with numerous
correlated variables is the lasso (least absolute subset selection
and shrinkage operator) introduced in statistics in Tibshirani
(1996). For logistic regression, lasso estimation replaces
(\ref{eq:loglikelihood}) with a version that penalizes the absolute
magnitude of the coefficients

\begin{equation}
\ell{\beta}=\frac{1}{n}\sum_{i=1}^n
t_i\beta'\mathbf{x}_i-\log\left(1+\exp(\beta'\mathbf{x}_i)\right) -
\lambda\sum_{j=1}^J|\beta_j| \label{eq:lasso}
\end{equation}

The second term on the right-hand side of the equation is the
penalty term since it decreases the overall of $\ell{\beta}$ when
there are coefficient that are large in absolute value.  Setting
$\lambda=0$ returns the standard (and potentially unstable) logistic
regression estimates of $\beta$. Setting $\lambda$ to be very large
essentially forces all of the $\beta_j$ to be equal to 0 (the
penalty excludes $\beta_0$). For a fixed value of $\lambda$ the
estimated $\hat\beta$ can have many coefficients exactly equal to 0,
not just extremely small but precisely 0, and only the most powerful
predictors of $t$ will be non-zero. As a result the absolute penalty
operates as a variable selection penalty. In practice, if we have
several predictors of $t$ that are highly correlated with each
other, the lasso tends to include all of them in the model, shrink
their coefficients toward 0, and produce a predictive model that
utilizes all of the information in the covariates, producing a model
with greater out-of-sample predictive performance than models fit
using variable subset selection methods.

Our aim is to include as covariates all piecewise constant functions
of the potential confounders and their interactions. That is, in
$\mathbf{x}$ we will include indicator functions for continuous
variables like $I(\mbox{age}<15), I(\mbox{age}<16), \ldots,
I(\mbox{age}<90)$, etc., for categorical variables like
$I(\mbox{sex}=\mbox{male}), I(\mbox{prior MI}=\mbox{TRUE})$, and
interactions among them like $I(\mbox{age}<16)I(\mbox{sex} =
\mbox{male})I(\mbox{prior MI}=\mbox{TRUE})$. This collection of
basis functions spans a plausible set of propensity score functions,
are computationally efficient, and are flat at the extremes of
$\mathbf{x}$ reducing the likelihood of propensity score estimates
near 0 and 1 that can occur with linear basis functions of
$\mathbf{x}$. Theoretically with the lasso we can estimate the model
in (\ref{eq:lasso}), selecting a $\lambda$ small enough so that it
will eliminate most of the irrelevant terms and yield a sparse model
with only the most important main effects and interactions. Boosting
(Friedman 2001, 2003, Ridgeway 1999) effectively implements this
strategy using a computationally efficient method that Efron
\textit{et al.} (2004) showed is equivalent to optimizing
(\ref{eq:lasso}). With boosting it is possible to maximize
(\ref{eq:lasso}) for a range of values of $\lambda$ with no
additional computational effort than for a specific value of
$\lambda$. We use boosted logistic regression as implemented in the
generalized boosted modeling (\texttt{gbm}) package in R (Ridgeway 2005).

\subsection{Evaluating the weights}
\label{sec:pseval} As with regression analyses, propensity score
methods cannot adjust for unmeasured covariates that are
uncorrelated with the observed covariates. Nonetheless, the quality
of the adjustment for the observed covariates achieved by propensity
score weighting is easy to evaluate. The estimated propensity score
weights should equalize the distributions of the cases' features as
in (\ref{eq:balance}). This implies that weighted statistics of the
covariates of the comparison group should equal the same statistics
for the treatment group. For example, the weighted average of the
age of comparison cases should equal the average age of the
treatment cases. To assess the quality of the propensity score
weights one could compare a variety of statistics such as means,
medians, variances, and Kolmogorov-Smirnov statistics for each
covariate as well as interactions. The \texttt{twang} package
provides both the standardized effect sizes and KS statistics and
p-values testing for differences in the means and distributions of
the covariates for analysts to use in assessing balance.





\subsection{Analysis of outcomes}

With propensity score analyses the final outcomes analysis is
generally straightforward, while the propensity score estimation may
require complex modeling. Once we have weights that
equalize the distribution of features of treatment and control
cases by reweighting. For ATT, we give each treatment case a 
weight of 1 and each comparison
case a weight $w_i = p(\mathbf{x}_i)/(1 - p(\mathbf{x}_i))$. To
estimate the ATE, we give control cases weight $w_i =
1/(1-p(\mathbf{x}_i))$ and we give the treatment cases $w_i =
1/p(\mathbf{x}_i)$.  We then
estimate the treatment effect estimate with a weighted regression
model that contains only a treatment indicator. No additional
covariates are needed if the weights account for
differences in $\mathbf{x}$.

A combination of propensity score weighting and covariate adjustment
can be useful for several reasons. First, the propensity scores may
not have been able to completely balance all of the covariates. The
inclusion of these covariates in addition to the treatment indicator
in a weighted regression model may correct this if the imbalance is
relatively small. Second, in addition to exposure, the relationship
between some of the covariates and the outcome may also be of
interest. Their inclusion can provide coefficients that can estimate
the direction and magnitude of the relationship. Third, as with
randomized trials, stratifying on covariates that are highly
correlated with the outcome can improve the precision of estimates.
Lastly, the some treatment effect estimators that utilize an
outcomes regression model and propensity scores are ``doubly
robust'' in the sense that if either the propensity score model is
correct or the regression model is correct then the treatment effect
estimator will be unbiased (Bang \& Robins 2005).


\section*{About This Tutorial}
This tutorial and the R package were supported by funding from grant R01DA017507, R01DA015697, and R01DA034065 from the National Institute on Drug Abuse. The overarching goal of this work is to develop statistical methods and tools that will provide addiction health services researchers and others with the tools and training they need to study the effectiveness of treatments using observational data. For more information about \texttt{twang} and other causal tools being developed, see www.rand.org/statistics/twang.

RAND Social and Economic Well-Being is a division of the RAND Corporation that seeks to actively improve the health and social and economic well-being of populations and communities throughout the world. This research was conducted in the Social and Behavioral Policy Program within RAND Social and Economic Well-Being. The program focuses on such topics as risk factors and prevention programs, social safety net programs and other social supports, poverty, aging, disability, child and youth health and well-being, and quality of life, as well as other policy concerns that are influenced by social and behavioral actions and systems that affect well-being. For more information, email sbp@rand.org.

\begin{thebibliography}{77}     % start the bibliography

\small                          % put the bibliography in a small font

\bibitem{Bang:Robins:2005} Bang H. and J. Robins (2005). ``Doubly robust estimation in missing data and
causal inference models,'' \textit{Biometrics} 61:692--972.

\bibitem{bland:2013} Bland M. (2013). ``Do baseline p-values follow a uniform distribution in randomised trials?''
\textit{PLoS ONE} 8(10): e76010: 1--5.

\bibitem{Dehejia:Wahba:1999} Dehejia, R.H. and S. Wahba (1999). ``Causal effects in nonexperimental
studies: re-evaluating the evaluation of training programs," \textit{Journal
of the American Statistical Association} 94:1053--1062.

\bibitem{lars:2004} Efron, B., T. Hastie, I. Johnstone, R. Tibshirani (2004). ``Least angle
regression,'' \textit{Annals of Statistics} 32(2):407--499.

\bibitem{Friedman:2001} Friedman, J.H. (2001). ``Greedy function approximation: a gradient boosting
machine,'' \textit{Annals of Statistics} 29(5):1189--1232.

\bibitem{Friedman:2002} Friedman, J.H. (2002). ``Stochastic gradient boosting,'' \textit{Computational
Statistics and Data Analysis} 38(4):367--378.

\bibitem{Frie:Hast:Tibs:2000} Friedman, J.H., T. Hastie, R. Tibshirani (2000). ``Additive logistic regression:
a statistical view of boosting,'' \textit{Annals of Statistics} 28(2):337--374.

\bibitem{Hast:Tibs:Frie:2001} Hastie, T., R. Tibshirani, and J. Friedman (2001). \textit{The Elements of
Statistical Learning}. Springer-Verlag, New York.

\bibitem{helm} Helmreich, J.E., and R.M.\ Pruzek
  (2009). ``PSAgraphics: An R package to support propensity score
  analysis,'' \textit{Journal of Statistical Software} 29(6):1--23.

\bibitem{Hirano:Imbens:2001} Hirano, K. and G. Imbens (2001). ``Estimation of causal effects using
propensity score weighting: An application to data on right heart
catheterization,'' \textit{Health Services and Outcomes Research
Methodology} 2:259--278.

\bibitem{Hull:Loui:2002} Huppler-Hullsiek, K. and T. Louis (2002)
``Propensity score modeling strategies for the causal analysis of
observational data,'' \textit{Biostatistics} 3:179--193.

\bibitem{Lalonde:1986} Lalonde, R. (1986). ``Evaluating the econometric evaluations of training
programs with experimental data," \textit{American Economic Review}
76:604--620.

\bibitem{Litt:Vart:2004} Little, R. J. and S. Vartivarian (2004).
``Does weighting for nonresponse increase the variance of survey
means?'' \textit{ASA Proceedings of the Joint Statistical Meetings},
3897-3904 American Statistical Association (Alexandria, VA)
\url{http://www.bepress.com/cgi/viewcontent.cgi?article=1034&context=umichbiostat}.

\bibitem{McCaffrey:Ridgeway:Morral:2004}
McCaffrey, D., G. Ridgeway, A. Morral (2004). ``Propensity score
estimation with boosted regression for evaluating adolescent
substance abuse treatment,'' \textit{Psychological Methods}
9(4):403--425.

\bibitem{usps} Obenchain, B. (2011). \textit{USPS 1.2 package
    manual}. \url{http://cran.r-project.org/web/packages/USPS/USPS.pdf}

\bibitem{Ridgeway:1999} Ridgeway, G. (1999). ``The state of boosting,'' \textit{Computing Science and
Statistics} 31:172--181.

\bibitem{Ridgeway:2005} Ridgeway, G. (2005). \textit{GBM 1.5 package manual}.
\url{http://cran.r-project.org/doc/packages/gbm.pdf}.



\bibitem{Ridgeway:2006} Ridgeway, G. (2006). ``Assessing the effect of race bias in
post-traffic stop outcomes using propensity scores.''
\textit{Journal of Quantitative Criminology} 22(1):1--29.

\bibitem{Rosenbaum:Rubin:1983}
Rosenbaum, P. and D. Rubin (1983). ``The Central Role of the Propensity Score
in Observational Studies for Causal Effects,'' \emph{Biometrika} 70(1):41--55.

\bibitem{Rosenbaum:1987} Rosenbaum, P. (1987). ``Model-based direct adjustment,'' \textit{Journal of the
American Statistical Association} 82:387--394.

\bibitem{Tibshirani:1996} Tibshirani, R. (1996). ``Regression shrinkage and selection via the lasso,''
\textit{Journal of the Royal Statistical Society, Series B} 58(1):267--288.

\bibitem{Wooldridge:2002} Wooldridge, J. (2002). \textit{Econometric analysis of cross section and panel
data}, MIT Press, Cambridge.

\end{thebibliography}           % end the bibliography


\end{document}