---
title: "Introduction to cdid"
author: "David Benatia"
date:  "2024-12-31"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to cdid}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

<!-- rmarkdown::pandoc_version() -->
```{r setup, include=FALSE}
# Set the working directory to the project root
knitr::opts_knit$set(root.dir = getwd())
```

# Introduction

------------------------------------------------------------------------

-   Recent advances in econometrics have focused on the identification,
    estimation, and inference of treatment effects in staggered designs,
    i.e. where treatment timing varies across units. Despite the variety
    of available approaches, none stand out as universally superior, and
    many overlook key challenges such as estimator efficiency or the
    frequent occurrence of unbalanced panel data.

    In our recent paper [(Bellégo, Benatia, Dortet-Bernadet, Journal of
    Econometrics,
    2024)](<https://doi.org/10.1016/j.jeconom.2024.105783>),
    we address these gaps by extending the well-established method
    developed by [Callaway and Sant'Anna (Journal of Econometrics,
    2021)](<https://doi.org/10.1016/j.jeconom.2020.12.001>)
    The cdid library is therefore intended to be used in connection with
    the **did** library:
    <https://bcallaway11.github.io/did/articles/did-basics.html>. Our
    approach improves efficiency and adapts seamlessly to unbalanced
    panels, making it particularly valuable for researchers facing data
    with missing observations.

    This page introduces the **cdid** R library that implements the
    methods from our paper, showcasing their relevance for empirical
    research. **For those short on time, here are the key takeaways:**

    -   **cdid** offers greater precision (smaller standard errors) than
        **did** with balanced panel data, by aggregating smaller ATT
        parameters efficiently.

    -   **cdid** outperforms **did** with unbalanced panels,
        particularly if errors are serially correlated (e.g., in
        presence of unit-level fixed effects). Remark, however, units
        observed only once across all time periods must be dropped
        before estimation because they are not used by **cdid.**

    -   **cdid** is less prone to bias in cases of attrition, notably if
        missingness is related to unobservable heterogeneity. For cases
        where missingness is related to observables, additional results
        from our paper have yet to be implemented in the library (see
        the paper).

### Features of the cdid library

**What it supports:**

-   Handles any form of missing data.

-   Allows for control units comprising only "never treated" or also
    "not-yet-treated."

-   Provides two weighting matrix options for GMM-based aggregation of
    $\Delta_k ATT(g,t)$ into $ATT(g,t)$: identity or 2-step.
    Small-sample simulations show that:

    -   **Identity**: Best for smaller datasets with more missing data,
        and less overidentifying restrictions, like rotating panel
        surveys.

    -   **2-step**: Best for larger, balanced datasets with many
        overidentifying restrictions, or unbalanced datasets with
        relatively few missings.

-   Implements simple propensity scores for treatment assignment where
    predictor columns X are treated as constant across time periods. For
    time-varying covariates, users can include additional columns
    (Xt,Xt+1,etc.) in the dataset.

**Current limitations:**

-   The generalized attrition model from our paper, which allows for
    dynamic attrition processes (e.g., past outcomes influencing
    observation status), is not yet implemented.

-   The library currently supports only the IPW estimator, though
    extensions to doubly-robust estimators like in the **did** library
    are planned.

# Getting started

Installing the library requires using remotes at the moment because we
cannot submit the package on CRAN yet (Happy holidays!). The command is

``` r
remotes::install_github("joelcuerrier/cdid", ref = "main", build_vignettes = TRUE, force = TRUE)
```

## Balanced data

```{r}
#Load the relevant packages
library(did) #Callaway and Sant'Anna (2021) 
library(cdid) #Bellego, Benatia, and Dortet-Bernadet (2024)

set.seed(123)
#Generate a dataset: 500 units, 8 time periods, with unit fixed-effects. 
# The parameter sigma_alpha controls the unit-specific time-persistent unobserved
# heterogeneity.
data0=fonction_simu_attrition(N = 500,T = 8,theta2_alpha_Gg = 0.5, 
                              lambda1_alpha_St = 0, sigma_alpha = 2, 
                              sigma_epsilon = 0.1, tprob = 0.5)

# The true values of the coefficients are based on time-to-treatment. The treatment
# effect is zero before the treatment, 1.75 one period after, 1.5 two period after,
# 1.25 three period after, 1 four period after, 0.75 five period after, 0.5 six  
# period after, etc.

#We keep all observations, so we have a balanced dataset
data0$S <- 1

#Look at the data
head(data0,20)
```

The dataset is a dataframe with 3960 observations and 6 columns: $id$ to
keep track of unit ids, $date$ to keep track of time periods, an outcome
variable $Y$, a predictor $X$, a treatment date $date\_G$ (zero for
control group), and a sampling dummy $S$ used to keep track which
observations are used in the estimation. There are 495 unique id, and 8
time periods.

```{r}
#run did library on balanced panel
did.results = att_gt(
  yname="Y",
  tname="date",
  idname = "id",
  gname = "date_G",
  xformla = ~X,
  data = data0,
  weightsname = NULL,
  allow_unbalanced_panel = FALSE,
  panel = TRUE,
  control_group = "notyettreated",
  alp = 0.05,
  bstrap = TRUE,
  cband = TRUE,
  biters = 1000,
  clustervars = NULL,
  est_method = "ipw",
  base_period = "varying",
  print_details = FALSE,
  pl = FALSE,
  cores = 1
)

#run cdid with 2step weighting matrix
result_2step = att_gt_cdid(yname="Y", tname="date",
                         idname="id",
                         gname="date_G",
                         xformla=~X,
                         data=data0,
                         control_group="notyettreated",
                         alp=0.05,
                         bstrap=TRUE,
                         biters=1000,
                         clustervars=NULL,
                         cband=TRUE,
                         est_method="2-step",
                         base_period="varying",
                         print_details=FALSE,
                         pl=FALSE,
                         cores=1)

#run cdid with identity weighting matrix
result_id = att_gt_cdid(yname="Y", tname="date",
                        idname="id",
                        gname="date_G",
                        xformla=~X,
                        data=data0,
                        control_group="notyettreated",
                        alp=0.05,
                        bstrap=TRUE,
                        biters=1000,
                        clustervars=NULL,
                        cband=TRUE,
                        est_method="Identity",
                        base_period="varying",
                        print_details=FALSE,
                        pl=FALSE,
                        cores=1)
```

```{r}
print(did.results)
# Remark that standard errors are smaller for most ATT(g,t) when using cdid
print(result_2step)
print(result_id)
```

Now that we have the ATT(g,t) for all three methods, we can compare
aggregate parameters using the following commands

```{r}
# Aggregation
agg.es.did <- aggte(MP = did.results, type = 'dynamic')
agg.es.2step <- aggte(MP = result_2step, type = 'dynamic')
agg.es.id <- aggte(MP = result_id, type = 'dynamic')
```

```{r}
# Print results
agg.es.did
```

```{r}
#Remark that standard errors are smaller, notably so for the 2step estimator
agg.es.2step
```

```{r}
agg.es.id
```

## Unbalanced data

Let us now do the same but with an unbalanced panel dataset.

```{r}
#Generate a dataset: 500 units, 8 time periods, with unit fixed-effects (alpha)
set.seed(123)
data0=fonction_simu_attrition(N = 500,T = 8,theta2_alpha_Gg = 0.5, 
                              lambda1_alpha_St = 0, sigma_alpha = 2, 
                              sigma_epsilon = 0.1, tprob = 0.5)

#We discard observations based on sampling indicator S
data0 <- data0[data0$S==1,]

#run did
did.results =  att_gt(
  yname="Y",
  tname="date",
  idname = "id",
  gname = "date_G",
  xformla = ~X,
  data = data0,
  weightsname = NULL,
  allow_unbalanced_panel = FALSE,
  panel = FALSE,
  control_group = "notyettreated",
  alp = 0.05,
  bstrap = TRUE,
  cband = TRUE,
  biters = 1000,
  clustervars = NULL,
  est_method = "ipw",
  base_period = "varying",
  print_details = FALSE,
  pl = FALSE,
  cores = 1
)

#run cdid with 2step weighting matrix
result_2step = att_gt_cdid(yname="Y", tname="date",
                         idname="id",
                         gname="date_G",
                         xformla=~X,
                         data=data0,
                         control_group="notyettreated",
                         alp=0.05,
                         bstrap=TRUE,
                         biters=1000,
                         clustervars=NULL,
                         cband=TRUE,
                         est_method="2-step",
                         base_period="varying",
                         print_details=FALSE,
                         pl=FALSE,
                         cores=1)

#run cdid with identity weighting matrix
result_id = att_gt_cdid(yname="Y", tname="date",
                        idname="id",
                        gname="date_G",
                        xformla=~X,
                        data=data0,
                        control_group="notyettreated",
                        alp=0.05,
                        bstrap=TRUE,
                        biters=1000,
                        clustervars=NULL,
                        cband=TRUE,
                        est_method="Identity",
                        base_period="varying",
                        print_details=FALSE,
                        pl=FALSE,
                        cores=1)
```

```{r}
#Note the precision gains
print(did.results)
print(result_2step)
print(result_id)
```

```{r}
# Aggregation
# There are other ways to aggregate, see the did library
agg.es.did <- aggte(MP = did.results, type = 'dynamic')
agg.es.2step <- aggte(MP = result_2step, type = 'dynamic')
agg.es.id <- aggte(MP = result_id, type = 'dynamic')
```

```{r}
#Note the precision gains
agg.es.did
```

```{r}
agg.es.2step
```

```{r}
agg.es.id
```