---
title: "Cohort tables"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{a03_cohort_table}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Cohort table

```{r, echo = FALSE, message = FALSE, warning = FALSE}
library(omopgenerics)
library(dplyr)
```

A cohort is a **set of people that fulfill a certain set of criteria for a period of time**.

In omopgenerics we defined the `cohort_table` class that allows us to represent individuals in a cohort.

A `cohort_table` is created using the `newCohortTable()` function that is defined by:

- A cohort table.

- A cohort set.

- A cohort attrition.

Let's start by creating a cdm reference with just two people.
```{r}
person <- tibble(
  person_id = c(1, 2),
  gender_concept_id = 0, year_of_birth = 1990,
  race_concept_id = 0, ethnicity_concept_id = 0
)
observation_period <- dplyr::tibble(
  observation_period_id = c(1, 2), person_id = c(1, 2),
  observation_period_start_date = as.Date("2000-01-01"),
  observation_period_end_date = as.Date("2021-12-31"),
  period_type_concept_id = 0
)
cdm <- cdmFromTables(
  tables = list(
    "person" = person,
    "observation_period" = observation_period
  ),
  cdmName = "example_cdm"
)
cdm
```

Now let's say one of these people have a clinical event of interest, we can include them in a cohort table which can then be used as part of an analysis. 
```{r}
cohort <- tibble(
  cohort_definition_id = 1, subject_id = 1,
  cohort_start_date = as.Date("2020-01-01"),
  cohort_end_date = as.Date("2020-01-10")
)
cdm <- insertTable(cdm = cdm, name = "cohort", table = cohort)
cdm$cohort <- newCohortTable(cdm$cohort)
```

The cohort table will be associated with settings and attrition. As we didn't specify these in newCohortTable() above they will have been automatically populated. You can access the cohort set of a cohort table using the function `settings()`
```{r}
settings(cdm$cohort)
```

Meanwhile, you can access the cohort attrition of a cohort table using the function `attrition()`
```{r}
attrition(cdm$cohort)
```

Cohort attrition table is also used to compute the number of counts that each cohort (ie from the last row of the attrition). It can be seen with the function `cohortCount()`. 
```{r}
cohortCount(cdm$cohort)
```

Note that because the cohort count is taken from the last row of attrition, if we make changes to a cohort we should then update attrition as we go. We can do this 
```{r}
cdm$cohort <- cdm$cohort |>
  filter(cohort_start_date == as.Date("2019-01-01")) |>
  compute(name = "cohort", temporary = FALSE) |>
  recordCohortAttrition("Require cohort start January 1st 2019")
attrition(cdm$cohort)
cohortCount(cdm$cohort)
```

An additional, optional, attribute keeps track of the concepts used to create the cohort. In this example we do not have a codelist associated with our cohort. 

```{r}
cohortCodelist(cdm$cohort, cohortId = 1, type = "index event")
```

We could though associate our cohort with a codelist
```{r}
cdm$cohort <- newCohortTable(cdm$cohort,
  cohortCodelistRef = dplyr::tibble(
    cohort_definition_id = c(1, 1),
    codelist_name = c("disease X", "disease X"),
    concept_id = c(101, 102),
    type = "index event"
  )
)
cohortCodelist(cdm$cohort, cohortId = 1, type = "index event")
```

Each one of the elements that define a cohort table have to fulfill certain criteria.

### Cohort Set

A cohort set must be a table with:

- Lower case column names.

- At least `r cohortColumns("cohort_set")` columns (`cohortColumns("cohort_set")`).

- `cohort_name` it must contain unique cohort names (currently they are cased to snake case).

- `cohort_definition_id` it must contain unique cohort ids, all the ids present in table must be present in the cohort set and the same ids must be present in cohort attrition.

### Cohort Attrition

A cohort attrition must be a table with:

- Lower case column names.

- At least `r cohortColumns("cohort_attrition")` columns (`cohortColumns("cohort_attrition")`).

- `cohort_definition_id` it must contain cohort ids, all the ids present in table must be present in the cohort attrition and the same ids must be present in cohort set.

- There must exist unique pairs of `cohort_definition_id` and `reason_id`.

### Cohort Codelist

A cohort codelist must be a table with:

- Lower case column names.

- At least `r cohortColumns("cohort_codelist")` columns (`cohortColumns("cohort_codelist")`).

- `cohort_definition_id` it must contain cohort ids, all the ids present in table must be present in the cohort attrition and the same ids must be present in cohort set.

- `type` must be one of "index event", "inclusion criteria", and "exit criteria"

### Cohort Table

A cohort table must be a table with:

- It comes from a cdm_reference (extracted via `cdm$cohort`).

- It has the same source than this cdm_reference.

- Lower case column names.

- At least `r cohortColumns("cohort")` columns (`cohortColumns("cohort")`).

- There is no record with `NA` value in the required columns.

- There is no record with `cohort_start_date` after `cohort_end_date`.

- There is no overlap between records. A person can be in a cohort several times (several records with the same subject_id). But it can't enter (cohort_start_date) the cohort again before leaving it (cohort_end_date). So an individual can't be simultaneously more than once in the same cohort. This rule is applied at the cohort_definition_id level, so records with different cohort_definition_id can overlap.

- All the time between cohort_start_date and cohort_end_date (both included) the individual must be in observation.

## Combining generated cohort sets

You can bind two cohort tables using the method `bind()`. You can combine several cohort tables using this method. The only constrain is that cohort names must be unique across the different cohort tables. You have to provide a name for the new cohort table.

```{r}
asthma <- tibble(
  cohort_definition_id = 1, subject_id = 1,
  cohort_start_date = as.Date("2020-01-01"),
  cohort_end_date = as.Date("2020-01-10")
)
cdm <- insertTable(cdm, name = "asthma", table = asthma)
cdm$asthma <- newCohortTable(cdm$asthma,
  cohortSetRef = tibble(
    cohort_definition_id = 1,
    cohort_name = "asthma"
  )
)

copd <- tibble(
  cohort_definition_id = 1, subject_id = 2,
  cohort_start_date = as.Date("2020-01-01"),
  cohort_end_date = as.Date("2020-01-10")
)
cdm <- insertTable(cdm, name = "copd", table = copd)
cdm$copd <- newCohortTable(cdm$copd,
  cohortSetRef = tibble(
    cohort_definition_id = 1,
    cohort_name = "copd"
  )
)

cdm <- bind(cdm$asthma,
  cdm$copd,
  name = "exposures"
)
cdm$exposures

settings(cdm$exposures)
attrition(cdm$exposures)
cohortCount(cdm$exposures)
```

## Export metadata about a cohort table

You can export the metadata of a `cohort_table` using the function: `summary()`:
```{r}
summary(cdm$exposures) |>
  glimpse()
```

This will provide a `summarised_result` object with the metadata of the cohort (cohort set, cohort counts and cohort attrition).