---
title: "Delay distributions"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Delay-distributions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, message = FALSE, warning = FALSE, collapse=TRUE}
library(rtestim)
library(ggplot2)
library(dplyr)
library(nnet)
library(forcats)
library(tidyr)
theme_set(theme_bw())
```

\newcommand{\argmin}{\mathop{\mathrm{argmin}}}
\newcommand{\Rtseq}{\mathop{\{R_t\}_{t=1}^n}}


```{r, include = FALSE}
knitr::opts_chunk$set(
  dpi = 300,
  collapse = FALSE,
  comment = "#>",
  fig.asp = 0.618,
  fig.width = 6,
  out.width = "80%",
  fig.align = "center"
)
```

This package accommodates 3 different ways of specifying a delay distribution
in the renewal equation, and this vignette illustrates these in turn. The
renewal equation specifies that the expected incidence at time $t$ is a 
weighted sum of previous incidence, multiplied by $R_t$:
\begin{equation}
\mathbb{E}[I_t \mid I_1,\ldots,I_{t-1}] = R_t \sum_{a=1}^t w_a I_{t-a},
\end{equation}
calculated by convolving the
preceding $a$ days of new infections with the discretized generation (or serial)
interval distribution $w$ of length $t$. It is the allowable distributions 
$\{w\}_{a=1}^n$ that are the focus.

## Discretization

First, it is important to recognize that the sequence $\{w\}_{a=1}^n$ is usually
a discretization of a probability density function, most frequently gamma or 
Weibull. This is because observed incidence happens at discrete time points
like days or weeks, while time is continuous. Using the default (parametric) 
delay distributions requires calculating a discrete approximation.

## Default (parametric) delay distribution

By default, `estimate_rt()` uses a gamma distribution parameterized by the 
shape $k$ and scale $\theta$. This density has pdf
\[
f_W(w) = \frac{1}{\Gamma(k)\theta^k} w^{k-1} e^{-w/\theta} I(w > 0),
\]
where $I$ is the indicator function. The mean of this distribution is $k\theta$,
and the variance is $k\theta^2$. Both $k$ and $\theta$ must be greater than 0.
The figure below shows a few examples densities from this family.

```{r gamma-pdf, echo = FALSE}
ggplot(data.frame(x = seq(.001, 20, length.out = 100)), aes(x)) +
  stat_function(fun = dgamma, args = list(shape = 2.5, scale = 2.5), color = "orange") +
  stat_function(fun = dgamma, args = list(shape = 1, scale = 2), color = "cornflowerblue") +
  stat_function(fun = dgamma, args = list(shape = 2, scale = 1), color = "forestgreen") +
  scale_y_continuous(expand = expansion(c(0, 0.05)))
```

The default is the orange curve shown above. Given an incidence sequence of 
length $n$, internally, `delay_calculator()` creates $\{w\}_{a=1}^n$ by 
\[
w_a = F(a) - F(a - 1),
\]
and then renormalizing with `w / sum(w)` to ensure that it sums to 1. Here, $F$
is the cumulative distribution function for Gamma (though similar works for any
continuous distribution). Note that this formula assumes that the probability
of a 0 delay is 0 and that the probability of a delay of $a$ is really
\[
\int_{a-1}^a f_W(w) \, dw = F(a) - F(a-1).
\]

When using this default, the delay distribution is necessarily assumed to be the same for all $t$, and we
compute the convolution of `w` and `I` with the Fast Fourier Transform. 
Finally, an adjustment is made to handle the first observation. We define
$I_0 = I_1$.

Just to illustrate this behaviour, we show the results for the default setting
on the included `cancovid` data.

```{r can-default}
can_default <- estimate_rt(cancovid$incident_cases, x = cancovid$date, nsol = 20L)
plot(can_default) + coord_cartesian(ylim = c(0.5, 2))
```

## Constant, non-parametric delay distribution

If we don't believe that the Gamma distribution closely approximates the serial
interval distribution, then we can specify our own distribution manually. For example,
[Backer et al., Table S1](https://doi.org/10.2807/1560-7917.ES.2022.27.6.2200042)
gives observed serial intervals for the Omicron (SGTF) and Delta (non-SGTF)
COVID-19 variants during 2 weeks in 2021 in the Netherlands. For the sake of 
illustration, we aggregate these together and use this as our (constant) delay
distribution. 

```{r cancov-nonpar}
# Data from Backer et al.
delay <- read.csv("backer.csv") |>
  filter(delay > 0) |>
  select(-delay)
delay <- rowSums(delay)
delay <- delay / sum(delay)
```

```{r cancov-nonpar-plot, echo=FALSE}
ggplot(
  data.frame(delay = 1:length(delay), probability = delay), 
  aes(delay, probability)
) +
  geom_point(colour = "cornflowerblue") +
  geom_segment(yend = 0, colour = "cornflowerblue") +
  scale_y_continuous(expand = expansion(c(0, 0.05)))
```

This distribution looks something like a gamma, but it has finite support.
We can easily use it instead.

```{r can-nonpar}
can_nonpar <- estimate_rt(
  cancovid$incident_cases, 
  x = cancovid$date, 
  delay_distn = delay,
  nsol = 20L)
plot(can_nonpar) + coord_cartesian(ylim = c(0.5, 2))
```

The result is much less dramatic than the previous version. This is likely
because the distribution is much more concentrated near short delays. We can see
this by examining the two CDFs.

```{r cdfs, echo = FALSE}
cdfs <- data.frame(
  x = 0:20,
  default = pgamma(0:20, shape = 2.5, scale = 2.5),
  backer = cumsum(c(0, delay, rep(0, 20 - length(delay))))
)
ggplot(cdfs |> pivot_longer(-x), aes(x, value, color = name)) +
  geom_step() +
  scale_y_continuous(expand = expansion(0)) +
  scale_color_manual(values = c("orange", "cornflowerblue"), name = "")
```

## Time-varying delays

Finally, we also allow time-varying delay distributions. This is accomplished
with a matrix. This requires a bit of work, but is not too challenging. For 
example, to create the correct matrix using the Baker et al. delays, the 
necessary code is the following.

```{r backer-matrix, eval=FALSE}
# library(Matrix)
n <- nrow(cancovid)
backer_delay <- c(0, delay, rep(0, n - length(delay) - 1))
delay_mat <- matrix(0, n, n)
delay_mat[1,1] <- 1
for (iter in 2:n) delay_mat[iter, 1:iter] <- rev(backer_delay[1:iter])
delay_mat <- drop0(as(delay_mat, "CsparseMatrix")) # make it sparse, not necessary
delay_mat <- delay_mat / rowSums(delay_mat) # renormalize
```

We could then simply pass `delay_mat` to `estimate_rt(..., delay_distn = delay_mat)`.
The result would be the same as that shown above.

## Variant-specific delays

For a more illustrative version of the previous example, we revisit the Canadian
data, but combine it with some additional information. 

We use variant
circulation information from [CoVaRR-Net's Duotang notebook](https://covarr-net.github.io/duotang/duotang.html).
For simplicity, the code for this processing is hidden in the fold.

<details>
```{r duotang-processing, eval=FALSE, echo=TRUE}
# Run on 19 April 2024
duotang <- read_tsv("https://github.com/CoVaRR-NET/duotang/raw/main/data_needed/virusseq.metadata.csv.gz")
columnlist <- c(
  "fasta_header_name", "province", "host_gender", "host_age_bin",
  "sample_collection_date", "sample_collected_by",
  "purpose_of_sampling", "purpose_of_sequencing", "lineage",
  "raw_lineage", "gisaid_accession", "isolate"
)
unknown.str <- c(
  "Undeclared", "Not Provided", "Restricted Access", "Missing",
  "Not Applicable", "", "NA", "unknow"
)
duotang <- duotang |>
  rename(province = geo_loc_name_state_province_territory) |>
  select(all_of(columnlist))

meta <- duotang |>
  mutate(
    week = cut(sample_collection_date, "week"),
    month = gsub("-..$", "", as.character(cut(sample_collection_date, "month")))
  )
source("https://github.com/CoVaRR-NET/duotang/raw/main/scripts/scanlineages.R")
meta <- meta |>
  mutate(gisaid_accession = str_replace(gisaid_accession, "EPI_ISL_", "")) |>
  rename(GID = gisaid_accession) |>
  rowwise() |>
  mutate(raw_lineage = ifelse(
    grepl("^X", raw_lineage),
    str_replace_all(paste0(
      realtorawlineage(substr(
        raw_lineage, 1, str_locate(raw_lineage, "\\.") - 1
      )),
      ".",
      substr(raw_lineage, str_locate(raw_lineage, "\\.") + 1, nchar(raw_lineage))
    ), "[\r\n]", ""),
    raw_lineage
  )) |>
  ungroup()
dico <- makepangolindico() # rebuild the lineage dictionary so the correct names gets assigned for XBB descedants not named XBB

VOCVOI <- read_csv("https://raw.githubusercontent.com/CoVaRR-NET/duotang/main/resources/vocvoi.csv")
meta$pango_group <- create.pango.group(VOCVOI, meta)
meta <- select(meta, province, week, pango_group) |>
  mutate(week = as.Date(week))

counts <- group_by(meta, province, week, pango_group) |>
  count() |>
  ungroup() |>
  arrange(province, week, pango_group)
can_counts <- group_by(meta, week, pango_group) |>
  count() |>
  ungroup() |>
  arrange(week, pango_group) |>
  mutate(province = "Canada")
counts <- bind_rows(can_counts, counts)
saveRDS(counts, "duotang-counts.rds")
```
</details>

We smooth the raw data using multinomial logistic regression on a third order orthogonal polynomial to produce the following estimated variant proportions 
in Canada.

```{r duotang-counts, echo=FALSE, message=FALSE}
props <- readRDS("duotang-counts.rds") |>
  pivot_wider(names_from = pango_group, values_from = n, values_fill = 0) |>
  mutate(total = rowSums(across(-c(week, province)))) |>
  mutate(across(-c(week, province, total), ~ .x / total)) |>
  select(-total)

smooth_it <- function(props_group) {
  z <- props_group |> select(-week, -province)
  n <- names(z)
  nn <- gsub(" ", "_", n)
  names(z) <- nn
  form_resp <- paste0("cbind(", paste0(names(z), collapse = ",") ,") ~ ")
  z$time <- as.numeric(props_group$week)
  form <- as.formula(paste0(form_resp, "poly(time, degree = 3)"))
  fits <- multinom(form, z, trace = FALSE)
  rng <- range(props_group$week)
  alltime <- as.numeric(seq(rng[1], rng[2], by = 1))
  z <- as_tibble(predict(fits, data.frame(time = alltime), type = "probs")) |>
    mutate(Date = as.Date(alltime))
  z
}

can_props_smoothed <- smooth_it(props |> filter(province == "Canada"))

can_props_smoothed |>
  pivot_longer(-Date) |>
  ggplot(aes(Date, y = value, fill = name)) +
  geom_area(position = "stack") +
  ylab("Variants in circulation") +
  xlab("") + 
  theme_bw() +
  scale_x_date(name = "", date_breaks = "1 year", date_labels = "%Y", expand = c(0,0)) +
  scale_y_continuous(expand = c(0,0)) + 
  scale_fill_viridis_d(name = "")

can_pred_class <- can_props_smoothed |>
  filter(Date <= max(cancovid$date)) |>
  pivot_longer(-Date) |> 
  group_by(Date) |> 
  summarise(var = name[which.max(value)], .groups = "drop") |>
  mutate(var = case_when(
    var == "other" ~ "Ancestral lineage",
    var == "Alpha" ~ "Alpha",
    var == "Beta" ~ "Beta",
    var == "Delta" ~ "Delta",
    TRUE ~ "Omicron"
  ))

boot <- data.frame(
  Date = seq(min(cancovid$date), min(can_props_smoothed$Date) - 1, by = "day"),
  var = "Ancestral lineage"
)
can_pred_class <- bind_rows(boot, can_pred_class)
```

Using the estimated proportions, we label each date with the dominant variant
at the time (and restrict ourselves to the time period of the included case
data).

```{r, echo=FALSE, fig.height=1, dev='png'}
ggplot(can_pred_class, aes(x = Date, fill = fct_relevel(var, "Ancestral lineage"))) + 
  geom_ribbon(aes(ymax = 1, ymin = 0)) +
  scale_y_continuous(expand = expansion(0)) +
  theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) +
  scale_fill_brewer(palette = "Dark2", name = "")
```

## Creating delay distributions

We use the delay distribution data from the meta analysis in 
[Xu, X., Wu, Y., Kummer, A.G. et al.](https://doi.org/10.1186/s12916-023-03070-8). 
The raw data for their analysis is available on [GitHub](https://github.com/xxyy0574/COVID-19-transmission-parameters). We use
a simple version of their procedure, taking the median calculations across 
studies to find the mean and standard deviation of the delay for each variant 
separately. Then we convert these to the $k$ and $\theta$ parameters for the gamma distribution.
This analysis is hidden below the fold.

<details>
```{r meta-analysis, eval=FALSE}
data_raw <- readxl::read_excel("xu-etal-DATA_RAW.xlsx") |>
  select(type, para, n = Sample_size, mean, sd, se, median) |>
  filter(!is.na(type)) |>
  mutate(across(-c(type, para), as.numeric)) 

bonehead_meta <- data_raw |>
  group_by(type, para) |>
  mutate(
    no_n = all(is.na(n)),
    n = case_when(!is.na(n) ~ n, no_n ~ 1, TRUE ~ median(n, na.rm = TRUE))
  ) |>
  ungroup() |>
  mutate(
    mean = case_when(!is.na(mean) ~ mean, TRUE ~ median),
    sd = case_when(!is.na(sd) ~ sd, TRUE ~ se * sqrt(n))
  ) |>
  group_by(type, para) |>
  summarise(
    mean = median(mean, na.rm = TRUE),
    sd = median(sd, na.rm = TRUE),
    .groups = "drop"
  )
## There's only one Beta and only IP. 
## We use the corresponding sd for Alpha IP, 
## and duplicate Alpha for GT / ST

Beta_IP <- bonehead_meta |> filter(type == "Beta")
Beta_IP$sd = bonehead_meta |> 
  filter(type == "Alpha", para == "IP") |> pull(sd)
Beta <- bind_rows(
  Beta_IP,
  bonehead_meta |> 
    filter(type == "Alpha", para != "IP") |>
    mutate(type = "Beta")
)

delay_dstns_byvar <- bonehead_meta |>
  filter(type != "Beta") |>
  bind_rows(Beta) |>
  arrange(type, para) |>
  mutate(shape = mean^2 / sd^2, scale = mean / shape)
saveRDS(delay_dstns_byvar, "delay-distns-byvar.rds")
```
</details>

We'll use just the Serial Intervals (SI), though the incubation periods (IP),
and generation times (GT) are also available. Below, we visualize the estimated
delays for the 4 variants that were most prevalent in Canada.

```{r variant-si-plot, echo=FALSE}
delay_dstns_can <- readRDS("delay-distns-byvar.rds") |>
  filter(para == "SI", type %in% unique(can_pred_class$var)) 

delay_dstns_can |>
  rowwise() |>
  mutate(probability = list(discretize_gamma(0:20, shape, scale))) |>
  ungroup() |>
  select(type, probability) |>
  unnest(probability) |>
  group_by(type) |>
  mutate(delay = row_number() - 1) |>
  ggplot(aes(delay, probability, colour = type)) +
  geom_line() + 
  scale_color_brewer(palette = "Dark2", name = "") +
  scale_y_continuous(expand = expansion(c(0, 0.05)))
```


We use these delays during the period they were most prevalent to 
estimate Rt. First, we build the delay matrix.

```{r tvar-matrix, message=FALSE}
library(Matrix)
n <- nrow(cancovid)
delay_mat <- matrix(0, n, n)
delay_mat[1,1] <- 1
for (iter in 2:n) {
  current_var <- can_pred_class$var[iter]
  current_pars <- delay_dstns_can |> filter(type == current_var)
  delay <- discretize_gamma(0:(iter - 1), current_pars$shape, current_pars$scale)
  delay_mat[iter, 1:iter] <- rev(delay)
}
delay_mat <- drop0(as(delay_mat, "CsparseMatrix")) # make it sparse, not necessary
delay_mat <- delay_mat / rowSums(delay_mat) # renormalize
```

Finally, we use this time-varying delay matrix to estimate Rt.

```{r tvar-Rt}
can_tvar <- estimate_rt(
  cancovid$incident_cases, 
  x = cancovid$date, 
  delay_distn = delay_mat,
  nsol = 20L)
plot(can_tvar) + coord_cartesian(ylim = c(0.5, 2))
```