---
title: "Data Generation"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Data Generation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

Updated: 2024-11-12

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(AllelicSeries)
```

## Overview

The data generating process `DGP` provides the ability to simulate data from a variety of genetic architectures, including the baseline model, allelic sum models, and allelic SKAT models. 


### Baseline model

The *baseline allelic series model* is:
$$ \mathbb{E}(Y|N,X) = \sum_{l=1}^{L}N_{l}\beta_{l} + X'\beta_{X} $$
where $Y$ is the phenotype, $L$ is the number of annotation categories, $N_{l}$ is the number of variants in annotation category $l$, and $X$ is a vector of covariates with associated coefficient $\beta_{X}$. To simulate from the baseline model, the aggregation `method` is set to `"none"`. `n` is the sample size, and `snps` in the number of variants. `prop_anno` specifies the proportion of variants in each annotation category. $L = 3$ annotation categories are adopted by default. `beta` is the per-variant effect size in each annotation category. `weights` is redundant with `beta` in the case of the baseline model, but these arguments have distinct functions in the allelic sum and max models. 

```{r, eval=FALSE}
data <- AllelicSeries::DGP(
  method = "none",
  n = 100,
  snps = 300,
  prop_anno = c(0.5, 0.4, 0.1),
  beta = c(1, 2, 3),
  weights = c(1, 1, 1)
)
```


### Allelic sum model

The *allelic series sum model* is:
$$ \mathbb{E}(Y|N,X) = \left(\sum_{l=1}^{L}N_{l}w_{l}\right)\beta + X'\beta_{X} $$
To simulate from the sum model, the aggregation `method` is set to `"sum"`. In contrast to the baseline model, `beta` is now a scalar multiplier for the allelic sum burden $\sum_{l=1}^{L}N_{l}w_{l}$, while `weights` specifies the annotation category weights $(w_{1}, \dots, w_{L})$. 

```{r, eval=FALSE}
data <- AllelicSeries::DGP(
  method = "sum",
  n = 100,
  snps = 300,
  prop_anno = c(0.5, 0.4, 0.1),
  beta = 1,
  weights = c(1, 2, 3)
)
```


### Allelic max model

The *allelic series max model* is:
$$ \mathbb{E}(Y|N,X) = \left(\max_{l=1}^{L}N_{l}w_{l}\right)\beta + X'\beta_{X} $$

To simulate from the max model, the aggregation `method` is set to `"max"`.

```{r, eval=FALSE}
data <- AllelicSeries::DGP(
  method = "max",
  n = 100,
  snps = 300,
  prop_anno = c(0.5, 0.4, 0.1),
  beta = 1,
  weights = c(1, 2, 3)
)
```


### Allelic SKAT model

The generative version of the *allelic series SKAT model* is:
$$ \begin{gathered}
\mathbb{E}(Y|G,X) = \sum_{j=1}^{J}G_{j}\beta_{j} + X'\beta_{X}, \\
\beta_{j} = r_{j}\gamma_{j}\beta_{A_{j}}
\end{gathered} $$

Here $G_{j}$ is genotype at the $J$th rare variant and $\beta_{j}$ is the corresponding effect size. The effect size of each variant are the product of a random sign $r_{j} \in \{-1, 1\}$, a scalar frailty $\gamma_{j} \sim \Gamma(\alpha, \alpha)$ with mean 1 and variance $\alpha^{-1}$, and $\beta_{A_{j}}$ is the mean absolute effect size for variant $j$'s annotation category $A_{j} \in \{1, \dots, L\}$. To simulate from the SKAT model, the aggregation `method` is set to `"none"` and the `random_signs` argument is set to `TRUE`. The mean absolute effect sizes are set via `beta`. The variance of the frailty $\gamma_{j}$ is specified with `random_var`. While the annotation category `weights` are not explicitly required for generation from the SKAT model, `weights` should be provided because the number of annotation categories $L$ is inferred from the length of the weight vector. 

```{r, eval=FALSE}
data <- AllelicSeries::DGP(
  method = "none",
  random_signs = TRUE,
  random_var = 1,
  n = 100,
  snps = 300,
  prop_anno = c(0.5, 0.4, 0.1),
  beta = c(1, 2, 3),
  weights = c(1, 1, 1)
)
```


### Simulating more than 3 annotation categories

Data can be simulated from any of the baseline, sum, max, or SKAT models simply by changing the lengths of annotation category proportions, effect sizes, and weights. These arguments should all be updated to reflect the number of annotation categories $L$.

```{r, eval=FALSE}
# Baseline model, 2 categories.
data <- AllelicSeries::DGP(
  method = "none",
  n = 100,
  snps = 300,
  prop_anno = c(0.6, 0.4),
  beta = c(1, 2),
  weights = c(1, 1)
)

# Baseline model, 4 categories.
data <- AllelicSeries::DGP(
  method = "none",
  n = 100,
  snps = 300,
  prop_anno = c(0.4, 0.3, 0.2, 0.1),
  beta = c(1, 2, 3, 4),
  weights = c(1, 1, 1, 1)
)

```


### Additional arguments

* Real-data annotations or genotypes can be provided by specifying `anno` or `geno`. 

* To simulate binary phenotypes from a probit model, set `binary = TRUE`. The binary phenotype is generated by first constructing a latent normal phenotype `Z` then dichotomizing $Y = \mathbb{I}(Z > 0)$. 

* The range of minor allele frequencies can be specified with `maf_range`. By default, variants have MAFs in the range of 0.1% to 0.5%. Genotypes are generated in such a way that the minor allele count is always at least 1, guaranteeing that no empty variants will be present. 

* The proportion of causal variants can be modulated with `prop_causal`. By default, all variants are causal for the phenotype (unless `beta` is set to zero). If `prop_causal < 1`, then a corresponding proportion of variants is removed from the causal set.