---
title: "Introduction to radEmu"
author: "David Clausen, Sarah Teichman and Amy Willis"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to radEmu}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

First, we will install `radEmu`, if we haven't already. 

```{r, eval = FALSE}
# if (!require("remotes", quietly = TRUE))
#     install.packages("remotes")
# 
# remotes::install_github("statdivlab/radEmu")
```

Next, we can load `radEmu` as well as the `tidyverse` package suite. 

```{r setup, message = FALSE}
library(magrittr)
library(dplyr)
library(ggplot2)
library(stringr)
library(radEmu)
```

## Introduction 

In this lab we'll explore a [dataset published by Wirbel et al. (2019)](https://www.nature.com/articles/s41591-019-0406-6). This is a meta-analysis of case-control studies, meaning that Wirbel et al. collected raw sequencing data from studies other researchers conducted and re-analyzed it (in this case, they also collected some new data of their own).

Wirbel et al. published two pieces of data we'll focus on today:

* metadata giving demographics and other information about participants
* a mOTU (metagenomic OTU) table

In the manuscript, we looked at differential abundance across otherwise similar colorectal cancer and non-cancer control study participants for the 849 mOTUs that Wirbel et al. published. For the purpose of having a streamlined tutorial, we will only look at a subset of those 849 mOTUs in this vignette. 

## Loading and exploring data 

We'll start by looking at the metadata. 

```{r}
data("wirbel_sample")
dim(wirbel_sample)
head(wirbel_sample)
```

We can see that this dataset includes $566$ observations and $14$ variables. Let's see how many observations we have among cases ("CRC") and controls ("CTR")

```{r}
wirbel_sample %>%
  group_by(Group) %>%
  summarize(count = n())
```

We have data from studies in 5 different countries. How much from each study, you ask? Let's find out!

```{r}
wirbel_sample %>%
  group_by(Country) %>%
  summarize(count = n())
```

Let's see how many cases and controls were enrolled in each study as well. 

```{r}
wirbel_sample %>%
  group_by(Country, Group) %>%
  summarize(n = n())
```

Now let's load the mOTU table. 

```{r}
data("wirbel_otu")
dim(wirbel_otu)
# let's check out a subset
wirbel_otu[1:5, 1:3]
```

We can see that this table has $566$ samples (just like the metadata) and $845$ mOTUs. Let's save these mOTU names in a vector. 

```{r}
mOTU_names <- colnames(wirbel_otu)
```

## Fitting a model 

`radEmu` is a package that can be used to estimate fold-differences in the abundance of microbial taxa between levels of a covariate. In this analysis, the covariate that we are primarily interested in is whether a sample is from a case of colorectal cancer or a control. We will make control ("CTR") the reference category: 

```{r}
wirbel_sample$Group <- factor(wirbel_sample$Group, levels = c("CTR","CRC"))
```

While in general we would fit a model to all mOTUs, we are going to subset to some specific genera for the purposes of this tutorial. Let's look at *Eubacterium*, *Porphyromonas*, *Faecalibacteria*, and *Fusobacterium* for now.

```{r}
chosen_genera <- c("Eubacterium", "Faecalibacterium", "Fusobacterium", "Porphyromonas")
mOTU_name_df <- data.frame(name = mOTU_names) %>% 
  mutate(base_name = stringr::str_remove(mOTU_names, "unknown ") %>%
                      stringr::str_remove("uncultured ")) %>%
  mutate(genus_name = stringr::word(base_name, 1))
restricted_mOTU_names <- mOTU_name_df %>%
  filter(genus_name %in% chosen_genera) %>%
  pull(name)
```

Again, while we would generally fit a model using all of our samples, for this tutorial we are only going to consider data from a case-control study from China. 

```{r}
ch_study_obs <- which(wirbel_sample$Country %in% c("CHI"))
```

Next, we want to confirm that all samples have at least one non-zero count across the categories we've chosen and that all categories have at least one non-zero count across the samples we've chosen.

```{r}
small_Y <- wirbel_otu[ch_study_obs, restricted_mOTU_names]
sum(rowSums(small_Y) == 0) # no samples have a count sum of 0 
sum(colSums(small_Y) == 0) # one category has a count sum of 0

category_to_rm <- which(colSums(small_Y) == 0)
small_Y <- small_Y[, -category_to_rm]
sum(colSums(small_Y) == 0)
```

The function that we use to fit our model is called `emuFit`. It can accept your data in various forms, and here we will show how to use it with data frames as input. Check out the `phyloseq` vignette if you'd like to know how `radEmu` plays with `phyloseq` objects! One version of inputs to `emuFit` are 

  - `formula`: This is a formula telling radEmu what predictors to use in its model. We are using Group, which is an indicator for case (CRC) vs control (CTR). 
  - `data`: A dataframe containing information on our predictors. Recall that here we're only looking observations from the Chinese study. 
  - `Y`: A matrix or dataframe containing our observed abundance data (e.g., counts or depth measurements). The rows give the observations (samples), and the columns give the categories (taxa/mOTUs). Here we are only considering the observations from the Chinese study and the *Eubacterium*, *Porphyromonas*, *Faecalibacteria*, and *Fusobacterium* genera. Note that `Y` doesn't have to be integer-valued (counts)! 
  
and some optional arguments include 

  - `run_score_tests`: A logical value denoting whether or not to run score tests. Score tests are awesome in their error rate control (including with small sample sizes; though of course larger sample sizes always give better power), but require refitting the model, so can require some compute time. 

```{r}
ch_fit <- emuFit(formula = ~ Group, 
                 data = wirbel_sample[ch_study_obs, ],
                 Y = small_Y,
                 run_score_tests = FALSE) 
```
Let's check out what this object looks like!

```{r}
ch_fit
```

Now, we can easily visualize our results using the `plot` function!
So that we only produce the plot, and not the dataframe used to produce the plots,
we will explicitly extract the plots using the `$` operator to get the `plots` component.

```{r, fig.height = 6, fig.width = 6}
plot(ch_fit)$plots
```

In the plot above, it is a bit difficult to read the taxa names on the y-axis. We can create a data frame that maps the full taxa names in our data to simplified labels and plot using those instead, as shown below.
```{r, fig.height = 6, fig.width = 6}
#create data frame that has simplified names for taxa
taxa_names <- data.frame("category" = ch_fit$coef$category) %>%
  mutate(cat_small = stringr::str_remove(paste0("mOTU_", 
                            stringr::str_split(category, 'mOTU_v2_', simplify = TRUE)[, 2]), 
                            "\\]"))


#produce plot with cleaner taxa labels
plot(ch_fit, taxon_names = taxa_names)$plots
```


Interestingly, we estimate a meta-mOTU "unknown Eubacterium [meta_mOTU_v2_7116]" assigned to Eubacteria to have a much higher ratio of abundance (comparing CRC group to control) than is typical across the mOTUs we included in this analysis.

The confidence interval for this effect does not include zero -- but (!!!) the kind of confidence interval that is returned by default by emuFit is not extremely reliable when counts are very skewed or sample size is small-to-moderate. 

To investigate further, let's run a robust score test, which is more reliable in these settings (but also takes more time because apparently we can't have nice things). For comparison, we'll also test the mOTU "Fusobacterium nucleatum s. nucleatum [ref_mOTU_v2_0777]", which we also estimate to have a much larger ratio of concentrations across groups than is typical among the taxa we included in this model fit.

To set up this test, we can again run `emuFit`, giving it the fitted values that it's already found:

  - `formula`, `data` and `Y` are as before
  - `B` is our previous fitted object (the output of `emuFit`)
  - `test_kj` a dataframe listing the indices of the parameters (in `ch_fit$B`) that we want to test. Below we show how to identify these, but `j = 3` is F. nucleatum and `j = 36` is the *Eubacterium* meta mOTU. Note that while `test_kj` was an optional argument in previous software versions, in `radEmu` v2.0.0.0 and forward `test_kj` is required when `run_score_tests = TRUE`. 

```{r}
taxa_to_test <- c(which(str_detect(restricted_mOTU_names, "0777")), 
                  which(str_detect(restricted_mOTU_names, "7116")))
design <- make_design_matrix(formula = ~ Group, data = wirbel_sample[ch_study_obs, ])
colnames(design)
covariate_to_test <- 2 # we see in the previous line that the covariate we care about corresponds to the second column of the design matrix
ch_fit$B %>% rownames
two_robust_score_tests <- emuFit(formula = ~ Group,
                                 data = wirbel_sample[ch_study_obs, ],
                                 B = ch_fit$B,
                                 test_kj = data.frame(k = covariate_to_test, 
                                                      j = taxa_to_test), 
                                 Y = small_Y)
```

Let's take a look at the test output.

```{r}
two_robust_score_tests$coef[taxa_to_test, c("covariate", "category", "estimate", "pval")]
```

The *Fusobacterium nucleatum* mOTU has a robust score test p-value of `r round(two_robust_score_tests$coef[taxa_to_test[1], "pval"], 2)`, while the unknown Eubacterium mOTU has a robust score test p-value of `r round(two_robust_score_tests$coef[taxa_to_test[2], "pval"], 2)`. Does this make sense? Let's investigate further by looking at the Eubacterium mOTU counts by Group. 

```{r}
data.frame(counts = wirbel_otu[ch_study_obs, "unknown Eubacterium [meta_mOTU_v2_7116]"],
           group = wirbel_sample$Group[ch_study_obs]) %>%
  mutate(eubact_present = counts > 0) %>%
  group_by(group, eubact_present) %>%
  count()
```

We only detect this meta-mOTU in a single sample in the Chinese study cohort! So, yes -- it makes sense that our test returns a relatively large p-value. Good job, `emuFit`! 

Now let's look at *F. nucleatum*.

```{r}
data.frame(counts = wirbel_otu[ch_study_obs, 
                               "Fusobacterium nucleatum s. nucleatum [ref_mOTU_v2_0777]"],
           group = wirbel_sample$Group[ch_study_obs]) %>%
  mutate(fuso_present = counts > 0) %>%
  group_by(group, fuso_present) %>%
  count()
```

This also makes sense given what we found --  *F. nucleatum* shows up in a sizeable minority of CRC cases, whereas  Wirbel et al detect it in only one control participant. 

We could run robust score tests for every taxon in this analysis, but it will take a longer amount of time to run. The code below will not run in the vignette, but feel free to run it on your own. 

```{r, eval = FALSE}
test_all <- emuFit(formula = ~ Group, 
                   data = wirbel_sample[ch_study_obs, ],
                   B = ch_fit$B,
                   Y = small_Y,
                   run_score_tests = TRUE,
                   test_kj = data.frame(k = covariate_to_test,
                                        j = 1:ncol(small_Y)))
```

## A more reasonable model

In the above analysis, we provided a basic illustration of our method looking at a small number of taxa, in a subset of samples (one study out of five). However, if we we're truly interested in identifying taxa that are unusually abundant in either CRC cases or controls, it would make sense to compare across study populations that are similar in their sex, age, BMI, country and whether the samples were provided before undergoing colonoscopy or after. That's what we did in our manuscript! Code to fit this model is below. 

```{r, eval=FALSE}
all_fit <- emuFit(formula = ~ Group + Study + Gender +
                    Age_spline.1 + Age_spline.2 + 
                    BMI_spline.1 + BMI_spline.2 + Sampling,
                  data = wirbel_sample, 
                  Y = wirbel_otu[, restricted_mOTU_names],
                  run_score_tests = FALSE)
```

Note that we wanted to allow in non-linear trends in age and BMI, which we did using B-splines. If you're interested in doing something similar, you could adapt the code below. 

```{r, eval = FALSE}
age_spline <- splines2::bSpline(wirbel_sample$Age, degree = 1, knots = median(wirbel_sample$Age))
age_spline[,1] <- (age_spline[,1] - mean(age_spline[,1]))/sd(age_spline[,1])
age_spline[,2] <- (age_spline[,2] - mean(age_spline[,2]))/sd(age_spline[,2])

bmi_spline <- splines2::bSpline(wirbel_sample$BMI, degree = 1, knots = median(wirbel_sample$BMI))
bmi_spline[,1] <- (bmi_spline[,1] - mean(bmi_spline[,1]))/sd(bmi_spline[,1])
bmi_spline[,2] <- (bmi_spline[,2] - mean(bmi_spline[,2]))/sd(bmi_spline[,2])
```