---
title: "Introduction to radEmu with phyloseq"
author: "David Clausen, Sarah Teichman and Amy Willis"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to radEmu with phyloseq}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

First, we will install `radEmu`, if we haven't already.

```{r, eval = FALSE}
# if (!require("remotes", quietly = TRUE))
#     install.packages("remotes")
#
# remotes::install_github("statdivlab/radEmu")
```

Next, we can load `radEmu` as well as the `tidyverse` package suite. 

```{r setup, message = FALSE}
library(magrittr)
library(dplyr)
library(ggplot2)
library(stringr)
library(radEmu)
```

## Introduction 

This vignette provides an introduction to using `radEmu` for differential abundance analysis using a `phyloseq` data object. For more in-depth explanations of how this software works, an example of running hypothesis tests, and details on this analysis, see the vignette "intro_radEmu.Rmd". 

In this lab we'll explore a [dataset published by Wirbel et al. (2019)](https://www.nature.com/articles/s41591-019-0406-6). This is a meta-analysis of case-control studies, meaning that Wirbel et al. collected raw sequencing data from studies other researchers conducted and re-analyzed it (in this case, they also collected some new data of their own).

Wirbel et al. published two pieces of data we'll focus on today:

* metadata giving demographics and other information about participants
* a mOTU (metagenomic OTU) table

In the manuscript, we looked at differential abundance across otherwise similar colorectal cancer and non-cancer control study participants for the 849 mOTUs that Wirbel et al. published. For the purpose of having a streamlined tutorial, we will only look at a subset of those 849 mOTUs in this vignette. 

## Loading and exploring data 

Note that in order to follow along with this tutorial (but not to use `radEmu`!) you will need to have `phyloseq` installed. We will check if you have `phyloseq` installed, and if you do not then you can read the following code but it will not be run.

```{r, results = 'hide'}
phy <- requireNamespace("phyloseq", quietly = TRUE) == TRUE
```

```{r, echo = FALSE}
print(paste0("phyloseq is installed: ", phy))
```

Now that we have loaded the `phyloseq` package, we will create our `phyloseq` data object. 

```{r, eval = phy} 
data(wirbel_sample)
data(wirbel_otu)
data(wirbel_taxonomy)
wirbel_phylo <- phyloseq::phyloseq(phyloseq::sample_data(wirbel_sample),
                                   phyloseq::otu_table(wirbel_otu, taxa_are_rows = FALSE),
                                   phyloseq::tax_table(wirbel_taxonomy))
wirbel_phylo
```

We'll start by looking at the metadata. 

```{r, eval = phy}
dim(phyloseq::sample_data(wirbel_phylo))
head(phyloseq::sample_data(wirbel_phylo))
```

We can see that this dataset includes $566$ observations and $14$ variables. 

Now let's load the mOTU table. 

```{r, eval = phy}
dim(phyloseq::otu_table(wirbel_phylo))
# let's check out a subset
phyloseq::otu_table(wirbel_phylo)[1:5, 1:3]
```

We can see that this table has $566$ samples (just like the metadata) and $845$ mOTUs. Let's save these mOTU names in a vector. 

```{r, eval = phy}
mOTU_names <- colnames(phyloseq::otu_table(wirbel_phylo))
```

Finally, we can check out the taxonomy table. 

```{r, eval = phy}
head(phyloseq::tax_table(wirbel_phylo))
```

## Fitting a model 

`radEmu` is a package that can be used to estimate fold-differences in the abundance of microbial taxa between levels of a covariate. In this analysis, the covariate that we are primarily interested in is whether a sample is from a case of colorectal cancer or a control. We will make control ("CTR") the reference category: 

```{r, eval = phy}
phyloseq::sample_data(wirbel_phylo)$Group <- factor(phyloseq::sample_data(wirbel_phylo)$Group, levels = c("CTR","CRC"))
```

While in general we would fit a model to all mOTUs, we are going to subset to some specific genera for the purposes of this tutorial. Let's look at *Eubacterium*, *Porphyromonas*, *Faecalibacteria*, and *Fusobacterium* for now.

```{r, eval = phy}
chosen_genera <- c("Eubacterium", "Faecalibacterium", "Fusobacterium", "Porphyromonas")
wirbel_restrict <- phyloseq::subset_taxa(wirbel_phylo, genus %in% chosen_genera)
```

Again, while we would generally fit a model using all of our samples, for this tutorial we are only going to consider data from a case-control study from China. 

```{r, eval = phy}
wirbel_china <- phyloseq::subset_samples(wirbel_restrict, Country == "CHI")
```

Next, we want to confirm that all samples have at least one non-zero count across the categories we've chosen and that all categories have at least one non-zero count across the samples we've chosen.

```{r}
sum(rowSums(phyloseq::otu_table(wirbel_china)) == 0) # no samples have a count sum of 0 
sum(colSums(phyloseq::otu_table(wirbel_china)) == 0) # one category has a count sum of 0 
category_to_rm <- names(which(colSums(phyloseq::otu_table(wirbel_china)) == 0))
wirbel_china <- phyloseq::subset_taxa(wirbel_china, species != category_to_rm)
sum(colSums(phyloseq::otu_table(wirbel_china)) == 0) # now no categories have a count sum of 0 
```

The function that we use to fit our model is called `emuFit`. It can accept your data in various forms, and here we will show how to use it with a `phyloseq` object as input. 


```{r, eval = phy}
ch_fit <- emuFit(formula = ~ Group, 
                 Y = wirbel_china, 
                 run_score_tests = FALSE) 
```

The way to access estimated coefficients and confidence intervals from the model is with `ch_fit$coef`. 

Now, we can easily visualize our results using the `plot.emuFit` function!

```{r, fig.height = 6, fig.width = 6, eval = phy}
plot(ch_fit)$plots
```

If you'd like to see more explanations of the `radEmu` software and additional analyses of this data, check out the vignette "intro_radEmu.Rmd".