--- title: "Introduction to radEmu with phyloseq" author: "David Clausen, Sarah Teichman and Amy Willis" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to radEmu with phyloseq} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` First, we will install `radEmu`, if we haven't already. ```{r, eval = FALSE} # if (!require("remotes", quietly = TRUE)) # install.packages("remotes") # # remotes::install_github("statdivlab/radEmu") ``` Next, we can load `radEmu` as well as the `tidyverse` package suite. ```{r setup, message = FALSE} library(magrittr) library(dplyr) library(ggplot2) library(stringr) library(radEmu) ``` ## Introduction This vignette provides an introduction to using `radEmu` for differential abundance analysis using a `phyloseq` data object. For more in-depth explanations of how this software works, an example of running hypothesis tests, and details on this analysis, see the vignette "intro_radEmu.Rmd". In this lab we'll explore a [dataset published by Wirbel et al. (2019)](https://www.nature.com/articles/s41591-019-0406-6). This is a meta-analysis of case-control studies, meaning that Wirbel et al. collected raw sequencing data from studies other researchers conducted and re-analyzed it (in this case, they also collected some new data of their own). Wirbel et al. published two pieces of data we'll focus on today: * metadata giving demographics and other information about participants * a mOTU (metagenomic OTU) table In the manuscript, we looked at differential abundance across otherwise similar colorectal cancer and non-cancer control study participants for the 849 mOTUs that Wirbel et al. published. For the purpose of having a streamlined tutorial, we will only look at a subset of those 849 mOTUs in this vignette. ## Loading and exploring data Note that in order to follow along with this tutorial (but not to use `radEmu`!) you will need to have `phyloseq` installed. We will check if you have `phyloseq` installed, and if you do not then you can read the following code but it will not be run. ```{r, results = 'hide'} phy <- requireNamespace("phyloseq", quietly = TRUE) == TRUE ``` ```{r, echo = FALSE} print(paste0("phyloseq is installed: ", phy)) ``` Now that we have loaded the `phyloseq` package, we will create our `phyloseq` data object. ```{r, eval = phy} data(wirbel_sample) data(wirbel_otu) data(wirbel_taxonomy) wirbel_phylo <- phyloseq::phyloseq(phyloseq::sample_data(wirbel_sample), phyloseq::otu_table(wirbel_otu, taxa_are_rows = FALSE), phyloseq::tax_table(wirbel_taxonomy)) wirbel_phylo ``` We'll start by looking at the metadata. ```{r, eval = phy} dim(phyloseq::sample_data(wirbel_phylo)) head(phyloseq::sample_data(wirbel_phylo)) ``` We can see that this dataset includes $566$ observations and $14$ variables. Now let's load the mOTU table. ```{r, eval = phy} dim(phyloseq::otu_table(wirbel_phylo)) # let's check out a subset phyloseq::otu_table(wirbel_phylo)[1:5, 1:3] ``` We can see that this table has $566$ samples (just like the metadata) and $845$ mOTUs. Let's save these mOTU names in a vector. ```{r, eval = phy} mOTU_names <- colnames(phyloseq::otu_table(wirbel_phylo)) ``` Finally, we can check out the taxonomy table. ```{r, eval = phy} head(phyloseq::tax_table(wirbel_phylo)) ``` ## Fitting a model `radEmu` is a package that can be used to estimate fold-differences in the abundance of microbial taxa between levels of a covariate. In this analysis, the covariate that we are primarily interested in is whether a sample is from a case of colorectal cancer or a control. We will make control ("CTR") the reference category: ```{r, eval = phy} phyloseq::sample_data(wirbel_phylo)$Group <- factor(phyloseq::sample_data(wirbel_phylo)$Group, levels = c("CTR","CRC")) ``` While in general we would fit a model to all mOTUs, we are going to subset to some specific genera for the purposes of this tutorial. Let's look at *Eubacterium*, *Porphyromonas*, *Faecalibacteria*, and *Fusobacterium* for now. ```{r, eval = phy} chosen_genera <- c("Eubacterium", "Faecalibacterium", "Fusobacterium", "Porphyromonas") wirbel_restrict <- phyloseq::subset_taxa(wirbel_phylo, genus %in% chosen_genera) ``` Again, while we would generally fit a model using all of our samples, for this tutorial we are only going to consider data from a case-control study from China. ```{r, eval = phy} wirbel_china <- phyloseq::subset_samples(wirbel_restrict, Country == "CHI") ``` Next, we want to confirm that all samples have at least one non-zero count across the categories we've chosen and that all categories have at least one non-zero count across the samples we've chosen. ```{r} sum(rowSums(phyloseq::otu_table(wirbel_china)) == 0) # no samples have a count sum of 0 sum(colSums(phyloseq::otu_table(wirbel_china)) == 0) # one category has a count sum of 0 category_to_rm <- names(which(colSums(phyloseq::otu_table(wirbel_china)) == 0)) wirbel_china <- phyloseq::subset_taxa(wirbel_china, species != category_to_rm) sum(colSums(phyloseq::otu_table(wirbel_china)) == 0) # now no categories have a count sum of 0 ``` The function that we use to fit our model is called `emuFit`. It can accept your data in various forms, and here we will show how to use it with a `phyloseq` object as input. ```{r, eval = phy} ch_fit <- emuFit(formula = ~ Group, Y = wirbel_china, run_score_tests = FALSE) ``` The way to access estimated coefficients and confidence intervals from the model is with `ch_fit$coef`. Now, we can easily visualize our results using the `plot.emuFit` function! ```{r, fig.height = 6, fig.width = 6, eval = phy} plot(ch_fit)$plots ``` If you'd like to see more explanations of the `radEmu` software and additional analyses of this data, check out the vignette "intro_radEmu.Rmd".