---
title: "High-Dimensional Mediation Analysis"
subtitle: "A Guide to Using the HIMA Package"
author: "The HIMA Development Team"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{High-Dimensional Mediation Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE,
  echo = TRUE # Ensures all code is displayed by default
)
library(HIMA)
```

# Introduction

Mediation analysis is a statistical method used to explore the mechanisms by which an independent variable influences a dependent variable through one or more intermediary variables, known as mediators. This process involves assessing the indirect, direct, and total effects within a defined statistical framework. The primary goal of mediation analysis is to test the mediation hypothesis, which posits that the effect of an independent variable on a dependent variable is partially or fully mediated by intermediate variables. By examining these pathways, researchers can gain a deeper understanding of the causal relationships among variables. Mediation analysis is particularly valuable in refining interventions to enhance their effectiveness in clinical trials and in observational studies, where it helps identify key intervention targets and elucidate the mechanisms underlying complex diseases.


# Package Overview

The `HIMA` package provides robust tools for estimating and testing high-dimensional mediation effects, specifically designed for modern `omic` data, including epigenetics, transcriptomics, and microbiomics. It supports high-dimensional mediation analysis across continuous, binary, and survival outcomes, with specialized methods tailored for compositional microbiome data and quantile mediation analysis.

At the core of the package is the `hima` function, a flexible and powerful wrapper that integrates various high-dimensional mediation analysis methods for both effect estimation and hypothesis testing. The `hima` function automatically selects the most suitable analytical approach based on the outcome and mediator data type, streamlining complex workflows and reducing user burden.

`hima` is designed with extensibility in mind, allowing seamless integration of future features and methods. This ensures a consistent, user-friendly interface while staying adaptable to evolving analytical needs.


# Data Preparation and Settings

The `hima` function provides a flexible and user-friendly interface for performing high-dimensional mediation analysis. It supports a variety of analysis methods tailored for continuous, binary, survival, and compositional data. Below is an overview of the `hima` function and its key parameters:

### `hima` Function Interface

```{r hima-interface}
hima(
  formula,          # The model formula specifying outcome, exposure, and covariate(s)
  data.pheno,       # Data frame with outcome, exposure, and covariate(s)
  data.M,           # Data frame or matrix of high-dimensional mediators
  mediator.type,    # Type of mediators: "gaussian", "negbin", or "compositional"
  penalty = "DBlasso",  # Penalty method: "DBlasso", "MCP", "SCAD", or "lasso"
  quantile = FALSE, # Use quantile mediation analysis (default: FALSE)
  efficient = FALSE,# Use efficient mediation analysis (default: FALSE)
  scale = TRUE,     # Scale data (default: TRUE)
  sigcut = 0.05,    # Significance cutoff for mediator selection
  contrast = NULL,  # Named list of contrasts for factor covariate(s)
  subset = NULL,    # Optional subset of observations
  verbose = FALSE   # Display progress messages (default: FALSE)
)
```

To use the `hima` function, ensure your data is prepared according to the following guidelines:

### 1. Formula Argument (`formula`)

Define the model formula to specify the relationship between the `Outcome`, `Exposure`, and `Covariate(s)`. Ensure the following:

- **General Form**: Use the format `Outcome ~ Exposure + Covariate(s)`. Note that the `Exposure` variable represents the exposure of interest (e.g., "Treatment" in the demo examples) and it has to be listed as the first independent variable in the formula. `Covariate(s)` are optional.

- **Survival Data**: For survival analysis, use the format `Surv(time, event) ~ Exposure + Covariate(s)`. See data examples `SurvivalData$PhenoData` for more details.

### 2. Phenotype Data (`data.pheno`)

The `data.pheno` object should be a `data.frame` or `matrix` containing the phenotype information for the analysis (without missing values). Key requirements include:

- **Rows:** Represent samples.

- **Columns:** Include variables such as the outcome, treatment, and optional covariate(s).

- **Formula Consistency:** Ensure that all variables specified in the `formula` argument (e.g., `Outcome`, `Treatment`, and `Covariate(s)`) are present in `data.pheno`.

### 3. Mediator Data (`data.M`)

The `data.M` object should be a `data.frame` or `matrix` containing high-dimensional mediators (without missing values). Key requirements include:

- **Rows:** Represent samples, aligned with the rows in `data.pheno`.

- **Columns:** Represent mediators (e.g., CpGs, genes, or other molecular features).

- **Mediator Type:** Specify the type of mediators in the `mediator.type` argument. Supported types include:
  - `"gaussian"` for continuous mediators (default, e.g., DNA methylation data).
  - `"negbin"` for count data (e.g., transcriptomic data).
  - `"compositional"` for microbiome or other compositional data.

### 4. About data scaling

In most real-world data analysis scenarios, `scale` is typically set to `TRUE`, ensuring that the exposure (variable of interest), mediators, and covariate(s) (if included) are standardized to a mean of zero and a variance of one. No scaling will be applied to `Outcome`. However, if your data is already pre-standardized—such as in simulation studies or when using our demo dataset-`scale` should be set to `FALSE` to prevent introducing biases or altering the original data structure. 

When applying `HIMA` to simulated data, if `scale` is set to `TRUE`, it is imperative to preprocess the mediators by scaling them to have a mean of zero and a variance of one prior to generating the outcome variables.

## Parallel Computing Support

The `hima()` function supports **parallel computing** to speed up high-dimensional mediation analysis, especially when dealing with a large number of mediators.

### Enabling Parallel Computing

To enable parallel computing, simply set `parallel = TRUE` and specify the number of CPU cores to use via the `ncore` argument:

```{r parallel}
hima(..., parallel = TRUE, ncore = 4)
```

# Applications and Examples

## Load the `HIMA` Package

```{r load-HIMA}
library(HIMA)
```

## Continuous Outcome Analysis

When analyzing continuous and normally distributed outcomes and mediators, we can use the following code snippet:

```{r continuous-example}
data(ContinuousOutcome)
pheno_data <- ContinuousOutcome$PhenoData
mediator_data <- ContinuousOutcome$Mediator

hima_continuous.fit <- hima(
  Outcome ~ Treatment + Sex + Age,
  data.pheno = pheno_data,
  data.M = mediator_data,
  mediator.type = "gaussian",
  penalty = "DBlasso",
  scale = FALSE # Demo data is already standardized
)
summary(hima_continuous.fit, desc=TRUE) 
# `desc = TRUE` option to show the description of the output results
```

`penalty = "DBlasso"` is particularly effective at identifying mediators with weaker signals compared to `penalty = "MCP"`. However, using `DBlasso` requires more computational time.

## Efficient HIMA

For continuous and normally distributed mediators and outcomes, an efficient HIMA method can be activated with the `efficient = TRUE` option (penalty should be `MCP` for the best results). This method may also provide greater statistical power to detect mediators with weaker signals.

```{r efficient-example}
hima_efficient.fit <- hima(
  Outcome ~ Treatment + Sex + Age,
  data.pheno = pheno_data,
  data.M = mediator_data,
  mediator.type = "gaussian",
  efficient = TRUE,
  penalty = "MCP",
  scale = FALSE # Demo data is already standardized
)
summary(hima_efficient.fit, desc=TRUE) 
# Note that the efficient HIMA is controlling FDR
```

It is recommended to try different `penalty` options and `efficient` option to find the best one for your data.

## Binary Outcome Analysis

The package can handle binary outcomes based on logistic regression:   

```{r binary-example}
data(BinaryOutcome)
pheno_data <- BinaryOutcome$PhenoData
mediator_data <- BinaryOutcome$Mediator

hima_binary.fit <- hima(
  Disease ~ Treatment + Sex + Age,
  data.pheno = pheno_data,
  data.M = mediator_data,
  mediator.type = "gaussian",
  penalty = "MCP",
  scale = FALSE # Demo data is already standardized
)
summary(hima_binary.fit)
```

## Survival Outcome Analysis

For survival data, `HIMA` incorporates a Cox proportional hazards approach. Here is an example of survival outcome analysis using `HIMA`:

```{r survival-example}
data(SurvivalData)
pheno_data <- SurvivalData$PhenoData
mediator_data <- SurvivalData$Mediator

hima_survival.fit <- hima(
  Surv(Time, Status) ~ Treatment + Sex + Age,
  data.pheno = pheno_data,
  data.M = mediator_data,
  mediator.type = "gaussian",
  penalty = "DBlasso",
  scale = FALSE # Demo data is already standardized
)
summary(hima_survival.fit)
```

## Microbiome Mediation Analysis

For compositional microbiome data, `HIMA` employs isometric Log-Ratio transformations. Here is an example of microbiome mediation analysis using `HIMA`:

```{r microbiome-example}
data(MicrobiomeData)
pheno_data <- MicrobiomeData$PhenoData
mediator_data <- MicrobiomeData$Mediator

hima_microbiome.fit <- hima(
  Outcome ~ Treatment + Sex + Age,
  data.pheno = pheno_data,
  data.M = mediator_data,
  mediator.type = "compositional",
  penalty = "DBlasso"
)
summary(hima_microbiome.fit)
```

## Quantile Mediation Analysis

Perform quantile mediation analysis using the `quantile = TRUE` option and specify `tau` for desired quantile(s):

```{r quantile-example}
data(QuantileData)
pheno_data <- QuantileData$PhenoData
mediator_data <- QuantileData$Mediator

hima_quantile.fit <- hima(
  Outcome ~ Treatment + Sex + Age,
  data.pheno = pheno_data,
  data.M = mediator_data,
  mediator.type = "gaussian",
  quantile = TRUE,
  penalty = "MCP",
  tau = c(0.3, 0.5, 0.7),
  scale = FALSE # Demo data is already standardized
)
summary(hima_quantile.fit)
```