---
title: "Elastic Net Penalized Structural Equation Models"
author:
- Anhui Huang
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
pdf_document:
fig_caption: yes
toc: yes
toc_depth: 3
vignette: >
%\VignetteIndexEntry{Elastic Net Penalized Structural Equation Models}
%\VignetteEngine{knitr::rmarkdown}
%\usepackage[utf8]{inputenc}
---
## Introduction
This vignette demonstrates the usage of `sparseSEM` package via a simulated network as well as a Gene Regulatory Network (GRN) in budding yeast (Saccharomyces cerevisiae).
### simulated network
The simulated network is relatively small and sparse for demonstration purpose. A total of 30 nodes with 30 edges were simulated to form a directed acyclic graph (DAG) following the description in Huang 2014 [1]:
- Number of nodes in the network: $N_{g} = 30$;
- Number of expected edges $N_{e} = 1$;
- Edge weight $B_{i j}$ was generated randomly from uniform distribution in $(0.5,1)$ or $(-1,-0.5)$ if there was an edge from node $j$ to node $i$; and
- Noise $E_{i j}$ was sampled from a Gaussian distribution with zero mean and variance 0.01.
### Yeast Gene Regulatory Network (GRN)
The data contains expression levels of 6,126 yeast ORFs and 2,956 genetic markers from 112 yeast segregants from the cross of a laboratory strain (BY4716) and a wild strain (RM11-1a) [2].
The following steps were followed to arrived at the final dataset in `data(yeast)`:
- Step 1. Screen out ORFs not in the Yeast Comparative Genomics database [3], resulting 5225 names of ORFs in annotation list;
- Step 2. Screen out ORFs with more than $5 \%$ of missing expression data, resulted in 3,380 ORFs;
- Step 3. Associated gene markers with an ORF if they were in distance $\leq 20 \mathrm{~kb}$ representing the QTL resolution for this cross [4].
- Step 4. Mapping of gene expression levels with their associated gene markers through Wilcoxon rank-sum test with adjusted `q-value` following the procedure described in [4].
- Step 5. Keep 1 gene marker with the smallest $p$-value if multiple cis-eQTLs were found for an ORF, which leads to 1,162 ORFs having cis-eQTLs.
The gene expression profile of 3,380 ORFs and genetic marker of 1,162 eQTLs were then used for network inference by `sparseSEM`.
[Back to Top](#top)
## Quick Start
We will give users a general idea of the package by using the simulated example to demonstrates the basis package usage. Through running the main functions and examining the outputs, users may have a better idea on how the package works, what functions are available, which parameters to choose, as well as where to seek help. More information on the GRN will be provided in later sections.
Let us first clear up the workspace and load the `sparseSEM` package:
```{r}
rm(list = ls())
library(sparseSEM)
```
The simulated network includes matrices of `X`, `Y` , `B`, and `Missing` with 30 rows and 200 columns giving the following measurements:
- `Y`: The observed node response data with dimension of M (nodes) by N (samples);
- `X`: The network node attribute (for causal effect inference) with dimension of M by N;
- `B`: A sparse M by M matrix which contains the actual edges among the network of M nodes.
- `Missing`: M by N matrix corresponding to elements of Y to denote any missing values.
```{r}
data(B);
data(Y);
data(X);
data(Missing);
cat("dimenstion of Y: ",dim(Y) )
cat("dimenstion of X: ",dim(X) )
```
Note that both `B` and `Missing` are optional to run all functions in the package.
We fit the model using the most basic call to `elasticNetSEM`:
```{r}
set.seed(1)
output = elasticNetSEM(Y, X, Missing, B, verbose = 1);
names(output)
```
"output" is a list containing all the relevant information of the fitted model. Users can examine the output by directly looking at each element in the list. Particularly, the sparse regression coefficients can be extracted as showed below:
```{r}
fit_SEM = output$weight
```
Here we can use package `plot.matrix` to visually examine the `sparseSEM` estimated network structure vs. the simulated ground truth:
```{r}
library('plot.matrix')
par(mfrow = c(1, 2),mar=c(5.1, 4.1, 4.1, 4.1))
plot(B)
plot(fit_SEM)
```
The left plot is the simulation ground truth, and the right plot is the `sparseSEM` inferred network. The plot shows that 29 out of the 30 edges in the network are captured by `sparseSEM` (misses 1 edge connect node_2 (column 2) to node_26 (row 26)). Meanwhile, the estimated edge weights are relatively smaller in the fitted model due to the `elastic net` penalty.
### Cross Validation (CV)
The hyperparameters in `elasitc net` are specified as (alpha, lambda) which control the number of non-zero effects to be selected, and cross-validation (CV) is perhaps the simplest and most widely used method in deciding their values. while `elasticNetSEM` performs all the computation behind the scene, `elasticNetSEMcv` is the function to allow users to control more details of CV, which can be called using the following code.
```{r}
set.seed(1)
cvfit = elasticNetSEMcv(Y, X, Missing, B, alpha_factors = c(0.75, 0.5, 0.25),
lambda_factors=c(0.1, 0.01, 0.001), kFold = 5, verbose = 1);
names(cvfit)
```
`elasticNetSEMcv` returns a `cvfit` object, which is a list with all the ingredients of CV and the final fit results using CV selected hyperparameters. We can print out the CV results, selected hyperparameters and the corresponding coefficients. For example, result using different hyperparameters and the corresponding prediction errors are shown below:
```{r}
head(cvfit$cv)
```
Note that the values for hyperparameters (alpha, lambda) are `alpha_factor` and `lambda_factor`, which are the weights of the actual (alpha, lambda) and should be in range of [0,1]. I.e., $\alpha = alpha\_factor* \alpha_0$ and $\lambda = lambda\_factor* \lambda_0$. Here $\alpha_0 = 1$ as in `elastic net`, $0\leq \alpha \leq 1$. However, $\lambda_0$ is algorithm computed value such that there is one and only one non-zero element in matrix $\hat{\mathbf{B}}$. See Huang (2014) [1] for details on the `lasso` discarding rule for SEM.
As shown in the plot below, due to the granular choice of `alpha_factor` and `lambda_factor`, the performance result shows that all 30 edges are captured:
```{r}
par(mfrow = c(1, 2),mar=c(5.1, 4.1, 4.1, 4.1))
plot(B)
plot(cvfit$fit$weight)
```
With ground truth `B` provided, the functions also compute the performance statistics in `fit$statistics$, which is 6x1 array keeping record of:
1. correct positive
2. total positive
3. false positive
4. positive detected
5. Power of detection (PD) = correct positive/total positive
6. False Discovery Rate (FDR) = false positive/positive detected
```{r}
cvfit$fit$statistics
```
User may also notice that the above two examples only have a few `lambdas` computed than the specified `lambda_factors` parameter (20 `lambdas` by default). This is due to the early stop rule implemented [1], i.e., if current `lambda` leads to Mean Square Error (MSE) larger than that of the previous `lambda` plus 1 Standand Error (STE) unit, then the selection path will not continue. See the selection path and discarding rules for more details [1].
### Stability Selection (STS)
With a twist of the hyperparameters, the above plots showed that the non-zero edges can be slightly different. Stability Selection [5,6] is an algorithm to achieve stable effect selection with controlled False Discovery Rate (FDR). The algorithm requires bootstrapping (typically n = 100 rounds), each round randomly selected half of the dataset to run through the lasso/elastic net selection path. Then the stable effects are those consistently selected in different bootstraps measured by the bounded pre-comparison error rate and FDR.
The following code shows the example of running STS:
```{r}
tStart = proc.time()
set.seed(0)
output = enSEM_stability_selection(Y,X, Missing,B,
alpha_factors = seq(1,0.1, -0.1),
lambda_factors =10^seq(-0.2,-3,-0.2),
kFold = 5,
nBootstrap = 100,
verbose = -1)
tEnd = proc.time()
simTime = tEnd - tStart;
print(simTime)
names(output)
cat("nSTS = ", length(which(output$STS !=0)))
```
One key part of STS method is that, the hyperparameters generally are set toward higher penalty to allow few non-zero edges. This is due to the fact that, if most of the hyperparameters are with small penalty, the number of non-zero edges will be large such that no matter how the threhold (pi) [5, 6] is set, there will be large number of false positives.
The above example shows that we set the values of lambda_factors to not too small, and all true effects are captured with no false positives as shown below:
```{r}
B[which(B!=0)] =1
par(mfrow = c(1, 2),mar=c(5.1, 4.1, 4.1, 4.1))
plot(B)
plot(output$STS)
```
Note that STS output a binary 0 or 1 $B_{i j}$ indicating if there was an edge from node $j$ to node $i$. Parallel bootstrapping is also implemented with R package `parallel`:
```{r, eval=FALSE}
library(parallel)
cl<-makeCluster(4,type="SOCK")
clusterEvalQ(cl,{library(sparseSEM)})
output = enSEM_stability_selection_parallel(Y,X, Missing,B,
alpha_factors = seq(1,0.1, -0.1),
lambda_factors =10^seq(-0.2,-3,-0.2),
kFold = 3,
nBootstrap = 100,
verbose = -1,
clusters = cl)
stopCluster(cl)
```
[Back to Top](#top)
## Yeast GRN
We next turn to the real world scenario to apply `sparseSEM` to the yeast GRN dataset analysis. While the above simulated network has 30 nodes, the yeast GRN has 3380 nodes, which increases computation time significantly. Roughly the computation time for the yeast GRN is about 30 min with `elasticNetSEM`, and 1.5 hour with `enSEM_stability_selection` running on an Intel i5 CPU.
```{r, eval=FALSE}
rm(list = ls())
library(sparseSEM)
data(yeast)
output = elasticNetSEM(Y, X, verbose = 1)
# STS
STS = enSEM_stability_selection(Y,X,verbose = -1)
```
For even larger network, a parallel computation version of `sparseSEM` is also available at [sparseSEM GitHub Repo](https://github.com/anhuihng/pySparseSEM/tree/master/openMPI).
[Back to Top](#top)
## References
1. Huang, A. (2014). "Sparse model learning for inferring genotype and phenotype associations." Ph.D Dissertation Chapter 7. University of Miami(1186).
2. Brem RB, Yvert $\mathrm{G}$, Clinton R, Kruglyak L: Genetic dissection of transcriptional regulation in budding yeast. Science 2002 296:752-755
3. Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature $2003, \mathbf{4 2 3}: 241-254$
4. Brem RB, Kruglyak L: The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proceedings of the National Academy of Sciences of the United States of America 2005, 102:1572-1577
5. Meinshausen, N. and Buhlmann, P., 2010. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), pp.417-473.
6. Shah, R.D. and Samworth, R.J., 2013. Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), pp.55-80.