An Introduction to the package M3JF

Xiaoyao Yin

2023-08-11

This vignette presents the M3JF,which implements a framework named multi-modality matrix joint factorization (M3JF) to conduct integrative analysis of multiple modality data in R. The objective is to provide an implementation of the proposed method, which is designed to solve the high dimensionality multiple modality data in bioinformatics. It was achieved by jointly factorizing the matrices into a shared sub-matrix and several modality specific sub-matrices. The introduction of group sparse constraint on the shared sub-matrix forces the samples in the same group to allow each modality exploiting only a subset of the dimensions of the global latent space.

Installation

The latest stable version of the package can be installed from any CRAN repository mirror:

#Install
install.packages('M3JF')
#Load
library(M3JF)

The latest development version is available from https://cran.r-project.org/package=M3JF and may be downloaded from there and installed manually:

install.packages('/path/to/file/M3JF.tar.gz',repos=NULL,type="source")

Support: Users interested in this package are encouraged to email to Xiaoyao Yin () for enquiries, bug reports, feature requests, suggestions or M3JF-related discussions.

Usage

We will give an example of how to use this package hereafter.

Simulation data generation

We generate simulated data with the R package InterSIM, which generates three inter-related data set with realistic inter- and intra- relationships based on the DNA methylation, mRNA expression and protein expression from the TCGA ovarian cancer study. Each data modality consists of 500 samples, samples are assigned to 4 groups with 100, 150, 135 and 115 samples per group. The data can be generated by running:

library(InterSIM)

sim.data <- InterSIM(n.sample=500, cluster.sample.prop = c(0.20,0.30,0.27,0.23),
                     delta.methyl=5, delta.expr=5, delta.protein=5,p.DMP=0.2, p.DEG=NULL,
                     p.DEP=NULL,sigma.methyl=NULL, sigma.expr=NULL,
                     sigma.protein=NULL,cor.methyl.expr=NULL,
                     cor.expr.protein=NULL,do.plot=FALSE, sample.cluster=TRUE,
                     feature.cluster=TRUE)
sim.methyl <- sim.data$dat.methyl
sim.expr <- sim.data$dat.expr
sim.protein <- sim.data$dat.protein
data_list <- list(sim.methyl, sim.expr, sim.protein)

Simulation data groundtruth assignment and permutation

Label assignment: According to the data generation process, we assign the groundtruth label to the data we have generated as:

truelabel = sim.data$clustering.assignment$cluster.id

this label will be used to test the clustering ability afterwards.

Now we can cluster the samples with the proposed method and compare its performance by calculating the normalized mutual information with the function cal_NMI by inputting the truelabel and the predicted label.


You should start from here if you are using your own data.


Evaluating k: Evaluate the most proper cluster number k by mean of modality modulairty with the function new_modularity.

#Build similarity matrices for your data with SNFtool
library(SNFtool)
library(dplyr)
WL_dist1 <- lapply(data_list,function(x){
  dd <- x%>%as.matrix
  w <- dd %>% dist2(dd) %>% affinityMatrix(K = 10, sigma = 0.5)
})
#Assign the interval of k according to your data
k_list = 2:10
#Initialize the varible
clu_eval <- RotationCostBestGivenGraph(W,k_list)
#The most proper is the one with minimal rotation cost
best_k = k_list[which.min(clu_eval)]

M3JF: Jointly factorize the matrices into a shared embedding matrix and several modality private basis matrices.

#Assign the parameters
lambda = 0.01
theta = 10^-6
k = best_k
res = M3JF(data_list,lambda,theta,k)

Now you have got the classification result you want.


You can ommit the following if you do not have any true label as the groudtruth, we do the next to evaluate our method.


Robustness test: We test the robustness of our method by calculating the normalized mutual information and adjusted rand index of the true label and our predicted label. We can compare the performance of our method with others by these scores, which lie in the interval [0,1]. The larger the scores, the more robust the method.

library(SNFtool)
#Calculate the NMI of *M3JF*
M3JF_res = M3JF(data_list,lambda,theta,k)
M3JF_cluster = M3JF_res$clusters
M3JF_NMI = cal_NMI(true_label,M3JF_cluster)
#Calculate the ARI of *M3JF*
library(mclust)
M3JF_ARI = adjustedRandIndex(true_label,M3JF_cluster)