Automated command line analysis

Arnaud Wolfer

2019-10-03

The santaR package is designed for the detection of significantly altered time trajectories between study groups, in short time-series. Command line parallelisation and reporting functions allow the automated analysis of multiple variables.

The automated command line functions are to be prefered to the GUI for the processing of very high number of variables, as they are more efficient and can be integrated in scripts.

Using an example dataset, this vignette will:

Parallel processing

In a same experiment, multiple variables can be measured and explored dynamically (e.g. NMR or MS features, genes). As santaR’s analysis is an univariate approach, each variable can be fitted independently. This lack of dependency renders santaR’s analysis an embarrassingly parallel workload.

The santaR_auto_fit() function is a wrapper for each of the analytical functions (i.e. get_ind_time_matrix(), santaR_fit(), santaR_CBand(), santaR_pvalue_dist() and santaR_pvalue_fit()), executing them in a parallel fashion (for each individual function see the help and advanced command line options vignette). The parallelisation relies on the doParallel package for the instantiation of worker nodes and foreach for the distribution of tasks. This set of packages enable the parallelisation on all operating systems (Windows, Mac OS and most Linux distributions).

Observation values are expected as a data-frame of samples as rows and variables as columns, the parallelisation taking place over the columns. For a selected number of CPU cores (ncores parameter), santaR_auto_fit() first instantiate worker nodes (if ncores=0, the procedure is applied sequentially (no parallelisation)). The conversion of inputs by get_ind_time_matrix() is however not parallelised by default as the parallelisation overhead time cost is superior to the time gain for all but the most complex datasets. When the number of individuals, unique time points, or variables is elevated, the forceParIndTimeMat parameter enables the parallelisation of this step. All subsequent analytical steps are automatically parallelised, with the calculation of confidence bands on the group mean curves and the identification of altered trajectory activated by default.

santaR_auto_fit() returns a list of SANTAObj containing each variable’s analysis results. In practice, santaR_auto_fit() is the function employed for command line analysis as it caters for all possible use cases.

library(santaR)

# Load example data
tmp_data  <- acuteInflammation$data
tmp_meta  <- acuteInflammation$meta

# Analyse data, with confidence bands and p-value
res_acuteInf_df5 <- santaR_auto_fit(inputData=tmp_data, ind=tmp_meta$ind, time=tmp_meta$time, group=tmp_meta$group, df=5, ncores=4, CBand=TRUE, pval.dist=TRUE)
# Input data generated: 0.13 secs
# Spline fitted:        1.05 secs
# ConfBands done:      18.98 secs
# p-val dist done:     35.43 secs
# total time:          55.59 secs

length(res_acuteInf_df5)
# [1] 22
names(res_acuteInf_df5)
#  [1] "var_1"  "var_2"  "var_3"  "var_4"  "var_5"  "var_6"  "var_7"  "var_8"  "var_9"  "var_10" "var_11" "var_12" "var_13" "var_14" "var_15" "var_16" "var_17" "var_18"
# [19] "var_19" "var_20" "var_21" "var_22"

Automated Reporting

After multiple variables have been analysed using santaR_auto_fit(), a reporting function helps assess significant results and summarise them in an easily interpretable fashion. santaR_auto_summary() takes a list of SANTAObj as generated by santaR_auto_fit() as input.

First, correction for multiple testing can be applied to generate Bonferroni, Benjamini-Hochberg or Benjamini-Yekutieli corrected p-values. P-values can be returned by the function, but also automatically saved to disk as .csv. For a given significance cut-off (plotCutOff parameter), the number of variables significantly altered is reported and plots are automatically saved to disk by increasing p-value. The aspect of the plots can be altered using multiple options such as the representation of confidence bands (showConfBand parameter) or the generation of a mean curve across all samples (showTotalMeanCurve parameter) which can help assess difference between groups when group sizes are unbalanced.

# Generate a summary
#   without a defined 'targetFolder', no csv or plots can be saved
pval_acuteInf_df5 <- santaR_auto_summary(SANTAObjList=res_acuteInf_df5, targetFolder=NA)
# p-value dist found
# Benjamini-Hochberg corrected p-value

names(pval_acuteInf_df5)
# [1] "pval.all"     "pval.summary"

pval_acuteInf_df5$pval.summary
Test Inf 0.05 Inf 0.01 Inf 0.001
dist 17 8 0
dist_BH 16 0 0
pval_acuteInf_df5$pval.all
  dist dist_upper dist_lower curveCorr dist_BH
var_1 0.00999 0.0183 0.005434 -0.243 0.02747
var_2 0.007992 0.0157 0.004054 0.0006572 0.02747
var_3 0.006993 0.01437 0.00339 -0.131 0.02747
var_4 0.2098 0.2361 0.1857 -0.3878 0.2148
var_5 0.005994 0.01302 0.002749 -0.5635 0.02747
var_6 0.008991 0.017 0.004736 -0.4767 0.02747
var_7 0.01399 0.02334 0.008347 -0.5629 0.03077
var_8 0.00999 0.0183 0.005434 -0.4679 0.02747
var_9 0.03896 0.05282 0.02863 -0.389 0.05042
var_10 0.03497 0.04825 0.02524 -0.05017 0.04808
var_11 0.01399 0.02334 0.008347 0.0568 0.03077
var_12 0.2148 0.2413 0.1904 0.153 0.2148
var_13 0.06693 0.08414 0.05304 -0.4078 0.0775
var_14 0.1548 0.1786 0.1337 -0.06504 0.1703
var_15 0.008991 0.017 0.004736 0.1268 0.02747
var_16 0.01598 0.02581 0.00986 0.5055 0.03197
var_17 0.01998 0.03067 0.01297 0.2798 0.03663
var_18 0.02997 0.04247 0.02107 0.4028 0.04396
var_19 0.05395 0.06973 0.04157 0.5015 0.06593
var_20 0.02398 0.03543 0.01616 0.3899 0.03768
var_21 0.02298 0.03425 0.01536 0.1458 0.03768
var_22 0.007992 0.0157 0.004054 -0.2075 0.02747

Save results for GUI

In practice, time-dependent patterns for a given biological question (e.g. a grouping of individuals) are assessed by parallelised fitting and analysis using santaR_auto_fit() and reporting using santaR_auto_summary(). When results are available, the most significantly altered variables can be identified using the reports and visually inspected for confirmation using the plots already saved to disk.

Additionally analysis results can be loaded into the GUI for interactive visualisation or generation of plots. For that, the list of SANTAObj generated by santaR_auto_fit() must be saved under the variable name inSp in a .RData file:

# Rename the results
inSp        <- res_acuteInf_df5
# Save to disk
outputPath  <- file.path('path_to_my_output_folder', 'acuteInf_results.rdata') 
save(inSp, file=outputPath, compress=TRUE)

See Also