Longitudinal analysis pipeline with longdat

Introduction

This is an example of running longdat_disc(). Note that the time variable (proxy of treatment) here should be discrete. If the time variable is continuous, please apply longdat_cont() instead.

# Load the packages
library(LongDat)
library(kableExtra)

Explaining the input data frame format

The input data frame (called master table) should have the same format as the example data “LongDat_disc_master_table”. If you have metadata and feature (eg. microbiome, immunome) data stored in separate tables, you can go to the section Preparing the input data frame with make_master_table() below. The function make_master_table() helps you to create master table from metadata and feature tables.

Now let’s have a look at the required format for the input master table. The example below is a dummy longitudinal data set with 3 time points (1, 2, 3). Here we want to see if the treatment has a significant effect on gut microbial abundance or not.

# Read in the data frame. LongDat_disc_master_table is already lazily loaded.
master <- LongDat_disc_master_table
master %>%
  kableExtra::kbl() %>%
  kableExtra::kable_paper(bootstrap_options = "responsive", font_size = 12) %>% 
  kableExtra::scroll_box(width = "700px", height = "200px")

Individual	Time_point	sex	age	DrugA	DrugB	BacteriumA	BacteriumB	BacteriumC
1	1	0	61	0.0	10	11	4	23
1	2	0	61	0.0	10	13	2	44
1	3	0	61	0.0	20	7	13	48
2	1	0	66	0.0	640	344	0	48
2	2	0	66	0.0	320	3	0	80
2	3	0	66	0.0	640	379	0	87
3	1	0	63	7.5	100	55	0	95
3	2	0	63	7.5	0	5	0	160
3	3	0	63	7.5	100	205	0	210
4	1	0	47	0.0	300	60	0	126
4	2	0	47	0.0	200	4	0	130
4	3	0	47	0.0	300	64	59	186
5	1	1	51	0.0	160	100	20	15
5	2	1	51	0.0	130	3	64	8
5	3	1	51	0.0	160	53	5	34
6	1	1	53	10.0	0	32	138	0
6	2	1	53	10.0	0	2	0	5
6	3	1	53	10.0	0	10	0	2
7	1	0	50	0.0	40	22	105	69
7	2	0	50	0.0	20	27	158	40
7	3	0	50	0.0	40	32	100	113
8	1	1	54	0.0	100	24	0	0
8	2	1	54	0.0	80	0	0	0
8	3	1	54	0.0	160	192	0	2
9	1	0	44	0.0	160	65	0	1
9	2	0	44	0.0	80	1	0	31
9	3	0	44	0.0	160	12	0	31
10	1	0	60	0.0	100	19	163	163
10	2	0	60	0.0	25	0	41	155
10	3	0	60	0.0	100	43	13	180

As you can see, the “Individual” is at the first column, and the features (dependent variables), which are gut microbial abundances in this case, are at the end of the table. Any column apart from individual, test_var (e.g. Time_point) and dependent variables will be taken as potential covariates (could be confounder or mediator). For example, here the potential covariates are sex, age, drug A and drug B. Please avoid using characters that don’t belong to ASCII printable characters for the column names in the input data frame.

Preparing the input data frame with make_master_table()

If you have your input master table prepared already, you can skip this section and go to Run longdat_disc() directly. If your metadata and feature (eg. microbiome, immunome) data are stored in two tables, you can create a master table out of them easily with the function make_master_table().

First, let’s take a look at an example of the metadata table. Metadata table should be a data frame whose columns consist of sample identifiers (sample_ID, unique for each sample), individual, time point and other meta data. Each row corresponds to one sample_ID.

# Read in the data frame. LongDat_disc_metadata_table is already lazily loaded.
metadata <- LongDat_disc_metadata_table
metadata %>%
  kableExtra::kbl() %>%
  kableExtra::kable_paper(bootstrap_options = "responsive", font_size = 12) %>% 
  kableExtra::scroll_box(width = "700px", height = "200px")

Sample_ID	Individual	Time_point	sex	age	DrugA	DrugB
1_1	1	1	0	61	0.0	10
1_2	1	2	0	61	0.0	10
1_3	1	3	0	61	0.0	20
2_1	2	1	0	66	0.0	640
2_2	2	2	0	66	0.0	320
2_3	2	3	0	66	0.0	640
3_1	3	1	0	63	7.5	100
3_2	3	2	0	63	7.5	0
3_3	3	3	0	63	7.5	100
4_1	4	1	0	47	0.0	300
4_2	4	2	0	47	0.0	200
4_3	4	3	0	47	0.0	300
5_1	5	1	1	51	0.0	160
5_2	5	2	1	51	0.0	130
5_3	5	3	1	51	0.0	160
6_1	6	1	1	53	10.0	0
6_2	6	2	1	53	10.0	0
6_3	6	3	1	53	10.0	0
7_1	7	1	0	50	0.0	40
7_2	7	2	0	50	0.0	20
7_3	7	3	0	50	0.0	40
8_1	8	1	1	54	0.0	100
8_2	8	2	1	54	0.0	80
8_3	8	3	1	54	0.0	160
9_1	9	1	0	44	0.0	160
9_2	9	2	0	44	0.0	80
9_3	9	3	0	44	0.0	160
10_1	10	1	0	60	0.0	100
10_2	10	2	0	60	0.0	25
10_3	10	3	0	60	0.0	100

This example is a dummy longitudinal meatadata with 3 time points for each individual. Besides sample_ID, individual, time point columns, there are also information of sex, age and drugs that individuals take. Here we want to see if the treatment has a significant effect on gut microbial abundance or not.

Then, let’s see how a feature table looks like. Feature table should be a data frame whose columns only consist of sample identifiers (sample_ID) and features (dependent variables, e.g. microbiome). Each row corresponds to one sample_ID. Please do not include any columns other than sample_ID and features in the feature table.

# Read in the data frame. LongDat_disc_feature_table is already lazily loaded.
feature <- LongDat_disc_feature_table
feature %>%
  kableExtra::kbl() %>%
  kableExtra::kable_paper(bootstrap_options = "responsive", font_size = 12) %>% 
  kableExtra::scroll_box(width = "700px", height = "200px")

Sample_ID	BacteriumA	BacteriumB	BacteriumC
1_1	11	4	23
1_2	13	2	44
1_3	7	13	48
2_1	344	0	48
2_2	3	0	80
2_3	379	0	87
3_1	55	0	95
3_2	5	0	160
3_3	205	0	210
4_1	60	0	126
4_2	4	0	130
4_3	64	59	186
5_1	100	20	15
5_2	3	64	8
5_3	53	5	34
6_1	32	138	0
6_2	2	0	5
6_3	10	0	2
7_1	22	105	69
7_2	27	158	40
7_3	32	100	113
8_1	24	0	0
8_2	0	0	0
8_3	192	0	2
9_1	65	0	1
9_2	1	0	31
9_3	12	0	31
10_1	19	163	163
10_2	0	41	155
10_3	43	13	180

This example is a dummy longitudinal feature data. It stores the gut microbial abundance of each sample.

To enable the joining process of metadata and feature tables, please pay attention to the following rules.

The row numbers of metadata and feature tables should be the same.
Sample_IDs are unique for each sample (i.e. no repeated sample_ID)
Metadata and feature tables have the same sample_IDs. If sample_IDs don’t match between the two tables, the joining process will fail.
As mentioned above, feature table should include only the columns of sample_ID and features.
Avoid using characters that don’t belong to ASCII printable characters for the column names.

Now let’s create a master table and take a look at the result!

master_created <- make_master_table(
metadata_table = LongDat_disc_metadata_table,
feature_table = LongDat_disc_feature_table,
sample_ID = "Sample_ID",
individual = "Individual")
#> [1] "Finished creating master table successfully!"

master_created %>%
  kableExtra::kbl() %>%
  kableExtra::kable_paper(bootstrap_options = "responsive", font_size = 12) %>% 
  kableExtra::scroll_box(width = "700px", height = "200px")

Individual	Time_point	sex	age	DrugA	DrugB	BacteriumA	BacteriumB	BacteriumC
1	1	0	61	0.0	10	11	4	23
1	2	0	61	0.0	10	13	2	44
1	3	0	61	0.0	20	7	13	48
2	1	0	66	0.0	640	344	0	48
2	2	0	66	0.0	320	3	0	80
2	3	0	66	0.0	640	379	0	87
3	1	0	63	7.5	100	55	0	95
3	2	0	63	7.5	0	5	0	160
3	3	0	63	7.5	100	205	0	210
4	1	0	47	0.0	300	60	0	126
4	2	0	47	0.0	200	4	0	130
4	3	0	47	0.0	300	64	59	186
5	1	1	51	0.0	160	100	20	15
5	2	1	51	0.0	130	3	64	8
5	3	1	51	0.0	160	53	5	34
6	1	1	53	10.0	0	32	138	0
6	2	1	53	10.0	0	2	0	5
6	3	1	53	10.0	0	10	0	2
7	1	0	50	0.0	40	22	105	69
7	2	0	50	0.0	20	27	158	40
7	3	0	50	0.0	40	32	100	113
8	1	1	54	0.0	100	24	0	0
8	2	1	54	0.0	80	0	0	0
8	3	1	54	0.0	160	192	0	2
9	1	0	44	0.0	160	65	0	1
9	2	0	44	0.0	80	1	0	31
9	3	0	44	0.0	160	12	0	31
10	1	0	60	0.0	100	19	163	163
10	2	0	60	0.0	25	0	41	155
10	3	0	60	0.0	100	43	13	180

The table “master_created” is just the same as the table “master” or “LongDat_disc_master_table” in the previous section, with the “Individual” as the first column, and the features (dependent variables), which are gut microbial abundances in this case, are at the end of the table. Any column apart from individual, test_var (e.g. Time_point) and dependent variables will be taken as potential covariates (could be confounder or mediator). For the details of the arguments, please read the help page of this function by using ?make_master_table.

OK, now we’re ready to run longdat_disc()!

Run longdat_disc()

The input is the example data frame LongDat_disc_master_table (same as “master” or “master_created” in the previous sections), and the data_type is “count” since the dependent variables (features, in this case they’re gut microbial abundance) are count data. The “test_var” is the independent variable you’re testing, and here we’re testing “Time_point” (time as the proxy for treatment). The variable_col is 7 because the dependent variables start at column 7. And the fac_var mark the columns that aren’t numerical. For the details of the arguments, please read the help page of this function by using ?longdat_disc.

The run below takes less than a minute to complete. When data_type equals to “count”, please remember to set seed (as shown below) so that you’ll get reproducible randomized control test.

# Run longdat_disc() on LongDat_disc_master_table
set.seed(100)
test_disc <- longdat_disc(input = LongDat_disc_master_table,
data_type = "count",
test_var = "Time_point",
variable_col = 7, 
fac_var = c(1:3))
#> [1] "Start data preprocessing."
#> [1] "Finish data preprocessing."
#> [1] "Start selecting potential covariates."
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] "Finished selecting potential covariates."
#> [1] "Start null model test and post-hoc test."
#> [1] 1
#> ERROR : argument is of length zero 
#> [1] 2
#> ERROR : argument is of length zero 
#> [1] 3
#> ERROR : argument is of length zero 
#> [1] "Finish null model test and post-hoc test."
#> [1] "Start covariate model test."
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] "Finish covariate model test."
#> [1] "Start unlisting tables from covariate model result."
#> [1] "Finish unlisting tables from covariate model result."
#> [1] "Start calculating effect size."
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] "Finish calculating effect size."
#> [1] "Start randomized negative control model test."
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] "Finish randomized negative control model test."
#> [1] "Start Wilcoxon post-hoc test."
#> [1] "Finish Wilcoxon post-hoc test."
#> [1] "Start removing the dependent variables to be exlcuded."
#> [1] "Finish removing the dependent variables to be exlcuded."
#> [1] "Start multiple test correction on null model test p values."
#> [1] "Finish multiple test correction on null model test p values."
#> [1] "Start generating result tables."
#> [1] "Not_reducible_to_covariate"
#> [1] "Finished successfully!"

If you have completed running the function successfully, you’ll see the message “Finished successfully!” at the end. The results are stored in list format.

Results

The major output from longdat_disc() include a result table and a covariate table. If you have count data (data_type equals to “count”), then there are chances that you get a third table “randomized control table”. If you specify data_type as either “measurement” or “others”, then you’ll get a “Normalize_method” table. For more details about the “randomized control table” and “Normalize_method” table, please read the help page of this function by using ?longdat_disc.

Result table

Let’s have a look at the result table first.

# The first dataframe in the list is the result table
result_table <- test_disc[[1]]
result_table %>%
  kableExtra::kbl() %>%
  kableExtra::kable_paper(bootstrap_options = "responsive", font_size = 12, position = "center") %>% 
  kableExtra::scroll_box(width = "700px")

Feature	Prevalence_percentage	Mean_abundance	Signal	Effect_1_2	Effect_1_3	Effect_2_3	EffectSize_1_2	EffectSize_1_3	EffectSize_2_3	Null_time_model_q	Post-hoc_q_1_2	Post-hoc_q_1_3	Post-hoc_q_2_3
BacteriumA	93.333	59.567	OK_nrc	Decreased	NS	Enriched	-0.6	0.2	0.8	0.0000001	0.0000015	0.5070438	0.0000002
BacteriumB	46.667	29.500	NS	NS	NS	NS	-0.1	-0.2	-0.1	0.7869569	0.7995718	0.7995718	0.7995718
BacteriumC	90.000	69.533	OK_d	NS	Enriched	NS	0.3	1.0	0.7	0.0018420	0.0686659	0.0007599	0.0686659

The second and third columns show the prevalence and mean abundance of each feature According to the “Signal” column, treatment is a significant predictor for BacteriumA as it shows “OK_nrc” (which represents “OK and not reducible to covariate”), meaning that there are potential covariates, however there is an effect of time (proxy of treatment) and it is independent of those of covariates. To find out what the covariates are, you’ll need to see the covariate table, but we’ll get to that later. On the other hand, the signal for BacteriumC is “OK_nc” (which represents OK and no covariate), meaning that the abundance of BacteriumC alters significantly throughout time (proxy of treatment), and that there is no potential covariate. As for BacteriumB, time (proxy of treatment) has no effect on its abundance.

The following columns “Effect_1_2”, “Effect_1_3” and “Effect_2_3” describe how features (dependent variables) change from time point a to b. Here we can tell that BacteriumA decreases from time point 1 to 2, and then enriched from time point 2 to 3. From the columns “EffectSize_1_2” and “EffectSize_2_3”, we know that the effect size are -0.6 and 0.8. The most relevant information for users ends here, which are listed from the first column to “EffectSize” columns.

Then the following columns contain the details of model test p values (“Null_time_model_q”), the post-hoc test p values (Post.hoc_q_1_2, Post.hoc_q_1_3 and Post.hoc_q_2_3). For more detailed information of the columns in the result table, please refer to the help page by using ?longdat_disc.

The explanation of each type of “Signal” is listed below.

Signal	Meaning	Explanation
NS	Non-significant	There’s no effect of time.
OK_nc	OK and no covariate	There’s an effect of time and there’s no potential covariate.
OK_d	OK but doubtful	There’s an effect of time and there’s no potential covariate, however the confidence interval of the test_var estimate in the model test includes zero, and thus it is doubtful. Please check the raw data (e.g., plot feature against time) to confirm if there is real effect of time.
OK_nrc	OK and not reducible to covariate	There are potential covariates, however there’s an effect of time and it is independent of those of covariates.
EC	Entangled with covariate	There are potential covariates, and it isn’t possible to conclude whether the effect is resulted from time or covariates.
RC	Effect reducible to covariate	There’s an effect of time, but it can be reduced to the covariate effects.

Covariate table

Next, let’s take a look at the covariate table.

# The second dataframe in the list is the covariate table
covariate_table <- test_disc[[2]]
covariate_table %>%
  kableExtra::kbl() %>%
  kableExtra::kable_paper(bootstrap_options = "responsive", font_size = 12, position = "center") %>% 
  kableExtra::scroll_box(width = "700px")

Feature	Covariate1	Covariate_type1	Effect_size1	Covariate2	Covariate_type2	Effect_size2
BacteriumA	DrugB	Not_reducible_to_covariate	0.4943542	NA	NA	NA

The columns of this covariate table are grouped every three columns. “Covariate1” is the name of the covariate, while “Covariate_type1” is the covariate type of covariate1, that is, if the effect of time is reducible to covariate1. “Effect_size1” is the effect size of the dependent variable values between different levels of covariate1. Here, the effect of time isn’t reducible to covariate1, so we don’t need to worry about the covariate effect of covariate1 on bacteriumA. If there are more than one covariates, they will be listed along the rows of each dependent variable.

Result interpretation

From the result above, we see that the potential covariate for BacteriumA is DrugB, but we don’t need to worry about the effect of time (proxy for the treatment) being able to reduce to DrugB since it’s covariate type is “not reducible to covariate”, meaning that the effect of time is independent from the effect of DrugB. With this information, we confirm that the treatment can solely explain the changes in BacteriumA abundance. All together, the treatment induces significant changes on the abundance of BacteriumA and BacteriumC, while causing no alteration in that of BacteriumB.

Plotting the result

Finally, we can plot the result with the function cuneiform_plot(). The required input is a result table from longdat_disc() (or any table with the same format as a result table does).

test_plot <- cuneiform_plot(result_table = test_disc[[1]],
                            x_axis_order = c("Effect_1_2", "Effect_2_3",                             "Effect_1_3"),
                            title_size = 15)
#> [1] "Finished plotting successfully!"

test_plot

Here we can see the result clearly from the cuneiform plot. It shows the features whose signals are not “NS”. The left panel displays the effects in each time interval. Red represents positive effect size while blue describes negative one (colors can be customized by users). Signficant signals are indicated by solid shapes, whereas insignificant signals are denoted by transparent ones. The right panel displays the covariate status of each feature, and users can remove it by specifying covariate_panel = FALSE. For more details of the arguments, please read the help page of this function by using ?cuneiform_plot.

Wrap-up

This tutorial ends here! If you have any further questions and can’t find the answers in the vignettes or help pages, please contact the author (jessica821112@gmail.com).

Longitudinal analysis pipeline with longdat_disc()

Time (the proxy for treatment) as a discrete variable