--- title: "GSSTDA Vignette: Gene Structure Survival using Topological Data Analysis" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{GSSTDA-vignette} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette represents an introduction to the use of the package G-SS-TDA. ```{r setup} library(GSSTDA) ``` Loading data: * The *full data* is the expression matrix whose columns correspond to the patients and rows to the genes, * the *survival_time* is a vector in which, in the case of patients whose sample is pathological, the time between the disease diagnosis and event (in this case relapse) is indicated. If the patient has not relapsed, the time until the end of follow-up is indicated. Patients whose sample is from healthy tissue have an NA value, * the *survival_event* is a vector with information on whether the event (in this case relapse) has occurred in each patient, * and the *case_tag* with information for each patient on whether his or her sample is pathological or healthy tissue. See *GSSTDA* documentation for further information. ```{r} data("full_data") data("survival_time") data("survival_event") data("case_tag") ``` Declare the necessary parameters of the GSSTDA object. The *gen_select_type* parameter is used to choose the option on how to select the genes to be used in the mapper. Choose between "Abs" and "Top_Bot". The *percent_gen_select* parameter is the percentage of genes to be selected to be used in mapper. ```{r} # Gene selection information gen_select_type <- "Top_Bot" percent_gen_select <- 10 # Percentage of genes to be selected ``` For the mapper, it is necessary to know the number of intervals into which the values of the filter functions will be divided and the overlap between them (\code{percent_overlap}). Default are 5 and 40 respectively. It is also necessary to choose the type of distance to be used for clustering within each interval (choose between correlation ("correlation"), default, and euclidean ("euclidean")) and the clustering type (choose between "hierarchical", default, and "PAM" (“partition around medoids”) options). For hierarchical clustering only, you will be asked by the console to choose the mode in which the number of clusters will be chosen (choose between "silhouette", default, and "standard"). If you use the package's own data we recommend to use "silhouette". If the mode is "standard" you can indicate the number of bins to generate the histogram (\code{num_bins_when_clustering}, by default 10). If the clustering method is "PAM", the default method will be "silhouette". Also, if the clustering type is hierarchical you can choose the type of linkage criteria (\code{linkage_type} choose between "single", "complete" and "average"). ```{r} #Mapper information num_intervals <- 10 percent_overlap <- 40 distance_type <- "correlation" clustering_type <- "hierarchical" linkage_type <- "single" # only necessary if the type of clustering is hierarchical # num_bins_when_clustering <- 10 # only necessary if the type of clustering is hierarchical # and the optimal_clustering_mode is "standard" # (this is not the case) ``` The package allows the various steps required for GSSTDA to be performed separately or together in one function. ### OPTION #1 (the three blocks of the G-SS-TDA process are in separate function): # First step of the process: dsga. This analysis, developed by Nicolau *et al.* is independent of the rest of the process and can be used with the data for further analysis other than mapper. It allows the calculation of the "disease component" which consists of, through linear models, eliminating the part of the data that is considered normal or healthy and keeping only the component that is due to the disease. ```{r} dsga_object <- dsga(full_data, survival_time, survival_event, case_tag) ``` # Second step of the process: Select the genes within the dsga object created in the previous step and calcute the values of the filtering functions. After performing a survival analysis of each gene, this function selects the genes to be used in the mapper according to both their variability within the database and their relationship with survival. Subsequently, with the genes selected, the values of the filtering functions are calculated for each patient. The filter function allows summarizing each vector of each individual into a single data point. This function takes into account the survival associated with each gene. ```{r} gene_selection_object <- gene_selection(dsga_object, gen_select_type, percent_gen_select) ``` Another option to execute the second step of the process. Create a object "data_object" with the require information. This could be used when you do not want to apply dsga. ```{r} # Create data object data_object <- list("full_data" = full_data, "survival_time" = survival_time, "survival_event" = survival_event, "case_tag" = case_tag) class(data_object) <- "data_object" #Select gene from data object gene_selection_object <- gene_selection(data_object, gen_select_type, percent_gen_select) ``` # Third step of the process: Create the mapper object with disease component matrix # with only the selected genes and the filter function obtained in the gene selection step. Mapper condenses the information of high-dimensional datasets into a combinatory graph that is referred to as the skeleton of the dataset. To do so, it divides the dataset into different levels according to its value of the filtering function. These levels overlap each other. Within each level, an independent clustering is performed using the input matrix and the indicated distance type. Subsequently, clusters from different levels that share patients with each other are joined by a vertex. This function is independent from the rest and could be used without having done dsga and gene selection ```{r} mapper_object <- mapper(data = gene_selection_object[["genes_disease_component"]], filter_values = gene_selection_object[["filter_values"]], num_intervals = num_intervals, percent_overlap = percent_overlap, distance_type = distance_type, clustering_type = clustering_type, linkage_type = linkage_type) ``` Obtain information from the dsga block created in the previous step. This function returns the 100 genes with the highest variability within the dataset and builds a heat map with them. ```{r} dsga_information <- results_dsga(dsga_object[["matrix_disease_component"]], case_tag) print(dsga_information) ``` Obtain information from the mapper object created in the G-SS-TDA process. ```{r} print(mapper_object) ``` Plot the mapper graph. ```{r} plot_mapper(mapper_object) ``` Hovering the mouse over each node in the interactive graph displays the number of samples that form the node. ### OPTION #2 (all process integrate in the same function): It creates the GSSTDA object with full data set, internally pre-process using the dsga technique, and the mapper information. ```{r} gsstda_obj <- gsstda(full_data = full_data, survival_time = survival_time, survival_event = survival_event, case_tag = case_tag, gen_select_type = gen_select_type, percent_gen_select = percent_gen_select, num_intervals = num_intervals, percent_overlap = percent_overlap, distance_type = distance_type, clustering_type = clustering_type, linkage_type = linkage_type) ``` Obtain information from the dsga block created in the previous step. This function returns the 100 genes with the highest variability within the dataset and builds a heat map with them. ```{r} dsga_information <- results_dsga(gsstda_obj[["matrix_disease_component"]], case_tag) print(dsga_information) ``` Obtain information from the mapper object created in the G-SS-TDA process. ```{r} print(gsstda_obj[["mapper_obj"]]) ``` Plot the mapper graph. ```{r} plot_mapper(gsstda_obj[["mapper_obj"]]) ```