--- title: "TDApplied Theory and Practice" author: "Shael Brown and Dr. Reza Farivar" output: rmarkdown::html_document: fig_caption: yes vignette: > %\VignetteIndexEntry{TDApplied Theory and Practice} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} bibliography: REFERENCES.bib --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) # get original graphic parameters to be able to # revert back at the end of the vignette original_mfrow <- par()$mfrow original_xpd <- par()$xpd original_mar <- par()$mar original_scipen <- options()$scipen if(exists(".Random.seed", .GlobalEnv) == F) { runif(1) } oldseed <- get(".Random.seed", .GlobalEnv) oldRNGkind <- RNGkind() # set some new parameters for viewing and reproducibility options(scipen = 999) set.seed(123) ``` # Introduction Topological data analysis is a relatively new area of data science which can compare and contrast data sets via non-linear global structure. The main tool of topological data analysis, *persistent homology* [@PHoriginal;@ComputingPH], builds on techniques from the field of algebraic topology to describe shape features present in a data set (stored in a "persistence diagram"). Persistent homology has been used in a number of applications, including - predicting stock market crashes from the topology of stock price correlations over time [@PHeconomics], - finding non-trivial and complex topological structure in local patches of naturalistic images [@PHpatches], - translating sentences in one language into another language (from a set of candidate sentences) using the persistence diagrams of their word embeddings [@word_embeddings], and - improving model predictions of various chemical properties of molecules by including topological features [@TDA_chemistry], - distinguishing between the topology of human brain function of healthy control subjects vs. subjects with a neurological disorder [@TDA_ADHD;@TDA_Alzheimers;@persistent_hom_brain_networks], etc. For a broad introduction to the mathematical background and main tools of topological data analysis with applications, see [@TDA_intro;@TDA]. Traditional data science pipelines in academia and industry are focused on machine learning and statistical inference of structured (tabular) data, being able to answer questions like: - How well can a label variable be predicted from feature variables? - What are the latent subgroups/dimensions of a dataset? - Are the subgroups of a dataset, defined by factor features, distinguishable? While persistence diagrams have been found to be a useful summary of datasets in many domains, they are not structured data and therefore require special analysis methods. Some papers (for example [@Robinson_Turner;@persistence_fisher]) have described post-processing pipelines for analyzing persistence diagrams built on distance [@distance_calc] and kernel [@persistence_fisher] calculations, however these papers are lacking publicly available implementations in R (and Python), and many more data science methods are possible using such calculations [@murphyML;@kpca;@kernel_test;@Cox2008;@kkmeans]. **TDApplied** is the first R package which provides applied analysis implementations of published methods for analyzing persistence diagrams using machine learning and statistical inference. Its functions contain highly optimized and scalable code (see the package vignette "Benchmarking and Speedups") and have been tested and validated (see the package vignette "Comparing Distance Calculations"). **TDApplied** can interface with other data science packages to perform powerful and flexible analyses (see the package vignette "Personalized Analyses with **TDApplied**"), and an example usage of **TDApplied** on real data has been demonstrated (see the package vignette "Human Connectome Project Analysis"). This vignette documents the background of **TDApplied** functions and the usage of those functions on simulated data, by considering a typical data analysis workflow for topological data analysis: 1. Computing and comparing persistence diagrams. 2. Visualizing and interpreting persistence diagrams. 3. Analyzing statistical properties of groups of persistence diagrams. 4. Finding latent structure in groups of persistence diagrams. 5. Predicting labels from persistence diagram structure. To start we must load the **TDApplied** package: ```{r setup} library("TDApplied") ``` Let's get started! # Computing and Comparing Persistence Diagrams ## Computing Diagrams and **TDApplied**'s `PyH` Function The main tool of topological data analysis is called *persistent homology* [@PHoriginal;@ComputingPH]. Persistent homology starts with either data points and a distance function, or a distance matrix storing distance values between data points. It assumes that these points arise from a dataset with some kind of "shape". This "shape" has certain features that exist at various scales, but sampling induces noise in these features. Persistent homology aims to describe certain mathematical features of this underlying shape, by forming approximations to the shape at various distance scales. The mathematical features which are tracked include clusters (connected components), loops (ellipses) and voids (spheres), which are examples of *cycles* (i.e. different types of holes). The *homological dimension* of these features are 0, 1 and 2, respectively. What is interesting about these particular mathematical features is that they can tell us where our data is not, which is extremely important information which other data analysis methods cannot provide. ```{r,echo = F,fig.height = 3,fig.width = 7,fig.align = 'center',eval = requireNamespace("TDAstats")} par(mfrow = c(1,4)) circ <- TDAstats::circle2d[sample(1:100,50),] plot(x = circ[,1],y = circ[,2],main = "Approximation 1:\nindividual data points",xlab = "",ylab = "",las = 1) plot(x = circ[,1],y = circ[,2],main = "Approximation 2:\nnot a loop",xlab = "",ylab = "",las = 1) for(i in 1:(nrow(circ)-1)) { for(j in (i+1):nrow(circ)) { if(sqrt((circ[i,1]-circ[j,1])^2+(circ[i,2] - circ[j,2])^2) <= 0.2) { lines(c(circ[i,1],circ[j,1]),c(circ[i,2],circ[j,2])) } } } plot(x = circ[,1],y = circ[,2],main = "Approximation 3:\nloop",xlab = "",ylab = "",las = 1) for(i in 1:(nrow(circ)-1)) { for(j in (i+1):nrow(circ)) { if(sqrt((circ[i,1]-circ[j,1])^2+(circ[i,2] - circ[j,2])^2) <= 1) { lines(c(circ[i,1],circ[j,1]),c(circ[i,2],circ[j,2])) } } } plot(x = circ[,1],y = circ[,2],main = "Approximation 4:\nnot a loop",xlab = "",ylab = "",las = 1) for(i in 1:(nrow(circ)-1)) { for(j in (i+1):nrow(circ)) { if(sqrt((circ[i,1]-circ[j,1])^2+(circ[i,2] - circ[j,2])^2) <= 2) { lines(c(circ[i,1],circ[j,1]),c(circ[i,2],circ[j,2])) } } } par(mfrow = c(1,1)) ``` The persistent homology algorithm proceeds in the following manner: first, if the input is a dataset and distance metric, then the distance matrix, storing the distance metric value of each pair of points in the dataset, is computed. Next, a parameter $\epsilon \geq 0$ is grown starting at 0, and at each $\epsilon$ value we compute a shape approximation of the dataset $C_{\epsilon}$, called a *simplicial complex* or in this case a *Rips complex*. We construct $C_{\epsilon}$ by connecting all pairs of points whose distance is at most $\epsilon$. To encode higher-dimensional structure in these approximations, we also add a triangle between any triple of points which are all connected (note that no triangles are formally shaded on the above diagram, even though there are certainly triples of connected points), a tetrahedron between any quadruple of points which are all connected, etc. Note that this process of forming a sequence of skeletal approximations is called a *Rips-Vietoris filtration*, and other methods exist for forming the approximations. At any given $\epsilon$ value, some topological features will exist in $C_{\epsilon}$. As $\epsilon$ grows, the $C_{\epsilon}$'s will contain each other, i.e. if $\epsilon_{1} < \epsilon_{2}$, then every edge (triangle, tetrahedron etc.) in $C_{\epsilon_1}$ will also be present in $C_{\epsilon_2}$. Each topological feature of interest will be "born" at some $\epsilon_{birth}$ value, and "die" at some some $\epsilon_{death}$ value -- certainly each feature will die once the whole dataset is connected and has trivial shape structure. Consider the example of a loop -- a loop will be "born" when the last connection around the circumference of the loop is connected (at the $\epsilon$ value which is the largest distance between consecutive points around the loop), and the loop will "die" when enough connections across the loop fill in its hole. Since the topological features are tracked across multiple scales, we can estimate their (static) location in the data, i.e. finding the points on these structures, by calculating what are called *representative cycles*. The output of persistent homology, a *persistence diagram*, has one 2D point for each topological feature found in the filtration process in each desired homological dimension, where the $x$-value of the point is the birth $\epsilon$-value and the $y$-value is the death $\epsilon$-value. Hence every point in a persistence diagram lies above the diagonal -- features die after they are born! The difference of a points $y$ and $x$ value, $y-x$, is called the "persistence" of the corresponding topological feature. Points which have high (large) persistence likely represent real topological features of the dataset, whereas points with low persistence likely represent topological noise. For a more practical and scalable computation of persistence diagrams, a method has been introduced called *persistence cohomology* [@PChom;@PHom_dualities] which calculates the exact same output as persistent homology (with analogous "representative co-cycles" to persistent homology's representative cycles) just much faster. Persistent cohomology is implemented in the c++ library **ripser** [@Ripser], which is wrapped in R in the **TDAstats** package [@R-TDAstats] and in Python in the **ripser** module. However, it was observed in simulations that the Python implementation of **ripser** seemed faster, even when called in R via the reticulate package [@R-reticulate] (see the package vignette "Benchmarking and Speedups"). Therefore, the **TDApplied** `PyH` function has been implemented as a wrapper of the Python **ripser** module for fast calculations of persistence diagrams. There are three prerequisites that must be satisfied in order to use the `PyH` function: 1. The reticulate package must be installed. 2. Python must be installed and configured to work with reticulate. 3. The **ripser** Python module must be installed, via `reticulate::py_install("ripser")`. Some windows machines have had issues with recent versions of the **ripser** module, but version 0.6.1 has been tested and does work on Windows. So, Windows users may try `reticulate::py_install("ripser==0.6.1")`. For a sample use of `PyH` we will use the following pre-loaded dataset called "circ" (which is stored as a data frame in this vignette): ```{r,echo = F,eval = T} # create the data circ <- data.frame(x = c(-0.815663170207959,-0.595099344842628,0.864889292249816,0.20793951765139,-0.954866650195679,-0.18706988463915,-0.798725432243614,0.830591026518797,-0.661799977235305,0.999878714877245,0.245504222689744,-0.431326750789015,0.666897829439335,0.30720650008886,0.661328840033743,-0.76039885584628,0.239379052558954,-0.351042298907497,-0.829639564028719,0.71216384463492,-0.940714425455924,-0.949907158729059,0.101402707416082,0.986585898475086,0.33772426530395,0.734256674456601,0.771307272532461,0.980540492964891,-0.750935667779769,0.63057972598311,0.75993549565125,-0.997652044265574,-0.199144715289604,-0.996513284779077,0.997484507122883,0.918374523107964,0.54816769740782,0.22028039764542,-0.965072372227865,-0.86565030108487,0.999975356226944,-0.0356376975101777,-0.998775901784022,0.879402540916111,0.389455305917558,-0.73827286958174,0.322030655401438,0.321397994754972,0.943823546436826,-0.565572997267532),y = c(0.578527089051413,-0.803652144754107,-0.501962660116878,0.978141685544025,-0.297034813353725,-0.982346608006102,0.601695673814639,0.55688288415649,-0.749680458683131,0.0155741945354593,0.969395521261319,-0.902195784768357,-0.745149169689602,-0.951642877503506,0.750096104069088,0.649456372690012,-0.970926191425475,-0.936359602064153,-0.558299376498163,0.702013289329204,-0.339199601619947,0.312532222011246,-0.994845460827303,0.163242962880818,0.941245090624598,-0.678871958484023,0.63646295362616,-0.196316941847027,0.660375365119076,-0.776124480466288,-0.649998494190016,0.0684864845989492,-0.979970092590699,0.0834342451204145,-0.0708848224221418,0.395712313816768,0.836368444836729,-0.975436592717935,-0.261983427648545,0.500649134855612,-0.00702046571072753,0.999364775503006,-0.0494641083566077,0.476078954618127,-0.921045365165398,0.674502164592185,-0.946729241642889,0.946944205836586,-0.330449864868202,0.824698238607201)) ``` ```{r,echo = F,warning=F,fig.height = 4,fig.width = 4,fig.align = 'center'} plot(x = circ$x,y = circ$y,xlab = "x",ylab = "y",main = "circ") ``` We would then calculate the persistence diagram as follows: ```{r,echo = T,eval = F} # import the ripser module ripser <- import_ripser() # calculate the persistence diagram diag <- PyH(X = circ,maxdim = 1,thresh = 2,ripser = ripser) # view last five rows of the diagram diag[47:51,] ``` ```{r,echo = F,eval = T} diag <- data.frame(dimension = c(rep(0,50),1),birth = c(rep(0,50),0.5579783),death = c(0.0123064760118723,0.0144490944221616,0.0149910748004913,0.0156172784045339,0.0172923970967531,0.0189713705331087,0.0196240246295929,0.0225948672741652,0.0286996569484472,0.0365069359540939,0.038569450378418,0.0386403761804104,0.0444764532148838,0.0477333702147007,0.0612373314797878,0.0639129132032394,0.0699725300073624,0.0705153122544289,0.0721508488059044,0.0791449695825577,0.0858016163110733,0.0872511267662048,0.0882881134748459,0.0893174782395363,0.0925402045249939,0.0944025367498398,0.0944980531930923,0.099234826862812,0.117955945432186,0.120451688766479,0.126571387052536,0.139067515730858,0.142296731472015,0.148265853524208,0.158034011721611,0.181465998291969,0.188804805278778,0.19113427400589,0.20612421631813,0.21517525613308,0.228875741362572,0.233790531754494,0.235128790140152,0.242270082235336,0.244500055909157,0.245646774768829,0.254552245140076,0.281323730945587,0.288743227720261,2,1.73859250545502)) diag[47:51,] ``` In the package vignette "Benchmarking and Speedups", simulations are used to demonstrate the practical advantages of using `PyH` to calculate persistence diagrams compared to other alternatives. Note that the installation status of Python for `PyH` is checked using the function `reticulate::py_available()`, which according to several online forums does not always behave as expected. If error messages occur using **TDApplied** functions regarding Python not being installed then we recommend consulting online resources to ensure that the `py_available` function returns `TRUE` on your system. Due to the complicated dependencies required to use the `PyH` function, it is only an optional function in the **TDApplied** package and therefore the reticulate package is only suggested in the **TDApplied** namespace. ## Determining the Radius Threshold with **TDApplied**'s `enclosing_radius` Function One of the most important parameters in a persistent (co)homology calculation is the maximum connectivity radius of the filtration (which is the `maxdim` parameter in the `PyH` function). A small radius threshold may prematurely stop the filtration before we can identify real large-scale features of our data, while an overly large threshold could quickly cause the calculations to become intractable. While there is no method that can determine an optimal balance between these two extremes, a radius threshold exists, called the "enclosing radius", beyond which the topology of the dataset is trivial (i.e. there are no topological features except for one connected component). This radius is obtained by calculating, for each data point, the maximum distance from that point to all other points, and taking the minimum of these values. The enclosing radius of a dataset can be computed using **TDApplied**'s `enclosing_radius` function. Consider the topology of a dataset sampled from a circle and its origin: ```{r, echo=FALSE,fig.height = 5,fig.width = 5,fig.align = 'center'} theta <- stats::runif(n = 100,min = 0,max = 2*pi) x <- cos(theta) y <- sin(theta) origin <- data.frame(x = 0,y = 0) new_data <- rbind(data.frame(x = x,y = y), origin) layout <- as.matrix(new_data) rownames(layout) <- as.character(1:nrow(new_data)) vrs <- vr_graphs(new_data, eps = c(0.0001,1)) plot_vr_graph(vrs, eps = (0.0001), layout = layout, plot_isolated_vertices = TRUE,vertex_labels = FALSE) ``` Without knowing anything about the structure of our new dataset we would not know which distance threshold to use for persistent homology calculations. We would compute the enclosing radius as follows: ```{r} enc_rad <- enclosing_radius(new_data, distance_mat = FALSE) print(enc_rad) ``` This radius is the distance from the origin point to all the points on the circumference - at which point the whole graph is connected and the loop is gone: ```{r,echo = FALSE,fig.height = 5,fig.width = 5,fig.align = 'center'} plot_vr_graph(vrs, eps = enc_rad, layout = layout,vertex_labels = F) ``` We will see in a later section how to construct these types of graphs using the `vr_graphs` and `plot_vr_graph` functions. ## Converting Diagrams to DataFrames with **TDApplied**'s `diagram_to_df` Function The most typical data structure used in R for data science is a data frame. The output of the `PyH` function is a data frame (unless representatives are calculated, in which case the output is a list containing a data frame and other information), but the persistence diagrams calculated from the popular R packages **TDA** [@R-TDA] and **TDAstats** [@R-TDAstats] are not stored in data frames, making subsequent machine learning and inference analyses of these diagrams challenging. Since In order to solve this problem the **TDApplied** function `diagram_to_df` can convert **TDA**/**TDAstats** persistence diagrams into data frames: ```{r,echo = T,eval = F} # convert TDA diagram into data frame diag1 <- TDA::ripsDiag(circ,maxdimension = 1,maxscale = 2,library = "dionysus") diag1_df <- diagram_to_df(diag1) class(diag1_df) ``` ```{r,echo = F} c("data.frame") ``` ```{r,echo = T,eval = F} # convert TDAstats diagram into data frame diag2 <- TDAstats::calculate_homology(circ,dim = 1,threshold = 2) diag2_df <- diagram_to_df(diag1) class(diag2_df) ``` ```{r,echo = F} c("data.frame") ``` When a persistence diagram is calculated with either `PyH`, `ripsDiag` or `alphaComplexDiag` and contains representatives, `diagram_to_df` only returns the persistence diagram data frame (i.e. the representatives are ignored). ## Comparing Persistence Diagrams and **TDApplied**'s `diagram_distance` and `diagram_kernel` Functions Earlier we mentioned that persistence diagrams do not form structured data, and now we will give an intuitive argument for why this is the case. A persistence diagram $\{(x_1,y_1),\dots,(x_n,y_n)\}$ containing $n$ topological features can be represented in a vector of length $2n$, $(x_1,y_1,x_2,y_2,\dots,x_n,y_n)$. However, we cannot easily combine vectors calculated in this way into a table with a fixed number of feature columns because (1) persistence diagrams can contain different numbers of features, meaning their vectors would be of different lengths, and (2) the ordering of the features is arbitrary, calling into question what the right way to compare the vectors would be. Fortunately, we can still apply many machine learning and inference techniques to persistence diagrams provided we can quantify how near (similar) or far (distant) they are from each other, and these calculations are possible with distance and kernel functions. There are several ways to compute distances between persistence diagrams in the same homological dimension. The most common two are called the *2-wasserstein* and *bottleneck* distances [@distance_calc;@comp_geom]. These techniques find an optimal matching of the 2D points in their input two diagrams, and compute a cost of that optimal matching. A point from one diagram is allowed either to be paired (matched) with a point in the other diagram or its diagonal projection, i.e. the nearest point on the diagonal line $y=x$ (matching a point to its diagonal projection is essentially saying that feature is likely topological noise because it died very soon after it was born). Allowing points to be paired with their diagonal projections both allows for matchings of persistence diagrams with different numbers of points (which is almost always the case in practice) and also formalizes the idea that some points in a persistence diagram represent noise. The "cost" value associated with a matching is given by either (i) the maximum of infinity-norm distances between paired points, or (ii) the square-root of the sum of squared infinity-norm between matched points. The cost of the optimal matching under loss (i) is the bottleneck distance of persistence diagrams, and the cost of the optimal matching of cost (ii) is called the 2-wasserstein metric of persistence diagrams. Both distance metrics have been used in a number of applications, but the 2-wasserstein metric is able to find more fine-scale differences in persistence diagrams compared to the bottleneck distance. The problem of finding an optimal matching can be solved with the Hungarian algorithm, which is implemented in the R package clue [@R-clue]. We will introduce three new simple persistence diagrams, D1, D2 and D3, for examples in this section (and future ones): ```{r,echo = F,fig.height = 3,fig.width = 7,fig.align = 'center'} D1 = data.frame(dimension = c(0),birth = c(2),death = c(3)) D2 = data.frame(dimension = c(0),birth = c(2,0),death = c(3.3,0.5)) D3 = data.frame(dimension = c(0),birth = c(0),death = c(0.5)) par(mfrow = c(1,3)) plot_diagram(D1,title = "D1",max_radius = 4,legend = F) plot_diagram(D2,title = "D2",max_radius = 4,legend = F) plot_diagram(D3,title = "D3",max_radius = 4,legend = F) ``` Here is a plot of the optimal matchings between D1 and D2, and between D1 and D3: ```{r,echo = F,fig.height = 3,fig.width = 5,fig.align = 'center'} par(mfrow = c(1,2)) plot(x = c(D1$birth,D2$birth),y = c(D1$death,D2$death),main = "Best matching D1,D2",xlab = "",ylab = "",xlim = c(0,4),ylim = c(0,4)) abline(a = 0,b = 1) lines(c(2,2),c(3,3.3)) lines(c(0,0.25),c(0.5,0.25)) plot(x = c(D1$birth,D3$birth),y = c(D1$death,D3$death),main = "Best matching D1,D3",xlab = "",ylab = "",xlim = c(0,4),ylim = c(0,4)) abline(a = 0,b = 1) lines(c(2,2.5),c(3,2.5)) lines(c(0,0.25),c(0.5,0.25)) par(mfrow = c(1,1)) ``` In the picture we can see that there is a "better" matching between D1 and D2 compared to D1 and D3, so the (wasserstein/bottleneck) distance value between D1 and D2 would be smaller than that of D1 and D3. Another distance metric between persistence diagrams, which will be useful for kernel calculations, is called the *Fisher information metric*, $d_{FIM}(D_1,D_2,\sigma)$ [@persistence_fisher]. The idea is to represent the two persistence diagrams as probability density functions, with a 2D-Gaussian point mass centered at each point in both diagrams (including the diagonal projections of the points in the opposite diagram), all of variance $\sigma^2 > 0$, and calculate how much those distributions disagree on their pdf value at each point in the plane (called their *Fisher information metric*). ```{r,echo = F,fig.height = 3,fig.width = 7,fig.align = 'center'} par(mfrow = c(1,3)) x = seq(-4,4,0.01) y = seq(-4,4,0.01) D1_with_diag = rbind(D1,data.frame(dimension = c(0),birth = c(0.25),death = c(0.25))) z1 = outer(x,y,FUN = function(x,y){ # sigma = 1 return((exp(-((x-D1_with_diag[1,2])^2+(y-D1_with_diag[1,3])^2)/(2*1^2)))/sqrt(2*pi*1^2) + (exp(-((x-D1_with_diag[2,2])^2+(y-D1_with_diag[2,3])^2)/(2*1^2)))/sqrt(2*pi*1^2)) }) z1 = z1/sum(z1) image(x = x,y = y,z1,main = "Distribution for D1",xlab = "",xlim = c(-4,4),ylim = c(-4,4),ylab = "") abline(a = 0,b = 1) D3_with_diag = rbind(D3,data.frame(dimension = c(0),birth = c(2.5),death = c(2.5))) z3 = outer(x,y,FUN = function(x,y){ # sigma = 1 return((exp(-((x-D3_with_diag[1,2])^2+(y-D3_with_diag[1,3])^2)/(2*1^2)))/sqrt(2*pi*1^2) + (exp(-((x-D3_with_diag[2,2])^2+(y-D3_with_diag[2,3])^2)/(2*1^2)))/sqrt(2*pi*1^2)) }) z3 = z3/sum(z3) image(x = x,y = y,z3,main = "Distribution for D3",xlab = "",xlim = c(-4,4),ylim = c(-4,4),ylab = "") abline(a = 0,b = 1) image(x = x,y = y,z1-z3,main = "Difference of distributions",xlab = "",xlim = c(-4,4),ylim = c(-4,4),ylab = "") abline(a = 0,b = 1) par(mfrow = c(1,1)) ``` Points in the rightmost plot which are close to white in color have the most similar pdf values in the two distributions, and would not contribute to a large distance value; however, having more points with a red color would contribute to a larger distance value. The wasserstein and bottleneck distances have been implemented in the **TDApplied** function `diagram_distance`. We can confirm that the distance between D1 and D2 is smaller than D1 and D3 for both distances: ```{r,echo = T} # calculate 2-wasserstein distance between D1 and D2 diagram_distance(D1,D2,dim = 0,p = 2,distance = "wasserstein") # calculate 2-wasserstein distance between D1 and D3 diagram_distance(D1,D3,dim = 0,p = 2,distance = "wasserstein") # calculate bottleneck distance between D1 and D2 diagram_distance(D1,D2,dim = 0,p = Inf,distance = "wasserstein") # calculate bottleneck distance between D1 and D3 diagram_distance(D1,D3,dim = 0,p = Inf,distance = "wasserstein") ``` There is a generalization of the 2-wasserstein distance for any $p \geq 1$, the *p-wasserstein* distance, which can also be computed using the `diagram_distance` function by varying the parameter `p`, although $p = 2$ seems to be the most popular parameter choice. The `diagram_distance` function can also calculate the Fisher information metric between persistence diagrams: ```{r,echo = T} # Fisher information metric calculation between D1 and D2 for sigma = 1 diagram_distance(D1,D2,dim = 0,distance = "fisher",sigma = 1) # Fisher information metric calculation between D1 and D3 for sigma = 1 diagram_distance(D1,D3,dim = 0,distance = "fisher",sigma = 1) ``` Again, D1 and D2 are less different than D1 and D3 using the Fisher information metric. A fast approximation to the Fisher information metric was described in [@persistence_fisher], and C++ code in the GitHub repository (https://github.com/vmorariu/figtree) was used to calculate this approximation in Matlab. Using the Rcpp package [@Rcpp] this code is included in **TDApplied** and the approximation can be calculated by providing the positive `rho` parameter: ```{r,echo = T,eval = F} # Fisher information metric calculation between D1 and D2 for sigma = 1 diagram_distance(D1,D2,dim = 0,distance = "fisher",sigma = 1) ``` ```{r,echo = F} 0.02354779 ``` ```{r,echo = T,eval = F} # fast approximate Fisher information metric calculation between D1 and D3 for sigma = 1 diagram_distance(D1,D2,dim = 0,distance = "fisher",sigma = 1,rho = 0.001) ``` ```{r,echo = F} 0.02354779 ``` Now we will explore calculating similarity of persistence diagrams using kernel functions. A kernel function is a special (positive semi-definite) symmetric similarity measure between objects in some complicated space which can be used to project data into a space suitable for machine learning [@murphyML]. Some examples of machine learning techniques which can be "kernelized" when dealing with complicated data are *k-means* (kernel k-means), *principal components analysis* (kernel PCA), and *support vector machines* (SVM) which are inherently based on kernel calculations. There have been, to date, four main kernels proposed for persistence diagrams. In **TDApplied** the persistence Fisher kernel [@persistence_fisher] has been implemented because of its practical advantages over the other kernels -- smaller cross-validation SVM error on a number of test data sets and a faster method for cross validation. For information on the other three kernels see [@kernel_TDA;@sliced_wasserstein;@kernel_TDA_original]. The persistence Fisher kernel is computed directly from the Fisher information metric between two persistence diagrams: let $\sigma > 0$ be the parameter for $d_{FIM}$, and let $t > 0$. Then the persistence Fisher kernel is defined as $k_{PF}(D_1,D_2) = \mbox{exp}(-t*d_{FIM}(D_1,D_2,\sigma))$. Computing the persistence Fisher kernel can be achieved with the `diagram_kernel` function in **TDApplied**: ```{r,echo = T} # calculate the kernel value between D1 and D2 with sigma = 2, t = 2 diagram_kernel(D1,D2,dim = 0,sigma = 2,t = 2) # calculate the kernel value between D1 and D3 with sigma = 2, t = 2 diagram_kernel(D1,D3,dim = 0,sigma = 2,t = 2) ``` As before, D1 and D2 are more similar than D1 and D3, and if desired a fast approximation to the kernel value can be computed. # Visualizing and Interpreting Persistence Diagrams ## **TDApplied's** Function `plot_diagram` Persistence diagrams can contain any number of points representing different types of topological features from different homological dimensions. However we can easily view this information in a two-dimensional plot of the birth and death values of the topological features, where each homological dimension has a unique point shape and color, using **TDApplied's** `plot_diagram` function: ```{r,echo = T,fig.height = 5,fig.width = 5,fig.align = 'center'} plot_diagram(diag,title = "Circle diagram") ``` Now that we can visualize persistence diagrams we will describe four tools which can be used for their interpretation -- filtering out noisy topological features with the universal null distribution and bootstrapping methods, representative (co)cycles and *Vietoris-VR graphs* (called VR graphs for short). ## Filtering Topological Features and **TDApplied**'s `universal_null` Function Noise in datasets can drastically affect the results of inference and machine learning procedures. Therefore it is desirable to clean data before applying such procedures. Noise in persistence diagrams are features whose birth and death values are very similar. The question of determining which points in a persistence diagrams are "significant" and which are "noise" has been recently addressed via a parametric approach in [@universal_null]. The idea is to compute a test statistic for each feature based on the ratio of death divided by birth and to compare these statistics against the left-skewed Gumbel distribution to compute p-values, and use a Bonferroni correction to ensure that the probability of getting a false positive significant feature is at most $\alpha$. This is a very strict thresholding procedure, testing each p-value $p$ for a topological feature in dimension $i$ against $\alpha/n_i$ where $n_i$ is the number of topological features of that dimension (so typical choices for $\alpha$, like 0.05 or 0.01 etc., may be too strict for some analyses). We would be tempted to use this procedure to locate the main loop in our circ dataset, but the results are surprising: ```{r} # determine significant features circ_result <- universal_null(circ, thresh = enclosing_radius(circ)) circ_result$subsetted_diag ``` No loop was found! The reason for this is because since circ is a "topologically perfect" dataset, with no noise and therefore only one loop, the singular loop can never be significant without smaller loops to compare it to. The same phenomenon will occur in other datasets like a sampled unit sphere, etc., but does this mean that the procedure does not work? Certainly not! All real world datasets will contain some measure of noise and therefore will likely have more than one feature in a given homological dimension. Let's add some noise to our unit_circ dataset and rerun the procedure: ```{r,echo = T,fig.height = 5,fig.width = 5,fig.align = 'center',eval = requireNamespace("TDA")} # create circle with noise dataset and plot circ_with_noise <- circ x_noise <- stats::rnorm(n = 50,sd = 0.1) y_noise <- stats::rnorm(n = 50,sd = 0.1) circ_with_noise$x <- circ_with_noise$x + x_noise circ_with_noise$y <- circ_with_noise$y + y_noise plot(circ_with_noise) # rerun the inference procedure library(TDA) noisy_circ_result <- universal_null(circ_with_noise, FUN_diag = "ripsDiag", thresh = enclosing_radius(circ_with_noise), return_pvals = TRUE) noisy_circ_result$subsetted_diag noisy_circ_result$pvals ``` We have successfully found the significant loop in the noisy dataset (with p-value less than `alpha`). Remember though that this procedure is very strict - it is much more likely to disregard real topological features than to retain noise features, so if the `universal_null` function says that your dataset has no significant features when you are confident there are some you can try increasing the `alpha` parameter to carry out a more relaxed thresholding. While the `enclosing_radius` function, used in our examples, may be an appropriate connectivity threshold for most applications, it may be too computationally intensive for others. When lower thresholds are used, real topological features may be obscured because the threshold is much smaller than its (unobserved) death radius. Thankfully though, the `universal_null` procedure can handle these "infinite cycles", by: 1. Using the left-skewed Gumbel distribution and `alpha` parameter to determine the minimum connectivity threshold for persistent homology where a given infinite cycle feature *might* be significant, 2. Computing the persistence diagram at that threshold, and 3. Checking if the feature is still infinite (i.e. if its death value is the new larger threshold). If yes then the feature is reported as being significant, otherwise it is filtered out. Picking a very small threshold, resulting in many infinite cycles, will be very computationally demanding requiring the computation of many persistence diagrams, so care should be taken when picking the threshold when performing this infinite cycle inference. Here's an example of performing inference on the noisy circ dataset, with and without infinite cycle inference at a lower threshold value: ```{r,eval = requireNamespace("TDA")} # inference without infinite cycle inference res_non_inf_small_thresh <- universal_null(circ_with_noise, FUN_diag = "ripsDiag", thresh = 0.9) res_non_inf_small_thresh$subsetted_diag # inference with infinite cycle inference res_inf_small_thresh <- universal_null(circ_with_noise, FUN_diag = "ripsDiag", thresh = 0.9, infinite_cycle_inference = TRUE) res_inf_small_thresh$subsetted_diag ``` At this smaller threshold the procedure with infinite cycle inference was able to locate the main loop. In a later section you will read about "representative (co)cycles", which are subsets of the datapoints on a topological feature that can be used to help locate the feature in our dataset, but for now just know that if desired this procedure will subset the representatives for only the significant topological features, but infinite cycles will not have representatives. ## Bootstrapping Topological Features and **TDApplied**'s `bootstrap_persistence_thresholds` Function Another approach in the literature for filtering significant topological features of datasets was based on bootstrapping ([@bootstrap]), although this method is far slower than the aforementioned universal null distribution approach because it requires the computation of many persistence diagrams and distances. On the other hand, for topologically simple datasets without noise (like a sampled unit circle) the universal null distribution will often fail to retrieve the real topological features. The idea of the bootstrap procedure is as follows, where $X$ is the input data set and $\alpha$ is the desired threshold for type 1 error: 1. We first compute $D$, the diagram of $X$. 2. Then we repeatedly sample, with replacement, the original data to obtain $\{X_1,\dots,X_n\}$ and compute new persistence diagrams $\{D_1,\dots,D_n\}$. 3. We calculate the bottleneck distance of each new diagram with the original, $d_{\infty}(D_i,D)$, in each dimension separately. 4. Finally we compute the $1-\alpha$ percentile of these values in each dimension. These thresholds form a square-shaped "confidence interval" around each point in $D$. In particular, if $t$ was the threshold found for dimension $k$ then the confidence interval around a point $(x,y) \in D$ (of dimension $k$) is the set of points $\{(x',y'):\mbox{max}(|x-x'|,|y-y'|) < t\}$. For example, if we calculated the bottleneck-threshold-based confidence interval around the single 1-dimensional point in the circ dataset's persistence diagram, we would get something like this: ```{r,echo = F,fig.height = 5,fig.width = 5,fig.align = 'center',eval = requireNamespace("TDAstats") & requireNamespace("TDA")} par(mfrow = c(1,1)) pt <- as.numeric(diag[which(diag[,1L] == 1),])[2:3] plot_diagram(D = diag,title = "Circ diagram with confidence interval",legend = T,max_radius = 2*0.3953059 + pt[[2]]) graphics::rect(xleft = pt[[1]]-0.3953059,xright = pt[[1]]+0.3953059,ybottom = pt[[2]]-0.3953059,ytop = pt[[2]]+0.3953059) graphics::lines(x = c(pt[[1]],pt[[1]] + 0.3953059),y = c(pt[[2]],pt[[2]])) graphics::lines(x = c(pt[[1]],pt[[1]]),y = c(pt[[2]],pt[[2]] - 0.3953059)) graphics::text(x = c(pt[[1]] + 0.3953059/2,0.4),y = c(1.83,1.55),c("t","t")) ``` In this setup, "significant" points will be those whose confidence intervals do not touch the diagonal line, i.e. where birth and death is the same. Note that the persistence threshold is twice the bottleneck distance threshold calculated above: the (Euclidean) distance from the point to the bottom right corner of the box is $\sqrt{2}t$, which must be greater than or equal to the (Euclidean) distance of the point to its diagonal projection, which is $\frac{y-x}{\sqrt{2}}$. Therefore, for the point to be considered real (i.e. significant), $\sqrt{2}t \leq \frac{y-x}{\sqrt{2}}$, implying that the persistence of the point, $y-x$, must be no less than twice the bottleneck threshold $t$. Like in [@Robinson_Turner] we can calculate the p-value for a feature as $p = \frac{Z + 1}{N + 1}$ where $Z$ is the number of bootstrap iterations which, when doubled, are at least the persistence of the feature, and $N$ is the number of bootstrap samples. In order to ensure that the p-value of any topological feature which survives the thresholding is less than $\alpha$ we transform $\alpha$ by $\alpha \leftarrow \frac{\mbox{max}\{\alpha(N + 1) - 1,0\}}{N + 1}$. The function `bootstrap_persistence_thresholds` can be used to determine these persistence thresholds. Here is an example for the circ dataset, and the results can be plotted with the `plot_diagram` function (with overlaid p-values using the graphics `text` function): ```{r,fig.height = 4,fig.width = 8,fig.align = 'center',eval = F} # calculate the bootstrapped persistence thresholds using 2 cores # and 30 iterations. We'll use the distance matrix of circ to # make representative cycles more comprehensible library("TDA") thresh <- bootstrap_persistence_thresholds(X = as.matrix(dist(circ)), FUN_diag = 'ripsDiag', FUN_boot = 'ripsDiag', distance_mat = T, maxdim = 1,thresh = 2,num_workers = 2, alpha = 0.05,num_samples = 30, return_subsetted = T,return_pvals = T, calculate_representatives = T) diag <- thresh$diag # plot original diagram and thresholded diagram side-by-side, including # p-values. These p-values are the smallest possible (1/31) when there # are 30 bootstrap iterations par(mfrow = c(1,2)) plot_diagram(diag,title = "Circ diagram") plot_diagram(diag,title = "Circ diagram with thresholds", thresholds = thresh$thresholds) text(x = c(0.2,0.5),y = c(2,1.8), paste("p = ",round(thresh$pvals,digits = 3)), cex = 0.5) ``` ```{r,fig.height = 4,fig.width = 8,fig.align = 'center',eval = T,echo = F} thresh <- readRDS("thresh.rds") diag <- thresh$diag par(mfrow = c(1,2)) plot_diagram(diag,title = "Circ diagram") plot_diagram(diag,title = "Circ diagram with thresholds", thresholds = thresh$thresholds) text(x = c(0.2,0.5),y = c(2,1.8), paste("p = ",round(thresh$pvals,digits = 3)), cex = 0.5) ``` ```{r,echo = F} par(mfrow = c(1,1)) ``` To see why we needed to load the **TDA** package prior to the function call please see the `bootstrap_persistence_thresholds` function documentation. The bootstrap procedure can be done in parallel or sequentially, depending on which function is specified to calculate the persistence diagrams. There are three possible such functions -- **TDAstats**' `calculate_homology`, **TDA**'s `ripsDiag` and **TDApplied**'s `PyH`. The functions `calculate_homology` and `ripsDiag` can be run in parallel. However, `PyH` is the fastest function followed by `calculate_homology` based on our simulations. Both `ripsDiag` and `PyH` allow for the calculation of representative (co)cycles (i.e. the approximate location in the data of each topological feature), whereas `calculate_homology` does not. Therefore, our recommendation is to pick the function according to the following criteria: if a user can use the `PyH` function, then it should be used in all cases except for when the input data set is small, the machine has many available cores and the number of desired bootstrap iterations is large. Otherwise, use `calculate_homology` for speed or `ripsDiag` for calculating representatives. ## Representative (Co)Cycles One of the advantages of the R package **TDA** over **TDAstats** is its ability to calculate representative cycles in the data, i.e. locating the persistence diagram topological features in the input data set. Having access to representative cycles can permit deep analyses of the original data set by finding particular types of features spanned by certain subsets of data points. For example, a representative cycle of a 1-dimensional topological feature would be a set of edges between data points which lie along that feature (a loop). The `PyH` function can also find *representative cocycles* (i.e. analogues of representative cycles for persistent cohomology) in its input data, which are returned if the `calculate_representatives` parameter is set to `TRUE`. In that case, the `PyH` function returns a list with a data frame called "diagram", containing the persistence diagram, and a list called "representatives". The "representatives" list has one element for each homological dimension in the persistence diagram, with one matrix/array for each point in the persistence diagram of that dimension (except for dimension 0, where the list is always empty). The matrix for a $d$-dimensional feature (1 for loops, 2 for voids, etc.) has $d + 1$ columns, where row $i$ contains the row indices in the data set of the data points in the $i$-th substructure in the representative (a substructure of a loop would be an edge, a substructure of a void would be a triangle, etc.). Here is an example where we calculate the representative cocycles of our circ dataset: ```{r,echo = T,eval = F} # ripser has already been imported, so calculate diagram with representatives diag_rep <- PyH(circ,maxdim = 1,thresh = 2,ripser = ripser,calculate_representatives = T) # identify the loops in the diagram diag_rep$diagram[which(diag_rep$diagram$dimension == 1),] ``` ```{r,echo = F} data.frame(dimension = c(1),birth = c(0.5579783),death = c(1.738593),row.names = c("50")) ``` ```{r,echo = T,eval = F} # show the representative for the loop, just the first five rows diag_rep$representatives[[2]][[1]][1:5,] ``` ```{r,echo = F} representative = matrix(data = c(50,42,46,42,50,4,42,29,42,16,50,11,42,7,42,1,50,48,50,25,42,40,46,4,29,4,16,4,46,11,29,11,16,11,7,4,48,46,4,1,11,7,46,25,48,29,50,37,48,16,29,25,11,1,25,16,42,22,48,7,40,4,25,7,48,1,40,11,25,1,50,15,48,40,40,25,50,20,46,37,37,29,37,16,42,34,22,4,42,32,50,27,22,11,37,7,37,1,46,15,29,15,48,22,50,8,43,42,16,15,25,22,46,20,40,37,29,20,15,7,20,16,50,44,15,1,34,4,46,27,32,4,20,7,29,27,34,11,27,16,20,1,32,11,50,36,40,15,42,39,27,7,46,8,48,34,48,32,29,8,43,4,34,25,37,22,27,1,42,5,40,20,16,8,32,25,43,11,42,21,46,44,8,7,44,29,40,27,8,1,44,16,48,43,43,25,22,15,46,36,44,7,50,24,36,29,40,8,36,16,44,1,39,4,22,20,37,34,5,4,37,32,39,11,36,7),nrow = 113,ncol = 2,byrow = T) representative[1:5,] ``` The representative of the one loop contains the edges found to be present in the loop. We could iterate over the representative for the loop to find all the data points in that representative: ```{r,echo = T,eval = F} unique(c(diag_rep$representatives[[2]][[1]][,1],diag_rep$representatives[[2]][[1]][,2])) ``` ```{r,echo = F} unique(c(representative[,1],representative[,2])) ``` Since the circ dataset is two-dimensional, we could actually plot the loop representative according to the following process: 1. Pick a threshold $\epsilon$ between the birth and death radii of the cocycle. 2. Plot all edges between pairs of points in circ of distance no more than $\epsilon$. 3. Highlight all edges in the representative. Since the death radius of the main cocycle is 1.738894, we can use the following code to plot the cocycle at thresholds value 1.7: ```{r,echo = T,eval = F} plot(x = circ$x,y = circ$y,xlab = "x",ylab = "y",main = "circ with representative") for(i in 1:nrow(circ)) { for(j in 1:nrow(circ)) { pt1 <- circ[i,] pt2 <- circ[j,] if(sqrt((pt1[[1]] - pt2[[1]])^2 + (pt1[[2]] - pt2[[2]])^2) <= 1.7) { graphics::lines(x = c(pt1[[1]],pt2[[1]]),y = c(pt1[[2]],pt2[[2]])) } } } for(i in 1:nrow(diag_rep$representatives[[2]][[1]])) { pt1 <- circ[diag_rep$representatives[[2]][[1]][i,1],] pt2 <- circ[diag_rep$representatives[[2]][[1]][i,2],] if(sqrt((pt1[[1]] - pt2[[1]])^2 + (pt1[[2]] - pt2[[2]])^2) <= 1.7) { graphics::lines(x = c(pt1[[1]],pt2[[1]]),y = c(pt1[[2]],pt2[[2]]),col = "red") } } ``` ```{r,echo = F,warning=F,fig.height = 4,fig.width = 4,fig.align = 'center'} plot(x = circ$x,y = circ$y,xlab = "x",ylab = "y",main = "circ with representative") for(i in 1:nrow(circ)) { for(j in 1:nrow(circ)) { pt1 <- circ[i,] pt2 <- circ[j,] if(sqrt((pt1[[1]] - pt2[[1]])^2 + (pt1[[2]] - pt2[[2]])^2) < 1.7) { graphics::lines(x = c(pt1[[1]],pt2[[1]]),y = c(pt1[[2]],pt2[[2]])) } } } for(i in 1:nrow(representative)) { pt1 <- circ[representative[i,1],] pt2 <- circ[representative[i,2],] if(sqrt((pt1[[1]] - pt2[[1]])^2 + (pt1[[2]] - pt2[[2]])^2) <= 1.7) { graphics::lines(x = c(pt1[[1]],pt2[[1]]),y = c(pt1[[2]],pt2[[2]]),col = "red") } } ``` This plot shows the main loop that was found via persistent cohomology, and the representative is a set of edges (in this 2D case) whose removal would destroy the loop. A more intuitive notion of a "representative loop" can be found with persistent homology, for instance using the **TDA** `ripsDiag` function with the `location` parameter set to TRUE. Another example of the representative cocycle can be found using another threshold value, for instance the (rounded up) birth value of 0.6009: ```{r,echo = F,warning=F,fig.height = 4,fig.width = 4,fig.align = 'center'} plot(x = circ$x,y = circ$y,xlab = "x",ylab = "y",main = "circ with representative") for(i in 1:nrow(circ)) { for(j in 1:nrow(circ)) { pt1 <- circ[i,] pt2 <- circ[j,] if(sqrt((pt1[[1]] - pt2[[1]])^2 + (pt1[[2]] - pt2[[2]])^2) <= 0.6009) { graphics::lines(x = c(pt1[[1]],pt2[[1]]),y = c(pt1[[2]],pt2[[2]])) } } } for(i in 1:nrow(representative)) { pt1 <- circ[representative[i,1],] pt2 <- circ[representative[i,2],] if(sqrt((pt1[[1]] - pt2[[1]])^2 + (pt1[[2]] - pt2[[2]])^2) <= 0.6009) { graphics::lines(x = c(pt1[[1]],pt2[[1]]),y = c(pt1[[2]],pt2[[2]]),col = "red") } } ``` Since we know that the data points in the representatives help form a loop in the original data set, we could perform a further exploratory analysis in circ to explore the periodic nature of the feature. While in this example we know that a loop will be present, this type of analysis could help find hidden latent structure in data sets. ## VR Graphs and **TDApplied**'s `vr_graphs` and `plot_vr_graph` Functions Integral to the plots in the previous section were that the circ dataset is 2D, but most datasets are not 2D. In order to investigate topological features of high-dimensional datasets we can use the Rips complexes which are calculated as an intermediate step in persistent homology (as described earlier). Recall that the Rips complex of a dataset is a skeletal representations of the dataset's structure at a particular scale $\epsilon$ formed by connecting data points of distance at most $\epsilon$ and adding higher-dimensional structures (like triangles between triples of connected points). The connections between data points in a Rips complex form a "VR graph" [@VR_graph], and this graph can be visualized regardless of the dimension of the underlying dataset. We can use the function `vr_graphs` to compute VR graphs at a variety of scales, and can visualize the resulting graphs with the **igraph** package [@R-igraph] and the `plot_vr_graph` function. To demonstrate this functionality, we will pick two $\epsilon$ scales to study the dataset structure of circ, only based on its persistence diagram - the first scale will be half the birth radius of the loop and the second scale will be the average of the loop's birth and death radius: ```{r,echo = T,eval = T} # get half of loop's birth radius eps_1 <- diag[nrow(diag),2L]/2 # get mean of loop's birth and death radii eps_2 <- (diag[nrow(diag),3L] + diag[nrow(diag),2L])/2 # compute two VR graphs gs <- vr_graphs(X = circ,eps = c(eps_1,eps_2)) ``` Next we can plot both VR graphs: ```{r,eval = requireNamespace("igraph"),fig.height = 5,fig.width = 5,fig.align = 'center'} # plot first graph plot_vr_graph(gs,eps_1) # plot second graph layout <- plot_vr_graph(gs,eps_2,return_layout = TRUE) layout <- apply(layout,MARGIN = 2,FUN = function(X){ return(-1 + 2*(X - min(X))/(max(X) - min(X))) }) ``` The first graph shows that at a scale before the loop is born, the dominant loop does not exist in the data (and that the space is more fragmented), while the second graph was able to retrieve the dominant loop. Visualizing multiple VR graphs of a dataset, choosing particular scales of interest from the persistence diagram of the dataset, can be an effective tool for interpreting topological features. The input parameters of `plot_vr_graph` can help customize the graph visualizations. For example, if we want to investigate the loop we can also customize the plot to highlight the loop representative, and to only plot the component of the graph which contains the loop vertices: ```{r,eval = requireNamespace('igraph'),echo = F,fig.width = 5,fig.align = 'center'} # get the stimuli in the loop stimuli_in_loop <- unique(as.numeric(thresh$subsetted_representatives[[2]])) # create colors for the data points, light blue if not in the loop # and red if in the loop colors <- rep("lightblue",50) colors[stimuli_in_loop] <- "red" # plot only component containing the loop stimuli with vertex colors plot_vr_graph(gs,eps_2,cols = colors,component_of = stimuli_in_loop[[1]],layout = layout) ``` We can customize one level further by removing the vertex labels and using the graph layout (i.e. x-y coordinates of all vertices) to plot other things on the graph: ```{r,eval = requireNamespace('igraph'),fig.width = 5,fig.align = 'center'} # plot only component containing the loop stimuli with vertex colors plot_vr_graph(gs,eps_2,cols = colors, component_of = stimuli_in_loop[[1]], vertex_labels = FALSE, layout = layout) # get indices of vertices in loop # not necessary in this case but necessary when we have # removed some vertices from the graph vertex_inds <- match(stimuli_in_loop,as.numeric(rownames(layout))) # add volcano image over loop nodes # image could be anything like rasters read from # png files! this is just an example.. utils::data("volcano") volcano <- (volcano - min(volcano)) / diff(range(volcano)) for(i in vertex_inds) { graphics::rasterImage(volcano,xleft = layout[i,1L] - 0.05, xright = layout[i,1L] + 0.05, ybottom = layout[i,2L] - 0.05, ytop = layout[i,2L] + 0.05) } ``` These visualizations can help determine which points are part of a representative and what the dataset structure is at various scales. # Hypothesis Testing Having visualized and interpreted our data of interest (persistence diagrams), a common next step in data analysis pipelines is to ask statistical questions in the form of hypothesis testing [@CasellaBerger]. Three such questions that we will focus on are: 1. Are groups of persistence diagrams different from each other? 2. Are two groups of persistence diagrams independent of each other? 3. Are two datasets good topological models of each other? In order to answer these questions we need to generate groups of persistence diagrams. For the third question we will generate diagrams from simulated circle datasets. For the first two questions we will generate diagrams which are random deviations of D1, D2 and D3 diagrams from earlier: ```{r,echo = F,fig.height = 3,fig.width = 7,fig.align = 'center'} par(mfrow = c(1,3)) plot_diagram(D1,title = "D1",max_radius = 4,legend = F) plot_diagram(D2,title = "D2",max_radius = 4,legend = F) plot_diagram(D3,title = "D3",max_radius = 4,legend = F) ``` ```{r,echo = F} par(mfrow = c(1,1)) ``` The helper function `generate_TDApplied_vignette_data` (seen below) adds Gaussian noise with a small variance to the birth and death values of the points in D1, D2 and D3, making "noisy copies" of the three diagrams: ```{r,echo = F} generate_TDApplied_vignette_data <- function(num_D1,num_D2,num_D3){ # num_D1 is the number of desired copies of D1, and likewise # for num_D2 and num_D3 # create data D1 = data.frame(dimension = c(0),birth = c(2),death = c(3)) D2 = data.frame(dimension = c(0),birth = c(2,0),death = c(3.3,0.5)) D3 = data.frame(dimension = c(0),birth = c(0),death = c(0.5)) # make noisy copies noisy_copies <- lapply(X = 1:(num_D1 + num_D2 + num_D3),FUN = function(X){ # i stores the number of the data frame to make copies of: # i = 1 is for D1, i = 2 is for D2 and i = 3 is for D3 i <- 1 if(X > num_D1 & X <= num_D1 + num_D2) { i <- 2 } if(X > num_D1 + num_D2) { i <- 3 } # store correct data in noisy_copy noisy_copy <- get(paste0("D",i)) # add Gaussian noise to birth and death values n <- nrow(noisy_copy) noisy_copy$dimension <- as.numeric(as.character(noisy_copy$dimension)) noisy_copy$birth <- noisy_copy$birth + stats::rnorm(n = n,mean = 0,sd = 0.05) noisy_copy$death <- noisy_copy$death + stats::rnorm(n = n,mean = 0,sd = 0.05) # make any birth values which are less than 0 equal 0 noisy_copy[which(noisy_copy$birth < 0),2] <- 0 # make any birth values which are greater than their death values equal their death values noisy_copy[which(noisy_copy$birth > noisy_copy$death),2] <- noisy_copy[which(noisy_copy$birth > noisy_copy$death),3] return(noisy_copy) }) # return list containing num_D1 noisy copies of D1, then # num_D2 noisy copies of D2, and finally num_D3 noisy copies # of D3 return(noisy_copies) } ``` Here is an example D1 and two noisy copies: ```{r,echo = F,fig.height = 5,fig.width = 5,fig.align = 'center'} par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE) noisy_copies_D1 = generate_TDApplied_vignette_data(2,0,0) cols = factor(c("D1","D1\'","D2\'\'"),levels = c("D1","D1\'","D1\'\'")) plot(x = c(D1$birth,noisy_copies_D1[[1]]$birth,noisy_copies_D1[[2]]$birth),y = c(D1$death,noisy_copies_D1[[1]]$death,noisy_copies_D1[[2]]$death),main = "D1 and noisy copies",xlab = "",ylab = "",xlim = c(0,4),ylim = c(0,4),col = c("black","red","blue"),bty = "L") lines(x = c(-0.15,4.15),y = c(-0.15,4.15)) legend("topright", inset=c(-0.2,0), legend=levels(cols), pch=16, col=c("black","red","blue")) ``` ## Detecting Group Differences and **TDApplied**'s `permutation_test` Function One of the most important inference procedures in classical statistics is the analysis of variance (ANOVA), which can find differences in the means of groups of normally-distributed measurements [@CasellaBerger]. Distributions of persistence diagrams and their means can be complicated (see [@PD_means] and [@PD_distributions]). Therefore, a non-parametric permutation test has been proposed which can find differences in groups of persistence diagrams. Such a test was first proposed in [@Robinson_Turner], and some variations have been suggested in later publications. In [@Robinson_Turner], two groups of persistence diagrams would be compared. The null hypothesis, $H_0$, is that the diagrams from the two groups are generated from shapes with the same type and scale of topological features, i.e. they "come" from the same "shape". The alternative hypothesis, $H_A$, is that the underlying type or scale of the features are different between the two groups. In each dimension a p-value is computed, finding evidence against $H_0$ in that dimension. A measure of within-group distances (a "loss function") is calculated for the two groups, and that measure is compared to a null distribution for when the group labels are permuted. This inference procedure is implemented in the `permutation_test` function, with several speedups and additional functionalities. Firstly, the loss function is computed in parallel for scalability since distance computations can be expensive. Secondly, we store distance calculations as we compute them because these calculations are often repeated. Additional functionality includes allowing for any number groups (not just two) and allowing for a pairing between groups of the same size as described in (@dependencies). When a natural pairing exists between the groups (such as if the groups represent persistence diagrams from the same subject of a study in different conditions) we can simulate a more realistic null distribution by restricting the way in which we permute group labels, achieving higher statistical power. In order to demonstrate the permutation test we will detect differences between noisy copies of D1, D2, D3: ```{r,echo = T} # permutation test between three diagrams g1 <- generate_TDApplied_vignette_data(3,0,0) g2 <- generate_TDApplied_vignette_data(0,3,0) g3 <- generate_TDApplied_vignette_data(0,0,3) perm_test <- permutation_test(g1,g2,g3, num_workers = 2, dims = c(0)) perm_test$p_values ``` As expected, a difference was found (at the $\alpha = 0.05$ significance level) between the three groups. The package **TDAstats** also has a function called `permutation_test` which is based on the same test procedure. However, it uses an unpublished distance metric between persistence diagrams (see the package vignette "Comparing Distance Calculations") and does not use parallelization for scalability. As such, care must be taken when both **TDApplied** and **TDAstats** are attached in an R script to use the particular `permutation_test` function desired. ## Independence Between Two Groups of Paired Diagrams and **TDApplied**'s `independence_test` Function An important question when presented with two groups of paired data is determining if the pairings are independent or not. A procedure was described in [@kernel_test] which can be used to answer this question using kernel computations, and importantly uses a parametric null distribution. The null hypothesis for this test is that the groups are independent, and the alternative hypothesis is that the groups are not independent. A test statistic called the *Hilbert-Schmidt independence criteria* [@kernel_test] is calculated, and its value is compared to a gamma distribution with certain parameters which are estimated from the data. This inference procedure has been implemented in the `independence_test` function, and returns the p-value of the test in each desired dimension of the diagrams (among other additional information). We would expect to find no dependence between noisy copies of D1, D2 and D3, since each copy is generated randomly: ```{r,echo = T} # create 10 noisy copies of D1 and D2 g1 <- generate_TDApplied_vignette_data(10,0,0) g2 <- generate_TDApplied_vignette_data(0,10,0) # do independence test with sigma = t = 1 indep_test <- independence_test(g1,g2,dims = c(0),num_workers = 2) indep_test$p_values ``` The p-value of this test would not be significant at any typical significance threshold, reflecting the fact that there is no real (i.e. non-spurious) dependence between the two groups, as expected. ## Model Inference and **TDApplied**'s `permutation_model_inference` Function The permutation test procedure allows us to test whether two groups of persistence diagrams are different, and the bootstrap procedure allows us to estimate the sampling distribution of the persistence diagram of a single dataset. Putting these two concepts together we can test whether two datasets are topologically similar or not, i.e. if each dataset is a good topological "model" of the other. The procedure we will employ is to 1. compute two groups of bootstrapped persistence diagrams, one from each dataset, and 2. carry out a permutation test on the two groups. A small p-value in a given homological dimension of this test indicates that the two datasets had a significant topological difference in that dimension. Let's take an example of two datasets generated by a circle, our original `circ` and a new dataset `circ2`: ```{r,echo = F,fig.height = 3,fig.width = 6,fig.align = 'center'} # plot par(mfrow = c(1,2)) plot(x = circ[,1],y = circ[,2],main = "circ",xlab = "",ylab = "",las = 1) theta <- stats::runif(n = 50,min = 0,max = 2*pi) circ2 <- data.frame(x = cos(theta),y = sin(theta)) plot(x = circ2[,1],y = circ2[,2],main = "circ2",xlab = "",ylab = "",las = 1) ``` When we run the model inference procedure we find that there is no topological difference (at the 0.05 significance level) in dimensions 0 or 1: ```{r} mod_comp <- permutation_model_inference(circ, circ2, iterations = 20, thresh = 2, num_samples = 3, num_workers = 2L) mod_comp$p_values ``` Now suppose we knew there was a correspondence between the rows of the two datasets, for instance row $i$ in both datasets corresponded to the same subject in two studies. For a practical example consider the two datasets `circ` and `circ` with a small variance Gaussian noise source: ```{r,echo = F,fig.height = 3,fig.width = 6,fig.align = 'center'} # plot par(mfrow = c(1,2)) plot(x = circ[,1],y = circ[,2],main = "circ",xlab = "",ylab = "",las = 1) circ_noise <- data.frame(x = circ[,1] + rnorm(50,sd = 0.01),y = circ[,2] + rnorm(50,sd = 0.01)) plot(x = circ_noise[,1],y = circ_noise[,2],main = "circ_noise",xlab = "",ylab = "",las = 1) ``` We can carry out a paired permutation test to obtain increased statistical power as follows: ```{r} mod_comp_paired <- permutation_model_inference(circ, circ_noise, iterations = 20, thresh = 2, num_samples = 3, paired = TRUE, num_workers = 2L) mod_comp_paired$p_values ``` Once again the p-values are not significant. In general a high p-value only "suggests" a better model fit, but in this case we know that the two datasets are highly related. If we wanted to carry out multiple paired tests between datasets (for instance data from the same subjects across multiple studies) we can ensure that the same bootstrap samples are used in all comparisons by explicitly defining those samples and supplying them to the procedure. If we had defined another dataset, `circ3`, we could carry out the paired tests as follows: ```{r,eval = F} # in this case we are creating 3 samples of the rows of circ # (and equivalently the other two datasets) boot_samples <- lapply(X = 1:3, FUN = function(X){return(unique(sample(1:nrow(circ),size = nrow(circ),replace = TRUE)))}) # carry out three model comparisons between circ, circ2 and circ3 mod_comp_1_2 <- permutation_model_inference(circ, circ2, iterations = 20, thresh = 2, num_samples = 3, paired = TRUE, num_workers = 2L, samp = boot_samples) mod_comp_1_3 <- permutation_model_inference(circ, circ3, iterations = 20, thresh = 2, num_samples = 3, paired = TRUE, num_workers = 2L, samp = boot_samples) mod_comp_2_3 <- permutation_model_inference(circ2, circ3, iterations = 20, thresh = 2, num_samples = 3, paired = TRUE, num_workers = 2L, samp = boot_samples) ``` # Finding Latent Structure Patterns may exist in a collection of persistence diagrams. For example, if the collection contained diagrams from three distinct shapes then we would like to be able to assign three distinct (discrete) labels to the correct subsets of diagrams. On the other hand, maybe the diagrams vary along several dimensions, like the mean persistence of their loops and the mean persistence of their components. In this case we would like to be able to retrieve these continuous features for visualizing all diagrams in the same (2D in this example) space. These two types of analyses are called clustering and dimension reduction respectively, and are two of the most common and popular machine learning applications. We will explore three techniques from these areas - kernel k-means from clustering, and multidimensional scaling and kernel principal components analysis from dimension reduction. ## Kernel k-means and **TDApplied**'s `diagram_kkmeans` Function Kernel k-means [@kkmeans] is a method which can find hidden groups in complex data, extending regular k-means clustering [@murphyML] via a kernel. A "kernel distance" is calculated between a persistence diagram and a cluster center using only the kernel function, and the algorithm converges like regular k-means. This algorithm is implemented in the function `diagram_kkmeans` as a wrapper of the kernlab function `kkmeans`. Moreover, a prediction function `predict_diagram_kkmeans` can be used to find the nearest cluster labels for a new set of diagrams. Here is an example clustering three groups of noisy copies from D1, D2 and D3: ```{r,echo = T} # create noisy copies of D1, D2 and D3 g <- generate_TDApplied_vignette_data(3,3,3) # calculate kmeans clusters with centers = 3, and sigma = t = 0.1 clust <- diagram_kkmeans(diagrams = g,centers = 3,dim = 0,t = 0.1,sigma = 0.1,num_workers = 2) # display cluster labels clust$clustering@.Data ``` As we can see, the `diagram_kkmeans` function was able to correctly separate the three generating diagrams D1, D2 and D3 (the cluster labels are arbitrary and therefore may not be 1,1,1,2,2,2,3,3,3; however, they induce the correct partition). If we wish to predict the cluster label for new persistence diagrams (computed via the largest kernel value to any cluster center), we can use the `predict_diagram_kkmeans` function as follows: ```{r,echo = T} # create nine new diagrams g_new <- generate_TDApplied_vignette_data(3,3,3) # predict cluster labels predict_diagram_kkmeans(new_diagrams = g_new,clustering = clust,num_workers = 2) ``` This function correctly predicted the cluster for each new diagram (assigning each diagram to the cluster label by D1, D2 or D3, depending on which diagram it was generated from). ## Multidimensional Scaling and **TDApplied**'s `diagram_mds` Function Dimension reduction is a task in machine learning which is commonly used for data visualization, removing noise in data, and decreasing the number of covariates in a model (which can be helpful in reducing overfitting). One common dimension reduction technique in machine learning is called *multidimensional scaling* (MDS) [@Cox2008]. MDS takes as input an $n$ by $n$ distance (or dissimilarity) matrix $D$, computed from $n$ points in a dataset, and outputs an embedding of those points into a Euclidean space of chosen dimension $k$ which best preserves the inter-point distances. MDS is often used for visualizing data in exploratory analyses, and can be particularly useful when the input data points do not live in a shared Euclidean space (as is the case for persistence diagrams). Using the R function `cmdscale` from the package stats (@R-stats) we can compute the optimal embedding of a set of persistence diagrams using any of the three distance metrics with the function `diagram_mds`. Here is an example of the `diagram_mds` function projecting nine persistence diagrams, three noisy copies sampled from each of D1, D2 and D3, into 2D space: ```{r,echo = T,fig.height = 3,fig.width = 6,fig.align = 'center'} # create 9 diagrams based on D1, D2 and D3 g <- generate_TDApplied_vignette_data(3,3,3) # calculate their 2D MDS embedding in dimension 0 with the bottleneck distance mds <- diagram_mds(diagrams = g,dim = 0,p = Inf,k = 2,num_workers = 2) # plot par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE) plot(mds[,1],mds[,2],xlab = "Embedding coordinate 1",ylab = "Embedding coordinate 2", main = "MDS plot",col = as.factor(rep(c("D1","D2","D3"),each = 3)),bty = "L") legend("topright", inset=c(-0.2,0), legend=levels(as.factor(c("D1","D2","D3"))), pch=16, col=unique(as.factor(c("D1","D2","D3")))) ``` The MDS plot shows the clear separation between the three generating diagrams (D1, D2 and D3), and the embedded coordinates could be used for further downstream analyses. ### Kernel Principal Components Analysis, and **TDApplied**'s `diagram_kpca` Function PCA is another dimension reduction technique in machine learning, but can be preferable compared to MDS in certain situations because it allows for the projection of new data points onto an old embedding model [@murphyML]. For example, this can be important if PCA is used as a pre-processing step in model fitting. Kernel PCA (kPCA) [@kpca] is an extension of regular PCA which uses a kernel to project complex data into a high-dimensional Euclidean space and then uses PCA to project that data into a low-dimensional space. The `diagram_kpca` method computes the kPCA embedding of a set of persistence diagrams, and the `predict_diagram_kpca` function can be used to project new diagrams using a pre-trained kPCA model. Here is an example using a group of noisy copies of D1, D2 and D3: ```{r,echo = T,fig.height = 3,fig.width = 6,fig.align = 'center'} # create noisy copies of D1, D2 and D3 g <- generate_TDApplied_vignette_data(3,3,3) # calculate their 2D PCA embedding with sigma = t = 2 pca <- diagram_kpca(diagrams = g,dim = 0,t = 2,sigma = 2,features = 2,num_workers = 2) # plot par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE) plot(pca$pca@rotated[,1],pca$pca@rotated[,2],xlab = "Embedding coordinate 1", ylab = "Embedding coordinate 2",main = "PCA plot", col = as.factor(rep(c("D1","D2","D3"),each = 3))) legend("topright",inset = c(-0.2,0), legend=levels(as.factor(c("D1","D2","D3"))), pch=16, col=unique(as.factor(c("D1","D2","D3")))) ``` The function was able to recognize the three groups, and the embedding coordinates can be used for further downstream analysis. However, an important advantage of kPCA over MDS is that in kPCA we can project new points onto an old embedding using the `predict_diagram_kpca` function: ```{r,echo = T,fig.height = 3,fig.width = 6,fig.align = 'center'} # create nine new diagrams g_new <- generate_TDApplied_vignette_data(3,3,3) # project new diagrams onto old model new_pca <- predict_diagram_kpca(new_diagrams = g_new,embedding = pca,num_workers = 2) # plot par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE) plot(new_pca[,1],new_pca[,2],xlab = "Embedding coordinate 1", ylab = "Embedding coordinate 2",main = "PCA prediction plot", col = as.factor(rep(c("D1","D2","D3"),each = 3))) legend("topright",inset = c(-0.2,0), legend=levels(as.factor(c("D1","D2","D3"))), pch=16, col=unique(as.factor(c("D1","D2","D3")))) ``` As we can see, the original three groups, and their approximate location in 2D space, is preserved during prediction. # Predicting Labels of Persistence Diagrams One of the most valuable problems that machine learning is used to solve is that of prediction - can a variable $Y$ be predicted from other variables $X$. Kernel functions can be used to predict a variable $Y$ from a persistence diagram $D$ using a class of models called support vector machines (SVMs) [@murphyML]. ## Support Vector Machines and **TDApplied**'s `diagram_ksvm` Function SVMs are one of the most popular machine learning techniques for regression and classification tasks. SVMs use a kernel function to project complex data into a high-dimensional space and then find a sparse set of training examples, called "support vectors", which maximally linearly separate the outcome variable classes (or yield the highest explained variance in the case of regression). Unlike in the case of dimension reduction or clustering, it is possible that our persistence diagrams may each have a label, either discrete or continuous, which gives us more information about the underlying data represented by the diagram. For instance, if our persistence diagrams represented periodic behavior in stock market trends during a particular temporal window then a useful (discrete) label could be whether the overall market went up or down during that period. Being able to predict such labels from the persistence diagrams themselves is a way to link persistence diagrams to external data through modeling, i.e. classification and regression. Support vector machines SVMs have been implemented in **TDApplied** by the function `diagram_ksvm`, tailored for input datasets which contain pairs of persistence diagrams and their outcome variable labels. A prediction method is supplied called `predict_diagram_ksvm` which can be used to predict the label value of a set of new persistence diagrams given a pre-trained model. A parallelized implementation of cross-validation model-fitting is used based on the remarks in [@persistence_fisher] for scalability (which avoids needlessly recomputing persistence Fisher information metric values). Here is an example of fitting an SVM model on a list of persistence diagrams for a classification task (guessing whether the diagram comes from D1, D2 or D3): ```{r,echo = T} # create thirty noisy copies of D1, D2 and D3 g <- generate_TDApplied_vignette_data(10,10,10) # create response vector y <- as.factor(rep(c("D1","D2","D3"),each = 10)) # fit model with cross validation model_svm <- diagram_ksvm(diagrams = g,cv = 2,dim = c(0), y = y,sigma = c(1,0.1),t = c(1,2), num_workers = 2) ``` We can use the function `predict_diagram_ksvm` to predict new diagrams like so: ```{r,echo = T} # create nine new diagrams g_new <- generate_TDApplied_vignette_data(3,3,3) # predict predict_diagram_ksvm(new_diagrams = g_new,model = model_svm,num_workers = 2) ``` As we can see the best SVM model was able to separate the three diagrams We can gain more information about the best model found during model fitting and the CV results by accessing different list elements of `model_svm`. # Limitations of **TDApplied** Functionality There is one main limitation of **TDApplied** which should be discussed for its own future improvements -- **TDApplied** functions cannot analyze numeric and factor features with persistence diagrams. This may be too inflexible for some applications, where the data may include several persistence diagrams, or a mix of persistence diagrams, numeric and categorical variables. The package vignette "Personalized Analyses with **TDApplied**" demonstrates how one can circumvent this issue using extra code; however, a future update to **TDApplied** might construct such functionality directly into its functions. # Conclusion Current topological data analysis packages in R (and Python) do not provide the ability to carry out statistics and machine learning with persistence diagrams, leading to a high barrier to adoption of topological data analysis in academia and industry. By filling in this gap, the **TDApplied** package aims to bridge topological data analysis with researchers and data practitioners in the R community. Topological data analysis is an exciting and powerful new field of data analysis, and with **TDApplied** anyone can access its power for meaningful and creative analyses of data. ```{r,echo = F} # reset parameters par(mfrow = original_mfrow,xpd = original_xpd,mar = original_mar) options(scipen = original_scipen) do.call("RNGkind",as.list(oldRNGkind)) assign(".Random.seed", oldseed, .GlobalEnv) ``` ## References