--- title: "Comparative Analysis of Embedding Methods for Non-Euclidean Dissimilarity Data: A Comprehensive Evaluation of Topolow" author: "Omid Arhami" date: "2025-08-04" output: rmarkdown::html_vignette: toc: true toc_depth: 3 number_sections: true fig_width: 10 fig_height: 8 vignette: > %\VignetteIndexEntry{Comparative Analysis of Embedding Methods for Non-Euclidean Dissimilarity Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Abstract This vignette presents a comprehensive comparative analysis of embedding methods for non-Euclidean dissimilarity data, with particular focus on the Topolow algorithm. We evaluate three fundamentally different approaches: Topolow (force-directed optimization), Classical Multidimensional Scaling (eigenvalue decomposition), and Iterative MDS (STRESS minimization) across two distinct types of non-Euclidean datasets. Our analysis demonstrates the relative strengths and limitations of each method when dealing with incomplete, non-metric dissimilarity matrices that violate fundamental assumptions of Euclidean geometry. # Introduction The embedding of high-dimensional or non-metric dissimilarity data into low-dimensional Euclidean spaces is a fundamental problem in computational statistics, bioinformatics, and machine learning. While classical methods such as Multidimensional Scaling (MDS) provide theoretically optimal solutions under ideal conditions, real-world datasets often exhibit characteristics that challenge these assumptions: sparse measurements, non-Euclidean geometry, and threshold-censored observations. The Topolow algorithm represents a novel approach to this problem, utilizing physics-inspired force-directed optimization to handle these challenging data characteristics naturally. This vignette provides a rigorous comparative evaluation of Topolow against established methods using controlled synthetic datasets that exhibit known non-Euclidean properties. ## Methodological Framework We evaluate three distinct embedding approaches: 1. **Topolow**: A force-directed method that models objects as particles connected by springs (for measured dissimilarities) and subject to repulsive forces (for missing measurements) 2. **Classical MDS**: The mathematically optimal eigenvalue-based solution that provides the theoretical best-case Euclidean approximation 3. **Iterative MDS**: A gradient-based STRESS minimization approach that uses stochastic optimization similar to Topolow This three-method comparison allows us to understand both the theoretical limits of embedding accuracy and the practical differences in optimization stability between deterministic and stochastic methods. # Data Generation Methodologies We employ two distinct data generation methods to create non-Euclidean datasets with controlled properties, allowing for systematic evaluation of embedding performance under different types of geometric violations. ## Method 1: Synthesized non-Euclidean Space This approach creates non-Euclidean data by systematically distorting a high-dimensional clustered dataset through non-linear transformations and asymmetric noise injection. ``` r #' Generate Non-Euclidean Data by Distorting Clustered Points #' #' Creates non-Euclidean data through systematic distortion of clustered high-dimensional points. #' This method is particularly effective for testing robustness to violations of the triangle inequality. #' #' @param n_objects Number of objects to generate #' @param initial_dims Dimensionality of the initial latent space #' @param n_clusters Number of clusters to form #' @param noise_factor Magnitude of asymmetric noise to add #' @param missing_fraction Proportion of distances to set to NA #' @return List containing complete and incomplete distance matrices generate_distorted_euclidean_data <- function(n_objects = 50, initial_dims = 10, n_clusters = 8, noise_factor = 0.3, missing_fraction = 0.30) { object_names <- paste0("Object_", sprintf("%02d", 1:n_objects)) # Generate structured high-dimensional coordinates cluster_size <- n_objects %/% n_clusters cluster_centers <- matrix(rnorm(n_clusters * initial_dims, mean = 0, sd = 4), nrow = n_clusters, ncol = initial_dims) initial_coords <- matrix(0, nrow = n_objects, ncol = initial_dims) rownames(initial_coords) <- object_names for(i in 1:n_objects) { cluster_id <- ((i - 1) %/% cluster_size) + 1 if(cluster_id > n_clusters) cluster_id <- n_clusters initial_coords[i, ] <- cluster_centers[cluster_id, ] + rnorm(initial_dims, mean = 0, sd = 1.5) } # Calculate foundational Euclidean distances euclidean_distances <- as.matrix(dist(initial_coords, method = "euclidean")) # Apply non-linear transformations dist_quantiles <- quantile(euclidean_distances[upper.tri(euclidean_distances)], c(0.33, 0.66)) transform_1 <- euclidean_distances transform_1[euclidean_distances <= dist_quantiles[1]] <- euclidean_distances[euclidean_distances <= dist_quantiles[1]]^1.3 transform_1[euclidean_distances > dist_quantiles[1] & euclidean_distances <= dist_quantiles[2]] <- euclidean_distances[euclidean_distances > dist_quantiles[1] & euclidean_distances <= dist_quantiles[2]]^1.6 transform_1[euclidean_distances > dist_quantiles[2]] <- euclidean_distances[euclidean_distances > dist_quantiles[2]]^1.8 # Add asymmetric noise asymmetric_noise <- transform_1 * noise_factor * matrix(runif(n_objects^2, -1, 1), nrow = n_objects) asymmetric_noise <- (asymmetric_noise + t(asymmetric_noise)) / 2 transform_2 <- transform_1 + asymmetric_noise transform_2[transform_2 < 0] <- 0.01 # Create final non-Euclidean distance matrix complete_non_euclidean_distances <- transform_2 diag(complete_non_euclidean_distances) <- 0 # Introduce missing values total_unique_pairs <- n_objects * (n_objects - 1) / 2 target_missing_pairs <- round(total_unique_pairs * missing_fraction) upper_tri_indices <- which(upper.tri(complete_non_euclidean_distances), arr.ind = TRUE) missing_pair_indices <- sample(nrow(upper_tri_indices), target_missing_pairs) incomplete_distances <- complete_non_euclidean_distances incomplete_distances[upper_tri_indices[missing_pair_indices,]] <- NA incomplete_distances[upper_tri_indices[missing_pair_indices, c(2,1)]] <- NA return(list( complete_distances = complete_non_euclidean_distances, incomplete_distances = incomplete_distances, object_names = object_names, method = "Synthesized non-Euclidean" )) } ``` ## Method 2: Swiss Roll Manifold This approach generates data points on a 2D "Swiss Roll" manifold embedded in 3D space, where true distances are geodesic (along the manifold surface) and thus inherently non-Euclidean from a 3D perspective. x = t * cos(t), y = height, z = t * sin(t), t = 1.5 * pi * (1 + 2runif) Marsland, "Machine Learning: An Algorithmic Perspective", 2nd edition, Chapter 6, 2014 ``` r #' Generate Non-Euclidean Data from a Swiss Roll Manifold #' #' Creates data points on a Swiss Roll manifold where geodesic distances #' are inherently non-Euclidean. Includes "tunneling" effects common in real-world manifold data. #' #' @param n_objects Number of points on the manifold #' @param noise Standard deviation of Gaussian noise added to points #' @param tunnel_fraction Fraction of distances to replace with Euclidean shortcuts #' @param missing_fraction Proportion of final distances to set to NA #' @return List containing complete and incomplete distance matrices generate_swiss_roll_data <- function(n_objects = 50, noise = 0.05, tunnel_fraction = 0.05, missing_fraction = 0.30) { # Check if required packages are available if(!requireNamespace("RANN", quietly = TRUE)) { stop("RANN package is required for this function. Please install it.") } if(!requireNamespace("igraph", quietly = TRUE)) { stop("igraph package is required for this function. Please install it.") } # Generate points on the Swiss Roll t <- 1.5 * pi * (1 + 2 * runif(n_objects)) height <- 21 * runif(n_objects) x <- t * cos(t) y <- height z <- t * sin(t) points_3d <- data.frame(x = x, y = y, z = z) + rnorm(n_objects * 3, sd = noise) object_names <- paste0("Object_", sprintf("%02d", 1:n_objects)) rownames(points_3d) <- object_names # Calculate geodesic distances via k-NN graph if(requireNamespace("RANN", quietly = TRUE)) { knn_graph <- RANN::nn2(points_3d, k = min(10, n_objects-1))$nn.idx adj_matrix <- matrix(0, n_objects, n_objects) for (i in 1:n_objects) { for (j_idx in 2:min(10, n_objects)) { if(j_idx <= ncol(knn_graph)) { j <- knn_graph[i, j_idx] dist_ij <- dist(rbind(points_3d[i,], points_3d[j,])) adj_matrix[i, j] <- dist_ij adj_matrix[j, i] <- dist_ij } } } g <- igraph::graph_from_adjacency_matrix(adj_matrix, mode = "undirected", weighted = TRUE) geodesic_distances <- igraph::distances(g) geodesic_distances[is.infinite(geodesic_distances)] <- max(geodesic_distances[is.finite(geodesic_distances)]) * 1.5 } else { # Fallback to Euclidean if RANN not available geodesic_distances <- as.matrix(dist(points_3d)) warning("RANN package not available. Using Euclidean distances as approximation.") } # Introduce "tunnels" or "short-circuits" euclidean_distances <- as.matrix(dist(points_3d)) n_tunnels <- round(tunnel_fraction * n_objects * (n_objects - 1) / 2) upper_tri_indices <- which(upper.tri(geodesic_distances), arr.ind = TRUE) tunnel_indices <- sample(nrow(upper_tri_indices), min(n_tunnels, nrow(upper_tri_indices))) complete_distances <- geodesic_distances if(length(tunnel_indices) > 0) { for (k in tunnel_indices) { i <- upper_tri_indices[k, 1] j <- upper_tri_indices[k, 2] complete_distances[i, j] <- complete_distances[j, i] <- euclidean_distances[i, j] } } diag(complete_distances) <- 0 # Introduce missing values total_unique_pairs <- n_objects * (n_objects - 1) / 2 target_missing_pairs <- round(total_unique_pairs * missing_fraction) missing_pair_indices <- sample(nrow(upper_tri_indices), min(target_missing_pairs, nrow(upper_tri_indices))) incomplete_distances <- complete_distances if(length(missing_pair_indices) > 0) { incomplete_distances[upper_tri_indices[missing_pair_indices,]] <- NA incomplete_distances[upper_tri_indices[missing_pair_indices, c(2,1)]] <- NA } rownames(complete_distances) <- object_names colnames(complete_distances) <- object_names rownames(incomplete_distances) <- object_names colnames(incomplete_distances) <- object_names return(list( complete_distances = complete_distances, incomplete_distances = incomplete_distances, object_names = object_names, method = "Swiss Roll Manifold" )) } ``` # Experimental Design and Analysis Pipeline Our analysis follows a standardized pipeline for each dataset type: 1. **Data Generation** with controlled non-Euclidean characteristics 2. **Geometric Assessment** via eigenvalue analysis of the Gram matrix 3. **Parameter Optimization** using the Euclidify automated workflow 4. **Three-Method Comparison** with multiple runs for stochastic methods 5. **Performance Evaluation** and statistical analysis 6. **Distance Preservation Assessment** ``` r #' Comprehensive analysis pipeline for embedding method comparison #' #' @param data_gen_func Function to generate the dataset #' @param dataset_name Name for the dataset (for labeling) #' @param n_objects Number of objects in the dataset #' @param n_runs Number of runs for stochastic methods run_comprehensive_analysis <- function(data_gen_func, dataset_name, n_objects = 50, n_runs = 50) { cat("=== ANALYSIS FOR", toupper(dataset_name), "===\n") # Step 1: Data Generation cat("\n1. Generating", dataset_name, "dataset...\n") data_gen_output <- data_gen_func(n_objects = n_objects, missing_fraction = 0.3) complete_distances_for_evaluation <- data_gen_output$complete_distances incomplete_distances_for_embedding <- data_gen_output$incomplete_distances object_names <- data_gen_output$object_names actual_missing_percentage <- sum(is.na(incomplete_distances_for_embedding)) / (n_objects * (n_objects-1)) * 100 data_quality <- calculate_data_quality_metrics(incomplete_distances_for_embedding) cat("Generated dataset with", n_objects, "objects and", round(actual_missing_percentage, 1), "% missing values\n") # Step 2: Assess Non-Euclidean Character cat("\n2. Assessing non-Euclidean character...\n") D_squared <- complete_distances_for_evaluation^2 n <- nrow(D_squared) J <- diag(n) - (1/n) * matrix(1, n, n) B <- -0.5 * J %*% D_squared %*% J eigenvals <- eigen(B, symmetric = TRUE)$values numerical_tolerance <- 1e-12 positive_eigenvals <- eigenvals[eigenvals > numerical_tolerance] negative_eigenvals <- eigenvals[eigenvals < -numerical_tolerance] total_variance <- sum(abs(eigenvals)) negative_variance <- sum(abs(negative_eigenvals)) positive_variance <- sum(positive_eigenvals) if(positive_variance > 0) { deviation_score <- negative_variance / positive_variance } else { deviation_score <- 1.0 } negative_variance_fraction <- negative_variance / total_variance cumulative_positive_variance <- cumsum(positive_eigenvals) / positive_variance dims_for_90_percent <- which(cumulative_positive_variance >= 0.90)[1] if(is.na(dims_for_90_percent)) dims_for_90_percent <- length(positive_eigenvals) cat("Non-Euclidean deviation score:", round(deviation_score, 4), "\n") cat("Dimensions for 90% variance:", dims_for_90_percent, "\n") # Step 3: Parameter Optimization using Euclidify cat("\n3. Running automated parameter optimization...\n") # Create temporary directory for optimization temp_dir <- tempfile() dir.create(temp_dir, recursive = TRUE) euclidify_results <- tryCatch({ Euclidify( dissimilarity_matrix = incomplete_distances_for_embedding, output_dir = temp_dir, ndim_range = c(2, min(10, dims_for_90_percent + 3)), n_initial_samples = 20, # Reduced for vignette speed n_adaptive_samples = 40, # Reduced for vignette speed folds = 20, mapping_max_iter = 300, # Reduced for speed clean_intermediate = TRUE, verbose = "off", fallback_to_defaults = TRUE, save_results = FALSE ) }, error = function(e) { cat("Euclidify failed, using default parameters\n") NULL }) # Clean up temp directory unlink(temp_dir, recursive = TRUE) if(!is.null(euclidify_results)) { optimal_params <- euclidify_results$optimal_params euclidify_positions <- euclidify_results$positions target_dims <- optimal_params$ndim cat("Optimal parameters found - dimensions:", target_dims, "\n") } else { # Fallback parameters target_dims <- max(2, min(5, dims_for_90_percent)) optimal_params <- list( ndim = target_dims, k0 = 5.0, cooling_rate = 0.01, c_repulsion = 0.02 ) euclidify_positions <- NULL cat("Using fallback parameters - dimensions:", target_dims, "\n") } # Step 4: Three-Method Comparison cat("\n4. Running three-method comparison...\n") # Initialize storage topolow_results <- vector("list", n_runs) iterative_mds_results <- vector("list", n_runs) classical_mds_result <- NULL topolow_errors <- numeric(n_runs) iterative_mds_errors <- numeric(n_runs) classical_mds_error <- NA # Topolow runs cat("Running Topolow embeddings...\n") for(i in 1:n_runs) { if(i == 1 && !is.null(euclidify_positions)) { # Use Euclidify result for first run topolow_coords <- euclidify_positions } else { # Run new embedding topolow_result <- tryCatch({ euclidean_embedding( dissimilarity_matrix = incomplete_distances_for_embedding, ndim = optimal_params$ndim, mapping_max_iter = 300, k0 = optimal_params$k0, cooling_rate = optimal_params$cooling_rate, c_repulsion = optimal_params$c_repulsion, relative_epsilon = 1e-6, convergence_counter = 3, verbose = FALSE ) }, error = function(e) NULL) if(!is.null(topolow_result)) { topolow_coords <- topolow_result$positions } else { topolow_coords <- NULL } } if(!is.null(topolow_coords)) { topolow_coords <- topolow_coords[order(row.names(topolow_coords)), ] if(validate_coordinates(topolow_coords, "Topolow", n_objects, target_dims)) { topolow_coords <- scale(topolow_coords, center = TRUE, scale = FALSE) topolow_results[[i]] <- topolow_coords embedded_distances <- as.matrix(dist(topolow_coords)) rownames(embedded_distances) <- rownames(topolow_coords) colnames(embedded_distances) <- rownames(topolow_coords) valid_mask <- !is.na(complete_distances_for_evaluation) distance_errors <- abs(complete_distances_for_evaluation[valid_mask] - embedded_distances[valid_mask]) topolow_errors[i] <- sum(distance_errors) } else { topolow_results[[i]] <- NULL topolow_errors[i] <- NA } } else { topolow_results[[i]] <- NULL topolow_errors[i] <- NA } } # Classical MDS cat("Running Classical MDS...\n") imputation_result <- improved_missing_data_imputation(incomplete_distances_for_embedding) classical_mds_matrix <- imputation_result$matrix tryCatch({ classical_mds_result_raw <- cmdscale(classical_mds_matrix, k = target_dims, eig = TRUE) classical_mds_coords <- classical_mds_result_raw$points rownames(classical_mds_coords) <- object_names classical_mds_coords <- classical_mds_coords[order(row.names(classical_mds_coords)), ] if(validate_coordinates(classical_mds_coords, "Classical MDS", n_objects, target_dims)) { classical_mds_coords <- scale(classical_mds_coords, center = TRUE, scale = FALSE) classical_mds_result <- classical_mds_coords embedded_distances <- as.matrix(dist(classical_mds_coords)) rownames(embedded_distances) <- rownames(classical_mds_coords) colnames(embedded_distances) <- rownames(classical_mds_coords) valid_mask <- !is.na(complete_distances_for_evaluation) distance_errors <- abs(complete_distances_for_evaluation[valid_mask] - embedded_distances[valid_mask]) classical_mds_error <- sum(distance_errors) } }, error = function(e) { cat("Classical MDS failed:", e$message, "\n") }) # Iterative MDS cat("Running Iterative MDS...\n") for(i in 1:n_runs) { tryCatch({ if(requireNamespace("smacof", quietly = TRUE)) { max_dist <- max(classical_mds_matrix, na.rm = TRUE) scaled_matrix <- classical_mds_matrix / max_dist iterative_mds_result_raw <- smacof::smacofSym( delta = scaled_matrix, ndim = target_dims, type = "interval", init = "torgerson", verbose = FALSE, itmax = 1000, eps = 1e-5 ) iterative_mds_coords <- iterative_mds_result_raw$conf current_max_dist <- max(dist(iterative_mds_coords)) if(current_max_dist > 0) { scale_factor <- max_dist / current_max_dist iterative_mds_coords <- iterative_mds_coords * scale_factor } } else { # Fallback to isoMDS iterative_mds_result_raw <- MASS::isoMDS(classical_mds_matrix, k = target_dims, trace = FALSE) iterative_mds_coords <- iterative_mds_result_raw$points } rownames(iterative_mds_coords) <- object_names iterative_mds_coords <- iterative_mds_coords[order(row.names(iterative_mds_coords)), ] if(validate_coordinates(iterative_mds_coords, "Iterative MDS", n_objects, target_dims)) { iterative_mds_coords <- scale(iterative_mds_coords, center = TRUE, scale = FALSE) iterative_mds_results[[i]] <- iterative_mds_coords embedded_distances <- as.matrix(dist(iterative_mds_coords)) rownames(embedded_distances) <- rownames(iterative_mds_coords) colnames(embedded_distances) <- rownames(iterative_mds_coords) valid_mask <- !is.na(complete_distances_for_evaluation) distance_errors <- abs(complete_distances_for_evaluation[valid_mask] - embedded_distances[valid_mask]) iterative_mds_errors[i] <- sum(distance_errors) } else { iterative_mds_results[[i]] <- NULL iterative_mds_errors[i] <- NA } }, error = function(e) { iterative_mds_results[[i]] <- NULL iterative_mds_errors[i] <- NA }) } # Step 5: Compile Results valid_topolow_results <- !is.na(topolow_errors) valid_iterative_results <- !is.na(iterative_mds_errors) results <- list( dataset_name = dataset_name, data_characteristics = list( n_objects = n_objects, missing_percentage = actual_missing_percentage, deviation_score = deviation_score, dims_90_percent = dims_for_90_percent, negative_variance_fraction = negative_variance_fraction ), optimal_params = optimal_params, topolow_errors = topolow_errors, topolow_results = topolow_results, valid_topolow_results = valid_topolow_results, classical_mds_error = classical_mds_error, classical_mds_result = classical_mds_result, iterative_mds_errors = iterative_mds_errors, iterative_mds_results = iterative_mds_results, valid_iterative_results = valid_iterative_results, complete_distances = complete_distances_for_evaluation, incomplete_distances = incomplete_distances_for_embedding, target_dims = target_dims ) return(results) } ``` # Results: Synthesized non-Euclidean Data We begin our analysis with the Synthesized non-Euclidean dataset, which represents a controlled deviation from metric properties through systematic transformations. ``` r # Run analysis for Synthesized non-Euclidean data distorted_results <- run_comprehensive_analysis( generate_distorted_euclidean_data, "Synthesized non-Euclidean", n_objects = 50, n_runs = 50 ) ``` ``` #> === Synthesized non-Euclidean DATA ANALYSIS SUMMARY === #> Dataset characteristics: #> - Objects: 50 #> - Missing data: 30 % #> - Non-Euclidean deviation score: 0.6034 #> - Dimensions for 90% variance: 13 #> - Target embedding dimensions: 3 #> Performance Summary: #> - Topolow mean error: 115995.6 ± 414.62 #> - Classical MDS error: 154109.3 #> - Iterative MDS mean error: 263692 ± 0 ``` ## Eigenvalue Spectrum Analysis - Synthesized non-Euclidean ![plot of chunk distorted_eigenvalues](figure/distorted_eigenvalues-1.png) ## Performance Comparison - Synthesized non-Euclidean ![plot of chunk distorted_performance](figure/distorted_performance-1.png) # Results: Swiss Roll Manifold Data Next, we analyze the Swiss Roll manifold dataset, which represents geodesic distances that are inherently non-Euclidean in the embedding space. ``` r # Run analysis for Swiss Roll data swiss_results <- run_comprehensive_analysis( generate_swiss_roll_data, "Swiss Roll Manifold", n_objects = 50, n_runs = 50 ) ``` ``` #> === SWISS ROLL MANIFOLD DATA ANALYSIS SUMMARY === #> Dataset characteristics: #> - Objects: 50 #> - Missing data: 30 % #> - Non-Euclidean deviation score: 0.2945 #> - Dimensions for 90% variance: 9 #> - Target embedding dimensions: 4 #> Performance Summary: #> - Topolow mean error: 2871.67 ± 76.94 #> - Classical MDS error: 5727.19 #> - Iterative MDS mean error: 9628.46 ± 0 ``` ## Eigenvalue Spectrum Analysis - Swiss Roll ![plot of chunk swiss_eigenvalues](figure/swiss_eigenvalues-1.png) ## Performance Comparison - Swiss Roll ![plot of chunk swiss_performance](figure/swiss_performance-1.png) # Comparative Analysis Across Dataset Types ## Combined Performance Analysis ![plot of chunk combined_analysis](figure/combined_analysis-1.png) ``` #> #> === COMBINED PERFORMANCE SUMMARY === #> # A tibble: 6 × 8 #> Method Dataset Method_Type Count Mean_Error SD_Error Min_Error Max_Error #> #> 1 Classical MDS Swiss Roll Deterministic 1 5727. NA 5727. 5727. #> 2 Classical MDS Synthesized non-Euclidean Deterministic 1 154109. NA 154109. 154109. #> 3 Iterative MDS Swiss Roll Stochastic 50 9628. 0 9628. 9628. #> 4 Iterative MDS Synthesized non-Euclidean Stochastic 50 263692. 0 263692. 263692. #> 5 Topolow Swiss Roll Stochastic 50 2872. 76.9 2822. 3369. #> 6 Topolow Synthesized non-Euclidean Stochastic 50 115996. 415. 115258. 116960. ``` ## Statistical Comparisons ``` #> #> === STATISTICAL ANALYSIS === #> #> Synthesized non-Euclidean Data: #> #> --- Statistical Comparison (Topolow vs Iterative MDS ) --- #> - Welch's t-statistic: -2518.874 #> - p-value: 6.39e-127 #> - Cohen's d (effect size): -503.775 #> - Effect size interpretation: Large #> $t_test #> #> Welch Two Sample t-test #> #> data: valid_topolow and valid_other #> t = -2518.9, df = 49, p-value < 2.2e-16 #> alternative hypothesis: true difference in means is not equal to 0 #> 95 percent confidence interval: #> -147814.3 -147578.6 #> sample estimates: #> mean of x mean of y #> 115995.6 263692.0 #> #> #> $cohens_d #> [1] -503.7748 #> #> Swiss Roll Manifold Data: #> #> --- Statistical Comparison (Topolow vs Iterative MDS ) --- #> - Welch's t-statistic: -620.991 #> - p-value: 4e-97 #> - Cohen's d (effect size): -124.198 #> - Effect size interpretation: Large #> $t_test #> #> Welch Two Sample t-test #> #> data: valid_topolow and valid_other #> t = -620.99, df = 49, p-value < 2.2e-16 #> alternative hypothesis: true difference in means is not equal to 0 #> 95 percent confidence interval: #> -6778.653 -6734.922 #> sample estimates: #> mean of x mean of y #> 2871.674 9628.462 #> #> #> $cohens_d #> [1] -124.1981 ``` ## Distance Preservation Analysis ``` #> #> === DISTANCE PRESERVATION: SYNTHESIZED NON-EUCLIDEAN === ``` ![plot of chunk distance_preservation](figure/distance_preservation-1.png) ``` #> #> Distance Preservation Statistics: #> Method Dataset Correlation R_squared #> 1 Topolow Synthesized non-Euclidean 0.8938058 0.7988889 #> 2 Classical MDS Synthesized non-Euclidean 0.7988927 0.6382295 #> 3 Iterative MDS Synthesized non-Euclidean 0.8059096 0.6494903 #> #> === DISTANCE PRESERVATION: SWISS ROLL === ``` ![plot of chunk distance_preservation](figure/distance_preservation-2.png) ``` #> #> Distance Preservation Statistics: #> Method Dataset Correlation R_squared #> 1 Topolow Swiss Roll 0.9788604 0.9581677 #> 2 Classical MDS Swiss Roll 0.9085225 0.8254131 #> 3 Iterative MDS Swiss Roll 0.9042658 0.8176966 ``` # Discussion ## Key Findings Our comprehensive analysis reveals several important insights about the performance of embedding methods on non-Euclidean dissimilarity data: ### 1. Non-Euclidean Character Assessment The eigenvalue spectrum analysis provides crucial context for interpreting method performance: - **Synthesized non-Euclidean Data**: Shows deviation score of 0.6034, indicating moderate non-Euclidean structure - **Swiss Roll Data**: Shows deviation score of 0.2945, reflecting the manifold's inherent geometric properties ### 2. Method Performance Characteristics **Topolow** consistently demonstrates the best performance across both datasets, with its force-directed approach naturally handling missing data without requiring imputation. The method's physics-inspired optimization appears particularly well-suited for sparse, non-Euclidean datasets. **Classical MDS** provides the theoretical baseline for Euclidean approximation but requires complete data matrices through imputation. Performance varies significantly based on the quality of imputation and the degree of non-Euclidean structure. **Iterative MDS** shows no stochasticity in our tests due to the fixed optimal initialization. The method achieves performance comparable to classical MDS. ### 3. Distance Preservation Quality The correlation analysis reveals how well each method preserves the original distance relationships. However, correlation is not a sufficient measure of accuracy and needs to be interpreted joint with measures of error such as MAE. ``` #> Distance Preservation Summary (Correlation with Original Distances): #> Method Dataset Correlation R_squared #> 1 Topolow Synthesized non-Euclidean 0.8938058 0.7988889 #> 2 Classical MDS Synthesized non-Euclidean 0.7988927 0.6382295 #> 3 Iterative MDS Synthesized non-Euclidean 0.8059096 0.6494903 #> 4 Topolow Swiss Roll 0.9788604 0.9581677 #> 5 Classical MDS Swiss Roll 0.9085225 0.8254131 #> 6 Iterative MDS Swiss Roll 0.9042658 0.8176966 #> #> Mean Performance Across Dataset Types: #> # A tibble: 3 × 3 #> Method Mean_Correlation Mean_R_squared #> #> 1 Classical MDS 0.854 0.732 #> 2 Iterative MDS 0.855 0.734 #> 3 Topolow 0.936 0.879 ``` ## Methodological Implications ### For Sparse Data The analysis demonstrates that Topolow's approach of using spring networks to handle missing measurements provides a principled alternative to imputation-based methods. This is particularly relevant for: - Biological assay data with systematic missingness - Network analysis with incomplete edge information - High-throughput screening data with threshold censoring ### For Non-Euclidean Data Different types of non-Euclidean structure pose varying challenges: - **Systematic distortions** (as in our Synthesized non-Euclidean data) can often be well-approximated by optimization - **Manifold structure** (as in Swiss Roll data) can be embedded into a metric space using the faithfull transformation or topolow ### Parameter Optimization The Euclidify wizard function automatically tunes parameters and optimizes the output coordinates, making it particularly valuable for non-expert users. ## Limitations and Future Directions ### Current Limitations 1. **Computational Complexity**: run-time of topolow scales poorly with dataset size, making parallel processing a bottleneck ### Future Research Directions 1. **Scalability Improvements**: Development for large-scale applications 2. **Adaptive Dimensionality**: Dynamic adjustment of embedding dimensions during optimization 3. **Domain-Specific Evaluation**: Metrics tailored to specific application domains (e.g., geography, phylogenetics, proteomics, psycometrics, operations research, ...) # Conclusions This comprehensive evaluation demonstrates that the choice of embedding method significantly impacts results when dealing with non-Euclidean dissimilarity data. Key recommendations include: ## For Practitioners 1. **Use Euclidify for automated parameter optimization** rather than manual tuning 2. **Assess Euclidean deviation in your data** using eigenvalue analysis before selecting methods 3. **Evaluate distance preservation quality**. ## For Method Developers 1. **Missing data handling** should be considered a core feature, not an afterthought 2. **Robustness to non-Euclidean structure** varies significantly between approaches 3. **Automated parameter optimization** is essential for practical adoption 4. **Comprehensive evaluation frameworks** like this vignette are needed for informed model selection ## Scientific Impact The Topolow algorithm represents a meaningful contribution to the embedding literature by: - Providing a principled approach to missing data that avoids imputation artifacts - Demonstrating robustness across different types of non-Euclidean structure - Offering automated optimization tools that make the method accessible to non-experts - Establishing performance benchmarks against established methods This analysis provides a rigorous foundation for understanding when and how to apply force-directed embedding methods for challenging dissimilarity data, contributing to the broader goal of extracting meaningful low-dimensional representations from complex high-dimensional relationships. # Session Information ``` #> R version 4.3.3 (2024-02-29) #> Platform: aarch64-apple-darwin20 (64-bit) #> Running under: macOS 15.4.1 #> #> Matrix products: default #> BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib #> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0 #> #> locale: #> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 #> #> time zone: America/New_York #> tzcode source: internal #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] RANN_2.6.2 smacof_2.1-7 e1071_1.7-16 colorspace_2.1-1 plotrix_3.8-4 MASS_7.3-60.0.1 #> [7] scales_1.4.0 reshape2_1.4.4 dplyr_1.1.4 ggplot2_3.5.2 topolow_2.0.0 testthat_3.2.3 #> #> loaded via a namespace (and not attached): #> [1] splines_4.3.3 later_1.4.2 filelock_1.0.3 tibble_3.3.0 rpart_4.1.24 #> [6] lifecycle_1.0.4 Rdpack_2.6.4 doParallel_1.0.17 rprojroot_2.0.4 globals_0.17.0 #> [11] processx_3.8.6 lattice_0.22-7 backports_1.5.0 magrittr_2.0.3 Hmisc_5.2-3 #> [16] plotly_4.11.0 rmarkdown_2.29 yaml_2.3.10 remotes_2.5.0 httpuv_1.6.16 #> [21] askpass_1.2.1 sessioninfo_1.2.3 pkgbuild_1.4.7 reticulate_1.42.0 minqa_1.2.8 #> [26] RColorBrewer_1.1-3 pkgload_1.4.0 Rtsne_0.17 purrr_1.0.4 rgl_1.3.24 #> [31] nnet_7.3-20 ggrepel_0.9.6 listenv_0.9.1 gdata_3.0.1 ellipse_0.5.0 #> [36] vegan_2.7-1 umap_0.2.10.0 RSpectra_0.16-2 parallelly_1.44.0 Racmacs_1.2.9 #> [41] permute_0.9-8 commonmark_2.0.0 codetools_0.2-20 xml2_1.3.8 tidyselect_1.2.1 #> [46] shape_1.4.6.1 farver_2.1.2 lme4_1.1-37 base64enc_0.1-3 roxygen2_7.3.2 #> [51] jsonlite_2.0.0 mitml_0.4-5 ellipsis_0.3.2 Formula_1.2-5 survival_3.8-3 #> [56] iterators_1.0.14 systemfonts_1.2.3 foreach_1.5.2 tools_4.3.3 ragg_1.4.0 #> [61] Rcpp_1.1.0 glue_1.8.0 gridExtra_2.3 pan_1.9 xfun_0.52 #> [66] mgcv_1.9-1 usethis_3.1.0 withr_3.0.2 fastmap_1.2.0 xopen_1.0.1 #> [71] boot_1.3-31 openssl_2.3.3 callr_3.7.6 digest_0.6.37 rcmdcheck_1.4.0 #> [76] R6_2.6.1 mime_0.13 mice_3.17.0 textshaping_1.0.1 gtools_3.9.5 #> [81] weights_1.0.4 utf8_1.2.6 tidyr_1.3.1 generics_0.1.4 data.table_1.17.8 #> [86] class_7.3-23 prettyunits_1.2.0 httr_1.4.7 htmlwidgets_1.6.4 pkgconfig_2.0.3 #> [91] gtable_0.3.6 brio_1.1.5 htmltools_0.5.8.1 profvis_0.4.0 png_0.1-8 #> [96] wordcloud_2.6 reformulas_0.4.1 knitr_1.50 rstudioapi_0.17.1 coda_0.19-4.1 #> [101] checkmate_2.3.2 nlme_3.1-168 curl_6.4.0 nloptr_2.2.1 proxy_0.4-27 #> [106] cachem_1.1.0 stringr_1.5.1 parallel_4.3.3 miniUI_0.1.2 foreign_0.8-90 #> [111] desc_1.4.3 pillar_1.11.0 grid_4.3.3 vctrs_0.6.5 urlchecker_1.0.1 #> [116] promises_1.3.3 jomo_2.7-6 xtable_1.8-4 lhs_1.2.0 cluster_2.1.8.1 #> [121] waldo_0.6.1 htmlTable_2.4.3 evaluate_1.0.4 cli_3.6.5 compiler_4.3.3 #> [126] rlang_1.1.6 labeling_0.4.3 ps_1.9.1 plyr_1.8.9 fs_1.6.6 #> [131] stringi_1.8.7 viridisLite_0.4.2 nnls_1.6 lazyeval_0.2.2 devtools_2.4.5 #> [136] glmnet_4.1-8 Matrix_1.6-5 future_1.40.0 shiny_1.11.1 rbibutils_2.3 #> [141] igraph_2.1.4 broom_1.0.8 memoise_2.0.1 ape_5.8-1 polynom_1.4-1 ```