--- title: "spqrp on mock data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{spqrp on mock data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` This vignette mirrors `Example_Mock_Data.Rmd` from the Python package, but uses the native R package. No Python install is required. ## Load data The package ships a small mock cohort and a matching protein ranking, both in long format with the required columns (`Sample_ID`, `Patient_ID`, `Protein`, `Intensity`). ```{r, message = FALSE} library(spqrp) df <- spqrp_example_data("input_cohort_df") ranking <- spqrp_example_data("protein_ranking") head(df) ranking ``` ## Run clustering `run_clustering()` builds a kNN graph in a 2D embedding (PCA / MDS / UMAP), iteratively splits oversized components, and visualises the result. Use `n_neighbors = expected_samples_per_patient - 1` and `max_component_size = expected_samples_per_patient` for the typical case where every patient should appear in their own small cluster. ```{r} res <- run_clustering( df = df, ranking = ranking, n_neighbors = 1L, max_component_size = 2L, metric = "manhattan", method = "PCA", # switch to "UMAP" once a cohort is large enough plot_name = "Mock ranking on Mock data", #quiet = FALSE. #save_path = Mock_data_clustering.png ) ``` It is recommended to specify 'save_path' to save the clustering with a better visualization with less cramped clusters (adjustable figsize and dpi). The plot in `res$plot` is a ggplot2 object — you can save or modify it like any other ggplot. Cluster bookkeeping is also returned as plain R lists: ```{r} head(res$cluster_assignments, 8) res$uncertain_samples res$error_candidate_samples res$plot res$transitive_results ``` * `cluster_assignments` — named integer vector mapping sample → cluster ID * `uncertain_samples` — samples that should have a same-patient match in the cohort but ended up isolated * `error_candidate_samples` — samples connected to a *different*-patient neighbour (i.e. probable mix-ups) ## Threshold-based evaluation `perform_distance_evaluation_on_ranked_proteins()` computes pairwise distances on the top-`n` ranked proteins, picks a cutoff at the `p`-th percentile, and reports classification metrics against the patient ID ground truth. ```{r} result <- perform_distance_evaluation_on_ranked_proteins( df = df, top_importance_df = ranking, metric = "manhattan", p = 0.989, n = 4L, ) result$cutoff result$eval_metrics[c("TP", "FP", "FN", "TN", "Precision", "Sensitivity", "F1")] result$plot ``` ## Compute your own protein ranking If you don't have a precomputed ranking, train a pairwise random-forest classifier on the raw data. Three balancing strategies are available via `classifier_backend`; the differences are documented at . **Avoid applying preprocessing or filtering steps to the data frame you pass in.** Anything that uses cohort-wide information — median normalization, occurrence filtering, outlier detection on the full cohort — leaks the held-out fold into the training fold and inflates the importance estimates. The built-in `outlier_removal = TRUE` option is safe: it fits the isolation forest inside each train/test split rather than on the whole cohort. ```{r} results <- train_with_normalise( df, plate_corrected = FALSE, # mock data has no plate column outlier_removal = FALSE # skip on tiny data # classifier_backend defaults to "randomForest" — closest to imblearn's # BalancedRandomForestClassifier. Pass "ranger" for a faster (but more # divergent from original Python package) backend. ) new_ranking <- retrieve_ranking(results) new_ranking ``` ## Filtering (optional) Isolation-forest-based outlier detection flags samples whose protein profiles look anomalous relative to the rest of the cohort. Skip it on very small cohorts (the algorithm has little to learn from < ~30 samples) — it's mostly useful at production scale where one mis-pipetted sample can drag clustering off course. ```{r} # Drop flagged samples in one step: filtered <- remove_outlier_samples(df, contamination = "auto") filtered$outlier_list # samples flagged as anomalous filtered$anomaly_df # per-sample anomaly scores head(filtered$df) # input df minus the flagged samples ``` ```{r, eval = FALSE} #Inspect the score distribution before deciding on a cutoff: filtered$anomaly_plot # Or call the underlying detector directly to keep the original df and # only act on the flag list — useful when you want to surface candidates # without auto-removing them: forest <- by_isolation_forest(df, impute_median = TRUE, contamination = 0.05) # top 5% by score forest$outlier_list ``` `contamination = "auto"` uses an anomaly-score cutoff calibrated for `solitude` (the underlying engine; see `?by_isolation_forest` for why the default is `0.6` rather than sklearn's `0.5`). Pass a numeric fraction like `contamination = 0.1` to flag a fixed percentile instead. `train_with_normalise()` also runs the same filter internally via `outlier_removal = TRUE` (default), so you don't need to call this explicitly before training for creating the protein ranking — only when you want to inspect or pre-filter the cohort (e.g. for the clustering or threshold approach) yourself. ## The complete SQRP pipeline: SPQRP consists of six main steps: preprocessing, protein selection with a Balanced Random Forest Classifier (BRF-Classifier), distance calculation, clustering-based classification, threshold-based classification, and performance calculation. The full SPQRP R packages functions can be used to run them in three stages: (1) preprocessing, (2) protein ranking via `train_with_normalise()`, and (3) clustering plus threshold-based evaluation on the filtered cohort (including, distance calculation, clustering- or respectively threshold-based classification, and performance calculation in one function). To follow it end-to-end, feed the `new_ranking` and (optionally) filtered data frame from stages 1–2 into `run_clustering()` and `perform_distance_evaluation_on_ranked_proteins()`: ```{r} res <- run_clustering( df = filtered$df, ranking = new_ranking, n_neighbors = 1L, max_component_size = 2L, metric = "manhattan", method = "PCA", # switch to "UMAP" once a cohort is large enough plot_name = "Mock ranking on Mock data", #quiet = FALSE. #save_path = filtered_mock_data_clustering.png ) head(res$cluster_assignments, 8) res$uncertain_samples res$error_candidate_samples res$plot ``` ```{r} result <- perform_distance_evaluation_on_ranked_proteins( df = filtered$df, top_importance_df = new_ranking, metric = "manhattan", p = 0.989, n = 4L, ) result$cutoff result$eval_metrics[c("TP", "FP", "FN", "TN", "Precision", "Sensitivity", "F1")] result$plot ``` ## Preprocessing (optional) The package mirrors the Python preprocessing pipeline. None of these steps are required for clustering — they're convenience helpers if you want to start from raw intensities. ```{r, eval = FALSE} df_raw <- spqrp_example_data("input_cohort_df") # Zeros become NA so missingness is explicit df_raw$Intensity[df_raw$Intensity == 0] <- NA df_pp <- df_raw |> log_transform() |> filter_by_occurrence(0.7) # Per-sample median normalization (returns list(data, plot)) norm <- normalize_medianintensity(df_pp, plot = FALSE) df_pp <- norm$data # Plate-effect residualisation if a `plate` column exists df_pp <- plate_correct_residuals_by_protein(df_pp) ```