--- title: "Benchmarking Recall and Latency" output: litedown::html_format: meta: css: ["@default"] --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(bigANNOY.progress = FALSE) set.seed(20260326) ``` `bigANNOY` includes exported benchmark helpers so you can measure three related things with the same interface: - index build time - search time - optional Euclidean recall against an exact `bigKNN` baseline - comparison against direct `RcppAnnoy` - scaling with data volume and generated index size This vignette shows how to use those helpers for both quick one-off runs and small parameter sweeps. ## What the Benchmark Helpers Do The package currently exports four benchmark functions: - `benchmark_annoy_bigmatrix()` for one build-and-search configuration - `benchmark_annoy_recall_suite()` for a grid of `n_trees` and `search_k` settings on the same dataset - `benchmark_annoy_vs_rcppannoy()` for a direct comparison between the package's `bigmemory` workflow and a dense `RcppAnnoy` baseline - `benchmark_annoy_volume_suite()` for scaling studies across larger synthetic data sizes These helpers can work with: - synthetic data generated on the fly - user-supplied dense matrices - `big.matrix` inputs, descriptors, descriptor paths, and external pointers They can also write summaries to CSV so results can be saved outside the current R session, and the comparison helpers add byte-oriented fields for the reference data, query data, Annoy index file, and total persisted artifacts. ## Load the Package ```{r} library(bigANNOY) ``` ## Create a Benchmark Workspace We will write any temporary benchmark files into a dedicated directory so the workflow is easy to inspect. ```{r} bench_dir <- tempfile("bigannoy-benchmark-") dir.create(bench_dir, recursive = TRUE, showWarnings = FALSE) bench_dir ``` ## A Single Synthetic Benchmark Run The simplest benchmark call uses synthetic data. This is useful when you want a quick sense of how build and search times respond to `n_trees`, `search_k`, and the problem dimensions. ```{r} single_csv <- file.path(bench_dir, "single.csv") single <- benchmark_annoy_bigmatrix( n_ref = 200L, n_query = 20L, n_dim = 6L, k = 3L, n_trees = 10L, search_k = 50L, exact = FALSE, path_dir = bench_dir, output_path = single_csv, load_mode = "eager" ) single$summary ``` The returned object contains more than just the summary row. ```{r} names(single) single$params single$exact_available ``` Because `exact = FALSE`, the benchmark skips the exact `bigKNN` comparison and focuses only on the approximate Annoy path. ## Validation Is Part of the Benchmark Workflow The benchmark helpers also validate the built Annoy index before measuring the search step. That helps ensure the timing result corresponds to a usable, reopenable index rather than a partially successful build. ```{r} single$validation$valid single$validation$checks[, c("check", "passed", "severity")] ``` The same summary is also written to CSV when `output_path` is supplied. ```{r} read.csv(single_csv, stringsAsFactors = FALSE) ``` ## External-Query Versus Self-Search Benchmarks One subtle but important detail is how synthetic data generation works: - if `x = NULL` and `query` is omitted, the benchmark generates a separate synthetic query matrix - if `x = NULL` and `query = NULL` is supplied explicitly, the benchmark runs self-search on the reference matrix That difference is reflected in the `self_search` and `n_query` fields. ```{r} external_run <- benchmark_annoy_bigmatrix( n_ref = 120L, n_query = 12L, n_dim = 5L, k = 3L, n_trees = 8L, exact = FALSE, path_dir = bench_dir ) self_run <- benchmark_annoy_bigmatrix( n_ref = 120L, query = NULL, n_dim = 5L, k = 3L, n_trees = 8L, exact = FALSE, path_dir = bench_dir ) shape_cols <- c("self_search", "n_ref", "n_query", "k") rbind( external = external_run[["summary"]][, shape_cols], self = self_run[["summary"]][, shape_cols] ) ``` That distinction matters when you are benchmarking workflows that mirror either training-set neighbour search or truly external query traffic. ## Benchmark a Recall Suite Across Parameter Grids For tuning work, a single benchmark point is usually not enough. The suite helper runs a grid of `n_trees` and `search_k` values on the same dataset so you can compare trade-offs more systematically. ```{r} suite_csv <- file.path(bench_dir, "suite.csv") suite <- benchmark_annoy_recall_suite( n_ref = 200L, n_query = 20L, n_dim = 6L, k = 3L, n_trees = c(5L, 10L), search_k = c(-1L, 50L), exact = FALSE, path_dir = bench_dir, output_path = suite_csv, load_mode = "eager" ) suite$summary ``` Each row corresponds to one `(n_trees, search_k)` configuration on the same underlying benchmark dataset. The saved CSV contains the same summary table. ```{r} read.csv(suite_csv, stringsAsFactors = FALSE) ``` ## Optional Exact Recall Against bigKNN For Euclidean workloads, the benchmark helpers can optionally compare Annoy results against the exact `bigKNN` baseline and report: - `exact_elapsed` - `recall_at_k` That comparison is only available when the runtime package `bigKNN` is installed. ```{r} if (length(find.package("bigKNN", quiet = TRUE)) > 0L) { exact_run <- benchmark_annoy_bigmatrix( n_ref = 150L, n_query = 15L, n_dim = 5L, k = 3L, n_trees = 10L, search_k = 50L, metric = "euclidean", exact = TRUE, path_dir = bench_dir ) exact_run$exact_available exact_run$summary[, c("build_elapsed", "search_elapsed", "exact_elapsed", "recall_at_k")] } else { "Exact baseline example skipped because bigKNN is not installed." } ``` This is the most direct way to answer the practical question, "How much search speed am I buying, and what recall do I lose in return?" ## Benchmark User-Supplied Data Synthetic data is convenient, but real benchmarking usually needs real data. Both benchmark helpers can accept user-supplied reference and query inputs. ```{r} ref <- matrix(rnorm(80 * 4), nrow = 80, ncol = 4) query <- matrix(rnorm(12 * 4), nrow = 12, ncol = 4) user_run <- benchmark_annoy_bigmatrix( x = ref, query = query, k = 3L, n_trees = 12L, search_k = 40L, exact = FALSE, filebacked = TRUE, path_dir = bench_dir, load_mode = "eager" ) user_run$summary[, c( "filebacked", "self_search", "n_ref", "n_query", "n_dim", "build_elapsed", "search_elapsed" )] ``` When `filebacked = TRUE`, dense reference inputs are first converted into a file-backed `big.matrix` before the Annoy build starts. That can be useful when you want the benchmark workflow to resemble the package's real persisted data path more closely. ## Compare bigANNOY with Direct RcppAnnoy When you want to understand the cost of the `bigmemory`-oriented wrapper itself, the most useful benchmark is not an exact Euclidean baseline. It is a direct comparison with plain `RcppAnnoy`, using the same synthetic dataset, the same metric, the same `n_trees`, and the same `search_k`. That is what `benchmark_annoy_vs_rcppannoy()` provides. ```{r} compare_csv <- file.path(bench_dir, "compare.csv") compare_run <- benchmark_annoy_vs_rcppannoy( n_ref = 200L, n_query = 20L, n_dim = 6L, k = 3L, n_trees = 10L, search_k = 50L, exact = FALSE, path_dir = bench_dir, output_path = compare_csv, load_mode = "eager" ) compare_run$summary[, c( "implementation", "reference_storage", "n_ref", "n_query", "n_dim", "total_data_bytes", "index_bytes", "build_elapsed", "search_elapsed" )] ``` This benchmark is useful for a different question from the earlier exact baseline: - `benchmark_annoy_bigmatrix()` asks how approximate Annoy behaves on a given dataset and, optionally, how much recall it loses against exact `bigKNN` - `benchmark_annoy_vs_rcppannoy()` asks how much overhead or benefit comes from the package's `bigmemory` and persistence workflow relative to direct `RcppAnnoy` The output also includes data-volume fields: - `ref_bytes`: estimated bytes in the reference matrix - `query_bytes`: estimated bytes in the query matrix - `total_data_bytes`: reference plus effective query volume - `index_bytes`: bytes in the saved Annoy index - `metadata_bytes`: bytes in the sidecar metadata file - `artifact_bytes`: persisted Annoy artifacts written by the workflow The generated CSV contains the same comparison table. ```{r} read.csv(compare_csv, stringsAsFactors = FALSE)[, c( "implementation", "ref_bytes", "query_bytes", "index_bytes", "metadata_bytes", "artifact_bytes" )] ``` In practice, the comparison table helps answer two operational questions: - Is `bigANNOY` close enough to plain `RcppAnnoy` on build and search speed for this workload? - How large is the persisted Annoy index relative to the input data volume? ## Benchmark Scaling by Data Volume A single comparison point is useful, but it does not tell you whether the wrapper overhead stays modest as the problem gets larger. The volume suite runs the same `bigANNOY` versus `RcppAnnoy` comparison across a grid of synthetic data sizes. ```{r} volume_csv <- file.path(bench_dir, "volume.csv") volume_run <- benchmark_annoy_volume_suite( n_ref = c(200L, 500L), n_query = 20L, n_dim = c(6L, 12L), k = 3L, n_trees = 10L, search_k = 50L, exact = FALSE, path_dir = bench_dir, output_path = volume_csv, load_mode = "eager" ) volume_run$summary[, c( "implementation", "n_ref", "n_dim", "total_data_bytes", "index_bytes", "build_elapsed", "search_elapsed" )] ``` This kind of table is especially useful when you want to prepare a more formal benchmark note for a package release or for internal performance regression tracking: - it shows how build time changes as reference size grows - it shows how query time changes as dimension grows - it shows whether index size scales roughly as expected with data volume - it makes the `bigANNOY` versus direct `RcppAnnoy` gap visible across more than one benchmark point ## Interpreting the Main Summary Columns The most useful summary fields are: - `build_elapsed`: time spent creating the Annoy index - `search_elapsed`: time spent running the search step - `exact_elapsed`: time spent on the exact Euclidean baseline, when available - `recall_at_k`: average overlap with the exact top-`k` neighbours - `implementation`: whether the row came from `bigANNOY` or direct `RcppAnnoy` - `n_trees`: index quality/size control at build time - `search_k`: query effort control at search time - `self_search`: whether the benchmark searched the reference rows against themselves - `filebacked`: whether dense reference data was converted into a file-backed `big.matrix` - `ref_bytes`, `query_bytes`, and `index_bytes`: the rough data and artifact volume associated with the benchmark In practice: - raise `search_k` first when recall is too low - increase `n_trees` when higher search budgets alone are not enough - compare `search_elapsed` and `recall_at_k` together instead of optimizing either one in isolation - use `benchmark_annoy_vs_rcppannoy()` when you want to reason about package overhead rather than approximate-versus-exact quality - use `benchmark_annoy_volume_suite()` when you need a more formal scaling table for release notes or internal reports ## Installed Benchmark Runner The package also installs a command-line benchmark script. That is convenient when you want to run a benchmark outside an interactive R session or save CSV output from shell scripts. The installed path is: ```{r} system.file("benchmarks", "benchmark_annoy.R", package = "bigANNOY") ``` Example single-run command: ```r Rscript "$(R -q -e 'cat(system.file(\"benchmarks\", \"benchmark_annoy.R\", package = \"bigANNOY\"))')" \ --mode=single \ --n_ref=5000 \ --n_query=500 \ --n_dim=50 \ --k=20 \ --n_trees=100 \ --search_k=5000 \ --load_mode=eager ``` Example suite command: ```r Rscript "$(R -q -e 'cat(system.file(\"benchmarks\", \"benchmark_annoy.R\", package = \"bigANNOY\"))')" \ --mode=suite \ --n_ref=5000 \ --n_query=500 \ --n_dim=50 \ --k=20 \ --suite_trees=10,50,100 \ --suite_search_k=-1,2000,10000 \ --output_path=/tmp/bigannoy_suite.csv ``` Example direct-comparison command: ```r Rscript "$(R -q -e 'cat(system.file(\"benchmarks\", \"benchmark_annoy.R\", package = \"bigANNOY\"))')" \ --mode=compare \ --n_ref=5000 \ --n_query=500 \ --n_dim=50 \ --k=20 \ --n_trees=100 \ --search_k=5000 \ --load_mode=eager ``` Example volume-suite command: ```r Rscript "$(R -q -e 'cat(system.file(\"benchmarks\", \"benchmark_annoy.R\", package = \"bigANNOY\"))')" \ --mode=volume \ --suite_n_ref=2000,5000,10000 \ --suite_n_query=200 \ --suite_n_dim=20,50 \ --k=10 \ --n_trees=50 \ --search_k=1000 \ --output_path=/tmp/bigannoy_volume.csv ``` ## Recommended Workflow A practical tuning workflow usually looks like this: 1. start with a small single benchmark to confirm dimensions and plumbing 2. switch to a suite over a small `n_trees` by `search_k` grid 3. enable exact Euclidean benchmarking when `bigKNN` is available 4. compare recall and latency together 5. repeat the same workflow on user-supplied data before drawing conclusions ## Recap `bigANNOY`'s benchmark helpers are designed to make performance work part of the normal package workflow, not a separate ad hoc script: - `benchmark_annoy_bigmatrix()` for one configuration - `benchmark_annoy_recall_suite()` for parameter sweeps - `benchmark_annoy_vs_rcppannoy()` for direct implementation comparison - `benchmark_annoy_volume_suite()` for speed and size scaling studies - optional exact recall against `bigKNN` - CSV output for saved summaries - support for both synthetic and user-supplied data The next vignette to read after this one is usually *Metrics and Tuning*, which goes deeper on how to choose metrics and search/build controls.