--- title: "Using bigKNN as Exact Ground Truth" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using bigKNN as Exact Ground Truth} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- `bigKNN` is a natural exact backend for evaluating approximate nearest-neighbour systems. It gives you two especially useful tools: - exact top-`k` neighbours that act as trusted ground truth - exact reranking of approximate candidate sets against the original `big.matrix` reference This vignette stays package-independent on purpose. Instead of depending on a specific ANN backend, it uses a small exact reference set plus a hand-built candidate matrix that behaves like approximate output. ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") if (!requireNamespace("bigmemory", quietly = TRUE)) { cat("This vignette requires the 'bigmemory' package.\n") knitr::knit_exit() } library(bigKNN) library(bigmemory) ``` ```{r helpers, include=FALSE} neighbor_table <- function(index, query_ids, ref_ids, distance = NULL) { do.call(rbind, lapply(seq_along(query_ids), function(i) { out <- data.frame( query = query_ids[i], rank = seq_len(ncol(index)), neighbor = ref_ids[index[i, ]], row.names = NULL ) if (!is.null(distance)) { out$distance <- signif(distance[i, ], 5) } out })) } ``` # Why exact ground truth matters Approximate search is usually evaluated against an exact baseline. Without that baseline, it is hard to answer basic questions: - Are the returned neighbours actually correct? - Is the approximate system missing true near neighbours, or just ranking them oddly? - Does widening the ANN candidate pool improve top-`k` quality enough to justify the extra cost? Exact search is also useful as a debugging tool. If an ANN system suddenly behaves differently after a tuning change, exact neighbours give you a stable reference point for comparing what changed. # Build a Small Exact Evaluation Set We will use eight reference rows and three queries. The example is small so the exact neighbours and the approximate candidate sets remain easy to inspect. ```{r create-data} reference_points <- data.frame( id = paste0("r", 1:8), x1 = c(0.0, 0.2, 1.0, 1.2, 3.0, 3.2, 4.0, 4.2), x2 = c(0.0, 0.1, 0.9, 1.0, 3.0, 3.1, 4.0, 4.1), x3 = c(0.5, 0.4, 1.2, 1.1, 2.8, 2.9, 3.8, 3.9) ) query_points <- data.frame( id = paste0("q", 1:3), x1 = c(0.1, 1.1, 3.1), x2 = c(0.0, 1.0, 3.0), x3 = c(0.45, 1.15, 2.85) ) reference <- as.big.matrix(as.matrix(reference_points[c("x1", "x2", "x3")])) query_matrix <- as.matrix(query_points[c("x1", "x2", "x3")]) reference_points query_points ``` # Producing exact neighbours with `bigKNN` The exact reference result is just the exact `k`-NN output for the same reference, query set, metric, and `k` that you want to evaluate. ```{r exact-truth} exact <- knn_bigmatrix( reference, query = query_matrix, k = 3, metric = "euclidean", exclude_self = FALSE ) exact neighbor_table( exact$index, query_ids = query_points$id, ref_ids = reference_points$id, distance = exact$distance ) ``` This object is the ground truth for the rest of the vignette. Every later comparison asks how closely an approximate candidate set matches these exact top three neighbours. # Measuring recall with `recall_against_exact()` Suppose another package returns the following approximate top-3 neighbour index matrix: ```{r approx-top3} approx_top3 <- rbind( c(8, 3, 1), c(7, 4, 2), c(6, 2, 5) ) neighbor_table( approx_top3, query_ids = query_points$id, ref_ids = reference_points$id ) ``` Each query still recovers some correct neighbours, but one true top-3 neighbour is missing for every row. We can quantify that with `recall_against_exact()`: ```{r recall} recall_before <- recall_against_exact(exact, approx_top3, k = 3) recall_before data.frame( query = query_points$id, recall = recall_before$per_query, row.names = NULL ) ``` Recall is set-based rather than rank-based: - it checks whether the exact top-`k` ids appear in the approximate top-`k` - it does not care whether those ids are in the same internal order - `overall` is the mean of the per-query recalls That makes recall a good first coverage metric, but not a full ranking metric. An ANN system can have excellent recall while still misordering the neighbours within the recovered set. # Reranking candidate sets with `rerank_candidates_bigmatrix()` Reranking is useful when an approximate system returns a wider candidate pool than the final `k` you care about. The idea is: 1. let the ANN stage generate a shortlist 2. compute exact distances only within that shortlist 3. sort the shortlist exactly and keep the best `top_k` Here is a 5-candidate pool that contains the true exact top-3 ids for every query, but not always in the first three slots: ```{r candidate-pool} candidate_pool <- rbind( c(8, 3, 1, 2, 6), c(7, 4, 2, 3, 1), c(6, 2, 5, 7, 1) ) neighbor_table( candidate_pool, query_ids = query_points$id, ref_ids = reference_points$id ) ``` Now rerank those candidates exactly against the reference matrix: ```{r rerank} reranked <- rerank_candidates_bigmatrix( reference, query = query_matrix, candidate_index = candidate_pool, metric = "euclidean", top_k = 3 ) reranked neighbor_table( reranked$index, query_ids = query_points$id, ref_ids = reference_points$id, distance = reranked$distance ) ``` The reranked result now matches the exact top-3 output, because the true neighbours were present somewhere in the candidate pool. # Comparing approximate candidates against exact results The usual evaluation pattern is: 1. compute exact neighbours once on an evaluation subset 2. compare ANN output to exact truth with `recall_against_exact()` 3. rerank a wider candidate pool exactly 4. compare recall again after reranking ```{r compare-before-after} recall_after <- recall_against_exact(exact, reranked, k = 3) data.frame( stage = c("Approximate top-3", "Reranked top-3 from 5 candidates"), overall_recall = c(recall_before$overall, recall_after$overall), row.names = NULL ) ``` In this example, the approximate top-3 recall is `2/3`, while reranking the 5-wide candidate pool recovers perfect top-3 recall. # Suggested workflow alongside a separate ANN package A practical package-independent workflow looks like this: 1. Choose an evaluation subset of queries and references. 2. Run `knn_bigmatrix()` once to compute exact top-`k` neighbours on that subset. 3. Run your ANN package with the same metric definition and collect its candidate indices as a matrix or `big.matrix`. 4. Measure ANN recall with `recall_against_exact()`. 5. If the ANN backend produces a wider candidate pool than final `k`, rerank it exactly with `rerank_candidates_bigmatrix()`. 6. Compare recall before and after reranking, and decide whether the candidate width is worth the cost. This keeps responsibilities clean: - the ANN package handles fast candidate generation - `bigKNN` supplies exact truth and exact candidate reranking # Interpreting recall and reranking results There are two distinct questions to keep apart. First: does the approximate stage cover the correct neighbours at all? - low recall means true neighbours are missing from the approximate top-`k` - widening the ANN candidate pool is often the first fix Second: if the correct ids are present somewhere in the candidate pool, are they ranked well enough for the final top-`k`? - reranking can repair ordering within the supplied candidate set - reranking can also improve top-`k` coverage when true neighbours are present just outside the ANN top-`k` The wider the candidate pool, the better reranking can work, but the more exact distance calculations it has to do. # Limits and caveats Reranking improves ordering only within the candidates you provide. It cannot invent missing neighbours. If the candidate pool is already too narrow, recall may stay capped even after exact reranking: ```{r rerank-limit} reranked_limited <- rerank_candidates_bigmatrix( reference, query = query_matrix, candidate_index = approx_top3, metric = "euclidean", top_k = 3 ) recall_after_limited <- recall_against_exact(exact, reranked_limited, k = 3) data.frame( stage = c("Approximate top-3", "Reranked top-3 from same 3 candidates"), overall_recall = c(recall_before$overall, recall_after_limited$overall), row.names = NULL ) ``` Other important caveats: - exact search still has a real computational cost, so evaluation often uses a representative subset rather than the entire production workload - exact and approximate systems must use the same metric definition if recall is to mean anything - `recall_against_exact()` measures set overlap, not rank quality - sparse query inputs are supported in `bigKNN`, but currently densified on the R side before exact computation Used this way, `bigKNN` becomes a clean exact companion to approximate search: it tells you what the right neighbours are, how often the ANN system finds them, and whether a wider candidate pool can be turned into a better final result by exact reranking.