bigKNN is a natural exact backend for evaluating
approximate nearest-neighbour systems. It gives you two especially
useful tools:
k neighbours that act as trusted ground
truthbig.matrix referenceThis vignette stays package-independent on purpose. Instead of depending on a specific ANN backend, it uses a small exact reference set plus a hand-built candidate matrix that behaves like approximate output.
Approximate search is usually evaluated against an exact baseline. Without that baseline, it is hard to answer basic questions:
k
quality enough to justify the extra cost?Exact search is also useful as a debugging tool. If an ANN system suddenly behaves differently after a tuning change, exact neighbours give you a stable reference point for comparing what changed.
We will use eight reference rows and three queries. The example is small so the exact neighbours and the approximate candidate sets remain easy to inspect.
reference_points <- data.frame(
id = paste0("r", 1:8),
x1 = c(0.0, 0.2, 1.0, 1.2, 3.0, 3.2, 4.0, 4.2),
x2 = c(0.0, 0.1, 0.9, 1.0, 3.0, 3.1, 4.0, 4.1),
x3 = c(0.5, 0.4, 1.2, 1.1, 2.8, 2.9, 3.8, 3.9)
)
query_points <- data.frame(
id = paste0("q", 1:3),
x1 = c(0.1, 1.1, 3.1),
x2 = c(0.0, 1.0, 3.0),
x3 = c(0.45, 1.15, 2.85)
)
reference <- as.big.matrix(as.matrix(reference_points[c("x1", "x2", "x3")]))
query_matrix <- as.matrix(query_points[c("x1", "x2", "x3")])
reference_points
#> id x1 x2 x3
#> 1 r1 0.0 0.0 0.5
#> 2 r2 0.2 0.1 0.4
#> 3 r3 1.0 0.9 1.2
#> 4 r4 1.2 1.0 1.1
#> 5 r5 3.0 3.0 2.8
#> 6 r6 3.2 3.1 2.9
#> 7 r7 4.0 4.0 3.8
#> 8 r8 4.2 4.1 3.9
query_points
#> id x1 x2 x3
#> 1 q1 0.1 0 0.45
#> 2 q2 1.1 1 1.15
#> 3 q3 3.1 3 2.85bigKNNThe exact reference result is just the exact k-NN output
for the same reference, query set, metric, and k that you
want to evaluate.
exact <- knn_bigmatrix(
reference,
query = query_matrix,
k = 3,
metric = "euclidean",
exclude_self = FALSE
)
exact
#> <bigknn_knn_result>
#> metric: euclidean
#> k: 3
#> queries: 3
#> references: 8
#> backend: bruteforce
neighbor_table(
exact$index,
query_ids = query_points$id,
ref_ids = reference_points$id,
distance = exact$distance
)
#> query rank neighbor distance
#> 1 q1 1 r1 0.1118
#> 2 q1 2 r2 0.1500
#> 3 q1 3 r3 1.4773
#> 4 q2 1 r4 0.1118
#> 5 q2 2 r3 0.1500
#> 6 q2 3 r2 1.4773
#> 7 q3 1 r5 0.1118
#> 8 q3 2 r6 0.1500
#> 9 q3 3 r7 1.6470This object is the ground truth for the rest of the vignette. Every later comparison asks how closely an approximate candidate set matches these exact top three neighbours.
recall_against_exact()Suppose another package returns the following approximate top-3 neighbour index matrix:
approx_top3 <- rbind(
c(8, 3, 1),
c(7, 4, 2),
c(6, 2, 5)
)
neighbor_table(
approx_top3,
query_ids = query_points$id,
ref_ids = reference_points$id
)
#> query rank neighbor
#> 1 q1 1 r8
#> 2 q1 2 r3
#> 3 q1 3 r1
#> 4 q2 1 r7
#> 5 q2 2 r4
#> 6 q2 3 r2
#> 7 q3 1 r6
#> 8 q3 2 r2
#> 9 q3 3 r5Each query still recovers some correct neighbours, but one true top-3
neighbour is missing for every row. We can quantify that with
recall_against_exact():
recall_before <- recall_against_exact(exact, approx_top3, k = 3)
recall_before
#> <bigknn_recall>
#> overall: 0.6667
#> k: 3
#> queries: 3
data.frame(
query = query_points$id,
recall = recall_before$per_query,
row.names = NULL
)
#> query recall
#> 1 q1 0.6666667
#> 2 q2 0.6666667
#> 3 q3 0.6666667Recall is set-based rather than rank-based:
k ids appear in the
approximate top-koverall is the mean of the per-query recallsThat makes recall a good first coverage metric, but not a full ranking metric. An ANN system can have excellent recall while still misordering the neighbours within the recovered set.
rerank_candidates_bigmatrix()Reranking is useful when an approximate system returns a wider
candidate pool than the final k you care about. The idea
is:
top_kHere is a 5-candidate pool that contains the true exact top-3 ids for every query, but not always in the first three slots:
candidate_pool <- rbind(
c(8, 3, 1, 2, 6),
c(7, 4, 2, 3, 1),
c(6, 2, 5, 7, 1)
)
neighbor_table(
candidate_pool,
query_ids = query_points$id,
ref_ids = reference_points$id
)
#> query rank neighbor
#> 1 q1 1 r8
#> 2 q1 2 r3
#> 3 q1 3 r1
#> 4 q1 4 r2
#> 5 q1 5 r6
#> 6 q2 1 r7
#> 7 q2 2 r4
#> 8 q2 3 r2
#> 9 q2 4 r3
#> 10 q2 5 r1
#> 11 q3 1 r6
#> 12 q3 2 r2
#> 13 q3 3 r5
#> 14 q3 4 r7
#> 15 q3 5 r1Now rerank those candidates exactly against the reference matrix:
reranked <- rerank_candidates_bigmatrix(
reference,
query = query_matrix,
candidate_index = candidate_pool,
metric = "euclidean",
top_k = 3
)
reranked
#> <bigknn_knn_result>
#> metric: euclidean
#> k: 3
#> queries: 3
#> references: 8
#> backend: candidate-rerank
neighbor_table(
reranked$index,
query_ids = query_points$id,
ref_ids = reference_points$id,
distance = reranked$distance
)
#> query rank neighbor distance
#> 1 q1 1 r1 0.1118
#> 2 q1 2 r2 0.1500
#> 3 q1 3 r3 1.4773
#> 4 q2 1 r4 0.1118
#> 5 q2 2 r3 0.1500
#> 6 q2 3 r2 1.4773
#> 7 q3 1 r5 0.1118
#> 8 q3 2 r6 0.1500
#> 9 q3 3 r7 1.6470The reranked result now matches the exact top-3 output, because the true neighbours were present somewhere in the candidate pool.
The usual evaluation pattern is:
recall_against_exact()recall_after <- recall_against_exact(exact, reranked, k = 3)
data.frame(
stage = c("Approximate top-3", "Reranked top-3 from 5 candidates"),
overall_recall = c(recall_before$overall, recall_after$overall),
row.names = NULL
)
#> stage overall_recall
#> 1 Approximate top-3 0.6666667
#> 2 Reranked top-3 from 5 candidates 1.0000000In this example, the approximate top-3 recall is 2/3,
while reranking the 5-wide candidate pool recovers perfect top-3
recall.
A practical package-independent workflow looks like this:
knn_bigmatrix() once to compute exact
top-k neighbours on that subset.big.matrix.recall_against_exact().k, rerank it exactly with
rerank_candidates_bigmatrix().This keeps responsibilities clean:
bigKNN supplies exact truth and exact candidate
rerankingThere are two distinct questions to keep apart.
First: does the approximate stage cover the correct neighbours at all?
kSecond: if the correct ids are present somewhere in the candidate
pool, are they ranked well enough for the final top-k?
k coverage when true
neighbours are present just outside the ANN top-kThe wider the candidate pool, the better reranking can work, but the more exact distance calculations it has to do.
Reranking improves ordering only within the candidates you provide. It cannot invent missing neighbours. If the candidate pool is already too narrow, recall may stay capped even after exact reranking:
reranked_limited <- rerank_candidates_bigmatrix(
reference,
query = query_matrix,
candidate_index = approx_top3,
metric = "euclidean",
top_k = 3
)
recall_after_limited <- recall_against_exact(exact, reranked_limited, k = 3)
data.frame(
stage = c("Approximate top-3", "Reranked top-3 from same 3 candidates"),
overall_recall = c(recall_before$overall, recall_after_limited$overall),
row.names = NULL
)
#> stage overall_recall
#> 1 Approximate top-3 0.6666667
#> 2 Reranked top-3 from same 3 candidates 0.6666667Other important caveats:
recall_against_exact() measures set overlap, not rank
qualitybigKNN, but
currently densified on the R side before exact computationUsed this way, bigKNN becomes a clean exact companion to
approximate search: it tells you what the right neighbours are, how
often the ANN system finds them, and whether a wider candidate pool can
be turned into a better final result by exact reranking.