---
title: "Metrics and Tuning"
output:
  litedown::html_format:
    meta:
      css: ["@default"]
---

<!--
%\VignetteEngine{litedown::vignette}
%\VignetteIndexEntry{Metrics and Tuning}
%\VignetteEncoding{UTF-8}
-->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

options(bigANNOY.progress = FALSE)
set.seed(20260326)
```

`bigANNOY` exposes two kinds of choices that matter in practice:

- the **metric**, which defines what "near" means
- the **tuning controls**, which trade build cost, search cost, and search
  quality against one another

This vignette walks through both with small concrete examples and then ends
with a lightweight tuning workflow you can reuse on your own data.

## Load the Packages

```{r}
library(bigANNOY)
library(bigmemory)
```

## A Small Dataset for Metric Comparisons

To make metric behavior easier to see, we will use a tiny reference set with a
few deliberately different vector directions and magnitudes.

```{r}
tune_dir <- tempfile("bigannoy-tuning-")
dir.create(tune_dir, recursive = TRUE, showWarnings = FALSE)

ref_labels <- c(
  "unit_x",
  "double_x",
  "unit_y",
  "tilted_x",
  "unit_z",
  "diag_xy"
)

ref_dense <- matrix(
  c(
    1.0, 0.0, 0.0,
    2.0, 0.0, 0.0,
    0.0, 1.0, 0.0,
    0.8, 0.2, 0.0,
    0.0, 0.0, 1.0,
    1.0, 1.0, 0.0
  ),
  ncol = 3,
  byrow = TRUE
)

query_dense <- matrix(
  c(
    1.0, 0.0, 0.0,
    0.9, 0.1, 0.0
  ),
  ncol = 3,
  byrow = TRUE
)

ref_big <- as.big.matrix(ref_dense)

data.frame(
  index = seq_along(ref_labels),
  label = ref_labels,
  ref_dense,
  row.names = NULL
)
```

## Supported Metrics

`bigANNOY` currently supports:

- `"euclidean"`
- `"angular"`
- `"manhattan"`
- `"dot"`

The most important rule of thumb is that **distances are only directly
comparable within the same metric**. A Euclidean distance and an angular
distance are not on the same scale and should not be interpreted as if they
meant the same thing.

## Compare Metrics on the Same Queries

Here is the same search performed under all four metrics.

```{r}
metric_table <- do.call(
  rbind,
  lapply(c("euclidean", "angular", "manhattan", "dot"), function(metric) {
    index_path <- file.path(tune_dir, sprintf("%s.ann", metric))

    idx <- annoy_build_bigmatrix(
      ref_big,
      path = index_path,
      metric = metric,
      n_trees = 20L,
      seed = 123L,
      load_mode = "eager"
    )

    res <- annoy_search_bigmatrix(
      idx,
      query = query_dense,
      k = 2L,
      search_k = 100L
    )

    data.frame(
      metric = metric,
      q1_top1 = ref_labels[res$index[1, 1]],
      q1_distance = round(res$distance[1, 1], 3),
      q2_top1 = ref_labels[res$index[2, 1]],
      q2_distance = round(res$distance[2, 1], 3),
      stringsAsFactors = FALSE
    )
  })
)

metric_table
```

Even on this toy example, the metric choice changes how rows are ranked.

The practical interpretation is:

- use `"euclidean"` when straight-line distance in the original space is what
  you care about, and especially when you want the most direct comparison with
  `bigKNN`
- use `"angular"` when vector direction matters more than magnitude
- use `"manhattan"` when coordinatewise absolute deviations are a more natural
  notion of difference than Euclidean distance
- use `"dot"` when inner-product style ranking is closer to the scoring rule
  you want

For non-Euclidean metrics, treat the returned `distance` matrix as the
Annoy-backend distance for that metric rather than as something you can compare
directly to Euclidean values.

## Build-Time Controls

The most important build-time controls are:

- `n_trees`
- `seed`
- `build_threads`
- `block_size`
- `load_mode`

### n_trees

`n_trees` is the main quality-versus-build-cost knob at index build time.

- more trees usually improve search quality
- more trees usually increase build time and index size
- very small trees are useful for quick experiments but often not for final
  production settings

### seed

`seed` makes index construction reproducible. This is especially useful when
you are benchmarking different settings and want to reduce one source of
variation between runs.

### build_threads

`build_threads` is passed to the native C++ backend.

- `-1L` means "use Annoy's default"
- positive integers request an explicit build-thread count
- the debug-only R backend ignores this control

### block_size

`block_size` controls how many rows are processed per streamed block while
building and searching. This is mostly an execution-behavior knob, not a
quality knob.

- smaller blocks can reduce transient memory pressure
- larger blocks can reduce overhead in some workloads

### load_mode

`load_mode` controls session behavior, not search quality:

- `"lazy"` delays opening the native handle until first search
- `"eager"` opens the handle immediately

Here is a simple side-by-side example.

```{r}
lazy_index <- annoy_build_bigmatrix(
  ref_big,
  path = file.path(tune_dir, "lazy.ann"),
  metric = "euclidean",
  n_trees = 8L,
  seed = 123L,
  load_mode = "lazy"
)

eager_index <- annoy_build_bigmatrix(
  ref_big,
  path = file.path(tune_dir, "eager.ann"),
  metric = "euclidean",
  n_trees = 25L,
  seed = 123L,
  load_mode = "eager"
)

c(
  lazy_loaded = annoy_is_loaded(lazy_index),
  eager_loaded = annoy_is_loaded(eager_index)
)
```

## Query-Time Controls

The most important search-time controls are:

- `k`
- `search_k`
- `block_size`
- `prefault`

### k

`k` is simply the number of neighbours you want returned. It changes the shape
of the result and the amount of work the search must do.

### search_k

`search_k` is the main quality-versus-search-cost knob at query time.

- larger values usually improve search quality
- larger values usually increase search time
- `-1L` lets Annoy use its default search budget

When you start tuning, this is usually the first knob to increase.

### block_size

At search time, `block_size` controls how many query rows are processed per
block. As with build-time blocking, this affects execution behavior more than
quality.

### prefault

`prefault` controls how the persisted Annoy index is loaded by the native
backend. It can be useful for repeated search workloads on some platforms, but
it is not guaranteed to have the same effect everywhere.

```{r, eval = FALSE}
reopened <- annoy_open_index(
  eager_index$path,
  prefault = TRUE,
  load_mode = "eager"
)

result <- annoy_search_bigmatrix(
  reopened,
  query = query_dense,
  k = 2L,
  search_k = 100L,
  prefault = TRUE
)
```

Because `prefault` depends on platform and OS support, it is best treated as a
workload-specific optimization rather than as a universal default.

## Use the Benchmark Helpers to Tune n_trees and search_k

Once you know which metric is appropriate, the next question is usually how far
to push `n_trees` and `search_k`.

The benchmark helpers are the easiest way to study that trade-off.

```{r}
if (length(find.package("bigKNN", quiet = TRUE)) > 0L) {
  tuning_suite <- benchmark_annoy_recall_suite(
    n_ref = 200L,
    n_query = 20L,
    n_dim = 6L,
    k = 3L,
    n_trees = c(5L, 20L),
    search_k = c(-1L, 50L, 200L),
    metric = "euclidean",
    exact = TRUE,
    path_dir = tune_dir
  )

  tuning_suite$summary[, c(
    "n_trees",
    "search_k",
    "build_elapsed",
    "search_elapsed",
    "recall_at_k"
  )]
} else {
  tuning_suite <- benchmark_annoy_recall_suite(
    n_ref = 200L,
    n_query = 20L,
    n_dim = 6L,
    k = 3L,
    n_trees = c(5L, 20L),
    search_k = c(-1L, 50L, 200L),
    metric = "euclidean",
    exact = FALSE,
    path_dir = tune_dir
  )

  tuning_suite$summary[, c(
    "n_trees",
    "search_k",
    "build_elapsed",
    "search_elapsed"
  )]
}
```

That table is the practical center of most tuning work:

- if recall is available, compare it against search time
- if recall is not available yet, compare build and search timing first
- only benchmark metrics against each other when those metrics make sense for
  the same modelling problem

## Package-Level Defaults

`bigANNOY` also exposes a few package options that are useful in repeated
tuning sessions.

```{r}
list(
  block_size_default = getOption("bigANNOY.block_size", 1024L),
  progress_default = getOption("bigANNOY.progress", FALSE),
  backend_default = getOption("bigANNOY.backend", "cpp")
)
```

In practice:

- set `options(bigANNOY.block_size = ...)` when you want a session-wide block
  size default
- set `options(bigANNOY.progress = TRUE)` when you want progress messages
  during long runs
- keep the native C++ backend as the default for real performance work

## A Practical Tuning Pattern

A useful workflow is:

1. choose the metric that best matches the meaning of similarity in your data
2. start with moderate `n_trees` and a modest `search_k`
3. benchmark a small grid of `n_trees` by `search_k`
4. increase `search_k` first if quality is too low
5. rebuild with more trees when higher search budgets alone are not enough
6. revisit `block_size`, `load_mode`, and `prefault` only after the main
   quality-versus-latency trade-off is understood

## Recap

The most important ideas in `bigANNOY` tuning are:

- metric choice comes first
- `n_trees` mostly controls build-time quality investment
- `search_k` mostly controls query-time quality investment
- `block_size`, `load_mode`, and `prefault` mostly affect execution behavior
  rather than neighbour semantics
- Euclidean tuning is the easiest place to start when you want an exact
  baseline with `bigKNN`

The next vignette after this one is usually *Validation and Sharing Indexes*,
which focuses on sidecar metadata, persisted files, and safe reuse across
sessions.