---
title: "File-Backed bigmemory Workflows"
output:
  litedown::html_format:
    meta:
      css: ["@default"]
---

<!--
%\VignetteEngine{litedown::vignette}
%\VignetteIndexEntry{File-Backed bigmemory Workflows}
%\VignetteEncoding{UTF-8}
-->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

options(bigANNOY.progress = FALSE)
set.seed(20260326)
```

One of the main goals of `bigANNOY` is to work comfortably with `bigmemory`
data that already lives on disk. Instead of forcing a large reference matrix
through dense in-memory copies, the package can build and query Annoy indexes
directly from file-backed `big.matrix` objects and their descriptors.

This vignette focuses on the most common disk-oriented workflows:

- building from a file-backed reference matrix
- querying with descriptor objects and descriptor file paths
- streaming neighbour results into file-backed destination matrices
- working with separated-column `big.matrix` query layouts

## Load the Packages

```{r}
library(bigANNOY)
library(bigmemory)
```

## Create a Small File-Backed Workspace

For reproducibility, we will create all backing files inside a temporary
directory. In real work this would usually be a project directory or a shared
data location.

```{r}
workspace_dir <- tempfile("bigannoy-filebacked-")
dir.create(workspace_dir, recursive = TRUE, showWarnings = FALSE)

make_filebacked_matrix <- function(values, type, backingpath, name) {
  bm <- filebacked.big.matrix(
    nrow = nrow(values),
    ncol = ncol(values),
    type = type,
    backingfile = sprintf("%s.bin", name),
    descriptorfile = sprintf("%s.desc", name),
    backingpath = backingpath
  )
  bm[,] <- values
  bm
}
```

## Build a File-Backed Reference Matrix

We will create a reference dataset and store it in a file-backed
`big.matrix`. The corresponding descriptor file is what lets later R sessions
reattach to the same on-disk data.

```{r}
ref_dense <- matrix(
  c(
    0.0, 0.0,
    5.0, 0.0,
    0.0, 5.0,
    5.0, 5.0,
    9.0, 9.0
  ),
  ncol = 2,
  byrow = TRUE
)

ref_fb <- make_filebacked_matrix(
  values = ref_dense,
  type = "double",
  backingpath = workspace_dir,
  name = "ref"
)

ref_desc <- describe(ref_fb)
ref_desc_path <- file.path(workspace_dir, "ref.desc")

file.exists(ref_desc_path)
dim(ref_fb)
```

At this point we have:

- a file-backed data file at `ref.bin`
- a descriptor file at `ref.desc`
- a `big.matrix` object currently attached in this R session

## Build an Annoy Index from a Descriptor Path

The simplest persisted workflow is to build directly from the descriptor file
path instead of from the live `big.matrix` object. That mirrors how later
sessions typically work.

```{r}
index_path <- file.path(workspace_dir, "ref.ann")

index <- annoy_build_bigmatrix(
  x = ref_desc_path,
  path = index_path,
  n_trees = 25L,
  metric = "euclidean",
  seed = 99L,
  load_mode = "lazy"
)

index
```

This pattern is useful because the build call no longer depends on a
particular in-memory object being alive. As long as the descriptor can be
reattached, the reference matrix can be used.

## Accepted File-Oriented Input Forms

For `x`, `query`, `xpIndex`, and `xpDistance`, `bigANNOY` accepts several
`bigmemory`-oriented forms:

- a live `big.matrix`
- an external pointer to a `big.matrix`
- a `big.matrix.descriptor` object
- a descriptor file path

For queries only, a dense numeric matrix is also accepted.

That flexibility matters most in persisted workflows where one part of the
pipeline writes descriptors and another part reattaches them later.

## Query with a File-Backed big.matrix

Now we will create a file-backed query matrix and search the persisted Annoy
index against it.

```{r}
query_dense <- matrix(
  c(
    0.2, 0.1,
    4.7, 5.1
  ),
  ncol = 2,
  byrow = TRUE
)

query_fb <- make_filebacked_matrix(
  values = query_dense,
  type = "double",
  backingpath = workspace_dir,
  name = "query"
)

query_result_big <- annoy_search_bigmatrix(
  index,
  query = query_fb,
  k = 2L,
  search_k = 100L
)

query_result_big$index
round(query_result_big$distance, 3)
```

The query matrix itself is file-backed, but the search call looks the same as
it would for an in-memory `big.matrix`.

## Query with a Descriptor Object and a Descriptor Path

The same persisted query data can be supplied through its descriptor object or
through the descriptor file path. This is often the most convenient way to
reattach query data across sessions.

```{r}
query_desc <- describe(query_fb)
query_desc_path <- file.path(workspace_dir, "query.desc")

query_result_desc <- annoy_search_bigmatrix(
  index,
  query = query_desc,
  k = 2L,
  search_k = 100L
)

query_result_path <- annoy_search_bigmatrix(
  index,
  query = query_desc_path,
  k = 2L,
  search_k = 100L
)

query_result_desc$index
query_result_path$index
```

These should match the result obtained from the live `big.matrix` query.

```{r}
identical(query_result_big$index, query_result_desc$index)
identical(query_result_big$index, query_result_path$index)
all.equal(query_result_big$distance, query_result_desc$distance)
```

## Stream Results into File-Backed Destination Matrices

Large search results can be expensive to keep in ordinary R memory. To avoid
that, `bigANNOY` can stream neighbour ids and distances directly into
destination `big.matrix` objects.

For file-backed workflows, this means you can keep both the inputs and the
outputs on disk.

```{r}
index_store <- filebacked.big.matrix(
  nrow = nrow(query_dense),
  ncol = 2L,
  type = "integer",
  backingfile = "nn_index.bin",
  descriptorfile = "nn_index.desc",
  backingpath = workspace_dir
)

distance_store <- filebacked.big.matrix(
  nrow = nrow(query_dense),
  ncol = 2L,
  type = "double",
  backingfile = "nn_distance.bin",
  descriptorfile = "nn_distance.desc",
  backingpath = workspace_dir
)

streamed_result <- annoy_search_bigmatrix(
  index,
  query = query_desc,
  k = 2L,
  xpIndex = describe(index_store),
  xpDistance = file.path(workspace_dir, "nn_distance.desc")
)

bigmemory::as.matrix(index_store)
round(bigmemory::as.matrix(distance_store), 3)
```

The important practical details are:

- `xpIndex` must be integer-compatible
- `xpDistance` must be double-compatible
- both destination matrices must have shape `n_query x k`
- `xpDistance` can only be supplied when `xpIndex` is also supplied

## Reattach the Output Files Later

Because the result matrices are file-backed, they can be reattached later in
the same way as any other `bigmemory` artifact.

```{r}
index_store_again <- attach.big.matrix(file.path(workspace_dir, "nn_index.desc"))
distance_store_again <- attach.big.matrix(file.path(workspace_dir, "nn_distance.desc"))

bigmemory::as.matrix(index_store_again)
round(bigmemory::as.matrix(distance_store_again), 3)
```

That is useful in longer pipelines where one step performs ANN search and a
later step consumes the neighbour graph or distance matrix.

## Separated-Column Query Matrices

`bigANNOY` also supports separated-column `big.matrix` layouts. These are not
necessarily file-backed, but they are common in `bigmemory` workflows and are
worth knowing about because they use a different memory layout from the usual
contiguous matrix case.

```{r}
query_sep <- big.matrix(
  nrow = nrow(query_dense),
  ncol = ncol(query_dense),
  type = "double",
  separated = TRUE
)
query_sep[,] <- query_dense

sep_result <- annoy_search_bigmatrix(
  index,
  query = describe(query_sep),
  k = 2L,
  search_k = 100L
)

sep_result$index
round(sep_result$distance, 3)
```

For the same query values, the separated-column result should match the
ordinary file-backed query result.

```{r}
identical(sep_result$index, query_result_big$index)
all.equal(sep_result$distance, query_result_big$distance)
```

## Persisted Reference, Persisted Index, Persisted Outputs

Taken together, the main file-backed pattern looks like this:

1. store the reference data in a file-backed `big.matrix`
2. keep the descriptor alongside the backing file
3. build the Annoy index from the descriptor path
4. query using either a live `big.matrix`, a descriptor object, or a
   descriptor path
5. write neighbour results into file-backed destination matrices when result
   size matters

This is often the most practical way to use `bigANNOY` in large-data settings,
because every major artifact in the workflow can be reopened later.

## Practical Tips

- Keep descriptor files with their corresponding backing files.
- Keep the `.ann` file with its `.meta` sidecar file.
- Use descriptor paths when you want to decouple one R session from another.
- Use streamed outputs when `n_query x k` is too large to hold comfortably in
  ordinary R matrices.
- Use the lifecycle helpers from the persistence vignette when you want to
  reopen and validate the Annoy index itself across sessions.

## Recap

This vignette covered the main `bigmemory` persistence features in `bigANNOY`:

- file-backed reference matrices
- descriptor-object and descriptor-path queries
- streamed file-backed outputs
- reattachment of persisted outputs
- separated-column query support

The natural next vignette after this one is *Benchmarking Recall and Latency*,
which shows how to evaluate these workflows against runtime and quality
targets.