---
title: "String Distance Benchmarking"
author: "Jon Downs"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{String Distance Benchmarking}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

### Introduction

This vignette presents a performance comparison between `fozzie_string_join()`
and `fuzzyjoin::stringdist_join()` across several string distance algorithms.
The goal is to demonstrate an easily digestible example of methods for the full
suite of `fozziejoin` vs. `fuzzyjoin` benchmarks. 

We benchmark the following methods:

- Optimal String Alignment (`osa`)
- Jaccard (`jaccard`)
- Cosine (`cosine`)

To conclude, we’ll highlight general considerations when benchmarking
`fozziejoin` against `fuzzyjoin`, including differences in implementation,
return types, and performance characteristics.

For comprehensive benchmarking methods and results, please refer to:

📂 [Benchmarking Scripts](https://github.com/fozzieverse/fozziejoin/tree/main/benchmarks/r)
📊 [Benchmarking Outputs](https://github.com/fozzieverse/fozziejoin/tree/main/benchmarks/results)
🔁 [Reproducible GitHub Workflow](https://github.com/fozzieverse/fozziejoin/blob/main/.github/workflows/run_rbase_benches.yml)

---

### Benchmark Setup

We use the `qdapDictionaries::DICTIONARY` as our reference word list and
`qdapDictionaries::misspellings` as the source of noisy input. For
demonstration purposes, we use a sample size of 100 misspellings.

```{r, message = FALSE}
library(microbenchmark)
library(fuzzyjoin)
library(fozziejoin)
library(qdapDictionaries)
library(tibble)
library(dplyr)

nsamp <- 100
seed <- 2016

params <- list(
  list(method = "osa", mode = "inner", max_dist = 1, q = 0),
  list(method = "jaccard", mode = "inner", max_dist = 0.5, q = 2),
  list(method = "lv", mode = "inner", max_dist = 1)
)

data(misspellings)
sub_misspellings <- misspellings[sample(seq_len(nrow(misspellings)), nsamp), ]

words <- as.data.frame(DICTIONARY)
results <- data.frame()
```

```{r}
for (p in params) {
  set.seed(seed)

  bench <- microbenchmark(
    fuzzy = fuzzy <- stringdist_join(
      sub_misspellings, words,
      by = c(misspelling = "word"),
      method = p$method, mode = p$mode,
      max_dist = p$max_dist, q = p$q
    ),
    fozzie = fozzie <- fozzie_string_join(
      sub_misspellings, words,
      by = c(misspelling = "word"),
      method = p$method, how = p$mode,
      max_distance = p$max_dist, q = p$q
    ),
    times = 10
  )

  # Check for equal output
  if (!isTRUE(all.equal(fuzzy, fozzie))) {
    message("Mismatch detected for method: ", p$method)
  }

  # Convert benchmark results to df, append to running list
  df <- as.data.frame(bench)
  df$method <- p$method
  df$n_comps <- nrow(sub_misspellings) * nrow(words)
  df$os <- Sys.info()["sysname"]
  results <- rbind(results, df)
}
```

### Summary results

We compute the average runtime (in milliseconds) and compare performance across
methods and sample sizes.

```{r}
summary_stats <- aggregate(
  time ~ expr + method + n_comps,
  data = results,
  FUN = function(x) mean(x)
)

summary_df <- data.frame(
  expr = summary_stats$expr,
  method = summary_stats$method,
  n_comps = summary_stats$n_comps,
  mean_time = summary_stats$time / 1e6
)

wide_df <- reshape(
  summary_df,
  idvar = c("n_comps", "method"),
  timevar = "expr",
  direction = "wide"
)

wide_df$mean_ratio <- wide_df$mean_time.fuzzy / wide_df$mean_time.fozzie

clean_df <- tibble(wide_df[order(wide_df$method), c(
  "method", "n_comps", "mean_time.fuzzy", "mean_time.fozzie", "mean_ratio"
)])

print(clean_df)
```

### Considerations when Benchmarking

If you wish to perform your own benchmarks of `fozziejoin` and `fuzzyjoin`,
please consider these factors that may influence your results.


#### Results may not be Identical

The ouput of `fozziejoin` is not guaranteed to be identical to `fuzzyjoin`,
especially for normalized distances. For instance, these cosine distance joins
produce different results:

```{r}
fozzie_cos <- fozzie_string_join(
  sub_misspellings, words, 
  by = c(misspelling = 'word'), 
  method='cosine', q = 3, max_distance = 0.5,
  distance_col = 'dist'
)
fuzzy_cos <- stringdist_join(
  sub_misspellings, words,
  by = c(misspelling = 'word'),
  method = 'cosine', q = 3, max_dist = 0.5,
  distance_col = 'dist'
)
```

Upon further exploration, `fozziejoin` results contain additional rows that
`fuzzyjoin` does not. Each row has a cosine distance of around 0.5, the
threshold for inclusion. This implies the difference results from floating
point rounding errors.

```{r}
joinby <- c('misspelling', 'correct', 'word')
only_fozzie <- anti_join(fozzie_cos, fuzzy_cos, by = joinby)
print(table(only_fozzie$dist))

only_fuzzy <- anti_join(fuzzy_cos, fozzie_cos, by = joinby)
print(paste("Fozzie contains all fuzzy rows:", nrow(only_fuzzy) == 0))
```

#### Multithreading

All `fozziejoin` implementations currently support multithreading. In contrast,
multithreading in `fuzzyjoin` is limited to specific cases. For example,
`stringdist_join()` benefits from parallelism via the `stringdist` package,
while functions like `difference_join()` remain single-threaded.

As a result, `fozziejoin`'s relative performance advantage tends to increase
with the number of available threads — especially on large datasets. Performance
tends to be slightly worse on Windows than Unix systems due to the relatively
efficiency of multithreading via `rayon`.

#### Smart Search strategies

`fozziejoin` uses algorithmic optimizations to reduce the number of comparisons
required during fuzzy joins. In contrast, `fuzzyjoin` typically performs naive
pairwise comparisons across all values. As a result, the relative performance
of `fozziejoin` depends on the structure and distribution of the input data.

In some cases, these methods could increase runtime, such as for very small
datasets. Such slowdowns are typically negligible. Meanwhile, the potential
performance gains can be dramatic, such as datasets where string values are
duplicated many times.

#### `fozzie_string_join`

This function deduplicates strings on the left and right sides before computing
string distances. As a result, `fozzie_string_join` becomes increasingly
efficient when either input contains duplicate values. In contrast,
`fuzzyjoin::stringdist_join()` performs comparisons across all rows, regardless
of duplication.

#### `fozzie_difference_join`

This method bins right-hand values using a width equal to `max_distance`. Each
left-hand value is then compared to at most three bins. The effectiveness of
this strategy depends on the distribution of values and the chosen
`max_distance`. When values are well-separated, binning drastically reduces
comparisons, leading to faster joins than `fuzzyjoin`.

#### `fozzie_interval_join`

This join uses an Adelson-Velsky and Landis (AVL) tree to index intervals.
Intervals that are more than `max_distance` units apart are excluded from
evaluation. This pruning strategy avoids unnecessary comparisons and scales
well with large datasets.