String Distance Benchmarking

Introduction

This vignette presents a performance comparison between fozzie_string_join() and fuzzyjoin::stringdist_join() across several string distance algorithms. The goal is to demonstrate an easily digestible example of methods for the full suite of fozziejoin vs. fuzzyjoin benchmarks.

We benchmark the following methods:

Optimal String Alignment (osa)
Jaccard (jaccard)
Cosine (cosine)

To conclude, we’ll highlight general considerations when benchmarking fozziejoin against fuzzyjoin, including differences in implementation, return types, and performance characteristics.

For comprehensive benchmarking methods and results, please refer to:

📂 Benchmarking Scripts 📊 Benchmarking Outputs 🔁 Reproducible GitHub Workflow

Benchmark Setup

We use the qdapDictionaries::DICTIONARY as our reference word list and qdapDictionaries::misspellings as the source of noisy input. For demonstration purposes, we use a sample size of 100 misspellings.

library(microbenchmark)
library(fuzzyjoin)
library(fozziejoin)
library(qdapDictionaries)
library(tibble)
library(dplyr)

nsamp <- 100
seed <- 2016

params <- list(
  list(method = "osa", mode = "inner", max_dist = 1, q = 0),
  list(method = "jaccard", mode = "inner", max_dist = 0.5, q = 2),
  list(method = "lv", mode = "inner", max_dist = 1)
)

data(misspellings)
sub_misspellings <- misspellings[sample(seq_len(nrow(misspellings)), nsamp), ]

words <- as.data.frame(DICTIONARY)
results <- data.frame()

for (p in params) {
  set.seed(seed)

  bench <- microbenchmark(
    fuzzy = fuzzy <- stringdist_join(
      sub_misspellings, words,
      by = c(misspelling = "word"),
      method = p$method, mode = p$mode,
      max_dist = p$max_dist, q = p$q
    ),
    fozzie = fozzie <- fozzie_string_join(
      sub_misspellings, words,
      by = c(misspelling = "word"),
      method = p$method, how = p$mode,
      max_distance = p$max_dist, q = p$q
    ),
    times = 10
  )

  # Check for equal output
  if (!isTRUE(all.equal(fuzzy, fozzie))) {
    message("Mismatch detected for method: ", p$method)
  }

  # Convert benchmark results to df, append to running list
  df <- as.data.frame(bench)
  df$method <- p$method
  df$n_comps <- nrow(sub_misspellings) * nrow(words)
  df$os <- Sys.info()["sysname"]
  results <- rbind(results, df)
}

Summary results

We compute the average runtime (in milliseconds) and compare performance across methods and sample sizes.

summary_stats <- aggregate(
  time ~ expr + method + n_comps,
  data = results,
  FUN = function(x) mean(x)
)

summary_df <- data.frame(
  expr = summary_stats$expr,
  method = summary_stats$method,
  n_comps = summary_stats$n_comps,
  mean_time = summary_stats$time / 1e6
)

wide_df <- reshape(
  summary_df,
  idvar = c("n_comps", "method"),
  timevar = "expr",
  direction = "wide"
)

wide_df$mean_ratio <- wide_df$mean_time.fuzzy / wide_df$mean_time.fozzie

clean_df <- tibble(wide_df[order(wide_df$method), c(
  "method", "n_comps", "mean_time.fuzzy", "mean_time.fozzie", "mean_ratio"
)])

print(clean_df)

## # A tibble: 3 × 5
##   method  n_comps mean_time.fuzzy mean_time.fozzie mean_ratio
##   <chr>     <int>           <dbl>            <dbl>      <dbl>
## 1 jaccard 2013700           1384.            19.2        72.0
## 2 lv      2013700           1187.             7.02      169. 
## 3 osa     2013700           1197.             7.12      168.

Considerations when Benchmarking

If you wish to perform your own benchmarks of fozziejoin and fuzzyjoin, please consider these factors that may influence your results.

Results may not be Identical

The ouput of fozziejoin is not guaranteed to be identical to fuzzyjoin, especially for normalized distances. For instance, these cosine distance joins produce different results:

fozzie_cos <- fozzie_string_join(
  sub_misspellings, words, 
  by = c(misspelling = 'word'), 
  method='cosine', q = 3, max_distance = 0.5,
  distance_col = 'dist'
)
fuzzy_cos <- stringdist_join(
  sub_misspellings, words,
  by = c(misspelling = 'word'),
  method = 'cosine', q = 3, max_dist = 0.5,
  distance_col = 'dist'
)

Upon further exploration, fozziejoin results contain additional rows that fuzzyjoin does not. Each row has a cosine distance of around 0.5, the threshold for inclusion. This implies the difference results from floating point rounding errors.

joinby <- c('misspelling', 'correct', 'word')
only_fozzie <- anti_join(fozzie_cos, fuzzy_cos, by = joinby)
print(table(only_fozzie$dist))

## 
## 0.5 
##  49

only_fuzzy <- anti_join(fuzzy_cos, fozzie_cos, by = joinby)
print(paste("Fozzie contains all fuzzy rows:", nrow(only_fuzzy) == 0))

## [1] "Fozzie contains all fuzzy rows: TRUE"

Multithreading

All fozziejoin implementations currently support multithreading. In contrast, multithreading in fuzzyjoin is limited to specific cases. For example, stringdist_join() benefits from parallelism via the stringdist package, while functions like difference_join() remain single-threaded.

As a result, fozziejoin’s relative performance advantage tends to increase with the number of available threads — especially on large datasets. Performance tends to be slightly worse on Windows than Unix systems due to the relatively efficiency of multithreading via rayon.

Smart Search strategies

fozziejoin uses algorithmic optimizations to reduce the number of comparisons required during fuzzy joins. In contrast, fuzzyjoin typically performs naive pairwise comparisons across all values. As a result, the relative performance of fozziejoin depends on the structure and distribution of the input data.

In some cases, these methods could increase runtime, such as for very small datasets. Such slowdowns are typically negligible. Meanwhile, the potential performance gains can be dramatic, such as datasets where string values are duplicated many times.

`fozzie_string_join`

This function deduplicates strings on the left and right sides before computing string distances. As a result, fozzie_string_join becomes increasingly efficient when either input contains duplicate values. In contrast, fuzzyjoin::stringdist_join() performs comparisons across all rows, regardless of duplication.

`fozzie_difference_join`

This method bins right-hand values using a width equal to max_distance. Each left-hand value is then compared to at most three bins. The effectiveness of this strategy depends on the distribution of values and the chosen max_distance. When values are well-separated, binning drastically reduces comparisons, leading to faster joins than fuzzyjoin.

`fozzie_interval_join`

This join uses an Adelson-Velsky and Landis (AVL) tree to index intervals. Intervals that are more than max_distance units apart are excluded from evaluation. This pruning strategy avoids unnecessary comparisons and scales well with large datasets.