This vignette presents a performance comparison between
fozzie_string_join() and
fuzzyjoin::stringdist_join() across several string distance
algorithms. The goal is to demonstrate an easily digestible example of
methods for the full suite of fozziejoin
vs. fuzzyjoin benchmarks.
We benchmark the following methods:
osa)jaccard)cosine)To conclude, we’ll highlight general considerations when benchmarking
fozziejoin against fuzzyjoin, including
differences in implementation, return types, and performance
characteristics.
For comprehensive benchmarking methods and results, please refer to:
📂 Benchmarking Scripts 📊 Benchmarking Outputs 🔁 Reproducible GitHub Workflow
We use the qdapDictionaries::DICTIONARY as our reference
word list and qdapDictionaries::misspellings as the source
of noisy input. For demonstration purposes, we use a sample size of 100
misspellings.
library(microbenchmark)
library(fuzzyjoin)
library(fozziejoin)
library(qdapDictionaries)
library(tibble)
library(dplyr)
nsamp <- 100
seed <- 2016
params <- list(
list(method = "osa", mode = "inner", max_dist = 1, q = 0),
list(method = "jaccard", mode = "inner", max_dist = 0.5, q = 2),
list(method = "lv", mode = "inner", max_dist = 1)
)
data(misspellings)
sub_misspellings <- misspellings[sample(seq_len(nrow(misspellings)), nsamp), ]
words <- as.data.frame(DICTIONARY)
results <- data.frame()for (p in params) {
set.seed(seed)
bench <- microbenchmark(
fuzzy = fuzzy <- stringdist_join(
sub_misspellings, words,
by = c(misspelling = "word"),
method = p$method, mode = p$mode,
max_dist = p$max_dist, q = p$q
),
fozzie = fozzie <- fozzie_string_join(
sub_misspellings, words,
by = c(misspelling = "word"),
method = p$method, how = p$mode,
max_distance = p$max_dist, q = p$q
),
times = 10
)
# Check for equal output
if (!isTRUE(all.equal(fuzzy, fozzie))) {
message("Mismatch detected for method: ", p$method)
}
# Convert benchmark results to df, append to running list
df <- as.data.frame(bench)
df$method <- p$method
df$n_comps <- nrow(sub_misspellings) * nrow(words)
df$os <- Sys.info()["sysname"]
results <- rbind(results, df)
}We compute the average runtime (in milliseconds) and compare performance across methods and sample sizes.
summary_stats <- aggregate(
time ~ expr + method + n_comps,
data = results,
FUN = function(x) mean(x)
)
summary_df <- data.frame(
expr = summary_stats$expr,
method = summary_stats$method,
n_comps = summary_stats$n_comps,
mean_time = summary_stats$time / 1e6
)
wide_df <- reshape(
summary_df,
idvar = c("n_comps", "method"),
timevar = "expr",
direction = "wide"
)
wide_df$mean_ratio <- wide_df$mean_time.fuzzy / wide_df$mean_time.fozzie
clean_df <- tibble(wide_df[order(wide_df$method), c(
"method", "n_comps", "mean_time.fuzzy", "mean_time.fozzie", "mean_ratio"
)])
print(clean_df)## # A tibble: 3 × 5
## method n_comps mean_time.fuzzy mean_time.fozzie mean_ratio
## <chr> <int> <dbl> <dbl> <dbl>
## 1 jaccard 2013700 1384. 19.2 72.0
## 2 lv 2013700 1187. 7.02 169.
## 3 osa 2013700 1197. 7.12 168.
If you wish to perform your own benchmarks of fozziejoin
and fuzzyjoin, please consider these factors that may
influence your results.
The ouput of fozziejoin is not guaranteed to be
identical to fuzzyjoin, especially for normalized
distances. For instance, these cosine distance joins produce different
results:
fozzie_cos <- fozzie_string_join(
sub_misspellings, words,
by = c(misspelling = 'word'),
method='cosine', q = 3, max_distance = 0.5,
distance_col = 'dist'
)
fuzzy_cos <- stringdist_join(
sub_misspellings, words,
by = c(misspelling = 'word'),
method = 'cosine', q = 3, max_dist = 0.5,
distance_col = 'dist'
)Upon further exploration, fozziejoin results contain
additional rows that fuzzyjoin does not. Each row has a
cosine distance of around 0.5, the threshold for inclusion. This implies
the difference results from floating point rounding errors.
joinby <- c('misspelling', 'correct', 'word')
only_fozzie <- anti_join(fozzie_cos, fuzzy_cos, by = joinby)
print(table(only_fozzie$dist))##
## 0.5
## 49
only_fuzzy <- anti_join(fuzzy_cos, fozzie_cos, by = joinby)
print(paste("Fozzie contains all fuzzy rows:", nrow(only_fuzzy) == 0))## [1] "Fozzie contains all fuzzy rows: TRUE"
All fozziejoin implementations currently support
multithreading. In contrast, multithreading in fuzzyjoin is
limited to specific cases. For example, stringdist_join()
benefits from parallelism via the stringdist package, while
functions like difference_join() remain
single-threaded.
As a result, fozziejoin’s relative performance advantage
tends to increase with the number of available threads — especially on
large datasets. Performance tends to be slightly worse on Windows than
Unix systems due to the relatively efficiency of multithreading via
rayon.
fozziejoin uses algorithmic optimizations to reduce the
number of comparisons required during fuzzy joins. In contrast,
fuzzyjoin typically performs naive pairwise comparisons
across all values. As a result, the relative performance of
fozziejoin depends on the structure and distribution of the
input data.
In some cases, these methods could increase runtime, such as for very small datasets. Such slowdowns are typically negligible. Meanwhile, the potential performance gains can be dramatic, such as datasets where string values are duplicated many times.
fozzie_string_joinThis function deduplicates strings on the left and right sides before
computing string distances. As a result, fozzie_string_join
becomes increasingly efficient when either input contains duplicate
values. In contrast, fuzzyjoin::stringdist_join() performs
comparisons across all rows, regardless of duplication.
fozzie_difference_joinThis method bins right-hand values using a width equal to
max_distance. Each left-hand value is then compared to at
most three bins. The effectiveness of this strategy depends on the
distribution of values and the chosen max_distance. When
values are well-separated, binning drastically reduces comparisons,
leading to faster joins than fuzzyjoin.
fozzie_interval_joinThis join uses an Adelson-Velsky and Landis (AVL) tree to index
intervals. Intervals that are more than max_distance units
apart are excluded from evaluation. This pruning strategy avoids
unnecessary comparisons and scales well with large datasets.