Fuzzy Matching with fozziejoin

Jon Downs

2026-03-04

Introduction

Fuzzy matching is a powerful technique for joining datasets that contain noisy or inconsistent fields, such as names with typos, small differences in date or numeric value, or overlapping intervals. R users may be familiar with the fuzzyjoin package for such tasks. fozziejoin is a performance-minded alternative that supports fast, flexible fuzzy joins across multiple data types. While it is not a drop-in replacement for fuzzyjoin, we hope that the package feels familiar to longtime users.

In this vignette, we demonstrate fuzzy joins using the following fozziejoin utilities:

The goal is to broadly introduce the package to new users and showcase its capabilities.

String Distance joins

To demonstrate string matching, we use the babynames dataset. We randomly sample 10 names and introduce a single-character mutation to simulate noisy input.

library(babynames)
library(fozziejoin)
library(tibble)

# Seed for reproducibility
set.seed(1337)

# Restrict to names from years 2000 or later
babynames <- babynames[babynames$year >= 2000, ]

# Sample rows from babynames dataset
sample_df <- babynames[sample(nrow(babynames), 10), 'name']

# Mutate a single character in the 'name' field for sample
mutate_char <- function(x) {
  if (nchar(x) == 0) return(x)
  pos <- sample(1:nchar(x), 1)
  new_char <- sample(letters, 1)
  substr(x, pos, pos) <- new_char
  return(x)
}
sample_df$name <- sapply(sample_df$name, mutate_char)

Next, we attempt to join the modified names back to the original dataset using trigram-based Jaccard similarity. fozziejoin supports all the same string distance algorithms as fuzzyjoin, though some implementations (soundex and jw) differ slightly.

fozzie <- fozzie_string_join(
    babynames, sample_df, how='inner', method='jaccard', q=3,
    by = c('name')
)
print(head(fozzie))
## # A tibble: 6 × 6
##    year sex   name.x       n    prop name.y        
##   <dbl> <chr> <chr>    <int>   <dbl> <chr>         
## 1  2000 F     Emily    25953 0.0130  Olunatimilehin
## 2  2000 F     Brianna  12878 0.00646 Cairiana      
## 3  2000 F     Victoria 10923 0.00548 Cairiana      
## 4  2000 F     Haley     9070 0.00455 Taleka        
## 5  2000 F     Hailey    7831 0.00393 Olunatimilehin
## 6  2000 F     Maria     6852 0.00343 Cairiana
print(nrow(fozzie))
## [1] 79089

Note the return type of the object is tibble in this case. If one or both input datasets are a tibble, a tibble is returned. Otherwise, a base data.frame is returned.

# If both neither input is `tibble`, `data.frame` is returned.
fozzie_df <- fozzie_string_join(
    as.data.frame(babynames),
    as.data.frame(sample_df),
    how='inner',
    method='jaccard',
    q=3,
    by = c('name')
)
head(fozzie_df)
##   year sex   name.x     n       prop         name.y
## 1 2000   F    Emily 25953 0.01300980 Olunatimilehin
## 2 2000   F  Brianna 12878 0.00645552       Cairiana
## 3 2000   F Victoria 10923 0.00547551       Cairiana
## 4 2000   F    Haley  9070 0.00454664         Taleka
## 5 2000   F   Hailey  7831 0.00392555 Olunatimilehin
## 6 2000   F    Maria  6852 0.00343479       Cairiana

Difference and Distance joins

fozziejoin also supports numeric-based fuzzy joins. These are useful for measurements, non-geometric coordinates, or scores.

Most join functions allow you to return the fuzzy matching metric via the distance_col argument. When multiple distances are returned, the distance output columns will be named using the pattern {distance_col}_{left column name}_{right column name}.

# Simulate data
size <- 1000
df1 <- tibble(
  x = round(runif(size, min = 0, max = 100), 2),
  y = round(runif(size, min = 0, max = 100), 2)
)
df2 <- tibble(
  x = round(runif(size, min = 0, max = 100), 2),
  y = round(runif(size, min = 0, max = 100), 2)
)
# Absolute difference join (per column)
diff_join <- fozzie_difference_join(
  df1, df2, max_distance=1, distance_col = 'diff'
)
## Joining by: c("x", "y")
print(head(diff_join))
## # A tibble: 6 × 6
##     x.x   y.x   x.y   y.y diff_x_x diff_y_y
##   <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 25.0   17.6 25.0   17.7  0          0.150
## 2  3.51  79.3  3.51  78.6  0          0.680
## 3 75.7   50   75.7   50.5  0          0.460
## 4 94.4   32.2 94.4   32.7  0.01000    0.440
## 5 22.6   25.8 22.6   26.2  0.01000    0.41 
## 6 60.8   28.0 60.8   28.3  0.01000    0.350
# Manhattan distance join (across all columns)
dist_join <- fozzie_distance_join(
  df1, df2, method='manhattan', max_distance=1, distance_col='dist'
)
## Joining by: c("x", "y")
print(head(dist_join))
## # A tibble: 6 × 5
##     x.x   y.x   x.y   y.y  dist
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  13.2  62.2  13.4  61.8 0.68 
## 2  55.1  89.7  54.6  89.2 0.980
## 3  29.8  20.5  30.2  20.9 0.780
## 4  70.0  65.1  70.7  65.2 0.810
## 5  98.6  72.8  99.2  73.0 0.830
## 6   3.8  87.6   3.2  87.8 0.74

Interval Joins

Interval joins allow you to match records based on overlapping ranges — useful for genomic intervals, time windows, or numeric spans. fozziejoin supports flexible interval matching with control over overlap behavior and precision.

In this example, we simulate two datasets with randomly generated intervals and use fozzie_interval_join() to find overlapping pairs.

size <- 1000

# Simulate left data
starts1 <- runif(size, min = 0, max = 500)
ends1 <- starts1 + runif(size, min = 0, max = 10)
df1 <- tibble(start = starts1, end = ends1)

# Simulate right data
starts2 <- runif(size, min = 0, max = 500)
ends2 <- starts2 + runif(size, min = 0, max = 10)
df2 <- tibble(start = starts2, end = ends2)

# Perform interval join using real-valued ranges
real_olaps <- fozzie_interval_join(
  df1, df2,
  by = c(start = "start", end = "end"),
  how = "inner",
  overlap_type = "any",
  maxgap = 0,
  minoverlap = 0,
  interval_mode = "real"
)

Interval Modes Explained

If the user does not specify interval_mode, a mode is chosen automatically based on the input data. If all values are integer, then integer mode is used. Otherwise, real mode is used.

Temporal joins

Temporal joins are also available via fozzie_temporal_join and fozzie_temporal_interval_join. Under the hood, they are an extension of the difference and interval joins. While these functions are designed to work with both POSIX timestamps and Date types, users may not mix and match. All join columns must be of the same type.

For POSIX timestamps, users may specify distance by days, hours, minutes, seconds, ms, ns, or us).

df1 <- data.frame(time = as.POSIXct(c(
  "2023-01-01 12:00:00", "2023-01-01 13:00:00"
)))
df2 <- data.frame(time = as.POSIXct(c(
  "2023-01-01 12:00:05", "2023-01-01 14:00:00"
)))

result <- fozzie_temporal_inner_join(
  df1, df2, by = c("time"), max_distance = 10, unit = "seconds"
)
print(head(result))
##                time.x              time.y
## 1 2023-01-01 12:00:00 2023-01-01 12:00:05

For Date class objects, the unit must be days. This is the default option.

# An error results if matching on `Date` with unit other than `days`
df1$date <- as.Date(df1$time)
df2$date <- as.Date(df2$time)
result <- fozzie_temporal_inner_join(
  df1, df2, by = c("date"), max_distance = 10, unit = "seconds"
)
## Error in `fozzie_temporal_join()`:
## ! When joining on Date columns, unit must be 'days'.
# Succeeds
result <- fozzie_temporal_inner_join(
  df1, df2, by = c("date"), max_distance = 10
)

fozzie_temporal_interval_join uses interval_mode='real' in all cases.

df1 <- data.frame(
  start = as.Date(c("2023-01-01", "2023-01-05")),
  end = as.Date(c("2023-01-03", "2023-01-07"))
)
df2 <- data.frame(
  start = as.Date(c("2023-01-02", "2023-01-06")),
  end = as.Date(c("2023-01-04", "2023-01-08"))
)

result <- fozzie_temporal_interval_inner_join(
  df1, df2,
  by = c(start = "start", end = "end"),
  overlap_type = "any",
  maxgap = 0,
  minoverlap = 0,
  unit = "days"
)

head(result)
##      start.x      end.x    start.y      end.y
## 1 2023-01-01 2023-01-03 2023-01-02 2023-01-04
## 2 2023-01-05 2023-01-07 2023-01-06 2023-01-08

Summary of fozziejoin Join Types

Join Type Input Type(s) Matching Logic Key Options / Notes
fozzie_string_join Character String similarity (e.g. cosine, hamming) method, q, distance_col
fozzie_difference_join Numeric Absolute difference per column max_distance, distance_col
fozzie_distance_join Numeric Euclidean or Manhattan distance method, max_distance, distance_col
fozzie_interval_join Ranges (start/end) Overlapping intervals overlap_type, maxgap, minoverlap, interval_mode
fozzie_temporal_join POSIXct / Date Time difference within unit unit, max_distance
fozzie_temporal_interval_join POSIXct / Date ranges Overlapping time intervals unit, overlap_type, maxgap, minoverlap