Fuzzy Matching with fozziejoin

Introduction

Fuzzy matching is a powerful technique for joining datasets that contain noisy or inconsistent fields, such as names with typos, small differences in date or numeric value, or overlapping intervals. R users may be familiar with the fuzzyjoin package for such tasks. fozziejoin is a performance-minded alternative that supports fast, flexible fuzzy joins across multiple data types. While it is not a drop-in replacement for fuzzyjoin, we hope that the package feels familiar to longtime users.

In this vignette, we demonstrate fuzzy joins using the following fozziejoin utilities:

fozzie_string_join
fozzie_difference_join
fozzie_distance_join
fozzie_interval_join
fozzie_temporal_join
fozzie_temporal_interval_join

The goal is to broadly introduce the package to new users and showcase its capabilities.

String Distance joins

To demonstrate string matching, we use the babynames dataset. We randomly sample 10 names and introduce a single-character mutation to simulate noisy input.

library(babynames)
library(fozziejoin)
library(tibble)

# Seed for reproducibility
set.seed(1337)

# Restrict to names from years 2000 or later
babynames <- babynames[babynames$year >= 2000, ]

# Sample rows from babynames dataset
sample_df <- babynames[sample(nrow(babynames), 10), 'name']

# Mutate a single character in the 'name' field for sample
mutate_char <- function(x) {
  if (nchar(x) == 0) return(x)
  pos <- sample(1:nchar(x), 1)
  new_char <- sample(letters, 1)
  substr(x, pos, pos) <- new_char
  return(x)
}
sample_df$name <- sapply(sample_df$name, mutate_char)

Next, we attempt to join the modified names back to the original dataset using trigram-based Jaccard similarity. fozziejoin supports all the same string distance algorithms as fuzzyjoin, though some implementations (soundex and jw) differ slightly.

fozzie <- fozzie_string_join(
    babynames, sample_df, how='inner', method='jaccard', q=3,
    by = c('name')
)
print(head(fozzie))

## # A tibble: 6 × 6
##    year sex   name.x       n    prop name.y        
##   <dbl> <chr> <chr>    <int>   <dbl> <chr>         
## 1  2000 F     Emily    25953 0.0130  Olunatimilehin
## 2  2000 F     Brianna  12878 0.00646 Cairiana      
## 3  2000 F     Victoria 10923 0.00548 Cairiana      
## 4  2000 F     Haley     9070 0.00455 Taleka        
## 5  2000 F     Hailey    7831 0.00393 Olunatimilehin
## 6  2000 F     Maria     6852 0.00343 Cairiana

print(nrow(fozzie))

## [1] 79089

Note the return type of the object is tibble in this case. If one or both input datasets are a tibble, a tibble is returned. Otherwise, a base data.frame is returned.

# If both neither input is `tibble`, `data.frame` is returned.
fozzie_df <- fozzie_string_join(
    as.data.frame(babynames),
    as.data.frame(sample_df),
    how='inner',
    method='jaccard',
    q=3,
    by = c('name')
)
head(fozzie_df)

##   year sex   name.x     n       prop         name.y
## 1 2000   F    Emily 25953 0.01300980 Olunatimilehin
## 2 2000   F  Brianna 12878 0.00645552       Cairiana
## 3 2000   F Victoria 10923 0.00547551       Cairiana
## 4 2000   F    Haley  9070 0.00454664         Taleka
## 5 2000   F   Hailey  7831 0.00392555 Olunatimilehin
## 6 2000   F    Maria  6852 0.00343479       Cairiana

Difference and Distance joins

fozziejoin also supports numeric-based fuzzy joins. These are useful for measurements, non-geometric coordinates, or scores.

Difference joins filter based on the absolute difference between columns, evaluated individually.
Distance joins use multi-dimensional metrics like "euclidean" or "manhattan" to evaluate all columns simultaneously.

Most join functions allow you to return the fuzzy matching metric via the distance_col argument. When multiple distances are returned, the distance output columns will be named using the pattern {distance_col}_{left column name}_{right column name}.

# Simulate data
size <- 1000
df1 <- tibble(
  x = round(runif(size, min = 0, max = 100), 2),
  y = round(runif(size, min = 0, max = 100), 2)
)
df2 <- tibble(
  x = round(runif(size, min = 0, max = 100), 2),
  y = round(runif(size, min = 0, max = 100), 2)
)

# Absolute difference join (per column)
diff_join <- fozzie_difference_join(
  df1, df2, max_distance=1, distance_col = 'diff'
)

## Joining by: c("x", "y")

print(head(diff_join))

## # A tibble: 6 × 6
##     x.x   y.x   x.y   y.y diff_x_x diff_y_y
##   <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 25.0   17.6 25.0   17.7  0          0.150
## 2  3.51  79.3  3.51  78.6  0          0.680
## 3 75.7   50   75.7   50.5  0          0.460
## 4 94.4   32.2 94.4   32.7  0.01000    0.440
## 5 22.6   25.8 22.6   26.2  0.01000    0.41 
## 6 60.8   28.0 60.8   28.3  0.01000    0.350

# Manhattan distance join (across all columns)
dist_join <- fozzie_distance_join(
  df1, df2, method='manhattan', max_distance=1, distance_col='dist'
)

## Joining by: c("x", "y")

print(head(dist_join))

## # A tibble: 6 × 5
##     x.x   y.x   x.y   y.y  dist
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  13.2  62.2  13.4  61.8 0.68 
## 2  55.1  89.7  54.6  89.2 0.980
## 3  29.8  20.5  30.2  20.9 0.780
## 4  70.0  65.1  70.7  65.2 0.810
## 5  98.6  72.8  99.2  73.0 0.830
## 6   3.8  87.6   3.2  87.8 0.74

Interval Joins

Interval joins allow you to match records based on overlapping ranges — useful for genomic intervals, time windows, or numeric spans. fozziejoin supports flexible interval matching with control over overlap behavior and precision.

In this example, we simulate two datasets with randomly generated intervals and use fozzie_interval_join() to find overlapping pairs.

size <- 1000

# Simulate left data
starts1 <- runif(size, min = 0, max = 500)
ends1 <- starts1 + runif(size, min = 0, max = 10)
df1 <- tibble(start = starts1, end = ends1)

# Simulate right data
starts2 <- runif(size, min = 0, max = 500)
ends2 <- starts2 + runif(size, min = 0, max = 10)
df2 <- tibble(start = starts2, end = ends2)

# Perform interval join using real-valued ranges
real_olaps <- fozzie_interval_join(
  df1, df2,
  by = c(start = "start", end = "end"),
  how = "inner",
  overlap_type = "any",
  maxgap = 0,
  minoverlap = 0,
  interval_mode = "real"
)

Interval Modes Explained

“integer” mode is designed for discrete, integer-based intervals — similar to how the IRanges package handles genomic ranges. It assumes endpoints are whole numbers and uses inclusive logic. As an example, [1, 9] would match to [10, 11], as these ranges are touching in integer space.
“real” mode supports continuous numeric ranges — ideal for floating-point values like scores or measurements. This mode behaves more like foverlaps() from data.table, allowing precise control over overlap boundaries.

If the user does not specify interval_mode, a mode is chosen automatically based on the input data. If all values are integer, then integer mode is used. Otherwise, real mode is used.

Temporal joins

Temporal joins are also available via fozzie_temporal_join and fozzie_temporal_interval_join. Under the hood, they are an extension of the difference and interval joins. While these functions are designed to work with both POSIX timestamps and Date types, users may not mix and match. All join columns must be of the same type.

For POSIX timestamps, users may specify distance by days, hours, minutes, seconds, ms, ns, or us).

df1 <- data.frame(time = as.POSIXct(c(
  "2023-01-01 12:00:00", "2023-01-01 13:00:00"
)))
df2 <- data.frame(time = as.POSIXct(c(
  "2023-01-01 12:00:05", "2023-01-01 14:00:00"
)))

result <- fozzie_temporal_inner_join(
  df1, df2, by = c("time"), max_distance = 10, unit = "seconds"
)
print(head(result))

##                time.x              time.y
## 1 2023-01-01 12:00:00 2023-01-01 12:00:05

For Date class objects, the unit must be days. This is the default option.

# An error results if matching on `Date` with unit other than `days`
df1$date <- as.Date(df1$time)
df2$date <- as.Date(df2$time)
result <- fozzie_temporal_inner_join(
  df1, df2, by = c("date"), max_distance = 10, unit = "seconds"
)

## Error in `fozzie_temporal_join()`:
## ! When joining on Date columns, unit must be 'days'.

# Succeeds
result <- fozzie_temporal_inner_join(
  df1, df2, by = c("date"), max_distance = 10
)

fozzie_temporal_interval_join uses interval_mode='real' in all cases.

df1 <- data.frame(
  start = as.Date(c("2023-01-01", "2023-01-05")),
  end = as.Date(c("2023-01-03", "2023-01-07"))
)
df2 <- data.frame(
  start = as.Date(c("2023-01-02", "2023-01-06")),
  end = as.Date(c("2023-01-04", "2023-01-08"))
)

result <- fozzie_temporal_interval_inner_join(
  df1, df2,
  by = c(start = "start", end = "end"),
  overlap_type = "any",
  maxgap = 0,
  minoverlap = 0,
  unit = "days"
)

head(result)

##      start.x      end.x    start.y      end.y
## 1 2023-01-01 2023-01-03 2023-01-02 2023-01-04
## 2 2023-01-05 2023-01-07 2023-01-06 2023-01-08

Summary of `fozziejoin` Join Types

Join Type	Input Type(s)	Matching Logic	Key Options / Notes
`fozzie_string_join`	Character	String similarity (e.g. cosine, hamming)	`method`, `q`, `distance_col`
`fozzie_difference_join`	Numeric	Absolute difference per column	`max_distance`, `distance_col`
`fozzie_distance_join`	Numeric	Euclidean or Manhattan distance	`method`, `max_distance`, `distance_col`
`fozzie_interval_join`	Ranges (start/end)	Overlapping intervals	`overlap_type`, `maxgap`, `minoverlap`, `interval_mode`
`fozzie_temporal_join`	POSIXct / Date	Time difference within unit	`unit`, `max_distance`
`fozzie_temporal_interval_join`	POSIXct / Date ranges	Overlapping time intervals	`unit`, `overlap_type`, `maxgap`, `minoverlap`

Fuzzy Matching with `fozziejoin`

Jon Downs

2026-03-04

Introduction

String Distance joins

Difference and Distance joins

Interval Joins

Interval Modes Explained

Temporal joins

Summary of `fozziejoin` Join Types

Fuzzy Matching with fozziejoin

Jon Downs

2026-03-04

Introduction

String Distance joins

Difference and Distance joins

Interval Joins

Interval Modes Explained

Temporal joins

Summary of fozziejoin Join Types

Fuzzy Matching with `fozziejoin`

Summary of `fozziejoin` Join Types