fozziejoinFuzzy matching is a powerful technique for joining datasets that
contain noisy or inconsistent fields, such as names with typos, small
differences in date or numeric value, or overlapping intervals. R users
may be familiar with the fuzzyjoin package for such tasks.
fozziejoin is a performance-minded alternative that
supports fast, flexible fuzzy joins across multiple data types. While it
is not a drop-in replacement for fuzzyjoin, we hope that
the package feels familiar to longtime users.
In this vignette, we demonstrate fuzzy joins using the following
fozziejoin utilities:
fozzie_string_joinfozzie_difference_joinfozzie_distance_joinfozzie_interval_joinfozzie_temporal_joinfozzie_temporal_interval_joinThe goal is to broadly introduce the package to new users and showcase its capabilities.
To demonstrate string matching, we use the babynames
dataset. We randomly sample 10 names and introduce a single-character
mutation to simulate noisy input.
library(babynames)
library(fozziejoin)
library(tibble)
# Seed for reproducibility
set.seed(1337)
# Restrict to names from years 2000 or later
babynames <- babynames[babynames$year >= 2000, ]
# Sample rows from babynames dataset
sample_df <- babynames[sample(nrow(babynames), 10), 'name']
# Mutate a single character in the 'name' field for sample
mutate_char <- function(x) {
if (nchar(x) == 0) return(x)
pos <- sample(1:nchar(x), 1)
new_char <- sample(letters, 1)
substr(x, pos, pos) <- new_char
return(x)
}
sample_df$name <- sapply(sample_df$name, mutate_char)Next, we attempt to join the modified names back to the original
dataset using trigram-based Jaccard similarity. fozziejoin
supports all the same string distance algorithms as
fuzzyjoin, though some implementations
(soundex and jw) differ slightly.
fozzie <- fozzie_string_join(
babynames, sample_df, how='inner', method='jaccard', q=3,
by = c('name')
)
print(head(fozzie))## # A tibble: 6 × 6
## year sex name.x n prop name.y
## <dbl> <chr> <chr> <int> <dbl> <chr>
## 1 2000 F Emily 25953 0.0130 Olunatimilehin
## 2 2000 F Brianna 12878 0.00646 Cairiana
## 3 2000 F Victoria 10923 0.00548 Cairiana
## 4 2000 F Haley 9070 0.00455 Taleka
## 5 2000 F Hailey 7831 0.00393 Olunatimilehin
## 6 2000 F Maria 6852 0.00343 Cairiana
## [1] 79089
Note the return type of the object is tibble in this
case. If one or both input datasets are a tibble, a
tibble is returned. Otherwise, a base
data.frame is returned.
# If both neither input is `tibble`, `data.frame` is returned.
fozzie_df <- fozzie_string_join(
as.data.frame(babynames),
as.data.frame(sample_df),
how='inner',
method='jaccard',
q=3,
by = c('name')
)
head(fozzie_df)## year sex name.x n prop name.y
## 1 2000 F Emily 25953 0.01300980 Olunatimilehin
## 2 2000 F Brianna 12878 0.00645552 Cairiana
## 3 2000 F Victoria 10923 0.00547551 Cairiana
## 4 2000 F Haley 9070 0.00454664 Taleka
## 5 2000 F Hailey 7831 0.00392555 Olunatimilehin
## 6 2000 F Maria 6852 0.00343479 Cairiana
fozziejoin also supports numeric-based fuzzy joins.
These are useful for measurements, non-geometric coordinates, or
scores.
"euclidean" or "manhattan" to evaluate all
columns simultaneously.Most join functions allow you to return the fuzzy matching metric via
the distance_col argument. When multiple distances are
returned, the distance output columns will be named using the pattern
{distance_col}_{left column name}_{right column name}.
# Simulate data
size <- 1000
df1 <- tibble(
x = round(runif(size, min = 0, max = 100), 2),
y = round(runif(size, min = 0, max = 100), 2)
)
df2 <- tibble(
x = round(runif(size, min = 0, max = 100), 2),
y = round(runif(size, min = 0, max = 100), 2)
)# Absolute difference join (per column)
diff_join <- fozzie_difference_join(
df1, df2, max_distance=1, distance_col = 'diff'
)## Joining by: c("x", "y")
## # A tibble: 6 × 6
## x.x y.x x.y y.y diff_x_x diff_y_y
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 25.0 17.6 25.0 17.7 0 0.150
## 2 3.51 79.3 3.51 78.6 0 0.680
## 3 75.7 50 75.7 50.5 0 0.460
## 4 94.4 32.2 94.4 32.7 0.01000 0.440
## 5 22.6 25.8 22.6 26.2 0.01000 0.41
## 6 60.8 28.0 60.8 28.3 0.01000 0.350
# Manhattan distance join (across all columns)
dist_join <- fozzie_distance_join(
df1, df2, method='manhattan', max_distance=1, distance_col='dist'
)## Joining by: c("x", "y")
## # A tibble: 6 × 5
## x.x y.x x.y y.y dist
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 13.2 62.2 13.4 61.8 0.68
## 2 55.1 89.7 54.6 89.2 0.980
## 3 29.8 20.5 30.2 20.9 0.780
## 4 70.0 65.1 70.7 65.2 0.810
## 5 98.6 72.8 99.2 73.0 0.830
## 6 3.8 87.6 3.2 87.8 0.74
Interval joins allow you to match records based on overlapping ranges
— useful for genomic intervals, time windows, or numeric spans.
fozziejoin supports flexible interval matching with control
over overlap behavior and precision.
In this example, we simulate two datasets with randomly generated
intervals and use fozzie_interval_join() to find
overlapping pairs.
size <- 1000
# Simulate left data
starts1 <- runif(size, min = 0, max = 500)
ends1 <- starts1 + runif(size, min = 0, max = 10)
df1 <- tibble(start = starts1, end = ends1)
# Simulate right data
starts2 <- runif(size, min = 0, max = 500)
ends2 <- starts2 + runif(size, min = 0, max = 10)
df2 <- tibble(start = starts2, end = ends2)
# Perform interval join using real-valued ranges
real_olaps <- fozzie_interval_join(
df1, df2,
by = c(start = "start", end = "end"),
how = "inner",
overlap_type = "any",
maxgap = 0,
minoverlap = 0,
interval_mode = "real"
)“integer” mode is designed for discrete, integer-based intervals
— similar to how the IRanges package handles genomic
ranges. It assumes endpoints are whole numbers and uses inclusive logic.
As an example, [1, 9] would match to [10, 11], as these ranges are
touching in integer space.
“real” mode supports continuous numeric ranges — ideal for
floating-point values like scores or measurements. This mode behaves
more like foverlaps() from data.table,
allowing precise control over overlap boundaries.
If the user does not specify interval_mode, a mode is
chosen automatically based on the input data. If all values are
integer, then integer mode is used. Otherwise,
real mode is used.
Temporal joins are also available via
fozzie_temporal_join and
fozzie_temporal_interval_join. Under the hood, they are an
extension of the difference and interval joins. While these functions
are designed to work with both POSIX timestamps and
Date types, users may not mix and match. All join columns
must be of the same type.
For POSIX timestamps, users may specify distance by
days, hours, minutes, seconds, ms, ns, or us).
df1 <- data.frame(time = as.POSIXct(c(
"2023-01-01 12:00:00", "2023-01-01 13:00:00"
)))
df2 <- data.frame(time = as.POSIXct(c(
"2023-01-01 12:00:05", "2023-01-01 14:00:00"
)))
result <- fozzie_temporal_inner_join(
df1, df2, by = c("time"), max_distance = 10, unit = "seconds"
)
print(head(result))## time.x time.y
## 1 2023-01-01 12:00:00 2023-01-01 12:00:05
For Date class objects, the unit must be days. This is
the default option.
# An error results if matching on `Date` with unit other than `days`
df1$date <- as.Date(df1$time)
df2$date <- as.Date(df2$time)
result <- fozzie_temporal_inner_join(
df1, df2, by = c("date"), max_distance = 10, unit = "seconds"
)## Error in `fozzie_temporal_join()`:
## ! When joining on Date columns, unit must be 'days'.
fozzie_temporal_interval_join uses
interval_mode='real' in all cases.
df1 <- data.frame(
start = as.Date(c("2023-01-01", "2023-01-05")),
end = as.Date(c("2023-01-03", "2023-01-07"))
)
df2 <- data.frame(
start = as.Date(c("2023-01-02", "2023-01-06")),
end = as.Date(c("2023-01-04", "2023-01-08"))
)
result <- fozzie_temporal_interval_inner_join(
df1, df2,
by = c(start = "start", end = "end"),
overlap_type = "any",
maxgap = 0,
minoverlap = 0,
unit = "days"
)
head(result)## start.x end.x start.y end.y
## 1 2023-01-01 2023-01-03 2023-01-02 2023-01-04
## 2 2023-01-05 2023-01-07 2023-01-06 2023-01-08
fozziejoin Join Types| Join Type | Input Type(s) | Matching Logic | Key Options / Notes |
|---|---|---|---|
fozzie_string_join |
Character | String similarity (e.g. cosine, hamming) | method, q, distance_col |
fozzie_difference_join |
Numeric | Absolute difference per column | max_distance, distance_col |
fozzie_distance_join |
Numeric | Euclidean or Manhattan distance | method, max_distance,
distance_col |
fozzie_interval_join |
Ranges (start/end) | Overlapping intervals | overlap_type, maxgap,
minoverlap, interval_mode |
fozzie_temporal_join |
POSIXct / Date | Time difference within unit | unit, max_distance |
fozzie_temporal_interval_join |
POSIXct / Date ranges | Overlapping time intervals | unit, overlap_type, maxgap,
minoverlap |