--- title: "Fuzzy Matching with `fozziejoin`" author: "Jon Downs" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Fuzzy Matching with `fozziejoin`} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ### Introduction Fuzzy matching is a powerful technique for joining datasets that contain noisy or inconsistent fields, such as names with typos, small differences in date or numeric value, or overlapping intervals. R users may be familiar with the `fuzzyjoin` package for such tasks. `fozziejoin` is a performance-minded alternative that supports fast, flexible fuzzy joins across multiple data types. While it is not a drop-in replacement for `fuzzyjoin`, we hope that the package feels familiar to longtime users. In this vignette, we demonstrate fuzzy joins using the following `fozziejoin` utilities: - `fozzie_string_join` - `fozzie_difference_join` - `fozzie_distance_join` - `fozzie_interval_join` - `fozzie_temporal_join` - `fozzie_temporal_interval_join` The goal is to broadly introduce the package to new users and showcase its capabilities. ### String Distance joins To demonstrate string matching, we use the `babynames` dataset. We randomly sample 10 names and introduce a single-character mutation to simulate noisy input. ```{r} library(babynames) library(fozziejoin) library(tibble) # Seed for reproducibility set.seed(1337) # Restrict to names from years 2000 or later babynames <- babynames[babynames$year >= 2000, ] # Sample rows from babynames dataset sample_df <- babynames[sample(nrow(babynames), 10), 'name'] # Mutate a single character in the 'name' field for sample mutate_char <- function(x) { if (nchar(x) == 0) return(x) pos <- sample(1:nchar(x), 1) new_char <- sample(letters, 1) substr(x, pos, pos) <- new_char return(x) } sample_df$name <- sapply(sample_df$name, mutate_char) ``` Next, we attempt to join the modified names back to the original dataset using trigram-based Jaccard similarity. `fozziejoin` supports all the same string distance algorithms as `fuzzyjoin`, though some implementations (`soundex` and `jw`) differ slightly. ```{r} fozzie <- fozzie_string_join( babynames, sample_df, how='inner', method='jaccard', q=3, by = c('name') ) print(head(fozzie)) print(nrow(fozzie)) ``` Note the return type of the object is `tibble` in this case. If one or both input datasets are a `tibble`, a `tibble` is returned. Otherwise, a base `data.frame` is returned. ```{r} # If both neither input is `tibble`, `data.frame` is returned. fozzie_df <- fozzie_string_join( as.data.frame(babynames), as.data.frame(sample_df), how='inner', method='jaccard', q=3, by = c('name') ) head(fozzie_df) ``` ### Difference and Distance joins `fozziejoin` also supports numeric-based fuzzy joins. These are useful for measurements, non-geometric coordinates, or scores. - **Difference joins** filter based on the absolute difference between columns, evaluated individually. - **Distance joins** use multi-dimensional metrics like `"euclidean"` or `"manhattan"` to evaluate all columns simultaneously. Most join functions allow you to return the fuzzy matching metric via the `distance_col` argument. When multiple distances are returned, the distance output columns will be named using the pattern `{distance_col}_{left column name}_{right column name}`. ```{r} # Simulate data size <- 1000 df1 <- tibble( x = round(runif(size, min = 0, max = 100), 2), y = round(runif(size, min = 0, max = 100), 2) ) df2 <- tibble( x = round(runif(size, min = 0, max = 100), 2), y = round(runif(size, min = 0, max = 100), 2) ) ``` ```{r} # Absolute difference join (per column) diff_join <- fozzie_difference_join( df1, df2, max_distance=1, distance_col = 'diff' ) print(head(diff_join)) # Manhattan distance join (across all columns) dist_join <- fozzie_distance_join( df1, df2, method='manhattan', max_distance=1, distance_col='dist' ) print(head(dist_join)) ``` ### Interval Joins Interval joins allow you to match records based on overlapping ranges — useful for genomic intervals, time windows, or numeric spans. `fozziejoin` supports flexible interval matching with control over overlap behavior and precision. In this example, we simulate two datasets with randomly generated intervals and use `fozzie_interval_join()` to find overlapping pairs. ```{r} size <- 1000 # Simulate left data starts1 <- runif(size, min = 0, max = 500) ends1 <- starts1 + runif(size, min = 0, max = 10) df1 <- tibble(start = starts1, end = ends1) # Simulate right data starts2 <- runif(size, min = 0, max = 500) ends2 <- starts2 + runif(size, min = 0, max = 10) df2 <- tibble(start = starts2, end = ends2) # Perform interval join using real-valued ranges real_olaps <- fozzie_interval_join( df1, df2, by = c(start = "start", end = "end"), how = "inner", overlap_type = "any", maxgap = 0, minoverlap = 0, interval_mode = "real" ) ``` #### Interval Modes Explained - "integer" mode is designed for discrete, integer-based intervals — similar to how the `IRanges` package handles genomic ranges. It assumes endpoints are whole numbers and uses inclusive logic. As an example, [1, 9] would match to [10, 11], as these ranges are touching in integer space. - "real" mode supports continuous numeric ranges — ideal for floating-point values like scores or measurements. This mode behaves more like `foverlaps()` from `data.table`, allowing precise control over overlap boundaries. If the user does not specify `interval_mode`, a mode is chosen automatically based on the input data. If all values are `integer`, then integer mode is used. Otherwise, `real` mode is used. ### Temporal joins Temporal joins are also available via `fozzie_temporal_join` and `fozzie_temporal_interval_join`. Under the hood, they are an extension of the difference and interval joins. While these functions are designed to work with both `POSIX` timestamps and `Date` types, users may not mix and match. All join columns must be of the same type. For `POSIX` timestamps, users may specify distance by days, hours, minutes, seconds, ms, ns, or us). ```{r} df1 <- data.frame(time = as.POSIXct(c( "2023-01-01 12:00:00", "2023-01-01 13:00:00" ))) df2 <- data.frame(time = as.POSIXct(c( "2023-01-01 12:00:05", "2023-01-01 14:00:00" ))) result <- fozzie_temporal_inner_join( df1, df2, by = c("time"), max_distance = 10, unit = "seconds" ) print(head(result)) ``` For `Date` class objects, the unit must be days. This is the default option. ```{r, error=TRUE} # An error results if matching on `Date` with unit other than `days` df1$date <- as.Date(df1$time) df2$date <- as.Date(df2$time) result <- fozzie_temporal_inner_join( df1, df2, by = c("date"), max_distance = 10, unit = "seconds" ) ``` ```{r} # Succeeds result <- fozzie_temporal_inner_join( df1, df2, by = c("date"), max_distance = 10 ) ``` `fozzie_temporal_interval_join` uses `interval_mode='real'` in all cases. ```{r} df1 <- data.frame( start = as.Date(c("2023-01-01", "2023-01-05")), end = as.Date(c("2023-01-03", "2023-01-07")) ) df2 <- data.frame( start = as.Date(c("2023-01-02", "2023-01-06")), end = as.Date(c("2023-01-04", "2023-01-08")) ) result <- fozzie_temporal_interval_inner_join( df1, df2, by = c(start = "start", end = "end"), overlap_type = "any", maxgap = 0, minoverlap = 0, unit = "days" ) head(result) ``` ### Summary of `fozziejoin` Join Types | Join Type | Input Type(s) | Matching Logic | Key Options / Notes | |---------------------------------|-----------------------|-------------------------------------------|---------------------------------------------------------| | `fozzie_string_join` | Character | String similarity (e.g. cosine, hamming) | `method`, `q`, `distance_col` | | `fozzie_difference_join` | Numeric | Absolute difference per column | `max_distance`, `distance_col` | | `fozzie_distance_join` | Numeric | Euclidean or Manhattan distance | `method`, `max_distance`, `distance_col` | | `fozzie_interval_join` | Ranges (start/end) | Overlapping intervals | `overlap_type`, `maxgap`, `minoverlap`, `interval_mode` | | `fozzie_temporal_join` | POSIXct / Date | Time difference within unit | `unit`, `max_distance` | | `fozzie_temporal_interval_join` | POSIXct / Date ranges | Overlapping time intervals | `unit`, `overlap_type`, `maxgap`, `minoverlap` |