--- title: "Introduction to RapidFuzz" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to RapidFuzz} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(RapidFuzz) ``` ## Overview `RapidFuzz` provides high-performance string similarity and distance functions powered by the C++ library [rapidfuzz-cpp](https://github.com/rapidfuzz/rapidfuzz-cpp). It is useful for tasks such as record linkage, fuzzy matching, typo correction, and deduplication. This vignette demonstrates the main features of the package using data readily available in R. --- ## 1. Basic String Distances ### Levenshtein Distance The Levenshtein distance counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another. ```{r} levenshtein_distance("kitten", "sitting") levenshtein_normalized_similarity("kitten", "sitting") ``` ### Comparing Multiple Metrics Let's compare how different metrics score the same pair of strings: ```{r} s1 <- "California" s2 <- "Kalifornia" data.frame( metric = c("Levenshtein", "Damerau-Levenshtein", "Hamming", "Jaro", "Jaro-Winkler", "LCSseq", "OSA", "Indel"), distance = c( levenshtein_distance(s1, s2), damerau_levenshtein_distance(s1, s2), hamming_distance(s1, s2), round(jaro_distance(s1, s2), 4), round(jaro_winkler_distance(s1, s2), 4), lcs_seq_distance(s1, s2), osa_distance(s1, s2), indel_distance(s1, s2) ), normalized_similarity = c( round(levenshtein_normalized_similarity(s1, s2), 4), round(damerau_levenshtein_normalized_similarity(s1, s2), 4), round(hamming_normalized_similarity(s1, s2), 4), round(jaro_normalized_similarity(s1, s2), 4), round(jaro_winkler_normalized_similarity(s1, s2), 4), round(lcs_seq_normalized_similarity(s1, s2), 4), round(osa_normalized_similarity(s1, s2), 4), round(indel_normalized_similarity(s1, s2), 4) ) ) ``` --- ## 2. Fuzzy Matching with Fuzz Ratios The `fuzz_*` family of functions provides different strategies for comparing strings, especially useful when word order or partial matches matter. ```{r} # Exact content, different case/spacing fuzz_ratio("New York City", "new york city") # Partial match: one string is contained in the other fuzz_partial_ratio("York", "New York City") # Word order doesn't matter fuzz_token_sort_ratio("City of New York", "New York City") # Common tokens fuzz_token_set_ratio("New York City NY", "New York City") # Weighted ratio (best overall heuristic) fuzz_WRatio("New York City", "new york city!!") ``` ### New in v1.1.0: Partial Token Ratios These combine the benefits of token-based comparison with partial matching: ```{r} fuzz_partial_token_sort_ratio("Museum of Modern Art", "Modern Art Museum NYC") fuzz_partial_token_set_ratio("Museum of Modern Art", "Modern Art Museum NYC") fuzz_partial_token_ratio("Museum of Modern Art", "Modern Art Museum NYC") ``` --- ## 3. Matching Against a List of Choices A common task is finding the best match for a query within a list of options. `RapidFuzz` provides three extract functions for this. ### Using US State Names ```{r} # Misspelled state names queries <- c("Kalifornia", "Nwe York", "Texs", "Florda", "Pensylvania") states <- state.name # Find the best match for each misspelled name results <- lapply(queries, function(q) { best <- extract_best_match(q, states, score_cutoff = 0) data.frame( query = q, best_match = best$choice, score = round(best$score, 2) ) }) do.call(rbind, results) ``` ### Extract Top-N Matches ```{r} # Find top 5 states similar to "New" extract_matches("New", states, score_cutoff = 50, limit = 5, scorer = "PartialRatio") ``` ### Extract All Matches Above a Threshold ```{r} # All states with > 70% similarity to "North" extract_similar_strings("North", states, score_cutoff = 70) ``` --- ## 4. Choosing the Right Scorer The `extract_matches()` function supports 10 different scorers. The best choice depends on your data: ```{r} query <- "san francisco" cities <- c("San Francisco", "San Fernando", "Santa Fe", "San Diego", "Francisco", "South San Francisco", "San Fran") scorers <- c("Ratio", "PartialRatio", "TokenSortRatio", "TokenSetRatio", "WRatio", "QRatio", "PartialTokenSortRatio", "PartialTokenSetRatio", "PartialTokenRatio", "TokenRatio") results <- lapply(scorers, function(sc) { m <- extract_matches(query, cities, score_cutoff = 0, limit = 3, scorer = sc) data.frame(scorer = sc, rank1 = m$choice[1], score1 = round(m$score[1], 1)) }) do.call(rbind, results) ``` --- ## 5. String Preprocessing The `processString()` function helps normalize strings before comparison: ```{r} # Trim + lowercase processString(" São Paulo ", processor = TRUE, asciify = FALSE) # Trim + lowercase + ASCII transliteration processString(" São Paulo ", processor = TRUE, asciify = TRUE) # ASCII only processString("Ñoño", processor = FALSE, asciify = TRUE) ``` This is especially useful for matching names with accented characters: ```{r} # Without preprocessing fuzz_ratio("São Paulo", "sao paulo") # With preprocessing fuzz_ratio( processString("São Paulo", processor = TRUE, asciify = TRUE), processString("sao paulo", processor = TRUE, asciify = TRUE) ) ``` --- ## 6. Edit Operations Edit operations show exactly what transformations are needed to convert one string into another. ```{r} # Levenshtein edit operations ops <- get_editops("saturday", "sunday") ops ``` ```{r} # Apply the operations editops_apply_str(ops, "saturday", "sunday") ``` ```{r} # LCSseq edit operations lcs_seq_editops("kitten", "sitting") ``` --- ## 7. Prefix and Postfix Matching Useful for comparing strings that share beginnings or endings: ```{r} # Same prefix "inter" prefix_similarity("international", "internet") prefix_normalized_similarity("international", "internet") # Same postfix "tion" postfix_similarity("education", "formation") postfix_normalized_similarity("education", "formation") ``` --- ## 8. Practical Example: Record Linkage A real-world scenario: matching messy data against a clean reference list. ```{r} # Simulated "dirty" records dirty <- c("J. Smith", "Jane M. Doe", "Bob Johnson Jr", "Alice Wonderland", "Charlie Browne") # Clean reference list clean <- c("John Smith", "Jane Mary Doe", "Robert Johnson Junior", "Alice Wonder", "Charles Brown", "David Lee") # Match each dirty record to the best clean record matches <- lapply(dirty, function(d) { best <- extract_best_match(d, clean, score_cutoff = 0) data.frame( dirty_record = d, matched_to = best$choice, confidence = round(best$score, 1) ) }) do.call(rbind, matches) ``` --- ## 9. Performance Comparison with Base R `RapidFuzz` is implemented in C++ and is significantly faster than pure R alternatives for string matching tasks. ```{r} # Compare performance: RapidFuzz vs base R adist s1 <- paste(sample(letters, 100, replace = TRUE), collapse = "") s2 <- paste(sample(letters, 100, replace = TRUE), collapse = "") bench <- system.time( for (i in 1:1000) levenshtein_distance(s1, s2) ) bench_base <- system.time( for (i in 1:1000) adist(s1, s2) ) data.frame( method = c("RapidFuzz", "base::adist"), time_1000_calls = c(bench["elapsed"], bench_base["elapsed"]) ) ``` --- ## Summary | Task | Recommended Functions | |------|-----------------------| | Simple distance/similarity | `levenshtein_*`, `hamming_*` | | Transpositions matter | `damerau_levenshtein_*`, `osa_*` | | Fuzzy matching (general) | `fuzz_WRatio`, `fuzz_QRatio` | | Partial string matching | `fuzz_partial_ratio`, `fuzz_partial_token_*` | | Word-order independent | `fuzz_token_sort_ratio`, `fuzz_token_set_ratio` | | Find best match in list | `extract_best_match`, `extract_matches` | | Names with accents | `processString()` + any metric | | Common prefix/suffix | `prefix_*`, `postfix_*` | | Edit operations detail | `get_editops`, `lcs_seq_editops`, `osa_editops` |