| Title: | Robust Probabilistic Matching for German Company Names |
| Version: | 0.1.3 |
| Description: | A pipeline for matching messy company name strings against a clean dictionary (e.g., 'Orbis'). Implements a cascading strategy: Exact -> Fuzzy ('zoomerjoin') -> 'FTS5' ('SQLite') -> Rarity Weighted. References: Beniamino Green (2025) https://beniamino.org/zoomerjoin/; https://www.sqlite.org/fts5.html. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Imports: | data.table, stringi, stringdist, zoomerjoin, DBI, RSQLite, cli, progressr, httr, jsonlite, glue, purrr, readr, dplyr |
| Suggests: | testthat |
| NeedsCompilation: | no |
| Packaged: | 2026-03-06 14:07:45 UTC; getingin |
| Author: | Giulian Etingin-Frati [aut, cre] |
| Maintainer: | Giulian Etingin-Frati <etingin-frati@kof.ethz.ch> |
| Repository: | CRAN |
| Date/Publication: | 2026-03-06 14:20:02 UTC |
Internal Azure Chat Completion Wrapper (Custom Endpoint)
Description
Sends a request to a custom Azure-like endpoint (e.g. /openai/v1/responses).
Usage
azure_chat_request(
system_msg,
user_msg,
endpoint,
api_key,
deployment,
api_version = "2024-04-14"
)
Arguments
system_msg |
String. The instructions for the LLM. |
user_msg |
String. The specific case to evaluate. |
endpoint |
String. Base URL. |
api_key |
String. API Key. |
deployment |
String. Model/Deployment name. |
api_version |
String. API version (unused in this custom path but kept for compatibility). |
Value
A character string (the JSON response) or NULL on failure.
Match Company Names against a Dictionary
Description
Runs a cascading matching pipeline: Exact -> Fuzzy (Zoomer) -> FTS5 -> Rarity. Matches found in earlier steps are removed from subsequent steps.
Usage
match_companies(
queries,
dictionary,
query_col = "company_name",
dict_col = "company_name",
unique_id_col = "query_id",
dict_id_col = "orbis_id",
threshold_jw = 0.8,
threshold_zoomer = 0.4,
threshold_rarity = 1,
n_cores = 1
)
Arguments
queries |
Data frame. Must contain columns specified in |
dictionary |
Data frame. Must contain columns specified in |
query_col |
String. Column name for company names in |
dict_col |
String. Column name for company names in |
unique_id_col |
String. ID column in |
dict_id_col |
String. ID column in |
threshold_jw |
Numeric (0-1). Minimum Jaro-Winkler similarity. Default 0.8. |
threshold_zoomer |
Numeric (0-1). Jaccard threshold for blocking. Default 0.4. |
threshold_rarity |
Numeric. Minimum score for rarity matching. Default 1.0. |
n_cores |
Integer. Number of cores (reserved for future parallel implementation). |
Value
A data.table containing query_id, dict_id, and match_type.
Examples
# Create sample query data
queries <- data.frame(
query_id = 1:3,
company_name = c("BMW", "Siemens AG", "Deutsche Bank")
)
# Create sample dictionary
dictionary <- data.frame(
orbis_id = c("D001", "D002", "D003"),
company_name = c("BMW AG", "Siemens Aktiengesellschaft", "Commerzbank AG")
)
# Match companies (uses multi-threaded Rust internals via zoomerjoin)
results <- match_companies(
queries = queries,
dictionary = dictionary,
query_col = "company_name",
dict_col = "company_name",
unique_id_col = "query_id",
dict_id_col = "orbis_id"
)
print(results)
Normalize Company Names
Description
Standardizes company names by lowercasing, removing legal suffixes, translating characters to ASCII, and removing noise words.
Usage
normalize_company_name(x)
Arguments
x |
A character vector of company names. |
Value
A character vector of normalized names.
Examples
# Normalize a single company name
normalize_company_name("BMW AG")
normalize_company_name("Siemens GmbH & Co. KG")
# Normalize multiple names
companies <- c("Deutsche Bank AG", "VW Group", "BASF SE")
normalize_company_name(companies)
Internal OpenAI/Local LLM Chat Completion Wrapper
Description
Sends a request to a standard OpenAI-compatible endpoint (e.g. /v1/chat/completions). Used for both OpenAI's official API and local LLMs (Ollama, LM Studio, vLLM).
Usage
openai_chat_request(system_msg, user_msg, endpoint, api_key, model)
Arguments
system_msg |
String. The instructions for the LLM. |
user_msg |
String. The specific case to evaluate. |
endpoint |
String. Base URL (e.g., "http://localhost:11434/v1"). |
api_key |
String. API Key. Often not required for local LLMs. |
model |
String. Model name. |
Value
A character string (the JSON response) or NULL on failure.
Validate Matches using LLM (Azure OpenAI)
Description
Sends doubtful matches (not "Perfect" or "Unmatched") to an LLM for verification. Supports resuming from interruptions via chunk files.
Usage
validate_matches_llm(
data,
query_name_col,
dict_name_col,
output_dir = tempdir(),
filename_stem = "match_validation",
batch_size = 20,
api_key = NULL,
endpoint = NULL,
deployment = NULL,
engine = c("azure", "openai", "local")
)
Arguments
data |
Data frame. Must contain the columns specified by |
query_name_col |
String. Column containing the user's query name (Employer). |
dict_name_col |
String. Column containing the dictionary match name (Registry). |
output_dir |
String. Directory to save temporary chunks and final results. Defaults to |
filename_stem |
String. Base name for output files. |
batch_size |
Integer. Number of rows to process before saving a chunk. |
api_key |
String. API Key. Defaults to |
endpoint |
String. API Endpoint. Defaults to |
deployment |
String. Deployment or model name. Defaults to |
engine |
String. Either |
Value
A data frame with added LLM_decision and LLM_reason columns.
Examples
## Not run:
# Sample matched data
matched_data <- data.frame(
employer_name = c("BMW", "Siemens"),
registry_name = c("BMW AG", "SAP SE"),
dict_id = c("D001", "D002"),
match_type = c("Fuzzy", "Fuzzy")
)
# Validate using LLM (requires Azure credentials)
validated <- validate_matches_llm(
data = matched_data,
query_name_col = "employer_name",
dict_name_col = "registry_name"
)
print(validated)
## End(Not run)