\documentclass{article} %\VignetteIndexEntry{Example} \usepackage[utf8]{inputenc} \begin{document} \SweaveOpts{concordance=TRUE} \title{Using KOR.addrlink} \author{Daniel Sch\"urmann} \date{February 29, 2024} \maketitle \section{Introduction} Consider a data set with semi-structured address data, e.g. street and house number as a concatenated string, wrongly spelled street names or non-existing house numbers. This data set (referred to as df\_match) should be mapped to a complete list of valid addresses within the given municipality. The latter data set is called df\_ref and may include further information like coordinates of addresses or district information. KOR.addrlink tries to solve this problem specifically for German municipalities as the package is specialized on German address schemes. \section{Reference data} First, a complete list of reference addresses (df\_ref) is needed. An example data.fame named "Adressen" is shown below. <<>>= library(KOR.addrlink) Adressen[c(sample(which(is.na(Adressen$HNRZ)), 4), sample(which(!is.na(Adressen$HNRZ)), 2)),] @ The columns used for the matching procedure are STRNAME (street name), HNR (house number) and HNRZ (additional letter). This vignette illustrates the merging workflow on two sample data sets called df1 and df2. \section{Example 1} df1 has address information in columns gross\_strasse and housnr. The columns Var1 and Var2 provide non-address related information about the individuals. Row 1183 shows that the column hausnr needs to be split into house number and additional letter before addresses can be matched. The function split\_number is provided for that task. <<>>= df1[1180:(1183+6),] @ split\_number takes hausnr and creates a data.frame with columns "Hausnummer" (house number) and "Hausnummernzusatz" (additional letter). <<>>= df1 <- cbind(df1, split_number(df1$hausnr)) df1[1180:(1183+6),] @ addrlink merges the two data sets. For both data sets, the columns referring to steet name, house number and additional letter need to be specified in exactly that order (parameter col\_ref and col\_match). <<>>= # column hausnr is no longer needed df1 <- within(df1, rm(hausnr)) df1_matched <- addrlink(df_ref = Adressen, col_ref = c("STRNAME", "HNR", "HNRZ"), df_match = df1, col_match = c("gross_strasse", "Hausnummer", "Hausnummernzusatz")) @ The result is a list with two data.frames \begin{itemize} \item ret: The merged data set \item QA: Indicators showing the match quality \end{itemize} <<>>= head(df1_matched$ret) table(df1_matched$QA$qAddress) @ qAdress states the stage within the matching procedure that yielded the match. Out of the 10000 records, 9670 could be merged directly. 72 had a valid street name, but an invalid house number. 157 records had (possibly) misspelled street names and 101 records could not be matched at all. \section{Example 2} The second data set has a single column "Adresse", which includes street names and house numbers. Thus, this column needs to be split by the function split\_address. <<>>= head(within(df2, Adresse <- trimws(Adresse))) @ split\_number creates a data.frame with columns "Strasse" (street) "Hausnummer" (house number) and "Hausnummernzusatz" (additional letter) from the column "Adresse". <<>>= df2 <- cbind(df2, split_address(df2$Adresse)) within(df2, Adresse <- trimws(Adresse))[23:(23+6),] @ Again, addrlink merges the two data sets. The parameter fuzzy\_threshold sets the threshold for fuzzy matching of misspelled street names. A value of 1 means no fuzzy matching and 0 means forced fuzzy matches for all records. If a steet name could be matched, but the provided house number does not exist, addrlink may randomly assign a valid house number to that record. A seed is always set to ensure reproducibility. Customization is possible via the parameter seed. <<>>= # column Adresse is no longer needed df2 <- within(df2, rm(Adresse)) df2_matched <- addrlink(df_ref = Adressen, col_ref = c("STRNAME", "HNR", "HNRZ"), df_match = df2, col_match = c("Strasse", "Hausnummer", "Hausnummernzusatz"), fuzzy_threshold = .9, seed = 1234) @ <<>>= head(df2_matched$ret) table(df2_matched$QA$qAddress) @ 49 records had invalid house numbers and one record was matched by fuzzy matching. This record can be inspected in detail. <<>>= id <- which(df2_matched$QA$qAddress == 3) df2_matched$ret[id,] df2_matched$QA[id,] @ In this case the fuzzy matching procedure was most likely correct (St.-Georg-Str. matched SANKT-GEORG-STRA{\ss}E). The number of cases with correct street name and randomly assigned house numbers is 10. <<>>= sum(df2_matched$QA$qscore == 0) @ \end{document}