--- title: "Example session for Weight-based deduplication" author: "Andreas Borg, Murat Sariyar" output: html_document vignette: > %\VignetteIndexEntry{Weight-based deduplication} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, echo=FALSE, message=FALSE, warning=FALSE} knitr::opts_chunk$set(message = FALSE, warning = FALSE) options(width = 60) backup_options <- options() ``` This document shows an example session using the package *RecordLinkage*. A single data set is deduplicated using an EM algorithm for weight calculation. Conducting linkage of two data sets differs only in the step of generating record pairs. ## Generating record pairs ```{r load-library, echo=FALSE, results='hide'} library(RecordLinkage) ``` The data to be deduplicated is expected to reside in a data frame or matrix, each row containing one record. Example data sets of 500 and 10000 records are included in the package as `RLData500` and `RLData10000`. ```{r load-data} data(RLdata500) RLdata500[1:5,] ``` For deduplication, `compare.dedup` is to be used. In this example, blocking is set to return only record pairs which agree in at least two components of the subdivided date of birth, resulting in 810 pairs. The argument `identity` preserves the true matching status for later evaluation. ```{r compare-dedup} pairs <- compare.dedup(RLdata500, identity = identity.RLdata500, blockfld = list(c(5,6), c(6,7), c(5,7))) summary(pairs) ``` ## Weight calculation Weights are calculated by means of an EM algorithm. This step is computationally intensive and might take a while. The histogram shows the resulting weight distribution. ```{r em-weights} pairs <- emWeights(pairs) hist(pairs$Wdata, plot = FALSE) ``` ## Classification For determining thresholds, record pairs within a given range of weights can be printed using `getPairs`^[The output of `getPairs` is shortened in this document.]. In this case, 24 is set as upper and -7 as lower threshold, dividing links, possible links and non-links. The summary shows the resulting contingency table and error measures. ```{r get-pairs-hidden, results='hide'} getPairs(pairs, 30, 20) ``` ```{r get-pairs-shown, echo=FALSE} getPairs(pairs, 30, 20)[23:36,] ``` ```{r em-classify} pairs <- emClassify(pairs, threshold.upper = 24, threshold.lower = -7) summary(pairs) ``` Review of the record pairs denoted as possible links is facilitated by `getPairs`, which can be forced to show only possible links via argument `show`. A list with the ids of linked pairs can be extracted from the output of `getPairs` with argument `single.rows` set to `TRUE`. ```{r final-pairs} possibles <- getPairs(pairs, show = "possible") possibles[1:6,] links <- getPairs(pairs, show = "links", single.rows = TRUE) link_ids <- links[, c("id1", "id2")] link_ids ``` ```{r cleanup, echo=FALSE, results='hide'} options(backup_options) ```