%\VignetteIndexEntry{Record Linkage with Extreme Value Theory} %\VignetteEngine{knitr::knitr} <>= knitr::opts_chunk$set(message =FALSE, warnings = FALSE) options(width=60) backup_options <- options() @ \documentclass[a4paper]{article} %\usepackage[ansinew]{inputenc} %\usepackage[ngerman]{babel} \begin{document} \title{Classifying record pairs by means of Extreme Value Theory} \author{Andreas Borg, Murat Sariyar} \maketitle This document is a (practical) description of a procedure for Record Linkage by means of Extreme Value Theory (EVT). No labeled training data are needed, but user decisions are necessary for the selection of thresholds in a mean residual life plot (also known as mean excess plot). In the following, the data set \texttt{RLdata500} will be used. As classification with EVT is weight-based, weights have to be calculated for the record pairs to classify. In this case an EM algorithm is applied. <>= library(RecordLinkage) @ <<>>= data(RLdata500) bf=list(1,3,5,6,7) rpairs=compare.dedup(RLdata500,identity=identity.RLdata500, blockfld=bf,strcmp=1:4) rpairs=emWeights(rpairs) @ Calling \texttt{getParetoThreshold} opens a mean residual life (MRL) plot for the computed weights, as shown in figure \ref{fig:mrl-plot}. From this graph, an interval has to be selected where the graph has a relatively long and approximately linear descent. Usually this can be found in the range between 0 and 20 for weights computed with \texttt{emWeights} or between 0.5 and 0.9 for weights computed with \texttt{epiWeights}. Figure \ref{fig:mrl-with-interval} shows the same MRL plot with the appropriate segment marked. The interval is selected by clicking on the endpoints of the desired segment of the graph. In some cases the right endpoint is identical to the edge of the graph, in this case only selection of the left endpoint is necessary. See the documentation of \texttt{identify} for more information on selecting points on a plot. <>= ## Not run: getParetoThreshold(rpairs) @ \begin{figure}[b] <>= plotMRL(rpairs) @ \caption{MRL plot} \label{fig:mrl-plot} \end{figure} \begin{figure}[b] <>= plotMRL(rpairs) abline(v=c(1.2,12.8),col="red",lty="dashed") l=mrl(rpairs$Wdata) range=l$x>1.2 & l$x < 12.8 points(l$x[range], l$y[range],col="red",type="l") @ \caption{MRL plot with appropriate graph segment marked} \label{fig:mrl-with-interval} \end{figure} As an alternative to interactive selection, the interval can be given as argument to \texttt{getParetoThreshold}. The return value is in every case a threshold which can be used directly for classification. <<>>= threshold=getParetoThreshold(rpairs,interval=c(1.2,12.8)) result=emClassify(rpairs,threshold) summary(result) @ <>= options(backup_options) @ \end{document}