--- title: "Example Session for Supervised Classification" author: "Andreas Borg, Murat Sariyar" output: html_document vignette: > %\VignetteIndexEntry{Supervised Classification} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, echo=FALSE, message=FALSE, warning=FALSE} knitr::opts_chunk$set(message = FALSE, warning = FALSE) options(width = 60) backup_options <- options() ``` This document shows an example session for using supervised classification in the package *RecordLinkage* for deduplication of a single data set. Conducting linkage of two data sets differs only in the step of generating record pairs. See also the vignette on Fellegi-Sunter deduplication for some general information on using the package. ## Generating comparison patterns ```{r load-library, results='hide', echo=FALSE} library(RecordLinkage) ``` In this session, a training set with 50 matches and 250 non-matches is generated from the included data set `RLData10000`. Record pairs from the set `RLData500` are used to calibrate and subsequently evaluate the classifiers. ```{r generate-pairs} data(RLdata500) data(RLdata10000) train_pairs <- compare.dedup(RLdata10000, identity = identity.RLdata10000, n_match = 500, n_non_match = 500) eval_pairs <- compare.dedup(RLdata500, identity = identity.RLdata500) ``` ## Training `trainSupv` handles calibration of supervised classificators which are selected through the argument `method`. In the following, a single decision tree (rpart), a bootstrap aggregation of decision trees (bagging) and a support vector machine are calibrated (svm). ```{r training} model_rpart <- trainSupv(train_pairs, method = "rpart") model_bagging <- trainSupv(train_pairs, method = "bagging") model_svm <- trainSupv(train_pairs, method = "svm") ``` ## Classification `classifySupv` handles classification for all supervised classificators, taking as arguments the structure returned by `trainSupv` which contains the classification model and the set of record pairs which to classify. ```{r classification} result_rpart <- classifySupv(model_rpart, eval_pairs) result_bagging <- classifySupv(model_bagging, eval_pairs) result_svm <- classifySupv(model_svm, eval_pairs) ``` ## Results ### Rpart ```{r results-rpart, echo=FALSE} summary(result_rpart) ``` ### Bagging ```{r results-bagging, echo=FALSE} summary(result_bagging) ``` ### SVM ```{r results-svm, echo=FALSE} summary(result_svm) ``` ```{r cleanup, echo=FALSE, results='hide'} options(backup_options) ```