--- title: "riskutility: Comprehensive Disclosure Risk and Data Utility Assessment for Anonymized and Synthetic Data in R" author: - name: Matthias Templ affiliation: "School of Business, FHNW University of Applied Sciences and Arts Northwestern Switzerland" email: matthias.templ@gmail.com orcid: 0000-0002-8638-5276 - name: Oscar Thees affiliation: "School of Business, FHNW University of Applied Sciences and Arts Northwestern Switzerland" orcid: 0009-0001-9378-4988 abstract: > The R package **riskutility** provides a comprehensive framework for measuring disclosure risk and data utility in synthetic and anonymized datasets. It implements over 30 risk measures spanning six paradigms --- frequency-based privacy models ($k$-anonymity, $l$-diversity, $t$-closeness), attribution-based (DCAP, TCAP, WEAP, DiSCO), ML-based (RAPID), distance-based (DCR, NNDR, IMS, RF proximity), record linkage, membership inference (DOMIAS, NNAA), and GDPR anonymization criteria (singling out, linkability) --- alongside a broad suite of utility measures including propensity scores, distributional distances (Hellinger, energy, MMD), dependence structure fidelity (copula, contingency), regression fidelity, subgroup-stratified utility, and Train on Synthetic, Test on Real (TSTR). The multivariate Risk-Utility map (`rumap()`) provides a unified framework for jointly evaluating multiple risk and utility dimensions. All functions share a consistent S3 API with `print()`, `summary()`, and `plot()` methods and accept a `synth_pair` container. This paper describes the package architecture, demonstrates the complete workflow from synthetic data generation to comprehensive risk-utility assessment, and presents a case study comparing three synthesis approaches. keywords: - synthetic data - disclosure risk - data utility - privacy - statistical disclosure control - R output: rmarkdown::html_vignette: toc: true toc_depth: 3 number_sections: true bibliography: references.bib vignette: > %\VignetteIndexEntry{riskutility: Comprehensive Disclosure Risk and Data Utility Assessment for Anonymized and Synthetic Data in R} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, warning = TRUE, message = FALSE ) library(riskutility) ``` # Introduction {#sec-intro} Statistical agencies and data custodians face a fundamental challenge: releasing data that are useful for analysis while protecting individual privacy. Two main approaches exist: (i) traditional anonymization methods such as perturbation, suppression, and generalization [@hundepool2012statistical; @templ2017statistical], and (ii) synthetic data generation via statistical models [@nowok2016synthpop; @templ2017simulation]. Both require rigorous evaluation of *disclosure risk* and *data utility* --- yet existing tools assess these two dimensions in isolation or cover only a narrow subset of the relevant metrics. The R package sdcMicro [@templ2015sdcmicro] provides frequency-based risk estimation but limited utility assessment. synthpop [@nowok2016synthpop] offers the CAP disclosure metric and propensity-score utility but not distance-based or membership inference methods. In Python, SDMetrics covers distance-based metrics and Anonymeter implements GDPR failure criteria, but neither addresses attribution risk. No existing *R* package provides a unified framework that spans all risk families and supports multivariate comparison of multiple synthesis approaches; the most comprehensive evaluation suites (the Python libraries synthcity and SynthEval) sit outside the R ecosystem. The **riskutility** package for R [@R2025] fills this gap. It implements over 30 risk measures spanning six paradigms --- frequency-based privacy models, attribution-based (CAP), ML-based (RAPID), distance-based, record linkage, and membership inference --- alongside more than a dozen utility functions covering global, distributional, structural, and predictive assessment. All functions share a consistent S3 API with `print()`, `summary()`, and `plot()` methods and accept a `synth_pair` container that bundles data and metadata. The need for rigorous evaluation is well established in the statistical disclosure control literature and has gained urgency as synthetic data enters regulatory frameworks. The European Article 29 Working Party [@wp29anonymisation] identifies three criteria for anonymization: singling out, linkability, and inference. **riskutility** measures the first two directly (`singling_out()`, `linkability()`); the inference criterion is approached through the attribution-based CAP and RAPID measures rather than a dedicated inference-attack function. Importantly, synthetic data does not automatically satisfy these criteria [@stadler2022synthetic], making empirical assessment essential. The distinction between *general* and *specific* utility [@snoke2018general] is important: general utility measures (propensity scores, distributional distances) assess overall distributional fidelity, while specific utility measures (regression fidelity, TSTR) assess whether particular analyses yield similar results on original and synthetic data. A thorough evaluation should include both. **riskutility** provides empirical privacy metrics rather than formal mathematical guarantees. Unlike differential privacy, which provides worst-case bounds, our metrics quantify observed risk on concrete datasets. This approach is appropriate for practical SDC workflows where the protection mechanism is not differentially private. For differentially private synthesizers, formal epsilon budgets should be used alongside empirical evaluation. **Quick start.** A comprehensive risk assessment requires just a few lines: ```{r intro-hook, eval=FALSE} library(riskutility) pair <- synth_pair(original, synthetic, key_vars = c("age", "sex", "region"), target_var = "income") report <- disclosure_report(pair) print(report) ``` This single call computes attribution-based risk (DCAP, TCAP, WEAP, DiSCO, RAPID), distance-based risk (DCR, NNDR, IMS, dRisk, hitting rate), privacy models ($k$-anonymity, $l$-diversity, $t$-closeness), and membership inference measures (singling out, linkability, NNAA) --- producing a pass/fail summary across all families. For comparing multiple synthesis approaches, `rumap()` computes risk and utility measures simultaneously, normalizes them to a common scale, and identifies Pareto-optimal methods (Section \@ref(sec-case-rumap)). **Paper overview.** Section \@ref(sec-background) introduces the disclosure threat taxonomy and reviews related software. Section \@ref(sec-design) describes the package architecture and the `synth_pair` container. Sections \@ref(sec-risk) and \@ref(sec-utility) present the risk and utility measures with mathematical formulations and worked examples. Section \@ref(sec-comprehensive) demonstrates the complete practitioner workflow on a realistic case study, comparing three synthesis approaches with `disclosure_report()` and `rumap()`. Section \@ref(sec-discussion) discusses limitations, remediation strategies, and future work. Section \@ref(sec-computational) provides computational details. ```{r table1-scope, echo=FALSE} scope <- data.frame( Category = c("Privacy models", "Attribution (CAP)", "ML-based (RAPID)", "Distance-based", "Record linkage", "Membership inference", "Utility measures", "Frameworks"), Functions = c(8, 4, 4, 6, "1 (8 methods)", 6, 15, 3), Paradigm = c("Equivalence class", "Matching", "Prediction", "Nearest neighbor", "Linkage", "Attack simulation", "Various", "Composite"), `Applicable to` = c(rep("Both", 8)), check.names = FALSE ) knitr::kable(scope, caption = "Package scope at a glance. Both = applicable to traditionally anonymized and synthetic data.") ``` All analyses in this paper were conducted in R [@R2025]. # Background: Threat Taxonomy and Related Software {#sec-background} ## Disclosure Threat Taxonomy {#sec-threats} Every risk measure in riskutility addresses one or more of four disclosure threats. The classical SDC taxonomy [@hundepool2012statistical; @templ2017statistical] recognizes three types: identity, attribute, and inferential disclosure. We extend this to four threats by separating *membership disclosure* (from the ML privacy literature, @shokri2017membership) and *memorization* (relevant to generative models) as distinct categories, reflecting the broader scope of modern synthetic data evaluation: - **Identity disclosure**: An attacker links a released record to a specific individual. Measured by: `recordLinkage()`, `kanonymity()`, `individual_risk()`, `hitting_rate()`. - **Attribute disclosure**: An attacker learns a sensitive attribute value through quasi-identifier (QI) matching. Measured by: `dcap()`, `tcap()`, `weap()`, `disco()`, `rapid()`. - **Membership disclosure**: An attacker determines whether an individual's data was used to create the released dataset. Measured by: `mia_classifier()`, `domias()`, `nnaa()`, `singling_out()`, `delta_presence()`. - **Memorization**: A generative model reproduces training records verbatim or near-verbatim. Measured by: `ims()`, `dcr()`, `nndr()`. ```{r table2-threats, echo=FALSE} threats <- data.frame( Threat = c("Identity", "Attribute", "Membership", "Memorization"), Definition = c( "Attacker links record to individual", "Attacker learns sensitive value via linkage", "Attacker determines if individual is in dataset", "Generator reproduces training records" ), `Key measures` = c( "recordLinkage, kanonymity, individual_risk", "dcap, tcap, weap, disco, rapid", "mia_classifier, domias, nnaa, singling_out", "ims, dcr, nndr" ), check.names = FALSE ) knitr::kable(threats, caption = "Disclosure threat taxonomy.") ``` ## Risk Assessment Paradigms riskutility implements six risk assessment paradigms, each addressing different threats from the taxonomy above. We summarize each here; full definitions and worked examples are in Section \@ref(sec-risk). **Frequency-based (privacy models).** These methods assess privacy properties of a single dataset based on quasi-identifier frequencies. $k$-Anonymity [@samarati1998protecting] requires each combination of quasi-identifiers to appear at least $k$ times. $l$-Diversity [@machanavajjhala2007ldiversity] and $t$-closeness [@li2007tcloseness] progressively strengthen the protection of sensitive attributes within equivalence classes. See Section \@ref(sec-privacy-models). **Attribution-based (CAP family).** The Correct Attribution Probability framework [@taub2018differential] measures whether an attacker can infer sensitive attribute values by matching quasi-identifiers between original and released data. DCAP provides an aggregate measure; TCAP gives per-record risk scores. See Section \@ref(sec-cap). **ML-based (RAPID).** Risk of Attribute Prediction-Induced Disclosure [@thees2026beyond] trains predictive models on the released data and evaluates them on the original. Accurate predictions indicate information leakage. RAPID captures complex non-linear relationships and variable interactions that the CAP matching approach may miss. See Section \@ref(sec-rapid). **Distance-based.** The holdout method compares distances from synthetic records to training data vs. holdout data. If synthetic records are systematically closer to training records, the generator has memorized rather than generalized. However, @yao2025dcr demonstrate that passing distance-based tests does not guarantee privacy (the "DCR Delusion"). See Section \@ref(sec-distance). **Record linkage.** Directly simulates a re-identification attack using deterministic (Gower distance), probabilistic (Fellegi-Sunter), PRAM-aware, predictive (propensity-score), or random forest linkage [@fellegi1969theory; @domingoferrer2003disclosure]. See Section \@ref(sec-recordlinkage). **Membership inference.** Shadow model attacks [@shokri2017membership], density-based detection [@hu2023domias], and GDPR failure criteria [@stadler2022synthetic] assess whether an attacker can determine if a specific individual's data was used during synthesis. See Section \@ref(sec-membership). ## Utility Assessment Paradigms Utility measures quantify how well the released data preserves the statistical properties of the original. We organize them into four groups (details in Section \@ref(sec-utility)): **Global utility** measures use propensity scores to quantify how distinguishable the released data is from the original [@woo2009global; @snoke2018general]. A single number summarizes overall data quality. **Distributional utility** measures (Wasserstein, Hellinger, KS test, energy distance, MMD) compare marginal or joint distributions, identifying which specific variables or relationships are poorly reproduced. **Structural utility** measures (copula fidelity, contingency fidelity, PCA comparison, correlation matrix comparison) assess whether multivariate relationships --- the correlations and interactions that analysts depend on --- are preserved. **Predictive utility** measures (TSTR, regression fidelity, feature importance stability) test whether models trained on released data generalize to real-world predictions. ## Related Software **In R.** sdcMicro [@templ2015sdcmicro] provides the most comprehensive SDC toolkit, including frequency-based risk estimation, but focuses primarily on *applying* anonymization methods. synthpop [@nowok2016synthpop] provides the CAP disclosure metric and propensity-score utility, but does not cover distance-based or membership inference methods. Neither package provides a unified framework for comparing multiple risk and utility measures simultaneously. **In Python.** SDMetrics (part of the Synthetic Data Vault) provides distance-based and statistical metrics for synthetic data quality. Anonymeter [@giomi2023anonymeter] implements the three GDPR failure criteria from @stadler2022synthetic (singling out, linkability, inference). Neither covers attribution-based metrics (CAP, RAPID) or traditional privacy models ($k$-anonymity, $l$-diversity). ```{r table3-comparison, echo=FALSE} comp <- data.frame( Measure = c("k-Anonymity", "l-Diversity", "t-Closeness", "DCAP/TCAP", "RAPID", "DCR/NNDR", "Record linkage", "MIA / GDPR criteria", "pMSE / SPECKS", "R-U map"), sdcMicro = c("freq()", "ldiversity()", "--", "--", "--", "dRisk()", "--", "--", "--", "--"), synthpop = c("--", "--", "--", "disclosure()", "--", "--", "--", "--", "utility.gen()", "--"), `SDMetrics (Py)` = c("--", "--", "--", "--", "--", "Yes", "--", "--", "--", "--"), `Anonymeter (Py)` = c("--", "--", "--", "--", "--", "--", "--", "Yes", "--", "--"), riskutility = c("kanonymity()", "ldiversity()", "tcloseness()", "dcap(), tcap()", "rapid()", "dcr(), nndr()", "recordLinkage()", "mia_classifier(), singling_out()", "propscore(), specks()", "rumap()"), check.names = FALSE ) knitr::kable(comp, caption = "Risk/utility measures across R and Python packages.") ``` Py = Python. # Software Design and Architecture {#sec-design} ## Design Philosophy Existing R packages for statistical disclosure control --- sdcMicro [@templ2015sdcmicro], synthpop [@nowok2016synthpop], and simPop [@templ2017simulation] --- focus primarily on *applying* protection methods, with risk and utility assessment provided as secondary features. riskutility inverts this emphasis: it is a dedicated evaluation package, designed to be used *after* protection has been applied, regardless of which tool generated the released data. The package follows four design principles: 1. **Consistent S3 architecture.** Every major function returns a typed S3 object with `print()`, `summary()`, and `plot()` methods. We chose S3 over S4 classes for three reasons: lighter memory footprint, simpler method dispatch for the typical R user, and easier interoperability with data.table and ggplot2 objects. All risk/utility classes follow the same pattern, making the API predictable once one class is learned. 2. **Direction conventions.** Risk measures are oriented so that higher values always indicate higher disclosure risk. Utility measures are oriented so that higher values indicate higher utility (better preservation of statistical properties). Some measures (e.g., pMSE, Wasserstein distance) naturally use a "lower is better" scale; their interpretation is noted in the documentation, and `rumap()` applies the necessary direction transformation when normalizing to $[0, 1]$. 3. **Minimal core dependencies.** Frequency-based privacy models, CAP metrics, and distance-based measures require only base R, data.table, and ggplot2. ML-based methods require optional packages --- ranger for random forests in RAPID, xgboost for gradient boosting, and rpart for classification trees. These are loaded conditionally via `requireNamespace()` and produce informative error messages when absent. 4. **Integration over competition.** Rather than reimplementing synthesis or anonymization, the package provides `from_synthpop()`, `from_simPop()`, and `from_sdcMicro()` constructors that extract original and released data from objects created by these packages, wrapping them in the `synth_pair` container (Section \@ref(sec-synth-pair)). This ensures users can evaluate any protection method with a single consistent interface. ## The synth_pair Container {#sec-synth-pair} The central data structure in riskutility is the `synth_pair` object, which bundles original data, released data, variable roles, and metadata into a single container: ```{r synth-pair-demo, eval=FALSE} pair <- synth_pair(original, synthetic, key_vars = c("age", "gender", "region"), target_var = "income", holdout = holdout_data) ``` The constructor stores the original and synthetic data frames alongside their dimensions, automatically detects categorical (`cat_vars`) and numeric (`num_vars`) columns, and retains user-specified quasi-identifiers (`key_vars`), sensitive attribute (`target_var`), and optional holdout data. This metadata eliminates a common source of error: specifying different `key_vars` for different risk measures on the same dataset. Once constructed, every risk and utility function in the package accepts a `synth_pair` object as its first argument via S3 dispatch: ```{r synth-pair-dispatch, eval=FALSE} # All functions accept synth_pair --- no parameter repetition: dcap(pair) # Attribution risk rapid(pair, model_type = "rf") # ML-based risk propscore(pair) # Propensity score utility disclosure_report(pair) # Full risk report rumap(pair) # Risk-Utility map ``` ## S3 Method Pattern {#sec-s3-pattern} Every exported risk/utility class follows an identical three-part pattern: ```{r s3-pattern, eval=FALSE} # 1. Two equivalent calling conventions: result <- dcap(pair) # synth_pair method result <- dcap(X, Y, key_vars = ..., target_var = ...) # default method # 2. Inspection: print(result) # One-screen summary with key statistic s <- summary(result) # Detailed statistics (returns summary.dcap) print(s) # Formatted multi-line output # 3. Visualization: plot(result, which = 1) # Plot type 1 plot(result, which = 1:2) # Multiple plot types ``` The generic function dispatches via `UseMethod()`. The `synth_pair` method extracts `original`, `synthetic`, `key_vars`, and `target_var` from the container and delegates to the default method, which performs the actual computation. The return object is a list with a class attribute (e.g., `"dcap"`). The `summary()` method returns a typed summary object (e.g., `"summary.dcap"`) with its own `print()` method, separating computation from display. Plot methods use an integer `which` parameter to select among multiple visualization types. The number of available plot types varies by class, from one (simple measures) to seven (`rumap`). ## Integration with the R Ecosystem {#sec-integration} The `from_*` family of constructors bridges riskutility with the three main R packages for statistical disclosure control and synthetic data generation: ```{r integration, eval=FALSE} # From synthpop: pass synds object + original data pair <- from_synthpop(synds_object, original_data, key_vars = c("age", "sex"), target_var = "income") # From simPop: original data extracted automatically from simPopObj pair <- from_simPop(simPopObj, key_vars = c("age", "sex"), target_var = "income") # From sdcMicro: variable roles extracted from sdcMicroObj pair <- from_sdcMicro(sdcMicroObj) ``` Each constructor returns a standard `synth_pair` object. `from_sdcMicro()` additionally extracts variable roles (quasi-identifiers, sensitive attributes, sample weights) from the sdcMicro S4 object, so that `key_vars` and `target_var` need not be specified manually. `from_synthpop()` supports multiple syntheses via the `m` parameter, selecting a specific synthetic dataset from a `synds` object. `from_simPop()` extracts sample weights when available, enabling weighted risk calculations. # Risk Measures {#sec-risk} This section presents the six risk measure families, each illustrated with a worked example using the same running dataset. ```{r running-data} set.seed(42) n <- 500 original <- data.frame( age = sample(18:85, n, replace = TRUE), sex = factor(sample(c("M", "F"), n, replace = TRUE)), education = factor(sample(c("Primary", "Secondary", "Tertiary"), n, replace = TRUE, prob = c(0.3, 0.5, 0.2))), region = factor(sample(paste0("R", 1:5), n, replace = TRUE)), income = round(rlnorm(n, log(40000), 0.5)) ) # Synthetic: independent draws (low risk expected) synthetic <- data.frame( age = sample(18:85, n, replace = TRUE), sex = factor(sample(c("M", "F"), n, replace = TRUE)), education = factor(sample(c("Primary", "Secondary", "Tertiary"), n, replace = TRUE, prob = c(0.3, 0.5, 0.2))), region = factor(sample(paste0("R", 1:5), n, replace = TRUE)), income = round(rlnorm(n, log(40000), 0.5)) ) key_vars <- c("age", "sex", "education", "region") target_var <- "income" pair <- synth_pair(original, synthetic, key_vars = key_vars, target_var = target_var) # Train/holdout split for distance-based metrics set.seed(123) train_idx <- sample(n, size = floor(0.7 * n)) train_data <- original[train_idx, ] holdout_data <- original[-train_idx, ] ``` ## Privacy Models and Frequency-Based Risk {#sec-privacy-models} These methods assess privacy properties of a *single* dataset based on quasi-identifier frequencies. Originally developed for traditionally anonymized data, they apply equally to synthetic data. They do not compare original and released data; instead, they evaluate structural properties of the released data alone. **$k$-Anonymity** [@samarati1998protecting] partitions records into equivalence classes (ECs) based on quasi-identifier values. The dataset satisfies $k$-anonymity if every EC contains at least $k$ records: $k = \min_i |\text{EC}(\mathbf{q}_i)|$, where $\mathbf{q}_i$ is the quasi-identifier vector of record $i$. Small ECs are vulnerable to identity disclosure because an attacker who knows a target's quasi-identifiers can narrow them to fewer than $k$ candidates. $k$-Anonymity protects against identity disclosure but not attribute disclosure: if all $k$ records in a class share the same sensitive value, the attribute is trivially revealed. **$l$-Diversity** [@machanavajjhala2007ldiversity] strengthens $k$-anonymity by requiring that each EC contains at least $l$ distinct values of the sensitive attribute. This prevents attribute disclosure even when all records in an EC share the same sensitive value (homogeneity attack). **$t$-Closeness** [@li2007tcloseness] further requires that the distribution of the sensitive attribute within each EC is close to its overall distribution. The distance is measured by the Earth Mover's Distance (EMD), and $t$ is the maximum EMD across all ECs. ```{r privacy-models} # k-Anonymity: minimum equivalence class size k_res <- riskutility::kanonymity(synthetic, key_vars = key_vars) k_res # l-Diversity: sensitive attribute diversity per EC l_res <- riskutility::ldiversity(synthetic, key_vars = key_vars, sensitive_var = target_var) print(l_res) # t-Closeness: EMD between EC and overall distribution t_res <- riskutility::tcloseness(synthetic, key_vars = key_vars, sensitive_var = target_var) t_res ``` With 68 unique age values, 2 sex levels, 3 education levels, and 5 regions, there are up to $68 \times 2 \times 3 \times 5 = 2040$ possible QI combinations for only $n = 500$ records. Most equivalence classes are singletons, yielding $k = 1$ --- this is expected for fine-grained quasi-identifiers and does not by itself indicate a problem with the synthetic data. The three models form a hierarchy: $k$-anonymity guards against identity disclosure, $l$-diversity against homogeneity attacks, and $t$-closeness against skewness attacks. Additional frequency-based measures include `individual_risk()` for per-record re-identification probability based on EC frequencies [@franconi2004individual; @skinner2002measure], `attacker_risk()` for scenario-based assessment under prosecutor, journalist, and marketer attacker models [@hundepool2012statistical], `suda()` for detecting records unique on small QI subsets, and `population_uniqueness()` for estimating population-level uniques via super-population models [@reiter2005estimating]. ```{r table5-privacy, echo=FALSE} privacy <- data.frame( Function = c("kanonymity()", "ldiversity()", "tcloseness()", "suda()", "individual_risk()", "population_uniqueness()", "attacker_risk()", "epsilon_identifiability()"), Input = rep("Single dataset", 8), `Key output` = c("Min EC size", "Min distinct values per EC", "Max EMD across ECs", "SUDA scores", "Per-record frequency risk", "Estimated pop. uniques", "Scenario-based risk", "Identifiability fraction"), `Threats` = c("Identity", "Attribute", "Attribute", "Identity", "Identity", "Identity", "Identity", "Identity"), check.names = FALSE ) knitr::kable(privacy, caption = "Privacy models overview.") ``` ## Attribution-Based Risk: The CAP Family {#sec-cap} The Correct Attribution Probability framework [@taub2018differential] measures attribute disclosure: can an attacker infer a sensitive value by matching quasi-identifiers between original and released data? For each original record $i$ with quasi-identifier values $\mathbf{q}_i$, the attacker finds all records in the released data whose quasi-identifiers match $\mathbf{q}_i$ (the *equivalence class*). If the sensitive attribute is homogeneous within this class, the attacker learns the true value. The **Targeted CAP** (TCAP) gives each record a risk score between 0 and 1: $\text{TCAP}_i = \Pr(\text{correct attribution} \mid \mathbf{q}_i)$. The mean CAP across all records is $\overline{\text{CAP}} = n^{-1} \sum_i \text{CAP}_i$ (returned as `cap`). The **Differential CAP** subtracts the baseline (modal-class) attribution rate, $\text{DCAP} = \overline{\text{CAP}} - \text{baseline}$ (returned as `dcap`; @taub2018differential), so a value near zero indicates no attribution gain over random guessing. The `summary()` method also reports the risk ratio $\overline{\text{CAP}} / \text{baseline}$ to contextualize the result. The **WEAP** (Within-EC Attribution Probability) evaluates risk from the released data alone, without access to the original, making it suitable for data custodians who cannot share the original data with an auditor. **DiSCO** (Disclosive in Synthetic, Correct in Original) identifies records that are both confidently attributed in the released data and correctly attributed in the original. ```{r cap-demo} # TCAP: per-record risk (most informative member of CAP family) tcap_res <- tcap(pair) summary(tcap_res) plot(tcap_res) ``` Since our running example uses independently generated synthetic data (no relationship to the original), TCAP values should be close to the baseline attribution probability. Records with TCAP above 0.1 warrant closer inspection. ```{r cap-table, echo=FALSE} cap <- data.frame( Metric = c("DCAP", "TCAP", "WEAP", "DiSCO"), `Requires original?` = c("Yes", "Yes", "No", "Yes"), `Per-record?` = c("No", "Yes", "Yes", "Yes"), `Measures` = c("Mean attribution probability", "Individual attribution risk", "Within-EC homogeneity", "Correct + confident attribution"), `Low risk` = c("ratio < 1.5", "< 0.1 per record", "< 0.1", "< 5%"), check.names = FALSE ) knitr::kable(cap, caption = "CAP family comparison with interpretation thresholds.") ``` ## ML-Based Risk: RAPID {#sec-rapid} Risk of Attribute Prediction-Induced Disclosure [@thees2026beyond] takes a fundamentally different approach to attribute disclosure. Instead of matching quasi-identifiers, RAPID trains a predictive model $\hat{f}$ on the released data $(Y_{\mathcal{K}}, Y_s)$ and evaluates its predictions on the original data: $\hat{s}_i = \hat{f}(X_{\mathcal{K},i})$. For numeric targets, a record is *at risk* when the prediction error falls below a threshold $\epsilon$: $e(s_i, \hat{s}_i) < \epsilon$, where $e(\cdot, \cdot)$ is a configurable error metric (symmetric percentage error by default). The RAPID score is the fraction of at-risk records: $\text{RAPID} = n^{-1} \sum_i \mathbf{1}(e(s_i, \hat{s}_i) < \epsilon)$. For categorical targets, a different evaluation applies: a record is at risk when a gain or ratio score exceeds a threshold. ```{r rapid-demo, warning=FALSE} rapid_res <- rapid(pair, model_type = "lm") summary(rapid_res) plot(rapid_res, which = c(1, 3)) ``` With independently generated synthetic data, we expect the RAPID score to be close to the baseline. The threshold sensitivity plot (`which = 3`) shows how the at-risk fraction changes across threshold values. RAPID complements the CAP family in two ways. First, it captures non-linear relationships and variable interactions. Second, it provides inferential tools: `rapid_test()` computes a permutation-based $p$-value, `confint()` provides bootstrap confidence intervals, and `rapid_threshold_select()` optimizes the threshold in a data-driven manner. ```{r rapid-models, echo=FALSE} models <- data.frame( Model = c("lm", "rf", "cart", "gbm", "logit"), Package = c("stats", "ranger", "rpart", "xgboost", "stats"), Numeric = c("Yes", "Yes", "Yes", "Yes", "No"), Categorical = c("No", "Yes", "Yes", "Yes", "Yes"), Interactions = c("Manual", "Automatic", "Automatic", "Automatic", "Manual"), check.names = FALSE ) knitr::kable(models, caption = "RAPID model backends.") ``` | RAPID Score | Risk Level | Interpretation | |-------------|------------|----------------| | < 0.05 | Low | ML model cannot predict target much better than baseline | | 0.05--0.15 | Moderate | Some predictive signal from synthetic data | | 0.15--0.30 | Elevated | Significant predictive leakage | | > 0.30 | High | Strong evidence of disclosure risk | ## Distance-Based Risk {#sec-distance} Distance-based methods detect *memorization*: the failure mode where a generative model reproduces training records verbatim or near-verbatim. The key idea is the **holdout method** --- split the original data into a training set $T$ (used for synthesis) and a holdout set $H$ (unseen by the generator). For each synthetic record $y_j$, compute the Distance to Closest Record in $T$ ($d_T$) and in $H$ ($d_H$). If the generator has generalized, $d_T$ and $d_H$ should be comparable. If it has memorized, $d_T$ will be systematically smaller: $$\text{DCR\_share} = n^{-1} \sum_j \mathbf{1}\bigl(d_T(y_j) < d_H(y_j)\bigr), \qquad \text{DCR\_ratio} = \frac{\bar{d}_T}{\bar{d}_H}$$ A DCR share meaningfully above 0.5 (the package flags shares above 0.55), or a ratio below about 1, suggests memorization. The **NNDR** (Nearest Neighbor Distance Ratio) provides a complementary view: for each synthetic record, it is the ratio of the distance to its nearest neighbor over the distance to its second-nearest neighbor. A ratio near 0 indicates a single dominant match; near 1 indicates no distinctive match. **IMS** (Identical Match Share) counts exact copies. When an explicit holdout is unavailable, `holdout_fraction` automatically splits the original data. ```{r distance-demo, warning=FALSE} dcr_res <- dcr(pair, holdout_fraction = 0.2) summary(dcr_res) plot(dcr_res, which = 1) ``` **The DCR Delusion.** @yao2025dcr show that DCR can fail to detect privacy leakage: datasets deemed "private" by DCR may still be vulnerable to membership inference attacks. Their central recommendation is that DCR be interpreted relative to a proper null distribution rather than in absolute terms; `dcr()` implements exactly this, comparing the observed share against a permutation null (`null_test`) and reporting a Wilcoxon test alongside the point estimate. Even so, distance-based metrics should always be complemented with other risk families. ```{r distance-table, echo=FALSE} dist <- data.frame( Metric = c("DCR", "NNDR", "IMS", "RF proximity", "dRisk", "Hitting rate", "Epsilon ID", "Delta-presence"), Holdout = c("Yes", "Yes", "No", "Yes", "No", "No", "No", "No"), Detects = c("Memorization", "Memorization", "Exact copies", "Memorization (non-linear)", "Close records", "Close records", "Identifiability", "Membership bounds"), `Low risk` = c("share < 0.55", "share < 0.55", "< 0.01", "ratio near 1", "< 0.05", "< 0.05", "< 0.01", "> 0.5"), check.names = FALSE ) knitr::kable(dist, caption = "Distance-based and proximity risk measures.") ``` **RF proximity** offers a data-adaptive alternative: it trains a random forest to discriminate original from synthetic records and measures how often synthetic records share terminal nodes with training vs. holdout records, capturing non-linear proximity that fixed distance metrics may miss. Use `rf_privacy()` when complex interactions are expected. ## Record Linkage Risk {#sec-recordlinkage} The `recordLinkage()` function directly simulates a re-identification attack by linking each original record to the most similar record(s) in the anonymized dataset. Eight methods are implemented, spanning deterministic (Gower distance), probabilistic (Fellegi-Sunter, @fellegi1969theory), PRAM-aware, predictive (propensity score), random forest proximity, rank-based (RBRL, @muralidhar2016rankbased), robust Mahalanobis [@templ2008robust], and autoencoder embedding [@guo2016entity]. Three matching modes are available: independent (many-to-one), bijective (one-to-one via Hungarian algorithm, @herranz2016gdbrl), and optimal transport (Sinkhorn). For full details on all methods and matching modes, see `?recordLinkage`. ```{r recordlinkage-table, echo=FALSE} rl <- data.frame( Method = c("Deterministic", "Probabilistic", "PRAM", "Predictive", "RF", "RBRL", "Mahalanobis", "Embedding"), Distance = c("Gower", "Fellegi-Sunter", "Transition prob.", "Propensity", "RF proximity", "Rank-based", "Mahalanobis", "Autoencoder"), `Mixed types` = c("Yes", "Yes", "Categorical", "Yes", "Yes", "Yes", "Numeric", "Yes"), Matching = c("All 3", "All 3", "All 3", "All 3", "All 3", "Independent", "All 3", "All 3"), check.names = FALSE ) knitr::kable(rl, caption = "Record linkage methods. All 3 = independent, bijective, OT.") ``` ## Membership Inference and Anonymization Failure Criteria {#sec-membership} This section groups two related but distinct concerns. The first three measures (`mia_classifier()`, `domias()`, `nnaa()`) assess *membership disclosure* --- whether a membership inference attack (MIA) can determine if a specific individual's data was used during synthesis. The singling out and linkability attacks operationalize two of the GDPR anonymization criteria [@wp29anonymisation], following the attack-based approach of @giomi2023anonymeter. **NNAA** (Nearest Neighbor Adversarial Accuracy, @yale2020generation) is based on the adversarial accuracy of a nearest-neighbour two-sample comparison, $\text{AA}(A,S) = \tfrac{1}{2}\bigl[\Pr(d_{AS} > d_{AA}) + \Pr(d_{SA} > d_{SS})\bigr]$. The reported privacy loss is $\text{AA}(\text{holdout}, S) - \text{AA}(\text{train}, S)$; a positive value means synthetic records resemble training records more closely than holdout records, indicating memorization: ```{r nnaa-demo} nnaa_res <- nnaa(train_data, synthetic, holdout = holdout_data, method = "gower", seed = 42) print(nnaa_res) ``` | Privacy Loss | Interpretation | |-------------|----------------| | Near 0 | No detectable leakage (ideal) | | 0.01--0.05 | Minor leakage, likely acceptable | | > 0.10 | Significant memorization | **Singling out** and **linkability** operationalize two of the Article 29 Working Party's three anonymization criteria (the third, inference, is addressed by the attribution-based CAP and RAPID measures): ```{r membership-demo} so_res <- singling_out(original, synthetic, n_attacks = 500, n_cols = 3, mode = "multivariate", seed = 42) print(so_res) link_res <- linkability(original, synthetic, n_attacks = 500, n_neighbors = 1, seed = 42) print(link_res) ``` ```{r membership-table, echo=FALSE} mia <- data.frame( Metric = c("MIA classifier", "DOMIAS", "NNAA", "Singling out", "Linkability", "delta-Presence"), `Attack type` = c("Shadow model", "Density overfitting", "Nearest neighbor", "Predicate-based", "Record linkage", "Membership bounds"), Holdout = c("Yes", "Yes", "Yes", "Yes", "Yes", "No"), `GDPR criterion` = c("--", "--", "--", "Art. 29 WP", "Art. 29 WP", "--"), `Low risk` = c("< 0.55", "< 0.6", "< 0.05", "< 0.1", "< 0.1", "> 0.5"), check.names = FALSE ) knitr::kable(mia, caption = "Membership inference and GDPR measures.") ``` ## Cross-Family Comparison {#sec-rosetta} No single metric tells the full story. Applying all families to the same dataset reveals complementary and sometimes contradictory information: ```{r rosetta} # Near-copy: original + small noise (high risk expected) set.seed(99) near_copy <- original near_copy$age <- near_copy$age + sample(-1:1, n, replace = TRUE) near_copy$income <- near_copy$income + round(rnorm(n, 0, 500)) pair_risky <- synth_pair(original, near_copy, key_vars = key_vars, target_var = target_var) # Compare key metrics across the two datasets comparison <- data.frame( Metric = c("DCAP", "RAPID (lm)", "IMS"), Safe = c( dcap(pair)$dcap, rapid(pair, model_type = "lm", verbose = FALSE)$rapid, ims(pair)$ims ), Risky = c( dcap(pair_risky)$dcap, rapid(pair_risky, model_type = "lm", verbose = FALSE)$rapid, ims(pair_risky)$ims ) ) comparison$Safe <- round(comparison$Safe, 4) comparison$Risky <- round(comparison$Risky, 4) knitr::kable(comparison, caption = "Cross-family comparison: safe vs. risky synthetic data.") ``` The near-copy shows elevated risk across all families, but the magnitude and interpretation differ. Attribution measures quantify *information leakage*; distance-based measures quantify *memorization*. These complementary perspectives mean that a dataset can pass one family's tests while failing another's --- a thorough evaluation uses at least one measure from each family. # Data Utility Measures {#sec-utility} A dataset that passes all risk checks but destroys the analytical value of the data is useless. Utility measures quantify how well the released data preserves the statistical properties of the original. ## Global Utility: Propensity Scores {#sec-utility-quick} Global utility measures give a single-number verdict by asking: *can a classifier tell original and synthetic records apart?* The propensity score method [@woo2009global; @snoke2018general] pools original ($X$, $n_X$ records) and synthetic ($Y$, $n_Y$ records) data, labels them (0/1), and fits a classifier. The **pMSE** (propensity score Mean Squared Error) measures how well the model discriminates: $$\text{pMSE} = \frac{1}{N} \sum_{i=1}^{N} \left(\hat{p}_i - c\right)^2$$ where $N = n_X + n_Y$ and $c = n_Y / N$. If original and synthetic records are indistinguishable, pMSE $\approx 0$. ```{r utility-quick, warning=FALSE} prop_res <- propscore(pair) summary(prop_res) ``` | pMSE Value | Interpretation | |------------|----------------| | < 0.01 | Excellent fidelity | | 0.01--0.05 | Good fidelity | | 0.05--0.10 | Moderate differences | | > 0.10 | Poor fidelity | ## Univariate Diagnostics {#sec-utility-univariate} When global utility is poor, per-variable measures identify which variables are responsible. For **numeric** variables, the Wasserstein distance measures the cost of transforming one distribution into another. For **categorical** variables, the Hellinger distance measures distributional overlap: $$H(p, q) = \frac{1}{\sqrt{2}} \sqrt{\sum_{k=1}^{K} \left(\sqrt{p_k} - \sqrt{q_k}\right)^2}$$ ```{r utility-univariate} # Hellinger distance for categorical variables h_res <- hellinger(original, synthetic, vars = c("sex", "education")) print(h_res) # CI proximity: confidence interval overlap for means cip_res <- ci_proximity(original, synthetic, vars = c("age", "income")) print(cip_res) ``` The **CI proximity** measure [@karr2006framework] compares confidence intervals of summary statistics (means) between original and synthetic data. An overlap near 1 means the intervals coincide; a relative error near 0 means point estimates are close. ## Multivariate and Structural Utility {#sec-utility-structural} Marginal distributions can match perfectly while joint distributions diverge. The **energy distance** [@szekely2013energy] is a multivariate two-sample statistic sensitive to differences in both location and scale (lower values indicate closer joint distributions): ```{r utility-structural} e_res <- energy_distance(original[, c("age", "income")], synthetic[, c("age", "income")], seed = 42) print(e_res) ``` The **MMD** (Maximum Mean Discrepancy, @gretton2012kernel) provides a kernel-based alternative supporting exact computation and random Fourier features (RFF) for large datasets: ```{r mmd-demo} mmd_res <- mmd(original[, c("age", "income")], synthetic[, c("age", "income")], kernel = "gaussian", method = "rff", n_features = 500, seed = 42) print(mmd_res) ``` **Copula fidelity** compares the empirical copula (rank dependence structure) using the Cramér-von Mises statistic on pairwise copula CDFs. **Contingency fidelity** [@snoke2018general] is its categorical complement, computing total variation distance between bivariate contingency tables: ```{r fidelity-demo} cop_res <- copula_fidelity(original, synthetic, vars = c("age", "income")) print(cop_res) ctf_res <- contingency_fidelity(original, synthetic, vars = c("sex", "education", "region")) print(ctf_res) ``` ## Predictive Utility {#sec-utility-predictive} **TSTR** (Train on Synthetic, Test on Real, @zhao2021ctgan) trains a predictive model on the synthetic data and evaluates performance on held-out real data. The ratio of TSTR-to-TRTR performance quantifies how well predictive relationships are preserved: ```{r tstr-demo, warning=FALSE, eval=requireNamespace("ranger", quietly=TRUE)} set.seed(42) tstr_res <- tstr(pair, target_var = "income", model = "rf", test_fraction = 0.3, seed = 42) print(tstr_res) ``` **Regression fidelity** [@karr2006framework] fits the same regression model on both datasets and compares coefficient estimates via CI overlap, standardized bias, and significance agreement: ```{r regression-demo} reg_res <- regression_fidelity(original, synthetic, formula = income ~ age + sex + education) summary(reg_res) plot(reg_res, which = 1) ``` **Tail fidelity** assesses how well extreme values are preserved --- critical for applications where tail behavior matters (financial risk, rare diseases): ```{r tail-demo} tail_res <- tail_fidelity(original, synthetic, vars = c("age", "income"), percentile = 95, tails = "both") print(tail_res) ``` **Subgroup utility** [@snoke2018general] applies any utility measure to each subgroup defined by a grouping variable, identifying groups with low utility: ```{r subgroup-demo} su_res <- subgroup_utility(original, synthetic, group_var = "region", utility_fun = energy_distance, threshold = 0.5, seed = 42) print(su_res) ``` The conservative `utility_score` is the worst subgroup score. A `ratio` near 1 indicates homogeneous utility; below 0.5 indicates substantial disparity. ```{r table7-utility, echo=FALSE} util <- data.frame( `Use case` = c(rep("Quick assessment", 2), rep("Univariate", 3), rep("Multivariate", 4), rep("Predictive", 3), "Subgroup"), Function = c("propscore()", "specks()", "compare_wasserstein()", "hellinger()", "ci_proximity()", "energy_distance()", "mmd()", "copula_fidelity()", "contingency_fidelity()", "tstr()", "regression_fidelity()", "compare_feature_importance()", "subgroup_utility()"), `Data type` = c("Mixed", "Mixed", "Numeric", "Categorical", "Numeric", "Numeric", "Numeric", "Numeric", "Categorical", "Mixed", "Mixed", "Mixed", "Mixed"), Interpretation = c("< 0.1: good", "< 0.05: good", "Lower = better", "< 0.1: good", "> 0.8: good", "Lower = better", "Lower = better", "< 0.1: good", "< 0.05: good", "ratio near 1: good", "overlap > 0.8: good", "High corr: good", "min > 0.5: good"), check.names = FALSE ) knitr::kable(util, caption = "Utility measures by use case.") ``` # Comprehensive Assessment: A Case Study {#sec-comprehensive} This section demonstrates the complete practitioner workflow on a realistic dataset, comparing three synthesis approaches with different privacy-utility trade-offs. ## Scenario and Data {#sec-case-data} Consider a statistical agency that wants to release a survey dataset ($n = 1000$) containing demographic variables (age, sex, education, region) and a sensitive income variable. ```{r case-data} set.seed(123) N <- 1000 edu_levels <- c("Primary", "Secondary", "Tertiary") age_groups <- c("20-29", "30-39", "40-49", "50-59", "60-69") orig <- data.frame( age_group = factor(sample(age_groups, N, replace = TRUE)), sex = factor(sample(c("M", "F"), N, replace = TRUE)), education = factor(sample(edu_levels, N, replace = TRUE, prob = c(0.25, 0.50, 0.25))), region = factor(sample(paste0("R", 1:4), N, replace = TRUE)) ) edu_effect <- c(Primary = 0, Secondary = 0.3, Tertiary = 0.7) age_effect <- c("20-29" = 0, "30-39" = 0.15, "40-49" = 0.3, "50-59" = 0.4, "60-69" = 0.35) orig$income <- round(exp( 10 + age_effect[as.character(orig$age_group)] + edu_effect[as.character(orig$education)] + rnorm(N, 0, 0.4) )) qi <- c("age_group", "sex", "education", "region") sens <- "income" ``` We create three synthetic datasets spanning the privacy-utility spectrum: ```{r case-synthesis} set.seed(456) # Method A: Independent marginals (safest, but destroys correlations) synA <- data.frame( age_group = factor(sample(age_groups, N, replace = TRUE)), sex = factor(sample(c("M", "F"), N, replace = TRUE)), education = factor(sample(edu_levels, N, replace = TRUE, prob = c(0.25, 0.50, 0.25))), region = factor(sample(paste0("R", 1:4), N, replace = TRUE)), income = sample(orig$income, N, replace = TRUE) ) # Method B: Category-preserving bootstrap with income noise idx_B <- sample(N, N, replace = TRUE) synB <- orig[idx_B, ] rownames(synB) <- NULL synB$income <- round(synB$income * exp(rnorm(N, 0, 0.15))) swap_idx <- sample(N, round(0.2 * N)) synB$age_group[swap_idx] <- factor(sample(age_groups, length(swap_idx), replace = TRUE)) # Method C: Near-copy with minimal perturbation (risky) synC <- orig synC$income <- round(synC$income * exp(rnorm(N, 0, 0.03))) ``` ## Step 1: Quick Risk Screening with disclosure_report() {#sec-case-report} The `disclosure_report()` function computes multiple risk measures, evaluates each against a threshold, and produces a pass/fail assessment: ```{r case-report, warning=FALSE} pair_A <- synth_pair(orig, synA, key_vars = qi, target_var = sens) pair_B <- synth_pair(orig, synB, key_vars = qi, target_var = sens) pair_C <- synth_pair(orig, synC, key_vars = qi, target_var = sens) rep_A <- disclosure_report(pair_A, compute = c("attribution", "privacy"), seed = 42, verbose = FALSE) rep_B <- disclosure_report(pair_B, compute = c("attribution", "privacy"), seed = 42, verbose = FALSE) rep_C <- disclosure_report(pair_C, compute = c("attribution", "privacy"), seed = 42, verbose = FALSE) verdicts <- data.frame( Method = c("A: Independent", "B: Bootstrap+noise", "C: Near-copy"), Overall = c(rep_A$overall_risk, rep_B$overall_risk, rep_C$overall_risk), Pass = c(rep_A$n_pass, rep_B$n_pass, rep_C$n_pass), Warn = c(rep_A$n_warn, rep_B$n_warn, rep_C$n_warn) ) knitr::kable(verdicts, caption = "Quick risk screening across three methods.") ``` Two patterns emerge. First, attribution metrics differentiate the three methods in the expected order: Method A (independent) has the lowest attribution risk, Method C (near-copy) the highest. Second, privacy models flag all three methods because they evaluate the released data's structure alone --- with 120 possible QI combinations and $n = 1000$ records, some equivalence classes are small regardless of synthesis method. This illustrates a key lesson: privacy models and attribution metrics answer different questions and should be interpreted together. ## Step 2: Comparative Assessment with rumap() {#sec-case-rumap} The `rumap()` function implements the multivariate Risk-Utility framework of @thees2026beyond. Traditional R-U analysis plots a single risk measure against a single utility measure, producing a two-dimensional trade-off curve. This can be misleading: a method may appear optimal on one pair of measures while performing poorly on another. `rumap()` computes multiple risk and utility measures simultaneously, normalizes to $[0, 1]$, and identifies **Pareto-optimal** methods. ```{r case-rumap, warning=FALSE} set.seed(42) ru <- rumap(orig, list("A: Independent" = synA, "B: Bootstrap+noise" = synB, "C: Near-copy" = synC), risk_measures = c("dcap", "tcap", "ims"), utility_measures = c("pmse", "wasserstein"), key_vars = qi, target_var = sens, seed = 42) print(ru) ``` ```{r case-rumap-scatter, fig.width=8, fig.height=6} plot(ru, which = 1) # R-U scatterplot with Pareto front ``` The R-U scatterplot places each method in the composite risk-utility plane. Methods in the lower-right corner (low risk, high utility) are preferred. ```{r case-rumap-heatmap, fig.width=8, fig.height=5} plot(ru, which = 2) # Heatmap of individual measures ``` The heatmap reveals *why* the methods differ. Method A achieves low risk across all measures but has poor utility. Method C has excellent utility but elevated attribution risk. Method B balances the two. ## Step 3: Decision {#sec-case-decision} The analysis supports a structured decision: - **Method A** is appropriate when data is released to the general public and any re-identification would be unacceptable. - **Method B** is appropriate for controlled-access research environments where moderate risk is acceptable. - **Method C** should be rejected --- its risk profile is too close to the original data. This iterative workflow --- screen with `disclosure_report()`, compare with `rumap()`, and refine synthesis parameters --- is the core use case that riskutility is designed to support. # Summary and Discussion {#sec-discussion} ## Contributions The riskutility package makes five contributions to the R ecosystem for statistical disclosure control: 1. **Comprehensive coverage.** It is the first R package to unify all six risk assessment paradigms --- frequency-based privacy models, attribution (CAP), ML-based (RAPID), distance-based, record linkage, and membership inference --- under a single API. 2. **Novel implementations.** It provides the first R implementations of RAPID, two of the three GDPR failure criteria from @stadler2022synthetic (singling out and linkability), $t$-closeness, DOMIAS density-based membership inference, and eight-method record linkage with bijective and optimal transport matching. 3. **Unified API.** The `synth_pair` container and consistent S3 class pattern eliminate parameter repetition and ensure practitioners can switch between risk measures without learning new interfaces. 4. **Multivariate R-U mapping.** The `rumap()` function implements the framework of @thees2026beyond for comparing multiple synthesis approaches on multiple risk and utility dimensions simultaneously. 5. **Ecosystem integration.** The `from_sdcMicro()`, `from_synthpop()`, and `from_simPop()` constructors allow practitioners to evaluate data produced by any of the three main R packages. ## Partially vs Fully Synthetic Data The privacy-utility evaluation differs depending on the synthesis approach [@drechsler2011synthetic]: - **Fully synthetic data**: All records are synthetic; all risk metrics are directly applicable. - **Partially synthetic data**: Only sensitive values are replaced. Attribution metrics (DCAP, RAPID) are particularly relevant because original quasi-identifiers provide a direct matching key. Distance-based metrics should focus on the synthesized variables. - **Multiple synthetic datasets**: When $m > 1$ datasets are generated, evaluate each separately and report the worst-case risk across all $m$ releases. ## Remediation If disclosure risk is too high: 1. Add more noise to the synthesis process 2. Reduce granularity of quasi-identifiers (e.g., age groups instead of exact age) 3. Apply additional anonymization techniques (suppression, generalization) 4. Re-synthesize with different model settings 5. Re-evaluate with **riskutility** --- iterate until risk-utility balance is acceptable If utility is too low: 1. Use a more flexible synthesizer (CART, Bayesian network, GAN) 2. Reduce the amount of perturbation 3. Identify affected variables/subgroups (`subgroup_utility()`, per-variable Hellinger) 4. Consider partially synthetic data (synthesize only sensitive variables) ## Limitations and Recommendations **No formal privacy guarantees.** All measures provide empirical risk assessment. A low DCAP score does not prove that no attacker can succeed. Empirical and formal approaches (differential privacy) are complementary. **Key variable selection.** Results depend heavily on the choice of quasi-identifiers. Practitioners should base QI selection on a realistic threat model, not on convenience. **Threshold interpretation.** The pass/fail thresholds used by `disclosure_report()` are pragmatic defaults. Different contexts require different thresholds. We recommend the following minimal evaluation protocol: 1. **Always run** `disclosure_report()` with `compute = "all"` as a first screening. 2. **For publication-quality assessment**, use `rumap()` to compare multiple synthesis approaches and identify Pareto-optimal methods. 3. **For regulatory compliance** (GDPR), include singling out and linkability tests. 4. **Interpret distance-based metrics cautiously.** Following @yao2025dcr, do not rely on DCR/NNDR alone. ## Future Work Four extensions are planned: (i) a Shiny dashboard for interactive evaluation; (ii) integration with differential privacy frameworks; (iii) computational optimizations for large datasets ($n > 50{,}000$); and (iv) population-level risk estimation from sample data. # Computational Details {#sec-computational} All computations were performed using R `r getRversion()` [@R2025] with the riskutility package version `r packageVersion("riskutility")`. Core dependencies include data.table for efficient data manipulation and ggplot2 for all visualizations. ML-based methods require optional packages: ranger, rpart, xgboost, and caret, loaded conditionally via `requireNamespace()`. ```{r scalability-table, echo=FALSE} scale_df <- data.frame( Metric = c("dcap()", "dcr()", "kanonymity()", "energy_distance()", "mmd(method='rff')", "propscore()", "rumap()"), `n=1000` = c("< 1 s", "< 1 s", "< 1 s", "< 1 s", "< 1 s", "~1 s", "~10 s"), `n=10000` = c("~5 s", "~10 s", "~1 s", "~2 s", "~1 s", "~5 s", "~60 s"), `n=100000` = c("~60 s", "~5 min", "~5 s", "~30 s", "~5 s", "~30 s", "depends"), Scaling = c("O(n*k)", "O(n^2)", "O(n log n)", "O(n^2)", "O(n*D)", "O(n*p)", "Sum of components"), check.names = FALSE ) knitr::kable(scale_df, caption = "Approximate runtimes for key metrics.") ``` ```{r session-info} sessionInfo() ``` # References {-}