clinCompare: Dataset Comparison with CDISC Validation

Introduction

clinCompare is an R package for comparing datasets at the dataset, variable, and observation level. For clinical trial data, an optional CDISC validation layer checks SDTM and ADaM conformance automatically. The package is designed for statistical programmers, data managers, and regulatory professionals who need to ensure data quality and compliance with industry standards.

Key Features

Compare dimensions, variable names, data types, and values in a single call
Key-based row matching with auto-detected CDISC ID variables
CDISC validation for 51 SDTM domains and 14 ADaM datasets
Export results to HTML, plain text, or Excel
Batch compare entire submissions across two directories
Numeric tolerance for floating-point comparisons

Getting Started

library(clinCompare)

Basic Dataset Comparison

Comparing Two Data Frames

The compare_datasets() function gives a comprehensive overview: dimension checks, variable comparison, type mismatches, and row-level value differences.

baseline <- data.frame(
  USUBJID = c("SUBJ01", "SUBJ02", "SUBJ03"),
  AGE     = c(45, 52, 38),
  SEX     = c("M", "F", "M"),
  RACE    = c("WHITE", "WHITE", "ASIAN"),
  stringsAsFactors = FALSE
)

updated <- data.frame(
  USUBJID = c("SUBJ01", "SUBJ02", "SUBJ03"),
  AGE     = c(45, 53, 38),
  SEX     = c("M", "F", "F"),
  RACE    = c("WHITE", "WHITE", "ASIAN"),
  stringsAsFactors = FALSE
)

result <- compare_datasets(baseline, updated)
result

## 
## ================================================== 
##   clinCompare: Dataset Comparison
## ================================================== 
## 
##   Status: DIFFERENCES FOUND
## 
##   Base dataset:    3 rows x 4 columns
##   Compare dataset: 3 rows x 4 columns
## 
##   Shared columns:       4
## 
## -------------------------------------------------- 
##   Value Comparison
## -------------------------------------------------- 
##   2 difference(s) found in 2 of 4 column(s)
##   2 of 3 row(s) affected (66.7%)
## 
##   Per-Column Summary:
##   Column               Type          Differences   Largest Diff
##   ------------------------------------------------------------
##   AGE                  numeric                 1              1
##   SEX                  character               1              -
## 
##   Differences in 'AGE' (showing 1 of 1):
##  Row Base Compare Diff
##  2   52   53      -1  
## 
##   Differences in 'SEX' (showing 1 of 1):
##  Row Base Compare
##  3   M    F      
## 
## -------------------------------------------------- 
##   Summary: 2 values differ in 'AGE' and 'SEX', affecting 2 rows of 3. 
## ================================================== 
## 
##   Try next:
##     get_all_differences(result) : extract all diffs as a data frame
##     export_report(result, "report.html") : save as HTML report
##     export_report(result, "report.xlsx") : save as Excel workbook
##     compare_datasets(df1, df2, tolerance = 1) : largest numeric diff is 1

The result is a structured list you can drill into programmatically:

# Per-column difference counts
result$observation_comparison$discrepancies

## USUBJID     AGE     SEX    RACE 
##       0       1       1       0

# Row-level details for a specific variable
result$observation_comparison$details$SEX

##   Row Value_in_df1 Value_in_df2
## 1   3            M            F

Comparing Variables

Use compare_variables() to focus on structural differences between two datasets – column names, data types, and variable ordering.

df_a <- data.frame(
  USUBJID = c("SUBJ01", "SUBJ02"),
  AGE     = c(45, 52),
  SEX     = c("M", "F"),
  stringsAsFactors = FALSE
)

df_b <- data.frame(
  USUBJID = c("SUBJ01", "SUBJ02"),
  AGE     = c(45L, 52L),
  WEIGHT  = c(75.5, 80.2),
  stringsAsFactors = FALSE
)

compare_variables(df_a, df_b)

## $discrepancies
## [1] 2
## 
## $details
## $details$common_columns
## [1] "USUBJID" "AGE"    
## 
## $details$extra_in_df1
## [1] "SEX"
## 
## $details$extra_in_df2
## [1] "WEIGHT"
## 
## $details$data_type_comparisons
## $details$data_type_comparisons[[1]]
## $details$data_type_comparisons[[1]]$column
## [1] "USUBJID"
## 
## $details$data_type_comparisons[[1]]$type_df1
## [1] "character"
## 
## $details$data_type_comparisons[[1]]$type_df2
## [1] "character"
## 
## 
## $details$data_type_comparisons[[2]]
## $details$data_type_comparisons[[2]]$column
## [1] "AGE"
## 
## $details$data_type_comparisons[[2]]$type_df1
## [1] "numeric"
## 
## $details$data_type_comparisons[[2]]$type_df2
## [1] "integer"

Comparing Observations

Use compare_observations() for row-by-row value comparison on common columns:

df1 <- data.frame(
  ID    = c(1, 2, 3),
  SCORE = c(80, 90, 70),
  stringsAsFactors = FALSE
)

df2 <- data.frame(
  ID    = c(1, 2, 3),
  SCORE = c(80, 95, 70),
  stringsAsFactors = FALSE
)

compare_observations(df1, df2)

## $discrepancies
##    ID SCORE 
##     0     1 
## 
## $details
## $details$SCORE
##   Row Value_in_df1 Value_in_df2
## 1   2           90           95

Data Preparation

Cleaning Data

Remove duplicates and standardize text case before comparing:

messy <- data.frame(
  NAME  = c("Alice", "alice", "Bob", "Bob"),
  SCORE = c(100, 100, 85, 85),
  stringsAsFactors = FALSE
)

clean_dataset(messy, remove_duplicates = TRUE, convert_to_case = "upper")

##    NAME SCORE
## 1 ALICE   100
## 2 ALICE   100
## 3   BOB    85

Sorting and Filtering

Prepare two datasets identically before comparison:

df_unsorted1 <- data.frame(
  REGION = c("West", "East", "North"),
  SALES  = c(150, 200, 180)
)

df_unsorted2 <- data.frame(
  REGION = c("East", "North", "West"),
  SALES  = c(210, 185, 160)
)

prepped <- prepare_datasets(df_unsorted1, df_unsorted2, sort_columns = "REGION")
prepped$df1

##   REGION SALES
## 1   East   200
## 2  North   180
## 3   West   150

prepped$df2

##   REGION SALES
## 1   East   210
## 2  North   185
## 3   West   160

Group-Wise Comparison

Compare datasets within specific subgroups. Useful for multi-site or multi-arm studies:

site_data_v1 <- data.frame(
  SITEID = c("SITE01", "SITE01", "SITE02", "SITE02"),
  SUBJID = c("S01", "S02", "S03", "S04"),
  AGE    = c(45, 52, 38, 61)
)

site_data_v2 <- data.frame(
  SITEID = c("SITE01", "SITE01", "SITE02", "SITE02"),
  SUBJID = c("S01", "S02", "S03", "S04"),
  AGE    = c(45, 53, 38, 62)
)

by_site <- compare_by_group(site_data_v1, site_data_v2, group_vars = "SITEID")
names(by_site)

## [1] "SITE01" "SITE02"

CDISC Comparison

What is CDISC?

CDISC (Clinical Data Interchange Standards Consortium) provides standardized formats for regulatory submissions:

SDTM (Study Data Tabulation Model): Raw data from clinical trials
ADaM (Analysis Data Model): Derived datasets used for statistical analysis

CDISC validation ensures that datasets meet industry standards and regulatory requirements. For official CDISC standards documentation, see https://www.cdisc.org/standards.

Auto-Detecting CDISC Domains

clinCompare auto-detects the CDISC domain of a dataset using column matching, ADaM indicator columns, and filename hints:

dm_data <- data.frame(
  STUDYID  = rep("STUDY01", 3),
  USUBJID  = c("SUBJ01", "SUBJ02", "SUBJ03"),
  AGE      = c(45, 62, 51),
  SEX      = c("M", "F", "M"),
  RACE     = c("WHITE", "BLACK", "ASIAN"),
  ARMCD    = c("TRT", "PBO", "TRT"),
  ARM      = c("Treatment", "Placebo", "Treatment"),
  stringsAsFactors = FALSE
)

detect_cdisc_domain(dm_data)

## Warning: Ambiguous domain detection: 'DM' (58%) vs 'ADSL' (56%). Specify
## `domain` and `standard` explicitly for reliable results.

## $standard
## [1] "SDTM"
## 
## $domain
## [1] "DM"
## 
## $confidence
## [1] 0.5798462
## 
## $message
## [1] "SDTM domain 'DM' detected with 58% confidence (38% required vars present, 100% of columns explained)"

Comparing with CDISC Validation

cdisc_compare() is the flagship function. It compares two datasets, auto-detects the CDISC domain and key variables, performs key-based row matching, and validates against CDISC standards – all in one call.

dm_v1 <- data.frame(
  STUDYID  = rep("STUDY01", 3),
  USUBJID  = c("SUBJ01", "SUBJ02", "SUBJ03"),
  AGE      = c(45, 62, 51),
  SEX      = c("M", "F", "M"),
  RACE     = c("WHITE", "BLACK", "ASIAN"),
  ARMCD    = c("TRT", "PBO", "TRT"),
  ARM      = c("Treatment", "Placebo", "Treatment"),
  RFSTDTC  = c("2024-01-15", "2024-01-16", "2024-01-17"),
  stringsAsFactors = FALSE
)

dm_v2 <- data.frame(
  STUDYID  = rep("STUDY01", 3),
  USUBJID  = c("SUBJ01", "SUBJ02", "SUBJ03"),
  AGE      = c(45, 62, 52),
  SEX      = c("M", "F", "M"),
  RACE     = c("WHITE", "BLACK", "ASIAN"),
  ARMCD    = c("TRT", "PBO", "TRT"),
  ARM      = c("Treatment", "Placebo", "Treatment"),
  RFSTDTC  = c("2024-01-15", "2024-01-16", "2024-01-17"),
  stringsAsFactors = FALSE
)

cdisc_result <- cdisc_compare(dm_v1, dm_v2, domain = "DM", standard = "SDTM")

## ID variables auto-detected for SDTM DM: STUDYID, USUBJID

cdisc_result

## 
## ================================================== 
##   clinCompare: CDISC Comparison Results
## ================================================== 
## 
##   Domain:              DM (SDTM)
##   Base dataset:        3 rows x 8 columns
##   Compare dataset:     3 rows x 8 columns
##   Matching:            key-based (STUDYID, USUBJID)
## 
##   Differences:         0 attribute, 1 value
## 
## -------------------------------------------------- 
##   Value Comparison
## -------------------------------------------------- 
##   1 difference(s) found in 1 of 6 column(s)
##   1 of 3 row(s) affected (33.3%)
## 
##   Per-Column Summary:
##   Column               Type          Differences   Largest Diff
##   ------------------------------------------------------------
##   AGE                  numeric                 1              1
## 
##   Differences in 'AGE' (showing 1 of 1):
##  STUDYID USUBJID Base Compare Diff
##  STUDY01 SUBJ03  51   52      -1  
## 
##   CDISC Compliance:    FAIL (14 errors, 8 warnings)
## ================================================== 
## 
##   Try next:
##     get_all_differences(result) : extract all diffs as a data frame
##     export_report(result, "report.html") : save as HTML report
##     export_report(result, "report.xlsx") : save as Excel workbook
##     print_cdisc_validation(result$cdisc_validation_df1) : base dataset issues
##     print_cdisc_validation(result$cdisc_validation_df2) : compare dataset issues
##     generate_cdisc_report(result) : full CDISC compliance report
##     cdisc_compare(..., tolerance = 1) : largest numeric diff is 1

Validating a Single Dataset

Use validate_cdisc() to check a dataset against CDISC standards without comparing it to another dataset:

validation <- validate_cdisc(dm_v1, domain = "DM", standard = "SDTM")

Extracting All Differences

get_all_differences() returns every value-level difference as a single long-format data frame, making it easy to filter, count, or export:

diffs <- get_all_differences(cdisc_result)
diffs

##   STUDYID USUBJID Variable Base Compare Diff  PctDiff
## 1 STUDY01  SUBJ03      AGE   51      52   -1 1.960784

Exporting Reports

export_report() auto-detects the output format from the file extension:

# HTML report
export_report(cdisc_result, file.path(tempdir(), "dm_report.html"))

## Report written to: /var/folders/40/7745jn2j13q9cnp73bsd5_dc0000gn/T//RtmppUVSRG/dm_report.html

# Text report
export_report(cdisc_result, file.path(tempdir(), "dm_report.txt"))

## Report written to: /var/folders/40/7745jn2j13q9cnp73bsd5_dc0000gn/T//RtmppUVSRG/dm_report.txt

Excel export requires the openxlsx package:

# Excel workbook with Summary, Variable Diffs, Value Diffs, and CDISC tabs
export_report(cdisc_result, file.path(tempdir(), "dm_report.xlsx"))

Batch Comparing a Submission

compare_submission() scans two directories, matches files by name, and runs cdisc_compare() on every matched pair. Domain, standard, and key variables are all auto-detected per file.

results <- compare_submission(
  base_dir    = "submission_v1/",
  compare_dir = "submission_v2/",
  output_file = "submission_diff.xlsx"
)

CDISC Coverage

clinCompare ships with hand-curated metadata for 51 SDTM domains (IG 3.4, with 3.3 support) and 14 ADaM datasets (IG 1.3, with 1.2/1.1 provenance tracking).

SDTM domains: AE, AG, BE, BS, CE, CM, CO, CP, DA, DD, DM, DS, DV, EC, EG, EX, FA, GF, HO, IE, IS, LB, MB, MH, MI, ML, MS, PC, PE, PP, PR, QS, RELREC, RS, SC, SE, SM, SS, SU, SUPPQUAL, SV, TA, TD, TE, TI, TM, TR, TS, TU, TV, VS.

ADaM datasets: ADAE, ADCM, ADEG, ADEFF, ADEX, ADLB, ADMH, ADPC, ADPP, ADRS, ADSL, ADTR, ADTTE, ADVS.

Disclaimer: clinCompare is a quality-assurance and exploratory analysis tool. It is not a substitute for official CDISC compliance validation software (e.g., Pinnacle 21). For regulatory submissions, always cross-reference with your organization’s validated tools.

Summary

clinCompare provides a complete workflow for dataset comparison in clinical trials: compare any two data frames with compare_datasets(), add CDISC validation with cdisc_compare(), batch process entire submissions with compare_submission(), and export results to HTML, text, or Excel with export_report().

For more information and additional examples, visit the GitHub repository.