clinCompare is an R package for comparing datasets at the dataset, variable, and observation level. For clinical trial data, an optional CDISC validation layer checks SDTM and ADaM conformance automatically. The package is designed for statistical programmers, data managers, and regulatory professionals who need to ensure data quality and compliance with industry standards.
The compare_datasets() function gives a comprehensive
overview: dimension checks, variable comparison, type mismatches, and
row-level value differences.
baseline <- data.frame(
USUBJID = c("SUBJ01", "SUBJ02", "SUBJ03"),
AGE = c(45, 52, 38),
SEX = c("M", "F", "M"),
RACE = c("WHITE", "WHITE", "ASIAN"),
stringsAsFactors = FALSE
)
updated <- data.frame(
USUBJID = c("SUBJ01", "SUBJ02", "SUBJ03"),
AGE = c(45, 53, 38),
SEX = c("M", "F", "F"),
RACE = c("WHITE", "WHITE", "ASIAN"),
stringsAsFactors = FALSE
)
result <- compare_datasets(baseline, updated)
result##
## ==================================================
## clinCompare: Dataset Comparison
## ==================================================
##
## Status: DIFFERENCES FOUND
##
## Base dataset: 3 rows x 4 columns
## Compare dataset: 3 rows x 4 columns
##
## Shared columns: 4
##
## --------------------------------------------------
## Value Comparison
## --------------------------------------------------
## 2 difference(s) found in 2 of 4 column(s)
## 2 of 3 row(s) affected (66.7%)
##
## Per-Column Summary:
## Column Type Differences Largest Diff
## ------------------------------------------------------------
## AGE numeric 1 1
## SEX character 1 -
##
## Differences in 'AGE' (showing 1 of 1):
## Row Base Compare Diff
## 2 52 53 -1
##
## Differences in 'SEX' (showing 1 of 1):
## Row Base Compare
## 3 M F
##
## --------------------------------------------------
## Summary: 2 values differ in 'AGE' and 'SEX', affecting 2 rows of 3.
## ==================================================
##
## Try next:
## get_all_differences(result) : extract all diffs as a data frame
## export_report(result, "report.html") : save as HTML report
## export_report(result, "report.xlsx") : save as Excel workbook
## compare_datasets(df1, df2, tolerance = 1) : largest numeric diff is 1
The result is a structured list you can drill into programmatically:
## USUBJID AGE SEX RACE
## 0 1 1 0
## Row Value_in_df1 Value_in_df2
## 1 3 M F
Use compare_variables() to focus on structural
differences between two datasets – column names, data types, and
variable ordering.
df_a <- data.frame(
USUBJID = c("SUBJ01", "SUBJ02"),
AGE = c(45, 52),
SEX = c("M", "F"),
stringsAsFactors = FALSE
)
df_b <- data.frame(
USUBJID = c("SUBJ01", "SUBJ02"),
AGE = c(45L, 52L),
WEIGHT = c(75.5, 80.2),
stringsAsFactors = FALSE
)
compare_variables(df_a, df_b)## $discrepancies
## [1] 2
##
## $details
## $details$common_columns
## [1] "USUBJID" "AGE"
##
## $details$extra_in_df1
## [1] "SEX"
##
## $details$extra_in_df2
## [1] "WEIGHT"
##
## $details$data_type_comparisons
## $details$data_type_comparisons[[1]]
## $details$data_type_comparisons[[1]]$column
## [1] "USUBJID"
##
## $details$data_type_comparisons[[1]]$type_df1
## [1] "character"
##
## $details$data_type_comparisons[[1]]$type_df2
## [1] "character"
##
##
## $details$data_type_comparisons[[2]]
## $details$data_type_comparisons[[2]]$column
## [1] "AGE"
##
## $details$data_type_comparisons[[2]]$type_df1
## [1] "numeric"
##
## $details$data_type_comparisons[[2]]$type_df2
## [1] "integer"
Use compare_observations() for row-by-row value
comparison on common columns:
df1 <- data.frame(
ID = c(1, 2, 3),
SCORE = c(80, 90, 70),
stringsAsFactors = FALSE
)
df2 <- data.frame(
ID = c(1, 2, 3),
SCORE = c(80, 95, 70),
stringsAsFactors = FALSE
)
compare_observations(df1, df2)## $discrepancies
## ID SCORE
## 0 1
##
## $details
## $details$SCORE
## Row Value_in_df1 Value_in_df2
## 1 2 90 95
Remove duplicates and standardize text case before comparing:
messy <- data.frame(
NAME = c("Alice", "alice", "Bob", "Bob"),
SCORE = c(100, 100, 85, 85),
stringsAsFactors = FALSE
)
clean_dataset(messy, remove_duplicates = TRUE, convert_to_case = "upper")## NAME SCORE
## 1 ALICE 100
## 2 ALICE 100
## 3 BOB 85
Prepare two datasets identically before comparison:
df_unsorted1 <- data.frame(
REGION = c("West", "East", "North"),
SALES = c(150, 200, 180)
)
df_unsorted2 <- data.frame(
REGION = c("East", "North", "West"),
SALES = c(210, 185, 160)
)
prepped <- prepare_datasets(df_unsorted1, df_unsorted2, sort_columns = "REGION")
prepped$df1## REGION SALES
## 1 East 200
## 2 North 180
## 3 West 150
## REGION SALES
## 1 East 210
## 2 North 185
## 3 West 160
Compare datasets within specific subgroups. Useful for multi-site or multi-arm studies:
site_data_v1 <- data.frame(
SITEID = c("SITE01", "SITE01", "SITE02", "SITE02"),
SUBJID = c("S01", "S02", "S03", "S04"),
AGE = c(45, 52, 38, 61)
)
site_data_v2 <- data.frame(
SITEID = c("SITE01", "SITE01", "SITE02", "SITE02"),
SUBJID = c("S01", "S02", "S03", "S04"),
AGE = c(45, 53, 38, 62)
)
by_site <- compare_by_group(site_data_v1, site_data_v2, group_vars = "SITEID")
names(by_site)## [1] "SITE01" "SITE02"
CDISC (Clinical Data Interchange Standards Consortium) provides standardized formats for regulatory submissions:
CDISC validation ensures that datasets meet industry standards and regulatory requirements. For official CDISC standards documentation, see https://www.cdisc.org/standards.
clinCompare auto-detects the CDISC domain of a dataset using column matching, ADaM indicator columns, and filename hints:
dm_data <- data.frame(
STUDYID = rep("STUDY01", 3),
USUBJID = c("SUBJ01", "SUBJ02", "SUBJ03"),
AGE = c(45, 62, 51),
SEX = c("M", "F", "M"),
RACE = c("WHITE", "BLACK", "ASIAN"),
ARMCD = c("TRT", "PBO", "TRT"),
ARM = c("Treatment", "Placebo", "Treatment"),
stringsAsFactors = FALSE
)
detect_cdisc_domain(dm_data)## Warning: Ambiguous domain detection: 'DM' (58%) vs 'ADSL' (56%). Specify
## `domain` and `standard` explicitly for reliable results.
## $standard
## [1] "SDTM"
##
## $domain
## [1] "DM"
##
## $confidence
## [1] 0.5798462
##
## $message
## [1] "SDTM domain 'DM' detected with 58% confidence (38% required vars present, 100% of columns explained)"
cdisc_compare() is the flagship function. It compares
two datasets, auto-detects the CDISC domain and key variables, performs
key-based row matching, and validates against CDISC standards – all in
one call.
dm_v1 <- data.frame(
STUDYID = rep("STUDY01", 3),
USUBJID = c("SUBJ01", "SUBJ02", "SUBJ03"),
AGE = c(45, 62, 51),
SEX = c("M", "F", "M"),
RACE = c("WHITE", "BLACK", "ASIAN"),
ARMCD = c("TRT", "PBO", "TRT"),
ARM = c("Treatment", "Placebo", "Treatment"),
RFSTDTC = c("2024-01-15", "2024-01-16", "2024-01-17"),
stringsAsFactors = FALSE
)
dm_v2 <- data.frame(
STUDYID = rep("STUDY01", 3),
USUBJID = c("SUBJ01", "SUBJ02", "SUBJ03"),
AGE = c(45, 62, 52),
SEX = c("M", "F", "M"),
RACE = c("WHITE", "BLACK", "ASIAN"),
ARMCD = c("TRT", "PBO", "TRT"),
ARM = c("Treatment", "Placebo", "Treatment"),
RFSTDTC = c("2024-01-15", "2024-01-16", "2024-01-17"),
stringsAsFactors = FALSE
)
cdisc_result <- cdisc_compare(dm_v1, dm_v2, domain = "DM", standard = "SDTM")## ID variables auto-detected for SDTM DM: STUDYID, USUBJID
##
## ==================================================
## clinCompare: CDISC Comparison Results
## ==================================================
##
## Domain: DM (SDTM)
## Base dataset: 3 rows x 8 columns
## Compare dataset: 3 rows x 8 columns
## Matching: key-based (STUDYID, USUBJID)
##
## Differences: 0 attribute, 1 value
##
## --------------------------------------------------
## Value Comparison
## --------------------------------------------------
## 1 difference(s) found in 1 of 6 column(s)
## 1 of 3 row(s) affected (33.3%)
##
## Per-Column Summary:
## Column Type Differences Largest Diff
## ------------------------------------------------------------
## AGE numeric 1 1
##
## Differences in 'AGE' (showing 1 of 1):
## STUDYID USUBJID Base Compare Diff
## STUDY01 SUBJ03 51 52 -1
##
## CDISC Compliance: FAIL (14 errors, 8 warnings)
## ==================================================
##
## Try next:
## get_all_differences(result) : extract all diffs as a data frame
## export_report(result, "report.html") : save as HTML report
## export_report(result, "report.xlsx") : save as Excel workbook
## print_cdisc_validation(result$cdisc_validation_df1) : base dataset issues
## print_cdisc_validation(result$cdisc_validation_df2) : compare dataset issues
## generate_cdisc_report(result) : full CDISC compliance report
## cdisc_compare(..., tolerance = 1) : largest numeric diff is 1
Use validate_cdisc() to check a dataset against CDISC
standards without comparing it to another dataset:
get_all_differences() returns every value-level
difference as a single long-format data frame, making it easy to filter,
count, or export:
## STUDYID USUBJID Variable Base Compare Diff PctDiff
## 1 STUDY01 SUBJ03 AGE 51 52 -1 1.960784
export_report() auto-detects the output format from the
file extension:
## Report written to: /var/folders/40/7745jn2j13q9cnp73bsd5_dc0000gn/T//RtmppUVSRG/dm_report.html
## Report written to: /var/folders/40/7745jn2j13q9cnp73bsd5_dc0000gn/T//RtmppUVSRG/dm_report.txt
Excel export requires the openxlsx package:
compare_submission() scans two directories, matches
files by name, and runs cdisc_compare() on every matched
pair. Domain, standard, and key variables are all auto-detected per
file.
clinCompare ships with hand-curated metadata for 51 SDTM domains (IG 3.4, with 3.3 support) and 14 ADaM datasets (IG 1.3, with 1.2/1.1 provenance tracking).
SDTM domains: AE, AG, BE, BS, CE, CM, CO, CP, DA, DD, DM, DS, DV, EC, EG, EX, FA, GF, HO, IE, IS, LB, MB, MH, MI, ML, MS, PC, PE, PP, PR, QS, RELREC, RS, SC, SE, SM, SS, SU, SUPPQUAL, SV, TA, TD, TE, TI, TM, TR, TS, TU, TV, VS.
ADaM datasets: ADAE, ADCM, ADEG, ADEFF, ADEX, ADLB, ADMH, ADPC, ADPP, ADRS, ADSL, ADTR, ADTTE, ADVS.
Disclaimer: clinCompare is a quality-assurance and exploratory analysis tool. It is not a substitute for official CDISC compliance validation software (e.g., Pinnacle 21). For regulatory submissions, always cross-reference with your organization’s validated tools.
clinCompare provides a complete workflow for dataset comparison in
clinical trials: compare any two data frames with
compare_datasets(), add CDISC validation with
cdisc_compare(), batch process entire submissions with
compare_submission(), and export results to HTML, text, or
Excel with export_report().
For more information and additional examples, visit the GitHub repository.