--- title: "Summarise missing data" output: html_document: pandoc_args: [ "--number-offset=1,0" ] number_sections: yes toc: yes vignette: > %\VignetteIndexEntry{missing_data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(CDMConnector) CDMConnector::requireEunomia() ``` In this vignette, we explore how *OmopSketch* functions can serve as a valuable tool for summarising missingness in databases containing electronic health records mapped to the OMOP Common Data Model. ## Create a mock cdm To illustrate the package’s functionality, we begin by loading the required packages and connecting to a test CDM using the Eunomia dataset. ```{r, warning=FALSE} library(dplyr) library(DBI) library(duckdb) library(OmopSketch) # Connect to Eunomia database con <- DBI::dbConnect(duckdb::duckdb(), CDMConnector::eunomiaDir()) cdm <- CDMConnector::cdmFromCon( con = con, cdmSchema = "main", writeSchema = "main", cdmName = "Eunomia" ) cdm ``` ## Summary of missing data A common first step in data quality assessment is to identify missing values. In this contest, missing data are defined as either NA values or concept IDs equal to 0. You can use the `summariseMissingData()` function to summarise missingness across the clinical tables in the CDM: ```{r, warning=FALSE} result_missingData <- summariseMissingData(cdm, omopTableName = "observation_period") result_missingData |> glimpse() ``` ### Summarise by OMOP CDM table You can choose to summarise missing data for specific OMOP CDM tables using the argument `omopTableName`. ```{r, warning=FALSE, eval = FALSE} result_missingData <- summariseMissingData(cdm = cdm, omopTableName = c("observation_period", "visit_occurrence", "condition_occurrence", "drug_exposure", "procedure_occurrence","device_exposure", "measurement", "observation", "death")) ``` ### Summarise by sex You can choose to summarise missing data by sex by setting the argument `sex` to `TRUE`. ```{r, warning=FALSE, eval = FALSE} result_missingData <- summariseMissingData(cdm = cdm, omopTableName = c("observation_period", "visit_occurrence", "condition_occurrence", "drug_exposure", "procedure_occurrence","device_exposure", "measurement", "observation", "death"), sex = TRUE) ``` ### Summarise by age group You can choose to summarise missing data by age group by creating a list defining the age groups you want to use. ```{r, warning=FALSE, eval = FALSE} result_missingData <- summariseMissingData(cdm = cdm, omopTableName = c("observation_period", "visit_occurrence", "condition_occurrence", "drug_exposure", "procedure_occurrence","device_exposure", "measurement", "observation", "death"), ageGroup = list(c(0, 17), c(18, 64), c(65, 150))) ``` ### Summarise by date and/or time interval You can also summarise missing data within a specific date range or across defined time intervals using the `dateRange` and `interval` arguments. The `interval` argument supports "overall" (no time stratification), "years", "quarters", or "months". ```{r, warning=FALSE, eval = FALSE} result_missingData <- summariseMissingData(cdm = cdm, omopTableName = c("observation_period", "visit_occurrence", "condition_occurrence", "drug_exposure", "procedure_occurrence","device_exposure", "measurement", "observation", "death"), interval = "years", dateRange = as.Date(c("2012-01-01", "2019-01-01"))) ``` ### Summarise by column You can also choose to summarise missing data for specific columns in the OMOP CDM tables using the argument `col`. ```{r, warning=FALSE, eval = FALSE} result_missingData <- summariseMissingData(cdm = cdm, omopTableName = c("observation_period", "visit_occurrence", "condition_occurrence", "drug_exposure", "procedure_occurrence","device_exposure", "measurement", "observation", "death"), col = c("observation_period_start_date", "observation_period_end_date")) ``` ### Summarise in sample of OMOP CDM Finally, you can summarise missing data on a sample of records from the OMOP CDM tables by specifying the desired number of records with the `sample` argument. ```{r, warning=FALSE, eval = FALSE} result_missingData <- summariseMissingData(cdm = cdm, omopTableName = c("observation_period", "visit_occurrence", "condition_occurrence", "drug_exposure", "procedure_occurrence","device_exposure", "measurement", "observation", "death"), sample = 1000) ``` ### Visualise summary results You can present these results using the function `tableMissingData()`. ```{r, warning = FALSE} result_missingData <- summariseMissingData(cdm, omopTableName = c("condition_occurrence", "drug_exposure", "procedure_occurrence"), sex = TRUE, ageGroup = list(c(0, 17), c(18, 64), c(65, 150)), interval = "years", dateRange = as.Date(c("2012-01-01", "2019-01-01")), sample = 100000 ) result_missingData |> tableMissingData() ``` This table can either be of type [gt](https://gt.rstudio.com/) (default) or [flextable](https://davidgohel.github.io/flextable/). ```{r, warning = FALSE} result_missingData |> tableMissingData(type = "flextable") ``` Finally, disconnect from the cdm ```{r, warning=FALSE} PatientProfiles::mockDisconnect(cdm = cdm) ```