metacore

R build status Lifecycle: experimental CRAN status

Programming for clinical trial data analysis tends to be very standardized. With data standards such as CDISC, expectations tend to be very clearly defined. Within these programming activities, there’s ample room for the use of metadata. Metadata can be used for several different purposes, such as applying dataset attributes, establishing sort sequences, working with controlled terminology, and more. Despite CDISC standards, organizations tend to have their own means of storing metadata, be it in excel spreadsheets, databases, and more.

The purpose of metacore is to establish a common foundation for the use of metadata within an R session. This is done by creating an R object that can hold the necessary data in a standardized, immutable structure (using R6) that makes it easy to extract out necessary information when needed. Users can read in their metadata from their various sources. To make this easy, we’ve provided some helper functions - and even have readers that can read directly from Define.xml 2.0. By establishing a common and consistent object in memory, further packages that support these work flows can have a common foundation upon which tools can be built that leverage metadata in the future. This reduces the need to hold different data structures containing metadata and instead allows programs to pull this information from a centralized source.

Installation

You can install the current development version of metacore from github with:

devtools::install_github("atorus-research/metacore")

Structure

A metacore object is made-up of 6 different tables, which are connected with a series of identifying columns. The goal of these tables is to normalize the information as much as possible, while keeping together like information. Each table has a basic theme to make them easier to remember. They are as follows:

Here is a schema of how all this fits together:

ds_spec

This table covers the basic information about each dataset. There is only a single row per dataset, with the following information:

ds_vars

This table contains the information that bridges between purely dataset level and purely variable level. There is one row per dataset per variable:

var_spec

This table contains the purely variable level information. The goal is there is a single row per variable, which is common across all datasets. This helps ensure variables follow the CDISC standard. But, this isn’t always possible, so if information for a given variable differs across datasets, the variable will be recorded as dataset.variable in the variable column.

value_spec

This table contains the information the information at the value level. There will be at least one row per dataset/variable combination. There is more than one row per dataset/variable combination if the combination has values which have differing metadata. For instance LBORRES that are different data types depending on the value. The information contained are as follows:

derivations

This table has all the derivation information, with one row per derivation ID and the following information:

codelist

This table contains the code lists, permitted value lists, and external libraries nested within a tibble. There is only a single row per list/library, with the following information:

supp

This table contains the information needed to create supplemental tables. If you want to add a variable which will go into a supplemental qualifier then you can create as normal (i.e. label information going to the var_spec table and derivation and origin going into the value_spec table), but you need to flag it as supplemental in the ds_vars table and add a row to the supp table. There is only a single row per dataset/variable, with the following information:

To get more information about the metacore objects and how to build a specification reader, please see our vignettes.

Future Development

This is an alpha release of this package, so if you have ideas on future improvements please add them to the issue log.