split and unite functions

Split and unite are complementary functions to manipulate dataframes in R. They work with summarised_results objects (see R package omopgenerics), but they can also support R dataframes from other classes.

summarised_result

First, let’s load relevant libraries and generate a mock summarised_result object to use in the following examples.

library(visOmopResults)
library(dplyr)
mock_sr <- mockSummarisedResult()
mock_sr |> glimpse()
#> Rows: 126
#> Columns: 13
#> $ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name         <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name       <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_…
#> $ group_level      <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1"…
#> $ strata_name      <chr> "overall", "age_group &&& sex", "age_group &&& sex", …
#> $ strata_level     <chr> "overall", "<40 &&& Male", ">=40 &&& Male", "<40 &&& …
#> $ variable_name    <chr> "number subjects", "number subjects", "number subject…
#> $ variable_level   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name    <chr> "count", "count", "count", "count", "count", "count",…
#> $ estimate_type    <chr> "integer", "integer", "integer", "integer", "integer"…
#> $ estimate_value   <chr> "4919829", "9611305", "6176201", "4600876", "1033323"…
#> $ additional_name  <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…

A summarised_result contains 3 types of name-level paired columns which are targeted by the set of unite and split functions. These are the group columns which typically can contain information about cohorts, strata columns which have data on stratification for each group, and finally the additional columns which include further information not covered by group and strata.

Split functions

The idea of the split functions is to pivot the “name” (e.g. group_name) column to split each value of that column into a column in the dataframe, which values are taken by the “level” (e.g. group_level) column.

splitGroup(), splitStrata(), and splitAdditional()

For instance, the splitGroup function will target the group_name-group_level columns as seen below.

mock_sr |> splitGroup() |> glimpse()
#> Rows: 126
#> Columns: 12
#> $ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name         <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ cohort_name      <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1"…
#> $ strata_name      <chr> "overall", "age_group &&& sex", "age_group &&& sex", …
#> $ strata_level     <chr> "overall", "<40 &&& Male", ">=40 &&& Male", "<40 &&& …
#> $ variable_name    <chr> "number subjects", "number subjects", "number subject…
#> $ variable_level   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name    <chr> "count", "count", "count", "count", "count", "count",…
#> $ estimate_type    <chr> "integer", "integer", "integer", "integer", "integer"…
#> $ estimate_value   <chr> "4919829", "9611305", "6176201", "4600876", "1033323"…
#> $ additional_name  <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…

Similar to splitStrata, the functions splitGroup will split group_name and group_level columns, while splitAdditional will split the additional name-level pair. Finally, the function splitAll will split group, strata, and additional at once. Note that after using splitStrata on our summarised_result object, we do no longer have a strata_name-strata_level pair, instead we have two new columns corresponding to the stratifications, age_group and sex.

mock_sr |> splitStrata() |> glimpse()
#> Rows: 126
#> Columns: 13
#> $ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name         <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name       <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_…
#> $ group_level      <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1"…
#> $ age_group        <chr> "overall", "<40", ">=40", "<40", ">=40", "overall", "…
#> $ sex              <chr> "overall", "Male", "Male", "Female", "Female", "Male"…
#> $ variable_name    <chr> "number subjects", "number subjects", "number subject…
#> $ variable_level   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name    <chr> "count", "count", "count", "count", "count", "count",…
#> $ estimate_type    <chr> "integer", "integer", "integer", "integer", "integer"…
#> $ estimate_value   <chr> "4919829", "9611305", "6176201", "4600876", "1033323"…
#> $ additional_name  <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…
mock_sr |> splitAdditional() |> glimpse()
#> Rows: 126
#> Columns: 11
#> $ result_id      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ cdm_name       <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock",…
#> $ group_name     <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_na…
#> $ group_level    <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1", …
#> $ strata_name    <chr> "overall", "age_group &&& sex", "age_group &&& sex", "a…
#> $ strata_level   <chr> "overall", "<40 &&& Male", ">=40 &&& Male", "<40 &&& Fe…
#> $ variable_name  <chr> "number subjects", "number subjects", "number subjects"…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ estimate_name  <chr> "count", "count", "count", "count", "count", "count", "…
#> $ estimate_type  <chr> "integer", "integer", "integer", "integer", "integer", …
#> $ estimate_value <chr> "4919829", "9611305", "6176201", "4600876", "1033323", …
mock_sr |> splitAll() |> glimpse()
#> Rows: 126
#> Columns: 10
#> $ result_id      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ cdm_name       <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock",…
#> $ cohort_name    <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1", …
#> $ age_group      <chr> "overall", "<40", ">=40", "<40", ">=40", "overall", "ov…
#> $ sex            <chr> "overall", "Male", "Male", "Female", "Female", "Male", …
#> $ variable_name  <chr> "number subjects", "number subjects", "number subjects"…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ estimate_name  <chr> "count", "count", "count", "count", "count", "count", "…
#> $ estimate_type  <chr> "integer", "integer", "integer", "integer", "integer", …
#> $ estimate_value <chr> "4919829", "9611305", "6176201", "4600876", "1033323", …

!! Keyword: &&&

Looking at the results below, observe how the splitting was not only done by values in the “name” column, but also among values containing the key word “&&&”. That is, “sex &&& age_group” was splitted into sex and age_group columns, instead of generating a column called “sex &&& age_group”.

splitNameLevel()

The function splitNameLevel provides a more tailored splitting of the dataframe. This function can take any dataframe with no restrictions to the naming of the name-level pair columns, since these can be specified in the name and level arguments.

For instance let’s use it in the following table:

data_to_split <- tibble(
  denominator = "general_population",
  outcome = "stroke",
  input_arguments = c("wash_out &&& previous_observation"),
  input_arguments_values = c("60 &&& 180")
)
data_to_split 
#> # A tibble: 1 × 4
#>   denominator        outcome input_arguments              input_arguments_values
#>   <chr>              <chr>   <chr>                        <chr>                 
#> 1 general_population stroke  wash_out &&& previous_obser… 60 &&& 180

data_to_split |>
  splitNameLevel(
    name = "input_arguments",
    level = "input_arguments_values"
  )
#> # A tibble: 1 × 4
#>   denominator        outcome wash_out previous_observation
#>   <chr>              <chr>   <chr>    <chr>               
#> 1 general_population stroke  60       180

The function splitNameLevel, in additionally to the argument overall previously seen, has the argument keep to set whether we want to keep the columns before the splitting.

Unite functions

The unite functions are the complementary to the split ones. These are meant to generate name-level pair columns from targeted columns within a dataframe.

uniteGroup(), uniteStrata(), and uniteAdditional()

To work with summarised_result objects, we have the uniteGroup, uniteStrata, and uniteAdditional functions which will generate the group, strata, and additional name-level columns respectively from a given set of columns. For instance, in the following example we want to create the group_name and group_level columns:

to_unite_group <- tibble(
  denominator_cohort_name = c("general_population", "older_than_60", "younger_than_60"),
  outcome_cohort_name = c("stroke", "stroke", "stroke")
)

to_unite_group |>
  uniteGroup(cols = c("denominator_cohort_name", "outcome_cohort_name"))
#> # A tibble: 3 × 2
#>   group_name                                      group_level                  
#>   <chr>                                           <chr>                        
#> 1 denominator_cohort_name &&& outcome_cohort_name general_population &&& stroke
#> 2 denominator_cohort_name &&& outcome_cohort_name older_than_60 &&& stroke     
#> 3 denominator_cohort_name &&& outcome_cohort_name younger_than_60 &&& stroke

A part from the columns to unite argument (cols), there is the argument ignore, by default: ignore = c(NA, "overall"). This means that, levels within ignore will be ignored. For example if in this case we do not ignore them we will obtain the NA as output:

to_unite_strata <- tibble(
    age = c(NA, ">40", "<=40", NA, NA, NA, NA, NA, ">40", "<=40"),
    sex = c(NA, NA, NA, "F", "M", NA, NA, NA, "F", "M"),
    region = c(NA, NA, NA, NA, NA, "North", "South", "Center", NA, NA)
  )

to_unite_strata |>
  uniteStrata(cols = c("age", "sex", "region"),
              ignore = character())
#> # A tibble: 10 × 2
#>    strata_name            strata_level        
#>    <chr>                  <chr>               
#>  1 age &&& sex &&& region NA &&& NA &&& NA    
#>  2 age &&& sex &&& region >40 &&& NA &&& NA   
#>  3 age &&& sex &&& region <=40 &&& NA &&& NA  
#>  4 age &&& sex &&& region NA &&& F &&& NA     
#>  5 age &&& sex &&& region NA &&& M &&& NA     
#>  6 age &&& sex &&& region NA &&& NA &&& North 
#>  7 age &&& sex &&& region NA &&& NA &&& South 
#>  8 age &&& sex &&& region NA &&& NA &&& Center
#>  9 age &&& sex &&& region >40 &&& F &&& NA    
#> 10 age &&& sex &&& region <=40 &&& M &&& NA

By default (ignore = c(NA, "overall")) we obtain an output where only names and levels of non-NA values are returned, and from those rows where all values are NA it uses “overall”.

to_unite_strata |>
  uniteStrata(cols = c("age", "sex", "region"))
#> # A tibble: 10 × 2
#>    strata_name strata_level
#>    <chr>       <chr>       
#>  1 overall     overall     
#>  2 age         >40         
#>  3 age         <=40        
#>  4 sex         F           
#>  5 sex         M           
#>  6 region      North       
#>  7 region      South       
#>  8 region      Center      
#>  9 age &&& sex >40 &&& F   
#> 10 age &&& sex <=40 &&& M

uniteNameLevel()

Lastly, the function uniteNameLevel, idem to splitNameLevel, provides more flexibility on the name-level column naming, in addition of the keep argument (FALSE by default) to choose whether to keep the targeted columns. For instance, if we repeat the previous example with keep set to TRUE we would obtain the following output:

to_unite_strata |>
  uniteNameLevel(cols = c("age", "sex", "region"),
                 name = "name",
                 level = "level",
                 keep = TRUE)
#> # A tibble: 10 × 5
#>    age   sex   region name        level     
#>    <chr> <chr> <chr>  <chr>       <chr>     
#>  1 <NA>  <NA>  <NA>   overall     overall   
#>  2 >40   <NA>  <NA>   age         >40       
#>  3 <=40  <NA>  <NA>   age         <=40      
#>  4 <NA>  F     <NA>   sex         F         
#>  5 <NA>  M     <NA>   sex         M         
#>  6 <NA>  <NA>  North  region      North     
#>  7 <NA>  <NA>  South  region      South     
#>  8 <NA>  <NA>  Center region      Center    
#>  9 >40   F     <NA>   age &&& sex >40 &&& F 
#> 10 <=40  M     <NA>   age &&& sex <=40 &&& M