Use case for nplyr

The package README uses the gapminder dataset to demonstrate nplyr’s functionality. In this case (and other similar cases) the output from nplyr’s nested operations could be obtained by unnesting and performing grouped dplyr operations.

library(nplyr)

gm_nest <- 
  gapminder::gapminder_unfiltered %>%
  tidyr::nest(country_data = -continent)

gm_nest
#> # A tibble: 6 × 2
#>   continent country_data        
#>   <fct>     <list>              
#> 1 Asia      <tibble [578 × 5]>  
#> 2 Europe    <tibble [1,302 × 5]>
#> 3 Africa    <tibble [637 × 5]>  
#> 4 Americas  <tibble [470 × 5]>  
#> 5 FSU       <tibble [139 × 5]>  
#> 6 Oceania   <tibble [187 × 5]>

# we can use nplyr to perform operations on the nested data
gm_nest %>%
  nest_filter(country_data, year == max(year)) %>%
  nest_mutate(country_data, pop_millions = pop/1000000) %>%
  slice_head(n = 1) %>%
  tidyr::unnest(country_data)
#> # A tibble: 43 × 7
#>    continent country           year lifeExp        pop gdpPercap pop_millions
#>    <fct>     <fct>            <int>   <dbl>      <int>     <dbl>        <dbl>
#>  1 Asia      Afghanistan       2007    43.8   31889923      975.       31.9  
#>  2 Asia      Azerbaijan        2007    67.5    8017309     7709.        8.02 
#>  3 Asia      Bahrain           2007    75.6     708573    29796.        0.709
#>  4 Asia      Bangladesh        2007    64.1  150448339     1391.      150.   
#>  5 Asia      Bhutan            2007    65.6    2327849     4745.        2.33 
#>  6 Asia      Brunei            2007    77.1     386511    48015.        0.387
#>  7 Asia      Cambodia          2007    59.7   14131858     1714.       14.1  
#>  8 Asia      China             2007    73.0 1318683096     4959.     1319.   
#>  9 Asia      Hong Kong, China  2007    82.2    6980412    39725.        6.98 
#> 10 Asia      India             2007    64.7 1110396331     2452.     1110.   
#> # … with 33 more rows

# in this case, we could have obtained the same result with tidyr and dplyr
gm_nest %>%
  tidyr::unnest(country_data) %>%
  group_by(continent) %>%
  filter(year == max(year)) %>%
  mutate(pop_millions = pop/1000000) %>%
  ungroup() %>%
  filter(continent == "Asia")
#> # A tibble: 43 × 7
#>    continent country           year lifeExp        pop gdpPercap pop_millions
#>    <fct>     <fct>            <int>   <dbl>      <int>     <dbl>        <dbl>
#>  1 Asia      Afghanistan       2007    43.8   31889923      975.       31.9  
#>  2 Asia      Azerbaijan        2007    67.5    8017309     7709.        8.02 
#>  3 Asia      Bahrain           2007    75.6     708573    29796.        0.709
#>  4 Asia      Bangladesh        2007    64.1  150448339     1391.      150.   
#>  5 Asia      Bhutan            2007    65.6    2327849     4745.        2.33 
#>  6 Asia      Brunei            2007    77.1     386511    48015.        0.387
#>  7 Asia      Cambodia          2007    59.7   14131858     1714.       14.1  
#>  8 Asia      China             2007    73.0 1318683096     4959.     1319.   
#>  9 Asia      Hong Kong, China  2007    82.2    6980412    39725.        6.98 
#> 10 Asia      India             2007    64.7 1110396331     2452.     1110.   
#> # … with 33 more rows

Why, then, might we need to use nplyr? Well, in other scenarios, it may be far more convenient to work with nested data frames or it may not even be possible to unnest!

A motivating example

Consider a set of surveys that an organization might use to gather market data. It is common for organizations to have separate surveys for separate purposes but gather the same baseline set of data across all surveys (for example, a respondent’s age and gender may be recorded across all surveys, but each survey will have a different set of questions). Let’s use two fake surveys with the below questions for this example:

Survey 1: Job
  1. How old are you? (multiple choice)
  2. What city do you live in? (multiple choice)
  3. What field do you work in? (multiple choice)
  4. Overall, how satisfied are you with your job? (multiple choice)
  5. What is your annual salary? (numeric entry)
Survey 2: Personal Life
  1. How old are you? (multiple choice)
  2. What city do you live in (multiple choice)
  3. What field do you work in? (multiple choice)
  4. Overall, how satisfied are you with your personal life (multiple choice)
  5. Please provide additional detail (text entry)

In this scenario, both surveys are collecting demographic information — age, location, and industry — but differ in the questions. A convenient way to get the response files into the environment would be to use purrr::map() to read in each file to a nested data frame.

path <- "https://raw.githubusercontent.com/markjrieke/nplyr/main/data-raw/"

surveys <- 
  tibble::tibble(survey_file = c("job_survey", "personal_survey")) %>%
  mutate(survey_data = purrr::map(survey_file, ~readr::read_csv(paste0(path, .x, ".csv"))))

surveys
#> # A tibble: 2 × 2
#>   survey_file     survey_data         
#>   <chr>           <list>              
#> 1 job_survey      <spc_tbl_ [500 × 6]>
#> 2 personal_survey <spc_tbl_ [750 × 6]>

tidyr::unnest() can usually handle idiosyncrasies in layout when unnesting but in this case unnesting throws an error!

surveys %>%
  tidyr::unnest(survey_data)
#> Error in `list_unchop()`:
#> ! Can't combine `x[[1]]$Q5` <double> and `x[[2]]$Q5` <character>.

This is because the surveys share column names but not necessarily column types! In this case, both data frames contain a column named “Q5”, but in job_survey it’s a double and in personal_survey it’s a character.

surveys %>%
  slice(1) %>%
  tidyr::unnest(survey_data) %>%
  glimpse()
#> Rows: 500
#> Columns: 7
#> $ survey_file <chr> "job_survey", "job_survey", "job_survey", "job_survey", "j…
#> $ survey_name <chr> "job", "job", "job", "job", "job", "job", "job", "job", "j…
#> $ Q1          <dbl> 100, 81, 51, 81, 80, 32, 65, 57, 43, 94, 25, 83, 61, 66, 8…
#> $ Q2          <chr> "Austin", "San Antonio", "Austin", "Austin", "Dallas", "Fo…
#> $ Q3          <chr> "Consulting", "Consulting", "Consulting", "Technology", "C…
#> $ Q4          <chr> "Somewhat dissatisfied", "Neither satisfied nor dissatisfi…
#> $ Q5          <dbl> 163, 48, 190, 25, 143, 233, 43, 243, 158, 235, 245, 195, 2…

surveys %>%
  slice(2) %>%
  tidyr::unnest(survey_data) %>%
  glimpse()
#> Rows: 750
#> Columns: 7
#> $ survey_file <chr> "personal_survey", "personal_survey", "personal_survey", "…
#> $ survey_name <chr> "personal", "personal", "personal", "personal", "personal"…
#> $ Q1          <dbl> 91, 32, 40, 23, 88, 69, 96, 40, 57, 40, 39, 70, 29, 38, 57…
#> $ Q2          <chr> "Austin", "San Antonio", "San Antonio", "Austin", "Dallas"…
#> $ Q3          <chr> "Energy", "Healthcare", "Consulting", "Consulting", "Techn…
#> $ Q4          <chr> "Neither satisfied nor dissatisfied", "Extremely satisfied…
#> $ Q5          <chr> "Blandit eros! A, ligula facilisis imperdiet! Interdum pla…

We could potentially get around this issue with unnesting by reading in all columns as characters via readr::read_csv(x, col_types = cols(.default = "c")), but this presents its own challenges. Q5 would still be better represented as a double in job_survey and from the survey question text Q4 has similar, but distinctly different, meanings across the survey files.

This is where nplyr comes into play! Rather than malign the data types or create separate objects for each survey file, we can use nplyr to perform operations directly on the nested data frames.

surveys <- 
  surveys %>%
  nest_mutate(survey_data,
              age_group = if_else(Q1 < 65, "Adult", "Retirement Age")) %>%
  nest_group_by(survey_data, Q3) %>%
  nest_add_count(survey_data, 
                 name = "n_respondents_in_industry") %>%
  nest_mutate(survey_data, 
              median_industry_age = median(Q1)) %>%
  nest_ungroup(survey_data)

surveys %>%
  slice(1) %>%
  tidyr::unnest(survey_data)
#> # A tibble: 500 × 10
#>    survey_file survey_name    Q1 Q2    Q3    Q4       Q5 age_g…¹ n_res…² media…³
#>    <chr>       <chr>       <dbl> <chr> <chr> <chr> <dbl> <chr>     <int>   <dbl>
#>  1 job_survey  job           100 Aust… Cons… Some…   163 Retire…     107    57  
#>  2 job_survey  job            81 San … Cons… Neit…    48 Retire…     107    57  
#>  3 job_survey  job            51 Aust… Cons… Extr…   190 Adult       107    57  
#>  4 job_survey  job            81 Aust… Tech… Extr…    25 Retire…     108    61.5
#>  5 job_survey  job            80 Dall… Cons… Extr…   143 Retire…     107    57  
#>  6 job_survey  job            32 Fort… Ener… Some…   233 Adult        99    59  
#>  7 job_survey  job            65 Dall… Cons… Some…    43 Retire…     107    57  
#>  8 job_survey  job            57 Hous… Heal… Some…   243 Adult        75    65  
#>  9 job_survey  job            43 Dall… Gove… Neit…   158 Adult       111    57  
#> 10 job_survey  job            94 Fort… Heal… Extr…   235 Retire…      75    65  
#> # … with 490 more rows, and abbreviated variable names ¹​age_group,
#> #   ²​n_respondents_in_industry, ³​median_industry_age

surveys %>%
  slice(2) %>%
  tidyr::unnest(survey_data)
#> # A tibble: 750 × 10
#>    survey_file     surve…¹    Q1 Q2    Q3    Q4    Q5    age_g…² n_res…³ media…⁴
#>    <chr>           <chr>   <dbl> <chr> <chr> <chr> <chr> <chr>     <int>   <dbl>
#>  1 personal_survey person…    91 Aust… Ener… Neit… Blan… Retire…     145      61
#>  2 personal_survey person…    32 San … Heal… Extr… Elem… Adult       131      62
#>  3 personal_survey person…    40 San … Cons… Some… Eget… Adult       149      61
#>  4 personal_survey person…    23 Aust… Cons… Extr… Scel… Adult       149      61
#>  5 personal_survey person…    88 Dall… Tech… Neit… Aene… Retire…     150      61
#>  6 personal_survey person…    69 Fort… Tech… Neit… Inte… Retire…     150      61
#>  7 personal_survey person…    96 Hous… Heal… Extr… Blan… Retire…     131      62
#>  8 personal_survey person…    40 Hous… Cons… Extr… Scel… Adult       149      61
#>  9 personal_survey person…    57 Fort… Ener… Extr… Pede… Adult       145      61
#> 10 personal_survey person…    40 Fort… Heal… Extr… Phar… Adult       131      62
#> # … with 740 more rows, and abbreviated variable names ¹​survey_name,
#> #   ²​age_group, ³​n_respondents_in_industry, ⁴​median_industry_age