---
title: "Getting Started with rollup Package"
author: "Ju Young Ahn"
date: "2024-08-27"
output: 
  html_document:
  toc: true
  toc_float: true
  toc_collapsed: true
toc: true
vignette: >
  %\VignetteIndexEntry{rollup}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(rollup))
```
`rollup`: A Tidy implementation of `GROUPING SETS`, `WITH ROLLUP`, and `WITH CUBE`, which are powerful extensions of the `GROUP BY` clause that compute multiple group-by clauses in a single statement in `SQL`. This package operates on top of the `dplyr` and performs the same functions as `SQL`.

## Installation
```{r warning=FALSE, eval=FALSE}
# From CRAN
install.packages("rollup")
 
# From Github
library(devtools)
devtools::install_github("JuYoungAhn/rollup")
```

## In a Nutshell
- The functions of `rollup` package allow you to simplify multiple `group_by` operations into a single, concise statement. 
- This makes data aggregation easier and more efficient.
```{r warning=FALSE}
mtcars %>% group_by(vs, am) %>% grouping_sets("vs","am",c("vs","am"), NA) %>% 
  summarize(n=n(), avg_mpg=mean(mpg))

mtcars %>% group_by(vs, am) %>% with_rollup() %>% 
  summarize(n=n(), avg_mpg=mean(mpg))

mtcars %>% group_by(vs, am) %>% with_cube() %>% 
  summarize(n=n(), avg_mpg=mean(mpg))
```

## Practical example
- This example shows how to compute the average pageview count grouped by various combinations of gender and age.
- In this section, you will see **why the rollup package is useful** by exploring practical data examples.

### Web service data 
##### Description of data
- date_id : yyyy-mm-dd
- id : user unique id
- gender : male(M), female(F)
- age : age band (categorical)
- page_view_cnt : pageview count of user on date_id
- product_view_cnt_cat : decile category of the product view count for a user on date_id.
```{r setup}
library(dplyr)
library(rollup)
data("web_service_data") # web_service_data of rollup package
web_service_data %>% head
```

### grouping_sets
- `grouping_sets()` allows you to perform multiple `group_by` operations simultaneously, producing combined results in a single output.
- `grouping_sets('a')` is equivalent to the single grouping set operation `group_by(a)`.
- `grouping_sets('a','b')` is equivalent to row binding of `group_by(a)` and `group_by(b)`.
- `grouping_sets(c('a','b'),'a','b', NA)` is equivalent to row binding of `group_by(a,b)`, `group_by(a)`, `group_by(b)` and without `group_by` operation. 
```{r warning=FALSE}
library(tidyr)
# compute average of `page_view_cnt` group by "gender", "age", and "gender & age", along with the overall average. NA in the output table represents overall aggregates.
web_service_data %>% filter(date_id == '2024-06-30' & gender != "N") %>% 
  group_by(gender, age) %>% grouping_sets('gender', 'age', c('gender','age'), NA) %>% 
  summarize(avg_pv_cnt = mean(page_view_cnt))

# compute average of `page_view_cnt` group by "gender & age & product_view_cnt_cat" along with the marginal average with regard to "product_view_cnt_cat".
web_service_data %>% filter(date_id == '2024-06-30' & gender != "N") %>% 
  group_by(gender, age, product_view_cnt_cat) %>% 
  grouping_sets('product_view_cnt_cat', c('product_view_cnt_cat', 'gender','age')) %>% 
  summarize(avg_pv_cnt = mean(page_view_cnt)) %>% 
  pivot_wider(names_from = product_view_cnt_cat, values_from = avg_pv_cnt)
```

### with_cube
- `with_cube()` automatically generates all possible combinations of specified variables in `group_by` clause. 
- `with_cube()` function is a simplified way of expressing `grouping_sets()`. 
- `with_cube()` is equivalent to using `grouping_sets()` with all combinations of the specified columns.
- For example, `group_by(a,b,c)` followed by `with_cube()` equals to `grouping_sets(c('a','b','c'), c('a','b'), c('a','c'), c('b','c'), 'a', 'b', 'c', NA)`.
- `with_cube()` is particularly useful when you want to include total aggregates of both rows and columns in a cross table.

```{r warning=FALSE}
# This produces a table with average page view counts grouped by gender and age, including total aggregates across all combinations.
web_service_data %>% filter(date_id == '2024-06-30' & gender != "N") %>% 
  group_by(gender, age) %>% with_cube() %>% 
  summarize(avg_pv_cnt = mean(page_view_cnt)) %>% 
  pivot_wider(names_from = age, values_from = avg_pv_cnt)
```

### with_rollup
- `with_rollup()` creates hierarchical aggregations by progressively reducing the number of grouping variables.
- `with_rollup()` is particulary useful **when variables have a hierarchy**, because all possible combinations are not necessary.
- `group_by(a,b)` followed by `with_rollup()` equals to `grouping_sets(c('a','b'), 'a', NA)`.
- `group_by(a,b,c)` followed by `with_rollup()` equals to `grouping_sets(c('a','b','c'), ('a','b'), ('a'), NA)`.

```{r warning=FALSE}
# The variables "age_big" and "age" have a hierarchy. 
web_service_data_processed <- web_service_data %>% mutate(
  age_big = case_when(
    age %in% c(10,20,30) ~ 'young',
    age %in% c(40,50,60) ~ 'old'  
  )
)

# If there are aggregates "age_big & age", marginal aggregates for "age" are not necessary.
# The following code computes aggregates for "age_big & age", "age_big", and entire data set.
web_service_data_processed %>% group_by(age_big, age) %>% 
  with_rollup() %>% summarize(
  user_cnt = n_distinct(id),
  avg_pv_cnt = mean(page_view_cnt)
)
```