--- title: "Where" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Where} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) library(where) library(dplyr) library(data.table) library(ggplot2) ``` The `where` package has one main function `run()` that provides a clean syntax for vectorising the use of NSE (non-standard evaluation), for example in `ggplot2`, `dplyr`, or `data.table`. There are also two (infix) wrappers `%where%` and `%for%` that provide arguably cleaner syntax. A typical example might look like ```{r} subgroups <- .(all = TRUE, long_sepal = Sepal.Length > 6, long_petal = Petal.Length > 5.5) (iris %>% filter(x) %>% summarise(across(Sepal.Length:Petal.Width, mean), .by = Species)) %for% subgroups ``` Here we have a population dataset and various subpopulations of interest and we want to apply the same code over all subpopulations. If the subpopulations were a partition of the data (for example, a census population could be divided into 5 year age bands), then we can use `group_by()` in `dplyr` or faceting in `ggplot`, for example, to apply the same code over all subpopulations. In general, however, the populations will not be so easy to apply over, for example if we have some defined by age, others by gender, and then others as a combination of the two. A variable that allows multiple options to be selected (for example, ethnicity in the New Zealand Census), can alone define subpopulations (in this case ethnic groups) that cannot be vectorised over with the partitioning functionality (like group by and faceting) in standard packages. The `where` package makes these examples straightforward. ## Simple example As a running example we will use the `iris` dataset and the following (largely unnatural) sub-populations of irises: - the full population, - irises with sepal length more than 6, and - irises with petal length more than 5.5. These subgroups can be captured with the `.()` function to capture the filter conditions used to define these populations: ```{r subgroups} subgroups <- .(all = TRUE, long_sepal = Sepal.Length > 6, long_petal = Petal.Length > 5.5) ``` To utilise these subgroups directly with standard R is tricky. For example we could form the separate populations with repeated code. ```{r repetition} # With base R iris iris[iris[["Sepal.Length"]] > 6, ] # or with(iris, iris[Sepal.Length > 6]) iris[iris[["Petal.Length"]] > 5.5, ] # or with(iris, iris[Petal.Length > 5.5]) # With dplyr iris filter(iris, Sepal.Length > 6) filter(iris, Petal.Length > 5.5) # With data.table iris as.data.table(iris)[Sepal.Length > 6] as.data.table(iris)[Petal.Length > 5.5] ``` or this could be done by first explicitly capturing expressions (as done above with `.`) and then evaluating them: ```{r eval} lapply(subgroups, function(group) with(iris, iris[eval(group), ])) ``` This requires some comfort with managing expressions in R and can quickly get messy with more complex queries, particularly if we want to apply across more than one set of expressions. The `run()` function hides these manipulations: ```{r} run(with(iris, iris[subgroup, ]), subgroup = subgroups) # or with(iris, iris[x, ]) %for% subgroups ``` ## More interesting examples A standard group by and summarise operation: ```{r filter_summarise} library(dplyr) subgroups = .(all = TRUE, long_sepal = Sepal.Length > 6, long_petal = Petal.Length > 5.5) functions = .(mean, sum, prod) run( iris %>% filter(subgroup) %>% summarise(across(Sepal.Length:Petal.Width, summary), .by = Species), subgroup = subgroups, summary = functions ) ``` The same using `data.table`: ```{r filter_summarise_dt} library(data.table) df <- as.data.table(iris) run(df[subgroup, lapply(.SD, functions), keyby = "Species", .SDcols = Sepal.Length:Petal.Width], subgroup = subgroups, functions = functions) ``` Producing the same `ggplot` over the different populations: ```{r ggplot} library(ggplot2) plots <- run( ggplot(filter(iris, subgroup), aes(Sepal.Length, Sepal.Width)) + geom_point() + theme_minimal(), subgroup = subgroups ) Map(function(plot, name) plot + ggtitle(name), plots, names(plots)) ``` Or different plots for the full population: ```{r ggplots} run( ggplot(iris, aes(Sepal.Length, Sepal.Width)) + plot + theme_minimal(), plot = .(geom_point(), geom_smooth()) ) ``` ### A limitation A natrual extension of the previous example can fail is a non-obvious way, due to expressions being executed differently than might be intended. For example the following does not work ```{r fail_compound_geom, eval = FALSE} # Fails run( ggplot(iris, aes(Sepal.Length, Sepal.Width)) + plot + theme_minimal(), plot = .(geom_point(), geom_smooth(), geom_quantile() + geom_rug()) ) ``` since, for the third plot, it tries to evaluate ```{r fail_compound_geom2, eval = FALSE} # Fails ggplot(iris, aes(Sepal.Length, Sepal.Width)) + (geom_quantile() + geom_rug()) + theme_minimal() ``` and `geom_quantile() + geom_rug()` throws an error. This particular use case can be accomplished by putting the separate `geom`s in a list ```{r compound_geom} run( ggplot(iris, aes(Sepal.Length, Sepal.Width)) + plot + theme_minimal(), plot = .(point = geom_point(), smooth = geom_smooth(), quantilerug = list(geom_quantile(), geom_rug())) ) # or by separating out the combined geoms as a function (also using a list) geom_quantilerug <- function() list(geom_quantile(), geom_rug()) run( ggplot(iris, aes(Sepal.Length, Sepal.Width)) + plot + theme_minimal(), plot = .(point = geom_point(), smooth = geom_smooth(), quantilerug = geom_quantilerug()) ) ``` ## run in a function We can call `run()` from within a function to further hide details. For example, we could produce subpopulation summaries for the different species of iris: ```{r function_on_parts} population_summaries <- function(df) run(with(df, df[subgroup, ]), subgroup = subgroups) as.data.table(iris)[, .(population_summaries(.SD)), keyby = "Species"] ``` As a more general example, if we are undertaking an analysis of different subpopulations, then we could fix the populations in a function and apply code immediately over all groups. ```{r apply_over_pops} on_subpopulations <- function(expr, populations = subgroups) eval(substitute(run(expr, subgroup = populations), list(expr = substitute(expr)))) on_subpopulations(as.data.table(iris)[subgroup]) on_subpopulations( iris %>% filter(subgroup) %>% summarise(across(Sepal.Length:Petal.Width, mean), .by = Species) ) on_subpopulations( ggplot(filter(iris, subgroup), aes(Sepal.Length, Sepal.Width)) + geom_point() + theme_minimal() ) ``` As when following the DRY (Don't Repeat Yourself) principle in general, this isolation makes it straightforward to add a new subpopulation, here by editing the subgroups: ```{r extra_subpop} subgroups = .(all = TRUE, long_sepal = Sepal.Length > 6, long_petal = Petal.Length > 5.5, veriscolor = Species == "versicolor") ``` Taking things to the absurd, we can also isolate out the analysis code: ```{r} analyses <- .(subset = as.data.table(iris)[subgroup], summarise = iris %>% filter(subgroup) %>% summarise(across(Sepal.Length:Petal.Width, mean), .by = Species), plot = ggplot(filter(iris, subgroup), aes(Sepal.Length, Sepal.Width)) + geom_point() + theme_minimal()) lapply(analyses, function(expr) do.call("on_subpopulations", list(expr))) ``` ### A small warning The `ggplot` example ```{r} on_subpopulations( ggplot(filter(iris, subgroup), aes(Sepal.Length, Sepal.Width)) + geom_point() + theme_minimal() ) ``` does not give identical results to executing the `ggplot` code with the given subgroups, since the ggplot object stores the execution environment, which will be different. If important, this can be remedied by capturing and passing the calling environment in the `on_subpopulations()` function: ```{r} on_subpopulations <- function(expr, populations = subgroups) { e <- parent.frame() eval(substitute(run(expr, subgroup = populations, e = e), list(expr = substitute(expr)))) } ``` ## Infix notation As some syntactic sugar, there are also two infix versions of `run`: - `%where%` is a full infix version of `run` taking the expression as the left argument and a named list of values to be substituted as the right argument. - ``%for%` has slightly simplified syntax but only allows one substitution, for the symbol `x`. ```{r infixed} as.data.table(iris)[subgroup, lapply(.SD, summary), keyby = "Species", .SDcols = Sepal.Length:Petal.Width] %where% list(subgroup = subgroups[1:3], summary = functions) # note `subgroup` replaced with 'x' as.data.table(iris)[x, lapply(.SD, mean), keyby = "Species", .SDcols = Sepal.Length:Petal.Width] %for% subgroups ``` Complex expressions (for example, with pipes or `+`) need to be wrapped with "()" or "{}". For example ```{r infixed_bracketed} (iris %>% filter(x) %>% summarise(across(Sepal.Length:Petal.Width, mean), .by = Species)) %for% subgroups ``` An additional `%with%` function provides a similar syntax to `%where%` for standard evaluation: ```{r with} (a + b) %with% { a = 1 b = 2 } ```