--- title: "Package rvec" author: - name: John Bryant output: bookdown::html_document2: base_format: rmarkdown::html_vignette fig_caption: yes toc: true toc_depth: 2 number_sections: true pkgdown: as_is: true vignette: > %\VignetteIndexEntry{Package rvec} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Aims of **rvec** {#sec:aims} Many statistical methods, particularly Bayesian methods, produce random draws from a distribution. A useful feature of these draws is that they can be used to make inferences about derived quantities. The procedure is: - Step 1. Calculate the derived quantity for each of the random draws. - Step 2. Summaries the distribution of these derived quantities. If, for instance, we have randoms draws of age-specific mortality rates, and we want make inferences about life expectancy (a summary indicator for mortality rates), then we proceed as follows: - Step 1. Derive life expectancy for each set of age-specific mortality rates. - Step 2. Calculate means, medians, or other statistics for these life expectancies. For more on the theory behind manipulating random draws, and for an argument that R needs high-level tools to help with this manipulation, see @kerman2007manipulating. Package **rvec** provides tools for working with random draws. The draws are held in a structure called an rvec, which can, for many purposes, be treated like an ordinary R vector, and manipulated using ordinary base R and [tidyverse](https://www.tidyverse.org) code. **rvec** also contains functions for summarizing across random draws. # Examples ## Toy example We begin with a toy example, to illustrate basic functionality. ```{r} library(rvec) l <- list(c(3, 1, 0)) theta <- rvec(l) theta ``` The header `[1]>` describe the structure of `theta`: - `_dbl` indicates that `theta` is composed of [doubles](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/double) - `<3>` indicates that `theta` holds three random draws, and - `[1]` indicates that each draw has length 1. We can perform standard mathematical operations: ```{r} theta^2 + 1 ``` `theta` recycles to match the length of other vectors, ```{r} beta <- theta + c(1, -1) beta ``` `beta` consists of three draws: 1. `c(4, 2)`, obtained from adding `3` and `c(1, -1)`, 2. `c(2, 0)`, obtained from adding `1` and `c(1, -1)`, and 3. `c(1, -1)`, obtained from adding `0` and `c(1, -1)`. To summarize across random draws, we use `draws_*` functions, e.g. ```{r} draws_mean(beta) ``` ## Divorce rates {#sec:divorce} Our next example is more involved, and includes the use of some standard tidyverse packages. ```{r} library(dplyr, warn.conflicts = FALSE) library(tidyr) library(ggplot2) ``` We analyse a posterior sample from a Bayesian model of divorce rates in New Zealand. The rates are divorces per thousand people per year, disaggregated by age and sex. The rates are not stored as an rvec, but instead in a 'data base' format, where each row describes a single draw. ```{r} divorce ``` First we convert from database format to rvec format. ```{r} divorce_rv <- divorce |> collapse_to_rvec(value = rate) divorce_rv ``` When the number of draws is large, the print method displays ` (<2.5% quantile>, <97.5% quantile>)` for the distribution, rather than the individual draws. We define the 'total divorce rate' to be the number of divorces that a person would expect to experience over their lifetime under prevailing divorce rates. The total divorce rate can be calculated as ```{r} divorce_rv |> group_by(sex) |> summarise(TDR = sum(rate) * 5 / 1000) ``` We summarize across draws using `draws_ci()`, which, by default, calculates medians and 95% credible intervals. Function `draws_ci()` returns a tibble rather than a vector, so, following standard `mutate` rules, we do not explicitly create new columns. ```{r} divorce_rv |> group_by(sex) |> summarise(tdr = sum(rate) * 5 / 1000) |> mutate(draws_ci(tdr)) ``` Next we calculate the ratio between female and male divorce rates, ```{r} divorce_ratio <- divorce_rv |> pivot_wider(names_from = sex, values_from = rate) |> mutate(ratio = Female / Male) |> mutate(draws_ci(ratio)) ``` and graph the result ```{r, fig.width = 7, fig.height = 4} ggplot(divorce_ratio, aes(x = age, ymin = ratio.lower, y = ratio.mid, ymax = ratio.upper)) + geom_pointrange() ``` # Structure of an rvec {#sec:structure} The class `"rvec"` has four subclasses: - `"rvec_dbl"`, which holds doubles, e.g. `3.142`, `-1.01`. - `"rvec_int"`, which holds integers, e.g. `42`, `-1`. - `"rvec_lgl"`, which holds `TRUE`, `FALSE`, and `NA`. - `"rvec_chr"`, which hold characters, e.g. `"a"`, `"Thomas Bayes"`. Internally, an rvec is a matrix, with each row representing one unknown quantity, and each column representing one draw from the joint distribution of the unknown quantities, | | Draw 1 | Draw 2 | $\dots$ | Draw $n$ | |--------------|:-------------:|:-------------:|:--------:|:-------------:| | Quantity 1 | $\theta_{11}$ | $\theta_{12}$ | $\dots$ | $\theta_{1n}$ | | Quantity 2 | $\theta_{21}$ | $\theta_{22}$ | $\dots$ | $\theta_{2n}$ | | $\vdots$ | $\vdots$ | $\vdots$ | $\ddots$ | $\vdots$ | | Quantity $m$ | $\theta_{m1}$ | $\theta_{m2}$ | $\dots$ | $\theta_{mn}$ | Ordinary functions are applied independently to each column. For instance, calling `sum()` on an rvec creates a new rvec with structure | | Draw 1 | Draw 2 | $\dots$ | Draw $n$ | |--------------|:-------------------------:|:-------------------------:|:--------:|:-------------------------:| | Quantity 1 | $\sum_{i=1}^m\theta_{i1}$ | $\sum_{i=1}^m\theta_{i2}$ | $\dots$ | $\sum_{i=1}^m\theta_{in}$ | Functions with a `draws_` prefix are applied independently to each row. For instance, calling `draws_mean()` on an rvec creates a new numeric vector with structure | | Value | |--------------|:------------------------------------:| | Quantity 1 | $\frac{1}{n}\sum_{j=1}^n\theta_{1j}$ | | Quantity 2 | $\frac{1}{n}\sum_{j=1}^n\theta_{2j}$ | | $\vdots$ | $\vdots$ | | Quantity $m$ | $\frac{1}{n}\sum_{j=1}^n\theta_{mj}$ | Each rvec holds a fixed number of draws. Two rvecs can only be used together in a function if 1. both rvecs contain the same number of draws, or 2. one of the rvecs contains a single draw. # Creating rvecs An individual rvec can be created from a list of vectors, ```{r} x <- list(LETTERS, letters) rvec(x) ``` a matrix, ```{r} x <- matrix(rnorm(2000), nrow = 2) rvec(x) ``` or an atomic vector ```{r} x <- c(TRUE, FALSE) rvec(x) ``` Function `rvec()` chooses from classes `"rvec_dbl"`, `"rvec_int"`, `"rvec_lgl"`, and `"rvec_chr"`, based on the input. To enforce a particular class, use function `rvec_dbl()`, `rvec_int()`, `rvec_lgl()`, or `rvec_chr()`, ```{r} x <- list(1:3) rvec(x) rvec_dbl(x) rvec_chr(x) ``` When the raw data take the form of a database with one draw per row, the most efficient way to create rvecs is to use `collapse_to_rvec()`. See Section \@ref(sec:divorce) for an example. Section \@ref(sec:prob) shows how to create an rvec consisting of draws from a standard probability distribution. # Mathematical and logical operations Mathematical and logical operations are applied independently to each draw. ```{r} x <- rvec(list(c(TRUE, FALSE), c(TRUE, TRUE))) all(x) any(x) ``` User-defined functions that consist entirely of standard mathematical and logical operations should work with no modifications. ```{r} logit <- function(p) log(p / (1-p)) tibble( x = rvec(list(c(0.2, 0.4), c(0.6, 0.9))), y = logit(x) ) ``` Multiplying an rvec by a matrix produces an rvec, ```{r} m <- rbind(c(1, 1), c(0, 1)) x <- rvec(list(1:2, 3:4)) m %*% x ``` **rvec** contains a suite of functions for summarising weighted data: - `weighted_mad()` - `weighted_mean()` - `weighted_median()` - `weighted_sd()` - `weighted_var()` All of these are built on functions from package [matrixStats](https://CRAN.R-project.org/package=matrixStats). The elements of an rvec do not have a well-defined order when there is more than one draw. Functions `sort()` and `order()` fail when called on an rvec, unless `n_draw` is 1. Ranking does, however, have a useful interpretation. We apply the ranking operation independently to each draw, and return the results as an integer rvec. ```{r} divorce_ratio |> select(age, ratio) |> mutate(rank = rank(ratio)) ``` The rank of the 15-19 age group is uncertain, while the rank of the 65+ age group is estimated precisely. # Probability distributions {#sec:prob} Standard R probability functions such as `dnorm()` or `rbinom()` do not allow rvec arguments. Package **rvec** provides modified functions that do. For instance, ```{r} y <- rvec(list(c(-1, 0.2), c(3, -7))) mu <- rvec(list(c(0, 1), c(0, -1))) dnorm_rvec(y, mean = mu, sd = 3) rbinom_rvec(n = 2, size = round(y+10), prob = 0.8) ``` The return value from an **rvec** probability function is an rvec if and only if at least one argument to the function is an rvec -- with one exception. The exception is random variate functions, where a value can be supplied for a special argument called `n_draw`. When a value for `n_draw` is supplied, the return value is an rvec with `n_draw` draws, ```{r} rnorm_rvec(n = 3, mean = 100, sd = 10, n_draw = 2) ``` This is a convenient way to create inputs to a simulation. # Manipulating rvecs ## Subsetting Standard R ways of selecting elements from vectors work with rvecs. ```{r} x <- rvec(list(a = 1:2, b = 3:4, c = 5:6)) x[1] ## element number x[c("a", "c")] ## element name x[c(TRUE, FALSE, TRUE)] ## logical flag ``` ## If-Else {#sec:ifelse} The standard R function `ifelse()` does not work at all with rvecs. The tidyverse function `if_else()` works when the `true`, `false`, or `missing` arguments are rvecs, ```{r} x <- rvec(list(1:2, 3:4)) if_else(condition = c(TRUE, FALSE), true = x, false = -x) ``` `if_else()`does not work, however, when the `condition` argument is an rvec. When the `condition` argument is an rvec, we need the **rvec** function `if_else_rvec()`, ```{r} if_else_rvec(x <= 2, x, 2) ``` Function `if_else_rvec()` can be used to independently transform or recode values across different draws, ```{r} x <- rvec(list(c(1, 3.3), c(NA, -2))) x x_recode <- if_else_rvec(is.na(x), 99, x) x_recode ``` ## Combining The standard R concatenation function `c()` works with rvecs, ```{r} x1 <- rvec(list(c(0.1, 0.2), c(0.3, 0.4))) x2 <- rvec(list(c(0.5, 0.6), c(0.7, 0.8))) c(x1, x2) ``` Unfortunately, `cbind()` and `rbind()` cannot be made to work properly on raw rvecs, ```{r} rbind(x1, x2) cbind(x1, x2) ``` though `cbind()` does work if the rvecs are contained in data frames ```{r} df1 <- data.frame(x1) df2 <- data.frame(x2) cbind(df1, df2) ``` Tidyverse equivalents such as `dplyr::bind_rows()`, `dbplyr::bind_cols()`, `vctrs::vec_cbind()`, and `vctrs::vec_rbind()` *do* work with rvecs, ```{r} library(vctrs, warn.conflicts = FALSE) vec_cbind(a = x1, b = x2) ``` Base R function `sapply()` does not work with rvecs (unless `simplify` is set to `FALSE`). **rvec** supplies a function called `map_rvec()` (based on map functions in package [purrr](https://purrr.tidyverse.org)) that does the same job: ```{r} l <- list(a = rvec(list(c(1, 4))), b = rvec(list(c(9, 16)))) l map_rvec(l, sqrt) ``` ## Coercing Function `as.matrix()` returns the data underlying an rvec. ```{r} m <- matrix(1:6, nr = 2) m x <- rvec(m) x as.matrix(x) ``` Function `as_list_col()` returns a list of vectors ```{r} as_list_col(x) ``` Functions such as [point_interval](https://mjskay.github.io/ggdist/reference/point_interval.html) in package [ggdist](https://mjskay.github.io/ggdist/) accept lists of vector. A good way to access the sophisticated plotting facilities in **ggdist**, or in packages such as [tidybayes](https://mjskay.github.io/tidybayes/) and [bayesplot](https://mc-stan.org/bayesplot/), is to use `as_list_col()` to convert an rvec to a list column. Function `expand_from_rvec()` is the inverse of function `collapse_to_rvec()`, introduced in Section \@ref(sec:divorce). ```{r} divorce |> head(2) divorce |> collapse_to_rvec(values = rate) |> head(2) divorce |> collapse_to_rvec(values = rate) |> expand_from_rvec() |> head(2) ``` # Summarising distributions Most functions in **rvec** are concerned with deriving random vectors from other random vectors: that is, with the column-wise calculations described in Section \@ref(sec:structure). But once we have derived the random vectors, we typically want to summarise them, using statistics such as means or quantiles: that is, we want to carry out row-wise calculations. The functions for carrying out row-wise calculations on rvecs are: - `draws_all()` - `draws_any()` - `draws_median()` - `draws_mean()` - `draws_mode()` - `draws_ci()` - `draws_quantile()` - `draws_fun()` Internally, most of these functions call functions from [matrixStats](https://CRAN.R-project.org/package=matrixStats), which means they are fast. `draws_ci()`, which calculates credible intervals, is the draws function that is used most often, ```{r} divorce_rv <- divorce |> collapse_to_rvec(value = rate) divorce_rv divorce_rv |> mutate(draws_ci(rate)) ``` # Other packages The first R package to provide a specialized object for handling multiple draws was [rv](https://CRAN.R-project.org/package=rv). The specialized object, called an rv, can be manipulated and summarized much like an rvec. However, in software terms, an rv is not strictly a vector (calling `is.vector()` on one returns `FALSE`) and an rv does not always behave as expected inside a data frame. It is therefore not well suited to tidyverse-style work flows. R package [posterior](https://CRAN.R-project.org/package=posterior) provides several data structures for handling multiple draws, including one, called a rvar, that is similar to an rvec. An rvar is, however, is not limited to a single dimension, and has special facilities for dealing with multiple chains (as produced by Markov chain Monte Carlo methods.) These features are essential for some analyses, but they can make rvars harder to master, and they are not needed for most tidyverse-style work flows. Whereas rvecs interpret summary functions such as `mean()` and `sum()` as operations to be applied independently on each draw, rvars interpret them as operations to be applied across draws. The result is that code written for ordinary R vectors will often work on rvecs, but need modification to work on rvars. The tidyverse function `count()`, for instance, works with rvecs but not rvars. # References