--- title: "tf Vectors and Operations" author: "Jeff Goldsmith, Fabian Scheipl" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: fig_width: 7 vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{tf Vectors and Operations} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "##", fig.width = 6, fig.height = 4, out.width = "90%" ) library("tidyverse") library("viridisLite") theme_set(theme_minimal() + theme(legend.position = "bottom")) options( ggplot2.continuous.colour = "viridis", ggplot2.continuous.fill = "viridis" ) scale_colour_discrete <- scale_colour_viridis_d scale_fill_discrete <- scale_fill_viridis_d library("tidyfun") pal_5 <- viridis(7)[-(1:2)] set.seed(1221) ``` This vignette introduces the `tf` class, as well as the `tfd` and `tfb` subclasses, and focuses on vectors of this class. It also illustrates operations for `tf` vectors. # `tf`-Class: Definition ## `tf`-class **`tf`** is a new data type for (vectors of) functional data: - an abstract superclass for functional data in 2 forms: - as (argument, value)-tuples: subclass **`tfd`**, also irregular or sparse - or in basis representation: subclass **`tfb`** represents each observed function as a weighted sum of a fixed dictionary of known "basis functions". - basically, a `list` of numeric vectors (... since `list`s work well as columns of data frames ...) - with additional attributes that define *function-like* behavior: - how to **evaluate** the given "functions" for new arguments - their **domain**: the range of valid argument values - `S3` based ## Example Data First we extract a `tf` vector from the `tidyfun::dti_df` dataset containing fractional anisotropy tract profiles for the corpus callosum (`cca`). When printed, `tf` vectors show the first few `arg` and `value` pairs for each subject. ```{r load-dti} data("dti_df") cca <- dti_df$cca cca ``` We also extract a simple 5-element vector of functions on a regular grid: ```{r ex-def} cca_five <- cca[1:5, seq(0, 1, length.out = 93), interpolate = TRUE] rownames(cca_five) <- LETTERS[1:5] cca_five <- tfd(cca_five, signif = 2) cca_five ``` For illustration, we plot the vector `cca_five` below. ```{r, ex-fig} plot(cca_five, xlim = c(-0.15, 1), col = pal_5) text(x = -0.1, y = cca_five[, 0.07], labels = names(cca_five), col = pal_5) ``` ## **`tf`** subclass: **`tfd`** **`tfd`** objects contain "raw" functional data: - represented as a list of **`evaluations`** $f_i(t)|_{t=t'}$ and corresponding **`arg`**ument vector(s) $t'$ - has a **`domain`**: the range of valid **`arg`**s. ```{r} cca_five |> tf_evaluations() |> str() cca_five |> tf_arg() |> str() cca_five |> tf_domain() ``` - each **`tfd`**-vector contains an **`evaluator`** function that defines how to inter-/extrapolate `evaluations` between `arg`s ```{r} tf_evaluator(cca_five) |> str() tf_evaluator(cca_five) <- tf_approx_spline ``` - **`tfd`** has two subclasses: one for regular data with a common grid and one for irregular or sparse data. The `cca` data are irregular (values are missing for some subjects at some arguments) but the example below more clearly illustrates support for sparse and irregular data using CD4 cell counts from a longitudinal study included in **`refund`**. ```{r} cd4_vec <- tfd(refund::cd4) cd4_vec[1:2] cd4_vec[1:2] |> tf_arg() |> str() cd4_vec[1:20] |> plot(pch = "x", col = viridis(20)) ``` ## **`tf`** subclass: **`tfb`** Functional data in basis representation: - represented as a list of **`coefficients`** and a common **`basis_matrix`** of basis function evaluations on a vector of `arg`-values. - contains a **`basis`** function that defines how to evaluate the basis functions for new **`arg`**s and how to differentiate or integrate it. - (internal) flavors: - `tfb_spline`: uses **`mgcv`**-spline bases - `tfb_fpc`: uses functional principal components - significant memory and time savings: ```{r} refund::DTI$cca |> object.size() |> print(units = "Kb") cca |> object.size() |> print(units = "Kb") cca |> tfb(verbose = FALSE) |> object.size() |> print(units = "Kb") ``` ### **`tfb_spline`**: spline basis - default for `tfb()` - accepts all arguments of **`mgcv`**'s `s()`-syntax: basis type `bs`, basis dimension `k`, penalty order `m`, etc... - also does non-Gaussian fits: `family` argument - all exponential families - but also: $t$-distribution, ZI-Poisson, Beta, ... ```{r, message = TRUE} cca_five_b <- cca_five |> tfb() cca_five_b[1:2] cca_five[1:2] |> tfb(bs = "tp", k = 55) # functions represent ratios in (0,1), so a Beta-distribution is more appropriate: cca_five[1:2] |> tfb(bs = "ps", m = c(2, 1), family = mgcv::betar(link = "cloglog")) ``` ### Penalization: **Function-specific (default), none**, prespecified (`sp`), or global: ```{r} layout(t(1:2)) cca_five |> plot() cca_five_b |> plot(col = "red") cca_five |> tfb(k = 35, penalized = FALSE) |> lines(col = "blue") cca_five |> tfb(sp = 0.001) |> lines(col = "orange") ``` Right plot shows smoothing with function-specific penalization in red, without penalization in blue, and with manually set strong smoothing (`sp` $\to 0$) in orange. **"Global" smoothing**: 1. estimate smoothing parameters for subsample (~10\%) of curves 2. apply geometric mean of estimated smoothing parameters to smooth *all* curves **Advantages:** - (much) faster than optimizing penalization for each curve - should scale well for largish datasets **Disadvantages** - no real borrowing of information across curves (very sparse or functional fragment data, e.g.) - still requires more observations than basis functions *per curve* - subsample could miss small subgroups with different roughness, over-/undersmooth parts of the data, see below. ```{r, echo = FALSE} set.seed(1212) raw <- c( tf_rgp(5, scale = 0.2, nugget = 0.05, arg = 101L) - 5, tf_rgp(5, scale = 0.02, nugget = 0.05, arg = 101L), tf_rgp(5, scale = 0.002, nugget = 0.05, arg = 101L) + 5 ) ``` Dataset with heterogeneous roughness: ```{r} layout(t(1:3)) clrs <- scales::alpha(sample(viridis(15)), 0.5) plot(raw, main = "raw", col = clrs) plot(tfb(raw, k = 55), main = "separate", col = clrs) plot(tfb(raw, k = 55, global = TRUE), main = "global", col = clrs) ``` ### **`tfb`** FPC-based - uses first few eigenfunctions computed from a simple unregularized (weighted) SVD of the data matrix by default - corresponding FPC basis and mean function saved as `tfd`-object - observed functions are linear combinations of those. - amount of "smoothing" can be controlled (roughly!) by setting the minimal *percentage of variance explained* `pve` ```{r} cca_five_fpc <- cca_five |> tfb_fpc(pve = 0.999) cca_five_fpc cca_five_fpc_lowrank <- cca_five |> tfb_fpc(pve = 0.6) cca_five_fpc_lowrank ``` ```{r} layout(t(1:2)) cca_five |> plot() cca_five_fpc |> plot(col = "red", ylab = "tfb_fpc(cca_five)") cca_five_fpc_lowrank |> lines(col = "blue", lty = 2) ``` `tfb_fpc` is currently only implemented for data on identical (but possibly non-equidistant) grids. The **`{refunder}`** `rfr_fpca`-functions provide FPCA methods appropriate for highly irregular and sparse data and regularized/smoothed FPCA. # `tf`-Class: Methods **`tidyfun`** implements almost all types of operations that are available for conventional numerical or logical vectors for `tf`-vectors as well, so you can: ### subset & subassign: ```{r} cca_five[1:2] cca_five[1:2] <- cca_five[2:1] cca_five ``` ### compare & compute: ```{r, echo = FALSE} n_cca_five <- names(cca_five) cca_five <- unname(cca_five) ``` ```{r} cca_five[1] + cca_five[1] == 2 * cca_five[1] log(exp(cca_five[2])) == cca_five[2] (cca_five - (2:-2)) != cca_five ``` ```{r, echo = FALSE} names(cca_five) <- n_cca_five ``` ### summarize across a vector of functions: Compute functional summaries like mean functions, functional standard deviations or variances or functional data depths over a vector of functional data: ```{r} c(mean = mean(cca_five), sd = sd(cca_five)) tf_depth(cca_five) ## Modified Band-2 Depth (à la Sun/Genton/Nychka, 2012), others to come. median(cca_five) == cca_five[which.max(tf_depth(cca_five))] summary(cca_five) ``` ### summarize each function over its domain: Compute summaries for each function like its mean or extreme values, quantiles, etc. ```{r} tf_fmean(cca_five) # mean of each function's evaluations tf_fmax(cca_five) # max of each function's evaluations # 25%-tile of each f(t) for t > .5: tf_fwise(cca_five, \(x) quantile(x$value[x$arg > 0.5], prob = 0.25)) |> unlist() ``` `tf_fwise` can be used to define custom statistics for each function that can depend on both its `value` and its `arg`. In addition, **`tidyfun`** provides methods for operations that are specific for functional data: ## Methods for "functional" operations ### evaluate: `tf`-objects have a special `[`-operator: Its second argument specifies `arg`ument values at which to evaluate the functions and has some additional options, so it's easy to get point values for `tf` objects, in `matrix` or `data.frame` formats: ```{r, warning = FALSE} cca_five[1:2, seq(0, 1, length.out = 3)] cca_five["B", seq(0, 0.15, length.out = 3), interpolate = FALSE] cca_five[1:2, seq(0, 1, length.out = 7), matrix = FALSE] |> str() ``` ### (simple, local) smoothing ```{r} layout(t(1:3)) cca_five |> plot(alpha = 0.2, ylab = "lowess") cca_five |> tf_smooth("lowess") |> lines(col = pal_5) cca_five |> plot(alpha = 0.2, ylab = "rolling median (k=5)") cca_five |> tf_smooth("rollmedian", k = 5) |> lines(col = pal_5) cca_five |> plot(alpha = 0.2, ylab = "Savitzky-Golay (quartic, 11 steps)") cca_five |> tf_smooth("savgol", fl = 11) |> lines(col = pal_5) ``` ### differentiate & integrate: ```{r} layout(t(1:3)) cca_five |> plot(col = pal_5) cca_five |> tf_smooth() |> tf_derive() |> plot(col = pal_5, ylab = "tf_derive(tf_smooth(cca_five))") cca_five |> tf_integrate(definite = FALSE) |> plot(col = pal_5) ``` ```{r} cca_five |> tf_integrate() ``` ### query **`tidyfun`** makes it easy to find (ranges of) `arg`uments $t$ satisfying a condition on `value` $f(t)$ (and `arg`ument $t$): ```{r} cca_five |> tf_anywhere(value > 0.65) cca_five[1:2] |> tf_where(value > 0.6, "all") cca_five[2] |> tf_where(value > 0.6, "range") cca_five |> tf_where(value > 0.6 & arg > 0.5, "first") ``` ### zoom & query ```{r, ex-fig2} cca_five |> plot(xlim = c(-0.15, 1), col = pal_5, lwd = 2) text(x = -0.1, y = cca_five[, 0.07], labels = names(cca_five), col = pal_5, cex = 1.5) median(cca_five) |> lines(col = pal_5[3], lwd = 4) ``` ```{r} # where are the first maxima of these functions? cca_five |> tf_where(value == max(value), "first") # where are the first maxima of the later part (t > .5) of these functions? cca_five[c("A", "D")] |> tf_zoom(0.5, 1) |> tf_where(value == max(value), "first") # which f_i(t) are below the functional median anywhere for 0.2 < t < 0.6? # (t() needed here so we're comparing column vectors to column vectors...) cca_five |> tf_zoom(0.2, 0.6) |> tf_anywhere(value <= t(median(cca_five)[, arg])) ```