--- title: "Data Wrangling" author: "Jeff Goldsmith, Fabian Scheipl" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: fig_width: 7 vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Data Wrangling} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "##", fig.width = 6, fig.height = 4, out.width = "90%" ) library(tidyverse) library(viridisLite) theme_set(theme_minimal() + theme(legend.position = "bottom")) options( ggplot2.continuous.colour = "viridis", ggplot2.continuous.fill = "viridis" ) scale_colour_discrete <- scale_colour_viridis_d scale_fill_discrete <- scale_fill_viridis_d library("tidyfun") pal_5 <- viridis(7)[-(1:2)] set.seed(1221) ``` # Data manipulation using **`tidyfun`** The goal of **`tidyfun`** is to provide accessible and well-documented software that **makes functional data analysis in `R` easy**. In this vignette, we explore some aspects of data manipulation that are possible using **`tidyfun`** workflows with `tf` vectors, emphasizing compatibility with the **`tidyverse`**. Other vignettes have examined the **`tfd`** & **`tfb`** data types, and how to convert common formats for functional data (e.g. matrices, long- and wide-format data frames, **`fda`** objects) in these new data types. Because our goal is "tidy" data manipulation for functional data analysis, the result of data conversion processes has been a data frame in which a column contains the functional data of interest. This vignette starts from that point. Throughout, we make use of some visualization tools -- these are explained in more detail in the [visualization](https://tidyfun.github.io/tidyfun/articles/x04_Visualization.html) vignette. # Example datasets The datasets used in this vignette are the `tidyfun::chf_df` and `tidyfun::dti_df` dataset. The first contains minute-by-minute observations of log activity counts (stored as a `tfd` vector called `activity`) over seven days for each of 47 subjects with congestive heart failure. In addition to `id` and `activity`, we observe several covariates. ```{r view_chf} data(chf_df) chf_df ``` A quick plot of the first 5 curves: ```{r plot_chf} chf_df |> slice(1:5) |> tf_ggplot(aes(tf = activity)) + geom_line(alpha = 0.1) ``` The `tidyfun::dti_df` contains fractional anisotropy (FA) tract profiles for the corpus callosum (cca) and the right corticospinal tract (rcst), along with several covariates. ```{r view_dti} data(dti_df) dti_df ``` A quick plot of the `cca` tract profiles is below. ```{r plot_dti} dti_df |> tf_ggplot(aes(tf = cca)) + geom_line(alpha = 0.05) ``` # Existing `tidyverse` functions Dataframes using **`tidyfun`** to store functional observations can be manipulated using tools from **`dplyr`**, including `select` and `filter`: ```{r} chf_df |> select(id, day, activity) |> filter(day == "Mon") |> tf_ggplot(aes(tf = activity)) + geom_line(alpha = 0.05) ``` Operations using `group_by` and `summarize` also work -- let's look at some daily averages of these activity profiles: ```{r} chf_df |> group_by(day) |> summarize(mean_act = mean(activity)) |> tf_ggplot(aes(tf = mean_act, color = day)) + geom_line() ``` One can `mutate` functional observations -- here we exponentiate the log activity counts to obtain original recordings: ```{r} chf_df |> slice(1:5) |> mutate(exp_act = exp(activity)) |> tf_ggplot(aes(tf = exp_act)) + geom_line(alpha = 0.2) ``` Functions for data manipulation from **`tidyr`** are also supported. We illustrate by using `pivot_wider` to create new `tfd`-columns containing the activity profiles for each day of the week: ```{r} chf_df |> select(id, day, activity) |> pivot_wider( names_from = day, values_from = activity ) ``` (Note that this has made the data less "tidy" and is therefore not generally recommended, but may be useful in some cases). It's also possible to join datasets based on non-functional keys. To illustrate, we'll first create a pair of datasets: ```{r} monday_df <- chf_df |> filter(day == "Mon") |> select(id, monday_act = activity) friday_df <- chf_df |> filter(day == "Fri") |> select(id, friday_act = activity) ``` These can be joined using the `id` variable as a key (and then tidied using `pivot_longer`): ```{r} monday_df |> left_join(friday_df, by = "id") |> pivot_longer(monday_act:friday_act, names_to = "day", values_to = "activity") ``` Similar tidying can be done for the DTI data -- let's look at average RCST tract values for gender and case status: ```{r} dti_df |> group_by(case, sex) |> summarize(mean_rcst = mean(rcst, na.rm = TRUE)) |> tf_ggplot(aes(tf = mean_rcst, color = case)) + geom_line(linewidth = 2) + facet_grid(~sex) ``` # `tf` helper functions in tidy workflows Some **`dplyr`** functions are useful in conjunction with **`tf`** helper functions. For example, one might use `filter` with `tf_anywhere()` to filter based on the values of observed functions: ```{r} like_to_move_it_move_it <- chf_df |> filter(tf_anywhere(activity, value > 9)) glimpse(like_to_move_it_move_it) like_to_move_it_move_it |> tf_ggplot(aes(tf = activity, colour = id)) + geom_line() ``` A second example of this functionality in the DTI data is below. ```{r} dti_df |> filter(tf_anywhere(cca, value < 0.26)) |> tf_ggplot(aes(tf = cca)) + geom_line() ``` The existing `mutate` function can be combined with several `tf` helpers, including `tf_smooth()`, `tf_zoom()`, and `tf_derive()`. One can smooth existing observations using `tf_smooth`: ```{r} chf_df |> filter(id == 1) |> mutate(smooth_act = tf_smooth(activity)) |> tf_ggplot(aes(tf = smooth_act)) + geom_line() ``` This can be combined with previous steps, like `group_by` and `summarize`, to build intuition through descriptive plots and summaries: ```{r} chf_df |> group_by(day) |> summarize(mean_act = mean(activity)) |> mutate(smooth_mean = tf_smooth(mean_act)) |> tf_ggplot(aes(color = day)) + geom_line(aes(tf = mean_act), alpha = 0.2) + geom_line(aes(tf = smooth_mean), linewidth = 2) ``` One can also extract observations over a subset of the full domain using `tf_zoom`: ```{r} chf_df |> filter(id == 1) |> mutate(daytime_act = tf_zoom(activity, 360, 1200)) |> tf_ggplot(aes(tf = daytime_act)) + geom_line(alpha = 0.2) ``` We can also convert from `tfd` to `tfb` inside a `mutate` statement as part of a data processing pipeline: ```{r} dti_df <- dti_df |> mutate(cca_tfb = tfb(cca, k = 25)) ``` It's also possible to compute derivatives as part of a processing pipeline: ```{r} dti_df |> slice(1:10) |> mutate( cca_raw_deriv = tf_derive(cca), cca_tfb_deriv = tf_derive(cca_tfb) ) |> tf_ggplot() + geom_line(aes(tf = cca_raw_deriv), alpha = 0.3, linewidth = 0.3, color = "blue") + geom_line(aes(tf = cca_tfb_deriv), alpha = 0.3, linewidth = 0.3, color = "red") + ylab("d/dt f(t)") ``` # Working with `data.table` **`tidyfun`** functional data objects work within **`data.table`** as well. However, there is one specific known caveat when calculating the mean of a `tf` vector within a **`data.table`** context: By default, **`data.table`** will use optimized routines for summary statistics, which may cause `mean` to dispatch to **`data.table`**'s own optimized implementation, leading to unexpected behavior for `tf` vectors. To ensure the correct mean calculation for `tf` vectors, there are two solutions: - Disable the optimization for the summary step and let `mean()` dispatch normally: ```{r, eval = FALSE} withr::with_options( list(data.table.optimize = 0), data.table::as.data.table(chf_df)[, list(mean_act = mean(activity)), by = day] ) ``` This caveat applies specifically to `mean()` and is the only known issue.