---
title: "Pathing and Table Structure"
author: "Gabriel Becker"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Pathing and Table Structure}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options:
  chunk_output_type: console
---

# Introduction

`rtables` models both the row- and column-structure of a table as trees. These trees collectively reflect the layout instructions used to declare the table's structure. We can use this to describe locations within the row-, column-, or cell-space of a table in a semantically self-describing way. We call these semantically meaningful locations *paths*.

A *path* is an ordered set of names which declares both a path for traversing the tree structure for the relevant dimensions, and consequently a corresponding subset of the table in that dimension. *Column* paths may contain only split names and names of facets generated from those splits. *Row* paths, can additionally contain names of tables corresponding to `analysis` calls, the `"@content"` directive which steps from a facet into the table generated by `summarize_row_groups` containing its marginal summary row(s), and names of individual rows.The location of individual cells or rectangular groups of cells is then defined by a row-path column-path pair.

As of `rtables` version `0.6.13`, any structural element of a successfully built[^1] table is guaranteed to correspond to a unique row path, column path, or combination thereof.

## An Illustrative Example

Consider a the table with non-trivial structure in both the column and row dimensions:
```{r}
library(rtables)

keep_rc <- c("ASIAN", "WHITE") ## chosen for brevity

afun <- function(x) {
  list(
    Mean = rcell(mean(x), format = "xx.x"),
    Median = rcell(median(x), format = "xx.x")
  )
}

lyt <- basic_table() |>
  split_cols_by("ARM", split_fun = keep_split_levels(c("A: Drug X", "C: Combination"))) |>
  split_cols_by("SEX", split_fun = keep_split_levels(c("F", "M"))) |>
  add_overall_col("All") |>
  split_rows_by("RACE", split_fun = keep_split_levels(keep_rc)) |>
  summarize_row_groups() |>
  split_rows_by("STRATA1") |>
  summarize_row_groups() |>
  analyze("AGE", afun = afun) |>
  analyze("BMRKR1", nested = FALSE, show_labels = "visible")

tbl <- build_table(lyt, DM)
tbl
```

We can get a first look at the row- and column-structure of our table (if in different formats) with `table_structure` and `col_info`.


```{r}
table_structure(tbl)
```

```{r}
col_info(tbl)
```

We can use paths to declare intuitive substructures of our table. We illustrate
this using `[` which interpret character vector indices as individual paths in
the respective dimension.

Note that while we will use these paths to subset our table for illustrative purposes,
they are more often used to specify where something should happen within the larger
table, which we discuss in the following section.

### Row Paths

Row paths, in isolation, describe horizontal slices of our table. We can see all the valid row paths (including an optional "root" beginning value which is technically correct but not necessary to include) via `row_paths_summary`.

```{r}
rpsummry <- row_paths_summary(tbl)
```

In addition to displaying a nicely formatted summary, this returns a data.frame
containing the same information in a programmatically accessible form. In particular
`path` is a list-valued column whose values can be used directly as row paths:

```{r}
head(rpsummry)

tbl[rpsummry$path[[6]], ]
```

The `c("RACE", "ASIAN")` row path refers to the horizontal slice of our table
containing all rows that represent analysis on Asian patients. We see that we get the expected subtable:


```{r}
tbl[c("RACE", "ASIAN"), ]
```

Similarly we can get the groups summary and row for strata B of Caucasian patients via


```{r}
tbl[c("RACE", "WHITE", "STRATA1", "B"), ]
```

Notice this is a strict subtable in the structural sense, which means we do not get
the ethnicity-level group summary here. We can see this because our structure and our path
now starts with `"B"`:


```{r}
table_structure(tbl[c("RACE", "WHITE", "STRATA1", "B"), ])
```

As mentioned above, to "path into" a group summary we use the `"@content"` directive:

```{r}
tbl[c("RACE", "ASIAN", "@content"), ]
```

We can path down to analysis tables and then individual rows via their name, which unlike other structures
tends to be identical to their label:

```{r}
tbl[c("RACE", "WHITE", "STRATA1", "B", "AGE"), ]
tbl[c("RACE", "WHITE", "STRATA1", "B", "AGE", "Median"), ]
```


### Column Paths

Similar to row paths, we can get information about column paths via `col_paths_summary`:

```{r}
col_paths_summary(tbl)
```

We can then describe vertical slices of our table via these paths (we use `head`, which
subsets via absolute position to limit the amount of output here):


```{r}
head(tbl[, c("ARM", "A: Drug X")])
head(tbl[, c("ARM", "C: Combination", "SEX", "M")])
head(tbl[, c("All", "All")])
```

Note that despite being displayed next to each-other, the last two columns of
our table have fundamentally different paths. This is due to `add_overall_col` 
adding a **non-nested** additional split rather than adding an additional implicit
combination level to the `ARM` split.

## Uniqueness of Paths

As of `rtables` `0.6.13`, rtables enforces uniqueness of names within groups
of direct sibling structures in both row and column[^1] space, thus guaranteeing unique paths to every substructure in the table.

In row space, it does this by appending `[k]` to 
the names of elements which would otherwise have an identical name to a 
previous sibling, where k is a sequence of integers such that all siblings have unique names. This affects the paths to these substructures[^2], as we see from the informative messages below:


```{r}
lytdup <- basic_table() |>
  analyze("STRATA1") |>
  split_rows_by("STRATA1") |>
  analyze("AGE")

tbldup <- build_table(lytdup, DM)
tbldup
```

```{r}
row_paths_summary(tbldup)
```

 This allows us to path to all elements of the row structure, which was not possible
in previous (`<0.6.13`) `rtables` versions:


```{r}
tbldup[c("STRATA1", "A"), ]

tbldup[c("STRATA1[2]", "A"), ]
```

## Wildcards In Paths

Many, though not all, `rtables` functions which accept a row or column paths
support the `"*"` path wildcard. Where supported, the wild-card will match
any *name* present at that step in the table structure, leading to (potentially)
multiple matches. Note `"*"` will never behave as the `"@content"` directive, which
must always be used explicitly.

```{r}
tbl[c("RACE", "*", "STRATA1", "B", "AGE", "Median"), ]
```


Multiple wildcards can appear in a path, with each wildcard applied recursively
within the full combined set of matches from all wildcards earlier in the path.


```{r}
tbl[c("RACE", "*", "STRATA1", "*", "AGE", "Median"), ]
```


Note that while the `[` method does support wildcards, we are only
using that to illustrate the behavior, as the tables resulting from 
using wildcard paths with `[` are generally not going to be very useful.


For (currently only) row paths, we can resolve a path with one or more 
wildcards into a set of fully specified paths that match the path in our table
using the `tt_normalize_row_path` utility function

```{r}
tt_normalize_row_path(tbl, c("RACE", "*", "STRATA1", "*", "AGE", "Median"))
```

We can also test whether a row path (including those containing wildcards) exists
in our table with `tt_row_path_exists`

```{r}
tt_row_path_exists(tbl, c("RACE", "*", "STRATA1", "*", "AGE", "Median"))
```

```{r}
tt_row_path_exists(tbl, c("RACE", "*", "STRATA1", "*", "FAKEFAKEFAKE", "Median"))
```


Note also that each `"*"` wildcard will only match *a single step*, there is
not currently a directive that searches for a match anywhere in the relevant
(sub)structure.

Thus we get
```{r}
tt_normalize_row_path(tbl, c("*", "Mean"))
```

Despite there being other `"Mean"` elements elsewhere in our row structure.


Though the above utilities don't currently exist for column paths (which are implemented
differently in ways not relevant to end users), generally those mechanisms which support
wildcards in row space and also accept a column path support wildcards for column paths
as well:


```{r}
tbl[, c("ARM", "*", "SEX", "F")]
```

## Operating On Tables With Paths

In addition to subsetting via paths, which as we mentioned is likely to be of limited utility, many aspects of a table can be selectively inspected or changed using paths.

We will explore some of these throughout this section

### Column Counts

We can set (though, currently not get, an oversight that will likely be remedied in a future
version) the visibility on *a set of sibling facets*.
```{r}
tbl2 <- head(tbl)
facet_colcounts_visible(tbl2, c("ARM", "A: Drug X", "SEX")) <- TRUE
tbl2
```

NB: unlike virtually all functions which accept paths, `facet_colcounts_visible` accepts the path to ***the parent of the facets you'd like to change the colcount visibility for***. This is because direct siblings cannot have different column count visibilities, so pathing
to individual facets would lead to an invalid table.

We can also get or  modify the value of any particular column count (note no s here):

```{r}
facet_colcount(tbl2, c("ARM", "A: Drug X", "SEX", "M")) <- 5
tbl2
```

If we need to mix visibilty and non-visibilty of counts within a direct sibling group
the best we can do is setting one to NA, which will leave a blank space there:

```{r}
facet_colcount(tbl2, c("ARM", "A: Drug X", "SEX", "F")) <- NA_integer_
tbl2
```

### Section Dividers

Section dividers are character(s) that are printed in a line after a
particular row or subtable during rendering to differentiate sections
of a table (they are most often, and by default, " " to create a blank line).

```{r}
tbl3 <- tbl

section_div_at_path(tbl3, c("RACE", "*")) <- "*"
section_div_at_path(tbl3, c("RACE", "*", "STRATA1", "B")) <- "+"
tbl3
```

Section dividers have a *least specific to most specific* order of precedence, 
with only the least specific applicable section divider displayed after any
given row. See `?section_div` for more details.

### Sorting Within An `rtables` Table

Sorting rows in a table occurs in a path-specific way. See the sorting section
in the [pruning and sorting vignette](sorting_pruning.html#Sorting) for a 
detailed discussion of this.

### Extracting And Modifying Values

We saw that the `[` method interprets character indicies as paths. Beyond that, 
the `value_at` and `cell_values` getters and setters accept paths as well.
See the [subsetting vignette](subsetting_tables.html).

### Footnotes

Pathing can also be used to add referential footnotes to rows, columns,
or cells. This is discussed in the [title and footer](title_footer.html)
and [subsetting](subsetting_tables.html) vignettes.

## Table Structure And Technical Details

Here we will go into a bit more detail of how layouts, table structure, and pathing
are related. This is largely for informational purposes and most of it will
not be directly relevant to end-users who are simply creating tables.

### Row Structure

`rtables` is row-dominant (as opposed to R's `data.frame`s which are column
dominant). This means that tables are modelled as a (generalized) collection
of rows, rather than columns. More accurately, a table is modeled as a collection
of children, which can have children, etc until ultimately all of the "leaf" children
in the defined tree-graph are individual rows. 

We can see this using the `tree_children` function. The table we've been 
working  with throughout this vignette has two direct children, one containing
all of the structure generated underneath the initial `"RACE"` split, and one
containing the unnested analysis of `"BMRKR1"`:


```{r}
tree_children(tbl)
```

For convenience we will define a `multi_step_children` function which
recursively retrieves children from the table, and then from those children,
etc. For information purposes, we will print the "path step" taken each time,
thus building up our path as we descend using the class structure.

```{r}
multi_step_children <- function(tbl, indices) {
  print(obj_name(tbl))
  ret <- tree_children(tbl)
  for (i in indices) {
    print(obj_name(ret[[i]]))
    ret <- tree_children(ret[[i]])
  }
  ret
}
```


Thus we can see that the first of our table's children has the path `c("root", "RACE")`
and has children for each ethnicity in our table (recall the "root" path element is 
correct but optional):

```{r}
multi_step_children(tbl, 1)
```

Each of these children under `"RACE"` is a subtable. 

The children under our `BMRKR1` analysis, on the other hand, are rows (in this
case only one row, in fact):

```{r}
multi_step_children(tbl, 2)
```

Within each race subtable, we see a table corresponding to the `STRATA1` split:


```{r}
multi_step_children(tbl, c(1, 1))
```

```{r}
multi_step_children(tbl, c(1, 1, 1))
```


And finally within each strata facet is a table representing the analysis of `AGE`

```{r}
multi_step_children(tbl, c(1, 1, 1, 2))
```

And within each of those `AGE` analysis tables, like our `BMRKR1` top level
analysis table, we have a collection of rows:

```{r}
multi_step_children(tbl, c(1, 1, 1, 2, 1))
```

Thus we see that `analyze` calls create tables (called `ElementaryTable`s) containing
individual rows as children, while `split_rows_by` (and siblings) calls create a
subtable with children that are a table for each facet declared by the split operation:

```{r}
## child is AGE analysis table within RACE->WHITE->STRATA1->A
multi_step_children(tbl, c(1, 2, 1, 1))
```

```{r}
## children are individual rows of that AGE table
multi_step_children(tbl, c(1, 2, 1, 1, 1))
```


### Label And Group Summary Rows

For technical and historical reasons, label rows and so-called "content rows"
(which are essentially marginal analyses at a non-leaf point in the tree
graph defined by the parent-child relationships discussed above) are 
modeled separately.

Given a (sub)table, the content table (containing the content rows) and label
can be retrieved by `content_table` and `obj_label`, respectively. Note that
`obj_label` returns a string, not a row, as the label row is an internal detail
not currently exposed.

Recall that our `multi_step_children` function returns *the set of children
at a location*, so we must subset one additional time to arrive at a single child:


```{r}
tb <- multi_step_children(tbl, c(1, 1, 1))[[2]] ## second ie B strata
tb

content_table(tb)
```

Typically (ie by default) label rows for (sub)tables that have a non-empty
content table are not visible when rendering, but they do still exist:

```{r}
obj_label(tb)
```

Thus we see that:

- `split_rows_by*` layout instructions create a single subtable with the split name, which contains a facet for each value of the split;
- individual `analyze` instructions create a subtable containing individual rows defined by the afun used;
- repeated or multi-var `analyze` instructions create a parent subtable with a child for each individual analyzed var, as above; and 
- `summarize_row_groups` create content tables on *the children of* the table for that split


### Column Structure

For largely historical reasons, and due to the fact that the `rtables`
object model is row-dominant, the exact way that column structure is
modeled is an arcane implementation detail not useful to end users
(much more so than the row structure explored above). Thus we will
largely gloss over it here.

For our purposes here it suffices to say that the analog of the
subtables representing split instructions are implicit in column
space after the first split, as opposed to explicit as we saw
them to be in row space. That said, the relationship between layout
instructions and resulting paths in the table remains valid and useful.

We can see this by looking again at our column paths summary:

```{r}
col_paths_summary(tbl)
```

Column paths have a more rigid structure than their row-based counterparts.
Because column space has no analog to analyze layout instructions, 
All paths corresponding to facets or individual columns come 
in the form of one or more pairs of the form (split name, split value).

Virtually all useful column paths will be of the form above. The only exception
to this is when setting column count visibility to a set of facets via
`facet_colcounts_visible<-`, for which we path to the implicit parent
structure whose children are the facets we are interested in. We saw
this in action in the previous section.

To summarize, as for row space, the relationship between layout
instructions and column paths as follows: column split instructions
create structures pathable via (split_name, split value/facet name)
pairs. Because there is no `analyze` analog for column splitting,
this paradigm is sufficient to understand and predict all column paths.


[^1]: In `rtables` `0.6.13` table layouts which would result in non-unique paths in column space will fail to build. This will likely change to be more in line with the behavior in row space in a future release.

[^2]:  the result-data.frame / ARD machinery knows to remove these uniquification artifacts, so these modifications to the names will not be reflected there.