---
title: "Disproportionality analysis"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 4
vignette: >
  %\VignetteIndexEntry{Disproportionality analysis}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---
# Introduction
```{r, include=FALSE}

# There are notes about data.table using too many cores when you try to submit a table with data.table-dependencies
# to cran. This is supposed to prevent that note. Also included in tests. 
data.table::setDTthreads(1)

library(pvda)
knitr::opts_chunk$set(echo = FALSE)
```

During the life cycle of a medicinal product, knowledge of the safety profile 
continues to accumulate post market introduction. Adverse event reports 
collected through spontaneous reporting play an instrumental part in this 
development.

Adverse event report databases often cover a wide range of drugs 
and events, beyond what can be feasibly monitored through manual work. 
Disproportionality analysis is thus often used to highlight drug-event-combinations 
of potential interest for further qualitative manual case-by-case assessment. 

The `pvda` package provides three disproportionality estimators: 
the proportional reporting rate (PRR), the reporting odds ratio (ROR) 
and the information component (IC). It takes report level data as input, and
outputs estimates for each drug-event-combination, possibly stratified by some
variable of interest, e.g. age group and/or sex. 

# Executing a disproportionality analysis
The workhorse of this package is the function `da`. The `da`-function takes 
patient level data in a data frame as input (here `drug_event_df`) and produces 
estimates and summary counts as demonstrated below.

```{r}
#| echo = TRUE 

da_1 <- 
  drug_event_df |>
  da()
```

The `drug_event_df` is a simulated data set with drugs named Drug_A - Drug_Z 
and events Event_1 - Event_1000. Further details about the simulation are 
documented in the data object (execute `?df_drug_event`).

## Specifying the column names

The da-function will examine the passed data set for column names contained in
the passed parameter list "df_colnames". The default column names are "report_id", "drug" and "event". If we rename the report-id column in "drug_event_df" we need to specify the new name in the call to da, as shown below:

```{r}
#| echo = TRUE 

da_1 <-
  drug_event_df |> 
  dplyr::rename(reportID = report_id) |> 
  da(df_colnames = list(report_id = "reportID"))
```

## Summarizing and printing results

The 'da_1' object can then be passed to a summary and a print function.

```{r, echo = TRUE, results = FALSE, message = FALSE}
summary(da_1)
```

![](summary_console_printout.png){width=110%}

```{r, echo = TRUE, results = FALSE, message = FALSE}
print(da_1, n=3)
```

![](print_console_printout.png){width=110%}

(Technical note: As printing of colors in the console is not easily reproduced within this vignette , we provide screenshots when applicable. )

## Passing correctly structured input data
The input data needs to be in a report-level format. every line in the data 
frame corresponds to a drug-event pair from a specific report.

In the drug_event_df-example data, the first three rows of drug_event_df are 
from the same report, with report_id = 1. The first row reports a Drug_D and 
an adverse event named Event_5.The next three rows are from another report, 
where for instance drug B has been reported for two different events, event 15 
and event 33. We can ignore the group-column for now.

```{r}
#| echo = TRUE 
drug_event_df[1:6, ]
```

## Extracting the results
For covenience, the output object of da is a list of class "da" containing
two objects, the results ("da_df") and the input parameters ("input_params"). 
At some point you might want to extract the results or the input parameters list, 
here we demonstrate some alternatives: 

```{r}
#| echo = TRUE 
names(da_1)

# Different ways of extracting the results data frame
# da_1$da_df
# da_1[['da_df']]
# da_1 |> purrr::pluck("da_df")

```

## How to run a subgroup analysis?
The `da` function builds a comparator from the passed data, e.g. for `prr` by
excluding the drug of interest from the comparator, sometimes referred to as the 
background. This implies that any subgroup analysis of e.g. specific age groups 
can be achieved by creating a new data frame with the subgroup of interest, 
and then passing it to `da`.  

If a more exhaustive grouping is required, one can provide a grouping variable 
`group_by` to `da`. This executes a call to `da` for each grouping and 
return the results as a single data frame. The grouping could for instance 
be age groups, sex, countries, regions or a combination thereof, but needs to 
be passed as a single variable (character or numeric). 

The drug_event_df contains two groups, in the column named "group". We can pass
"group" as the `group_by` parameter in `df_colnames` as below.

```{r}
#| echo = TRUE

drug_event_df |> 
  head()

da_grouped <- 
  drug_event_df |> 
  da(df_colnames = list(group_by="group")) 

print(da_grouped$da_df)

```

In the output we note the third column is "group", i.e. the column name we passed.
We also note that the drug-event-pairs are kept together, e.g. first and second row
contains the same drug-event-combination, "Drug_Z" and "Event_1". This is to 
simplify comparisons within a drug-event-pair across different groups, e.g. if
the same pattern is seen across both males and females. The ordering of 
the drug-events in the output is made by averaging the selected disproportionality 
estimate (parameter `sort_by`), i.e. the top two rows have the highest
average "ic2.5". 

# Frequently Asked Questions

## How are drug-event pairs occurring several times in one report counted?
If the same drug-event pair occurs several times in one report, these contributes 
to counts only once. That is, an observed count of five means that there were 
five different reports containing the drug-event-pair (at least once). It cannot
be due to a single report containing the same drug-event pair five times due 
to repeated events or data quality issues. This is demonstrated below, where we 
duplicate the first row of drug_event_df and still arrive at an observed count of 1.

```{r}
#| echo = TRUE 
first_row <- drug_event_df[1,]

first_row |> 
  dplyr::bind_rows(first_row) |> 
  da() |> 
  purrr::pluck("da_df") |> 
  dplyr::pull(obs)

```

## How is the expected calculated?
Detailed formulas for PRR, RRR and IC are documented within each function (
"?prr", "?rrr", "?ic"), but overall one can note that all expected counts are 
derived from the same contingency table of those who were and were not exposed 
to the drug, and did and did not experience the adverse event. The sample
used for the contingency table is the data frame passed to the da function. See
also the grouping parameter.

## Where is exp_ic in the output, or what is exp_prr?
A common question regards the output column names, ic is preceded by "exp_prr". 
This is as the information component uses the same expected count as the 
Relative Reporting Rate (RRR), which is perhaps more well known in the 
pharmacovigilance community.

```{r}
#| echo = TRUE 

# Note the fourth column name:
drug_event_df |> 
  da() |> 
  purrr::pluck("da_df") |> 
  dplyr::select(drug:ic97.5) |> 
  print(n=1)

```

## Rule of N
The function `da` has several input parameters. One is the `rule_of_N`, by default set to 3, 
which is sometimes referred to as "rule of three". This sets ROR and PRR-values 
to NA if the observed count is less than the specified N. For completeness, note that the default shrinkage 
in the IC acts as a built in 'rule of 3', (i.e. the shrinkage of +0.5 prevents 
the lower bound to exceed 0 at the default significance level of 95\%).

## Controlling the number of digits
The function `da` has a parameter `number_of_digits` controls the rounding of non-count values in the output, including all expected counts, uncertainty bounds and point estimates.

```{r, echo = TRUE, results = FALSE, message = FALSE}
#| warning = FALSE

drug_event_df |>
  da(
    rule_of_N = 3,
    number_of_digits = 8
  ) |> print(n = 3)
```

![](print_console_many_digits.png)

## Can I use pvda to get IC, PRR or ROR, when I already have the observed and expected counts? 
Yes, you can calculate point estimates with confidence or credibility intervals 
for specific drug-event-combination if you already have the number of observed 
reports, and other required counts such as the database total count, number of reports 
with the exposure and number of reports with the event. This can be helpful
for instance to quality check results, e.g. from papers or other software.

```{r}
#| echo = TRUE
prr(obs = 10, n_drug = 1000, n_event_prr = 200, n_tot_prr = 10000)
ror(a = 10, b = 20, c = 200, d = 10000)
ic(obs = 10, exp = 5)
```

For further details on the specific input parameters, consult the documentation 
of each function. 

## How fast is pvda?
The data.table-package (through "dtplyr") is used for fast execution. A test of 
the execution speed was made using the `vaers` R package, available on 
[gitlab](https://gitlab.com/iembry/vaers.git). The `vaers` package contains 
data from VAERS from years 1990 - 2018, resulting in 4 146 778 rows to
be processed by `da` in `pvda`. Execution time on a regular laptop was less 
than 10 seconds.

```{r, include=FALSE}
# plot_udfs_as_graph <- function(package_name = "pvda"){

# library(pvda)
# package_name <- "pvda"
# ns_names <- getNamespaceExports(package_name)
# ns_names_w_id <- dplyr::bind_cols("id" = seq_along(ns_names), "name" = ns_names)
# edge_output <- list()
#
# for (i in seq_along(ns_names)) {
#   # i = 9
#
#   ns_name_i <- ns_names[i]
#   code_function_i <- body(get(ns_name_i)) |> paste0((as.character()), collapse = "")
#
#   # Find first occurrence of prefix calls in code_function_i
#   ns_calls <- stringr::str_c(ns_names, "(")
#   # ns_calls_with_comma <- stringr::str_c("", ns_names, "(")
#   # Avoid "is.numeric" being counted as ic
#   code_function_i <- gsub(x = code_function_i, pattern = "as.numeric", replacement = "", fixed = T)
#   # str_replace didn't manage this.
#
#   first_call_pos <- stringr::str_locate(
#     stringr::fixed(code_function_i),
#     stringr::fixed(c(ns_calls))
#   )[, 1]
#
#   calls <- na.omit(dplyr::bind_rows(
#     "function_id" = ns_names_w_id$id,
#     "function_name" = ns_names_w_id$name,
#     "pos" = first_call_pos
#   )) |>
#     dplyr::arrange(pos) |>
#     dplyr::select(-pos) |>
#     dplyr::filter(!function_name %in% c("custom_colours", "custom_ggtheme", "sign_lvl_to_quantile_prob"))
#
#   edges <- dplyr::bind_cols(i, calls) |>
#     dplyr::select(-function_name)
#   colnames(edges) <- c("from", "to")
#   edges <- edges |>
#     dplyr::filter(from != to)
#   edge_output[[i]] <- edges
# }
#
# edges <- do.call(rbind.data.frame, edge_output)
# edge_df <- DiagrammeR::create_edge_df(edges$from, edges$to)
#
# nodes_with_edges <- c(unique(edges$from), unique(edges$to))
#
# nodes <- dplyr::bind_cols(
#   "id" = 1:length(ns_names),
#   "label" = ns_names
# )[1:length(ns_names) %in% nodes_with_edges, ]
# node_df <- DiagrammeR::create_node_df(length(nodes$id),
#   label = nodes$label,
#   fontsize = 6,
#   shape = "rectangle",
#   fixedsize = FALSE
# )
#
# node_df$id <- nodes$id
#
#
# graph <- DiagrammeR::create_graph(
#   nodes_df = node_df,
#   edges_df = edge_df
# )
#
# suppressWarnings(DiagrammeR::render_graph(graph,
#   layout = "tree",
#   title = "Overview of package functions"
# ))
```

```{r, echo=FALSE, fig.height = 8, fig.width = 8}
# suppressWarnings(DiagrammeR::render_graph(graph,
#   layout = "tree",
#   title = "Overview of package functions"
# ))
```