---
title: "Introduction to predictsr"
output: html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to predictsr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
knitr:
  opts_chunk:
    collapse: true
    comment: '#>'
cache: true
---

The [`predictsr`](https://github.com/Biodiversity-Futures-Lab/predictsr) package accesses the PREDICTS database (Hudson et al, 2013) from within R, conveniently as a dataframe. It uses the [Natural History Museum Data Portal](https://data.nhm.ac.uk/) to download the latest versions of the PREDICTS database and some related metadata. 

The [PREDICTS](https://www.nhm.ac.uk/our-science/research/projects/predicts/science.html) database comprises over 4 million measurements of species at sites across the world. The data are mostly collected from arthropods (mainly insects), and we cover about 2% of the species known by science. All major terrestrial plant, animal, and fungal groups are covered. There have been 2 public releases of PREDICTS data. The first was in [2016](https://data.nhm.ac.uk/dataset/the-2016-release-of-the-predicts-database-v1-1) and the second was in [2022](https://data.nhm.ac.uk/dataset/release-of-data-added-to-the-predicts-database-november-2022); each consisted of about 3 million and 1 million records, respectively. We include both, as a single dataset, in this package.

To get started, let's load in the database into R and poke around. To do so you will use the `GetPredictsData` function, which pulls in the data from the data portal. I'll repeat the default option below, but you can take either 2016, 2022, or both as the extracts to fetch. It reads in both extracts into a single dataframe:

```{r setup}
predicts <- predictsr::GetPredictsData(extract = c(2016, 2022))
str(predicts)
```

So we can see that there are over 4 million records in the combined PREDICTS extracts. Let's look at a set of summary statistics for the database:

```{r}
taxa <- predicts[
  !duplicated(predicts[, c("Source_ID", "Study_name", "Taxon_name_entered")]),
]
species_counts <- length(
  unique(taxa[taxa$Rank %in% c("Species", "Infraspecies"), "Taxon"])
) +  nrow(taxa[!taxa$Rank %in% c("Species", "Infraspecies"), ])

print(glue::glue(
  "This database has {length(unique(predicts$SS))} studies across ",
  "{length(unique(predicts$SSBS))} sites, in ",
  "{length(unique(predicts$Country))} countries, and with ",
  "{species_counts} species."
))
```

So over 30,000 sites, 101 countries, and 12,892 species in this dataframe! There are a couple of important columns that should be noted. The `SS` column is what we use to identify studies when dealing with PREDICTS data. This is the concantenation of the `Source_ID`, and the `Study_number` columns. Another important identifier is the `SBBS`, the concatenation of `Source_ID`, `Study_number`, `Block` and `Site_number`, which is what we use to identify single sites in the database.

Let's also check the ranges of sample collection in the database:

```{r}
print(glue::glue(
  "Earliest sample collection (midpoint): {min(predicts$Sample_midpoint)}, ",
  "latest sample collection (midpoint): {max(predicts$Sample_midpoint)}"
))
```

## Accessing site-level summaries

We also include access to the site-level summaries from the full release; to get *these data* you will need to use the `GetSitelevelSummaries` function. The function call is very similar to pull in the summaries for the same data as above:

```{r}
summaries <- predictsr::GetSitelevelSummaries(extract = c(2016, 2022))
str(summaries)
```

Investigating the summary data closer we see that there are a number of missing columns between the two dataframes:

```{r}
print(names(predicts)[!(names(predicts) %in% names(summaries))])
```

This is because all of the measurement-level data has been dropped from the dataframe. Now indeed we could try and replicate the creation of the `summaries` dataframe through some {dplyr} operations. These would be the following (roughly):

**Note:** I'm not sure on the summary statistics for this one, there seems to be some differences, and there are a couple sites that don't match up.

```{r}
summaries_rep <- predicts |>
  dplyr::mutate(
    Higher_taxa = paste(sort(unique(Higher_taxon)), collapse = ","),
    N_samples = length(Measurement),
    Rescaled_sampling_effort = mean(Rescaled_sampling_effort),
    .by = SSBS
  ) |>
  dplyr::select(
    dplyr::all_of(names(summaries))
  ) |>
  dplyr::distinct() |>
  dplyr::arrange(SSBS)

summaries_copy <- summaries |>
  subset(SSBS %in% summaries_rep$SSBS) |>
  dplyr::arrange(SSBS)

all.equal(summaries_copy, summaries_rep)
```

## Accessing descriptions of the columns in PREDICTS

As we've seen already, there are 67 (!) columns in the PREDICTS database, and within the NHM Data portal releases, we have included a description of the data that is used in each of these columns. You can access this via the `GetColumnDescriptions` function:

```{r}
descriptions <- predictsr::GetColumnDescriptions()
str(descriptions)
```

So this includes the `Column` name, the resolution of the `Column` (`Applies_to`), whether it is in the PREDICTS extract (`Diversity_extract`), whether it is in the site-level summaries (`Site_extract`), the datatype of the `Column`*, whether it is guaranteed to be nonempty, any additional `Notes`, and any information on the range of values that the `Column` may be expected to take (`Validation`).

## Glossary

To clarify some of the PREDICTS jargon, we include the following table of definitions, from the dataframes we have worked with thus far:

```{r, echo = FALSE, results = 'asis'}
descriptions_sub <-  subset(
  descriptions,
  Column %in% c(
    "Source_ID", "SS", "Block", "SSBS", "Diversity_metric_type", "Measurement"
  )
)

for (i in seq_along(descriptions_sub$Column)) {
  cat(
    paste0(
      "* `", descriptions_sub$Column[i], "` (`", descriptions_sub$Type[i], "`): "
    )
  )
  notes <- descriptions_sub$Notes[i] |>
    (\(s) gsub("\\*", "  -", s))() |>
    (\(s) gsub("\n\n", "  \n", s))() |>
    (\(s) gsub("\n", "  \n", s))()

  if (notes == " ") {
    notes <- "As title."
  }

  cat(notes)
  cat("  \n")
}
```

**Notes:** When we refer to a "study" we typically refer to it from the `SS` that identifies it. Referring to an "extract" simply refers to a PREDICTS database release, either from 2016 or 2022 (or both). For further complete documentation see the SI in Hudson et. al. (2017).

## Notes

The [NHM Data Portal API](https://data.nhm.ac.uk/about/download) has no rate limits so be considerate with your requests. Make sure you save the data somewhere, or use a tool like [`targets`](https://docs.ropensci.org/targets/index.html) to save you from re-running workflows.

## References

Hudson, Lawrence N., et al. "The PREDICTS database: a global database of how local terrestrial biodiversity responds to human impacts." Ecology and evolution 4.24 (2014): 4701-4735. <doi.org/10.1002/ece3.1303>.

Hudson, Lawrence N., et al. "The database of the PREDICTS (projecting responses of ecological diversity in changing terrestrial systems) project." Ecology and evolution 7.1 (2017): 145-188 <doi.org/10.1002/ece3.2579>