BeeBDC logo of a cuckoo bee sweeping up occurrence records in South America

BeeBDC: an occurrence data cleaning package

CRANstatus downloads R-CMD-check License

Overview

The reliable implementation of biodiversity data continues to be a challenge for researchers. We present the BeeBDC package which provides novel and updated functions for flagging, cleaning, visualising, and analysing occurrence datasets. Our package is general and can be applied to any taxon; however, we also provide some functions and data that are specific for use with bee occurrence data; specifically due to their input data. We add new functionality and keep conventions in other fantastic R packages, especially bdc and CoordinateCleaner, while also removing many dependencies on sp-related packages. Hence, our package name is Bee Biodiversity Data Cleaning (BeeBDC).

We provide a full workflow that uses BeeBDC, bdc, and CoordinateCleaner to clean occurrence data in our Articles page and encourage users to read and also cite this primary publication. For our parallelised implementation of iChao and iNEXT species richness estimations, cite this primary publication.

The BeeBDC vignettes

The BeeBDC vignettes are split into several, depending on waht you’re hoping to do with the package. So far, it is broadly split into the (1) data cleaning workflow and (2) species richness estimation.

  1. Data cleaning is broken into
  2. A short, but complete, vignette to estimate species richness

BeeBDC’s structure

The BeeBDC toolkit is intentionally organized using conventions in bdc and CoordinateCleaner.

Like in the bdc package, we provide a suggested workflow here. While our functions can mostly be run out of order, there are a few exceptions mentioned throughout the documentation. Additionally, many functions require the database_id column that is generated early on in the BeeBDC or bdc workflows. When running very large datasets (e.g., the global bee occurrence dataset) you may require a machine that has a minimum amount of RAM (~32 GB). However, we do try to provide work-arounds, especially by alowing some functions to be broken into consumable chunks. Paper DOI - https://doi.org/10.1101/2023.06.30.547152; Package GitHub - https://github.com/jbdorey/BeeBDC/

Workflow figure from Dorey et al. 2023

Installation

Install BeeBDC

You can install BeeBDC from CRAN or GitHub.

  # Install BeeBDC from CRAN
install.packages("BeeBDC")

  # Or using the development version from GitHub (keeping in mind this may not be as stable)
remotes::install_github("https://github.com/jbdorey/BeeBDC.git", 
                          # To use the development version use "devel"; otherwise choose "main"
                        ref = "devel", force = TRUE)

Install sf and terra

First time using the sf or terra packages?

The first time that you use terra or sf on a new computer you may need to install some dependencies. Try to install the terra and sf packages first but then come back here if that doesn’t work.

Windows:

On Windows, you need to first install Rtools to get a C++ compiler that R can use. You need a recent version of Rtools42 (rtools42-5355-5357).

MacOS:

On macOS, you can use MacPorts or Homebrew.

With MacPorts you can do

sudo port install R-terra

With Homebrew, you need to first install GDAL:

brew install pkg-config

brew install gdal

Followed by (note the additional configuration argument needed for Homebrew)

  # Install terra
install.packages("terra", type = "source", configure.args = "--with-proj-lib=$(brew --prefix)/lib/")
  # install sf
install.packages("sf", type = "source", configure.args = "--with-proj-lib=$(brew --prefix)/lib/")

library(terra)
library(sf)

Load the package with:

library(BeeBDC)

Optional packages

Because BeeBDC provides broad functionality that might not be required by all users, some dependencies are optional (but required for some functions). Optional packages can be downloaded prior to starting your workflow, if desired. However, you will be prompted to download these packages if they aren’t already installed when you run those functions. The packages BiocManager and devtools may also be required to download some extra packages.

  1. The package, rnaturalearthhires, is a data package that allows the usage of higher-resolution country maps and is very useful for multiple BeeBDC functions.
  # rnaturalearthhires can be installed using devtools from their github page
if (!require("devtools", quietly = TRUE))
    install.packages("devtools")
  # Install rnaturalearthhires
devtools::install_github("ropensci/rnaturalearthhires")
  1. The package, ComplexHeatmap, is only used for one BeeBDC function (BeeBDC::chordDiagramR()) and is less critical.
  # ComplexHeatmap can be installed using BiocManager
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
  # Install ComplexHeatmap
BiocManager::install("ComplexHeatmap")
  1. The package taxadb is used by BeeBDC::taxadbToBeeBDC() to download and transform taxonomy files for any taxon and from multiple providers (e.g., GBIF and ITIS) to work with BeeBDC.
  # taxadb can be installed using install.packages
    install.packages("taxadb")
  1. The packages iNEXT and SpadeR can be downloaded in order to estimate species richness using occurrence data (see the species richness estimation vignette). Implemented in the functions BeeBDC::iNEXTwrapper(), BeeBDC::ChaoWrapper(), BeeBDC::ggRichnessWrapper(), and BeeBDC::richnessEstimateR().
  # iNEXT and SpadeR can be installed using install.packages
    install.packages("iNEXT")
    install.packages("SpadeR")

BeeBDC workflow components

1. Data merge

Integrate and merge different datasets from major the data repositories - GBIF, SCAN, iDigBio, the USGS, and ALA.

2. Data preperation

The reading in and formatting of the major and minor [bee] occurrence repositories as well as some data modifications. This section is mostly, but not entirely, related to bee occurrence data.

3. Initial flags

Flagging and carpentry of several, mostly general, data issues. See bdc’s pre-filter for more related functions.

4. Taxonomy

Harmonisation of scientific names against a taxonomy downloaded from taxadb, from the provided Discover Life website’s taxonomic reference, or a custom taxonomy.

5. Space

Flagging of erroneous, suspicious, and low-precision geographic coordinates.

6. Time

Flagging and, whenever possible, correction of inconsistent collection date.

7. De-duplication

8. Filtering

9. Figures and tables

10. Species richness estimation

11. Datasets

We provide two full datasets that are downloadable using the below two functions

We further provide five test datasets that are available with BeeBDC.

  # Access the test taxonomy file
system.file("extdata", "testTaxonomy.rda", package="BeeBDC") |> load()
  # View the file
View(testTaxonomy)
  # Access the test checklist file
system.file("extdata", "testChecklist.rda", package="BeeBDC") |> load()
  # View the file
View(testChecklist)

Package info

Package website

See BeeBDC package website (https://jbdorey.github.io/BeeBDC/reference/index.html) for detailed explanation on each module.

Getting help

This package is maintained by Dr James B Dorey, Lecturer in Biological Sciences at the University of Wollongong, Australia.

If you encounter a clear bug, please file an issue here. For questions or suggestion, flick us an email (jdorey@uow.edu.au).

Citations

This package and its data sets were created with the support, and as a part, of the iDigBees project

The iDigBees logo with a colourful bee and the iDigBees text on the right