BeeBDC: an occurrence data cleaning package

Overview

The reliable implementation of biodiversity data continues to be a challenge for researchers. We present the BeeBDC package which provides novel and updated functions for flagging, cleaning, visualising, and analysing occurrence datasets. Our package is general and can be applied to any taxon; however, we also provide some functions and data that are specific for use with bee occurrence data; specifically due to their input data. We add new functionality and keep conventions in other fantastic R packages, especially bdc and CoordinateCleaner, while also removing many dependencies on sp-related packages. Hence, our package name is Bee Biodiversity Data Cleaning (BeeBDC).

We provide a full workflow that uses BeeBDC, bdc, and CoordinateCleaner to clean occurrence data in our Articles page and encourage users to read and also cite this primary publication. For our parallelised implementation of iChao and iNEXT species richness estimations, cite this primary publication.

**The BeeBDC vignettes**

The BeeBDC vignettes are split into several, depending on waht you’re hoping to do with the package. So far, it is broadly split into the (1) data cleaning workflow and (2) species richness estimation.

Data cleaning is broken into
- 1.1 The full cleaning workflow
- 1.2 A more basic workflow
- 1.3 A bee-data specific workflow to prepare those bee datasets
A short, but complete, vignette to estimate species richness

BeeBDC’s structure

The BeeBDC toolkit is intentionally organized using conventions in bdc and CoordinateCleaner.

Like in the bdc package, we provide a suggested workflow here. While our functions can mostly be run out of order, there are a few exceptions mentioned throughout the documentation. Additionally, many functions require the database_id column that is generated early on in the BeeBDC or bdc workflows. When running very large datasets (e.g., the global bee occurrence dataset) you may require a machine that has a minimum amount of RAM (~32 GB). However, we do try to provide work-arounds, especially by alowing some functions to be broken into consumable chunks. Paper DOI - https://doi.org/10.1101/2023.06.30.547152; Package GitHub - https://github.com/jbdorey/BeeBDC/

Installation

Install BeeBDC

You can install BeeBDC from CRAN or GitHub.

  # Install BeeBDC from CRAN
install.packages("BeeBDC")

  # Or using the development version from GitHub (keeping in mind this may not be as stable)
remotes::install_github("https://github.com/jbdorey/BeeBDC.git", 
                          # To use the development version use "devel"; otherwise choose "main"
                        ref = "devel", force = TRUE)

Install sf and terra

First time using the sf or terra packages?

The first time that you use terra or sf on a new computer you may need to install some dependencies. Try to install the terra and sf packages first but then come back here if that doesn’t work.

Windows:

On Windows, you need to first install Rtools to get a C++ compiler that R can use. You need a recent version of Rtools42 (rtools42-5355-5357).

MacOS:

On macOS, you can use MacPorts or Homebrew.

With MacPorts you can do

sudo port install R-terra

With Homebrew, you need to first install GDAL:

brew install pkg-config

brew install gdal

Followed by (note the additional configuration argument needed for Homebrew)
  # Install terra
install.packages("terra", type = "source", configure.args = "--with-proj-lib=$(brew --prefix)/lib/")
  # install sf
install.packages("sf", type = "source", configure.args = "--with-proj-lib=$(brew --prefix)/lib/")

library(terra)
library(sf)

Load the package with:

library(BeeBDC)

Optional packages

Because BeeBDC provides broad functionality that might not be required by all users, some dependencies are optional (but required for some functions). Optional packages can be downloaded prior to starting your workflow, if desired. However, you will be prompted to download these packages if they aren’t already installed when you run those functions. The packages BiocManager and devtools may also be required to download some extra packages.

The package, rnaturalearthhires, is a data package that allows the usage of higher-resolution country maps and is very useful for multiple BeeBDC functions.

  # rnaturalearthhires can be installed using devtools from their github page
if (!require("devtools", quietly = TRUE))
    install.packages("devtools")
  # Install rnaturalearthhires
devtools::install_github("ropensci/rnaturalearthhires")

The package, ComplexHeatmap, is only used for one BeeBDC function (BeeBDC::chordDiagramR()) and is less critical.

  # ComplexHeatmap can be installed using BiocManager
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
  # Install ComplexHeatmap
BiocManager::install("ComplexHeatmap")

The package taxadb is used by BeeBDC::taxadbToBeeBDC() to download and transform taxonomy files for any taxon and from multiple providers (e.g., GBIF and ITIS) to work with BeeBDC.

  # taxadb can be installed using install.packages
    install.packages("taxadb")

The packages iNEXT and SpadeR can be downloaded in order to estimate species richness using occurrence data (see the species richness estimation vignette). Implemented in the functions BeeBDC::iNEXTwrapper(), BeeBDC::ChaoWrapper(), BeeBDC::ggRichnessWrapper(), and BeeBDC::richnessEstimateR().

  # iNEXT and SpadeR can be installed using install.packages
    install.packages("iNEXT")
    install.packages("SpadeR")

BeeBDC workflow components

1. Data merge

Integrate and merge different datasets from major the data repositories - GBIF, SCAN, iDigBio, the USGS, and ALA.

atlasDownloader() Downloads ALA data and creates a new file in the path to put those data. This function can also request downloads from other atlases (see here). However, it will only send the download to your email and you must do the rest yourself at this point.
repoMerge() Locates data from GBIF, ALA, iDigBio, and SCAN within a directory and reads it in along with its eml metadata.
repoFinder() Find GBIF, ALA, iDigBio, and SCAN files in a directory.
importOccurrences() Looks for and imports the most-recent version of the occurrence data created by the BeeBDC::repoMerge() function.
USGS_formatter() The function finds, imports, formats, and creates metadata for the USGS dataset.
formattedCombiner() Merges the Darwin Core version of the USGS dataset that was created using BeeBDC::USGS_formatter() with the main dataset.
dataSaver() Used at the end of 1.x in the example workflow in order to save the occurrence dataset and its associated eml metadata.

2. Data preperation

The reading in and formatting of the major and minor [bee] occurrence repositories as well as some data modifications. This section is mostly, but not entirely, related to bee occurrence data.

fileFinder() A function which can be used to find files within a user-defined directory based on a user-provided character string.
PaigeIntegrater() Replaces publicly available data with data that has been manually cleaned and error-corrected for use in the paper Chesshire, P. R., Fischer, E. E., Dowdy, N. J., Griswold, T., Hughes, A. C., Orr, M. J., . . . McCabe, L. M. (In Press). Completeness analysis for over 3000 United States bee species identifies persistent data gaps. Ecography.
readr_BeeBDC() Read in a variety of data files that are specific to certain smaller data providers. There is an internal readr function for each dataset and each one of these functions is called by readr_BeeBDC. While these functions are internal, they are displayed in the documentation of readr_BeeBDC for clarity.
idMatchR() This function attempts to match database_ids from a prior bdc or BeeBDC run in order to keep this column somewhat consistent between iterations. However, not all records contain sufficient information for this to work flawlessly.

3. Initial flags

Flagging and carpentry of several, mostly general, data issues. See bdc’s pre-filter for more related functions.

countryNameCleanR() This is a basic function for a user to manually fix some country name inconsistencies.
jbd_CfC_chunker() Because the BeeBDC::jbd_country_from_coordinates() function is very RAM-intensive, this wrapper allows a user to specify chunk-sizes and only analyse a small portion of the occurrence data at a time. The prefix jbd_ is used to highlight the difference between this function and the original bdc::bdc_country_from_coordinates().
jbd_Ctrans_chunker() Because the BeeBDC::jbd_coordinates_transposed() function is very RAM-intensive, this wrapper allows a user to specify chunk-sizes and only analyse a small portion of the occurrence data at a time. The prefix jbd_ is used to highlight the difference between this function and the original bdc::bdc_coordinates_transposed(). These functions will preferably use the countryCode column generated by bdc::bdc_country_standardized().
jbd_coordCountryInconsistent() Compares stated country name in an occurrence record with record’s coordinates using rnaturalearth data. The prefix, jbd_ is meant to distinguish this function from the original bdc::bdc_coordinates_country_inconsistent(). This functions will preferably use the countryCode and country_suggested columns generated by bdc::bdc_country_standardized(); please run it on your dataset prior to running this function.
flagAbsent() Flags occurrences that are “ABSENT” for the occurrenceStatus (or some other user-specified) column.
GBIFissues() This function will flag records which are subject to a user-specified vector of GBIF issues.
flagRecorder() This function is used to save the flag data for your occurrence data as you run the BeeBDC script. It will read and append existing files, if asked to. Your flags should also be saved in the occurrence file itself automatically.

4. Taxonomy

Harmonisation of scientific names against a taxonomy downloaded from taxadb, from the provided Discover Life website’s taxonomic reference, or a custom taxonomy.

taxadbToBeeBDC() Uses the taxadb to download a species taxonomy from any of their sources and transforms it into the BeeBDC format that can then be exported as a .csv or into the R environment to be be fed directly into BeeBDC::harmoniseR(). This means that the taxonomy from ANY taxon can be used. See also BeeBDC::beesTaxonomy() for the best global bee taxonomy.
harmoniseR() Uses the Discover Life taxonomy to harmonise bee occurrences and flag those that do not match the checklist. This function could be hijacked to service other taxa if a user matched the format of the beesTaxonomy file. BeeBDC::harmoniseR() prefers to use the names_clean columns that is generated by bdc::bdc_clean_names(). While this is not required, you may find better results by running that function on your dataset first.

5. Space

Flagging of erroneous, suspicious, and low-precision geographic coordinates.

jbd_coordinates_precision() This function flags occurrences where BOTH latitude and longitude values are rounded. This contrasts with the original function, bdc::bdc_coordinates_precision() that will flag occurrences where only one of latitude OR longitude are rounded. The BeeBDC approach saves occurrences that may have had terminal zeros rounded in one coordinate column.
diagonAlley() A simple function that looks for potential latitude and longitude fill-down errors by identifying consecutive occurrences with coordinates at regular intervals. This is accomplished by using a sliding window with the length determined by minRepeats.
coordUncerFlagR() To use this function, the user must choose a column, probably “coordinateUncertaintyInMeters” and a threshold above which occurrences will be flagged for geographic uncertainty.
countryOutlieRs() This function flags country-level outliers using the checklist provided with this package. For additional context and column names, see beesChecklist.
continentOutlieRs() This function flags continent-level outliers using the checklist provided with this package. This function works much the same as countryOutlieRs(), but at a lower resolution. For additional context and column names, see beesChecklist.
jbd_create_figures() Creates figures (i.e., bar plots, maps, and histograms) reporting the results of data quality tests implemented the bdc and BeeBDC packages. Works like bdc::bdc_create_figures(), but it allows the user to specify a save path.

6. Time

Flagging and, whenever possible, correction of inconsistent collection date.

dateFindR() A function made to search other columns for dates and add them to the eventDate column. The function searches the columns locality, fieldNotes, locationRemarks, and verbatimEventDate for the relevant information.

7. De-duplication

dupeSummary() This function uses user-specified inputs and columns to identify duplicate occurrence records. Duplicates are identified iteratively and will be tallied up, duplicate pairs clustered, and sorted at the end of the function. The function is designed to work with Darwin Core data with a database_id column, but it is also modifiable to work with other columns.

8. Filtering

manualOutlierFindeR() Uses expert-identified outliers with source spreadsheets that may be edited by users. The function will also use the duplicates file made using BeeBDC::dupeSummary() to identify duplicates of the expert-identified outliers and flag those as well. The function will add a flagging column called .expertOutlier where records that are FALSE are the expert outliers.
summaryFun() Using all flag columns (column names starting with “.”), this function either creates or updates the .summary flag column which is FALSE when ANY of the flag columns are FALSE. Columns can be excluded and removed after creating the .summary column. Additionally, the occurrence dataset can be filtered to only those where .summary = TRUE at the end of the function.

9. Figures and tables

chordDiagramR() This function outputs a figure which shows the relative size and direction of occurrence points duplicated between data providers, such as, SCAN, GBIF, ALA, etc. This function requires the outputs generated by BeeBDC::dupeSummary().
dupePlotR() Creates a plot with two bar graphs. One shows the absolute number of duplicate records for each data source while the other shows the proportion of records that are duplicated within each data source. This function requires a dataset that has been run through BeeBDC::dupeSummary().
plotFlagSummary() Creates a compound bar plot that shows the proportion of records that pass or fail each flag (rows) and for each data source (columns). The function can also optionally return a point map for a user-specified species when plotMap = TRUE. This function requires that your dataset has been run through some filtering functions - so that is can display logical columns starting with “.”.
summaryMaps() Builds an output figure that shows the number of species and the number of occurrences per country. Breaks the data into classes for visualisation. Users may filter data to their taxa of interest to produce figures of interest.
interactiveMapR() Uses the occurrence data (preferably uncleaned) and outputs interactive .html maps that can be opened in your browser to a specific directory. The maps can highlight if an occurrence has passed all filtering (.summary == TRUE) or failed at least one filter (.summary == FALSE). This can be modified by first running BeeBDC::summaryFun() to set the columns that you want to be highlighted. It can also highlight occurrences flagged as expert-identified or country outliers.
dataProvTables() This function will attempt to find and build a table of data providers that have contributed to the input data, especially using the ‘institutionCode’ column. It will also look for a variety of other columns to find data providers using a an internally set sequence of if-else statements. Hence, this function is quite specific for bee data, but should work for other taxa in similar institutions.
flagSummaryTable() Takes a flagged dataset and returns the total number of fails (FALSE) per flag (columns starting with “.”) and per species. Users may define the column to group the summary by. While it is intended to work with the scientificName column, users may select any grouping column (e.g., country).

10. Species richness estimation

diversityPrepR() Takes your occurrence dataset along with a taxonomy and checklist in order to produce a file that’s ready to be passed into the BeeBDC::richnessEstimateR() function in order to estimate species richness using iChao (non-parametric species richness; BeeBDC::ChaoWrapper()) and iNEXT (hill numbers; BeeBDC::iNEXTwrapper()) for countries, continents, or the entire globe.
richnessEstimateR() Takes an output dataset from BeeBDC::diversityPrepR() to estimate species richness using iChao (non-parametric species richness; BeeBDC::ChaoWrapper()) and iNEXT (hill numbers; BeeBDC::iNEXTwrapper()) for countries, continents, and/or the entire globe. Has parallel functionality.
iNEXTwrapper() A wrapper for iNEXT::iNEXT() to interpolate and extrapolate Hill numbers with order q (rarify species richness). The wrapper has the ability to estimate species richness for multiple sites (or countries) at once and to do this using multiple cores.
ChaoWrapper() A wrapper for SpadeR::ChaoSpecies() to non-parametrically estimate species richness. The wrapper has the ability to estimate species richness for multiple sites (or countries) at once and to do this using multiple cores at once.

11. Datasets

We provide two full datasets that are downloadable using the below two functions

beesTaxonomy() Downloads the taxonomic information for the bees of the world. Source of taxonomy is listed under “source” but are mostly derived from the Discover Life website. The data will be sourced from the BeeBDC article’s Figshare. Please see also BeeBDC::taxadbToBeeBDC() for the download of any other taxonomy (for any taxa or for bees).
beesChecklist() Download the table contains taxonomic and country information for the bees of the world based on data collated on Discover Life. The data will be sourced from the BeeBDC article’s Figshare.

We further provide five test datasets that are available with BeeBDC.

BeeBDC::bees3sp This test dataset includes 105 random occurrence records from three bee species. The included species are: “Agapostemon tyleri Cockerell, 1917”, “Centris rhodopus Cockerell, 1897”, and “Perdita octomaculata (Say, 1824)”.
BeeBDC::beesRaw A small bee occurrence dataset with flags generated by BeeBDC used to run example script and test functions. For data types, see ColTypeR().
BeeBDC::beesFlagged A small bee occurrence dataset with flags generated by BeeBDC used to run example script and test functions. For data types, see ColTypeR().
BeeBDC::beesCountrySubset A very small bee occurrence dataset with the columns “scientificName” and “country_suggested” and data for four countries, Fiji, Uganda, Vietnam, and Zambia. This is the test dataset for the species richness functions.
There are also two small test datasets of the beesTaxonomy and beesChecklist in the system files of the package that are filtered to include only those species that occur in bees3sp, beesRaw, and beesFlagged. These are accessible as follows but are only used internally for tests.

  # Access the test taxonomy file
system.file("extdata", "testTaxonomy.rda", package="BeeBDC") |> load()
  # View the file
View(testTaxonomy)
  # Access the test checklist file
system.file("extdata", "testChecklist.rda", package="BeeBDC") |> load()
  # View the file
View(testChecklist)

Package info

Package website

See BeeBDC package website (https://jbdorey.github.io/BeeBDC/reference/index.html) for detailed explanation on each module.

Getting help

This package is maintained by Dr James B Dorey, Lecturer in Biological Sciences at the University of Wollongong, Australia.

If you encounter a clear bug, please file an issue here. For questions or suggestion, flick us an email (jdorey@uow.edu.au).

Citations

Original paper, dataset, and package citation: Dorey, J.B., Fischer, E.E., Chesshire, P.R., Nava-Bolaños, A., O’Reilly, R.L., Bossert, S., Collins, S.M., Lichtenberg, E.M., Tucker, E., Smith-Pardo, A., Falcon-Brindis, A., Guevara, D.A., Ribeiro, B.R., de Pedro, D., Hung, J.K.-L., Parys, K.A., McCabe, L.M., Rogan, M.S., Minckley, R.L., Velzco, S.J.E., Griswold, T., Zarrillo, T.A., Jetz, W., Sica, Y.V., Orr, M.C., Guzman, L.M., Ascher, J., Hughes, A.C. & Cobb, N.S. (2023) A globally synthesised and flagged bee occurrence dataset and cleaning workflow. Scientific Data, 10, 1–17. https://www.doi.org/10.1038/S41597-023-02626-W
- Figshare live data link: https://doi.org/10.25451/flinders.21709757
Species richness estimation citation: Dorey J. B., Gilpin, A.-M., Johnson, N., Esquerre, D., Hughes, A. C., Ascher, J. S., & Orr, M. C. (2026). Estimating global bee species richness and taxonomic gaps. Nature Communications. https://doi.org/10.1038/s41467-026-69029-4
BeeBDC package citation: Dorey, J. B., O’Reilly, R. L., Bossert, S. & Fischer, E. (2023). BeeBDC: an occurrence data cleaning package. R package version 1.3.4. url: https://github.com/jbdorey/BeeBDC
Discover Life citation (for use of bee taxonomy and checklist): Ascher, J.S. & Pickering, J. (2026) Discover Life bee species guide and world checklist (Hymenoptera: Apoidea: Anthophila). https://www.discoverlife.org/mp/20q?guide=Apoidea_species

This package and its data sets were created with the support, and as a part, of the iDigBees project