--- title: "Quick start guide" author: "Martin Westgate, Dax Kellie, Shandiya Balasubramaniam" date: '2025-05-01' output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Quick start guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- `galaxias` is an R package that helps users bundle their data into a standardised format optimised for storing, documenting, and sharing biodiversity data. This standardised format is called a [**Darwin Core Archive**](https://ipt.gbif.org/manual/en/ipt/latest/dwca-guide#what-is-darwin-core-archive-dwc-a)---a zip file containing data and metadata that conform to the [Darwin Core Standard](https://dwc.tdwg.org/), the accepted data standard of the [Global Biodiversity Information Facility (GBIF)](https://gbif.org) and its partner nodes (e.g. the Atlas of Living Australia). Sharing Darwin Core Archives with data infrastructures allows data to be reconstructed and aggregated accurately. Let's see how to prepare a Darwin Core Archive using `galaxias`. ```{r, eval=TRUE, echo=FALSE, message=FALSE, warning=FALSE} # load packages now to avoid messages later library(galaxias) library(lubridate) library(dplyr) ``` ## Getting started Here we have an existing R project containing data collected over the course of a research project. Our project uses a fairly standard folder structure. ```{r} #| eval: false #| echo: false #| warning: false #| message: false devtools::load_all() ``` ``` ├── README.md : Description of the repository ├── my-project-name.Rproj : RStudio project file ├── data : Folder to store cleaned data | └── my_data.csv ├── data-raw : Folder to store original/source data | └── my_raw_data.csv ├── plots : Folder containing plots/dataviz └── scripts : Folder with analytic coding scripts ``` Let's see how galaxias can help us to package our data as a Darwin Core Archive. ## Use standardised data in an archive Data that we wish to share are in the `data` folder. They might look something like this: ```{r} #| echo: false #| message: false #| warning: false library(galaxias) library(tibble) my_data <- tibble( latitude = c(-35.310, -35.273), longitude = c(149.125, 149.133), date = c("14-01-2023", "15-01-2023"), time = c("10:23", "11:25"), species = c("Callocephalon fimbriatum", "Eolophus roseicapilla"), location_id = c("ARD001", "ARD001") ) ``` ```{r} #| collapse: true #| comment: "#>" my_data ``` First, we'll need to standardise our data to conform to the Darwin Core Standard. `suggest_workflow()` can help by summarising our dataset and suggesting the steps we should take. ```{r} #| collapse: true #| comment: "#>" my_data |> suggest_workflow() ``` Following the advice of `suggest_workflow()`, we can use the `set_` functions to standardise `my_data`. `set_` functions work a lot like `dplyr::mutate()`: they modify existing columns or create new columns. The suffix of each `set_` function gives an indication of the type of data it accepts (e.g. `set_coordinates()`, `set_scientific_name`), and function arguments are valid Darwin Core terms to use as column names. Each `set_` function also checks to make sure that each column contains valid data according to Darwin Core Standard. ```{r} #| echo: false #| eval: false # I think we can get rid of this chunk? library(lubridate) my_data_dwc <- df |> # basic requirements of Darwin Core set_occurrences(occurrenceID = composite_id(location_id, sequential_id()), basisOfRecord = "humanObservation") |> # place and time set_coordinates(decimalLatitude = latitude, decimalLongitude = longitude) |> set_locality(country = "Australia", locality = "Canberra") |> set_datetime(eventDate = lubridate::dmy(date), eventTime = lubridate::hm(time)) |> # taxonomy set_scientific_name(scientificName = species, taxonRank = "species") |> set_taxonomy(kingdom = "Animalia", phylum = "Aves") my_data_dwc |> print(n = 5) ``` ```{r} #| echo: true #| message: false #| collapse: true #| comment: "#>" library(lubridate) my_data_dwc <- my_data |> # basic requirements of Darwin Core set_occurrences(occurrenceID = sequential_id(), basisOfRecord = "humanObservation") |> # place and time set_coordinates(decimalLatitude = latitude, decimalLongitude = longitude) |> set_locality(country = "Australia", locality = "Canberra") |> set_datetime(eventDate = lubridate::dmy(date), eventTime = lubridate::hm(time)) |> # taxonomy set_scientific_name(scientificName = species, taxonRank = "species") |> set_taxonomy(kingdom = "Animalia", family = "Cacatuidae") my_data_dwc ``` You may have noticed that we added some additional columns that were not included in the advice of `suggest_workflow()` (`country`, `locality`, `taxonRank`, `kingdom`, `family`). We encourage users to specify additional information where possible to avoid ambiguity once their data are shared. To use our standardised data in a Darwin Core Archive, we can select columns that use valid Darwin Core terms as column names. Invalid columns won't be accepted when we try to build our Darwin Core Archive. Our data is an occurrence-based dataset (each row contains information at the observation level, as opposed to site/survey level), so we'll select columns that match names in `occurrence_terms()`. ```{r} #| warning: false #| message: false library(dplyr) my_data_dwc_occ <- my_data_dwc |> select(any_of(occurrence_terms())) my_data_dwc_occ ``` Now we can specify that we wish to use `my_data_dwc_occ` in our Darwin Core Archive with `use_data()`, which saves this dataset in the `data_publish` folder with the correct file name `occurrences.csv`. ```{r} #| eval: false use_data(my_data_dwc_occ) ``` If we look again at our file structure, we now find our data has been added to our new folder: ``` ├── README.md ├── my-project-name.Rproj ├── data | └── my_data.csv ├── data-publish : New folder to store data for publication | └── occurrences.csv : Data formatted as per Darwin Core Standard ├── data-raw | └── my_raw_data.csv ├── plots └── scripts ``` ## Add metadata A critical part of a Darwin Core archive is a metadata statement: this tells users who owns the data, what the data were collected for, and what uses they can be put to (i.e. a data licence). To get an example statement, call `use_metadata_template()`. ```{r, eval=FALSE} use_metadata_template() ``` By default, this creates an R Markdown template named `metadata.Rmd` in your working directory. We can edit this template to include information about our dataset, and specify that we wish to use it in our Darwin Core Archive with `use_metadata()`. ```{r, echo=FALSE, eval=FALSE} # this code doesn't work any more # best practice here might be to call `use_metadata()` then `readLines()` and `cat()` library(delma) metadata_string <- as_eml_chr(metadata_example)[3:15] metadata_string |> paste0("\n") |> cat() ``` ```{r, eval=FALSE} use_metadata("metadata.Rmd") ``` This converts our metadata statement to Ecological Meta Language (`EML`), the accepted format of metadata for Darwin Core Archives, and saves it as `eml.xml` in the `data-publish` folder. ## Build an archive At the end of the above process, we should have a folder named `data-publish` that contains at least two files: - One or more `.csv` files containing data (e.g. `occurrences.csv`, `events.csv`, `multimedia.csv`) - An `eml.xml` file containing your metadata We can now run `build_archive()` to build our Darwin Core Archive! ```{r eval=FALSE} build_archive() ``` Running `build_archive()` first checks whether we have a 'schema' document (`meta.xml`) in our `data-publish` folder. This is a machine-readable `xml` document that describes the content of the archive's data files and their structure. The schema document is a required file in a Darwin Core Archive. If it is missing, `build_archive()` will build one. We can also build a schema document ourselves using `use_schema()`. At the end of this process, you should have a Darwin Core Archive zip file (`dwc-archive.zip`) in your paernt directory. You should also have a `data-publish` folder in your working directory containing standardised data files (e.g. `occurrences.csv`), a metadata statement in EML format (`eml.xml`), and a schema document (`meta.xml`). ## Check archive There are two ways to check whether the contents of your Darwin Core Archive meet the Darwin Core Standard. The first is to run local tests on the files inside a local folder directory that will be used to build a Darwin Core Archive. `check_directory()` allows us to check csv files and xml files in the directory against Darwin Core Standard criteria, using the same checking functionality that is built into the `set_` functions. This function is especially beneficial if you have standardized your data to Darwin Core headers using functions outside of `galaxias`/`corella`, such as `dplyr::mutate()` for example. ```{r} #| eval: false check_directory() ``` The second is to check whether a complete Darwin Core Archive meets institution's Darwin Core criteria via an API. For example, we can test an archive against GBIF's API tests. ```{r} #| eval: false # Check against GBIF API check_archive("dwc-archive.zip", email = "your-email", username = "your-username", password = "your-password") ``` ## Publish/share your archive The final step is to share your completed Darwin Core Archive with a data infrastructure like the Atlas of Living Australia. To share with the ALA, you can launch our data submission process in your browser by calling: ```{r} #| eval: false submit_archive() ``` This function will provide you with the option to open a GitHub issue where you can attach your archive. We will run the galaxias test suite on your dataset and respond as soon as we can. If you'd prefer not to use GitHub, you can send your file and a brief description to [support@ala.org.au](mailto:support@ala.org.au).