---
title: "Quick start guide"
author: "Martin Westgate & Dax Kellie"
date: '2026-02-11'
output:
rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Quick start guide}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
`galah` is an R interface to biodiversity data hosted by the Global Biodiversity
Information Facility ([GBIF](https://www.gbif.org)) and its subsidiary node
organisations. GBIF and its partner nodes collate and store observations of
individual life forms using the ['Darwin Core'](https://dwc.tdwg.org) data
standard.
# Installation
To install from CRAN:
``` r
install.packages("galah")
```
Or install the development version from GitHub:
``` r
install.packages("remotes")
remotes::install_github("AtlasOfLivingAustralia/galah")
```
Load the package
``` r
library(galah)
```
# Configuration
Begin by choosing which organisation you would like `galah` to query,
and providing your registration information for that organisation.
``` r
galah_config(atlas = "GBIF",
username = "user1",
email = "email@email.com",
password = "my_password")
```
The full list of supported queries by organisation is as follows:
# Getting data
`galah` is a `dplyr` extension package; rather than using pipes to amend
a `tibble` in your workspace, you amend a query, which is then sent to your
chosen organisation. These pipes differ from traditional syntax in two ways:
- they begin with a function - usually `galah_call()` - instead of a `tibble`
- they end with one of `dplyr`'s evaluation functions, usually `collect()`
So an example query might be to find the number of records per year:
``` r
galah_config(atlas = "Australia")
galah_call() |> # open a pipe
filter(year >= 2020) |> # choose rows to keep
count(year) |> # count the number of rows
collect() # retrieve query from the server
```
```
## # A tibble: 7 × 2
## year count
##
## 1 2024 11889930
## 2 2023 11007491
## 3 2022 9430065
## 4 2025 9142677
## 5 2021 8695248
## 6 2020 7311836
## 7 2026 309836
```
Or to find the number of categories present in a dataset, for example how many
species are present:
``` r
galah_call() |>
identify("Crinia") |> # filters by taxonomic names
distinct(speciesID) |> # keep only unique values
count() |>
collect()
```
```
## # A tibble: 1 × 1
## count
##
## 1 17
```
You can 'glimpse' a data download before you run it, to check all the data
you need is included:
``` r
galah_call() |>
identify("Eolophus roseicapilla") |>
filter(year == 2010) |>
glimpse() |>
collect()
```
```
## Rows: 21,984
## Columns: 8
## $ taxonConceptID "https://biodiversity.org.au/afd/taxa/9b4ad548-8bb3-486a-ab0a-905506c463ea", "https://biodiversity.org.au…
## $ eventDate 1.272672e+12, 1.289002e+12, 1.291014e+12
## $ scientificName "Eolophus roseicapilla", "Eolophus roseicapilla", "Eolophus roseicapilla"
## $ decimalLatitude -25.98833, -37.83032, -35.41707
## $ decimalLongitude 152.0442, 144.9812, 138.6868
## $ basisOfRecord "HUMAN_OBSERVATION", "HUMAN_OBSERVATION", "HUMAN_OBSERVATION"
## $ dataResourceName "BirdLife Australia, Birdata", "eBird Australia", "eBird Australia"
## $ occurrenceStatus "PRESENT", "ABSENT", "ABSENT"
```
And, once satisfied that your parameters are correct, download the records
themselves:
``` r
galah_call() |>
identify("Eolophus roseicapilla") |>
filter(year == 2010) |>
select(eventDate, decimalLatitude, species) |>
collect()
```
```
## # A tibble: 21,984 × 3
## eventDate decimalLatitude species
##
## 1 NA -36.5 Eolophus roseicapilla
## 2 NA -38.2 Eolophus roseicapilla
## 3 NA -37.0 Eolophus roseicapilla
## 4 NA -37.7 Eolophus roseicapilla
## 5 NA -35.6 Eolophus roseicapilla
## 6 NA -31.1 Eolophus roseicapilla
## 7 NA -38.2 Eolophus roseicapilla
## 8 NA -38.2 Eolophus roseicapilla
## 9 NA -38.2 Eolophus roseicapilla
## 10 NA -38.2 Eolophus roseicapilla
## # ℹ 21,974 more rows
```
This works because many of the functions in `dplyr` are "generic", meaning
it is possible to write extensions that apply them to new object classes.
In our case, `galah_call()` creates a new object class called a
`data_request` for which we have written new extensions. This means that galah
will not interfere with your use of `filter()` and friends on your tibbles.
Supported `dplyr` verbs that modify queries are as follows:
- `arrange.data_request()`
- `count.data_request()`
- `distinct.data_request()`
- `filter.data_request()`
- `glimpse.data_request()`
- `group_by.data_request()`
- `select.data_request()`
- `slice_head.data_request()`
Additional verbs are:
- `apply_profile()`
- `geolocate()` or `st_crop.data_request()`
- `identify.data_request()`
- `unnest()`
It is good practice to download your data in as few steps as possible,
to minimize impacts on the server, and to ensure you can get a single
DOI for your data. See the
[download data reproducibly](download-data-reproducibly.html) vignette
for details.
# Finding information
Building queries using `filter()` requires that you know two things:
- what **fields** (columns) are present in the dataset you are searching
- what **values** exist for those fields
Finding this information requires looking for metadata:
``` r
request_metadata(type = "fields") |>
collect()
```
```
## # A tibble: 639 × 3
## id description type
##
## 1 abcdTypeStatus fields
## 2 acceptedNameUsage Accepted name fields
## 3 acceptedNameUsageID Accepted name fields
## 4 accessRights Access rights fields
## 5 annotationsDoi fields
## 6 annotationsUid Referenced by publication fields
## 7 assertionUserId Assertions by user fields
## 8 assertions Record issues fields
## 9 assertionsCount fields
## 10 associatedMedia Associated Media fields
## # ℹ 629 more rows
```
You can browser this tibble using `View()` or search it using `filter()`.
Once you have found a field that you want to include in your query, you
can find values for that field using `unnest()`:
``` r
request_metadata() |>
filter(fields == "cl22") |>
unnest() |>
collect()
```
```
## # A tibble: 11 × 1
## cl22
##
## 1 New South Wales
## 2 Victoria
## 3 Queensland
## 4 South Australia
## 5 Western Australia
## 6 Northern Territory
## 7 Tasmania
## 8 Australian Capital Territory
## 9 Macquarie Island
## 10 Coral Sea Islands
## 11 Ashmore and Cartier Islands
```
Different types of metadata are available; see `?request_metadata` for
a full list.
# Wrapper functions
While `dplyr` syntax is very flexible, there are cases where it is easier
to simply say the sort of data you want, rather than create a database
query to implement it. For this reason, several common use cases have
their own wrapper functions.
The `atlas_` family of functions act like `collect()`, but enforce
a particular type of data to be returned, such as record counts:
``` r
galah_call() |>
filter(year == 2025) |>
atlas_counts() # note no need for a `count()` function
```
```
## # A tibble: 1 × 1
## count
##
## 1 9142677
```
Or occurrences:
``` r
galah_call() |>
identify("Eolophus roseicapilla") |>
filter(year == 2000,
cl22 == "Australian Capital Territory") |>
atlas_occurrences() |>
print(n = 6)
```
```
## # A tibble: 2,032 × 9
## recordID scientificName taxonConceptID decimalLatitude decimalLongitude eventDate basisOfRecord occurrenceStatus
##
## 1 0026d29f-b6ab-4… Eolophus rose… https://biodi… -35.4 149. 2000-08-07 00:00:00 HUMAN_OBSERV… PRESENT
## 2 0062d446-007b-4… Eolophus rose… https://biodi… -35.3 149. 2000-03-10 00:00:00 HUMAN_OBSERV… PRESENT
## 3 00a62ee0-1e08-4… Eolophus rose… https://biodi… -35.2 149. 2000-01-29 00:00:00 HUMAN_OBSERV… PRESENT
## 4 00ab2f4d-326f-4… Eolophus rose… https://biodi… -35.4 149. 2000-09-25 00:00:00 HUMAN_OBSERV… PRESENT
## 5 00ae4631-ea59-4… Eolophus rose… https://biodi… -35.3 149. 2000-02-12 00:00:00 HUMAN_OBSERV… PRESENT
## 6 00b6c8ec-e7b9-4… Eolophus rose… https://biodi… -35.2 149. 2000-02-05 00:00:00 HUMAN_OBSERV… PRESENT
## # ℹ 2,026 more rows
## # ℹ 1 more variable: dataResourceName
```
`atlas_species()` replaces the need for `distinct()` call, while `atlas_media()`
is a shortcut to a complex workflow that incorporates both data and metadata
calls. Finally, metadata calls can be made more efficiently using the `show_all()`
and `show_values()` functions. These take the same arguments as the `type`
argument in `request_metadata()`, but use non-standard evaluation, so they
don't require quotes. They are also evaluated immediately rather than lazily:
``` r
show_all(fields)
```
```
## # A tibble: 639 × 3
## id description type
##
## 1 abcdTypeStatus fields
## 2 acceptedNameUsage Accepted name fields
## 3 acceptedNameUsageID Accepted name fields
## 4 accessRights Access rights fields
## 5 annotationsDoi fields
## 6 annotationsUid Referenced by publication fields
## 7 assertionUserId Assertions by user fields
## 8 assertions Record issues fields
## 9 assertionsCount fields
## 10 associatedMedia Associated Media fields
## # ℹ 629 more rows
```
You can check the
[look up information](https://galah.ala.org.au/R/articles/look_up_information.html)
vignette for further details.