Finding & Downloading Data

Required `R` Packages & Preparations

First, we install and load relevant packages to our R session. You will see that I do this in a bit of a peculiar fashion. Why do I do this with a custom function as seen below instead of using the lirbary() function in R? I want my scripts to be transferrable across systems (e.g., different computers I work on) and installations of R. SOme of those have the packages I need installed while others don’t. The function I coded below called install.load.package() takes the name of an R package, checks if it is installed and installs it if it isn’t. Subsequently, the function loads the package in question. By applying this function to a vector of package names, we can easily load all packages we need without needing to also run installation functions or accidently overriding an existing installation.

For this part of the tutorial, we need these packages:

## Custom install & load function
install.load.package <- function(x) {
  if (!require(x, character.only = TRUE)) {
    install.packages(x, repos = "http://cran.us.r-project.org")
  }
  require(x, character.only = TRUE)
}
## names of packages we want installed (if not installed yet) and loaded
package_vec <- c(
  "rgbif", # for access to gbif
  "knitr" # for rmarkdown table visualisations
)
## executing install & load for each package
sapply(package_vec, install.load.package)

## rgbif knitr 
##  TRUE  TRUE

Do not forget that you need to have set your GBIF credentials within R as demonstrated in the setup section.

Study Characteristics

To make these exercises more tangible to you, I will walk you through a mock-study within which we will find relevant GBIF-mediated data, download this data, load it into R and clean it, and finally join it with relevant abiotic data ready for a subseqeuent analysis or species distribution modelling exercise (a common use-case of GBIF-mediated data).

Organisms

Let’s first focus on our organisms of interest: conifers and beeches. Why these two? For a few reasons, mainly:

These are sessile organisms which lend themselves well to species distribution modelling
These are long-lived species which lend themselves well to species distribution modelling and it use of climate data across long time-scales
We are all familiar with at least some of these species.
I think they are neat. They form the backdrop for a lot of wonderful hikes for me - just look at them:

Area

Looking at the world at once, while often something that macroecologists strive for, is unfeasible for our little seasonal school. Consequently, we need to limit our area-of-study to a relevant and familiar region to us. Since the seasonal school is organised within Germany, I assume that Germany would be a good study region for us. Plus, we know we have conifers and beeches there.

Timeframe

Lastly, we need to settle on a timeframe at which to investigate our study organisms in their environments - let’s choose a large time-window to align with the fact that our species of interest live long and to demonstrate some neat functionality of retrieving non-standard environment data later: 1970 to 2020 should be fine!

Finding Data with `rgbif`

We’ve got our study organisms, we know when and where to look for them - now how can GBIF help us find relevant data?

Resolving Taxonomic Names

First, we need to identify the backbone key by which GBIF indexes data for conifers and beeches. To do so, we make use of the binomial nomenclature frequently used in taxanomy and search for Pinaceae and Fagaceae (conifers and beeches, respectively) within the GBIF backbone taxonomy. We do so with the name_backbone() function:

Pinaceae_backbone <- name_backbone(name = "Pinaceae")
knitr::kable(Pinaceae_backbone) # I do this here for a nice table output in the html page you are looking at

usageKey	scientificName	canonicalName	rank	status	confidence	matchType	kingdom	phylum	order	family	kingdomKey	phylumKey	classKey	orderKey	familyKey	synonym	class	verbatim_name
3925	Pinaceae	Pinaceae	FAMILY	ACCEPTED	96	EXACT	Plantae	Tracheophyta	Pinales	Pinaceae	6	7707728	194	640	3925	FALSE	Pinopsida	Pinaceae

Pinaceae_key <- Pinaceae_backbone$familyKey

Fagaceae_backbone <- name_backbone(name = "Fagaceae")
knitr::kable(Fagaceae_backbone) # I do this here for a nice table output in the html page you are looking at

usageKey	scientificName	canonicalName	rank	status	confidence	matchType	kingdom	phylum	order	family	kingdomKey	phylumKey	classKey	orderKey	familyKey	synonym	class	verbatim_name
4689	Fagaceae	Fagaceae	FAMILY	ACCEPTED	96	EXACT	Plantae	Tracheophyta	Fagales	Fagaceae	6	7707728	220	1354	4689	FALSE	Magnoliopsida	Fagaceae

Fagaceae_key <- Fagaceae_backbone$familyKey

We now have the keys by which GBIF indexes the relevant taxonomic families for our use-case!

Data Discovery

How do we know if GBIF even mediates any data for the taxonomic families we are interested in within our study area across our time frame of interest? Well, we could query a download from GBIF, however, doing so takes time. Alternatively, we can use the occ_search() or occ_count() functions to get a much, much faster overview of data availability. Since occ_search() is very inviting to adopt it for sub-optimal data download practices, let’s focus on occ_count() for now and forget the other functione even exists (personally, I have never found a reason to use occ_search() over occ_count() for the purpose of data discovery).

Let’s build the data discovery call stepp by step for Pinaceae and then apply the same to the Fagaceae:

Notice that the document you are looking at is frozen in time and the exact numbers you will obtain running the code below will most definitely differ from the ones shown here as new data is added to GBIF.

How many observations does GBIF mediate for Pinaceae?

occ_count(familyKey = Pinaceae_key)

## [1] 4958280

How many observations does GBIF mediate for Pinaceae in Germany?

occ_count(
  familyKey = Pinaceae_key,
  country = "DE" # ISO 3166-1 alpha-2 country code
)

## [1] 313109

How many observations does GBIF mediate for Pinaceae in Germany between 1970 and 2020?

occ_count(
  familyKey = Pinaceae_key,
  country = "DE",
  year = "1970,2020" # year-span identified by comma between start- and end-year
)

## [1] 238098

You can now discover data mediated by GBIF according to common characteristics by which ecologists will query data through GBIF! Nevertheless, there are additional arguments you might be interested in by which to refine your search more. The documentation of occ_count() is a good place to start looking for those.

Finally, let’s look at how much data we can obtain for Fagaceae and also for Pinaceae and Fagaceae at the same time:

occ_count(
  familyKey = Fagaceae_key,
  country = "DE",
  year = "1970,2020"
)

## [1] 230461

occ_count(
  familyKey = paste(Fagaceae_key, Pinaceae_key, sep = ";"), # multiple entries that don't indicate a span or series are separated with a semicolon
  country = "DE",
  year = "1970,2020"
)

## [1] 468559

Who would have thought? If we add the number of observations available for Pinaceae and those available for Fagaceae, we get the number of observation available for Pinaceae and Fagaceae.

Downloading Data with `rgbif`

This is how you should obtain data for publication-level research and reports.

Toensure reproducibility and data richness of our downloads, we must make a formal download query to the GBIF API and await processing of our data request - an asynchronous download. To do so, we need to do three things in order:

Specify the data we want and request processing and download from GBIF
Download the data once it is processed
Load the downloaded data into R

Time between registering a download request and retrieval of data may vary according to how busy the GBIF API and servers are as well as the size and complexity of the data requested. GBIF will only handle a maximum of 3 download request per user simultaneously.

Let’s tackle these step-by-step.

Data Request at GBIF

To start this process of obtaining GBIF-mediated occurrence records, we first need to make a request to GBIF to start processing of the data we require. This is done with the occ_download(...) function. Making a data request at GBIF via the occ_download(...) function comes with two important considerations:

occ_download(...) specific syntax
data query metadata

Syntax and Query through `occ_download()`

The occ_download(...) function - while powerful - requires us to confront a new set of syntax which translates the download request as we have seen it so far into a form which the GBIF API can understand. To do so, we use a host of rgbif functions built around the GBIF predicate DSL (domain specific language). These functions (with a few exceptions which you can see by calling the documentation - ?download_predicate_dsl) all take two arguments:

key - this is a core term which we want to target for our request
value - this is the value for the core term which we are interested in

Finally, the relevant functions and how the relate key to value are:

pred(...): equals
pred_lt(...): lessThan
pred_lte(...): lessThanOrEquals
pred_gt(...): greaterThan
pred_gte(...): greaterThanOrEquals
pred_like(...): like
pred_within(...): within (only for geospatial queries, and only accepts a WKT string)
pred_notnull(...): isNotNull (only for geospatial queries, and only accepts a WKT string)
pred_isnull(...): isNull (only for stating that you want a key to be null)
pred_and(...): and (accepts multiple individual predicates, separating them by either “and” or “or”)
pred_or(...): or (accepts multiple individual predicates, separating them by either “and” or “or”)
pred_not(...): not (negates whatever predicate is passed in)
pred_in(...): in (accepts a single key but many values; stating that you want to search for all the values)

Let us use these to query the data we are interested in and add a new parameter into our considerations - we only want human observations of our trees:

res <- occ_download(
  pred_or(
    pred("taxonKey", Pinaceae_key),
    pred("taxonKey", Fagaceae_key)
  ),
  pred("basisOfRecord", "HUMAN_OBSERVATION"),
  pred("country", "DE"),
  pred_gte("year", 1970),
  pred_lte("year", 2020)
)

You just made a data query with GBIF - congratulations!

If you navigate to your downloads tab on the GBIF website, you should now see the data request being processed:

You can expand this view request to get more information:

Finally, when the data request is processed and ready for download, you will receive an E-mail telling you so and the view of the request on the webpage will change:

You don’t need to sit idly looking at the GBIF webpage or wait for the E-mail that your requested data is ready, you can let R do the waiting for you.

Downloading Requested Data

Instead of waiting for the confirmation that our query has been processed, let’s just use an rgbif function to do the waiting for us and then continue on to download the processed data once it is ready:

## Check GBIF whether data is ready, this function will finish running when done and return metadata
res_meta <- occ_download_wait(res, status_ping = 10, quiet = FALSE)
## Download the data as .zip (can specify a path)
res_get <- occ_download_get(res)

You now have the GBIF mediated and processed data on your hard drive as a .zip file named after the GBIF request ID.

Loading Downloaded Data into `R` & Saving Data

All that is left to do is to load the data we just downloaded into R, reformat it, and save it to our hard drive in a format that is easier to load and use with R and other software further down the line:

## Load the data into R
res_data <- occ_download_import(res_get)
write.csv(res_data, file = "NFDI4Bio_GBIF.csv")

For the sake of reproducibility, I would always recommend you also save the GBIF query object. We will use this when discussing citing data download via GBIF.

save(res, file = "NFDI4Bio_GBIF.RData")

There we go! You now know how to find and download data mediated by GBIF.

Session Info

## R version 4.4.0 (2024-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=Norwegian Bokmål_Norway.utf8  LC_CTYPE=Norwegian Bokmål_Norway.utf8   
## [3] LC_MONETARY=Norwegian Bokmål_Norway.utf8 LC_NUMERIC=C                            
## [5] LC_TIME=Norwegian Bokmål_Norway.utf8    
## 
## time zone: Europe/Oslo
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.48  rgbif_3.8.1
## 
## loaded via a namespace (and not attached):
##  [1] styler_1.10.3     sass_0.4.9        utf8_1.2.4        generics_0.1.3   
##  [5] xml2_1.3.6        blogdown_1.19     stringi_1.8.4     httpcode_0.3.0   
##  [9] digest_0.6.37     magrittr_2.0.3    evaluate_0.24.0   grid_4.4.0       
## [13] bookdown_0.40     fastmap_1.2.0     R.oo_1.26.0       R.cache_0.16.0   
## [17] plyr_1.8.9        jsonlite_1.8.8    R.utils_2.12.3    whisker_0.4.1    
## [21] crul_1.5.0        urltools_1.7.3    httr_1.4.7        purrr_1.0.2      
## [25] fansi_1.0.6       scales_1.3.0      oai_0.4.0         lazyeval_0.2.2   
## [29] jquerylib_0.1.4   cli_3.6.3         rlang_1.1.4       triebeard_0.4.1  
## [33] R.methodsS3_1.8.2 bit64_4.0.5       munsell_0.5.1     cachem_1.1.0     
## [37] yaml_2.3.10       tools_4.4.0       dplyr_1.1.4       colorspace_2.1-1 
## [41] ggplot2_3.5.1     curl_5.2.2        vctrs_0.6.5       R6_2.5.1         
## [45] lifecycle_1.0.4   stringr_1.5.1     bit_4.0.5         pkgconfig_2.0.3  
## [49] pillar_1.9.0      bslib_0.8.0       gtable_0.3.6      data.table_1.16.0
## [53] glue_1.7.0        Rcpp_1.0.13       xfun_0.47         tibble_3.2.1     
## [57] tidyselect_1.2.1  htmltools_0.5.8.1 rmarkdown_2.28    compiler_4.4.0

Last updated on 2023-11-14