Working with the GBIF Backbone
Preamble, Package-Loading, and GBIF API Credential Registering (click here):
## Custom install & load function
install.load.package <- function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x, repos = "http://cran.us.r-project.org")
}
require(x, character.only = TRUE)
}
## names of packages we want installed (if not installed yet) and loaded
package_vec <- c(
"rgbif",
"knitr" # for rmarkdown table visualisations
)
## executing install & load for each package
sapply(package_vec, install.load.package)
## rgbif knitr
## TRUE TRUE
options(gbif_user = "my gbif username")
options(gbif_email = "my registred gbif e-mail")
options(gbif_pwd = "my gbif password")
Most of the time, GBIF users query data for individual species so we will establish a comparable use-case here. For most of this material, I will be focussing on Lagopus muta - the rock ptarmigan (see below). I have particularly fond memories of these birds flying alongside an uncle of mine and I on a topptur-ski trip in Lofoten earlier this year. It also lends itself well to a demonstration of rgbif
functionality.
The ways in which we record and report species identities is arguably more varied than recorded species identities themselves. For example, while the binomial nomenclature is widely adopted across scientific research, the same species may be still be referred to via different binomial names with descriptor or subspecies suffixes. In addition, particularly when dealing with citizen science data, species names may not always be recorded according to the binomial nomenclature but rather via vernacular names.
The GBIF Backbone Taxonomy circumvents these issues on the data-management side as it assigns unambiguous keys to taxonomic units of interest - these are known as taxonKeys.
Matching between what you require and how GBIF indexes its data is therefore vital to ensure you retrieve the data you need accurately and in full. To discover data themselves, we first need to discover their corresponding relevant identifiers.
Finding the taxonKeys
To identify the relevant taxonKeys for our
study organism (Lagopus muta), we will use the name_backbone(...)
function to match our binomial species name to the GBIF backbone as follows:
sp_name <- "Lagopus muta"
sp_backbone <- name_backbone(name = sp_name)
Let’s look at the output of this function call:
knitr::kable(sp_backbone)
usageKey | scientificName | canonicalName | rank | status | confidence | matchType | kingdom | phylum | order | family | genus | species | kingdomKey | phylumKey | classKey | orderKey | familyKey | genusKey | speciesKey | synonym | class | verbatim_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5227679 | Lagopus muta (Montin, 1781) | Lagopus muta | SPECIES | ACCEPTED | 99 | EXACT | Animalia | Chordata | Galliformes | Phasianidae | Lagopus | Lagopus muta | 1 | 44 | 212 | 723 | 9331 | 2473369 | 5227679 | FALSE | Aves | Lagopus muta |
The data frame / tibble
returned by the name_backbone(...)
function contains important information regarding the confidence and type of match achieved between the input species name and the GBIF backbone. In addition, it lists all relevant taxonKeys. Of particular to most use-cases are the following columns:
usageKey
: The taxonKey by which this species is indexed in the GBIF backbone.matchType
: This can be either:- EXACT: binomial input matched 1-1 to backbone
- FUZZY: binomial input was matched to backbone assuming misspelling
- HIGHERRANK: binomial input is not a species-level name, but indexes a higher-rank taxonomic group
- NONE: no match could be made
Let’s extract the usageKey
of Lagopus muta in the GBIF backbone for later use in this workshop.
sp_key <- sp_backbone$usageKey
sp_key
## [1] 5227679
Resolving Taxonomic Names
Not all species identities are as straightforwardly matched to the GBIF backbone and there is more information stored in the GBIF backbone which may be relevant to users. Here, I would like to spend some time delving further into these considerations.
Matching Input to Backbone
To widen the backbone matching, we can set verbose = TRUE
in the name_backbone(...)
function. Doing so for Lagopus muta, we obtain the following:
sp_backbone <- name_backbone(name = sp_name, verbose = TRUE)
knitr::kable(sp_backbone)
usageKey | scientificName | canonicalName | rank | status | confidence | matchType | kingdom | phylum | order | family | genus | species | kingdomKey | phylumKey | classKey | orderKey | familyKey | genusKey | speciesKey | synonym | class | verbatim_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5227679 | Lagopus muta (Montin, 1781) | Lagopus muta | SPECIES | ACCEPTED | 99 | EXACT | Animalia | Chordata | Galliformes | Phasianidae | Lagopus | Lagopus muta | 1 | 44 | 212 | 723 | 9331 | 2473369 | 5227679 | FALSE | Aves | Lagopus muta |
Seems like, even with widened backbone matching, Lagopus muta is precise enough of a specification for there to be one direct match.
To demonstrate how this widened backbone matching can result in multiple matches, let’s consider Calluna vulgaris - the common heather and my favourite plant:
sp_backbone2 <- name_backbone(name = "Calluna vulgaris", verbose = TRUE)
knitr::kable(t(sp_backbone2))
1 | 2 | 3 | 4 | |
---|---|---|---|---|
usageKey | 2882482 | 8208549 | 3105380 | 7918820 |
scientificName | Calluna vulgaris (L.) Hull | Calluna vulgaris Salisb., 1802 | Carlina vulgaris L. | Carlina vulgaris Schur |
canonicalName | Calluna vulgaris | Calluna vulgaris | Carlina vulgaris | Carlina vulgaris |
rank | SPECIES | SPECIES | SPECIES | SPECIES |
status | ACCEPTED | SYNONYM | ACCEPTED | DOUBTFUL |
confidence | 97 | 97 | 70 | 64 |
matchType | EXACT | EXACT | FUZZY | FUZZY |
kingdom | Plantae | Plantae | Plantae | Plantae |
phylum | Tracheophyta | Tracheophyta | Tracheophyta | Tracheophyta |
order | Ericales | Ericales | Asterales | Asterales |
family | Ericaceae | Ericaceae | Asteraceae | Asteraceae |
genus | Calluna | Calluna | Carlina | Carlina |
species | Calluna vulgaris | Calluna vulgaris | Carlina vulgaris | Carlina vulgaris |
kingdomKey | 6 | 6 | 6 | 6 |
phylumKey | 7707728 | 7707728 | 7707728 | 7707728 |
classKey | 220 | 220 | 220 | 220 |
orderKey | 1353 | 1353 | 414 | 414 |
familyKey | 2505 | 2505 | 3065 | 3065 |
genusKey | 2882481 | 2882481 | 3105349 | 3105349 |
speciesKey | 2882482 | 2882482 | 3105380 | 7918820 |
synonym | FALSE | TRUE | FALSE | FALSE |
class | Magnoliopsida | Magnoliopsida | Magnoliopsida | Magnoliopsida |
acceptedUsageKey | NA | 2882482 | NA | NA |
verbatim_name | Calluna vulgaris | Calluna vulgaris | Calluna vulgaris | Calluna vulgaris |
Here, you can see how fuzzy matching has resulted in an erroneous match with a different plant: Carlina vulgaris - the thistle - also a neat plant, but not the one I was after here.
Competing Name Matches
By horribly misspelling our binomial input, we can coerce an output of match type FUZZY (a match achieved with deviations to the supplied string) or HIGHERRANK (a match indexing the Genus itself):
knitr::kable(name_backbone("Lagopus mut", verbose = TRUE))
confidence | matchType | synonym | usageKey | scientificName | canonicalName | rank | status | kingdom | phylum | order | family | genus | kingdomKey | phylumKey | classKey | orderKey | familyKey | genusKey | class | acceptedUsageKey | verbatim_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100 | NONE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Lagopus mut |
69 | EXACT | FALSE | 2473369 | Lagopus Brisson, 1760 | Lagopus | GENUS | ACCEPTED | Animalia | Chordata | Galliformes | Phasianidae | Lagopus | 1 | 44 | 212 | 723 | 9331 | 2473369 | Aves | NA | Lagopus mut |
68 | EXACT | TRUE | 3233696 | Lagopus (Gren. & Godr.) Fourr. | Lagopus | GENUS | SYNONYM | Plantae | Tracheophyta | Lamiales | Plantaginaceae | Plantago | 6 | 7707728 | 220 | 408 | 2420 | 3189695 | Magnoliopsida | 3189695 | Lagopus mut |
68 | EXACT | TRUE | 3233247 | Lagopus Bernh. | Lagopus | GENUS | SYNONYM | Plantae | Tracheophyta | Fabales | Fabaceae | Trifolium | 6 | 7707728 | 220 | 1370 | 5386 | 2973363 | Magnoliopsida | 2973363 | Lagopus mut |
68 | EXACT | TRUE | 8132572 | Lagopus Hill | Lagopus | GENUS | SYNONYM | Plantae | Tracheophyta | Fabales | Fabaceae | Trifolium | 6 | 7707728 | 220 | 1370 | 5386 | 2973363 | Magnoliopsida | 2973363 | Lagopus mut |
68 | EXACT | TRUE | 6007644 | Lagopus Reichenbach, 1817 | Lagopus | GENUS | SYNONYM | Animalia | Arthropoda | Lepidoptera | Noctuidae | Callopistria | 1 | 54 | 216 | 797 | 7015 | 8875134 | Insecta | 8875134 | Lagopus mut |
43 | FUZZY | TRUE | 4825684 | Lagomus McEnery, 1859 | Lagomus | GENUS | SYNONYM | Animalia | Chordata | Lagomorpha | Prolagidae | Prolagus | 1 | 44 | 359 | 785 | 5468 | 2436678 | Mammalia | 2436678 | Lagopus mut |
38 | FUZZY | FALSE | 4834660 | Lagodus Pomel, 1852 | Lagodus | GENUS | DOUBTFUL | Animalia | Chordata | NA | NA | Lagodus | 1 | 44 | 359 | NA | NA | 4834660 | Mammalia | NA | Lagopus mut |
To truly see how competing name identifiers can cause us to struggle identifying the correct usageKey
we must turn away from Lagopus muta. Instead, let us look at Glocianus punctiger - a species of weevil:
knitr::kable(name_backbone("Glocianus punctiger", verbose = TRUE))
usageKey | scientificName | canonicalName | rank | status | confidence | matchType | kingdom | phylum | order | family | kingdomKey | phylumKey | classKey | orderKey | familyKey | synonym | class | acceptedUsageKey | genus | species | genusKey | speciesKey | verbatim_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4239 | Curculionidae | Curculionidae | FAMILY | ACCEPTED | 97 | HIGHERRANK | Animalia | Arthropoda | Coleoptera | Curculionidae | 1 | 54 | 216 | 1470 | 4239 | FALSE | Insecta | NA | NA | NA | NA | NA | Glocianus punctiger |
11356251 | Glocianus punctiger (C.R.Sahlberg, 1835) | Glocianus punctiger | SPECIES | SYNONYM | 97 | EXACT | Animalia | Arthropoda | Coleoptera | Curculionidae | 1 | 54 | 216 | 1470 | 4239 | TRUE | Insecta | 1187423 | Rhynchaenus | Rhynchaenus punctiger | 1187150 | 1187423 | Glocianus punctiger |
4464480 | Glocianus punctiger (Gyllenhal, 1837) | Glocianus punctiger | SPECIES | SYNONYM | 97 | EXACT | Animalia | Arthropoda | Coleoptera | Curculionidae | 1 | 54 | 216 | 1470 | 4239 | TRUE | Insecta | 1178810 | Ceuthorhynchus | Ceuthorhynchus punctiger | 8265946 | 1178810 | Glocianus punctiger |
Here, we find that there exist two competing identifiers for Glocianus punctiger in the GBIF backbone in accordance with their competing descriptors. To query data for all Glocianus punctiger records, we should thus always use the keys 11356251 and 4464480.
Matching Names and Backbone for several Species
The above use of name_backbone(...)
can be executed for multiple species at once using instead the name_backbone_checklist(...)
function. So let’s do so for our target species as well as my favourite plant:
checklist_df <- name_backbone_checklist(c(sp_name, "Calluna vulgaris"))
knitr::kable(checklist_df)
usageKey | scientificName | canonicalName | rank | status | confidence | matchType | kingdom | phylum | order | family | genus | species | kingdomKey | phylumKey | classKey | orderKey | familyKey | genusKey | speciesKey | synonym | class | verbatim_name | verbatim_index |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5227679 | Lagopus muta (Montin, 1781) | Lagopus muta | SPECIES | ACCEPTED | 99 | EXACT | Animalia | Chordata | Galliformes | Phasianidae | Lagopus | Lagopus muta | 1 | 44 | 212 | 723 | 9331 | 2473369 | 5227679 | FALSE | Aves | Lagopus muta | 1 |
2882482 | Calluna vulgaris (L.) Hull | Calluna vulgaris | SPECIES | ACCEPTED | 97 | EXACT | Plantae | Tracheophyta | Ericales | Ericaceae | Calluna | Calluna vulgaris | 6 | 7707728 | 220 | 1353 | 2505 | 2882481 | 2882482 | FALSE | Magnoliopsida | Calluna vulgaris | 2 |
Name Suggestions and Lookup
To catch other commonly used or relevant names for a species of interest, you can use the name_suggest(...)
function. This is particularly useful when data mining publications or data sets for records which can be grouped to the same species although they might be recorded with different names:
sp_suggest <- name_suggest(sp_name)$data
knitr::kable(t(head(sp_suggest)))
key | 5227679 | 5227684 | 5227713 | 5227686 | 5227681 | 5227692 |
canonicalName | Lagopus muta | Lagopus muta rupestris | Lagopus muta welchi | Lagopus muta townsendi | Lagopus muta kurilensis | Lagopus muta helvetica |
rank | SPECIES | SUBSPECIES | SUBSPECIES | SUBSPECIES | SUBSPECIES | SUBSPECIES |
To trawl GBIF mediated data sets for records of a specific species, one may use the name_lookup(...)
function:
sp_lookup <- name_lookup(sp_name)$data
knitr::kable(head(sp_lookup))
key | scientificName | nameKey | datasetKey | nubKey | parentKey | parent | kingdom | order | family | species | kingdomKey | orderKey | familyKey | speciesKey | canonicalName | authorship | nameType | taxonomicStatus | rank | origin | numDescendants | numOccurrences | habitats | nomenclaturalStatus | threatStatuses | synonym | phylum | genus | phylumKey | classKey | genusKey | class | taxonID | extinct | constituentKey | publishedIn | basionymKey | basionym | accordingTo | acceptedKey | accepted |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
123212008 | Lagopus muta | 5972798 | a5dd063e-f45b-4a54-8b94-8fa3adf7f1e1 | 5227679 | 167183824 | Phasianidae | Animalia | Galliformes | Phasianidae | Lagopus muta | 167183684 | 167183822 | 167183824 | 123212008 | Lagopus muta | SCIENTIFIC | ACCEPTED | SPECIES | SOURCE | 0 | 0 | NA | NA | NA | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | |
114449074 | Lagopus muta | 5972798 | 3772da2f-daa1-4f07-a438-15a881a2142c | 5227679 | 183207232 | Lagopus | Animalia | Galliformes | Tetraonidae | NA | 183203277 | 183207227 | 183207228 | NA | Lagopus muta | SCIENTIFIC | ACCEPTED | NA | SOURCE | 0 | 0 | NA | NA | NA | FALSE | Chordata | Lagopus | 183205906 | 183206633 | 183207232 | Aves | NA | NA | NA | NA | NA | NA | NA | NA | NA | |
133167086 | Lagopus muta | 5972798 | 47f16512-bf31-410f-b272-d151c996b2f6 | 5227679 | 135274878 | Phasianidae | Animalia | Galliformes | Phasianidae | Lagopus muta | 135274602 | 135274874 | 135274878 | 133167086 | Lagopus muta | SCIENTIFIC | ACCEPTED | SPECIES | SOURCE | 0 | 0 | NA | NA | NA | FALSE | NA | NA | NA | 135274603 | NA | Aves | 1613 | FALSE | NA | NA | NA | NA | NA | NA | NA | |
119341248 | Lagopus muta | 5972798 | 4f1047ac-a19d-41a8-98eb-d968b2548b53 | 5227679 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Lagopus muta | SCIENTIFIC | ACCEPTED | NA | SOURCE | 0 | 0 | NA | NA | NEAR_THREATENED | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | |
104151733 | Lagopus muta | NA | fab88965-e69d-4491-a04d-e3198b626e52 | 5227679 | 104151714 | Lagopus | Metazoa | Galliformes | Phasianidae | Lagopus muta | 103832354 | 104149839 | 104150497 | 104151733 | Lagopus muta | NA | SCIENTIFIC | ACCEPTED | SPECIES | SOURCE | 0 | 0 | NA | NA | NA | FALSE | Chordata | Lagopus | 103882489 | 104106614 | 104151714 | Aves | 64668 | NA | NA | NA | NA | NA | NA | NA | NA |
177659687 | Lagopus muta | NA | 6b6b2923-0a10-4708-b170-5b7c611aceef | 5227679 | 177659682 | Lagopus | Metazoa | Galliformes | Phasianidae | Lagopus muta | 177651702 | 177659367 | 177659587 | 177659687 | Lagopus muta | NA | SCIENTIFIC | ACCEPTED | SPECIES | SOURCE | 0 | 0 | NA | NA | NA | FALSE | Chordata | Lagopus | 177654008 | 177656782 | 177659682 | Aves | 64668 | NA | NA | NA | NA | NA | NA | NA | NA |
Here, we see clearly that Lagopus muta is recorded slightly differently in the datasets mediated by GBIF, but are indexed just fine for GBIF to find them for us.
Lastly, to gain a better understanding of the variety of vernacular names by which our species is know, we can use the name_usage(..., data = "vernacularnames")
function as follows:
sp_usage <- name_usage(key = sp_key, data = "vernacularNames")$data
knitr::kable(head(sp_usage))
taxonKey | vernacularName | language | source | sourceTaxonKey | country | area | preferred |
---|---|---|---|---|---|---|---|
5227679 | Alpenschneehuhn | deu | Multilingual IOC World Bird List, v11.2 | 123212008 | NA | NA | NA |
5227679 | Alpenschneehuhn | deu | Taxon list of animals with German names (worldwide) compiled at the SMNS | 116803956 | DE | NA | NA |
5227679 | Alpenschneehuhn | deu | EUNIS Biodiversity Database | 101137652 | NA | NA | NA |
5227679 | Alpensneeuwhoen | nld | EUNIS Biodiversity Database | 101137652 | NA | NA | NA |
5227679 | Alpensneeuwhoen | nld | Multilingual IOC World Bird List, v11.2 | 123212008 | NA | NA | NA |
5227679 | Fjeldrype | dan | Multilingual IOC World Bird List, v11.2 | 123212008 | NA | NA | NA |
name_usage(...)
can be tuned to output different information and its documentation gives a good overview of this. Simply call ?name_usage
to look for yourself.