When working with biodiversity data, it is important to verify
taxonomic names with an authoritative list and correct any out-of-date
names. The APCalign
package simplifies this process by:
To demonstrate how to use APCalign
, we will use an
example dataset gbif_lite
which is documented in
?gbif_lite
dim(gbif_lite)
#> [1] 129 7
gbif_lite |> print(n = 6)
#> # A tibble: 129 × 7
#> species infraspecificepithet taxonrank decimalLongitude decimalLatitude scientificname
#> <chr> <chr> <chr> <dbl> <dbl> <chr>
#> 1 Tetratheca… <NA> SPECIES 145. -37.4 Tetratheca ci…
#> 2 Peganum ha… <NA> SPECIES 139. -33.3 Peganum harma…
#> 3 Calotis mu… <NA> SPECIES 115. -24.3 Calotis multi…
#> 4 Leptosperm… <NA> SPECIES 151. -34.0 Leptospermum …
#> 5 Lepidosper… <NA> SPECIES 142. -37.3 Lepidosperma …
#> 6 Enneapogon… <NA> SPECIES 129. -17.8 Enneapogon po…
#> # ℹ 123 more rows
#> # ℹ 1 more variable: verbatimscientificname <chr>
The first step is to retrieve the entire APC and APNI name databases
and store them locally as taxonomic resources. We achieve this using
load_taxonomic_resources()
. The resources are compressed as
parquet files to speed download and local loading.
There are two versions of the databases that you can retrieve with
the stable_or_current_data
argument. Calling:
stable
will retrieve the most recent, archived version
of the databases from our GitHub
releases. This is set as the default option.current
will retrieve the up-to-date databases directly
from the APC and APNI website.Note that the databases are reasonably large so the initial retrieval
of the core data will take a few minutes. Once the taxonomic resources
have been stored locally, subsequent retrievals will take less time.
Retrieving current
resources will always take longer since
it is accessing the latest information from the website in an
uncompressed format.
# Benchmarking the retrieval of `stable` or `current` resources
stable_start_time <- Sys.time()
stable_resources <- load_taxonomic_resources(stable_or_current_data = "stable")
#> Loading resources......done
stable_end_time <- Sys.time()
current_start_time <- Sys.time()
current_resources <- load_taxonomic_resources(stable_or_current_data = "current")
#> Loading resources......done
current_end_time <- Sys.time()
# Compare times
stable_end_time - stable_start_time
#> Time difference of 16.48976 secs
For a more reproducible workflow, we recommend specifying the exact
stable
version you want to use.
Now we can query our taxonomic names against the taxonomic resources
we just retrieved using create_taxonomic_update_lookup()
.
This all-in-one function will:
accepted_name
when available, and otherwise providing an
APNI name or a name where only a genus-level alignment is possible.If you would like to learn more about each of these step, take a look at the section Closer look at name alignment and updating with ‘APCalign’
library(dplyr)
updated_gbif_names <- gbif_lite |>
pull(species) |>
create_taxonomic_update_lookup(resources = resources)
#> Checking alignments of 121 taxa
#> -> 0 names already matched; 0 names checked but without a match; 121 taxa yet to be checked
updated_gbif_names |>
print(n = 6)
#> # A tibble: 129 × 12
#> original_name aligned_name accepted_name suggested_name genus taxon_rank taxonomic_dataset
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Tetratheca c… Tetratheca … Tetratheca c… Tetratheca ci… Tetr… species APC
#> 2 Peganum harm… Peganum har… Peganum harm… Peganum harma… Pega… species APC
#> 3 Calotis mult… Calotis mul… Calotis mult… Calotis multi… Calo… species APC
#> 4 Leptospermum… Leptospermu… Leptospermum… Leptospermum … Lept… species APC
#> 5 Lepidosperma… Lepidosperm… Lepidosperma… Lepidosperma … Lepi… species APC
#> 6 Enneapogon p… Enneapogon … Enneapogon p… Enneapogon po… Enne… species APC
#> # ℹ 123 more rows
#> # ℹ 5 more variables: taxonomic_status <chr>, scientific_name_authorship <chr>,
#> # aligned_reason <chr>, update_reason <chr>, number_of_collapsed_taxa <dbl>
The original_name
is the taxon name used in your
original data. The aligned_name
is the taxon name we used
to link with the APC to identify any synonyms. The
accepted_name
is the currently, accepted taxon name used by
the Australian Plant Census. The suggested_name
is the best
possible name option for the original_name
.
‘APCalign’ can also provide the state/territory distribution for established status (native/introduced) from the APC.
We can access the established status data by state/territory using
create_species_state_origin_matrix()
# Retrieve status data by state/territory
status_matrix <- create_species_state_origin_matrix(resources = resources)
Here is a breakdown of all possible values for
origin
library(purrr)
library(janitor)
# Obtain unique values
status_matrix |>
select(-species) |>
flatten_chr() |>
tabyl()
#> flatten_chr(select(status_matrix, -species)) n percent
#> doubtfully naturalised 1120 2.371003e-03
#> formerly naturalised 277 5.863998e-04
#> native 40336 8.538997e-02
#> native and doubtfully naturalised 9 1.905270e-05
#> native and naturalised 136 2.879075e-04
#> native and uncertain origin 2 4.233933e-06
#> naturalised 8765 1.855521e-02
#> not present 421606 8.925258e-01
#> presumed extinct 101 2.138136e-04
#> uncertain origin 22 4.657327e-05
You can also obtain the breakdown of species by established status
for a particular state/territory using
state_diversity_counts()
state_diversity_counts("NSW", resources = resources)
#> # A tibble: 7 × 3
#> origin state num_species
#> <chr> <chr> <table[1d]>
#> 1 doubtfully naturalised NSW 93
#> 2 formerly naturalised NSW 8
#> 3 native NSW 5958
#> 4 native and doubtfully naturalised NSW 2
#> 5 native and naturalised NSW 34
#> 6 naturalised NSW 1580
#> 7 presumed extinct NSW 8
Using the established status data and state/territory information, we
can check if a plant taxa is a native using
native_anywhere_in_australia()
library(dplyr)
updated_gbif_names |>
sample_n(1) |> # Choosing a random species
pull(suggested_name) |> # Extracting this APC accepted name
native_anywhere_in_australia(resources = resources)
#> # A tibble: 1 × 2
#> species native_anywhere_in_aus
#> <chr> <chr>
#> 1 Solanum prinophyllum considered native to Australia by APC
create_taxonomic_update_lookup
is a simple, wrapper,
function for novice users that want to quickly check and standardise
taxon names. For more experienced users, you can take a look at the sub
functions match_taxa()
, align_taxa()
and
update_taxonomy()
to see how taxon names are processed,
aligned and updated.
The function align_taxa
will:
standardise_names
,
strip_names
and strip_names_extra
standardise
infraspecific taxon designations and clean up punctuation and
whitespacestaxon_rank
to which the name can be
resolved, based on its syntax.
genus sp.
name, with additional
information/notes provided as part of the original name in square
brackets, as in Acacia sp. [skinny leaves]
or
Acacia sp. [Broken Hill]
taxonomic_reference
(APC or APNI) of each
name-alignment.Note that align_taxa
does
not seek to update outdated taxonomy. That process occurs
during update_taxonomy process.
align_taxa
instead aligns each name input to the closest
match amongst names documented by the APC and APNI.
library(dplyr)
aligned_gbif_taxa <- gbif_lite |>
pull(species) |>
align_taxa(resources = resources)
#> Checking alignments of 121 taxa
#> -> 0 names already matched; 0 names checked but without a match; 121 taxa yet to be checked
aligned_gbif_taxa |>
print(n = 6)
#> # A tibble: 129 × 7
#> original_name cleaned_name aligned_name taxonomic_dataset taxon_rank aligned_reason
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Tetratheca ciliata Tetratheca … Tetratheca … APC species Exact match o…
#> 2 Peganum harmala Peganum har… Peganum har… APC species Exact match o…
#> 3 Calotis multicaulis Calotis mul… Calotis mul… APC species Exact match o…
#> 4 Leptospermum triner… Leptospermu… Leptospermu… APC species Exact match o…
#> 5 Lepidosperma latera… Lepidosperm… Lepidosperm… APC species Exact match o…
#> 6 Enneapogon polyphyl… Enneapogon … Enneapogon … APC species Exact match o…
#> # ℹ 123 more rows
#> # ℹ 1 more variable: alignment_code <chr>
For every aligned_name
, align_taxa()
will
provide a aligned_reason
which you can review as a table of
counts:
library(janitor)
aligned_gbif_taxa |>
pull(aligned_reason) |>
tabyl() |>
tibble()
#> # A tibble: 6 × 4
#> `pull(aligned_gbif_taxa, aligned_reason)` n percent valid_percent
#> <chr> <int> <dbl> <dbl>
#> 1 Exact match of taxon name to an APC-accepted canonical name o… 118 0.915 0.929
#> 2 Exact match of taxon name to an APC-known canonical name once… 6 0.0465 0.0472
#> 3 Exact match of taxon name to an APNI-listed canonical name on… 1 0.00775 0.00787
#> 4 Exact match of the first two words of the taxon name to an AP… 1 0.00775 0.00787
#> 5 Exact match of the first word of the taxon name to an APC-acc… 1 0.00775 0.00787
#> 6 <NA> 2 0.0155 NA
There are arguments in align_taxa
that allows you to
select which of the 50 matching algorithms are activated/deactivated and
the degree of fuzziness of the fuzzy matching function
fuzzy_matches
turns fuzzy matching on / off (it
defaults to TRUE
).fuzzy_abs_dist
and fuzzy_rel_dist
control
the degree of fuzzy matching (they default to
fuzzy_abs_dist = 3
&
fuzzy_rel_dist = 0.2
).imprecise_fuzzy_matches
turns imprecise fuzzy matching
on / off (it defaults to FALSE
; for true it is set to
fuzzy_abs_dist = 5
&
fuzzy_rel_dist = 0.25
).APNI_matches
turns matches to the APNI list on/off (it
defaults to TRUE
).identifier
allows you to specify a text string that is
added to genus-level matches, indicating the site, study, etc
e.g. Acacia sp. [Blue Mountains]
update_taxonomy()
uses the information generated by
align_taxa()
to, whenever possible, update names to
APC-accepted names.
updated_gbif_taxa <- aligned_gbif_taxa |>
update_taxonomy(resources = resources)
updated_gbif_taxa |>
print(n = 6)
#> # A tibble: 129 × 21
#> original_name aligned_name accepted_name suggested_name genus family taxon_rank
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Tetratheca ciliata Tetratheca c… Tetratheca c… Tetratheca ci… Tetr… Elaeo… species
#> 2 Peganum harmala Peganum harm… Peganum harm… Peganum harma… Pega… Nitra… species
#> 3 Calotis multicaulis Calotis mult… Calotis mult… Calotis multi… Calo… Aster… species
#> 4 Leptospermum trinervium Leptospermum… Leptospermum… Leptospermum … Lept… Myrta… species
#> 5 Lepidosperma laterale Lepidosperma… Lepidosperma… Lepidosperma … Lepi… Cyper… species
#> 6 Enneapogon polyphyllus Enneapogon p… Enneapogon p… Enneapogon po… Enne… Poace… species
#> # ℹ 123 more rows
#> # ℹ 14 more variables: taxonomic_dataset <chr>, taxonomic_status <chr>,
#> # taxonomic_status_aligned <chr>, aligned_reason <chr>, update_reason <chr>,
#> # subclass <chr>, taxon_distribution <chr>, scientific_name_authorship <chr>,
#> # taxon_ID <chr>, taxon_ID_genus <chr>, scientific_name_ID <chr>, canonical_name <chr>,
#> # row_number <dbl>, number_of_collapsed_taxa <dbl>
The APC includes all previously recorded taxonomic names for a
current taxon concept, designating the currently-accepted name as
taxonomic_status: accepted
, while previously used or
inappropriately used names for the taxon concept have alternative
taxonomic statuses documented (e.g. taxonomic synonym, orthographic
variant, misapplied).
The APC includes a column acceptedNameUsageID
that
links a taxon name with an alternative taxonomic status to the current
taxon name, allowing outdated/inappropriately used names to be synced to
their current name.
Note: Names listed on the APNI but absent from the APC are
those that are designated as taxonomic_dataset: APNI
by
APCalign
. These are names that are currently
unknown
by the APC. Over time, this list shrinks, as
taxonomists link ever more occasionally used name variants to an
APC-accepted taxon. However, for now, names listed only on the APNI
cannot be updated
update_taxonomy()
divides names into lists based on the
taxon_rank
and taxonomic_dataset
assigned by
align_taxa
, as each list requires different updating
algorithms.taxon_rank = species/infraspecific
and
taxonomic_dataset = APC
can be updated to an APC-accepted
name.suggested_name
is provided, selecting
the accepted_name
when available, and otherwise the
aligned_name
, but with, if possible, an updated,
APC-accepted genus name.Taxonomic splits refers to instances where a single taxon concept
is subsequently split into multiple taxon concepts. For such taxa, when
the aligned_name
is the “old” taxon concept name, it is
impossible to know which of the currently accepted taxon concepts the
name represents.
The function update_taxonomy
includes an argument
taxonomic_splits
, offering three alternative outputs for
taxon concepts that have been split.
most_likely_species
is the default value, and
returns the accepted_name
of the original taxon_concept;
alternative names are documented in square brackets as part of the
suggested name
(Acacia aneura [alternative possible names: Acacia minyura (pro parte misapplied) | Acacia paraneura (pro parte misapplied) | Acacia quadrimarginea (misapplied)
).
return_all
returns all currently accepted names that
were split from the original taxon_concept; this leads to an increase in
the number of rows in the output table. (Acacia aneura, Acacia minyura
and Acacia paraneura are each output as a separate row, each with a
unique taxon_ID)
collapse_to_higher_taxon
declares that for split
names, there is no way to be certain about which accepted name is
appropriate and therefore that the best possible match is at the genus
level; no accepted_name
is returned, the
taxon_rank
is demoted to genus
and the
suggested name documents the possible species-level names in square
brackets
(Acacia sp. [collapsed names: Acacia aneura (accepted) | Acacia minyura (pro parte misapplied) | Acacia paraneura (pro parte misapplied)]
)
library(dplyr)
aligned_gbif_taxa |>
update_taxonomy(taxonomic_splits = "most_likely_species",
resources = resources) |>
filter(original_name == "Acacia aneura") # Subsetting Acacia aneura as an example
#> # A tibble: 1 × 21
#> original_name aligned_name accepted_name suggested_name genus family taxon_rank
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Acacia aneura Acacia aneura Acacia aneura Acacia aneura [alternat… Acac… Fabac… species
#> # ℹ 14 more variables: taxonomic_dataset <chr>, taxonomic_status <chr>,
#> # taxonomic_status_aligned <chr>, aligned_reason <chr>, update_reason <chr>,
#> # subclass <chr>, taxon_distribution <chr>, scientific_name_authorship <chr>,
#> # taxon_ID <chr>, taxon_ID_genus <chr>, scientific_name_ID <chr>, canonical_name <chr>,
#> # row_number <dbl>, number_of_collapsed_taxa <dbl>
aligned_gbif_taxa |>
update_taxonomy(taxonomic_splits = "return_all",
resources = resources) |>
filter(original_name == "Acacia aneura") # Subsetting Acacia aneura as an example
#> # A tibble: 3 × 21
#> original_name aligned_name accepted_name suggested_name genus family taxon_rank
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Acacia aneura Acacia aneura Acacia aneura Acacia aneura Acacia Fabaceae species
#> 2 Acacia aneura Acacia aneura Acacia minyura Acacia minyura Acacia Fabaceae species
#> 3 Acacia aneura Acacia aneura Acacia paraneura Acacia paraneura Acacia Fabaceae species
#> # ℹ 14 more variables: taxonomic_dataset <chr>, taxonomic_status <chr>,
#> # taxonomic_status_aligned <chr>, aligned_reason <chr>, update_reason <chr>,
#> # subclass <chr>, taxon_distribution <chr>, scientific_name_authorship <chr>,
#> # taxon_ID <chr>, taxon_ID_genus <chr>, scientific_name_ID <chr>, canonical_name <chr>,
#> # row_number <dbl>, number_of_collapsed_taxa <dbl>
aligned_gbif_taxa |>
update_taxonomy(taxonomic_splits = "collapse_to_higher_taxon",
resources = resources) |>
filter(original_name == "Acacia aneura") # Subsetting Acacia aneura as an example
#> # A tibble: 1 × 21
#> original_name aligned_name accepted_name suggested_name genus family taxon_rank
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Acacia aneura Acacia aneura Acacia sp. Acacia sp. [collapsed n… Acac… Fabac… species
#> # ℹ 14 more variables: taxonomic_dataset <chr>, taxonomic_status <chr>,
#> # taxonomic_status_aligned <chr>, aligned_reason <chr>, update_reason <chr>,
#> # subclass <chr>, taxon_distribution <chr>, scientific_name_authorship <chr>,
#> # taxon_ID <chr>, taxon_ID_genus <chr>, scientific_name_ID <chr>, canonical_name <chr>,
#> # row_number <dbl>, number_of_collapsed_taxa <dbl>