APCalign

When working with biodiversity data, it is important to verify taxonomic names with an authoritative list and correct any out-of-date names. The APCalign package simplifies this process by:

Installation

install.packages("remotes")
remotes::install_github("traitecoevo/APCalign")
library(APCalign)

To demonstrate how to use APCalign, we will use an example dataset gbif_lite which is documented in ?gbif_lite

dim(gbif_lite)
#> [1] 129   7

gbif_lite |> print(n = 6)
#> # A tibble: 129 × 7
#>   species     infraspecificepithet taxonrank decimalLongitude decimalLatitude scientificname
#>   <chr>       <chr>                <chr>                <dbl>           <dbl> <chr>         
#> 1 Tetratheca… <NA>                 SPECIES               145.           -37.4 Tetratheca ci…
#> 2 Peganum ha… <NA>                 SPECIES               139.           -33.3 Peganum harma…
#> 3 Calotis mu… <NA>                 SPECIES               115.           -24.3 Calotis multi…
#> 4 Leptosperm… <NA>                 SPECIES               151.           -34.0 Leptospermum …
#> 5 Lepidosper… <NA>                 SPECIES               142.           -37.3 Lepidosperma …
#> 6 Enneapogon… <NA>                 SPECIES               129.           -17.8 Enneapogon po…
#> # ℹ 123 more rows
#> # ℹ 1 more variable: verbatimscientificname <chr>

Retrieve taxonomic resources

The first step is to retrieve the entire APC and APNI name databases and store them locally as taxonomic resources. We achieve this using load_taxonomic_resources(). The resources are compressed as parquet files to speed download and local loading.

There are two versions of the databases that you can retrieve with the stable_or_current_data argument. Calling:

Note that the databases are reasonably large so the initial retrieval of the core data will take a few minutes. Once the taxonomic resources have been stored locally, subsequent retrievals will take less time. Retrieving current resources will always take longer since it is accessing the latest information from the website in an uncompressed format.

# Benchmarking the retrieval of `stable` or `current` resources
stable_start_time <- Sys.time()
stable_resources <- load_taxonomic_resources(stable_or_current_data = "stable")
#> Loading resources......done
stable_end_time <-  Sys.time()

current_start_time <- Sys.time()
current_resources <- load_taxonomic_resources(stable_or_current_data = "current")
#> Loading resources......done
current_end_time <-  Sys.time()

# Compare times
stable_end_time - stable_start_time
#> Time difference of 16.48976 secs

For a more reproducible workflow, we recommend specifying the exact stable version you want to use.

resources <- load_taxonomic_resources(stable_or_current_data = "stable", version = "0.0.2.9000")
#> Loading resources......done

Align and update plant taxon names

Now we can query our taxonomic names against the taxonomic resources we just retrieved using create_taxonomic_update_lookup(). This all-in-one function will:

If you would like to learn more about each of these step, take a look at the section Closer look at name alignment and updating with ‘APCalign’

library(dplyr)

updated_gbif_names <- gbif_lite |> 
  pull(species) |> 
  create_taxonomic_update_lookup(resources = resources)
#> Checking alignments of 121 taxa
#>   -> 0 names already matched; 0 names checked but without a match; 121 taxa yet to be checked

updated_gbif_names |> 
  print(n = 6)
#> # A tibble: 129 × 12
#>   original_name aligned_name accepted_name suggested_name genus taxon_rank taxonomic_dataset
#>   <chr>         <chr>        <chr>         <chr>          <chr> <chr>      <chr>            
#> 1 Tetratheca c… Tetratheca … Tetratheca c… Tetratheca ci… Tetr… species    APC              
#> 2 Peganum harm… Peganum har… Peganum harm… Peganum harma… Pega… species    APC              
#> 3 Calotis mult… Calotis mul… Calotis mult… Calotis multi… Calo… species    APC              
#> 4 Leptospermum… Leptospermu… Leptospermum… Leptospermum … Lept… species    APC              
#> 5 Lepidosperma… Lepidosperm… Lepidosperma… Lepidosperma … Lepi… species    APC              
#> 6 Enneapogon p… Enneapogon … Enneapogon p… Enneapogon po… Enne… species    APC              
#> # ℹ 123 more rows
#> # ℹ 5 more variables: taxonomic_status <chr>, scientific_name_authorship <chr>,
#> #   aligned_reason <chr>, update_reason <chr>, number_of_collapsed_taxa <dbl>

The original_name is the taxon name used in your original data. The aligned_name is the taxon name we used to link with the APC to identify any synonyms. The accepted_name is the currently, accepted taxon name used by the Australian Plant Census. The suggested_name is the best possible name option for the original_name.

Plant established status across states/territories

‘APCalign’ can also provide the state/territory distribution for established status (native/introduced) from the APC.

We can access the established status data by state/territory using create_species_state_origin_matrix()

# Retrieve status data by state/territory 
status_matrix <- create_species_state_origin_matrix(resources = resources)

Here is a breakdown of all possible values for origin

library(purrr)
library(janitor)

# Obtain unique values
status_matrix |> 
  select(-species) |> 
  flatten_chr() |> 
  tabyl()
#>  flatten_chr(select(status_matrix, -species))      n      percent
#>                        doubtfully naturalised   1120 2.371003e-03
#>                          formerly naturalised    277 5.863998e-04
#>                                        native  40336 8.538997e-02
#>             native and doubtfully naturalised      9 1.905270e-05
#>                        native and naturalised    136 2.879075e-04
#>                   native and uncertain origin      2 4.233933e-06
#>                                   naturalised   8765 1.855521e-02
#>                                   not present 421606 8.925258e-01
#>                              presumed extinct    101 2.138136e-04
#>                              uncertain origin     22 4.657327e-05

You can also obtain the breakdown of species by established status for a particular state/territory using state_diversity_counts()

state_diversity_counts("NSW", resources = resources)
#> # A tibble: 7 × 3
#>   origin                            state num_species
#>   <chr>                             <chr> <table[1d]>
#> 1 doubtfully naturalised            NSW     93       
#> 2 formerly naturalised              NSW      8       
#> 3 native                            NSW   5958       
#> 4 native and doubtfully naturalised NSW      2       
#> 5 native and naturalised            NSW     34       
#> 6 naturalised                       NSW   1580       
#> 7 presumed extinct                  NSW      8

Using the established status data and state/territory information, we can check if a plant taxa is a native using native_anywhere_in_australia()

library(dplyr)

updated_gbif_names |> 
  sample_n(1) |>  # Choosing a random species
  pull(suggested_name) |> # Extracting this APC accepted name
  native_anywhere_in_australia(resources = resources) 
#> # A tibble: 1 × 2
#>   species              native_anywhere_in_aus               
#>   <chr>                <chr>                                
#> 1 Solanum prinophyllum considered native to Australia by APC

Closer look at name standardisation with ‘APCalign’

create_taxonomic_update_lookup is a simple, wrapper, function for novice users that want to quickly check and standardise taxon names. For more experienced users, you can take a look at the sub functions match_taxa(), align_taxa() and update_taxonomy() to see how taxon names are processed, aligned and updated.

Aligning names to APC and APNI

The function align_taxa will:

  1. Clean up your taxonomic names
    • The functions standardise_names, strip_names and strip_names_extra standardise infraspecific taxon designations and clean up punctuation and whitespaces
  2. Find best alignment with APC or APNI to your taxonomic name using our the function match_taxa
    • A taxonomic name flows through a progression of 50 match algorithms until it is able to be aligned to a name on either the APC or APNI list.
    • These include exact and fuzzy matches. Fuzzy matches are designed to capture small spelling mistakes and syntax errors in phrase names.
    • These include matches to the entire name string and matches on just select words in the sequence.
    • The sequence of matches has been carefully curated to align names with the fewest mistakes.
  3. Determine the taxon_rank to which the name can be resolved, based on its syntax.
    • For names that can only be resolved to genus, reformats the name to offer a standardised genus sp. name, with additional information/notes provided as part of the original name in square brackets, as in Acacia sp. [skinny leaves] or Acacia sp. [Broken Hill]
  4. Determine the taxonomic_reference (APC or APNI) of each name-alignment.

Note that align_taxa does not seek to update outdated taxonomy. That process occurs during update_taxonomy process. align_taxa instead aligns each name input to the closest match amongst names documented by the APC and APNI.

library(dplyr)

aligned_gbif_taxa <- gbif_lite |> 
  pull(species) |> 
  align_taxa(resources = resources)
#> Checking alignments of 121 taxa
#>   -> 0 names already matched; 0 names checked but without a match; 121 taxa yet to be checked

aligned_gbif_taxa |> 
  print(n = 6)
#> # A tibble: 129 × 7
#>   original_name        cleaned_name aligned_name taxonomic_dataset taxon_rank aligned_reason
#>   <chr>                <chr>        <chr>        <chr>             <chr>      <chr>         
#> 1 Tetratheca ciliata   Tetratheca … Tetratheca … APC               species    Exact match o…
#> 2 Peganum harmala      Peganum har… Peganum har… APC               species    Exact match o…
#> 3 Calotis multicaulis  Calotis mul… Calotis mul… APC               species    Exact match o…
#> 4 Leptospermum triner… Leptospermu… Leptospermu… APC               species    Exact match o…
#> 5 Lepidosperma latera… Lepidosperm… Lepidosperm… APC               species    Exact match o…
#> 6 Enneapogon polyphyl… Enneapogon … Enneapogon … APC               species    Exact match o…
#> # ℹ 123 more rows
#> # ℹ 1 more variable: alignment_code <chr>

For every aligned_name, align_taxa() will provide a aligned_reason which you can review as a table of counts:

library(janitor)

aligned_gbif_taxa |> 
  pull(aligned_reason) |> 
  tabyl() |> 
  tibble() 
#> # A tibble: 6 × 4
#>   `pull(aligned_gbif_taxa, aligned_reason)`                          n percent valid_percent
#>   <chr>                                                          <int>   <dbl>         <dbl>
#> 1 Exact match of taxon name to an APC-accepted canonical name o…   118 0.915         0.929  
#> 2 Exact match of taxon name to an APC-known canonical name once…     6 0.0465        0.0472 
#> 3 Exact match of taxon name to an APNI-listed canonical name on…     1 0.00775       0.00787
#> 4 Exact match of the first two words of the taxon name to an AP…     1 0.00775       0.00787
#> 5 Exact match of the first word of the taxon name to an APC-acc…     1 0.00775       0.00787
#> 6 <NA>                                                               2 0.0155       NA

Configuring matching precision and aligned output

There are arguments in align_taxa that allows you to select which of the 50 matching algorithms are activated/deactivated and the degree of fuzziness of the fuzzy matching function

  • fuzzy_matches turns fuzzy matching on / off (it defaults to TRUE).
  • fuzzy_abs_dist and fuzzy_rel_dist control the degree of fuzzy matching (they default to fuzzy_abs_dist = 3 & fuzzy_rel_dist = 0.2).
  • imprecise_fuzzy_matches turns imprecise fuzzy matching on / off (it defaults to FALSE; for true it is set to fuzzy_abs_dist = 5 & fuzzy_rel_dist = 0.25).
  • APNI_matches turns matches to the APNI list on/off (it defaults to TRUE).
  • identifier allows you to specify a text string that is added to genus-level matches, indicating the site, study, etc e.g. Acacia sp. [Blue Mountains]

Updating to APC-accepted names

update_taxonomy() uses the information generated by align_taxa() to, whenever possible, update names to APC-accepted names.

updated_gbif_taxa <- aligned_gbif_taxa |> 
  update_taxonomy(resources = resources)

updated_gbif_taxa |> 
  print(n = 6)
#> # A tibble: 129 × 21
#>   original_name           aligned_name  accepted_name suggested_name genus family taxon_rank
#>   <chr>                   <chr>         <chr>         <chr>          <chr> <chr>  <chr>     
#> 1 Tetratheca ciliata      Tetratheca c… Tetratheca c… Tetratheca ci… Tetr… Elaeo… species   
#> 2 Peganum harmala         Peganum harm… Peganum harm… Peganum harma… Pega… Nitra… species   
#> 3 Calotis multicaulis     Calotis mult… Calotis mult… Calotis multi… Calo… Aster… species   
#> 4 Leptospermum trinervium Leptospermum… Leptospermum… Leptospermum … Lept… Myrta… species   
#> 5 Lepidosperma laterale   Lepidosperma… Lepidosperma… Lepidosperma … Lepi… Cyper… species   
#> 6 Enneapogon polyphyllus  Enneapogon p… Enneapogon p… Enneapogon po… Enne… Poace… species   
#> # ℹ 123 more rows
#> # ℹ 14 more variables: taxonomic_dataset <chr>, taxonomic_status <chr>,
#> #   taxonomic_status_aligned <chr>, aligned_reason <chr>, update_reason <chr>,
#> #   subclass <chr>, taxon_distribution <chr>, scientific_name_authorship <chr>,
#> #   taxon_ID <chr>, taxon_ID_genus <chr>, scientific_name_ID <chr>, canonical_name <chr>,
#> #   row_number <dbl>, number_of_collapsed_taxa <dbl>

Taxonomic resources used for updating names

  • The APC includes all previously recorded taxonomic names for a current taxon concept, designating the currently-accepted name as taxonomic_status: accepted, while previously used or inappropriately used names for the taxon concept have alternative taxonomic statuses documented (e.g. taxonomic synonym, orthographic variant, misapplied).

  • The APC includes a column acceptedNameUsageID that links a taxon name with an alternative taxonomic status to the current taxon name, allowing outdated/inappropriately used names to be synced to their current name.

Note: Names listed on the APNI but absent from the APC are those that are designated as taxonomic_dataset: APNI by APCalign. These are names that are currently unknown by the APC. Over time, this list shrinks, as taxonomists link ever more occasionally used name variants to an APC-accepted taxon. However, for now, names listed only on the APNI cannot be updated

Name updates at different taxonomic levels

  • update_taxonomy() divides names into lists based on the taxon_rank and taxonomic_dataset assigned by align_taxa, as each list requires different updating algorithms.
  • Only taxonomic names that are designated as taxon_rank = species/infraspecific and taxonomic_dataset = APC can be updated to an APC-accepted name.
  • For all other taxa, it may be possible to align the genus-name to an APC-accepted genus.
  • For all taxa, a suggested_name is provided, selecting the accepted_name when available, and otherwise the aligned_name, but with, if possible, an updated, APC-accepted genus name.

Taxonomic splits

  • Taxonomic splits refers to instances where a single taxon concept is subsequently split into multiple taxon concepts. For such taxa, when the aligned_name is the “old” taxon concept name, it is impossible to know which of the currently accepted taxon concepts the name represents.

  • The function update_taxonomy includes an argument taxonomic_splits, offering three alternative outputs for taxon concepts that have been split.

    1. most_likely_species is the default value, and returns the accepted_name of the original taxon_concept; alternative names are documented in square brackets as part of the suggested name (Acacia aneura [alternative possible names: Acacia minyura (pro parte misapplied) | Acacia paraneura (pro parte misapplied) | Acacia quadrimarginea (misapplied)).

    2. return_all returns all currently accepted names that were split from the original taxon_concept; this leads to an increase in the number of rows in the output table. (Acacia aneura, Acacia minyura and Acacia paraneura are each output as a separate row, each with a unique taxon_ID)

    3. collapse_to_higher_taxon declares that for split names, there is no way to be certain about which accepted name is appropriate and therefore that the best possible match is at the genus level; no accepted_name is returned, the taxon_rank is demoted to genus and the suggested name documents the possible species-level names in square brackets (Acacia sp. [collapsed names: Acacia aneura (accepted) | Acacia minyura (pro parte misapplied) | Acacia paraneura (pro parte misapplied)])

library(dplyr)

aligned_gbif_taxa |> 
  update_taxonomy(taxonomic_splits = "most_likely_species", 
                  resources = resources)  |> 
  filter(original_name == "Acacia aneura")  # Subsetting Acacia aneura as an example 
#> # A tibble: 1 × 21
#>   original_name aligned_name  accepted_name suggested_name           genus family taxon_rank
#>   <chr>         <chr>         <chr>         <chr>                    <chr> <chr>  <chr>     
#> 1 Acacia aneura Acacia aneura Acacia aneura Acacia aneura [alternat… Acac… Fabac… species   
#> # ℹ 14 more variables: taxonomic_dataset <chr>, taxonomic_status <chr>,
#> #   taxonomic_status_aligned <chr>, aligned_reason <chr>, update_reason <chr>,
#> #   subclass <chr>, taxon_distribution <chr>, scientific_name_authorship <chr>,
#> #   taxon_ID <chr>, taxon_ID_genus <chr>, scientific_name_ID <chr>, canonical_name <chr>,
#> #   row_number <dbl>, number_of_collapsed_taxa <dbl>
aligned_gbif_taxa |> 
  update_taxonomy(taxonomic_splits = "return_all",
                  resources = resources)  |> 
  filter(original_name == "Acacia aneura") # Subsetting Acacia aneura as an example
#> # A tibble: 3 × 21
#>   original_name aligned_name  accepted_name    suggested_name   genus  family   taxon_rank
#>   <chr>         <chr>         <chr>            <chr>            <chr>  <chr>    <chr>     
#> 1 Acacia aneura Acacia aneura Acacia aneura    Acacia aneura    Acacia Fabaceae species   
#> 2 Acacia aneura Acacia aneura Acacia minyura   Acacia minyura   Acacia Fabaceae species   
#> 3 Acacia aneura Acacia aneura Acacia paraneura Acacia paraneura Acacia Fabaceae species   
#> # ℹ 14 more variables: taxonomic_dataset <chr>, taxonomic_status <chr>,
#> #   taxonomic_status_aligned <chr>, aligned_reason <chr>, update_reason <chr>,
#> #   subclass <chr>, taxon_distribution <chr>, scientific_name_authorship <chr>,
#> #   taxon_ID <chr>, taxon_ID_genus <chr>, scientific_name_ID <chr>, canonical_name <chr>,
#> #   row_number <dbl>, number_of_collapsed_taxa <dbl>
aligned_gbif_taxa |> 
  update_taxonomy(taxonomic_splits = "collapse_to_higher_taxon",
                  resources = resources)  |> 
  filter(original_name == "Acacia aneura") # Subsetting Acacia aneura as an example
#> # A tibble: 1 × 21
#>   original_name aligned_name  accepted_name suggested_name           genus family taxon_rank
#>   <chr>         <chr>         <chr>         <chr>                    <chr> <chr>  <chr>     
#> 1 Acacia aneura Acacia aneura Acacia sp.    Acacia sp. [collapsed n… Acac… Fabac… species   
#> # ℹ 14 more variables: taxonomic_dataset <chr>, taxonomic_status <chr>,
#> #   taxonomic_status_aligned <chr>, aligned_reason <chr>, update_reason <chr>,
#> #   subclass <chr>, taxon_distribution <chr>, scientific_name_authorship <chr>,
#> #   taxon_ID <chr>, taxon_ID_genus <chr>, scientific_name_ID <chr>, canonical_name <chr>,
#> #   row_number <dbl>, number_of_collapsed_taxa <dbl>