The disdat
package is primarily a data package,
supplying the data described in Elith et al. (2020). There are
data for 6 regions in the world, for a total of 226 anonymised species
including birds, vascular plants, reptiles and bats. Each data set has
presence-only (and optionally background) training data to build models,
and presence/absence data to evaluate models. These data were compiled
by a species distribution modeling working group sponsored by the
National Center for Ecological Analysis and Synthesis (NCEAS), at UC
Santa Barbara, USA (project ID: 4980). That group, or subsets of its
members, have published 13 papers using some or all of the data for
various explorations of model performance. Those papers are detailed in
Elith et al. (2020).
The data are now released both on Open Science Framework primarily as .csv files (link supplied in first reference, below) and here as an R package. Both sets of data are similar; in addition the OSF dataset includes raster files not supplied here (561MB zipped) and additional metadata files. Elith et al. (2020) detail the background to the data including suppliers, sources of both species and environmental data, steps for cleaning and preparation for modeling. Some of that information is also provided in the help files in this package, but note that Elith et al. (2020) provide the full documentation needed to understand the data.
Several functions are supplied, with help pages, in the
disdat
package, to load the data into your workspace or get
it for modelling: disPo
, disBg
,
disPa
, disEnv
and disBorder
. The
help files explain the nuances of these, and give examples of their
application to load all data from all regions, or subsets of it. A
function disPredictors
returns the names of all predictor
variables in the dataset and disCRS
returns the information
about coordinate reference system of each region.
Finally, the function disMapBook
enables you to generate
maps, in PDF format, showing the location of PO and PA data for all
species within one or more regions, with base plots looking like
this:
In the remainder of this vignette we provide a few examples of how to model and evaluate using these data, and how to make your own target group background sample to help you to deal with biases in the data.
In this section, code is provided as an example of one approach for using these data for modeling and evaluation. Here we model all 226 species, using a random forest (RF) model with default settings.
As you read the code you will notice two things: 1. for any one region and biological group, the evaluation data comes in two files: environmental data, that can be predicted to, and species data, for evaluating the predictions. 2. that two regions (AWT and NSW) have more than one set of evaluation files, because there is more than one biological group (and therefore more than one set of surveys) for those regions. Both these points are explained in the help files for the data, and in Elith et al. (2020).
The way we organise the species data for modeling suits the
randomForest
package - each modeling method has different
requirements.
# loading the library
library(disdat)
library(forcats) # for handling factor variables
library(randomForest)
Prepare for modeling:
# set output directory for prediction csv files:
<- "~/Desktop/modeling_nceas"
outdir # provide names for regions to be modeled - here we model all 6:
<- c("AWT", "CAN", "NSW", "NZ", "SA", "SWI")
regions # specify names of all categorical variables across all regions:
<- c("ontveg", "vegsys", "toxicats", "age", "calc")
categoricalvars
# check the model's output directory
if (file.exists(outdir)) {
print("The directory already exists!")
else {
} dir.create(file.path(outdir))
print("The directory is created.")
}
Now start the modeling, working through one region at a time, getting the data for modeling and prediction, ensuring categorical predictors are represented correctly, and fitting a random forest model. Note: this is modelling all 226 species and will take some minutes to complete.
<- 0
n for (r in regions) {
# reading presence-only and background species data for this region, one
# file per region:
<- disPo(r)
presences <- disBg(r)
background
# extract names for all species
<- unique(presences$spid)
species
# now for each species, prepare data for modeling and evaluation with
# randomForests, model, predict to the evaluation file with environmental
# data.
for (s in species) {
# subset presence records of species for this species
<- presences[presences$spid == s, ]
sp_presence # add background data
<- rbind(sp_presence, background)
pr_bg
# find the evaluation file – for some regions this means identifying
# the taxonomic group
if (r %in% c("AWT", "NSW")) {
<- sp_presence[, "group"][1]
grp <- disEnv(r, grp)
evaluation else {
} <- disEnv(r)
evaluation
}
# convert categorical vars to factor in both training and testing data.
# We use the package forcats to ensure that the levels of the factor in
# the evaluation data match those in the training data, regardless of
# whether all levels are present in the evaluation data.
for (i in 1:ncol(pr_bg)) {
if (colnames(pr_bg)[i] %in% categoricalvars) {
<- colnames(pr_bg)[i]
fac_col <- as.factor(pr_bg[, fac_col])
pr_bg[, fac_col] <- as.factor(evaluation[, fac_col])
evaluation[, fac_col] <- forcats::fct_expand(evaluation[, fac_col],
evaluation[, fac_col] levels(pr_bg[, fac_col]))
<- forcats::fct_relevel(evaluation[, fac_col],
evaluation[, fac_col] levels(pr_bg[, fac_col]))
}
}
# extract the relevant columns for modelling
<- pr_bg[, c("occ", disPredictors(r))]
model_data # convert the response to factor for RF model to return probabilities
$occ <- as.factor(model_data$occ)
model_data
# start modeling! We use the 'try' notation so if a species fails to
# fit, the loop will continue.
<- n + 1
n if (inherits(try(mod <- randomForest(occ ~ ., data = model_data, ntree = 500)),
"try-error")) {
print(paste("Error for species", s, "from", r))
}
# make and save the prediction
<- evaluation[, 1:3]
out_file $spid <- sp_presence$spid[1]
out_file$group <- sp_presence$group[1]
out_file$prediction <- predict(mod, evaluation, type = "prob")[, "1"] # prediction for presences
out_filewrite.csv(out_file, paste(outdir, "/", r, "_", s, "_rf", ".csv", sep = ""),
row.names = FALSE)
print(n)
} }
Here we give an example of evaluating the predictive performance of the model using area under the ROC curve (AUC). We get the predictions we have made, read in the relevant presence-absence observations, and evaluate the predictions.
library(dplyr)
library(ROCR)
# specify the output folder that contains species predictions, and in which to
# save the models results
<- "~/Desktop/modeling_nceas"
modelsdir
# get the names of all the prediction files
<- list.files(modelsdir, pattern = ".csv$", full.names = FALSE)
sp_preds
# a data.frame to save the evaluation result
<- data.frame(region = rep(NA, length(sp_preds)), species = NA, auc = NA)
rf_eval
# now a loop to evaluate each species, one at a time:
<- 0
n for (i in sp_preds) {
<- read.csv(file.path(modelsdir, i), stringsAsFactors = FALSE)
pred # get information on the species (s) and region (r) for each file:
<- unlist(strsplit(i, "_"))[1]
r <- unlist(strsplit(i, "_"))[2]
s
# now evaluate predictions, finding the right column for this species find
# the evaluation file – for some regions this means identifying the
# taxonomic group
if (r %in% c("AWT", "NSW")) {
<- pred[pred$spid == s, "group"][1]
grp <- disPa(r, grp)
evals else {
} <- disPa(r)
evals
}
<- n + 1
n <- prediction(pred$prediction, evals[, s]) %>%
perf performance("auc")
$region[n] <- r
rf_eval$species[n] <- s
rf_eval$auc[n] <- as.numeric(perf@y.values)
rf_evalprint(s)
}
mean(rf_eval$auc)
hist(rf_eval$auc)
# auc mean for region
%>%
rf_eval group_by(region) %>%
summarise(mean(auc))
This gives an example for one region, NSW, for creating target group background (TGB) samples. We chose NSW because it has several groups of species, so is one of the more complex examples.
The main thing to understand here is that for TGB samples the intention is to create a sample consistent with protocols used for the presence records. So, for example, if you had cleaned presence records so there was only one record for each unique location, you would also screen the TGB samples in the same way.
For these NCEAS data, the presence-only records were cleaned (per species) to one per grid cell. When making the TGB sample (e.g. for birds) a naïve approach would be to simply gather all the records for birds, and use those as a TGB sample. This would not be correct. You may have multiple records in any one cell (e.g. because a recorder may have gone to that location and recorded several species). The idea is to represent survey effort, and the survey effort is not a lot greater for recording many species at one location (or in this case, in one cell). Hence we will reduce the set of records for any designated “group” to one record per grid cell.
To do this, you will need to download the rasters from OSF – see the link in the first reference below - see the /data/Environment folder. You can download all rasters at once by highlighting the “Environment” folder and clicking “Download as zip”.
The code below assumes you have set your working directory, and the environmental rasters are in a sub-folder “env_grids”, with the rasters for each region within their respective regional folders
Load the required packages:
library(terra)
Read in the data for NSW:
<- disPo("NSW")
po <- rast("env_grids/nsw/mi.tif") nsw.raster
Check how many groups and records in these PO data, and think about what target groups to propose. Here there are 8 groups (see main paper), but it is sensible to collapse some of these, given the way that recorders are likely to survey. Following Phillips et al. (2009) we propose four target groups: bat, bird, plant, reptile:
table(po$group)
<- list(bat = "ba", bird = c("db", "nb"), plant = c("ot", "ou", "rt",
target.groups "ru"), reptile = "sr")
Now, for each target group, extract the relevant rows of data, find which cell each record is in, then identify and remove the duplicates:
<- list()
tgb.data
for (i in 1:4) {
<- match(po$group, target.groups[[i]])
getgroup <- po[!is.na(getgroup), ]
dat <- cellFromXY(nsw.raster, dat[, c("x", "y")])
samplecellID <- duplicated(samplecellID)
dup <- dat[!dup, ]
tgb.data[[i]]
}
names(tgb.data) <- names(target.groups)
Elith, J., Graham, C.H., Valavi, R., Abegg, M., Bruce, C., Ferrier, S., Ford, A., Guisan, A., Hijmans, R.J., Huettmann, F., Lohmann, L.G., Loiselle, B.A., Moritz, C., Overton, J.McC., Peterson, A.T., Phillips, S., Richardson, K., Williams, S., Wiser, S.K., Wohlgemuth, T. & Zimmermann, N.E. (2020) Presence-only and presence-absence data for comparing species distribution modeling methods. Biodiversity Informatics, 15, 69-80.
Phillips, S.J., Dudík, M., Elith, J., Graham, C.H., Lehmann, A., Leathwick, J. & Ferrier, S. (2009) Sample selection bias and presence-only distribution models: implications for background and pseudo-absence data. Ecological Applications, 19, 181–197.