This module of the bdc package extracts the collection year whenever possible from complete and legitimate date information and flags dubious (e.g., 07/07/10), illegitimate (e.g., 1300, 2100), or not supplied (e.g., 0 or NA) collecting year.
Check here how to install the bdc package.
Read the database created in the Space module of the bdc package. It is also possible to read any datasets containing the **required** fields to run the function (more details here).
<-
database ::read_csv(here::here("Output/Intermediate/03_space_database.csv")) readr
⚠️IMPORTANT:
The results of the VALIDATION test used to flag data quality are appended in separate fields in this database and retrieved as TRUE (✅ ok) or FALSE (❌check carefully).
VALIDATION. This function flags records lacking event date information (e.g., empty or NA).
<-
check_time bdc_eventDate_empty(data = database, eventDate = "verbatimEventDate")
#>
#> bdc_eventDate_empty:
#> Flagged 64 records.
#> One column was added to the database.
ENRICHMENT. This function extracts four-digit years from unambiguously interpretable collecting dates.
<-
check_time bdc_year_from_eventDate(data = check_time, eventDate = "verbatimEventDate")
#>
#> bdc_year_from_eventDate:
#> Four-digit year were extracted from 51 records.
VALIDATION. This function identifies records with illegitimate or potentially imprecise collecting years. The year provided can be out-of-range (e.g., in the future) or collected before a specified year supplied by the user (e.g., 1900). Older records are more likely to be imprecise due to the locality-derived geo-referencing process.
<-
check_time bdc_year_outOfRange(data = check_time,
eventDate = "year",
year_threshold = 1900)
#>
#> bdc_year_outOfRange:
#> Flagged 0 records.
#> One column was added to the database.
Here we create a column named .summary summing up the results of all VALIDATION tests. This column is FALSE when a record is flagged as FALSE in any data quality test (❌check carefully. potentially invalid or suspect record).
<- bdc_summary_col(data = check_time)
check_time #> Column '.summary' already exist. It will be updated
#>
#> bdc_summary_col:
#> Flagged 70 records.
#> One column was added to the database.
Creating a report summarizing the results of all tests of the
bdc package. The report can be automatically saved if
save_report = TRUE.
<-
report bdc_create_report(data = check_time,
database_id = "database_id",
workflow_step = "time",
save_report = FALSE)
report
Here we create figures (bar plots and histrogram) to make the
interpretation of the results of data quality tests easier. See some
examples below. Figures can be automatically saved if
save_figures = TRUE.
<-
figures bdc_create_figures(data = check_time,
database_id = "database_id",
workflow_step = "time",
save_figures = FALSE)
# Check figures using
$time_year_BAR figures
Save the original database containing the results of all data quality tests appended in separate columns. You can use qs::qread() instead of write_csv to save a large database in a compressed format.
%>%
check_time ::write_csv(.,
readr::here("Output", "Intermediate", "04_time_database.csv")) here
Let’s remove potentially erroneous or suspect records flagged by the data quality tests applied in all modules of the bdc package to get a “clean”, “fitness-for-use” database. Note that 25% (45 out of 180 records) of original records were considered “fitness-for-use” after the data-cleaning process.
<-
output %>%
check_time ::filter(.summary == TRUE) %>%
dplyrbdc_filter_out_flags(data = ., col_to_remove = "all")
#>
#> bdc_fiter_out_flags:
#> The following columns were removed from the database:
#> .uncer_terms, .rou, .val, .equ, .zer, .cap, .cen, .urb, .otl, .gbf, .inst, .dpl, .eventDate_empty, .year_outOfRange, .summary
You can use qs::qsave() instead of write_csv to save a large database in a compressed format.
# use qs::qsave() to save the database in a compressed format and then qs:qread() to load the database
%>%
output ::write_csv(.,
readr::here("Output", "Intermediate", "05_cleaned_database.csv")) here