You are developing a family of R packages that extend tidy data workflows with richer semantic and provenance-aware capabilities. The work began from practical experience building tidyverse-based data pipelines and repeatedly encountering the same limitation: while tidy datasets are highly efficient and semantically clear within a given workflow, much of their meaning remains implicit and dependent on the contextual knowledge of their creator. Once exported, serialized, or transferred across environments, this contextual information is often lost. :contentReferenceoaicite:0
The dataset package introduces semantically enriched
vectors and data frames that preserve explicit metadata throughout the
workflow lifecycle. However, fully formal semantic annotation is verbose
and cognitively demanding. Constructing semantically complete
RDF-compatible objects is appropriate only for mature stages of a
workflow.
In practice, semantic stabilization is usually incremental.
Observational data often arrive with partially inconsistent, incomplete,
or ambiguous labels. Before a variable can mature into a formally
defined vector created with labelled::labelled() or
dataset::defined(), analysts typically perform several
rounds of semantic harmonisation.
The prelabelled class supports this intermediate
stage.
Unlike formally defined semantic vectors, prelabelled
vectors tolerate:
This vignette demonstrates how provisional semantic assertions can be incrementally stabilised while preserving the original observational evidence.
We begin with a small dataset containing country observations. The dataset is intentionally inconsistent: some observations use full country names, while others already use ISO 3166 alpha-2 country codes.
Such ambiguity is extremely common in operational analytical workflows, particularly when datasets are merged from multiple sources or manually curated over time.
country_data_1 <- data.frame(
country = c("Andorra", "LI", "San Marino", "AD", "Liechtenstein"),
time = c(2020, 2020, 2020, 2021, 2021),
value = c(1.2, 2.4, 3.1, 1.3, 2.5)
)We now create a lightweight semantic mapping.
The goal is not yet to create a formally closed semantic vocabulary. Instead, we begin stabilising the semantics incrementally by mapping some observational values to candidate semantic assertions.
Values that are not explicitly mapped remain self-describing.
The resulting vector preserves the original observational values
while attaching a provisional semantic vocabulary in the
"prelabel" attribute.
print(country_data_1$country)
#> [1] "Andorra" "LI" "San Marino" "AD"
#> [5] "Liechtenstein"
#> attr(,"prelabel")
#> Andorra Liechtenstein San Marino LI AD
#> "AD" "LI" "SM" "LI" "AD"
#> attr(,"class")
#> [1] "prelabelled" "character"This separation between:
is a central design principle of the prelabelled
class.
The observational values remain unchanged, while semantic operationalisation may evolve iteratively over time.
Using as.character() operationalises the semantic
assertions into a semantically stabilised character vector.
country_data_2 <- data.frame(
country = as.character(country_data_1$country),
time = country_data_1$time,
value = country_data_1$value
)
country_data_2
#> country time value
#> 1 AD 2020 1.2
#> 2 LI 2020 2.4
#> 3 SM 2020 3.1
#> 4 AD 2021 1.3
#> 5 LI 2021 2.5Mapped observations are converted into their candidate semantic assertions, while unmatched values remain self-describing.
This allows analysts to gradually reduce semantic ambiguity without destroying the original observational evidence.
The next dataset contains a more difficult form of semantic ambiguity.
Some observations use ISO 3166 alpha-2 country codes, while others use ISO 3166 alpha-3 codes or full country names. Although the observations are semantically related, they do not yet form a stable closed vocabulary.
The prelabelled workflow does not require complete
semantic resolution from the outset.
Instead, semantic stabilization can proceed incrementally:
country_map_3 <- c(
"Andorra" = "AD",
"Andorra" = "AND",
"Liechtenstein" = "LI",
"San Marino" = "SM",
"San Marino" = "SMR"
)
prelabelled_country <- prelabel(
country_data_3$country,
labels = country_map_3
)This approach is particularly useful in exploratory analytical workflows, archival reconstruction, metadata harmonisation, and cross-dataset integration tasks.
prelabelled_country
#> [1] "AD" "AND" "LI" "LIE" "SMR"
#> [6] "San Marino"
#> attr(,"prelabel")
#> Andorra Andorra Liechtenstein San Marino San Marino
#> "AD" "AND" "LI" "SM" "SMR"
#> AD AND LI LIE SMR
#> "AD" "AND" "LI" "LIE" "SMR"
#> attr(,"class")
#> [1] "prelabelled" "character"While as.character() provides lightweight semantic
coercion, which may be more useful after semantic stabilisation.
The as_character() method creates a
provenance-preserving semantic workspace.
as_character(prelabelled_country)
#> [1] "AD" "AND" "LI" "LIE" "SMR" "SM"
#> attr(,"prelabel")
#> Andorra Andorra Liechtenstein San Marino San Marino
#> "AD" "AND" "LI" "SM" "SMR"
#> AD AND LI LIE SMR
#> "AD" "AND" "LI" "LIE" "SMR"
#> attr(,"original_values")
#> [1] "AD" "AND" "LI" "LIE" "SMR"
#> [6] "San Marino"
#> attr(,"original_values")attr(,"prelabel")
#> Andorra Andorra Liechtenstein San Marino San Marino
#> "AD" "AND" "LI" "SM" "SMR"
#> AD AND LI LIE SMR
#> "AD" "AND" "LI" "LIE" "SMR"The resulting vector retains:
This allows analysts to continue semantic refinement workflows while preserving reversibility and provenance awareness.
The goal of prelabelled vectors is not to replace
formally defined semantic vectors.
Instead, they provide a lightweight preparatory stage for incremental semantic stabilization.
Once semantic ambiguity has been sufficiently reduced,
prelabelled vectors may mature into formally defined
semantic vectors created with labelled::labelled() or
dataset::defined(). For further information, see
vignette("defined", package = "dataset")- Working with
semantic vectors: Semantic vectors with defined().
In this sense, semantic enrichment becomes an iterative analytical workflow rather than a single terminal annotation step.