teal_data reproducibility

NEST CoreDev

Reproducibility of teal_data objects

Reproducibility is a primary function of the qenv class, which teal_data inherits from. Every data modification in a teal_data object is performed in an encapsulated environment, separate from the global environment.

It is important to note that the reproducibility of this object is limited only to the data-code relationship. Other aspects such as the reliability of the data source, reproducibility of the R session (including package versions), and creation and use of objects from other environments (e.g. .GlobalEnv) cannot be verified properly by teal_data. It is advisable to always begin analysis in a new session and run all code that pertains to the analysis within the teal_data object.

Verification

Verification status

Every teal_data object has a verification status, which is a statement of whether the contents of the env can be reproduced by code. From this perspective, teal_data objects that are instantiated empty are verified but ones instantiated with data and code are unverified because the code need not be reproducible. Obviously, teal_data objects instantiated with data only are unverified as well.

When evaluating code in a teal_data object, the code that is stored is the same as the code that is executed, so it is reproducible by definition. Therefore, evaluating code in a teal_data object does not change its verification status.

The verification status is always printed when inspecting a teal_data object. Also, when retrieving code, unverified objects add a warning to the code stating that it has not passed verification.

library(teal.data)

data_empty <- teal_data()
data_empty # is verified
## ✅︎ verified teal_data object
## <environment: 0x7feee372c858> [L]
## Parent: <environment: package:teal.data>
data_empty <- within(data_empty, i <- head(iris))
data_empty # remains verified
## ✅︎ verified teal_data object
## <environment: 0x7feeeb19ada0> [L]
## Parent: <environment: package:teal.data>
## Bindings:
## • i: <df[,5]> [L]
data_with_data <- teal_data(i = head(iris), code = "i <- head(iris)")
data_with_data # is unverified
## ✖ unverified teal_data object
## <environment: 0x7feee0f90118> [L]
## Parent: <environment: package:teal.data>
## Bindings:
## • i: <df[,5]> [L]
data_with_data <- within(data_with_data, i$rand <- sample(nrow(i)))
data_with_data # remains unverified
## ✖ unverified teal_data object
## <environment: 0x7feee9aaf930> [L]
## Parent: <environment: package:teal.data>
## Bindings:
## • i: <df[,6]> [L]
cat(get_code(data_with_data)) # warning is prepended
## warning('Code was not verified for reproducibility.')
## i <- head(iris)
## i$rand <- sample(nrow(i))

Verification process

In order to confirm that the code stored in teal_data exactly reproduces the contents of the environment, one must run the verify() function. This causes the code to be evaluated and the results to be compared to the contents of the environment. If the code executes without errors and the results are the same as the contents already present in the environment, the verification is successful and the object’s state will be changed to verified. Otherwise an error will be raised.

verified

library(teal.data)

data <- data.frame(x = 11:20)
data$id <- seq_len(nrow(data))

data_right <- teal_data(
  data = data,
  code = quote({
    data <- data.frame(x = 11:20)
    data$id <- seq_len(nrow(data))
  })
)
(data_right_verified <- verify(data_right)) # returns verified object
## ✅︎ verified teal_data object
## <environment: 0x7feeea042a20> [L]
## Parent: <environment: package:teal.data>
## Bindings:
## • data: <df[,2]> [L]

unverified

data_wrong <- teal_data(
  data = data,
  code = quote({
    data <- data.frame(x = 11:20)
  })
)
verify(data_wrong) # fails verification, raises error
## Error: Code verification failed.
## Object(s) recreated with code that have different structure in data_wrong:
##   • data

Retrieving code

The get_code function is used to retrieve the code stored in a teal_data object. A simple get_code(<teal_data>) will return the entirety of the code but using the datanames argument allows for obtaining a subset of the code that only deals with some of the objects stored in teal_data.

library(teal.data)

data <- within(teal_data(), {
  i <- iris
  m <- mtcars
  head(i)
})
cat(get_code(data)) # retrieve all code
## i <- iris
## m <- mtcars
## head(i)
cat(get_code(data, datanames = "i")) # retrieve code for `i`
## i <- iris

Note that in when retrieving code for a specific dataset, the result is only the code used to create that dataset, not code that uses is.

Tracking object dependencies

Calling get_code with datanames specified initiates an analysis of the stored code, in which object dependencies are automatically discovered. If object x is created with an expression that uses object y, the lines that create object y must also be returned. This is quite effective when objects are created by simple assignments like x <- foo(y). However, in rare cases discovering dependencies is impossible, e.g. when opening connections to databases or when objects are created by side effects (functions acting on their calling environment implicitly rather than returning a value that is then assigned). In such cases the code author must manually tag code lines that are required for a dataset by adding a special comment to the lines: # @linksto x will cause the line to be included when retrieving code for x.

See ?get_code for a detailed explanation and examples.