provTraceR

The provTraceR package displays information about files used or created by an R script or a series of R scripts. The package uses provenance collected by rdtLite and stored in prov-json format. Output from provTraceR can be used to help manage files and to identify the input files needed to reproduce an analysis.

Installation

To install from GitHub:

devtools::install_github("End-to-end-provenance/provTraceR")

To load the package after installation:

library("provTraceR")

Usage

This package includes two functions:

  1. To use existing provenance to trace file lineage:
prov.trace(scripts, prov.dir=NULL, file.details=FALSE, console=TRUE, save=FALSE, save.dir=NULL, check=TRUE)
  1. To run one or more scripts, collect provenance, and trace file lineage:
prov.trace.run(scripts, prov.dir=NULL, file.details=FALSE, console=TRUE, save=FALSE, save.dir=NULL, check=TRUE, prov.tool="rdtLite", details=FALSE, ...)

The scripts parameter may contain a single script name, a vector of script names, or a text file (with extension .txt) of script names.

For prov.trace only: If more than one script is specified, the order of the scripts must match the order of execution as recorded in the provenance; otherwise an error message is displayed. For console sessions, set scripts = “console”.

For prov.trace.run only: The provenance collection tool specified by prov.tool must be “rdtLite” or “rdt”. If details = TRUE, fine-grained provenance is collected. Other optional parameters (…) are passed to rdtLite or rdt. Scripts are executed in the order listed.

It is assumed that provenance for each script is stored under a single provenance directory set by the prov.dir option. If not, the provenance directory may be specified with the prov.dir parameter. Timestamped provenance and provenance in scattered locations are not currently supported.

Files are matched by hash value. INPUTS lists files that are required to run the script or scripts. These include files read by a script and not written by an earlier script or previously written by the same script. OUTPUTS lists files written by the script or scripts. EXCHANGES lists files with the same hash value that were written by one script and read by a later script; if the location changed, both locations are listed.

If file.details = TRUE, additional details are displayed, including script execution timestamps, file timestamps, file hash values, and saved file names.

Results of both functions are returned as a string.

If console = TRUE (the default), results are displayed in the console.

If save = TRUE, results are saved to the file prov-trace.txt.

The save.dir parameter determines where the results file is saved. If NULL (the default), the R session temporary directory is used. If a period (.), the current working directory is used. Otherwise the directory specified by save.dir is used.

If check = TRUE (the default), each file recorded in the provenance is checked against the user’s file system. A dash (-) in the output indicates that the file no longer exists, a plus (+) indicates that the file exists but the hash value has changed, and a colon (:) indicates that the file exists and the hash value is unchanged. If check = FALSE, no comparison is made.

Example

In this example, three R scripts are used to gap fill, harmonize, and combine data from two meteorological stations to create a single dataset. The script names are contained in the file “update-hf300.txt”.

In the first case, the prov.trace.run function is used to run the scripts, collect provenance, and display summary file information.

prov.trace.run("update-hf300.txt")

Console output (below) shows the save message for each script from rdtLite followed by output from prov.trace.run. Scripts are numbered in the order of execution. Each line shows the script number, a symbol indicating whether the file has changed since provenance was collected, and the file path and name.

[1] "Saving prov.json in C:/Prov/prov_gap-fill-shaler"
[1] "Saving prov.json in C:/Prov/prov_combine-shaler-fisher"
[1] "Saving prov.json in C:/Prov/prov_calculate-hf-annual-monthly"

SCRIPTS:

1 : C:/TraceR/gap-fill-shaler.R 
2 : C:/TraceR/combine-shaler-fisher.R 
3 : C:/TraceR/calculate-hf-annual-monthly.R 

INPUTS:

1 : C:/TraceR/amherst-ma-1964-2002.csv 
1 : C:/TraceR/bedford-ma-1964-2002.csv 
1 : C:/TraceR/hf000-02-daily-e.csv 
2 : C:/TraceR/hf001-06-daily-m.csv 
2 : C:/TraceR/hf001-08-hourly-m.csv 

OUTPUTS:

1 : C:/TraceR/hf-shaler-gap-filled.csv 
2 : C:/TraceR/hf-shaler-fisher-overlap.csv 
2 : C:/TraceR/hf300-05-daily-m.csv 
2 : C:/TraceR/hf300-06-daily-e.csv 
3 : C:/TraceR/hf300-01-annual-m.csv 
3 : C:/TraceR/hf300-02-annual-e.csv 
3 : C:/TraceR/hf300-03-monthly-m.csv 
3 : C:/TraceR/hf300-04-monthly-e.csv 

EXCHANGES:

1 > 2 : C:/TraceR/hf-shaler-gap-filled.csv 
2 > 3 : C:/TraceR/hf300-05-daily-m.csv

In the second case, the prov.trace function is used to display detailed file information contained in the provenance without running the scripts.

prov.trace("update-hf300.txt", file.details=TRUE)

For each file, the console output (below) shows the file timestamp, the file hash value and algorithm, and the path and name of the saved copy of the file on the provenance directory. For scripts the execution time stamp is also shown.

SCRIPTS:

1 : C:/TraceR/gap-fill-shaler.R 
        Timestamp: 2019-10-19T09.42.45EDT 
        Hash:      9ab73da3681ae9cbe85efb912550e432 / md5 
        Saved:     C:/Prov/prov_gap-fill-shaler/scripts/gap-fill-shaler.R 
        Executed:  2020-07-08T10.21.30EDT 

2 : C:/TraceR/combine-shaler-fisher.R 
        Timestamp: 2019-10-19T09.41.59EDT 
        Hash:      848a20e2696b1fb7c9bdeec27df059f5 / md5 
        Saved:     C:/Prov/prov_combine-shaler-fisher/scripts/combine-shaler-fisher.R 
        Executed:  2020-07-08T10.21.35EDT 

3 : C:/TraceR/calculate-hf-annual-monthly.R 
        Timestamp: 2019-10-19T10.16.12EDT 
        Hash:      213661ba5f7e4de68d2205c9fe8c0922 / md5 
        Saved:     C:/Prov/prov_calculate-hf-annual-monthly/scripts/calculate-hf-annual-monthly.R 
        Executed:  2020-07-08T10.21.41EDT 


INPUTS:

1 : C:/TraceR/amherst-ma-1964-2002.csv 
        Timestamp: 2019-10-16T10.51.53EDT 
        Hash:      06c82be1ceeec8f41216ee670f485d77 / md5 
        Saved:     C:/Prov/prov_gap-fill-shaler/data/2-amherst-ma-1964-2002.csv 

1 : C:/TraceR/bedford-ma-1964-2002.csv 
        Timestamp: 2019-10-17T10.43.55EDT 
        Hash:      d7f8e08fd84f4b75941325cd82ca7768 / md5 
        Saved:     C:/Prov/prov_gap-fill-shaler/data/3-bedford-ma-1964-2002.csv 

1 : C:/TraceR/hf000-02-daily-e.csv 
        Timestamp: 2019-10-16T10.37.42EDT 
        Hash:      e9f67f7074eb68059385c683d0410c01 / md5 
        Saved:     C:/Prov/prov_gap-fill-shaler/data/1-hf000-02-daily-e.csv 

2 : C:/TraceR/hf001-06-daily-m.csv 
        Timestamp: 2020-06-01T09.07.21EDT 
        Hash:      5e515ea3e7080543fba92b9b9114810f / md5 
        Saved:     C:/Prov/prov_combine-shaler-fisher/data/2-hf001-06-daily-m.csv 

2 : C:/TraceR/hf001-08-hourly-m.csv 
        Timestamp: 2019-10-17T11.34.57EDT 
        Hash:      af36c84e4c0b8f72632eba5661506129 / md5 
        Saved:     C:/Prov/prov_combine-shaler-fisher/data/3-hf001-08-hourly-m.csv 


OUTPUTS:

1 : C:/TraceR/hf-shaler-gap-filled.csv 
        Timestamp: 2020-07-08T10.21.34EDT 
        Hash:      a5022c912b1ec50e8cd4c20d8ed636cf / md5 
        Saved:     C:/Prov/prov_gap-fill-shaler/data/4-hf-shaler-gap-filled.csv 

2 : C:/TraceR/hf-shaler-fisher-overlap.csv 
        Timestamp: 2020-07-08T10.21.40EDT 
        Hash:      f7334fb30cf16c566f8e1de2b7643cf2 / md5 
        Saved:     C:/Prov/prov_combine-shaler-fisher/data/6-hf-shaler-fisher-overlap.csv 

2 : C:/TraceR/hf300-05-daily-m.csv 
        Timestamp: 2020-07-08T10.21.39EDT 
        Hash:      1c9eabddcd5474e11e36168234a1cfae / md5 
        Saved:     C:/Prov/prov_combine-shaler-fisher/data/4-hf300-05-daily-m.csv 

2 : C:/TraceR/hf300-06-daily-e.csv 
        Timestamp: 2020-07-08T10.21.40EDT 
        Hash:      e463c55ff22f56c2fe5e7a69758d3339 / md5 
        Saved:     C:/Prov/prov_combine-shaler-fisher/data/5-hf300-06-daily-e.csv 

3 : C:/TraceR/hf300-01-annual-m.csv 
        Timestamp: 2020-07-08T10.21.42EDT 
        Hash:      e4969c413d3abce641335ad418b51f5c / md5 
        Saved:     C:/Prov/prov_calculate-hf-annual-monthly/data/2-hf300-01-annual-m.csv 

3 : C:/TraceR/hf300-02-annual-e.csv 
        Timestamp: 2020-07-08T10.21.42EDT 
        Hash:      cc99e71c31fd4696d99e68e724497dc5 / md5 
        Saved:     C:/Prov/prov_calculate-hf-annual-monthly/data/3-hf300-02-annual-e.csv 

3 : C:/TraceR/hf300-03-monthly-m.csv 
        Timestamp: 2020-07-08T10.21.42EDT 
        Hash:      de8267dba4643b5d174d4a3140bd9414 / md5 
        Saved:     C:/Prov/prov_calculate-hf-annual-monthly/data/4-hf300-03-monthly-m.csv 

3 : C:/TraceR/hf300-04-monthly-e.csv 
        Timestamp: 2020-07-08T10.21.42EDT 
        Hash:      bf6841b2b01b81c87b30f843b6dda0b1 / md5 
        Saved:     C:/Prov/prov_calculate-hf-annual-monthly/data/5-hf300-04-monthly-e.csv 


EXCHANGES:

1 > 2 : C:/TraceR/hf-shaler-gap-filled.csv 
        Timestamp: 2020-07-08T10.21.34EDT 
        Hash:      a5022c912b1ec50e8cd4c20d8ed636cf / md5 
        Saved out: C:/Prov/prov_gap-fill-shaler/data/6-hf-shaler-fisher-overlap.csv 
        Saved in:  C:/Prov/prov_combine-shaler-fisher/data/1-hf-shaler-gap-filled.csv 

2 > 3 : C:/TraceR/hf300-05-daily-m.csv 
        Timestamp: 2020-07-08T10.21.39EDT 
        Hash:      1c9eabddcd5474e11e36168234a1cfae / md5 
        Saved out: C:/Prov/prov_combine-shaler-fisher/data/4-hf300-03-monthly-m.csv 
        Saved in:  C:/Prov/prov_calculate-hf-annual-monthly/data/1-hf300-05-daily-m.csv 

In both cases, the colon after the script number for each file indicates that the file has not changed since the provenance was collected.