Introduction
This document describes how to use the dataone R package to
upload data to DataONE, and how to perform maintenance operations on the
data after upload.
The dataone R package provides methods to enable R scripts
to interact with DataONE Coordinating Nodes (CN) and Member Nodes (MN),
to search for, download, upload and update data and metadata. The
dataone R package takes care of the details of calling the
corresponding DataONE web service on a DataONE node. For example, the
dataone createObject
R method calls the DataONE
web service MNStorage.create()
that uploads a dataset to a DataONE MN.
Before uploading any data to a DataONE MN, it is necessary to obtain
a DataONE user identity that will be provided with each request to
upload or update data. The method that DataONE uses to achieve this is
known as user identity authentication, and requires that an
authentication token, which is a character string, be provided
during upload. The process to obtain this token is described in the
DataONE Federation vignette, in the section DataONE
User Authentication With Tokens, which is viewable with the R
command vignette("dataone-overview")
. (Note: DataONE
originally used X.509 certificates for authentication, which are still
supported.)
Uploading A Package Using uploadDataPackage
Datasets and metadata can be uploaded individually or as a
collection. Such a collection, whether contained in local R objects or
existing on a DataONE repository, will be informally referred to as a
package
or ‘data package’. Figure 1. is a diagram of a
typical DataONE package showing a metadata file that describes, or
documents
the data granules that the package contains.
The steps necessary to to prepare and upload a package to DataONE
using the uploadDataPackage
method will be shown. A
complete script that uses these steps is shown here, with detailed
explanations of the steps following.
library(dataone)
library(datapack)
library(uuid)
dp <- new("DataPackage")
emlFile <- system.file("extdata/strix-pacific-northwest.xml", package="dataone")
metadataObj <- new("DataObject", format="eml://ecoinformatics.org/eml-2.1.1", filename=emlFile)
dp <- addMember(dp, metadataObj)
sourceData <- system.file("extdata/OwlNightj.csv", package="dataone")
sourceObj <- new("DataObject", format="text/csv", filename=sourceData)
dp <- addMember(dp, sourceObj, metadataObj)
progFile <- system.file("extdata/filterObs.R", package="dataone")
progObj <- new("DataObject", format="application/R", filename=progFile, mediaType="text/x-rsrc")
dp <- addMember(dp, progObj, metadataObj)
outputData <- system.file("extdata/Strix-occidentalis-obs.csv", package="dataone")
outputObj <- new("DataObject", format="text/csv", filename=outputData)
dp <- addMember(dp, outputObj, metadataObj)
myAccessRules <- data.frame(subject="https://orcid.org/0000-0002-2192-403X", permission="changePermission")
d1c <- D1Client("STAGING", "urn:node:mnStageUCSB2")
packageId <- uploadDataPackage(d1c, dp, public=TRUE, accessRules=myAccessRules, quiet=FALSE)
This particular package contains the R script
filterObs.R
, the input file OwlNightj.csv
that
was read by the script and the output file
Strix-occidentalis-obs.csv
that was created by the R
script, which was run at a previous time.
The following sections describe each line of the above script in
detail.
1. Create a DataPackage object.
In order to use uploadDataPackage
, it is necessary to
prepare an R DataPackage object which is a container for the
set of files that will be included in the package. The following
commands load the required libraries and creates an empty DataPackage
object that will be added to later:
library(dataone)
library(datapack)
library(uuid)
dp <- new("DataPackage")
When using the uploadDataPackage
method, data structures
that are required by DataONE are created, configured and uploaded
automatically with the package. These data structures include a ResourceMap
that details the contents of the package, and SystemMetadata objects
that contain DataONE system information for each of the science datasets
and associated science metadata.
A dataone
DataObject is a container that holds
both the data bytes and the system information for a metadata file, data
or other type of file. A DataObject is created for each file that will
be included in a DataPackage.
2. Prepare a metadata file that will describe the files in the
package
The next step is to prepare a metadata file that will describe the
science datasets and other files in the package. The most common
metadata format used in the DataONE network is the Ecological
Metadata Langauge (EML). Other supported formats include FGDC, ISO
19115 and others. Additional information about EML is available at https://knb.ecoinformatics.org/#external//emlparser/docs/index.html.
Detailed directions regarding authoring metadata documents are
outside the scope of this document.
DataONE requires that any file uploaded to a member node have a
unique identifier associated with it.
When a DataObject is created, a unique identifier is generated for
the DataObject if one is not specified using the id
parameter. This automatically generated identifier has the format “urn:uuid:”, for example
“urn:uuid:c3443142-6260-4ea5-aaa1-1114981e04ad”.
The following commands create the DataObject for the science
metadata, using an automatically generated identifier:
emlFile <- system.file("extdata/strix-pacific-northwest.xml", package="dataone")
metadataObj <- new("DataObject", format="eml://ecoinformatics.org/eml-2.1.1", filename=emlFile)
Now add the metadata object to the DataPackage:
dp <- addMember(dp, metadataObj)
Files are considered members of a package when they are enumerated
and described by a metadata file, and a relationship between the
metadata and data object is explicitly stated.
DataONE (and the dataone
R package) has adopted the
package guidelines detailed by the DataONE
package implementation. In this specification, the relationship that
links a metadata object and a science object is CiTO (Citation Typing
Ontology) documents.
This relationship between the science metadata and data objects will
be added to the DataPackage automatically for each data object as it is
added to the DataPackage, if the metadata object is first added, then
referenced as the DataObjects are added.
As the metadata object has already been added, it can be referenced
as each DataObject is added.
dp <- addMember(dp, sourceObj, metadataObj)
Since metadataObj
is included as the third argument
here, the CiTO documents relationship will automatically be
added between metadataObj
and sourceObj
.
Alternatively, this relationship between the metadata and science
objects can be made explicitly using the
insertRelationship()
method:
dp <- addMember(dp, metadataObj)
dp <- addMember(dp, sourceObj)
dp <- insertRelationship(dp, getIdentifier(metadataObj), getIdentifier(sourceObj))
Note that the relationship type, using the
insertRelationship()
predicate
argument does
not have to be specified in this case, as the CiTO documents
relationship is the default value for
insertRelationship
.
3. Create and add a DataObject for each data file
A DataObject must be created for each metadata file, data file or any
other type of file that will be included in the package.
A dataone
SystemMetadata R object will be
created automatically and stored in each DataObject. The information
from the SystemMetadata R object will be used by DataONE to
maintain low level information about the dataset, such as the access
policy, the user identity of the rightsholder (the user
identity that can modify access the dataset), which Member Nodes it can
be replicated to, etc.
The example below creates a DataObject for a science dataset:
sourceData <- system.file("extdata/OwlNightj.csv", package="dataone")
sourceObj <- new("DataObject", format="text/csv", filename=sourceData)
dp <- addMember(dp, sourceObj, metadataObj)
An optional user argument can be specified when creating a
DataObject, which will be used to set the DataONE submitter and
rightsholder of the dataset when it is uploaded. The
rightsholder is granted all access privileges to the object.
If user is not specified for a DataObject, then the
submitter and rightsholder for an object will automatically be set, when
the object is uploaded to DataONE, to the DataONE user that created the
authentication token or X.509 certificate.
Now DataObjects for an R script and for a file created by the R
script will be created:
progFile <- system.file("extdata/filterObs.R", package="dataone")
progObj <- new("DataObject", format="application/R", filename=progFile, mediaType="text/x-rsrc")
dp <- addMember(dp, progObj, mo=metadataObj)
outputData <- system.file("extdata/Strix-occidentalis-obs.csv", package="dataone")
outputObj <- new("DataObject", format="text/csv", filename=outputData)
dp <- addMember(dp, outputObj, mo=metadataObj)
Important Note: files previously uploaded to DataONE can be
added to the DataPackage, for more information see the section “Adding
Existing DataONE Files To A Package”.
4. Determine what access your data and metadata should have
DataONE provides a mechanism that allows data submitters to control
access to their data.
The levels of access available to objects in DataONE are “read”,
“write”, and “changePermission”.
The “read” permission allows a user the ability to view the content
of a DataONE object. The “write” permission allows a user the ability to
change the content of an object via update services. The
“changePermission” permission allows the ability to change the access
policy for an object and includes both read and write permissions.
The access rules that are added to DataObjects in a DataPackage will
determine the access that is granted to users accessing the package
after it is uploaded to DataONE.
Each of these permissions can be granted to a single user, a group of
users, or the special public user which means all users.
Each object in DataONE can have one or more access rules that control
the access of that object. The complete set of access rules for an
object is referred to as its access policy.
Access rules can be added to each DataObject individually
after it has been created.
Alternatively, access rules can be specified for all package members
when a package is uploaded using uploadDataPackage
. This
method is shown at the end of this section.
To grant read permission to all users:
sourceObj <- setPublicAccess(sourceObj)
Individual access rules to be added for a DataONE user identity can
also be added to the access policy.
Access rules are added to a DataObject using the
addAccessRule
method. The following access rule will grant
the user with the ORCID
https://orcid.org/0000-0002-2192-403X
changePermission
access to the dataset:
myAccessRules <- data.frame(subject="https://orcid.org/0000-0002-2192-403X", permission="changePermission")
sourceObj <- addAccessRule(sourceObj, myAccessRules)
DataONE user identities and user authentication are described in
section DataONE User Authentication in the vignette
dataone-overview (to view this vignette, type this command in
the R console: vignette("dataone-overview")
)
5. Upload the DataPackage
When all DataObjects have been added to the DataPackage, call the
uploadDataPackage
method to upload the entire
DataPackage.
As mentioned previous, as an alternative to adding access rules to
each DataObject individually before adding it to the DataPackage, the
access rules can be specified once when the package is uploaded to
DataONE. For example, to add public access to every object in the
package, and add the custom access rule show above, the
public
and accessRules
arguments are used when
calling updateDataPackage
:
d1c <- D1Client("STAGING", "urn:node:mnStageUCSB2")
packageId <- uploadDataPackage(d1c, dp, public=TRUE, accessRules=myAccessRules, quiet=FALSE)
message(sprintf("Uploaded package with identifier: %s", packageId))
(Note that the example uses a DataONE test environment
STAGING, and not the production environment.)
After uploadDataPackage has been called successfully, the
package can be viewed on the member node, searched for using the DataONE
search facility. Note that if objects in DataONE are not publicly
readable, and the authenticated user performing the search isn’t granted
access in an object’s access policy, then the objects will not be
viewable or discoverable via the search facility for that user.
Uploading Individual Data And Metadata Files
A single data or metadata file can be uploaded to a DataONE MN using
the createObject method. When uploading a single file using
this method, additional information must be supplied to DataONE that
controls how DataONE interacts with the uploaded file. This additional
information is stored in DataONE as a system metadata object
and contains information such as who can access or update the file, how
many copies of the file should be maintained, whether the file has been
superseded by another object, etc. The system metadata information that
will be uploaded to DataONE is collected and stored in an R object type
datapack::SystemMetadata, as shown below:
library(digest)
# Create a system metadata object for a data file.
# Just for demonstration purposes, create a temporary data file.
testdf <- data.frame(x=1:20,y=11:30)
csvfile <- paste(tempfile(), ".csv", sep="")
write.csv(testdf, csvfile, row.names=FALSE)
format <- "text/csv"
size <- file.info(csvfile)$size
sha256 <- digest(csvfile, algo="sha256", serialize=FALSE, file=TRUE)
# Generate a unique identifier for the dataset
pid <- sprintf("urn:uuid:%s", UUIDgenerate())
sysmeta <- new("SystemMetadata", identifier=pid, formatId=format, size=size, checksum=sha256)
sysmeta <- addAccessRule(sysmeta, "public", "read")
Alternatively, the system metadata could have been created with a
seriesId. The seriesId is explained in the
dataone_overview vignette. The following example shows the
creation of a SystemMetadata object using the optional
seriesId:
# Create a system metadata object for a data file.
# Just for demonstration purposes, create a temporary data file.
testdf <- data.frame(x=1:20,y=11:30)
csvfile <- paste(tempfile(), ".csv", sep="")
write.csv(testdf, csvfile, row.names=FALSE)
format <- "text/csv"
size <- file.info(csvfile)$size
sha256 <- digest(csvfile, algo="sha256", serialize=FALSE, file=TRUE)
# Generate a unique identifier for the dataset
pid <- sprintf("urn:uuid:%s", UUIDgenerate())
# The seriesId can be any unique character string.
seriesId <- sprintf("urn:uuid:%s", UUIDgenerate())
sysmeta <- new("SystemMetadata", identifier=pid, formatId=format, size=size, checksum=sha256, seriesId=seriesId)
A unique identifier must be specified for each system metadata,
whether or not a seriesId is used.
The dataset can now be uploaded to DataONE with the associated system
metadata:
cn <- CNode("STAGING")
mn <- getMNode(cn, "urn:node:mnStageUCSB2")
response <- createObject(mn, pid, csvfile, sysmeta)
Note that for this example, the DataONE test environment
STAGING is used, and not the production environment.
Maintaining Uploaded Datasets
After data has been uploaded to DataONE, maintenance operations can
be performed on these objects using the methods described in the
following sections.
Replace an object with a newer version (MNode: updateObject)
The updateObject updates an existing object by creating a
new object identified by a new PID on the Member Node. The new object
replaces and obsoletes the old object. An obsoleted object in
DataONE does not appear in search results, however it is still available
for download if the identifier is known.
# Update object from previous example with a new version
updateid <- sprintf("urn:uuid:%s", UUIDgenerate())
testdf <- data.frame(x=1:20,y=11:30)
csvfile <- paste(tempfile(), ".csv", sep="")
write.csv(testdf, csvfile, row.names=FALSE)
size <- file.info(csvfile)$size
sha256 <- digest(csvfile, algo="sha256", serialize=FALSE, file=TRUE)
# Start with the old object's sysmeta, then modify it to match
# the new object. We could have also created a sysmeta from scratch.
sysmeta <- getSystemMetadata(mn, pid)
sysmeta@identifier <- updateid
sysmeta@size <- size
sysmeta@checksum <- sha256
sysmeta@obsoletes <- pid
# Now update the object on the member node.
response <- updateObject(mn, pid, csvfile, updateid, sysmeta)
# Get the new, updated sysmeta and check it to ensure that the update
# worked, i.e. "obsoletes" is the old pid that was replaced by the update.
updsysmeta <- getSystemMetadata(mn, updateid)
updsysmeta@obsoletes
The Member Node will mark the object as being obsolete by
setting a property in the system metadata on the object being replaced.
An object marked as obsolete will not appear in search results,
however, such an object is still available for download if the PID is
known.
Remove an object from DataONE search
An object can be removed from searches done with the DataONE search
mechanism by calling the archive method with the PID of the
object. This operation does not delete the object bytes, but instead
updates the system metadata for the object to set the archived
flag to true. The object can still be referenced with its PID and
downloaded, but it will not appear in any search results.
Objects that are archived can not be updated using the
updateObject method. Once an object is archived it cannot be
un-archived.
The following statement archives the object that was just created in
the previous example with the updateObject method.
response <- archive(mn, updateid)
The following commands can be used to verify that the object was
archived.
sysmeta <- getSystemMetadata(mn, updateid)
sysmeta@archived