Searching DataONE Data Holdings

2022-06-10

The primary method to locate data in the DataONE network is to use the web based search tool located at https://search.dataone.org.

Data can also be discovered programmatically from R using the dataone R package method query. Both the web page and the R method access the same underlying search mechanism, the Apache Foundation Solr search engine that runs on the DataONE Coordinating Node. The DataONE Solr index, similar to a catalog or database, contains information for every dataset that is available from the DataONE network.

The same search mechanism is available on DataONE Member Nodes. However, the search index on a member node only contains dataset information for datasets that are contained on that member node.

Information about the Solr search engine can be obtained at Solr Resources.

Additional information about searching DataONE can be viewed at Content Discovery.

Using The query Method

Some familiarity with Solr is helpful when using the query method effectively, however basic queries can be very powerful and the examples in this document can provide a starting point. As an alternative to composing queries using Solr syntax, a simpler search mechanism is available with the query method that uses name, value pairs (See the section: *A Simplified Search”).

Additional information about the query method can be obtained from the R help facility, e.g. ?query from the R command line.

Query and return an R list

The following example queries DataONE and returns values in an R list, with each value converted from the Solr result to the appropriate R data type:

library(dataone)
cn <- CNode("PROD")
queryParamList <- list(q="id:doi*", rows="5", fq="abstract:carbon", fl="id,title,dateUploaded,abstract,datasource,size")
result <- query(cn, solrQuery=queryParamList, as="list")

The ‘solrQuery’ argument takes a list of query parameters that are sent to the DataONE Solr search engine and control how the search is performed. The name of each list element is a Solr keyword which is combined with the list element value to create each Solr query term. All these terms are combined to create the complete Solr query. The queryParamlist in the example above will be used to construct this Solr query:

?q=id:doi*&rows=5,&fq=abstract:carbon&fl=id,title,dateUploaded,abstract,datasource,size

The q=id:doi* is the main query term and specifies that all DataONE objects that have an id field value beginning with doi* should be returned.

The &fq=abstract:carbon term is a filter query, which filters the results so that only results with an abstract field containing the word carbon will be returned.

The &fl=id,title,dateUploaded,abstract,datasource,size term is a field specifier, so only those fields in the list will be included in the result set.

The &rows=5 term specifies that a maximum of five results will be returned.

The result object contains all the data values found and returned from the query. Each element of the returned list contains information for one dataset. Each returned attribute for a dataset can be accessed with the appropriate element name, for example, to access the title information of the first dataset returned, use the R statement:

result[[1]]$title

To print out selected information for all returned values, use:

ids <- lapply(result, function(x) {
  message(sprintf("id: %s", x$id))
  message(sprintf("origin member node: %s", x$datasource))
  message(sprintf("title: %s", x$title))
  message(sprintf("date uploaded: %s", x$dateUploaded))
  x$id
})

The complete list of possible searchable values stored for a dataset can be viewed using getQueryEngineDescription():

cn <- CNode()
getQueryEngineDescription(cn, "solr")

However, the values available for a particular dataset may be a subset of these, depending on the metadata provided when the dataset is uploaded to DataONE.

The DataONE Coordinating Node (CN) contains metadata about datasets from all Member Nodes (MN) in the network. As the above example shows, sending a query to the CN may find matching datasets located on potentially any Member Node in the network.

Return values as a data frame

To return results as a data frame, use the as parameter as shown below. In addition all values will be stored as character, because parse=FALSE is specified:

cn <- CNode("PROD")
result <- query(cn, queryParamList, as="data.frame", parse=FALSE)

The result is a data frame with each row containing information for one dataset. To print all returned DataONE identifiers from the result:

result[,'id']

Using a Simplified Search.

A search may be performed by just specifying query fields and values using the searchTerms parameter. For example, to search for datasets that mention ‘kelp’ in the abstract and have an attribute description that contains the word ‘biomass’, the following query could be used:

cn <- CNode("PROD")
mn <- getMNode(cn, "urn:node:KNB")
mySearchTerms <- list(abstract="kelp", attribute="biomass")
result <- query(cn, searchTerms=mySearchTerms, as="data.frame")

Using the searchTerms parameter causes the query method to construct a Solr query based on the list, that is passed on to the DataONE node specified.

The names in the searchTerm list are the query field names available from the Solr query engine being used. These names can be determined using the getQueryEngineDescription function.

Search just a member node

If it is known which Member Node holds the data of interest, then a search can be limited to just that MN by sending the search directly to that MN instead of the CN. For example, if the dataset of interest is on The Knowledge Network for Biocomplexity (KNB) Member Node, then the search is performed with the statements:

# Query the data holdings on a member node
cn <- CNode("PROD")
mn <- getMNode(cn, "urn:node:KNB")
queryParams <- list(q="abstract:habitat", fl="id,title,abstract") 
result <- query(mn, queryParams, as="data.frame", parse=FALSE)
# Choose the first matchin PID
pid <- result[1,'id']

Filter query results for a specific member node

An alternative way to locate datasets on a particular node is to send the query to the CN but limit the returned results to data holdings that originated from the specific member node by using Solr filter query parameter (&fq) can be used for the datasource field:

cn <- CNode("PROD")
queryParams <- 'q=id:*&fl=id,title&fq=datasource:"urn:node:KNB"&rows=5'
result <- query(cn, queryParams, as="data.frame", parse=FALSE)
result[,'id']