Introduction
In this vignette, CTX Chemical API will be explored.
NOTE: Please see the introductory vignette for an overview of the ctxR package and initial set up instruction with API key storage.
The foundation of toxicology, toxicokinetics, and exposure is embedded in the physics and chemistry of chemical-biological interactions. The accurate characterization of chemical structure linked to commonly used identifiers, such as names and Chemical Abstracts Service Registry Numbers (CASRNs), is essential to support both predictive modeling of the data as well as dissemination and application of the data for chemical safety decisions.
With cheminformatics as the backbone for research efforts, sources of available data through the CTX Chemical API include:
- Chemical structures, nomenclature, synonyms, IDs, list associations, physicochemical property, environmental fate and transport data from the Distributed Structure-Searchable Toxicity (DSSTox) database. For early references, see (Richard, A. et al. 2002), (Richard, A. et al. 2006), and (Richard, A. et al 2008).
- Predictions from Toxicity Estimation Software Tool (TEST) suite of QSAR models. For early references, see (Martin, T. et al. 2001), (Martin, T. et al. 2007), and (Young, D. et al. 2008).
More information on Chemicals and Chemistry Data can be found here: https://www.epa.gov/comptox-tools/downloadable-computational-toxicology-data#SCD.
Functions
Several ctxR functions are used to access the CTX Chemical API data.
Chemical Details Resource
Get chemical data
get_chemical_details()
retrieves chemical detail data
either using the chemical identifier DTXSID or DTXCID. Alternate
parameter “projection” determines the type of data returned. Examples
for each are provided below:
Chemical Property Resource
get_chemical_by_property_range()
retrieves data for
chemicals that have a specified property within the input range.
get_chem_info()
retrieves specific chemical information
for an input chemical. This includes both experimental and predicted
values by default, but providing “experimental” or “predicted” to the
type parameter will return the specific associated information.
Chemical Fate Resource
get_fate_by_dtxsid()
retrieves chemical fate data.
Chemical Search Resource
Chemicals can be searched using string values. Examples for each are provided by the following:
Subset for MS-Ready Structures
MS-Ready data can be retrieved using a variety of input information. Examples for each are provided below:
List Resource
There are several lists of chemicals one can access. These can be filtered by the type, name, inclusion of a specific chemical, or name of list.
Lists containing a specific chemical
get_lists_containing_chemical()
retrieves a list of
names of chemical lists, each of which contains the specified
chemical.
Chemical File Resource
There a mrv, mol, and image files that can be accessed using either the DTXSID or DTXCID. Examples are provided below:
Get mrv by DTXSID or DTXCID
get_chemical_mrv()
retrieves mrv file information for a
chemical specified either by DTXSID or DTXCID.
Get mol by DTXSID or DTXCID
get_chemical_mol()
retrieves mol file information for a
chemical specified either by DTXSID or DTXCID.
Example Use Case: Comparing Physico-chemical Properties Across Chemical Lists
The fourth Drinking Water Contaminant Candidate List (CCL4) is a set of chemicals that “…are not subject to any proposed or promulgated national primary drinking water regulations, but are known or anticipated to occur in public water systems….” Moreover, this list “…was announced on November 17, 2016. The CCL 4 includes 97 chemicals or chemical groups and 12 microbial contaminants….” The National-Scale Air Toxics Assessments (NATA) is “… EPA’s ongoing comprehensive evaluation of air toxics in the United States… a state-of-the-science screening tool for State/Local/Tribal agencies to prioritize pollutants, emission sources and locations of interest for further study in order to gain a better understanding of risks… use general information about sources to develop estimates of risks which are more likely to overestimate impacts than underestimate them….”
These lists can be found in the CCD at CCL4 with additional information at CCL4 information and NATADB with additional information at NATA information. The quotes from the previous paragraph were excerpted from list detail descriptions found using the CCD links.
In this example use case, physico-chemical Properties data will be compared between a water contaminant priority and an air toxics list.
Obtain Lists of Chemicals
First, confirm the chemical list to query.
options(width = 100)
ccl4_information <- get_public_chemical_list_by_name('CCL4')
print(ccl4_information, trunc.cols = TRUE)
#> id type label visibility
#> 1 443 federal WATER|EPA: Chemical Contaminants - CCL 4 PUBLIC
#> longDescription
#> 1 The Contaminant Candidate List (CCL) is a list of contaminants that, at the time of publication, are not subject to any proposed or promulgated national primary drinking water regulations, but are known or anticipated to occur in public water systems. Contaminants listed on the CCL may require future regulation under the Safe Drinking Water Act (SDWA). EPA announced the <a href='https://www.epa.gov/ccl/contaminant-candidate-list-4-ccl-4-0' target='_blank'>fourth Drinking Water Contaminant Candidate List (CCL 4)</a> on November 17, 2016. The CCL 4 includes 97 chemicals or chemical groups and 12 microbial contaminants. The group of cyanotoxins on CCL 4 includes, but is not limited to: anatoxin-a, cylindrospermopsin, microcystins, and saxitoxin. The CCL Chemical Candidate Lists are versioned iteratively and this description navigates between the various versions of the lists. The list of substances displayed below represents only the chemical CCL 4 contaminants. For the versioned lists, please use the hyperlinked lists below.<br/><br/> \r\n\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL5' target='_blank'>CCL5 - November 2022</a> <br/><br/>\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL4' target='_blank'>CCL4 - November 2016</a> \r\n This list<br/><br/>\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL3' target='_blank'>CCL3 - October 2009</a> <br/><br/>\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL2' target='_blank'>CCL2 - February 2005</a><br/><br/>\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL1' target='_blank'>CCL1 - March 1998</a><br/><br/>
#> updatedAt listName chemicalCount createdAt
#> 1 2022-10-26T21:14:27Z CCL4 100 2017-12-28T17:58:36Z
#> shortDescription
#> 1 The Contaminant Candidate List (CCL) is a list of contaminants that are known or anticipated to occur in public water systems. Version 4 is known as CCL 4.
natadb_information <- get_public_chemical_list_by_name('NATADB')
print(natadb_information, trunc.cols = TRUE)
#> id type label visibility
#> 1 454 federal EPA: National-Scale Air Toxics Assessment (NATA) PUBLIC
#> longDescription
#> 1 The National-Scale Air Toxics Assessment (NATA) is EPA's ongoing comprehensive evaluation of air toxics in the United States. EPA developed the NATA as a state-of-the-science screening tool for State/Local/Tribal Agencies to prioritize pollutants, emission sources and locations of interest for further study in order to gain a better understanding of risks. NATA assessments do not incorporate refined information about emission sources but, rather, use general information about sources to develop estimates of risks which are more likely to overestimate impacts than underestimate them.\r\n\r\nNATA provides estimates of the risk of cancer and other serious health effects from breathing (inhaling) air toxics in order to inform both national and more localized efforts to identify and prioritize air toxics, emission source types and locations which are of greatest potential concern in terms of contributing to population risk. This in turn helps air pollution experts focus limited analytical resources on areas and or populations where the potential for health risks are highest. Assessments include estimates of cancer and non-cancer health effects based on chronic exposure from outdoor sources, including assessments of non-cancer health effects for Diesel Particulate Matter (PM). Assessments provide a snapshot of the outdoor air quality and the risks to human health that would result if air toxic emissions levels remained unchanged.
#> updatedAt listName chemicalCount createdAt
#> 1 2018-11-16T21:42:01Z NATADB 163 2018-02-21T12:04:16Z
#> shortDescription
#> 1 The National-Scale Air Toxics Assessment (NATA) is EPA's ongoing comprehensive evaluation of air toxics in the United States.
Next, retrieve the list of chemicals associated with each list.
ccl4 <- get_chemicals_in_list('ccl4')
ccl4 <- data.table::as.data.table(ccl4)
natadb <- get_chemicals_in_list('NATADB')
natadb <- data.table::as.data.table(natadb)
We examine the dimensions of the data, the column names, and display a single row for illustrative purposes.
Access Physico-chemical Property Data for Chemical Lists
Next, physico-chemical properties for all chemicals in each list can
be retrieved. The function get_chem_info()
will be used to
batch search for a list of DTXSIDs.
ccl4_phys_chem <- get_chem_info_batch(ccl4$dtxsid)
#> Warning in get_chem_info_batch(ccl4$dtxsid): Setting type to ''!
natadb_phys_chem <- get_chem_info_batch(natadb$dtxsid)
#> Warning in get_chem_info_batch(natadb$dtxsid): Setting type to ''!
Observe that this returns a single data.table for each query, and the
data.table contains the physico-chemical properties available from the
CompTox Chemicals Dashboard for each chemical in the query. Note, a
warning message was triggered,
Warning: Setting type to ''!
, which indicates the the
parameter type
was not given a value. A default value is
set within the function and more information can be found in the
associated documentation. We examine the set of physico-chemical
properties for the first chemical in CCL4.
Before any deeper analysis, consider the dimensions of the data and the column names.
Next, we display the unique values for the columns
propertyID
and propType
.
ccl4_phys_chem[, unique(propertyId)]
#> [1] "boiling-point" "logkow-octanol-water" "melting-point" "vapor-pressure"
#> [5] "water-solubility" "density" "flash-point" "henrys-law"
#> [9] "index-of-refraction" "logkoa-octanol-air" "molar-refractivity" "molar-volume"
#> [13] "polarizability" "surface-tension" "thermal-conductivity" "viscosity"
#> [17] "pka-acidic-apparent" "pka-basic-apparent"
Let’s explore this further by examining the mean of the “boiling-point” and “melting-point” data.
ccl4_phys_chem[propertyId == 'boiling-point', .(Mean = mean(value))]
#> Mean
#> <num>
#> 1: 252.6593
ccl4_phys_chem[propertyId == 'boiling-point', .(Mean = mean(value)),
by = .(propType)]
#> propType Mean
#> <char> <num>
#> 1: experimental 250.5943
#> 2: predicted 253.8196
ccl4_phys_chem[propertyId == 'melting-point', .(Mean = mean(value))]
#> Mean
#> <num>
#> 1: 34.91613
ccl4_phys_chem[propertyId == 'melting-point', .(Mean = mean(value)),
by = .(propType)]
#> propType Mean
#> <char> <num>
#> 1: experimental 23.18876
#> 2: predicted 49.99417
These results tell us about some of the reported physico-chemical properties of the data sets.
The mean “boiling-point” is 251.1072 degrees Celsius for CCL4, with mean values of 250.5943 and 251.4001 for experimental and predicted, respectively. The mean “melting-point” is 33.93924 degrees Celsius for CCL4, with mean values of 23.18876 and 47.98422 for experimental and predicted, respectively.
To explore all the values of the physico-chemical
properties and calculate their means, we can do the following procedure.
First we look at all the physico-chemical properties individually, then
group them by each property (“boiling-point”, “melting-point”, etc…),
and then additionally group those by property type (“experimental” vs
“predicted”). In the grouping, we look at the columns
value
, unit
, propertyID
and
propType
. We also demonstrate how take the mean of the
values for each grouping.
head(ccl4_phys_chem[dtxsid == ccl4$dtxsid[[1]], ])
#> Empty data.table (0 rows and 10 cols): name,value,id,source,description,propType...
ccl4_phys_chem[dtxsid == ccl4$dtxsid[[1]], .(propType, value, unit),
by = .(propertyId)]
#> Empty data.table (0 rows and 4 cols): propertyId,propType,value,unit
ccl4_phys_chem[dtxsid == ccl4$dtxsid[[1]], .(value, unit),
by = .(propertyId, propType)]
#> Empty data.table (0 rows and 4 cols): propertyId,propType,value,unit
ccl4_phys_chem[dtxsid == ccl4$dtxsid[[1]], .(Mean_value = sapply(.SD, mean)),
by = .(propertyId, unit), .SDcols = c("value")]
#> Empty data.table (0 rows and 3 cols): propertyId,unit,Mean_value
ccl4_phys_chem[dtxsid == ccl4$dtxsid[[1]], .(Mean_value = sapply(.SD, mean)),
by = .(propertyId, unit, propType),
.SDcols = c("value")][order(propertyId)]
#> Empty data.table (0 rows and 4 cols): propertyId,unit,propType,Mean_value
Review Physico-Chemical Properties Across Chemical Lists
We consider exploring the differences in mean predicted and
experimental values for a variety of physico-chemical properties in an
effort to understand better the CCL4 and NATADB lists. In particular, we
examine “vapor-pressure”, “henrys-law”, and “boiling-point” and plot the
means by chemical for these using boxplots. We then compare the values
by grouping by both data set and propType
value.
Vapor Pressure
Begin by grouping pulled data by DTXSID, and also by DTXSID and property type.
ccl4_vapor_all <- ccl4_phys_chem[propertyId %in% 'vapor-pressure',
.(mean_vapor_pressure = sapply(.SD, mean)),
.SDcols = c('value'), by = .(dtxsid)]
natadb_vapor_all <- natadb_phys_chem[propertyId %in% 'vapor-pressure',
.(mean_vapor_pressure = sapply(.SD, mean)),
.SDcols = c('value'), by = .(dtxsid)]
ccl4_vapor_grouped <- ccl4_phys_chem[propertyId %in% 'vapor-pressure',
.(mean_vapor_pressure = sapply(.SD, mean)),
.SDcols = c('value'),
by = .(dtxsid, propType)]
natadb_vapor_grouped <- natadb_phys_chem[propertyId %in% 'vapor-pressure',
.(mean_vapor_pressure =
sapply(.SD, mean)),
.SDcols = c('value'),
by = .(dtxsid, propType)]
Examine summary statistics.
summary(ccl4_vapor_all)
#> dtxsid mean_vapor_pressure
#> Length:83 Min. : 0.000
#> Class :character 1st Qu.: 0.000
#> Mode :character Median : 0.002
#> Mean : 224.118
#> 3rd Qu.: 3.866
#> Max. :6007.242
summary(ccl4_vapor_grouped)
#> dtxsid propType mean_vapor_pressure
#> Length:140 Length:140 Min. : 0.000
#> Class :character Class :character 1st Qu.: 0.000
#> Mode :character Mode :character Median : 0.069
#> Mean : 265.007
#> 3rd Qu.: 10.125
#> Max. :7252.700
summary(natadb_vapor_all)
#> dtxsid mean_vapor_pressure
#> Length:145 Min. : 0.000
#> Class :character 1st Qu.: 0.007
#> Mode :character Median : 1.001
#> Mean : 225.872
#> 3rd Qu.: 89.931
#> Max. :9298.713
summary(natadb_vapor_grouped)
#> dtxsid propType mean_vapor_pressure
#> Length:267 Length:267 Min. : 0.000
#> Class :character Class :character 1st Qu.: 0.008
#> Mode :character Mode :character Median : 1.521
#> Mean : 243.404
#> 3rd Qu.: 98.721
#> Max. :9412.390
With such a large range of values covering several orders of magnitude, log transform the data. This data from both chemical lists can also be plotted individually and by property type.
ccl4_vapor_all[, log_transform_mean_vapor_pressure := log(mean_vapor_pressure)]
#> dtxsid mean_vapor_pressure log_transform_mean_vapor_pressure
#> <char> <num> <num>
#> 1: DTXSID0020153 1.436092e+00 0.3619259
#> 2: DTXSID0020446 1.211530e-06 -13.6236263
#> 3: DTXSID0020573 1.016181e-08 -18.4046290
#> 4: DTXSID0020600 1.060439e+03 6.9664387
#> 5: DTXSID0020814 1.713402e-08 -17.8821999
#> 6: DTXSID0021541 3.623345e+03 8.1951529
#> 7: DTXSID0021917 1.447043e+02 4.9746920
#> 8: DTXSID0024052 2.579787e-07 -15.1703889
#> 9: DTXSID1020437 2.153143e+02 5.3720986
#> 10: DTXSID1021407 9.093477e-04 -7.0027830
#> 11: DTXSID1021740 6.808950e+00 1.9182379
#> 12: DTXSID1021798 5.779040e-02 -2.8509326
#> 13: DTXSID1024174 5.616588e-06 -12.0897863
#> 14: DTXSID1024338 7.125250e-08 -16.4570359
#> 15: DTXSID1026164 2.444570e-01 -1.4087159
#> 16: DTXSID1037484 4.124746e-07 -14.7010911
#> 17: DTXSID1037486 4.206599e-07 -14.6814412
#> 18: DTXSID1037567 4.633640e-08 -16.8873380
#> 19: DTXSID2020684 3.733735e-03 -5.5903463
#> 20: DTXSID2021028 1.127656e+00 0.1201408
#> 21: DTXSID2021317 1.563302e+01 2.7493857
#> 22: DTXSID2021731 2.360838e+02 5.4641866
#> 23: DTXSID2022333 1.685188e+00 0.5218768
#> 24: DTXSID2037506 8.393291e-06 -11.6880778
#> 25: DTXSID2052156 3.878108e-09 -19.3679185
#> 26: DTXSID3020203 1.876293e+03 7.5370530
#> 27: DTXSID3020833 2.420272e+02 5.4890503
#> 28: DTXSID3020964 2.336795e-01 -1.4538048
#> 29: DTXSID3024366 5.501190e+01 4.0075495
#> 30: DTXSID3024869 1.885188e-02 -3.9711425
#> 31: DTXSID3031864 2.479965e-06 -12.9072661
#> 32: DTXSID3032464 1.727766e-06 -13.2686813
#> 33: DTXSID3042219 3.106110e+00 1.1333711
#> 34: DTXSID3074313 1.481020e-11 -24.9357050
#> 35: DTXSID4020533 4.036368e+01 3.6979302
#> 36: DTXSID4021503 1.722775e+02 5.1491065
#> 37: DTXSID4022361 3.226160e-06 -12.6442179
#> 38: DTXSID4022367 1.643378e-08 -17.9239267
#> 39: DTXSID4022448 1.773129e-05 -10.9401797
#> 40: DTXSID4022991 1.440972e-10 -22.6605333
#> 41: DTXSID4032611 5.263403e-04 -7.5495627
#> 42: DTXSID4034948 3.744327e-08 -17.1004389
#> 43: DTXSID5020023 2.545125e+02 5.5393500
#> 44: DTXSID5020576 5.737673e-09 -18.9762122
#> 45: DTXSID5020601 1.428200e+00 0.3564147
#> 46: DTXSID5021207 4.463567e+02 6.1011185
#> 47: DTXSID5024182 7.171160e+00 1.9700674
#> 48: DTXSID5039224 7.963770e+02 6.6800727
#> 49: DTXSID50867064 1.522060e-03 -6.4876906
#> 50: DTXSID6020301 6.007243e+03 8.7007211
#> 51: DTXSID6020856 2.909907e-01 -1.2344638
#> 52: DTXSID6021030 3.562765e-05 -10.2423885
#> 53: DTXSID6021032 8.306000e-01 -0.1856069
#> 54: DTXSID6022422 1.921524e-06 -13.1623921
#> 55: DTXSID6024177 7.035691e-02 -2.6541743
#> 56: DTXSID6037483 4.847560e-08 -16.8422053
#> 57: DTXSID6037485 4.976030e-08 -16.8160484
#> 58: DTXSID6037568 2.875770e-07 -15.0617752
#> 59: DTXSID7020005 6.776980e-02 -2.6916386
#> 60: DTXSID7020637 1.836866e+03 7.5158163
#> 61: DTXSID7021029 3.323853e+00 1.2011248
#> 62: DTXSID7024241 2.031631e-06 -13.1066715
#> 63: DTXSID7047433 1.073489e-08 -18.3497663
#> 64: DTXSID8020044 2.447645e+01 3.1977114
#> 65: DTXSID8020090 5.073765e-01 -0.6785019
#> 66: DTXSID8020597 2.037895e-01 -1.5906676
#> 67: DTXSID8020832 1.520092e+03 7.3265265
#> 68: DTXSID8021062 1.180510e-01 -2.1366388
#> 69: DTXSID8022292 1.528695e-08 -17.9962663
#> 70: DTXSID8022377 1.016181e-08 -18.4046290
#> 71: DTXSID8023846 1.702425e-06 -13.2834569
#> 72: DTXSID8023848 8.277807e-06 -11.7019325
#> 73: DTXSID8025541 1.949228e-05 -10.8454918
#> 74: DTXSID8031865 8.386721e-01 -0.1759355
#> 75: DTXSID9020243 8.999120e-08 -16.2235539
#> 76: DTXSID9021390 4.408660e+00 1.4835708
#> 77: DTXSID9021427 3.637282e-01 -1.0113483
#> 78: DTXSID9022366 1.506794e-09 -20.3132814
#> 79: DTXSID9023380 1.149256e-08 -18.2815660
#> 80: DTXSID9023914 3.131080e-04 -8.0689624
#> 81: DTXSID9024142 2.739923e-09 -19.7153359
#> 82: DTXSID9032113 2.981382e-08 -17.3282936
#> 83: DTXSID9032329 7.041032e-07 -14.1663408
#> dtxsid mean_vapor_pressure log_transform_mean_vapor_pressure
ccl4_vapor_grouped[, log_transform_mean_vapor_pressure :=
log(mean_vapor_pressure)]
#> dtxsid propType mean_vapor_pressure log_transform_mean_vapor_pressure
#> <char> <char> <num> <num>
#> 1: DTXSID0020153 experimental 1.229990e+00 0.2070060
#> 2: DTXSID0020153 predicted 1.504793e+00 0.4086556
#> 3: DTXSID0020446 experimental 6.899220e-08 -16.4892724
#> 4: DTXSID0020446 predicted 1.592376e-06 -13.3502831
#> 5: DTXSID0020573 predicted 1.016181e-08 -18.4046290
#> ---
#> 136: DTXSID9024142 predicted 2.059810e-09 -20.0006521
#> 137: DTXSID9032113 experimental 1.279970e-08 -18.1738441
#> 138: DTXSID9032113 predicted 3.548520e-08 -17.1541501
#> 139: DTXSID9032329 experimental 8.000180e-07 -14.0386316
#> 140: DTXSID9032329 predicted 6.721317e-07 -14.2128116
natadb_vapor_all[, log_transform_mean_vapor_pressure :=
log(mean_vapor_pressure)]
#> dtxsid mean_vapor_pressure log_transform_mean_vapor_pressure
#> <char> <num> <num>
#> 1: DTXSID0020153 1.436092e+00 0.3619259
#> 2: DTXSID0020448 4.899910e+01 3.8918019
#> 3: DTXSID0020523 2.824828e-04 -8.1718931
#> 4: DTXSID0020529 7.603507e-04 -7.1817307
#> 5: DTXSID0020600 1.060439e+03 6.9664387
#> ---
#> 141: DTXSID9020293 3.124725e-02 -3.4658239
#> 142: DTXSID9020299 1.177018e-06 -13.6525264
#> 143: DTXSID9020827 2.930390e-06 -12.7403750
#> 144: DTXSID9021138 3.619153e-03 -5.6215152
#> 145: DTXSID9041522 6.433997e-05 -9.6513295
natadb_vapor_grouped[, log_transform_mean_vapor_pressure :=
log(mean_vapor_pressure)]
#> dtxsid propType mean_vapor_pressure log_transform_mean_vapor_pressure
#> <char> <char> <num> <num>
#> 1: DTXSID0020153 experimental 1.229990e+00 0.2070060
#> 2: DTXSID0020153 predicted 1.504793e+00 0.4086556
#> 3: DTXSID0020448 experimental 5.329670e+01 3.9758744
#> 4: DTXSID0020448 predicted 4.756657e+01 3.8621301
#> 5: DTXSID0020523 experimental 3.900320e-04 -7.8492818
#> ---
#> 263: DTXSID9020299 experimental 2.199890e-06 -13.0271032
#> 264: DTXSID9020299 predicted 8.360607e-07 -13.9945647
#> 265: DTXSID9020827 predicted 2.930390e-06 -12.7403750
#> 266: DTXSID9021138 predicted 3.619153e-03 -5.6215152
#> 267: DTXSID9041522 predicted 6.433997e-05 -9.6513295
Finally, compare both chemical lists simultaneously. To accomplish
this, add a column to each data.table denoting to which chemical list
the rows correspond and then combine the rows from both data sets
together using the function rbind()
.
ccl4_vapor_grouped[, set := 'CCL4']
#> dtxsid propType mean_vapor_pressure log_transform_mean_vapor_pressure set
#> <char> <char> <num> <num> <char>
#> 1: DTXSID0020153 experimental 1.229990e+00 0.2070060 CCL4
#> 2: DTXSID0020153 predicted 1.504793e+00 0.4086556 CCL4
#> 3: DTXSID0020446 experimental 6.899220e-08 -16.4892724 CCL4
#> 4: DTXSID0020446 predicted 1.592376e-06 -13.3502831 CCL4
#> 5: DTXSID0020573 predicted 1.016181e-08 -18.4046290 CCL4
#> ---
#> 136: DTXSID9024142 predicted 2.059810e-09 -20.0006521 CCL4
#> 137: DTXSID9032113 experimental 1.279970e-08 -18.1738441 CCL4
#> 138: DTXSID9032113 predicted 3.548520e-08 -17.1541501 CCL4
#> 139: DTXSID9032329 experimental 8.000180e-07 -14.0386316 CCL4
#> 140: DTXSID9032329 predicted 6.721317e-07 -14.2128116 CCL4
natadb_vapor_grouped[, set := 'NATADB']
#> dtxsid propType mean_vapor_pressure log_transform_mean_vapor_pressure set
#> <char> <char> <num> <num> <char>
#> 1: DTXSID0020153 experimental 1.229990e+00 0.2070060 NATADB
#> 2: DTXSID0020153 predicted 1.504793e+00 0.4086556 NATADB
#> 3: DTXSID0020448 experimental 5.329670e+01 3.9758744 NATADB
#> 4: DTXSID0020448 predicted 4.756657e+01 3.8621301 NATADB
#> 5: DTXSID0020523 experimental 3.900320e-04 -7.8492818 NATADB
#> ---
#> 263: DTXSID9020299 experimental 2.199890e-06 -13.0271032 NATADB
#> 264: DTXSID9020299 predicted 8.360607e-07 -13.9945647 NATADB
#> 265: DTXSID9020827 predicted 2.930390e-06 -12.7403750 NATADB
#> 266: DTXSID9021138 predicted 3.619153e-03 -5.6215152 NATADB
#> 267: DTXSID9041522 predicted 6.433997e-05 -9.6513295 NATADB
all_vapor_grouped <- rbind(ccl4_vapor_grouped, natadb_vapor_grouped)
vapor_box <- ggplot(all_vapor_grouped,
aes(set, log_transform_mean_vapor_pressure)) +
geom_boxplot(aes(color = propType))
vapor <- ggplot(all_vapor_grouped, aes(log_transform_mean_vapor_pressure)) +
geom_boxplot((aes(color = set))) +
coord_flip()
Plot the combined data. Boxplots are colored based on the property type, with mean log transformed vapor pressure plotted for each chemical list and property type, or by chemical list alone.
In the box plots above, a general trend indicates that that the NATADB chemical list has a higher mean vapor pressure than the CCL4 chemical list.
Henry’s Law constant
Henry’s Law constant can be explored in a similar fashion. Begin by grouping data by DTXSID, and also by DTXSID and property type.
ccl4_hlc_all <- ccl4_phys_chem[propertyId %in% 'henrys-law',
.(mean_hlc = sapply(.SD, mean)),
.SDcols = c('value'), by = .(dtxsid)]
natadb_hlc_all <- natadb_phys_chem[propertyId %in% 'henrys-law',
.(mean_hlc = sapply(.SD, mean)),
.SDcols = c('value'), by = .(dtxsid)]
ccl4_hlc_grouped <- ccl4_phys_chem[propertyId %in% 'henrys-law',
.(mean_hlc = sapply(.SD, mean)),
.SDcols = c('value'),
by = .(dtxsid, propType)]
natadb_hlc_grouped <- natadb_phys_chem[propertyId %in% 'henrys-law',
.(mean_hlc = sapply(.SD, mean)),
.SDcols = c('value'),
by = .(dtxsid, propType)]
Examine summary statistics.
summary(ccl4_hlc_all)
#> dtxsid mean_hlc
#> Length:82 Min. :0.0000000
#> Class :character 1st Qu.:0.0000000
#> Mode :character Median :0.0000007
#> Mean :0.0076308
#> 3rd Qu.:0.0000216
#> Max. :0.4922550
summary(ccl4_hlc_grouped)
#> dtxsid propType mean_hlc
#> Length:110 Length:110 Min. :0.0000000
#> Class :character Class :character 1st Qu.:0.0000000
#> Mode :character Mode :character Median :0.0000020
#> Mean :0.0063874
#> 3rd Qu.:0.0001217
#> Max. :0.4922550
summary(natadb_hlc_all)
#> dtxsid mean_hlc
#> Length:144 Min. :0.0000000
#> Class :character 1st Qu.:0.0000002
#> Mode :character Median :0.0000328
#> Mean :0.0086311
#> 3rd Qu.:0.0012426
#> Max. :0.4922550
summary(natadb_hlc_grouped)
#> dtxsid propType mean_hlc
#> Length:210 Length:210 Min. :0.0000000
#> Class :character Class :character 1st Qu.:0.0000003
#> Mode :character Mode :character Median :0.0001028
#> Mean :0.0073349
#> 3rd Qu.:0.0028132
#> Max. :0.4922550
Again, log transform the data as it is positive and covers several orders of magnitude.
ccl4_hlc_all[, log_transform_mean_hlc := log(mean_hlc)]
#> dtxsid mean_hlc log_transform_mean_hlc
#> <char> <num> <num>
#> 1: DTXSID0020153 2.918720e-03 -5.8366101
#> 2: DTXSID0020446 1.527830e-09 -20.2994174
#> 3: DTXSID0020573 3.748870e-06 -12.4940561
#> 4: DTXSID0020600 1.479610e-04 -8.8185618
#> 5: DTXSID0020814 2.049460e-07 -15.4005193
#> 6: DTXSID0021541 8.825620e-03 -4.7300964
#> 7: DTXSID0021917 4.922550e-01 -0.7087584
#> 8: DTXSID0024052 2.338400e-10 -22.1763840
#> 9: DTXSID1020437 5.639950e-03 -5.1778801
#> 10: DTXSID1021407 1.768600e-06 -13.2453223
#> 11: DTXSID1021740 8.824295e-06 -11.6380018
#> 12: DTXSID1021798 8.664450e-06 -11.6562821
#> 13: DTXSID1024174 2.964310e-07 -15.0314514
#> 14: DTXSID1024338 1.646800e-10 -22.5270169
#> 15: DTXSID1026164 1.977445e-06 -13.1337050
#> 16: DTXSID1037484 1.196270e-09 -20.5440575
#> 17: DTXSID1037486 1.185530e-09 -20.5530759
#> 18: DTXSID1037567 4.339640e-10 -21.5580595
#> 19: DTXSID2020684 3.194620e-06 -12.6540424
#> 20: DTXSID2021028 3.662790e-06 -12.5172854
#> 21: DTXSID2021317 2.493330e-03 -5.9941361
#> 22: DTXSID2021731 4.553975e-06 -12.2995101
#> 23: DTXSID2022333 8.026420e-03 -4.8250167
#> 24: DTXSID2037506 1.313660e-09 -20.4504487
#> 25: DTXSID2052156 3.785200e-10 -21.6947522
#> 26: DTXSID3020203 4.129470e-02 -3.1870211
#> 27: DTXSID3020833 5.902805e-04 -7.4349127
#> 28: DTXSID3020964 2.389895e-05 -10.6416760
#> 29: DTXSID3024366 1.485460e-04 -8.8146159
#> 30: DTXSID3024869 1.776260e-06 -13.2410005
#> 31: DTXSID3031864 1.803350e-11 -24.7387900
#> 32: DTXSID3032464 8.835860e-06 -11.6366921
#> 33: DTXSID3042219 1.047975e-02 -4.5583105
#> 34: DTXSID3074313 2.046920e-11 -24.6120998
#> 35: DTXSID4020533 4.847905e-06 -12.2369639
#> 36: DTXSID4021503 3.771570e-03 -5.5802639
#> 37: DTXSID4022361 2.458220e-08 -17.5212432
#> 38: DTXSID4022367 1.047830e-09 -20.6765445
#> 39: DTXSID4022448 9.004310e-09 -18.5255625
#> 40: DTXSID4022991 1.238900e-11 -25.1142121
#> 41: DTXSID4032611 1.477090e-05 -11.1228515
#> 42: DTXSID4034948 1.537700e-09 -20.2929780
#> 43: DTXSID5020023 1.214215e-04 -9.0162426
#> 44: DTXSID5020576 9.440170e-08 -16.1757068
#> 45: DTXSID5020601 3.864950e-09 -19.3713171
#> 46: DTXSID5021207 1.374810e-04 -8.8920248
#> 47: DTXSID5024182 3.307185e-07 -14.9219983
#> 48: DTXSID5039224 6.639505e-05 -9.6198881
#> 49: DTXSID50867064 1.183860e-08 -18.2519005
#> 50: DTXSID6020301 4.061025e-02 -3.2037348
#> 51: DTXSID6020856 3.217130e-09 -19.5547762
#> 52: DTXSID6021030 9.187830e-07 -13.9002159
#> 53: DTXSID6021032 3.249700e-04 -8.0317777
#> 54: DTXSID6022422 1.067460e-07 -16.0528137
#> 55: DTXSID6024177 2.553800e-07 -15.1805132
#> 56: DTXSID6037483 5.542140e-10 -21.3134702
#> 57: DTXSID6037485 5.623550e-10 -21.2988878
#> 58: DTXSID6037568 8.289390e-09 -18.6082895
#> 59: DTXSID7020005 8.830950e-08 -16.2424181
#> 60: DTXSID7020637 3.432355e-07 -14.8848490
#> 61: DTXSID7021029 3.650960e-05 -10.2179353
#> 62: DTXSID7024241 3.283070e-06 -12.6267316
#> 63: DTXSID7047433 6.785440e-08 -16.5059016
#> 64: DTXSID8020044 4.999450e-06 -12.2061827
#> 65: DTXSID8020090 2.019270e-06 -13.1127745
#> 66: DTXSID8020597 5.999740e-08 -16.6289646
#> 67: DTXSID8020832 7.367270e-03 -4.9107081
#> 68: DTXSID8021062 1.159700e-05 -11.3647641
#> 69: DTXSID8022292 2.387120e-08 -17.5505931
#> 70: DTXSID8022377 3.748870e-06 -12.4940561
#> 71: DTXSID8023846 4.938440e-09 -19.1262163
#> 72: DTXSID8023848 9.954840e-09 -18.4252070
#> 73: DTXSID8025541 4.150640e-07 -14.6948331
#> 74: DTXSID8031865 1.916920e-10 -22.3751312
#> 75: DTXSID9021390 3.432295e-04 -7.9771112
#> 76: DTXSID9021427 5.556870e-07 -14.4030607
#> 77: DTXSID9022366 5.167830e-09 -19.0808130
#> 78: DTXSID9023380 4.156230e-09 -19.2986574
#> 79: DTXSID9023914 5.035540e-11 -23.7119153
#> 80: DTXSID9024142 2.692760e-06 -12.8249439
#> 81: DTXSID9032113 3.098320e-07 -14.9872356
#> 82: DTXSID9032329 1.566850e-06 -13.3664433
#> dtxsid mean_hlc log_transform_mean_hlc
ccl4_hlc_grouped[, log_transform_mean_hlc := log(mean_hlc)]
#> dtxsid propType mean_hlc log_transform_mean_hlc
#> <char> <char> <num> <num>
#> 1: DTXSID0020153 predicted 2.91872e-03 -5.836610
#> 2: DTXSID0020446 predicted 1.52783e-09 -20.299417
#> 3: DTXSID0020573 predicted 3.74887e-06 -12.494056
#> 4: DTXSID0020600 experimental 1.48000e-04 -8.818298
#> 5: DTXSID0020600 predicted 1.47922e-04 -8.818825
#> ---
#> 106: DTXSID9023914 experimental 5.03002e-11 -23.713012
#> 107: DTXSID9023914 predicted 5.04106e-11 -23.710820
#> 108: DTXSID9024142 predicted 2.69276e-06 -12.824944
#> 109: DTXSID9032113 predicted 3.09832e-07 -14.987236
#> 110: DTXSID9032329 predicted 1.56685e-06 -13.366443
natadb_hlc_all[, log_transform_mean_hlc := log(mean_hlc)]
#> dtxsid mean_hlc log_transform_mean_hlc
#> <char> <num> <num>
#> 1: DTXSID0020153 2.918720e-03 -5.836610
#> 2: DTXSID0020448 2.806465e-03 -5.875830
#> 3: DTXSID0020523 8.611785e-08 -16.267549
#> 4: DTXSID0020529 5.417975e-08 -16.730959
#> 5: DTXSID0020600 1.479610e-04 -8.818562
#> ---
#> 140: DTXSID9020293 2.943670e-06 -12.735853
#> 141: DTXSID9020299 4.846860e-10 -21.447520
#> 142: DTXSID9020827 2.040880e-07 -15.404715
#> 143: DTXSID9021138 5.672420e-08 -16.685065
#> 144: DTXSID9041522 8.981400e-06 -11.620355
natadb_hlc_grouped[, log_transform_mean_hlc := log(mean_hlc)]
#> dtxsid propType mean_hlc log_transform_mean_hlc
#> <char> <char> <num> <num>
#> 1: DTXSID0020153 predicted 2.91872e-03 -5.836610
#> 2: DTXSID0020448 experimental 2.82000e-03 -5.871018
#> 3: DTXSID0020448 predicted 2.79293e-03 -5.880664
#> 4: DTXSID0020523 experimental 8.59999e-08 -16.268920
#> 5: DTXSID0020523 predicted 8.62358e-08 -16.266180
#> ---
#> 206: DTXSID9020299 predicted 4.84686e-10 -21.447520
#> 207: DTXSID9020827 experimental 2.03000e-07 -15.410060
#> 208: DTXSID9020827 predicted 2.05176e-07 -15.399398
#> 209: DTXSID9021138 predicted 5.67242e-08 -16.685065
#> 210: DTXSID9041522 predicted 8.98140e-06 -11.620355
Finally, compare both chemical lists simultaneously. To accomplish
this, add a column to each data.table denoting to which chemical list
the rows correspond and then combine the rows from both data sets
together using the function rbind()
.
ccl4_hlc_grouped[, set := 'CCL4']
#> dtxsid propType mean_hlc log_transform_mean_hlc set
#> <char> <char> <num> <num> <char>
#> 1: DTXSID0020153 predicted 2.91872e-03 -5.836610 CCL4
#> 2: DTXSID0020446 predicted 1.52783e-09 -20.299417 CCL4
#> 3: DTXSID0020573 predicted 3.74887e-06 -12.494056 CCL4
#> 4: DTXSID0020600 experimental 1.48000e-04 -8.818298 CCL4
#> 5: DTXSID0020600 predicted 1.47922e-04 -8.818825 CCL4
#> ---
#> 106: DTXSID9023914 experimental 5.03002e-11 -23.713012 CCL4
#> 107: DTXSID9023914 predicted 5.04106e-11 -23.710820 CCL4
#> 108: DTXSID9024142 predicted 2.69276e-06 -12.824944 CCL4
#> 109: DTXSID9032113 predicted 3.09832e-07 -14.987236 CCL4
#> 110: DTXSID9032329 predicted 1.56685e-06 -13.366443 CCL4
natadb_hlc_grouped[, set := 'NATADB']
#> dtxsid propType mean_hlc log_transform_mean_hlc set
#> <char> <char> <num> <num> <char>
#> 1: DTXSID0020153 predicted 2.91872e-03 -5.836610 NATADB
#> 2: DTXSID0020448 experimental 2.82000e-03 -5.871018 NATADB
#> 3: DTXSID0020448 predicted 2.79293e-03 -5.880664 NATADB
#> 4: DTXSID0020523 experimental 8.59999e-08 -16.268920 NATADB
#> 5: DTXSID0020523 predicted 8.62358e-08 -16.266180 NATADB
#> ---
#> 206: DTXSID9020299 predicted 4.84686e-10 -21.447520 NATADB
#> 207: DTXSID9020827 experimental 2.03000e-07 -15.410060 NATADB
#> 208: DTXSID9020827 predicted 2.05176e-07 -15.399398 NATADB
#> 209: DTXSID9021138 predicted 5.67242e-08 -16.685065 NATADB
#> 210: DTXSID9041522 predicted 8.98140e-06 -11.620355 NATADB
all_hlc_grouped <- rbind(ccl4_hlc_grouped, natadb_hlc_grouped)
hlc_box <- ggplot(all_hlc_grouped, aes(set, log_transform_mean_hlc)) +
geom_boxplot(aes(color = propType))
hlc <- ggplot(all_hlc_grouped, aes(log_transform_mean_hlc)) +
geom_boxplot(aes(color = set)) +
coord_flip()
Again, in both grouping by propType
and aggregating all
results together by chemical list, NATADB chemicals generally higher
mean Henry’s Law Constant value than CCL4 chemicals.
Boling Point
Boiling Point data be explored. Begin by grouping data by DTXSID, and also by DTXSID and property type.
ccl4_boiling_all <- ccl4_phys_chem[propertyId %in% 'boiling-point',
.(mean_boiling_point = sapply(.SD, mean)),
.SDcols = c('value'), by = .(dtxsid)]
natadb_boiling_all <- natadb_phys_chem[propertyId %in% 'boiling-point',
.(mean_boiling_point =
sapply(.SD, mean)),
.SDcols = c('value'), by = .(dtxsid)]
ccl4_boiling_grouped <- ccl4_phys_chem[propertyId %in% 'boiling-point',
.(mean_boiling_point =
sapply(.SD, mean)),
.SDcols = c('value'),
by = .(dtxsid, propType)]
natadb_boiling_grouped <- natadb_phys_chem[propertyId %in% 'boiling-point',
.(mean_boiling_point =
sapply(.SD, mean)),
.SDcols = c('value'),
by = .(dtxsid, propType)]
Calculate summary statistics.
summary(ccl4_boiling_all)
#> dtxsid mean_boiling_point
#> Length:91 Min. : -34.92
#> Class :character 1st Qu.: 167.40
#> Mode :character Median : 306.29
#> Mean : 350.26
#> 3rd Qu.: 390.63
#> Max. :3377.66
summary(ccl4_boiling_grouped)
#> dtxsid propType mean_boiling_point
#> Length:143 Length:143 Min. : -40.78
#> Class :character Class :character 1st Qu.: 115.13
#> Mode :character Mode :character Median : 208.03
#> Mean : 318.70
#> 3rd Qu.: 383.44
#> Max. :4825.00
summary(natadb_boiling_all)
#> dtxsid mean_boiling_point
#> Length:152 Min. :-38.45
#> Class :character 1st Qu.: 85.98
#> Mode :character Median :187.18
#> Mean :186.92
#> 3rd Qu.:275.57
#> Max. :584.48
summary(natadb_boiling_grouped)
#> dtxsid propType mean_boiling_point
#> Length:295 Length:295 Min. :-87.78
#> Class :character Class :character 1st Qu.: 82.20
#> Mode :character Mode :character Median :180.00
#> Mean :181.67
#> 3rd Qu.:262.46
#> Max. :685.00
Since some of the boiling point values have negative values, log transformation of these values will result in warnings as NaNs are produced.
Finally, compare both chemical lists simultaneously. To accomplish
this, add a column to each data.table denoting to which chemical list
the rows correspond and then combine the rows from both data sets
together using the function rbind()
.
ccl4_boiling_grouped[, set := 'CCL4']
#> dtxsid propType mean_boiling_point set
#> <char> <char> <num> <char>
#> 1: DTXSID0020153 experimental 178.9778 CCL4
#> 2: DTXSID0020153 predicted 180.1880 CCL4
#> 3: DTXSID0020446 experimental 182.5000 CCL4
#> 4: DTXSID0020446 predicted 340.0280 CCL4
#> 5: DTXSID0020573 predicted 398.2052 CCL4
#> ---
#> 139: DTXSID9024142 predicted 452.0113 CCL4
#> 140: DTXSID9032113 predicted 384.5745 CCL4
#> 141: DTXSID9032119 experimental 990.0000 CCL4
#> 142: DTXSID9032119 predicted 482.9800 CCL4
#> 143: DTXSID9032329 predicted 448.4297 CCL4
natadb_boiling_grouped[, set := 'NATADB']
#> dtxsid propType mean_boiling_point set
#> <char> <char> <num> <char>
#> 1: DTXSID0020153 experimental 178.97780 NATADB
#> 2: DTXSID0020153 predicted 180.18800 NATADB
#> 3: DTXSID0020448 experimental 96.14168 NATADB
#> 4: DTXSID0020448 predicted 97.64523 NATADB
#> 5: DTXSID0020523 experimental 113.00000 NATADB
#> ---
#> 291: DTXSID9021138 predicted 263.44175 NATADB
#> 292: DTXSID9021261 experimental 685.00000 NATADB
#> 293: DTXSID9021261 predicted 483.96000 NATADB
#> 294: DTXSID9041522 experimental 340.00000 NATADB
#> 295: DTXSID9041522 predicted 334.18150 NATADB
all_boiling_grouped <- rbind(ccl4_boiling_grouped, natadb_boiling_grouped)
boiling_box <- ggplot(all_boiling_grouped, aes(set, mean_boiling_point)) +
geom_boxplot(aes(color = propType))
boiling <- ggplot(all_boiling_grouped, aes(mean_boiling_point)) +
geom_boxplot(aes(color = set)) +
coord_flip()
A visual inspection of this set of graphs is not as clear as in the previous cases. Note that the predicted values for each data set tend to be higher than the experimental. The mean of CCL4, by predicted and experimental appears to be greater than the corresponding means for NATADB, as does the overall mean, but the interquartile ranges of these different groupings yield slightly different results. This gives us a sense that the picture for boiling point is not as clear cut between experimental and predicted for these two chemical lists as it was in the previous physico-chemical properties investigated.
To summarize the observations, across the various physico-chemical properties for chemicals in these chemical lists, there are indeed differences between the mean values of various physico-chemical properties when grouped by predicted or experimental.
- For “vapor-pressure”, the means of predicted values tend to be a little lower than experimental, though they are much closer in the case of NATADB than CCL4.
- The trend of lower predicted means compared to experimental means is more clearly demonstrated for “henrys-law” values in both data sets.
- In the case of “boiling-point”, the predicted values are greater than the experimental values, though this is much more pronounced in CCL4 while the set of means for NATADB are again fairly close.
Conclusion
In this vignette, a variety of functions that access different types
of data found in the Chemical
endpoints of the CTX APIs
were explored. While this exploration was not exhaustive, it provides a
basic introduction to how one may access data and work with it.
Additional endpoints and corresponding functions exist and we encourage
the user to explore these while keeping in mind the examples contained
in this vignette.