Table of contents:
Welcome to RaMS
! This vignette is designed to provide
examples using the package at various levels of complexity. Let’s jump
right in.
If you have your own data, feel free to load it here. If not, there’s a couple small example files you’re welcome to use in the “extdata” folder. The first of these contains DDA data from a pooled sample, while the others are individual samples. I’ll be using these throughout. For more details on the origin of these files, see the minification vignette.
library(RaMS)
# Locate the file directory
msdata_dir <- system.file("extdata", package = "RaMS")
# Identify the files of interest
data_files <- list.files(msdata_dir, pattern = "mzML", full.names = TRUE)[1:4]
# Check that the files identified are the ones expected
basename(data_files)
#> [1] "Blank_129I_1L_pos_20240207-MS3.mzML.gz"
#> [2] "LB12HL_AB.mzML.gz"
#> [3] "LB12HL_CD.mzML.gz"
#> [4] "LB12HL_EF.mzML.gz"
There’s only one function to worry about in RaMS
: the
aptly named grabMSdata
. This function has a couple
arguments with sensible defaults, but you’ll always need to tell it two
things: one, which files you’d like to process; and two, the data you’d
like to obtain from those files.
Let’s start simple, with a single file and the most basic information about it.
A TIC reports the total intensity measured by the mass analyzer during each scan, so the data is parsed into two columns: retention time (rt) and intensity (int). This makes it easy to read and simple to plot:
single_file <- data_files[2]
msdata <- grabMSdata(single_file, grab_what = "TIC")
#>
#> Reading file LB12HL_AB.mzML.gz... 0.09 secs
#> Reading TIC...0.09 secs
#> Binding files together into single data.table
#> Total time: 0.19 secs
rt | int | filename |
---|---|---|
4.009000 | 24680889 | LB12HL_AB.mzML.gz |
4.024533 | 22832946 | LB12HL_AB.mzML.gz |
4.040133 | 23881353 | LB12HL_AB.mzML.gz |
Since we asked for a single thing, the TIC, our
file_data
object is a list with a single entry: the TIC.
Let’s plot that data:
Simple enough!
A BPC is just like a TIC except that it records the maximum intensity measured, rather than the sum of all intensities. This data is also collected by the mass analyzer and doesn’t need to be calculated.
msdata <- grabMSdata(single_file, grab_what = "BPC")
#>
#> Reading file LB12HL_AB.mzML.gz... 0.08 secs
#> Reading BPC...0.11 secs
#> Binding files together into single data.table
#> Total time: 0.2 secs
Since the data is parsed in a “tidy” format, it plays
nicely with popular packages such as ggplot2
. Let’s use
that to plot our BPC instead of the base R plotting system:
The advantages of tidy data and ggplot
become clear when
we load more than one file at a time because we can group and color by
the third column, the name of the file from which the data was read.
RaMS
will also automatically enable a loading bar if more
than one file is loaded; this can be disabled by setting the
verbosity
parameter to 0.
msdata <- grabMSdata(data_files[2:4], grab_what = "BPC")
#> | | | 0% | |======================= | 33% | |=============================================== | 67% | |======================================================================| 100%
#> Total time: 0.46 secs
And of course, this means that all of ggplot
’s aesthetic
power can be brought to your chromatograms as well, so customize
away!
ggplot(msdata$BPC) +
geom_polygon(aes(x=rt, y=int, color=filename), lwd=1, fill="#FFFFFF44") +
theme(legend.position = "inside", legend.position.inside=c(0.8, 0.7),
plot.title = element_text(face="bold"),
axis.title = element_text(size = 15)) +
scale_y_continuous(labels = c(0, "250M", "500M"), breaks = c(0, 2.5e8, 5e8)) +
scale_colour_manual(values = c("#2596be", "#6c25be", "#bea925")) +
labs(x="Retention time (minutes)", y="Intensity",
title = "Base peak chromatogram", color="Files:") +
coord_cartesian(xlim = c(7.50, 9), ylim = c(0, 5e8))
RaMS
also provides some basic file metadata extraction
capability, although the focus for this package is on the actual data
and other MS packages handle file metadata much more elegantly. This is
one area where there are major differences between mzML and mzXML file
types - the mzXML file type simply doesn’t encode as much metadata as
the mzML filetype, so RaMS
can’t extract it.
# Since the minification process strips some metadata, I use the
# less-minified DDA file here
grabMSdata(files = data_files[1], grab_what = "metadata")
#>
#> Reading file Blank_129I_1L_pos_20240207-MS3.mzML.gz... 0.06 secs
#> Reading file metadata...0.04 secs
#> Binding files together into single data.table
#> Total time: 0.1 secs
#> $metadata
#> source_file inst_data config_data timestamp
#> <char> <char> <list> <POSc>
#> 1: Blank_129I_1L_pos_20240207-MS3.raw None found <data.frame[1x3]> <NA>
#> n_spectra n_chromatograms ms_levels mz_lowest mz_highest lambda_lowest
#> <int> <int> <char> <num> <num> <num>
#> 1: 227 0 MS1, MS3, MS2 0 959.8531 NA
#> lambda_highest rt_start rt_end centroided polarity
#> <num> <num> <num> <lgcl> <char>
#> 1: NA 46.01383 48.98667 TRUE positive
#> filename
#> <char>
#> 1: Blank_129I_1L_pos_20240207-MS3.mzML.gz
MS1 data can be extracted just as easily, by supplying “MS1” to the
grab_what
argument of grabMSdata
function.
msdata <- grabMSdata(data_files[2:4], grab_what = "MS1")
#> | | | 0% | |======================= | 33% | |=============================================== | 67% | |======================================================================| 100%
#> Total time: 1.03 secs
rt | mz | int | filename |
---|---|---|---|
4.009 | 139.0503 | 1800550.12 | LB12HL_AB.mzML.gz |
4.009 | 148.0967 | 206310.81 | LB12HL_AB.mzML.gz |
4.009 | 136.0618 | 71907.15 | LB12HL_AB.mzML.gz |
So we’ve now got the mz column, corresponding to the mass-to-charge ratio (m/z) of an ion. This means that we can now filter our data for specific masses and separate out molecules with different masses.
Note that this also makes the data much larger in R’s memory - so don’t go loading hundreds of files simultaneously. If that’s necessary, check out the section below on saving space.
Because RaMS
returns data.tables
rather than normal data.frame
s, indexing is super-fast and
a bit more intuitive than with base R. Below, I also use the
pmppm
function from RaMS
to produce a mass
range from an initial mass and spectrometer accuracy (here, 5
parts-per-million) and select only retention times that are between 4
and 9 minutes.
library(data.table)
adenine_mz <- 136.06232
adenine_data <- msdata$MS1[mz%between%pmppm(adenine_mz, ppm=5)][rt%between%c(4, 9)]
ggplot(adenine_data) + geom_line(aes(x=rt, y=int, color=filename))
This makes it easy to grab the data for multiple compounds of
interest with a simple loop, provided here by the purrr
package of the tidyverse:
mzs_of_interest <- c(Adenine=136.06232, Valine=118.0865, Homarine=138.055503)
mass_data <- imap(mzs_of_interest, function(mz_i, name){
cbind(msdata$MS1[mz%between%pmppm(mz_i, ppm=10)], name)
}) %>%
bind_rows() %>%
filter(rt%between%c(4, 9))
ggplot(mass_data) +
geom_line(aes(x=rt, y=int, color=filename)) +
facet_wrap(~name, ncol = 1, scales = "free_y")
RaMS
also handles MS2 data elegantly. Request
it with the “MS2” option for grab_what
, although it’s often
a good idea to grab the MS1 data alongside.
msdata <- grabMSdata(data_files[1], grab_what = c("MS1", "MS2", "MS3"))
#>
#> Reading file Blank_129I_1L_pos_20240207-MS3.mzML.gz... 0.08 secs
#> Reading MS1 data...0.01 secs
#> Reading MS2 data...0.02 secs
#> Reading MS3 data...0.04 secs
#> Binding files together into single data.table
#> Total time: 0.16 secs
rt | premz | fragmz | int | voltage | filename |
---|---|---|---|---|---|
46.45483 | 351.0818 | 51.10470 | 106.65444 | 40 | Blank_129I_1L_pos_20240207-MS3.mzML.gz |
46.45483 | 351.0818 | 52.49430 | 90.74608 | 40 | Blank_129I_1L_pos_20240207-MS3.mzML.gz |
46.45483 | 351.0818 | 54.45892 | 88.82258 | 40 | Blank_129I_1L_pos_20240207-MS3.mzML.gz |
DDA data can be plotted nicely with ggplot2
as well.
Typically it makes sense to filter for a precursor mass, then render the
fragments obtained.
iodine_MS2 <- msdata$MS2[premz%between%pmppm(351.0818, 5)][rt==min(rt)]
iodine_MS2$int <- iodine_MS2$int/max(iodine_MS2$int)
ggplot(iodine_MS2) +
geom_point(aes(x=fragmz, y=int)) +
geom_segment(aes(x=fragmz, xend=fragmz, y=int, yend=0)) +
scale_y_continuous(breaks = c(0, .5, 1), labels = c("0%", "50%", "100%")) +
labs(x="Fragment m/z", y="")
This is also the perfect place to enable some interactivity with
packages such as plotly
, making data exploration not only
simple but also highly intuitive.
## Not run to save space in the vignette:
library(plotly)
compound_MS1 <- msdata$MS1 %>%
filter(mz%between%pmppm(351.0818, 10)) %>%
filter(!str_detect(filename, "DDA")) %>%
slice_max(int, by = rt)
compound_MS2 <- msdata$MS2[premz%between%pmppm(351.0818, 10)] %>%
group_by(rt) %>%
arrange(desc(int)) %>%
slice(1:10) %>%
summarise(frags=paste(
paste(round(fragmz, digits = 3), round(int), sep = ": "), collapse = "\n"),
.groups="drop"
) %>%
mutate(int=approx(x = compound_MS1$rt, y=compound_MS1$int, xout = rt)$y)
plot_ly(compound_MS1) %>%
add_trace(type="scatter", mode="lines", x=~rt, y=~int, hoverinfo="none") %>%
add_trace(type="scatter", mode="markers", x=~rt, y=~int,
text=~frags, hoverinfo="text", showlegend=FALSE,
marker=list(color="black"), data = compound_MS2) %>%
layout(annotations=list(x=min(compound_MS2$rt), y=median(compound_MS2$int)*10,
text="Mouse over to see\nMSMS fragments"))
Easy access to MS2 data also allows us to rapidly perform simple operations such as searching for a specific fragment mass. For example, if we want to find other precursor masses that produced a fragment of 57.07, we simply subset the MS2 data for fragments in a range around that mass:
rt | premz | fragmz | int | voltage | filename |
---|---|---|---|---|---|
46.45483 | 351.0818 | 57.07004 | 2886.5396 | 40 | Blank_129I_1L_pos_20240207-MS3.mzML.gz |
46.52917 | 351.0818 | 57.07003 | 2928.7546 | 40 | Blank_129I_1L_pos_20240207-MS3.mzML.gz |
46.60383 | 351.0814 | 57.06982 | 171.9509 | 40 | Blank_129I_1L_pos_20240207-MS3.mzML.gz |
46.60383 | 351.0814 | 57.07001 | 4307.4131 | 40 | Blank_129I_1L_pos_20240207-MS3.mzML.gz |
46.75333 | 351.0817 | 57.07002 | 6987.7876 | 40 | Blank_129I_1L_pos_20240207-MS3.mzML.gz |
46.82800 | 351.0818 | 57.06979 | 150.1399 | 40 | Blank_129I_1L_pos_20240207-MS3.mzML.gz |
Support for MS3 data was added in RaMS 1.4 and includes an additional “prepremz” column that contains the original MS1 value selected for later fragmentation.
rt | prepremz | premz | fragmz | int | voltage | filename |
---|---|---|---|---|---|---|
46.27817 | 351.0817 | 57.07004 | 41.02417 | 118.76534 | 60 | Blank_129I_1L_pos_20240207-MS3.mzML.gz |
46.27817 | 351.0817 | 57.07004 | 41.06103 | 106.31035 | 60 | Blank_129I_1L_pos_20240207-MS3.mzML.gz |
46.27817 | 351.0817 | 57.07004 | 42.34753 | 105.90051 | 60 | Blank_129I_1L_pos_20240207-MS3.mzML.gz |
46.27817 | 351.0817 | 57.07004 | 42.35089 | 80.45853 | 60 | Blank_129I_1L_pos_20240207-MS3.mzML.gz |
46.27817 | 351.0817 | 57.07004 | 42.35569 | 86.08349 | 60 | Blank_129I_1L_pos_20240207-MS3.mzML.gz |
46.27817 | 351.0817 | 57.07004 | 42.57482 | 101.63557 | 60 | Blank_129I_1L_pos_20240207-MS3.mzML.gz |
mzMLs can also contain precompiled chromatograms. Version 1.3 of RaMS
enabled the extraction of these chromatograms just like MS1
or MS2 data and added the new chroms
slot to
hold them. However, these need to be requested using
grab_what = "chroms"
and will create this new slot in the
returned object. Chromatograms are returned as a data.table
with seven columns: chromatogram type (usually TIC, BPC or SRM info),
chromatogram index, target mz, product mz, retention time (rt), and
intensity (int). This allows simple plotting of various chromatograms
such as those returned by MRM:
chrom_file <- system.file("extdata", "wk_chrom.mzML.gz", package = "RaMS")
msdata_chroms <- grabMSdata(chrom_file, verbosity = 0, grab_what = "chroms")
given_chrom <- msdata_chroms$chroms[chrom_type=="SRM iletter1"]
ptitle <- with(given_chrom, paste0(
unique(chrom_type), ": Target m/z = ", unique(target_mz), "; Product m/z = ",
unique(product_mz)
))
plot(given_chrom$rt, given_chrom$int, type="l", main=ptitle)
Version 1.3 also enabled support for DAD (diode array detection)
data. This data type can be requested using
grab_what = "DAD"
and will return an additional item in the
list with columns for retention time, wavelength, and intensity.
The sections below will likely be of interest to users who are
already familiar with RaMS
and are looking to optimize
speed, reduce memory requirements, or are otherwise interested in the
details of what RaMS
does under the hood. If you’re just
getting started, I strongly recommend applying RaMS
to your
own data before you read on. For a more detailed analysis of the first
two sections on saving space and speeding things up, consider also
browsing the speed
& size comparison vignette
Mass-spec files are typically tens or hundreds of megabytes in size,
which means that the simultaneous analysis of many files can easily
exceed a computer’s memory. Since RaMS
stores all data in
R’s working memory, this can become a problem for large analyses.
However, much of the usage envisioned for RaMS
on this
scale doesn’t require access to the entire file, the entire time.
Instead, users are typically interested in a few masses of interest or a
specific time window. This means that while each file still needs to be
read into R in full to find the data of interest, extraneous data can be
discarded before the next file is loaded.
This functionality can be enabled by passing “EIC” and/or “EIC_MS2”
to the grab_what
argument of grabMSdata
, along
with a vector of masses to extract (mz) and the instrument’s ppm
accuracy. When this is enabled, files are read into R’s memory
sequentially, the mass window is extracted, and the rest of the data is
discarded.
all_data <- grabMSdata(data_files[-1], grab_what = c("MS1", "MS2"))
#> | | | 0% | |======================= | 33% | |=============================================== | 67% | |======================================================================| 100%
#> Total time: 1.03 secs
mzs_of_interest <- c(adenine=136.06232, valine=118.0865)
small_data <- grabMSdata(data_files[-1], grab_what = c("EIC", "EIC_MS2"),
mz=mzs_of_interest, ppm = 5)
#> | | | 0% | |======================= | 33% | |=============================================== | 67% | |======================================================================| 100%
#> Total time: 1.07 secs
all_data$MS1 %>%
mutate(type="All data") %>%
rbind(small_data$EIC %>% mutate(type="Extracted data only")) %>%
filter(!str_detect(filename, "DDA")) %>%
filter(rt%between%c(5, 15)) %>%
group_by(rt, filename, type) %>%
summarise(TIC=sum(int), .groups="drop") %>%
ggplot() +
geom_line(aes(x=rt, y=TIC, color=filename)) +
facet_wrap(~type, ncol = 1)
As expected, the size of the small_data
object is much
smaller than the all_data
object, here by a factor of about
20x. For files that haven’t already been “minified”, that size reduction
will be even more significant. Of course, this comes with the cost of
needing to re-load the data a second time if a new mass feature becomes
of interest but this shrinkage is especially valuable for targeted
analyses where the analytes of interest are known in advance.
A second way of reducing file size is to constrain the retention time
dimension rather than the m/z dimension. This can be done with the
rtrange
argument, which expects a length-two vector
corresponding to the upper and lower bounds on retention time. This is
useful when a small time window is of interest, and only the data
between those bounds is relevant. Below, peaks are deliberately sliced
in half to show that this is filtering on retention time instead of
mass.
small_data <- grabMSdata(data_files[-1], grab_what = c("MS1", "MS2"), rtrange = c(6, 8))
#> | | | 0% | |======================= | 33% | |=============================================== | 67% | |======================================================================| 100%
#> Total time: 0.87 secs
all_data$MS1 %>%
mutate(type="All data") %>%
rbind(small_data$MS1 %>% mutate(type="Extracted data only")) %>%
filter(!str_detect(filename, "DDA")) %>%
group_by(rt, filename, type) %>%
summarise(TIC=sum(int), .groups="drop") %>%
ggplot() +
geom_line(aes(x=rt, y=TIC, color=filename)) +
facet_wrap(~type, ncol = 1)
The savings are also worth noting here, but constraining the retention time also speeds up data retrieval slightly. Since m/z and intensity data is encoded in mzML and mzXML files while retention time information is not, eliminating scans with a retention time window removes the need to decode the intensity and m/z information in those scans.
However, decoding is rarely the rate-limiting step and for more information about speeding things up, continue to the next section.
So, RaMS
isn’t fast enough for you? Let’s see what we
can do to improve that. The first step in speeding things up is
discovering what’s slow. Typically, this is the process of reading in
the mzML/mzXML file rather than any processing that occurs afterward,
but this is not always the case. To examine bottlenecks,
RaMS
includes timing information that’s produced if the
verbosity
argument is set to 2 or higher.
all_data <- grabMSdata(data_files, grab_what = c("MS1", "MS2"), verbosity = 2)
#> | | | 0%
#> Reading file Blank_129I_1L_pos_20240207-MS3.mzML.gz... 0.09 secs
#> Reading MS1 data...0.01 secs
#> Reading MS2 data...0.02 secs
#> | |================== | 25%
#> Reading file LB12HL_AB.mzML.gz... 0.19 secs
#> Reading MS1 data...0.12 secs
#> Reading MS2 data...0.02 secs
#> | |=================================== | 50%
#> Reading file LB12HL_CD.mzML.gz... 0.2 secs
#> Reading MS1 data...0.16 secs
#> Reading MS2 data...0.02 secs
#> | |==================================================== | 75%
#> Reading file LB12HL_EF.mzML.gz... 0.3 secs
#> Reading MS1 data...0.15 secs
#> Reading MS2 data...0.02 secs
#> | |======================================================================| 100%
#> Binding files together into single data.table
#> Total time: 1.3 secs
In general, slow file read times can be improved by compressing the
data. mzML files are highly compressible, with options to compress both
the data itself using the zlib
method and the files as a
whole using gzip
.
gzip
is the simplest one to use, as a plethora of
methods exist to compress files this way, including online sites.
RaMS
can read the data directly from a gzipped file, no
decompression necessary, so this is an easy way to reduce file size and
read times. Sharp-eyed users will have noticed that the demo files are
already gzipped, which is part of the reason they are so small.
zlib
compression is slightly trickier, and is most often
performed with tools such as Proteowizard’s
msconvert
tool with the option “-z”.
Read times may also be slow if files are being accessed via the Internet, either through a VPN or network drive. If your files are stored elsewhere, consider first moving those files somewhere more local before reading data from them.
If the bottleneck appears when reading MS1 data, consider
restricting the retention time range with the rtrange
argument or using more detailed profiling tools such as RStudio’s
“Profile” options or the profvis
package. Pull requests
that improve data processing speed are always welcome!
While the package is not currently set up for parallel processing, this is a potential future feature if a strong need is demonstrated.
While there’s only one main function (grabMSdata
) in
RaMS
, you may have noticed that two other functions have
been exposed that perform similar tasks: grabMzmlData
and
grabMzxmlData
. The main function grabMSdata
serves as a wrapper around these two functions, which detects the file
type, adds the “filename” column to the data, and loops over multiple
files if provided. However, there’s often reason to use these internal
functions separately.
For one, the objects themselves are smaller because they don’t have the filename column attached yet. You as the user will need to keep track of which data belongs to which files in this case.
Another use case might be applying functions to each file individually, perhaps aligning to a reference chromatogram or identifying peaks. Rather than spending the time to bind the files together and immediately separate them again, these functions have been exposed to skip that step.
Finally, these functions are useful for parallelization. Because
iterating over each mass-spec file is often the largest reasonable
chunk, these functions can be passed directly to parallel processes like
mclapply
or doParallel
. However,
parallelization is a beast best handled by individual users because its
actual implementation often differs wildly and its utility depends
strongly on individual setups (remember that parallelization won’t help
with slow I/O times, so it may not always improve data processing
speed.)
## Not run:
library(parallel)
cl <- makeCluster(getOption("cl.cores", detectCores()-1))
output_data <- parLapply(data_files, grabMzmlData, grab_what="everything", cl = cl)
library(foreach)
library(doParallel)
registerDoParallel(detectCores()-1)
output_data <- foreach (i=data_files) %dopar% {
RaMS::grabMzmlData(i, grab_what="everything")
}
stopImplicitCluster()
RaMS
is possible because mzML and mzXML documents are
fundamentally XML-based.
This means that we can leverage speedy and robust XML parsing packages
such as xml2
to extract the data. Fundamentally,
RaMS
relies on XPath
navigation to collect various bits of mass-spec data, and the format of
mzML and mzXML files provides the tags necessary. That means a lot of
RaMS
code consists of lines like the following:
## Not run:
data_nodes <- xml2::xml_find_all(mzML_nodes, xpath="//d1:precursorMz")
raw_data <- xml2::xml_attr(data_nodes, "value")
…plus a lot of data handling to get the output into the tidy format.
The other tricky bit of data extraction is converting the (possibly compressed) binary data into R-friendly objects. This is usually handled with code like that shown below. Many of the settings can be deduced from the file, but sometimes compression types need to be guessed at and will throw a warning if so.
## Not run:
decoded_binary <- base64enc::base64decode(binary)
raw_binary <- as.raw(decoded_binary)
decomp_binary <- memDecompress(raw_binary, type = file_metadata$compression)
final_binary <- readBin(decomp_binary, what = "double",
n=length(decomp_binary)/file_metadata$mz_precision,
size = file_metadata$mz_precision)
# See https://github.com/ProteoWizard/pwiz/issues/1301
Fundamentally, mass-spectrometry data is formatted as a ragged array, with an unknown number of m/z and intensity values for a given scan. This makes it difficult to encode neatly without interpolating, but tidy data provides a solution by stacking those arrays rather than aligning them into some sort of matrix.
This ragged shape is also the reason that subsetting mass-spec data by retention time is trivial - grab the scans that correspond to the desired retention times and you’re done. Subsetting by mass, on the other hand, requires decoding each and every scan’s m/z and intensity data. If you’re reading a book and only want a couple chapters, it’s easy to flip to those sections. If you’re looking instead for every time a specific word shows up, you’ve gotta read the whole thing.
For more information about the mzML data format and its history, check out the specification at https://www.psidev.info/mzML.
Vignette last built on 2024-10-08