The package diemr
incorporates the diagnostic index
expectation maximisation algorithm used to estimate which genomic
alleles belong to which side of a barrier to gene flow. To start using
diemr
, load the package or install it from CRAN if it is
not yet available:
# Attempt to load a package, if the package was not available, install and load
if(!require("diemr", character.only = TRUE)){
install.packages("diemr", dependencies = TRUE)
library("diemr", character.only = TRUE)
}
Next, assemble paths to all files containing the data to be used by
diemr
. Here, we will use a tiny example dataset that is
included in the package for illustrating the analysis workflow. A good
practice is to check that all files contain data in the correct format
for all individuals and markers. Additionally, the analysis will need a
list with ploidies for all genomic compartments and
individuals, and a vector with indices of
samples that will be included in the analysis.
filepaths <- system.file("extdata", "data7x3.txt", package = "diemr")
# Analysing six individuals
samples <- 1:6
# Assuming diploid markers of all individuals
ploidies = rep(list(rep(2, nchar(readLines(filepaths[1])[1]) - 1)), length(filepaths))
CheckDiemFormat(files = filepaths,
ChosenInds = samples,
ploidy = ploidies)
# File check passed: TRUE
# Ploidy check passed: TRUE
If the CheckDiemFormat()
function fails, work through
the error messages and fix the stored input files accordingly. The
algorithm repeatedly accesses data from the harddisk, so seeing the
passed file and variable check prior to the analysis is critical.
Starting from diemr 1.4
, ploidy might now be assumed to
be diploid for all individuals and all sites across all compartments by
default. Use this only when compartments for the sex chromosomes are not
identified.
To estimate the marker polarities, their diagnostic indices and
support, run the function diem()
with default settings.
Here, we have only one file with data, so paralelisation is unnecessary,
and we set nCores = 1
. The Windows operating system
only allows for nCores = 1. Other operating systems can process
multiple genomic compartments (e.g. chromosomes) in parallel, the
analysis of different genomic compartment files running on multiple
processors.
res <- diem(files = filepaths,
ploidy = ploidies,
markerPolarity = list(c(FALSE, FALSE, TRUE)),
ChosenInds = samples,
nCores = 1)
The result is a list, where the element res$DI
contains
a table with marker polarities, their diagnostic indices and
support.
res$DI
# newPolarity DI Support Marker
# 1 FALSE -4.872256 15.930181 1
# 2 TRUE -3.520647 18.633399 2
# 3 TRUE -13.274822 6.130625 3
The column newPolarity
means that marker 1 should be
imported for subsequent analyses as is, and markers 2 and 3 should be
imported with 0 replaced with 2 and 2 replaced with 0 (hereafter
‘flipped’ 0↔︎2). The marker 3 has the lowest diagnostic index and low
support, indicating that the genotypes scored at this marker are poorly
related to the barrier to gene flow arbitrated by the data.
With the marker polarities optimised to detect a barrier to gene flow, a plot of the polarised genome will show how genomic regions cross the barrier. First, the genotypes need to be imported with optimal marker polarities. Second, individual hybrid indices need to be calculated from the polarised genotypes. And last, the data will be plotted as a raster image.
genotypes <- importPolarized(file = filepaths,
changePolarity = res$markerPolarity,
ChosenInds = samples)
h <- apply(X = res$I4,
MARGIN = 1,
FUN = pHetErrOnStateCount)[1, ]
plotPolarized(genotypes = genotypes,
HI = h[samples])
CAUTION: This is just a quick start to get you started! For real datasets you will use the diagnostic index (DI) to filter the full set of sites you have analysed with
diem
in order to plot only those markers relevant to any barrier detected in the analysis.
The diemr
package uses a consise genome representation.
Let’s have a small dataset of three markers genotyped for seven
individuals.
The genotypes encoded as 0
represent homozygotes for an
allele attributed to barrier side A, 1
are heterozygous
genotypes, 2
are homozygotes for another allele, attributed
to barrier side B, and U
(symbol “_” is also allowed)
represents an unknown state or a third (fourth) allele. The power of
diem
lies in the assurance that the user does not need to
determine the true assignment to
barrier sides A and B before the analysis and the specific genotypes
encoded as 0
and 2
respectively can be
arbitrary.
The leading S
on each line of the input file ensures
that the marker genotypes are read in as a string on all operating
systems. The S
is dropped during import of the genotypes,
and the dataset is imported as a character matrix of all sites.
Some genomic compartments differ between individuals in their ploidy. For example, markers located on chromosome X in mammals will be diploid in females, but haploid in males. Ploidy differences between individuals influence calculation of the hybrid index, which in turn has an effect on the diem analysis.
To set up the diem analysis with multiple compartments, the
markers with different individual ploidies must be stored in separate
files. The file analysed in the Quick
start chapter could contain markers from autosomes and an
additional file will contain markers from an X chromosome, with
individuals 2 and 6 being males. The respective ploidies for the second
genomic compartment will be c(2, 1, 2, 2, 2, 1, 2)
.
Arguments files
and ploidy
will need to
reflect the information, taking care that the order of filenames
corresponds to the order of elements in the list of ploidies.
diem
cannot check that the order of the elements is
correct, only that the information is in the correct format.
filepaths2 <- c(system.file("extdata", "data7x3.txt", package = "diemr"),
system.file("extdata", "data7x10.txt", package = "diemr"))
ploidies2 <- list(rep(2, 7),
c(2, 1, 2, 2, 2, 1, 2))
CheckDiemFormat(files = filepaths2,
ChosenInds = samples,
ploidy = ploidies2)
# File check passed: TRUE
# Ploidy check passed: TRUE
# Set random seed for repeatibility of null polarities (optional)
set.seed(39583782)
# Run diem with verbose = TRUE to store hybrid indices with ploidy-aware allele counts
res2 <- diem(files = filepaths2,
ploidy = ploidies2,
markerPolarity = FALSE,
ChosenInds = samples,
nCores = 1,
verbose = TRUE)
Plotting polarised genomes from multiple compartments requires
separate import of the compartment data. The polarities in the
res2$markerPolarity
element are combined across all
compartments.
# Import polarized genotypes for all compartments
genotypes2 <- importPolarized(files = filepaths2,
changePolarity = res2$markerPolarity,
ChosenInds = samples)
# Load individual hybrid indices from a stored file
h2 <- unlist(read.table("diagnostics/HIwithOptimalPolarities.txt"))
# Plot the polarised genotypes
plotPolarized(genotypes = genotypes2,
HI = h2[samples])
Plotting all sites gives a good first impression on the diversity in
the data. However, to examine the barrier to gene flow, markers with
high diagnostic index will be more informative. Use the
ChosenSites
argument to specify which markers to
display.
# Select a threshold for the top diagnostic markers.
threshold <- quantile(res2$DI$DI, prob = 0.6)
# Create a vector identifying the diagnostic markers at the given threshold
markers <- res2$DI$DI > threshold
# Import only the selected markers
genotypes3 <- importPolarized(files = filepaths2,
changePolarity = res2$markerPolarity,
ChosenInds = samples,
ChosenSites = markers)
# Calculate hybrid index from diagnostic markers
h3 <- apply(genotypes3, 1, FUN = function(x) pHetErrOnStateCount(sStateCount(x)))[1, ]
# Plot diagnostic markers
plotPolarized(genotypes3, h3)
diemr
from the
source?Make sure to set the R
working directory to the folder,
where the package tarball is stored, or include a full path to the file
within the quotes. Update the version number to the specific file.
There are two options. First, use diem
with argument
verbose = TRUE
and hybrid indices will be stored in a text
file in the diagnostics folder in the working directory. The stored
values will not be ploidy-aware. Additionally, these
hybrid indices are blind to the diagnostic index of the markers. Hybrid
indices should be calculated only on the most diagnostic markers. To
calculate the hybrid indices without the small data correction use the
I4
matrix in the diem
output (See FAQ on
Hybrid indices below on how to first filter markers based on their
diagnosticity).
apply(res$I4, MARGIN = 1, FUN = pHetErrOnStateCount)
# [,1] [,2] [,3] [,4] [,5] [,6]
# p 0.5000000 0 0.3333333 0.5000000 0.8333333 1.0000000
# Het 0.3333333 0 0.6666667 0.3333333 0.3333333 0.0000000
# Err 0.0000000 0 0.0000000 0.0000000 0.0000000 0.3333333
To calculate the hybrid indices while ignoring uninformative sites
(which will force all hybrid indices towards 0.5), filter the
importPolarised
data by the DI of each site.
diemr
?Yes. Multiple barriers to gene flow between multiple groups of
samples can be identified iteratively with help from the argument
ChosenInds
. For example, let’s assume that the individual 2
was identified as belonging to one side of the barrier and being
separated from other by the steepest change in the hybrid index. In the
next diem
run, we exclude the individual 2.
samples2 <- c(1, 3:6)
CheckDiemFormat(files = filepaths,
ChosenInds = samples2,
ploidy = ploidies)
# File check passed: TRUE
# Ploidy check passed: TRUE
res2 <- diem(files = filepaths,
ChosenInds = samples2,
ploidy = ploidies,
nCores = 1,
markerPolarity = list(c(FALSE, FALSE, TRUE)))
# calculate hybrid indices from updated I4
h.res2 <- apply(res2$I4,
MARGIN = 1,
FUN = pHetErrOnStateCount)[1, ]
# set names for the hybrid index values
h.res2 <- setNames(h.res2, nm = samples2)
# 1 3 4 5 6
# 0.50 0.33 0.50 0.83 1.00
# calculate the center of the maximum hybrid index change
diffs <- data.frame(rollmean = zoo::rollmean(sort(h.res2), k = 2),
diff = diff(sort(h.res2), lag = 1))
h.res2.c <- diffs$rollmean[which.max(diffs$diff)]
# [1] 0.6666667
Since the center of the barrier is at 0.67 now, diem
separated individuals 1, 3, and 4 from a group that includes 5, 6.
Combined with the result from the first diem
run, we have
identified three groups in the dataset: (2), (1, 3, 4), and (5, 6).
The input data probably contains invariant sites. These cannot be
polarised, and so will keep their initial random polarisation, making
all hybrid indices tend to 0.5. A filter removing all invariant sites
before analysis will speed up analysis, but it will not eliminate the
problem because (1) sequencing errors will make invariant sites appear
variant, or (2) variant sites can have variation irrelevant to a
diem
barrier.
We recommend you filter sites by diagnostic index (DI) after the
diem
analysis to recalculate hybrid indices (and plot
genotypes) with only the most diagnostic markers.
# Select 40% of markers with the highest diagnostic index
threshold <- quantile(res2$DI$DI, prob = 0.6)
genotypes3 <- genotypes2[, res2$DI$DI > threshold]
# Recalculate I4 and hybrid indices
h3 <- apply(genotypes3,
MARGIN = 1,
FUN = \(x) pHetErrOnStateCount(sStateCount(x)))[1, ]
# Plot the polarised markers
plotPolarized(genotypes3, h3)
diemr
?To use diemr
in a publication, please cite (Baird et al. 2023).