Performing polyploid QTL analysis using polyqtlR

Introduction

Installing polyqtlR

IBD probabilities

Data structures
HMM for IBD estimation
“Heuristic” method for IBD estimation
High marker densities
Interpolating IBDs

Visualising IBD haplotypes

Genotypic Information

Performing a QTL scan

Including experimental blocks
Running an initial QTL scan
Visualising QTL results
Significance thresholds
BLUEs
Adding genetic co-factors
Comparing multiple analyses

Exploring the QTL peak

Compare QTL results with offspring

Single Marker Analysis

Meiosis & Recombinations

Meiosis report
Preferential pairing
Recombination landscape
Searching for recombinant individuals
Fishing for specific alleles

Correcting genotyping errors

Concluding remarks

References

Introduction

It is nowadays increasingly feasible to conduct genetic analyses in polyploid populations thanks to developments in genotyping technologies as well as tools designed to deal with these genotypes. polyqtlR is one such tool that provides a number of functions to perform quantitative trait locus (QTL) mapping in F1 populations of outcrossing, heterozygous polyploid species. For more details on the methodology, please see the 2021 Bioinformatics publication of Bourke et al..

polyqtlR assumes that a number of prior steps have already been performed - in particular, that an integrated linkage map for the mapping population has been generated. The package has been developed to utilise the output of the mapping software polymapR, although any alternative mapping software that produces phased and integrated linkage maps could also be used. However, the input files may have to be altered in such cases. Currently, bi-allelic SNP markers are the expected raw genotypic data, which is also the data format used by polymapR in map construction. However, some functions can also accept probabilistic genotypic data, for example in the estimation of IBD probabilities. Full functionality of probabilistic genotypes in the package has yet to be implemented but is planned for future releases. The assignment of marker dosages in polyploid species can be achieved using a number of R packages such as fitPoly (Voorrips, Gort, and Vosman 2011). More background on the steps of dosage-calling and polyploid linkage mapping can be found in a review of the topic by (Peter M. Bourke, Voorrips, et al. 2018).

The QTL analysis functions primarily rely on identity-by-descent (IBD) probabilities, which are the probabilities of inheritance of particular combinations of parental haplotypes. Single-marker QTL detection methods are also available. The package was originally developed with the IBD output of TetraOrigin (Zheng et al. 2016) in mind. However, the TetraOrigin package was implemented in the proprietary software Mathematica, which many users may not have access to without buying a licence. Since 2021 an updated method called PolyOrigin (extended for multi-parental populations) has been released in the OpenSource julia software (Zheng et al. 2021). Alternative options to estimate IBD probabilities for multiple ploidy levels (2x, 3x, 4x and 6x) are also provided directly polyqtlR.

Genotyping errors are a regular feature of modern marker datasets. Although linkage mapping software such as polymapR has been shown to be relatively robust against genotyping errors (Peter M. Bourke, Geest, et al. 2018), if present in sufficiently large proportions (~10 % errors or more) issues may arise with map accuracy and QTL detection. Therefore, a number of functions have been included in polyqtlR to check map accuracy using the estimated IBD probabilities, and to re-impute marker genotypes if required. These imputed genotypes may then be used in an iterative process to re-estimate the linkage map and re-run a QTL analysis.

This tutorial goes through each of the steps in a typical QTL analysis using the example datasets provided with the package, outlining the different options that are available. However, it is certainly recommended to consult the documentation that accompanies each of the functions by running the ? command before the function name (e.g. ?QTLscan).

Installing `polyqtlR`

polyqtlR can be installed using the following call from within R:

install.packages("polyqtlR")

There are a number of package dependencies which should be installed, e.g. Rcpp, foreach, or doParallel. Usually these will be automatically installed as well but if not, you can always install them yourself one by one, e.g.

install.packages("Rcpp")
install.packages("foreach")
install.packages("doParallel")

before re-trying the installation of polyqtlR. Eventually when all dependencies are on your computer, you should be able to successfully run the following command (i.e. without any error message):

library(polyqtlR)

It is also a good time to load the datasets that you will need to perform a typical analysis, namely a phased maplist, a set of SNP marker genotypes (discrete dosage scores) and a set of trait data (phenotypes):

data("phased_maplist.4x", "SNP_dosages.4x", "Phenotypes_4x")

In the example that follows we are using a simulated tetraploid dataset with 2 chromosomes for simplicity.

IBD probabilities

Currently two options to estimate IBD probabilities in an F1 population exist within polyqtlR. The first uses a heuristic approach originally developed for tetraploids but later applied to hexaploid populations (Peter M. Bourke 2014; Geest et al. 2017) and implemented here in the polyqtlR::estimate_IBD function. This can be very computationally efficient at higher ploidy levels (and has been generalised to all ploidy levels), but the algorithm ignores partially-informative marker information. In datasets with large numbers of haplotype-specific markers this is not such an issue, but in datasets with fewer such markers the accuracy of the resulting IBD probabilities is compromised. The method behind TetraOrigin (Zheng et al. 2016) for offspring IBD probability estimation given a known parental marker phasing has also been implemented in the polyqtlR::estimate_IBD function (this is the default method used). Note that unlike the original TetraOrigin software, parental marker phasing is not re-estimated. However, this is usually a direct output of the linkage mapping procedure (e.g. using polymapR).

Data structures

As the IBD data are the most central objects in this package it is worth spending a moment describing them. In polyqtlR, IBD probabilities are organised in nested list form. The first level of structure are the linkage groups. In our example dataset, the list has two elements corresponding to the two linkage groups, each of which is itself a list containing the following elements:

IBDtype The type of IBD probability, either “genotypeIBD” or “haplotypeIBD”
IBDarray A 3d array of IBD probabilities, with dimensions “locus”,“genotype class”, “individual”
map The integrated linkage map (marker names and cM positions, in order)
parental_phase The phasing of the markers in the parental map
marginal.likelihoods A list of marginal likelihoods of different valencies if method “hmm” was used, otherwise NULL
valency The predicted valency that maximised the marginal likelihood, per offspring. For method “heur”, NULL
offspring The offspring names
biv_dec Whether bivalents only (TRUE) or also multivalents were allowed (FALSE) in the procedure
gap Here NULL, but later this can hold a numeric value (e.g. 1 cM) if IBDs are ‘splined’ or interpolated.
genocodes Ordered list of genotype codes used to represent the different genotype classes
pairing Log likelihoods of each of the different pairing scenarios considered
ploidy The ploidy of parent 1
ploidy2 The ploidy of parent 2
method The method used, either “hmm” (Hidden Markov Model), “heur” (heuristic) or “hmm_TO” (Hidden Markov Model TetraOrigin)
error The offspring error prior (eps) used in the calculation.

All functions within polyqtlR for estimating or importing IBD probabilities will automatically return them in this form.

HMM for IBD estimation

The estimate_IBD function of polyqtlR with the default setting method = "hmm" estimates offspring IBD probabilities using a Hidden Markov Model (HMM) developed by Zheng et al (Zheng et al. 2016) in the TetraOrigin package but generalised in polyqtlR to multiple ploidy levels. Currently, diploid (2x), triploid (3x = 4x $\times$ 2x), tetraploid (4x) and hexaploid (6x) populations are handled. genotypes can either be discrete (i.e. integer marker dosage scores), or probabilistic (i.e. the probabilities of each of the ploidy + 1 possible scores). For triploids, tetraploids and hexaploids, the possibility of incorporating double reduction (Peter M. Bourke et al. 2015) (i.e. allowing for the phenomenon of multivalent pairing) is available, although this can have serious performance implications for hexaploids (and in hexaploids, the option of allowing multivalent pairing in both parents for the same chromosome is by default disabled as it requires very high RAM requirements (> 32 Gb). Use argument full_multivalent_hexa = TRUE to allow multivalents simultaneously at the same position from both hexaploid parents).

Some of the code used to estimate these IBD probabilities has been programmed in C++ to improve performance, and relies on both the Rcpp and RcppArmadillo packages to provide the interface between R, C++ and the Armadillo C++ library.

The expected format of input files is that used by the mapping software polymapR (Peter M. Bourke, Geest, et al. 2018). If you have used other software for map construction and/or genotype calling, you will need to convert your input files to the correct format. There are already some tools available to do this (although it should be quite straightforward if you look at the expected format using the example data files provided here). For example, if you generated the phased linkage map using mappoly (Mollinari and Garcia 2019), you can convert your map to the required format using the convert_mappoly_to_phased.maplist function (check out the help using ?convert_mappoly_to_phased.maplist for details).

The genotypes can be either discrete or probabilistic. If the genotypes are discrete, they must be provided as a matrix of marker dosage scores (for example as provided in this tutorial in SNP_dosages.4x) with marker names as row-names, and individual names (including the two parents) as column-names. Checks are performed and non-numeric data is converted to numeric data if needed. Alternatively, probabilistic genotypes can be provided which can either be the direct output of the fitpoly (Voorrips, Gort, and Vosman 2011) function saveMarkerModels, or from some other polyploid-appropriate genotype-calling software. For example, if polyRAD (Clark, Lipka, and Sacks 2019) or updog (Gerard et al. 2018) were used for genotype calling, a conversion step is needed. Functions for doing this are provided by polymapR. Check out ?polymapR::convert_polyRAD or ?polymapR::convert_updog for details.

We run the function estimate_IBD using the phased linkage map phased_maplist.4x as generated by polymapR, in this case allowing for multivalents (bivalent_decoding = FALSE), as follows:

IBD_4x <- estimate_IBD(phased_maplist = phased_maplist.4x,
                       genotypes = SNP_dosages.4x,
                       ploidy = 4,
                       bivalent_decoding = FALSE,
                       ncores = 4)

Note that the default setting of bivalent_decoding is TRUE, as the function evaluates faster if only bivalents are considered (the difference in run-time becomes more pronounced for hexaploids). However, allowing multivalents in the model gives a more complete and correct picture.

Here we chose to enable parallel processing to speed up the calculations. Note that jobs are split across linkage groups, so with 5 linkage groups it would have made more sense to use 5 cores if they were available. Use parallel::detectCores() to display how many CPU cores are available, and use 1 or 2 less than this at the very most. In this dataset there are only 2 linkage groups, so no more than 2 cores are needed.

nc <- parallel::detectCores() - 1

Optional: Importing IBDs from TetraOrigin or PolyOrigin

If so desired, the TetraOrigin package (Mathematica software) or the PolyOrigin package (julia software) can be used to estimate IBD probabilities rather than using the estimate_IBD function. The results should be quite similar for diploids or tetraploids. If mapping in triploid or hexaploid populations, polyqtlR::estimate_IBD should be used.

TetraOrigin

TetraOrigin produces by default a large .txt output file after running the function “inferTetraOrigin”. However, it is convenient to produce a summary of this output using the “saveAsSummaryITO” function, which produces a .csv file summarising the results. Information on how to use TetraOrigin is provided with the package itself, downloadable from GitHub.

The function import_IBD imports the .csv output of TetraOrigin::saveAsSummaryITO. It takes a number of arguments, such as folder, which is the folder containing the output of TetraOrigin. For example, if we wanted to import IBDs from the subfolder “TetraOrigin” for a species with 5 linkage groups, we would run the following:

IBD_4x <- import_IBD(method = "TO", #TO for TetraOrigin
                     folder = "TetraOrigin",
                     filename = paste0("TetraOrigin_Output_bivs_LinkageGroup",1:5,"_Summary"),
                     bivalent_decoding = TRUE)

Here it is assumed that the TetraOrigin summary filenames are “TetraOrigin_Output_bivs_LinkageGroup1_Summary.csv” etc. If all goes well, the following messages will be printed per linkage group:

Importing map data under description inferTetraOrigin-Summary,Genetic map of biallelic markers
Importing parental phasing under description inferTetraOrigin-Summary,MAP of parental haplotypes
Importing IBD data under description inferTetraOrigin-Summary,Conditonal genotype probability

PolyOrigin

A more recent version of the TetraOrigin software has been released in julia, extended for multi-parental populations of either diploid or tetraploid species. Details of the method can be found in Zheng et al. (2021) (Zheng et al. 2021). The software, as well as vignettes detailing how to use it can also be found on GitHub.

Assuming the output files are in a folder called “PolyOrigin” with filestem “PolyOrigin_Output”, a similar call to import the IBDs would be:

IBD_4x <- import_IBD(method = "PO", #PO for PolyOrigin
                     folder = "PolyOrigin",
                     filename = "PolyOrigin_Output")

Note that bivalent_decoding or error do not need to be specified when method = "PO", as these parameters are looked up from the PolyOrigin log file. Note also that previously the summary output of TetraOrigin was split up per linkage group. In PolyOrigin, the output from all linkage groups is kept together.

“Heuristic” method for IBD estimation

In cases where the results are needed quickly, or where there are very large numbers of markers, or for ploidy levels above 6, it is convenient to use the heuristic approach to IBD estimation. We do this using the estimate_IBD function as follows:

IBD_4x <- estimate_IBD(phased_maplist = phased_maplist.4x,
                       genotypes = SNP_dosages.4x,
                       method = "heur",
                       ploidy = 4)

Note that the attribute IBDtype is now haplotypeIBD, referring to the fact that these IBDs are the probabilities of inheriting each parental haplotype at a locus, as opposed to the probabilities of inheriting particular combinations of parental alleles at a locus (genotypeIBD). Although similar, certain downstream functions such as exploreQTL can only work with the former (although you will still be able to visualise QTL allele effects with the visualiseQTL function). The accuracy of these IBDs is generally lower than if method = "hmm" had been used (and therefore the power of subsequent QTL analyses will be somewhat reduced).

High marker densities

Particularly at higher ploidy levels (6x), it becomes computationally expensive to estimate IBDs using the hmm method. Our experience has shown that for hexaploids with a mapping population size of 400 or more individuals running on 4 cores, about 200 - 300 markers per chromosome can be accommodated on an “above-average” desktop computer (16 GB RAM). Running on a single core will reduce the committed memory, but the evaluation time will therefore be much longer. 250 markers per chromosome should be already enough to estimate IBDs with high accuracy - including extra markers over this amount will add incrementally less information (following the law of diminishing returns).

The function thinmap can be used to make a selection of markers that tries to maximise their distribution across the genome and across parental homologues. The argument bin_size is used to specify the size of the bins in which markers are selected - increasing this results in fewer markers being used. The ideal bin_size is 1 cM, although at higher ploidies and with higher marker densities, wider bins may be needed. The function is called as follows:

thinned_maplist.4x <- thinmap(maplist = phased_maplist.4x,
                              dosage_matrix = SNP_dosages.4x)

## 87 markers from a possible 93 on LG1 were included.
## 
## 89 markers from a possible 93 on LG2 were included.

The object thinned_maplist.4x can then be used in place of phased_maplist.4x in a call to estimate_IBD.

Interpolating IBDs

Regardless of how they were generated, IBD probabilities are estimated at all marker positions provided. When we perform an initial scan for QTL, it is often more efficient to look for QTL at a grid of evenly-spaced positions (such as at every 1 cM). This is because the genotype data has been replaced with multi-point IBD estimates at the marker positions. Information at e.g. ten different markers within a 0.5 cM window is virtually identical and therefore just one representative for this locus should approximate the full information, while reducing the number of statistical tests. This becomes particularly relevant when we generate significance thresholds using permutation tests later, which are computationally quite demanding.

It is therefore recommended to interpolate the probabilities using the spline_IBD function as follows:

IBD_4x.spl <- spline_IBD(IBD_list = IBD_4x, 
                         gap = 1) #grid at 1 cM spacing

Visualising IBD haplotypes

Before performing a QTL analysis, it is a good idea to visualise the IBD probabilities as these can give indications about the quality of the genotypes, the linkage maps and the parental marker phasing. The function visualiseHaplo was developed originally to examine the inherited haplotypes of offspring with extreme (usually favourable) trait scores to see whether their inherited alleles are consistent with a predicted QTL model (we return to this later). But as a first look at the data quality, the function is also useful.

Haplotypes of linkage group 1 of the first nine F1 offspring can be visualised as follows:

visualiseHaplo(IBD_list = IBD_4x,
               display_by = "name",
               linkage_group = 1,
               select_offspring = colnames(SNP_dosages.4x)[3:11], #cols 1+2 are parents
               multiplot = c(3,3)) #plot layout in 3x3 grid

Here the quality of the predicted haplotypes appears to be high - dark colours signifying high confidence (probabilities close to 1) and small numbers of recombinations predicted. Regions of dark red signify double reduction, which were permitted during the estimation of IBDs using estimate_IBD. In poorer-quality datasets, considerably more “noise” may be present, with less certainty about the correct configuration per offspring.

The function returns a list of 2 elements (recombinants and allele_fish) which do not concern us here but which we will return to later. Note that we selected the offspring to display using the select_offspring argument - using the option select_offspring = all will return the whole population. We have also opted to display_by = "name", as opposed to display_by = "phenotype", in which case further arguments are required. For convenience the plots were combined into a 3 x 3 grid using the multiplot argument. For more options, including how to overlay recombination events, see ?visualiseHaplo.

An example of IBD probabilities that show much higher levels of uncertainty might look something like the following:

Here, there are unrealistically-high numbers of recombinations predicted, suggesting that the algorithm had difficulty correctly assigning genotype classes. In such cases it may be worthwhile to re-examine the steps prior to IBD estimation, in particular genotype calling. We will return to this issue again later.

Genotypic Information

Another approach to assessing the quality of the IBD probabilities, as well as understanding the “QTL detection landscape” to some degree is to estimate the Genotypic Information Coefficient (GIC). The GIC is calculated per homologue across the whole population, and ranges from 0 (no information) to 1 (full information). To maximise QTL detection power and precision, we would like the GIC to be uniform and as high as possible (Peter M. Bourke et al. 2019). Note that the GIC is only defined for bivalent pairing, and therefore estimates of GIC from a multivalent-aware HMM are based on offspring predicted to have come from a bivalent-based meiosis for that linkage group (this tends to be most of the population anyway).

We calculate the GIC as follows:

GIC_4x <- estimate_GIC(IBD_list = IBD_4x)

We can also visualise the output of this function as follows:

visualiseGIC(GIC_list = GIC_4x)

At the telomeres there is usually a tapering off of information, a result of having shared marker information from one side only in these regions. It is often possible to discern the relationship between marker coverage and GIC directly from this visualisation, although it should be noted that non-simplex marker alleles provide incomplete information about inheritance. So for example in an autotetraploid, a marker in duplex condition in parent 1, phased to homologues 1 and 3, does not provide much information about the inheritance of homologues 1 and 3. This is because on average, $\frac{2}{3}$ of the population will inherit only one copy of the marker allele, which may have come from either homologue 1 or homologue 3. Without neighbouring marker information to support a decision, we remain unsure, as both possibilities are equally likely.

Performing a QTL scan

In order to perform a QTL scan, we also need phenotypes. There is already an example set of phenotypes included with the package which we previously loaded with the call data(Phenotypes_4x). We first look at the required structure of this (data-frame):

dim(Phenotypes_4x)

## [1] 150   3

head(Phenotypes_4x)

##   year  geno    pheno
## 1 2014 F1_01 53.63065
## 2 2014 F1_02 59.02528
## 3 2014 F1_03 52.96245
## 4 2014 F1_04 48.23597
## 5 2014 F1_05 52.02645
## 6 2014 F1_06 44.62646

Here we see that there are 3 columns in this data-frame, namely “year”, “geno” and “pheno”. There can be as many extra columns as you like for different traits, but at a minumum there must always be two columns, one corresponding to the genotype identifiers (the names of the F1 individuals in the mapping population) and one corresponding to trait scores of some kind (numeric). If a trait was qualitatively scored, for example as “resistant” or “susceptible”, then you first need to re-code this as (e.g.) 1 and 0.

Including experimental blocks

Note also that there is a separate column for “year” in this dataset. In general, only 1 level of experimental blocking is allowed. If there were replicated observations within blocks, then a more complicated nested block analysis would be needed. In such instances, it would probably be best to first derive best linear unbiased estimates (BLUEs) for the trait, with the full nested block structure provided. See the section on BLUEs below for more details. However, the QTL functions provided within polyqtlR are unable to inherit variance components from such an approach, and can only deal with the means in this case.

Running an initial QTL scan

Assuming we have a simple blocking structure as is the case here (single measurements per genotype repeated over 3 years) we can proceed with a preliminary QTL scan as follows:

qtl_LODs.4x <- QTLscan(IBD_list = IBD_4x,
                       Phenotype.df = Phenotypes_4x,
                       genotype.ID = "geno",
                       trait.ID = "pheno",
                       block = "year")

QTLscan by default uses an additive-effects model (Peter M. Bourke et al. 2021). Since package v.0.1.0 it is also possible to run a QTL scan using the full model as described in (Hackett, Bradshaw, and Bryan 2014), which uses the genotypic IBD probabilties rather than the haplotype IBD probabilities as regressors in the detection model. Although this risks over-fitting (as the model is over-parametrised, particularly if genotype probabilities were estimated allowing for double reduction which adds many extra genotype classes), it may be useful to also check and compare the results after significance thresholds have been calculated (see below). The full model can be accessed in the QTLscan function by setting argument allelic_interaction = TRUE.

The function gives a summary table of the trait over the different blocks, which can be useful for identifying possibly outlying observations. Because the function calls on the lm function from base R, missing values are not allowed in the trait scores. QTLscan performs some imputation if blocks are present, by estimating block effects and using them and the observed values to predict a missing phenotype. If this behaviour is not desired, then manually removing these datapoints beforehand is necessary. QTLscan can only work if the genotype identifiers (the offspring names) match completely between the trait data and the IBD data. If this is not the case, an error will result. The function reports the number of offspring with matching genotype and phenotype identifiers, so the user can record the population size that was used in the analysis.

Visualising QTL results

The object returned by QTLscan is a list with a number of items, including:

QTL.res A matrix of QTL results
Perm.res The results of a permutation test, if run (see below). In our example here, this field is NULL, i.e empty.
Residuals Vector of residual phenotypes after blocks or co-factors have been fitted.
Map The linkage map upon which the IBD probabilities were based.

More details on the output can be found by running ?QTLscan. There are a number of functions in the package for visualising the results. One option is to plot per linkage group, which can be achieved using the plotQTL function and grid layout (layout = 'g'). An example call would be as follows:

plotQTL(LOD_data = qtl_LODs.4x,
        layout = "g",
        colour = "darkgreen")

As before, details of the options available for this function can be accessed using ?plotQTL.

An alternative is to plot the results of the whole-genome scan on a single track, which since package v.0.1.0 is the default plotting behaviour of the plotQTL function:

plotQTL(LOD_data = qtl_LODs.4x,
        colour = "dodgerblue")

plotQTL can also visualise the results of a number of analysis together. By default, the superimposed plots are re-scaled so that threshold lines overlap (argument rescale = TRUE). This is covered in more detail in the section on co-factors.

Significance thresholds

Although it already appears like we have an interesting peak on LG 1, we are not sure whether this region is truly associated with our trait or not. One method of establishing whether the association is significant is to run a permutation test. Currently the package has two options. The first uses the full structure of your data as provided. However, it is much faster to run a permutation test when your data has a simple structure, with no experimental blocks or no genetic co-factors. To deal with experimental blocks, you should estimate BLUEs first before running the permutation test. For genetic co-factors, you can first run an analysis with the co-factor(s) included, and then use the residuals as your phenotype. However, it is also possible (but not recommended) to simply run a permutation test without these steps, using the QTLscan function as follows:

qtl_LODs.4x <- QTLscan(IBD_list = IBD_4x,
                       Phenotype.df = Phenotypes_4x,
                       genotype.ID = "geno",
                       trait.ID = "pheno",
                       block = "year",
                       perm_test = TRUE,
                       ncores = nc)

If we now plot the results, the threshold line is automatically included:

plotQTL(LOD_data = qtl_LODs.4x,
        colour = "dodgerblue")

## LG_names not detected in LOD_data. Re-creating LG names for plot...

However, if you ran this step you will have noticed that it is quite slow. Therefore, it is recommended to first estimate a mean phenotypic value per individual over years before running the permutation test, which we do here using best linear unbiased estimates in the nlme package which fits a linear mixed model with genotype as fixed effect.

BLUEs

We can estimate the best linear unbiased estimates using the BLUE function as follows:

blues <- BLUE(data = Phenotypes_4x,
              model = pheno~geno,
              random = ~1|year,
              genotype.ID = "geno")

The resulting object is a data-frame with columns “geno” and “blue”, which we need in our revised call to QTLscan:

qtl_LODs.4x <- QTLscan(IBD_list = IBD_4x,
                       Phenotype.df = blues,
                       genotype.ID = "geno",
                       trait.ID = "blue",
                       perm_test = TRUE,
                       ncores = nc)

If we now plot the results, the threshold line is automatically included:

plotQTL(LOD_data = qtl_LODs.4x,
        colour = "dodgerblue")

Note that the LOD scores have now increased, but the significance threshold has also increased. This has to do with how the LOD score is estimated, using a ratio of the RSS from your fitted model at a putative QTL position with that of the NULL model with no QTL. In the case of blocks, the NULL model already contains your blocking term, and so a portion of the phenotypic variance is already accounted for when calculating the LOD score. When using BLUEs as your phenotype, you first correct your phenotypes for block effects, but the NULL model now has no explanatory variables, which artificially inflates your LODs. However, a permutation test will account for these sorts of discrepancies, and the region of the genome that lies above the threshold should be the same.

Adding genetic co-factors

In many situations we may wish to include already-identified QTL positions as fixed genetic effects in a QTL search. This can help to reduce the amount of unwanted background variance in the phenotypic data which can lead to greater power to detect QTL at other positions. Another possible reason could be to test whether multiple peaks on a single linkage group are independent. There are two options for including co-factors in the package. We could use the automated function check_cofactors which performs all steps automatically, or we can run a manual co-factor analysis using QTLscan by specifying the cofactor_df argument.

Manual co-factor analysis

We first describe the manual approach using QTLscan directly, as this clearly demonstrates how co-factors can be included, and offers the most flexibility. To this we supply a data-frame with at least two columns: the first one giving the linkage group identifier and the second column giving the cM position (note that column identifiers are not mandatory, the function looks for LG in column 1 and cM in column 2 by itself). Since package v.0.1.0 it is possible to add a third column, to specify whether the co-factor should be added as an additive-effects-only QTL, or whether allelic interactions are also to be modelled whemn fitting the co-factor.

Before we add the QTL peak as a co-factor, we need to know where it is. We search for the peak as follows:

findPeak(qtl_LODs.4x, linkage_group = 1)

##   chromosome          marker position      LOD  thresh
## 9          1 LG1_12.3_0x1_h5     12.3 12.29909 5.27213

## [1] 12.3

The peak position is therefore at 12.3 cM. We include this as a co-factor as follows:

qtl_LODs.4x_cofactor <- QTLscan(IBD_list = IBD_4x,
                                Phenotype.df = Phenotypes_4x,
                                genotype.ID = "geno",
                                trait.ID = "pheno",
                                block = "year",
                                cofactor_df = data.frame("LG" = 1,
                                                         "cM" = 12.3),
                                perm_test = FALSE,
                                ncores = nc)#nc is the number of cores, defined earlier

Note that we have not performed a permutation test here. This is possible, but again the running time is somewhat longer than one would like. Therefore, we can use the residuals of the fitted NULL model (including blocks and the co-factor) as a new set of phenotypes as a way to again access the optimised permutation test of QTLscan:

new_pheno <- qtl_LODs.4x_cofactor$Residuals
head(new_pheno)

##    geno       Pheno Block
## 1 F1_01  3.06894507  2014
## 2 F1_02  4.80132691  2014
## 3 F1_03  0.09938343  2014
## 4 F1_04 -1.65306259  2014
## 5 F1_05  0.96184614  2014
## 6 F1_06 -5.48341785  2014

colnames(new_pheno)

## [1] "geno"  "Pheno" "Block"

Note the colnames of the new_pheno, we will need these. We now estimate BLUEs using these as our phenotypes (to account for block effects) and re-run the analysis as follows:

blues <- BLUE(data = new_pheno,
              model = Pheno~geno,
              random = ~1|Block,
              genotype.ID = "geno")

qtl_LODs.4x_cofactor <- QTLscan(IBD_list = IBD_4x,
                                Phenotype.df = blues,
                                genotype.ID = "geno",
                                trait.ID = "blue",
                                perm_test = TRUE,
                                ncores = nc)

plotQTL(LOD_data = qtl_LODs.4x_cofactor,
        colour = "red")

Note that if we wanted to include multiple co-factors, we could do this by supplying a data-frame with multiple rows in cofactor_df. As before, check out ?QTLscan for full details.

Automatic co-factor analysis

A useful alternative to a manual co-factor analysis is to use an automatic co-factor analysis which attempts to build a multi-QTL model by maximising the signficance at all detected QTL peaks. To use this function, we first estimate BLUEs over blocks, and then run the check_cofactors function as follows:

blues <- BLUE(data = Phenotypes_4x,
              model = pheno~geno,
              random = ~1|year,
              genotype.ID = "geno")

QTLmodel <- check_cofactors(IBD_list = IBD_4x,
                            Phenotype.df = blues,
                            genotype.ID = "geno",
                            trait.ID = "blue",
                            LOD_data = qtl_LODs.4x,
                            ncores = nc)


QTLmodel

##   LG   cM deltaLOD CofactorID
## 1  1 12.3 7.026958         NA

The output gives 4 columns: the LG of the QTL, its cM position, the difference between the LOD at the peak and the significance threshold (ie. a threshold-adjusted LOD score) and the CofactorID. Note that the last one is NA in our case, as this QTL peak was detected without including any other genetic co-factor in the model. For more details refer as usual to the help documentation ?check_cofactors.

It can be useful to directly run the PVE function after this to look at the percentage variance explained of the QTL model and all sub-models (note that this function is also presented in its own separate section later. The output of check_cofactors can be passed directly to PVE:

PVE(IBD_list = IBD_4x,
    Phenotype.df = blues,
    genotype.ID = "geno",
    trait.ID = "blue",
    QTL_df = QTLmodel)

## Warning in PVE(IBD_list = IBD_4x, Phenotype.df = blues, genotype.ID = "geno", : Maximal QTL model has a single QTL!
## A single QTL scan using function QTLscan provides comparable results!

## $`1 QTL model`
##    Q1 
## 67.79

Comparing multiple analyses

It is often convenient to plot the results of different analyses together. For example comparing the results of a single marker analysis with an IBD-based approach, or plotting an initial QTL scan versus the results when a set of genetic co-factors were also included. In cases where significance thresholds were determined in separate analyses, this will soon lead to cluttered-looking plots with multiple significance thresholds on top of each other. One possibility is to disable the visualisation of significance thresholds when rendering each plot. However, it is often of interest to compare the results in terms of the presence (or absence) of significant QTL peaks. For this reason the default behaviour of plotQTL when multiple results are passed to LOD_data is to re-scale the overlaid plots so that the significance thresholds for each set of results overlaps perfectly. In this case, LOD_data should be a list of separate QTL results, each of which is the output of a separate call to the function QTLscan or singleMarkerRegression. Re-scaling can be disabled by setting rescale = FALSE.

We can compare the results obtained earlier from the original and co-factor analyses as follows:

plotQTL(LOD_data = list(qtl_LODs.4x,
                        qtl_LODs.4x_cofactor))

Exploring the QTL peak

Once we have detected a QTL, we may also wish to determine what the most likely QTL segregation type is. In other words, we would like to know the source of favourable QTL alleles, or find out which particular alleles lead to an improvement (or reduction) in a trait of interest. However, in order to be able to do this using the exploreQTL function, we again need to have calculated the best linear unbiased estimates, but without co-factors (where the variation associated with our discovered QTL has essentially been removed from the data!). Therefore, first go back and re-estimate the BLUEs using the original phenotypes:

blues <- BLUE(data = Phenotypes_4x,
              model = pheno~geno,
              random = ~1|year,
              genotype.ID = "geno")

We then use these adjusted means in our call to the exploreQTL function as follows:

qtl_explored <- exploreQTL(IBD_list = IBD_4x,
                           Phenotype.df = blues,
                           genotype.ID = "geno",
                           trait.ID = "blue",
                           linkage_group = 1,
                           LOD_data = qtl_LODs.4x)

## Fitting specified QTL models at position 12.3 cM on linkage group 1
## 
## There are 50 individuals with matching identifiers between phenotypic and genotypic datasets.

## 
## The QTL configuration for trait blue at 12.3 cM on linkage group 1 with seg type:
##  Qooo x oooo and additive gene action had BIC 8.0435

Note that we need only specify the linkage_group - the peak LOD position on that linkage group is taken by default. Alternatively, a specific position can be supplied using argument cM (check out ?exploreQTL for extra details).

The function by default plots the BIC (Bayesian Information Criterion) for each of the models supplied. The BIC gives a measure of the likelihood of the different QTL models tested. Note that we did not actually supply a list of different models to test to the function - a pre-defined list was looked up by the function instead. Different models can be supplied using the QTLconfig argument. To see the default list that was supplied in the case of a tetraploid, use the following code:

default_QTLconfig <- get("segList_4x",envir = getNamespace("polyqtlR"))
default_QTLconfig[1:2]

## [[1]]
## [[1]]$homs
## [1] 1
## 
## [[1]]$mode
## [1] "a"
## 
## 
## [[2]]
## [[2]]$homs
## [1] 2
## 
## [[2]]$mode
## [1] "a"

length(default_QTLconfig)

## [1] 224

For a triploid population we would get “segList_3x” etc. Here we see the first 2 models out of a total of 224 models tested, the first being “Qooo x oooo” (“$homs” = 1) and additive (“$mode” = “a”), the second being “oQoo x oooo” and additive, etc.

From the plot we see that model 1 was actually the most likely. The function also reports what this model was, but we can also check out the first element of the previously-defined default_QTLconfig as follows:

default_QTLconfig[[1]] #replace 1 with 27 to see the second most-likely model

## $homs
## [1] 1
## 
## $mode
## [1] "a"

The green dashed line is the default cut-off for competing QTL models, which is any model within 6 BIC units of the minimum BIC. This can be adjusted using the deltaBIC argument. Currently, the function only considers simple additive and dominant bi-allelic QTL models, although testing multi-allelic QTL models is also possible and has already been implemented previously (Peter M. Bourke et al. 2019). Please contact the author if you would require more model flexibility (however for many purposes it is arguably better to stick to a range of simpler models to prevent over-fitting).

It can also be helpful to visualise the QTL effects for comparison purposes. This can be achieved as follows:

visualiseQTLeffects(IBD_list = IBD_4x,
                    Phenotype.df = blues,
                    genotype.ID = "geno",
                    trait.ID = "blue",
                    linkage_group = 1,
                    LOD_data = qtl_LODs.4x)

The visualisation appears to be consistent with the predicted model, i.e. that the allele originating from homologue 1 of parent 1 contributes positively to the trait. Note that there is a limitation to predicting the effects of specific alleles within a mapping population, as there is no reference against which to compare. We have opted here to display the differences between the overall (fixed) population mean and the mean trait value of individuals carrying each of homologues in turn. As such, we are using the whole population as a reference, even though it is a mix of individuals with different allelic compositions. This is not ideal, but is probably the best we can do. An alternative would be to choose one of the parental alleles as a reference level and determine the contrasts between that allele and each of the others in turn. Here again there would be a mixing of signals, as polyploid individuals carry multiple alleles. In short, the results presented here from the exploration of the QTL peak should be interpreted with caution. Larger population sizes may help clarify the resolution of allele effects, but the best situation would be to compare the effects of specific alleles across multiple populations.

Compare QTL results with offspring

We might also like to see whether the QTL results tally with the allelic conformation of the population if sorted by phenotype. For this, we use the display_by = "phenotype" argument in the visualiseHaplo function. First, we look at the distribution of (BLUE) phenotypes of the population:

hist(x = blues$blue)

Suppose we are only interested in offspring with a trait value higher than 44. We can choose to only display the haplotypes of these offspring as follows:

visualiseHaplo(IBD_list = IBD_4x,
               display_by = "phenotype",
               Phenotype.df = blues,
               genotype.ID = "geno",
               trait.ID = "blue",
               pheno_range = c(44,max(blues$blue)),
               linkage_group = 1,
               multiplot = c(3,3))

All these offspring seem to carry homologue 1 as anticipated, although certain individuals (like F1_24) possess recombinations in the region of the QTL, which might be of interest for further fine-mapping work. However, the main point here is that the results seem to be consistent with the predictions from the BIC analysis.

Single Marker Analysis

The usual method of QTL detection in polyploid mapping populations is to use single-marker approaches, looking for associations between marker allele dosages and the trait of interest. The function singleMarkerRegresion does this, again using a permutation test to determine significance thresholds. While IBD-based analyses require markers to be present on a linkage map, single-marker analyses have no such constraints and therefore all markers can be included in the analysis, even if they did not map. By also providing a linkage map, visualisations will be more informative - since any unmapped markers will be assigned to chromosome 0 for plotting, with no meaningful order.

A single marker analysis can be performed as follows:

qtl_SMR.4x <- singleMarkerRegression(dosage_matrix = SNP_dosages.4x,
                                     Phenotype.df = blues,
                                     genotype.ID = "geno",
                                     trait.ID = "blue",
                                     return_R2 = TRUE,
                                     perm_test = TRUE,
                                     ncores = nc,
                                     maplist = phased_maplist.4x)

The results can be plotted alongside the IBD-based results:

plotQTL(LOD_data =  list(qtl_LODs.4x,
                         qtl_SMR.4x),
        plot_type = c("lines","points"), #IBD results as lines, SMR results as points..
        pch = 19,
        colour = c("darkgreen","navyblue"))

PVE

It is often useful to estimate the percentage of variance explained (PVE) by your maximal QTL model. The function PVE computes this, as well as looking at all possible sub-models. In this example it is pretty trivial, as there was only a single QTL detected which explains about 31% of the phenotypic variation:

PVE(IBD_list = IBD_4x,
    Phenotype.df = Phenotypes_4x,
    genotype.ID = "geno",
    trait.ID = "pheno",
    block = "year",
    QTL_df = data.frame("LG" = 1,"cM" = 12.3))

## Warning in PVE(IBD_list = IBD_4x, Phenotype.df = Phenotypes_4x, genotype.ID = "geno", : Maximal QTL model has a single QTL!
## A single QTL scan using function QTLscan provides comparable results!

## Summary of trait: pheno over blocks
## 
## |     |     Min.|  1st Qu.|   Median|     Mean|  3rd Qu.|     Max.| NA's|
## |:----|--------:|--------:|--------:|--------:|--------:|--------:|----:|
## |2014 | 43.05715| 49.12318| 52.85265| 51.83483| 54.36611| 60.27919|    0|
## |2015 | 25.78358| 28.65918| 31.90446| 31.94833| 34.36686| 41.13060|    0|
## |2016 | 33.28601| 36.58176| 40.40502| 39.72945| 42.19982| 46.08645|    0|

## $`1 QTL model`
##    Q1 
## 31.52

Meiosis & Recombinations

We have by now identified the major QTL affecting our trait, and both explored and visualised the allelic effects at the QTL peak positions. However there are a number of remaining topics that could lead to a more complete understanding, leveraging the information contained in the IBD probabilities.

Meiosis report

It is interesting to look at what went on in parental meiosis in the population, as it may give an indication on e.g. the pairing behaviour of chromosomes during meiosis. For many polyploid crops this is not necessarily a well-understood process, so we provide some tools to help give insights into whether your species of interest is autopolyploid, allopolyploid or something in between (“segmental allopolyploid”).

To this end, we run the function meiosis_report as follows:

mr <- meiosis_report(IBD_list = IBD_4x)

Preferential pairing

Apart from visualising the deviations (in counts) from a purely random bivalent pairing model, the function meiosis_report also analyses all predicted bivalent pairings and performs an uncorrected $\chi^2$ test to detect deviations from a polyomic model. If multivalents were also included during IBD estimation these are also counted and recorded, but do not contribute to the $\chi^2$ test or output plots. Note that the colour of the output plot conveys information about the results of the $\chi^2$ test: green signifying no strong deviation from a polysomic model, while red colours indicating a more severe deviation from a polysomic model. In this example, there is no suggestion of anything but polysomic inheritance, associated with autopolyploidy.

An alternative approach to visualise the pairing behaviour of parental homologues is provided by the function visualisePairing. This function displays deviations from random expectations as before, but using line thickness to represent the magnitude of the deviation (by default, an excess of pairing counts is coloured red and a lack of counts coloured blue). One argument needed in order for the function to work is datawidemax, which is the data-wide maximum deviation. This is then given the line width as specified by max.lwd, and all other deviations are scaled accordingly. See ?visualisePairing for more details.

The function can be run using the output of meiosis_report as follows:

par(mfrow = c(1,2))
visualisePairing(meiosis_report.ls = mr,
                 parent = "P1",
                 datawidemax = 5)

Recombination landscape

It can be very informative to perform an analysis of all the predicted recombinations (from cross-over events), as these can indicate issues such as poor map order, problematic individuals (carrying too many predicted recombinations) or errors in marker phase. The position of these can also be biologically relevant, indicating recombination hot-spots or cold-spots. They also provide a means to correct genotyping errors, if we take the view that genotyping errors are a more likely cause of multiple recombination events in an individual rather than cross-overs (which they usually are). In short, an analysis of the recombination landscape offers a lot of scope for improved understanding on many levels.

Within polyqtlR a number of tools are provided for this. We start by looking at the function count_recombinations, which generates a summary of the recombination events per linkage group and per individual:

recom.ls <- count_recombinations(IBD_4x)

Note that this function has only been implemented in the context of bivalent pairing - a likely future update is to also consider and count recombination events from multivalents.

The output is a rather long list giving, per linkage group and per individual the probability of the most likely pairing configuration and the most likely positions of the recombination break-point (taken as the midpoint of the two bounding positions which define the interval). Although we do not know the precise break-point around cross-over events, this value can be taken as indicative of the likely location. To then make sense of this list we call on the visualisation function plotRecLS (plot the Recombination LandScape) to plot the results and return some summary information from the plots:

layout(matrix(c(1,3,2,3),nrow=2)) #make tidy layout
RLS_summary <- plotRecLS(recom.ls) #capture the function output as well

The green lines show a fitted spline to the recombination count data - which are also returned by the function. Indeed, if you explore the output saved in RLS_summary you should see that there are two items: $per_LG and per_individual. In the per-linkage group item, recombination rates that have been interpolated using the splines (rec.rates) are given (per linkage group). In the per-individual item, a named vector of the genome-wide counts of recombinations is given, which can be useful in identifying offspring that perhaps should not have been included in the mapping.

It is also of interest to overlay the predicted recombinations with the visualised haplotypes. This can be done with the visualiseHaplo function, similar to what we did earlier but now adding the argument recombination_data to the call:

visualiseHaplo(IBD_list = IBD_4x,
               display_by = "name",
               linkage_group = 1,
               select_offspring = colnames(SNP_dosages.4x)[3:11], #cols 1+2 are parents
               multiplot = c(3,3), #plot layout in 3x3 grid
               recombination_data = recom.ls)

It is interesting to note the problems with reconstructing the inheritance probabilities of individual F1_01 - compare this to the reconstruction earlier when we also allowed for multivalents and double reduction. F1_04 has also been reconstructed differently, but here the bivalent-only based prediction does not seem to have caused serious conflicts.

Searching for recombinant individuals

We have analysed the recombination landscape, but suppose for the sake of argument we were interested in all offspring that had inherited homologues 1 and 3 from linkage group 1 in the region of our detected QTL. Although we only found evidence for a positive effect of homologue 1, we could imagine a different dataset where there was a second QTL on homologue 3 some distance away. If we wanted to stack these QTL, then recombinant offspring could offer the possibility of stacking these QTL in coupling linkage. We can search for these individuals using the visualiseHaplo function.

We display the offspring by “name” as before. However, we now also provide details of the recombinant_scan homologues that we are interested in. The output of this initial function call will be used in the subsequent step, so we save the output as “visHap1”. We will also use multi-plotting for convenience. Note that the plots have been suppressed here (as there are too many of them to begin with):

visHap1 <- visualiseHaplo(IBD_list = IBD_4x,
                          display_by = "name",
                          select_offspring = "all",
                          linkage_group = 1,
                          cM_range = c(1.56,50),
                          recombinant_scan = c(1,3),
                          multiplot = c(4,2))

We now look at the output of the function:

visHap1

## $recombinants
## [1] "F1_02" "F1_03" "F1_16" "F1_22" "F1_26" "F1_41"
## 
## $allele_fish
## NULL

Six offspring have been identified as having a recombination in this region. There may be others, as this function is only searching among the IBD results for potential recombinants and may miss some with less clear features. However, six is already enough to continue with. We pass the names of these individuals to the function again (in place of “all” offspring) using the select_offspring argument, and specifying the individuals mentioned in visHap1$recombinants, and specifying the desired region of linkage_group 1 using cM_range as follows:

visualiseHaplo(IBD_list = IBD_4x,
               display_by = "name",
               select_offspring = visHap1$recombinants,
               linkage_group = 1,
               cM_range = c(1.56,50),#note: start position = first marker @1.56 cM
               recombinant_scan = c(1,3),
               multiplot = c(3,2),
               recombination_data = recom.ls) # we can add this track for clarity

In each of these individuals we can see evidence for a recombination between homologues 1 and 3 in the specified region.

Note - currently this process takes two steps, which could be confusing. A planned update will allow this to be done in a single step, using the recombination_data instead.

Fishing for specific alleles

The other output associated with the function is $allele_fish, which allows us to fish out individuals that carry a particular allele or combinations of alleles around a particular locus. Suppose we were just interested in the offspring that carried homologue 1 of chromosome 1 around the QTL, so in the region up to 15 cM say. We can “fish out” these individuals as follows:

visHap2 <- visualiseHaplo(IBD_list = IBD_4x,
                          display_by = "name",
                          select_offspring = "all",
                          linkage_group = 1,
                          cM_range = c(1.56,15),
                          allele_fish = 1,
                          multiplot = c(4,2))

Note that these individuals must carry homologue 1 for almost all of the specified segment - therefore specifying a longer segment could also result in fewer individuals (as they are more likely to carry recombined chromosomes). As before, we can inspect the output of this function call as follows:

visHap2

## $recombinants
## NULL
## 
## $allele_fish
## $allele_fish$h1
##  [1] "F1_02" "F1_08" "F1_10" "F1_15" "F1_17" "F1_19" "F1_23" "F1_31" "F1_32"
## [10] "F1_34" "F1_35" "F1_40" "F1_43" "F1_44" "F1_46" "F1_49"

Here there are 16 offspring carrying that particular segment of homologue 1. We can visualise them using a call to the visualiseHaplo function as before:

visualiseHaplo(IBD_list = IBD_4x,
               display_by = "name",
               select_offspring = visHap2$allele_fish$h1,
               linkage_group = 1,
               cM_range = c(1.56,15),
               multiplot = c(4,4))

If all went well, these offspring should all possess the homologue 1 allele in this particular region. Examining the plots shows this to be the case.

Correcting genotyping errors

One final topic that is useful to include here is that of error correction in our marker genotypes. Particularly when using some next-generation sequencing based methods for genotyping, errors in the data can become quite a problem. One of the many applications of IBD probabilities (and their method of estimation) is to try to determine what the offspring in a population actually inherited in terms of their parental alleles. In the Hidden Markov Model introduced earlier, we did not discuss choice of prior for genotyping errors. The function uses a default prior of 0.01. What this means in practice is that we put a very high weight on our genotype observations: with such a prior in a tetraploid, the probability that an offspring score of ‘1’ is in fact a true ‘1’ is 0.99, while the probability of it being 0, 2, 3 or 4 is 0.01 (and therefore the probability of its true state being 0 is only 0.01/4 i.e. only 0.25 %). This is clearly a very strong prior on our observed genotype values and can lead to many erroneous recombinations being predicted.

A very simple solution to this issue when there seems to be evidence of many genotyping errors (see previous haplotype plot for an example) is to increase the error prior when estimating the IBD probabilities, to effectively suppress incorrect prediction of double recombination events. In general, a prior of 0.1 or even 0.2 will suppress spurious recombinations from being predicted. Note that an error prior of 0.8 in a tetraploid means we actually consider all possible dosage scores as equally likely (with a probability of 0.2). However, setting an error prior at this level means that you consider your data to be random noise.

To facilitate this procedure, the function maxL_IBD is provided which runs the step of IBD estimation over the range of errors provided, merging the results by selecting those that maximise the marginal likelihood of the data, per individual (and per linkage group). A default set of error priors (0.01, 0.05, 0.1, 0.2) are checked which should adequetly cover the range of possible error rates commonly encountered. To run this function, use the following call (basically takes the same arguments as estimate_IBD but with errors instead of error to signify multiple error priors being applied):

IBD_multiError <- maxL_IBD(phased_maplist = phased_maplist.4x,
                           genotypes = SNP_dosages.4x,
                           ploidy = 4,
                           bivalent_decoding = FALSE,
                           errors = c(0.01,0.02,0.05,0.1,0.2),
                           ncores = 4)

The output is now a list, but to access the IBD object we have been using here throughout, simply extract the maxL_IBD element of the output, i.e.

IBD_4x <- IBD_multiError$maxL_IBD

Once you are happy with the results (again, using the visualisation tools described earlier) you have one further possibility open to you: to re-impute your marker genotypes given these “smoothed” IBD probabilities. This can be achieved using the function impute_dosages. To demonstrate this here, we will artificially introduce some genotyping errors (something you normally would not do), and then re-estimate the IBDs using a higher error prior:

# Copy our dosage matrix to a new matrix:
error_dosages <- SNP_dosages.4x

# Only work with offspring errors for now:
F1cols <- 3:ncol(error_dosages)

# Count number of (offspring) dosage scores:
ndose <- length(error_dosages[,F1cols])

# Generate a vector of random positions (10% of total nr):
set.seed(42)
error.pos <- sample(1:ndose,round(ndose*0.1))

# Replace real values with these random scores (simple error simulation):
error_dosages[,F1cols][error.pos] <- sapply(error_dosages[,F1cols][error.pos], function(x) sample(setdiff(0:4,x),1))

# Re-estimate IBD probabilities, using the maxL_IBD function:
IBD_multiError <- maxL_IBD(phased_maplist = phased_maplist.4x,
                           genotypes = error_dosages,
                           ploidy = 4,
                           errors = c(0.01, 0.02, 0.05, 0.1, 0.3), 
                           ncores = nc)

IBD_4x.err <- IBD_multiError$maxL_IBD

# Re-impute marker dosages
new_dosages <- impute_dosages(IBD_list=IBD_4x.err,dosage_matrix=error_dosages,
                              min_error_prior = 0.01,
                              rounding_error = 0.2)

## ____________________________________________________________________________
## 
## 186 out of a total possible 186 markers were imputed
## In the original dataset there were 0 % missing values among the 186 markers
## In the imputed dataset there are now 1.4 % missing values among these, using a rounding threshold of 0.2
## 
## ____________________________________________________________________________
## The % of markers with changed dosage scores is as follows (0 = no change):
## error.matrix
##     0     1     2     3     4 
## 89.44  4.18  2.88  1.72  0.35

# Check the results:
round(length(which(error_dosages[,F1cols] != SNP_dosages.4x[,F1cols])) / ndose,2)

## [1] 0.1

round(length(which(new_dosages[,F1cols] != SNP_dosages.4x[,F1cols])) / ndose,2)

## [1] 0.01

So having had 10 % genotyping errors to begin with, we have reduced the error rate in the data to 1 %, but also introduced 1.4 % missing values (these are often around true recombination break-points where we cannot easily decide which parental homologue was inherited). A missing value is an error of sorts, but has a much smaller impact on linkage map quality than a true genotyping error. These corrected genotypes may then be used in an iterative process to improve the underlying linkage map, although usually a single or at most 2 iterations should be used. Previous experience has shown that this approach can also lead to significant improvements in subsequent QTL mapping.

Concluding remarks

We hope you have been able to successfully run a QTL analysis and explore polyploid meiotic dynamics using your data and the methods described here. The package may still be further improved and developed, and feedback on performance, suggestions for improvements / extensions and bug reporting is most welcome; please contact the developer Peter Bourke (peter.bourke at wur.nl).

References

Bourke, Peter M. 2014. “QTL Analysis in Polyploids: Model Testing and Power Calculations for an Autotetraploid.” MSc Thesis, Wageningen University & Research.

Bourke, Peter M., Geert van Geest, Roeland E. Voorrips, Johannes Jansen, Twan Kranenburg, Arwa Shahin, Richard G. F. Visser, Paul Arens, Marinus J. M. Smulders, and Chris Maliepaard. 2018. “polymapR - linkage analysis and genetic map construction from F[1] populations of outcrossing polyploids.” Bioinformatics 34 (20): 3496–3502.

Bourke, Peter M, Christine A Hackett, Roeland E Voorrips, Richard GF Visser, and Chris Maliepaard. 2019. “Quantifying the power and precision of QTL analysis in autopolyploids under bivalent and multivalent genetic models.” G3: Genes, Genomes, Genetics 9 (7): 2107–22.

Bourke, Peter M., Roeland E. Voorrips, Richard G. F. Visser, and Chris Maliepaard. 2018. “Tools for genetic studies in experimental populations of polyploids.” Frontiers in Plant Science 9 (513). https://doi.org/10.3389/fpls.2018.00513.

Bourke, Peter M, Roeland E Voorrips, Christine A Hackett, Geert van Geest, Johan H Willemsen, Paul Arens, Marinus JM Smulders, Richard GF Visser, and Chris Maliepaard. 2021. “Detecting Quantitative Trait Loci and Exploring Chromosomal Pairing in Autopolyploids Using polyqtlR.” Bioinformatics 37 (21): 3822–29.

Bourke, Peter M, Roeland E Voorrips, Richard GF Visser, and Chris Maliepaard. 2015. “The Double-Reduction Landscape in Tetraploid Potato as Revealed by a High-Density Linkage Map.” Genetics 201 (3): 853–63.

Clark, Lindsay V, Alexander E Lipka, and Erik J Sacks. 2019. “polyRAD: Genotype calling with uncertainty from sequencing data in polyploids and diploids.” G3: Genes, Genomes, Genetics 9 (3): 663–73.

Geest, Geert van, Peter M Bourke, Roeland E Voorrips, Agnieszka Marasek-Ciolakowska, Yanlin Liao, Aike Post, Uulke van Meeteren, Richard GF Visser, Chris Maliepaard, and Paul Arens. 2017. “An Ultra-Dense Integrated Linkage Map for Hexaploid Chrysanthemum Enables Multi-Allelic QTL Analysis.” Theoretical and Applied Genetics 130 (12): 2527–41.

Gerard, David, Luis Felipe Ventorim Ferrão, Antonio Augusto Franco Garcia, and Matthew Stephens. 2018. “Genotyping polyploids from messy sequencing data.” Genetics 210 (3): 789–807.

Hackett, Christine A, John E Bradshaw, and Glenn J Bryan. 2014. “QTL Mapping in Autotetraploids Using SNP Dosage Information.” Theoretical and Applied Genetics 127: 1885–1904.

Mollinari, Marcelo, and Antonio Augusto Franco Garcia. 2019. “Linkage Analysis and Haplotype Phasing in Experimental Autopolyploid Populations with High Ploidy Level Using Hidden Markov Models.” G3: Genes, Genomes, Genetics 9 (10): 3297–3314.

Voorrips, Roeland E, Gerrit Gort, and Ben Vosman. 2011. “Genotype calling in tetraploid species from bi-allelic marker data using mixture models.” BMC Bioinformatics 12 (1): 172.

Zheng, Chaozhi, Rodrigo R Amadeu, Patricio R Munoz, and Jeffrey B Endelman. 2021. “Haplotype Reconstruction in Connected Tetraploid F1 Populations.” Genetics 219 (2): iyab106.

Zheng, Chaozhi, Roeland E. Voorrips, Johannes Jansen, Christine A. Hackett, Julie Ho, and Marco C. A. M. Bink. 2016. “Probabilistic Multilocus Haplotype Reconstruction in Outcrossing Tetraploids.” Genetics 203 (1): 119–31.