Ancestral Sequence Reconstruction

Klaus Schliep

Graz University of Technology
klaus.schliep@gmail.com

2024-09-17

Introduction

These notes describe the ancestral sequence reconstruction using the phangorn package (Schliep 2011). phangorn provides several methods to estimate ancestral character states with either Maximum Parsimony (MP) or Maximum Likelihood (ML). For more background on all the methods see e.g. (Felsenstein 2004) or (Yang 2006).

Parsimony reconstructions

To reconstruct ancestral sequences we first load some data and reconstruct a tree:

library(phangorn)
fdir <- system.file("extdata/trees", package = "phangorn")
primates <- read.phyDat(file.path(fdir, "primates.dna"),
                        format = "interleaved")
tree <- pratchet(primates, trace=0) |> acctran(primates) |> makeNodeLabel()
parsimony(tree, primates)
## [1] 746

For parsimony analysis of the edge length represent the observed number of changes. Reconstructing ancestral states therefore defines also the edge lengths of a tree. However there can exist several equally parsimonious reconstructions or states can be ambiguous and therefore edge length can differ. In phangorn all the ancestral reconstructions for parsimony used to be based on the fitch algorithm (below version 3.0) and needed bifurcating trees. However trees can get pruned afterwards using the function multi2di from ape. Recently we replaced the acctran routine with a method based on the sankoff algorithm adopting the algorithm for joint reconstruction (Pupko et al. 2000) and breaking ties at random. This has the additional benefit that to infer phylogenies with multifurcations.

“MPR” reconstructs the ancestral states for each (internal) node as if the tree would be rooted in that node. However the nodes are not independent of each other. If one chooses one state for a specific node, this can restrict the choice of neighboring nodes (figures 2 and 3). There is also an option “POSTORDER” which is only a one pass algotrithm, which we use for teaching purposes. The function acctran (accelerated transformation) assigns edge length and internal nodes to the tree (Swofford and Maddison 1987).

anc.pars <- anc_pars(tree, primates)

The plotSeqLogo function is a wrapper around the from the ggseqlogo function in the ggseqlogo package (Wagih 2024) and provides a simple way to show proportions of a nucleotides of ancestral states (see figure 1).

plotSeqLogo(anc.pars, node=getRoot(tree), 1, 20)
Fig 1. Ancestral reconstruction for a node.
Fig 1. Ancestral reconstruction for a node.
plotAnc(anc.pars, 17)
title("MPR")
Fig 2. Ancestral reconstruction using MPR.
Fig 2. Ancestral reconstruction using MPR.

Likelihood reconstructions

phangorn also offers the possibility to estimate ancestral states using ML. The advantages of ML over parsimony is that the reconstruction accounts for different edge lengths. Currently a marginal construction is implemented (see [Yang (2006)](Koshi and Goldstein 1996)), but the joint reconstruction (Pupko et al. 2000) only for models without rate variation (e.g. gamma models) or invariant sites.

fit <- pml(tree, primates)
fit <- optim.pml(fit, model="F81", control = pml.control(trace=0))

We can assign the ancestral states according to the highest likelihood (“ml”): \[ P(x_r = A) = \frac{L(x_r=A)}{\sum_{k \in \{A,C,G,T\}}L(x_r=k)} \] and the highest posterior probability (“bayes”) criterion: \[ P(x_r=A) = \frac{\pi_A L(x_r=A)}{\sum_{k \in \{A,C,G,T\}}\pi_k L(x_r=k)}, \] where \(L(x_r)\) is the joint probability of states at the tips and the state at the root \(x_r\) and \(\pi_i\) are the estimated base frequencies of state \(i\). Both methods agree if all states (base frequencies) have equal probabilities.

anc.ml <- anc_pml(fit)

The differences of the two approaches for a specific site (17) are represented in the following figures.

plotAnc(anc.ml, 17)
title("ML")
Fig 4. Ancestral reconstruction the using the maximum likelihood.
Fig 4. Ancestral reconstruction the using the maximum likelihood.
#plotAnc(anc.bayes, 17)
#title("Bayes")

Fitting for discrete comparative data

Often have already a phylogeny and only want estimate the ancestral reconstruction for this tree. This is a common problem in phylogentic comparative methods and we can use the function ace in the ape (Paradis and Schliep 2019), fitDiscrete in the geiger (Pennell et al. 2014) or fitMK in the phytools (Revell 2012) package. Here we want to show how to fit these models using optim.pml.

First we load a tree and create some data.

data("bird.orders")
x <- c(rep(0, 5), rep(1, 18))
x[c(20,22,23)] <- 2
x <- factor(x)
names(x) <- bird.orders$tip.label
dat <- phyDat(x, "USER", levels=c(0,1,2))

We than set up the pml object and optimize the model. Instead of optimizing the edge length we only optimize the rate.

fit <- pml(bird.orders, dat)
fit_ER <- optim.pml(fit, optEdge = FALSE, optRate=TRUE, 
                    control = pml.control(trace=0))
fit_ER
## model: Mk 
## loglikelihood: -16.47 
## unconstrained loglikelihood: 0 
## 
## Rate matrix:
##   0 1 2
## 0 0 1 1
## 1 1 0 1
## 2 1 1 0
## 
## Base frequencies:  
##      0      1      2 
## 0.3333 0.3333 0.3333 
## 
## Rate: 0.007846

We can also fit the symmetric (model=“SYM”) or ordered metristic model (model=“ORDERED”).

fit_SYM <- optim.pml(fit, optEdge = FALSE, optRate=TRUE, model="SYM", 
                    control = pml.control(trace=0))
fit_SYM
## model: SYM 
## loglikelihood: -15.31 
## unconstrained loglikelihood: 0 
## 
## Rate matrix:
##           0      1         2
## 0 0.000e+00 0.2747 1.604e-06
## 1 2.747e-01 0.0000 1.000e+00
## 2 1.604e-06 1.0000 0.000e+00
## 
## Base frequencies:  
##      0      1      2 
## 0.3333 0.3333 0.3333 
## 
## Rate: 0.00678

We can compare the estimate with the one from ace from ape.

fit_ace <- ace(x, bird.orders, model="SYM", type = "d")
## Warning in sqrt(diag(solve(h))): NaNs wurden erzeugt

The log-likelihood values differ slightly as in phangorn the values get multiplied by the state frequencies. Thus if we add log(1/3) as we have three states to ace estimate the two estimates are almost identical.

fit_SYM$logLik
## [1] -15.31
fit_ace$loglik+log(1/3)
## [1] -15.31
all.equal(fit_SYM$logLik, fit_ace$loglik+log(1/3))
## [1] "Mean relative difference: 1.229e-07"
anc_SYM <- anc_pml(fit_SYM)
plotAnc(anc_SYM)

More complicated models can be applied using defining the rate matrix as shown in the vignette Markov models and transition rate matrices. The “ARD” model is currently not available as phangorn only fits reversible models.

Session info

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Debian GNU/Linux 12 (bookworm)
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
## 
## locale:
##  [1] LC_CTYPE=de_AT.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_AT.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=de_AT.UTF-8    LC_MESSAGES=de_AT.UTF-8   
##  [7] LC_PAPER=de_AT.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Vienna
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] phangorn_2.12.1 ape_5.8-0.1    
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-0      gtable_0.3.5      jsonlite_1.8.8    dplyr_1.1.4      
##  [5] compiler_4.4.1    highr_0.11        tidyselect_1.2.1  Rcpp_1.0.13      
##  [9] parallel_4.4.1    jquerylib_0.1.4   scales_1.3.0      yaml_2.3.10      
## [13] fastmap_1.2.0     lattice_0.22-6    ggplot2_3.5.1     R6_2.5.1         
## [17] labeling_0.4.3    generics_0.1.3    igraph_2.0.3      knitr_1.48       
## [21] tibble_3.2.1      munsell_0.5.1     bslib_0.8.0       pillar_1.9.0     
## [25] rlang_1.1.4       utf8_1.2.4        fastmatch_1.1-4   cachem_1.1.0     
## [29] xfun_0.47         quadprog_1.5-8    sass_0.4.9        cli_3.6.3        
## [33] withr_3.0.1       magrittr_2.0.3    digest_0.6.37     grid_4.4.1       
## [37] rstudioapi_0.16.0 lifecycle_1.0.4   nlme_3.1-165      vctrs_0.6.5      
## [41] evaluate_0.24.0   glue_1.7.0        farver_2.1.2      codetools_0.2-20 
## [45] ggseqlogo_0.2     fansi_1.0.6       colorspace_2.1-1  rmarkdown_2.28   
## [49] tools_4.4.1       pkgconfig_2.0.3   htmltools_0.5.8.1

References

Felsenstein, Joseph. 2004. Inferring Phylogenies. Sunderland: Sinauer Associates.
Koshi, Jeffrey M., and Richard A. Goldstein. 1996. Probabilistic reconstruction of ancestral protein sequences.” Journal of Molecular Evolution 42 (2): 313–20. https://doi.org/10.1007/BF02198858.
Paradis, Emmanuel, and Klaus Schliep. 2019. “Ape 5.0: An Environment for Modern Phylogenetics and Evolutionary Analyses in r.” Bioinformatics 35 (3): 526–28. https://doi.org/10.1093/bioinformatics/bty633.
Pennell, M. W., J. M. Eastman, G. J. Slater, J. W. Brown, J. C. Uyeda, R. G. Fitzjohn, M. E. Alfaro, and L. J. Harmon. 2014. “Geiger V2.0: An Expanded Suite of Methods for Fitting Macroevolutionary Models to Phylogenetic Trees.” Bioinformatics 30: 2216–18. https://doi.org/10.1093/bioinformatics/btu181.
Pupko, Tal, Itsik Pe, Ron Shamir, and Dan Graur. 2000. A Fast Algorithm for Joint Reconstruction of Ancestral Amino Acid Sequences.” Molecular Biology and Evolution 17 (6): 890–96. https://doi.org/10.1093/oxfordjournals.molbev.a026369.
Revell, Liam J. 2012. “Phytools: An r Package for Phylogenetic Comparative Biology (and Other Things).” Methods in Ecology and Evolution 3: 217–23.
Schliep, Klaus Peter. 2011. “Phangorn: Phylogenetic Analysis in R.” Bioinformatics 27 (4): 592–93. https://doi.org/10.1093/bioinformatics/btq706.
Swofford, D. L., and W. P. Maddison. 1987. “Reconstructing Ancestral Character States Under Wagner Parsimony.” Math. Biosci. 87: 199–229.
Yang, Ziheng. 2006. Computational Molecular Evolution. Oxford: Oxford University Press.