With the function pml_bb
from phangorn (Schliep 2011) a lot of steps have become easier
and shorter. If you want to have more control over all of the used
parameters, it is also possible to use the older functions,
e.g. optim_pml
. The data is the same as in the vignette
Estimating phylogenetic trees with phangorn:
library(ape)
library(phangorn)
fdir <- system.file("extdata/trees", package = "phangorn")
primates <- read.phyDat(file.path(fdir, "primates.dna"),
format = "interleaved")
As a starting tree, we calculate a neighbor joining tree:
## model: JC
## loglikelihood: -3075
## unconstrained loglikelihood: -1230
##
## Rate matrix:
## a c g t
## a 0 1 1 1
## c 1 0 1 1
## g 1 1 0 1
## t 1 1 1 0
##
## Base frequencies:
## a c g t
## 0.25 0.25 0.25 0.25
The function pml
returns an object of class
pml
. This object contains the data, the tree and many
different parameters of the model like the likelihood. There are many
generic functions for the class pml
available, which allow
the handling of these objects.
## [1] AICc BIC anova glance logLik plot print simSeq update vcov
## see '?methods' for accessing help and source code
The object fit just estimated the likelihood for the tree it got
supplied, but the branch length are not optimized for the Jukes-Cantor
(Jukes and Cantor 1969) model yet, which
can be done with the function optim.pml
.
## optimize edge weights: -3075 --> -3068
## optimize edge weights: -3068 --> -3068
## optimize topology: -3068 --> -3068 NNI moves: 1
## optimize edge weights: -3068 --> -3068
## optimize topology: -3068 --> -3068 NNI moves: 0
## 'log Lik.' -3068 (df=25)
With the default values pml
will estimate a Jukes-Cantor
model. That means equal base frequencies and all transition rates are
equal. The generic function update
allows to change
parameters manually. This is not what we usually want to do. However we
might want to supply a different tree or change the number of rate
categories.
## model: F81+G(4)+I
## loglikelihood: -3037
## unconstrained loglikelihood: -1230
## Proportion of invariant sites: 0.2
## Model of rate heterogeneity: Discrete gamma model
## Number of rate categories: 4
## Shape parameter: 1
## Rate Proportion
## 1 0.0000 0.2
## 2 0.1712 0.2
## 3 0.5959 0.2
## 4 1.2500 0.2
## 5 2.9829 0.2
##
## Rate matrix:
## a c g t
## a 0 1 1 1
## c 1 0 1 1
## g 1 1 0 1
## t 1 1 1 0
##
## Base frequencies:
## a c g t
## 0.37481 0.40160 0.03911 0.18448
In the line above we changed the model to a (discrete) rate across site model with 4 rate categories (using the default shape parameter of 1), to 0.2 invariant sites and supply empirical base frequencies.
fitGTR <- optim.pml(fitF81, model="GTR", optInv=TRUE, optGamma=TRUE,
rearrangement = "NNI", control = pml.control(trace = 0))
fitGTR
## model: GTR+G(4)+I
## loglikelihood: -2611
## unconstrained loglikelihood: -1230
## Proportion of invariant sites: 0.006978
## Model of rate heterogeneity: Discrete gamma model
## Number of rate categories: 4
## Shape parameter: 3.081
## Rate Proportion
## 1 0.0000 0.006978
## 2 0.3982 0.248256
## 3 0.7411 0.248256
## 4 1.0905 0.248256
## 5 1.7983 0.248256
##
## Rate matrix:
## a c g t
## a 0.0000 0.953568 64.023928 0.813
## c 0.9536 0.000000 0.005149 24.800
## g 64.0239 0.005149 0.000000 1.000
## t 0.8130 24.799977 1.000000 0.000
##
## Base frequencies:
## a c g t
## 0.37481 0.40160 0.03911 0.18448
We will change the model to the GTR + \(\Gamma(4)\) + I model and then optimize all the parameters.
With the control parameters the thresholds for the fitting process
can be changed. Here we want just to suppress output during the fitting
process. For larger trees the NNI rearrangements often get stuck in a
local maximum. We added two stochastic algorithms to improve topology
search. The first (set rearrangement="stochastic"
) performs
stochastic rearrangements similar as in (Nguyen
et al. 2015), which makes random NNI permutation to the tree,
which than gets optimized to escape local optima. The second option
(rearrangement="ratchet"
) perform the likelihood ratchet
(Vos 2003).
While these algorithms may find better trees they will also take more time.
fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE,
rearrangement = "stochastic", control = pml.control(trace = 0))
fitGTR
## model: GTR+G(4)+I
## loglikelihood: -2608
## unconstrained loglikelihood: -1230
## Proportion of invariant sites: 0.007411
## Model of rate heterogeneity: Discrete gamma model
## Number of rate categories: 4
## Shape parameter: 2.994
## Rate Proportion
## 1 0.0000 0.007411
## 2 0.3917 0.248147
## 3 0.7366 0.248147
## 4 1.0906 0.248147
## 5 1.8110 0.248147
##
## Rate matrix:
## a c g t
## a 0.0000 0.72394 74.28572 0.6022
## c 0.7239 0.00000 0.00403 26.0053
## g 74.2857 0.00403 0.00000 1.0000
## t 0.6022 26.00533 1.00000 0.0000
##
## Base frequencies:
## a c g t
## 0.37481 0.40160 0.03911 0.18448
We can compare nested models for the JC and GTR + \(\Gamma(4)\) + I model using likelihood ratio statistic
## Likelihood Ratio Test Table
## Log lik. Df Df change Diff log lik. Pr(>|Chi|)
## 1 -3068 25
## 2 -2608 35 10 921 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
with the Shimodaira-Hasegawa test
## Trees ln L Diff ln L p-value
## [1,] 1 -2608 0.0 0.5006
## [2,] 2 -3068 460.5 0.0000
or with the AIC
## [1] 6187
## [1] 5286
## [1] 5298
## [1] 5406
At last we may want to apply standard bootstrap to test how well the edges of the tree are supported. This has already been shown in the vignette Estimating phylogenetic trees with phangorn.
Now we can plot the tree with the bootstrap support values on the
edges and also look at consensusNet
to identify potential
conflict.
phangorn has several functions to generate tree topologies,
which may are interesting for simulation studies. allTrees
computes all possible bifurcating tree topologies either rooted or
unrooted for up to 10 taxa. One has to keep in mind that the number of
trees is growing exponentially, use howmanytrees
from
ape as a reminder.
trees <- allTrees(5)
par(mfrow=c(3,5), mar=rep(0,4))
for(i in 1:15)plot(trees[[i]], cex=1, type="u")
nni
returns a list of all trees which are one nearest
neighbor interchange away.
## 4 phylogenetic trees
rNNI
and rSPR
generate trees which are a
defined number of NNI (nearest neighbor interchange) or SPR (subtree
pruning and regrafting) away.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Debian GNU/Linux 12 (bookworm)
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
##
## locale:
## [1] LC_CTYPE=de_AT.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_AT.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=de_AT.UTF-8 LC_MESSAGES=de_AT.UTF-8
## [7] LC_PAPER=de_AT.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Vienna
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] phangorn_2.12.1 ape_5.8-0.1
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-0 gtable_0.3.5 jsonlite_1.8.8 dplyr_1.1.4
## [5] compiler_4.4.1 highr_0.11 tidyselect_1.2.1 Rcpp_1.0.13
## [9] parallel_4.4.1 jquerylib_0.1.4 scales_1.3.0 yaml_2.3.10
## [13] fastmap_1.2.0 lattice_0.22-6 ggplot2_3.5.1 R6_2.5.1
## [17] labeling_0.4.3 generics_0.1.3 igraph_2.0.3 knitr_1.48
## [21] tibble_3.2.1 munsell_0.5.1 bslib_0.8.0 pillar_1.9.0
## [25] rlang_1.1.4 utf8_1.2.4 fastmatch_1.1-4 cachem_1.1.0
## [29] xfun_0.47 quadprog_1.5-8 sass_0.4.9 cli_3.6.3
## [33] withr_3.0.1 magrittr_2.0.3 digest_0.6.37 grid_4.4.1
## [37] rstudioapi_0.16.0 lifecycle_1.0.4 nlme_3.1-165 vctrs_0.6.5
## [41] evaluate_0.24.0 glue_1.7.0 farver_2.1.2 codetools_0.2-20
## [45] ggseqlogo_0.2 fansi_1.0.6 colorspace_2.1-1 rmarkdown_2.28
## [49] tools_4.4.1 pkgconfig_2.0.3 htmltools_0.5.8.1