Intro to validating growth rate estimates via simulation

In this vignette, we’ll walk through how we simulated a birth-death branching process to validate our growth rate methods. The simulation procedure is a direct implementation of results by Amaury Lambert in a recent paper. We’ll show here that these results allow for extremely fast generation of a sampled tree of size \(n\) from a birth-death process of time \(T\) with fixed birth and death rates. The key point here is that we don’t have to simulate this using computationally intensive algorithms (gillespie etc.); the math provides a shortcut which allows us to generate the results on a single core in under a second for a tree with 100 tips or about 8 seconds for a tree with 1000 tips! In R!

The applications of this fast generation of sampled trees are countless. For us, these results allowed us to simulate thousands of trees to check our methods for growth rate estimation from phylogenetic tree reconstruction, which is detailed in our paper. We provide an intro to simulation here, with full reproduction of our simulation results in the article linked here.

Setup

First, we’ll have to load the packages we want to use. We’ll be plotting the trees using the ape package function ape::plot.phylo() along with some other ape functions. If you have cloneRate installed, you already have the ape package. We’ll also be using ggplot2 to make our plots, which can be installed from CRAN.

# Load and attach our package cloneRate
library(cloneRate)

# Load and attach ape, which will be installed if you've installed cloneRate
library(ape)

# Install ggplot2 from CRAN if necessary, then load and attach it with library()
library(ggplot2)

We’ll also set the color palette which we’ll use for most of the plotting. The palette is avilable here.

colorPal <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

1 Simulating trees

In this section, we’ll simulate ultrametric and mutation-based trees. Ultrametric just means that the tips are all the same distance from the root. In our specific case, the ultrametric trees will have edge lengths in units of time, so ultrametric means that the tips are sampled at the same time. Mutation-based trees are those with edge lengths in units of mutations. Mutation-based trees will typically not be ultrametric, because there will be some fluctuations in the number of mutations acquired.

We’ll be generating trees of class phylo, which is a fairly straightforward encoding of phylogenetic trees in R. The ape documentation describing the class phylo can be found here.

1.1 Ultrametric trees

Let’s start with ultrametric trees. First, let’s simulate a single tree and have a look at the class phylo. We’ll use the function simUltra() from our package to do this, which simulates an ultrametric tree. We’ll set a birth rate a, a death rate b, an age for the tree cloneAge, and the number of sampled tips n. Note that we use cloneAge instead of \(T\) for the time because T in R means TRUE.

# Generate the tree
tree <- simUltra(a = 1, b = 0, cloneAge = 20, n = 100, addStem = TRUE)

# We see that the tree is of class phylo
class(tree)
#> [1] "phylo"

# Preview the tree
print(tree)
#> 
#> Phylogenetic tree with 100 tips and 100 internal nodes.
#> 
#> Tip labels:
#>   t8, t82, t63, t21, t19, t72, ...
#> 
#> Rooted; includes branch lengths.

Again, the best place to get more info about the class phylo is here. Our functions for simulating trees will include metadata. Let’s take a look at the metadata included in the tree.

print(tree$metadata)
#>   r a b cloneAge   n runtime_seconds addStem
#> 1 1 1 0       20 100           1.403    TRUE

We see that the columns are:

Now let’s plot the tree.

plot.phylo(tree, direction = "downwards", show.tip.label = FALSE)
axisPhylo(side = 2, backward = FALSE, las = 1)
title(main = "Simulated tree", ylab = "Time")

In the tree we plotted above, we see that the “stem” is the edge extending from time 0 to the first split. We call these splits coalescence events and they’ll be very important for our methods of growth rate estimation.

Also, we see that the tree is ultrametric, meaning that they all are sampled at the same time, in this case 20 units of time after the birth-death process began. As a default, we’ll talk about units of time in years, but that is arbitrary. Just make sure the units of the growth rate (per year) are the same as the time units (years).

Let’s apply our growth rate estimates to this tree, seeing if our estimates are close to the value of r that we used to generate the tree. We have two functions for estimating the growth rate of an ultrametric tree, and we’ll compare their performance later on:

Both methods are detailed in our paper.

maxLikelihood(tree)$estimate
#> [1] 0.9173629

internalLengths(tree)$estimate
#> [1] 0.8549125

We know that our actual growth rate is 1, so we can evaluate how our estimates are doing. One estimate doesn’t really tell us much though. Let’s try 100. To do this, we just set the nTrees param equal to 100 in our simUltra() function. You can parallelize this process if you have the parallel package installed by simply setting nCores, but we’ll leave this at 1 for now. By printing the elapsed time, we see that with only one core it takes about a minute.

ptm <- proc.time()
tree.list <- simUltra(a = 1, b = 0, cloneAge = 20, n = 100, nTrees = 100, nCores = 1, addStem = TRUE)
print(proc.time()[["elapsed"]] - ptm[["elapsed"]])
#> [1] 149.59

Let’s apply our methods to these trees. We can input either a single phylo object, or a list of phylo objects to our growth rate functions. Either way, it’ll return a data.frame which each row corresponding to an estimate for each tree.

resultsMaxLike <- maxLikelihood(tree.list)

resultsLengths <- internalLengths(tree.list)

We gave each method 100 trees to estimate, let’s plot these estimates. Remember that all the input trees had a growth rate of 1, so we want these estimates to be close to 1. We’ll use ggplot’s density plot for this:

# Combine for ggplot formatting
resultsCombined <- rbind(resultsLengths, resultsMaxLike)

# Plot, adding a vertical line at r=1 because that's the true growth rate
ggplot(resultsCombined) +
  geom_density(aes(x = estimate, color = method), linewidth = 1.5) +
  geom_vline(xintercept = exampleUltraTrees[[1]]$metadata$r) +
  theme_bw() +
  theme(
    axis.text.y = element_blank(), axis.ticks.y = element_blank(),
    legend.title = element_blank()
  ) +
  xlab("Net growth rate estimate (r)") +
  ylab("Density") +
  scale_color_manual(labels = c("Internal lengths", "Max. likelihood"), values = c("black", "#009E73"))

Not bad, but we’ll want to test this more formally over a large range of parameter values. More on that later in this vignette. For now, let’s shift our attention to mutation trees.

1.2 Mutation trees

There are two ways to generate mutation trees:

Either way, the process of generating mutation trees consists of:

Let’s take the ultrametric trees we just generated and convert them to mutation-based trees. For this, we provide the function ultra2mut():

# Set mutation rate equal to 10 muts/year for all trees
mutTree.list <- ultra2mut(tree.list, nu = 10)

Alternatively, we can generate a new set of mutation trees using the function simMut(). We won’t run this because it will take another few minutes, and we already have the mutation-based trees generated above.

# Set params for ultra tree + a mutation rate
mutTree.list2 <- simMut(a = 1, b = 0, cloneAge = 20, n = 100, nTrees = 100, nu = 100, nCores = 1)

If we want to estimate the growth rate from a mutation tree, we use the sharedMuts() function, which works very similarly to the internalLengths() function. However, instead of counting the internal edge lengths in units of time, the sharedMuts() function counts the internal/shared mutations and uses the mutation rate to scale to time. So, if \(M_i\) is the number of internal or shared mutations, \(\nu\) (nu) is the mutation rate, and \(n\) is the number of samples, the growth rate estimate is \(r = M_i /(\nu n)\).

Let’s apply the sharedMuts() function to the trees that we converted to mutation-based. Remember that these mutation-based trees are generated from the ultrametric trees we applied our functions maxLikelihood() and internalLengths() to.

resultsShared <- sharedMuts(mutTree.list, nu = 10)

Let’s plot the estimates all together. First, we’ll combine the data.frames produced. The sharedMuts() function outputs an additional column corresponding to the mutation rate nu. Because of this, using rbind on the full data.frames will give an error. We only want to look at the estimates for now, so we’ll just combine the necessary columns from each data.frame.

# Combine the columns with the estimates
colsUse <- c("lowerBound", "estimate", "upperBound", "method")
resultsAll <- rbind(resultsShared[, colsUse], resultsCombined[, colsUse])

# Plot, adding a vertical line at r=1 because that's the true growth rate
ggplot(resultsAll) +
  geom_density(aes(x = estimate, color = method), linewidth = 1.5) +
  geom_vline(xintercept = exampleUltraTrees[[1]]$metadata$r) +
  theme_bw() +
  theme(
    axis.text.y = element_blank(), axis.ticks.y = element_blank(),
    legend.title = element_blank()
  ) +
  xlab("Net growth rate estimate (r)") +
  ylab("Density") +
  scale_color_manual(labels = c("Internal lengths", "Max. likelihood", "Shared Muts."), values = c(colorPal[1], colorPal[4], colorPal[6]))

It looks like we’re on the right track. From this, it’s hard to tell which is best. Maximum likelihood seems a bit biased on the high side, but the estimates seem to fall in a more narrow range. Let’s do a more quantitative analysis now that we have the hang of it.

2 Quantitative comparisons

In this section, we’ll use our ability to generate trees in order to test our methods. We’ll focus mainly on the ultrametric trees, noting that the mutation-based approach based on shared mutations is analogous to the internal lengths method. We’ll see how well our methods perform, where they perform well, and where they fail.

2.1 Max likelihood vs. internal lengths vs. shared mutations

We want to know which estimate is the best. We’ll start by making a quantiative comparison of the 100 estimates we already have from our simulations in the previous section. 100 trees should be enough to get an idea of which estimate is most accurate, but we can always simulate more. We won’t do that here though because it’ll take a bit longer and, well, we have to stop somewhere.

2.1.1 Root Mean Square Error (RMSE)

# Calculate the RMSE
groundTruth <- 1
rmse <- unlist(lapply(
  split(resultsAll, resultsAll$method),
  function(x) {
    sqrt(sum((x$estimate - groundTruth)^2) / length(x$estimate))
  }
))

print(rmse)
#>    lengths    maxLike sharedMuts 
#>  0.1068260  0.1013800  0.1139182

2.1.2 Mean

# Calculate the mean
simMean <- unlist(lapply(
  split(resultsAll, resultsAll$method),
  function(x) {
    mean(x$estimate)
  }
))

print(simMean)
#>    lengths    maxLike sharedMuts 
#>   1.014857   1.037438   1.015132

2.1.3 Standard deviation

# Calculate the standard deviation of our estimates
simSD <- unlist(lapply(
  split(resultsAll, resultsAll$method),
  function(x) {
    sd(x$estimate)
  }
))

print(simSD)
#>    lengths    maxLike sharedMuts 
#>  0.1063208  0.0946886  0.1134776

What we saw in the density plot is quantified in the mean, sd, and rmse. Maximum likelihood usually performs best by Root Mean Square Error (RMSE) even though it’s mean is a bit higher (usually around 1.04). With only 100 trees, there will be larger fluctuations, so maximum likelihood might have a higher RMSE than the other methods. The shared mutations method typically performs a bit worse than its ultrametric cousin, internalLengths(). The randomness of the poissonian mutation accumulation likely leads to a wider spread of estimates in sharedMuts(), which explains the difference between the two methods.

2.1.4 Coverage probability

Up to now, we’ve only looked at the estimate itself. We note that our methods also provide confidence intervals. The default, with alpha = 0.05, is to provide 95% confidence intervals. While you can change that if you’d like, 95% seems to be pretty standard, so we’ll use that. If our 95% confidence interval is accurate, we’d expect the ground truth growth rate to fall within our confidence interval 95% of the time. Let’s check:

# Calculate the coverage probability of our estimates
groundTruth <- 1
simCoverage <- unlist(lapply(
  split(resultsAll, resultsAll$method),
  function(x) {
    # Set insideInterval to TRUE if inside interval and FALSE if outside
    insideInterval <- (x$lowerBound < groundTruth & x$upperBound > groundTruth)
    # Count TRUE as 1 and FALSE as 0. See the fraction inside interval
    sum(insideInterval) / length(insideInterval)
  }
))

print(simCoverage)
#>    lengths    maxLike sharedMuts 
#>       0.94       0.93       0.93

We’re right around 95%, which is great! But you might expect that maxLikelihood(), which typically has the lowest RMSE, might have the best coverage probability. In fact, the confidence interval for the maxLikelihood() function is narrower. The opposite is true for sharedMuts() which has the widest confidence interval to account for the randomness of mutation accumulation. With only 100 trees, the coverage probability will only be a rough estimate.

3 Varying parameters

So far, we’ve looked at the same parameters. Now, we’ll see what happens when we change some important parameters about the birth-death process.

3.1 Varying n

One of the most interesting results to explore further is how the accuracy of the estimates depends on the number of samples. For this, we’ll look at a few different n parameter values. To keep things from running for too long, we can re-use our trees from n=100 and add 100 more trees at different n values. For this, let’s show how to run simUltra() with a vector input for n. First, we’ll generate the vector, which we’ll call n_vec. Let’s explore a small n value, 10, and an intermediate n value, 30.

n_vec <- c(rep(10, 100), rep(30, 100))

For now, we’ll keep the growth rate a - b equal to 1, but again we’ll let the actual birth and death rates vary.

a_vec <- stats::runif(n = 200, min = 1, max = 4)
b_vec <- a_vec - 1

In total, our n_vec is of length 200. So we’ll want to make sure a_vec and b_vec are also of length 200. And we’ll have to set nTrees = 200. If we generate multiple trees, we either set a single value for the parameters, or a vector of length nTrees. This applies to both simUltra() and simMut(). Let’s generate some more trees. Note: this will also take a minute or two.

n10_n30_trees <- simUltra(
  a = a_vec, b = b_vec, cloneAge = 20, n = n_vec,
  nTrees = length(n_vec)
)

Now, let’s apply our estimates and combine the data.frames:

# Apply our estimates for ultrametric trees
n10_n30_maxLike <- maxLikelihood(n10_n30_trees)
n10_n30_lengths <- internalLengths(n10_n30_trees)

# Combine the estimates
n10_n30_both <- rbind(n10_n30_maxLike, n10_n30_lengths)

Finally, we can add our resultsCombined data.frame from before, which has the results of applying maxLikelihood() and internalLengths() to the set of 100 ultrametric trees at n = 100.

# Merge the results data.frames
results_vary_n <- rbind(n10_n30_both, resultsCombined)

Let’s make a plot for each of these, showing the performance. We’ll use ggplot2::facet_wrap() in order to show the same plot at various n values. Although we have a different number of trees from the n = 100 case, the density plot will normalize this.

ggplot(results_vary_n) +
  geom_density(aes(x = estimate, color = method), linewidth = 1.5) +
  geom_vline(xintercept = exampleUltraTrees[[1]]$metadata$r) +
  theme_bw() +
  theme(
    axis.text.y = element_blank(), axis.ticks.y = element_blank(),
    legend.title = element_blank()
  ) +
  xlim(0, 3) +
  xlab("Net growth rate estimate (r)") +
  ylab("Density") +
  scale_color_manual(
    labels = c("Internal lengths", "Max. likelihood"),
    values = c("black", "#009E73")
  ) +
  facet_wrap(~ factor(paste0("n = ", n), levels = c("n = 10", "n = 30", "n = 100")),
    ncol = 1, strip.position = "bottom", scales = "free", dir = "v"
  )
#> Warning: Removed 3 rows containing non-finite values (`stat_density()`).

As you can see, both estimates are highly dependent on the number of samples. If you’d like to see the same density values on the y-axis for all three n values, set scales = "fixed" in the facet_wrap() function.

If you’re interested, you can apply the same quantitative measures to this data as we did before in Quantitative comparisons. We won’t repeat that here, but we show this in Figure 2 of our paper. This information is also available in the Supplementary tables, which are also available as .csv files.

3.2 Varying r (and/or cloneAge)

As we noted before, we can simulate many trees at once using simUltra() and simMut(). Here, we’ll use simUltra() to generate 100 trees at various growth rates. We do a similar analysis in Figures 3 and 4 of our paper, noting that our methods struggle with small growth rates. Let’s see if we come to the same conclusion here.

3.2.1 Varying r or cloneAge?

First, we have to address something that might otherwise lead to confusion; varying r is the same as varying cloneAge. When you think about it, it makes sense. If I simulate a population with a net growth rate of 1 per year for 20 years it should look the same as a growth rate of 0.5 per year for 40 years. We’ll show this, but first consider the fact that the units are meaningless, so long as they’re consistent. So a growth rate of 1/12 and a clone age of 20*12 is exactly the same as a growth rate of 1 and 20 if the units change from a years to months.

This all might be a bit confusing, so let’s plot it. First, we’ll simulate a tree with a growth rate of 2 for 30 years and compare it to a tree with a growth rate of 0.5 and the same 30 years. These trees should look different. Finally, we’ll add a tree that has a growth rate of 0.5, but run it for 120 years. If what I said above is true, we should see that the last tree looks like the first tree, because 0.5 for 120 should be the same as 2 for 30. Let’s find out:

# First tree, r = a - b = 2
tree1 <- simUltra(a = 2.5, b = .5, cloneAge = 30, n = 50)

# Second tree, with r = a - b = 0.5
tree2 <- simUltra(a = 1, b = .5, cloneAge = 30, n = 50)

# Third tree, with r = 0.5 but cloneAge = 120
tree3 <- simUltra(a = 1, b = .5, cloneAge = 120, n = 50)

Now, let’s plot using par() to show all three in one plot

oldpar <- graphics::par(mfrow = c(1, 3))
ape::plot.phylo(tree1, direction = "downwards", show.tip.label = F, main = "r = 2, cloneAge = 30")
ape::plot.phylo(tree2, direction = "downwards", show.tip.label = F, main = "r = 0.5, cloneAge = 30")
ape::plot.phylo(tree3, direction = "downwards", show.tip.label = F, main = "r = 0.5, cloneAge = 120")

# reset par settings
graphics::par(oldpar)

While the trees on the left and the right aren’t identical due to the stochastic nature of the birth-death process, they are similar, and certainly different from the tree in the middle.

3.2.2 Performance across r values

Now we can arbitrarily choose whether to vary r or cloneAge. For consistency, we’ll vary r, as we do in Figures 3 and 4 of our paper. Here, we have two options:

  1. Simulate at fixed r values, showing which are good and which are problematic
  2. Simulate at random r values and then try to decipher which estimates are good and which aren’t

Option 1 is essentially the same as we did in Varying n, but with r instead of n, so repeating that here wouldn’t be as useful. Option 2 is also more realistic…we can pretend we don’t know the ground truth and show how we decide if our methods are relevant before comparing to the ground truth.

Let’s simulate 1000 trees with 50 samples n and various growth rates r, ranging from 0.1 to 1 per year, run for 40 years. This will take a while, so we won’t actually evaluate this as part of this vignette. However, we do a full reproduction of our work from our recent paper in an article, “Reproduce simulation results” which can be found on our package website.

# Uniform ditribution of r used to generate b_vec
r_vec <- stats::runif(n = 1000, min = 0.1, max = 1)
a_vec <- stats::runif(n = 1000, min = 1, max = 3)
b_vec <- a_vec - r_vec

# Input to simUltra()
vary_r_trees <- simUltra(a = a_vec, b = b_vec, cloneAge = 40, n = 50, nTrees = length(a_vec))

If we were to run the code above, we could analyze it the same as we have done, with our internalLengths() and maxLikelihood() functions. However, we’d also have to apply our diagnostic, which tells us which clones are good candidates for our method, and which are not. If we run a clone that’s not a good candidate, we’ll get a warning. Let’s generate a tree with a very low growth rate and plot it.

# Let's call it slowClone
slowClone <- simUltra(a = .15, b = .05, n = 50, cloneAge = 40)

# Plot the tree
ape::plot.phylo(slowClone, direction = "downwards", show.tip.label = F)

We see that this tree has long internal branches and short external branches. This indicates that the process has not been “supercritical” enough (i.e. the growth rate and time are too low). Quantitatively, this means the ratio of external to internal edge lengths is low. Let’s try to apply our methods:

# Apply our estimates
maxLikelihood(slowClone)
#> Warning in maxLikelihood(slowClone): External to internal lengths ratio is less than or equal to 3,
#>             which means max. likelihood method may not be applicable. Consider
#>             using birthDeathMCMC() function, which avoids this issue.
#>   lowerBound  estimate upperBound cloneAgeEstimate sumInternalLengths
#> 1  0.1117909 0.1455219   0.179253         46.43419           375.8563
#>   sumExternalLengths extIntRatio  n alpha runtime_s  method
#> 1           490.5726    1.305213 50  0.05     0.009 maxLike
internalLengths(slowClone)
#> Warning in internalLengths(slowClone): External to internal lengths ratio is less than or equal to 3,
#>             which means internal lengths method may not be applicable. Consider
#>             using birthDeathMCMC() function, which avoids this issue.
#>   lowerBound  estimate upperBound cloneAgeEstimate sumInternalLengths
#> 1 0.09615632 0.1330296  0.1699028          47.0795           375.8563
#>   sumExternalLengths extIntRatio  n alpha runtime_s  method
#> 1           490.5726    1.305213 50  0.05     0.005 lengths

Our estimates are off by quite a bit (the ground truth estimate is 0.1), and we see a warning. The warning tells us that the external to internal edge length ratio is below 3, so this clone is likely not a good candidate for our methods. For details on how we came up with 3 as a cutoff see Figure 4 of our paper or the reproduced analysis in the article, “Reproduce simulation results” from our package website. The real advantage to this diagnostic is that we can look at a tree and calculate this ratio to tell us right away if our methods are applicable. Again, if you’re interested in the quantitative side of this, you can apply the metrics from the Quantitative comparisons section, which we do in the longer form analysis in the article mentioned above.

4 References

The method for simulating the trees is a direct result of the work of Amaury Lambert in the following paper. The mathematical methods for estimating growth rates build in large part from the same work, linked below:

And here’s a final link to our paper for more of the details of the methods and data analysis.

There aren’t too many colors in this vignette, but we tried to use colorblind friendly colors, specifically pulling colors from a palette designed by Bang Wong and available here.