1 Introduction

Delay distributions play a crucial role in various fields, including epidemiology, reliability analysis, and survival analysis. These distributions describe the time between two events of interest, such as the incubation period of a disease or the time to failure of a component. Accurately estimating and calculating these distributions is essential for understanding the underlying processes and making informed decisions.[1] However, estimating these distributions can be challenging due to various factors, including censoring and truncation of the observed data.[2]

The estimation of delay distributions often faces the following challenges:

The primarycensored package aims to address these challenges by providing tools to manipulate primary censored delay distributions. By accounting for the censoring and truncation present in the data, the package enables more accurate estimation and use of the underlying true distribution.

In this vignette, we will provide a quick introduction to the main functions and concepts in the primarycensored package. We will cover the mathematical formulation of the problem, demonstrate the usage of the key functions, and provide signposting on how to learn more.

2 Packages in this getting started vignette.

As well as the primarycensored package this vignette also uses ggplot2.

# Load packages
library(primarycensored)
library(ggplot2)
# Set seed for reproducibility
set.seed(123)

3 Generating random samples with rprimarycensored()

This function generates random samples from a primary event censored distribution. It adjusts the distribution by accounting for the primary event distribution, potential truncation at a maximum delay (\(D\)), and secondary event censoring.

The mathematical formulation for generating random samples from a primary event censored distribution is as follows:

  1. Generate primary event times (\(p\)) from the specified primary event distribution (\(f_p\)) within the primary event window (\(pwindow\)):

\[p \sim f_p(x), \quad 0 \leq x \leq pwindow\]

  1. Generate delays (\(d\)) from the specified delay distribution (\(f_d\)) with parameters \(\theta\):

\[d \sim f_d(x; \theta)\]

  1. Calculate the total delays (\(t\)) by adding the primary event times and the delays:

\[t = p + d\]

  1. Apply truncation to ensure that the delays are within the specified range \([0, D]\):

\[t_{truncated} = \{t \mid 0 \leq t < D\}\]

  1. Round the truncated delays to the nearest secondary event window (\(swindow\)):

\[t_{valid} = \lfloor \frac{t_{truncated}}{swindow} \rfloor \times swindow\]

Here’s an example of how to use rprimarycensored() to sample from a log-normal distribution with and without secondary interval censoring. For simplicity we will use a daily censoring window for both events.

n <- 1e4
meanlog <- 1.5
sdlog <- 0.75
obs_time <- 10
pwindow <- 1

# Random samples without secondary censoring
samples <- rprimarycensored(
  n,
  rdist = rlnorm, rprimary = runif,
  pwindow = pwindow, swindow = 0, D = obs_time,
  meanlog = meanlog, sdlog = sdlog
)
# Random samples with secondary censoring
samples_sc <- rprimarycensored(
  n,
  rdist = rlnorm, rprimary = runif,
  pwindow = pwindow, swindow = 1, D = obs_time,
  meanlog = meanlog, sdlog = sdlog
)
# Calculate the PMF for the samples with secondary censoring
samples_sc_pmf <- data.frame(
  pmf =
    table(samples_sc) /
      sum(table(samples_sc))
)
# Compare the samples with and without secondary censoring to the true
# distribution
ggplot() +
  geom_density(
    data = data.frame(samples = samples),
    aes(x = samples),
    fill = "#4292C6",
    col = "#252525",
    alpha = 0.5
  ) +
  geom_col(
    data = samples_sc_pmf,
    aes(
      x = as.numeric(as.character(pmf.samples_sc)),
      y = pmf.Freq
    ),
    fill = "#20b986",
    col = "#252525",
    alpha = 0.5,
    width = 0.9,
    just = 0
  ) +
  geom_function(
    fun = dlnorm,
    args = list(meanlog = meanlog, sdlog = sdlog),
    color = "#252525",
    linewidth = 1
  ) +
  labs(
    title = "Comparison of Samples from Log-Normal Distribution",
    x = "Samples",
    y = "Density",
    caption = paste0(
      "Blue density: Samples without secondary censoring\n",
      "Green bars: Samples with secondary censoring\n",
      "Black line: True log-normal distribution without truncation"
    )
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
  theme_minimal() +
  theme(
    panel.grid.minor = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

Neither distribution matches the true distribution due to the truncation at D which biases both observed distributions towards shorter delays, as well as the primary and secondary event censoring.

4 Compute the primary event censored cumulative distribution function (CDF) for delays with pprimarycensored()

This function computes the primary event censored cumulative distribution function (CDF) for a given set of quantiles. It adjusts the CDF of delay distribution by accounting for the primary event distribution and potential truncation at a maximum delay (\(D\)).

The primary event censored CDF, (\(F_{\text{cens}}(q)\)), is given by:

\[ F_{\text{cens}}(q) = \int_{0}^{pwindow} F(q - p) \cdot f_{\text{primary}}(p) \, dp \]

where \(F\) is the CDF of the delay distribution, \(f_{\text{primary}}\) is the PDF of the primary event times, and \(pwindow\) is the primary event window.

If the maximum delay \(D\) is finite, the CDF is normalized by dividing by \(F_{\text{cens}}(D)\):

\[ F_{\text{cens,norm}}(q) = \frac{F_{\text{cens}}(q)}{F_{\text{cens}}(D)} \]

where \(F_{\text{cens,norm}}(q)\) is the normalized CDF.

Let’s compare the empirical CDF of our samples without secondary censoring to the theoretical CDF computed using pprimarycensored():

empirical_cdf <- ecdf(samples)
theoretical_cdf <- pprimarycensored(
  seq(0, obs_time, length.out = 100),
  pdist = plnorm, dprimary = dunif,
  pwindow = pwindow, D = obs_time,
  meanlog = meanlog, sdlog = sdlog
)

# Create a data frame for plotting
cdf_data <- data.frame(
  x = seq(0, obs_time, length.out = 100),
  Theoretical = theoretical_cdf,
  Empirical = empirical_cdf(seq(0, obs_time, length.out = 100))
)

# Plot the empirical and theoretical CDFs
ggplot(cdf_data, aes(x = x)) +
  geom_step(aes(y = Theoretical), color = "black", linewidth = 1) +
  geom_step(aes(y = Empirical), color = "#4292C6", linewidth = 1) +
  labs(
    title = "Comparison of Empirical and Theoretical CDFs",
    x = "Delay",
    y = "Cumulative Probability",
    caption = paste0(
      "Blue line: Empirical CDF from samples without secondary censoring\n",
      "Black line: Theoretical CDF computed using pprimarycensored()"
    )
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
  theme_minimal() +
  theme(
    panel.grid.minor = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

The theoretical CDF matches the empirical CDF very well, confirming that pprimarycensored() is working as expected.

5 Compute the primary event censored probability mass function (PMF) with dprimarycensored()

This function computes the primary event censored probability mass function (PMF) for a given set of quantiles using the CDF. On top of accounting for the primary event distribution and truncation, it also accounts for secondary event censoring.

The primary event censored PMF, (\(f_{\text{cens}}(d)\)), is given by:

\[ f_{\text{cens}}(d) = F_{\text{cens}}(d + \text{swindow}) - F_{\text{cens}}(d) \]

where (\(F_{\text{cens}}\)) is the potentially right truncated primary event censored CDF and (\(\text{swindow}\)) is the secondary event window.

Let’s compare the empirical PMF of our samples with secondary censoring to the theoretical PMF computed using dprimarycensored():

# Calculate the theoretical PMF using dprimarycensored
theoretical_pmf <- dprimarycensored(
  0:(obs_time - 1),
  pdist = plnorm, dprimary = dunif,
  pwindow = pwindow, swindow = 1, D = obs_time,
  meanlog = meanlog, sdlog = sdlog
)

pmf_df <- data.frame(
  x = 0:obs_time,
  pmf = c(theoretical_pmf, 0)
)

# Plot the empirical and theoretical PMFs
ggplot() +
  geom_col(
    data = samples_sc_pmf,
    aes(
      x = as.numeric(as.character(pmf.samples_sc)),
      y = pmf.Freq
    ),
    fill = "#20b986",
    col = "#252525",
    alpha = 0.5,
    width = 0.9,
    just = 0
  ) +
  geom_step(
    data = pmf_df,
    aes(x = x, y = pmf),
    color = "black",
    linewidth = 1
  ) +
  labs(
    title = "Comparison of Samples from Log-Normal Distribution",
    x = "Samples",
    y = "Density",
    caption = paste0(
      "Green bars: Empirical PMF from samples with secondary censoring\n",
      "Black line: Theoretical PMF computed using dprimarycensored()"
    )
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
  theme_minimal() +
  theme(
    panel.grid.minor = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

Again the theoretical PMF matches the empirical PMF very well, confirming that dprimarycensored() is also working as expected.

6 Other key functionality

In addition to these main functions, the package also includes:

7 Learning more

1. Charniga, K., Park, S. W., Akhmetzhanov, A. R., Cori, A., Dushoff, J., Funk, S., Gostic, K. M., Linton, N. M., Lison, A., Overton, C. E., Pulliam, J. R. C., Ward, T., Cauchemez, S., & Abbott, S. (2024). Best practices for estimating and reporting epidemiological delay distributions of infectious diseases using public health surveillance and healthcare data. arXiv [stat.ME]. http://arxiv.org/abs/2405.08841

2. Park, S. W., Akhmetzhanov, A. R., Charniga, K., Cori, A., Davies, N. G., Dushoff, J., Funk, S., Gostic, K., Grenfell, B., Linton, N. M., Lipsitch, M., Lison, A., Overton, C. E., Ward, T., & Abbott, S. (2024). Estimating epidemiological delay distributions for infectious diseases. bioRxiv. https://doi.org/10.1101/2024.01.12.24301247