fssg

fssg is a small package focused on allowing users to quickly test multiple parametric survival models in one line of code, by stream-lining the building of many flexsurv models.

This package was originally inspired some of my other research on Broad Genomic Profiling, where tens to hundreds of genomic mutations are tested in one ‘batch’ using a single tissue sample. This type of approach has many names, but can be colloquially referred to as a “shotgun” approach, which is where the package gets it’s name: FlexSurv ShotGun (fssg).

To get started with fssg, we’ll simply load our package, which will also grab the main dependency flexsurv.

library(fssg)
library(flexsurv)
#> Loading required package: survival

The main powerhouse behind this package is the parametric survival modeling package flexsurv. flexsurv provides a straightforward way to build parametric survival models, including many of the most commonly used distributions in survival modeling (weibull, gamma, etc.). This is a very useful tool, but is limited in parametric distribution options. Thankfully, the authors of flexsurv included a way for users to declare and use their own custom defined distribution. This means that as long as we know what we’re doing, we can create models using whatever distribution we’d like!

Custom Distributions

Included in this package are some functions for the creation of custom flexsurv-compatible distributions. As long as we have distribution functions for the distribution (i.e. pnorm and dnorm), and understand the bounds of our parameters, then we can construct a distribution object using fssg_dist.

What do we need in order to build a custom distribution?

  1. A name! Used in the fssg back-end, but technically isn’t that important if you’re not doing batch testing via fssg.
  2. A vector of parameters. This lets flexsurv know what parameters need to be initialized and optimized in the fitting process.
  3. An indicator for the main variable we want to change depending on the model covariates. flexsurv calls this the ‘location’ variable, as it originally focused primarily on location-scale distributions, but non-location parameters may be supplied as well. Please note, that this can only be one parameter in flexsurv, but we can allow other parameters to vary by supplying a list into the “anc” parameter in flexsurvreg. See ?flexsurvreg for more information on this.
  4. Transformation functions that change our parameter domains into the real number line for better optimization. flexsurv also requests for the inverse functions to allow for back transformation to the original parameter scale. These are supplied to the distribution object as a list of functions. Most often, when parameters must be greater than 0, the transforms function is something like log, while the inverse transformation is exp.
  5. Initial values. The optimization process requires a starting point. This is often the trickiest part of specifying a distribution, as poor initial values will cause the optimization to fail. To avoid these problems, it’s recommend to have your initial values supplied as function which is dependent on the survival times. This allows the initial values to be dynamically determined by the actual shape of your data, and lowers the chance failure due to bad initial values.
  6. Distribution functions. flexsurv actually allows these to be specified in one of two primary ways: with a density and distribution function (‘d’, ‘p’), or with a Hazard and Cumulative Hazard function (‘h’,‘H’). It also helps to include the ‘q’ function when possible. fssg includes some helper functions to construct these distribution functions based off a ‘d’ and ‘p’ function (See ?survivify).

Let’s try an example. fssg contains some custom distribution functions for the Gamma-Gompertz distribution (e.g. dgamgomp, pgamgomp). These distributions functions are specified in a similar fashion to the native R distribution functions like pnorm, dnorm.

# 'Density' function
dgamgomp(x=1, b=1, sigma=1, beta=1)
#> [1] 0.3678794

# 'Distribution' function
pgamgomp(q=1, b=1, sigma=1, beta=1)
#> [1] 0.6321206


# Constructing our distribution object
fssg_gamgomp <- fssg_dist(
    name='gamgomp',                       # a simple name
    pars=c('b','sigma','beta'),           # parameters are scale, shape, shape respectively
    location='b',                         # name of the parameter we want to vary 
    transforms=c(log,log,log),            # transformation functions for each of our parameters (all must be >0)
    inv.transforms=c(exp,exp,exp),        # back-transformation functions for each of our parameters
    inits=function(t){c(1/max(t), 1, 1)}, # function that constructs the initial parameter estimates for optimization
    d = dgamgomp,                         # density function
    p = pgamgomp,                         # distribution function.
    q = quantilify(pgamgomp),             # quantile function (optional) 
    h = hazardify(dgamgomp, pgamgomp),    # hazard function (optional)
    H = cumhazardify(pgamgomp),           # cumulative hazard function (optional)
    fullname='gamma_gompertz'             # secondary name(s) which is used in the back-end of `fssg` to better label outputs
  )

We can then use this distribution object directly with the normal flexsurvreg function. It’s recommended that you specify the dfns parameter explicitly, as otherwise it will attempt to use globally declared d and p variables based on your distribution name.

# Sample data set provided in the package
data('pseudo', package='fssg')  

flexsurvreg(
  formula = Surv(time, death) ~ 1, 
  data = pseudo, 
  dist = fssg_gamgomp, 
  dfns = list(fssg_gamgomp$d, fssg_gamgomp$p)
) -> f1

print(f1)
#> Call:
#> flexsurvreg(formula = Surv(time, death) ~ 1, data = pseudo, dist = fssg_gamgomp, 
#>     dfns = list(fssg_gamgomp$d, fssg_gamgomp$p))
#> 
#> Estimates: 
#>        est       L95%      U95%      se      
#> b      1.57e-01  1.50e-01  1.65e-01  3.75e-03
#> sigma  5.08e-01  4.50e-01  5.74e-01  3.17e-02
#> beta   1.86e+06  1.10e+06  3.13e+06  4.96e+05
#> 
#> N = 10000,  Events: 5228,  Censored: 4772
#> Total time at risk: 930539
#> Log-likelihood = -23777.51, df = 3
#> AIC = 47561.02
plot(f1)

To save users some time, we’ve already specified a large number of custom distributions which can be cleanly accessed using fssg_dist_list. This function creates a list of each included distribution object. If you want to grab a specific distribution object from this list, you can directly grab it using get_fssg_dist.

get_fssg_dist('gamma_gompertz')
#> $name
#> [1] "gamgomp"
#> 
#> $pars
#> [1] "b"     "sigma" "beta" 
#> 
#> $location
#> [1] "b"
#> 
#> $transforms
#> $transforms[[1]]
#> function (x, base = exp(1))  .Primitive("log")
#> 
#> $transforms[[2]]
#> function (x, base = exp(1))  .Primitive("log")
#> 
#> $transforms[[3]]
#> function (x, base = exp(1))  .Primitive("log")
#> 
#> 
#> $inv.transforms
#> $inv.transforms[[1]]
#> function (x)  .Primitive("exp")
#> 
#> $inv.transforms[[2]]
#> function (x)  .Primitive("exp")
#> 
#> $inv.transforms[[3]]
#> function (x)  .Primitive("exp")
#> 
#> 
#> $inits
#> function (t) 
#> {
#>     c(1/max(t), 1, 1)
#> }
#> <bytecode: 0x00000290f301a730>
#> <environment: 0x00000290fa290438>
#> 
#> $d
#> function (x, b, sigma, beta, log = FALSE) 
#> {
#>     return(dgamgomp_c(x, b, sigma, beta, log))
#> }
#> <bytecode: 0x00000290ef38b648>
#> <environment: namespace:fssg>
#> 
#> $p
#> function (q, b, sigma, beta, lower.tail = TRUE, log.p = FALSE) 
#> {
#>     return(pgamgomp_c(q, b, sigma, beta, lower.tail, log.p))
#> }
#> <bytecode: 0x00000290ef384500>
#> <environment: namespace:fssg>
#> 
#> $q
#> function (p, ...) 
#> {
#>     lower = -1e+06
#>     upper = 1e+100
#>     lower_init <- if (p_function(0, ...) == 0) 
#>         0
#>     else lower
#>     sapply(p, function(pi) {
#>         if (is.na(pi)) {
#>             return(NA_real_)
#>         }
#>         if (pi < 0 || pi > 1) {
#>             return(NaN)
#>         }
#>         if (pi == 0) {
#>             return(-Inf)
#>         }
#>         if (pi == 1) {
#>             return(Inf)
#>         }
#>         stats::uniroot(function(x) p_function(x, ...) - pi, lower = lower_init, 
#>             upper = upper)$root
#>     })
#> }
#> <bytecode: 0x00000290f33c14c0>
#> <environment: 0x00000290fa27aee8>
#> 
#> $h
#> function (x, ...) 
#> {
#>     fx <- d_function(x, ...)
#>     Fx <- p_function(q = x, ...)
#>     fx/(1 - Fx)
#> }
#> <bytecode: 0x00000290f3405178>
#> <environment: 0x00000290fa27b2b0>
#> 
#> $H
#> function (x, ...) 
#> {
#>     sx <- 1 - p_function(q = x, ...)
#>     -log(sx)
#> }
#> <bytecode: 0x00000290f340b808>
#> <environment: 0x00000290fa27b5c0>
#> 
#> $fullname
#> [1] "gamma_gompertz"

We can also use the fssg function to build an individual flexsurvreg model by giving it individual model names! More on this function later.

fssg(
  formula = Surv(time, death) ~ 1, 
  data = pseudo, 
  model='gamma_gompertz'
) -> model1
#> Beginning...
#> ------------------------------------------------------------------------
#> gamma_gompertz
#> gamma_gompertz: 2.13 sec elapsed
#> ------------------------------------------------------------------------
#> Final Summary!
#> Model with best AIC: gamma_gompertz
#> Model with best BIC: gamma_gompertz
#> Kapow!: 2.24 sec elapsed

model1$models$gamma_gompertz
#> Call:
#> flexsurv::flexsurvreg(formula = formula, data = data, dist = i, 
#>     dfns = list(d = i$d, p = i$p), method = opt_method)
#> 
#> Estimates: 
#>        est       L95%      U95%      se      
#> b      1.57e-01  1.50e-01  1.65e-01  3.75e-03
#> sigma  5.08e-01  4.50e-01  5.74e-01  3.17e-02
#> beta   1.86e+06  1.10e+06  3.13e+06  4.96e+05
#> 
#> N = 10000,  Events: 5228,  Censored: 4772
#> Total time at risk: 930539
#> Log-likelihood = -23777.51, df = 3
#> AIC = 47561.02

The fssg Function

The titular fssg function provides a one-line way to test the fit of several distributions on your data. Simply provide it your model formula and data, and watch it run. By default, the function will attempt to build all available flexsurv and fssg parametric distribution models on your data. Progress will be printed by default, and any models that failed will list the corresponding error. The function always prints a summary of the models it ran, and by default returns a list with two items: the summary table as a data.frame, and a list containing each successful model.

fssg(
  survival::Surv(time, status) ~ x, 
  data = survival::aml
) -> aml_model
#> Beginning...
#> ------------------------------------------------------------------------
#> genf
#> genf: 0.05 sec elapsed
#> ------------------------------------------------------------------------
#> genf.orig
#> genf.orig: 0.27 sec elapsed
#> ------------------------------------------------------------------------
#> gengamma
#> gengamma: 0.01 sec elapsed
#> ------------------------------------------------------------------------
#> gengamma.orig
#> gengamma.orig: 0.22 sec elapsed
#> ------------------------------------------------------------------------
#> exp
#> exp: 0 sec elapsed
#> ------------------------------------------------------------------------
#> weibull
#> weibull: 0.02 sec elapsed
#> ------------------------------------------------------------------------
#> weibullPH
#> weibullPH: 0 sec elapsed
#> ------------------------------------------------------------------------
#> lnorm
#> lnorm: 0 sec elapsed
#> ------------------------------------------------------------------------
#> gamma
#> gamma: 0.01 sec elapsed
#> ------------------------------------------------------------------------
#> gompertz
#> gompertz: 0 sec elapsed
#> ------------------------------------------------------------------------
#> llogis
#> llogis: 0.02 sec elapsed
#> ------------------------------------------------------------------------
#> betaprime
#> betaprime: 0.01 sec elapsed
#> ------------------------------------------------------------------------
#> cauchy_scale
#> cauchy_scale: 0.02 sec elapsed
#> ------------------------------------------------------------------------
#> cauchy_location
#> cauchy_location: 0.01 sec elapsed
#> ------------------------------------------------------------------------
#> dagum
#> dagum: 0.07 sec elapsed
#> ------------------------------------------------------------------------
#> exp_log
#> exp_log: 0.01 sec elapsed
#> ------------------------------------------------------------------------
#> gen_extr_val
#> gen_extr_val: 0.03 sec elapsed
#> ------------------------------------------------------------------------
#> gen_extr_val_location
#> gen_extr_val_location: 0.05 sec elapsed
#> ------------------------------------------------------------------------
#> fatigue2_scale
#> fatigue2_scale: 0 sec elapsed
#> ------------------------------------------------------------------------
#> fatigue3_scale
#> fatigue3_scale: 0.06 sec elapsed
#> ------------------------------------------------------------------------
#> fatigue3_location
#> fatigue3_location: 0.07 sec elapsed
#> ------------------------------------------------------------------------
#> fold_norm_scale
#> fold_norm_scale: 0.01 sec elapsed
#> ------------------------------------------------------------------------
#> fold_norm_location
#> fold_norm_location: 0.06 sec elapsed
#> ------------------------------------------------------------------------
#> feller_pareto
#> feller_pareto: 0.11 sec elapsed
#> ------------------------------------------------------------------------
#> frechet
#> frechet: 0.03 sec elapsed
#> ------------------------------------------------------------------------
#> gamma_gompertz
#> Error in gamma_gompertz model : Error in optim(method = "Nelder-Mead", par = c(b = -5.08140436498446, : non-finite finite-difference value [1]
#> gamma_gompertz: 0.1 sec elapsed
#> ------------------------------------------------------------------------
#> gumbel
#> gumbel: 0.01 sec elapsed
#> ------------------------------------------------------------------------
#> gumbel_T2
#> gumbel_T2: 0.02 sec elapsed
#> ------------------------------------------------------------------------
#> hypertab_a
#> hypertab_a: 0.02 sec elapsed
#> ------------------------------------------------------------------------
#> hypertab_b
#> hypertab_b: 0.01 sec elapsed
#> ------------------------------------------------------------------------
#> inv_chisq
#> inv_chisq: 0 sec elapsed
#> ------------------------------------------------------------------------
#> inv_gamma
#> inv_gamma: 0.02 sec elapsed
#> ------------------------------------------------------------------------
#> inv_gaussian_scale
#> inv_gaussian_scale: 0.03 sec elapsed
#> ------------------------------------------------------------------------
#> inv_gaussian_location
#> inv_gaussian_location: 0.03 sec elapsed
#> ------------------------------------------------------------------------
#> inv_lind
#> inv_lind: 0.01 sec elapsed
#> ------------------------------------------------------------------------
#> laplace
#> laplace: 0.02 sec elapsed
#> ------------------------------------------------------------------------
#> levy
#> Error in levy model : Error in optim(method = "Nelder-Mead", par = c(location = 0, scale = 3.13549421592915, : non-finite finite-difference value [1]
#> levy: 0.02 sec elapsed
#> ------------------------------------------------------------------------
#> lindley
#> lindley: 0 sec elapsed
#> ------------------------------------------------------------------------
#> log_cauchy_location
#> log_cauchy_location: 0.01 sec elapsed
#> ------------------------------------------------------------------------
#> lomax_scale
#> lomax_scale: 0.03 sec elapsed
#> ------------------------------------------------------------------------
#> lomax_shape
#> lomax_shape: 0.02 sec elapsed
#> ------------------------------------------------------------------------
#> nakagami
#> nakagami: 0.01 sec elapsed
#> ------------------------------------------------------------------------
#> pareto_1
#> Error in pareto_1 model : Error in optim(method = "Nelder-Mead", par = c(scale = 1.38629436111989, : non-finite finite-difference value [1]
#> pareto_1: 0.04 sec elapsed
#> ------------------------------------------------------------------------
#> pareto_2
#> Error in pareto_2 model : Error in optim(method = "Nelder-Mead", par = c(location = 1.6094379124341, : non-finite finite-difference value [1]
#> pareto_2: 0.04 sec elapsed
#> ------------------------------------------------------------------------
#> pareto_3
#> Error in pareto_3 model : Error in optim(method = "Nelder-Mead", par = c(location = 1.6094379124341, : non-finite finite-difference value [1]
#> pareto_3: 0.06 sec elapsed
#> ------------------------------------------------------------------------
#> pareto_4
#> Error in pareto_4 model : Error in optim(method = "Nelder-Mead", par = c(location = 1.6094379124341, : non-finite finite-difference value [1]
#> pareto_4: 0.08 sec elapsed
#> ------------------------------------------------------------------------
#> rayleigh
#> rayleigh: 0 sec elapsed
#> ------------------------------------------------------------------------
#> rice_scale
#> rice_scale: 0.02 sec elapsed
#> ------------------------------------------------------------------------
#> rice_location
#> rice_location: 0.01 sec elapsed
#> ------------------------------------------------------------------------
#> shift_gomp
#> shift_gomp: 0.02 sec elapsed
#> ------------------------------------------------------------------------
#> sinmad
#> sinmad: 0.03 sec elapsed
#> ------------------------------------------------------------------------
#> scale_inv_chisq
#> scale_inv_chisq: 0.02 sec elapsed
#> ------------------------------------------------------------------------
#> Final Summary!
#> Model with best AIC: inv_lind
#> Model with best BIC: inv_lind
#> Kapow!: 1.93 sec elapsed

# Summary Table
aml_model$summary
#>                dist_name dist_source dist_ran      aic      bic    loglik
#> 1               inv_lind        fssg     TRUE 162.5793 164.8503 -79.28967
#> 2  inv_gaussian_location        fssg     TRUE 162.8084 166.2149 -78.40420
#> 3        scale_inv_chisq        fssg     TRUE 163.1561 166.5626 -78.57806
#> 4              inv_gamma        fssg     TRUE 163.1561 166.5626 -78.57806
#> 5              gumbel_T2        fssg     TRUE 163.2167 166.6232 -78.60835
#> 6         fatigue2_scale        fssg     TRUE 163.2254 166.6319 -78.61270
#> 7                  lnorm    flexsurv     TRUE 163.8552 167.2617 -78.92761
#> 8             hypertab_b        fssg     TRUE 164.0928 167.4993 -79.04641
#> 9             hypertab_a        fssg     TRUE 164.6250 168.0315 -79.31252
#> 10                llogis    flexsurv     TRUE 164.7053 168.1118 -79.35266
#> 11               lindley        fssg     TRUE 165.0135 167.2845 -80.50675
#> 12              gengamma    flexsurv     TRUE 165.1502 169.6921 -78.57508
#> 13             betaprime        fssg     TRUE 165.1612 169.7032 -78.58061
#> 14               frechet        fssg     TRUE 165.1707 169.7127 -78.58536
#> 15                 dagum        fssg     TRUE 165.2174 169.7593 -78.60868
#> 16    inv_gaussian_scale        fssg     TRUE 166.0326 169.4391 -80.01629
#> 17 gen_extr_val_location        fssg     TRUE 166.2521 170.7941 -79.12605
#> 18                 gamma    flexsurv     TRUE 166.2720 169.6785 -80.13601
#> 19         gengamma.orig    flexsurv     TRUE 166.3323 170.8743 -79.16617
#> 20                   exp    flexsurv     TRUE 166.5746 168.8456 -81.28729
#> 21                sinmad        fssg     TRUE 166.5899 171.1319 -79.29494
#> 22       cauchy_location        fssg     TRUE 166.6615 170.0680 -80.33077
#> 23          cauchy_scale        fssg     TRUE 167.0362 170.4426 -80.51808
#> 24               weibull    flexsurv     TRUE 167.0433 170.4498 -80.52165
#> 25             weibullPH    flexsurv     TRUE 167.0433 170.4498 -80.52165
#> 26               laplace        fssg     TRUE 167.1507 170.5572 -80.57534
#> 27                  genf    flexsurv     TRUE 167.1508 172.8283 -78.57542
#> 28             genf.orig    flexsurv     TRUE 167.1518 172.8293 -78.57591
#> 29                gumbel        fssg     TRUE 167.2659 170.6724 -80.63297
#> 30            shift_gomp        fssg     TRUE 167.6457 171.0522 -80.82285
#> 31         feller_pareto        fssg     TRUE 168.3251 175.1381 -78.16257
#> 32          gen_extr_val        fssg     TRUE 168.4282 172.9701 -80.21409
#> 33              nakagami        fssg     TRUE 168.4514 171.8579 -81.22571
#> 34              gompertz    flexsurv     TRUE 168.4822 171.8886 -81.24108
#> 35           lomax_shape        fssg     TRUE 168.5243 171.9308 -81.26216
#> 36               exp_log        fssg     TRUE 168.5765 171.9830 -81.28824
#> 37           lomax_scale        fssg     TRUE 168.5771 171.9836 -81.28857
#> 38   log_cauchy_location        fssg     TRUE 169.8337 173.2402 -81.91684
#> 39       fold_norm_scale        fssg     TRUE 169.8585 175.5360 -79.92926
#> 40        fatigue3_scale        fssg     TRUE 170.6104 175.1524 -81.30521
#> 41              rayleigh        fssg     TRUE 173.4173 175.6883 -84.70864
#> 42     fatigue3_location        fssg     TRUE 173.7757 178.3177 -82.88787
#> 43            rice_scale        fssg     TRUE 175.4175 178.8240 -84.70875
#> 44    fold_norm_location        fssg     TRUE 176.1256 181.8031 -83.06279
#> 45         rice_location        fssg     TRUE 189.1910 192.5975 -91.59550
#> 46             inv_chisq        fssg     TRUE 197.1167 199.3877 -96.55836
#> 47        gamma_gompertz        fssg    FALSE       NA       NA        NA
#> 48                  levy        fssg    FALSE       NA       NA        NA
#> 49              pareto_1        fssg    FALSE       NA       NA        NA
#> 50              pareto_2        fssg    FALSE       NA       NA        NA
#> 51              pareto_3        fssg    FALSE       NA       NA        NA
#> 52              pareto_4        fssg    FALSE       NA       NA        NA

# List of Models
head(aml_model$models)
#> $genf
#> Call:
#> flexsurv::flexsurvreg(formula = formula, data = data, dist = current_dist, 
#>     method = opt_method)
#> 
#> Estimates: 
#>                 data mean  est        L95%       U95%       se       
#> mu                     NA   3.26e+00   2.37e+00   4.16e+00   4.58e-01
#> sigma                  NA   8.27e-01   5.47e-01   1.25e+00   1.74e-01
#> Q                      NA  -7.45e-01  -2.59e+00   1.10e+00   9.43e-01
#> P                      NA   6.19e-04   6.44e-50   5.94e+42   3.34e-02
#> xNonmaintained   5.22e-01  -7.02e-01  -1.39e+00  -1.41e-02   3.51e-01
#>                 exp(est)   L95%       U95%     
#> mu                     NA         NA         NA
#> sigma                  NA         NA         NA
#> Q                      NA         NA         NA
#> P                      NA         NA         NA
#> xNonmaintained   4.96e-01   2.49e-01   9.86e-01
#> 
#> N = 23,  Events: 18,  Censored: 5
#> Total time at risk: 678
#> Log-likelihood = -78.57542, df = 5
#> AIC = 167.1508
#> 
#> 
#> $genf.orig
#> Call:
#> flexsurv::flexsurvreg(formula = formula, data = data, dist = current_dist, 
#>     method = opt_method)
#> 
#> Estimates: 
#>                 data mean  est        L95%       U95%       se       
#> mu                     NA   3.26e+00   2.37e+00   4.16e+00   4.58e-01
#> sigma                  NA   1.11e+00   7.25e-02   1.70e+01   1.55e+00
#> s1                     NA   1.34e+03   3.00e-27   5.95e+32   4.66e+04
#> s2                     NA   1.80e+00   1.26e-02   2.59e+02   4.57e+00
#> xNonmaintained   5.22e-01  -7.02e-01  -1.39e+00  -1.40e-02   3.51e-01
#>                 exp(est)   L95%       U95%     
#> mu                     NA         NA         NA
#> sigma                  NA         NA         NA
#> s1                     NA         NA         NA
#> s2                     NA         NA         NA
#> xNonmaintained   4.96e-01   2.49e-01   9.86e-01
#> 
#> N = 23,  Events: 18,  Censored: 5
#> Total time at risk: 678
#> Log-likelihood = -78.57591, df = 5
#> AIC = 167.1518
#> 
#> 
#> $gengamma
#> Call:
#> flexsurv::flexsurvreg(formula = formula, data = data, dist = current_dist, 
#>     method = opt_method)
#> 
#> Estimates: 
#>                 data mean  est      L95%     U95%     se       exp(est)
#> mu                   NA     3.2633   2.3662   4.1603   0.4577       NA 
#> sigma                NA     0.8274   0.5476   1.2501   0.1742       NA 
#> Q                    NA    -0.7449  -2.5931   1.1032   0.9430       NA 
#> xNonmaintained   0.5217    -0.7021  -1.3902  -0.0141   0.3510   0.4955 
#>                 L95%     U95%   
#> mu                   NA       NA
#> sigma                NA       NA
#> Q                    NA       NA
#> xNonmaintained   0.2490   0.9860
#> 
#> N = 23,  Events: 18,  Censored: 5
#> Total time at risk: 678
#> Log-likelihood = -78.57508, df = 4
#> AIC = 165.1502
#> 
#> 
#> $gengamma.orig
#> Call:
#> flexsurv::flexsurvreg(formula = formula, data = data, dist = current_dist, 
#>     method = opt_method)
#> 
#> Estimates: 
#>                 data mean  est        L95%       U95%       se       
#> shape                  NA   2.46e-01   6.59e-02   9.22e-01   1.66e-01
#> scale                  NA   1.33e-04   1.89e-16   9.35e+07   1.85e-03
#> k                      NA   2.24e+01   1.60e+00   3.13e+02   3.01e+01
#> xNonmaintained   5.22e-01  -7.55e-01  -1.51e+00  -1.37e-03   3.84e-01
#>                 exp(est)   L95%       U95%     
#> shape                  NA         NA         NA
#> scale                  NA         NA         NA
#> k                      NA         NA         NA
#> xNonmaintained   4.70e-01   2.21e-01   9.99e-01
#> 
#> N = 23,  Events: 18,  Censored: 5
#> Total time at risk: 678
#> Log-likelihood = -79.16617, df = 4
#> AIC = 166.3323
#> 
#> 
#> $exp
#> Call:
#> flexsurv::flexsurvreg(formula = formula, data = data, dist = current_dist, 
#>     method = opt_method)
#> 
#> Estimates: 
#>                 data mean  est      L95%     U95%     se       exp(est)
#> rate                 NA    0.01655  0.00789  0.03471  0.00625       NA 
#> xNonmaintained  0.52174    0.95809  0.01046  1.90572  0.48349  2.60672 
#>                 L95%     U95%   
#> rate                 NA       NA
#> xNonmaintained  1.01052  6.72428
#> 
#> N = 23,  Events: 18,  Censored: 5
#> Total time at risk: 678
#> Log-likelihood = -81.28729, df = 2
#> AIC = 166.5746
#> 
#> 
#> $weibull
#> Call:
#> flexsurv::flexsurvreg(formula = formula, data = data, dist = current_dist, 
#>     method = opt_method)
#> 
#> Estimates: 
#>                 data mean  est      L95%     U95%     se       exp(est)
#> shape                NA      1.264    0.892    1.793    0.225       NA 
#> scale                NA     60.889   33.828  109.599   18.260       NA 
#> xNonmaintained    0.522     -0.929   -1.679   -0.180    0.383    0.395 
#>                 L95%     U95%   
#> shape                NA       NA
#> scale                NA       NA
#> xNonmaintained    0.187    0.836
#> 
#> N = 23,  Events: 18,  Censored: 5
#> Total time at risk: 678
#> Log-likelihood = -80.52165, df = 3
#> AIC = 167.0433

In many cases, the user may only want to try a subset of the available models. This can be done by giving fssg only the names of the desired distributions. These should be strings and correspond to either the name or fullname field of an included distribution object.

fssg(
  survival::Surv(time, death) ~ gender, 
  data = pseudo, 
  models = c('gamma_gompertz','gompertz','gumbel','gumbel_T2')
)$summary
#> Beginning...
#> ------------------------------------------------------------------------
#> gompertz
#> gompertz: 0.23 sec elapsed
#> ------------------------------------------------------------------------
#> gamma_gompertz
#> gamma_gompertz: 3.33 sec elapsed
#> ------------------------------------------------------------------------
#> gumbel
#> gumbel: 0.58 sec elapsed
#> ------------------------------------------------------------------------
#> gumbel_T2
#> gumbel_T2: 1.22 sec elapsed
#> ------------------------------------------------------------------------
#> Final Summary!
#> Model with best AIC: gamma_gompertz
#> Model with best BIC: gamma_gompertz
#> Kapow!: 5.47 sec elapsed
#>        dist_name dist_source dist_ran      aic      bic    loglik
#> 1 gamma_gompertz        fssg     TRUE 47562.28 47591.12 -23777.14
#> 2         gumbel        fssg     TRUE 47737.97 47759.60 -23865.99
#> 3      gumbel_T2        fssg     TRUE 48181.63 48203.26 -24087.81
#> 4       gompertz    flexsurv     TRUE 48206.60 48228.23 -24100.30

Failed models

Due to the large number of distributions considered in the fssg function, it is likely that one or more of them will fail to work with the provided data. This is typically due to one of three reasons:

  1. Poor initial values
  1. Optimized into invalid values
  1. Failure to identify parameters

Initial Values

fssg includes a basic diagnostic function for checking whether the init function for a distribution will fail or not based on specific data. This checks if the initial distribution parameters provided by init will allow the distribution function to return valid values for each observed time values.

check_inits(times = pseudo$time,  get_fssg_dist('gumbel'))

check_inits(times = pseudo$time,  get_fssg_dist('gumbel_T2'))

check_inits(times = pseudo$time,  get_fssg_dist('lomax'))

Individual Models

Each model created by fssg can be interacted with in the same way you would interact with flexsurv models. Here’s a quick example of how to look into a specific model using the Gumbel distribution.

fssg(survival::Surv(time, status) ~ age + sex, 
  data = survival::cancer, 
  models = c('gamma_gompertz','gompertz','gumbel','gumbel_T2'), progress = F) -> output
#> Beginning...
#> ------------------------------------------------------------------------
#> Final Summary!
#> Model with best AIC: gamma_gompertz
#> Model with best BIC: gompertz
#> Kapow!: 0.5 sec elapsed

output$models$gumbel
#> Call:
#> flexsurv::flexsurvreg(formula = formula, data = data, dist = i, 
#>     dfns = list(d = i$d, p = i$p), method = opt_method)
#> 
#> Estimates: 
#>        data mean  est        L95%       U95%       se         exp(est) 
#> mu            NA  161.71987   88.62247  234.81728   37.29528         NA
#> sigma         NA  244.44378   65.47783  912.56476  164.28841         NA
#> age     62.44737   -0.00878   -0.02685    0.00930    0.00922    0.99126
#> sex      1.39474    0.43641    0.11710    0.75573    0.16292    1.54715
#>        L95%       U95%     
#> mu            NA         NA
#> sigma         NA         NA
#> age      0.97350    1.00934
#> sex      1.12423    2.12917
#> 
#> N = 228,  Events: 165,  Censored: 63
#> Total time at risk: 69593
#> Log-likelihood = -1148.943, df = 4
#> AIC = 2305.887

Sex appears to be a significant variable here, as the confidence interval does not contain zero. Let’s see what this exp(est) of 1.56 amounts to visually.

plot(
  output$models$gumbel, 
  newdata = data.frame(sex=c(1,2), age=70), col=c('blue','red'),
  type='survival',
  lwd=2,
)
legend(
  'topright',
  legend = c('Male','Female'),
  col =c('blue','red'),
  lwd=2
)

It looks like males have a considerably higher hazard rate over time according to this model.

On the other hand, age is not significantly associated with survival time.

plot(
  output$models$gumbel, 
  newdata = data.frame(sex=c(1), age=c(65,80)), col=c('blue','red'),
  type='survival',
  lwd=2
)
legend(
  'topright',
  legend = c('Age 65','Age 80'),
  col =c('blue','red'),
  lwd=2
)

This makes sense, as the two survival lines appear to be very similar.

Spline Models

fssg also includes some options for building spline models using flexsurvspline. fssg includes two arguments which are used to build splines:

It’s worth noting that flexsurvspline allows for three different scales to be used in the spline estimation: ‘hazard’, ‘odds’ and ‘normal’. fssg will attempt to build splines using each of these scales, and labels to output model accordingly.

fssg(
  survival::Surv(time, status) ~ x, 
  data = survival::aml, 
  models = c('weibull'), 
  spline = c('rp','wy'), # include both methods
  max_knots = 3          # attempt up to 3 knots
) -> spline_models
#> Beginning...
#> ------------------------------------------------------------------------
#> weibull
#> weibull: 0 sec elapsed
#> Loading required namespace: splines2
#> ------------------------------------------------------------------------
#> Spline
#> Error in spline_rp_hazard_2 model : Error in optim(method = "Nelder-Mead", par = c(gamma0 = -4.78433978320387, : function cannot be evaluated at initial parameters
#> Error in spline_rp_hazard_3 model : Error in optim(method = "Nelder-Mead", par = c(gamma0 = -5.05412785907675, : function cannot be evaluated at initial parameters
#> Error in spline_wy_hazard_2 model : Error in optim(method = "Nelder-Mead", par = c(gamma0 = -2.26513881676761, : function cannot be evaluated at initial parameters
#> Error in spline_wy_hazard_3 model : Error in optim(method = "Nelder-Mead", par = c(gamma0 = -2.27523490522575, : function cannot be evaluated at initial parameters
#> weibull: 1.55 sec elapsed
#> ------------------------------------------------------------------------
#> Final Summary!
#> Model with best AIC: spline_wy_normal_1
#> Model with best BIC: spline_wy_normal_1
#> Kapow!: 1.66 sec elapsed

spline_models$summary
#>             dist_name dist_source dist_ran      aic      bic    loglik
#> 1  spline_wy_normal_1    flexsurv     TRUE 165.8190 170.3610 -78.90950
#> 2  spline_rp_normal_1    flexsurv     TRUE 165.8190 170.3610 -78.90950
#> 3  spline_wy_normal_3    flexsurv     TRUE 166.0324 172.8454 -77.01621
#> 4  spline_rp_normal_3    flexsurv     TRUE 166.0325 172.8454 -77.01623
#> 5    spline_rp_odds_1    flexsurv     TRUE 166.7050 171.2469 -79.35248
#> 6    spline_wy_odds_1    flexsurv     TRUE 166.7050 171.2469 -79.35248
#> 7    spline_wy_odds_3    flexsurv     TRUE 166.7448 173.5578 -77.37241
#> 8    spline_rp_odds_3    flexsurv     TRUE 166.7448 173.5578 -77.37241
#> 9  spline_rp_normal_2    flexsurv     TRUE 167.0326 172.7101 -78.51630
#> 10 spline_wy_normal_2    flexsurv     TRUE 167.0326 172.7101 -78.51630
#> 11            weibull    flexsurv     TRUE 167.0433 170.4498 -80.52165
#> 12   spline_wy_odds_2    flexsurv     TRUE 167.1662 172.8437 -78.58309
#> 13   spline_rp_odds_2    flexsurv     TRUE 167.1662 172.8437 -78.58310
#> 14 spline_rp_hazard_1    flexsurv     TRUE 167.5169 172.0588 -79.75843
#> 15 spline_wy_hazard_1    flexsurv     TRUE 167.5169 172.0588 -79.75843

# Best model has 1 knot
# plot(spline_models$models$spline_wy_normal_1)

# Model with 3 knots
# plot(spline_models$models$spline_wy_normal_3)

Fit Statistics

By default, fssg provides AIC, BIC, and Log Likelihood for the models in the summary table.

# Example using our models on the aml dataset
head(aml_model$summary)
#>               dist_name dist_source dist_ran      aic      bic    loglik
#> 1              inv_lind        fssg     TRUE 162.5793 164.8503 -79.28967
#> 2 inv_gaussian_location        fssg     TRUE 162.8084 166.2149 -78.40420
#> 3       scale_inv_chisq        fssg     TRUE 163.1561 166.5626 -78.57806
#> 4             inv_gamma        fssg     TRUE 163.1561 166.5626 -78.57806
#> 5             gumbel_T2        fssg     TRUE 163.2167 166.6232 -78.60835
#> 6        fatigue2_scale        fssg     TRUE 163.2254 166.6319 -78.61270

Included in this package is a function that compiles a wide variety of fit statistics for a model, including concordance, AUCs, Mean Absolute Error (MAE), Integrated Absolute Error (IAE), Integrated Square Error (ISE), Brier Scores, and optionally the very slow to calculate Integrated Brier Scores.

get_fit_stats(aml_model$models$inv_lind, ibs=T) -> fitstats
#> Loading required namespace: survAUC
#> Loading required namespace: SurvMetrics

fitstats
#> $Harrel.C.Index
#> [1] 0.6190476
#> 
#> $Uno.C.Index
#> [1] 0.6144108
#> 
#> $iAUC
#> [1] 0.689442
#> 
#> $iAUC.IQR
#> [1] 0.6506163
#> 
#> $iAUC.Q10.Q90
#> [1] 0.6942003
#> 
#> $iAUC.Full
#> [1] 0.689442
#> 
#> $AUC.Median
#> [1] 0.6097697
#> 
#> $AUC.Mean
#> [1] 0.584504
#> 
#> $C.Index
#> [1] NA
#> 
#> $MAE
#> [1] NA
#> 
#> $IAE
#> [1] 2.9773
#> 
#> $ISE
#> [1] 0.3439
#> 
#> $IAE.IQR
#> [1] 3.7387
#> 
#> $ISE.IQR
#> [1] 0.5323
#> 
#> $IAE.Q10.Q90
#> [1] 3.11
#> 
#> $ISE.Q10.Q90
#> [1] 0.3611
#> 
#> $IAE.Full
#> [1] 2.9773
#> 
#> $ISE.Full
#> [1] 0.3439
#> 
#> $Brier.Median
#> [1] 0.405603
#> 
#> $Brier.Mean
#> [1] 0.232592
#> 
#> $IBS
#> [1] 0.146813
#> 
#> $IBS.Full
#> [1] 0.107497

More specifics on these metrics can be found in the vignette “Fit_Statistics”.