--- title: "fssg" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{fssg} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` `fssg` is a small package focused on allowing users to quickly test multiple parametric survival models in one line of code, by stream-lining the building of many `flexsurv` models. This package was originally inspired some of my other research on Broad Genomic Profiling, where tens to hundreds of genomic mutations are tested in one 'batch' using a single tissue sample. This type of approach has many names, but can be colloquially referred to as a "shotgun" approach, which is where the package gets it's name: *F*lex*S*urv *S*hot*G*un (`fssg`). To get started with `fssg`, we'll simply load our package, which will also grab the main dependency `flexsurv`. ```{r setup} library(fssg) library(flexsurv) ``` The main powerhouse behind this package is the parametric survival modeling package `flexsurv`. `flexsurv` provides a straightforward way to build parametric survival models, including many of the most commonly used distributions in survival modeling (weibull, gamma, etc.). This is a very useful tool, but is limited in parametric distribution options. Thankfully, the authors of `flexsurv` included a way for users to declare and use their own custom defined distribution. This means that as long as we know what we're doing, we can create models using whatever distribution we'd like! ## Custom Distributions Included in this package are some functions for the creation of custom `flexsurv`-compatible distributions. As long as we have distribution functions for the distribution (i.e. `pnorm` and `dnorm`), and understand the bounds of our parameters, then we can construct a distribution object using `fssg_dist`. What do we need in order to build a custom distribution? 1. A name! Used in the `fssg` back-end, but technically isn't that important if you're not doing batch testing via `fssg`. 2. A vector of parameters. This lets `flexsurv` know what parameters need to be initialized and optimized in the fitting process. 3. An indicator for the main variable we want to change depending on the model covariates. `flexsurv` calls this the 'location' variable, as it originally focused primarily on location-scale distributions, but non-location parameters may be supplied as well. Please note, that this can only be one parameter in `flexsurv`, but we can allow other parameters to vary by supplying a list into the "anc" parameter in `flexsurvreg`. See `?flexsurvreg` for more information on this. 4. Transformation functions that change our parameter domains into the real number line for better optimization. `flexsurv` also requests for the inverse functions to allow for back transformation to the original parameter scale. These are supplied to the distribution object as a list of functions. Most often, when parameters must be greater than 0, the transforms function is something like `log`, while the inverse transformation is `exp`. 5. Initial values. The optimization process requires a starting point. This is often the trickiest part of specifying a distribution, as poor initial values will cause the optimization to fail. To avoid these problems, it's recommend to have your initial values supplied as function which is dependent on the survival times. This allows the initial values to be dynamically determined by the actual shape of your data, and lowers the chance failure due to bad initial values. 6. Distribution functions. `flexsurv` actually allows these to be specified in one of two primary ways: with a density and distribution function ('d', 'p'), or with a Hazard and Cumulative Hazard function ('h','H'). It also helps to include the 'q' function when possible. `fssg` includes some helper functions to construct these distribution functions based off a 'd' and 'p' function (See `?survivify`). Let's try an example. `fssg` contains some custom distribution functions for the Gamma-Gompertz distribution (e.g. `dgamgomp`, `pgamgomp`). These distributions functions are specified in a similar fashion to the native R distribution functions like `pnorm`, `dnorm`. ```{r} # 'Density' function dgamgomp(x=1, b=1, sigma=1, beta=1) # 'Distribution' function pgamgomp(q=1, b=1, sigma=1, beta=1) # Constructing our distribution object fssg_gamgomp <- fssg_dist( name='gamgomp', # a simple name pars=c('b','sigma','beta'), # parameters are scale, shape, shape respectively location='b', # name of the parameter we want to vary transforms=c(log,log,log), # transformation functions for each of our parameters (all must be >0) inv.transforms=c(exp,exp,exp), # back-transformation functions for each of our parameters inits=function(t){c(1/max(t), 1, 1)}, # function that constructs the initial parameter estimates for optimization d = dgamgomp, # density function p = pgamgomp, # distribution function. q = quantilify(pgamgomp), # quantile function (optional) h = hazardify(dgamgomp, pgamgomp), # hazard function (optional) H = cumhazardify(pgamgomp), # cumulative hazard function (optional) fullname='gamma_gompertz' # secondary name(s) which is used in the back-end of `fssg` to better label outputs ) ``` We can then use this distribution object directly with the normal `flexsurvreg` function. It's recommended that you specify the `dfns` parameter explicitly, as otherwise it will attempt to use globally declared d and p variables based on your distribution `name`. ```{r, message=F} # Sample data set provided in the package data('pseudo', package='fssg') flexsurvreg( formula = Surv(time, death) ~ 1, data = pseudo, dist = fssg_gamgomp, dfns = list(fssg_gamgomp$d, fssg_gamgomp$p) ) -> f1 print(f1) plot(f1) ``` To save users some time, we've already specified a large number of custom distributions which can be cleanly accessed using `fssg_dist_list`. This function creates a list of each included distribution object. If you want to grab a specific distribution object from this list, you can directly grab it using `get_fssg_dist`. ```{r} get_fssg_dist('gamma_gompertz') ``` We can also use the `fssg` function to build an individual `flexsurvreg` model by giving it individual model names! More on this function later. ```{r} fssg( formula = Surv(time, death) ~ 1, data = pseudo, model='gamma_gompertz' ) -> model1 model1$models$gamma_gompertz ``` ## The `fssg` Function The titular `fssg` function provides a one-line way to test the fit of several distributions on your data. Simply provide it your model formula and data, and watch it run. By default, the function will attempt to build all available `flexsurv` and `fssg` parametric distribution models on your data. Progress will be printed by default, and any models that failed will list the corresponding error. The function always prints a summary of the models it ran, and by default returns a list with two items: the summary table as a data.frame, and a list containing each successful model. ```{r} fssg( survival::Surv(time, status) ~ x, data = survival::aml ) -> aml_model # Summary Table aml_model$summary # List of Models head(aml_model$models) ``` In many cases, the user may only want to try a subset of the available models. This can be done by giving `fssg` only the names of the desired distributions. These should be strings and correspond to either the `name` or `fullname` field of an included distribution object. ```{r} fssg( survival::Surv(time, death) ~ gender, data = pseudo, models = c('gamma_gompertz','gompertz','gumbel','gumbel_T2') )$summary ``` ## Failed models Due to the large number of distributions considered in the `fssg` function, it is likely that one or more of them will fail to work with the provided data. This is typically due to one of three reasons: 1. Poor initial values * Error's such as "initial value in 'vmmin' is not finite" are evidence of this 2. Optimized into invalid values * Error's such as "non-finite finite-difference value" are evidence of successful initial values, but failure's in the optimization process. This happens when `optim` perturbs a parameters, but the new perturbed value leads to invalid results such as `NaN` or `Inf`. 3. Failure to identify parameters * Error's like "system is exactly singular" are indicative of hessian computation issues. Can occur when the parameters are not estimable for the given data. ### Initial Values `fssg` includes a basic diagnostic function for checking whether the `init` function for a distribution will fail or not based on specific data. This checks if the initial distribution parameters provided by `init` will allow the distribution function to return valid values for each observed time values. ```{r, message=F} check_inits(times = pseudo$time, get_fssg_dist('gumbel')) check_inits(times = pseudo$time, get_fssg_dist('gumbel_T2')) check_inits(times = pseudo$time, get_fssg_dist('lomax')) ``` ## Individual Models Each model created by `fssg` can be interacted with in the same way you would interact with `flexsurv` models. Here's a quick example of how to look into a specific model using the Gumbel distribution. ```{r} fssg(survival::Surv(time, status) ~ age + sex, data = survival::cancer, models = c('gamma_gompertz','gompertz','gumbel','gumbel_T2'), progress = F) -> output output$models$gumbel ``` Sex appears to be a significant variable here, as the confidence interval does not contain zero. Let's see what this exp(est) of 1.56 amounts to visually. ```{r} plot( output$models$gumbel, newdata = data.frame(sex=c(1,2), age=70), col=c('blue','red'), type='survival', lwd=2, ) legend( 'topright', legend = c('Male','Female'), col =c('blue','red'), lwd=2 ) ``` It looks like males have a considerably higher hazard rate over time according to this model. On the other hand, age is not significantly associated with survival time. ```{r} plot( output$models$gumbel, newdata = data.frame(sex=c(1), age=c(65,80)), col=c('blue','red'), type='survival', lwd=2 ) legend( 'topright', legend = c('Age 65','Age 80'), col =c('blue','red'), lwd=2 ) ``` This makes sense, as the two survival lines appear to be very similar. ## Spline Models `fssg` also includes some options for building spline models using `flexsurvspline`. `fssg` includes two arguments which are used to build splines: - `spline`: By default this is NA, but will create spline models when supplied with the name(s) of the desired spline-methods. Include 'rp' or 'wy' for Royston-Parmar natural cubic spline, or Wang-Yan alternative natural cubic spline respectively. Note: The Wang-Yan version requires the package `splines2ns`. - `max_knots`: An integer specifying the maximum number of knots to be considered in the spline models. `fssg` will attempt to fit splines using every integer of knots from 1 to `max_knots`. Knots are placed at equally spaced quantiles based on the survival time distribution. It's worth noting that `flexsurvspline` allows for three different scales to be used in the spline estimation: 'hazard', 'odds' and 'normal'. `fssg` will attempt to build splines using each of these scales, and labels to output model accordingly. ```{r} fssg( survival::Surv(time, status) ~ x, data = survival::aml, models = c('weibull'), spline = c('rp','wy'), # include both methods max_knots = 3 # attempt up to 3 knots ) -> spline_models spline_models$summary # Best model has 1 knot # plot(spline_models$models$spline_wy_normal_1) # Model with 3 knots # plot(spline_models$models$spline_wy_normal_3) ``` ## Fit Statistics By default, `fssg` provides AIC, BIC, and Log Likelihood for the models in the summary table. ```{r} # Example using our models on the aml dataset head(aml_model$summary) ``` Included in this package is a function that compiles a wide variety of fit statistics for a model, including concordance, AUCs, Mean Absolute Error (MAE), Integrated Absolute Error (IAE), Integrated Square Error (ISE), Brier Scores, and optionally the very slow to calculate Integrated Brier Scores. ```{r} get_fit_stats(aml_model$models$inv_lind, ibs=T) -> fitstats fitstats ``` More specifics on these metrics can be found in the vignette "Fit_Statistics".