--- title: "Basic Usage" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Basic Usage} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(poissonsuperlearner) library(riskRegression) ``` # Introduction This vignette gives a minimal example of how to use `poissonsuperlearner` to fit piecewise-constant hazard models and combine them in a Poisson super learner. The package works by expanding subject-level time-to-event data to a long Poisson format on a grid of time intervals, fitting one or more base learners to the corresponding cause-specific hazards, and optionally combining them through a meta-learner. The same workflow can be used both for a single fitted learner via `fit_learner()` and for an ensemble via `Superlearner()`. # Generate synthetic Steno data We first simulate a small synthetic data set with `simulateStenoT1()`. The simulator is inspired by the Steno Type-1 risk engine and generates correlated baseline covariates such as age, diabetes duration, LDL, systolic blood pressure, HbA1c, albuminuria, eGFR, smoking, and physical activity. Event times are then generated from Weibull proportional hazards models. In scenario `"alpha"`, the event mechanism is linear in the covariates, and with `competing_risks = TRUE` the data contain two causes: cardiovascular disease (CVD) and death without prior CVD. In scenario `"beta"`, the CVD hazard instead includes additional nonlinear hinge-squared effects of age and LDL. Here we use the `"alpha"` scenario with competing risks to illustrate the cause-specific output of the package on a subset of the covariates. ```{r} set.seed(42) d <- simulateStenoT1( n = 150, scenario = "alpha", competing_risks = TRUE ) head(d[, .(id, time, event, age, diabetes_duration, value_LDL, sex)]) ``` The observed follow-up time is stored in `time` and the event indicator in `event`. With competing risks, `event = 0` denotes censoring, `event = 1` corresponds to CVD, and `event = 2` to death without prior CVD. # Define a small learner library Each learner models the piecewise-constant hazard on a grid of time intervals. In the examples below, the grid is controlled by `number_of_nodes = 5`, which creates a quantile-based time grid used internally for the long Poisson representation. We define two simple learners based on `glmnet`: one unpenalized Poisson model and one lasso-type learner. ```{r} lglmnet0 <- Learner_glmnet( covariates = c("sex", "diabetes_duration"), cross_validation = FALSE, lambda = 0, intercept = TRUE ) lglmnet1 <- Learner_glmnet( covariates = c("value_Smoking", "value_LDL"), cross_validation = TRUE, alpha = 1, intercept = TRUE ) learners_list <- list( glm = lglmnet0, lasso = lglmnet1 ) ``` # Alternative learner definitions and wrapper arguments The package currently provides three learner classes: `Learner_glmnet()`, `Learner_gam()`, and `Learner_hal()`. The code below is shown for illustration only and is therefore not run. It highlights how learners are defined by initializing Reference Class objects and then passing them to `fit_learner()` or `Superlearner()`. ```{r, eval = FALSE} l_glmnet <- Learner_glmnet( covariates = c("sex", "diabetes_duration", "value_LDL"), add_nodes = TRUE, cross_validation = TRUE, alpha = 1 ) l_gam <- Learner_gam( covariates = c("s(age)", "s(value_LDL)", "sex"), add_nodes = TRUE, method = "fREML", discrete = TRUE ) l_hal <- Learner_hal( covariates = c("age", "diabetes_duration", "value_LDL"), add_nodes = TRUE, cross_validation = TRUE, num_knots = c(10L, 5L), max_degree = 2L ) ``` Some learner arguments are specific to the piecewise-constant hazard workflow implemented in `poissonsuperlearner`. In particular, `covariates` defines the terms used in the long-format hazard model, and `add_nodes = TRUE` adds the interval-specific node effects that represent the baseline hazard across the time grid. For the HAL learner, `num_knots`, `max_degree`, and `maxit_prefit` also control the package-specific basis construction and screening step used before the final penalized fit. A useful feature of the package is that, except for HAL, the learners mainly wrap existing fitting routines. `Learner_glmnet()` forwards additional arguments in `...` to `glmnet::glmnet()` or `glmnet::cv.glmnet()`, and `Learner_gam()` forwards them to `mgcv::bam()`. This means that most tuning can be done with the familiar arguments from the underlying packages rather than through a separate package-specific interface. For GAMs, smooth terms are specified directly in the `covariates` vector using standard `mgcv` syntax. For example, if a spline effect of age is desired, one simply writes `"s(age)"`; tensor products or other `mgcv` terms can be passed in the same way. The HAL learner is more specialized. It includes several parameters that are specific to the piecewise-constant hazard implementation, such as the number of basis cutpoints per interaction order (`num_knots`) and the maximum interaction degree (`max_degree`), while still relying on the underlying `glmnet` fitting routine for the penalized Poisson regression step. In other words, HAL combines a package-specific basis expansion with the familiar `glmnet` optimization controls available through `...`. # Fit an ensemble and individual learners `Superlearner()` fits all learners, obtains cross-validated predictions for the meta-learning step, and then fits one cause-specific meta-learner per cause. For comparison, `fit_learner()` fits a single learner directly. `Superlearner()` and `fit_learner()` can be supplied either with an explicit grid of time nodes `(nodes)` or with the number of nodes selected from the observed quantiles (`number_of_nodes`) used to construct that grid; `nfolds` specifies the number of folds used for internal cross-validation; if an `id` variable is provided it identifies which rows belong to the same individual, whereas if `id` is omitted each row is treated as a separate observation; the `status` variable must be coded as `0` for censoring and positive integers for the competing event types; and `time_event` is the event or follow-up time variable on which the model is built. ```{r} sl_model <- Superlearner( d, id = "id", status = "event", event_time = "time", learners = learners_list, number_of_nodes = 5, nfold = 3 ) l0_model <- fit_learner( d, id = "id", learner = lglmnet0, status = "event", event_time = "time", number_of_nodes = 5 ) l1_model <- fit_learner( d, id = "id", learner = lglmnet1, status = "event", event_time = "time", number_of_nodes = 5 ) ``` # Inspect the fitted objects The fitted objects have `print()`, `summary()`, and `coef()` methods. `print()` is useful for a quick look at the stored fitted object. For a fitted super learner, you can print the stacked meta-learner or one of the stored base learners. ```{r} print(sl_model, cause = 1) print(sl_model, cause = 1, model = "glm") print(l0_model, cause = 1) ``` `summary()` provides a more informative overview. For a fitted super learner, it prints a compact description of the ensemble, the number of causes, learners, folds and nodes, the cross-validated Poisson deviance used to compare the base learners, and the cause-specific meta-learner coefficients. These coefficients show how the ensemble combines the learners for each cause. ```{r} summary(sl_model) ``` You can also summarize one of the base learners stored inside the ensemble, or a learner fitted directly with `fit_learner()`. ```{r} summary(sl_model, cause = 1, model = "glm") summary(l0_model, cause = 1) ``` `coef()` extracts model coefficients. For a fitted super learner, `coef()` returns the meta-learner coefficients by default, while the `model` argument can be used to extract coefficients from one of the stored base learners. ```{r} coef(sl_model, cause = 1) coef(sl_model, cause = 1, model = "glm") coef(l0_model, cause = 1) ``` # Predict hazards, survival, and absolute risk The low-level `predict()` method returns a data set with one row per requested prediction time. In addition to the original covariates, it includes the predicted cause-specific piecewise hazards (`pwch_1`, `pwch_2`, ...), the predicted survival function, and the absolute risk for the chosen cause. ```{r} pred_sl <- predict( sl_model, newdata = d[1:2], times = c(2, 5), cause = 1 ) pred_sl ``` For interoperability with `riskRegression`, the package also provides `predictRisk()`. This returns a matrix with one column per requested time. Risk predictions can be obtained directly from a learner fitted with `fit_learner()`: ```{r} cbind( glm_direct = predictRisk(l0_model, newdata = d[1, ], times = 5, cause = 1), lasso_direct = predictRisk(l1_model, newdata = d[1, ], times = 5, cause = 1) ) ``` The same is possible from the fitted super learner object. Setting `model = "sl"` uses the ensemble prediction, while setting `model` equal to a learner label such as `"glm"` or `"lasso"` uses the corresponding stored base learner from the fitted ensemble. ```{r} cbind( sl = predictRisk(sl_model, newdata = d[1, ], times = 5, cause = 1, model = "sl"), glm_from_ensemble = predictRisk(sl_model, newdata = d[1, ], times = 5, cause = 1, model = "glm"), lasso_from_ensemble = predictRisk(sl_model, newdata = d[1, ], times = 5, cause = 1, model = "lasso") ) ``` This illustrates the two main prediction workflows supported by the package: using a single fitted learner with `fit_learner()` or using the fitted ensemble returned by `Superlearner()`.