Broglio et al. (2014) presented an example from a hypothetical trial. We will use a similar setup for this example, and break up the pieces to make clear the argument choices for the package.
The setting is a two-arm randomized trial where patients are equally randomized to either a control or a treatment arm. The primary endpoint is overall survival (OS) measured from the date of randomization to the date of death from any cause or last follow-up. The expected OS rate at 12-months for the control arm is 30%. The minimum sample size is 100 and the maximum sample size is 300. For simplicity, it is assumed that there is no attrition. The maximum follow-up period for each subject is 12-months. (Note: this is slightly different from Broglio et al. (2014), as they assumed a follow-up period of 12-months on all subjects after accrual is complete.) Thus, if accrual is stopped early for predicted success or the trial continues accrual to the maximum sample size of 300 patients, the primary analysis of OS will be conducted after each subject has completed 12-months of follow-up.
From this information, we have:
block = 2
and
rand_ratio = c(1, 1)
(default parameters)end_of_study = 12
hazard_control = prop_to_haz(1 - 0.30, endtime = 12)
(note
that the input argument is the failure proportion, not the survival
proportion)cutpoint = 0
(default
parameter)N_total = 300
prop_loss = 0
Sample size selection analyses are planned starting when 100 patients are enrolled and after every additional 25 patients are enrolled. Early stopping for futility is allowed starting with the 100 patient sample size selection analysis and \(F_n\) is 10%. Stopping accrual early for predicted success is only allowed starting with the 200 patient sample size selection analysis and \(S_n\) is 90%. It is expected that an average of 5 patients per month will be enrolled, with no change in speed for the duration of the trial.
From this information, we have:
interim_look = seq(100, 275, 25)
Fn = rep(0.10, 8)
Sn = c(1, rep(0.9, 7))
.lambda = 5
and lambda_time = NULL
(default
parameter)Note that the first value of Sn
is 1. This is because
the trial is not allowed to stop for predicted success at the first
interim analysis of \(n = 100\). The
remaining elements of Sn
are 0.9, corresponding to 90%.
The primary analysis is a two-sided log-rank test, with success declared at the \(\alpha = 0.05\) level.
From this information, we have:
alternative = "two.sided"
and method = "logrank"
prob_ha = 0.95
Note that prob_ha
is set as \(1 - 0.05\). This allows us to interchange
between test, including Bayesian tests (method = "bayes"
),
which requires an analogous posterior probability threshold. The
parameter h0
is ignored when using a log-rank test, as it
is not meaningful to have success margins.
The operating characteristics will be determined using 500 simulated trials. At each interim analysis, we will use 100 imputations and assume independent weakly-informative Gamma(0.1, 0.1) prior distributions for the treatment and control arm event time hazard rate parameters. As this is computationally expensive overall, we will exploit the option to parallelize the simulations over multiple cores.
N_trials = 500
N_impute = 100
prior = c(0.1, 0.1)
ncores = 8
(note: this is not
currently possible on Windows machines)Similar to above, the parameter N_mcmc
is not required
when using a log-rank test, meaning we do not need to enter a value for
this argument. Since we do not allow for attrition, the data at the
final analysis will be complete, and we can set
imputed_final = FALSE
. If attrition occurred and we planned
to impute the final analysis dataset, we could change this to
imputed_final = TRUE
.
Initially, we want to determine the power to detect a significant treatment effect when the OS rate at 12-months for the treatment arm is 50%.
hc <- prop_to_haz(0.7, endtime = 12)
ht <- prop_to_haz(0.5, endtime = 12)
out_power <- sim_trials(
hazard_treatment = ht,
hazard_control = hc,
cutpoint = 0,
N_total = 300,
lambda = 5,
lambda_time = 0,
interim_look = seq(100, 275, 25),
end_of_study = 12,
prior = c(0.1, 0.1),
block = 2,
rand_ratio = c(1, 1),
prop_loss = 0,
alternative = "two.sided",
Fn = rep(0.10, 8),
Sn = c(1, rep(0.9, 7)),
prob_ha = 0.95,
N_impute = 100,
N_trials = 500,
method = "logrank",
ncores = 8)
The simulations take approximately 3 minutes to run on 2 GHz Quad-Core Intel i5 MacBook Pro.
It is straightforward to calculate the type I error under this
design. The only change required is to set the
hazard_treatment
argument to the same as the
hazard_control
argument (i.e. the null case). We can make
use of the update()
function to avoid having to type
everything else over again.
summarise_sims(list(out_power$sims, out_t1error$sims))
#> # A tibble: 2 × 8
#> scenario power stop_success stop_futility stop_max_N mean_N sd_N
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.934 0.916 0.0400 0.044 171. 53.3
#> 2 2 0.068 0.0560 0.846 0.0980 210. 47.6
#> # ℹ 1 more variable: stop_and_fail <dbl>
The type I error under this design is slightly too large to be
considered acceptable. This was to be expected, since we kept the \(P\)-value threshold as 0.05 despite having
multiple interim looks. However, we note that only simulated
N_trials = 500
trials, meaning if the type I error was
truly 0.05, then values in the interval
(0.05 + c(-1, 1) * 1.96 * sqrt(0.05 * (1 - 0.05) / 500)
)
would be consistent with this.
In practice, we need to use a more stringent threshold in order to
control the overall type I error. This can be achieved by trial and
error. For example, if we use \(P <
0.04\) (applied using the argument prob_ha = 0.96
),
we find the operating characteristics are more acceptable.
out_power2 <- update(out_power, prob_ha = 0.96)
out_t1error2 <- update(out_power2, hazard_treatment = hc)
summarise_sims(list(out_power2$sims, out_t1error2$sims))
#> # A tibble: 2 × 8
#> scenario power stop_success stop_futility stop_max_N mean_N sd_N
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.92 0.89 0.0440 0.0660 175. 56.2
#> 2 2 0.048 0.0420 0.868 0.09 203. 46.8
#> # ℹ 1 more variable: stop_and_fail <dbl>
Assuming the treatment arm has an OS rate of 50% at 12-months, the trial would be expected to stop early 89% of time, with an average sample size of 175. Overall, the power is 92%. Conversely, if the treatment arm OS rate is the same as the control arm, 87% of the trials stopped early for expected futility.
Once we have identified a suitable design, we would typically re-run the simulations using a larger number of simulations and, perhaps, imputations.
Broglio KR, Connor JT, Berry SM. Not too big, not too small: a Goldilocks approach to sample size selection. Journal of Biopharmaceutical Statistics, 2014; 24(3): 685–705.