Example: GeneSearch Breast Lymph Node Assay

Problem

Broglio et al. (2014) describe an example diagnostic study from Julian et al. (2008). This was a single-arm diagnostic study of the GeneSearch Breast Lymph Node (BLN) Assay for patients with breast cancer undergoing sentinel lymph node (SLN) biopsy. The new BLN test yielded a result during the surgery, i.e. intra-operatively, which yielded a positive or negative test for metastasis. The gold standard reference test was histologic evaluation. Whilst histologic evaluation may not be known immediately, we assume it is available soon after, meaning both test results (BLN and reference) are known immediately.

The trial would be declared successful if the new test sensitivity (\(\pi_1\)) and specificity (\(\pi_0\)) satisfy the following posterior probability inequalities:

\[ P[\pi_1 > 0.70 | \text{Data}] \ge 0.985 \, \text{ and } P[\pi_0 > 0.90 | \text{Data}] \ge 0.985. \]

Based on this, we know that

sens_pg <- 0.7
spec_pg <- 0.9

succ_sens <- 0.985,
succ_spec <- 0.985,

endpoint <- "both"

The thresholds of 0.985 were reported as being chosen to control the overall type I error at <5%. In general, when there are interim looks at the data, a type I error penalty must be paid. The choice of thresholds can be determined using simulation (in the null model space) at multiple sequential thresholds, and identify which threshold(s) lead to type I error below the required value.

Their study did not specify what the assumed sensitivity, specificity, or true prevalence was. From the pilot (beta) study reported by Julian et al. (2008), we assume that the true values are:

sens_true <- 0.824
spec_true <- 0.963
prev_true <- 0.2

There was no prevalence value reported from the pilot study, so we assume prev_true <- 0.20.

The prior distributions for prior_sens, prior_spec, and prior_prev, are all assumed to be independent \(\text{Beta}(0.1, 0.1)\). This is passed as:

prior_sens <- c(0.1, 0.1)
prior_spec <- c(0.1, 0.1)
prior_prev <- c(0.1, 0.1)

The trial sample size selection analyses were planned starting when \(n=200\) patients were enrolled, and after every additional \(n=50\) patients, up to a maximum of \(n=700\). Hence, we can set:

n_at_looks <- seq.int(200, 700, by = 50)

Simulation

We can simulate trials under this design using the adaptDiag package function multi_trial() function as follows:

library(adaptDiag)

fit_power <- multi_trial(
  sens_true = 0.824,
  spec_true = 0.963,
  prev_true = 0.20,
  endpoint = "both",
  sens_pg = 0.7,
  spec_pg = 0.9,
  prior_sens = c(0.1, 0.1),
  prior_spec = c(0.1, 0.1),
  prior_prev = c(0.1, 0.1),
  succ_sens = 0.985,
  succ_spec = 0.985,
  n_at_looks = seq(200, 700, 50),
  n_mc = 10000,
  n_trials = 200,
  ncores = 1L)

Here, we have only simulated 200 trials here, using a total of 1 core. In general, n_trials should be larger, and ncores should be set as large as possible dependent on the machine being used to perform the calculations. The Monte Carlo integration performed for futility assessment is based on n_mc = 10000 draws.

To determine the operating characteristics, we need to also specify the futility stopping rule. That is, we wish to calculate the posterior predictive probability of eventual trial success at the maximum sample size. Using conjugacy, the posterior predictive distribution of cases is Beta-Binomial (Broglio et al. 2014). For the GeneSearch BLN Assay study, the futility stopping rule was set at 5% – that is, if the posterior predictive probability of eventual study success is <0.05, the study will terminate early.

We add in an additional criteria that the study cannot stop for early success or futilty until a fixed number of reference positive cases are observed. We implement these 2 criteria and extract the operating characteristic as follows:

summarise_trials(fit_power, min_pos = 30, fut = 0.05)
#>                    n
#> decision            200 250 300 350 400 450 500 550 600 650 700
#>   early win          64  37  25  13  15  11   5   8   5   1   0
#>   late win            0   0   0   0   0   0   0   0   0   0   2
#>   no stopping         0   0   0   0   0   0   0   0   0   0   1
#>   stop for futility   1   2   1   0   1   1   0   1   4   2   0
#>   power stop_futility n_avg      sens      spec mean_pos
#> 1  0.93         0.065   319 0.8439725 0.9636661    63.73

The printed output shows a table with columns listing the sample size looks, and rows listing the reasons for the trial stopping. Below this table we also see estimates of the study power, average sample size, proportion of trials stopping for futility, as well as posterior estimates of the sensitivity and specificity.

We also wish to evaluate the type I error. We can straightforwardly calculate this by assuming the true values are equal to the null.

fit_type1 <- update(fit_power,
                    sens_true = 0.7,
                    spec_true = 0.9)

summarise_trials(fit_type1, min_pos = 30, fut = 0.05)
#>                    n
#> decision            200 250 300 350 400 450 500 600
#>   stop for futility 156  18   9   7   5   1   1   3
#>   power stop_futility n_avg      sens      spec mean_pos
#> 1     0             1   228 0.6928179 0.9012423   45.405

The operating characteristics show the type I error is well controlled at the 5% level. Note: in practice, one should run many simulations (at least 10,000) to get accurate estimates of the type I error.

References

Broglio KR, Connor JT, Berry SM. Not too big, not too small: a Goldilocks approach to sample size selection. Journal of Biopharmaceutical Statistics, 2014; 24(3): 685–705.

Julian TB, Blumencranz P, Deck K, Whitworth P, Berry DA, Berry SM, Rosenberg A, et al. Novel intraoperative molecular test for sentinel lymph node metastases in patients with early-stage breast cancer. Journal of Clinical Oncology: Official Journal of the American Society of Clinical Oncology, 2008; 26(20): 3338–3345.