Introduction

Risk scores to predict future events are widely used in many fields. Deployment of a predictive risk score is usually with the intent of avoiding such future events (or sometimes bringing them about), in which case the risk score is often intended to be used to plan anticipatory interventions which avoid the event in an efficient way (e.g., @hippisley17, @nashef12, @koopman08).

This risk-score-directed anticipatory action can mean that, when analysed retrospectively, it appears the risk score is performing poorly, since high-risk samples have undergone risk-score directed intervention (@lenert19,@sperrin19). This leads to difficulties updating risk scores [@perdomo20]. A simple 'fix' allowing risk scores to be updated safely is to with-hold risk score generation and distribution for a random subset of the population, on whom an updated risk score can then be trained [@liley21].

Such a 'hold-out' set of individuals should not be too small, since this will lead to excessive inaccuracy in the risk score trained on it, nor too large, since those individuals in the hold-out set miss out on any benefit. Our manuscript (partly) and this vignette (predominantly) serve to demonstrate the emergence of an optimal holdout set size under 'natural' conditions. We will also demonstrate the dependence of this optimal holdout size on system parameters, particularly the 'learning curve' of the risk score.

Setup

Assumptions

Suppose we begin with a population of \(N\) samples at which we observe \(p\) covariates, represented by an \(N \times p\) matrix \(X\). Each sample has an outcome of either 1 or 0. We denote by \(Y\) the set of \(N\) outcomes which would occur should no risk score guided intervention take place. We presume that our \(N\) samples represents our entire population, so we have the option of fitting a risk score to some 'hold-out' subset of the \(N\) samples, and using this risk score it to pre-empt the outcome on the remaining 'intervention' subset.

Although this obviously does not make sense temporally, we can think of the 'intervention' set as coming from a subsequent 'epoch' representing a period in which a particular risk score is used. At each epoch, we hold out a fixed proportion of the \(N\) samples as a hold-out set.

We presume that we may take a binary action to attempt to avoid the event \(Y=1\), and that we take this action if assessed risk is \(>0.5\). Thus the eventual cost for any sample can be summarised with a contingency table (true positive, false negative etc). We assume that in the intervention set, some better-than-random 'baseline' assessment of risk is made, corresponding to whatever the protocol was before a risk score was introduced. For this better-than random assessment, we use the Bayes-optimal (oracle) predictor on a single covariate.

We consider an underlying generative model for \(f(x)=\textrm{logit}\{P(Y=1|X=x)\}\) as either

in \(x\). We consider risk scores fitted using either

Model type (i) can approximate the oracle (Bayes-optimal) predictor for \(Y|X\) arbitrarily well in setting (a) but not (b). Model type (ii) can approximate the Bayes-optimal arbitrarily well in either setting, but will be less accurate than model type (i) in setting (a).

General setup

We briefly set up general parameters for this simulation


# Set random seed
seed=1234
set.seed(seed)

# Packages
library("ranger")            # Random forests
library("OptHoldoutSize")    # Functions for OHS estimation

# Initialisation of patient data
n_iter <- 500           # Number of point estimates to be calculated
nobs <- 5000            # Number of observations, i.e patients (N)
npreds <- 7             # Number of predictors (p)

# Parameters to loop over
families <- c("log_reg","rand_forest")   # Model family: (i,ii) respectively
interactions <- c(0,10)                  # Number of interaction terms: (a,b) respectively

# The baseline assessment uses an oracle (Bayes-optimal) predictor on max_base_powers covariates.
max_base_powers <- 1 

# Check holdout sizes within these two proportions of total
min_frac <- 0.02
max_frac <- 0.15
num_sizes <- 50

# Fraction of patients assigned to the holdout set
frac_ho <- seq(min_frac, max_frac, length.out = num_sizes)


# Cost function: total cost is defined from the continengency table, as follows:
c_tn <- 0 # Cost of true neg
c_tp <- 0.5 # Cost of true pos
c_fp <- 0.5 # Cost of false pos
c_fn <- 1 # Cost of false neg

Results

Emergence of optimal holdout size

For each holdout size, we simulate 500 datasets \(X,Y\), using the same underlying function for \(P(Y=1|X=x)\) (that is, the same coefficients) and evaluate the costs in the holdout and intervention sets and in total.

Detailed code to run simulation is shown below (not run)

Show detailed code


# Cost of true negative, false positive etc for computing total cost
cost_mat <- rbind(c(c_tn, c_fp), c(c_fn, c_tp))

# Generate coefficients for underlying model and for predictions in h.o. set
set.seed(seed + 1)

# Set ground truth coefficients, and the accuracy at baseline
coefs_general <- rnorm(npreds,sd=1/sqrt(npreds))
coefs_base <- gen_base_coefs(coefs_general, max_base_powers = max_base_powers)

# Run simulation
old_progress=0
for (ninters in interactions) {

  # If ninters>0, append coefficients for interaction terms to coefficients
  if (ninters>0)
    coefs_generate=c(coefs_general,rnorm(ninters,sd=2/sqrt(npreds)))
  else
    coefs_generate=coefs_general

  for (family in families) {

    ## Initialise arrays to populate

    # Record total cost in intervention and holdout sets of various sizes over n_iter resamplings
    costs_inter_resample <- costs_ho_resample <- matrix(nrow = n_iter, ncol = num_sizes)

    # Record total cost (cost in intervention set + cost in holdout set)
    costs_tot_resample <- array(0, dim = c(max_base_powers, n_iter, num_sizes))

    # Standard deviations of cost estimates at each holdout size
    costs_sd <- array(0, dim = c(max_base_powers, num_sizes))


    # Sweep through h.o. set sizes of interest
    for (i in 1:num_sizes) {

      # Sweep through resamplings
      for (b in 0:n_iter) {

        progress <- 100 * (((i - 1) * n_iter) + b) / (num_sizes * (n_iter + 1))
        if (abs(floor(old_progress) - floor(progress)) > 0) {
          cat(floor(progress), "%\n")
        }

        # Set random seed
        set.seed(seed + b + i*n_iter)

        # Generate dataset
        X <- gen_preds(nobs, npreds, ninters)

        # Generate labels
        newdata <- gen_resp(X, coefs = coefs_generate)
        Y <- newdata$classes

        # This contains only non-interaction columns and is used to train models.
        Xtrain <- X[,1:npreds]

        # Combined dataset
        pat_data <- cbind(Xtrain, Y)
        pat_data$Y = factor(pat_data$Y)

        # For each holdout size, split data into intervention and holdout set
        mask <- split_data(pat_data, frac_ho[i])
        data_interv <- pat_data[!mask,]
        data_hold <- pat_data[mask,]

        # Train model
        trained_model <- model_train(data_hold, model_family = family)

        # Predict
        class_pred <- model_predict(data_interv, trained_model,
                                    return_type = "class",
                                    threshold = 0.5, model_family = family)

        # Predictions made at baseline (oracle/Bayes optimal predictor on subset of covariates)
        base_pred <- oracle_pred(data_hold, coefs_base[base_vars, ])


        ## Generalised cost function with a specific cost for fn, fp, tp and tn

        # Generate confusion matrices
        confus_inter <- table(factor(data_interv$Y, levels=0:1),
                              factor(class_pred, levels=0:1))

        confus_hold <- table(factor(data_hold$Y, levels=0:1),
                             factor(base_pred, levels=0:1))

        costs_inter_resample[b, i] <- sum(confus_inter * cost_mat)
        costs_ho_resample[b, i] <- sum(confus_hold * cost_mat)
        costs_tot_resample[base_vars, b, i] <- costs_ho_resample[b, i] + costs_inter_resample[b, i]

        # Progress
        old_progress <- progress
      }

    }

    # Calculate Standard Deviations and Errors
    for (base_vars in 1:max_base_powers) {
      costs_sd[base_vars, ] <-
        apply(costs_tot_resample[base_vars, , ], 2, sd)
    }
    costs_se <- costs_sd / sqrt(n_iter)

    # Store data for this model family and number of interactions
    data_list=list(max_base_powers=max_base_powers,
                   frac_ho=frac_ho,
                   costs_tot_resample=costs_tot_resample,
                   costs_ho_resample=costs_ho_resample,
                   costs_inter_resample=costs_inter_resample,
                   base_vars=base_vars,
                   costs_sd=costs_sd,
                   costs_se=costs_se)
    assign(paste0("data_",family,"_inter",ninters),data_list)
  }
}

# Collate and save all data
data_example_simulation=list()
ind=1
for (family in families) {
  for (ninters in interactions) {
    xname=paste0("data_",family,"_inter",ninters)
    data_example_simulation[[ind]]=get(xname)
    names(data_example_simulation)[ind]=xname
    ind=ind+1
  }
}
save(data_example_simulation,file="data/data_example_simulation.RData")

The following figure shows the emergence of an optimal holdout size in setting (a), (i). The leftmost panel (panel 1) shows the empirical cost function \(L(n)\) and standard deviation. The two rightmost panels (panels 2 and 3) show the constituents of \(L(n)\): namely, the contribution from the intervention set \(k_2(n)(N-n)\) and the contribution from the holdout set \(k_1 n\). The middle panel additionally shows the empirical function \(k_2(n)\), effectively the 'learning curve' of the predictive score.

It is evident that the location of the optimal holdout size depends on a balance of the slope of the blue-coloured curves in panels 2 and 3.

plot of chunk unnamed-chunk-3

Show code


# Load data
data(data_example_simulation)
X=data_example_simulation$data_log_reg_inter0
for (i in 1:length(X)) assign(names(X)[i],X[[i]])


# Holdout set size in absolute terms
n_ho=frac_ho*nobs

# Colour intensities
r_alpha=0.2
b_alpha=0.6


oldpar=par(mfrow=c(1,3))



## Draw cost function

# Cost values
l_n=colMeans(costs_tot_resample[1,,])
l_sd=costs_sd[1, ]

# Initialise plot for cost function
oldpar1=par(mar=c(4,4,3,4))
plot(0, 0, type = "n",
     ylab = "L(n)",
     xlab = "Holdout set size (n)",
     xlim=range(n_ho),
     ylim = range(l_n + 2*l_sd*c(1,-1),na.rm=T)
)

# Plot Cost line
lines(n_ho,l_n,pch = 16,lwd = 1,col = "blue")

# Standard Deviation Bands
polygon(c(n_ho, rev(n_ho)),
        c(l_n - l_sd,
          rev(l_n + l_sd)),
        col = rgb(0,0,1,alpha=b_alpha),
        border = NA)

# Add legend
legend("topright", legend = c("L(n)", "SD"),
       col = c("blue",rgb(0,0,1, alpha = b_alpha)),
       lty=c(1,NA),
       pch=c(NA,16),
       bty="n",bg="white",
       ncol=1
)

par(oldpar1)





## Draw k2 function

# Evaluate k2(n)*(N-n)
k2n=colMeans(costs_inter_resample)
sd_k2n=colSds(costs_inter_resample)

# Evaluate k2
k2=k2n/(nobs-n_ho)
sd_k2=sd_k2n/(nobs-n_ho)

# Scalinge factor (for plotting)
sc=mean(nobs-n_ho)


# Initialise plot for k2 function
oldpar2=par(mar=c(4,4,3,4))
yr_k2=range(k2n+3*sd_k2n*c(1,-1),na.rm=F)
plot(0, 0, type = "n",
     ylab = expression(paste("k"[2],"(n)(N-n)")),
     xlab = "Holdout set size (n)",
     xlim=range(n_ho),
     ylim = yr_k2
)

# Line for estimated k2(n)
lines(n_ho,k2*sc,col="red",lty=2)

# Add standard deviation of k2(n). 
polygon(c(n_ho, rev(n_ho)),
        c(k2 + sd_k2,rev(k2-sd_k2))*sc,
        col = rgb(1,0,0,alpha=r_alpha),
        border = NA)

axis(4,at=sc*pretty(yr_k2/sc),labels=pretty(yr_k2/sc))
mtext(expression(paste("k"[2],"(n)")), side=4, line=3,cex=0.5)

# Line for estimated k2(n)*(N-n)
lines(n_ho,k2n,col="blue",lty=2)

# Add standard deviation for k2(n)*(N-n)
polygon(c(n_ho, rev(n_ho)),
        c(k2n + sd_k2n,rev(k2n-sd_k2n)),
        col = rgb(0,0,1,alpha=b_alpha),
        border = NA)

# Add legend
legend("topright", legend = c(expression(paste("k"[2],"(n)(N-n)")), "SD",
                              expression(paste("k"[2],"(n)")), "SD"),
       col = c("blue",rgb(0,0,1, alpha = b_alpha),
               "red",rgb(1,0,0, alpha = r_alpha)),
       lty=c(1,NA,1,NA),
       pch=c(NA,16,NA,16),
       bty="n",bg="white",
       ncol=2
)

par(oldpar2)






## Draw k1*n function

# Evaluate k1
k1n=colMeans(costs_ho_resample)
sd_k1n=colSds(costs_ho_resample)

# Initialise plot for k2 function
oldpar3=par(mar=c(4,4,3,4))
plot(0, 0, type = "n",
     ylab = expression(paste("k"[1],"n")),
     xlab = "Holdout set size (n)",
     xlim=range(n_ho),
     ylim = range(k1n+3*sd_k1n*c(1,-1),na.rm=F)
)

# Line for estimated k1*n
lines(n_ho,k1n,col="blue",lty=2)

# Add standard deviation. Note this is scaled by (nobs-n_ho)
polygon(c(n_ho, rev(n_ho)),
        c(k1n + sd_k1n,rev(k1n-sd_k1n)),
        col = rgb(0,0,1,alpha=b_alpha),
        border = NA)

# Add legend
legend("topleft", legend = c(expression(paste("k"[1],"n")), "SD"),
       col = c("blue",rgb(0,0,1, alpha = b_alpha)),
       lty=c(1,NA),
       pch=c(NA,16),
       bty="n",bg="white",
       ncol=1
)
par(oldpar3)

par(oldpar)

Behaviour of empirical cost function and optimal holdout size

The following figure shows empirical cost functions \(L(n)\) and empirical curves \(k_2(n)\) using both underlying models ((a) and (b)) and risk score types ((i) and (ii)). Parametric approximations to empirical cost functions and empirical \(k_2(n)\) are additionally shown, as well as optimal holdout size and minimum cost estimates derived from these approximations.

It is evident that the random forest (risk score (ii)) leads to better performance in setting (b) but worse performance in setting (a). A more accurate risk score does not necessarily lead to a lower optimal holdout size.

It should be noted that the optimal holdout size often occurs at a value of \(n\) at which the performance of the risk score is substantially worse than its optimal performance.

plot of chunk unnamed-chunk-5

Show code

data(data_example_simulation)

oldpar=par(mfrow=c(2,2))


# Loop through values of 'families' and 'interactions'
for (xf in 1:length(families)) {
  for (xi in 1:length(interactions)) {
    family=families[xf]
    ninters=interactions[xi]

    X=data_example_simulation[[paste0("data_",family,"_inter",ninters)]]
    for (i in 1:length(X)) assign(names(X)[i],X[[i]])

    # Plot title
    if (xi==1 & xf==1) ptitle="Setting (a), model (i)"
    if (xi==2 & xf==1) ptitle="Setting (a), model (ii)"
    if (xi==1 & xf==2) ptitle="Setting (b), model (i)"
    if (xi==2 & xf==2) ptitle="Setting (b), model (ii)"

    # Y axis range
    if (xi==1) yl=c(2100,2350) else yl=c(2450,2700)

    # Holdout set size in absolute terms
    n_ho=frac_ho*nobs

    # Colour intensities
    r_alpha=0.2
    b_alpha=0.6

    # Create figure
    oldpar1=par(mar=c(4,4,3,4))
    plot(0, 0, type = "n",
         ylab = "L(n)",
         xlab = "Holdout set size (n)",
         main=ptitle,
         xlim=range(n_ho),
         ylim = yl
    )

    ### Draw learning curve with axis on right

    # Compute
    k2=colMeans(costs_inter_resample)/(nobs-n_ho)

    # Scaling
    if (xi==2) scvec=c(0.48,20) else scvec=c(0.42,20)
    k2_sc=function(x) yl[1] + (yl[2]-yl[1])*(x-scvec[1])*scvec[2] # Scaling transform
    i_k2_sc=function(x) (x-yl[1])/(scvec[2]*(yl[2]-yl[1])) + scvec[1] # Inverse

    # Add standard deviation. Note this is scaled by (nobs-n_ho)
    sd_k2=colSds(costs_inter_resample)/(nobs-n_ho)
    polygon(c(n_ho, rev(n_ho)),
            k2_sc(c(k2 + sd_k2,rev(k2-sd_k2))),
            col = rgb(1,0,0,alpha=r_alpha),
            border = NA)

    # Add axis and label
    axis(4,at=k2_sc(pretty(i_k2_sc(yl))),labels=pretty(i_k2_sc(yl)))
    mtext(expression(paste("k"[2],"(n)")), side=4, line=3)

    # Estimate parameters.
    theta=c(1,1,1);
    for (i in 1:5) theta=powersolve(n_ho,k2,init=theta)$par
    k1=median(colMeans(costs_ho_resample)/(nobs*frac_ho))
    k2=theta[1]*(n_ho)^(-theta[2]) + theta[3]

    # Line for estimated k2 curve
    lines(n_ho,
          k2_sc(k2),
          col="red",lty=2)

    # Line for estimated cost function
    xcost=n_ho*k1 + k2*(nobs-n_ho)
    lines(n_ho,
          xcost,
          col="blue",lty=2)

    # Add minimum L and OHS
    ohs=n_ho[which.min(xcost)]
    minL=min(xcost)
    lines(c(ohs,ohs),c(0,minL),lty=2)
    abline(h=minL,lty=2)
    points(ohs,minL,pch=16)

    # Finally, add cost function (on top)
    # Plot Cost line
    lines(n_ho,
          colMeans(costs_tot_resample[base_vars, , ]),
          pch = 16,
          lwd = 1,
          col = "blue")

    # Standard Deviation Bands
    polygon(c(n_ho, rev(n_ho)),
            c(colMeans(costs_tot_resample[1,,]) - costs_sd[1, ],
              rev(colMeans(costs_tot_resample[1,,]) + costs_sd[1, ])),
            col = rgb(0,0,1,alpha=b_alpha),
            border = NA)

    # Add legend
    legend("topright", legend = c("L(n)", "SD","Fitted",
                                  "OHS",
                                  expression(paste("k"[2],"(n)")), "SD","Fitted",
                                  "Min. L"),
           col = c("blue",rgb(0,0,1, alpha = b_alpha),"blue","black",
                   "red",rgb(1,0,0, alpha = r_alpha),"red","black"),
           lty=c(1,NA,2,NA,
                 1,NA,2,2),
           pch=c(NA,16,NA,16,
                 NA,16,NA,NA),
           bty="n",bg="white",
           ncol=2
    )

    par(oldpar1)
  }
}

par(oldpar)

   

References