Primary and Secondary Analysis for Micro-Randomized Trial (MRT)

Tianchen Qian (t.qian@uci.edu), Shaolin Xiang, Zhaoxi Cheng

2023-07-01

Introduction

The MRTAnalysis package provides user-friendly functions to conduct primary and secondary analyses for micro-randomized trial (MRT), where the proximal outcomes are continuous or binary and the intervention option is binary. For continuous outcomes, the estimated causal effects are on the additive scale. For binary outcomes, the estimated causal effects are on the log relative risk scale. In particular, this package can be used to

MRT is an experimental design for optimizing mobile health interventions. The marginal and the moderated causal excursion effects are the quantities of interest in primary and secondary analyses for MRT. In this tutorial we briefly review MRT and causal excursion effects, and illustrate the use of the estimators implemented in this package for conducting primary and secondary analyses of MRT: wcls() (weighted and centered least squares, for MRT with continuous outcomes) and emee() (estimator for marginal excursion effect, for MRT with binary outcome).

Data Structure of an MRT

In an MRT, each participant is repeatedly randomized among treatment options many times throughout the trial. Suppose there are \(n\) participants, and for the \(i\)-th participant, they are enrolled in the trial for \(m_i\) decision points. (In many MRT, the number of decision points \(m_i\) is the same for all participants. This package also automatically handles the setting where \(m_i\) is different for different participants, so we present the data structure in a more general way.)

For the \(i\)-th participant at the \(t\)-th decision point, we use the triplet \((X_{it}, A_{it}, Y_{it})\) to denote the data collected, where

Availability

An MRT may include availability considerations. When it is inappropriate or unethical to deliver interventions to an individual, that individual is considered “unavailable”, and no intervention will be delivered at that decision point so \(A_{it} = 0\).

Mathematically, we use \(I_{it}\) denotes the availability status of participant \(i\) at decision point \(t\): \(I_{it} = 1\) if available and \(I_{it} = 0\) if unavailable. Temporal-wise, availability \(I_{it}\) is determined before \(A_{it}\), and one can conceptualize it by considering \(I_{it}\) to be measured at the same time as \(X_{it}\).

Prepare Your Data Set for Analysis

To use any of the estimators in this package, you need to prepare your data set as a data.frame in long format, meaning that each row records observations from a decision point of a participant (i.e., \((X_{it}, A_{it}, Y_{it})\) (and possibly other variables — see below) for a specific \((i,t)\) combination). The data.frame should be sorted so that consecutive rows should be from adjacent decision points from the same participant. Furthermore, the data set should contain the following columns:

The data set may also contain the following columns, depending on your MRT:

Causal Excursion Effect Estimation for MRT with Continuous Outcomes

library(MRTAnalysis)
current_options <- options(digits = 3) # save current options for restoring later

Fully Marginal Causal Excursion Effect

The following code uses wcls() to estimate the fully marginal causal excursion effect from a synthetic data set data_mimicHeartSteps that mimics the HeartSteps V1 MRT (Klasnja and others (2015)). This data set contains observations for 37 individuals at 210 different time points. The data set contains the following variables:

A summary of data_mimicHeartSteps is as follows:

head(data_mimicHeartSteps, 10)
#>    userid decision_point day_in_study logstep_30min logstep_30min_lag1
#> 1       1              1            0        2.3902             0.0000
#> 2       1              2            0       -0.6931             2.3902
#> 3       1              3            0        2.4647            -0.6931
#> 4       1              4            0        0.1207             2.4647
#> 5       1              5            0        0.8322             0.1207
#> 6       1              6            1        1.8450             0.8322
#> 7       1              7            1        4.6315             1.8450
#> 8       1              8            1        4.1929             4.6315
#> 9       1              9            1       -0.0344             4.1929
#> 10      1             10            1       -0.1495            -0.0344
#>    logstep_pre30min is_at_home_or_work intervention rand_prob avail
#> 1            -0.693                  1            0       0.6     0
#> 2             2.196                  1            0       0.6     1
#> 3             4.589                  1            1       0.6     1
#> 4             3.179                  1            1       0.6     1
#> 5             3.295                  0            0       0.6     0
#> 6             4.666                  1            0       0.6     0
#> 7             3.774                  0            0       0.6     1
#> 8            -0.693                  1            1       0.6     1
#> 9            -0.693                  0            1       0.6     1
#> 10           -0.693                  1            1       0.6     1

In the following function call of wcls(), we specify the proximal outcome variable by outcome = logstep_30min. We specify the treatment variable by treatment = intervention. We specify the constant randomization probability by rand_prob = 0.6. We specify the fully marginal effect as the quantity to be estimated by setting moderator_formula = ~1. We use logstep_pre30min as a control variable by setting control_formula = ~logstep_pre30min. We specify the availability variable by availability = avail.

fit1 <- wcls(
    data = data_mimicHeartSteps,
    id = "userid",
    outcome = "logstep_30min",
    treatment = "intervention",
    rand_prob = 0.6,
    moderator_formula = ~1,
    control_formula = ~logstep_pre30min,
    availability = "avail"
)
#> Constant randomization probability 0.6 is used.
#> Constant numerator probability 0.5 is used.
summary(fit1)
#> $call
#> wcls(data = data_mimicHeartSteps, id = "userid", outcome = "logstep_30min", 
#>     treatment = "intervention", rand_prob = 0.6, moderator_formula = ~1, 
#>     control_formula = ~logstep_pre30min, availability = "avail")
#> 
#> $causal_excursion_effect
#>             Estimate 95% LCL 95% UCL StdErr Hotelling df1 df2 p-value
#> (Intercept)    0.157   0.031   0.284 0.0622       6.4   1  34  0.0162

The summary() function provides the estimated causal excursion effect as well as the 95% confidence interval, standard error, and p-value. The only row in the output $causal_excursion_effect is named (Intercept), indicating that this is the fully marginal effect (like an intercept in the causal effect model). In particular, the estimated marginal excursion effect is 0.157, with 95% confidence interval (0.031, 0.284), and p-value 0.016. The confidence interval and the p-value are based on a small sample correction technique that is based on Hotelling’s T distribution, so the Hotelling’s T statistic (Hotelling) and the degrees of freedom (df1 and df2) are also printed. See Boruvka and others (2018) for details on the small sample correction.

One can include more control variables, e.g., by setting control_formula = ~logstep_pre30min + logstep_30min_lag1 + is_at_home_or_work. This is illustrated by the following code. Different choices of control variables should lead to similar causal effect estimates, but better control variables (i.e., those that are correlated with the proximal outcome) usually decrease the standard error of the causal effect estimates. This is the case here: the standard error of the marginal causal excursion effect decreases slightly from 0.062 to 0.061 after we included two additional control variables.

fit2 <- wcls(
    data = data_mimicHeartSteps,
    id = "userid",
    outcome = "logstep_30min",
    treatment = "intervention",
    rand_prob = 0.6,
    moderator_formula = ~1,
    control_formula = ~ logstep_pre30min + logstep_30min_lag1 + is_at_home_or_work,
    availability = "avail"
)
#> Constant randomization probability 0.6 is used.
#> Constant numerator probability 0.5 is used.
summary(fit2)
#> $call
#> wcls(data = data_mimicHeartSteps, id = "userid", outcome = "logstep_30min", 
#>     treatment = "intervention", rand_prob = 0.6, moderator_formula = ~1, 
#>     control_formula = ~logstep_pre30min + logstep_30min_lag1 + 
#>         is_at_home_or_work, availability = "avail")
#> 
#> $causal_excursion_effect
#>             Estimate 95% LCL 95% UCL StdErr Hotelling df1 df2 p-value
#> (Intercept)    0.159  0.0351   0.283 0.0607      6.84   1  32  0.0135

One can ask summary() to print out the fitted coefficients for the control variables as well, by setting show_control_fit = TRUE. This is illustrated by the following code. However, we caution against interpreting the fitted coefficients for the control variables, because these coefficients are only interpretable when the control model is correctly specified, which can rarely be true in an MRT setting.

summary(fit2, show_control_fit = TRUE)
#> Interpreting the fitted coefficients for control variables is not recommended.
#> $call
#> wcls(data = data_mimicHeartSteps, id = "userid", outcome = "logstep_30min", 
#>     treatment = "intervention", rand_prob = 0.6, moderator_formula = ~1, 
#>     control_formula = ~logstep_pre30min + logstep_30min_lag1 + 
#>         is_at_home_or_work, availability = "avail")
#> 
#> $causal_excursion_effect
#>             Estimate 95% LCL 95% UCL StdErr Hotelling df1 df2 p-value
#> (Intercept)    0.159  0.0351   0.283 0.0607      6.84   1  32  0.0135
#> 
#> $control_variables
#>                    Estimate 95% LCL 95% UCL StdErr Hotelling df1 df2  p-value
#> (Intercept)          1.8373  1.7222  1.9524 0.0565   1056.59   1  32 0.000000
#> logstep_pre30min     0.3400  0.2997  0.3803 0.0198    295.31   1  32 0.000000
#> logstep_30min_lag1   0.0399  0.0181  0.0618 0.0107     13.84   1  32 0.000764
#> is_at_home_or_work   0.1355  0.0263  0.2447 0.0536      6.39   1  32 0.016585

Moderated Causal Excursion Effect

The following code uses wcls() to estimate the causal excursion effect moderated by the location of the individual, is_at_home_or_work. This is achieved by setting moderator_formula = ~is_at_home_or_work.

fit3 <- wcls(
    data = data_mimicHeartSteps,
    id = "userid",
    outcome = "logstep_30min",
    treatment = "intervention",
    rand_prob = 0.6,
    moderator_formula = ~is_at_home_or_work,
    control_formula = ~ logstep_pre30min + logstep_30min_lag1 + is_at_home_or_work,
    availability = "avail"
)
#> Constant randomization probability 0.6 is used.
#> Constant numerator probability 0.5 is used.
summary(fit3)
#> $call
#> wcls(data = data_mimicHeartSteps, id = "userid", outcome = "logstep_30min", 
#>     treatment = "intervention", rand_prob = 0.6, moderator_formula = ~is_at_home_or_work, 
#>     control_formula = ~logstep_pre30min + logstep_30min_lag1 + 
#>         is_at_home_or_work, availability = "avail")
#> 
#> $causal_excursion_effect
#>                    Estimate 95% LCL 95% UCL StdErr Hotelling df1 df2 p-value
#> (Intercept)           0.109 -0.0288   0.247 0.0677     2.606   1  31   0.117
#> is_at_home_or_work    0.135 -0.1661   0.436 0.1475     0.835   1  31   0.368

The moderated causal excursion effect is modeled as a linear form: \(\beta_0 + \beta_1 \cdot \text{is_at_home_or_work}\). The output $causal_excursion_effect contains two rows, one for \(\beta_0\) (with row name (Intercept)) and the other for \(\beta_1\) (with row name is_at_home_or_work). Here, \(\beta_0\) is the causal excursion effect when the individual is not at home or work (estimate = 0.109, 95% CI = (-0.029, 0.247), p = 0.117), and \(\beta_1\) is the difference in the causal excursion effects between when at home or work and when at other locations (estimate = 0.135, 95% CI = (-0.166, 0.435), p = 0.368).

One can also ask summary() to calculate and print the estimated coefficients for \(\beta_0 + \beta_1\), the causal excursion effect when the individual is at home or work, by using the lincomb optional argument. This is illustrated by the following code. We set lincomb = c(1, 1), i.e., asks summary() to print out \([1, 1] \times (\beta_0, \beta_1)^T = \beta_0 + \beta_1\). The third row in $causal_excursion_effect, named (Intercept) + is_at_home_or_work, is the fitted result for this \(\beta_0 + \beta_1\) coefficient combination.

summary(fit3, lincomb = c(1, 1))
#> $call
#> wcls(data = data_mimicHeartSteps, id = "userid", outcome = "logstep_30min", 
#>     treatment = "intervention", rand_prob = 0.6, moderator_formula = ~is_at_home_or_work, 
#>     control_formula = ~logstep_pre30min + logstep_30min_lag1 + 
#>         is_at_home_or_work, availability = "avail")
#> 
#> $causal_excursion_effect
#>                                  Estimate 95% LCL 95% UCL StdErr Hotelling df1
#> (Intercept)                         0.109 -0.0288   0.247 0.0677     2.606   1
#> is_at_home_or_work                  0.135 -0.1661   0.436 0.1475     0.835   1
#> (Intercept) + is_at_home_or_work    0.244  0.0846   0.403 0.1260     3.753   2
#>                                  df2 p-value
#> (Intercept)                       31 0.11661
#> is_at_home_or_work                31 0.36787
#> (Intercept) + is_at_home_or_work  31 0.00187

Based on the output, the causal excursion effect at home or work is estimated to be 0.244, with 95% CI (0.085, 0.403) and p-value 0.002.

Causal Excursion Effect Estimation for MRT with Binary Outcomes

The syntax of emee() is exactly the same as wcls(). We illustrate the use of emee() below for completeness.

Fully Marginal Causal Excursion Effect

The following code uses emee() to estimate the fully marginal causal excursion effect from a synthetic data set data_binary. This data set contains observations for 100 individuals at 30 different time points. The data set contains the following variables:

A summary of data_binary is as follows:

head(data_binary, 30)
#>    userid time time_var1 time_var2 Y A avail rand_prob
#> 1       1    1    0.0333         0 0 0     1       0.3
#> 2       1    2    0.0667         0 0 1     1       0.5
#> 3       1    3    0.1000         0 0 0     1       0.7
#> 4       1    4    0.1333         0 1 0     0       0.3
#> 5       1    5    0.1667         0 0 0     0       0.5
#> 6       1    6    0.2000         0 0 0     0       0.7
#> 7       1    7    0.2333         0 0 0     1       0.3
#> 8       1    8    0.2667         0 0 1     1       0.5
#> 9       1    9    0.3000         0 0 0     1       0.7
#> 10      1   10    0.3333         0 1 0     1       0.3
#> 11      1   11    0.3667         0 0 0     1       0.5
#> 12      1   12    0.4000         0 0 1     1       0.7
#> 13      1   13    0.4333         0 1 0     1       0.3
#> 14      1   14    0.4667         0 0 0     1       0.5
#> 15      1   15    0.5000         0 1 1     1       0.7
#> 16      1   16    0.5333         1 0 0     1       0.3
#> 17      1   17    0.5667         1 1 1     1       0.5
#> 18      1   18    0.6000         1 1 0     1       0.7
#> 19      1   19    0.6333         1 1 0     1       0.3
#> 20      1   20    0.6667         1 1 0     1       0.5
#> 21      1   21    0.7000         1 0 1     1       0.7
#> 22      1   22    0.7333         1 1 0     1       0.3
#> 23      1   23    0.7667         1 1 1     1       0.5
#> 24      1   24    0.8000         1 1 0     0       0.7
#> 25      1   25    0.8333         1 1 1     1       0.3
#> 26      1   26    0.8667         1 1 1     1       0.5
#> 27      1   27    0.9000         1 1 0     1       0.7
#> 28      1   28    0.9333         1 1 1     1       0.3
#> 29      1   29    0.9667         1 0 1     1       0.5
#> 30      1   30    1.0000         1 1 1     1       0.7

In the following function call of emee(), we specify the proximal outcome variable by outcome = Y. We specify the treatment variable by treatment = A. We specify the randomization probability by rand_prob = rand_prob (the first rand_prob is an argument of emee(); the second rand_prob is a column in data_binary). We specify the fully marginal effect as the quantity to be estimated by setting moderator_formula = ~1. We use time_var1 and time_var2 as control variables by setting control_formula = ~ time_var1 + time_var2. We specify the availability variable by availability = avail.

fit4 <- emee(
    data = data_binary,
    id = "userid",
    outcome = "Y",
    treatment = "A",
    rand_prob = "rand_prob",
    moderator_formula = ~1,
    control_formula = ~ time_var1 + time_var2,
    availability = "avail"
)
#> Constant numerator probability 0.5 is used.
summary(fit4)
#> $call
#> emee(data = data_binary, id = "userid", outcome = "Y", treatment = "A", 
#>     rand_prob = "rand_prob", moderator_formula = ~1, control_formula = ~time_var1 + 
#>         time_var2, availability = "avail")
#> 
#> $causal_excursion_effect
#>             Estimate 95% LCL 95% UCL StdErr t_value df  p-value
#> (Intercept)    0.341   0.241    0.44   0.05    6.81 96 8.54e-10

The summary() function provides the estimated causal excursion effect as well as the 95% confidence interval, standard error, and p-value. The only row in the output $causal_excursion_effect is named (Intercept), indicating that this is the fully marginal effect (like an intercept in the causal effect model). In particular, the estimated marginal excursion effect is 0.341 (on the log relative risk scale), with 95% confidence interval (0.241, 0.44), and p-value\(<0.001\).

One can ask summary() to print out the fitted coefficients for the control variables as well, by setting show_control_fit = TRUE. This is illustrated by the following code. However, we caution against interpreting the fitted coefficients for the control variables, because these coefficients are only interpretable when the control model is correctly specified, which can rarely be true in an MRT setting.

summary(fit4, show_control_fit = TRUE)
#> Interpreting the fitted coefficients for control variables is not recommended.
#> $call
#> emee(data = data_binary, id = "userid", outcome = "Y", treatment = "A", 
#>     rand_prob = "rand_prob", moderator_formula = ~1, control_formula = ~time_var1 + 
#>         time_var2, availability = "avail")
#> 
#> $causal_excursion_effect
#>             Estimate 95% LCL 95% UCL StdErr t_value df  p-value
#> (Intercept)    0.341   0.241    0.44   0.05    6.81 96 8.54e-10
#> 
#> $control_variables
#>             Estimate 95% LCL 95% UCL StdErr t_value df  p-value
#> (Intercept)   -1.580 -1.7154   -1.45  0.068  -23.26 96 3.13e-41
#> time_var1      0.672  0.2878    1.06  0.194    3.47 96 7.76e-04
#> time_var2      0.248  0.0268    0.47  0.112    2.22 96 2.84e-02

Moderated Causal Excursion Effect

The following code uses emee() to estimate the causal excursion effect moderated by time_var1. This is achieved by setting moderator_formula = ~time_var1.

fit5 <- emee(
    data = data_binary,
    id = "userid",
    outcome = "Y",
    treatment = "A",
    rand_prob = "rand_prob",
    moderator_formula = ~time_var1,
    control_formula = ~ time_var1 + time_var2,
    availability = "avail"
)
#> Constant numerator probability 0.5 is used.
summary(fit5)
#> $call
#> emee(data = data_binary, id = "userid", outcome = "Y", treatment = "A", 
#>     rand_prob = "rand_prob", moderator_formula = ~time_var1, 
#>     control_formula = ~time_var1 + time_var2, availability = "avail")
#> 
#> $causal_excursion_effect
#>             Estimate 95% LCL 95% UCL StdErr t_value df p-value
#> (Intercept)   0.0811  -0.180   0.342  0.132   0.616 95  0.5391
#> time_var1     0.4293   0.051   0.808  0.191   2.253 95  0.0266

The moderated causal excursion effect is modeled as a linear form: \(\beta_0 + \beta_1 \cdot \text{time_var1}\). The output $causal_excursion_effect contains two rows, one for \(\beta_0\) (with row name (Intercept)) and the other for \(\beta_1\) (with row name time_var1). Here, \(\beta_0\) is the causal excursion effect when time_var1\(=0\) (estimate = 0.081, 95% CI = (-0.180, 0.342), p = 0.54), and \(\beta_1\) is the slope of time_var1 in the causal excursion effect model (estimate = 0.429, 95% CI = (0.051, 0.808), p = 0.03).

One can also ask summary() to calculate and print the linear combination of coefficients and their confidence interval, standard error, and p-value, by using the lincomb optional argument. The following code sets lincomb = rbind(c(1, 0.0333), c(1, 0.5), c(1, 1)), i.e., asks summary() to print out the estimates for \[ \begin{bmatrix}1 & 0.0333\\ 1 & 0.5\\ 1 & 1 \end{bmatrix}\times\begin{bmatrix}\beta_{0}\\ \beta_{1} \end{bmatrix}=\begin{bmatrix}\beta_{0}+0.0333\beta_{1}\\ \beta_{0}+0.5\beta_{1}\\ \beta_{0}+\beta_{1} \end{bmatrix}. \] Because \(\beta_1\) is the slope of time_var1, which is a scaled version of decision time index that starts at 0.0333 and ends at 1, \(\beta_0 + 0.0333\beta_1\), \(\beta_0 + 0.5\beta_1\) and \(\beta_0 + \beta_1\) are the causal excursion effects at the beginning of the study, mid-way during the study, and at the end of the study, respectively. The 3rd to 5th rows in $causal_excursion_effect show these results. Note that the interpretation is under the assumption that the causal excursion effect changes linearly over time.

summary(fit5, lincomb = rbind(c(1, 0.0333), c(1, 0.5), c(1, 1)))
#> $call
#> emee(data = data_binary, id = "userid", outcome = "Y", treatment = "A", 
#>     rand_prob = "rand_prob", moderator_formula = ~time_var1, 
#>     control_formula = ~time_var1 + time_var2, availability = "avail")
#> 
#> $causal_excursion_effect
#>                                Estimate 95% LCL 95% UCL StdErr t_value df
#> (Intercept)                      0.0811  -0.180   0.342 0.1316   0.616 95
#> time_var1                        0.4293   0.051   0.808 0.1905   2.253 95
#> (Intercept) + 0.0333*time_var1   0.0954  -0.154   0.345 0.1258   0.759 95
#> (Intercept) + 0.5*time_var1      0.2958   0.183   0.409 0.0569   5.202 95
#> (Intercept) + time_var1          0.5105   0.341   0.680 0.0854   5.978 95
#>                                 p-value
#> (Intercept)                    5.39e-01
#> time_var1                      2.66e-02
#> (Intercept) + 0.0333*time_var1 4.50e-01
#> (Intercept) + 0.5*time_var1    1.13e-06
#> (Intercept) + time_var1        3.93e-08
options(current_options) # restore old options

Bibliography

Below are some references:

Reference

Boruvka, A., Almirall, D., Witkiewitz, K. and Murphy, S. A. (2018). Assessing time-varying causal effect moderation in mobile health. Journal of the American Statistical Association 113, 1112–1121.
Klasnja, P., Hekler, E. B., Shiffman, S., Boruvka, A., Almirall, D., Tewari, A. and Murphy, S. A. (2015). Microrandomized trials: An experimental design for developing just-in-time adaptive interventions. Health Psychology 34, 1220.
Qian, T., Walton, A. E., Collins, L. M., Klasnja, P., Lanza, S. T., Nahum-Shani, I., Rabbi, M., Russell, M. A., Walton, M. A., Yoo, H., et al. (2022). The microrandomized trial for developing digital interventions: Experimental design and data analysis considerations. Psychological Methods.
Qian, T., Yoo, H., Klasnja, P., Almirall, D. and Murphy, S. A. (2021). Estimating time-varying causal excursion effects in mobile health with binary outcomes. Biometrika 108, 507–527.