--- title: "Additional Predictor with Maximum Effect Size" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{intro} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} library(knitr) opts_chunk$set( collapse = TRUE, comment = "#>" ) options(rmarkdown.html_vignette.check_title = FALSE) ``` # Introduction This vignette of package **`maxEff`** ([Github](https://github.com/tingtingzhan/maxEff), [RPubs](https://rpubs.com/tingtingzhan/maxEff)) documents three methods of selecting one from many `numeric` predictors for a regression model, to ensure that the additional predictor has the maximum effect size. The three methods are implemented in functions `add_num()`, `add_dummy()` and `add_dummy_partition()`. ## Note to Users Examples in this vignette require that the `search` path has ```{r setup} library(maxEff) library(groupedHyperframe) library(survival) library(rpart) ``` Users should remove parameter `mc.cores = 1L` from all examples and use the default option, which engages all CPU cores on the current host for macOS. The authors are forced to have `mc.cores = 1L` in this vignette in order to pass `CRAN`'s submission check. ## Terms and Abbreviations ```{r echo = FALSE, results = 'asis'} c( '', 'Forward pipe operator', '`?base::pipeOp` introduced in `R` 4.1.0', '`abs`', 'Absolute value', '`base::abs`', '`coxph`', 'Cox model', '`survival::coxph`', '`CRAN`, `R`', 'The Comprehensive R Archive Network', 'https://cran.r-project.org', '`factor`', 'Factor, or categorical variable', '`base::factor`', '`function`', '`R` function', '``base::`function` ``', '`groupedHyperframe`', 'Grouped hyper data frame', ' `groupedHyperframe::as.groupedHyperframe`', '`head`', 'First parts of an object', '`utils::head`; `utils:::head.default`', '`hypercolumns`, `hyperframe`', '(Hyper columns of) hyper data frame', '`spatstat.geom::hyperframe`', '`labels`', 'Labels from object', '`base::labels`; `maxEff::labels.node1`', '`levels`', 'Levels of a `factor`', '`base::levels`', '`listof`', 'List of objects', '`stats::listof`', '`logistic`', 'Logistic regression model', '`stats::glm(., family = binomial(\'logit\'))`', '`matrix`', 'Matrix', '`base::matrix`', '`partition`', 'Stratified partition', '`maxEff::statusPartition`, `caret::createDataPartition`', '`PFS`', 'Progression/recurrence free survival', 'https://en.wikipedia.org/wiki/Progression-free_survival', '`predict`', 'Model prediction', '`stats::predict`; `maxEff::predict.add_num`; `maxEff::predict.add_dummy`', '`quantile`', 'Quantile', '`stats::quantile`', '`rpart`', 'Recursive partitioning and regression trees', '`rpart::rpart`', '`S3`, `generic`, `methods`', '`S3` object oriented system', '`base::UseMethod`; `utils::methods`; `utils::getS3method`; https://adv-r.hadley.nz/s3.html', '`sort_by`', 'Sort an object by some criterion', '`base::sort_by`; `maxEff::sort_by.add_`', '`subset`', 'Subsets of object by conditions', '`base::subset`; `maxEff::subset.add_dummy`', '`Surv`', 'Survival object', '`survival::Surv`', '`update`', 'Update and re-fit a model call', '`stats::update`' ) |> matrix(nrow = 3L, dimnames = list(c('Term / Abbreviation', 'Description', 'Reference'), NULL)) |> t.default() |> as.data.frame.matrix() |> kable() ``` ## Acknowledgement This work is supported by NCI R01CA222847 ([I. Chervoneva](https://orcid.org/0000-0002-9104-4505), [T. Zhan](https://orcid.org/0000-0001-9971-4844), and [H. Rui](https://orcid.org/0000-0002-8778-261X)) and R01CA253977 (H. Rui and I. Chervoneva). # Handy Tools ## `node1()`: Dichotomization via First Node of Recursive Partitioning Consider the following recursive partitioning and regression example, ```{r} data(cu.summary, package = 'rpart') (r = rpart(Price ~ Mileage, data = cu.summary, cp = .Machine$double.eps, maxdepth = 1L)) ``` Function `node1()` creates a dichotomizing rule, i.e., a **`function`**, based on the first node of partitioning, ```{r} (foo = r |> node1()) ``` The dichotomizing rule `foo` converts a `numeric` object to a `logical` object. ```{r} set.seed(125); rnorm(6, mean = 24.5) |> foo() ``` Developers may obtain the `numeric` cutoff value of `foo`, or a brief text of to describe `foo`, for further analysis. ```{r} foo |> get_cutoff() foo |> labels() ``` ## `statusPartition()`: Stratified Partition by Survival Status Function `statusPartition()` performs stratified partition based on the survival status of a `Surv` response, to avoid the situation that Cox model is degenerate if all subjects are censored. This is a deviation from the very popular function `caret::createDataPartition()`, which stratifies `Surv` response by the `quantile`s of the survival time. ```{r} y = (survival::capacitor) |> with(expr = Surv(time, status)) set.seed(15); id = y |> statusPartition(times = 1L, p = .5) table(y[id[[1L]], 2L]) / table(y[,2L]) # balanced by status set.seed(15); id0 = y |> caret::createDataPartition(times = 1L, p = .5) table(y[id0[[1L]], 2L]) / table(y[,2L]) # *not* balanced by status ``` # Data Preparation Data set `Ki67` in package **`groupedHyperframe`** is a `groupedHyperframe`. ```{r} data(Ki67, package = 'groupedHyperframe') Ki67 ``` ```{r} s = Ki67 |> aggregate_quantile(by = ~ patientID, probs = seq.int(from = .01, to = .99, by = .01)) ``` Candidate of additional predictors are stored in a `numeric`-`hypercolumn` *`logKi67.quantile`*. Users are encouraged to learn more about `groupedHyperframe` class and function `aggregate_quantile()` from package **`groupedHyperframe`** [vignette](https://rpubs.com/tingtingzhan/groupedHyperframe). Partition into a training (80%) and test (20%) set. ```{r} set.seed(234); id = sample.int(n = nrow(s), size = nrow(s)*.8) |> sort.int() s0 = s[id, , drop = FALSE] # training set s1 = s[-id, , drop = FALSE] # test set ``` Let's consider a starting model of endpoint `PFS` with predictor `Tstage` on the training set `s0`. ```{r} summary(m <- coxph(PFS ~ Tstage, data = s0)) ``` # Additional `numeric` Predictor with Maximum Effect Size Function `add_num()` treats each additional predictor as a `numeric` variable, and `update`s the starting model with each additional predictor. Function `add_num()` returns an `'add_num'` object, which is a `listof` objects with an internal class `'add_num_'`. The `S3` generic `sort_by()` sorts the `'add_num'` object by the `abs`olute value of the regression coefficient (i.e., effect size) of the additioanal `numeric` predictor. The `S3` generic `head()` chooses the top `n` element from the object returned from the previous step. ```{r} set.seed(14837); m1 = m |> add_num(x = ~ logKi67.quantile, mc.cores = 1L) |> sort_by(y = abs(effsize)) |> head(n = 2L) m1 ``` The S3 generic `predict()` uses model `m1` on the test set `s1`. ```{r} m1[1L] |> predict(newdata = s1) ``` # Additional `logical` Predictor with Maximum Effect Size ## `add_dummy()`: Naive Practice Function `add_dummy()` partitions each additional `numeric` predictor into a `logical` variable using function `node1()`, and `update`s the starting model by adding in each of the dichotomized `logical` predictor. Function `add_dummy()` returns an `'add_dummy'` object, which is a `listof` `node1` objects. The `S3` generic `subset()` subsets the the `'add_dummy'` object by the balance of partition of the additional predictor. The `S3` generic `sort_by()` sorts the `'add_dummy'` object by the `abs`olute value of regression coefficient (i.e., effect size) of the additional `logical` predictor. The `S3` generic `head()` chooses the top `n` element from the object returned from the previous step. ```{r} set.seed(14837); m2 = m |> add_dummy(x = ~ logKi67.quantile, mc.cores = 1L) |> subset(subset = p1 > .05 & p1 < .95) |> sort_by(y = abs(effsize)) |> head(n = 2L) m2 ``` The S3 generic `predict()` uses model `m2` on the test set `s1`. ```{r} m2[1L] |> predict(newdata = s1) ``` ## `add_dummy_partition()`: via Repeated Partitions Function `add_dummy_partition()` partitions each additional `numeric` predictor into a `logical` variable in the following steps. 1. Generate multiple, i.e., repeated, partitions. 2. For each partition, create a dichotomizing rule (via function `node1()`) on the training set. Apply this dichotomizing rule on the test set and obtain the estimated regression coefficient (i.e., effect size) of the additional `logical` predictor. 3. Among all partitions, select the one with median effect size of the additional `logical` predictor. Function `add_dummy_partition()` also returns an `'add_dummy'` object. ```{r} set.seed(14837); m3 = m |> add_dummy_partition(~ logKi67.quantile, times = 20L, mc.cores = 1L) |> subset(subset = p1 > .15 & p1 < .85) |> sort_by(y = abs(effsize), decreasing = TRUE) |> head(n = 2L) m3 ``` ```{r} m3[1L] |> predict(newdata = s1) ```