--- title: "Binary Protected Attributes" authors: - name: "Jianhui Gao" orcid: "0000-0003-0915-1473" affiliation: 1 - name: "Benjamin Smith" orcid: "0009-0007-2206-0177" affiliation: 1 - name: "Benson Chou" orcid: "0009-0007-0265-033X" affiliation: 1 - name: "Jessica Gronsbell" orcid: "0000-0002-5360-5869" affiliation: 1 affiliations: - name: "University of Toronto" index: 1 output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Binary Protected Attributes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction We illustrate the usage of the `fairmetrics` package through a case study using a publicly available dataset of 1,776 ICU patients from the MIMIC-II clinical database, focusing on predicting 28-day mortality and evaluating disparities in model performance across sex. The following packages are used for the analysis along with the `fairmetrics` package: ```{r setup, message=FALSE, warning=FALSE} # Packages we are using for the analysis library(dplyr) library(corrplot) library(randomForest) library(pROC) library(SpecsVerification) library(kableExtra) library(naniar) # Our package library(fairmetrics) ``` The "Data Preprocessing" section discusses the dataset, the handling of missing data, model construction and standard predictive model evaluation through train-test splitting for binary classification. The "Fairness Evaluation" section shows how to evaluate the model's fairness toward binary protected attributes with the `fairmetrics` package. ```{=html} ``` # Data Preprocessing The dataset used in this analysis is the MIMIC II clinical database [2], which has been previously studied to explore the relationship between indwelling arterial catheters in hemodynamically stable patients and respiratory failure in relation to mortality outcomes [3]. It includes 46 variables which cover demographics and clinical characteristics (including white blood cell count, heart rate during ICU stays and others) along with a 28-day mortality indicator (`day_28_flg`) for 1,776 patients. The data has been made publicly available by [PhysioNet](https://physionet.org/content/mimic2-iaccd/1.0/) [4] and is available in the `fairmetrics` package as the `mimic` dataset. ## Handling Missing Data We first assess the extent of missingness in the dataset. For each variable, we calculate both the total number and the percentage of missing values with `naniar::miss_var_summary()`. ```{r} # Loading mimic dataset # (available in fairmetrics) data("mimic") missing_data_summary<- naniar::miss_var_summary(mimic, digits= 3) kableExtra::kable(missing_data_summary, booktabs = TRUE, escape = FALSE) %>% kableExtra::kable_styling( latex_options = "hold_position" ) ``` To ensure data quality, the following procedure is applied to handle missing data: - Removal of that variables which had more than 10% missing values. Three variables had more than 10% of missing values: body mass index (`bmi`; 26.2%), first partial pressure of oxygen (`po2_first`; 10.5%), and first partial pressure of carbon dioxide (`pco2_first`; 10.5%). - Remaining missing values are imputed using the median value of the variable which they belong to. ```{r} # Remove columns with more than 10% missing values columns_to_remove <- missing_data_summary %>% dplyr::filter(pct_miss > 10) %>% dplyr::pull(variable) mimic <- dplyr::select(mimic, -dplyr::one_of(columns_to_remove) ) # Impute remaining missing values with median mimic <- mimic %>% dplyr::mutate( dplyr::across( dplyr::where(~any(is.na(.))), ~ifelse(is.na(.), median(., na.rm = TRUE), .) ) ) ``` We additionally remove `sepsis_flg` column from the dataset as it contains a single unique value across all observations. Since this does not provide any useful information for model training, it is excluded. ```{r} # Identify columns that have only one unique value cols_with_one_value <- sapply(mimic, function(x) length(unique(x)) == 1) # Subset the dataframe to remove these columns mimic <- mimic[, !cols_with_one_value] ``` ## Model Building Before training the model, we further remove variables that are directly correlated with patient outcomes to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_%28machine_learning%29). In particular, we inspect the correlation matrix of the numeric features and exclude variables such as the hospital expiration flag (`hosp_exp_flg`), the ICU expiration flag (`icu_exp_flg`), the mortality censoring day (`mort_day_censored`), and the censoring flag (`censor_flg`), which are strongly associated patient outcomes. ```{r} # Remove columns that are highly correlated with the outcome variable corrplot::corrplot(cor(select_if(mimic, is.numeric)), method = "color", tl.cex = 0.5) mimic <- mimic %>% dplyr::select(-c("hosp_exp_flg", "icu_exp_flg", "mort_day_censored", "censor_flg")) ``` We split the dataset into a training and testing sets. The first 700 patients are used as the training set and the remaining patients are used as the testing set. The hyperparameters for the random forest (RF) model are set to use 1000 trees and a random sampling of 6 variables at each split, determined by the square root of the number of predictors. After training, the overall area under the receiver operating characteristic curve (AUC) for the model on the test set is 0.90 and the overall accuracy of the model on the test set is 0.88. ```{r} # Use 700 labels to train the mimic train_data <- mimic %>% dplyr::filter( dplyr::row_number() <= 700 ) # Fit a random forest model set.seed(123) rf_model <- randomForest::randomForest(factor(day_28_flg) ~ ., data = train_data, ntree = 1000) # Test the model on the remaining data test_data <- mimic %>% dplyr::filter( dplyr::row_number() > 700 ) test_data$pred <- predict(rf_model, newdata = test_data, type = "prob")[,2] # Check the AUC roc_obj <- pROC::roc(test_data$day_28_flg, test_data$pred) roc_auc <- pROC::auc(roc_obj) roc_auc ``` # Fairness Evaluation To evaluate fairness, we use the testing set results to examine patient gender as a binary protected attribute and 28-day mortality (`day_28_flg`) as the outcome of interest. ```{r} # Recode gender variable explicitly for readability: test_data <- test_data %>% dplyr::mutate(gender = ifelse(gender_num == 1, "Male", "Female")) ``` Since many fairness metrics require binary predictions, we threshold the predicted probabilities using a fixed cutoff. We set a threshold of 0.41 to maintain the overall false positive rate (FPR) at approximately 5%. ```{r} # Control the overall false positive rate (FPR) at 5% by setting a threshold. cut_off <- 0.41 test_data %>% dplyr::mutate(pred = ifelse(pred > cut_off, 1, 0)) %>% dplyr::filter(day_28_flg == 0) %>% dplyr::summarise(fpr = mean(pred)) ``` To calculate various fairness metrics for the model, we pass our test data with its predicted results into the `get_fairness_metrics` function. ```{r} fairness_result <- fairmetrics::get_fairness_metrics( data = test_data, outcome = "day_28_flg", group = "gender", group2 = "age", condition = ">=60", probs = "pred", cutoff = cut_off ) kableExtra::kable(fairness_result, booktabs = TRUE, escape = FALSE) %>% kableExtra::kable_styling(full_width = FALSE) %>% kableExtra::pack_rows("Independence-based criteria", 1, 2) %>% kableExtra::pack_rows("Separation-based criteria", 3, 6) %>% kableExtra::pack_rows("Sufficiency-based criteria", 7, 7) %>% kableExtra::pack_rows("Other criteria", 8, 10) %>% kableExtra::kable_styling( full_width = FALSE, font_size = 10, # Controls font size manually latex_options = "hold_position" ) ``` From the outputted fairness metrics we note: - Independence is likely violated, as evidenced by the statistical parity metric which shows a 9% difference (95% CI: [5%, 13%]) or a ratio of 2.12 (95% CI: [1.49, 3.04]) between females and males. The measures in the independence category indicate that the model predicts a significantly higher mortality rate for females, even after conditioning on age. - With respect to separation, we observe that all metrics show significant disparities. For instance, equal opportunity shows a -24% difference (95% CI: [-39%, -9%]) in false negative rate (FNR) between females and males. This indicates the model is less likely to detect males at risk of mortality compared to females. - On the other hand, the sufficiency criterion is satisfied as predictive parity shows no significant difference between males and females (difference: -4%, 95% CI: [-21%, 13%]; ratio: 0.94, 95% CI: [0.72, 1.23]). This suggests that given the same prediction, the actual mortality rates are similar for both males and females. - Among the additional metrics that assess calibration and discrimination, Brier Score Parity and Overall Accuracy Equality do not show significant disparities. However, Treatment Equality shows a statistically significant difference (difference: –2.21, 95% CI: [–4.35, –0.07]; ratio: 0.32, 95% CI: [0.15, 0.68]), indicating that males have a substantially higher false negative to false positive ratio compared to females. This suggests that male patients are more likely to be missed by the model relative to being incorrectly flagged. # Appendix A: Confidence Interval Construction The function `get_fairness_metrics()` computes Wald-type confidence intervals for both group-specific and disparity metrics using nonparametric bootstrap. To illustrate the construction of confidence intervals (CIs), we use the following example involving the false positive rate ($FPR$). Let $\widehat{\textrm{FPR}}_a$ and $\textrm{FPR}_a$ denote the estimated and true FPR in group $A = a$. Then the difference $\widehat{\Delta}_{\textrm{FPR}} = \widehat{\textrm{FPR}}_{a_1} - \widehat{\textrm{FPR}}_{a_0}$ satisfies (e.g., [Gronsbell et al., 2018](https://doi.org/10.1093/jrsssb/qkad107)): $$ \sqrt{n}(\widehat{\Delta}_{\textrm{FPR}} - \Delta_{\textrm{FPR}}) \overset{d}{\to} \mathcal{N}(0, \sigma^2) $$ We estimate the standard error of $\widehat{\Delta}_{\textrm{FPR}}$ using bootstrap resampling within groups, and form a Wald-style confidence interval: $$ \widehat{\Delta}_{\textrm{FPR}} \pm z_{1-\alpha/2} \cdot \widehat{\textrm{se}}(\widehat{\Delta}_{\textrm{FPR}}) $$ For **ratios**, such as $\widehat{\rho}_{\textrm{FPR}} = \widehat{\textrm{FPR}}_{a_1} / \widehat{\textrm{FPR}}_{a_0}$, we apply a log transformation and use the delta method: $$ \log(\widehat{\rho}_{\textrm{FPR}}) \pm z_{1-\alpha/2} \cdot \widehat{\textrm{se}}\left[\log(\widehat{\rho}_{\textrm{FPR}})\right] $$ Exponentiation of the bounds yields a confidence interval for the ratio on the original scale: $$ \left[ \exp\left\{\log(\widehat{\rho}_{\textrm{FPR}}) - z_{1-\alpha/2} \cdot \widehat{\textrm{se}}\left[\log(\widehat{\rho}_{\textrm{FPR}})\right]\right\},\ \exp\left\{\log(\widehat{\rho}_{\textrm{FPR}}) + z_{1-\alpha/2} \cdot \widehat{\textrm{se}}\left[\log(\widehat{\rho}_{\textrm{FPR}})\right]\right\} \right]. $$ # Appendix B: Practical Fairness Considerations In the above example, both **separation** and **sufficiency**-based fairness criteria are important: - **Separation** ensures that patients at true risk are identified across groups, reducing under-detection in specific sub-populations. - **Sufficiency** ensures interventions are based solely on predicted need, not protected attributes. However, 28-day mortality differs by sex (19% for females vs. 14% for males), making it impossible to satisfy multiple fairness criteria simultaneously. Given this, enforcing **independence** is likely not advisable in this case, as it could blind the model to true mortality differences. However, the model **violates separation**, potentially leading to higher false negative rates among male patients and delayed interventions. See [Hort et al. (2022)](https://arxiv.org/abs/2207.07068) for strategies to reduce disparities in separation-based metrics. # References 1. Gao, J. et al. What is Fair? Defining Fairness in Machine Learning for Health. arXiv.org (2024). 2. Raffa, J. (2016). Clinical data from the MIMIC-II database for a case study on indwelling arterial catheters (version 1.0). PhysioNet. . 3. Raffa J.D., Ghassemi M., Naumann T., Feng M., Hsu D. (2016) Data Analysis. In: Secondary Analysis of Electronic Health Records. Springer, Cham 4. Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. 5. Hsu, D. J. et al. The association between indwelling arterial catheters and mortality in hemodynamically stable patients with respiratory failure. CHEST Journal 148, 1470–1476 (2015). 6. Gronsbell, J. L. & Cai, T. Semi-Supervised approaches to efficient evaluation of model prediction performance. Journal of the Royal Statistical Society Series B (Statistical Methodology) 80, 579–594 (2017). 7. Efron, B. & Tibshirani, R. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science 1, (1986). 8. Hort, M., Chen, Z., Zhang, J. M., Harman, M. & Sarro, F. Bias Mitigation for Machine Learning Classifiers: A Comprehensive survey. arXiv.org (2022).