--- title: "Assessing Model Fairness Across Binary Protected Attributes" authors: - name: "Jianhui Gao" orcid: "0000-0003-0915-1473" affiliation: 1 - name: "Benjamin Smith" orcid: "0009-0007-2206-0177" affiliation: 1 - name: "Benson Chou" orcid: "0009-0007-0265-033X" affiliation: 1 - name: "Jessica Gronsbell" orcid: "0000-0002-5360-5869" affiliation: 1 affiliations: - name: "University of Toronto" index: 1 output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Assessing Model Fairness Across Binary Protected Attributes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) set.seed(123) ``` # Introduction This vingette demonstrates how to obtain, report and interpret model fairness metrics for binary protected attributes with the `fairmetrics` package. We illustrate this through a case study based on a preprocessed version of the MIMIC-II clinical database [1], which has been previously studied to explore the relationship between indwelling arterial catheters in hemodynamically stable patients and respiratory failure in relation to mortality outcomes [2]. The original, unprocessed dataset is publicly available through [PhysioNet](https://physionet.org/content/mimic2-iaccd/1.0/) [3]. A preprocessed version of this dataset is included in the `fairmetrics` package as the `mimic_preprocessed` and is used in this vingette. # Data Split and Model Construction In this setting, we construct a model that will predict 28-day mortality (`day_28_flg`). To do this, we split the dataset into a training and testing sets and fit a random forest model. The first 700 patients are used as the training set and the remaining patients are used as the testing set. After the model is fit, it is used to predict 28-day mortality. The predicted probabilities are saved as new column in the testing data and are used to assess model fairness. ```{r, message=FALSE, warning=FALSE} # Load required libraries library(dplyr) library(fairmetrics) library(randomForest) # Set seed for reproducibility set.seed(1) # Use 700 labels to train on the mimic_preprocessed dataset train_data <- mimic_preprocessed %>% filter(row_number() <= 700) # Test the model on the remaining data test_data <- mimic_preprocessed %>% filter(row_number() > 700) # Fit a random forest model rf_model <- randomForest(factor(day_28_flg) ~ ., data = train_data, ntree = 1000) # Save model prediction test_data$pred <- predict(rf_model, newdata = test_data, type = "prob")[, 2] ``` # Fairness Evaluation The `fairmetrics` package is used to assess model fairness across binary protected attributes. This means that the number of unique values in the protected attribute column need to be exactly two. To evaluate fairness of the random forest model which we fit, we examine patient gender as the binary protected attribute. ```{r} # Recode gender variable explicitly for readability: test_data <- test_data %>% mutate(gender = ifelse(gender_num == 1, "Male", "Female")) ``` Since many fairness metrics require binary predictions, we threshold the predicted probabilities using a fixed cutoff. We set a threshold of 0.41 to maintain the overall false positive rate (FPR) at approximately 5%. To evaluate specific fairness metrics, its possible to do so with the `eval_*` functions (for a list of the functions contained in `fairmetrics` see [here](https://jianhuig.github.io/fairmetrics/reference/index.html)). For example, if we are interested in calculating the statistical parity of our model across gender (here assumed to be binary), we write: ```{r} eval_stats_parity( data = test_data, outcome = "day_28_flg", group = "gender", probs = "pred", cutoff = 0.41, message = TRUE ) ``` The dataframe returned gives the positive prediction rate between the groups defined by the binary protected attribute (`GroupFemale` and `GroupMale` in this case), the difference and ratios between the groups and the bootstrap calculated confidence intervals for the estimated difference and ratios. For inference, it can be considered that confidence interval which contains 0 in its range for difference or 1 for ratio as insignificant. The above syntax can be extended to the other `eval_*` functions as well. Additionally, if one wishes to calculate various fairness metrics (and their confidence intervals) for the model simultaneously, we pass our test data with its predicted results into the `get_fairness_metrics` function. ```{r} get_fairness_metrics( data = test_data, outcome = "day_28_flg", group = "gender", probs = "pred", cutoff = 0.41 ) ``` With the `get_fairness_metrics` function, it's possible to examine the bootstrapped confidence intervals to determine which fairness criterion show evidence of violation. From the above output, Statistical Parity, Equal Opportunity, Predictive Equality, Balance for Positive Class and Balance for Negative Class show evidence of being violated, while Positive Predictive Parity, Negative Predictive Parity, Brier Score Parity and Overall Accuracy Parity do not. Treatment Equality remains ambiguous: when assessed by difference, the 95% bootstrapped confidence interval crosses 0 - indicating that there is no evidence indicating that the fairness criterion is violated, but when assessed by ratio, the interval crosses 1 - which indicates that there is. > __NOTE:__ > > Statistical inference from bootstrapped confidence intervals should be interpreted with caution. A confidence interval crossing 0 (for differences) or 1 (for ratios) means the evidence is inconclusive rather than proving absence of unfairness. Apparent violations may reflect sampling variability rather than systematic bias. Always complement these results with domain knowledge, sensitivity analyses, and additional fairness diagnostics before drawing strong conclusions about a specific fairness assessment. While `fairmetrics` focuses only on assessing fairness of models across binary protected attributes, it is possible to work with protected attributes which consist of more than two groups by using "one-vs-all" comparisons and a little bit of data wrangling to create the appropriate columns. # Conditional Fairness Evaluation To evaluate a model's fairness for a specific subgroup, simply subset the data. For example, to evaluate the conditional statistical parity among subjects aged 60 and older using the previously fitted random forest model, subsetting the test data and and passing it through the `eval_stats_parity` is all that's required. ```{r} eval_stats_parity( data = subset(test_data, age >= 60), outcome = "day_28_flg", group = "gender", probs = "pred", cutoff = 0.41, message = TRUE ) ``` From the bootstrapped confidence intervals for the difference and the ratio, we note that there is evidence that statistical parity is violated based on the difference and ratio confidence intervals between the male and female subjects aged 60 and over. A similar example can be constructed with `get_fairness_metrics` for a more comprehensive view of the conditional fairness of the model. ```{r} get_fairness_metrics( data = subset(test_data, age >= 60), outcome = "day_28_flg", group = "gender", probs = "pred", cutoff = 0.41 ) ``` Conditioning on subjects aged 60 and older, Statistical Parity, Balance for Positive Class and Balance for Negative Class posses evidence for being violated based on 95% bootstrapped confidence intervals. While Predictive Equality, Positive Predictive Parity, Negative Predictive Parity, Brier Score Parity and Overall Accuracy Parity do not. Similar to the unconditioned case, Treatment Equality remains ambiguous based on the contrast between the difference and ratio confidence intervals. If difference is considered, there is evidence of Treatment equality being violated. If ratio is considered, there does not. # Appendix: Confidence Interval Construction The function `get_fairness_metrics()` computes Wald-type confidence intervals for both group-specific and disparity metrics using nonparametric bootstrap. To illustrate the construction of confidence intervals (CIs), we use the following example involving the false positive rate ($FPR$). Let $\widehat{\textrm{FPR}}_a$ and $\textrm{FPR}_a$ denote the estimated and true FPR in group $A = a$. Then the difference $\widehat{\Delta}_{\textrm{FPR}} = \widehat{\textrm{FPR}}_{a_1} - \widehat{\textrm{FPR}}_{a_0}$ satisfies (e.g., [Gronsbell et al., 2018](https://doi.org/10.1093/jrsssb/qkad107)): $$ \sqrt{n}(\widehat{\Delta}_{\textrm{FPR}} - \Delta_{\textrm{FPR}}) \overset{d}{\to} \mathcal{N}(0, \sigma^2) $$ We estimate the standard error of $\widehat{\Delta}_{\textrm{FPR}}$ using bootstrap resampling within groups, and form a Wald-style confidence interval: $$ \widehat{\Delta}_{\textrm{FPR}} \pm z_{1-\alpha/2} \cdot \widehat{\textrm{se}}(\widehat{\Delta}_{\textrm{FPR}}) $$ For **ratios**, such as $\widehat{\rho}_{\textrm{FPR}} = \widehat{\textrm{FPR}}_{a_1} / \widehat{\textrm{FPR}}_{a_0}$, we apply a log transformation and use the delta method: $$ \log(\widehat{\rho}_{\textrm{FPR}}) \pm z_{1-\alpha/2} \cdot \widehat{\textrm{se}}\left[\log(\widehat{\rho}_{\textrm{FPR}})\right] $$ Exponentiation of the bounds yields a confidence interval for the ratio on the original scale: $$ \left[ \exp\left\{\log(\widehat{\rho}_{\textrm{FPR}}) - z_{1-\alpha/2} \cdot \widehat{\textrm{se}}\left[\log(\widehat{\rho}_{\textrm{FPR}})\right]\right\},\ \exp\left\{\log(\widehat{\rho}_{\textrm{FPR}}) + z_{1-\alpha/2} \cdot \widehat{\textrm{se}}\left[\log(\widehat{\rho}_{\textrm{FPR}})\right]\right\} \right]. $$ # References 1. Raffa, J. (2016). Clinical data from the MIMIC-II database for a case study on indwelling arterial catheters (version 1.0). PhysioNet. . 2. Raffa J.D., Ghassemi M., Naumann T., Feng M., Hsu D. (2016) Data Analysis. In: Secondary Analysis of Electronic Health Records. Springer, Cham 3. Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. 4. Gao, J. et al. What is Fair? Defining Fairness in Machine Learning for Health. arXiv.org (2024). 6. Gronsbell, J. L. & Cai, T. Semi-Supervised approaches to efficient evaluation of model prediction performance. Journal of the Royal Statistical Society Series B (Statistical Methodology) 80, 579–594 (2017). 7. Hort, M., Chen, Z., Zhang, J. M., Harman, M. & Sarro, F. Bias Mitigation for Machine Learning Classifiers: A Comprehensive survey. arXiv.org (2022). 8. Hsu, D. J. et al. The association between indwelling arterial catheters and mortality in hemodynamically stable patients with respiratory failure. CHEST Journal 148, 1470–1476 (2015). 9. Efron, B. & Tibshirani, R. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science 1, (1986).