--- title: "Example: Multi-Ethnicity Categorization" author: "Vinh Nguyen" date: "`r Sys.Date()`" output: prettydoc::html_pretty: theme: architect highlight: github toc: true vignette: > %\VignetteIndexEntry{Example: Multi-Ethnicity Categorization} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Background on Multi-Ethnicity Categorization When it comes to disaggregating student outcomes by ethnicity, colleges typically rely on an ethnicity categorization that assigns each student to a single race/ethnicity category. However, it is not uncommon for students to identify as a member of more than one race/ethnicity group. Categorization of ethnicity is typically simplified by assigning these multi-ethnicity students to a "2 or more Races" group or assigning students to a single race/ethnicity category using a predefined rule set. For example, in the case of [IPEDS](https://nces.ed.gov/ipeds/report-your-data/race-ethnicity-collecting-data-for-reporting-purposes) race and ethnicity reporting, which mirrors the US Census, students are asked on their college application (CCC Apply for California Community Colleges) if they are "Hispanic / Latino" (yes or no) in one question. Then in a subsequent question, students are asked to check all races that they identify with (e.g., White, Black, Asian, etc.). If a student identifies as Hispanic or Latino, then that student would be grouped into the "Hispanic / Latino" ethnicity group in IPEDS reporting regardless of however many additional race/ethnicity boxes are checked. Conducting a disproportionate impact (DI) analysis using a single ethnicity categorization has the potential to skew results when some students are left out (the impact can be large depending on the institution and/or the size of these groups), and could also mask some student groups that appear hidden under a single categorization formula. A more inclusive approach to DI analysis would be to include students in all ethnicity groups that they identify with. For example, if a student identifies as Hispanic and White, then they should be included in both the Hispanic group and the White group. Similarly, if a student identifies as Black and Asian, then they should be included in both the Black group and the Asian group. Carrying out the previous analysis is certainly feasible, but suffers from practical implementation for at least two reasons: 1. The distribution of the ethnicity group components do not add up to 100%, leading to a potentially distorted view and confusion among the audience. 2. Working with more granular ethnicity data from multiple variables increases the complexity of the analysis considerably, reducing the likelihood that a multi-ethnicity view is adopted in most DI analyses. In this vignette, we illustrate how the `DisImpact` package could be adapted to carry out a multi-ethnicity analysis using the `di_iterate` function as the workhorse, and manipulating the returned summary data set. # Multi-Ethnicity Data Format In the case of single ethnicity categorization, ethnicity is usually stored in a single variable or column that lists the ethnicity group for each student (row). In the case of multi-ethnicity data, when a student could correspond to multiple groups, there are multiple ways to describe such information. Here, we describe three common approaches: 1. A *single variable/column* that lists all student groups that a student (row) corresponds to in a single data cell, usually delimited by a comma or some other delimeter. For example, the value in the data cell could have the value `Asian, Black, Hispanic` if the students fall into these three groups. 2. A *wide format* consisting of the same number of flags as there are groups, where each flag is binary (1/0), indicating group membership in each column. For example, if there are 9 groups, then there would be 9 additional variables/columns with names such as `Flag_Group_1`, ..., `Flag_Group_9`, where each variable will take on a value of 1 or 0, with 1 indicating group membership, and 0 indicating non-membership. 3. A *long format* that has each student corresponding to the same number of rows as the number of groups they are members of, with an ethnicity column similar to that of a single ethnicity categorization. For example, if a student self-identifies as Asian, Black, and Hispanic, then the student would have three rows in the data set, with an ethnicity column having the following three values in the three corresponding rows: `Asian`, `Black`, `Hispanic`. The second, wide format is preferred when it comes to conducting a DI analysis using the `DisImpact` package. The included `student_equity` data set consists of ethnicity flags, and these variables will be used in a multi-ethnicity DI analysis. # Multi-Ethnicity Example Data Set As seen in the Scaling DI [vignette](./Scaling-DI-Calculations.html), one could repeat DI calculations over various success variables, group (disaggregation) variables, and cohort variables using the `di_iterate` function. The original intent of the `di_iterate` function is to take in a student-level data set and output a data set with summary results of dissagregation that could be referenced in a dashboard tool like Tableau or PowerBI. A pre-calculated data set (the output of `di_iterate`) makes it relatively easy to visualize disaggregation, equity gaps, and disproportionate impact across many outcome variables, cohort variables, disaggregation variables, and scenarios (subset) by filtering on the appropriate rows (summarized results). The following snippet illustrates this capability using default options: ```{r, warning=FALSE} # Load some necessary packages library(dplyr) library(stringr) library(ggplot2) library(scales) library(forcats) library(DisImpact) # Load student equity data set data(student_equity) # Caclulate DI over several scenarios df_di_summary <- di_iterate(data=student_equity , success_vars=c('Math', 'English', 'Transfer') , group_vars=c('Ethnicity', 'Gender') , cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort') , scenario_repeat_by_vars=c('Ed_Goal', 'College_Status') ) ``` In addition to the `Ethnicity` variable, the `student_equity` data set also contains ethnicity flags that are more granular, based on what students report. A student could be assigned to more than one category. For example, a student could fall into the Asian and South East Asian categories. Similarly, a student could fall into both White and South West Asian / North African (SWANA) categories . ```{r} head(student_equity) ## # Correlation to show overlap ## cor(student_equity[, str_detect(names(student_equity), 'EthnicityFlag')]) ``` # Analysis of Multi-Ethnicity Data Using DisImpact For a multi-ethnicity analysis, one could pass a list of ethnicity flags to the `group` parameter of `di_iterate`, similar to how `Gender` and `Ethnicity` were passed in the previous example to create `df_di_summary`. However, since the flags are binary (1's and 0's), and the ethnicity group names are in the variable names themselves (eg, `EthnicityFlag_Asian`), the user needs to filter on the appropriate rows corresponding to the groups of interest (1 value in the flags), extract the group names, and store the group names in the `group` column of the returned summary data set. The following code illustrates this with the `student_equity` data set. ```{r} # Identify the ethnicity flag variables want_vars <- names(student_equity)[str_detect(names(student_equity), '^EthnicityFlag')] want_vars <- want_vars[!str_detect(want_vars, 'Unknown')] # Remove Unknown want_vars <- want_vars[!str_detect(want_vars, 'Two')] # Remove Two or More Races want_vars # Ethnicity Flags of interest # Number of students ## Total student_equity %>% group_by(Cohort) %>% tally ## Each group student_equity %>% select(Cohort, one_of(want_vars)) %>% group_by(Cohort) %>% summarize_all(.funs=sum) %>% as.data.frame ## Observation: students can be in more than 1 group # Convert the ethnicity flags to character as required by di_iterate for (varname in want_vars) { student_equity[[varname]] <- as.character(student_equity[[varname]]) } # DI analysis df_di_summary_mult_eth <- di_iterate(data=student_equity , success_vars=c('Math', 'English', 'Transfer') , group_vars=want_vars # specify the list of ethnicity flag variables , cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort') , scenario_repeat_by_vars=c('Ed_Goal', 'College_Status') , di_80_index_reference_groups='all but current' ) %>% filter(group=='1') %>% # Ethnicity flags have 1's and 0's; filter on just the 1 group as that is of interest # filter((group=='1') | (disaggregation=='- None' & group=='- All')) %>% mutate(group=str_replace(disaggregation, 'EthnicityFlag_', '') %>% gsub(pattern='([A-Z])', replacement=' \\1', x=.) %>% str_replace('^ ', '') %>% str_replace('A A N A P I', 'AANAPI')# Rather than show '1', identify the ethnicity group names and assign them to group , disaggregation='Multi-Ethnicity' # Originally is a list of variable names corresponding to the various ethnicity flags; call this disaggregation 'Multi-Ethnicity' ) # Check if re-assignments are correct table(df_di_summary_mult_eth$disaggregation, useNA='ifany') table(df_di_summary_mult_eth$group, useNA='ifany') # Illustration: the group proportions add up to more than 100% since a student could be counted in more than 1 group df_di_summary_mult_eth %>% filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Transfer', cohort=='2018') %>% select(group, n) %>% mutate(Proportion=n / sum(student_equity$Cohort=='2018')) %>% mutate(Sum_Proportion=sum(Proportion)) ``` # Visualizing in Dashboard Platform Once a DI summary data set for multi-ethnicity is available, it could be combined with other summary data sets to be used in dashboard development as described in the Scaling DI [vignette](./Scaling-DI-Calculations.html). ```{r} # Combine df_di_summary_combined <- bind_rows( df_di_summary , df_di_summary_mult_eth # Could first filter on rows of interest (eg, just the categorizations of interest to the institution) ) # Disaggregation: Ethnicity df_di_summary_combined %>% filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Math', disaggregation=='Ethnicity') %>% select(cohort, group, n, pct, di_indicator_ppg, di_indicator_prop_index, di_indicator_80_index) %>% as.data.frame # Disaggregation: Multi-Ethnicity df_di_summary_combined %>% filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Math', disaggregation=='Multi-Ethnicity') %>% select(cohort, group, n, pct, di_indicator_ppg, di_indicator_prop_index, di_indicator_80_index) %>% as.data.frame ``` ```{r, fig.width=9, fig.height=5} # Disaggregation: Ethnicity df_di_summary_combined %>% filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Math', disaggregation=='Ethnicity') %>% select(cohort, group, n, pct, di_indicator_ppg, di_indicator_prop_index, di_indicator_80_index) %>% mutate(group=factor(group) %>% fct_reorder(desc(pct))) %>% ggplot(data=., mapping=aes(x=factor(cohort), y=pct, group=group, color=group)) + geom_point(aes(size=factor(di_indicator_ppg, levels=c(0, 1), labels=c('Not DI', 'DI')))) + ## geom_point(aes(size=factor(di_indicator_80_index, levels=c(0, 1), labels=c('Not DI', 'DI')))) + geom_line() + xlab('Cohort') + ylab('Rate') + theme_bw() + scale_color_manual(values=c('#1b9e77', '#d95f02', '#7570b3', '#e7298a', '#66a61e', '#e6ab02'), name='Ethnicity') + labs(size='Disproportionate Impact') + scale_y_continuous(labels = percent, limits=c(0, 1)) + ggtitle('Dashboard drop-down selections:', subtitle=paste0("Ed Goal = '- All' | College Status = '- All' | Outcome = 'Math' | Disaggregation = 'Ethnicity'")) ``` ```{r, fig.width=9, fig.height=5} # Disaggregation: Multi-Ethnicity df_di_summary_combined %>% filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Math', disaggregation=='Multi-Ethnicity') %>% select(cohort, group, n, pct, di_indicator_ppg, di_indicator_prop_index, di_indicator_80_index) %>% mutate(group=factor(group) %>% fct_reorder(desc(pct))) %>% ggplot(data=., mapping=aes(x=factor(cohort), y=pct, group=group, color=group)) + geom_point(aes(size=factor(di_indicator_ppg, levels=c(0, 1), labels=c('Not DI', 'DI')))) + ## geom_point(aes(size=factor(di_indicator_80_index, levels=c(0, 1), labels=c('Not DI', 'DI')))) + geom_line() + xlab('Cohort') + ylab('Rate') + theme_bw() + scale_color_manual(values=c('#a6cee3', '#1f78b4', '#b2df8a', '#33a02c', '#fb9a99', '#e31a1c', '#fdbf6f', '#ff7f00', '#cab2d6', '#6a3d9a', '#ffff99'), name='Multi-Ethnicity') + labs(size='Disproportionate Impact') + scale_y_continuous(labels = percent, limits=c(0, 1)) + ggtitle('Dashboard drop-down selections:', subtitle=paste0("Ed Goal = '- All' | College Status = '- All' | Outcome = 'Math' | Disaggregation = 'Multi-Ethnicity'")) ``` # Appendix: R and R Package Versions This vignette was generated using an R session with the following packages. There may be some discrepancies when the reader replicates the code caused by version mismatch. ```{r} sessionInfo() ```