---
title: "ISCA - Guide"
bibliography: references.bib 
vignette: >
  %\VignetteIndexEntry{ISCA - Guide}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  rmarkdown::html_vignette:
    includes:
      in_header: logo.txt
---


```{r setup}
library(ISCA)
```

## Why ISCA?

ISCA stands for “Inductive Subgroup Comparison Approach”, a data-driven approach to identifying and comparing groups in data originally proposed in Drouhot [-@Drouhot]. Specifically, ISCA identifies the main subgroups within Group 1 (a majority or reference group), and matches members of Group 2 (a minority group) to different subgroups from the majority based on social similarity. In other words, ISCA allows minority members to be compared to socially similar members of the majority. For instance, when comparing two ethnic groups (a majority and a minority group), ISCA might identify a subgroup comprising men of low social status in the majority group. It will then identify similar members of the minority group (also men of low social status) and match them to the appropriate subgroup within the majority, so that the comparison is more appropriate than would be if we compared all members to the majority group to all members of the minority groups. ISCA is especially useful when majority and minority groups to be compared are internally heterogeneous, resulting in different baselines within the majority group against which the minority group should be compared (again, to illustrate, for many comparisons it is best to study low status minority members vis-à-vis low status majority members rather than majority members in general).

On a more theoretical level, ISCA helps researchers avoid essentializing minority and majority groups as (e.g. ethnically, racially, religiously) homogenous categories in their quantitative analyses, and recover the internal social structures of social groups. Unlike common regression approach which obscure intra-group heterogeneity by yielding a single estimate for each pre-defined, fixed (nominal) group category, ISCA relies on a mixture of fuzzy clustering, Monte Carlo simulation, and regression analysis to assess minority and majority groups within a population as socially comparable subgroups, yielding an estimate for each subgroup. Estimates can then be compared across subgroups, as well as to a grand mean capturing traditional, single estimates at the group level. Together, these comparisons allow to see which empirical subgroups may be driving overall trends observed at the aggregate level, and, conversely, which pointed trends at the subgroup level get lost when only looking at nominal group-level estimates. For an in-depth discussion, see Drouhot 2021: 804-811. 
If you are already familiar with the logic of ISCA, you can skip the next section and move right on to the first function.

## What is ISCA? Some Background (adapted from Drouhot 2021)

ISCA consists of three distinct methodological steps. 

1. Through fuzzy cluster analysis— a family of data partitioning techniques from the larger umbrella of unsupervised machine learning [@Kaufman_etal; @Molina_Garip] — the first step is to inductively identify the key subgroups within the majority population.

2. Minority groups are assigned to one of the subgroups making up the majority population on the basis of (social) similarity.

3. Hypotheses are tested and outcomes of interest are modelled in these matched minority-majority subgroups.

Thus, ISCA effectively switches the relevant categories of analysis in intergroup comparisons from nominal (e.g., religious, ethnic, or otherwise categorically defined) groups to data-driven subgroups and allows for within as well as the more traditional between-group comparisons.

From the point of view of the integration / assimilation literature within the social sciences, ISCA allows for the study of distinct assimilation/integration pathways by taking intragroup heterogeneity among both the majority and minority population into account and is characterized by three key features. 

1. It is **inductive**, insofar as subgroups of interest emerge from the data rather than prenotions from the researcher. This is a point worth emphasizing: you could decide to compare low-income immigrants to low-income natives, for example. But you cannot, by definition, know in advance whether income is the right dimension to organize your comparison. Additionally, it is possible for several dimensions (income, gender, age, urban location, etc.) to consolidate into subgroups [@Blau]. In other words, what matters for group heterogeneity may well be specific configurations of variables rather than specific variables. ISCA relies on the inherent reflexivity afforded by unsupervised machine-learning approaches: while domain-specific knowledge remains key to interpreting results, researchers need not impose assumptions about the structure of the data or the analytical appropriateness of a given social category in organizing heterogeneity within the data.

2. ISCA is **probabilistic** because it does not rely on “hard” clustering assignment (such as that obtained with k-means clustering), where observations can belong to one cluster only, as such an approach would lead to reifying subgroups themselves and possibly overstating cross-cluster differences. Rather, it relies on fuzzy clustering, where membership in each cluster is uncertain and expressed through a membership score, which is then used to assign cases to groups in a stochastic and thus truly probabilistic manner. <br>  <br> More specifically, each individual observation receives *k* membership scores, where *k* is the number of clusters. Membership scores range from 0 to 1, and the sum of each individual’s membership scores adds up to 1. If all individuals have a very high probability of belonging to only one cluster, then fuzzy and hard clustering do not differ much. However, if membership probabilities are balanced across clusters, hard clustering may result in arbitrary, and possibly misleading group assignments.^[Membership scores are calculated as follows: $w_{i_j}=\frac{1}{\sum_{k=1}^{c} (|| x_i - c_j || / || x_i - c_k ||)^{2/(m-1)}}$] 
<br>

3. Finally, ISCA is **iterative**. As there exists significant uncertainty around subgroup boundaries (as expressed by membership scores from fuzzy clustering that are balanced), results following stochastic assignment may misrepresent the underlying uncertainty about cluster membership when membership is assigned only once. Thus, ISCA relies on Monte Carlo simulation and multiple iterations of the assignment and modelling steps to obtain stable results. Rather than a statistical or analytical nuisance, the procedure hence regards assignment uncertainty as meaningful since it reflects the blurry boundaries between ideal-typical subgroups making up the majority and minority categories of interest.

The following three functions allow researchers to conduct the ISCA in R.

## Function / Step 1

**Identifying Heterogeneity in Majority Group through Fuzzy Clustering and Stochastically Assign Individuals to a Subgroup**

The function translates to step 1 and 2 (see Drouhot 2021: 808-10). First, identify and choose variables that are known to be associated with your outcome variable based on past research. In the original implementation, ISCA does not divide up a majority group sample directly in terms of the outcome variable but uses (demographic) variables that are known to be related to it. Note, however, that splitting groups directly in terms of the outcome of interest is of course possible, and can be appropriate depending on the research questions and related substantive concerns.

To obtain subgroups in the majority sample, the ISCA package then relies on the fuzzy c-means clustering algorithm [@Bezdek] and specifically the `cmeans()` function of the `e1071`-package [@e1071]. It is advisable to transform continuous variables in dummies (e.g., coded 0 for values below the median and 1 for values above) to avoid an arbitrary weighting of attributes due to different scales, which would affect the clustering results in undesirable ways. 

You need to decide the number of clusters you would like to model. There is a large literature on this across social sciences, as well as computer and data science. Hence, a large number of diagnostic tests exist to determine the most appropriate number, each with strength and weakness; for instance, Drouhot (2021: 839-40 in Appendix B) used 6 tests from the fclust package [@fclust]. Additionally, users can compare multiple solutions and go for the one where each subgroup is the most easily interpretable. In particular, clustering solution yielding either one very large, multiple very small, or redundant groups should be approached with caution.

For transparency, the function uses the following arguments within the `cmeans()` function:  `iter.max=100`, `verbose=TRUE`, `dist="manhattan"`, `method="cmeans"`, `m=1.5` (default, but can be adjusted in the `ISCA_random_assignments()`-function), `rate.par = NULL`.

Once you know the number of clusters you want to model and the variables with which the clusters should be created, the first function from the ISCA-package can be used.

The `ISCA_random_assignments()`-function requires the following arguments:

* `data=` A dataset containing all relevant variables

* `filter=` Specification of the variable name that contains information on majority / minority status.

* `majority_group=` The specific value within the variable specified in the previous filter-argument indicating majority status. This could be either a numeric value or a character string.

* `minority_group=` The specific value(s) indicating minority status in the filter variable. This could be either a numeric value or a character string. It can be one single minority group or a vector of several minority groups.

* `fuzzifier =` The fuzzifier is a value larger than 1 determining the extent of overlap between clusters. A value of 1 effectively makes fuzzy c-means equivalent to hard k-means. By default, the value is 1.5.

* `n_clusters=` Number of clusters to be created.
                                      
* `draws=` Number of probabilistic draws. If not specified, the default is 500.

* `cluster_vars=` Vector specifying the variables that should be used to create the clusters.

The output is a dataframe that has one new column containing the cluster assignment for each probabilistic draw. This means that the output is based on the originally provided dataset and all variables including those not used to create the clusters remain unchanged so that they could be used in later analyses (e.g., for step 2 and 3).  The output is the foundation of the other two functions in the ISCA-package. For purposes of illustration, the output below is from a simulated dataset containing 1000 observations. One observation can be interpreted as one individual. The output below only shows the first six individuals in this fictitious dataset.

```{r }
data("sim_data")
head(sim_data)

ISCA_step1 <- ISCA_random_assignments(data=sim_data, filter=native, majority_group=1, minority_group=c(0), fuzzifier = 1.5,
                                      n_clusters=3, draws=5, cluster_vars= c("female", "age", "education", "income"))

head(ISCA_step1)
```

In this example, four variables (i.e., female, age, education, income) are used to create 4 clusters to which majority and minority subjects are assigned probabilistically across 5 iterations. The variable in the data set that stores information on minority vs. majority status is called 'native'. In it, value 1 indicates majority status, the values 0 indicates minority status. In that particular example, clustering results (as indicated by column A1 to A5; 5 multinomial assignments based on the distribution formed by membership function in the 3 clusters) shows substantial stability across assignments A1-A5, indicating that the distribution of membership functions is probably very skewed towards one cluster. Note that for observation #1, the assignment is less stable than other observations. In this particular example based on a fictitious and random data, the obtained subgroups do not coalesce on all variables; rather, income appear most influential in organizing subgroups (i.e., subgroups may be heterogeneous in terms of age and religiosity for instance, but homogenous in terms of income). Note that the discrimination and religiosity variables are part of the dataset but were not used to create the clusters because the clusters focus more on social structural indicators and the inclusion of the outcome variable is not always recommended (see Drouhot 2021).

Note also that the function probabilistically assigns members of both majority and minority groups to clusters. Specifically, the function first starts by probabilistically assigning majority group members to the empirically defined clusters, and then continues to probabilistically assign each minority group member to the cluster they most closely resemble. In other words, the clusters are created based only on the majority group, and minority group members are subsequently matched to a cluster of reference based on social similarity. This is why in the `head(ISCA_step1)` output above the first six individuals are all natives, indicated by the value one. When displaying the last 6 rows of `ISCA_step1`, we can see that the output contains minority members as well:

```{r}
tail(ISCA_step1, 6)
```


## Function / Step 2

The second function gives an overview of the distribution of the variables of interest per cluster and across all iterations. That includes:

* a grand mean (i.e., the mean of the mean per variable and cluster and across iteration), 

* a grand standard deviation (i.e., the mean of the standard deviation per variable and cluster, and across iteration),

* a cluster error (i.e., the standard deviation of the mean value per variable and cluster, and across iteration) for each variable. 

The function also computes a grand mean and cluster error for a count variable for each cluster.

The substantive goal for the analyst here is to better understand the obtained clusters, by interpreting them based on domain-specific and other expert knowledge of the populations and context at hand. When producing a descriptive table, remember that observations should first be filtered to include only members of the majority group since they are the ones determining the social structure obtained at step 1 (minority members are only matched to clusters of majority members they most closely resemble). Additionally, for purposes of comparison, the table could also include variables that were not part of the cluster building, for example other independent or dependent variables.

Variables of interest could be the variables that were used to build the clusters. This helps to understand the clusters and assign substantive meaning to them. In that case the dataset should first be filtered to include only members of the majority group. The table could also include variables that were not part of the cluster building, for example other independent or dependent variables.

Variables with only two values are automatically recognized as categorical, dichotomous variables and no grand standard deviation is calculated. If you work with multinomial factor variables, it is best to include them as dummy-variables.

The `ISCA_clustertable()`-function requires the following arguments:

* `data=` The dataset including all relevant variables and the random assignments from the first ISCA_random_assignments()-function.

* `cluster_vars=` A vector specifying the variables of interest.

* `draws=` Number of probabilistic draws. The number of draws should be equal to the number of draws specified in the first step. If not specified, the default is 500.


```{r }
majority_only <- ISCA_step1[ISCA_step1["native"] == 1, ]

result_ISCA_clustertable <- ISCA_clustertable(data = majority_only, 
                                              cluster_vars = c("native", "female", "age", "education", "income"),
                                              draws = 5)

head(result_ISCA_clustertable, 50)
```

Remember that the function examples are based on a simple artificial dataset. Interpreting the clusters here is therefore not very straightforward and useful. With real life data, clearer patterns are likely to emerge, and are to be interpreted by the researcher based on domain-specific knowledge. 

## Function / Step 3

**Within-Cluster Modelling of Outcomes and Cross-Group Comparisons**

The third function runs OLS regressions for each cluster and iteration. The output contains a list with two entries which can be extracted. The first entry contains the OLS regression estimates for each cluster averaged across iterations, including estimates for a model which pools all clusters. The second entry contains averaged adjusted R-squared values per cluster to indicate model fit.

The `ISCA_modeling()`-function requires the following arguments:

* `data=` The dataset including all relevant variables and the random assignments from the first ISCA_random_assignments()-function.

* `model_spec=` A model specification similar to the `lm()`-function.

* `draws=` Number of probabilistic draws. The number of draws should be equal to the number of draws specified in the first and second step. If not specified, the default is 500.

* `n_clusters=` Number of clusters. This value should be equal to the number of clusters specified in the first and second step.

* `weights=` A vector specifying the variable in which the weights are stored. The default is `NONE`.

Note that the goal here becomes to compare majority and minority groups within each cluster as well as differences between clusters within a given categorically defined groups, particularly whether minority consistently differ from majority members and whether predictors affect the main outcome of interest in a consistent manner across subgroups. Additionally, if a supplementary model at the categorical (i.e., nominal) group level is run, the comparison of the cluster-specific estimates with group-level estimates is particularly meaningful because it allows to see which subgroups drive aggregate trends, and conversely, which significant trends at the subgroup level get obscured at the aggregate level.

Currently, there are no tests for the differences in effects between clusters, although these can be obtained from ad hoc testing as such through the Paternoster test for the equality of coefficients [@Paternoster_etal].

```{r }
ISCA_modeling_results <- ISCA_modeling(data= ISCA_step1, 
                                   model_spec="religiosity ~ native + female + age + education + discrimination",
                                   draws = 5, n_clusters = 3, weights = NULL)

estimates <- as.data.frame(ISCA_modeling_results[1])
head(estimates, 30) # OLS estimates per cluster across iterations

rsquared <- as.data.frame(ISCA_modeling_results[2])
head(rsquared)  # Corresponding Adjusted R Squared values per cluster across iterations
```

In this example, the variables native, female, age, education and discrimination are regressed on religiosity.
No weights are specified. Results are averaged across 5 iterations and estimates are calculated for each of the three clusters. Additionally, there is a model 0 which pooled all clusters. Note that unlike in the second function (i.e., `ISCA_clustertable()`), clusters are not represented as columns but are instead stored in the "cluster" variable (i.e., as rows). This way, users can use the output for post-processing tasks, such as plotting the estimates.

The code below helps to transform the output so that clusters are depicted as columns. This might ease the comparison of the effect coefficients across clusters. Remember that 0 indicates the pooled model.

```{r }
cluster_as_columns <- estimates %>%
  tidyr::pivot_longer(cols = c(mean_coefficients, mean_std.error, mean_p_value),
               names_to = "Statistics",
               values_to = "value") %>%
  tidyr::pivot_wider(names_from = cluster, values_from = value)

print(cluster_as_columns)
```

## References