--- title: 'Short R Tutorial: Validating Sequence Analysis Typologies Using Parametric Bootstrap' author: "Matthias Studer" output: rmarkdown::html_vignette bibliography: manual.bib vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Short R Tutorial: Validating Sequence Analysis Typologies Using Parametric Bootstrap} %\VignetteEncoding{UTF-8} --- # Introduction ```{r, include=FALSE} knitr::knit_hooks$set(time_it = local({ now <- NULL function(before, options) { if (before) { # record the current time before each chunk now <<- Sys.time() } else { # calculate the time difference after a chunk res <- difftime(Sys.time(), now, units = "secs") # return a character string to show the time paste("Time for this code chunk to run:", round(res, 2), "seconds") } } })) knitr::opts_chunk$set(dev = "png", dev.args = list(type = "cairo-png"), time_it=TRUE) ``` This document provides a very quick introduction to the `R` code needed to use parametric bootstraps for typology validation in sequence analysis. Readers interested in the methods and the exact interpretation of the results are referred to: - Studer, M. (2021). Validating Sequence Analysis Typologies Using Parametric Bootstraps. *Sociological Methodology 51(2)*, 290--318. https://doi.org/10.1177/00811750211014232 You are kindly asked to cite the above reference if you use the methods presented in this document. **Warning!!** To avoid lengthy computations (and overloading the CRAN server), we restricted the number of bootstraps to 50. We recommend using a higher value (i.e., 1000). Let's start by setting the seed for reproducible results. ```{r, message=FALSE} set.seed(1) ``` # Creating the State Sequence Object For this example, we use the `mvad` dataset. Let's start with the creation of the state sequence object. ```{r, message=FALSE} ## Loading the TraMineR library library(TraMineR) ## Loading the data data(mvad) ## State properties mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training") mvad.lab <- c("employment", "further education", "higher education", "joblessness", "school", "training") mvad.shortlab <- c("EM","FE","HE","JL","SC","TR") ## Creating the state sequence object mvad.seq <- seqdef(mvad, 17:86, alphabet = mvad.alphabet, states = mvad.shortlab, labels = mvad.lab, xtstep = 6) ``` # Creating the typology We will now create a typology using cluster analysis. Readers interested in more detail are referred to the `WeightedCluster` library manual (also available as a vignette), which goes into the details of the creation and computation of cluster quality measures. We start by computing dissimilarities with the `seqdist` function using the Hamming distance. We then use Ward clustering to create a typology of the trajectories. For this step, we recommend the use of the `fastcluster` library, which considerably speed up the computations. ```{r, cache=TRUE, message=FALSE} ## Using fastcluster for hierarchical clustering library(fastcluster) ## Distance computation diss <- seqdist(mvad.seq, method="HAM") ## Hierarchical clustering hc <- hclust(as.dist(diss), method="ward.D") ``` We can now compute several cluster quality indices using `as.clustrange` function from two to ten groups. ```{r, message=FALSE} # Loading the WeightedCluster library library(WeightedCluster) # Computing cluster quality measures. clustqual <- as.clustrange(hc, diss=diss, ncluster=10) clustqual ``` # Parametric Bootstrap Parametric bootstrap aims to provide baseline values obtained by clustering *similar* but *non-clustered* data [@Studer2021]. This can be computed using the `seqnullcqi` function with the following parameters: - `R`: number of bootstraps. - `model`: The null model (see table below - `seqdist.args`: list of arguments passed to `seqdist` (should be identical to first call to seqdist). - `hclust.method`: hierarchical clustering method (should be identical to orginal clustering). - `kmedoid`: If `TRUE`, use PAM (and the `wcKMedRange` function) instead of hierarchical clustering. - `parallel`: If `TRUE`, use parallel computing to speed up the computations. ## Combined Randomization The following `R` code estimate expected values of the cluster quality indices when clustering similar sequences that are not clustered according to the `"combined"` model, using the Hamming distance and Ward hierarchical clustering. We set `parallel=TRUE` to use parallel computing. You can use `progressbar=TRUE` to show a progress bar and an estimation of the computation remaining time (not meaningful here within a document): ```{r} bcq.combined <- seqnullcqi(mvad.seq, clustqual, R=50, model="combined", seqdist.args=list(method="HAM"), hclust.method="ward.D", parallel = TRUE) ``` Once the parametric bootstrap is computed (may take a while...), the results are stored in the `bcq.combined` object. Printing the object (just by writing its name), already provides several information, the standardized cluster quality indices and the associated inconclusive intervals. Here, 2, 9 and 10 groups stand out. ```{r, size="tiny"} bcq.combined ``` To get non-standardized values, use `norm=FALSE`. Notice that the ASW inconclusive intervals are well below the values recommended by Kaufman and Rousseeuw (over 0.5). ```{r, size="tiny"} print(bcq.combined, norm=FALSE) ``` Several plots can then be used to inspect the results using the `plot` command and the `type` argument. First, one can look at the sequences generated by the null model by using `type="seqdplot"`. ```{r, fig.width=8, fig.height=5, results="hide", dev="png"} plot(bcq.combined, type="seqdplot") ``` The overall distribution of the CQI values can be plotted using `type="density"`. In this case, one also needs to specify the CQI to be used. All tested number of groups are found to be significant. Any CQI computed by `as.clustrange()` can be used here. To show the density of the average silhouette width (`"ASW"`), one can use: ```{r, fig.width=8, fig.height=5, results="hide", dev="png"} plot(bcq.combined, stat="ASW", type="density") ``` By using `type="line"`, we plot the obtained and bootstrapped CQI values depending on the number of groups. Here again ```{r, fig.width=8, fig.height=5, results="hide", dev="png"} plot(bcq.combined, stat="ASW", type="line") ``` ## Randomized Sequencing To use another null model, one needs to change the `model` argument of the `seqnullcqi` function. The randomized sequencing keep the duration attached to each state, but randomizes the ordering of the spells. It can be used to uncover sequencing structure of the data. ```{r} bcq.seq <- seqnullcqi(mvad.seq, clustqual, R=50, model="sequencing", seqdist.args=list(method="HAM"), hclust.method="ward.D", parallel = TRUE) ``` We can then plot the results as before. Notice that solutions between 3 and 6 are below the critical line. ```{r, fig.width=8, fig.height=5, results="hide", dev="png"} plot(bcq.seq, stat="ASW", type="line") ``` ## Randomized Duration The randomized duration keeps the same ordering of the states, but randomizes the time spent in each spell. It can be used to uncover the duration-related structure of the data. ```{r} bcq.dur <- seqnullcqi(mvad.seq, clustqual, R=50, model="duration", seqdist.args=list(method="HAM"), hclust.method="ward.D", parallel = TRUE) ``` We can then plot the results as before. The solutions 3 and 4 groups solutions are below the "significance line". Otherwise, the ranking of the solutions is the same. ```{r, fig.width=8, fig.height=5, results="hide", dev="png"} plot(bcq.dur, stat="ASW", type="line") ``` ## State Independence The state independence null model generates sequence, position by position, independently of the previous state. This is a quite unrealistic assumption for longitudinal data, but a common one in statistical modeling. ```{r, cache=TRUE} bcq.stateindep <- seqnullcqi(mvad.seq, clustqual, R=50, model="stateindep", seqdist.args=list(method="HAM"), hclust.method="ward.D", parallel = TRUE) ``` Bootstrapped CQI values are extremely low compared to our clustering, meaning that we have a strong longitudinal structure (not surprising!). ```{r, fig.width=8, fig.height=5, results="hide", dev="png"} plot(bcq.stateindep, stat="ASW", type="line") ``` ## First-order Markov Null Model The first-order Markov null model generates sequences using time-invariant transition rates. As a result, the generated sequences are often quite similar to the observed ones. This model can uncover structure stemming from time-dependent transition rates. ```{r, cache=TRUE} bcq.Markov <- seqnullcqi(mvad.seq, clustqual, R=50, model="Markov", seqdist.args=list(method="HAM"), hclust.method="ward.D", parallel = TRUE) ``` ```{r, fig.width=8, fig.height=5, results="hide", dev="png"} plot(bcq.Markov, stat="ASW", type="line") ``` # Choosing a Solution The various null models lead to the same conclusions and ranking of the solutions. Solutions between 3 and 6 groups were not always above the critical lines (in the sequencing null model for instance), and can be avoided. We generally saw good clustering quality for a clustering in 9 groups. The solution is shown below. ```{r, fig.width=10, fig.height=12, results="hide"} seqdplot(mvad.seq, clustqual$clustering$cluster9, border=NA) ```