--- title: "Multi-label Classification with MLPUGS" author: "Mikhail Popov" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Multi-label Classification with MLPUGS} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ## Introduction Suppose we have a dataset $D$ of $n$ observations and a label set $L$. Each $i$-th instance can have one or more labels $(S_i \subseteq L)$. In the multi-label classification problem, we are given $x_*$ and we are interested in $\hat{S}_*$. The most well known approach to this problem is *binary relevance* (**BM**), which transforms the problem into one binary classification problem for each label. The problem with this approach is that it ignores the relationship between the response variables, akin to fitting two multiple regression models when one should instead be using multivariate regression. Classifier chains (Read et al., 2009) (**CC**) is a type of **BM** that makes use of the correlations between the labels. Each link in the chain is trained to predict label $l_j \in L$ using a feature space extended to include $(l_1,\ldots,l_{j-1})$. An *Ensemble of Classifier Chains* (**ECC**) trains $m$ **CC** classifiers where each $C_k$ is trained with a random subset of $D$ and is likely to be unique and able to give different multi-label predictions. Of course, the problem with this approach is that to make a prediction for any of the labels, we need to know what the other labels are, which we don't because we also need to predict those. This is where the idea of Gibbs sampling comes in. We start with random predictions, then proceed label-by-label, conditioning on the most recent predictions within each iteration. After the burn-in, we should have have a stable multivariate distribution of the labels for each observation. ## Classification Pipeline **MLPUGS** (**M**ulti-**l**abel **p**rediction **u**sing **G**ibbs **s**ampling) is a wrapper that takes any binary classification algorithm as a base classifier and constructs an **ECC** and uses Gibbs sampling to make the multi-label predictions. In short, the steps are: 1. Train an **ECC** using any base classifier that can predict classes and probabilities. 2. Use it to make predictions (using Gibbs sampling). 3. Collapse multiple iterations and models into a final set of predictions. ```{r example, eval = FALSE} ecc(x, y) %>% predict(newdata) %>% [summary|validate] ``` We will go through each of the steps, including an evaluation step, in the example below. ### Note on Parallelization This package was designed to take advantage of multiple cores unless the OS is Windows. On a quad-core processor it's recommended to parallel train an ensemble of 3 models. On 6-core and 8-core processors the recommended number of models is 5 and 7, respectively. Predictions are also performed in parallel, if the user allows it. ## Example: Movies Suppose we wanted to predict whether a movie would have a good (at least 80%) critic rating on Rotten Tomatoes, Metacritic, and Fandango based on the user ratings on those websites, along with the user ratings on IMDB.com. Multi-label prediction via classifier chains allows us to use the correlation between those websites (a critically accepted movie on one review score aggregation website is likely to have high rating on another). ```{r setup} library(MLPUGS) data("movies") ``` ```{r data_head, eval = FALSE} head(movies) ``` ```{r formatted_data_head, echo = FALSE} knitr::kable(head(movies)) ``` We are going use a pre-made training dataset `movies_train` (60% of `movies`) to train our **ECC** and `movies_test` (remaining 40%) for assessing the accuracy of our predictions. ```{r load_datasets} data("movies_train"); data("movies_test") ``` ### Training an Ensemble of Classifier Chains (ECC) There is no built-in classifier as of the writing of this vignette, so `train_ecc` requires us to give it an appropriate classifier to train. In a future release, `MLPUGS` will include a classifier to work-out-of-the-box, although the package was written to allow for user-supplied classifiers. We could, for example, use the `randomForest` package, in which case our code will look like: ```{r train, eval = FALSE} fit <- ecc(movies_train[, -(1:3)], movies_train[1:3], 3, randomForest::randomForest, replace = TRUE) ``` This will give us 3 models, forming an ensemble of classifier chains. Each set of classifier chains was trained on a random 95% of the available training data. (If we had trained 1 set of classifier chains, that model would have used all of the training data.) ### Prediction Using Gibbs Sampling (PUGS)

Photo by Dídac Balanzó ([https://www.flickr.com/photos/fotodak/8968262720](https://www.flickr.com/photos/fotodak/8968262720))


Multi-label prediction is tricky with classifier chains because each label's classifier was trained using the other labels. That is, we must predict an unobserved set of values using `L-1` sets of unobserved values. To this end, we employ the Gibbs Sampler. There is no built-in classifier as of the writing of this vignette, so `predict` requires us to give it an appropriate prediction function. We chose to train a `randomForest`, so we must supply a corresponding function for making the predictions. ```{r predict_rf, eval = FALSE} pugs <- predict(fit, movies_test[, -(1:3)], burn.in = 500, n.iters = 1500, thin = 15, .f = randomForest:::predict.randomForest, type = "prob") ``` This will give us a total of 30 predictions (10 iterations, 3 sets of chains) for each movie's rating on Rotten Tomatoes, Metacritic, and Fandango. ### Gathering Predictions When we collapse the predictions using `summary`, we can ask it to provide us with binary classifications or probabilistic classifications. Binary classification works by picking the most-predicted classification between the iterations and then the models. ```{r gather_prob, eval = FALSE} y_pred <- summary(pugs, type = "prob") ``` ```{r compare_prob, echo = FALSE, eval = FALSE} rownames(y_pred) <- rownames(movies_test) knitr::kable(head(y_pred, 5), digits = 3) ``` **Table 2**: Probabilistic classifications for the first 5 movies in the validation set. | | rotten_tomatoes| metacritic| fandango| |:------------------------------|---------------:|----------:|--------:| |Avengers: Age of Ultron (2015) | 0.767| 0.287| 0.979| |Do You Believe? (2015) | 0.310| 0.131| 0.883| |Hot Tub Time Machine 2 (2015) | 0.001| 0.000| 0.018| |The Water Diviner (2015) | 0.223| 0.023| 0.983| |Top Five (2014) | 0.541| 0.016| 0.766| ```{r gather_class, eval = FALSE} y_pred <- summary(pugs, type = "class") ``` ```{r compare_class, echo = FALSE, eval = FALSE} rownames(y_pred) <- rownames(movies_test) knitr::kable(head(y_pred, 5), digits = 3) ``` **Table 3**: Binary classifications for the first 5 movies in the validation set. | | rotten_tomatoes| metacritic| fandango| |:------------------------------|---------------:|----------:|--------:| |Avengers: Age of Ultron (2015) | 1| 0| 1| |Do You Believe? (2015) | 0| 0| 1| |Hot Tub Time Machine 2 (2015) | 0| 0| 0| |The Water Diviner (2015) | 0| 0| 1| |Top Five (2014) | 1| 0| 1| An advantage of using probabilistic classification over binary is that you can experiment with the threshold you use. You may find, for example, that the intuitive and initial threshold of 0.5 doesn't work as well as, say, 0.62. ## Evaluation So how well did our ensemble of classifier chains perform? Here are true classifications of first 5 movies in the test (validation) set: ```{r,echo=FALSE} knitr::kable(movies_test[1:5, 1:3], caption="**Table 4**: True classifications for the first 5 movies in the test (validation) set.") ``` Let's calculate some established measures of accuracy for multi-label classification: ```{r,eval=FALSE} validate(pugs, movies_test[, 1:3]) ``` ```{r,eval=FALSE,echo=FALSE} temp <- as.data.frame(t(validate(pugs, movies_test[, 1:3]))) colnames(temp) <- "Measurement" temp <- cbind(temp, Description = c( "provides a steep penalty for predictions that are both confident and wrong", "average per-obs exact classification", "average per-obs classification with partial matches", "per-label classification with partial matches", "average per-example per-class total error")) knitr::kable(temp, digits = 4) ``` | | Measurement|Description | |:-----------------|-----------:|:--------------------------------------------------------------------------| |Logarithmic.Loss | 0.3180|provides a steep penalty for predictions that are both confident and wrong | |Exact.Match.Ratio | 0.6441|average per-obs exact classification | |Labelling.F.score | 0.7853|average per-obs classification with partial matches | |Retrieval.F.score | 0.6034|per-label classification with partial matches | |Hamming.Loss | 0.1299|average per-example per-class total error | Not bad! ## Further Reading Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2011). Classifier chains for multi-label classification. URL: [http://www.cs.waikato.ac.nz/~eibe/pubs/chains.pdf](http://www.cs.waikato.ac.nz/~eibe/pubs/chains.pdf) Casella, G., & George, E. I. (1992). Explaining the Gibbs Sampler. The American Statistician, 46(3), 167–174. URL: [http://www.stat.ufl.edu/archived/casella/OlderPapers/ExpGibbs.pdf](http://www.stat.ufl.edu/archived/casella/OlderPapers/ExpGibbs.pdf)