--- title: "Multi-label Classification with MLPUGS" author: "Mikhail Popov" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Multi-label Classification with MLPUGS} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ## Introduction Suppose we have a dataset $D$ of $n$ observations and a label set $L$. Each $i$-th instance can have one or more labels $(S_i \subseteq L)$. In the multi-label classification problem, we are given $x_*$ and we are interested in $\hat{S}_*$. The most well known approach to this problem is *binary relevance* (**BM**), which transforms the problem into one binary classification problem for each label. The problem with this approach is that it ignores the relationship between the response variables, akin to fitting two multiple regression models when one should instead be using multivariate regression. Classifier chains (Read et al., 2009) (**CC**) is a type of **BM** that makes use of the correlations between the labels. Each link in the chain is trained to predict label $l_j \in L$ using a feature space extended to include $(l_1,\ldots,l_{j-1})$. An *Ensemble of Classifier Chains* (**ECC**) trains $m$ **CC** classifiers where each $C_k$ is trained with a random subset of $D$ and is likely to be unique and able to give different multi-label predictions. Of course, the problem with this approach is that to make a prediction for any of the labels, we need to know what the other labels are, which we don't because we also need to predict those. This is where the idea of Gibbs sampling comes in. We start with random predictions, then proceed label-by-label, conditioning on the most recent predictions within each iteration. After the burn-in, we should have have a stable multivariate distribution of the labels for each observation. ## Classification Pipeline **MLPUGS** (**M**ulti-**l**abel **p**rediction **u**sing **G**ibbs **s**ampling) is a wrapper that takes any binary classification algorithm as a base classifier and constructs an **ECC** and uses Gibbs sampling to make the multi-label predictions. In short, the steps are: 1. Train an **ECC** using any base classifier that can predict classes and probabilities. 2. Use it to make predictions (using Gibbs sampling). 3. Collapse multiple iterations and models into a final set of predictions. ```{r example, eval = FALSE} ecc(x, y) %>% predict(newdata) %>% [summary|validate] ``` We will go through each of the steps, including an evaluation step, in the example below. ### Note on Parallelization This package was designed to take advantage of multiple cores unless the OS is Windows. On a quad-core processor it's recommended to parallel train an ensemble of 3 models. On 6-core and 8-core processors the recommended number of models is 5 and 7, respectively. Predictions are also performed in parallel, if the user allows it. ## Example: Movies Suppose we wanted to predict whether a movie would have a good (at least 80%) critic rating on Rotten Tomatoes, Metacritic, and Fandango based on the user ratings on those websites, along with the user ratings on IMDB.com. Multi-label prediction via classifier chains allows us to use the correlation between those websites (a critically accepted movie on one review score aggregation website is likely to have high rating on another). ```{r setup} library(MLPUGS) data("movies") ``` ```{r data_head, eval = FALSE} head(movies) ``` ```{r formatted_data_head, echo = FALSE} knitr::kable(head(movies)) ``` We are going use a pre-made training dataset `movies_train` (60% of `movies`) to train our **ECC** and `movies_test` (remaining 40%) for assessing the accuracy of our predictions. ```{r load_datasets} data("movies_train"); data("movies_test") ``` ### Training an Ensemble of Classifier Chains (ECC) There is no built-in classifier as of the writing of this vignette, so `train_ecc` requires us to give it an appropriate classifier to train. In a future release, `MLPUGS` will include a classifier to work-out-of-the-box, although the package was written to allow for user-supplied classifiers. We could, for example, use the `randomForest` package, in which case our code will look like: ```{r train, eval = FALSE} fit <- ecc(movies_train[, -(1:3)], movies_train[1:3], 3, randomForest::randomForest, replace = TRUE) ``` This will give us 3 models, forming an ensemble of classifier chains. Each set of classifier chains was trained on a random 95% of the available training data. (If we had trained 1 set of classifier chains, that model would have used all of the training data.) ### Prediction Using Gibbs Sampling (PUGS)
Photo by Dídac Balanzó ([https://www.flickr.com/photos/fotodak/8968262720](https://www.flickr.com/photos/fotodak/8968262720))