A Brief Introduction to D2MCS

Miguel Ferreiro Diaz

2022-08-23

Introduction

D2MCS is an object-oriented framework able to identify and exploit the intrinsic characteristic of input data to (i) accurately distribute features in groups (feature clustering) and (ii) design and deploy effective MCS models. Below are included code snippets belonging to the different stages of D2MCS framework: (i) Data Manipulation, (ii) Feature Clustering, (iii) MCS Creation and (iv) Classification Results.

Furthermore, the package provides the facility to read the descriptions and details of all functions through the help(D2MCS) command.

Stage 1: Data Manipulation

The first step starts using the DatasetLoader class to convert the data to be analyzed into the structure compatible with D2MCS called Dataset (or HDDataset in case the dataset cannot be stored in memory). The following code fragment shows the parameters included in the load function after instantiating the DatasetLoader class.

data.loader <- DatasetLoader$new()
data <- data.loader$load(filepath, header = TRUE, sep = ",", 
                         skip.lines = 0, normalize.names = FALSE, 
                         ignore.columns = NULL)

Once the loading process is completed and the dataset is available in a Dataset object, it is possible to perform different methods divided into three main categories taking into account their behaviour: (i) dataset information obtainer, (ii) dataset column removal and (iii) dataset splitting operation. The following code snippet shows some of the different functions in each category.

## DATASET INFORMATION OBTAINER
data$getNcol()
data$getNrow()
data$getColumnNames()
data$getDataset()

## DATASET COLUMN REMOVAL
data$cleanData(columns = NULL, 
               remove.funcs = NULL, 
               remove.na = FALSE, 
               remove.const = FALSE)

## DATASET HANDLING AND SPLITTING
data$createPartitions(num.folds = NULL, 
                      percent.folds = NULL,
                      class.balance = NULL)
subset <- data$createSubset(num.folds = NULL, 
                            column.id = NULL,
                            opts = list(remove.na = TRUE, 
                                        remove.const = FALSE),
                            class.index = NULL, 
                            positive.class = NULL)
train <- data$createTrain(num.folds = NULL, 
                          class.index, 
                          positive.class, 
                          opts = list(remove.na = TRUE, 
                                      remove.const = FALSE))

Using the createPartitions() method, the dataset is divided in order to use the divisions to create the data structures required in the following phases, using the createSubset() and createTrain() methods. While the first method performs the creation of a Subset object used both for clustering operations and for validation purposes. On the other hand, the second method is responsible for creating a Trainset object necessary to perform the model training stage.

Stage 2: Feature Clustering

After creating a Subset object, the stage two based on the distribution of features in clusters starts. The code snippet below exemplifies the three steps necessary to create and execute the clustering strategy called DependencyBasedStrategy included by default in D2MCS.

## FEATURE-CLUSTERING ALGORITHM CREATION
conf <- DependencyBasedStrategyConfiguration$new() 
dbs <- DependencyBasedStrategy$new(subset, 
                                   heuristic, 
                                   configuration = conf)

## FEATURE-CLUSTERING ALGORITHM EXECUTION 
dbs$execute(verbose = FALSE)

## FEATURE-CLUSTERING ALGORITHM FUNCTIONALITIES
dbs$getBestClusterDistribution()
dbs.train <- dbs$createTrain(subset, 
                             num.clusters = NULL, 
                             num.groups = NULL, 
                             include.unclustered = FALSE)

Stage 3: MCS Creation

From the previous execution of the selected clustering strategy, a Trainset object is obtained to be used as input for the SMC creation phase. This stage is divided into three main steps (i) D2MCS framework initialization, (ii) MCS behaviour customization options and (iii) execution of MCS discovery operation.

## D2MCS FRAMEWORK INITIALIZATION
d2mcs <- D2MCS$new(dir.path, 
                   num.cores = 2, 
                   socket.type = "PSOCK", 
                   outfile = NULL)

## MCS BEHAVIOUR CUSTOMIZATION OPTIONS
trFunction <- TwoClass$new(method, 
                           number, 
                           savePredictions, 
                           classProbs, 
                           allowParallel, 
                           verboseIter, 
                           seed = NULL)

## EXECUTION OF MCS DISCOVERY OPERATION
trained <- d2mcs$train(train.set, 
                       train.function, 
                       num.clusters = NULL, 
                       model.recipe = DefaultModelFit$new(), 
                       ex.classifiers = c(), 
                       ig.classifiers = c(), 
                       metrics = NULL, 
                       saveAllModels = FALSE)

Using the code fragment shown previously, the training of the ML models provided by the caret library is performed in order to find out which models offer the best performance for the dataset, taking into account the indicated parameters.

Stage 4: Classification Results

After building the MCS, starts the next stage related to the classification of the data. To this end, the D2MCS tool needs to use the TrainOutput structure obtained by training the MCS, the dataset to be predicted and the voting schemes chosen to combine the results of the different MSC models.

## VOTING SCHEMES AVAILABLE IN THE CLASSIFICATION STAGE
voting.types <- c(SingleVoting$new(voting.schemes, 
                                   metrics),                                    
                  CombinedVoting$new(voting.schemes, 
                                     combined.metrics, 
                                     methodology, 
                                     metrics))

## EXECUTE THE CLASSIFICATION STAGE
predictions <- d2mcs$classify(train.output, 
                              subset, 
                              voting.types, 
                              positive.class = NULL)

## COMPUTE THE PERFORMANCE OF EACH VOTING SCHEME
predictions$getPerformances(test.set, 
                            measures, 
                            voting.names = NULL, 
                            metric.names = NULL, 
                            cutoff.values = NULL)

## OBTAIN THE PREDICTIONS OBTAINED OF EACH VOTING SCHEME USED
prediction$getPredictions(voting.names = NULL, 
                          metric.names = NULL, 
                          cutoff.values = NULL, 
                          type = NULL, 
                          target = NULL, 
                          filter = FALSE)

When the classification stage is completed, the tool produces a ClassificationOutput object to allow the user to obtain information about the classification performance (getPerformances() method) and to observe in depth the predictions obtained (getPredictions() method).

Development

The D2MCS package is also available in a development version at the Github development page: github.com/drordas/D2MCS