Introduction to SuperML

Manish Saraswat

2024-02-18

SuperML R package is designed to unify the model training process in R like Python. Generally, it’s seen that people spend lot of time in searching for packages, figuring out the syntax for training machine learning models in R. This behaviour is highly apparent in users who frequently switch between R and Python. This package provides a python´s scikit-learn interface (fit, predict) to train models faster.

In addition to building machine learning models, there are handy functionalities to do feature engineering

This ambitious package is my ongoing effort to help the r-community build ML models easily and faster in R.

Install

You can install latest cran version using (recommended):

install.packages("superml")

You can install the developmemt version directly from github using:

devtools::install_github("saraswatmks/superml")

Caveats on superml installation

For machine learning, superml is based on the existing R packages. Hence, while installing the package, we don’t install all the dependencies. However, while training any model, superml will automatically install the package if its not found. Still, if you want to install all dependencies at once, you can simply do:

install.packages("superml", dependencies=TRUE)

Examples - Machine Learning Models

This package uses existing r-packages to build machine learning model. In this tutorial, we’ll use data.table R package to do all tasks related to data manipulation.

Regression Data

We’ll quickly prepare the data set to be ready to served for model training.

load("../data/reg_train.rda")
# if the above doesn't work, you can try: load("reg_train.rda")
# superml::check_package("caret")
library(data.table)
library(caret)
#> Loading required package: ggplot2
#> Loading required package: lattice
library(superml)

library(Metrics)
#> 
#> Attaching package: 'Metrics'
#> The following objects are masked from 'package:caret':
#> 
#>     precision, recall

head(reg_train)
#>    Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
#> 1:  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
#> 2:  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
#> 3:  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
#> 4:  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
#> 5:  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
#> 6:  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
#>    Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
#> 1:    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
#> 2:    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
#> 3:    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
#> 4:    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
#> 5:    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
#> 6:    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
#>    HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
#> 1:     2Story           7           5      2003         2003     Gable  CompShg
#> 2:     1Story           6           8      1976         1976     Gable  CompShg
#> 3:     2Story           7           5      2001         2002     Gable  CompShg
#> 4:     2Story           7           5      1915         1970     Gable  CompShg
#> 5:     2Story           8           5      2000         2000     Gable  CompShg
#> 6:     1.5Fin           5           5      1993         1995     Gable  CompShg
#>    Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
#> 1:     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
#> 2:     MetalSd     MetalSd       None          0        TA        TA     CBlock
#> 3:     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
#> 4:     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
#> 5:     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
#> 6:     VinylSd     VinylSd       None          0        TA        TA       Wood
#>    BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
#> 1:       Gd       TA           No          GLQ        706          Unf
#> 2:       Gd       TA           Gd          ALQ        978          Unf
#> 3:       Gd       TA           Mn          GLQ        486          Unf
#> 4:       TA       Gd           No          ALQ        216          Unf
#> 5:       Gd       TA           Av          GLQ        655          Unf
#> 6:       Gd       TA           No          GLQ        732          Unf
#>    BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
#> 1:          0       150         856    GasA        Ex          Y      SBrkr
#> 2:          0       284        1262    GasA        Ex          Y      SBrkr
#> 3:          0       434         920    GasA        Ex          Y      SBrkr
#> 4:          0       540         756    GasA        Gd          Y      SBrkr
#> 5:          0       490        1145    GasA        Ex          Y      SBrkr
#> 6:          0        64         796    GasA        Ex          Y      SBrkr
#>    1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
#> 1:      856      854            0      1710            1            0        2
#> 2:     1262        0            0      1262            0            1        2
#> 3:      920      866            0      1786            1            0        2
#> 4:      961      756            0      1717            1            0        1
#> 5:     1145     1053            0      2198            1            0        2
#> 6:      796      566            0      1362            1            0        1
#>    HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
#> 1:        1            3            1          Gd            8        Typ
#> 2:        0            3            1          TA            6        Typ
#> 3:        1            3            1          Gd            6        Typ
#> 4:        0            3            1          Gd            7        Typ
#> 5:        1            4            1          Gd            9        Typ
#> 6:        1            1            1          TA            5        Typ
#>    Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
#> 1:          0        <NA>     Attchd        2003          RFn          2
#> 2:          1          TA     Attchd        1976          RFn          2
#> 3:          1          TA     Attchd        2001          RFn          2
#> 4:          1          Gd     Detchd        1998          Unf          3
#> 5:          1          TA     Attchd        2000          RFn          3
#> 6:          0        <NA>     Attchd        1993          Unf          2
#>    GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
#> 1:        548         TA         TA          Y          0          61
#> 2:        460         TA         TA          Y        298           0
#> 3:        608         TA         TA          Y          0          42
#> 4:        642         TA         TA          Y          0          35
#> 5:        836         TA         TA          Y        192          84
#> 6:        480         TA         TA          Y         40          30
#>    EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
#> 1:             0         0           0        0   <NA>  <NA>        <NA>
#> 2:             0         0           0        0   <NA>  <NA>        <NA>
#> 3:             0         0           0        0   <NA>  <NA>        <NA>
#> 4:           272         0           0        0   <NA>  <NA>        <NA>
#> 5:             0         0           0        0   <NA>  <NA>        <NA>
#> 6:             0       320           0        0   <NA> MnPrv        Shed
#>    MiscVal MoSold YrSold SaleType SaleCondition SalePrice
#> 1:       0      2   2008       WD        Normal    208500
#> 2:       0      5   2007       WD        Normal    181500
#> 3:       0      9   2008       WD        Normal    223500
#> 4:       0      2   2006       WD       Abnorml    140000
#> 5:       0     12   2008       WD        Normal    250000
#> 6:     700     10   2009       WD        Normal    143000

split <- createDataPartition(y = reg_train$SalePrice, p = 0.7)
xtrain <- reg_train[split$Resample1]
xtest <- reg_train[!split$Resample1]
# remove features with 90% or more missing values
# we will also remove the Id column because it doesn't contain
# any useful information
na_cols <- colSums(is.na(xtrain)) / nrow(xtrain)
na_cols <- names(na_cols[which(na_cols > 0.9)])

xtrain[, c(na_cols, "Id") := NULL]
xtest[, c(na_cols, "Id") := NULL]

# encode categorical variables
cat_cols <- names(xtrain)[sapply(xtrain, is.character)]

for(c in cat_cols){
    lbl <- LabelEncoder$new()
    lbl$fit(c(xtrain[[c]], xtest[[c]]))
    xtrain[[c]] <- lbl$transform(xtrain[[c]])
    xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA'

# removing noise column
noise <- c('GrLivArea','TotalBsmtSF')

xtrain[, c(noise) := NULL]
xtest[, c(noise) := NULL]

# fill missing value with  -1
xtrain[is.na(xtrain)] <- -1
xtest[is.na(xtest)] <- -1

KNN Regression

knn <- KNNTrainer$new(k = 2,prob = T,type = 'reg')
knn$fit(train = xtrain, test = xtest, y = 'SalePrice')
probs <- knn$predict(type = 'prob')
labels <- knn$predict(type='raw')
rmse(actual = xtest$SalePrice, predicted=labels)
#> [1] 50619.75

SVM Regression

svm <- SVMTrainer$new()
svm$fit(xtrain, 'SalePrice')
pred <- svm$predict(xtest)
rmse(actual = xtest$SalePrice, predicted = pred)

Simple Regresison

lf <- LMTrainer$new(family="gaussian")
lf$fit(X = xtrain, y = "SalePrice")
summary(lf$model)
#> 
#> Call:
#> stats::glm(formula = f, family = self$family, data = X, weights = self$weights)
#> 
#> Coefficients: (1 not defined because of singularities)
#>                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)    1.296e+06  1.605e+06   0.807 0.419765    
#> MSSubClass    -1.596e+02  4.528e+01  -3.524 0.000445 ***
#> MSZoning      -2.418e+03  1.744e+03  -1.387 0.165864    
#> LotFrontage   -5.934e+01  3.400e+01  -1.745 0.081257 .  
#> LotArea        4.490e-01  1.366e-01   3.287 0.001051 ** 
#> Street        -4.059e+04  2.088e+04  -1.944 0.052238 .  
#> LotShape       8.265e+01  2.098e+03   0.039 0.968591    
#> LandContour   -1.059e+03  2.272e+03  -0.466 0.641316    
#> Utilities             NA         NA      NA       NA    
#> LotConfig      1.520e+03  1.092e+03   1.392 0.164303    
#> LandSlope      1.006e+04  5.030e+03   2.001 0.045718 *  
#> Neighborhood   5.015e+02  1.987e+02   2.524 0.011761 *  
#> Condition1    -1.763e+03  9.432e+02  -1.870 0.061859 .  
#> Condition2    -9.393e+03  2.832e+03  -3.317 0.000944 ***
#> BldgType      -6.226e+02  1.974e+03  -0.315 0.752598    
#> HouseStyle     1.973e+02  8.855e+02   0.223 0.823775    
#> OverallQual    1.680e+04  1.364e+03  12.316  < 2e-16 ***
#> OverallCond    5.159e+03  1.225e+03   4.211 2.79e-05 ***
#> YearBuilt      2.392e+02  8.954e+01   2.672 0.007678 ** 
#> YearRemodAdd   1.355e+02  8.129e+01   1.667 0.095914 .  
#> RoofStyle      5.727e+03  1.921e+03   2.981 0.002945 ** 
#> RoofMatl      -1.288e+04  2.533e+03  -5.083 4.47e-07 ***
#> Exterior1st   -1.895e+03  6.889e+02  -2.750 0.006069 ** 
#> Exterior2nd    1.246e+03  6.551e+02   1.902 0.057430 .  
#> MasVnrType     3.984e+03  1.695e+03   2.350 0.018986 *  
#> MasVnrArea     3.162e+01  7.220e+00   4.379 1.32e-05 ***
#> ExterQual      2.974e+03  2.364e+03   1.258 0.208725    
#> ExterCond      2.453e+03  2.420e+03   1.014 0.310879    
#> Foundation    -2.473e+03  1.525e+03  -1.622 0.105184    
#> BsmtQual       1.493e+03  1.488e+03   1.004 0.315833    
#> BsmtCond      -4.395e+02  1.453e+03  -0.302 0.762416    
#> BsmtExposure   2.758e+03  9.849e+02   2.800 0.005211 ** 
#> BsmtFinType1  -1.829e+03  8.187e+02  -2.235 0.025668 *  
#> BsmtFinSF1     7.017e+00  6.055e+00   1.159 0.246795    
#> BsmtFinType2  -1.110e+03  1.219e+03  -0.911 0.362676    
#> BsmtFinSF2     1.368e+01  9.628e+00   1.421 0.155567    
#> BsmtUnfSF      1.118e+00  5.842e+00   0.191 0.848287    
#> Heating       -1.301e+03  3.584e+03  -0.363 0.716799    
#> HeatingQC     -2.912e+03  1.404e+03  -2.074 0.038313 *  
#> CentralAir     5.515e+03  5.488e+03   1.005 0.315212    
#> Electrical     1.643e+03  1.523e+03   1.079 0.280997    
#> `1stFlrSF`     3.679e+01  7.274e+00   5.058 5.09e-07 ***
#> `2ndFlrSF`     3.062e+01  6.024e+00   5.083 4.47e-07 ***
#> LowQualFinSF   2.060e+01  2.313e+01   0.891 0.373282    
#> BsmtFullBath   1.058e+04  3.048e+03   3.470 0.000544 ***
#> BsmtHalfBath  -1.596e+03  4.871e+03  -0.328 0.743172    
#> FullBath       9.015e+03  3.216e+03   2.804 0.005156 ** 
#> HalfBath       3.567e+03  2.997e+03   1.190 0.234208    
#> BedroomAbvGr  -5.661e+03  1.943e+03  -2.913 0.003666 ** 
#> KitchenAbvGr  -1.355e+04  6.414e+03  -2.112 0.034914 *  
#> KitchenQual    6.223e+03  1.796e+03   3.465 0.000554 ***
#> TotRmsAbvGrd   5.676e+03  1.455e+03   3.902 0.000102 ***
#> Functional    -4.137e+03  1.583e+03  -2.614 0.009080 ** 
#> Fireplaces    -1.663e+02  2.554e+03  -0.065 0.948105    
#> FireplaceQu    4.029e+03  1.404e+03   2.870 0.004191 ** 
#> GarageType     1.664e+02  1.380e+03   0.121 0.904026    
#> GarageYrBlt    1.581e+00  5.738e+00   0.276 0.782978    
#> GarageFinish  -1.441e+01  1.525e+03  -0.009 0.992463    
#> GarageCars     1.323e+04  3.501e+03   3.780 0.000167 ***
#> GarageArea     1.683e+01  1.183e+01   1.423 0.155049    
#> GarageQual     8.273e+03  3.293e+03   2.513 0.012149 *  
#> GarageCond    -1.480e+03  2.165e+03  -0.684 0.494158    
#> PavedDrive    -2.902e+03  3.090e+03  -0.939 0.347968    
#> WoodDeckSF     2.260e+01  9.139e+00   2.473 0.013569 *  
#> OpenPorchSF    2.219e+01  1.829e+01   1.213 0.225405    
#> EnclosedPorch  2.817e+01  1.862e+01   1.513 0.130725    
#> `3SsnPorch`    5.053e+01  3.686e+01   1.371 0.170751    
#> ScreenPorch    8.503e+01  2.091e+01   4.067 5.15e-05 ***
#> PoolArea      -8.682e+01  2.844e+01  -3.052 0.002334 ** 
#> Fence         -9.121e+02  1.421e+03  -0.642 0.520980    
#> MiscVal        2.612e-01  1.844e+00   0.142 0.887346    
#> MoSold        -2.057e+02  3.821e+02  -0.538 0.590532    
#> YrSold        -1.056e+03  7.972e+02  -1.324 0.185735    
#> SaleType       3.283e+03  1.351e+03   2.431 0.015250 *  
#> SaleCondition  1.627e+03  1.265e+03   1.286 0.198766    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 1025713223)
#> 
#>     Null deviance: 6.2770e+12  on 1023  degrees of freedom
#> Residual deviance: 9.7443e+11  on  950  degrees of freedom
#> AIC: 24226
#> 
#> Number of Fisher Scoring iterations: 2
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 36414.68

Lasso Regression

lf <- LMTrainer$new(family = "gaussian", alpha = 1, lambda = 1000)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 40647.6

Ridge Regression

lf <- LMTrainer$new(family = "gaussian", alpha=0)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 41033.56

Logistic Regression with CV

lf <- LMTrainer$new(family = "gaussian")
lf$cv_model(X = xtrain, y = 'SalePrice', nfolds = 5, parallel = FALSE)
predictions <- lf$cv_predict(df = xtest)
coefs <- lf$get_importance()
rmse(actual = xtest$SalePrice, predicted = predictions)

Random Forest

rf <- RFTrainer$new(n_estimators = 500,classification = 0)
rf$fit(X = xtrain, y = "SalePrice")
pred <- rf$predict(df = xtest)
rf$get_importance()
#>               tmp.order.tmp..decreasing...TRUE..
#> OverallQual                         805396946851
#> GarageCars                          522277512570
#> GarageArea                          492504056036
#> 1stFlrSF                            456146530362
#> YearBuilt                           321001356223
#> FullBath                            273711059272
#> BsmtFinSF1                          256640207793
#> GarageYrBlt                         242955028927
#> BsmtQual                            221599078735
#> TotRmsAbvGrd                        197896142442
#> LotArea                             187832123674
#> ExterQual                           172679747974
#> 2ndFlrSF                            165575198771
#> YearRemodAdd                        153466157731
#> FireplaceQu                         138693929431
#> MasVnrArea                          128673493393
#> KitchenQual                         117443303312
#> Fireplaces                          111705942290
#> OpenPorchSF                          92631484839
#> Foundation                           83286121952
#> LotFrontage                          78827754312
#> BsmtUnfSF                            65668598298
#> BsmtFinType1                         61625988292
#> WoodDeckSF                           57948560794
#> Neighborhood                         57616557231
#> HeatingQC                            47085794011
#> BedroomAbvGr                         44560418156
#> GarageType                           41061811034
#> MSSubClass                           38186219440
#> OverallCond                          32843017602
#> Exterior2nd                          32477077925
#> HalfBath                             31853557276
#> MoSold                               31439094626
#> Exterior1st                          27829430593
#> HouseStyle                           27529302283
#> YrSold                               26883223060
#> RoofStyle                            26445458267
#> BsmtFullBath                         25895731346
#> LotShape                             22357815068
#> BsmtExposure                         21438623304
#> GarageFinish                         20395336117
#> MSZoning                             17500907349
#> SaleCondition                        17274027042
#> MasVnrType                           17169758549
#> ScreenPorch                          16281657555
#> GarageQual                           15715484041
#> LandContour                          15003410671
#> CentralAir                           14567933496
#> LotConfig                            14394740952
#> BldgType                             13477548912
#> GarageCond                           13395753051
#> SaleType                             13189102573
#> EnclosedPorch                        10687715135
#> BsmtFinSF2                           10438619040
#> Condition1                           10308218603
#> LandSlope                             9906284042
#> ExterCond                             8994393632
#> Functional                            7360746298
#> BsmtFinType2                          6937544624
#> RoofMatl                              6765618719
#> KitchenAbvGr                          6604190117
#> BsmtCond                              6435711990
#> Fence                                 6259391880
#> LowQualFinSF                          5518554748
#> PavedDrive                            4503508900
#> Heating                               3676274614
#> Condition2                            3363484182
#> BsmtHalfBath                          2930090851
#> Electrical                            2626782717
#> 3SsnPorch                             2014013123
#> MiscVal                               1804559185
#> PoolArea                              1794636554
#> Street                                 317976991
#> Utilities                                      0
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 33814.05

Xgboost

xgb <- XGBTrainer$new(objective = "reg:linear"
                      , n_estimators = 500
                      , eval_metric = "rmse"
                      , maximize = F
                      , learning_rate = 0.1
                      ,max_depth = 6)
xgb$fit(X = xtrain, y = "SalePrice", valid = xtest)
pred <- xgb$predict(xtest)
rmse(actual = xtest$SalePrice, predicted = pred)

Grid Search

xgb <- XGBTrainer$new(objective = "reg:linear")

gst <- GridSearchCV$new(trainer = xgb,
                             parameters = list(n_estimators = c(10,50), max_depth = c(5,2)),
                             n_folds = 3,
                             scoring = c('accuracy','auc'))
gst$fit(xtrain, "SalePrice")
gst$best_iteration()

Random Search

rf <- RFTrainer$new()
rst <- RandomSearchCV$new(trainer = rf,
                             parameters = list(n_estimators = c(5,10),
                             max_depth = c(5,2)),
                             n_folds = 3,
                             scoring = c('accuracy','auc'),
                             n_iter = 3)
rst$fit(xtrain, "SalePrice")
#> [1] "In total, 3 models will be trained"
rst$best_iteration()
#> $n_estimators
#> [1] 5
#> 
#> $max_depth
#> [1] 5
#> 
#> $accuracy_avg
#> [1] 0.005850878
#> 
#> $accuracy_sd
#> [1] 0.002902616
#> 
#> $auc_avg
#> [1] NaN
#> 
#> $auc_sd
#> [1] NA

Binary Classification Data

Here, we will solve a simple binary classification problem (predict people who survived on titanic ship). The idea here is to demonstrate how to use this package to solve classification problems.

Data Preparation

# load class
load('../data/cla_train.rda')
# if the above doesn't work, you can try: load("cla_train.rda")

head(cla_train)
#>    PassengerId Survived Pclass
#> 1:           1        0      3
#> 2:           2        1      1
#> 3:           3        1      3
#> 4:           4        1      1
#> 5:           5        0      3
#> 6:           6        0      3
#>                                                   Name    Sex Age SibSp Parch
#> 1:                             Braund, Mr. Owen Harris   male  22     1     0
#> 2: Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
#> 3:                              Heikkinen, Miss. Laina female  26     0     0
#> 4:        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
#> 5:                            Allen, Mr. William Henry   male  35     0     0
#> 6:                                    Moran, Mr. James   male  NA     0     0
#>              Ticket    Fare Cabin Embarked
#> 1:        A/5 21171  7.2500              S
#> 2:         PC 17599 71.2833   C85        C
#> 3: STON/O2. 3101282  7.9250              S
#> 4:           113803 53.1000  C123        S
#> 5:           373450  8.0500              S
#> 6:           330877  8.4583              Q

# split the data
split <- createDataPartition(y = cla_train$Survived,p = 0.7)
xtrain <- cla_train[split$Resample1]
xtest <- cla_train[!split$Resample1]

# encode categorical variables - shorter way
for(c in c('Embarked','Sex','Cabin')) {
    lbl <- LabelEncoder$new()
    lbl$fit(c(xtrain[[c]], xtest[[c]]))
    xtrain[[c]] <- lbl$transform(xtrain[[c]])
    xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA'

# impute missing values
xtrain[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]
xtest[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]

# drop these features
to_drop <- c('PassengerId','Ticket','Name')

xtrain <- xtrain[,-c(to_drop), with=F]
xtest <- xtest[,-c(to_drop), with=F]

Now, our data is ready to be served for model training. Let’s do it.

KNN Classification

knn <- KNNTrainer$new(k = 2,prob = T,type = 'class')
knn$fit(train = xtrain, test = xtest, y = 'Survived')
probs <- knn$predict(type = 'prob')
labels <- knn$predict(type = 'raw')
auc(actual = xtest$Survived, predicted = labels)
#> [1] 0.6385027

Naive Bayes Classification

nb <- NBTrainer$new()
nb$fit(xtrain, 'Survived')
pred <- nb$predict(xtest)
#> Warning: predict.naive_bayes(): more features in the newdata are provided as
#> there are probability tables in the object. Calculation is performed based on
#> features to be found in the tables.
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7771836

SVM Classification

#predicts labels
svm <- SVMTrainer$new()
svm$fit(xtrain, 'Survived')
pred <- svm$predict(xtest)
auc(actual = xtest$Survived, predicted=pred)

Logistic Regression

lf <- LMTrainer$new(family = "binomial")
lf$fit(X = xtrain, y = "Survived")
summary(lf$model)
#> 
#> Call:
#> stats::glm(formula = f, family = self$family, data = X, weights = self$weights)
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  1.830070   0.616894   2.967  0.00301 ** 
#> Pclass      -0.980785   0.192493  -5.095 3.48e-07 ***
#> Sex          2.508241   0.230374  10.888  < 2e-16 ***
#> Age         -0.041034   0.009309  -4.408 1.04e-05 ***
#> SibSp       -0.235520   0.117715  -2.001  0.04542 *  
#> Parch       -0.098742   0.137791  -0.717  0.47361    
#> Fare         0.001281   0.002842   0.451  0.65230    
#> Cabin        0.008408   0.004786   1.757  0.07899 .  
#> Embarked     0.248088   0.166616   1.489  0.13649    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 831.52  on 623  degrees of freedom
#> Residual deviance: 564.76  on 615  degrees of freedom
#> AIC: 582.76
#> 
#> Number of Fisher Scoring iterations: 5
predictions <- lf$predict(df = xtest)
auc(actual = xtest$Survived, predicted = predictions)
#> [1] 0.8832145

Lasso Logistic Regression

lf <- LMTrainer$new(family="binomial", alpha=1)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)

Ridge Logistic Regression

lf <- LMTrainer$new(family="binomial", alpha=0)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)

Random Forest

rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 3)
rf$fit(X = xtrain, y = "Survived")

pred <- rf$predict(df = xtest)
rf$get_importance()
#>          tmp.order.tmp..decreasing...TRUE..
#> Sex                                67.80128
#> Fare                               57.97193
#> Age                                48.37045
#> Pclass                             24.64915
#> Cabin                              21.45972
#> SibSp                              13.51637
#> Parch                              10.45743
#> Embarked                           10.23844

auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7976827

Xgboost

xgb <- XGBTrainer$new(objective = "binary:logistic"
                      , n_estimators = 500
                      , eval_metric = "auc"
                      , maximize = T
                      , learning_rate = 0.1
                      ,max_depth = 6)
xgb$fit(X = xtrain, y = "Survived", valid = xtest)

pred <- xgb$predict(xtest)
auc(actual = xtest$Survived, predicted = pred)

Grid Search

xgb <- XGBTrainer$new(objective="binary:logistic")
gst <-GridSearchCV$new(trainer = xgb,
                             parameters = list(n_estimators = c(10,50),
                             max_depth = c(5,2)),
                             n_folds = 3,
                             scoring = c('accuracy','auc'))
gst$fit(xtrain, "Survived")
gst$best_iteration()

Random Search

rf <- RFTrainer$new()
rst <- RandomSearchCV$new(trainer = rf,
                             parameters = list(n_estimators = c(10,50), max_depth = c(5,2)),
                             n_folds = 3,
                             scoring = c('accuracy','auc'),
                             n_iter = 3)
rst$fit(xtrain, "Survived")
#> [1] "In total, 3 models will be trained"
rst$best_iteration()
#> $n_estimators
#> [1] 50
#> 
#> $max_depth
#> [1] 5
#> 
#> $accuracy_avg
#> [1] 0.7964744
#> 
#> $accuracy_sd
#> [1] 0.03090914
#> 
#> $auc_avg
#> [1] 0.7729436
#> 
#> $auc_sd
#> [1] 0.04283084

Let’s create some new feature based on target variable using target encoding and test a model.

# add target encoding features
xtrain[, feat_01 := smoothMean(train_df = xtrain,
                        test_df = xtest,
                        colname = "Embarked",
                        target = "Survived")$train[[2]]]
xtest[, feat_01 := smoothMean(train_df = xtrain,
                               test_df = xtest,
                               colname = "Embarked",
                               target = "Survived")$test[[2]]]

# train a random forest
# Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 4)
rf$fit(X = xtrain, y = "Survived")
pred <- rf$predict(df = xtest)
rf$get_importance()
#>          tmp.order.tmp..decreasing...TRUE..
#> Sex                               69.787235
#> Fare                              60.832089
#> Age                               52.982604
#> Pclass                            24.419818
#> Cabin                             21.419274
#> SibSp                             13.112177
#> Parch                             10.175269
#> feat_01                            6.675399
#> Embarked                           6.450819

auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.8018717