The goal of randomForestVIP
is to tune and select a
Random Forest model with high accuracy and interpretability. This is
done by tuning the Random Forest based on the accuracy and variable
importance metrics associated with each model. To accomplish this,
functions are available to tabulate and plot results designed to help
the user select an optimal model.
The function mtry_compare
may be used to tune the
hyper-parameter mtry based on model performance and variable importance
metrics. This grid-search technique provides tables and plots showing
the effect of mtry on each of the assessment metrics. It also returns
each of the evaluated models to the user.
This package also contains functions for assessing relationships
among the predictor variables and between the predictors and response.
These are relevant for any predictive model, not just Random Forests.
Metrics such as partial correlations and variance inflation factors are
available for a variety of modeling techniques (not just linear
regressions). These are tabulated and plotted for the analyst using the
functions partial_cor
and robust_vifs
.
The package also provides superior ggplot2
variable
importance plots for individual models using the function
ggvip
. This function is a highly aesthetic and editable
improvement upon the function randomForest::varImpPlot
and
other basic importance graphics.
All of the plots generated by these functions are developed with
ggplot2
techniques so that the user has the ability to edit
and improve further upon the plots.
For methodology see “Contributions to Random Forest Variable Importance with Applications in R” https://digitalcommons.usu.edu/etd/8587/.
library(randomForestVIP)
library(MASS)
library(EZtune)
To introduce the functionality of randomForestVIP
, we
look at modeling the Boston housing data (found in the MASS package). We
want to build a Random Forest model with a view towards both accuracy
and interpretability. We begin by running some preliminary diagnostics
on our data.
set.seed(1234)
<- partial_cor(medv ~ ., data = Boston, model = lm)
pcs $plot_y_part_cors pcs
<- robust_vifs(medv ~ ., data = Boston, model = lm)
rv $plot_lin_vifs rv
These functions assess concerns with collinearity. Notice that the
VIFs from robust_vifs
are all less than 10. The partial
correlations with the response from partial_cor
are a type
of pseudo-importance assessing the importance each variable does not
share with the others. Now we tune our model by assessing four different
mtry values in the mtry_compare
function.
set.seed(1)
<- mtry_compare(medv ~ .,
m data = Boston, sqrt = TRUE,
mvec = c(1, 4, 9, 13), num_var = 7
)$gg_model_errors m
$model_errors
m#> mtry mse
#> 1 1 18.96051
#> 2 4 10.11247
#> 3 9 10.13187
#> 4 13 10.36482
According to the accuracy plot and table above, our best choice is when mtry is 4. However, the accuracy for the best model is notably only slightly better than the models with mtry set to 9 and 13. We now look at the variable importance metrics across the different models.
$gg_var_imp_error m
The top two variables are consistently identified as more important than the other variables and their order remains unchanged across mtry. However, the variables ‘nox’ and ‘dis’ switch order as mtry increases. Pollution (nox) has a strong negative correlation with distance to employment centers (dis). This makes sense if the employment centers are responsible for much of the pollution. If many home buyers consider distance to work more important than pollution when selecting a house, ‘dis’ is more likely to be a causal driver of price than ‘nox’. By this reasoning, the model where mtry is 9 appears to be superior to the model where mtry is 4, despite mtry of 4 yielding slightly more accurate results.
We now take our selected model and build individual importance plots
for it using ggvip
.
<- ggvip(m$rf9)$both_vips g
The plot above resembles a standard variable importance plot, but possesses superior tick labels and editing capabilities for the analyst.
We have used the randomForestVIP
package to tune a
strong model for prediction and with reasonably useful importance
values. This was accomplished by assessing variable importance and
accuracy metrics across the hyper-parameter mtry.
library(randomForestVIP)
To further demonstrate the functionality of
randomForestVIP
, we provide another example. This time
using classification data. We look at modeling the Lichen data (found in
the EZtune package) with a view towards both accuracy and
interpretability. The response is presence or absence (coded 0 or 1) of
a lichen species, Lobaria oregana. We begin by running preliminary
diagnostics on our data using partial_cor
and
robust_vifs
.
set.seed(1234)
<- EZtune::lichen[, -c(1, 3:8)]
lichen
pairs(lichen[, c(16, 20, 26)])
cor(lichen[, c(16, 20, 26)])
#> MinTempAve AmbVapPressAve Elevation
#> MinTempAve 1.0000000 0.9973158 -0.9781953
#> AmbVapPressAve 0.9973158 1.0000000 -0.9745659
#> Elevation -0.9781953 -0.9745659 1.0000000
<- partial_cor(factor(LobaOreg) ~ .,
pcs data = lichen, model = lm,
num_var = 15
)$plot_y_part_cors pcs
<- robust_vifs(factor(LobaOreg) ~ .,
rv data = lichen, model = lm,
num_var = 15
)$plot_nonlin_vifs rv
These variables exhibit high collinearity. To illustrate this
observation, consider the pairs plots above for ‘MinTempAve’,
‘Elevation’, and ‘AmbVapPressAve’. Most of the VIFs from
robust_vifs
exceed the standard threshold. The partial
correlations with the response from partial_cor
are a type
of pseudo-importance assessing the importance each variable does not
share with the others. Now we tune our Random Forest model across four
mtry values.
set.seed(100)
<- mtry_compare(factor(LobaOreg) ~ .,
m data = lichen, sqrt = TRUE,
mvec = c(1, 5, 19, 33), num_var = 7
)$gg_model_errors m
$model_errors
m#> mtry misclass_rate
#> 1 1 0.1797619
#> 2 5 0.1607143
#> 3 19 0.1571429
#> 4 33 0.1619048
According to the accuracy plot and table above, our best choice is when mtry is 19. However, the accuracy for the best model is only slightly better than the models with mtry set to 5 and 33. We now look at the variable importance metrics across the different models.
$gg_var_imp_error m
There are 3 variables to focus on. ‘MinTempAve’, ‘Elevation’, and ‘AmbVapPressAve’ were all shown to be highly correlated above. These variables appear to be the most importance variable when mtry is small. However, as mtry increases, the importance of ‘Elevation’ drops off a bit, and the importance of ‘AmbVapPressAve’ drops even more. After seeing these changes, a researcher might consider how these variables actually affect lichen presence. They would find that ‘MinTempAve’ informs freezing which directly contributes to lichen presence. They would also realize that ‘Elevation’ indirectly causes lichen presence since ‘Elevation’ drives ‘MinTempAve’. ‘AmbVapPressAve’ can be assumed to be a byproduct of ‘Elevation’ and is not a feature that should have much of a causal impact on lichen presences. While it is highly predictive, it is not something a scientist would prescribe for inducing the response. In this example, as mtry increases, casual variables rise while collinear byproducts fall.
No solution is perfect, but mtry of 33 yields results that match the intuition about the effect our predictors have on the response.
We now take our selected model and build individual importance plots
for it using ggvip
.
<- ggvip(m$rf33, num_var = 12)$both_vips g
We have used the randomForestVIP
package to tune a model
for prediction and with superior importance values. This was
accomplished by assessing variable importance and accuracy metrics
across the hyper-parameter mtry.