mcvis: Multi-collinearity Visualization

Kevin Wang, Chen Lin and Samuel Mueller

2021-07-30

Introduction

The mcvis package provides functions for detecting multi-collinearity (also known as collinearity) in linear regression. In simple terms, the mcvis method investigates variables with strong influences on collinearity in a graphical manner.

Basic usage

Suppose that we have a simple scenario that one predictor \(X_1\) is almost linearly dependent on another two predictors \(X_2\) and \(X_3\), thus \(X_1\) is strongly correlated with these two predictors. The dependence among these three variables is a sufficient cause for collinearity which can be shown through large variances of estimated model parameters in linear regression. We illustrate this with a simple example:

## Simulating some data
set.seed(1)
p = 6
n = 100

X = matrix(rnorm(n*p), ncol = p)
X[,1] = X[,2] + X[,3] + rnorm(n, 0, 0.01)

y = rnorm(n)
summary(lm(y ~ X))
#> 
#> Call:
#> lm(formula = y ~ X)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.56042 -0.73579 -0.05585  0.86967  2.20334 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)
#> (Intercept)   0.02084    0.11157   0.187    0.852
#> X1           10.14768   10.34285   0.981    0.329
#> X2          -10.08175   10.33068  -0.976    0.332
#> X3          -10.30688   10.34038  -0.997    0.321
#> X4            0.04175    0.11321   0.369    0.713
#> X5            0.07191    0.09563   0.752    0.454
#> X6           -0.16951    0.11482  -1.476    0.143
#> 
#> Residual standard error: 1.094 on 93 degrees of freedom
#> Multiple R-squared:  0.06683,    Adjusted R-squared:  0.006628 
#> F-statistic:  1.11 on 6 and 93 DF,  p-value: 0.3625

The mcvis method highlights the major collinearity-causing variables on a bipartite graph. There are three major components of this graph: + the top row renders the “tau” statistics and by default, only one tau statistic is shown (\(\tau_p\), where \(p\) is the number of predictors). This tau statistic measures the extent of collinearity in the data and relates to the eigenvalues of the correlation matrix in \(X\). + the bottom row renders all original predictors. + the two rows are linked through the MC-indices that we have developed, which are represented as lines of different shades and thickness. Darker lines implies larger values of the MC-index indicate what predictor contribute more to causing collinearity.

If you are interested in how the tau statistics and the resampling-based MC-index are calculated, our paper is published as Lin, C., Wang, K. Y. X., & Mueller, S. (2020). mcvis: A new framework for collinearity discovery, diagnostic and visualization. Journal of Computational and Graphical Statistics

library(mcvis)
mcvis_result = mcvis(X = X)

mcvis_result
#>        X1   X2   X3 X4 X5 X6
#> tau6 0.51 0.25 0.24  0  0  0

This matrix of MC-indices is the main numeric output of mcvis and our visualisation techniques are focused on visualising this matrix. Below is a bipartite graph visualising the last row of this matrix, which is the main visualisation plot of mcvis.

plot(mcvis_result)

We also provide a igraph version of the mcvis bipartite graph.

plot(mcvis_result, type = "igraph")

(Extension) why not just look at the correlation matrix?

In practice, high correlation between variables is not a necessary criterion for collinearity. In the mplot package (Tarr et. al. 2018), a simulated data was created with many of its columns being a linear combination of other columns plus noise. In this case, the cause of the collinearity is not clear from the correlation matrix.

The mcvis visualisation plot identified that the 8th variable (x8) is the main cause of collinearity of this data. Upon consultation with the data generation in this simulation, we see that x8 is a linear combination of all other predictor variables (plus noise). This knowledge can provide important guidance to statistical interpretations of model selection results.

## Simulation taken from the `mplot` package.
## Generating a data with multi-collinearity. 
n=50
set.seed(8) # a seed of 2 also works
x1 = rnorm(n,0.22,2)
x7 = 0.5*x1 + rnorm(n,0,sd=2)
x6 = -0.75*x1 + rnorm(n,0,3)
x3 = -0.5-0.5*x6 + rnorm(n,0,2)
x9 = rnorm(n,0.6,3.5)
x4 = 0.5*x9 + rnorm(n,0,sd=3)
x2 = -0.5 + 0.5*x9 + rnorm(n,0,sd=2)
x5 = -0.5*x2+0.5*x3+0.5*x6-0.5*x9+rnorm(n,0,1.5)
x8 = x1 + x2 -2*x3 - 0.3*x4 + x5 - 1.6*x6 - 1*x7 + x9 +rnorm(n,0,0.5)
y = 0.6*x8 + rnorm(n,0,2)
artificialeg = round(data.frame(x1,x2,x3,x4,x5,x6,x7,x8,x9,y),1)
X = artificialeg[,1:9]
round(cor(X), 2)
#>       x1    x2    x3    x4    x5    x6    x7    x8    x9
#> x1  1.00  0.00  0.14 -0.07 -0.02 -0.37  0.46  0.36 -0.22
#> x2  0.00  1.00  0.31  0.30 -0.60  0.00 -0.29  0.24  0.53
#> x3  0.14  0.31  1.00  0.04 -0.28 -0.66 -0.08 -0.01  0.13
#> x4 -0.07  0.30  0.04  1.00 -0.48  0.01  0.02 -0.07  0.62
#> x5 -0.02 -0.60 -0.28 -0.48  1.00  0.38  0.17 -0.30 -0.75
#> x6 -0.37  0.00 -0.66  0.01  0.38  1.00  0.02 -0.50 -0.08
#> x7  0.46 -0.29 -0.08  0.02  0.17  0.02  1.00 -0.43 -0.29
#> x8  0.36  0.24 -0.01 -0.07 -0.30 -0.50 -0.43  1.00  0.27
#> x9 -0.22  0.53  0.13  0.62 -0.75 -0.08 -0.29  0.27  1.00

mcvis_result = mcvis(X)
mcvis_result
#>        x1   x2   x3 x4   x5   x6   x7   x8   x9
#> tau9 0.01 0.01 0.29  0 0.03 0.31 0.02 0.32 0.02
plot(mcvis_result)

Shiny implementation

We also offer a shiny app implementation of mcvis in our package. Suppose that we have a mcvis_result object stored in the memory of R. You can simply call the function shiny_mcvis to load up a Shiny app.

class(mcvis_result)
#> [1] "mcvis"
shiny_mcvis(mcvis_result)

Reference

Session Info

sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] C/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] mcvis_1.0.8
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.1.1   xfun_0.24          bslib_0.2.5.1      purrr_0.3.4       
#>  [5] reshape2_1.4.4     lattice_0.20-41    colorspace_2.0-0   vctrs_0.3.8       
#>  [9] generics_0.1.0     htmltools_0.5.1.1  yaml_2.2.1         utf8_1.2.2        
#> [13] rlang_0.4.11       jquerylib_0.1.4    pillar_1.6.1       later_1.2.0       
#> [17] glue_1.4.2         DBI_1.1.1          RColorBrewer_1.1-2 lifecycle_1.0.0   
#> [21] plyr_1.8.6         stringr_1.4.0      munsell_0.5.0      gtable_0.3.0      
#> [25] psych_2.1.3        evaluate_0.14      labeling_0.4.2     knitr_1.33        
#> [29] fastmap_1.1.0      httpuv_1.6.1       parallel_4.0.3     fansi_0.5.0       
#> [33] highr_0.9          Rcpp_1.0.6         xtable_1.8-4       scales_1.1.1      
#> [37] promises_1.2.0.1   jsonlite_1.7.2     tmvnsim_1.0-2      farver_2.1.0      
#> [41] mime_0.11          mnormt_2.0.2       ggplot2_3.3.3      digest_0.6.27     
#> [45] stringi_1.7.3      dplyr_1.0.6.9000   shiny_1.6.0        grid_4.0.3        
#> [49] tools_4.0.3        magrittr_2.0.1     sass_0.4.0         tibble_3.1.3      
#> [53] crayon_1.4.1       pkgconfig_2.0.3    ellipsis_0.3.2     assertthat_0.2.1  
#> [57] rmarkdown_2.9      R6_2.5.0           igraph_1.2.6       nlme_3.1-152      
#> [61] compiler_4.0.3