--- title: "Sampling Properties of Variable Selection" author: "John Maindonald" date: "01/01/2019" output: bookdown::html_document2: base_format: rmarkdown::html_vignette number_sections: false link-citations: yes bibliography: vigref.bib vignette: > %\VignetteIndexEntry{Sampling Properties of Variable Selection} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup} knitr::opts_chunk$set(echo=FALSE) ``` The function `varselect()` in the _leaps_ package can be used for variable selection. Available approaches are forward, backward, and exhaustive selection. The _DAAG_ package has the functions `bestsetNoise()` and `bsnVaryNvar()` that are designed to give insight on the sampling properties of output from the function `lm()`, when one of these variable selection approaches has been used to choose the explanatory variables that appear in the model. ```{r loadDAAG} suppressMessages(library(DAAG, quietly=TRUE, warn.conflicts=FALSE)) ``` ## $p$-value based variable selection, with data that are pure noise The function `bestsetNoise()` (_DAAG_) can be used to experiment with the behaviour of various variable selection techniques with data that is purely noise. @m-b, Section 6.5, pp.~197-198, gives examples of the use of this function. For example, try: ```{r bestset, eval=FALSE, echo=TRUE} bestsetNoise(m=100, n=40, nvmax=3) bestsetNoise(m=100, n=40, method="backward", nvmax=3) ``` The analyses will typically yield a model that, if assessed using output from the R function `lm()`, appears to have highly (but spuriously) statistically significant explanatory power, with one or more coefficients that appear (again spuriously) significant at a level of around $p$=0.01 or less. The function `bestsetNoise()` has provision to specify the model matrix. Model matrices with uncorrelated columns of independent Normal data, which is the default, are not a good match to most practical situations. ## $p$-values, vs number of variables available for selection As above, datasets of random normal data were created, always with 100 observations and with the number of variables varying between 3 and 50. For three variables, there was no selection, while in other cases the ``best'' three variables were selected, by exhaustive search. Figure \@ref(fig:exhaust) plots the p-values for the 3 variables that were selected against the total number of variables. The fitted line estimates the median $p$-value, as a function of `nvar`. The function `bsnVaryNvar()` that is used for the calculations makes repeated calls to `bestsetNoise()`. Similar results will be obtained from use of forward or backward selection. ```{r cap1, echo=F} cap1 <- "$p$-values from the R function `lm()`, versus number of variables available for selection." ``` ```{r exhaust, eval=TRUE, echo=FALSE} #| fig.width: 5 #| fig.height: 3.75 #| message: FALSE #| out.width: "60%" #| fig.cap: cap1 ## Code suppressPackageStartupMessages(library(qgam, quietly=TRUE)) set.seed(37) # Use to reproduce graph that is shown bsnVaryNvar(m=100, nvar=3:50, nvmax=3) ``` Code is: ```{r exhaust, eval=FALSE, echo=TRUE} ``` When all 3 variables are taken, the $p$-values are expected to average 0.5. Notice that, for selection of the best 3 variables out of 10, the median $p$-value has reduced to about 0.1. ## Reference