--- title: "Simulation Tools Provided With the Selectboost Package" shorttitle: "Selectboost: Simulation Tools" author: - name: "Frédéric Bertrand and Myriam Maumy-Bertrand" affiliation: - Université de Strasbourg and CNRS - IRMA, labex IRMIA email: frederic.bertrand@utt.fr date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Simulation Tools Provided With the Selectboost Package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} #file.edit(normalizePath("~/.Renviron")) LOCAL <- identical(Sys.getenv("LOCAL"), "TRUE") #LOCAL=TRUE knitr::opts_chunk$set(purl = LOCAL) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Contents This vignette details the simulations tools provided with the selectboost package by providing five examples of use. If you are a Linux/Unix or a Macos user, you can install a version of SelectBoost with support for `doMC` from [github](https://github.com) with: ```{r, eval = FALSE} devtools::install_github("fbertran/SelectBoost", ref = "doMC") ``` # First example ## Aim We want to creates $NDatasets=200$ datasets with $\textrm{length}(group)=10$ variables and $N=10$ observations. In that example we want $9$ groups: - $x_1$ and $x_{10}$ belong to the first group and the intra-group Pearson correlation for this group is equal to $.95$, - $x_2$ belongs to the second group, - $x_3$ belongs to the third group, - ... - $x_9$ belongs to the ninth group. ## Correlation structure The correlation structure of the explanatory variables of the dataset is provided by `group` and the intra-group Pearson correlation value for each of the groups by `cor_group`. A value must be provided even for single variable groups and the number of variables is length of the `group` vector. Use the `simulation_cor` function to create the correlation matrix (`CM`). ```{r, eval = LOCAL} library(SelectBoost) group<-c(1:9,1) #10 variables cor_group<-rep(0.95,9) CM<-simulation_cor(group,cor_group) CM ``` ## Explanatory dataset generation Then generation of an explanatory dataset with $N=10$ observations is made by the `simulation_X` function. ```{r, eval = LOCAL} set.seed(3141) N<-10 X<-simulation_X(N,CM) X ``` ## Response derivation A response can now be added to the dataset by the `simulation_Data` function. We have to specifiy the support of the response, i.e. the explanatory variables that will be used in the linear model created to compute the response. The support is given by the `supp` vector whose entries are either $0$ or $1$. The length of the `supp` vector must be equal to the number of explanatory variables and if the $i$\th entry is equal to $1$, it means that the $i$\th variable will be used to derive the response value, whereas if the $i$\th entry is equal to $0$, it means that the $i$\th variable will not be used to derive the response value (`beta<-rep(0,length(supp))`). The values of the coefficients for the explanatory variables that are in the support of the response are random (either absolute value and sign) and given by `beta[which(supp==1)]<-runif(sum(supp),minB,maxB)*(rbinom(sum(supp),1,.5)*2-1)`. Hence, the user can specify their minimal absolute value with the `minB` option and their maximal absolute value with the `maxB` option. The `stn` option is a scaling factor for the noise added to the response vector (`(t(beta)%*%var(X)%*%beta)/stn`, with `X` the matrix of explanatory variables). The higher the `stn` value, the smaller the noise: for instance for a given `X` dataset, an `stn` value $\alpha$ times larger will result in a noise exactly $\sqrt{\alpha}$ times smaller. ```{r, eval = LOCAL} set.seed(3141) supp<-c(1,1,1,0,0,0,0,0,0,0) minB<-1 maxB<-2 stn<-50 firstdataset=simulation_DATA(X,supp,minB,maxB,stn) firstdataset ``` ## Multiple datasets and checks To generate multiple datasets, repeat steps 2 and 3, for instance use a `for` loop. We create $NDatasets=200$ datasets and assign them to the objects `DATA_exemple1_nb_1` to `DATA_exemple1_nb_200`. ```{r, cache=TRUE, eval = LOCAL} set.seed(3141) NDatasets=200 for(i in 1:NDatasets){ X<-simulation_X(N,CM) assign(paste("DATA_exemple1_nb_",i,sep=""),simulation_DATA(X,supp,minB,maxB,stn)) } ``` We now check the correlation structure of the explanatory variable. First we compute the mean correlation matrix. ```{r, cache=TRUE, eval = LOCAL} corr_sum=matrix(0,length(group),length(group)) for(i in 1:NDatasets){ corr_sum=corr_sum+cor(get(paste("DATA_exemple1_nb_",i,sep=""))$X) } corr_mean=corr_sum/NDatasets ``` Then we display and plot that the mean correlation matrix. ```{r, fig.width=7, eval = LOCAL} corr_mean plot(abs(corr_mean)) ``` ```{r, eval = LOCAL} coef_sum=rep(0,length(group)) names(coef_sum)<-paste("x",1:length(group),sep="") error_counter=0 for(i in 1:NDatasets){ tempdf=data.frame(cbind(Y=get(paste("DATA_exemple1_nb_",i,sep=""))$Y, get(paste("DATA_exemple1_nb_",i,sep=""))$X)) tempcoef=coef(lm(Y~.-1,data=tempdf)) if(is.null(tempcoef)){ cat("Error in lm fit, skip coefficients\n") error_counter=error_counter+1 } else{ coef_sum=coef_sum+abs(tempcoef) } } error_counter coef_mean=coef_sum/NDatasets ``` All fits were sucessful. Then we display and plot that the mean coefficient vector values. ```{r, eval = LOCAL} coef_mean barplot(coef_mean) abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue") ``` Reduce the noise in the response for the new responses by a factor $\sqrt{5000/50}=10$. $1/stn\cdot \beta_{support}^t\mathrm{Var}(X)\beta_{support}$ where $\beta_{support}$ is the vector of coefficients wh ```{r, cache=TRUE, eval = LOCAL} set.seed(3141) stn <- 5000 for(i in 1:NDatasets){ X<-simulation_X(N,CM) assign(paste("DATA_exemple1_bis_nb_",i,sep=""),simulation_DATA(X,supp,minB,maxB,stn)) } ``` Since it is the same explanatory dataset for response generation, we can compare the $\sigma$ between those $NDatasets=200$ datasets. ```{r, cache=TRUE, eval = LOCAL} stn_ratios=rep(0,NDatasets) for(i in 1:NDatasets){ stn_ratios[i]<-get(paste("DATA_exemple1_nb_",i,sep=""))$sigma/ get(paste("DATA_exemple1_bis_nb_",i,sep=""))$sigma } all(sapply(stn_ratios,all.equal,10)) ``` All the ratios are equal to 10 as anticipated. Since, the correlation structure is the same as before, we do not need to check it again. As befor, we infer the coefficients values of a linear model using the `lm` function. ```{r, cache=TRUE, eval = LOCAL} coef_sum_bis=rep(0,length(group)) names(coef_sum_bis)<-paste("x",1:length(group),sep="") error_counter_bis=0 for(i in 1:NDatasets){ tempdf=data.frame(cbind(Y=get(paste("DATA_exemple1_bis_nb_",i,sep=""))$Y, get(paste("DATA_exemple1_bis_nb_",i,sep=""))$X)) tempcoef=coef(lm(Y~.-1,data=tempdf)) if(is.null(tempcoef)){ cat("Error in lm fit, skip coefficients\n") error_counter_bis=error_counte_bisr+1 } else{ coef_sum_bis=coef_sum_bis+abs(tempcoef) } } error_counter_bis coef_mean_bis=coef_sum_bis/NDatasets ``` All fits were sucessful. Then we display and plot that the mean coefficient vector values. As expected, the noise reduction enhances the retrieval of the `true` mean coefficient absolute values by the models. ```{r, eval = LOCAL} coef_mean_bis barplot(coef_mean_bis) abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue") ``` The simulation process looks sucessfull. What are the confidence indices for those variables? # Second example ## Aim We want to creates $NDatasets=200$ datasets with $\textrm{length}(group)=50$ variables and $N=20$ observations. In that example we want $1$ group: - $x_1$, \dots, $x_{50}$ belong to the same group and the intra-group Pearson correlation for this group is equal to $.5$. - only the first five variables $x_1$, \dots, $x_{5}$ are explanatory variables for the response. ## Correlation structure ```{r, eval = LOCAL} group<-rep(1,50) #50 variables cor_group<-rep(0.5,1) ``` ## Explanatory variables and response ```{r, cache=TRUE, eval = LOCAL} set.seed(3141) N<-20 supp<-c(1,1,1,1,1,rep(0,45)) minB<-1 maxB<-2 stn<-50 for(i in 1:200){ C<-simulation_cor(group,cor_group) X<-simulation_X(N,C) assign(paste("DATA_exemple2_nb_",i,sep=""),simulation_DATA(X,supp,minB,maxB,stn)) } ``` ## Checks We now check the correlation structure of the explanatory variable. First we compute the mean correlation matrix. ```{r, eval = LOCAL} corr_sum=matrix(0,length(group),length(group)) for(i in 1:NDatasets){ corr_sum=corr_sum+cor(get(paste("DATA_exemple2_nb_",i,sep=""))$X) } corr_mean=corr_sum/NDatasets ``` Then we display and plot that the mean correlation matrix. ```{r, fig.keep='none', fig.width=7, eval = LOCAL} corr_mean[1:10,1:10] plot(abs(corr_mean)) ``` ```{r, cache=TRUE, eval = LOCAL} coef_sum=rep(0,length(group)) coef_lasso_sum=rep(0,length(group)) names(coef_sum)<-paste("x",1:length(group),sep="") names(coef_lasso_sum)<-paste("x",1:length(group),sep="") error_counter=0 for(i in 1:NDatasets){ tempdf=data.frame(cbind(Y=get(paste("DATA_exemple2_nb_",i,sep=""))$Y, get(paste("DATA_exemple2_nb_",i,sep=""))$X)) tempcoef=coef(lm(Y~.-1,data=tempdf)) require(lars) lasso.1 <- lars::lars(x=get(paste("DATA_exemple2_nb_",i,sep=""))$X, y=get(paste("DATA_exemple2_nb_",i,sep=""))$Y, type="lasso", trace=FALSE, normalize=FALSE, intercept = FALSE) # cv.lars() uses crossvalidation to estimate optimal position in path cv.lasso.1 <- lars::cv.lars(x=get(paste("DATA_exemple2_nb_",i,sep=""))$X, y=get(paste("DATA_exemple2_nb_",i,sep=""))$Y, plot.it=FALSE, type="lasso") # Use the "+1SE rule" to find best model: # Take the min CV and add its SE ("limit"). # Find smallest model that has its own CV within this limit (at "s.cv.1") limit <- min(cv.lasso.1$cv) + cv.lasso.1$cv.error[which.min(cv.lasso.1$cv)] s.cv.1 <- cv.lasso.1$index[min(which(cv.lasso.1$cv < limit))] # Print out coefficients at optimal s. coef_lasso_sum=coef_lasso_sum+abs(coef(lasso.1, s=s.cv.1, mode="fraction")) if(is.null(tempcoef)){ cat("Error in lm fit, skip coefficients\n") error_counter=error_counter+1 } else{ coef_sum=coef_sum+abs(tempcoef) } } error_counter coef_mean=coef_sum/NDatasets coef_lasso_mean=coef_lasso_sum/NDatasets ``` With regular least squares and lasso estimators all fits were sucessful, yet only 20 variables coefficients could be estimated with regular least squares estimates for the linear model. Then we display and plot that the mean coefficient vector values for the least squares estimates. ```{r, eval = LOCAL} coef_mean barplot(coef_mean) abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue") ``` ```{r, eval = LOCAL} coef_lasso_mean barplot(coef_lasso_mean,ylim=c(0,1.5)) abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue") ``` The simulation process looks sucessfull: the lasso estimates retrives mostly the correct variables, yet the other ones are also selected sometimes. What are the confidence indices for those variables? # Third Example ## Aim We want to creates $NDatasets=200$ datasets with $\textrm{length}(supp)=100$ variables and $N=24$ observations. In that example we use real data for the `X` variables that we sample from all the $1650$ probesets that are differentially expressed between the two conditions US and S. The main interest of that simulation is that the correlation structure of the `X` dataset will be a real one. ## Data and response generations First retrieve the datasets and get the differentially expressed probesets. Run the code to get additionnal plots. ```{r, fig.keep='none', cache=TRUE, eval = LOCAL} require(CascadeData) data(micro_S) data(micro_US) require(Cascade) micro_US<-as.micro_array(micro_US,c(60,90,240,390),6) micro_S<-as.micro_array(micro_S,c(60,90,240,390),6) S<-geneSelection(list(micro_S,micro_US),list("condition",c(1,2),1),-1) Sel<-micro_S@microarray[S@name,] ``` ```{r, fig.keep='none', cache=TRUE, eval = LOCAL} summary(S) ``` ```{r, fig.keep='last', cache=TRUE, eval = LOCAL} plot(S) ``` Generates the datasets sampling for each of them 100 probesets expressions among the 1650 that were selected and linking the response to the expressions of the first five probesets. ```{r, cache=TRUE, eval = LOCAL} set.seed(3141) supp<-c(1,1,1,1,1,rep(0,95)) minB<-1 maxB<-2 stn<-50 NDatasets=200 for(i in 1:NDatasets){ X<-t(as.matrix(Sel[sample(1:nrow(Sel),100),])) Xnorm<-t(t(X)/sqrt(colSums(X*X))) assign(paste("DATA_exemple3_nb_",i,sep=""),simulation_DATA(Xnorm,supp,minB,maxB,stn)) } ``` ## Checks Here are the plots of an example of correlation structure, namely for `DATA_exemple3_nb_200$X`. Run the code to get the graphics. ```{r, eval=FALSE} plot(cor(Xnorm)) mixOmics::cim(cor(Xnorm)) ``` ```{r, cache=TRUE, eval = LOCAL} coef_sum=rep(0,length(supp)) coef_lasso_sum=rep(0,length(supp)) names(coef_sum)<-paste("x",1:length(supp),sep="") names(coef_lasso_sum)<-paste("x",1:length(supp),sep="") error_counter=0 for(i in 1:NDatasets){ tempdf=data.frame(cbind(Y=get(paste("DATA_exemple3_nb_",i,sep=""))$Y, get(paste("DATA_exemple3_nb_",i,sep=""))$X)) tempcoef=coef(lm(Y~.-1,data=tempdf)) require(lars) lasso.1 <- lars::lars(x=get(paste("DATA_exemple3_nb_",i,sep=""))$X, y=get(paste("DATA_exemple3_nb_",i,sep=""))$Y, type="lasso", trace=FALSE, normalize=FALSE, intercept = FALSE) # cv.lars() uses crossvalidation to estimate optimal position in path cv.lasso.1 <- lars::cv.lars(x=get(paste("DATA_exemple3_nb_",i,sep=""))$X, y=get(paste("DATA_exemple3_nb_",i,sep=""))$Y, plot.it=FALSE, normalize=FALSE, intercept = FALSE, type="lasso") # Use the "+1SE rule" to find best model: # Take the min CV and add its SE ("limit"). # Find smallest model that has its own CV within this limit (at "s.cv.1") limit <- min(cv.lasso.1$cv) + cv.lasso.1$cv.error[which.min(cv.lasso.1$cv)] s.cv.1 <- cv.lasso.1$index[min(which(cv.lasso.1$cv < limit))] # Print out coefficients at optimal s. coef_lasso_sum=coef_lasso_sum+abs(coef(lasso.1, s=s.cv.1, mode="fraction")) if(is.null(tempcoef)){ cat("Error in lm fit, skip coefficients\n") error_counter=error_counter+1 } else{ coef_sum=coef_sum+abs(tempcoef) } } error_counter coef_mean=coef_sum/NDatasets coef_lasso_mean=coef_lasso_sum/NDatasets ``` With regular least squares and lasso estimators all fits were sucessful, yet only 20 variables coefficients could be estimated with regular least squares estimates for the linear model. Then we display and plot that the mean coefficient vector values for the least squares estimates. ```{r, eval = LOCAL} coef_mean barplot(coef_mean) abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue") ``` ```{r, eval = LOCAL} coef_lasso_mean barplot(coef_lasso_mean,ylim=c(0,1.5)) abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue") ``` The simulation process looks sucessfull: the lasso estimates retrives mostly the correct variables, yet the other ones are also selected sometimes. What are the confidence indices for those variables? # Fourth Example ## Aim We want to creates $NDatasets=101$ datasets with $\textrm{length}(supp)=100$ variables and $N=18$ observations. In that example we use real data for the variables that are the $101$ probesets that are the more differentially expressed between the two conditions US and S. We create $101$ datasets by leaving one of the variables out each time and using it as the response that shall be predicted. We also only use for the explanatory variables the observations that are the measurements for the 1st, 2nd and 3rd timepoints and for the responses the observations that are the measurements of the same variables for the 2nd, 3rd and 4th timepoints. The main interest of that simulation is that the correlation structure of the `X` dataset will be a real one and that it is a typical setting for cascade network reverse-engineering in genomics or proteomics, see the `Cascade` package for more details. ## Data and response generations First retrieve the datasets and get the differentially expressed probesets. Run the code to get additionnal plots. ```{r, fig.keep='none', cache=TRUE, eval = LOCAL} require(CascadeData) data(micro_S) data(micro_US) require(Cascade) micro_US<-as.micro_array(micro_US,c(60,90,240,390),6) micro_S<-as.micro_array(micro_S,c(60,90,240,390),6) S<-geneSelection(list(micro_S,micro_US),list("condition",c(1,2),1),101) Sel<-micro_S@microarray[S@name,] ``` ```{r, fig.keep='none', cache=TRUE, eval = LOCAL} summary(S) ``` ```{r, fig.keep='last', cache=TRUE, eval = LOCAL} plot(S) ``` ```{r, cache=TRUE, eval = LOCAL} suppt<-rep(1:4,6) supp<-c(1,1,1,1,1,rep(0,95)) #not used since we use one of the probeset expressions as response minB<-1 #not used since we use one of the probeset expressions as response maxB<-2 #not used since we use one of the probeset expressions as response stn<-50 #not used since we use one of the probeset expressions as response NDatasets<-101 set.seed(3141) for(i in 1:NDatasets){ #the explanatory variables are the values for the 1st, 2nd and 3rd timepoints X<-t(as.matrix(Sel[-i,suppt!=4])) Xnorm<-t(t(X)/sqrt(colSums(X*X))) DATA<-simulation_DATA(Xnorm,supp,minB,maxB,stn) #the reponses are the values for the 2nd, 3rd and 4th timepoints DATA$Y<-as.vector(t(Sel[i,suppt!=1])) assign(paste("DATA_exemple4_nb_",i,sep=""),DATA) rm(DATA) } ``` ## Checks Here are the plots of an example of correlation structure, namely for `DATA_exemple3_nb_200$X`. Run the code to get the graphics. ```{r, eval=FALSE} plot(cor(Xnorm)) mixOmics::cim(cor(Xnorm)) ``` ```{r, cache=TRUE, eval = LOCAL} coef_sum=rep(0,length(supp)+1) coef_lasso_sum=rep(0,length(supp)+1) names(coef_sum)<-paste("x",1:(length(supp)+1),sep="") names(coef_lasso_sum)<-paste("x",1:(length(supp)+1),sep="") error_counter=0 for(i in 1:NDatasets){ tempdf=data.frame(cbind(Y=get(paste("DATA_exemple4_nb_",i,sep=""))$Y, get(paste("DATA_exemple4_nb_",i,sep=""))$X)) tempcoef=coef(lm(Y~.-1,data=tempdf)) require(lars) lasso.1 <- lars::lars(x=get(paste("DATA_exemple4_nb_",i,sep=""))$X, y=get(paste("DATA_exemple4_nb_",i,sep=""))$Y, type="lasso", trace=FALSE, normalize=FALSE, intercept = FALSE) # cv.lars() uses crossvalidation to estimate optimal position in path cv.lasso.1 <- lars::cv.lars(x=get(paste("DATA_exemple4_nb_",i,sep=""))$X, y=get(paste("DATA_exemple4_nb_",i,sep=""))$Y, plot.it=FALSE, normalize=FALSE, intercept = FALSE, type="lasso") # Use the "+1SE rule" to find best model: # Take the min CV and add its SE ("limit"). # Find smallest model that has its own CV within this limit (at "s.cv.1") limit <- min(cv.lasso.1$cv) + cv.lasso.1$cv.error[which.min(cv.lasso.1$cv)] s.cv.1 <- cv.lasso.1$index[min(which(cv.lasso.1$cv < limit))] # Print out coefficients at optimal s. coef_lasso_sum[-i]=coef_lasso_sum[-i]+abs(coef(lasso.1, s=s.cv.1, mode="fraction")) if(is.null(tempcoef)){ cat("Error in lm fit, skip coefficients\n") error_counter=error_counter+1 } else{ coef_sum[-i]=coef_sum[-i]+abs(tempcoef) } } error_counter coef_mean=coef_sum/NDatasets coef_lasso_mean=coef_lasso_sum/NDatasets ``` With regular least squares and lasso estimators all fits were sucessful, yet only 20 variables coefficients could be estimated with regular least squares estimates for the linear model. Then we display and plot that the mean coefficient vector values for the least squares estimates. ```{r, eval = LOCAL} head(coef_mean, 40) barplot(coef_mean) abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue") ``` ```{r, eval = LOCAL} head(coef_lasso_mean, 40) barplot(coef_lasso_mean) ``` Some probesets seem explanatory for many other ones (=hubs). What are the confidence indices for those variables? # Fifth Example ## Aim We want to creates $NDatasets=200$ datasets with $\textrm{length}(group)=500$ variables and $N=25$ observations. In that example we want $1$ group: - $x_1$, \dots, $x_{500}$ belong to the same group and the intra-group Pearson correlation for this group is equal to $.5$. - only the first five variables $x_1$, \dots, $x_{5}$ are explanatory variables for the response. ## Correlation structure ```{r, cache=TRUE, eval = LOCAL} group<-rep(1,500) #500 variables cor_group<-rep(0.5,1) ``` ## Explanatory variables and response ```{r, cache=TRUE, eval = LOCAL} set.seed(3141) N<-25 supp<-c(1,1,1,1,1,rep(0,495)) minB<-1 maxB<-2 stn<-50 for(i in 1:NDatasets){ C<-simulation_cor(group,cor_group) X<-simulation_X(N,C) assign(paste("DATA_exemple5_nb_",i,sep=""),simulation_DATA(X,supp,minB,maxB,stn)) } ``` ## Checks We now check the correlation structure of the explanatory variable. First we compute the mean correlation matrix. ```{r, eval = LOCAL} corr_sum=matrix(0,length(group),length(group)) for(i in 1:NDatasets){ corr_sum=corr_sum+cor(get(paste("DATA_exemple5_nb_",i,sep=""))$X) } corr_mean=corr_sum/NDatasets ``` Then we display and plot that the mean correlation matrix. ```{r, fig.keep='none', fig.width=7, eval = LOCAL} corr_mean[1:10,1:10] plot(abs(corr_mean)) ``` ```{r, cache=TRUE, eval = LOCAL} coef_sum=rep(0,length(group)) coef_lasso_sum=rep(0,length(group)) names(coef_sum)<-paste("x",1:length(group),sep="") names(coef_lasso_sum)<-paste("x",1:length(group),sep="") error_counter=0 for(i in 1:NDatasets){ tempdf=data.frame(cbind(Y=get(paste("DATA_exemple5_nb_",i,sep=""))$Y, get(paste("DATA_exemple5_nb_",i,sep=""))$X)) tempcoef=coef(lm(Y~.-1,data=tempdf)) require(lars) lasso.1 <- lars::lars(x=get(paste("DATA_exemple5_nb_",i,sep=""))$X, y=get(paste("DATA_exemple5_nb_",i,sep=""))$Y, type="lasso", trace=FALSE, normalize=FALSE, intercept = FALSE) # cv.lars() uses crossvalidation to estimate optimal position in path cv.lasso.1 <- lars::cv.lars(x=get(paste("DATA_exemple5_nb_",i,sep=""))$X, y=get(paste("DATA_exemple5_nb_",i,sep=""))$Y, plot.it=FALSE, type="lasso") # Use the "+1SE rule" to find best model: # Take the min CV and add its SE ("limit"). # Find smallest model that has its own CV within this limit (at "s.cv.1") limit <- min(cv.lasso.1$cv) + cv.lasso.1$cv.error[which.min(cv.lasso.1$cv)] s.cv.1 <- cv.lasso.1$index[min(which(cv.lasso.1$cv < limit))] # Print out coefficients at optimal s. coef_lasso_sum=coef_lasso_sum+abs(coef(lasso.1, s=s.cv.1, mode="fraction")) if(is.null(tempcoef)){ cat("Error in lm fit, skip coefficients\n") error_counter=error_counter+1 } else{ coef_sum=coef_sum+abs(tempcoef) } } error_counter coef_mean=coef_sum/NDatasets coef_lasso_mean=coef_lasso_sum/NDatasets ``` With regular least squares and lasso estimators all fits were sucessful, yet only 20 variables coefficients could be estimated with regular least squares estimates for the linear model. Then we display and plot that the mean coefficient vector values for the least squares estimates. ```{r, eval = LOCAL} head(coef_mean, 40) barplot(coef_mean) abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue") ``` ```{r, eval = LOCAL} head(coef_lasso_mean, 40) barplot(coef_lasso_mean,ylim=c(0,1.5)) abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue") ``` The simulation process looks sucessfull: the lasso estimates retrives mostly the correct variables, yet the other ones are also selected sometimes. What are the confidence indices for those variables?