GLMMselect: Bayesian model selection for generalized linear mixed models

Shuangshuang Xu, Marco A. R. Ferreira, Erica M. Porter, and Christopher T. Franck

library(GLMMselect)

Introduction

The GLMMselect package provides a novel Bayesian model selection approach for the analysis of Poisson and binary data. GLMMselect contains functions to simultaneously select fixed effects and random effects for non-Gaussian data. In the GLMMselect package, we use pseudo likelihood method to solve the problem that the random effects cannot be analytically integrated out of GLMMs. In addition, we develop a fractional Bayes factor (FBF) approach to obtain posterior probabilities of the competing models. Full details on the methods implemented in GLMMselect can be found in the manuscript (Xu et al. 202X).

This vignette contains a estimated example for count data and a case study presented in the manuscript (Xu et al. 202X) to illustrate how GLMMselect works. For the simulated example, the count data are simulated from Poisson GLMM with four candidate fixed effects and two types of random effects (spatial random effects and overdispersion random effects).

Function

The function GLMMselect() implemented in the GLMMselect package is described below:

Model

The generalized linear mixed model used in the GLMMselect package is \[ g(\pmb{Y})=X\pmb{\beta}+\sum_{q=1}^Q Z_q \pmb{\alpha}_q,\] where

Currently, the GLMMselect package can analyze binary responses (family = "bernoulli") and Poisson responses (family = "poisson"). The manuscript that develops the methods in GLMMselect (Xu et al. 202X) provides details on the priors for \(\pmb{\beta}\) and \(\tau_q\), as well as details about the FBF approach used by GLMMselect.

Examples

The GLMMselect() function requires a vector of observations (either Bernoulli or assumed Poisson distributed), a matrix of covariates, a list of design matrices for random effects, and a list of covariance matrices for random effects.

The vector of observations must be a numeric \(n \times 1\) vector. In the GLMMselect package, there is an example simulated from the Poisson generalized linear mixed model with four candidate covariates, a vector of spatial random effects, and a vector of overdispersion random effects. Here are the first five elements of the Poisson simulated vector of observations:

data("Y")
Y[1:5]
#> [1]  33  47 187  26  35

The covariate matrix passed to the function contains all candidate covariates. Each column corresponds to a candidate covariate. Each row corresponds to an observation. In this example, the covariate matrix contains four candidate covariates. The covariates in the first two columns are in the true model. Here are the first five rows of the covariate matrix:

data("X")
X[1:5,]
#>            [,1]        [,2]       [,3]        [,4]
#> [1,]  0.3586051 -1.39887035 -1.2441304 -0.97847040
#> [2,] -0.1766957  0.09031447  0.5870933 -1.19242832
#> [3,]  0.7548008  0.54639426 -0.2518881  0.02433473
#> [4,]  0.0854506 -0.97412286  0.1203769  0.04743402
#> [5,]  0.5478787 -1.19069820 -1.3385666 -0.28752443

The design matrices for candidate random effects are passed to the GLMMselect function as a list. In this example, this is a list of two design matrices for two types of random effects. The first component is for spatial random effects. The second component is for overdispersion random effects. Here are the first five rows and the first five columns for each of these design matrices:

data("Z")
Z[[1]][1:5,1:5]
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    0    0    0    0
#> [2,]    0    1    0    0    0
#> [3,]    0    0    1    0    0
#> [4,]    0    0    0    1    0
#> [5,]    0    0    0    0    1
Z[[2]][1:5,1:5]
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    0    0    0    0
#> [2,]    0    1    0    0    0
#> [3,]    0    0    1    0    0
#> [4,]    0    0    0    1    0
#> [5,]    0    0    0    0    1

The covariance matrices for the candidate random effects are also passed to the GLMMselect function as a list. In this example, this is a list of two covariance matrices for two types of random effects. The first component is for spatial random effects. The second componet is for overdispersion random effects. Here are the first five rows and the first five columns for each of these covariance matrices:

data("Sigma")
Sigma[[1]][1:5,1:5]
#>           [,1]      [,2]      [,3]      [,4]      [,5]
#> [1,] 1.2861009 0.7911009 0.4988302 0.3047474 0.1670184
#> [2,] 0.7911009 0.9938302 0.5970181 0.3611012 0.2041574
#> [3,] 0.4988302 0.5970181 0.8561012 0.4964282 0.2877719
#> [4,] 0.3047474 0.3611012 0.4964282 0.7827719 0.4448714
#> [5,] 0.1670184 0.2041574 0.2877719 0.4448714 0.7498331
Sigma[[2]][1:5,1:5]
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    0    0    0    0
#> [2,]    0    1    0    0    0
#> [3,]    0    0    1    0    0
#> [4,]    0    0    0    1    0
#> [5,]    0    0    0    0    1

Data Y,X, Sigma, and Z above are attached in the GLMMselect package.

Example: Analysis of the simulated dataset

In this example, we use GLMMselect to analyze Poisson count data with an approximate reference (AR) prior for the variance components of random effects.

Model_selection_output <- GLMMselect(Y=Y, X=X, Sigma=Sigma, Z=Z, 
                                     family="poisson", prior="AR", offset=NULL)
Model_selection_output$BestModel
#> $covariate_inclusion
#> [1] 1 2
#> 
#> $random_effect_inclusion
#> [1] 1
Model_selection_output$PosteriorProb
#>    x1 x2 x3 x4 r1 r2   MPP
#> 1   1  1  0  0  1  0 0.624
#> 2   1  1  1  0  1  0 0.137
#> 3   1  1  0  1  1  0 0.114
#> 4   1  1  1  1  1  0 0.055
#> 5   1  1  0  0  0  1 0.032
#> 6   1  1  0  0  1  1 0.022
#> 7   1  1  1  0  0  1 0.006
#> 8   1  1  0  1  0  1 0.004
#> 9   1  1  1  0  1  1 0.002
#> 10  1  1  0  1  1  1 0.002
Model_selection_output$FixedEffect
#>       Est    SD   PIP
#> x1  1.028 0.028 1.000
#> x2  0.997 0.027 1.000
#> x3 -0.012 0.020 0.202
#> x4  0.000 0.023 0.177
Model_selection_output$RandomEffect
#>      Est    SD   PIP
#> r1 0.042 0.016 0.956
#> r2 0.012 0.006 0.070

GLMMselect() outputs the indices of the covariates and the indices of the random effects which are in the best model. The true model contains the first two covariates and spatial random effects. GLMMselect correctly identifies these two covariates and the spatial random effects.

Example: Analysis of Scotland lip cancer data

Here is a case study to help illustrate how to use GLMMselect. This dataset provides the number of male lip cancer cases in the 56 counties of Scotland during the period 1975-1980, as well as the percentage of the work force employed in agriculture, fishing, or forestry (AFF) as a covariate. The model we consider is \[ y_i|\mu_i \stackrel{ind}{\sim} \mbox{Poisson}(\mu_i), \ \ \ i=1\dots 56, \\ \log(\mu_i) = \log(n_i)+\pmb{x}_i^T\pmb{\beta}+\alpha_{1i}+\alpha_{2i}, \\ \pmb{\alpha}_1 \sim N(\pmb{0},\tau_1 \Sigma_1), \\ \pmb{\alpha}_2 \sim N(\pmb{0}, \tau_2 \Sigma_2). \]

\(n_i\) is the expected number of lip cancer cases in the \(i^{th}\) county, which are assumed to be known constants. \(\pmb{\alpha}_1\) is a vector of spatial random effects. \(\pmb{\alpha}_2\) is a vector of overdispersion random effects.

In this example, we use GLMMselect to analyze lip cancer data with a half Cauchy (HC) prior for variance components of random effects. Data lipcancer_Y,lipcancer_X, lipcancer_Sigma, and lipcancer_Z are attached in the GLMMselect package.

lip_cancer_output <- GLMMselect(Y=lipcancer_Y, X=lipcancer_X, Sigma=lipcancer_Sigma, Z=lipcancer_Z,
                                family="poisson", prior="HC", offset=lipcancer_offset)
lip_cancer_output$BestModel
#> $covariate_inclusion
#> [1] 1
#> 
#> $random_effect_inclusion
#> [1] 1
lip_cancer_output$PosteriorProb
#>   x1 r1 r2   MPP
#> 1  1  1  0 0.905
#> 2  0  1  0 0.084
#> 3  1  1  1 0.009
#> 4  0  1  1 0.002
#> 5  1  0  1 0.000
#> 6  0  0  1 0.000
#> 7  1  0  0 0.000
#> 8  0  0  0 0.000
lip_cancer_output$FixedEffect
#>      Est    SD   PIP
#> x1 0.428 0.128 0.914
lip_cancer_output$RandomEffect
#>      Est    SD   PIP
#> r1 0.486 0.157 1.000
#> r2 0.000 0.052 0.011

See also

ref.ICAR provides an objective Bayesian approach for modeling spatially correlated areal data using an intrinsic conditional autoregressive prior on a vector of spatial random effects.

Reference

Xu, Shuangshuang, Ferreira, M. A. R., Porter, E. M., and Franck, C. T. (202X). Bayesian Model Selection for Generalized Linear Mixed Models, Biometrics, under review.