--- title: "VARshrink 0.3: Shrinkage Estimation Methods for Vector Autoregressive Models (A Brief Version)" author: "Namgil Lee, Heon-Young Yang, and Sung-Ho Kim*" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: number_sections: true fig_width: 7 fig_height: 6 fig_caption: true abstract: > We introduce an R software package, **VARshrink**, for providing shrinkage estimation methods for vector autoregressive (VAR) models. Contrary to the standard ordinary least squares method, shrinkage estimation methods can be applied to high-dimensional VAR models with dimensionality greater than the number of observations. While existing R packages for VAR shrinkage are mostly based on parametric, Bayesian estimation methods using informative priors, the **VARshrink** aims to be an integrative R package delivering nonparametric, parametric, and semiparametric methods in a unified and consistent manner. The current version of **VARshrink** contains four shrinkage estimation methods, which are the multivarate ridge regression, a nonparametric shrinkage method, a full Bayesian approach using noninformative priors, and a semiparametric Bayesian approach using informative priors. VAR estimation problems are formulated as a multivariate regression form, so that all the shrinkage estimation methods can be run by one interface function named `VARshrink()`. We clearly present mathematical expressions of shrinkage estimators of the shrinkage estimation methods. The effective number of parameters can be calculated based on a shrinkage intensity parameter value, which plays an important role for further statistical analyses. We provide a sample R code for demonstration of the package and model comparison.

keywords: Bayes estimation, vector autoregression (VAR), high-dimensionality, shrinkage, multivariate time series bibliography: bibVARshrink.bib vignette: > %\VignetteIndexEntry{Article for VARshrink} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup0, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction {#sec:intro} Let $\mathbf{y}_t = (y_{t1},y_{t2},\ldots,y_{tK})^\top$ denote a $K\times 1$ vector of endogenous variables. A vector autoregressive (VAR) model of order $p$ can be expressed as $$ \mathbf{y}_t = \mathbf{A}_1 \mathbf{y}_{t-1} + \cdots + \mathbf{A}_p \mathbf{y}_{t-p} + \mathbf{C} \mathbf{d}_t + \boldsymbol\epsilon_t, \qquad\qquad\qquad (1) $$\label{VAReqnA} where $\mathbf{d}_t$ is an $L\times 1$ vector of deterministic regressors, $\boldsymbol\epsilon_t$ is a $K\times 1$ noise vector, and $\mathbf{A}_1,\ldots,\mathbf{A}_p$ and $\mathbf{C}$ are coefficient matrices [@Hamilton94; @Tsay05]. ## Backgrounds Recently, several R packages have been developed for parameter estimation and forecasting using stochastic time series models. The **forecast** package provides various methods and tools for univariate time series models such as the ARIMA and ETS models [@Hyndman18]. The **MTS** package has been developed for a wide variety of multivariate linear time series models and multivariate volatility models such as the VARMA, multivariate EWMA, and low-dimensional BEKK models [@Tsay18]. The **vars** package provides methods for multivariate time series analysis using the VAR, SVAR, and SVEC models [@Pfaff18]. In this study, we focus on shrinkage estimation of VAR model parameters. Shrinkage estimation methods have been playing crucial roles in high-dimensional statistical modeling; see, e.g., @Beltra2013shrinkageEEG, @Bohm2009shrinkage, @Fiecas2011aoas and @LedoitWolf2004shrinkcov. For VAR models, several shrinkage estimation methods have been suggested such as a Stein-type nonparametric shrinkage estimation method [@Rhein07c], Bayesian VARs using informative priors [@Banbura10;@Doan84;@Koop10BayesianVAR;@Litterman86], Bayesian VARs using noninformative priors [@Ni05;@Sun04], and a semiparametric Bayesian approach adopting a modified $K$-fold cross validation [@LeeChoiKim2016]. Due to its popularity in macroeconomic time series analysis, several Bayesian VAR methods have been implemented in R packages. For example, the function `BVAR()` in **MTS** implements a Bayesian VAR method using an informative prior [@Tsay18]; the package **bvarsv** implements Bayesian VAR models with stochastic volatility and time-varying parameters [@Krueger2015bvarsv;@Koop10BayesianVAR;@Primiceri2005timevarying]. On the other hand, we note that other types of shrinkage methods which have been implemented for other purposes than multivariate time series analysis can be applied to shrinkage estimation of VAR parameters. For instance, the function `cov.shrink()` in the package **corpcor** has been implemented to compute shrinkage estimates of covariances, but it has been applied to estimate VAR coefficients in @Rhein07c [@Schafer17]. Moreover, we note that VAR models can be reformulated into multivariate regression problems, so that penalized least squares methods can be used for shrinkage estimation of VAR parameters, e.g., the functions `lm.gls()` for generalized least squares and `lm.ridge()` for ridge regression in the package **MASS** [@Ripley18]; the function `glmnet()` for Lasso and Elastic-Net regularized generalized linear models in the package `glmnet()` [@Frideman18]; the function `linearRidge()` for ridge regression in the package **ridge** [@Moritz18]. ## Main Purpose While Bayesian approaches have been widely used in the literature, we note that nonparametric and semiparametric approaches have advantages in the case of high-dimensional VARs with more than several hundreds of time series variables due to their relatively low computational costs [@Rhein07c]. Despite of their relatively high computational costs, Bayesian approaches can impose proper assumptions on the multivariate time series data flexibly, such as VAR roots near unity and correlations between noise processes [@LeeChoiKim2016]. In this sense, a semiparametric approach can be a trade-off between nonparametric and parametric approaches [@LeeChoiKim2016]. In this study, we developed an integrative R package, **VARshrink**, for implementing nonparametric, parametric, and semiparametric approaches for shrinkage estimation of VAR model parameters. By providing a simple interface function, `VARshrink()`, for running various types of shrinkage estimation methods, the performance of the methods can be easily compared. We note that the package **vars** implemented an ordinary least squares method for VAR models by the function `VAR()`. The function `VARshrink()` was built to extend `VAR()` to shrinkage estimation methods, so that the **VARshrink** package takes advantage of the tools and methods available in the **vars** package. For example, we demonstrate the use of model selection criteria such as AIC and BIC in this paper, where the methods `AIC(), BIC(),` and `logLik()` can handle multivariate t-distribution and the effective number of parameters in **VARshrink**. This paper is a *brief version* of the original manuscript. This paper is organized as follows. In Section [2](#sec:models), we explain the formulation of VAR models in a multivariate regression problem, which simplifies implementation of the package. In Section [3](#sec:methods), we describe the common interface function and the four shrinkage estimation methods included in the package. We clearly present closed form expressions for the shrinkage estimators inferred by the shrinkage methods, so that we can indicate the role of the shrinkage intensity parameters in each method. In addition, we explain how the effective number of parameters can be computed for shrinkage estimators. In Section [4](#sec:numer), we present numerical experiments using benchmark data and simulated data *briefly* for comparing performances of the shrinkage estimation methods. Discussion and conclusions are provided in Section [5](#sec:concl). --- # Models {#sec:models} In general, we can rewrite the model equation Eq. (1) in the form of a multivariate regression as $$ \mathbf{y}_t = \mathbf{\Psi}^\top \mathbf{x}_t + \boldsymbol{\epsilon}_t, \qquad\qquad\qquad (5) $$\label{VAReqnPsi} where $\mathbf{\Psi} = (\mathbf{A}_1, \mathbf{A}_2, \ldots, \mathbf{A}_p, \mathbf{C})^\top$ is a $(Kp + L) \times K$ matrix of coefficients, $\mathbf{x}_t = (\mathbf{y}_{t-1}^\top, \ldots, \mathbf{y}_{t-p}^\top, \mathbf{d}_t^\top)^\top$ is a $(Kp + L)\times 1$ vector of regressors. For estimation of VAR parameters from the observed time series data $\mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_T$, we define data matrices as \begin{equation} \begin{split} \mathbf{Y} &= ( \mathbf{y}_{p+1}, \mathbf{y}_{p+2}, \ldots, \mathbf{y}_{T} )^\top \in \mathbb{R}^{N \times K}, \\ \mathbf{X} &= ( \mathbf{x}_{p+1}, \mathbf{x}_{p+2}, \ldots, \mathbf{x}_{T} )^\top \in \mathbb{R}^{N \times (Kp + L)}, \end{split} \end{equation} with $N = T-p$. Then, we can rewrite Eq. (5) in a matrix form as $$ \mathbf{Y} = \mathbf{X} \mathbf{\Psi} + \mathbf{E} \in \mathbb{R}^{N \times K}, \qquad\qquad\qquad (6) $$\label{VAReqnMat} with $\mathbf{E} = (\boldsymbol\epsilon_{p+1}, \ldots, \boldsymbol\epsilon_T)^\top$ and $N = T-p$. --- # Shrinkage Estimation Methods {#sec:methods} In this section, we will describe the shrinkage estimation methods implemented in this package. The methods are used for estimating the VAR coefficient matrix $\mathbf{\Psi}$ alone or both of the $\mathbf{\Psi}$ and $\mathbf{\Sigma}$ in Eq. (6). We provide a common R function interface `VARshrink()` for running the estimation methods, which is defined by ``` VARshrink(y, p = 1, type = c("const", "trend", "both", "none"), season = NULL, exogen = NULL, method = c("ridge", "ns", "fbayes", "sbayes", "kcv"), lambda = NULL, lambda_var = NULL, dof = Inf, ...) ``` The input arguments are described as follows. - `y`: A T-by-K matrix of endogenous variables - `p`: Integer for the lag order - `type`: Type of deterministic regressors to include. 1) "const" - the constant: $\mathbf{d}_t = 1$. 2) "trend" - the trend: $\mathbf{d}_t = t$. 3) "both" - both the constant and the trend: $\mathbf{d}_t = (t, 1)^\top$. 4) "none" - no deterministic regressors. - `season`: An integer value of frequency for inclusion of centered seasonal dummy variables. - `exogen`: A T-by-L matrix of exogenous variables. - `method`: 1) "ridge" - multivariate ridge regression. 2) "ns" - a Stein-type nonparametric shrinkage method. 3) "fbayes" - a full Bayesian shrinkage method using noninformative priors. 4) "sbayes" - a semiparametric Bayesian shrinkage method using parameterized cross validation. 5) "kcv" - a semiparametric Bayesian shrinkage method using K-fold cross validation - `lambda, lambda_var`: Shrinkage parameter value(s). Use of this parameter is slightly different for each method: the same value does not imply the same shrinkage estimates. See description in the following subsections for the use of shrinkage parameters in each method. - `dof`: Degree of freedom of multivariate t-distribution for noise. Valid only for `method = "fbayes"` and `method = "sbayes"`. `dof = Inf` means multivariate normal distribution. The output value is an object of class "varshrinkest", which inherits the class "varest" in the package **vars**, and an object of class "varest" can be obtained by the function `VAR()` in the package **vars**. The input arguments and the output value indicate that `VARshrink()` is an extension of `VAR()` in the **vars** package. As a result, almost all the methods and functions included in the package **vars** are available for the package **VARshrink**, such as `fevd(), Phi(), plot()`. The methods and functions available in the **VARshrink** package are summarized in Table 1. The names of the functions for class "varshrinkest" has suffix "_sh" in order to distinguish them from the functions for class "varest". We remark that the classes "varshrinkest", "varshirf", and "varshsum" inherit the classes "varest", "varirf", and "varsum" of the **vars** package, respectively. ```{r, echo = FALSE, warning=FALSE} text_tbl <- data.frame( FM = c("VAR", "fevd", "irf", "predict", "summary", "arch.test_sh", "normality.test_sh", "serial.test_sh", "stability_sh"), CL = c("varshrinkest, varest", "varfevd", "varshirf, varirf", "varprd", "varshsum, varsum", "varcheck", "varcheck", "varcheck", "varstabil"), MC = c("coef, fevd, fitted, irf, logLik, Phi, plot, predict, print, Psi, resid, summary", "plot, print", "plot, print", "plot, print", "print", "plot, print", "plot, print", "plot, print", "plot, print"), FC = c("Acoef_sh, arch.test_sh, Bcoef_sh, BQ_sh, causality_sh, normality.test_sh, restrict_sh, roots_sh, serial.test_sh, stability_sh", " ", " ", "fanchart", " ", " ", " ", " ", " ") ) colnames(text_tbl) <- c("Function or method", "Class", "Methods for class", "Functions for class") kableExtra::column_spec( knitr::kable(text_tbl, caption = "Table 1. Structure of the package VARshrink."), 1:4, width = "7em", border_left = FALSE, border_right = FALSE) ``` ## Multivariate Ridge Regression The ridge regression method is a kind of penalized least squares (PLS) method, which produces a biased estimate of the VAR coefficient [@Hoerl70]. Formally speaking, the ridge regression estimator of $\mathbf{\Psi}$ can be obtained by minimizing the penalized sum of squared prediction errors (SSPE) as \begin{equation} \widehat{ \mathbf{\Psi} }^{\text{R}} (\lambda) = \arg\min_{\mathbf{\Psi}} \ \frac{1}{N} \left\|\mathbf{Y} - \mathbf{X} \mathbf{\Psi} \right\|_F^2 + \lambda \left\| \mathbf{\Psi} \right\|_F^2, \end{equation} where $\|\mathbf{A}\|_F = \sqrt{ \sum_{i} \sum_{j} a_{ij}^2 }$ is the Frobenius norm of a matrix $\mathbf{A}$, $N = T-p$, and $\lambda \geq 0$ is called the regularization parameter or the shrinkage parameter. The ridge regression estimator $\widehat{ \mathbf{\Psi} }^\text{R} (\lambda)$ can be expressed in the closed form \begin{equation} \widehat{ \mathbf{\Psi} }^\text{R} (\lambda) = \left( \mathbf{X}^\top \mathbf{X} + N \lambda \mathbf{I} \right)^{-1} \mathbf{X}^\top \mathbf{Y}, \qquad \lambda \geq 0. \end{equation} The shrinkage parameter $\lambda$ for the ridge regression can be automatically determined by using the generalized cross-validation (GCV) [@Golub79]. The GCV selects the value of $\lambda$ where the GCV score given below is minimized: \begin{equation} GCV(\lambda) = \frac{1}{N} \left\| (\mathbf{I} - \mathbf{H}(\lambda) \mathbf{Y} ) \right\|^2_\text{F} \left/ \left[ \frac{1}{N} Trace(\mathbf{I} - \mathbf{H}(\lambda) ) \right]^2 \right. , \end{equation} where $\mathbf{H}(\lambda) = \mathbf{X}^\top \left(\mathbf{X}^\top \mathbf{X} + N \lambda \mathbf{I} \right)^{-1} \mathbf{X}^\top$. In this package, the interface to the shrinkage estimation methods is provided by the function `VARshrink(method = "ridge", ...)`. If the input argument `lambda` is set at a value or a vector of values, then corresponding GCV score is computed automatically for each $\lambda$ value, and the VAR coefficients with the smallest GCV score is selected. If `lambda = NULL`, then the default value of `c(0.0001, 0.0005, 0.001, 0.005, ..., 10, 50)` is used. For example, simulated time series data of length $T=100$ were generated based on a multivariate normal distribution for noise and a VAR model with $p=1$, $K=2$, $\mathbf{A}_1 = 0.5\mathbf{I}_2$, $\mathbf{C}=(0.2, 0.7)^\top$, and $\mathbf{\Sigma} = 0.1^2\mathbf{I}_2$ as follows: ```{r setup, include = FALSE} library(VARshrink) ``` ```{r, results = "hide", message=FALSE} set.seed(1000) myCoef <- list(A = list(matrix(c(0.5, 0, 0, 0.5), 2, 2)), c = c(0.2, 0.7)) myModel <- list(Coef = myCoef, Sigma = diag(0.1^2, 2), dof = Inf) Y <- simVARmodel(numT = 100, model = myModel, burnin = 10) resu_estim <- list() ``` Then, the multivariate ridge regression can be carried out for VAR models as follows. The result printed on the screen shows all the `lambda` values considered and the corresponding GCV values. The VAR parameters are estimated using the lambda value with the minimum GCV value. ```{r} resu_estim$`Ridge regression` <- VARshrink(Y, p = 1, type = "const", method = "ridge", lambda = NULL) resu_estim$`Ridge regression` ``` The method `summary()` is available for objects of class "varshrinkest" as follows: ```{r} summary(resu_estim$`Ridge regression`) ``` ## Nonparametric Shrinkage (NS) The nonparametric shrinkage (NS) estimation method for VAR models, proposed by @Rhein07c, produces an estimate of $\mathbf{\Psi}$ based on a James-Stein type shrinkage of sample covariance matrices [@Rhein07a;@Schafer05b]. We will briefly describe the NS method. In the NS method, the data matrices $\mathbf{X}$ and $\mathbf{Y}$ are mean-corrected so that each column has the mean of zero. Let $\mathbf{Z} = [\mathbf{X}, \mathbf{Y}] \in \mathbb{R}^{N \times (Kp + K + L)}$ be a combined data matrix. The sample covariance matrix of $\mathbf{Z}$ is partitioned as \begin{equation} \mathbf{S}_\text{ZZ} = \frac{1}{N - 1} \mathbf{Z}^\top \mathbf{Z} = \begin{bmatrix} \mathbf{S}_\text{XX} & \mathbf{S}_\text{XY} \\ \mathbf{S}_\text{XY}^\top & \mathbf{S}_\text{YY} \end{bmatrix} \in \mathbb{R}^{(Kp + K + L) \times (Kp + K + L)}, \end{equation} where $\mathbf{S}_\text{XX} = (N - 1)^{-1} \mathbf{X}^\top \mathbf{X}$, $\mathbf{S}_\text{XY} = (N - 1)^{-1} \mathbf{X}^\top \mathbf{Y}$, and $\mathbf{S}_\text{YY} = (N - 1)^{-1} \mathbf{Y}^\top \mathbf{Y}$. The matrix $\mathbf{S}_\text{ZZ}$ can be decomposed as \begin{equation} \mathbf{S}_\text{ZZ} = \mathbf{D}_\text{Z}^{1/2} \mathbf{R}_\text{ZZ} \mathbf{D}_\text{Z}^{1/2}, \end{equation} where $\mathbf{R}_\text{ZZ}$ is the sample correlation matrix and $\mathbf{D}_\text{Z} = \text{diag}(s_{11}, s_{22}, \ldots, s_{Kp + K + L, Kp + K + L})$ is a diagonal matrix with diagonal elements of sample variances. Shrinkage estimates of the correlation matrix and the variances can be written as \begin{equation} \widehat{\mathbf{R}}_\text{ZZ} = (1-\lambda) \mathbf{R}_\text{ZZ} + \lambda \mathbf{I} \quad \text{and} \quad \widehat{\mathbf{D}}_\text{Z} = \text{diag}(\hat{s}_{11}, \hat{s}_{22}, \ldots, \hat{s}_{Kp+K+L,Kp+K+L}) \end{equation} with \begin{equation} \hat{s}_{ii} = (1-\lambda_v) s_{ii} + \lambda_v s_\text{med}, \quad i = 1, 2, \ldots, Kp + K + L, \end{equation} where $s_\text{med}$ is a median of all the sample variances $s_{ii}$, and $0\leq \lambda, \lambda_v \leq 1$ are shrinkage parameters. The estimated covariance matrix can be computed by \begin{equation} \widehat{\mathbf{S}}_\text{ZZ}(\lambda, \lambda_v) = \widehat{\mathbf{D}}_\text{Z}^{1/2} \widehat{\mathbf{R}}_\text{ZZ} \widehat{\mathbf{D}}_\text{Z}^{1/2}. \end{equation} The values of the shrinkage parameters $\lambda$ and $\lambda_v$ are determined by the James-Stein type shrinkage method, which we call the NS method described in [@Rhein07a;@Schafer05b]. The ordinary least squares estimate $\widehat{\mathbf{\Psi}}^{\text{OLS}}$ of $\mathbf{\Psi}$ is given by $\widehat{\mathbf{\Psi}}^{\text{OLS}} = \mathbf{S}_\text{XX}^{-1} \mathbf{S}_\text{XY}$. We define the NS estimate of $\mathbf{\Psi}$ as \begin{equation} \widehat{ \mathbf{\Psi} }^\text{N} (\lambda, \lambda_v) = \widehat{\mathbf{S}}_\text{XX}^{-1} \widehat{\mathbf{S}}_\text{XY}, \qquad 0\leq \lambda, \lambda_v \leq 1. \end{equation} where $\widehat{\mathbf{S}}_\text{XX}$ and $\widehat{\mathbf{S}}_\text{XY}$ are parts of the estimated covariance matrix, \begin{equation} \widehat{\mathbf{S}}_\text{ZZ} (\lambda, \lambda_v) = \begin{bmatrix} \widehat{\mathbf{S}}_\text{XX} & \widehat{\mathbf{S}}_\text{XY} \\ \widehat{\mathbf{S}}_\text{XY}^\top & \widehat{\mathbf{S}}_\text{YY} \end{bmatrix}. \end{equation} In the package **VARshrink**, the function `VARshrink(method = "ns", ...)` provides an interface with the NS method. In specific, the package **corpcor** [@Schafer17] includes the R function `cov.shrink()`, which can determine $\lambda$ and $\lambda_v$ and estimate the covariance matrix $\widehat{\mathbf{S}}_\text{ZZ}(\lambda, \lambda_v)$. The function `VARshrink()` in the **VARshrink** package infers the NS estimates of VAR coefficients, $\widehat{\mathbf{\Psi}}^\text{N} (\lambda, \lambda_v)$, by using the covariance matrix $\widehat{\mathbf{S}}_\text{ZZ}(\lambda, \lambda_v)$. If the input arguments `lambda` and `lambda_var` are set as `lambda = NULL` and `lambda_var = NULL`, then $\lambda$ and $\lambda_v$ are determined automatically. We can find the estimated $\lambda$ and $\lambda_v$ values on the printed screen. For example, ```{r} resu_estim$`Nonparametric shrinkage` <- VARshrink(Y, p = 1, type = "const", method = "ns", lambda = NULL, lambda_var = NULL) resu_estim$`Nonparametric shrinkage` ``` The `summary()` shows statistical inference on the estimated VAR coefficients together with the estimated scale matrix $\mathbf{\Sigma}$ of multivariate t-distribution for noise. ```{r} summary(resu_estim$`Nonparametric shrinkage`) ``` ## Full Bayesian Shrinkage @Ni05 and @Sun04 studied Bayesian estimation methods using noninformative priors for VAR models, where they used Markov Chain Monte Carlo (MCMC) methods for estimating coefficient matrices, noise covariance matrices, and other hyperparameters in Bayesian VAR models. In @Ni05, various Bayesian estimators were compared using several types of loss functions, noninformative priors, and multivariate normal and t-distributions. Among them, Bayesian estimators using a certain type of noninformative priors showed higher accuracies than the other Bayesian estimators in simulated experiments. The noninformative prior for coefficient matrices was called the shrinkage prior, and the prior for the noise covariance matrix was called the reference prior. In the package **VARshrink**, we can obtain Bayesian estimators using the shrinkage prior and the reference prior, which are the noninformative priors that yielded the highest accuracies in the simulation results [@Ni05]. As a Bayesian estimator of VAR parameters, the minimizer of the posterior expectation of quadratic loss is computed, which is the mean of the posterior distribution. Additionally, the minimizer of the posterior expectation of LINEX loss is also computed [@Ni05;@Zellner86]. In this section, we will explain the model assumptions in more detail. First, the noise vectors are independent and identically distributed from multivariate t-distributions with the degree of freedom $\nu$, i.e., $\boldsymbol\epsilon_t \sim t_\nu (\mathbf{0}, \mathbf{\Sigma})$. It can be expressed as a hierarchical model as \begin{equation} \begin{split} ( \boldsymbol\epsilon_t | q_t ) & \sim \text{N}_K (\mathbf{0}, q_t^{-1} \mathbf{\Sigma}). \qquad\qquad\qquad (7) \\ q_t & \sim \text{Gamma}(\nu/2, \nu/2). \end{split} \end{equation}\label{noise_tdist_hierarchy} Second, we denote the vectorized version of $\mathbf{\Psi} \in\mathbb{R}^{(J/K)\times K}$ by $\boldsymbol\psi = \text{vec}(\mathbf{\Psi}) \in\mathbb{R}^{J}$. Here $J = K (Kp + L)$. The shrinkage prior $\pi_\text{S}(\boldsymbol\psi)$ for $\boldsymbol\psi \in\mathbb{R}^{J}$ is taken as $\pi_\text{S}(\boldsymbol\psi) \propto \left\| \boldsymbol\psi \right\|^{ -(J-2) }.$ By introducing a latent variable $\lambda>0$, the shrinkage prior can also be expressed as a scale mixture of multivariate normals as\footnote{\cite{Ni05} used the letter $\delta$ to denote $\delta = \lambda^{-1}$.} \begin{equation} \begin{split} (\boldsymbol\psi | \lambda) & \sim \text{N}_{J} (\mathbf{0}, \lambda^{-1} \mathbf{I}_J), \qquad\qquad\qquad (8) \\ \pi (\lambda) & \propto 1. \end{split} \end{equation}\label{psi_normal_const_prior} Third, the reference prior $\pi_\text{R}(\mathbf{\Sigma})$ for $\mathbf{\Sigma}$ is taken as \begin{equation} \pi_\text{R}(\mathbf{\Sigma}) \propto \left| \mathbf{\Sigma} \right|^{-1} \prod_{1\leq i\leq j\leq K} (\lambda_i - \lambda_j)^{-1}, \end{equation} where $\lambda_1 > \lambda_2 > \cdots > \lambda_K$ are eigenvalues of $\mathbf{\Sigma}$. Note that no hyperparameters are involved in the shrinkage prior and the reference prior, since they are noninformative priors. The Gibbs MCMC method makes use of the hierarchical expression of $\boldsymbol\epsilon_t$. The Gibbs MCMC samples the parameters $(\boldsymbol\psi, \lambda, \mathbf{\Sigma}, \mathbf{Q}, \nu)$ with $\mathbf{Q} = \text{diag}(q_{p+1}, \ldots, q_T)$ from conditional posterior distributions [@Ni05]. We remark that the mean of the conditional posterior distribution, $\pi(\boldsymbol\psi | \lambda, \mathbf{\Sigma}, \mathbf{Q}, \nu; \mathbf{Y})$ of $\boldsymbol\psi$ is given by \begin{equation} \widehat{\boldsymbol\psi}^{F} (\lambda) = \left[\left(\mathbf{\Sigma}^{-1} \otimes \left( \mathbf{X}^\top \mathbf{Q} \mathbf{X} \right) \right) + \lambda \mathbf{I}_J \right]^{-1} \text{vec}\left( \mathbf{X}^\top \mathbf{Q} \mathbf{Y} \mathbf{\Sigma}^{-1} \right), \qquad \lambda > 0. \qquad\qquad\qquad (9) \end{equation}\label{expr_fbayes_psihat} Note that if $\mathbf{\Sigma} = \mathbf{I}_K$ and $q_{p+1}=\cdots=q_T = 1$, then the estimator becomes the ridge regression estimator. That is, Bayesian shrinkage estimators have more flexible and abundant expressions, even though more computational effort is required to estimate more parameters. In the package **VARshrink**, the function `VARshrink(method = "fbayes", ...)` plays the role as an interface with the full Bayesian shrinkage method with the shrinkage prior and the reference prior. The shrinkage parameter $\lambda$ is not set at a fixed value, i.e., the arguments `lambda` and `lambda_var` are of no use here. Instead, the mean of the posterior distribution, $\hat{\delta} = \mathbb{E}\left[ \lambda^{-1} | \mathbf{Y} \right]$ is estimated during the MCMC process [@Ni05], and we define $\hat{\lambda} = \hat{\delta}^{-1}$. There are several arguments to be specified as follows. - `dof`: If `dof = Inf`, then a multivariate normal distribution is applied and and the weight $\mathbf{Q}$ is not estimated. If `dof` is a finite value, then we apply the multivariate t-distribution with a fixed degree of freedom $\nu=$ `dof` and estimate $\mathbf{Q}$. If `dof = NULL`, we estimate both the degree of freedom $\nu$ and the weight $\mathbf{Q}$. In this case, the package **ars** is required for sampling the parameter $\nu$ from its conditional posterior distribution. - `burnincycle, mcmccycle`: The Gibbs MCMC method samples a series of parameter values of the length, `burnincycle + mcmccycle`. `burnincycle` is the number of sampled parameter values to be discarded in the beginning, and `mcmccycle` is the number to be attained and used for computing the parameter estimates. By default, we set `burnincycle = 1000` and `mcmccycle = 2000`. For example, we run the full Bayesian shrinkage method with a fixed $\nu = 6$ as follows. ```{r} resu_estim$`Full Bayes (fixed dof)` <- VARshrink(Y, p = 1, type = "const", method = "fbayes", dof = 6, burnincycle = 1000, mcmccycle = 2000) resu_estim$`Full Bayes (fixed dof)` ``` By using `summary()`, we can find the estimated scale matrix $\mathbf{\Sigma}$ of multivariate t-distribution for noise. ```{r} summary(resu_estim$`Full Bayes (fixed dof)`) ``` Instead, we can estimate the degree of freedom, $\nu$, of multivariate t-distribution by the argument `dof = NULL` as follows: ```{r} resu_estim$`Full Bayes (estim dof)` <- VARshrink(Y, p = 1, type = "const", method = "fbayes", dof = NULL, burnincycle = 1000, mcmccycle = 2000) resu_estim$`Full Bayes (estim dof)` ``` ## Semiparametric Bayesian Shrinkage {#sec:sbayes} Whereas full Bayesian shrinkage methods estimate all the hyperparameters including latent variables via MCMC methods, semiparametric Bayesian methods estimate some of the hyperparameters by a certain nonparametric method and estimate the rest in a Bayesian framework. The semiparametric approach is advantageous especially when the dimensionality of the model is so high that standard MCMC methods are not computationally tractable. @LeeChoiKim2016 assumed scale mixtures of multivariate normal distributions for noise vectors as in Eq. (7). The prior distribution for the model coefficients, $\boldsymbol\psi = \text{vec}(\mathbf{\Psi})$, was set as multivariate normal distributions, similarly to Eq. (8). Here, we can choose either the conjugate prior (CJ) and non-conjugate (NCJ) prior as follows: (i) The conjugate prior for $\boldsymbol\psi \in\mathbb{R}^J$ is expressed by \begin{equation} (\boldsymbol\psi | \lambda, \mathbf{\Sigma}) \sim \text{N}_{J} \left(\mathbf{0}, \frac{1-\lambda}{(N - 1) \lambda} \mathbf{\Sigma} \otimes \mathbf{I} \right), \qquad 0<\lambda < 1. \end{equation} (ii) The non-conjugate prior for $\boldsymbol\psi \in\mathbb{R}^J$ is expressed by \begin{equation} (\boldsymbol\psi | \lambda) \sim \text{N}_{J} \left(\mathbf{0}, \frac{1-\lambda}{(N - 1) \lambda} \mathbf{I}_J \right), \qquad 0<\lambda < 1. \end{equation} The scale matrix $\mathbf{\Sigma}$ for noise is included in the equation for conjugate prior distribution but not for the non-conjugate prior distribution. The non-conjugate prior is quite similar to the full Bayesian formulation in Eq. (8). However, the main difference is that, in the semiparametric Bayesian approach, the shrinkage parameter $\lambda$ should be estimated explicitly via a nonparametric method, but in the full Bayes approach, it is a latent variable which should be sampled and estimated implicitly via an MCMC method. The prior distribution for $\mathbf\Sigma$ was set as an inverse Wishart distribution as \begin{equation} (\mathbf{\Sigma} | \mathbf{L}_0, m_0) \sim \text{InvWishart}( \mathbf{L}_0, m_0), \qquad \mathbf{L}_0 \succ \mathbf{0}, m_0 > K - 1. \end{equation} where $\mathbf{L}_0 \succ \mathbf{0}$ means that $\mathbf{L}_0$ is positive definite. Once the shrinkage parameter $\lambda$ is set at a fixed value, the other parameters, $\boldsymbol\psi, \mathbf{\Sigma}$, and $\mathbf{Q}$ can be estimated iteratively in a Bayesian framework efficiently. Briefly speaking, we consider estimating the parameters $\boldsymbol\psi$ and $\mathbf{\Sigma}$ at the maximum point (i.e., the mode) of the marginal posterior density function $\pi (\boldsymbol\psi, \mathbf{\Sigma} | \lambda; \mathbf{Y})$. In the case of the non-conjugate prior, the mode, $(\widehat{\boldsymbol\psi}^\text{S} (\lambda), \widehat{\mathbf{\Sigma}}^\text{S} (\lambda))$, is expressed as \begin{equation} \widehat{\boldsymbol\psi}^{S} (\lambda) = \left[\left( (\widehat{\mathbf{\Sigma}}^\text{S})^{-1} \otimes \left( \mathbf{X}^\top \widehat{\mathbf{Q}}^\text{S} \mathbf{X} \right) \right) + \frac{ (N - 1) \lambda}{1-\lambda} \mathbf{I}_J \right]^{-1} \text{vec}\left( \mathbf{X}^\top \widehat{\mathbf{Q}}^\text{S} \mathbf{Y} ( \widehat{\mathbf{\Sigma}}^\text{S})^{-1} \right), \qquad\qquad\qquad (10) \end{equation}\label{expr_sbayes_psihat} and \begin{equation} \widehat{\mathbf{\Sigma}}^\text{S} (\lambda) = \frac{1}{m_0 + T + K + 1} \left( \mathbf{L}_0 + \mathbf{Y}^\top \widehat{\mathbf{Q}} \mathbf{Y} - \mathbf{Y}^\top \widehat{\mathbf{Q}} \mathbf{X} \widehat{\mathbf{\Psi}}^\text{S} (\lambda) \right), \end{equation} for $0< \lambda < 1$, where $\widehat{\mathbf{Q}}^\text{S} = \text{diag}(\hat{q}_{p+1}, \ldots, \hat{q}_T)$ is obtained in an iterative manner.\footnote{We note that the expression in Eq. (10) is equivalent to that in Eq. (9) in a full Bayesian framework.} The values $\widehat{\mathbf{Q}}^\text{S}$ is also expressed as \begin{equation} \hat{q}_t = h \left( \left\| (\widehat{\mathbf{\Sigma}}^\text{S})^{-1/2} \hat{\boldsymbol\epsilon}_t \right\|^2 \right), \qquad t = p+1, \ldots, T, \end{equation} where $h(x)$ is defined depending on the noise distribution and $\hat{\boldsymbol\epsilon}_t$ is the residual given by $\hat{\boldsymbol\epsilon}_t = \mathbf{y}_t - \widehat{\mathbf{\Psi}}^{\text{S}\top} \mathbf{x}_t$ [@LeeChoiKim2016]. The shrinkage parameter $\lambda$ is determined via an internal nonparametric method, which is called the parameterized cross validation (PCV). This algorithm can be considered as a modified $K$-fold cross validation, especially for estimating the shrinkage parameter of VAR models; see @LeeChoiKim2016 for a detailed explanation. In addition, the semiparametric Bayesian shrinkage method adopts the idea of shrinking both the correlations and variances from the NS method. As a result, there are two shrinkage parameters $\lambda$ and $\lambda_v$, where $0 \leq \lambda \leq 1$ is used for the shrinkage estimation of the VAR coefficient matrix $\Psi$ and noise covariance matrix $\mathbf{\Sigma}$ while $0 \leq \lambda_v \leq 1$ is used for the shrinkage estimation of the variances of the variables $y_{tj}, j = 1, \ldots, K$. In the package **VARshrink**, the function `VARshrink(method = "sbayes", ...)` is for the semiparametric Bayesian shrinkage method. There are several input arguments to these functions as follows: - `dof`: If `dof = Inf`, we use a multivariate normal distribution for noise vectors. Otherwise, `dof` can be set as a finite number $\nu$ between $0$ and $\infty$ for the multivariate t-distribution with the degree of freedom $\nu$. If `dof = NULL`, then $\nu$ is automatically selected by the $K$-fold cross validation, with $K$ `= num_fold`. `dof = Inf` by default. - `lambda, lambda_var`: `lambda = NULL` and `lambda_var = NULL` implies that the shrinkage parameters $\lambda$ and $\lambda_v$ are estimated automatically with $0\leq \lambda, \lambda_v \leq 1$. - `prior_type`: Either `prior_type = "NCJ"` or `prior_type = "CJ"`, which implies non-conjugate prior or conjugate prior. The default value is "NCJ". - `num_folds`: The number of folds for the parameterized cross validation method for determining $\lambda$ value. `num_folds = 5` by default. It works only when `lambda = NULL` or `dof = NULL`. - `m0`: The value of the hyperparameter $m_0$ of the inverse Wishart prior of $\mathbf{\Sigma}$. `m0 = `$K$ by default. On the other hand, the other hyperparameter $\mathbf{L}_0$ of the inverse Wishart prior is set by $\mathbf{L}_0 = (m_0 + K + 1) \mathbf{I}_K$. For example, we can run the semiparametric shrinkage method as follows. ```{r} resu_estim$`Semi Bayes (fixed dof)` <- VARshrink(Y, p = 1, type = "const", method = "sbayes", dof = 6, lambda = NULL, lambda_var = NULL, prior_type = "NCJ", num_folds = 5, m0 = ncol(Y)) resu_estim$`Semi Bayes (fixed dof)` summary(resu_estim$`Semi Bayes (fixed dof)`) ``` We can also let the software package to choose the degree of freedom parameter $\nu$ automatically by setting `dof = NULL`. In this case, the package uses simply a $K$-fold cross validation to find an optimal value of $\nu$. ```{r} resu_estim$`Semi Bayes (estim dof)` <- VARshrink(Y, p = 1, type = "const", method = "sbayes", dof = NULL, lambda = NULL, lambda_var = NULL, prior_type = "NCJ", num_folds = 5, m0 = ncol(Y)) resu_estim$`Semi Bayes (estim dof)` ``` ## K-fold Cross Validation for Semiparametric Shrinkage The **VARshrink** includes an implementation of the K-fold cross validation (CV) method for selecting shrinkage parameters. In the current version of the package, the K-fold CV method can select the $\lambda$ and $\lambda_v$ values for the semiparametric Bayesian shrinkage estimator described in Section [3.4](#sec:sbayes). Note the the semiparametric shrinkage method in the previous section selects a $\lambda$ value by using the PCV method and selects a $\lambda_v$ value by a Stein-type nonparametric shrinkage method. The K-fold CV method can be run as follows. The arguments to `VARshrink()` are same to those for the semiparametric Bayesian shrinkage method except for the argument `method = "kcv"`. ```{r} resu_estim$`K-fold CV (fixed dof)` <- VARshrink(Y, p = 1, type = "const", method = "kcv", dof = 6, lambda = NULL, lambda_var = NULL, prior_type = "NCJ", num_folds = 5, m0 = ncol(Y)) resu_estim$`K-fold CV (fixed dof)` ``` Degree of freedom of multivariate t-distribution for noise can be selected automatically by setting the argument `dof = Inf` as follows. ```{r, results = "hide"} resu_estim$`K-fold CV (estim dof)` <- VARshrink(Y, p = 1, type = "const", method = "kcv", dof = NULL, lambda = NULL, lambda_var = NULL, prior_type = "NCJ", num_folds = 5, m0 = ncol(Y)) ``` After all, the function `calcSSE_Acoef()` computes the sum of squared errors (SSEs) between two VAR model parameters, $\{\mathbf{A}_k^{(1)}\}$ and $\{\mathbf{A}_k^{(2)}\}$, as $SSE = \sum_{k=1}^p \sum_{i,j} (\mathbf{A}^{(1)}_{kij} - \mathbf{A}^{(2)}_{kij})^2$. For example, Table 2 shows the SSEs of the estimated VAR coefficients. ```{r} resu_sse <- data.frame(SSE = sapply(resu_estim, function(x) calcSSE_Acoef(Acoef_sh(x), myCoef$A))) ``` ```{r, echo = FALSE} knitr::kable(round(resu_sse, 3), caption = "Table 2. Sum of squared errors of VAR coefficients estimated by the shrinkage methods.") ``` --- # Numerical Experiments {#sec:numer} In this section, we apply the shrinkage estimation methods in the package **VARshrink** to a benchmark data set. Using the benchmark data set, we demonstrate the use of information criteria such as Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) for comparison of VAR models. In this case, the effective number of parameters for an shrinkage estimate has to be re-calculated based on the shrinkage intensity parameter value. ## Benchmark Data The Canada data set is a benchmark macroeconomic data included in the package **vars**. It contains four time series variables representing employment(`e`), labor productivity(`prod`), real wage(`rw`), and unemployment rate(`U`), with $84$ observations. We took difference on the data to remove trend, yielding $T = 83$. Figure 1 shows the differencing result. ```{r diffCanada, out.width='70%', fig.cap = "Figure 1. The benchmark data set obtained by differencing the Canada data."} data(Canada, package = "vars") Y = diff(Canada) plot(Y, cex.lab = 1.3) ``` The shrinkage methods are run with the option `const = "const"` since the mean of the data has not been corrected. We set the lag order $p$ at `p = 1, 2, 3` to compare several VAR models. For the semiparametric Bayesian method (`method = "sbayes"`), the degree of freedom $\nu$ was automatically selected. In addition, we ran the K-fold CV method (`method = "kcv"`) for choosing $\lambda$ and $\lambda_v$ values of the semiparametric Bayesian shrinkage estimator. ``` set.seed(1000) resu_model <- array(NA, dim = c(5, 2, 3), dimnames = list(c("Ridge regression", "Nonparametric shrinkage", "Full Bayes", "Semi Bayes", "K-fold CV"), c("AIC", "BIC"), c("p=1", "p=2", "p=3"))) for (p in 1:3) { EstimRidge <- VARshrink(Y, p = p, type = "const", method = "ridge") resu_model["Ridge regression", , p] <- c(AIC(EstimRidge), BIC(EstimRidge)) EstimNS <- VARshrink(Y, p = p, type = "const", method = "ns") resu_model["Nonparametric shrinkage", , p] <- c(AIC(EstimNS), BIC(EstimNS)) EstimFB <- VARshrink(Y, p = p, type = "const", method = "fbayes", dof = NULL) resu_model["Full Bayes", , p] <- c(AIC(EstimFB), BIC(EstimFB)) EstimSB <- VARshrink(Y, p = p, type = "const", method = "sbayes", dof = NULL, prior_type = "NCJ") resu_model["Semi Bayes", , p] <- c(AIC(EstimSB), BIC(EstimSB)) EstimKCV <- VARshrink(Y, p = p, type = "const", method = "kcv", dof = NULL, prior_type = "NCJ") resu_model["K-fold CV", , p] <- c(AIC(EstimKCV), BIC(EstimKCV)) } ``` We can compare several models by computing their AIC and BIC. The following results in Table 3 indicate that the NS method produced better VAR coefficients than those of the other methods, and that the AIC took the minimum at $p=3$ while the BIC took the minimum at $p=2$. ```{r modelcomp, echo = FALSE} load("table3_modelcomp.RData") knitr::kable(round(resu_model, 1), caption = "Table 3. Information criteria (AIC, BIC) for model comparison.") ``` The estimated parameters by the NS method with $p=2$ can be analyzed further by using the methods and functions in Table 1. For example, we can perform time series forecasting as in Figure 2. ```{r pred, fig.cap="Figure 2. A 10-step ahead Forecasting of time series by the VAR model estimated by the nonparametric shrinkage method. The differenced Canada data were modeled by a VAR(2) model selected at the minimum BIC."} plot(predict(VARshrink(Y, p = 2, type = "const", method = "ns")), names = "U") ``` --- # Conclusions {#sec:concl} We wrote an R software package **VARshrink** for shrinkage estimation of VAR model parameters. The shrinkage methods included in this package are the multivariate ridge regression [@Hoerl70; @Golub79], a nonparametric shrinkage method [@Rhein07c], a full Bayesian shrinkage method [@Ni05], and a semiparametric Bayesian shrinkage method [@LeeChoiKim2016]. An advantage of this package is the integration of the nonparametric, parametric, and semiparametric methods in one frame via a common interface function `VARshrink()`, which makes it easy and convenient to use various types of shrinkage estimation methods. Moreover, we note that the shrinkage estimation methods implemented in the current version have not been widely implemented in R packages in the context of VAR models. We demonstrated the use of model selection criteria as AIC and BIC by using benchmark time series data. We note that computation of the log-likelihood of an estimated model must consider the actual distribution assumption of each method. We explained that the multivariate normal distribution is not the only choice for a distribution of noise, but another distributions such as the multivariate t-distribution can be chosen. Moreover, an effective number of parameters must be calculated for computing the AIC and BIC values. In this case, a large number of shrinkage parameter value leads to a small value of the effective number of parameters. In the package **VARshrink**, effective number of parameters is computed automatically based on the selected shrinkage parameter value. Shrinkage methods are quite crucial especially for high dimensional VAR models. Bayesian approaches have been developed widely for VAR models in the literature for various purposes. However, the computational time for carrying out MCMC processes may be intractably high for high dimensional VAR models. For this reason, it is important to use nonparametric and semiparametric shrinkage estimation methods together to produce computationally feasible estimates of model parameters. The R package **VARshrink** is the first step to an integrative and general types of software packages for VAR models. Moreover, this package can be extended to include other shrinkage methods such as Bayesian shrinkage methods using several types of different prior distributions. # References {-}