A Guide to the SpatialGEV Package

Introduction to the GEV-GP Model

The generalized extreme value (GEV) distribution is often used to analyze sequences of maxima within non-overlapping time periods. An example of this type of data is the monthly maximum rainfall levels recorded over years at a weather station. Since there are typically a large number of weather stations within a country or a state, it is more ideal to have a model that can borrow information from nearby weather stations to increase inference and prediction accuracy. Such spatial information is often pooled using the Gaussian process.

The GEV-GP model is a hierarchical model with a data layer and a spatial random effects layer. Let $${\boldsymbol{x}}_1, \ldots, {\boldsymbol{x}}_n \in \mathbb{R}^2$$ denote the geographical coordinates of $$n$$ locations, and let $$y_{ik}$$ denote the extreme value measurement $$k$$ at location $$i$$, for $$k = 1, \ldots, n_i$$. The data layer specifies that each observation $$y_{ik}$$ has a generalized extreme value distribution, denoted by $$y \sim \mathrm{GEV}(a, b, s)$$, whose CDF is given by $$$F(y\mid a, b, s) = \begin{cases} \exp\left\{-\left(1+s\frac{y-a}{b}\right)^{-\frac{1}{s}}\right\} \ \ &s\neq 0,\\ \exp\left\{-\exp\left(-\frac{y-a}{b}\right)\right\} \ \ &s=0, \end{cases} \label{eqn:gev-distn}$$$ where $$a\in\mathbb{R}$$, $$b>0$$, and $$s\in\mathbb{R}$$ are location, scale, and shape parameters, respectively. The support of the GEV distribution depends on the parameter values: $$y$$ is bounded below by $$a-b/s$$ when $$s>0$$, bounded above by $$a-b/s$$ when $$s<0$$, and unbounded when $$s=0$$. To capture the spatial dependence in the data, we assume some or all of the GEV parameters in the data layer are spatially varying. Thus they are introduced as random effects in the model.

A Gaussian process $$z({\boldsymbol{x}})\sim \mathcal{GP}(\mu({\boldsymbol{x}}_{cov}), k({\boldsymbol{x}}, {\boldsymbol{x}}'))$$ is fully characterized by its mean $$\mu({\boldsymbol{x}}_{cov})$$ and kernel function $$k({\boldsymbol{x}}, {\boldsymbol{x}}') = {\mathrm{Cov}}( z({\boldsymbol{x}}), z({\boldsymbol{x}}') )$$, which captures the strength of the spatial correlation between locations. The mean is a function of parameters $$\boldsymbol{\beta}$$ and covariates $${\boldsymbol{x}}_{cov}=(x_1,\ldots,x_p)'$$. We assume that given the locations, the data follow independent GEV distributions each with their own parameters. The complete GEV-GP hierarchical model then becomes \begin{aligned} y_{ik} \mid a({\boldsymbol{x}}_i), b({\boldsymbol{x}}_i), s & {\overset{ind}{\sim}}\mathrm{GEV}\big( a({\boldsymbol{x}}_i), \exp( b({\boldsymbol{x}}_i) ), \exp(s)\big)\\ a({\boldsymbol{x}}) \mid \boldsymbol{\beta}_a, {\boldsymbol{\theta}}_a &\sim \mathcal{GP}\big( {\boldsymbol{X}}_a\boldsymbol{\beta}_a, k({\boldsymbol{x}}, {\boldsymbol{x}}' \mid {\boldsymbol{\theta}}_a) \big)\\ log(b)({\boldsymbol{x}}) \mid \boldsymbol{\beta}_b, {\boldsymbol{\theta}}_b &\sim \mathcal{GP}\big( {\boldsymbol{X}}_b\boldsymbol{\beta}_b, k({\boldsymbol{x}}, {\boldsymbol{x}}' \mid {\boldsymbol{\theta}}_b) \big)\\ s({\boldsymbol{x}}) \mid \boldsymbol{\beta}_s, {\boldsymbol{\theta}}_s &\sim \mathcal{GP}\big( {\boldsymbol{X}}_s\boldsymbol{\beta}_s, k({\boldsymbol{x}}, {\boldsymbol{x}}' \mid {\boldsymbol{\theta}}_a) \big). \end{aligned} In this package, a uniform prior $$\pi({\boldsymbol{\theta}}) \propto 1$$ is specified on the fixed effect and hyperparameters ${\boldsymbol{\theta}}=(s, \boldsymbol{\beta}_a, \boldsymbol{\beta}_b, \boldsymbol{\beta}_s, {\boldsymbol{\theta}}_a, {\boldsymbol{\theta}}_b, {\boldsymbol{\theta}}_s).$

What Does SpatialGEV Do?

The package provides an interface to estimate the approximate joint posterior distribution of the spatial random effects $$a$$ and $$b$$ in the GEV-GP model. The main functionalities of the package are:

• Method to fit the GEV-GP model and sample from the approximate joint posterior distribution of $$a$$ and $$b$$

• Method to sample from the posterior predictive distributions $$p({\boldsymbol{y}}_{\mathrm{new}} \mid {\boldsymbol{y}}_{\mathrm{observed}})$$ at new locations

Details about the approximate posterior inference can be found in Chen, Ramezan, and Lysy (2021).

Installation

SpatialGEV depends on the package TMB to perform the Laplace approximation. Make sure you have TMB installed following their instruction before installing SpatialGEV. Moreover, SpatialGEV uses several functions from the INLA package for approximating the Matérn covariance with the SPDE representation and for creating meshes on the spatial domain. If the user would like to use the SPDE approximation, please first install INLA. Since INLA is not on CRAN, it needs to be downloaded following their instruction here.

To install SpatialGEV, run the following:

devtools::install_github("meixichen/SpatialGEV")

Using the SpatialGEV Package

Exploratory analysis

We now demonstrate how to use this package through a simulation study. The simulated data used in this example comes with the package as a list variable simulatedData2, which contains the following:

• locs: A $$400\times 2$$ data frame of spatial coordinates (longitudes and latitudes)

• a: A length $$400$$ vector of the true values of $$a_i, \ i = 1,\ldots,400$$ at the 400 locations

• logb: A length $$400$$ vector of the true values of log-transformed $$b_i, \ i=1,\ldots,400$$ at the 400 locations

• logs: A length $$400$$ vector of the true values of log-transformed $$s_i, \ i=1,\ldots,400$$ at the 400 locations

a, logb and logs are simulated from Gaussian random fields using the R package SpatialExtremes. Using the simulated GEV parameters, we generate 50 to 70 observed data $${\boldsymbol{y}}_i$$ at each location $$i, \ i=1,\ldots,400$$.

Spatial variation of $$a$$ and $$log(b)$$ can be viewed by plotting them on regular lattices:

Number of observations at each location is shown in the figure below.

Below are histograms of observations at $$8$$ randomly sampled locations.

Model fitting

To fit the GEV-GP model to this simulated dataset, the first step is calling the spatialGEV_fit() function, for which several arguments must be provided:

• y: A list of $$n$$ vectors, each of which contains all data collected at one location.

• locs: A $$n \times 2$$ coordinate matrix.

• random: A character string indicating which GEV parameters are treated as spatial random effects. It can be one of “a”, “ab”, and “abs”, where “a”, “b”, and “s” represent the GEV location, scale, and shape parameters, respectively. Note that in the model $$b$$ is always estimated on the log scale since it is constrained to be positive.

• init_param: A list of initial parameters to be passed to the optimizer. Call ?spatialGEV_fit() and see Details for which parameters need to be included in the list.

• reparam_s: A flag for reparametrizing the GEV shape parameter $$s$$ - either “zero”, “unconstrained”, “negative”, or “positive”. For example, if reparam_s = "positive", the model works with the log-transformed shape parameter. Call ?spatialGEV_fit() and see Details for more information about this argument.

There are other arguments which user might want to specify to override the defaults:

• kernel: The kernel used for the Gaussian process(es) describing the spatial random effect(s). Currently 3 kernels are implemented: the default exponential kernel (kernel="exp"), the Matérn kernel (kernel="matern"), and the SPDE approximation to the Matérn kernel (kernel="spde") based on Lindgren, Rue, and Lindström (2011).

• X_a, X_b, X_s: Design matrices of covariates for the spatial random effects “a”, “b”, and “s”. The defaults are column matrices of $$1$$s, i.e., only an intercept parameter $$\beta_0$$ will be estimated for each spatial random effect.

• s_prior: A vector $$(\mu, \sigma)$$. Optionally a normal prior with parameters $$(\mu, \sigma)$$ can be specified on the fixed effect shape parameter $$s$$, or its reparametrized version depending on the value of reparam_s. When s_prior is not specified, a uniform prior is used.

• beta_prior: A named list of length-2 vectors $$(\mu, \sigma)$$. Optionally normal priors $$\mathcal{N}(\mu, \sigma^2)$$ can be put on the regression coefficients $$\boldsymbol{\beta}$$ for the random effects. To specify a normal prior on a random effect, name the vector beta_a or beta_b or beta_s. E.g., beta_prior=list(beta_a=c(0, 100)) puts a $$\mathcal{N}(0,100)$$ prior on $$a$$. When beta_prior is not specified, uniform priors are applied.

• matern_pc_prior: A named list with elements provided using the matern_pc_prior() function. Optionally penalized complexity priors can be put on the Matérn covariance hyperparameters. See ?matern_pc_prior() for more details.

The code below fits a GEV-GP model to the simulated data. The model assumes that all spatial random effects follow Gaussian processes with the exponential kernel: $k({\boldsymbol{x}}, {\boldsymbol{x}}') = \sigma^2\exp\left(-\frac{||{\boldsymbol{x}}-{\boldsymbol{x}}'||}{\ell}\right).$ No covariates are included in this model, so by default an intercept parameter $$\beta_0$$ is estimated for the GP of each spatial random effect.

Posterior sampling

To obtain posterior samples of $$a$$, $$b$$, and $$s$$, we pass the fitted model object fit to spatialGEV_sample(), which takes in three arguments:

• model: An object of class spatialGEVfit, which is the output of spatialGEV_fit().

• n_draw: Number of samples to draw from the posterior distribution $$p(a, b, s \mid {\boldsymbol{Y}}, {\boldsymbol{X}})$$.

• observation: If set to TRUE, the function will also draw from the posterior predictive distribution $$p(y^{\mathrm{rep}} \mid {\boldsymbol{Y}})$$ at the observed locations. This is useful for Bayesian model checking, which we will demonstrate shortly.

• loc_ind: A vector of location indices at which samples will be drawn. Default is all locations.

The following line of code draws 2000 samples from the posterior distribution and the posterior predictive distribution:

Then use summary() to view the summary statistics of the posterior samples.

pos_summary <- summary(sam)
pos_summary$param_summary[c(1:5, (n_loc+1):(n_loc+5), (2*n_loc+1):(2*n_loc+5), (3*n_loc):(nrow(pos_summary$param_summary))),]
#>                    2.5%        25%       50%       75%      97.5%       mean
#> a1          60.37275783 61.5980928 62.210164 62.835360 63.9804803 62.2141060
#> a2          60.75855130 61.7379378 62.328065 62.859586 63.9297423 62.3079065
#> a3          60.81983443 61.9089171 62.431818 62.940149 63.9610251 62.4305312
#> a4          60.79951615 61.8344749 62.345769 62.881036 63.8523161 62.3527635
#> a5          60.74117629 61.7295232 62.198760 62.745477 63.7537897 62.2326125
#> log_b1       2.93220334  2.9719966  2.991997  3.014360  3.0553563  2.9930003
#> log_b2       2.93448172  2.9723931  2.992898  3.011996  3.0494946  2.9924037
#> log_b3       2.94528959  2.9782006  2.996280  3.014380  3.0515383  2.9964359
#> log_b4       2.94725158  2.9830574  2.999908  3.017881  3.0510366  2.9998425
#> log_b5       2.95743662  2.9905772  3.008068  3.025838  3.0566551  3.0079155
#> s1          -3.86044147 -3.2227490 -2.921353 -2.609795 -2.0274468 -2.9200414
#> s2          -3.80428531 -3.2219404 -2.931187 -2.610808 -2.0574936 -2.9274155
#> s3          -3.76864653 -3.2296526 -2.950027 -2.664731 -2.1404922 -2.9526005
#> s4          -3.71662682 -3.1967825 -2.940971 -2.666217 -2.1067600 -2.9334609
#> s5          -3.67186841 -3.1723053 -2.923136 -2.676004 -2.2098599 -2.9288787
#> s400        -3.45638367 -2.8673479 -2.571656 -2.296527 -1.7403831 -2.5842184
#> beta_a      57.52397467 58.9584185 59.670918 60.446429 61.9572194 59.7032630
#> beta_b       2.90031130  2.9653742  3.001447  3.035666  3.1069824  3.0004916
#> beta_s      -3.19469003 -2.7156268 -2.459331 -2.212726 -1.7199864 -2.4620612
#> log_sigma_a -0.48482824  0.4901036  1.081240  1.610590  2.6947695  1.0606942
#> log_ell_a    0.00258028  1.3207450  2.001316  2.668811  3.9966672  1.9918314
#> log_sigma_b -7.24392340 -6.0568859 -5.344071 -4.671646 -3.4885019 -5.3640991
#> log_ell_b   -0.22138081  1.4165251  2.310469  3.212735  4.8795805  2.3244647
#> log_sigma_s -1.83110995 -1.0777626 -0.689746 -0.312738  0.3793501 -0.7062119
#> log_ell_s    0.08662430  0.9420228  1.363870  1.809115  2.6334788  1.3711216

Data preprocessing

We first grid the data using cells of length $$0.5^{\circ}$$. By doing this, weather stations that are apart by less than $$0.5^{\circ}$$ in longitude/latitude are grouped together in the same grid cell. From now on, we refer to each grid cell as a location.

For each location, we find the maximum snowfall amount each year and only keep locations where there are at least two years of records.

Now we fit the GEV-GP model to the data using the exponential kernel function. Both $$a$$ and $$b$$ are treated as spatial random effects. $$s$$ is constrained to be a positive constant. Note that here we have specified a $$\mathcal{N}(-5,5)$$ prior on the log-transformed shape parameter. This is because we found that the shape parameter is estimated close to 0 and such a prior ensures model fitting procedure is numerically stable.

Next, 5000 samples are drawn from the joint posterior distribution of all parameters.

From the samples we can calculate the 5-year, 10-year, and 25-year return levels at all locations, which are the upper $$1/5$$%, $$1/10$$%, and $$1/25$$% quantiles of the extreme value distributions at these locations.

Plotted below are the return levels.

References

Chen, Meixi, Reza Ramezan, and Martin Lysy. 2021. “Fast Approximate Inference for Spatial Extreme Value Models.” arXiv. https://arxiv.org/abs/2110.07051.

Lindgren, F. K., H. Rue, and J. Lindström. 2011. “An Explicit Link Between Gaussian Fields and Gaussian Markov Random Fields: The Stochastic Partial Differential Equation Approach.” Journal of the Royal Statistical Society, Series B 73: 423–98.