--- title: "Spatial sampling with SamplingStrata" author: "Marco Ballin, Giulio Barcaroli" date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: SamplingStrata.bib vignette: > %\VignetteIndexEntry{Spatial sampling with SamplingStrata} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} \usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE} options(width = 999) knitr::opts_chunk$set(echo = TRUE, fig.width=6, fig.height=4, out.width = 6, out.heigth = 4) ``` ```{r load, eval=TRUE, echo=FALSE, include=FALSE} load("spatial_vignette.RData") ``` # Optimization with the *spatial* method Let us suppose we want to design a sample survey with $k$ $Z$ target variables, each one of them correlated to one or more of the available $Y$ frame variables. When frame units are georeferenced or geocoded, the presence of spatial auto-correlation can be investigated. This can be done by executing for instance the Moran test on the target variables: if the null hypothesis is rejected (i.e. the hypothesis of the presence of spatial auto-correlation is accepted) then we should take into account also this variance component. As indicated by @degruijter2016 and @degruijter2019, in case $Z$ is the target variable, omitting as negligible the *fpc* factor, the sampling variance of its estimated mean is: \begin{equation} \label{eq1} V(\hat{\bar{Z}}) = \sum_{h=1}^{H}(N_{h}/N)^{2} S_{h}^{2}/n_{h} \end{equation} We can write the variance in each stratum $h$ as: \begin{equation} \label{eq2} S_{h}^{2} = \dfrac{1}{N_{h}^{2}} \sum_{i=1}^{N_{h-1}}\sum_{j=i+1}^{N_{h}}(z_{i}-z_{j})^{2} \end{equation} The optimal determination of strata is obtained by minimizing the quantity $O$: \begin{equation} \label{eq3} O = \sum_{h=1}^{H} \dfrac{1}{N_{h}^{2}} \{ \sum_{i=1}^{N_{h-1}} \sum_{j=i+1}^{N_{h}} (z_{i}-z_{j})^{2}\}^{1/2} \end{equation} Obviously, values $z$ are not known, but only their predictions, obtained by means of a regression model. So, in Equation \ref{eq3} we can substitute $(z_{i}-z_{j})^{2}$ with \begin{equation} \label{eq5} D_{ij}^{2} = \dfrac{(\tilde{z}_{i}-\tilde{z}_{j})^{2}}{R^{2}} + V(e_{i}) + V(e_{j}) - 2Cov(e_{i},e_{j}) \end{equation} where $R^{2}$ is the squared correlation coefficient indicating the fitting of the regression model, and $V(e_{i})$, $V(e_{j})$ are the model variances of the residuals. The spatial auto-correlation component is contained in the term $Cov(e_{i},e_{j})$. In particular, the quantity $D_{ij}$ is calculated in this way: \begin{equation} \label{eq6} D_{ij}^{2} = \dfrac{(\tilde{z}_{i}-\tilde{z}_{j})^{2}}{R^{2}} + (s_{i}^{2} + s_{j}^{2}) - 2 s_{i} s_{j} e^{-k (d_{ij}/range)} \end{equation} where $d_{ij}$ is the Euclidean distance between two units i and j in the frame (calculated using their geographical coordinates, that must be expressed in meters), the $s_{i}$ and $s_{j}$ are estimates of the prediction errors in the single points and *range* is the maximum distance below which spatial auto-correlation can be observed among points. The value of *range* can be determined by an analysis of the spatial *variogram*. To summarize, when frame units can be geo-referenced, the proposed procedure is the following: * acquire coordinates of the geographic location of the units in the population of interest; * fit a *kriging* model (or any other spatial model) on data for each $Z$; * obtain predicted values together with prediction errors for each $Z$ and associate them to each unit in the frame; * perform the optimization step. # Example In order to illustrate the "*spatial*" method, we make use of a dataset generally employed as an example of spatially correlated phenomena (in this case, the concentration of four heavy metals in a portion of the river Meuse). This dataset comes with the library "*sp*": ```{r, eval = FALSE,echo=FALSE, message=FALSE} library(sp) # locations (155 observed points) data("meuse") # grid of points (3103) data("meuse.grid") meuse.grid$id <- c(1:nrow(meuse.grid)) coordinates(meuse)<-c("x","y") coordinates(meuse.grid)<-c("x","y") ``` ```{r, eval = FALSE,echo=FALSE,include=FALSE} par(mfrow=c(1,2)) plot(meuse.grid) title("Meuse territory (3103 points)",sub="with reported distance from the river",cex.main=0.8,cex.sub=0.8) plot(meuse) title("Subset of 155 points",sub="with also observed metals concentration",cex.main=0.8,cex.sub=0.8) par(mfrow=c(1,1)) ``` ![](spatial_figures/unnamed-chunk-2-1.png) We analyse the territorial distribution of the *lead* and *zinc* concentration, and model (by using the *universal kriging*) their relations with distance from the river, using the subset of 155 points on which these values have been jointly observed: ```{r, eval = F,echo=TRUE,message=FALSE,fig.width=6,fig.height=8} library(automap) kriging_lead = autoKrige(log(lead) ~ dist, meuse, meuse.grid) plot(kriging_lead,sp.layout = NULL, justPosition = TRUE) ``` ![](spatial_figures/unnamed-chunk-3-1.png) ```{r, eval = F,echo=TRUE,message=FALSE,fig.width=6,fig.height=8} kriging_zinc = autoKrige(log(zinc) ~ dist, meuse, meuse.grid) plot(kriging_zinc, sp.layout = list(pts = list("sp.points", meuse))) ``` ![](spatial_figures/unnamed-chunk-3-2.png) It is possible to calculate the fitting of the two models in this way: ```{r, eval = F,echo=TRUE,message=FALSE,fig.width=6,fig.height=8} r2_lead <- 1 - kriging_lead$sserr/sum((log(meuse$lead)-mean(log(meuse$lead)))^2) r2_lead ## [1] 0.9999997 r2_zinc <- 1 - kriging_zinc$sserr/sum((log(meuse$zinc)-mean(log(meuse$zinc)))^2) r2_zinc ## [1] 0.9999999 ``` Using these *kriging* models, we are able to predict the values of lead and zinc concentration on the totality of the 3,103 points in the Meuse territory: ```{r, eval = F,echo=TRUE} df <- NULL df$id <- meuse.grid$id df$lead.pred <- kriging_lead$krige_output@data$var1.pred df$lead.var <- kriging_lead$krige_output@data$var1.var df$zinc.pred <- kriging_zinc$krige_output@data$var1.pred df$zinc.var <- kriging_zinc$krige_output@data$var1.var df$lon <- meuse.grid$x df$lat <- meuse.grid$y df$dom1 <- 1 df <- as.data.frame(df) head(df) ## id lead.pred lead.var zinc.pred zinc.var lon lat dom1 ## 1 1 5.509360 0.1954937 6.736502 0.2007150 181180 333740 1 ## 2 2 5.546006 0.1716895 6.785460 0.1749260 181140 333700 1 ## 3 3 5.488913 0.1784052 6.698883 0.1826314 181180 333700 1 ## 4 4 5.388320 0.1855561 6.558216 0.1906426 181220 333700 1 ## 5 5 5.584415 0.1463018 6.841612 0.1465346 181100 333660 1 ## 6 6 5.525538 0.1533757 6.749216 0.1549663 181140 333660 1 ``` The aim is now to produce the optimal stratification of the 3,103 points under a precision constraint of 1% on the target estimates of the mean *lead* and *zinc* concentrations: ```{r, eval = F,echo=TRUE, message=FALSE, warning=FALSE} library(SamplingStrata) frame <- buildFrameSpatial(df=df, id="id", X=c("lead.pred","zinc.pred"), Y=c("lead.pred","zinc.pred"), variance=c("lead.var","zinc.var"), lon="lon", lat="lat", domainvalue = "dom1") cv <- as.data.frame(list(DOM=rep("DOM1",1), CV1=rep(0.01,1), CV2=rep(0.01,1), domainvalue=c(1:1) )) cv # DOM CV1 CV2 domainvalue # 1 DOM1 0.01 0.01 1 ``` To this aim, we carry out the optimization step by indicating the method *spatial*: ```{r, eval = F,echo=TRUE, message=FALSE, warning=FALSE} set.seed(1234) solution <- optimStrata ( method = "spatial", errors=cv, framesamp=frame, iter = 15, pops = 10, nStrata = 5, fitting = c(r2_lead,r2_zinc), range = c(kriging_lead$var_model$range[2],kriging_zinc$var_model$range[2]), kappa=1, writeFiles = FALSE, showPlot = FALSE, parallel = FALSE) framenew <- solution$framenew outstrata <- solution$aggr_strata ## ## Input data have been checked and are compliant with requirements ## Sequential optimization as parallel = FALSE, defaulting number of cores = 1 ## *** Domain : 1 1 ## Number of strata : 3103 ## *** Sample cost: 61.92901 ## *** Number of strata: 4 ``` ![](spatial_figures/unnamed-chunk-7-1.png) obtaining the following optimized strata: ```{r, eval = F,echo=TRUE} plotStrata2d(framenew,outstrata,domain=1,vars=c("X1","X2"), labels=c("Lead","Zinc")) ``` ![](spatial_figures/unnamed-chunk-8-1.png) that can be visualised in this way: ```{r, eval = F,echo=TRUE} frameres <- SpatialPointsDataFrame(data=framenew, coords=cbind(framenew$LON,framenew$LAT) ) frameres2 <- SpatialPixelsDataFrame(points=frameres[c("LON","LAT")], data=framenew) frameres2$LABEL <- as.factor(frameres2$LABEL) spplot(frameres2,c("LABEL"), col.regions=bpy.colors(5)) ``` ![](spatial_figures/unnamed-chunk-9-1.png) We can now proceed with the selection of the sample, using the function 'selectSampleSpatial' (which is a wrapper of function 'lpm2_kdtree' in package 'SamplingBigData', see @lisic2018b), in order to ensure a spatially balanced sample: ```{r, eval = F,echo=TRUE} s <- selectSampleSpatial(framenew,outstrata,coord_names=c("LON","LAT")) ## ## *** Sample has been drawn successfully *** ## 62 units have been selected from 4 strata ``` whose units are so distributed in the territory: ```{r, eval = F,echo=TRUE,message=FALSE,fig.width=6, fig.height=8} s <- selectSampleSpatial(framenew,outstrata,coord_names = c("LON","LAT")) coordinates(s) <- ~LON+LAT proj4string(s) <- CRS("+init=epsg:28992") s$LABEL <- as.factor(s$LABEL) library(mapview) mapview(s,zcol="LABEL", map.types = c("OpenStreetMap")) ``` ![](spatial_figures/unnamed-chunk-10-1.jpeg) # References