Spatial sampling with SamplingStrata

Marco Ballin, Giulio Barcaroli

2022-11-14

Optimization with the spatial method

Let us suppose we want to design a sample survey with \(k\) \(Z\) target variables, each one of them correlated to one or more of the available \(Y\) frame variables.

When frame units are georeferenced or geocoded, the presence of spatial auto-correlation can be investigated. This can be done by executing for instance the Moran test on the target variables: if the null hypothesis is rejected (i.e. the hypothesis of the presence of spatial auto-correlation is accepted) then we should take into account also this variance component.

As indicated by deGruijter et al. (2016) and deGruijter, Wheeler, and Malone (2019), in case \(Z\) is the target variable, omitting as negligible the fpc factor, the sampling variance of its estimated mean is:

\[\begin{equation} \label{eq1} V(\hat{\bar{Z}}) = \sum_{h=1}^{H}(N_{h}/N)^{2} S_{h}^{2}/n_{h} \end{equation}\]

We can write the variance in each stratum \(h\) as:

\[\begin{equation} \label{eq2} S_{h}^{2} = \dfrac{1}{N_{h}^{2}} \sum_{i=1}^{N_{h-1}}\sum_{j=i+1}^{N_{h}}(z_{i}-z_{j})^{2} \end{equation}\]

The optimal determination of strata is obtained by minimizing the quantity \(O\):

\[\begin{equation} \label{eq3} O = \sum_{h=1}^{H} \dfrac{1}{N_{h}^{2}} \{ \sum_{i=1}^{N_{h-1}} \sum_{j=i+1}^{N_{h}} (z_{i}-z_{j})^{2}\}^{1/2} \end{equation}\]

Obviously, values \(z\) are not known, but only their predictions, obtained by means of a regression model. So, in Equation we can substitute \((z_{i}-z_{j})^{2}\) with

\[\begin{equation} \label{eq5} D_{ij}^{2} = \dfrac{(\tilde{z}_{i}-\tilde{z}_{j})^{2}}{R^{2}} + V(e_{i}) + V(e_{j}) - 2Cov(e_{i},e_{j}) \end{equation}\]

where \(R^{2}\) is the squared correlation coefficient indicating the fitting of the regression model, and \(V(e_{i})\), \(V(e_{j})\) are the model variances of the residuals. The spatial auto-correlation component is contained in the term \(Cov(e_{i},e_{j})\).

In particular, the quantity \(D_{ij}\) is calculated in this way:

\[\begin{equation} \label{eq6} D_{ij}^{2} = \dfrac{(\tilde{z}_{i}-\tilde{z}_{j})^{2}}{R^{2}} + (s_{i}^{2} + s_{j}^{2}) - 2 s_{i} s_{j} e^{-k (d_{ij}/range)} \end{equation}\]

where \(d_{ij}\) is the Euclidean distance between two units i and j in the frame (calculated using their geographical coordinates, that must be expressed in meters), the \(s_{i}\) and \(s_{j}\) are estimates of the prediction errors in the single points and range is the maximum distance below which spatial auto-correlation can be observed among points. The value of range can be determined by an analysis of the spatial variogram.

To summarize, when frame units can be geo-referenced, the proposed procedure is the following:

Example

In order to illustrate the “spatial” method, we make use of a dataset generally employed as an example of spatially correlated phenomena (in this case, the concentration of four heavy metals in a portion of the river Meuse). This dataset comes with the library “sp”:

We analyse the territorial distribution of the lead and zinc concentration, and model (by using the universal kriging) their relations with distance from the river, using the subset of 155 points on which these values have been jointly observed:

library(automap)
kriging_lead = autoKrige(log(lead) ~ dist, meuse, meuse.grid)
plot(kriging_lead,sp.layout = NULL, justPosition = TRUE)

kriging_zinc = autoKrige(log(zinc) ~ dist, meuse, meuse.grid)
plot(kriging_zinc, sp.layout = list(pts = list("sp.points", meuse)))

It is possible to calculate the fitting of the two models in this way:

r2_lead <- 1 - kriging_lead$sserr/sum((log(meuse$lead)-mean(log(meuse$lead)))^2)
r2_lead
## [1] 0.9999997
r2_zinc <- 1 - kriging_zinc$sserr/sum((log(meuse$zinc)-mean(log(meuse$zinc)))^2)
r2_zinc
## [1] 0.9999999

Using these kriging models, we are able to predict the values of lead and zinc concentration on the totality of the 3,103 points in the Meuse territory:

df <- NULL
df$id <- meuse.grid$id
df$lead.pred <- kriging_lead$krige_output@data$var1.pred
df$lead.var <- kriging_lead$krige_output@data$var1.var
df$zinc.pred <- kriging_zinc$krige_output@data$var1.pred
df$zinc.var <- kriging_zinc$krige_output@data$var1.var
df$lon <- meuse.grid$x
df$lat <- meuse.grid$y
df$dom1 <- 1
df <- as.data.frame(df)
head(df)
##   id lead.pred  lead.var zinc.pred  zinc.var    lon    lat dom1
## 1  1  5.509360 0.1954937  6.736502 0.2007150 181180 333740    1
## 2  2  5.546006 0.1716895  6.785460 0.1749260 181140 333700    1
## 3  3  5.488913 0.1784052  6.698883 0.1826314 181180 333700    1
## 4  4  5.388320 0.1855561  6.558216 0.1906426 181220 333700    1
## 5  5  5.584415 0.1463018  6.841612 0.1465346 181100 333660    1
## 6  6  5.525538 0.1533757  6.749216 0.1549663 181140 333660    1

The aim is now to produce the optimal stratification of the 3,103 points under a precision constraint of 1% on the target estimates of the mean lead and zinc concentrations:

library(SamplingStrata)
frame <- buildFrameSpatial(df=df,
                      id="id",
                      X=c("lead.pred","zinc.pred"),
                      Y=c("lead.pred","zinc.pred"),
                      variance=c("lead.var","zinc.var"),
                      lon="lon",
                      lat="lat",
                      domainvalue = "dom1")
cv <- as.data.frame(list(DOM=rep("DOM1",1),
                         CV1=rep(0.01,1),
                         CV2=rep(0.01,1),
                         domainvalue=c(1:1) ))
cv
#    DOM  CV1  CV2 domainvalue
# 1 DOM1 0.01 0.01           1

To this aim, we carry out the optimization step by indicating the method spatial:

set.seed(1234)
solution <- optimStrata (
  method = "spatial",
  errors=cv, 
  framesamp=frame,
  iter = 15,
  pops = 10,
  nStrata = 5,
  fitting = c(r2_lead,r2_zinc),
  range = c(kriging_lead$var_model$range[2],kriging_zinc$var_model$range[2]),
  kappa=1,
  writeFiles = FALSE,
  showPlot = FALSE,
  parallel = FALSE)
framenew <- solution$framenew
outstrata <- solution$aggr_strata
## 
## Input data have been checked and are compliant with requirements
## Sequential optimization as parallel = FALSE, defaulting number of cores = 1
##  *** Domain :  1   1
##  Number of strata :  3103
##  *** Sample cost:  61.92901
##  *** Number of strata:  4

obtaining the following optimized strata:

plotStrata2d(framenew,outstrata,domain=1,vars=c("X1","X2"),
             labels=c("Lead","Zinc"))

that can be visualised in this way:

frameres <- SpatialPointsDataFrame(data=framenew, coords=cbind(framenew$LON,framenew$LAT) )
frameres2 <- SpatialPixelsDataFrame(points=frameres[c("LON","LAT")], data=framenew)
frameres2$LABEL <- as.factor(frameres2$LABEL)
spplot(frameres2,c("LABEL"), col.regions=bpy.colors(5))

We can now proceed with the selection of the sample, using the function ‘selectSampleSpatial’ (which is a wrapper of function ‘lpm2_kdtree’ in package ‘SamplingBigData,’ see Lisic and Grafström (2018)), in order to ensure a spatially balanced sample:

s <- selectSampleSpatial(framenew,outstrata,coord_names=c("LON","LAT"))
## 
## *** Sample has been drawn successfully ***
##  62  units have been selected from  4  strata

whose units are so distributed in the territory:

s <- selectSampleSpatial(framenew,outstrata,coord_names = c("LON","LAT"))
coordinates(s) <- ~LON+LAT
proj4string(s) <- CRS("+init=epsg:28992")
s$LABEL <- as.factor(s$LABEL)
library(mapview)
mapview(s,zcol="LABEL", map.types = c("OpenStreetMap"))

References

deGruijter, J. J., A. B. McBratney, B. Minasny, I. Wheeler, B. P. Malone, and U. Stockmann. 2016. “Farm-Scale Soil Carbon Auditing.” Geoderma, no. 265: 12–130.
deGruijter, J. J., I. Wheeler, and B. P. Malone. 2019. “Using Model Predictions of Soil Carbon in Farm-Scale Auditing - a Sofwtare Tool.” Agricultural Systems, no. 169(C): 24–30.
Lisic, J., and A. Grafström. 2018. SamplingBigData: Sampling Methods for Big Data. https://cran.r-project.org/package=SamplingBigData.