%Sweave("DClusterm.Rnw", encoding = "UTF-8", keep.source = FALSE); system("pdflatex DClusterm.tex"); system("bibtex DClusterm"); system("pdflatex DClusterm.tex"); system("pdflatex DClusterm.tex") %system("pdflatex DClusterm.tex") %system("bibtex DClusterm") <>= #JSS style options(prompt = "R> ", continue = "+", width = 70, useFancyQuotes = FALSE) @ \documentclass[article, nojss]{jss} \usepackage{thumbpdf} \usepackage{amssymb} %% need no \usepackage{Sweave.sty} %\usepackage{shmwlabels} %\VignetteIndexEntry{Model-based detection of disease clusters} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% declarations for jss.cls %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% almost as usual \author{Virgilio G\'omez-Rubio\\Universidad de Castilla-La Mancha \And Paula Moraga\\Lancaster University \AND John Molitor\\Oregon State University \And Barry Rowlingson\\Lancaster University } %\title{Extending the \pkg{R-INLA} Package for Spatial Statistics} %\title{Some Spatial Statistical Extensions to \pkg{R-INLA}} \title{\pkg{DClusterm}: Model-based Detection of Disease Clusters} %% for pretty printing and a nice hypersummary also set: \Plainauthor{Virgilio G\'omez-Rubio et al.}%, Paula Moraga} \Plaintitle{DClusterm: Model-based detection of disease clusters} %\Shorttitle{} %% a short title (if necessary) %% an abstract and keywords \Abstract{ The detection of regions with unusually high risk plays an important role in disease mapping and the analysis of public health data. In particular, the detection of groups of areas (i.e., clusters) where the risk is significantly high is often conducted by Public Health authorities. Many methods have been proposed for the detection of these disease clusters, most of them based on moving windows, such as Kulldorff's spatial scan statistic. Here we describe a model-based approach for the detection of disease clusters implemented in the \pkg{DClusterm} package. Our model-based approach is based on representing a large number of possible clusters by dummy variables and then fitting many generalized linear models to the data where these covariates are included one at a time. Cluster detection is done by performing a variable or model selection among all fitted models using different criteria. Because of our model-based approach, cluster detection can be performed using different types of likelihoods and latent effects. We cover the detection of spatial and spatio-temporal clusters, as well as how to account for covariates, zero-inflated datasets and overdispersion in the data. } \Keywords{disease cluster, spatial statistics, \proglang{R}} \Plainkeywords{disease cluster, spatial statistics, R} \Address{ Virgilio G\'omez-Rubio\\ Department of Mathematics\\ School of Industrial Engineering\\ University of Castilla-La Mancha\\ 02071 Albacete, Spain\\ E-mail: \email{virgilio.gomez@uclm.es}\\ URL: \url{https://becarioprecario.github.io}\\ \\ Paula Moraga\\ Centre for Health Informatics, Computing and Statistics (CHICAS)\\ Lancaster Medical School\\ Lancaster University\\ Lancaster, LA1 4YW\\ United Kingdom\\ E-mail: \email{p.e.moraga-serrano@lancaster.ac.uk}\\ URL: \url{https://paula-moraga.github.io/}\\ \\ John Molitor\\ College of Public Health and Human Sciences\\ Oregon State University\\ Corvallis, Oregon 97331, United States\\ E-mail: \email{John.Molitor@oregonstate.edu}\\ URL: \url{http://health.oregonstate.edu/people/molitor-john}\\ \\ Barry Rowlingson\\ Centre for Health Informatics, Computing and Statistics (CHICAS)\\ Lancaster Medical School\\ Lancaster University\\ Lancaster, LA1 4YW\\ United Kingdom\\ E-mail: \email{b.rowlingson@lancaster.ac.uk}\\ URL: \url{http://www.lancaster.ac.uk/fhm/about-us/people/barry-rowlingson}\\ } \begin{document} \maketitle \section{Introduction} The analysis of epidemiological data at small area level often involves accounting for possible risk factors and other important covariates using different types of regression models. However, it is not uncommon that after a number of covariates have been accounted for, residuals still show a spatial distribution that defines some groups of areas with unusual high epidemiological risk. Hence, in many occasions it is not clear whether all spatial risk factors have been included in our model. Public health data are often aggregated over small administrative areas due to issues of confidentiality. However, individual data are often available as well, and generalized linear models \citep[GLM,][]{McCullaghNelder:1989} is a common framework used in disease mapping for modeling aggregated and individual-level information. GLMs not only model Poisson or binomial responses, but they can also link the outcome to a linear predictor on the covariates (and, possibly, other effects). However, until recently, it was not clear how to use GLMs to detect clusters of disease, a group of contiguous areas with significant high risk. In order to detect disease clusters, the most widely used method is probably the spatial scan statistic (SSS) proposed by \citet{Kulldorff:1997}. Given a possible cluster $z$, the SSS will compare the relative risk in the cluster $\theta_z$ to the relative risk outside the cluster $\theta_{\overline{z}}$ using the following test: % \begin{equation} \begin{array}{cc} H_0: & \theta_z=\theta_{\overline{z}}\\ H_1: & \theta_z>\theta_{\overline{z}}\\ \end{array} \end{equation} % This test is performed via the use of a likelihood ratio statistic, where many different possible clusters are tested in turn by changing the areas in $z$ and the most likely cluster (i.e., the one with the highest value of the test statistic) is selected. Significance of this cluster is assessed with a Monte Carlo test that also provides the $p$-value. Several \proglang{R} packages include an implementation of the SSS. \pkg{DCluster} \citep{DCluster:2005} implements the orginal version of the SSS for spatial binomial and Poisson data and can compute cluster significancy by means of bootstrap or a Gumbel distribution. Package \pkg{SpatialEpi} \citep{SpatialEpi:2016} also implements the SSS for binomial or Poisson data and assesses significancy using a Monte Carlo test. \pkg{rsatscan} \citep{rsatscan:2015} provides a set of functions to create the necessary data files and run the \pkg{SaTScan} software (for Windows) from \proglang{R}. Hence, spatio-temporal detection of disease clusters can also be performed. Package \pkg{scanstatistics} \citep{scanstatistics:2018} implements a number of space-time scan statistics, that includes spatio-temporal versions of the SSS. Similarly as the original SSS, these implementations lack of a general way of adjusting for covariates and assessing the significancy of non-primary clusters. Other packages which can be used for the detection of spatial clusters are described now. The \pkg{smerc} package \citep{smerc:2015} implements different scan statistics, including some for the detection of non-circular clusters. Package \pkg{AMOEBA} \citep{AMOEBA:2014} includes a function to detect spatial clusters based on the Getis-Ord statistic. Bayesian Dirichlet processes for spatial clustering, as developed in \citet{Molitoretal:2010}, are implemented in the \pkg{PReMiuM} package \citep{PReMiuM:2015}. A version of the spatial scan statistic adapted to detect clusters in social networks is available in package \pkg{SNscan} \citep{SNscan:2016}. Package \pkg{SpatialEpiApp} \citep{Moraga:2017} includes a \pkg{shiny} application for the analysis of spatial and spatio-temporal disease mapping that includes cluster detection using the SSS. Methods for the detection of spatial clusters with case-control point data are available in package \pkg{smacpod} \citep{smacpod:2018}. The \pkg{surveillance} package \citep{Salmonetal:2016,Meyeretal:2017} also implements several methods for the detection of spatial and spatio-temporal clustering. Finally, package \pkg{ClustGeo} \citep{ClustGeo:2017} implements hierarchical clustering with spatial constraints that can be used to partition regions into different groups of contiguous areas. In this paper we describe a novel approach to disease cluster detection that provides a link between GLMs and SSS. We have implemented this approach via the free and open source \pkg{DClusterm} package \citep{DClusterm} for the \proglang{R} statistical software \citep{R:2018}. This approach involves fitting many different GLMs for which dummy variables that represent possible clusters are included one at a time. Cluster detection is based on selecting a number of dummy cluster variables using variable selection methods. This paper is organized as follows. Section \ref{sec:GLM} will introduce the link between GLM and SSS. Next, in Section \ref{sec:mixed} we show how to include random effects in the detection of disease clusters to account for over-dispersion. The detection of disease clusters for zero-inflated data is discussed in Section \ref{sec:zeroinfl}. Section \ref{sec:spacetime} describes how to extend these ideas to detect clusters in space and time. Finally, a discussion and some final remarks are provided in Section \ref{sec:disc}. \section{Generalized linear models for cluster detection} \label{sec:GLM} \subsection{General description} \label{subsec:GLMdesc} An explicit link between GLMs and the SSS is provided by \citet{Jung:2009} and \citet{ZhangLinCSDA:2009} who show that the test statistic for a given cluster is equivalent to fitting a GLM using a cluster variable as predictor. This cluster variable is a dummy variable which is 1 for the areas in the cluster and 0 for the areas outside the cluster. By including cluster covariates in the model we obtain an estimate of the increased risk (as measured by its associated coefficient) and its significance (by means of the associated $p$-value, for example). We demonstrate our aproach by first considering a Poisson model with expected counts $E_i$ and observed cases $O_i$ modeled as: % \begin{eqnarray} O_i \sim Po(E_i \theta_i) \nonumber\\ \log(\mu_i) = \log(E_i) + \alpha + \beta x_i \end{eqnarray} % Here $\mu_i$ is the mean of the Poisson distribution (which is equal to $E_i \theta_i$), $\log(E_i)$ denotes an offset to account for the population ``exposure'' (age, gender, etc.), $x_i$ represents a covariate of the outcome of interest and $\theta_i$ is the relative risk and it measures any deviation (increase or decrease) in the incidence of the disease from the expected number of cases. A simple and raw estimate of the relative risk $\theta_i$ that does not require covariates is the standardized mortality ratio (SMR), which is defined as $O_i / E_i$ \citep{WallerGotway:2004}. After fitting this model, we obtain estimates $\hat\alpha$ and $\hat\beta$ which account for the effects of non-cluster covariates in the model. We include the cluster covariates as follows: % \begin{equation} \log(\mu_i) = \log(E_i) + \hat\alpha + \hat\beta x_i + \gamma_j c_i^{(j)} \end{equation} % The overall intercept is now $\log(E_i) + \hat\alpha + \hat\beta x_i$ and $c_i^{(j)}$ denotes a dummy variable associated with cluster $j$, with $j$ taking values from 1 to the number of clusters being tests, defined as % \begin{equation} c^{(j)}_i= \left\{ \begin{array}{cl} 1 & \textrm{if area } i \textrm{ belongs to cluster } j\\ 0 & \textrm{otherwise} \end{array} \right. \end{equation} % Here, $\gamma_j$ is a measure of the risk within cluster $j$. We are only interested in clusters whose coefficient is significantly higher than 0 (i.e., increased risk). Hence those with a significant negative coefficient will be ignored. We use model selection techniques to select the most important cluster in the study region. As such, the log-likelihood can be used to compare the model with the cluster variables to the null model (i.e., the one with no cluster covariates at~all). It is possible to perform cluster detection without considering non-cluster based covariates in the model. Here, cluster detection accounting for the non-cluster based covariates will likely provide a different number of clusters than models fit without these variables. By comparing the clusters detected in both models (with and without non-cluster based covariates), we will be able to find which clusters are linked to underlying risk factors included in the model and which clusters remain unexplained by these other variables. In the examples that we include in this paper, we will consider both scenarios to better understand how cluster detection works. Similar approaches to detection of disease clusters have been proposed by \citet{BilanciaDemarinis:2014,GomezRubioetal:2018} who describe a similar approach to the detection of disease clusters using Bayesian hierarchical models. The Integrated Nested Laplace Approximation \citep[INLA,][]{Rueetal:2009} is used in both cases for model fitting as it provides computational benefits over other computationally expensive methods, such as Markov Chain Monte Carlo. %NY8 Example %\input{NY.tex} \subsection{Leukemia in upstate New York} \label{subsec:ex:NY} We first consider an example where we model counts of leukemia cases in upstate New York. These data are provided in the \code{NY8} dataset, available in package \pkg{DClusterm}. It provides cases of leukemia in different census tracts in upstate New York. This data set has been analyzed by several authors \citep{Walletetal:1992,WallerGotway:2004}. The location of leukemia is thought to be linked to the use of Trichloroethene (TCE) by several companies in the area. Figure~\ref{fig:NYmap} (using color palettes from \pkg{RColorBrewer}, \citealp{RColorBrewer:2014}) shows the standardized mortality ratios of the census tracts and the locations of the industries using TCE. In order to measure exposure, the inverse of the distance to the nearest TCE site has been used (variable \code{PEXPOSURE} in the dataset). In addition, two other socioeconomic covariates have been used: the proportion of people aged 65 or more (variable \code{PCTAGE65P}) and the proportion of people who own their home (variable \code{PCTOWNHOME}). Hence, our first action is to load the \pkg{DClusterm} package and the \code{NY8} dataset: %some required packages, such as \pkg{xts} \citep{xts:2014}, and %the dataset itself. <>= library("DClusterm") data("NY8") @ A number of cases could not be linked to their actual location and they were distributed uniformly over the study area, making the counts real numbers instead of integers \citep[see,][for details]{WallerGotway:2004}. We have rounded these values as we intend to use a Poisson likelihood for the analysis. Furthermore, expected counts are computed using the overall incidence ratio (i.e., total number of cases divided by the total population). Age-sex standardization is not possible in this case as this information is not available in our dataset. <>= NY8$Observed <- round(NY8$Cases) NY8$Expected <- NY8$POP8 * sum(NY8$Observed) / sum(NY8$POP8) NY8$SMR <- NY8$Observed / NY8$Expected @ The centres of the areas will be stored in columns named \code{x} and \code{y}. This will be used later when ordering the areas by increasing distance to the putative cluster centre. If the location of the main populated cities are available these could be used but here we will use function \code{coordinates()} instead: <>= NY8$x <- coordinates(NY8)[, 1] NY8$y <- coordinates(NY8)[, 2] @ As \pkg{DClusterm} is designed to detect clusters in space and time, it will always expect data to be from one of the classes in packages \pkg{sp}, for purely spatial data, or \pkg{spacetime} \citep{Pebesma:2012}, for spatio-temporal data. <>= NY8st <- STFDF(as(NY8, "SpatialPolygons"), xts(1, as.Date("1972-01-01")), NY8@data, endTime = as.POSIXct(strptime(c("1972-01-01"), "%Y-%m-%d"), tz = "GMT")) @ %Cite package \nocite{RColorBrewer:2014} \begin{figure}[h] \centering <>= library("RColorBrewer") #Save plots p1 <- spplot(NY8, "SMR", cuts = 8, col.regions = brewer.pal(9, "Oranges"), main = "Standardized mortality ratio", sp.layout = list(list("sp.points", TCE, col = "red"))) p2 <- spplot(NY8, "PCTOWNHOME", cuts = 8, col.regions = brewer.pal(9, "Blues"), main = "PCTOWNHOME") p3 <- spplot(NY8, "PCTAGE65P", cuts = 8, col.regions = brewer.pal(9, "Greens"), main = "PCTAGE65P") p4 <- spplot(NY8, "PEXPOSURE", cuts = 8, col.regions = brewer.pal(9, "Reds"), main = "PEXPOSURE") #Display plots print(p1, position = c(0, 0.5, 0.5, 1), more = TRUE) print(p2, position = c(0.5, 0.5, 1, 1), more = TRUE) print(p3, position = c(0, 0, 0.5, 0.5), more = TRUE) print(p4, position = c(0.5, 0, 1, 0.5)) @ \caption{Standardized mortality ratios and covariates of the incidence of Leukemia in upstate New York dataset. \code{PEXPOSURE} is the inverse of the distance to the nearest TCE site, \code{PCTAGE65P} is the proportion people aged 65 or more, and \code{PCTOWNHOME} is the proportion of people who own their home. The red crosses in the top-left map mark the locations of the TCE sites.} \label{fig:NYmap} \end{figure} \subsection{Cluster detection} \subsubsection{Cluster detection with no covariates} First of all, a model with no covariates will be fitted and used as a baseline, so that other models can be compared to this one (for example, using the AIC or the log--likelihood) to assess whether they provide a better fit. <>= ny.m0 <- glm(Observed ~ offset(log(Expected)) + 1, family = "poisson", data = NY8) @ Function \code{DetectClustersModel()} will take this baseline model (using argument \code{model0}), create the cluster dummy variables and test them in turn. Then, those clusters with a highest significance will be reported. Argument \code{thegrid} will take a 2-column \code{data.frame} (with names \code{x} and {y}) with the centres of possible clusters. If the grid of cluster centres is not defined, then a rectangular grid is used with a distance between adjacent points defined by argument \code{step}. Dummy cluster variables are created around these points by adding areas to the cluster until a certain proportion of the population (defined by argument \code{fractpop}), with the column to be used to compute the population at risk in the cluster defined by parameter \code{ClusterSizeContribution}, or until a certain distance about the centre (defined by argument \code{radius}) has been reached. When testing for significant cluster variables, argument \code{alpha} defines the significance level. \code{DetectClustersModel()} can detect spatial and spatio-temporal clusters, which is why its first argument is a spatial or space-time object. The type of clusters that are investigated is defined by argument \code{typeCluster}. In the example we have used \code{typeCluster = "S"} in order to detect spatial clusters. For spatio-temporal clusters \code{typeCluster = "ST"} must be used, as described in Section~\ref{sec:spacetime}. The output of \code{DetectClustersModel()} contains all clusters found at significance level given by the parameter \code{alpha}. In the detection process, multiple tests are performed simultaneously which can increase the likelihood of incorrectly declaring a group of areas as a cluster. This situation is known as multiple testing problem and to avoid it several statistical methods have been developed. Here, we deal with this problem using the Bonferroni correction which uses a stricter significance level to declare a cluster \citep{Dunn:1958}. Specifically, a cluster is declared if its $p$-value is smaller than alpha divided by the number of clusters tested. In addition, overlapping clusters can be removed with function \code{slimkncluster()} (see below) so that only clusters with the greatest significancy are reported. This will also help to reduce the problem of multiple testing as only one possible cluster around a given cluster center is reported, similarly as the SSS. The output of \code{DetectClustersModel()} has a column called \code{alpha\_bonferroni} that denotes the level of significance adjusted for multiple testing using the Bonferroni correction. Thus, users can avoid the multiple testing problem by filtering the output and keeping only the clusters with $p$-value less than \code{alpha\_bonferroni}. Other options include the number of replicates for Monte Carlo tests (argument \code{R}) if cluster assessment is done by simulation. By default, Monte Carlo tests are not used. \pkg{DClusterm} allows for parallel computing using several cores as implemented in package \pkg{parallel}. The number of cores to use is defined in option \code{mc.cores} (now 2 cores are used): <>= options(mc.cores = 2) @ In the following example, to reduce the computational burden, we have only looked for clusters around 5 areas (whose rows in \code{NY8} are defined in variable \code{idxcl}). In a real application we advise the use of all locations (area centroids or actual locations of individual data). <>= idxcl <- c(120, 12, 89, 139, 146) ny.cl0 <- DetectClustersModel(NY8, thegrid = as.data.frame(NY8)[idxcl, c("x", "y")], fractpop = 0.15, alpha = 0.05, radius = Inf, step = NULL, typeCluster = "S", R = NULL, model0 = ny.m0, ClusterSizeContribution = "POP8") @ Below is a summary of the clusters detected. Dates are ignored as this is a purely spatial cluster analysis. In the case of spatio-temporal clusters, dates shown define the temporal range of the cluster. Values \code{x} and \code{y} defined the cluster centre, \code{size} is the number of areas included in the cluster, \code{statistic} is the value of the test statistic and \code{risk} represents the point estimate of the associated cluster coefficient. Also, note that only clusters with a lower \code{pvalue} than argument \code{alpha} are returned. \code{cluster} indicates whether the cluster is a significant one. Finally, note how detected clusters are ordered by increasing value of \code{pvalue}, so that the most significant clusters are reported first. <<>>= ny.cl0 @ Function \code{get.knclusters()} can be used to extract the indices of the areas in the different clusters detected. Similarly, a vector with the cluster to which area belongs can be obtained with function \code{get.allknclusters()}. The detected clusters plus a ``no cluster'' category are the levels of this categorical variable, which can be added to a spatial object as follows: <<>>= NY8$CLUSTER0 <- get.allknclusters(NY8, ny.cl0) @ The areas and centers of the clusters detected are shown in Figure~\ref{fig:NYcl}. Because of the lack of adjustment for covariates these clusters show regions of high risk based on the raw data (observed and expected counts) alone. %\begin{figure}[h!] %\centering <>= cl0.centres <- SpatialPoints(ny.cl0[, 1:2]) spplot(NY8, "CLUSTER0", col.regions = c("white", "gray"), col = "#4D4D4D", sp.layout = list(list("sp.points", cl0.centres, pch = 19, col = "black"))) @ %\caption{Clusters detected when no covariates are included in the model.} %\label{fig:NYcl0} %\end{figure} \subsubsection{Cluster detection after adjusting for covariates} Similarly, clusters can be detected after adjusting for significant risk factors. First of all, we will fit a Poisson regression with the 3 covariates mentioned earlier. As it can be seen, two of them are clearly significant and the third one has a $p$-value very close to $0.05$: <<>>= ny.m1 <- glm(Observed ~ offset(log(Expected)) + PCTOWNHOME + PCTAGE65P + PEXPOSURE, family = "poisson", data = NY8) summary(ny.m1) @ As we have three covariates in the model, the offset included in the model will be different now, which may produce that the detected clusters may be different in this case. Cluster detection is performed as in the previous example, but now we use the model that adjusts for covariates instead: <>= ny.cl1 <- DetectClustersModel(NY8, thegrid = as.data.frame(NY8)[idxcl, c("x", "y")], fractpop = 0.15, alpha = 0.05, typeCluster = "S", R = NULL, model0 = ny.m1, ClusterSizeContribution = "POP8") @ <<>>= ny.cl1 @ Similarly as in the previous example, we can use function \code{get.allkncluster()} to get the indices of the areas in the clusters and add a new variable to the \code{SpatialPolygonsDataFrame}: <<>>= NY8$CLUSTER1 <- get.allknclusters(NY8, ny.cl1) @ Figure~\ref{fig:NYcl} shows the clusters detected after adjusting for covariates (using \pkg{gridExtra}, \citealp{gridExtra:2016}, and \pkg{latticeExtra}, \citealp{latticeExtra:2016}). Compared to the example with no covariate adjustment, the most significant cluster has disappeared. Hence, that cluster has been explained by the effect of the covariates. Another cluster is a bit smaller in size, which means that covariates only explain a small part of it. In all remaining clusters, cluster significance has been reduced by the effect of the covariates. \begin{figure}[!h] \centering <>= library("gridExtra") library("latticeExtra") ny.map0 <- spplot(NY8, c("CLUSTER0"), col.regions = c("white", "gray"), col = "#4D4D4D", colorkey = FALSE) + layer(lpoints(ny.cl0$x, ny.cl0$y, pch = 19)) ny.map1 <- spplot(NY8, c("CLUSTER1"), col.regions = c("white", "gray"), col = "#4D4D4D", colorkey = FALSE) + layer(lpoints(ny.cl1$x, ny.cl1$y, pch = 19)) #Syracuse City syracuse <- which(NY8$AREANAME == "Syracuse city") ny.map0.s <- spplot(NY8[syracuse, ], c("CLUSTER0"), col.regions = c("white", "gray"), col = "#4D4D4D", colorkey = FALSE) + layer(lpoints(ny.cl0$x, ny.cl0$y, pch = 19)) ny.map1.s <- spplot(NY8[syracuse, ], c("CLUSTER1"), col.regions = c("white", "gray"), col = "#4D4D4D", colorkey = FALSE) + layer(lpoints(ny.cl1$x, ny.cl1$y, pch = 19)) grid.arrange(ny.map0, ny.map1, ny.map0.s, ny.map1.s, ncol = 2) @ \caption{Clusters detected with no covariate adjustment (left) and after adjusting for covariates (right) in the whole study region (top row) and Syracuse city (bottom row). Areas in clusters are shaded in gray and cluster centres are represented by blue dots.} \label{fig:NYcl} \end{figure} Finally, function \code{slimknclusters()} could have been used to remove overlapping clusters with less significant coefficients. This procedure will sort all detected clusters in decreasing order according to the value of \code{statistic} and remove those clusters which overlap with another cluster with a larger value of \code{statistics}. This will be similar in spirit to the SSS as this will only consider a single cluster around each area and will help to control for the multiple testing problem inherent to the problem of the detection of disease clusters. Function \code{slimknclusters()} takes three arguments. The first one is the spatial or spatio-temporal object used in the call to \code{DetectClustersModel}, the second one is the table with the clusters detected and the third one is the minimum cluster size to consider. By default, this is 1, and it will set the minimum number of areas in a cluster to report it. For example, the following code could be used to reduce the number of possible clusters to those non-overlapping cluster with a size of at least three in the cluster detection with covariates: <<>>= slimknclusters(NY8, ny.cl1, 3) @ In this case, as there are no overlapping clusters the results do not change. \section{Mixed-effects models for cluster detection} \label{sec:mixed} Random effects can be incorporated into our models to account for unmeasured risk factors. In this context, random effects will model area-specific intrinsic variation not explained by other covariates or cluster variables in the model. Cluster detection will be performed as usual, but we should keep in mind that by including random effects and dummy cluster variables there may be a clash between the two. By using dummy variables we are intentionally looking for unexplained spatial variation in the data. Hence, random effects should aim at modelling a different structure in the data. For example, if spatial random effects are included, these may tend to model the clusters and then an identifiability problem between them and the dummy cluster covariates may occur, which may lead to poor estimates of the random effects and/or the coefficients of the dummy cluster covariates. For the Poisson case, this will mean that the relative risk can be modelled as: % \begin{eqnarray} \log(\mu_i) & = &\log(E_i) + \alpha + \beta x_i + \gamma_j c^{(j)}_i + u_i\nonumber\\ u_i & \sim & N(0, \sigma^2_u) \end{eqnarray} % where $u_i$ represents a random effect normally distributed with zero mean and variance $\sigma^2_u$. Note that random effects can be defined to be spatially correlated, as suggested by \citet{BilanciaDemarinis:2014}. However, this can produce a clash between the dummy cluster variables and the random effects. Random effects are particularly useful to model over-dispersion in count data, which occurs when the variance in the data is greater than the variance defined by the statistical model. For example, for a Poisson model the mean and variance should be equal. Hence, by including random effects in a Poisson GLM the variance of the data can be different to the mean. \subsection{Leukemia in upstate New York} We go back to the example on the leukemia incidence in upstate New York to show how models can include random effects and, at the same time, detect disease clusters. In this particular example, random effects will be important in order to reflect any over-dispersion present in the data. For this reason, our first step here is to test the data for over-dispersion using Dean's $P_B$ and $P'_B$ score tests \citep[see,][for details]{Dean:1992}. These two tests have been implemented in functions \code{DeanB()} and \code{DeanB2()} in the \pkg{DCluster} \citep{DCluster:2005} package. They both take a \code{glm} object and perform the score tests: <<>>= DeanB(ny.m0) DeanB2(ny.m0) @ From the results, it is clear that when no covariates are included data are clearly over-dispersed. Hence, a Poisson distribution will not be appropriate to model the observed counts in each tract. The same tests applied to the model with covariates produce a similar result: <<>>= DeanB(ny.m1) DeanB2(ny.m1) @ Although $p$-values have increased, they are both small and we may still consider that data are over-dispersed. Hence, we will aim at detecting clusters using a Poisson regression with independent random effects to account for census tract-level heterogeneity. \subsection{Cluster detection} \subsubsection{Cluster detection with no covariates} The baseline model with no covariate will now be fitted with function \code{glmer()} from package \pkg{lme4} \citep{lme4:2015}. This function is similar to \code{glm()} for GLMs but it will allow us to include random effects in the model as part of the \code{formula} argument. <<>>= library("lme4") ny.mm0 <- glmer(Observed ~ offset(log(Expected)) + (1 | AREANAME), data = as(NY8, "data.frame"), family = "poisson") @ <>= ny.clmm0 <- DetectClustersModel(NY8, thegrid = as.data.frame(NY8)[idxcl, c("x", "y")], fractpop = 0.15, alpha = 0.05, typeCluster = "S", R = NULL, model0 = ny.mm0, ClusterSizeContribution = "POP8") @ <<>>= ny.clmm0 @ After accounting for overdispersion, the number of clusters detected is \Sexpr{nrow(ny.clmm0)}. These are shown in Figure~\ref{fig:NYclmm}. The largest cluster detected before has now disappeared and only the two smallest clusters remain. This may be due to the fact that the first cluster has the smallest SMR and risk. Hence, allowing for extra-variation will make the discrepancy between observed and expected less extreme and this cluster will be the first one to be declared as non-significant. The two remaining clusters have the same size and a very similar risk as in the previous case. <>= #Compute SMR in clusters cls <- get.knclusters(NY8, ny.cl0) lapply(cls, function(X) { res <- c(sum(NY8$Observed[X]), sum(NY8$Expected[X])) c(res, res[1]/res[2]) }) @ \subsubsection{Cluster detection with covariates} The mixed-effects model with covariates can be fitted as follows: <<>>= ny.mm1 <- glmer(Observed ~ offset(log(Expected)) + PCTOWNHOME + PCTAGE65P + PEXPOSURE + (1 | AREANAME), data = as(NY8, "data.frame"), family = "poisson") #summary(ny.mm1) @ The estimates of the coefficients are similar to that of the fixed-effects model, but the associated $p$-values are somewhat higher. This is probably due to the inclusion of random effects in the model. <>= ny.clmm1 <- DetectClustersModel(NY8, thegrid = as.data.frame(NY8)[idxcl, c("x", "y")], fractpop = 0.15, alpha = 0.05, typeCluster = "S", R = NULL, model0 = ny.mm1, ClusterSizeContribution = "POP8") @ <<>>= ny.clmm1 @ When overdispersion and covariates are included in the model, only one of the clusters remains with size 1 and a slightly reduced estimate of the cluster coefficient. This should not come as a surprise given that we have already seen how including covariates explains some of the extra-variation and that by including random effects the significance of the clusters is also reduced. A summary of the clusters detected can be found in Figure~\ref{fig:NYclmm}. As in the example in Section~\ref{subsec:ex:NY}, we have created two cluster variables to display the clusters obtained: <<>>= NY8$CLUSTERMM0 <- get.allknclusters(NY8, ny.clmm0) NY8$CLUSTERMM1 <- get.allknclusters(NY8, ny.clmm1) @ \begin{figure}[!h] \centering <>= ny.mapmm0 <- spplot(NY8, c("CLUSTERMM0"), col.regions = c("white", "gray"), col = "#4D4D4D", colorkey = FALSE) + layer(lpoints(ny.clmm0$x, ny.clmm0$y, pch = 19)) ny.mapmm1 <- spplot(NY8, c("CLUSTERMM1"), col.regions = c("white", "gray"), col = "#4D4D4D", colorkey = FALSE) + layer(lpoints(ny.clmm1$x, ny.clmm1$y, pch = 19)) #Syracuse ny.mapmm0.s <- spplot(NY8[syracuse, ], c("CLUSTERMM0"), col.regions = c("white", "gray"), col = "#4D4D4D", colorkey = FALSE) + layer(lpoints(ny.clmm0$x, ny.clmm0$y, pch = 19)) ny.mapmm1.s <- spplot(NY8[syracuse, ], c("CLUSTERMM1"), col.regions = c("white", "gray"), col = "#4D4D4D", colorkey = FALSE) + layer(lpoints(ny.clmm1$x, ny.clmm1$y, pch = 19)) grid.arrange(ny.mapmm0, ny.mapmm1, ny.mapmm0.s, ny.mapmm1.s, ncol = 2) @ \caption{Clusters detected with no covariate adjustment (left) and after adjusting for covariates (right) in the whole study region (top row) and Syracuse city (bottom row) using a mixed-effects model to account for overdispersion of the data. Areas in clusters are shaded in gray and cluster centers are represented by blue dots.} \label{fig:NYclmm} \end{figure} \section{Zero-inflated models for cluster detection} \label{sec:zeroinfl} The analysis of rare diseases often involves datasets where there are many areas with zero counts, leading to zero-inflated data. In this situation the Poisson or binomial likelihoods may not be suitable to fit a model, because there is an excess of areas with zeros with respect to the number predicted by the pure Poisson or binomial model, and other distributions for the data should be used. \citet{RD:2010} discuss this issue and they have extended model-based cluster detection methods to account for zero-inflation. For count data, a zero-inflated Poisson model could be used. In this case, observed number of cases come from a mixture distribution of a Poisson and a distribution with all mass at zero. Probabilities are as follows: % \begin{equation} Pr(O_i=n_i)= \left\{ \begin{array}{ll} \pi_i+(1-\pi_i)Po(0|\theta_iE_i) & n_i=0\\ (1-\pi_i)Po(n_i|\theta_iE_i) & n_i=1, 2, \ldots\\ \end{array} \right. \end{equation} % Here $Po(O_i|\theta_i E_i)$ represents the probability of observing $O_i$ cases using a Poisson distribution with mean $\theta_i E_i$. $\pi_i$ represents the proportion of zeroes observed that do not come from the Poisson distribution. Relative risks $\theta_i$ can be modeled using a log-linear model to depend on some relevant risk factors. Also, it is common that all $\pi_i$'s are taken equal to a single value $\pi$. %\input{Navarre.tex} \subsection{Brain cancer in Navarre (Spain)} \citet{Ugarteetal:2006} analyze the incidence of brain cancer in Navarre (Spain) at the health district level. Figure~\ref{fig:Navarre} shows the standardized mortality ratios. As it can be seen there are many areas where the SMR is zero because there are no cases in those areas. \citet{Ugarteetal:2004} have also reported a significant zero-inflation of these data compared to a Poisson distribution. For cluster detection, the method implemented in \code{DClusterm} is similar to the one used in \citet{RD:2010} for the detection of disease clusters of rare diseases. <>= library("DClusterm") data("Navarre") @ \begin{figure}[!h] <>= print(spplot(brainnav, "SMR", cuts = 8, col.regions = brewer.pal(9, "Oranges"))) @ \caption{SMR of brain cancer in Navarre (Spain).} \label{fig:Navarre} \end{figure} \subsection{Cluster detection} %\subsubsection{Cluster detection with no covariates} Before starting our cluster detection methods, we will check the appropriateness of a Poisson distribution for this data. Fitting a log-linear model (with no covariates) gives the following model: <<>>= nav.m0 <- glm(OBSERVED ~ offset(log(EXPECTED)) + 1, family = "poisson", data = brainnav) @ Furthermore, a quasipoisson model has been fitted in order to assess any extra-variation in the data: % <<>>= nav.m0q <- glm(OBSERVED ~ offset(log(EXPECTED)) + 1, data = brainnav, family = "quasipoisson") @ % The dispersion parameter in the previous model is \Sexpr{round(summary(nav.m0q)$dispersion, 2)}, which is higher than 1. This may mean that the Poisson distribution is not appropriate in this case due to the excess of zeros. For this reason, and following \citet{Ugarteetal:2004}, a zero-inflated Poisson model has been fitted using function \code{zeroinfl()} from package \pkg{pscl} \citep{pscl:2008}. Here is the resulting model: <<>>= library("pscl") nav.m0zip <- zeroinfl(OBSERVED ~ offset(log(EXPECTED)) + 1 | 1, data = brainnav, dist = "poisson", x = TRUE) summary(nav.m0zip) @ Hence, the zero-inflated Poisson model will be used now to detect clusters of disease. As in the example on the New York leukemia dataset, a \pkg{sp} object will store all the information. The column for the expected counts must be named \code{Expected}, and this is our first step. Note also that, because only one time period is considered, data will have a single value and it is the 1st of January of 1990. Function \code{DetectClustersModel()} will perform the cluster detection using a \code{zeroinfl} model. This provides a very flexible way of handling different types of models in \proglang{R} for cluster detection. <>= nav.cl0 <- DetectClustersModel(brainnav, coordinates(brainnav), fractpop = 0.25, alpha = 0.05, typeCluster = "S", R = NULL, model0 = nav.m0zip, ClusterSizeContribution = "EXPECTED") @ The output will show the following clusters: % <<>>= nav.cl0 @ % As it can be seen, two clusters (with a $p$-value lower than $0.05$) are detected. However, they overlap and we will just consider the one with the lowest $p$-value, which is shown in Figure~\ref{fig:Navarrecl}. An index for the areas in each of the detected clusters can be obtained with the \code{knbinary()} function. It will return a \code{data.frame} with all the dummy cluster variables, i.e., the returned \code{data.frame} will have as many columns as clusters and a number of rows equal to the number of areas. Entries will be 1 if the corresponding areas are in a cluster and 0 otherwise. These indices can be used for a number of analyses, such as checking whether two clusters overlap or computing the number of times an area is included in a cluster. In the following example we obtain the representation of all the clusters detected and the first one, the most significant, is added as a new column to the original \code{SpatialPolygonsDataFrame} to be displayed in Figure~\ref{fig:Navarrecl}. <<>>= nav.clusters <- get.knclusters(brainnav, nav.cl0) brainnav$CLUSTER <- "" brainnav$CLUSTER [ nav.clusters[[1]] ] <- "CLUSTER" brainnav$CLUSTER <- as.factor(brainnav$CLUSTER) @ \begin{figure}[!h] \centering <>= print(spplot(brainnav, "CLUSTER", col = "#4D4D4D", col.regions = c("white", "grey")) ) @ \caption{Cluster of brain cancer detected in Navarre (Spain).} \label{fig:Navarrecl} \end{figure} \section{Spatio-temporal clusters} \label{sec:spacetime} %\input{NM.tex} \citet{Jung:2009} discusses how to extend model-based approaches for the detection of spatial disease clusters to space and time. \citet{GomezRubioetal:2018} propose the following model: % \begin{equation} \log(\mu_{i,t}) = \log(E_{i,t}) + \gamma_j c^{(j)}_{i,t} \label{eq:stcluster} \end{equation} % where $\mu_{i,t}$ is the mean of area $i$ at time $t$ and $c^{(j)}_{i,t}$ a cluster dummy variable for spatio-temporal cluster $j$. Random effects can also be included in Equation~\ref{eq:stcluster} as described in Section~\ref{sec:mixed} and zero-inflated distributions can also be considered as in Section~\ref{sec:zeroinfl}. Note how now data are indexed according to space and time. Dummy cluster variables are defined as in the spatial case, by considering areas in the cluster according to their distance to the cluster center, for data within a particular time period. When defining a temporal cluster, areas are aggregated using all possible temporal windows up to a predefined temporal range. \subsection{Brain cancer in New Mexico} The \code{brainNM} dataset (included in \pkg{DClusterm}) contains yearly cases of brain cancer in New Mexico from 1973 to 1991 (inclusive) in a \pkg{spacetime} object. The data set has been taken from the \pkg{SaTScan} website and the area boundaries from the U.S. Census Bureau. In addition, the location of Los Alamos National Laboratory (LANL) has been included (from Wikipedia). Inverse distance to this site can be used to test for increased risk in the areas around the Laboratory as no other covariates are available. <>= library("DClusterm") @ <<>>= data("brainNM") @ Expected counts have been obtained using age and sex standardization over the whole period of time. Hence, yearly differences are likely to be seen when plotting the data. Standardized mortality ratios have been plotted in Figure~\ref{fig:NMSMR}. \begin{figure}[!h] \centering <>= print(stplot(brainst[, , "SMR"], cuts = 8, col.regions = brewer.pal(9, "Oranges"))) @ \caption{Standardized mortality ratios of brain cancer in New Mexico.} \label{fig:NMSMR} \end{figure} \subsection{Cluster detection} \subsubsection{Cluster detection with no covariates} Similarly as in the purely spatial case, a Poisson model with no covariates will be fitted first: <<>>= nm.m0 <- glm(Observed ~ offset(log(Expected)) + 1, family = "poisson", data = brainst) summary(nm.m0) @ Before proceeding to disease cluster detection, we have extracted the centroids of the counties in New Mexico by using function \code{coordinates()} from the \code{sp} slot in the \code{STFDF} object that stores the data. <>= NM.coords <- coordinates(brainst@sp) @ Cluster detection with function \code{DetectClustersModel()} takes arguments \code{minDateUser} and \code{maxDateUser} to define the start and end date of the period which is considered when looking for clusters. In this example the time period has been constrained to 1985--1989. Furthermore, \code{typeCluster = "ST"} is used to look for spatio-temporal clusters. Note that it is also possible to look for spatial clusters by aggregating observed and expected cases over the whole temporal window with \code{typeCluster = "S"}. % <>= nm.cl0 <- DetectClustersModel(brainst, NM.coords, minDateUser = "1985-01-01", maxDateUser = "1989-01-01", fractpop = 0.15, alpha = 0.05, typeCluster = "ST", R = NULL, model0 = nm.m0, ClusterSizeContribution = "Expected") @ % \Sexpr{nrow(nm.cl0)} possible clusters have been found this time. However, note that most of them overlap and may be different configurations of the same cluster. However, as mentioned earlier, secondary overlapping clusters will be removed in the analysis: <<>>= nm.cl0.s <- slimknclusters(brainst, nm.cl0) nm.cl0.s @ Hence, only \Sexpr{nrow(nm.cl0.s)} remain after removing overlapping clusters and these will be the only ones considered in the analysis. \subsubsection{Cluster detection after adjusting for covariates} In this case, we will use the inverse of the distance to LANL as a covariate as no other information about the areas is available. Distances have been computed using function \code{spDistsN1()} from package \pkg{sp} \citep{sp:2005}. Given that coordinates are expressed in longitude and latitude great circle distances are used. <<>>= dst <- spDistsN1(pts = NM.coords, pt = losalamos, longlat = TRUE) @ Distances need to be put together in a way that values are available for all time periods. In this case, given that distances do not change over time, a vector is created by repeating the vector of distances as many times as time slots (years) we have in the dataset. <<>>= nyears <- length(unique(brainst$Year)) brainst$IDLANL <- rep(1 / dst, nyears) @ With all these data we are now able to fit a baseline model. <<>>= nm.m1 <- glm(Observed ~ offset(log(Expected)) + IDLANL, family = "poisson", data = brainst) summary(nm.m1) @ Note how now the included covariate is not significant. For illustrative purposes, we will still keep the covariate in our model for the cluster detection. However, non-significant covariates will have a tiny impact on the clusters detected as they will not produce a change in the expected number of cases. <>= nm.cl1 <- DetectClustersModel(brainst, NM.coords, fractpop = 0.15, alpha = 0.05, minDateUser = "1985-01-01", maxDateUser = "1989-01-01", typeCluster = "ST", R = NULL, model0 = nm.m1, ClusterSizeContribution = "Expected") @ The number of clusters detected in this case is \Sexpr{nrow(nm.cl1)}, very similar to the \Sexpr{nrow(nm.cl0)} found when no covariates where included in the model. This was expected as the included covariate was not significant. Similarly as in the previous example, non-overlapping clusters will be removed: <<>>= nm.cl1.s <- slimknclusters(brainst, nm.cl1) nm.cl1.s @ Now, the same \Sexpr{nrow(nm.cl1.s)} clusters are detected. Note how the first cluster covers three years (1986, 1987 and 1988). In order to exploit the output from \code{DetectClustersModel()}, function \code{get.stclusters()} will take the data and this output to return a list with the indices of the areas in the clusters. The next example shows how to add a new variable to \code{brainst} with the space-time regions in the most significant cluster, which is displayed in Figure~\ref{fig:NMcluster}. <<>>= stcl <- get.stclusters(brainst, nm.cl0.s) brainst$CLUSTER <- "" brainst$CLUSTER[ stcl[[1]] ] <- "CLUSTER" @ \begin{figure}[!h] \centering <>= print(stplot(brainst[, , "CLUSTER"], at = c(0, 0.5, 1.5), col = "#4D4D4D", col.regions = c("white", "gray"))) @ \caption{Most significant spatio-temporal cluster of brain cancer detected in New Mexico.} \label{fig:NMcluster} \end{figure} %\section{Bivariate models for cluster detection} %\label{sec:bivar} % % %FIXME: We may remove this section... \section{Discussion} \label{sec:disc} In this paper we have introduced \pkg{DClusterm}, a new package for the \proglang{R} statistical computing software for the detection of disease clusters using a model-based approach, following recent developments by several authors. Clusters are represented by dummy variables that are introduced into a generalized linear model and different likelihoods can be used to account for different types of data. Because of this model-based approach, fixed effects (to consider relevant risk factors) and random effects (to account for other non-spatial unmeasured risk factors) can be put in the linear predictor as well. In our examples we have considered well-known datasets to show how the functions in the \pkg{DClusterm} package tackle the problem of cluster detection. The results are similar to those found in relevant papers where the same datasets have been analyzed using a similar methodology. In particular, we have considered the case of the detection of clusters in space and space-time, zero-inflated data and over-dispersed data. \section{Acknowledgments} This package was initially developed by Paula Moraga as a Google Summer of Code project. V. G\'omez-Rubio has been supported by grants PPIC-2014-001 and SBPLY/17/180501/000491, funded by Consejer\'ia de Educaci\'on, Cultura y Deportes (JCCM, Spain) and FEDER, and grant MTM2016-77501-P, funded by Ministerio de Econom\'ia y Competitividad (Spain). \bibliography{DClusterm} %\bibliographystyle{chicago} \end{document}