ERGM equations

George G Vega Yon

2023-06-13

The likelihood of an Exponential Random Graph Model (ERGM) is defined as follows:

\[ \text{P}\left(\left.\mathbf{Y}= \mathbf{y}\vphantom{X = x}\right|\vphantom{\mathbf{Y}= \mathbf{y}}X = x\right) = \frac{% \text{exp}\left\{{\theta}^{\text{t}}g\left(\mathbf{y}, x\right)\right\} % }{% \sum_{\mathbf{y}'\in\mathcal{Y}} \text{exp}\left\{{\theta}^{\text{t}}g\left(\mathbf{y}', x\right)\right\} % },\quad \forall \mathbf{y}\in\mathcal{Y} \]

Where \(\mathbf{y}\in\mathcal{Y}\) is a random graph, \(X\) is a vector of attributes, \(\theta\) is a column-vector of length \(k\) (model parameters), and \(g\left(\cdot\right)\) is a function that returns a column-vector of sufficient statistics, also of length \(k\). In general, from the computational point of view, it is easier to manipulate the likelihood function centered at the observed sufficient statistics, this is:

\[ \text{P}\left(\left.\mathbf{Y}= \mathbf{y}\vphantom{X = x}\right|\vphantom{\mathbf{Y}= \mathbf{y}}X = x\right) = \frac{1}{% \sum_{\mathbf{y}'\in\mathcal{Y}} \text{exp}\left\{{\theta}^{\text{t}}\Delta{}g\left(\mathbf{y}',x\right)\right\} % },\quad \forall \mathbf{y}\in\mathcal{Y} \]

Where \(\Delta{}g\left(\mathbf{y}'\right) = g\left(\mathbf{y}',x\right) - g\left(\mathbf{y}, x\right)\). In the case of ergmito, we usually look at a pooled model with \(n\) networks, i.e.

\[ \prod_{i \in N}\text{P}\left(\left.\mathbf{Y}= \mathbf{y}_i\vphantom{X = x_i}\right|\vphantom{\mathbf{Y}= \mathbf{y}_i}X = x_i\right) = \prod_{i \in N}\frac{1}{% \sum_{\mathbf{y}_i'\in\mathcal{Y}} \text{exp}\left\{{\theta}^{\text{t}}\Delta g\left(\mathbf{y}_i'\right)\right\} % },\quad \forall \mathbf{y}_i\in\mathcal{Y} \]

Where \(N\equiv\{1,\dots, n\}\) is a vector of indices.

log-likelihood

In the case of a single network, the model’s log-likelihood is given by

\[ \text{log}\left(\text{P}\left(\cdot\right)\right) = - \text{log}\left( % \sum_{\mathbf{y}'\in\mathbf{Y}}\text{exp}\left\{{\theta}^{\text{t}}\Delta g\left(\mathbf{y}'\right)\right\} % \right) \] In general, we can reduce the computational complexity of this calculations by looking at the isomorphic sufficient statistics, this is, group up elements based on the set of unique vectors of sufficient statistics:

\[ - \text{log}\left( % \sum_{\mathbf{s}\in\mathcal{S}}\mathbf{w}_\mathbf{s}\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}\right\} % \right) \]

Where \(\mathcal{S}\) is the support of the sufficient statistics under \(\mathcal{Y}\), \(\mathbf{s}\in\mathcal{S}\) is one of its realizations, and \(\mathbf{w}_\mathbf{s}\equiv|\left\{\mathbf{y}\in\mathcal{Y}: \Delta{}g\left(\mathbf{y}\right) = \mathbf{s}\right\}|\) is the number of networks in \(\mathcal{Y}\) which centered sufficient statistics equal \(\mathbf{s}\). Furthermore, we can write this in matrix form:

\[ - % \text{log}\left( % \mathbf{W}\times \text{exp}\left\{\mathbf{S}\times \theta\right\} % \right) \]

Where \(\mathbf{W}\equiv\{\mathbf{w}_\mathbf{s}\}_{\mathbf{s}\in\mathbf{S}}\) is a row vector of, length \(w\) and \(\mathbf{S}\) is a matrix of size \(w\times k\). The log-likelihood of a pooled model is simply the sum of the individual ones.

Gradient

The partial derivative of the log-likelihood with respect to \(j\)-th parameter equals:

\[ \frac{\delta}{\delta\theta_j}\text{log}\left(\text{P}\left(\cdot\right)\right) = % - \frac{ % \sum_{\mathbf{s}\in\mathcal{S}}\mathbf{w}_\mathbf{s}\mathbf{s}_j\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}\right\} % }{ % \sum_{\mathbf{s}\in\mathcal{S}}\mathbf{w}_\mathbf{s}\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}\right\} % }, \quad\forall j \]

We can also write this using matrix notation as follows:

\[ \nabla\text{log}\left(\text{P}\left(\cdot\right)\right) = % - % {\mathbf{S}}^{\text{t}}\times \left[ \mathbf{W}\circ \text{exp}\left\{\mathbf{S}\times \theta\right\} \right]/ % \lambda({\theta}) \]

Where \(\circ\) is the element-wise product and \(\lambda({\theta}) = \mathbf{W}\times \text{exp}\left\{\mathbf{S}\times \theta\right\}\).

Hessian

In the case of the hessian, each \((j, l)\) element of, \(\frac{\delta^2}{\delta\theta_k\delta\theta_u}\text{log}\left(\text{P}\left(\cdot\right)\right)\), can be computed as:

\[ \frac{% -\left(\sum_{\mathbf{s}'\in\mathcal{S}}\mathbf{s}_j'\mathbf{s}_l'\mathbf{w}_\mathbf{s}\text{exp}\left\{{\theta}^{\text{t}} \mathbf{s}\right\}\right) }{% \lambda(\theta)% } + \frac{% \left(\sum_{\mathbf{s}'\in\mathcal{S}}\mathbf{s}_j'\mathbf{w}_\mathbf{s}\text{exp}\left\{{\theta}^{\text{t}} \mathbf{s}\right\}\right) \left(\sum_{\mathbf{s}'\in\mathcal{S}}\mathbf{s}_l'\mathbf{w}_\mathbf{s}\text{exp}\left\{{\theta}^{\text{t}} \mathbf{s}\right\}\right) }{% \lambda(\theta)^2% } \] Where \(\mathbf{s}_j\) as the \(j\)-th element of the vector \(\mathbf{s}\). Once again, we can simplify this using matrix notation:

\[ \frac{% -\mathbf{W}\times\left[\mathbf{S}_j\circ\mathbf{S}_l \circ \text{exp}\left\{\mathbf{S}\times\theta\right\}\right]% }{% \lambda(\theta)% } + \frac{% \left(\mathbf{W}\times\left[\mathbf{S}_j \circ \text{exp}\left\{\mathbf{S}\times\theta\right\}\right]\right) \left(\mathbf{W}\times\left[\mathbf{S}_l \circ \text{exp}\left\{\mathbf{S}\times\theta\right\}\right]\right) }{% \lambda(\theta)^2% } \]

Where \(\mathbf{S}_j\) is the \(j\)-th column of the matrix \(\mathbf{S}\).

Limiting values

In the case that the MLE does not exists, i.e. \(\theta_k\to\pm\infty\), which occurs when the observed sufficient statistics are not in the interior of the space, for example, a fully connected network, the limit of the log-likelihood, the gradient, and hessian are finite. This is relevant as some special care needs to be taken when dealing with these cases.

While in general models in which the MLEs diverge may seem uninteresting, the fact that ERGMs describe a discrete and not continuous random variable makes this type of event easier to observed when compared to other families of distributions. Furthermore, in the case of methods such as bootstrapping, forward/backward selection, or other classes of algorithms that involve some level of automatization during the model fitting process, it is important to know how to correctly deal with the non-existence of MLEs. Fortunately, as we will show, the log-likelihood and its derivatives are well defined in such cases.

The main principle here is what happens in the argument of the exponential function that populates these functions. Depending on what is \(\mathbf{s}_k\), two cases follow:

  1. The observe value of the \(k\)-th sufficient statistic is in the lower bound \(\mathbf{s}_k = \min_{\mathbf{s}'\in\mathcal{S}}\mathbf{s}_{k}'\) This happens, for example, when trying to fit a model with triangles where there are no observed triangles. In this case, with \(\mathbf{s}_k' \geq 0\) for all cases, the theoretical MLE for \(\theta_k\) goes to \(-\infty\), thus, in the limit, the expression \[ \lim_{\theta_k\to-\infty}\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}'\right\} = \left\{\begin{array}{ll} % \text{exp}\left\{\sum_{j\neq k}\mathbf{s}_j'\theta_j\right\} &\quad\text{if }\mathbf{s}_k' = 0 \\ 0 & \quad\text{if }\mathbf{s}_k' > 0 \end{array}\right. \]

  2. The observe value of the \(k\)-th sufficient statistic is in the upper bound \(\mathbf{s}_k = \min_{\mathbf{s}'\in\mathcal{S}}\mathbf{s}_k'\) The most common case is when the sufficient statistic is satturated, for example, in a fully connected graph, where the MLE goes to infinity, \(\theta_k\to+\infty\). Similar to before, since \(\mathbf{s}_k' \leq 0\) for all cases, in the limit the previous expression is well defined: \[ \lim_{\theta_k\to+\infty}\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}'\right\} = \left\{\begin{array}{ll} % \text{exp}\left\{\sum_{j\neq k}\mathbf{s}_j'\theta_j\right\} &\quad\text{if }\mathbf{s}_k' = 0 \\ 0 & \quad\text{if }\mathbf{s}_k' < 0 \end{array}\right. \]

The two previous points can be interpreted as only graphs whose \(k\)-th sufficient statistic matches that of the observed data influence the model. Therefore, a key to compute the limiting values of the log-likelihood, gradient, and hessian will be on partitioning the summations over \(\mathbf{s}'\in\mathcal{S}\) into two different sets as follows:

\[\begin{align} \mathcal{S}_0 & \equiv \{\mathbf{s}\in\mathcal{S}: \mathbf{s}_u = 0, \forall u\in U\} \\ \mathcal{S}_1 & \equiv \mathcal{S}\setminus\mathcal{S}_0 \end{align}\]

Where \(U\) is the set of observed sufficient statistics that are on the boundary and thus have an MLE that diverges to \(\pm\infty\). In this partition \(\mathcal{S}_0\) contains all the realizations of sufficient statistics for which the statistics in \(U\) are equal to the observed ones. With this definition in hand, we will now show what is the asymptotic behavior of the log-likelihood, gradient, and hessian when one or more observed sufficient statistics are on not in the interior of the support.

Log-likelihood

Using the partition of \(\mathcal{S}\), the log-likelihood can be written as follows

\[ -\text{log}\left( % \sum_{\mathbf{s}'\in\mathcal{S}_0}\mathbf{w}_{\mathbf{s}'}\text{exp}\left\{\sum_{j\not\in U}\mathbf{s}_j'\theta_j\right\} + % \sum_{\mathbf{s}'\in\mathcal{S}_1}\mathbf{w}_{\mathbf{s}'}\text{exp}\left\{{\theta}^{\text{t}}\Delta g\left(\mathbf{y}'\right)\right\} \right) \]

Then, WLOG as \(\theta_u\to-\infty\), positive if \(\mathbf{s}_u\) is in the upper bound, we have:

\[ \begin{align} & \lim_{\theta_u\to-\infty} -\text{log}\left( % \sum_{\mathbf{s}'\in\mathcal{S}_0}\mathbf{w}_{\mathbf{s}'}\text{exp}\left\{\sum_{j\not\in U}\mathbf{s}_j'\theta_j\right\} + % \sum_{\mathbf{s}'\in\mathcal{S}_1}\mathbf{w}_{\mathbf{s}'}\text{exp}\left\{{\theta}^{\text{t}}\Delta g\left(\mathbf{y}'\right)\right\} \right) \\ & = -\text{log}\left( \lim_{\theta_u\to-\infty} % \sum_{\mathbf{s}'\in\mathcal{S}_0}\mathbf{w}_{\mathbf{s}'}\text{exp}\left\{\sum_{j\not\in U}\mathbf{s}_j'\theta_j\right\} + % \sum_{\mathbf{s}'\in\mathcal{S}_1}\mathbf{w}_{\mathbf{s}'}\text{exp}\left\{{\theta}^{\text{t}}\Delta g\left(\mathbf{y}'\right)\right\} \right) \\ & = -\text{log}\left( % \sum_{\mathbf{s}'\in\mathcal{S}_0}\mathbf{w}_{\mathbf{s}'}\text{exp}\left\{\sum_{j\not\in U}\mathbf{s}_j'\theta_j\right\} \right) \end{align} \]

Where the second equality follows from the fact that the logarithm function is continuous in \((0,1)\). We can see this in the following example:

Suppose that we have five networks of size 4 as the one included in the ergmito package, fivenets, and we wanted to fit the following model fivenets ~ edges + triadcensus(15), with triadcensus(15) equal to a fully connected triad, also known as triad class 300 using Davis and Leinhardt triadic classification system. Since the fivenets dataset has no fully connected triad, the MLE of that term diverges:

library(ergmito)
data(fivenets)
(ans <- ergmito(fivenets ~ edges + triadcensus(15)))
## Warning: The observed statistics (target.statistics) are near or at the
## boundary of its support, i.e. the Maximum Likelihood Estimates maynot exist or
## be hard to be estimated. In particular, the statistic(s) "edges",
## "triadcensus.300".
## 
## ERGMito estimates
## note: A subset of the parameters estimates was replaced with +/-Inf,  and the Hessian matrix is not p.s.d.
## 
## Call:
## ergmito(model = fivenets ~ edges + triadcensus(15))
## 
##  Coefficients:
##           edges  triadcensus.300  
##         -0.6851             -Inf

The log-likelihood of this model can be computed directly with ergmito using a large negative instead of -Inf, or by using the equation that shows the limiting value of the log-likelihood:

# Approach using the limiting value
l <- with(ans$formulae, {
  sapply(1:nnets(ans), function(i) {
    # Preparing suff stats
    S <- t(t(stats_statmat[[i]]) - target_stats[i, ])
    W <- stats_weights[[i]]
    theta <- coef(ans)["edges"]
    
    # Index of members of S0
    s0idx <- which(S[,2] == 0)
    
    - log(sum(W[s0idx] * exp(S[s0idx,1] * theta)))
  })
})
sum(l)
## [1] -38.16338
# Which should be equivalent to the approach with large negative number
ans$formulae$loglik(c(coef(ans)["edges"], -1e100))
## [1] -38.16338

Gradient

The gradient for \(\theta_j\) equals

\[ - \frac{ % \sum_{\mathbf{s}\in\mathcal{S}}\mathbf{w}_\mathbf{s}\mathbf{s}_j\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}\right\} % }{ % \sum_{\mathbf{s}\in\mathcal{S}}\mathbf{w}_\mathbf{s}\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}\right\} % } = % - \frac{ % \sum_{\mathbf{s}\in\mathcal{S}_0}\mathbf{w}_\mathbf{s}\mathbf{s}_j\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\} + \sum_{\mathbf{s}\in\mathcal{S}_1}\mathbf{w}_\mathbf{s}\mathbf{s}_j\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}\right\} % }{ % \sum_{\mathbf{s}\in\mathcal{S}_0}\mathbf{w}_\mathbf{s}\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\} + % \sum_{\mathbf{s}\in\mathcal{S}_1}\mathbf{w}_\mathbf{s}\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}\right\} % } \]

WLOG if \(\mathbf{s}_u = \min_{\mathbf{s}'\in\mathcal{S}}\mathbf{s}_u'\) the limit of the above expression as \(\theta_u\to-\infty\) is evaluated as follows:

\[ \begin{align} \lim_{\theta_u\to-\infty} & - \frac{ % \sum_{\mathbf{s}\in\mathcal{S}_0}\mathbf{w}_\mathbf{s}\mathbf{s}_j\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\} + \sum_{\mathbf{s}\in\mathcal{S}_1}\mathbf{w}_\mathbf{s}\mathbf{s}_j\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}\right\} % }{ % \sum_{\mathbf{s}\in\mathcal{S}_0}\mathbf{w}_\mathbf{s}\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\} + % \sum_{\mathbf{s}\in\mathcal{S}_1}\mathbf{w}_\mathbf{s}\text{exp}\left\{{\theta}^{\text{t}}\mathbf{s}\right\} % } \\ & = % - \frac{ % \sum_{\mathbf{s}\in\mathcal{S}_0}\mathbf{w}_\mathbf{s}\mathbf{s}_j\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\} }{ % \sum_{\mathbf{s}\in\mathcal{S}_0}\mathbf{w}_\mathbf{s}\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\} } \end{align} \]

If \(j = u\), the above expression reduces to 0 as \(\mathbf{s}'_u = 0\forall \mathbf{s}'\in \mathcal{S}_0\), otherwise the number is well defined. We can calculate the gradient of the triad 300 model alternatively using the above expression:

g <- with(ans$formulae, {
  sapply(1:nnets(ans), function(i) {
    # Preparing suff stats
    S <- t(t(stats_statmat[[i]]) - target_stats[i, ])
    W <- stats_weights[[i]]
    theta <- coef(ans)["edges"]
    
    # Index of members of S0
    s0idx <- which(S[,2] == 0)
    
    - sum(W[s0idx] * S[s0idx,1] * exp(S[s0idx,1] * theta))/
      sum(W[s0idx] * exp(S[s0idx,1] * theta))
  })
})
sum(g)
## [1] 0.002926372
# Which is equivalent to
ans$formulae$grad(c(coef(ans)["edges"], -1e100))
##             [,1]
## [1,] 0.002926372
## [2,] 0.000000000

Hessian

Just like the other cases, we can rewrite the hessian distingushing between \(\mathcal{S}_0\) and \(\mathcal{S}_1\). Without rewriting the whole expression (which would be simply taking too much space), and WLOG, the limiting Hessian, and in particular, its \((j,l)\)-th component, as \(\theta_u\to-\infty\) equals:

\[ \begin{align} & \frac{% -\left(\sum_{\mathbf{s}'\in\mathcal{S}_0}\mathbf{s}_j'\mathbf{s}_l'\mathbf{w}_\mathbf{s}\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\}\right) }{% \sum_{\mathbf{s}\in\mathcal{S}_0}\mathbf{w}_\mathbf{s}\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\}% } + \\ & \quad \frac{% \left(\sum_{\mathbf{s}'\in\mathcal{S}_0}\mathbf{s}_j'\mathbf{w}_\mathbf{s}\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\}\right) \left(\sum_{\mathbf{s}'\in\mathcal{S}_0}\mathbf{s}_l'\mathbf{w}_\mathbf{s}\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\}\right) }{% \left(\sum_{\mathbf{s}\in\mathcal{S}_0}\mathbf{w}_\mathbf{s}\text{exp}\left\{\sum_{j\not\in U}\theta_j\mathbf{s}_j\right\}\right)^2% } \end{align} \]

In the case of either \(j\) or \(l\) equal to \(u\), the hessian equals 0. For values other than \(u\), the hessian is non-zero, again, using the example with the triad 300 term we can compute the hessian as follows:

h <- with(ans$formulae, {
  sapply(1:nnets(ans), function(i) {
    # Preparing suff stats
    S <- t(t(stats_statmat[[i]]) - target_stats[i, ])
    W <- stats_weights[[i]]
    theta <- coef(ans)["edges"]
    
    # Index of members of S0
    s0idx <- which(S[,2] == 0)
    
    # First part
    - sum(W[s0idx] * S[s0idx,1]^2 * exp(S[s0idx,1] * theta))/
      sum(W[s0idx] * exp(S[s0idx,1] * theta)) + 
      # Second bit
      sum(W[s0idx] * S[s0idx,1] * exp(S[s0idx,1] * theta)) ^ 2/
      sum(W[s0idx] * exp(S[s0idx,1] * theta))^2
  })
})
sum(h)
## [1] -12.97199

Which should be equal to using the full hessian function but with a very large negative velue for the parameter associated with the statistic triad 300:

ans$formulae$hess(c(coef(ans)["edges"], -1e100))
##           [,1] [,2]
## [1,] -12.97199    0
## [2,]   0.00000    0

This last result is useful to then apply the Moore-Penrose generalized inverse and thus a pseudo-covariance matrix, which in some cases can be better than nothing. Furthermore, the limiting expressions of the log-likelihood, gradient, and hessian have less terms to be consider, which in principle results in faster calculations.

References

Vega Yon, George, Andrew Slaughter, and Kayla de la Haye. 2019. “Exponential Random Graph Models for Little Networks.” arXiv preprint arXiv:1904.10406. https://arxiv.org/pdf/1904.10406.pdf.