Conditional moments tests where proposed by Tauchen (1985) and Newey (1985). As these tests are quite alike Lagrange multiplier (or score) tests, which is, along with the likelihood ratio and the Wald tests one of the three “classical” tests. These three tests have in common to test a set of hypothesis which enables to go from a unconstrained (or larger) model to a constrained (or smaller) model:

- the Wald test is based on the values of the vector implied the restrictions evaluated for the constrained model,
- the likelihood ratio is based on the comparison of the fits of both models (measured by the value of the log-likelihood function),
- the score test is based on the value of the score (ie the vector of the first derivatives) of the log-likelihood of the unconstrained models using the estimates of the constrained model.

For the special case where the hypothesis can be writen as a linear
combination of the coefficients, the unconstrained model can always be
parametrized in a way such that the likelihood function of the larger
model is: \(\ln L (\beta_1, \beta_2)\), with under H_{0}: \(\beta_2 = 0\)
and the larger model reduces to the smaller one. For the score test,
the statistic is based on the gradient of the log-likelihood computed
with the parameters obtained by estimating the constrained model:
\(\hat{\beta}_1 ^c, 0\), which is:

\[ \frac{\partial \ln L}{\partial \beta}(\hat{\beta}_1^c, 0) = \left( \begin{array}{c} \frac{\partial \ln L}{\partial \beta_1}(\hat{\beta}_1^c, 0) \\ \frac{\partial \ln L}{\partial \beta_2}(\hat{\beta}_1^c, 0) \end{array} \right) = \left( \begin{array}{c} 0 \\ \frac{\partial \ln L}{\partial \beta_2}(\hat{\beta}_1^c, 0) \end{array} \right) \]

As an example, consider a linear or log-linear gaussian model for which one wishes to test the normality hypothesis. For the smaller model, we have:

\[ y_n = \alpha + \beta x_n + \epsilon_n \mbox{ with } \epsilon_n \sim N(0, \sigma_\epsilon) \]

for the linear model and:

\[ \ln y_n = \alpha + \beta x_n + \epsilon_n \mbox{ with } \epsilon_n \sim N(0, \sigma_\epsilon) \]

for the log-linear model. A potential larger model consists on taking a Box-Cox transformation of \(y\), denoted \(y^{(\lambda)}\):

\[ y^{(\lambda)} = \alpha + \beta x_n + \epsilon_n \mbox{ with } \epsilon_n \sim N(0, \sigma_\epsilon) \]

with:

\[ y^{(\lambda)} = \left\{ \begin{array}{lcl} \frac{y_n^\lambda - 1}{\lambda} & \mbox{if} & \lambda \neq 0 \\ \ln y_n & \mbox{if} & \lambda = 0 \end{array} \right. \]

which obviously reduce to the smaller log-linear model for \(\lambda = 0\). For \(\lambda = 1\), \(y ^ {(\lambda)} = y - 1\) and the larger model reduce to the linear model except that \(1\) should be substracted from the left hand side of the equation, which means that the intercept of the linear model fitted by ordinary least squares should be reduced by 1.

The log-likelihood of the larger model is:

\[ \ln L (\beta, \sigma, \lambda) = -\frac{N}{2}\ln 2\pi - N \ln \sigma - (1 - \lambda) \sum_{n=1}^N \ln y_n - \frac{1}{2\sigma ^ 2}\sum_{n=1}^N\left(y^{(\lambda)} - \beta^\top x_n\right) ^ 2 \]

The gradient of the log-likelihood is:

\[ \frac{\partial \ln L}{\partial \theta}(\theta) = \left( \begin{array}{c} \frac{\partial \ln L}{\partial \beta}\\ \frac{\partial \ln L}{\partial \sigma}\\ \frac{\partial \ln L}{\partial \lambda}\\ \end{array} \right) = \left( \begin{array}{c} \frac{1}{\sigma ^ 2} \sum_{n=1}^N \left(y^{(\lambda)} - \beta^\top x_n\right) x_n\\ -\frac{N}{\sigma} + \frac{1}{\sigma^3}\sum_{n=1}^N \left(y^{(\lambda)} - \beta^\top x_n\right) ^ 2\\ \sum_{n=1} ^ N \ln y_n - \frac{1}{\sigma ^ 2}\sum_{n=1}^N \left(y_n ^ {(\lambda)} - \beta^\top x_n\right) \left(y_n^{(\lambda)} \ln y_n - \frac{y_n ^ {(\lambda)} - \ln y_n}{\lambda}\right) \end{array} \right) \]

For the constrained linear (\(\hat{\lambda}_{i} = 1\)) and log-linear (\(\hat{\lambda}_o = 0\)) models, we have respectively:

\[ \frac{\partial \ln L}{\partial \theta}(\hat{\theta}_i) = \left( \begin{array}{c} 0 \\ 0 \\ \sum_{n=1} ^ N \ln y_n - \frac{1}{2\hat{\sigma}_c ^ 2}\sum_{n=1}^N \left(\ln y_n - \hat{\beta}_c^\top x_n\right) \left(1 + y(\ln y - 1)\right) \end{array} \right) \]

\[ \frac{\partial \ln L}{\partial \theta}(\hat{\theta}_o) = \left( \begin{array}{c} 0 \\ 0 \\ \sum_{n=1} ^ N \ln y_n - \frac{1}{2\hat{\sigma}_c ^ 2}\sum_{n=1}^N \left(\ln y_n - \hat{\beta}_c^\top x_n\right) \ln y_n^2 \end{array} \right) \]

The following function returns the log-likelihood for a box-cox model,
given a vector of parameters `coef`

, a matrix of covariates `X`

and a
vector of response `y`

. It has further arguments:

`sum`

: if`FALSE`

, invidual contributions to the log-likelihood (a vector of length \(N\)) and to the gradient (a matrix of dimensions \(N\times K\) is returned,`gradient`

: if`TRUE`

the analytical gradient is computed and returned as an attribute,`hessian`

: if`TRUE`

the analytical hessian is computed and returned as an attribute.

```
lnL_bc <- function(coef, X, y, sum = FALSE, gradient = FALSE, hessian = FALSE){
y <- y
N <- length(y)
K <- length(coef) - 3
beta <- coef[1:(K + 1)]
sigma <- coef[K + 2]
lambda <- coef[K + 3]
bX <- as.numeric(X %*% beta)
if (lambda == 0){
bc <- log(y)
bc_l <- 1 / 2 * log(y) ^ 2
bc_ll <- 1 / 3 * log(y) ^ 3
}
else{
bc <- (y ^ lambda - 1) / lambda
bc_l <- bc * log(y) - (bc - log(y)) / lambda
bc_ll <- bc_l * log(y) - (bc_l * lambda - (bc - log(y))) / lambda ^ 2
}
e <- bc - bX
lnL <- - 1 / 2 * log(2 * pi) - log(sigma) - (1 - lambda) * log(y) -
1 / (2 * sigma ^ 2) * e ^ 2
if (gradient){
g_beta <- 1 / sigma ^ 2 * e * X
g_sigma <- - 1 / sigma + 1 / (sigma ^ 3) * e ^ 2
g_lambda <- log(y) - 1 / (sigma ^ 2) * e * bc_l
G <- cbind(g_beta, g_sigma, g_lambda)
}
if (hessian){
h_bb <- - 1 / sigma ^ 2 * crossprod(X)
h_bs <- apply(- 2 / sigma ^ 3 * X * e, 2, sum)
h_bl <- apply( 1 / sigma ^ 2 * bc_l * X, 2, sum)
h_ss <- sum(1 / sigma ^ 2 - 3 / sigma ^ 4 * e ^ 2)
h_sl <- sum(2 / sigma ^ 3 * e * bc_l)
h_ll <- sum(- 1 / sigma ^ 2 * (e * bc_ll + bc_l ^ 2))
H <- rbind(cbind(h_bb, sigma = h_bs, lambda = h_bl),
sigma = c(h_bs, h_ss, h_sl),
lambda = c(h_bl, h_sl, h_ll))
}
if (sum){
lnL <- sum(lnL)
if (gradient) G <- apply(G, 2, sum)
}
if (gradient) attr(lnL, "gradient") <- G
if (hessian) attr(lnL, "hessian") <- H
lnL
}
```

Consider as an example the `cars`

data set which is provided by base
**R**. The response is `dist`

(the stoping distance) and the only
covariate is `speed`

(the speed of the car). The smaller linear and
log-linear models can be computed using `lm`

:

```
lin_lm <- lm(dist ~ speed + I(speed ^ 2), cars)
log_lm <- update(lin_lm, log(.) ~ .)
```

as stated previously, the linear model is a special case of the box-cox model if one is substracted from the intercept.

```
coefs_lin <- coef(lin_lm)
coefs_lin[1] <- coefs_lin[1] - 1
```

To use the `lnL_bc`

function, we need to extract the response vector
and the covariates matrix:

```
X <- model.matrix(lin_lm)
y <- model.response(model.frame(lin_lm))
```

Finally we can evaluate the log-likelihood of the box-cox model for
the two special cases of the linear and the log-linear models using
the coefficients obtained using `lm`

(\(\sigma\) is a parameter in the
log-likelihood, it is extracted from the `lm`

models using the
provided `sigma2`

function which differs from the `sigma`

function by
the fact that the deviance is divided by the number of observations
and not the number of degrees of freedom of the regression).

```
sigma2 <- function(x) sigma(x) * sqrt(df.residual(x) / nobs(x))
coefs_lin <- c(coefs_lin, sigma = sigma2(lin_lm), lambda = 1)
coefs_log <- c(coef(log_lm), sigma = sigma2(log_lm), lambda = 0)
lnL_lin <- lnL_bc(coefs_lin, X, y, sum = TRUE, gradient = TRUE, hessian = FALSE)
lnL_log <- lnL_bc(coefs_log, X, y, sum = TRUE, gradient = TRUE, hessian = FALSE)
lnL_lin
## [1] -205.386
## attr(,"gradient")
## (Intercept) speed I(speed^2) g_sigma g_lambda
## 1.041376e-15 1.951564e-14 4.263673e-13 6.175616e-16 -2.184459e+01
lnL_log
## [1] -202.3903
## attr(,"gradient")
## (Intercept) speed I(speed^2) g_sigma g_lambda
## -2.936210e-13 -3.541334e-12 -4.704148e-11 2.620126e-13 2.956768e+01
```

According to the log-likelihood, the log-linear model fits fairly
better the data than the linear one. As expected, all the elements of
the gradient vector except the last one is 0. The derivative of the
log-likelihood with respect with \(\lambda\) is negative for the linear
model and positive for the log-linear model. This means that the
log-likelihood should increase for some values of \(\lambda\) between 0
and 1. To fit the box-cox log-likelihood model, we use the
`maxLik::maxLik`

function. It is a good general practise to provide
“good” (and different) starting values to unsures that the non-linear
optimization process converges, and that it converges to a global
maximum. We therefore run twice `maxLik`

, with starting values
corresponding to the linear and to the non-linear model:

```
library("maxLik")
bc_lin <- maxLik(lnL_bc, start = coefs_lin, X = X, y = y,
sum = TRUE, gradient = TRUE, hessian = TRUE)
bc_log <- maxLik(lnL_bc, start = coefs_log, X = X, y = y,
sum = TRUE, gradient = TRUE, hessian = TRUE)
```

We first check that the fitted model is the same for the two sets of starting values:

`cbind(coef(bc_lin), coef(bc_log))`

```
## [,1] [,2]
## (Intercept) 0.213367880 0.213370726
## speed 0.582112891 0.582111088
## I(speed^2) -0.005530642 -0.005530664
## sigma 1.365042140 1.365033776
## lambda 0.372872290 0.372870677
```

and we then summarize the fitted model:

`summary(bc_lin)`

```
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 15 iterations
## Return code 1: gradient close to zero (gradtol)
## Log-Likelihood: -197.3795
## 5 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## (Intercept) 0.213368 1.351837 0.158 0.87459
## speed 0.582113 0.232726 2.501 0.01237 *
## I(speed^2) -0.005531 0.006108 -0.905 0.36524
## sigma 1.365042 0.650610 2.098 0.03590 *
## lambda 0.372872 0.131795 2.829 0.00467 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
```

As expected, the fitted value of \(\lambda\) lies between 0 and 1 and is statistically different from 0 and from 1. Note also that the log-likelihood of the box-cox model (-197.38) is larger than the one of the log-linear model (-202.39).

If the hypothesis corresponding to the smaller model are true, the distribution of the score vector is:

\[ g(\tilde{\theta}) \sim N(0, I(\theta)) \]

Were \(I\) is the information matrix. Therefore if H_{0} is true:

\[ g(\tilde{\theta}) ^ \top I(\theta) ^ {-1} g(\tilde{\theta}) \]

is a \(\chi^2\) with \(J\) degrees of freedom (\(J\) being the number of hypothesis, one in our example). \(I(\theta)=-\mbox{E}\left(\frac{\partial^2 \ln L}{\partial \theta \partial \theta^\top}(\theta)\right)\) needs to be estimate and two natural estimators are:

- the empirical variance of the gradient: \(\sum_{n=1} ^ N g(\tilde{\theta}, x_n, y_n) g(\tilde{\theta}, x_n, y_n)^\top = G(\tilde{\theta}, X, y) ^ \top G(\tilde{\theta}, X, y)\) (also called the outer-product of the gradient), where \(G\) is an \(N\times K\) matrix for which each line is the contribution of an observation to the gradient,
- the opposite of the hessian: \(-H(\tilde{\theta}) = -\frac{\partial^2 \ln L}{\partial \theta \partial \theta^\top}(\tilde{\theta})\).

The \(G\) matrix is obtained using the `lnL_bc`

function with `sum = FALSE`

and `gradient = TRUE`

and the \(H\) matrix using `hessian = TRUE`

.

```
lnL_lin <- lnL_bc(coefs_lin, X, y, sum = FALSE, gradient = TRUE, hessian = TRUE)
lnL_log <- lnL_bc(coefs_log, X, y, sum = FALSE, gradient = TRUE, hessian = TRUE)
G_lin <- attr(lnL_lin, "gradient")
H_lin <- attr(lnL_lin, "hessian")
G_log <- attr(lnL_log, "gradient")
H_log <- attr(lnL_log, "hessian")
g_lin <- apply(G_lin, 2, sum)
g_log <- apply(G_log, 2, sum)
```

The test statistics can then be computed using any of the two estimators of the information matrix:

```
as.numeric(g_lin %*% solve(- H_lin) %*% g_lin)
## [1] 22.48473
as.numeric(g_lin %*% solve(crossprod(G_lin)) %*% g_lin)
## [1] 18.08425
as.numeric(g_log %*% solve(- H_log) %*% g_log)
## [1] 8.797853
as.numeric(g_log %*% solve(crossprod(G_log)) %*% g_log)
## [1] 9.428458
```

The two variants of the same test statistics are very close whatever the approximation of the information matrix used. The critical value of a \(\chi^2\) with one degree of freedom at the 5% level being 3.84, the two hypothesis of linear and log normality are strongly rejected.

Note that \(g(\tilde{\theta}) = G(\tilde{\theta})^\top \iota\) where \(\iota\) is a vector of one of length \(N\). Therefore the outer product variant of the test is:

\[ \iota^\top G(\tilde{\theta}) \left[G(\tilde{\theta}) ^ \top G(\tilde{\theta})\right] ^ {-1} G(\tilde{\theta}) ^ \top \iota = \iota^\top M_G(\tilde{\theta}) \iota \]

where \(M_G(\tilde{\theta}) = G(\tilde{\theta})\left[G(\tilde{\theta}) ^ \top G(\tilde{\theta})\right] ^ {-1} G(\tilde{\theta}) ^ \top\) is the projection matrix on the columns of \(G(\tilde{\theta})\). \(M_G(\tilde{\theta}) \iota\) is therefore the fitted values of the regression of a column of ones with the columns of \(G(\tilde{\theta})\) and, as \(M_G(\tilde{\theta})\) is idempotent, the test statistic is the sum of the fitted values of this regression. The (uncetered as there is no intercept in this regression) \(R^2\) is this sum of the fitted values divided by the total variation of the response, which is equal to \(N\) (the response being a vector of ones of length \(N\)). Therefore, the test statistic can be obtained as \(N\) times the \(R^2\) of the regression of \(1\) one on the columns of \(G\):

```
iota <- rep(1, length(y))
lm_ones_lin <- lm(iota ~ G_lin - 1)
lm_ones_log <- lm(iota ~ G_log - 1)
summary(lm_ones_lin)$r.squared * nobs(lm_ones_lin)
## [1] 18.08425
summary(lm_ones_log)$r.squared * nobs(lm_ones_log)
## [1] 9.428458
```

The conditional moment test approach consists on constructing a vector
of moments which are 0 under H_{0}. Denoting \(\mu_n\) such a vector, we
have, for the normality hypothesis:

\[ \mu_n(\theta) = \left( \begin{array}{c} \mu_{3n}\\ \mu_{4n} \end{array} \right)= \left(\begin{array}{c} \mbox{E}(\epsilon_n ^ 3) \\ \mbox{E}(\epsilon_n ^ 4 - 3 \sigma ^ 4) \end{array}\right) = 0 \]

Denoting \(\hat{\epsilon}_n = y_n - \tilde{\beta} ^ \top x_n\) or \(\hat{\epsilon}_n = \ln y_n - \tilde{\beta} ^ \top x_n\) the residuals of the linear or the log-linear model, the empirical moments are:

\[ m_n(\tilde{\theta}) = \left( \begin{array}{c} \hat{\epsilon}_n ^ 3 \\ \hat{\epsilon}_n ^ 4 - 3 \sigma ^ 4 \end{array} \right) \]

Define : \(M (\tilde{\theta}) ^ \top = \left(m_1, m_2, \ldots, m_N\right)\) an \(N\times r\) matrix for which each line is the moment vector for on observation and \(W(\theta)= \frac{1}{N}\sum_n \mbox{E}\left(\frac{\partial m_n}{\partial \theta}(\theta)\right)\) a \(K \times r\) matrix containing the derivatives of the empirical moments with the parameters of the model. Defining:

\[ Q = \left[M(\tilde{\theta}) - G(\tilde{\theta}) I(\theta) ^ {-1} W(\theta)\right]^\top \left[M(\tilde{\theta}) - G(\tilde{\theta}) I(\theta) ^ {-1} W(\theta)\right] \]

the test statistic is:

\[ m(\tilde{\theta}) ^ \top Q ^ {-1} m(\tilde{\theta}) \]

and under H_{o} follows a \(\chi^2\) with \(r\) (the length of \(m\)) degrees
of freedom.

\(Q\) plays the role of the information matrix for the score test. As previously, different flavours of the test can be obtained depending on how \(Q\) is estimated:

- numerically, using as previously \(\hat{I}_N = G(\tilde{\theta}) ^ \top G(\tilde{\theta})\) and \(\hat{W}_N = G(\tilde{\theta}) ^ \top M(\tilde{\theta})\),
- using the analytical derivatives, using as previously \(\hat{I}_A = - H(\tilde{\theta})\) and \(\hat{W}_A = - \frac{\partial M}{\partial \theta}(\tilde{\theta})\)

In this latter case, the test statistic can also be obtained as \(N\) times the \(R^2\) of the regression of a column of one on the columns of \(G\) and \(M\). The following function performs the test.

```
cm_norm <- function(x, type = c("analytical", "opg", "reg")){
type <- match.arg(type)
X <- model.matrix(x)
y <- model.response(model.frame(x))
N <- length(y)
K <- length(coef(x))
coefs <- c(coef(x), sigma = sigma2(x))
beta <- coefs[1:K]
sigma <- coefs[K + 1]
epsilon <- as.numeric(y - X %*% beta)
G <- cbind(1 / sigma ^ 2 * X * epsilon,
sigma = - 1 / sigma + 1 / sigma ^ 3 * epsilon ^ 2)
M <- cbind(asym = epsilon ^ 3, kurt = epsilon ^ 4 - 3 * sigma ^ 4)
m <- apply(M, 2, sum)
if (type == "analytical"){
I <- - rbind(cbind(- 1 / sigma ^ 2 * crossprod(X),
sigma = - 2 / sigma ^ 3 * apply(X * epsilon, 2, sum)),
sigma = c(- 2 / sigma ^ 3 * apply(X * epsilon, 2, sum),
1 / sigma ^ 2 - 3 / sigma ^ 4 * sum(epsilon ^ 2)))
W <- - rbind(cbind(asym = - 3 * apply(epsilon ^ 2 * X, 2, sum),
kurt = - 4 * apply(epsilon ^ 3 * X, 2, sum)),
sigma = c(0, - 12 * N / sigma ^ 3))
}
if (type == "opg"){
I <- crossprod(G)
W <- crossprod(G, M)
}
if (type != "reg"){
Q <- crossprod(M - G %*% solve(I) %*% W)
stat <- as.numeric(m %*% solve(Q) %*% m)
}
else stat <- summary(lm(rep(1, N) ~ G + M - 1))$r.squared * N
stat
}
```

For the linear model, we get:

```
cm_norm(lin_lm, "analytical")
## [1] 12.2857
cm_norm(lin_lm, "opg")
## [1] 13.41273
cm_norm(lin_lm, "reg")
## [1] 13.41273
```

and the normality hypothesis is therefore strongly rejected. For the log-linear model:

```
cm_norm(log_lm, "analytical")
## [1] 0.07940664
cm_norm(log_lm, "opg")
## [1] 0.6255081
cm_norm(log_lm, "reg")
## [1] 0.6255081
```

on the contrary, the normality hypothesis is not rejected.

Newey, Whitney K. 1985. “Maximum Likelihood Specification Testing and Conditional Moment Tests.” *Econometrica* 53 (5): 1047–70. https://www.jstor.org/stable/1911011.

Tauchen, George. 1985. “Diagnostic Testing and Evaluation of Maximum Likelihood Models.” *Journal of Econometrics* 30 (1): 415–43. https://doi.org/https://doi.org/10.1016/0304-4076(85)90149-6.