Influential Observations Diagnostics

Joshua French

2022-09-23

An influential observation dramatically changes the fitted model based on whether it is included or excluded from the analysis.

library(api2lm)

Influential observations are often extreme with respect to their response values or their associated predictors.

home_sales data

To facilitate our discussion we will use the home_sales data set in the api2lm package, which contains information about homes sold in King County, WA between May 2014 and May 2015. The data set is a data frame with 216 rows and 8 columns:

We load the data below.

data(home_sales, package = "api2lm")

The validity of the assumptions we want to check depend on the model fit to the data.

We will regress the price variable on all of the remaining variables in home_sales.

lmod <- lm(price ~ ., data = home_sales)
coef(lmod)
##        (Intercept)           bedrooms          bathrooms        sqft_living 
##       5.377543e+00       2.604201e-03       3.149792e-02       9.987617e-05 
##           sqft_lot             floors      waterfrontyes      conditionfair 
##      -8.773790e-07       8.630242e-02       3.152404e-01      -3.128967e-01 
##   conditionaverage      conditiongood conditionvery_good 
##      -1.278938e-01      -9.122793e-02      -2.185523e-02

Outliers

We describe two common approaches for identifying outliers.

The simplest approach for identifying outliers is to use an index plot of the studentized residuals to determine the observations with extreme residuals.

The other approach is to compare the set of studentized residuals to the appropriate quantile of a \(t_{n-p-1}\) distribution.

To evaluate whether a single observation is an outlier, we compare its studentized residual to \(t^{\alpha/2}_{n-p-1}\), i.e., the \(1-\alpha/2\) quantile of a \(t\) distribution with \(n-p-1\) degrees of freedom.

To identify the outliers from the \(n\) observations of our fitted model, we should adjust the previous idea using the Bonferroni correction to address the multiple comparisons problem.

Outlier example


The outlier_plot function is a convenient way to create an index plot of the studentized residuals from a fitted model.

outlier_plot has four main arguments that we need to know about:

We use the outlier_plot function to identify the 2 most extreme outliers for our fitted model. Observations 214 and 33 have the most extreme studentized residuals and observation 214 is more extreme than the Bonferroni-adjusted \(t\) quantile.

outlier_plot(lmod, id_n = 2)

The outlier_test function can be used to identify the observations with studentized residuals more extreme than the appropriate Bonferroni-corrected \(t\) quantiles. The main arguments to the outlier_test function are:

Running outlier_test on a fitted model will return the identified outliers and their Bonferroni-adjusted p-values (adj_pvalue). If the adjusted p-value is less than alpha, then the observation is declared an outlier.

We use outlier_test to see that observation 214 is an outlier according to this test.

outlier_test(lmod)
##          stat       pvalue adj_pvalue
## 214 -3.948387 5.414913e-05 0.01169621

Leverage Points

The leverage values are the diagonal elements of the hat matrix \(\mathbf{H}=\mathbf{X}(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T\). The \(i\)th leverage value is given by \(h_i=\mathbf{H}_{i,i}\), the \(i\)th diagonal position of the hat matrix.

We can extract the leverage values of a fitted model using the hatvalues function.

An observation is declared to be a leverage point if its leverage value is unusually large.

Kutner et al. (2005) suggest two different thresholds for identifying a leverage point.

Leverage example


We can create an index plot of the leverage values of a fitted model using the leverage_plot function.

The leverage_plot function takes a few main arguments:

We use the leverage_plot function to identify leverage points for the model fit to the home_sales data.

leverage_plot(lmod)

Observations 33 and 179 are leverage points with leverage values greater than 0.5. Observation 214 has a leverage value nearly at 0.5, so we should consider it a leverage point.

Influential Observations

An influential observation dramatically changes the fitted model based on whether it is included or excluded from the analysis.

We will now discuss several other measures of influence.

DFBETA and DFBETAS

The DFBETA measure of influence is an \(n\times p\) matrix of statistics that measures the change in the estimated regression coefficients when we fit the model using all \(n\) observations versus when we fit the model leaving one observation out at a time.

The \(i\)th row of DFBETA is

\[ \text{DFBETA}_i = \hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}_{(i)}. \]

Identifying influential observations using the DFBETA matrix can be difficult because the sampling variance associated with each coefficient can be very different.

The DFBETAS matrix is a transformation of the DFBETA matrix that has been scaled so that the individual statistics have similar sampling variability.

DFBETAS is also an \(n\times p\) matrix of statistics with the \(j\)th element of the \(i\)th row being defined as

\[\text{DFBETAS}_{i,j} = \frac{\hat{\beta}_{j-1} - \hat{\beta}_{j-1(i)}}{\hat{\mathrm{se}}(\hat{\beta}_{j-1})},\quad i=1,2,\ldots,n, \quad j=0,1,\ldots,p-1.\]

A DFBETAS statistic less than -1 or greater than 1 is often used as an indicator that the associated observation is influential in determining the fitted model.

One way of identifying influential observations is to create index plots the DFBETA or DFBETAs values for each regression coefficient and determine the estimates that substantially change when observation \(i\) is excluded from analysis.

The DFBETA and DFBETAS matrices can be obtained by using the dfbeta and dfbetas functions, respectively.

DFBETAS example


We can use the dfbetas_plot function to get index plots of the DFBETAS statistics for each regression coefficient.

Some of the main arguments to the dfbetas_plot function are:

Reference lines are automatically added at \(\pm 1\) to indicate observations whose DFBETAS values are particularly extreme.

We plot the bedrooms, bathrooms, sqft_living, and sqft_lot variables for the fitted model of our home_sales data and label the two most extreme observations.

dfbetas_plot(lmod, id_n = 2,
             regressors = ~ bedrooms + bathrooms + sqft_living + sqft_lot)

We can see from these plots that:

DFFITS

Welsch and Kuh (1977) proposed measuring an observation’s influence through the DFFITS statistic, which is the difference between its fitted value and its LOO predicted response value.

The DFFITS statistic for observations \(i\) defined as

\[\text{DFFITS}_i = \hat{Y}_i - \hat{Y}_{i(i)}.\]

Belsley et al. (2005) suggest that observation \(i\) is influential if \(|\text{DFFITS}_i|>2\sqrt{p/n}\).

An index plot of the DFFITS statistics for all observations will indicate the observations most influential in changing their associated predicted value.

The DFFITS statistics for each observation can be obtained using the dffits function.

DFFITS example


We can create an index plot of our DFFITS statistics using the dffits_plot function.

Some of the main arguments to the dffits_plot function are:

Horizontal reference lines are automatically added at \(\pm 2\sqrt{p/n}\) to identify influential observations.

We create an index plot of the DFFITS statistics below. We identify the observations with the 3 most extreme DFFITS statistics.

dffits_plot(lmod, id_n = 3)

We see that observations 33, 179, and 214 all have particularly extreme DFFITS statistics. Several other observations have statistics more extreme than \(\pm 1\) (run dffits_test(lmod) for a complete list).

Cook’s Distance

Cook (1977) proposed the Cook’s distance to summarize the potential influence of an observation with a single statistic.

The Cook’s distance for the \(i\)th observation is

\[D_i=\frac{\sum_{k=1}^n (Y_k - \hat{Y}_{k(i)})^2}{p \widehat{\sigma}^2} = \frac{1}{p} r_i^2 \frac{h_i}{1-h_i}.\]

\[ r_i = \frac{\hat{\epsilon}_i}{\widehat{\sigma}\sqrt{1-h_i}}. \]

An index plot of the Cook’s distances can be used to identify observations with unusually large Cook’s distances.

Cook’s distance example


An index plot of the Cook’s distances can be created using the cooks_plot function.

Some of the main arguments to the cooks_plot function are:

A horizontal reference line is automatically added at \(F^{0.5}_{p,n-p}\) to identify influential observations.

We create an index plot of the Cook’s distances below. We identify the observations with the 3 most extreme Cook’s distances.

cooks_plot(lmod, id_n = 3)

We see that observations 33, 179, and 214 all have particularly extreme Cook’s distances, indicating those observations are influential in some sense. This is consistent with our previous results.

Influence plots

An influence plot is a scatter plot of the studentized or standardized residuals versus the leverage values. The size of each point in the plot is proportional to either its Cook’s distance or DFFITS statistic.

In general, observations that are extreme with respect to their residuals, leverage values, or Cook’s distance/DFFITS statistics are more likely be be potentially influential.

Influence plot example


An influence plot can be created using the influence_plot function from the api2lm package.

Some of the main arguments to the influence_plot function are:

We create in influence plot for the model fit to the home_sales data below.

influence_plot(lmod)

References

Belsley, D. A., Kuh, E., & Welsch, R. E. (2005). Regression diagnostics: Identifying influential data and sources of collinearity. John Wiley & Sons.

Cook, R. D. (1977). Detection of Influential Observation in Linear Regression. Technometrics, 19(1), 15–18. https://doi.org/10.2307/1268249

Kutner, Michael H, Christopher J Nachtsheim, John Neter, and William Li. 2005. Applied Linear Statistical Models, 5th Edition. McGraw-Hill/Irwin, New York.

Welsch, R. E., & Kuh, E. (1977). Linear regression diagnostics (No. w0173). National Bureau of Economic Research.