--- title: "Selecting Cases In lavaan_rerun" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Selecting Cases In lavaan_rerun} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette illustrates how to select cases when calling `lavaan_rerun()` from the package [semfindr](https://sfcheung.github.io/semfindr/). Instead of refitting the model *n* times, each time with each case removed, it is possible to use some criteria to select cases to be included in the analysis. This is useful when the time to fit a model is long and/or the sample size is large. This can also be used together with the approximate approach: Use the approximate approach to identify potentially influential case and then compute the exact influence statistics for these cases. See `vignette("casewise_scores", package = "semfindr")` for further information on the the approximate approach. The sample data set `pa_dat` will be used for illustration: ```{r} library(semfindr) dat <- pa_dat # Add case id dat <- cbind(id = paste0("case", seq_len(nrow(dat))), dat) head(dat) ``` The following model is fitted to the data set: ```{r} library(lavaan) mod <- " iv1 ~~ iv2 m1 ~ iv1 + iv2 dv ~ m1 " fit <- sem(mod, dat) ``` # Row Numbers or Case IDs Suppose, for some reasons, users want to refit a model only with selected rows removed. For example, only rows 1, 4, 15, and 18 should be selected. This can be done using the argument `to_rerun` of `lavaan_rerun()`: ```{r} rerun_out <- lavaan_rerun(fit, to_rerun = c(1, 4, 15, 18)) ``` Only four reruns in the output: ```{r} rerun_out est_change(rerun_out) ``` If user supplied case IDs are used, then the value of `to_rerun` should be a vector of these case IDs: ```{r} rerun_out <- lavaan_rerun(fit, case_id = dat$id, to_rerun = c("case1", "case4", "case15", "case18")) ``` Only four reruns in the output. User supplied case IDs are used in the output: ```{r} rerun_out est_change(rerun_out) ``` # Mahalanobis Distance on Residuals Users can select cases using their rankings on the Mahalanobis distance computed using the regression-based residuals. This is possible only for models with observed variables only (i.e., path models). This is analogous to selecting cases based on their residuals in a multiple regression model. A path model can have more than one endogenous variable. The residuals of a case on all endogenous variables will be computed (as differences between observed scores and implied scores computed by `implied_scores()`), and the Mahalanobis distance will be computed using these residuals. This is done using the argument `resid_md_top`. Users specify the top *x* cases on this distance to be selected for refitting a model. ```{r} rerun_out <- lavaan_rerun(fit, case_id = dat$id, resid_md_top = 5) ``` Five cases were selected, as shown below: ```{r} rerun_out est_change(rerun_out) ``` Note that selecting cases by this method *can* miss some influential cases. As in multiple regression, a case that is influential on the results needs not be a case that is poorly predicted by the exogenous variables. Therefore, this method should be used with caution. # Mahalanobis Distance on All Variables Users can select cases using their rankings on the Mahalanobis distance computed using all observed variables. This is done using the argument `md_top`. Users specify the top *x* cases on this distance to be selected for refitting a model. ```{r} rerun_out <- lavaan_rerun(fit, case_id = dat$id, md_top = 5) ``` Five cases were selected, as shown below: ```{r} rerun_out est_change(rerun_out) ``` Note that selecting cases by this method *can* miss some influential cases (Pek & MacCallum, 2011). Unlike multiple regression, this distance is not a measure of leverage. For a path model, this distance used distances from the centroid on *all* observed variables, including exogenous variables and endogenous variables. For a model with latent factors, this distance is affected by both residuals and values predicted by the latent factors. Therefore, this method should be used with caution. # Final Remarks If feasible, it is recommended to refit a model once for each case, such that the influential of all cases can be considered together. The methods above are included when the processing time is slow and so only selected cases are to be explored. For the final model(s), `lavaan_rerun()` using all cases are recommended, to serve as a final check on the sensitivity of the results to individual cases. # Reference Pek, J., & MacCallum, R. (2011). Sensitivity analysis in structural equation models: Cases and their influence. *Multivariate Behavioral Research, 46*(2), 202–228. https://doi.org/10.1080/00273171.2011.561068