This article demonstrates how to use lav_betaselect()
from the package betaselectr
to standardize selected variables in a model fitted by
lavaan
and forming confidence intervals for the
parameters.
The sample dataset from the package betaselectr
will be
used in this demonstration:
library(betaselectr)
head(data_test_medmod)
#> dv iv mod med cov1 cov2
#> 1 7.487873 11.42573 16.65805 42.28988 54.14051 15.56069
#> 2 8.474931 16.64790 22.66332 42.08692 39.21125 17.61286
#> 3 11.206539 14.81278 22.80955 32.76869 31.97963 20.77333
#> 4 10.148827 15.79632 22.94451 43.96807 42.72187 15.66971
#> 5 7.421606 14.29621 24.51562 37.10942 42.74174 21.97132
#> 6 6.846435 12.00819 25.22163 35.46051 30.85914 22.35444
This is the path model, fitted by lavaan::sem()
:
library(lavaan)
#> This is lavaan 0.6-19
#> lavaan is FREE software! Please report any bugs.
mod <-
"
med ~ iv + mod + iv:mod + cov1 + cov2
dv ~ med + iv + cov1 + cov2
"
fit <- sem(mod,
data_test_medmod)
The model has a moderator, mod
, posited to moderate the
effect from iv
to med
. The product term is
iv:mod
.
These are the results:
summary(fit)
#> lavaan 0.6-19 ended normally after 2 iterations
#>
#> Estimator ML
#> Optimization method NLMINB
#> Number of model parameters 11
#>
#> Number of observations 200
#>
#> Model Test User Model:
#>
#> Test statistic 2.303
#> Degrees of freedom 2
#> P-value (Chi-square) 0.316
#>
#> Parameter Estimates:
#>
#> Standard errors Standard
#> Information Expected
#> Information saturated (h1) model Structured
#>
#> Regressions:
#> Estimate Std.Err z-value P(>|z|)
#> med ~
#> iv -6.373 0.985 -6.473 0.000
#> mod -3.899 0.614 -6.346 0.000
#> iv:mod 0.286 0.039 7.340 0.000
#> cov1 -0.093 0.070 -1.327 0.185
#> cov2 0.242 0.133 1.823 0.068
#> dv ~
#> med 0.092 0.011 8.098 0.000
#> iv 0.227 0.038 5.896 0.000
#> cov1 -0.006 0.013 -0.454 0.650
#> cov2 0.030 0.025 1.230 0.219
#>
#> Variances:
#> Estimate Std.Err z-value P(>|z|)
#> .med 60.292 6.029 10.000 0.000
#> .dv 2.087 0.209 10.000 0.000
We can request the standardized solution using
lavaan::standardizedSolution()
:
standardizedSolution(fit,
output = "text")
#>
#> Regressions:
#> est.std Std.Err z-value P(>|z|) ci.lower ci.upper
#> med ~
#> iv -1.855 0.259 -7.158 0.000 -2.363 -1.347
#> mod -1.956 0.280 -6.988 0.000 -2.504 -1.407
#> iv:mod 3.588 0.428 8.390 0.000 2.750 4.426
#> cov1 -0.077 0.058 -1.332 0.183 -0.189 0.036
#> cov2 0.105 0.057 1.836 0.066 -0.007 0.218
#> dv ~
#> med 0.459 0.052 8.845 0.000 0.357 0.560
#> iv 0.331 0.053 6.279 0.000 0.228 0.434
#> cov1 -0.024 0.054 -0.454 0.650 -0.130 0.081
#> cov2 0.066 0.054 1.233 0.218 -0.039 0.171
#>
#> Variances:
#> est.std Std.Err z-value P(>|z|) ci.lower ci.upper
#> .med 0.656 0.050 13.243 0.000 0.559 0.753
#> .dv 0.569 0.050 11.353 0.000 0.471 0.667
However, for this model, there are several problems:
The product term, iv:mod
, is also standardized. This
is inappropriate. One simple but underused solution is to standardize
the variables before forming the product term (Friedrich, 1982).
The confidence intervals are formed using the delta-method, which has been found to be inferior to methods such as nonparametric percentile bootstrap confidence interval for the standardized solution (Falk, 2018). Although there are situations in which the delta-method confidence and the nonparametric percentile bootstrap confidences can be similar (e.g., sample size is large and the sample estimates are not extreme), it is still safe to at least try both methods and compare the results.
There are cases in which some variables are measured by
meaningful units and do not need to be standardized. for example, if
cov1
is age measured by year, then age is more meaningful
than “standardized age”.
In path analysis, categorical variables are usually represented by dummy variables, each of them having only two possible values (0 or 1). It is not meaningful to standardize the dummy variables.
lav_betaselect()
The function lav_betaselect()
can be used to solve this
problem by:
standardizing variables before product terms are formed,
standardizing only variables for which standardization can facilitate interpretation, and
forming confidence intervals that take into account selected standardization.
We call the coefficients computed by this kind of standardization betas-select (\(\beta{s}_{Select}\), \(\beta_{Select}\) in singular form), to differentiate them from coefficients computed by standardizing all variables, including product terms.
Suppose we only need to solve the first problem, with the product
term computed after iv
and mod
are
standardized:
This is the output if printed using the default options:
#>
#> Selected Standardization:
#>
#> Standard Error: Nil
#>
#> Parameter Estimates Settings:
#>
#> Standard errors: Standard
#> Information: Expected
#> Information saturated (h1) model: Structured
#>
#> Regressions:
#> BetaSelect
#> med ~
#> iv -1.855
#> mod -1.956
#> iv:mod 0.400
#> cov1 -0.077
#> cov2 0.105
#> dv ~
#> med 0.459
#> iv 0.331
#> cov1 -0.024
#> cov2 0.066
#>
#> Footnote:
#> - Variable(s) standardized: cov1, cov2, dv, iv, med, mod
#> - Call 'print()' and set 'standardized_only' to 'FALSE' to print both
#> original estimates and betas-select.
#> - Product terms (iv:mod) have variables standardized before computing
#> them. The product term(s) is/are not standardized.
Compared to the solution with the product term standardized, the
coefficient of iv:mod
changed substantially from 3.588 to
0.286. As shown by Cheung et al. (2022),
the coefficient of standardized product term
(iv:mod
) can be substantially different from the properly
standardized product term (the product of standardized iv
and standardized mod
).
The footnote will also indicate variables that are standardized, and remarked that product terms are formed after standardization.
Suppose we want to address both the first and the second problems, with
the product term computed after iv
and
mod
standardized, and
bootstrap confidence intervals used, that take into account the sampling variation of the standardizers (the standard deviations).
We can call lav_betaselect()
again, with additional
arguments set:
fit_beta <- lav_betaselect(fit,
std_se = "bootstrap",
bootstrap = 5000,
iseed = 2345,
parallel = "snow",
ncpus = 20)
#> Finding product terms in the model ...
#> Finished finding product terms.
#>
#> Compute bootstrapping standardized solution:
These are the additional arguments:
std_se
: The method to compute the standard errors as
well as confidence intervals. Set to "bootstrap"
for
nonparametric bootstrapping.
iseed
: The seed for the random number generator used
for bootstrapping. Set this to an integer to make the results
reproducible.
parallel
: The method to be used for parallel
processing. It will be passed to lavaan::bootstrapLavaan()
.
Supported values are "none"
, "snow"
, and
"multicore"
.
ncpus
: The number of CPU cores to use if
parallel
processing is not "none"
. Default is
parallel::detectCores(logical = FALSE) - 1
, or the number
of physical cores minus one.
This is the output if printed with default options:
#>
#> Selected Standardization:
#>
#> Standard Error: Nonparametric bootstrap
#> Bootstrap samples: 5000
#> Confidence Interval: Percentile
#> Level of Confidence: 95.0%
#>
#> Parameter Estimates Settings:
#>
#> Standard errors: Standard
#> Information: Expected
#> Information saturated (h1) model: Structured
#>
#> Regressions:
#> BetaSelect SE Z p-value Sig CI.Lo CI.Hi CI.Sig
#> med ~
#> iv -1.855 0.248 -7.490 0.000 *** -2.307 -1.332 Sig.
#> mod -1.956 0.281 -6.950 0.000 *** -2.453 -1.348 Sig.
#> iv:mod 0.400 0.047 8.565 0.000 *** 0.298 0.481 Sig.
#> cov1 -0.077 0.057 -1.353 0.185 -0.186 0.038 n.s.
#> cov2 0.105 0.061 1.725 0.094 . -0.019 0.219 n.s.
#> dv ~
#> med 0.459 0.052 8.828 0.000 *** 0.348 0.553 Sig.
#> iv 0.331 0.051 6.480 0.000 *** 0.229 0.431 Sig.
#> cov1 -0.024 0.058 -0.418 0.686 -0.137 0.090 n.s.
#> cov2 0.066 0.058 1.139 0.259 -0.050 0.178 n.s.
#>
#> Footnote:
#> - Variable(s) standardized: cov1, cov2, dv, iv, med, mod
#> - Sig codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> - Standard errors, p-values, and confidence intervals are not computed
#> for betas-select which are fixed in the standardized solution.
#> - P-values for betas-select are asymmetric bootstrap p-value computed
#> by the method of Asparouhov and Muthén (2021).
#> - Call 'print()' and set 'standardized_only' to 'FALSE' to print both
#> original estimates and betas-select.
#> - Product terms (iv:mod) have variables standardized before computing
#> them. The product term(s) is/are not standardized.
In this dataset, with 200 cases, the delta-method confidence intervals are close to the bootstrap confidence intervals, except obviously for the product term because the coefficient of the product term has substantially different values in the two solutions.
Suppose we want to address also the the third issue, and standardize
only some of the variables. This can be done using either
to_standardize
or not_to_standardize
.
Use to_standardize
when the number of variables to
standardize is much fewer than the number of variables not to
standardize.
Use not_to_standardize
when the number variables to
standardize is much more than the the number of variables not to
standardize.
For example, suppose we only need to standardize dv
and
iv
, cov1
, and cov2
, this is the
call to do this, setting to_standardize
to
c("iv", "dv", "cov1", "cov2")
:
fit_beta_select_1 <- lav_betaselect(fit,
std_se = "bootstrap",
to_standardize = c("iv", "dv", "cov1", "cov2"),
bootstrap = 5000,
iseed = 2345,
parallel = "snow",
ncpus = 20)
If we want to standardize all variables except for dv
and mod
, we can use this call, and set
not_to_standardize
to c("mod", "dv")
:
fit_beta_select_2 <- lav_betaselect(fit,
std_se = "bootstrap",
not_to_standardize = c("mod", "dv"),
bootstrap = 5000,
iseed = 2345,
parallel = "snow",
ncpus = 20)
The results of these calls are identical, and only those of the second version are printed:
#> Selected Standardization:
#>
#> Standard Error: Nonparametric bootstrap
#> Bootstrap samples: 5000
#> Confidence Interval: Percentile
#> Level of Confidence: 95.0%
#>
#> Parameter Estimates Settings:
#>
#> Standard errors: Standard
#> Information: Expected
#> Information saturated (h1) model: Structured
#>
#> Regressions:
#> BetaSelect SE Z p-value Sig CI.Lo CI.Hi CI.Sig
#> med ~
#> iv -1.855 0.248 -7.490 0.000 *** -2.307 -1.332 Sig.
#> mod -0.407 0.059 -6.950 0.000 *** -0.510 -0.280 Sig.
#> iv:mod 0.083 0.010 8.565 0.000 *** 0.062 0.100 Sig.
#> cov1 -0.077 0.057 -1.353 0.185 -0.186 0.038 n.s.
#> cov2 0.105 0.061 1.725 0.094 . -0.019 0.219 n.s.
#> dv ~
#> med 0.878 0.116 7.567 0.000 *** 0.635 1.092 Sig.
#> iv 0.634 0.100 6.337 0.000 *** 0.430 0.826 Sig.
#> cov1 -0.047 0.112 -0.418 0.686 -0.265 0.168 n.s.
#> cov2 0.126 0.111 1.137 0.259 -0.093 0.345 n.s.
#>
#> Footnote:
#> - Variable(s) standardized: cov1, cov2, iv, med
#> - Sig codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> - Standard errors, p-values, and confidence intervals are not computed
#> for betas-select which are fixed in the standardized solution.
#> - P-values for betas-select are asymmetric bootstrap p-value computed
#> by the method of Asparouhov and Muthén (2021).
#> - Call 'print()' and set 'standardized_only' to 'FALSE' to print both
#> original estimates and betas-select.
#> - Product terms (iv:mod) have variables standardized before computing
#> them. The product term(s) is/are not standardized.
The footnotes show that, by specifying that dv
and
mod
are not standardized, all the other four variables are
standardized: iv
, med
, cov1
, and
cov2
. Therefore, in this case, it is more convenient to use
not_to_standardize
.
When reporting betas-select, researchers need to
state which variables are standardized and which are not. This can be
done in table notes, or in a column of the parameter estimate tables.
The output can of lav_betaselect()
can be printed with
show_Bs.by
set to TRUE
to demonstrate the
second approach:
#> Regressions:
#> BetaSelect SE Z p-value Sig CI.Lo CI.Hi CI.Sig Selected
#> med ~
#> iv -1.855 0.248 -7.490 0.000 *** -2.307 -1.332 Sig. iv,med
#> mod -0.407 0.059 -6.950 0.000 *** -0.510 -0.280 Sig. med
#> iv:mod 0.083 0.010 8.565 0.000 *** 0.062 0.100 Sig. iv,med
#> cov1 -0.077 0.057 -1.353 0.185 -0.186 0.038 n.s. cov1,med
#> cov2 0.105 0.061 1.725 0.094 . -0.019 0.219 n.s. cov2,med
#> dv ~
#> med 0.878 0.116 7.567 0.000 *** 0.635 1.092 Sig. med
#> iv 0.634 0.100 6.337 0.000 *** 0.430 0.826 Sig. iv
#> cov1 -0.047 0.112 -0.418 0.686 -0.265 0.168 n.s. cov1
#> cov2 0.126 0.111 1.137 0.259 -0.093 0.345 n.s. cov2
#>
#> Footnote:
#> - Variable(s) standardized: cov1, cov2, iv, med
#> - Sig codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> - Standard errors, p-values, and confidence intervals are not computed
#> for betas-select which are fixed in the standardized solution.
#> - P-values for betas-select are asymmetric bootstrap p-value computed
#> by the method of Asparouhov and Muthén (2021).
#> - Call 'print()' and set 'standardized_only' to 'FALSE' to print both
#> original estimates and betas-select.
#> - The column 'Selected' lists variable(s) standardized when computing
#> the standardized coefficient of a parameter. ('NA' for user-defined
#> parameters because they are computed from other standardized
#> parameters.)
#> - Product terms (iv:mod) have variables standardized before computing
#> them. The product term(s) is/are not standardized.
When calling lav_betaselect()
, variables with only two
values in the dataset are assumed to be categorical and will not be
standardized by default. This can be overriden by setting
skip_categorical_x
to FALSE
, though not
recommended.
In structural equation modeling, there are situations in which
standardizing all variables is not appropriate, or when standardization
needs to be done before forming product terms. We are not aware of tools
that can do appropriate standardization and form confidence
intervals that takes into account the selective standardization. By
promoting the use of betas-select using
lav_betaselect()
, we hope to make it easier for researchers
to do appropriate Standardization in when reporting SEM results.