AbstractThis vignette, explains the installation of the
scDIFtestpackage and provides an illustration of item-wise DIF-detection with the
scDIFtest-function using a subset of the
The score-based test framework for parameter instability has been proposed for testing measurement invariance in measurement models. Until now, the focus was on (a) testing the invariance of all parameters simultaneously, or (b) on testing the invariance of a single parameter in the model. However in educational and psychological assessments, the appropriateness of each items is of interest. For instance, the detection of differential item function (DIF) plays an important role in validating new items. The
scDIFtest package provides a user-friendly method for detecting DIF by automatically and efficiently applying the tests from the score-based test framework to the individual items in the assessment. The main function of the
scDIFtest package is the
scDIFtest function, which is a wrapper around the
To detect DIF with the
scDIFtest package, first, the appropriate Item Response Theory (IRT) or Factor Analysis (FA) model should fitted using the
mirt package. The
scDIFtest-function can directly be used on the resulting
mirt-object. Hence, in addition to the
scDIFtest, the package
mirt will typically also be loaded in the
R session. For now,
scDIFtest only works for IRT/FA models that were fitted using the
mirt package, but we aim to extend this to other packages that fit IRT/FA models using maximum likelihood estimation.
In order to fit the IRT model and analyze DIF with the
scDIFtest, the following steps are necessary:
multipleGroup-function implemented in the
mirtpackage Chalmers (2012)
In the sections that follow, these steps will be explained in detail.
scDIFtest package is installed using the following commands:
mirt package Chalmers (2012) is required for fitting the IRT/FA model of interest, it should also be installed (using
In this vignette, a subset of the
SPISA data is used. This data is part of the
psychotree package, it can be accessed when the
psychotree package is installed. To load the
The SPISA data is a subsample from the general knowledge quiz “Studentenpisa” conducted online by the German weekly news magazine SPIEGEL Trepte and Verbeet (2010). The data contain the quiz results from 45 questions as well as socio-demographic data for 1075 university students from Bavaria Trepte and Verbeet (2010). Although there were 45 questions addressing different topics, this illustration is limited to the analysis of the nine science questions (items 37 - 45). To analyze the data with
mirt, the responses are converted to a data frame.
In addition to the responses, the SPISA data also contains five socio-demographic variables (i.e., person covariates):
summary(SPISA[,2:6]) #> gender age semester elite spon #> female:417 Min. :18.0 2 :173 no :836 never :303 #> male :658 1st Qu.:21.0 4 :123 yes:239 <1/month :127 #> Median :23.0 6 :116 1-3/month:107 #> Mean :23.1 1 :105 1/week : 79 #> 3rd Qu.:25.0 5 : 99 2-3/week : 73 #> Max. :40.0 3 : 98 4-5/week : 60 #> (Other):361 daily :326
In this illustration, we will try to detect DIF along the following three covariates:
ageof the student in years (numeric covariate)
genderof the student (unordered categorical covariate)
spon, which is the frequency of assessing the SPIEGEL ONline (SPON) magazine (ordered categorical covariate)
It is important to note that, for the package to work, the parameters in the assumed IRT model need to be be estimated using either the
mirt or the
multipleGroup function from the
multipleGroup function can model impact between groups of persons, which is not possible with the
mirt function. Modeling impact is important when the goal is to detect DIF DeMars (2010). In this illustration, for instance, we test whether there is impact with respect to gender by comparing a model which allows ability differences between male and female students with a model that assumes there are no group difference in ability. The relative fit of these two models is compared, and the best fitting model is selected for the DIF analysis. The general idea is that we want to avoid (a) false cases of DIF detection that can be attributed to ability differences and (b) not detecting DIF that is masked due to not modeling ability differences.
mirt package is loaded in the `R} session:
Then the two models are fit and compared. Note that in general we do not recommend using
verbose = FALSE, but for this vignette it is more convenient.
The comparison of the two models with
anova yields the following results:
anova(fit_2PL, fit_multiGroup) #> #> Model 1: multipleGroup(data = resp, model = 1, group = SPISA$gender, invariance = c("free_means", #> "slopes", "intercepts", "free_var"), verbose = FALSE) #> Model 2: mirt(data = resp, model = 1, itemtype = "2PL", verbose = FALSE) #> AIC AICc SABIC HQ BIC logLik X2 df p #> 1 10139.62 10140.41 10175.69 10177.34 10239.22 -5049.808 NaN NaN NaN #> 2 10161.68 10162.33 10194.16 10195.64 10251.33 -5062.843 -26.069 509 1
multipleGroup model with ability differences between male and female test takers best fits the data (lower AIC and BIC; small \(p\)-value for the Likelihood Ratio Test). It seem like there are differences between male and female students with respect to the assessed science knowledge. Therefore, the
multipleGroup model is used in the DIF detection analysis.
In the (sub)sections that follow, DIF is tested for three different covariates:
spon but only the DIF analysis for gender is explained in more detail. Yet the the used
R commands are the same for any covariate. The interpretation is given for all of the covariates.
To test item wise DIF along gender, the
scDIFtest function is used with the fitted model object and
gender as the
DIF_covariate argument. Note that the
scDIFtest package has to be loaded first.
The resulting object is assigned to
DIF_gender. For a readable version of the results The
summary method returns a summary of the results as a data frame.
In the two subsections that follow, the results regarding the analyses of item wise DIF by
spon will be interpreted.
For the gender covariate, the print method gives the following results:
DIF_gender #> #> Score Based DIF-tests for 9 items #> Person covariate: SPISA$gender #> Test statistic type: Lagrange Multiplier Test for Unordered Groups #> #> item_type n_est_pars stat p_value p_fdr #> V1 2PL 2 0.4141020 8.129782e-01 9.146005e-01 #> V2 2PL 2 8.3162505 1.563685e-02 4.691054e-02 #> V3 2PL 2 4.8449033 8.870388e-02 1.995837e-01 #> V4 2PL 2 32.7335352 7.798358e-08 7.018522e-07 #> V5 2PL 2 3.2679379 1.951535e-01 3.512763e-01 #> V6 2PL 2 0.4159221 8.122387e-01 9.146005e-01 #> V7 2PL 2 30.3499936 2.567927e-07 1.155567e-06 #> V8 2PL 2 0.1517182 9.269468e-01 9.269468e-01 #> V9 2PL 2 0.5925442 7.435851e-01 9.146005e-01
First, in three lines some general information is given:
LMuo; Merkle and Zeileis (2013), Merkle, Fan, and Zeileis (2014)).
After these three lines, a table with the main results is printed with one line for each item that was included in the DIF detection analysis. The columns of the table represent:
item_typethe type of IRT model used for each item (in this case the two-Parameter Logistic Model (2PL))
n_est_pars: the number of estimated parameters for each item
statistic: the value for the statistic per item (in this case the
p-value: the \(p\)-value per item
p.fdr: the False-Discovery-Rate corrected \(p\)-value Benjamini and Hochberg (1995)
The printed output indicates that, when a significance level of \(.05\) is used, DIF along
gender is detected in item V4 and in item V7: these two items function differently, depending on the gender of the students.
When one of more items are selected using the
item_selection argument of the
sctest objects (or M-fluctuation tests) are printed.
print(DIF_gender, item_selection = c("V4", "V7")) #> #> DIF-test for V4 #> Person covariate: SPISA$gender #> Test statistic type: Lagrange Multiplier Test for Unordered Groups #> #> M-fluctuation test #> #> data: resp #> f(efp) = 32.734, p-value = 7.798e-08 #> #> #> DIF-test for V7 #> Person covariate: SPISA$gender #> Test statistic type: Lagrange Multiplier Test for Unordered Groups #> #> M-fluctuation test #> #> data: resp #> f(efp) = 30.35, p-value = 2.568e-07
Note that here the uncorrected \(p\)-values are given.
The results for the DIF-detection analysis with
age as the covariate are:
DIF_age <- scDIFtest(fit_multiGroup, DIF_covariate = SPISA$age) summary_age <- summary(DIF_age) summary_age #> item_type n_est_pars stat p_value p_fdr #> V1 2PL 2 1.0593393 0.378630317 0.56794548 #> V2 2PL 2 0.7508117 0.859974883 0.96747174 #> V3 2PL 2 1.3579887 0.097556732 0.21950265 #> V4 2PL 2 1.6092879 0.022393893 0.06718168 #> V5 2PL 2 1.0936080 0.332120746 0.56794548 #> V6 2PL 2 1.6830445 0.013808746 0.06213936 #> V7 2PL 2 0.5720489 0.989797256 0.98979726 #> V8 2PL 2 0.7729229 0.830878151 0.96747174 #> V9 2PL 2 1.9126378 0.002656523 0.02390871
In this case, the Double Maximum Test for continuous numeric orderings (
dm; Merkle and Zeileis (2013), Merkle, Fan, and Zeileis (2014)) is used. The results indicate that DIF along
age is detected in three items: V4 (\(p = 0.022\)), V6 (\(p = 0.014\)), and V9 ($ p = 0.003$). Note that the score-based framework has the power to detect DIF along numeric covariates, without assuming some functional form of the DIF.
The results for the DIF-detection analysis with
spon as the covariate are:
DIF_spon <- scDIFtest(fit_multiGroup, DIF_covariate = SPISA$spon) DIF_spon #> #> Score Based DIF-tests for 9 items #> Person covariate: SPISA$spon #> Test statistic type: Maximum Lagrange Multiplier Test for Ordered #> Groups #> #> item_type n_est_pars stat p_value p_fdr #> V1 2PL 2 1.868941 0.77865040 0.8759817 #> V2 2PL 2 6.342694 0.13831369 0.4507635 #> V3 2PL 2 2.390256 0.66339331 0.8529343 #> V4 2PL 2 3.597938 0.43124151 0.6468623 #> V5 2PL 2 7.536444 0.08292608 0.4507635 #> V6 2PL 2 4.847357 0.26086019 0.5319609 #> V7 2PL 2 1.304980 0.89473667 0.8947367 #> V8 2PL 2 6.174822 0.15025448 0.4507635 #> V9 2PL 2 4.553582 0.29553382 0.5319609
In this case, the maximum Lagrange-Multiplier-Test (
maxLMO; Merkle and Zeileis (2013), Merkle, Fan, and Zeileis (2014)) is used. Since all tests result in large \(p\)-values, we conclude that no DIF was detected along the
scDIFtest is a user-friendly and efficient wrapper around the
sctest function of the
scDIFtest can be used to detect item-wise DIF, along both categorical and continuous DIF covariates. Note however, that the functionality is compatible with IRT models fit using the
mirt package only. For now.
Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B (Methodological) 57 (1): 289–300.
Chalmers, R. Philip. 2012. “mirt: A Multidimensional Item Response Theory Package for the R Environment.” Journal of Statistical Software 48 (6): 1–29. https://doi.org/10.18637/jss.v048.i06.
Debeer, Dries. 2020. ScDIFtest: Item-Wise Score-Based Dif Tests.
DeMars, Christine E. 2010. “Type I Error Inflation for Detecting Dif in the Presence of Impact.” Educational and Psychological Measurement 70 (6): 961–72. https://doi.org/10.1177/0013164410366691.
Merkle, Edgar C, Jinyan Fan, and Achim Zeileis. 2014. “Testing for Measurement Invariance with Respect to an Ordinal Variable.” Psychometrika 79 (4): 569–84.
Merkle, Edgar C, and Achim Zeileis. 2013. “Tests of Measurement Invariance Without Subgroups: A Generalization of Classical Methods.” Psychometrika 78 (1): 59–82.
Trepte, Sabine, and Markus Verbeet, eds. 2010. Allgemeinbildung in Deutschland - Erkenntnisse Aus Dem SPIEGEL Studentenpisa-Test. Wiesbaden: VS Verlag.