For the most part, this document will present the functionalities of
the function surveysd::calc.stError()
which generates point
estimates and standard errors for user-supplied estimation
functions.
In order to use a dataset with calc.stError()
, several
weight columns have to be present. Each weight column corresponds to a
bootstrap sample. In the following examples, we will use the data from
demo.eusilc()
and attach the bootstrap weights using
draw.bootstrap()
and recalib()
. Please refer
to the documentation of those functions for more detail.
library(surveysd)
set.seed(1234)
<- demo.eusilc(prettyNames = TRUE)
eusilc <- draw.bootstrap(eusilc, REP = 10, hid = "hid", weights = "pWeight",
dat_boot strata = "region", period = "year")
<- recalib(dat_boot, conP.var = "gender", conH.var = "region",
dat_boot_calib epsP = 1e-2, epsH = 2.5e-2, verbose = FALSE)
:= nrow(.SD) == 1, by = .(year, hid)]
dat_boot_calib[, onePerson
## print part of the dataset
1:5, .(year, povertyRisk, eqIncome, onePerson, pWeight, w1, w2, w3, w4, w5)] dat_boot_calib[
year | povertyRisk | eqIncome | onePerson | pWeight | w1 | w2 | w3 | w4 | w5 |
---|---|---|---|---|---|---|---|---|---|
2010 | FALSE | 16090.69 | FALSE | 504.5696 | 0.4477743 | 1007.5651 | 0.4466992 | 0.4489062 | 1019.721 |
2010 | FALSE | 16090.69 | FALSE | 504.5696 | 0.4477743 | 1007.5651 | 0.4466992 | 0.4489062 | 1019.721 |
2010 | FALSE | 16090.69 | FALSE | 504.5696 | 0.4477743 | 1007.5651 | 0.4466992 | 0.4489062 | 1019.721 |
2011 | FALSE | 16090.69 | FALSE | 504.5696 | 0.4237016 | 968.2711 | 0.4468813 | 0.4654705 | 1005.100 |
2011 | FALSE | 16090.69 | FALSE | 504.5696 | 0.4237016 | 968.2711 | 0.4468813 | 0.4654705 | 1005.100 |
The parameters fun
and var
in
calc.stError()
define the estimator to be used in the error
analysis. There are two built-in estimator functions
weightedSum()
and weightedRatio()
which can be
used as follows.
<- calc.stError(dat_boot_calib, var = "povertyRisk", fun = weightedRatio)
povertyRate <- calc.stError(dat_boot_calib, var = "eqIncome", fun = weightedSum) totalIncome
Those functions calculate the ratio of persons at risk of poverty (in percent) and the total income. By default, the results are calculated separately for each reference period.
$Estimates povertyRate
year | n | N | val_povertyRisk | stE_povertyRisk |
---|---|---|---|---|
2010 | 14827 | 8182222 | 14.44422 | 0.5838199 |
2011 | 14827 | 8182222 | 14.77393 | 0.6771848 |
2012 | 14827 | 8182222 | 15.04515 | 0.4907275 |
2013 | 14827 | 8182222 | 14.89013 | 0.4756386 |
2014 | 14827 | 8182222 | 15.14556 | 0.5195149 |
2015 | 14827 | 8182222 | 15.53640 | 0.4808834 |
2016 | 14827 | 8182222 | 15.08315 | 0.3566648 |
2017 | 14827 | 8182222 | 15.42019 | 0.5592589 |
$Estimates totalIncome
year | n | N | val_eqIncome | stE_eqIncome |
---|---|---|---|---|
2010 | 14827 | 8182222 | 162750998071 | 960851034 |
2011 | 14827 | 8182222 | 161926931417 | 901915958 |
2012 | 14827 | 8182222 | 162576509628 | 1198230549 |
2013 | 14827 | 8182222 | 163199507862 | 1474893389 |
2014 | 14827 | 8182222 | 163986275009 | 1528829971 |
2015 | 14827 | 8182222 | 163416275447 | 1338217239 |
2016 | 14827 | 8182222 | 162706205137 | 1207314432 |
2017 | 14827 | 8182222 | 164314959107 | 1322744977 |
Columns that use the val_
prefix denote the point
estimate belonging to the “main weight” of the dataset, which is
pWeight
in case of the dataset used here.
Columns with the stE_
prefix denote standard errors
calculated with bootstrap replicates. The replicates result in using
w1
, w2
, …, w10
instead of
pWeight
when applying the estimator.
n
denotes the number of observations for the year and
N
denotes the total weight of those persons.
In order to define a custom estimator function to be used in
fun
, the function needs to have at least two arguments like
the example below.
## define custom estimator
<- function(x, w) {
myWeightedSum sum(x*w)
}
## check if results are equal to the one using `surveysd::weightedSum()`
<- calc.stError(dat_boot_calib, var = "eqIncome", fun = myWeightedSum)
totalIncome2 all.equal(totalIncome$Estimates, totalIncome2$Estimates)
## [1] TRUE
The parameters x
and w
can be assumed to be
vectors with equal length with w
being numeric weight
vector and x
being the column defined in the
var
argument. It will be called once for each period (in
this case year
) and for each weight column (in this case
pWeight
, w1
, w2
, …,
w10
).
Custom estimators using additional parameters can also be supplied
and parameter add.arg
can be used to set the additional
arguments for the custom estimator.
## use add.arg-argument
<- function(x, w, b) {
fun sum(x*w*b)
}= list(b="onePerson")
add.arg
<- calc.stError(dat_boot_calib, var = "povertyRisk", fun = fun,
err.est period.mean = 0, add.arg=add.arg)
$Estimates err.est
year | n | N | val_povertyRisk | stE_povertyRisk |
---|---|---|---|---|
2010 | 14827 | 8182222 | 273683.9 | 12521.133 |
2011 | 14827 | 8182222 | 261883.6 | 9478.316 |
2012 | 14827 | 8182222 | 243083.9 | 8099.185 |
2013 | 14827 | 8182222 | 238004.4 | 12366.440 |
2014 | 14827 | 8182222 | 218572.1 | 16077.990 |
2015 | 14827 | 8182222 | 219984.1 | 14706.184 |
2016 | 14827 | 8182222 | 201753.9 | 14324.090 |
2017 | 14827 | 8182222 | 196881.2 | 9721.776 |
# compare with direct computation
<- dat_boot_calib[,fun(povertyRisk,pWeight,b=onePerson),
compare.value =c("year")]
byall((compare.value$V1-err.est$Estimates$val_povertyRisk)==0)
## [1] TRUE
The above chunk computes the weighted poverty ratio for single person households.
In our example the variable povertyRisk
is a boolean and
is TRUE
if the income is less than 60% of the weighted
median income. Thus it directly depends on the original weight vector
pWeight
. To further reduce the estimated error one should
calculate for each bootstrap replicate weight \(w\) the weighted median income \(medIncome_{w}\) and then define \(povertyRisk_w\) as
\[ povertyRisk_w = \cases{1 \quad\text{if Income}<0.6\cdot medIncome_{w}\\ 0 \quad\text{else}} \]
The estimator can then be applied to the new variable \(povertyRisk_w\). This can be realized using a custom estimator function.
# custom estimator to first derive poverty threshold
# and then estimate a weighted ratio
<- function(x, w) {
povmd <- laeken::weightedMedian(x, w)*0.6
md <- x < md
pmd60 # weighted ratio is directly estimated inside the function
return(sum(w[pmd60])/sum(w)*100)
}
<- calc.stError(
err.est var = "povertyRisk", fun = weightedRatio,
dat_boot_calib, fun.adjust.var = povmd, adjust.var = "eqIncome")
$Estimates err.est
year | n | N | val_povertyRisk | stE_povertyRisk |
---|---|---|---|---|
2010 | 14827 | 8182222 | 14.44422 | 0 |
2011 | 14827 | 8182222 | 14.77393 | 0 |
2012 | 14827 | 8182222 | 15.04515 | 0 |
2013 | 14827 | 8182222 | 14.89013 | 0 |
2014 | 14827 | 8182222 | 15.14556 | 0 |
2015 | 14827 | 8182222 | 15.53640 | 0 |
2016 | 14827 | 8182222 | 15.08315 | 0 |
2017 | 14827 | 8182222 | 15.42019 | 0 |
The approach shown above is only valid if not grouping variables are
supplied (parameter group
). If grouping variables are
supplied one should use parameters fun.adjust.var
and
adjust.var
such that the \(povertyRisk_w\) is first calculated for
each period
and then used for each grouping in
group
.
# using fun.adjust.var and adjust.var to estimate povmd60 indicator
# for each period and bootstrap weight before applying the weightedRatio
<- function(x, w) {
povmd2 <- laeken::weightedMedian(x, w)*0.6
md <- x < md
pmd60 return(as.integer(pmd60))
}
# set adjust.var="eqIncome" so the income vector is used to estimate
# the povmd60 indicator for each bootstrap weight
# and the resulting indicators are passed to function weightedRatio
<- "gender"
group <- calc.stError(
err.est var = "povertyRisk", fun = weightedRatio, group = "gender",
dat_boot_calib, fun.adjust.var = povmd2, adjust.var = "eqIncome")
$Estimates err.est
year | n | N | gender | val_povertyRisk | stE_povertyRisk |
---|---|---|---|---|---|
2010 | 7267 | 3979572 | male | 12.02660 | 0.6237242 |
2010 | 7560 | 4202650 | female | 16.73351 | 0.7234573 |
2010 | 14827 | 8182222 | NA | 14.44422 | 0.6330581 |
2011 | 7267 | 3979572 | male | 12.81921 | 0.5870065 |
2011 | 7560 | 4202650 | female | 16.62488 | 0.6907626 |
2011 | 14827 | 8182222 | NA | 14.77393 | 0.6103274 |
2012 | 7267 | 3979572 | male | 13.76065 | 0.5105388 |
2012 | 7560 | 4202650 | female | 16.26147 | 0.7726815 |
2012 | 14827 | 8182222 | NA | 15.04515 | 0.6236907 |
2013 | 7267 | 3979572 | male | 13.88962 | 0.5193635 |
2013 | 7560 | 4202650 | female | 15.83754 | 0.6342728 |
2013 | 14827 | 8182222 | NA | 14.89013 | 0.5654875 |
2014 | 7267 | 3979572 | male | 14.50351 | 0.5988502 |
2014 | 7560 | 4202650 | female | 15.75353 | 0.5824421 |
2014 | 14827 | 8182222 | NA | 15.14556 | 0.5411586 |
2015 | 7267 | 3979572 | male | 15.12289 | 0.4298710 |
2015 | 7560 | 4202650 | female | 15.92796 | 0.5197222 |
2015 | 14827 | 8182222 | NA | 15.53640 | 0.3928424 |
2016 | 7267 | 3979572 | male | 14.57968 | 0.3692397 |
2016 | 7560 | 4202650 | female | 15.55989 | 0.4607921 |
2016 | 14827 | 8182222 | NA | 15.08315 | 0.2784766 |
2017 | 7267 | 3979572 | male | 14.94816 | 0.3491172 |
2017 | 7560 | 4202650 | female | 15.86717 | 0.4099472 |
2017 | 14827 | 8182222 | NA | 15.42019 | 0.3168769 |
In case an estimator should be applied to several columns of the
dataset, var
can be set to a vector containing all
necessary columns.
<- calc.stError(dat_boot_calib, var = c("povertyRisk", "onePerson"), fun = weightedRatio)
multipleRates $Estimates multipleRates
year | n | N | val_povertyRisk | stE_povertyRisk | val_onePerson | stE_onePerson |
---|---|---|---|---|---|---|
2010 | 14827 | 8182222 | 14.44422 | 0.5838199 | 14.85737 | 0.2572088 |
2011 | 14827 | 8182222 | 14.77393 | 0.6771848 | 14.85737 | 0.2778282 |
2012 | 14827 | 8182222 | 15.04515 | 0.4907275 | 14.85737 | 0.2909612 |
2013 | 14827 | 8182222 | 14.89013 | 0.4756386 | 14.85737 | 0.3458258 |
2014 | 14827 | 8182222 | 15.14556 | 0.5195149 | 14.85737 | 0.4394058 |
2015 | 14827 | 8182222 | 15.53640 | 0.4808834 | 14.85737 | 0.3810024 |
2016 | 14827 | 8182222 | 15.08315 | 0.3566648 | 14.85737 | 0.3105411 |
2017 | 14827 | 8182222 | 15.42019 | 0.5592589 | 14.85737 | 0.3030293 |
Here we see the relative number of persons at risk of poverty and the relative number of one-person households.
The groups
argument can be used to calculate estimators
for different subsets of the data. This argument can take the grouping
variable as a string that refers to a column name (usually a factor) in
dat
. If set, all estimators are not only split by the
reference period but also by the grouping variable. For simplicity, only
one reference period of the above data is used.
<- subset(dat_boot_calib, year == 2010)
dat2 for (att in c("period", "weights", "b.rep"))
attr(dat2, att) <- attr(dat_boot_calib, att)
To calculate the ratio of persons at risk of poverty for each federal
state of Austria, group = "region"
can be used.
<- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = "region")
povertyRates $Estimates povertyRates
year | n | N | region | val_povertyRisk | stE_povertyRisk |
---|---|---|---|---|---|
2010 | 549 | 260564 | Burgenland | 19.53984 | 1.8042764 |
2010 | 733 | 377355 | Vorarlberg | 16.53731 | 3.2896891 |
2010 | 924 | 535451 | Salzburg | 13.78734 | 2.2585866 |
2010 | 1078 | 563648 | Carinthia | 13.08627 | 1.7469304 |
2010 | 1317 | 701899 | Tyrol | 15.30819 | 1.3943715 |
2010 | 2295 | 1167045 | Styria | 14.37464 | 1.3500431 |
2010 | 2322 | 1598931 | Vienna | 17.23468 | 1.0428496 |
2010 | 2804 | 1555709 | Lower Austria | 13.84362 | 1.7034375 |
2010 | 2805 | 1421620 | Upper Austria | 10.88977 | 0.8709790 |
2010 | 14827 | 8182222 | NA | 14.44422 | 0.5838199 |
The last row with region = NA
denotes the aggregate over
all regions. Note that the columns N
and n
now
show the weighted and unweighted number of persons in each region.
In case more than one grouping variable is used, there are several
options of calling calc.stError()
depending on whether
combinations of grouping levels should be regarded or not. We will
consider the variables gender
and region
as
our grouping variables and show three options on how
calc.stError()
can be called.
Calculate the point estimate and standard error for each region and each gender. The number of rows in the output is therefore
\[n_\text{periods}\cdot(n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9 + 2 + 1) = 12.\]
The last row is again the estimate for the whole period.
<- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio,
povertyRates group = c("gender", "region"))
$Estimates povertyRates
year | n | N | gender | region | val_povertyRisk | stE_povertyRisk |
---|---|---|---|---|---|---|
2010 | 549 | 260564 | NA | Burgenland | 19.53984 | 1.8042764 |
2010 | 733 | 377355 | NA | Vorarlberg | 16.53731 | 3.2896891 |
2010 | 924 | 535451 | NA | Salzburg | 13.78734 | 2.2585866 |
2010 | 1078 | 563648 | NA | Carinthia | 13.08627 | 1.7469304 |
2010 | 1317 | 701899 | NA | Tyrol | 15.30819 | 1.3943715 |
2010 | 2295 | 1167045 | NA | Styria | 14.37464 | 1.3500431 |
2010 | 2322 | 1598931 | NA | Vienna | 17.23468 | 1.0428496 |
2010 | 2804 | 1555709 | NA | Lower Austria | 13.84362 | 1.7034375 |
2010 | 2805 | 1421620 | NA | Upper Austria | 10.88977 | 0.8709790 |
2010 | 7267 | 3979572 | male | NA | 12.02660 | 0.6303407 |
2010 | 7560 | 4202650 | female | NA | 16.73351 | 0.6300745 |
2010 | 14827 | 8182222 | NA | NA | 14.44422 | 0.5838199 |
state
and
gender
Split the data by all combinations of the two grouping variables. This will result in a larger output-table of the size
\[n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + 1) = 1\cdot(9\cdot2 + 1)= 19.\]
<- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio,
povertyRates group = list(c("gender", "region")))
$Estimates povertyRates
year | n | N | gender | region | val_povertyRisk | stE_povertyRisk |
---|---|---|---|---|---|---|
2010 | 261 | 122741.8 | male | Burgenland | 17.414524 | 2.2332286 |
2010 | 288 | 137822.2 | female | Burgenland | 21.432598 | 2.0899871 |
2010 | 359 | 182732.9 | male | Vorarlberg | 12.973259 | 3.1093204 |
2010 | 374 | 194622.1 | female | Vorarlberg | 19.883637 | 3.7576712 |
2010 | 440 | 253143.7 | male | Salzburg | 9.156964 | 1.9028612 |
2010 | 484 | 282307.3 | female | Salzburg | 17.939382 | 2.5374889 |
2010 | 517 | 268581.4 | male | Carinthia | 10.552148 | 2.0614209 |
2010 | 561 | 295066.6 | female | Carinthia | 15.392924 | 1.9756282 |
2010 | 650 | 339566.5 | male | Tyrol | 12.857542 | 1.0371712 |
2010 | 667 | 362332.5 | female | Tyrol | 17.604861 | 2.2145631 |
2010 | 1128 | 571011.7 | male | Styria | 11.671247 | 1.5111686 |
2010 | 1132 | 774405.4 | male | Vienna | 15.590616 | 1.3348673 |
2010 | 1167 | 596033.3 | female | Styria | 16.964539 | 1.3758548 |
2010 | 1190 | 824525.6 | female | Vienna | 18.778813 | 0.9304295 |
2010 | 1363 | 684272.5 | male | Upper Austria | 9.074690 | 1.1711956 |
2010 | 1387 | 772593.2 | female | Lower Austria | 16.372949 | 1.8366058 |
2010 | 1417 | 783115.8 | male | Lower Austria | 11.348283 | 1.6437512 |
2010 | 1442 | 737347.5 | female | Upper Austria | 12.574205 | 0.7710579 |
2010 | 14827 | 8182222.0 | NA | NA | 14.444218 | 0.5838199 |
In this case, the estimates and standard errors are calculated for
The number of rows in the output is therefore
\[n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9\cdot2 + 9 + 2 + 1) = 30.\]
<- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio,
povertyRates group = list("gender", "region", c("gender", "region")))
$Estimates povertyRates
year | n | N | gender | region | val_povertyRisk | stE_povertyRisk |
---|---|---|---|---|---|---|
2010 | 261 | 122741.8 | male | Burgenland | 17.414524 | 2.2332286 |
2010 | 288 | 137822.2 | female | Burgenland | 21.432598 | 2.0899871 |
2010 | 359 | 182732.9 | male | Vorarlberg | 12.973259 | 3.1093204 |
2010 | 374 | 194622.1 | female | Vorarlberg | 19.883637 | 3.7576712 |
2010 | 440 | 253143.7 | male | Salzburg | 9.156964 | 1.9028612 |
2010 | 484 | 282307.3 | female | Salzburg | 17.939382 | 2.5374889 |
2010 | 517 | 268581.4 | male | Carinthia | 10.552148 | 2.0614209 |
2010 | 549 | 260564.0 | NA | Burgenland | 19.539836 | 1.8042764 |
2010 | 561 | 295066.6 | female | Carinthia | 15.392924 | 1.9756282 |
2010 | 650 | 339566.5 | male | Tyrol | 12.857542 | 1.0371712 |
2010 | 667 | 362332.5 | female | Tyrol | 17.604861 | 2.2145631 |
2010 | 733 | 377355.0 | NA | Vorarlberg | 16.537310 | 3.2896891 |
2010 | 924 | 535451.0 | NA | Salzburg | 13.787343 | 2.2585866 |
2010 | 1078 | 563648.0 | NA | Carinthia | 13.086268 | 1.7469304 |
2010 | 1128 | 571011.7 | male | Styria | 11.671247 | 1.5111686 |
2010 | 1132 | 774405.4 | male | Vienna | 15.590616 | 1.3348673 |
2010 | 1167 | 596033.3 | female | Styria | 16.964539 | 1.3758548 |
2010 | 1190 | 824525.6 | female | Vienna | 18.778813 | 0.9304295 |
2010 | 1317 | 701899.0 | NA | Tyrol | 15.308191 | 1.3943715 |
2010 | 1363 | 684272.5 | male | Upper Austria | 9.074690 | 1.1711956 |
2010 | 1387 | 772593.2 | female | Lower Austria | 16.372949 | 1.8366058 |
2010 | 1417 | 783115.8 | male | Lower Austria | 11.348283 | 1.6437512 |
2010 | 1442 | 737347.5 | female | Upper Austria | 12.574205 | 0.7710579 |
2010 | 2295 | 1167045.0 | NA | Styria | 14.374637 | 1.3500431 |
2010 | 2322 | 1598931.0 | NA | Vienna | 17.234683 | 1.0428496 |
2010 | 2804 | 1555709.0 | NA | Lower Austria | 13.843623 | 1.7034375 |
2010 | 2805 | 1421620.0 | NA | Upper Austria | 10.889773 | 0.8709790 |
2010 | 7267 | 3979571.7 | male | NA | 12.026600 | 0.6303407 |
2010 | 7560 | 4202650.3 | female | NA | 16.733508 | 0.6300745 |
2010 | 14827 | 8182222.0 | NA | NA | 14.444218 | 0.5838199 |