--- title: "Error estimation" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: df_print: kable toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{error estimation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} options(rmarkdown.html_vignette.check_title = FALSE) knitr::opts_chunk$set(echo = TRUE) ``` For the most part, this document will present the functionalities of the function `surveysd::calc.stError()` which generates point estimates and standard errors for user-supplied estimation functions. ## Prerequisites In order to use a dataset with `calc.stError()`, several weight columns have to be present. Each weight column corresponds to a bootstrap sample. In the following examples, we will use the data from `demo.eusilc()` and attach the bootstrap weights using `draw.bootstrap()` and `recalib()`. Please refer to the documentation of those functions for more detail. ```{r} library(surveysd) set.seed(1234) eusilc <- demo.eusilc(prettyNames = TRUE) dat_boot <- draw.bootstrap(eusilc, REP = 10, hid = "hid", weights = "pWeight", strata = "region", period = "year") dat_boot_calib <- recalib(dat_boot, conP.var = "gender", conH.var = "region", epsP = 1e-2, epsH = 2.5e-2, verbose = FALSE) dat_boot_calib[, onePerson := nrow(.SD) == 1, by = .(year, hid)] ## print part of the dataset dat_boot_calib[1:5, .(year, povertyRisk, eqIncome, onePerson, pWeight, w1, w2, w3, w4, w5)] ``` ## Estimator functions The parameters `fun` and `var` in `calc.stError()` define the estimator to be used in the error analysis. There are two built-in estimator functions `weightedSum()` and `weightedRatio()` which can be used as follows. ```{r} povertyRate <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = weightedRatio) totalIncome <- calc.stError(dat_boot_calib, var = "eqIncome", fun = weightedSum) ``` Those functions calculate the ratio of persons at risk of poverty (in percent) and the total income. By default, the results are calculated separately for each reference period. ```{r} povertyRate$Estimates totalIncome$Estimates ``` Columns that use the `val_` prefix denote the point estimate belonging to the "main weight" of the dataset, which is `pWeight` in case of the dataset used here. Columns with the `stE_` prefix denote standard errors calculated with bootstrap replicates. The replicates result in using `w1`, `w2`, ..., `w10` instead of `pWeight` when applying the estimator. `n` denotes the number of observations for the year and `N` denotes the total weight of those persons. ### Custom estimators In order to define a custom estimator function to be used in `fun`, the function needs to have at least two arguments like the example below. ```{r} ## define custom estimator myWeightedSum <- function(x, w) { sum(x*w) } ## check if results are equal to the one using `surveysd::weightedSum()` totalIncome2 <- calc.stError(dat_boot_calib, var = "eqIncome", fun = myWeightedSum) all.equal(totalIncome$Estimates, totalIncome2$Estimates) ``` The parameters `x` and `w` can be assumed to be vectors with equal length with `w` being numeric weight vector and `x` being the column defined in the `var` argument. It will be called once for each period (in this case `year`) and for each weight column (in this case `pWeight`, `w1`, `w2`, ..., `w10`). Custom estimators using additional parameters can also be supplied and parameter `add.arg` can be used to set the additional arguments for the custom estimator. ```{r} ## use add.arg-argument fun <- function(x, w, b) { sum(x*w*b) } add.arg = list(b="onePerson") err.est <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = fun, period.mean = 0, add.arg=add.arg) err.est$Estimates # compare with direct computation compare.value <- dat_boot_calib[,fun(povertyRisk,pWeight,b=onePerson), by=c("year")] all((compare.value$V1-err.est$Estimates$val_povertyRisk)==0) ``` The above chunk computes the weighted poverty ratio for single person households. ### Adjust variable depending on bootstrap weights In our example the variable `povertyRisk` is a boolean and is `TRUE` if the income is less than 60% of the weighted median income. Thus it directly depends on the original weight vector `pWeight`. To further reduce the estimated error one should calculate for each bootstrap replicate weight $w$ the weighted median income $medIncome_{w}$ and then define $povertyRisk_w$ as $$ povertyRisk_w = \cases{1 \quad\text{if Income}<0.6\cdot medIncome_{w}\\ 0 \quad\text{else}} $$ The estimator can then be applied to the new variable $povertyRisk_w$. This can be realized using a custom estimator function. ```{r} # custom estimator to first derive poverty threshold # and then estimate a weighted ratio povmd <- function(x, w) { md <- laeken::weightedMedian(x, w)*0.6 pmd60 <- x < md # weighted ratio is directly estimated inside the function return(sum(w[pmd60])/sum(w)*100) } err.est <- calc.stError( dat_boot_calib, var = "povertyRisk", fun = weightedRatio, fun.adjust.var = povmd, adjust.var = "eqIncome") err.est$Estimates ``` The approach shown above is only valid if not grouping variables are supplied (parameter `group`). If grouping variables are supplied one should use parameters `fun.adjust.var` and `adjust.var` such that the $povertyRisk_w$ is first calculated for each `period` and then used for each grouping in `group`. ```{r} # using fun.adjust.var and adjust.var to estimate povmd60 indicator # for each period and bootstrap weight before applying the weightedRatio povmd2 <- function(x, w) { md <- laeken::weightedMedian(x, w)*0.6 pmd60 <- x < md return(as.integer(pmd60)) } # set adjust.var="eqIncome" so the income vector is used to estimate # the povmd60 indicator for each bootstrap weight # and the resulting indicators are passed to function weightedRatio group <- "gender" err.est <- calc.stError( dat_boot_calib, var = "povertyRisk", fun = weightedRatio, group = "gender", fun.adjust.var = povmd2, adjust.var = "eqIncome") err.est$Estimates ``` ### Multiple estimators In case an estimator should be applied to several columns of the dataset, `var` can be set to a vector containing all necessary columns. ```{r} multipleRates <- calc.stError(dat_boot_calib, var = c("povertyRisk", "onePerson"), fun = weightedRatio) multipleRates$Estimates ``` Here we see the relative number of persons at risk of poverty and the relative number of one-person households. ## Grouping The `groups` argument can be used to calculate estimators for different subsets of the data. This argument can take the grouping variable as a string that refers to a column name (usually a factor) in `dat`. If set, all estimators are not only split by the reference period but also by the grouping variable. For simplicity, only one reference period of the above data is used. ```{r} dat2 <- subset(dat_boot_calib, year == 2010) for (att in c("period", "weights", "b.rep")) attr(dat2, att) <- attr(dat_boot_calib, att) ``` To calculate the ratio of persons at risk of poverty for each federal state of Austria, `group = "region"` can be used. ```{r} povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = "region") povertyRates$Estimates ``` The last row with `region = NA` denotes the aggregate over all regions. Note that the columns `N` and `n` now show the weighted and unweighted number of persons in each region. ### Several grouping variables In case more than one grouping variable is used, there are several options of calling `calc.stError()` depending on whether combinations of grouping levels should be regarded or not. We will consider the variables `gender` and `region` as our grouping variables and show three options on how `calc.stError()` can be called. #### Option 1: All regions and all genders Calculate the point estimate and standard error for each region and each gender. The number of rows in the output is therefore $$n_\text{periods}\cdot(n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9 + 2 + 1) = 12.$$ The last row is again the estimate for the whole period. ```{r} povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = c("gender", "region")) povertyRates$Estimates ``` #### Option 2: All combinations of `state` and `gender` Split the data by all combinations of the two grouping variables. This will result in a larger output-table of the size $$n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + 1) = 1\cdot(9\cdot2 + 1)= 19.$$ ```{r} povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = list(c("gender", "region"))) povertyRates$Estimates ``` #### Option 3: Cobination of Option 1 and Option 2 In this case, the estimates and standard errors are calculated for * every gender, * every state and * every combination of state and gender. The number of rows in the output is therefore $$n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9\cdot2 + 9 + 2 + 1) = 30.$$ ```{r} povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = list("gender", "region", c("gender", "region"))) povertyRates$Estimates ```