--- title: "Introduction to samplingin" # author: "Choerul Afifanto" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to samplingin} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = T, comment = "#>") options(tibble.print_min = 4L, tibble.print_max = 4L) library(samplingin) library(dplyr) library(magrittr) set.seed(114) ``` ## Population DataFrame We'll use the dataset `pop_dt`. The dataset contains tabulation of Indonesia's population based on the results of the 2020 population census by regency/city and gender from BPS-Statistics Indonesia . ```{r} dim(pop_dt) pop_dt %>% head() ``` ## Allocation DataFrame The dataset used is `alokasi_dt` which is a dataset consisting of sample allocations for each province for sampling purposes. ```{r} dim(alokasi_dt) alokasi_dt ``` ## Simple Random Sampling (SRS) A simple random sample is a randomly selected subset of a population. In this sampling method, each member of the population has an exactly equal chance of being selected. The following is the syntax for simple random sampling. Use parameter `method = 'srs'` ```{r} dtSampling_srs = doSampling( pop = pop_dt, alloc = alokasi_dt, nsample = "n_primary", seed = 7891, method = "srs", ident = c("kdprov"), type = "U" ) ``` Displaying the primary sampling result #### Population Sampled ```{r} head(dtSampling_srs$pop) ``` #### Units Sampled ```{r} head(dtSampling_srs$sampledf) dtSampling_srs$sampledf %>% nrow ``` #### Sampling Details ```{r} head(dtSampling_srs$details) ``` ## Systematic Random Sampling Systematic random sampling is a method to select samples at a particular preset interval. Using population and allocation data that has been provided previously, we will carry out systematic random sampling by utilizing the `doSampling` function from `samplingin` package. Use parameter `method = 'systematic'` ### Primary Units Sampling The following is the syntax for sampling the primary units ```{r} dtSampling_u = doSampling( pop = pop_dt, alloc = alokasi_dt, nsample = "n_primary", seed = 2, method = "systematic", ident = c("kdprov"), type = "U" ) ``` Displaying the primary sampling result #### Population Sampled ```{r} head(dtSampling_u$pop) ``` #### Units Sampled ```{r} head(dtSampling_u$sampledf) dtSampling_u$sampledf %>% nrow ``` #### Sampling Details ```{r} head(dtSampling_u$details) ``` ### Secondary Units Sampling To perform sampling for secondary units, we utilize the population results from prior sampling, which have been marked for the selected primary units. Parameters in `doSampling` are added with `is_secondary=TRUE`. ```{r} alokasi_dt_p = alokasi_dt %>% mutate(n_secondary = 2*n_primary) dtSampling_p = doSampling( pop = dtSampling_u$pop, alloc = alokasi_dt_p, nsample = "n_secondary", seed = 243, method = "systematic", ident = c("kdprov"), type = "P", is_secondary = TRUE ) ``` It can be seen that there are still 2 units that have not been selected as samples. To view the allocation that has not yet been selected as samples, it is as follows: ```{r} dtSampling_p$details %>% filter(n_deficit>0) ``` Displaying the secondary sampling result #### Population Sampled ```{r} head(dtSampling_p$pop) ``` Flags for primary and secondary units ```{r} dtSampling_p$pop %>% count(flags) ``` #### Units Sampled ```{r} head(dtSampling_p$sampledf) dtSampling_p$sampledf %>% nrow ``` #### Sampling Details ```{r} head(dtSampling_p$details) ``` ## PPS Systematic Sampling PPS systematic sampling is a method of sampling from a finite population in which a size measure is available for each population unit before sampling and where the probability of selecting a unit is proportional to its size. Units with larger sizes have more chance to be selected. We will use `doSampling` function with parameter `method = 'pps'` and `auxVar = 'Total'` for its auxiliary variable. ```{r} dtSampling_pps = doSampling( pop = pop_dt, alloc = alokasi_dt, nsample = "n_primary", seed = 321, method = "pps", auxVar = "Total", ident = c("kdprov"), type = "U" ) ``` Displaying the PPS sampling result #### Population Sampled ```{r} head(dtSampling_pps$pop) ``` #### Units Sampled ```{r} head(dtSampling_pps$sampledf) dtSampling_pps$sampledf %>% nrow ``` #### Sampling Details ```{r} head(dtSampling_pps$details) ``` ## Sampling using Stratification For sampling that utilizes stratification, the `doSampling` function includes additional parameter called `strata`. The strata variable must be available in the population and the allocation being used. For example, in the `pop_dt` data, information about `strata` is added, namely `strata_kabkot`, which indicates information about districts (strata_kabkot = 1) and cities (strata_kabkot = 2). ```{r} pop_dt_strata = pop_dt %>% mutate( strata_kabkot = ifelse(substr(kdkab,1,1)=='7', 2, 1) ) alokasi_dt_strata = pop_dt_strata %>% group_by(kdprov,strata_kabkot) %>% summarise( jml_kabkota = n() ) %>% ungroup %>% left_join( alokasi_dt %>% select(kdprov,n_primary) %>% rename(n_alloc = n_primary) ) alokasi_dt_strata = alokasi_dt_strata %>% get_allocation(n_alloc = "n_alloc", group = c("kdprov"), pop_var = "jml_kabkota") dtSampling_strata = doSampling( pop = pop_dt_strata, alloc = alokasi_dt_strata, nsample = "n_primary", seed = 3512, method = "systematic", strata = "strata_kabkot", ident = c("kdprov"), type = "U" ) ``` Displaying the sampling result with stratification #### Population Sampled ```{r} head(dtSampling_strata$pop) ``` #### Units Sampled ```{r} head(dtSampling_strata$sampledf) dtSampling_strata$sampledf %>% nrow dtSampling_strata$sampledf %>% count(strata_kabkot) ``` #### Sampling Details ```{r} head(dtSampling_strata$details) ``` ## Sampling with Implicit Stratification So that the characteristics of the selected sample are distributed according to certain variables, sampling sometimes employs **implicit stratification**. For instance, if you aim to obtain samples distributed according to the total population, you can add the parameter `implicitby = 'Total'` when conducting sampling. ```{r} dtSampling_implicit = doSampling( pop = pop_dt_strata, alloc = alokasi_dt_strata, nsample = "n_primary", seed = 3512, method = "systematic", strata = "strata_kabkot", implicitby = "Total", ident = c("kdprov"), type = "U" ) ``` Displaying the sampling result with implicit stratification #### Population Sampled ```{r} head(dtSampling_implicit$pop) ``` #### Units Sampled ```{r} head(dtSampling_implicit$sampledf) dtSampling_implicit$sampledf %>% nrow dtSampling_implicit$sampledf %>% count(strata_kabkot) ``` #### Sampling Details ```{r} head(dtSampling_implicit$details) ``` ## Sampling with Predetermined Random Number Sometimes, the random numbers for sampling have already been determined beforehand. Thus, for sampling using those predetermined random numbers, the `samplingin` package accommodates this by adding the parameter `predetermined_rn`, which takes the value of the variable storing the predetermined random numbers. For example, if the random numbers are stored in the allocation data frame under the variable name `arand`, thus we add `predetermined_rn = 'arand'` ```{r} set.seed(988) alokasi_dt_arand = alokasi_dt_strata %>% mutate(arand = runif(n(),0,1)) alokasi_dt_arand %>% as.data.frame %>% head(10) dtSampling_prn = doSampling( pop = pop_dt_strata, alloc = alokasi_dt_arand, nsample = "n_primary", seed = 974, method = "systematic", strata = "strata_kabkot", predetermined_rn = "arand", ident = c("kdprov"), type = "U" ) ``` Displaying the sampling result with predetermined random number #### Population Sampled ```{r} head(dtSampling_prn$pop) ``` #### Units Sampled ```{r} head(dtSampling_prn$sampledf) dtSampling_prn$sampledf %>% nrow ``` #### Sampling Details ```{r} head(dtSampling_prn$details) ``` ## Allocate predetermined allocations to smaller levels One of the supporting functions in the `samplingin` package is `get_allocation`. This function aims to allocate sample allocations to lower levels using the *proportional allocation* method based on the square root of the specified variable. For example, sample allocations are available at the Province level, which will be allocated to lower levels such as Districts/Cities using the *proportional allocation* method based on the square root of the total population (`Total`). ```{r} set.seed(242) alokasi_prov = alokasi_dt %>% select(-jml_kabkota, -n_primary) %>% mutate(init_alloc = as.integer(runif(n(), 100, 200))) %>% as.data.frame() alokasi_prov %>% head(10) alokasi_prov %>% summarise(sum(init_alloc)) alokasi_kab = pop_dt %>% left_join(alokasi_prov) %>% get_allocation(n_alloc = "init_alloc", group = c("kdprov"), pop_var = "Total") %>% as.data.frame() alokasi_kab %>% head(10) alokasi_kab %>% summarise(sum(n_primary)) alokasi_kab %>% group_by(kdprov) %>% summarise(sum(n_primary)) # check all.equal( alokasi_prov, alokasi_kab %>% group_by(kdprov) %>% summarise(init_alloc=sum(n_primary)) %>% ungroup() %>% as.data.frame() ) ```