--- title: "OBL: Optimum Block Length for Bootstrap Methods" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{OBL} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: "economia.bib" logo: obl.PNG --- ```{r setup2, include = FALSE} knitr::opts_chunk$set(echo = TRUE, comment = "", collapse = TRUE, error = TRUE, # do not interrupt in case of errors message = FALSE, warning = FALSE, comma = function(x) format(x, digits = 2, big.mark = ",") ) ``` ```{r setup} library(OBL) ``` ## Introduction ARIMA method has limitations in the area of small sample sizes among others. although, analysis of small sample series are available in few cases, there is currently no widely applicable and easily accessible method that can be used to make small sample inference. Methods like Edgeworth's expansions involve a lot of algebra (which might discourage its users) and are also applicable in very special cases. The regular bootstrap method that could be a potential alternative failed on the grand of conflicting assumptions. The normal bootstrap method depends on assumption that observations are independent and identically distributed (i.i.d.), while a typical time series data are dependent in nature. ## The Need for Block Bootstrap Methods To find a way to ovoid this assumption of i.i.d. on normal bootstrap method in @efron1979 and still maintain the dependence structure of time series data, one can hold reasonable amount of dependence structure within the series in a way by slicing a time series data into a number of chunks each with a length *l*. This way the dependence structure within each block is kept. Instead of sampling each unit randomly with replacement (as it would have been done for traditional bootstrap method), the chunks are rather sampled. This will distort certain amount of dependence structure of the series only among blocks are distorted (as serial correlation is distorted among the blocks), while the i.i.d. is invariably preserved. This way, one is able to coerce the i.i.d. assumption of the regular bootstrap method and the assuption of presence of serial correlation of a typical time series data in one method. The broad name given to method that achieves these two opposing objective is called Block Bootstrap Methods. ## Statement of Problem The main challenge with block bootstrap procedures is the responsiveness of Root Mean Squared Error (RMSE) to the preference of block length (*l*), or the number of blocks (*m*). This is one problem define in two ways, `OBL: Optimum Block Lengt` package has chosen to approached this problem with **the preference of block length**. Diverse methods can be used (which are explained briefly below), each method has numerous block lengths which must be considered, it is this problem that the `OBL: Optimum Block Lengt` package is here to solve. ## About the Package The OBL package provides optimum block length to five(5) different block bootrap methods vized: 1. The Non-overlapping Block Bootstrap (NBB) uses a method described in @Carlstein1986 which splits original series into Non-overlapping blocks and thereafter resamples the blocks in multiple times(which is named *R*) to form a new series. 2. The Moving Block Bootstrap (MBB) otherwise called Overlaping Block Bootstrap uses a method described in @kunsch1989 which splits original series into overlapping blocks and thereafter resamples the blocks in multiple times(which is named *R*) to form a new series. 3. The Circular Block Bootstrap (CBB) uses a method described in @Politis1992 is an improvement on MBB @kunsch1989 such that in which provisions are made for observations at the tail end of the original series that could have been cut off from resampling simply because the left over element(s) is not equal to predetermined block length. This happens when original series is not divisible by $n - l + 1$, where $n$ is the number of original series and $l$ is the predetermined block sizes $1 < l < n$. Such provision is made up by completing the so called left-over by adding the first element(s) of the original series to form a circle. Afterwards, the blocks in multiple times(which is named *R*) are resampled to form a new series. 4. The Tapered Moving Block Bootstrap (TMBB unpublished) is formed to reduce the less representative presence of extreme member of the series from $2l$ to just 2.. Reduction of less-represented elements of the series will help to increase the performance of model evaluation metrics (RMSE and MAE). Afterwards, the blocks in multiple times(which is named *R*) are resampled to form a new series. 5. The Tapered Circular Block Bootstrap (TCBB unpublished) is an extension from TMBB such that the last block contains the first element of the parent series as its last sub-series element. It is formed to reduce the less representative presence of extreme member of the series from $2l$ (in the case of MBB) and from 2 (in the case of TMBB) to just 1. Reduction of less-represented element of the series will help to increase the performance of model evaluation metrics (which leads to reduced RMSE). Afterwards, the blocks in multiple times(which is named *R*) are resampled to form a new series. It also checks for every possible block length *l* (where $1 < l < n$ for $n$ is the length of the original time series data) in each method to know which one is optimal by calculating RMSE value for every possible block length of each method and sorting out which of them is minimum in value. The minimum RMSE values for every method is sorted out in a data frame(with three(3) columns namely: Methods, lb and RMSE) to let the `OBL: Optimum Block Lengt` package users choose the method and the block length with the minimal RMSE value from the output data frame ## Instalation You can install the development version from GitHub with: ```{r instalation, include = TRUE, eval = FALSE} install.packages("devtools") devtools::install_github("sta189332/OBL") ``` ## Description It is observed that the optimum block length of any time series data is contingent(dependent) on the uniqueness of every time series data. Block bootstrap users thus, need to be flexible in choosing the optimum block length by adopting to a concise but clear while such method must be easy to use as well. As a result of the such a need, `OBL: Optimum Block Lengt` package is created to solve such problem. `OBL: Optimum Block Length` package helps users to search for the best block length and the best method that has the minimum RMSE value. blockboot function produces a data frame with three (3) column (Method, lb \& RMSE). lolliblock function is another function that can plot the lollipop chart of the data frame displays by the blockboot function. It shows the optimum block lengths, for each method with different colours ranging from red to green. While red shows the method with worst performance (method with the highest RMSE) the green colour shows the method with the smallest RMSE. The corresponding block length of each methods as a **legend** with their matching colours. ## Usage The minimum arguments in the function `blockboot()` can be the `ts` which should be a univirate time series data and `R` which is the numbers of replicate of resapling. ```{r usage_blockboot, include = TRUE, eval = FALSE} blockboot(ts, R, seed, n_cores, methods = c("optnbb", "optmbb", "optcbb", "opttmbb", "opttcbb")) ``` While the minimum arguments in the function `lolliblock()` can be the `ts` which should be a univirate time series data and `R` which is the numbers of replicate of resapling. ```{r usage_lolliblock, include = TRUE, eval = FALSE} lolliblock(ts, R, seed, n_cores, methods = c("optnbb", "optmbb", "optcbb", "opttmbb", "opttcbb")) ``` ### Argument :::::::::::::: {.columns} ::: {.column width="20%"} ts R seed n_cores Methods ::: ::: {.column width="70%"} univariate time series data Number of replication for resampling RNG seed number of core(s) to be used on your operating system methods is optional, if specified, it must be any combination as follows: "optnbb", "optmbb", "optcbb", "opttmbb", "opttcbb" ::: :::::::::::::: ### Output The suction output a data frame with 5 rows 3 columns which are "Methods", "lb" and "RMSE". Method with the minimum RMSE value is ### Examples ```{r simulate1, include = TRUE, eval = FALSE} # simulate univariate time series data set.seed(289805) ts <- arima.sim(n = 10, model = list(ar = 0.8, order = c(1, 0, 0)), sd = 1) # get the optimal block length table OBL::blockboot(ts = ts, R = 100, seed = 6, n_cores = 2) # Methods lb RMSE #1 nbb 9 0.2402482 #2 mbb 9 0.1023012 #3 cbb 8 0.2031448 #4 tmbb 4 0.2654746 #5 tcbb 9 0.4048711 ``` The suction output a lollipop chart with 5 pops for the 5 methods separated with 5 distinct colours while the method with red lollipop indicates the least desired method with the highest RMSE and the method with green lollipop indicates the preferred method having the lowest RMSE. The **legend** beside the chart indicate the optimum block length for each method. ### Examples2 ```{r simulate2, include = TRUE, eval = FALSE} # simulate univariate time series data set.seed(289805) ts <- arima.sim(n = 10, model = list(ar = 0.8, order = c(1, 0, 0)), sd = 1) # get the optimal block length table OBL::lolliblock(ts = ts, R = 100, seed = 6, n_cores = 2) ``` ## vignette ## vignette("factors.cc", package="rQCC") ## Reference