--- title: "Design Principles for {ColOpenData}" output: rmarkdown::html_vignette: df_print: tibble vignette: > %\VignetteIndexEntry{Design Principles for {ColOpenData}} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = identical(tolower(Sys.getenv("NOT_CRAN")), "true") ) ``` ::: {style="text-align: justify;"} This vignette outlines the design decisions that have been taken during the development of the ColOpenData R package, and provides some of the reasoning, and possible pros and cons of each decision. This document is primarily intended to be read by those interested in understanding the code within the package and for potential package contributors. ::: ## Scope ::: {style="text-align: justify;"} ColOpenData provides an easy access to Colombian Open Government Data from two main data sources: Departamento Administrativo Nacional de Estadística (DANE), and Instituto de Hidrología, Meteorología y Estudios Ambientales (IDEAM). These include Demographic, Geospatial, Climate, and Population Projections' datasets. The package centralizes information an provides harmonized and cleaned datasets for easier data analysis. In addition, the package includes a series of documentation and dictionaries to help the user navigate the open datasets available. Further information regarding included documentation, naming conventions, and uses can be found here: [Documentation and Datasets](https://epiverse-trace.github.io/ColOpenData/articles/documentation_and_dictionaries.html) ::: ## Output ::: {style="text-align: justify;"} The output for each function is a `data.frame` with the requested data. When downloading geospatial data, which includes maps, an `sf` object is created. ::: ## Design Decisions ::: {style="text-align: justify;"} A manual cleaning process was necessary to address issues like inconsistent headers in the original datasets, while harmonization ensured alignment with tidy data standards and national naming conventions. To avoid repeating these steps at each download, and ensure stable and independent access to open data, the already processed datasets were stored on a private server for later redistribution. The datasets list and dictionaries are stored inside the package data, to avoid unnecessary downloads of frequently consulted datasets. Additional particularities include: - The datasets are only available in the original language (Spanish). However, dictionaries and dataset lists are available in English and Spanish. - Geospatial datasets include very detailed and segregated structures (down to block level). To avoid extremely long download times, a simplified version was included for all the maps, allowing the users to select either of them according to their needs. - Climate datasets are stored as they were acquired from the source, so filtering and aggregation occurs during each user's request. - Dictionaries are only provided for geospatial data, since they were the only ones originally provided from the source. ::: ## Dependencies ::: {style="text-align: justify;"} The package dependencies are: - `{checkmate}` - `{config}` - `{dplyr}` - `{magrittr}` - `{rlang}` - `{sf}` - `{stringdist}` - `{tidyr}` - `{utils}` ::: ## Additional Considerations ::: {style="text-align: justify;"} The datasets included in the package contain changes which may alter the structure, format, or content, meaning the data does not reflect the official dataset. The package is developed independently, with no endorsement or involvement from these institutions or any Colombian government body. The authors of ColOpenData are not liable for how users utilize the data, and users are responsible for any outcomes from their use or analysis of the data. :::