--- title: "The ROCKproject file format" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{The ROCKproject file format} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- This vignette describes version 1.0 of the ROCK project file format. ROCK project files have extension `.ROCKproject` and are ZIP archives. They contain two things: - Files containing the data, ideally in a deliberately designed set of sub-directories to facilitate tracing the data through different stages of processing and analysis; - Files containing settings and directives for applications and processing of the data. The former are raw data files and ROCK files. ROCK files are plain text files with the `.rock` extension. The latter are YAML files. Of these, the only required one is the `_ROCKproject.yml` file. This file must always be a regular YAML file that contains a map with key `_ROCKproject`. This map in turn must contain maps with keys `project`, `codebook`, `sources`, and `workflow`. The `project` map contains project metadata, such as the project's `title`, its `authors`, optional (but strongly recommended!) author identifiers in `authorIds`, the project's `version`, the version of the ROCK standard used in the project (with key `ROCK_version`), the version of the ROCK project file (with key `ROCK_project_version`), the date the project was created (with key `date_created`), and the date the project was last modified (with key `date_modified`). The `codebook` map contains the project's codebook, either embedded or by linking to it. The `codebook` key can also have value `~` (NULL) if not codebook information is specified (or the codebook is embedded in the ROCK files). Valid keys to be specified with the `codebook` map are `urcid`, `embedded`, and `local`. The `urcid` key can store the project's Unique ROCK Codebook Identifier (i.e. its URCID) as a URL to a ROCK codebook in spreadsheet (`.xlsx` or `.ods` format) or YAML (`.yml` or `.rock`) format. The `sources` map specifies where the project's data resides. This is specified in terms of regular expressions. The first valid key is `extension`, which is not a regular expression but can be used to conveniently specify that files with a given extension must be imported. This is used if `regex` is `~` (NULL, i.e. unspecified). However, if a value is specified for `regex`, a program importing a ROCK project should ignore whatever is specified for `extension`. The value stored in the `dirsToIncludeRegex` key should be a regular expression indicating which directories contain the data (i.e. the ROCK files forming the project). The `recursive` key can be `true` or `false` and indicates whether all subdirectories of matched directories should be imported too. The `dirsToExcludeRegex` regular expression can be used to ignore directories. In addition, if `filesToIncludeRegex` is specified, only files matching that regular expression should be imported; and if `filesToExcludeRegex` is specified, files matching that regular expression should be ignored. Finally, the `workflow` map described the workflow and data management template used in this project. It consists of a `pipeline` and `actions`. The `pipeline` is a sequence of stages, each with an identifier (in key `stage`); the directory containing files in that stage (in key `dirName`; note that this is a single directory name, not a regular expression!); and a sequence of one or more next stage (with key `nextStages`). Each element in `nextStages` has a `nextStageId` key and a `actionId`. The `nextStageId` specifies to which stage files transfer (i.e. are saved) when the action with the corresponding `actionId` is executed. These `actions` are stored in a sequence where each element has an `actionId`; a `language` specified the programming language the action is specified in; one or more `dependencies` (typically packages that need to be loaded in that programming environment before the `script` can be executed), and a `script` section specifying the commands to run to execute that action. In this script, two placeholders can be used: `{currentStage::dirName}` will be replaced with the contents of `dirName` for the current stage; and `{nextStage::dirName}` will be replaced with the contents of `dirName` for the next stage. The latter part of these expressions (`dirName` in both of these examples) can be replaced by other keys specified in each stage to allow setting parameters in the pipeline specification. An example of a `_ROCKproject.yml` file is included below. ```yaml _ROCKproject: project: title: "The Alice Study" # Any character string authors: "Author names as string" # Any character string authorIds: - display_name: "Talea Cornelius" # Any character string orcid: "0000-0001-7181-0981" # Any character string matching ^([0-9]{4}-){3}[0-9]{4}$ shorcid: "ip6b381" # Any character string matching ^([0-9a-zA-Z]+$ - display_name: "Gjalt-Jorn Peters" # Any character string orcid: "0000-0002-0336-9589" # Any character string matching ^([0-9]{4}-){3}[0-9]{4}$ shorcid: "it36ll9" # Any character string matching ^([0-9a-zA-Z]+$ version: "1.1" # Anything matching regex [0-9]+(\\.[0-9]+)* ROCK_version: 1 # Anything matching regex [0-9]+(\\.[0-9]+)* ROCK_project_version: 1 # Anything matching regex [0-9]+(\\.[0-9]+)* date_created: "2023-03-01 20:03:51 UTC" # Anything matching that date format, preferably converted to UTC timezone date_modified: "2023-03-08 20:03:51 UTC" # Anything matching that date format, preferably converted to UTC timezone codebook: urcid: "" embedded: ~ local: "" sources: extension: ".rock" # Any valid extension regex: ~ # Any regex or ~ dirsToIncludeRegex: data/ # Any regex or ~ recursive: true # true or false dirsToExcludeRegex: ~ # Any regex or ~ filesToIncludeRegex: ~ # Any regex or ~ filesToExcludeRegex: ~ # Any regex or ~ workflow: pipeline: - stage: raw # Anything matching regex [a-A-Z][a-zA-Z0-9_]* dirName: "data/010---raw-sources" # Any valid directory name, using a forward slash as separator nextStages: - nextStageid: clean # A different stage identifier or ~ actionId: cleanSource - nextStageid: uids # A different stage identifier or ~ actionId: addUIDs - stage: clean # Anything matching regex [a-A-Z][a-zA-Z0-9_]* dirName: "data/020---cleaned-sources" # Any valid directory name, using a forward slash as separator nextStages: - nextStageid: uids # A different stage identifier or ~ actionId: addUIDs - stage: uids # Anything matching regex [a-A-Z][a-zA-Z0-9_]* dirName: "data/030---sources-with-uids" # Any valid directory name, using a forward slash as separator nextStage: coded # A different stage identifier or ~ - stage: coded # Anything matching regex [a-A-Z][a-zA-Z0-9_]* dirName: "data/040---coded-sources" # Any valid directory name, using a forward slash as separator nextStage: masked # A different stage identifier or ~ - stage: masked # Anything matching regex [a-A-Z][a-zA-Z0-9_]* dirName: "data/090---masked-sources" # Any valid directory name, using a forward slash as separator nextStage: ~ # A different stage identifier or ~ actions: - actionId: addUIDs # String, referenced from the stages language: R # Language, has to be matched to interpreter dependencies: rock # Dependencies to be loaded before running the script script: | # Literal block style string rock::prepend_ids_to_sources( input = {currentStage::dirName}, output = {nextStage::dirName} ); ```