--- title: "Inspect SGICs" output: rmarkdown::html_vignette description: > This vingette shows you how to use the package for checking SGIcs and related survey data on plausibility vignette: > %\VignetteIndexEntry{inspect_sgics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Linking survey data with SGICs (Subject Generated Identification-Codes)? Awesome! Just remember, you need to validate those IDs. That's how you get clean data and make sure the link-up goes smoothly. This vignette shows you: - How to perform plausibility checks on different SGIC components. - How to perform plausibility checks on non-SGIC variables that may serve as additional identifiers. - How to detect duplicate cases using a combination of variables as unique identifiers. To check the plausibility of ID-related variables in a dataset, `trustmebro` provides several functions beginning with the prefix *inspect*. Every *inspect*-function returns a boolean value, indicating whether a value has passed or failed the plausibility check. We\`ll start by loading trustmebro and dplyr: ```{r setup, message=FALSE} library(trustmebro) library(dplyr) ``` # Data: sailor_students The survey data we use is the `trustmebro::sailor_students` dataset. It contains fictional student assessment data from students of the sailor moon universe. ```{r} sailor_students ``` # SGIC Plausibility The variable `sgic` stores SGICs created by students. Each SGIC is a seven-character string created according to the following instructions: Characters 1-3 (letters): - First letter of given name (1st character) - Last letter of given name (2nd character) - First letter of family name (3rd character) Characters 4-7 (digits): - Birthday (4th and 5th character) - Month of birth (6th and 7th character) ## Check Character IDs We can use `trustmebro::inspect_characterid` to check if the provided SGICs adhere to the expected pattern of three letters followed by four digits. The expected structure can be defined using the regular expression `"^[A-Za-z]{3}[0-9]{4}$"`, which we can then pass to the function using the `pattern =` argument. For seamless integration into your data workflow, this function can be conveniently combined with `dplyr::mutate`: ```{r} sailor_students %>% mutate(structure_check = inspect_characterid( sgic, pattern = "^[A-Za-z]{3}[0-9]{4}$")) %>% select(sgic, structure_check) ``` We created `trustmebro::inspect_characterid` with SGICs in mind, but of course, any other non-SGIC strings can also be checked using a specified regular expression. ## Check Birthdate-Components Since the SGIC should end with a date of birth, you can verify the plausibility of this date of birth using `trustmebro::inspect_birthdaymonth`. This function checks if a string contains exactly four digits representing a valid date of birth. As before, you can combine `trustmebro::inspect_birthdaymonth` with `dplyr::mutate` to generate a plausibility check variable: ```{r} sailor_students %>% mutate(birthdate_check = inspect_birthdaymonth(sgic)) %>% select(sgic, birthdate_check) ``` Some SGICs only use the single day or month a person was born. In this case, you can use of `trustmebro::inspect_birthday` or `trustmebro::inspect_birthmonth` accordingly. # Non-SGIC variables' plausibility Besides a SGIC, other variables in a given dataset might be used to identify cases. As mentioned above, `trustmebro::inspect_characterid` can be used for any string that should follow a specific pattern. Furthermore, this package also provides functions for checking other data types beyond strings. ## Check Numbers We can use `trustmebro::inspect_numberid` to check if a number matches an expected length. In our dataset, `school` should be a five-digit number. combined with `dplyr::mutate`, we can add a plausibility variable for the schoolnumber, just as we did before: ```{r} sailor_students %>% mutate(school_check = inspect_numberid(school, 5)) %>% select(school, school_check) ``` ## Check the presence of a value within the recode map In the process of using non-SGIC variables as identifiers, categorical data is often recoded to ensure consistency within a workflow. We can use `trustmebro::inspect_valinvec` to check if a value exists in a recode map. The recode map should be a named vector, where the names represent the keys. In our dataset, we want to inspect if all values in `gender` conform to this recode map: ```{r} recode_gender <- c(Male = "M", Female = "F") ``` The function checks if a value is present as a key. Combine with `dplyr::mutate` to add a variable that contains the check results: ```{r} sailor_students %>% mutate(gender_check = inspect_valinvec(gender, recode_gender)) %>% select(gender, gender_check) ``` # Identify Duplicate Cases So far, we've checked if `SGIC`, `school` and `gender` contain plausible values. Last, we want to ensure that these variables, when used together as identifiers, uniquely identify a single case and that there are no duplicate entries based on these variables. `trustmebro::find_dupes` checks whether the combination of identifiers is unique by adding a has_dupes variable to the dataset. To find duplicates in your data, use it like this: ```{r} sailor_students %>% find_dupes(school, sgic, gender) %>% select(school, sgic, gender, has_dupes) ```