--- title: "Privacy-Preserving Data Anonymization with privacyR" author: "Vikrant Dev Rathore" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Privacy-Preserving Data Anonymization with privacyR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## Introduction The `privacyR` package helps you anonymize sensitive data in healthcare and research datasets. It provides tools to protect patient privacy while keeping your data useful for analysis. ## Installation ```{r install, eval = FALSE} # Install from CRAN install.packages("privacyR") ``` ## Basic Usage ### Anonymizing Patient Identifiers Anonymize patient IDs while keeping referential integrity (same IDs get the same anonymized value): ```{r anonymize_id} library(privacyR) # Original patient IDs patient_ids <- c("P001", "P002", "P003", "P001", "P002") print(patient_ids) # Anonymize IDs anonymized_ids <- anonymize_id(patient_ids, seed = 123) print(anonymized_ids) # Note: Same original IDs map to same anonymized IDs ``` ### Anonymizing Patient Names ```{r anonymize_names} # Original names names <- c("John Doe", "Jane Smith", "Bob Johnson", "John Doe") print(names) # Anonymize names anonymized_names <- anonymize_names(names, seed = 123) print(anonymized_names) ``` ### Anonymizing Dates Two methods are available: shifting or rounding. #### Date Shifting Shifting moves all dates by the same amount, preserving relative time differences: ```{r anonymize_dates_shift} # Original dates dates <- as.Date(c("2020-01-15", "2020-03-20", "2020-06-10")) print(dates) # Shift dates shifted_dates <- anonymize_dates(dates, method = "shift", seed = 123) print(shifted_dates) # Relative differences are preserved diff_original <- as.numeric(dates[2] - dates[1]) diff_shifted <- as.numeric(shifted_dates[2] - shifted_dates[1]) cat("Original difference:", diff_original, "days\n") cat("Shifted difference:", diff_shifted, "days\n") ``` #### Date Rounding Rounding reduces precision by grouping dates into buckets (day, week, month, year, etc.): ```{r anonymize_dates_round} # Round to month rounded_month <- anonymize_dates(dates, method = "round", granularity = "month", seed = 123) print(rounded_month) # Round to year rounded_year <- anonymize_dates(dates, method = "round", granularity = "year", seed = 123) print(rounded_year) ``` ### Anonymizing Locations ```{r anonymize_locations} # Original locations locations <- c("New York, NY", "Los Angeles, CA", "Chicago, IL", "New York, NY") print(locations) # Generalize locations generalized <- anonymize_locations(locations, method = "generalize", seed = 123) print(generalized) # Or remove locations entirely removed <- anonymize_locations(locations, method = "remove", seed = 123) print(removed) ``` ## Working with Data Frames The `anonymize_dataframe()` function provides a convenient way to anonymize entire data frames: ```{r anonymize_dataframe} # Create sample patient data patient_data <- data.frame( patient_id = c("P001", "P002", "P003", "P001"), name = c("John Doe", "Jane Smith", "Bob Johnson", "John Doe"), dob = as.Date(c("1980-01-15", "1975-03-20", "1990-06-10", "1980-01-15")), admission_date = as.Date(c("2020-01-10", "2020-02-15", "2020-03-20", "2020-01-10")), location = c("New York, NY", "Los Angeles, CA", "Chicago, IL", "New York, NY"), diagnosis = c("Hypertension", "Diabetes", "Hypertension", "Hypertension"), age = c(40, 45, 30, 40) ) print("Original data:") print(patient_data) # Anonymize the entire data frame anonymized_data <- anonymize_dataframe(patient_data, seed = 123) print("\nAnonymized data:") print(anonymized_data) ``` ### Auto-detection By default, `anonymize_dataframe()` automatically detects columns based on naming patterns and data types: ```{r auto_detect} # The function automatically detects: # - ID columns: patient_id, subject_id, etc. # - Name columns: name, patient_name, etc. # - Date columns: date, dob, admission_date, etc. # - Location columns: location, address, city, etc. # You can also manually specify columns manual_anon <- anonymize_dataframe( patient_data, id_cols = "patient_id", name_cols = "name", date_cols = c("dob", "admission_date"), location_cols = "location", auto_detect = FALSE, seed = 123 ) ``` ## Best Practices 1. **Seeds and reproducibility**: - The `seed` parameter is optional (default: `NULL`). When `seed = NULL`, the package still maintains referential integrity using a deterministic hash-based approach, so same inputs always produce same outputs. - For explicit reproducibility across sessions, provide a seed: ```{r best_practices, eval = FALSE} anonymized <- anonymize_dataframe(data, seed = 12345) ``` - **Note**: The package always restores your R session's random number generator state after anonymization, so your random number generation is never affected. 2. **Referential integrity** is maintained automatically - same original values get the same anonymized values, which preserves relationships in your data. This works even when `seed = NULL`. 3. **Date anonymization**: - Use "shift" to preserve relative time differences - Use "round" to reduce precision (e.g., month-year format) 4. **Location anonymization**: - Use "generalize" to keep some location info - Use "remove" when location is too sensitive 5. **Validate your results** - make sure anonymized data still works for your analysis. ## Privacy Considerations Keep in mind: - Complete anonymization is difficult to achieve - You may need additional privacy measures depending on your use case - Consider consulting privacy experts for sensitive data - Review relevant regulations (HIPAA, GDPR, etc.) for your jurisdiction ## Getting Help For more information, see the package documentation: ```{r help, eval = FALSE} ?anonymize_dataframe help(package = "privacyR") ```