---
title: "Validation tools for identifying and repairing errors in pedigrees"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Validation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(rmarkdown.html_vignette.check_title = FALSE)

```

# Introduction

The `BGmisc` R package offers a comprehensive suite of functions tailored for extended behavior genetics analysis, including model identification, calculating relatedness, pedigree conversion, and pedigree simulation. This vignette provides an overview of the validation tools available in the package, designed to identify and repair errors in pedigrees. 

In an ideal world, you would have perfect pedigrees with no errors. However, in the real world, pedigrees are often incomplete, contain errors, or are missing data. The `BGmisc` package provides tools to identify these errors, which is particularly useful for large pedigrees where manual inspection is not feasible. 
While some errors in the package can be automatically repaired, the vast majority require manual inspection. It is often not possible to automatically repair errors in pedigrees, as the correct solution may not be obvious, or may depend on additional information that is not universally available.

# Identifying and Repairing Errors in Pedigrees

## ID Validation

One common issue in pedigree data is the presence of duplicate IDs. There are two main types of ID duplication: within-row duplication and across-row duplication. Within-row duplication occurs when an individual's parents' IDs are incorrectly listed as their own ID. Across-row duplication occurs when two or more individuals share the same ID.

 The `checkIDs` function in BGmisc helps identify by kinds of duplicates. Here's how to use it:

```{r,checkIDs}
library(BGmisc)
# Create a sample dataset
df <- ped2fam(potter, famID = "newFamID", personID = "personID")

# Call the checkIDs function
result <- checkIDs(df, repair = FALSE)
print(result)

#> $all_unique_ids
#> [1] TRUE
#> 
#> $total_non_unique_ids
#> [1] 0
#> 
#> $total_own_father
#> [1] 0
#> 
#> $total_own_mother
#> [1] 0
#> 
#> $total_duplicated_parents
#> [1] 0
#> 
#> $total_within_row_duplicates
#> [1] 0
#> 
#> $within_row_duplicates
#> [1] FALSE
```

In this example, the `checkIDs` function returns a list with several elements. The `all_unique_ids` element indicates whether all IDs in the dataset are unique. The `total_non_unique_ids` element indicates the total number of non-unique IDs. The `total_own_father` and `total_own_mother` elements indicate the total number of individuals whose father's and mother's IDs match their own ID, respectively. The `total_duplicated_parents` element indicates the total number of individuals with duplicated parent IDs. The `total_within_row_duplicates` element indicates the total number of within-row duplicates. The `within_row_duplicates` element indicates whether there are any within-row duplicates in the dataset. As the output shows, there are no duplicates in the sample dataset. 


### Between-Person Duplicates

Let us now consider a scenario where there are between-person duplicates in the dataset. The `checkIDs` function can identify these duplicates and, if the `repair` argument is set to `TRUE`, attempt to repair them. In the example below, we have created two between-person duplicates. First, we have overwritten the `personID` of one person with their sibling's ID. Second, we have added a copy of Dudley Dursley to the dataset.



```{r, repair}
# Create a sample dataset with duplicates
df <- ped2fam(potter, famID = "newFamID", personID = "personID")

# Sibling overwrite
df$personID[df$name == "Vernon Dursley"] <- df$personID[df$name == "Marjorie Dursley"]

# Add a copy of Dudley Dursley
df <- rbind(df, df[df$name == "Dudley Dursley",])
```

Now, let's call the `sumarizeFamilies` function to see what the dataset looks like.

```{r}
library(tidyverse)

summarizeFamilies(df, famID = "newFamID", personID = "personID")$family_summary %>% glimpse()
```

If we didn't know to look for duplicates, we might not notice the issue. Indeed, only of the duplicates was selected as are founder member. However, the `checkIDs` function can help us identify and repair these errors:

```{r}
# Call the checkIDs
result <- checkIDs(df)

print(result)
```

As we can see from this output, there are `r result$total_non_unique_ids` non-unique IDs in the dataset, specifically `r result$non_unique_ids`. Let's take a peek at the duplicates:

```{r}

df %>% filter(personID %in% result$non_unique_ids) %>%
  arrange(personID)

```

Yep, these are  definitely the duplicates.

```{r}
df_repair <- checkIDs(df, repair = TRUE)

df_repair %>% filter(ID %in% result$non_unique_ids) %>%
  arrange(ID)

result <- checkIDs(df_repair)

print(result)
```

Great! The function was able to repair the full duplicate, without any manual intervention. That still leaves us with the sibling overwrite, but that's a more complex issue that would require manual intervention. We'll leave that for now.


### Handling Within-Row Duplicates

Sometimes, an individual's parents' IDs may be incorrectly listed as their own ID, leading to within-row duplicates. The checkIDs function can also identify these errors:

```{r within}
# Create a sample dataset with within-person duplicate parent IDs

df <- ped2fam(potter, famID = "newFamID", personID = "personID")

df$momID[df$name == "Vernon Dursley"] <- df$personID[df$name == "Vernon Dursley"]

# Check for within-row duplicates
result <- checkIDs(df, repair = FALSE)
print(result)
```

In this example, we have created a within-row duplicate by setting the momID of Vernon Dursley to his own ID. The `checkIDs` function correctly identifies this error. 

## Verifying Sex Coding

Another common issue in pedigree data is incorrect coding of biological sex. In genetic studies, ensuring accurate recording of biological sex in pedigree data is crucial for analyses that rely on this information. The `checkSex` function in `BGmisc` helps identify and repair errors related to biological sex coding, such as inconsistencies where an individual's sex is incorrectly recorded. An example of this would be a parent who is biologically male, but listed as a mother. The `checkSex` function can help identify and correct such errors. 

It is essential to distinguish between biological sex (genotype) and gender identity (phenotype). Biological sex is based on chromosomes and other biological characteristics, while gender identity is a broader, richer, personal, deeply-held sense of being male, female, a blend of both, neither, or another gender entirely. While `checkSex` focuses on biological sex necessary for genetic analysis, we respect and recognize the full spectrum of gender identities beyond the binary. The developers of this package affirm their support for folx in the LGBTQ+ community.


The `checkSex` function in `BGmisc` performs two main tasks: identifying possible errors and inconsistencies for variables related to biological sex. The function is capable of validating the sex coding in a pedigree and optionally repairing the sex coding based on specified logic. Here’s how you can use the `checkSex` function to validate and optionally repair sex coding in a pedigree dataset:

```{r}
# Validate sex coding
results <- checkSex(potter, code_male = 1, code_female = 0, verbose = TRUE, repair = FALSE)
print(results)
```

In this example, the `checkSex` function checks the unique values in the sex column and identifies any inconsistencies in the sex coding of parents. The function returns a list containing validation results, such as the unique values found in the sex column and any inconsistencies in the sex coding of parents.

If incorrect sex codes are found, you can attempt to repair them automatically using the repair argument:

```{r}
# Repair sex coding
df_fix <- checkSex(potter, code_male = 1, code_female = 0, verbose = TRUE, repair = TRUE)
print(df_fix)
```

When the repair argument is set to TRUE, the function attempts to repair the sex coding based on specified logic. It recodes the sex variable based on the most frequent sex values found among parents. This ensures that the sex coding is consistent and accurate, which is essential for constructing valid genetic pedigrees.

<!--
## Practical Example: Cleaning a Pedigree

Below is an example of how to clean and repair a pedigree dataset using the BGmisc package. This example is based on the approach Mason typically takes for data cleaning.

```{r, eval = FALSE}
# note, is broken right now
# Load necessary libraries and datasets
library(tidyverse)
library(BGmisc)
set.seed(123)
# Create a sample dataset similar to the one used in Mason's approach
sample_data <- data.frame(
  ID = 1:10,
  name = c("Person1", "Person2", "Person3", "Person4", "Person5", "Person6", "Person7", "Person8", "Person9", "Person10"),
  dadID = c(NA, NA, 1, 1, 3, 3, 5, 5, 7, 7),
  momID = c(NA, NA, 2, 2, 4, 4, 6, 6, 7, 8),
  sex = c(1, 0, 1, 0, 1, 0, 1, 0, 1, 0),
  byr = runif(10, 1900, 2000),
  dyr = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
)



summarizePedigrees(sample_data)


# Clean the sample dataset
cleaned_data <- sample_data %>%
  janitor::remove_empty(c("rows", "cols")) %>%
  mutate(
    sex_factor = as.factor(case_when(sex == 1 ~ "male", sex == 0 ~ "female"))
  )

# Check for duplicate IDs
temp_check <- checkIDs(cleaned_data, verbose = TRUE, repair = FALSE)
 all_duplicated_ids   <- cbind(temp_check$non_unique_ids, temp_check$duplicated_parents_ids)

cleaned_data <- cleaned_data %>%
  mutate(
    duplicated = case_when(ID %in% temp_check$non_unique_ids ~ 1, TRUE ~ 0),
    duplicated_parent = case_when(dadID %in% all_duplicated_ids  | momID %in% all_duplicated_ids  ~ 1, TRUE ~ 0),
    duplicated_source_ID = case_when(ID %in% all_duplicated_ids ~ ID, dadID %in% all_duplicated_ids  ~ dadID, momID %in% all_duplicated_ids  ~ momID, TRUE ~ NA_integer_),
    alteredlinks = 0
  )

# Display and manually correct specific errors
cleaned_data %>%
  filter(duplicated == 1 | duplicated_parent == 1) %>%
  arrange(duplicated_source_ID, ID) %>%
  print(n = Inf)

# Perform specific corrections
cleaned_data <- cleaned_data %>%
  mutate(
    alteredlinks = case_when(ID == 9  ~ 1, TRUE ~ alteredlinks),
    ID = case_when(ID == 7 & round(byr, digits = 0) == 2020 ~ ID + 1e6, TRUE ~ ID)
  )

# Final check for remaining duplicates
final_check <- checkIDs(cleaned_data, verbose = TRUE, repair = FALSE)
print(final_check)

```
-->

# Conclusion
This vignette demonstrates how to use the BGmisc package to identify and repair errors in pedigrees. By leveraging functions like checkIDs, checkSex, and recodeSex, you can ensure the integrity of your pedigree data, facilitating accurate analysis and research.