---
title: "Genetic File Information"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Genetic File Information}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(BinaryDosage)
```

The routines <span style="font-family:Courier">getbdinfo</span>, <span style="font-family:Courier">getvcfinfo</span>, and <span style="font-family:Courier">getgeninfo</span> return a list with information about the data in the files. The list returned by each of these routines a section common to them all and a list <span style="font-family:Courier">additionalinfo</span> that is specific to the file type.

## Common section

The common section has the following elements

- filename - Character value with the complete path and file name of the file with the genetic data
- usesfid - Logical value indicating if the subject data has family IDs.
- samples - Data frame containing the following information about the subjects
  + fid - Character value with family IDs
  + sid - Character value with the individual IDs
- onchr - Logical value indicating if all the SNPs are on the same chromosome
- snpidformat - Integer indicating the format of the SNP IDs as follows
  + 0 - Unknown for VCF and GEN files or user specified for binary dosage files
  + 1 - chromosome:location
  + 2 - chromosome:location:referenceallele:alternateallele
  + 3 - chromosome:location_referenceallele_alternateallele
- snps - Data frame containing the following values
  + chromosome - Character value indicating what chromosome the SNP is on
  + location - Integer value with the location of the SNP on the chromosome
  + snpid - Character value with the ID of the SNP
  + reference - Character value of the reference allele
  + alternate - Character value of the alternate allele
- snpinfo - List that contain the following information
  + aaf - numeric vector with the alternate allele frequencies
  + maf - numeric vector with the minor allele frequencies
  + avgcall - Numeric vector with the imputation average call
  + rsq - Numeric vector with the imputation r squared value
- datasize - Numeric vector indicating the size of data in the file for each SNP
- indices - Numeric vector indicating the starting location in the file for each SNP

The list returned has its class value set to "genetic-info".

The <span style="font-family:Courier">datasize</span> and <span style="font-family:Courier">indices</span> values are only returned if the parameter <span style="font-family:Courier">index</span> is set equal to <span style="font-family:Courier">TRUE</span>

## Binary Dosage Additional Information

The additional information returned for binary dosage files contains the following information.

- format - numeric value with the format of the binary dosage file
- subformat - numeric value with the subformat of the binary dosage file
- headersize - integer value with the size of the header in the binary dosage file
- numgroups - integer value of the number of groups of subjects in the binary dosage file. This is usually the number of binary dosage files merged together to form the file
- groups - integer vector with size of each of the groups

This list has its class value set to "bdose-info".

## VCF File Additional Information

The additional information returned for VCF files contains the following information.

- gzipped - Logical value indicating if the file has been compressed using gzip
- headerlines - Integer value indicating the number of lines in the header
- headersize - Numeric value indicating the size of the header in bytes
- quality - Character vector containing the values in QUALITY column
- filter - Character vector containing the values in the FILTER column
- info - Character vector containing the values in the INFO column
- format - Character vector containing the values in the FORMAT column
- datacolumns - Data frame summarizing the entries in the FORMAT value containing the following information
  + numcolumns - Integer value indicating the number of values in the FORMAT value
  + dosage - Integer value indicating the column containing the dosage value
  + genotypeprob - Integer value indicating the column containing the genotype probabilities
  + genotype - Integer value indicating the column containing the genotype call

This list has its class value set to "vcf-info".

The values for quality, filter, info, and format can have a length of 0 if all the values are missing. They will have a length of 1 if all the values are equal. The number of rows in the datacolumns data frame will be equal to the length of the format value.

## GEN File Additional Information

The additional information returned for GEN files contains the following information.

- gzipped - Logical value indicating if the GEN file is compressed using gz
- headersize - Integer value indicating the size of the header in bytes
- format - Integer value indicating the number of genotype probabilities for each subject with the following meanings
  + 1 - Dosage only
  + 2 - $\Pr(g=0)$ and $\Pr(g=1)$
  + 3 - $\Pr(g=0)$, $\Pr(g=1)$, and $\Pr(g=2)$
- startcolumn - Integer value indicating in which column the genetic data starts
- sep - Character value indicating what value separates the columns

$g$ indicates the number of alternate alleles the subject has.

This list has its class value set to "gen-info".