---
title: "Using Format 5 Binary Dosage Files"
output:
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Using Format 5 Binary Dosage Files}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, echo = FALSE}
library(BinaryDosage)
```

# Introduction

Format 5 is a binary dosage file format designed for reading imputed genotype
data from bgzipped, tabix-indexed VCF files — such as those returned by the
[Michigan Imputation Server](https://imputationserver.sph.umich.edu) using
[minimac](https://genome.sph.umich.edu/wiki/Minimac).

Format 5 differs from earlier formats (1–4) in two important ways.

First, it uses **per-SNP gzip compression**. Each SNP's dosage and genotype
probability values are compressed independently and written as a contiguous
block in the data file. This means any SNP can be decompressed in isolation
without reading surrounding data.

Second, the metadata — sample IDs, the SNP table, byte offsets, and file
information — is stored in a companion **RDS file** rather than embedded in a
binary header. This makes the metadata immediately accessible from R without
reading the data file.

A Format 5 dataset consists of two files.

- **`.bdose`** — the data file. Begins with a 4-byte magic number followed by
  one gzip-compressed block per SNP.
- **`.bdose.bdi`** — an RDS file containing a list of class `"genetic-info"`
  with the sample table, SNP table, byte offsets, and format metadata. The
  name is always the `.bdose` filename with `.bdi` appended (e.g.
  `set1a.bdose.bdi`). This structure is identical to the one returned by
  `getbdinfo`, `getvcfinfo`, and `getgeninfo` for other file types.

# Example files

The BinaryDosage package includes a small bgzipped VCF file, *set1a.vcf.gz*,
which contains dosage (DS) and genotype probability (GP) data for 60 subjects
and 10 SNPs on chromosome 1. This file is used in all examples below.

The output files in each example are written to a temporary directory so that
nothing is left behind after the vignette runs.

```{r files}
vcf_file   <- system.file("extdata", "set1a.vcf.gz", package = "BinaryDosage")
bd4_file   <- system.file("extdata", "vcf1a.bdose",  package = "BinaryDosage")
bdose_file <- file.path(tempdir(), "set1a.bdose")
```

# Converting a VCF file to Format 5

The `vcftobd` function converts a bgzipped VCF file to a Format 5 file pair.
The VCF file must contain the FORMAT fields `DS` (dosage) and `GP` (genotype
probabilities), as produced by minimac and the Michigan Imputation Server.

The function takes the following parameters.

- `vcffile` — path to the bgzipped VCF file.
- `bdose_file` — path for the output `.bdose` file. The companion `.bdi`
  metadata file is written automatically to `paste0(bdose_file, ".bdi")`.
- `region` — optional genomic region string in bcftools format (e.g.
  `"chr21"` or `"chr21:1-5000000"`). Requires a tabix index. Default `NULL`
  processes the entire file.

The following code converts *set1a.vcf.gz* to a Format 5 file pair.

```{r convert, eval = TRUE, echo = TRUE, message = FALSE, warning = FALSE}
if (requireNamespace("vcfppR", quietly = TRUE)) {
  vcftobd(vcffile = vcf_file, bdose_file = bdose_file)
} else {
  updatebd(bdfiles = bd4_file, bdose_file = bdose_file)
}
```

To convert only a specific region of the file, supply the `region` parameter.
The VCF file must have a tabix index (a `.tbi` file alongside it).

```{r convert_region, eval = F, echo = T}
vcftobd(vcffile    = vcf_file,
         bdose_file = bdose_file,
         region     = "1:10000-15000")
```

# Loading Format 5 file information

The `getbdinfo` function reads and validates a Format 5 file pair, returning
a list of class `"genetic-info"` that is used by `getbd5snp` to retrieve
individual SNPs. It automatically detects the Format 5 magic header and
delegates to the internal Format 5 reader.

The function takes the following parameters.

- `bdose_file` — path to the `.bdose` file. The companion `.bdi` file is
  read automatically from `paste0(bdose_file, ".bdi")`.

```{r load_info, eval = TRUE, echo = T, message = F, warning = F}
bd5 <- getbdinfo(bdose_file)
```

The object returned has the same structure as the list returned by `getbdinfo`
for older formats. For a full description of all fields see
[Genetic File Information](geneticfileinfo.html). The key fields are described
below.

## File and format metadata

```{r meta, eval = TRUE, echo = T}
cat("Class:          ", class(bd5), "\n")
cat("Uses family ID: ", bd5$usesfid, "\n")
cat("One chromosome: ", bd5$onechr, "\n")
cat("SNP ID format:  ", bd5$snpidformat, "\n")
cat("Format:         ", bd5$additionalinfo$format, "\n")
cat("Header size:    ", bd5$additionalinfo$headersize, "bytes\n")
```

## Sample information

The `samples` element is a data frame with one row per subject. The `fid`
column holds the family ID (empty for VCF files, which carry no family
information) and the `sid` column holds the subject ID.

```{r samples, eval = TRUE, echo = T}
cat("Number of samples:", nrow(bd5$samples), "\n")
knitr::kable(head(bd5$samples, 5), caption = "First 5 samples")
```

## SNP information

The `snps` element is a data frame with one row per SNP.

```{r snps, eval = TRUE, echo = T}
cat("Number of SNPs:", nrow(bd5$snps), "\n")
knitr::kable(bd5$snps, caption = "SNP table")
```

## Byte offsets

The `indices` element is a numeric vector of byte offsets into the `.bdose`
file, one per SNP, giving the start of that SNP's compressed block. These are
used internally by `getbd5snp`.

```{r indices, eval = TRUE, echo = T}
cat("Indices (bytes):", bd5$indices, "\n")
```

# Reading SNP data

The `getbd5snp` function decompresses and returns the dosage and genotype
probability data for a single SNP. The general-purpose `getsnp` function also
works with Format 5 files and can be used as an alternative.

For details on all three Format 5 SNP reader functions — including
`getbd5snp_buf` (pre-allocated output vectors) and `getbd5snp_con` (persistent
file connection for high-throughput loops) — see the
[Reading SNPs from Format 5 Files](bd5snpreaders.html) vignette.

The function takes the following parameters.

- `bd5info` — object returned by `getbdinfo`.
- `snp` — the SNP to retrieve, either as a 1-based integer index or as a
  character SNP ID matching a value in `bd5info$snps$snpid`.

The function returns a list with four numeric vectors, each of length equal
to the number of samples.

- `dosage` — DS values in \[0, 2\]; `NA` = missing.
- `p0` — $\Pr(g=0)$ values in \[0, 1\]; `NA` = missing.
- `p1` — $\Pr(g=1)$ values in \[0, 1\]; `NA` = missing.
- `p2` — $\Pr(g=2)$ values in \[0, 1\]; `NA` = missing.

## Reading a SNP by index

The following code retrieves the first SNP by its 1-based index.

```{r snp_by_index, eval = TRUE, echo = T, message = F, warning = F}
snp1 <- getbd5snp(bd5info = bd5, snp = 1L)

snp1_df <- data.frame(
  SampleID = bd5$samples$sid,
  Dosage   = round(snp1$dosage, 4),
  P_00     = round(snp1$p0,     4),
  P_01     = round(snp1$p1,     4),
  P_11     = round(snp1$p2,     4)
)

knitr::kable(head(snp1_df, 10),
             caption = paste("SNP", bd5$snps$snpid[1], "— first 10 subjects"))
```

## Reading a SNP by ID

A SNP can also be retrieved by passing its SNP ID as a character string.

```{r snp_by_id, eval = TRUE, echo = T, message = F, warning = F}
snp_id <- "1:12000:T:C"
snp3   <- getbd5snp(bd5info = bd5, snp = snp_id)

snp3_df <- data.frame(
  SampleID = bd5$samples$sid,
  Dosage   = round(snp3$dosage, 4),
  P_00     = round(snp3$p0,     4),
  P_01     = round(snp3$p1,     4),
  P_11     = round(snp3$p2,     4)
)

knitr::kable(head(snp3_df, 10),
             caption = paste("SNP", snp_id, "— first 10 subjects"))
```

# Applying a function to all SNPs

The simplest option is `bdapply`, which handles the loop internally and uses
`getbd5snp_con` automatically for Format 5 files.

```{r apply_bdapply, eval = TRUE, echo = T, message = F, warning = F}
getaaf <- function(dosage, p0, p1, p2) mean(dosage, na.rm = TRUE) / 2

aaf_list  <- bdapply(bdinfo = bd5, func = getaaf)
aaf_table <- data.frame(snpid = bd5$snps$snpid,
                        aaf   = round(unlist(aaf_list), 4))

knitr::kable(aaf_table, caption = "Alternate allele frequencies via bdapply")
```

For manual loops, `getbd5snp_buf` and `getbd5snp_con` are faster than
`getbd5snp` because they avoid allocating a new list on every call.
`getbd5snp_con` is the fastest option as it also keeps the file open across
all iterations. See the
[Reading SNPs from Format 5 Files](bd5snpreaders.html) vignette for details
and a performance comparison.

```{r apply_con, eval = TRUE, echo = T, message = F, warning = F}
n_snps <- nrow(bd5$snps)
n_samp <- nrow(bd5$samples)
dosage <- numeric(n_samp)
p0     <- numeric(n_samp)
p1     <- numeric(n_samp)
p2     <- numeric(n_samp)

con <- openbd5con(bd5)
aaf <- numeric(n_snps)
for (i in seq_len(n_snps)) {
  getbd5snp_con(bd5info = bd5, snp = i,
                dosage = dosage, p0 = p0, p1 = p1, p2 = p2,
                bd5con = con)
  aaf[i] <- mean(dosage, na.rm = TRUE) / 2
}
closebd5con(con)

knitr::kable(data.frame(snpid = bd5$snps$snpid, aaf = round(aaf, 4)),
             caption = "Alternate allele frequencies via getbd5snp_con")
```

```{r cleanup, include = FALSE, eval = TRUE}
unlink(c(bdose_file, paste0(bdose_file, ".bdi")))
```