---
title: 'Alakazam: Reconstruction of Ig lineage trees'
author: "Jason Anthony Vander Heiden"
date: '`r Sys.Date()`'
output:
  pdf_document:
    dev: pdf
    fig_height: 4
    fig_width: 7.5
    highlight: pygments
    toc: yes
  html_document:
    fig_height: 4
    fig_width: 7.5
    highlight: pygments
    theme: readable
    toc: yes
  md_document:
    fig_height: 4
    fig_width: 7.5
    preserve_yaml: no
    toc: yes
geometry: margin=1in
fontsize: 11pt
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Lineage reconstruction}
  %\usepackage[utf8]{inputenc}
---

Reconstruction of an Ig lineage requires the following steps:

1. Load an AIRR tab-delimited database file and select a clone
2. Preprocess the clone to remove gap characters and duplicate sequences
3. Run PHYLIP, parse the output, and modify the tree topology

## Example data

A small example AIRR database, `ExampleDb`, is included in the `alakazam` package. 
Lineage reconstruction requires the following fields (columns) to be present 
in the AIRR file: 

* `sequence_id`
* `sequence_alignment` 
* `germline_alignment`
* `v_call`
* `j_call`
* `junction_length`
* `clone_id`

For details about the AIRR format, visit the [AIRR Community documentation site](https://docs.airr-community.org/en/stable/datarep/rearrangements.html).

```{r, eval=TRUE, warning=FALSE, message=FALSE}
# Load required packages
library(alakazam)
library(igraph)
library(dplyr)

# Select a clone from the example database
data(ExampleDb)
sub_db <- subset(ExampleDb, clone_id == 3138)
```

## Preprocess a clone

Before a lineage can be constructed, the sequences must first be cleaned of gap
(-, .) characters added by IMGT, duplicate sequences must be removed, and
annotations must be combined for each cluster of duplicate sequences. 
Optionally, "ragged" ends of sequences (such as those that may occur from primer template
switching) may also be cleaned by masking mismatched positions and the leading
and trailing ends of each sequence. The function `makeChangeoClone` is a wrapper
function which combines these steps and returns a `ChangeoClone` object which
may then be passed into the lineage reconstruction function.

Two arguments to `makeChangeoClone` control which annotations are retained
following duplicate removal. Unique values appearing within columns given by the 
`text_fields` arguments will be concatenated into a single string delimited by a
"," character. Values appearing within columns given by the 
`num_fields` arguments will be summed.

```{r, eval=TRUE}
# This example data set does not have ragged ends
# Preprocess clone without ragged end masking (default)
clone <- makeChangeoClone(sub_db, text_fields=c("sample_id", "c_call"), 
                          num_fields="duplicate_count")

# Show combined annotations
clone@data[, c("sample_id", "c_call", "duplicate_count")]
```

## Run PHYLIP

Lineage construction uses the `dnapars` (maximum parsimony) application of the
PHYLIP package. The function `buildPhylipLineage` performs a number of steps to
execute `dnapars`, parse its output, and modify the tree topology to meet the
criteria of an Ig lineage. This function takes as input a `ChangeoClone` object
output by `makeChangeoClone` and returns an igraph `graph` object. The igraph
`graph` object will contain clone annotations as graph attributes, sequence 
annotations as vertex attributes, and mutations along edges as edge attributes.

The system call to `dnapars` requires a temporary folder to store input and 
output. This is created in the system temporary location (according to 
`base::tempfile`), and is not deleted by default (only because automatically 
deleting files is somewhat rude). In most cases, you will want to set 
`rm_temp=TRUE` to delete this folder.

```{r, eval=FALSE}
# Run PHYLIP and parse output
phylip_exec <- "~/apps/phylip-3.69/dnapars"
graph <- buildPhylipLineage(clone, phylip_exec, rm_temp=TRUE)
```

```{r, echo=FALSE, warning=FALSE, message=FALSE}
# Load data instead of running phylip
# Clone 3138 is at index 23
graph <- ExampleTrees[[23]]
```

```{r, eval=TRUE, warning=FALSE, message=FALSE}
# The graph has shared annotations for the clone
data.frame(clone_id=graph$clone,
           junction_length=graph$junc_len,
           v_gene=graph$v_gene,
           j_gene=graph$j_gene)

# The vertices have sequence specific annotations
data.frame(sequence_id=V(graph)$name, 
           c_call=V(graph)$c_call,
           duplicate_count=V(graph)$duplicate_count)
```

## Plotting of the lineage tree

Plotting of a lineage tree may be done using the built-in functions of the 
igraph package. The default edge and vertex labels are edge weights and sequence 
identifiers, respectively. 

```{r, eval=TRUE}
# Plot graph with defaults
plot(graph)
```

The default layout and attributes are not very pretty. We can modify the 
graphical parameter in the usual igraph ways. A tree layout can be built using 
the `layout_as_tree` layout with assignment of the root position to the 
germline sequence, which is named "Germline" in the object returned by 
`buildPhylipLineage`.

```{r, eval=TRUE}
# Modify graph and plot attributes
V(graph)$color <- "steelblue"
V(graph)$color[V(graph)$name == "Germline"] <- "black"
V(graph)$color[grepl("Inferred", V(graph)$name)] <- "white"
V(graph)$label <- V(graph)$c_call
E(graph)$label <- ""

# Remove large default margins
par(mar=c(0, 0, 0, 0) + 0.1)
# Plot graph
plot(graph, layout=layout_as_tree, edge.arrow.mode=0, vertex.frame.color="black",
     vertex.label.color="black", vertex.size=40)
# Add legend
legend("topleft", c("Germline", "Inferred", "Sample"), 
       fill=c("black", "white", "steelblue"), cex=0.75)
```

Which is much better.

## Batch processing lineage trees

Multiple lineage trees may be generated at once, by splitting the Change-O 
data.frame on the clone column.

```{r, eval=TRUE, warning=FALSE, results="hide"}
# Preprocess clones
clones <- ExampleDb %>%
    group_by(clone_id) %>%
    do(CHANGEO=makeChangeoClone(., text_fields=c("sample_id", "c_call"), 
                                num_fields="duplicate_count"))
```

```{r, eval=FALSE}
# Build lineages
phylip_exec <- "~/apps/phylip-3.69/dnapars"
graphs <- lapply(clones$CHANGEO, buildPhylipLineage, 
                 phylip_exec=phylip_exec, rm_temp=TRUE)
```

```{r, echo=FALSE, warning=FALSE, message=FALSE}
# Load data instead of running phylip
graphs <- ExampleTrees
```

```{r, eval=TRUE}
# Note, clones with only a single sequence will not be processed.
# A warning will be generated and NULL will be returned by buildPhylipLineage
# These entries may be removed for clarity
graphs[sapply(graphs, is.null)] <- NULL

# The set of tree may then be subset by node count for further 
# analysis, if desired.
graphs <- graphs[sapply(graphs, vcount) >= 5]
```

## Converting between graph, phylo, and newick formats

While much of analysis in `alakazam` focuses on using `igraph` `graph` objects,
R `phylo` objects are capable of being used by a rich set of phylogenetic analysis
tools in R. Further, stand-alone phylogenetics programs typically import and export
trees in Newick format. 

To convert to trees in `graph` format to `phylo` format, use `graphToPhylo`. These
objects can now be used by functions detailed in other R phylogenetics packages such
as `ape`.

```{r, eval=TRUE, show=FALSE}
# Modify graph and plot attributes
V(graph)$color <- categorical_pal(8)[1]
V(graph)$label <- V(graph)$name
E(graph)$label <- E(graph)$weight
```

```{r, eval=TRUE, warning=FALSE, message=FALSE}
# Convert to phylo
phylo <- graphToPhylo(graph)

# Plot using ape
plot(phylo, show.node.label=TRUE)
```

To import lineage trees as `phylo` objects from Newick files, use the `read.tree` function 
provided in the `ape` package.  To export lineage trees as a Newick file, use the `write.tree`
function provided in `ape`.

```{r, eval=FALSE}
# Read in Newick tree as phylo object
phylo <- ape::read.tree("example.tree")

# Write tree file in Newick format
ape::write.tree(phylo, file="example.tree")
```

To convert this `phylo` object to a `graph` object, use the `phyloToGraph` function with the 
germline sequence ID specified using the `germline` option. Note that while some of the nodes in 
more complex trees may rotate during this process, their topological relationships will remain 
the same.

```{r, eval=TRUE}
# Convert to graph object
graph <- phyloToGraph(phylo, germline="Germline")
```