diemr: Diagnostic index expectation maximisation in R

Natália Martínková

2024-04-28

Quick start

The package diemr incorporates the diagnostic index expectation maximisation algorithm used to estimate which genomic alleles belong to which side of a barrier to geneflow. To start using diemr, load the package or install it from CRAN if it is not yet available:

# Attempt to load a package, if the package was not available, install and load
if(!require("diemr", character.only = TRUE)){
    install.packages("diemr", dependencies = TRUE)
    library("diemr", character.only = TRUE)
}

Next, assemble paths to all files containing the data to be used by diemr. Here, we will use a tiny example dataset that is included in the package for illustrating the analysis workflow. A good practice is to check that all files contain data in the correct format for all individuals and markers. Additionally, the analysis will need a list with ploidies for all genomic compartments and individuals, and a vector with indices of samples that will be included in the analysis.

filepaths <- system.file("extdata", "data6x3.txt", package = "diemr")
# Assuming diploid markers of all individuals
ploidies <- list(rep(2, 6))
# Analysing all individuals
samples <- 1:6
CheckDiemFormat(files = filepaths, 
                ChosenInds = samples,
                ploidy = ploidies)
# File check passed: TRUE
# Ploidy check passed: TRUE

If the CheckDiemFormat() function fails, work through the error messages and fix the stored input files accordingly. The algorithm repeatedly accesses data from the harddisk, so seeing the passed file and variable check prior to the analysis is critical.

To estimate the marker polarities, their diagnostic indices and support, run the function diem() with default settings. Here, we have only one file with data, so paralelisation is unnecessary, and we set nCores = 1. This setting must always be used on Windows. Other operating systems can process multiple genomic compartments (e.g.  chromosomes) in parallel, and the analysis of different genomic compartment files can run on multiple processors.

res <- diem(files = filepaths, 
            ploidy = ploidies, 
            markerPolarity = list(c(FALSE, FALSE, TRUE)),
            ChosenInds = samples, 
            nCores = 1)

The result is a list, where the element res$DI contains a table with marker polarities, their diagnostic indices and support.

res$DI
#   newPolarity         DI   Support Marker
# 1       FALSE  -4.872256 15.930181      1
# 2        TRUE  -3.520647 18.633399      2
# 3        TRUE -13.274822  6.130625      3

The column newPolarity means that marker 1 should be imported for subsequent analyses as is, and markers 2 and 3 should be imported with 0 replaced with 2 and 2 replaced with 0. The marker 3 has the lowest diagnostic index and low support, indicating that the genotypes scored at this marker are poorly related to the barrier to geneflow arbitrated by the data.

With the marker polarities optimised to detect a barrier to geneflow, a plot of the polarised genome will show how genomic regions cross the barrier. First, the genotypes need to be imported with optimal marker polarities. Second, individual hybrid indices need to be calculated from the polarised genotypes. And last, the data will be plotted as a raster image.

genotypes <- importPolarized(file = filepaths, 
                             changePolarity = res$markerPolarity, 
                             ChosenInds = samples)
                             
h <- apply(X = res$I4, 
           MARGIN = 1, 
           FUN = pHetErrOnStateCount)[1, ]
           
plotPolarized(genotypes = genotypes,
              HI = h)

Input data

The diemr package uses a consise genome representation. Let’s have a small dataset of three markers genotyped for six individuals.

S001122
S121000
S02221U

The genotypes encoded as 0 represent homozygots for an allele attributed to barrier side A, 1 are heterozygous genotypes, 2 are homozygots for another allele, attributed to barrier side B, and U (symbol “_” is also allowed) represents an unknown state or a third (fourth) allele. The power of diem lies in the safety that the user does not need to determine the true assignment to the barrier sides A and B before the analysis and the specific genotypes encoded as 0 and 2 respectively can be arbitrary.

The leading S on each line of the input file is needed to ensure that the marker genotypes are read in as a string on all operating systems. The S is dropped during import of the genotypes, and the dataset is imported as a character matrix of all markers.

Multiple compartments with different ploidies

Some genomic compartments differ between individuals in their ploidy. For example, markers located on chromosome X in mammals will be diploid in females, but haploid in males. Ploidy differences between individuals influence calculation of the hybrid index, which in turn has an effect on the diem analysis.

To set up the diem analysis with multiple compartments, the markers with different individual ploidies must be stored in separate files. The file analysed in the Quick start chapter could contain markers from autosomes and an additional file will contain markers from an X chromosome, with individuals 2 and 6 being males. The respective ploidies for the second genomic compartment will be c(2, 1, 2, 2, 2, 1).

Arguments files and ploidy will need to reflect the information, taking care that the order of filenames corresponds to the order of elements in the list of ploidies. diem cannot check that the order of the elements is correct, only that the information is in the correct format.

filepaths2 <- c(system.file("extdata", "data6x3.txt", package = "diemr"),
                system.file("extdata", "data7x10.txt", package = "diemr"))
               
ploidies2 <- list(rep(2, 6),
                  c(2, 1, 2, 2, 2, 1))

CheckDiemFormat(files = filepaths2, 
                ChosenInds = samples,
                ploidy = ploidies2)
# File check passed: TRUE
# Ploidy check passed: TRUE

# Set random seed for repeatibility of null polarities (optional)
set.seed(39583782)

# Run diem with verbose = TRUE to store hybrid indices with ploidy-aware allele counts
res2 <- diem(files = filepaths2, 
             ploidy = ploidies2, 
             markerPolarity = FALSE,
             ChosenInds = samples, 
             nCores = 1,
             verbose = TRUE)

Plotting polarised genomes from multiple compartments requires separate import of the compartment data. The polarities in the res2$markerPolarity element are combined across all compartments, and extracting them requires knowledge of the number of markers in each compartment. Alternatively, the marker polarities from each compartment can be extracted from the list in the res2$PolarityChanges element.

# Import each compartment into a list
genotypes2 <- list(importPolarized(file = filepaths2[1], 
                       changePolarity = res2$markerPolarity[1:3], 
                       ChosenInds = samples),
                  importPolarized(file = filepaths2[2], 
                       changePolarity = res2$markerPolarity[4:13], 
                       ChosenInds = samples)) 
                       
# Bind compartment genotypes into one matrix
genotypes2 <- Reduce(cbind, genotypes2)

# Load individual hybrid indices from a stored file
h2 <- unlist(read.table("diagnostics/HIwithOptimalPolarities.txt"))

# Plot the polarised genotypes
plotPolarized(genotypes = genotypes2,
              HI = h2)

Frequently asked questions

  1. How can I install diemr from the source?
install.packages(package = "diemr_1.1.tar.gz",
                 repos = NULL, type = "source")

Make sure to set the R working directory to the folder, where the package tarball is stored, or include a full path to the file within the quotes. Update the version number to the specific file.

  1. How can I calculate the hybrid indices from my polarised data?

There are two options. First, use diem with argument verbose = TRUE and the hybrid indices will be stored in a text file in the diagnostics folder in the working directory. The stored values will be ploidy-aware. However, these hybrid indices could include the bias introduced by the small data correction if the data warranted it. The user might prefer to calculate the hybrid indices without the small data correction. For this, use the I4 matrix in the diem output.

apply(res$I4, MARGIN = 1, FUN = pHetErrOnStateCount)
#          [,1] [,2]      [,3]      [,4]      [,5]      [,6]
# p   0.5000000    0 0.3333333 0.5000000 0.8333333 1.0000000
# Het 0.3333333    0 0.6666667 0.3333333 0.3333333 0.0000000
# Err 0.0000000    0 0.0000000 0.0000000 0.0000000 0.3333333
  1. I expect more than two groups of samples in my data. Can I use diemr?

Yes. Multiple barriers to geneflow between multiple groups of samples can be identified iteratively with help from the argument ChosenInds. For example, let’s assume that the individual 2 was identified as belonging to one side of the barrier and being separated from other by the steepest change in the hybrid index. In the next diem run, we exclude the individual 2, adjusting the ploidies to reflect the omission.

ploidies2 <- list(c(2, 2, 2, 2, 2))
samples2 <- c(1, 3:6)
CheckDiemFormat(files = filepaths, 
                ChosenInds = samples2, 
                ploidy = ploidies2)
# File check passed: TRUE
# Ploidy check passed: TRUE

res2 <- diem(files = filepaths, 
             ChosenInds = samples2,
             ploidy = ploidies2,
             nCores = 1,
             markerPolarity = list(c(FALSE, FALSE, TRUE)))
             
# calculate hybrid indices from updated I4
h.res2 <- apply(res2$I4, 
                MARGIN = 1, 
                FUN = pHetErrOnStateCount)[1, ]
                
# set names for the hybrid index values
h.res2 <- setNames(h.res2, nm = samples2)
#    1    3    4    5    6 
# 0.50 0.33 0.50 0.83 1.00 

# calculate the center of the maximum hybrid index change
diffs <- data.frame(rollmean = zoo::rollmean(sort(h.res2), k = 2),
                    diff = diff(sort(h.res2), lag = 1))
h.res2.c <- diffs$rollmean[which.max(diffs$diff)]
# [1] 0.6666667

Since the center of the barrier is at 0.67 now, diem separated individuals 1, 3, and 4 from a group that includes 5, 6. Combined with the result from the first diem run, we have identified three groups in the dataset: (2), (1, 3, 4), and (5, 6).

  1. The hybrid indices in my dataset are too similar. I expected them to look more like the rescaled hybrid indices. What is wrong?

The input data probably contains invariant sites, which weight the hybrid indices to be similar between individuals. Make sure to use only variable sites. Alternatively, filter sites with lowest diagnostic index after the diem analysis and recalculate the hybrid indices with the filtered data.

  1. How to cite diemr?

To use diemr in a publication, please cite:

Baird S. J. E., Petruzela J., Jaron I., Skrabanek P., Martinkova N. 2022. Genome polarisation for detecting barriers to geneflow. Methods in Ecology and Evolution, doi: 10.1111/2041-210X.14010.