poolABC


1. Introduction

An implementation of Approximate Bayesian Computation (ABC) methods applied to pooled sequencing (Pool-seq) data is available in R language in the package poolABC. The purpose of this vignette is to provide an in-depth overview of the capabilities of the package, highlighting the usage of its main functions.

library(poolABC)

The initial sections of this vignette detail how to import pooled sequencing data from files in the _rc format. We also show how to randomly select multiple subsets of the observed data, compute summary statistics for those subsets and use those summary statistics as target for parameter estimation and model selection with ABC.


This vignette also teaches users how to simulate Pool-seq data under pre-defined models. Note that the simulation of Pool-seq data requires functions included in the poolHelper package. We then exemplify how the imported data and the simulated data can be used to perform parameter inference. The poolABC package allows the use of genome-wide multilocus data for ABC by using multiple subsets of simulated and observed loci.


Briefly, we obtain one set of summary statistics for each set of simulated loci and for each random subset of observed data. Each set of summary statistics computed from a unique subset of observed data is used as an independent target for parameter estimation. Thus, with the poolABC package, users obtain one posterior distribution, for each parameter of interest and for each subset of observed loci. Then, our package allows users to combine those multiple posteriors to obtain a single estimate per parameter. This merging is performed with the Epanechnikov kernel and weighting according to the distance between the mean summary statistics of a subset of loci and the mean across the genome, giving more weight to subsets of loci with a mean closer to the overall mean. Finally, the poolABC package also includes functions to compute point estimates and produce plots of those merged posterior distributions.



2. Import dataset

This package uses pooled sequencing data stored on _rc format files. These _rc files are created by running the SNP-frequency-diff.pl function of popoolation2. Briefly, this is an example of a typical _rc file with only two populations:

data.frame(chr = c("NC297", "NC297"), pos = c(3530, 5450), rc = c("A", "T"), allele_count = 2,
    allele_states = c("A/G", "T/A"), deletion_sum = 0, snp_type = "pop", major_alleles = c("AA",
        "TT"), minor_alleles = c("GG", "AN"), maa_1 = c("54/55", "51/54"), maa_2 = c("76/78",
        "96/96"), mia_1 = c("1/55", "3/54"), mia_2 = c("2/78", "0/96"))
#>     chr  pos rc allele_count allele_states deletion_sum snp_type major_alleles
#> 1 NC297 3530  A            2           A/G            0      pop            AA
#> 2 NC297 5450  T            2           T/A            0      pop            TT
#>   minor_alleles maa_1 maa_2 mia_1 mia_2
#> 1            GG 54/55 76/78  1/55  2/78
#> 2            AN 51/54 96/96  3/54  0/96

More information about these files can be found at: https://sourceforge.net/p/popoolation2/wiki/Main/


If you have your data in _rc files in a folder of your computer, you can simply use the importContigs function.

# load multiple files and organize information by contig
files <- importContigs(path = "/home/myawesomedata", pops = c(1, 4, 7, 10))

The path input of this function indicates the path to the folder where the _rc files are located. By default, the importContigs function will import all files present in the folder that include the _rc pattern in their name. The index of populations to import is defined by the pops input argument. For instance, the input pops = c(1, 4, 7, 10) will import the major and minor allele for the first, fourth, seventh and tenth population in the _rc files.


The importContigs function also includes several optional input arguments. The files input argument allows you to specify the index of the files to import. For instance, by setting files = 1:5, only the first five files listed in the output of list.files(path = path) will be imported. Additionally, specific contigs can be removed from the data by adding their names to the remove input argument.


The min.minor input argument allows you to filter the data by the number of minor allele reads. For instance, if min.minor = 2 all sites where the total number of minor allele reads across all populations of the pops input argument is below 2, will be removed from the data. Alternatively, by setting filter = TRUE, you can filter the data by the frequency of the minor allele. When filter = TRUE, the user can define a threshold for the minimum allowed frequency of the minor allele. If no threshold is defined, the importContigs function will assume that at least one minor allele read per site should exist. Finally, it is possible to include an header when importing the data. This header can be created with the createHeader function.



3. Random subset of loci

Random windows of a given size (in base pairs) can be selected from the imported data with the pickWindows function. The data imported with the importContigs function is a list with all the elements required for the pickWindows function. You can assign each of those list elements to an individual object or use them directly as input argument of the pickWindows function.

# randomly select blocks of a given size from several contigs
blocks <- pickWindows(freqs, positions, range, rMajor, rMinor, coverage, window = 1000,
    nLoci = 100)

With this function, users can randomly select a subset of the complete pooled sequencing data at their disposal. More specifically, the pickWindows function allows users to randomly select nLoci blocks of window size (in base pairs) from the data imported in the previous section. In other words, this function will randomly select select nLoci contigs and then select one random block with a user defined size (defined by the window input) per contig.



4. Compute stats for observed data

The next step is the computation of a set of summary statistics from the observed data. To compute summary statistics from the observed data, we can use the statsContig function. This function will compute the same set of summary statistics used in the simulations from the multiple random subset of loci obtained in the previous step.

# compute a set of observed summary statitstics
obs <- statsContig(randomWindows = blocks, nPops = 4, stat.names = stat.names)

Note that we are using the blocks object created with the pickWindows function as the randomWindows input argument. The statsContig function will compute summary statistics from those randomly selected blocks of observed data. Also, the use of names for the summary statistics is strongly recommended. To ensure that the set of observed summary statistics is named, we should obtain the name of the simulated summary statistics and include those in the stat.names input argument of the statsContig function.



5. Simulate data for two or four-population models

With this package, you also have the ability to simulate pooled sequencing data under three different models by using the poolSim function. The model input argument allows the user to define which model to simulate. At the moment, this package includes three different models: an isolation with migration model with two populations, a model representing a single origin of two divergent ecotypes and a third model representing a parallel origin of those ecotypes.


To simulate data using the two populations model, you have to define the mean depth of coverage and the variance of the coverage for those two populations. You also need to create a list with the number of individuals per pool and per population. In the next chunk, you can see how to simulate data using this model:

# set the mean and variance of the coverage
sMean <- c(84.34, 66.76)
sVars <- c(1437.22, 912.43)
# create a list containing the information of the pool sizes by population
size <- rep(list(rep(10, 10)), 2)
# run simulation for a two-populations model
sims <- poolSim(model = "2pops", nDip = 200, nPops = 2, nLoci = 100, nSites = 2000,
    mutrate = 2e-08, size = size, mean = sMean, variance = sVars, minimum = 15, maximum = 180,
    min.minor = 1, Nref = c(25000, 25000), ratio = c(0.1, 3), pool = c(5, 180), seq = c(1e-04,
        0.001), split = c(0, 3), CW = c(1e-13, 0.001), WC = c(1e-13, 0.001), bT = c(0,
        0.5))

The poolSim function requires several input arguments, that are explained in detail in the help page of the function. However, note that most of those input arguments define the minimum and maximum values for a variety of relevant parameters. To simulate data using a four populations model:

# set the mean and variance of the coverage
sMean <- c(84.34, 66.76, 65.69, 68.83)
sVars <- c(1437.22, 912.43, 848.02, 1028.23)
# create a list containing the information of the pool sizes by population
size <- rep(list(rep(5, 20)), 4)
# run simulation for a four-populations model
sims <- poolSim(model = "Single", nDip = 400, nPops = 4, nLoci = 100, nSites = 2000,
    mutrate = 2e-08, size = size, mean = sMean, variance = sVars, minimum = 15, maximum = 180,
    min.minor = 2, Nref = c(25000, 25000), ratio = c(0.1, 3), pool = c(5, 180), seq = c(1e-04,
        0.001), split = c(0, 3), CW = c(1e-13, 0.001), WC = c(1e-13, 0.001), CC = c(1e-13,
        0.001), WW = c(1e-13, 0.001), ANC = c(1e-13, 0.001), bT = c(0, 0.5), bCW = c(0,
        0.5), bWC = c(0, 0.5))


5.1. multiple simulations

The poolSim function can be used to perform a single simulation. However, most of the times, you will want to perform thousands of simulations. One way to accomplish this is to use replicate function together with our poolSim function. We recommend that you do the following:

# set the mean and variance of the coverage
sMean <- c(84.34, 66.76)
sVars <- c(1437.22, 912.43)
# create a list containing the information of the pool sizes by population
size <- rep(list(rep(5, 20)), 2)
# define how many simulations to run
nSims <- 10
# run one batch of simulations
sims <- t(replicate(n = nSims, unlist(poolSim(model = "2pops", nDip = 200, nPops = 2,
    nLoci = 100, nSites = 1000, mutrate = 2e-08, size = size, mean = sMean, variance = sVars,
    minimum = 20, maximum = 185, min.minor = 2, Nref = c(25000, 25000), ratio = c(0.1,
        3), pool = c(5, 180), seq = c(1e-04, 0.001), split = c(0, 3), CW = c(1e-13,
        0.001), WC = c(1e-13, 0.001), bT = c(0, 0.5)))))

By using the replicate function, you can perform multiple simulations. By unlisting and then transposing the output of those simulations, you obtain a matrix where each row corresponds to a different simulation and each column is a different parameter or summary statistic.



6. Perform parameter estimation

The observed summary statistics computed in the previous sections and the simulations performed in the previous one can then be used to perform parameter estimation with Approximate Bayesian Computation (ABC).


We included with this package a small dataset simulated under the two populations model. This includes one matrix (sumstats) with the summary statistics computed from the simulated data, one matrix (params) with the simulated parameter values and a final matrix (limits) with the minimum and maximum value of the prior distribution for each parameter.

# load the data included in the package
data("sumstats"); data("params"); data("limits")

The poolABC package aims at streamlining the process of parameter inference with Pool-seq data. One of the key components of that design is the ABC function.


By using this function, users can simultaneously perform parameter estimation with ABC for multiple targets. The ABC function requires the data, imported with the importContigs function and then uses both the pickWindows and statsContig functions to select multiple random subset of loci from the observed data and compute a set of observed summary statistics for each of those subsets. Thus, for each subset of loci we obtain a vector of summary statistics and each vector acts as an independent target for parameter estimation. The ABC function can be used by doing:

# parameter estimation with ÃBC function
myabc <- ABC(nPops = 2, ntrials = 10, freqs = freqs, positions = positions, range = range,
    rMajor = rMajor, rMinor = rMinor, coverage = coverage, window = 1000, nLoci = 100,
    limits = limits, params = params, sumstats = sumstats, tol = 0.01, method = "regression")

The ntrials input argument defines the number of independent targets for parameter estimation. In this example, we are performing parameter inference for 10 different targets. Each of those targets was obtained by computing summary statistics from windows of 1000 base pairs (window = 1000) from 100 (nLoci = 100) randomly selected contigs of the observed data.


Note that you should also define the method and tolerance rate, tol, to use. The tol is defined as the percentage of accepted simulation. You should strive to keep a low tolerance rate, to avoid accepting simulations that are too distant from the observed data, but it is also important to avoid very stringent tolerance rates that may lead to few accepted values. A typical value of tol = 0.01 or tol = 0.05 is recommended but you should test different tol values in the cross-validation analysis (see more about this in subsequent sections).


This package implements two ABC algorithms for constructing the posterior distribution from the accepted simulations: a rejection method and a regression-based correction using a local linear regression. When method is “rejection”, simulations are accepted if the Euclidean distance between the set of summary statistics computed from the simulated data and the target is sufficiently small and these accepted simulations are considered a sample from the posterior distribution. When method is “regression”, an additional step is used to correct for the imperfect match between the summary statistics computed from the simulated data and the summary statistics computed from the observed data. For this reason, we recommend that you select the regression method because it will, most often than not, lead to more precise parameter estimates.



7. Merge multiple posteriors

After using the ABC function to perform parameter estimation with Approximate Bayesian Computation for several targets, we need to merge the multiple posteriors obtained (one for each target) into a single posterior per parameter.


This can be performed with the mergepost function. One of the required input arguments of this function is the global input. This input should be a vector with the observed summary statistics computed from the entire dataset. We recommend that you use the pickWindows function to select a large number of loci and then use that random selection as the input argument of the statsContig function.

# load multiple files and organize information by contig
blocks <- pickWindows(freqs = freqs, positions = positions, range = range, rMajor = rMajor,
    rMinor = rMinor, coverage = coverage, window = 1000, nLoci = 800)
# compute a set of summary statistics from the observed data
global <- statsContig(randomWindows = blocks, nPops = 2, stat.names = stat.names)

The global vector can then be used in the mergepost function. The remaining required input arguments are the matrix with the target for the parameter inference and the list containing the posteriors (post) for each target and parameter. It is also possible to include the regression weights in the wtreg option.

# merge posterior distributions
myabc <- mergepost(target = myabc$target, global = global, post = myabc$adjusted,
    wtreg = myabc$weights)

Briefly, this function will merge the different posteriors into a single one, using different weighting methods for the merging. Details about the various elements of the mergepost output can be found in the help page of the function. Note that the merge, weighted, merge_reg and weighted_reg entries contain, for each parameter, a locfit object, obtained after merging the multiple posteriors using the corresponding method. The merged_stat, weighted_stat, merge_reg_stat and weighted_reg_stat are the posterior point estimates for the corresponding merging method.



8. Posterior point estimates and plots

Users can then plot the resulting merged posterior distribution with the plot_weighted function. You should include your matrix with the simulated parameter values in the prior input argument of this function to also plot the prior distribution of the chosen parameter. This allows for a comparison, in the same plot, of the prior and posterior shape.


You should include the output of the mergepost function as the merged_posterior input argument and a matrix with the limits of the prior distribution for each parameter. Then, you just need to define which parameter to plot with the index input argument.

# plot the density estimation of a parameter
plot_weighted(prior = params, merged_posterior = myabc, limits = limits, index = 2)

The plot_weighted function plots the posterior density of the chosen parameter, together with the prior distribution of the same plot.



9. Model selection

The poolABC package also allows users to perform model selection by estimating the posterior model probabilities, comparing two scenarios of ecotype formation: the single and the parallel origin scenario. The modelSelect function can be used to perform model selection with ABC.


One of the required input arguments of the modelSelect function is the index. This is a vector of model indices that should have the same length as nrow(sumstats) to indicate to which model a particular row of sumstats belongs. The remaining input arguments are explained in the help page of the function. As before, you should also define the tolerance rate (tol) and the method to use. A tolerance of 0.01 and the “regression” method are recommended.

# create a vecto of model indices
index <- c(rep("model1", nrow(sumstats)/2), rep("model2", nrow(sumstats)/2))
# select a random simulation to act as target
target <- sumstats[10, ]
# perform model selection with ABC
mysel <- modelSelect(target = target, index = index, sumstats = sumstats, tol = 0.1,
    method = "regression")
# display the structure of the mysel object
str(mysel)
#> List of 6
#>  $ method : chr "regression"
#>  $ indices: Factor w/ 2 levels "model1","model2": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ pred   : 'table' num [1:2(1d)] 0.541 0.459
#>   ..- attr(*, "dimnames")=List of 1
#>   .. ..$ : chr [1:2] "model1" "model2"
#>  $ ss     : num [1:1000, 1:14] 0 0.0106 0.00118 0.01243 0 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:14] "Sf" "Sx1" "Sx2" "SS" ...
#>  $ weights: num [1:1000] 0.2723 1 0.0146 0.0171 0.3713 ...
#>  $ nmodels: Named int [1:2] 5000 5000
#>   ..- attr(*, "names")= chr [1:2] "model1" "model2"

The output of the modelSelect function is a list with six entries. To quickly view the results of the model selection, you can use the summary_modelSelect function. This function will provide an easy to read display of the posterior model probabilities and Bayes factors. The only required input argument of the summary_modelSelect function is the object created with the the modelSelect function.

# check results of model selection
msel <- summary_modelSelect(object = mysel)
#> Data:
#>  results based on 1000 posterior samples
#> 
#> Models a priori:
#>  model1, model2
#> 
#> Models a posteriori:
#>  model1, model2
#> 
#> Proportion of accepted simulations (rejection):
#> model1 model2 
#>   0.51   0.49 
#> 
#> Posterior model probabilities (mnlogistic):
#> model1 model2 
#>  0.541  0.459

As you can see, by running the summary.modelSelect function we get an output with the proportion of accepted simulation for each model under a rejection method and posterior model probabilities under the regression method. If we print the object itself

# print results of model selection 
msel
#> $rejection
#> $rejection$Prob
#> model1 model2 
#>   0.51   0.49 
#> 
#> $rejection$BayesF
#>           model1   model2
#> model1 1.0000000 1.040816
#> model2 0.9607843 1.000000
#> 
#> 
#> $mnlogistic
#> $mnlogistic$Prob
#>    model1    model2 
#> 0.5406397 0.4593603 
#> 
#> $mnlogistic$BayesF
#>           model1   model2
#> model1 1.0000000 1.176941
#> model2 0.8496605 1.000000

we can also see the Bayes factors between pairs of models for both the rejection and the regression methods.



10. Cross validation for Approximate Bayesian Computation

A fundamental part of any ABC analysis is the validation of the results obtained in the parameter estimation and model selection steps. The poolABC package includes tools to perform cross validation for both analysis, computing prediction errors for both parameter inference and model selection.


10.1 Parameter inference

One important component of this validation process is the calculation of prediction errors for each parameter. This allows us to evaluate the confidence of the estimates and the effect of various point estimates and/or tolerance rates.


To perform a leave-one-out cross validation for ABC, you can use the simulationABC function. This functions requires the simulated parameter values, params, the simulated summary statistics, sumstats and a matrix with the limits of the prior distributions. You should also define the size of the cross-validation sample, nval, the tolerance rate, tol, and the type of ABC algorithm to be applied in the method input.

# perform an Approximate Bayesian Computation simulation study
mycv <- simulationABC(params = params, sumstats = sumstats, limits = limits, nval = 100,
    tol = 0.01, method = "regression")
# display the structure of the mycv object
str(mycv, max.level = 2)
#> List of 3
#>  $ true: num [1:100, 1:8] 1.09 0.364 0.579 1.332 0.306 ...
#>   ..- attr(*, "dimnames")=List of 2
#>  $ rej :List of 4
#>   ..$ mode  : num [1:100, 1:8] 1.888 0.276 0.27 0.709 0.099 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   ..$ median: num [1:100, 1:8] 1.87 0.396 0.313 0.934 0.262 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   ..$ mean  : num [1:100, 1:8] 1.777 0.459 0.348 1.033 0.295 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   ..$ error : num [1:3, 1:8] 0.411 0.176 0.176 0.331 0.207 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>  $ reg :List of 4
#>   ..$ mode  : num [1:100, 1:8] 0.549 0.291 0.497 1.458 0.18 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   ..$ median: num [1:100, 1:8] 0.686 0.311 0.51 1.415 0.198 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   ..$ mean  : num [1:100, 1:8] 0.77 0.335 0.524 1.395 0.221 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   ..$ error : num [1:3, 1:8] 0.155 0.142 0.135 0.198 0.195 ...
#>   .. ..- attr(*, "dimnames")=List of 2

The output of the simulationABC function is a list with three elements. Details about each list element are available in the help page of the function. A quick way to visualize the results of the leave-one-out cross validation is to plot the the cross-validation results.


The poolABC package includes the plot_errorABC function to allow this visual evaluation of the quality of the estimation. This function requires as input the output of the simulationABC function. Additionally, you need to define the ABC algorithm (either “reg” for regression or “rej” for rejection) and which point estimate (“mode”, “median” or “mean”) to plot. You should also define which parameter to plot, by selecting the corresponding index.

# plot the prediction errors 
plot_errorABC(x = mycv, method = "reg", statistic = "median", index = 1)

As you can see, this produces a plot with the true parameter value in x-axis and the estimate value of the parameter in the y-axis. Thus, the closer the points are to the diagonal line, the higher is the quality of the estimation.


10.2 Model selection

It is also possible to evaluate how much confidence we should place in the model selection results by performing a leave-one-out cross validation for model selection with ABC via subsequent calls to the function modelSelect. Briefly, several simulations from each model are selected to act as validation simulations, while the remaining simulations are used as training simulations. For each validation simulation, the function modelSelect is called to estimate the posterior model probabilities.

# perform a leave-one-out cross validation for model selection
modelSim <- sim_modelSel(index = index, sumstats = sumstats, nval = 100, tol = 0.1)
# display the structure of the modelSim object
str(modelSim)
#> List of 5
#>  $ cvsamples  : Named int [1:200] 3353 2818 3561 1671 1513 3309 2166 1897 3942 3356 ...
#>   ..- attr(*, "names")= chr [1:200] "model11" "model12" "model13" "model14" ...
#>  $ true       : chr [1:200] "model1" "model1" "model1" "model1" ...
#>  $ estimated  : chr [1:200] "model2" "model1" "model1" "model2" ...
#>  $ model.probs: num [1:200, 1:2] 0.494 0.532 0.501 0.486 0.456 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:200] "model1" "model1" "model1" "model1" ...
#>   .. ..$ : chr [1:2] "model1" "model2"
#>  $ models     : chr [1:2] "model1" "model2"

The output of this leave-one-out cross validation for model selection is a list with 5 different elements that can be used in the error_modelSel function to compute the confusion matrix and the mean misclassification probabilities of models.


Users can also define a threshold for the posterior model probabilities. This threshold corresponds to the minimum posterior probability of assignment. Thus, a simulation where the posterior probability of any model is below the threshold will not be assigned to a model and will instead be classified as unclear.

# compute the mean misclassification probabilities  
mselError <- error_modelSel(object = modelSim)
#> Confusion matrix based on 100 samples for each model
#> 
#>        model1 model2
#> model1     41     59
#> model2     51     49
#> 
#> Mean model posterior probabilities (mnlogistic)
#> 
#>        model1 model2
#> model1  0.484  0.516
#> model2  0.506  0.494
#> 
#> Posterior probabilities of correctly assigned model1 models
#> 
#>    model1 incorrect 
#>      0.53      0.47 
#> 
#> Posterior probabilities of correctly assigned model2 models
#> 
#> incorrect    model2 
#>     0.464     0.536 
#> 
#> Posterior probabilities when model1 is estimated as model2
#> 
#> model1 model2 
#>  0.453  0.547 
#> 
#> Posterior probabilities when model2 is estimated as model1
#> 
#> model1 model2 
#>  0.546  0.454

The error_modelSel function outputs the confusion matrix and the mean model posterior probabilities obtained with the regression method. It will also output other useful information such as the mean posterior probability of correctly assigned models and the mean posterior probability when each model is not correctly assigned.


For a more visual interpretation of these results, it is also possible to display a barplot of the model misclassification. By using the plot_msel function we can plot the confusion matrix, either in colour (if color = TRUE) or in grey (if color = FALSE).

# display a barplot of model misclassification
plot_msel(object = mselError)

Using the output of the error_modelSel function as the input of the plot_msel function, we can produce this barplot showing the proportion of simulations classified to any of the models.