zalpha

The zalpha package contains statistics for identifying areas of the genome that have undergone a selective sweep. The idea behind these statistics is to find areas of the genome that are highly correlated, as this can be a sign that a sweep has occurred recently in the vicinity. For more information on the statistics, please see the paper by Jacobs et al. (2016)[1] referenced below.

A simple example

Here we have a dataset containing five humans with a single pair of chromosomes each. The chromosome has 20 SNPs, at base pair 100, 200, 300, …, 2000.

Person 1 Chromosome A AGGAAGGGATACGGTTATAC
Chromosome B CGGAACTGATCAGCTCAGGG
Person 2 Chromosome A CGGTACTGTCACGGGCATGG
Chromosome B ATGTAGGGTCCAGCTCTGAC
Person 3 Chromosome A ATGTCGGCATCCAGGCAGAC
Chromosome B ATGAACGCATCAACTTTGAG
Person 4 Chromosome A CGGTCGGCTCCAGCTTTTGG
Chromosome B CTGTCCGCTCCCGGGTTTGC
Person 5 Chromosome A CGCAACGGACACGCGCATGC
Chromosome B CTGACGTCACCCAGTTTGAG

For this simple example all that is needed is:

The snps dataset

The snps dataset is a data frame that comes with the zalpha package. It is identical to the simple example above, but with 0s and 1s instead of ACGTs.

Realistically, the dataset would be much bigger. It is highly recommended to only use SNPs with a minor allele frequency of over 5%, as it is hard to find correlations between rare alleles. Any missing values should be coded as NA.

The snps dataset can be loaded using the code:

library(zalpha)
data(snps)
## This is what the dataset looks like:
snps
#>    bp_positions cM_distances chrom_1 chrom_2 chrom_3 chrom_4 chrom_5 chrom_6
#> 1           100       0.0001       1       0       0       1       1       1
#> 2           200       0.0002       1       1       1       0       0       0
#> 3           300       0.0005       1       1       1       1       1       1
#> 4           400       0.0008       0       0       1       1       1       0
#> 5           500       0.0010       1       1       1       1       0       1
#> 6           600       0.0015       1       0       0       1       1       0
#> 7           700       0.0017       1       0       0       1       1       1
#> 8           800       0.0019       0       0       0       0       1       1
#> 9           900       0.0023       1       1       0       0       1       1
#> 10         1000       0.0024       0       0       1       1       0       0
#> 11         1100       0.0026       1       0       1       0       0       0
#> 12         1200       0.0027       1       0       1       0       1       0
#> 13         1300       0.0032       1       1       1       1       0       0
#> 14         1400       0.0037       1       0       1       0       1       0
#> 15         1500       0.0039       0       0       1       0       1       0
#> 16         1600       0.0040       1       0       0       0       0       1
#> 17         1700       0.0041       0       0       0       1       0       1
#> 18         1800       0.0045       1       0       1       0       0       0
#> 19         1900       0.0048       1       0       0       1       1       1
#> 20         2000       0.0049       0       1       1       0       0       1
#>    chrom_7 chrom_8 chrom_9 chrom_10
#> 1        0       0       0        0
#> 2        1       0       1        0
#> 3        1       1       0        1
#> 4        1       1       0        0
#> 5        0       0       1        0
#> 6        1       0       0        1
#> 7        1       1       1        0
#> 8        1       1       0        1
#> 9        0       0       1        1
#> 10       1       1       1        1
#> 11       0       0       1        0
#> 12       0       1       1        1
#> 13       1       1       1        0
#> 14       0       1       0        1
#> 15       0       1       1        0
#> 16       1       1       0        1
#> 17       1       1       0        1
#> 18       1       1       1        0
#> 19       0       0       0        1
#> 20       1       0       0        1

This data set contains information about each of the SNPs. The first column gives the physical location of the SNP along the chromosome, in whatever units is useful to the user (usually bp or Kb). In this example, the positions are in base pairs (bp).

The next column is the genetic distance of the SNP from the start of the chromosome. Ignore this column for now.

The final columns are the SNP alleles for each of the chromosomes in the population. Each SNP must be biallelic, but can contain any value, for example 0s and 1s, or ACGTs. The data can contain missing values, however it is recommended that the cut off is 10% missing at most. Missing values should be coded as NA. It is also recommended to use a minor allele frequency of 5% or higher.

Note: There is no requirement to put data into a data frame - all that is required is a vector of SNP positions and a matrix of SNP values.

Zalpha

To test for selection, the user can use the Zalpha function. This function assigns the first SNP in the dataset as the “target locus”, calculates the \(Z_{\alpha}\) value, then moves on to the next SNP making that the target locus, until every SNP in the dataset has been considered. It works by calculating correlations between alleles on each side of the target locus and averaging them. To do this, the function needs three inputs:

results<-Zalpha(snps$bp_positions,3000,as.matrix(snps[,3:12]))
results
#> $position
#>  [1]  100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#> 
#> $Zalpha
#>  [1]         NA         NA         NA         NA 0.09893585 0.10944106
#>  [7] 0.11602964 0.11215782 0.12419556 0.12962040 0.12583275 0.11980489
#> [13] 0.10802033 0.12064462 0.10873917 0.10267168         NA         NA
#> [19]         NA         NA
plot(results$position,results$Zalpha)

The output is in the form of a list and shows the positions of each of the SNPs and the \(Z_{\alpha}\) value calculated for each SNP. The NAs are because there were not enough SNPs on one side of the target locus for an accurate \(Z_{\alpha}\) value to be calculated. This is controlled by the parameters minRandL and minRL, which have defaults 4 and 25 respectively. minRandL specifies the minimum number of SNPs that must be to the left and right of the target SNP within the window for \(Z_{\alpha}\) to be calculated. minRL is the product of these numbers.

The graph shows a sharp increase in \(Z_{\alpha}\) values in the centre of this region, which could indicate the presence of a sweep. The user should compare the values across the whole genome to find outliers.

Say the user is only interested in the output of Zalpha for a particular region of the chromosome; this is achieved by setting the “X” parameter to the lower and upper bounds of the region.

Zalpha(snps$bp_positions,3000,as.matrix(snps[,3:12]),X=c(500,1000))
#> $position
#> [1]  500  600  700  800  900 1000
#> 
#> $Zalpha
#> [1] 0.09893585 0.10944106 0.11602964 0.11215782 0.12419556 0.12962040

That concludes the simple example of the Zalpha function!

It is recommended that the user uses the Zalpha_all function, as this function will calculate all the statistics in the zalpha package in one go, rather than running all of the statistics separately. More information on the Zalpha_all function can be found further down this vignette. Read on for information on the other statistics in the package and what they require.

Adjusting for expected correlations between SNPs

There are many reasons apart from selection that pairs of SNPs could be more correlated than the rest of the genome, including regions of low recombination and genetic drift. This package allows the user to correct for expected correlations between SNPs. There are multiple functions included in this package that adjust for expected correlations, all of which have an example below. First however, the new inputs will be described. The extra inputs required are:

Returning to the snps example dataset, we can now consider the second column of the dataset “cM_distances”.

snps$cM_distances
#>  [1] 0.0001 0.0002 0.0005 0.0008 0.0010 0.0015 0.0017 0.0019 0.0023 0.0024
#> [11] 0.0026 0.0027 0.0032 0.0037 0.0039 0.0040 0.0041 0.0045 0.0048 0.0049

Each value is the genetic distance of the SNP from the start of the chromosome. This could be in centimorgans (cM), linkage disequilibrium units (LDU) or any other way of measuring genetic distance, as long as it is additive (i.e. the distance between SNP A and SNP C is equal to the distance between SNP A and SNP B plus SNP B and SNP C).

There are many ways of calculating the genetic distances between SNPs. Some software that could be used include LDhat[2], pyrho[3], FastEPRR[4], and LDJump[5].

LD Profile

Using an LD (linkage disequilibrium) profile allows the user to adjust for variable recombination rates along the chromosome. An LD profile is a basic look-up table. It tells the user what the expected correlation between two SNPs is, given the genetic distance between them. Here is the example LD profile provided with the zalpha R package:

data(LDprofile)
LDprofile
#>       bin        rsq        sd    Beta_a    Beta_b
#> 1  0.0000 0.12214545 0.2275545 0.2764342 1.8116558
#> 2  0.0001 0.09335562 0.2197797 0.2708984 2.0267642
#> 3  0.0002 0.12218997 0.2370013 0.2698494 1.6785461
#> 4  0.0003 0.09042594 0.2005840 0.2639042 2.2719719
#> 5  0.0004 0.09846861 0.2415291 0.2498579 1.5550545
#> 6  0.0005 0.05234892 0.1645840 0.2734530 3.5077770
#> 7  0.0006 0.09849803 0.2171552 0.2805981 2.0719602
#> 8  0.0007 0.09234729 0.2185910 0.2729321 1.8642312
#> 9  0.0008 0.05612510 0.1463234 0.3270302 4.5627156
#> 10 0.0009 0.05799673 0.1569512 0.2875387 3.8010222
#> 11 0.0010 0.06451333 0.1803580 0.2695071 2.7318059
#> 12 0.0011 0.07202593 0.1980737 0.2575224 2.2259595
#> 13 0.0012 0.10457653 0.2452326 0.2527777 1.5974515
#> 14 0.0013 0.05750545 0.1681630 0.2854224 3.3140481
#> 15 0.0014 0.09774452 0.2388637 0.1924497 0.6386457
#> 16 0.0015 0.06229074 0.1752179 0.1791416 0.6853134
#> 17 0.0016 0.08488753 0.1917311 0.3031776 2.7157927
#> 18 0.0017 0.08160874 0.2037504 0.2639457 2.1465406
#> 19 0.0018 0.08745139 0.2037958 0.2704832 2.2651764
#> 20 0.0019 0.07123330 0.1840094 0.2810918 2.8468435
#> 21 0.0020 0.09485261 0.2191376 0.2723791 1.9259026
#> 22 0.0021 0.05804671 0.1458408 0.3207629 4.4316165
#> 23 0.0022 0.06447594 0.1502141 0.3368500 4.2842122
#> 24 0.0023 0.08147056 0.2012367 0.2690378 2.2633659
#> 25 0.0024 0.09434107 0.2250389 0.2557325 1.8280552
#> 26 0.0025 0.07046571 0.1831713 0.2857604 2.8830237
#> 27 0.0026 0.08397797 0.1926024 0.2900191 2.4698181
#> 28 0.0027 0.05834056 0.1662758 0.3033592 3.2711990
#> 29 0.0028 0.06702507 0.1662532 0.2933794 3.2574873
#> 30 0.0029 0.05820796 0.1723149 0.2682267 2.8738054
#> 31 0.0030 0.09507890 0.2085120 0.2716050 2.1487500
#> 32 0.0031 0.04551497 0.1239696 0.3296394 5.8488772
#> 33 0.0032 0.04241460 0.1352978 0.3349360 5.5737334
#> 34 0.0033 0.10255730 0.2259408 0.2718446 1.9065634
#> 35 0.0034 0.05181171 0.1523572 0.2978946 4.0344464
#> 36 0.0035 0.06537539 0.1633711 0.2937305 3.4617464
#> 37 0.0036 0.07133690 0.1728133 0.2845848 3.0456476
#> 38 0.0037 0.09034151 0.2365844 0.2438361 1.6296772
#> 39 0.0038 0.06661469 0.1718022 0.2842834 3.1988517
#> 40 0.0039 0.11322964 0.2325874 0.2031107 0.6843425
#> 41 0.0040 0.10068963 0.2067405 0.2918418 2.1743057
#> 42 0.0041 0.09924437 0.2321804 0.2601480 1.7887389
#> 43 0.0042 0.07122869 0.1813322 0.2701038 2.7201233
#> 44 0.0043 0.11335340 0.2081721 0.2869211 2.0950029
#> 45 0.0044 0.11286222 0.2357003 0.2679826 1.6981865
#> 46 0.0045 0.05429260 0.1622758 0.2940277 3.7098450
#> 47 0.0046 0.07505181 0.1992234 0.2679613 2.3091843
#> 48 0.0047 0.08873456 0.1996523 0.2754293 2.3102141
#> 49 0.0048 0.06982331 0.1902250 0.2603511 2.4806350
#> 50 0.0049 0.07837201 0.1944029 0.2745399 2.5207091

The LD profile contains data about the expected correlation between SNPs given the genetic distance between them. The columns in the example are:

If we know two SNPs are 0.00017 cM apart, this LD profile tells us that we expect the r2 value to be 0.093, with a standard deviation of 0.22, and that the expected distribution of r2 values for SNPs this far apart is Beta(0.27,2.03).

The package contains a function for creating an LD profile. This is explained lower down this vignette. The vignette continues by using the example LDprofile dataset supplied.

Zalpha_expected

The expected \(Z_{\alpha}\) value (denoted \(Z_{\alpha}^{E[r^2]}\)) can be calculated for a chromosome given an LD profile and the genetic distances between each SNP in the chromosome. Instead of calculating the r2 values between SNPs, the function uses the expected correlations. It does this by working out the genetic distance between each pair of SNPs and uses the r2 values given in the LD profile for SNPs that far apart.

Zalpha_expected(snps$bp_positions, 3000, snps$cM_distances, LDprofile$bin, LDprofile$rsq)
#> $position
#>  [1]  100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#> 
#> $Zalpha_expected
#>  [1]         NA         NA         NA         NA 0.08633502 0.08169169
#>  [7] 0.07981817 0.08104324 0.08134506 0.08015626 0.07984135 0.08082988
#> [13] 0.08263229 0.08040794 0.07822704 0.08289555         NA         NA
#> [19]         NA         NA

Note that this statistic does not use the SNP value data. Once \(Z_{\alpha}^{E[r^2]}\) has been calculated, it can be combined with the \(Z_{\alpha}\) results to adjust for recombination, for example by computing \(Z_{\alpha}\)/\(Z_{\alpha}^{E[r^2]}\). Outliers in this new combined statistic could be potential selection candidates.

Other functions that take into account variable recombination rates are Zalpha_rsq_over_expected, Zalpha_log_rsq_over_expected, Zalpha_Zscore, and Zalpha_BetaCDF. These statistics all use the actual r2 values from the data combined with the expected r2 values from the LD profile in various ways.

Examples of these statistics are here:

Zalpha_rsq_over_expected(snps$bp_positions, 3000, as.matrix(snps[,3:12]), snps$cM_distances, LDprofile$bin, LDprofile$rsq)
#> $position
#>  [1]  100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#> 
#> $Zalpha_rsq_over_expected
#>  [1]       NA       NA       NA       NA 1.229709 1.455017 1.557973 1.473750
#>  [9] 1.663279 1.748810 1.655937 1.560111 1.348384 1.540937 1.403390 1.302492
#> [17]       NA       NA       NA       NA
Zalpha_log_rsq_over_expected(snps$bp_positions, 3000, as.matrix(snps[,3:12]), snps$cM_distances, LDprofile$bin, LDprofile$rsq)
#> $position
#>  [1]  100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#> 
#> $Zalpha_log_rsq_over_expected
#>  [1]          NA          NA          NA          NA -2.80109183 -2.52953641
#>  [7] -2.89862430 -0.06997296 -0.04968498 -0.05720747 -0.09096136 -0.14235949
#> [13] -0.18846597 -0.11839984 -0.18570016 -3.28861224          NA          NA
#> [19]          NA          NA
Zalpha_Zscore(snps$bp_positions, 3000, as.matrix(snps[,3:12]), snps$cM_distances, LDprofile$bin, LDprofile$rsq, LDprofile$sd)
#> $position
#>  [1]  100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#> 
#> $Zalpha_Zscore
#>  [1]         NA         NA         NA         NA 0.08320337 0.17092137
#>  [7] 0.21026515 0.17941815 0.24933136 0.28017067 0.25141352 0.21744923
#> [13] 0.13918111 0.21735113 0.16446541 0.12300862         NA         NA
#> [19]         NA         NA
Zalpha_BetaCDF(snps$bp_positions, 3000, as.matrix(snps[,3:12]), snps$cM_distances, LDprofile$bin, LDprofile$Beta_a, LDprofile$Beta_b)
#> $position
#>  [1]  100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#> 
#> $Zalpha_BetaCDF
#>  [1]        NA        NA        NA        NA 0.5553324 0.6013140 0.6135663
#>  [8] 0.6114601 0.6153488 0.6073154 0.5895785 0.6060782 0.5871007 0.6161310
#> [15] 0.5889128 0.5730982        NA        NA        NA        NA

Note that not all the statistics need all the columns from the LD profile.

Zbeta

The Zbeta function works in the same way as the Zalpha function but evaluates correlations between pairs of SNPs where one is to the left of the target locus and the other is to the right. It is useful to use the \(Z_{\beta}\) statistic in conjunction with the \(Z_{\alpha}\) statistic, as they behave differently depending on how close to fixation the sweep is. For example, while a sweep is in progress both \(Z_{\alpha}\) and \(Z_{\beta}\) would be higher than other areas of the chromosome without a sweep present. However, when a sweep reaches near-fixation, \(Z_{\beta}\) would decrease whereas \(Z_{\alpha}\) would remain high. Combining \(Z_{\alpha}\) and \(Z_{\beta}\) into new statistics such as \(Z_{\alpha}\)/\(Z_{\beta}\) is one way of analysing this.

The Zbeta function requires the exact same inputs as the Zalpha function. Here is an example:

results<-Zbeta(snps$bp_positions,3000,as.matrix(snps[,3:12]))
results
#> $position
#>  [1]  100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#> 
#> $Zbeta
#>  [1]        NA        NA        NA        NA 0.1280042 0.1298619 0.1219965
#>  [8] 0.1071535 0.1124896 0.1121871 0.1033178 0.1185118 0.1212802 0.1281512
#> [15] 0.1275420 0.1442328        NA        NA        NA        NA
plot(results$position,results$Zbeta)

Comparing this to the \(Z_{\alpha}\) graph in the earlier example, we can see that the value of \(Z_{\beta}\) decreases where \(Z_{\alpha}\) increases. This could indicate that, if there is a sweep at this locus, it is near-fixation.

There is an equivalent Zbeta function for each of the Zalpha variations. Here is an example for each of them:

Zbeta_expected(snps$bp_positions, 3000, snps$cM_distances,
               LDprofile$bin, LDprofile$rsq)
#> $position
#>  [1]  100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#> 
#> $Zbeta_expected
#>  [1]         NA         NA         NA         NA 0.07935940 0.07840546
#>  [7] 0.07845889 0.07726474 0.07734390 0.07801402 0.07743300 0.07714027
#> [13] 0.07798668 0.07880907 0.08001374 0.08138610         NA         NA
#> [19]         NA         NA
Zbeta_rsq_over_expected(snps$bp_positions, 3000, as.matrix(snps[,3:12]), snps$cM_distances, 
                        LDprofile$bin, LDprofile$rsq)
#> $position
#>  [1]  100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#> 
#> $Zbeta_rsq_over_expected
#>  [1]       NA       NA       NA       NA 1.662441 1.753478 1.628663 1.463948
#>  [9] 1.543292 1.559820 1.468623 1.689309 1.687198 1.759164 1.737935 1.934010
#> [17]       NA       NA       NA       NA
Zbeta_log_rsq_over_expected(snps$bp_positions, 3000, as.matrix(snps[,3:12]), snps$cM_distances, 
                            LDprofile$bin, LDprofile$rsq)
#> $position
#>  [1]  100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#> 
#> $Zbeta_log_rsq_over_expected
#>  [1]          NA          NA          NA          NA -5.81626185 -6.65751023
#>  [7] -7.49094344 -6.58013806 -6.69181952 -6.54908550 -7.83646879 -6.18038047
#> [13] -7.39743178 -6.45648923 -6.00894964  0.09510038          NA          NA
#> [19]          NA          NA
Zbeta_Zscore(snps$bp_positions, 3000, as.matrix(snps[,3:12]), snps$cM_distances, 
             LDprofile$bin, LDprofile$rsq, LDprofile$sd)
#> $position
#>  [1]  100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#> 
#> $Zbeta_Zscore
#>  [1]        NA        NA        NA        NA 0.2644152 0.2959959 0.2473133
#>  [8] 0.1761311 0.2115564 0.2148552 0.1755580 0.2596216 0.2590911 0.2916302
#> [15] 0.2848765 0.3617791        NA        NA        NA        NA
Zbeta_BetaCDF(snps$bp_positions, 3000, as.matrix(snps[,3:12]), snps$cM_distances, 
              LDprofile$bin, LDprofile$Beta_a, LDprofile$Beta_b)
#> $position
#>  [1]  100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#> 
#> $Zbeta_BetaCDF
#>  [1]        NA        NA        NA        NA 0.5904780 0.5810107 0.5627583
#>  [8] 0.5507821 0.5595338 0.5637204 0.5466796 0.5865750 0.5672179 0.5854016
#> [15] 0.5899818 0.6315494        NA        NA        NA        NA

Diversity statistics LR and L_plus_R

These statistics show the diversity around the target locus. LR calculates the number of SNPs to the left of the target locus multiplied by the number of SNPs to the right. L_plus_R is the total number of pairs of SNPs on the left and the right of the target locus. The idea behind these statistics is that if the diversity is low, there might be a sweep in this region.

Care should be taken when interpreting these statistics if diversity has been altered by filtering and, when using the Zalpha_all function below, the use of minRL and minRandL parameters.

Zalpha_all

Zalpha_all is the recommended function for using this package. It will run all the statistics included in the package (\(Z_{\alpha}\) and \(Z_{\beta}\) variations), so the user does not have to run multiple functions to calculate all the statistics they want. The function will only calculate the statistics it has been given the appropriate inputs for, so it is flexible.

For example, this code will only run Zalpha, Zbeta and the two diversity statistics LR and L_plus_R, as an LD profile was not supplied:

Zalpha_all(snps$bp_positions,3000,as.matrix(snps[,3:12]))
#> $position
#>  [1]  100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#> 
#> $LR
#>  [1]  0 15 30 45 60 70 78 84 88 90 90 88 84 78 70 60 45 30 15  0
#> 
#> $L_plus_R
#>  [1] 105 105 106 108 111 101  93  87  83  81  81  83  87  93 101 111 108 106 105
#> [20] 105
#> 
#> $Zalpha
#>  [1]         NA         NA         NA         NA 0.09893585 0.10944106
#>  [7] 0.11602964 0.11215782 0.12419556 0.12962040 0.12583275 0.11980489
#> [13] 0.10802033 0.12064462 0.10873917 0.10267168         NA         NA
#> [19]         NA         NA
#> 
#> $Zbeta
#>  [1]        NA        NA        NA        NA 0.1280042 0.1298619 0.1219965
#>  [8] 0.1071535 0.1124896 0.1121871 0.1033178 0.1185118 0.1212802 0.1281512
#> [15] 0.1275420 0.1442328        NA        NA        NA        NA

Supplying an LD profile and genetic distances for each SNP will result in more of the statistics being calculated.

There are many ways that the resulting statistics can be combined to give new insights into the data, see Jacobs et al. (2016)[1].

Identifying regions under selection

To find candidate regions for selection, first calculate the statistics across the chromosome, including any combined statistics that may be of interest. It is then suggested to find the maximum value for windows of around 200 Kb for each statistic (minimum values for the diversity statistics). Any regions that are outliers compared to the rest of the chromosome could be considered candidates and can be investigated further.

create_LDprofile

An LD profile is required to adjust for expected correlations. It is a basic look-up table that tells the user what the expected correlation is between a pair of SNPs, given the genetic distance between them. To create a simple LD profile, the user just needs two things:

The user also needs to tell the function what bin_size to use, and optionally if they want to calculate Beta distribution parameters (this requires the fitdistrplus package to be installed).

The function considers all pairs of SNPs. It separates the pairs of SNPs into bins based on the genetic distance between them. The correlation between each pair of SNPs is calculated. In each bin, the average and the standard deviation of these correlations is calculated. If beta_params=TRUE, then a beta distribution will be fitted to the correlations in each bin too.

We can use the snps dataset as an example of this:

create_LDprofile(snps$cM_distances,as.matrix(snps[,3:12]),bin_size = 0.001,beta_params = TRUE)
#>     bin        rsq        sd    Beta_a   Beta_b  n
#> 1 0.000 0.10226664 0.1318628 0.8232427 6.435184 59
#> 2 0.001 0.14412346 0.1745857 0.7323277 3.960080 51
#> 3 0.002 0.09538328 0.1122312 0.9962786 8.322913 41
#> 4 0.003 0.11193736 0.1303776 1.0432924 7.113517 28
#> 5 0.004 0.19393939 0.1963924 1.4733393 4.979831 11

This code has created an LD profile with 6 columns. These are:

There is one more optional input parameter - max_dist - which sets the maximum distance SNPs can be apart for calculating for the LD profile. For real world data, Jacobs et al. (2016)[1] recommend using distances up to 2 cM assigned to bins of size 0.0001 cM. Without this parameter, the code will generate bins up to the maximum distance between pairs of SNPs, which is likely to be inefficient as most distances will not be used. max_dist should be big enough to cover the genetic distances between pairs of SNPs within the window size given when the \(Z_{\alpha}\) statistics are run. Any pairs with genetic distances bigger than max_dist will be assigned the values in the maximum bin of the LD profile.

Ideally, we would want to generate an LD profile based on genetic data without selection but exactly matching the other population parameters for our data. This could be done using simulated data (using software such as MSMS[6] or SLiM[7]). We could use another genetic dataset containing a similar population. Alternatively, we could generate an LD profile using the same dataset that we are analysing for selection. Care should be taken that bins are big enough to have a lot of data in so expected r2 values are not overly affected by outliers.

Realistically, the user will not have just one chromosome of data for creating the LD profile, but will likely have a whole genome. So far, we have used a vector of genetic distances and a SNP value matrix in our example. However, with multiple chromosomes there will be a vector of genetic distances and a SNP value matrix for each chromosome, and it would be good to use all that information to create the LD profile. Therefore, the function has been written to accept multiple vectors of genetic distances and multiple SNP value matrices via lists.

The dist parameter will accept a vector or a list of vectors.

The x parameter will accept a matrix or a list of matrices.

For example, if we use the snps dataset but this time pretend it is three different chromosomes.

## Generate three chromosomes of data - cM distances and SNP values
chrom1_cM_distances<-snps$cM_distances
chrom1_snp_values<-as.matrix(snps[,3:12])

chrom2_cM_distances<-snps$cM_distances
chrom2_snp_values<-as.matrix(snps[,3:12])

chrom3_cM_distances<-snps$cM_distances
chrom3_snp_values<-as.matrix(snps[,3:12])

## create a list of the cM distances
cM_distances_list<-list(chrom1_cM_distances,chrom2_cM_distances,chrom3_cM_distances)

## create a list of SNP value matrices
snp_values_list<-list(chrom1_snp_values,chrom2_snp_values,chrom3_snp_values)

## create the LD profile using the lists as the dist and x parameters
create_LDprofile(cM_distances_list,snp_values_list,bin_size = 0.001,beta_params = TRUE)
#>     bin        rsq        sd    Beta_a   Beta_b   n
#> 1 0.000 0.10226664 0.1311114 0.6691595 5.517397 177
#> 2 0.001 0.14412346 0.1734333 0.5921975 3.371973 153
#> 3 0.002 0.09538328 0.1113074 0.7402206 6.697932 123
#> 4 0.003 0.11193736 0.1287972 0.7396069 5.557230  84
#> 5 0.004 0.19393939 0.1901561 1.0279480 3.889066  33

Care should be taken that the chromosomes stay in the same order in each list.

Congratulations! You should now be able to create your own LD profile and use the zalpha package.

References

[1] Jacobs, G.S., Sluckin, T.J., and Kivisild, T. Refining the Use of Linkage Disequilibrium as a Robust Signature of Selective Sweeps. Genetics, 2016. 203(4): p. 1807

[2] McVean, G. A. T., Myers, S. R., Hunt, S., Deloukas, P., Bentley, D. R., and Donnelly, P. The Fine-Scale Structure of Recombination Rate Variation in the Human Genome. Science, 2004. 304(5670): 581-584.

[3] Spence, J.P. and Song, Y.S. Inference and analysis of population-specific fine-scale recombination maps across 26 diverse human populations. Science Advances, 2019. 5(10): eaaw9206.

[4] Gao, F., Ming, C., Hu, W. J., and Li, H. P. New Software for the Fast Estimation of Population Recombination Rates (FastEPRR) in the Genomic Era. G3-Genes Genomes Genetics, 2016. 6(6): 1563-1571.

[5] Hermann, P., Heissl, A., Tiemann-Boege, I., and Futschik, A. LDJump: Estimating variable recombination rates from population genetic data. Molecular Ecology Resources, 2019. 19(3): 623-638.

[6] Ewing, G. and Hermisson, J. MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics, 2010. 26(16):2064-2065.

[7] Haller, B.C. and Messer, P.W. SLiM 3: Forward Genetic Simulations Beyond the Wright–Fisher Model. Molecular Biology and Evolution, 2019. 36(3):632-637.