Getting Started with Capture Hi-C Analysis Engine

2021-11-03

Background

CHiCANE is a toolkit for identifying chromatin regions that interact more often than expected by chance given the specific interaction dynamics of a capture Hi-C (CHi-C) experiment. It offers extensive flexibility in implementing interaction calling via numerous optional parameters so that it can be tailored to a wide range of CHi-C experiments. Additionally, CHiCANE includes multiple helper functions for use in both pre- and post-processing/analysis of the interaction calls.

For details of statistical concepts implemented by CHiCANE and their application to both the region CHi-C and promoter CHi-C datasets, please see our manuscript: Holgerson, Erle, Gillespie, Andrea, et al. “Identifying High Confidence Capture Hi-C Interactions Using CHiCANE” (Nature Protocols 2021).

Quickstart

The main function for running the CHiCANE method is chicane(). This takes a bam file and bed files with all restriction fragments and targeted restriction fragments as input, and produces a table with fragment interactions and associated p-values.

The package comes with a subset of reads from a capture Hi-C experiment on the Bre80 normal epithelial breast tissue cell line (Baxter et al. 2018). The experiment targeted several breast cancer risk loci in non-coding regions of the genome, and aimed to associate these risk variants with genes. Two of the risk SNPs are rs13387042 and rs16857609, and reads that map to them have been included in the file Bre80_2q35.bam.

library(chicane);
#> Loading required package: gamlss.tr
#> Loading required package: gamlss.dist
#> Loading required package: MASS
#> Loading required package: gamlss
#> Loading required package: splines
#> Loading required package: gamlss.data
#> 
#> Attaching package: 'gamlss.data'
#> The following object is masked from 'package:datasets':
#> 
#>     sleep
#> Loading required package: nlme
#> Loading required package: parallel
#>  **********   GAMLSS Version 5.3-4  **********
#> For more on GAMLSS look at https://www.gamlss.com/
#> Type gamlssNews() to see new features/changes/bug fixes.
#> Loading required package: data.table
# example BAM file, baits, and restriction fragments
bam <- system.file('extdata', 'Bre80_2q35.bam', package = 'chicane');
baits <- system.file('extdata', '2q35.bed', package = 'chicane');
fragments <- system.file('extdata', 'GRCh38_HindIII_chr2.bed.gz', package = 'chicane'); # HindIII fragments on chromosome 2

if( bedtools.installed() ) {
  chicane.results <- chicane(
    bam = bam,
    baits = baits,
    fragments = fragments
    );

    print( chicane.results[ 1:10 ] );
}
#> Taking input= as a system command ('sh /tmp/RtmpdNJTjh/Rinsta536160f7fb70/chicane/bin/prepare_bam.sh -b /tmp/RtmpdNJTjh/Rinsta536160f7fb70/chicane/extdata/Bre80_2q35.bam -c /tmp/RtmpdNJTjh/Rinsta536160f7fb70/chicane/extdata/2q35.bed -f /tmp/RtmpdNJTjh/Rinsta536160f7fb70/chicane/extdata/GRCh38_HindIII_chr2.bed.gz -d /tmp/RtmpdNJTjh/Rbuilda536136379101/chicane/vignettes ') and a variable has been used in the expression passed to `input=`. Please use fread(cmd=...). There is a security concern if you are creating an app, and the app could have a malicious user, and the app is not running in a secure environment; e.g. the app is running as root. Please read item 5 in the NEWS file for v1.11.6 for more information and for the option to suppress this message.
#>                    target.id                  bait.id target.chr target.start
#>  1: chr2:217491009-217492415 chr2:217035649-217042355       chr2    217491009
#>  2: chr2:216650000-216651600 chr2:217430817-217437011       chr2    216650000
#>  3: chr2:217290048-217292530 chr2:217430817-217437011       chr2    217290048
#>  4: chr2:216269087-216273140 chr2:217430817-217437011       chr2    216269087
#>  5: chr2:216551051-216551968 chr2:217430817-217437011       chr2    216551051
#>  6: chr2:216879929-216883553 chr2:217430817-217437011       chr2    216879929
#>  7: chr2:216696003-216698549 chr2:217430817-217437011       chr2    216696003
#>  8: chr2:216687613-216696003 chr2:217430817-217437011       chr2    216687613
#>  9: chr2:216906471-216908390 chr2:217430817-217437011       chr2    216906471
#> 10: chr2:217162758-217166250 chr2:217430817-217437011       chr2    217162758
#>     target.end bait.chr bait.start  bait.end bait.to.bait bait.trans.count
#>  1:  217492415     chr2  217035649 217042355        FALSE                0
#>  2:  216651600     chr2  217430817 217437011        FALSE                0
#>  3:  217292530     chr2  217430817 217437011        FALSE                0
#>  4:  216273140     chr2  217430817 217437011        FALSE                0
#>  5:  216551968     chr2  217430817 217437011        FALSE                0
#>  6:  216883553     chr2  217430817 217437011        FALSE                0
#>  7:  216698549     chr2  217430817 217437011        FALSE                0
#>  8:  216696003     chr2  217430817 217437011        FALSE                0
#>  9:  216908390     chr2  217430817 217437011        FALSE                0
#> 10:  217166250     chr2  217430817 217437011        FALSE                0
#>     target.trans.count  distance count expected      p.value   q.value
#>  1:                  0  452710.0    28 6.222070 0.0001352420 0.2346449
#>  2:                  0  783114.0    16 2.992273 0.0004798454 0.8172368
#>  3:                  0  142625.0    33 8.889424 0.0010581410 0.8172368
#>  4:                  0 1162800.5     8 2.101973 0.0019527803 0.8172368
#>  5:                  0  882404.5    11 2.722403 0.0021130042 0.8172368
#>  6:                  0  552173.0    11 3.822236 0.0036069810 0.8172368
#>  7:                  0  736638.0    16 4.162590 0.0043689364 0.8172368
#>  8:                  0  742106.0    15 3.999764 0.0055330101 0.8172368
#>  9:                  0  526483.5    11 4.150239 0.0063231596 0.8172368
#> 10:                  0  269410.0    24 8.030272 0.0063767687 0.8172368

The chicane function returns a data.table object sorted by q-value. Each row contains information about the target and bait fragments, and the observed and expected number of reads linking the two fragments.

Pre-processing data

Before running the main chicane() function, a minumum of three input files must be obtained. The chicane() function, which is the main interaction calling step, can be run on these three input files directly as it will work as a wrapper function to prepare the data and the run interaction calling in a single step. Alternatively, there is a prepare.data() function which allows a user to run data preparation on these same three input files to make an interactions data table, which can subsequently be used as input for interaction calling separately in a step-wise fashion (as well as a number of other useful functions as discussed below). This section dicusses how to obtain each of the three required inputs, followed by a discussion of the optional preparation of an interactions data table.

The first required input is at least one BAM file containing mapped interaction reads. To convert the BAM file into a text file, R calls the shell script prepare_bam.sh. This script relies on bedtools to identify the restriction fragments corresponding to each read (based on the second required input, the \(in\) \(silico\) genome digest) and select reads with at least one end overlapping the baitmap file (the third required input). The reads are also filtered to remove those in which both reads map to the same fragment. The resulting data is read into a data.table object, and the number of reads linking any two fragments is calculated.

The second required input is the \(in\) \(silico\) digest file, which contains the coordinates of the fragments created by the restriction enzyme (\(e.g.\) HindIII) used in the experiment, in BED format. For users that have a digested genome output from HiCUP, this needs to be converted to a strict 3 column, header-free BED format to be compatible with bedtools. CHiCANE includes a helper function, convert.hicup.digest.bed(), which can be used to convert a digested genome in HiCUP format to a BED file of the restiction digest fragments. Any BED file containing the coordinates of restriction enzyme digested fragments of the genome build used according to the experiment are appropriate to be used with CHiCANE. For the purposes of this vignette, the fragments variable is assigned to the HindIII fragments of GRCh38 chromosome 2 (sample data included in the package) at the top of this document. Here is an example of what this should look like showing the top six entries in the example fragments file included in the package:

chr2     0 15298
chr2 15298 16049
chr2 16049 18132
chr2 18132 24992
chr2 24992 28070
chr2 28070 28179

The third and final input required is a baitmap in BED format containing the coordinates of only the restriction digested fragments which contain a bait from the capture step of the experiment. This can be obtained by conducting a bedtools intersect command on the fragment digest above and a BED file with the coordinates of the individual probes included on the array used in the capture step. The following is an example command to create the baitmap, then limit it to only unique fragments with an overlap of at least 20bp:

# intersect restriction fragments file with coordinates for probes included on the array
bedtools intersect -wo -a /path/to/fragments.bed -b /path/to/capture_regions.bed \
> /path/to/fragments_overlap_full.bed
# limit to overlap of at least 20bp and unique fragments only
awk '$7>=20 {print $1 "\t" $2 "\t" $3}' /path/to/fragments_overlap_full.bed | uniq > /path/to/baitmap.bed

The final output of the above command (baitmap.bed) is then used as the baits parameter in either the chicane or prepare.data function calls. At the top of this vignette we have set baits to the sample baitmap at the 2q35 locus which is included in the package. As an example of how this file should look, this is how the BED file for the sample baits at 2q35 would appear:

chr2 217035649 217042355
chr2 217430817 217437011

NOTE: The BED format used for the fragments and baits parameters is a header-free, tab-separated, 3 column (chr, start, end) format. Containing strictly 3 columns is necessary both as input to the CHiCANE function calls and for the snippet of code above which creates the baitmap to function properly.

To speed up model fitting, it is possible to pre-process the data with the prepare.data() function and run chicane directly on the resulting data table. Furthermore, running in this step-wise manner is useful for those wanting to add custom covariates not already included in the data table or test variations of parameters and/or distibutions in the interaction calling. To first compare biological replicates using the compare.replicates() function, prepare.data() should be run on each replicate separately to assess concordance before deciding to merge them. This processing can take a while (~45 minutes for a 19GB BAM file on a commodity computer with 16GB RAM), and only needs to be done once for each BAM.

if( bedtools.installed() ) {
  interaction.data <- prepare.data(
    bam = bam,
    baits = baits,
    fragments = fragments
    );
}
#> Taking input= as a system command ('sh /tmp/RtmpdNJTjh/Rinsta536160f7fb70/chicane/bin/prepare_bam.sh -b /tmp/RtmpdNJTjh/Rinsta536160f7fb70/chicane/extdata/Bre80_2q35.bam -c /tmp/RtmpdNJTjh/Rinsta536160f7fb70/chicane/extdata/2q35.bed -f /tmp/RtmpdNJTjh/Rinsta536160f7fb70/chicane/extdata/GRCh38_HindIII_chr2.bed.gz -d /tmp/RtmpdNJTjh/Rbuilda536136379101/chicane/vignettes ') and a variable has been used in the expression passed to `input=`. Please use fread(cmd=...). There is a security concern if you are creating an app, and the app could have a malicious user, and the app is not running in a secure environment; e.g. the app is running as root. Please read item 5 in the NEWS file for v1.11.6 for more information and for the option to suppress this message.

The interaction data object contains details of the fragments, and the number of reads linking the two fragments.

if( bedtools.installed() ) print(interaction.data);
#>                      target.id                  bait.id target.chr target.start
#>    1: chr2:200013520-200014510 chr2:217035649-217042355       chr2    200013520
#>    2: chr2:200018581-200020582 chr2:217035649-217042355       chr2    200018581
#>    3: chr2:200020582-200023537 chr2:217430817-217437011       chr2    200020582
#>    4: chr2:200033118-200034885 chr2:217430817-217437011       chr2    200033118
#>    5: chr2:200040560-200041826 chr2:217430817-217437011       chr2    200040560
#>   ---                                                                          
#> 4477: chr2:229867689-229872871 chr2:217035649-217042355       chr2    229867689
#> 4478: chr2:229872871-229875614 chr2:217035649-217042355       chr2    229872871
#> 4479: chr2:229882404-229887121 chr2:217430817-217437011       chr2    229882404
#> 4480: chr2:229894321-229904945 chr2:217035649-217042355       chr2    229894321
#> 4481: chr2:229917543-229919836 chr2:217430817-217437011       chr2    229917543
#>       target.end bait.chr bait.start  bait.end bait.to.bait bait.trans.count
#>    1:  200014510     chr2  217035649 217042355        FALSE                0
#>    2:  200020582     chr2  217035649 217042355        FALSE                0
#>    3:  200023537     chr2  217430817 217437011        FALSE                0
#>    4:  200034885     chr2  217430817 217437011        FALSE                0
#>    5:  200041826     chr2  217430817 217437011        FALSE                0
#>   ---                                                                       
#> 4477:  229872871     chr2  217035649 217042355        FALSE                0
#> 4478:  229875614     chr2  217035649 217042355        FALSE                0
#> 4479:  229887121     chr2  217430817 217437011        FALSE                0
#> 4480:  229904945     chr2  217035649 217042355        FALSE                0
#> 4481:  229919836     chr2  217430817 217437011        FALSE                0
#>       target.trans.count distance count
#>    1:                  0 17024987     1
#>    2:                  0 17019420     1
#>    3:                  0 17411854     2
#>    4:                  0 17399912     1
#>    5:                  0 17392721     1
#>   ---                                  
#> 4477:                  0 12831278     1
#> 4478:                  0 12835240     1
#> 4479:                  0 12450848     1
#> 4480:                  0 12860631     1
#> 4481:                  0 12484776     1

The interactions parameter of the chicane() function can take either a data.table object from prepare.data or the path to such a table.

if( bedtools.installed() ) {
  file.name <- tempfile('interaction_data');
  write.table(interaction.data, file.name, row.names = FALSE);
  chicane.results <- chicane(interactions = file.name); 
}

Statistical model

CHiCANE models the expected number of paired reads containing two restriction fragments as a function of distance between them and how frequently the bait interacts with other fragments.

In the CHiCANE model, \(Y_{ij}\) represents the number of reads linking bait \(i\) and target fragment \(j\), such that it is assumed to follow a distribution with a mean \(\mu_{ij}\) as:

\[\begin{equation} \mu_{ij} = \beta_0 + \beta_1\log(d_{ij}) + \beta_2\log(t_i + 1) \end{equation}\]

where \(d_{ij}\) represents the distance between the two fragments, and \(t_i\) represents the number of reads linking the bait with a \(trans\) fragment.

For bait to bait interactions, terms are added to the model to adjust for \(trans\) counts of the other end, fragment \(j\), and the product of \(trans\) counts of both fragments as:

\[\begin{equation} \mu_{ij} = \beta_0 + \beta_1\log(d_{ij}) + \beta_2\log(t_i + 1) + \beta_3\log(t_j + 1) + \beta_4\log(t_i + 1)\log(t_j + 1) \end{equation}\]

Each possible interaction is assaigned an expected \(\mu_{ij}\) and a \(p\)-value for the observed counts \(y_{ij}\) versus what is expected by chance is calculated as:

\[\begin{equation} p = P(Y_{ij} \geq y_{ij}) \end{equation}\]

This probability is dependent on how the distribution is modelled.

Distribution of the counts

The CHiCANE method supports several different distributions of the counts \(Y_{ij}\) conditional on the mean \(\mu_{ij}\). By default the counts are assumed to follow a negative binomial distribution with \(E(Y_{ij}) = exp (\mu_{ij})\). After fitting the model by a maximum likelihood over all possible interactions, the fitted model is used to estimate a \(p\)-value for each pair. Poisson, truncated Poisson, or truncated negative binomial distributions are also supported and can override the default negative binomial by being passed as the distribution parameter.

Negative binomial

Negative binomial is the default distribution in CHiCANE as it is deemed more appropriate for modelling datasets with high variance in comparison to Poisson due to its inclusion of a term for overdispersion.

Using this default distribution, the counts are modelled as \(Y \sim NB(\mu, \theta)\), with mean \(\mu\) and dispersion term \(\theta\). The variance of the counts conditional on the mean is then

\[\begin{equation} Var(Y) = \mu + \frac{\mu^2}{\theta} \end{equation}\]

This model implements the glm.nb() function in the MASS package.

If the model fails to converge due to a lack of overdispersion in the data chicane() will attempt to diagnose the error and if this is found to be the case a Poisson distribution will be implemented instead.

Poisson

The Poisson distribution assumes that \(Y \sim \text{Poisson}(\mu)\), such that \(Var(Y) = \mu\). As the Poisson model does not allow variance of counts to exceed the mean, capture Hi-C counts usually possesses more variability than can be explained using this distribution, the result of which is false positives.

Truncated negative binomial

By default CHiCANE only includes observed interactions, rather than all possible interactions, \(i.e.\) leaving out zero counts (see Inclusion of Zeros). One way to account for this is in the statistical model, by fitting a truncated negative binomial distribution in the chicane() call which implements the GAMLSS R package (Stasinopoulos and Rigby 2016). It assigns \(P(Y = 0) = 0\), and assumes proportional probabilities to the negative binomial model for values \(y > 0\).

Truncated poisson

The truncated Poisson distribution is also supported, which behaves much like the truncated negative binomial model, but rather assumes the probabilities are proportional to Poisson probabilities for values \(y>0\).

It is worth noting that specifying either of the truncated distributions will significantly slow down the model fitting.

Adjusting for covariates

CHiCANE allows users to specify additional covariates that can be added to the model with the adjustment.terms parameter. For example, we can add a term to adjust for the chromosome of the target fragment as shown below.

data(bre80);
adjusted.results <- chicane(
  interactions = bre80, 
  adjustment.terms = 'target.chr'
  );

print( adjusted.results[ 1:5 ] );
#>                   target.id                  bait.id target.chr target.start
#> 1:    chr19:1155128-1228414 chr2:217035649-217042355      chr19      1155128
#> 2: chr1:117125276-117135137 chr2:217035649-217042355       chr1    117125276
#> 3:   chr1:14950583-14951791 chr2:217035649-217042355       chr1     14950583
#> 4:   chr1:49076960-49080684 chr2:217035649-217042355       chr1     49076960
#> 5: chr3:145680459-145682483 chr2:217035649-217042355       chr3    145680459
#>    target.end bait.chr bait.start  bait.end bait.to.bait bait.trans.count
#> 1:    1228414     chr2  217035649 217042355        FALSE             9494
#> 2:  117135137     chr2  217035649 217042355        FALSE             9494
#> 3:   14951791     chr2  217035649 217042355        FALSE             9494
#> 4:   49080684     chr2  217035649 217042355        FALSE             9494
#> 5:  145682483     chr2  217035649 217042355        FALSE             9494
#>    target.trans.count distance count expected   p.value   q.value
#> 1:                  2       NA     2 1.003963 0.2656989 0.6417435
#> 2:                  2       NA     2 1.003992 0.2657096 0.6417435
#> 3:                  2       NA     2 1.003992 0.2657096 0.6417435
#> 4:                  2       NA     2 1.003992 0.2657096 0.6417435
#> 5:                  2       NA     2 1.004014 0.2657179 0.6417435

The adjustment.terms parameter also supports expressions such as log transformations. Multiple adjustment terms can be added by passing as a vector.

Custom covariates

Any adjustment term must correspond to a column already present in the data. If a user would like to adjust for anything that is not already present in the CHiCANE output (\(e.g.\) GC content), there’s a three step approach:

  1. Run prepare.data() on your BAM file(s)
  2. Add extra columns containing the desired covariate to the resulting data frame
  3. Run chicane() by setting the interactions parameter to the output from step 2

Filtering baits and targets

By default, all baits and targets are included when fitting the CHiCANE model. An alternative approach is to filter out baits and fragments with low (Dryden et al. 2014) or high (Cairns et al. 2016) degree of “interactibility,” which is inferred via \(trans\) counts for each bait. CHiCANE also supports this filtering through the bait.filters and target.filters parameters.

Each of these take a vector of length two, where each element corresponds to the proportion of fragments that should be considered to fall below the “lower limit” and “upper limit.” For example, passing in bait.filters = c(0.3, 0.9) means that the baits with the lowest 30% or highest 10% of \(trans\) counts will be removed before fitting the model.

Filtering fragments may affect multiple testing correction by changing the number of tests performed.

Merging biological replicates

The helper function compare.replicates() can be used to assess concordance between biological replicates before deciding whether to merge them to improve signal to noise ratio. This is done by conducting the prepare.data() step on each BAM separately before passing to the compare.replicates() function.

CHiCANE can merge replicates at the data processing step. If more than one BAM file is available, these can be passed as a vector to the bam parameter of the chicane() and prepare.data() functions. The replicate.merging.method parameter determines how replicates are merged. Available options are ‘sum’ and ‘weighted-sum,’ with the former being the default.

Assessing model fits

In order to assess goodness of fit for the model implemented, CHiCANE outputs rootogram model fit plots and statistics to the directory provided through the interim.data.directory parameter in the chicane() function. The default NULL will skip these so it is important for a user to set this directory in order assess the model fit.

Setting distance bins

Adjusting for distance is done in a piecewise linear fashion. \(cis\) data is ordered by distance, and split into equally sized bins. The count model is then fit separately within each bin, and results are merged at the end by concatenating the results from the individual model fits.

By default, CHiCANE will split the data into distance quantiles, \(i.e.\) 100 equally sized groups. However, if the resulting datasets are too small, the number of distance bins is halved until all resulting datasets are large enough for the model to be fit. To override this default, pass the desired number of distance bins to distance.bins. For example, setting distance.bins = 4 results in the count model being fit separately in each quartile. \(trans\) and \(cis\) interactions are fit separately.

Multiple testing correction

Finally, CHiCANE applies the Benjamini-Hochberg method of multiple testing correction separately for each bait. To change this to a global multiple testing correction, set multiple.testing.correction = 'global' in the main chicane() function.

The chicane() function’s final output is a q-value sorted data table of all interactions included in the model fitting. This full interactions table is useful in some instances such as the locus plot and background interactions functionality discussed below. However, its large size is rather unwieldy for most downstream analysis, so it is useful to also filter for only interactions below a desired q-value threshold and save this more workable table as well.

Interpretation of interaction peaks

In order to assist users with interpretation of interaction calls, CHiCANE also includes several helper functions for this purpose designed to be applicable to a myriad of experimental designs as delineated below and in more detail in our aforementioned manuscript.

Annotation with bedtools

The first step in interpretation of the interaction calls is likely to be annotation of the target fragments. Selecting just columns 3-5 from the CHiCANE interactions data table and saving the resulting coordinates as a BED file will then allow a bedtools intersect to be done between the target fragments and a genome annotation file also in BED format. The resulting output will indicate the genes that any target fragments overlap with.

Similarly, an intersect could be performed with VCF files containing variant calls or BED files containing ChIP-Seq peak calls for histone marks or CTCF binding sites. There are many disparate datasets which use a file type compatible with bedtools intersect and numerous options for the command to customise the output.

Visualising interaction peaks in WashU Epigenome Browser

CHiCANE also contains a function to convert its results into a format compatible with the WashU Epigenome Browser (Li et al. 2019), convert.standard.format(). The path to significant interaction calling results is provided to the chicane.results parameter and the desired name for the output file is passed to file.name. Once standard format is obtained it can be uploaded to the browser from the ‘Tracks’ dropdown, selecting ‘Local Text tracks.’ In the upload window select ‘long-range format by CHiCANE’ as the text file type and then your standard format file in ‘Browse.’ Once the track is loaded in the browser, you can right click on the track and set the Display Mode to ‘ARC.’

Locus plots

CHiCANE results are also usable with the Gviz (Hahne and Ivanek 2016) package to visualise along with highly customisable genome anootation tracks. Additionally, the create.locus.plot() function in the CHiCANE package will make a basic implementation of such a plot with 6 tracks:

  1. Ideogram of selected chromosome
  2. Genome axis of the chromosome
  3. Genome annotation provided by the user
  4. A feature annotation displaying coordinates of any feature as provided by user in BED format (\(e.g.\) baitmap, TADs, CTCF)
  5. Raw counts of fragments from full interaction data table
  6. Interactions below user-specified FDR threshold

The user needs to provide the genome build to the genome parameter and coordinates of the locus desired to be plotted to the chr, start, and end parameters. A genome annotation file in GTF format needs to be provided to gene.data. The genomic feature chosen to be included will be passed as a BED format file to genomic.features and the desired name for this track is passed as feature.name. Finally, the path to the FULL interaction calls table is passed to interaction.data and the name for the output plot is set by file.name. By default only interactions below an FDR threshold of 0.05 will be included in the interactions track; a user can override this default by setting the `fdr.filter’ parameter. A brief example keeping all other parameters at default is shown below:

if( bedtools.installed() ) {
  create.locus.plot(
    genome = 'hg38',
    chr = 'chr2', start = 111100000, end = 112200000,
    gene.data = '/path/to/gene_data.gtf',
    genomic.features = '/path/to/baitmap.bed',
    feature.name = 'baits',
    interaction.data = '/path/to/interaction_calls.txt',
    file.name = '/path/to/locus_plot.pdf'
    ); 
}

Background Interactions

In order to accurately estimate histone marker enrichment, the helper function stratified.enrichment.sample() will randomise interactions which match the number and proportion of observed interactions which serve as background for comparison, stratified by distance bins: 0-100kbp, 100kbp-1Mbp, 1Mbp-10Mbp, 10Mbp+, and \(trans\). Full paths to data tables containing significant and insignificant calls are passed to significant.results and nonsignificant.results, respectively.

Example Workflow (T-47D breast cancer cell line)

The code in this workflow is intended to be run on Baxter T-47D from the Zenodo directory of Supplementary Data for our manuscript at: https://zenodo.org/record/4073433

First pre-processing of input files needs to be completed as discussed above (see Pre-Processing Data). The BAM files referred to in this vignette have been created using HiCUP, the digested fragments were processed with the convert.hicup.digest.bed() function, and the baitmap was created via bedtools intersect between the aforementioned fragments and array bait probes used in the capture step. The fragments used here (hg38, HindIII digested) are found in Zenodo directory under data/ and the baitmap is found in data/captured_fragments/T47D/ as used in the prepare.data() call below.

The interactions data table can then be prepared with these processed inputs (aligned BAMs, fragments BED, and baitmap BED). The resulting output can then be saved before being passed to the main chicane() function for interaction calling. Once completed save both the full interaction calls results as well as calls filtered for significance to be used for different purposes.

if( bedtools.installed() ) {

  # initial prep of interactions data table
  interaction.data <- prepare.data(
    bam = list.files('data/bams/T47D', pattern = '.bam', full.names = TRUE),
    baits = 'data/captured_fragments/T47D/Baxter_captured_fragments.bed',
    fragments = 'data/GRCh38_HindIII_fragments.bed'
    );

  # save interactions data table
  file.name <- tempfile('interaction_data');
  write.table(interaction.data, file.name, row.names = FALSE);

  # run interaction calling and save model fit plots and statistics
  chicane.results <- chicane(
    interactions = file.name, 
    interim.data.dir = 'graphs/T47D/model_fits'
    ); 

  # save full interaction calls table
  write.table(
    chicane.results, 
    file = 'data/chicane/T47D/interaction_calls.txt', 
    row.names = FALSE, 
    quote = FALSE, 
    sep = '\t'
    );

  # filter for significant only and save these results separately
  significant.results <- chicane.results[q.value < 0.05]

  write.table(
    significant.results, 
    file = 'data/chicane/T47D/interaction_calls_significant.txt', 
    row.names = FALSE, 
    quote = FALSE, 
    sep = '\t'
    );

}

After completion of interaction calling, the model fit plots and statistics in the interim.data.dir should be examined to ensure goodness of fit (detailed discussion of this in manuscript). Additionally, it is useful to examine the proportion of bait-to-bait, \(cis\) and \(trans\) interactions in the significant calls and the number of interaction calls by distance bins, which can be done with the following code:

if( bedtools.installed() ) {

  # calculate proportions by interaction type
  total <- nrow(significant.results);
  trans.prop <- sum(is.na(significant.results$distance))/total;
  cis.prop <- sum(!is.na(significant.results$distance))/total;
  b2b.prop <- sum(significant.results$bait.to.bait)/total;
  int.data <- c('trans' = trans.prop, 'cis' = cis.prop, 'bait.to.bait' = b2b.prop);
  print(int.data);

  # get number of interactions by distance bins
  binned.data <- NULL;

  distance.bins <- list(
    "0-10kbp" = c(0, 1e4),
    "10-100kbp" = c(1e4, 1e5),
    "100kbp-1Mbp" = c(1e5, 1e6),
    "1-10Mbp" = c(1e6, 1e7),
    "10-100Mbp" = c(1e7, 1e8),
    "100-1000Mbp" = c(1e8, 1e9)
    );

  for(dist.i in names(distance.bins)){
    bin.sum <- length(which(significant.results$distance >= distance.bins[[dist.i]][1] & significant.results$distance < distance.bins[[dist.i]][2]));
    binned.data <- cbind(binned.data, bin.sum);
    }

  colnames(binned.data) <- names(distance.bins);
  print(binned.data);

}

Once the interaction calls are determined to be reasonable, it is useful to examine the significant interactions in context. Uploading the files to the WashU Epigenome Browser is a quick and user-friendly way to do so. Use the CHiCANE helper function to easily convert to the correct format for the browser.

if( bedtools.installed() ) {

  # make a browser compatible file from significant interactions
  browser.file <- tempfile('significant_calls_standard.txt')
  convert.standard.format(
    chicane.results = 'data/chicane/T47D/interaction_calls_significant.txt',
    file.name = browser.file
    );

}

Viewing in the browser will help to elucidate interaction peaks for loci of interest, then a user can make a plot of interactions at a particular locus using another of CHiCANE’s helper functions. In the following example we plot interactions at the 2q35 locus.

if( bedtools.installed() ) {

  # set genome annotation file 
  gene.ann <- system.file('extdata', 'gencode_2q35.gtf', package = 'chicane');

  # generate 2q35 locus plot
  create.locus.plot(
    genome = 'hg38',
    chr = 'chr2',
    start = 216600000,
    end = 217140000,
    gene.data = gene.ann,
    genomic.features = 'data/captured_fragments/T47D/Baxter_captured_fragments.bed',
    feature.name = 'baits',
    interaction.data = 'data/chicane/T47D/interaction_calls.txt',
    file.name = 'graphs/T47D/2q35.pdf'
    );

}

Session Info

sessionInfo();
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-conda_cos6-linux-gnu (64-bit)
#> Running under: CentOS Linux 8 (Core)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /data/scratch/DBC/UBCN/BCRBIOIN/SHARED/software/conda_envs/R-3.6.0/lib/R/lib/libRblas.so
#> 
#> locale:
#> [1] C
#> 
#> attached base packages:
#> [1] parallel  splines   stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#> [1] chicane_0.1.8     data.table_1.14.0 gamlss.tr_5.1-7   gamlss_5.3-4     
#> [5] nlme_3.1-151      gamlss.data_6.0-1 gamlss.dist_5.3-2 MASS_7.3-53.1    
#> 
#> loaded via a namespace (and not attached):
#>  [1] bslib_0.2.4          compiler_3.6.0       formatR_1.11        
#>  [4] jquerylib_0.1.4      bedr_1.0.7           futile.logger_1.4.3 
#>  [7] iterators_1.0.13     R.methodsS3_1.8.1    futile.options_1.0.1
#> [10] R.utils_2.10.1       tools_3.6.0          testthat_3.0.2      
#> [13] digest_0.6.27        jsonlite_1.7.2       evaluate_0.14       
#> [16] lattice_0.20-41      rlang_0.4.11         foreach_1.5.1       
#> [19] Matrix_1.3-2         yaml_2.2.1           VennDiagram_1.6.20  
#> [22] xfun_0.22            stringr_1.4.0        knitr_1.34          
#> [25] sass_0.3.1           grid_3.6.0           R6_2.5.1            
#> [28] survival_3.2-10      rmarkdown_2.11       lambda.r_1.2.4      
#> [31] magrittr_2.0.1       codetools_0.2-18     htmltools_0.5.1.1   
#> [34] stringi_1.5.3        R.oo_1.24.0

Acknowledgements

Development of CHiCANE was supported by Breast Cancer Now.

References

Baxter, Joseph S, Olivia C Leavy, Nicola H Dryden, Sarah Maguire, Nichola Johnson, Vita Fedele, Nikiana Simigdala, et al. 2018. “Capture Hi-c Identifies Putative Target Genes at 33 Breast Cancer Risk Loci.” Nature Communications 9 (1): 1028.
Cairns, Jonathan, Paula Freire-Pritchett, Steven W Wingett, Csilla Várnai, Andrew Dimond, Vincent Plagnol, Daniel Zerbino, et al. 2016. “CHiCAGO: Robust Detection of DNA Looping Interactions in Capture Hi-c Data.” Genome Biology 17 (1): 127.
Dryden, Nicola H, Laura R Broome, Frank Dudbridge, Nichola Johnson, Nick Orr, Stefan Schoenfelder, Takashi Nagano, et al. 2014. “Unbiased Analysis of Potential Targets of Breast Cancer Susceptibility Loci by Capture Hi-c.” Genome Research 24 (11): 1854–68.
Hahne, Florian, and Robert Ivanek. 2016. “Statistical Genomics: Methods and Protocols.” In, edited by Ewy Mathé and Sean Davis, 335–51. New York, NY: Springer New York. https://doi.org/10.1007/978-1-4939-3578-9_16.
Li, Daofeng, Silas Hsu, Deepak Purushotham, Renee L Sears, and Ting Wang. 2019. WashU Epigenome Browser update 2019.” Nucleic Acids Research 47 (W1): W158–65. https://doi.org/10.1093/nar/gkz348.
Stasinopoulos, Mikis, and Bob Rigby. 2016. Gamlss.tr: Generating and Fitting Truncated ‘Gamlss.family’ Distributions. https://CRAN.R-project.org/package=gamlss.tr.