macrosyntR

Sami El Hilali

2022/11/09

An R package for evaluation of pair-wise synteny conservation at the genome-wide scale. It takes a table of orthologs and genome annotation files formatted as BED to automatically infer significantly conserved linkage groups, and order them on an oxford grid.

Overview :

It has 7 functions :

Function description
load_orthologs() integrates genomic coordinates (bedfiles) of the orthologs of the species to compare
compute_macrosynteny() compares all the chromosomes to each other and identifies the significantly conserved linkage groups
reorder_macrosynteny() takes an orthologs_df (from load_orthologs()) and outputs an orthologs_df with chromosome levels reordered by cluster and amount of orthologs (run alone or by setting plot_oxford_grid(…,auto_order_clusters = TRUE) )
plot_macrosynteny() draws a dotplot displaying the significant linkage groups with their relative amount of orthologs
plot_oxford_grid() draws an Oxford grid from an orthologs_df (output of either load_orthologs() or reorder_macrosynteny()
reverse_species_order() returns an orthologs_df where sp1 became sp2 and the other way around
get_syntenic_genes() compute all blocks containing at least two consecutive genes (relative order) in both species and returns the details as a table
reorder_multiple_macrosyntenies() takes an orthologs_df and reorder like reorder_macrosynteny() but for more than two species
plot_chord_diagram() takes an orthologs_df with two or more species and draws the orthologs as lines in a chord diagram
compute_linkage_groups() takes an orthologs_df and computes the conserved linkage groups

Step-by-step tutorial :

We demonstrate the usage of the package using a subset of publicly available data (samples attached to the package).

1 - Pre-processing :

1.1 - Foreword :

This package doesn’t compute the orthologs. I recommend to compute pairwise orthology as reciprocal best hits using rbhxpress. It is fast and accurate as it uses diamond blast.

To use more than 2 species, I recommend using OrthoFinder (Emms and Kelly 2019) as it’s easy to derive single copy orthologs from it’s output.

Drawing the plots using this package require to have the following data :

  • A two (or more) columns table of orthologs (reciprocal best hits). Each gene must appear only once in the table.
  • genomic coordinates on each species for all the orthologs

let’s say I have the following orthologs :

sp1.gene.x1  sp2.gene.y1   
sp1.gene.X2  sp2.gene.y2   
...   
sp1.gene.xn  sp2.gene.yn   

then, the genomic coordinates files must look like (BED format) :

  • species1 :
chr4    200    600    sp1.gene.x1   
chr8     10    400    sp1.gene.x2   
...   
chr12   900    980    sp1.gene.xn   
  • species2 :
chr1    100    200    sp2.gene.y1   
chr6     50    200    sp2.gene.y2   
...   
chr8   300    480    sp2.gene.yn   

1.2 - Example using two species (rbhXpress) :

I’m going to show usage of this package by comparing the data of the lancelet Branchiostoma floridae (Simakov et al. 2020) with the data of the vestimentifera (giant tubeworm) Paraescarpia echinospica (Sun et al. 2021).

Download the sequences of proteins (fasta format) and their genomic coordinates :

  • B.floridae : The data are available on ncbi at https://www.ncbi.nlm.nih.gov/genome/?term=txid7739
    get the protein sequences at by clicking the “Download sequences in FASTA format for protein”.
    get the genomic coordinates by clicking “Download genome annotation in tabular format” and further click download as csv.

  • P.echinospica :
    The data are available on figshare under the doi : 10.6084/m9.figshare.15050478

Compute the reciprocal best hits of the fasta sequences. Using rbhXpress you can achieve it by typing the following in your terminal :

 # In the terminal :
 # call rbhXpress with using 6 threads :
 bash rbhXpress -a GCF_000003815.2_Bfl_VNyyK_protein.faa -b Pec_ragoo_v1.0.pep.fasta -o Bflo_vs_Pech.tab -t 6
 

To convert the genome annotation to the bed file format, I’m using the following command lines (if unfamiliar with this you can use a spreadsheet software). The concept is to keep the chrom, chromStart, chromEnd mandatory fields plus the name optional field that links the genomic region with the protein product :

 # In the terminal :
 # B.floridae CSV file to bed
tail -n +2 proteins_75_971579.csv | cut -d "," -f1,3,4,9  | \
sed -e 's/\"//g' -e 's/,/\t/g' -e 's/chromosome /BFL/g' > Bflo.bed

 # P.echinospica gff file to bed
fgrep "gene" Pec_genes.ragoo_v1.0.gff | cut -f1,4,5,9 | cut -d ";" -f 1 | \
fgrep "Superscaffold" | sed -e 's/ID=//g' -e 's/Superscaffold/PEC/g' > Pech.bed

Please note that the example dataset attached to this package is a subset. I kept only 2500 ortholog pairs to lower the compilation time.

1.3 - Example using 3 species (Orthofinder) :

If we wanted to add one species to the previous study, we would need to get the data and compute the single copy orthologs.
We will use here the data for the scallop Patinopecten yessoensis (Wang et al. 2017). These data were shared by the authors upon request and a sample is attached with the package.

The first step is to compute the orthologs using orthofinder (Emms and Kelly 2019).
When done, you can extract the Single copy orthologs using the following command line :


fgrep -f <path_to_your_orthofinder_run>/Orthogroups/Orthogroups_SingleCopyOrthologues.txt \
<path_to_your_orthofinder_run>/Orthogroups/Orthogroups.tsv > Single_copy_orthologs.tab

One important thing about orthofinder’s output is the encoding that will create issues when loaded in R.

If for the encoding you have like me :


file Single_copy_orthologs.tab
# ASCII text, with CRLF line terminators

These CRLF line terminators are a problem and you should replace it by regular “” as line terminator.
It can be achieved using the following command :


tr  '\015\012/' '\n' < Single_copy_orthologs.tab | awk '($0 != "") {print}' > Single_copy_orthologs.tsv

2 - Load the data :

Now the data are ready to be loaded into R using macrosyntR :

library(macrosyntR)

my_orthologs_table <- load_orthologs(orthologs_table = system.file("extdata","Bflo_vs_Pech.tab",package="macrosyntR"),
                                     bedfiles = c(system.file("extdata","Bflo.bed",package="macrosyntR"),
                                     system.file("extdata","Pech.bed",package="macrosyntR")))

head(my_orthologs_table)
#>              sp2.ID         sp1.ID sp1.Chr sp1.Start  sp1.End sp1.Index sp2.Chr
#> 1 PE_Scaf10003_0.10 XP_035673525.1    BFL4  10090273 10092836        58    PEC6
#> 2  PE_Scaf10003_0.9 XP_035673513.1    BFL4  10410588 10419275        59    PEC6
#> 3  PE_Scaf10005_7.3 XP_035697237.1   BFL14  11770697 11784649        95    PEC7
#> 4  PE_Scaf10008_0.9 XP_035685062.1    BFL8  12933159 12935194        96   PEC10
#> 5  PE_Scaf10013_0.3 XP_035689348.1   BFL10  16401513 16412200        97    PEC1
#> 6 PE_Scaf10029_0.13 XP_035678904.1    BFL6   2279527  2280933        20    PEC7
#>   sp2.Start  sp2.End sp2.Index
#> 1  12027052 12028934        34
#> 2  12056157 12076766        35
#> 3  66094757 66132196       389
#> 4  58579881 58586985       243
#> 5  71030249 71051389       265
#> 6  53587028 53595680       306

or alternatively :


my_orthologs_with_3_sp <- load_orthologs(orthologs_table = system.file("extdata","Single_copy_orthologs.tsv",package="macrosyntR"),
                                     bedfiles = c(system.file("extdata","Bflo.bed",package="macrosyntR"),
                                                  system.file("extdata","Pech.bed",package="macrosyntR"),
                                                  system.file("extdata","Pyes.bed",package="macrosyntR")))

head(my_orthologs_with_3_sp)
#>      sp3.ID           sp2.ID         sp1.ID sp1.Chr sp1.Start  sp1.End
#> 1 PY_T03519 PE_Scaf12272_5.7 XP_035683654.1    BFL8   4420511  4433370
#> 2 PY_T03527  PE_Scaf1648_0.2 XP_035695060.1    BFL1  20618380 20653810
#> 3 PY_T03529  PE_Scaf8980_9.6 XP_035685159.1    BFL8  12537504 12550583
#> 4 PY_T03538  PE_Scaf3040_8.1 XP_035683392.1    BFL8   9015804  9046881
#> 5 PY_T03547  PE_Scaf3304_4.6 XP_035684798.1    BFL8  11658472 11714434
#> 6 PY_T03548  PE_Scaf8642_8.3 XP_035683725.1    BFL8  11972351 11999594
#>   sp1.Index sp2.Chr sp2.Start  sp2.End sp2.Index sp3.Chr sp3.Start sp3.End
#> 1        16   PEC10  49429739 49461457       197    chr1    323789  338491
#> 2       315   PEC10  14476621 14515269        41    chr1    628200  721503
#> 3        92   PEC10  57785129 57816235       233    chr1    750556  791349
#> 4        59   PEC10  52592843 52611634       216    chr1   1057964 1137351
#> 5        83   PEC10  43961347 44009459       169    chr1   1509829 1562935
#> 6        88   PEC10  62411366 62440096       262    chr1   1616962 1641735
#>   sp3.Index
#> 1         1
#> 2         2
#> 3         3
#> 4         4
#> 5         5
#> 6         6

3 - Compute linkage groups :

Let’s vizualize the pairs of chromosomes that have a significant amount of orthologs using compute_macrosynteny(). We can visualize the results on a dot plot using plot_macrosnyteny() and see the distributions of orthologs on an oxford grid using plot_oxford_grid()


# compute significance :
macrosynteny_df <- compute_macrosynteny(my_orthologs_table)
head(macrosynteny_df)
#>   sp1.Chr sp2.Chr orthologs         pval significant    pval_adj
#> 1    BFL1    PEC1         7 1.000000e+00          no 1.000000000
#> 2    BFL1   PEC10        53 3.830506e-05          no 0.008197283
#> 3    BFL1   PEC11         2 1.000000e+00          no 1.000000000
#> 4    BFL1   PEC12        11 9.989689e-01          no 1.000000000
#> 5    BFL1   PEC13         5 9.999991e-01          no 1.000000000
#> 6    BFL1    PEC2         6 1.000000e+00          no 1.000000000

# visualize the loaded data on a oxford grid :
plot_oxford_grid(my_orthologs_table,
                 sp1_label = "B.floridae",
                 sp2_label = "P.echinospica")

# Visualize the results of the test of significance :
plot_macrosynteny(macrosynteny_df,
                  sp1_label = "B.floridae",
                  sp2_label = "P.echinospica")

To get these (sub)chromosomal associations of two or more species, we can use the function compute_linkage_groups()

my_linkage_groups <- compute_linkage_groups(my_orthologs_with_3_sp)
head(my_linkage_groups)
#> # A tibble: 6 × 5
#>   sp2.Chr sp3.Chr sp1.Chr     n LGs  
#>   <fct>   <fct>   <fct>   <int> <chr>
#> 1 PEC8    chr5    BFL1      244 a    
#> 2 PEC7    chr8    BFL6      194 b    
#> 3 PEC5    chr2    BFL13     169 c    
#> 4 PEC1    chr6    BFL5      152 d    
#> 5 PEC11   chr9    BFL2      135 e    
#> 6 PEC13   chr7    BFL7      135 f

4 - Reorder chromosome levels to group the linkage groups in clusters :

4.1 - Automatic reordering using a network-based greedy algorithm :

Reordering the chromosomes using a network based greedy algorithm can be performed by calling the function reorder_macrosynteny. It returns an orthologs_df with reordered levels in sp1.Chr and sp2.Chr. These columns are factors where the levels determine the plotting order. You’ll see the results of the clustering, when drawing the oxford grid of this newly generated orthologs data.frame


# visualize the loaded data on a oxford grid :
my_orthologs_table_reordered <- reorder_macrosynteny(my_orthologs_table)
plot_oxford_grid(my_orthologs_table_reordered,
                 sp1_label = "B.floridae",
                 sp2_label = "P.echinospica")

# compute significance and visualize on a dotplot :
macrosynteny_df_reordered <- compute_macrosynteny(my_orthologs_table_reordered)
plot_macrosynteny(macrosynteny_df_reordered,
                  sp1_label = "B.floridae",
                  sp2_label = "P.echinospica")

4.2 - Manually reorder/subset the Chromosomes :

If you would like to subset some chromosomes of interest and manually reorder them you can still take advantage of functions implemented to handle data.frames. This task is out of the scope of this package, and can achieved using base R :

# select only the orthologs falling in the chromosomes of interest and plot: 
subset_of_orthologs <- subset(my_orthologs_table, sp1.Chr %in% c("BFL13","BFL15","BFL2","BFL3") & sp2.Chr %in% c("PEC2","PEC5","PEC11"))

plot_oxford_grid(subset_of_orthologs,
                 sp1_label = "B.floridae",
                 sp2_label = "P.echinospica")

# reorder :
subset_of_orthologs$sp2.Chr <- factor(subset_of_orthologs$sp2.Chr,levels = c("PEC5","PEC11","PEC2"))
plot_oxford_grid(subset_of_orthologs,
                 sp1_label = "B.floridae",
                 sp2_label = "P.echinospica")

# Compute and plot macrosynteny :
macrosynteny_of_subset <- compute_macrosynteny(subset_of_orthologs)
plot_macrosynteny(macrosynteny_of_subset,
                 sp1_label = "B.floridae",
                 sp2_label = "P.echinospica")

5 - Plot directly with reordering the linkage groups and coloring :

The reordering can be performed on the row when calling plot_oxford_grid() by setting the reorder argument to TRUE.


# visualize the loaded data on a oxford grid  with reordering and coloring by cluster :
plot_oxford_grid(my_orthologs_table,
                 sp1_label = "B.floridae",
                 sp2_label = "P.echinospica",
                 reorder = TRUE,
                 color_by = "clust")

# redo and color by sp1.Chr instead :
plot_oxford_grid(my_orthologs_table,
                 sp1_label = "B.floridae",
                 sp2_label = "P.echinospica",
                 reorder = TRUE,
                 color_by = "sp1.Chr")

6 - Customize the plots :

Plots returned from both plot_oxford_grid() and plot_macrosynteny() are ggplot objects and it is possible to further customizing them using the vocabulary from the ggplot2 package.

6.1 - customize the legend :

As the functions plot_oxford_grid(), plot_macrosynteny() and plot_chord_diagram() return a ggplot2 object, you can customize the plots using ggplot2::theme() function. For example, if you want to display the legend on the right of the plot you can do :


library(ggplot2)

# legend on right (works also with "top" and "left") :
plot_macrosynteny(macrosynteny_df_reordered) +
  theme(legend.position = "right")

6.2 - change the color palette :

plot_oxford_grid() features an option color_palette, but it is also possible to set it using scale_color_manual() from ggplot2 :


# Check how many colors are necessary :
print(length(unique(my_orthologs_table_reordered$sp2.Chr)))
#> [1] 14

# change color_palette using plot_oxford_grid option color_palette :
color_palette_Pechinospica_chromosomes <- c("#A52A2A", "#FFD39B", "#66CDAA", "#8EE5EE", "#7FFF00", "#FFD700", "#FF7F00", "#474747", "#6495ED", "#FF3030", "#0000EE", "#FF1493", "#8A2BE2", "#080808")


plot_oxford_grid(my_orthologs_table_reordered,
                 color_by = "sp2.Chr",
                 color_palette = color_palette_Pechinospica_chromosomes)


# change the colors in plot_macrosynteny using ggplot2 functions :

plot_macrosynteny(macrosynteny_df_reordered) +
scale_color_manual(values = c("gray45","darkgreen")) +
  theme(legend.position = "right")
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.

7 - Integrate additional information using colors :

The data returned by load_orthologs(), reorder_macrosynteny(), and compute_macrosynteny() are data.frame objects. I can thus add external information based on protein IDs by merging it with another dataframe, and then take advantage of using colors to display the new information on the plot :


library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> Les objets suivants sont masqués depuis 'package:stats':
#> 
#>     filter, lag
#> Les objets suivants sont masqués depuis 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Let's color only the orthologs that were previously selected in the part 3.2 :
my_orthologs_table_modified <- my_orthologs_table_reordered %>%
  mutate(selected = "no") %>%
  mutate(selected = replace(selected,sp1.ID %in% subset_of_orthologs$sp1.ID,"yes"))

plot_oxford_grid(my_orthologs_table_modified,
                 color_by = "selected",
                 color_palette = c("black","firebrick"))


# set the argument shade_non_significant to FALSE to have colors on all the genes of interest :

plot_oxford_grid(my_orthologs_table_modified,
                 color_by = "selected",
                 shade_non_significant = FALSE,
                 color_palette = c("black","firebrick"))

8 - plot a chord diagram :

To plot the conservation of macrosynteny accross two or more species (here 3) on a chord diagram and coloring according to linkage groups we use the function plot_chord_diagram() :


plot_chord_diagram(my_orthologs_with_3_sp,
                   species_labels = c("B. flo","P. ech", "P. yes"),
                   color_by = "LGs") +
  theme(legend.position = "none")

We can change the chromosome names to improve the readability of the plot using stringr::str_replace() and factor levels as following :


# Change the chromosome names to keep only numbers
levels(my_orthologs_with_3_sp$sp1.Chr) <- stringr::str_replace(levels(my_orthologs_with_3_sp$sp1.Chr),"BFL","")
levels(my_orthologs_with_3_sp$sp2.Chr) <- stringr::str_replace(levels(my_orthologs_with_3_sp$sp2.Chr),"PEC","")
levels(my_orthologs_with_3_sp$sp3.Chr) <- stringr::str_replace(levels(my_orthologs_with_3_sp$sp3.Chr),"chr","")


plot_chord_diagram(my_orthologs_with_3_sp,
                   species_labels = c("B. flo","P. ech", "P. yes"),
                   color_by = "LGs") +
  theme(legend.position = "none")