An R package for evaluation of pair-wise synteny conservation at the genome-wide scale. It takes a table of orthologs and genome annotation files formatted as BED to automatically infer significantly conserved linkage groups, and order them on an oxford grid.
It has 7 functions :
Function | description |
---|---|
load_orthologs() | integrates genomic coordinates (bedfiles) of the orthologs of the species to compare |
compute_macrosynteny() | compares all the chromosomes to each other and identifies the significantly conserved linkage groups |
reorder_macrosynteny() | takes an orthologs_df (from load_orthologs()) and outputs an orthologs_df with chromosome levels reordered by cluster and amount of orthologs (run alone or by setting plot_oxford_grid(…,auto_order_clusters = TRUE) ) |
plot_macrosynteny() | draws a dotplot displaying the significant linkage groups with their relative amount of orthologs |
plot_oxford_grid() | draws an Oxford grid from an orthologs_df (output of either load_orthologs() or reorder_macrosynteny() |
reverse_species_order() | returns an orthologs_df where sp1 became sp2 and the other way around |
get_syntenic_genes() | compute all blocks containing at least two consecutive genes (relative order) in both species and returns the details as a table |
reorder_multiple_macrosyntenies() | takes an orthologs_df and reorder like reorder_macrosynteny() but for more than two species |
plot_chord_diagram() | takes an orthologs_df with two or more species and draws the orthologs as lines in a chord diagram |
compute_linkage_groups() | takes an orthologs_df and computes the conserved linkage groups |
We demonstrate the usage of the package using a subset of publicly available data (samples attached to the package).
Branchiostoma floridae (Simakov et al. 2020)
Patinopecten yessoensis (Wang et al. 2017)
Paraescarpia echinospica (Sun et al. 2021)
This package doesn’t compute the orthologs. I recommend to compute pairwise orthology as reciprocal best hits using rbhxpress. It is fast and accurate as it uses diamond blast.
To use more than 2 species, I recommend using OrthoFinder (Emms and Kelly 2019) as it’s easy to derive single copy orthologs from it’s output.
Drawing the plots using this package require to have the following data :
let’s say I have the following orthologs :
sp1.gene.x1 sp2.gene.y1
sp1.gene.X2 sp2.gene.y2
...
sp1.gene.xn sp2.gene.yn
then, the genomic coordinates files must look like (BED format) :
chr4 200 600 sp1.gene.x1
chr8 10 400 sp1.gene.x2
...
chr12 900 980 sp1.gene.xn
chr1 100 200 sp2.gene.y1
chr6 50 200 sp2.gene.y2
...
chr8 300 480 sp2.gene.yn
I’m going to show usage of this package by comparing the data of the lancelet Branchiostoma floridae (Simakov et al. 2020) with the data of the vestimentifera (giant tubeworm) Paraescarpia echinospica (Sun et al. 2021).
Download the sequences of proteins (fasta format) and their genomic coordinates :
B.floridae : The data are available on ncbi at https://www.ncbi.nlm.nih.gov/genome/?term=txid7739
get the protein sequences at by clicking the “Download sequences in
FASTA format for protein”.
get the genomic coordinates by clicking “Download genome annotation in
tabular format” and further click download as csv.
P.echinospica :
The data are available on figshare under the doi : 10.6084/m9.figshare.15050478
Compute the reciprocal best hits of the fasta sequences. Using rbhXpress you can achieve it by typing the following in your terminal :
# In the terminal :
# call rbhXpress with using 6 threads :
bash rbhXpress -a GCF_000003815.2_Bfl_VNyyK_protein.faa -b Pec_ragoo_v1.0.pep.fasta -o Bflo_vs_Pech.tab -t 6
To convert the genome annotation to the bed file format, I’m using the following command lines (if unfamiliar with this you can use a spreadsheet software). The concept is to keep the chrom, chromStart, chromEnd mandatory fields plus the name optional field that links the genomic region with the protein product :
# In the terminal :
# B.floridae CSV file to bed
tail -n +2 proteins_75_971579.csv | cut -d "," -f1,3,4,9 | \
sed -e 's/\"//g' -e 's/,/\t/g' -e 's/chromosome /BFL/g' > Bflo.bed
# P.echinospica gff file to bed
fgrep "gene" Pec_genes.ragoo_v1.0.gff | cut -f1,4,5,9 | cut -d ";" -f 1 | \
fgrep "Superscaffold" | sed -e 's/ID=//g' -e 's/Superscaffold/PEC/g' > Pech.bed
Please note that the example dataset attached to this package is a subset. I kept only 2500 ortholog pairs to lower the compilation time.
If we wanted to add one species to the previous study, we would need
to get the data and compute the single copy orthologs.
We will use here the data for the scallop Patinopecten yessoensis (Wang et al. 2017).
These data were shared by the authors upon request and a sample is
attached with the package.
The first step is to compute the orthologs using orthofinder (Emms
and Kelly 2019).
When done, you can extract the Single copy orthologs using the following
command line :
fgrep -f <path_to_your_orthofinder_run>/Orthogroups/Orthogroups_SingleCopyOrthologues.txt \
<path_to_your_orthofinder_run>/Orthogroups/Orthogroups.tsv > Single_copy_orthologs.tab
One important thing about orthofinder’s output is the encoding that will create issues when loaded in R.
If for the encoding you have like me :
file Single_copy_orthologs.tab
# ASCII text, with CRLF line terminators
These CRLF line terminators are a problem and you should replace it
by regular “” as line terminator.
It can be achieved using the following command :
tr '\015\012/' '\n' < Single_copy_orthologs.tab | awk '($0 != "") {print}' > Single_copy_orthologs.tsv
Now the data are ready to be loaded into R using macrosyntR :
library(macrosyntR)
<- load_orthologs(orthologs_table = system.file("extdata","Bflo_vs_Pech.tab",package="macrosyntR"),
my_orthologs_table bedfiles = c(system.file("extdata","Bflo.bed",package="macrosyntR"),
system.file("extdata","Pech.bed",package="macrosyntR")))
head(my_orthologs_table)
#> sp2.ID sp1.ID sp1.Chr sp1.Start sp1.End sp1.Index sp2.Chr
#> 1 PE_Scaf10003_0.10 XP_035673525.1 BFL4 10090273 10092836 58 PEC6
#> 2 PE_Scaf10003_0.9 XP_035673513.1 BFL4 10410588 10419275 59 PEC6
#> 3 PE_Scaf10005_7.3 XP_035697237.1 BFL14 11770697 11784649 95 PEC7
#> 4 PE_Scaf10008_0.9 XP_035685062.1 BFL8 12933159 12935194 96 PEC10
#> 5 PE_Scaf10013_0.3 XP_035689348.1 BFL10 16401513 16412200 97 PEC1
#> 6 PE_Scaf10029_0.13 XP_035678904.1 BFL6 2279527 2280933 20 PEC7
#> sp2.Start sp2.End sp2.Index
#> 1 12027052 12028934 34
#> 2 12056157 12076766 35
#> 3 66094757 66132196 389
#> 4 58579881 58586985 243
#> 5 71030249 71051389 265
#> 6 53587028 53595680 306
or alternatively :
<- load_orthologs(orthologs_table = system.file("extdata","Single_copy_orthologs.tsv",package="macrosyntR"),
my_orthologs_with_3_sp bedfiles = c(system.file("extdata","Bflo.bed",package="macrosyntR"),
system.file("extdata","Pech.bed",package="macrosyntR"),
system.file("extdata","Pyes.bed",package="macrosyntR")))
head(my_orthologs_with_3_sp)
#> sp3.ID sp2.ID sp1.ID sp1.Chr sp1.Start sp1.End
#> 1 PY_T03519 PE_Scaf12272_5.7 XP_035683654.1 BFL8 4420511 4433370
#> 2 PY_T03527 PE_Scaf1648_0.2 XP_035695060.1 BFL1 20618380 20653810
#> 3 PY_T03529 PE_Scaf8980_9.6 XP_035685159.1 BFL8 12537504 12550583
#> 4 PY_T03538 PE_Scaf3040_8.1 XP_035683392.1 BFL8 9015804 9046881
#> 5 PY_T03547 PE_Scaf3304_4.6 XP_035684798.1 BFL8 11658472 11714434
#> 6 PY_T03548 PE_Scaf8642_8.3 XP_035683725.1 BFL8 11972351 11999594
#> sp1.Index sp2.Chr sp2.Start sp2.End sp2.Index sp3.Chr sp3.Start sp3.End
#> 1 16 PEC10 49429739 49461457 197 chr1 323789 338491
#> 2 315 PEC10 14476621 14515269 41 chr1 628200 721503
#> 3 92 PEC10 57785129 57816235 233 chr1 750556 791349
#> 4 59 PEC10 52592843 52611634 216 chr1 1057964 1137351
#> 5 83 PEC10 43961347 44009459 169 chr1 1509829 1562935
#> 6 88 PEC10 62411366 62440096 262 chr1 1616962 1641735
#> sp3.Index
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
Let’s vizualize the pairs of chromosomes that have a significant amount of orthologs using compute_macrosynteny(). We can visualize the results on a dot plot using plot_macrosnyteny() and see the distributions of orthologs on an oxford grid using plot_oxford_grid()
# compute significance :
<- compute_macrosynteny(my_orthologs_table)
macrosynteny_df head(macrosynteny_df)
#> sp1.Chr sp2.Chr orthologs pval significant pval_adj
#> 1 BFL1 PEC1 7 1.000000e+00 no 1.000000000
#> 2 BFL1 PEC10 53 3.830506e-05 no 0.008197283
#> 3 BFL1 PEC11 2 1.000000e+00 no 1.000000000
#> 4 BFL1 PEC12 11 9.989689e-01 no 1.000000000
#> 5 BFL1 PEC13 5 9.999991e-01 no 1.000000000
#> 6 BFL1 PEC2 6 1.000000e+00 no 1.000000000
# visualize the loaded data on a oxford grid :
plot_oxford_grid(my_orthologs_table,
sp1_label = "B.floridae",
sp2_label = "P.echinospica")
# Visualize the results of the test of significance :
plot_macrosynteny(macrosynteny_df,
sp1_label = "B.floridae",
sp2_label = "P.echinospica")
To get these (sub)chromosomal associations of two or more species, we can use the function compute_linkage_groups()
<- compute_linkage_groups(my_orthologs_with_3_sp)
my_linkage_groups head(my_linkage_groups)
#> # A tibble: 6 × 5
#> sp2.Chr sp3.Chr sp1.Chr n LGs
#> <fct> <fct> <fct> <int> <chr>
#> 1 PEC8 chr5 BFL1 244 a
#> 2 PEC7 chr8 BFL6 194 b
#> 3 PEC5 chr2 BFL13 169 c
#> 4 PEC1 chr6 BFL5 152 d
#> 5 PEC11 chr9 BFL2 135 e
#> 6 PEC13 chr7 BFL7 135 f
Reordering the chromosomes using a network based greedy algorithm can be performed by calling the function reorder_macrosynteny. It returns an orthologs_df with reordered levels in sp1.Chr and sp2.Chr. These columns are factors where the levels determine the plotting order. You’ll see the results of the clustering, when drawing the oxford grid of this newly generated orthologs data.frame
# visualize the loaded data on a oxford grid :
<- reorder_macrosynteny(my_orthologs_table)
my_orthologs_table_reordered plot_oxford_grid(my_orthologs_table_reordered,
sp1_label = "B.floridae",
sp2_label = "P.echinospica")
# compute significance and visualize on a dotplot :
<- compute_macrosynteny(my_orthologs_table_reordered)
macrosynteny_df_reordered plot_macrosynteny(macrosynteny_df_reordered,
sp1_label = "B.floridae",
sp2_label = "P.echinospica")
If you would like to subset some chromosomes of interest and manually reorder them you can still take advantage of functions implemented to handle data.frames. This task is out of the scope of this package, and can achieved using base R :
# select only the orthologs falling in the chromosomes of interest and plot:
<- subset(my_orthologs_table, sp1.Chr %in% c("BFL13","BFL15","BFL2","BFL3") & sp2.Chr %in% c("PEC2","PEC5","PEC11"))
subset_of_orthologs
plot_oxford_grid(subset_of_orthologs,
sp1_label = "B.floridae",
sp2_label = "P.echinospica")
# reorder :
$sp2.Chr <- factor(subset_of_orthologs$sp2.Chr,levels = c("PEC5","PEC11","PEC2"))
subset_of_orthologsplot_oxford_grid(subset_of_orthologs,
sp1_label = "B.floridae",
sp2_label = "P.echinospica")
# Compute and plot macrosynteny :
<- compute_macrosynteny(subset_of_orthologs)
macrosynteny_of_subset plot_macrosynteny(macrosynteny_of_subset,
sp1_label = "B.floridae",
sp2_label = "P.echinospica")
The reordering can be performed on the row when calling plot_oxford_grid() by setting the reorder argument to TRUE.
# visualize the loaded data on a oxford grid with reordering and coloring by cluster :
plot_oxford_grid(my_orthologs_table,
sp1_label = "B.floridae",
sp2_label = "P.echinospica",
reorder = TRUE,
color_by = "clust")
# redo and color by sp1.Chr instead :
plot_oxford_grid(my_orthologs_table,
sp1_label = "B.floridae",
sp2_label = "P.echinospica",
reorder = TRUE,
color_by = "sp1.Chr")
Plots returned from both plot_oxford_grid() and plot_macrosynteny() are ggplot objects and it is possible to further customizing them using the vocabulary from the ggplot2 package.
As the functions plot_oxford_grid(), plot_macrosynteny() and plot_chord_diagram() return a ggplot2 object, you can customize the plots using ggplot2::theme() function. For example, if you want to display the legend on the right of the plot you can do :
library(ggplot2)
# legend on right (works also with "top" and "left") :
plot_macrosynteny(macrosynteny_df_reordered) +
theme(legend.position = "right")
plot_oxford_grid() features an option color_palette, but it is also possible to set it using scale_color_manual() from ggplot2 :
# Check how many colors are necessary :
print(length(unique(my_orthologs_table_reordered$sp2.Chr)))
#> [1] 14
# change color_palette using plot_oxford_grid option color_palette :
<- c("#A52A2A", "#FFD39B", "#66CDAA", "#8EE5EE", "#7FFF00", "#FFD700", "#FF7F00", "#474747", "#6495ED", "#FF3030", "#0000EE", "#FF1493", "#8A2BE2", "#080808")
color_palette_Pechinospica_chromosomes
plot_oxford_grid(my_orthologs_table_reordered,
color_by = "sp2.Chr",
color_palette = color_palette_Pechinospica_chromosomes)
# change the colors in plot_macrosynteny using ggplot2 functions :
plot_macrosynteny(macrosynteny_df_reordered) +
scale_color_manual(values = c("gray45","darkgreen")) +
theme(legend.position = "right")
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.
The data returned by load_orthologs(), reorder_macrosynteny(), and compute_macrosynteny() are data.frame objects. I can thus add external information based on protein IDs by merging it with another dataframe, and then take advantage of using colors to display the new information on the plot :
library(dplyr)
#>
#> Attachement du package : 'dplyr'
#> Les objets suivants sont masqués depuis 'package:stats':
#>
#> filter, lag
#> Les objets suivants sont masqués depuis 'package:base':
#>
#> intersect, setdiff, setequal, union
# Let's color only the orthologs that were previously selected in the part 3.2 :
<- my_orthologs_table_reordered %>%
my_orthologs_table_modified mutate(selected = "no") %>%
mutate(selected = replace(selected,sp1.ID %in% subset_of_orthologs$sp1.ID,"yes"))
plot_oxford_grid(my_orthologs_table_modified,
color_by = "selected",
color_palette = c("black","firebrick"))
# set the argument shade_non_significant to FALSE to have colors on all the genes of interest :
plot_oxford_grid(my_orthologs_table_modified,
color_by = "selected",
shade_non_significant = FALSE,
color_palette = c("black","firebrick"))
To plot the conservation of macrosynteny accross two or more species (here 3) on a chord diagram and coloring according to linkage groups we use the function plot_chord_diagram() :
plot_chord_diagram(my_orthologs_with_3_sp,
species_labels = c("B. flo","P. ech", "P. yes"),
color_by = "LGs") +
theme(legend.position = "none")
We can change the chromosome names to improve the readability of the plot using stringr::str_replace() and factor levels as following :
# Change the chromosome names to keep only numbers
levels(my_orthologs_with_3_sp$sp1.Chr) <- stringr::str_replace(levels(my_orthologs_with_3_sp$sp1.Chr),"BFL","")
levels(my_orthologs_with_3_sp$sp2.Chr) <- stringr::str_replace(levels(my_orthologs_with_3_sp$sp2.Chr),"PEC","")
levels(my_orthologs_with_3_sp$sp3.Chr) <- stringr::str_replace(levels(my_orthologs_with_3_sp$sp3.Chr),"chr","")
plot_chord_diagram(my_orthologs_with_3_sp,
species_labels = c("B. flo","P. ech", "P. yes"),
color_by = "LGs") +
theme(legend.position = "none")