Authors: Sophie Le Bars, Mohamed Soudy, and Enrico Glaab
This vignette provides a guide for using the XYomics package to analyze sex-related patterns in single-cell RNA-sequencing (scRNA-seq) data. We will walk through a complete workflow using simulated data demonstrating how to identify and interpret sex-specific gene expression changes at the cell-type level. To ensure fast execution, all heavy computations are precomputed and loaded from the package.
The tutorial covers: 1. load a simulated scRNA-seq dataset. 2. Standard preprocessing using the Seurat package. 3. Performing differential expression analysis using different strategies: sex-stratified analysis and interaction analysis. 4. Detecting and categorizing sex-specific DEGs for each cell type. 5. Visualizing gene expression using dot plots. 6. Performing pathway enrichment analysis to uncover biological functions. 7. Constructing and visualizing protein-protein networks.
We start by load a simulated scRNA-seq dataset containing 10 simulated samples. The code for generating this simulated dataset can be found in the Gitlab in the data_vignette folder.
library(Seurat)
library(XYomics)
library(dplyr)
sim.seurat <- readRDS(
system.file("extdata", "sc_toy_seurat.rds", package = "XYomics")
)
We performed standard scRNA-seq preprocessing, including normalization, feature selection, and scaling. We then apply dimensionality reduction (PCA and UMAP) and clustering to identify cell populations.
# Precomputed during data generation
# sim.seurat <- NormalizeData(sim.seurat)
# sim.seurat <- FindVariableFeatures(sim.seurat)
# sim.seurat <- ScaleData(sim.seurat)
# sim.seurat <- RunPCA(sim.seurat)
# sim.seurat <- FindNeighbors(sim.seurat)
# sim.seurat <- FindClusters(sim.seurat)
# sim.seurat <- RunUMAP(sim.seurat)
DimPlot(sim.seurat)
For the analysis, the Seurat object must contain metadata columns for cell type (cell_type), sex (sex), and condition (status). Here, we added mock annotations to our simulated data.
table(sim.seurat$cell_type)
##
## cell type 1 cell type 2 cell type 3 cell type 4 cell type 5
## 202 206 184 211 197
table(sim.seurat$sex)
##
## F M
## 500 500
table(sim.seurat$status)
##
## TG WT
## 600 400
When analyzing sex differences, researchers must be aware of the pitfalls of associated statistical analyses, including the limitations of sex-stratified analyses and the challenges of analyzing interactions between sex and disease state.
Sex-stratified analyses use standard statistical tests for differential molecular abundance analysis to test for disease-associated changes in each sex separately. However, a pure sex-stratified analysis may misclassify a change as sex-specific if it uses a standard significance threshold to assess both the presence and absence of an effect. Stochastic variation in significance scores around a chosen threshold may lead to the erroneous detection of significance specific to only one sex, especially if the p-value in the other sex marginally exceeds the chosen threshold. In addition, such an analysis may miss sex-modulated changes, where significant changes in both sexes share the same direction but differ significantly in magnitude; these changes require cross-sex comparisons for accurate detection.
Interaction analysis formally tests whether the relationship between disease and molecular changes differs significantly between males and females. Not only can such interaction terms reveal complexities in disease mechanisms that might otherwise be obscured in analyses that do not consider SABV, but they also have the potential to detect changes that are limited to the magnitude of an effect, an aspect that sex-stratified analyses do not capture. Nevertheless, robust estimation of interaction effects requires large sample sizes, which are often not available due to the costs associated with advanced molecular profiling techniques such as single-cell RNA sequencing.
The XYomics package provides functions for both types of analyses.
We use sex_stratified_analysis_sc() to identify DEGs between conditions separately for males and females within each cell type. We use sex_stratified_pipeline_sc() to obtain the categorized DEGs for each cell type in our seurat object.
# Run for all cell types
# if you want to have the female and male degs
results_degs <- sex_stratified_analysis_sc(sim.seurat)
# The differential expression method can be specified using the `test.use` parameter.
# Available options include "MAST", "bimod", "poisson", "LR", and any other
# method supported by Seurat::FindMarkers().
#
# The `min_samples` parameter defines the minimum number of cells required
# to perform the analysis. The default value is 3.
#
# Internal QC results generated by XYomics can be accessed via:
# results_degs$validation
#
# This QC output includes assessment of sex and phenotype balance, per-group cell
# counts, detection of under-represented groups below the minimum cell threshold,
# and overall design imbalance warnings, as well as dataset sparsity metrics.
# if you want the categorized DEGs by cell type more lenient than the default thesholds as it is simulated data
results <- sex_stratified_pipeline_sc(sim.seurat, min_abs_logfc = 0.1, alpha = 0.1)
# By default, `target_cell_type_flag = "all"`, meaning that all cell types are plotted.
# You can also run on one or multiple cell types, for example:
# target_cell_type_flag = "cell type 1", "cell type 2"
results_degs <- readRDS(
system.file("extdata", "sc_results_degs_small.rds", package = "XYomics")
)
results <- readRDS(
system.file("extdata", "sc_results_small.rds", package = "XYomics")
)
head(results$`cell type 1`)
## DEG_Type Gene_Symbols Male_avg_logFC Male_FDR
## male-specific1 male-specific EXOSC1 2.8642113 6.875386e-09
## male-specific2 male-specific NINJ1 2.0378847 1.905962e-06
## male-specific3 male-specific NT5C 2.1459254 3.396873e-04
## male-specific4 male-specific RAB22A 1.8442810 1.534290e-03
## male-specific5 male-specific NIF3L1 1.2654157 4.691819e-02
## male-specific6 male-specific NLRP3 0.9819638 7.825030e-02
## Female_avg_logFC Female_FDR
## male-specific1 -0.2228460 1
## male-specific2 0.1816786 1
## male-specific3 0.2135438 1
## male-specific4 0.3144854 1
## male-specific5 -0.0471756 1
## male-specific6 0.2874025 1
We use sex_interaction_pipeline_sc() to perform a formal interaction analysis, directly comparing the condition effect between sexes for all cell types.
# Example for one cell type (e.g., "cell type 1")
interaction_results <- sex_interaction_pipeline_sc(sim.seurat)
# By default, `target_cell_type_flag = "all"`, meaning that all cell types are plotted.
# You can also specify one or multiple cell types, for example:
# target_cell_type_flag = c("cell type 1", "cell type 2")
#
# The `min_samples` parameter defines the minimum number of cells required
# to perform the interaction analysis. The default value is 20.
interaction_results <- readRDS(
system.file("extdata", "sc_interaction_results_small.rds", package = "XYomics")
)
head(interaction_results$`cell type 1`$summary_stats)# signficant sex-modulated DEGs: interaction_results$`cell type 1`$sig_results
## cell_type n_total_genes n_sig_genes
## 1 cell type 1 100 29
Dot plots are an effective way to visualize gene expression in scRNA-seq data. Here, we show the expression of the top male-specific, female-specific, sex-neutral and sex-dimorphic genes across all cell types, split by sex and condition.
# Get top gene from each category for 'cell type 1'
top_male_gene <- results$`cell type 1` %>% filter(DEG_Type == "male-specific") %>% top_n(1, -Male_FDR) %>% pull(Gene_Symbols)
top_female_gene <- results$`cell type 1` %>% filter(DEG_Type == "female-specific") %>% top_n(1, -Female_FDR) %>% pull(Gene_Symbols)
top_dimorphic_gene <- results$`cell type 1` %>% filter(DEG_Type == "sex-dimorphic") %>% top_n(1, -Male_FDR) %>% pull(Gene_Symbols)
top_neutral_gene <- results$`cell type 1` %>% filter(DEG_Type == "sex-neutral") %>% top_n(1, -Male_FDR) %>% pull(Gene_Symbols)
top_genes <- c(top_male_gene, top_female_gene, top_dimorphic_gene, top_neutral_gene)
generate_dotplot_sc(sim.seurat, top_genes, target_cell_type_flag = "cell type 1")
# By default, `target_cell_type_flag = "all"`, meaning that all cell types are plotted.
# You can also specify one or multiple cell types, for example:
# target_cell_type_flag = c("cell type 1", "cell type 2")
We perform pathway enrichment analysis to identify biological pathways over-represented in our DEG categories. The categorized_enrich() function automates this for all categories.
# Run for all cell types
pathway_category <- lapply(
results,
function(cell) categorized_enrich(cell, enrichment_db = "GO")
)
# The `enrichment_db` parameter can be set to "GO", "KEGG", "REACTOME".
# If the user wants to use a custom database, a TERM2GENE data frame must be
# provided via the `custom_db` parameter.
#
# Gene identifiers can be either any Gene identifier type supported by OrgDb
#' (e.g., "SYMBOL", "ENTREZID", "ENSEMBL", "UNIPROT")
# using the `gene_type` parameter.
#
# The `return_df` parameter controls the output format:
# if set to TRUE, results are returned as data frames;
# by default (FALSE), enrichResult objects are returned.
# Run for one specific cell type
# pathway_category_one <- categorized_enrich(results$`cell type 1`, return_df =T, enrichment_db = "GO")
pathway_category <- readRDS(
system.file("extdata", "sc_pathway_single_small.rds", package = "XYomics")
)
pathway_category_one <- readRDS(
system.file("extdata", "sc_pathway_single_small_df.rds", package = "XYomics")
)
head(pathway_category_one$`female-specific`)
## ID
## GO:1904507 GO:1904507
## GO:0060382 GO:0060382
## GO:1904505 GO:1904505
## GO:0033617 GO:0033617
## GO:2000819 GO:2000819
## GO:0008535 GO:0008535
## Description
## GO:1904507 positive regulation of telomere maintenance in response to DNA damage
## GO:0060382 regulation of DNA strand elongation
## GO:1904505 regulation of telomere maintenance in response to DNA damage
## GO:0033617 mitochondrial cytochrome c oxidase assembly
## GO:2000819 regulation of nucleotide-excision repair
## GO:0008535 respiratory chain complex IV assembly
## GeneRatio BgRatio RichFactor FoldEnrichment zScore pvalue
## GO:1904507 1/3 15/18860 0.06666667 419.1111 20.43257 0.002384231
## GO:0060382 1/3 17/18860 0.05882353 369.8039 19.18795 0.002701842
## GO:1904505 1/3 19/18860 0.05263158 330.8772 18.14516 0.003019386
## GO:0033617 1/3 27/18860 0.03703704 232.8395 15.20524 0.004288885
## GO:2000819 1/3 29/18860 0.03448276 216.7816 14.66765 0.004606092
## GO:0008535 1/3 31/18860 0.03225806 202.7957 14.18283 0.004923231
## p.adjust qvalue geneID Count
## GO:1904507 0.04561299 0.01356908 ACTL6A 1
## GO:0060382 0.04561299 0.01356908 ACTL6A 1
## GO:1904505 0.04561299 0.01356908 ACTL6A 1
## GO:0033617 0.04561299 0.01356908 COX20 1
## GO:2000819 0.04561299 0.01356908 ACTL6A 1
## GO:0008535 0.04561299 0.01356908 COX20 1
You can then visualize them using plot_enrichment_dotplots() function.
# visualize via dotplot
plot <- plot_enrichment_dotplots(pathway_category)
print(plot)
$all
Finally, we construct a protein-protein interaction network to explore the interactions between our DEGs.
We first fetch a protein-protein interaction network from the STRING database or directly without the package (hormonal_network_metacore.rds). Then, we use the PCSF algorithm via ppi_pipeline() to extract a relevant subnetwork based on “prizes” derived from our DEG p-values for all cell types.
# Fetch STRING network (can be replaced with a custom network)
# g <- get_string_network(organism = "9606", score_threshold = 900)
# Load a pre-existing network from a file
g <- readRDS(system.file("extdata", "string_example_network.rds", package = "XYomics")) # You can also download the hormonal network "hormonal_network_metacore.rds" instead of "string_example_network.rds".
### Run for all cell type
network_results <- ppi_pipeline(results, g)
# By default, `target_cell_type_flag = "all"`, meaning that all cell types are plotted.
# You can also specify one or multiple cell types, for example:
# target_cell_type_flag = c("cell type 1", "cell type 2")
# Example for one specific cell type
network_results_one <- ppi_pipeline(results,g , target_cell_type_flag = "cell type 2")
network_results <- readRDS(
system.file("extdata", "sc_network_results_small.rds", package = "XYomics")
)
We use plot_network() to visualize the network, highlighting nodes based on their degree (i.e., significance). Each node also includes a barplot showing logFC values, blue for males and pink for females.
female_network <- network_results$`cell type 2`$female_network
#Generate network visualization
plot_network(female_network, "female-specific", results$`cell type 2`, "Cell type 2", show_barplot = T)
#Generate all_plots for all categories and all cell types
all_plots <- plot_network_pipeline(network_results, results)
The generate_cat_report function can be used to compile all results into a single HTML report.
# This command generates a comprehensive HTML report for single-cell dataset
generate_cat_report(results, pathway_category, network_results)