This pipeline is designed for filtering mutations found in formalin-fixed and paraffin-embedded (FFPE) samples. The MicroSEC filter utilizes a statistical analysis, and the results for mutations with less than 10 supporting reads are not reliable. Four files are necessary for the analysis: mutation information file, BAM file, and mutation supporting read ID information file.
This tsv or tsv.gz file should contain at least these columns: Sample
Mut_type Chr Pos Ref Alt SimpleRepeat_TRF Neighborhood_sequence
sample_name 1-snv chr1 108130741 C T N
CTACCTGGAGAATGGGCCCATGTGTCCAGGTAGCAGTAAGC
SimpleRepeat_TRF: Whether the mutation locates at a simple repeat
sequence or not (“Y” or “N”).
Neighborhood_sequence: [5’-20 bases] + [Alt sequence] + [3’-20
bases]
Sample, Mut_type, Chr, Pos, Ref, and Alt should be set exactly.
If you do not know the SimpleRepeat_TRF, Mut_type, or
Neighborhood_sequence, enter “-”. Automatically detected.
(mandatory, if multiple samples are processed in a batch)
Seven to ten columns are necessary (without column names).
Optional columns can be deleted if they are not applicable.
[sample name] [mutation information tsv file] [BAM file] [read length]
[adapter sequence read 1] [optional: adapter sequence read 2] [sample
type: Human or Mouse] [panel name] [optional: reference genome fastq
file] [optional: simple repeat region bed file]
PC9 ./source/CCLE.tsv ./source/Cell_line/PC9.bam 127
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Human TOP
A375 ./source/CCLE.tsv.gz ./source/Cell_line/A375.bam 127
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT Hg38 TOP ./reference/hg38.fa
./reference/simpleRepeat.bed.gz
(optional, but mandatory with cram files)
(optional, but mandatory to detect simple repeat derived artifacts)
This pipeline contains 8 filtering processes.
Filter 1 : Shorter-supporting lengths distribute too unevenly to occur
(1-1 and 1-2).
Filter 1-1: P-values are less than the threshold_p(default:
10^(-6)).
Filter 1-2: The shorter-supporting lengths distributed over less than
75% of the read length.
Filter 2 : Hairpin-structure induced error detection (2-1 and
2-2).
Filter 2-1: Palindromic sequences exist within 150 bases. Filter 2-2:
>=50% mutation-supporting reads contains a reverse complementary
sequence of the opposite strand consisting >= 15 bases.
Filter 3 : 3’-/5’-supporting lengths are too unevenly distributed to
occur (3-1 and 3-3).
Filter 3-1: P-values are less than the threshold_p(default:
10^(-6)).
Filter 3-2: The distributions of 3’-/5’-supporting lengths are within
75% of the read length.
Filter 4 : >=15% mutations were called by chimeric reads comprising
two distant regions.
Filter 5 : >=50% mutations were called by soft-clipped reads.
Filter 6 : Mutations locating at simple repeat sequences.
Filter 7 : Mutations locating at a >=15 homopolymer.
Filter 8 : >=10% low quality bases (Quality score <18) in the
mutation supporting reads.
Supporting lengths are adjusted considering small repeat sequences around the mutations.
Rscript MicroSEC.R [working/output directory] [sample information tsv file] [progress bar Y/N]
$ Rscript MicroSEC.R ~ \
~/source/Sample_list_test.txt N
$ Rscript MicroSEC.R ~ \
~/source/sample_info_test.tsv.gz Y
If you want to know the progress visually, [progress bar Y/N] should
be Y.
Results are saved in a tsv file.
github url: https://github.com/MANO-B/MicroSEC
# initialize
msec <- NULL
homology_search <- NULL
mut_depth <- NULL
# test data
sample_name <- "sample"
read_length <- 150
adapter_1 <- "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
adapter_2 <- "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
organism <- "hg38"
# load mutation information
df_mutation <- fun_load_mutation(
system.file("extdata", "mutation_list.tsv", package = "MicroSEC"),
"sample",
BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38,
24)
df_bam <- fun_load_bam(
system.file("extdata", "sample.bam", package = "MicroSEC"))
# another example data
# data(exampleMutation)
# data(exampleBam)
# df_mutation <- exampleMutation
# df_bam <- exampleBam
# load genomic sequence
ref_genome <- fun_load_genome(organism)
chr_no <- fun_load_chr_no(organism)
# analysis
result <- fun_read_check(df_mutation = df_mutation,
df_bam = df_bam,
ref_genome = ref_genome,
sample_name = sample_name,
read_length = read_length,
adapter_1 = adapter_1,
adapter_2 = adapter_2,
short_homology_search_length = 4,
min_homology_search = 40,
progress_bar = progress_bar)
msec_read_checked <- result[[1]]
homology_searched <- result[[2]]
mut_depth_checked <- result[[3]]
# search homologous sequences
msec_homology_searched = fun_homology(msec = msec_read_checked,
df_distant = homology_searched,
min_homology_search = 40,
ref_genome = ref_genome,
chr_no = chr_no,
progress_bar = progress_bar)
# statistical analysis
msec_summarized <- fun_summary(msec_homology_searched)
msec_analyzed <- fun_analysis(msec = msec_summarized,
mut_depth = mut_depth_checked,
short_homology_search_length = 4,
min_homology_search = 40,
threshold_p = 10 ^ (-6),
threshold_hairpin_ratio = 0.50,
threshold_short_length = 0.75,
threshold_distant_homology = 0.15,
threshold_soft_clip_ratio = 0.50,
threshold_low_quality_rate = 0.1,
homopolymer_length = 15)
# save the results as a tsv.gz file.
#fun_save(msec_analyzed, "~/MicroSEC_test.tsv.gz")
msec_analyzed
#> Sample Mut_type Chr Pos Ref Alt SimpleRepeat_TRF
#> 1 sample 2-snv chr11 17719997 CC GG N
#> 2 sample 2-snv chr11 69295601 CT AA N
#> 3 sample 2-snv chr15 98707826 TC CG N
#> 4 sample 1-ins chr2 47783391 C CA N
#> 5 sample 2-del chr2 60920434 CAG C N
#> 6 sample 1-snv chr3 72446880 G C N
#> 7 sample 2-snv chr3 179198865 CAC AAG N
#> 8 sample 2-del chr3 181742887 TTA T Y
#> 9 sample 1-snv chr5 180618885 G A N
#> 10 sample 2-snv chr6 29725282 CC GG N
#> 11 sample 2-snv chr6 88143884 GA TC N
#> 12 sample 2-snv chr7 64068966 CG AA N
#> 13 sample 2-snv chr8 23681315 GC CA N
#> Neighborhood_sequence read_length total_read
#> 1 CCCCGCGGCGGTGCACCCGGGGCCGGGCGCACGTGAGGACGA 150 9
#> 2 CTTGTCTGTAGTCTGTCCCCAATGGGGACAGGGACTCGTTGC 150 28
#> 3 CAACTACGCCCTGGTCATCTCGGAGATGACCAATCTCAAGGA 150 17
#> 4 GGATGCGGCCTGGAGCGAGGCATGGGCCTGGGCCCAGGCCCT 150 13
#> 5 TGGTCTTGAACTCCTGACATCGTGATCCACCCACCTTGGCC 150 34
#> 6 CTCGCACACCAGGCGGTGGCCGCGGCGCCCCGGGAAGGGCC 150 16
#> 7 CAGGTGAACTGTGGGGCATCAAGTTGATGCCCCCAAGAATCCT 150 23
#> 8 GGGGGAAAGGACAGAGCAGTTTATATATATATATATATATA 150 6
#> 9 AGCCTGGCGAGCTCCACCATAGCGCGGAAGCGTCCGCGCTG 150 19
#> 10 ATGTGCAGCACGAGGGGCTGGGCCAGCCCCTCATCCTGAGAT 150 9
#> 11 CCTCGGCAGACGTGTCTGTGTCCACAGACATGGTTACCTTGG 150 23
#> 12 GCCTGGACTCTGCTCAGCAGAATTTGTATAGGGATGTGATGT 150 11
#> 13 CCAGGGAGGCCCGGGAGAAGCACTCCTCTTTCAGGGCCGGCA 150 34
#> soft_clipped_read flag_hairpin pre_support_length post_support_length
#> 1 3 8 143 10
#> 2 13 28 139 140
#> 3 7 17 139 140
#> 4 11 12 131 20
#> 5 16 0 65 144
#> 6 16 0 138 5
#> 7 7 23 139 141
#> 8 1 0 100 111
#> 9 12 0 145 19
#> 10 4 9 139 140
#> 11 8 21 144 10
#> 12 7 0 47 131
#> 13 34 32 135 9
#> short_support_length pre_farthest post_farthest
#> 1 10 205 10
#> 2 9 242 282
#> 3 10 338 187
#> 4 20 211 20
#> 5 65 65 258
#> 6 5 190 5
#> 7 10 268 383
#> 8 72 114 111
#> 9 19 158 19
#> 10 10 205 155
#> 11 10 393 10
#> 12 47 47 243
#> 13 9 301 9
#> low_quality_base_rate_under_q18 low_quality_pre low_quality_post
#> 1 0.015555556 0.011111111 0.000000000
#> 2 0.006428571 0.010714286 0.003571429
#> 3 0.009803922 0.005882353 0.005882353
#> 4 0.010769231 0.015384615 0.000000000
#> 5 0.003725490 0.000000000 0.000000000
#> 6 0.009583333 0.018750000 0.012500000
#> 7 0.005217391 0.000000000 0.004347826
#> 8 0.091111111 0.000000000 0.166666667
#> 9 0.009824561 0.021052632 0.000000000
#> 10 0.004444444 0.000000000 0.000000000
#> 11 0.021449275 0.004347826 0.000000000
#> 12 0.020000000 0.009090909 0.027272727
#> 13 0.006862745 0.000000000 0.011764706
#> distant_homology_rate soft_clipped_rate prob_filter_1 prob_filter_3_pre
#> 1 0.00000000 0.3333333 5.080526e-05 1.173347e-03
#> 2 0.00000000 0.4642857 6.775881e-22 5.890075e-01
#> 3 0.00000000 0.4117647 1.362376e-19 4.838936e-01
#> 4 0.07692308 0.8461538 3.198724e-11 1.943905e-06
#> 5 1.00000000 0.4705882 9.553664e-03 1.214764e-04
#> 6 0.00000000 1.0000000 3.370855e-15 7.600607e-06
#> 7 0.00000000 0.3043478 2.439900e-13 1.000000e+00
#> 8 0.33333333 0.1666667 1.438974e-04 5.013675e-01
#> 9 0.00000000 0.6315789 2.128166e-08 2.319433e-12
#> 10 0.00000000 0.4444444 5.263540e-10 6.122053e-01
#> 11 0.00000000 0.3478261 1.667818e-18 7.358585e-10
#> 12 0.27272727 0.6363636 9.045576e-03 2.661244e-03
#> 13 0.23529412 1.0000000 2.853725e-24 2.396907e-12
#> prob_filter_3_post filter_1_mutation_intra_hairpin_loop
#> 1 2.136914e-05 FALSE
#> 2 8.042059e-01 TRUE
#> 3 6.816295e-01 TRUE
#> 4 8.192000e-10 TRUE
#> 5 1.042664e-04 FALSE
#> 6 3.552714e-15 TRUE
#> 7 1.000000e+00 TRUE
#> 8 1.562500e-02 FALSE
#> 9 3.876247e-12 TRUE
#> 10 7.240429e-01 TRUE
#> 11 3.030645e-20 TRUE
#> 12 3.880853e-03 FALSE
#> 13 2.295862e-28 TRUE
#> filter_2_hairpin_structure filter_3_microhomology_induced_mutation
#> 1 TRUE FALSE
#> 2 TRUE FALSE
#> 3 TRUE FALSE
#> 4 TRUE TRUE
#> 5 FALSE FALSE
#> 6 FALSE TRUE
#> 7 TRUE FALSE
#> 8 FALSE FALSE
#> 9 FALSE TRUE
#> 10 TRUE FALSE
#> 11 TRUE TRUE
#> 12 FALSE FALSE
#> 13 TRUE TRUE
#> filter_4_highly_homologous_region filter_5_soft_clipped_reads
#> 1 FALSE FALSE
#> 2 FALSE FALSE
#> 3 FALSE FALSE
#> 4 FALSE TRUE
#> 5 TRUE FALSE
#> 6 FALSE TRUE
#> 7 FALSE FALSE
#> 8 FALSE FALSE
#> 9 FALSE TRUE
#> 10 FALSE FALSE
#> 11 FALSE FALSE
#> 12 TRUE TRUE
#> 13 TRUE TRUE
#> filter_6_simple_repeat filter_7_mutation_at_homopolymer filter_8_low_quality
#> 1 FALSE FALSE FALSE
#> 2 FALSE FALSE FALSE
#> 3 FALSE FALSE FALSE
#> 4 FALSE FALSE FALSE
#> 5 FALSE FALSE FALSE
#> 6 FALSE FALSE FALSE
#> 7 FALSE FALSE FALSE
#> 8 TRUE FALSE TRUE
#> 9 FALSE FALSE FALSE
#> 10 FALSE FALSE FALSE
#> 11 FALSE FALSE FALSE
#> 12 FALSE FALSE FALSE
#> 13 FALSE FALSE FALSE
#> msec_filter_123 msec_filter_1234 msec_filter_all
#> 1 Artifact suspicious Artifact suspicious Artifact suspicious
#> 2 Artifact suspicious Artifact suspicious Artifact suspicious
#> 3 Artifact suspicious Artifact suspicious Artifact suspicious
#> 4 Artifact suspicious Artifact suspicious Artifact suspicious
#> 5 Artifact suspicious Artifact suspicious
#> 6 Artifact suspicious Artifact suspicious Artifact suspicious
#> 7 Artifact suspicious Artifact suspicious Artifact suspicious
#> 8 Artifact suspicious
#> 9 Artifact suspicious Artifact suspicious Artifact suspicious
#> 10 Artifact suspicious Artifact suspicious Artifact suspicious
#> 11 Artifact suspicious Artifact suspicious Artifact suspicious
#> 12 Artifact suspicious Artifact suspicious
#> 13 Artifact suspicious Artifact suspicious Artifact suspicious
#> comment
#> 1
#> 2
#> 3
#> 4
#> 5
#> 6
#> 7
#> 8 too repetitive to analyze homology, filter 4: sequence is too repetitive,
#> 9
#> 10
#> 11
#> 12
#> 13
sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Sonoma 14.5
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Asia/Tokyo
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] MicroSEC_2.1.6
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.9 utf8_1.2.4
#> [3] generics_0.1.3 SparseArray_1.2.4
#> [5] bitops_1.0-7 lattice_0.22-6
#> [7] stringi_1.8.3 digest_0.6.35
#> [9] magrittr_2.0.3 grid_4.3.2
#> [11] evaluate_0.23 fastmap_1.1.1
#> [13] Matrix_1.6-5 jsonlite_1.8.8
#> [15] restfulr_0.0.15 GenomeInfoDb_1.38.8
#> [17] fansi_1.0.6 BSgenome_1.70.2
#> [19] XML_3.99-0.16.1 Biostrings_2.70.3
#> [21] codetools_0.2-20 jquerylib_0.1.4
#> [23] abind_1.4-5 cli_3.6.2
#> [25] rlang_1.1.3 crayon_1.5.2
#> [27] XVector_0.42.0 Biobase_2.62.0
#> [29] withr_3.0.0 DelayedArray_0.28.0
#> [31] cachem_1.0.8 yaml_2.3.8
#> [33] S4Arrays_1.2.1 tools_4.3.2
#> [35] parallel_4.3.2 BiocParallel_1.36.0
#> [37] dplyr_1.1.4 GenomeInfoDbData_1.2.11
#> [39] Rsamtools_2.18.0 SummarizedExperiment_1.32.0
#> [41] BiocGenerics_0.48.1 vctrs_0.6.5
#> [43] R6_2.5.1 matrixStats_1.3.0
#> [45] BiocIO_1.12.0 stats4_4.3.2
#> [47] lifecycle_1.0.4 BSgenome.Hsapiens.UCSC.hg38_1.4.5
#> [49] rtracklayer_1.62.0 zlibbioc_1.48.2
#> [51] stringr_1.5.1 S4Vectors_0.40.2
#> [53] IRanges_2.36.0 pkgconfig_2.0.3
#> [55] pillar_1.9.0 bslib_0.8.0
#> [57] glue_1.7.0 GenomicAlignments_1.38.2
#> [59] xfun_0.43 tibble_3.2.1
#> [61] GenomicRanges_1.54.1 tidyselect_1.2.1
#> [63] MatrixGenerics_1.14.0 rstudioapi_0.16.0
#> [65] knitr_1.46 rjson_0.2.21
#> [67] htmltools_0.5.8.1 rmarkdown_2.26
#> [69] compiler_4.3.2 RCurl_1.98-1.14