The goal of bbmix is to genotype variants using RNA-seq reads. We developed a 2-step approach. First, we selected RNA-seq reads that overlap known exonic bi-allelic SNPs to learn the parameters for a mixture of three Beta-Binomial distributions underlying the three possible genotypes. In the second step we obtained posterior probabilities for each of the genotypes using the parameters learnt in step 1
Installation has been tested on R 3.5.1. Installation time is estimated as 2 minutes.
## Within R:
install.packages("devtools") # if you don't already have the package
library(devtools)
devtools::install_git(url = "https://gitlab.com/evigorito/bbmix.git/")
Preparation of the required input files are described in bbmix_pipeline.
We have shown that training the model with the same sample that we want to learn genotypes, with a pool of reads from all the study samples or with an external sample did not affect much the accuracy of genotype calls. Here we provide examples on how to call genotypes in this 3 scenarios.
The function to call is call_gt. We provide an example for calling genotypes on chromosome 22 for the genome in a bottle NA12878 sample. Genome coordinates are in built 37.
Arguments
allele_counts_f: file name with allele counts for SNPs (details on how to generate this file are in the pipeline.)
depth : minumun number of reads overlapping a SNP for genotypes to be called, defaults to 10
stan_f : full name to stan object with model fit to extract mean of parameters. Defaults to NULL, using the model trained with genome in a bottle reads. Otherwise this object can be generated with fit_bb function.
legend_f : name for legend file with allele frequencies to apply for prior. Requited columns are “position a0 a1” and the name for the column to extract allele frequencies. The file need to be compressed and compatible with gunzip.
pop : column name from legend_f for the population to extract allele frequencies, defaults to EUR
prob : threshold for the posterior probability to make hard calls, defaults to 0.99
fisher_f : file with SNP annotations with Fisher exact test for detecting strand bias
fisher : cut-off for Fisher strand bias, defaults to 30
cluster_f : file with SNP annotation of whether there are in clusters of at least 3 within 35bp
out : file name to save output
## Retrive input files for running call_gt
counts_f <- system.file("extdata/input", "NA12878.chr22.Q20.allelicCounts.txt",
package = "bbmix",
mustWork = TRUE)
legend <- system.file("extdata/input", "1000GP_Phase3_chr22.legend.gz",
package = "bbmix", mustWork = TRUE)
fisher_f <- system.file("extdata/input", "chr22.FS.Q20.alleleCounts.txt",
package = "bbmix", mustWork = TRUE)
cluster_f <- system.file("extdata/input", "fSNPs_22_RP_maf0_01_cluster3window35.txt",
package = "bbmix", mustWork = TRUE)
## Choose your output directory
out <- "path/to/output_dir/NA12878.chrom22.gt.txt"
## Run call_gt:
call_gt(allele_counts_f = counts_f,
legend_f = legend,
fisher_f = fisher_f,
cluster_f = cluster_f,
out = out)
1- We first call the function fit_bb and then call_gt as follows:
Arguments for fit_bb
counts_f: file name with allele counts for SNPs (details on how to generate this file are in the pipeline, same format as for allele_counts_f in call_gt function.)
N : Number of SNPs to train the model, defaults to 1000
prefix : prefix to add to files when saving (suffix: _stan_beta_binom_fit.rds), suggested prefix is sample name, defaults to NULL (stan_beta_binom_fit.rds)
alpha_p : Beta prior parameter for model training, default [1,10,499]
beta_p : Beta prior parameter for model training, default [499, 10, 1]
out : directory to save output from model fitting
## Retrive input files for running call_gt
counts_f <- system.file("extdata/input", "NA12878.chr22.Q20.allelicCounts.txt",
package = "bbmix",
mustWork = TRUE)
## Choose your output directory
out <- "path/to/output_dir"
## Run fit_bb:
fit_bb(counts_f = counts_f,
out = out)
2- Call genotypes using call_gt
We use the same arguments as above except that now we use the argument “stan_f” with the output file generated by the function fit_bb:
## Run call_gt:
call_gt(allele_counts_f = counts_f,
stan_f = output_from_fit_bb,
legend_f = legend,
fisher_f = fisher_f,
cluster_f = cluster_f,
out = out)
The function poolreads can be use for preparing the input file ‘counts_fit’ for bbmix. poolreads takes the following arguments: * files: vector with the name of the files to pool the reads from * N : number of SNPs to select for each sample, defaults to 1000 * d : read depth for a SNP to be included in the pool, defaults to 10 * out : name to write output file
Then call fit_bb for model fitting as in previous example using the output file from poolreads as input for ‘counts_f’ argument. Last, call cal_gt with stan_f argument generated by fit_bb.
## Inspecting output files
gt <- data.table::fread(system.file("extdata/output", "gt.NA12878.chr22.txt",
package = "bbmix",
mustWork = TRUE))
head(gt)
#> CHROM POS REF ALT NREF NALT total EUR p0 p1
#> 1: 22 17538189 T C 14 0 14 0.07952286 0.9996466 0.0003534366
#> 2: 22 17538439 C T 15 0 15 0.50994036 0.9968645 0.0031355337
#> 3: 22 17538808 T C 23 0 23 0.79522863 0.9985178 0.0014822119
#> 4: 22 17538980 C G 12 0 12 0.59244533 0.9888215 0.0111784524
#> 5: 22 17539427 G T 18 0 18 0.79522863 0.9949742 0.0050258204
#> 6: 22 17539439 T C 13 0 13 0.79522863 0.9787282 0.0212717969
#> p2 expected_GT expected_sd GT SNPcluster FS
#> 1: 1.271118e-29 0.0003534366 0.0001142123 0 FALSE 3.802112
#> 2: 5.602803e-29 0.0031355337 0.0010943268 0 FALSE 2.178861
#> 3: 4.304680e-39 0.0014822119 0.0008511031 0 FALSE 9.150132
#> 4: 4.721349e-24 0.0111784524 0.0029755483 NA FALSE 0.000000
#> 5: 3.173322e-32 0.0050258204 0.0021620213 0 FALSE 4.615766
#> 6: 8.830021e-25 0.0212717969 0.0061388626 NA FALSE 19.158656
The table list the chromosome (CHROM), position (POS), reference (REF) and alternative alleles (ALT), the number of reads overlapping the reference allele (NREF), the alternative allele (NALT) and total. The next column is the effector allele frequency for the selected ethnicity, followed by the posterior probabilities for genotypes homozygous reference (p0), heterozygous (p1) or hom alternative (p2). Then, the expected genotype (expected_GT, dosage), its the standard deviation (expected_sd) and hard calls based on parameter “prob” follows in column GT. Last, the quality control measures for SNP in clusters or Fisher strand bias phred scaled p-value are labelled as SNPclusters and FS.