README

The goal of bbmix is to genotype variants using RNA-seq reads. We developed a 2-step approach. First, we selected RNA-seq reads that overlap known exonic bi-allelic SNPs to learn the parameters for a mixture of three Beta-Binomial distributions underlying the three possible genotypes. In the second step we obtained posterior probabilities for each of the genotypes using the parameters learnt in step 1

Install bbmix from GitLab:

Installation has been tested on R 3.5.1. Installation time is estimated as 2 minutes.

## Within R:
install.packages("devtools") # if you don't already have the package
library(devtools)
devtools::install_git(url = "https://gitlab.com/evigorito/bbmix.git/")

Running bbmix

Preparation of the required input files are described in bbmix_pipeline.

We have shown that training the model with the same sample that we want to learn genotypes, with a pool of reads from all the study samples or with an external sample did not affect much the accuracy of genotype calls. Here we provide examples on how to call genotypes in this 3 scenarios.

Calling genotypes with the model trained with external default sample (NA12878)

The function to call is call_gt. We provide an example for calling genotypes on chromosome 22 for the genome in a bottle NA12878 sample. Genome coordinates are in built 37.

Arguments

allele_counts_f: file name with allele counts for SNPs (details on how to generate this file are in the pipeline.)

depth : minumun number of reads overlapping a SNP for genotypes to be called, defaults to 10

stan_f : full name to stan object with model fit to extract mean of parameters. Defaults to NULL, using the model trained with genome in a bottle reads. Otherwise this object can be generated with fit_bb function.

legend_f : name for legend file with allele frequencies to apply for prior. Requited columns are “position a0 a1” and the name for the column to extract allele frequencies. The file need to be compressed and compatible with gunzip.

pop : column name from legend_f for the population to extract allele frequencies, defaults to EUR

prob : threshold for the posterior probability to make hard calls, defaults to 0.99

fisher_f : file with SNP annotations with Fisher exact test for detecting strand bias

fisher : cut-off for Fisher strand bias, defaults to 30

cluster_f : file with SNP annotation of whether there are in clusters of at least 3 within 35bp

out : file name to save output

## Retrive input files for running call_gt
counts_f <- system.file("extdata/input", "NA12878.chr22.Q20.allelicCounts.txt",
        package = "bbmix",
        mustWork = TRUE)

legend <- system.file("extdata/input", "1000GP_Phase3_chr22.legend.gz",
       package = "bbmix", mustWork = TRUE)

fisher_f <- system.file("extdata/input", "chr22.FS.Q20.alleleCounts.txt",
     package = "bbmix", mustWork = TRUE)

cluster_f <- system.file("extdata/input", "fSNPs_22_RP_maf0_01_cluster3window35.txt",
      package = "bbmix", mustWork = TRUE)

## Choose your output directory
out <- "path/to/output_dir/NA12878.chrom22.gt.txt"

## Run call_gt:
call_gt(allele_counts_f = counts_f,
    legend_f = legend,
    fisher_f = fisher_f,
    cluster_f = cluster_f,
    out = out)

Calling genotypes with the model trained with a sample of choice

1- We first call the function fit_bb and then call_gt as follows:

Arguments for fit_bb

counts_f: file name with allele counts for SNPs (details on how to generate this file are in the pipeline, same format as for allele_counts_f in call_gt function.)

N : Number of SNPs to train the model, defaults to 1000

prefix : prefix to add to files when saving (suffix: _stan_beta_binom_fit.rds), suggested prefix is sample name, defaults to NULL (stan_beta_binom_fit.rds)

alpha_p : Beta prior parameter for model training, default [1,10,499]

beta_p : Beta prior parameter for model training, default [499, 10, 1]

out : directory to save output from model fitting

## Retrive input files for running call_gt
counts_f <- system.file("extdata/input", "NA12878.chr22.Q20.allelicCounts.txt",
        package = "bbmix",
        mustWork = TRUE)

## Choose your output directory
out <- "path/to/output_dir"

## Run fit_bb:
fit_bb(counts_f = counts_f,
    out = out)

2- Call genotypes using call_gt

We use the same arguments as above except that now we use the argument “stan_f” with the output file generated by the function fit_bb:

## Run call_gt:
call_gt(allele_counts_f = counts_f,
    stan_f = output_from_fit_bb,
    legend_f = legend,
    fisher_f = fisher_f,
    cluster_f = cluster_f,
    out = out)

Calling genotypes with a pool of samples

The function poolreads can be use for preparing the input file ‘counts_fit’ for bbmix. poolreads takes the following arguments: * files: vector with the name of the files to pool the reads from * N : number of SNPs to select for each sample, defaults to 1000 * d : read depth for a SNP to be included in the pool, defaults to 10 * out : name to write output file

Then call fit_bb for model fitting as in previous example using the output file from poolreads as input for ‘counts_f’ argument. Last, call cal_gt with stan_f argument generated by fit_bb.

Calling genotypes output

## Inspecting output files
gt <- data.table::fread(system.file("extdata/output", "gt.NA12878.chr22.txt",
                                  package = "bbmix",
                              mustWork = TRUE))

head(gt)
#>    CHROM      POS REF ALT NREF NALT total        EUR        p0           p1
#> 1:    22 17538189   T   C   14    0    14 0.07952286 0.9996466 0.0003534366
#> 2:    22 17538439   C   T   15    0    15 0.50994036 0.9968645 0.0031355337
#> 3:    22 17538808   T   C   23    0    23 0.79522863 0.9985178 0.0014822119
#> 4:    22 17538980   C   G   12    0    12 0.59244533 0.9888215 0.0111784524
#> 5:    22 17539427   G   T   18    0    18 0.79522863 0.9949742 0.0050258204
#> 6:    22 17539439   T   C   13    0    13 0.79522863 0.9787282 0.0212717969
#>              p2  expected_GT  expected_sd GT SNPcluster        FS
#> 1: 1.271118e-29 0.0003534366 0.0001142123  0      FALSE  3.802112
#> 2: 5.602803e-29 0.0031355337 0.0010943268  0      FALSE  2.178861
#> 3: 4.304680e-39 0.0014822119 0.0008511031  0      FALSE  9.150132
#> 4: 4.721349e-24 0.0111784524 0.0029755483 NA      FALSE  0.000000
#> 5: 3.173322e-32 0.0050258204 0.0021620213  0      FALSE  4.615766
#> 6: 8.830021e-25 0.0212717969 0.0061388626 NA      FALSE 19.158656

The table list the chromosome (CHROM), position (POS), reference (REF) and alternative alleles (ALT), the number of reads overlapping the reference allele (NREF), the alternative allele (NALT) and total. The next column is the effector allele frequency for the selected ethnicity, followed by the posterior probabilities for genotypes homozygous reference (p0), heterozygous (p1) or hom alternative (p2). Then, the expected genotype (expected_GT, dosage), its the standard deviation (expected_sd) and hard calls based on parameter “prob” follows in column GT. Last, the quality control measures for SNP in clusters or Fisher strand bias phred scaled p-value are labelled as SNPclusters and FS.