This vignette documents the functions in the BinaryDosage package that convert GEN files to binary dosage files.
Note: The examples below use functions to access information in binary dosage files. Information about these functions can be found in the vignette Using Binary Dosage Files. Data returned by the function getbdinfo contains information about a binary dosage file. Information on the data return by getbdinfo can be found in the vignette Genetic File Information.
GEN files are a useful way to store genetic data. They are text files and can be easily parsed. The output files returned from the imputation software Impute2 are returned in this format.
Uncompressed GEN files can be very large, 100s of GB. Because of this they are quite often compressed. This makes the files much smaller but greatly increases the read time. The BinaryDosage package supports reading of gzip compressed GEN files.
There appears to have been changes to the GEN file over the years and it also appears people have created GEN-like file formats. The BinaryDosage package can support many GEN-like file formats.
The BinaryDosage package has a routine to convert GEN files into a binary format that maintains the dosage, genetic and probabilities. This results in a file about 10-15% the size of the uncompressed VCF file with much faster, 200-300x, read times. In comparison, Using gzip to compress the GEN file reduces the size of the file to about 5% its original size but makes run times slower.
Routines were written to help debug the conversion routine. It was discovered these routines were quite useful for accessing data in the GEN file and are now included in the package. This document contains instructions on how to use these routines.
The GEN file may have a header. If it does have a header the format the first N entries must be the column names for the SNP information variables. The following values identify the subjects and can have either of the following formats ordered by subject
If the GEN file does not have a header, the subject information must be in a sample file that can be read with read.table. If there is only one column the subject ID is set to this value and the family ID is set to "“. Otherwise, the family ID value is set to the value of the first column and the subject ID value is set to the second column. If the first value of the subject ID and family ID are both”0“, they are deleted. If family ID and subject ID are equal for all subjects, the family ID value is set to”".
Note: If a sample file is provided, the header is ignored.
The body GEN file must have the following format. The first N columns must contain information about the SNP. These columns must contain the following values
The chromosome number may also be in the first N columns.
Note: The first three columns of the GEN file used to be snp_id, rs_id, and position. In many cases these values got change to chromosome, snp_id, and position.
The remaining columns must contain the genotype probabilities sorted by subject. The genotype probabilities can be in any of the following formats.
Note: The number of genotype probabilities must agree with the number of subjects identified in the header or sample file.
There are several sample files included with the BinaryDosage package. The file names will be found by using the system.file command in the examples. This will be used many times in the examples.
The binary dosage files created will be temporary files. They will be created using the tempfile command. This will also be used many times in the examples. All output files will use the default format of 4. For information on other formats see the vignette Binary Dosage Formats.
The gentobd routine converts GEN files to the binary dosage format. Many different formats of GEN files by the BinaryDosage package. The following sections show how to convert GEN files in various formats to the binary dosage format.
The gentobd takes the following parameters
The default values for the gentobd require a sample file meaning there is no header in the GEN file and the first five columns be chromosome, SNP ID, location, reference allele, and alternate allele, respectively. The genotype data must have the three genotype values and the file must not be compressed.
The following code reads the gen file set3b.chr.imp using the set3b.sample sample file. These files are in the format mentioned above.
gen3bchrfile <- system.file("extdata", "set3b.chr.imp", package = "BinaryDosage")
sample3bfile <- system.file("extdata", "set3b.sample", package = "BinaryDosage")
bdfile3b_chr <- tempfile()
gentobd(genfiles = c(gen3bchrfile, sample3bfile), bdfiles = bdfile3b_chr)
bdinfo3b_chr <- getbdinfo(bdfiles = bdfile3b_chr)
The snpcolumns parameter lists the column numbers for the chromosome, SNP ID, location, reference allele, and alternate allele, respectively.
The following code reads in set1b.imp. This file has the SNP data in the following order, chromosome, location, SNP ID, reference allele, alternate allele. The file also has a header so there is no sample file.
gen1bfile <- system.file("extdata", "set1b.imp", package = "BinaryDosage")
bdfile1b <- tempfile()
gentobd(genfiles = gen1bfile, bdfiles = bdfile1b, snpcolumns = c(1L, 3L, 2L, 4L,
5L), header = TRUE)
bdinfo1b <- getbdinfo(bdfiles = bdfile1b)
Quite often the chromosome is not part of the GEN file and first column has the value ‘--’. In this case the SNP ID is often in the format <chromosome>:<additional SNP data>. In this case, the chromosome column number (first value in snpcolumns) can be set to 0L and the gentobd routine will extract the chromosome from the SNP ID value.
The following code reads in set3b.imp. This is in same format as set3b.chr.imp except that there is no column for the chromosome value. The value of 0L will be used for the chromosome column to indicate to get the chromosome value from the SNP ID value.
gen3bfile <- system.file("extdata", "set3b.imp", package = "BinaryDosage")
sample3bfile <- system.file("extdata", "set3b.sample", package = "BinaryDosage")
bdfile3b <- tempfile()
gentobd(genfiles = c(gen3bfile, sample3bfile), bdfiles = bdfile3b, snpcolumns = c(0L,
2L:5L))
bdinfo3b <- getbdinfo(bdfiles = bdfile3b)
Sometimes the GEN file has more SNP information than the five values mentioned earlier. In this case the genotype probabilities start in a column number other than 6. The value of the startcolumn is the column number that the genotype probabilities start
The following code reads in set4b.imp. It has an extra column in the SNP data in column 2. The snpcolumns and startcolumn has been set to handle this. The value of impformat has also been set since there are only 2 genotype probabilities in the file.
gen4bfile <- system.file("extdata", "set4b.imp", package = "BinaryDosage")
sample4bfile <- system.file("extdata", "set4b.sample", package = "BinaryDosage")
bdfile4b <- tempfile()
gentobd(genfiles = c(gen4bfile, sample4bfile), bdfiles = bdfile4b, snpcolumns = c(1L,
2L, 4L, 5L, 6L), startcolumn = 7L, impformat = 2L)
bdinfo4b <- getbdinfo(bdfiles = bdfile4b)
The impformat parameter is an integer from 1 to 3 that indicates how many genotype probabilities are in the file for each person. The value of 1 indicates that the value is the dosage value for the subject.
The following codes reads in file set2b.imp. This file contains only the dosage values for the subjects. The SNP information is not in the default order so the values of snpcolumns has to be specified (see above).
gen2bfile <- system.file("extdata", "set2b.imp", package = "BinaryDosage")
sample2bfile <- system.file("extdata", "set2b.sample", package = "BinaryDosage")
bdfile2b <- tempfile()
gentobd(genfiles = c(gen2bfile, sample2bfile), bdfiles = bdfile2b, snpcolumns = c(1L,
3L, 2L, 4L, 5L), impformat = 1L)
bdinfo2b <- getbdinfo(bdfiles = bdfile2b)
The chromosome parameter is a character value that used when the chromosome column value in snpcolumns is set to -1L.
The following code reads in set3b.imp, setting the chromosome value to 1.
gen3bfile <- system.file("extdata", "set3b.imp", package = "BinaryDosage")
sample3bfile <- system.file("extdata", "set3b.sample", package = "BinaryDosage")
bdfile3bm1 <- tempfile()
gentobd(genfiles = c(gen3bfile, sample3bfile), bdfiles = bdfile3bm1, snpcolumns = c(-1L,
2L:5L), chromosome = "1")
bdinfo3bm1 <- getbdinfo(bdfiles = bdfile3bm1)
The header parameter is a character vector of length 1 or 2. These indicate if the GEN file an sample file have headers, respectively. If the first value is TRUE, the second value is ignored as the subjects IDs are in the header of the GEN file.
The following code reads in set3b.imp using the sample file set3bnh.sample which has no header.
gen3bfile <- system.file("extdata", "set3b.imp", package = "BinaryDosage")
sample3bnhfile <- system.file("extdata", "set3bnh.sample", package = "BinaryDosage")
bdfile3bnh <- tempfile()
gentobd(genfiles = c(gen3bfile, sample3bnhfile), bdfiles = bdfile3bnh, snpcolumns = c(0L,
2L:5L), header = c(FALSE, FALSE))
bdinfo3bnh <- getbdinfo(bdfiles = bdfile3bnh)
The gz parameter is a logical value that indicates if the GEN file is compressed using gzip. The sample file is always assumed to be uncompressed.
The following code reads in set4b.imp.gz file using the sample file set4b.sample.
gen4bgzfile <- system.file("extdata", "set4b.imp.gz", package = "BinaryDosage")
sample4bfile <- system.file("extdata", "set4b.sample", package = "BinaryDosage")
bdfile4bgz <- tempfile()
gentobd(genfiles = c(gen4bgzfile, sample4bfile), bdfiles = bdfile4bgz, snpcolumns = c(1L,
2L, 4L, 5L, 6L), startcolumn = 7L, impformat = 2L, gz = TRUE)
bdinfo4bgz <- getbdinfo(bdfiles = bdfile4bgz)
The separator parameter is a character variable. This character separates the columns in the GEN file. Multiple copies of the separator are considered to be separator.
The bdfiles parameter is a character vector of length 1 or 3. These are the names of the binary dosage, family, and map files. If the format of the binary dosage file is 4, the only value needed is the name of the binary dosage file.
The format parameter determines the format of the binary dosage files. Formats 1, 2 and 3 consist of three files, binary dosage, family, and map. Format 4 combines all of these into one file.
The subformat parameter determines what information is in the binary dosage files. All formats can have subformats 1 and 2. A subformat value of 1 indicates that only the dosage values are written to the binary dosage file and a value of 2 indicates that the dosage and genotype probabilities are written to the binary dosage file. Formats 3 and 4 can also have subformat values of 3 and 4. These values have the same meaning as 1 and 2 respectively but have a slightly reordered header in the binary dosage file to improve read speed.
The snpidformat options specifies how the SNP ID is written to the binary dosage file. The default value is 0. This tells the code to use the SNP IDs that are in the GEN file. Other values that can be supplied to the function creates a SNP ID from the chromosome, location, reference, and alternate allele values.
When the snpidformat is set to 1, the SNP ID is written in the format
Chromosome:Location
When the snpidformat is set to 2, the SNP ID is written in the format
Chromosome:Location:Reference Allele:Alternate Allele
When the snpidformat is set to 3, the SNP ID is written in the format
Chromosome:Location_Reference Allele_Alternate Allele
When the snpidformat is set to -1, the SNP ID is not written to the binary dosage file. When the binary dosage file is read, the SNP ID is generated using the format for snpidformat equal to 2. This reduces the size of the binary dosage file.
When using binary dosage format 4.x it is possible to store additional information about the SNPs in the file. This is information consists of the following values
It is possible to calculate the alternate and minor allele frequency without the imputation information file. It is also possible to estimate the imputation r-squared. See the vignette Estimating Imputed R-squares for more information on the r-squared estimate.
The value for bdoptions is a vector of character values that can be “aaf”, “maf”, “rsq”, or and combination of these values. The values indicate to calculate alternate allele frequency, minor allele frequency, and imputation r-squared respectively.
The following routines are available for accessing information contained in VCF files
The getgeninfo routine returns information about a GEN file. For more information about the data returned see Genetic File Information. This information needs to be passed to genapply so it can efficiently read the GEN file.
The parameters passed to getgeninfo are
All of these parameters have the same meaning as in the gentobd routine above. There is on additional parameter index. This is a logical value indicating if the GEN file should be indexed for quicker reading. This is useful when using genapply. However the index parameter can not be TRUE if the file is compressed.
The genapply routine applies a function to all SNPs in the GEN file. The routine returns a list with length equal to the number of SNPs in the GEN file. Each element in the list is the value returned by the user supplied function. The routine takes the following parameters.
The user supplied function must have the following parameters.
The user supplied function can have other parameters. These parameters need to passed to the genapply routine.
There is a function in the BinaryDosage package named getaaf that calculates the alternate allele frequencies and is the format needed by genapply routine. The following uses getaaf to calculate the alternate allele frequency for each SNP in the set3b.chr.imp file using the vcfapply routine.
gen3bchrfile <- system.file("extdata", "set3b.chr.imp", package = "BinaryDosage")
sample3bfile <- system.file("extdata", "set3b.sample", package = "BinaryDosage")
geninfo <- getgeninfo(genfiles = c(gen3bchrfile, sample3bfile), index = TRUE)
aaf <- unlist(genapply(geninfo = geninfo, getaaf))
altallelefreq <- data.frame(SNP = geninfo$snps$snpid, aafcalc = aaf)
knitr::kable(altallelefreq, caption = "Calculated aaf", digits = 3)
SNP | aafcalc |
---|---|
1:10000_C_A | 0.362 |
1:11000_T_C | 0.010 |
1:12000_T_C | 0.202 |
1:13000_T_C | 0.399 |
1:14000_G_C | 0.169 |
1:15000_A_C | 0.621 |
1:16000_G_A | 0.505 |
1:17000_C_A | 0.479 |
1:18000_C_G | 0.250 |
1:19000_T_G | 0.241 |