The ICSKAT package implements the Interval-Censored Sequence Kernel Association Test (ICSKAT), the interval-censored Burden test, and the ICSKAT-Optimal (ICSKATO) test for inference on sets of features (e.g. all the SNPs in a gene or pathway) in genetic association studies.
Interval-censored data are a type of failure time data (also known as time-to-event data) quite commonly found in modern genetic datasets including the UK Biobank (UKB) and St. Jude Lifetime Cohort Study (SJLIFE). Many people are more familiar with right-censored data, which are a special case of interval-censored data. In right-censored data, an event time is either known to happen exactly or is only known to fall after some last observation time. In interval-censored data, an event time is not known exactly but is only known to fall in some interval (e.g between 0 and 20, or between 19.5 and 22.3, or between 50 and infinity).
For example, in SJLIFE, childhood cancer subjects who have turned 18 and have survived at least five years past their initial cancer treatment are invited back at regular intervals for comprehensive, multi-day clinical checkups. These patients are at extremely high risk of many chronic conditions, and they suffer an average of three severe or life-threatening conditions by age 40. At the (free) visits, the patients undergo a variety of diagnostic tests across many organ systems. Many diseases, for instance bone mineral density deficiency, are only able to be found at these visits because patients cannot self-diagnose the conditions on their own at home. Thus, if a subject visited the hospital for check-ups at age 20, 25, and 30, and bone mineral density deficiency was found at the age 25 visit, the disease onset time is only known to fall between 20 and 25.
In UKB, some patients visit assessment centers multiple times, and each time they take a questionnaire asking about a variety of health outcomes. For example, one question asks whether they have had any fractures. If the patients answers “no” at the first visit and “yes” at the second visit, then the fracture is only known to have occured between the first and second visits.
There are some ad-hoc ways to work with interval-censored data, including dichotomizing the occurrence of the event and working with binary outcomes (e.g. logistic regression) or imputing the occurrence time to fall in the middle of the interval and applying right-censored survival analysis methodology. However applying interval-censored methodology to interval-censored data often results in better operating characteristics (e.g. better control of Type I error rate or more power).
There are three steps to running the ICSKAT test with this package:
Suppose we are interested in testing whether a specific gene is associated with time to bone mineral density deficiency. We will simulate event times for 10,000 subjects under a proportional hazards model with baseline cumulative hazard function H(t)=t. We will set four observations times at times 1, 2, 3, and 4, with each subject’s exact visit times drawn from a Uniform(-0.25, 0.25) distribution surrounding these times. Each subject will have a 10% chance of missing any given visit. Our genetic data will consist of 50 SNPs in the gene of interest, and for each patient we have their minor allele count (0,1,2) at each of the 50 SNPs. Additionally suppose have non-genetic covariates for each subject’s gender and whether they take daily vitamins (both binary):
library(ICSKAT)
set.seed(0)
n <- 10^4
q <- 50
# generate data
# all SNPs have minor allele frequency 0.3 in this toy example
gMat <- matrix(data=rbinom(n=n*q, size=2, prob=0.3), nrow=n)
# gender and whether they take daily vitamins
xMat <- matrix(data=rbinom(n=n*2, size=1, prob=0.5), nrow=n)
# the baseline cumulative hazard function
bhFunInv <- function(x) {x}
# observation times
obsTimes <- 1:4
# no effect of either gender or daily vitamins
etaVec <- rep(0, n)
# generate data
outcomeDat <- gen_IC_data(bhFunInv = bhFunInv, obsTimes = obsTimes, windowHalf = 0.25,
probMiss = 0.1, etaVec = etaVec)
lt <- outcomeDat$leftTimes
rt <- outcomeDat$rightTimes
# indicators of left- and right-censoring
tpos_ind <- as.numeric(lt > 0)
obs_ind <- as.numeric(rt != Inf)
# make design matrix with cubic spline terms
dmats <- make_IC_dmat(xMat, lt, rt, obs_ind, tpos_ind)
# fit null model - only need to do this once for each genetic set (note there is no information
# on the SNPs used here)
nullFit <- ICSKAT_fit_null(init_beta = rep(0, 5), left_dmat = dmats$left_dmat, right_dmat=dmats$right_dmat,
obs_ind = obs_ind, tpos_ind = tpos_ind, lt = lt, rt = rt)
# perform the ICSKAT and Burden tests
icskatOut <- ICskat(left_dmat = dmats$left_dmat, right_dmat=dmats$right_dmat, lt = lt, rt = rt,
obs_ind = obs_ind, tpos_ind = tpos_ind, gMat = gMat, null_beta = as.numeric(nullFit$beta_fit), Itt = nullFit$Itt)
icskatOut$p_SKAT
## [1] 0.6011683
## [1] 0.256489
## [1] 0.4500931
If we want to test another SNP set (e.g. another gene), we don’t need to run the make_IC_dmat or ICSKAT_fit_null functions again. Just run ICskat() on the new genotype matrix.
# another gene with 100 SNPs in it
gMat2 <- matrix(data=rbinom(n=n*100, size=2, prob=0.3), nrow=n)
# we don't need to run the make_IC_dmat or ICSKAT_fit_null functions again as long
# as the event times and non-genetic covariates haven't changed.
icskatOut2 <- ICskat(left_dmat = dmats$left_dmat, right_dmat=dmats$right_dmat, lt = lt, rt = rt,
obs_ind = obs_ind, tpos_ind = tpos_ind, gMat = gMat2, null_beta = as.numeric(nullFit$beta_fit), Itt = nullFit$Itt)
icskatOut2$p_SKAT
## [1] 0.8072349
## [1] 0.4946116
## [1] 0.7827799
Questions or novel applications? Please let me know! Contact information can be found in the package description.