The analysis of multivariate count data arises in numerous fields including genomics, image analysis, text mining, and sport analytics. The multinomial logit model is limiting due to its restrictive mean-variance structure. Moreover, it assumes that counts of different categories are negatively correlated. Models that allow over-dispersion and possess more flexible positive and/or negative correlation structures offer more realism. We implement four models in the R package MGLM
: multinomial logit (MN), Dirichlet multinomial (DM), generalized Dirichlet multinomial (GDM), and negative mutinomial (NegMN). Distribution fitting, regression, hypothesis testing, and variable selection are treated in a unified framework. The multivariate count data we analyze here has \(d\) categories.
The function MGLMfit
fits various multivariate discrete distributions and outputs a list with the maximum likelihood estimate (MLE) and relevant statistics.
When fitting distributions, i.e. no covariates involved, MN is a sub-model of DM, and DM is a sub-model of GDM. <!–%We can perform the likelihood ratio test (LRT) to make comparison between the three models.–>
MGLMfit
outputs the p-value of the likelihood ratio test (LRT) for comparing the fitted model with the most commonly used multinomial model. The NegMN model does not have a nesting relationship with any of the other three models. Therefore, no LRT is performed when fitting a NegMN distribution.
We first generate data from a multinomial distribution. Note the multinomial parameter (must be positive) supplied to the rmn
function is automatically scaled to be a probability vector.
library(MGLM)
set.seed(123)
n <- 200
d <- 4
alpha <- rep(1, d)
m <- 50
Y <- rmn(n, m, alpha)
Multinomial distribution fitting, although trivial, is implemented.
mnFit <- MGLMfit(Y, dist="MN")
show(mnFit)
## estimate SE
## alpha_1 0.2568 0.030891
## alpha_2 0.2467 0.030483
## alpha_3 0.2451 0.030416
## alpha_4 0.2514 0.030676
##
## Distribution: Multinomial
## Log-likelihood: -1457.788
## BIC: 2931.471
## AIC: 2921.576
## LRT test p value:
## Iterations:
As a comparison, we fit the DM distribution to the same data set. The results indicate that using a more complex model on the multinomial data shows no advantage.
compareFit <- MGLMfit(Y, dist="DM")
show(compareFit)
## estimate SE
## alpha_1 5270901 786946676
## alpha_2 5063596 755995891
## alpha_3 5030755 751092797
## alpha_4 5160065 770398732
##
## Distribution: Dirichlet Multinomial
## Log-likelihood: -1457.788
## BIC: 2936.769
## AIC: 2923.576
## LRT test p value: 1.000
## Iterations: 35
Both the DM parameter estimates and their standard errors are large, indicating possible overfitting by the DM model. This is confirmed by the fact that the p-value of the LRT for comparing MN to DM is close to 1.
DM is a Dirichlet mixture of multinomials and allows over-dispersion. Similar to the MN model, it assumes that the counts of any two different categories are negatively correlated. We generate the data from the DM model and fit the DM distribution.
set.seed(123)
n <- 200
d <- 4
alpha <- rep(1, d)
m <- 50
Y <- rdirmn(n, m, alpha)
dmFit <- MGLMfit(Y, dist="DM")
show(dmFit)
## estimate SE
## alpha_1 0.976670 0.076589
## alpha_2 0.995142 0.079255
## alpha_3 1.006120 0.080033
## alpha_4 0.904500 0.072547
##
## Distribution: Dirichlet Multinomial
## Log-likelihood: -2011.225
## BIC: 4043.644
## AIC: 4030.451
## LRT test p value: <0.0001
## Iterations: 4
The estimate is very close to the true value with small standard errors. The LRT shows that the DM model is significantly better than the MN model.
GDM model uses \(d-2\) more parameters than the DM model and allows both positive and negative correlations among different categories. DM is a sub-model of GDM. Here we fit a GDM model to the above DM sample.
gdmFit <- MGLMfit(Y, dist="GDM")
show(gdmFit)
## estimate SE
## alpha_1 1.158474 0.123403
## alpha_2 0.993293 0.437239
## alpha_3 0.839967 0.106374
## beta_1 3.706863 0.226414
## beta_2 1.979389 0.094645
## beta_3 0.759641 0.084403
##
## Distribution: Generalized Dirichlet Multinomial
## Log-likelihood: -2007.559
## BIC: 4046.907
## AIC: 4027.117
## LRT test p value: <0.0001
## Iterations: 27
GDM yields a slightly larger log-likelihood value but a larger BIC, suggesting DM as a preferred model. p-value indiciates GDM is still significantly better thant the MN model. <!–%This is confirmed by a formal LRT
% <
set.seed(124)
n <- 200
d <- 4
alpha <- rep(1, d-1)
beta <- rep(1, d-1)
m <- 50
Y <- rgdirmn(n, m, alpha, beta)
gdmFit <- MGLMfit(Y, dist="GDM")
gdmFit
## estimate SE
## alpha_1 1.019835 0.103480
## alpha_2 0.826125 0.109316
## alpha_3 0.774387 0.092568
## beta_1 1.062112 0.095566
## beta_2 0.846216 0.108725
## beta_3 0.924559 0.135008
##
## Distribution: Generalized Dirichlet Multinomial
## Log-likelihood: -1820.616
## BIC: 3673.021
## AIC: 3653.231
## LRT test p value: <0.0001
## Iterations: 24
NegMN model is a multivariate analog of the negative binomial model. It assumes positive correlation among the counts. The following code generates data from the NegMN model and fit the NegMN distribution,
set.seed(1220)
n <- 100
d <- 4
p <- 5
prob <- rep(0.2, d)
beta <- 10
Y <- rnegmn(n, beta, prob)
negmnFit <- MGLMfit(Y, dist="NegMN")
show(negmnFit)
## estimate SE
## p_1 0.188151 0.009584
## p_2 0.194311 0.009837
## p_3 0.191511 0.009722
## p_4 0.196177 0.009914
## phi 12.313935 2.266097
##
## Distribution: Negative Multinomial
## Log-likelihood: -1104.579
## BIC: 2232.184
## AIC: 2219.158
## LRT test p value:
## Iterations: 4
Because MN is not a sub-model of NegMN, no LRT is performed here.
In regression, the \(n \times p\) covariate matrix \(X\) is similar to that used in the glm
function. The response should be a \(n \times d\) count matrix. Unlike estimating a parameter vector \(\beta\) in GLM, we need to estimate a parameter matrix \(B\) when the responses are multivariate. The dimension of the parameter matrix depends on the model:
The GDM model provides the most flexibility, but also requires most parameters. In the function MGLMreg
for regression, the option dist="MN"
, "DM"
, "GDM"
or "NegMN"
specifies the model.
The rows \(B_{j,\cdot}\) of the parameter matrix correspond to covariates. By default, the function output the Wald test statistics and p-values for testing \(H_0: B_{j,\cdot}={\bf 0}\) vs \(H_a: B_{j, \cdot}\neq {\bf 0}\). If specifying the option LRT=TRUE
, the function also outputs LRT statistics and p-values.
Next, we demonstrate that model mis-specification results in failure of hypothesis testing. We simulate response data from the GDM model. Covariates \(X_1\) and \(X_2\) have no effect while \(X_3\), \(X_4\), \(X_5\) have different effect sizes.
set.seed(1234)
n <- 200
p <- 5
d <- 4
X <- matrix(runif(p * n), n, p)
alpha <- matrix(c(0.6, 0.8, 1), p, d - 1, byrow=TRUE)
alpha[c(1, 2),] <- 0
Alpha <- exp(X %*% alpha)
beta <- matrix(c(1.2, 1, 0.6), p, d - 1, byrow=TRUE)
beta[c(1, 2),] <- 0
Beta <- exp(X %*% beta)
m <- runif(n, min=0, max=25) + 25
Y <- rgdirmn(n, m, Alpha, Beta)
We fit various regression models and test significance of covariates.
mnReg <- MGLMreg(Y~0+X, dist="MN")
show(mnReg)
## Call: MGLMreg(formula = Y ~ 0 + X, dist = "MN")
##
## Coefficients:
## Col_1 Col_2 Col_3
## [1,] 0.277063 -0.182760 -0.123204
## [2,] 0.543064 0.422730 0.246523
## [3,] 0.333252 0.517605 0.221851
## [4,] 0.356843 0.486722 0.565427
## [5,] -0.302455 0.192508 0.323713
##
## Hypothesis test:
## wald value Pr(>wald)
## X1 24.63244 1.842859e-05
## X2 21.99680 6.533133e-05
## X3 23.10908 3.832310e-05
## X4 25.07475 1.489470e-05
## X5 49.37327 1.086326e-10
##
## Distribution: Multinomial
## Log-likelihood: -2194.448
## BIC: 4468.371
## AIC: 4418.896
## Iterations: 5
The Wald test shows that all predictors, including the null predictors \(X_1\) and \(X_2\), are significant.
dmReg <- MGLMreg(Y~0+X, dist="DM")
show(dmReg)
## Call: MGLMreg(formula = Y ~ 0 + X, dist = "DM")
##
## Coefficients:
## Col_1 Col_2 Col_3 Col_4
## [1,] 0.154137 -0.118264 -0.188339 -0.013172
## [2,] 0.183264 0.142034 -0.183394 -0.333886
## [3,] 1.143146 1.254828 1.092635 0.811259
## [4,] 0.392703 0.545421 0.590015 0.131132
## [5,] 0.226350 0.660108 0.939599 0.487034
##
## Hypothesis test:
## wald value Pr(>wald)
## X1 3.349794 0.501082
## X2 7.845339 0.097411
## X3 25.497386 0.000040
## X4 8.735121 0.068072
## X5 23.136042 0.000119
##
## Distribution: Dirichlet Multinomial
## Log-likelihood: -1683.961
## BIC: 3473.889
## AIC: 3407.922
## Iterations: 7
Wald test declares that \(X1\), \(X2\) and \(X4\) have not effects, but \(X3\) and \(X5\) are significant.
gdmReg <- MGLMreg(Y~0+X, dist="GDM")
show(gdmReg)
## Call: MGLMreg(formula = Y ~ 0 + X, dist = "GDM")
##
## Coefficients:
## alpha_Col_2 alpha_Col_1 alpha_Col_3 beta_Col_2 beta_Col_1 beta_Col_3
## [1,] -0.283917 0.190503 0.315706 -0.400203 0.684651 0.467592
## [2,] -0.209171 0.395541 0.014426 -0.508254 0.552645 -0.171410
## [3,] 1.090140 1.243785 1.171786 1.304955 1.537526 0.916033
## [4,] 0.296819 0.405333 0.809087 0.487838 0.573695 0.305889
## [5,] 0.601841 0.030123 1.281219 1.430931 0.260197 0.853498
##
## Hypothesis test:
## wald value Pr(>wald)
## X1 9.424379 1.510802e-01
## X2 4.865683 5.611523e-01
## X3 26.828718 1.559064e-04
## X4 12.607034 4.971848e-02
## X5 38.637877 8.427803e-07
##
## Distribution: Generalized Dirichlet Multinomial
## Log-likelihood: -1676.399
## BIC: 3511.748
## AIC: 3412.798
## Iterations: 21
When using the correct model GDM, the Wald test is able to differentiate the null effects from the significant ones. GDM regression yields the highest log-likelihood and smallest BIC.
negReg <- MGLMreg(Y~0+X, dist="NegMN", regBeta=FALSE)
show(negReg)
## Call: MGLMreg(formula = Y ~ 0 + X, dist = "NegMN", regBeta = FALSE)
##
## Coefficients:
## $alpha
## Col_1 Col_2 Col_3 Col_4
## X1 0.243606 -0.216368 -0.156525 -0.031371
## X2 0.061896 -0.055885 -0.232635 -0.479780
## X3 -0.160914 0.022688 -0.271238 -0.495127
## X4 -0.176181 -0.043828 0.034781 -0.530126
## X5 -0.607974 -0.115829 0.012937 -0.315857
##
## $phi
## phi
## 13.77531
##
##
## Hypothesis test:
## wald value Pr(>wald)
## X1 24.99903 5.033252e-05
## X2 24.90077 5.267448e-05
## X3 29.32830 6.704198e-06
## X4 28.18693 1.143075e-05
## X5 60.50179 2.275444e-12
##
## Distribution: Negative Multinomial
## Log-likelihood: -2908.896
## BIC: 5929.056
## AIC: 5859.792
## Iterations: 15
Again, the Wald test declares all predictors to be significant.