The R package BFI
(Bayesian
Federated Inference) provides several
functions to carry out the Bayesian Federated Inference method for two
types of models (GLM
and Survival
) using
multicenter data without the need to combine or share them. This
tutorial focuses on GLM
models. Two commonly used families
for GLM
models, "binomial"
and
"gaussian"
, are available for this version of the package.
The mostly using functions include bfi()
,
MAP.estimation()
, and inv.prior.cov()
. In the
following, we will see how the BFI
package can be applied
to real datasets included in the package.
Before we go on, we first install and load the BFI
package:
By using the following code, we can see that there are two available
datasets in the package: trauma
and
Nurses
.
The trauma
dataset can be utilized for the
"binomial"
family and Nurses
dataset can be
used for "gaussian"
family. To avoid repetition, we will
only use the trauma
dataset. By loading the package, the
datasets included will be loaded and can be inspected as follows:
## [1] 371 6
## sex age hospital ISS GCS mortality
## 1 1 20 3 24 15 0
## 2 0 38 3 34 13 0
## 3 0 37 3 50 15 0
## 4 0 17 3 43 4 1
## 5 0 49 3 29 15 0
## 6 0 30 3 22 15 0
## 7 1 84 2 66 3 1
This dataset consists of data of 371 trauma patients from three
hospitals (peripheral hospital without a neuro-surgical unit,
'status=1'
, peripheral hospital with a neuro-surgical unit,
status=2
, and academic medical center,
status=3
).
As we can see, it has 6 columns:
## [1] "sex" "age" "hospital" "ISS" "GCS" "mortality"
The covariates sex
(dichotomous), age
(continuous), ISS
(Injury Severity Score, continuous), and
GCS
(Glasgow Coma Scale, continuous) are the predictors,
and mortality
is the response variable.
hospital
is a categorical variable which indicates the
hospitals involved in the study. For more information about this dataset
use
We will analyze the data with a logistic
regression
model. First we standardize the covariates. This is not necessary for
the analysis, but is done for the interpretability of the accuracy of
the estimates.
trauma$age <- scale(trauma$age)
trauma$ISS <- scale(trauma$ISS)
trauma$GCS <- scale(trauma$GCS)
trauma$hospital <- as.factor(trauma$hospital)
By using the following code we can see there are three hospitals involved in the study:
## [1] 3
Therefore, the MAP.estimation
function should be applied
to these 3 local datasets separately to obtain the MAP estimations. Note
that, in practice, we do not have access to the combined data, and each
center should perform the analysis independently and send the output to
the central server, as follows:
# Center 1:
X1 <- data.frame(sex=trauma$sex[trauma$hospital==1],
age=trauma$age[trauma$hospital==1],
ISS=trauma$ISS[trauma$hospital==1],
GCS=trauma$GCS[trauma$hospital==1])
Lambda1 <- inv.prior.cov(X1, lambda=0.01, L=3, family="binomial")
fit1 <- MAP.estimation(y=trauma$mortality[trauma$hospital==1], X=X1, family="binomial", Lambda=Lambda1)
summary(fit1)
##
## Summary of the model:
##
## Formula: y ~ sex + age + ISS + GCS
## Family: 'binomial'
## Link: 'Logit'
##
## Coefficients:
##
## Estimate Std.Dev CI 2.5% CI 97.5%
## (Intercept) -2.1226 0.8074 -3.7050 -0.5402
## sex -5.2796 2.6305 -10.4352 -0.1239
## age 1.8864 0.7210 0.4732 3.2996
## ISS 2.3741 0.9251 0.5611 4.1872
## GCS -2.4522 0.8295 -4.0781 -0.8264
##
## Dispersion parameter (sigma2): 1
## log Lik Posterior: -10.74
## Convergence: 0
# Center 2:
X2 <- data.frame(sex=trauma$sex[trauma$hospital==2],
age=trauma$age[trauma$hospital==2],
ISS=trauma$ISS[trauma$hospital==2],
GCS=trauma$GCS[trauma$hospital==2])
Lambda2 <- inv.prior.cov(X2, lambda=0.01, L=3, family="binomial")
fit2 <- MAP.estimation(y=trauma$mortality[trauma$hospital==2], X=X2, family="binomial", Lambda=Lambda2)
summary(fit2)
##
## Summary of the model:
##
## Formula: y ~ sex + age + ISS + GCS
## Family: 'binomial'
## Link: 'Logit'
##
## Coefficients:
##
## Estimate Std.Dev CI 2.5% CI 97.5%
## (Intercept) -0.7854 0.3789 -1.5280 -0.0427
## sex -0.7850 0.6999 -2.1568 0.5868
## age 1.9601 0.4662 1.0465 2.8737
## ISS 0.5216 0.3673 -0.1983 1.2415
## GCS -1.8737 0.4115 -2.6803 -1.0672
##
## Dispersion parameter (sigma2): 1
## log Lik Posterior: -36.87
## Convergence: 0
# Center 3:
X3 <- data.frame(sex=trauma$sex[trauma$hospital==3],
age=trauma$age[trauma$hospital==3],
ISS=trauma$ISS[trauma$hospital==3],
GCS=trauma$GCS[trauma$hospital==3])
Lambda3 <- inv.prior.cov(X3, lambda=0.01, L=3, family="binomial")
fit3 <- MAP.estimation(y=trauma$mortality[trauma$hospital==3], X=X3, family="binomial", Lambda=Lambda3)
summary(fit3)
##
## Summary of the model:
##
## Formula: y ~ sex + age + ISS + GCS
## Family: 'binomial'
## Link: 'Logit'
##
## Coefficients:
##
## Estimate Std.Dev CI 2.5% CI 97.5%
## (Intercept) -2.3144 0.4020 -3.1024 -1.5264
## sex 0.1580 0.5714 -0.9619 1.2780
## age 1.3031 0.2955 0.7239 1.8824
## ISS 0.2967 0.2638 -0.2203 0.8138
## GCS -2.4221 0.4130 -3.2316 -1.6126
##
## Dispersion parameter (sigma2): 1
## log Lik Posterior: -50.63
## Convergence: 0
It can be seen that all algorithms converged
(Convergence: 0
). To see more information about the
dataset, such as the number of observations and parameters, we can use
the output of the MAP.estimation
function as follows:
## [1] 49
## [1] 5
## [1] 106
## [1] 216
Additionally, before conducting the analysis, we can use the
n.par
function to retrieve this information.
Then, only the outputs from the local centers (i.e.,
fit1
,fit2
, and fit3
) should be
sent to the central server for further analysis. On the central server,
the bfi()
function can be used to obtain the BFI
estimations:
theta_hats <- list(fit1$theta_hat, fit2$theta_hat, fit3$theta_hat)
A_hats <- list(fit1$A_hat, fit2$A_hat, fit3$A_hat)
Lambda_com <- inv.prior.cov(X1, lambda=0.01, L=3, family="binomial")
Lambdas <- list(Lambda1, Lambda2, Lambda3, Lambda_com)
BFI_fits <- bfi(theta_hats, A_hats, Lambdas)
summary(BFI_fits, cur_mat=TRUE)
##
## Summary of the model:
##
## Family: 'binomial'
## Link: 'Logit'
##
## Coefficients:
##
## Estimate Std.Dev CI 2.5% CI 97.5%
## (Intercept) -1.4434 0.2465 -1.9265 -0.9602
## sex -0.2473 0.4187 -1.0680 0.5733
## age 1.2189 0.2190 0.7897 1.6481
## ISS 0.4939 0.1945 0.1127 0.8751
## GCS -1.7375 0.2491 -2.2258 -1.2492
##
## Dispersion parameter (sigma2): 1
##
## Minus the Curvature Matrix:
##
## (Intercept) sex age ISS GCS
## (Intercept) 30.4330 7.7926 0.7577 8.3798 -16.6609
## sex 7.7926 7.8026 1.3061 1.8414 -4.6925
## age 0.7577 1.3061 31.5230 -4.8355 15.7494
## ISS 8.3798 1.8414 -4.8355 30.7882 -11.7190
## GCS -16.6609 -4.6925 15.7494 -11.7190 34.4509
To compare the performance of the BFI methodology, we can combine the datasets and obtain the MAP estimations based on the combined data:
# MAP estimates of the combined data:
X_combined <- data.frame(sex=trauma$sex,
age=trauma$age,
ISS=trauma$ISS,
GCS=trauma$GCS)
Lambda <- inv.prior.cov(X=X_combined, lambda=0.1, L=3, family="binomial")
fit_comb <- MAP.estimation(y=trauma$mortality, X=X_combined, family="binomial", Lambda=Lambda)
summary(fit_comb, cur_mat=TRUE)
##
## Summary of the model:
##
## Formula: y ~ sex + age + ISS + GCS
## Family: 'binomial'
## Link: 'Logit'
##
## Coefficients:
##
## Estimate Std.Dev CI 2.5% CI 97.5%
## (Intercept) -1.6110 0.2368 -2.0752 -1.1468
## sex -0.3346 0.3946 -1.1079 0.4388
## age 1.3575 0.2034 0.9588 1.7562
## ISS 0.5490 0.1852 0.1860 0.9119
## GCS -1.9808 0.2348 -2.4411 -1.5205
##
## Dispersion parameter (sigma2): 1
## log Lik Posterior: -109.8
## Convergence: 0
##
## Minus the Curvature Matrix:
##
## (Intercept) sex age ISS GCS
## (Intercept) 33.9101 8.5588 2.4320 8.3617 -18.2744
## sex 8.5588 8.6588 1.8922 1.4043 -4.5371
## age 2.4320 1.8922 37.5028 -6.1562 18.0250
## ISS 8.3617 1.4043 -6.1562 33.7608 -12.8879
## GCS -18.2744 -4.5371 18.0250 -12.8879 38.8457
Now, we can see the difference between the BFI and combined estimates:
## (Intercept) sex age ISS GCS
## [1,] 0.02811718 0.007613421 0.01921903 0.003037933 0.05919514
which are close to zero, as expected!
For more details see the following references.
Jonker M.A., Pazira H. and Coolen A.C.C. (2024). Bayesian federated inference for estimating statistical models based on non-shared multicenter data sets, Statistics in Medicine, 43(12): 2421-2438. https://doi.org/10.1002/sim.10072
Pazira H., Massa E., Weijers J.A.M., Coolen A.C.C. and Jonker M.A. (2024). Bayesian Federated Inference for Survival Models, arXiv. https://arxiv.org/abs/2404.17464
Jonker M.A., Pazira H. and Coolen A.C.C. (2024b). Bayesian Federated Inference for regression models with heterogeneous multi-center populations, arXiv. https://arxiv.org/abs/2402.02898
If you find any errors, have any suggestions, or would like to request that something be added, please file an issue at issue report or send an email to: hassan.pazira@radboudumc.nl or Marianne.Jonker@radboudumc.nl.