An Introduction to BFI

Hassan Pazira

Marianne Jonker

2024-07-03

Overview

The R package BFI (Bayesian Federated Inference) provides several functions to carry out the Bayesian Federated Inference method for two types of models (GLM and Survival) using multicenter data without the need to combine or share them. This tutorial focuses on GLM models. Two commonly used families for GLM models, "binomial" and "gaussian", are available for this version of the package. The mostly using functions include bfi(), MAP.estimation(), and inv.prior.cov(). In the following, we will see how the BFI package can be applied to real datasets included in the package.

How to use it?

Before we go on, we first install and load the BFI package:

# Install and load the BFI package from CRAN:
install.packages("BFI")
library(BFI)

By using the following code, we can see that there are two available datasets in the package: trauma and Nurses.

data(package = "BFI")

The trauma dataset can be utilized for the "binomial" family and Nurses dataset can be used for "gaussian" family. To avoid repetition, we will only use the trauma dataset. By loading the package, the datasets included will be loaded and can be inspected as follows:

# Get the number of rows and columns
dim(trauma)
## [1] 371   6
# To get an idea of the dataset, print the first 7 rows
head(trauma, 7)
##   sex age hospital ISS GCS mortality
## 1   1  20        3  24  15         0
## 2   0  38        3  34  13         0
## 3   0  37        3  50  15         0
## 4   0  17        3  43   4         1
## 5   0  49        3  29  15         0
## 6   0  30        3  22  15         0
## 7   1  84        2  66   3         1

This dataset consists of data of 371 trauma patients from three hospitals (peripheral hospital without a neuro-surgical unit, 'status=1', peripheral hospital with a neuro-surgical unit, status=2, and academic medical center, status=3).

As we can see, it has 6 columns:

(col_name <- colnames(trauma))
## [1] "sex"       "age"       "hospital"  "ISS"       "GCS"       "mortality"

The covariates sex (dichotomous), age (continuous), ISS (Injury Severity Score, continuous), and GCS (Glasgow Coma Scale, continuous) are the predictors, and mortality is the response variable. hospital is a categorical variable which indicates the hospitals involved in the study. For more information about this dataset use

# Get some info about the dataset from the help file
?trauma

We will analyze the data with a logistic regression model. First we standardize the covariates. This is not necessary for the analysis, but is done for the interpretability of the accuracy of the estimates.

trauma$age <- scale(trauma$age)
trauma$ISS <- scale(trauma$ISS) 
trauma$GCS <- scale(trauma$GCS) 
trauma$hospital <- as.factor(trauma$hospital)

By using the following code we can see there are three hospitals involved in the study:

length(levels(trauma$hospital))
## [1] 3

MAP estimations

Therefore, the MAP.estimation function should be applied to these 3 local datasets separately to obtain the MAP estimations. Note that, in practice, we do not have access to the combined data, and each center should perform the analysis independently and send the output to the central server, as follows:

# Center 1:
X1 <- data.frame(sex=trauma$sex[trauma$hospital==1],
                 age=trauma$age[trauma$hospital==1],
                 ISS=trauma$ISS[trauma$hospital==1],
                 GCS=trauma$GCS[trauma$hospital==1])
Lambda1 <- inv.prior.cov(X1, lambda=0.01, L=3, family="binomial")
fit1 <- MAP.estimation(y=trauma$mortality[trauma$hospital==1], X=X1, family="binomial", Lambda=Lambda1)
summary(fit1)
## 
## Summary of the model:
## 
##    Formula: y ~ sex + age + ISS + GCS 
##     Family: 'binomial' 
##       Link: 'Logit'
## 
## Coefficients:
## 
##             Estimate Std.Dev  CI 2.5% CI 97.5%
## (Intercept)  -2.1226  0.8074  -3.7050  -0.5402
## sex          -5.2796  2.6305 -10.4352  -0.1239
## age           1.8864  0.7210   0.4732   3.2996
## ISS           2.3741  0.9251   0.5611   4.1872
## GCS          -2.4522  0.8295  -4.0781  -0.8264
## 
## Dispersion parameter (sigma2):  1 
##             log Lik Posterior:  -10.74 
##                   Convergence:  0
# Center 2:
X2 <- data.frame(sex=trauma$sex[trauma$hospital==2],
                 age=trauma$age[trauma$hospital==2],
                 ISS=trauma$ISS[trauma$hospital==2],
                 GCS=trauma$GCS[trauma$hospital==2])
Lambda2 <- inv.prior.cov(X2, lambda=0.01, L=3, family="binomial")
fit2 <- MAP.estimation(y=trauma$mortality[trauma$hospital==2], X=X2, family="binomial", Lambda=Lambda2)
summary(fit2)
## 
## Summary of the model:
## 
##    Formula: y ~ sex + age + ISS + GCS 
##     Family: 'binomial' 
##       Link: 'Logit'
## 
## Coefficients:
## 
##             Estimate Std.Dev CI 2.5% CI 97.5%
## (Intercept)  -0.7854  0.3789 -1.5280  -0.0427
## sex          -0.7850  0.6999 -2.1568   0.5868
## age           1.9601  0.4662  1.0465   2.8737
## ISS           0.5216  0.3673 -0.1983   1.2415
## GCS          -1.8737  0.4115 -2.6803  -1.0672
## 
## Dispersion parameter (sigma2):  1 
##             log Lik Posterior:  -36.87 
##                   Convergence:  0
# Center 3:
X3 <- data.frame(sex=trauma$sex[trauma$hospital==3],
                 age=trauma$age[trauma$hospital==3],
                 ISS=trauma$ISS[trauma$hospital==3],
                 GCS=trauma$GCS[trauma$hospital==3])
Lambda3 <- inv.prior.cov(X3, lambda=0.01, L=3, family="binomial")
fit3 <- MAP.estimation(y=trauma$mortality[trauma$hospital==3], X=X3, family="binomial", Lambda=Lambda3)
summary(fit3)
## 
## Summary of the model:
## 
##    Formula: y ~ sex + age + ISS + GCS 
##     Family: 'binomial' 
##       Link: 'Logit'
## 
## Coefficients:
## 
##             Estimate Std.Dev CI 2.5% CI 97.5%
## (Intercept)  -2.3144  0.4020 -3.1024  -1.5264
## sex           0.1580  0.5714 -0.9619   1.2780
## age           1.3031  0.2955  0.7239   1.8824
## ISS           0.2967  0.2638 -0.2203   0.8138
## GCS          -2.4221  0.4130 -3.2316  -1.6126
## 
## Dispersion parameter (sigma2):  1 
##             log Lik Posterior:  -50.63 
##                   Convergence:  0

It can be seen that all algorithms converged (Convergence: 0). To see more information about the dataset, such as the number of observations and parameters, we can use the output of the MAP.estimation function as follows:

# number of samples in center 1
fit1$n
## [1] 49
# number of parameters in center 1
fit1$np
## [1] 5
# number of samples in center 2
fit2$n
## [1] 106
# number of samples in center 3
fit3$n
## [1] 216

Additionally, before conducting the analysis, we can use the n.par function to retrieve this information.

BFI estimations

Then, only the outputs from the local centers (i.e., fit1,fit2, and fit3) should be sent to the central server for further analysis. On the central server, the bfi() function can be used to obtain the BFI estimations:

theta_hats <- list(fit1$theta_hat, fit2$theta_hat, fit3$theta_hat)
A_hats     <- list(fit1$A_hat, fit2$A_hat, fit3$A_hat)
Lambda_com <- inv.prior.cov(X1, lambda=0.01, L=3, family="binomial")
Lambdas    <- list(Lambda1, Lambda2, Lambda3, Lambda_com)
BFI_fits   <- bfi(theta_hats, A_hats, Lambdas)
summary(BFI_fits, cur_mat=TRUE)
## 
## Summary of the model:
## 
##     Family: 'binomial' 
##       Link: 'Logit'
## 
## Coefficients:
## 
##             Estimate Std.Dev CI 2.5% CI 97.5%
## (Intercept)  -1.4434  0.2465 -1.9265  -0.9602
## sex          -0.2473  0.4187 -1.0680   0.5733
## age           1.2189  0.2190  0.7897   1.6481
## ISS           0.4939  0.1945  0.1127   0.8751
## GCS          -1.7375  0.2491 -2.2258  -1.2492
## 
## Dispersion parameter (sigma2):  1 
## 
## Minus the Curvature Matrix: 
## 
##             (Intercept)     sex     age      ISS      GCS
## (Intercept)     30.4330  7.7926  0.7577   8.3798 -16.6609
## sex              7.7926  7.8026  1.3061   1.8414  -4.6925
## age              0.7577  1.3061 31.5230  -4.8355  15.7494
## ISS              8.3798  1.8414 -4.8355  30.7882 -11.7190
## GCS            -16.6609 -4.6925 15.7494 -11.7190  34.4509

Comparison

To compare the performance of the BFI methodology, we can combine the datasets and obtain the MAP estimations based on the combined data:

# MAP estimates of the combined data:
X_combined  <- data.frame(sex=trauma$sex,
                          age=trauma$age,
                          ISS=trauma$ISS,
                          GCS=trauma$GCS)
Lambda   <- inv.prior.cov(X=X_combined, lambda=0.1, L=3, family="binomial")
fit_comb  <- MAP.estimation(y=trauma$mortality, X=X_combined, family="binomial", Lambda=Lambda) 
summary(fit_comb, cur_mat=TRUE)
## 
## Summary of the model:
## 
##    Formula: y ~ sex + age + ISS + GCS 
##     Family: 'binomial' 
##       Link: 'Logit'
## 
## Coefficients:
## 
##             Estimate Std.Dev CI 2.5% CI 97.5%
## (Intercept)  -1.6110  0.2368 -2.0752  -1.1468
## sex          -0.3346  0.3946 -1.1079   0.4388
## age           1.3575  0.2034  0.9588   1.7562
## ISS           0.5490  0.1852  0.1860   0.9119
## GCS          -1.9808  0.2348 -2.4411  -1.5205
## 
## Dispersion parameter (sigma2):  1 
##             log Lik Posterior:  -109.8 
##                   Convergence:  0 
## 
## Minus the Curvature Matrix: 
## 
##             (Intercept)     sex     age      ISS      GCS
## (Intercept)     33.9101  8.5588  2.4320   8.3617 -18.2744
## sex              8.5588  8.6588  1.8922   1.4043  -4.5371
## age              2.4320  1.8922 37.5028  -6.1562  18.0250
## ISS              8.3617  1.4043 -6.1562  33.7608 -12.8879
## GCS            -18.2744 -4.5371 18.0250 -12.8879  38.8457

Now, we can see the difference between the BFI and combined estimates:

# Squared Errors:
(fit_comb$theta_hat - BFI_fits$theta_hat)^2
##      (Intercept)         sex        age         ISS        GCS
## [1,]  0.02811718 0.007613421 0.01921903 0.003037933 0.05919514

which are close to zero, as expected!

For more details see the following references.

References

Jonker M.A., Pazira H. and Coolen A.C.C. (2024). Bayesian federated inference for estimating statistical models based on non-shared multicenter data sets, Statistics in Medicine, 43(12): 2421-2438. https://doi.org/10.1002/sim.10072

Pazira H., Massa E., Weijers J.A.M., Coolen A.C.C. and Jonker M.A. (2024). Bayesian Federated Inference for Survival Models, arXiv. https://arxiv.org/abs/2404.17464

Jonker M.A., Pazira H. and Coolen A.C.C. (2024b). Bayesian Federated Inference for regression models with heterogeneous multi-center populations, arXiv. https://arxiv.org/abs/2402.02898

Contact

If you find any errors, have any suggestions, or would like to request that something be added, please file an issue at issue report or send an email to: or .