In the section on univariate, bivariate and trivariate entropies, we saw that the bivariate entropy of two variables \(X\) and \(Y\) is bounded according to \[H(X) \leq H(X,Y) \leq H(X)+H(Y) \ .\] The increment between the lower bound and the bivariate entropy is equal to the expected conditional entropy \[EH(Y|X)=H(X,Y)-H(X)\] which is a measure of how far from functional dependence \(X\rightarrow Y\) (which means that that \(X\) uniquely determines \(Y\)) we are. This measure is equal to 0 if and only if \(p(x,y) = p(x,+)\) meaning \(X\) uniquely determines \(Y\).
Similarly, trivariate entropies for triples of variables \(X,Y,Z\) are bounded by \[ H(X,Y) \leq H(X,Y,Z) \leq H(X,Z) + H(Y,Z) - H(Z) \] and the increment between the trivariate entropy and its lower bound is equal to the expected conditional entropy given by \[EH(Z|X,Y) = H(X,Y,Z)-H(X,Y)\] which is non-negative and equal to 0 if and only if there is functional dependence \((X,Y)\rightarrow Z\). Thus, \(EH(Z|X,Y)\) measures the prediction uncertainty when \((X,Y)\) is used to predict \(Z\).
\(EH=EH(Z|X,Y)\) is a logarithmic measure of how many outcomes there are of \(Z\) on average when the outcomes are given for \(X\) and \(Y\) . If \(EH\) is rounded to its closest integer, we get an unambiguous prediction value for \(Z\) based on predictors \(X\) and \(Y\) when \(EH < 0.5\) and two prediction values for \(Z\) when \(0.5\leq EH < 1.5\) etc. Thus, prediction power is a decreasing function of \(EH\).
We create a dataframe dyad.var
consisting of dyad
variables as described and created in variable domains and data editing.
Similar analyses can be performed on observed and/or transformed
dataframes with vertex or triad variables.
## status gender office years age practice lawschool cowork advice friend
## 1 3 3 0 8 8 1 0 0 3 2
## 2 3 3 3 5 8 3 0 0 0 0
## 3 3 3 3 5 8 2 0 0 1 0
## 4 3 3 0 8 8 1 6 0 1 2
## 5 3 3 0 8 8 0 6 0 1 1
## 6 3 3 1 7 8 1 6 0 1 1
The function prediction_power()
computes prediction
power when pairs of variables in a given dataframe are used to predict a
third variable from the same dataframe. The variable to be predicted and
the dataframe in which this variable also is part of is given as input
arguments, and the output is an upper triangular matrix giving the
expected conditional entropies of pairs of row and column variables of
the matrix, i.e. \(EH(Z|X,Y)\). The
diagonal gives \(EH(Z|X)\) , that is
when only one variable as a predictor. Note that NA
’s are
in the row and column representing the variable being predicted.
Assume we are interested in predicting variable status
(that is whether a lawyer in the data set is an associate or partner).
This is done by running the following:
## status gender office years age practice lawschool cowork advice
## status NA NA NA NA NA NA NA NA NA
## gender NA 1.375 1.180 0.670 0.855 1.304 1.225 1.306 1.263
## office NA NA 2.147 0.493 0.820 1.374 1.245 1.373 1.325
## years NA NA NA 2.265 0.573 0.682 0.554 0.691 0.667
## age NA NA NA NA 1.877 1.089 0.958 1.087 1.052
## practice NA NA NA NA NA 2.446 1.388 1.459 1.410
## lawschool NA NA NA NA NA NA 3.335 1.390 1.337
## cowork NA NA NA NA NA NA NA 2.419 1.400
## advice NA NA NA NA NA NA NA NA 2.781
## friend NA NA NA NA NA NA NA NA NA
## friend
## status NA
## gender 1.270
## office 1.334
## years 0.684
## age 1.058
## practice 1.427
## lawschool 1.350
## cowork 1.411
## advice 1.407
## friend 3.408
For better readability, the powers of different predictors can be
conveniently compared by using prediction plots that display a color
matrix with rows for \(X\) and columns
for \(Y\) with darker colors in the
cells when we have higher prediction power for \(Z\). This is shown for the prediction of
status
:
Obviously, the darkest color is obtained when the variable to be
predicted is included among the predictors, and the cells exhibit
prediction power for a single predictor on the diagonal and for two
predictors symmetrically outside the diagonal. Some findings are as
follows: good predictors for status
are given by
years
in combination with any other variable, and
age
in combination with any other variable. The best sole
predictor is gender
.
Frank, O., & Shafie, T. (2016). Multivariate entropy analysis of network data. Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique, 129(1), 45-63. link