In population size estimation, methods based on back-calculation (multiplier methods) are a popular approach to estimating the size of a target population which is partially hidden and not directly enumerable from existing data. The basis behind this approach is to use a subpopulation with known count and a estimate of the proportion of the target population belonging to this subgroup to estimate the size of the target population. However, this basic method is not applicable when many subgroup counts are available, in which case evidence could be synthesized to provide a more accurate estimate of the target population size. If subgroups are mutually exclusive, a tree-like structure can be created with the target population at the root node, and children (or grandchildren) of the root representing these subgroups with known counts. In this case, to combine available evidence a weighted sum of estimates from each back-calculated path can be made, which is automated and achieved with this package using the weighted multiplier method (WMM) (Flynn and Gustafson 2024b). Variance-minimizing weights are used to provide an estimate of the root population size on any admissible tree-structured data, as well as options for visual rendering of trees for reporting of results.
The implementation behind these functions is described at length elsewhere (Flynn and Gustafson 2024a), and is based on previously developed methodology (Flynn and Gustafson 2024b). A more extensive application to real-world data can be found in (Flynn, Gustafson, and Irvine 2024).
The makeTree
function creates a tree object from any
admissible dataframe; for a dataframe to be admissible, it must contain
specific column labels. These columns include the following:
‘from’ (string, node label)
‘to’ (string, node label)
‘Estimate’ (+ integer)
‘Total’ (+ integer)
‘Count’ (+ integer)
The root node is the assumed to represent the target population for
which estimation is sought. The makeTree
function will
accept a data.frame
object with these columns, and create a
data.tree
object (from data.tree
package) from
data.frame
; it also checks the dataframe columns and
structure to ensure the data.tree
object can be used for
root node population size estimation.
Further details on each column are as follows:
‘from’ (node label) and ‘to’ (node label), which encode the tree structure. from and to describe the edge for that row of data.
‘Estimate’ (+ integer), ‘Total’ (+ integer), which are used as parameters in a Beta(Estimate+1,Total-Estimate+1) distribution at each branch, or, if ‘Total’ is equal for all branches in a sibling group, then a Dirichlet distributions with parameters Estimate+1 from each child of a sibling group is used. Estimate and Total are assumed to come from surveys of size Total (a sample of the population at node from), and observe Estimate number of those individuals at Total which move to the node described by to. Used for branch probabilities only.
‘Count’ (for terminal nodes with marginal counts). Count column is NA for rows where to nodes are not leaves, and also for all leaves without a marginal count.
A Population (logical) column is not needed, but can be
added if Estimate and Total come from population
numbers, rather than samples. A Description column (string) is
also possible to include if particular descriptions are desired on the
visual diagram made by drawTree
function.
An example of this package in use can be seen in the following:
## create admissible dataset
treeData <- data.frame("from" = c("Z", "Z", "A", "A"),
"to" = c("A", "B", "C", "D"),
"Estimate" = c(4, 34, 9, 1),
"Total" = c(11, 70, 10, 10),
"Count" = c(NA, 500, NA, 50),
"Population" = c(FALSE, FALSE, FALSE, FALSE),
"Description" = c("First child of the root", "Second child of the root",
"First grandchild", "Second grandchild"))
## make tree object using makeTree
tree <- makeTree(treeData)
tree
#> levelName
#> 1 Z
#> 2 ¦--A
#> 3 ¦ ¦--C
#> 4 ¦ °--D
#> 5 °--B
The drawTree
function creates a descriptive diagram of a
tree object created using makeTree
, which allows the user
to visualize the tree before it is used for WMM estimation. Prints
descriptions or node labels on nodes, and probabilities based on
previous surveys to branches. Specifically, node descriptions are given
within respective nodes if provided by Description column in
dataframe used in makeTree
, and branch probabilities
calculated using the ratio of data columns Estimate over
Total are given along tree branches.
The function has three arguments, the first being the
makeTree
tree object the user would like the render. The
user may also specify an argument probs
, which takes the
values TRUE
/FALSE
and determines whether
probabilities will be displayed along tree branches. Lastly, the
argument desc
takes the values
TRUE
/FALSE
and determines whether descriptions
will be displayed in each node; for desc=TRUE
, a
Description column must be included as part of the
data.frame
used in the makeTree
function tree
construction.
A use case, using the above tree created by makeTree
, is
as follows:
This function is the central functionality of the
AutoWMM
package, and performs WMM estimation on the tree
created with makeTree
. It compute an estimate of the size
of a target population represented by the root node of tree-structure
data. The wmmTree
function accepts a tree object made using
the makeTree
function, and generates an estimate of the
root node (the target population) by combining multiple back-calculated
path estimates. This function will generate a weighted estimate using
variance-minimizing weights, which combines back-calculated estimates of
the root via the multiplier method. It will incorporate data from all
leaves with both 1) known marginal leaf counts at the terminal point of
the root to leaf path, and 2) available branch probability estimates for
each segment along the root-to-leaf path. These paths are deemed
“informative” (Flynn and Gustafson
2024b).
The wmmTree
function uses the following arguments:
tree
: A tree object created using
makeTree
function.
sample_length
: The desired number of estimates of
the target (root) node.
method
: The method of back-calculation. Current
version only supports the default multiplier method to produce
back-calculated estimates using an internal function,
method = "mmEstimate"
, unless the tree is a special case
using single-source sibling data only. In the latter case, the method
uses closed forms to generate path specific means and variances and WMM
estimate (see single.source
argument below).
int.type
: The type of confidence interval desired.
The default, and recommended, interval produced is the central 95% using
quantiles (int.type = "quants"
}. Setting this argument to
var
generates a 95% confidence interval based on sample
variance, while the cox
option for this argument generates
the Cox interval for log-normal data.
single.source
: This logical argument should be set
to TRUE
only when all sibling groups which have
branches that are used in back-calculated paths are fully informed by a
single source of data. In this special case, sampling can be bypassed,
and closed forms are used to generate the WMM root estimate and it’s
uncertainty; at this time, this method provides an approximation only as
paths are assumed independent.
Back-calculation from each leaf proceeds as follows. Surveys or literature estimates are assumed to inform Estimate and Total columns of dataframe. These branches are then sampled ‘sample_length’ number of times (i.e., number of runs) using a variety of methods, depending on the sources of data being used to inform sibling branches. For example, two sibling branches informed by a single survey which observed the movement from parent to child node of Estimate individuals in a survey with sample size Total, can be assigned a Beta(Estimate+1, Total-Estimate+1) distribution. For more complex configurations of source knowledge, importance sampling and rejection schemes are also employed to ensure consistency among sibling branch groups and root estimates for a given run. For each leaf with non-empty Count, we back-calculate by multiplying by the sampled inverse probabilities of each branch along the root-to-leaf path.
The function ultimately generates sample_length
number
of weighted estimates of the root target population. Using functionality
of data.tree
package on makeTree
object also
provides:
confidence intervals of estimates of the root provided by leaves
with marginal counts, as node$uncertainty
probability samples can be accessed as
node$probability_samples
root estimates from terminal nodes with non-empty count as stored
as node$targetEst_samples
The output of the function is a list of four entries; WMM root node estimate and corresponding uncertainty, a vector of estimates given on each run, and the weights associated with each path which provides a root node estimate (an informative path). Specifically, these four outputs are as follows:
The first is a list
with four entries: in the first
position is the WMM root node size estimate given by the synthesis
across all informative paths. The second entry is the uncertainty
associated with the root node size estimate. In the third entry we find
a vector of weights which sum to one, with length equal to the number of
informative paths and the weight associated with those paths. In the
fourth and final position, a vector of length sample_length
of estimates of the target (root) population size given by the WMM for
each run.
The second entry can alternatively be accessed using the
Get
function through data.tree
to access the
node attribute uncertainty
; it gives the 95% confidence
intervals for estimates given by leaves with marginal counts, as well as
the root node.
The third entry can alternatively be accessed using the
Get
function through data.tree
to access the
node attribute probability_samples
; it returns vectors of
probabilities sampled for all branches leading into each node.
The fourth entry can alternatively be accessed using the
Get
function through data.tree
to access the
node attribute targetEst_samples
; it returns vectors of
root estimates calculated for all paths with marginal counts on the
respective root-to-leaf path.
The calculation of estimates relies on the internal functions
confInts
, ko.weights
,
meanlogEstimates
, mmEstimate
,
root.confInt
, rootEsts
,
logEstimates
, and sampleBeta
.
The function can be demonstrated using the tree created in the
makeTree
section above:
## perform root node estimation
## small sample_length was chosen for efficiency across machines
Zhats <- wmmTree(tree, sample_length = 3)
#> using variance-weighted mean with multiplier method sampled path estimates
The user can then print the estimates of the root node generated by each iteration, the weights of each branch, the final estimate of the root node population size calculated using the WMM, and the final rounded estimate of the root:
# print the estimates of the root node generated by the iterations
Zhats$estimates
#> [,1]
#> D 1124.66
#> D.1 966.21
#> D.2 1044.96
# prints the weights of each branch
Zhats$weights
#> D B
#> [1,] 0.05070794 0.9492921
# prints the final estimate of the root node by WMM
Zhats$root
#> Z
#> 1043.27
# prints the final rounded estimate of the root with conf. int.
Zhats$uncertainty
#> Z
#> lower 970
#> upper 1121
The user may also use the data.tree
functionality with
the makeTree
object to obtain the average root estimate
with a 95% confidence interval, the samples generated from each path
which provided the root estimate, as well as the sampled probabilities
for each branch over the iterations:
## show the average root estimate with 95\% confidence interval, as well as
## average estimates with confidence intervals for each node with a marginal
## count
tree$Get('uncertainty')
#> Z A C D B
#> lower 970 NA NA 386.2427 943.8715
#> upper 1121 NA NA 2058.3142 1105.7360
## show the samples generated from each path which provides root estimates
tree$Get('targetEst_samples')
#> $Z
#> numeric(0)
#>
#> $A
#> numeric(0)
#>
#> $C
#> numeric(0)
#>
#> $D
#> D D D
#> 2077.6884 1722.6841 357.0131
#>
#> $B
#> [1] 1088.3875 936.8208 1106.6567
## show the probabilities sampled at each branch leading into the given node
tree$Get('probability_samples')
#> Z A C D B
#> [1,] NA 0.5406048 0.9554847 0.04451534 0.4593952
#> [2,] NA 0.4662800 0.9377531 0.06224687 0.5337200
#> [3,] NA 0.5481887 0.7445207 0.25547932 0.4518113
A second example using a slightly larger tree can be seen as follows:
## create 2nd admissible dataset
## this example handles many branch sampling cases, including all siblings informed from different surveys, same survey, and mixed case, as well as some siblings not informed and the rest from different surveys, same survey, and mixed case.
treeData2 <- data.frame("from" = c("Z", "Z", "Z",
"A", "A",
"B", "B", "B",
"C", "C", "C",
"H", "H", "H",
"K", "K", "K"),
"to" = c("A", "B", "C",
"D", "E",
"F", "G", "H",
"I", "J", "K",
"L", "M", "N",
"O", "P", "Q"),
"Estimate" = c(24, 34, 12,
9, 1,
NA, 19, 1,
NA, 2, 1,
20, 10, 12,
5, 3, NA),
"Total" = c(70, 70, 70,
10, 11,
NA, 30, 8,
NA, 12, 12,
40, 40, 40,
10, 10, NA),
"Count" = c(NA, NA, NA,
50, NA,
NA, 15, NA,
NA, 10, NA,
NA, NA, 20,
5, 2, NA))
## make tree object using makeTree
tree2 <- makeTree(treeData2)
#> WARNING: No 'Population' column exists. We assume all
#> values are sample estimates
## perform root node estimation
Zhats <- wmmTree(tree2, sample_length = 3)
#> using variance-weighted mean with multiplier method sampled path estimates
Zhats$estimates # print the estimates of the root node generated by the 15 iterations
#> [,1]
#> A 30.98
#> A.1 73.72
#> A.2 169.16
Zhats$weights # prints the weights of each branch
#> D G N J O P
#> [1,] 1.64132 -0.08525576 -1.355679 0.670305 -0.6686254 0.7979354
Zhats$root # prints the final estimate of the root node by WMM
#> Z
#> 72.83
Zhats$uncertainty # prints the final rounded estimate of the root with conf. int.
#> Z
#> lower 32
#> upper 162
## show the average root estimate with 95\% confidence interval, as well as average estimates with confidence intervals for each node with a marginal count
tree2$Get('uncertainty')
#> Z A D E B F G H L M N C I J K
#> lower 32 NA 228.0368 NA NA NA 40.09115 NA NA NA 522.1902 NA NA 132.3488 NA
#> upper 162 NA 421.1833 NA NA NA 72.23021 NA NA NA 1003.8735 NA NA 359.9365 NA
#> O P Q
#> lower 156.2951 190.7389 NA
#> upper 5045.4052 3200.3460 NA
## show the samples generated from each path which provides root estimates
tree2$Get('targetEst_samples')
#> $Z
#> numeric(0)
#>
#> $A
#> numeric(0)
#>
#> $D
#> A A A
#> 226.1279 431.3649 267.5290
#>
#> $E
#> numeric(0)
#>
#> $B
#> numeric(0)
#>
#> $F
#> numeric(0)
#>
#> $G
#> G G G
#> 52.08562 73.48397 39.54266
#>
#> $H
#> numeric(0)
#>
#> $L
#> numeric(0)
#>
#> $M
#> numeric(0)
#>
#> $N
#> H H H
#> 912.0151 1008.9567 507.0873
#>
#> $C
#> numeric(0)
#>
#> $I
#> numeric(0)
#>
#> $J
#> J J J
#> 206.4076 129.2891 370.6265
#>
#> $K
#> numeric(0)
#>
#> $O
#> O O O
#> 1379.7363 139.3682 5401.7303
#>
#> $P
#> P P P
#> 841.1112 176.4098 3433.5338
#>
#> $Q
#> numeric(0)
## show the probabilities sampled at each branch leading into the given node
tree2$Get('probability_samples')
#> Z A D E B F G H
#> [1,] NA 0.2982193 0.7414471 0.2585529 0.4882268 NA 0.5898639 0.1169601
#> [2,] NA 0.2970065 0.3902646 0.6097354 0.4186533 NA 0.4875780 0.1061275
#> [3,] NA 0.2612823 0.7153015 0.2846985 0.6281103 NA 0.6039339 0.2041082
#> L M N C I J K O
#> [1,] 0.4186382 0.1973286 0.3840332 0.2135539 NA 0.2268647 0.03399441 0.4991821
#> [2,] 0.3081928 0.2456633 0.4461439 0.2843401 NA 0.2720195 0.20431137 0.6175549
#> [3,] 0.4441151 0.2482392 0.3076458 0.1106073 NA 0.2439381 0.01515880 0.5520625
#> P Q
#> [1,] 0.3275380 NA
#> [2,] 0.1951537 NA
#> [3,] 0.3474080 NA
After WMM estimation has been performed on a makeTree
object, the countTree
function renders a diagram of the
tree, much like drawTree
, but showing the root estimate
generated using wmmTree
in the root node, as well as the
marginal leaf counts (data column Count) which contributed to
the weighted estimate displayed within the corresponding leaf nodes. The
mean of the sampled branch probabilities generated using the
wmmTree
method are also displayed along each branch.
Functionality can be demonstrated using the trees defined above:
The estTree
function is for use after
wmmTree
has been applied to a makeTree
tree
object. The function allows the user to visualize the tree with the root
size estimate given by wmmTree
displayed in the root node,
and the root estimate given by each particular path which contributed to
the weighted estimate displayed in the corresponding leaf node. It also
displays average of probability samples generated using
wmmTree
method on each branch.
Functionality can again be demonstrated using the trees defined above: