library(dtGAP)
Decision trees are prized for their simplicity and interpretability
but often fail to reveal underlying data structures. Generalized
Association Plots (GAP) excel at illustrating complex associations yet
are typically unsupervised. We introduce dtGAP, a novel
framework that embeds supervised correlation and
distance measures into GAP for enriched decision-tree
visualization. dtGAP offers confusion matrix maps,
decision-tree matrix maps, predicted class membership maps, and
evaluation panels. The dtGAP package is available on GitHub
and CRAN at (https://github.com/hanmingwu1103/dtGAP) and (https://CRAN.R-project.org/package=dtGAP).
Let’s begin with the penguins dataset! Running the
dtGAP() function can be as simple as:
penguins <- na.omit(penguins)
dtGAP(
data_all = penguins, model = "party", show = "all",
trans_type = "percentize", target_lab = "species",
simple_metrics = TRUE,
label_map_colors = c(
"Adelie" = "#50046d", "Gentoo" = "#fcc47f",
"Chinstrap" = "#e15b76"
),
show_col_prox = FALSE, show_row_prox = FALSE,
y_eval_start = 220,
raw_value_col = colorRampPalette(
c("#33286b", "#26828e", "#75d054", "#fae51f")
)(9)
)
By default dtGAP visualizes the entire data, but you can
focus on just the training or testing split using the show
argument, which takes either 'all', 'train' or
'test'. Similarly, You can choose between two tree models
via the model argument, which can be either
'rpart'or 'party'.
When you choose model = "rpart" (classic CART), each
node shows its class-membership probabilities and
display the percentage of samples in each branch.
dtGAP(
data_all = Psychosis_Disorder, show = "all",
trans_type = "none", target_lab = "UNIQID", print_eval = FALSE
)
In contrast, with model = "party" (conditional inference
trees), dtGAP will annotate each internal node with its
split-variable p-value and display the percentage of
samples in each branch. Also, you can custom label mapping and
colors.
dtGAP(
data_all = Psychosis_Disorder, model = "party", show = "all",
trans_type = "none", target_lab = "UNIQID", print_eval = FALSE,
label_map = c("0" = "bipolar", "1" = "schizophrenia"),
label_map_colors = c("bipolar" = "#50046d", "schizophrenia" = "#fcc47f")
)
At the beginning, we choose suitable data transformation via
trans_type argument, which can be either
'none', 'percentize',
'normalize', and 'scale'.
Before sorting, we build two proximity measures:
"CT"
(centroid) , "SG" (single), or "CP"
(complete).Use any method from seriation to reorder rows and
columns.
> seriation::list_seriation_methods("dist")
#> [1] "ARSA" "BBURCG" "BBWRCG" "Enumerate"
#> [5] "GSA" "GW" "GW_average" "GW_complete"
#> [9] "GW_single" "GW_ward" "HC" "HC_average"
#> [13] "HC_complete" "HC_single" "HC_ward" "Identity"
#> [17] "isomap" "isoMDS" "MDS" "MDS_angle"
#> [21] "metaMDS" "monoMDS" "OLO" "OLO_average"
#> [25] "OLO_complete" "OLO_single" "OLO_ward" "QAP_2SUM"
#> [29] "QAP_BAR" "QAP_Inertia" "QAP_LS" "R2E"
#> [33] "Random" "Reverse" "Sammon_mapping" "SGD"
#> [37] "Spectral" "Spectral_norm" "SPIN_NH" "SPIN_STS"
#> [41] "TSP" "VAT"
Also, when show = "all", use
sort_by_data_type = TRUE to preserve the original
train/test grouping; set it to FALSE if you’d rather
intermix samples from both sets when ordering.
how to measure the quality of sorting?
Then compute the cRGAR —an average of node-specific anti-Robinson scores weighted by each node’s sample fraction—to quantify order quality.
dtGAP(
data_all = Psychosis_Disorder, model = "party", show = "all",
trans_type = "none", target_lab = "UNIQID",
label_map = c("0" = "bipolar", "1" = "schizophrenia"),
label_map_colors = c("bipolar" = "#50046d", "schizophrenia" = "#fcc47f"),
seriate_method = "GW_average", sort_by_data_type = FALSE
)
When you set print_eval = TRUE, dtGAP will
append an evaluation panel containing two sections:
Data Information
Dataset name, model and train/test sample sizes.
Column proximity method, linkage, seriation algorithm and cRGAR score.
Train/Test Metrics
Full confusion-matrix report (default)
Uses caret::confusionMatrix() to show accuracy, kappa,
sensitivity, specificity, etc.
Simple metrics
If you set simple_metrics = TRUE, you’ll instead get six
key measures from the yardstick package:
Accuracy
Balanced accuracy
Kappa
Precision
Recall
Specificity
dtGAP(
data_all = Psychosis_Disorder, model = "party", show = "all",
label_map = c("0" = "bipolar", "1" = "schizophrenia"),
label_map_colors = c("bipolar" = "#50046d", "schizophrenia" = "#fcc47f"),
trans_type = "none", target_lab = "UNIQID", simple_metrics = TRUE
)
If the default conditional tree is not desired, you can create your
tree (e.g. with rpart) and wrap as.party()
around this object to plug into dtGAP(). As an example, we
will examine the datasets of COVID-19 cases in Wuhan from 2020-01-10 to
2020-02-18 from a recent study.
dtGAP(
data_train = train_covid, data_test = test_covid,
target_lab = "Outcome", show = "train",
label_map = c("0" = "Survival", "1" = "Death"),
label_map_colors = c("Survival" = "#50046d", "Death" = "#fcc47f"),
simple_metrics = TRUE,
show_col_prox = FALSE, show_row_prox = FALSE,
y_eval_start = 200,
raw_value_col = colorRampPalette(
c("#33286b", "#26828e", "#75d054", "#fae51f")
)(9)
)
You can print measures evaluating the conditional decision tree’s
performance by setting print_eval = TRUE. By defaults, we
show 5 measures for classification tasks:
and 4 measures for regression tasks:
dtGAP(
data_train = train_covid, data_test = test_covid,
target_lab = "Outcome", show = "test",
label_map = c("0" = "Survival", "1" = "Death"),
label_map_colors = c("Survival" = "#50046d", "Death" = "#fcc47f"),
simple_metrics = TRUE,
show_col_prox = FALSE, show_row_prox = FALSE,
y_eval_start = 200,
raw_value_col = colorRampPalette(
c("#33286b", "#26828e", "#75d054", "#fae51f")
)(9)
)
Compared with classification, interpreting a regression tree can be challenging. A heatmap, however, can make the structure more transparent by showing how observations cluster within each terminal node. Here’s an example:
dtGAP(
data_all = galaxy, task = "regression",
target_lab = "target", show = "all",
trans_type = "percentize", model = "party",
simple_metrics = TRUE, y_eval_start = 220,
raw_value_col = colorRampPalette(
c("#33286b", "#26828e", "#75d054", "#fae51f")
)(9)
)
Variable Importance and split-variable Labels panel
col_var_imp set the bar fill color
(e.g. "orange", "#2c7bb6").
var_imp_bar_width Adjust bar thickness (default
0.8).
var_imp_fontsize / split_var_fontsize
Control the font size (default 5).
split_var_bg Background color behind each
split-variable name (default "darkgreen").
Color
Define the RColorBrewer palette and number of
shades.
Col_Prox_palette (e.g. "RdBu",
"Viridis") and Col_Prox_n_colors
Row_Prox_palette and
Row_Prox_n_colors
sorted_dat_palette &
sorted_dat_n_colors
Uses display.brewer.all() to displays all available
RColorBrewer palettes.
You can customize the color schemes and font sizes in the visualization to match your preferences.
dtGAP(
data_all = Psychosis_Disorder, show = "all", trans_type = "none",
target_lab = "UNIQID", simple_metrics = TRUE, col_var_imp = "blue",
split_var_bg = "darkblue", Col_Prox_palette = "RdYlGn",
type_palette = "Set2",
Row_Prox_palette = "Spectral",
var_imp_fontsize = 7, split_var_fontsize = 7,
sorted_dat_palette = "Oranges", sorted_dat_n_colors = 9,
label_map = c("0" = "bipolar", "1" = "schizophrenia"),
label_map_colors = c("bipolar" = "#50046d", "schizophrenia" = "#fcc47f")
)
You can also choose whether to display the row or column proximity.
dtGAP(
data_all = Psychosis_Disorder, model = "party", show = "all",
trans_type = "none", target_lab = "UNIQID",
seriate_method = "GW_average",
label_map = c("0" = "bipolar", "1" = "schizophrenia"),
label_map_colors = c("bipolar" = "#50046d", "schizophrenia" = "#fcc47f"),
show_row_prox = FALSE, show_col_prox = FALSE
)
While extreme tree visualizations may reduce immediate
interpretability, they effectively illustrate the structural
adaptability of our layout algorithm in the context of increasing tree
complexity. The horizontal positioning of tree components is governed by
the tree_p parameter in dtGAP(), which
determines the proportion of the overall canvas dedicated to the tree
structure. Adjusting tree_p helps mitigate issues such as
branch overlapping by providing adequate spacing between nodes.
dtGAP(
data_all = wine_quality_red, target_lab = "target",
show = "all", model = "party", simple_metrics = TRUE,
show_col_prox = FALSE, show_row_prox = FALSE,
y_eval_start = 40,
raw_value_col = colorRampPalette(
c("#33286b", "#26828e", "#75d054", "#fae51f")
)(9),
show_row_names = FALSE
)
dtGAP(
data_all = wine_quality_red, target_lab = "target",
show = "all", model = "party", simple_metrics = TRUE,
tree_p = 0.4,
show_col_prox = FALSE, show_row_prox = FALSE,
y_eval_start = 40,
raw_value_col = colorRampPalette(
c("#33286b", "#26828e", "#75d054", "#fae51f")
)(9),
show_row_names = FALSE
)
Sometimes you may want to focus the heatmap on a subset of features
while keeping the tree trained on all variables. The
select_vars parameter lets you specify which variables to
display—the tree still uses every feature for splitting, but only the
selected ones appear in the heatmap panels.
dtGAP(
data_train = train_covid, data_test = test_covid,
target_lab = "Outcome", show = "test",
select_vars = c("LDH", "Lymphocyte"),
label_map = c("0" = "Survival", "1" = "Death"),
label_map_colors = c("Survival" = "#50046d", "Death" = "#fcc47f"),
simple_metrics = TRUE,
show_col_prox = FALSE, show_row_prox = FALSE,
y_eval_start = 200,
raw_value_col = colorRampPalette(
c("#33286b", "#26828e", "#75d054", "#fae51f")
)(9)
)
Note that select_vars must be a character vector of
column names that exist in the data (excluding the target). Variable
importance values are rescaled to sum to 1 for the selected subset.
If you have already trained a decision tree outside of
dtGAP, you can pass it directly using the fit
parameter. This is useful when you want to use a specific tree
configuration or compare a custom model with the built-in options.
dtGAP() accepts rpart, party,
and train (caret) objects. The model type is automatically
detected, and the tree is converted internally. You can optionally
supply your own variable importance vector via
user_var_imp.
library(rpart)
# Train a custom rpart tree with specific parameters
custom_tree <- rpart(
Outcome ~ ., data = train_covid,
control = rpart.control(maxdepth = 3, cp = 0.01)
)
dtGAP(
fit = custom_tree,
data_train = train_covid, data_test = test_covid,
target_lab = "Outcome", show = "test",
label_map = c("0" = "Survival", "1" = "Death"),
label_map_colors = c("Survival" = "#50046d", "Death" = "#fcc47f"),
simple_metrics = TRUE,
show_col_prox = FALSE, show_row_prox = FALSE,
y_eval_start = 200,
raw_value_col = colorRampPalette(
c("#33286b", "#26828e", "#75d054", "#fae51f")
)(9)
)
Set interactive = TRUE to launch a Shiny-based
interactive heatmap viewer powered by InteractiveComplexHeatmap.
This lets you hover, click, and zoom into the heatmap panels directly in
your browser.
dtGAP(
data_train = train_covid, data_test = test_covid,
target_lab = "Outcome", show = "test",
interactive = TRUE, print_eval = FALSE
)
Note: InteractiveComplexHeatmap must be installed
separately from Bioconductor:
BiocManager::install("InteractiveComplexHeatmap")
In interactive mode, only the heatmap panels are displayed (the tree
panel is omitted, as InteractiveComplexHeatmap handles
ComplexHeatmap objects).
The compare_dtGAP() function lets you compare two or
more tree models side-by-side on a single wide canvas. Each model gets
its own tree + heatmap panel with a label header.
compare_dtGAP(
models = c("rpart", "party"),
data_train = train_covid, data_test = test_covid,
target_lab = "Outcome", show = "test",
label_map = c("0" = "Survival", "1" = "Death"),
label_map_colors = c("Survival" = "#50046d", "Death" = "#fcc47f"),
simple_metrics = TRUE,
show_col_prox = FALSE, show_row_prox = FALSE,
y_eval_start = 200,
raw_value_col = colorRampPalette(
c("#33286b", "#26828e", "#75d054", "#fae51f")
)(9)
)
Supported models include "rpart", "party",
"C50", and "caret". The default page width is
594 mm (two A4 pages side-by-side); you can adjust it with
total_w.
dtGAP extends beyond single decision trees with three functions for
conditional random forests via partykit::cforest:
train_rf() — train a conditional random forestrf_summary() — ensemble-level summary (variable
importance + representative tree)rf_dtGAP() — visualize any individual tree from the
forest using the full dtGAP pipelinetrain_rf() fits a cforest and returns the
forest object, normalized variable importance, and the number of
trees.
rf <- train_rf(
data_train = train_covid,
target_lab = "Outcome",
ntree = 50
)
names(rf)
#> [1] "forest" "var_imp" "ntree"
rf$var_imp
#> LDH hs_CRP Lymphocyte
#> 0.5 0.3 0.2
rf_summary() provides an overview of the fitted random
forest. It displays a variable importance barplot and identifies the
representative tree—the individual tree whose
predictions agree most closely with the full ensemble.
result <- rf_summary(
data_train = train_covid,
data_test = test_covid,
target_lab = "Outcome",
ntree = 50,
top_n_vars = 3
)
result$rep_tree_index
#> [1] 11
The returned rep_tree_index tells you which tree best
represents the ensemble, which you can then visualize with
rf_dtGAP().
rf_dtGAP() extracts a single tree from the forest and
renders it through the full dtGAP pipeline (decision tree + heatmap +
evaluation). The title automatically shows “Tree k/N”.
rf_dtGAP(
data_train = train_covid, data_test = test_covid,
target_lab = "Outcome", show = "test",
tree_index = 1, ntree = 50,
label_map = c("0" = "Survival", "1" = "Death"),
label_map_colors = c("Survival" = "#50046d", "Death" = "#fcc47f"),
simple_metrics = TRUE,
show_col_prox = FALSE, show_row_prox = FALSE,
y_eval_start = 200,
raw_value_col = colorRampPalette(
c("#33286b", "#26828e", "#75d054", "#fae51f")
)(9)
)
save_dtGAP() exports the dtGAP visualization to PNG,
PDF, or SVG files. The format is automatically inferred from the file
extension, or you can set it explicitly. Dimensions are specified in
millimeters (default A4 landscape: 297 x 210 mm).
# Save as PNG (300 dpi)
save_dtGAP(
file = "my_plot.png",
data_train = train_covid, data_test = test_covid,
target_lab = "Outcome", show = "test",
print_eval = FALSE
)
# Save as PDF
save_dtGAP(
file = "my_plot.pdf",
data_train = train_covid, data_test = test_covid,
target_lab = "Outcome", show = "test",
print_eval = FALSE
)
# Custom dimensions (wide format)
save_dtGAP(
file = "wide_plot.svg",
width = 500, height = 250,
data_train = train_covid, data_test = test_covid,
target_lab = "Outcome", show = "test",
print_eval = FALSE
)
All dtGAP() arguments can be passed through
..., so you can customize colors, metrics, and layout just
as you would with dtGAP() directly.
Han-Ming Wu, Chia-Yu Chang, and Chun-houh Chen (2025), dtGAP: Supervised matrix visualization for decision trees based on the GAP framework. R package version 0.0.2, (https://github.com/hanmingwu1103/dtGAP).
References: