Download a copy of the vignette to follow along here: feature_weights.Rmd
The distance metrics used in metasnf are all capable of applying custom weights to included features. The code below outlines how to generate and use a weights_matrix (dataframe containing feature weights) object.
library(metasnf)
# Make sure to throw in all the data you're interested in visualizing for this
# data_list, including out-of-model measures and confounding features.
data_list <- generate_data_list(
list(income, "household_income", "demographics", "ordinal"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
list(fav_colour, "favourite_colour", "demographics", "categorical"),
list(anxiety, "anxiety", "behaviour", "ordinal"),
list(depress, "depressed", "behaviour", "ordinal"),
uid = "unique_id"
)
#> Warning in generate_data_list(list(income, "household_income", "demographics",
#> : 188 subject(s) dropped due to incomplete data.
summarize_dl(data_list)
#> name type domain length width
#> 1 household_income ordinal demographics 87 2
#> 2 pubertal_status continuous demographics 87 2
#> 3 favourite_colour categorical demographics 87 2
#> 4 anxiety ordinal behaviour 87 2
#> 5 depressed ordinal behaviour 87 2
set.seed(42)
settings_matrix <- generate_settings_matrix(
data_list,
nrow = 20,
min_k = 20,
max_k = 50
)
weights_matrix <- generate_weights_matrix(
data_list,
nrow = 20
)
head(weights_matrix)
#> household_income pubertal_status colour cbcl_anxiety_r cbcl_depress_r
#> [1,] 1 1 1 1 1
#> [2,] 1 1 1 1 1
#> [3,] 1 1 1 1 1
#> [4,] 1 1 1 1 1
#> [5,] 1 1 1 1 1
#> [6,] 1 1 1 1 1
By default, the weights are all 1. This is what
batch_snf
uses when no weights_matrix is supplied.
If you have custom feature weights you’d like to be used you can manually populate this dataframe. There’s one column per feature (no need to worry about column orders) and the number of rows should match the number of rows in the settings_matrix.
If you are just looking to broaden the space of cluster solutions you generate, you can use some of the built-in randomization options for the weights:
# Random uniformly distributed values
generate_weights_matrix(
data_list,
nrow = 5,
fill = "uniform"
)
#> household_income pubertal_status colour cbcl_anxiety_r cbcl_depress_r
#> [1,] 0.08161542 0.3198375 0.8328815 0.9943410 0.3955367
#> [2,] 0.40378037 0.4627980 0.3132912 0.7119147 0.9593465
#> [3,] 0.83551451 0.9353873 0.2794196 0.4951427 0.1132382
#> [4,] 0.59499701 0.5917005 0.7100717 0.8079317 0.2355968
#> [5,] 0.35140389 0.5460431 0.3481677 0.5611197 0.5104740
# Random exponentially distributed values
generate_weights_matrix(
data_list,
nrow = 5,
fill = "exponential"
)
#> household_income pubertal_status colour cbcl_anxiety_r cbcl_depress_r
#> [1,] 0.5123907 0.1624127 1.6042481 3.7447548 1.53441037
#> [2,] 3.9471338 0.4178442 0.2354796 0.3647522 0.22186034
#> [3,] 0.4215409 0.2394908 0.1519102 0.8262260 0.03348363
#> [4,] 1.4107604 2.3230736 2.0428148 0.2279961 0.48877057
#> [5,] 0.1756311 0.5256458 1.3623835 0.1072554 0.24304379
Once you’re happy with your weights_matrix, you can pass it into batch_snf:
The specific implementation of the weights during distance matrix calculations is dependent on the distance metric used, which you can learn more about in the distance metrics vignette.
The other aspect to understand if you want to know precisely how your weights are being used is related to the SNF schemes. Depending on which scheme is specified in the settings_matrix row, the feature columns that are involved at each distance matrix calculation can differ substantially.
For example, in the domain scheme, all features of the same domain are concatenated prior to distance matrix calculation. If you have any domains with multiple types of features (e.g., continuous and categorical), that will mean that the mixed distance metric (Gower’s method by default) will be used, and weights will be applied but only on a per-domain basis.
Here’s a more concrete example on how data set-up and SNF scheme can influence the feature weighting process: consider generating a data_list where every single input dataframe contains only 1 input feature. If that data_list is processed exclusively using the “individual” SNF scheme, feature weights won’t matter. This is because the individual SNF scheme calculates individual distance metrics for every input dataframe separately before fusing them together with SNF. Anytime a distance matrix is calculated, it’ll be for a single feature only, and the purpose of feature weighting (changing the relative contributions of input features during the distance matrix calculations) will be lost.