Data is synthesised by sampling from a multivariate cumulative distribution (Copula), using the simstudy
package.
Data can be synthesised from marginal distributions using the synthesise_data()
function:
library(RESIDE)
import_marginal_distributions()
marginals <- synthesise_data(marginals) simulated_data <-
User specified correlations can be added to the synthesised data by supplying a correlation matrix. An empty correlations matrix can be generated using the export_empty_cor_matrix()
function, supplying the marginals imported using ‘import_marginal_distributions’ and a folder path respectively:
library(RESIDE)
import_marginal_distributions()
marginals <-export_empty_cor_matrix(marginals, folder_path = tempdir())
The exported CSV file will be a symmetric table which looks like:
Correlations should then be added to the CSV file, without modifying the column / row names. Correlations should use rank order correlations. Categorical variables are represented as dummy variables named using the format variable name underscore category name e.g. SEX_F. Note the correlation matrix should be symmetrical and positive semi definite.
Once the correlations have been added to the CSV file, the correlations can be imported using the `import_cor_matrix’ function:
library(RESIDE)
import_cor_matrix() correlation_matrix <-
By default the filename for the correlation matrix is that of the exported filename (correlation_matrix.csv
) and is imported from the current working directory. This can be changed by specifying a file_path
using the corresponding parameter of the import_cor_matrix()
function, this file path should be a relative or absolute file path.
The import_cor_matrix()
function will produce and error if the matrix is not symmetrical and positive semi definite, or the file does not exist.
With a correlation matrix data can now be synthesised with the user specified correlations using the synthesise_data()
function, specifying the correlation matrix imported by the import_cor_matrix()
function:
library(RESIDE)
import_marginal_distributions()
marginals <-export_empty_cor_matrix(marginals)
import_cor_matrix()
correlation_matrix <- synthesise_data(
simulated_data <-
marginals,
correlation_matrix )
NB It is not possible to entirely maintain all the marginal distributions when specifying correlations, this is a known limitation and is not likely to change.