README

DataFakeR is an R package designed to help you generate sample of fake data preserving specified assumptions about the original one.

Installation

Learning DataFakeR

You may find there a list of useful articles that will guide you through the package functionality.

Usage

Configure schema YAML structure

Define custom simulation methods if needed

books <- function(n, add_second = FALSE) {
  first <- c("Learning", "Amusing", "Hiding", "Symbols", "Hunting", "Smile")
  second <- c("Of", "On", "With", "From", "In", "Before")
  third <- c("My", "Your", "The", "Common", "Mysterious", "A")
  fourth <- c("Future", "South", "Technology", "Forest", "Storm", "Dreams")
  second_res <- NULL
  if (add_second) {
    second_res <- sample(second, n, replace = TRUE)
  }
  paste(
    sample(first, n, replace = TRUE), second_res, 
    sample(third, n, replace = TRUE), sample(fourth, n, replace = TRUE)
  )
}

simul_spec_character_book <- function(n, unique, spec_params, ...) {
  spec_params$n <- n
  
  DataFakeR::unique_sample(
    do.call(books, spec_params), 
    spec_params = spec_params, unique = unique
  )
}

set_faker_opts(
  opt_simul_spec_character = opt_simul_spec_character(book = simul_spec_character_book)
)

Source schema (and check table and column dependencies)

options("dfkr_verbose" = TRUE) # set `dfkr_verbose` option to see the workflow progress
sch <- schema_source("schema_books.yml")

schema_plot_deps(sch)

schema_plot_deps(sch, "books")

Run data simulation

sch <- schema_simulate(sch)
#> =====> Simulating table 'books' started..
#>   ===> Simulating column 'author' started..
#>   ===> Simulating column 'title' started..
#>   ===> Simulating column 'genre' started..
#>   ===> Simulating column 'bought' started..
#>   ===> Simulating column 'amount' started..
#>   ===> Simulating column 'book_id' started..
#>   ===> Simulating column 'purchase_id' started..
#> =====> Simulating table 'borrowed' started..
#>   ===> Simulating column 'book_id' started..
#>   ===> Simulating column 'user_id' started..

Check the results

schema_get_table(sch, "books")
#> # A tibble: 10 × 7
#>    book_id      author                   title                           
#>    <chr>        <chr>                    <chr>                           
#>  1 DormAmus2021 Dorman Abshire           Amusing In Common Forest        
#>  2 Dr. Symb2020 Dr. Montie Kihn          Symbols In My Future            
#>  3 SharAmus2021 Sharde Howell MD         Amusing With Your Forest        
#>  4 Dr. Lear2020 Dr. Maggie Lind          Learning From A Storm           
#>  5 NathSmil2020 Nathanael Upton-Prosacco Smile Of Common Future          
#>  6 AnasSmil2021 Anastacia Dickens        Smile In Common Forest          
#>  7 RyleSymb2020 Ryleigh Brekke           Symbols From Mysterious Storm   
#>  8 HortAmus2020 Hortense Rosenbaum       Amusing Before Common Technology
#>  9 MariHidi2021 Mariana Auer-Sauer       Hiding On The Forest            
#> 10 TrisSmil2021 Tristen Larkin           Smile With The South            
#>    genre     bought     amount purchase_id        
#>    <chr>     <date>      <int> <chr>              
#>  1 Adventure 2021-04-13     17 purchase_2021-04-13
#>  2 Horror    2020-03-16     81 purchase_2020-03-16
#>  3 Adventure 2021-01-06     55 purchase_2021-01-06
#>  4 Adventure 2020-02-02     NA purchase_2020-02-02
#>  5 Adventure 2020-04-13     93 purchase_2020-04-13
#>  6 Romance   2021-03-02      2 purchase_2021-03-02
#>  7 Horror    2020-08-09     42 purchase_2020-08-09
#>  8 Adventure 2020-10-12     NA purchase_2020-10-12
#>  9 Horror    2021-05-27     47 purchase_2021-05-27
#> 10 Horror    2021-05-30     72 purchase_2021-05-30

schema_get_table(sch, "borrowed")
#> # A tibble: 30 × 2
#>    book_id      user_id   
#>    <chr>        <chr>     
#>  1 DormAmus2021 PKPFJGYlKQ
#>  2 SharAmus2021 YiitBNRqgN
#>  3 RyleSymb2020 ZmFaiKZrsn
#>  4 RyleSymb2020 hKKanzSLlW
#>  5 AnasSmil2021 vvTGnzCNAP
#>  6 DormAmus2021 BZcsAzAjzm
#>  7 RyleSymb2020 gEfcYAuUVw
#>  8 SharAmus2021 oVcYOaJXBc
#>  9 HortAmus2020 YDCQQTGlce
#> 10 AnasSmil2021 uLrpKuAFVd
#> # … with 20 more rows

Acknowledgment

The package was created thanks to Roche support and contributions from RWD Insights Engineering Team.

DataFakeR

Overview

DataFakeR 0.1.3 is now available!