‘archetyper’ initializes data mining and data science projects by generating common workflow components as well as peripheral files needed to support technical best practices:
The lifecyle of a data mining project generally includes the following components:
Integration
Exploration
Enrichment
Modeling
Evaluation
Presentation
Deployment
Additionally, a well-formed data mining project will include:
Centralized code for common libraries, functions, and constants
Version control (e.g. git)
Unit testing
A readme file
Adherence to syntax and style
Externalized properties for secret information (when necessary)
Logging
Use of relative directories
generate will create a new project with the files and directories to support the data mining and data science workflow.
generate("majestic_12")
list.files("majestic_12")
[1] "data_input" "data_output" "data_working" "docs" "majestic_12.Rproj"
[6] "models" "R" "readme.md" ".gitignore"
The R code for the data workflow will be in the R/ directory.
list.files("majestic_12/R/")
[1] "0_test.R" "1_integrate.R" "2_enrich.R" "3_model.R"
[5] "4_evaluate.R" "5_present.Rmd" "api.R" "common.R"
[9] "explore.R" "lint.R" "mediator.R" "utilities.R"
The base work-flow files include integrate.R, enrich.R, model.R, evaluate.R, present.Rmd, and api.R.
1_integrate.R
is responsible for acquiring and integrating data across data sources as well as standardizing data types and naming conventions of columnar data. It is recommended that the output of the integration step is prepared as a two-by-two tibble, with rows as observations and columns as raw candidate features. The output should be saved in version-controlled feather format to the data_working/
directory for downstream use.2_enrich.R
is responsible for reading integrated data from the data_working/
directory and enriching the raw features into more informative features specific to the model being applied. Tasks in this stage might include feature engineering, outlier removal, imputation, feature selection, and the assignment of testing partition labels. As with the integration step, the output of the integration step should be a two-by-two tibble object saved as a version-controlled feather file in the data_working/
directory for downstream use.3_model.R
should read the training partitions from the enriched dataset stored in the data_working/
directory and train the desired model(s). Model objects generated within the model.R file should be stored in the models/
directory using consistent version control and naming conventions.4_evaluate.R
should read the testing partitions from the enriched dataset stored in the data_working/
directory and apply the trained model from the model/
directory. Model results are appended to the testing dataset and are persisted as a .csv file in the data_output/
directory. A feather file is not generated in this step as this data may be shared with non-technical stakeholders directly. Visualizations related to performance on the testing dataset, such as ROC Curves and box-plots, and model metadata (e.g. coefficients) may be generated at this phase and persisted in the data_working/
directory.5_present.Rmd
is an RMarkdown document template that demonstrates the assembly of data and charts stored within the data_output/
directory.explore.R
exists outside of the sequential work-flow and is used primarily for transient data analysis needed for data integration and feature preparation.api.R
is a RESTful api template generated by ‘archetyper’. The API can be tested locally or deployed to an internal server. Note that the 'archetyper'
API template does not enforce authentication and authorization and therefore developers should consult with their security team prior to deploying an API within their organization. The RESTful service in the api.R
file is implemented using the ‘plumber’ package.Additional files are created to serve supporting functions:
test.R
stores unit and integration tests with an example unit test (using the ‘testthat’ package).mediator.R
is responsible for the contiguous execution of each component, with conditions to stop execution upon failure at any step. The mediator.R
file is named after the Gang of Four mediator pattern as it orchestrates calls to each component and enforces isolation between those components. The mediator.R file further executes unit tests prior to the execution of the work-flow. In the final step, the present.Rmd file will be executed with its output (a PDF document for example) stored into the docs/
directory.common.R
is responsible for identifying libraries common across components (e.g. ‘dplyr’, ‘stringr’, ‘magittr’, etc.), storing project constants, inheriting utilities, and initializing a centralized logger. Each component in the work-flow sources the common.R file.utilities.R
is designed to store functions that are necessary across components. The common.R
file sources the utilities.R file, and in turn, each file in the work-flow sourcing common.R also has access to the common utility methods..gitignore
ignores files that should not be stored in version control, including config.yml
. The base .gitignore file included in the ‘archetyper’ template was generated from the gitignore r package.lint.R
includes linting commands for each file in the work-flow to enforce proper style and syntax conventions.readme.md
is designed to provide project documentation and details on how the user will interact with the code..Rproj
is a file indicating that the directory includes an R project and includes metadata related to the project itself. This file further serves as the marker of the root directory so that relative directories can be used in the project’s .R files. Relative directories are supported by the ‘here’ package.A directory structure designed to logically separate the data artifacts produced throughout the work-flow is also generated by the ‘archetyper’ package. These directories include:
data_input/
is used to store unprocessed data read in through the integration.R
filedata_working/
stores working data throughout the data mining workflow. This data includes versioned snapshots of both integrated and enriched data as well as other files and objects that might be necessary to present findings from the model.models/
stores version-controlled model objects.data_output/
stores version-controlled plain-text files with the appended model results.docs/
stores the output of R Markdown files (and other documents resulting from the analysis)For traceability, files and objects (e.g. models) throughout the project are named according to a standard naming convention.
[ project_name ]_[ file_name ]_[ YYYY_MM_DD_HH:MM ].[ file_extension ]
This structure, in conjunction with the persistent state of each component, allows each component script to be run independently without sourcing all the preceding components.
A database connection type of “odbc” or “jdbc” can be passed in as a function argument to generate scaffolding helpful for database connections.
ODBC
A db_connection argument of ‘odbc’ will generate a connection code snippet in the integrate.R file.
library(odbc)
con <- dbConnect(odbc::odbc(), "dev_database")
sql <- "select my_value from my_table"
result_df <- dbGetQuery(con, sql)
A file to store the database DML and DDL (for data preparation occurring in the database prior to the integration step) is additionally generated when using ‘odbc’:
dml_ddl.sql
stores SQL statements and other database scripts used to prepare and extract data from the source systems.Note that when using odbc, the user must update the appropriate odbc configuration files (e.g. odbcinst.ini, odbc.ini)
JDBC
The following files and directories will be created with a ‘jdbc’ argument:
dml_ddl.sql
stores SQL statements and other database scripts used to prepare and extract data from the source systems.config.yml
stores secret information, such as user database credentials.drivers/
is directory for storing database driver jars. Note that the user must provide the jars and classpath names to the jars specific to the database type being accessed.A db_connection argument of ‘jdbc’ will additionally generate a connection code snippet in the integrate.R file. Note that the credentials are sourced from the config.yml file so that they are not exposed in the source code. The config.yml file is ignored in the .gitignore file so that it is not committed to source.
library(RJDBC)
db_credentials <- config::get("dev_database")
drv <- RJDBC::JDBC(driverClass = db_credentials$driver_class, classPath = Sys.glob("drivers/*"))
con <- dbConnect(drv,db_credentials$connection_string, db_credentials$username, db_credentials$password)
sql <- "select my_value from my_table"
result_df <- dbGetQuery(con, sql)
When using a JDBC connection, the user must provide appropriate driver JARs in the drivers/ directory, as well as user credentials, class path, and connection string in the config.yml file.
The exclude
argument prevents specified files from being generated.
generate(project_name = project_name, path = project_directory, exclude = c("api.R", "utilities.R", "readme.md", "lint.R", ".gitignore"))
list.files(project_path)
[1] "data_input" "data_output" "data_working" "docs" "majestic_12.Rproj"
[6] "models" "R"
list.files(project_path_r)
[1] "0_test.R" "1_integrate.R" "2_enrich.R" "3_model.R" "4_evaluate.R" "5_present.Rmd"
[7] "common.R" "explore.R" "mediator.R"
The ‘archetyper’ project is pre-packaged with a working demo project that predicts hospital readmission rates based on publicly-available structural characteristics and complication rates.
The demo project can be generated by running the generate_demo() function.
archetyper::generate_demo()
list.files("hospital_readmissions_demo/")
[1] "data_input" "data_output" "data_working" "docs" "hospital_readmissions_demo.Rproj"
[6] "models" "R" "readme.md" ".gitignore"
Once the demo project has been created, the project should be opened in RStudio.
The the full data-mining/data-science life-cycle can be triggered by executing the mediator.R
file. The contents of the mediator.R
file is below:
cat(readChar("hospital_readmissions_demo/R/mediator.R"), 1e5))
##--------------------------------------------------------------------------
## The mediator file will execute the linear data processing work-flow. -
##--------------------------------------------------------------------------
source("R/common.R")
tryCatch({
info(logger, "running tests...")
source("R/0_test.R")
info(logger, "gathering and integrating data...")
source("R/1_integrate.R")
info(logger, "enriching base data...")
source("R/2_enrich.R")
info(logger, "building model(s)...")
source("R/3_model.R")
info(logger, "applying model(s) to test partitions...")
source("R/4_evaluate.R")
info(logger, "building presentation materials...")
rmarkdown::render("R/5_present.Rmd", "pdf_document", output_dir = "docs")
info(logger, "workflow is complete.")
},
error = function(cond) {
log4r::error(logger, str_c("Script error: ", cond))
}
)
Note that the file includes a centralized logger to distinguish levels of severity (using the ‘log4r’ package) as well as relative directory references (using the ‘here’ package). Comments in all files were created using the ‘bannerCommenter’ package.
A set of publicly-available data files will be loaded by the integration.R file. The integration step joins and transforms the source files according to Tidy Data principles, and persists the integrated data into the data_working/ directory.
The enrichment step creates features better suited for the modeling process by applying feature engineering methods (such as scaling and centering the numeric features), outlier removal (using Cook’s Distance), imputation (using predictive mean matching PMM), and feature selection (by removing highly-correlated features). Additionally, training labels are assigned through stratified random sampling. The enriched results are stored in feather format in the data_working/ directory.
In the modeling step, a stepwise linear regression is applied to the training partition of the enriched dataset. The model coefficients, performance statistics, and the model itself is stored in the data_output/ and models/ directories (respectively).
In the evaluation step, the testing partitions of the enriched dataset are applied to the trained model from the models/ directory. The testing data with the appended predictions, along with performance statistics from the testing dataset, are stored in the data_output/ directory as a .csv file.
Finally, an R Markdown report is produced, using files sourced from the data_output/ directory.
The api.R file generates a sample RESTful api that uses the trained model from the models/ directory. The demo api can be called with the below sample request body:
{
"dt": {"state": "AL",
"hospital_type": "acute_care_hospitals",
"hospital_ownership": "government_hospital_district_or_authority",
"emergency_services": "Yes",
"ehr_interop": "Y",
"denominator": 1.4791,
"PSI_10": -1.3895,
"PSI_11": 0.597,
"PSI_12": -0.9487,
"PSI_13": -0.102,
"PSI_14": -1.1131,
"PSI_15": -0.9439,
"PSI_3": 0.029,
"PSI_6": -0.5752,
"PSI_8": -1.4081,
"PSI_9": -0.3028,
"denominator_ln": 1.3489}
}
Note that the demo project was designed simply to illustrate the the functionality of the ‘archetyper’ project. It was not designed to be a production or publish worthy model.