replacer
, a value
replacement utility currently based on package data.table
,
is intended for outside-database update of data. It requires the
preparation of a lookup file which is a list of replacement
requests with and, in special circumstances, without an
index column.
The input and output data files are comma-separated values file format (csv). Various other file formats can therefore be processed after conversion.
Data processing such as cleanup, file format conversion and the appendage of new data are outside the scope of this utility.
There could be several lookup files associated with one data file and vice versa which could be batch-processed.
This vignette defines the terms related to the replacement and to the elements of the lookup file, and details the procedure of creating an efficient lookup file suggesting ways to maintain it relevant for future updates. It continues by introducing the two User-intended functions, for single file and batch file processing, and describes their internal workflow. Finally, it presents a screen output result obtained with a lookup without index and introduces the datasets that are used in examples.
This utility is accessible to beginners to R
working
with outside-database files. It facilitates complex dataset updates with
minimal prompt time by employing User-friendly functions which
automatically follow a decision tree rooted in the presence of User-made
index.
The choice for data.table
is motivated by its concise
and sound code, effectiveness and efficiency in processing large and
complex datasets, and its independence on packages, or verses,
other
than base R
itself; this utility will only install
data.table
on User’s machine, had it had not been installed
already.
Hence, familiarity with basic R
commands is necessary.
Familiarity with data.table
is not, yet strongly
recommended!
The following terms are specific to this utility:
lookup table: a csv format data file containing at least 4 standard columns with standard column names:
simple (or 1:1) replacement: one value from newVals replaces one value from oldVals
multiple (or 1:many) replacement: one value from newVals replaces identical values from oldVals
split: in this context, recursive partition of multiple replacements into simple and multiple replacements
generic value: restricted replacement value used for multiple replacements of missing data
involved column: data column with the name listed in vars column of lookup
request: implicit information containing value type, value missingness, value uniqueness, value location, replacement type, replacement value - all in one row, spanning the standard columns of lookup
Lookup may contain a combination of simple and multiple replacements, simple replacements only, multiple replacements only (i.e. missing only, duplicated values only, or mixed) as needed. Multiple replacements can be split as needed.
On special circumstances, column “id” may be absent from lookup while the presence of the other three standard columns is mandatory. Modifying standard columns’ names will result in error. Any extra columns in lookup will be automatically removed during internal conformance checks on lookup.
Multiple replacements apply to missing or duplicated non-missing values. While replacement values of duplicated values could be regarded as generic, they are entirely decided by the User. They are not restricted in this sense.
In this version, replacements of missing data are restricted to words formed with the word roots MIS|mis, PRE|pre, UN|un, ABS|abs and the adverbs YES|yes, NO|no for missing values of type character and with a number equal to- or larger than 3 integer 9s (i.e. 999, 9999 etc.) for missing values of double or integer types.
Short sentences beginning with any word root or adverb above may be accepted as replacements.
Typical data replacement workflow: the initial data file and the
corresponding lookup file (here named “data.csv” and
“lookup.csv”) are saved in a directory, here named
dir. Once the package is installed, the following commands
should be typed at the R
prompt:
> require(replacer)
> dir = 'C:/path/to/directory'
> replaceVals(dir) ## to update the values in 'data.csv'
Results will show on screen (see Section “Screen Output Example” which presents the result of a different data/lookup pair and requests set). A csv file similar to Out-Data is saved on dir.
Rw | a | b | uninvlvd |
---|---|---|---|
1 | aa | 7.2 | 0.6 |
2 | 17.6 | 0.1 | |
3 | cc | 1.6 | 1.6 |
4 | dd | 0.8 | 1.2 |
5 | ee | -14.0 | 5.0 |
6 | cc | 0.1 | |
7 | 1.6 | ||
8 | -5.0 | 3.3 | |
9 | mm | 1.6 | 9.6 |
10 | 5.2 | ||
11 | |||
12 | cc |
Rw | vars | oldVals | newVals | id | source |
---|---|---|---|---|---|
1 | a | ZZ | 2 | source1 | |
2 | a | CC | 8 | source2 | |
3 | a | cc | DD | 0 | source3 |
4 | a | mm | UU | 9 | source4 |
5 | a | cc | CA | 12 | source5 |
6 | a | YES | source6 | ||
7 | b | 1.607 | 1.1 | 0 | source7 |
8 | b | 1.607 | 8.8 | 7 | source8 |
9 | b | 999 | source9 |
Rw | a | b | uninvlvd |
---|---|---|---|
1 | aa | 7.2 | 0.6 |
2 | ZZ | 17.6 | 0.1 |
3 | DD | 1.1 | 1.6 |
4 | dd | 0.8 | 1.2 |
5 | ee | -14.0 | 5.0 |
6 | DD | 0.1 | |
7 | YES | 8.8 | |
8 | CC | -5.0 | 3.3 |
9 | UU | 1.1 | 9.6 |
10 | YES | 5.2 | |
11 | YES | 999.0 | |
12 | CA | 999.0 |
This small example covers the complete set of lookup requests with index present. Column “Rw” is here for convenience only. It should not be manually created.
The left-hand side table, In-Data, is the original data file and contains 3 columns of interest. The two columns (“a” and “b”) involved in replacements, are of character respectively, of numeric data types and both contain missing, duplicates and unique values. The “uninvlvd” column in In-Data is not listed in lookup’s vars, it is not involved in replacements and therefore, remains unchanged in Out-Data. Column source in Lookup is not standard and therefore, is automatically removed from computations.
The Out-Data result was achieved with a lookup file with User-made index; the replacement requests were as follows:
A User-made index may or, may not be necessary in lookup:
A User-made index is the safest value replacement procedure; it requires little insight in the structure of data and insures fast review of large datasets. When there is a full range of request types including splitting, the index should contain the value types shown in the above example, in lookup column id.
There could be several distinct sets of duplicated values in any data column. If the purpose is multiple replacements of any/all of these sets (or subsets of these sets) then, for each subset, the replacement request should take only one row in lookup having 0 as index value. Corresponding replacement values are at User’s discretion.
Same rules as above apply to multiple replacement of sets of missing values with two differences: 1) the corresponding index value is empty or NA, and 2) the generic value is restricted to a word or short sentence starting with any word root or adverb listed in “Definitions”.
For generic values of numeric type, data type preservation (see Note 2) is not required (i.e. it is not required to enter 9999.0 instead of 9999 in the id row corresponding to missing values of double data type).
Further revisions may bring more generic value customization if necessary.
To request simple replacements, whether having unique, duplicated or missing values in oldVals column, the index should contain the corresponding data file row number. Sorting lookup by vars would maintain focus on one data column at a time.
Standard columns oldVals and newVals in lookup, revert to the character data type upon completion since they contain numeric and character values combined. This behavior is normal.
In conclusion, a User-made index covers all types of replacements, including splits.
There are special circumstances when the index is absent in lookup:
There is no need for index when lookup contains only multiple replacement requests (Section “Screen Output Example”). An extension of this case involves splitting on missing values and is explained in the sub-section below.
This case requires a deeper insight in the structure of the data file
such as knowing in advance which values are duplicated. When used
standalone, helper function whichDups()
- part of this
utility - finds the duplicated values in all columns of the data
file.
The requirements for an effective lookup table without index are:
rows in lookup must have the same succession as the succession of values in data. This applies to each involved column (e.g. if two data values located at rows 3 and 2000, need be updated in data column “X”, the corresponding requests for “X” in lookup should be located in adjacent rows, in the same order)
simple replacements: if values set for replacement within any or all involved data columns are unique (either missing or not), simple replacements without index are possible
mixed simple and multiple replacements: if any involved data column contains unique, duplicated and/or missing values and some or all of the duplicated and/or missing values are set for multiple replacements that do not undergo splitting, replacements without index are possible
splitting: not implemented for duplicated values when User-made index is absent. Although there are ways for implementing duplicate values splits without index, the necessary User time and data structure insight plead for the implementation of a User-made index. A special case of splitting is, however, allowed for missing values (see Note 3)
Note 1 Whether the index is present or not in lookup, at the time of running the updates the row order in the data file must be identical to the row order in the data file as it was at the completion of lookup. If necessary, an index named other than “id” could be built and preserved inside the dataset and it will be treated as uninvolved column. Re-ordering the dataset by this index before run will retrieve the row order at the completion of lookup and will keep the lookup relevant for future data updates.
Note 2 It is recommended that User preserves the
data type between values set for replacement and their replacements
(i.e. integer/double to integer/double etc.). Although the helper
function con2fcoales()
automatically solves such
discrepancies in the background, if an error message similar to
“Item 2 is type integer but the first item is type double. Please
coerce …” appears on screen, it means that 1) the requirement was
obviously violated and that 2) the severity of violation was beyond the
mitigation capability of con2fcoales()
.
In such cases, a review of all output messages and a review of the data file and the associated lookup should ensue.
Note 3 Due to their special type, splitting on missing values across data columns is allowed when index is absent, i.e. if one involved data column contains one missing value and other involved columns contain missing values set (or not) for multiple replacements, splitting is allowed and the unique missing value can be replaced with a non-generic replacement value.
There are two functions designed for direct use and set to display a series of messages and comments informing the User on the computational path taken and on findings along the way. While the messages can be turned off, the comments remain visible.
Single file replacements are processed by function
replaceVals()
. When the data and lookup files are named
simply “data.csv” and “lookup.csv” the function only
requires the path to the directory where these files were stored,
written as length 1 quoted character, with forward
“/” or double backward “\\” slash and
without end slash.
At start, this function performs a series of conformance checks on data and lookup, excludes non-standard columns, separates the data into involved and uninvolved columns then, rejoins them on exit to preserve the data format.
Batch file replacements are processed by function
bReplace()
. When messages are set to TRUE
it offers an wide range of custom messages/comments for each data/lookup
input pair and request type. During run, bReplace()
calls
replaceVals()
function as many times as data/lookup pairs
are on the names list, displays a named list of messages, comments,
updated data and counts related to duplicated and/or missing values
requests, for each pair.
Both functions save their updated data to the directory above, in csv file format.
Helper function sReplace()
(described in the Manual) is
the data replacement workhorse for this utility. When called by
replaceVals()
, it firstly checks for index presence in
lookup. Upon result, the function moves along the branches of a decision
tree:
The function starts by identifying duplicated and/or missing values within the involved columns as well as eventual splits on missing values.
If lookup requires multiple and simple replacements, the function separates lookup into maximum 3 subsets: 1) of multiple replacements for duplicated values, for which it later creates an internal index list, 2) of multiple replacements for missing values for which an internal index is not necessary, and 3) the remainder subset containing unique values, including unique missing values, for which it also creates an internal index list.
Requests for splitting on duplicated values with no index stop the function with an error.
The function creates an internal index list of row numbers
corresponding to all elements of distinct subsets of duplicated values
found within each involved data column and loops the function
data.table::set()
for replacements.
As already mentioned, no index is created for multiple replacements of missing values as there is only one generic value per data column (this utility uses no other information on missing data beside the data type. For a different approach on this subject the User is directed to data imputation literature).
The subset of missing values is then reshaped, and the columns are coalesced with corresponding data columns, for each generic value present in lookup.
As stated above, simple replacements of unique values without
User-made index are possible. Once the internal index is created, this
subset is reshaped, joined with the data on index and
then, the corresponding columns are coalesced (please see these
terms in data.table
Manual).
The function subsets the lookup using the special index values 0 and/or NA (or empty). At maximum, 3 subsets are formed as above. The replacement process is similar with the process used for absent index with the difference that the unique values already have the User-made index of row numbers.
In conclusion, whether single or batch file processing, all request types are processed in one run while Users have the ability of monitoring the internal workflow.
Presented below is an update performed on the “Chile” dataset from
package carData
.
At the R
prompt, type or just copy/paste the following
commands:
help(Chile, package = "carData")
require(replacer)
dir = system.file("extdata", package = "replacer")
replaceVals(dir, 'chile.csv', 'chile_nadup.csv', save = FALSE)
Screen output:
reading data from: C:/R/R-4.1.0/library/replacer/extdata ...
completed reading data ...
checking standard columns in lookup ...
starting replacements ...
found no request for 1:1 replacements ...
searching for 1:many replacements ...
found request for multiple duplicated value replacments: creating index ...
replacing multiple duplicated values ...
found request for multiple missing value replacements: replacing ...
helper function has completed!
$` updated_chile_using_chile_nadup`
region population sex age education income statusquo vote
1: North 200000 Male 65 Primary 35000 1.00820 Y
2: North 200000 Male 29 Post-secondary 7500 -1.29617 N
3: North 200000 Female 38 Primary 15000 1.23072 Y
4: North 200000 Female 49 Primary 35000 -1.03163 N
5: North 200000 Female 23 Secondary 35000 -1.10496 N
---
2696: Metro Santiago 15000 Male 42 Primary 15000 -1.26247 N
2697: Metro Santiago 15000 Female 28 Primary 15000 1.32950 Y
2698: Metro Santiago 15000 Female 44 Primary 75000 1.42045 Y
2699: Metro Santiago 15000 Male 21 Secondary 75000 0.18315 PREFERS NO ANSWER
2700: Metro Santiago 15000 Male 20 Post-secondary 35000 1.38179 Y
$` multiple_dups_repl_counts`
vars oldVals newVals education population region sex
1: education P Primary 1107 NA NA NA
2: education PS Post-secondary 462 NA NA NA
3: education S Secondary 1120 NA NA NA
4: population 175000 200000 NA 140 NA NA
5: population 25000 50000 NA 360 NA NA
6: region C Central NA NA 600 NA
7: region M Metro Santiago NA NA 100 NA
8: region N North NA NA 322 NA
9: region S South NA NA 718 NA
10: region SA City of Santiago NA NA 960 NA
11: sex F Female NA NA NA 1379
12: sex M Male NA NA NA 1321
$` multiple_NAs_repl_counts`
region population sex age education income statusquo vote
0 0 0 1 11 98 17 168
This update has replaced the abbreviations used in the source data with full names and complete words, as well as, updated the population totals on few regions with fictitious values. Missing values in column vote were replaced by a short sentence using the word root “PRE” listed above.
The requests were processed through a lookup table without User-made index.
The first output block named “updated_chile_using_chile_nadup” displays the head and the tail of the updated data file.
To view the complete updated dataset on screen, type:
upData = replaceVals(dir, 'chile.csv', 'chile_nadup.csv', save = FALSE)[[1]]
View(upData)
Next block, named “multiple_dups_repl_counts”, displays a count of duplicated value updates processed on each involved column, by value and replacement value. In this example, the left-hand side of this block contains the complete cases of lookup.
Finally, the third block named “multiple_NAs_repl_counts”
shows counts of missing values replaced by generic values within
respective involved columns. In other cases, unrequested replacements
within any of the involved columns will be commented in the screen
output (see replaceVals()
Examples).
All block names change according to input and request type; these names show what was processed and what was not. In case of error, the displayed messages/comments should hint of where in the process the error occurred.
Although artificial, the datasets in this package cover situations
encountered in practice. They contain fictitious values inserted for
exemplification purpose with the exception of “chile” dataset,
a copy of the “Chile” dataset from package carData
;
however, some replacement values within associated lookup files are also
fictitious.
The batch-file example (see bReplace()
Examples)
contains the names of data and associated lookup files presented below.
Duplicated values and counts of missing values are shown below for all
these datasets. When several distinctive subsets of duplicated values
are present in an involved column the column’s name appears indexed.
|
|
||||||||||||||||||||
Associated lookup files:
“lookup_id.csv”: mixed simple/multiple replacements of unique, duplicated and missing data, similar to the typical replacement workflow example. The file contains standard columns only.
“lookup_idsimple.csv”: simple replacements i.e. without 0 or NA values in “id”. The file contains standard columns only.
|
|
||||||||||||||||||||||||
Associated lookup file:
“chile_id.csv”: splits of multiple replacements of duplicated values. Also, multiple replacements of missing data. The file includes non-standard column “source”.
|
|
||||||||||||||||||||||||
Associated lookup file:
“lookup.csv”: mixed simple and multiple replacements of unique, duplicated and missing values. The file contains standard columns only.
|
|
||||||||||||
Associated lookup file:
“lookup_unique”: split across columns on missing values. The file contains standard columns only.
“lookupDUP”, “lookupNA”: multiple replacements only of duplicated or missing values in “data_id” dataset. Both files contain standard columns only.
“chile_nadup”: multiple replacements only of duplicated and missing values in “chile” dataset. The file contains non-standard column “source” .
#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19044)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] replacer_1.0.2 kableExtra_1.3.4 knitr_1.39 data.table_1.14.2
#>
#> loaded via a namespace (and not attached):
#> [1] rstudioapi_0.13 xml2_1.3.3 magrittr_2.0.3 munsell_0.5.0
#> [5] rvest_1.0.2 viridisLite_0.4.0 colorspace_2.0-3 R6_2.5.1
#> [9] rlang_1.0.4 fastmap_1.1.0 highr_0.9 stringr_1.4.0
#> [13] httr_1.4.4 tools_4.2.1 webshot_0.5.3 xfun_0.32
#> [17] cli_3.3.0 jquerylib_0.1.4 systemfonts_1.0.4 htmltools_0.5.3
#> [21] yaml_2.3.5 digest_0.6.29 lifecycle_1.0.1 formatR_1.12
#> [25] sass_0.4.2 glue_1.6.2 cachem_1.0.6 evaluate_0.16
#> [29] rmarkdown_2.15 stringi_1.7.8 compiler_4.2.1 bslib_0.4.0
#> [33] scales_1.2.0 svglite_2.1.0 jsonlite_1.8.0