In contrast to other programming languages, R has no widely
established and undisputed style guide (e.g. PEP 8 for Python). As a
data scientist, I helped to establish a company wide R style guide.
While it mainly relies on the tidyverse style guide, we
generally decided to be more explicit in our coding practice. This
includes that we always refer to functions from non-native R packages
with the double colon operator ::
. While it is relatively
easy to establish such a convention in new projects, it is challenging
to adapt ongoing projects and legacy code. origin
allows
for much faster conversions of both legacy code as well as currently
written code.
origin
The main purpose is to add pkg::
to an R function call,
i.e. it changes code like this:
origin
In general, you can either originize some selected text (more on that
later in Addins), a whole script, or a all scripts in a specific folder,
e.g. your project folder. There is a specifically designed function for
each purpose yet they all share the same options. Therefore, only
originize_file()
is extensively presented as an example
with its default options.
originize_file(file = "testscript.R",
pkgs = .packages(),
overwrite = TRUE,
ask_before_applying_changes = TRUE,
ignore_comments = TRUE,
check_conflicts = TRUE,
add_base_packages = FALSE,
check_base_conflicts = TRUE,
check_local_conflicts = TRUE,
excluded_functions = list(dplyr = c("%>%", "across"),
data.table = c(":=", "%like%"),
# exclude from all packages:
c("first", "last")),
verbose = TRUE,
use_markers = TRUE)
pkgs
: which packages to check for functions used in the
code (see Considered Packages). The default are all
packages attached via library
or require
overwrite
: actually insert pkg::
into the
code. Otherwise, logging shows only what would happen. Note
that ask_before_applying_changes
still allows to keep
control over your code before origin
changes anything.ask_before_applying_changes
: whether changes should be
applied immediately or the user must approve them first.check_conflicts
: should origin
check for
potential namespace conflicts, i.e. a used function is defined in more
than one considered package. User input is required to solve the issue.
Strongly encouraged to be set to TRUE
.add_base_packages
: should base packages also be added,
e.g. base::sum()
.check_base_conflicts
: Should origin also check for
conflicts with base R functions.check_local_conflicts
: Should origin also check for
conflicts with locally defined functions anywhere in your project? Note
that it does not check the environment but solely parses files and scans
them for function definitions.excluded_functions
: a (named) list of functions to
exclude from checking.verbose
: some sort of logging is performed, either in
the console or via the markers tab in RStudio.use_markers
: whether to use the Markers tab in
RStudio.Besides using regular R functions to originize files, there are also
useful addins delivered with origin
. These addins are
designed to be used on-the-fly while coding. You can either originize
selected text, the currently opened file, or all scripts in the
currently opened project. However, to have as much control as when using
functions, each function argument corresponds to an option that can be
set and used inside the addins, e.g.
Actually, most function arguments of origin
first check
whether an option has been declared and uses the assigned value as its
default. This allows for equal outcomes regardless whether you use the
addin or a function sequentially.
Since origin
changes files on disk, it is very important
that the user has full control over what happens and user input is
required before critical steps.
Most importantly, the user must be aware of what the originized file(s) would look like. For this, all changes and potential missed changes are presented, either in the Markers tab (recommended) or in the console.
pkg::
is
inserted prior to a functionorigin
highlights such cases in the
logging output.%>%
are exported by packages but cannot be called with
the pkg::fun()
convention. Such functions are highlighted
by default to point the user that these stem from a package. When using
dplyr-style code, consider to exclude the pipe-operator via
exclude_functions
.Due to the variety of R packages, function names must not be unique
across all packages out there. By default, R masks priorly imported
functions by those imported afterwards. origin
mimics this
rule by applying a higher priority to those packages that are listed
first. In case there is a conflict regarding a used
function, These functions are listed along with the packages from which
they stem.
Used functions in mutliple Packages!
filter: dplyr, stats first: data.table, dplyr
Order in which relevant packges are evaluated: data.table >> dplyr >> stats
Do you want to proceed? 1: YES 2: NO
As packages mask each others functions, the same applies to locally
defined custom functions. In case you defined your own last
function in your project, origin
should
not add dplyr::
to it. Therefore, your
project is searched for function definitions and local functions have
higher priority than those exported by packages. Note that, depending on
the project size, this process can take quite some time. In this case,
set the argument/option path_to_local_functions
to a
subdirectory or check_local_conflicts
to FALSE
to skip this feature.
Locally defined and used functions mask exported functions from packages
last: dplyr
Local functions have higher priority. In case you want to use an exported version of a function listed above set pkg::fun manually
Got it? 1: YES 2: NO 3: Show files
When originizing
a complete folder or project, many R
scripts might be checked. In case the user is unaware that there are
many files in the selected folder, resulting in a long run time of
origin
, a warning is triggered and user input is
required.
You are about to originize 99 files.
Proceed? 1: YES 2: NO 3: Show files
Before the proposed changes are applied eventually, a final user input is required.
Happy with the result? 😀
1: YES 2: NO
Whether or not to add pkg::
to each (imported) function
is a controversial issue in the R
community. While the tidyverse style guide does not mention explicit
namespacing, R Packages and the Google
R style guide are in favor of it.
Pros
Cons
(minimal) performance issue
more writing required
longer code
infix functions like %>%
cannot be called via
magrittr::%>%
and workarounds are still required here.
Either use
library(magrittr, include.only = "%>%")
`%>%` <- magrittr::`%>%`
calling library()
on top of a script clearly
indicates which packages are needed. A not yet installed package throws
an error right away, not until a function cannot be found later in the
script. However, one can use the include_only
argument and
set it to NULL
. No functions are attached into the search
list then.
library(magrittr, include_only = NULL)
As a new feature origin origin exports the function
check_pkg_usage
. Given you take over a project or just
built a huge barrage of library
calls over time. Which of
those are actually still needed. Just run all those
library(...)
calls and then call
check_pkg_usage()
== Package Usage Report ================================================
-- Used Packages: 2 ----------------------------------------------------
v data.table
v testthat
-- Unused Packages: 1 --------------------------------------------------
i dplyr
-- Possible Namespace Conflicts: 1 -----------------------------------
x last data.table >> dplyr
-- Specifically (`pkg::fun()`) further used Packages: 2 ----------------
i purrr
-- Functions with unknown origin: 1 ------------------------------------
x map
The output shows - we had attached 3 packages: {data.table},
{testthat}, and {dplyr} - functions from {data.table} and {testthat} are
used - {dplyr} functions are not used - a namespace conflict for the
function last
between {data.table} and {dplyr} -
additionally, we use purrr:: at some occasions - we use the
map()
function that is not exported from {data.table},
{testthat}, or {dplyr}. Note that map
is exported from
{purrr} that is used elsewhere but here our code would fail since
{purrr} is not attached and `map cannot be found.
A markers Tab shows all unknown functions and unknown packages that are used explicitly
Having a closer look into result
as.data.frame(result)
#> pkg fun n_calls namespaced conflict conflict_pkgs
#> 1 base %in% 53 FALSE FALSE NA
#> 2 base .packages 8 FALSE FALSE NA
#> 3 base Filter 3 FALSE FALSE NA
#> 4 base Map 1 FALSE FALSE NA
#> 5 base Reduce 5 FALSE FALSE NA
#> ...
It first shows a lot of base functions. That is, even though their are not explicitly attached, base r packages are always attached. The print output does not show them but if you want to deep dive into the functions that are used in the project they are available
#> pkg fun n_calls namespaced conflict conflict_pkgs
#> 110 data.table %like% 10 FALSE FALSE NA
#> 111 data.table := 1 FALSE FALSE NA
#> 112 data.table CJ 1 FALSE FALSE NA
#> 113 data.table as.data.table 1 FALSE FALSE NA
#> 114 data.table as.data.table 3 TRUE FALSE NA
#> 115 data.table last 2 TRUE FALSE NA
#> 116 data.table last 1 FALSE TRUE dplyr
Going further, there are a bunch of {data.table} functions that have
been used. Some are listed twice because they were sometimes called via
data.table::
, sometimes not. Furthermore, last
is marked with conflict = TRUE
. This is because {dplyr}
does export a last
function, as well. However, since
{data.table} has the higher priority than {dplyr} in this project,
{origin} considers it as an {data.table} function. Note that if a
function is namespaced via ::
, no conflict is given.
Finally, at the end of the output:
#> pkg fun n_calls namespaced conflict conflict_pkgs
#> 219 <NA> map 1 FALSE NA NA
#> 220 dplyr <NA> 0 NA NA NA
Here we see the map
function that would not be assigned
to one of the given packages and the {dplyr} package that has not been
used.
Locally defined functions are also detected via parsing. These also do have a higher priority than exported function from other packages.