This is intended for GNU R package developers who want to give package users the option to effortlessly download and mange optional data for their package.
When package authors want to ship data for their package, they will
quickly hit the package size limit on CRAN (which is 5 MB as of
September 2019). The solution is to host the data elsewhere and download
it on demand when the user requests it, then store it for future use.
This is what pkgfilecache allows you to do. You can put your files onto
a web server of your choice, take the MD5 sums, and have pkgfilecache
download them locally unless they already exist and have the correct MD5
hash. Users can then access the data in a convenient way, similar to
accessing files shipped in inst/extdata
via
system.file
. They can also erase the data if it is no
longer needed.
Technically, this package permanently stores the files under a subdir
of the directory returned by rappdirs::user_data_dir
. For
my Ubuntu Linux system, that is /home/myuser/.local/share
,
but the exact location is operating system-dependent and you should not
care about it.
You have put some file onto a server that can be accessed by the public via HTTP or HTTPS. Optionally, you know their MD5 checksums.
In the following example, we will assume that the package you develop
(the one that requires the extra data in the package cache), is called
yourpackage
. And that you have these two files on the
server:
You can use this package to manage any files you want in normal application code, and this is explained here. If you are a package author and want to give your users the option to effortlessly download optional data for your package, make sure to also read the section ‘Giving users the option to download extra package data’ below.
Let’s first make files hosted on our server available on the client, in the package cache. First, define the files you want:
library("pkgfilecache")
pkg_info = get_pkg_info("yourpackage"); # Something to identify the package that uses the package file cache.
local_filenames = c("file1.txt", "file2.txt"); # How the files should be called in the local package file cache
urls = c("https://your.server/yourpackage/large_file1.txt", "https://your.server/yourpackage/large_file2.txt"); # Remote URLs where to download files from
md5sums = c("35261471bcd198583c3805ee2a543b1f", "85ffec2e6efb476f1ee1e3e7fddd86de"); # MD5 checksums. Optional but recommended.
Now it is time to make them available:
The return value res
is a named list with several
entries. The field res$file_status
is a logical vector
indicating for each file whether it now exists locally. Note that the
function will first check to see whether the files are already in the
package cache. If you supplied md5sums, they will also be checked. Only
files which did not pass the check will be downloaded, so it is save to
call this function every time you want to make sure that the files
exist. (E.g., in example code you give).
You can also get a list of the filename which are now available (no
matter whether they have been downloaded or were already available) from
res$available
. And you can get a list of file which should
have been downloaded, but for which retrieval failed, from
res$missing
.
The example above stores all files in the base directory of the package cache. To store a file in a local subdirectory, pass a list of vectors instead. Each entry in the list will be interpreted as the relative path to the local file:
That example will download the files to
dir1/file1.txt
and dir2/file2.txt
. The
directories will be created automatically if they do not exist.
Let’s now get the full path for a file in the package cash, so that we can use it in our application:
wanted_local_file = "file1.txt";
file_path = get_filepath(pkg_info, wanted_local_file, mustWork=TRUE);
If the file is in a local sub directory, use a list with one element, a vector of path elements:
local_filenames = c("file1.txt", "file2.txt");
deleted = remove_cached_files(pkg_info, local_filenames);
The return value deleted
is a logical vector indicating
for each file whether it was deleted. IMPORTANT: Files that did not
exist in the first place were not deleted. To check which files really
exist, read on.
Do one of the following, depending on whether you want MD5 sum checks:
files_exist = are_files_available(pkg_info, local_filenames); # no MD5 check
files_exist_and_have_correct_md5 = are_files_available(pkg_info, local_filenames, md5sums=md5sums); # with MD5 check
The return values are logical vectors indicating for each file whether it exists (and, in the second example, whether the MD5 sum is as expected).
This part is for package authors and gives a suggestion for an interface that allows users to download optional data for your package. Make sure you have read the section ‘Using the functions in your script’ before reading on.
I recommend to have the following public functions in your package, which users can then call if they need to manage optional package data:
download_optional_data()
: This function should download
the optional data if needed (i.e., only if they do not exist
already).list_optional_data()
: This function should return a
vector of local files in the package cache that are currently
available.get_optional_data_file(filename, mustWork=TRUE)
: This
function should return the full path to a file in the package cache,
based on the filename.delete_all_optional_data()
: This function should delete
the optional data from the package cache to free up disk space.Here are example implementations for the functions above. You can copy and paste them, all you need to do afterwards is:
yourpackage
with the name of
your package.download_optional_data
function: also replace
the local_files
, urls
and md5sums
with your optional data information.Here are the functions:
#' @title Download optional data for this package if required.
#'
#' @description Ensure that the optioanl data is available locally in the package cache. Will try to download the data only if it is not available.
#'
#' @return Named list. The list has entries: "available": vector of strings. The names of the files that are available in the local file cache. You can access them using get_optional_data_file(). "missing": vector of strings. The names of the files that this function was unable to retrieve.
#'
#' @export
download_optional_data <- function() {
pkg_info = pkgfilecache::get_pkg_info("yourpackage"); # to identify the package using the cache
# Replace these with your optional data files.
local_filenames = c("file1.txt", "file2.txt"); # How the files should be called in the local package file cache
urls = c("https://your.server/yourpackage/large_file1.txt", "https://your.server/yourpackage/large_file2.txt"); # Remote URLs where to download files from
md5sums = c("35261471bcd198583c3805ee2a543b1f", "85ffec2e6efb476f1ee1e3e7fddd86de"); # MD5 checksums. Optional but recommended.
cfiles = pkgfilecache::ensure_files_available(pkg_info, local_filenames, urls, md5sums=md5sums);
cfiles$file_status = NULL;
return(cfiles);
}
#' @title Get file names available in package cache.
#'
#' @description Get file names of optional data files which are available in the local package cache. You can access these files with get_optional_data_file().
#'
#' @return vector of strings. The file names available, relative to the package cache.
#'
#' @export
list_optional_data <- function() {
pkg_info = pkgfilecache::get_pkg_info("yourpackage");
return(pkgfilecache::list_available(pkg_info));
}
#' @title Access a single file from the package cache by its file name.
#'
#' @param filename, string. The filename of the file in the package cache.
#'
#' @param mustWork, logical. Whether an error should be created if the file does not exist. If mustWork=FALSE and the file does not exist, the empty string is returned.
#'
#' @return string. The full path to the file in the package cache. Use this in your application code to open the file.
#'
#' @export
get_optional_data_filepath <- function(filename, mustWork=TRUE) {
pkg_info = pkgfilecache::get_pkg_info("yourpackage");
return(pkgfilecache::get_filepath(pkg_info, filename, mustWork=mustWork));
}
#' @title Delete all data in the package cache.
#'
#' @return integer. The return value of the unlink() call: 0 for success, 1 for failure. See the unlink() documentation for details.
#'
#' @export
delete_all_optional_data <- function() {
pkg_info = pkgfilecache::get_pkg_info("yourpackage");
return(pkgfilecache::erase_file_cache(pkg_info));
}