Implementing a session cache in R packages


It’s been about 8 months since I started working with the USACE Portland District, and while I’ve found myself doing a lot of Python (and Jython) scripting I’ve still had opportunities to build some R packages. One effort I took on was to rewrite the dssrip package, written by Evan Heisman and others to read HEC-DSS files into R. While dssrip worked well, it required some manual configuration to get set up and did not pass the various R checks required to be published on CRAN. What started out as a modest effort to clean up the source code to meet CRAN requirements ballooned into a complete redesign of the package, including adding DSS write support and automating the installation of the required Java libraries. I’m pretty happy with the result, and it gave me the opportunity to dive into the HEC-DSS Java interface (which has helped a lot with my Jython projects!).

Like the original dssrip package, my rewrite package dssrip2 uses rJava to access the HEC-DSS Java classes and methods. Dealing with the DSS file handles is not trivial; you don’t want to be constantly closing and re-opening DSS files as this can significantly slow down operations, but you also don’t want to open the same file twice as can cause DSS to go into “multi-user access” mode (which also slows down operations). My original design was to make the user deal with opening and closing DSS files explicitly. This worked fine but didn’t sit well with me for a few reasons; it felt clunky to have to create a file handle and pass that to the various read/write functions rather than letting the user simply pass a filename, and I didn’t like that the user was exposed to the underlying Java object reference. It also wasn’t very safe, since a user could easily overwrite the file reference by accident. What I really wanted was a way to store the file handles internally in the package and retrieve them when the user supplied a file path (and create a new handle when necessary).

The simplest implementation of the “file store” would be to simply have a named list of file handles: provide the name, return the file handle itself. However, anyone who writes R packages will eventually discover that package namespaces are sealed on load. This presents a conundrum, as you can’t simply create a placeholder variable in the package on load as any attempts to modify that variable later will result in an error message stating that you “cannot add bindings to a locked environment”. In the past I’ve used tricks like creating an environment in the package .onLoad() function which can be accessed and modified, but I was intrigued by the way the ggplot2 package implements its last_plot() function:

# copied from https://github.com/tidyverse/ggplot2/blob/04a5ef274e912bee76180154b25d8bca0206feb1/R/plot-last.R
.plot_store <- function() {
  .last_plot <- NULL

  list(
    get = function() .last_plot,
    set = function(value) .last_plot <<- value
  )
}
.store <- .plot_store()

set_last_plot <- function(value) .store$set(value)


last_plot <- function() .store$get()

The plot_store() function is essentially a wrapper around a .last_plot variable and functions for accessing or overwriting its value. The package then creates a .store object on load by calling plot_store() which provides access to the set() and get() functions. Because the .last_plot variable is created by the call to plot_store(), it does not get locked in the same way that the package namespace does.

It took trivial modifications to turn this into a file handle store. I also added functions to list file handles in the store and “drop” file handles (call the DSS file handle’s close() method and remove the reference from the store).

.file_store <- function() {
  .file_list <- list()

  list(
    get = function(filepath) {
      .file_list[[filepath]]
    },
    set = function(filepath, ...) {
      if (!(filepath %in% names(.file_list))) {
        .file_list[[filepath]] <<- .jcall("hec/heclib/dss/HecDss",
          "Lhec/heclib/dss/HecDss;", method = "open", filepath, ...)
      }
    },
    drop = function(filepath) {
      .file_list[[filepath]]$close()
      .file_list[[filepath]] <<- NULL
    },
    list = function() {
      names(.file_list)
    }
  )
}
.store <- .file_store()

The logic is pretty simple. The set() function calls the Java logic needed to create a file handle and stores the handle in .file_list using the supplied filepath as the element name. If there is already an entry in the store for a given file path, the set() function does nothing. The get() function simply returns the file handle for the supplied path. In the package code, I use a strict implementation of normalizePath() to ensure file paths are consistently formatted regardless of how the supplied path is formatted or whether it is an absolute or relative path. One hiccup I discovered is that normalizePath(..., mustWork = FALSE) does not always provide the same letter case for the expanded directory, so my implementation additionally calls normalizePath(..., mustWork = TRUE) on the directory of the supplied path to ensure paths for new files are constructed consistently with those for existing files.

With this approach, users only ever need to provide file paths and don’t need to think about managing file handles explicitly. I still provide user-visible functions to close one or all of the DSS files in the store, but most of the time users won’t have to think about managing DSS files at all. This is an incredibly simple and flexible approach to create session stores in R packages, and I can already see other use cases for caching results in R packages.


Comments