Whew! So that was a long hiatus.
Things have changed quite a bit for me in the last year and a half. About a year ago I moved up to Washington State and began a job with the Department of Ecology as a software developer and GIS specialist with the Spills Prevention, Preparedness, and Response program. It’s been an interesting ride, and I’ve had the opportunity to develop some new skills and proficiency in SQL Database architecture, reproducible data pipelines, and high-performance computing as a developer supporting the Safety of Oil Transportation Act risk modeling work—which just so happens to be implemented in R, so I was able to jump right in.
My work on the risk model has required handling and processing billions of data records describing maritime traffic; geographic points of interest such as berths, anchorages, and waypoints; ship attributes; ocean bathymetry; and modeled wind and ocean currents. That’s, um, slightly more data than R is generally equipped to handle, so we use a series of Microsoft SQL Server databases for data storage and retrieval (nevermind that MS SQL Server has extremely mediocre support for spatial data).
odbc packages provide the
tools we need to interface with the databases from R… mostly.
DBI can’t handle the SQL Server
spatial types directly, so those objects get replaced with an unhelpful
character string in the resulting R data frame:
library(DBI) library(odbc) # connect to database db_con <- dbConnect(odbc(), server = "my_server", database = "my_db") # query data dbGetQuery(db_con, "SELECT PointID, Shape FROM my_table") ## PointID Shape ## 1 5 \036i ## 2 3 \036i ## 3 86 \036i ## 4 27 \036i ## ...
This isn’t that surprising since MS SQL Server spatial data formats aren’t the same as those used by R. You can cast the spatial data columns into Well-Known Binary (WKB) or Well-Known-Text (WKT) in the SQL statement, but this doesn’t immediately result in something useable either:
dbGetQuery(db_con, "SELECT PointID, Shape.STAsBinary() AS Shape FROM my_table") ## PointID Shape ## 1 5 blob[21 B] ## 2 3 blob[21 B] ## 3 86 blob[21 B] ## 4 27 blob[21 B] ## ...
Most of the advice online suggests using
st_read() (from the
sf package) to get the
spatial data in a useable form:
library(sf) st_read(db_con, query = "Select PointID, Shape.STAsBinary() AS Shape FROM my_table") ## Simple feature collection with 154 features and 1 field ## Geometry type: POINT ## Dimension: XY ## Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 ## CRS: NA ## First 10 features: ## PointID Shape ## 1 5 POINT (-81.47276 3... ## 2 3 POINT (-81.23989 3... ## 3 86 POINT (-80.45634 3... ## 4 27 POINT (-76.00897 3... ## ...
This works, but it precludes the use of some of the fanciest
DBI; specifically, using
dbBind() to parameterize
statements (which is critical if you, say, need to grab wind and ocean
current data for tens of thousands of vessels scattered across the
entire Salish Sea at arbitrary times throughout the year). This means
that you need to either pull the whole dataset into R at once
(not feasible for me) or use
lapply()/etc. to pull data
down in chunks, which is much slower than
dbBind() and starts getting
messy fast if you need to do anything remotely complex.
Thankfully, there’s a way to use
DBI to get spatial data without
sf. Instead, we can use the
wk package to parse the
WKB objects directly after pulling them into R. However, this
so we’ll get an error unless we convert the output to a
object first (e.g., using the
library(dplyr) library(wk) # get the data dbGetQuery(db_con, "Select PointID, Shape.STAsBinary() AS Shape FROM my_table") |> # convert to tibble as_tibble() |> # parse the WKB mutate(Shape = as_wkb(Shape)) ## # A tibble: 154 x 2 ## PointID Shape ## <int> <wk_wkb> ## 1 5 POINT (-81.47276 3... ## 2 3 POINT (-81.23989 3... ## 3 86 POINT (-80.45634 3... ## 4 27 POINT (-76.00897 3... ## ...
Once this is done, we can easily convert the dataset to a spatial
data frame using
sf::st_set_geometry()—or keep it as a
object until we need to call
The primary advantage of doing things this way is that we can now use
dbBind() with spatial tables. Even better, we can use this strategy
and take advantage of lazy evaluation:
library(dbplyr) tbl(db_con, "my_table") |> filter(PointID %in% c(5L, 27L)) |> mutate(Shape = sql("Shape.STAsBinary()")) |> select(PointID, Shape) |> collect() |> mutate(Shape = as_wkb(Shape)) ## # A tibble: 2 x 2 ## PointID Shape ## <int> <wk_wkb> ## 1 5 POINT (-81.47276 3... ## 2 27 POINT (-76.00897 3...
Note that we need to use
mutate() (rather than
sql() to extract
the “Shape” field as WKB objects, but otherwise we can use all the
features and functionality of
dbplyr. This lets us perform much more
of the work on the SQL side to e.g., take advantage of indexes and the
server’s computing power. However, I’m not sure how well it will
perform when e.g., filtering by hundreds or thousands of values,
so I may need to fall back to
dbBind() for those
That’s it! Working with spatial data this way has not only sped up
my code, but also made it more readable—especially for some of my
colleagues who do not have much experience with SQL databases but are
quite familiar with
dplyr syntax. There is still a bit of
fiddling required to write the spatial data back to SQL Server,
since you need to first write the data as WKB objects and then
convert them to the SQL Server geometry/geography objects after.
I’ll make another post if I figure out a smarter way to do that too.