Whew! So that was a long hiatus.
Things have changed quite a bit for me in the last year and a half. About a year ago I moved up to Washington State and began a job with the Department of Ecology as a software developer and GIS specialist with the Spills Prevention, Preparedness, and Response program. It’s been an interesting ride, and I’ve had the opportunity to develop some new skills and proficiency in SQL Database architecture, reproducible data pipelines, and high-performance computing as a developer supporting the Safety of Oil Transportation Act risk modeling work—which just so happens to be implemented in R, so I was able to jump right in.
My work on the risk model has required handling and processing billions of data records describing maritime traffic; geographic points of interest such as berths, anchorages, and waypoints; ship attributes; ocean bathymetry; and modeled wind and ocean currents. That’s, um, slightly more data than R is generally equipped to handle, so we use a series of Microsoft SQL Server databases for data storage and retrieval (nevermind that MS SQL Server has extremely mediocre support for spatial data).
The
DBI
and
odbc
packages provide the
tools we need to interface with the databases from R… mostly.
DBI
can’t handle the SQL Server
geometry
or
geography
spatial types directly, so those objects get replaced with an unhelpful
character string in the resulting R data frame:
library(DBI)
library(odbc)
# connect to database
db_con <- dbConnect(odbc(), server = "my_server", database = "my_db")
# query data
dbGetQuery(db_con, "SELECT PointID, Shape FROM my_table")
## PointID Shape
## 1 5 \036i
## 2 3 \036i
## 3 86 \036i
## 4 27 \036i
## ...
This isn’t that surprising since MS SQL Server spatial data formats aren’t the same as those used by R. You can cast the spatial data columns into Well-Known Binary (WKB) or Well-Known-Text (WKT) in the SQL statement, but this doesn’t immediately result in something useable either:
dbGetQuery(db_con, "SELECT PointID, Shape.STAsBinary() AS Shape FROM my_table")
## PointID Shape
## 1 5 blob[21 B]
## 2 3 blob[21 B]
## 3 86 blob[21 B]
## 4 27 blob[21 B]
## ...
Most of the advice online suggests using st_read()
(from the
sf
package) to get the
spatial data in a useable form:
library(sf)
st_read(db_con, query = "Select PointID, Shape.STAsBinary() AS Shape FROM my_table")
## Simple feature collection with 154 features and 1 field
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## CRS: NA
## First 10 features:
## PointID Shape
## 1 5 POINT (-81.47276 3...
## 2 3 POINT (-81.23989 3...
## 3 86 POINT (-80.45634 3...
## 4 27 POINT (-76.00897 3...
## ...
This works, but it precludes the use of some of the fanciest
capabilities of DBI
; specifically, using
dbBind()
to parameterize
statements (which is critical if you, say, need to grab wind and ocean
current data for tens of thousands of vessels scattered across the
entire Salish Sea at arbitrary times throughout the year). This means
that you need to either pull the whole dataset into R at once
(not feasible for me) or use for
loops/lapply()
/etc. to pull data
down in chunks, which is much slower than dbBind()
and starts getting
messy fast if you need to do anything remotely complex.
Thankfully, there’s a way to use DBI
to get spatial data without
relying on sf
. Instead, we can use the
wk
package to parse the
WKB
objects directly after pulling them into R. However, this
requires using
list columns,
so we’ll get an error unless we convert the output to a tbl_df
object first (e.g., using the
dplyr
package).
library(dplyr)
library(wk)
# get the data
dbGetQuery(db_con, "Select PointID, Shape.STAsBinary() AS Shape FROM my_table") |>
# convert to tibble
as_tibble() |>
# parse the WKB
mutate(Shape = as_wkb(Shape))
## # A tibble: 154 x 2
## PointID Shape
## <int> <wk_wkb>
## 1 5 POINT (-81.47276 3...
## 2 3 POINT (-81.23989 3...
## 3 86 POINT (-80.45634 3...
## 4 27 POINT (-76.00897 3...
## ...
Once this is done, we can easily convert the dataset to a spatial
data frame using sf::st_set_geometry()
—or keep it as a tbl_df
object until we need to call sf
functions.
The primary advantage of doing things this way is that we can now use
dbBind()
with spatial tables. Even better, we can use this strategy
to use dbplyr
workflows
and take advantage of lazy evaluation:
library(dbplyr)
tbl(db_con, "my_table") |>
filter(PointID %in% c(5L, 27L)) |>
mutate(Shape = sql("Shape.STAsBinary()")) |>
select(PointID, Shape) |>
collect() |>
mutate(Shape = as_wkb(Shape))
## # A tibble: 2 x 2
## PointID Shape
## <int> <wk_wkb>
## 1 5 POINT (-81.47276 3...
## 2 27 POINT (-76.00897 3...
Note that we need to use mutate()
(rather than select()
) in
conjunction with
sql()
to extract
the “Shape” field as WKB objects, but otherwise we can use all the
features and functionality of dbplyr
. This lets us perform much more
of the work on the SQL side to e.g., take advantage of indexes and the
server’s computing power. However, I’m not sure how well it will
perform when e.g., filtering by hundreds or thousands of values,
so I may need to fall back to dbGetQuery()
/dbBind()
for those
cases.
That’s it! Working with spatial data this way has not only sped up
my code, but also made it more readable—especially for some of my
colleagues who do not have much experience with SQL databases but are
quite familiar with dplyr
syntax. There is still a bit of
fiddling required to write the spatial data back to SQL Server,
since you need to first write the data as WKB objects and then
convert them to the SQL Server geometry/geography objects after.
I’ll make another post if I figure out a smarter way to do that too.
Comments
Want to leave a comment? Visit this post's issue page on GitHub (you'll need a GitHub account).