I’ve been working on the Russian River Estuary this summer, taking the boat out to do CTD casts along our regular transect and poring through the data we’ve collected since 2011. One of the things I’ve been doing is looking through the timelapse imagery of the river mouth we’ve been collecting and trying to identify when a closure begins, the idea being that we may see some temporal patterns in the evolution of water quality parameters relative to the onset of mouth closure. This requires downloading the multitude of timelapse images from the Bodega Ocean Observing Node website, which total almost 13 GB. The images are hosted in a password-protected section of the site, and the web interface—while pretty handy for viewing the latest images—doesn’t provide any options for batch downloads.
In an earlier post, I showed how to use R to download files. This time, I’m going to show you how to download a bunch of files, and (semi)automate getting the list of file URLs to download.
The first thing to do is get a list of URLs for all the files you want to
download. In my case I am working with the index file at
http://bml.ucdavis.edu/boon/dyn/rrmc/archive/2011/
, which looks something
like this:
Depending on your source file you might be able to do this with
download.file
, but since I’m working with a password-protected site I opted
to use the RCurl
package. I first read the webpage using getURL
, which returns the entire page
source as a single string; I then use textConnection
and readLines
to turn
it into a vector of strings. For security purposes, I use readline
to prompt
for the username and password instead of including them in plain text in my
source code.
If this is a one-shot thing for you can skip the next bit of code, as
it’s probably easier to just paste the HTML into Excel and manually pull out
the links you want. I’m going to show you a more sophisticated solution using
regular expressions,
which are useful if you e.g. need to repeatedly scrape a site or
download a specific subset of files from a page that contains tons of links. In
my case I want all the jpeg files listed, so I create a regular expression
to pull just these links out and give me a vector of URLs. I map out the link
strings using gregexpr
and extract them using regmatches
.
Let me break down the regular expression
(?<=<a href=\")([0-9]{12})(.jpg)(?=\">)
a bit. First, recognize that ()
denote groups of characters, so (.jpg)
searches for the sequence .jpg
in
the string. The pattern ([0-9]{12})
says to look for any 12-character
sequence of digits 0-9; I use this because all of my image names are simply
numeric datetimes (YYYYMMDDhhmm). The pattern sequence ([0-9]{12})(.jpg)
is
therefore the filename of a given image. The special pattern (?<=)
is a
look-ahead assertion; I use the pattern (?<=<a href=\")
to look for the
sequence of characters <a href="
as the start of my search pattern, but I
don’t want it to actually be included in the result. Similarly, the
pattern (?=)
is a look-behind assertion, and I use (?=\">)
to look for
the sequence ">
as the end of my search pattern without including it in the
result.
Now I have a list of URLs for the files I want to download. Note that they are
all relative links; depending on your needs you may need to do more legwork to
format your URLs properly. In my case I just need to slap
http://bml.ucdavis.edu/boon/dyn/rrmc/archive/2011/
in front of them. The next
step is to loop through the URLs and download each file. Because I’m
downloading images, I use getBinaryURL
to read the image from the URL and
writeBin
to write the contents to a file.
Since I’m downloading a ton of files, I’ll probably want to wrap the loop
contents in some kind of tryCatch
statement so that I can go back and check
if any files failed to download. One solution might be to check the size of the
result of getBinaryURL
, since a failure would probably return something
smaller than the image itself.
That’s it! Using these methods I now have over 26,000 images of the river mouth. Now if only I could figure out a way to automate the actual analysis…
Comments
Want to leave a comment? Visit this post's issue page on GitHub (you'll need a GitHub account).