Automatic data requests from the ESO archive

The ESO Raw Data Archive is a reference archive for data obtained with any of the ESO telescopes, including VISTA, VST, etc. Currently there is no API to query the archive and retrieve data although scripts to access the archive programmatically are available from the archive FAQ.

The official solution relies on shell scripts to query the different archive URLs and submit and retrieve forms using wget. Another option is using Python with urllib or requests to do the same, i.e., extract the relevant URLs needed in the request and download workflow and submit the correct information to each of them. These methods rely on knowing the URL endpoints for the different stages of the workflow.

While both approaches above work and are currently in use, in this post I will describe an alternative method to automate archive queries and requests using Python and the Splinter module. Splinter is used to automate browser actions, such as visiting URLs and interacting with their items. So in short, the difference here is that we are using a browser driver to open the main archive web page and we automate the actions.

The method is simple and general enough to be applicable to any other archive web interface.

Requirements

  • Python 3.6+
  • Python modules: requests, splinter
  • ChromeDriver installed in the PATH
  • Username and password for the ESO archive

Code

We start by opening a browser connection to the main archive web page. Note the use of the headless variable. If False a browser will appear in your screen and you will be able to follow the steps on the browser. If True then the script runs headless.

import re

import requests
from splinter import Browser

headless = False
url = "http://archive.eso.org/eso/eso_archive_main.html"
archive = Browser("chrome", headless=headless)
archive.visit(url)

Let’s fill in some fields in the form:

# Fill in night of observation
archive.find_by_name("night").fill("2017 06 09")

# Select instrument(s)
archive.find_by_value("VIRCAM").first.click()

# Set maximum number of records to return
archive.find_by_name("max_rows_returned").first.fill(10)

Once we have filled all the boxes needed and ticked the relevant options we can submit the form:

archive.find_by_id("search").click()

This will direct to page which contains a list of records (images) available. Each record is preceded with a check box to select those that we want to request. There is however a button to request all records:

# Mark all datasets
archive.find_by_id("ibmarkall").click()

# Submit request
archive.find_by_value("Request marked datasets").first.click()

The archive will then ask for your username and password in the login page. This is a simple form that can be filled in and submitted as follows:

archive.fill_form({"username": "<username>", "password": "<password>"})
archive.find_by_xpath("//button[@type='submit']").first.click()

After successful login, the next page shows a list of requested files and a confirmation form with some default values. All we need to do here is to submit the form:

archive.find_by_name("submit").first.click()

The final page shows the status of the request and reloads periodically until all the files are in the retrieval area and the request is finished. At this point the request is marked as “Completed” and one can download the shell script provided.

# Wait until the page displays "Completed" status
while archive.html.find("Completed") < 0:
    time.sleep(2)

# Obtain the link to the download script
href = archive.find_link_by_partial_text('downloadRequest').first
url = re.compile('(https://.*)"').search(href.outer_html).group(1)

# Retrieve the download script and save it locally
result = requests.get(url, cookies=archive.cookies.all())
script = result.content.decode('utf-8')
with open(href.value, 'w') as fh:
    fh.write(script)

All that is required now is to execute the script that has been downloaded and saved. Alternatively one can parse the script, extract the URLs to the data files and use requests to get the data and save it.

Note that while the above summarises the procedure we still have to take care of cases when there are no data available in the request (e.g. we specified a night where no data have been acquired) or the login credentials are wrong or there is a problem with the network connection.

A more complete implementation of the code is available from GitHub at https://github.com/eddienko/esoarchive

Related