Generate static site pages after JavaScript has run #15639

henare · 2018-01-10T03:14:02Z

Problem

We currently have one page that uses JavaScript to progressively enhance the page by displaying up-to-date data from a remote API. It first gets the data from the GitHub API using Ruby, which is what the static site then shows on the published site. When a user visits the page the JS will also run and fetch the most up to date data from the API.

A problem with this is that the logic to get this data exists in Ruby and JS. If our static site generator ran after JS was executed we'd be able to remove the Ruby code that fetches this data, reducing the amount of code we need to maintain. Because the static site had been generated with the data included, the user experience would be the same as it currently is.

This duplication isn't a problem for one page but as part of the Democratic Commons project we're about to start using JS to pull in a lot more data from Wikidata so it makes this change a worthwhile thing to do.

Proposed Solution

From everypolitician/democratic-commons-tasks#36 (comment)

...switch to crawling these pages with Capybara / PhantomJS...

Acceptance Criteria

The static Countries Needed page is generated populated with data, but without using the Ruby code that currently populates data for it from the GitHub API.

Related Issues

henare · 2018-01-11T03:41:06Z

There are two parts to the static site generation - deciding which pages to scrape, and scraping the actual pages. Both of these tasks are currently handed by wget. It scrapes the pages, and spiders the whole site using its recursive option (from the seed start page we give it).

Let's discuss both separately to decide our approach.

Deciding which pages to scrape

We currently set wget's mirror option (which sets infinite recursion) and point it at the seed start page. From there it spiders pretty much every page on the site (except some pages we deliberately don't point to from the seed start page).

For this issue we could just run it on the subset of pages that need JavaScript run (currently only one page but soon to be increasing to a couple of hundred, but still by no means the whole site), keeping wget for the rest of the site, or we could wholesale switch to the new scraping method for every page.

Using one generation method for the whole site should make things less complex, more maintainable, and less error-prone. It's probably where we want to go eventually. However we should iterate there rather than doing it in one big go, as that would make the switch more risky and more work up-front.

So that means we should initially plan to scrape just the subset of pages that need JavaScript run. We know which pages these are so we can specify them rather than discovering them through spidering or some other discovery technique.

The simplest thing we could do is hardcode the URLs we want. But very soon we're going to want to programatically generate this list as the Wikidata project will have a whole stack of URLs, generated from the list of countries/legislatures. Probably the next most simple thing we could, similar to the seed page, is to have a page on the site that lists the URLs for this thing to fetch.

How to scrape the pages

There are a few different options out there for scraping a page after JS has run and these are in turn available in a few different programming languages. Preferably we'd want to use one of the languages that this project already uses - Ruby, JavaScript or shell - so let's focus on tools available that support those languages:

[Ruby] Capybara - can use webkit directly or PhantomJS
[JavaScript] PhantomJS - was dead, now being picked up again
[Shell/Ruby] Headless Chrome - has CLI options that could be used from shell or can be used with Selenium in Ruby.
[JavaScript] Puppeteer - uses Headless Chrome

I've spent a bit of time looking at each of these and can't see any obvious functional reasons to use one over the others. It seems like it would be possible to do what we need with any of them. So then the deciding factor is what's most usable for us.

Most of the project is currently Ruby, capybara-webkit is lean, has some of the most easily satisfied dependencies, and is actively developed, so that seems like the sensible place to start. It's also being installed in our pull request for the start of the Wikidata work that this issue really relates to.

henare · 2018-01-11T05:35:02Z

Here's a couple of issues I've been thinking about as I work through this. The first is a problem we need to solve and the second one isn't a problem now but could be if/when we switch the whole site to being generated in the same way.

How do we know when JS on the page is finished?

Update: thanks to a suggestion in Slack I've found the recommended way and implemented it.

The whole idea of this issue is to capture the page after the JS has run but how do we know when it's done?

We could check that the elements we expect to have been inserted by the JS are there. But they'd be different on each page, and what if there were correctly no elements added?

We could just wait [some amount of time] but that runs the risk of generating broken pages (if the JS hasn't finished), it generally sounds flaky and non-deterministic, and will be slow because you're introducing unnecessary waiting.

So if those two options are out it means we need some way for the page to signal it's ready to be scraped. It's not uncommon for JS to add a class to the page to say it's done something (see next issue below!) so that could be a simple fix. It would mean changes to the JS on pages that need to be scraped but that should be simple, right?

Not if you have a bunch of asynchronous calls unfortunately, just like the three AJAX queries we have on this page. Instead you'd need to make more substantial changes and wrap these calls in something like jQuery.when().

Page `no-js` class removed by rendering the page with JS-enabled

Update: the suggested solution below has been implemented.

We currently have a no-js class on the <html> that's subsequently removed by JS. This indicates if JS has run on the page. It's used by CSS in a couple of cases.

Obviously, if we run pages through a process that runs the JS it will remove that class, and that's what we'll publish on the static site.

I think the best solution to this would be if we could get our scraping process to add that class back before it saves the page. That way this won't trip us up down the line.

henare · 2018-01-16T04:45:27Z

I've been pushing work in progress to this branch: seat-count-wikidata...javascript-page-generation-capybara-refactor

That branch's history may change so here's what it looks like as of today: a939115^...d798785

#15639 We want to be able to generate some pages as part of the static site build after JavaScript has run on that page. This will allow these pages to be populated with data from JavaScript but appear to progressively enhance when deployed as part of our static site. This script will fetch a list of URLs from an endpoint on the application that should be scraped with JS enabled. It will then use capybara-webkit to fetch those pages, run JS, and save those pages to disk. The target files will be in the same location as `wget` uses which will allow us to overwrite the static files already generated by `wget` with these JS-enhanced versions. Subsequent commits will need to: * Create the endpoint that this script uses to determine which URLs to scrape * Update the build script to run this after `wget` has run

#15639 This exposes a dynamically generated text file with a list of URLs that should be generated with JS enabled by the static site generation process. This can be read by the script added in d8b3c9f to determine what URLs it should scrape. It currently only contains a single line for the one page we want to scrape after JS has been run. In the future it should be relatively easy to add new URLs to this list including lists of URLs programmatically generated. Now that we have a script to do the JS scraping and a way of generating a list of URLs for this process, our next step is to add this to the build process.

#15639 In d8b3c9f and 4bb70ee we added a way to generate some of our static pages after JavaScript had finished running. This commit adds this step into the static site generation build script. After `wget` has finished running this script overwrites some of the static files it generated with the (currently) few pages we want enhanced with JS.