Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate static site pages after JavaScript has run #15639

Open
henare opened this issue Jan 10, 2018 · 4 comments
Open

Generate static site pages after JavaScript has run #15639

henare opened this issue Jan 10, 2018 · 4 comments

Comments

@henare
Copy link
Contributor

henare commented Jan 10, 2018

Problem

We currently have one page that uses JavaScript to progressively enhance the page by displaying up-to-date data from a remote API. It first gets the data from the GitHub API using Ruby, which is what the static site then shows on the published site. When a user visits the page the JS will also run and fetch the most up to date data from the API.

A problem with this is that the logic to get this data exists in Ruby and JS. If our static site generator ran after JS was executed we'd be able to remove the Ruby code that fetches this data, reducing the amount of code we need to maintain. Because the static site had been generated with the data included, the user experience would be the same as it currently is.

This duplication isn't a problem for one page but as part of the Democratic Commons project we're about to start using JS to pull in a lot more data from Wikidata so it makes this change a worthwhile thing to do.

Proposed Solution

From everypolitician/democratic-commons-tasks#36 (comment)

...switch to crawling these pages with Capybara / PhantomJS...

Acceptance Criteria

The static Countries Needed page is generated populated with data, but without using the Ruby code that currently populates data for it from the GitHub API.

Related Issues

@henare
Copy link
Contributor Author

henare commented Jan 11, 2018

There are two parts to the static site generation - deciding which pages to scrape, and scraping the actual pages. Both of these tasks are currently handed by wget. It scrapes the pages, and spiders the whole site using its recursive option (from the seed start page we give it).

Let's discuss both separately to decide our approach.

Deciding which pages to scrape

We currently set wget's mirror option (which sets infinite recursion) and point it at the seed start page. From there it spiders pretty much every page on the site (except some pages we deliberately don't point to from the seed start page).

For this issue we could just run it on the subset of pages that need JavaScript run (currently only one page but soon to be increasing to a couple of hundred, but still by no means the whole site), keeping wget for the rest of the site, or we could wholesale switch to the new scraping method for every page.

Using one generation method for the whole site should make things less complex, more maintainable, and less error-prone. It's probably where we want to go eventually. However we should iterate there rather than doing it in one big go, as that would make the switch more risky and more work up-front.

So that means we should initially plan to scrape just the subset of pages that need JavaScript run. We know which pages these are so we can specify them rather than discovering them through spidering or some other discovery technique.

The simplest thing we could do is hardcode the URLs we want. But very soon we're going to want to programatically generate this list as the Wikidata project will have a whole stack of URLs, generated from the list of countries/legislatures. Probably the next most simple thing we could, similar to the seed page, is to have a page on the site that lists the URLs for this thing to fetch.

How to scrape the pages

There are a few different options out there for scraping a page after JS has run and these are in turn available in a few different programming languages. Preferably we'd want to use one of the languages that this project already uses - Ruby, JavaScript or shell - so let's focus on tools available that support those languages:

I've spent a bit of time looking at each of these and can't see any obvious functional reasons to use one over the others. It seems like it would be possible to do what we need with any of them. So then the deciding factor is what's most usable for us.

Most of the project is currently Ruby, capybara-webkit is lean, has some of the most easily satisfied dependencies, and is actively developed, so that seems like the sensible place to start. It's also being installed in our pull request for the start of the Wikidata work that this issue really relates to.

@henare
Copy link
Contributor Author

henare commented Jan 11, 2018

Here's a couple of issues I've been thinking about as I work through this. The first is a problem we need to solve and the second one isn't a problem now but could be if/when we switch the whole site to being generated in the same way.

How do we know when JS on the page is finished?

Update: thanks to a suggestion in Slack I've found the recommended way and implemented it.

The whole idea of this issue is to capture the page after the JS has run but how do we know when it's done?

We could check that the elements we expect to have been inserted by the JS are there. But they'd be different on each page, and what if there were correctly no elements added?

We could just wait [some amount of time] but that runs the risk of generating broken pages (if the JS hasn't finished), it generally sounds flaky and non-deterministic, and will be slow because you're introducing unnecessary waiting.

So if those two options are out it means we need some way for the page to signal it's ready to be scraped. It's not uncommon for JS to add a class to the page to say it's done something (see next issue below!) so that could be a simple fix. It would mean changes to the JS on pages that need to be scraped but that should be simple, right?

Not if you have a bunch of asynchronous calls unfortunately, just like the three AJAX queries we have on this page. Instead you'd need to make more substantial changes and wrap these calls in something like jQuery.when().

Page no-js class removed by rendering the page with JS-enabled

Update: the suggested solution below has been implemented.

We currently have a no-js class on the <html> that's subsequently removed by JS. This indicates if JS has run on the page. It's used by CSS in a couple of cases.

Obviously, if we run pages through a process that runs the JS it will remove that class, and that's what we'll publish on the static site.

I think the best solution to this would be if we could get our scraping process to add that class back before it saves the page. That way this won't trip us up down the line.

@henare
Copy link
Contributor Author

henare commented Jan 16, 2018

I've been pushing work in progress to this branch: seat-count-wikidata...javascript-page-generation-capybara-refactor

That branch's history may change so here's what it looks like as of today: a939115^...d798785

henare added a commit that referenced this issue Jan 17, 2018
#15639

We want to be able to generate some pages as part of the static site
build after JavaScript has run on that page. This will allow these pages
to be populated with data from JavaScript but appear to progressively
enhance when deployed as part of our static site.

This script will fetch a list of URLs from an endpoint on the
application that should be scraped with JS enabled. It will then use
capybara-webkit to fetch those pages, run JS, and save those pages to
disk. The target files will be in the same location as `wget` uses which
will allow us to overwrite the static files already generated by `wget`
with these JS-enhanced versions.

Subsequent commits will need to:

* Create the endpoint that this script uses to determine which URLs to
  scrape
* Update the build script to run this after `wget` has run
henare added a commit that referenced this issue Jan 17, 2018
#15639

This exposes a dynamically generated text file with a list of URLs that
should be generated with JS enabled by the static site generation
process. This can be read by the script added in d8b3c9f to determine
what URLs it should scrape.

It currently only contains a single line for the one page we want to
scrape after JS has been run. In the future it should be relatively
easy to add new URLs to this list including lists of URLs
programmatically generated.

Now that we have a script to do the JS scraping and a way of generating
a list of URLs for this process, our next step is to add this to the
build process.
henare added a commit that referenced this issue Jan 17, 2018
#15639

In d8b3c9f and 4bb70ee we added a way to generate some of our static
pages after JavaScript had finished running. This commit adds this step
into the static site generation build script.

After `wget` has finished running this script overwrites some of the
static files it generated with the (currently) few pages we want
enhanced with JS.
henare added a commit that referenced this issue Jan 17, 2018
#15639

We want to be able to generate some pages as part of the static site
build after JavaScript has run on that page. This will allow these pages
to be populated with data from JavaScript but appear to progressively
enhance when deployed as part of our static site.

This script will fetch a list of URLs from an endpoint on the
application that should be scraped with JS enabled. It will then use
capybara-webkit to fetch those pages, run JS, and save those pages to
disk. The target files will be in the same location as `wget` uses which
will allow us to overwrite the static files already generated by `wget`
with these JS-enhanced versions.

Subsequent commits will need to:

* Create the endpoint that this script uses to determine which URLs to
  scrape
* Update the build script to run this after `wget` has run
henare added a commit that referenced this issue Jan 17, 2018
#15639

This exposes a dynamically generated text file with a list of URLs that
should be generated with JS enabled by the static site generation
process. This can be read by the script added in d8b3c9f to determine
what URLs it should scrape.

It currently only contains a single line for the one page we want to
scrape after JS has been run. In the future it should be relatively
easy to add new URLs to this list including lists of URLs
programmatically generated.

Now that we have a script to do the JS scraping and a way of generating
a list of URLs for this process, our next step is to add this to the
build process.
henare added a commit that referenced this issue Jan 17, 2018
#15639

In d8b3c9f and 4bb70ee we added a way to generate some of our static
pages after JavaScript had finished running. This commit adds this step
into the static site generation build script.

After `wget` has finished running this script overwrites some of the
static files it generated with the (currently) few pages we want
enhanced with JS.
henare added a commit that referenced this issue Jan 17, 2018
#15639

We want to be able to generate some pages as part of the static site
build after JavaScript has run on that page. This will allow these pages
to be populated with data from JavaScript but appear to progressively
enhance when deployed as part of our static site.

This script will fetch a list of URLs from an endpoint on the
application that should be scraped with JS enabled. It will then use
capybara-webkit to fetch those pages, run JS, and save those pages to
disk. The target files will be in the same location as `wget` uses which
will allow us to overwrite the static files already generated by `wget`
with these JS-enhanced versions.

Subsequent commits will need to:

* Create the endpoint that this script uses to determine which URLs to
  scrape
* Update the build script to run this after `wget` has run
henare added a commit that referenced this issue Jan 17, 2018
#15639

This exposes a dynamically generated text file with a list of URLs that
should be generated with JS enabled by the static site generation
process. This can be read by the script added in d8b3c9f to determine
what URLs it should scrape.

It currently only contains a single line for the one page we want to
scrape after JS has been run. In the future it should be relatively
easy to add new URLs to this list including lists of URLs
programmatically generated.

Now that we have a script to do the JS scraping and a way of generating
a list of URLs for this process, our next step is to add this to the
build process.
henare added a commit that referenced this issue Jan 17, 2018
#15639

In d8b3c9f and 4bb70ee we added a way to generate some of our static
pages after JavaScript had finished running. This commit adds this step
into the static site generation build script.

After `wget` has finished running this script overwrites some of the
static files it generated with the (currently) few pages we want
enhanced with JS.
henare added a commit that referenced this issue Jan 18, 2018
#15639

We want to be able to generate some pages as part of the static site
build after JavaScript has run on that page. This will allow these pages
to be populated with data from JavaScript but appear to progressively
enhance when deployed as part of our static site.

This script will fetch a list of URLs from an endpoint on the
application that should be scraped with JS enabled. It will then use
capybara-webkit to fetch those pages, run JS, and save those pages to
disk. The target files will be in the same location as `wget` uses which
will allow us to overwrite the static files already generated by `wget`
with these JS-enhanced versions.

Subsequent commits will need to:

* Create the endpoint that this script uses to determine which URLs to
  scrape
* Update the build script to run this after `wget` has run
henare added a commit that referenced this issue Jan 18, 2018
#15639

This exposes a dynamically generated text file with a list of URLs that
should be generated with JS enabled by the static site generation
process. This can be read by the script added in d8b3c9f to determine
what URLs it should scrape.

It currently only contains a single line for the one page we want to
scrape after JS has been run. In the future it should be relatively
easy to add new URLs to this list including lists of URLs
programmatically generated.

Now that we have a script to do the JS scraping and a way of generating
a list of URLs for this process, our next step is to add this to the
build process.
henare added a commit that referenced this issue Jan 18, 2018
#15639

In d8b3c9f and 4bb70ee we added a way to generate some of our static
pages after JavaScript had finished running. This commit adds this step
into the static site generation build script.

After `wget` has finished running this script overwrites some of the
static files it generated with the (currently) few pages we want
enhanced with JS.
henare added a commit that referenced this issue Jan 18, 2018
#15639

We want to be able to generate some pages as part of the static site
build after JavaScript has run on that page. This will allow these pages
to be populated with data from JavaScript but appear to progressively
enhance when deployed as part of our static site.

This script will fetch a list of URLs from an endpoint on the
application that should be scraped with JS enabled. It will then use
capybara-webkit to fetch those pages, run JS, and save those pages to
disk. The target files will be in the same location as `wget` uses which
will allow us to overwrite the static files already generated by `wget`
with these JS-enhanced versions.

Subsequent commits will need to:

* Create the endpoint that this script uses to determine which URLs to
  scrape
* Update the build script to run this after `wget` has run
henare added a commit that referenced this issue Jan 18, 2018
#15639

This exposes a dynamically generated text file with a list of URLs that
should be generated with JS enabled by the static site generation
process. This can be read by the script added in d8b3c9f to determine
what URLs it should scrape.

It currently only contains a single line for the one page we want to
scrape after JS has been run. In the future it should be relatively
easy to add new URLs to this list including lists of URLs
programmatically generated.

Now that we have a script to do the JS scraping and a way of generating
a list of URLs for this process, our next step is to add this to the
build process.
henare added a commit that referenced this issue Jan 18, 2018
#15639

In d8b3c9f and 4bb70ee we added a way to generate some of our static
pages after JavaScript had finished running. This commit adds this step
into the static site generation build script.

After `wget` has finished running this script overwrites some of the
static files it generated with the (currently) few pages we want
enhanced with JS.
@henare
Copy link
Contributor Author

henare commented Jan 23, 2018

After thinking more about the kind of pages we're really going to need, and creating issues in this repo for them, I've come up with another issue about the approach in #15640.

The idea behind that PR is that we'll expand it to scrape a whole bunch of pages and that list of pages could be programatically generated in our Ruby world. However because we're going to start with a list of countries generated from Wikidata we're going to have to come up with a different way.

I guess the most obvious solution is to collect links to the country pages from that base country list page (that's been generated by JS). However that ties the scraping in quite tightly with how the pages are built which is something I was hoping to avoid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant