Master of Puppets

I’ve been working with Google’s new Puppeteer library the past week or so.  It’s basically the next generation headless browser API- you can use it for testing, scraping, performance profiling- their list is better than anything I can post here.

I’m using it to crawl our Angular application.  We have a service to generate a server side render version of our components (we are working on a custom process for this, but it turns out to be very tricky if you’re using 3rd party libraries).  However, using this service requires a call out to their server to create the pre-rendered version (so search engines can index more than just an empty body tag for our site!).  This extra call adds a noticeable lag on our page load.

We do have a local cache on our own server- once a page is rendered, it’s stored in that cache for a while and all future requests are nice and fast.  Ideally, however, no actual user should see that initial slow load if the cache doesn’t have the page they’re looking for.

Side note: I know this is a bit of a small problem.  It will affect very few users – specifically, only people who find a “deep” page in our SPA via a Google search or link and that page hasn’t been visited for a while so it’s not in our cache.  However, we are at the point where we are trying to optimize as much as possible pre-launch, and this project’s time has come!

So, we need a reliable, preferably automated way to crawl all our routes.  Puppeteer is perfect for this operation.  It has a fairly large API, but performing the basics is really simple:

//be sure to npm install puppeteer
const puppeteer = require('puppeteer');

//create a browser instance
const browser = await puppeteer.launch();

//open a page
const page = await browser.newPage();

//navigate
await page.goto('https://your-url.com');

//important for some sites (like SPAs)- make sure we wait until the page is fully loaded
await page.waitForNavigation({ waitUntil: 'networkidle' });

//evaluate gives you access to the DOM- use good old querySelector type methods!
const myLinks = await page.evaluate(async () => {
    const links = Array.from(document.querySelectorAll('htmlselector'));
    return Promise.resolve(links.map(link => link.href));
});

Note that everything is using async/await – which is really great. It makes asynchronous code easier to read and reason about (depending on who you ask…).  However, you will need Node 7.6 or higher.

After the above code, you should have an array of links from whatever selector you passed to querySelectorAll.  The functional programmer in me wanted to just forEach that array, visit each in a new puppeteer ‘goto’ command, and call it good.  However, that doesn’t appear to work.  With async/await, you need to use a for/of loop and await each crawl:

for(let link of myLinks) {
    try {
        await crawlPage(link);
    } catch(err) {
        handleError(link, err);
    }
}

Try/catch is the preferred (only?) error handling mechanism for async/await. Also, at this point, I had abstracted my crawl and error processes out into their own methods. Basically, crawlPage just opens a new page, goes to it, saves the result to a log,  emits the result (more on that in the next post!), and returns the page.close() method- which itself returns a promise (allowing crawlPage to be an async function.

handleError records and emits the error source (url) and message.

async function crawlPage(url) {
    const page = await browser.newPage();
    const pageResult = await page.goto(url);
    let resultObj = { success: pageResult.ok, url };
    resultGroup.push(resultObj);
    crawlEmitter.emit('crawled', resultObj);
    return page.close();
}

function handleError(url, err) {
    const errorObj = { success: false, url, err };
    resultGroup.push(errorObj);
    crawlEmitter.emit('error', errorObj);
}

All very cool and useful stuff.  We can run this process over our application to refresh the local cache at a regular interval- preventing an actual user from ever experiencing the slow load.

Next week will be part 2 of this saga.  I wanted two options: run this from the command line or as a cron job, and have a nice UI that will show the last crawl, stats, and allow a non-technical user to crawl (including crawl by different categories- our url routing comes partly from a JSON config file).  This meant I would need to log to the console sometimes (running as a command line tool), or return data to a browser others.  This is where crawlEmitter comes in- it’s a custom Node EventEmitter (not hard to create) that sends data to a frontend via a websocket.  The fun part was aliasing my crawlEmitter- if it doesn’t exist when the main function starts, it falls back to a good old console.log!

Advertisements