Pulling the strings

So, back to the puppet show.

Actually, I guess this is going to be a bit more of a tangent, but it did come up when working on the new Puppeteer project (to crawl our Angular application and store server rendered pages in a local cache), so it still counts…

We wanted to have both a command line option and a graphic UI option to run the crawl service.  The UI would need a backend that keeps a websocket open and broadcast out updates when a new url was crawled (to show success or failure, and give stats at the end).  Socket.io works great for this- just install it on your Node project, and you can socket.emit messages and data to your frontend, which can listen with socket.on (use the same name for both emit and on to coordinate the two).

However, in the command line option, there would be no socket.  With this configuration, a user would just run the command and the messages/data should print to the console.  So we have a shared program that needs two different interfaces.  I had already created a “crawl-runner.js” file with my main “run” function.  It would handle the browser init, page creation, and navigation in headless Chrome (using Puppeteer).  It also handles storing and responding with the results.  It was set up to use a simple native Node EventEmitter- which worked fine for interfacing with websockets.  In fact, we could probably just cut out the middleman and eliminate the EventEmitter- just socket.emit directly from the crawler.

But either way, we will have to switch to console.log when using the command line option.  How to reuse the logic from crawl-runner.js in the command line version?  We can pass the emitter as an optional argument to “run” and if it’s not there, alias that name to console.log:

When the program is run in interactive, UI mode (via a dashboard on our Angular app), crawlEmitter is passed to run, and the socket interface works.  When it’s run as a command line application, we still call “crawlEmitter.emit” with the message and data we want to send, but the check at the top of the function will call “console.log” whenever “crawlEmitter.emit” is called (because there is no crawlEmitter in this case).

Another option would to be simply passing the function we want to use as a broadcaster into run.  So, pass crawlEmitter.emit as the 2nd argument for the dashboard version, or console.log for the command line version.  That might be a better, more readable solution, so I’m thinking about switching (haven’t tested this yet- but I don’t see any reason it shouldn’t work).

One of the most fun things about programming is how many roads you can take to one final product.  The trick is finding the balance between most efficient and most understandable – and always being open to finding a new route!

Advertisements

Master of Puppets

I’ve been working with Google’s new Puppeteer library the past week or so.  It’s basically the next generation headless browser API- you can use it for testing, scraping, performance profiling- their list is better than anything I can post here.

I’m using it to crawl our Angular application.  We have a service to generate a server side render version of our components (we are working on a custom process for this, but it turns out to be very tricky if you’re using 3rd party libraries).  However, using this service requires a call out to their server to create the pre-rendered version (so search engines can index more than just an empty body tag for our site!).  This extra call adds a noticeable lag on our page load.

We do have a local cache on our own server- once a page is rendered, it’s stored in that cache for a while and all future requests are nice and fast.  Ideally, however, no actual user should see that initial slow load if the cache doesn’t have the page they’re looking for.

Side note: I know this is a bit of a small problem.  It will affect very few users – specifically, only people who find a “deep” page in our SPA via a Google search or link and that page hasn’t been visited for a while so it’s not in our cache.  However, we are at the point where we are trying to optimize as much as possible pre-launch, and this project’s time has come!

So, we need a reliable, preferably automated way to crawl all our routes.  Puppeteer is perfect for this operation.  It has a fairly large API, but performing the basics is really simple:

//be sure to npm install puppeteer
const puppeteer = require('puppeteer');

//create a browser instance
const browser = await puppeteer.launch();

//open a page
const page = await browser.newPage();

//navigate
await page.goto('https://your-url.com');

//important for some sites (like SPAs)- make sure we wait until the page is fully loaded
await page.waitForNavigation({ waitUntil: 'networkidle' });

//evaluate gives you access to the DOM- use good old querySelector type methods!
const myLinks = await page.evaluate(async () => {
    const links = Array.from(document.querySelectorAll('htmlselector'));
    return Promise.resolve(links.map(link => link.href));
});

Note that everything is using async/await – which is really great. It makes asynchronous code easier to read and reason about (depending on who you ask…).  However, you will need Node 7.6 or higher.

After the above code, you should have an array of links from whatever selector you passed to querySelectorAll.  The functional programmer in me wanted to just forEach that array, visit each in a new puppeteer ‘goto’ command, and call it good.  However, that doesn’t appear to work.  With async/await, you need to use a for/of loop and await each crawl:

for(let link of myLinks) {
    try {
        await crawlPage(link);
    } catch(err) {
        handleError(link, err);
    }
}

Try/catch is the preferred (only?) error handling mechanism for async/await. Also, at this point, I had abstracted my crawl and error processes out into their own methods. Basically, crawlPage just opens a new page, goes to it, saves the result to a log,  emits the result (more on that in the next post!), and returns the page.close() method- which itself returns a promise (allowing crawlPage to be an async function.

handleError records and emits the error source (url) and message.

async function crawlPage(url) {
    const page = await browser.newPage();
    const pageResult = await page.goto(url);
    let resultObj = { success: pageResult.ok, url };
    resultGroup.push(resultObj);
    crawlEmitter.emit('crawled', resultObj);
    return page.close();
}

function handleError(url, err) {
    const errorObj = { success: false, url, err };
    resultGroup.push(errorObj);
    crawlEmitter.emit('error', errorObj);
}

All very cool and useful stuff.  We can run this process over our application to refresh the local cache at a regular interval- preventing an actual user from ever experiencing the slow load.

Next week will be part 2 of this saga.  I wanted two options: run this from the command line or as a cron job, and have a nice UI that will show the last crawl, stats, and allow a non-technical user to crawl (including crawl by different categories- our url routing comes partly from a JSON config file).  This meant I would need to log to the console sometimes (running as a command line tool), or return data to a browser others.  This is where crawlEmitter comes in- it’s a custom Node EventEmitter (not hard to create) that sends data to a frontend via a websocket.  The fun part was aliasing my crawlEmitter- if it doesn’t exist when the main function starts, it falls back to a good old console.log!