Master of Puppets

I’ve been working with Google’s new Puppeteer library the past week or so.  It’s basically the next generation headless browser API- you can use it for testing, scraping, performance profiling- their list is better than anything I can post here.

I’m using it to crawl our Angular application.  We have a service to generate a server side render version of our components (we are working on a custom process for this, but it turns out to be very tricky if you’re using 3rd party libraries).  However, using this service requires a call out to their server to create the pre-rendered version (so search engines can index more than just an empty body tag for our site!).  This extra call adds a noticeable lag on our page load.

We do have a local cache on our own server- once a page is rendered, it’s stored in that cache for a while and all future requests are nice and fast.  Ideally, however, no actual user should see that initial slow load if the cache doesn’t have the page they’re looking for.

Side note: I know this is a bit of a small problem.  It will affect very few users – specifically, only people who find a “deep” page in our SPA via a Google search or link and that page hasn’t been visited for a while so it’s not in our cache.  However, we are at the point where we are trying to optimize as much as possible pre-launch, and this project’s time has come!

So, we need a reliable, preferably automated way to crawl all our routes.  Puppeteer is perfect for this operation.  It has a fairly large API, but performing the basics is really simple:

//be sure to npm install puppeteer
const puppeteer = require('puppeteer');

//create a browser instance
const browser = await puppeteer.launch();

//open a page
const page = await browser.newPage();

//navigate
await page.goto('https://your-url.com');

//important for some sites (like SPAs)- make sure we wait until the page is fully loaded
await page.waitForNavigation({ waitUntil: 'networkidle' });

//evaluate gives you access to the DOM- use good old querySelector type methods!
const myLinks = await page.evaluate(async () => {
    const links = Array.from(document.querySelectorAll('htmlselector'));
    return Promise.resolve(links.map(link => link.href));
});

Note that everything is using async/await – which is really great. It makes asynchronous code easier to read and reason about (depending on who you ask…).  However, you will need Node 7.6 or higher.

After the above code, you should have an array of links from whatever selector you passed to querySelectorAll.  The functional programmer in me wanted to just forEach that array, visit each in a new puppeteer ‘goto’ command, and call it good.  However, that doesn’t appear to work.  With async/await, you need to use a for/of loop and await each crawl:

for(let link of myLinks) {
    try {
        await crawlPage(link);
    } catch(err) {
        handleError(link, err);
    }
}

Try/catch is the preferred (only?) error handling mechanism for async/await. Also, at this point, I had abstracted my crawl and error processes out into their own methods. Basically, crawlPage just opens a new page, goes to it, saves the result to a log,  emits the result (more on that in the next post!), and returns the page.close() method- which itself returns a promise (allowing crawlPage to be an async function.

handleError records and emits the error source (url) and message.

async function crawlPage(url) {
    const page = await browser.newPage();
    const pageResult = await page.goto(url);
    let resultObj = { success: pageResult.ok, url };
    resultGroup.push(resultObj);
    crawlEmitter.emit('crawled', resultObj);
    return page.close();
}

function handleError(url, err) {
    const errorObj = { success: false, url, err };
    resultGroup.push(errorObj);
    crawlEmitter.emit('error', errorObj);
}

All very cool and useful stuff.  We can run this process over our application to refresh the local cache at a regular interval- preventing an actual user from ever experiencing the slow load.

Next week will be part 2 of this saga.  I wanted two options: run this from the command line or as a cron job, and have a nice UI that will show the last crawl, stats, and allow a non-technical user to crawl (including crawl by different categories- our url routing comes partly from a JSON config file).  This meant I would need to log to the console sometimes (running as a command line tool), or return data to a browser others.  This is where crawlEmitter comes in- it’s a custom Node EventEmitter (not hard to create) that sends data to a frontend via a websocket.  The fun part was aliasing my crawlEmitter- if it doesn’t exist when the main function starts, it falls back to a good old console.log!

Advertisements

Deep Streams

Usually when I think of Node, I think about web servers.  That’s mostly what I use it for when writing code- setting up a simple test server for a proof of concept, or bringing in Express and its ecosystem for more production-ready projects.

But in reality, I use Node for a whole lot more.  Of course, just about anything NPM related is a use of Node- but it also powers all the awesome developer tools that we don’t even really need to think about much anymore.  When a minifier runs over your Javascript code before you push to production- Node is probably doing that magic.  Same for a bundler.  It’s behind the scenes on quite a bit of frontend workflow these days.  I use it for such all the time, but I hadn’t had much chance to really write any of those dev tools until recently.

The task was pretty simple- I had a js file with a bunch of arrays and objects someone else had entered.  The formatting had been mangled somewhere along the way- there were long strings of spaces everywhere- I wanted to strip them out, but leave any single spaces (to preserve multi word strings).  Now I know: any halfway good code editor will have the search/replace feature to handle this, but I could see this being a nice little utility to write in Node.  That way, I could run it over an entire directory if necessary (probably won’t ever be necessary, but I really wanted to do this short little side project).

My first iteration was a super simple script using the fs module.  First, a nice path splitter utility in a separate file.  This takes a string, splits it out by path and extension, and inserts ‘-‘ and whatever new string you want.  This prevents overwriting the original file (though this part would be unnecessary if you do want to overwrite- just pass the same file name to the write function):

Then we can use that in our script to strip multi-spaces and return a new file:

All very cool. But I really like the notion of streams in Node. What if the file I need to manipulate is really large? With my current setup, it might take a lot of memory to read the entire file, then write it out. But that’s what streams are for! So I rewrote the script with a custom transform stream. It wasn’t all that difficult- as soon as I realized that the required method on your custom stream (that extends the Transform class) has to be named _transform. If you leave out the underscore, it will not work (recognizes Transform as undefined).

Again, in a separate file (small modules for the win!), I defined my custom stream:

Then it was just a matter of importing that and the path splitting utility created for the original fs version (code reuse for the win!) and running a couple Node-included streams (createReadStream and createWriteStream) that can make a stream from a file automatically:

Both methods (fs and stream) are simple and concise. Both have their place. Creating a custom transform stream was probably unnecessary for this task, but would be very useful for very large files. Either way, it was a fun quick dive into some other corners of Node!

Staying on Script

I like npm scripts.  I’m sure real programmers can memorize long strings of command line gibberish without even waking up the full 10% of our brain capacity movies tell me is all we have access to, but I mess those commands up all the time.

Don’t get me wrong- it’s fun to be able to cruise around a system on the command line.  You feel like you really know secret stuff.  I like that you don’t really even need a monitor to get real work done on a server- just a secure login method and some useful commands.  ssh in, check out where you are and where you can go (ls -la), change directories and do it again (cd var/www && ls -la).  See something you want to check out?  Cat is your friend (cat package.json).  See something you want to change?  Use Vim (vim package.json).  Though be sure to install it first (sudo apt-get vim -y) and be ready to look up the commands just to get around the page (note: I am not a Vim expert- I only use it when I have no other option).

But when it’s time to get to work, it’s really nice to just be able to type ‘npm start’ and have your dev environment spin up.  Particularly when you’re a less than stellar typist.

Npm scripts just go in your package.json file- in the appropriately named “scripts” object.  They can be extremely simple- for example, a project built with the angular cli automatically sets up a couple: “test”: “ng test”, “start”: “ng serve”.  These just make it easier to find those (admittedly already simple) commands.  A developer will know they can run “npm start”, but if they’re not familiar with the angular cli, they might not know about “ng serve” right away.  Npm scripts work well as aliases- unifying those commands under one umbrella- and one less thing for me to try to remember.

You can extend those simple commands as well.  I made a simple edit to my start command: “start”: “ng serve –open”.  This passes the ‘open’ flag to ng serve, automatically opening my default browser.  There are flags for many other options- check out the docs!

But where Npm scripts have really shined for me is in abstracting away any directory changes and environment settings that may have to be done in order to fire up other aspects of a project.  Sure- when you’re just working on the frontend, it’s easy to remember that you run “npm start” from the same directory your package.json resides.  But one project I’m working on uses .net core as a backend API.  It runs locally, so one option is to fire up the full Visual Studio program, but this feels big and slow compared to VS Code now.  Luckily, .net core now comes with a command line interface.  I can run the command “dotnet run”, and my backend dev server starts right up.

But there are complications.  This executable is not run from the same directory as package.json- it’s technically a completely different project in a completely different folder.  An environment variable also has to be set to “Development” in order for it to run properly.

But Npm scripts don’t care!  A script will just take your commands and run them.  So, my new script is:

“backend-start”: “cd ../path/to/dotnet/project/folder && set ASPNETCORE_ENVIRONMENT=Development && dotnet run”
The path really is longer- this is just the way a .net project is initialized.  The point is- instead of typing everything to the right of the colon, I just type everything to the left (from my package.json directory).  The Npm script changes directory, sets my env variable, and starts the process- pretty cool and useful!  I created another to publish the backend and can now forget those file paths and environment settings and other commands.  More brain space for me!