Master of Puppets

I’ve been working with Google’s new Puppeteer library the past week or so.  It’s basically the next generation headless browser API- you can use it for testing, scraping, performance profiling- their list is better than anything I can post here.

I’m using it to crawl our Angular application.  We have a service to generate a server side render version of our components (we are working on a custom process for this, but it turns out to be very tricky if you’re using 3rd party libraries).  However, using this service requires a call out to their server to create the pre-rendered version (so search engines can index more than just an empty body tag for our site!).  This extra call adds a noticeable lag on our page load.

We do have a local cache on our own server- once a page is rendered, it’s stored in that cache for a while and all future requests are nice and fast.  Ideally, however, no actual user should see that initial slow load if the cache doesn’t have the page they’re looking for.

Side note: I know this is a bit of a small problem.  It will affect very few users – specifically, only people who find a “deep” page in our SPA via a Google search or link and that page hasn’t been visited for a while so it’s not in our cache.  However, we are at the point where we are trying to optimize as much as possible pre-launch, and this project’s time has come!

So, we need a reliable, preferably automated way to crawl all our routes.  Puppeteer is perfect for this operation.  It has a fairly large API, but performing the basics is really simple:

//be sure to npm install puppeteer
const puppeteer = require('puppeteer');

//create a browser instance
const browser = await puppeteer.launch();

//open a page
const page = await browser.newPage();

//navigate
await page.goto('https://your-url.com');

//important for some sites (like SPAs)- make sure we wait until the page is fully loaded
await page.waitForNavigation({ waitUntil: 'networkidle' });

//evaluate gives you access to the DOM- use good old querySelector type methods!
const myLinks = await page.evaluate(async () => {
    const links = Array.from(document.querySelectorAll('htmlselector'));
    return Promise.resolve(links.map(link => link.href));
});

Note that everything is using async/await – which is really great. It makes asynchronous code easier to read and reason about (depending on who you ask…).  However, you will need Node 7.6 or higher.

After the above code, you should have an array of links from whatever selector you passed to querySelectorAll.  The functional programmer in me wanted to just forEach that array, visit each in a new puppeteer ‘goto’ command, and call it good.  However, that doesn’t appear to work.  With async/await, you need to use a for/of loop and await each crawl:

for(let link of myLinks) {
    try {
        await crawlPage(link);
    } catch(err) {
        handleError(link, err);
    }
}

Try/catch is the preferred (only?) error handling mechanism for async/await. Also, at this point, I had abstracted my crawl and error processes out into their own methods. Basically, crawlPage just opens a new page, goes to it, saves the result to a log,  emits the result (more on that in the next post!), and returns the page.close() method- which itself returns a promise (allowing crawlPage to be an async function.

handleError records and emits the error source (url) and message.

async function crawlPage(url) {
    const page = await browser.newPage();
    const pageResult = await page.goto(url);
    let resultObj = { success: pageResult.ok, url };
    resultGroup.push(resultObj);
    crawlEmitter.emit('crawled', resultObj);
    return page.close();
}

function handleError(url, err) {
    const errorObj = { success: false, url, err };
    resultGroup.push(errorObj);
    crawlEmitter.emit('error', errorObj);
}

All very cool and useful stuff.  We can run this process over our application to refresh the local cache at a regular interval- preventing an actual user from ever experiencing the slow load.

Next week will be part 2 of this saga.  I wanted two options: run this from the command line or as a cron job, and have a nice UI that will show the last crawl, stats, and allow a non-technical user to crawl (including crawl by different categories- our url routing comes partly from a JSON config file).  This meant I would need to log to the console sometimes (running as a command line tool), or return data to a browser others.  This is where crawlEmitter comes in- it’s a custom Node EventEmitter (not hard to create) that sends data to a frontend via a websocket.  The fun part was aliasing my crawlEmitter- if it doesn’t exist when the main function starts, it falls back to a good old console.log!

Hold My Beer

The rallying cry of a redneck about to do something really stupid.  I think we should have a version in the programming world where it means: “I just spent half the day writing this code and not compiling/running it once- here goes!”  You’re likely about to hurt yourself in a spectacular manner.

But this time, for the first time ever, it actually worked (results not typical).  I was adding a really simple module to allow an admin to add new activity types (for audit logging).  Usually, I will recompile the .net code at multiple steps, but with Visual Studio, that has become less important- it will yell at me with those red squiggly lines if I really mess something up.  It’s my complete newbieness with C# that usually causes the issues- generally trying to use JS syntax and finding that it just doesn’t work that way.  Imagine my surprise when I finally ran the app and it not only compiled without error, but also did what it was supposed to do!

But no one would recommend this as a best practice.  And really, most modern development tools have checks against this built in.  There are Visual Studio’s aforementioned red squigglies that burrow though your eyes and into your brain, relentlessly mocking you and your stupid mistakes.  Also, on the front end, when working with a transpiled language like Typescript (which we are and it’s awesome), you usually have a “Typescript compile watch” constantly running.  Whenever you make a save to a file that’s being watched (usually all .ts and .html files within your app folder), that watcher will automatically attempt to re-transpile the entire thing.  If you’ve done something really dumb, the command prompt will yell at you (no red squigglies, but oh well – also, the word ‘squigglies’ also results in a red squiggly line!).

Of course, an app can compile or transpile and still be completely broken, so it’s definitely good to test frequently after small changes.  Even when your dev tools don’t force you to.

There’s a great line in the Mel Gibson (I know) ode to violence and America “The Patriot”.  He’s giving his young sons quick tips on how to kill British soldiers in the most effective manner possible.  He says “Aim small, miss small”.  It’s cheesy and I have no idea if it actually makes you a better shot, but it is solid advice to co-opt for writing code.  Add or modify something small and check your results.  You might miss- but it will take a much smaller correction than if you spent hours editing multiple files across your entire project, only to find you’ve broken the whole thing.  Or missed that evil Redcoat!

The metaphor might have broken down a bit at the end there, but I think you get the point.

Testing my patience

As promised, I’m trying to get into automated testing with Javascript (as a start).  It was a tough sell for me- the idea of writing tests to cover every aspect of your code, as well as writing the code itself, seemed like doubling my workload.  Not to mention the complexities of testing in a browser-based environment.  After some reading, basic testing seemed to make sense- check that a function returns what it should- but what about a bunch of jQuery DOM manipulation functions.  They don’t return anything, but have multiple side effects- how do I test those?

Of course, wiser folks are shaking their heads and telling me that tests are a great way of catching errors before they become bugs- or eliminating the need for manually testing every edit/addition, then adding console.log statements all over your code when you can’t figure out what’s going wrong.

But what really got me to dig in and start testing was the idea of regression.  A code base starts out small, but eventually, it will likely get very large.  The goal is modularity- one aspect not necessarily altering another- but that doesn’t always work, and sometimes. modules have to interact with each other.

A code base is also a living thing- it has to be updated frequently: bugs squished, features added, etc.  And, frequently, when you edit or add to that code base, you’re at risk of breaking something somewhere else- or something right there in the feature you’re editing.  Automated testing really makes sense for this.  Cover your code with enough tests as you go, and any future edits/additions that break something deep inside (or in the same feature) will break those tests- you’ll know right away.

With my main project- a brand new Angular 2 application, I’ve started writing tests using Jasmine.  And it’s really great so far.  The hardest part (as with any Javascript framework these days) was setup.  Configuring Jasmine to run tests on Angular services (still working on components) and getting systemjs to play along was a bit tricky.  The docs at angular.io (and the docs at jasmine.github.io are great, but I did run into one ‘gotcha’.

As a total test newbie, I wanted to do a couple basic ones just to make sure everything was linked up properly.  So I threw in a couple:

it('should add correctly', () => {
    expect(2+2).toEqual(4);
});

But for some reason, in the Angular 2 template test environment, all my tests were running twice. They correctly passed or failed, but everything was duplicated.

The dual culprits were my stupidity and my inexperience.  In the template test env, Angular has you do a system.js config in  your main test index .html page (where Jasmine will show your results).  After that, you use system.import(‘your-test-file-here’), then chain a .then(window.onload) command.  Import returns a promise that resolves when your test file is found and loaded.  Then it fires a window refresh to recognize your new file (Jasmine doesn’t have a ‘fire’ method).

I took out .then(window.onload) and it worked great- each test ran once.  But then I went further.  I wrote some basic tests for an actual Angular service (showing/hiding a loader component).  And the test broke again- it couldn’t load any Angular.  So I added .then(window.onload) back in- and it worked great!  Because I wanted to test the tests, I stumbled into a bug that no one had probably even noticed (probably can’t really call it a bug- everything worked as it was supposed to, technically).

So, the moral is: if you’re testing Angular, use the window.onload call- it’s necessary.  If you’re just running tests to test your test setup, leave it out.

Test (had to say it one more time).

Double Check

Some lessons are learned the hard way.  Fortunately, some are only learned the almost-hard way.  This is a story about the latter.

I was working on a user profile feature.  A neat little slide out div that allows a logged in user to edit their name and email, and reset their password.  As with most web development projects, it was a bit of an iceberg.  Creating the slider and simple form was fairly easy, but also on the list of necessary items was: Javascript validation and visual feedback, PHP processing of submitted info on the server side and server side validation of that info, and SQL statements for updating the database.  The last two are where I made my almost-hard lesson.

On a couple different recent projects, we’ve used a nice ORM system.  Short for object relational mapping, an ORM basically prevents you from ever having to write SQL directly.  Django uses one- you just create classes in your models.py file to ‘model’ your data.  .NET also does (I think)- add a new .cs file to your models folder and you can create the table with get and set methods, type tagging, etc.  Those frameworks will then produce the SQL commands (specific to the type of database you’re using) and you’re good to go.

But on this project, it’s an older setup – just a simple PHP/MYSQL system.  I’m generally ok writing SQL, but when you get out of practice, you make stupid mistakes.  I grabbed the right table, used an UPDATE command on the row denoted by the client id, and tested.  What was missing?  Well, in this system, there are two important keys for users: the client_id (basically a group- one client can have many users) and the user_id (which is the actual unique id for a specific user).  By only grabbing the client id, I’d created a method to update all users related to a client at once (not exactly a useful function).

I tested the update and saw my user’s profile save the correct info- great!  But then I checked the rest of the users for that test client- they were all named ‘Testy Guy’ and had email ‘test@test.com’.  After a few moments of panic, I realized that my mistake, while really stupid and dangerous, was not as bad as I’d thought.  It had only updated users for that one test client (I had, after all, grabbed the client_id) and I was still working in a full test environment (testing only database and code base).

So, an almost-bad mistake.  Still, a really careless mistake.  But a reminder and good lesson to always check my work.  In this case, adding an important client side check:

  • Checking that the unique username is not null or a blank string when submitting (this will ensure that the server code gets all the info it needs)

And an even more important couple server side checks:

  • Check that username again, return false with an error if blank
  • And use the mysqli_num_rows function to ensure only one row (and therefore, user) is being updated.  If that function gets more or less than one, return false with an error.

And it renewed my interest in getting into automated testing.  I know- I’m a monster for not doing this sooner, but it hasn’t been pushed at any job I’ve worked so far.  And it’s not really a popular course/tutorial topic.  There’s a rush (I know I’ve felt it) to get something working and up on the screen- but the more I read about (and experiment with) truly testing code, the more I realize that it will pay off in the long run.  More on that next week!