Pulling the strings

So, back to the puppet show.

Actually, I guess this is going to be a bit more of a tangent, but it did come up when working on the new Puppeteer project (to crawl our Angular application and store server rendered pages in a local cache), so it still counts…

We wanted to have both a command line option and a graphic UI option to run the crawl service.  The UI would need a backend that keeps a websocket open and broadcast out updates when a new url was crawled (to show success or failure, and give stats at the end).  Socket.io works great for this- just install it on your Node project, and you can socket.emit messages and data to your frontend, which can listen with socket.on (use the same name for both emit and on to coordinate the two).

However, in the command line option, there would be no socket.  With this configuration, a user would just run the command and the messages/data should print to the console.  So we have a shared program that needs two different interfaces.  I had already created a “crawl-runner.js” file with my main “run” function.  It would handle the browser init, page creation, and navigation in headless Chrome (using Puppeteer).  It also handles storing and responding with the results.  It was set up to use a simple native Node EventEmitter- which worked fine for interfacing with websockets.  In fact, we could probably just cut out the middleman and eliminate the EventEmitter- just socket.emit directly from the crawler.

But either way, we will have to switch to console.log when using the command line option.  How to reuse the logic from crawl-runner.js in the command line version?  We can pass the emitter as an optional argument to “run” and if it’s not there, alias that name to console.log:

When the program is run in interactive, UI mode (via a dashboard on our Angular app), crawlEmitter is passed to run, and the socket interface works.  When it’s run as a command line application, we still call “crawlEmitter.emit” with the message and data we want to send, but the check at the top of the function will call “console.log” whenever “crawlEmitter.emit” is called (because there is no crawlEmitter in this case).

Another option would to be simply passing the function we want to use as a broadcaster into run.  So, pass crawlEmitter.emit as the 2nd argument for the dashboard version, or console.log for the command line version.  That might be a better, more readable solution, so I’m thinking about switching (haven’t tested this yet- but I don’t see any reason it shouldn’t work).

One of the most fun things about programming is how many roads you can take to one final product.  The trick is finding the balance between most efficient and most understandable – and always being open to finding a new route!



Quick tangent this week, then back to the puppet show!

We launched a website this past week.  I’d love to say we were able to use all the cool new deployment tools, but for various reasons (stack choice, database connectors, developing and deploying on Windows, major version and architecture updates), it just wasn’t possible.  So, we had to shut down the old server, remotely log in, make some upgrades and update some versions, cross our fingers, and re-start.

Always a risky procedure.  Particularly with a two person team.  Really a one and a half person team.  I do a bit of dev ops, but am mostly a Javascript guy.  Apache config files aren’t really my sweet spot.

My partner on this was updating Django versions, then trying to install the right db connector to work with SqlServer.  It was tricky, but eventually he got there.  I updated Node, built our Angular application, and deployed the front end server.  For the most part, it worked, but there were a few ‘gotchas’.

One involved enabling https.  This was something we didn’t test in development- we have one SSL certificate and it was in use for the production site.  Enabling it on the Express server was simple enough, but I didn’t realize that there were 3 important parts.  The key, the certificate (got both those), and the chain.  Most modern browsers only need the first two, but some require the chain (I’m looking at you, Firefox on Mac).  The chain needs to be split on newlines in Node to work.  But we got past that hurdle after a bit of frantic Googling.

The other interesting one involved server side rendering our Angular application.  Like most modern JS apps, ours uses some 3rd party libraries.  We could never get Angular’s built in SSR to work (though we are still trying), so we’re using a 3rd party service to do the render.  Our server contacts that 3rd party’s servers, gets the rendered version of a page, and returns it (and then the SPA loads and takes over).  Google’s spiders are happy.  But our users aren’t- they just had to wait over 2 seconds to get a page.  However, we can store those renders in a local cache, so on any subsequent visit, it’s blazing fast again (I’m documenting the crawl and storing of those in the Puppeteer series- see entry one from last week- it’s super fun!).

Back to the related launch issue: we were getting cross origin request errors.  But only if the url included ‘www’.  For example: if I went to domain.tld, our site loaded, it made a delayed https call to our internal news article api, loaded those, and everything was great.  But if I went to http://www.domain.tld, boom- CORS error.  The error referenced a disallowed header.

The lesson this reinforced was to use your dev tools.  I hit F12 and checked the network tab.  Clicked the red response and inspected the request and response headers.  Then I made a successful (no ‘www’) request and compared those.  The ‘www’ request was getting an extra header related to the prerendering service (still not sure why it wasn’t present on all requests).  So, on our Apache server (running the news API- which just returns JSON), we just had to add that header to the allowed list.  I definitely don’t have that command memorized, but once I knew exactly what I was Googling for, it took about 2 minutes to find and implement.

Use your dev tools.  Words to live by in this business.  When something doesn’t work, there’s always an answer- particularly in this world where everything is online.  Dev tools help narrow down that search.  I went from a panic search of “site works without ‘www’ but not with ‘www'” to “apache configuration allow headers”.  The first got me nothing useful.  The second solve my problem almost immediately.


Master of Puppets

I’ve been working with Google’s new Puppeteer library the past week or so.  It’s basically the next generation headless browser API- you can use it for testing, scraping, performance profiling- their list is better than anything I can post here.

I’m using it to crawl our Angular application.  We have a service to generate a server side render version of our components (we are working on a custom process for this, but it turns out to be very tricky if you’re using 3rd party libraries).  However, using this service requires a call out to their server to create the pre-rendered version (so search engines can index more than just an empty body tag for our site!).  This extra call adds a noticeable lag on our page load.

We do have a local cache on our own server- once a page is rendered, it’s stored in that cache for a while and all future requests are nice and fast.  Ideally, however, no actual user should see that initial slow load if the cache doesn’t have the page they’re looking for.

Side note: I know this is a bit of a small problem.  It will affect very few users – specifically, only people who find a “deep” page in our SPA via a Google search or link and that page hasn’t been visited for a while so it’s not in our cache.  However, we are at the point where we are trying to optimize as much as possible pre-launch, and this project’s time has come!

So, we need a reliable, preferably automated way to crawl all our routes.  Puppeteer is perfect for this operation.  It has a fairly large API, but performing the basics is really simple:

//be sure to npm install puppeteer
const puppeteer = require('puppeteer');

//create a browser instance
const browser = await puppeteer.launch();

//open a page
const page = await browser.newPage();

await page.goto('https://your-url.com');

//important for some sites (like SPAs)- make sure we wait until the page is fully loaded
await page.waitForNavigation({ waitUntil: 'networkidle' });

//evaluate gives you access to the DOM- use good old querySelector type methods!
const myLinks = await page.evaluate(async () => {
    const links = Array.from(document.querySelectorAll('htmlselector'));
    return Promise.resolve(links.map(link => link.href));

Note that everything is using async/await – which is really great. It makes asynchronous code easier to read and reason about (depending on who you ask…).  However, you will need Node 7.6 or higher.

After the above code, you should have an array of links from whatever selector you passed to querySelectorAll.  The functional programmer in me wanted to just forEach that array, visit each in a new puppeteer ‘goto’ command, and call it good.  However, that doesn’t appear to work.  With async/await, you need to use a for/of loop and await each crawl:

for(let link of myLinks) {
    try {
        await crawlPage(link);
    } catch(err) {
        handleError(link, err);

Try/catch is the preferred (only?) error handling mechanism for async/await. Also, at this point, I had abstracted my crawl and error processes out into their own methods. Basically, crawlPage just opens a new page, goes to it, saves the result to a log,  emits the result (more on that in the next post!), and returns the page.close() method- which itself returns a promise (allowing crawlPage to be an async function.

handleError records and emits the error source (url) and message.

async function crawlPage(url) {
    const page = await browser.newPage();
    const pageResult = await page.goto(url);
    let resultObj = { success: pageResult.ok, url };
    crawlEmitter.emit('crawled', resultObj);
    return page.close();

function handleError(url, err) {
    const errorObj = { success: false, url, err };
    crawlEmitter.emit('error', errorObj);

All very cool and useful stuff.  We can run this process over our application to refresh the local cache at a regular interval- preventing an actual user from ever experiencing the slow load.

Next week will be part 2 of this saga.  I wanted two options: run this from the command line or as a cron job, and have a nice UI that will show the last crawl, stats, and allow a non-technical user to crawl (including crawl by different categories- our url routing comes partly from a JSON config file).  This meant I would need to log to the console sometimes (running as a command line tool), or return data to a browser others.  This is where crawlEmitter comes in- it’s a custom Node EventEmitter (not hard to create) that sends data to a frontend via a websocket.  The fun part was aliasing my crawlEmitter- if it doesn’t exist when the main function starts, it falls back to a good old console.log!

Little Johnny Angular Has Trouble Focusing in Class

Short post this week, but working on a longer one in the next couple weeks regarding Google’s new Puppeteer project.  Spoiler- I think it’s pretty cool.

I have worked with latest generation angular for a couple years now.  It’s generally great, occasionally frustrating, and often interesting.  However, until this past week, I hadn’t really had a need to work with directives.  Specifically, attribute directives, but it turns out they can be very useful!

The challenge: I was creating a search feature for a new website.  I had the actual search function working well (and the data being searched over was small enough to include in an in-memory JSON structure, so it’s a nice and fast search as you type setup).  All that was left was to put a UI on this thing.  I put together a nice full screen display with large input element that fires the search function as a user types.  It will show/hide based on an ngIf conditional.  All simple enough in Angular land- and something we do all the time (by the way, I also tried this with the [hidden] element attribute, but ran into the same difficulty below).

So, click the search icon, the background fades in and search input slides down.  The user can search and results pop right up.  All great- but the input field autofocus was not cooperating.  It would fire properly the very first time ngIf became true, but never again after that.  If a user closed the search display, then opened again for another search, no autofocus.

Not a huge deal- but an inconvenience for users, and something that I should be able to make work.  It makes sense- the built in autofocus attribute was probably only ever designed to fire on a page load- but that only happens once (ideally) with an angular application.  I tried firing a .focus() event in my controller, but it didn’t seem to have any effect.

The answer was a directive.  Specifically, an attribute directive.  Directives were all over angular V1, but in latest generation, you can usually get away with just having components and services.  In this case, I could just make a super simple directive that grabs a reference to the element (the angular way!) and fires the .focus() in the right lifecycle hook (some references said it would work in onInit, but I had to use afterViewInit).

Boom- the search input element gets focused every time the search display opens!


Deep Streams

Usually when I think of Node, I think about web servers.  That’s mostly what I use it for when writing code- setting up a simple test server for a proof of concept, or bringing in Express and its ecosystem for more production-ready projects.

But in reality, I use Node for a whole lot more.  Of course, just about anything NPM related is a use of Node- but it also powers all the awesome developer tools that we don’t even really need to think about much anymore.  When a minifier runs over your Javascript code before you push to production- Node is probably doing that magic.  Same for a bundler.  It’s behind the scenes on quite a bit of frontend workflow these days.  I use it for such all the time, but I hadn’t had much chance to really write any of those dev tools until recently.

The task was pretty simple- I had a js file with a bunch of arrays and objects someone else had entered.  The formatting had been mangled somewhere along the way- there were long strings of spaces everywhere- I wanted to strip them out, but leave any single spaces (to preserve multi word strings).  Now I know: any halfway good code editor will have the search/replace feature to handle this, but I could see this being a nice little utility to write in Node.  That way, I could run it over an entire directory if necessary (probably won’t ever be necessary, but I really wanted to do this short little side project).

My first iteration was a super simple script using the fs module.  First, a nice path splitter utility in a separate file.  This takes a string, splits it out by path and extension, and inserts ‘-‘ and whatever new string you want.  This prevents overwriting the original file (though this part would be unnecessary if you do want to overwrite- just pass the same file name to the write function):

Then we can use that in our script to strip multi-spaces and return a new file:

All very cool. But I really like the notion of streams in Node. What if the file I need to manipulate is really large? With my current setup, it might take a lot of memory to read the entire file, then write it out. But that’s what streams are for! So I rewrote the script with a custom transform stream. It wasn’t all that difficult- as soon as I realized that the required method on your custom stream (that extends the Transform class) has to be named _transform. If you leave out the underscore, it will not work (recognizes Transform as undefined).

Again, in a separate file (small modules for the win!), I defined my custom stream:

Then it was just a matter of importing that and the path splitting utility created for the original fs version (code reuse for the win!) and running a couple Node-included streams (createReadStream and createWriteStream) that can make a stream from a file automatically:

Both methods (fs and stream) are simple and concise. Both have their place. Creating a custom transform stream was probably unnecessary for this task, but would be very useful for very large files. Either way, it was a fun quick dive into some other corners of Node!

Static Signal

Express is awesome.

I’ve worked with a few backend technologies, but Node will probably always be my favorite because JS was the first language I really picked up.  Python (Django) and C# (.NET) are both cool, but being able to write JS on the server is great- no more context switching and ending my Python lines with semicolons!  You can create some really cool stuff with Node if you mind the event loop.  If you’re experienced coding for a browser, it shouldn’t be too difficult- the principle of not blocking the loop applies in both places!

Anyway, back to Express.  It is a very useful wrapper for Node- making it easy to handle requests and send responses, set headers, and even bring in security packages (hello helmet!).

One of the best parts is the static file serving process.  As a learning exercise, I wrote a Node process to serve static files- it works, but it’s not pretty, and I’m sure it’s not as robust as Express offers.  Just do the normal Express app setup, then call:

app.use(express.static(path.join(__dirname, 'directory-path-here'));

If you want your code to be a little more flexible, use Node’s __dirname global.  It gives you the directory name of the current module- very useful to make sure your code is portable (for running in dev vs production, for example).

However, there is one little “gotcha” with serving static via express- particularly when creating a single page application.  express.static can be passed an options object as its second argument.  Without anything passed, it uses sensible defaults- one of which is looking for an ‘index’ file in the directory it’s given.  Makes sense- it’s a good place to start when looking for what file to serve.  If you’re creating a single page application, you probably want to return index.html for just about everything.

But we wanted to include a nice alternate homepage for people using very old browsers (think IE9 and below).  Those won’t load our fancy Angular application, but we didn’t want them to just see a page of gibberish.  So we created a simple HTML page with some vanilla JS interactions on a contact form (felt like the good old days!).  But when we went to configure the routing, we just couldn’t get the server to return anything but index.html- even when we set a conditional in our route to check the browser name/version and return a different file.

The answer was in the documentation (of course).  That options object can be passed an “index” property.  This is a boolean that defaults to true, but if you set it to false, Express won’t automatically serve up the index.  You can control what gets served- just remember that now you have to return index.html manually (after any other possible files, depending on your setup).

app.use(express.static(path.join(__dirname, '/'), {index: false}));

There we go!  Now people forced to use IE8 (and I can’t think of anyone who would voluntarily use it these days…) can still view our super important content!

Living in a Material World

The more I work with React, the cooler it seems.  I know- I’m way behind.  React is already uncool, Vue is the new hotness.

However, every time I start writing custom HTML components in plain ‘ol Javascript, I end up with a pattern that looks a lot like React.  I then say to myself: “Why am I recreating a crappy version of React instead of just using React?  People smarter than me have spent many hours making a much more optimized version of this monstrosity I’m building”.  Then I look around to make sure no one heard me talking to myself.

It’s at that point that I run the “create-react-app my-app” command and really get down to business.

I started a project with React almost a year and a half ago, but just couldn’t wrap my head around the router.  It’s not that React Router (v4) wasn’t well built (though I might have thrown some insults at it in a frustration-fueled rage), but I’d been using Angular at work, and the router philosophies are pretty different between those two.

I started a new project with React (integrating some Phillips Hue lights into a ui) a couple weeks ago, and decided to give it all another shot.  The first few hours of figuring out the routing was still a struggle, but then it kind of clicked.  The <Router> declaration in React is a bit like the router-outlet in Angular, with the declaration of the route all in one.  I don’t think I’d really grasped that originally, and it led to a mess of an application.

But the router isn’t the point of this short post.  This one is about integrating Material-UI with React and how cool composable components are when you start to understand how to really use them (note that I said “start to understand”- I have a long way to go).

So, I’d already created some simple components to get everything working.  I’d also cooked up a simple Node server to serve as a fake API.  I hadn’t bought the smart lights yet- but wanted to test, so I just copied the JSON structure from the documentation, pasted it into a file on my computer, and set up a server on localhost to respond to the same endpoints the actual API would and serve up the JSON.  It works surprisingly well!

One of those components displays all the lights available (2 currently).  That LightPanel component is made up of LightSwitch components- one for each light.  The focus early on was to get them working.  Once I reached that point, the time had come to actually make them look presentable.

Too often this is where I end up burning a lot of time.  I have a habit of trying to create nice looking css on my own- but this time I learned my lesson.  I’d let the experts help out, and use the Material UI library.  It integrated pretty easily, As a test, I switched my LightSwitch jsx from simple custom radio buttons to the Toggle component and it instantly looked better.  Success!

But the really cool part was when I decided to use the Card element for my main LightPanel grid.  Again- this will show all the switches, their status, and allow a user to turn them on or off.  Each light gets its own Card in the grid.  The functionality to get this info and do these actions was already there, I just had to integrate it into the Card element.  I thought the best way would be to just embed my LightSwitch component into the Card’s ‘text’ section.  This worked, but didn’t look quite right.  I realized that it would really go best in the Card’s ‘title’ section.

But the title section is meant to be passed info as attributes- not have text inside tags.  Instead of <CardTitle>My text here</CardTitle>, it should be <CardTitle title=”My text here” />.  I wanted to basically embed a custom React component into the “title” attribute.

Thinking “there’s no way this will work”, that’s exactly what I did:

I saved and waited for my command line output to give me a wall of red text, but none appeared.  Fingers crossed, I opened my browser, and there it was, in all its glory.  The card with my custom switch component embedded as the title attribute- toggle functionality and all!

So thanks all around: to React, to Material UI, to the wonders of Javascript!