Pulling the strings

So, back to the puppet show.

Actually, I guess this is going to be a bit more of a tangent, but it did come up when working on the new Puppeteer project (to crawl our Angular application and store server rendered pages in a local cache), so it still counts…

We wanted to have both a command line option and a graphic UI option to run the crawl service.  The UI would need a backend that keeps a websocket open and broadcast out updates when a new url was crawled (to show success or failure, and give stats at the end).  Socket.io works great for this- just install it on your Node project, and you can socket.emit messages and data to your frontend, which can listen with socket.on (use the same name for both emit and on to coordinate the two).

However, in the command line option, there would be no socket.  With this configuration, a user would just run the command and the messages/data should print to the console.  So we have a shared program that needs two different interfaces.  I had already created a “crawl-runner.js” file with my main “run” function.  It would handle the browser init, page creation, and navigation in headless Chrome (using Puppeteer).  It also handles storing and responding with the results.  It was set up to use a simple native Node EventEmitter- which worked fine for interfacing with websockets.  In fact, we could probably just cut out the middleman and eliminate the EventEmitter- just socket.emit directly from the crawler.

But either way, we will have to switch to console.log when using the command line option.  How to reuse the logic from crawl-runner.js in the command line version?  We can pass the emitter as an optional argument to “run” and if it’s not there, alias that name to console.log:

When the program is run in interactive, UI mode (via a dashboard on our Angular app), crawlEmitter is passed to run, and the socket interface works.  When it’s run as a command line application, we still call “crawlEmitter.emit” with the message and data we want to send, but the check at the top of the function will call “console.log” whenever “crawlEmitter.emit” is called (because there is no crawlEmitter in this case).

Another option would to be simply passing the function we want to use as a broadcaster into run.  So, pass crawlEmitter.emit as the 2nd argument for the dashboard version, or console.log for the command line version.  That might be a better, more readable solution, so I’m thinking about switching (haven’t tested this yet- but I don’t see any reason it shouldn’t work).

One of the most fun things about programming is how many roads you can take to one final product.  The trick is finding the balance between most efficient and most understandable – and always being open to finding a new route!

Advertisements

Master of Puppets

I’ve been working with Google’s new Puppeteer library the past week or so.  It’s basically the next generation headless browser API- you can use it for testing, scraping, performance profiling- their list is better than anything I can post here.

I’m using it to crawl our Angular application.  We have a service to generate a server side render version of our components (we are working on a custom process for this, but it turns out to be very tricky if you’re using 3rd party libraries).  However, using this service requires a call out to their server to create the pre-rendered version (so search engines can index more than just an empty body tag for our site!).  This extra call adds a noticeable lag on our page load.

We do have a local cache on our own server- once a page is rendered, it’s stored in that cache for a while and all future requests are nice and fast.  Ideally, however, no actual user should see that initial slow load if the cache doesn’t have the page they’re looking for.

Side note: I know this is a bit of a small problem.  It will affect very few users – specifically, only people who find a “deep” page in our SPA via a Google search or link and that page hasn’t been visited for a while so it’s not in our cache.  However, we are at the point where we are trying to optimize as much as possible pre-launch, and this project’s time has come!

So, we need a reliable, preferably automated way to crawl all our routes.  Puppeteer is perfect for this operation.  It has a fairly large API, but performing the basics is really simple:

//be sure to npm install puppeteer
const puppeteer = require('puppeteer');

//create a browser instance
const browser = await puppeteer.launch();

//open a page
const page = await browser.newPage();

//navigate
await page.goto('https://your-url.com');

//important for some sites (like SPAs)- make sure we wait until the page is fully loaded
await page.waitForNavigation({ waitUntil: 'networkidle' });

//evaluate gives you access to the DOM- use good old querySelector type methods!
const myLinks = await page.evaluate(async () => {
    const links = Array.from(document.querySelectorAll('htmlselector'));
    return Promise.resolve(links.map(link => link.href));
});

Note that everything is using async/await – which is really great. It makes asynchronous code easier to read and reason about (depending on who you ask…).  However, you will need Node 7.6 or higher.

After the above code, you should have an array of links from whatever selector you passed to querySelectorAll.  The functional programmer in me wanted to just forEach that array, visit each in a new puppeteer ‘goto’ command, and call it good.  However, that doesn’t appear to work.  With async/await, you need to use a for/of loop and await each crawl:

for(let link of myLinks) {
    try {
        await crawlPage(link);
    } catch(err) {
        handleError(link, err);
    }
}

Try/catch is the preferred (only?) error handling mechanism for async/await. Also, at this point, I had abstracted my crawl and error processes out into their own methods. Basically, crawlPage just opens a new page, goes to it, saves the result to a log,  emits the result (more on that in the next post!), and returns the page.close() method- which itself returns a promise (allowing crawlPage to be an async function.

handleError records and emits the error source (url) and message.

async function crawlPage(url) {
    const page = await browser.newPage();
    const pageResult = await page.goto(url);
    let resultObj = { success: pageResult.ok, url };
    resultGroup.push(resultObj);
    crawlEmitter.emit('crawled', resultObj);
    return page.close();
}

function handleError(url, err) {
    const errorObj = { success: false, url, err };
    resultGroup.push(errorObj);
    crawlEmitter.emit('error', errorObj);
}

All very cool and useful stuff.  We can run this process over our application to refresh the local cache at a regular interval- preventing an actual user from ever experiencing the slow load.

Next week will be part 2 of this saga.  I wanted two options: run this from the command line or as a cron job, and have a nice UI that will show the last crawl, stats, and allow a non-technical user to crawl (including crawl by different categories- our url routing comes partly from a JSON config file).  This meant I would need to log to the console sometimes (running as a command line tool), or return data to a browser others.  This is where crawlEmitter comes in- it’s a custom Node EventEmitter (not hard to create) that sends data to a frontend via a websocket.  The fun part was aliasing my crawlEmitter- if it doesn’t exist when the main function starts, it falls back to a good old console.log!

Deep Streams

Usually when I think of Node, I think about web servers.  That’s mostly what I use it for when writing code- setting up a simple test server for a proof of concept, or bringing in Express and its ecosystem for more production-ready projects.

But in reality, I use Node for a whole lot more.  Of course, just about anything NPM related is a use of Node- but it also powers all the awesome developer tools that we don’t even really need to think about much anymore.  When a minifier runs over your Javascript code before you push to production- Node is probably doing that magic.  Same for a bundler.  It’s behind the scenes on quite a bit of frontend workflow these days.  I use it for such all the time, but I hadn’t had much chance to really write any of those dev tools until recently.

The task was pretty simple- I had a js file with a bunch of arrays and objects someone else had entered.  The formatting had been mangled somewhere along the way- there were long strings of spaces everywhere- I wanted to strip them out, but leave any single spaces (to preserve multi word strings).  Now I know: any halfway good code editor will have the search/replace feature to handle this, but I could see this being a nice little utility to write in Node.  That way, I could run it over an entire directory if necessary (probably won’t ever be necessary, but I really wanted to do this short little side project).

My first iteration was a super simple script using the fs module.  First, a nice path splitter utility in a separate file.  This takes a string, splits it out by path and extension, and inserts ‘-‘ and whatever new string you want.  This prevents overwriting the original file (though this part would be unnecessary if you do want to overwrite- just pass the same file name to the write function):

Then we can use that in our script to strip multi-spaces and return a new file:

All very cool. But I really like the notion of streams in Node. What if the file I need to manipulate is really large? With my current setup, it might take a lot of memory to read the entire file, then write it out. But that’s what streams are for! So I rewrote the script with a custom transform stream. It wasn’t all that difficult- as soon as I realized that the required method on your custom stream (that extends the Transform class) has to be named _transform. If you leave out the underscore, it will not work (recognizes Transform as undefined).

Again, in a separate file (small modules for the win!), I defined my custom stream:

Then it was just a matter of importing that and the path splitting utility created for the original fs version (code reuse for the win!) and running a couple Node-included streams (createReadStream and createWriteStream) that can make a stream from a file automatically:

Both methods (fs and stream) are simple and concise. Both have their place. Creating a custom transform stream was probably unnecessary for this task, but would be very useful for very large files. Either way, it was a fun quick dive into some other corners of Node!

Static Signal

Express is awesome.

I’ve worked with a few backend technologies, but Node will probably always be my favorite because JS was the first language I really picked up.  Python (Django) and C# (.NET) are both cool, but being able to write JS on the server is great- no more context switching and ending my Python lines with semicolons!  You can create some really cool stuff with Node if you mind the event loop.  If you’re experienced coding for a browser, it shouldn’t be too difficult- the principle of not blocking the loop applies in both places!

Anyway, back to Express.  It is a very useful wrapper for Node- making it easy to handle requests and send responses, set headers, and even bring in security packages (hello helmet!).

One of the best parts is the static file serving process.  As a learning exercise, I wrote a Node process to serve static files- it works, but it’s not pretty, and I’m sure it’s not as robust as Express offers.  Just do the normal Express app setup, then call:

app.use(express.static(path.join(__dirname, 'directory-path-here'));

If you want your code to be a little more flexible, use Node’s __dirname global.  It gives you the directory name of the current module- very useful to make sure your code is portable (for running in dev vs production, for example).

However, there is one little “gotcha” with serving static via express- particularly when creating a single page application.  express.static can be passed an options object as its second argument.  Without anything passed, it uses sensible defaults- one of which is looking for an ‘index’ file in the directory it’s given.  Makes sense- it’s a good place to start when looking for what file to serve.  If you’re creating a single page application, you probably want to return index.html for just about everything.

But we wanted to include a nice alternate homepage for people using very old browsers (think IE9 and below).  Those won’t load our fancy Angular application, but we didn’t want them to just see a page of gibberish.  So we created a simple HTML page with some vanilla JS interactions on a contact form (felt like the good old days!).  But when we went to configure the routing, we just couldn’t get the server to return anything but index.html- even when we set a conditional in our route to check the browser name/version and return a different file.

The answer was in the documentation (of course).  That options object can be passed an “index” property.  This is a boolean that defaults to true, but if you set it to false, Express won’t automatically serve up the index.  You can control what gets served- just remember that now you have to return index.html manually (after any other possible files, depending on your setup).

app.use(express.static(path.join(__dirname, '/'), {index: false}));

There we go!  Now people forced to use IE8 (and I can’t think of anyone who would voluntarily use it these days…) can still view our super important content!

Living in a Material World

The more I work with React, the cooler it seems.  I know- I’m way behind.  React is already uncool, Vue is the new hotness.

However, every time I start writing custom HTML components in plain ‘ol Javascript, I end up with a pattern that looks a lot like React.  I then say to myself: “Why am I recreating a crappy version of React instead of just using React?  People smarter than me have spent many hours making a much more optimized version of this monstrosity I’m building”.  Then I look around to make sure no one heard me talking to myself.

It’s at that point that I run the “create-react-app my-app” command and really get down to business.

I started a project with React almost a year and a half ago, but just couldn’t wrap my head around the router.  It’s not that React Router (v4) wasn’t well built (though I might have thrown some insults at it in a frustration-fueled rage), but I’d been using Angular at work, and the router philosophies are pretty different between those two.

I started a new project with React (integrating some Phillips Hue lights into a ui) a couple weeks ago, and decided to give it all another shot.  The first few hours of figuring out the routing was still a struggle, but then it kind of clicked.  The <Router> declaration in React is a bit like the router-outlet in Angular, with the declaration of the route all in one.  I don’t think I’d really grasped that originally, and it led to a mess of an application.

But the router isn’t the point of this short post.  This one is about integrating Material-UI with React and how cool composable components are when you start to understand how to really use them (note that I said “start to understand”- I have a long way to go).

So, I’d already created some simple components to get everything working.  I’d also cooked up a simple Node server to serve as a fake API.  I hadn’t bought the smart lights yet- but wanted to test, so I just copied the JSON structure from the documentation, pasted it into a file on my computer, and set up a server on localhost to respond to the same endpoints the actual API would and serve up the JSON.  It works surprisingly well!

One of those components displays all the lights available (2 currently).  That LightPanel component is made up of LightSwitch components- one for each light.  The focus early on was to get them working.  Once I reached that point, the time had come to actually make them look presentable.

Too often this is where I end up burning a lot of time.  I have a habit of trying to create nice looking css on my own- but this time I learned my lesson.  I’d let the experts help out, and use the Material UI library.  It integrated pretty easily, As a test, I switched my LightSwitch jsx from simple custom radio buttons to the Toggle component and it instantly looked better.  Success!

But the really cool part was when I decided to use the Card element for my main LightPanel grid.  Again- this will show all the switches, their status, and allow a user to turn them on or off.  Each light gets its own Card in the grid.  The functionality to get this info and do these actions was already there, I just had to integrate it into the Card element.  I thought the best way would be to just embed my LightSwitch component into the Card’s ‘text’ section.  This worked, but didn’t look quite right.  I realized that it would really go best in the Card’s ‘title’ section.

But the title section is meant to be passed info as attributes- not have text inside tags.  Instead of <CardTitle>My text here</CardTitle>, it should be <CardTitle title=”My text here” />.  I wanted to basically embed a custom React component into the “title” attribute.

Thinking “there’s no way this will work”, that’s exactly what I did:

I saved and waited for my command line output to give me a wall of red text, but none appeared.  Fingers crossed, I opened my browser, and there it was, in all its glory.  The card with my custom switch component embedded as the title attribute- toggle functionality and all!

So thanks all around: to React, to Material UI, to the wonders of Javascript!

When in Rome

I really like learning new programming concepts and methods.  The beautiful and frustrating thing about software development is that it’s kind of endless.  No matter how much you learn, there’s almost always going to be something you don’t know.  As long as you don’t let that though overwhelm you, and focus on the process instead of the outcome, it can be a fun ride.

And the internet is a great place for learning.  Don’t get me wrong- it’s also full of inaccurate, opinion-tainted drivel, but you can definitely find really good resources.  How to tell them apart?  The compiler (or browser in the case of the web) doesn’t lie.  Learn something, try it on your own- if it works, on a real project you’ve found a good resource.

A couple months back I posted about Wes Bos’ great courses (seriously- if you want to learn, go get his stuff).  I also find Pluralsight very helpful (sometimes you feel like laying down and being lazy- why not watch videos too!).  Free Code Camp is another great avenue for learning.  I started that track a few years ago but drifted away due to work and time limitations.  When I went back a few weeks ago, I found even more great problems and projects to practice on.

One of them is the Roman Numeral Conversion algorithm problem.  It’s listed as an “intermediate” difficulty question, but the instructions are simple: “Convert the given number into a roman numeral”.  For some reason, this one was giving me a really hard time- I just couldn’t figure out how to do it without making a big map of numbers linked to their corresponding roman numerals.

So I decided to do exactly that.  I would solve the problem in the worst way I possibly could.  If my only idea was to make a big map, I’d just make a big map.  And something interesting happened along the way.  Patterns began to emerge.  Using my big map, I had a bunch of conditional logic, depending on the length of the number.  But I realized I could break that into a single loop.  I had a whole bunch of logic in that loop that really belonged inside a helper function, so that was the next step- abstract it out to a helper.  Finally, I realized that instead of the map (much smaller now), if I used an array, I could match the index up with the loop counter variable in my main function- passing the correct Roman Numeral letters without having to manually do so each time.

The code is available in some Github Gists- first, second, and final passes.  Apologies if there are errors- never used the Gist feature before, but it’s pretty cool!

I bet there are many ways to improve it- the logic in the helper seems unwieldy, but the best thing I learned from this exercise was the general process.  Get something working, see patterns, edit, repeat.

Strippin’ Tags

A while back, I helped create a Django back end for an update of an existing website.  The website’s front end was AngularJs, and we didn’t really get how the two should integrate.  The result?  A bit of a mess.  Django template views inside Angular html template files.  We had to modify the bracket style used for Angular (as it conflicts with Django’s).  The folder structure was confusing, the code was confusing- it worked, but it wasn’t a great solution.

Now I understand what we should have done: use Django to create an API for the Angular front end.  Just return that sweet, sweet JSON to the app and have Angular do the templating.  We are in the process of this update (as well as migrating this one from AngularJS to Angular the Next Generation, and I ran into an interesting project.

The original site had a Django-driven blog.  This was not integrated into the main AngularJS app in any way- a user clicked the ‘news’ link and they were taken to a completely new page with a traditional Django blog setup.  It made creation and updating easy- we just used Django’s admin panel for new posts, and the default templating views to handle any sorting (by category, date, etc).

But it doesn’t really fit.  Having the blog as a module within the Angular (now to be v4.2.4) application makes more sense.  With our new API approach it will work, but it will take some extra work.

One aspect is the admin panel.  We won’t be using the built in Django admin panel- instead, we’ve created an Angular-driven admin panel.  Converting the data returned by Django’s ORM to JSON was a bit tricky at first, but it’s flowing smoothly at this point (might cover that in a future post, as it was a bit of a process).

Another aspect is searching through blog posts (on the user interface), and that’s where we come to the cool project of the week.  One of the benefits of doing this extra work is getting the hip “instant updates” feel of a single page app within our blog’s UI.  When someone types in the search bar, they see the blog list below filter immediately.

But I noticed some test posts were coming up in almost all search results.  They happened to be the test posts with images in them.  The reason?  We are using an HTML editor toolbar for the admin area.  When a site admin posts a new blog, they use this toolbar to format text, add links, or post images (not everyone posting will be a developer or have source code access).

The toolbar has a cool feature where it encodes any images uploaded to base-64 format.  According to Wikipedia,  “Base64 is a group of similar binary-to-text encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation.

I don’t really know what that means, but I do know it means I don’t have to set up an ‘images’ folder for my blog module and save a path to those images with a relation to the blog post in my database.  When it converts to base-64, it gives me a string of text that a modern browser can translate into an image.  That image src can be saved right along with the rest of the HTML body.

Very helpful – but back to the search feature.  It turns out that the search was iterating over everything stored in the blog’s body- including the html code.  I would have had to fix this anyway, but it became really apparent with the image issue.  Because the string of text representing an image was so long, it naturally contained matches to most strings I was searching (at least the simple ones people would start with).  It might not have mattered on a traditional request-response site- a person would type their whole search string then submit and avoid seeing the wrong results.  But in a live reload search, the problem was obvious.

My first thought was a terrible one: Why not store a plain text version of the blog body in the database?  That way, we just return it alongside the rest of the JSON data and use that for the search.  But that really doesn’t make sense- it’s just bloating our database tables with duplicate info and increasing the size of the JSON object returned on each request.

Then I remembered how cool Javascript can be- and that it comes with awesome built in array features like ‘map’.  So, why not return the data as we have been (with the body as HTML) and manipulate the body just within the search function?  We can search over that manipulated body, but display the original HTML.

Turns out that this works pretty well.  In our main blog component, we initialize a search array on a service:

this.blogViewService.searchArray = ['author', 'title', 'body_plain_text', 'category'];

This is a convention we’re using on a different project- the array contains the property names we want to search by (those property names appear within each object in the array we will be searching over). That searchArray is passed to our search service with some other info.  In this case, ‘body_plain_text’ doesn’t exist on the object- but we’re going to create it as we go.

For the search, we cheat a bit and use Angular’s built in form input subscription.  A form input field in Angular has an observable you can hook into to get the data as a user types (called valueChanges).  You can then subscribe and do any searching there.  All we need to do is make sure to transform each object a bit in the process:

this.blogViewService.searchSubscription = this.blogViewService.term.valueChanges
    .debounceTime(200)
    .subscribe(result => {
        let itemsToSearch = this.blogViewService.originalItems
            .map(item => {
                item['body_plain_text'] = String(item['body']).replace(/<[^>]+>/gm, ' ');
                return item;
            });
        //we pass to another service to do the actual filtering of originalItems here
    });

We start with assigning a reference to this subscription (searchSubscription) so we can unsubscribe on the component’s destruction (to avoid memory leaks).  Then we hook into the form input observable (valueChanges).  I put a slight delay on the process at this point with debounceTime(200)- when the news/blog section gets big enough that returning it all up front doesn’t make sense, we will have to hit the database in this search.  debounceTime(timeHereInMs) is a great rxJS built in that handles debouncing your calls.

Finally, we get to the actual change- the originalItems array is mapped, but none of the properties within each object are actually changed.  Instead, we append a ‘body_plain_text’ property to each object that uses a regex to strip HTML tags.  Originally, it replaced with nothing, but then we had words joining together (if they were on the other side of tags), so replacing with a single blank space preserves the integrity of the search.  We never change the original ‘body’ property- this is where our HTML lives and is used for the actual display.

I’m sure there will be edge cases where this might not work and we have to tweak the process, but it’s a good start.  I also don’t think this is technically how you’re supposed to use .map- it’s a functional programming staple, and I’m using it to append a property to an object then return it back to the array.  Definitely not functional programming!