SmartMonsters

Friday, April 8, 2016

Formatting Database Reports Using Serverless Computing

Serverless computing is the obvious endpoint of contemporary technology evolution.  From onsite hardware to self-managed datacenters to virtualization to public cloud, the trend resolves in a very attractive way with the end result that your business logic executes “somewhere”, you don't need to worry where. An automated platform manages scalability, availability, and security; you're billed only for the cycles you use, not for the idle headroom you once needed to provide.  On AWS the future is Lambda: we’ve chosen database reporting for our initial experience.

In the old days we’d run a suite of weekly cron jobs on the database server firing report scripts, piping their output to mailx.  Business analysts, product owners and technology managers would receive human-readable columnar reports in their inboxes, which they'd study and archive however they liked.  Pretty straightforward.  It's possible to duplicate the same concept in the cloud, and this is initially how we started: install mailx on one of the MongoDB secondaries, have cron run weekly jobs, mailing not particularly human-readable JSON output to the usual suspects — thought being we'd get that JSON human-formatted real soon now.

This strategy takes poor advantage of the AWS platform.  A more sophisticated approach is to save report outputs to a weekly bucket on S3, then mail a link to that bucket to the interested parties.  This centralizes the report archive in a pre-organized repository, and gets the weekly email down from a mini-storm to a single message with links.  Easy change, then: instead of piping report output to mailx, pipe it to a super-simple shell script which reads it from stdin, figures out if a new S3 bucket is needed, creates one if yes, and writes a JSON file to the bucket.  Here’s our version, for your Googling pleasure:

#!/bin/bash

# Writes input from standard in to an S3 bucket named by date.
# The file name is passed as param $1 to the script.
# 2016.04.04 MP.

DATE=`date +%Y-%m-%d`
S3_BUCKET=<base_directory>/$DATE/

# Create the bucket if it doesn't exist:
if aws s3 ls "s3://$S3_BUCKET" 2>&1 | grep -q 'NoSuchBucket'
then
aws s3 mb s3://<base_directory>/$DATE/
fi

# Write stdin to a new file in the bucket:
aws s3 cp - s3://sm-reports/$DATE/$1.$DATE.json --content-type text/plain --grants read=uri=http://acs.amazonaws.com/groups/global/AllUsers

Now we simply add a concluding cron job to send an email with the links.  Easy-peasy.  No need for mailx anymore: let's use AWS SES instead, which offers enhanced security — no more mailx credentials inside a pseudo-user's home directory.  Instead, SES access is managed by IAM.  Nice.

This outcome already leverages serverlessness.  S3 is an abstracted service, where users pay no attention to concepts of physical location, disk capacity, backup, redundancy, scale, or Web server infrastructure.  SES and IAM are also serverless.  They're "somewhere" but we don't care where.  That's a start.

But here's the thing.  As noted, the reports are flippin' ugly.  MongoDB’s JSON output is serial rather than columnar, and there's no built-in utility analogous to sql*plus which can easily convert JSON blobs to spiffy tables.  So let's fix this.  We'll write our own down and dirty little formatting tool converting JSON arrays to tabular HTML.  Where should that tool run?  Lambda's perfect.  With Lambda we don't maintain or pay for EC2 instances with inevitable idle time, we simply let Lambda fire up our converter as the weekly cron jobs require.  It's an event-driven compute pipeline, where cron fires reports, pipes them to a script which writes them to S3, S3 notifies Lambda that new JSON files are written, Lambda converts them to HTML, and SES mails a notice.

I chose to write the converter as a Node.js script.  My inexperienced guess was that this would be less cumbersome than a Java program, avoiding the hassles of frequently uploading new iterations during development and generally, as the lore insists, being faster to work with.  I was wrong in every assumption.

First, non-trivial Node.js programs can't be entered directly into the Lambda inline editor because necessary Node.js packages aren't available there.  In our case, the "async" package is required to manage the asynchronous programming model of AWS API calls to S3.  So, you know: le sigh.  I ended up spending about 12 hours on the converter, where some of that time was my Node.js learning curve, but far too much of it was the inevitable save-zip-upload-test-logscan cycle which even Node.js non-noobs are likely to experience.

Second, lack of compiler support made the cycle inevitably more onerous, as every little typo or missing semicolon made it all the way through the loop, to bubble up in the log rather than under my fingers while typing the code.  While I realize that Node.js has considerable real-world momentum, I can't imagine using it for nontrivial production applications.  I strongly suspect that any large-scale production code base can be written in Java in one-fourth the time, and will be easier and less expensive to maintain over its life cycle.  Experience will tell, of course.  But you heard it here first.

Additionally, there's a little bit of lag, often 15-20 seconds, between running a test iteration and viewing Lambda's log output in CloudWatch.  For retrospectively scanning production logs, this will seldom be a problem, and in fact CloudWatch logs themselves can be processed by Lambda functions, a pretty great thing.  During development it's a hindrance.

Summary: the experience was more of a chore than it should have been.

Ultimately, the manual save-zip-upload-test-logscan process can’t scale to production.  The AWS CLI includes APIs for pushing functions to Lambda.  Developers who prefer the command-line can homebrew their own Continuous Integration / Continuous Delivery scripts.  Larger shops will want to automate their CI/CD pipelines with Jenkins or equivalent.  This'll be prominent in our enhancements list.

Future TODOs: add a step generating a single HTML index file with links to the reports; mail a link to that index file, instead of two dozen links to the individual reports.  Spiffy up the report format, which right now is down-and-dirty unstyled HTML circa 1994.  Get the Lambda functions into Git and add the CI/CD pipeline to Jenkins.  Consolidate multiple report formatters to a single converter, presumably relying on metadata such as column names added to the scripts generating JSON report output.  You could reasonably sum these up with the single word "professionalize".  My POC is pretty amateur.

But it works, and that's impressive.  No production infrastructure to maintain, just pure code execution which automatically scales, is inherently secure, and is guaranteed always available.  For the few cents a month which runtime will cost, this is really seriously something.  Lambda can be used in the event-driven architecture described here, or can be fronted by AWS's API Gateway service providing RESTful front-ends to Lambda functions.  Perhaps the most serious downside to the Lambda ecology today is lack of persistent database connections: every db request requires connection/authentication/authorization, which is likely too much lag for many real-world production scenarios.  You have to suspect that'll be fixed soon.

Very psyched to enter the new world.

-----------------------

Addendum 2016.04.21:

In retrospect, it would have been simpler to design a one-step solution from the outset.  For example, a Python script to query the db, transform JSON to HTML, and write the final result to S3.  If I were starting from zip, that would be attractive.  But I started with MongoDB "native" JavaScript queries, so, with those already in place, it was simpler to leverage S3's event-driven integration to Lambda for the HTML transformations.

Since the original post I've consolidated what began as individual functions per report to a genericized formatter accepting directives passed as input; it's now the responsibility of the JavaScript query scripts to pass those configuration variables to the HTML processor.  This makes it easy to add new reports — no need to update the HTML converter.  I've also found a package called "node-lambda" (https://www.npmjs.com/package/node-lambda) which promises to eliminate or at least simplify the inefficient code-save-zip-upload-test-logscan loop I naively began with.  Report to follow if it works out.

But lastly, I'm pretty sold on abandoning Node.js for Python.  Node's asynchronous model is just frankly cumbersome.  The resulting code is ugly, hard to scan, error-prone, and slow to work with.  The Python is also quirky, particularly pymongo's departures from canonical MongoDB shell syntax — "find_one" instead of "findOne", and different quoting rules.  But it's clean, and straightforward to work with.  Watch this space.

Unplugged Cable