Monday, September 29, 2014

Migrating SmartMonsters to The Cloud. Part Three: Build Automation and Continuous Deployment.

[Intro 2018.05.11. This was written in 2014, when the concept of "Continuous Deployment" was new-ish, not universally adopted, and lacking mature tools. AWS now provides managed services which do much of what we hand-rolled four years ago. I'll describe our migration to their services in future posts.]

For years we've pushed Web content from desktop to production without human intervention: a very early example of what's now known as "Continuous Deployment".  A cron job on the Web server updated from SCM every fifteen minutes.  That was it: if you checked-in page changes, they were gonna go live, baby!  It was up to you to test them first.  Code as opposed to content was pushed the old school way: 100% manual.

The Cloud-based process we’ve recently implemented fully automates CI/CD of every component where that's a reasonable goal.  The TriadCity game server is the major exception — we wouldn't want to bump players off every time we commit.  All other components, from Web sites to robots to AI resources are now built tested and deployed with full automation without downtime every time one of our developers commits code. 

Here’s the flow:

  • Developers are responsible for comprehensive tests.  Test suites must pass locally before developers commit to SCM.  This remains a manual step.
  • Commit to SCM triggers a Jenkins CI build in The Cloud.  Jenkins, pulls, builds, and tests.  Build or test failures halt the process and trigger nastygrams.
  • If the tests pass, Web sites, robots, AIML interpreters and other components are pushed by Jenkins directly to production.  Most of these deploy to Elastic Beanstalk environments. Jenkins' Elastic Beanstalk Plugin writes new artifacts to S3, then calls EB's REST API to trigger the swap.  EB handles the roll-over invisibly to end-users.  There was no real work in this: I simply configured the plugin to deploy to EB as a post-build step.
  • For some components, acceptance tests follow deployment.  That’s backward — we'll fix it.  For example, the AIML files enabling many of our chatterbots are validated in two steps, first for syntactic correctness during the build phase, second for semantic correctness after deployment, once it's possible to ask the now-live bots for known responses.  We do this on live bots 'cos it's simple: a separate acceptance test environment will make it correct.  There'll be a pipeline with an intermediate step: on build success, Jenkins promotes into the acceptance test environment where it runs the final semantic checks.  Only on success there will Jenkins finally promote to production.  Failure anywhere stops the pipeline and broadcasts nastygrams.  We'll run a handful of final post-deploy smoke tests. 
  • TriadCity game server updates are semi-automated with a final manual step. As noted, we don't want to bump users at unannounced moments.  We’d rather trigger reboots manually.  Build and deploy short of reboot however is automated. Jenkins remote-triggers a shell script on the TC server which pulls the latest build from Artifactory in The Cloud.  That build sits on the TC server where it's up to a human to trigger the bounce.  Because we pull-build-test-and-deliver following every code push, there may be many builds delivered without ever being launched.  TC's administrators will launch when they choose to.  The good news is, the latest build's already present on the server; and is guaranteed to have passed its tests.

One-button rollbacks are possible on EB instances.  Simply re-deploy an older build via EB's Web console.  Obviously safe rollbacks are possible only when there have been no backward-incompatible schema changes in the databases.  Since most of our DBs are schemaless this is not usually difficult.  We can similarly roll-back the TriadCity game server with a simple script.

SmartMonsters releases continuously, often many times a day.  We can go further. With 100% virtualization it's possible to bundle the environment with the release.  This vision of infrastructure as code is more radical than the contemporary DevOps movement.  I believe it's the future of release automation.

[Afterward 2018.05.11. We're now using CodeBuild and CodePipeline for nearly all components. CodeDeploy coming real soon now. Future posts will elaborate.]

Jacob Lawrence, The Great Migration Panel 45

No comments:

Post a Comment