SmartMonsters

Monday, September 29, 2014

Migrating SmartMonsters to The Cloud. Part Three: Build Automation and Continuous Deployment.

[Intro 2018.05.11. This was written in 2014, when the concept of "Continuous Deployment" was new-ish, not universally adopted, and lacking mature tools. AWS now provides managed services which do much of what we hand-rolled four years ago. I'll describe our migration to their services in future posts.]

For years we've pushed Web content from desktop to production without human intervention: a very early example of what's now known as "Continuous Deployment".  A cron job on the Web server updated from SCM every fifteen minutes.  That was it: if you checked-in page changes, they were gonna go live, baby!  It was up to you to test them first.  Code as opposed to content was pushed the old school way: 100% manual.

The Cloud-based process we’ve recently implemented fully automates CI/CD of every component where that's a reasonable goal.  The TriadCity game server is the major exception — we wouldn't want to bump players off every time we commit.  All other components, from Web sites to robots to AI resources are now built tested and deployed with full automation without downtime every time one of our developers commits code. 

Here’s the flow:

  • Developers are responsible for comprehensive tests.  Test suites must pass locally before developers commit to SCM.  This remains a manual step.
  • Commit to SCM triggers a Jenkins CI build in The Cloud.  Jenkins, pulls, builds, and tests.  Build or test failures halt the process and trigger nastygrams.
  • If the tests pass, Web sites, robots, AIML interpreters and other components are pushed by Jenkins directly to production.  Most of these deploy to Elastic Beanstalk environments. Jenkins' Elastic Beanstalk Plugin writes new artifacts to S3, then calls EB's REST API to trigger the swap.  EB handles the roll-over invisibly to end-users.  There was no real work in this: I simply configured the plugin to deploy to EB as a post-build step.
  • For some components, acceptance tests follow deployment.  That’s backward — we'll fix it.  For example, the AIML files enabling many of our chatterbots are validated in two steps, first for syntactic correctness during the build phase, second for semantic correctness after deployment, once it's possible to ask the now-live bots for known responses.  We do this on live bots 'cos it's simple: a separate acceptance test environment will make it correct.  There'll be a pipeline with an intermediate step: on build success, Jenkins promotes into the acceptance test environment where it runs the final semantic checks.  Only on success there will Jenkins finally promote to production.  Failure anywhere stops the pipeline and broadcasts nastygrams.  We'll run a handful of final post-deploy smoke tests. 
  • TriadCity game server updates are semi-automated with a final manual step. As noted, we don't want to bump users at unannounced moments.  We’d rather trigger reboots manually.  Build and deploy short of reboot however is automated. Jenkins remote-triggers a shell script on the TC server which pulls the latest build from Artifactory in The Cloud.  That build sits on the TC server where it's up to a human to trigger the bounce.  Because we pull-build-test-and-deliver following every code push, there may be many builds delivered without ever being launched.  TC's administrators will launch when they choose to.  The good news is, the latest build's already present on the server; and is guaranteed to have passed its tests.

One-button rollbacks are possible on EB instances.  Simply re-deploy an older build via EB's Web console.  Obviously safe rollbacks are possible only when there have been no backward-incompatible schema changes in the databases.  Since most of our DBs are schemaless this is not usually difficult.  We can similarly roll-back the TriadCity game server with a simple script.

SmartMonsters releases continuously, often many times a day.  We can go further. With 100% virtualization it's possible to bundle the environment with the release.  This vision of infrastructure as code is more radical than the contemporary DevOps movement.  I believe it's the future of release automation.

[Afterward 2018.05.11. We're now using CodeBuild and CodePipeline for nearly all components. CodeDeploy coming real soon now. Future posts will elaborate.]

Jacob Lawrence, The Great Migration Panel 45

Wednesday, September 24, 2014

Migrating SmartMonsters to the Cloud. Part Two: Breaking Up the Bots.

TriadCity’s robots Death, Help, Marvin and Oscar are AIML-based chatterbots architected monolithically. Each bot is its own application: TriadCity client code, AIML interpreter, AIML files, plus each bot's layer of individual behaviors. They're built and deployed independently of each other. The plus side is they're free to evolve and scale independently; the downside is there's a lot of duplication.

While migrating to AWS we want to break them down to sharable components. We especially want a shared AIML interpreter accessible from multiple protocols and device types. Advantages:

  • Enables a new generation of bots accessed independently of TriadCity. For example, we can turn Oscar into his own application, accessing a single scalable AIML brain from mobile devices, Web sites, Twitter feeds, and the existing TriadCity bot without having to maintain separate AIML repositories for each client.
  • Enables AIML updates independent of the Java code for these multiple platforms, doable in real time without rebooting the bots.
  • The shared AIML interpreter can scale independently, becoming fully elastic in The Cloud.
  • Centralized AIML files can be validated globally within a consolidated staged CI/CD pipeline with appropriate tests at each stage.
  • It's easier to swap-out or enhance AIML interpreters without having to rebuild and redistribute an army of dependents.

Here are the migration steps:

  1. Pull the AIML interpreter out of the bot code, into a standalone application accessed as a RESTful service on Elastic Beanstalk.
  2. Redesign the AIML interpreter to serve all of the bots. The REST API will be super-simple: a GET call parameterized with the bot name and the input String.  Return a JSON object.  EB provides elasticity.
  3. Remove the AIML interpreter from the Web site, having the appropriate JSPs now RESTfully GET their conversations from the service on EB.
  4. Remove the AIML interpreter from the TriadCIty bots, having them similarly call into the service.  I decided while I was at it to consolidate the four existing bots into a single JAR rather than continuing to build each as a separate project.  This is simpler, and there’s very little Java code distinguishing the bots.  A configuration file determines which bot to run; the consolidated code figures out which bot-centric classes to load.
  5. Configure Jenkins to manage Continuous Delivery.  The CI build fires when updates are made to SCM.  The build lifecycle now includes two cycles of AIML validation.  A syntactic pass ensures XML correctness before changes are pushed to production; a semantic pass validates AIML rules by comparing expected to actual outputs.
  6. Log interpreter inputs which fail to match AIML rules, triggering default responses.  This feedback will let our botmasters know what their users are interested in, and how their bots are performing.  Write logs to DynamoDB, with a simple schema: bot name, <input>, <topic>, <that>.  Schedule a weekly cron to report on these failures.

Migration took two working days. To my surprise, DynamoDB logging required a full third day.  I struggled to get connected — turns out the published Developer Guide is missing one key point, and incorrect on another.  To connect you have to specify the AWS region, duh.  Integer keys are not necessary, Strings are fine; there's a "@DynamoDBAutoGeneratedKey" annotation in the Java SDK which'll auto-gen them for you.

One more enhancement seems obvious.  Writing logs directly to db is synchronous: our RESTful API server will block a thread during the transaction.  We can avoid the block by posting log messages asynchronously to a queue. It's half a day to update the interpreter to write to SQS instead of directly to DB. An EB "worker tier" consumer pulls from the queue and does the DB write.

The platform is now in place upon which we'll build a new generation of bots more advanced than before. Watch this space.

Jacob Lawrence, One Way Ticket

Monday, September 15, 2014

Migrating SmartMonsters to The Cloud. Part One: Problems and Legacy State.

No-one owns servers anymore. The advantages of Cloud Computing are so overwhelmingly obvious that hardware is simply no longer sensible. If we were starting SmartMonsters today we'd begin on AWS. But, we started in 1999, putting us in a similar situation to the majority of 15-year-old companies in the Internet space: we have a complex physical infrastructure hosted in a data center. Migrating to AWS will be a nontrivial project.

It's not simply a matter of lift-and-shift. Although it's possible to simply substitute virtualized servers for physical ones, virtualization per se isn't the advantage. Where you really gain is in leveraging the platform services which AWS provides on top of virtualization. That leverage shrinks your management footprint, offloading traditional IT tasks of monitoring and maintenance to AWS, while at the same time making it much easier to build "elastic" systems which scale automatically and self-recover from outages. The key is to break legacy monolithic apps into components accessed as services, either self-managed or provided by the platform.

For example, our chatterbots Oscar, Marvin, Help, Death and Eliza all currently run in small EC2 instances, logging in to the TriadCity server in the same way that humans would. That's fine, but if we break them into components — say by exposing a shared AIML interpreter via RESTful interface — then the Web site, the bots in TriadCity, and perhaps a new generation of bots running on mobile devices could all call into a shared and elastically scalable AIML service. At the same time we can move their relational databases into AWS' Relational Database Service, offloading management to the vendor. This distributed design would allow all components to evolve and scale independently of each other. It enables flexibility, scalability, and automatic recovery from outages while at the same time considerably decreasing cost.

Decomposition confronts technical issues without simple solutions. For example, the Web site's Forum section is powered by heavily modified open source code now deeply integrated into — tightly coupled with — the site's authentication, authorization and persistence frameworks. This infrastructure aggressively caches to local memory. Without modification, multiple elastic instances would not sync state, so that user visits from one click to the next would likely differ. We can leverage AWS' managed ElastiCache service to share cache between instances. We'll have to modify existing code to do that.

This is a common problem. Legacy architectures designed to be "integrated" back when that was a marketing bullet rather than a swear word are really the norm. In my consulting practice I try to educate clients not to build new monoliths running on self-managed EC2. Yah it'll be "The Cloud". But it'll be the same problems without the advantages.

While migrating we're going to re-architect, breaking things down wherever possible into loosely coupled components. We'll move as many of those components as we can to managed services provided by AWS, for example SCM, email, databases, caches, message queues, directory services, file storage, and many others. At the same time we'll leverage AWS best practices for security, elasticity, resilience and performance. And cost. In the next weeks I'll write several posts here outlining how it goes.


Initial State

  • SCM is by CVS - remember that? - on a VPS in the colo. This SCM manages not only source code but also company docs. Deployment of new or edited JSPs to the production web server is automated by CVS update via cron.
  • Email is self-administered Postfix, on a VPS in the colo.
  • MongoDB instances are spread among VPS and physical hardware in the colo. There’s a primary and two secondaries.
  • PostgreSQL is on a VPS in the colo.
  • A single Tomcat instance serves the entire Web site, which is 100% dynamic — every page is a JSP. This instance experiences a nagging memory leak, probably related to the forum code, which no-one has bothered to track down yet; we simply reboot the site every night.
  • The TriadCity game server is physical hardware, with 16 cores and 32gb RAM, racked in the colo.
  • The Help, Oscar, Marvin, Death and Eliza chatterbots run on small EC2 IaaS instances on AWS.
  • Jira and Greenhopper are on-demand, hosted by Atlassian.
  • Jenkins build servers are on VPS in the colo.
  • DNS is administered by the colo provider.
  • Firewall rules, NAT, and port forwarding are also administered by the provider.

Anticipated End State

  • SCM by Git, hosted by a commercial provider such as GitHub or CodeCommit.
  • Company docs moved off of SCM, onto S3 or Google Drive.
  • Corporate email by Google, maintaining the smartmonsters.com domain and addresses. Website email notices and newsletter blasts via SES.
  • MongoDB instances hosted in EC2, geographically dispersed. There’ll continue to be a primary and two secondaries. Backups will be via daily snapshots of one or another secondary. It may be optimal to occasionally snapshot the primary as well: Amazon recommends shorter-lived instances whenever that’s feasible.
  • PostgresSQL via RDS.
  • Caches migrated to a shared ElastiCache cluster.
  • The remainder of the dynamic Web site served by Elastic Beanstalk; sessions shared via DynamoDB. At least two instances, geographically dispersed.
  • Static web content such as images hosted on S3 and served by CloudFront.
  • Long-term backups such as legacy SQL dumps stored on Glacier.
  • The chatterbots broken into two layers: an AIML interpreter, accessed via REST by the Java-based bot code which in other respects is very similar to any other TriadCity client. The AIML interpreter will probably be served by Elastic Beanstalk; the bots can continue to run from EC2 instances. The interpreter will have to be adapted to save state in DynamoDB; the AIML files can live in S3; but note that the program will have to be adapted to also be able to read files from the local file system, in case we choose to deploy it in a different way. We may or may not make those instances elastic.
  • Jira and Greenhopper remain at Atlassian.
  • Jenkins builds will run on EC2.
  • A micro EC2 instance will host an Artifactory repo to be used by Continuous Deployment.
  • DNS will move to Route 53; advantage to us is self-administration.
  • The TriadCity game server will remain on physical hardware inside the colo. It's not yet cost-effective for a company as small as ours to virtualize servers of that scale.
  • Firewall rules, NAT and port forwarding for the TriadCity server remain with the colo provider. Security groups, key management and so on for the AWS infrastructure will be self-administered via IAM and related services.
  • Monitoring of the AWS resources will be via CloudWatch.
  • The AWS resources will probably be defined by CloudFormation, but, we don't know enough about it yet.
  • Continuous Deployment will have to be re-thought. Today, a shell script on the Web server runs CVS update periodically, a super-simple CD solution which has worked with no intervention for years. The virtualized infrastructure may allow more sophistication without much additional cost. For example it may be sensible to automate deployments from Jenkins which not only push code, but create new Beanstalk instances while terminating the old ones, a strategy which will keep the instances always young per AWS best practices. Beanstalk will keep these deployments invisible to end users, with no app downtime. More exploration is in order.

You'll note our planned dependence on AWS. Are we concerned about "vendor lock-in"?

No. "Vendor lock-in" is a legacy IT concept implied by long acquisition cycles and proprietary languages. While it's true we'll be tying code and deployments to AWS APIs, migrating to Azure or another cloud provider would not be a daunting project. For example, AWS exposes cache APIs, so does Azure. Migration would imply only minor changes to API calls. Our apps rely on the Strategy design pattern: we'd only have to write new Strategy implementations which call the changed APIs. Trivial concern compared to the advantages.


Conclusion

This is a journey which very many companies will be taking in the coming decade. The advantages of resilience, elasticity, and cost are overwhelming. Please note that I design and manage these migrations for a living. Please contact me if I can be of help.

Jacob Lawrence, The Great Migration Panel 18