SmartMonsters

Monday, September 15, 2014

Migrating SmartMonsters to The Cloud. Part One: Problems and Legacy State.

No-one owns servers anymore. The advantages of Cloud Computing are so overwhelmingly obvious that hardware is simply no longer sensible. If we were starting SmartMonsters today we'd begin on AWS. But, we started in 1999, putting us in a similar situation to the majority of 15-year-old companies in the Internet space: we have a complex physical infrastructure hosted in a data center. Migrating to AWS will be a nontrivial project.

It's not simply a matter of lift-and-shift. Although it's possible to simply substitute virtualized servers for physical ones, virtualization per se isn't the advantage. Where you really gain is in leveraging the platform services which AWS provides on top of virtualization. That leverage shrinks your management footprint, offloading traditional IT tasks of monitoring and maintenance to AWS, while at the same time making it much easier to build "elastic" systems which scale automatically and self-recover from outages. The key is to break legacy monolithic apps into components accessed as services, either self-managed or provided by the platform.

For example, our chatterbots Oscar, Marvin, Help, Death and Eliza all currently run in small EC2 instances, logging in to the TriadCity server in the same way that humans would. That's fine, but if we break them into components — say by exposing a shared AIML interpreter via RESTful interface — then the Web site, the bots in TriadCity, and perhaps a new generation of bots running on mobile devices could all call into a shared and elastically scalable AIML service. At the same time we can move their relational databases into AWS' Relational Database Service, offloading management to the vendor. This distributed design would allow all components to evolve and scale independently of each other. It enables flexibility, scalability, and automatic recovery from outages while at the same time considerably decreasing cost.

Decomposition confronts technical issues without simple solutions. For example, the Web site's Forum section is powered by heavily modified open source code now deeply integrated into — tightly coupled with — the site's authentication, authorization and persistence frameworks. This infrastructure aggressively caches to local memory. Without modification, multiple elastic instances would not sync state, so that user visits from one click to the next would likely differ. We can leverage AWS' managed ElastiCache service to share cache between instances. We'll have to modify existing code to do that.

This is a common problem. Legacy architectures designed to be "integrated" back when that was a marketing bullet rather than a swear word are really the norm. In my consulting practice I try to educate clients not to build new monoliths running on self-managed EC2. Yah it'll be "The Cloud". But it'll be the same problems without the advantages.

While migrating we're going to re-architect, breaking things down wherever possible into loosely coupled components. We'll move as many of those components as we can to managed services provided by AWS, for example SCM, email, databases, caches, message queues, directory services, file storage, and many others. At the same time we'll leverage AWS best practices for security, elasticity, resilience and performance. And cost. In the next weeks I'll write several posts here outlining how it goes.


Initial State

  • SCM is by CVS - remember that? - on a VPS in the colo. This SCM manages not only source code but also company docs. Deployment of new or edited JSPs to the production web server is automated by CVS update via cron.
  • Email is self-administered Postfix, on a VPS in the colo.
  • MongoDB instances are spread among VPS and physical hardware in the colo. There’s a primary and two secondaries.
  • PostgreSQL is on a VPS in the colo.
  • A single Tomcat instance serves the entire Web site, which is 100% dynamic — every page is a JSP. This instance experiences a nagging memory leak, probably related to the forum code, which no-one has bothered to track down yet; we simply reboot the site every night.
  • The TriadCity game server is physical hardware, with 16 cores and 32gb RAM, racked in the colo.
  • The Help, Oscar, Marvin, Death and Eliza chatterbots run on small EC2 IaaS instances on AWS.
  • Jira and Greenhopper are on-demand, hosted by Atlassian.
  • Jenkins build servers are on VPS in the colo.
  • DNS is administered by the colo provider.
  • Firewall rules, NAT, and port forwarding are also administered by the provider.

Anticipated End State

  • SCM by Git, hosted by a commercial provider such as GitHub or CodeCommit.
  • Company docs moved off of SCM, onto S3 or Google Drive.
  • Corporate email by Google, maintaining the smartmonsters.com domain and addresses. Website email notices and newsletter blasts via SES.
  • MongoDB instances hosted in EC2, geographically dispersed. There’ll continue to be a primary and two secondaries. Backups will be via daily snapshots of one or another secondary. It may be optimal to occasionally snapshot the primary as well: Amazon recommends shorter-lived instances whenever that’s feasible.
  • PostgresSQL via RDS.
  • Caches migrated to a shared ElastiCache cluster.
  • The remainder of the dynamic Web site served by Elastic Beanstalk; sessions shared via DynamoDB. At least two instances, geographically dispersed.
  • Static web content such as images hosted on S3 and served by CloudFront.
  • Long-term backups such as legacy SQL dumps stored on Glacier.
  • The chatterbots broken into two layers: an AIML interpreter, accessed via REST by the Java-based bot code which in other respects is very similar to any other TriadCity client. The AIML interpreter will probably be served by Elastic Beanstalk; the bots can continue to run from EC2 instances. The interpreter will have to be adapted to save state in DynamoDB; the AIML files can live in S3; but note that the program will have to be adapted to also be able to read files from the local file system, in case we choose to deploy it in a different way. We may or may not make those instances elastic.
  • Jira and Greenhopper remain at Atlassian.
  • Jenkins builds will run on EC2.
  • A micro EC2 instance will host an Artifactory repo to be used by Continuous Deployment.
  • DNS will move to Route 53; advantage to us is self-administration.
  • The TriadCity game server will remain on physical hardware inside the colo. It's not yet cost-effective for a company as small as ours to virtualize servers of that scale.
  • Firewall rules, NAT and port forwarding for the TriadCity server remain with the colo provider. Security groups, key management and so on for the AWS infrastructure will be self-administered via IAM and related services.
  • Monitoring of the AWS resources will be via CloudWatch.
  • The AWS resources will probably be defined by CloudFormation, but, we don't know enough about it yet.
  • Continuous Deployment will have to be re-thought. Today, a shell script on the Web server runs CVS update periodically, a super-simple CD solution which has worked with no intervention for years. The virtualized infrastructure may allow more sophistication without much additional cost. For example it may be sensible to automate deployments from Jenkins which not only push code, but create new Beanstalk instances while terminating the old ones, a strategy which will keep the instances always young per AWS best practices. Beanstalk will keep these deployments invisible to end users, with no app downtime. More exploration is in order.

You'll note our planned dependence on AWS. Are we concerned about "vendor lock-in"?

No. "Vendor lock-in" is a legacy IT concept implied by long acquisition cycles and proprietary languages. While it's true we'll be tying code and deployments to AWS APIs, migrating to Azure or another cloud provider would not be a daunting project. For example, AWS exposes cache APIs, so does Azure. Migration would imply only minor changes to API calls. Our apps rely on the Strategy design pattern: we'd only have to write new Strategy implementations which call the changed APIs. Trivial concern compared to the advantages.


Conclusion

This is a journey which very many companies will be taking in the coming decade. The advantages of resilience, elasticity, and cost are overwhelming. Please note that I design and manage these migrations for a living. Please contact me if I can be of help.

Jacob Lawrence, The Great Migration Panel 18

No comments:

Post a Comment