Thursday, October 16, 2014

Migrating SmartMonsters to the Cloud. Part Four: the Web Site.

SmartMonsters' Web site was one of the first in our experience to be fully dynamic.  Every page is a JSP.  Written c. 2001, the goal was similar to elaborate turnkey systems such as Kana: content personalization based on user profiles.

Although that model lost interest for us as advertising became unviable on smaller Web properties, it still has practical uses.  With little effort we can reconfigure displayed pages for users who self-identify as visually-impaired, while easily enabling a sophisticated security model.

Architecturally, the site suffers from age-related drawbacks.  Sessions are old-school server-side JEE requiring replication between instances.  Single-instance content caches minimize database roundtrips, but ensure that instances will be out of sync.  The forum software caches especially aggressively, guaranteeing inconsistency.  

We'll fix these pre-Cloud disadvantages during migration.

With JEE sessions we don't want stickiness via the load balancer. If the ELB downscales the number of instances we'll lose nonreplicated JEE sessions which were unique to the evaporating instances. Replication is necessary. Memcached is tried-and-true. I opted instead to have EB manage replication. Some configuration is required, see here for an example. Once configured it operates transparently. There's an odd downside though: the auto-generated DynamoDB session records don't include a TTL attribute, meaning old session records linger permanently until manually deleted. Surprised.

In-memory caches which previously used Ehcache can be migrated to memcached on ElastiCache.  Elastic Beanstalk server instances can then share a central ElastiCache cloud. Memcached chosen over Redis because the cached Forum data are frequently Maps or serialized objects which the main Java Redis client, Jedis, can't handle.  Internally our caching API is exposed via the Strategy OO pattern, so it was straightforward to write a memcached implementation of the pattern Interface which talks to ElastiCache.

As an aside, offloading cache memory to its own cloud allows the main app servers to downscale.  This makes it inexpensive to generalize caching of dynamic information that's costly to compute, for example derived data like these.  Ultimately these weekly calculations live in a database, but we can now easily do the classic cache-first lookup, saving the db roundtrip.

There are a couple of opportunities for decomposition:

First, email notifications can be dropped fire-and-forget onto a message queue from which a very tiny consumer instance hands them off to SES.  These are join confirmation emails, password change confirmations, and so on. On a larger scale we can send newsletter blasts via the same mechanism. We can then shrink the Web application instances since they no longer need the headroom.

Second, we can serve static content from CloudFront.  Images, primarily.  We end up with a more "standard" architecture than before, where static content is served independently of dynamically-generated pages.

In the end we've broken a formerly monolithic JEE app into distributed components which interact to generate the final user experience. The decomposition adds minor complexity while allowing the components to scale independently. We see a cost optimization from downsizing the main application servers.

There were two unexpected architectural gotchas:  

It was necessary to redesign the mechanism behind the TriadCity Who's On page. Previously the Web application communicated with the TriadCity game server via RMI. Reasonable with co-located servers, but very ugly when the Web app is in the cloud and the game server is still in co-lo. Redesigned the game server startup to write server version, boot time and other details to the database; logins, TriadCity date and other details were already there.

I was surprised to find that we saved non-Serializables to HttpSession.  Loggers, and some ugly Freemarker utilities. Loggers, no probs: just grab them statically, don’t save a class instance.  Freemarker: blecch.  It was necessary to detangle the mess we created by embedding Forum software which wasn't designed to be embeddable.

I bravely tested elasticity with simulated traffic against the production environment.  Worked as advertised: additional instances spin up automatically, and sessions are shared.  Then I terminated a production instance. EB brought a replacement up within a few seconds.  Very nice. 

There was absolutely no downtime; the migration was entirely transparent. With the new EB environment online and tested, I simply changed the DNS (Route 53) to point to EB. That was it.

Jacob Lawrence, The Great Migration Panel 40