The TriadCity Developers' Blog: 2014

Thursday, October 16, 2014

Migrating SmartMonsters to the Cloud. Part Four: the Web Site.

SmartMonsters' Web site was one of the first in our experience to be fully dynamic. Every page is a JSP. Written c. 2001, the goal was similar to elaborate turnkey systems such as Kana: content personalization based on user profiles.

Although that model lost interest for us as advertising became unviable on smaller Web properties, it still has practical uses. With little effort we can reconfigure displayed pages for users who self-identify as visually-impaired, while easily enabling a sophisticated security model.

Architecturally, the site suffers from age-related drawbacks. Sessions are old-school server-side JEE requiring replication between instances. Single-instance content caches minimize database roundtrips, but ensure that instances will be out of sync. The forum software caches especially aggressively, guaranteeing inconsistency.

We'll fix these pre-Cloud disadvantages during migration.

With JEE sessions we don't want stickiness via the load balancer. If the ELB downscales the number of instances we'll lose nonreplicated JEE sessions which were unique to the evaporating instances. Replication is necessary. Memcached is tried-and-true. I opted instead to have EB manage replication. Some configuration is required, see here for an example. Once configured it operates transparently. There's an odd downside though: the auto-generated DynamoDB session records don't include a TTL attribute, meaning old session records linger permanently until manually deleted. Surprised.

In-memory caches which previously used Ehcache can be migrated to memcached on ElastiCache. Elastic Beanstalk server instances can then share a central ElastiCache cloud. Memcached chosen over Redis because the cached Forum data are frequently Maps or serialized objects which the main Java Redis client, Jedis, can't handle. Internally our caching API is exposed via the Strategy OO pattern, so it was straightforward to write a memcached implementation of the pattern Interface which talks to ElastiCache.

As an aside, offloading cache memory to its own cloud allows the main app servers to downscale. This makes it inexpensive to generalize caching of dynamic information that's costly to compute, for example derived data like these. Ultimately these weekly calculations live in a database, but we can now easily do the classic cache-first lookup, saving the db roundtrip.

There are a couple of opportunities for decomposition:

First, email notifications can be dropped fire-and-forget onto a message queue from which a very tiny consumer instance hands them off to SES. These are join confirmation emails, password change confirmations, and so on. On a larger scale we can send newsletter blasts via the same mechanism. We can then shrink the Web application instances since they no longer need the headroom.

Second, we can serve static content from CloudFront. Images, primarily. We end up with a more "standard" architecture than before, where static content is served independently of dynamically-generated pages.

In the end we've broken a formerly monolithic JEE app into distributed components which interact to generate the final user experience. The decomposition adds minor complexity while allowing the components to scale independently. We see a cost optimization from downsizing the main application servers.

There were two unexpected architectural gotchas:

It was necessary to redesign the mechanism behind the TriadCity Who's On page. Previously the Web application communicated with the TriadCity game server via RMI. Reasonable with co-located servers, but very ugly when the Web app is in the cloud and the game server is still in co-lo. Redesigned the game server startup to write server version, boot time and other details to the database; logins, TriadCity date and other details were already there.

I was surprised to find that we saved non-Serializables to HttpSession. Loggers, and some ugly Freemarker utilities. Loggers, no probs: just grab them statically, don’t save a class instance. Freemarker: blecch. It was necessary to detangle the mess we created by embedding Forum software which wasn't designed to be embeddable.

I bravely tested elasticity with simulated traffic against the production environment. Worked as advertised: additional instances spin up automatically, and sessions are shared. Then I terminated a production instance. EB brought a replacement up within a few seconds. Very nice.

There was absolutely no downtime; the migration was entirely transparent. With the new EB environment online and tested, I simply changed the DNS (Route 53) to point to EB. That was it.

Jacob Lawrence, The Great Migration Panel 40

Monday, September 29, 2014

Migrating SmartMonsters to The Cloud. Part Three: Build Automation and Continuous Deployment.

[Intro 2018.05.11. This was written in 2014, when the concept of "Continuous Deployment" was new-ish, not universally adopted, and lacking mature tools. AWS now provides managed services which do much of what we hand-rolled four years ago. I'll describe our migration to their services in future posts.]

For years we've pushed Web content from desktop to production without human intervention: a very early example of what's now known as "Continuous Deployment". A cron job on the Web server updated from SCM every fifteen minutes. That was it: if you checked-in page changes, they were gonna go live, baby! It was up to you to test them first. Code as opposed to content was pushed the old school way: 100% manual.

The Cloud-based process we’ve recently implemented fully automates CI/CD of every component where that's a reasonable goal. The TriadCity game server is the major exception — we wouldn't want to bump players off every time we commit. All other components, from Web sites to robots to AI resources are now built tested and deployed with full automation without downtime every time one of our developers commits code.

Here’s the flow:

Developers are responsible for comprehensive tests. Test suites must pass locally before developers commit to SCM. This remains a manual step.

Commit to SCM triggers a Jenkins CI build in The Cloud. Jenkins, pulls, builds, and tests. Build or test failures halt the process and trigger nastygrams.

If the tests pass, Web sites, robots, AIML interpreters and other components are pushed by Jenkins directly to production. Most of these deploy to Elastic Beanstalk environments. Jenkins' Elastic Beanstalk Plugin writes new artifacts to S3, then calls EB's REST API to trigger the swap. EB handles the roll-over invisibly to end-users. There was no real work in this: I simply configured the plugin to deploy to EB as a post-build step.

For some components, acceptance tests follow deployment. That’s backward — we'll fix it. For example, the AIML files enabling many of our chatterbots are validated in two steps, first for syntactic correctness during the build phase, second for semantic correctness after deployment, once it's possible to ask the now-live bots for known responses. We do this on live bots 'cos it's simple: a separate acceptance test environment will make it correct. There'll be a pipeline with an intermediate step: on build success, Jenkins promotes into the acceptance test environment where it runs the final semantic checks. Only on success there will Jenkins finally promote to production. Failure anywhere stops the pipeline and broadcasts nastygrams. We'll run a handful of final post-deploy smoke tests.

TriadCity game server updates are semi-automated with a final manual step. As noted, we don't want to bump users at unannounced moments. We’d rather trigger reboots manually. Build and deploy short of reboot however is automated. Jenkins remote-triggers a shell script on the TC server which pulls the latest build from Artifactory in The Cloud. That build sits on the TC server where it's up to a human to trigger the bounce. Because we pull-build-test-and-deliver following every code push, there may be many builds delivered without ever being launched. TC's administrators will launch when they choose to. The good news is, the latest build's already present on the server; and is guaranteed to have passed its tests.

One-button rollbacks are possible on EB instances. Simply re-deploy an older build via EB's Web console. Obviously safe rollbacks are possible only when there have been no backward-incompatible schema changes in the databases. Since most of our DBs are schemaless this is not usually difficult. We can similarly roll-back the TriadCity game server with a simple script.

SmartMonsters releases continuously, often many times a day. We can go further. With 100% virtualization it's possible to bundle the environment with the release. This vision of infrastructure as code is more radical than the contemporary DevOps movement. I believe it's the future of release automation.

[Afterward 2018.05.11. We're now using CodeBuild and CodePipeline for nearly all components. CodeDeploy coming real soon now. Future posts will elaborate.]

Jacob Lawrence, The Great Migration Panel 45

Wednesday, September 24, 2014

Migrating SmartMonsters to the Cloud. Part Two: Breaking Up the Bots.

TriadCity’s robots Death, Help, Marvin and Oscar are AIML-based chatterbots architected monolithically. Each bot is its own application: TriadCity client code, AIML interpreter, AIML files, plus each bot's layer of individual behaviors. They're built and deployed independently of each other. The plus side is they're free to evolve and scale independently; the downside is there's a lot of duplication.

While migrating to AWS we want to break them down to sharable components. We especially want a shared AIML interpreter accessible from multiple protocols and device types. Advantages:

Enables a new generation of bots accessed independently of TriadCity. For example, we can turn Oscar into his own application, accessing a single scalable AIML brain from mobile devices, Web sites, Twitter feeds, and the existing TriadCity bot without having to maintain separate AIML repositories for each client.
Enables AIML updates independent of the Java code for these multiple platforms, doable in real time without rebooting the bots.
The shared AIML interpreter can scale independently, becoming fully elastic in The Cloud.
Centralized AIML files can be validated globally within a consolidated staged CI/CD pipeline with appropriate tests at each stage.
It's easier to swap-out or enhance AIML interpreters without having to rebuild and redistribute an army of dependents.

Here are the migration steps:

Pull the AIML interpreter out of the bot code, into a standalone application accessed as a RESTful service on Elastic Beanstalk.
Redesign the AIML interpreter to serve all of the bots. The REST API will be super-simple: a GET call parameterized with the bot name and the input String. Return a JSON object. EB provides elasticity.
Remove the AIML interpreter from the Web site, having the appropriate JSPs now RESTfully GET their conversations from the service on EB.
Remove the AIML interpreter from the TriadCIty bots, having them similarly call into the service. I decided while I was at it to consolidate the four existing bots into a single JAR rather than continuing to build each as a separate project. This is simpler, and there’s very little Java code distinguishing the bots. A configuration file determines which bot to run; the consolidated code figures out which bot-centric classes to load.
Configure Jenkins to manage Continuous Delivery. The CI build fires when updates are made to SCM. The build lifecycle now includes two cycles of AIML validation. A syntactic pass ensures XML correctness before changes are pushed to production; a semantic pass validates AIML rules by comparing expected to actual outputs.
Log interpreter inputs which fail to match AIML rules, triggering default responses. This feedback will let our botmasters know what their users are interested in, and how their bots are performing. Write logs to DynamoDB, with a simple schema: bot name, <input>, <topic>, <that>. Schedule a weekly cron to report on these failures.

Migration took two working days. To my surprise, DynamoDB logging required a full third day. I struggled to get connected — turns out the published Developer Guide is missing one key point, and incorrect on another. To connect you have to specify the AWS region, duh. Integer keys are not necessary, Strings are fine; there's a "@DynamoDBAutoGeneratedKey" annotation in the Java SDK which'll auto-gen them for you.

One more enhancement seems obvious. Writing logs directly to db is synchronous: our RESTful API server will block a thread during the transaction. We can avoid the block by posting log messages asynchronously to a queue. It's half a day to update the interpreter to write to SQS instead of directly to DB. An EB "worker tier" consumer pulls from the queue and does the DB write.

The platform is now in place upon which we'll build a new generation of bots more advanced than before. Watch this space.

Monday, September 15, 2014

Migrating SmartMonsters to The Cloud. Part One: Problems and Legacy State.

No-one owns servers anymore. The advantages of Cloud Computing are so overwhelmingly obvious that hardware is simply no longer sensible. If we were starting SmartMonsters today we'd begin on AWS. But, we started in 1999, putting us in a similar situation to the majority of 15-year-old companies in the Internet space: we have a complex physical infrastructure hosted in a data center. Migrating to AWS will be a nontrivial project.

It's not simply a matter of lift-and-shift. Although it's possible to simply substitute virtualized servers for physical ones, virtualization per se isn't the advantage. Where you really gain is in leveraging the platform services which AWS provides on top of virtualization. That leverage shrinks your management footprint, offloading traditional IT tasks of monitoring and maintenance to AWS, while at the same time making it much easier to build "elastic" systems which scale automatically and self-recover from outages. The key is to break legacy monolithic apps into components accessed as services, either self-managed or provided by the platform.

For example, our chatterbots Oscar, Marvin, Help, Death and Eliza all currently run in small EC2 instances, logging in to the TriadCity server in the same way that humans would. That's fine, but if we break them into components — say by exposing a shared AIML interpreter via RESTful interface — then the Web site, the bots in TriadCity, and perhaps a new generation of bots running on mobile devices could all call into a shared and elastically scalable AIML service. At the same time we can move their relational databases into AWS' Relational Database Service, offloading management to the vendor. This distributed design would allow all components to evolve and scale independently of each other. It enables flexibility, scalability, and automatic recovery from outages while at the same time considerably decreasing cost.

Decomposition confronts technical issues without simple solutions. For example, the Web site's Forum section is powered by heavily modified open source code now deeply integrated into — tightly coupled with — the site's authentication, authorization and persistence frameworks. This infrastructure aggressively caches to local memory. Without modification, multiple elastic instances would not sync state, so that user visits from one click to the next would likely differ. We can leverage AWS' managed ElastiCache service to share cache between instances. We'll have to modify existing code to do that.

This is a common problem. Legacy architectures designed to be "integrated" back when that was a marketing bullet rather than a swear word are really the norm. In my consulting practice I try to educate clients not to build new monoliths running on self-managed EC2. Yah it'll be "The Cloud". But it'll be the same problems without the advantages.

While migrating we're going to re-architect, breaking things down wherever possible into loosely coupled components. We'll move as many of those components as we can to managed services provided by AWS, for example SCM, email, databases, caches, message queues, directory services, file storage, and many others. At the same time we'll leverage AWS best practices for security, elasticity, resilience and performance. And cost. In the next weeks I'll write several posts here outlining how it goes.

Initial State

SCM is by CVS - remember that? - on a VPS in the colo. This SCM manages not only source code but also company docs. Deployment of new or edited JSPs to the production web server is automated by CVS update via cron.
Email is self-administered Postfix, on a VPS in the colo.
MongoDB instances are spread among VPS and physical hardware in the colo. There’s a primary and two secondaries.
PostgreSQL is on a VPS in the colo.
A single Tomcat instance serves the entire Web site, which is 100% dynamic — every page is a JSP. This instance experiences a nagging memory leak, probably related to the forum code, which no-one has bothered to track down yet; we simply reboot the site every night.
The TriadCity game server is physical hardware, with 16 cores and 32gb RAM, racked in the colo.
The Help, Oscar, Marvin, Death and Eliza chatterbots run on small EC2 IaaS instances on AWS.
Jira and Greenhopper are on-demand, hosted by Atlassian.
Jenkins build servers are on VPS in the colo.
DNS is administered by the colo provider.
Firewall rules, NAT, and port forwarding are also administered by the provider.

Anticipated End State

SCM by Git, hosted by a commercial provider such as GitHub or CodeCommit.
Company docs moved off of SCM, onto S3 or Google Drive.
Corporate email by Google, maintaining the smartmonsters.com domain and addresses. Website email notices and newsletter blasts via SES.
MongoDB instances hosted in EC2, geographically dispersed. There’ll continue to be a primary and two secondaries. Backups will be via daily snapshots of one or another secondary. It may be optimal to occasionally snapshot the primary as well: Amazon recommends shorter-lived instances whenever that’s feasible.
PostgresSQL via RDS.
Caches migrated to a shared ElastiCache cluster.
The remainder of the dynamic Web site served by Elastic Beanstalk; sessions shared via DynamoDB. At least two instances, geographically dispersed.
Static web content such as images hosted on S3 and served by CloudFront.
Long-term backups such as legacy SQL dumps stored on Glacier.
The chatterbots broken into two layers: an AIML interpreter, accessed via REST by the Java-based bot code which in other respects is very similar to any other TriadCity client. The AIML interpreter will probably be served by Elastic Beanstalk; the bots can continue to run from EC2 instances. The interpreter will have to be adapted to save state in DynamoDB; the AIML files can live in S3; but note that the program will have to be adapted to also be able to read files from the local file system, in case we choose to deploy it in a different way. We may or may not make those instances elastic.
Jira and Greenhopper remain at Atlassian.
Jenkins builds will run on EC2.
A micro EC2 instance will host an Artifactory repo to be used by Continuous Deployment.
DNS will move to Route 53; advantage to us is self-administration.
The TriadCity game server will remain on physical hardware inside the colo. It's not yet cost-effective for a company as small as ours to virtualize servers of that scale.
Firewall rules, NAT and port forwarding for the TriadCity server remain with the colo provider. Security groups, key management and so on for the AWS infrastructure will be self-administered via IAM and related services.
Monitoring of the AWS resources will be via CloudWatch.
The AWS resources will probably be defined by CloudFormation, but, we don't know enough about it yet.
Continuous Deployment will have to be re-thought. Today, a shell script on the Web server runs CVS update periodically, a super-simple CD solution which has worked with no intervention for years. The virtualized infrastructure may allow more sophistication without much additional cost. For example it may be sensible to automate deployments from Jenkins which not only push code, but create new Beanstalk instances while terminating the old ones, a strategy which will keep the instances always young per AWS best practices. Beanstalk will keep these deployments invisible to end users, with no app downtime. More exploration is in order.

You'll note our planned dependence on AWS. Are we concerned about "vendor lock-in"?

No. "Vendor lock-in" is a legacy IT concept implied by long acquisition cycles and proprietary languages. While it's true we'll be tying code and deployments to AWS APIs, migrating to Azure or another cloud provider would not be a daunting project. For example, AWS exposes cache APIs, so does Azure. Migration would imply only minor changes to API calls. Our apps rely on the Strategy design pattern: we'd only have to write new Strategy implementations which call the changed APIs. Trivial concern compared to the advantages.

Conclusion

This is a journey which very many companies will be taking in the coming decade. The advantages of resilience, elasticity, and cost are overwhelming. Please note that I design and manage these migrations for a living. Please contact me if I can be of help.

Jacob Lawrence, The Great Migration Panel 18

Friday, May 16, 2014

Migration notes: from RDBMS to MongoDB

The relational paradigm maps poorly to OO design. Objects don't have relations. When you use an ORM for CRUD you end up with a denormalized schema lacking the point of the relational paradigm. If your OO design relies on inheritance you can potentially end up with absurdly large numbers of tables. In TriadCity we have hundreds of Item subtypes with varying properties; mapping these to traditional normalized RDBMS would result in spaghetti schema. Somebody has to win: you end up with either a lousy database or lousy OO design. By contrast the document data model maps intuitively to OO. Because MongoDB is schemaless you can put all those hundreds of Item subtypes into the same Collection, hydrating them without need for OO mapping and reporting on them in a perfectly granular way. Schemalessness removes the impedance mismatch.

Migration turned out to be a larger project than expected. This isn't the fault of the technology. It was trivial to migrate game object persistence. Without hyperbole, an hour or two. But of course, then we got ambitious. We do vastly more with data than merely game CRUD. For example, we're beginning to use rich data analytics of in-game auctions to enable AI driven characters' bidding behavior. Previously that logic would have lived in a vendor-specific stored procedure, and would have been less comprehensive at runtime than what we can achieve now. We're better with a combination of MongoDB plus Neo4J for real-time lookups. I'm happy as can be to have this logic live in Java inside the game engine, rather than a proprietary stored procedure language in the database. But, implementation in the new technology took some effort.

I chose not to replicate the previously relational schema in MongoDB. It would have been incrementally easier to simply map tables to Collections 1-for-1. Instead, I chose to take advantage of MongoDB's ability to store related information in a single document via nested arrays. Here's an example. Our RDBMs used normalized tables to relate basic user account information to: demographics; settings and preferences; security roles and access restrictions; a listing of bounced email addresses; and a table of locked accounts. So, a parent table and five children, related by the unique ID of the parent account. In MongoDB, it makes better sense to keep all of this related information in a single record. So, our Person document has an array field for demographics, another for roles and privileges, another for settings and preferences, and simple boolean flags for bounced emails or global account lock. This consolidation is straightforward to code to, and results in a human-readable document browsable via your favorite administrative GUI.

The need for transactional guarantees goes away. In a SQL world most transactions are second-order: they're really artifacts of the relational model itself, where an update to a Person per above might require writes to multiple tables. Post-migration we literally have not one use case where a transaction is necessary. We do rely on write guarantees in MongoDB: our writing Java thread blocks until it receives an appropriate acknowledgement. The Java driver handles them automagically.

Object-oriented queries in MongoDB's Java API turn out to be very intuitive to internalize. They read like almost-English and make immediate sense. Here's an example and its SQL equivalent. This looks up the most recent sale price of a Player's in-game house. MongoDB:

   DBObject where = new BasicDBObject("playerHouseID", playerHouseID);
   DBObject fields = new BasicDBObject("price", true).append("timestamp", true);
   DBObject sort = new BasicDBObject("timestamp", DESCENDING);
   DBObject found = this.mongo.db.getCollectionFromString("PlayerHouseSales").findOne(where, fields, sort);
   if (found == null) { // this player house has never been sold
       return -1;
   }
   return (Long) found.get("price");

The SQL equivalent minus framework boilerplate would have been something like:

"SELECT * FROM (SELECT price FROM player_house_sales WHERE player_house_id = ? ORDER BY timestamp DESC) WHERE rownum = 1"

I find the OO version vastly easier to conceptualize.

Object-oriented queries like these make it straightforward to eliminate the 200 or so stored procedures which had tied us to one specific RDBMS vendor. Nearly all of these existed because it's easier to implement branching logic in a stored procedure language than in SQL. Typically the branching test was whether a row already existed, thus insert or update. The MongoDB Java driver's save() method handles upsert for you, automagically converting as the case warrants. A few circumstances which are marginally more complicated are easily handled in Java. This as a huge, huge win. All of our data persistence and analytics logic are now in Java in a single location in a single code base — no longer scattered among multiple languages and disparate scripts in different locations in SCM.

I don't know what the best practice is for sequences. Sequences aren't recommended for _id unique keys. But, there are circumstances where an incrementing recordNo is handy. For example, I like to assign a recordNo to each new Room, Exit, Item and so on. A dynamic count() won't do for this, because these objects are occasionally deleted after creation. I'd nevertheless like to know the grand total of how many Rooms have ever been created, without keeping a separate audit trail. It seems straightforward to create a sequence document in the database with one field for each sequence; you then call the $inc operator to do what it sounds like. Since this is not the _id unique key we're talking about, I ran with it.

I'd like to see more syntactic sugar in the query API. Common operations such as summing or averaging columns are cumbersome today. There's a new-ish Aggregation Framework seemingly designed for large-scale offline reporting. For ad hoc in-game queries, I'd prefer the Java API, and would like to have common operations like summing or finding max or min values highly runtime optimized, as they are in the RDBMs we're replacing. There are corners of the MongoDB programming world which to me seem immature, like these. Development seems very active, so hopefully these gaps will be filled soonish.

I've encountered one problem not yet solved. Sparse unique indexes don't seem to work. Inserting more than one record with a null value into a field with a sparse unique index fails with a duplicate value error on the null, which the "sparse" part is supposed to finesse. Google suggests others have experienced similar glitches. For the time being I've turned off the index, and am continuing to research. Really do want this working.

Runtime performance is excellent. Superior in most cases to the SQL equivalent. For example, loading 18k Rooms from RDBMS including their default Items and Mobiles typically required about 40 seconds; with MongoDB it's usually 15. On my MacBook Pro with 8 cores and 16gb RAM, a complete game boot in the SQL world took around 60 seconds; it's half that with MongoDB.

[For a client, I've subsequently done a shootout of very high-volume write processing between MongoDB and Oracle, where "high-volume" means hundreds of millions of writes. MongoDB was an order of magnitude faster under load.]

MongoDB's system requirements are demanding. MongoDB likes to put the entire data set into RAM. It assumes a 3-node replica cluster with several gigs memory per node. Our data set really isn't that big, so we've got it shoehorned into small VPS instances in the colo. Without a lot of hands-on lore, I can't say yet if we'll turn out to need to bump up these instance sizes. [Update 2014.10.31: instance sizes unchanged. All's well.] [Update 2018.05.12: after migrating to AWS the Mongos run happily on m1.medium instances.]

MongoDB's automated replication is nifty for high-availability and offloading queries from the writeable system. We run our analytics queries using map/reduce against mongodumps taken from one of the secondaries, freeing the primary for production writes. Nothing's necessary to enable this behavior, it works out of the box. For very large data sets, sharding is easy-peasy. There's a handy connector linking MongoDB with Hadoop, for offline map/reduce queries. The database is designed for these topologies, which in many RDBMS worlds are cumbersome to achieve.

Granted enhanced ability to easily analyze 14 years of comprehensive log data, we can begin to enable vastly more powerful and realistic in-game AI. Here's an example. The data show us which Items sell most frequently at auction; what the most successful starting prices are; what the average, minimum and maximum final auction prices are for each Item type; how many bids are common; upper and lower bounds on all these things; medians; distributions; curves; etc. etc. They can help us predict which Players or AI-driven characters might be interested in buying which things. By pulling the "big data" from MongoDB and graphing it with Neo4j, we can enable super-zippy real-time queries like, "Should I bid on the current auction Item? If yes, how much? What should my top bid be based on the bid history so far? If I buy it at the right price, will I later be able to profit from re-auctioning it? When's the best time to re-auction? Are there any Players currently online who may be interested in the Items in my possession? Are there any robots or other AIs who might be interested?" This has far-reaching implications not only for the quality of in-game AI, but also for our ability to expand and make the in-game economy more rich and realistic. In addition to in-game merchants who buy Items from Players, we can now have AI-driven characters buy and sell from auction, creating another sales channel from which Payers can benefit.

A note re Mongo and JSON. If your company models its business objects in JSON, a highly-performant JSON database makes great sense. It's far more straightforward than shoehorning JSON into the relational paradigm designed for an earlier world.

SmartMonsters is now fully-committed to "polyglot persistence". The principals are MongoDB, Neo4J, DynamoDB, and PostgreSQL. We plan to migrate the PostgreSQL to AWS' RDS. Very much looking forward to a world where AWS offers a managed MongoDB which we don't have to take care of ourselves.

Sunday, April 6, 2014

NPC movement with Neo4j

TriadCity’s sophisticated "subsumption architecture" for NPC behaviors depends on accurate, comprehensive knowledge of game world geography. Unlike traditional MUDs where NPC movement is statically confined to home zones and is largely random, TriadCity is alive with highly individualized NPCs capable of purposeful navigation throughout the City.

For years the code enabling this directed movement has been a simple homegrown weighted graph mapping all possible paths from every point A to every other point B. Weights were based primarily on movement costs by alignment. Separate graphs were computed for Evil, Good and Neutral paths, plus specialized graphs for Cops and for residents of the Capitoline neighborhood — two categories of NPCs with restricted movement possibilities. The graphs were computed at server boot, following instantiation of all Rooms and the Exits linking them. They were static: unchanging after initial computation, essentially freezing NPC knowledge of world geography at the state of its existence at boot. And they were monolithic: an NPC could look up a path from the Neutral graph, or the Cop graph, but not a hybrid NeutralCop graph. For a long time this was "good enough for now." While inflexible, it enabled accurate if not always subtle NPC navigation from any arbitrary origin to any chosen destination. NPCs simply looked up their path and, following Dijkstra, off they went.

TriadCity eventually outgrew this simple design. We want to be able to accurately model arbitrary restrictions on NPC paths, for example, excluding NPCs from certain streets based on Role, or gender, or other classification. It's important for NPCs to react to more than one category of restriction. And we want their movement behaviors to be as "human" as possible, so, not necessarily always taking the shortest possible path, if the shortcut for instance leads through someone's private garden, or in one pub door and out another. Usually real people would go around the garden, or stay on the sidewalk outside the pub. We grew especially frustrated with our simple system's inability to track dynamic changes to the game world at runtime. It would have required a major project to extend our down & dirty proprietary code to handle these features.

Instead, we've chosen the excellent NoSQL graph database Neo4j as our solution. While the approach is essentially the same — a weighted graph based on Dijkstra — Neo4j's comprehensive query-ability allows us enormous flexibility at runtime. It easily enables the kinds of ambitions described above, with impressively little code.

We run Neo4j in embedded mode. Easy-peasy, since our server's written in Java. The database is populated during server boot after Rooms and Exits are loaded. Nodes are Rooms, Relationships are built from outbound Exits leading to destination Rooms. The algorithm's straightforward. Iterate the Rooms, define a Node for each; iterate a second time, catalog each Room's outbound Exits, look up the destination Room each Exit leads to, find the Node representing that destination, and define a Relationship between the origin and destination Nodes. We only need one RelationshipType: LEADS_TO. As each Relationship is defined we flag it with boolean properties representing the restrictions we want to impose: isNoCops, isNoBeasts, isNoHumans, isNoDeathSuckers, isPerformerOnly, isCapitoline, hasDoor, and so on; these properties belong to Exits and destination Rooms, so we simply look for them and set the flags appropriately. At runtime, NPC path lookups are via a PathFinder built by passing GraphAlgoFactory's static dijkstra() method a custom CostEvaluator which looks for the boolean property flags defined on Relationships and totals costs accordingly. For example if the calling NPC has the Cop Role, and a Relationship is flagged isNoCops, the CostEvaluator will assign a prohibitive cost to that Relationship — we arbitrarily use 100,000 — effectively removing it from consideration. When PathFinder searches for the least-cost path it'll choose one which bypasses that expensive Relationship. Managing the scenario where we want NPCs to go around a pub although the shortest path leads through it is straightforward: we simply assign a cost of 100 to all Relationships flagged hasDoor, so that generally NPCs will prefer not to move through doors unless they're the only paths available. Nice.

Implementing Neo4j required just a single class of about 300 lines, and took about half a working day including time spent tuning the costs.

Runtime performance is excellent. The Neo4j solution takes longer to bootstrap — about 5 seconds for 18,000 Rooms on my MacBook Pro, compared to just a few milliseconds for our deprecated home baked version. But, its runtime lookup performance is superior. We judge overall AI processing load by tracking average completion times of our one-per-second Behaviors pulse which triggers Behaviors throughout the game world. Average Behavior computation on my laptop was steady in the range of 120 to 150 milliseconds with the old grapher; with Neo4j it's been consistently around 80, an efficiency gain of 50% or better. Memory use looks about the same.

So far, runtime movement appears to be perfect. Everybody goes where they're supposed to, and they now do it with the subtle enhancements highlighted above. Importantly for us, it's now very easy to tune movement by simply adding new property flags to Relationships and looking for them in our custom CostEvaluator. And, we can easily enhance the system to adapt to runtime changes in the game world itself, for example when builders add new Rooms or 4d mazes rearrange themselves.

I'm impressed with how easy it's been to include Neo4j in our project. The API is intuitive and very nicely documented. I was able to take and tweak a Dijkstra example from the Neo4j web site with very little effort. Very pleased!

Saturday, February 8, 2014

All AI in TriadCity is Event-Driven. Actually, Pretty Much Everything in TriadCity is Event-Driven.

All NPC behaviors in TriadCity are event-driven. Not just NPC AI, but all change everywhere in the virtual world, from the motion of subway trains to room descriptions morphing at night to the growth of fruit on trees. Here's a quick sketch of some TC event basics.

In TriadCity every action is encapsulated in an Event object. This is straightforward in OO Land. Take for example player-generated actions. Player Commands are expressed in the server code as Strategy Pattern objects encapsulating the necessary algorithm. Each Command will generate one or more Events as its outcome(s). The Steal command generates a ThiefEvent, which it parameterizes differently depending on success or failure. Commands instantiate the appropriate Event type at the appropriate moment within their algorithms, handing off the new Event instance to the current Room, which propagates it wherever it needs to go.

Here's the structure:

public Interface Event {
long getID();
long getEventType();
long getVisibility();
String getSourceID();
String getTargetID();
        String getRoomID();
        String getMessageForSource();
        String getMessageForTarget();
        String getMessageForRoom();
        ....
}

The Steal command populates ThiefEvent with the information shown above and a ton more. The eventType will vary depending whether the Steal succeeds or fails; the assigned visibility will be computed depending on the Thief's Skill level and multiple variables. The messageForSource will be: "You steal the silver coins from the drunken Texan", or "You fail to steal the silver coins from the well-dressed man, and are detected." Appropriate messages are computed for the target and for bystanders in the Room. When ready, the ThiefEvent is passed to the Room, which in turn passes it to everybody and everything currently present.

What happens next?

Up to world authors.

One form of NPC behavior is triggered when Events are received. The Behavior instance tests the Event against its triggering conditions. Am I supposed to fire when the Event is a ThiefEvent and its getEventType() == STEAL_FAILS_EVENT_TYPE and the NPC I'm assigned to is the Event.getTarget()? Or in English, did someone just try to steal from me and fail? If the world author has assigned the TriggeredAttackBehavior to an NPC and given it those trigger conditions, then the victim will attack the Thief when those conditions are met. This is a form of rule-based AI, expressed in OO patterns.

Who'll see what happened? Depends on their SeeSkill and other attributes. If observers are skilled enough, they may be able to watch Thieves stealing from victims. The victim won't see it happen, but observers with high enough SeeSkill might. One of many many forms of character subjectivity in TriadCity.

OO allows TriggeredBehaviors to be niftily flexible. Behaviors can be triggered if they receive any MovementEvent, or only if the MovementEvent.getEventType() == BOW_EVENT_TYPE. So: any movement in your Room and you'll jump to your feet on guard. But, you'll only Bow if someone Bows to you first.

All Behaviors are Event-triggered, even ones which appear random, for instance pigeons moving around the Gray Plaza. Although their movement Behavior is indeed semi-random, it's triggered by a worldwide pulse called the DoBehaviorsEvent, which is propagated everywhere in the game world once per second by an Executor running a background Thread. Every Behavior tests that DoBehaviorsEvent: do I ignore it, or randomly do something, or do something not randomly? TimeEvents trigger actions such as closing shops and going home, and so on.

Events are integral to Player subjectivity. One form this takes is that Player objects include a Consciousness Strategy object via composition. That Consciousness object acts as a filter on the Events passed through it. It has the ability to modify or ignore the Event.getMessageForXYZ() methods it encounters, or generate entirely new ones. This is one of the ways we can make you see things which maybe aren't even "there".

Events in TriadCity are like the chemical "messages" flying around your brain, telling you when to breathe and fire your heart muscle and decide to go online to play games.

Friday, February 7, 2014

Why Java's a Good Platform for MMORPG Servers

I recently found a discussion of programming languages on a MMORPG hobbyist bulletin board. Much of the lore there was garbled, revolving around old but persistent myths of the performance characteristics of C, C++, and Java. I thought I'd share some hard-won wisdom.

To first state my claim to bonafides in this realm. In real life I'm a cloud scale software architect at the senior executive level, meaning I've got a couple of decades' experience building really big commercial systems with bazillions of users. Plus I've got twenty years hands-on writing TriadCity, which I hope I can say without boasting is the most technologically sophisticated textual RPG to date. So here's the true dope.

Under super heavy load — truly massively multiplayer scale — Java has a slight advantage over other languages in scalability and perceived responsiveness because under load its garbage collection mechanism makes more efficient use of CPU cycles than hand-rolled memory management. Yes, I realize that contradicts the claims that many developers will make, but, it's not hard to understand. If you profile an Apache web server under very heavy stress, you'll find it spending a large percentage of its cycles calling malloc() and free(), where every call requires some number of CPU instructions. At its core it's just doing lots of String processing, but there you go: memory in, memory out, many many millions and millions of times once the load gets going. At some point, memory management will crowd business logic out of the processor, and perceived degradation is no longer graceful.

By contrast, Java allocates its heap in just one go, when its runtime initializes. Garbage collection is a background process and on multi-core systems does its thing pretty efficiently. CPU time can be dedicated to business logic instead of system overhead. This plus team productivity are the two reasons Java dominates the market for enterprise application servers.

To highlight some qualifiers. A well-designed and written system in C or C++ will outperform a badly designed one in Java. "Better" is contextual: do you care about throughput, latency, graceful degradation under stress, or some other criterion? Maybe you're super proficient in one language over another: use that one.

Note in general that OO languages are a better match for game servers than procedural ones. I started TriadCity in C back in the day because at that time I was more proficient in C, and there was plenty of open source C game code to model things on. But I found in practice that some of the concepts I wanted to express just didn't "feel" intuitive in C, and it became harder and harder to be productive in the code base once it got large enough. Switching to Java solved these problems because thinking in objects maps more intuitively to virtual world concepts; and because, at least in my experience, it's far easier to organize a very large OO code base for extensibility and maintainability. I recently read a different hobbyist RPG board where the lead developer of a very popular old-school MUD written in C complained about the effort required to implement a new character class. I'm pretty sure I can tell you why it was so rough for him. I don't think it would have been as much of a chore in TriadCity.

To underscore this point: OO code bases scale better. Today TriadCity has about 40,000 server-side classes. No, seriously. But I can find them all easily, because organizing OO code is straightforward, and Java's package concept helps. More importantly, thoughtful OO design insulates you against brittleness, enabling evolution. Adding that new character class would have been less daunting in an OO world.

Thursday, February 6, 2014

Death to Mazes Like Moria!

I suppose that mazes are a staple of the MUD tradition. Starting with Adventure, you're underground and you're lost. Throw in some orcs and you're in Moria. Whatevs.

We follow the tradition with lotso mazes, but, we try to make them interesting. This post'll be about the technology rather than the content, but, we really do try to make the content interesting too.

TriadCity Maze types are implemented via the Strategy OO pattern. There's that word "Strategy" again. We do a lot of that. A Java Interface called Maze defines the contract; multiple concrete implementations can be swapped out at runtime. Among the implications: world authors not only have a ton of maze algorithms at their disposal, but we can actually swap out algorithms at runtime whenever that seems the thing to do.

TwoDMaze is a simple x/y grid of Rooms, always in a rectangular pattern. World authors can specify how many Rooms wide versus tall, and the Builder tools will prompt them if they provide the wrong number of Rooms. Thus a TwoDMaze defined as x=7 y=5 has gotta have 35 Rooms assigned, no more, no less. TwoDMaze puts the Rooms in consistent order across the grid, that is, Room 1 will always be the upper left corner, and the Rooms will be assigned in left-to-right order at every server boot. Exits between Rooms will be assigned dynamically according to what maze-drawing algorithm is chosen by the world author. You guessed it, these algorithms are again Strategy objects. We've got Kruksal, Prim, Depth-First, and a pile of others. Exits will be computed and distributed around the maze at server boot.

Add a z dimension and it's a ThreeDMaze. Have the Exits dynamically reassigned at random or semi-random during server uptime and it's a FourDMaze — think Harry and Cedric racing for the cup.

We can vary all of these by not assigning the Rooms in consistent order. Just shuffle the Room[] before laying out the grid. Now there'll always be a Room 1, but it won't always be in the upper left corner. It could be anywhere, meaning that special fountain or other object of interest will have to be searched for anew after every server boot. Or shuffle them every hour, or shuffle the grid and the Exits and really confuse people.

But wait, there's more. We don't have to always assign ordinary FreeExits — the Exits that allow movement between Rooms without impediment. They can be RandomDestinationTeleportationExits or any other Exit type we like, in whatever proportion we like. Yes, there's one extremely notorious maze full of nasty critters with pointy sharp teeth offering only a single teleporter to get out — no way you're going back the way you came. Bring a large party with supplies. Find that teleporter before you run out of grenades.

More still. Authors can assign Behaviors to the Rooms or the Exits, making things even more interesting. TelGar's writing a big slag heap you have to climb to reach something valuable. The slag heap kicks up tonso dust so there's no telling exactly where you are, and because the ground is loose, you're constantly being forced downward as the earth slides out from under you. It's a compelling adventure, enabled by combining a dynamically-changing Maze type with a SlideRoomBehavior which forces you to move in a certain direction, like a slide in a playground. You can eventually get to the top, but it's a struggle.

These Strategy Pattern objects allow the server code to remain tidy and easy to maintain, while putting a lot of power into the hands of world authors. Authors can assign whatever crazy algorithm best suits their thematic purpose. Just how lost do you want the punters to get? You've got lots of options. Somebody invents a new maze-generation algorithm? Great, we'll add it. Easy-peasy.

Death to mazes like Moria!

Tuesday, February 4, 2014

Using the "Strategy" OO Pattern for Player Subjectivity

Uniquely to any "game" we're familiar with, TriadCity imposes elaborate subjectivities on its players.

This means that if you and I walk our characters into the same Room at the same time, we might see it described differently depending on our characters' attributes, skills, moral alignment, history, and other variables. The type and degree of subjectivity is largely controlled by the world author who wrote the Room. Most authors tread lightly, but there's no reason they can't make things wildly different if that makes sense for their context.

It also means that one of our characters could be Schizophrenic. We can impose any manner of "unreal" experiences on characters, from highly distorted Room descriptions to conversations with other characters who don't actually exist.

The code which enables these two forms of subjectivity is conceptually simple. Each form relies on the Strategy OO pattern, applied in different places. I'll briefly sketch them both.

Rooms, Items, Exits, Characters all share common attributes inherited from their parent class, WorldObject. These include Descriptions, Smells, Sounds, Tastes, and Touches. I'm capitalizing the nouns because these are Java Interfaces which WorldObject possesses by composition. A code snip for illustration:

public abstract class WorldObject extends Object() {
          protected Descriptions descriptions;
          protected Smells smells;
          protected Sounds sounds;
          protected Tastes tastes;
          protected Touches touches;
          public final String getDescription(WorldObject observer) {
                    return this.descriptions.getDescription(observer);
          }
          public final String getSmell(WorldObject observer) {
                    return this.smells.getSmell(observer);
          }
   public final String getSound(WorldObject observer) {
                    return this.sounds.getSound(observer);
          }
   public final String getTaste(WorldObject observer) {
                    return this.tastes.getTaste(observer);
          }
   public final String getTouch(WorldObject observer) {
                    return this.touches.getTouch(observer);
          }

}

Descriptions, Smells, Sounds, Tastes and Touches are Strategy Pattern Interfaces defining the simple contract which will be implemented by a rich hierarchy of concrete classes. At runtime, the correct implementing class is assigned to each WorldObject according to the directions given to that WorldObject by the author who wrote it.

For example, a world author may write a Room and choose to have its descriptions vary by hour of day. Using our authoring tools, she assigns it the TimeBasedDescriptions type, and writes a description for dawn, dusk, day, and night. At server boot, that Room will be assigned the type of Descriptions she specified, and populated with her writing. The TimeBasedDescriptions class has the responsibility of figuring out what time it is whenever its getDescription() method is called, and returning the right one.

That's not subjective, though. Although the descriptions vary by time of day, the same description will be shown to every character who enters the Room. If she wants to, our world author can be more adventurous, assigning concrete Description types from our library. She could choose AlignmentAndTimeBasedDescriptions which will show descriptions which vary by both time of day and character moral alignment; or Descriptions which vary by Gender, or health condition, or attribute status. Each concrete implementing class knows how to calculate the description which is appropriate for the observer. As you can see, the code for WorldObject remains conceptually simple.

A second and often more radical flavor of subjectivity is implemented by a Strategy Interface called Consciousness. Each character has a composed Consciousness object which acts as a filter, transforming messages received via game Events, or not, according to its logic. The TriadCharacter class includes these snips:

public abstract class TriadCharacter extends WorldObject() {
protected Consciousness consciousness;
public boolean proccessEvent(MovementEvent evt) {
return this.consciousness.processEvent(evt);
}
public boolean proccessEvent(SenseEvent evt) {
return this.consciousness.processEvent(evt);
}
public boolean proccessEvent(SocialEvent evt) {
return this.consciousness.processEvent(evt);
}
}

Concrete Consciousness implementations include SleepingConsciousness, FightingConsciousness, SchizophrenicConsciousness, IncapacitatedConsciousness, and many others. Their job is to analyze incoming Events and determine whether to pass along the messages associated with those Events unchanged, or dynamically altered. A simple change might be to squelch an Event altogether if the character is asleep. A radical one might be to modify a ChatEvent or TellEvent, or even generate entirely new ones, if the character is experiencing SchizophrenicConsciousness. In these cases, the Consciousness type is assigned by the server at runtime, not by world authors, and it'll be swapped out for a different one as circumstances change. So these transformations are extremely dynamic.

Using the Strategy pattern via composition allows the class hierarchy to remain reasonably flat and really very simple. It would be possible for example to create a massive class tree resulting in things like TriadCharacterWithDescriptionsThatVaryByAlignmentAndTimeWithSchizophrenicConsciousness. But for obvious reasons that would be a nightmare to maintain. Composition keeps the WorldObject tree simple; Strategy objects chosen at runtime provide the rich library of behaviors.

a coloured rendering of a complex graph network

Thursday, January 30, 2014

"Subsumption" Architecture for NPC AI

TriadCity employs multiple AI strategies to drive NPC behaviors. Most of these are quite simple, AI being in my opinion far more artificial than intelligent. One of our most successful strategies to date is based on Rodney Brooks' "subsumption architecture" which we've borrowed from robotics. Here's a short description of how we use it.

Subsumption rejects the approach of modeling intelligence via symbolic representation, choosing instead to mimic lifelike behaviors by building high-level abstractions from layered libraries of low-level components. Lowest-level components might be as simple as "I hit something"; the next layer up might include avoidance strategies such as "back up and try a slightly random vector", or "move left several inches and try again". A next level up might abstract from particulars to "explore the world". Behaviors become more removed from details the higher they are in a pyramid of abstraction, with "behave like a human" at the top.

In TriadCity we implement subsumption by composing higher-level behaviors from a vast library of detailed lower-level ones. Low-level examples are "open the door", "look around the room", "pick up an object". These correspond to TriadCity's library of player-typable commands. There's a building-block behavior corresponding to each TC command. An intermediate abstraction might be "move from current location to the Virtual Vegas Casino", which involves querying a directed graph, moving along the retrieved path, opening any doors encountered, and so on. The next highest level might be "play slot machines in the Virtual Vegas Casino", which would require successfully moving to the correct location in the game world and executing the Playslot command. Higher still might be "gamble in the Virtual Vegas Casino", which would make a randomized selection between playing slots or roulette or etc. World authors don't need to interact with subsumed low-level components. They just assign high-level behaviors and the architecture manages details.

A common example in TC is the assignment of weighted "behavior groups" based on time of day. Let's say you assign your NPC two behavior groups, one for the dawn, day, and dusk hours, and a second for night time. Your daytlight group may include sending the NPC to work; or having it hunt slave killers in Zaroff Park; or eat lunch at a favorite restaurant; or watch the chariot races in the Circus Maximus. You'd assign each of these possibilities a weight determining its likelihood of being chosen, plus a weight determining its likelihood of being swapped out for a different one. By night you might have your NPC eat dinner somewhere; go see a movie; go shopping; go back to hunting its favorite slave killers; or go home to sleep. The result is strikingly rich. If you follow this fellow around, he performs his varied activities with human-like unpredictability. He won't simply circle the same static course over and over, and so for example if there's some NPC you'd really like to stalk, you'll have to go find him first because he could be all over the place doing things that are appropriate to his character. This is trivial for world authors to work with, allowing rich individualization of non-player characters with little effort. It puts considerable AI sophistication into play without forcing authors to interact with it.

These "subsumption" behaviors contribute verisimilitude to the game world. In TC you don't get a lot of NPCs standing around waiting to be killed. We even send our judges home from court at night. There's continual motion and variation: you can stand on a popular street corner people-watching NPCs on their way to and from jobs and hobbies and murders and freeing slaves.

Richard Bartle wrote, "It would be great to have a virtual city with 100,000 virtual inhabitants, each making real time decisions as to how to spend their virtual lives. We may have to wait some time before we get this, though." Not really. TriadCity's done it for years.