The TriadCity Developers' Blog

AWS can be exasperatingly almost

2018-07-16T22:09:00.000-07:00

We have a specific architectural need. We'd prefer to handle it with an AWS managed service. It's almost possible. But not quite.

To explain. Our TriadCity server lives in the Oregon Region. With our current level of traffic we’ve decided not to shard the game world. So everyone on planet earth who visits TriadCity arrives at that single large-ish EC2 instance.

To help with scalability we offload as much computation as possible. Much of the AI is computed by clients in a distributed-grid architecture similar to SETI@home. With the same goal in mind we offload TLS termination. Easy-peasy: incoming connections terminate TLS at a Classic Load Balancer, which forwards to the game server. We’d really prefer a Network Load Balancer: it’s cheaper and performs better. But, it doesn't terminate TLS. See? Exasperatingly almost.

Now. We have some legacy native clients. But most players today connect over WebSockets from HTML5/JQuery clients in their browsers. We open a WSS connection to the CLB which terminates TLS and passes the packets through.

The architectural problem with WebSockets is our lack of control over socket parameters. With native clients we can set a long connection timeout, say, five seconds. Then when a user in Seoul or Sydney or London connects to our infrastructure in Oregon, the client doesn't hang up if network latency causes connection negotiation to take a second or two. Not so with WS. The browser vendors control the connect timeout. It's the truly major downside of keeping an API simple. We're stuck with whatever their settings are — and they're pretty freakin short. With the result that users in international locales experience frequent connect failures.

So let's work around that with proxies in Seoul, Sydney, London, and Virginia. We'll use Route 53's Latency Based Routing, so that clients transparently connect to whichever proxy gives them the best performance. WebSockets will connect to the proxy in less than its timeout period; the proxy will forward non terminated TLS to our CLB in Oregon. Our schools win, too! (You might have to live in California to get that joke.)

Straightforward in principle. But, I'd sure like to not manage this infrastructure myself. Let's use the NLB for proxying! Perfect! But: nope, sorry: can't forward across Regions. What about if we peer the VPCs? You’d think the load balancers would be able to see instances in the peered VPC, but, not. Seriously? Yah. Not. Instances can see instances in peer'd VPCs, but, load balancers can't.

Well, fuck.

There's actually no AWS-native way to do this. You have to use EC2 instances running HAProxy or NGINX or whatever your favorite flavor is.

And the pain doesn't stop there. For High Availability you have to run a cluster of EC2s behind an ELB. So now we've got an ELB, EC2s in multiple AZs, proxy software, blah blah, all of which we have to pay for, monitor, worry about, concern ourselves with. Fuck fuck fuck. If the LBs could forward across Regions or to instances in peer'd VPCs, we wouldn't have to do any of this.

So where are we today? At the moment we're just proving that the concept works. I've got small EC2s in several Regions running minimal little socat. So far so good. It's not super scalable and it's not HA. Maybe when we have more money.

Dear AWS: please tweak your LBs to allow forwarding across Regions. Or to see instances in peer'd VPCs. Either would be fine.

Meanwhile you’re exasperatingly almost.

Skynet Versus Dunning-Kruger: Who Will Win?

2017-06-03T17:00:00.000-07:00

I remember vividly my first encounter with computer programming. It was 1975 and it was a simple space-conquest game the kids played at school. Perhaps an Altair, perhaps running Altair BASIC. I wanted to add some rules so I opened the source and typed them in natural language. Of course the compiler barfed, and the lesson I drew from that experience was extremely to the point: "I thought computers were supposed to be smart. This is the stupidest thing I've ever seen."

It was a naive opinion but entirely right.

Now for twenty years I've programmed highly advanced "artificial intelligence" driving the complex, rich, imaginary world TriadCity. It's still the stupidest thing I've ever seen. It's brilliant and nuanced and entirely fake.

Humanity need no fear of computer "intelligence". The proper fear is computer stupidity, paired with the hubristic self-congratulation of programmers and architects and managers and defense officials and generals who genuinely imagine themselves capable of forecasting every conceivable scenario and use case and edge case and intersection. Where the end of the species comes not at the hand of Skynet, but of Dunning-Kruger.

Distributing AI Computation to the Clients

2017-05-31T17:12:00.000-07:00

TriadCity relies on a lot of AI. AI-driven characters have employers to report to, homes to sweep up, meals to cook, hobbies to pursue, dates to go out on. Sports leagues play games which simulate real life; AI-driven commodity speculators calculate optimal bids. It's a lot of AI. The old-school way to scale would be to verticalize the game server: more CPUs, more RAM, bigger-faster-more. The game-y way would be to shard the game world across multiple servers horizontally, which for now we want to avoid. Either way, we're a tiny little struggling game company and we want to keep our costs as close to nil as possible. Besides, brute force seems really kinda conceptually offensive for tasks like these.

There's a better way. We can borrow cycles from players' computers by offloading portions of these computations to our TriadCity clients. It's a form of matrix computing which distributes our scaling problem to many connected machines.

We do a lot of this. I'll note two contexts here which are interesting to me.

NLP is handled by the client. Players type natural English: "Show me what’s inside the red bag"; or, "What’s in the bag?" Client code parses those inputs before transmitting them to the server, translating them to canonical MUD-style commands based on old-timey DIKUmud. In this example, the client sends "l in red bag" or "l in bag". Offloading this computation is very helpful to keeping the server computer small. Individually these computations aren't taxing, but when you scale them to tens of thousands per second you're talking nontrivial horsepower. Pushing that CPU cost to the client is a simple form of grid computing which is easy to implement and makes sense in our context.

Somewhat more elaborately, we use the clients to pre-compute sporting events, for example, chariot races modeled with accurate 2D physics. Think of SETI@Home: we push a data blat to the client with the stats of the chariot teams, their starting positions and race strategies; the client computes the race move-by-move, and sends a computed blat back to the server. The server caches these as scripts for eventual playback. To players the races appear to be real-time, but, they’re actually pre-determined by these client-side calculations. The arithmetic really isn’t that complex, but, it’s easy to distribute, and we want to take advantage of whatever server-side processor savings we can.

There's no inconvenience to players. Consumer computers spend bajillions of cycles idling between keystrokes. We borrow some. Nobody notices.

The major downside is that TriadCity players are locked to a specialized client: for example they can't use dumb telnet clients. This is an obstacle for blind players. Telnet clients are simple for screen readers to manage. Our GUI clients have been challenging to integrate with screen readers. As a workaround we've written a specialized talking client — which many sighted players turn out to prefer! — I use it most of the time. But this isn't really where we want to live. We could enable telnet clients and forgo NLP for their users. We dislike that idea a lot — it would violate our commitment to enabling blind players as first-class citizens in our universe — but we may go ahead and do it anyway based on player requests. Haven't decided yet.

Pushing AI to the client keeps the server small-ish. It's currently an 8-core, 16gb one-unit pizza box in a co-lo, which we intend to migrate to AWS in a few months. Despite all the AI, it can handle tens of thousands of players no sweat. If we actually had that many players we'd be thrilled.

Please bring your friends.

Simulating Human Geographical Knowledge With Neo4j

2016-05-12T17:22:00.000-07:00

TriadCity uses multiple AI architectures to model human behavior in different contexts. Even the simplest is pretty sophisticated: the subsumption architecture for in-game automata does a great job of keeping the City vibrant, where AI-driven characters have homes to go to, jobs to be employed at, hobbies to cultivate, and relationships to explore. At the higher end of the spectrum, we use external bots which log in to the game world just as human players do, mimicking human decision-making and in a very limited way mimicking human speech.

There's a trade-off which might not be immediately intuitive between in-game and external AI. While external AI has access to greater compute resources, in-game AI has access to the database. Think of movement. An AI-driven in-game character finds its way around the world by querying a weighted directed graph which the game server keeps in memory. The model goes like this: in-game AI are residents of the city, they've been there a while, and they know their way around. External AI face a different problem. If we want to model human behaviors, they can't simply query the database to ask how to get to point B. Like humans they must explore the world, building their own mental representation of its geography analogous to the paper or Visio maps drawn by players.

Here's a practical problem external AI characters must solve to survive. "My canteen's empty; I'm thirsty. Where is that fountain I passed a while back?" A human will probably use ordinary geospatial reasoning to reckon a reasonable path. A naive AI can walk back through a history of its moves. A more sophisticated, human-like AI needs an internal representation of where it's been, what it's found there, and how to navigate the most reasonable path back.

I've recently enhanced our external bots to build persistent knowledge maps of the world as they explore, using the excellent open-source graph database Neo4j. This is the same technology the TriadCity game server uses. The server's use case is extremely simple. At boot, once the game world is loaded into memory, a Neo4j graph of all of the Rooms and their relationships is made, with properties describing path attributes. At runtime, AI characters simply query the graph for a path and off they go. The database isn't persisted, it's re-created at each boot, and is updated on the fly as World Builders add or delete Rooms. The external AI use case is more nuanced, where the bot must build a persisted graph room-by-room as it explores. Fortunately the Neo4j code is again very simple.

Each time the bot enters a Room, it asks the graph if that Room is already present. If no, it creates the new Node, populating its properties with information it may want later: is there a water source here, a food source, a shop, a skill teacher? Along with the new Node, it adds a Relationship from the originating Node to the new one. A single simple LEADS_TO RelationshipType is all that's needed. If yes — the Room was already present in the graph — it adds the Relationship linking the origin and destination Nodes. With a little work we can add the cost of the move to the new Relationship: cache the bot’s energy before the move and compare it to when the move is complete; make that cost a property of the Relationship. The robot now has a comprehensive map of all the Rooms it's visited, the paths between them, and their costs. Finding the closest Room with a water source is a simple shortest-weighted-path traversal.

Movement behavior is now more intelligent. Bots already knew not to re-enter the Room they previously left. Now they can prioritize directions they haven't yet explored. If a Room has Exits north, south and west, and the bot's previously explored north and west, next time it'll go south. The resulting behavior is less random, more like the way a systematic human would do things. Also, the bot can take the least expensive path back to a water source; that is, it'll get onto the trail rather than blundering through thickets. Nice.

Survival skills are improved. The bot now knows the comparative distances to multiple water and food sources, and can compare their relative merits. It keeps a catalog of shops with their Item types and can return to buy something useful to it once it has the necessary scratch. It retains the locations of hidden treasures it can revisit if its current strategy is wealth acquisition, and it knows where the ATMs are so it can deposit its loot. It can potentially assist human players with directions, or bring them food or water if they're without — I'm thinking of implementing an Alpine Rescue Dog who can bring supplies to stranded players and guide them out of mazes. Yes, it'll have a barrel around its neck. The barrel was previously doable — now the rescue is, too.

There are some nuances to manage. One obvious one is that in TriadCity, Room Relationships aren't 100% reciprocal. Certain mazes are especially problematic, where if you enter a Room from the East, returning to the East doesn't necessarily take you back to the Room you originated from. Sneaky us. So there’s an impedance mismatch, where Neo4j Relationships are reciprocal by definition, but TriadCity reality isn't always so. Also, of course, the mazes change. The solution we've implemented for now is to simply not persist maze graphs, although that's not particularly satisfying. More thought required. A perhaps less obvious problem is closed doors. If the bot finds an exit West, and tries to move through it, but the door is closed, we have to parse the failure message — "Perhaps you should open the <thing> first?" — and delete the Relationship. The ugly problem is that "Perhaps you should open the <thing> first" can apply to either doors or containers: "Perhaps you should open the refrigerator first?" We can try to disambiguate by keeping a cache of the last several commands, doing our best to figure out whether a move succeeded or not. But Threads are our enemy here, where events and messages are pushed to us from the server, and the Thread driving bot behaviors doesn't stop for 'net lag. The danger is that the graph could become garbled if the bot fails to recognize that while it intended to move, it in fact did not. We address this today by periodically walking the bot to a known location; if that traversal fails to arrive at the expected Room, we delete the portions of the graph that were added since the most recent successful walk-back, and begin graphing again from wherever we happen to be, with the hope that before long we'll meet up with Rooms we already know, and can link the new graph segment to the old. Not entirely pretty, and we're open to suggestion.

None of that is Neo4j's fault. The graph db is a stellar addition to our external AI. It lets the robots "think" similarly to the ways humans do.

Here's how one of our bots thinks of Sanctuary Island, the exact geographical center of the City. Experienced TriadCity players will note that its north and south poles are reversed. But that doesn't matter at all from the bot's POV. The point is that it can now easily return to the Temple of the King to level, or the Fountain to refill a canteen, and so on. This image was generated by Neo4j's built-in graph visualizer, which is pretty spiffy:

Embedding Neo4j this way was again extremely simple. One class, a few dozen lines of code, easy-peasy. My one complaint is that Cypher, the Neo4j query language, has been a moving target. The change which tripped me up is related to revised indexing schemes which first deprecated, then eliminated Cypher constructs such as "START". The first edition of O'Reilly's Graph Databases is out of date; the one I've linked to should be current. These syntax changes cost me some Internet research. Time to buy the new edition.

I'll close by noting an interesting possible future enhancement, now that we have Neo4j embedded with the robot code. Seems the perfect platform for implementing decision trees. The bots I’ve been describing are rules-based, via Drools. Enhancing their decision-making realism via decision trees is attractive — I've long planned to explore this idea for modelling addiction in-game. Perhaps our external bots will be the prototype platform for this type of AI. Watch this space.

[Addendum 2016.05.21: view of Sanctuary Island and part of the Northwest Third from space. Doesn't this graph look like a three-dimensional sphere — a universe?]

Formatting Database Reports Using Serverless Computing

2016-04-08T17:45:00.000-07:00

Serverless computing is the obvious endpoint of contemporary technology evolution. From onsite hardware to self-managed datacenters to virtualization to public cloud, the trend resolves in a very attractive way with the end result that your business logic executes “somewhere”, you don't need to worry where. An automated platform manages scalability, availability, and security; you're billed only for the cycles you use, not for the idle headroom you once needed to provide. On AWS the future is Lambda: we’ve chosen database reporting for our initial experience.

In the old days we’d run a suite of weekly cron jobs on the database server firing report scripts, piping their output to mailx. Business analysts, product owners and technology managers would receive human-readable columnar reports in their inboxes, which they'd study and archive however they liked. Pretty straightforward. It's possible to duplicate the same concept in the cloud, and this is initially how we started: install mailx on one of the MongoDB secondaries, have cron run weekly jobs, mailing not particularly human-readable JSON output to the usual suspects — thought being we'd get that JSON human-formatted real soon now.

This strategy takes poor advantage of the AWS platform. A more sophisticated approach is to save report outputs to a weekly bucket on S3, then mail a link to that bucket to the interested parties. This centralizes the report archive in a pre-organized repository, and gets the weekly email down from a mini-storm to a single message with links. Easy change, then: instead of piping report output to mailx, pipe it to a super-simple shell script which reads it from stdin, figures out if a new S3 bucket is needed, creates one if yes, and writes a JSON file to the bucket. Here’s our version, for your Googling pleasure:

#!/bin/bash
# Writes input from standard in to an S3 bucket named by date.
# The file name is passed as param $1 to the script.
# 2016.04.04 MP.

DATE=`date +%Y-%m-%d`
S3_BUCKET=<base_directory>/$DATE/

# Create the bucket if it doesn't exist:
if aws s3 ls "s3://$S3_BUCKET" 2>&1 | grep -q 'NoSuchBucket'
then
aws s3 mb s3://<base_directory>/$DATE/
fi

# Write stdin to a new file in the bucket:
aws s3 cp - s3://sm-reports/$DATE/$1.$DATE.json --content-type text/plain --grants read=uri=http://acs.amazonaws.com/groups/global/AllUsers

Now we simply add a concluding cron job to send an email with the links. Easy-peasy. No need for mailx anymore: let's use AWS SES instead, which offers enhanced security — no more mailx credentials inside a pseudo-user's home directory. Instead, SES access is managed by IAM. Nice.

This outcome already leverages serverlessness. S3 is an abstracted service, where users pay no attention to concepts of physical location, disk capacity, backup, redundancy, scale, or Web server infrastructure. SES and IAM are also serverless. They're "somewhere" but we don't care where. That's a start.

But here's the thing. As noted, the reports are flippin' ugly. MongoDB’s JSON output is serial rather than columnar, and there's no built-in utility analogous to sql*plus which can easily convert JSON blobs to spiffy tables. So let's fix this. We'll write our own down and dirty little formatting tool converting JSON arrays to tabular HTML. Where should that tool run? Lambda's perfect. With Lambda we don't maintain or pay for EC2 instances with inevitable idle time, we simply let Lambda fire up our converter as the weekly cron jobs require. It's an event-driven compute pipeline, where cron fires reports, pipes them to a script which writes them to S3, S3 notifies Lambda that new JSON files are written, Lambda converts them to HTML, and SES mails a notice.

I chose to write the converter as a Node.js script. My inexperienced guess was that this would be less cumbersome than a Java program, avoiding the hassles of frequently uploading new iterations during development and generally, as the lore insists, being faster to work with. I was wrong in every assumption.

First, non-trivial Node.js programs can't be entered directly into the Lambda inline editor because necessary Node.js packages aren't available there. In our case, the "async" package is required to manage the asynchronous programming model of AWS API calls to S3. So, you know: le sigh. I ended up spending about 12 hours on the converter, where some of that time was my Node.js learning curve, but far too much of it was the inevitable save-zip-upload-test-logscan cycle which even Node.js non-noobs are likely to experience.

Second, lack of compiler support made the cycle inevitably more onerous, as every little typo or missing semicolon made it all the way through the loop, to bubble up in the log rather than under my fingers while typing the code. While I realize that Node.js has considerable real-world momentum, I can't imagine using it for nontrivial production applications. I strongly suspect that any large-scale production code base can be written in Java in one-fourth the time, and will be easier and less expensive to maintain over its life cycle. Experience will tell, of course. But you heard it here first.

Additionally, there's a little bit of lag, often 15-20 seconds, between running a test iteration and viewing Lambda's log output in CloudWatch. For retrospectively scanning production logs, this will seldom be a problem, and in fact CloudWatch logs themselves can be processed by Lambda functions, a pretty great thing. During development it's a hindrance.

Summary: the experience was more of a chore than it should have been.

Ultimately, the manual save-zip-upload-test-logscan process can’t scale to production. The AWS CLI includes APIs for pushing functions to Lambda. Developers who prefer the command-line can homebrew their own Continuous Integration / Continuous Delivery scripts. Larger shops will want to automate their CI/CD pipelines with Jenkins or equivalent. This'll be prominent in our enhancements list.

Future TODOs: add a step generating a single HTML index file with links to the reports; mail a link to that index file, instead of two dozen links to the individual reports. Spiffy up the report format, which right now is down-and-dirty unstyled HTML circa 1994. Get the Lambda functions into Git and add the CI/CD pipeline to Jenkins. Consolidate multiple report formatters to a single converter, presumably relying on metadata such as column names added to the scripts generating JSON report output. You could reasonably sum these up with the single word "professionalize". My POC is pretty amateur.

But it works, and that's impressive. No production infrastructure to maintain, just pure code execution which automatically scales, is inherently secure, and is guaranteed always available. For the few cents a month which runtime will cost, this is really seriously something. Lambda can be used in the event-driven architecture described here, or can be fronted by AWS's API Gateway service providing RESTful front-ends to Lambda functions. Perhaps the most serious downside to the Lambda ecology today is lack of persistent database connections: every db request requires connection/authentication/authorization, which is likely too much lag for many real-world production scenarios. You have to suspect that'll be fixed soon.

Very psyched to enter the new world.

-----------------------

Addendum 2016.04.21:

In retrospect, it would have been simpler to design a one-step solution from the outset. For example, a Python script to query the db, transform JSON to HTML, and write the final result to S3. If I were starting from zip, that would be attractive. But I started with MongoDB "native" JavaScript queries, so, with those already in place, it was simpler to leverage S3's event-driven integration to Lambda for the HTML transformations.

Since the original post I've consolidated what began as individual functions per report to a genericized formatter accepting directives passed as input; it's now the responsibility of the JavaScript query scripts to pass those configuration variables to the HTML processor. This makes it easy to add new reports — no need to update the HTML converter. I've also found a package called "node-lambda" (https://www.npmjs.com/package/node-lambda) which promises to eliminate or at least simplify the inefficient code-save-zip-upload-test-logscan loop I naively began with. Report to follow if it works out.

But lastly, I'm pretty sold on abandoning Node.js for Python. Node's asynchronous model is just frankly cumbersome. The resulting code is ugly, hard to scan, error-prone, and slow to work with. The Python is also quirky, particularly pymongo's departures from canonical MongoDB shell syntax — "find_one" instead of "findOne", and different quoting rules. But it's clean, and straightforward to work with. Watch this space.

Migrating SmartMonsters to the Cloud. Part Four: the Web Site.

2014-10-16T18:54:00.000-07:00

SmartMonsters' Web site was one of the first in our experience to be fully dynamic. Every page is a JSP. Written c. 2001, the goal was similar to elaborate turnkey systems such as Kana: content personalization based on user profiles.

Although that model lost interest for us as advertising became unviable on smaller Web properties, it still has practical uses. With little effort we can reconfigure displayed pages for users who self-identify as visually-impaired, while easily enabling a sophisticated security model.

Architecturally, the site suffers from age-related drawbacks. Sessions are old-school server-side JEE requiring replication between instances. Single-instance content caches minimize database roundtrips, but ensure that instances will be out of sync. The forum software caches especially aggressively, guaranteeing inconsistency.

We'll fix these pre-Cloud disadvantages during migration.

With JEE sessions we don't want stickiness via the load balancer. If the ELB downscales the number of instances we'll lose nonreplicated JEE sessions which were unique to the evaporating instances. Replication is necessary. Memcached is tried-and-true. I opted instead to have EB manage replication. Some configuration is required, see here for an example. Once configured it operates transparently. There's an odd downside though: the auto-generated DynamoDB session records don't include a TTL attribute, meaning old session records linger permanently until manually deleted. Surprised.

In-memory caches which previously used Ehcache can be migrated to memcached on ElastiCache. Elastic Beanstalk server instances can then share a central ElastiCache cloud. Memcached chosen over Redis because the cached Forum data are frequently Maps or serialized objects which the main Java Redis client, Jedis, can't handle. Internally our caching API is exposed via the Strategy OO pattern, so it was straightforward to write a memcached implementation of the pattern Interface which talks to ElastiCache.

As an aside, offloading cache memory to its own cloud allows the main app servers to downscale. This makes it inexpensive to generalize caching of dynamic information that's costly to compute, for example derived data like these. Ultimately these weekly calculations live in a database, but we can now easily do the classic cache-first lookup, saving the db roundtrip.

There are a couple of opportunities for decomposition:

First, email notifications can be dropped fire-and-forget onto a message queue from which a very tiny consumer instance hands them off to SES. These are join confirmation emails, password change confirmations, and so on. On a larger scale we can send newsletter blasts via the same mechanism. We can then shrink the Web application instances since they no longer need the headroom.

Second, we can serve static content from CloudFront. Images, primarily. We end up with a more "standard" architecture than before, where static content is served independently of dynamically-generated pages.

In the end we've broken a formerly monolithic JEE app into distributed components which interact to generate the final user experience. The decomposition adds minor complexity while allowing the components to scale independently. We see a cost optimization from downsizing the main application servers.

There were two unexpected architectural gotchas:

It was necessary to redesign the mechanism behind the TriadCity Who's On page. Previously the Web application communicated with the TriadCity game server via RMI. Reasonable with co-located servers, but very ugly when the Web app is in the cloud and the game server is still in co-lo. Redesigned the game server startup to write server version, boot time and other details to the database; logins, TriadCity date and other details were already there.

I was surprised to find that we saved non-Serializables to HttpSession. Loggers, and some ugly Freemarker utilities. Loggers, no probs: just grab them statically, don’t save a class instance. Freemarker: blecch. It was necessary to detangle the mess we created by embedding Forum software which wasn't designed to be embeddable.

I bravely tested elasticity with simulated traffic against the production environment. Worked as advertised: additional instances spin up automatically, and sessions are shared. Then I terminated a production instance. EB brought a replacement up within a few seconds. Very nice.

There was absolutely no downtime; the migration was entirely transparent. With the new EB environment online and tested, I simply changed the DNS (Route 53) to point to EB. That was it.

Migrating SmartMonsters to The Cloud. Part Three: Build Automation and Continuous Deployment.

2014-09-29T20:36:00.000-07:00

[Intro 2018.05.11. This was written in 2014, when the concept of "Continuous Deployment" was new-ish, not universally adopted, and lacking mature tools. AWS now provides managed services which do much of what we hand-rolled four years ago. I'll describe our migration to their services in future posts.]

For years we've pushed Web content from desktop to production without human intervention: a very early example of what's now known as "Continuous Deployment". A cron job on the Web server updated from SCM every fifteen minutes. That was it: if you checked-in page changes, they were gonna go live, baby! It was up to you to test them first. Code as opposed to content was pushed the old school way: 100% manual.

The Cloud-based process we’ve recently implemented fully automates CI/CD of every component where that's a reasonable goal. The TriadCity game server is the major exception — we wouldn't want to bump players off every time we commit. All other components, from Web sites to robots to AI resources are now built tested and deployed with full automation without downtime every time one of our developers commits code.

Here’s the flow:

Developers are responsible for comprehensive tests. Test suites must pass locally before developers commit to SCM. This remains a manual step.

Commit to SCM triggers a Jenkins CI build in The Cloud. Jenkins, pulls, builds, and tests. Build or test failures halt the process and trigger nastygrams.

If the tests pass, Web sites, robots, AIML interpreters and other components are pushed by Jenkins directly to production. Most of these deploy to Elastic Beanstalk environments. Jenkins' Elastic Beanstalk Plugin writes new artifacts to S3, then calls EB's REST API to trigger the swap. EB handles the roll-over invisibly to end-users. There was no real work in this: I simply configured the plugin to deploy to EB as a post-build step.

For some components, acceptance tests follow deployment. That’s backward — we'll fix it. For example, the AIML files enabling many of our chatterbots are validated in two steps, first for syntactic correctness during the build phase, second for semantic correctness after deployment, once it's possible to ask the now-live bots for known responses. We do this on live bots 'cos it's simple: a separate acceptance test environment will make it correct. There'll be a pipeline with an intermediate step: on build success, Jenkins promotes into the acceptance test environment where it runs the final semantic checks. Only on success there will Jenkins finally promote to production. Failure anywhere stops the pipeline and broadcasts nastygrams. We'll run a handful of final post-deploy smoke tests.

TriadCity game server updates are semi-automated with a final manual step. As noted, we don't want to bump users at unannounced moments. We’d rather trigger reboots manually. Build and deploy short of reboot however is automated. Jenkins remote-triggers a shell script on the TC server which pulls the latest build from Artifactory in The Cloud. That build sits on the TC server where it's up to a human to trigger the bounce. Because we pull-build-test-and-deliver following every code push, there may be many builds delivered without ever being launched. TC's administrators will launch when they choose to. The good news is, the latest build's already present on the server; and is guaranteed to have passed its tests.

One-button rollbacks are possible on EB instances. Simply re-deploy an older build via EB's Web console. Obviously safe rollbacks are possible only when there have been no backward-incompatible schema changes in the databases. Since most of our DBs are schemaless this is not usually difficult. We can similarly roll-back the TriadCity game server with a simple script.

SmartMonsters releases continuously, often many times a day. We can go further. With 100% virtualization it's possible to bundle the environment with the release. This vision of infrastructure as code is more radical than the contemporary DevOps movement. I believe it's the future of release automation.

[Afterward 2018.05.11. We're now using CodeBuild and CodePipeline for nearly all components. CodeDeploy coming real soon now. Future posts will elaborate.]

Migrating SmartMonsters to the Cloud. Part Two: Breaking Up the Bots.

2014-09-24T21:37:00.000-07:00

TriadCity’s robots Death, Help, Marvin and Oscar are AIML-based chatterbots architected monolithically. Each bot is its own application: TriadCity client code, AIML interpreter, AIML files, plus each bot's layer of individual behaviors. They're built and deployed independently of each other. The plus side is they're free to evolve and scale independently; the downside is there's a lot of duplication.

While migrating to AWS we want to break them down to sharable components. We especially want a shared AIML interpreter accessible from multiple protocols and device types. Advantages:

Enables a new generation of bots accessed independently of TriadCity. For example, we can turn Oscar into his own application, accessing a single scalable AIML brain from mobile devices, Web sites, Twitter feeds, and the existing TriadCity bot without having to maintain separate AIML repositories for each client.
Enables AIML updates independent of the Java code for these multiple platforms, doable in real time without rebooting the bots.
The shared AIML interpreter can scale independently, becoming fully elastic in The Cloud.
Centralized AIML files can be validated globally within a consolidated staged CI/CD pipeline with appropriate tests at each stage.
It's easier to swap-out or enhance AIML interpreters without having to rebuild and redistribute an army of dependents.

Here are the migration steps:

Pull the AIML interpreter out of the bot code, into a standalone application accessed as a RESTful service on Elastic Beanstalk.
Redesign the AIML interpreter to serve all of the bots. The REST API will be super-simple: a GET call parameterized with the bot name and the input String. Return a JSON object. EB provides elasticity.
Remove the AIML interpreter from the Web site, having the appropriate JSPs now RESTfully GET their conversations from the service on EB.
Remove the AIML interpreter from the TriadCIty bots, having them similarly call into the service. I decided while I was at it to consolidate the four existing bots into a single JAR rather than continuing to build each as a separate project. This is simpler, and there’s very little Java code distinguishing the bots. A configuration file determines which bot to run; the consolidated code figures out which bot-centric classes to load.
Configure Jenkins to manage Continuous Delivery. The CI build fires when updates are made to SCM. The build lifecycle now includes two cycles of AIML validation. A syntactic pass ensures XML correctness before changes are pushed to production; a semantic pass validates AIML rules by comparing expected to actual outputs.
Log interpreter inputs which fail to match AIML rules, triggering default responses. This feedback will let our botmasters know what their users are interested in, and how their bots are performing. Write logs to DynamoDB, with a simple schema: bot name, <input>, <topic>, <that>. Schedule a weekly cron to report on these failures.

Migration took two working days. To my surprise, DynamoDB logging required a full third day. I struggled to get connected — turns out the published Developer Guide is missing one key point, and incorrect on another. To connect you have to specify the AWS region, duh. Integer keys are not necessary, Strings are fine; there's a "@DynamoDBAutoGeneratedKey" annotation in the Java SDK which'll auto-gen them for you.

One more enhancement seems obvious. Writing logs directly to db is synchronous: our RESTful API server will block a thread during the transaction. We can avoid the block by posting log messages asynchronously to a queue. It's half a day to update the interpreter to write to SQS instead of directly to DB. An EB "worker tier" consumer pulls from the queue and does the DB write.

The platform is now in place upon which we'll build a new generation of bots more advanced than before. Watch this space.

Migrating SmartMonsters to The Cloud. Part One: Problems and Legacy State.

2014-09-15T08:01:00.000-07:00

No-one owns servers anymore. The advantages of Cloud Computing are so overwhelmingly obvious that hardware is simply no longer sensible. If we were starting SmartMonsters today we'd begin on AWS. But, we started in 1999, putting us in a similar situation to the majority of 15-year-old companies in the Internet space: we have a complex physical infrastructure hosted in a data center. Migrating to AWS will be a nontrivial project.

It's not simply a matter of lift-and-shift. Although it's possible to simply substitute virtualized servers for physical ones, virtualization per se isn't the advantage. Where you really gain is in leveraging the platform services which AWS provides on top of virtualization. That leverage shrinks your management footprint, offloading traditional IT tasks of monitoring and maintenance to AWS, while at the same time making it much easier to build "elastic" systems which scale automatically and self-recover from outages. The key is to break legacy monolithic apps into components accessed as services, either self-managed or provided by the platform.

For example, our chatterbots Oscar, Marvin, Help, Death and Eliza all currently run in small EC2 instances, logging in to the TriadCity server in the same way that humans would. That's fine, but if we break them into components — say by exposing a shared AIML interpreter via RESTful interface — then the Web site, the bots in TriadCity, and perhaps a new generation of bots running on mobile devices could all call into a shared and elastically scalable AIML service. At the same time we can move their relational databases into AWS' Relational Database Service, offloading management to the vendor. This distributed design would allow all components to evolve and scale independently of each other. It enables flexibility, scalability, and automatic recovery from outages while at the same time considerably decreasing cost.

Decomposition confronts technical issues without simple solutions. For example, the Web site's Forum section is powered by heavily modified open source code now deeply integrated into — tightly coupled with — the site's authentication, authorization and persistence frameworks. This infrastructure aggressively caches to local memory. Without modification, multiple elastic instances would not sync state, so that user visits from one click to the next would likely differ. We can leverage AWS' managed ElastiCache service to share cache between instances. We'll have to modify existing code to do that.

This is a common problem. Legacy architectures designed to be "integrated" back when that was a marketing bullet rather than a swear word are really the norm. In my consulting practice I try to educate clients not to build new monoliths running on self-managed EC2. Yah it'll be "The Cloud". But it'll be the same problems without the advantages.

While migrating we're going to re-architect, breaking things down wherever possible into loosely coupled components. We'll move as many of those components as we can to managed services provided by AWS, for example SCM, email, databases, caches, message queues, directory services, file storage, and many others. At the same time we'll leverage AWS best practices for security, elasticity, resilience and performance. And cost. In the next weeks I'll write several posts here outlining how it goes.

Initial State

SCM is by CVS - remember that? - on a VPS in the colo. This SCM manages not only source code but also company docs. Deployment of new or edited JSPs to the production web server is automated by CVS update via cron.
Email is self-administered Postfix, on a VPS in the colo.
MongoDB instances are spread among VPS and physical hardware in the colo. There’s a primary and two secondaries.
PostgreSQL is on a VPS in the colo.
A single Tomcat instance serves the entire Web site, which is 100% dynamic — every page is a JSP. This instance experiences a nagging memory leak, probably related to the forum code, which no-one has bothered to track down yet; we simply reboot the site every night.
The TriadCity game server is physical hardware, with 16 cores and 32gb RAM, racked in the colo.
The Help, Oscar, Marvin, Death and Eliza chatterbots run on small EC2 IaaS instances on AWS.
Jira and Greenhopper are on-demand, hosted by Atlassian.
Jenkins build servers are on VPS in the colo.
DNS is administered by the colo provider.
Firewall rules, NAT, and port forwarding are also administered by the provider.

Anticipated End State

SCM by Git, hosted by a commercial provider such as GitHub or CodeCommit.
Company docs moved off of SCM, onto S3 or Google Drive.
Corporate email by Google, maintaining the smartmonsters.com domain and addresses. Website email notices and newsletter blasts via SES.
MongoDB instances hosted in EC2, geographically dispersed. There’ll continue to be a primary and two secondaries. Backups will be via daily snapshots of one or another secondary. It may be optimal to occasionally snapshot the primary as well: Amazon recommends shorter-lived instances whenever that’s feasible.
PostgresSQL via RDS.
Caches migrated to a shared ElastiCache cluster.
The remainder of the dynamic Web site served by Elastic Beanstalk; sessions shared via DynamoDB. At least two instances, geographically dispersed.
Static web content such as images hosted on S3 and served by CloudFront.
Long-term backups such as legacy SQL dumps stored on Glacier.
The chatterbots broken into two layers: an AIML interpreter, accessed via REST by the Java-based bot code which in other respects is very similar to any other TriadCity client. The AIML interpreter will probably be served by Elastic Beanstalk; the bots can continue to run from EC2 instances. The interpreter will have to be adapted to save state in DynamoDB; the AIML files can live in S3; but note that the program will have to be adapted to also be able to read files from the local file system, in case we choose to deploy it in a different way. We may or may not make those instances elastic.
Jira and Greenhopper remain at Atlassian.
Jenkins builds will run on EC2.
A micro EC2 instance will host an Artifactory repo to be used by Continuous Deployment.
DNS will move to Route 53; advantage to us is self-administration.
The TriadCity game server will remain on physical hardware inside the colo. It's not yet cost-effective for a company as small as ours to virtualize servers of that scale.
Firewall rules, NAT and port forwarding for the TriadCity server remain with the colo provider. Security groups, key management and so on for the AWS infrastructure will be self-administered via IAM and related services.
Monitoring of the AWS resources will be via CloudWatch.
The AWS resources will probably be defined by CloudFormation, but, we don't know enough about it yet.
Continuous Deployment will have to be re-thought. Today, a shell script on the Web server runs CVS update periodically, a super-simple CD solution which has worked with no intervention for years. The virtualized infrastructure may allow more sophistication without much additional cost. For example it may be sensible to automate deployments from Jenkins which not only push code, but create new Beanstalk instances while terminating the old ones, a strategy which will keep the instances always young per AWS best practices. Beanstalk will keep these deployments invisible to end users, with no app downtime. More exploration is in order.

You'll note our planned dependence on AWS. Are we concerned about "vendor lock-in"?

No. "Vendor lock-in" is a legacy IT concept implied by long acquisition cycles and proprietary languages. While it's true we'll be tying code and deployments to AWS APIs, migrating to Azure or another cloud provider would not be a daunting project. For example, AWS exposes cache APIs, so does Azure. Migration would imply only minor changes to API calls. Our apps rely on the Strategy design pattern: we'd only have to write new Strategy implementations which call the changed APIs. Trivial concern compared to the advantages.

Conclusion

This is a journey which very many companies will be taking in the coming decade. The advantages of resilience, elasticity, and cost are overwhelming. Please note that I design and manage these migrations for a living. Please contact me if I can be of help.

Migration notes: from RDBMS to MongoDB

2014-05-16T09:50:00.000-07:00

The relational paradigm maps poorly to OO design. Objects don't have relations. When you use an ORM for CRUD you end up with a denormalized schema lacking the point of the relational paradigm. If your OO design relies on inheritance you can potentially end up with absurdly large numbers of tables. In TriadCity we have hundreds of Item subtypes with varying properties; mapping these to traditional normalized RDBMS would result in spaghetti schema. Somebody has to win: you end up with either a lousy database or lousy OO design. By contrast the document data model maps intuitively to OO. Because MongoDB is schemaless you can put all those hundreds of Item subtypes into the same Collection, hydrating them without need for OO mapping and reporting on them in a perfectly granular way. Schemalessness removes the impedance mismatch.

Migration turned out to be a larger project than expected. This isn't the fault of the technology. It was trivial to migrate game object persistence. Without hyperbole, an hour or two. But of course, then we got ambitious. We do vastly more with data than merely game CRUD. For example, we're beginning to use rich data analytics of in-game auctions to enable AI driven characters' bidding behavior. Previously that logic would have lived in a vendor-specific stored procedure, and would have been less comprehensive at runtime than what we can achieve now. We're better with a combination of MongoDB plus Neo4J for real-time lookups. I'm happy as can be to have this logic live in Java inside the game engine, rather than a proprietary stored procedure language in the database. But, implementation in the new technology took some effort.

I chose not to replicate the previously relational schema in MongoDB. It would have been incrementally easier to simply map tables to Collections 1-for-1. Instead, I chose to take advantage of MongoDB's ability to store related information in a single document via nested arrays. Here's an example. Our RDBMs used normalized tables to relate basic user account information to: demographics; settings and preferences; security roles and access restrictions; a listing of bounced email addresses; and a table of locked accounts. So, a parent table and five children, related by the unique ID of the parent account. In MongoDB, it makes better sense to keep all of this related information in a single record. So, our Person document has an array field for demographics, another for roles and privileges, another for settings and preferences, and simple boolean flags for bounced emails or global account lock. This consolidation is straightforward to code to, and results in a human-readable document browsable via your favorite administrative GUI.

The need for transactional guarantees goes away. In a SQL world most transactions are second-order: they're really artifacts of the relational model itself, where an update to a Person per above might require writes to multiple tables. Post-migration we literally have not one use case where a transaction is necessary. We do rely on write guarantees in MongoDB: our writing Java thread blocks until it receives an appropriate acknowledgement. The Java driver handles them automagically.

Object-oriented queries in MongoDB's Java API turn out to be very intuitive to internalize. They read like almost-English and make immediate sense. Here's an example and its SQL equivalent. This looks up the most recent sale price of a Player's in-game house. MongoDB:

   DBObject where = new BasicDBObject("playerHouseID", playerHouseID);
   DBObject fields = new BasicDBObject("price", true).append("timestamp", true);
   DBObject sort = new BasicDBObject("timestamp", DESCENDING);
   DBObject found = this.mongo.db.getCollectionFromString("PlayerHouseSales").findOne(where, fields, sort);
   if (found == null) { // this player house has never been sold
       return -1;
   }
   return (Long) found.get("price");

The SQL equivalent minus framework boilerplate would have been something like:

"SELECT * FROM (SELECT price FROM player_house_sales WHERE player_house_id = ? ORDER BY timestamp DESC) WHERE rownum = 1"

I find the OO version vastly easier to conceptualize.

Object-oriented queries like these make it straightforward to eliminate the 200 or so stored procedures which had tied us to one specific RDBMS vendor. Nearly all of these existed because it's easier to implement branching logic in a stored procedure language than in SQL. Typically the branching test was whether a row already existed, thus insert or update. The MongoDB Java driver's save() method handles upsert for you, automagically converting as the case warrants. A few circumstances which are marginally more complicated are easily handled in Java. This as a huge, huge win. All of our data persistence and analytics logic are now in Java in a single location in a single code base — no longer scattered among multiple languages and disparate scripts in different locations in SCM.

I don't know what the best practice is for sequences. Sequences aren't recommended for _id unique keys. But, there are circumstances where an incrementing recordNo is handy. For example, I like to assign a recordNo to each new Room, Exit, Item and so on. A dynamic count() won't do for this, because these objects are occasionally deleted after creation. I'd nevertheless like to know the grand total of how many Rooms have ever been created, without keeping a separate audit trail. It seems straightforward to create a sequence document in the database with one field for each sequence; you then call the $inc operator to do what it sounds like. Since this is not the _id unique key we're talking about, I ran with it.

I'd like to see more syntactic sugar in the query API. Common operations such as summing or averaging columns are cumbersome today. There's a new-ish Aggregation Framework seemingly designed for large-scale offline reporting. For ad hoc in-game queries, I'd prefer the Java API, and would like to have common operations like summing or finding max or min values highly runtime optimized, as they are in the RDBMs we're replacing. There are corners of the MongoDB programming world which to me seem immature, like these. Development seems very active, so hopefully these gaps will be filled soonish.

I've encountered one problem not yet solved. Sparse unique indexes don't seem to work. Inserting more than one record with a null value into a field with a sparse unique index fails with a duplicate value error on the null, which the "sparse" part is supposed to finesse. Google suggests others have experienced similar glitches. For the time being I've turned off the index, and am continuing to research. Really do want this working.

Runtime performance is excellent. Superior in most cases to the SQL equivalent. For example, loading 18k Rooms from RDBMS including their default Items and Mobiles typically required about 40 seconds; with MongoDB it's usually 15. On my MacBook Pro with 8 cores and 16gb RAM, a complete game boot in the SQL world took around 60 seconds; it's half that with MongoDB.

[For a client, I've subsequently done a shootout of very high-volume write processing between MongoDB and Oracle, where "high-volume" means hundreds of millions of writes. MongoDB was an order of magnitude faster under load.]

MongoDB's system requirements are demanding. MongoDB likes to put the entire data set into RAM. It assumes a 3-node replica cluster with several gigs memory per node. Our data set really isn't that big, so we've got it shoehorned into small VPS instances in the colo. Without a lot of hands-on lore, I can't say yet if we'll turn out to need to bump up these instance sizes. [Update 2014.10.31: instance sizes unchanged. All's well.] [Update 2018.05.12: after migrating to AWS the Mongos run happily on m1.medium instances.]

MongoDB's automated replication is nifty for high-availability and offloading queries from the writeable system. We run our analytics queries using map/reduce against mongodumps taken from one of the secondaries, freeing the primary for production writes. Nothing's necessary to enable this behavior, it works out of the box. For very large data sets, sharding is easy-peasy. There's a handy connector linking MongoDB with Hadoop, for offline map/reduce queries. The database is designed for these topologies, which in many RDBMS worlds are cumbersome to achieve.

Granted enhanced ability to easily analyze 14 years of comprehensive log data, we can begin to enable vastly more powerful and realistic in-game AI. Here's an example. The data show us which Items sell most frequently at auction; what the most successful starting prices are; what the average, minimum and maximum final auction prices are for each Item type; how many bids are common; upper and lower bounds on all these things; medians; distributions; curves; etc. etc. They can help us predict which Players or AI-driven characters might be interested in buying which things. By pulling the "big data" from MongoDB and graphing it with Neo4j, we can enable super-zippy real-time queries like, "Should I bid on the current auction Item? If yes, how much? What should my top bid be based on the bid history so far? If I buy it at the right price, will I later be able to profit from re-auctioning it? When's the best time to re-auction? Are there any Players currently online who may be interested in the Items in my possession? Are there any robots or other AIs who might be interested?" This has far-reaching implications not only for the quality of in-game AI, but also for our ability to expand and make the in-game economy more rich and realistic. In addition to in-game merchants who buy Items from Players, we can now have AI-driven characters buy and sell from auction, creating another sales channel from which Payers can benefit.

A note re Mongo and JSON. If your company models its business objects in JSON, a highly-performant JSON database makes great sense. It's far more straightforward than shoehorning JSON into the relational paradigm designed for an earlier world.

SmartMonsters is now fully-committed to "polyglot persistence". The principals are MongoDB, Neo4J, DynamoDB, and PostgreSQL. We plan to migrate the PostgreSQL to AWS' RDS. Very much looking forward to a world where AWS offers a managed MongoDB which we don't have to take care of ourselves.

NPC movement with Neo4j

2014-04-06T10:56:00.000-07:00

TriadCity’s sophisticated "subsumption architecture" for NPC behaviors depends on accurate, comprehensive knowledge of game world geography. Unlike traditional MUDs where NPC movement is statically confined to home zones and is largely random, TriadCity is alive with highly individualized NPCs capable of purposeful navigation throughout the City.

For years the code enabling this directed movement has been a simple homegrown weighted graph mapping all possible paths from every point A to every other point B. Weights were based primarily on movement costs by alignment. Separate graphs were computed for Evil, Good and Neutral paths, plus specialized graphs for Cops and for residents of the Capitoline neighborhood — two categories of NPCs with restricted movement possibilities. The graphs were computed at server boot, following instantiation of all Rooms and the Exits linking them. They were static: unchanging after initial computation, essentially freezing NPC knowledge of world geography at the state of its existence at boot. And they were monolithic: an NPC could look up a path from the Neutral graph, or the Cop graph, but not a hybrid NeutralCop graph. For a long time this was "good enough for now." While inflexible, it enabled accurate if not always subtle NPC navigation from any arbitrary origin to any chosen destination. NPCs simply looked up their path and, following Dijkstra, off they went.

TriadCity eventually outgrew this simple design. We want to be able to accurately model arbitrary restrictions on NPC paths, for example, excluding NPCs from certain streets based on Role, or gender, or other classification. It's important for NPCs to react to more than one category of restriction. And we want their movement behaviors to be as "human" as possible, so, not necessarily always taking the shortest possible path, if the shortcut for instance leads through someone's private garden, or in one pub door and out another. Usually real people would go around the garden, or stay on the sidewalk outside the pub. We grew especially frustrated with our simple system's inability to track dynamic changes to the game world at runtime. It would have required a major project to extend our down & dirty proprietary code to handle these features.

Instead, we've chosen the excellent NoSQL graph database Neo4j as our solution. While the approach is essentially the same — a weighted graph based on Dijkstra — Neo4j's comprehensive query-ability allows us enormous flexibility at runtime. It easily enables the kinds of ambitions described above, with impressively little code.

We run Neo4j in embedded mode. Easy-peasy, since our server's written in Java. The database is populated during server boot after Rooms and Exits are loaded. Nodes are Rooms, Relationships are built from outbound Exits leading to destination Rooms. The algorithm's straightforward. Iterate the Rooms, define a Node for each; iterate a second time, catalog each Room's outbound Exits, look up the destination Room each Exit leads to, find the Node representing that destination, and define a Relationship between the origin and destination Nodes. We only need one RelationshipType: LEADS_TO. As each Relationship is defined we flag it with boolean properties representing the restrictions we want to impose: isNoCops, isNoBeasts, isNoHumans, isNoDeathSuckers, isPerformerOnly, isCapitoline, hasDoor, and so on; these properties belong to Exits and destination Rooms, so we simply look for them and set the flags appropriately. At runtime, NPC path lookups are via a PathFinder built by passing GraphAlgoFactory's static dijkstra() method a custom CostEvaluator which looks for the boolean property flags defined on Relationships and totals costs accordingly. For example if the calling NPC has the Cop Role, and a Relationship is flagged isNoCops, the CostEvaluator will assign a prohibitive cost to that Relationship — we arbitrarily use 100,000 — effectively removing it from consideration. When PathFinder searches for the least-cost path it'll choose one which bypasses that expensive Relationship. Managing the scenario where we want NPCs to go around a pub although the shortest path leads through it is straightforward: we simply assign a cost of 100 to all Relationships flagged hasDoor, so that generally NPCs will prefer not to move through doors unless they're the only paths available. Nice.

Implementing Neo4j required just a single class of about 300 lines, and took about half a working day including time spent tuning the costs.

Runtime performance is excellent. The Neo4j solution takes longer to bootstrap — about 5 seconds for 18,000 Rooms on my MacBook Pro, compared to just a few milliseconds for our deprecated home baked version. But, its runtime lookup performance is superior. We judge overall AI processing load by tracking average completion times of our one-per-second Behaviors pulse which triggers Behaviors throughout the game world. Average Behavior computation on my laptop was steady in the range of 120 to 150 milliseconds with the old grapher; with Neo4j it's been consistently around 80, an efficiency gain of 50% or better. Memory use looks about the same.

So far, runtime movement appears to be perfect. Everybody goes where they're supposed to, and they now do it with the subtle enhancements highlighted above. Importantly for us, it's now very easy to tune movement by simply adding new property flags to Relationships and looking for them in our custom CostEvaluator. And, we can easily enhance the system to adapt to runtime changes in the game world itself, for example when builders add new Rooms or 4d mazes rearrange themselves.

I'm impressed with how easy it's been to include Neo4j in our project. The API is intuitive and very nicely documented. I was able to take and tweak a Dijkstra example from the Neo4j web site with very little effort. Very pleased!

All AI in TriadCity is Event-Driven. Actually, Pretty Much Everything in TriadCity is Event-Driven.

2014-02-08T11:28:00.000-08:00

All NPC behaviors in TriadCity are event-driven. Not just NPC AI, but all change everywhere in the virtual world, from the motion of subway trains to room descriptions morphing at night to the growth of fruit on trees. Here's a quick sketch of some TC event basics.

In TriadCity every action is encapsulated in an Event object. This is straightforward in OO Land. Take for example player-generated actions. Player Commands are expressed in the server code as Strategy Pattern objects encapsulating the necessary algorithm. Each Command will generate one or more Events as its outcome(s). The Steal command generates a ThiefEvent, which it parameterizes differently depending on success or failure. Commands instantiate the appropriate Event type at the appropriate moment within their algorithms, handing off the new Event instance to the current Room, which propagates it wherever it needs to go.

Here's the structure:

public Interface Event {
long getID();
long getEventType();
long getVisibility();
String getSourceID();
String getTargetID();
        String getRoomID();
        String getMessageForSource();
        String getMessageForTarget();
        String getMessageForRoom();
        ....
}

The Steal command populates ThiefEvent with the information shown above and a ton more. The eventType will vary depending whether the Steal succeeds or fails; the assigned visibility will be computed depending on the Thief's Skill level and multiple variables. The messageForSource will be: "You steal the silver coins from the drunken Texan", or "You fail to steal the silver coins from the well-dressed man, and are detected." Appropriate messages are computed for the target and for bystanders in the Room. When ready, the ThiefEvent is passed to the Room, which in turn passes it to everybody and everything currently present.

What happens next?

Up to world authors.

One form of NPC behavior is triggered when Events are received. The Behavior instance tests the Event against its triggering conditions. Am I supposed to fire when the Event is a ThiefEvent and its getEventType() == STEAL_FAILS_EVENT_TYPE and the NPC I'm assigned to is the Event.getTarget()? Or in English, did someone just try to steal from me and fail? If the world author has assigned the TriggeredAttackBehavior to an NPC and given it those trigger conditions, then the victim will attack the Thief when those conditions are met. This is a form of rule-based AI, expressed in OO patterns.

Who'll see what happened? Depends on their SeeSkill and other attributes. If observers are skilled enough, they may be able to watch Thieves stealing from victims. The victim won't see it happen, but observers with high enough SeeSkill might. One of many many forms of character subjectivity in TriadCity.

OO allows TriggeredBehaviors to be niftily flexible. Behaviors can be triggered if they receive any MovementEvent, or only if the MovementEvent.getEventType() == BOW_EVENT_TYPE. So: any movement in your Room and you'll jump to your feet on guard. But, you'll only Bow if someone Bows to you first.

All Behaviors are Event-triggered, even ones which appear random, for instance pigeons moving around the Gray Plaza. Although their movement Behavior is indeed semi-random, it's triggered by a worldwide pulse called the DoBehaviorsEvent, which is propagated everywhere in the game world once per second by an Executor running a background Thread. Every Behavior tests that DoBehaviorsEvent: do I ignore it, or randomly do something, or do something not randomly? TimeEvents trigger actions such as closing shops and going home, and so on.

Events are integral to Player subjectivity. One form this takes is that Player objects include a Consciousness Strategy object via composition. That Consciousness object acts as a filter on the Events passed through it. It has the ability to modify or ignore the Event.getMessageForXYZ() methods it encounters, or generate entirely new ones. This is one of the ways we can make you see things which maybe aren't even "there".

Events in TriadCity are like the chemical "messages" flying around your brain, telling you when to breathe and fire your heart muscle and decide to go online to play games.

Why Java's a Good Platform for MMORPG Servers

2014-02-07T11:55:00.000-08:00

I recently found a discussion of programming languages on a MMORPG hobbyist bulletin board. Much of the lore there was garbled, revolving around old but persistent myths of the performance characteristics of C, C++, and Java. I thought I'd share some hard-won wisdom.

To first state my claim to bonafides in this realm. In real life I'm a cloud scale software architect at the senior executive level, meaning I've got a couple of decades' experience building really big commercial systems with bazillions of users. Plus I've got twenty years hands-on writing TriadCity, which I hope I can say without boasting is the most technologically sophisticated textual RPG to date. So here's the true dope.

Under super heavy load — truly massively multiplayer scale — Java has a slight advantage over other languages in scalability and perceived responsiveness because under load its garbage collection mechanism makes more efficient use of CPU cycles than hand-rolled memory management. Yes, I realize that contradicts the claims that many developers will make, but, it's not hard to understand. If you profile an Apache web server under very heavy stress, you'll find it spending a large percentage of its cycles calling malloc() and free(), where every call requires some number of CPU instructions. At its core it's just doing lots of String processing, but there you go: memory in, memory out, many many millions and millions of times once the load gets going. At some point, memory management will crowd business logic out of the processor, and perceived degradation is no longer graceful.

By contrast, Java allocates its heap in just one go, when its runtime initializes. Garbage collection is a background process and on multi-core systems does its thing pretty efficiently. CPU time can be dedicated to business logic instead of system overhead. This plus team productivity are the two reasons Java dominates the market for enterprise application servers.

To highlight some qualifiers. A well-designed and written system in C or C++ will outperform a badly designed one in Java. "Better" is contextual: do you care about throughput, latency, graceful degradation under stress, or some other criterion? Maybe you're super proficient in one language over another: use that one.

Note in general that OO languages are a better match for game servers than procedural ones. I started TriadCity in C back in the day because at that time I was more proficient in C, and there was plenty of open source C game code to model things on. But I found in practice that some of the concepts I wanted to express just didn't "feel" intuitive in C, and it became harder and harder to be productive in the code base once it got large enough. Switching to Java solved these problems because thinking in objects maps more intuitively to virtual world concepts; and because, at least in my experience, it's far easier to organize a very large OO code base for extensibility and maintainability. I recently read a different hobbyist RPG board where the lead developer of a very popular old-school MUD written in C complained about the effort required to implement a new character class. I'm pretty sure I can tell you why it was so rough for him. I don't think it would have been as much of a chore in TriadCity.

To underscore this point: OO code bases scale better. Today TriadCity has about 40,000 server-side classes. No, seriously. But I can find them all easily, because organizing OO code is straightforward, and Java's package concept helps. More importantly, thoughtful OO design insulates you against brittleness, enabling evolution. Adding that new character class would have been less daunting in an OO world.

Death to Mazes Like Moria!

2014-02-06T12:10:00.000-08:00

I suppose that mazes are a staple of the MUD tradition. Starting with Adventure, you're underground and you're lost. Throw in some orcs and you're in Moria. Whatevs.

We follow the tradition with lotso mazes, but, we try to make them interesting. This post'll be about the technology rather than the content, but, we really do try to make the content interesting too.

TriadCity Maze types are implemented via the Strategy OO pattern. There's that word "Strategy" again. We do a lot of that. A Java Interface called Maze defines the contract; multiple concrete implementations can be swapped out at runtime. Among the implications: world authors not only have a ton of maze algorithms at their disposal, but we can actually swap out algorithms at runtime whenever that seems the thing to do.

TwoDMaze is a simple x/y grid of Rooms, always in a rectangular pattern. World authors can specify how many Rooms wide versus tall, and the Builder tools will prompt them if they provide the wrong number of Rooms. Thus a TwoDMaze defined as x=7 y=5 has gotta have 35 Rooms assigned, no more, no less. TwoDMaze puts the Rooms in consistent order across the grid, that is, Room 1 will always be the upper left corner, and the Rooms will be assigned in left-to-right order at every server boot. Exits between Rooms will be assigned dynamically according to what maze-drawing algorithm is chosen by the world author. You guessed it, these algorithms are again Strategy objects. We've got Kruksal, Prim, Depth-First, and a pile of others. Exits will be computed and distributed around the maze at server boot.

Add a z dimension and it's a ThreeDMaze. Have the Exits dynamically reassigned at random or semi-random during server uptime and it's a FourDMaze — think Harry and Cedric racing for the cup.

We can vary all of these by not assigning the Rooms in consistent order. Just shuffle the Room[] before laying out the grid. Now there'll always be a Room 1, but it won't always be in the upper left corner. It could be anywhere, meaning that special fountain or other object of interest will have to be searched for anew after every server boot. Or shuffle them every hour, or shuffle the grid and the Exits and really confuse people.

But wait, there's more. We don't have to always assign ordinary FreeExits — the Exits that allow movement between Rooms without impediment. They can be RandomDestinationTeleportationExits or any other Exit type we like, in whatever proportion we like. Yes, there's one extremely notorious maze full of nasty critters with pointy sharp teeth offering only a single teleporter to get out — no way you're going back the way you came. Bring a large party with supplies. Find that teleporter before you run out of grenades.

More still. Authors can assign Behaviors to the Rooms or the Exits, making things even more interesting. TelGar's writing a big slag heap you have to climb to reach something valuable. The slag heap kicks up tonso dust so there's no telling exactly where you are, and because the ground is loose, you're constantly being forced downward as the earth slides out from under you. It's a compelling adventure, enabled by combining a dynamically-changing Maze type with a SlideRoomBehavior which forces you to move in a certain direction, like a slide in a playground. You can eventually get to the top, but it's a struggle.

These Strategy Pattern objects allow the server code to remain tidy and easy to maintain, while putting a lot of power into the hands of world authors. Authors can assign whatever crazy algorithm best suits their thematic purpose. Just how lost do you want the punters to get? You've got lots of options. Somebody invents a new maze-generation algorithm? Great, we'll add it. Easy-peasy.

Death to mazes like Moria!

Using the "Strategy" OO Pattern for Player Subjectivity

2014-02-04T12:53:00.000-08:00

Uniquely to any "game" we're familiar with, TriadCity imposes elaborate subjectivities on its players.

This means that if you and I walk our characters into the same Room at the same time, we might see it described differently depending on our characters' attributes, skills, moral alignment, history, and other variables. The type and degree of subjectivity is largely controlled by the world author who wrote the Room. Most authors tread lightly, but there's no reason they can't make things wildly different if that makes sense for their context.

It also means that one of our characters could be Schizophrenic. We can impose any manner of "unreal" experiences on characters, from highly distorted Room descriptions to conversations with other characters who don't actually exist.

The code which enables these two forms of subjectivity is conceptually simple. Each form relies on the Strategy OO pattern, applied in different places. I'll briefly sketch them both.

Rooms, Items, Exits, Characters all share common attributes inherited from their parent class, WorldObject. These include Descriptions, Smells, Sounds, Tastes, and Touches. I'm capitalizing the nouns because these are Java Interfaces which WorldObject possesses by composition. A code snip for illustration:

public abstract class WorldObject extends Object() {
          protected Descriptions descriptions;
          protected Smells smells;
          protected Sounds sounds;
          protected Tastes tastes;
          protected Touches touches;
          public final String getDescription(WorldObject observer) {
                    return this.descriptions.getDescription(observer);
          }
          public final String getSmell(WorldObject observer) {
                    return this.smells.getSmell(observer);
          }
   public final String getSound(WorldObject observer) {
                    return this.sounds.getSound(observer);
          }
   public final String getTaste(WorldObject observer) {
                    return this.tastes.getTaste(observer);
          }
   public final String getTouch(WorldObject observer) {
                    return this.touches.getTouch(observer);
          }

}

Descriptions, Smells, Sounds, Tastes and Touches are Strategy Pattern Interfaces defining the simple contract which will be implemented by a rich hierarchy of concrete classes. At runtime, the correct implementing class is assigned to each WorldObject according to the directions given to that WorldObject by the author who wrote it.

For example, a world author may write a Room and choose to have its descriptions vary by hour of day. Using our authoring tools, she assigns it the TimeBasedDescriptions type, and writes a description for dawn, dusk, day, and night. At server boot, that Room will be assigned the type of Descriptions she specified, and populated with her writing. The TimeBasedDescriptions class has the responsibility of figuring out what time it is whenever its getDescription() method is called, and returning the right one.

That's not subjective, though. Although the descriptions vary by time of day, the same description will be shown to every character who enters the Room. If she wants to, our world author can be more adventurous, assigning concrete Description types from our library. She could choose AlignmentAndTimeBasedDescriptions which will show descriptions which vary by both time of day and character moral alignment; or Descriptions which vary by Gender, or health condition, or attribute status. Each concrete implementing class knows how to calculate the description which is appropriate for the observer. As you can see, the code for WorldObject remains conceptually simple.

A second and often more radical flavor of subjectivity is implemented by a Strategy Interface called Consciousness. Each character has a composed Consciousness object which acts as a filter, transforming messages received via game Events, or not, according to its logic. The TriadCharacter class includes these snips:

public abstract class TriadCharacter extends WorldObject() {
protected Consciousness consciousness;
public boolean proccessEvent(MovementEvent evt) {
return this.consciousness.processEvent(evt);
}
public boolean proccessEvent(SenseEvent evt) {
return this.consciousness.processEvent(evt);
}
public boolean proccessEvent(SocialEvent evt) {
return this.consciousness.processEvent(evt);
}
}

Concrete Consciousness implementations include SleepingConsciousness, FightingConsciousness, SchizophrenicConsciousness, IncapacitatedConsciousness, and many others. Their job is to analyze incoming Events and determine whether to pass along the messages associated with those Events unchanged, or dynamically altered. A simple change might be to squelch an Event altogether if the character is asleep. A radical one might be to modify a ChatEvent or TellEvent, or even generate entirely new ones, if the character is experiencing SchizophrenicConsciousness. In these cases, the Consciousness type is assigned by the server at runtime, not by world authors, and it'll be swapped out for a different one as circumstances change. So these transformations are extremely dynamic.

Using the Strategy pattern via composition allows the class hierarchy to remain reasonably flat and really very simple. It would be possible for example to create a massive class tree resulting in things like TriadCharacterWithDescriptionsThatVaryByAlignmentAndTimeWithSchizophrenicConsciousness. But for obvious reasons that would be a nightmare to maintain. Composition keeps the WorldObject tree simple; Strategy objects chosen at runtime provide the rich library of behaviors.

"Subsumption" Architecture for NPC AI

2014-01-30T06:04:00.000-08:00

TriadCity employs multiple AI strategies to drive NPC behaviors. Most of these are quite simple, AI being in my opinion far more artificial than intelligent. One of our most successful strategies to date is based on Rodney Brooks' "subsumption architecture" which we've borrowed from robotics. Here's a short description of how we use it.

Subsumption rejects the approach of modeling intelligence via symbolic representation, choosing instead to mimic lifelike behaviors by building high-level abstractions from layered libraries of low-level components. Lowest-level components might be as simple as "I hit something"; the next layer up might include avoidance strategies such as "back up and try a slightly random vector", or "move left several inches and try again". A next level up might abstract from particulars to "explore the world". Behaviors become more removed from details the higher they are in a pyramid of abstraction, with "behave like a human" at the top.

In TriadCity we implement subsumption by composing higher-level behaviors from a vast library of detailed lower-level ones. Low-level examples are "open the door", "look around the room", "pick up an object". These correspond to TriadCity's library of player-typable commands. There's a building-block behavior corresponding to each TC command. An intermediate abstraction might be "move from current location to the Virtual Vegas Casino", which involves querying a directed graph, moving along the retrieved path, opening any doors encountered, and so on. The next highest level might be "play slot machines in the Virtual Vegas Casino", which would require successfully moving to the correct location in the game world and executing the Playslot command. Higher still might be "gamble in the Virtual Vegas Casino", which would make a randomized selection between playing slots or roulette or etc. World authors don't need to interact with subsumed low-level components. They just assign high-level behaviors and the architecture manages details.

A common example in TC is the assignment of weighted "behavior groups" based on time of day. Let's say you assign your NPC two behavior groups, one for the dawn, day, and dusk hours, and a second for night time. Your daytlight group may include sending the NPC to work; or having it hunt slave killers in Zaroff Park; or eat lunch at a favorite restaurant; or watch the chariot races in the Circus Maximus. You'd assign each of these possibilities a weight determining its likelihood of being chosen, plus a weight determining its likelihood of being swapped out for a different one. By night you might have your NPC eat dinner somewhere; go see a movie; go shopping; go back to hunting its favorite slave killers; or go home to sleep. The result is strikingly rich. If you follow this fellow around, he performs his varied activities with human-like unpredictability. He won't simply circle the same static course over and over, and so for example if there's some NPC you'd really like to stalk, you'll have to go find him first because he could be all over the place doing things that are appropriate to his character. This is trivial for world authors to work with, allowing rich individualization of non-player characters with little effort. It puts considerable AI sophistication into play without forcing authors to interact with it.

These "subsumption" behaviors contribute verisimilitude to the game world. In TC you don't get a lot of NPCs standing around waiting to be killed. We even send our judges home from court at night. There's continual motion and variation: you can stand on a popular street corner people-watching NPCs on their way to and from jobs and hobbies and murders and freeing slaves.

Richard Bartle wrote, "It would be great to have a virtual city with 100,000 virtual inhabitants, each making real time decisions as to how to spend their virtual lives. We may have to wait some time before we get this, though." Not really. TriadCity's done it for years.

Why Not Telnet? 'Cos grid computing.

2013-12-14T06:59:00.000-08:00

TriadCity being text-based, it would have been possible to design its server to communicate with clients via the standard telnet protocol. This is the strategy taken by classical MUDs including DikuMUD and CircleMUD, and by many popular proprietary MUDs today. It allows players to use their choice of free telnet clients, including ones optimized for the most common MUD experiences such as command aliasing and so on.

Our choice to write proprietary clients for TriadCity is determined by TC's scale and complexity. Unlike conventional MUDs, TC's server has many hundreds of thousands of AI decisions to compute from moment to moment, determining NPC actions and generating the often complex subjectivities displayed to human players. We use proprietary clients to distribute some of the processing. TriadCity is less like old-school client-server with a smart server and dumb clients, and more like a distributed computational grid capitalizing on the horsepower made available by players. These are some of the tasks performed by clients:

"Natural language" processing. We use a two-stage parser, where the first and more computationally demanding step lives in the clients. Clients send only perfectly-formed, canonical commands over the wire, performing the heavy lifting so the server doesn't have to. While the processing requirement for a single user command is not burdensome, offloading that requirement allows the server to scale more easily to MMORPG levels.
Distributed arithmetic. Wherever sustained computation is required — for example, 2D vector arithmetic for chariot races — portions of those computations can be parallelized and distributed. Think SETI@Home. In that regard the TriadCity client/server architecture is something like a single distributed computer.
Character state management. Housekeeping such as hunger and thirst calculations, attribute computations, and consciousness state can live happily in the clients. Much of the arithmetic driving character subjectivity and situational capability can be offloaded.
Network encryption. Telnet's not secure. TC's clients encrypt network I/O, so your password is safe at Starbucks.
Firewall immunity. TC's clients allow connection through company firewalls, so we can happily diminish players' productivity at work. Telnet usually doesn't.
Bandwidth optimizing. TC's clients transmit only when the communication is guaranteed to be correctly formed. The server doesn't have to invest cycles processing garbage.
Integrated support for assistive technologies. Our HTML5 client implements WEI-ARIA Live Regions; the Java applet has an integrated screen reader.
Additional security measures. Because we control the clients, it's easy to know when Bad Guys On The Net are trying to poke the server. Swish-and-flick! — into the blacklist bin with them. Easy-peasy.

Most of this architectural logic is about helping the server scale to MMORPG levels without having to shard the game world or add other complexity. We routinely test with 10k simulated players, and despite its significant AI computational load the server does just fine.

Of course, we don't have 10k simultaneous players in real life. Invite your friends.

Scalable Command Routing Via the Strategy and HashedAdaptor OO Design Patterns

2013-12-13T09:39:00.000-08:00

TriadCity's command vocabulary is far larger than the 400 or so commands typed by players. Every Admin and Builder action is implemented as a command — there are tens of thousands altogether.

In C, you'd probably route user input through a simple switch() statement. CircleMUD does this. The parser and command router are tightly coupled inside one master controlling statement which switch()es on the user's first word of input. Logic can be parceled out to functions, or handled right there inside the switch() statement itself. This is fine for a couple of hundred commands, but growing it to tens of thousands would be a performance and maintenance disaster.

We ensure infinite command routing scalability for TriadCity by combining the Strategy and Hashed Adapter OO patterns.

Strategy encapsulates an algorithm as an object, allowing algorithms to be selected and substituted at runtime. In Java you'll typically implement Strategy via an Interface defining the algorithm's contract. Classes implementing that Interface define the available algorithms. In TriadCity we have a Command Interface defining our Strategy contract:

public Interface Command {
public abstract boolean doCommand(<multile-args-go-here>);
}

And we have many thousands of classes implementing that Interface, for example, Chat, which implements the doCommand() method to iterate all NPCs and logged-in Players, sending the arguments to the chat channel.

The Hashed Adapter pattern achieves scalable routing by stashing these concrete Command instances in a HashMap keyed by command name. Each concrete Command is a Singleton; we populate the map with command names as keys, instantiated Command Singletons as values. The resulting routing algorithm is super-simple:

Command command = this.commands.get(commandName);
command.doCommand(<multiple-args-passed-here>);

There's other logic wrapped around these lines, for example authorization checking for security. But that's the idea.

By using Hashed Adapter this way, routing scales infinitely without performance degradation. Server processing is very efficient, and for developers there's no nightmare switch() statement of ten million lines to maintain. By using Strategy, it's very easy to do the file housekeeping necessary to keep tens of thousands of Command implementations easy to navigate.