We have a specific architectural need. We'd prefer to handle it with an AWS managed service. It's almost possible. But not quite.
To explain. Our
TriadCity
server lives in the Oregon Region. With our current level of traffic we’ve decided not to shard the game world. So everyone on planet earth who visits TriadCity arrives at that single large-ish EC2 instance.
To help with scalability we offload as much computation as possible. Much of the AI is computed by clients in a distributed-grid architecture similar to SETI@home. With the same goal in mind we offload TLS termination. Easy-peasy: incoming connections terminate TLS at a Classic Load Balancer, which forwards to the game server. We’d really prefer a Network Load Balancer: it’s cheaper and performs better. But, it doesn't terminate TLS. See? Exasperatingly almost.
Now. We have some legacy native clients. But most players today connect over WebSockets from HTML5/JQuery clients in their browsers. We open a WSS connection to the CLB which terminates TLS and passes the packets through.
The architectural problem with WebSockets is our lack of control over socket parameters. With native clients we can set a long connection timeout, say, five seconds. Then when a user in Seoul or Sydney or London connects to our infrastructure in Oregon, the client doesn't hang up if network latency causes connection negotiation to take a second or two. Not so with WS. The browser vendors control the connect timeout. It's the truly major downside of keeping an API simple. We're stuck with whatever their settings are — and they're pretty freakin short. With the result that users in international locales experience frequent connect failures.
So let's work around that with proxies in Seoul, Sydney, London, and Virginia. We'll use Route 53's Latency Based Routing, so that clients transparently connect to whichever proxy gives them the best performance. WebSockets will connect to the proxy in less than its timeout period; the proxy will forward non terminated TLS to our CLB in Oregon. Our schools win, too! (You might have to live in California to get that joke.)
Straightforward in principle. But, I'd sure like to not manage this infrastructure myself. Let's use the NLB for proxying! Perfect! But: nope, sorry: can't forward across Regions. What about if we peer the VPCs? You’d think the load balancers would be able to see instances in the peered VPC, but, not. Seriously? Yah. Not. Instances can see instances in peer'd VPCs, but, load balancers can't.
Well, fuck.
There's actually no AWS-native way to do this. You have to use EC2 instances running HAProxy or NGINX or whatever your favorite flavor is.
And the pain doesn't stop there. For High Availability you have to run a cluster of EC2s behind an ELB. So now we've got an ELB, EC2s in multiple AZs, proxy software, blah blah, all of which we have to pay for, monitor, worry about, concern ourselves with. Fuck fuck fuck. If the LBs could forward across Regions or to instances in peer'd VPCs, we wouldn't have to do any of this.
So where are we today? At the moment we're just proving that the concept works. I've got small EC2s in several Regions running minimal little socat. So far so good. It's not super scalable and it's not HA. Maybe when we have more money.
Dear AWS: please tweak your LBs to allow forwarding across Regions. Or to see instances in peer'd VPCs. Either would be fine.
I remember vividly my first encounter with computer programming. It was 1975 and it was a simple space-conquest game the kids played at school. Perhaps an Altair, perhaps running Altair BASIC. I wanted to add some rules so I opened the source and typed them in natural language. Of course the compiler barfed, and the lesson I drew from that experience was extremely to the point: "I thought computers were supposed to be smart. This is the stupidest thing I've ever seen."
It was a naive opinion but entirely right.
Now for twenty years I've programmed highly advanced "artificial intelligence" driving the complex, rich, imaginary world TriadCity. It's still the stupidest thing I've ever seen. It's brilliant and nuanced and entirely fake.
Humanity need no fear of computer "intelligence". The proper fear is computer stupidity, paired with the hubristic self-congratulation of programmers and architects and managers and defense officials and generals who genuinely imagine themselves capable of forecasting every conceivable scenario and use case and edge case and intersection. Where the end of the species comes not at the hand of Skynet, but of Dunning-Kruger.
TriadCity relies on a lot of AI. AI-driven characters have employers to report to, homes to sweep up, meals to cook, hobbies to pursue, dates to go out on. Sports leagues play games which simulate real life; AI-driven commodity speculators calculate optimal bids. It's a lot of AI. The old-school way to scale would be to verticalize the game server: more CPUs, more RAM, bigger-faster-more. The game-y way would be to shard the game world across multiple servers horizontally, which for now we want to avoid. Either way, we're a tiny little struggling game company and we want to keep our costs as close to nil as possible. Besides, brute force seems really kinda conceptually offensive for tasks like these.
There's a better way. We can borrow cycles from players' computers by offloading portions of these computations to our TriadCity clients. It's a form of matrix computing which distributes our scaling problem to many connected machines.
We do a lot of this. I'll note two contexts here which are interesting to me.
NLP is handled by the client. Players type natural English: "Show me what’s inside the red bag"; or, "What’s in the bag?" Client code parses those inputs before transmitting them to the server, translating them to canonical MUD-style commands based on old-timey DIKUmud. In this example, the client sends "l in red bag" or "l in bag". Offloading this computation is very helpful to keeping the server computer small. Individually these computations aren't taxing, but when you scale them to tens of thousands per second you're talking nontrivial horsepower. Pushing that CPU cost to the client is a simple form of grid computing which is easy to implement and makes sense in our context.
Somewhat more elaborately, we use the clients to pre-compute sporting events, for example, chariot races modeled with accurate 2D physics. Think of SETI@Home: we push a data blat to the client with the stats of the chariot teams, their starting positions and race strategies; the client computes the race move-by-move, and sends a computed blat back to the server. The server caches these as scripts for eventual playback. To players the races appear to be real-time, but, they’re actually pre-determined by these client-side calculations. The arithmetic really isn’t that complex, but, it’s easy to distribute, and we want to take advantage of whatever server-side processor savings we can.
There's no inconvenience to players. Consumer computers spend bajillions of cycles idling between keystrokes. We borrow some. Nobody notices.
The major downside is that TriadCity players are locked to a specialized client: for example they can't use dumb telnet clients. This is an obstacle for blind players. Telnet clients are simple for screen readers to manage. Our GUI clients have been challenging to integrate with screen readers. As a workaround we've written a specialized talking client — which many sighted players turn out to prefer! — I use it most of the time. But this isn't really where we want to live. We could enable telnet clients and forgo NLP for their users. We dislike that idea a lot — it would violate our commitment to enabling blind players as first-class citizens in our universe — but we may go ahead and do it anyway based on player requests. Haven't decided yet.
Pushing AI to the client keeps the server small-ish. It's currently an 8-core, 16gb one-unit pizza box in a co-lo, which we intend to migrate to AWS in a few months. Despite all the AI, it can handle tens of thousands of players no sweat. If we actually had that many players we'd be thrilled.
TriadCity uses multiple AI architectures to model human behavior in different contexts. Even the simplest is pretty sophisticated: the subsumption architecture for in-game automata does a great job of keeping the City vibrant, where AI-driven characters have homes to go to, jobs to be employed at, hobbies to cultivate, and relationships to explore. At the higher end of the spectrum, we use external bots which log in to the game world just as human players do, mimicking human decision-making and in a very limited way mimicking human speech.
There's a trade-off which might not be immediately intuitive between in-game and external AI. While external AI has access to greater compute resources, in-game AI has access to the database. Think of movement. An AI-driven in-game character finds its way around the world by querying a weighted directed graph which the game server keeps in memory. The model goes like this: in-game AI are residents of the city, they've been there a while, and they know their way around. External AI face a different problem. If we want to model human behaviors, they can't simply query the database to ask how to get to point B. Like humans they must explore the world, building their own mental representation of its geography analogous to the paper or Visio maps drawn by players.
Here's a practical problem external AI characters must solve to survive. "My canteen's empty; I'm thirsty. Where is that fountain I passed a while back?" A human will probably use ordinary geospatial reasoning to reckon a reasonable path. A naive AI can walk back through a history of its moves. A more sophisticated, human-like AI needs an internal representation of where it's been, what it's found there, and how to navigate the most reasonable path back.
I've recently enhanced our external bots to build persistent knowledge maps of the world as they explore, using the excellent open-source graph database Neo4j. This is the same technology the TriadCity game server uses. The server's use case is extremely simple. At boot, once the game world is loaded into memory, a Neo4j graph of all of the Rooms and their relationships is made, with properties describing path attributes. At runtime, AI characters simply query the graph for a path and off they go. The database isn't persisted, it's re-created at each boot, and is updated on the fly as World Builders add or delete Rooms. The external AI use case is more nuanced, where the bot must build a persisted graph room-by-room as it explores. Fortunately the Neo4j code is again very simple.
Each time the bot enters a Room, it asks the graph if that Room is already present. If no, it creates the new Node, populating its properties with information it may want later: is there a water source here, a food source, a shop, a skill teacher? Along with the new Node, it adds a Relationship from the originating Node to the new one. A single simple LEADS_TO RelationshipType is all that's needed. If yes — the Room was already present in the graph — it adds the Relationship linking the origin and destination Nodes. With a little work we can add the cost of the move to the new Relationship: cache the bot’s energy before the move and compare it to when the move is complete; make that cost a property of the Relationship. The robot now has a comprehensive map of all the Rooms it's visited, the paths between them, and their costs. Finding the closest Room with a water source is a simple shortest-weighted-path traversal.
Movement behavior is now more intelligent. Bots already knew not to re-enter the Room they previously left. Now they can prioritize directions they haven't yet explored. If a Room has Exits north, south and west, and the bot's previously explored north and west, next time it'll go south. The resulting behavior is less random, more like the way a systematic human would do things. Also, the bot can take the least expensive path back to a water source; that is, it'll get onto the trail rather than blundering through thickets. Nice.
Survival skills are improved. The bot now knows the comparative distances to multiple water and food sources, and can compare their relative merits. It keeps a catalog of shops with their Item types and can return to buy something useful to it once it has the necessary scratch. It retains the locations of hidden treasures it can revisit if its current strategy is wealth acquisition, and it knows where the ATMs are so it can deposit its loot. It can potentially assist human players with directions, or bring them food or water if they're without — I'm thinking of implementing an Alpine Rescue Dog who can bring supplies to stranded players and guide them out of mazes. Yes, it'll have a barrel around its neck. The barrel was previously doable — now the rescue is, too.
There are some nuances to manage. One obvious one is that in TriadCity, Room Relationships aren't 100% reciprocal. Certain mazes are especially problematic, where if you enter a Room from the East, returning to the East doesn't necessarily take you back to the Room you originated from. Sneaky us. So there’s an impedance mismatch, where Neo4j Relationships are reciprocal by definition, but TriadCity reality isn't always so. Also, of course, the mazes change. The solution we've implemented for now is to simply not persist maze graphs, although that's not particularly satisfying. More thought required. A perhaps less obvious problem is closed doors. If the bot finds an exit West, and tries to move through it, but the door is closed, we have to parse the failure message — "Perhaps you should open the <thing> first?" — and delete the Relationship. The ugly problem is that "Perhaps you should open the <thing> first" can apply to either doors or containers: "Perhaps you should open the refrigerator first?" We can try to disambiguate by keeping a cache of the last several commands, doing our best to figure out whether a move succeeded or not. But Threads are our enemy here, where events and messages are pushed to us from the server, and the Thread driving bot behaviors doesn't stop for 'net lag. The danger is that the graph could become garbled if the bot fails to recognize that while it intended to move, it in fact did not. We address this today by periodically walking the bot to a known location; if that traversal fails to arrive at the expected Room, we delete the portions of the graph that were added since the most recent successful walk-back, and begin graphing again from wherever we happen to be, with the hope that before long we'll meet up with Rooms we already know, and can link the new graph segment to the old. Not entirely pretty, and we're open to suggestion.
None of that is Neo4j's fault. The graph db is a stellar addition to our external AI. It lets the robots "think" similarly to the ways humans do.
Here's how one of our bots thinks of Sanctuary Island, the exact geographical center of the City. Experienced TriadCity players will note that its north and south poles are reversed. But that doesn't matter at all from the bot's POV. The point is that it can now easily return to the Temple of the King to level, or the Fountain to refill a canteen, and so on. This image was generated by Neo4j's built-in graph visualizer, which is pretty spiffy:
Embedding Neo4j this way was again extremely simple. One class, a few dozen lines of code, easy-peasy. My one complaint is that Cypher, the Neo4j query language, has been a moving target. The change which tripped me up is related to revised indexing schemes which first deprecated, then eliminated Cypher constructs such as "START". The first edition of O'Reilly's Graph Databases is out of date; the one I've linked to should be current. These syntax changes cost me some Internet research. Time to buy the new edition.
I'll close by noting an interesting possible future enhancement, now that we have Neo4j embedded with the robot code. Seems the perfect platform for implementing decision trees. The bots I’ve been describing are rules-based, via Drools. Enhancing their decision-making realism via decision trees is attractive — I've long planned to explore this idea for modelling addiction in-game. Perhaps our external bots will be the prototype platform for this type of AI. Watch this space.
[Addendum 2016.05.21: view of Sanctuary Island and part of the Northwest Third from space. Doesn't this graph look like a three-dimensional
sphere — a universe?]
Serverless computing is the obvious endpoint of contemporary technology evolution. From onsite hardware to self-managed datacenters to virtualization to public cloud, the trend resolves in a very attractive way with the end result that your business logic executes “somewhere”, you don't need to worry where. An automated platform manages scalability, availability, and security; you're billed only for the cycles you use, not for the idle headroom you once needed to provide. On AWS the future is Lambda: we’ve chosen database reporting for our initial experience.
In the old days we’d run a suite of weekly cron jobs on the database server firing report scripts, piping their output to mailx. Business analysts, product owners and technology managers would receive human-readable columnar reports in their inboxes, which they'd study and archive however they liked. Pretty straightforward. It's possible to duplicate the same concept in the cloud, and this is initially how we started: install mailx on one of the MongoDB secondaries, have cron run weekly jobs, mailing not particularly human-readable JSON output to the usual suspects — thought being we'd get that JSON human-formatted real soon now.
This strategy takes poor advantage of the AWS platform. A more sophisticated approach is to save report outputs to a weekly bucket on S3, then mail a link to that bucket to the interested parties. This centralizes the report archive in a pre-organized repository, and gets the weekly email down from a mini-storm to a single message with links. Easy change, then: instead of piping report output to mailx, pipe it to a super-simple shell script which reads it from stdin, figures out if a new S3 bucket is needed, creates one if yes, and writes a JSON file to the bucket. Here’s our version, for your Googling pleasure:
#!/bin/bash
# Writes input from standard in to an S3 bucket named by date.
# The file name is passed as param $1 to the script.
# 2016.04.04 MP.
# Create the bucket if it doesn't exist:
if aws s3 ls "s3://$S3_BUCKET" 2>&1 | grep -q 'NoSuchBucket'
then
aws s3 mb s3://<base_directory>/$DATE/ fi
# Write stdin to a new file in the bucket:
aws s3 cp - s3://sm-reports/$DATE/$1.$DATE.json --content-type text/plain --grants read=uri=http://acs.amazonaws.com/groups/global/AllUsers
Now we simply add a concluding cron job to send an email with the links. Easy-peasy. No need for mailx anymore: let's use AWS SES instead, which offers enhanced security — no more mailx credentials inside a pseudo-user's home directory. Instead, SES access is managed by IAM. Nice.
This outcome already leverages serverlessness. S3 is an abstracted service, where users pay no attention to concepts of physical location, disk capacity, backup, redundancy, scale, or Web server infrastructure. SES and IAM are also serverless. They're "somewhere" but we don't care where. That's a start.
But here's the thing. As noted, the reports are flippin' ugly. MongoDB’s JSON output is serial rather than columnar, and there's no built-in utility analogous to sql*plus which can easily convert JSON blobs to spiffy tables. So let's fix this. We'll write our own down and dirty little formatting tool converting JSON arrays to tabular HTML. Where should that tool run? Lambda's perfect. With Lambda we don't maintain or pay for EC2 instances with inevitable idle time, we simply let Lambda fire up our converter as the weekly cron jobs require. It's an event-driven compute pipeline, where cron fires reports, pipes them to a script which writes them to S3, S3 notifies Lambda that new JSON files are written, Lambda converts them to HTML, and SES mails a notice.
I chose to write the converter as a Node.js script. My inexperienced guess was that this would be less cumbersome than a Java program, avoiding the hassles of frequently uploading new iterations during development and generally, as the lore insists, being faster to work with. I was wrong in every assumption.
First, non-trivial Node.js programs can't be entered directly into the Lambda inline editor because necessary Node.js packages aren't available there. In our case, the "async" package is required to manage the asynchronous programming model of AWS API calls to S3. So, you know: le sigh. I ended up spending about 12 hours on the converter, where some of that time was my Node.js learning curve, but far too much of it was the inevitable save-zip-upload-test-logscan cycle which even Node.js non-noobs are likely to experience.
Second, lack of compiler support made the cycle inevitably more onerous, as every little typo or missing semicolon made it all the way through the loop, to bubble up in the log rather than under my fingers while typing the code. While I realize that Node.js has considerable real-world momentum, I can't imagine using it for nontrivial production applications. I strongly suspect that any large-scale production code base can be written in Java in one-fourth the time, and will be easier and less expensive to maintain over its life cycle. Experience will tell, of course. But you heard it here first.
Additionally, there's a little bit of lag, often 15-20 seconds, between running a test iteration and viewing Lambda's log output in CloudWatch. For retrospectively scanning production logs, this will seldom be a problem, and in fact CloudWatch logs themselves can be processed by Lambda functions, a pretty great thing. During development it's a hindrance.
Summary: the experience was more of a chore than it should have been.
Ultimately, the manual save-zip-upload-test-logscan process can’t scale to production. The AWS CLI includes APIs for pushing functions to Lambda. Developers who prefer the command-line can homebrew their own Continuous Integration / Continuous Delivery scripts. Larger shops will want to automate their CI/CD pipelines with Jenkins or equivalent. This'll be prominent in our enhancements list.
Future TODOs: add a step generating a single HTML index file with links to the reports; mail a link to that index file, instead of two dozen links to the individual reports. Spiffy up the report format, which right now is down-and-dirty unstyled HTML circa 1994. Get the Lambda functions into Git and add the CI/CD pipeline to Jenkins. Consolidate multiple report formatters to a single converter, presumably relying on metadata such as column names added to the scripts generating JSON report output. You could reasonably sum these up with the single word "professionalize". My POC is pretty amateur.
But it works, and that's impressive. No production infrastructure to maintain, just pure code execution which automatically scales, is inherently secure, and is guaranteed always available. For the few cents a month which runtime will cost, this is really seriously something. Lambda can be used in the event-driven architecture described here, or can be fronted by AWS's API Gateway service providing RESTful front-ends to Lambda functions. Perhaps the most serious downside to the Lambda ecology today is lack of persistent database connections: every db request requires connection/authentication/authorization, which is likely too much lag for many real-world production scenarios. You have to suspect that'll be fixed soon.
Very psyched to enter the new world.
-----------------------
Addendum 2016.04.21:
In retrospect, it would have been simpler to design a one-step solution from the outset. For example, a Python script to query the db, transform JSON to HTML, and write the final result to S3. If I were starting from zip, that would be attractive. But I started with MongoDB "native" JavaScript queries, so, with those already in place, it was simpler to leverage S3's event-driven integration to Lambda for the HTML transformations.
Since the original post I've consolidated what began as individual functions per report to a genericized formatter accepting directives passed as input; it's now the responsibility of the JavaScript query scripts to pass those configuration variables to the HTML processor. This makes it easy to add new reports — no need to update the HTML converter. I've also found a package called "node-lambda" (https://www.npmjs.com/package/node-lambda) which promises to eliminate or at least simplify the inefficient code-save-zip-upload-test-logscan loop I naively began with. Report to follow if it works out.
But lastly, I'm pretty sold on abandoning Node.js for Python. Node's asynchronous model is just frankly cumbersome. The resulting code is ugly, hard to scan, error-prone, and slow to work with. The Python is also quirky, particularly pymongo's departures from canonical MongoDB shell syntax — "find_one" instead of "findOne", and different quoting rules. But it's clean, and straightforward to work with. Watch this space.