SmartMonsters

Thursday, May 12, 2016

Simulating Human Geographical Knowledge With Neo4j

TriadCity uses multiple AI architectures to model human behavior in different contexts. Even the simplest is pretty sophisticated: the subsumption architecture for in-game automata does a great job of keeping the City vibrant, where AI-driven characters have homes to go to, jobs to be employed at, hobbies to cultivate, and relationships to explore. At the higher end of the spectrum, we use external bots which log in to the game world just as human players do, mimicking human decision-making and in a very limited way mimicking human speech.

There's a trade-off which might not be immediately intuitive between in-game and external AI. While external AI has access to greater compute resources, in-game AI has access to the database. Think of movement. An AI-driven in-game character finds its way around the world by querying a weighted directed graph which the game server keeps in memory. The model goes like this: in-game AI are residents of the city, they've been there a while, and they know their way around. External AI face a different problem. If we want to model human behaviors, they can't simply query the database to ask how to get to point B. Like humans they must explore the world, building their own mental representation of its geography analogous to the paper or Visio maps drawn by players.

Here's a practical problem external AI characters must solve to survive. "My canteen's empty; I'm thirsty. Where is that fountain I passed a while back?" A human will probably use ordinary geospatial reasoning to reckon a reasonable path. A naive AI can walk back through a history of its moves. A more sophisticated, human-like AI needs an internal representation of where it's been, what it's found there, and how to navigate the most reasonable path back.

I've recently enhanced our external bots to build persistent knowledge maps of the world as they explore, using the excellent open-source graph database Neo4j. This is the same technology the TriadCity game server uses. The server's use case is extremely simple. At boot, once the game world is loaded into memory, a Neo4j graph of all of the Rooms and their relationships is made, with properties describing path attributes. At runtime, AI characters simply query the graph for a path and off they go. The database isn't persisted, it's re-created at each boot, and is updated on the fly as World Builders add or delete Rooms. The external AI use case is more nuanced, where the bot must build a persisted graph room-by-room as it explores. Fortunately the Neo4j code is again very simple.

Each time the bot enters a Room, it asks the graph if that Room is already present. If no, it creates the new Node, populating its properties with information it may want later: is there a water source here, a food source, a shop, a skill teacher? Along with the new Node, it adds a Relationship from the originating Node to the new one. A single simple LEADS_TO RelationshipType is all that's needed. If yes — the Room was already present in the graph — it adds the Relationship linking the origin and destination Nodes. With a little work we can add the cost of the move to the new Relationship: cache the bot’s energy before the move and compare it to when the move is complete; make that cost a property of the Relationship. The robot now has a comprehensive map of all the Rooms it's visited, the paths between them, and their costs. Finding the closest Room with a water source is a simple shortest-weighted-path traversal.

Movement behavior is now more intelligent. Bots already knew not to re-enter the Room they previously left. Now they can prioritize directions they haven't yet explored. If a Room has Exits north, south and west, and the bot's previously explored north and west, next time it'll go south. The resulting behavior is less random, more like the way a systematic human would do things. Also, the bot can take the least expensive path back to a water source; that is, it'll get onto the trail rather than blundering through thickets. Nice.

Survival skills are improved. The bot now knows the comparative distances to multiple water and food sources, and can compare their relative merits. It keeps a catalog of shops with their Item types and can return to buy something useful to it once it has the necessary scratch. It retains the locations of hidden treasures it can revisit if its current strategy is wealth acquisition, and it knows where the ATMs are so it can deposit its loot. It can potentially assist human players with directions, or bring them food or water if they're without — I'm thinking of implementing an Alpine Rescue Dog who can bring supplies to stranded players and guide them out of mazes. Yes, it'll have a barrel around its neck. The barrel was previously doable — now the rescue is, too.

There are some nuances to manage. One obvious one is that in TriadCity, Room Relationships aren't 100% reciprocal. Certain mazes are especially problematic, where if you enter a Room from the East, returning to the East doesn't necessarily take you back to the Room you originated from. Sneaky us. So there’s an impedance mismatch, where Neo4j Relationships are reciprocal by definition, but TriadCity reality isn't always so. Also, of course, the mazes change. The solution we've implemented for now is to simply not persist maze graphs, although that's not particularly satisfying. More thought required. A perhaps less obvious problem is closed doors. If the bot finds an exit West, and tries to move through it, but the door is closed, we have to parse the failure message — "Perhaps you should open the <thing> first?" — and delete the Relationship. The ugly problem is that "Perhaps you should open the <thing> first" can apply to either doors or containers: "Perhaps you should open the refrigerator first?" We can try to disambiguate by keeping a cache of the last several commands, doing our best to figure out whether a move succeeded or not. But Threads are our enemy here, where events and messages are pushed to us from the server, and the Thread driving bot behaviors doesn't stop for 'net lag. The danger is that the graph could become garbled if the bot fails to recognize that while it intended to move, it in fact did not. We address this today by periodically walking the bot to a known location; if that traversal fails to arrive at the expected Room, we delete the portions of the graph that were added since the most recent successful walk-back, and begin graphing again from wherever we happen to be, with the hope that before long we'll meet up with Rooms we already know, and can link the new graph segment to the old. Not entirely pretty, and we're open to suggestion.

None of that is Neo4j's fault. The graph db is a stellar addition to our external AI. It lets the robots "think" similarly to the ways humans do.

Here's how one of our bots thinks of Sanctuary Island, the exact geographical center of the City. Experienced TriadCity players will note that its north and south poles are reversed. But that doesn't matter at all from the bot's POV. The point is that it can now easily return to the Temple of the King to level, or the Fountain to refill a canteen, and so on. This image was generated by Neo4j's built-in graph visualizer, which is pretty spiffy:

Embedding Neo4j this way was again extremely simple. One class, a few dozen lines of code, easy-peasy. My one complaint is that Cypher, the Neo4j query language, has been a moving target. The change which tripped me up is related to revised indexing schemes which first deprecated, then eliminated Cypher constructs such as "START". The first edition of O'Reilly's Graph Databases is out of date; the one I've linked to should be current. These syntax changes cost me some Internet research. Time to buy the new edition.

I'll close by noting an interesting possible future enhancement, now that we have Neo4j embedded with the robot code. Seems the perfect platform for implementing decision trees. The bots I’ve been describing are rules-based, via Drools. Enhancing their decision-making realism via decision trees is attractive — I've long planned to explore this idea for modelling addiction in-game. Perhaps our external bots will be the prototype platform for this type of AI. Watch this space.

[Addendum 2016.05.21: view of Sanctuary Island and part of the Northwest Third from space. Doesn't this graph look like a three-dimensional sphere — a universe?]

Neo4J graph

Friday, April 8, 2016

Formatting Database Reports Using Serverless Computing

Serverless computing is the obvious endpoint of contemporary technology evolution.  From onsite hardware to self-managed datacenters to virtualization to public cloud, the trend resolves in a very attractive way with the end result that your business logic executes “somewhere”, you don't need to worry where. An automated platform manages scalability, availability, and security; you're billed only for the cycles you use, not for the idle headroom you once needed to provide.  On AWS the future is Lambda: we’ve chosen database reporting for our initial experience.

In the old days we’d run a suite of weekly cron jobs on the database server firing report scripts, piping their output to mailx.  Business analysts, product owners and technology managers would receive human-readable columnar reports in their inboxes, which they'd study and archive however they liked.  Pretty straightforward.  It's possible to duplicate the same concept in the cloud, and this is initially how we started: install mailx on one of the MongoDB secondaries, have cron run weekly jobs, mailing not particularly human-readable JSON output to the usual suspects — thought being we'd get that JSON human-formatted real soon now.

This strategy takes poor advantage of the AWS platform.  A more sophisticated approach is to save report outputs to a weekly bucket on S3, then mail a link to that bucket to the interested parties.  This centralizes the report archive in a pre-organized repository, and gets the weekly email down from a mini-storm to a single message with links.  Easy change, then: instead of piping report output to mailx, pipe it to a super-simple shell script which reads it from stdin, figures out if a new S3 bucket is needed, creates one if yes, and writes a JSON file to the bucket.  Here’s our version, for your Googling pleasure:

#!/bin/bash

# Writes input from standard in to an S3 bucket named by date.
# The file name is passed as param $1 to the script.
# 2016.04.04 MP.

DATE=`date +%Y-%m-%d`
S3_BUCKET=<base_directory>/$DATE/

# Create the bucket if it doesn't exist:
if aws s3 ls "s3://$S3_BUCKET" 2>&1 | grep -q 'NoSuchBucket'
then
aws s3 mb s3://<base_directory>/$DATE/
fi

# Write stdin to a new file in the bucket:
aws s3 cp - s3://sm-reports/$DATE/$1.$DATE.json --content-type text/plain --grants read=uri=http://acs.amazonaws.com/groups/global/AllUsers

Now we simply add a concluding cron job to send an email with the links.  Easy-peasy.  No need for mailx anymore: let's use AWS SES instead, which offers enhanced security — no more mailx credentials inside a pseudo-user's home directory.  Instead, SES access is managed by IAM.  Nice.

This outcome already leverages serverlessness.  S3 is an abstracted service, where users pay no attention to concepts of physical location, disk capacity, backup, redundancy, scale, or Web server infrastructure.  SES and IAM are also serverless.  They're "somewhere" but we don't care where.  That's a start.

But here's the thing.  As noted, the reports are flippin' ugly.  MongoDB’s JSON output is serial rather than columnar, and there's no built-in utility analogous to sql*plus which can easily convert JSON blobs to spiffy tables.  So let's fix this.  We'll write our own down and dirty little formatting tool converting JSON arrays to tabular HTML.  Where should that tool run?  Lambda's perfect.  With Lambda we don't maintain or pay for EC2 instances with inevitable idle time, we simply let Lambda fire up our converter as the weekly cron jobs require.  It's an event-driven compute pipeline, where cron fires reports, pipes them to a script which writes them to S3, S3 notifies Lambda that new JSON files are written, Lambda converts them to HTML, and SES mails a notice.

I chose to write the converter as a Node.js script.  My inexperienced guess was that this would be less cumbersome than a Java program, avoiding the hassles of frequently uploading new iterations during development and generally, as the lore insists, being faster to work with.  I was wrong in every assumption.

First, non-trivial Node.js programs can't be entered directly into the Lambda inline editor because necessary Node.js packages aren't available there.  In our case, the "async" package is required to manage the asynchronous programming model of AWS API calls to S3.  So, you know: le sigh.  I ended up spending about 12 hours on the converter, where some of that time was my Node.js learning curve, but far too much of it was the inevitable save-zip-upload-test-logscan cycle which even Node.js non-noobs are likely to experience.

Second, lack of compiler support made the cycle inevitably more onerous, as every little typo or missing semicolon made it all the way through the loop, to bubble up in the log rather than under my fingers while typing the code.  While I realize that Node.js has considerable real-world momentum, I can't imagine using it for nontrivial production applications.  I strongly suspect that any large-scale production code base can be written in Java in one-fourth the time, and will be easier and less expensive to maintain over its life cycle.  Experience will tell, of course.  But you heard it here first.

Additionally, there's a little bit of lag, often 15-20 seconds, between running a test iteration and viewing Lambda's log output in CloudWatch.  For retrospectively scanning production logs, this will seldom be a problem, and in fact CloudWatch logs themselves can be processed by Lambda functions, a pretty great thing.  During development it's a hindrance.

Summary: the experience was more of a chore than it should have been.

Ultimately, the manual save-zip-upload-test-logscan process can’t scale to production.  The AWS CLI includes APIs for pushing functions to Lambda.  Developers who prefer the command-line can homebrew their own Continuous Integration / Continuous Delivery scripts.  Larger shops will want to automate their CI/CD pipelines with Jenkins or equivalent.  This'll be prominent in our enhancements list.

Future TODOs: add a step generating a single HTML index file with links to the reports; mail a link to that index file, instead of two dozen links to the individual reports.  Spiffy up the report format, which right now is down-and-dirty unstyled HTML circa 1994.  Get the Lambda functions into Git and add the CI/CD pipeline to Jenkins.  Consolidate multiple report formatters to a single converter, presumably relying on metadata such as column names added to the scripts generating JSON report output.  You could reasonably sum these up with the single word "professionalize".  My POC is pretty amateur.

But it works, and that's impressive.  No production infrastructure to maintain, just pure code execution which automatically scales, is inherently secure, and is guaranteed always available.  For the few cents a month which runtime will cost, this is really seriously something.  Lambda can be used in the event-driven architecture described here, or can be fronted by AWS's API Gateway service providing RESTful front-ends to Lambda functions.  Perhaps the most serious downside to the Lambda ecology today is lack of persistent database connections: every db request requires connection/authentication/authorization, which is likely too much lag for many real-world production scenarios.  You have to suspect that'll be fixed soon.

Very psyched to enter the new world.

-----------------------

Addendum 2016.04.21:

In retrospect, it would have been simpler to design a one-step solution from the outset.  For example, a Python script to query the db, transform JSON to HTML, and write the final result to S3.  If I were starting from zip, that would be attractive.  But I started with MongoDB "native" JavaScript queries, so, with those already in place, it was simpler to leverage S3's event-driven integration to Lambda for the HTML transformations.

Since the original post I've consolidated what began as individual functions per report to a genericized formatter accepting directives passed as input; it's now the responsibility of the JavaScript query scripts to pass those configuration variables to the HTML processor.  This makes it easy to add new reports — no need to update the HTML converter.  I've also found a package called "node-lambda" (https://www.npmjs.com/package/node-lambda) which promises to eliminate or at least simplify the inefficient code-save-zip-upload-test-logscan loop I naively began with.  Report to follow if it works out.

But lastly, I'm pretty sold on abandoning Node.js for Python.  Node's asynchronous model is just frankly cumbersome.  The resulting code is ugly, hard to scan, error-prone, and slow to work with.  The Python is also quirky, particularly pymongo's departures from canonical MongoDB shell syntax — "find_one" instead of "findOne", and different quoting rules.  But it's clean, and straightforward to work with.  Watch this space.

Unplugged Cable