Friday, May 16, 2014

Migration notes: from RDBMS to MongoDB

The relational paradigm maps poorly to OO design. Objects don't have relations. When you use an ORM for CRUD you end up with a denormalized schema lacking the point of the relational paradigm. If your OO design relies on inheritance you can potentially end up with absurdly large numbers of tables. In TriadCity we have hundreds of Item subtypes with varying properties; mapping these to traditional normalized RDBMS would result in spaghetti schema. Somebody has to win: you end up with either a lousy database or lousy OO design. By contrast the document data model maps intuitively to OO. Because MongoDB is schemaless you can put all those hundreds of Item subtypes into the same Collection, hydrating them without need for OO mapping and reporting on them in a perfectly granular way. Schemalessness removes the impedance mismatch.

Migration turned out to be a larger project than expected. This isn't the fault of the technology. It was trivial to migrate game object persistence. Without hyperbole, an hour or two. But of course, then we got ambitious. We do vastly more with data than merely game CRUD. For example, we're beginning to use rich data analytics of in-game auctions to enable AI driven characters' bidding behavior. Previously that logic would have lived in a vendor-specific stored procedure, and would have been less comprehensive at runtime than what we can achieve now. We're better with a combination of MongoDB plus Neo4J for real-time lookups. I'm happy as can be to have this logic live in Java inside the game engine, rather than a proprietary stored procedure language in the database. But, implementation in the new technology took some effort.

I chose not to replicate the previously relational schema in MongoDB. It would have been incrementally easier to simply map tables to Collections 1-for-1. Instead, I chose to take advantage of MongoDB's ability to store related information in a single document via nested arrays. Here's an example. Our RDBMs used normalized tables to relate basic user account information to: demographics; settings and preferences; security roles and access restrictions; a listing of bounced email addresses; and a table of locked accounts. So, a parent table and five children, related by the unique ID of the parent account. In MongoDB, it makes better sense to keep all of this related information in a single record. So, our Person document has an array field for demographics, another for roles and privileges, another for settings and preferences, and simple boolean flags for bounced emails or global account lock. This consolidation is straightforward to code to, and results in a human-readable document browsable via your favorite administrative GUI.

The need for transactional guarantees goes away. In a SQL world most transactions are second-order: they're really artifacts of the relational model itself, where an update to a Person per above might require writes to multiple tables. Post-migration we literally have not one use case where a transaction is necessary. We do rely on write guarantees in MongoDB: our writing Java thread blocks until it receives an appropriate acknowledgement. The Java driver handles them automagically.

Object-oriented queries in MongoDB's Java API turn out to be very intuitive to internalize. They read like almost-English and make immediate sense. Here's an example and its SQL equivalent. This looks up the most recent sale price of a Player's in-game house. MongoDB:

    DBObject where = new BasicDBObject("playerHouseID", playerHouseID);
    DBObject fields = new BasicDBObject("price", true).append("timestamp", true);
    DBObject sort = new BasicDBObject("timestamp", DESCENDING);
    DBObject found = this.mongo.db.getCollectionFromString("PlayerHouseSales").findOne(where, fields, sort);
    if (found == null) { // this player house has never been sold
        return -1;
    return (Long) found.get("price");

The SQL equivalent minus framework boilerplate would have been something like:

    "SELECT * FROM (SELECT price FROM player_house_sales WHERE player_house_id = ? ORDER BY timestamp DESC) WHERE rownum = 1"

I find the OO version vastly easier to conceptualize.

Object-oriented queries like these make it straightforward to eliminate the 200 or so stored procedures which had tied us to one specific RDBMS vendor. Nearly all of these existed because it's easier to implement branching logic in a stored procedure language than in SQL. Typically the branching test was whether a row already existed, thus insert or update. The MongoDB Java driver's save() method handles upsert for you, automagically converting as the case warrants. A few circumstances which are marginally more complicated are easily handled in Java. This as a huge, huge win. All of our data persistence and analytics logic are now in Java in a single location in a single code base — no longer scattered among multiple languages and disparate scripts in different locations in SCM.

I don't know what the best practice is for sequences. Sequences aren't recommended for _id unique keys. But, there are circumstances where an incrementing recordNo is handy. For example, I like to assign a recordNo to each new Room, Exit, Item and so on. A dynamic count() won't do for this, because these objects are occasionally deleted after creation. I'd nevertheless like to know the grand total of how many Rooms have ever been created, without keeping a separate audit trail. It seems straightforward to create a sequence document in the database with one field for each sequence; you then call the $inc operator to do what it sounds like. Since this is not the _id unique key we're talking about, I ran with it.

I'd like to see more syntactic sugar in the query API. Common operations such as summing or averaging columns are cumbersome today. There's a new-ish Aggregation Framework seemingly designed for large-scale offline reporting. For ad hoc in-game queries, I'd prefer the Java API, and would like to have common operations like summing or finding max or min values highly runtime optimized, as they are in the RDBMs we're replacing. There are corners of the MongoDB programming world which to me seem immature, like these. Development seems very active, so hopefully these gaps will be filled soonish.

I've encountered one problem not yet solved. Sparse unique indexes don't seem to work. Inserting more than one record with a null value into a field with a sparse unique index fails with a duplicate value error on the null, which the "sparse" part is supposed to finesse. Google suggests others have experienced similar glitches. For the time being I've turned off the index, and am continuing to research. Really do want this working.

Runtime performance is excellent. Superior in most cases to the SQL equivalent. For example, loading 18k Rooms from RDBMS including their default Items and Mobiles typically required about 40 seconds; with MongoDB it's usually 15. On my MacBook Pro with 8 cores and 16gb RAM, a complete game boot in the SQL world took around 60 seconds; it's half that with MongoDB.

[For a client, I've subsequently done a shootout of very high-volume write processing between MongoDB and Oracle, where "high-volume" means hundreds of millions of writes. MongoDB was an order of magnitude faster under load.]

MongoDB's system requirements are demanding. MongoDB likes to put the entire data set into RAM. It assumes a 3-node replica cluster with several gigs memory per node. Our data set really isn't that big, so we've got it shoehorned into small VPS instances in the colo. Without a lot of hands-on lore, I can't say yet if we'll turn out to need to bump up these instance sizes. [Update 2014.10.31: instance sizes unchanged. All's well.] [Update 2018.05.12: after migrating to AWS the Mongos run happily on m1.medium instances.]

MongoDB's automated replication is nifty for high-availability and offloading queries from the writeable system. We run our analytics queries using map/reduce against mongodumps taken from one of the secondaries, freeing the primary for production writes. Nothing's necessary to enable this behavior, it works out of the box. For very large data sets, sharding is easy-peasy. There's a handy connector linking MongoDB with Hadoop, for offline map/reduce queries. The database is designed for these topologies, which in many RDBMS worlds are cumbersome to achieve.

Granted enhanced ability to easily analyze 14 years of comprehensive log data, we can begin to enable vastly more powerful and realistic in-game AI. Here's an example. The data show us which Items sell most frequently at auction; what the most successful starting prices are; what the average, minimum and maximum final auction prices are for each Item type; how many bids are common; upper and lower bounds on all these things; medians; distributions; curves; etc. etc. They can help us predict which Players or AI-driven characters might be interested in buying which things. By pulling the "big data" from MongoDB and graphing it with Neo4j, we can enable super-zippy real-time queries like, "Should I bid on the current auction Item? If yes, how much? What should my top bid be based on the bid history so far? If I buy it at the right price, will I later be able to profit from re-auctioning it? When's the best time to re-auction? Are there any Players currently online who may be interested in the Items in my possession? Are there any robots or other AIs who might be interested?" This has far-reaching implications not only for the quality of in-game AI, but also for our ability to expand and make the in-game economy more rich and realistic. In addition to in-game merchants who buy Items from Players, we can now have AI-driven characters buy and sell from auction, creating another sales channel from which Payers can benefit.

A note re Mongo and JSON. If your company models its business objects in JSON, a highly-performant JSON database makes great sense. It's far more straightforward than shoehorning JSON into the relational paradigm designed for an earlier world.

SmartMonsters is now fully-committed to "polyglot persistence". The principals are MongoDB, Neo4J, DynamoDB, and PostgreSQL. We plan to migrate the PostgreSQL to AWS' RDS. Very much looking forward to a world where AWS offers a managed MongoDB which we don't have to take care of ourselves.

Jacob Lawrence, The Great Migration