Monday, July 16, 2018

AWS can be exasperatingly almost

We have a specific architectural need. We'd prefer to handle it with an AWS managed service. It's almost possible. But not quite.

To explain. Our TriadCity server lives in the Oregon Region. With our current level of traffic we’ve decided not to shard the game world. So everyone on planet earth who visits TriadCity arrives at that single large-ish EC2 instance.

To help with scalability we offload as much computation as possible. Much of the AI is computed by clients in a distributed-grid architecture similar to SETI@home. With the same goal in mind we offload TLS termination. Easy-peasy: incoming connections terminate TLS at a Classic Load Balancer, which forwards to the game server. We’d really prefer a Network Load Balancer: it’s cheaper and performs better. But, it doesn't terminate TLS. See? Exasperatingly almost.

Now. We have some legacy native clients. But most players today connect over WebSockets from HTML5/JQuery clients in their browsers. We open a WSS connection to the CLB which terminates TLS and passes the packets through.

The architectural problem with WebSockets is our lack of control over socket parameters. With native clients we can set a long connection timeout, say, five seconds. Then when a user in Seoul or Sydney or London connects to our infrastructure in Oregon, the client doesn't hang up if network latency causes connection negotiation to take a second or two. Not so with WS. The browser vendors control the connect timeout. It's the truly major downside of keeping an API simple. We're stuck with whatever their settings are — and they're pretty freakin short. With the result that users in international locales experience frequent connect failures.

So let's work around that with proxies in Seoul, Sydney, London, and Virginia. We'll use Route 53's Latency Based Routing, so that clients transparently connect to whichever proxy gives them the best performance. WebSockets will connect to the proxy in less than its timeout period; the proxy will forward non terminated TLS to our CLB in Oregon. Our schools win, too! (You might have to live in California to get that joke.)

Straightforward in principle. But, I'd sure like to not manage this infrastructure myself. Let's use the NLB for proxying! Perfect! But: nope, sorry: can't forward across Regions. What about if we peer the VPCs? You’d think the load balancers would be able to see instances in the peered VPC, but, not. Seriously? Yah. Not. Instances can see instances in peer'd VPCs, but, load balancers can't.

Well, fuck.

There's actually no AWS-native way to do this. You have to use EC2 instances running HAProxy or NGINX or whatever your favorite flavor is.

And the pain doesn't stop there. For High Availability you have to run a cluster of EC2s behind an ELB. So now we've got an ELB, EC2s in multiple AZs, proxy software, blah blah, all of which we have to pay for, monitor, worry about, concern ourselves with. Fuck fuck fuck. If the LBs could forward across Regions or to instances in peer'd VPCs, we wouldn't have to do any of this.

So where are we today? At the moment we're just proving that the concept works. I've got small EC2s in several Regions running minimal little socat. So far so good. It's not super scalable and it's not HA. Maybe when we have more money.

Dear AWS: please tweak your LBs to allow forwarding across Regions. Or to see instances in peer'd VPCs. Either would be fine.

Meanwhile you’re exasperatingly almost.

Unplugged Cable