Perry and I used to joke about what will get released first: FableLab’s next game or Couchbase 2.0. And yes, he won But that does mean that I get the option to use the new version to power my next game. Besides key operational improvement, 2.0 also added several key features that were missing in a side-by-side comparison to other document-store choices like MongoDB. Back in 20111, it seemed like a no brainer that would upgrade. But after spending a year and a half with live deployment of Membase 1.7/1.8, I am finding good reasons to use the new Amazon DynamoDB instead.
Reason 1: Growing memory usage.
The Couchbase cluster need to keep the metadata of every single key in memory, even if those values are not in the working set. Here’s how the memory usage broke down for our live cluster:
Total users: 5 millions
Keys in the cluster: 300 millions
Average key size: 20 bytes
Metadata per key: 120 bytes
Total copies: 2 (original + 1 replica)
The metadata alone for us came out to 78GB – and it was larger than the actual data in the working set! And all of this data must remain in the cluster memory at all time. We ran a cluster with 8 m2.xlarge servers, and metadata ate up 60% of the 131GB cluster memory.
We probably could collate more user data into a fewer keys to go lower than average of 30 k/v pairs per player. But the point here is that as the game grew, so did the memory requirement – regardless of the working set – because we couldn’t just delete old and inactive players who hadn’t logged in for a year! Animal Party had a stabilized player base at about 300k monthly actives, but we still needed enough memory for 5m players’ metadata. Yikes.
If you want more detail on calculating Couchbase memory usage, check out: http://www.couchbase.com/docs/couchbase-manual-1.8/couchbase-bestpractice-sizing-ram.html
Reason 2: Recovery
The inability to perform online compacting in 1.x was a real issue for us, especially when servers had to be restarted. Without compacting, databases will take increasingly more time to warm-up after a reboot, and at our size it meant several hours of downtime. The auto-compacting in 2.0 should reduce this issue, but the warm-up time still remains an issue the next time AWS goes into a tizzy. Granted, no one yet knows how DynamoDB service will be impacted during one of the future AWS outage. But for a small team like ours, I’d rather put more onus on the Amazon engineers than us in a recovery situation.
After doing a bit more research, it turns out that the two above issues are not exclusive to Couchbase. Several other NoSQL solutions have similar problem.
What about DynamoDB? Judging from its spec, it should be able to allow a game to age gracefully past the peak. You can dial down access rate as concurrent user drops, and the amount you pay for additional storage as data grow is very small compared to the additional memory needed to hold them in a NoSQL database. Increasing and decreasing DynamoDB access rate takes about 10 min, so it’s also possible to run a script to ram up and down to match daily traffic cycle, which is challenging to setup for any other solution.
There are some way around the aging issue for Couchbase. Old data can be identified and pulled out of the cluster into a different storage system that’s suitable for data archiving. When user try to retrieve the old data, the system pull them out from archive and restore into Couchbase. However, if we’re going down that route, we might as well in-memory databases like VoltDB that provide transaction and SQL support.
There have been a lot of success stories with scaling up quickly with NoSQL like OMGPOP with Draw Something, but there aren’t as much discussion about the managing the data in the late stage of product life cycle. Zynga and certainly has a lot of knowledge and proprietary solution on this issue (they use Couchbase as well – one of the earliest adopters who also contributed on the technology), though it will be something the indie studios will have to tackle as we go.