I got an email from EC2 a few days ago that a number of our instances that serves Animal Party have pending reboots coming up. It turns out that ALL of the 8 nodes in our membase cluster are scheduled for a reboot. Sure, a reboot on any database server is expected to cause some interruption, but I know from my past experience that warming up a membase node after reboot can take a really long time, especially if the data on disk is highly fragmented. Since I’m waiting for our scheduled downtime to kick in, I figured it’s a good time for me to share my learning with membase – especially how and why we are doing the cluster migration.
Our cluster currently holds about 240m key-value pairs over a total of about 160GB of data. We have only about half of those data in memory, but it’s enough to cover our working set and the cluster doesn’t read from our disks much at all. However, when a member node is rebooted, it must warm-up all the items that the node is “master” of.
What do i mean by “master”? We have a replication count of 1 in our cluster, so while any item (a key-value pair) will exist on two nodes at any given time, all the writes to that piece of data will only take place on a single node to ensure consistency. When a membase server reboots, it needs to “warm-up” and finds “all the data that it is master of” from the disk. Since we got ~32m items per node, it can take a while.
Another thing that causes this warm-up to take a long time is the fact that membase uses sqlite3 engine for persisting data to the disk. Sqlite3 uses btree to store its data, and when items are deleted, the underlying btree pages are merely marked as “free”. Later on when new items are stored, their content can be spread over different pages, causing fragmentation. So if the membase cluster is seeing a lot of delete or expiration, which ours does, the warm-up time will slowly increase overtime. This fragmentation issue will be addressed in the next major release Couchbase 2.0, since it will be replacing sqlite3 with CouchDB. But in the mean time, this is a real problem that we will need to deal with in production.
While it’s possible to vacuum the sqlite database, it’s not really a viable option for us because data persistence needs to be suspended during the process, leaving the system vulnerable to data loss. And we know EBS has pretty awful disk performance, so vacuum is surely going to take a really long time with the amount of data we have.
After talking to Perry Krug from Couchbase (who has always been extremely helpful) about the situation, he suggested doing a complete cluster migration by rebalancing in 8 new nodes while removing the 8 old nodes. This will take care of the reboot issue since we’ll be moving to new instances, but we will also get a free vacuuming of our data. As a side bonus, Perry said we can also use this opportunity to upgrade our cluster from version 188.8.131.52 to 1.7.2 too. I don’t know if this “rebalace your entire cluster” upgrade path works for older versions too, since you will need to mix two versions of membase server in your cluster during the rebalance – but I am glad it works for 1.7.2
One caveat Perry mentioned was that the webconsole UI does not allow you to remove all the nodes, but this can be circumvented with the command line tools. After all the new nodes are added to the existing cluster and show up as “pending for rebalance” in the webconsole, this command will remove all the old nodes and rebalance them into the new ones:
/opt/membase/bin/membase rebalance -c 10.0.0.10 -u <username> -p <password> –server-remove=10.0.0.1 –server-remove=10.0.0.2 …
(assuming 10.0.0.10 is one of the new nodes)
I’ll write up a postmortem after the actual migration, and I hope I will have nothing but good news to report…