After up all night babysitting the rebalance process, I am happy to report that it was a rather uneventful night of maintenance. The rebalance itself took 8-9 hours to complete, and then took another hour for all the replicas to get saved to the disk also. Theoretically, I didn’t need to take the site down while the rebalance was happening, but I took the game down just to be safe and not compromise the game experience.
The disk access was definitely the bottleneck through out the rebalance process once again. One the the reason we went for more # of smaller nodes rather than smaller # of bigger nodes is to spread out our disk activities over more # of EBS drives during a rebalance, conceptually similar to a RAID 0. We do increase the higher risk of hardware failure simply by having more nodes in the cluster, but the disk performance gain is definitely worth it.
Some folks are doing a RAID 0 setup using multiple EBS as described here on alestic, but I haven’t tried it personally. If anyone has attempted that setup, especially in a production environment, please share your experience in the comments!