Distributed Computing is Hard: A Cassandra Migration Horror Story

I won’t lie (or conveniently fail to mention) that I have lost many nights of sleep due to Cassandra. I’ve certainly reflected and asked myself, “Was it really worth it?” Some of the sleepless nights were due to encountering previously unknown bugs, which have since been fixed. Other sleepless nights were caused by bad and misinformed decisions myself and my co-workers made while performing various C* operations. Implemented correctly, distributed computing brings lots of potential to your application. You can improve performance by distributing work across many physical (and inexpensive!) machines. Additionally, a database like Cassandra was designed from the beginning with replication in mind. Ensuring there are multiple copies of a dataset across multiple nodes and datacenters in distant geographical regions is not an afterthought (unlike MySQL replication). However, the many advantages and benefits of distributed computing come with the tradeoff of increased complexity.

Distributed logic and designs will inevitably cause an increase in complexity in application logic. When done right however, the rewards are obvious and easy to appreciate. Operationally, while it might be possible to get away with a single non-sharded MySQL instance installed via apt-get/emerge/yum/etc, operations with Cassandra need to be taken seriously to achieve desired performance and uptime of the cluster. Or, if you currently shard data across multiple MySQL instances, knowing that Cassandra deals with sharding and replication for you might be a huge benefit and upsell for Cassandra. But, unfortunately there is no such thing as a free lunch. For example, although Cassandra will remove all of your homegrown database abstraction and sharding code, you ultimately ended simply moving that logic from your code to Cassandra. Luckily, given the number of people and corporations of all sizes using Cassandra in production combined with an engaged and involved community, it’s fair to assume and argue that Cassandra’s equivalent of your MySQL sharding code will be better than your old homegrown solution. In summary: deploying and maintaining a Cassandra cluster isn’t free (and I’m not talking financially, I’m simply saying this isn’t something you can consider a set it and forget it application component) - and planning and accounting for production operations requirements should be strongly considered in a decision to replace MySQL with Cassandra.

My Personal Cassandra Migration Horror Story

Our initial use of Cassandra (now over two years ago) was limited to a very small subset of our data running on a 3-node cluster running C* 0.8.6 deployed in a single datacenter. A few months later, after our initial success with Cassandra we decided to expand our usage and start a project to replace MySQL with C*. We started a migration project and one of the first tasks we knew we had to do was grow our cluster to accommodate the additional write and read load that would be added while we continued to migrate data and application logic from MySQL to C*.

We wanted to initially grow the cluster from 3 nodes to 12, and as this was still during the Cassandra stone ages, v-nodes did not yet exist. As we weren’t simply doubling the cluster size, I researched and planned to add the additional nodes and perform the various move and cleanup operations tasks required to add nodes to the cluster evenly spaced on the ring. So, armed with what I thought was enough information, I fired up a few screen sessions and, naïvely (and in error), ran `nodetool move <token>` on every single node in the cluster at the same time. The boxes quickly ran out of I/O, C* instances started going into GC and I was left with a cascading failure across all the nodes and a totally broken cluster. In retrospect, it’s painfully obvious that a move operation would need to physically transfer many GBs of data between nodes in the cluster. However, at the time I still hadn’t fully wrapped my head around distributed computing and the other added variables (now obvious) such as network latency and overhead and the impact of a saturated uplink to a rack from the core.

If I could give myself some advice in the past: step away from the keyboard before actually deploying Cassandra into production and read the documentation!

Stay tuned this week for more posts from Michael Kjellman. This post is an excerpt from 'Frustrated with MySQL? Improving the scalability, reliability and performance of your application by migrating from MySQL to Cassandra.' In the meantime, check out our other Cassandra Week posts!