An Advanced Cassandra Data Modeling Guide

Data Modeling and Why I Love Schemas
If I had to pick one area to focus on while migrating from MySQL to Cassandra, it would without hesitation be creating a good data model for your dataset that takes Cassandra’s storage design and database architecture appropriately into account.

When I first started with Cassandra data modeling, the database originally lacked current features such as CQL. Additionally it was not required or generally encouraged (or at the very least considered a best-practice) to have a formal schema defined for column families/tables. At the time, while migrating from MySQL to Cassandra, the thought of storing data without a schema seemed exciting. It was the “future” and was a total change from the way I was taught about data modeling in school. A few years and many Cassandra releases later, I can easily admit that I’m glad the schema is back! It turns out I actually love schemas.

With MySQL, like other relational databases, it’s almost a given that a good deal of thought and planning would be put into your applications database schema. Just because Cassandra is considered a “NoSQL” database, does not and should not diminish or reduce the amount of time you should spend on your new data model and C* schema. It actually should take longer to come up with the “perfect” data model for your MySQL dataset in Cassandra (especially the first time) as you need to think long and hard if a given data model will accommodate for future application logic while still allowing for efficient reads. Remember that you are entering a world without JOINS and that data for a given query requiring multiple reads from multiple column family and rows most likely will be split across multiple physical nodes in the cluster.

It’s very hard to fix a bad Cassandra data model in the future
Creating a thoughtful and conscious data model in Cassandra from the very beginning is very important. A bad data model can easily ruin and erase any of the benefits you want by migrating to Cassandra in the first place. With MySQL, the lack of a thoughtful or poor data model can frequently be worked around and accommodated thanks to the various relational database features (for example, the use of complex JOINS). While these MySQL queries might be slow and expensive, given enough time and resources it’s possible to get the exact desired result from the dataset. With Cassandra, it is much harder to retroactively “fix” a poor data model. First, the lack of JOINS in Cassandra removes complex reads as a hacked solution to a bad data model. Additionally, thanks to the power and architecture of Cassandra, it becomes very easy to store more rows and data than imaginable with MySQL. With increased amounts of data stored, comes an increased complexity in successfully get the exact data needed within the given performance boundaries required by your application. A SELECT query containing only 30 rows will return quickly and predictably. Performing a query over 5 million rows requires processing significantly more IO. Just as more data in MySQL made complex JOINS more difficult, accommodating for a C* data model that requires the iteration over multiple nodes and rows will be slow, inefficient, and most likely not work at all. Obviously, faster database responses are always better in any application; so don’t let your data model be the cause of slow database latency in your application!

Denormalize ALL THE THINGS: Increase the number of writes to reduce and simplify reads
Denormalization is the concept that a data model should be designed so that a given query can be served from the results from one row and query. Instead of doing multiple reads from multiple tables and rows to gather all the required data for a response, instead modify your application logic to insert the required data multiple times into every row that might need it in the future This way, all required data can be available in just one read which prevents multiple lookups. In the next section I’ll give an example of a highly relational MySQL schema and how you might denormalize one component to create a data model that fits C* well.

With earlier Cassandra versions, writes tended to be significantly faster and more efficient than reads. However, if you benchmark C* v1.2+ you might find almost equal read and write latencies. For example, in one of our production clusters we currently average 1.45ms for reads and 1.13ms for writes. In previous versions, it was easy to argue that a write heavy data model was always ideal with Cassandra as it would reduce the number of required reads and thus directly increase application performance by reducing overall application latency caused by a read against the Cassandra cluster. Even though it’s no longer valid to use the disparity in latency between reads and writes as the only reason to encourage a write heavy data model, there are many other reasons to creating a data model that sacrifices and increases the number of writes to reduce the number of reads.

Data Model Example #1: Migrate a MySQL database that currently stores documentation for hosts, IPs, and all other operations data on production and development machines at your organization
To help make this example more concrete, I’ll use part of the actual RackTables schema, which is an open-source MySQL powered application that aids in documentation of various attributes and objects (network devices, physical hosts, VM instances) for an Operations or IT group.

The RackTables data model revolves around the concept of various Objects linked together using complex relationships to associate multiple unrelated objects with each other. For instance, let’s say we want to document that host is plugged in to power socket 4 of the managed PDU in datacenter rack 12 in row 4.

Using a data model similar to RackTables is very common in the relational MySQL world. RackTables would creates 4 Objects in a table: One for the managed PDU, one for each of the physical power sockets present on the PDU, one for the host, and one for the physical rack at the datacenter. In a separate table, RackTables then will insert a row for each relationship between the Objects that needs to be tracked. For instance, we’ll need to have the association’s table have an entry recording that rack 12 has an association host. We’ll also need an association from the rack to the PDU, the host to the PDU and the PDU’s power socket. Clearly there are lots of associations between different Objects.

When the user wants to later use RackTables to see what physical power socket host is plugged into, a series of SELECT queries must be made to JOIN the relationships from the relationship or association table for the various Objects, with the actual Object’s metadata in the Object table.

Re-implementing this highly relational data model mostly as is with Cassandra is a recipe for disaster. Instead, we can implement a data model that works well with Cassandra by denormalizing the data for the Objects we want to document. As a quick refresher, denormalization is the process of adding redundant data in each write to both simplify and increase the performance of reads. Each host or network device that we track can have one or more other pieces of data stored in it’s row about other objects it’s related to.

In our RackTables case, we could use a data model that denormalizes each Object by including all the additional information needed during a read about any other possibly associated Objects directly in the Objects row itself. For instance, we could add a CQL3 List or Map Collection column to the host object’s key itself. That way we could track all of the power sockets that host is plugged into. If we wanted to then retrieve what power socket’s the host was plugged into, we only need to do one read in Cassandra for the row key for the host (which in this example, for simplicity in this example we will just use the box’s hostname).

Example CQL3 queries (CQL Language Spec and Documentation) for implementing a data model appropriate for Cassandra to migrate current MySQL based application, RackTables.

CQL Sample Code

By de-normalizing the data on insert and updated the PDU socket port information into a separate column in the host’s record, with one read or a given hosts row, we were able to get all of the information we would need to be able to show a host object’s metadata and all of the information we’d like to display about it’s associated objects. Most importantly we did this without querying multiple tables and merging and reconciling the results by using a data model that duplicates data and writes our host information in a de-normalized manner contrary to the original highly relational MySQL data model.

Data Model Example #2: History and Activity Tracking
It’s common to implement some form of tracking in almost every application. For example tracking changes made to our systems documentation application from Data Model Example #1. As inserts into MySQL can be extremely expensive, it’s fairly common practice to bulk up pending INSERTS and UPDATES for a given record in the application code to bulk and de-duplicate multiple requests and reduce load on the database. In MySQL we likely would update the same column and row with the current value. With Cassandra, we can take a different approach by embracing denormalization in our data model. Instead of updating the same value over and over, we can insert a history line for each action taken by a given user. If the volume or inserts or updates is high, we can insert each history audit line with a TTL. This way after a given period of time has elapsed, Cassandra will automatically invalidate that row and tombstone the data for eventual collection. If we wanted to be able to recall the last 30 days of audit activity for a given user, we can SELECT all rows from our user activity table for the given username. When we insert the data, we can insert with a 30-day TTL which means the results returned will be all inserted audit history lines for the given user over the 30 day period.

I’m fond of this data model to also track changing attributes as it also gives the added benefit of historical values in addition to the current value. MySQL mentality has trained most people to do everything possible to prevent writes in their application. With the performance of Cassandra and it’s ability to deal with a high number of concurrent writes, leverage this to think of a data model for your Cassandra table that reduces and hopefully avoids a data model that requires reading the current value before updating and adding the current value. Instead tracking the current value in a de-normalized activity stream like data model improves data integrity and consistency. As an added benefit, by keeping the historical values, we can use something like Hadoop to run a Map / Reduce job over all our audit history lines and look for and create an alert on abnormal detected trends.

Data Model Example #3: A Queue
Although using MySQL as a queue is obviously far from ideal, I’ve seen many examples where applications use MySQL in some form as a queue. Don't be tempted to use Cassandra as a queue. Using Cassandra as a queue is an anti-pattern. Due to Cassandra’s Dynamo architecture, a queue workload will generate many tombstones. If you ultimately try to use Cassandra as a queue, you will end up with increasingly slow and unpredictable performance.

If you need a queue, instead you could evaluate some of the current popular queuing options such as Kafka and RabbitMQ.

Eventual Consistency
When planning a data model and later implementing your application code, make sure to consider what consistency level (i.e. ONE, QUORUM, SERIAL) is required for each query. What consistency level is required by your application for a given mutation to ensure integrity of inserted data?

Client Language Support
MySQL has broad language support, certainly helped by it’s status as one of the most popular databases for 18+ years. Thankfully, there is also a good client story with Cassandra with many languages supported by a broad number of developers. Here's a good list of Cassandra client options.

Stay tuned this week for more posts from Michael Kjellman. This post is an excerpt from 'Frustrated with MySQL? Improving the scalability, reliability and performance of your application by migrating from MySQL to Cassandra.' In the meantime, check out our other Cassandra Week posts!