Cassandra Internals – Tricks!

2010-03-20 cassandra

In my previous posts, I covered how Cassandra reads and writes data. In this post, I want to explain some of the trickery that Cassandra uses to provide a scalable distributed system.

Gossip

Cassandra is a cluster of individual nodes – there’s no “master” node or single point of failure – so each node must actively verify the state of the other cluster members. They do this with a mechanism known as gossip. Each node ‘gossips’ to 1-3 other nodes every second about the state of each node in the cluster. The gossip data is versioned so that any change for a node will quickly propagate throughout the entire cluster. In this way, every node will know the current state of every other node: whether it is bootstrapping, running normally, etc.

Hinted Handoff

In writing, I mentioned that Cassandra stores a copy of the data on N nodes. The client can select a consistency level for a write based on the importance of the data – for example, ConsistencyLevel.QUORUM means that a majority of those N nodes must reply success for the write to be considered successful.

What happens if one of those nodes goes down? How do those writes propagate to that node later? Cassandra uses a technique known as hinted handoff, where the data is written to anther random node X to be stored and replayed for node Y when it comes back online (remember that gossip will quickly tell X when Y comes online). Hinted handoff ensures that node Y will quickly match the rest of the cluster. Note that read repair would still eventually “fix” the old data if hinted handoff did not work for some reason but only once the client asked for that data.

Hinted writes are not readable (since node X is not officially one of the N copies) so they don’t count toward write consistency. If Cassandra is configured for three copies and two of those nodes are down, it would be impossible to fulfill a ConsistencyLevel.QUORUM write.

Anti-Entropy

The final trick up Cassandra’s proverbial sleeve is anti-entropy. AE explicitly ensures that the nodes in the cluster agree on the current data. If read repair or hinted handoff don’t work due to some set of circumstances, the AE service will ensure that nodes reach eventual consistency. The AE service runs during “major compactions” (the equivalent of rebuilding a table in an RDBMS) so it is a relatively heavyweight process that runs infrequently. AE uses a Merkle Tree to determine where within the tree of column family data the nodes disagree and then repairs each of those branches.

This is the last post in my series on Cassandra. I hope you enjoyed them! Please leave a comment if you have questions or if I’ve made an error above.