Mike Perham

On Ruby, software and the Internet

Cassandra Internals – Tricks!

March 20th, 2010 · 4 Comments

In my previous posts, I covered how Cassandra reads and writes data. In this post, I want to explain some of the trickery that Cassandra uses to provide a scalable distributed system.

Gossip

Cassandra is a cluster of individual nodes – there’s no “master” node or single point of failure – so each node must actively verify the state of the other cluster members. They do this with a mechanism known as gossip. Each node ‘gossips’ to 1-3 other nodes every second about the state of each node in the cluster. The gossip data is versioned so that any change for a node will quickly propagate throughout the entire cluster. In this way, every node will know the current state of every other node: whether it is bootstrapping, running normally, etc.

Hinted Handoff

In writing, I mentioned that Cassandra stores a copy of the data on N nodes. The client can select a consistency level for a write based on the importance of the data – for example, ConsistencyLevel.QUORUM means that a majority of those N nodes must reply success for the write to be considered successful.

What happens if one of those nodes goes down? How do those writes propagate to that node later? Cassandra uses a technique known as hinted handoff, where the data is written to anther random node X to be stored and replayed for node Y when it comes back online (remember that gossip will quickly tell X when Y comes online). Hinted handoff ensures that node Y will quickly match the rest of the cluster. Note that read repair would still eventually “fix” the old data if hinted handoff did not work for some reason but only once the client asked for that data.

Hinted writes are not readable (since node X is not officially one of the N copies) so they don’t count toward write consistency. If Cassandra is configured for three copies and two of those nodes are down, it would be impossible to fulfill a ConsistencyLevel.QUORUM write.

Anti-Entropy

The final trick up Cassandra’s proverbial sleeve is anti-entropy. AE explicitly ensures that the nodes in the cluster agree on the current data. If read repair or hinted handoff don’t work due to some set of circumstances, the AE service will ensure that nodes reach eventual consistency. The AE service runs during “major compactions” (the equivalent of rebuilding a table in an RDBMS) so it is a relatively heavyweight process that runs infrequently. AE uses a Merkle Tree to determine where within the tree of column family data the nodes disagree and then repairs each of those branches.

This is the last post in my series on Cassandra. I hope you enjoyed them! Please leave a comment if you have questions or if I’ve made an error above.

Tags: Software

4 responses so far ↓

  • 1 Brian S. // Mar 21, 2010 at 8:31 pm

    Hey Mike,
    I’ve thoroughly enjoyed your series of posts on Cassandra’s internals. Thanks!

  • 2 Cassandra内部机制 – 技巧 « a db thinker's home // May 30, 2010 at 9:23 am

    [...] – 技巧 Cassandra内部运作 – 技巧 By Mike Perham Translated By [...]

  • 3 Shen // Jul 8, 2010 at 12:58 am

    Hi Mike,
    f I have A,B,C and D nodes(RF=1), and write ConsistencyLevel=ANY, so A coordinator node sends a write request to another node(e.g. B node), but B node is down during write operation, what happend? return failure message to client immediately? or write a hint to another node(e.g. C node) and return success message to client? or could anyone can give me a real case?

  • 4 Mike Perham // Jul 8, 2010 at 9:28 am

    I believe it will write a hint to C and return success. The Cassandra wiki on Hinted Handoff should have more details.

Leave a Comment