Cassandra Internals – Writing

Cassandra logo

We’ve started using Cassandra as our next-generation data storage engine at OneSpot (replacing a very large Postgresql machine with a cluster of EC2 machines) and so I’ve been using it for the last few weeks. As I’m an infrastructure nerd and a big believer in understanding the various layers in the stack, I’ve been reading up a bit on how Cassandra works and wanted to write a summary for others to benefit from. Since Cassandra is known to have very good write performance, I thought I would cover that first.

First thing to understand is that Cassandra wants to run on many machines. From what I’ve heard, Twitter uses a cluster of 45 machines. It doesn’t make a lot of sense to run Cassandra on a single machine as you are losing the benefits of a system with no single point of failure.

Your client sends a write request to a single, random Cassandra node. This node acts as a proxy and writes the data to the cluster. The cluster of nodes is stored as a “ring” of nodes and writes are replicated to N nodes using a replication placement strategy. With the RackAwareStrategy, Cassandra will determine the “distance” from the current node for reliability and availability purposes where “distance” is broken into three buckets: same rack as current node, same data center as current node, or a different data center. You configure Cassandra to write data to N nodes for redundancy and it will write the first copy to the primary node for that data, the second copy to the next node in the ring in another data center, and the rest of the copies to machines in the same data center as the proxy. This ensures that a single failure does not take down the entire cluster and the cluster will be available even if an entire data center goes offline.

So the write request goes from your client to a single random node, which sends the write to N different nodes according to the replication placement strategy. There are many edge cases here (nodes are down, nodes being added to the cluster, etc) which I won’t go into but the node waits for the N successes and then returns success to the client.

Each of those N nodes gets that write request in the form of a “RowMutation” message. The node performs two actions for this message:

  • Append the mutation to the commit log for transactional purposes
  • Update an in-memory Memtable structure with the change

And it’s done. This is why Cassandra is so fast for writes: the slowest part is appending to a file. Unlike a database, Cassandra does not update data in-place on disk, nor update indices, so there’s no intensive synchronous disk operations to block the write.

There are several asynchronous operations which occur regularly:

  • A “full” Memtable structure is written to a disk-based structure called an SSTable so we don’t get too much data in-memory only.
  • The set of temporary SSTables which exist for a given ColumnFamily are merged into one large SSTable. At this point the temporary SSTables are old and can be garbage collected at some point in the future.

There are lots of edge cases and complications beyond what I’ve talked about so far. I highly recommend reading the Cassandra wiki pages for ArchitectureInternals and Operations at the very least. Distributed systems are hard and Cassandra is no different.

Please leave a comment if you have a correction or want to add detail – I’m not a Cassandra developer so I’m sure there’s a mistake or two hidden up there.

25 thoughts on “Cassandra Internals – Writing”

  1. So, if each write only writes to N nodes, how do the other nodes eventually get updated so that they’re eventually consistent for reads?

  2. Chris, not every node has a copy of the data, just N nodes where typically N = 3. The fact that the write will eventually propagate to those three nodes makes Cassandra eventually consistent.

  3. Mike, great article. I hope you can write one about reading from cassandra and how that works. I am curious about your EC2 experience, if you can share some details on
    1.what kind of performance are you getting, how many writes vs reads do you do per min?
    2. have you used EBS or local disks and
    3. what would you recommend from your experience to those who would like to try out cassandra on EC2?
    4. Some one suggested to use Zookeeper for adding locks on top Cassandra? what are pros and cons of doing that?

  4. Nice article. Just curious when you said “Unlike a database, Cassandra does not update data in-place on disk, nor update indices, so there’s no intensive synchronous disk operations to block the write.”, which database are you referring to? Oracle can modify data blocks in memory and only write the redo log to disk, then move on. It seems to be Cassandra is similar to what Oracle is doing in this respect, isn’t it?

  5. Oracle has use dbwr/lgwr/ckpt systm processes do similar jobs as Cassandra does. I refer it as change sync jobs to async jobs, and change random i/o to sequential i/o. Cassandra does it both in Commit log and SSTable.

  6. Mike, nice write up. One question though, does that mean that we can not update our data in cassandra? Does that mean we have to delete and insert new record if we want to update our data?


  7. Joshua, no it just means that Cassandra does not perform updates internally. Everything is an insert. Old values are “garbage collected” during the compaction operation. To you it looks like an update.

  8. Hi MIke,

    Thanks for the post .

    I have a question :

    “Append the mutation to the commit log for transactional purposes”

    Is this a Synchronous Disk Operation ?

    means, does commit wait for this disk write ?

  9. Hello Mike!
    Does this mean that only that amount of data is stored on disks which cat fit into memory, in sstables? The things you wrote about updates makes me think that.. Or if the rec is not in memory how will it get updated on hdd?
    My other question is about HW. As many servers as possible, i guess. What kind of servers are usually used (or used at your project) and how many of them? At your project, if you can remember, just for an example. What was the read/write performance, if it’s not secret..?
    Btw, still usef, interesting post. Thank you!

Comments are closed.