Mike Perham

On Ruby, software and the Internet

Cassandra Internals – Writing

March 13th, 2010 · 19 Comments

Cassandra logo

We’ve started using Cassandra as our next-generation data storage engine at OneSpot (replacing a very large Postgresql machine with a cluster of EC2 machines) and so I’ve been using it for the last few weeks. As I’m an infrastructure nerd and a big believer in understanding the various layers in the stack, I’ve been reading up a bit on how Cassandra works and wanted to write a summary for others to benefit from. Since Cassandra is known to have very good write performance, I thought I would cover that first.

First thing to understand is that Cassandra wants to run on many machines. From what I’ve heard, Twitter uses a cluster of 45 machines. It doesn’t make a lot of sense to run Cassandra on a single machine as you are losing the benefits of a system with no single point of failure.

Your client sends a write request to a single, random Cassandra node. This node acts as a proxy and writes the data to the cluster. The cluster of nodes is stored as a “ring” of nodes and writes are replicated to N nodes using a replication placement strategy. With the RackAwareStrategy, Cassandra will determine the “distance” from the current node for reliability and availability purposes where “distance” is broken into three buckets: same rack as current node, same data center as current node, or a different data center. You configure Cassandra to write data to N nodes for redundancy and it will write the first copy to the primary node for that data, the second copy to the next node in the ring in another data center, and the rest of the copies to machines in the same data center as the proxy. This ensures that a single failure does not take down the entire cluster and the cluster will be available even if an entire data center goes offline.

So the write request goes from your client to a single random node, which sends the write to N different nodes according to the replication placement strategy. There are many edge cases here (nodes are down, nodes being added to the cluster, etc) which I won’t go into but the node waits for the N successes and then returns success to the client.

Each of those N nodes gets that write request in the form of a “RowMutation” message. The node performs two actions for this message:

  • Append the mutation to the commit log for transactional purposes
  • Update an in-memory Memtable structure with the change

And it’s done. This is why Cassandra is so fast for writes: the slowest part is appending to a file. Unlike a database, Cassandra does not update data in-place on disk, nor update indices, so there’s no intensive synchronous disk operations to block the write.

There are several asynchronous operations which occur regularly:

  • A “full” Memtable structure is written to a disk-based structure called an SSTable so we don’t get too much data in-memory only.
  • The set of temporary SSTables which exist for a given ColumnFamily are merged into one large SSTable. At this point the temporary SSTables are old and can be garbage collected at some point in the future.

There are lots of edge cases and complications beyond what I’ve talked about so far. I highly recommend reading the Cassandra wiki pages for ArchitectureInternals and Operations at the very least. Distributed systems are hard and Cassandra is no different.

Please leave a comment if you have a correction or want to add detail – I’m not a Cassandra developer so I’m sure there’s a mistake or two hidden up there.

Tags: Software

19 responses so far ↓

  • 1 Jonathan Ellis // Mar 14, 2010 at 12:37 pm

    Nice writeup, Mike! Thanks!

  • 2 Sal // Mar 14, 2010 at 4:59 pm

    Nice post. However, I sincerely hope this is part one of a series :)

  • 3 Joe Stein // Mar 17, 2010 at 2:52 pm

    I too hope there is more, if so please add it to the news feed of the LinkedIn group http://www.linkedin.com/groups?home=&gid=2822930&trk=anet_ug_hm I already added this one

  • 4 Chris R. // Mar 17, 2010 at 4:01 pm

    So, if each write only writes to N nodes, how do the other nodes eventually get updated so that they’re eventually consistent for reads?

  • 5 Mike Perham // Mar 17, 2010 at 4:54 pm

    Chris, not every node has a copy of the data, just N nodes where typically N = 3. The fact that the write will eventually propagate to those three nodes makes Cassandra eventually consistent.

  • 6 Roger Schildmeijer // Mar 19, 2010 at 1:34 pm

    Nice A4 summary, good work.

  • 7 Lenin Gali // Mar 20, 2010 at 4:30 am

    Mike, great article. I hope you can write one about reading from cassandra and how that works. I am curious about your EC2 experience, if you can share some details on
    1.what kind of performance are you getting, how many writes vs reads do you do per min?
    2. have you used EBS or local disks and
    3. what would you recommend from your experience to those who would like to try out cassandra on EC2?
    4. Some one suggested to use Zookeeper for adding locks on top Cassandra? what are pros and cons of doing that?

  • 8 Cheatsheet: 2010 04.01 ~ 04.07 - gOODiDEA.NET // Apr 6, 2010 at 9:20 pm

    [...] Cassandra Internals – Writing – Cassandra Write Operation Performance Explained [...]

  • 9 OCTO talks ! » no:sql(eu) and NoSQL: What’s the deal? // Apr 22, 2010 at 12:47 pm

    [...] – NoSQL is about data usages and reminds us that the most important – and I guess the forgotten part in our traditional systems – is to understand how data is used. A system can live with inconsistent data (in fact this is already the case if you are using database Master/Slave replication, caching or asynchronous approaches). This is not a big deal in most of the cases but we need to keep that in mind (I must admit that we, as architects, often forget about this concern since we have been using RDBMS and ACID transactions for a long time). What happens for the business if an operation (let’s say a withdrawal) is consistent but my balance stays inconsistent a couple of seconds or even a couple of minutes? Will my data be accessed in a highly concurrent manner and so do I have a high probability to get conflicts on that data? In case of conflicts, do I need complex conflict resolutions strategies for read-repair? more specifically, do I need a Vector Clock model or a much simpler but also efficient mechanism based on timestamps like in Cassandra? How many revisions of my objects do I need to keep? For instance, Riak stores revisions of objects that can be used for conflicts resolutions but the number of revisions you need to keep has to be correlated to the probability of having a conflict on that data. How much time do I have between the moment my data is written and the moment that same data is read? And what is the probability of being in an inconsistent state? If there is one hour between the write and the read, the probability is quite low. Is my data mainly used for reads or writes ? Some of these systems have been optimized for writes and less for reads. For instance Cassandra does many more disk accesses during a read than during a write [...]

  • 10 OCTO talks ! » no:sql(eu) et NoSQL : qu’en est il? // Apr 22, 2010 at 5:00 pm

    [...] – NoSQL concerne l’utilisation de la donnée et nous rappelle qu’un point important – et je pense souvent oublié dans nos SIs traditionnels – est de comprendre comment la donnée est utilisée. Un système peut vivre avec de la donnée inconsistante. En fait, c’est déjà le cas si vous utilisez les mécanismes de réplication Master/Slave, du cache ou simplement l’utilisation de messages asynchrones. Juste se demander comment sera utiliser la donnée. Que se passera-t-il du point de vue du métier si mon opération (disons un retrait) est consistant mais que mon solde reste inconsistant pendant quelques secondes ou même quelques minutes? Ma donnée sera-t-elle accéder de manière concurrente? Y aura-t-il une forte probabilité de conflit sur cette donnée? En cas de conflit, de quelle stratégie de « read-repair » aurai-je besoin (un modèle type Vector Clock ou plus simplement l’approche timestamp de Cassandra)? Combien de temps y a t-il entre le moment ou j’écris ma donnée et le moment où elle est lue? Du coup, quelle est la probabilité de « Stale State »? S’il y a un une heure entre les deux opérations, la probabilité est faible… Ma donnée est-elle plutôt utilisée en écriture ou en lecture? Certains de ces systèmes ont été optimisés pour l’écriture. Par exemple, Cassandra réalise plus d’accès disque pour une lecture que pour une écriture. [...]

  • 11 Steve Lihn // May 23, 2010 at 8:55 pm

    Nice article. Just curious when you said “Unlike a database, Cassandra does not update data in-place on disk, nor update indices, so there’s no intensive synchronous disk operations to block the write.”, which database are you referring to? Oracle can modify data blocks in memory and only write the redo log to disk, then move on. It seems to be Cassandra is similar to what Oracle is doing in this respect, isn’t it?

  • 12 jametong // May 29, 2010 at 1:38 am

    Oracle has use dbwr/lgwr/ckpt systm processes do similar jobs as Cassandra does. I refer it as change sync jobs to async jobs, and change random i/o to sequential i/o. Cassandra does it both in Commit log and SSTable.

  • 13 Cassandra内核介绍–写操作 « a db thinker's home // May 29, 2010 at 11:03 pm

    [...] Cassandra内核介绍–写操作 by Mike Perham Translated by [...]

  • 14 Joshua Partogi // Sep 19, 2010 at 4:36 pm

    Mike, nice write up. One question though, does that mean that we can not update our data in cassandra? Does that mean we have to delete and insert new record if we want to update our data?

    Cheers

  • 15 Mike Perham // Sep 19, 2010 at 8:36 pm

    Joshua, no it just means that Cassandra does not perform updates internally. Everything is an insert. Old values are “garbage collected” during the compaction operation. To you it looks like an update.

  • 16 Ankit Jhalaria // Sep 22, 2010 at 6:35 pm

    Awesome Post!! Helped me a lot.
    Thanks

  • 17 Cassandra Write Operation Performance Explained | All in one for social - Blog // Nov 8, 2010 at 2:00 am

    [...] Performance, Technology by Nguyễn Bá Khoa — Leave a comment08/11/2010 Number of View: 6An ☞ interesting explanation of how Cassandra write ops are happening:client submits its write request to a single, random [...]

  • 18 夏客行 » Cassandra内核介绍–写操作 // Dec 3, 2010 at 8:29 pm

    [...] Cassandra内核介绍–写操作 by Mike Perham Translated by Jametong [...]

  • 19 amalendu // Apr 14, 2011 at 1:08 pm

    Hi MIke,

    Thanks for the post .

    I have a question :

    “Append the mutation to the commit log for transactional purposes”

    Is this a Synchronous Disk Operation ?

    means, does commit wait for this disk write ?

Leave a Comment