<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Mike Perham &#187; Software</title>
	<atom:link href="http://www.mikeperham.com/category/software/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.mikeperham.com</link>
	<description>On Ruby, software and the Internet</description>
	<lastBuildDate>Sat, 22 May 2010 03:05:29 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Detecting Duplicate Images with Phashion</title>
		<link>http://www.mikeperham.com/2010/05/21/detecting-duplicate-images-with-phashion/</link>
		<comments>http://www.mikeperham.com/2010/05/21/detecting-duplicate-images-with-phashion/#comments</comments>
		<pubDate>Sat, 22 May 2010 03:05:29 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Ruby]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=556</guid>
		<description><![CDATA[Recently I was given a ticket to implement a &#8220;near-duplicate&#8221; image detector.  Look at these three images:
The original image files have different bytesizes and different sizes but they show essentially the same thing.  This is what we call a &#8220;near-duplicate&#8221; and the problem was that when displaying an automatically generated image gallery for [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I was given a ticket to implement a &#8220;near-duplicate&#8221; image detector.  Look at these three images:<br />

<a href='http://www.mikeperham.com/2010/05/21/detecting-duplicate-images-with-phashion/earns-apple/' title='Earns Apple'><img width="86" height="86" src="http://www.mikeperham.com/wp-content/uploads/2010/05/86x86-0a1e.jpeg" class="attachment-thumbnail" alt="" title="Earns Apple" /></a>
<a href='http://www.mikeperham.com/2010/05/21/detecting-duplicate-images-with-phashion/86x86-83d6/' title='86x86-83d6'><img width="86" height="86" src="http://www.mikeperham.com/wp-content/uploads/2010/05/86x86-83d6.jpeg" class="attachment-thumbnail" alt="" title="86x86-83d6" /></a>
<a href='http://www.mikeperham.com/2010/05/21/detecting-duplicate-images-with-phashion/86x86-a855/' title='86x86-a855'><img width="86" height="86" src="http://www.mikeperham.com/wp-content/uploads/2010/05/86x86-a855.jpeg" class="attachment-thumbnail" alt="" title="86x86-a855" /></a>
<br />
The original image files have different bytesizes and different sizes but they show essentially the same thing.  This is what we call a &#8220;near-duplicate&#8221; and the problem was that when displaying an automatically generated image gallery for a given subject, we were sometimes showing duplicate images due to slight differences in the images.</p>
<p>Obviously we can&#8217;t use something like an MD5 or SHA1 fingerprint &#8211; we have to create a fingerprint based on the content of the image, not the exact bytes.  This is what the <a href="http://phash.org">pHash library</a> does.  A &#8220;perceptual hash&#8221; is a 64-bit value based on the discrete cosine transform of the image&#8217;s frequency spectrum data.  Similar images will have hashes that are close in terms of <a href="http://en.wikipedia.org/wiki/Hamming_distance">Hamming distance</a>.  That is, a binary hash value of 1000 is closer to 0000 than 0011 because it only has one bit different whereas the latter value has two bits different. The duplicate threshold defines how many bits must be different between two hashes for the two associated images to be considered different images.  Our testing showed that 15 bits is a good value to start with, it detected all duplicates with a minimum of false positives.</p>
<p><a href="http://github.com/mperham/phashion">Phashion</a> is my new Ruby wrapper for the pHash library and wraps just enough of the pHash API to implement the described functionality.  Here&#8217;s the test in the test suite which verifies that Phashion considers the images to be duplicates:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;">  <span style="color:#9966CC; font-weight:bold;">def</span> assert_duplicate<span style="color:#006600; font-weight:bold;">&#40;</span>a, b<span style="color:#006600; font-weight:bold;">&#41;</span>
    assert a.<span style="color:#9900CC;">duplicate</span>?<span style="color:#006600; font-weight:bold;">&#40;</span>b<span style="color:#006600; font-weight:bold;">&#41;</span>, <span style="color:#996600;">&quot;#{a.filename} not dupe of #{b.filename}&quot;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
  <span style="color:#9966CC; font-weight:bold;">def</span> test_duplicate_detection
    files = <span style="color:#006600; font-weight:bold;">%</span>w<span style="color:#006600; font-weight:bold;">&#40;</span>86x86<span style="color:#006600; font-weight:bold;">-</span>0a1e.<span style="color:#9900CC;">jpeg</span> 86x86<span style="color:#006600; font-weight:bold;">-</span>83d6.<span style="color:#9900CC;">jpeg</span> 86x86<span style="color:#006600; font-weight:bold;">-</span>a855.<span style="color:#9900CC;">jpeg</span><span style="color:#006600; font-weight:bold;">&#41;</span>
    images = files.<span style="color:#9900CC;">map</span> <span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006600; font-weight:bold;">|</span>f<span style="color:#006600; font-weight:bold;">|</span> <span style="color:#6666ff; font-weight:bold;">Phashion::Image</span>.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;#{File.dirname(__FILE__) + '/../test/'}#{f}&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span><span style="color:#006600; font-weight:bold;">&#125;</span>
    assert_duplicate images<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">0</span><span style="color:#006600; font-weight:bold;">&#93;</span>, images<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">1</span><span style="color:#006600; font-weight:bold;">&#93;</span>
    assert_duplicate images<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">1</span><span style="color:#006600; font-weight:bold;">&#93;</span>, images<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">2</span><span style="color:#006600; font-weight:bold;">&#93;</span>
    assert_duplicate images<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">0</span><span style="color:#006600; font-weight:bold;">&#93;</span>, images<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">2</span><span style="color:#006600; font-weight:bold;">&#93;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span></pre></div></div>

<p>pHash does have much more functionality, including video and audio support and persistent MVP tree support for similarity searches based on previously processed files, but I have not wrapped any of those APIs.  Try it out and let me know what you think!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/05/21/detecting-duplicate-images-with-phashion/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Stream Processing and &#8220;Trending&#8221; Data</title>
		<link>http://www.mikeperham.com/2010/05/05/stream-processing-and-trending-data/</link>
		<comments>http://www.mikeperham.com/2010/05/05/stream-processing-and-trending-data/#comments</comments>
		<pubDate>Wed, 05 May 2010 19:01:35 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=553</guid>
		<description><![CDATA[The Britney Spears Problem is a fantastic article from American Scientist about real-time processing of streaming data to determine trends.  I love discovering clever new algorithms and the &#8220;majority algorithm&#8221; is simple, easy to implement but something you probably wouldn&#8217;t think up yourself if solving the same problem.  If you&#8217;ve ever wondered how [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.americanscientist.org/issues/id.3822,y.0,no.,content.true,page.2,css.print/issue.aspx">The Britney Spears Problem</a> is a fantastic article from American Scientist about real-time processing of streaming data to determine trends.  I love discovering clever new algorithms and the &#8220;majority algorithm&#8221; is simple, easy to implement but something you probably wouldn&#8217;t think up yourself if solving the same problem.  If you&#8217;ve ever wondered how Twitter&#8217;s trending feature is implemented, this is probably a good place to start.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/05/05/stream-processing-and-trending-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Risk and Startups</title>
		<link>http://www.mikeperham.com/2010/04/20/risk-and-startups/</link>
		<comments>http://www.mikeperham.com/2010/04/20/risk-and-startups/#comments</comments>
		<pubDate>Tue, 20 Apr 2010 15:25:04 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Personal]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=524</guid>
		<description><![CDATA[I&#8217;ve worked at 7-8 startups in the last 12 years, learning along the way that I love the freedom and flexibility that a small company affords.  You pay a good price for that freedom though in the form of risk: your job will be measured in terms of months and years, not decades.  [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve worked at 7-8 startups in the last 12 years, learning along the way that I love the freedom and flexibility that a small company affords.  You pay a good price for that freedom though in the form of risk: your job will be measured in terms of months and years, not decades.  My parents spent decades at their jobs working for large corporations; that kind of job security does not exist at a startup.</p>
<p><strong>An Analogy</strong></p>
<p>Risk is something that you either purposefully manage or you roll the dice with your life, sometimes literally.  I ride/race a motorcycle as my main hobby away from the computer.  Riding a moto is a risky activity and I do several things to manage that risk:</p>
<ul>
<li>Always wear a helmet, gloves and jacket</li>
<li>Ride a relatively low power bike</li>
<li>Taken every MSF training course available</li>
<li>Refuse to ride in groups</li>
</ul>
<p>Do these guarantee I won&#8217;t crash?  Certainly not but I hope they will lessen the odds and minimize any damage if I do.</p>
<p><strong>Managing Risks</strong></p>
<p>As engineers, what are the risks of working at a startup?  The main risk is the company failing and going bankrupt.  A second, related risk is being laid off.  In both cases, your job and paycheck are at risk.  How do we manage those risks?  I have three tactics to manage the risk of working at a startup.</p>
<p>1) Make it as easy as possible to find a job</p>
<p>You could make yourself essential to the operation of the company; that helps with layoffs but does not help with bankruptcy and has the drawback that you will start from square one at the next startup.  My strategy has been to make myself a valuable developer, independent of any one startup, by working on open source software and maintaining a high quality blog that evangelizes myself and my work.  This is a last resort strategy: if anything happens to make my job disappear, ideally I can interview and find another job within days.  This recently proved successful when I announced my upcoming move to San Francisco and had 20-30 inquiries over the next few days.</p>
<p>2) Exercise common sense and your math skills</p>
<p>Do you know your startup&#8217;s monthly burn rate, cash reserves and revenue?  I&#8217;d bet that the majority of people at startups do not.  Get those numbers and figure out how many months the company has before it has no money.  Just a few months left?  Would it be difficult to raise more money?  Are you part of a &#8220;layer of fat&#8221; that could be laid off to cut the burn rate?  Is revenue rising or dropping?  Are you getting more customers?  These are questions you should be asking yourself every month to evaluate the health of your startup.  At some point you will need to leave on your own terms, before you are forced out by bankruptcy or layoffs.  I left FiveRuns last year when these questions made bankruptcy look unavoidable.  Leaving on my own terms meant I could take a few weeks to interview around to find the right job.</p>
<p>3) Stick with Success</p>
<p>They say failure is the best way to learn but in my experience nothing breeds success more than previous success.  I try to stick with entrepreneurs that have past successes.  As developers, we want to work with smart developers, yes, but you also want to work with great business guys who have a network of contacts, know how to raise funding and can navigate the company to a successful exit.  I can interview a person to learn if they are a good developer but I can&#8217;t interview a CEO to learn if they are a good CEO.  I have only two metrics:</p>
<ul>
<li>do they have a reasonable business plan with a way to make money?</li>
<li>have they had previous startup successes?</li>
</ul>
<p>The &#8220;halo&#8221; effect is very real.  VCs are more willing to talk to someone who has previous success and knows the funding process.  People are more willing to work at a company run by someone with previous success.  Press is easier to get and customers are easier to talk to if they already know the company as the latest effort by a successful entrepreneur.</p>
<p>4) Educate yo&#8217;self (Extra bonus tip!)</p>
<p>You may know computer science but how much do you know about management or finance?  Read a management book.  I recommend anything by Peter Drucker &#8211; he literally invented the science of management and his writing really opened my eyes.  Read a book on business finance.  You&#8217;re not trying to become an expert in these fields but when you learn a little bit about the other major roles in a startup, you&#8217;ll be able to evaluate your startup&#8217;s current situation more accurately.</p>
<p>Even with all this, you will fail often.  I&#8217;ve been part of two moderately successful exits and several bankruptcies.  I&#8217;ve only been caught flat-footed once and tried to learn as much as I could from that experience.  No matter what happens the startup experience is rewarding but with a little foresight you can minimize the inevitable risk to yourself and your livelihood.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/04/20/risk-and-startups/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Using ActiveRecord with EventMachine</title>
		<link>http://www.mikeperham.com/2010/03/30/using-activerecord-with-eventmachine/</link>
		<comments>http://www.mikeperham.com/2010/03/30/using-activerecord-with-eventmachine/#comments</comments>
		<pubDate>Tue, 30 Mar 2010 05:25:14 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Rails]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[eventmachine]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=494</guid>
		<description><![CDATA[Given all my work with Fibers and EventMachine over the last three months, it should come as no surprise that I&#8217;ve been working on infrastructure based on Fibers and EventMachine to get maximum scalability without the callback style of code which I dislike for many reasons.  Watch my talk on scaling with EventMachine if [...]]]></description>
			<content:encoded><![CDATA[<p>Given all my work with Fibers and EventMachine over the last three months, it should come as no surprise that I&#8217;ve been working on infrastructure based on Fibers and EventMachine to get maximum scalability without the callback style of code which I dislike for many reasons.  <a href="/2010/01/27/scalable-ruby-processing-with-eventmachine/">Watch my talk on scaling with EventMachine</a> if you need more background on the problem.</p>
<p>Now that I have RabbitMQ, Cassandra, Solr and the Amazon AWS services evented, the only holdup was ActiveRecord.  Some people may advocate using another ORM layer but when you have 2-3 other Rails apps, all sharing 100+ models, you can&#8217;t afford to maintain two separate ORM layers.  Plus, frankly I like the Rails stack: it works pretty well, is thoroughly documented and every Ruby developer is familiar with it.</p>
<p>So what do we need to do to get AR working event-style?  At a high level, there&#8217;s two things required:</p>
<ul>
<li>The database driver itself must be modified to send SQL asynchronously.  The postgresql driver, for instance, calls the <code>exec(sql)</code> method for all traffic to the database.  So we just need to provide an exec method which uses Fibers under the covers to work asynchronously.</li>
<li>AR&#8217;s connection pooling needs to be Fiber-safe.  Out of the box, it is Thread-safe.  Since we are using an execution model based on a single Thread with multiple Fibers, all the Fibers would try to use the same connection, with disastrous consequences.</li>
</ul>
<p>These are the things that em_postgresql does.</p>
<ul>
<li><a href="http://github.com/mperham/em_postgresql/blob/master/lib/postgres_connection.rb">postgres_connection</a> is a basic, EM-aware Postgres driver.  It provides the Fibered <code>exec()</code> method which makes the whole thing asynchronous.
<li><a href="http://github.com/mperham/em_postgresql/blob/master/lib/active_record/connection_adapters/em_postgresql_adapter.rb">em_postgresql_adapter.rb</a> wraps postgres_connection to make it a proper ActiveRecord driver.</li>
<li><a href="http://github.com/mperham/em_postgresql/blob/master/lib/active_record/patches.rb">patches.rb</a> overrides a bunch of AR&#8217;s internal connection pooling to make it Fiber-friendly.</li>
</ul>
<p>Unfortunately the latter makes one hack necessary &#8211; we have to have a list of current Fibers to release any lingering connections associated with those Fibers.  The Threaded version can use <code>Thread.list</code> but Ruby does not provide an equivalent method for Fibers.  Instead I require the application to register a FiberPool with AR to clear stale connections.</p>
<p>So what does it all mean?  Well, here&#8217;s <a href="http://github.com/mperham/em_postgresql/blob/master/examples/app.rb">a Sinatra application</a> that uses plain old ActiveRecord and <strong>is completely asynchronous</strong>!  Try <code>ab -n 100 -c 20 http://localhost:9292/test</code> to hit the app with 20 concurrent connections; it will process them all in parallel, without any painful threading issues (autoloading, misbehaving extensions, etc).  Awesome!</p>
<p>You should guess what&#8217;s next.  Coming soon: the whole Rails stack, running asynchronously&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/03/30/using-activerecord-with-eventmachine/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Cassandra Internals &#8211; Tricks!</title>
		<link>http://www.mikeperham.com/2010/03/20/cassandra-internals-tricks/</link>
		<comments>http://www.mikeperham.com/2010/03/20/cassandra-internals-tricks/#comments</comments>
		<pubDate>Sat, 20 Mar 2010 16:59:16 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Software]]></category>
		<category><![CDATA[cassandra]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=478</guid>
		<description><![CDATA[In my previous posts, I covered how Cassandra reads and writes data.  In this post, I want to explain some of the trickery that Cassandra uses to provide a scalable distributed system.
Gossip
Cassandra is a cluster of individual nodes &#8211; there&#8217;s no &#8220;master&#8221; node or single point of failure &#8211; so each node must actively [...]]]></description>
			<content:encoded><![CDATA[<p>In my previous posts, I covered how Cassandra <a href="/2010/03/17/cassandra-internals-reading/">reads</a> and <a href="/2010/03/13/cassandra-internals-writing/">writes</a> data.  In this post, I want to explain some of the trickery that Cassandra uses to provide a scalable distributed system.</p>
<p><strong>Gossip</strong></p>
<p>Cassandra is a cluster of individual nodes &#8211; there&#8217;s no &#8220;master&#8221; node or single point of failure &#8211; so each node must actively verify the state of the other cluster members.  They do this with a mechanism known as <a href="http://wiki.apache.org/cassandra/ArchitectureGossip">gossip</a>.  Each node &#8216;gossips&#8217; to 1-3 other nodes every second about the state of each node in the cluster.  The gossip data is versioned so that any change for a node will quickly propagate throughout the entire cluster.  In this way, every node will know the current state of every other node: whether it is bootstrapping, running normally, etc. </p>
<p><strong>Hinted Handoff</strong></p>
<p>In <a href="/2010/03/13/cassandra-internals-writing/">writing</a>, I mentioned that Cassandra stores a copy of the data on N nodes.  The client can select a consistency level for a write based on the importance of the data &#8211; for example, ConsistencyLevel.QUORUM means that a majority of those N nodes must reply success for the write to be considered successful.</p>
<p>What happens if one of those nodes goes down?  How do those writes propagate to that node later?  Cassandra uses a technique known as <a href="http://wiki.apache.org/cassandra/HintedHandoff">hinted handoff</a>, where the data is written to anther random node X to be stored and replayed for node Y when it comes back online (remember that gossip will quickly tell X when Y comes online).  Hinted handoff ensures that node Y will quickly match the rest of the cluster.  Note that read repair would still eventually &#8220;fix&#8221; the old data if hinted handoff did not work for some reason but only once the client asked for that data.</p>
<p>Hinted writes are not readable (since node X is not officially one of the N copies) so they don&#8217;t count toward write consistency.  If Cassandra is configured for three copies and two of those nodes are down, it would be impossible to fulfill a ConsistencyLevel.QUORUM write.</p>
<p><strong>Anti-Entropy</strong></p>
<p>The final trick up Cassandra&#8217;s proverbial sleeve is <a href="http://wiki.apache.org/cassandra/ArchitectureAntiEntropy">anti-entropy</a>.  AE explicitly ensures that the nodes in the cluster agree on the current data.  If read repair or hinted handoff don&#8217;t work due to some set of circumstances, the AE service will ensure that nodes reach eventual consistency.  The AE service runs during &#8220;major compactions&#8221; (the equivalent of rebuilding a table in an RDBMS) so it is a relatively heavyweight process that runs infrequently.  AE uses a <a href="http://en.wikipedia.org/wiki/Hash_tree">Merkle Tree</a> to determine where within the tree of column family data the nodes disagree and then repairs each of those branches.</p>
<p>This is the last post in my series on Cassandra.  I hope you enjoyed them!  Please leave a comment if you have questions or if I&#8217;ve made an error above.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/03/20/cassandra-internals-tricks/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Cassandra Internals &#8211; Reading</title>
		<link>http://www.mikeperham.com/2010/03/17/cassandra-internals-reading/</link>
		<comments>http://www.mikeperham.com/2010/03/17/cassandra-internals-reading/#comments</comments>
		<pubDate>Thu, 18 Mar 2010 02:06:59 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Software]]></category>
		<category><![CDATA[cassandra]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=444</guid>
		<description><![CDATA[
In my previous post, I discussed how writes happen in Cassandra and why they are so fast.  Now we&#8217;ll look at reads and learn why they are slow.
Reading and Consistency
One of the fundamental thereoms in distributed systems is Brewer&#8217;s CAP theorem: distributed systems can have Consistency, Availability and Partition-tolerance properties but can only guarantee [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://incubator.apache.org/cassandra/media/img/cassandra_logo.png" alt="Cassandra logo" /></p>
<p>In my <a href="/2010/03/13/cassandra-internals-writing/">previous post</a>, I discussed how writes happen in Cassandra and why they are so fast.  Now we&#8217;ll look at reads and learn why they are slow.</p>
<p><strong>Reading and Consistency</strong></p>
<p>One of the fundamental thereoms in distributed systems is <a href="http://en.wikipedia.org/wiki/CAP_theorem">Brewer&#8217;s CAP theorem</a>: distributed systems can have Consistency, Availability and Partition-tolerance properties but can only guarantee two.  In the case of Cassandra, they guarantee AP and loosen consistency to what is known as <em>eventual consistency</em>.  Consider a write and a read that are very close together in time.  Let&#8217;s say you have a key &#8220;A&#8221; with a value of &#8220;123&#8243; in your cluster.  Now you update &#8220;A&#8221; to be &#8220;456&#8243;.  The write is sent to N different nodes, each of which takes some time to write the value.  Now you ask for a read of &#8220;A&#8221;.  Some of those nodes might still have &#8220;123&#8243; for the value while others have &#8220;456&#8243;.  They will all eventually return &#8220;456&#8243; but it is not guaranteed when (in practice, usually just a few milliseconds).  You&#8217;ll see why this is important in a second.</p>
<p>Reads are similar to writes in that your client makes a read request to a single random node in the Cassandra cluster (aka the Storage Proxy).  The proxy determines the nodes in the ring (based on the replica placement strategy) that hold the N copies of the data to be read and makes a read request to each node.  Because of the eventual consistency limitations, Cassandra allows the client select the strength of the read consistency:</p>
<ul>
<li>Single read &#8211; the proxy returns the first response it gets.  Can easily return stale data.</li>
<li>Quorum read &#8211; the proxy <strong>waits for a majority to respond with the same value</strong>. This makes it much more difficult to get stale data (nodes would have to go down) but slower.</li>
</ul>
<p>In the background, the proxy also performs <em>read repair</em> on any inconsistent responses.  The proxy will send a write request to any nodes returning older values to ensure that the nodes return the latest value in the future.  There are a number of edge cases here that I&#8217;m not clear how Cassandra deals with:</p>
<ul>
<li>What if an even number of nodes reply, with half returning a value of &#8220;X&#8221; and the other half returning a value of &#8220;Y&#8221;?  Since each column value is timestamped, presumably it will use the timestamp as a tie breaker.</li>
<li>What if two nodes return &#8220;X&#8221; with an old timestamp and one node returns &#8220;Y&#8221; with a newer timestamp?  Does quorum override the clock?</li>
<li>What if the clocks on the cluster nodes are out of sync?</li>
</ul>
<p><strong>Scanning ranges</strong></p>
<p>Cassandra works fine as a key/value store: you give it the key and it will return the value for that key.  But this is often not enough to answer critical questions: what if I want to read all users whose last name starts with Z?  Or read all orders placed between 2010-02-01 and 2010-03-01?  To do this, Cassandra must know how to determine the nodes which hold the corresponding values.  This is done with a <em>partitioner</em>.  By default, Cassandra uses a <em>RandomPartitioner</em> which is guaranteed to spread the load evenly across your cluster but cannot be used for range scanning.  Instead a ColumnFamily can be configured to use an <em>OrderPreservingPartitioner</em>, which knows how to map a range of keys directly onto one or more nodes.  In essence, it knows which node(s) hold the data for your alphabetically-challanged users and for February&#8217;s orders.</p>
<p><strong>Reading on an Individual Node</strong></p>
<p>So all of that distributed system nonsense aside, what does each node do when performing a read?  Recall that Cassandra has two levels of storage: Memtable and SSTable.  The Memtable read is relatively painless &#8211; we are operating in memory so the data is relatively small and iterating through the contents is fast as possible.  To scan the SSTable, Cassandra uses a row-level column index and bloom filter to find the necessary blocks on disk, deserializes them and determines the actual data to return.  There&#8217;s a lot of disk IO here which ultimately makes the read latency higher than a similar DBMS.  Cassandra does provide some row caching which solves much of that latency.</p>
<p>That&#8217;s a whirlwind tour of Cassandra&#8217;s read path.  Take a look at the <a href="http://wiki.apache.org/cassandra/StorageConfiguration">StorageConfiguration</a> wiki page for much more content on this subject.  Next up, I&#8217;ll discuss some of the various &#8220;tricks&#8221; Cassandra uses to solve the myriad of edge cases inherent in distributed systems.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/03/17/cassandra-internals-reading/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Cassandra Internals &#8211; Writing</title>
		<link>http://www.mikeperham.com/2010/03/13/cassandra-internals-writing/</link>
		<comments>http://www.mikeperham.com/2010/03/13/cassandra-internals-writing/#comments</comments>
		<pubDate>Sat, 13 Mar 2010 21:24:55 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Software]]></category>
		<category><![CDATA[cassandra]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=439</guid>
		<description><![CDATA[
We&#8217;ve started using Cassandra as our next-generation data storage engine at OneSpot (replacing a very large Postgresql machine with a cluster of EC2 machines) and so I&#8217;ve been using it for the last few weeks.  As I&#8217;m an infrastructure nerd and a big believer in understanding the various layers in the stack, I&#8217;ve been [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://incubator.apache.org/cassandra/media/img/cassandra_logo.png" alt="Cassandra logo" /></p>
<p>We&#8217;ve started using Cassandra as our next-generation data storage engine at <a href="http://www.onespot.com">OneSpot</a> (replacing a very large Postgresql machine with a cluster of EC2 machines) and so I&#8217;ve been using it for the last few weeks.  As I&#8217;m an infrastructure nerd and a big believer in understanding the various layers in the stack, I&#8217;ve been reading up a bit on how Cassandra works and wanted to write a summary for others to benefit from.  Since Cassandra is known to have very good write performance, I thought I would cover that first.</p>
<p>First thing to understand is that Cassandra wants to run on many machines.  From what I&#8217;ve heard, Twitter uses a cluster of 45 machines.  It doesn&#8217;t make a lot of sense to run Cassandra on a single machine as you are losing the benefits of a system with no single point of failure.  </p>
<p>Your client sends a write request to a single, random Cassandra node.  This node acts as a proxy and writes the data to the cluster.  The cluster of nodes is stored as a &#8220;ring&#8221; of nodes and writes are replicated to N nodes using a <em>replication placement strategy</em>.  With the RackAwareStrategy, Cassandra will determine the &#8220;distance&#8221; from the current node for reliability and availability purposes where &#8220;distance&#8221; is broken into three buckets: same rack as current node, same data center as current node, or a different data center.  You configure Cassandra to write data to N nodes for redundancy and it will write the first copy to the primary node for that data, the second copy to the next node in the ring <em>in another data center</em>, and the rest of the copies to machines in the same data center as the proxy.  This ensures that a single failure does not take down the entire cluster and the cluster will be available even if an entire data center goes offline.</p>
<p>So the write request goes from your client to a single random node, which sends the write to N different nodes according to the replication placement strategy.  There are many edge cases here (nodes are down, nodes being added to the cluster, etc) which I won&#8217;t go into but the node waits for the N successes and then returns success to the client.</p>
<p>Each of those N nodes gets that write request in the form of a &#8220;RowMutation&#8221; message.  The node performs two actions for this message:</p>
<ul>
<li>Append the mutation to the commit log for transactional purposes</li>
<li>Update an in-memory Memtable structure with the change</li>
</ul>
<p>And it&#8217;s done.  This is why Cassandra is so fast for writes: the slowest part is appending to a file.  Unlike a database, Cassandra does not update data in-place on disk, nor update indices, so there&#8217;s no intensive <em>synchronous</em> disk operations to block the write.</p>
<p>There are several asynchronous operations which occur regularly:</p>
<ul>
<li>A &#8220;full&#8221; Memtable structure is written to a disk-based structure called an SSTable so we don&#8217;t get too much data in-memory only.</li>
<li>The set of temporary SSTables which exist for a given ColumnFamily are merged into one large SSTable.  At this point the temporary SSTables are old and can be garbage collected at some point in the future.</li>
</ul>
<p>There are lots of edge cases and complications beyond what I&#8217;ve talked about so far.  I highly recommend reading the Cassandra wiki pages for <a href="http://wiki.apache.org/cassandra/ArchitectureInternals">ArchitectureInternals</a> and <a href="http://wiki.apache.org/cassandra/Operations">Operations</a> at the very least.  Distributed systems are hard and Cassandra is no different.</p>
<p>Please leave a comment if you have a correction or want to add detail &#8211; I&#8217;m not a Cassandra developer so I&#8217;m sure there&#8217;s a mistake or two hidden up there.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/03/13/cassandra-internals-writing/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Changelog vs Commitlog</title>
		<link>http://www.mikeperham.com/2010/02/18/changelog-vs-commitlog/</link>
		<comments>http://www.mikeperham.com/2010/02/18/changelog-vs-commitlog/#comments</comments>
		<pubDate>Fri, 19 Feb 2010 03:19:42 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=419</guid>
		<description><![CDATA[One of the things I really like about some software projects is when they provide an actual changelog or release notes.  RabbitMQ released 1.7.2 the other day and I asked the developers if they could link to a changelog.  They pointed me to this page.  Unfortunately this is not exactly what I [...]]]></description>
			<content:encoded><![CDATA[<p>One of the things I really like about some software projects is when they provide an actual changelog or release notes.  RabbitMQ released 1.7.2 the other day and I asked the developers if they could link to a changelog.  They pointed me to <a href="http://hg.rabbitmq.com/rabbitmq-server/log">this page</a>.  Unfortunately this is not exactly what I had in mind.  To me, a changelog is a brief overview of the changes in a version <em>that is digestible by the end user</em>.  The key factor is that a changelog is not machine-generated but written by a project developer for the project&#8217;s users.  The RabbitMQ changelog is far too verbose (one entry per commit, along with merge noise).</p>
<p>Here&#8217;s a few examples of good changelogs: <a href="http://github.com/mperham/memcache-client/blob/master/History.rdoc">memcache-client</a>, <a href="http://java.sun.com/javase/6/webnotes/6u18.html">Java</a>, <a href="http://github.com/tenderlove/nokogiri/blob/master/CHANGELOG.rdoc">Nokogiri</a>, <a href="http://github.com/defunkt/resque/blob/master/HISTORY.md">Resque</a>, <a href="http://code.google.com/p/redis/wiki/Redis_1_2_0_Changelog">Redis</a>.</p>
<p>Personally I consider a changelog one of the best indicators of a well run OSS project. If you run an OSS project, please consider supplying release notes or a changelog so that other developers can follow your project with ease!</p>
<p>Update: looks like I just missed the changelog for RabbitMQ.  Alexis was kind enough to point me to the release notes in the comments.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/02/18/changelog-vs-commitlog/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Varnish on 32-bit systems</title>
		<link>http://www.mikeperham.com/2010/01/18/varnish-on-32-bit-systems/</link>
		<comments>http://www.mikeperham.com/2010/01/18/varnish-on-32-bit-systems/#comments</comments>
		<pubDate>Mon, 18 Jan 2010 15:37:16 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=396</guid>
		<description><![CDATA[We run three small EC2 instances for content caching purposes at OneSpot.  These systems are 32-bit machines with 1.7GB of RAM.  Originally we figured even on a small system Varnish could flood a 100Mb line so we wouldn&#8217;t need a more expensive, large EC2 instance.  This blog post explains why this turned [...]]]></description>
			<content:encoded><![CDATA[<p>We run three small EC2 instances for content caching purposes at OneSpot.  These systems are 32-bit machines with 1.7GB of RAM.  Originally we figured even on a small system Varnish could flood a 100Mb line so we wouldn&#8217;t need a more expensive, large EC2 instance.  This blog post explains why this turned out to be a poor choice.</p>
<p>Executive summary: Varnish really, really wants to run on a 64-bit system.  Don&#8217;t run it on 32-bit systems if possible.</p>
<p>Varnish wants to memory map the entire cache.  This means the entire cache needs to be able to fit into virtual memory.  On a 64-bit system, VM is virtually unlimited.  On a 32-bit system, processes usually have access to a maximum of 3GB of virtual memory.  Since you also need to allocate stack space and other standard process requirements, in practice people don&#8217;t recommend more than 2GB of cache space for Varnish on 32-bit systems.  Pretty small for a web content cache.  If you want Varnish to use an entire disk for a cache, it must run on a 64-bit system.</p>
<p>We had a few minutes of outage recently due to this architecture.  We read some <a href="http://kristian.blog.linpro.no/2009/05/25/common-varnish-issues/">Varnish tuning tips</a> and decided to modify our default configuration.  Specifically we raised the minimum thread count from 1 to 500.  Because, after all, &#8220;<em> idle threads are cheap</em>&#8220;.  But they are only cheap on 64-bit systems where allocating hundreds of MB for extra stack space is a no brainer!  When we rolled out this change, the process ran out of memory and couldn&#8217;t allocate the extra threads.  Klaxons went off and I rolled back the changes.  Over the next few months, we&#8217;ll be upgrading our caches to 64 bit so that we don&#8217;t need to worry about sizing issues moving forward.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2010/01/18/varnish-on-32-bit-systems/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Event-Driven Applications</title>
		<link>http://www.mikeperham.com/2009/12/01/event-driven-applications/</link>
		<comments>http://www.mikeperham.com/2009/12/01/event-driven-applications/#comments</comments>
		<pubDate>Wed, 02 Dec 2009 02:52:07 +0000</pubDate>
		<dc:creator>Mike Perham</dc:creator>
				<category><![CDATA[Ruby]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://www.mikeperham.com/?p=386</guid>
		<description><![CDATA[Getting concurrency in Ruby is tough: Ruby 1.8 threads are green so they don&#8217;t execute concurrently.  Ruby 1.9 threads are native but they don&#8217;t execute concurrently due to the GIL (global interpreter lock) necessary to ensure thread-safety with native extensions.  Only JRuby provides a stable, concurrent Ruby VM today.  On top of [...]]]></description>
			<content:encoded><![CDATA[<p>Getting concurrency in Ruby is tough: Ruby 1.8 threads are green so they don&#8217;t execute concurrently.  Ruby 1.9 threads are native but they don&#8217;t execute concurrently due to the GIL (global interpreter lock) necessary to ensure thread-safety with native extensions.  Only JRuby provides a stable, concurrent Ruby VM today.  On top of that, writing thread-safe code is tough &#8211; code execution is non-deterministic and so everyone gets it wrong, the code is hard to test and bugs painful to track down.</p>
<p>For these reasons, I would argue that IO-intensive applications need to either use an event-driven application model or a language designed for concurrency like <a href="http://github.com/richhickey/clojure">Clojure</a>.  Since I like to work with Ruby, the former is the route to follow.</p>
<p>This overview is important to understand because the main deployment pattern with Rails apps is to instantiate 5-10 Rails processes, which can each handle one request at a time.  If a request takes 5-10 seconds to process (maybe it is calling Amazon S3 or SimpleDB), that entire Rails process is stuck waiting for the data.  Even a multi-threaded Rails application is limited due to the GIL.  For this reason, people use a message queue to handle long-running tasks but often that just passes the buck: now the message queue processor is the one stuck for 5-10 seconds instead.  You don&#8217;t have a user waiting for a response but you still are limited in how fast you can process the queue based on the amount of memory you have and the number of daemon processes you can start.</p>
<p><img src="http://www.mikeperham.com/wp-content/uploads/2009/12/EventMachineLogo.png" alt="EventMachineLogo" title="EventMachineLogo" width="413" height="66" class="alignnone size-full wp-image-388" /><img src="http://www.mikeperham.com/wp-content/uploads/2009/12/neverblock.jpg" alt="neverblock" title="neverblock" width="218" height="67" class="alignnone size-full wp-image-389" /></p>
<p>This is where an event-driven model would help immensely.  The fundamental tools at your disposal are <a href="http://github.com/espace/neverblock/">NeverBlock</a> and <a href="http://github.com/eventmachine/eventmachine">EventMachine</a>.  EventMachine provides the <em>reactor</em>, the fundamental &#8220;switch&#8221; in your application which decides what code is ready to run now, and NeverBlock provides various drop-in replacements for the common Ruby code used for network and IO: mysql and postgres database drivers, tcp sockets, etc.  Using these, the message queue processor can process many messages at the same time: there&#8217;s never any concurrent execution but as one message performs some IO request, eventmachine and neverblock will seamlessly switch to handle another message while waiting for the IO response.  That&#8217;s the fundamental difference with threaded code: instead of switching threads at a non-deterministic point in the future, event-driven code only switches when the code tries to perform IO.  Your code does not need to be thread-safe because your code will not be interrupted while modifying variables and data structures in memory.</p>
<p>Sounds good, right?  Well, a few caveats:</p>
<ul>
<li>CPU-intensive processes won&#8217;t gain much.  There&#8217;s still only a single actual thread of execution under the covers so event-driven applications will only take advantage of a single processor/core.</li>
<li>Your application should run on Ruby 1.9 to take advantage of Fibers.  Fibers have been backported to Ruby 1.8 but I encourage you to try Ruby 1.9.  Most extensions are Ruby 1.9 safe now and Rails is fully supported on Ruby 1.9.  Without Fibers, your application code needs to change dramatically to work as success/error callbacks.  With Fibers, your code needs little change and can be written in the more familiar procedural style.</li>
<li>Application exception handling becomes tricky, just as with threads.  It&#8217;s easy to lose an exception.</li>
</ul>
<p>Next time, we&#8217;ll take a deeper look into some event-driven code and how it works.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mikeperham.com/2009/12/01/event-driven-applications/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
