Mike Perham

On Ruby, software and the Internet

Detecting Duplicate Images with Phashion

May 21st, 2010 · 7 Comments

Recently I was given a ticket to implement a “near-duplicate” image detector. Look at these three images:


The original image files have different bytesizes and different sizes but they show essentially the same thing. This is what we call a “near-duplicate” and the problem was that when displaying an automatically generated image gallery for a given subject, we were sometimes showing duplicate images due to slight differences in the images.

Obviously we can’t use something like an MD5 or SHA1 fingerprint – we have to create a fingerprint based on the content of the image, not the exact bytes. This is what the pHash library does. A “perceptual hash” is a 64-bit value based on the discrete cosine transform of the image’s frequency spectrum data. Similar images will have hashes that are close in terms of Hamming distance. That is, a binary hash value of 1000 is closer to 0000 than 0011 because it only has one bit different whereas the latter value has two bits different. The duplicate threshold defines how many bits must be different between two hashes for the two associated images to be considered different images. Our testing showed that 15 bits is a good value to start with, it detected all duplicates with a minimum of false positives.

Phashion is my new Ruby wrapper for the pHash library and wraps just enough of the pHash API to implement the described functionality. Here’s the test in the test suite which verifies that Phashion considers the images to be duplicates:

  def assert_duplicate(a, b)
    assert a.duplicate?(b), "#{a.filename} not dupe of #{b.filename}"
  end
  def test_duplicate_detection
    files = %w(86x86-0a1e.jpeg 86x86-83d6.jpeg 86x86-a855.jpeg)
    images = files.map {|f| Phashion::Image.new("#{File.dirname(__FILE__) + '/../test/'}#{f}")}
    assert_duplicate images[0], images[1]
    assert_duplicate images[1], images[2]
    assert_duplicate images[0], images[2]
  end

pHash does have much more functionality, including video and audio support and persistent MVP tree support for similarity searches based on previously processed files, but I have not wrapped any of those APIs. Try it out and let me know what you think!

→ 7 CommentsTags: Ruby · Software

Stream Processing and “Trending” Data

May 5th, 2010 · 1 Comment

The Britney Spears Problem is a fantastic article from American Scientist about real-time processing of streaming data to determine trends. I love discovering clever new algorithms and the “majority algorithm” is simple, easy to implement but something you probably wouldn’t think up yourself if solving the same problem. If you’ve ever wondered how Twitter’s trending feature is implemented, this is probably a good place to start.

→ 1 CommentTags: Software

bayes_motel – Bayesian classification for Ruby

April 28th, 2010 · 4 Comments

Bayesian classification is an algorithm which allows us to categorize documents probabilistically. I recently started playing with Twitter data and realized there was no Ruby gem which would allow me to build a spam detector for tweets. The classifier gem just works on a set of text by figuring out which words appear in a category but a tweet is much more complicated than that. A tweet looks like this:

{:text=>"Firesale prices, too! RT @nirajc: Time to change your Facebook password. Hacker selling 1.5m accounts. http://bit.ly/dryY7", 
:truncated=>false, :created_at=>"Fri Apr 23 18:26:51 +0000 2010", :coordinates=>nil, :geo=>nil, :favorited=>false,
:source=>"<a href=\"http://www.tweetdeck.com\" rel=\"nofollow\">TweetDeck</a>",  :place=>nil, :contributors=>nil,
:user=>{:verified=>false, :profile_text_color=>"666666", :friends_count=>226, :created_at=>"Wed Oct 08 07:15:23 +0000 2008",
:profile_link_color=>"2FC2EF", :favourites_count=>12, :description=>"All the news that's fit to tweet (and most that isn't)",
:lang=>"en", :profile_sidebar_fill_color=>"252429", :location=>"Brooklyn, NY", :following=>nil, :notifications=>nil,
:time_zone=>"Eastern Time (US & Canada)", :statuses_count=>981, :profile_sidebar_border_color=>"181A1E", 
:profile_image_url=>"http://a1.twimg.com/profile_images/834612904/Photo_on_2010-04-16_at_00.38__3_normal.jpg", 
:profile_background_image_url=>"http://s.twimg.com/a/1271725794/images/themes/theme9/bg.gif", :protected=>false, 
:contributors_enabled=>false, :url=>"http://www.aolnews.com", :screen_name=>"carlfranzen", :name=>"Carl Franzen", 
:profile_background_tile=>false, :profile_background_color=>"1A1B1F", :id=>16645918, :geo_enabled=>false, 
:utc_offset=>-18000, :followers_count=>174}, :id=>12717456105}

As you can see, a tweet is just a hash of variables. So which variables are a better indicator of spam? I don’t know and chances are you don’t either. But if we create a corpus of ham tweets and a corpus of spam tweets, we can train a Bayesian classifier with the two datasets and it will figure out which variable values are seen often in spam and which in ham.

Some variables don’t work, statistically speaking:

  • :id, :created_at – these variables are unique for each tweet which means they are useless for classification. BayesMotel will trim any variable values that don’t appear in more than 3% of the corpus.
  • :followers_count – this is probably a pretty good spam/ham indicator in general, but not as a simple number. There are millions of possible values (@aplusk has 4.5 million followers) but we are only training on hundreds or thousands of tweets. What would be better is the binary logarithm of the followers_count to create discrete buckets: 32-64 followers = 5, 1024-2048 = 10 and so on. I’d bet any tweet with a value greater than 12 or so (i.e. 4096+ followers) is very likely to be ham.

There are additional things we could do to improve our spam detector:

  • We aren’t deep inspecting the value of the tweet text. It might be useful to have variables like “text_link_count” or “text_hashtag_count” to provide basic metrics for the tweet text content.
  • We aren’t performing any timeline checks or storing previous tweet state – spammers tend to tweet the same text over and over and their tweets all contain links. This is beyond the scope of a generic Bayesian system.

I wrote bayes_motel based on my research this last weekend. Give it a try and send a pull request if you make changes you’d like to see. The test suite gives more detail about the API and has a few thousand tweets to use as sample data. Happy coding!

→ 4 CommentsTags: Ruby

Risk and Startups

April 20th, 2010 · 11 Comments

I’ve worked at 7-8 startups in the last 12 years, learning along the way that I love the freedom and flexibility that a small company affords. You pay a good price for that freedom though in the form of risk: your job will be measured in terms of months and years, not decades. My parents spent decades at their jobs working for large corporations; that kind of job security does not exist at a startup.

An Analogy

Risk is something that you either purposefully manage or you roll the dice with your life, sometimes literally. I ride/race a motorcycle as my main hobby away from the computer. Riding a moto is a risky activity and I do several things to manage that risk:

  • Always wear a helmet, gloves and jacket
  • Ride a relatively low power bike
  • Taken every MSF training course available
  • Refuse to ride in groups

Do these guarantee I won’t crash? Certainly not but I hope they will lessen the odds and minimize any damage if I do.

Managing Risks

As engineers, what are the risks of working at a startup? The main risk is the company failing and going bankrupt. A second, related risk is being laid off. In both cases, your job and paycheck are at risk. How do we manage those risks? I have three tactics to manage the risk of working at a startup.

1) Make it as easy as possible to find a job

You could make yourself essential to the operation of the company; that helps with layoffs but does not help with bankruptcy and has the drawback that you will start from square one at the next startup. My strategy has been to make myself a valuable developer, independent of any one startup, by working on open source software and maintaining a high quality blog that evangelizes myself and my work. This is a last resort strategy: if anything happens to make my job disappear, ideally I can interview and find another job within days. This recently proved successful when I announced my upcoming move to San Francisco and had 20-30 inquiries over the next few days.

2) Exercise common sense and your math skills

Do you know your startup’s monthly burn rate, cash reserves and revenue? I’d bet that the majority of people at startups do not. Get those numbers and figure out how many months the company has before it has no money. Just a few months left? Would it be difficult to raise more money? Are you part of a “layer of fat” that could be laid off to cut the burn rate? Is revenue rising or dropping? Are you getting more customers? These are questions you should be asking yourself every month to evaluate the health of your startup. At some point you will need to leave on your own terms, before you are forced out by bankruptcy or layoffs. I left FiveRuns last year when these questions made bankruptcy look unavoidable. Leaving on my own terms meant I could take a few weeks to interview around to find the right job.

3) Stick with Success

They say failure is the best way to learn but in my experience nothing breeds success more than previous success. I try to stick with entrepreneurs that have past successes. As developers, we want to work with smart developers, yes, but you also want to work with great business guys who have a network of contacts, know how to raise funding and can navigate the company to a successful exit. I can interview a person to learn if they are a good developer but I can’t interview a CEO to learn if they are a good CEO. I have only two metrics:

  • do they have a reasonable business plan with a way to make money?
  • have they had previous startup successes?

The “halo” effect is very real. VCs are more willing to talk to someone who has previous success and knows the funding process. People are more willing to work at a company run by someone with previous success. Press is easier to get and customers are easier to talk to if they already know the company as the latest effort by a successful entrepreneur.

4) Educate yo’self (Extra bonus tip!)

You may know computer science but how much do you know about management or finance? Read a management book. I recommend anything by Peter Drucker – he literally invented the science of management and his writing really opened my eyes. Read a book on business finance. You’re not trying to become an expert in these fields but when you learn a little bit about the other major roles in a startup, you’ll be able to evaluate your startup’s current situation more accurately.

Even with all this, you will fail often. I’ve been part of two moderately successful exits and several bankruptcies. I’ve only been caught flat-footed once and tried to learn as much as I could from that experience. No matter what happens the startup experience is rewarding but with a little foresight you can minimize the inevitable risk to yourself and your livelihood.

→ 11 CommentsTags: Personal · Software

Phat News

April 6th, 2010 · No Comments

Gregg and Nathaniel (both of whom are notorious Gowalla cheats, which I would never do, no sir) chat a bit about Phat in the latest episode of Ruby5.

The Changelog crew also gave their take on Phat in a recent posting.

I’ve spent 100s of hours working on the technology behind Phat over the last six months. If you think it’s awesome, please consider recommending me on Working with Rails. I’m not asking for money, just an electronic thumbs up from my fellow Ruby community members. Thanks!

→ No CommentsTags: Ruby

Introducing Phat, an Asynchronous Rails app

April 3rd, 2010 · 20 Comments

Phat is my new Rails 2.3.5 application which runs 100% asynchronous, supporting many concurrent requests in a single Ruby process.

This is a new breed of Rails application which uses a new mode of execution available in Ruby 1.9: single Thread, multiple Fiber. Existing modes of execution suck:

  • Single thread harkens back to the days of Rails 1.x, where you started N mongrels to handle up to N concurrent requests.
  • Multiple threads is better but still has fundamental issues in Ruby. Autoloading is simply broken and Ruby’s thread implementation does not scale at all due to the GIL.

Here’s a sample action which uses memcached and the database. There’s nothing odd here – it’s the same old Rails API and codebase we are used to as Ruby developers, it just executes differently under the covers.

class HelloController < ApplicationController
  def world
    site_ids = Rails.cache.fetch 'site_ids', :expires_in => 1.minute do
      Site.all.map(&:id)
    end
    render :text => site_ids
  end
end

How does it work? If you want the nitty-gritty, watch my talk on EventMachine and Fibers. Everything that does network access ideally should be modified to be Fiber-aware. I’ve updated many gems to be Fiber-aware: memcache-client, em_postgresql (and activerecord), cassandra, bunny and rsolr to name a few. You’ll also need to run thin as your app server, since all of this code assumes it is executing within EventMachine.

Additionally we need to ensure that each request runs in its own Fiber. My new gem, rack-fiber_pool, will do this for you, just add it as Rack middleware in config/environment.rb. Here’s the basic configuration:

# Asynchronous DNS lookup
require 'em-resolv-replace'
require 'rack/fiber_pool'
# Pull in the evented memcache-client.
# You'll need to configure config.cache_store as normal.
require 'memcache/event_machine'
 
Rails::Initializer.run do |config|
  config.cache_store = :mem_cache_store
  # Run each request in a Fiber
  config.middleware.use Rack::FiberPool
  # Get rid of Rack::Lock so we don't kill our concurrency
  config.threadsafe!
end

Additionally we need to configure Postgresql and disable ActionController’s reloader mutex as it really doesn’t like fibered execution. This is ok because remember – there’s only a single thread executing in our process!

With that done, we can try some tests to see how we scale now. EventMachine works best when you have significant network latency. A simple test with database access over coffeeshop WiFi:

Without EventMachine:
Requests per second: 4.39 [#/sec] (mean)

With EventMachine:
Requests per second: 21.31 [#/sec] (mean)

That’s it! There’s no magic here: you can make your Rails app a “phat” app by following the same guidelines above. Fire up one thin instance per processor/core, put nginx in front of it and it should scale like crazy!

→ 20 CommentsTags: Rails

Using ActiveRecord with EventMachine

March 30th, 2010 · 3 Comments

Given all my work with Fibers and EventMachine over the last three months, it should come as no surprise that I’ve been working on infrastructure based on Fibers and EventMachine to get maximum scalability without the callback style of code which I dislike for many reasons. Watch my talk on scaling with EventMachine if you need more background on the problem.

Now that I have RabbitMQ, Cassandra, Solr and the Amazon AWS services evented, the only holdup was ActiveRecord. Some people may advocate using another ORM layer but when you have 2-3 other Rails apps, all sharing 100+ models, you can’t afford to maintain two separate ORM layers. Plus, frankly I like the Rails stack: it works pretty well, is thoroughly documented and every Ruby developer is familiar with it.

So what do we need to do to get AR working event-style? At a high level, there’s two things required:

  • The database driver itself must be modified to send SQL asynchronously. The postgresql driver, for instance, calls the exec(sql) method for all traffic to the database. So we just need to provide an exec method which uses Fibers under the covers to work asynchronously.
  • AR’s connection pooling needs to be Fiber-safe. Out of the box, it is Thread-safe. Since we are using an execution model based on a single Thread with multiple Fibers, all the Fibers would try to use the same connection, with disastrous consequences.

These are the things that em_postgresql does.

  • postgres_connection is a basic, EM-aware Postgres driver. It provides the Fibered exec() method which makes the whole thing asynchronous.
  • em_postgresql_adapter.rb wraps postgres_connection to make it a proper ActiveRecord driver.
  • patches.rb overrides a bunch of AR’s internal connection pooling to make it Fiber-friendly.

Unfortunately the latter makes one hack necessary – we have to have a list of current Fibers to release any lingering connections associated with those Fibers. The Threaded version can use Thread.list but Ruby does not provide an equivalent method for Fibers. Instead I require the application to register a FiberPool with AR to clear stale connections.

So what does it all mean? Well, here’s a Sinatra application that uses plain old ActiveRecord and is completely asynchronous! Try ab -n 100 -c 20 http://localhost:9292/test to hit the app with 20 concurrent connections; it will process them all in parallel, without any painful threading issues (autoloading, misbehaving extensions, etc). Awesome!

You should guess what’s next. Coming soon: the whole Rails stack, running asynchronously…

→ 3 CommentsTags: Rails · Software

Cassandra Internals – Tricks!

March 20th, 2010 · 4 Comments

In my previous posts, I covered how Cassandra reads and writes data. In this post, I want to explain some of the trickery that Cassandra uses to provide a scalable distributed system.

Gossip

Cassandra is a cluster of individual nodes – there’s no “master” node or single point of failure – so each node must actively verify the state of the other cluster members. They do this with a mechanism known as gossip. Each node ‘gossips’ to 1-3 other nodes every second about the state of each node in the cluster. The gossip data is versioned so that any change for a node will quickly propagate throughout the entire cluster. In this way, every node will know the current state of every other node: whether it is bootstrapping, running normally, etc.

Hinted Handoff

In writing, I mentioned that Cassandra stores a copy of the data on N nodes. The client can select a consistency level for a write based on the importance of the data – for example, ConsistencyLevel.QUORUM means that a majority of those N nodes must reply success for the write to be considered successful.

What happens if one of those nodes goes down? How do those writes propagate to that node later? Cassandra uses a technique known as hinted handoff, where the data is written to anther random node X to be stored and replayed for node Y when it comes back online (remember that gossip will quickly tell X when Y comes online). Hinted handoff ensures that node Y will quickly match the rest of the cluster. Note that read repair would still eventually “fix” the old data if hinted handoff did not work for some reason but only once the client asked for that data.

Hinted writes are not readable (since node X is not officially one of the N copies) so they don’t count toward write consistency. If Cassandra is configured for three copies and two of those nodes are down, it would be impossible to fulfill a ConsistencyLevel.QUORUM write.

Anti-Entropy

The final trick up Cassandra’s proverbial sleeve is anti-entropy. AE explicitly ensures that the nodes in the cluster agree on the current data. If read repair or hinted handoff don’t work due to some set of circumstances, the AE service will ensure that nodes reach eventual consistency. The AE service runs during “major compactions” (the equivalent of rebuilding a table in an RDBMS) so it is a relatively heavyweight process that runs infrequently. AE uses a Merkle Tree to determine where within the tree of column family data the nodes disagree and then repairs each of those branches.

This is the last post in my series on Cassandra. I hope you enjoyed them! Please leave a comment if you have questions or if I’ve made an error above.

→ 4 CommentsTags: Software

Ruby Open Files

March 19th, 2010 · No Comments

Get the number of open files for each of your Ruby processes:

sudo lsof | grep ruby | ruby -e 'h=Hash.new(0);$<.each_line {|line| h[line.split[1]] += 1};p h'

Example output:

{"3268"=>808, "4513"=>399, "4795"=>237, "5067"=>178, "5083"=>16, "23751"=>108}

→ No CommentsTags: Ruby

Cassandra Internals – Reading

March 17th, 2010 · 5 Comments

Cassandra logo

In my previous post, I discussed how writes happen in Cassandra and why they are so fast. Now we’ll look at reads and learn why they are slow.

Reading and Consistency

One of the fundamental thereoms in distributed systems is Brewer’s CAP theorem: distributed systems can have Consistency, Availability and Partition-tolerance properties but can only guarantee two. In the case of Cassandra, they guarantee AP and loosen consistency to what is known as eventual consistency. Consider a write and a read that are very close together in time. Let’s say you have a key “A” with a value of “123″ in your cluster. Now you update “A” to be “456″. The write is sent to N different nodes, each of which takes some time to write the value. Now you ask for a read of “A”. Some of those nodes might still have “123″ for the value while others have “456″. They will all eventually return “456″ but it is not guaranteed when (in practice, usually just a few milliseconds). You’ll see why this is important in a second.

Reads are similar to writes in that your client makes a read request to a single random node in the Cassandra cluster (aka the Storage Proxy). The proxy determines the nodes in the ring (based on the replica placement strategy) that hold the N copies of the data to be read and makes a read request to each node. Because of the eventual consistency limitations, Cassandra allows the client select the strength of the read consistency:

  • Single read – the proxy returns the first response it gets. Can easily return stale data.
  • Quorum read – the proxy waits for a majority to respond with the same value. This makes it much more difficult to get stale data (nodes would have to go down) but slower.

In the background, the proxy also performs read repair on any inconsistent responses. The proxy will send a write request to any nodes returning older values to ensure that the nodes return the latest value in the future. There are a number of edge cases here that I’m not clear how Cassandra deals with:

  • What if an even number of nodes reply, with half returning a value of “X” and the other half returning a value of “Y”? Since each column value is timestamped, presumably it will use the timestamp as a tie breaker.
  • What if two nodes return “X” with an old timestamp and one node returns “Y” with a newer timestamp? Does quorum override the clock?
  • What if the clocks on the cluster nodes are out of sync?

Scanning ranges

Cassandra works fine as a key/value store: you give it the key and it will return the value for that key. But this is often not enough to answer critical questions: what if I want to read all users whose last name starts with Z? Or read all orders placed between 2010-02-01 and 2010-03-01? To do this, Cassandra must know how to determine the nodes which hold the corresponding values. This is done with a partitioner. By default, Cassandra uses a RandomPartitioner which is guaranteed to spread the load evenly across your cluster but cannot be used for range scanning. Instead a ColumnFamily can be configured to use an OrderPreservingPartitioner, which knows how to map a range of keys directly onto one or more nodes. In essence, it knows which node(s) hold the data for your alphabetically-challanged users and for February’s orders.

Reading on an Individual Node

So all of that distributed system nonsense aside, what does each node do when performing a read? Recall that Cassandra has two levels of storage: Memtable and SSTable. The Memtable read is relatively painless – we are operating in memory so the data is relatively small and iterating through the contents is fast as possible. To scan the SSTable, Cassandra uses a row-level column index and bloom filter to find the necessary blocks on disk, deserializes them and determines the actual data to return. There’s a lot of disk IO here which ultimately makes the read latency higher than a similar DBMS. Cassandra does provide some row caching which solves much of that latency.

That’s a whirlwind tour of Cassandra’s read path. Take a look at the StorageConfiguration wiki page for much more content on this subject. Next up, I’ll discuss some of the various “tricks” Cassandra uses to solve the myriad of edge cases inherent in distributed systems.

→ 5 CommentsTags: Software