Mike Perham

On Ruby, software and the Internet

Entries Tagged as 'Software'

Detecting Duplicate Images with Phashion

May 21st, 2010 · 7 Comments

Recently I was given a ticket to implement a “near-duplicate” image detector. Look at these three images:
The original image files have different bytesizes and different sizes but they show essentially the same thing. This is what we call a “near-duplicate” and the problem was that when displaying an automatically generated image gallery for [...]

[Read more →]

Tags: Ruby · Software

Stream Processing and “Trending” Data

May 5th, 2010 · 1 Comment

The Britney Spears Problem is a fantastic article from American Scientist about real-time processing of streaming data to determine trends. I love discovering clever new algorithms and the “majority algorithm” is simple, easy to implement but something you probably wouldn’t think up yourself if solving the same problem. If you’ve ever wondered how [...]

[Read more →]

Tags: Software

Risk and Startups

April 20th, 2010 · 11 Comments

I’ve worked at 7-8 startups in the last 12 years, learning along the way that I love the freedom and flexibility that a small company affords. You pay a good price for that freedom though in the form of risk: your job will be measured in terms of months and years, not decades. [...]

[Read more →]

Tags: Personal · Software

Using ActiveRecord with EventMachine

March 30th, 2010 · 3 Comments

Given all my work with Fibers and EventMachine over the last three months, it should come as no surprise that I’ve been working on infrastructure based on Fibers and EventMachine to get maximum scalability without the callback style of code which I dislike for many reasons. Watch my talk on scaling with EventMachine if [...]

[Read more →]

Tags: Rails · Software

Cassandra Internals – Tricks!

March 20th, 2010 · 4 Comments

In my previous posts, I covered how Cassandra reads and writes data. In this post, I want to explain some of the trickery that Cassandra uses to provide a scalable distributed system.
Gossip
Cassandra is a cluster of individual nodes – there’s no “master” node or single point of failure – so each node must actively [...]

[Read more →]

Tags: Software

Cassandra Internals – Reading

March 17th, 2010 · 5 Comments

In my previous post, I discussed how writes happen in Cassandra and why they are so fast. Now we’ll look at reads and learn why they are slow.
Reading and Consistency
One of the fundamental thereoms in distributed systems is Brewer’s CAP theorem: distributed systems can have Consistency, Availability and Partition-tolerance properties but can only guarantee [...]

[Read more →]

Tags: Software

Cassandra Internals – Writing

March 13th, 2010 · 13 Comments

We’ve started using Cassandra as our next-generation data storage engine at OneSpot (replacing a very large Postgresql machine with a cluster of EC2 machines) and so I’ve been using it for the last few weeks. As I’m an infrastructure nerd and a big believer in understanding the various layers in the stack, I’ve been [...]

[Read more →]

Tags: Software

Changelog vs Commitlog

February 18th, 2010 · 4 Comments

One of the things I really like about some software projects is when they provide an actual changelog or release notes. RabbitMQ released 1.7.2 the other day and I asked the developers if they could link to a changelog. They pointed me to this page. Unfortunately this is not exactly what I [...]

[Read more →]

Tags: Software

Varnish on 32-bit systems

January 18th, 2010 · No Comments

We run three small EC2 instances for content caching purposes at OneSpot. These systems are 32-bit machines with 1.7GB of RAM. Originally we figured even on a small system Varnish could flood a 100Mb line so we wouldn’t need a more expensive, large EC2 instance. This blog post explains why this turned [...]

[Read more →]

Tags: Software

Event-Driven Applications

December 1st, 2009 · 1 Comment

Getting concurrency in Ruby is tough: Ruby 1.8 threads are green so they don’t execute concurrently. Ruby 1.9 threads are native but they don’t execute concurrently due to the GIL (global interpreter lock) necessary to ensure thread-safety with native extensions. Only JRuby provides a stable, concurrent Ruby VM today. On top of [...]

[Read more →]

Tags: Ruby · Software