Recently I was given a ticket to implement a “near-duplicate” image detector. Look at these three images:
The original image files have different bytesizes and different sizes but they show essentially the same thing. This is what we call a “near-duplicate” and the problem was that when displaying an automatically generated image gallery for [...]
Entries Tagged as 'Software'
Detecting Duplicate Images with Phashion
May 21st, 2010 · 7 Comments
Stream Processing and “Trending” Data
May 5th, 2010 · 1 Comment
The Britney Spears Problem is a fantastic article from American Scientist about real-time processing of streaming data to determine trends. I love discovering clever new algorithms and the “majority algorithm” is simple, easy to implement but something you probably wouldn’t think up yourself if solving the same problem. If you’ve ever wondered how [...]
Tags: Software
Risk and Startups
April 20th, 2010 · 11 Comments
I’ve worked at 7-8 startups in the last 12 years, learning along the way that I love the freedom and flexibility that a small company affords. You pay a good price for that freedom though in the form of risk: your job will be measured in terms of months and years, not decades. [...]
Using ActiveRecord with EventMachine
March 30th, 2010 · 3 Comments
Given all my work with Fibers and EventMachine over the last three months, it should come as no surprise that I’ve been working on infrastructure based on Fibers and EventMachine to get maximum scalability without the callback style of code which I dislike for many reasons. Watch my talk on scaling with EventMachine if [...]
Cassandra Internals – Tricks!
March 20th, 2010 · 4 Comments
In my previous posts, I covered how Cassandra reads and writes data. In this post, I want to explain some of the trickery that Cassandra uses to provide a scalable distributed system.
Gossip
Cassandra is a cluster of individual nodes – there’s no “master” node or single point of failure – so each node must actively [...]
Tags: Software
Cassandra Internals – Reading
March 17th, 2010 · 5 Comments
In my previous post, I discussed how writes happen in Cassandra and why they are so fast. Now we’ll look at reads and learn why they are slow.
Reading and Consistency
One of the fundamental thereoms in distributed systems is Brewer’s CAP theorem: distributed systems can have Consistency, Availability and Partition-tolerance properties but can only guarantee [...]
Tags: Software
Cassandra Internals – Writing
March 13th, 2010 · 13 Comments
We’ve started using Cassandra as our next-generation data storage engine at OneSpot (replacing a very large Postgresql machine with a cluster of EC2 machines) and so I’ve been using it for the last few weeks. As I’m an infrastructure nerd and a big believer in understanding the various layers in the stack, I’ve been [...]
Tags: Software
Changelog vs Commitlog
February 18th, 2010 · 4 Comments
One of the things I really like about some software projects is when they provide an actual changelog or release notes. RabbitMQ released 1.7.2 the other day and I asked the developers if they could link to a changelog. They pointed me to this page. Unfortunately this is not exactly what I [...]
Tags: Software
Varnish on 32-bit systems
January 18th, 2010 · No Comments
We run three small EC2 instances for content caching purposes at OneSpot. These systems are 32-bit machines with 1.7GB of RAM. Originally we figured even on a small system Varnish could flood a 100Mb line so we wouldn’t need a more expensive, large EC2 instance. This blog post explains why this turned [...]
Tags: Software
Event-Driven Applications
December 1st, 2009 · 1 Comment
Getting concurrency in Ruby is tough: Ruby 1.8 threads are green so they don’t execute concurrently. Ruby 1.9 threads are native but they don’t execute concurrently due to the GIL (global interpreter lock) necessary to ensure thread-safety with native extensions. Only JRuby provides a stable, concurrent Ruby VM today. On top of [...]