Distributed Locking with Redis and Ruby

2016-04-25

It can happen: sometimes you need to severely curtail access to a resource. Maybe you use a 3rd party API where you can only make one call at a time. To handle this extreme case, you need an extreme tool: a distributed lock.

Distributed locks are dangerous: hold the lock for too long and your system throughput plummets. They can easy become a major chokepoint for your app’s performance and scalability.

Recently a blog post talked about using Redis for distributed locking with Sidekiq. I tried the code and it didn’t even work. It did however give me the idea to test Sidekiq Enterprise’s Rate Limiting API, which provides a flexible “concurrent” limiter, against other rubygems which provide a similar lock.

Please Note: I’m not talking about Redlock and other algorithms that provide fault-tolerant locking via distributed consensus. Those algorithms are slower and much harder to get correct; I would never trust myself to write one (or anyone else that’s not a Computer Science Ph.D). In this post, I’m talking about using a single Redis instance to coordinate many worker processes distributed across many machines. This is sufficiently safe and robust for most businesses.

The Setup

I tested four different distributed lock gems, including sidekiq-ent. With any of them we can create a distributed lock which ensures our system executes a block of code exclusively, even with dozens of processes. One thing to understand: sidekiq-ent’s Rate Limiting API does not need to run within a Sidekiq process - it can be used in any Ruby process: puma, unicorn, passenger, sidekiq, etc.

All locking libraries provide similar semantics. You define:

how long should the code wait for the lock before giving up?
how long before the lock times out?

The lock has to have a timeout as that’s the only way to recover from a process crash while holding a lock. Libraries “wait” in two different ways: redis-semaphore and sidekiq-ent block, efficiently waiting to be notified when they can take the lock, the other two gems poll regularly, forcing an unfortunate tradeoff: polling more often means slamming Redis with unnecessary work.

The Test

I created a benchmark exercising all four APIs. The code executes 100 “jobs” using 25 threads. Each job sleeps for 0.1 sec while holding the lock, meaning that a perfect run will take 10.0 sec. Gist of the actual benchmark code here.

sidekiq-ent
  0.110000   0.100000   0.210000 ( 10.433794)
redis-semaphore
  0.150000   0.150000   0.300000 ( 10.487963)
pmckee11-redis-lock
  0.460000   0.550000   1.010000 ( 10.718958)
ruby_redis_lock
  0.280000   0.250000   0.530000 ( 11.655952)

The third column shows you the number of seconds actually running on the CPU; sidekiq-ent’s limiter used 0.21 seconds of CPU time, the others varied from 0.3 to 1.0 seconds.

The theoretical perfect runtime is 10 sec, 100 jobs * 0.1 sec sleep so sidekiq-ent adds about 4% overhead. The latter two gems added notably more overhead. Note in the gist, I had to modify pmckee11-redis-lock to disable exponential backoff, otherwise it would die with a timeout after several minutes.

Metrics

Unfortunately the other three libraries give you no insight into actual lock usage while sidekiq-ent’s concurrent limiter offers real-time metrics so you can understand how the lock is performing – it can answer questions like:

how heavily is this lock contended?
is the lock ever timing out?
how often is the lock granted immediately vs forcing your code to wait?

You can read the metric definitions in the wiki. Here’s the UI:

Limiter Web UI

What have we learned?

The other libraries give you the basics of a distributed lock but two are lacking in performance and all are missing the metrics necessary to debug problems. Some good things about Sidekiq Enterprise’s concurrent limiter:

it provides the highest performance distributed lock for Redis
it blocks, it does not poll or sleep, so it won’t slam Redis with superfluous requests or burn CPU
it can limit access to N callers, not just 1
it provides much better visibility with real-time metrics about limiter usage

If you are using Sidekiq today, the Enterprise upgrade will drop right in. You can find it here.