Optimizing Sidekiq
2015-10-14
Sidekiq has a reputation for being much faster than its competition but there’s always room for improvement. I recently rewrote its internals and made it six times faster. Here’s how!
It’s been quite a while since I’ve touched Sidekiq’s core design. That was intentional: for the last year Sidekiq has stabilized and become reliable infrastructure that Ruby developers can trust when building their applications. That didn’t stop me from wondering though, what would happen if I change this or tweak that?
Recently I decided to embark on an experiment: how hard would it be to remove Celluloid? Could I convert Sidekiq to use bare threads? I like Celluloid and how much easier it makes concurrent programming but it’s also Sidekiq’s largest dependency; as Celluloid changes, Sidekiq must accomodate those changes.
1. Get a Baseline
First thing I did was write a load testing script to execute 100,000 no-op jobs so I could judge whether a change was effective or not. The script creates 100,000 jobs in Redis, boots Sidekiq with 25 worker threads and then prints the current state every 2 seconds until the queue is drained. This script ran in 125 seconds on MRI 2.2.3.
2. Bare Threads
Once a baseline was established, I spent a few days porting Sidekiq’s core to use nothing but plain old Threads. This wasn’t easy but after a few days I had a stable system and the improvement was impressive: the load testing script now ran in 57 seconds. Every abstraction has a cost and benefit; Celluloid allows you to reason about and build a concurrent system much quicker but does have a small runtime cost.
3. Asynchronous Status
Once the rewritten core was stable and tests passing again, I ran ruby-prof
on the load testing script to see
if there was any low hanging fruit. The profiler showed that the
processor threads were spending most of their time sending job status data to
Redis. Sidekiq has 25 processor threads to execute jobs concurrently and each thread
called Redis at the start and finish of each job; you get precise status
but at the cost of two
network round trips. To optimize this, I changed the
processor threads to update a global status structure in memory then
changed the process’s heartbeat, which contacts Redis every few seconds,
to update the status as part of the heartbeat. If Sidekiq is processing
1000 jobs/sec, this saves 1999 round trips! Result? The load testing
script ran in 20 seconds.
4. Parallel Fetch
The last major change I made when I noticed that MRI was using 100% of CPU and JRuby was using 150% during the script execution. Only 150%??? I have four cores in this laptop; why isn’t it using 300% or more? I had a hunch: Sidekiq has always used a single Fetcher thread to retrieve jobs from Redis one at a time. To test my theory, I introduced 1ms of latency into the Redis network connection using Shopify’s nifty Toxiproxy gadget and immediately the script execution time shot up to over five minutes! The processor threads were starving, waiting for that single thread to deliver jobs to them one at a time over the slow network.
I refactored things to move the fetch code into the processor thread itself. Now all 25 processor threads will call Redis and block, waiting for a job to appear. This, along with the async status change, should make Sidekiq much more resilient to Redis latency. With fetch happening in parallel, the script ran in 20 seconds again, even with 1ms of latency. JRuby 9000 uses >300% CPU now and processes 7000 jobs/sec!
Bonus: Memory and Latency!
I also ran the script with GC disabled. With no optimizations, Sidekiq executed 10,000 jobs using 1257MB of memory. With all optimizations, Sidekiq executed the same number of jobs in 151MB of memory. In other words, the optimizations result in 8.3x less garbage.
But that’s not all! I measured job execution latency before and after: the time required for the client in one process to create a job, push it to Redis, Sidekiq to pick it up and execute the worker. Latency dropped from 22ms to 10ms.
Version | Latency | Garbage created when processing 10,000 jobs | Time to process 100,000 jobs | Throughput |
---|---|---|---|---|
3.5.1 | 22ms | 1257 MB | 125 sec | 800 job/sec |
4.0.0 | 10ms | 151 MB | 22 sec | 4500 jobs/sec |
Drawbacks?
There are a few trade offs to consider with these changes:
- more Redis connections in use. Previously only the single Fetcher thread would block on Redis. Now each processor thread will block on Redis, meaning you must have more Redis connections than your concurrency setting. Sidekiq’s default connection pool sizing of (concurrency + 2) will work great.
- job status on the Busy tab in the Web UI isn’t real-time when the page renders, it may be delayed up to a few seconds
- Celluloid is no longer required by Sidekiq so if your application uses it, you will need to pull it in and initialize it yourself
Conclusion
Keep in mind what we are talking about: the overhead of executing no-op jobs. This overhead is dwarfed by application job execution time so don’t expect to see radical speedups in your own application jobs. That said, this dramatic lowering of job overhead is still a nice win for all Sidekiq users, especially those with lots of very fast jobs.
This effort will become Sidekiq 4.0 coming later this Fall. All of this is made possible by sales of Sidekiq Pro and Sidekiq Enterprise. If you rely on Sidekiq every day, please upgrade and support my work.
See the GitHub pull request for all the gory detail.