The Clymb 2013

2014-01-04

At the start of 2103, I was given full control of The Clymb’s site operations. Now that a year has passed, I want to post a year end summary with lessons learned. We had a few problems but the year was mostly positive.

New Hardware

This time last year we moved from Rackspace Cloud servers to Rackspace dedicated hardware with the thought that we could grow 10x on a more stable cluster. Previously we were running about 10 cloud servers to power the site and would spin up another 5 if we knew a big sale would drive higher traffic levels than normal (e.g. Black Friday).

We moved to 4 dedicated machines: 3 apps and 1 db. Each has a LOT more memory and cores than a Cloud box. The costs are approximately the same. In practice this has worked out fantastically. The dedicated machines scale at least an order of magnitude better than the Cloud boxes. Rackspace is not cheap but the network has proven very reliable and the support staff very knowledgable on the lower levels of the stack, i.e. Linux and OS setup/tuning.

The dedicated boxes crushed Black Friday. They had no problems at all, despite us having 2x the traffic of our previous best day.

Infrastructure

Our stack is very vanilla, with simple tools where possible. I eschew complexity as that is where production problems tend to crop up in my experience.1

ElasticSearch has proven the least reliable part of our stack: getting indexing and sharding working reliably took several months with a handful of extended outages. Luckily we built in a “Search is not available right now” fallback into the code so users don’t get a broken page when ES did go haywire.

Unicorn is expensive. Its lack of concurrency means it consumes most of the memory on our app servers. One of my tasks for 2014 is moving to a hybrid concurrency model via Puma or Phusion Passenger. Sidekiq has proven that Rails can be multithreaded, we just need to start using threads on the web side.

Failures

We had several production issues during the year:

I can’t think of any other user-facing disruptions.

Fun Stuff

Other issues, while maybe not user-facing, cropped up and demanded attention:

TODO

In 2014, we plan to:

Conclusion

Overall I’m quite happy with the year. I was never woken up by a phone call and I sleep soundly at night. I like to think that validates the choices we’ve made with our infrastructure; let’s hope 2014 proves as reliable.


  1. This is why I avoid distributed data stores as their administration and failure modes tend to be much more complex and thus difficult to make reliable. ↩︎

  2. Replication has proven reliable as long as we aren’t doing something crazy on the slave, like locking tables for over an hour, or using a read/write account to access it. cough ↩︎

  3. Memory bloat is common for Ruby daemons. Since we deploy and restart Sidekiq daily, this hasn’t posed a problem. monit restarts Sidekiq if it goes above a certain threshold on weekends. ↩︎