The Clymb 2013

At the start of 2103, I was given full control of The Clymb’s site operations. Now that a year has passed, I want to post a year end summary with lessons learned. We had a few problems but the year was mostly positive.

New Hardware

This time last year we moved from Rackspace Cloud servers to Rackspace dedicated hardware with the thought that we could grow 10x on a more stable cluster. Previously we were running about 10 cloud servers to power the site and would spin up another 5 if we knew a big sale would drive higher traffic levels than normal (e.g. Black Friday).

We moved to 4 dedicated machines: 3 apps and 1 db. Each has a LOT more memory and cores than a Cloud box. The costs are approximately the same. In practice this has worked out fantastically. The dedicated machines scale at least an order of magnitude better than the Cloud boxes. Rackspace is not cheap but the network has proven very reliable and the support staff very knowledgable on the lower levels of the stack, i.e. Linux and OS setup/tuning.

The dedicated boxes crushed Black Friday. They had no problems at all, despite us having 2x the traffic of our previous best day.

Infrastructure

Our stack is very vanilla, with simple tools where possible. I eschew complexity as that is where production problems tend to crop up in my experience.1

  • Percona 5.5 – 100% uptime, replication issues2
  • Redis 2.4 – 100% uptime
  • memcached 1.4 – 100% uptime
  • nginx 1.1 – 100% uptime
  • sidekiq – 100% uptime, memory bloat3
  • elasticsearch 0.90 – some downtime, configuration and setup complexity bit us
  • unicorn 4.6 – 100% uptime, expensive

ElasticSearch has proven the least reliable part of our stack: getting indexing and sharding working reliably took several months with a handful of extended outages. Luckily we built in a “Search is not available right now” fallback into the code so users don’t get a broken page when ES did go haywire.

Unicorn is expensive. Its lack of concurrency means it consumes most of the memory on our app servers. One of my tasks for 2014 is moving to a hybrid concurrency model via Puma or Phusion Passenger. Sidekiq has proven that Rails can be multithreaded, we just need to start using threads on the web side.

Failures

We had several production issues during the year:

  • In February we switched over to our new dedicated hardware cluster and almost immediately had downtime. The issue was tracked down to use of a hostname for our statsd metrics server. The statsd client was resolving the hostname for every metric it sent, causing 1000s of DNS requests per second when site traffic picked up. In response we switched the client to use a hard coded IP address and installed a caching DNS resolver on localhost.

  • In April our image domain expired, no one was monitoring the email address it had been registered with. Product images were dead from midnight to about 8am. In response we consolidated all our domains away from GoDaddy to another registrar and renewed them.

  • In December we had an engineer cough, me, cough who accidentally ran an UPDATE in production that he meant to run on localhost. This led to 500 orders being short on inventory and ruining the holidays for a few of our customers. I started using our read-only slave except where I explicitly need to write something right now.

I can’t think of any other user-facing disruptions.

Fun Stuff

Other issues, while maybe not user-facing, cropped up and demanded attention:

  • We had one Russian IP address performing automated dictionary login attacks. We banned the IP address and then implemented rate limiting via nginx to ensure it can’t happen again.
  • A few times people have tried crawling us without any rate limiting. If it was persistent, we banned their IP.
  • There was once or twice where bad data caused MySQL queries to return huge volumes of data, spiking various graphs and resulting in miserable performance. We found the problem in the slow query log and quickly fixed the data but don’t have a more comprehensive solution.

TODO

In 2014, we plan to:

  • upgrade Linux kernel for TCP performance improvements
  • upgrade to Percona 5.6
  • upgrade to Redis 2.8 for Lua and SCAN support
  • upgrade to nginx 1.5 for SSL performance improvements
  • switch from unicorn to Puma for memory efficiency

Conclusion

Overall I’m quite happy with the year. I was never woken up by a phone call and I sleep soundly at night. I like to think that validates the choices we’ve made with our infrastructure; let’s hope 2014 proves as reliable.


  1. This is why I avoid distributed data stores as their administration and failure modes tend to be much more complex and thus difficult to make reliable. 

  2. Replication has proven reliable as long as we aren’t doing something crazy on the slave, like locking tables for over an hour, or using a read/write account to access it. cough 

  3. Memory bloat is common for Ruby daemons. Since we deploy and restart Sidekiq daily, this hasn’t posed a problem. monit restarts Sidekiq if it goes above a certain threshold on weekends. 

8 thoughts on “The Clymb 2013”

  1. Great writeup Mike. We’re only starting to feel scaling issues here at Reverb, and we’re learning from the stuff you’re working on. Are you on Ruby 2.1? We just switched in production and I’m seeing what appears to be a memory leak that was not present under 2.0. On the other hand, GC time is way down.

  2. Hey Mike-
    Good read. What are your thoughts on TorqueBox? Or the newest alpha codeline TorqBox? Seems like that would be a good candidate for getting to a solid threaded web position. FYI – I have a robust Sidekiq stack running on TB with various middleware like Benchmarks, Failures, Sidetiq, and socialpanda’s Monitor & Superworker.
    thx/chad

  3. Heh! At iversity.com we use almost the same stack.

    The major difference is that we use Percona XTraDB Cluster and we have 2 DB servers with master-master(active passive) replication. In plus to more reliable data storage and more fault tolerant infrastructure cluster allows us:

    - switch off any machine for system software/hardware upgrade at any time
    - handle bigger load

    We use haproxy and Octopus gem(https://github.com/tchandy/octopus) to split DB Read/Write requests and load balance them between cluster machines.

  4. Mike, thanks for detailed retro. Love seeing how others approach system architecture at scale. Are you planning to try different Ruby implementations with your switch to Puma?

  5. It appears that with ruby 2.1 our sidekiq process started growing very quickly. It could be a leak in our code or something in Ruby 2.1 interacting with sidekiq. We have downgraded to 2.0 for now. Just thought you’d want to know Mike.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>