At the start of 2103, I was given full control of The Clymb's site operations. Now that a year has passed, I want to post a year end summary with lessons learned. We had a few problems but the year was mostly positive.
This time last year we moved from Rackspace Cloud servers to Rackspace dedicated hardware with the thought that we could grow 10x on a more stable cluster. Previously we were running about 10 cloud servers to power the site and would spin up another 5 if we knew a big sale would drive higher traffic levels than normal (e.g. Black Friday).
We moved to 4 dedicated machines: 3 apps and 1 db. Each has a LOT more memory and cores than a Cloud box. The costs are approximately the same. In practice this has worked out fantastically. The dedicated machines scale at least an order of magnitude better than the Cloud boxes. Rackspace is not cheap but the network has proven very reliable and the support staff very knowledgable on the lower levels of the stack, i.e. Linux and OS setup/tuning.
The dedicated boxes crushed Black Friday. They had no problems at all, despite us having 2x the traffic of our previous best day.
Our stack is very vanilla, with simple tools where possible. I eschew complexity as that is where production problems tend to crop up in my experience.1
- Percona 5.5 -- 100% uptime, replication issues2
- Redis 2.4 -- 100% uptime
- memcached 1.4 -- 100% uptime
- nginx 1.1 -- 100% uptime
- sidekiq -- 100% uptime, memory bloat3
- elasticsearch 0.90 -- some downtime, configuration and setup complexity bit us
- unicorn 4.6 -- 100% uptime, expensive
ElasticSearch has proven the least reliable part of our stack: getting indexing and sharding working reliably took several months with a handful of extended outages. Luckily we built in a "Search is not available right now" fallback into the code so users don't get a broken page when ES did go haywire.
Unicorn is expensive. Its lack of concurrency means it consumes most of the memory on our app servers. One of my tasks for 2014 is moving to a hybrid concurrency model via Puma or Phusion Passenger. Sidekiq has proven that Rails can be multithreaded, we just need to start using threads on the web side.
We had several production issues during the year:
In February we switched over to our new dedicated hardware cluster and almost immediately had downtime. The issue was tracked down to use of a hostname for our statsd metrics server. The statsd client was resolving the hostname for every metric it sent, causing 1000s of DNS requests per second when site traffic picked up. In response we switched the client to use a hard coded IP address and installed a caching DNS resolver on localhost.
In April our image domain expired, no one was monitoring the email address it had been registered with. Product images were dead from midnight to about 8am. In response we consolidated all our domains away from GoDaddy to another registrar and renewed them.
In December we had an engineer cough, me, cough who accidentally ran an UPDATE in production that he meant to run on localhost. This led to 500 orders being short on inventory and ruining the holidays for a few of our customers. I started using our read-only slave except where I explicitly need to write something right now.
I can't think of any other user-facing disruptions.
Other issues, while maybe not user-facing, cropped up and demanded attention:
- We had one Russian IP address performing automated dictionary login attacks. We banned the IP address and then implemented rate limiting via nginx to ensure it can't happen again.
- A few times people have tried crawling us without any rate limiting. If it was persistent, we banned their IP.
- There was once or twice where bad data caused MySQL queries to return huge volumes of data, spiking various graphs and resulting in miserable performance. We found the problem in the slow query log and quickly fixed the data but don't have a more comprehensive solution.
In 2014, we plan to:
- upgrade Linux kernel for TCP performance improvements
- upgrade to Percona 5.6
- upgrade to Redis 2.8 for Lua and SCAN support
- upgrade to nginx 1.5 for SSL performance improvements
- switch from unicorn to Puma for memory efficiency
Overall I'm quite happy with the year. I was never woken up by a phone call and I sleep soundly at night. I like to think that validates the choices we've made with our infrastructure; let's hope 2014 proves as reliable.
- This is why I avoid distributed data stores as their administration and failure modes tend to be much more complex and thus difficult to make reliable. [return]
- Replication has proven reliable as long as we aren't doing something crazy on the slave, like locking tables for over an hour, or using a read/write account to access it. cough [return]
- Memory bloat is common for Ruby daemons. Since we deploy and restart Sidekiq daily, this hasn't posed a problem. monit restarts Sidekiq if it goes above a certain threshold on weekends. [return]