Monitoring Redis

2017-04-20

Redis is widely used by the Ruby community but, like any complex piece of infrastructure, isn’t well understood by many of its users. I wanted to write a blog post that would answer the question: How can I check on the health of my Redis server? Here’s a few things you can do to better understand your Redis server.

Ask for Info

The info command is the easiest way to get an overall view of Redis:

$ redis-cli info
# Server
redis_version:3.2.5
tcp_port:6379
uptime_in_seconds:1313002
uptime_in_days:15
executable:/usr/local/opt/redis/bin/redis-server
config_file:/usr/local/etc/redis.conf

# Clients
connected_clients:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:1043920
used_memory_human:1019.45K
used_memory_rss:16564224
used_memory_rss_human:15.80M
used_memory_peak:18961872
used_memory_peak_human:18.08M
total_system_memory:8589934592
total_system_memory_human:8.00G
used_memory_lua:60416
used_memory_lua_human:59.00K
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
mem_fragmentation_ratio:15.87
mem_allocator:libc

# Stats
total_connections_received:396
total_commands_processed:588823
instantaneous_ops_per_sec:0
total_net_input_bytes:136139888
total_net_output_bytes:216183171
instantaneous_input_kbps:0.00
instantaneous_output_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:95
evicted_keys:0
keyspace_hits:6978
keyspace_misses:20802
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:1007
migrate_cached_sockets:0

# CPU
used_cpu_sys:67.83
used_cpu_user:42.98
used_cpu_sys_children:0.07
used_cpu_user_children:0.07

# Keyspace
db4:keys=25,expires=24,avg_ttl=892611707
db5:keys=37,expires=36,avg_ttl=829671792
db8:keys=4,expires=2,avg_ttl=4423347833

(output trimmed a bit)

There’s a few very important pieces to note in the output:

config_file shows where Redis is getting its static configuration. If you need to tweak a setting, this is where to go.
connected_clients shows how many network client connections are currently connected. It’s normal for this to be in the hundreds or even thousands for a very busy server.
used_memory_* is important to see how much RAM is used by Redis.

Memory is the most important thing to note here. Your entire dataset size must fit into machine RAM. If it does not, the resulting swapping will lead to terrible performance and massive latency spikes.

redis-cli --stat will give you the most important details in a real-time view you can watch or scrape:

$ redis-cli --stat
------- data ------ --------------------- load -------------------- - child -
keys       mem      clients blocked requests            connections
31         1009.00K 1       0       681483 (+0)         576
31         1009.00K 1       0       681484 (+1)         576
31         1009.00K 1       0       681485 (+1)         576

redis-cli is a very powerful tool with lots of nice features within it. It’s worth running --help and playing with the various options and modes.

Look out for Latency

Latency is the time difference between request and response, e.g. the client sends a command and gets a result back in 10ms. We want this round trip to be as fast as possible. Really bad latency can trigger “Operation timed out” exceptions in your application. There are two general latency sources:

Network - the network between the client and Redis can be congested.
Internal - Redis itself might have commands or internal operations which add significant latency.

Network

You can see the current network latency easily:

$ redis-cli --latency
min: 0, max: 1, avg: 0.29 (632 samples)

This shows a minimum of 0ms and max of 1ms, averaging 290µs, latency to my localhost server. Ideal conditions.

You can also easily watch latency history, here with a default window of 15 sec:

$ redis-cli --latency-history
min: 0, max: 7, avg: 0.24 (1321 samples) -- 15.00 seconds range
min: 0, max: 256, avg: 0.55 (1280 samples) -- 15.01 seconds range
min: 0, max: 3062, avg: 3.86 (1051 samples) -- 15.98 seconds range
min: 0, max: 490, avg: 0.69 (1255 samples) -- 15.01 seconds range

You can see the first window looks pretty good, the second window had a spike of 256ms, and the third had a spike of 3062ms (ouch!).

I triggered these spikes by running redis-cli debug sleep 0.5. Not recommended in production.

Internal

Redis also has a latency event and monitoring subsystem to track commands execution latency and operations (like forking) which might block command processing. Use the LATENCY commands to monitor and collect these events. Some commands can be naturally slow, see SLOWLOG below.

Watch for Slow Commands

Because it is single-threaded, it’s vitally important that all commands to Redis process quickly or one slow command can block everyone else for a significant amount of time. Redis includes a configurable SLOWLOG command to watch for these bad apples.

My Redis is configured by Homebrew to log any commands that take longer than 10ms, a sensible default. Run this on your instance and see if you find anything suspicious:

$ redis-cli slowlog get 10
1) 1) (integer) 3
   2) (integer) 1492461257
   3) (integer) 11825
   4)  1) "LPUSH"
       2) "queue:default"
       3) "{\"queue\":\"default\",\"jid\":\"df55c4e27c48e63f0d5212b4\",\"class\":\"LoadWorker\",\"args\":[0],\"created_at\":1492461257.208411,\"enqueued_at\"... (33 more bytes)"
       4) "{\"queue\":\"default\",\"jid\":\"4989e31cea7ecf78e93b7f9f\",\"class\":\"LoadWorker\",\"args\":[1],\"created_at\":1492461257.2084169,\"enqueued_at... (34 more bytes)"
      [snip]
      32) "... (9971 more arguments)"

There are four parts to each entry: 3 is a unique identifier, 1492461257 is the timestamp when it happened and 11825 is the number of microseconds for execution, 11.825ms. The last part is the command with arguments.

I happen to know that entry is an LPUSH to the default queue of 10,000 jobs. It’s a testcase for Sidekiq::Client’s push_bulk API. Since we’re enqueueing 10,000 jobs at once, it’s not surprising that might take almost 12ms.

It is safe to run SLOWLOG in production and highly encouraged. In fact, I would play with slowlog-log-slower-than values until you find a setting that catches unexpectedly slow things but does not contain a lot of “normal” commands from your application. If you find the slowlog is constantly full, redesign your system so it doesn’t run those slow commands. For instance, if you want to enqueue 10,000 jobs, perhaps you could call push_bulk 10 times with 1000 jobs each, so each invocation only takes 1/10th as long. If a piece of OSS is running a slow comamnd, open an issue so the maintainer knows about the problem. I’m not too proud to admit it’s happened several times with Sidekiq; I fix them as fast as I can!

Extra Credit

When I was managing technical operations, my team was responsible for monitoring our infrastructure. We would do things like:

Write a cron job to dump the slowlog to a daily email if it is not empty. Treat any entries as issues to be investigated and fixed.
Set up a dashboard for the Redis server. Have dedicated graphs for connected_clients, instantaneous_ops_per_sec, and used_memory_rss. Treat big changes in these graphs as incidents to be investigated.

Part of owning your infrastructure is monitoring its health proactively; understanding these tools and having quick access to this data will be invaluable when debugging failures and outages.