If you’ve watched tv recently, you’ve perhaps seen the introduction to Elementary , the latest in a string of Sherlock Holmes inspired shows. The intro depicts a machine in which a series of events set off by a marble results in the caging of a figurine. Watching it, you might question the complexity of the events, but that is the essence of a Rube Goldberg machine—something that is intentionally over engineered to perform what would normally be a simple task.
But what does this have to do with shipping?
A few months ago, one of our Redis machines became overloaded. We were able to quickly kill the offending script, work on a fix, and relaunch without further problems.
While the relaunch may have been uneventful (as we hope all launches are), the story of how we ended up in this situation is quite interesting. We hope that by sharing what we learned we can help others from building the same Rube Goldberg machine.
To follow along, you should know that our development stack is in Python, and this particular case involved Redis, Memcached and Celery—the Python library for dealing with queues and background workers.
To provide shipping rate quotes as quickly as possible, we built our own rating engine. This rates engine breaks down the entire United States by zip code to return prices for shipping to and from any location, resulting in a potential 498,501 entries (or keys) in our system.
The zip code entries are contained in a flat file, and we needed to get this data into Memcached for our production system to query every time we make a shipping rate request. Our tests showed that the round-trip to Memcached is ~10ms, meaning it would take 13 hours every time we wanted to seed Memcache with that data. We needed a faster solution to seed the cache.
We attempted to speed this up by writing a script to read the flat file in batches into Celery tasks and and having Celery push it to Redis, which would than push it to Memcached.
But as Mike Tyson once said, “everyone has a plan until they get punched in the mouth.”
It turns out that we had a bug in our batching code:
batch_size = 500 # number of lines to batch together
flat_file = load_zones_file() # reads flatfile
for i in xrange(0, len(flat_file), batch):
batch = zones_table[i:(i+1)*batch_size]# sends the task to Redis to be read & processed by worker
That code has a tiny and very costly bug. Think about the size of batch at i=0 , it works fine. However when i>0, the size of batch increased with i ! Specifically, the size of i th batch is i*batch_size whereas it really is supposed to be just batch_size
Hence the size of all batches sent to Redis is: (1+n)*n/2*batch_size , where n=len(flat_file) . So we just sent ~n 2 *batch_sizelines to Redis instead of n*batch_size . That was a pretty costly bug. The fix is obviously to just multiply by batch_size in the start index:
batch = zones_table[i*batch_size:(i+1)*batch_size]
While we were looking for the bug, we also discovered that Memcached supports a set_multi in it’s binary protocol, which Django exposes as django.core.cache.cache.set_many .
According to Django source code, this is much more efficient for obvious reasons (I’d imagine reusing the connection to Memcached for one).
So what did we learn?
1. Look for library-supported ways of batching or improving performance. Often times others have run into similar problems and there’s probably a right solution there for you. In this case if we had known about django.core.cache.cache.set_many , we could have avoided the whole situation. This is actually what we ended up employing as our solution, and it’s been working great.
2. Redis’ DEL Operator can be very handy. It is a destructive operation so be careful with it, it came in handy when we decided to go with django’s set_many and get rid of the old tasks.
3. Redis will not always free up (return) memory to the OS when keys are removed. Hence the OS’s memory metrics can be misleading, use the Redis CLI or monitors that used it (Elasticache’s “Bytes Used For Cache”) to monitor actual memory usage.
We're hiring at Shippo! If building simple systems is up your alley, and shipping code that sends millions of packages around the world peaks your interest, check out our jobs page!