Modifying Redis for Fun and Profit
At Simulmedia, we have a lot of data representing the US television audience. Our goal is to present that data via a web interface that is as performant as we have come to expect from the big web tech companies, yet at a price scale that is suitable for a smaller company. Historically, we stored our data on a Hadoop cluster and accessed it via Hive. The advancements within the Hadoop ecosystem for SQL-like querying have been steadily progressing. Spark, Impala, Presto, and even Hive itself have all led to closing the gap between Hadoop and traditional data warehousing MPPs (Massively Parallel Processing). We wanted to look outside the infrastructure we had built for an alternative solution as a fresh start. We chose Redis (http://redis.io/) for its performance, stability, licensing, simplicity, and its diverse language support.
One of the applications of the data that we wanted to improve was the selection of audiences based on user attributes. This was written as Hive queries and on average would take 40 minutes. If the cluster was overloaded and the query was inserted on a queue, it would take even longer. While looking at our usage, it was clear that we only cared about the recency of the television audience. The composition of the audience historically wasn't as important as the composition now. It seemed viable to write an ETL (Extract, Transform, Load) to process the data into a caching architecture. When we evaluated the size of the data, it would fit in memory on a larger instance. This is when a proof of concept was written using Redis. Initial performance tests showed ~100ms as a potential target for servicing data access requests. From 40 minutes to 100ms was a huge jump that placed web responsiveness in the realm of possibility.
The data team then started on the task of building an ETL that would load all attributes into Redis. This would be executed as a full data load, and would reinitialize the Redis server so all relevant data was replicated. Thus, the most recent audience was always selectable from Redis. Using a full bulk loading ETL meant we didnt have to cache data from the SQL engine upon request. Along the way, we encountered a couple of problems and worked around them. Let's dive into these issues.
Bulk load failures due to Persistence
The first problem we encountered was that while bulk loading into Redis, we had left Redis Persistence enabled. The idea being that in the face of failure the time to boot is less than the time to rebuild the cache. However, default settings for the Persistence model is to save every minute after 10000 keys have been written to. Effectively, in the middle of a bulk load, Redis does a fork and relies on the COW (copy-on-write) feature in Linux so that old parent pages are left untouched while the background process writes to disk. We had sized our Redis instances according to their final cache size, with some additional head room. Depending on when the background save is triggered, this means the high water mark for memory usage may be closer to double. There would be one copy of the older working set, and one copy of the newer working set. The OOM (Out Of Memory) killer was invoked by the Linux kernel, which kills the task that is actively requesting memory. This means that the Redis server that we are bulk loading to would be killed. While one could easily double the instance size to account for this. Fortunately, we aren't writing data into Redis, as it is just a cache. By disabling automatic background saves, the ETL can issue the BGSAVE command to manually persist at the end of its bulk load.
Faster bulk loading via Python C module
This bulk load is a normal ETL process, and generally ETL processes, specially nightly ones, only have to complete by the start of business hours. One of the strengths of Redis is that it runs practically everywhere. So, as we were building out the software, developers were executing the ETL on laptops and ephemeral servers. If we could improve the performance of bulk loading, we could collectively spend less time waiting for the data sets being loaded. Some of us were transferring the persistence files around and starting Redis that way. That was great when sitting on the same network, but much slower transferring across the open internet.
We then attacked the bulk load mechanism itself, to see if that could be improved. We had been using the redis-py package, and loading pretty naively from Python. First up, pipelining. Tests showed that it helped, naively pipelining the whole bulk load ate up memory and our full data set is large enough to cause stress on a developers laptop. Presumably this memory consumption was because redis-py or hiredis was caching the whole pipeline before writing. When playing with the number of key writes issued before executing the pipeline, we were able to find a balance between performance and memory consumption. Of course, as most engineers are wont to do, we thought we could do better.
Next, we decided to write to a file in Redis Mass Insertion Protocol. This is now called Redis Serialization Protocol, or RESP. By translating our key writes into RESP, and writing to a temporary file, we could execute redis-cli from within Python. Viola, much better performance. Although curiously, now we were constrained by the CPU time being spent writing out to the file from Python. This is easily solved by actually consuming a large dictionary of all keys and their values into a Python C module. We let C do the processing of the dictionary, and write to the temporary file. This was the point where performance was fast enough that we stopped there. We open sourced the Python C module, known as BulkRedis, available on GitHub ( https://github.com/Simulmedia/bulkredis).
In the future, if someone has a desire to seek more performance with this process, the C module may directly connect to the Redis socket, and write the RESP commands while processing the dictionary. It remains to be seen how much this would affect performance.
One final note about bulk loading to Redis. In the length of time spent loading, we are unable to service other requests. It would be nice if we could write to another DB in Redis, and atomically switch the DB indices such that operations reading the cache would never be interrupted. Again, ETLs generally run at night when business isn't being run, as this isn't an always-on application. Some other users of Redis may find it useful to switch DBs though.
Slow ZUNIONSTORE performance
After the ETL to load cached data of users attributes was complete, we looked at performance of our REST API. Without any further caching, hitting the API, getting a response from Redis, and returning to the application was consistently sub-100ms by a wide margin. Naturally, we would like to extend that performance to further data operations. In another part of our application we look at a user's behavior over time. This is classic time series view of the data. If you think about it as a normal web server, let's say we want to look at the number of users logged in per day and we would like to intersect this by classes of users. The natural key is the day, and we would filter each days rows by the list of users returned from a subquery. This is fast enough, but doesnt return sub-100ms.
If we think again about this problem, we have the list of users already selected via the Redis intersection on users' attributes. Perhaps we could pivot the data and store that in Redis such that the key is the user, and the values are their days logged in. Then, we can use the AGGREGATE SUM functionality of ZUNIONSTORE to count users logged in. It seems like a viable idea, and easy to test.
Except once the test was set up, ZUNIONSTOREs performance was nowhere near fast enough across many keys. We almost gave up right there, but on an off chance, that night a quick perusal of Redis code hit upon a rather simple performance improvement. The next day, once we forked Redis repository, we implemented the performance fix. Sure enough, performance drastically improved. At Simulmedia, we like to give back, so we immediately sent it out as a pull request ( https://github.com/antirez/redis/pull/1786). Antirez was able to look at the mod, and noticed that the dictionary size was known in advance, so we could resize the dictionary up-front. This prevents resizing the dictionary while writing to it, and further improves performance. The community response was very positive, so Antirez moved the mod into the stable 2.8 release. We are very pleased with the Redis team, with such a quick turn around we were happy to abandon our repo to use the normal stable release.
Solving performance via custom Redis commands
Unfortunately, performance still wasn't good enough. It took multiple seconds to get a response from Redis, and Redis only services one command at a time. This wouldn't scale very well, and didn't hit our goal of sub-100ms. The data layout looks fine, it's just the time to walk the dictionaries that gets in our way. Plus, the ZUNIONSTORE isnt very cache friendly. Effectively, we are just walking a bunch of C arrays and adding together their values into a single array that we then send as a response.
We decided to write a custom command for Redis. This custom command would ideally walk a multi-dimensional C array in two for loops. This is accomplished by loading the keys into contiguous C arrays in one command. By handing a list of keys, the load command will walk their dictionaries and construct contiguous C arrays where the index of each array is the result of all the key values intersected. The weights are the data actually stored in the array itself.
Then, when servicing requests on behalf of the API, there is a second command. It takes the list of keys that are the intersection of users according to their attributes, and spins through the arrays in a tight for loop. Upon testing on a deployed server, with production data, we reached ~110ms to execute this command. This was close enough to our performance goals to be considered a success. By placing a challenge in front of us, it gave us the freedom to push the envelope and achieve what we didn't think was possible.
As a final note, this custom command is a two stage process. Once to load and allocate memory, and again the servicing command itself. This does mean we had to modify Redis startup scripts to run the custom load command upon restart. It would be nice to hook up the load command to occur every time Redis starts up. In addition, micro-optimization may tweak the performance and improve it. As we didn't bother prefetching cache lines, or unrolling loops. Also, the result array is allocated for every request, another avenue to improve. Perhaps we will look at this in the future.
Initially, we thought going from 40 minutes to a responsive web request was a large challenge. The engineering team was able to accomplish this by leveraging Redis and a bit of custom work. This resulted in a lot of energy and momentum carrying forward to the next project, creating a flywheel effect that allows engineering to continue delivering great products to the business.