How to easily scale a non-consistent cache based on redis ?

vmarchaud

2022-12-10

cache, caching, redis, scaling, technology

Introduction

Like most people working in the software engineering field, i’ve dealt for a long time with cache mechanisms. From the one in my browser than the one i’ve deployed in production to avoid making too many read into my database, there are pretty much eveywhere.

When i’ve joined Reelevant in 2019 as a infrastructure/backend engineer, most of my time was dedicated to improving the setup we had and one of my first project was to tackle our cache cluster.
For the context, we use Redis (like a lot of people) and we store a lot of different things; from a API response to whole images or gifs (ie. ~1kb to ~5Mb) and we had few problems:

When the cache was not responding all of our applications was slow, not because we were fetching everything from the database but because we were waiting for the cache to answer !
Our applications didn’t start if the cache was down
We were paying ~1500 euros/month for a cluster with 20G of memory, which was too much.
For this cluster, the provider was replicating the dataset across 2 secondary instances in case the primary had a problem, so even though our cluster had 60G installed, we could only use 20G.

I’m pretty sure a lot of people can come up with a solution for each problems with too much trouble. However instead of trying to fix each problem and certainly increasing a bit the complexity of the setup, i’ve instead decided to avoid making a quick fix and instead went back to the drawing board with a list of features we wanted.

The famous theorem

When someone want to store data in a distributed system, they will encounter the CAP theorem which state that you only can have 2 out of the 3 properties in the system:

consistency: you always get the up to date result
availability: the system is always working (but not necesarelly giving the right result)
network partition tolerance: if two host can’t talk to each other, the system will works and give a correct result

Even though the network partition is rare, it can still happens and you want to avoid beig down when it’s the case.
To go back to our drawing board for our cache, i went with the theory that we rarely care that the result isn’t consistent, which might help in our case.
Let me explain myself: when i do a query to find the latest netflix shows i didn’t see yet, the result is either discarded because there is a new recommendation or generally at some point of time i want to do a new one (which is the Time To Live or TTL of the query result), if there is a network partition i don’t care about loosing the data in the cache because i can surely re-compute it from my actual database (costing a bit more latency).

To go further, it’s not a big problem if there’s a lot of cache miss in case of a network partition, our database should be able to take the load and stay up (again, expecting that it will be slower than the cache).
You might argue that this assumption might be wrong and that the database might not handle that traffic and i agree that it’s possible but then i would take time to improve the database and avoid designing my cache to work around that.
That’s just personal preference in how i design architecture, not a silver bullet that work everytime for every use cases.

Anyway, we just want a cache that is always available and not specially consistent in case of network partition.

Why not just sum individual redis instances ?

As said previously, i find it sad that i can’t use just every bit of RAM available, i don’t care about replicating the data as i’m fine loosing it.
So i went looking for ways to “sum” different instances redis so my applications just think they have one redis instance but actually i can split data across multiple instances.

Splitting the data is actually straightforward in theory: given a set of hosts and a input key (here, the redis key) i want a algorithm that will consistenly give the same output. Fortunally load balancers are already great at this, most of them offers way to split traffic based on url on any information in a request.

Enter the ring a well known load balancer: Envoy ! Envoy can actually talk the redis protocol and route request depending on the requested key (with the MAGLEV strategy).
This routing strategy use the list of available hosts where it can route a given request so it avoid sending requests to offline/non-healthy hosts which means that as soon as as one host is considered offline envoy will recompute and send the request to another instance.

As you can see here, we are giving up the result (consistency) to prefer availability.

A big advantage of this setup is that you easily scale horizontally with more instances as your load increases, which is nice to have as side perk of this setup.

Where’s the catch ?

Yeah you know there is one. This comes with a major drawback that you can’t make a operations that involves multiple keys. Here’s the common operation that you cannot do anymore:

SCAN (altrough it could in the future since this is just involving reads)
MULTI as it involves doing the operation atomaticaly, it just can’t be sure of the order of operation if individual operation are done in different servers.
INFO (or generally any command asking for server’s info)
& more (like pubsub or streams, see list of supported commands)

Since you cannot scan for keys, invalidating cache is a bit tricky too. Taking back my previous query for an individual’s suggested tv shows and say i want to invalidate the cache for everyone, i would want to delete keys based on a pattern but i can’t do that anymore (at least not without connecting to each individual redis instances).

You have two solutions:

store individual keys into a list, when you want to invalidate them you just delete all the keys present in the list (with two distinct requests, again you can’t do transactions)
store all results inside a hashmap and invalidate the whole object.

Depending on the data cached i always found one of these solutions to work fine.

We’ll take different situations to understand what happens and if we are okay with it:

If one host is added, chance are that the hashing change the routing so i might loose the data
If one host is lost, again the routing change and i loose my data.
If one host is slow, i can configure Envoy to automatically give up the request after a specific timeout (i did choose 100ms) and return a error.
If i want more cache capacity, just add an host (again loosing data).

As you can see if there is a problem, the system will always “drop” data but again in our design that’s fine.

Wait a minute, i didn’t actually drop the data ? It is still there, just not reachable anymore.

Indeed ! The simple fix is to set good enough TTL for every data that you insert in redis, i personally range from few minutes to multiple months … or do i need to ?

Another case i want handled is the behavior when one individual instance don’t have any more memory to cache, but that can be configured easily with redis easily with the maxmemory-policy configuration set to allkeys-lru value.
This will automatically evict keys that aren’t used anymore when we need memory, which will automatically drop old data that aren’t reachable since by definition they are not used.

Setup

The setup is actually simple since you just need to deploy two kind of instances (redis and envoy) that will always be the same.
Here’s configuration file i’m using, taking into account that you are using Kubernetes

Envoy

static_resources:
  listeners:
  - name: redis_listener
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 6379
    filter_chains:
    - filters:
      - name: envoy.filters.network.redis_proxy
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.redis_proxy.v3.RedisProxy
          stat_prefix: egress_redis
          settings:
            op_timeout: 100ms # automatically timeout
          prefix_routes:
            catch_all_route:
              cluster: redis_cluster
  clusters:
  - name: redis_cluster
    connect_timeout: 1s
    type: STRICT_DNS
    dns_lookup_family: V4_ONLY
    lb_policy: MAGLEV
    load_assignment:
      cluster_name: redis_cluster
      # instances will be automatically discovered with kubernetes dns
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: redis.cache.svc.cluster.local #
                port_value: 6379
    health_checks:
    # even if kubernetes can do healthcheck, we just check ourselves wit hthe connection we already have
      - timeout: 2s
        interval: 5s
        unhealthy_threshold: 2
        healthy_threshold: 3
        custom_health_check:
          name: envoy.health_checkers.redis
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.network.redis_proxy.v3.RedisProxy

Redis

bind 0.0.0.0
port 6379
logfile ""
protected-mode yes
tcp-backlog 511
timeout 0
[...]
# we wont want to store the database in file
stop-writes-on-bgsave-error no
appendonly no
no-appendfsync-on-rewrite no
save ""
[...]
# max memory (in our case 9G)
maxmemory 9000000000
maxmemory-policy allkeys-lru
[...]

If you are not running in kubernetes, a simple TCP load balancer can do the trick to allow your applications to reach envoy. We actually did that initially because we were running the system in normal GCE instances:

Conclusion

As soon as you decide that it’s acceptable for your cache to drop data (or more precisely increase your cache miss rate), the setup can be a lot simpler which was the case for us.

Going back to our initial problems, some of the issues needed to be solved on the application code for which we wrote our library on top of ioredis called cache-cache:

it configure aggresive timeouts
handle errors by returning nothing (or calling the memoized function)
can automatically cache using a hashmap (to solve our invalidation problem)

Also by using simple redis instances we dramatically cut the cost from ~1500 to 200 euros/month, first because we only paid for 20G if we needed 20G and second because we moved the setup from a 3rd party provider that managed the instance for us to autoscaled GCP instances.

Going back this is easily to best design choice that we made, we only had one or two network partition since then (~3 years) on GCP so even though it was rare, we did have them and it didn’t bring our applications down :)

Recently (in 2022) we migrated this setup from GCE instances to Kubernetes hosted on OVH which improved performances and cost even more, you can read about that here.