Cache invalidation is probably one of the hardest things in computer programming. I understand it as finding a subtle compromise between completeness, redundancy and complexity. I would like to tap into this topic in a context of caching queries built via ORM.
I will move from basic ideas building upon them as needed and diving into more and more specifics as the post goes.
Completeness and redundancy
Let’s start from some general considerations. I define abovesaid completeness as a characteristic of invalidation procedure describing how frequent and under what circumstances data can become dirty and how long it will remain accessible. And redundancy will be a frequency and a volume of cache invalidated needlessly.
An example will help us perceive these two concepts better. Let’s look at common time-invalidated cache. On one hand, it inevitably leads to dirty data between update and cache timeout, making this algorithm inherently incomplete. On other hand, we can easily reduce incompleteness by reducing cache timeout, which, in it’s turn, will increase redundancy – clean cache data will be invalidated more frequently, which will lead to more cache misses. And for ideal completeness (no dirty data) we need to set timeout to zero.
There are lots of scenarios where it’s ok to use stale data: popular articles list doesn’t change fast and it’s not a big deal if user count of your social network is off by a couple of thousands. But then there are some scenarios where you need immediate invalidation, go to next section for that.
Probably, the only way to achieve ideal invalidation completeness is to invalidate each time you change data. These are elements of such system we need to think about:
- cached things,
- a dependency matrix explaining what thing to invalidate on what event.
There are obviously different strategies to define those in your code. I’ll start from simplest one – manual definition.
Coding dependencies by hand
First, cached things, they will probably look like this (in sort of python pseudo-code):
1 2 3 4 5 6 7 8 9
Remember this is pseudo-code.
register_cached_item() doesn’t need to be a function call, it could be a decorator in Python, a macro in lisp or a class in Java.
Ok, say we have some place in our code that adds post, we then need to add something like this there:
1 2 3 4 5 6 7 8 9 10 11
And if we delete post somewhere we need to make invalidation there too. We should obviously abstract our code into some
invalidate_post() function and call it from both places. So far, so good. What about updating? The thing is it’s not enough to just call our invalidation procedure from there:
1 2 3 4 5 6
We need to invalidate on both old and new states of a post on update. How we get old state is a separate not an easy question, but suppose we have both states, should we just call
invalidate_post() twice for each of them? Not so efficient, but that would work.
There is a problem with our code though. It’s tightly coupled – update logic knows about cache logic. Even bigger problem is that our cache logic is scattered. We define cached thing in one place and invalidate in another or several other places. This means there will come some day when someone will add a cached thing and just forget to add corresponding invalidation, producing hard to find bug.
Bringing things together
Fortunately, there is a single solution to both problems – events. We can write fetching and invalidation logic for each thing in one place and then register both cached thing and event listener:
1 2 3 4 5 6 7
As invalidate procedures would be tiresomely repetitive we can dry this up to (and even further using particular language sugar features):
1 2 3
Looks like we’ve come up with pretty solid system. It’s dry, but retains flexibility of still mostly manual system. We, however, need to declare each thing we plan to cache in advance. We also need to provide argument constructing functions. There got to be a smarter way.
We have not even started utilizing a power of ORM. Its query is not a mere text and plain arguments, it is a structure which includes condition tree. And some smart code could use that information to determine when query cache should be invalidated, saving work for lazy guys like me.
Suppose we cache a query:
We should drop that cache when we add, update or delete post with its old or new state satisfying
category_id = 2 and published condition. So the time we save that cache we should write along “invalidator” like that:
Then if some post changes we look up all invalidators and check their conditions with post at hand, deleting cache keys corresponding to holding ones. That could become very inefficient once we’ll have lots of queries cached.
The reason we will end up with lots of invalidators is that we have cached lots of different queries. In common use, however, most queries will differ only in parameters not structure. Maybe separating those will help us? Let’s try on larger example. Here be the queries:
1 2 3 4 5 6 7 8
Unseparated invalidators first:
1 2 3 4 5 6 7
We can use some tricks to make this more regular.
ored conditions could be split into two,
in is basically syntax sugar for
or and boolean tests could be substituted with equalities. Applying these we get to:
1 2 3 4 5
Note that after this transformation all conditions became simple conjunctions. And we are finally ready to separate condition scheme from data:
1 2 3 4 5 6 7 8 9 10 11
Time to try modeling invalidation procedure. Say we are adding this post to our stock:
S1 we can see that there is at most one conjunction of that scheme satisfying our state! Even better, we can build it from scheme and object data by looking at field values: each known value is equal only to itself. Too bad the trick won’t work with
S2 cause our post id, 42, is greater than many things. To find what queries of scheme
S2 should be invalidated one needs to look through all
S2:* conjunctions and that could be a lot.
This is probably a case for trade-off. Dropping all conditions but equalities we will sacrifice invalidation granularity, but simplify and speed up the procedure. Simplified invalidators will look like:
1 2 3 4 5 6 7 8 9 10 11
There are some points worth noting here. First,
S2 is now an empty scheme with a single empty conjunction which is always true, which means
K6 will be invalidated on any post change. Second, schemes are now just sets of field names, pretty neat.
So far I tried to stay language and platform agnostic, but that journey came to an end. Welcome to dirty reality in the next section.
A best way to represent scheme is probably alphabetically sorted list of field names, it’s easily serializable and it makes building a conjunction for an object pretty straightforward.
Extracting conjunctions from a query tree could be tricky. One might want to employ fuzzy logic: look at
not (f > 0 and g != 1), if we drop
f > 0right away, we’ll end up with
g = 1, which is not equivalent to
f <= 0 or g = 1. Lost empty conjunction along the way!
Considering invalidator structure (sets) and taking into account that it is generally a good idea to keep all interdependent data together (cache and invalidators) this is an excellent case to use Redis. Using it we can easily add keys to conjunctions, we can even do
SUNIONof conjunctions to fetch all dirty keys in one shot.
Sure, I went ahead and used that tips in my Django caching solution. Hope it or the ideas it embodies would be helpful.