First geek post, OK first post in a while… Been a busy year already!
The requirement: build out a notification service that can support 1M events coming through an hour, which each _could_ trigger a notification or notifications based on a ruleset of 100K rules. No small task. After some digging around, came up with an implementation that uses Drools to optimize the checking of the rules to meet the needed event pace. It builds out a rules tree which can basically decide that it only needs to parse 2K of those rules based on a particular event, rather than needing to iterate over the full 100K. Very necessary optimization.
As it turns out, 100K is more than I could get Drools to bear in a single session. Never fear, though: we have multiple rule “types”. If I can send the change event in parallel through N rule processors, each of which handles one or more rule types, well, I’ve probably divided the 100K rules up reasonably and can still meet the performance requirements. Put my rule processor on the end of a Kafka topic that gets pulled on by all of the rule processors: I’ve got parallelization that lets me meet my performance requirement.
So, today I wire key elements of our solution up in a development environment. The rule set is intended to be cached in a Redis instance, and my task today was to figure out how to deploy the Redis instance within kubernetes with an appropriate set of configurations for its workload. I’ve never used Redis before in any significant way, so this was an interesting problem: should it be a standalone single node that handles reads and writes? Should it instead use replication so writes go through one node and reads go through another set? How much CPU and memory? Let’s assume I’ll get those wrong or they’ll need to vary in environments: how do I turn on metrics and visibility into resource usage so I know when we need to change things?
Got all of that worked out for at least a first cut. I go to deploy using docker-compose as a stand-in for kubernetes. Awesome – it stands up and I figure out how to adjust our code’s database connection to use the replication model I’ve chosen (e.g., read from one node, write to another). Great – now I wire up my local environment to the thing it’s loading the data from to see how this works with whatever they’ve got loaded. Hey, it’s development – I’m expecting a paltry data set, just enough to let me show the read/write interactions work out OK.
Guess what? The “paltry data set” is 1M+ rows. Not only 1M+ rows, but 1M+ rows all of a single event type – no concept of sharding the events across different processors here, no siree. It’s either back to the drawing board or kill -9
whatever data generation process the upstream system is using. Suspecting the latter, that someone’s gotten overeager with the “we can build it so let’s pump it full!”. Will be interesting to see if the 1M+ number is larger tomorrow!
If none of the above makes sense to you, well, my non-techie analogy would be: your budget is 100K for a house. You’ve got 100K, or at least a path to get to it. But you’re surprised to discover that all of the houses in your reasonable travel area for work are 1M+. And your lease is about to expire. Find a solution!