Multiple queues
Learning Objectives
A new requirement: our distributed cron system needs to be able to schedule jobs to run in multiple clusters (e.g. one in Europe, one in America). Imagine that we want to support users who have data stored in specific locations and they want to make sure their cron jobs are running near their data.
Just like how we are simulating multiple computers with docker-compose, we don’t really need to set up any cells for this - just write our program as though you had multiple sets of consumer workers.
You don’t need to set up multiple Kafka clusters for this - this extension is just about having multiple sets of consumer jobs, which we notionally call clusters.
- Define a set of clusters in our program (two is fine,
cluster-aandcluster-b) - Each cluster should have its own Kafka topic
- Update the job format in the crontab file so that jobs must specify what cluster to run in (Note: This will diverge your crontab file format from the standard one - this is fine)
- Run separate consumers that are configured to read from each cluster-specific topic
Test that our new program and Kafka configuration works as expected.
Think
Imagine in real life you had a deployed system that didn’t need clusters specified, and then wanted to add the ability to choose clusters.
How would you do this sort of a migration in a running production environment, where you could not drop existing jobs?
Handling Errors
Learning Objectives
What happens if there is a problem running a job? For some kinds of jobs, maybe the right thing is retry it. For some, it isn’t. It probably depends on what the job was doing.
Exercise
Think about what jobs should probably be retried and what jobs shouldn’t.
What are the common characteristics of each?
This should be a configurable property of our cron jobs: update our program to add a maximum number of attempts to the job configurations and message format.
However: we don’t want to risk retry jobs displacing first-time runs of other jobs. This is why some queue-based systems use separate queues for retries.
Reading
We can create a second set of topics for jobs that fail the first time and need to be retried (we need one retry topic for each cluster). If a job fails, the consumer should write the job to the corresponding retry topic for the cluster (and decrement the remaining allowed attempts in the job definition).
Exercise
Run some instances of your consumer program that read from your retry queues (we can make this a command-line option in your consumer).
Define a job that fails and observe your retry consumers retrying and eventually discarding it.
Define a job that randomly fails some percent of the time, and observe your retry consumers retrying and eventually completing it.