Event-Driven Architecture Sprint 3 Prep

Learning Objectives

Run multiple Kafka topics.
Assign jobs to separate Kafka topics.

A new requirement: our distributed cron system needs to be able to schedule jobs to run in multiple clusters (e.g. one in Europe, one in America). Imagine that we want to support users who have data stored in specific locations and they want to make sure their cron jobs are running near their data.

Just like how we are simulating multiple computers with docker-compose, we don’t really need to set up any cells for this - just write our program as though you had multiple sets of consumer workers.

You don’t need to set up multiple Kafka clusters for this - this extension is just about having multiple sets of consumer jobs, which we notionally call clusters.

Define a set of clusters in our program (two is fine, cluster-a and cluster-b)
Each cluster should have its own Kafka topic
Update the job format in the crontab file so that jobs must specify what cluster to run in (Note: This will diverge your crontab file format from the standard one - this is fine)
Run separate consumers that are configured to read from each cluster-specific topic

Test that our new program and Kafka configuration works as expected.

Think

Imagine in real life you had a deployed system that didn’t need clusters specified, and then wanted to add the ability to choose clusters.

How would you do this sort of a migration in a running production environment, where you could not drop existing jobs?

Learning Objectives

Explain what command line jobs are and aren’t desirable to automatically retry.
Explain the risks of enqueueing retries on the same topic as first attempts.
Avoid overloading queues by adding delays between retries.
Define and describe a Dead Letter Queue.
Retry failed jobs on a separate Kafka topic.

What happens if there is a problem running a job? For some kinds of jobs, maybe the right thing is retry it. For some, it isn’t. It probably depends on what the job was doing.

Exercise

Think about what jobs should probably be retried and what jobs shouldn’t.

What are the common characteristics of each?

This should be a configurable property of our cron jobs: update our program to add a maximum number of attempts to the job configurations and message format.

However: we don’t want to risk retry jobs displacing first-time runs of other jobs. This is why some queue-based systems use separate queues for retries.

Reading

Read about using separate queues for retries.

We can create a second set of topics for jobs that fail the first time and need to be retried (we need one retry topic for each cluster). If a job fails, the consumer should write the job to the corresponding retry topic for the cluster (and decrement the remaining allowed attempts in the job definition).

Exercise

Run some instances of your consumer program that read from your retry queues (we can make this a command-line option in your consumer).

Define a job that fails and observe your retry consumers retrying and eventually discarding it.

Define a job that randomly fails some percent of the time, and observe your retry consumers retrying and eventually completing it.

ai-essentials backlog Tracks

Prep

Multiple queues

Learning Objectives

Think

Handling Errors

Learning Objectives

Exercise

Reading

Exercise