Distributed Tracing
Learning Objectives
In the past we have added metrics to our programs and collected and aggregated those metrics using the Prometheus monitoring tool. Metrics are a widely-used methodology for understanding the behaviour of our systems at a statistical level: what percentage of requests are being completed successfully, what is the 90th percentile latency, what is our current cache hit rate or queue length. These kinds of queries are very useful for telling us whether our systems seem to be healthy overall or not, and, in some cases, may provide useful insights into problems or inefficiencies.
However, one thing that metrics are not normally very good for is understanding how user experience for a system may vary between different types of requests, why particular requests are outliers in terms of latency, and how a single user request flows through backend services - many complex web services may involve dozens of backend services or datastores. It may be possible to answer some of these questions using logs analysis. However, there is a better solution, designed just for this problem: distributed tracing.
Distributed tracing has two key concepts: traces and spans. A trace represents a whole request or transaction. Traces are uniquely identified by trace IDs. Traces are made up of a set of spans, each tagged with the trace ID of the trace it belongs to. Each span is a unit of work: a remote procedure call or web request to a specific service, a method execution, or perhaps the time that a message spends in a queue. Spans can have child spans. There are specific tools that are designed to collect and store distributed traces, and to perform useful queries against them.
One of the key aspects of distributed tracing is that when services call other services the trace ID is propagated to those calls (in HTTP-based systems this is done using a special HTTP traceparent header) so that the overall trace may be assembled. This is necessary because each service in a complex chain of calls independently posts its spans to the distributed trace collector. The collector uses the trace ID to assemble the spans together, like a jigsaw puzzle, so that we can see a holistic view of an entire operation.
OpenTelemetry (also known as OTel) is the main industry standard for distributed tracing. It governs the format of traces and spans, and how traces and spans are collected. It is worth spending some time exploring the OTel documentation, particularly the Concepts section. The awesome-opentelemetry repo is another very comprehensive set of resources.
Distributed tracing is a useful technique for understanding how complex systems are operating.
Using Honeycomb
Learning Objectives
Honeycomb is a
Honeycomb provide API endpoints where we can upload trace spans. Honeycomb assembles spans which belong to the same traces. We can then view, query, and inspect those entire traces, seeing how our request flowed through a system.
We will experiment with Honeycomb locally with a single program running on one computer, to practice uploading and interpreting spans.
Sign up to Honeycomb for free.
Exercise
Write a small standalone command line application which:
- Picks a random number of iterations between 2 and 10 (we’ll call it
n). ntimes, creates a span, sleeps for a random amount of time between 10ms and 5s, then uploads the span.- Between each span, sleeps for a random amount of time between 100ms and 5s.
Each time you run your program, it should use a unique trace ID, but within on program execution, all spans should have the same trace ID.
There are standard libraries for creating and sending OTel spans, such as in Go and in Java.
Exercise
Run your program 10 times, making sure it uploads its spans to Honeycomb with your API key.
Explore the Honeycomb UI. Try to work out:
- What was the biggest
ngenerated by one of your program runs? - Which was the fastest run? What was
nfor that run? - What was the longest individual sleep performed in your program during a span?
- What was the longest individual sleep between spans in your program?
Distributed Tracing in Kafka
Learning Objectives
We know that metrics can help give us aggregate information about all actions, and distributed tracing can help us better understand the flow of particular requests through systems.
A single cron job invocation is a like a user request. It gets originated in one system (the producer), then flows through Kafka, and may run on one consumer (if it succeeds the first time), or more than one consumer (if it fails and needs to be retried).
We can use distributed tracing to trace individual cron job invocations.
Exercise
Add span publishing to your producer and consumers.
To end up assembled in the same trace, all of the services will need to know to use the same trace ID. You may need to modify your job data format to enable this.
Run your system, publishing to Honeycomb, and inspect the traces in Honeycomb. Identify:
- How long jobs spend waiting in Kafka between the producer and consumer. What was the longest time a job waited there? What was the shortest time?
- What was the largest number of retries any job took?
- How many jobs always failed all retries?
- Which jobs fail the most or the least?