Distributed Tracing: OpenTelemetry and Jaeger

In a complex system, a single user request might:

Hit your FastAPI Gateway.
Which calls an Auth Service.
Which calls a Database.
Which calls an AI Model.

If that request takes 5 seconds, where is the bottleneck? Tracing allows you to see the exact lifecycle of that single request.

1. Spans and Traces

Trace: The entire journey of a request from start to finish.
Span: A single "Step" in that journey (e.g., "SQL Query", "External API Call").

2. Using OpenTelemetry (OTel)

OpenTelemetry is a vendor-neutral standard for observability. In FastAPI, we use OTel middleware to automatically generate traces for every incoming request.

from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

app = FastAPI()

# Automatically trace every request!
FastAPIInstrumentor.instrument_app(app)

3. Visualizing with Jaeger

Jaeger is a tool that allows you to see your traces visually. It shows a timeline of the request, highlighting which parts took the most time.

Why Tracing is Better than Logs:

A log tells you that something failed. A trace shows you exactly where in the chain it failed and what the inputs and outputs were for every step leading up to the failure.

4. Sampling

In high-traffic APIs (10,000+ RPS), you shouldn't trace 100% of requests—it would slow down the app and fill up your storage. Instead, we use Sampling (e.g., trace 1% of requests) to get a statistically valid view of performance.

Visualizing the Trace Timeline

gantt
    title Request Trace for /process-payment
    dateFormat  X
    axisFormat %s 
    
    section Gateway
    FastAPI Handler      :0, 500
    section Services
    Auth Check           :10, 50
    Balance Check        :60, 150
    Stripe API Call      :160, 480
    section Database
    Log Transaction      :485, 495

Summary

Distributed Tracing: Essential for microservices and complex apps.
OpenTelemetry: The industry standard for tracing.
Spans: The building blocks of a trace.
Bottleneck Detection: Use traces to find out exactly why a request is slow.

In the next lesson, we wrap up Module 18 with Exercises on observability and monitoring.

Exercise: The Bottleneck Detective

Look at the Gantt chart above.

Which part of the request is taking the most time?
If you wanted to speed up this API, which service would you focus on optimizing first?