Module 9 Exercises: Cluster Observability

Module 9 Exercises: Cluster Observability

See the unseen. Practice querying metrics, searching logs, and mapping traces to become a master of cluster performance.

Module 9 Exercises: Cluster Observability

In Module 9, we transformed our cluster from a "Black Box" into a "Transparent Machine." You learned how to use the Metrics Server, Prometheus, Loki, and Jaeger. These exercises will put those tools to the test in a real-world scenario.


Exercise 1: The "Hunting" Expedition

  1. Preparation: Deploy 10 random pods. One of them must have an intentional memory leak (e.g., a Python script that keeps appending to a list).
  2. Task: Use kubectl top pods to find the "Leak."
  3. Analysis: What is the difference between CPU% and MEMORY% in kubectl top nodes? How does this output help you decide if it's time to add a new node?

Exercise 2: Master of PromQL

  1. Goal: You want to create a Grafana dashboard that alerts you when a pod restarts.
  2. Task: Write the PromQL query to find the total number of pod restarts in the production namespace over the last hour.
  3. Bonus: Write a query to find which pod has the highest memory usage relative to its Request. (Hint: You need to divide container_memory_usage_bytes by kube_pod_container_resource_requests).

Exercise 3: The Log Detective

  1. Setup: You have a distributed AI app with a Frontend, API, and Worker.
  2. Scenario: A user reports an "Internal Server Error."
  3. Task: Use LogQL in the Grafana "Explore" view to:
    • Filter by namespace="ai-prod".
    • Search for the string "error" (case-insensitive).
    • Find the Request-ID associated with that error.
    • Use that Request-ID to find the corresponding logs in the Worker pod.

Exercise 4: Distributed Bottleneck Search

  1. Scenario: You look at a Jaeger trace for a /summarize request. You see:
    • app-frontend: 50ms
    • app-api: 100ms
    • vector-db-search: 9,500ms
    • llm-call: 400ms
  2. Analysis: What is the bottleneck? Is the LLM (OpenAI/AWS) the primary cause of latency?
  3. Action: If you were the lead engineer, which service or infrastructure component would you investigate next?

Solutions (Self-Check)

Exercise 1 Answer:

The "Leak" will be the pod whose memory usage is constantly increasing while its CPU usage stays flat. MEMORY% in kubectl top nodes is the most critical metric for stability; once a node hits 100%, the OOM killer starts!

Exercise 2 Solution:

increase(kube_pod_container_status_restarts_total{namespace="production"}[1h])

Exercise 3 Hint:

You can use the pipe operator: {namespace="ai-prod"} |=? "error". Once you find the ID, the query becomes {namespace="ai-prod"} |= "REQ-12345".

Exercise 4 Logic:

The Vector DB Search is the massive bottleneck (9.5 seconds). The LLM is surprisingly fast (400ms). You should investigate:

  1. Vector DB index health.
  2. Network latency between API and Vector DB.
  3. Whether the search query is too complex or the index is unoptimized.

Summary of Module 9

Congratulations! You have completed the Observability Module.

  • You are a master of the Metrics Server and kubectl top.
  • You can write complex PromQL queries to extract insights.
  • You can search millions of logs effortlessly using Loki and LogQL.
  • You can map the "Mind" of a microservice using Jaeger.

In Module 10: Security in Kubernetes, we will move beyond observation and learn how to secure the cluster against hackers and internal misconfigurations.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn