
Module 9 Exercises: Cluster Observability
See the unseen. Practice querying metrics, searching logs, and mapping traces to become a master of cluster performance.
Module 9 Exercises: Cluster Observability
In Module 9, we transformed our cluster from a "Black Box" into a "Transparent Machine." You learned how to use the Metrics Server, Prometheus, Loki, and Jaeger. These exercises will put those tools to the test in a real-world scenario.
Exercise 1: The "Hunting" Expedition
- Preparation: Deploy 10 random pods. One of them must have an intentional memory leak (e.g., a Python script that keeps appending to a list).
- Task: Use
kubectl top podsto find the "Leak." - Analysis: What is the difference between
CPU%andMEMORY%inkubectl top nodes? How does this output help you decide if it's time to add a new node?
Exercise 2: Master of PromQL
- Goal: You want to create a Grafana dashboard that alerts you when a pod restarts.
- Task: Write the PromQL query to find the total number of pod restarts in the
productionnamespace over the last hour. - Bonus: Write a query to find which pod has the highest memory usage relative to its Request. (Hint: You need to divide
container_memory_usage_bytesbykube_pod_container_resource_requests).
Exercise 3: The Log Detective
- Setup: You have a distributed AI app with a Frontend, API, and Worker.
- Scenario: A user reports an "Internal Server Error."
- Task: Use LogQL in the Grafana "Explore" view to:
- Filter by
namespace="ai-prod". - Search for the string "error" (case-insensitive).
- Find the Request-ID associated with that error.
- Use that Request-ID to find the corresponding logs in the Worker pod.
- Filter by
Exercise 4: Distributed Bottleneck Search
- Scenario: You look at a Jaeger trace for a
/summarizerequest. You see:app-frontend: 50msapp-api: 100msvector-db-search: 9,500msllm-call: 400ms
- Analysis: What is the bottleneck? Is the LLM (OpenAI/AWS) the primary cause of latency?
- Action: If you were the lead engineer, which service or infrastructure component would you investigate next?
Solutions (Self-Check)
Exercise 1 Answer:
The "Leak" will be the pod whose memory usage is constantly increasing while its CPU usage stays flat.
MEMORY% in kubectl top nodes is the most critical metric for stability; once a node hits 100%, the OOM killer starts!
Exercise 2 Solution:
increase(kube_pod_container_status_restarts_total{namespace="production"}[1h])
Exercise 3 Hint:
You can use the pipe operator: {namespace="ai-prod"} |=? "error".
Once you find the ID, the query becomes {namespace="ai-prod"} |= "REQ-12345".
Exercise 4 Logic:
The Vector DB Search is the massive bottleneck (9.5 seconds). The LLM is surprisingly fast (400ms). You should investigate:
- Vector DB index health.
- Network latency between API and Vector DB.
- Whether the search query is too complex or the index is unoptimized.
Summary of Module 9
Congratulations! You have completed the Observability Module.
- You are a master of the Metrics Server and
kubectl top. - You can write complex PromQL queries to extract insights.
- You can search millions of logs effortlessly using Loki and LogQL.
- You can map the "Mind" of a microservice using Jaeger.
In Module 10: Security in Kubernetes, we will move beyond observation and learn how to secure the cluster against hackers and internal misconfigurations.