Module 13 Lesson 5: Monitoring Performance Metrics
Visualizing the health of your cluster. Using Prometheus and Grafana to track tokens-per-second and VRAM usage.
Monitoring: The AI Dashboard
If you are running a scaled AI system for a team, you need more than just a terminal. You need to see "Live" graphs of your hardware health and your token economy.
1. Key Metrics to Track
- Tokens Per Second (t/s): The most important metric for user satisfaction.
- VRAM Utilization: Are you close to a crash?
- Queue Length: How many people are currently waiting for an answer?
- Model Distribution: Which models are being used the most (Llama vs CodeLlama)?
2. Using the /api/ps Endpoint
Ollama's hidden /api/ps endpoint provides a JSON snapshot of what is running.
- Which models are currently in RAM?
- How much VRAM is each model using?
- When does the
keep_alivetimer expire?
You can write a simple Python script to poll this every 5 seconds and send it to a database.
3. The Prometheus + Grafana Stack
Professional AI engineers use Prometheus to collect data and Grafana to show it.
- Ollama Exporter: There are community tools on GitHub (like
ollama-exporter) that connect Prometheus directly to Ollama. - Visuals: You can build a dashboard that shows a "Big Green Number" for your current cluster speed.
4. Setting Up Alerting
Monitor should warn you BEFORE the system fails.
- Alert: "VRAM > 95% for 1 minute."
- Alert: "Average response time > 10 seconds."
This allows you to either clear the cache or tell your teammates: "The AI is under heavy load right now, expect delays."
Key Takeaways
- Monitoring ensures your local AI cluster stays healthy and fast.
- Tokens Per Second is your primary KPI (Key Performance Indicator).
- The api/ps endpoint is the source of truth for runtime state.
- Grafana is the best way to visualize AI performance for non-technical stakeholders.