
Scaling Models: Quotas and Performance
Going viral? Learn how to handle thousands of users, request quota increases, and optimize throughput.
Scaling Models
Your app hit the front page of Hacker News. Now what?
Quotas
By default, the API has a limit (e.g., 60 RPM - Requests Per Minute).
- Check: Go to GCP Console -> IAM & Admin -> Quotas.
- Monitor: Are you hitting the ceiling?
- Request: Click "Edit Quotas" to request a higher limit well before launch.
Caching
If 50% of your users ask "What is this app?", don't call Gemini every time.
- Semantic Cache: Check if the question is similar to a previous question. If so, return the cached answer.
Latency Scaling
If one request takes 10s, and you have 1 process, you handle 6 RPM.
- Async: Python
asyncioallows one process to handle hundreds of concurrent waiting requests. Make sure your web server (Uvicorn/Gunicorn) is configured for concurrency.
Summary
Scaling AI is mostly about Quota management and Async I/O.
In the final lesson of this module, we discuss Costs.