
Rate Limiting and Token Quotas: Resource Governance
Learn how to protect your AI infrastructure from abuse and over-spending. Master the implementation of 'Token Buckets' and tiered user quotas.
Rate Limiting and Token Quotas: Resource Governance
When you move from a prototype to a production system with thousands of users, your Token Liability becomes your biggest risk. A single aggressive user (or a bot) can burn through your monthly API budget in a single afternoon.
Rate Limiting is the practice of restricting the frequency of requests. Token Quotas go a step further, restricting the total volume of tokens consumed over a period (e.g., 24 hours).
In this lesson, we learn how to build a Governance Layer that keeps your infrastructure stable and your billing predictable.
1. Request Rate vs. Token Rate
- Requests per Minute (RPM): Limits how many times a user can call
POST /chat.- Risk: A user can send one 128k context request per minute and still bankrupt you.
- Tokens per Minute (TPM): Limits the total "Wait & Load" on your infrastructure.
- Reliability: This is the ultimate metric for financial stability.
2. The Token Bucket Algorithm
The Token Bucket is the most elegant way to handle burst traffic while maintaining a long-term average.
- The Bucket: Every user has a "Bucket" that can hold 10,000 tokens.
- The Drain: When a user makes a request for 2,000 tokens, the bucket drops to 8,000.
- The Refill: Every second, 100 new tokens are added to the bucket (up to the max).
- The Block: If the bucket is empty, the user gets a
429 Too Many Tokenserror.
graph TD
A[Incoming Request: 500 tokens] --> B{Bucket > 500?}
B -->|Yes| C[Execute Request]
C --> D[Decrement Bucket]
B -->|No| E[Reject with 429]
F[Refill Timer: +10 tokens/sec] --> G[Update All Buckets]
3. Implementation: Redis-Based Token Limiter (Python)
Using Redis is the standard for distributed rate limiting.
Python Code: Token Quota Middleware
import redis
r = redis.Redis(host='localhost', port=6379, db=0)
def check_token_quota(user_id, requested_tokens):
key = f"quota:{user_id}"
current_tokens = int(r.get(key) or 100000)
if current_tokens < requested_tokens:
return False, "Quota Exceeded"
# Atomic decrement
r.decrby(key, requested_tokens)
return True, "Authorized"
# In your FastAPI endpoint
@app.post("/v1/chat")
async def chat(request: ChatRequest):
# PRE-CALCULATE tokens using tiktoken (Module 1.1)
est_tokens = count_tokens(request.prompt)
allowed, msg = check_token_quota(request.user_id, est_tokens)
if not allowed:
raise HTTPException(status_code=429, detail=msg)
return execute_llm_call(request)
4. Tiered User Quotas
In a SaaS business model, you align Token Access with Revenue.
- Free Tier: 10k tokens / day. Access to small models (GPT-4o mini) only.
- Pro Tier: 500k tokens / day. Access to expert models.
- Enterprise: Custom quotas.
Token Efficiency Tip: When a "Pro" user hits their daily limit, instead of cutting them off entirely, you can Auto-Downgrade them to a Tier 1 model. This provides a "Degraded but Working" experience that keeps the user happy without increasing your bill.
5. Summary and Key Takeaways
- RPM is not enough: You must limit Tokens Per Minute (TPM).
- Token Buckets: Use Redis to manage distributed, high-performance quotas.
- Pre-Calculation: Estimate tokens before calling the API to prevent "Over-delivery."
- Alignment: Map quotas directly to your pricing tiers to ensure every user is profitable.
In the next lesson, Cost-Aware Routing at Scale, we look at چگونه to manage these quotas across a multi-model fleet.
Exercise: The Quota Stress Test
- Set a user quota to 500 tokens.
- Attempt to send a 600-token prompt.
- Observe: Did your middleware block it?
- Now, send two 300-token prompts.
- Analyze: The first should pass, the second should fail.
- Calculations: If you have 1,000 such "Blocked" users per day, how much money did your middleware save you from spending on the LLM provider?