Rate Limiting and Token Quotas: Resource Governance

When you move from a prototype to a production system with thousands of users, your Token Liability becomes your biggest risk. A single aggressive user (or a bot) can burn through your monthly API budget in a single afternoon.

Rate Limiting is the practice of restricting the frequency of requests. Token Quotas go a step further, restricting the total volume of tokens consumed over a period (e.g., 24 hours).

In this lesson, we learn how to build a Governance Layer that keeps your infrastructure stable and your billing predictable.

1. Request Rate vs. Token Rate

Requests per Minute (RPM): Limits how many times a user can call POST /chat.
- Risk: A user can send one 128k context request per minute and still bankrupt you.
Tokens per Minute (TPM): Limits the total "Wait & Load" on your infrastructure.
- Reliability: This is the ultimate metric for financial stability.

2. The Token Bucket Algorithm

The Token Bucket is the most elegant way to handle burst traffic while maintaining a long-term average.

The Bucket: Every user has a "Bucket" that can hold 10,000 tokens.
The Drain: When a user makes a request for 2,000 tokens, the bucket drops to 8,000.
The Refill: Every second, 100 new tokens are added to the bucket (up to the max).
The Block: If the bucket is empty, the user gets a 429 Too Many Tokens error.

graph TD
    A[Incoming Request: 500 tokens] --> B{Bucket > 500?}
    B -->|Yes| C[Execute Request]
    C --> D[Decrement Bucket]
    B -->|No| E[Reject with 429]
    
    F[Refill Timer: +10 tokens/sec] --> G[Update All Buckets]

3. Implementation: Redis-Based Token Limiter (Python)

Using Redis is the standard for distributed rate limiting.

Python Code: Token Quota Middleware

import redis

r = redis.Redis(host='localhost', port=6379, db=0)

def check_token_quota(user_id, requested_tokens):
    key = f"quota:{user_id}"
    current_tokens = int(r.get(key) or 100000)
    
    if current_tokens < requested_tokens:
        return False, "Quota Exceeded"
        
    # Atomic decrement
    r.decrby(key, requested_tokens)
    return True, "Authorized"

# In your FastAPI endpoint
@app.post("/v1/chat")
async def chat(request: ChatRequest):
    # PRE-CALCULATE tokens using tiktoken (Module 1.1)
    est_tokens = count_tokens(request.prompt)
    
    allowed, msg = check_token_quota(request.user_id, est_tokens)
    if not allowed:
        raise HTTPException(status_code=429, detail=msg)
    
    return execute_llm_call(request)

4. Tiered User Quotas

In a SaaS business model, you align Token Access with Revenue.

Free Tier: 10k tokens / day. Access to small models (GPT-4o mini) only.
Pro Tier: 500k tokens / day. Access to expert models.
Enterprise: Custom quotas.

Token Efficiency Tip: When a "Pro" user hits their daily limit, instead of cutting them off entirely, you can Auto-Downgrade them to a Tier 1 model. This provides a "Degraded but Working" experience that keeps the user happy without increasing your bill.

5. Summary and Key Takeaways

RPM is not enough: You must limit Tokens Per Minute (TPM).
Token Buckets: Use Redis to manage distributed, high-performance quotas.
Pre-Calculation: Estimate tokens before calling the API to prevent "Over-delivery."
Alignment: Map quotas directly to your pricing tiers to ensure every user is profitable.

In the next lesson, Cost-Aware Routing at Scale, we look at چگونه to manage these quotas across a multi-model fleet.

Exercise: The Quota Stress Test

Set a user quota to 500 tokens.
Attempt to send a 600-token prompt.
Observe: Did your middleware block it?
Now, send two 300-token prompts.
Analyze: The first should pass, the second should fail.

Calculations: If you have 1,000 such "Blocked" users per day, how much money did your middleware save you from spending on the LLM provider?

Rate Limiting and Token Quotas: Resource Governance

Rate Limiting and Token Quotas: Resource Governance

1. Request Rate vs. Token Rate

2. The Token Bucket Algorithm

3. Implementation: Redis-Based Token Limiter (Python)

Python Code: Token Quota Middleware

4. Tiered User Quotas

5. Summary and Key Takeaways

Exercise: The Quota Stress Test

Congratulations on completing Module 16 Lesson 1! You are now a production-ready architect.

Subscribe to our newsletter