Rate Limiting and Token Quotas: Resource Governance

Rate Limiting and Token Quotas: Resource Governance

Learn how to protect your AI infrastructure from abuse and over-spending. Master the implementation of 'Token Buckets' and tiered user quotas.

Rate Limiting and Token Quotas: Resource Governance

When you move from a prototype to a production system with thousands of users, your Token Liability becomes your biggest risk. A single aggressive user (or a bot) can burn through your monthly API budget in a single afternoon.

Rate Limiting is the practice of restricting the frequency of requests. Token Quotas go a step further, restricting the total volume of tokens consumed over a period (e.g., 24 hours).

In this lesson, we learn how to build a Governance Layer that keeps your infrastructure stable and your billing predictable.


1. Request Rate vs. Token Rate

  • Requests per Minute (RPM): Limits how many times a user can call POST /chat.
    • Risk: A user can send one 128k context request per minute and still bankrupt you.
  • Tokens per Minute (TPM): Limits the total "Wait & Load" on your infrastructure.
    • Reliability: This is the ultimate metric for financial stability.

2. The Token Bucket Algorithm

The Token Bucket is the most elegant way to handle burst traffic while maintaining a long-term average.

  1. The Bucket: Every user has a "Bucket" that can hold 10,000 tokens.
  2. The Drain: When a user makes a request for 2,000 tokens, the bucket drops to 8,000.
  3. The Refill: Every second, 100 new tokens are added to the bucket (up to the max).
  4. The Block: If the bucket is empty, the user gets a 429 Too Many Tokens error.
graph TD
    A[Incoming Request: 500 tokens] --> B{Bucket > 500?}
    B -->|Yes| C[Execute Request]
    C --> D[Decrement Bucket]
    B -->|No| E[Reject with 429]
    
    F[Refill Timer: +10 tokens/sec] --> G[Update All Buckets]

3. Implementation: Redis-Based Token Limiter (Python)

Using Redis is the standard for distributed rate limiting.

Python Code: Token Quota Middleware

import redis

r = redis.Redis(host='localhost', port=6379, db=0)

def check_token_quota(user_id, requested_tokens):
    key = f"quota:{user_id}"
    current_tokens = int(r.get(key) or 100000)
    
    if current_tokens < requested_tokens:
        return False, "Quota Exceeded"
        
    # Atomic decrement
    r.decrby(key, requested_tokens)
    return True, "Authorized"

# In your FastAPI endpoint
@app.post("/v1/chat")
async def chat(request: ChatRequest):
    # PRE-CALCULATE tokens using tiktoken (Module 1.1)
    est_tokens = count_tokens(request.prompt)
    
    allowed, msg = check_token_quota(request.user_id, est_tokens)
    if not allowed:
        raise HTTPException(status_code=429, detail=msg)
    
    return execute_llm_call(request)

4. Tiered User Quotas

In a SaaS business model, you align Token Access with Revenue.

  • Free Tier: 10k tokens / day. Access to small models (GPT-4o mini) only.
  • Pro Tier: 500k tokens / day. Access to expert models.
  • Enterprise: Custom quotas.

Token Efficiency Tip: When a "Pro" user hits their daily limit, instead of cutting them off entirely, you can Auto-Downgrade them to a Tier 1 model. This provides a "Degraded but Working" experience that keeps the user happy without increasing your bill.


5. Summary and Key Takeaways

  1. RPM is not enough: You must limit Tokens Per Minute (TPM).
  2. Token Buckets: Use Redis to manage distributed, high-performance quotas.
  3. Pre-Calculation: Estimate tokens before calling the API to prevent "Over-delivery."
  4. Alignment: Map quotas directly to your pricing tiers to ensure every user is profitable.

In the next lesson, Cost-Aware Routing at Scale, we look at چگونه to manage these quotas across a multi-model fleet.


Exercise: The Quota Stress Test

  1. Set a user quota to 500 tokens.
  2. Attempt to send a 600-token prompt.
  3. Observe: Did your middleware block it?
  4. Now, send two 300-token prompts.
  5. Analyze: The first should pass, the second should fail.
  • Calculations: If you have 1,000 such "Blocked" users per day, how much money did your middleware save you from spending on the LLM provider?

Congratulations on completing Module 16 Lesson 1! You are now a production-ready architect.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn