Handling the Load: Latency and Limits

In development, you are the only user. In production, 1,000 users might hit your API at once. If you aren't prepared for Throttling and Latency bottlenecking, your app will crash.

1. Understanding Throttling

AWS Bedrock has Quotas (Rate Limits). If you exceed them, you get a ThrottlingException.

Solution: Implement Exponential Backoff. Wait 2 seconds, then 4, then 8, before retrying.
Boto3 handles basic retries automatically, but you should customize it for heavy loads.

2. Factors Affecting Latency

Model Size: Claude 3 Opus (Large) is 5x slower than Claude 3 Haiku (Small).
Output Length: The more words the AI writes, the longer the call takes.
Region: Calling a US model from a server in Tokyo adds ~200ms of internet travel time.

3. Visualizing Scaling Strategies

graph TD
    U[1,000 Users] --> API[FastAPI Orchestrator]
    API -->|Demand| B1[Bedrock Region 1]
    API -->|Demand| B2[Bedrock Region 2]
    
    API --> Cache[Redis Cache: Quick Answer]

4. Provisioned Throughput

If your business depends on a guaranteed response time (e.g., a stock trading bot), you can "Reserve" capacity using Provisioned Throughput.

Pros: Zero Throttling, consistent speed.
Cons: Very expensive (billed monthly/yearly, not per token).

Summary

Throttling occurs when you exceed your account's RPM (Requests per Minute).
Small models are the best way to reduce latency.
Regional proximity matters for real-time applications.
Provisioned Throughput is the enterprise solution for guaranteed availability.

Module 5 Lesson 2: Latency and Throttling