Module 17 Lesson 2: Scaling and Retries
Handling the Peak. Advanced strategies for dealing with Bedrock's rate limits using exponential backoff and request queuing.
Scaling Up: Retries and Queues
When your app goes viral, you will hit ThrottlingException errors. AWS limits how many requests you can send per minute (RPM). If you don't handle these "429 Too Many Requests" errors, your users will see "Something went wrong."
1. Exponential Backoff
Instead of retrying immediately (which just causes more throttling), your code should wait longer and longer between attempts.
- Try 1: Fail $\rightarrow$ Wait 1s.
- Try 2: Fail $\rightarrow$ Wait 2s.
- Try 3: Fail $\rightarrow$ Wait 4s.
2. Jitter
If 100 users all fail at the same time and wait exactly 2 seconds, they will all hit the server again at the same millisecond, causing another failure. Adding Jitter (a random +/- 100ms) prevents these "Thundering Herds."
3. Visualizing the Retry Pattern
graph TD
Req[User Request] --> B[Bedrock Call]
B -->|Success| Out[Answer]
B -->|Fail: Throttled| W[Wait: 1s + Jitter]
W --> B
4. Request Queuing
For tasks that don't need to be instant (like generating a weekly summary), use a Queue (AWS SQS).
- User submits request $\rightarrow$ Put in SQS.
- Worker pulls from SQS at a slow, controlled rate that doesn't trigger throttling.
- Worker updates the DB when done.
Summary
- Throttling is inevitable at scale.
- Exponential Backoff is the primary defense.
- Jitter prevents synchronization of retrying users.
- Queues are the best way to handle non-real-time bulk AI tasks.