Scaling Up: Retries and Queues

When your app goes viral, you will hit ThrottlingException errors. AWS limits how many requests you can send per minute (RPM). If you don't handle these "429 Too Many Requests" errors, your users will see "Something went wrong."

1. Exponential Backoff

Instead of retrying immediately (which just causes more throttling), your code should wait longer and longer between attempts.

Try 1: Fail $\rightarrow$ Wait 1s.
Try 2: Fail $\rightarrow$ Wait 2s.
Try 3: Fail $\rightarrow$ Wait 4s.

2. Jitter

If 100 users all fail at the same time and wait exactly 2 seconds, they will all hit the server again at the same millisecond, causing another failure. Adding Jitter (a random +/- 100ms) prevents these "Thundering Herds."

3. Visualizing the Retry Pattern

graph TD
    Req[User Request] --> B[Bedrock Call]
    B -->|Success| Out[Answer]
    B -->|Fail: Throttled| W[Wait: 1s + Jitter]
    W --> B

4. Request Queuing

For tasks that don't need to be instant (like generating a weekly summary), use a Queue (AWS SQS).

User submits request $\rightarrow$ Put in SQS.
Worker pulls from SQS at a slow, controlled rate that doesn't trigger throttling.
Worker updates the DB when done.

Summary

Throttling is inevitable at scale.
Exponential Backoff is the primary defense.
Jitter prevents synchronization of retrying users.
Queues are the best way to handle non-real-time bulk AI tasks.

Module 17 Lesson 2: Scaling and Retries