
Speed vs. Substance: Sync, Async, and Streaming Patterns
Master the timing of AI. Learn when to use synchronous, asynchronous, and streaming response patterns to balance user experience, cost, and technical limits.
The Timing of Intelligence
When building a Generative AI application, the "Wait" is the enemy. Because Large Language Models can take anywhere from 1 to 60 seconds to respond, you cannot use traditional, synchronous web patterns for everything.
In the AWS Certified Generative AI Developer – Professional exam, you must demonstrate that you know which "Integration Pattern" to use based on the complexity of the task and the expectations of the user.
1. Synchronous Patterns (Standard REST)
The application waits ("blocks") until the AI returns the final result.
- Best For: Low-latency tasks (Classification, translation of short strings, intent detection).
- Limit: Most web servers (and AWS API Gateway) have a 29-30 second timeout. If the AI takes 31 seconds, the user gets a 504 error, even if the AI was 99% done.
- Pro Developer Tool:
bedrock-runtime.invoke_model().
2. Asynchronous Patterns (Fire and Forget)
The application submits the request and receives an "OK, I'm working on it" message. The result is delivered later via a callback or retrieved via a status check.
- Best For: High-latency tasks (Summarizing a 100-page PDF, generating a long story, complex agent reasoning).
- Mechanism: Use Amazon SQS as a buffer. Lambda picks it up and calls Bedrock. Once done, the result is saved to DynamoDB or S3.
- User Experience: Show a progress bar or send an email/notification when the job is done.
sequenceDiagram
participant FE as Frontend
participant API as API Gateway
participant SQS as SQS Queue
participant L as Lambda
participant B as Bedrock
FE->>API: 1. POST /process-video
API->>SQS: 2. Enqueue Job
API-->>FE: 3. Return JobID: #123
L->>SQS: 4. Poll and Process
L->>B: 5. Invoke (Async)
B-->>L: 6. Final Result
L->>FE: 7. Notify via WebSocket/DB
3. Streaming Patterns (Real-Time Magic)
The model sends the response word-by-word (tokens) as they are generated.
- Best For: Chat/Conversation apps. It "fakes" speed by showing the user the First Token immediately (Low TTFT - Time to First Token).
- Mechanism: Uses the EventStream protocol over HTTP.
- Pro Developer Tool:
bedrock-runtime.invoke_model_with_response_stream().
Code Example: Handling a Stream in Python
import boto3
import json
client = boto3.client('bedrock-runtime')
def run_stream():
response = client.invoke_model_with_response_stream(
modelId='anthropic.claude-3-sonnet-20240229-v1:0',
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 512,
"messages": [{"role": "user", "content": "Write a 500-word essay."}]
})
)
# Process chunks as they arrive
for event in response.get('body'):
chunk = json.loads(event.get('chunk').get('bytes').decode())
if chunk['type'] == 'content_block_delta':
print(chunk['delta']['text'], end="", flush=True)
run_stream()
4. Pattern Selection Matrix
| Use Case | Recommended Pattern | Primary Reason |
|---|---|---|
| Simple Translation | Sync | Fast, fits within 29s timeout. |
| Interactive Chat | Streaming | Critical for user perceived performance. |
| Batch Content Generation | Async | Protects system against timeouts and spikes. |
| Data Enrichment | Async | Allows for massive parallel processing. |
5. Professional Guardrail: The "Wait" Limit
In the exam, look for questions about API Gateway Limit. If a question says: "A model sometimes takes 45 seconds to respond. Users are seeing 504 Gateway Timeout errors," what should you do?
- Increase the timeout? (No, API Gateway has a hard limit).
- Switch to Async or Streaming? (Yes).
Knowledge Check: Test Your Pattern Knowledge
?Knowledge Check
A leading media company wants to provide a real-time 'Transcription and Translation' service for live news broadcasts. The goal is to minimize the delay between the audio being spoken and the text appearing on screen. Which integration pattern should the developer choose?
Summary
Sync is for speed, Async is for scale, and Streaming is for satisfaction. Mastering these three is the hallmark of a Professional Developer. This concludes Module 6. In the next module, we look at the logic that ties multiple model calls together: Multi-Step GenAI Workflows.
Next Module: The Logic Engine: Orchestration with Step Functions and Lambda