The Timing of Intelligence

When building a Generative AI application, the "Wait" is the enemy. Because Large Language Models can take anywhere from 1 to 60 seconds to respond, you cannot use traditional, synchronous web patterns for everything.

In the AWS Certified Generative AI Developer – Professional exam, you must demonstrate that you know which "Integration Pattern" to use based on the complexity of the task and the expectations of the user.

1. Synchronous Patterns (Standard REST)

The application waits ("blocks") until the AI returns the final result.

Best For: Low-latency tasks (Classification, translation of short strings, intent detection).
Limit: Most web servers (and AWS API Gateway) have a 29-30 second timeout. If the AI takes 31 seconds, the user gets a 504 error, even if the AI was 99% done.
Pro Developer Tool: bedrock-runtime.invoke_model().

2. Asynchronous Patterns (Fire and Forget)

The application submits the request and receives an "OK, I'm working on it" message. The result is delivered later via a callback or retrieved via a status check.

Best For: High-latency tasks (Summarizing a 100-page PDF, generating a long story, complex agent reasoning).
Mechanism: Use Amazon SQS as a buffer. Lambda picks it up and calls Bedrock. Once done, the result is saved to DynamoDB or S3.
User Experience: Show a progress bar or send an email/notification when the job is done.

sequenceDiagram
    participant FE as Frontend
    participant API as API Gateway
    participant SQS as SQS Queue
    participant L as Lambda
    participant B as Bedrock

    FE->>API: 1. POST /process-video
    API->>SQS: 2. Enqueue Job
    API-->>FE: 3. Return JobID: #123
    L->>SQS: 4. Poll and Process
    L->>B: 5. Invoke (Async)
    B-->>L: 6. Final Result
    L->>FE: 7. Notify via WebSocket/DB

3. Streaming Patterns (Real-Time Magic)

The model sends the response word-by-word (tokens) as they are generated.

Best For: Chat/Conversation apps. It "fakes" speed by showing the user the First Token immediately (Low TTFT - Time to First Token).
Mechanism: Uses the EventStream protocol over HTTP.
Pro Developer Tool: bedrock-runtime.invoke_model_with_response_stream().

Code Example: Handling a Stream in Python

import boto3
import json

client = boto3.client('bedrock-runtime')

def run_stream():
    response = client.invoke_model_with_response_stream(
        modelId='anthropic.claude-3-sonnet-20240229-v1:0',
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 512,
            "messages": [{"role": "user", "content": "Write a 500-word essay."}]
        })
    )

    # Process chunks as they arrive
    for event in response.get('body'):
        chunk = json.loads(event.get('chunk').get('bytes').decode())
        if chunk['type'] == 'content_block_delta':
            print(chunk['delta']['text'], end="", flush=True)

run_stream()

4. Pattern Selection Matrix

Use Case	Recommended Pattern	Primary Reason
Simple Translation	Sync	Fast, fits within 29s timeout.
Interactive Chat	Streaming	Critical for user perceived performance.
Batch Content Generation	Async	Protects system against timeouts and spikes.
Data Enrichment	Async	Allows for massive parallel processing.

5. Professional Guardrail: The "Wait" Limit

In the exam, look for questions about API Gateway Limit. If a question says: "A model sometimes takes 45 seconds to respond. Users are seeing 504 Gateway Timeout errors," what should you do?

Increase the timeout? (No, API Gateway has a hard limit).
Switch to Async or Streaming? (Yes).

Knowledge Check: Test Your Pattern Knowledge

Error: Quiz options are missing or invalid.

Summary

Sync is for speed, Async is for scale, and Streaming is for satisfaction. Mastering these three is the hallmark of a Professional Developer. This concludes Module 6. In the next module, we look at the logic that ties multiple model calls together: Multi-Step GenAI Workflows.

Next Module: The Logic Engine: Orchestration with Step Functions and Lambda

Speed vs. Substance: Sync, Async, and Streaming Patterns