
Deploying Agents to Production: Serverless vs. Containerized
Take your agents from local scripts to production scale. Explore the pros and cons of Serverless (AWS Lambda/GCP Functions) and Containerized (Docker/K8s) deployment models for Gemini ADK projects.
Deploying Agents to Production: Serverless vs. Containerized
A Python script running in your terminal is a prototype. A production agent is an infrastructure. When you deploy a Gemini ADK application, you must decide how to manage its lifecycle, how to scale it to thousands of users, and how to protect its secrets in a cloud environment.
In this lesson, we will compare the two most common deployment patterns: Serverless (for lightness and cost) and Containerized (for control and complexity). We will also learn how to wrap your agent in a FastAPI web server, creating a professional API endpoint that can be consumed by any frontend.
1. Choosing Your Infrastructure
A. Serverless (AWS Lambda / Google Cloud Functions)
- Best for: Low-frequency tasks, cost-sensitive prototypes, and simple message-based agents.
- Pros: You only pay when code runs; near-infinite horizontal scaling.
- Cons: "Cold Start" latency can add seconds to the first request; strict execution time limits (e.g., 15 mins).
B. Containerized (Docker, Cloud Run, ECS, Kubernetes)
- Best for: High-traffic chatbots, long-running research agents, and complex multi-agent orchestrations.
- Pros: Persistent connections; no cold starts; full control over the Python environment.
- Cons: More expensive (you pay for the uptime); higher management complexity.
2. Wrapping the Agent in FastAPI
In production, you don't run a while True: loop. Instead, you create a REST endpoint that accepts a user_input and returns a model_response.
from fastapi import FastAPI
import google.generativeai as genai
import os
app = FastAPI()
# Configuration
genai.configure(api_key=os.getenv("GEMINI_API_KEY"))
model = genai.GenerativeModel('gemini-1.5-flash')
@app.post("/chat")
async def chat_endpoint(user_input: str, session_id: str):
# In production, you would fetch history from Redis here!
chat = model.start_chat(history=[])
response = chat.send_message(user_input)
return {
"text": response.text,
"session_id": session_id
}
3. Containerization with Docker
To deploy this FastAPI app to Google Cloud Run or AWS ECS, we need a Dockerfile.
# Use a slim Python image
FROM python:3.11-slim
# Set work directory
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy source code
COPY . .
# Expose port and start FastAPI
EXPOSE 8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
4. Managing Production Secrets
NEVER put your Gemini API Key in your Dockerfile or a GitHub config.
The Production Secret Workflow:
- Store: Upload your key to AWS Secrets Manager or Google Secret Manager.
- Access: At runtime, your cloud environment (Lambda or ECS) automatically injects the secret into an environment variable.
- Use: Your code reads
os.getenv("GEMINI_API_KEY").
5. Architectural Diagram: Production Scale
graph TD
A[Mobile/Web Client] --> B[API Gateway]
B --> C[Load Balancer]
C --> D[Cloud Run / ECS - Agent Workers]
D <--> E[Redis - Session State]
D <--> F[Google Gemini API]
D <--> G[Postgres - Logs]
style F fill:#4285F4,color:#fff
style D fill:#34A853,color:#fff
6. Cold Starts and Latency Management
In Serverless (Lambda), the first user to visit after a period of inactivity might wait 10 seconds for the Python environment to "wake up."
Mitigation Strategies:
- Warm-up Pings: Set a timer to ping your endpoint every 5 minutes to keep it "awake."
- Provisioned Concurrency: (AWS) Pay a small fee to keep $N$ instances always running.
- Switch to Container-On-Server: If sub-second latency is vital, move to Cloud Run with "Min Instances > 0."
7. CI/CD for Agents
Deploying an agent is risky because a prompt change can "break" the agent's behavior.
The Pipeline:
- Code Review: Standard PR process.
- Unit Tests: Test tool functions (Module 12.3).
- Prompt Evaluation: Run the new prompt against a "Golden Dataset" to ensure no regressions in accuracy.
- Deploy: Use a Blue/Green deployment strategy (send 5% of traffic to the new agent design first).
8. Summary and Exercises
Deployment is the Reality Check for your code.
- FastAPI is the standard interface for agent APIs.
- Docker provides environment consistency.
- Secret Managers provide enterprise security.
- Wait and Warm-up strategies manage serverless latency.
Exercises
- Infrastrucutre Choice: You are building an agent that runs once a day at 2 AM to generate a report. Which deployment model would you choose and why?
- Dockerfile Optimization: Why did we use
python:3.11-slimin the example instead of the fullpython:3.11image? What are the benefits for cloud deployment? - Secret Invalidation: What happens to your production agent if you "Revoke" your API key in Google AI Studio? How would you update your production secret without re-deploying your entire code base?
In the next lesson, we will look at Managing Agentic State at Scale, exploring how to handle millions of sessions across multiple servers.