Deploying Agents to Production: Serverless vs. Containerized

Deploying Agents to Production: Serverless vs. Containerized

Take your agents from local scripts to production scale. Explore the pros and cons of Serverless (AWS Lambda/GCP Functions) and Containerized (Docker/K8s) deployment models for Gemini ADK projects.

Deploying Agents to Production: Serverless vs. Containerized

A Python script running in your terminal is a prototype. A production agent is an infrastructure. When you deploy a Gemini ADK application, you must decide how to manage its lifecycle, how to scale it to thousands of users, and how to protect its secrets in a cloud environment.

In this lesson, we will compare the two most common deployment patterns: Serverless (for lightness and cost) and Containerized (for control and complexity). We will also learn how to wrap your agent in a FastAPI web server, creating a professional API endpoint that can be consumed by any frontend.


1. Choosing Your Infrastructure

A. Serverless (AWS Lambda / Google Cloud Functions)

  • Best for: Low-frequency tasks, cost-sensitive prototypes, and simple message-based agents.
  • Pros: You only pay when code runs; near-infinite horizontal scaling.
  • Cons: "Cold Start" latency can add seconds to the first request; strict execution time limits (e.g., 15 mins).

B. Containerized (Docker, Cloud Run, ECS, Kubernetes)

  • Best for: High-traffic chatbots, long-running research agents, and complex multi-agent orchestrations.
  • Pros: Persistent connections; no cold starts; full control over the Python environment.
  • Cons: More expensive (you pay for the uptime); higher management complexity.

2. Wrapping the Agent in FastAPI

In production, you don't run a while True: loop. Instead, you create a REST endpoint that accepts a user_input and returns a model_response.

from fastapi import FastAPI
import google.generativeai as genai
import os

app = FastAPI()

# Configuration
genai.configure(api_key=os.getenv("GEMINI_API_KEY"))
model = genai.GenerativeModel('gemini-1.5-flash')

@app.post("/chat")
async def chat_endpoint(user_input: str, session_id: str):
    # In production, you would fetch history from Redis here!
    chat = model.start_chat(history=[])
    response = chat.send_message(user_input)
    
    return {
        "text": response.text,
        "session_id": session_id
    }

3. Containerization with Docker

To deploy this FastAPI app to Google Cloud Run or AWS ECS, we need a Dockerfile.

# Use a slim Python image
FROM python:3.11-slim

# Set work directory
WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy source code
COPY . .

# Expose port and start FastAPI
EXPOSE 8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

4. Managing Production Secrets

NEVER put your Gemini API Key in your Dockerfile or a GitHub config.

The Production Secret Workflow:

  1. Store: Upload your key to AWS Secrets Manager or Google Secret Manager.
  2. Access: At runtime, your cloud environment (Lambda or ECS) automatically injects the secret into an environment variable.
  3. Use: Your code reads os.getenv("GEMINI_API_KEY").

5. Architectural Diagram: Production Scale

graph TD
    A[Mobile/Web Client] --> B[API Gateway]
    B --> C[Load Balancer]
    C --> D[Cloud Run / ECS - Agent Workers]
    D <--> E[Redis - Session State]
    D <--> F[Google Gemini API]
    D <--> G[Postgres - Logs]
    
    style F fill:#4285F4,color:#fff
    style D fill:#34A853,color:#fff

6. Cold Starts and Latency Management

In Serverless (Lambda), the first user to visit after a period of inactivity might wait 10 seconds for the Python environment to "wake up."

Mitigation Strategies:

  1. Warm-up Pings: Set a timer to ping your endpoint every 5 minutes to keep it "awake."
  2. Provisioned Concurrency: (AWS) Pay a small fee to keep $N$ instances always running.
  3. Switch to Container-On-Server: If sub-second latency is vital, move to Cloud Run with "Min Instances > 0."

7. CI/CD for Agents

Deploying an agent is risky because a prompt change can "break" the agent's behavior.

The Pipeline:

  • Code Review: Standard PR process.
  • Unit Tests: Test tool functions (Module 12.3).
  • Prompt Evaluation: Run the new prompt against a "Golden Dataset" to ensure no regressions in accuracy.
  • Deploy: Use a Blue/Green deployment strategy (send 5% of traffic to the new agent design first).

8. Summary and Exercises

Deployment is the Reality Check for your code.

  • FastAPI is the standard interface for agent APIs.
  • Docker provides environment consistency.
  • Secret Managers provide enterprise security.
  • Wait and Warm-up strategies manage serverless latency.

Exercises

  1. Infrastrucutre Choice: You are building an agent that runs once a day at 2 AM to generate a report. Which deployment model would you choose and why?
  2. Dockerfile Optimization: Why did we use python:3.11-slim in the example instead of the full python:3.11 image? What are the benefits for cloud deployment?
  3. Secret Invalidation: What happens to your production agent if you "Revoke" your API key in Google AI Studio? How would you update your production secret without re-deploying your entire code base?

In the next lesson, we will look at Managing Agentic State at Scale, exploring how to handle millions of sessions across multiple servers.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn