FastAPI for AI: Async Clients and Model Serving

FastAPI has become the standard for the AI era. Why? Because AI is slow, and FastAPI is built specifically to handle "Waiting" without blocking. Whether you are calling an external LLM (like Gemini or OpenAI) or running your own model (like Llama or Stable Diffusion), FastAPI is your best friend.

In this lesson, we learn how to build production-grade AI wrappers.

1. The Async AI Pattern

Calling an LLM takes time. A single prompt can take 2 to 10 seconds. If you use a synchronous client (like requests), your whole API freezes.

The Solution: Use the Async version of the AI client.

from openai import AsyncOpenAI

client = AsyncOpenAI(api_key="sk-...")

@app.post("/ask-ai")
async def ask_ai(prompt: str):
    # This 'await' lets the server handle other users while the AI thinks
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return {"answer": response.choices[0].message.content}

2. Pydantic for Structured AI Output

The biggest problem with AI is that it's "unpredictable." You ask for JSON, but it gives you a poem. By combining FastAPI's Pydantic with modern LLM "Structural Output" features, you can guarantee that the AI's response follows your code's schema exactly.

class AIResponse(BaseModel):
    summary: str
    sentiment: str = Field(pattern="^(Positive|Negative|Neutral)$")
    tags: list[str]

3. Handling Timeouts and Retries

AI APIs fail. Models go down, or you hit rate limits. Your FastAPI code should include robust error handling (Module 7).

Timeouts: Stop waiting after 10 seconds so the user isn't stuck forever.
Exponential Backoff: If an API fails, wait 1s, then 2s, then 4s before trying again.

4. Serving Your Own Models (PyTorch / TensorFlow)

If you are running your own model on your own GPU:

Load the model ONCE during startup (using @app.on_event("startup")).
Serve it in a threadpool: Model inference is CPU/GPU heavy. Run it using def instead of async def so it doesn't block the event loop.

Visualizing the AI API Flow

sequenceDiagram
    participant U as User
    participant F as FastAPI
    participant AI as AI Model (LLM)
    
    U->>F: POST /generate (Prompt)
    F->>AI: Async Call (Waiting...)
    Note over F: Server handles other requests
    AI-->>F: AI Result (Text/JSON)
    F->>F: Pydantic Validation
    F-->>U: Reliable JSON Response

Summary

Async-First: Never use sync AI clients in FastAPI.
Pydantic: Use it to force the AI to return structured, typed data.
Lifecycle: Load heavy models during app startup, not inside the request.
Reliability: AI is unpredictable; your API shouldn't be.

In the next lesson, we’ll look at Streaming Responses, the secret to making AI feel "Live."

Exercise: The AI Guard

You are building an AI Support Bot.

If the AI takes 15 seconds to respond, what happens to your FastAPI server if you use async def?
What happens if you use standard def with a sync client?
Which one allows you to handle 100 concurrent users on a single CPU?