CI/CD for LLM Applications: Automated AI Pipelines

Traditional CI/CD (Continuous Integration / Continuous Deployment) ensures that code compiles and tests pass. LLMOps (MLOps for LLMs) ensures that your AI behavior is stable.

If you change a single word in a system prompt, it can cause the model to stop returning JSON and start returning haikus. You cannot "just push" a prompt change. You need an automated pipeline to verify that your change didn't break the agent.

1. The LLM CI/CD Workflow

Unlike standard software, our pipeline includes a "Reasoning Evaluation" step.

graph TD
    A[Code/Prompt Change] --> B[Linting & Unit Tests]
    B --> C[Small Model Eval: Check formatting]
    C --> D[Golden Dataset Eval: Check logic]
    D --> E{Pass Threshold > 95%?}
    E -- Yes --> F[Deploy to Staging]
    E -- No --> G[Reject & Notify Dev]
    F --> H[Blue/Green Deployment to Production]

2. Automated Prompt Testing

In your CI pipeline (GitHub Actions, GitLab CI), you should run a script that sends 20-50 tests to your new prompt version.

What to check for automatically:

Format: Did the model return valid JSON?
Schema: Did it include the required keys (id, summary, priority)?
Toxic Content: Did the change make the model more susceptible to prompt injection?

3. Evaluation Hubs: The "Unit Test" for Ideas

We use tools like Promptfoo or LangChain Eval to run these suites.

Example Config (eval.yaml):

prompts: [prompts/v2_system_prompt.txt]
providers: [openai:gpt-4o]
tests:
  - vars:
      user_input: "Cancel my order"
    assert:
      - type: contains
        value: "order_cancellation_confirmed"
      - type: javascript
        value: output.length < 200

4. Blue/Green and Canary Deployments

When you deploy a new model or prompt, you shouldn't give it to 100% of your users instantly.

Canary Release: Send 5% of traffic to the "New Agent." Monitor the logs for exceptions or a drop in "Thums-up" feedback.
AB Testing: Send 50% of users to Prompt A and 50% to Prompt B and see which one converts better.

5. Environment Variables vs. Hardcoded Prompts

One of the cornerstones of LLMOps is Separation.

Dev Env: Uses gpt-4o-mini to save money.
Prod Env: Uses gpt-4o or claude-3-5-sonnet. Your CI/CD pipeline should automatically inject the correct Model ID and API keys based on the target environment.

Summary

LLMOps is about automating the verification of probabilistic outputs.
Golden Datasets are the core of your CI pipeline.
Promptfoo and similar tools act as the "Test Runner" for AI.
Canary Deployments prevent a bad prompt from ruining the experience for all users.

In the next lesson, we will look at Evaluating LLM Outputs in Production, focusing on how to track quality after the deployment.

Exercise: The CI/CD Architect

You want to automate the testing of your "Legal AI" bot whenever a developer changes the system_instructions.txt file.

List the 3 steps of your GitHub Actions pipeline:

What is the first "Safety" check?
What is the "Logic" check?
What happens if the logic check passes for 18 out of 20 tests? (Is that a Success or a Failure?)

Answer Logic:

Linting/Unit Tests: Ensure the code around the prompt is valid.
Golden Dataset Run: Run the 20 most important legal edge cases.
Threshold Check: If 18/20 = 90%, and your threshold is 95%, the build FAILS. You cannot lower the standards of a legal bot!