Local vs Hosted Models (Ollama vs Bedrock)

Local vs Hosted Models (Ollama vs Bedrock)

Compare local and cloud model deployments for multimodal RAG systems and learn when to use each approach.

Local vs Hosted Models (Ollama vs Bedrock)

Choosing between local and cloud-hosted models is one of the most important architectural decisions for RAG systems.

Deployment Models Overview

graph TD
    A[Model Deployment] --> B[Local/On-Premise]
    A --> C[Cloud-Hosted]
    
    B --> D[Ollama]
    B --> E[vLLM]
    B --> F[SelfHosted LLaVA]
    
    C --> G[AWS Bedrock]
    C --> H[OpenAI API]
    C --> I[Anthropic API]
    
    style B fill:#d1ecf1
    style C fill:#fff3cd

Local Models with Ollama

What is Ollama?

Ollama is a framework for running LLMs locally with minimal setup.

# Install Ollama
curl https://ollama.ai/install.sh | sh

# Run a multimodal model
ollama run llava

# Or a text model
ollama run llama2

Supported Multimodal Models

# Available through Ollama
models = {
    "llava": "7B multimodal (text + vision)",
    "llava:13b": "Larger, more capable",
    "bakllava": "Optimized LLaVA variant",
    "moondream": "Tiny vision model (1.8B)"
}

Local RAG Architecture

graph TD
    A[Data Sources] --> B[Local Ingestion]
    B --> C[Local Preprocessing]
    C --> D[Ollama Embeddings]
    D --> E[Chroma DB]
    
    F[User Query] --> E
    E --> G[Retrieved Context]
    G --> H[Ollama LLaVA]
    H --> I[Response]
    
    J[All On-Premise] -.-> B & C & D & E & H
    
    style J fill:#d4edda

Advantages of Local Models

1. Complete Data Privacy

Your Data → Your Hardware → Your Network
NO external API calls
NO data leaves your infrastructure

Critical for:

  • Healthcare (HIPAA)
  • Finance (SOX, PCI-DSS)
  • Legal (attorney-client privilege)
  • Government (classified data)
  • R&D (trade secrets)

2. No API Costs

# Local cost calculation
Hardware: $5,000-20,000 (one-time)
Electricity: ~$50-200/month

vs

Cloud API: $0.10-0.50 per query × 100K queries/month = $10,000-50,000/month

Break-even: Usually under 6 months for high-volume systems

3. No Rate Limits

Bedrock: 10-50 req/s throttling
Ollama: Limited only by your hardware

4. No Vendor Lock-in

  • Switch models anytime
  • No dependency on external services
  • Control over model versions

5. Low Latency (Same Network)

Cloud API: 500ms-3s (network + processing)
Local: 100-500ms (processing only)

Disadvantages of Local Models

1. Lower Accuracy

graph LR
    A[Quality Spectrum] --> B[LLaVA 7B]
    A --> C[LLaVA 13B]
    A --> D[GPT-4V]
    A --> E[Claude 3.5]
    
    B --> F[Good]
    E --> G[Excellent]
    
    style B fill:#fff3cd
    style E fill:#d4edda

Open-source models lag 6-12 months behind frontier models.

2. Infrastructure Burden

You must manage:

  • GPU servers (often multiple)
  • Model storage (~10-50GB per model)
  • VRAM requirements (16-80GB)
  • Monitoring and updates
  • High availability
  • Scaling

3. Upfront Capital Expense

Minimum viable setup:
- GPU server: $5,000-10,000
- Storage: $1,000-3,000
- Networking: $500-2,000

Total: $6,500-15,000 before first query

4. Operational Complexity

# Cloud (simple)
response = bedrock.invoke_model(...)

# vs

# Local (complex)
# - Ensure GPU drivers updated
# - Monitor VRAM usage
# - Handle model loading
# - Implement request queuing
# - Set up load balancing
# - Configure auto-scaling

Cloud Models with AWS Bedrock

What is AWS Bedrock?

Bedrock is AWS's managed service for foundation models.

import boto3

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

response = bedrock.invoke_model(
    modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
    body={...}
)

Available Models

  • Anthropic: Claude 3.5 Sonnet, Opus, Haiku
  • Stability AI: Image generation
  • Cohere: Embeddings and reranking
  • Amazon Titan: Text and embeddings

Cloud RAG Architecture

graph LR
    A[S3 Documents] --> B[Bedrock KB]
    B --> C[Automatic Indexing]
    C --> D[Vector Store]
    
    E[User Query] --> F[Bedrock Agent]
    F --> D
    D --> G[Claude 3.5]
    G --> H[Response]
    
    style B fill:#ff9800
    style G fill:#ff9800

Advantages of Cloud Models

1. State-of-the-Art Quality

Claude 3.5 Sonnet accuracy >> LLaVA
Continuous improvements without effort

2. Zero Infrastructure Management

No servers
No GPUs
No maintenance
No scaling worries

3. Pay-Per-Use

# Only pay for what you use
Low volume: $10-100/month
High volume: $1,000-10,000/month

No upfront investment

4. Instant Scalability

1 query/second → 1,000 queries/second
Automatic, no configuration

5. Built-in Integrations

graph TD
    A[Bedrock] --> B[Knowledge Bases]
    A --> C[Agents]
    A --> D[Guardrails]
    A --> E[Model Evaluation]
    
    F[AWS Ecosystem] --> G[S3]
    F --> H[Lambda]
    F --> I[IAM]
    F --> J[CloudWatch]

Disadvantages of Cloud Models

1. Data Privacy Concerns

Your data → AWS network → Model APIs

Not suitable for:
- Highly regulated data
- Trade secrets
- Customer PII (without encryption)

2. Ongoing Costs

High-volume systems:
100K queries/day × $0.20/query = $20K/day = $600K/month

3. Vendor Lock-in

  • Dependent on AWS uptime
  • Subject to pricing changes
  • API changes require code updates
  • Region availability limitations

4. Latency (Network)

Internet roundtrip: 50-200ms
Processing: 1-3s
Total: 1.5-3.5s

Hybrid Approaches

Architecture: Best of Both Worlds

graph TD
    A{Data Classification} --> B[Public/General]
    A --> C[Confidential/PII]
    
    B --> D[Cloud Bedrock]
    C --> E[Local Ollama]
    
    F{Query Complexity} --> G[Simple]
    F --> H[Complex]
    
    G --> E
    H --> D
    
    style D fill:#fff3cd
    style E fill:#d1ecf1

Strategy:

  • Simple queries + sensitive data → Local
  • Complex analysis + public data → Cloud
  • Most queries → Start local, escalate to cloud if needed

Tiered Model System

# Conceptual: Tiered routing
def route_query(query, data):
    if is_sensitive(data):
        return ollama_local(query, data)
    elif is_complex(query):
        return bedrock_claude(query, data)
    else:
        return ollama_local(query, data)

Decision Framework

graph TD
    START{Start Here} --> Q1{Data privacy critical?}
    
    Q1 -->|Yes| LOCAL[Local Only]
    Q1 -->|No| Q2{High volume?}
    
    Q2 -->|Yes - >10K/day| Q3{Complex queries?}
    Q2 -->|No - \<10K/day| ACCOUNT{Budget?}
    
    Q3 -->|Yes| HYBRID[Hybrid Approach]
    Q3 -->|No| LOCAL
    
    CLOUD{Budget flexible?}
    CLOUD -->|Yes| FULLCLOUD[Cloud Only]
    CLOUD -->|No| HYBRID
    
    Q2 --> Q3
    
    style LOCAL fill:#d1ecf1
    style FULLCLOUD fill:#fff3cd
    style HYBRID fill:#d4edda

Decision Criteria Table

FactorChoose LocalChoose CloudConsider Hybrid
Data SensitivityHIPAA, PCI-DSSPublic dataMixed
Volume>50K/day<10K/day10K-50K/day
BudgetCapex availableOpex preferredFlexible
Accuracy NeedsAcceptable qualityHighest qualityVariable
Team SizeDevOps teamSmall teamMedium team
Latency<200ms<3s OKVariable

Real-World Examples

Example 1: Healthcare Startup

Scenario: Medical diagnosis assistant

Choice: Local (Ollama)

Reasoning:

  • HIPAA requires data not leave infrastructure
  • Medium query volume (~5K/day)
  • Acceptable latency (<1s)
  • CapEx budget available

Example 2: E-Commerce Site

Scenario: Product recommendation RAG

Choice: Cloud (Bedrock)

Reasoning:

  • Public product data
  • Variable traffic (Black Friday spikes)
  • Need best accuracy for conversions
  • Small engineering team
  • OpEx budget preferred

Example 3: Enterprise Knowledge Base

Scenario: Internal documentation search

Choice: Hybrid

Reasoning:

  • Mix of public and confidential docs
  • High volume (~50K/day)
  • Complex queries need best models
  • Large engineering team
  • Both CapEx and OpEx budgets

Migration Path

graph LR
    A[Start: Proof of Concept] --> B[Local Ollama]
    B --> C{Validate Use Case}
    C -->|Success| D{Scale Needed?}
    C -->|Failure| E[End]
    
    D -->|Yes| F[Add Cloud for Complex]
    D -->|No| G[Stay Local]
    
    F --> H[Hybrid Production]
    G --> I[Local Production]

Recommendation: Start local, add cloud as needed.

Key Takeaways

  1. Local (Ollama): Privacy, cost-effective at scale, more complexity
  2. Cloud (Bedrock): Best quality, managed, pay-per-use
  3. Hybrid: Flexibility, optimize cost and quality
  4. Decision factors: Privacy, volume, budget, team size

In the next lesson, we'll explore the detailed trade-offs in cost, latency, privacy, and performance.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn