Local vs Hosted Models (Ollama vs Bedrock)

Choosing between local and cloud-hosted models is one of the most important architectural decisions for RAG systems.

Deployment Models Overview

graph TD
    A[Model Deployment] --> B[Local/On-Premise]
    A --> C[Cloud-Hosted]
    
    B --> D[Ollama]
    B --> E[vLLM]
    B --> F[SelfHosted LLaVA]
    
    C --> G[AWS Bedrock]
    C --> H[OpenAI API]
    C --> I[Anthropic API]
    
    style B fill:#d1ecf1
    style C fill:#fff3cd

Local Models with Ollama

What is Ollama?

Ollama is a framework for running LLMs locally with minimal setup.

# Install Ollama
curl https://ollama.ai/install.sh | sh

# Run a multimodal model
ollama run llava

# Or a text model
ollama run llama2

Supported Multimodal Models

# Available through Ollama
models = {
    "llava": "7B multimodal (text + vision)",
    "llava:13b": "Larger, more capable",
    "bakllava": "Optimized LLaVA variant",
    "moondream": "Tiny vision model (1.8B)"
}

Local RAG Architecture

graph TD
    A[Data Sources] --> B[Local Ingestion]
    B --> C[Local Preprocessing]
    C --> D[Ollama Embeddings]
    D --> E[Chroma DB]
    
    F[User Query] --> E
    E --> G[Retrieved Context]
    G --> H[Ollama LLaVA]
    H --> I[Response]
    
    J[All On-Premise] -.-> B & C & D & E & H
    
    style J fill:#d4edda

Advantages of Local Models

1. Complete Data Privacy

Your Data → Your Hardware → Your Network
NO external API calls
NO data leaves your infrastructure

Critical for:

Healthcare (HIPAA)
Finance (SOX, PCI-DSS)
Legal (attorney-client privilege)
Government (classified data)
R&D (trade secrets)

2. No API Costs

# Local cost calculation
Hardware: $5,000-20,000 (one-time)
Electricity: ~$50-200/month

vs

Cloud API: $0.10-0.50 per query × 100K queries/month = $10,000-50,000/month

Break-even: Usually under 6 months for high-volume systems

3. No Rate Limits

Bedrock: 10-50 req/s throttling
Ollama: Limited only by your hardware

4. No Vendor Lock-in

Switch models anytime
No dependency on external services
Control over model versions

5. Low Latency (Same Network)

Cloud API: 500ms-3s (network + processing)
Local: 100-500ms (processing only)

Disadvantages of Local Models

1. Lower Accuracy

graph LR
    A[Quality Spectrum] --> B[LLaVA 7B]
    A --> C[LLaVA 13B]
    A --> D[GPT-4V]
    A --> E[Claude 3.5]
    
    B --> F[Good]
    E --> G[Excellent]
    
    style B fill:#fff3cd
    style E fill:#d4edda

Open-source models lag 6-12 months behind frontier models.

2. Infrastructure Burden

You must manage:

GPU servers (often multiple)
Model storage (~10-50GB per model)
VRAM requirements (16-80GB)
Monitoring and updates
High availability
Scaling

3. Upfront Capital Expense

Minimum viable setup:
- GPU server: $5,000-10,000
- Storage: $1,000-3,000
- Networking: $500-2,000

Total: $6,500-15,000 before first query

4. Operational Complexity

# Cloud (simple)
response = bedrock.invoke_model(...)

# vs

# Local (complex)
# - Ensure GPU drivers updated
# - Monitor VRAM usage
# - Handle model loading
# - Implement request queuing
# - Set up load balancing
# - Configure auto-scaling

Cloud Models with AWS Bedrock

What is AWS Bedrock?

Bedrock is AWS's managed service for foundation models.

import boto3

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

response = bedrock.invoke_model(
    modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
    body={...}
)

Available Models

Anthropic: Claude 3.5 Sonnet, Opus, Haiku
Stability AI: Image generation
Cohere: Embeddings and reranking
Amazon Titan: Text and embeddings

Cloud RAG Architecture

graph LR
    A[S3 Documents] --> B[Bedrock KB]
    B --> C[Automatic Indexing]
    C --> D[Vector Store]
    
    E[User Query] --> F[Bedrock Agent]
    F --> D
    D --> G[Claude 3.5]
    G --> H[Response]
    
    style B fill:#ff9800
    style G fill:#ff9800

Advantages of Cloud Models

1. State-of-the-Art Quality

Claude 3.5 Sonnet accuracy >> LLaVA
Continuous improvements without effort

2. Zero Infrastructure Management

No servers
No GPUs
No maintenance
No scaling worries

3. Pay-Per-Use

# Only pay for what you use
Low volume: $10-100/month
High volume: $1,000-10,000/month

No upfront investment

4. Instant Scalability

1 query/second → 1,000 queries/second
Automatic, no configuration

5. Built-in Integrations

graph TD
    A[Bedrock] --> B[Knowledge Bases]
    A --> C[Agents]
    A --> D[Guardrails]
    A --> E[Model Evaluation]
    
    F[AWS Ecosystem] --> G[S3]
    F --> H[Lambda]
    F --> I[IAM]
    F --> J[CloudWatch]

Disadvantages of Cloud Models

1. Data Privacy Concerns

Your data → AWS network → Model APIs

Not suitable for:
- Highly regulated data
- Trade secrets
- Customer PII (without encryption)

2. Ongoing Costs

High-volume systems:
100K queries/day × $0.20/query = $20K/day = $600K/month

3. Vendor Lock-in

Dependent on AWS uptime
Subject to pricing changes
API changes require code updates
Region availability limitations

4. Latency (Network)

Internet roundtrip: 50-200ms
Processing: 1-3s
Total: 1.5-3.5s

Hybrid Approaches

Architecture: Best of Both Worlds

graph TD
    A{Data Classification} --> B[Public/General]
    A --> C[Confidential/PII]
    
    B --> D[Cloud Bedrock]
    C --> E[Local Ollama]
    
    F{Query Complexity} --> G[Simple]
    F --> H[Complex]
    
    G --> E
    H --> D
    
    style D fill:#fff3cd
    style E fill:#d1ecf1

Strategy:

Simple queries + sensitive data → Local
Complex analysis + public data → Cloud
Most queries → Start local, escalate to cloud if needed

Tiered Model System

# Conceptual: Tiered routing
def route_query(query, data):
    if is_sensitive(data):
        return ollama_local(query, data)
    elif is_complex(query):
        return bedrock_claude(query, data)
    else:
        return ollama_local(query, data)

Decision Framework

graph TD
    START{Start Here} --> Q1{Data privacy critical?}
    
    Q1 -->|Yes| LOCAL[Local Only]
    Q1 -->|No| Q2{High volume?}
    
    Q2 -->|Yes - >10K/day| Q3{Complex queries?}
    Q2 -->|No - \&lt;10K/day| ACCOUNT{Budget?}
    
    Q3 -->|Yes| HYBRID[Hybrid Approach]
    Q3 -->|No| LOCAL
    
    CLOUD{Budget flexible?}
    CLOUD -->|Yes| FULLCLOUD[Cloud Only]
    CLOUD -->|No| HYBRID
    
    Q2 --> Q3
    
    style LOCAL fill:#d1ecf1
    style FULLCLOUD fill:#fff3cd
    style HYBRID fill:#d4edda

Decision Criteria Table

Factor	Choose Local	Choose Cloud	Consider Hybrid
Data Sensitivity	HIPAA, PCI-DSS	Public data	Mixed
Volume	>50K/day	<10K/day	10K-50K/day
Budget	Capex available	Opex preferred	Flexible
Accuracy Needs	Acceptable quality	Highest quality	Variable
Team Size	DevOps team	Small team	Medium team
Latency	<200ms	<3s OK	Variable

Real-World Examples

Example 1: Healthcare Startup

Scenario: Medical diagnosis assistant

Choice: Local (Ollama)

Reasoning:

HIPAA requires data not leave infrastructure
Medium query volume (~5K/day)
Acceptable latency (<1s)
CapEx budget available

Example 2: E-Commerce Site

Scenario: Product recommendation RAG

Choice: Cloud (Bedrock)

Reasoning:

Public product data
Variable traffic (Black Friday spikes)
Need best accuracy for conversions
Small engineering team
OpEx budget preferred

Example 3: Enterprise Knowledge Base

Scenario: Internal documentation search

Choice: Hybrid

Reasoning:

Mix of public and confidential docs
High volume (~50K/day)
Complex queries need best models
Large engineering team
Both CapEx and OpEx budgets

Migration Path

graph LR
    A[Start: Proof of Concept] --> B[Local Ollama]
    B --> C{Validate Use Case}
    C -->|Success| D{Scale Needed?}
    C -->|Failure| E[End]
    
    D -->|Yes| F[Add Cloud for Complex]
    D -->|No| G[Stay Local]
    
    F --> H[Hybrid Production]
    G --> I[Local Production]

Recommendation: Start local, add cloud as needed.

Key Takeaways

Local (Ollama): Privacy, cost-effective at scale, more complexity
Cloud (Bedrock): Best quality, managed, pay-per-use
Hybrid: Flexibility, optimize cost and quality
Decision factors: Privacy, volume, budget, team size

In the next lesson, we'll explore the detailed trade-offs in cost, latency, privacy, and performance.

Local vs Hosted Models (Ollama vs Bedrock)

Deployment Models Overview

Local Models with Ollama

What is Ollama?

Supported Multimodal Models

Local RAG Architecture

Advantages of Local Models

1. Complete Data Privacy

2. No API Costs

3. No Rate Limits

4. No Vendor Lock-in

5. Low Latency (Same Network)

Disadvantages of Local Models

1. Lower Accuracy

2. Infrastructure Burden

3. Upfront Capital Expense

4. Operational Complexity

Cloud Models with AWS Bedrock

What is AWS Bedrock?

Available Models

Cloud RAG Architecture

Advantages of Cloud Models

1. State-of-the-Art Quality

2. Zero Infrastructure Management

3. Pay-Per-Use

4. Instant Scalability

5. Built-in Integrations

Disadvantages of Cloud Models

1. Data Privacy Concerns

2. Ongoing Costs

3. Vendor Lock-in

4. Latency (Network)

Hybrid Approaches

Architecture: Best of Both Worlds

Tiered Model System

Decision Framework

Decision Criteria Table

Real-World Examples

Example 1: Healthcare Startup

Example 2: E-Commerce Site

Example 3: Enterprise Knowledge Base

Migration Path

Key Takeaways

Subscribe to our newsletter