
Local vs Hosted Models (Ollama vs Bedrock)
Compare local and cloud model deployments for multimodal RAG systems and learn when to use each approach.
Local vs Hosted Models (Ollama vs Bedrock)
Choosing between local and cloud-hosted models is one of the most important architectural decisions for RAG systems.
Deployment Models Overview
graph TD
A[Model Deployment] --> B[Local/On-Premise]
A --> C[Cloud-Hosted]
B --> D[Ollama]
B --> E[vLLM]
B --> F[SelfHosted LLaVA]
C --> G[AWS Bedrock]
C --> H[OpenAI API]
C --> I[Anthropic API]
style B fill:#d1ecf1
style C fill:#fff3cd
Local Models with Ollama
What is Ollama?
Ollama is a framework for running LLMs locally with minimal setup.
# Install Ollama
curl https://ollama.ai/install.sh | sh
# Run a multimodal model
ollama run llava
# Or a text model
ollama run llama2
Supported Multimodal Models
# Available through Ollama
models = {
"llava": "7B multimodal (text + vision)",
"llava:13b": "Larger, more capable",
"bakllava": "Optimized LLaVA variant",
"moondream": "Tiny vision model (1.8B)"
}
Local RAG Architecture
graph TD
A[Data Sources] --> B[Local Ingestion]
B --> C[Local Preprocessing]
C --> D[Ollama Embeddings]
D --> E[Chroma DB]
F[User Query] --> E
E --> G[Retrieved Context]
G --> H[Ollama LLaVA]
H --> I[Response]
J[All On-Premise] -.-> B & C & D & E & H
style J fill:#d4edda
Advantages of Local Models
1. Complete Data Privacy
Your Data → Your Hardware → Your Network
NO external API calls
NO data leaves your infrastructure
Critical for:
- Healthcare (HIPAA)
- Finance (SOX, PCI-DSS)
- Legal (attorney-client privilege)
- Government (classified data)
- R&D (trade secrets)
2. No API Costs
# Local cost calculation
Hardware: $5,000-20,000 (one-time)
Electricity: ~$50-200/month
vs
Cloud API: $0.10-0.50 per query × 100K queries/month = $10,000-50,000/month
Break-even: Usually under 6 months for high-volume systems
3. No Rate Limits
Bedrock: 10-50 req/s throttling
Ollama: Limited only by your hardware
4. No Vendor Lock-in
- Switch models anytime
- No dependency on external services
- Control over model versions
5. Low Latency (Same Network)
Cloud API: 500ms-3s (network + processing)
Local: 100-500ms (processing only)
Disadvantages of Local Models
1. Lower Accuracy
graph LR
A[Quality Spectrum] --> B[LLaVA 7B]
A --> C[LLaVA 13B]
A --> D[GPT-4V]
A --> E[Claude 3.5]
B --> F[Good]
E --> G[Excellent]
style B fill:#fff3cd
style E fill:#d4edda
Open-source models lag 6-12 months behind frontier models.
2. Infrastructure Burden
You must manage:
- GPU servers (often multiple)
- Model storage (~10-50GB per model)
- VRAM requirements (16-80GB)
- Monitoring and updates
- High availability
- Scaling
3. Upfront Capital Expense
Minimum viable setup:
- GPU server: $5,000-10,000
- Storage: $1,000-3,000
- Networking: $500-2,000
Total: $6,500-15,000 before first query
4. Operational Complexity
# Cloud (simple)
response = bedrock.invoke_model(...)
# vs
# Local (complex)
# - Ensure GPU drivers updated
# - Monitor VRAM usage
# - Handle model loading
# - Implement request queuing
# - Set up load balancing
# - Configure auto-scaling
Cloud Models with AWS Bedrock
What is AWS Bedrock?
Bedrock is AWS's managed service for foundation models.
import boto3
bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')
response = bedrock.invoke_model(
modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
body={...}
)
Available Models
- Anthropic: Claude 3.5 Sonnet, Opus, Haiku
- Stability AI: Image generation
- Cohere: Embeddings and reranking
- Amazon Titan: Text and embeddings
Cloud RAG Architecture
graph LR
A[S3 Documents] --> B[Bedrock KB]
B --> C[Automatic Indexing]
C --> D[Vector Store]
E[User Query] --> F[Bedrock Agent]
F --> D
D --> G[Claude 3.5]
G --> H[Response]
style B fill:#ff9800
style G fill:#ff9800
Advantages of Cloud Models
1. State-of-the-Art Quality
Claude 3.5 Sonnet accuracy >> LLaVA
Continuous improvements without effort
2. Zero Infrastructure Management
No servers
No GPUs
No maintenance
No scaling worries
3. Pay-Per-Use
# Only pay for what you use
Low volume: $10-100/month
High volume: $1,000-10,000/month
No upfront investment
4. Instant Scalability
1 query/second → 1,000 queries/second
Automatic, no configuration
5. Built-in Integrations
graph TD
A[Bedrock] --> B[Knowledge Bases]
A --> C[Agents]
A --> D[Guardrails]
A --> E[Model Evaluation]
F[AWS Ecosystem] --> G[S3]
F --> H[Lambda]
F --> I[IAM]
F --> J[CloudWatch]
Disadvantages of Cloud Models
1. Data Privacy Concerns
Your data → AWS network → Model APIs
Not suitable for:
- Highly regulated data
- Trade secrets
- Customer PII (without encryption)
2. Ongoing Costs
High-volume systems:
100K queries/day × $0.20/query = $20K/day = $600K/month
3. Vendor Lock-in
- Dependent on AWS uptime
- Subject to pricing changes
- API changes require code updates
- Region availability limitations
4. Latency (Network)
Internet roundtrip: 50-200ms
Processing: 1-3s
Total: 1.5-3.5s
Hybrid Approaches
Architecture: Best of Both Worlds
graph TD
A{Data Classification} --> B[Public/General]
A --> C[Confidential/PII]
B --> D[Cloud Bedrock]
C --> E[Local Ollama]
F{Query Complexity} --> G[Simple]
F --> H[Complex]
G --> E
H --> D
style D fill:#fff3cd
style E fill:#d1ecf1
Strategy:
- Simple queries + sensitive data → Local
- Complex analysis + public data → Cloud
- Most queries → Start local, escalate to cloud if needed
Tiered Model System
# Conceptual: Tiered routing
def route_query(query, data):
if is_sensitive(data):
return ollama_local(query, data)
elif is_complex(query):
return bedrock_claude(query, data)
else:
return ollama_local(query, data)
Decision Framework
graph TD
START{Start Here} --> Q1{Data privacy critical?}
Q1 -->|Yes| LOCAL[Local Only]
Q1 -->|No| Q2{High volume?}
Q2 -->|Yes - >10K/day| Q3{Complex queries?}
Q2 -->|No - \<10K/day| ACCOUNT{Budget?}
Q3 -->|Yes| HYBRID[Hybrid Approach]
Q3 -->|No| LOCAL
CLOUD{Budget flexible?}
CLOUD -->|Yes| FULLCLOUD[Cloud Only]
CLOUD -->|No| HYBRID
Q2 --> Q3
style LOCAL fill:#d1ecf1
style FULLCLOUD fill:#fff3cd
style HYBRID fill:#d4edda
Decision Criteria Table
| Factor | Choose Local | Choose Cloud | Consider Hybrid |
|---|---|---|---|
| Data Sensitivity | HIPAA, PCI-DSS | Public data | Mixed |
| Volume | >50K/day | <10K/day | 10K-50K/day |
| Budget | Capex available | Opex preferred | Flexible |
| Accuracy Needs | Acceptable quality | Highest quality | Variable |
| Team Size | DevOps team | Small team | Medium team |
| Latency | <200ms | <3s OK | Variable |
Real-World Examples
Example 1: Healthcare Startup
Scenario: Medical diagnosis assistant
Choice: Local (Ollama)
Reasoning:
- HIPAA requires data not leave infrastructure
- Medium query volume (~5K/day)
- Acceptable latency (<1s)
- CapEx budget available
Example 2: E-Commerce Site
Scenario: Product recommendation RAG
Choice: Cloud (Bedrock)
Reasoning:
- Public product data
- Variable traffic (Black Friday spikes)
- Need best accuracy for conversions
- Small engineering team
- OpEx budget preferred
Example 3: Enterprise Knowledge Base
Scenario: Internal documentation search
Choice: Hybrid
Reasoning:
- Mix of public and confidential docs
- High volume (~50K/day)
- Complex queries need best models
- Large engineering team
- Both CapEx and OpEx budgets
Migration Path
graph LR
A[Start: Proof of Concept] --> B[Local Ollama]
B --> C{Validate Use Case}
C -->|Success| D{Scale Needed?}
C -->|Failure| E[End]
D -->|Yes| F[Add Cloud for Complex]
D -->|No| G[Stay Local]
F --> H[Hybrid Production]
G --> I[Local Production]
Recommendation: Start local, add cloud as needed.
Key Takeaways
- Local (Ollama): Privacy, cost-effective at scale, more complexity
- Cloud (Bedrock): Best quality, managed, pay-per-use
- Hybrid: Flexibility, optimize cost and quality
- Decision factors: Privacy, volume, budget, team size
In the next lesson, we'll explore the detailed trade-offs in cost, latency, privacy, and performance.