
Cost and Latency Considerations
Optimize the ROI of your Claude-based RAG system by balancing model choice, token count, and performance.
Cost and Latency Considerations
Building a RAG proof-of-concept is easy; building a cost-effective, high-speed production system is hard. When using Claude (via Anthropic API or AWS Bedrock), you must manage two key variables: Dollars and Milliseconds.
Model Selection Trade-offs
| Model | Cost (Input/Output) | Latency | Use Case |
|---|---|---|---|
| Claude 3 Haiku | Lowest | Instant | Simple Q&A, Summarization |
| Claude 3.5 Sonnet | Moderate | Fast | Production RAG (Most Balanced) |
| Claude 3 Opus | High | Slow | Deep Reasoning, Complex Legal |
Latency Drivers in RAG
- Preprocessing: OCR and document parsing add 1-5 seconds.
- Retrieval: Chroma search usually takes 10-100ms.
- Re-Ranking: Can add 500ms - 2s depending on document count.
- Generation: The "Time to First Token" (TTFT) and throughput (tokens per second).
Cost Optimization Techniques
- Prompt Caching: Save 90% on input tokens for repeat contexts.
- Shortening History: Truncate chat history to only the last 3-5 exchanges.
- Smart Chunking: Retrieve only the Top 3 highly relevant chunks instead of 10 mediocre ones.
Pricing Models (AWS Bedrock)
- On-Demand: Pay for what you use. Best for variable traffic.
- Provisioned Throughput: Rent "dedicated" GPU capacity. Essential for consistent, high-volume production.
- Batch API: (If available) Submit 10,000 queries at a discounted rate for non-urgent tasks.
Real-World Math
If each RAG query uses 5,000 tokens of context:
- 1,000 queries = 5 Million tokens.
- At $3 per Million tokens (Sonnet), that's $15.
- If your system handles 100,000 queries a day, you're spending $1,500/day. Optimization is not optional.
Exercises
- Calculate the monthly cost of a RAG app for 500 employees, assuming each employee asks 5 questions per day.
- If you switch from Sonnet to Haiku, how much do you save?
- Use a tool like
langsmithto trace a single query. Where is the most time being spent?