Cost and Latency Considerations

Building a RAG proof-of-concept is easy; building a cost-effective, high-speed production system is hard. When using Claude (via Anthropic API or AWS Bedrock), you must manage two key variables: Dollars and Milliseconds.

Model Selection Trade-offs

Model	Cost (Input/Output)	Latency	Use Case
Claude 3 Haiku	Lowest	Instant	Simple Q&A, Summarization
Claude 3.5 Sonnet	Moderate	Fast	Production RAG (Most Balanced)
Claude 3 Opus	High	Slow	Deep Reasoning, Complex Legal

Latency Drivers in RAG

Preprocessing: OCR and document parsing add 1-5 seconds.
Retrieval: Chroma search usually takes 10-100ms.
Re-Ranking: Can add 500ms - 2s depending on document count.
Generation: The "Time to First Token" (TTFT) and throughput (tokens per second).

Cost Optimization Techniques

Prompt Caching: Save 90% on input tokens for repeat contexts.
Shortening History: Truncate chat history to only the last 3-5 exchanges.
Smart Chunking: Retrieve only the Top 3 highly relevant chunks instead of 10 mediocre ones.

Pricing Models (AWS Bedrock)

On-Demand: Pay for what you use. Best for variable traffic.
Provisioned Throughput: Rent "dedicated" GPU capacity. Essential for consistent, high-volume production.
Batch API: (If available) Submit 10,000 queries at a discounted rate for non-urgent tasks.

Real-World Math

If each RAG query uses 5,000 tokens of context:

1,000 queries = 5 Million tokens.
At $3 per Million tokens (Sonnet), that's $15.
If your system handles 100,000 queries a day, you're spending $1,500/day. Optimization is not optional.

Exercises

Calculate the monthly cost of a RAG app for 500 employees, assuming each employee asks 5 questions per day.
If you switch from Sonnet to Haiku, how much do you save?
Use a tool like langsmith to trace a single query. Where is the most time being spent?