
Prompt Routing: The Traffic Controller
Master the architecture of 'Dynamic Routing'. Learn how to build a router that sends simple queries to cheap models and complex queries to experts.
Prompt Routing: The Traffic Controller
How do you know when a user's question is "Simple" vs. "Complex"? If the user says "Hello," you don't need GPT-4o. If the user says "Explain the security vulnerabilities in this C++ kernel module," you definitely do.
Prompt Routing is the automated logic that inspects a query and dispatches it to the most efficient model for that specific job.
In this lesson, we master Dynamic Routing Architectures. We’ll explore Semantic Routers, Keyword Routers, and the "Cascading Model" pattern.
1. The Semantic Router (The Intelligence Filter)
A semantic router uses a tiny embedding model (Module 8) to compare the user's query against "Known Intent Patterns."
- Pattern A: "Greeting/Small Talk" -> Route to Small Local Model.
- Pattern B: "Research Request" -> Route to Search Agent.
- Pattern C: "Code Review" -> Route to GPT-4o.
graph TD
U[User Query] --> R{Semantic Router}
R -->|Chat| L1[Llama 3 8B]
R -->|Logic| L2[GPT-4o mini]
R -->|Coding| L3[Claude 3.5 Sonnet]
style R fill:#f96,stroke-width:4px
2. The Keyword "Fast-Path"
Sometimes, you don't even need embeddings.
If a user's query contains http:// or www., they want a search. If it contains def or class , they are coding.
Regex Routing is the cheapest possible form of orchestration. It uses Zero Tokens and Zero Milliseconds to make a decision.
3. Implementation: The Cascading Model Pattern
A "Cascade" is a series of models where each one "Vetoes" or "Elevates" to the next.
Python Code: The Router Wrapper
def cascade_completion(query):
# 1. Tier 1: Try the 'Cheapest'
# Use a tiny prompt to check if LLM can handle it
res = call_mini_model("Can you answer this simple fact? Query: " + query)
if "YES" in res:
# 2. Complete with the cheap model
return call_mini_model(query)
# 3. Tier 2: Escalate to the Expert
return call_expert_model(query)
Token Efficiency: You spend 20 tokens to see if you can save 2,000 tokens. This is a 100:1 ROI gamble that pays off in 90% of user sessions.
4. Libraries for Routing: Semantic Router
There are specialized libraries like semantic-router that manage this process for you using local vector search.
from semantic_router import Route
from semantic_router.layer import RouteLayer
# Define our routes
chitchat = Route(name="chitchat", samples=["hi", "how are you", "hello"])
coding = Route(name="coding", samples=["write a function", "fix this bug"])
layer = RouteLayer(routes=[chitchat, coding])
# ROUTE THE QUERY
route = layer("hello there!")
print(route.name) # "chitchat" -> Now call the tiny model!
5. Token Savings: Yearly Projection
In a production app with 10k users:
- No Routing: 100M tokens on GPT-4o = $30,000.
- With Routing: 80M tokens on Flash, 20M on GPT-4o = $6,100.
- Savings: 80%.
6. Summary and Key Takeaways
- Classify first, Reason second: Spend tokens to decide which model to use.
- Regex is Free: Use traditional code to find obvious "Fast-Paths."
- Semantic Layers: Use tiny embeddings to map ambiguous queries to specific tiers.
- Cascading Logic: Start cheap and escalate only when necessary.
In the next lesson, Evaluating Model ROI for Specific Tasks, we look at چگونه to measure the success of these routes.
Exercise: The Router Design
- Create a list of 10 random user queries.
- Manually route them to "Cheap" or "Expert."
- Check your reasoning:
- Did you route "What is 2+2" to the Expert? (If so, you failed).
- Did you route "Write a novel about space" to the Expert? (If so, you succeeded).
- Build a simple Python function that uses
if "code" in query.lower()to route to a different model.