Most AI agents use a single model for everything. The planning step, the tool calls, the summarization, the error recovery. This works for demos. In production, it's wasteful.
A planning step that requires deep reasoning doesn't need the same model as a JSON extraction step. A code generation task has different requirements than a classification task. Using Claude Opus 4.6 ($25/1M output tokens) to format a date string is like hiring a senior architect to paint a wall.
Here's how to build agents that route each step to the optimal model.
If you are working on the API layer rather than the agent layer, read Agent-First API Design and Why Teams Switch from Direct Model APIs to a Unified AI API alongside this page. Multi-model agents work best when the underlying API surface is stable enough to swap models without rewriting orchestration code.
The Multi-Model Agent Architecture
User Request
โ
โผ
โโโโโโโโโโโโโโโ
โ Router โ โ Classifies task complexity
โ (fast model)โ
โโโโโโโโฌโโโโโโโ
โ
โโโโโดโโโโ
โผ โผ
โโโโโโโโ โโโโโโโโ
โSimpleโ โComplexโ
โModel โ โModel โ
โโโโฌโโโโ โโโโฌโโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโโ
โ Aggregator โ โ Combines results
โ (fast model)โ
โโโโโโโโโโโโโโโ
Three components:
- A router that classifies incoming tasks by complexity
- A pool of models matched to different task types
- An aggregator that combines results when needed
In practice, production agents usually need two more pieces:
- A fallback policy when the preferred model fails or slows down
- A telemetry layer that records model choice, latency, and cost per step
Without those two, a multi-model agent quickly turns into a black box with unpredictable behavior.
Implementation with OpenAI SDK
Using a single API key through an aggregator, you can access all models without managing multiple SDKs:
from openai import OpenAI
client = OpenAI(
api_key="sk-lemon-xxx",
base_url="https://api.lemondata.cc/v1"
)
# Model pool with cost/capability tiers
MODELS = {
"router": "gpt-4.1-mini", # $0.40/1M in - fast classification
"simple": "gpt-4.1-mini", # $0.40/1M in - extraction, formatting
"reasoning": "claude-sonnet-4-6", # $3.00/1M in - planning, analysis
"complex": "gpt-4.1", # $2.00/1M in - code gen, multi-step
"budget": "deepseek-chat", # $0.28/1M in - bulk processing
}
def route_task(task: str) -> str:
"""Use a cheap model to classify task complexity."""
response = client.chat.completions.create(
model=MODELS["router"],
messages=[
{"role": "system", "content": """Classify this task into one category:
- simple: data extraction, formatting, translation
- reasoning: analysis, planning, comparison
- complex: code generation, multi-step problem solving
- budget: bulk processing, non-critical tasks
Reply with just the category name."""},
{"role": "user", "content": task}
],
max_tokens=10
)
category = response.choices[0].message.content.strip().lower()
return MODELS.get(category, MODELS["simple"])
def execute_task(task: str, context: str = "") -> str:
"""Route task to appropriate model and execute."""
model = route_task(task)
messages = []
if context:
messages.append({"role": "system", "content": context})
messages.append({"role": "user", "content": task})
response = client.chat.completions.create(
model=model,
messages=messages
)
return response.choices[0].message.content
Real-World Agent: Code Review Pipeline
Here's a practical multi-model agent that reviews pull requests:
def review_pr(diff: str) -> dict:
"""Multi-model PR review pipeline."""
# Step 1: Classify changes (cheap model)
classification = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[{
"role": "user",
"content": f"Classify these code changes: {diff[:2000]}\n"
"Categories: bugfix, feature, refactor, docs, test"
}],
max_tokens=20
).choices[0].message.content
# Step 2: Security scan (reasoning model)
security = client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[{
"role": "system",
"content": "You are a security reviewer. Check for: "
"SQL injection, XSS, auth bypass, secrets in code, "
"unsafe deserialization. Be specific about line numbers."
}, {
"role": "user",
"content": f"Review this diff for security issues:\n{diff}"
}]
).choices[0].message.content
# Step 3: Code quality (general model)
quality = client.chat.completions.create(
model="gpt-4.1",
messages=[{
"role": "user",
"content": f"Review code quality: naming, structure, "
f"error handling, test coverage.\n{diff}"
}]
).choices[0].message.content
# Step 4: Summary (cheap model)
summary = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[{
"role": "user",
"content": f"Summarize this PR review in 3 bullet points:\n"
f"Type: {classification}\n"
f"Security: {security[:500]}\n"
f"Quality: {quality[:500]}"
}]
).choices[0].message.content
return {
"classification": classification,
"security": security,
"quality": quality,
"summary": summary
}
Cost breakdown for a typical PR review (2K token diff):
| Step | Model | Input Tokens | Cost |
|---|---|---|---|
| Classify | GPT-4.1-mini | ~2,100 | $0.0008 |
| Security | Claude Sonnet 4.6 | ~2,500 | $0.0075 |
| Quality | GPT-4.1 | ~2,500 | $0.0050 |
| Summary | GPT-4.1-mini | ~1,200 | $0.0005 |
| Total | ~$0.014 |
Using Claude Sonnet 4.6 for all four steps would cost ~$0.028. The multi-model approach cuts costs by 50% while using the strongest model where it matters most (security review).
Routing by Capability, Not Just by Price
Many teams start multi-model routing with a simple rule: expensive tasks go to expensive models, cheap tasks go to cheap models.
That is a good first pass, but it is not enough.
A stronger routing policy looks at four dimensions:
- reasoning depth
- context length
- tool-use reliability
- latency sensitivity
That leads to better rules:
- planning and decomposition go to a reasoning-heavy model
- extraction and formatting go to a cheap, fast model
- code review goes to the model with the best bug-finding behavior
- repo-wide analysis goes to the model with the largest context window
This is the same reason the coding model comparison and the pricing comparison should inform your router rather than sit in a separate research folder.
LangChain Integration
from langchain_openai import ChatOpenAI
# Create model instances with different configs
fast = ChatOpenAI(
model="gpt-4.1-mini",
api_key="sk-lemon-xxx",
base_url="https://api.lemondata.cc/v1"
)
reasoning = ChatOpenAI(
model="claude-sonnet-4-6",
api_key="sk-lemon-xxx",
base_url="https://api.lemondata.cc/v1"
)
# Use in LangChain chains
from langchain_core.prompts import ChatPromptTemplate
classify_chain = ChatPromptTemplate.from_template(
"Classify: {input}"
) | fast
analyze_chain = ChatPromptTemplate.from_template(
"Analyze in depth: {input}"
) | reasoning
When to Use Multi-Model Agents
Multi-model routing adds complexity. It's worth it when:
- Your agent handles diverse task types (not just chat)
- Monthly API costs exceed $100 (savings become meaningful)
- You need specific model strengths (Claude for code, Gemini for long context, GPT for speed)
- Latency matters for some steps but not others
For simple chatbots or single-purpose agents, a single model is fine. The overhead of routing isn't justified when every request needs the same capability.
The tipping point is usually one of these:
- you are paying for high-end reasoning on low-value tasks
- one provider's outages are now a real business risk
- context needs vary wildly across the workflow
- you need cheaper review / extraction / summarization stages around one expensive core stage
If none of those are true, a single model is still the right answer.
Common Failure Modes
Multi-model systems fail in predictable ways:
1. The router is too clever
If the router prompt becomes a giant taxonomy exercise, you spend too much on deciding what to do. Keep the router cheap and coarse.
2. Output contracts drift
One model returns clean JSON, another returns prose with a JSON block, and your downstream parser breaks. Use explicit schemas and validation at every handoff.
3. Fallback changes quality silently
Routing to a cheaper model during provider pressure can make the agent look flaky if the user sees a totally different quality profile. That is why rate limiting strategy belongs inside the design, not as an afterthought.
4. Cost reporting is missing
If you do not record per-step model choice, cost, and latency, you cannot tell whether the multi-model design is actually saving money.
A Minimal Evaluation Loop
You do not need a giant eval platform to operate a multi-model agent responsibly.
Start with one sheet or one database table per run:
- user task category
- router decision
- final model used per step
- latency per step
- total token cost
- whether fallback was triggered
- whether the user accepted the answer
That gives you enough signal to answer the questions that matter:
- Is the router choosing the right expensive model often enough?
- Which step is consuming most of the budget?
- Are fallbacks rescuing runs or just hiding instability?
- Is the cheap path good enough for repetitive tasks?
This is also why a unified gateway helps. When model usage is spread across many providers, it is harder to assemble one comparable run ledger. When everything comes through one API layer, the instrumentation burden drops.
Keep the Architecture Boring
The best multi-model agents do not feel exotic. They feel operationally boring.
That means:
- one stable request shape into your orchestration layer
- one place to define routing rules
- one place to inspect cost and latency
- one fallback policy per task family
- one source of truth for model allowlists
If your agent graph looks clever but your operators cannot explain why a request went to one model instead of another, the design is not finished.
When Not to Use a Multi-Model Agent
There are also clear cases where the simpler design wins.
Do not add routing just because the model catalog is large.
Stick to one model when:
- the product does one narrow task repeatedly
- quality differences between models are irrelevant to the user
- your traffic is too low for cost optimization to matter
- your ops surface is already under-instrumented
- you do not yet have evals strong enough to tell whether routing helped or hurt
A single well-chosen model with good retries, prompt hygiene, and observability often beats a flashy multi-model graph that nobody trusts.
The right question is not โcan we route?โ It is โdoes routing produce better quality, lower cost, or safer failure behavior for this workflow?โ
If the answer is vague, keep the architecture simple until the workflow itself becomes more diverse.
Key Takeaways
- Use the cheapest model that handles each step well
- Reserve expensive models for tasks that genuinely need them
- Classification/routing steps should always use the cheapest available model
- Measure actual cost per agent run, not just per-token pricing
- An API aggregator with one key simplifies multi-model access significantly
Multi-model agents are not inherently better. They are better when the workflow genuinely contains different kinds of work.
Access every model through one API: LemonData provides 300+ models with a single API key. Build multi-model agents without managing multiple provider accounts or reinventing routing for every provider pair.
