Building AI Agents with Multiple Models: A Practical Architecture Guide
Most AI agents use a single model for everything. The planning step, the tool calls, the summarization, the error recovery. This works for demos. In production, it's wasteful.
A planning step that requires deep reasoning doesn't need the same model as a JSON extraction step. A code generation task has different requirements than a classification task. Using Claude Opus 4.6 ($25/1M output tokens) to format a date string is like hiring a senior architect to paint a wall.
Here's how to build agents that route each step to the optimal model.
The Multi-Model Agent Architecture
User Request
โ
โผ
โโโโโโโโโโโโโโโ
โ Router โ โ Classifies task complexity
โ (fast model)โ
โโโโโโโโฌโโโโโโโ
โ
โโโโโดโโโโ
โผ โผ
โโโโโโโโ โโโโโโโโ
โSimpleโ โComplexโ
โModel โ โModel โ
โโโโฌโโโโ โโโโฌโโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโโ
โ Aggregator โ โ Combines results
โ (fast model)โ
โโโโโโโโโโโโโโโ
Three components:
- A router that classifies incoming tasks by complexity
- A pool of models matched to different task types
- An aggregator that combines results when needed
Implementation with OpenAI SDK
Using a single API key through an aggregator, you can access all models without managing multiple SDKs:
from openai import OpenAI
client = OpenAI(
api_key="sk-lemon-xxx",
base_url="https://api.lemondata.cc/v1"
)
# Model pool with cost/capability tiers
MODELS = {
"router": "gpt-4.1-mini", # $0.40/1M in - fast classification
"simple": "gpt-4.1-mini", # $0.40/1M in - extraction, formatting
"reasoning": "claude-sonnet-4-6", # $3.00/1M in - planning, analysis
"complex": "gpt-4.1", # $2.00/1M in - code gen, multi-step
"budget": "deepseek-chat", # $0.28/1M in - bulk processing
}
def route_task(task: str) -> str:
"""Use a cheap model to classify task complexity."""
response = client.chat.completions.create(
model=MODELS["router"],
messages=[
{"role": "system", "content": """Classify this task into one category:
- simple: data extraction, formatting, translation
- reasoning: analysis, planning, comparison
- complex: code generation, multi-step problem solving
- budget: bulk processing, non-critical tasks
Reply with just the category name."""},
{"role": "user", "content": task}
],
max_tokens=10
)
category = response.choices[0].message.content.strip().lower()
return MODELS.get(category, MODELS["simple"])
def execute_task(task: str, context: str = "") -> str:
"""Route task to appropriate model and execute."""
model = route_task(task)
messages = []
if context:
messages.append({"role": "system", "content": context})
messages.append({"role": "user", "content": task})
response = client.chat.completions.create(
model=model,
messages=messages
)
return response.choices[0].message.content
Real-World Agent: Code Review Pipeline
Here's a practical multi-model agent that reviews pull requests:
def review_pr(diff: str) -> dict:
"""Multi-model PR review pipeline."""
# Step 1: Classify changes (cheap model)
classification = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[{
"role": "user",
"content": f"Classify these code changes: {diff[:2000]}\n"
"Categories: bugfix, feature, refactor, docs, test"
}],
max_tokens=20
).choices[0].message.content
# Step 2: Security scan (reasoning model)
security = client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[{
"role": "system",
"content": "You are a security reviewer. Check for: "
"SQL injection, XSS, auth bypass, secrets in code, "
"unsafe deserialization. Be specific about line numbers."
}, {
"role": "user",
"content": f"Review this diff for security issues:\n{diff}"
}]
).choices[0].message.content
# Step 3: Code quality (general model)
quality = client.chat.completions.create(
model="gpt-4.1",
messages=[{
"role": "user",
"content": f"Review code quality: naming, structure, "
f"error handling, test coverage.\n{diff}"
}]
).choices[0].message.content
# Step 4: Summary (cheap model)
summary = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[{
"role": "user",
"content": f"Summarize this PR review in 3 bullet points:\n"
f"Type: {classification}\n"
f"Security: {security[:500]}\n"
f"Quality: {quality[:500]}"
}]
).choices[0].message.content
return {
"classification": classification,
"security": security,
"quality": quality,
"summary": summary
}
Cost breakdown for a typical PR review (2K token diff):
| Step | Model | Input Tokens | Cost |
|---|---|---|---|
| Classify | GPT-4.1-mini | ~2,100 | $0.0008 |
| Security | Claude Sonnet 4.6 | ~2,500 | $0.0075 |
| Quality | GPT-4.1 | ~2,500 | $0.0050 |
| Summary | GPT-4.1-mini | ~1,200 | $0.0005 |
| Total | ~$0.014 |
Using Claude Sonnet 4.6 for all four steps would cost ~$0.028. The multi-model approach cuts costs by 50% while using the strongest model where it matters most (security review).
LangChain Integration
from langchain_openai import ChatOpenAI
# Create model instances with different configs
fast = ChatOpenAI(
model="gpt-4.1-mini",
api_key="sk-lemon-xxx",
base_url="https://api.lemondata.cc/v1"
)
reasoning = ChatOpenAI(
model="claude-sonnet-4-6",
api_key="sk-lemon-xxx",
base_url="https://api.lemondata.cc/v1"
)
# Use in LangChain chains
from langchain_core.prompts import ChatPromptTemplate
classify_chain = ChatPromptTemplate.from_template(
"Classify: {input}"
) | fast
analyze_chain = ChatPromptTemplate.from_template(
"Analyze in depth: {input}"
) | reasoning
When to Use Multi-Model Agents
Multi-model routing adds complexity. It's worth it when:
- Your agent handles diverse task types (not just chat)
- Monthly API costs exceed $100 (savings become meaningful)
- You need specific model strengths (Claude for code, Gemini for long context, GPT for speed)
- Latency matters for some steps but not others
For simple chatbots or single-purpose agents, a single model is fine. The overhead of routing isn't justified when every request needs the same capability.
Key Takeaways
- Use the cheapest model that handles each step well
- Reserve expensive models for tasks that genuinely need them
- Classification/routing steps should always use the cheapest available model
- Measure actual cost per agent run, not just per-token pricing
- An API aggregator with one key simplifies multi-model access significantly
Access every model through one API: lemondata.cc provides 300+ models with a single API key. Build multi-model agents without managing multiple provider accounts.
