Settings

Language

Building AI Agents with Multiple Models: A Practical Architecture Guide

L
LemonData
ยทFebruary 26, 2026ยท3 views
#ai-agents#multi-model#architecture#tutorial#langchain
Building AI Agents with Multiple Models: A Practical Architecture Guide

Building AI Agents with Multiple Models: A Practical Architecture Guide

Most AI agents use a single model for everything. The planning step, the tool calls, the summarization, the error recovery. This works for demos. In production, it's wasteful.

A planning step that requires deep reasoning doesn't need the same model as a JSON extraction step. A code generation task has different requirements than a classification task. Using Claude Opus 4.6 ($25/1M output tokens) to format a date string is like hiring a senior architect to paint a wall.

Here's how to build agents that route each step to the optimal model.

The Multi-Model Agent Architecture

User Request
    โ”‚
    โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Router     โ”‚  โ† Classifies task complexity
โ”‚  (fast model)โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚
   โ”Œโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”
   โ–ผ       โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚Simpleโ”‚ โ”‚Complexโ”‚
โ”‚Model โ”‚ โ”‚Model  โ”‚
โ””โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜
   โ”‚        โ”‚
   โ–ผ        โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Aggregator  โ”‚  โ† Combines results
โ”‚  (fast model)โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Three components:

  1. A router that classifies incoming tasks by complexity
  2. A pool of models matched to different task types
  3. An aggregator that combines results when needed

Implementation with OpenAI SDK

Using a single API key through an aggregator, you can access all models without managing multiple SDKs:

from openai import OpenAI

client = OpenAI(
    api_key="sk-lemon-xxx",
    base_url="https://api.lemondata.cc/v1"
)

# Model pool with cost/capability tiers
MODELS = {
    "router": "gpt-4.1-mini",        # $0.40/1M in - fast classification
    "simple": "gpt-4.1-mini",         # $0.40/1M in - extraction, formatting
    "reasoning": "claude-sonnet-4-6",  # $3.00/1M in - planning, analysis
    "complex": "gpt-4.1",             # $2.00/1M in - code gen, multi-step
    "budget": "deepseek-chat",         # $0.28/1M in - bulk processing
}

def route_task(task: str) -> str:
    """Use a cheap model to classify task complexity."""
    response = client.chat.completions.create(
        model=MODELS["router"],
        messages=[
            {"role": "system", "content": """Classify this task into one category:
- simple: data extraction, formatting, translation
- reasoning: analysis, planning, comparison
- complex: code generation, multi-step problem solving
- budget: bulk processing, non-critical tasks
Reply with just the category name."""},
            {"role": "user", "content": task}
        ],
        max_tokens=10
    )
    category = response.choices[0].message.content.strip().lower()
    return MODELS.get(category, MODELS["simple"])

def execute_task(task: str, context: str = "") -> str:
    """Route task to appropriate model and execute."""
    model = route_task(task)
    messages = []
    if context:
        messages.append({"role": "system", "content": context})
    messages.append({"role": "user", "content": task})

    response = client.chat.completions.create(
        model=model,
        messages=messages
    )
    return response.choices[0].message.content

Real-World Agent: Code Review Pipeline

Here's a practical multi-model agent that reviews pull requests:

def review_pr(diff: str) -> dict:
    """Multi-model PR review pipeline."""

    # Step 1: Classify changes (cheap model)
    classification = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[{
            "role": "user",
            "content": f"Classify these code changes: {diff[:2000]}\n"
                       "Categories: bugfix, feature, refactor, docs, test"
        }],
        max_tokens=20
    ).choices[0].message.content

    # Step 2: Security scan (reasoning model)
    security = client.chat.completions.create(
        model="claude-sonnet-4-6",
        messages=[{
            "role": "system",
            "content": "You are a security reviewer. Check for: "
                       "SQL injection, XSS, auth bypass, secrets in code, "
                       "unsafe deserialization. Be specific about line numbers."
        }, {
            "role": "user",
            "content": f"Review this diff for security issues:\n{diff}"
        }]
    ).choices[0].message.content

    # Step 3: Code quality (general model)
    quality = client.chat.completions.create(
        model="gpt-4.1",
        messages=[{
            "role": "user",
            "content": f"Review code quality: naming, structure, "
                       f"error handling, test coverage.\n{diff}"
        }]
    ).choices[0].message.content

    # Step 4: Summary (cheap model)
    summary = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[{
            "role": "user",
            "content": f"Summarize this PR review in 3 bullet points:\n"
                       f"Type: {classification}\n"
                       f"Security: {security[:500]}\n"
                       f"Quality: {quality[:500]}"
        }]
    ).choices[0].message.content

    return {
        "classification": classification,
        "security": security,
        "quality": quality,
        "summary": summary
    }

Cost breakdown for a typical PR review (2K token diff):

Step Model Input Tokens Cost
Classify GPT-4.1-mini ~2,100 $0.0008
Security Claude Sonnet 4.6 ~2,500 $0.0075
Quality GPT-4.1 ~2,500 $0.0050
Summary GPT-4.1-mini ~1,200 $0.0005
Total ~$0.014

Using Claude Sonnet 4.6 for all four steps would cost ~$0.028. The multi-model approach cuts costs by 50% while using the strongest model where it matters most (security review).

LangChain Integration

from langchain_openai import ChatOpenAI

# Create model instances with different configs
fast = ChatOpenAI(
    model="gpt-4.1-mini",
    api_key="sk-lemon-xxx",
    base_url="https://api.lemondata.cc/v1"
)

reasoning = ChatOpenAI(
    model="claude-sonnet-4-6",
    api_key="sk-lemon-xxx",
    base_url="https://api.lemondata.cc/v1"
)

# Use in LangChain chains
from langchain_core.prompts import ChatPromptTemplate

classify_chain = ChatPromptTemplate.from_template(
    "Classify: {input}"
) | fast

analyze_chain = ChatPromptTemplate.from_template(
    "Analyze in depth: {input}"
) | reasoning

When to Use Multi-Model Agents

Multi-model routing adds complexity. It's worth it when:

  • Your agent handles diverse task types (not just chat)
  • Monthly API costs exceed $100 (savings become meaningful)
  • You need specific model strengths (Claude for code, Gemini for long context, GPT for speed)
  • Latency matters for some steps but not others

For simple chatbots or single-purpose agents, a single model is fine. The overhead of routing isn't justified when every request needs the same capability.

Key Takeaways

  1. Use the cheapest model that handles each step well
  2. Reserve expensive models for tasks that genuinely need them
  3. Classification/routing steps should always use the cheapest available model
  4. Measure actual cost per agent run, not just per-token pricing
  5. An API aggregator with one key simplifies multi-model access significantly

Access every model through one API: lemondata.cc provides 300+ models with a single API key. Build multi-model agents without managing multiple provider accounts.

Share: