Most teams overpay for AI API calls. Not because they picked the wrong model, but because they're ignoring three optimizations that require minimal code changes: prompt caching, smart model routing, and batch processing.
Here's a breakdown of each technique with real numbers.
If you are still deciding whether your current provider mix is the problem, read the pricing comparison first. If your biggest pain is retry storms or provider throttling rather than raw spend, pair this page with the rate limiting guide.
1. Prompt Caching: The Biggest Win
If your application sends the same system prompt with every request, you're paying full price for tokens the provider has already processed.
How It Works
OpenAI caches prompts automatically for inputs over 1,024 tokens. Cached tokens cost 50% of the standard input price. You don't need to change anything in your code.
Anthropic uses explicit caching via cache_control breakpoints. The write cost is 25% higher than standard input, but reads cost 90% less. Cache TTL is 5 minutes, extended on each hit.
On newer OpenAI pricing, the practical discount can be better than teams expect. GPT-4.1 cached input is priced at one quarter of standard input, which means consistent prefixes create much larger savings than the old “nice to have” framing suggested.
The Math
Take a typical customer support bot:
- System prompt: 2,000 tokens
- User message: 200 tokens average
- 5,000 requests/day using Claude Sonnet 4.6
Without caching:
Daily input cost = 5,000 × 2,200 tokens × $3.00/1M = $33.00
With Anthropic prompt caching (assuming 95% cache hit rate):
Cache writes: 250 × 2,200 × $3.75/1M = $2.06
Cache reads: 4,750 × 2,200 × $0.30/1M = $3.14
User tokens: 5,000 × 200 × $3.00/1M = $3.00
Daily total = $8.20 (75% savings on input costs)
Implementation
from anthropic import Anthropic
client = Anthropic(
api_key="sk-lemon-xxx",
base_url="https://api.lemondata.cc"
)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a customer support agent for Acme Corp...",
"cache_control": {"type": "ephemeral"} # This enables caching
}
],
messages=[{"role": "user", "content": user_message}]
)
# Check cache performance in response headers
# cache_creation_input_tokens vs cache_read_input_tokens
For OpenAI models, caching is automatic. Just make sure your prompts exceed 1,024 tokens and keep the static prefix consistent across requests.
Where teams go wrong:
- putting timestamps or request IDs at the top of every prompt
- reordering system instructions on each call
- embedding variable user context before the stable prefix
If the prefix changes every time, the cache never helps. Treat prompt shape as a cost primitive, not just a prompt engineering detail.
2. Smart Model Routing: Use the Right Model for Each Task
Not every request needs your most expensive model. A classification task that GPT-4.1 handles for $2.00/1M input tokens works just as well with GPT-4.1-mini at $0.40/1M, a 5x cost reduction.
The Routing Strategy
| Task Type | Recommended Model | Input Cost/1M |
|---|---|---|
| Complex reasoning | Claude Opus 4.6 / GPT-4.1 | $5.00 / $2.00 |
| General chat | Claude Sonnet 4.6 / GPT-4.1 | $3.00 / $2.00 |
| Classification, extraction | GPT-4.1-mini / Claude Haiku 4.5 | $0.40 / $1.00 |
| Embeddings | text-embedding-3-small | $0.02 |
| Simple formatting | DeepSeek V3 | $0.28 |
Implementation
from openai import OpenAI
client = OpenAI(
api_key="sk-lemon-xxx",
base_url="https://api.lemondata.cc/v1"
)
def route_request(task_type: str, messages: list) -> str:
"""Pick the cheapest model that handles this task well."""
model_map = {
"classification": "gpt-4.1-mini",
"extraction": "gpt-4.1-mini",
"summarization": "gpt-4.1-mini",
"complex_reasoning": "gpt-4.1",
"creative_writing": "claude-sonnet-4-6",
"code_generation": "claude-sonnet-4-6",
}
model = model_map.get(task_type, "gpt-4.1-mini")
response = client.chat.completions.create(
model=model,
messages=messages
)
return response.choices[0].message.content
Real Savings
A coding assistant that routes 60% of requests (linting, formatting, simple completions) to GPT-4.1-mini and 40% (architecture, debugging) to Claude Sonnet 4.6:
Before (all Claude Sonnet 4.6):
1,000 req/day × 3K input × $3.00/1M = $9.00/day
After (60/40 split):
600 req × 3K × $0.40/1M = $0.72/day (mini)
400 req × 3K × $3.00/1M = $3.60/day (sonnet)
Total = $4.32/day (52% savings)
3. Batch Processing: Lower Prices for Non-Urgent Work
OpenAI offers a Batch API with 50% discount on input and output tokens. The trade-off: results are delivered within 24 hours instead of real-time.
Anthropic also offers 50% batch discounts on supported models. If your workload is overnight, asynchronous, or review-oriented, there is rarely a good reason to pay real-time prices.
Good candidates for batching:
- Nightly content generation
- Bulk document classification
- Dataset labeling
- Scheduled report generation
# Create a batch file (JSONL format)
import json
requests = []
for i, doc in enumerate(documents):
requests.append({
"custom_id": f"doc-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4.1-mini",
"messages": [
{"role": "system", "content": "Classify this document..."},
{"role": "user", "content": doc}
]
}
})
# Write JSONL file
with open("batch_input.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
# Submit batch
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch = client.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions", completion_window="24h")
Good candidates for batching inside a real product:
- overnight content refresh jobs
- support-ticket summarization
- embeddings backfills
- large codebase or document reviews
- low-priority user notifications
Bad candidates:
- chat replies
- interactive coding assistance
- workflows where the next user action depends on the answer immediately
4. Bonus: Reduce Token Count
Before optimizing at the API level, check if you're sending more tokens than necessary.
Common waste:
- Verbose system prompts that repeat instructions the model already follows
- Including full conversation history when only the last 3-5 turns matter
- Sending raw HTML/markdown when plain text would work
- Not using
max_tokensto cap output length
A 30% reduction in prompt length directly translates to 30% lower input costs.
The easiest way to find waste is to log prompt length by route or feature. Most teams do not have a model-pricing problem. They have a “the same bloated prompt is sent 100,000 times a day” problem.
5. Add Cost Visibility Before You Optimize Blindly
Cost optimization fails when teams optimize from intuition.
Before changing routing rules, log:
- route or feature name
- model
- input tokens
- output tokens
- cache hit or miss
- retry count
- user-visible latency
That lets you answer the questions that matter:
- Which route is expensive because it is genuinely useful?
- Which route is expensive because the prompt is wasteful?
- Which route should move to batch?
- Which route should move to a cheaper model tier?
If you cannot answer those four questions, your “cost optimization” will just shift cost around.
6. A Real Optimization Order
The most effective order is usually:
- Remove obvious token waste.
- Turn on or fix caching.
- Split cheap tasks from expensive tasks.
- Batch anything non-urgent.
- Only then renegotiate provider mix.
That order matters because the biggest savings often come before provider switching. If you switch vendors without fixing prompt shape, you keep paying for the same inefficiency.
7. A Concrete Before-and-After Rollout
Take a support workflow that currently does this on every request:
- sends a 2,000-token system prompt
- calls one premium model for all requests
- retries the same request shape on temporary failures
- runs nightly summaries synchronously instead of in batch
The first version often feels “simple” because it has only one code path. Financially, it is doing four expensive things at once.
A more efficient rollout looks like this:
- Move the stable policy text to the front of the prompt so caching can actually hit.
- Route classification, extraction, and short summaries to a cheaper model tier.
- Reserve the premium model for escalation, complicated reasoning, or final answer synthesis.
- Push overnight summaries and backfills to batch.
- Review logs weekly for routes whose prompt shape drifted and killed cache efficiency.
That kind of rollout does not require a rewrite. It requires one week of instrumentation and a willingness to treat prompts and routing as production surfaces.
8. What Not to Do
The fastest way to waste a cost-optimization effort is to optimize the wrong thing.
Avoid these traps:
- switching providers before you measure prompt waste
- routing cheap tasks to cheap models without validating output quality
- enabling caching on prompts whose prefixes change every request
- batching user-facing work that actually needs real-time responses
- looking only at token price and ignoring retry, latency, and fallback overhead
Cost work is successful when the product still behaves well after the savings land. If the UX gets worse, the spreadsheet win is fake.
Putting It All Together
| Technique | Effort | Typical Savings |
|---|---|---|
| Prompt caching | Low (add cache_control) | 40-75% on input |
| Model routing | Medium (classify tasks) | 30-50% overall |
| Batch processing | Medium (async workflow) | 50% on batch jobs |
| Token reduction | Low (trim prompts) | 10-30% on input |
These techniques compound. A team that implements all four can realistically cut their monthly API bill from $3,000 to under $1,000 without any degradation in output quality.
The key insight: cost optimization in AI APIs isn't about finding cheaper providers. It's about using the right model, at the right price tier, with the right caching strategy, for each specific task.
If you are using multiple providers already, the operational side matters too. The migration guide and OpenRouter comparison help decide when it is time to centralize routing rather than keep patching separate integrations.
Start optimizing today: LemonData gives you access to 300+ models through one API key, with prompt caching support for OpenAI and Anthropic model families and one place to compare usage across them.
