How to Cut Your AI API Costs by 30% Without Changing Models
Most teams overpay for AI API calls. Not because they picked the wrong model, but because they're ignoring three optimizations that require minimal code changes: prompt caching, smart model routing, and batch processing.
Here's a breakdown of each technique with real numbers.
1. Prompt Caching: The Biggest Win
If your application sends the same system prompt with every request, you're paying full price for tokens the provider has already processed.
How It Works
OpenAI caches prompts automatically for inputs over 1,024 tokens. Cached tokens cost 50% of the standard input price. You don't need to change anything in your code.
Anthropic uses explicit caching via cache_control breakpoints. The write cost is 25% higher than standard input, but reads cost 90% less. Cache TTL is 5 minutes, extended on each hit.
The Math
Take a typical customer support bot:
- System prompt: 2,000 tokens
- User message: 200 tokens average
- 5,000 requests/day using Claude Sonnet 4.6
Without caching:
Daily input cost = 5,000 × 2,200 tokens × $3.00/1M = $33.00
With Anthropic prompt caching (assuming 95% cache hit rate):
Cache writes: 250 × 2,200 × $3.75/1M = $2.06
Cache reads: 4,750 × 2,200 × $0.30/1M = $3.14
User tokens: 5,000 × 200 × $3.00/1M = $3.00
Daily total = $8.20 (75% savings on input costs)
Implementation
from anthropic import Anthropic
client = Anthropic(
api_key="sk-lemon-xxx",
base_url="https://api.lemondata.cc"
)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a customer support agent for Acme Corp...",
"cache_control": {"type": "ephemeral"} # This enables caching
}
],
messages=[{"role": "user", "content": user_message}]
)
# Check cache performance in response headers
# cache_creation_input_tokens vs cache_read_input_tokens
For OpenAI models, caching is automatic. Just make sure your prompts exceed 1,024 tokens and keep the static prefix consistent across requests.
2. Smart Model Routing: Use the Right Model for Each Task
Not every request needs your most expensive model. A classification task that GPT-4.1 handles for $2.00/1M input tokens works just as well with GPT-4.1-mini at $0.40/1M, a 5x cost reduction.
The Routing Strategy
| Task Type | Recommended Model | Input Cost/1M |
|---|---|---|
| Complex reasoning | Claude Opus 4.6 / GPT-4.1 | $5.00 / $2.00 |
| General chat | Claude Sonnet 4.6 / GPT-4.1 | $3.00 / $2.00 |
| Classification, extraction | GPT-4.1-mini / Claude Haiku 4.5 | $0.40 / $1.00 |
| Embeddings | text-embedding-3-small | $0.02 |
| Simple formatting | DeepSeek V3 | $0.28 |
Implementation
from openai import OpenAI
client = OpenAI(
api_key="sk-lemon-xxx",
base_url="https://api.lemondata.cc/v1"
)
def route_request(task_type: str, messages: list) -> str:
"""Pick the cheapest model that handles this task well."""
model_map = {
"classification": "gpt-4.1-mini",
"extraction": "gpt-4.1-mini",
"summarization": "gpt-4.1-mini",
"complex_reasoning": "gpt-4.1",
"creative_writing": "claude-sonnet-4-6",
"code_generation": "claude-sonnet-4-6",
}
model = model_map.get(task_type, "gpt-4.1-mini")
response = client.chat.completions.create(
model=model,
messages=messages
)
return response.choices[0].message.content
Real Savings
A coding assistant that routes 60% of requests (linting, formatting, simple completions) to GPT-4.1-mini and 40% (architecture, debugging) to Claude Sonnet 4.6:
Before (all Claude Sonnet 4.6):
1,000 req/day × 3K input × $3.00/1M = $9.00/day
After (60/40 split):
600 req × 3K × $0.40/1M = $0.72/day (mini)
400 req × 3K × $3.00/1M = $3.60/day (sonnet)
Total = $4.32/day (52% savings)
3. Batch Processing: Lower Prices for Non-Urgent Work
OpenAI offers a Batch API with 50% discount on input and output tokens. The trade-off: results are delivered within 24 hours instead of real-time.
Good candidates for batching:
- Nightly content generation
- Bulk document classification
- Dataset labeling
- Scheduled report generation
# Create a batch file (JSONL format)
import json
requests = []
for i, doc in enumerate(documents):
requests.append({
"custom_id": f"doc-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4.1-mini",
"messages": [
{"role": "system", "content": "Classify this document..."},
{"role": "user", "content": doc}
]
}
})
# Write JSONL file
with open("batch_input.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
# Submit batch
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch = client.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions", completion_window="24h")
4. Bonus: Reduce Token Count
Before optimizing at the API level, check if you're sending more tokens than necessary.
Common waste:
- Verbose system prompts that repeat instructions the model already follows
- Including full conversation history when only the last 3-5 turns matter
- Sending raw HTML/markdown when plain text would work
- Not using
max_tokensto cap output length
A 30% reduction in prompt length directly translates to 30% lower input costs.
Putting It All Together
| Technique | Effort | Typical Savings |
|---|---|---|
| Prompt caching | Low (add cache_control) | 40-75% on input |
| Model routing | Medium (classify tasks) | 30-50% overall |
| Batch processing | Medium (async workflow) | 50% on batch jobs |
| Token reduction | Low (trim prompts) | 10-30% on input |
These techniques compound. A team that implements all four can realistically cut their monthly API bill from $3,000 to under $1,000 without any degradation in output quality.
The key insight: cost optimization in AI APIs isn't about finding cheaper providers. It's about using the right model, at the right price tier, with the right caching strategy, for each specific task.
Start optimizing today: lemondata.cc gives you access to 300+ models through one API key, with full prompt caching support for OpenAI and Anthropic models.
