How to Cut Your AI API Costs by 30% Without Changing Models

Most teams overpay for AI API calls. Not because they picked the wrong model, but because they're ignoring three optimizations that require minimal code changes: prompt caching, smart model routing, and batch processing.

Here's a breakdown of each technique with real numbers.

If you are still deciding whether your current provider mix is the problem, read the pricing comparison first. If your biggest pain is retry storms or provider throttling rather than raw spend, pair this page with the rate limiting guide.

1. Prompt Caching: The Biggest Win

If your application sends the same system prompt with every request, you're paying full price for tokens the provider has already processed.

How It Works

OpenAI caches prompts automatically for inputs over 1,024 tokens. Cached tokens cost 50% of the standard input price. You don't need to change anything in your code.

Anthropic uses explicit caching via cache_control breakpoints. The write cost is 25% higher than standard input, but reads cost 90% less. Cache TTL is 5 minutes, extended on each hit.

On newer OpenAI pricing, the practical discount can be better than teams expect. GPT-4.1 cached input is priced at one quarter of standard input, which means consistent prefixes create much larger savings than the old “nice to have” framing suggested.

The Math

Take a typical customer support bot:

System prompt: 2,000 tokens
User message: 200 tokens average
5,000 requests/day using Claude Sonnet 4.6

Without caching:

Daily input cost = 5,000 × 2,200 tokens × $3.00/1M = $33.00

With Anthropic prompt caching (assuming 95% cache hit rate):

Cache writes: 250 × 2,200 × $3.75/1M = $2.06
Cache reads:  4,750 × 2,200 × $0.30/1M = $3.14
User tokens:  5,000 × 200 × $3.00/1M = $3.00
Daily total = $8.20 (75% savings on input costs)

Implementation

from anthropic import Anthropic

client = Anthropic(
    api_key="sk-lemon-xxx",
    base_url="https://api.lemondata.cc"
)

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a customer support agent for Acme Corp...",
            "cache_control": {"type": "ephemeral"}  # This enables caching
        }
    ],
    messages=[{"role": "user", "content": user_message}]
)

# Check cache performance in response headers
# cache_creation_input_tokens vs cache_read_input_tokens

For OpenAI models, caching is automatic. Just make sure your prompts exceed 1,024 tokens and keep the static prefix consistent across requests.

Where teams go wrong:

putting timestamps or request IDs at the top of every prompt
reordering system instructions on each call
embedding variable user context before the stable prefix

If the prefix changes every time, the cache never helps. Treat prompt shape as a cost primitive, not just a prompt engineering detail.

2. Smart Model Routing: Use the Right Model for Each Task

Not every request needs your most expensive model. A classification task that GPT-4.1 handles for $2.00/1M input tokens works just as well with GPT-4.1-mini at $0.40/1M, a 5x cost reduction.

The Routing Strategy

Task Type	Recommended Model	Input Cost/1M
Complex reasoning	Claude Opus 4.6 / GPT-4.1	$5.00 / $2.00
General chat	Claude Sonnet 4.6 / GPT-4.1	$3.00 / $2.00
Classification, extraction	GPT-4.1-mini / Claude Haiku 4.5	$0.40 / $1.00
Embeddings	text-embedding-3-small	$0.02
Simple formatting	DeepSeek V3	$0.28

Implementation

from openai import OpenAI

client = OpenAI(
    api_key="sk-lemon-xxx",
    base_url="https://api.lemondata.cc/v1"
)

def route_request(task_type: str, messages: list) -> str:
    """Pick the cheapest model that handles this task well."""
    model_map = {
        "classification": "gpt-4.1-mini",
        "extraction": "gpt-4.1-mini",
        "summarization": "gpt-4.1-mini",
        "complex_reasoning": "gpt-4.1",
        "creative_writing": "claude-sonnet-4-6",
        "code_generation": "claude-sonnet-4-6",
    }
    model = model_map.get(task_type, "gpt-4.1-mini")

    response = client.chat.completions.create(
        model=model,
        messages=messages
    )
    return response.choices[0].message.content

Real Savings

A coding assistant that routes 60% of requests (linting, formatting, simple completions) to GPT-4.1-mini and 40% (architecture, debugging) to Claude Sonnet 4.6:

Before (all Claude Sonnet 4.6):
  1,000 req/day × 3K input × $3.00/1M = $9.00/day

After (60/40 split):
  600 req × 3K × $0.40/1M = $0.72/day (mini)
  400 req × 3K × $3.00/1M = $3.60/day (sonnet)
  Total = $4.32/day (52% savings)

3. Batch Processing: Lower Prices for Non-Urgent Work

OpenAI offers a Batch API with 50% discount on input and output tokens. The trade-off: results are delivered within 24 hours instead of real-time.

Anthropic also offers 50% batch discounts on supported models. If your workload is overnight, asynchronous, or review-oriented, there is rarely a good reason to pay real-time prices.

Good candidates for batching:

Nightly content generation
Bulk document classification
Dataset labeling
Scheduled report generation

# Create a batch file (JSONL format)
import json

requests = []
for i, doc in enumerate(documents):
    requests.append({
        "custom_id": f"doc-{i}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4.1-mini",
            "messages": [
                {"role": "system", "content": "Classify this document..."},
                {"role": "user", "content": doc}
            ]
        }
    })

# Write JSONL file
with open("batch_input.jsonl", "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

# Submit batch
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch = client.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions", completion_window="24h")

Good candidates for batching inside a real product:

overnight content refresh jobs
support-ticket summarization
embeddings backfills
large codebase or document reviews
low-priority user notifications

Bad candidates:

chat replies
interactive coding assistance
workflows where the next user action depends on the answer immediately

4. Bonus: Reduce Token Count

Before optimizing at the API level, check if you're sending more tokens than necessary.

Common waste:

Verbose system prompts that repeat instructions the model already follows
Including full conversation history when only the last 3-5 turns matter
Sending raw HTML/markdown when plain text would work
Not using max_tokens to cap output length

A 30% reduction in prompt length directly translates to 30% lower input costs.

The easiest way to find waste is to log prompt length by route or feature. Most teams do not have a model-pricing problem. They have a “the same bloated prompt is sent 100,000 times a day” problem.

5. Add Cost Visibility Before You Optimize Blindly

Cost optimization fails when teams optimize from intuition.

Before changing routing rules, log:

route or feature name
model
input tokens
output tokens
cache hit or miss
retry count
user-visible latency

That lets you answer the questions that matter:

Which route is expensive because it is genuinely useful?
Which route is expensive because the prompt is wasteful?
Which route should move to batch?
Which route should move to a cheaper model tier?

If you cannot answer those four questions, your “cost optimization” will just shift cost around.

6. A Real Optimization Order

The most effective order is usually:

Remove obvious token waste.
Turn on or fix caching.
Split cheap tasks from expensive tasks.
Batch anything non-urgent.
Only then renegotiate provider mix.

That order matters because the biggest savings often come before provider switching. If you switch vendors without fixing prompt shape, you keep paying for the same inefficiency.

7. A Concrete Before-and-After Rollout

Take a support workflow that currently does this on every request:

sends a 2,000-token system prompt
calls one premium model for all requests
retries the same request shape on temporary failures
runs nightly summaries synchronously instead of in batch

The first version often feels “simple” because it has only one code path. Financially, it is doing four expensive things at once.

A more efficient rollout looks like this:

Move the stable policy text to the front of the prompt so caching can actually hit.
Route classification, extraction, and short summaries to a cheaper model tier.
Reserve the premium model for escalation, complicated reasoning, or final answer synthesis.
Push overnight summaries and backfills to batch.
Review logs weekly for routes whose prompt shape drifted and killed cache efficiency.

That kind of rollout does not require a rewrite. It requires one week of instrumentation and a willingness to treat prompts and routing as production surfaces.

8. What Not to Do

The fastest way to waste a cost-optimization effort is to optimize the wrong thing.

Avoid these traps:

switching providers before you measure prompt waste
routing cheap tasks to cheap models without validating output quality
enabling caching on prompts whose prefixes change every request
batching user-facing work that actually needs real-time responses
looking only at token price and ignoring retry, latency, and fallback overhead

Cost work is successful when the product still behaves well after the savings land. If the UX gets worse, the spreadsheet win is fake.

Putting It All Together

Technique	Effort	Typical Savings
Prompt caching	Low (add cache_control)	40-75% on input
Model routing	Medium (classify tasks)	30-50% overall
Batch processing	Medium (async workflow)	50% on batch jobs
Token reduction	Low (trim prompts)	10-30% on input

These techniques compound. A team that implements all four can realistically cut their monthly API bill from $3,000 to under $1,000 without any degradation in output quality.

The key insight: cost optimization in AI APIs isn't about finding cheaper providers. It's about using the right model, at the right price tier, with the right caching strategy, for each specific task.

If you are using multiple providers already, the operational side matters too. The migration guide and OpenRouter comparison help decide when it is time to centralize routing rather than keep patching separate integrations.

Start optimizing today: LemonData gives you access to 300+ models through one API key, with prompt caching support for OpenAI and Anthropic model families and one place to compare usage across them.