AI API Rate Limiting: How It Works and How to Handle It
Every AI API has rate limits. Hit them in development and it's annoying. Hit them in production and your users see errors. Understanding how rate limits work across providers and building proper retry logic is the difference between a demo and a production application.
How Each Provider Limits You
OpenAI
OpenAI uses tiered rate limits based on your account's usage history and payment level.
| Tier | RPM (Requests/Min) | TPM (Tokens/Min) | How to Reach |
|---|---|---|---|
| Free | 3 | 40,000 | New account |
| Tier 1 | 500 | 200,000 | $5 paid |
| Tier 2 | 5,000 | 2,000,000 | $50 paid |
| Tier 3 | 5,000 | 10,000,000 | $100 paid |
| Tier 4 | 10,000 | 50,000,000 | $250 paid |
| Tier 5 | 10,000 | 300,000,000 | $1,000 paid |
Limits are per-model. Using GPT-4.1 doesn't consume your GPT-4.1-mini quota.
Anthropic
Anthropic uses a similar tier system with both RPM and TPM limits. They also enforce a daily token limit on lower tiers.
Google (Gemini)
Google AI Studio has per-model limits. The free tier is generous on daily requests but tight on per-minute rates (15 RPM for Gemini 2.5 Flash free tier).
Aggregator Platforms
Aggregators like LemonData and OpenRouter add their own rate limiting layer on top of upstream limits. LemonData uses role-based limits:
| Role | RPM |
|---|---|
| User | 1,000 |
| Partner | 3,000 |
| VIP | 10,000 |
The advantage: aggregator limits are typically higher than individual provider free tiers, and multi-channel routing means if one upstream channel is rate-limited, the request routes to another.
Reading Rate Limit Headers
All major providers return rate limit information in response headers:
x-ratelimit-limit-requests: 500
x-ratelimit-remaining-requests: 499
x-ratelimit-reset-requests: 60s
x-ratelimit-limit-tokens: 200000
x-ratelimit-remaining-tokens: 199500
Use these headers proactively. Don't wait for a 429 error to slow down.
Building Retry Logic
The Wrong Way
# Don't do this
import time
def call_api(messages):
while True:
try:
return client.chat.completions.create(
model="gpt-4.1",
messages=messages
)
except Exception:
time.sleep(1) # Fixed delay, no backoff, catches everything
Problems: no exponential backoff, catches non-retryable errors, no max retry limit, no jitter.
The Right Way
import time
import random
from openai import RateLimitError, APIError, APIConnectionError
def call_with_retry(messages, model="gpt-4.1", max_retries=3):
"""Retry with exponential backoff and jitter."""
for attempt in range(max_retries + 1):
try:
return client.chat.completions.create(
model=model,
messages=messages
)
except RateLimitError as e:
if attempt == max_retries:
raise
# Use retry_after from response if available
wait = getattr(e, 'retry_after', None)
if wait is None:
wait = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1})")
time.sleep(wait)
except APIConnectionError:
if attempt == max_retries:
raise
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)
except APIError as e:
# Don't retry client errors (400, 401, 403)
if e.status_code and 400 <= e.status_code < 500:
raise
if attempt == max_retries:
raise
time.sleep((2 ** attempt) + random.uniform(0, 1))
Key principles:
- Exponential backoff: 1s, 2s, 4s, 8s
- Jitter: random 0-1s added to prevent thundering herd
- Respect
retry_afterheader when provided - Don't retry client errors (bad request, auth failure)
- Set a max retry count
Async Version
import asyncio
import random
from openai import AsyncOpenAI, RateLimitError
async_client = AsyncOpenAI(
api_key="sk-lemon-xxx",
base_url="https://api.lemondata.cc/v1"
)
async def call_with_retry_async(messages, model="gpt-4.1", max_retries=3):
for attempt in range(max_retries + 1):
try:
return await async_client.chat.completions.create(
model=model,
messages=messages
)
except RateLimitError:
if attempt == max_retries:
raise
wait = (2 ** attempt) + random.uniform(0, 1)
await asyncio.sleep(wait)
Advanced: Token Bucket Rate Limiter
For high-throughput applications, implement client-side rate limiting to avoid hitting server limits:
import time
import asyncio
class TokenBucket:
def __init__(self, rate: float, capacity: int):
self.rate = rate # tokens per second
self.capacity = capacity # max burst size
self.tokens = capacity
self.last_refill = time.monotonic()
async def acquire(self, tokens: int = 1):
while True:
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(
self.capacity,
self.tokens + elapsed * self.rate
)
self.last_refill = now
if self.tokens >= tokens:
self.tokens -= tokens
return
# Wait for enough tokens
wait = (tokens - self.tokens) / self.rate
await asyncio.sleep(wait)
# 500 requests per minute = ~8.3 per second
limiter = TokenBucket(rate=8.0, capacity=20)
async def rate_limited_call(messages, model="gpt-4.1"):
await limiter.acquire()
return await async_client.chat.completions.create(
model=model,
messages=messages
)
Model Fallback on Rate Limits
When your primary model is rate-limited, fall back to an alternative:
FALLBACK_CHAIN = [
"claude-sonnet-4-6",
"gpt-4.1",
"gpt-4.1-mini",
]
async def call_with_fallback(messages):
for model in FALLBACK_CHAIN:
try:
return await async_client.chat.completions.create(
model=model,
messages=messages
)
except RateLimitError:
continue
raise Exception("All models rate limited")
This is where API aggregators shine. With 300+ models behind one endpoint, you always have a fallback available.
Monitoring Rate Limit Usage
Track your rate limit consumption to catch issues before they affect users:
import logging
def log_rate_limits(response):
headers = response.headers
remaining = headers.get("x-ratelimit-remaining-requests")
limit = headers.get("x-ratelimit-limit-requests")
if remaining and int(remaining) < int(limit) * 0.1:
logging.warning(
f"Rate limit warning: {remaining}/{limit} requests remaining"
)
Set alerts when remaining capacity drops below 10%. This gives you time to implement throttling before users see 429 errors.
Summary
| Strategy | When to Use |
|---|---|
| Exponential backoff | Always (baseline) |
| Client-side rate limiter | High-throughput apps (>100 RPM) |
| Model fallback | Production apps with SLA requirements |
| Proactive monitoring | Any production deployment |
| Batch API | Non-real-time workloads |
The goal isn't to avoid rate limits entirely. It's to handle them gracefully so your users never notice.
Build resilient AI applications: lemondata.cc provides multi-channel routing that automatically handles upstream rate limits. One API key, 300+ models.
