AI API Rate Limiting: How It Works and How to Handle It

Every AI API has rate limits. Hit them in development and it's annoying. Hit them in production and your users see errors. Understanding how rate limits work across providers and building proper retry logic is the difference between a demo and a production application.

How Each Provider Limits You

OpenAI

OpenAI uses tiered rate limits based on your account's usage history and payment level.

Tier	RPM (Requests/Min)	TPM (Tokens/Min)	How to Reach
Free	3	40,000	New account
Tier 1	500	200,000	$5 paid
Tier 2	5,000	2,000,000	$50 paid
Tier 3	5,000	10,000,000	$100 paid
Tier 4	10,000	50,000,000	$250 paid
Tier 5	10,000	300,000,000	$1,000 paid

Limits are per-model. Using GPT-4.1 doesn't consume your GPT-4.1-mini quota.

Anthropic

Anthropic uses a similar tier system with both RPM and TPM limits. They also enforce a daily token limit on lower tiers.

Google (Gemini)

Google AI Studio has per-model limits. The free tier is generous on daily requests but tight on per-minute rates (15 RPM for Gemini 2.5 Flash free tier).

Aggregator Platforms

Aggregators like LemonData and OpenRouter add their own rate limiting layer on top of upstream limits. LemonData uses role-based limits:

Role	RPM
User	1,000
Partner	3,000
VIP	10,000

The advantage: aggregator limits are typically higher than individual provider free tiers, and multi-channel routing means if one upstream channel is rate-limited, the request routes to another.

Reading Rate Limit Headers

All major providers return rate limit information in response headers:

x-ratelimit-limit-requests: 500
x-ratelimit-remaining-requests: 499
x-ratelimit-reset-requests: 60s
x-ratelimit-limit-tokens: 200000
x-ratelimit-remaining-tokens: 199500

Use these headers proactively. Don't wait for a 429 error to slow down.

Building Retry Logic

The Wrong Way

# Don't do this
import time

def call_api(messages):
    while True:
        try:
            return client.chat.completions.create(
                model="gpt-4.1",
                messages=messages
            )
        except Exception:
            time.sleep(1)  # Fixed delay, no backoff, catches everything

Problems: no exponential backoff, catches non-retryable errors, no max retry limit, no jitter.

The Right Way

import time
import random
from openai import RateLimitError, APIError, APIConnectionError

def call_with_retry(messages, model="gpt-4.1", max_retries=3):
    """Retry with exponential backoff and jitter."""
    for attempt in range(max_retries + 1):
        try:
            return client.chat.completions.create(
                model=model,
                messages=messages
            )
        except RateLimitError as e:
            if attempt == max_retries:
                raise
            # Use retry_after from response if available
            wait = getattr(e, 'retry_after', None)
            if wait is None:
                wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1})")
            time.sleep(wait)
        except APIConnectionError:
            if attempt == max_retries:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)
        except APIError as e:
            # Don't retry client errors (400, 401, 403)
            if e.status_code and 400 <= e.status_code < 500:
                raise
            if attempt == max_retries:
                raise
            time.sleep((2 ** attempt) + random.uniform(0, 1))

Key principles:

Exponential backoff: 1s, 2s, 4s, 8s
Jitter: random 0-1s added to prevent thundering herd
Respect retry_after header when provided
Don't retry client errors (bad request, auth failure)
Set a max retry count

Async Version

import asyncio
import random
from openai import AsyncOpenAI, RateLimitError

async_client = AsyncOpenAI(
    api_key="sk-lemon-xxx",
    base_url="https://api.lemondata.cc/v1"
)

async def call_with_retry_async(messages, model="gpt-4.1", max_retries=3):
    for attempt in range(max_retries + 1):
        try:
            return await async_client.chat.completions.create(
                model=model,
                messages=messages
            )
        except RateLimitError:
            if attempt == max_retries:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            await asyncio.sleep(wait)

Advanced: Token Bucket Rate Limiter

For high-throughput applications, implement client-side rate limiting to avoid hitting server limits:

import time
import asyncio

class TokenBucket:
    def __init__(self, rate: float, capacity: int):
        self.rate = rate          # tokens per second
        self.capacity = capacity  # max burst size
        self.tokens = capacity
        self.last_refill = time.monotonic()

    async def acquire(self, tokens: int = 1):
        while True:
            now = time.monotonic()
            elapsed = now - self.last_refill
            self.tokens = min(
                self.capacity,
                self.tokens + elapsed * self.rate
            )
            self.last_refill = now

            if self.tokens >= tokens:
                self.tokens -= tokens
                return
            # Wait for enough tokens
            wait = (tokens - self.tokens) / self.rate
            await asyncio.sleep(wait)

# 500 requests per minute = ~8.3 per second
limiter = TokenBucket(rate=8.0, capacity=20)

async def rate_limited_call(messages, model="gpt-4.1"):
    await limiter.acquire()
    return await async_client.chat.completions.create(
        model=model,
        messages=messages
    )

Model Fallback on Rate Limits

When your primary model is rate-limited, fall back to an alternative:

FALLBACK_CHAIN = [
    "claude-sonnet-4-6",
    "gpt-4.1",
    "gpt-4.1-mini",
]

async def call_with_fallback(messages):
    for model in FALLBACK_CHAIN:
        try:
            return await async_client.chat.completions.create(
                model=model,
                messages=messages
            )
        except RateLimitError:
            continue
    raise Exception("All models rate limited")

This is where API aggregators shine. With 300+ models behind one endpoint, you always have a fallback available.

Monitoring Rate Limit Usage

Track your rate limit consumption to catch issues before they affect users:

import logging

def log_rate_limits(response):
    headers = response.headers
    remaining = headers.get("x-ratelimit-remaining-requests")
    limit = headers.get("x-ratelimit-limit-requests")
    if remaining and int(remaining) < int(limit) * 0.1:
        logging.warning(
            f"Rate limit warning: {remaining}/{limit} requests remaining"
        )

Set alerts when remaining capacity drops below 10%. This gives you time to implement throttling before users see 429 errors.

Summary

Strategy	When to Use
Exponential backoff	Always (baseline)
Client-side rate limiter	High-throughput apps (>100 RPM)
Model fallback	Production apps with SLA requirements
Proactive monitoring	Any production deployment
Batch API	Non-real-time workloads

The goal isn't to avoid rate limits entirely. It's to handle them gracefully so your users never notice.

Build resilient AI applications: lemondata.cc provides multi-channel routing that automatically handles upstream rate limits. One API key, 300+ models.