Settings

Language

AI API Rate Limiting: How It Works and How to Handle It

L
LemonData
·February 26, 2026·23 views
#rate-limiting#production#error-handling#tutorial#best-practices
AI API Rate Limiting: How It Works and How to Handle It

AI API Rate Limiting: How It Works and How to Handle It

Every AI API has rate limits. Hit them in development and it's annoying. Hit them in production and your users see errors. Understanding how rate limits work across providers and building proper retry logic is the difference between a demo and a production application.

How Each Provider Limits You

OpenAI

OpenAI uses tiered rate limits based on your account's usage history and payment level.

Tier RPM (Requests/Min) TPM (Tokens/Min) How to Reach
Free 3 40,000 New account
Tier 1 500 200,000 $5 paid
Tier 2 5,000 2,000,000 $50 paid
Tier 3 5,000 10,000,000 $100 paid
Tier 4 10,000 50,000,000 $250 paid
Tier 5 10,000 300,000,000 $1,000 paid

Limits are per-model. Using GPT-4.1 doesn't consume your GPT-4.1-mini quota.

Anthropic

Anthropic uses a similar tier system with both RPM and TPM limits. They also enforce a daily token limit on lower tiers.

Google (Gemini)

Google AI Studio has per-model limits. The free tier is generous on daily requests but tight on per-minute rates (15 RPM for Gemini 2.5 Flash free tier).

Aggregator Platforms

Aggregators like LemonData and OpenRouter add their own rate limiting layer on top of upstream limits. LemonData uses role-based limits:

Role RPM
User 1,000
Partner 3,000
VIP 10,000

The advantage: aggregator limits are typically higher than individual provider free tiers, and multi-channel routing means if one upstream channel is rate-limited, the request routes to another.

Reading Rate Limit Headers

All major providers return rate limit information in response headers:

x-ratelimit-limit-requests: 500
x-ratelimit-remaining-requests: 499
x-ratelimit-reset-requests: 60s
x-ratelimit-limit-tokens: 200000
x-ratelimit-remaining-tokens: 199500

Use these headers proactively. Don't wait for a 429 error to slow down.

Building Retry Logic

The Wrong Way

# Don't do this
import time

def call_api(messages):
    while True:
        try:
            return client.chat.completions.create(
                model="gpt-4.1",
                messages=messages
            )
        except Exception:
            time.sleep(1)  # Fixed delay, no backoff, catches everything

Problems: no exponential backoff, catches non-retryable errors, no max retry limit, no jitter.

The Right Way

import time
import random
from openai import RateLimitError, APIError, APIConnectionError

def call_with_retry(messages, model="gpt-4.1", max_retries=3):
    """Retry with exponential backoff and jitter."""
    for attempt in range(max_retries + 1):
        try:
            return client.chat.completions.create(
                model=model,
                messages=messages
            )
        except RateLimitError as e:
            if attempt == max_retries:
                raise
            # Use retry_after from response if available
            wait = getattr(e, 'retry_after', None)
            if wait is None:
                wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1})")
            time.sleep(wait)
        except APIConnectionError:
            if attempt == max_retries:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)
        except APIError as e:
            # Don't retry client errors (400, 401, 403)
            if e.status_code and 400 <= e.status_code < 500:
                raise
            if attempt == max_retries:
                raise
            time.sleep((2 ** attempt) + random.uniform(0, 1))

Key principles:

  • Exponential backoff: 1s, 2s, 4s, 8s
  • Jitter: random 0-1s added to prevent thundering herd
  • Respect retry_after header when provided
  • Don't retry client errors (bad request, auth failure)
  • Set a max retry count

Async Version

import asyncio
import random
from openai import AsyncOpenAI, RateLimitError

async_client = AsyncOpenAI(
    api_key="sk-lemon-xxx",
    base_url="https://api.lemondata.cc/v1"
)

async def call_with_retry_async(messages, model="gpt-4.1", max_retries=3):
    for attempt in range(max_retries + 1):
        try:
            return await async_client.chat.completions.create(
                model=model,
                messages=messages
            )
        except RateLimitError:
            if attempt == max_retries:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            await asyncio.sleep(wait)

Advanced: Token Bucket Rate Limiter

For high-throughput applications, implement client-side rate limiting to avoid hitting server limits:

import time
import asyncio

class TokenBucket:
    def __init__(self, rate: float, capacity: int):
        self.rate = rate          # tokens per second
        self.capacity = capacity  # max burst size
        self.tokens = capacity
        self.last_refill = time.monotonic()

    async def acquire(self, tokens: int = 1):
        while True:
            now = time.monotonic()
            elapsed = now - self.last_refill
            self.tokens = min(
                self.capacity,
                self.tokens + elapsed * self.rate
            )
            self.last_refill = now

            if self.tokens >= tokens:
                self.tokens -= tokens
                return
            # Wait for enough tokens
            wait = (tokens - self.tokens) / self.rate
            await asyncio.sleep(wait)

# 500 requests per minute = ~8.3 per second
limiter = TokenBucket(rate=8.0, capacity=20)

async def rate_limited_call(messages, model="gpt-4.1"):
    await limiter.acquire()
    return await async_client.chat.completions.create(
        model=model,
        messages=messages
    )

Model Fallback on Rate Limits

When your primary model is rate-limited, fall back to an alternative:

FALLBACK_CHAIN = [
    "claude-sonnet-4-6",
    "gpt-4.1",
    "gpt-4.1-mini",
]

async def call_with_fallback(messages):
    for model in FALLBACK_CHAIN:
        try:
            return await async_client.chat.completions.create(
                model=model,
                messages=messages
            )
        except RateLimitError:
            continue
    raise Exception("All models rate limited")

This is where API aggregators shine. With 300+ models behind one endpoint, you always have a fallback available.

Monitoring Rate Limit Usage

Track your rate limit consumption to catch issues before they affect users:

import logging

def log_rate_limits(response):
    headers = response.headers
    remaining = headers.get("x-ratelimit-remaining-requests")
    limit = headers.get("x-ratelimit-limit-requests")
    if remaining and int(remaining) < int(limit) * 0.1:
        logging.warning(
            f"Rate limit warning: {remaining}/{limit} requests remaining"
        )

Set alerts when remaining capacity drops below 10%. This gives you time to implement throttling before users see 429 errors.

Summary

Strategy When to Use
Exponential backoff Always (baseline)
Client-side rate limiter High-throughput apps (>100 RPM)
Model fallback Production apps with SLA requirements
Proactive monitoring Any production deployment
Batch API Non-real-time workloads

The goal isn't to avoid rate limits entirely. It's to handle them gracefully so your users never notice.


Build resilient AI applications: lemondata.cc provides multi-channel routing that automatically handles upstream rate limits. One API key, 300+ models.

Share: