If you’ve ever scaled an AI-powered application past a few hundred daily users, you’ve likely run into the dreaded HTTP 429: Too Many Requests error.
Unlike traditional database APIs where rate limits are simple (e.g., 60 requests per minute), AI APIs use a two-dimensional limit schema: Requests Per Minute (RPM) and Tokens Per Minute (TPM).
Even if you only send 5 requests, a large document context can trigger a TPM rate limit error and crash your app.
This guide explains how rate limits are calculated across OpenAI, Gemini, and Claude, and shows you how to write bulletproof error handling code to keep your app online.
🧮 Calculate your token throughput: Use our AI API Pricing Calculator to project your expected token limits per minute based on user counts.
Understanding the 3 Types of Limits
AI providers throttle your app based on three distinct metrics:
- Requests Per Minute (RPM): How many times your code calls their endpoint in 60 seconds.
- Tokens Per Minute (TPM): The sum of all input and output tokens processed in 60 seconds.
- Requests Per Day (RPD): Daily cap (primarily enforced on free developer tiers).
Rate Limit Comparison (Tier 1 / Pay-As-You-Go)
Here are the typical starting limits for new developer accounts:
| Provider | Model | Default RPM | Default TPM |
|---|---|---|---|
| OpenAI | GPT-4o-mini | 500 RPM | 200,000 TPM |
| OpenAI | GPT-4o | 500 RPM | 30,000 TPM |
| Gemini 2.5 Flash | 2,000 RPM | 4,000,000 TPM | |
| Anthropic | Claude Sonnet | 50 RPM | 40,000 TPM |
The Winner: Google Gemini provides exceptionally high default limits, making it the most resilient provider for high-velocity startup traffic.
How to Fix Rate Limit Errors (Python)
1. Implement Exponential Backoff with Jitter
Do not immediately retry a failed request. Instead, wait, increasing the delay with each failure. Adding random “jitter” prevents all your concurrent requests from retrying at the exact same millisecond.
Here is the production-ready Python decorator using the tenacity library:
import random
import time
from google import genai
from google.genai.errors import APIError
from tenacity import retry, wait_random_exponential, stop_after_attempt, retry_if_exception_type
client = genai.Client()
# Retry up to 5 times with exponential backoff between 1 and 60 seconds
@retry(
wait=wait_random_exponential(min=1, max=60),
stop=stop_after_attempt(5),
retry=retry_if_exception_type(APIError)
)
def call_gemini_safely(prompt: str):
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=prompt
)
return response.text
2. Read Rate Limit Headers Dynamically
Every time you make an API call, the provider returns headers indicating how close you are to your limits. You can parse these values to slow down your code proactively:
x-ratelimit-remaining-requestsx-ratelimit-remaining-tokensx-ratelimit-reset-requests(Time until RPM resets)x-ratelimit-reset-tokens(Time until TPM resets)
3. Implement Fallback Routing (Multi-Model Resiliency)
If your primary model provider is fully throttled, route the query to a fallback model.
def generate_text_with_fallback(prompt: str):
try:
# 1. Try OpenAI
return call_openai(prompt)
except Exception as e:
if "429" in str(e):
print("⚠️ OpenAI Throttled! Falling back to Gemini...")
# 2. Route to Gemini
return call_gemini_safely(prompt)