What is LLM Prompt Caching?

Prompt Caching is an API feature that stores frequently used prompts, instructions, or documents in memory. When you send a matching request, the provider uses the cached version, charging you up to 90% less for input tokens.

How much does the Batch API save?

Most providers (including OpenAI, Google, and xAI) offer a 50% discount on standard token prices if you submit your tasks in batches and allow up to 24 hours for completion.

Can I combine Prompt Caching and the Batch API?

Yes! Some providers allow you to benefit from both caching and batch pricing on the same requests, lowering your total API bill by over 95% compared to standard real-time pricing.

Which models support Prompt Caching?

Prompt Caching is supported on OpenAI (GPT-4.1, GPT-5.5), Google Gemini (Pro, Flash, Flash-Lite), Anthropic Claude (Opus, Sonnet, Haiku), and xAI Grok (all models).

How to Cut Your AI API Bill by 90% (Prompt Caching + Batch API Guide)

25 May 2026 (Updated: May 25, 2026) 📖 3 min read

For developers building production AI apps in 2026, API costs are often the single largest expense. However, many developers are still paying the “real-time tax” on every single request.

By implementing two core optimization strategies — Prompt Caching and Batch APIs — you can reduce your AI API bills by 50% to 90% overnight.

This guide explains exactly how these features work across Google Gemini, OpenAI, Anthropic Claude, and xAI Grok, with actionable strategies to implement them in your codebase today.

🧮 See the math in action: Use our AI API Pricing Calculator to toggle caching and batch modes and watch your estimated monthly bill drop instantly.

Part 1: Prompt Caching (Save up to 90% on Inputs)

When you make an API call, you pay for every token in your prompt. If you send the same system instructions, the same user profile data, or a massive 50K-word reference document with every message, you are paying for those identical tokens repeatedly.

Prompt Caching stores your input prefix in the provider’s memory. When subsequent requests share that same prefix, you only pay a fraction of the cost.

How Caching Rates Compare

Provider	Caching Support	Cost Reduction on Cached Tokens	Minimum Cache Size
Google Gemini	Yes (Manual)	~90% Off (approx. $0.05/M on Flash)	32,768 tokens
OpenAI	Yes (Automatic)	~75% Off ($0.50/M instead of $2.00/M)	1,024 tokens
Anthropic Claude	Yes (Manual)	~90% Off (approx. $0.30/M on Sonnet)	8,192 tokens
xAI Grok	Yes (Automatic)	~90% Off (approx. $0.13/M on Grok 4.3)	1,024 tokens

How to Implement Caching

1. Automatic Caching (OpenAI & Grok)

OpenAI and xAI require zero code changes for caching. If the prefix of your prompt matches a previous request (of at least 1,024 tokens), they automatically use the cache.

Rule for success: Keep your prompts structured with static content at the beginning (e.g., system prompt, reference documents) and dynamic user inputs at the very end.

[STABLE SYSTEM INSTRUCTIONS]  <-- Cached
[STATIC REFERENCE KNOWLEDGE] <-- Cached
[DYNAMIC USER QUESTION]      <-- Not Cached (computed standard rate)

2. Manual Caching (Anthropic Claude)

Anthropic requires you to explicitly tag which blocks should be cached in your JSON payload using "cache_control": {"type": "ephemeral"}:

{
  "model": "claude-3-5-sonnet-20241022",
  "max_tokens": 1024,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Here is a huge document to analyze...",
          "cache_control": {"type": "ephemeral"}
        },
        {
          "type": "text",
          "text": "Summarize chapter 3."
        }
      ]
    }
  ]
}

Part 2: The Batch API (Save 50% on Everything)

If your application processes tasks that do not need immediate real-time responses (e.g., overnight report generation, database categorization, document indexing, translation pipelines), you should use the Batch API.

Instead of sending requests synchronously, you upload a file containing thousands of requests. The provider processes them asynchronously, returning the completed results within 24 hours.

The Benefit: All major providers offer a flat 50% discount on input and output tokens for batch requests.

Batch API Features compared

Provider	Turnaround Time	Cost Discount	Limit / Day
OpenAI Batch	≤ 24 hours	50% Off	50M tokens
Google Gemini Batch	≤ 24 hours	50% Off	100M tokens
xAI Grok Batch	≤ 24 hours	50% Off	50M tokens

Step-by-Step: Implementing OpenAI Batch API in Python

Here is a simple example of how to configure and execute batch workloads in Python:

import openai

# 1. Create a JSONL file with your tasks
# Each line represents one independent API call
tasks = [
    {"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Classify this email: ..."}]}},
    {"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Classify this email: ..."}]}}
]

with open("batch_tasks.jsonl", "w") as f:
    for task in tasks:
        f.write(json.dumps(task) + "\n")

# 2. Upload the file to OpenAI
batch_file = openai.files.create(
    file=open("batch_tasks.jsonl", "rb"),
    purpose="batch"
)

# 3. Create the batch job
batch_job = openai.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

print(f"Batch Job Created! ID: {batch_job.id}")

Once the status changes to completed, you can download the output file containing all completed completions.

Combining Both: The Ultimate Savings Setup

If you structure your code correctly, you can combine these two strategies:

Structure your data to isolate static system instructions and reference materials at the beginning of the prompt context (enabling Prompt Caching).
Queue the requests into a batch queue to be processed overnight (enabling the Batch API 50% discount).

By combining these two features, you can reduce standard API charges by over 95%.

Always verify feature availability and specific token rates in official developer documentation.

EXCEL / SHEETS TEMPLATE

Download the 2026 AI API Cost Optimization Spreadsheet

A complete, ready-to-use template to model, calculate, and project your API bills for Gemini, OpenAI, Grok, and Claude.

« Grok 4.3 vs Gemini 3.1 Pro vs Claude 4.6: Which Flagship API Wins? [2026]

I Built the Same App with 5 Different AI APIs — Here's What Each One Cost Me »

Professor XAI Follow ML Engineer passionate about advancing AI technologies and building intelligent systems.

How to Cut Your AI API Bill by 90% (Prompt Caching + Batch API Guide)

Part 1: Prompt Caching (Save up to 90% on Inputs)

How Caching Rates Compare

How to Implement Caching

1. Automatic Caching (OpenAI & Grok)

2. Manual Caching (Anthropic Claude)

Part 2: The Batch API (Save 50% on Everything)

Batch API Features compared

Step-by-Step: Implementing OpenAI Batch API in Python

Combining Both: The Ultimate Savings Setup

Download the 2026 AI API Cost Optimization Spreadsheet

🧮 Quick Tools

Newsletter

Popular Categories

How to Cut Your AI API Bill by 90% (Prompt Caching + Batch API Guide)

Part 1: Prompt Caching (Save up to 90% on Inputs)

How Caching Rates Compare

How to Implement Caching

1. Automatic Caching (OpenAI & Grok)

2. Manual Caching (Anthropic Claude)

Part 2: The Batch API (Save 50% on Everything)

Batch API Features compared

Step-by-Step: Implementing OpenAI Batch API in Python

Combining Both: The Ultimate Savings Setup

Related Guides

Download the 2026 AI API Cost Optimization Spreadsheet

🧮 Quick Tools

Newsletter

Get weekly AI insights & pricing updates delivered to your inbox

Popular Categories