How to Cut Your AI API Bill by 90% (Prompt Caching + Batch API Guide)

How to Cut Your AI API Bill by 90% (Prompt Caching + Batch API Guide)

(Updated: ) 📖 3 min read

For developers building production AI apps in 2026, API costs are often the single largest expense. However, many developers are still paying the “real-time tax” on every single request.

By implementing two core optimization strategies — Prompt Caching and Batch APIs — you can reduce your AI API bills by 50% to 90% overnight.

This guide explains exactly how these features work across Google Gemini, OpenAI, Anthropic Claude, and xAI Grok, with actionable strategies to implement them in your codebase today.

🧮 See the math in action: Use our AI API Pricing Calculator to toggle caching and batch modes and watch your estimated monthly bill drop instantly.


Part 1: Prompt Caching (Save up to 90% on Inputs)

When you make an API call, you pay for every token in your prompt. If you send the same system instructions, the same user profile data, or a massive 50K-word reference document with every message, you are paying for those identical tokens repeatedly.

Prompt Caching stores your input prefix in the provider’s memory. When subsequent requests share that same prefix, you only pay a fraction of the cost.

How Caching Rates Compare

Provider Caching Support Cost Reduction on Cached Tokens Minimum Cache Size
Google Gemini Yes (Manual) ~90% Off (approx. $0.05/M on Flash) 32,768 tokens
OpenAI Yes (Automatic) ~75% Off ($0.50/M instead of $2.00/M) 1,024 tokens
Anthropic Claude Yes (Manual) ~90% Off (approx. $0.30/M on Sonnet) 8,192 tokens
xAI Grok Yes (Automatic) ~90% Off (approx. $0.13/M on Grok 4.3) 1,024 tokens

How to Implement Caching

1. Automatic Caching (OpenAI & Grok)

OpenAI and xAI require zero code changes for caching. If the prefix of your prompt matches a previous request (of at least 1,024 tokens), they automatically use the cache.

Rule for success: Keep your prompts structured with static content at the beginning (e.g., system prompt, reference documents) and dynamic user inputs at the very end.

[STABLE SYSTEM INSTRUCTIONS]  <-- Cached
[STATIC REFERENCE KNOWLEDGE] <-- Cached
[DYNAMIC USER QUESTION]      <-- Not Cached (computed standard rate)

2. Manual Caching (Anthropic Claude)

Anthropic requires you to explicitly tag which blocks should be cached in your JSON payload using "cache_control": {"type": "ephemeral"}:

{
  "model": "claude-3-5-sonnet-20241022",
  "max_tokens": 1024,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Here is a huge document to analyze...",
          "cache_control": {"type": "ephemeral"}
        },
        {
          "type": "text",
          "text": "Summarize chapter 3."
        }
      ]
    }
  ]
}

Part 2: The Batch API (Save 50% on Everything)

If your application processes tasks that do not need immediate real-time responses (e.g., overnight report generation, database categorization, document indexing, translation pipelines), you should use the Batch API.

Instead of sending requests synchronously, you upload a file containing thousands of requests. The provider processes them asynchronously, returning the completed results within 24 hours.

The Benefit: All major providers offer a flat 50% discount on input and output tokens for batch requests.

Batch API Features compared

Provider Turnaround Time Cost Discount Limit / Day
OpenAI Batch ≤ 24 hours 50% Off 50M tokens
Google Gemini Batch ≤ 24 hours 50% Off 100M tokens
xAI Grok Batch ≤ 24 hours 50% Off 50M tokens

Step-by-Step: Implementing OpenAI Batch API in Python

Here is a simple example of how to configure and execute batch workloads in Python:

import openai

# 1. Create a JSONL file with your tasks
# Each line represents one independent API call
tasks = [
    {"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Classify this email: ..."}]}},
    {"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Classify this email: ..."}]}}
]

with open("batch_tasks.jsonl", "w") as f:
    for task in tasks:
        f.write(json.dumps(task) + "\n")

# 2. Upload the file to OpenAI
batch_file = openai.files.create(
    file=open("batch_tasks.jsonl", "rb"),
    purpose="batch"
)

# 3. Create the batch job
batch_job = openai.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

print(f"Batch Job Created! ID: {batch_job.id}")

Once the status changes to completed, you can download the output file containing all completed completions.


Combining Both: The Ultimate Savings Setup

If you structure your code correctly, you can combine these two strategies:

  1. Structure your data to isolate static system instructions and reference materials at the beginning of the prompt context (enabling Prompt Caching).
  2. Queue the requests into a batch queue to be processed overnight (enabling the Batch API 50% discount).

By combining these two features, you can reduce standard API charges by over 95%.


Always verify feature availability and specific token rates in official developer documentation.

Professor XAI
Professor XAI ML Engineer passionate about advancing AI technologies and building intelligent systems.
comments powered by Disqus