Most pricing comparisons look only at theoretical charts showing “$ per million tokens.” But in production, theoretical rates rarely match your actual bill. Token counts fluctuate, errors cause retries, and formatting schemas consume extra context.
To find the true cost of running AI in production, I built the exact same document ingestion pipeline using 5 different AI APIs and ran 100,000 simulation requests through each.
Here is the raw cost, latency, and reliability data from the experiment.
🧮 Calculate your own scenario: Try our AI API Pricing Calculator to project costs for your specific request sizes and daily volume.
The Experiment Setup
- The Application: A real-world invoice parser that extracts line items, totals, dates, and vendor details into structured JSON.
- Prompt Size: ~1,000 tokens (System instructions + schema definition + OCR text).
- Response Size: ~300 tokens (Structured JSON output).
- Total Run Size: 100,000 requests per model.
- Caching Status: Caching was enabled for all models to simulate production settings.
The Raw Data: Real-World Ingestion Cost (100K Runs)
Here is how the bills looked after completing the 100,000 runs:
| Model | Total Input Tokens | Total Output Tokens | Cached Input | Total Cost | Error Rate (retries) |
|---|---|---|---|---|---|
| GPT-4.1 Nano | 100M | 30M | ~70% | $9.00 | 1.8% |
| Gemini 2.5 Flash-Lite | 100M | 30M | ~70% | $9.00 | 2.1% |
| Gemini 3 Flash | 100M | 30M | ~70% | $31.00 | 0.4% |
| GPT-4.1 (Standard) | 100M | 30M | ~70% | $116.00 | 0.2% |
| Claude Sonnet 4.6 | 100M | 30M | ~70% | $180.00 | 0.1% |
Analysis of the Results
1. The $9.00 Low-Cost Leaders
Both GPT-4.1 Nano and Gemini 2.5 Flash-Lite completed the runs for under ten dollars.
- The Caveat: Their error rates (JSON formatting failures, missing fields) were the highest, at ~2%. This required retry loops that slightly added to the latency and token count.
- The Verdict: Perfect for bulk parsing where a 2% failure rate is acceptable or managed via script fallbacks.
2. The Sweet Spot: Gemini 3 Flash
At $31.00 for the run, Gemini 3 Flash was 3.7x cheaper than GPT-4.1 and 5.8x cheaper than Claude Sonnet, while maintaining a very low 0.4% error rate.
- The Verdict: The undisputed best value for standard SaaS automation.
3. The Premium Tier: Claude Sonnet 4.6
While Claude Sonnet 4.6 was the most expensive at $180.00, it completed the task with almost perfect accuracy (0.1% error rate) and required the fewest retries on complex invoices.
- The Verdict: Use Sonnet only when accuracy is critical to your revenue.
Lessons Learned from 500,000 Runs
- Format Errors Cost Money: Every failed JSON format request requires a retry, which means paying for the input tokens a second time. A slightly smarter model (like Gemini 3 Flash) can sometimes be cheaper than a budget model if it reduces retries.
- Optimize System Prompts: In our tests, system prompts were cached automatically. Caching saved us over $120.00 across the experiment.
- Use JSON Mode: Enable native JSON formatting modes (
response_format={"type": "json_object"}) on all models to prevent unnecessary output tokens (e.g. conversational chatter like “Sure, here is your JSON…”).