Building a Production Multimodal Invoice Parser: Pydantic AI, Gemini, and Cost Comparison vs. Legacy OCR
(Updated: )
📖 7 min read
Document intelligence represents one of the largest operational cost centers in enterprise operations. Processing accounts payable invoices, receipts, and purchase orders typically relies on legacy optical character recognition (OCR) engines, followed by custom heuristic rules to parse floating headers, line items, and dynamic dates.
These legacy tools are fragile, complex to train, and financially expensive. In 2026, the rise of natively multimodal vision-language models (VLMs) like Google Gemini has rendered traditional OCR systems obsolete.
This tutorial walks through building a complete, production-grade **Multimodal Invoice Parser** using Pydantic AI, FastAPI, uv, and Docker Compose. We will implement both a generic extraction pipeline and a dynamic, query-driven parser.
Additionally, we provide an exhaustive, mathematical cost and ROI comparison mapping Gemini against AWS Textract, Google Cloud Document AI, Nanonets, Mindee, and Rossum.ai.
---
## Technical Cost Comparison & ROI Analysis
Before diving into code, let us analyze why switching to a VLM-based parser is a major financial advantage for enterprise operations.
Traditional layout extraction requires paying for standard OCR processing, key-value pair detection, and line-item table parsing separately. Here is how specialized systems compare to Google Gemini API (based on processing 100,000 multi-page invoices per month with ~1,500 visual input tokens and 350 output tokens per document).
### Cost Matrix (Per 1,000 Invoices)
| Provider / Tool | OCR Cost | KV/Extraction Cost | Tables/Line Items Cost | Total Per 1k Documents | Monthly Cost (100k docs) |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **AWS Textract** | $1.50 | $50.00 | $15.00 | **$66.50** | **$6,650.00** |
| **GCP Document AI** | $1.50 | $30.00 | $15.00 | **$46.50** | **$4,650.00** |
| **Nanonets** | Included | Included | Included | **$100.00** | **$10,000.00** |
| **Mindee** | Included | Included | Included | **$150.00** | **$15,000.00** |
| **Rossum.ai** | Included | Included | Included | **$400.00** | **$40,000.00** |
| **Gemini 2.5 Flash** | Included | Included | Included | **$0.25** | **$25.00** |
### Mathematical Breakdown: Gemini 2.5 Flash ROI
For a document comprising **1,500 Input Tokens** (image + prompt) and returning **350 Output Tokens** (structured JSON):
* **Input Token Price:** $0.075 / 1 Million Tokens
* **Output Token Price:** $0.30 / 1 Million Tokens
* **Cost Per Document:**
$$\text{Cost} = (1500 \times 0.000000075) + (350 \times 0.00000030) = 0.0001125 + 0.000105 = \$0.0002175$$
* **Cost Per 1,000 Invoices:** **$0.217**
Compared to **AWS Textract** ($66.50 per 1k), Gemini represents a **99.6% cost reduction**, while delivering flexible layout understanding and contextual intelligence that matches human annotators.
---
## System Architecture & Multi-Endpoint Design
Our application features two distinct extraction pipelines:
1. **Generic Extraction (`/api/v1/parser/generic`):** Extracts a standard structured model containing essential vendor metadata, totals, dates, and a complete nested list of individual line items.
2. **Query-Driven Extraction (`/api/v1/parser/query`):** Accepts an upload along with a custom list of query instructions defined dynamically by the client (e.g., "Check if there is a VAT number," "Extract the carbon tax percentage"). The system evaluates the image and returns exactly those requested fields.
---
## Step 1: Defining the Dual-Pipeline Schemas
Create the schema definitions in `app/parser_schemas.py`:
```python
from pydantic import BaseModel, Field
from typing import List, Dict, Any, Optional
# ----------------------------------------------------
# Pipeline 1: Generic Extraction Schemas
# ----------------------------------------------------
class InvoiceLineItem(BaseModel):
description: str = Field(..., description="Description of the item or service rendered")
quantity: Optional[float] = Field(None, description="Quantity of items purchased")
unit_price: Optional[float] = Field(None, description="Price per unit of the item")
amount: Optional[float] = Field(None, description="Total amount for this line item")
class GenericInvoiceData(BaseModel):
vendor_name: str = Field(..., description="Name of the company or vendor issuing the invoice")
vendor_address: Optional[str] = Field(None, description="Physical or email address of the vendor")
invoice_number: Optional[str] = Field(None, description="Invoice or receipt identification number")
invoice_date: Optional[str] = Field(None, description="Date of invoice issuance (YYYY-MM-DD)")
due_date: Optional[str] = Field(None, description="Payment due date (YYYY-MM-DD)")
subtotal: Optional[float] = Field(None, description="Invoice subtotal before taxes and shipping")
tax_amount: Optional[float] = Field(None, description="Tax amount listed on the invoice")
total_amount: float = Field(..., description="Total gross amount payable on the invoice")
currency: str = Field("USD", description="Currency symbol or standard code")
line_items: List[InvoiceLineItem] = Field(default_factory=list, description="List of items extracted")
# ----------------------------------------------------
# Pipeline 2: Query-Driven Custom Schemas
# ----------------------------------------------------
class CustomField(BaseModel):
field_key: str = Field(..., description="Name of the custom field requested (e.g., 'vat_number')")
extracted_value: Optional[str] = Field(None, description="Value extracted by the agent matching custom prompt")
confidence_score: float = Field(..., description="Confidence rating between 0.0 and 1.0")
class DynamicInvoiceResponse(BaseModel):
custom_extractions: List[CustomField] = Field(default_factory=list, description="List of custom requested fields")
```
---
## Step 2: Implementing Pydantic AI Multimodal Agents
Create `app/parser_agent.py`:
```python
from pydantic_ai import Agent, BinaryContent
from app.parser_schemas import GenericInvoiceData, DynamicInvoiceResponse
# 1. Initialize the Generic Extraction Agent
generic_parser_agent = Agent(
model="google-gla:gemini-2.5-flash",
result_type=GenericInvoiceData,
system_prompt=(
"You are an expert invoice processing agent. "
"Your task is to analyze the provided invoice image, extract all key metadata, "
"and list out the line items inside the defined schema."
)
)
# 2. Initialize the Dynamic Query-Driven Agent
query_parser_agent = Agent(
model="google-gla:gemini-2.5-flash",
result_type=DynamicInvoiceResponse,
system_prompt=(
"You are a flexible key-value extraction agent. "
"Your task is to analyze the invoice image and extract only the custom fields "
"specified in the user prompt. For each field, provide the key, the extracted value, "
"and a confidence rating based on clarity."
)
)
async def run_generic_parser(image_bytes: bytes, media_type: str) -> GenericInvoiceData:
"""Invokes VLM to extract standard invoice schemas."""
result = await generic_parser_agent.run(
[
"Extract standard metadata and line items from this invoice image.",
BinaryContent(data=image_bytes, media_type=media_type)
]
)
return result.output
async def run_query_parser(image_bytes: bytes, media_type: str, queries: str) -> DynamicInvoiceResponse:
"""Invokes VLM with dynamic instructions to pull customizable parameters."""
result = await query_parser_agent.run(
[
f"Analyze this image and extract only the following custom fields:\n{queries}",
BinaryContent(data=image_bytes, media_type=media_type)
]
)
return result.output
```
---
## Step 3: Setting Up the FastAPI Endpoints
Create `app/parser_main.py`:
```python
import os
import json
from fastapi import FastAPI, HTTPException, UploadFile, File, Form
from fastapi.middleware.cors import CORSMiddleware
from fastapi.staticfiles import StaticFiles
from fastapi.responses import FileResponse
from app.parser_schemas import GenericInvoiceData, DynamicInvoiceResponse
from app.parser_agent import run_generic_parser, run_query_parser
app = FastAPI(
title="Multimodal Invoice Processing Hub",
version="1.0.0"
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"]
)
ALLOWED_TYPES = {"image/jpeg", "image/png", "image/webp"}
def validate_image(file: UploadFile):
if file.content_type not in ALLOWED_TYPES:
raise HTTPException(
status_code=400,
detail=f"Unsupported file format: {file.content_type}. Use JPEG, PNG, or WebP."
)
@app.post("/api/v1/parser/generic", response_model=GenericInvoiceData)
async def extract_generic_fields(file: UploadFile = File(...)):
"""Generic invoice parser: extracts standard metadata and item tables."""
validate_image(file)
try:
image_bytes = await file.read()
extracted_data = await run_generic_parser(image_bytes, file.content_type)
return extracted_data
except Exception as e:
raise HTTPException(status_code=500, detail=f"Generic extraction error: {str(e)}")
@app.post("/api/v1/parser/query", response_model=DynamicInvoiceResponse)
async def extract_custom_fields(
file: UploadFile = File(...),
queries: str = Form(..., description="String containing custom fields instructions")
):
"""Dynamic query parser: extracts only client-defined specific fields."""
validate_image(file)
try:
image_bytes = await file.read()
extracted_data = await run_query_parser(image_bytes, file.content_type, queries)
return extracted_data
except Exception as e:
raise HTTPException(status_code=500, detail=f"Custom extraction error: {str(e)}")
# Serve Client Frontend Panel
app.mount("/static", StaticFiles(directory="app/static"), name="static")
@app.get("/")
async def read_root():
return FileResponse("app/static/parser.html")
```
---
## Step 4: High-Fidelity shadcn-ui Frontend
To deliver a premium developer-first interface, let us build a single-page application modeled exactly after **shadcn-ui** design standards. It features a file upload interface, side-by-side JSON previews, dynamic instruction fields, and active cost tracking.
Create `app/static/parser.html`:
```html
Rogue Invoice — Multimodal Parser
Rogue Invoice
Vision Intelligence Hub
Gemini Process Cost$0.00021 / Doc
Upload Invoice Image
Drag and drop your invoice image here, or browse files
JPEG, PNG, or WebP up to 10MB
invoice.png
Custom Key Extraction
Define specific fields your business requires. The VLM will dynamically parse these targets.
Structured JSON Output
Ready to parse document. Upload an image and click an extraction pipeline.