Production Multimodal Vision AI with Pydantic AI, FastAPI, Docker, and uv

21 May 2026 (Updated: May 21, 2026) 📖 7 min read

Building AI applications that understand images is one of the most commercially valuable capabilities available to developers today. Document parsing, product cataloging, medical imaging triage, and security monitoring all rely on vision-language models (VLMs) to extract structured data from visual inputs.

However, taking a vision AI prototype from a Jupyter notebook to a production-grade, containerized microservice requires careful engineering across multiple layers: structured data validation, asynchronous web serving, reproducible dependency management, and container orchestration.

This guide walks through every layer of that stack. We will build a complete multimodal vision analysis service using Pydantic AI for structured agent orchestration, FastAPI for the async web layer, uv for lightning-fast dependency management, and Docker with Docker Compose for reproducible deployment.

Understanding the Technology Stack

Before writing code, let us clarify why each tool is essential and what role it plays in the architecture.

Pydantic AI

Pydantic AI is a Python agent framework built by the creators of Pydantic. It provides a clean, type-safe interface for interacting with LLMs. Its key advantage for vision applications is structured output extraction. You define a Pydantic model (e.g., InvoiceData or ProductLabel) and Pydantic AI forces the LLM to return data that strictly conforms to that schema. This eliminates the fragile regex parsing and JSON-fixing that plagues traditional LLM integrations.

For multimodal inputs, Pydantic AI provides two primitives:

ImageUrl: Pass a publicly accessible image URL directly to the agent.
BinaryContent: Pass raw image bytes (e.g., from a file upload) with a specified MIME type.

FastAPI

FastAPI is the standard Python framework for building high-performance asynchronous APIs. It natively integrates with Pydantic for request/response validation and generates interactive OpenAPI documentation automatically.

uv

uv is a Rust-based Python package manager that replaces pip, pip-tools, and virtualenv. It resolves and installs dependencies 10-100x faster than pip and produces a deterministic uv.lock file that guarantees identical environments across development, CI/CD, and production containers.

Docker and Docker Compose

Docker packages the application and all its dependencies into an isolated, portable container. Docker Compose orchestrates multiple containers (e.g., the API service and a Redis cache) with a single configuration file.

Step 1: Bootstrapping the Project with uv

Initialize a new Python project using uv. This creates the pyproject.toml manifest and pins the Python version.

# Create the project directory
mkdir vision-ai-service && cd vision-ai-service

# Initialize the project with uv
uv init .

# Pin the Python version
uv python pin 3.13

# Add production dependencies
uv add pydantic-ai fastapi uvicorn python-multipart httpx

# Add development dependencies
uv add --dev pytest ruff

After running these commands, your project will have the following structure:

vision-ai-service/
  .python-version
  pyproject.toml
  uv.lock
  main.py

The pyproject.toml file will contain all your dependencies, and uv.lock will contain the exact pinned versions for reproducible builds.

Step 2: Defining Structured Vision Schemas

The core advantage of Pydantic AI is forcing the LLM to return data that conforms to a strict schema. Let us define schemas for two common vision use cases: product label analysis and document/invoice extraction.

Create a file app/schemas.py:

from pydantic import BaseModel, Field
from typing import List, Optional
from enum import Enum


class Confidence(str, Enum):
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"


class ExtractedEntity(BaseModel):
    """A single named entity extracted from the image."""
    label: str = Field(..., description="The category of the entity (e.g., 'product_name', 'price', 'date')")
    value: str = Field(..., description="The extracted text value")
    confidence: Confidence = Field(..., description="Confidence level of the extraction")


class ProductLabelAnalysis(BaseModel):
    """Structured output for analyzing a product label or packaging image."""
    product_name: str = Field(..., description="The primary product name visible on the label")
    brand: Optional[str] = Field(None, description="The brand or manufacturer name")
    ingredients: List[str] = Field(default_factory=list, description="List of ingredients if visible")
    nutritional_claims: List[str] = Field(default_factory=list, description="Health or nutritional claims (e.g., 'organic', 'gluten-free')")
    weight_or_volume: Optional[str] = Field(None, description="Net weight or volume as printed")
    barcode_detected: bool = Field(False, description="Whether a barcode or QR code is visible")


class InvoiceExtraction(BaseModel):
    """Structured output for extracting data from an invoice or receipt image."""
    vendor_name: str = Field(..., description="The name of the vendor or company issuing the invoice")
    invoice_number: Optional[str] = Field(None, description="Invoice or receipt number")
    date: Optional[str] = Field(None, description="Date of the invoice in YYYY-MM-DD format")
    total_amount: Optional[str] = Field(None, description="Total amount including currency symbol")
    tax_amount: Optional[str] = Field(None, description="Tax amount if listed separately")
    line_items: List[ExtractedEntity] = Field(default_factory=list, description="Individual line items extracted")
    payment_method: Optional[str] = Field(None, description="Payment method if visible (e.g., 'Visa', 'Cash')")

These schemas guarantee that no matter how the LLM phrases its internal reasoning, the final output will always conform to strict field types, required values, and enumerated confidence levels.

Step 3: Building the Pydantic AI Vision Agent

Now we build the core agent layer. Pydantic AI agents accept multimodal inputs (BinaryContent for uploaded files) and return structured outputs (result_type).

Create a file app/agent.py:

from pydantic_ai import Agent, BinaryContent
from app.schemas import ProductLabelAnalysis, InvoiceExtraction


# Agent for Product Label Analysis
product_agent = Agent(
    model="google-gla:gemini-2.0-flash",
    result_type=ProductLabelAnalysis,
    system_prompt=(
        "You are a product label analysis specialist. "
        "Examine the provided image of a product label or packaging. "
        "Extract all visible information including the product name, brand, "
        "ingredients list, nutritional claims, weight/volume, and barcode presence. "
        "Be precise. Only extract information that is clearly visible in the image."
    ),
)


# Agent for Invoice/Receipt Extraction
invoice_agent = Agent(
    model="google-gla:gemini-2.0-flash",
    result_type=InvoiceExtraction,
    system_prompt=(
        "You are a financial document extraction specialist. "
        "Examine the provided image of an invoice, receipt, or bill. "
        "Extract the vendor name, invoice number, date, total amount, tax, "
        "individual line items, and payment method. "
        "Format dates as YYYY-MM-DD. Include currency symbols with amounts."
    ),
)


async def analyze_product_label(image_bytes: bytes, media_type: str) -> ProductLabelAnalysis:
    """Run the product label agent on the provided image bytes."""
    result = await product_agent.run(
        [
            "Analyze this product label image and extract all structured information.",
            BinaryContent(data=image_bytes, media_type=media_type),
        ]
    )
    return result.output


async def extract_invoice_data(image_bytes: bytes, media_type: str) -> InvoiceExtraction:
    """Run the invoice extraction agent on the provided image bytes."""
    result = await invoice_agent.run(
        [
            "Extract all structured data from this invoice or receipt image.",
            BinaryContent(data=image_bytes, media_type=media_type),
        ]
    )
    return result.output

Notice how the agent definition is clean and declarative. The result_type parameter tells Pydantic AI to enforce schema validation on the LLM output. If the model returns malformed data, Pydantic AI automatically retries the request with corrective instructions.

Step 4: Building the FastAPI Web Layer

Now we wire everything together with FastAPI endpoints that accept image uploads and return structured JSON responses.

Create the main application file app/main.py:

import os
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException, UploadFile, File
from fastapi.middleware.cors import CORSMiddleware
from app.agent import analyze_product_label, extract_invoice_data
from app.schemas import ProductLabelAnalysis, InvoiceExtraction


ALLOWED_MIME_TYPES = {"image/jpeg", "image/png", "image/webp", "image/gif"}


@asynccontextmanager
async def lifespan(app: FastAPI):
    """Application lifespan: validate API key on startup."""
    api_key = os.getenv("GEMINI_API_KEY")
    if not api_key:
        raise RuntimeError(
            "GEMINI_API_KEY environment variable is not set. "
            "Please configure it before starting the service."
        )
    print("Vision AI Service started successfully.")
    yield
    print("Vision AI Service shutting down.")


app = FastAPI(
    title="Vision AI Extraction Service",
    description="Production multimodal vision API powered by Pydantic AI and Gemini.",
    version="1.0.0",
    lifespan=lifespan,
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)


def validate_upload(file: UploadFile) -> None:
    """Validate that the uploaded file is a supported image type."""
    if file.content_type not in ALLOWED_MIME_TYPES:
        raise HTTPException(
            status_code=400,
            detail=f"Unsupported file type: {file.content_type}. "
                   f"Supported types: {list(ALLOWED_MIME_TYPES)}"
        )


@app.post("/api/v1/analyze/product", response_model=ProductLabelAnalysis)
async def analyze_product(file: UploadFile = File(...)):
    """
    Upload a product label or packaging image.
    Returns structured extraction of product name, brand,
    ingredients, nutritional claims, and barcode detection.
    """
    validate_upload(file)
    try:
        image_bytes = await file.read()
        result = await analyze_product_label(image_bytes, file.content_type)
        return result
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.post("/api/v1/analyze/invoice", response_model=InvoiceExtraction)
async def analyze_invoice(file: UploadFile = File(...)):
    """
    Upload an invoice, receipt, or bill image.
    Returns structured extraction of vendor, amounts,
    line items, dates, and payment method.
    """
    validate_upload(file)
    try:
        image_bytes = await file.read()
        result = await extract_invoice_data(image_bytes, file.content_type)
        return result
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.get("/health")
async def health_check():
    return {"status": "healthy", "service": "vision-ai-extraction"}

Step 5: Containerizing with Docker Multi-Stage Builds

A multi-stage Dockerfile ensures that build tools (like uv and compilers) never make it into the final production image, keeping it small and secure.

Create a Dockerfile in the project root:

# ============================================
# Stage 1: Builder
# ============================================
FROM python:3.13-slim AS builder

# Install uv from the official container image
COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/uv

# Configure uv environment
ENV VIRTUAL_ENV=/opt/venv
ENV UV_COMPILE_BYTECODE=1
ENV UV_LINK_MODE=copy

# Create virtual environment
RUN uv venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

# Copy dependency files first (layer caching optimization)
WORKDIR /build
COPY pyproject.toml uv.lock ./

# Install dependencies using frozen lockfile for reproducibility
RUN --mount=type=cache,target=/root/.cache/uv \
    uv sync --frozen --no-dev --no-install-project

# Copy the rest of the application code
COPY . .

# Install the project itself
RUN --mount=type=cache,target=/root/.cache/uv \
    uv sync --frozen --no-dev

# ============================================
# Stage 2: Production Runtime
# ============================================
FROM python:3.13-slim

# Copy the virtual environment from the builder stage
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Set up the application directory
WORKDIR /app
COPY . /app

# Create a non-root user for security
RUN useradd --create-home --shell /bin/bash appuser
USER appuser

# Expose the FastAPI port
EXPOSE 8000

# Start the FastAPI server with uvicorn
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]

Why This Structure Matters

Layer caching: pyproject.toml and uv.lock are copied before the source code. If your code changes but dependencies do not, Docker reuses the cached dependency layer, saving minutes on every rebuild.
--frozen flag: Guarantees that uv sync uses exactly the versions in uv.lock without attempting to resolve new versions.
UV_COMPILE_BYTECODE=1: Pre-compiles all .py files to .pyc bytecode during the build. This eliminates the compilation overhead on first import in production, reducing cold-start latency.
Non-root user: The final image runs as appuser, not root, following container security best practices.

Step 6: Orchestrating with Docker Compose

Docker Compose simplifies running the service alongside supporting infrastructure. Create a docker-compose.yml file:

services:
  vision-api:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: vision-ai-service
    ports:
      - "8000:8000"
    environment:
      - GEMINI_API_KEY=${GEMINI_API_KEY}
    env_file:
      - .env
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 15s
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: "1.0"

Create a .env file to store your API key securely (add .env to your .gitignore):

GEMINI_API_KEY=your-api-key-here

Running the Service

# Build and start the service
docker compose up --build -d

# View logs
docker compose logs -f vision-api

# Stop the service
docker compose down

Step 7: Testing the API

Once the service is running, you can test it using curl or any HTTP client.

Test Product Label Analysis

curl -X POST http://localhost:8000/api/v1/analyze/product \
  -F "file=@./sample_product_label.jpg" \
  | python -m json.tool

Example Response:

{
  "product_name": "Organic Greek Yogurt",
  "brand": "Chobani",
  "ingredients": ["Cultured Pasteurized Nonfat Milk", "Live Active Cultures"],
  "nutritional_claims": ["Non-GMO", "No Artificial Flavors", "Gluten-Free"],
  "weight_or_volume": "5.3 oz (150g)",
  "barcode_detected": true
}

Test Invoice Extraction

curl -X POST http://localhost:8000/api/v1/analyze/invoice \
  -F "file=@./sample_invoice.png" \
  | python -m json.tool

Example Response:

{
  "vendor_name": "Acme Cloud Services Ltd.",
  "invoice_number": "INV-2026-04821",
  "date": "2026-05-15",
  "total_amount": "$1,247.50",
  "tax_amount": "$112.50",
  "line_items": [
    {"label": "compute", "value": "GPU Instance (A100) x 720 hrs - $900.00", "confidence": "high"},
    {"label": "storage", "value": "Object Storage 500GB - $235.00", "confidence": "high"}
  ],
  "payment_method": "Visa ending 4242"
}

Conclusion

This guide demonstrates how to architect a production-grade multimodal vision service from the ground up. Pydantic AI eliminates the fragility of unstructured LLM outputs by enforcing strict schema validation. FastAPI provides the high-performance async web layer. uv delivers deterministic, lightning-fast dependency resolution. And Docker multi-stage builds produce minimal, secure container images ready for deployment.

The patterns demonstrated here extend naturally to any domain that requires structured extraction from visual data: medical imaging reports, construction site inspection logs, retail shelf audits, and insurance claims processing.

« Orchestrating Multi-Step AI Agents: Integrating Pydantic AI and LangGraph with Gemini 3.1 Pro

Serving Lightweight Open-Source LLMs Locally on CPU: A Developer's Best Practices Guide »

Professor XAI Follow ML Engineer passionate about advancing AI technologies and building intelligent systems.

Production Multimodal Vision AI with Pydantic AI, FastAPI, Docker, and uv

Understanding the Technology Stack

Pydantic AI

FastAPI

uv

Docker and Docker Compose

Step 1: Bootstrapping the Project with uv

Step 2: Defining Structured Vision Schemas

Step 3: Building the Pydantic AI Vision Agent

Step 4: Building the FastAPI Web Layer

Step 5: Containerizing with Docker Multi-Stage Builds

Why This Structure Matters

Step 6: Orchestrating with Docker Compose

Running the Service

Step 7: Testing the API

Test Product Label Analysis

Test Invoice Extraction

Conclusion

Newsletter

Popular Categories

Production Multimodal Vision AI with Pydantic AI, FastAPI, Docker, and uv

Understanding the Technology Stack

Pydantic AI

FastAPI

uv

Docker and Docker Compose

Step 1: Bootstrapping the Project with uv

Step 2: Defining Structured Vision Schemas

Step 3: Building the Pydantic AI Vision Agent

Step 4: Building the FastAPI Web Layer

Step 5: Containerizing with Docker Multi-Stage Builds

Why This Structure Matters

Step 6: Orchestrating with Docker Compose

Running the Service

Step 7: Testing the API

Test Product Label Analysis

Test Invoice Extraction

Conclusion

Newsletter

Get weekly AI insights & pricing updates delivered to your inbox

Popular Categories