Spreadsheets are the lifeblood of business operations. Yet, for developers, they are a constant source of friction. In May 2026, companies still exchange millions of Excel sheets and CSVs filled with missing values, mismatched date formats, unstructured notes, and raw human errors.
Traditional approaches to spreadsheet automation rely heavily on Python libraries like pandas or openpyxl combined with rigid regular expressions. While this works for clean data, it catastrophically fails when dealing with unstructured text columns (such as sales call notes, support feedback, or custom address fields) that require human-level reasoning to categorize.
To solve this, we must build a type-safe, AI-powered spreadsheet parser.
In this guide, we will combine openpyxl to stream Excel rows, Pydantic to enforce strict type-level validation schemas, and PydanticAI + Google Gemini to autonomously extract, clean, and validate unstructured spreadsheet columns into database-ready records at high speeds.
The Core Problem with Spreadsheet Data
Let’s look at a typical messy Excel row from a lead-generation form:
| Lead Name | Company / Site | Contact Info | Interaction Notes | Estimated Budget |
|---|---|---|---|---|
| John D. | Rogue Marketing | “john@roguemkt.com or text +1 555-0199” | “Interested in the OCR parser, wants to spend around 5k/month starting June.” | “Around 5000” |
If you run this through standard regex, you will fail to:
- Isolate the primary email from the text block in the
Contact Infocolumn. - Extract the standard country code from the phone number.
- Parse the unstructured sentence in the
Interaction Notesinto a clean start date and product category. - Cast the budget string to a clean float.
By wrapping Gemini 1.5 Flash (highly optimized for fast, cheap inference) inside PydanticAI, we can resolve all these challenges in a single, type-safe execution pass.
System Prerequisites
Ensure you have a modern Python environment (3.10+). Install openpyxl (the standard library to read/write .xlsx files), Pydantic, and PydanticAI:
pip install openpyxl pydantic pydantic-ai google-genai
Set your API credential in your environment:
export GEMINI_API_KEY="your-gemini-api-key"
1. Designing the Validated Pydantic Schema
We must first define what a “clean” lead row should look like. We will enforce strict typing, email formats, and use Pydantic’s @field_validator to clean and normalize numbers.
# schemas.py
import re
from pydantic import BaseModel, Field, EmailStr, field_validator
from datetime import date
from typing import Optional
class CleanLeadRow(BaseModel):
name: str = Field(description="The primary name of the lead.")
company: str = Field(description="The name of the company.")
email: EmailStr = Field(description="A strictly validated primary email address.")
phone: Optional[str] = Field(description="The cleaned contact phone number in E.164 format (e.g. +15550199).")
product_interest: str = Field(description="The specific product category they are interested in (e.g. OCR, Video, Audio).")
target_start_date: date = Field(description="The parsed date they want to start working together.")
monthly_budget: float = Field(description="The parsed monthly budget, extracted as a clean float.")
@field_validator("phone")
@classmethod
def clean_phone_number(cls, v: Optional[str]) -> Optional[str]:
"""
Enforces a clean E.164 phone format by stripping non-numeric characters locally.
"""
if not v:
return None
# Strip brackets, hyphens, and spaces
cleaned = re.sub(r"[^\d+]", "", v)
if not cleaned.startswith("+"):
# Default to US/Canada country code if missing
cleaned = "+" + cleaned
return cleaned
2. Setting Up the Spreadsheet Agent with PydanticAI
Now, we will build the core AI reasoning agent. We configure the agent to use Gemini 1.5 Flash for sub-second, ultra-cheap execution, passing in our target schema structure.
# spreadsheet_agent.py
import os
from pydantic_ai import Agent
from pydantic_ai.models.gemini import GeminiModel
from schemas import CleanLeadRow
# Initialize Gemini Model
gemini_model = GeminiModel(
'gemini-1.5-flash',
api_key=os.environ.get("GEMINI_API_KEY")
)
# System instructions directing the model on how to parse messy inputs
parser_prompt = """
You are an elite, high-performance Data Operations Agent operating inside an enterprise CRM database.
Your task is to ingest unstructured, messy columns of Excel data and sanitize them into a strictly typed schema object.
Strict Extraction Guidelines:
1. Contact Info: Read the text block, isolate the primary email, and identify the phone number.
2. Interaction Notes: Parse the conversation context. Identify what product they want (e.g., OCR, Video, Audio) and determine the exact date they want to start (use May 2026 as the current time context if relative terms like 'next month' are used).
3. Budget: Isolate the budget number and convert it into a clean float value.
"""
# Initialize the PydanticAI Agent with Structured Output
parsing_agent = Agent(
model=gemini_model,
result_type=CleanLeadRow,
system_prompt=parser_prompt
)
class DataOperationsService:
@staticmethod
async def parse_row(row_dict: dict) -> CleanLeadRow:
"""
Ingests a dictionary representing a raw Excel row, validates it, and returns a CleanLeadRow instance.
"""
row_string = "\n".join([f"{k}: {v}" for k, v in row_dict.items()])
try:
result = await parsing_agent.run(
user_prompt=f"Please sanitize the following spreadsheet row:\n\n{row_string}"
)
# The result.data is guaranteed to be a fully populated, validated CleanLeadRow instance
return result.data
except Exception as e:
raise RuntimeError(f"Row validation failed: {str(e)}")
3. Streaming and Writing Excel Data with openpyxl
Now, let’s tie the AI parsing layer to the filesystem. We will write a Python script that loads an Excel sheet, streams each row into our PydanticAI agent, compiles the cleaned results, and writes them back into a new, sanitized sheet.
# excel_pipeline.py
import asyncio
import openpyxl
from openpyxl import Workbook
from spreadsheet_agent import DataOperationsService
async def process_spreadsheet(input_path: str, output_path: str):
# 1. Load the input workbook
wb = openpyxl.load_workbook(input_path)
sheet = wb.active
# Read headers
headers = [cell.value for cell in sheet[1]]
print(f"Loaded sheet with headers: {headers}")
# Initialize a new Workbook for clean data
out_wb = Workbook()
out_sheet = out_wb.active
out_sheet.title = "Cleaned Leads"
# Write clean headers
clean_headers = [
"Lead Name", "Company", "Email", "Phone",
"Product Interest", "Target Start Date", "Monthly Budget"
]
out_sheet.append(clean_headers)
# 2. Iterate and stream rows (skipping header)
row_count = 0
for r_idx in range(2, sheet.max_row + 1):
row_values = [cell.value for cell in sheet[r_idx]]
if not any(row_values):
continue # Skip empty rows
row_dict = dict(zip(headers, row_values))
print(f"\nProcessing Row {r_idx-1}...")
try:
# Parse row via PydanticAI + Gemini
clean_row = await DataOperationsService.parse_row(row_dict)
# Append sanitized values to the new output sheet
out_sheet.append([
clean_row.name,
clean_row.company,
clean_row.email,
clean_row.phone,
clean_row.product_interest,
clean_row.target_start_date.strftime("%Y-%m-%d"),
clean_row.monthly_budget
])
row_count += 1
print(f"Row {r_idx-1} successfully sanitized: {clean_row.email}")
except Exception as e:
print(f"❌ Error sanitizing Row {r_idx-1}: {str(e)}")
# Save the output workbook
out_wb.save(output_path)
print(f"\nSpreadsheet successfully automated! processed {row_count} rows. Saved to: {output_path}")
# ==========================================
# Mock Excel Generator & Pipeline Run
# ==========================================
def create_mock_excel(path: str):
"""
Helper function to generate a messy test spreadsheet.
"""
wb = Workbook()
sheet = wb.active
sheet.append(["Lead Name", "Company / Site", "Contact Info", "Interaction Notes", "Estimated Budget"])
# Messy mock data
sheet.append([
"John D.",
"Rogue Marketing",
"john@roguemkt.com or text +1 555-0199",
"Interested in the OCR parser, wants to spend around 5k/month starting June 1st.",
"Around 5000"
])
sheet.append([
"Alice Smith",
"Aiviewz SaaS",
"Reach out at alice@aiviewz.com",
"Needs help automating the video render pipeline starting May 20, 2026. Budget is tight, 1200 max.",
"1200"
])
wb.save(path)
if __name__ == "__main__":
mock_input = "messy_leads.xlsx"
clean_output = "sanitized_leads.xlsx"
# Generate mock sheet
create_mock_excel(mock_input)
# Run the pipeline
asyncio.run(process_spreadsheet(mock_input, clean_output))
Conclusion & Productivity Gains
Manually cleaning spreadsheets is slow, expensive, and error-prone. By combining the streaming ease of openpyxl with the type-safe constraints of Pydantic and the high-speed reasoning of Gemini, developers can automate data cleansing pipelines in seconds.
This architecture scales perfectly to support hundreds of parallel rows inside background workers, making it the ideal framework to power B2B SaaS CSV import engines, Salesforce updates, and CRM sync pipelines.
Are you building automated spreadsheet engines or custom database cleaners? Let’s discuss openpyxl parameters, cell styles, and schema validators in the comments below!
Download the Complete PydanticAI Document Parser Blueprint
Get the complete, type-safe invoice and ID card parsing codebase in Python + a ready-to-run Docker environment. 100% free.