Automating Spreadsheet Workflows: High-Speed Excel Data Parsing & Validation with Python, Gemini, and Pydantic

Automating Spreadsheet Workflows: High-Speed Excel Data Parsing & Validation with Python, Gemini, and Pydantic

(Updated: ) 📖 5 min read

Spreadsheets are the lifeblood of business operations. Yet, for developers, they are a constant source of friction. In May 2026, companies still exchange millions of Excel sheets and CSVs filled with missing values, mismatched date formats, unstructured notes, and raw human errors.

Traditional approaches to spreadsheet automation rely heavily on Python libraries like pandas or openpyxl combined with rigid regular expressions. While this works for clean data, it catastrophically fails when dealing with unstructured text columns (such as sales call notes, support feedback, or custom address fields) that require human-level reasoning to categorize.

To solve this, we must build a type-safe, AI-powered spreadsheet parser.

In this guide, we will combine openpyxl to stream Excel rows, Pydantic to enforce strict type-level validation schemas, and PydanticAI + Google Gemini to autonomously extract, clean, and validate unstructured spreadsheet columns into database-ready records at high speeds.


The Core Problem with Spreadsheet Data

Let’s look at a typical messy Excel row from a lead-generation form:

Lead Name Company / Site Contact Info Interaction Notes Estimated Budget
John D. Rogue Marketing “john@roguemkt.com or text +1 555-0199” “Interested in the OCR parser, wants to spend around 5k/month starting June.” “Around 5000”

If you run this through standard regex, you will fail to:

  1. Isolate the primary email from the text block in the Contact Info column.
  2. Extract the standard country code from the phone number.
  3. Parse the unstructured sentence in the Interaction Notes into a clean start date and product category.
  4. Cast the budget string to a clean float.

By wrapping Gemini 1.5 Flash (highly optimized for fast, cheap inference) inside PydanticAI, we can resolve all these challenges in a single, type-safe execution pass.


System Prerequisites

Ensure you have a modern Python environment (3.10+). Install openpyxl (the standard library to read/write .xlsx files), Pydantic, and PydanticAI:

pip install openpyxl pydantic pydantic-ai google-genai

Set your API credential in your environment:

export GEMINI_API_KEY="your-gemini-api-key"

1. Designing the Validated Pydantic Schema

We must first define what a “clean” lead row should look like. We will enforce strict typing, email formats, and use Pydantic’s @field_validator to clean and normalize numbers.

# schemas.py
import re
from pydantic import BaseModel, Field, EmailStr, field_validator
from datetime import date
from typing import Optional

class CleanLeadRow(BaseModel):
    name: str = Field(description="The primary name of the lead.")
    company: str = Field(description="The name of the company.")
    email: EmailStr = Field(description="A strictly validated primary email address.")
    phone: Optional[str] = Field(description="The cleaned contact phone number in E.164 format (e.g. +15550199).")
    product_interest: str = Field(description="The specific product category they are interested in (e.g. OCR, Video, Audio).")
    target_start_date: date = Field(description="The parsed date they want to start working together.")
    monthly_budget: float = Field(description="The parsed monthly budget, extracted as a clean float.")

    @field_validator("phone")
    @classmethod
    def clean_phone_number(cls, v: Optional[str]) -> Optional[str]:
        """
        Enforces a clean E.164 phone format by stripping non-numeric characters locally.
        """
        if not v:
            return None
        # Strip brackets, hyphens, and spaces
        cleaned = re.sub(r"[^\d+]", "", v)
        if not cleaned.startswith("+"):
            # Default to US/Canada country code if missing
            cleaned = "+" + cleaned
        return cleaned

2. Setting Up the Spreadsheet Agent with PydanticAI

Now, we will build the core AI reasoning agent. We configure the agent to use Gemini 1.5 Flash for sub-second, ultra-cheap execution, passing in our target schema structure.

# spreadsheet_agent.py
import os
from pydantic_ai import Agent
from pydantic_ai.models.gemini import GeminiModel
from schemas import CleanLeadRow

# Initialize Gemini Model
gemini_model = GeminiModel(
    'gemini-1.5-flash',
    api_key=os.environ.get("GEMINI_API_KEY")
)

# System instructions directing the model on how to parse messy inputs
parser_prompt = """
You are an elite, high-performance Data Operations Agent operating inside an enterprise CRM database.
Your task is to ingest unstructured, messy columns of Excel data and sanitize them into a strictly typed schema object.

Strict Extraction Guidelines:
1. Contact Info: Read the text block, isolate the primary email, and identify the phone number.
2. Interaction Notes: Parse the conversation context. Identify what product they want (e.g., OCR, Video, Audio) and determine the exact date they want to start (use May 2026 as the current time context if relative terms like 'next month' are used).
3. Budget: Isolate the budget number and convert it into a clean float value.
"""

# Initialize the PydanticAI Agent with Structured Output
parsing_agent = Agent(
    model=gemini_model,
    result_type=CleanLeadRow,
    system_prompt=parser_prompt
)

class DataOperationsService:
    @staticmethod
    async def parse_row(row_dict: dict) -> CleanLeadRow:
        """
        Ingests a dictionary representing a raw Excel row, validates it, and returns a CleanLeadRow instance.
        """
        row_string = "\n".join([f"{k}: {v}" for k, v in row_dict.items()])
        try:
            result = await parsing_agent.run(
                user_prompt=f"Please sanitize the following spreadsheet row:\n\n{row_string}"
            )
            # The result.data is guaranteed to be a fully populated, validated CleanLeadRow instance
            return result.data
        except Exception as e:
            raise RuntimeError(f"Row validation failed: {str(e)}")

3. Streaming and Writing Excel Data with openpyxl

Now, let’s tie the AI parsing layer to the filesystem. We will write a Python script that loads an Excel sheet, streams each row into our PydanticAI agent, compiles the cleaned results, and writes them back into a new, sanitized sheet.

# excel_pipeline.py
import asyncio
import openpyxl
from openpyxl import Workbook
from spreadsheet_agent import DataOperationsService

async def process_spreadsheet(input_path: str, output_path: str):
    # 1. Load the input workbook
    wb = openpyxl.load_workbook(input_path)
    sheet = wb.active
    
    # Read headers
    headers = [cell.value for cell in sheet[1]]
    print(f"Loaded sheet with headers: {headers}")
    
    # Initialize a new Workbook for clean data
    out_wb = Workbook()
    out_sheet = out_wb.active
    out_sheet.title = "Cleaned Leads"
    
    # Write clean headers
    clean_headers = [
        "Lead Name", "Company", "Email", "Phone", 
        "Product Interest", "Target Start Date", "Monthly Budget"
    ]
    out_sheet.append(clean_headers)
    
    # 2. Iterate and stream rows (skipping header)
    row_count = 0
    for r_idx in range(2, sheet.max_row + 1):
        row_values = [cell.value for cell in sheet[r_idx]]
        if not any(row_values):
            continue  # Skip empty rows
            
        row_dict = dict(zip(headers, row_values))
        print(f"\nProcessing Row {r_idx-1}...")
        
        try:
            # Parse row via PydanticAI + Gemini
            clean_row = await DataOperationsService.parse_row(row_dict)
            
            # Append sanitized values to the new output sheet
            out_sheet.append([
                clean_row.name,
                clean_row.company,
                clean_row.email,
                clean_row.phone,
                clean_row.product_interest,
                clean_row.target_start_date.strftime("%Y-%m-%d"),
                clean_row.monthly_budget
            ])
            row_count += 1
            print(f"Row {r_idx-1} successfully sanitized: {clean_row.email}")
            
        except Exception as e:
            print(f"❌ Error sanitizing Row {r_idx-1}: {str(e)}")
            
    # Save the output workbook
    out_wb.save(output_path)
    print(f"\nSpreadsheet successfully automated! processed {row_count} rows. Saved to: {output_path}")

# ==========================================
# Mock Excel Generator & Pipeline Run
# ==========================================
def create_mock_excel(path: str):
    """
    Helper function to generate a messy test spreadsheet.
    """
    wb = Workbook()
    sheet = wb.active
    sheet.append(["Lead Name", "Company / Site", "Contact Info", "Interaction Notes", "Estimated Budget"])
    
    # Messy mock data
    sheet.append([
        "John D.", 
        "Rogue Marketing", 
        "john@roguemkt.com or text +1 555-0199", 
        "Interested in the OCR parser, wants to spend around 5k/month starting June 1st.", 
        "Around 5000"
    ])
    sheet.append([
        "Alice Smith", 
        "Aiviewz SaaS", 
        "Reach out at alice@aiviewz.com", 
        "Needs help automating the video render pipeline starting May 20, 2026. Budget is tight, 1200 max.", 
        "1200"
    ])
    
    wb.save(path)

if __name__ == "__main__":
    mock_input = "messy_leads.xlsx"
    clean_output = "sanitized_leads.xlsx"
    
    # Generate mock sheet
    create_mock_excel(mock_input)
    
    # Run the pipeline
    asyncio.run(process_spreadsheet(mock_input, clean_output))

Conclusion & Productivity Gains

Manually cleaning spreadsheets is slow, expensive, and error-prone. By combining the streaming ease of openpyxl with the type-safe constraints of Pydantic and the high-speed reasoning of Gemini, developers can automate data cleansing pipelines in seconds.

This architecture scales perfectly to support hundreds of parallel rows inside background workers, making it the ideal framework to power B2B SaaS CSV import engines, Salesforce updates, and CRM sync pipelines.

Are you building automated spreadsheet engines or custom database cleaners? Let’s discuss openpyxl parameters, cell styles, and schema validators in the comments below!

FREE CODE TEMPLATE

Download the Complete PydanticAI Document Parser Blueprint

Get the complete, type-safe invoice and ID card parsing codebase in Python + a ready-to-run Docker environment. 100% free.

Professor XAI
Professor XAI ML Engineer passionate about advancing AI technologies and building intelligent systems.
comments powered by Disqus