How much does Gemini's Text-to-Speech (TTS) cost?

Gemini 3.5 Flash TTS costs $0.15/1M input text tokens and $6.00/1M output audio tokens (approx. $0.09 per 10 minutes of audio). Gemini 3.1 Flash TTS costs $1.00/1M input and $20.00/1M output tokens (approx. $0.31 per 10 minutes).

How many tokens are used per second of audio output?

For Gemini TTS and Live API models, audio output corresponds to a rate of 25 tokens per second. This means 1 million output tokens yields approximately 11 hours of audio.

What is the cheapest way to generate text-to-speech using Google APIs?

Gemini 3.5 Flash TTS is the cheapest option at $6.00 per million output audio tokens. Additionally, the Batch API offers a 50% discount, bringing the cost down to $3.00/1M.

How much does the Gemini Live API cost?

The real-time bidirectional Gemini Live API costs $1.00/1M input tokens (both text and audio) and $20.00/1M output tokens for audio outputs ($6.00/1M for text outputs).

How does Gemini Audio transcription compare to Google Cloud Speech-to-Text?

Traditional Google Cloud STT (Chirp) costs $0.016 per minute. Gemini Audio Understanding is billed at standard Flash rates ($0.50/$3.00 per 1M tokens), which can be much cheaper for short summaries or Q&A.

Google Gemini TTS & Speech API Pricing June 2026 — Gemini 3.1 Flash, 3.5 Pro TTS & Live API Costs

17 May 2026 (Updated: Jun 3, 2026) 📖 5 min read

Google now offers voice and speech capabilities through multiple distinct services, each with its own specialized API architecture and cost calculation method.

For developers building real-time voice assistants, automated podcast narration pipelines, high-volume customer service routing, or offline video transcription engines, navigating these models is critical to keeping bills under control.

In this developer’s guide, we will analyze the pricing structures of the new Gemini TTS (Text-to-Speech) models, break down the economics of the bidirectional Gemini Live API, evaluate traditional Google Cloud Speech-to-Text (STT) Chirp costs, and perform a head-to-head comparison against OpenAI’s Realtime audio pricing.

🧮 Calculate your exact audio & speech costs: Use our AI API Pricing Calculator to estimate token usage and monthly fees for text, image, audio, and video across all major models.

1. The Google Speech AI Landscape

Google’s speech tools are split across three service platforms:

Gemini TTS API: Performs text-to-speech generation natively within the LLM model architecture. Billed per input text token and output audio token.
Gemini Live API: Handles real-time, low-latency, bidirectional voice conversations using WebSockets. Billed per input audio/text token and output audio/text token.
Google Cloud Speech-to-Text (Chirp): A specialized, heavy-duty transcription model hosted on Google Cloud Platform (GCP). Billed per minute of processed audio.
Gemini Audio Understanding: Transcription and audio analysis performed on standard Gemini Flash/Pro reasoning endpoints. Billed per audio input token and text output token.

2. Gemini TTS Models (Text-to-Speech)

Traditional TTS engines synthesize speech by stitching phonemes together, which often sounds mechanical. Gemini’s TTS models generate audio waveforms natively in the neural network’s output layer.

A. Gemini 3.1 Flash TTS (Preview)

Optimized for high-fidelity speech synthesis across 70+ languages, offering natural breathing pauses, emphasis, and context-aware pronunciation.

Model ID: gemini-3.1-flash-tts-preview

Cost Type	Price per 1 Million Tokens	Batch API (50% Off)
Input (text)	$1.00	$0.50
Output (audio)	$20.00	$10.00

📊 Audio Output tokenization Rate: Gemini TTS output translates to exactly 25 tokens per second of synthesized audio. This means 1 million output tokens yields approximately 11.1 hours of continuous audio.

Podcast Narration Cost Example:

Suppose you want to generate a 10-minute podcast narration (~6,000 words input, generating 600 seconds of audio output):

Input text tokens: ~8,000 tokens × $1.00/M = $0.008
Output audio tokens: 600 seconds × 25 tokens/sec = 15,000 tokens × $20.00/M = $0.300
Total Cost: $0.308 (~$0.31 per 10-minute episode)

B. Gemini 3.5 Flash TTS (Preview)

A highly cost-effective, low-latency speech synthesis option designed for high-velocity user interfaces.

Model ID: gemini-3.5-flash-preview-tts

Cost Type	Price per 1 Million Tokens	Batch API (50% Off)
Input (text)	$0.15	$0.075
Output (audio)	$6.00	$3.000

At just $6.00 per million output tokens, a 10-minute narration costs only $0.09—making Gemini 3.5 Flash TTS one of the most budget-friendly high-fidelity speech generation models on the market.

C. Gemini 3.5 Pro TTS (Preview)

Google’s premium voice synthesis option. It provides highly natural prosody, variable speed adjustments, and better emotional steering.

Model ID: gemini-3.5-pro-preview-tts

Cost Type	Price per 1 Million Tokens	Batch API (50% Off)
Input (text)	$1.25	$0.625
Output (audio)	$20.00	$10.000

3. Gemini Live API: Real-Time Bidirectional Voice

For building low-latency, interactive voice agents (similar to a phone call experience), developers use the Gemini Live API via WebSockets. It supports streaming both input and output audio concurrently.

Model ID: gemini-3.1-flash-live-preview

Cost Type	Price per 1 Million Tokens
Text Input	$1.00
Audio Input	$1.00
Text Output	$6.00
Audio Output	$20.00

Audio Input tokenization Rate: 32 tokens per second of raw audio.
Audio Output tokenization Rate: 25 tokens per second of raw audio.

4. Google Cloud Speech-to-Text (Chirp) vs. Gemini Audio Understanding

For transcribing long recorded files (like podcasts, call center recordings, or medical dictations), developers must choose between traditional GCP STT Chirp and Gemini Audio Understanding.

A. Google Cloud STT Chirp (Traditional)

Billed strictly by the minute:

Volume Tier	Price per Minute
First 15,000 minutes	$0.016 ($0.96 per hour)
High Volume (500K+ minutes)	$0.012 ($0.72 per hour)
Enterprise (2M+ minutes)	$0.004 ($0.24 per hour)

B. Gemini Audio Understanding (LLM-based)

Billed per input token. For Gemini 3.5 Flash, the rate is $0.50/M input tokens.

Token rate: 1 second of audio = 32 input tokens.
1 hour of audio = 115,200 tokens.
Cost: 115,200 × $0.50/1M = $0.057 per hour.

💡 The Optimization Strategy: If you only need a quick transcription or summary of an audio file, using Gemini 3.5 Flash costs just $0.057 per hour, whereas traditional Google Cloud Chirp costs $0.96 per hour. Gemini is 16.8x cheaper for basic transcription workloads. However, Chirp remains superior for massive multi-hour audio files that would exceed LLM rate limits.

5. Head-to-Head: Google Gemini Live vs. OpenAI Realtime API

The most critical pricing battle in Voice AI is between Google Gemini Live API and OpenAI Realtime API. Let’s examine the raw cost differences for real-time conversational audio processing:

Cost Metric	Google Gemini Live	OpenAI Realtime (GPT-4o)	Google Savings Ratio
Audio Input / 1M Tokens	$1.00	$32.00	32x Cheaper
Audio Output / 1M Tokens	$20.00	$64.00	3.2x Cheaper
Text Input / 1M Tokens	$1.00	$5.00	5x Cheaper
Text Output / 1M Tokens	$6.00	$20.00	3.3x Cheaper

The Startup Financial Comparison

Let’s model a startup running a customer voice assistant that handles 2,000 hours of conversational support calls monthly (averaging 50% user input talk time, 50% agent output talk time):

Conversational Metrics (per hour):
- Input Audio: 1,800 seconds (57,600 tokens)
- Output Audio: 1,800 seconds (45,000 tokens)
Monthly Token Totals (for 2,000 hours):
- Audio Inputs: 115.2 Million tokens
- Audio Outputs: 90.0 Million tokens

Monthly Operational Bills:

OpenAI Realtime API:
- Inputs: 115.2M tokens × $32.00/M = $3,686.40
- Outputs: 90.0M tokens × $64.00/M = $5,760.00
- Total Monthly Cost: $9,446.40
Google Gemini Live API:
- Inputs: 115.2M tokens × $1.00/M = $115.20
- Outputs: 90.0M tokens × $20.00/M = $1,800.00
- Total Monthly Cost: $1,915.20

🏆 Pricing Winner: Google Gemini Live is $7,531.20 cheaper per month than OpenAI Realtime. For high-volume voice applications, Google’s 32x input price advantage is a massive cost-saving factor.

FREE BOILERPLATE

Get the Gemini Live WebSocket Audio Starter Boilerplate

A production-grade Voice AI FastAPI backend and real-time HTML/JS sandbox template to jumpstart your audio agent projects.

« OpenAI API Pricing June 2026 — GPT-5.5, GPT-4.1, o3 Per-Million-Token Costs & Calculator

Google Imagen 4 & Nano Banana Pricing 2026: Midjourney API Killers? »

Professor XAI Follow ML Engineer passionate about advancing AI technologies and building intelligent systems.

Google Gemini TTS & Speech API Pricing June 2026 — Gemini 3.1 Flash, 3.5 Pro TTS & Live API Costs

1. The Google Speech AI Landscape

2. Gemini TTS Models (Text-to-Speech)

A. Gemini 3.1 Flash TTS (Preview)

Podcast Narration Cost Example:

B. Gemini 3.5 Flash TTS (Preview)

C. Gemini 3.5 Pro TTS (Preview)

3. Gemini Live API: Real-Time Bidirectional Voice

4. Google Cloud Speech-to-Text (Chirp) vs. Gemini Audio Understanding

A. Google Cloud STT Chirp (Traditional)

B. Gemini Audio Understanding (LLM-based)

5. Head-to-Head: Google Gemini Live vs. OpenAI Realtime API

The Startup Financial Comparison

Monthly Operational Bills:

Get the Gemini Live WebSocket Audio Starter Boilerplate

🧮 Quick Tools

Newsletter

Popular Categories

Google Gemini TTS & Speech API Pricing June 2026 — Gemini 3.1 Flash, 3.5 Pro TTS & Live API Costs

1. The Google Speech AI Landscape

2. Gemini TTS Models (Text-to-Speech)

A. Gemini 3.1 Flash TTS (Preview)

Podcast Narration Cost Example:

B. Gemini 3.5 Flash TTS (Preview)

C. Gemini 3.5 Pro TTS (Preview)

3. Gemini Live API: Real-Time Bidirectional Voice

4. Google Cloud Speech-to-Text (Chirp) vs. Gemini Audio Understanding

A. Google Cloud STT Chirp (Traditional)

B. Gemini Audio Understanding (LLM-based)

5. Head-to-Head: Google Gemini Live vs. OpenAI Realtime API

The Startup Financial Comparison

Monthly Operational Bills:

Related Guides

Get the Gemini Live WebSocket Audio Starter Boilerplate

🧮 Quick Tools

Newsletter

Get weekly AI insights & pricing updates delivered to your inbox

Popular Categories