Traditionally, building voice-enabled applications required developer teams to glue together multiple disconnected services. You would transcribe user speech using a Speech-to-Text (STT) model like Whisper, pass the text to a Large Language Model (LLM) to generate a response, and then convert that response back to audio using a third-party Text-to-Speech (TTS) engine.
This modular pipeline creates issues. It introduces latency, causes cascade errors (where a single transcription mistake throws off the entire response), and completely strips away the emotional nuances of human voice, like tone, pitch, and speed.
With Google Gemini native multimodal capabilities, you can interact with audio directly. Gemini processes audio inputs and outputs natively. This guide explains how to build a unified STT and TTS microservice using the google-genai SDK and FastAPI.
The Power of Native Multimodal Audio
To understand why native audio processing is a major architectural shift, consider the difference between text-based translation and direct audio translation.
Native Speech-to-Text (STT)
Instead of feeding audio into an external transcription module, Gemini accepts raw audio waveforms as primary context tokens. The model can:
- Transcribe multilingual audio.
- Identify speakers and summarize conversations.
- Understand background sounds (e.g., sirens, clicks, music) and incorporate them into its textual analysis.
Native Text-to-Speech (TTS)
When configured for audio output, Gemini does not synthesize a text string into mechanical speech using phone lists. Instead, the neural network directly generates audio waveforms in its output layers. This preserves human speech characteristics like natural breathing pauses, emphasis, and context-aware pronunciation.
Preparing the Development Environment
To begin, you will need the updated Google GenAI SDK and FastAPI. Install the required libraries inside your Python environment:
pip install google-genai fastapi uvicorn pydantic python-multipart
Ensure your Gemini API key is configured as an environment variable:
export GEMINI_API_KEY="your-api-key-here"
Implementation: Unified Audio API Service
Below is a complete, production-ready FastAPI microservice that implements two core endpoints:
/api/v1/transcribe: Accepts an uploaded audio file (MP3, WAV, etc.), uploads it to the Gemini File API, and returns an accurate text transcription./api/v1/synthesize: Accepts a text string and a target voice profile, requests native audio output from Gemini, and streams the synthesized audio file back to the client.
import os
import io
import shutil
from fastapi import FastAPI, HTTPException, UploadFile, File, Form
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from google import genai
from google.genai import types
# Initialize FastAPI application
app = FastAPI(
title="Gemini Audio API Service",
description="Unified Speech-to-Text and Text-to-Speech microservice utilizing Gemini native audio capabilities.",
version="1.0.0"
)
# Initialize the official Google GenAI Client
# It automatically picks up the GEMINI_API_KEY environment variable.
try:
client = genai.Client()
except Exception as e:
raise RuntimeError(
"Failed to initialize GenAI client. Ensure GEMINI_API_KEY is configured."
) from e
# Define available prebuilt voices for TTS
# Recommended voices: Puck, Charon, Aoede, Fenrir, Kore
SUPPORTED_VOICES = {"puck", "charon", "aoede", "fenrir", "kore"}
# ----------------------------------------------------
# 1. Text-to-Speech Request Schema
# ----------------------------------------------------
class SynthesisRequest(BaseModel):
text: str
voice: str = "Puck"
# ----------------------------------------------------
# 2. Endpoint: Speech-to-Text (Transcribe)
# ----------------------------------------------------
@app.post("/api/v1/transcribe")
async def transcribe_audio(
file: UploadFile = File(..., description="Audio file to transcribe (mp3, wav, m4a)")
):
"""
Accepts an audio file upload, transfers it to the Gemini File API,
and returns a textual transcription generated natively by the model.
"""
# Create a temporary directory locally to write the uploaded file
temp_dir = "./temp_audio"
os.makedirs(temp_dir, exist_ok=True)
temp_filepath = os.path.join(temp_dir, file.filename)
try:
# Save uploaded file to disk
with open(temp_filepath, "wb") as buffer:
shutil.copyfileobj(file.file, buffer)
# Upload the file to Gemini File API (required for larger media payloads)
print(f"Uploading {file.filename} to Gemini File API...")
uploaded_file = client.files.upload(file=temp_filepath)
# Request transcription using Gemini 2.0 Flash
print("Invoking model for native transcription...")
response = client.models.generate_content(
model="gemini-2.0-flash",
contents=[
uploaded_file,
"Provide an exact, verbatim transcription of this audio. "
"Do not summarize. Do not add commentary."
]
)
# Clean up file from Gemini cloud storage once processing is complete
client.files.delete(name=uploaded_file.name)
return {
"filename": file.filename,
"transcription": response.text.strip()
}
except Exception as e:
raise HTTPException(status_code=500, detail=f"Transcription failed: {str(e)}")
finally:
# Always clean up the temporary local file
if os.path.exists(temp_filepath):
os.remove(temp_filepath)
# ----------------------------------------------------
# 3. Endpoint: Text-to-Speech (Synthesize)
# ----------------------------------------------------
@app.post("/api/v1/synthesize")
async def synthesize_speech(request: SynthesisRequest):
"""
Accepts a text string and voice profile, requests Gemini to synthesize
native audio data, and streams the resulting audio file back to the client.
"""
# Normalize and validate voice profile
selected_voice = request.voice.lower()
if selected_voice not in SUPPORTED_VOICES:
raise HTTPException(
status_code=400,
detail=f"Unsupported voice profile. Supported: {list(SUPPORTED_VOICES)}"
)
# Capitalize first letter to match API expectations (e.g., 'Puck')
api_voice_name = selected_voice.capitalize()
try:
# Configure Gemini generation for raw audio modality
config = types.GenerateContentConfig(
response_modalities=["AUDIO"],
speech_config=types.SpeechConfig(
voice_config=types.VoiceConfig(
prebuilt_voice_config=types.PrebuiltVoiceConfig(
voice_name=api_voice_name
)
)
)
)
print(f"Synthesizing speech using voice profile: {api_voice_name}...")
response = client.models.generate_content(
model="gemini-2.0-flash",
contents=request.text,
config=config
)
# Extract binary audio data from response parts
audio_data = None
for candidate in response.candidates:
if candidate.content and candidate.content.parts:
for part in candidate.content.parts:
if part.inline_data:
audio_data = part.inline_data.data
break
if not audio_data:
raise HTTPException(
status_code=502,
detail="Inference completed, but no inline audio data was returned by the model."
)
# Stream binary audio data back to the client as an MP3 file
return StreamingResponse(
io.BytesIO(audio_data),
media_type="audio/mp3",
headers={"Content-Disposition": f"attachment; filename=synthesized_speech.mp3"}
)
except Exception as e:
raise HTTPException(status_code=500, detail=f"Speech synthesis failed: {str(e)}")
Critical Execution Best Practices
To ensure high-quality audio interactions, follow these production guidelines:
1. File Lifecycle Management
When using the client.files.upload method, files are stored temporarily in your Google Cloud Developer account.
- Media files should be deleted immediately after processing using
client.files.delete(name=uploaded_file.name)to prevent memory leaks and protect user data privacy.
2. Audio Format Support
Gemini native audio input supports common formats like WAV, MP3, AAC, and FLAC.
- For the best transcription accuracy, use raw, uncompressed formats (like 16-kHz or 48-kHz WAV files). This preserves the clear acoustic features needed for speaker identification and background analysis.
3. Voice Selection profiles
Gemini offers several voice profiles optimized for different applications:
- Puck: Energetic and casual, ideal for assistive chat interfaces.
- Charon: Clear and formal, suited for enterprise customer support.
- Kore: Warm and conversational, ideal for audio narration and voice-over scripts.
Conclusion
By implementing native audio processing directly within the model architecture, Gemini eliminates the complexity and latency of traditional STT/TTS pipelines. Developing a unified voice API requires minimal code, providing developers with high performance, reduced architectural overhead, and natural, expressive vocal output.