Building Speech-to-Text and Text-to-Speech APIs with Gemini Native Audio

21 May 2026 (Updated: May 21, 2026) 📖 4 min read

Traditionally, building voice-enabled applications required developer teams to glue together multiple disconnected services. You would transcribe user speech using a Speech-to-Text (STT) model like Whisper, pass the text to a Large Language Model (LLM) to generate a response, and then convert that response back to audio using a third-party Text-to-Speech (TTS) engine.

This modular pipeline creates issues. It introduces latency, causes cascade errors (where a single transcription mistake throws off the entire response), and completely strips away the emotional nuances of human voice, like tone, pitch, and speed.

With Google Gemini native multimodal capabilities, you can interact with audio directly. Gemini processes audio inputs and outputs natively. This guide explains how to build a unified STT and TTS microservice using the google-genai SDK and FastAPI.

The Power of Native Multimodal Audio

To understand why native audio processing is a major architectural shift, consider the difference between text-based translation and direct audio translation.

Native Speech-to-Text (STT)

Instead of feeding audio into an external transcription module, Gemini accepts raw audio waveforms as primary context tokens. The model can:

Transcribe multilingual audio.
Identify speakers and summarize conversations.
Understand background sounds (e.g., sirens, clicks, music) and incorporate them into its textual analysis.

Native Text-to-Speech (TTS)

When configured for audio output, Gemini does not synthesize a text string into mechanical speech using phone lists. Instead, the neural network directly generates audio waveforms in its output layers. This preserves human speech characteristics like natural breathing pauses, emphasis, and context-aware pronunciation.

Preparing the Development Environment

To begin, you will need the updated Google GenAI SDK and FastAPI. Install the required libraries inside your Python environment:

pip install google-genai fastapi uvicorn pydantic python-multipart

Ensure your Gemini API key is configured as an environment variable:

export GEMINI_API_KEY="your-api-key-here"

Implementation: Unified Audio API Service

Below is a complete, production-ready FastAPI microservice that implements two core endpoints:

/api/v1/transcribe: Accepts an uploaded audio file (MP3, WAV, etc.), uploads it to the Gemini File API, and returns an accurate text transcription.
/api/v1/synthesize: Accepts a text string and a target voice profile, requests native audio output from Gemini, and streams the synthesized audio file back to the client.

import os
import io
import shutil
from fastapi import FastAPI, HTTPException, UploadFile, File, Form
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from google import genai
from google.genai import types

# Initialize FastAPI application
app = FastAPI(
    title="Gemini Audio API Service",
    description="Unified Speech-to-Text and Text-to-Speech microservice utilizing Gemini native audio capabilities.",
    version="1.0.0"
)

# Initialize the official Google GenAI Client
# It automatically picks up the GEMINI_API_KEY environment variable.
try:
    client = genai.Client()
except Exception as e:
    raise RuntimeError(
        "Failed to initialize GenAI client. Ensure GEMINI_API_KEY is configured."
    ) from e

# Define available prebuilt voices for TTS
# Recommended voices: Puck, Charon, Aoede, Fenrir, Kore
SUPPORTED_VOICES = {"puck", "charon", "aoede", "fenrir", "kore"}

# ----------------------------------------------------
# 1. Text-to-Speech Request Schema
# ----------------------------------------------------
class SynthesisRequest(BaseModel):
    text: str
    voice: str = "Puck"

# ----------------------------------------------------
# 2. Endpoint: Speech-to-Text (Transcribe)
# ----------------------------------------------------
@app.post("/api/v1/transcribe")
async def transcribe_audio(
    file: UploadFile = File(..., description="Audio file to transcribe (mp3, wav, m4a)")
):
    """
    Accepts an audio file upload, transfers it to the Gemini File API,
    and returns a textual transcription generated natively by the model.
    """
    # Create a temporary directory locally to write the uploaded file
    temp_dir = "./temp_audio"
    os.makedirs(temp_dir, exist_ok=True)
    temp_filepath = os.path.join(temp_dir, file.filename)

    try:
        # Save uploaded file to disk
        with open(temp_filepath, "wb") as buffer:
            shutil.copyfileobj(file.file, buffer)

        # Upload the file to Gemini File API (required for larger media payloads)
        print(f"Uploading {file.filename} to Gemini File API...")
        uploaded_file = client.files.upload(file=temp_filepath)

        # Request transcription using Gemini 2.0 Flash
        print("Invoking model for native transcription...")
        response = client.models.generate_content(
            model="gemini-2.0-flash",
            contents=[
                uploaded_file,
                "Provide an exact, verbatim transcription of this audio. "
                "Do not summarize. Do not add commentary."
            ]
        )

        # Clean up file from Gemini cloud storage once processing is complete
        client.files.delete(name=uploaded_file.name)

        return {
            "filename": file.filename,
            "transcription": response.text.strip()
        }

    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Transcription failed: {str(e)}")

    finally:
        # Always clean up the temporary local file
        if os.path.exists(temp_filepath):
            os.remove(temp_filepath)

# ----------------------------------------------------
# 3. Endpoint: Text-to-Speech (Synthesize)
# ----------------------------------------------------
@app.post("/api/v1/synthesize")
async def synthesize_speech(request: SynthesisRequest):
    """
    Accepts a text string and voice profile, requests Gemini to synthesize
    native audio data, and streams the resulting audio file back to the client.
    """
    # Normalize and validate voice profile
    selected_voice = request.voice.lower()
    if selected_voice not in SUPPORTED_VOICES:
        raise HTTPException(
            status_code=400,
            detail=f"Unsupported voice profile. Supported: {list(SUPPORTED_VOICES)}"
        )

    # Capitalize first letter to match API expectations (e.g., 'Puck')
    api_voice_name = selected_voice.capitalize()

    try:
        # Configure Gemini generation for raw audio modality
        config = types.GenerateContentConfig(
            response_modalities=["AUDIO"],
            speech_config=types.SpeechConfig(
                voice_config=types.VoiceConfig(
                    prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name=api_voice_name
                    )
                )
            )
        )

        print(f"Synthesizing speech using voice profile: {api_voice_name}...")
        response = client.models.generate_content(
            model="gemini-2.0-flash",
            contents=request.text,
            config=config
        )

        # Extract binary audio data from response parts
        audio_data = None
        for candidate in response.candidates:
            if candidate.content and candidate.content.parts:
                for part in candidate.content.parts:
                    if part.inline_data:
                        audio_data = part.inline_data.data
                        break

        if not audio_data:
            raise HTTPException(
                status_code=502,
                detail="Inference completed, but no inline audio data was returned by the model."
            )

        # Stream binary audio data back to the client as an MP3 file
        return StreamingResponse(
            io.BytesIO(audio_data),
            media_type="audio/mp3",
            headers={"Content-Disposition": f"attachment; filename=synthesized_speech.mp3"}
        )

    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Speech synthesis failed: {str(e)}")

Critical Execution Best Practices

To ensure high-quality audio interactions, follow these production guidelines:

1. File Lifecycle Management

When using the client.files.upload method, files are stored temporarily in your Google Cloud Developer account.

Media files should be deleted immediately after processing using client.files.delete(name=uploaded_file.name) to prevent memory leaks and protect user data privacy.

2. Audio Format Support

Gemini native audio input supports common formats like WAV, MP3, AAC, and FLAC.

For the best transcription accuracy, use raw, uncompressed formats (like 16-kHz or 48-kHz WAV files). This preserves the clear acoustic features needed for speaker identification and background analysis.

3. Voice Selection profiles

Gemini offers several voice profiles optimized for different applications:

Puck: Energetic and casual, ideal for assistive chat interfaces.
Charon: Clear and formal, suited for enterprise customer support.
Kore: Warm and conversational, ideal for audio narration and voice-over scripts.

Conclusion

By implementing native audio processing directly within the model architecture, Gemini eliminates the complexity and latency of traditional STT/TTS pipelines. Developing a unified voice API requires minimal code, providing developers with high performance, reduced architectural overhead, and natural, expressive vocal output.

« Building an AI Lab Test Booking Assistant: Pydantic AI, Gemini, FastAPI, and shadcn-ui

DALL-E 4 vs. Imagen 4 vs. Midjourney v7: Flagship Image Generation API Comparison »

Professor XAI Follow ML Engineer passionate about advancing AI technologies and building intelligent systems.

Building Speech-to-Text and Text-to-Speech APIs with Gemini Native Audio

The Power of Native Multimodal Audio

Native Speech-to-Text (STT)

Native Text-to-Speech (TTS)

Preparing the Development Environment

Implementation: Unified Audio API Service

Critical Execution Best Practices

1. File Lifecycle Management

2. Audio Format Support

3. Voice Selection profiles

Conclusion

Newsletter

Popular Categories

Building Speech-to-Text and Text-to-Speech APIs with Gemini Native Audio

The Power of Native Multimodal Audio

Native Speech-to-Text (STT)

Native Text-to-Speech (TTS)

Preparing the Development Environment

Implementation: Unified Audio API Service

Critical Execution Best Practices

1. File Lifecycle Management

2. Audio Format Support

3. Voice Selection profiles

Conclusion

Newsletter

Get weekly AI insights & pricing updates delivered to your inbox

Popular Categories