The audio and speech AI landscape has undergone a radical transformation with the arrival of large-scale foundation models like Whisper and neural TTS systems capable of near-human voice synthesis. Building production voice agents now requires orchestrating ASR, language understanding, TTS, and real-time streaming infrastructure into coherent pipelines. This article provides an engineering-depth exploration of modern speech AI systems, from model architectures to production deployment patterns.
OpenAI's Whisper (Radford et al., 2023) is an encoder-decoder Transformer trained on 680,000 hours of multilingual, weakly-supervised audio data from the internet. Its architecture follows a straightforward sequence-to-sequence design:
Audio Preprocessing: Raw audio is resampled to 16kHz, converted to an 80-channel log-Mel spectrogram with a 25ms window and 10ms hop size, then normalized. The spectrogram is processed in 30-second chunks.
Encoder: A small CNN stem (two 1D convolution layers with GELU activation) processes the Mel spectrogram, followed by sinusoidal positional embeddings and standard Transformer encoder blocks. The encoder converts 30 seconds of audio into a sequence of 1500 embeddings (one per 20ms frame).
Decoder: A standard Transformer decoder with learned positional embeddings autoregressively generates output tokens. Special tokens control task behavior:
[startoftranscript] [en] [transcribe] [notimestamps] Hello, world [endoftext]
The genius of Whisper lies not in architectural novelty but in training data scale and multitask formulation. The same model handles transcription, translation, language identification, and timestamp prediction, selected by special tokens.
import whisper
model = whisper.load_model("large-v3")
# Basic transcription
result = model.transcribe("audio.mp3")
print(result["text"])
# With word-level timestamps
result = model.transcribe("audio.mp3", word_timestamps=True)
for segment in result["segments"]:
for word in segment["words"]:
print(f"[{word['start']:.2f} - {word['end']:.2f}] {word['word']}")
# Translation (any language -> English)
result = model.transcribe("german_audio.mp3", task="translate")
| Model | Parameters | English WER | Multilingual WER | Speed Factor |
|---|---|---|---|---|
| tiny | 39M | 7.6% | 14.2% | 32x |
| base | 74M | 5.0% | 10.6% | 16x |
| small | 244M | 3.4% | 8.1% | 6x |
| medium | 769M | 2.9% | 6.5% | 2x |
| large-v3 | 1.55B | 2.5% | 5.2% | 1x |
| large-v3-turbo | 809M | 2.5% | 5.3% | 3x |
Word Error Rate (WER) on LibriSpeech test-clean. Speed factors are approximate relative to real-time on GPU.
Whisper v3-turbo deserves special mention: released by OpenAI in late 2024, it uses a pruned decoder (4 layers instead of 32) while retaining the full large-v3 encoder. The result is near-identical accuracy to large-v3 at roughly 3x the speed and half the parameter count. For most production workloads, v3-turbo is the recommended starting point โ it offers the best accuracy-to-latency ratio in the Whisper family.
Tip: For new projects, start with
large-v3-turbounless you have a specific reason to use the fulllarge-v3. The accuracy difference (2.5% vs 2.5% WER) is negligible for most domains.
Fine-tuning Whisper for domain-specific audio (medical dictation, legal proceedings, accented speech) follows standard seq2seq fine-tuning with some speech-specific considerations:
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
import torch
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v3")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")
# Freeze encoder for efficiency (optional - helps with small datasets)
for param in model.model.encoder.parameters():
param.requires_grad = False
training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-finetuned",
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
learning_rate=1e-5,
warmup_steps=500,
max_steps=5000,
fp16=True,
predict_with_generate=True,
generation_max_length=225,
)
# Data collator handles padding and feature extraction
# Dataset should provide audio arrays and transcription text
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=data_collator,
tokenizer=processor.feature_extractor,
)
trainer.train()
Key fine-tuning tips:
CTranslate2-based Faster Whisper achieves 4x speedup through INT8 quantization, batched beam search, and optimized attention:
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="int8")
segments, info = model.transcribe(
"audio.mp3",
beam_size=5,
vad_filter=True, # Voice Activity Detection to skip silence
vad_parameters=dict(
min_silence_duration_ms=500,
speech_pad_ms=400,
),
)
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
WhisperX (Bain et al., 2023) adds forced alignment for precise word-level timestamps and speaker diarization, making it suitable for production transcription pipelines.
Modern TTS has moved far beyond concatenative synthesis. Current approaches fall into several categories:
Autoregressive token-based: Models like VALL-E (Wang et al., 2023) and Bark treat speech synthesis as a language modeling problem over discrete audio tokens (from neural codecs like EnCodec). They generate audio codes token-by-token, enabling zero-shot voice cloning from a short reference.
Non-autoregressive / Flow-based: Models like VITS (Kim et al., 2021) and its successors use variational inference and normalizing flows for parallel synthesis, achieving much faster inference. StyleTTS 2 combines style diffusion with adversarial training for state-of-the-art naturalness.
Diffusion-based: Models like Grad-TTS and NaturalSpeech 2/3 apply diffusion processes to speech generation, producing highly natural audio at the cost of iterative denoising steps.
ElevenLabs represents the state of commercial TTS, offering:
from elevenlabs import ElevenLabs
client = ElevenLabs(api_key="your-key")
# Standard TTS
audio = client.text_to_speech.convert(
voice_id="21m00Tcm4TlvDq8ikWAM", # Rachel voice
text="Hello, this is a test of the text to speech system.",
model_id="eleven_multilingual_v2",
voice_settings={
"stability": 0.5,
"similarity_boost": 0.75,
"style": 0.0,
"use_speaker_boost": True,
},
)
# Streaming TTS for low-latency applications
audio_stream = client.text_to_speech.convert_as_stream(
voice_id="21m00Tcm4TlvDq8ikWAM",
text="Streaming reduces time to first audio byte.",
model_id="eleven_turbo_v2_5",
)
Bark (Suno AI): A GPT-style model that generates audio from text prompts, capable of producing speech, music, and sound effects. It uses a three-stage architecture: text-to-semantic tokens, semantic-to-coarse acoustic tokens, coarse-to-fine acoustic tokens.
Coqui XTTS: Coqui, the company behind the popular TTS library, shut down in late 2023. However, the XTTS v2 model and the broader Coqui TTS library remain community-maintained on GitHub and continue to be widely used. XTTS v2 supports cross-lingual voice cloning from a 6-second reference:
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
# Zero-shot voice cloning
tts.tts_to_file(
text="This is a cloned voice speaking.",
speaker_wav="reference_audio.wav", # 6+ seconds of reference
language="en",
file_path="output.wav",
)
Note that while the library still works and the community has kept it functional, active development has slowed. For new projects, evaluate alternatives like F5-TTS, Parler-TTS, or commercial APIs.
Piper: A lightweight, fast TTS engine optimized for edge deployment (Raspberry Pi, mobile). Uses VITS architecture and can run in real-time on CPU. Ideal for offline voice assistants.
Neural audio codecs like Meta's EnCodec and Google's SoundStream are foundational to modern TTS. They compress audio into discrete tokens that can be modeled with language model architectures:
Audio (24kHz) -> EnCodec Encoder -> 8 codebook streams at 75Hz
-> Each codebook: 1024 entries
-> Total: ~6kbps at high quality
Codebook hierarchy:
- Codebook 1: Coarse acoustic structure (pitch, phonemes)
- Codebook 2-4: Fine acoustic details (timbre, prosody)
- Codebook 5-8: High-frequency details (breathiness, micro-prosody)
This tokenization enables treating speech generation as a next-token prediction problem, unifying the training framework with text LLMs.
A modern voice agent combines ASR, LLM, and TTS in a pipeline optimized for low latency:
User Speech -> ASR -> Text -> LLM -> Response Text -> TTS -> Audio Response
| |
+-- Microphone/WebRTC Speaker/WebRTC --+
Latency budget (target < 1 second):
- ASR: 200-400ms (streaming) or 500-1500ms (batch)
- LLM: 200-500ms (time to first token)
- TTS: 100-300ms (time to first audio chunk)
- Network/buffering: 100-200ms
Batch ASR (process complete utterance) introduces unacceptable latency for conversational agents. Streaming ASR processes audio incrementally:
import asyncio
import websockets
import json
class StreamingASR:
"""WebSocket-based streaming ASR client"""
def __init__(self, model="nova-2"):
self.model = model
self.buffer = []
self.final_transcript = ""
async def stream_audio(self, audio_source):
async with websockets.connect(
f"wss://api.deepgram.com/v1/listen"
f"?model={self.model}&smart_format=true"
f"&interim_results=true&endpointing=300",
extra_headers={"Authorization": f"Token {API_KEY}"},
) as ws:
# Send audio chunks as they arrive
async def send_audio():
async for chunk in audio_source:
await ws.send(chunk)
await ws.send(json.dumps({"type": "CloseStream"}))
# Receive transcription results
async def receive_results():
async for msg in ws:
result = json.loads(msg)
transcript = result["channel"]["alternatives"][0]["transcript"]
if result["is_final"]:
self.final_transcript += transcript + " "
yield {"type": "final", "text": transcript}
else:
yield {"type": "interim", "text": transcript}
send_task = asyncio.create_task(send_audio())
async for result in receive_results():
yield result
await send_task
Key streaming ASR providers and their characteristics:
| Provider | Strengths | Notes |
|---|---|---|
| Deepgram Nova-2 | Fast, accurate | Good endpointing (~300ms latency) |
| Google Cloud STT v2 | Strong multilingual | Chirp model |
| AssemblyAI Universal-2 | Balanced | Good accuracy/latency tradeoff |
| Whisper-streaming | Open-source | Streaming wrapper around Whisper |
One of the hardest problems in voice agents is detecting when the user has finished speaking (endpointing). Approaches include:
Note: Silence-based endpointing alone produces a poor experience โ users pause mid-sentence all the time. Hybrid approaches that combine VAD, silence duration, and semantic completeness are more reliable in practice.
class SmartEndpointer:
def __init__(self):
self.vad = SileroVAD()
self.silence_threshold_ms = 700
self.semantic_model = load_completion_predictor()
def should_trigger(self, audio_buffer, partial_transcript):
silence_duration = self.vad.get_silence_duration(audio_buffer)
# Quick trigger: long silence
if silence_duration > 1500:
return True
# Medium silence + complete sentence
if silence_duration > self.silence_threshold_ms:
completeness = self.semantic_model.predict(partial_transcript)
if completeness > 0.8:
return True
return False
Speaker diarization answers "who spoke when" - essential for multi-party conversations:
pyannote.audio (Bredin et al., 2023) is the leading open-source diarization toolkit:
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="hf_token",
)
# Process audio file
diarization = pipeline("meeting.wav")
# Iterate over speaker turns
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"[{turn.start:.1f}s - {turn.end:.1f}s] {speaker}")
# [0.2s - 3.1s] SPEAKER_00
# [3.5s - 7.8s] SPEAKER_01
# [8.1s - 12.4s] SPEAKER_00
Modern diarization pipelines combine:
WebRTC provides the transport layer for browser-based voice agents. The architecture involves:
// Client-side WebRTC setup for voice agent
class VoiceAgentClient {
constructor(serverUrl) {
this.pc = new RTCPeerConnection({
iceServers: [{ urls: 'stun:stun.l.google.com:19302' }]
});
this.serverUrl = serverUrl;
}
async start() {
// Get microphone access
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
echoCancellation: true,
noiseSuppression: true,
autoGainControl: true,
sampleRate: 16000,
}
});
// Add audio track to peer connection
stream.getAudioTracks().forEach(track => {
this.pc.addTrack(track, stream);
});
// Data channel for transcripts and metadata
this.dataChannel = this.pc.createDataChannel('agent', {
ordered: true,
});
this.dataChannel.onmessage = (event) => {
const msg = JSON.parse(event.data);
if (msg.type === 'transcript') {
this.onTranscript(msg.text, msg.is_final);
} else if (msg.type === 'agent_response') {
this.onAgentResponse(msg.text);
}
};
// Create and send offer
const offer = await this.pc.createOffer();
await this.pc.setLocalDescription(offer);
const response = await fetch(`${this.serverUrl}/offer`, {
method: 'POST',
body: JSON.stringify({ sdp: offer.sdp, type: offer.type }),
});
const answer = await response.json();
await this.pc.setRemoteDescription(answer);
}
}
OpenAI's Realtime API provides a WebSocket-based interface for speech-to-speech interaction, where the model processes audio directly without a separate ASR step:
# LiveKit-based voice agent (popular open-source framework)
from livekit.agents import AutoSubscribe, JobContext, WorkerOptions, cli
from livekit.agents.voice_assistant import VoiceAssistant
from livekit.plugins import deepgram, openai, silero
async def entrypoint(ctx: JobContext):
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
assistant = VoiceAssistant(
vad=silero.VAD.load(),
stt=deepgram.STT(model="nova-2"),
llm=openai.LLM(model="gpt-4o"),
tts=openai.TTS(voice="alloy"),
# Interrupt handling
allow_interruptions=True,
interrupt_speech_duration=0.5,
)
assistant.start(ctx.room)
await assistant.say("Hello! How can I help you today?")
if __name__ == "__main__":
cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))
Models like OpenAI's GPT-4o and Google's Gemini can understand non-speech audio - music, environmental sounds, and audio events. Audio Language Models (ALMs) are emerging that can:
CLAP (Contrastive Language-Audio Pretraining) by LAION applies the CLIP paradigm to audio-text pairs, creating a shared embedding space for audio retrieval and classification.
Music AI has seen parallel advances:
Voice cloning capabilities raise serious ethical concerns. Production systems should implement:
class VoiceCloningSafeguards:
def __init__(self):
self.consent_db = ConsentDatabase()
self.deepfake_detector = DeepfakeDetector()
def clone_voice(self, reference_audio, speaker_id, consent_proof):
# 1. Verify explicit consent
if not self.consent_db.verify_consent(speaker_id, consent_proof):
raise ConsentError("Explicit consent required for voice cloning")
# 2. Add audio watermark to all generated speech
cloned_voice = self.tts_model.clone(reference_audio)
cloned_voice.enable_watermarking(
method="audioseal", # Meta's AudioSeal
payload=f"synthetic:{speaker_id}:{timestamp}",
)
# 3. Log all generations for audit
self.audit_log.record(
speaker_id=speaker_id,
timestamp=datetime.now(),
text_generated=None, # Logged separately per request
)
return cloned_voice
AudioSeal (San Roman et al., 2024) from Meta provides proactive watermarking that embeds imperceptible identifiers in generated audio, enabling detection of synthetic speech even after compression or editing.
Note: Regulatory requirements around synthetic voice are evolving rapidly. Build consent and watermarking infrastructure early โ retrofitting these controls into a production system is significantly harder than designing them in from the start.
The end-to-end latency of a voice agent is the sum of its pipeline stages. Key optimization strategies:
async def streaming_voice_pipeline(audio_stream):
"""Process speech with minimum latency using streaming at every stage"""
# Stage 1: Streaming ASR
transcript = ""
async for partial in streaming_asr(audio_stream):
if partial.is_final:
transcript = partial.text
break
# Stage 2: Streaming LLM - start TTS as tokens arrive
tts_input_queue = asyncio.Queue()
async def llm_to_tts():
sentence_buffer = ""
async for token in streaming_llm(transcript):
sentence_buffer += token
# Flush to TTS at sentence boundaries for natural prosody
if token in ".!?":
await tts_input_queue.put(sentence_buffer)
sentence_buffer = ""
if sentence_buffer:
await tts_input_queue.put(sentence_buffer)
await tts_input_queue.put(None) # Signal completion
# Stage 3: Streaming TTS - play audio as it's generated
async def tts_to_audio():
while True:
text = await tts_input_queue.get()
if text is None:
break
async for audio_chunk in streaming_tts(text):
yield audio_chunk
# Run LLM and TTS concurrently
asyncio.create_task(llm_to_tts())
async for audio in tts_to_audio():
yield audio
Traditional voice agents chain three separate systems: ASR transcribes audio to text, an LLM generates a text response, and TTS synthesizes audio output. Each handoff loses information โ prosody, tone, emphasis, non-verbal cues โ and adds latency. Speech-to-speech models collapse this pipeline into a single model that reasons directly over audio representations.
GPT-4o Audio: OpenAI's GPT-4o natively accepts and produces audio tokens alongside text tokens. Rather than transcribing speech to text internally, the model operates on audio token representations derived from a neural codec, preserving prosodic and paralinguistic information. This enables capabilities impossible with pipeline architectures: the model can sing, laugh, whisper, and modulate emotion in its responses based on the emotional tone of the input. The Realtime API exposes this capability via WebSocket connections with sub-300ms response latency.
Gemini Live: Google's Gemini models support real-time bidirectional audio conversation. Gemini processes audio natively within its multimodal architecture, enabling it to understand tone, pacing, and non-verbal audio cues. Gemini Live specifically targets always-on conversational interaction, supporting extended multi-turn dialogues with persistent context.
Moshi (Kyutai, 2024): An open-source speech-to-speech model that processes full-duplex audio โ it can listen and speak simultaneously, much like a human in natural conversation. Moshi uses a dual-stream architecture where one stream models the user's speech and another models the system's speech, with cross-attention between them. The model generates audio tokens from a Mimi neural codec (a derivative of EnCodec) at 12.5 Hz, enabling real-time streaming inference. Moshi represents a significant milestone as the first fully open-source end-to-end speech-to-speech model.
End-to-end speech models generally follow one of two patterns:
Audio token LLMs: Extend a text LLM's vocabulary with discrete audio tokens from a neural codec (EnCodec, SoundStream, Mimi). The model generates interleaved text and audio tokens autoregressively. VALL-E, SpeechGPT, and GPT-4o follow this approach. The advantage is leveraging pretrained language model capabilities; the challenge is the long sequence lengths audio tokens require (75-150 tokens per second of audio across multiple codebook levels).
Parallel audio-text generation: Models like Moshi and SpiRit-LM generate audio and text tokens in parallel streams rather than interleaving them. This reduces sequence length and allows the model to "think in text" while simultaneously producing natural audio output.
Natural human conversation involves much more than alternating monologues. Speakers overlap, interrupt, backchannel ("mm-hmm," "right"), and use prosodic cues to signal turn-taking intentions. Conventional voice agents handle none of this well โ they wait for silence, process the utterance, and respond, producing an unnatural call-center-like interaction.
Note: The gap between pipeline-based voice agents and full-duplex speech models is most noticeable in user perception. Even at the same factual quality, full-duplex systems feel dramatically more human.
Full-duplex models listen and generate simultaneously, enabling natural conversational behaviors:
Moshi's dual-stream architecture is the most complete open-source implementation of full-duplex conversation. The model maintains separate "inner monologue" text streams for both speakers, with the audio generation conditioned on these text representations.
End-to-end speech models fundamentally change the latency equation:
For a broader treatment of designing conversational interactions around these latency characteristics, see Article 52 on conversational AI.
Speech carries emotional information through prosodic features โ pitch contour, speaking rate, energy, voice quality โ independent of lexical content. Emotion recognition from speech (Speech Emotion Recognition, or SER) has moved from hand-crafted feature extraction (MFCCs, jitter, shimmer) to end-to-end learned representations:
Generating emotionally appropriate speech is the synthesis counterpart to emotion recognition. Modern approaches include:
Style tokens and emotion embeddings: Models like GST-Tacotron learn a bank of "Global Style Tokens" from training data, where each token captures a distinct speaking style (angry, sad, cheerful). At inference time, selecting or interpolating between tokens controls the emotional quality of generated speech.
Prompt-based emotion control: Models like Parler-TTS and ElevenLabs' API accept natural language descriptions of desired speaking style ("speak warmly and slowly, with a gentle tone"), giving users intuitive control over emotional expression without needing to understand the underlying representation.
Reference-based synthesis: Extracting a style embedding from a reference audio clip and using it to condition generation โ "speak this text in the same emotional tone as this audio sample." This approach powers voice cloning with emotional transfer, though it raises additional ethical considerations around consent (see the voice cloning ethics section above and Article 44 on guardrails and content filtering).
For voice agent applications, emotion recognition and expressive synthesis create a feedback loop: the agent detects user frustration from prosodic cues and responds with a calmer, more empathetic tone, improving user experience in customer service and support contexts.
Tip: Prompt-based emotion control (as in Parler-TTS and ElevenLabs) is generally easier to iterate on than style-token approaches, since domain experts can describe desired tone in plain language rather than engineering embedding vectors.
Building a production voice agent from individual ASR, LLM, and TTS components requires significant infrastructure work: WebRTC/telephony integration, latency optimization, interruption handling, fallback logic, and monitoring. A growing ecosystem of managed platforms abstracts this complexity.
Vapi: A developer-focused voice agent platform that provides a unified API for building phone and web voice agents. Vapi handles the real-time audio pipeline (transport, ASR, TTS, interruption detection) and lets developers focus on conversation logic. Key features include pluggable ASR/TTS/LLM providers, function calling during conversations, call transfer, and detailed analytics. Vapi supports both inbound and outbound calling via SIP/PSTN and WebRTC.
Bland.ai: Targets enterprise telephony use cases โ outbound sales calls, appointment scheduling, customer service. Bland provides a higher-level abstraction where developers define conversation flows (sometimes called "pathways") and the platform handles voice synthesis, call management, and CRM integration. The platform emphasizes call quality metrics and A/B testing of conversation strategies.
Retell AI: Positions between Vapi's developer flexibility and Bland's enterprise focus. Retell provides a voice agent SDK with strong emphasis on low latency and natural conversation flow. It offers both a hosted platform and self-hostable components, along with built-in support for knowledge bases and custom LLM integration.
| Factor | Vapi | Bland.ai | Retell AI |
|---|---|---|---|
| Primary use case | General voice agents | Enterprise telephony | Balanced |
| Developer control | High (API-first) | Medium (flow-based) | High |
| Telephony | SIP + WebRTC | PSTN-native | SIP + WebRTC |
| Custom LLM support | Yes | Limited | Yes |
| Self-hosting | No | No | Partial |
| Latency optimization | Manual tuning | Managed | Managed |
For teams with strong engineering capacity that need fine-grained control over every pipeline component, building on LiveKit (open-source WebRTC framework) with pluggable ASR/TTS remains viable. For teams that want to ship a voice agent quickly and iterate on conversation design rather than infrastructure, managed platforms provide a compelling tradeoff. See Article 52 on conversational AI for guidance on designing the conversation logic itself.