How Voxtral Realtime Works

Click each stage to see what happens at that step. Audio flows top-to-bottom, producing text tokens as it streams in.

🎤 Audio Input 16 kHz PCM ▶

Raw audio is captured at 16 kHz sample rate. For streaming, audio chunks are sent as they are recorded — no need to wait for the speaker to finish.

In streaming mode, audio is sent via WebSocket in small chunks. Each chunk is appended to the running session — the model never needs to "start over."

📊 Mel Spectrogram 128 mel bins ▶

Raw audio is converted into a mel-frequency spectrogram — a 2D representation of energy across frequency bands over time.

Computed causally — each frame only uses current and past audio samples, allowing incremental computation as audio arrives.

🧠 Causal Audio Encoder ~970M · 32 layers ▶

Uses causal attention: each audio frame only attends to itself and past frames — never future ones. This is what makes streaming possible.

Sliding window attention (window = 750, ~15s of audio) bounds memory regardless of audio length.

Traditional encoders (like Whisper's) use bidirectional attention — every frame sees every other frame. That gives better accuracy but requires full audio upfront. Causal attention trades a small amount of accuracy for the ability to stream.

🔍 4x Downsampling Adapter 4:1 compress ▶

Encoder output is compressed 4x along the time axis, reducing sequence length before it enters the decoder.

After downsampling, 1 decoder token = ~80ms of audio. This is why the minimum configurable delay is 80ms — the resolution of one token.

⚙️ LLM Decoder ~3.4B · 26 layers ▶

Grouped-Query Attention (32 query heads, 8 KV heads) with sliding window of 8,192 tokens. Reads encoded audio and generates text.

AdaRMSNorm conditions the decoder on the configured transcription delay — same model, different latency-accuracy points, no retraining.

GQA uses fewer KV heads than query heads, reducing memory footprint while maintaining quality. Critical for streaming where the KV cache grows continuously.

📝 Tekken Tokenizer 131K vocab ▶

Decoder output logits are mapped to text tokens using the Tekken tokenizer (131K vocabulary, 13 languages).

In streaming mode, tokens are emitted as deltas — small text fragments appended to the running transcription. Earlier tokens can be revised as more audio context arrives.

💬 Streaming Text Output real-time ▶

Text streamed token-by-token via WebSocket. At 480ms delay, transcription runs faster than real-time (>12.5 tokens/second).

The entire pipeline processes continuously. No "wait for audio to finish" step anywhere — audio in, text out, in parallel, indefinitely.

2 Causal vs Bidirectional Attention

The core difference enabling streaming. Hover over cells to compare. Click "Animate" to see each attention type build row by row.

3 The Delay-Accuracy Tradeoff

Configure how far behind the audio the transcription lags. More delay = more context = better accuracy.

4 Sliding Window Attention

Fixed-size attention windows bound memory. As new tokens arrive, old ones slide out — enabling infinite-length audio.

5 Key Takeaways

Causal Attention

Each frame only sees the past, enabling output before hearing the full audio. The fundamental enabler of streaming.

Sliding Window

Fixed-size windows bound memory and compute. A 3-hour meeting uses the same peak memory as a 5-minute call.

Configurable Delay

AdaRMSNorm lets one model operate at different latency-accuracy points (80ms to 2.4s) without retraining.

Open Weights

Apache 2.0 license. Cloud GPUs via vLLM, Apple Silicon via MLX, or the browser on Hugging Face Spaces.

1 The Pipeline