How Voxtral Realtime Works

An interactive exploration of the architecture behind real-time streaming speech-to-text

mistralai/Voxtral-Mini-4B-Realtime-2602 · 4B params · Apache 2.0

1 The Pipeline

Click each stage to see what happens at that step. Audio flows top-to-bottom, producing text tokens as it streams in.

🎤 Audio Input 16 kHz PCM

Raw audio is captured at 16 kHz sample rate. For streaming, audio chunks are sent as they are recorded — no need to wait for the speaker to finish.

In streaming mode, audio is sent via WebSocket in small chunks. Each chunk is appended to the running session — the model never needs to "start over."
📊 Mel Spectrogram 128 mel bins

Raw audio is converted into a mel-frequency spectrogram — a 2D representation of energy across frequency bands over time.

Computed causally — each frame only uses current and past audio samples, allowing incremental computation as audio arrives.
🧠 Causal Audio Encoder ~970M · 32 layers

Uses causal attention: each audio frame only attends to itself and past frames — never future ones. This is what makes streaming possible.

Sliding window attention (window = 750, ~15s of audio) bounds memory regardless of audio length.

Traditional encoders (like Whisper's) use bidirectional attention — every frame sees every other frame. That gives better accuracy but requires full audio upfront. Causal attention trades a small amount of accuracy for the ability to stream.
🔍 4x Downsampling Adapter 4:1 compress

Encoder output is compressed 4x along the time axis, reducing sequence length before it enters the decoder.

After downsampling, 1 decoder token = ~80ms of audio. This is why the minimum configurable delay is 80ms — the resolution of one token.
⚙️ LLM Decoder ~3.4B · 26 layers

Grouped-Query Attention (32 query heads, 8 KV heads) with sliding window of 8,192 tokens. Reads encoded audio and generates text.

AdaRMSNorm conditions the decoder on the configured transcription delay — same model, different latency-accuracy points, no retraining.

GQA uses fewer KV heads than query heads, reducing memory footprint while maintaining quality. Critical for streaming where the KV cache grows continuously.
📝 Tekken Tokenizer 131K vocab

Decoder output logits are mapped to text tokens using the Tekken tokenizer (131K vocabulary, 13 languages).

In streaming mode, tokens are emitted as deltas — small text fragments appended to the running transcription. Earlier tokens can be revised as more audio context arrives.
💬 Streaming Text Output real-time

Text streamed token-by-token via WebSocket. At 480ms delay, transcription runs faster than real-time (>12.5 tokens/second).

The entire pipeline processes continuously. No "wait for audio to finish" step anywhere — audio in, text out, in parallel, indefinitely.

2 Causal vs Bidirectional Attention

The core difference enabling streaming. Hover over cells to compare. Click "Animate" to see each attention type build row by row.

Bidirectional (Whisper-style)

Every token sees every other token. Needs full audio.

Causal (Voxtral-style)

Each token only sees past tokens. Enables streaming.

Can attend Cannot attend Highlighted

3 The Delay-Accuracy Tradeoff

Configure how far behind the audio the transcription lags. More delay = more context = better accuracy.

480ms
Tokens Behind
6
1 token = 80ms audio
Approx. WER (English)
~4.9%
Word Error Rate on FLEURS
Use Case Fit
Recommended
Best balance of speed and accuracy

4 Sliding Window Attention

Fixed-size attention windows bound memory. As new tokens arrive, old ones slide out — enabling infinite-length audio.

Encoder window: 750 frames (~15s) Position: frame 8
In window: 8 / 8 Memory: constant

5 Key Takeaways

Causal Attention

Each frame only sees the past, enabling output before hearing the full audio. The fundamental enabler of streaming.

Sliding Window

Fixed-size windows bound memory and compute. A 3-hour meeting uses the same peak memory as a 5-minute call.

Configurable Delay

AdaRMSNorm lets one model operate at different latency-accuracy points (80ms to 2.4s) without retraining.

Open Weights

Apache 2.0 license. Cloud GPUs via vLLM, Apple Silicon via MLX, or the browser on Hugging Face Spaces.