An interactive exploration of the architecture behind real-time streaming speech-to-text
mistralai/Voxtral-Mini-4B-Realtime-2602 · 4B params · Apache 2.0Click each stage to see what happens at that step. Audio flows top-to-bottom, producing text tokens as it streams in.
Raw audio is captured at 16 kHz sample rate. For streaming, audio chunks are sent as they are recorded — no need to wait for the speaker to finish.
Raw audio is converted into a mel-frequency spectrogram — a 2D representation of energy across frequency bands over time.
Uses causal attention: each audio frame only attends to itself and past frames — never future ones. This is what makes streaming possible.
Sliding window attention (window = 750, ~15s of audio) bounds memory regardless of audio length.
Encoder output is compressed 4x along the time axis, reducing sequence length before it enters the decoder.
Grouped-Query Attention (32 query heads, 8 KV heads) with sliding window of 8,192 tokens. Reads encoded audio and generates text.
AdaRMSNorm conditions the decoder on the configured transcription delay — same model, different latency-accuracy points, no retraining.
Decoder output logits are mapped to text tokens using the Tekken tokenizer (131K vocabulary, 13 languages).
Text streamed token-by-token via WebSocket. At 480ms delay, transcription runs faster than real-time (>12.5 tokens/second).
The core difference enabling streaming. Hover over cells to compare. Click "Animate" to see each attention type build row by row.
Every token sees every other token. Needs full audio.
Each token only sees past tokens. Enables streaming.
Configure how far behind the audio the transcription lags. More delay = more context = better accuracy.
Fixed-size attention windows bound memory. As new tokens arrive, old ones slide out — enabling infinite-length audio.
Each frame only sees the past, enabling output before hearing the full audio. The fundamental enabler of streaming.
Fixed-size windows bound memory and compute. A 3-hour meeting uses the same peak memory as a 5-minute call.
AdaRMSNorm lets one model operate at different latency-accuracy points (80ms to 2.4s) without retraining.
Apache 2.0 license. Cloud GPUs via vLLM, Apple Silicon via MLX, or the browser on Hugging Face Spaces.