vllm - 💡(How to fix) Fix [Feature]: Expose Word-Level Timestamps in `/v1/realtime` API for Voxtral Realtime [1 participants]

Root Cause

Use case	How timestamps help
Speaker diarization	When a speaker change is detected mid-sentence, precise word boundaries allow splitting the transcript at the exact audio position instead of an approximate byte offset.
Live subtitling	Word-level sync enables karaoke-style highlighting where each word lights up as it is spoken.
Forced alignment	Eliminates the need for a separate alignment model (e.g., wav2vec2-based CTC) when timestamps are already available from the ASR model.
Meeting analytics	Accurate per-utterance timing for speaker talk-time metrics, overlap detection, and silence measurement.

Code Example

{"type": "transcription.delta", "delta": "Hello world"}
{"type": "transcription.done", "text": "Hello world."}

---

"audio": {
  "sampling_rate": 16000,
  "frame_rate": 12.5,
  "audio_encoding_config": {
    "num_mel_bins": 128,
    "hop_length": 160,
    "window_size": 400
  },
  "transcription_delay_ms": 480,
  "streaming_look_ahead_ms": 2.5,
  "streaming_look_back_ms": 52.5,
  "streaming_n_left_pad_tokens": 32,
  "transcription_format": "streaming"
}

---

end_frame_idx = streaming_word_token_idx - transcription_delay_in_tokens
end_time_sec  = end_frame_idx / frame_rate
end_time_ms   = end_time_sec * 1000

---

end_frame_idx = 25 - 6 = 19
end_time_ms   = (19 / 12.5) * 1000 = 1520 ms

---

{
  "type": "transcription.delta",
  "delta": "world",
  "end_time_ms": 1520
}

---

{
  "type": "transcription.delta",
  "delta": "Hello world",
  "word_timestamps": [
    {"word": "Hello", "end_ms": 960},
    {"word": "world", "end_ms": 1520}
  ]
}

Summary

The vLLM Realtime API (/v1/realtime) currently returns only text in transcription.delta and transcription.done events. However, the Voxtral Realtime model already has enough internal information to derive per-word timestamps via [STREAMING_WORD] token positions and the known transcription_delay. Exposing these timestamps in the WebSocket protocol would unlock critical downstream use cases without any model changes.

Motivation

Current behavior

{"type": "transcription.delta", "delta": "Hello world"}
{"type": "transcription.done", "text": "Hello world."}

Clients receive text but have no timing information. Applications that need to know when a word was spoken must resort to coarse heuristics (e.g., wall-clock arrival time of deltas) or run a separate forced-alignment model.

Why this matters

Use case	How timestamps help
Speaker diarization	When a speaker change is detected mid-sentence, precise word boundaries allow splitting the transcript at the exact audio position instead of an approximate byte offset.
Live subtitling	Word-level sync enables karaoke-style highlighting where each word lights up as it is spoken.
Forced alignment	Eliminates the need for a separate alignment model (e.g., wav2vec2-based CTC) when timestamps are already available from the ASR model.
Meeting analytics	Accurate per-utterance timing for speaker talk-time metrics, overlap detection, and silence measurement.

Technical Background

The Voxtral Realtime model uses a [STREAMING_WORD] token as a word boundary marker in its token vocabulary. The position (index) of each [STREAMING_WORD] token in the generated sequence is directly tied to the audio frame timeline.

From tekken.json:

"audio": {
  "sampling_rate": 16000,
  "frame_rate": 12.5,
  "audio_encoding_config": {
    "num_mel_bins": 128,
    "hop_length": 160,
    "window_size": 400
  },
  "transcription_delay_ms": 480,
  "streaming_look_ahead_ms": 2.5,
  "streaming_look_back_ms": 52.5,
  "streaming_n_left_pad_tokens": 32,
  "transcription_format": "streaming"
}

Key parameters:

Each text token corresponds to 80ms of audio
transcription_delay_ms = 480 → delay in tokens: 480 / 80 = 6 tokens
frame_rate = 12.5 frames/sec → 1 frame = 80ms

Deriving a word's end timestamp

end_frame_idx = streaming_word_token_idx - transcription_delay_in_tokens
end_time_sec  = end_frame_idx / frame_rate
end_time_ms   = end_time_sec * 1000

For example, if [STREAMING_WORD] appears at token index 25:

end_frame_idx = 25 - 6 = 19
end_time_ms   = (19 / 12.5) * 1000 = 1520 ms

This computation is trivial and can be performed in the serving layer without any model modifications.

Proposed API Change

Option A: Add `end_time_ms` to delta events (minimal)

{
  "type": "transcription.delta",
  "delta": "world",
  "end_time_ms": 1520
}

A single timestamp per emission group — the end time of the last word in the delta. Simple, low overhead, sufficient for most use cases.

Option B: Add `word_timestamps` array (detailed)

{
  "type": "transcription.delta",
  "delta": "Hello world",
  "word_timestamps": [
    {"word": "Hello", "end_ms": 960},
    {"word": "world", "end_ms": 1520}
  ]
}

Backward compatibility

Both options are fully backward-compatible — new fields are additive. Existing clients that ignore unknown fields will continue to work unchanged. Optionally, timestamp inclusion could be gated behind a session parameter (e.g., "include_timestamps": true in session.update).

Implementation Notes

The change is localized to vLLM's realtime endpoint handler where generated tokens are decoded and streamed to the client:

During token-by-token generation, detect [STREAMING_WORD] token IDs
Record their position in the sequence
Apply the delay offset: position - transcription_delay_tokens
Convert to milliseconds: adjusted_position * 80
Attach to the outgoing WebSocket message

No changes to the model, tokenizer, or audio encoder are required.

References

Model: mistralai/Voxtral-Mini-4B-Realtime-2602
Technical report: arxiv:2602.11298
vLLM Realtime API docs: vLLM Realtime
Related HF discussion: #2 — transcription_delay param
vLLM blog on streaming input: [blog.vllm.ai](https://blog.vllm.ai/2026/01/31/streaming-r

extent analysis

TL;DR

To unlock critical downstream use cases, modify the vLLM Realtime API to include per-word timestamps in the WebSocket protocol by utilizing the [STREAMING_WORD] token positions and the known transcription_delay.

Guidance

Identify the position of each [STREAMING_WORD] token in the generated sequence and record their indices.
Apply the delay offset to each token position using the formula end_frame_idx = streaming_word_token_idx - transcription_delay_in_tokens.
Convert the adjusted position to milliseconds using the frame rate and attach the timestamp to the outgoing WebSocket message.
Consider implementing one of the proposed API changes, such as adding an end_time_ms field to delta events or including a word_timestamps array.

Example

{
  "type": "transcription.delta",
  "delta": "world",
  "end_time_ms": 1520
}

This example shows the addition of an end_time_ms field to a delta event, providing the timestamp for the last word in the delta.

Notes

The implementation of this change is localized to the vLLM realtime endpoint handler and does not require any modifications to the model, tokenizer, or audio encoder. The proposed API changes are fully backward-compatible, allowing existing clients to continue working unchanged.

Recommendation

Apply workaround by modifying the vLLM Realtime API to include per-word timestamps, as this will unlock critical downstream use cases without requiring any model changes.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Feature]: Expose Word-Level Timestamps in `/v1/realtime` API for Voxtral Realtime [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Motivation

Current behavior

Why this matters

Technical Background

Deriving a word's end timestamp

Proposed API Change

Option A: Add `end_time_ms` to delta events (minimal)

Option B: Add `word_timestamps` array (detailed)

Backward compatibility

Implementation Notes

References

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Expose Word-Level Timestamps in `/v1/realtime` API for Voxtral Realtime [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Motivation

Current behavior

Why this matters

Technical Background

Deriving a word's end timestamp

Proposed API Change

Option A: Add end_time_ms to delta events (minimal)

Option B: Add word_timestamps array (detailed)

Backward compatibility

Implementation Notes

References

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Option A: Add `end_time_ms` to delta events (minimal)

Option B: Add `word_timestamps` array (detailed)