vllm - 💡(How to fix) Fix [Feature]: Expose Word-Level Timestamps in `/v1/realtime` API for Voxtral Realtime [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39735Fetched 2026-04-15 06:20:41
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
1
Author
Participants
Timeline (top)
labeled ×1

The vLLM Realtime API (/v1/realtime) currently returns only text in transcription.delta and transcription.done events. However, the Voxtral Realtime model already has enough internal information to derive per-word timestamps via [STREAMING_WORD] token positions and the known transcription_delay. Exposing these timestamps in the WebSocket protocol would unlock critical downstream use cases without any model changes.

Root Cause

Use caseHow timestamps help
Speaker diarizationWhen a speaker change is detected mid-sentence, precise word boundaries allow splitting the transcript at the exact audio position instead of an approximate byte offset.
Live subtitlingWord-level sync enables karaoke-style highlighting where each word lights up as it is spoken.
Forced alignmentEliminates the need for a separate alignment model (e.g., wav2vec2-based CTC) when timestamps are already available from the ASR model.
Meeting analyticsAccurate per-utterance timing for speaker talk-time metrics, overlap detection, and silence measurement.

Code Example

{"type": "transcription.delta", "delta": "Hello world"}
{"type": "transcription.done", "text": "Hello world."}

---

"audio": {
  "sampling_rate": 16000,
  "frame_rate": 12.5,
  "audio_encoding_config": {
    "num_mel_bins": 128,
    "hop_length": 160,
    "window_size": 400
  },
  "transcription_delay_ms": 480,
  "streaming_look_ahead_ms": 2.5,
  "streaming_look_back_ms": 52.5,
  "streaming_n_left_pad_tokens": 32,
  "transcription_format": "streaming"
}

---

end_frame_idx = streaming_word_token_idx - transcription_delay_in_tokens
end_time_sec  = end_frame_idx / frame_rate
end_time_ms   = end_time_sec * 1000

---

end_frame_idx = 25 - 6 = 19
end_time_ms   = (19 / 12.5) * 1000 = 1520 ms

---

{
  "type": "transcription.delta",
  "delta": "world",
  "end_time_ms": 1520
}

---

{
  "type": "transcription.delta",
  "delta": "Hello world",
  "word_timestamps": [
    {"word": "Hello", "end_ms": 960},
    {"word": "world", "end_ms": 1520}
  ]
}
RAW_BUFFERClick to expand / collapse

Summary

The vLLM Realtime API (/v1/realtime) currently returns only text in transcription.delta and transcription.done events. However, the Voxtral Realtime model already has enough internal information to derive per-word timestamps via [STREAMING_WORD] token positions and the known transcription_delay. Exposing these timestamps in the WebSocket protocol would unlock critical downstream use cases without any model changes.

Motivation

Current behavior

{"type": "transcription.delta", "delta": "Hello world"}
{"type": "transcription.done", "text": "Hello world."}

Clients receive text but have no timing information. Applications that need to know when a word was spoken must resort to coarse heuristics (e.g., wall-clock arrival time of deltas) or run a separate forced-alignment model.

Why this matters

Use caseHow timestamps help
Speaker diarizationWhen a speaker change is detected mid-sentence, precise word boundaries allow splitting the transcript at the exact audio position instead of an approximate byte offset.
Live subtitlingWord-level sync enables karaoke-style highlighting where each word lights up as it is spoken.
Forced alignmentEliminates the need for a separate alignment model (e.g., wav2vec2-based CTC) when timestamps are already available from the ASR model.
Meeting analyticsAccurate per-utterance timing for speaker talk-time metrics, overlap detection, and silence measurement.

Technical Background

The Voxtral Realtime model uses a [STREAMING_WORD] token as a word boundary marker in its token vocabulary. The position (index) of each [STREAMING_WORD] token in the generated sequence is directly tied to the audio frame timeline.

From tekken.json:

"audio": {
  "sampling_rate": 16000,
  "frame_rate": 12.5,
  "audio_encoding_config": {
    "num_mel_bins": 128,
    "hop_length": 160,
    "window_size": 400
  },
  "transcription_delay_ms": 480,
  "streaming_look_ahead_ms": 2.5,
  "streaming_look_back_ms": 52.5,
  "streaming_n_left_pad_tokens": 32,
  "transcription_format": "streaming"
}

Key parameters:

  • Each text token corresponds to 80ms of audio
  • transcription_delay_ms = 480 → delay in tokens: 480 / 80 = 6 tokens
  • frame_rate = 12.5 frames/sec → 1 frame = 80ms

Deriving a word's end timestamp

end_frame_idx = streaming_word_token_idx - transcription_delay_in_tokens
end_time_sec  = end_frame_idx / frame_rate
end_time_ms   = end_time_sec * 1000

For example, if [STREAMING_WORD] appears at token index 25:

end_frame_idx = 25 - 6 = 19
end_time_ms   = (19 / 12.5) * 1000 = 1520 ms

This computation is trivial and can be performed in the serving layer without any model modifications.

Proposed API Change

Option A: Add end_time_ms to delta events (minimal)

{
  "type": "transcription.delta",
  "delta": "world",
  "end_time_ms": 1520
}

A single timestamp per emission group — the end time of the last word in the delta. Simple, low overhead, sufficient for most use cases.

Option B: Add word_timestamps array (detailed)

{
  "type": "transcription.delta",
  "delta": "Hello world",
  "word_timestamps": [
    {"word": "Hello", "end_ms": 960},
    {"word": "world", "end_ms": 1520}
  ]
}

Backward compatibility

Both options are fully backward-compatible — new fields are additive. Existing clients that ignore unknown fields will continue to work unchanged. Optionally, timestamp inclusion could be gated behind a session parameter (e.g., "include_timestamps": true in session.update).

Implementation Notes

The change is localized to vLLM's realtime endpoint handler where generated tokens are decoded and streamed to the client:

  1. During token-by-token generation, detect [STREAMING_WORD] token IDs
  2. Record their position in the sequence
  3. Apply the delay offset: position - transcription_delay_tokens
  4. Convert to milliseconds: adjusted_position * 80
  5. Attach to the outgoing WebSocket message

No changes to the model, tokenizer, or audio encoder are required.

References

extent analysis

TL;DR

To unlock critical downstream use cases, modify the vLLM Realtime API to include per-word timestamps in the WebSocket protocol by utilizing the [STREAMING_WORD] token positions and the known transcription_delay.

Guidance

  • Identify the position of each [STREAMING_WORD] token in the generated sequence and record their indices.
  • Apply the delay offset to each token position using the formula end_frame_idx = streaming_word_token_idx - transcription_delay_in_tokens.
  • Convert the adjusted position to milliseconds using the frame rate and attach the timestamp to the outgoing WebSocket message.
  • Consider implementing one of the proposed API changes, such as adding an end_time_ms field to delta events or including a word_timestamps array.

Example

{
  "type": "transcription.delta",
  "delta": "world",
  "end_time_ms": 1520
}

This example shows the addition of an end_time_ms field to a delta event, providing the timestamp for the last word in the delta.

Notes

The implementation of this change is localized to the vLLM realtime endpoint handler and does not require any modifications to the model, tokenizer, or audio encoder. The proposed API changes are fully backward-compatible, allowing existing clients to continue working unchanged.

Recommendation

Apply workaround by modifying the vLLM Realtime API to include per-word timestamps, as this will unlock critical downstream use cases without requiring any model changes.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING