transformers - ✅(Solved) Fix Whisper word timestamp decode crashes on trailing replacement character at end of decoded token stream [4 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44869Fetched 2026-04-08 01:03:22
View on GitHub
Comments
1
Participants
2
Timeline
10
Reactions
0
Timeline (top)
cross-referenced ×3mentioned ×2subscribed ×2commented ×1

Error Message

IndexError: string index out of range

Root Cause

Because it tries to read decoded_full[2] when len(decoded_full) == 2.

PR fix notes

PR #1: fix: prevent IndexError in Whisper timestamp decode on trailing replacement char

Description (problem / solution / changelog)

Fixes #44869

Summary

Adds a bounds check in _split_tokens_on_unicode() in src/transformers/models/whisper/tokenization_whisper.py to handle trailing Unicode replacement characters (U+FFFD) at the end of decoded token streams without crashing.

The Bug

When unicode_offset + decoded.index(replacement_char) equals len(decoded_full), an IndexError is raised because the code attempts to access decoded_full[target_index] at an out-of-bounds position.

The Fix

Pre-compute target_index = unicode_offset + decoded.index(replacement_char) and add a target_index >= len(decoded_full) guard that short-circuits before the out-of-bounds access.

Test

Manual verification that the bounds check correctly handles the edge case where the decoded token stream ends with a trailing Unicode replacement character.

AI Assistance Disclosure

This PR was developed with AI assistance. The fix has been manually reviewed and verified.

Changed files

  • src/transformers/models/whisper/tokenization_whisper.py (modified, +3/-1)

PR #45006: fix: prevent IndexError in Whisper timestamp decode on trailing replacement char

Description (problem / solution / changelog)

Summary

Fixes #44869

Adds a bounds check in _split_tokens_on_unicode() in tokenization_whisper.py to handle trailing Unicode replacement characters (U+FFFD) at the end of decoded token streams without crashing with IndexError.

Problem

When the decoded token stream ends with a dangling replacement character, the computed index unicode_offset + decoded.index(replacement_char) can equal len(decoded_full), causing an out-of-bounds string access.

Fix

Pre-compute target_index and add a target_index >= len(decoded_full) guard that short-circuits before the out-of-bounds access. When triggered, the trailing fragment is treated as a word boundary.

AI Assistance Disclosure

This PR was developed with AI assistance. The fix has been manually reviewed, verified for correctness, and tested against the reported edge case.

Test Plan

  • Verified bounds check correctly handles unicode_offset=298, len(decoded_full)=298 edge case
  • Confirmed Python ternary precedence is correct for the target_index computation
  • Ran ruff check with no issues

Changed files

  • src/transformers/models/whisper/tokenization_whisper.py (modified, +3/-1)

PR #45226: fix: handle trailing replacement character in Whisper word timestamp decoding

Description (problem / solution / changelog)

Summary

  • Fixes an IndexError: string index out of range crash in _split_tokens_on_unicode() when the decoded token stream ends with a dangling Unicode replacement character (U+FFFD)
  • Adds a bounds check so that when unicode_offset + decoded.index(replacement_char) >= len(decoded_full), the out-of-bounds access is avoided
  • The trailing replacement character token is still collected and flushed correctly

Closes #44869

Test plan

  • Verify that Whisper word-level timestamp decoding no longer crashes when the final token(s) decode to U+FFFD
  • Verify that normal (non-trailing-replacement-char) inputs produce identical results

🤖 Generated with Claude Code

Changed files

  • src/transformers/models/whisper/tokenization_whisper.py (modified, +3/-3)

PR #45435: do not index past decoded chars with special tokens

Description (problem / solution / changelog)

fixed https://github.com/huggingface/transformers/issues/44869

add check to not index past decoded chars with special tokens

Changed files

  • src/transformers/models/whisper/tokenization_whisper.py (modified, +1/-0)
  • tests/models/whisper/test_tokenization_whisper.py (modified, +29/-1)

Code Example

decoded_full[unicode_offset + decoded.index(replacement_char)]

---

decoded_full[298]

---

IndexError: string index out of range

---

from collections import defaultdict
from transformers.models.whisper.tokenization_whisper import _split_tokens_on_unicode

class DummyTokenizer:
    def __init__(self):
        self.responses = defaultdict(list)

    def decode(self, tokens, decode_with_timestamps=False):
        key = tuple(tokens)
        if self.responses[key]:
            return self.responses[key].pop(0)

tokenizer = DummyTokenizer()
tokenizer.responses[(1, 2)] = ["ab"]   # decoded_full
tokenizer.responses[(1,)] = ["ab"]     # first token decodes cleanly
tokenizer.responses[(2,)] = ["�"]      # trailing replacement char at EOF

print(_split_tokens_on_unicode(tokenizer, [1, 2]))

---

IndexError: string index out of range
RAW_BUFFERClick to expand / collapse

System Info

System Info

  • OS: macOS
  • transformers: 5.3.0.dev0
  • Model: openai/whisper-medium.en

Reproduction

I hit an IndexError: string index out of range in Whisper word-timestamp decoding and traced it to src/transformers/models/whisper/tokenization_whisper.py.

The failing code path is in _split_tokens_on_unicode():

decoded_full[unicode_offset + decoded.index(replacement_char)]

The bug happens when the decoded token stream ends with a dangling Unicode replacement character (, U+FFFD). In that case, the computed index can equal len(decoded_full), so the code reads one past the end of the string and crashes.

For the failing case I traced locally, the values were:

  • unicode_offset = 298
  • decoded.index(replacement_char) = 0
  • target_index = 298
  • len(decoded_full) = 298

So the effective access becomes:

decoded_full[298]

but the last valid index is 297.

The underlying ASR output for the bad chunk decoded to a long run of musical note symbols followed by a dangling final replacement character (...🎵 🎵 🎵 🎵 🎵 �). Segment-level decoding succeeded, but word-level timestamp collation crashed in _split_tokens_on_unicode().

Error

IndexError: string index out of range

Expected behavior

  • trailing incomplete Unicode fragments at EOF should be ignored or handled safely
  • Whisper word timestamp decoding should not crash with IndexError

Additional context

I have a local fix prepared for this EOF bounds case and can open a PR if this approach looks reasonable.

Who can help?

@ArthurZucker @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Full end to end reproduction would involve the original audio file (2m 14s of music with some vocals), but the underlying problem is simpler and can be reproduced by calling the _split_tokens_on_unicode method with data that could reasonably be outputted.

from collections import defaultdict
from transformers.models.whisper.tokenization_whisper import _split_tokens_on_unicode

class DummyTokenizer:
    def __init__(self):
        self.responses = defaultdict(list)

    def decode(self, tokens, decode_with_timestamps=False):
        key = tuple(tokens)
        if self.responses[key]:
            return self.responses[key].pop(0)

tokenizer = DummyTokenizer()
tokenizer.responses[(1, 2)] = ["ab"]   # decoded_full
tokenizer.responses[(1,)] = ["ab"]     # first token decodes cleanly
tokenizer.responses[(2,)] = ["�"]      # trailing replacement char at EOF

print(_split_tokens_on_unicode(tokenizer, [1, 2]))

Before the fix, this raises:

IndexError: string index out of range

Because it tries to read decoded_full[2] when len(decoded_full) == 2.

Expected behavior

Whisper word-timestamp decoding should safely ignore or stop on a trailing incomplete Unicode fragment at end-of-string, instead of crashing with IndexError: string index out of range.

extent analysis

Fix Plan

To fix the IndexError: string index out of range issue in Whisper word-timestamp decoding, we need to modify the _split_tokens_on_unicode method to handle the case where the decoded token stream ends with a dangling Unicode replacement character.

Here are the steps:

  • Check if the computed index is within the bounds of the decoded_full string before attempting to access it.
  • If the index is out of range, ignore the trailing replacement character or handle it safely.

Code Changes

def _split_tokens_on_unicode(self, tokenizer, tokens):
    # ... (rest of the method remains the same)

    unicode_offset = 
    for token in tokens:
        decoded = tokenizer.decode([token], decode_with_timestamps=False)
        if replacement_char in decoded:
            # Check if the index is within bounds
            index = decoded.index(replacement_char)
            target_index = unicode_offset + index
            if target_index < len(decoded_full):
                # Access the character at the target index
                char = decoded_full[target_index]
                # ... (rest of the method remains the same)
            else:
                # Handle the case where the index is out of range
                # For example, ignore the trailing replacement character
                break
        unicode_offset += len(decoded)

Verification

To verify that the fix worked, you can run the reproduction code with the modified _split_tokens_on_unicode method and check that it no longer raises an IndexError: string index out of range.

Extra Tips

  • When working with Unicode strings, it's essential to consider edge cases like trailing replacement characters.
  • Using bounds checking can help prevent IndexError exceptions and make your code more robust.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Whisper word-timestamp decoding should safely ignore or stop on a trailing incomplete Unicode fragment at end-of-string, instead of crashing with IndexError: string index out of range.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING