transformers - ✅(Solved) Fix Whisper word timestamp decode crashes on trailing replacement character at end of decoded token stream [4 pull requests, 1 comments, 2 participants]

transformers2026-03-20 01:25:49

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44869•Fetched 2026-04-08 01:03:22

View on GitHub

Comments

Participants

Timeline

Reactions

Author

chromatic-descension

Participants

chromatic-descension

howardpen9

Timeline (top)

cross-referenced ×3mentioned ×2subscribed ×2commented ×1

Error Message

IndexError: string index out of range

Root Cause

Because it tries to read decoded_full[2] when len(decoded_full) == 2.

PR fix notes

PR #1: fix: prevent IndexError in Whisper timestamp decode on trailing replacement char

Repository: Krishnachaitanyakc/transformers
Author: Krishnachaitanyakc
State: closed | merged: True
Link: https://github.com/Krishnachaitanyakc/transformers/pull/1

Description (problem / solution / changelog)

Fixes #44869

Summary

Adds a bounds check in _split_tokens_on_unicode() in src/transformers/models/whisper/tokenization_whisper.py to handle trailing Unicode replacement characters (U+FFFD) at the end of decoded token streams without crashing.

The Bug

When unicode_offset + decoded.index(replacement_char) equals len(decoded_full), an IndexError is raised because the code attempts to access decoded_full[target_index] at an out-of-bounds position.

The Fix

Pre-compute target_index = unicode_offset + decoded.index(replacement_char) and add a target_index >= len(decoded_full) guard that short-circuits before the out-of-bounds access.

Test

Manual verification that the bounds check correctly handles the edge case where the decoded token stream ends with a trailing Unicode replacement character.

AI Assistance Disclosure

This PR was developed with AI assistance. The fix has been manually reviewed and verified.

Changed files

src/transformers/models/whisper/tokenization_whisper.py (modified, +3/-1)

PR #45006: fix: prevent IndexError in Whisper timestamp decode on trailing replacement char

Repository: huggingface/transformers
Author: Krishnachaitanyakc
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/45006

Description (problem / solution / changelog)

Summary

Fixes #44869

Adds a bounds check in _split_tokens_on_unicode() in tokenization_whisper.py to handle trailing Unicode replacement characters (U+FFFD) at the end of decoded token streams without crashing with IndexError.

Problem

When the decoded token stream ends with a dangling replacement character, the computed index unicode_offset + decoded.index(replacement_char) can equal len(decoded_full), causing an out-of-bounds string access.

Fix

Pre-compute target_index and add a target_index >= len(decoded_full) guard that short-circuits before the out-of-bounds access. When triggered, the trailing fragment is treated as a word boundary.

AI Assistance Disclosure

This PR was developed with AI assistance. The fix has been manually reviewed, verified for correctness, and tested against the reported edge case.

Test Plan

Verified bounds check correctly handles unicode_offset=298, len(decoded_full)=298 edge case
Confirmed Python ternary precedence is correct for the target_index computation
Ran ruff check with no issues

Changed files

src/transformers/models/whisper/tokenization_whisper.py (modified, +3/-1)

PR #45226: fix: handle trailing replacement character in Whisper word timestamp decoding

Repository: huggingface/transformers
Author: akhilc08
State: closed | merged: False
Link: https://github.com/huggingface/transformers/pull/45226

Description (problem / solution / changelog)

Summary

Fixes an IndexError: string index out of range crash in _split_tokens_on_unicode() when the decoded token stream ends with a dangling Unicode replacement character (U+FFFD)
Adds a bounds check so that when unicode_offset + decoded.index(replacement_char) >= len(decoded_full), the out-of-bounds access is avoided
The trailing replacement character token is still collected and flushed correctly

Closes #44869

Test plan

Verify that Whisper word-level timestamp decoding no longer crashes when the final token(s) decode to U+FFFD
Verify that normal (non-trailing-replacement-char) inputs produce identical results

🤖 Generated with Claude Code

Changed files

src/transformers/models/whisper/tokenization_whisper.py (modified, +3/-3)

PR #45435: do not index past decoded chars with special tokens

Repository: huggingface/transformers
Author: itazap
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/45435

Description (problem / solution / changelog)

fixed https://github.com/huggingface/transformers/issues/44869

add check to not index past decoded chars with special tokens

Changed files

src/transformers/models/whisper/tokenization_whisper.py (modified, +1/-0)
tests/models/whisper/test_tokenization_whisper.py (modified, +29/-1)

Code Example

decoded_full[unicode_offset + decoded.index(replacement_char)]

---

decoded_full[298]

---

IndexError: string index out of range

---

from collections import defaultdict
from transformers.models.whisper.tokenization_whisper import _split_tokens_on_unicode

class DummyTokenizer:
    def __init__(self):
        self.responses = defaultdict(list)

    def decode(self, tokens, decode_with_timestamps=False):
        key = tuple(tokens)
        if self.responses[key]:
            return self.responses[key].pop(0)

tokenizer = DummyTokenizer()
tokenizer.responses[(1, 2)] = ["ab"]   # decoded_full
tokenizer.responses[(1,)] = ["ab"]     # first token decodes cleanly
tokenizer.responses[(2,)] = ["�"]      # trailing replacement char at EOF

print(_split_tokens_on_unicode(tokenizer, [1, 2]))

---

IndexError: string index out of range

RAW_BUFFERClick to expand / collapse

System Info

OS: macOS
transformers: 5.3.0.dev0
Model: openai/whisper-medium.en

Reproduction

I hit an IndexError: string index out of range in Whisper word-timestamp decoding and traced it to src/transformers/models/whisper/tokenization_whisper.py.

The failing code path is in _split_tokens_on_unicode():

decoded_full[unicode_offset + decoded.index(replacement_char)]

The bug happens when the decoded token stream ends with a dangling Unicode replacement character (�, U+FFFD). In that case, the computed index can equal len(decoded_full), so the code reads one past the end of the string and crashes.

For the failing case I traced locally, the values were:

unicode_offset = 298
decoded.index(replacement_char) = 0
target_index = 298
len(decoded_full) = 298

So the effective access becomes:

decoded_full[298]

but the last valid index is 297.

The underlying ASR output for the bad chunk decoded to a long run of musical note symbols followed by a dangling final replacement character (...🎵 🎵 🎵 🎵 🎵 �). Segment-level decoding succeeded, but word-level timestamp collation crashed in _split_tokens_on_unicode().

Error

IndexError: string index out of range

Expected behavior

trailing incomplete Unicode fragments at EOF should be ignored or handled safely
Whisper word timestamp decoding should not crash with IndexError

Additional context

I have a local fix prepared for this EOF bounds case and can open a PR if this approach looks reasonable.

Who can help?

@ArthurZucker @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Full end to end reproduction would involve the original audio file (2m 14s of music with some vocals), but the underlying problem is simpler and can be reproduced by calling the _split_tokens_on_unicode method with data that could reasonably be outputted.

from collections import defaultdict
from transformers.models.whisper.tokenization_whisper import _split_tokens_on_unicode

class DummyTokenizer:
    def __init__(self):
        self.responses = defaultdict(list)

    def decode(self, tokens, decode_with_timestamps=False):
        key = tuple(tokens)
        if self.responses[key]:
            return self.responses[key].pop(0)

tokenizer = DummyTokenizer()
tokenizer.responses[(1, 2)] = ["ab"]   # decoded_full
tokenizer.responses[(1,)] = ["ab"]     # first token decodes cleanly
tokenizer.responses[(2,)] = ["�"]      # trailing replacement char at EOF

print(_split_tokens_on_unicode(tokenizer, [1, 2]))

Before the fix, this raises:

IndexError: string index out of range

Because it tries to read decoded_full[2] when len(decoded_full) == 2.

Expected behavior

Whisper word-timestamp decoding should safely ignore or stop on a trailing incomplete Unicode fragment at end-of-string, instead of crashing with IndexError: string index out of range.

extent analysis

Fix Plan

To fix the IndexError: string index out of range issue in Whisper word-timestamp decoding, we need to modify the _split_tokens_on_unicode method to handle the case where the decoded token stream ends with a dangling Unicode replacement character.

Here are the steps:

Check if the computed index is within the bounds of the decoded_full string before attempting to access it.
If the index is out of range, ignore the trailing replacement character or handle it safely.

Code Changes

def _split_tokens_on_unicode(self, tokenizer, tokens):
    # ... (rest of the method remains the same)

    unicode_offset = 
    for token in tokens:
        decoded = tokenizer.decode([token], decode_with_timestamps=False)
        if replacement_char in decoded:
            # Check if the index is within bounds
            index = decoded.index(replacement_char)
            target_index = unicode_offset + index
            if target_index < len(decoded_full):
                # Access the character at the target index
                char = decoded_full[target_index]
                # ... (rest of the method remains the same)
            else:
                # Handle the case where the index is out of range
                # For example, ignore the trailing replacement character
                break
        unicode_offset += len(decoded)

Verification

To verify that the fix worked, you can run the reproduction code with the modified _split_tokens_on_unicode method and check that it no longer raises an IndexError: string index out of range.

Extra Tips

When working with Unicode strings, it's essential to consider edge cases like trailing replacement characters.
Using bounds checking can help prevent IndexError exceptions and make your code more robust.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Whisper word-timestamp decoding should safely ignore or stop on a trailing incomplete Unicode fragment at end-of-string, instead of crashing with IndexError: string index out of range.

#api #ssr #installation #tensor shape #autograd error #task chaining #parallel task #integration issue #index setup #retrieval issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

transformers - ✅(Solved) Fix Whisper word timestamp decode crashes on trailing replacement character at end of decoded token stream [4 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

PR fix notes

PR #1: fix: prevent IndexError in Whisper timestamp decode on trailing replacement char

Description (problem / solution / changelog)

Fixes #44869

Summary

The Bug

The Fix

Test

AI Assistance Disclosure

Changed files

PR #45006: fix: prevent IndexError in Whisper timestamp decode on trailing replacement char

Description (problem / solution / changelog)

Summary

Problem

Fix

AI Assistance Disclosure

Test Plan

Changed files

PR #45226: fix: handle trailing replacement character in Whisper word timestamp decoding

Description (problem / solution / changelog)

Summary

Test plan

Changed files

PR #45435: do not index past decoded chars with special tokens

Description (problem / solution / changelog)

Changed files

Code Example

System Info

System Info

Reproduction

Error

Expected behavior

Additional context

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING