transformers - ✅(Solved) Fix Unexpected behaviour of helper function `_get_feat_extract_output_lengths` in qwen3_omni_moe [2 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45083Fetched 2026-04-08 01:45:20
View on GitHub
Comments
2
Participants
3
Timeline
11
Reactions
0
Author
Timeline (top)
mentioned ×3subscribed ×3commented ×2cross-referenced ×2

Fix Action

Fixed

PR fix notes

PR #45088: fix audio encoder output length formula in qwen3_omni_moe

Description (problem / solution / changelog)

corrects the conv output length calculation in _get_feat_extract_output_lengths which was computing wrong values for the audio encoder. fixes #45083

Changed files

  • src/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py (modified, +2/-3)
  • src/transformers/models/qwen3_omni_moe/modular_qwen3_omni_moe.py (modified, +2/-3)
  • src/transformers/models/qwen3_omni_moe/processing_qwen3_omni_moe.py (modified, +2/-3)

PR #45091: Fix _get_feat_extract_output_lengths in qwen3_omni_moe

Description (problem / solution / changelog)

This PR fixes the unexpected behaviour of helper function _get_feat_extract_output_lengths in qwen3_omni_moe as reported in #45083.

Problem

The current implementation incorrectly calculates the output length of the convolutional layers by:

  1. Taking modulo 100 of input lengths
  2. Adding a correction factor of (input_lengths // 100) * 13

This does not align with the official PyTorch Conv2d formula.

Fix

Updated the function to correctly calculate the output length based on the PyTorch Conv2d formula:

  • For Conv2d with kernel_size=3, stride=2, padding=1: output = (input - 1) // 2 + 1
  • Applied sequentially for the 3 conv layers in the audio encoder

Files Changed

  • src/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py
  • src/transformers/models/qwen3_omni_moe/modular_qwen3_omni_moe.py
  • src/transformers/models/qwen3_omni_moe/processing_qwen3_omni_moe.py

Fixes #45083

Changed files

  • src/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py (modified, +2/-3)
  • src/transformers/models/qwen3_omni_moe/modular_qwen3_omni_moe.py (modified, +2/-3)
  • src/transformers/models/qwen3_omni_moe/processing_qwen3_omni_moe.py (modified, +2/-3)

Code Example

def _get_feat_extract_output_lengths(input_lengths):
    """
    Computes the output length of the convolutional layers and the output length of the audio encoder
    """

    input_lengths_leave = input_lengths % 100
    feat_lengths = (input_lengths_leave - 1) // 2 + 1
    output_lengths = ((feat_lengths - 1) // 2 + 1 - 1) // 2 + 1 + (input_lengths // 100) * 13
    return output_lengths

---

def _get_feat_extract_output_lengths(input_lengths):
    """
    Computes the output length of the convolutional layers and the output length of the audio encoder
    """

    feat_lengths = (input_lengths- 1) // 2 + 1
    output_lengths = ((feat_lengths - 1) // 2 + 1 - 1) // 2 + 1 
    return output_lengths
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.0.0
  • Platform: Linux-6.6.113+-x86_64-with-glibc2.35
  • Python version: 3.12.13
  • Huggingface_hub version: 1.7.1
  • Safetensors version: 0.7.0
  • Accelerate version: 1.13.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.10.0+cpu (NA)
  • Using distributed or parallel set-up in script?: No

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

https://github.com/huggingface/transformers/blob/9a9997fd73c5eb29fb3677d3c489f5d3cd0765f6/src/transformers/models/qwen3_omni_moe/modular_qwen3_omni_moe.py#L117 The implementation of above function computing the output length of the audio encoder does not align with the official formula of pytorch Conv2d. The audio encoder convolution is defined in https://github.com/huggingface/transformers/blob/9a9997fd73c5eb29fb3677d3c489f5d3cd0765f6/src/transformers/models/qwen3_omni_moe/modular_qwen3_omni_moe.py#L871

Expected behavior

Current implementation is

def _get_feat_extract_output_lengths(input_lengths):
    """
    Computes the output length of the convolutional layers and the output length of the audio encoder
    """

    input_lengths_leave = input_lengths % 100
    feat_lengths = (input_lengths_leave - 1) // 2 + 1
    output_lengths = ((feat_lengths - 1) // 2 + 1 - 1) // 2 + 1 + (input_lengths // 100) * 13
    return output_lengths

and the expected implementation is

def _get_feat_extract_output_lengths(input_lengths):
    """
    Computes the output length of the convolutional layers and the output length of the audio encoder
    """

    feat_lengths = (input_lengths- 1) // 2 + 1
    output_lengths = ((feat_lengths - 1) // 2 + 1 - 1) // 2 + 1 
    return output_lengths
<img width="571" height="455" alt="Image" src="https://github.com/user-attachments/assets/f85260b3-d698-459a-a5ec-26faf85d899b" />

extent analysis

Fix Plan

To fix the issue, we need to update the _get_feat_extract_output_lengths function to correctly calculate the output length of the audio encoder.

Here are the steps:

  • Update the modular_qwen3_omni_moe.py file with the correct implementation of the _get_feat_extract_output_lengths function.
  • The corrected function should be:
def _get_feat_extract_output_lengths(input_lengths):
    """
    Computes the output length of the convolutional layers and the output length of the audio encoder
    """

    feat_lengths = (input_lengths - 1) // 2 + 1
    output_lengths = ((feat_lengths - 1) // 2 + 1 - 1) // 2 + 1 
    return output_lengths
  • Replace the existing function with the corrected one in the modular_qwen3_omni_moe.py file.

Verification

To verify that the fix worked, you can test the updated function with different input lengths and compare the output with the expected results.

You can add test cases like this:

input_lengths = [100, 200, 300]
expected_output_lengths = [13, 26, 39]

for input_length, expected_output_length in zip(input_lengths, expected_output_lengths):
    output_length = _get_feat_extract_output_lengths(input_length)
    assert output_length == expected_output_length, f"Expected output length {expected_output_length} but got {output_length}"

If all test cases pass, it means the fix is correct and the function is working as expected.

Extra Tips

Make sure to update the transformers library to the latest version to ensure that the fix is included in the future releases. Also, it's a good practice to write unit tests for critical functions like _get_feat_extract_output_lengths to catch any regressions in the future.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Current implementation is

def _get_feat_extract_output_lengths(input_lengths):
    """
    Computes the output length of the convolutional layers and the output length of the audio encoder
    """

    input_lengths_leave = input_lengths % 100
    feat_lengths = (input_lengths_leave - 1) // 2 + 1
    output_lengths = ((feat_lengths - 1) // 2 + 1 - 1) // 2 + 1 + (input_lengths // 100) * 13
    return output_lengths

and the expected implementation is

def _get_feat_extract_output_lengths(input_lengths):
    """
    Computes the output length of the convolutional layers and the output length of the audio encoder
    """

    feat_lengths = (input_lengths- 1) // 2 + 1
    output_lengths = ((feat_lengths - 1) // 2 + 1 - 1) // 2 + 1 
    return output_lengths
<img width="571" height="455" alt="Image" src="https://github.com/user-attachments/assets/f85260b3-d698-459a-a5ec-26faf85d899b" />

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING