vllm - ✅(Solved) Fix [Bug]: Qwen3.5 trust_remote_code loading intermittently fails with KeyError 'qwen3_5_moe' in worker init [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40249Fetched 2026-04-19 15:04:46
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
labeled ×1

Intermittent KeyError: 'qwen3_5_moe' in transformers.CONFIG_MAPPING at vLLM worker startup when loading Qwen/Qwen3.5-397B-A17B with --trust-remote-code, even though the same sweep's other concurrency levels of the same model load successfully.

Error Message

The failing worker subprocess exits during init while resolving the tokenizer. The underlying cause is a KeyError: 'qwen3_5_moe' from transformers.CONFIG_MAPPING:

Root Cause

Intermittent KeyError: 'qwen3_5_moe' in transformers.CONFIG_MAPPING at vLLM worker startup when loading Qwen/Qwen3.5-397B-A17B with --trust-remote-code, even though the same sweep's other concurrency levels of the same model load successfully.

PR fix notes

PR #40299: Register parsed config classes before tokenizer init

Description (problem / solution / changelog)

Summary

Register the already-parsed HF config class with AutoConfig before cached_tokenizer_from_config() reads the tokenizer config again.

This fixes #40249, where worker-side tokenizer init could still fail on qwen3_5_moe even though vLLM had already parsed the model config successfully.

Why this is not a duplicate

Checked before posting:

  • gh issue view 40249 --repo vllm-project/vllm --comments
  • gh pr list --repo vllm-project/vllm --state open --search "40249 in:body"
  • gh pr list --repo vllm-project/vllm --state open --search "qwen3_5_moe tokenizer trust_remote_code worker init"

Those checks did not find an open PR for this lane. This is also different from #39554: that change keeps HFConfigParser.parse() on the right config class, while this patch fixes the later tokenizer init path inside workers.

Tests

  • uvx ruff check vllm/transformers_utils/config.py vllm/tokenizers/registry.py tests/tokenizers_/test_registry.py -> passed
  • uv run --isolated python -m py_compile vllm/transformers_utils/config.py vllm/tokenizers/registry.py tests/tokenizers_/test_registry.py -> passed
  • uv run --isolated --with pytest python - with direct calls to test_customized_tokenizer() and test_cached_tokenizer_from_config_registers_local_config(...) -> PASS

AI assistance

I used AI assistance for issue triage and an initial patch/test draft, then reviewed the final diff and verified it locally.

Changed files

  • tests/tokenizers_/test_registry.py (modified, +64/-0)
  • vllm/tokenizers/registry.py (modified, +3/-1)
  • vllm/transformers_utils/config.py (modified, +17/-2)

Code Example

vLLM version:           0.19.1rc1.dev216+g17e787a77.cu130
Python:                 3.12
CUDA (vLLM build):      13.0
GPU:                    8xNVIDIA GB300
CPU:                    aarch64

Model:                  Qwen/Qwen3.5-397B-A17B
trust_remote_code:      True
dtype:                  bfloat16
max_model_len:          2176
tensor_parallel_size:   4
pipeline_parallel_size: 1
data_parallel_size:     1
enable_expert_parallel: True
language_model_only:    True
attention_backend:      FLASHINFER
quantization:           None (bf16)

---

VLLM_USE_FLASHINFER_MOE_FP16=1 \
python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3.5-397B-A17B \
    --tensor-parallel-size 4 \
    --max-num-seqs 256 \
    --trust-remote-code \
    --max-model-len 2176 \
    --no-enable-prefix-caching \
    --language-model-only \
    --async-scheduling \
    --attention-backend FLASHINFER \
    --enable-expert-parallel \
    --compilation_config.max_cudagraph_capture_size 2048 \
    --host 0.0.0.0 --port 60000
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
vLLM version:           0.19.1rc1.dev216+g17e787a77.cu130
Python:                 3.12
CUDA (vLLM build):      13.0
GPU:                    8xNVIDIA GB300
CPU:                    aarch64

Model:                  Qwen/Qwen3.5-397B-A17B
trust_remote_code:      True
dtype:                  bfloat16
max_model_len:          2176
tensor_parallel_size:   4
pipeline_parallel_size: 1
data_parallel_size:     1
enable_expert_parallel: True
language_model_only:    True
attention_backend:      FLASHINFER
quantization:           None (bf16)
</details>

🐛 Describe the bug

Summary

Intermittent KeyError: 'qwen3_5_moe' in transformers.CONFIG_MAPPING at vLLM worker startup when loading Qwen/Qwen3.5-397B-A17B with --trust-remote-code, even though the same sweep's other concurrency levels of the same model load successfully.

Reproduce

VLLM_USE_FLASHINFER_MOE_FP16=1 \
python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3.5-397B-A17B \
    --tensor-parallel-size 4 \
    --max-num-seqs 256 \
    --trust-remote-code \
    --max-model-len 2176 \
    --no-enable-prefix-caching \
    --language-model-only \
    --async-scheduling \
    --attention-backend FLASHINFER \
    --enable-expert-parallel \
    --compilation_config.max_cudagraph_capture_size 2048 \
    --host 0.0.0.0 --port 60000

Error

The failing worker subprocess exits during init while resolving the tokenizer. The underlying cause is a KeyError: 'qwen3_5_moe' from transformers.CONFIG_MAPPING:

Server log

server_log.txt

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix is to update the transformers library to a version that includes the 'qwen3_5_moe' key in its CONFIG_MAPPING.

Guidance

  • Verify that the transformers library version is compatible with the Qwen/Qwen3.5-397B-A17B model by checking the library's documentation or release notes.
  • Check the CONFIG_MAPPING dictionary in the transformers library to confirm that the 'qwen3_5_moe' key is missing.
  • Consider updating the transformers library to a version that includes the 'qwen3_5_moe' key, if available.
  • If updating the library is not feasible, try setting trust_remote_code to False to see if the issue persists.

Example

No code snippet is provided as the issue is related to a missing key in the transformers library.

Notes

The issue may be specific to the Qwen/Qwen3.5-397B-A17B model or the transformers library version being used. Further investigation is needed to determine the root cause.

Recommendation

Apply workaround: set trust_remote_code to False to see if the issue persists, as this may help mitigate the problem until a more permanent fix is available.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING