vllm - 💡(How to fix) Fix KimiK25ForConditionalGeneration failed to be inspected — SIGSEGV in registry subprocess during process exit (GB200) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40642Fetched 2026-04-23 07:23:39
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
referenced ×1

Error Message

pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig Value error, Model architectures ['KimiK25ForConditionalGeneration'] failed to be inspected. Please check the logs for more details.

Root Cause

Root cause: nixl-cu12 1.0.1

Fix Action

Fix / Workaround

  • vLLM: v0.19.0 (also reproduces on v0.19.1)
  • Hardware: NVIDIA GB200 (Blackwell), 4× GPU, aarch64
  • Model: Kimi-K2.5 (KimiK25ForConditionalGeneration) with trust_remote_code=True
  • Python: 3.10
  • CUDA: 12.9.1
  • PyTorch: 2.10.0
  • nixl-cu12: 1.0.1 (does NOT reproduce with 1.0.0 — see "Workarounds" below)

Workarounds

We tested 13+ image iterations systematically varying:

  • vLLM version (v0.19.0, v0.19.1, specific dev commits)
  • transformers version (4.57.6, 5.5.0, 5.5.4)
  • huggingface-hub version (0.x, 1.9.0, 1.10.2, 1.11.0)
  • compressed-tensors version (0.14.0.1, 0.15.0.1)
  • Overlay modifications (with/without patches, with/without additional pip packages)

Code Example

pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
  Value error, Model architectures ['KimiK25ForConditionalGeneration'] failed to be inspected.
  Please check the logs for more details.

---

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x0000ffff66c16dcc in ucs_topo_release_devices () at sys/topo/base/topo.c:1015

#0  ucs_topo_release_devices () at sys/topo/base/topo.c:1015
#1  ucs_topo_cleanup () at sys/topo/base/topo.c:1102
#2  0x0000ffff66be2bec in ucs_cleanup () at sys/init.c:128
#3  0x0000fffff7fc5398 in ?? () from /lib/ld-linux-aarch64.so.1
#4  0x0000fffff7cfcde8 in ?? () from /usr/lib/aarch64-linux-gnu/libc.so.6
#5  0x0000fffff7cfcf0c in exit () from /usr/lib/aarch64-linux-gnu/libc.so.6
#6  0x0000fffff7ce7400 in ?? () from /usr/lib/aarch64-linux-gnu/libc.so.6
#7  0x0000fffff7ce74cc in __libc_start_main () from /usr/lib/aarch64-linux-gnu/libc.so.6

---

[inner] fn returned: _ModelInfo(architecture='KimiK25ForConditionalGeneration',
  is_text_generation_model=True, supports_multimodal=True, supports_pp=True, ...)

---

# In vllm/model_executor/models/registry.py, at the end of _run():
   with open(output_file, "wb") as f:
       f.write(pickle.dumps(result))
   os._exit(0)  # Skip shutdown to avoid SIGSEGV in native extension cleanup
RAW_BUFFERClick to expand / collapse

Problem

vLLM's model registry inspection subprocess (python3 -m vllm.model_executor.models.registry) crashes with SIGSEGV (signal 11) during process exit, after the inspection has already completed successfully and written its output. The parent process sees exit code 139, raises CalledProcessError, and reports a misleading error:

pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
  Value error, Model architectures ['KimiK25ForConditionalGeneration'] failed to be inspected.
  Please check the logs for more details.

The inspection did not fail — the subprocess wrote correct model metadata to its output file. The crash occurs during Python interpreter shutdown when native extension cleanup handlers (atexit / static destructors) run.

Environment

  • vLLM: v0.19.0 (also reproduces on v0.19.1)
  • Hardware: NVIDIA GB200 (Blackwell), 4× GPU, aarch64
  • Model: Kimi-K2.5 (KimiK25ForConditionalGeneration) with trust_remote_code=True
  • Python: 3.10
  • CUDA: 12.9.1
  • PyTorch: 2.10.0
  • nixl-cu12: 1.0.1 (does NOT reproduce with 1.0.0 — see "Workarounds" below)

GDB Backtrace

The crash occurs in UCX's topology cleanup, called from libc's exit():

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x0000ffff66c16dcc in ucs_topo_release_devices () at sys/topo/base/topo.c:1015

#0  ucs_topo_release_devices () at sys/topo/base/topo.c:1015
#1  ucs_topo_cleanup () at sys/topo/base/topo.c:1102
#2  0x0000ffff66be2bec in ucs_cleanup () at sys/init.c:128
#3  0x0000fffff7fc5398 in ?? () from /lib/ld-linux-aarch64.so.1
#4  0x0000fffff7cfcde8 in ?? () from /usr/lib/aarch64-linux-gnu/libc.so.6
#5  0x0000fffff7cfcf0c in exit () from /usr/lib/aarch64-linux-gnu/libc.so.6
#6  0x0000fffff7ce7400 in ?? () from /usr/lib/aarch64-linux-gnu/libc.so.6
#7  0x0000fffff7ce74cc in __libc_start_main () from /usr/lib/aarch64-linux-gnu/libc.so.6

The libucs.so is bundled inside the nixl-cu12 wheel at nixl_cu12.libs/libucs.so. The md5sum of this file differs between nixl-cu12==1.0.0 (no crash) and nixl-cu12==1.0.1 (crash).

The subprocess output immediately before the crash confirms the inspection succeeded:

[inner] fn returned: _ModelInfo(architecture='KimiK25ForConditionalGeneration',
  is_text_generation_model=True, supports_multimodal=True, supports_pp=True, ...)

Root cause: nixl-cu12 1.0.1

The consistent factor across all crashing configurations is nixl-cu12==1.0.1. Every build with nixl-cu12==1.0.0 works; every build with nixl-cu12==1.0.1 crashes (unless the crash is bypassed with os._exit(0)).

nixl-cu12 bundles its own libucs.so under nixl_cu12.libs/. The md5sum of this file differs between 1.0.0 and 1.0.1. The 1.0.1 version crashes in ucs_topo_release_devices() during process exit — a static-destructor ordering issue where dependent data has already been freed.

The crash requires the inspection subprocess to load UCX (via nixl) during its import chain. Different transformers / hf-hub versions may affect which import paths execute during inspection, but we have not isolated whether transformers version independently matters — the nixl-cu12 version is the proven causal variable.

Reproduction

  1. Build vLLM v0.19.0 or v0.19.1 with nixl-cu12==1.0.1 (default pip resolution on v0.19.1)
  2. Deploy on GB200 (Blackwell) hardware, aarch64
  3. Start vLLM serving Kimi-K2.5 with --trust-remote-code
  4. Observe: crashes immediately with the above error before serving any requests

Workarounds

Any of the following prevent the crash:

  1. Pin nixl-cu12==1.0.0 — avoids the specific libucs.so with the destructor-ordering issue. The 1.0.0 version does not crash during process exit. Confirmed working on both vLLM v0.19.0 and v0.19.1.

  2. Add os._exit(0) after the subprocess writes its output — skips Python interpreter finalization entirely, avoiding the atexit handler crash:

    # In vllm/model_executor/models/registry.py, at the end of _run():
    with open(output_file, "wb") as f:
        f.write(pickle.dumps(result))
    os._exit(0)  # Skip shutdown to avoid SIGSEGV in native extension cleanup
  3. Use an environment where nixl-cu12==1.0.0 is resolved — older pip resolutions (before 1.0.1 was released) naturally avoid the crash. Not a reliable long-term fix.

Suggested fix directions

In vLLM (most impactful, independent of upstream nixl fix)

The parent-side code in registry.py (_run_in_subprocess) currently calls returned.check_returncode() before attempting to read the output file. This means any non-zero exit — including a post-completion SIGSEGV during shutdown — is treated as an inspection failure, even when the subprocess completed its work successfully.

A more robust approach: attempt to read and unpickle the output file first. If the artifact is valid, log a warning about abnormal child exit but proceed. Only fail if the output is missing or corrupt. This would make model inspection resilient to exit-time crashes from native extensions, which are outside vLLM's control.

In nixl (root cause of the native crash)

nixl-cu12==1.0.1 bundles a libucs.so that crashes in ucs_topo_release_devices() during __run_exit_handlers. This appears to be a static-destructor ordering issue where dependent data is freed before the topology cleanup runs. nixl-cu12==1.0.0 does not have this issue.

Testing summary

We tested 13+ image iterations systematically varying:

  • vLLM version (v0.19.0, v0.19.1, specific dev commits)
  • transformers version (4.57.6, 5.5.0, 5.5.4)
  • huggingface-hub version (0.x, 1.9.0, 1.10.2, 1.11.0)
  • compressed-tensors version (0.14.0.1, 0.15.0.1)
  • Overlay modifications (with/without patches, with/without additional pip packages)

After extensive bisection, the causal variable was identified as nixl-cu12:

  • Every build with nixl-cu12==1.0.0: no crash (tested with multiple transformers and vLLM versions)
  • Every build with nixl-cu12==1.0.1: SIGSEGV during subprocess exit (tested with multiple transformers and vLLM versions)
  • os._exit(0) patch: bypasses the crash even with nixl-cu12==1.0.1

extent analysis

TL;DR

The most likely fix for the SIGSEGV crash during subprocess exit is to pin nixl-cu12 to version 1.0.0 to avoid the specific libucs.so with the destructor-ordering issue.

Guidance

  • Pin nixl-cu12 to version 1.0.0 to prevent the crash.
  • Alternatively, add os._exit(0) after the subprocess writes its output to skip Python interpreter finalization and avoid the atexit handler crash.
  • Modify the parent-side code in registry.py to attempt to read and unpickle the output file first, and only fail if the output is missing or corrupt, to make model inspection resilient to exit-time crashes from native extensions.
  • The root cause of the native crash is in nixl-cu12==1.0.1, which bundles a libucs.so that crashes in ucs_topo_release_devices() during __run_exit_handlers.

Example

# In vllm/model_executor/models/registry.py, at the end of _run():
with open(output_file, "wb") as f:
    f.write(pickle.dumps(result))
os._exit(0)  # Skip shutdown to avoid SIGSEGV in native extension cleanup

Notes

The fix directions suggested are to either pin nixl-cu12 to version 1.0.0 or modify the parent-side code in registry.py to make model inspection resilient to exit-time crashes from native extensions. The root cause of the native crash is in nixl-cu12==1.0.1.

Recommendation

Apply the workaround by pinning nixl-cu12 to version 1.0.0, as it is a more reliable and long-term fix compared to adding os._exit(0) or using an environment where nixl-cu12==1.0.0 is resolved.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix KimiK25ForConditionalGeneration failed to be inspected — SIGSEGV in registry subprocess during process exit (GB200) [1 participants]