vllm - 💡(How to fix) Fix KimiK25ForConditionalGeneration failed to be inspected — SIGSEGV in registry subprocess during process exit (GB200) [1 participants]

vllm2026-04-22 18:57:33

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40642•Fetched 2026-04-23 07:23:39

View on GitHub

Comments

Participants

Timeline

Reactions

Author

c2w-sea

Participants

c2w-sea

Timeline (top)

referenced ×1

Error Message

pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig Value error, Model architectures ['KimiK25ForConditionalGeneration'] failed to be inspected. Please check the logs for more details.

Root Cause

Root cause: nixl-cu12 1.0.1

Fix Action

Fix / Workaround

vLLM: v0.19.0 (also reproduces on v0.19.1)
Hardware: NVIDIA GB200 (Blackwell), 4× GPU, aarch64
Model: Kimi-K2.5 (KimiK25ForConditionalGeneration) with trust_remote_code=True
Python: 3.10
CUDA: 12.9.1
PyTorch: 2.10.0
nixl-cu12: 1.0.1 (does NOT reproduce with 1.0.0 — see "Workarounds" below)

Workarounds

We tested 13+ image iterations systematically varying:

vLLM version (v0.19.0, v0.19.1, specific dev commits)
transformers version (4.57.6, 5.5.0, 5.5.4)
huggingface-hub version (0.x, 1.9.0, 1.10.2, 1.11.0)
compressed-tensors version (0.14.0.1, 0.15.0.1)
Overlay modifications (with/without patches, with/without additional pip packages)

Code Example

pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
  Value error, Model architectures ['KimiK25ForConditionalGeneration'] failed to be inspected.
  Please check the logs for more details.

---

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x0000ffff66c16dcc in ucs_topo_release_devices () at sys/topo/base/topo.c:1015

#0  ucs_topo_release_devices () at sys/topo/base/topo.c:1015
#1  ucs_topo_cleanup () at sys/topo/base/topo.c:1102
#2  0x0000ffff66be2bec in ucs_cleanup () at sys/init.c:128
#3  0x0000fffff7fc5398 in ?? () from /lib/ld-linux-aarch64.so.1
#4  0x0000fffff7cfcde8 in ?? () from /usr/lib/aarch64-linux-gnu/libc.so.6
#5  0x0000fffff7cfcf0c in exit () from /usr/lib/aarch64-linux-gnu/libc.so.6
#6  0x0000fffff7ce7400 in ?? () from /usr/lib/aarch64-linux-gnu/libc.so.6
#7  0x0000fffff7ce74cc in __libc_start_main () from /usr/lib/aarch64-linux-gnu/libc.so.6

---

[inner] fn returned: _ModelInfo(architecture='KimiK25ForConditionalGeneration',
  is_text_generation_model=True, supports_multimodal=True, supports_pp=True, ...)

---

# In vllm/model_executor/models/registry.py, at the end of _run():
   with open(output_file, "wb") as f:
       f.write(pickle.dumps(result))
   os._exit(0)  # Skip shutdown to avoid SIGSEGV in native extension cleanup

RAW_BUFFERClick to expand / collapse

Problem

vLLM's model registry inspection subprocess (python3 -m vllm.model_executor.models.registry) crashes with SIGSEGV (signal 11) during process exit, after the inspection has already completed successfully and written its output. The parent process sees exit code 139, raises CalledProcessError, and reports a misleading error:

pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
  Value error, Model architectures ['KimiK25ForConditionalGeneration'] failed to be inspected.
  Please check the logs for more details.

The inspection did not fail — the subprocess wrote correct model metadata to its output file. The crash occurs during Python interpreter shutdown when native extension cleanup handlers (atexit / static destructors) run.

Environment

vLLM: v0.19.0 (also reproduces on v0.19.1)
Hardware: NVIDIA GB200 (Blackwell), 4× GPU, aarch64
Model: Kimi-K2.5 (KimiK25ForConditionalGeneration) with trust_remote_code=True
Python: 3.10
CUDA: 12.9.1
PyTorch: 2.10.0
nixl-cu12: 1.0.1 (does NOT reproduce with 1.0.0 — see "Workarounds" below)

GDB Backtrace

The crash occurs in UCX's topology cleanup, called from libc's exit():

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x0000ffff66c16dcc in ucs_topo_release_devices () at sys/topo/base/topo.c:1015

#0  ucs_topo_release_devices () at sys/topo/base/topo.c:1015
#1  ucs_topo_cleanup () at sys/topo/base/topo.c:1102
#2  0x0000ffff66be2bec in ucs_cleanup () at sys/init.c:128
#3  0x0000fffff7fc5398 in ?? () from /lib/ld-linux-aarch64.so.1
#4  0x0000fffff7cfcde8 in ?? () from /usr/lib/aarch64-linux-gnu/libc.so.6
#5  0x0000fffff7cfcf0c in exit () from /usr/lib/aarch64-linux-gnu/libc.so.6
#6  0x0000fffff7ce7400 in ?? () from /usr/lib/aarch64-linux-gnu/libc.so.6
#7  0x0000fffff7ce74cc in __libc_start_main () from /usr/lib/aarch64-linux-gnu/libc.so.6

The libucs.so is bundled inside the nixl-cu12 wheel at nixl_cu12.libs/libucs.so. The md5sum of this file differs between nixl-cu12==1.0.0 (no crash) and nixl-cu12==1.0.1 (crash).

The subprocess output immediately before the crash confirms the inspection succeeded:

[inner] fn returned: _ModelInfo(architecture='KimiK25ForConditionalGeneration',
  is_text_generation_model=True, supports_multimodal=True, supports_pp=True, ...)

Root cause: nixl-cu12 1.0.1

The consistent factor across all crashing configurations is nixl-cu12==1.0.1. Every build with nixl-cu12==1.0.0 works; every build with nixl-cu12==1.0.1 crashes (unless the crash is bypassed with os._exit(0)).

nixl-cu12 bundles its own libucs.so under nixl_cu12.libs/. The md5sum of this file differs between 1.0.0 and 1.0.1. The 1.0.1 version crashes in ucs_topo_release_devices() during process exit — a static-destructor ordering issue where dependent data has already been freed.

The crash requires the inspection subprocess to load UCX (via nixl) during its import chain. Different transformers / hf-hub versions may affect which import paths execute during inspection, but we have not isolated whether transformers version independently matters — the nixl-cu12 version is the proven causal variable.

Reproduction

Build vLLM v0.19.0 or v0.19.1 with nixl-cu12==1.0.1 (default pip resolution on v0.19.1)
Deploy on GB200 (Blackwell) hardware, aarch64
Start vLLM serving Kimi-K2.5 with --trust-remote-code
Observe: crashes immediately with the above error before serving any requests

Workarounds

Any of the following prevent the crash:

Pin nixl-cu12==1.0.0 — avoids the specific libucs.so with the destructor-ordering issue. The 1.0.0 version does not crash during process exit. Confirmed working on both vLLM v0.19.0 and v0.19.1.

Add os._exit(0) after the subprocess writes its output — skips Python interpreter finalization entirely, avoiding the atexit handler crash:

# In vllm/model_executor/models/registry.py, at the end of _run():
with open(output_file, "wb") as f:
    f.write(pickle.dumps(result))
os._exit(0)  # Skip shutdown to avoid SIGSEGV in native extension cleanup

Use an environment where nixl-cu12==1.0.0 is resolved — older pip resolutions (before 1.0.1 was released) naturally avoid the crash. Not a reliable long-term fix.

Suggested fix directions

In vLLM (most impactful, independent of upstream nixl fix)

The parent-side code in registry.py (_run_in_subprocess) currently calls returned.check_returncode() before attempting to read the output file. This means any non-zero exit — including a post-completion SIGSEGV during shutdown — is treated as an inspection failure, even when the subprocess completed its work successfully.

A more robust approach: attempt to read and unpickle the output file first. If the artifact is valid, log a warning about abnormal child exit but proceed. Only fail if the output is missing or corrupt. This would make model inspection resilient to exit-time crashes from native extensions, which are outside vLLM's control.

In nixl (root cause of the native crash)

nixl-cu12==1.0.1 bundles a libucs.so that crashes in ucs_topo_release_devices() during __run_exit_handlers. This appears to be a static-destructor ordering issue where dependent data is freed before the topology cleanup runs. nixl-cu12==1.0.0 does not have this issue.

Testing summary

We tested 13+ image iterations systematically varying:

vLLM version (v0.19.0, v0.19.1, specific dev commits)
transformers version (4.57.6, 5.5.0, 5.5.4)
huggingface-hub version (0.x, 1.9.0, 1.10.2, 1.11.0)
compressed-tensors version (0.14.0.1, 0.15.0.1)
Overlay modifications (with/without patches, with/without additional pip packages)

After extensive bisection, the causal variable was identified as nixl-cu12:

Every build with nixl-cu12==1.0.0: no crash (tested with multiple transformers and vLLM versions)
Every build with nixl-cu12==1.0.1: SIGSEGV during subprocess exit (tested with multiple transformers and vLLM versions)
os._exit(0) patch: bypasses the crash even with nixl-cu12==1.0.1

extent analysis

TL;DR

The most likely fix for the SIGSEGV crash during subprocess exit is to pin nixl-cu12 to version 1.0.0 to avoid the specific libucs.so with the destructor-ordering issue.

Guidance

Pin nixl-cu12 to version 1.0.0 to prevent the crash.
Alternatively, add os._exit(0) after the subprocess writes its output to skip Python interpreter finalization and avoid the atexit handler crash.
Modify the parent-side code in registry.py to attempt to read and unpickle the output file first, and only fail if the output is missing or corrupt, to make model inspection resilient to exit-time crashes from native extensions.
The root cause of the native crash is in nixl-cu12==1.0.1, which bundles a libucs.so that crashes in ucs_topo_release_devices() during __run_exit_handlers.

Example

# In vllm/model_executor/models/registry.py, at the end of _run():
with open(output_file, "wb") as f:
    f.write(pickle.dumps(result))
os._exit(0)  # Skip shutdown to avoid SIGSEGV in native extension cleanup

Notes

The fix directions suggested are to either pin nixl-cu12 to version 1.0.0 or modify the parent-side code in registry.py to make model inspection resilient to exit-time crashes from native extensions. The root cause of the native crash is in nixl-cu12==1.0.1.

Recommendation

Apply the workaround by pinning nixl-cu12 to version 1.0.0, as it is a more reliable and long-term fix compared to adding os._exit(0) or using an environment where nixl-cu12==1.0.0 is resolved.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#API middleware #SSR setup #ISR setup #authentication setup #request error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix KimiK25ForConditionalGeneration failed to be inspected — SIGSEGV in registry subprocess during process exit (GB200) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root cause: nixl-cu12 1.0.1

Fix Action

Fix / Workaround

Workarounds

Code Example

Problem

Environment

GDB Backtrace

Root cause: nixl-cu12 1.0.1

Reproduction

Workarounds

Suggested fix directions

In vLLM (most impactful, independent of upstream nixl fix)

In nixl (root cause of the native crash)

Testing summary

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix KimiK25ForConditionalGeneration failed to be inspected — SIGSEGV in registry subprocess during process exit (GB200) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root cause: nixl-cu12 1.0.1

Fix Action

Fix / Workaround

Workarounds

Code Example

Problem

Environment

GDB Backtrace

Root cause: nixl-cu12 1.0.1

Reproduction

Workarounds

Suggested fix directions

In vLLM (most impactful, independent of upstream nixl fix)

In nixl (root cause of the native crash)

Testing summary

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING