vllm - 💡(How to fix) Fix [Bug]: Kimi 2.6 on 8x A100 SMX4 leads to NVLink Crash Coredump [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40652Fetched 2026-04-23 07:23:36
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
cross-referenced ×3labeled ×1

Error Message

We are getting a lot kernel / stack traces and a core dump. Model: moonshootai/Kimi 2.6 vLLM Container Version: vllm/vllm-openai:v0.19.1 Talos OS v1.12.6 NVIDIA Toolkit: nvidia-container-toolkit-lts=580.126.20-v1.18.2 Fabric Manager: nvidia-fabricmanager-lts=580.126.20 Drivers: nvidia-open-gpu-kernel-modules-lts=580.126.20-v1.12.6 nvidia-gdrdrv-device=v2.5.1

Information:

  • Nvidia Device Plugin is running inside the cluster without any issues
  • Nvidia SMI works without any issues
  • Other smaller models are running without any issues
  1. Model check points are loading:

Root Cause

192.168.11.119: kern: warning: [2026-04-22T21:25:20.458998481Z]: NVRM: GPU at PCI:0000:48:00: GPU-5fc0bcec-c2e1-38dc-8994-2a9f9f3b709e
192.168.11.119: kern:     err: [2026-04-22T21:25:20.466440686Z]: nvidia-nvswitch1: SXid (PCI:0000:aa:00.0): 24007, Severity 1 Engine instance 12 Sub-engine instance 00
192.168.11.119: kern: warning: [2026-04-22T21:25:20.473962892Z]: NVRM: GPU Board Serial Number: 1563521020414
192.168.11.119: kern:     err: [2026-04-22T21:25:20.500049682Z]: nvidia-nvswitch1: SXid (PCI:0000:aa:00.0): 24007, Data {0x10000000, 0x10008100, 0x10000000, 0x10008100, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000}
192.168.11.119: kern: warning: [2026-04-22T21:25:20.505406921Z]: NVRM: Xid (PCI:0000:48:00): 62, 00002740 00002b08 00001126 0000117a 0000279f 0002a91a 00000011 00000000
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560106789Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000a
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560271408Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000b
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560426825Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000c
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560534061Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000d
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560650492Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000e
192.168.11.119: kern: warning: [2026-04-22T21:25:20.56075873Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000f
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560896391Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000010
192.168.11.119: kern: warning: [2026-04-22T21:25:20.561005312Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000011
192.168.11.119: kern: warning: [2026-04-22T21:26:20.562739402Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000a caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.584539473Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000b caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599024717Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000c caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599137616Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000d caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599257853Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000e caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599366618Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000f caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599488292Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000010 caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599595837Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000011 caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599716692Z]: NVRM: Xid (PCI:0000:48:00): 74, pid=22884, name=VLLM::Worker_TP, NVLink: fatal error detected on link 8(0x0, 0x0, 0x10000, 0x0, 0x0, 0x0, 0x0)
192.168.11.119: kern: warning: [2026-04-22T21:26:20.695845708Z]: NVRM: Xid (PCI:0000:48:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
192.168.11.119: kern: warning: [2026-04-22T21:27:51.261401334Z]: NVRM: _kgspLogXid119: ********************************* GSP Timeout **********************************

Code Example

version 0.19.1
model   moonshotai/Kimi-K2.6

non-default args: {'enable_auto_tool_choice': True, 'tool_call_parser': 'kimi_k2', 'host': '0.0.0.0', 'port': 8080, 'model': 'moonshotai/Kimi-K2.6', 'trust_remote_code': True, 'max_model_len': 16000, 'served_model_name': ['Kimi-K2.6'], 'download_dir': '/mnt/llm-models', 'reasoning_parser': 'kimi_k2', 'tensor_parallel_size': 8, 'gpu_memory_utilization': 0.95, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'mm_encoder_tp_mode': 'data'}

---

192.168.11.119: kern: warning: [2026-04-22T21:25:20.458998481Z]: NVRM: GPU at PCI:0000:48:00: GPU-5fc0bcec-c2e1-38dc-8994-2a9f9f3b709e
192.168.11.119: kern:     err: [2026-04-22T21:25:20.466440686Z]: nvidia-nvswitch1: SXid (PCI:0000:aa:00.0): 24007, Severity 1 Engine instance 12 Sub-engine instance 00
192.168.11.119: kern: warning: [2026-04-22T21:25:20.473962892Z]: NVRM: GPU Board Serial Number: 1563521020414
192.168.11.119: kern:     err: [2026-04-22T21:25:20.500049682Z]: nvidia-nvswitch1: SXid (PCI:0000:aa:00.0): 24007, Data {0x10000000, 0x10008100, 0x10000000, 0x10008100, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000}
192.168.11.119: kern: warning: [2026-04-22T21:25:20.505406921Z]: NVRM: Xid (PCI:0000:48:00): 62, 00002740 00002b08 00001126 0000117a 0000279f 0002a91a 00000011 00000000
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560106789Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000a
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560271408Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000b
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560426825Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000c
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560534061Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000d
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560650492Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000e
192.168.11.119: kern: warning: [2026-04-22T21:25:20.56075873Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000f
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560896391Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000010
192.168.11.119: kern: warning: [2026-04-22T21:25:20.561005312Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000011
192.168.11.119: kern: warning: [2026-04-22T21:26:20.562739402Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000a caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.584539473Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000b caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599024717Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000c caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599137616Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000d caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599257853Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000e caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599366618Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000f caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599488292Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000010 caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599595837Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000011 caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599716692Z]: NVRM: Xid (PCI:0000:48:00): 74, pid=22884, name=VLLM::Worker_TP, NVLink: fatal error detected on link 8(0x0, 0x0, 0x10000, 0x0, 0x0, 0x0, 0x0)
192.168.11.119: kern: warning: [2026-04-22T21:26:20.695845708Z]: NVRM: Xid (PCI:0000:48:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
192.168.11.119: kern: warning: [2026-04-22T21:27:51.261401334Z]: NVRM: _kgspLogXid119: ********************************* GSP Timeout **********************************

---

We are getting a lot kernel / stack traces and a core dump.
Model: moonshootai/Kimi 2.6
vLLM Container Version: vllm/vllm-openai:v0.19.1
Talos OS v1.12.6
NVIDIA Toolkit: nvidia-container-toolkit-lts=580.126.20-v1.18.2
Fabric Manager:  nvidia-fabricmanager-lts=580.126.20
Drivers: nvidia-open-gpu-kernel-modules-lts=580.126.20-v1.12.6
nvidia-gdrdrv-device=v2.5.1

Information:
- Nvidia Device Plugin is running inside the cluster without any issues
- Nvidia SMI works without any issues
- Other smaller models are running without any issues


1. Model check points are loading:

---

Loading safetensors checkpoint shards: 100% Completed | 64/64 [13:08<00:00, 12.33s/it]
Worker_TP0 pid=370)
(Worker_TP0 pid=370) INFO 04-22 21:17:55 [default_loader.py:384] Loading weights took 788.87 seconds
(Worker_TP0 pid=370) INFO 04-22 21:18:02 [gpu_model_runner.py:4820] Model loading took 72.02 GiB memory and 798.933240 seconds
(Worker_TP0 pid=370) INFO 04-22 21:18:02 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 4225 tokens, and profiled with 1 vision_chunk items of the maximum feature size.
(Worker_TP0 pid=370) INFO 04-22 21:18:22 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/1924dfafa8/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=370) INFO 04-22 21:18:22 [backends.py:1111] Dynamo bytecode transform time: 13.40 s
(Worker_TP0 pid=370) INFO 04-22 21:18:27 [backends.py:372] Cache the graph of compile range (1, 2048) for later use
(Worker_TP0 pid=370) INFO 04-22 21:18:38 [backends.py:390] Compiling a graph for compile range (1, 2048) takes 15.04 s
(Worker_TP0 pid=370) INFO 04-22 21:18:43 [decorators.py:655] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/f088404224c4ce7d97f6740f4677a90f65e9ce57bb5c9143590028f4b2dd20cc/rank_0_0/model
(Worker_TP0 pid=370) INFO 04-22 21:18:43 [monitor.py:48] torch.compile took 34.01 s in total
(EngineCore pid=299) INFO 04-22 21:19:03 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=299) INFO 04-22 21:20:03 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=299) INFO 04-22 21:21:03 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP2 pid=372) ERROR 04-22 21:21:40 [multiproc_executor.py:949] WorkerProc hit an exception.
(...)
(Worker_TP2 pid=372) ERROR 04-22 21:21:40 [multiproc_executor.py:949] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

---

Kernel Error logs:

After that GPU is in reset mode and only after 1-2 reboots the Fabric Manager and the GPUs are useable again.

---

192.168.11.119: kern: warning: [2026-04-22T21:25:20.458998481Z]: NVRM: GPU at PCI:0000:48:00: GPU-5fc0bcec-c2e1-38dc-8994-2a9f9f3b709e
192.168.11.119: kern:     err: [2026-04-22T21:25:20.466440686Z]: nvidia-nvswitch1: SXid (PCI:0000:aa:00.0): 24007, Severity 1 Engine instance 12 Sub-engine instance 00
192.168.11.119: kern: warning: [2026-04-22T21:25:20.473962892Z]: NVRM: GPU Board Serial Number: 1563521020414
192.168.11.119: kern:     err: [2026-04-22T21:25:20.500049682Z]: nvidia-nvswitch1: SXid (PCI:0000:aa:00.0): 24007, Data {0x10000000, 0x10008100, 0x10000000, 0x10008100, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000}
192.168.11.119: kern: warning: [2026-04-22T21:25:20.505406921Z]: NVRM: Xid (PCI:0000:48:00): 62, 00002740 00002b08 00001126 0000117a 0000279f 0002a91a 00000011 00000000
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560106789Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000a
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560271408Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000b
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560426825Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000c
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560534061Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000d
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560650492Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000e
192.168.11.119: kern: warning: [2026-04-22T21:25:20.56075873Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000f
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560896391Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000010
192.168.11.119: kern: warning: [2026-04-22T21:25:20.561005312Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000011
192.168.11.119: kern: warning: [2026-04-22T21:26:20.562739402Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000a caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.584539473Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000b caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599024717Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000c caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599137616Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000d caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599257853Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000e caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599366618Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000f caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599488292Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000010 caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599595837Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000011 caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599716692Z]: NVRM: Xid (PCI:0000:48:00): 74, pid=22884, name=VLLM::Worker_TP, NVLink: fatal error detected on link 8(0x0, 0x0, 0x10000, 0x0, 0x0, 0x0, 0x0)
192.168.11.119: kern: warning: [2026-04-22T21:26:20.695845708Z]: NVRM: Xid (PCI:0000:48:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
192.168.11.119: kern: warning: [2026-04-22T21:27:51.261401334Z]: NVRM: _kgspLogXid119: ********************************* GSP Timeout **********************************
192.168.11.119: kern: warning: [2026-04-22T21:27:51.271840481Z]: NVRM: _kgspLogXid119: Note: Please also check logs above.
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288755476Z]: NVRM: GPU at PCI:0000:0b:00: GPU-ff3dfd6d-7042-cc8c-f3a6-84d9d6eb3a4b
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288764025Z]: NVRM: GPU Board Serial Number: 1563521019267
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288777915Z]: NVRM: Xid (PCI:0000:0b:00): 119, pid=11431, name=gpu-feature-dis, Timeout after 45s of waiting for RPC response from GPU3 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 4127 (0x20803039 0xb0).
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288813021Z]: NVRM: GPU3 GSP RPC buffer contains function 76 (GSP_RM_CONTROL) sequence 4127 and data 0x0000000020803039 0x00000000000000b0.
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288821491Z]: NVRM: GPU3 RPC history (CPU -> GSP):
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288824425Z]: NVRM:     entry function                     sequence data0              data1              ts_start           ts_end             duration actively_polling
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288830314Z]: NVRM:      0    76   GSP_RM_CONTROL              4127 0x0000000020803039 0x00000000000000b0 0x00065013160bb6e0 0x0000000000000000          y
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288836628Z]: NVRM:     -1    76   GSP_RM_CONTROL              4126 0x000000002080a026 0x0000000000000214 0x00065013118cf0d0 0x00065013118cf158    136us  
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288844547Z]: NVRM:     -2    76   GSP_RM_CONTROL              4125 0x000000002080a084 0x0000000000000004 0x00065013118cf063 0x00065013118cf0c2     95us  
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288848661Z]: NVRM:     -3    76   GSP_RM_CONTROL              4124 0x0000000020809001 0x0000000000000008 0x00065013118ceff5 0x00065013118cf053     94us  
192.168.11.119: kern: warning: [2026-04-22T21:27:51.28885631Z]: NVRM:     -4    76   GSP_RM_CONTROL              4123 0x0000000020809064 0x0000000000000208 0x00065013118cef7b 0x00065013118cefe3    104us  
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288860424Z]: NVRM:     -5    76   GSP_RM_CONTROL              4122 0x000000002080a026 0x0000000000000214 0x00065013118ceecc 0x00065013118cef56    138us  
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288869094Z]: NVRM:     -6    76   GSP_RM_CONTROL              4121 0x000000002080a084 0x0000000000000004 0x00065013118cee5c 0x00065013118ceeb9     93us  
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288873188Z]: NVRM:     -7    76   GSP_RM_CONTROL              4120 0x0000000020809001 0x0000000000000008 0x00065013118cede3 0x00065013118cee4b    104us  
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288880739Z]: NVRM: GPU3 RPC event history (CPU <- GSP):
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288883742Z]: NVRM:     entry function                     sequence data0              data1              ts_start           ts_end             duration during_incomplete_rpc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288889732Z]: NVRM:      0    4098 GSP_RUN_CPU_SEQUENCER          0 0x00000000000001ea 0x0000000000003fe2 0x00065011fbc021a9 0x00065011fbc02a48   2207us  
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288900158Z]: CPU: 3 UID: 0 PID: 11833 Comm: gpu-feature-dis Tainted: G           O        6.18.18-talos #1 NONE 
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288909147Z]: Tainted: [O]=OOT_MODULE
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288912031Z]: Hardware name: HPE ProLiant XL675d Gen10 Plus/ProLiant XL675d Gen10 Plus, BIOS A47 08/07/2024
192.168.11.119: kern: warning: [2026-04-22T21:27:51.28891896Z]: Call Trace:
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288924814Z]:  <TASK>
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288933332Z]:  dump_stack_lvl+0x5d/0x90
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288946276Z]:  _kgspRpcRecvPoll+0x725/0x870 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289079746Z]:  _issueRpcAndWait+0xdd/0x970 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289152639Z]:  ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289162609Z]:  ? osGetCurrentThread+0x26/0x60 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289295522Z]:  ? rmDeviceGpuLockIsOwner+0x29/0x90 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289426652Z]:  ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289431635Z]:  rpcRmApiControl_GSP+0x76f/0x940 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289507484Z]:  knvlinkExecGspRmRpc_IMPL+0x68/0x140 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289597368Z]:  knvlinkSyncLinkMasksAndVbiosInfo_IMPL+0xb7/0x1a0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289686107Z]:  nvlinkCtrlCmdBusGetNvlinkCaps+0x92/0x630 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.28977Z]:  kceGetCeFromNvlinkConfig_IMPL+0x49/0xe0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.28990033Z]:  knvlinkGetP2POptimalCEs_GP100+0x6c/0xf0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289993324Z]:  CliGetSystemP2pCaps+0x395/0x630 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290071203Z]:  ? CliGetSystemP2pCaps+0x11d/0x630 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290147077Z]:  cliresCtrlCmdSystemGetP2pCapsV2_IMPL+0xa2/0xf0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290223016Z]:  resControl_IMPL+0x1a9/0x1b0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290282909Z]:  serverControl+0x47e/0x590 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290345728Z]:  _rmapiRmControl+0x4f2/0x820 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290423802Z]:  rmapiControlWithSecInfo+0x79/0x140 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290497731Z]:  ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290502725Z]:  rmapiControlWithSecInfoTls+0x8f/0xf0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290578033Z]:  _nv04ControlWithSecInfo+0x8d/0xa0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290649057Z]:  ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290656345Z]:  ? cred_has_capability.isra.0+0xa4/0x170
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29066627Z]:  RmIoctl+0x90b/0xda0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290800879Z]:  ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290805753Z]:  ? os_acquire_spinlock+0x12/0x30 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29087688Z]:  ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290880774Z]:  ? portSyncSpinlockAcquire+0x18/0x30 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290942992Z]:  ? rm_ioctl+0x52/0x4f0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291069875Z]:  ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291078766Z]:  rm_ioctl+0x66/0x4f0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29120365Z]:  ? __check_object_size+0x215/0x230
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291213998Z]:  nvidia_unlocked_ioctl+0x447/0x950 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291273871Z]:  __x64_sys_ioctl+0x9f/0x100
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291283179Z]:  ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291288033Z]:  do_syscall_64+0x78/0x940
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291307702Z]:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291312635Z]: RIP: 0033:0x7f8079a2d67b
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291320264Z]: Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6d 57 0f 00 f7 d8 64 89 01 48
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291324508Z]: RSP: 002b:00007f8005baec88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291332297Z]: RAX: ffffffffffffffda RBX: 0000000000000020 RCX: 00007f8079a2d67b
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291335531Z]: RDX: 00007f8005baedd0 RSI: 00000000c020462a RDI: 000000000000000b
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291342611Z]: RBP: 00007f8005baece0 R08: 00007f8005baedd0 R09: 00007f8005baedec
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291345594Z]: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8005baedd0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291351073Z]: R13: 000000000000000b R14: 00000000c020462a R15: 00007f8005baeca0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291356167Z]:  </TASK>
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291364354Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(0) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291370308Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(1) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291378137Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(2) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291383081Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(3) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291539511Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPc               : 00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291543565Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvCpuctl           : 00000040
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291549985Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvIrqmask          : 00040040
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291553129Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvIrqdest          : 00000040
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291559707Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPrivErrStat      : 00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291562791Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPrivErrInfo      : badf1500
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291568069Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPrivErrAddr      : 0000000001e19e20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291572173Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvHubErrStat       : 00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291577631Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: falconMailbox         : 0:00000000 1:00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291581675Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: falconIrqstat         : 00009000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291587264Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: falconIrqmode         : 0000fc24
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291590368Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifInstblk           : 00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291596327Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifCtl               : 00000090
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291600721Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifThrottle          : 80000064
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29160622Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifAchkBlk           : 0:48215480 1:50125638
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291609354Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifAchkCtl           : 0:00000000 1:00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291616123Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifCg1               : 0000000f
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291619217Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 00 = 0x0000000005c27ca8
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291625875Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 01 = 0x0000000005c366cc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29162931Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 02 = 0x000000000400a35c
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29163573Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 03 = 0x0000000005c366c0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291638994Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 04 = 0x0000000004d4b1c0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291645653Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 05 = 0x0000000004d3f670
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291648737Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 06 = 0x000000000400a35c
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291657622Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 07 = 0x0000000004d3f5c0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291660856Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 08 = 0x0000000004d4b1a0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291666514Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 09 = 0x0000000005c37b14
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291670789Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 10 = 0x0000000005c39948
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291676137Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 11 = 0x0000000005a07fe0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291679812Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 12 = 0x000000000400a35c
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291686132Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 13 = 0x0000000005a0804c
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291689526Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 14 = 0x0000000005c398c0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291698842Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 15 = 0x0000000005c39c60
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291703737Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 16 = 0x0000000005c39e78
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291791108Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 17 = 0x000000000535bbf8
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291794483Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 18 = 0x0000000004d9cab0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291805401Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 19 = 0x000000000535bbcc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291810409Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 20 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291829641Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 21 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291836572Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 22 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291844715Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 23 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29184809Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 24 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291855031Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 25 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291859535Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 26 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291867569Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 27 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291870923Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 28 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291879187Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 29 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291882381Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 30 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291891544Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 31 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291894608Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 32 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291903372Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 33 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291906316Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 34 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291912866Z]: NVRM: _kgspLogXid119: ********************************************************************************
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291919134Z]: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 4127!
192.168.11.119: kern: warning: [2026-04-22T21:29:21.342163186Z]: NVRM: Xid (PCI:0000:0b:00): 119, pid=11431, name=gpu-feature-dis, Timeout after 45s of waiting for RPC response from GPU3 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 4128 (0x20803039 0xb0).
192.168.11.119: kern: warning: [2026-04-22T21:29:21.360926643Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(0) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:29:21.373933805Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(1) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:29:21.373945078Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(2) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:29:21.373963996Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(3) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:29:21.373980741Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPc               : 00000000
192.168.11.119: kern: warning: [2026-04-22T21:29:21.37399493Z]: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 4128!
192.168.11.119: kern: warning: [2026-04-22T21:30:51.494604223Z]: NVRM: Xid (PCI:0000:0b:00): 119, pid=11431, name=gpu-feature-dis, Timeout after 45s of waiting for RPC response from GPU3 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 4129 (0x20803039 0xb0).
192.168.11.119: kern: warning: [2026-04-22T21:30:51.513376162Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(0) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526390847Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(1) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526401628Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(2) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526418955Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(3) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.52643656Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPc               : 00000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526447914Z]: NVRM: nvAssertFailedNoLog: Assertion failed: Back to back GSP RPC timeout detected! GPU marked for reset @ kernel_gsp.c:2387
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526483418Z]: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: Core is booted.
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526530125Z]: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: RSTAT3 0x0000000000000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526545409Z]: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: RSTAT4 0x0000000000000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.795769959Z]: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: [ERROR] ICD Halt command failed.
192.168.11.119: kern: warning: [2026-04-22T21:30:51.803749334Z]: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 4129!
192.168.11.119: kern: warning: [2026-04-22T21:30:51.870495804Z]: NVRM: Xid (PCI:0000:0b:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
192.168.11.119: kern: warning: [2026-04-22T21:37:31.524565304Z]: NVRM: nvAssertOkFailedNoLog: Assertion failed: Reset required [NV_ERR_RESET_REQUIRED] (0x00000062) returned from pRmApi->Control(pRmApi, RES_GET_CLIENT_HANDLE(pKernelChannel), RES_GET_HANDLE(pKernelChannel), NVA06F_CTRL_CMD_STOP_CHANNEL, &stopChannelParams, sizeof(stopChannelParams)) @ nv_gpu_ops.c:10957
192.168.11.119: kern: warning: [2026-04-22T21:37:31.583305494Z]: NVRM: nvAssertOkFailedNoLog: Assertion failed: Reset required [NV_ERR_RESET_REQUIRED] (0x00000062) returned from pRmApi->Control(pRmApi, RES_GET_CLIENT_HANDLE(pKernelChannel), RES_GET_HANDLE(pKernelChannel), NVA06F_CTRL_CMD_STOP_CHANNEL, &stopChannelParams, sizeof(stopChannelParams)) @ nv_gpu_ops.c:10957
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
version 0.19.1
model   moonshotai/Kimi-K2.6

non-default args: {'enable_auto_tool_choice': True, 'tool_call_parser': 'kimi_k2', 'host': '0.0.0.0', 'port': 8080, 'model': 'moonshotai/Kimi-K2.6', 'trust_remote_code': True, 'max_model_len': 16000, 'served_model_name': ['Kimi-K2.6'], 'download_dir': '/mnt/llm-models', 'reasoning_parser': 'kimi_k2', 'tensor_parallel_size': 8, 'gpu_memory_utilization': 0.95, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'mm_encoder_tp_mode': 'data'}
</details>

🐛 Describe the bug

We are running vLLM on a Talos OS v1.12.6 Bare Metal Environment with 8x A100 SMX4 with NVLink. Loading of the Kimi 2.6 models breaks the GPUS and NVLinks with heavy impact.

Highlight Kernel Logs (all logs below):

192.168.11.119: kern: warning: [2026-04-22T21:25:20.458998481Z]: NVRM: GPU at PCI:0000:48:00: GPU-5fc0bcec-c2e1-38dc-8994-2a9f9f3b709e
192.168.11.119: kern:     err: [2026-04-22T21:25:20.466440686Z]: nvidia-nvswitch1: SXid (PCI:0000:aa:00.0): 24007, Severity 1 Engine instance 12 Sub-engine instance 00
192.168.11.119: kern: warning: [2026-04-22T21:25:20.473962892Z]: NVRM: GPU Board Serial Number: 1563521020414
192.168.11.119: kern:     err: [2026-04-22T21:25:20.500049682Z]: nvidia-nvswitch1: SXid (PCI:0000:aa:00.0): 24007, Data {0x10000000, 0x10008100, 0x10000000, 0x10008100, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000}
192.168.11.119: kern: warning: [2026-04-22T21:25:20.505406921Z]: NVRM: Xid (PCI:0000:48:00): 62, 00002740 00002b08 00001126 0000117a 0000279f 0002a91a 00000011 00000000
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560106789Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000a
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560271408Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000b
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560426825Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000c
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560534061Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000d
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560650492Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000e
192.168.11.119: kern: warning: [2026-04-22T21:25:20.56075873Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000f
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560896391Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000010
192.168.11.119: kern: warning: [2026-04-22T21:25:20.561005312Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000011
192.168.11.119: kern: warning: [2026-04-22T21:26:20.562739402Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000a caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.584539473Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000b caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599024717Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000c caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599137616Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000d caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599257853Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000e caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599366618Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000f caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599488292Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000010 caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599595837Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000011 caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599716692Z]: NVRM: Xid (PCI:0000:48:00): 74, pid=22884, name=VLLM::Worker_TP, NVLink: fatal error detected on link 8(0x0, 0x0, 0x10000, 0x0, 0x0, 0x0, 0x0)
192.168.11.119: kern: warning: [2026-04-22T21:26:20.695845708Z]: NVRM: Xid (PCI:0000:48:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
192.168.11.119: kern: warning: [2026-04-22T21:27:51.261401334Z]: NVRM: _kgspLogXid119: ********************************* GSP Timeout **********************************

We are getting a lot kernel / stack traces and a core dump. Model: moonshootai/Kimi 2.6 vLLM Container Version: vllm/vllm-openai:v0.19.1 Talos OS v1.12.6 NVIDIA Toolkit: nvidia-container-toolkit-lts=580.126.20-v1.18.2 Fabric Manager: nvidia-fabricmanager-lts=580.126.20 Drivers: nvidia-open-gpu-kernel-modules-lts=580.126.20-v1.12.6 nvidia-gdrdrv-device=v2.5.1

Information:

  • Nvidia Device Plugin is running inside the cluster without any issues
  • Nvidia SMI works without any issues
  • Other smaller models are running without any issues
  1. Model check points are loading:
Loading safetensors checkpoint shards: 100% Completed | 64/64 [13:08<00:00, 12.33s/it]
Worker_TP0 pid=370)
(Worker_TP0 pid=370) INFO 04-22 21:17:55 [default_loader.py:384] Loading weights took 788.87 seconds
(Worker_TP0 pid=370) INFO 04-22 21:18:02 [gpu_model_runner.py:4820] Model loading took 72.02 GiB memory and 798.933240 seconds
(Worker_TP0 pid=370) INFO 04-22 21:18:02 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 4225 tokens, and profiled with 1 vision_chunk items of the maximum feature size.
(Worker_TP0 pid=370) INFO 04-22 21:18:22 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/1924dfafa8/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=370) INFO 04-22 21:18:22 [backends.py:1111] Dynamo bytecode transform time: 13.40 s
(Worker_TP0 pid=370) INFO 04-22 21:18:27 [backends.py:372] Cache the graph of compile range (1, 2048) for later use
(Worker_TP0 pid=370) INFO 04-22 21:18:38 [backends.py:390] Compiling a graph for compile range (1, 2048) takes 15.04 s
(Worker_TP0 pid=370) INFO 04-22 21:18:43 [decorators.py:655] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/f088404224c4ce7d97f6740f4677a90f65e9ce57bb5c9143590028f4b2dd20cc/rank_0_0/model
(Worker_TP0 pid=370) INFO 04-22 21:18:43 [monitor.py:48] torch.compile took 34.01 s in total
(EngineCore pid=299) INFO 04-22 21:19:03 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=299) INFO 04-22 21:20:03 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=299) INFO 04-22 21:21:03 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP2 pid=372) ERROR 04-22 21:21:40 [multiproc_executor.py:949] WorkerProc hit an exception.
(...)
(Worker_TP2 pid=372) ERROR 04-22 21:21:40 [multiproc_executor.py:949] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

Kernel Error logs:

After that GPU is in reset mode and only after 1-2 reboots the Fabric Manager and the GPUs are useable again.

192.168.11.119: kern: warning: [2026-04-22T21:25:20.458998481Z]: NVRM: GPU at PCI:0000:48:00: GPU-5fc0bcec-c2e1-38dc-8994-2a9f9f3b709e
192.168.11.119: kern:     err: [2026-04-22T21:25:20.466440686Z]: nvidia-nvswitch1: SXid (PCI:0000:aa:00.0): 24007, Severity 1 Engine instance 12 Sub-engine instance 00
192.168.11.119: kern: warning: [2026-04-22T21:25:20.473962892Z]: NVRM: GPU Board Serial Number: 1563521020414
192.168.11.119: kern:     err: [2026-04-22T21:25:20.500049682Z]: nvidia-nvswitch1: SXid (PCI:0000:aa:00.0): 24007, Data {0x10000000, 0x10008100, 0x10000000, 0x10008100, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000}
192.168.11.119: kern: warning: [2026-04-22T21:25:20.505406921Z]: NVRM: Xid (PCI:0000:48:00): 62, 00002740 00002b08 00001126 0000117a 0000279f 0002a91a 00000011 00000000
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560106789Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000a
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560271408Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000b
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560426825Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000c
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560534061Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000d
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560650492Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000e
192.168.11.119: kern: warning: [2026-04-22T21:25:20.56075873Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000f
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560896391Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000010
192.168.11.119: kern: warning: [2026-04-22T21:25:20.561005312Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000011
192.168.11.119: kern: warning: [2026-04-22T21:26:20.562739402Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000a caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.584539473Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000b caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599024717Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000c caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599137616Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000d caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599257853Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000e caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599366618Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000f caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599488292Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000010 caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599595837Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000011 caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599716692Z]: NVRM: Xid (PCI:0000:48:00): 74, pid=22884, name=VLLM::Worker_TP, NVLink: fatal error detected on link 8(0x0, 0x0, 0x10000, 0x0, 0x0, 0x0, 0x0)
192.168.11.119: kern: warning: [2026-04-22T21:26:20.695845708Z]: NVRM: Xid (PCI:0000:48:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
192.168.11.119: kern: warning: [2026-04-22T21:27:51.261401334Z]: NVRM: _kgspLogXid119: ********************************* GSP Timeout **********************************
192.168.11.119: kern: warning: [2026-04-22T21:27:51.271840481Z]: NVRM: _kgspLogXid119: Note: Please also check logs above.
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288755476Z]: NVRM: GPU at PCI:0000:0b:00: GPU-ff3dfd6d-7042-cc8c-f3a6-84d9d6eb3a4b
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288764025Z]: NVRM: GPU Board Serial Number: 1563521019267
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288777915Z]: NVRM: Xid (PCI:0000:0b:00): 119, pid=11431, name=gpu-feature-dis, Timeout after 45s of waiting for RPC response from GPU3 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 4127 (0x20803039 0xb0).
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288813021Z]: NVRM: GPU3 GSP RPC buffer contains function 76 (GSP_RM_CONTROL) sequence 4127 and data 0x0000000020803039 0x00000000000000b0.
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288821491Z]: NVRM: GPU3 RPC history (CPU -> GSP):
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288824425Z]: NVRM:     entry function                     sequence data0              data1              ts_start           ts_end             duration actively_polling
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288830314Z]: NVRM:      0    76   GSP_RM_CONTROL              4127 0x0000000020803039 0x00000000000000b0 0x00065013160bb6e0 0x0000000000000000          y
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288836628Z]: NVRM:     -1    76   GSP_RM_CONTROL              4126 0x000000002080a026 0x0000000000000214 0x00065013118cf0d0 0x00065013118cf158    136us  
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288844547Z]: NVRM:     -2    76   GSP_RM_CONTROL              4125 0x000000002080a084 0x0000000000000004 0x00065013118cf063 0x00065013118cf0c2     95us  
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288848661Z]: NVRM:     -3    76   GSP_RM_CONTROL              4124 0x0000000020809001 0x0000000000000008 0x00065013118ceff5 0x00065013118cf053     94us  
192.168.11.119: kern: warning: [2026-04-22T21:27:51.28885631Z]: NVRM:     -4    76   GSP_RM_CONTROL              4123 0x0000000020809064 0x0000000000000208 0x00065013118cef7b 0x00065013118cefe3    104us  
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288860424Z]: NVRM:     -5    76   GSP_RM_CONTROL              4122 0x000000002080a026 0x0000000000000214 0x00065013118ceecc 0x00065013118cef56    138us  
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288869094Z]: NVRM:     -6    76   GSP_RM_CONTROL              4121 0x000000002080a084 0x0000000000000004 0x00065013118cee5c 0x00065013118ceeb9     93us  
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288873188Z]: NVRM:     -7    76   GSP_RM_CONTROL              4120 0x0000000020809001 0x0000000000000008 0x00065013118cede3 0x00065013118cee4b    104us  
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288880739Z]: NVRM: GPU3 RPC event history (CPU <- GSP):
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288883742Z]: NVRM:     entry function                     sequence data0              data1              ts_start           ts_end             duration during_incomplete_rpc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288889732Z]: NVRM:      0    4098 GSP_RUN_CPU_SEQUENCER          0 0x00000000000001ea 0x0000000000003fe2 0x00065011fbc021a9 0x00065011fbc02a48   2207us  
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288900158Z]: CPU: 3 UID: 0 PID: 11833 Comm: gpu-feature-dis Tainted: G           O        6.18.18-talos #1 NONE 
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288909147Z]: Tainted: [O]=OOT_MODULE
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288912031Z]: Hardware name: HPE ProLiant XL675d Gen10 Plus/ProLiant XL675d Gen10 Plus, BIOS A47 08/07/2024
192.168.11.119: kern: warning: [2026-04-22T21:27:51.28891896Z]: Call Trace:
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288924814Z]:  <TASK>
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288933332Z]:  dump_stack_lvl+0x5d/0x90
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288946276Z]:  _kgspRpcRecvPoll+0x725/0x870 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289079746Z]:  _issueRpcAndWait+0xdd/0x970 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289152639Z]:  ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289162609Z]:  ? osGetCurrentThread+0x26/0x60 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289295522Z]:  ? rmDeviceGpuLockIsOwner+0x29/0x90 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289426652Z]:  ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289431635Z]:  rpcRmApiControl_GSP+0x76f/0x940 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289507484Z]:  knvlinkExecGspRmRpc_IMPL+0x68/0x140 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289597368Z]:  knvlinkSyncLinkMasksAndVbiosInfo_IMPL+0xb7/0x1a0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289686107Z]:  nvlinkCtrlCmdBusGetNvlinkCaps+0x92/0x630 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.28977Z]:  kceGetCeFromNvlinkConfig_IMPL+0x49/0xe0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.28990033Z]:  knvlinkGetP2POptimalCEs_GP100+0x6c/0xf0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289993324Z]:  CliGetSystemP2pCaps+0x395/0x630 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290071203Z]:  ? CliGetSystemP2pCaps+0x11d/0x630 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290147077Z]:  cliresCtrlCmdSystemGetP2pCapsV2_IMPL+0xa2/0xf0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290223016Z]:  resControl_IMPL+0x1a9/0x1b0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290282909Z]:  serverControl+0x47e/0x590 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290345728Z]:  _rmapiRmControl+0x4f2/0x820 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290423802Z]:  rmapiControlWithSecInfo+0x79/0x140 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290497731Z]:  ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290502725Z]:  rmapiControlWithSecInfoTls+0x8f/0xf0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290578033Z]:  _nv04ControlWithSecInfo+0x8d/0xa0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290649057Z]:  ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290656345Z]:  ? cred_has_capability.isra.0+0xa4/0x170
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29066627Z]:  RmIoctl+0x90b/0xda0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290800879Z]:  ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290805753Z]:  ? os_acquire_spinlock+0x12/0x30 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29087688Z]:  ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290880774Z]:  ? portSyncSpinlockAcquire+0x18/0x30 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290942992Z]:  ? rm_ioctl+0x52/0x4f0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291069875Z]:  ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291078766Z]:  rm_ioctl+0x66/0x4f0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29120365Z]:  ? __check_object_size+0x215/0x230
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291213998Z]:  nvidia_unlocked_ioctl+0x447/0x950 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291273871Z]:  __x64_sys_ioctl+0x9f/0x100
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291283179Z]:  ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291288033Z]:  do_syscall_64+0x78/0x940
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291307702Z]:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291312635Z]: RIP: 0033:0x7f8079a2d67b
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291320264Z]: Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6d 57 0f 00 f7 d8 64 89 01 48
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291324508Z]: RSP: 002b:00007f8005baec88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291332297Z]: RAX: ffffffffffffffda RBX: 0000000000000020 RCX: 00007f8079a2d67b
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291335531Z]: RDX: 00007f8005baedd0 RSI: 00000000c020462a RDI: 000000000000000b
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291342611Z]: RBP: 00007f8005baece0 R08: 00007f8005baedd0 R09: 00007f8005baedec
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291345594Z]: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8005baedd0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291351073Z]: R13: 000000000000000b R14: 00000000c020462a R15: 00007f8005baeca0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291356167Z]:  </TASK>
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291364354Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(0) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291370308Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(1) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291378137Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(2) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291383081Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(3) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291539511Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPc               : 00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291543565Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvCpuctl           : 00000040
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291549985Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvIrqmask          : 00040040
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291553129Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvIrqdest          : 00000040
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291559707Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPrivErrStat      : 00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291562791Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPrivErrInfo      : badf1500
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291568069Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPrivErrAddr      : 0000000001e19e20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291572173Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvHubErrStat       : 00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291577631Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: falconMailbox         : 0:00000000 1:00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291581675Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: falconIrqstat         : 00009000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291587264Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: falconIrqmode         : 0000fc24
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291590368Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifInstblk           : 00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291596327Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifCtl               : 00000090
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291600721Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifThrottle          : 80000064
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29160622Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifAchkBlk           : 0:48215480 1:50125638
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291609354Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifAchkCtl           : 0:00000000 1:00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291616123Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifCg1               : 0000000f
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291619217Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 00 = 0x0000000005c27ca8
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291625875Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 01 = 0x0000000005c366cc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29162931Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 02 = 0x000000000400a35c
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29163573Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 03 = 0x0000000005c366c0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291638994Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 04 = 0x0000000004d4b1c0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291645653Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 05 = 0x0000000004d3f670
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291648737Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 06 = 0x000000000400a35c
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291657622Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 07 = 0x0000000004d3f5c0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291660856Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 08 = 0x0000000004d4b1a0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291666514Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 09 = 0x0000000005c37b14
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291670789Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 10 = 0x0000000005c39948
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291676137Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 11 = 0x0000000005a07fe0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291679812Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 12 = 0x000000000400a35c
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291686132Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 13 = 0x0000000005a0804c
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291689526Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 14 = 0x0000000005c398c0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291698842Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 15 = 0x0000000005c39c60
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291703737Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 16 = 0x0000000005c39e78
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291791108Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 17 = 0x000000000535bbf8
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291794483Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 18 = 0x0000000004d9cab0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291805401Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 19 = 0x000000000535bbcc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291810409Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 20 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291829641Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 21 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291836572Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 22 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291844715Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 23 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29184809Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 24 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291855031Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 25 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291859535Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 26 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291867569Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 27 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291870923Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 28 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291879187Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 29 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291882381Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 30 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291891544Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 31 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291894608Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 32 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291903372Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 33 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291906316Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 34 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291912866Z]: NVRM: _kgspLogXid119: ********************************************************************************
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291919134Z]: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 4127!
192.168.11.119: kern: warning: [2026-04-22T21:29:21.342163186Z]: NVRM: Xid (PCI:0000:0b:00): 119, pid=11431, name=gpu-feature-dis, Timeout after 45s of waiting for RPC response from GPU3 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 4128 (0x20803039 0xb0).
192.168.11.119: kern: warning: [2026-04-22T21:29:21.360926643Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(0) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:29:21.373933805Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(1) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:29:21.373945078Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(2) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:29:21.373963996Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(3) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:29:21.373980741Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPc               : 00000000
192.168.11.119: kern: warning: [2026-04-22T21:29:21.37399493Z]: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 4128!
192.168.11.119: kern: warning: [2026-04-22T21:30:51.494604223Z]: NVRM: Xid (PCI:0000:0b:00): 119, pid=11431, name=gpu-feature-dis, Timeout after 45s of waiting for RPC response from GPU3 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 4129 (0x20803039 0xb0).
192.168.11.119: kern: warning: [2026-04-22T21:30:51.513376162Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(0) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526390847Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(1) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526401628Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(2) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526418955Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(3) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.52643656Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPc               : 00000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526447914Z]: NVRM: nvAssertFailedNoLog: Assertion failed: Back to back GSP RPC timeout detected! GPU marked for reset @ kernel_gsp.c:2387
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526483418Z]: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: Core is booted.
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526530125Z]: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: RSTAT3 0x0000000000000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526545409Z]: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: RSTAT4 0x0000000000000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.795769959Z]: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: [ERROR] ICD Halt command failed.
192.168.11.119: kern: warning: [2026-04-22T21:30:51.803749334Z]: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 4129!
192.168.11.119: kern: warning: [2026-04-22T21:30:51.870495804Z]: NVRM: Xid (PCI:0000:0b:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
192.168.11.119: kern: warning: [2026-04-22T21:37:31.524565304Z]: NVRM: nvAssertOkFailedNoLog: Assertion failed: Reset required [NV_ERR_RESET_REQUIRED] (0x00000062) returned from pRmApi->Control(pRmApi, RES_GET_CLIENT_HANDLE(pKernelChannel), RES_GET_HANDLE(pKernelChannel), NVA06F_CTRL_CMD_STOP_CHANNEL, &stopChannelParams, sizeof(stopChannelParams)) @ nv_gpu_ops.c:10957
192.168.11.119: kern: warning: [2026-04-22T21:37:31.583305494Z]: NVRM: nvAssertOkFailedNoLog: Assertion failed: Reset required [NV_ERR_RESET_REQUIRED] (0x00000062) returned from pRmApi->Control(pRmApi, RES_GET_CLIENT_HANDLE(pKernelChannel), RES_GET_HANDLE(pKernelChannel), NVA06F_CTRL_CMD_STOP_CHANNEL, &stopChannelParams, sizeof(stopChannelParams)) @ nv_gpu_ops.c:10957

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix for the issue of GPU and NVLink failure when loading the Kimi 2.6 model is to adjust the gpu_memory_utilization parameter to a lower value to prevent overutilization of GPU resources.

Guidance

  • Review the gpu_memory_utilization parameter in the collect_env.py output and consider reducing it to a lower value, such as 0.8 or 0.9, to prevent overutilization of GPU resources.
  • Check the tensor_parallel_size parameter and consider reducing it to a lower value, such as 4 or 6, to reduce the load on the GPUs.
  • Verify that the nvidia-container-toolkit-lts and nvidia-fabricmanager-lts versions are compatible with the nvidia-open-gpu-kernel-modules-lts version.
  • Consider updating the nvidia-open-gpu-kernel-modules-lts version to a newer release to ensure compatibility with the latest GPU drivers.

Example

No code example is provided as the issue is related to configuration and hardware utilization.

Notes

The issue seems to be related to overutilization of GPU resources, which is causing the GPUs and NVLinks to fail. Adjusting the gpu_memory_utilization parameter and reducing the tensor_parallel_size may help prevent this issue. However, the optimal values for these parameters may depend on the specific hardware and workload.

Recommendation

Apply a workaround by adjusting the gpu_memory_utilization parameter to a lower value, such as 0.8 or 0.9, to prevent overutilization of GPU resources. This may help prevent the GPUs and NVLinks from failing when loading the Kimi 2.6 model.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING