vllm - 💡(How to fix) Fix [Bug]: NemotronH Super 120B crashes mid-decode when two vLLM instances run concurrently on AGX Thor unified memory [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38489Fetched 2026-04-08 01:49:01
View on GitHub
Comments
1
Participants
2
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
mentioned ×2subscribed ×2commented ×1labeled ×1

Error Message

<p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">I'm running NemotronH Super 120B NVFP4 (vLLM, port 8000) alongside Nemotron Nano 30B NVFP4 (vLLM, port 8001). When both models generate responses simultaneously — triggered via Open WebUI's side-by-side comparison mode — Super crashes mid-decode with <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">CUDA error: an illegal instruction was encountered</code>. Nano is never affected.</p> ease-[cubic-bezier(0.165,0.85,0.45,1)] h-8 w-8 rounded-md backdrop-blur-md _fill_56vq7_9 _ghost_56vq7_96" type="button" aria-label="Copy to clipboard" data-state="closed"><div class="relative"><div class="transition-all opacity-100 scale-100" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" class="transition-all opacity-100 scale-100" aria-hidden="true" style="flex-shrink: 0;"><path d="M12.5 3A1.5 1.5 0 0 1 14 4.5V6h1.5A1.5 1.5 0 0 1 17 7.5v8a1.5 1.5 0 0 1-1.5 1.5h-8A1.5 1.5 0 0 1 6 15.5V14H4.5A1.5 1.5 0 0 1 3 12.5v-8A1.5 1.5 0 0 1 4.5 3zm1.5 9.5a1.5 1.5 0 0 1-1.5 1.5H7v1.5a.5.5 0 0 0 .5.5h8a.5.5 0 0 0 .5-.5v-8a.5.5 0 0 0-.5-.5H14zM4.5 4a.5.5 0 0 0-.5.5v8a.5.5 0 0 0 .5.5h8a.5.5 0 0 0 .5-.5v-8a.5.5 0 0 0-.5-.5z"></path></svg></div><div class="absolute inset-0 flex items-center justify-center"><div class="transition-all opacity-0 scale-50" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" class="transition-all opacity-0 scale-50" aria-hidden="true" style="flex-shrink: 0;"><path d="M15.188 5.11a.5.5 0 0 1 .752.626l-.056.084-7.5 9a.5.5 0 0 1-.738.033l-3.5-3.5-.064-.078a.501.501 0 0 1 .693-.693l.078.064 3.113 3.113 7.15-8.58z"></path></svg></div></div></div></button></div></div><div class="overflow-x-auto"><pre class="code-block__code !my-0 !rounded-lg !text-sm !leading-relaxed p-3.5" style="color: rgb(234, 236, 240); background: transparent; font-family: var(--font-mono);"><code style="color: rgb(234, 236, 240); background: transparent; font-family: var(--font-mono); white-space: pre-wrap;"><span><span>torch.AcceleratorError: CUDA error: an illegal instruction was encountered</span></span></code></pre></div></div> </span><span>torch.AcceleratorError: CUDA error: an illegal instruction was encountered</span></code></pre></div></div> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]"><code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">nn.functional.silu</code> is a trivial elementwise operation. It cannot fail due to a kernel or architecture bug. When this crashes, the tensor <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">gate</code> is backed by freed or invalid GPU memory — this is memory corruption, not a math error.</p>

Root Cause

<p class="font-claude-response-body break-words whitespace-normal leading-[1.7]"><strong>Component:</strong> vLLM — NemotronH model implementation / CUDA memory management on unified memory architectures<br> <strong>Severity:</strong> Fatal — engine crash, complete loss of in-flight response<br> <strong>Reproducibility:</strong> Consistent under concurrent dual-model load; does NOT reproduce solo under equivalent memory pressure</p> <hr class="border-border-200 border-t-0.5 my-3 mx-1.5"> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Context (read this first)</h2> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">NVIDIA AGX Thor is an edge AI developer kit with 122.8 GiB of unified memory — CPU and GPU share the same physical DRAM pool. Running two inference servers on this hardware is a natural and intended use case: a large "reasoning" model for complex tasks alongside a smaller "fast" model for interactive use.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">I'm running NemotronH Super 120B NVFP4 (vLLM, port 8000) alongside Nemotron Nano 30B NVFP4 (vLLM, port 8001). When both models generate responses simultaneously — triggered via Open WebUI's side-by-side comparison mode — Super crashes mid-decode with <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">CUDA error: an illegal instruction was encountered</code>. Nano is never affected.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]"><strong>The key finding that makes this worth investigating:</strong> the crash does NOT reproduce when Super runs solo under equivalent or greater memory pressure. I ran Super solo at <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">--gpu-memory-utilization 0.88</code> (leaving the same ~5-6 GiB free headroom as the dual-model config) and sent multiple long prompts generating 3000+ tokens sequentially and concurrently to the same instance. All completed cleanly at stable 13.5 t/s. The crash is specifically triggered by two CUDA contexts running simultaneously against the same unified memory pool — not by memory pressure alone.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">This suggests the bug is in how vLLM or the CUDA runtime handles concurrent memory allocation from two separate processes sharing a unified physical address space, rather than a NemotronH kernel bug or a memory capacity issue.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">I'm filing this because the evidence is clean and the reproduction condition is well-characterized, even though I can't provide a minimal reproduction case. If this is a known limitation of running two CUDA processes on unified memory, I'd appreciate a pointer to the relevant documentation — I haven't found any warnings about this in vLLM's docs for Jetson/unified memory deployments.</p> <hr class="border-border-200 border-t-0.5 my-3 mx-1.5"> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Environment</h2> <div class="overflow-x-auto w-full px-2 mb-6"> Field | Value -- | -- Hardware | NVIDIA AGX Thor Developer Kit GPU architecture | Blackwell SM110a (compute capability sm_100) Unified memory | 122.8 GiB (CPU and GPU share the same physical pool — no discrete VRAM) L4T / JetPack | R39.0 / JetPack 7.2 CUDA | 13.2 OS | Ubuntu 24.04 aarch64 (SBSA) vLLM | 0.19.0 FlashInfer | 0.6.7 Container | jetson-containers vLLM build
RAW_BUFFERClick to expand / collapse

Your current environment

<p class="font-claude-response-body break-words whitespace-normal leading-[1.7]"><strong>Component:</strong> vLLM — NemotronH model implementation / CUDA memory management on unified memory architectures<br> <strong>Severity:</strong> Fatal — engine crash, complete loss of in-flight response<br> <strong>Reproducibility:</strong> Consistent under concurrent dual-model load; does NOT reproduce solo under equivalent memory pressure</p> <hr class="border-border-200 border-t-0.5 my-3 mx-1.5"> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Context (read this first)</h2> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">NVIDIA AGX Thor is an edge AI developer kit with 122.8 GiB of unified memory — CPU and GPU share the same physical DRAM pool. Running two inference servers on this hardware is a natural and intended use case: a large "reasoning" model for complex tasks alongside a smaller "fast" model for interactive use.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">I'm running NemotronH Super 120B NVFP4 (vLLM, port 8000) alongside Nemotron Nano 30B NVFP4 (vLLM, port 8001). When both models generate responses simultaneously — triggered via Open WebUI's side-by-side comparison mode — Super crashes mid-decode with <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">CUDA error: an illegal instruction was encountered</code>. Nano is never affected.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]"><strong>The key finding that makes this worth investigating:</strong> the crash does NOT reproduce when Super runs solo under equivalent or greater memory pressure. I ran Super solo at <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">--gpu-memory-utilization 0.88</code> (leaving the same ~5-6 GiB free headroom as the dual-model config) and sent multiple long prompts generating 3000+ tokens sequentially and concurrently to the same instance. All completed cleanly at stable 13.5 t/s. The crash is specifically triggered by two CUDA contexts running simultaneously against the same unified memory pool — not by memory pressure alone.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">This suggests the bug is in how vLLM or the CUDA runtime handles concurrent memory allocation from two separate processes sharing a unified physical address space, rather than a NemotronH kernel bug or a memory capacity issue.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">I'm filing this because the evidence is clean and the reproduction condition is well-characterized, even though I can't provide a minimal reproduction case. If this is a known limitation of running two CUDA processes on unified memory, I'd appreciate a pointer to the relevant documentation — I haven't found any warnings about this in vLLM's docs for Jetson/unified memory deployments.</p> <hr class="border-border-200 border-t-0.5 my-3 mx-1.5"> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Environment</h2> <div class="overflow-x-auto w-full px-2 mb-6"> Field | Value -- | -- Hardware | NVIDIA AGX Thor Developer Kit GPU architecture | Blackwell SM110a (compute capability sm_100) Unified memory | 122.8 GiB (CPU and GPU share the same physical pool — no discrete VRAM) L4T / JetPack | R39.0 / JetPack 7.2 CUDA | 13.2 OS | Ubuntu 24.04 aarch64 (SBSA) vLLM | 0.19.0 FlashInfer | 0.6.7 Container | jetson-containers vLLM build </div> <hr class="border-border-200 border-t-0.5 my-3 mx-1.5"> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Models</h2> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]"><strong>Super (crashes):</strong></p> <ul class="[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3"> <li class="whitespace-normal break-words pl-2">Model: Nemotron-H Super 120B NVFP4+FP8 checkpoint</li> <li class="whitespace-normal break-words pl-2">Quantization: <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">modelopt_mixed</code> (NVFP4 weights, FP8 KV cache)</li> <li class="whitespace-normal break-words pl-2">Architecture: NemotronH hybrid — 23 Mamba2 SSM layers, 40 MoE layers, 6 attention layers</li> </ul> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]"><strong>Nano (unaffected):</strong></p> <ul class="[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3"> <li class="whitespace-normal break-words pl-2">Model: Nemotron-H Nano 30B NVFP4</li> <li class="whitespace-normal break-words pl-2">Same quantization scheme, smaller model</li> </ul> <hr class="border-border-200 border-t-0.5 my-3 mx-1.5"> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Launch Configuration</h2> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]"><strong>Super (port 8000):</strong></p> <div role="group" aria-label="Code" tabindex="0" class="relative group/copy bg-bg-000/50 border-0.5 border-border-400 rounded-lg focus:outline-none focus-visible:ring-2 focus-visible:ring-accent-100"><div class="sticky opacity-0 group-hover/copy:opacity-100 group-focus-within/copy:opacity-100 top-2 py-2 h-12 w-0 float-right"><div class="absolute right-0 h-8 px-2 items-center inline-flex z-10"><button class="inline-flex items-center justify-center relative isolate shrink-0 can-focus select-none disabled:pointer-events-none disabled:opacity-50 disabled:shadow-none disabled:drop-shadow-none border-transparent transition font-base duration-300 ease-[cubic-bezier(0.165,0.85,0.45,1)] h-8 w-8 rounded-md backdrop-blur-md _fill_56vq7_9 _ghost_56vq7_96" type="button" aria-label="Copy to clipboard" data-state="closed"><div class="relative"><div class="transition-all opacity-100 scale-100" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" class="transition-all opacity-100 scale-100" aria-hidden="true" style="flex-shrink: 0;"><path d="M12.5 3A1.5 1.5 0 0 1 14 4.5V6h1.5A1.5 1.5 0 0 1 17 7.5v8a1.5 1.5 0 0 1-1.5 1.5h-8A1.5 1.5 0 0 1 6 15.5V14H4.5A1.5 1.5 0 0 1 3 12.5v-8A1.5 1.5 0 0 1 4.5 3zm1.5 9.5a1.5 1.5 0 0 1-1.5 1.5H7v1.5a.5.5 0 0 0 .5.5h8a.5.5 0 0 0 .5-.5v-8a.5.5 0 0 0-.5-.5H14zM4.5 4a.5.5 0 0 0-.5.5v8a.5.5 0 0 0 .5.5h8a.5.5 0 0 0 .5-.5v-8a.5.5 0 0 0-.5-.5z"></path></svg></div><div class="absolute inset-0 flex items-center justify-center"><div class="transition-all opacity-0 scale-50" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" class="transition-all opacity-0 scale-50" aria-hidden="true" style="flex-shrink: 0;"><path d="M15.188 5.11a.5.5 0 0 1 .752.626l-.056.084-7.5 9a.5.5 0 0 1-.738.033l-3.5-3.5-.064-.078a.501.501 0 0 1 .693-.693l.078.064 3.113 3.113 7.15-8.58z"></path></svg></div></div></div></button></div></div><div class="overflow-x-auto"><pre class="code-block__code !my-0 !rounded-lg !text-sm !leading-relaxed p-3.5" style="color: rgb(234, 236, 240); background: transparent; font-family: var(--font-mono);"><code style="color: rgb(234, 236, 240); background: transparent; font-family: var(--font-mono); white-space: pre-wrap;"><span><span>--model /data/models/hf/nvidia/nemotron/super-120b </span></span><span>--served-model-name nemotron-super </span><span>--quantization modelopt_mixed </span><span>--kv-cache-dtype fp8_e4m3 </span><span>--dtype bfloat16 </span><span>--max-model-len 16384 </span><span>--gpu-memory-utilization 0.58 </span><span>--attention-backend TRITON_ATTN </span><span>--trust-remote-code </span><span>--reasoning-parser super_v3</span></code></pre></div></div> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]"><strong>Nano (port 8001):</strong></p> <div role="group" aria-label="Code" tabindex="0" class="relative group/copy bg-bg-000/50 border-0.5 border-border-400 rounded-lg focus:outline-none focus-visible:ring-2 focus-visible:ring-accent-100"><div class="sticky opacity-0 group-hover/copy:opacity-100 group-focus-within/copy:opacity-100 top-2 py-2 h-12 w-0 float-right"><div class="absolute right-0 h-8 px-2 items-center inline-flex z-10"><button class="inline-flex items-center justify-center relative isolate shrink-0 can-focus select-none disabled:pointer-events-none disabled:opacity-50 disabled:shadow-none disabled:drop-shadow-none border-transparent transition font-base duration-300 ease-[cubic-bezier(0.165,0.85,0.45,1)] h-8 w-8 rounded-md backdrop-blur-md _fill_56vq7_9 _ghost_56vq7_96" type="button" aria-label="Copy to clipboard" data-state="closed"><div class="relative"><div class="transition-all opacity-100 scale-100" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" class="transition-all opacity-100 scale-100" aria-hidden="true" style="flex-shrink: 0;"><path d="M12.5 3A1.5 1.5 0 0 1 14 4.5V6h1.5A1.5 1.5 0 0 1 17 7.5v8a1.5 1.5 0 0 1-1.5 1.5h-8A1.5 1.5 0 0 1 6 15.5V14H4.5A1.5 1.5 0 0 1 3 12.5v-8A1.5 1.5 0 0 1 4.5 3zm1.5 9.5a1.5 1.5 0 0 1-1.5 1.5H7v1.5a.5.5 0 0 0 .5.5h8a.5.5 0 0 0 .5-.5v-8a.5.5 0 0 0-.5-.5H14zM4.5 4a.5.5 0 0 0-.5.5v8a.5.5 0 0 0 .5.5h8a.5.5 0 0 0 .5-.5v-8a.5.5 0 0 0-.5-.5z"></path></svg></div><div class="absolute inset-0 flex items-center justify-center"><div class="transition-all opacity-0 scale-50" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" class="transition-all opacity-0 scale-50" aria-hidden="true" style="flex-shrink: 0;"><path d="M15.188 5.11a.5.5 0 0 1 .752.626l-.056.084-7.5 9a.5.5 0 0 1-.738.033l-3.5-3.5-.064-.078a.501.501 0 0 1 .693-.693l.078.064 3.113 3.113 7.15-8.58z"></path></svg></div></div></div></button></div></div><div class="overflow-x-auto"><pre class="code-block__code !my-0 !rounded-lg !text-sm !leading-relaxed p-3.5" style="color: rgb(234, 236, 240); background: transparent; font-family: var(--font-mono);"><code style="color: rgb(234, 236, 240); background: transparent; font-family: var(--font-mono); white-space: pre-wrap;"><span><span>--model /data/models/hf/nvidia/nemotron/nano-30b </span></span><span>--served-model-name nemotron-nano-30b </span><span>--quantization modelopt_mixed </span><span>--kv-cache-dtype fp8_e4m3 </span><span>--dtype bfloat16 </span><span>--max-model-len 8192 </span><span>--gpu-memory-utilization 0.17 </span><span>--attention-backend TRITON_ATTN </span><span>--trust-remote-code</span></code></pre></div></div> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">Combined pre-allocation: ~71 GB (Super) + ~21 GB (Nano) = ~92 GB of 122.8 GB, leaving ~5-6 GB free for OS and dynamic allocations.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">Environment variables (both containers):</p> <div role="group" aria-label="Code" tabindex="0" class="relative group/copy bg-bg-000/50 border-0.5 border-border-400 rounded-lg focus:outline-none focus-visible:ring-2 focus-visible:ring-accent-100"><div class="sticky opacity-0 group-hover/copy:opacity-100 group-focus-within/copy:opacity-100 top-2 py-2 h-12 w-0 float-right"><div class="absolute right-0 h-8 px-2 items-center inline-flex z-10"><button class="inline-flex items-center justify-center relative isolate shrink-0 can-focus select-none disabled:pointer-events-none disabled:opacity-50 disabled:shadow-none disabled:drop-shadow-none border-transparent transition font-base duration-300 ease-[cubic-bezier(0.165,0.85,0.45,1)] h-8 w-8 rounded-md backdrop-blur-md _fill_56vq7_9 _ghost_56vq7_96" type="button" aria-label="Copy to clipboard" data-state="closed"><div class="relative"><div class="transition-all opacity-100 scale-100" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" class="transition-all opacity-100 scale-100" aria-hidden="true" style="flex-shrink: 0;"><path d="M12.5 3A1.5 1.5 0 0 1 14 4.5V6h1.5A1.5 1.5 0 0 1 17 7.5v8a1.5 1.5 0 0 1-1.5 1.5h-8A1.5 1.5 0 0 1 6 15.5V14H4.5A1.5 1.5 0 0 1 3 12.5v-8A1.5 1.5 0 0 1 4.5 3zm1.5 9.5a1.5 1.5 0 0 1-1.5 1.5H7v1.5a.5.5 0 0 0 .5.5h8a.5.5 0 0 0 .5-.5v-8a.5.5 0 0 0-.5-.5H14zM4.5 4a.5.5 0 0 0-.5.5v8a.5.5 0 0 0 .5.5h8a.5.5 0 0 0 .5-.5v-8a.5.5 0 0 0-.5-.5z"></path></svg></div><div class="absolute inset-0 flex items-center justify-center"><div class="transition-all opacity-0 scale-50" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" class="transition-all opacity-0 scale-50" aria-hidden="true" style="flex-shrink: 0;"><path d="M15.188 5.11a.5.5 0 0 1 .752.626l-.056.084-7.5 9a.5.5 0 0 1-.738.033l-3.5-3.5-.064-.078a.501.501 0 0 1 .693-.693l.078.064 3.113 3.113 7.15-8.58z"></path></svg></div></div></div></button></div></div><div class="overflow-x-auto"><pre class="code-block__code !my-0 !rounded-lg !text-sm !leading-relaxed p-3.5" style="color: rgb(234, 236, 240); background: transparent; font-family: var(--font-mono);"><code style="color: rgb(234, 236, 240); background: transparent; font-family: var(--font-mono); white-space: pre-wrap;"><span><span>FLASHINFER_DISABLE_VERSION_CHECK=1 </span></span><span>NINJA_MAX_JOBS=1 </span><span>MAX_JOBS=1</span></code></pre></div></div> <hr class="border-border-200 border-t-0.5 my-3 mx-1.5"> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Reproduction Steps</h2> <ol class="[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-decimal flex flex-col gap-1 pl-8 mb-3"> <li class="whitespace-normal break-words pl-2">Start both containers as above</li> <li class="whitespace-normal break-words pl-2">Open Open WebUI (or any client that can send the same prompt to two endpoints simultaneously)</li> <li class="whitespace-normal break-words pl-2">Select both <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">nemotron-super</code> and <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">nemotron-nano-30b</code> in side-by-side mode</li> <li class="whitespace-normal break-words pl-2">Send any prompt that generates a moderately long response</li> <li class="whitespace-normal break-words pl-2">Super crashes mid-decode; Nano completes normally</li> </ol> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">The crash is consistent — it has reproduced on every attempt at concurrent generation across multiple container restarts and configuration variations.</p> <hr class="border-border-200 border-t-0.5 my-3 mx-1.5"> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">What the Crash Looks Like</h2> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">Super generates tokens for a while (anywhere from ~30 to ~318 tokens observed), then EngineCore aborts with:</p> <div role="group" aria-label="Code" tabindex="0" class="relative group/copy bg-bg-000/50 border-0.5 border-border-400 rounded-lg focus:outline-none focus-visible:ring-2 focus-visible:ring-accent-100"><div class="sticky opacity-0 group-hover/copy:opacity-100 group-focus-within/copy:opacity-100 top-2 py-2 h-12 w-0 float-right"><div class="absolute right-0 h-8 px-2 items-center inline-flex z-10"><button class="inline-flex items-center justify-center relative isolate shrink-0 can-focus select-none disabled:pointer-events-none disabled:opacity-50 disabled:shadow-none disabled:drop-shadow-none border-transparent transition font-base duration-300 ease-[cubic-bezier(0.165,0.85,0.45,1)] h-8 w-8 rounded-md backdrop-blur-md _fill_56vq7_9 _ghost_56vq7_96" type="button" aria-label="Copy to clipboard" data-state="closed"><div class="relative"><div class="transition-all opacity-100 scale-100" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" class="transition-all opacity-100 scale-100" aria-hidden="true" style="flex-shrink: 0;"><path d="M12.5 3A1.5 1.5 0 0 1 14 4.5V6h1.5A1.5 1.5 0 0 1 17 7.5v8a1.5 1.5 0 0 1-1.5 1.5h-8A1.5 1.5 0 0 1 6 15.5V14H4.5A1.5 1.5 0 0 1 3 12.5v-8A1.5 1.5 0 0 1 4.5 3zm1.5 9.5a1.5 1.5 0 0 1-1.5 1.5H7v1.5a.5.5 0 0 0 .5.5h8a.5.5 0 0 0 .5-.5v-8a.5.5 0 0 0-.5-.5H14zM4.5 4a.5.5 0 0 0-.5.5v8a.5.5 0 0 0 .5.5h8a.5.5 0 0 0 .5-.5v-8a.5.5 0 0 0-.5-.5z"></path></svg></div><div class="absolute inset-0 flex items-center justify-center"><div class="transition-all opacity-0 scale-50" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" class="transition-all opacity-0 scale-50" aria-hidden="true" style="flex-shrink: 0;"><path d="M15.188 5.11a.5.5 0 0 1 .752.626l-.056.084-7.5 9a.5.5 0 0 1-.738.033l-3.5-3.5-.064-.078a.501.501 0 0 1 .693-.693l.078.064 3.113 3.113 7.15-8.58z"></path></svg></div></div></div></button></div></div><div class="overflow-x-auto"><pre class="code-block__code !my-0 !rounded-lg !text-sm !leading-relaxed p-3.5" style="color: rgb(234, 236, 240); background: transparent; font-family: var(--font-mono);"><code style="color: rgb(234, 236, 240); background: transparent; font-family: var(--font-mono); white-space: pre-wrap;"><span><span>torch.AcceleratorError: CUDA error: an illegal instruction was encountered</span></span></code></pre></div></div> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">The crash location varies across runs — sometimes in the MoE expert GEMM path, sometimes in a Mamba SSM layer doing a simple elementwise operation:</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]"><strong>Crash in Mamba layer (most diagnostic — trivial operation):</strong></p> <div role="group" aria-label="Code" tabindex="0" class="relative group/copy bg-bg-000/50 border-0.5 border-border-400 rounded-lg focus:outline-none focus-visible:ring-2 focus-visible:ring-accent-100"><div class="sticky opacity-0 group-hover/copy:opacity-100 group-focus-within/copy:opacity-100 top-2 py-2 h-12 w-0 float-right"><div class="absolute right-0 h-8 px-2 items-center inline-flex z-10"><button class="inline-flex items-center justify-center relative isolate shrink-0 can-focus select-none disabled:pointer-events-none disabled:opacity-50 disabled:shadow-none disabled:drop-shadow-none border-transparent transition font-base duration-300 ease-[cubic-bezier(0.165,0.85,0.45,1)] h-8 w-8 rounded-md backdrop-blur-md _fill_56vq7_9 _ghost_56vq7_96" type="button" aria-label="Copy to clipboard" data-state="closed"><div class="relative"><div class="transition-all opacity-100 scale-100" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" class="transition-all opacity-100 scale-100" aria-hidden="true" style="flex-shrink: 0;"><path d="M12.5 3A1.5 1.5 0 0 1 14 4.5V6h1.5A1.5 1.5 0 0 1 17 7.5v8a1.5 1.5 0 0 1-1.5 1.5h-8A1.5 1.5 0 0 1 6 15.5V14H4.5A1.5 1.5 0 0 1 3 12.5v-8A1.5 1.5 0 0 1 4.5 3zm1.5 9.5a1.5 1.5 0 0 1-1.5 1.5H7v1.5a.5.5 0 0 0 .5.5h8a.5.5 0 0 0 .5-.5v-8a.5.5 0 0 0-.5-.5H14zM4.5 4a.5.5 0 0 0-.5.5v8a.5.5 0 0 0 .5.5h8a.5.5 0 0 0 .5-.5v-8a.5.5 0 0 0-.5-.5z"></path></svg></div><div class="absolute inset-0 flex items-center justify-center"><div class="transition-all opacity-0 scale-50" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" class="transition-all opacity-0 scale-50" aria-hidden="true" style="flex-shrink: 0;"><path d="M15.188 5.11a.5.5 0 0 1 .752.626l-.056.084-7.5 9a.5.5 0 0 1-.738.033l-3.5-3.5-.064-.078a.501.501 0 0 1 .693-.693l.078.064 3.113 3.113 7.15-8.58z"></path></svg></div></div></div></button></div></div><div class="overflow-x-auto"><pre class="code-block__code !my-0 !rounded-lg !text-sm !leading-relaxed p-3.5" style="color: rgb(234, 236, 240); background: transparent; font-family: var(--font-mono);"><code style="color: rgb(234, 236, 240); background: transparent; font-family: var(--font-mono); white-space: pre-wrap;"><span><span>nemotron_h.py:426 output = self.mixer(hidden_states) </span></span><span>mamba_mixer2.py:546 hidden_states = self.norm(ssm_output, gate) </span><span>mamba_mixer2.py:102 x = x * nn.functional.silu(gate.to(torch.float32)) </span><span> ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ </span><span>torch.AcceleratorError: CUDA error: an illegal instruction was encountered</span></code></pre></div></div> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]"><code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">nn.functional.silu</code> is a trivial elementwise operation. It cannot fail due to a kernel or architecture bug. When this crashes, the tensor <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">gate</code> is backed by freed or invalid GPU memory — this is memory corruption, not a math error.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]"><strong>Crash in MoE GEMM path:</strong></p> <div role="group" aria-label="Code" tabindex="0" class="relative group/copy bg-bg-000/50 border-0.5 border-border-400 rounded-lg focus:outline-none focus-visible:ring-2 focus-visible:ring-accent-100"><div class="sticky opacity-0 group-hover/copy:opacity-100 group-focus-within/copy:opacity-100 top-2 py-2 h-12 w-0 float-right"><div class="absolute right-0 h-8 px-2 items-center inline-flex z-10"><button class="inline-flex items-center justify-center relative isolate shrink-0 can-focus select-none disabled:pointer-events-none disabled:opacity-50 disabled:shadow-none disabled:drop-shadow-none border-transparent transition font-base duration-300 ease-[cubic-bezier(0.165,0.85,0.45,1)] h-8 w-8 rounded-md backdrop-blur-md _fill_56vq7_9 _ghost_56vq7_96" type="button" aria-label="Copy to clipboard" data-state="closed"><div class="relative"><div class="transition-all opacity-100 scale-100" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" class="transition-all opacity-100 scale-100" aria-hidden="true" style="flex-shrink: 0;"><path d="M12.5 3A1.5 1.5 0 0 1 14 4.5V6h1.5A1.5 1.5 0 0 1 17 7.5v8a1.5 1.5 0 0 1-1.5 1.5h-8A1.5 1.5 0 0 1 6 15.5V14H4.5A1.5 1.5 0 0 1 3 12.5v-8A1.5 1.5 0 0 1 4.5 3zm1.5 9.5a1.5 1.5 0 0 1-1.5 1.5H7v1.5a.5.5 0 0 0 .5.5h8a.5.5 0 0 0 .5-.5v-8a.5.5 0 0 0-.5-.5H14zM4.5 4a.5.5 0 0 0-.5.5v8a.5.5 0 0 0 .5.5h8a.5.5 0 0 0 .5-.5v-8a.5.5 0 0 0-.5-.5z"></path></svg></div><div class="absolute inset-0 flex items-center justify-center"><div class="transition-all opacity-0 scale-50" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" class="transition-all opacity-0 scale-50" aria-hidden="true" style="flex-shrink: 0;"><path d="M15.188 5.11a.5.5 0 0 1 .752.626l-.056.084-7.5 9a.5.5 0 0 1-.738.033l-3.5-3.5-.064-.078a.501.501 0 0 1 .693-.693l.078.064 3.113 3.113 7.15-8.58z"></path></svg></div></div></div></button></div></div><div class="overflow-x-auto"><pre class="code-block__code !my-0 !rounded-lg !text-sm !leading-relaxed p-3.5" style="color: rgb(234, 236, 240); background: transparent; font-family: var(--font-mono);"><code style="color: rgb(234, 236, 240); background: transparent; font-family: var(--font-mono); white-space: pre-wrap;"><span><span>nemotron_h.py:250 self.experts(...) </span></span><span>flashinfer_cutlass_moe.py:369 flashinfer_cutlass_fused_moe() </span><span>RuntimeError: Failed to initialize cutlass TMA WS grouped gemm </span><span>TMA descriptor: gmem_address = 0 ← null pointer</span></code></pre></div></div> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">Again, CUTLASS correctly rejected a null buffer — the pointer was corrupted before reaching the kernel.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]"><strong>The non-deterministic crash location across architecturally unrelated components (MoE GEMM and Mamba SSM) is the key diagnostic signal.</strong> A kernel bug fails at the same spot every time. Memory corruption fails wherever the bad pointer is first dereferenced.</p> <hr class="border-border-200 border-t-0.5 my-3 mx-1.5"> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Evidence That This Is Not a Memory Capacity Issue</h2> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">Solo stress test performed after characterizing the dual-model crash:</p> <ul class="[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3"> <li class="whitespace-normal break-words pl-2">Super launched solo at <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">--gpu-memory-utilization 0.88</code> (~108 GB reserved, ~5-6 GB free — same headroom as dual-model config)</li> <li class="whitespace-normal break-words pl-2">Multiple long prompts sent requesting 3000 output tokens each</li> <li class="whitespace-normal break-words pl-2">Two requests sent concurrently to the same Solo Super instance</li> <li class="whitespace-normal break-words pl-2">All completed successfully at stable 13.5 t/s</li> <li class="whitespace-normal break-words pl-2">No crashes, no errors across multiple runs</li> </ul> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">The only difference between "crashes" and "works" is whether a second vLLM CUDA context (Nano) is present in the same unified memory pool during generation.</p> <hr class="border-border-200 border-t-0.5 my-3 mx-1.5"> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">What Has Been Ruled Out</h2> <ul class="[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3"> <li class="whitespace-normal break-words pl-2">✅ <strong>Cudagraph replay invoking wrong-arch kernel</strong> — <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">--enforce-eager</code> confirmed, crashes reproduced identically with <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">cudagraph_mode=NONE</code></li> <li class="whitespace-normal break-words pl-2">✅ <strong>SM80 kernel fallback</strong> — crash appears in Mamba layers unrelated to attention backends</li> <li class="whitespace-normal break-words pl-2">✅ <strong>Wrong CUTLASS tactic selection</strong> — crash also appears in Mamba where CUTLASS is not involved</li> <li class="whitespace-normal break-words pl-2">✅ <strong>Stale FlashInfer JIT cache</strong> — cleared, 5.3-hour fresh autotune performed, crashes continued</li> <li class="whitespace-normal break-words pl-2">✅ <strong>Memory capacity (total headroom)</strong> — solo stress test at equivalent headroom runs cleanly</li> </ul> <hr class="border-border-200 border-t-0.5 my-3 mx-1.5"> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Hypothesis</h2> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">Two vLLM containers simultaneously allocating from the same unified physical memory pool create a condition the CUDA runtime or vLLM's allocator does not handle correctly. On discrete GPU hardware, each container gets isolated VRAM — there is no equivalent shared physical address space. On AGX Thor, both CUDA contexts read and write the same physical DRAM. Dynamic allocations during decode (Mamba SSM state updates, CUDA scratch space for GEMM kernels) from two competing contexts may produce race conditions or pointer corruption that manifests as <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">cudaErrorIllegalInstruction</code> at whichever layer first dereferences the bad address.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">This could be a vLLM issue (insufficient isolation between processes), a CUDA runtime issue (unified memory allocator not handling concurrent multi-context allocation safely), or a JetPack/driver issue specific to Thor's unified memory implementation.</p> <hr class="border-border-200 border-t-0.5 my-3 mx-1.5"> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Real-Time Metrics at Time of Crash</h2> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">From 30-second Prometheus polling (<code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">/metrics</code> endpoint) during a concurrent session:</p> <div role="group" aria-label="Code" tabindex="0" class="relative group/copy bg-bg-000/50 border-0.5 border-border-400 rounded-lg focus:outline-none focus-visible:ring-2 focus-visible:ring-accent-100"><div class="sticky opacity-0 group-hover/copy:opacity-100 group-focus-within/copy:opacity-100 top-2 py-2 h-12 w-0 float-right"><div class="absolute right-0 h-8 px-2 items-center inline-flex z-10"><button class="inline-flex items-center justify-center relative isolate shrink-0 can-focus select-none disabled:pointer-events-none disabled:opacity-50 disabled:shadow-none disabled:drop-shadow-none border-transparent transition font-base duration-300 ease-[cubic-bezier(0.165,0.85,0.45,1)] h-8 w-8 rounded-md backdrop-blur-md _fill_56vq7_9 _ghost_56vq7_96" type="button" aria-label="Copy to clipboard" data-state="closed"><div class="relative"><div class="transition-all opacity-100 scale-100" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" class="transition-all opacity-100 scale-100" aria-hidden="true" style="flex-shrink: 0;"><path d="M12.5 3A1.5 1.5 0 0 1 14 4.5V6h1.5A1.5 1.5 0 0 1 17 7.5v8a1.5 1.5 0 0 1-1.5 1.5h-8A1.5 1.5 0 0 1 6 15.5V14H4.5A1.5 1.5 0 0 1 3 12.5v-8A1.5 1.5 0 0 1 4.5 3zm1.5 9.5a1.5 1.5 0 0 1-1.5 1.5H7v1.5a.5.5 0 0 0 .5.5h8a.5.5 0 0 0 .5-.5v-8a.5.5 0 0 0-.5-.5H14zM4.5 4a.5.5 0 0 0-.5.5v8a.5.5 0 0 0 .5.5h8a.5.5 0 0 0 .5-.5v-8a.5.5 0 0 0-.5-.5z"></path></svg></div><div class="absolute inset-0 flex items-center justify-center"><div class="transition-all opacity-0 scale-50" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" class="transition-all opacity-0 scale-50" aria-hidden="true" style="flex-shrink: 0;"><path d="M15.188 5.11a.5.5 0 0 1 .752.626l-.056.084-7.5 9a.5.5 0 0 1-.738.033l-3.5-3.5-.064-.078a.501.501 0 0 1 .693-.693l.078.064 3.113 3.113 7.15-8.58z"></path></svg></div></div></div></button></div></div><div class="overflow-x-auto"><pre class="code-block__code !my-0 !rounded-lg !text-sm !leading-relaxed p-3.5" style="color: rgb(234, 236, 240); background: transparent; font-family: var(--font-mono);"><code style="color: rgb(234, 236, 240); background: transparent; font-family: var(--font-mono); white-space: pre-wrap;"><span><span>22:08:12Z Super: requests_running=1, total_gen_tokens=255 — generating normally </span></span><span>22:08:22Z Super: total_gen_tokens=301 — still generating (~4.6 t/s) </span><span>22:08:33Z Super: total_gen_tokens=346 </span><span>22:08:43Z Super: total_gen_tokens=380, mem_used dropped 126.3→122.7 GB — CRASH + exit </span><span>22:08:53Z Super: port unreachable. Nano: continuing normally at ~42 t/s</span></code></pre></div></div> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">Nano was actively generating throughout and was completely unaffected.</p> <hr class="border-border-200 border-t-0.5 my-3 mx-1.5"> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Questions for Maintainers</h2> <ol class="[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-decimal flex flex-col gap-1 pl-8 mb-3"> <li class="whitespace-normal break-words pl-2">Is concurrent multi-process vLLM on unified memory (Jetson) a tested or supported configuration?</li> <li class="whitespace-normal break-words pl-2">Does vLLM use any inter-process memory isolation or locking when multiple instances share a physical address space?</li> <li class="whitespace-normal break-words pl-2">Are Mamba SSM state buffers allocated from vLLM's pre-reserved KV cache pool or from the general CUDA heap? (The latter would make them vulnerable to concurrent allocation races.)</li> <li class="whitespace-normal break-words pl-2">Is there a known-safe way to run two vLLM instances on AGX Thor, such as explicit memory partitioning or process isolation flags?</li></ol>

🐛 Describe the bug

NemotronH Super 120B (NVFP4) crashes mid-decode with cudaErrorIllegalInstruction when two vLLM instances run concurrently on NVIDIA AGX Thor (unified memory). Crash does not reproduce when Super runs solo under equivalent memory pressure, isolating the trigger to concurrent CUDA contexts sharing the same physical memory pool rather than memory capacity. Full details, stack traces, and supporting metrics in the report below.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issue of concurrent multi-process vLLM on unified memory (Jetson) causing crashes due to memory corruption, we will implement the following steps:

  • Memory Partitioning: Explicitly partition the unified memory to prevent concurrent allocation races between the two vLLM instances. This can be achieved by setting the --gpu-memory-utilization flag for each instance to ensure they do not overlap in memory usage.
  • Process Isolation: Utilize process isolation flags or mechanisms to prevent the two vLLM instances from accessing each other's memory spaces. This can be done using Linux namespaces or cgroups to isolate the processes.
  • Mamba SSM State Buffer Allocation: Modify the Mamba SSM state buffer allocation to use a pre-reserved KV cache pool instead of the general CUDA heap. This will reduce the likelihood of concurrent allocation races.

Example code for memory partitioning using the --gpu-memory-utilization flag:

# Launch Super instance with 60% GPU memory utilization
super_instance = subprocess.Popen([
    "vllm",
    "--model", "/data/models/hf/nvidia/nemotron/super-120b",
    "--gpu-memory-utilization", "0.6",
    # ... other flags ...
])

# Launch Nano instance with 20% GPU memory utilization
nano_instance = subprocess.Popen([
    "vllm",
    "--model", "/data/models/hf/nvidia/nemotron/nano-30b",
    "--gpu-memory-utilization", "0.2",
    # ... other flags ...
])

By implementing these measures, we can reduce the likelihood of memory corruption and crashes caused by concurrent multi-process vLLM on unified memory (Jetson).

Verification

To verify that the fix worked, we will:

  1. Launch the two vLLM instances with the modified memory partitioning and process isolation settings.
  2. Run concurrent inference workloads on both instances.
  3. Monitor the system for crashes or memory corruption issues.
  4. Verify that the instances can run concurrently without crashes or errors.

If the fix is successful, the instances should be able to run concurrently without crashes or memory corruption issues.

Extra Tips

To prevent similar issues in the future, it is recommended to:

  • Regularly test and validate concurrent multi-process vLLM workloads on unified memory (Jetson) configurations.
  • Implement robust memory management and allocation mechanisms to prevent concurrent allocation races.
  • Utilize process isolation and memory partitioning techniques to prevent memory corruption and crashes.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING