vllm - 💡(How to fix) Fix [Bug]: Stale Triton kernel cache on DGX Spark (sm_121) produces silently garbled outputs — wiping ~/.triton/cache restores correctness [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41871Fetched 2026-05-07 03:32:15
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0

On NVIDIA DGX Spark (GB10, sm_121) running torch 2.11.0+cu130 + triton 3.6.0, gpt-oss-20b returned semantically corrupted outputs — foreign-script glyphs mixed into otherwise English text, malformed Harmony tool-call headers (e.g. <|constrain|>1 instead of <|constrain|>json), and intermittent 500s in the responses parser. Wiping ~/.triton/cache and restarting restored fully clean, well-formed outputs with zero garbled-token warnings.

Error Message

  • #33857 — Triton allocator error on DGX Spark

Root Cause

On NVIDIA DGX Spark (GB10, sm_121) running torch 2.11.0+cu130 + triton 3.6.0, gpt-oss-20b returned semantically corrupted outputs — foreign-script glyphs mixed into otherwise English text, malformed Harmony tool-call headers (e.g. <|constrain|>1 instead of <|constrain|>json), and intermittent 500s in the responses parser. Wiping ~/.triton/cache and restarting restored fully clean, well-formed outputs with zero garbled-token warnings.

RAW_BUFFERClick to expand / collapse

Summary

On NVIDIA DGX Spark (GB10, sm_121) running torch 2.11.0+cu130 + triton 3.6.0, gpt-oss-20b returned semantically corrupted outputs — foreign-script glyphs mixed into otherwise English text, malformed Harmony tool-call headers (e.g. <|constrain|>1 instead of <|constrain|>json), and intermittent 500s in the responses parser. Wiping ~/.triton/cache and restarting restored fully clean, well-formed outputs with zero garbled-token warnings.

Environment

  • Hardware: NVIDIA GB10 (Blackwell, compute capability 12.1 / sm_121)
  • Arch: aarch64 Linux
  • torch: 2.11.0+cu130 (torch.cuda.get_arch_list() = [..., 'sm_120'] — sm_121 not present, falls back to sm_120 PTX → JIT)
  • triton: 3.6.0
  • vLLM: 0.16.0rc1.dev2814+ge151cd6a6
  • Model: openai/gpt-oss-20b

Reproducer

  1. On a GB10 host with stale ~/.triton/cache (entries built by an older torch/triton combo), run gpt-oss-20b.
  2. Send any chat completion or /v1/responses request.
  3. Observe: outputs contain mixed-script garble; tool-call headers are malformed enough to trip the Harmony parser.
  4. Stop server, rm -rf ~/.triton/cache, restart.
  5. Outputs are clean and well-formed; zero garbled-token warnings in the same code paths.

Likely cause

Triton's autotune/JIT cache key does not reflect every variable that actually changes the compiled PTX → SASS path on sm_121 fallback. Artifacts compiled under one (torch, triton, driver, arch-fallback) combination get reused later under a slightly different combination, silently producing wrong code on sm_121.

How we discovered it

While testing gpt-oss-20b → /v1/responses with the file_search tool, the model was emitting random non-English tokens mid-generation, and our Harmony parser was raising on malformed tool-call headers (<|constrain|>1<|call|>...). Adding token-skip recovery in responses/context.py paths kept the server from 500ing but the bad tokens still landed in the output. After ruling out attention backend swaps (gpt-oss + mxfp4 only exposes TRITON_ATTN), we wiped ~/.triton/cache. Next run: clean Harmony stream, valid tool calls, no parser recovery warnings.

Proposed fix direction

Either (a) extend the Triton cache key to include device-capability-vs-target-arch info so sm_121-vs-sm_120 mismatches invalidate cache entries, or (b) have vLLM detect an arch mismatch at startup and proactively invalidate.

Related (different symptoms, same hardware)

  • #36821 — No sm_121 support on aarch64
  • #33857 — Triton allocator error on DGX Spark
  • #37754 — FlashInfer + MTP crash on SM121

AI assistance was used to draft this report.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Stale Triton kernel cache on DGX Spark (sm_121) produces silently garbled outputs — wiping ~/.triton/cache restores correctness [1 participants]