vllm - 💡(How to fix) Fix [Bug]: Stale Triton kernel cache on DGX Spark (sm_121) produces silently garbled outputs — wiping ~/.triton/cache restores correctness [1 participants]

franciscojavierarceo · 2026-05-06T20:45:57Z

[vllm] On NVIDIA DGX Spark GB10, sm 121 running torch 2.11.0+cu130 + triton 3.6.0, gpt-oss-20b returned semantically corrupted outputs — foreign-script glyphs… On NVIDIA DGX Spark (GB10, sm_121) running torch 2.11.0+cu130 + triton 3.6.0, gpt-oss-20b returned semantically corrupted outputs — foreign-script glyphs mixed into otherwise English text, malformed Harmony tool-call headers (e.g. ` 1` instead of ` json`), and intermittent 500s in the responses parser. Wiping `~/.triton/cache` and restarting restored fully clean, well-formed outputs with zero garbled-token warnings. ### Summary On NVIDIA DGX Spark (GB10, sm_121) running torch 2.11.0+cu130 + triton 3.6.0, gpt-oss-20b returned semantically corrupted outputs — foreign-script glyphs mixed into otherwise English text, malformed Harmony tool-call headers (e.g. ` 1` instead of ` json`), and intermittent 500s in the responses parser. Wiping `~/.triton/cache` and restarting restored fully clean, well-formed outputs with zero garbled-token warnings. ### Environment - Hardware: NVIDIA GB10 (Blackwell, compute capability 12.1 / sm_121) - Arch: aarch64 Linux - torch: 2.11.0+cu130 (`torch.cuda.get_arch_list()` = `[..., 'sm_120']` — sm_121 not present, falls back to sm_120 PTX → JIT) - triton: 3.6.0 - vLLM: 0.16.0rc1.dev2814+ge151cd6a6 - Model: `openai/gpt-oss-20b` ### Reproducer 1. On a GB10 host with stale `~/.triton/cache` (entries built by an older torch/triton combo), run gpt-oss-20b. 2. Send any chat completion or `/v1/responses` request. 3. Observe: outputs contain mixed-script garble; tool-call headers are malformed enough to trip the Harmony parser. 4. Stop server, `rm -rf ~/.triton/cache`, restart. 5. Outputs are clean and well-formed; zero garbled-token warnings in the same code paths. ### Likely cause Triton's autotune/JIT cache key does not reflect every variable that actually changes the compiled PTX → SASS path on sm_121 fallback. Artifacts compiled under one (torch, triton, driver, arch-fallback) combination get reused later under a slightly different combination, silently producing wrong code on sm_121. ### How we discovered it While testing gpt-oss-20b → `/v1/responses` with the `file_search` tool, the model was emitting random non-English tokens mid-generation, and our Harmony parser was raising on malformed tool-call headers (` 1 ...`). Adding token-skip recovery in `responses/context.py` paths kept the server from 500ing but the bad tokens still landed in the output. After ruling out attention backend swaps (gpt-oss + mxfp4 only exposes `TRITON_ATTN`), we wiped `~/.triton/cache`. Next run: clean Harmony stream, valid tool calls, no parser recovery warnings. ### Proposed fix direction Either (a) extend the Triton cache key to include device-capability-vs-target-arch info so sm_121-vs-sm_120 mismatches invalidate cache entries, or (b) have vLLM detect an arch mismatch at startup and proactively invalidate. ### Related (different symptoms, same hardware) - #36821 — No sm_121 support on aarch64 - #33857 — Triton allocator error on DGX Spark - #37754 — FlashInfer + MTP crash on SM121 --- *AI assistance was used to draft this report.*

vllm2026-05-06 20:45:57

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41871•Fetched 2026-05-07 03:32:15

View on GitHub

Comments

Participants

Timeline

Reactions

Author

franciscojavierarceo

Participants

franciscojavierarceo

On NVIDIA DGX Spark (GB10, sm_121) running torch 2.11.0+cu130 + triton 3.6.0, gpt-oss-20b returned semantically corrupted outputs — foreign-script glyphs mixed into otherwise English text, malformed Harmony tool-call headers (e.g. <|constrain|>1 instead of <|constrain|>json), and intermittent 500s in the responses parser. Wiping ~/.triton/cache and restarting restored fully clean, well-formed outputs with zero garbled-token warnings.

Error Message

#33857 — Triton allocator error on DGX Spark

Root Cause

RAW_BUFFERClick to expand / collapse

Summary

Environment

Hardware: NVIDIA GB10 (Blackwell, compute capability 12.1 / sm_121)
Arch: aarch64 Linux
torch: 2.11.0+cu130 (torch.cuda.get_arch_list() = [..., 'sm_120'] — sm_121 not present, falls back to sm_120 PTX → JIT)
triton: 3.6.0
vLLM: 0.16.0rc1.dev2814+ge151cd6a6
Model: openai/gpt-oss-20b

Reproducer

On a GB10 host with stale ~/.triton/cache (entries built by an older torch/triton combo), run gpt-oss-20b.
Send any chat completion or /v1/responses request.
Observe: outputs contain mixed-script garble; tool-call headers are malformed enough to trip the Harmony parser.
Stop server, rm -rf ~/.triton/cache, restart.
Outputs are clean and well-formed; zero garbled-token warnings in the same code paths.

Likely cause

Triton's autotune/JIT cache key does not reflect every variable that actually changes the compiled PTX → SASS path on sm_121 fallback. Artifacts compiled under one (torch, triton, driver, arch-fallback) combination get reused later under a slightly different combination, silently producing wrong code on sm_121.

How we discovered it

While testing gpt-oss-20b → /v1/responses with the file_search tool, the model was emitting random non-English tokens mid-generation, and our Harmony parser was raising on malformed tool-call headers (<|constrain|>1<|call|>...). Adding token-skip recovery in responses/context.py paths kept the server from 500ing but the bad tokens still landed in the output. After ruling out attention backend swaps (gpt-oss + mxfp4 only exposes TRITON_ATTN), we wiped ~/.triton/cache. Next run: clean Harmony stream, valid tool calls, no parser recovery warnings.

Proposed fix direction

Either (a) extend the Triton cache key to include device-capability-vs-target-arch info so sm_121-vs-sm_120 mismatches invalidate cache entries, or (b) have vLLM detect an arch mismatch at startup and proactively invalidate.

Related (different symptoms, same hardware)

#36821 — No sm_121 support on aarch64
#33857 — Triton allocator error on DGX Spark
#37754 — FlashInfer + MTP crash on SM121

AI assistance was used to draft this report.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#chain error #conversation history #tool integration #LLM response #prompt template

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Stale Triton kernel cache on DGX Spark (sm_121) produces silently garbled outputs — wiping ~/.triton/cache restores correctness [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Summary

Environment

Reproducer

Likely cause

How we discovered it

Proposed fix direction

Related (different symptoms, same hardware)

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Stale Triton kernel cache on DGX Spark (sm_121) produces silently garbled outputs — wiping ~/.triton/cache restores correctness [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Summary

Environment

Reproducer

Likely cause

How we discovered it

Proposed fix direction

Related (different symptoms, same hardware)

Still need to ship something?

RELATED_DISCOVERY

TRENDING