vllm - 💡(How to fix) Fix [Bug]: [GLM-5.1] [MTP] DeepGEMM context_lens.is_contiguous assertion in paged MQA metadata [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41094Fetched 2026-04-29 06:12:26
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
1
Author
Participants
Timeline (top)
renamed ×2subscribed ×2unsubscribed ×2

Error Message

RuntimeError: Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()

Root Cause

(EngineCore pid=532) ERROR 04-28 04:40:10 [core.py:1138] EngineCore encountered a fatal error.
(EngineCore pid=532) ERROR 04-28 04:40:10 [core.py:1138] RuntimeError: Worker failed with error 'Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()', please check the stack trace above for the root cause
(APIServer pid=1) ERROR 04-28 04:40:10 [async_llm.py:699] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

Fix Action

Fix / Workaround

What we tried / mitigation

appears to avoid this crash path for us so far. That matches the workaround reported in #40926 (speculative_config=None).

Happy to test a patch/nightly image on B200/H200 if useful.

Code Example

RuntimeError: Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()

---

vllm serve \
  --model zai-org/GLM-5.1-FP8 \
  --served-model-name zai-org/glm-5.1 \
  --trust-remote-code \
  --enable-prompt-tokens-details \
  --enable-auto-tool-choice \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 3 \
  --max-model-len 202752 \
  --gpu-memory-utilization 0.88 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --max-num-seqs 256 \
  --max-num-batched-tokens 8192 \
  --chat-template-content-format string \
  --tensor-parallel-size 8

---

mkdir -p /tmp/flashinfer-cubin /tmp/flashinfer-cubin-shadow/flashinfer_cubin
printf '%b' '__version__ = "0.6.8.post1"\nCUBIN_DIR = "/tmp/flashinfer-cubin"\ndef get_cubin_dir():\n    return CUBIN_DIR\n' > /tmp/flashinfer-cubin-shadow/flashinfer_cubin/__init__.py
export PYTHONPATH=/tmp/flashinfer-cubin-shadow:${PYTHONPATH:-}
pip install --upgrade 'transformers>=5.4.0'

---

(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962] WorkerProc hit an exception.
(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 750, in sample_tokens
(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962]     return self.model_runner.sample_tokens(grammar_output)
(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4245, in sample_tokens
...
(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962] RuntimeError: Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()

---

(Worker_TP7 pid=797) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP4 pid=764) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP5 pid=775) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP3 pid=753) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP0 pid=731) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP6 pid=786) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP1 pid=736) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()

---

(EngineCore pid=532) ERROR 04-28 04:40:10 [core.py:1138] EngineCore encountered a fatal error.
(EngineCore pid=532) ERROR 04-28 04:40:10 [core.py:1138] RuntimeError: Worker failed with error 'Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()', please check the stack trace above for the root cause
(APIServer pid=1) ERROR 04-28 04:40:10 [async_llm.py:699] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

---

State:          Running
Last State:     Terminated
  Reason:       Completed
  Exit Code:    0
Restart Count:  2

---

--speculative-config.method mtp
--speculative-config.num_speculative_tokens 3
RAW_BUFFERClick to expand / collapse

Describe the bug

zai-org/GLM-5.1-FP8 crashes under vLLM V1 + MTP speculative decoding with a DeepGEMM paged MQA metadata assertion:

RuntimeError: Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()

This looks very similar to #40987, but on GLM-5.1-FP8 instead of DeepSeek-V4. It may also be related to #40926, since this is the same GLM-5.1 / DSA / MoE / MLA / MTP serving class.

In our deployment, short requests usually work, but a long-context chat request caused the engine to die. The pod restarted cleanly after the fatal EngineCore error. This does not look like OOM: Kubernetes reported the previous container state as Completed with exit code 0, and the first root-cause error was the DeepGEMM assertion above.

Environment

  • vLLM image: vllm/vllm-openai@sha256:46da022ce07aae43e4ffae844aeab467a223437e071abadf566555699fbf16c3 (v0.20.0 image digest)
  • Model: zai-org/GLM-5.1-FP8
  • Served model name: zai-org/glm-5.1
  • Hardware: 8x NVIDIA B200
  • Runtime: Kubernetes/k3s pod, TP=8
  • KV cache dtype: fp8
  • Prefix caching: enabled
  • MTP: enabled, num_speculative_tokens=3

Serve command

vllm serve \
  --model zai-org/GLM-5.1-FP8 \
  --served-model-name zai-org/glm-5.1 \
  --trust-remote-code \
  --enable-prompt-tokens-details \
  --enable-auto-tool-choice \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 3 \
  --max-model-len 202752 \
  --gpu-memory-utilization 0.88 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --max-num-seqs 256 \
  --max-num-batched-tokens 8192 \
  --chat-template-content-format string \
  --tensor-parallel-size 8

The container also runs this before vllm serve:

mkdir -p /tmp/flashinfer-cubin /tmp/flashinfer-cubin-shadow/flashinfer_cubin
printf '%b' '__version__ = "0.6.8.post1"\nCUBIN_DIR = "/tmp/flashinfer-cubin"\ndef get_cubin_dir():\n    return CUBIN_DIR\n' > /tmp/flashinfer-cubin-shadow/flashinfer_cubin/__init__.py
export PYTHONPATH=/tmp/flashinfer-cubin-shadow:${PYTHONPATH:-}
pip install --upgrade 'transformers>=5.4.0'

Trigger

Observed under normal OpenAI-compatible chat-completions traffic. One user-facing trigger was a long-context chat request around 130k tokens; short requests usually worked. We can try to build a smaller synthetic repro if helpful.

Stack trace excerpt

(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962] WorkerProc hit an exception.
(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 750, in sample_tokens
(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962]     return self.model_runner.sample_tokens(grammar_output)
(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4245, in sample_tokens
...
(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962] RuntimeError: Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()

The same assertion repeated across TP workers:

(Worker_TP7 pid=797) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP4 pid=764) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP5 pid=775) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP3 pid=753) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP0 pid=731) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP6 pid=786) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP1 pid=736) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()

EngineCore then died:

(EngineCore pid=532) ERROR 04-28 04:40:10 [core.py:1138] EngineCore encountered a fatal error.
(EngineCore pid=532) ERROR 04-28 04:40:10 [core.py:1138] RuntimeError: Worker failed with error 'Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()', please check the stack trace above for the root cause
(APIServer pid=1) ERROR 04-28 04:40:10 [async_llm.py:699] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

Pod state after failure

State:          Running
Last State:     Terminated
  Reason:       Completed
  Exit Code:    0
Restart Count:  2

What we tried / mitigation

Disabling MTP by removing:

--speculative-config.method mtp
--speculative-config.num_speculative_tokens 3

appears to avoid this crash path for us so far. That matches the workaround reported in #40926 (speculative_config=None).

Suspected cause

Our guess is that the GLM-5.1 MTP path can pass a non-contiguous context_lens/sequence-length tensor into DeepGEMM paged MQA metadata construction via the MLA indexer. The stack and assertion look aligned with #40987 and the draft fix in #40989, which makes seq_lens / context_lens contiguous before calling DeepGEMM metadata code.

Even if the .contiguous() fix resolves this assertion, #40926 suggests there may be a broader GLM-5.1 + MTP stability issue under sustained traffic.

Happy to test a patch/nightly image on B200/H200 if useful.

extent analysis

TL;DR

Disabling MTP or ensuring context_lens is contiguous may resolve the crash caused by the DeepGEMM assertion error.

Guidance

  1. Disable MTP: Remove --speculative-config.method mtp and --speculative-config.num_speculative_tokens 3 from the serve command to avoid the crash path.
  2. Verify contiguity: Ensure that context_lens is contiguous before passing it to DeepGEMM paged MQA metadata construction, potentially by applying a fix similar to the one mentioned in #40989.
  3. Monitor stability: Even if the assertion is fixed, monitor the system for broader GLM-5.1 + MTP stability issues under sustained traffic, as suggested by #40926.
  4. Test patches or nightly images: Consider testing patches or nightly images on B200/H200 hardware to resolve the issue and ensure stability.

Example

No explicit code example is provided due to the complexity of the issue and the need for a specific fix within the DeepGEMM or vLLM codebase.

Notes

The provided guidance is based on the information given in the issue and may not cover all possible scenarios or edge cases. The root cause seems related to the contiguity of context_lens when using MTP with GLM-5.1, but further investigation or patches may be necessary for a complete resolution.

Recommendation

Apply the workaround by disabling MTP until a more permanent fix is available, as it appears to avoid the crash path. This recommendation is based on the information provided in the issue and the temporary success of disabling MTP in mitigating the crash.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING