vllm - 💡(How to fix) Fix [Bug]: [GLM-5.1] [MTP] DeepGEMM context_lens.is_contiguous assertion in paged MQA metadata [1 participants]

vllm2026-04-28 04:57:18

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41094•Fetched 2026-04-29 06:12:26

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Luew2

Participants

Luew2

Timeline (top)

renamed ×2subscribed ×2unsubscribed ×2

Error Message

RuntimeError: Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()

Root Cause

(EngineCore pid=532) ERROR 04-28 04:40:10 [core.py:1138] EngineCore encountered a fatal error.
(EngineCore pid=532) ERROR 04-28 04:40:10 [core.py:1138] RuntimeError: Worker failed with error 'Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()', please check the stack trace above for the root cause
(APIServer pid=1) ERROR 04-28 04:40:10 [async_llm.py:699] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

Fix Action

Fix / Workaround

What we tried / mitigation

appears to avoid this crash path for us so far. That matches the workaround reported in #40926 (speculative_config=None).

Happy to test a patch/nightly image on B200/H200 if useful.

Code Example

RuntimeError: Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()

---

vllm serve \
  --model zai-org/GLM-5.1-FP8 \
  --served-model-name zai-org/glm-5.1 \
  --trust-remote-code \
  --enable-prompt-tokens-details \
  --enable-auto-tool-choice \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 3 \
  --max-model-len 202752 \
  --gpu-memory-utilization 0.88 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --max-num-seqs 256 \
  --max-num-batched-tokens 8192 \
  --chat-template-content-format string \
  --tensor-parallel-size 8

---

mkdir -p /tmp/flashinfer-cubin /tmp/flashinfer-cubin-shadow/flashinfer_cubin
printf '%b' '__version__ = "0.6.8.post1"\nCUBIN_DIR = "/tmp/flashinfer-cubin"\ndef get_cubin_dir():\n    return CUBIN_DIR\n' > /tmp/flashinfer-cubin-shadow/flashinfer_cubin/__init__.py
export PYTHONPATH=/tmp/flashinfer-cubin-shadow:${PYTHONPATH:-}
pip install --upgrade 'transformers>=5.4.0'

---

(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962] WorkerProc hit an exception.
(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 750, in sample_tokens
(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962]     return self.model_runner.sample_tokens(grammar_output)
(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4245, in sample_tokens
...
(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962] RuntimeError: Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()

---

(Worker_TP7 pid=797) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP4 pid=764) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP5 pid=775) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP3 pid=753) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP0 pid=731) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP6 pid=786) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP1 pid=736) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()

---

(EngineCore pid=532) ERROR 04-28 04:40:10 [core.py:1138] EngineCore encountered a fatal error.
(EngineCore pid=532) ERROR 04-28 04:40:10 [core.py:1138] RuntimeError: Worker failed with error 'Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()', please check the stack trace above for the root cause
(APIServer pid=1) ERROR 04-28 04:40:10 [async_llm.py:699] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

---

State:          Running
Last State:     Terminated
  Reason:       Completed
  Exit Code:    0
Restart Count:  2

---

--speculative-config.method mtp
--speculative-config.num_speculative_tokens 3

RAW_BUFFERClick to expand / collapse

Describe the bug

zai-org/GLM-5.1-FP8 crashes under vLLM V1 + MTP speculative decoding with a DeepGEMM paged MQA metadata assertion:

RuntimeError: Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()

This looks very similar to #40987, but on GLM-5.1-FP8 instead of DeepSeek-V4. It may also be related to #40926, since this is the same GLM-5.1 / DSA / MoE / MLA / MTP serving class.

In our deployment, short requests usually work, but a long-context chat request caused the engine to die. The pod restarted cleanly after the fatal EngineCore error. This does not look like OOM: Kubernetes reported the previous container state as Completed with exit code 0, and the first root-cause error was the DeepGEMM assertion above.

Environment

vLLM image: vllm/vllm-openai@sha256:46da022ce07aae43e4ffae844aeab467a223437e071abadf566555699fbf16c3 (v0.20.0 image digest)
Model: zai-org/GLM-5.1-FP8
Served model name: zai-org/glm-5.1
Hardware: 8x NVIDIA B200
Runtime: Kubernetes/k3s pod, TP=8
KV cache dtype: fp8
Prefix caching: enabled
MTP: enabled, num_speculative_tokens=3

Serve command

vllm serve \
  --model zai-org/GLM-5.1-FP8 \
  --served-model-name zai-org/glm-5.1 \
  --trust-remote-code \
  --enable-prompt-tokens-details \
  --enable-auto-tool-choice \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 3 \
  --max-model-len 202752 \
  --gpu-memory-utilization 0.88 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --max-num-seqs 256 \
  --max-num-batched-tokens 8192 \
  --chat-template-content-format string \
  --tensor-parallel-size 8

The container also runs this before vllm serve:

mkdir -p /tmp/flashinfer-cubin /tmp/flashinfer-cubin-shadow/flashinfer_cubin
printf '%b' '__version__ = "0.6.8.post1"\nCUBIN_DIR = "/tmp/flashinfer-cubin"\ndef get_cubin_dir():\n    return CUBIN_DIR\n' > /tmp/flashinfer-cubin-shadow/flashinfer_cubin/__init__.py
export PYTHONPATH=/tmp/flashinfer-cubin-shadow:${PYTHONPATH:-}
pip install --upgrade 'transformers>=5.4.0'

Trigger

Observed under normal OpenAI-compatible chat-completions traffic. One user-facing trigger was a long-context chat request around 130k tokens; short requests usually worked. We can try to build a smaller synthetic repro if helpful.

Stack trace excerpt

(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962] WorkerProc hit an exception.
(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 750, in sample_tokens
(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962]     return self.model_runner.sample_tokens(grammar_output)
(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4245, in sample_tokens
...
(Worker_TP2 pid=742) ERROR 04-28 04:40:10 [multiproc_executor.py:962] RuntimeError: Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()

The same assertion repeated across TP workers:

(Worker_TP7 pid=797) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP4 pid=764) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP5 pid=775) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP3 pid=753) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP0 pid=731) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP6 pid=786) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()
(Worker_TP1 pid=736) RuntimeError: Assertion error (.../deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()

EngineCore then died:

(EngineCore pid=532) ERROR 04-28 04:40:10 [core.py:1138] EngineCore encountered a fatal error.
(EngineCore pid=532) ERROR 04-28 04:40:10 [core.py:1138] RuntimeError: Worker failed with error 'Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:201): context_lens.is_contiguous()', please check the stack trace above for the root cause
(APIServer pid=1) ERROR 04-28 04:40:10 [async_llm.py:699] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

Pod state after failure

State:          Running
Last State:     Terminated
  Reason:       Completed
  Exit Code:    0
Restart Count:  2

What we tried / mitigation

Disabling MTP by removing:

--speculative-config.method mtp
--speculative-config.num_speculative_tokens 3

appears to avoid this crash path for us so far. That matches the workaround reported in #40926 (speculative_config=None).

Suspected cause

Our guess is that the GLM-5.1 MTP path can pass a non-contiguous context_lens/sequence-length tensor into DeepGEMM paged MQA metadata construction via the MLA indexer. The stack and assertion look aligned with #40987 and the draft fix in #40989, which makes seq_lens / context_lens contiguous before calling DeepGEMM metadata code.

Even if the .contiguous() fix resolves this assertion, #40926 suggests there may be a broader GLM-5.1 + MTP stability issue under sustained traffic.

Happy to test a patch/nightly image on B200/H200 if useful.

extent analysis

TL;DR

Disabling MTP or ensuring context_lens is contiguous may resolve the crash caused by the DeepGEMM assertion error.

Guidance

Disable MTP: Remove --speculative-config.method mtp and --speculative-config.num_speculative_tokens 3 from the serve command to avoid the crash path.
Verify contiguity: Ensure that context_lens is contiguous before passing it to DeepGEMM paged MQA metadata construction, potentially by applying a fix similar to the one mentioned in #40989.
Monitor stability: Even if the assertion is fixed, monitor the system for broader GLM-5.1 + MTP stability issues under sustained traffic, as suggested by #40926.
Test patches or nightly images: Consider testing patches or nightly images on B200/H200 hardware to resolve the issue and ensure stability.

Example

No explicit code example is provided due to the complexity of the issue and the need for a specific fix within the DeepGEMM or vLLM codebase.

Notes

The provided guidance is based on the information given in the issue and may not cover all possible scenarios or edge cases. The root cause seems related to the contiguity of context_lens when using MTP with GLM-5.1, but further investigation or patches may be necessary for a complete resolution.

Recommendation

Apply the workaround by disabling MTP until a more permanent fix is available, as it appears to avoid the crash path. This recommendation is based on the information provided in the issue and the temporary success of disabling MTP in mitigating the crash.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #network issue #logging issue #authentication issue #prompt issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: [GLM-5.1] [MTP] DeepGEMM context_lens.is_contiguous assertion in paged MQA metadata [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

What we tried / mitigation

Code Example

Describe the bug

Environment

Serve command

Trigger

Stack trace excerpt

Pod state after failure

What we tried / mitigation

Suspected cause

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: [GLM-5.1] [MTP] DeepGEMM context_lens.is_contiguous assertion in paged MQA metadata [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

What we tried / mitigation

Code Example

Describe the bug

Environment

Serve command

Trigger

Stack trace excerpt

Pod state after failure

What we tried / mitigation

Suspected cause

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING