vllm - 💡(How to fix) Fix [Bug]: /wake_up fails with "'list' object has no attribute 'zero_'" on hybrid-SWA / Mamba / DeltaNet models (SM120, NVFP4) — only Gemma-4 interleaved-SWA survives [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41563Fetched 2026-05-04 04:58:49
View on GitHub
Comments
1
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
closed ×1commented ×1cross-referenced ×1

Error Message

AttributeError: 'list' object has no attribute 'zero_'

Root Cause

TL;DR: On RTX PRO 6000 Blackwell (SM120) with vllm/vllm-openai:cu129-nightly, /sleep succeeds for every model we tested, but /wake_up fails with AttributeError: 'list' object has no attribute 'zero_' on every architecture except Gemma-4 (interleaved-SWA). The same bug reproduces on hybrid-SWA (Qwen3.6), hybrid-SWA MoE (Qwen3.6 A3B), DeltaNet+SWA (Qwen3-Coder-Next), and Mamba+attention (Nemotron-Omni). Because Gemma-4 wakes cleanly on the same hardware, image, and quantization scheme, the failure is in the engine pause/resume tensor-restore path — not in any arch-specific kernel. Filing because no upstream issue currently tracks this.

Fix Action

Fix / Workaround

We are filing this before exhausting every mitigation so that maintainers can weigh in early. Specifically:

  • --no-enable-prefix-caching workaround. Issue #16234 (closed 2025-04) reported wake_up corruption resolved by disabling prefix caching. We have not retested any of the four failing hybrid-SWA / Mamba / DeltaNet models with --no-enable-prefix-caching. If that workaround papers over this bug as well, it would suggest the prefix-cache restore path is the specific subsystem mishandling a list-of-tensors structure on these architectures — but until we test, this is conjecture.

  • Sleep level 2. All tests above used level=1. We have not tried level=2 (offload + discard) on the failing models.

  • Non-NVFP4 quantization. All five models in the matrix are NVFP4 / compressed-tensors. We have not tested FP8 or BF16 variants of the same architectures with sleep mode on this hardware.

  • #41519 — Xiaomi MiMo v2.5 broken on SM12x. Different bug (attention-backend-side, not sleep-related), but same hardware class. Mentioned for SM120 context.

  • #16234 — Closed 2025-04 /wake_up corruption; workaround was --no-enable-prefix-caching. Untested here on hybrid-SWA models; flagged as a potentially still-relevant mitigation path.

  • #40897 — Sleep level 3 PR. Tangential — appears to address weight retention, not the /wake_up tensor-restoration path implicated here.

Code Example

--enable-sleep-mode
--quantization compressed-tensors
--tensor-parallel-size 4
--gpu-memory-utilization 0.85

---

AttributeError: 'list' object has no attribute 'zero_'
RAW_BUFFERClick to expand / collapse

TL;DR: On RTX PRO 6000 Blackwell (SM120) with vllm/vllm-openai:cu129-nightly, /sleep succeeds for every model we tested, but /wake_up fails with AttributeError: 'list' object has no attribute 'zero_' on every architecture except Gemma-4 (interleaved-SWA). The same bug reproduces on hybrid-SWA (Qwen3.6), hybrid-SWA MoE (Qwen3.6 A3B), DeltaNet+SWA (Qwen3-Coder-Next), and Mamba+attention (Nemotron-Omni). Because Gemma-4 wakes cleanly on the same hardware, image, and quantization scheme, the failure is in the engine pause/resume tensor-restore path — not in any arch-specific kernel. Filing because no upstream issue currently tracks this.


Your current environment

<details> <summary>Output of <code>python collect_env.py</code></summary>

I will append the full python collect_env.py output as a follow-up comment within 24h. Filing the architectural matrix first because the failure pattern is what most needs maintainer eyes. Hardware/image summary inline below.

</details>

Hardware / image summary:

  • GPUs: 4× NVIDIA RTX PRO 6000 Blackwell Server Edition (SM 12.0)
  • OS: Ubuntu 24.04
  • NVIDIA driver: 580
  • Container images tested (both reproduce):
    • vllm/vllm-openai:cu129-nightly @ digest 8b49cf3a37eb (older)
    • vllm/vllm-openai:cu129-nightly @ digest a749a33d8d05 (newer)
  • Quantization: compressed-tensors (NVFP4) for all five models

🐛 Describe the bug

Summary

/wake_up consistently fails with AttributeError: 'list' object has no attribute 'zero_' after a successful /sleep on four out of five models tested. The one model that succeeds (Gemma-4-26B-A4B-NVFP4, interleaved SWA) shares the same hardware, container image, quantization scheme, and launch-flag subset as the failing models. This rules out arch-specific kernel paths and points at the engine-side sleep/wake tensor restoration logic.

Test matrix (validated 2026-05-02)

All five models successfully entered sleep (HTTP 200, VRAM dropped to ~4–5 GB residual per card). Only Gemma-4 woke back up.

ModelArchitecture/sleep/wake_up
Gemma-4-26B-A4B-NVFP4Interleaved SWA
Qwen3.6-27B-NVFP4Hybrid SWA'list' object has no attribute 'zero_'
Qwen3.6-35B-A3B-NVFP4Hybrid SWA MoE'list' object has no attribute 'zero_'
Qwen3-Coder-Next-80B-A3B-NVFP4DeltaNet + SWA'list' object has no attribute 'zero_'
Nemotron-Omni-30B-A3B-NVFP4Mamba + attention'list' object has no attribute 'zero_'

Why this looks engine-side, not arch-specific

The failing set spans four distinct attention/state-space designs:

  • Hybrid SWA (dense)
  • Hybrid SWA + MoE
  • DeltaNet + SWA
  • Mamba + attention

…and the only architecture that survives is interleaved SWA (Gemma-4). If this were a kernel-level bug in any one of {SWA mask handling, MoE expert weights, DeltaNet recurrent state, Mamba SSM state}, we would expect a different exception per architecture. Instead, all four failures produce the identical AttributeError, which strongly suggests a single common code path in the sleep/wake engine glue is iterating over a structure it expects to be a Tensor but is in fact a list (likely a list of per-layer / per-expert / per-state tensors that the restore path forgot to unwrap or zip-restore).

Launch flags (common subset; per-model values varied)

--enable-sleep-mode
--quantization compressed-tensors
--tensor-parallel-size 4
--gpu-memory-utilization 0.85

Plus VLLM_SERVER_DEV_MODE=1 in the container env to expose /sleep and /wake_up.

Reproduction steps

  1. Start any of the four broken models in the matrix above with --enable-sleep-mode and VLLM_SERVER_DEV_MODE=1.
  2. Wait for the server to be ready (/health returns 200).
  3. curl -X POST http://<host>:8000/sleep?level=1 → HTTP 200, per-card VRAM drops to ~4–5 GB residual (weights paged out to host RAM as expected).
  4. curl -X POST http://<host>:8000/wake_up → HTTP 500.
  5. Engine logs report AttributeError: 'list' object has no attribute 'zero_'.

The same sequence on Gemma-4-26B-A4B-NVFP4 succeeds: /wake_up returns 200 and inference resumes normally.

Stack trace

Will be attached as a follow-up comment within 24h. Captured terminating frame is uniform across all four failing models:

AttributeError: 'list' object has no attribute 'zero_'

The clean repro will be against Qwen3.6-27B-NVFP4 (smallest of the four failing models, fastest cycle). Filing the architectural matrix first so maintainers can weigh in on which subsystem to inspect — the uniform symptom across four otherwise-disjoint architectures is the load-bearing signal.

Operational note for anyone reproducing: cgroup memory ceiling

Sleep-mode containers need a cgroup memory_max of at least model_weights × 2 + working_set. We initially saw a confusing OOM-kill on Qwen3.6-27B (memory_max=20 GB, model ≈ 14 GB) that masked the actual wake-up bug — the container died at sleep time before we could even attempt /wake_up. After raising memory_max so the paged-out weights had headroom on the host side, /sleep succeeded cleanly and the wake-up failure became reproducible. Anyone trying to repro should size the cgroup generously to avoid the same red herring.

What we have not yet tested

We are filing this before exhausting every mitigation so that maintainers can weigh in early. Specifically:

  • --no-enable-prefix-caching workaround. Issue #16234 (closed 2025-04) reported wake_up corruption resolved by disabling prefix caching. We have not retested any of the four failing hybrid-SWA / Mamba / DeltaNet models with --no-enable-prefix-caching. If that workaround papers over this bug as well, it would suggest the prefix-cache restore path is the specific subsystem mishandling a list-of-tensors structure on these architectures — but until we test, this is conjecture.
  • Sleep level 2. All tests above used level=1. We have not tried level=2 (offload + discard) on the failing models.
  • Non-NVFP4 quantization. All five models in the matrix are NVFP4 / compressed-tensors. We have not tested FP8 or BF16 variants of the same architectures with sleep mode on this hardware.

We will follow up with results once we run those, but wanted the engine-side signal in front of maintainers now.

Related issues / PRs

  • #41519 — Xiaomi MiMo v2.5 broken on SM12x. Different bug (attention-backend-side, not sleep-related), but same hardware class. Mentioned for SM120 context.
  • #16234 — Closed 2025-04 /wake_up corruption; workaround was --no-enable-prefix-caching. Untested here on hybrid-SWA models; flagged as a potentially still-relevant mitigation path.
  • #40897 — Sleep level 3 PR. Tangential — appears to address weight retention, not the /wake_up tensor-restoration path implicated here.

Ask

Could a maintainer point at which subsystem currently owns /wake_up's tensor restoration — specifically the path that reconstructs per-layer / per-expert / per-state tensors after a level=1 offload? The uniform 'list' object has no attribute 'zero_' across four architecturally distinct models suggests a single restore loop is calling .zero_() on a container instead of on each contained tensor, but we'd rather hear from someone who knows the code than guess at a fix. Happy to test patches, gather additional logs, or run the --no-enable-prefix-caching and level=2 variants against any of the four failing models on request.

Thanks!

extent analysis

TL;DR

The most likely fix for the /wake_up failure is to modify the tensor restoration path in the sleep/wake engine glue to correctly handle lists of tensors.

Guidance

  • Investigate the tensor restoration path in the sleep/wake engine glue to identify where it is attempting to call .zero_() on a list object instead of individual tensors.
  • Consider testing the --no-enable-prefix-caching workaround to see if it resolves the issue, as it may be related to the prefix-cache restore path.
  • Review the code changes made in PR #16234, which addressed a similar /wake_up corruption issue, to see if they may be relevant to this problem.
  • Test the level=2 sleep mode to see if the issue persists, as it may help isolate the problem to a specific part of the tensor restoration path.

Example

No code snippet is provided as the issue does not include specific code references.

Notes

The issue appears to be related to the tensor restoration path in the sleep/wake engine glue, but the exact cause and solution are unclear without further investigation. The uniform error message across four distinct architectures suggests a single common code path is at fault.

Recommendation

Apply the --no-enable-prefix-caching workaround and test the level=2 sleep mode to see if they resolve the issue, as they may be related to the prefix-cache restore path and can help isolate the problem.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING