ollama - ✅(Solved) Fix qwen3.6:35b-a3b-coding-nvfp4 model file has corrupted K-projection weights (layers 0-1 entirely zero), making linear attention output zero [1 pull requests, 4 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15866Fetched 2026-04-29 06:11:38
View on GitHub
Comments
4
Participants
1
Timeline
5
Reactions
0
Participants
Timeline (top)
commented ×4cross-referenced ×1

Root Cause

The mlx-community model generates coherent text via mlx-lm generate (top next token for "What is 2+2?" → "Here", correctly leading into a thinking trace). The Ollama NVFP4 model cannot generate coherent text under any code fix because layers 0 and 1 of the linear-attention path produce K=0 → linear attention output = 0 → the residual stream loses the entire linear-attention contribution from those layers.

Fix Action

Fixed

PR fix notes

PR #15870: fix(mlxrunner): preserve fp32 precision in gated_delta_step recurrent state

Description (problem / solution / changelog)

Summary

Fixes incoherent generation in Qwen3.5/3.6 GatedDeltaNet (linear-attention) layers by preserving the fp32 recurrent-state accumulator across kernel invocations, matching MLX-LM's reference. Refs #15865, #15866.

The gated_delta_step Metal/CUDA kernel computed state in float (fp32) inside the inner loop but cast it back to InT (the input dtype, typically bf16) when writing to o_state. That truncated 16 mantissa bits to 7 every recurrent step. Across 30 linear-attention layers and N tokens of a real prompt the state degrades enough that generation becomes incoherent (e.g. "Copyright ofusr =" for a "What is 2+2?" prompt against mlx-community/Qwen3.6-35B-A3B-4bit).

Changes

Two surgical changes in two files:

x/mlxrunner/mlx/gated_delta.go

  • Add an StT template arg to both Metal and CUDA kernels (separate from InT)
  • Cast state writes via static_cast<StT>(state[i]) instead of static_cast<InT>(state[i])
  • Loosen the state.DType() == dtype precondition so the kernel accepts fp32 state alongside bf16 inputs
  • Set the state output_arg dtype to state.DType() instead of the input dtype

x/mlxrunner/cache/recurrent.go

  • Hardcode deltaState to fp32 in ensure(). Conv state continues to track the activation dtype (typically bf16); only the recurrent accumulator is widened.
  • Documented why with a comment pointing at the kernel side and the MLX-LM reference.

No API changes, no qwen3_5.go changes -- call sites still pass a single dtype to RecurrentCache.Get(b, dtype), which now applies only to conv state. Full-attention layers and other models that don't use RecurrentCache are unaffected.

Why hardcode fp32 vs. add a parameter

MLX-LM's reference always allocates the recurrent state as mx.float32. There's no current model that wants a different precision for it, and adding a knob would just push the decision to every call site without a use case to justify it. Easier to revisit if a model ever needs bf16 state for memory reasons.

Memory cost

Delta state is [B, num_v_heads, head_v_dim, head_k_dim]. For Qwen3.6-35B-A3B (32 v-heads x 128 dim x 128 dim per head x 30 linear layers x B=1 = ~63 MB at fp32 vs ~16 MB at bf16). Negligible relative to the model weights.

Verification

  • go build ./x/... -- clean
  • go test ./x/mlxrunner/cache/... ./x/mlxrunner/mlx/... -- all pass

Functional verification (M4 Max / 36 GB / macOS 26.4) using mlx-community/Qwen3.6-35B-A3B-4bit imported via ollama create --experimental:

BeforeAfter
"What is 2+2?""\n\n// Copyright ofusr = ...""<think>\n\n</think>\n\nThe result of $2 + 2$ is **4**."
"Reverse a string in Go"gibberishclean idiomatic Go one-liner
Throughput110 tok/s110 tok/s
nomic-embed-text (regression check)worksworks

mlx-lm's reference on the same checkpoint runs ~112 tok/s for comparison.

Test plan

  • Reviewer can reproduce on Apple Silicon by importing any mlx-community/Qwen3.* MoE checkpoint via ollama create --experimental and running a short prompt -- output should be coherent rather than the Copyright/static/-1999... pattern reported in #15866.
  • CI: existing cache + mlx tests should continue to pass (verified locally).

Generated with Claude Code </content> </invoke>

Changed files

  • x/mlxrunner/cache/recurrent.go (modified, +10/-2)
  • x/mlxrunner/mlx/gated_delta.go (modified, +22/-6)

Code Example

layer  0  K-zero-rows = 2048 / 2048   (100% — entire K projection is zero!)
layer  1  K-zero-rows = 1797 / 2048   ( 88%)
layer  2  K-zero-rows =   57 / 2048   (  3%)
layer  3  ... (full attention layer, no in_proj_qkv)
layer  4+ K-zero-rows ≤ 2 / 2048      (~natural sparsity)

---

layer 0  K-zero-rows =  56 / 2048
layer 1  K-zero-rows = 102 / 2048
layer 2  K-zero-rows =  56 / 2048

---

# Reproduce against the cached blob (replace the SHA with whichever blob holds the layer-0 in_proj_qkv tensor):
import struct, json, numpy as np
blob = "/Users/<you>/.ollama/models/blobs/sha256-..."   # blob carrying layers.0.linear_attn.in_proj_qkv
with open(blob, 'rb') as f:
    hsz = struct.unpack('<Q', f.read(8))[0]
    hdr = json.loads(f.read(hsz))
    body = 8 + hsz
    info = hdr['language_model.model.layers.0.linear_attn.in_proj_qkv.weight']
    s, e = info['data_offsets']
    f.seek(body + s)
    arr = np.frombuffer(f.read(e-s), dtype=np.uint32).reshape(info['shape'])
    print('K rows zero:', sum(1 for r in arr[2048:4096] if not r.any()), '/ 2048')
# → 2048 / 2048 on the broken model
RAW_BUFFERClick to expand / collapse

qwen3.6:35b-a3b-coding-nvfp4 model file has corrupted K-projection weights for early linear-attention layers

What is the issue?

The NVFP4 packaging of qwen3.6:35b-a3b-coding-nvfp4 shipped from registry.ollama.ai/library/qwen3.6 has the K-projection portion of linear_attn.in_proj_qkv.weight entirely zeroed out for layer 0 and mostly zeroed for layer 1 — making coherent generation impossible regardless of any runner-side fix.

This is a model-packaging issue, not a runtime bug. The mlx-community/Qwen3.6-35B-A3B-4bit model (same architecture, same source weights, MLX 4-bit affine quant instead of NVFP4) has the expected natural sparsity (3-5%) in the same regions.

Reproducer

The linear_attn.in_proj_qkv.weight tensor is laid out per HF transformers as [Q rows 0..2047 | K rows 2048..4095 | V rows 4096..8191]. A direct read of the safetensors blob shows:

layer  0  K-zero-rows = 2048 / 2048   (100% — entire K projection is zero!)
layer  1  K-zero-rows = 1797 / 2048   ( 88%)
layer  2  K-zero-rows =   57 / 2048   (  3%)
layer  3  ... (full attention layer, no in_proj_qkv)
layer  4+ K-zero-rows ≤ 2 / 2048      (~natural sparsity)

For comparison, on mlx-community/Qwen3.6-35B-A3B-4bit (4-bit affine, same architecture):

layer 0  K-zero-rows =  56 / 2048
layer 1  K-zero-rows = 102 / 2048
layer 2  K-zero-rows =  56 / 2048

The mlx-community model generates coherent text via mlx-lm generate (top next token for "What is 2+2?" → "Here", correctly leading into a thinking trace). The Ollama NVFP4 model cannot generate coherent text under any code fix because layers 0 and 1 of the linear-attention path produce K=0 → linear attention output = 0 → the residual stream loses the entire linear-attention contribution from those layers.

Verification commands

# Reproduce against the cached blob (replace the SHA with whichever blob holds the layer-0 in_proj_qkv tensor):
import struct, json, numpy as np
blob = "/Users/<you>/.ollama/models/blobs/sha256-..."   # blob carrying layers.0.linear_attn.in_proj_qkv
with open(blob, 'rb') as f:
    hsz = struct.unpack('<Q', f.read(8))[0]
    hdr = json.loads(f.read(hsz))
    body = 8 + hsz
    info = hdr['language_model.model.layers.0.linear_attn.in_proj_qkv.weight']
    s, e = info['data_offsets']
    f.seek(body + s)
    arr = np.frombuffer(f.read(e-s), dtype=np.uint32).reshape(info['shape'])
    print('K rows zero:', sum(1 for r in arr[2048:4096] if not r.any()), '/ 2048')
# → 2048 / 2048 on the broken model

Likely cause

The conversion from BF16 (or the source FP8/BF16 RedHatAI NVFP4) to Ollama's NVFP4 packaging picked a global scale per tensor that was too large to represent the small magnitudes in early-layer K projections — they all rounded to the FP4 zero codepoint.

For verification: the scales tensor for in_proj_qkv is non-zero across all rows (max-row-abs ≈ 1144 on K rows for layer 0), but the underlying FP4 nibbles are all zero, so dequantization yields zero regardless of the (non-zero) scale. That fingerprint matches "small-magnitude rows quantized below FP4 representable range" rather than "block of weights deleted in I/O".

Suggested fixes

  1. Re-quantize with per-tensor (or per-row block) scales chosen to preserve the K projection's dynamic range — or fall back to a higher-precision quant for early-layer linear-attention weights.
  2. Check the same artefact for the mxfp8 and other variants in the library (qwen3.6:35b-a3b-coding-mxfp8, qwen3.6:35b-a3b-nvfp4, qwen3.6:27b-coding-nvfp4).
  3. Re-run the conversion using the per-tensor quant overrides path (e.g. matching the pattern in #15760, where mlp.gate and mlp.shared_expert_gate get an 8-bit override) — applying an 8-bit override (or skipping quant entirely) to linear_attn.in_proj_qkv would avoid the zero-collapse.

Related

  • #15865 — independent recurrent-state precision bug in gated_delta_step (the kernel cast state to InT instead of StT). With only that fix and without this corruption, the model reaches the sampler but the very first decode step gets wrong logits because layer-0 linear attention contributes zero. Together, both issues explain the "broken Qwen3.6 NVFP4 generation on Ollama".
  • #15822, #15834, #15700 — user reports of broken generation on this model family on macOS; this issue likely explains the coding-nvfp4 sub-cases.
  • mlx-community / Qwen3.6-35B-A3B-4bit on HF works correctly via mlx-lm, demonstrating the architecture is otherwise sound.

Environment

  • Ollama version: v0.22.0 (as of 2026-04-28)
  • OS: macOS 26.4.1, Apple M4 Max
  • Source model: registry.ollama.ai/library/qwen3.6:35b-a3b-coding-nvfp4 (digest sha256:cd2692a833e6...)

extent analysis

TL;DR

Re-quantize the linear_attn.in_proj_qkv weights with per-tensor or per-row block scales to preserve the K projection's dynamic range and avoid zero-collapse.

Guidance

  • Verify the issue by checking the K-zero-rows count in the linear_attn.in_proj_qkv.weight tensor for layers 0 and 1 using the provided Python script.
  • Check other variants in the library (e.g., qwen3.6:35b-a3b-coding-mxfp8, qwen3.6:35b-a3b-nvfp4, qwen3.6:27b-coding-nvfp4) for similar issues.
  • Consider re-running the conversion using per-tensor quant overrides, such as applying an 8-bit override to linear_attn.in_proj_qkv to avoid the zero-collapse.
  • Review related issues (#15865, #15822, #15834, #15700) to ensure a comprehensive understanding of the problem.

Example

No code snippet is provided as the issue is related to model packaging and quantization, and the solution involves re-quantizing the weights rather than modifying code.

Notes

The issue is specific to the qwen3.6:35b-a3b-coding-nvfp4 model and its packaging, and the solution may not apply to other models or variants. The provided Python script can be used to verify the issue, but the actual fix requires re-quantizing the weights.

Recommendation

Apply a workaround by re-quantizing the linear_attn.in_proj_qkv weights with per-tensor or per-row block scales to preserve the K projection's dynamic range. This approach is recommended because it directly addresses the root cause of the issue, which is the zero-collapse of the K projection weights due to inadequate quantization.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING