ollama - ✅(Solved) Fix qwen3.6:35b-a3b-coding-nvfp4 model file has corrupted K-projection weights (layers 0-1 entirely zero), making linear attention output zero [1 pull requests, 4 comments, 1 participants]

ollama2026-04-28 22:42:06

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#15866•Fetched 2026-04-29 06:11:38

View on GitHub

Comments

Participants

Timeline

Reactions

Author

andreinknv

Participants

andreinknv

Timeline (top)

commented ×4cross-referenced ×1

Root Cause

The mlx-community model generates coherent text via mlx-lm generate (top next token for "What is 2+2?" → "Here", correctly leading into a thinking trace). The Ollama NVFP4 model cannot generate coherent text under any code fix because layers 0 and 1 of the linear-attention path produce K=0 → linear attention output = 0 → the residual stream loses the entire linear-attention contribution from those layers.

Fix Action

Fixed

Fixed by PR: fix(mlxrunner): preserve fp32 precision in gated_delta_step recurrent state (https://github.com/ollama/ollama/pull/15870)

PR fix notes

PR #15870: fix(mlxrunner): preserve fp32 precision in gated_delta_step recurrent state

Repository: ollama/ollama
Author: andreinknv
State: open | merged: False
Link: https://github.com/ollama/ollama/pull/15870

Description (problem / solution / changelog)

Summary

Fixes incoherent generation in Qwen3.5/3.6 GatedDeltaNet (linear-attention) layers by preserving the fp32 recurrent-state accumulator across kernel invocations, matching MLX-LM's reference. Refs #15865, #15866.

The gated_delta_step Metal/CUDA kernel computed state in float (fp32) inside the inner loop but cast it back to InT (the input dtype, typically bf16) when writing to o_state. That truncated 16 mantissa bits to 7 every recurrent step. Across 30 linear-attention layers and N tokens of a real prompt the state degrades enough that generation becomes incoherent (e.g. "Copyright ofusr =" for a "What is 2+2?" prompt against mlx-community/Qwen3.6-35B-A3B-4bit).

Changes

Two surgical changes in two files:

`x/mlxrunner/mlx/gated_delta.go`

Add an StT template arg to both Metal and CUDA kernels (separate from InT)
Cast state writes via static_cast<StT>(state[i]) instead of static_cast<InT>(state[i])
Loosen the state.DType() == dtype precondition so the kernel accepts fp32 state alongside bf16 inputs
Set the state output_arg dtype to state.DType() instead of the input dtype

`x/mlxrunner/cache/recurrent.go`

Hardcode deltaState to fp32 in ensure(). Conv state continues to track the activation dtype (typically bf16); only the recurrent accumulator is widened.
Documented why with a comment pointing at the kernel side and the MLX-LM reference.

No API changes, no qwen3_5.go changes -- call sites still pass a single dtype to RecurrentCache.Get(b, dtype), which now applies only to conv state. Full-attention layers and other models that don't use RecurrentCache are unaffected.

Why hardcode fp32 vs. add a parameter

MLX-LM's reference always allocates the recurrent state as mx.float32. There's no current model that wants a different precision for it, and adding a knob would just push the decision to every call site without a use case to justify it. Easier to revisit if a model ever needs bf16 state for memory reasons.

Memory cost

Delta state is [B, num_v_heads, head_v_dim, head_k_dim]. For Qwen3.6-35B-A3B (32 v-heads x 128 dim x 128 dim per head x 30 linear layers x B=1 = ~63 MB at fp32 vs ~16 MB at bf16). Negligible relative to the model weights.

Verification

go build ./x/... -- clean
go test ./x/mlxrunner/cache/... ./x/mlxrunner/mlx/... -- all pass

Functional verification (M4 Max / 36 GB / macOS 26.4) using mlx-community/Qwen3.6-35B-A3B-4bit imported via ollama create --experimental:

	Before	After
`"What is 2+2?"`	`"\n\n// Copyright ofusr = ..."`	`"<think>\n\n</think>\n\nThe result of $2 + 2$ is 4."`
`"Reverse a string in Go"`	gibberish	clean idiomatic Go one-liner
Throughput	110 tok/s	110 tok/s
`nomic-embed-text` (regression check)	works	works

mlx-lm's reference on the same checkpoint runs ~112 tok/s for comparison.

Test plan

Reviewer can reproduce on Apple Silicon by importing any mlx-community/Qwen3.* MoE checkpoint via ollama create --experimental and running a short prompt -- output should be coherent rather than the Copyright/static/-1999... pattern reported in #15866.
CI: existing cache + mlx tests should continue to pass (verified locally).

Generated with Claude Code </content> </invoke>

Changed files

x/mlxrunner/cache/recurrent.go (modified, +10/-2)
x/mlxrunner/mlx/gated_delta.go (modified, +22/-6)

Code Example

layer  0  K-zero-rows = 2048 / 2048   (100% — entire K projection is zero!)
layer  1  K-zero-rows = 1797 / 2048   ( 88%)
layer  2  K-zero-rows =   57 / 2048   (  3%)
layer  3  ... (full attention layer, no in_proj_qkv)
layer  4+ K-zero-rows ≤ 2 / 2048      (~natural sparsity)

---

layer 0  K-zero-rows =  56 / 2048
layer 1  K-zero-rows = 102 / 2048
layer 2  K-zero-rows =  56 / 2048

---

# Reproduce against the cached blob (replace the SHA with whichever blob holds the layer-0 in_proj_qkv tensor):
import struct, json, numpy as np
blob = "/Users/<you>/.ollama/models/blobs/sha256-..."   # blob carrying layers.0.linear_attn.in_proj_qkv
with open(blob, 'rb') as f:
    hsz = struct.unpack('<Q', f.read(8))[0]
    hdr = json.loads(f.read(hsz))
    body = 8 + hsz
    info = hdr['language_model.model.layers.0.linear_attn.in_proj_qkv.weight']
    s, e = info['data_offsets']
    f.seek(body + s)
    arr = np.frombuffer(f.read(e-s), dtype=np.uint32).reshape(info['shape'])
    print('K rows zero:', sum(1 for r in arr[2048:4096] if not r.any()), '/ 2048')
# → 2048 / 2048 on the broken model

RAW_BUFFERClick to expand / collapse

`qwen3.6:35b-a3b-coding-nvfp4` model file has corrupted K-projection weights for early linear-attention layers

What is the issue?

The NVFP4 packaging of qwen3.6:35b-a3b-coding-nvfp4 shipped from registry.ollama.ai/library/qwen3.6 has the K-projection portion of linear_attn.in_proj_qkv.weight entirely zeroed out for layer 0 and mostly zeroed for layer 1 — making coherent generation impossible regardless of any runner-side fix.

This is a model-packaging issue, not a runtime bug. The mlx-community/Qwen3.6-35B-A3B-4bit model (same architecture, same source weights, MLX 4-bit affine quant instead of NVFP4) has the expected natural sparsity (3-5%) in the same regions.

Reproducer

The linear_attn.in_proj_qkv.weight tensor is laid out per HF transformers as [Q rows 0..2047 | K rows 2048..4095 | V rows 4096..8191]. A direct read of the safetensors blob shows:

layer  0  K-zero-rows = 2048 / 2048   (100% — entire K projection is zero!)
layer  1  K-zero-rows = 1797 / 2048   ( 88%)
layer  2  K-zero-rows =   57 / 2048   (  3%)
layer  3  ... (full attention layer, no in_proj_qkv)
layer  4+ K-zero-rows ≤ 2 / 2048      (~natural sparsity)

For comparison, on mlx-community/Qwen3.6-35B-A3B-4bit (4-bit affine, same architecture):

layer 0  K-zero-rows =  56 / 2048
layer 1  K-zero-rows = 102 / 2048
layer 2  K-zero-rows =  56 / 2048

Verification commands

# Reproduce against the cached blob (replace the SHA with whichever blob holds the layer-0 in_proj_qkv tensor):
import struct, json, numpy as np
blob = "/Users/<you>/.ollama/models/blobs/sha256-..."   # blob carrying layers.0.linear_attn.in_proj_qkv
with open(blob, 'rb') as f:
    hsz = struct.unpack('<Q', f.read(8))[0]
    hdr = json.loads(f.read(hsz))
    body = 8 + hsz
    info = hdr['language_model.model.layers.0.linear_attn.in_proj_qkv.weight']
    s, e = info['data_offsets']
    f.seek(body + s)
    arr = np.frombuffer(f.read(e-s), dtype=np.uint32).reshape(info['shape'])
    print('K rows zero:', sum(1 for r in arr[2048:4096] if not r.any()), '/ 2048')
# → 2048 / 2048 on the broken model

Likely cause

The conversion from BF16 (or the source FP8/BF16 RedHatAI NVFP4) to Ollama's NVFP4 packaging picked a global scale per tensor that was too large to represent the small magnitudes in early-layer K projections — they all rounded to the FP4 zero codepoint.

For verification: the scales tensor for in_proj_qkv is non-zero across all rows (max-row-abs ≈ 1144 on K rows for layer 0), but the underlying FP4 nibbles are all zero, so dequantization yields zero regardless of the (non-zero) scale. That fingerprint matches "small-magnitude rows quantized below FP4 representable range" rather than "block of weights deleted in I/O".

Suggested fixes

Re-quantize with per-tensor (or per-row block) scales chosen to preserve the K projection's dynamic range — or fall back to a higher-precision quant for early-layer linear-attention weights.
Check the same artefact for the mxfp8 and other variants in the library (qwen3.6:35b-a3b-coding-mxfp8, qwen3.6:35b-a3b-nvfp4, qwen3.6:27b-coding-nvfp4).
Re-run the conversion using the per-tensor quant overrides path (e.g. matching the pattern in #15760, where mlp.gate and mlp.shared_expert_gate get an 8-bit override) — applying an 8-bit override (or skipping quant entirely) to linear_attn.in_proj_qkv would avoid the zero-collapse.

#15865 — independent recurrent-state precision bug in gated_delta_step (the kernel cast state to InT instead of StT). With only that fix and without this corruption, the model reaches the sampler but the very first decode step gets wrong logits because layer-0 linear attention contributes zero. Together, both issues explain the "broken Qwen3.6 NVFP4 generation on Ollama".
#15822, #15834, #15700 — user reports of broken generation on this model family on macOS; this issue likely explains the coding-nvfp4 sub-cases.
mlx-community / Qwen3.6-35B-A3B-4bit on HF works correctly via mlx-lm, demonstrating the architecture is otherwise sound.

Environment

Ollama version: v0.22.0 (as of 2026-04-28)
OS: macOS 26.4.1, Apple M4 Max
Source model: registry.ollama.ai/library/qwen3.6:35b-a3b-coding-nvfp4 (digest sha256:cd2692a833e6...)

extent analysis

TL;DR

Re-quantize the linear_attn.in_proj_qkv weights with per-tensor or per-row block scales to preserve the K projection's dynamic range and avoid zero-collapse.

Guidance

Verify the issue by checking the K-zero-rows count in the linear_attn.in_proj_qkv.weight tensor for layers 0 and 1 using the provided Python script.
Check other variants in the library (e.g., qwen3.6:35b-a3b-coding-mxfp8, qwen3.6:35b-a3b-nvfp4, qwen3.6:27b-coding-nvfp4) for similar issues.
Consider re-running the conversion using per-tensor quant overrides, such as applying an 8-bit override to linear_attn.in_proj_qkv to avoid the zero-collapse.
Review related issues (#15865, #15822, #15834, #15700) to ensure a comprehensive understanding of the problem.

Example

No code snippet is provided as the issue is related to model packaging and quantization, and the solution involves re-quantizing the weights rather than modifying code.

Notes

The issue is specific to the qwen3.6:35b-a3b-coding-nvfp4 model and its packaging, and the solution may not apply to other models or variants. The provided Python script can be used to verify the issue, but the actual fix requires re-quantizing the weights.

Recommendation

Apply a workaround by re-quantizing the linear_attn.in_proj_qkv weights with per-tensor or per-row block scales to preserve the K projection's dynamic range. This approach is recommended because it directly addresses the root cause of the issue, which is the zero-collapse of the K projection weights due to inadequate quantization.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#conversation history #tool integration #LLM response #prompt template #agent execution

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.