vllm - 💡(How to fix) Fix [Bug]: Gemma 4 31B FP8_BLOCK checkpoint produces garbage repetitive output — logit saturation at softcap wall due to absorbed activation scales being double-applied [9 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39407Fetched 2026-04-10 03:40:49
View on GitHub
Comments
9
Participants
3
Timeline
30
Reactions
0
Author
Timeline (top)
subscribed ×11mentioned ×10commented ×9

Root Cause

Root cause identified via instrumented diagnostics: The FP8_BLOCK checkpoint has activation scales already absorbed into the weights at quantization time (llm-compressor's default behavior for FP8_BLOCK). However, compressed_tensors_w8a8_fp8.pyprocess_weights_after_loading() still applies dynamic per-token activation quantization at inference time. This double-scales activations, causing hidden state norms to blow up across layers until all logits saturate at the softcap ceiling: 30 * tanh(x/30) ≈ 23.625 (BF16 representation of the softcap wall). With every token's logit at ~23.625, the distribution collapses and the model greedily repeats a single token.

Code Example

vLLM version: 0.1.dev13+g6155bbd1d (main, ~April 9 2026)
Python: 3.13
CUDA: 12.8
GPU: 4× NVIDIA GeForce RTX 5060 Ti 16GB (SM 12.0 / Blackwell)
tensor_parallel_size: 4
Model: Gemma 4 31B (google/gemma-4-31b-it)
Checkpoint type: FP8_BLOCK (llm-compressor, static weight scales, absorbed activation scales)
attention_backend: TRITON_ATTN (enforce_eager=True)
dtype: bfloat16

---

vllm serve /data/models/gemma4-31b-fp8 \
  --dtype bfloat16 \
  --max-model-len 16384 \
  --enforce-eager \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.94 \
  --attention-backend TRITON_ATTN \
  --served-model-name assistant gemma4-31b

# Any prompt produces garbage output:
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"assistant","messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":50}'
# Returns: " a a a a a a a a a a a a a a a a a a a a..."

---

[EMBED] ids=[3689, 563, ...] embed_norm=3.286 result_norm=241.504 normalizer=73.500

---

[TRITON_REAL] w=(4096, 5376) ws=(32, 42) ws_min=1.259e-04 ws_max=5.150e-04 x_norm=1505.39

---

[H] layer=0  norm=185.27  scalar=0.08936  residual_norm=1963.36
[H] layer=59 norm=17.67   scalar=0.03638  (last layer — norms collapse after blowup)

---

# Early tokens (before saturation):
[LOGIT] top5_vals=[5.906, 3.265, 3.25, 3.031, 2.937]  ← healthy distribution

# After a few tokens, full saturation:
[LOGIT] top5_vals=[23.625, 21.625, 21.25, 19.375, 18.875] top5_ids=[496, 236780, ...]
[LOGIT] top5_vals=[23.625, 21.5,   21.0,  19.125, 19.0  ] top5_ids=[496, 236780, ...]

---

WARNING: SM 12.x requires CUDA >= 12.9  (×8, one per worker)
WARNING: Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal!
         Config file not found at .../NVIDIA_GeForce_RTX_5060_Ti,dtype=fp8_w8a8,block_shape=[128,128].json
RAW_BUFFERClick to expand / collapse

[Bug]: Gemma 4 31B FP8_BLOCK checkpoint produces garbage repetitive output (" a a a a") — logit saturation at softcap wall due to absorbed activation scales being double-applied

Describe the bug

When running Gemma 4 31B (dense, not MoE) with an FP8_BLOCK checkpoint produced by llm-compressor, vLLM produces garbage repetitive output consisting of a single token repeated indefinitely (e.g., " a a a a a a a a"). The model loads without errors and tokenizes correctly, but generation is completely broken.

Root cause identified via instrumented diagnostics: The FP8_BLOCK checkpoint has activation scales already absorbed into the weights at quantization time (llm-compressor's default behavior for FP8_BLOCK). However, compressed_tensors_w8a8_fp8.pyprocess_weights_after_loading() still applies dynamic per-token activation quantization at inference time. This double-scales activations, causing hidden state norms to blow up across layers until all logits saturate at the softcap ceiling: 30 * tanh(x/30) ≈ 23.625 (BF16 representation of the softcap wall). With every token's logit at ~23.625, the distribution collapses and the model greedily repeats a single token.

A secondary contributing factor: CUTLASS block FP8 GEMM is not supported on SM 12.0 (RTX 5060 Ti / Blackwell) with CUDA < 12.9, so vLLM falls back to a Triton kernel that does not have a tuned config for RTX 5060 Ti and warns about using defaults. This fallback itself produced correct numerical output in our testing, the saturation issue is entirely in the scale application path, not the GEMM kernel.

Environment

vLLM version: 0.1.dev13+g6155bbd1d (main, ~April 9 2026)
Python: 3.13
CUDA: 12.8
GPU: 4× NVIDIA GeForce RTX 5060 Ti 16GB (SM 12.0 / Blackwell)
tensor_parallel_size: 4
Model: Gemma 4 31B (google/gemma-4-31b-it)
Checkpoint type: FP8_BLOCK (llm-compressor, static weight scales, absorbed activation scales)
attention_backend: TRITON_ATTN (enforce_eager=True)
dtype: bfloat16

Reproduction

vllm serve /data/models/gemma4-31b-fp8 \
  --dtype bfloat16 \
  --max-model-len 16384 \
  --enforce-eager \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.94 \
  --attention-backend TRITON_ATTN \
  --served-model-name assistant gemma4-31b

# Any prompt produces garbage output:
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"assistant","messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":50}'
# Returns: " a a a a a a a a a a a a a a a a a a a a..."

Checkpoint was created with llm-compressor using QuantizationScheme(targets="Linear", weights=FP8_BLOCK, input_activations=FP8_BLOCK). The resulting checkpoint has per-block static weight scales in weight_scale tensors and activation scales absorbed into the weights (no separate input_scale tensors at runtime).

Diagnostic evidence

We added instrumentation to gemma4.py, compressed_tensors_w8a8_fp8.py, and fp8_utils.py to trace norms through the forward pass. Key findings:

1. Embedding output is correct

[EMBED] ids=[3689, 563, ...] embed_norm=3.286 result_norm=241.504 normalizer=73.500

Token embeddings look healthy — the issue is not in the embedding layer.

2. FP8 GEMM output is correct

[TRITON_REAL] w=(4096, 5376) ws=(32, 42) ws_min=1.259e-04 ws_max=5.150e-04 x_norm=1505.39

The Triton block FP8 kernel itself produces numerically correct outputs. The activation norm of 1505 entering layer 0's attention projection is already inflated (should be ~O(10) for a healthy model), indicating scales have been applied upstream before the GEMM, and the GEMM then applies them again.

3. Hidden state norms blow up across layers

[H] layer=0  norm=185.27  scalar=0.08936  residual_norm=1963.36
[H] layer=59 norm=17.67   scalar=0.03638  (last layer — norms collapse after blowup)

Norms explode in early layers and collapse by layer 59, consistent with accumulated numerical overflow from double-scaling.

4. Logits saturate at the softcap ceiling

# Early tokens (before saturation):
[LOGIT] top5_vals=[5.906, 3.265, 3.25, 3.031, 2.937]  ← healthy distribution

# After a few tokens, full saturation:
[LOGIT] top5_vals=[23.625, 21.625, 21.25, 19.375, 18.875] top5_ids=[496, 236780, ...]
[LOGIT] top5_vals=[23.625, 21.5,   21.0,  19.125, 19.0  ] top5_ids=[496, 236780, ...]

The top logit is pinned at 23.625, which is bfloat16(30 * tanh(23.625/30)) = bfloat16(23.625) - i.e., the pre-softcap value is so large it has hit the floating-point ceiling of the tanh function. The same token (id=496, a) wins every step.

5. SM 12.0 CUTLASS fallback (separate issue, does not cause the saturation)

WARNING: SM 12.x requires CUDA >= 12.9  (×8, one per worker)
WARNING: Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal!
         Config file not found at .../NVIDIA_GeForce_RTX_5060_Ti,dtype=fp8_w8a8,block_shape=[128,128].json

vLLM falls back to Triton for block FP8 GEMMs on SM 12.0 with CUDA 12.8. The Triton fallback is numerically correct but untuned. This is a separate issue from the saturation bug.

Suspected fix location

vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py

Specifically in process_weights_after_loading(): when the checkpoint is FP8_BLOCK with absorbed activation scales (no input_scale tensors), the code should detect that activations are pre-scaled and skip dynamic per-token activation quantization at inference time. Currently it applies dynamic quantization regardless, treating the weights as if they still expect unscaled activations.

A potential detection heuristic: if the checkpoint has weight_scale tensors but no input_scale tensors and the quantization config indicates FP8_BLOCK with static=True for activations, assume scales are absorbed and disable dynamic activation quantization for those layers.

Additional context

  • Running with bfloat16 dtype and enforce_eager=True (no CUDA graphs) to isolate the issue.
  • Switching to llama.cpp (llama-turboquant) with the GGUF Q5_K_M checkpoint works correctly and produces coherent output at 20.6 tok/s — confirming the model weights themselves are not corrupt.
  • The issue is reproducible across multiple prompt types and temperatures; it is not a sampling artifact.
  • vLLM version tested: main branch ~April 9 2026 (post Gemma 4 initial support merge).

extent analysis

TL;DR

The most likely fix is to modify the process_weights_after_loading() function in compressed_tensors_w8a8_fp8.py to skip dynamic per-token activation quantization when the checkpoint has absorbed activation scales.

Guidance

  • Identify if the checkpoint has weight_scale tensors but no input_scale tensors and the quantization config indicates FP8_BLOCK with static=True for activations.
  • Modify the process_weights_after_loading() function to detect this condition and disable dynamic activation quantization for those layers.
  • Verify that the model produces correct output after applying the fix by running the reproduction script and checking the response.
  • Test the model with different prompts and temperatures to ensure the issue is fully resolved.

Example

def process_weights_after_loading(self, weights, config):
    # ... existing code ...
    if config.quantization_scheme == 'FP8_BLOCK' and config.static_activations and 'weight_scale' in weights and 'input_scale' not in weights:
        # Skip dynamic per-token activation quantization
        self.apply_dynamic_quantization = False
    # ... existing code ...

Notes

The fix assumes that the issue is caused by the double application of activation scales. If the problem persists after applying the fix, further investigation may be needed to identify other contributing factors.

Recommendation

Apply the workaround by modifying the process_weights_after_loading() function to skip dynamic per-token activation quantization when the checkpoint has absorbed activation scales. This should resolve the issue and produce correct output.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING