vllm - 💡(How to fix) Fix [Bug]: Kimi-K2.6 intermittently outputs only "!!!!!!!!!!" in reasoning field with content null

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

We are seeing intermittent bad generations from both moonshotai/Kimi-K2.6 as well as RedHatAI/Kimi-K2.6-NVFP4 served with vLLM. The response has content: null, while the reasoning field contains only repeated exclamation marks, e.g. "!!!!!!!!!!".

This looks very similar to the Kimi-K2.5 issue discussed in #36763, especially this comment about FA4 NaNs at KV length >= 8192: https://github.com/vllm-project/vllm/issues/36763#issuecomment-4170658649

The issue is not always immediate. In one deployment we saw it after the service had been running for a few days. We are still trying to isolate whether this is caused by FA4 MLA prefill, NVFP4 GEMM/MoE kernels, long-running state, or a specific traffic pattern.

But it seems like this appears when the model has either been running for a while and/or high traffic with large contexts.

When the model enters this bad state, requesting logprobs exposes NaNs in the response path. The request fails with:

400 - {'error': {'message': 'Out of range float values are not JSON compliant: nan', ...}}

When:

curl -X POST https://<HOST>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <TOKEN>" \
  -d '{
    "model": "kimi-k2.6",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello, can you help me?"}
    ],
    "temperature": 0.7,
    "max_tokens": 20,
    "logprobs": true,
    "top_logprobs": 5
  }'

We are still trying to isolate whether this is caused by FA4 MLA prefill, FLASHINFER_MLA decode, NVFP4 GEMM/MoE kernels, long-running state, high-concurrency scheduling, or a specific traffic pattern.

Error Message

400 - {'error': {'message': 'Out of range float values are not JSON compliant: nan', ...}}

Root Cause

The issue is not always immediate. In one deployment we saw it after the service had been running for a few days. We are still trying to isolate whether this is caused by FA4 MLA prefill, NVFP4 GEMM/MoE kernels, long-running state, or a specific traffic pattern.

Code Example

400 - {'error': {'message': 'Out of range float values are not JSON compliant: nan', ...}}

---

curl -X POST https://<HOST>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <TOKEN>" \
  -d '{
    "model": "kimi-k2.6",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello, can you help me?"}
    ],
    "temperature": 0.7,
    "max_tokens": 20,
    "logprobs": true,
    "top_logprobs": 5
  }'
RAW_BUFFERClick to expand / collapse

Your current environment

Your current environment

  • Model: moonshotai/Kimi-K2.6 and redhatai/Kimi-K2.6-NVFP4
  • Parsers: --reasoning-parser kimi_k2 --tool-call-parser kimi_k2
  • vLLM image tested: v0.20.0 and v0.18.1
  • GPUs: 8xB200

🐛 Describe the bug

Description

We are seeing intermittent bad generations from both moonshotai/Kimi-K2.6 as well as RedHatAI/Kimi-K2.6-NVFP4 served with vLLM. The response has content: null, while the reasoning field contains only repeated exclamation marks, e.g. "!!!!!!!!!!".

This looks very similar to the Kimi-K2.5 issue discussed in #36763, especially this comment about FA4 NaNs at KV length >= 8192: https://github.com/vllm-project/vllm/issues/36763#issuecomment-4170658649

The issue is not always immediate. In one deployment we saw it after the service had been running for a few days. We are still trying to isolate whether this is caused by FA4 MLA prefill, NVFP4 GEMM/MoE kernels, long-running state, or a specific traffic pattern.

But it seems like this appears when the model has either been running for a while and/or high traffic with large contexts.

When the model enters this bad state, requesting logprobs exposes NaNs in the response path. The request fails with:

400 - {'error': {'message': 'Out of range float values are not JSON compliant: nan', ...}}

When:

curl -X POST https://<HOST>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <TOKEN>" \
  -d '{
    "model": "kimi-k2.6",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello, can you help me?"}
    ],
    "temperature": 0.7,
    "max_tokens": 20,
    "logprobs": true,
    "top_logprobs": 5
  }'

We are still trying to isolate whether this is caused by FA4 MLA prefill, FLASHINFER_MLA decode, NVFP4 GEMM/MoE kernels, long-running state, high-concurrency scheduling, or a specific traffic pattern.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING