openclaw - 💡(How to fix) Fix DeepSeek V4-Pro reasoning_effort benchmark: xhigh vs medium/high token consumption [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#75706Fetched 2026-05-02 05:31:21
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
2
Author
Timeline (top)
closed ×1commented ×1
RAW_BUFFERClick to expand / collapse

Experiment

Testing reasoning_effort levels on DeepSeek V4-Pro via OpenClaw 4.29. Two rounds of controlled experiments.

Phase 1 · Capped (max_tokens=500) — Raft consensus question

efforttotal tokensreasoning tokenstime
low33019310.8s
medium50433313.9s
high39723012.6s
xhigh43221110.6s

medium consumed 17% more total tokens and 58% more reasoning tokens than xhigh

Phase 2 · Uncapped (max_tokens=4000) — 3-tier cache architecture design

| effort | total tokens | reasoning tokens | content chars | time | |---|---|---|---| | high (avg) | 2,084 | 434 | 3,886 | 64s | | xhigh | 4,126 | 2,496 | 4,107 | 126s |

xhigh reasoning = 5.7x high, but final answer only 5% longer

Observations

  1. medium anomaly: With output cap, medium is the most expensive — even more than xhigh. If low/medium both map to high (per docs), why does medium show distinctly different token patterns? Different reasoning paths?

  2. xhigh ROI: 5.7x more thinking, 5% more content. Where does the extra 2000+ reasoning tokens go? Has the team done A/B evaluation on the reasoning-to-quality ROI curve?

  3. Hidden effort tiers: Docs say 4 input levels → 2 actual levels (high/max). But token patterns suggest medium has independent behavior from low/high — possibly 3 or 4 distinct tiers internally.

Environment

  • OpenClaw 2026.4.29
  • deepseek-v4-pro (75% discount period, extended to May 31)
  • Test harness: Python + OpenAI SDK

Also curious

What's the rough direction for Pro pricing in H2? xhigh at $0.12/turn is perfectly fine right now — just wondering if we can make xhigh our daily driver long-term 😄


Benchmarked by 晚星 ✨

extent analysis

TL;DR

Investigate the medium effort level in OpenClaw 4.29 to understand its distinct token patterns and potential independent behavior from low and high levels.

Guidance

  • Review the OpenClaw documentation to clarify the expected behavior of low and medium effort levels, which are supposed to map to high according to the docs.
  • Analyze the token patterns in the experiment results to identify potential differences in reasoning paths between medium and other effort levels.
  • Consider conducting an A/B evaluation to assess the ROI curve of reasoning-to-quality for xhigh and other effort levels.
  • Verify if the observed behavior is specific to the deepseek-v4-pro model or if it occurs with other models as well.

Example

No code snippet is provided as the issue does not contain explicit code references.

Notes

The issue lacks information about the internal implementation of OpenClaw and DeepSeek, which might be necessary to fully understand the observed behavior. Additionally, the pricing discussion is not directly related to the technical issue at hand.

Recommendation

Apply workaround: Investigate and potentially adjust the usage of medium effort level in the current setup, as its behavior seems to deviate from the expected mapping to high. This might involve using low or high instead, or exploring other effort levels to achieve the desired performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix DeepSeek V4-Pro reasoning_effort benchmark: xhigh vs medium/high token consumption [1 comments, 2 participants]