openclaw - 💡(How to fix) Fix DeepSeek V4-Pro reasoning_effort benchmark: xhigh vs medium/high token consumption [1 comments, 2 participants]

openclaw2026-05-01 15:17:21

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#75706•Fetched 2026-05-02 05:31:21

View on GitHub

Comments

Participants

Timeline

Reactions

Author

17329971

Participants

17329971

clawsweeper[bot]

Timeline (top)

closed ×1commented ×1

RAW_BUFFERClick to expand / collapse

Experiment

Testing reasoning_effort levels on DeepSeek V4-Pro via OpenClaw 4.29. Two rounds of controlled experiments.

Phase 1 · Capped (max_tokens=500) — Raft consensus question

effort	total tokens	reasoning tokens	time
low	330	193	10.8s
medium	504	333	13.9s
high	397	230	12.6s
xhigh	432	211	10.6s

medium consumed 17% more total tokens and 58% more reasoning tokens than xhigh

Phase 2 · Uncapped (max_tokens=4000) — 3-tier cache architecture design

| effort | total tokens | reasoning tokens | content chars | time | |---|---|---|---| | high (avg) | 2,084 | 434 | 3,886 | 64s | | xhigh | 4,126 | 2,496 | 4,107 | 126s |

xhigh reasoning = 5.7x high, but final answer only 5% longer

Observations

medium anomaly: With output cap, medium is the most expensive — even more than xhigh. If low/medium both map to high (per docs), why does medium show distinctly different token patterns? Different reasoning paths?
xhigh ROI: 5.7x more thinking, 5% more content. Where does the extra 2000+ reasoning tokens go? Has the team done A/B evaluation on the reasoning-to-quality ROI curve?
Hidden effort tiers: Docs say 4 input levels → 2 actual levels (high/max). But token patterns suggest medium has independent behavior from low/high — possibly 3 or 4 distinct tiers internally.

Environment

OpenClaw 2026.4.29
deepseek-v4-pro (75% discount period, extended to May 31)
Test harness: Python + OpenAI SDK

Also curious

What's the rough direction for Pro pricing in H2? xhigh at $0.12/turn is perfectly fine right now — just wondering if we can make xhigh our daily driver long-term 😄

Benchmarked by 晚星 ✨

extent analysis

TL;DR

Investigate the medium effort level in OpenClaw 4.29 to understand its distinct token patterns and potential independent behavior from low and high levels.

Guidance

Review the OpenClaw documentation to clarify the expected behavior of low and medium effort levels, which are supposed to map to high according to the docs.
Analyze the token patterns in the experiment results to identify potential differences in reasoning paths between medium and other effort levels.
Consider conducting an A/B evaluation to assess the ROI curve of reasoning-to-quality for xhigh and other effort levels.
Verify if the observed behavior is specific to the deepseek-v4-pro model or if it occurs with other models as well.

Example

No code snippet is provided as the issue does not contain explicit code references.

Notes

The issue lacks information about the internal implementation of OpenClaw and DeepSeek, which might be necessary to fully understand the observed behavior. Additionally, the pricing discussion is not directly related to the technical issue at hand.

Recommendation

Apply workaround: Investigate and potentially adjust the usage of medium effort level in the current setup, as its behavior seems to deviate from the expected mapping to high. This might involve using low or high instead, or exploring other effort levels to achieve the desired performance.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#search optimization #API routing #API middleware #SSR setup #ISR setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix DeepSeek V4-Pro reasoning_effort benchmark: xhigh vs medium/high token consumption [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Experiment

Phase 1 · Capped (max_tokens=500) — Raft consensus question

Phase 2 · Uncapped (max_tokens=4000) — 3-tier cache architecture design

Observations

Environment

Also curious

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix DeepSeek V4-Pro reasoning_effort benchmark: xhigh vs medium/high token consumption [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Experiment

Phase 1 · Capped (max_tokens=500) — Raft consensus question

Phase 2 · Uncapped (max_tokens=4000) — 3-tier cache architecture design

Observations

Environment

Also curious

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING