hermes - ✅(Solved) Fix xiaomi/MiMo: reasoning_effort has no measurable server-side effect (4 efforts × N=5 on AIME 2025 II P2) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#17314Fetched 2026-04-30 06:48:27
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Participants
Timeline (top)
labeled ×4cross-referenced ×1

run_agent.py:_supports_reasoning_extra_body() short-circuits to False for any base_url not on its whitelist (nousresearch.com, ai-gateway.vercel.sh, github models, openrouter with specific model prefixes, lmstudio with detected capability). The MiMo direct endpoint isn't on that list, and there's no is_xiaomi branch in agent/transports/chat_completions.py's top-level reasoning_effort block either. So neither top-level reasoning_effort nor extra_body.reasoning is forwarded to MiMo.

The conservative default is sensible in principle (avoids forwarding fields some upstreams reject), but I wasn't sure whether MiMo actually honors reasoning_effort server-side, so I tested it.

TL;DR — the conservative default is correct. MiMo accepts the field at the schema layer (rejects xhigh with Input should be 'low', 'medium' or 'high'), but the four levels (none / low / medium / high) produce statistically indistinguishable reasoning depth, accuracy, and reasoning content on a real AIME problem.

Root Cause

run_agent.py:_supports_reasoning_extra_body() short-circuits to False for any base_url not on its whitelist (nousresearch.com, ai-gateway.vercel.sh, github models, openrouter with specific model prefixes, lmstudio with detected capability). The MiMo direct endpoint isn't on that list, and there's no is_xiaomi branch in agent/transports/chat_completions.py's top-level reasoning_effort block either. So neither top-level reasoning_effort nor extra_body.reasoning is forwarded to MiMo.

The conservative default is sensible in principle (avoids forwarding fields some upstreams reject), but I wasn't sure whether MiMo actually honors reasoning_effort server-side, so I tested it.

TL;DR — the conservative default is correct. MiMo accepts the field at the schema layer (rejects xhigh with Input should be 'low', 'medium' or 'high'), but the four levels (none / low / medium / high) produce statistically indistinguishable reasoning depth, accuracy, and reasoning content on a real AIME problem.

Fix Action

Fixed

PR fix notes

PR #17323: docs(run_agent): note xiaomi/MiMo empirical exclusion from reasoning whitelist (#17314)

Description (problem / solution / changelog)

Closes #17314 (the suggested follow-up #1 — inline doc).

Summary

#17314 empirically established that xiaomi/MiMo (mimo-v2.5-pro) accepts reasoning_effort at the schema layer but produces statistically indistinguishable reasoning depth, length, and accuracy across none / low / medium / high (4 efforts × N=5 on AIME 2025 II P2; Mann-Whitney U pairwise p > 0.1 on every pair; 100% accuracy on all 20 trials; identical solution paths in inspected reasoning_content).

The conservative default in _supports_reasoning_extra_body()not whitelisting xiaomi — is therefore correct. Forwarding the field would just ship a no-op.

This PR adds a docstring note documenting the empirical test so a future PR doesn't "complete the list" by adding xiaomi without first re-verifying server-side behavior.

Changes

  • run_agent.py: extended _supports_reasoning_extra_body() docstring with an "Empirically excluded providers" section noting the xiaomi/MiMo result, the test methodology, and the link back to #17314.
  • No code-path changes. No behavior changes. AST-parses clean.

Not included (intentional)

The issue's optional follow-up #2 — surfacing a startup warning when agent.reasoning_effort is set in config.yaml for a provider that won't forward it — is broader-scope (touches every excluded provider, not just xiaomi) and is left as a separate optional follow-up so this PR stays a minimal, mergeable doc-only change.

Tests

  • python3 -c "import ast; ast.parse(open('run_agent.py').read())" clean.
  • Docs-only change; no functional test surface.

Changed files

  • run_agent.py (modified, +14/-0)

Code Example

none vs low      p=0.117    indistinguishable
none vs medium   p=0.117    indistinguishable
none vs high     p=0.917    indistinguishable
low vs medium    p=0.602    indistinguishable
low vs high      p=0.917    indistinguishable
medium vs high   p=0.754    indistinguishable
RAW_BUFFERClick to expand / collapse

Context

run_agent.py:_supports_reasoning_extra_body() short-circuits to False for any base_url not on its whitelist (nousresearch.com, ai-gateway.vercel.sh, github models, openrouter with specific model prefixes, lmstudio with detected capability). The MiMo direct endpoint isn't on that list, and there's no is_xiaomi branch in agent/transports/chat_completions.py's top-level reasoning_effort block either. So neither top-level reasoning_effort nor extra_body.reasoning is forwarded to MiMo.

The conservative default is sensible in principle (avoids forwarding fields some upstreams reject), but I wasn't sure whether MiMo actually honors reasoning_effort server-side, so I tested it.

TL;DR — the conservative default is correct. MiMo accepts the field at the schema layer (rejects xhigh with Input should be 'low', 'medium' or 'high'), but the four levels (none / low / medium / high) produce statistically indistinguishable reasoning depth, accuracy, and reasoning content on a real AIME problem.

Setup

  • Model: mimo-v2.5-pro

  • Endpoint: MiMo OpenAI-compatible /v1/chat/completions, called directly with httpx (bypassing hermes — the goal was to characterize server-side behavior in isolation)

  • Prompt: AIME 2025 II Problem 2, sourced verbatim from HF yentinglin/aime_2025 row 16

    "Find the sum of all positive integers $n$ such that $n+2$ divides $3(n+3)(n^2+9)$."

    Ground truth: 49

  • 4 effort levels (none, low, medium, high) × N=5 trials each

  • temperature=0, max_tokens=16000

  • prompt_tokens_details.cached_tokens = 256/304 across all 20 trials, so prompt cache hit rate is constant — not a confound.

Results

effortnreasoning_tokens meanstdrangeaccuracy
none519483871285–23015/5
low517002271392–19705/5
medium516191621407–17895/5
high519136231454–28715/5

Mann-Whitney U (two-tailed, normal approximation*) on reasoning_tokens distributions, all 6 pairs:

none vs low      p=0.117    indistinguishable
none vs medium   p=0.117    indistinguishable
none vs high     p=0.917    indistinguishable
low vs medium    p=0.602    indistinguishable
low vs high      p=0.917    indistinguishable
medium vs high   p=0.754    indistinguishable

*Normal approximation; exact p with N=5 vs N=5 would be slightly higher (~0.15 for the lowest U). No pair reaches p<0.05 either way. Within-effort std (especially high at 623) dwarfs between-effort mean spread (1619–1948 = 329).

Reasoning content inspection

I read the first ~1500 chars of message.reasoning_content for high trial 0 (rt=2871, longest), high trial 1 (rt=1473, shortest of high), and low trial 2 (rt=1392, shortest of low). All three follow the same solution path within that window:

  1. Substitute d = n+2, so n = d-2
  2. Compute mod d: (d+1) ≡ 1, (d²-4d+13) ≡ 13
  3. Conclude d | 39
  4. Enumerate divisors d ∈ {3, 13, 39}, sum n = 1+11+37 = 49

The extra tokens in high trial 0 look (from skimming) like verbose self-hedges ("But careful: Is that correct? ... However, there is a subtlety:") around the same argument structure rather than additional reasoning steps. The variance appears more consistent with sampling noise within a single reasoning policy than with effort-controlled depth.

(I also ran 5 trials per effort on a trivial retrievable prompt — 24-game with 8,8,3,3 — and saw the same indistinguishability there. So the flat result isn't an artifact of "AIME P2 was already easy enough for low effort"; it holds at both difficulty extremes I tested.)

Suggestion

The current behavior is correct — please don't add is_xiaomi to the top-level reasoning_effort block in agent/transports/chat_completions.py or extend _supports_reasoning_extra_body() to include xiaomi. Doing so would just ship a no-op field to MiMo.

Two minor follow-ups, no strong opinion:

  1. Inline comment in _supports_reasoning_extra_body() documenting that xiaomi/MiMo was empirically tested and excluded — would prevent a future PR from "completing" the list without checking server behavior.
  2. Optional UX nicety: if agent.reasoning_effort is set in config.yaml while model.provider is one for which the field won't be forwarded, surface a one-time warning at startup. Right now the config silently has no effect for these providers.

Caveats

  • N=5 per effort. Power to detect a small effect is limited; the no-effect conclusion rests on the combination of (a) flat means, (b) overlapping ranges, (c) identical solution paths in the inspected reasoning content, and (d) 100% accuracy across all efforts.
  • Single problem (AIME 2025 II P2) plus a trivial sanity-check prompt. Not tested on code, longer reasoning chains, or multi-turn agentic scenarios.
  • Tested only mimo-v2.5-pro; other MiMo variants not covered.
  • Server-side behavior may change. Tested 2026-04-29.
  • This experiment characterizes the MiMo endpoint in isolation. The hermes → MiMo end-to-end path was not exercised, since hermes already (correctly) doesn't forward the field.

extent analysis

TL;DR

The current behavior of not forwarding reasoning_effort to MiMo is correct and should not be changed.

Guidance

  • The empirical testing shows that MiMo does not distinguish between different reasoning_effort levels, so adding support for it would be unnecessary.
  • Consider adding an inline comment in _supports_reasoning_extra_body() to document that xiaomi/MiMo was tested and excluded to prevent future changes without verification.
  • Optionally, surface a one-time warning at startup if agent.reasoning_effort is set in config.yaml for a provider that doesn't support it.

Example

No code changes are necessary, but an example of the suggested inline comment could be:

def _supports_reasoning_extra_body(base_url):
    # xiaomi/MiMo was empirically tested and excluded, as it does not distinguish between reasoning effort levels
    whitelist = [...]

Notes

The conclusion is based on a limited number of trials (N=5 per effort) and a single problem, so further testing may be necessary to confirm the results.

Recommendation

Apply the current workaround, which is to not forward reasoning_effort to MiMo, as the empirical testing shows it has no effect.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix xiaomi/MiMo: reasoning_effort has no measurable server-side effect (4 efforts × N=5 on AIME 2025 II P2) [1 pull requests, 1 participants]