hermes - ✅(Solved) Fix xiaomi/MiMo: reasoning_effort has no measurable server-side effect (4 efforts × N=5 on AIME 2025 II P2) [1 pull requests, 1 participants]

TatsuKo-Tsukimi · 2026-04-29T06:35:05Z

[hermes] run agent.py: supports reasoning extra body short-circuits to False for any base url not on its whitelist nousresearch.com, ai-gateway.vercel.sh, gith… `run_agent.py:_supports_reasoning_extra_body()` short-circuits to `False` for any `base_url` not on its whitelist (nousresearch.com, ai-gateway.vercel.sh, github models, openrouter with specific model prefixes, lmstudio with detected capability). The MiMo direct endpoint isn't on that list, and there's no `is_xiaomi` branch in `agent/transports/chat_completions.py`'s top-level `reasoning_effort` block either. So neither top-level `reasoning_effort` nor `extra_body.reasoning` is forwarded to MiMo. The conservative default is sensible in principle (avoids forwarding fields some upstreams reject), but I wasn't sure whether MiMo actually honors `reasoning_effort` server-side, so I tested it. **TL;DR — the conservative default is correct.** MiMo accepts the field at the schema layer (rejects `xhigh` with `Input should be 'low', 'medium' or 'high'`), but the four levels (`none` / `low` / `medium` / `high`) produce statistically indistinguishable reasoning depth, accuracy, and reasoning content on a real AIME problem. # PR #17323: docs(run_agent): note xiaomi/MiMo empirical exclusion from reasoning whitelist (#17314) - Repository: NousResearch/hermes-agent - Author: Sanjays2402 - State: open | merged: False - Link: https://github.com/NousResearch/hermes-agent/pull/17323 ## Description (problem / solution / changelog) Closes #17314 (the suggested follow-up #1 — inline doc). ## Summary #17314 empirically established that xiaomi/MiMo (`mimo-v2.5-pro`) accepts `reasoning_effort` at the schema layer but produces **statistically indistinguishable** reasoning depth, length, and accuracy across `none` / `low` / `medium` / `high` (4 efforts × N=5 on AIME 2025 II P2; Mann-Whitney U pairwise p > 0.1 on every pair; 100% accuracy on all 20 trials; identical solution paths in inspected `reasoning_content`). The conservative default in `_supports_reasoning_extra_body()` — *not* whitelisting xiaomi — is therefore correct. Forwarding the field would just ship a no-op. This PR adds a docstring note documenting the empirical test so a future PR doesn't "complete the list" by adding xiaomi without first re-verifying server-side behavior. ## Changes - `run_agent.py`: extended `_supports_reasoning_extra_body()` docstring with an "Empirically excluded providers" section noting the xiaomi/MiMo result, the test methodology, and the link back to #17314. - No code-path changes. No behavior changes. AST-parses clean. ## Not included (intentional) The issue's optional follow-up #2 — surfacing a startup warning when `agent.reasoning_effort` is set in `config.yaml` for a provider that won't forward it — is broader-scope (touches every excluded provider, not just xiaomi) and is left as a separate optional follow-up so this PR stays a minimal, mergeable doc-only change. ## Tests - `python3 -c "import ast; ast.parse(open('run_agent.py').read())"` clean. - Docs-only change; no functional test surface. ## Changed files - `run_agent.py` (modified, +14/-0) ## Fixed - Fixed by PR: docs(run_agent): note xiaomi/MiMo empirical exclusion from reasoning whitelist (#17314) (https://github.com/NousResearch/hermes-agent/pull/17323) ## Context `run_agent.py:_supports_reasoning_extra_body()` short-circuits to `False` for any `base_url` not on its whitelist (nousresearch.com, ai-gateway.vercel.sh, github models, openrouter with specific model prefixes, lmstudio with detected capability). The MiMo direct endpoint isn't on that list, and there's no `is_xiaomi` branch in `agent/transports/chat_completions.py`'s top-level `reasoning_effort` block either. So neither top-level `reasoning_effort` nor `extra_body.reasoning` is forwarded to MiMo. The conservative default is sensible in principle (avoids forwarding fields some upstreams reject), but I wasn't sure whether MiMo actually honors `reasoning_effort` server-side, so I tested it. **TL;DR — the conservative default is correct.** MiMo accepts the field at the schema layer (rejects `xhigh` with `Input should be 'low', 'medium' or 'high'`), but the four levels (`none` / `low` / `medium` / `high`) produce statistically indistinguishable reasoning depth, accuracy, and reasoning content on a real AIME problem. ## Setup - Model: `mimo-v2.5-pro` - Endpoint: MiMo OpenAI-compatible `/v1/chat/completions`, called directly with `httpx` (bypassing hermes — the goal was to characterize server-side behavior in isolation) - Prompt: AIME 2025 II Problem 2, sourced verbatim from HF [`yentinglin/aime_2025`](https://huggingface.co/datasets/yentinglin/aime_2025) row 16 > "Find the sum of all positive integers $n$ such that $n+2$ divides $3(n+3)(n^2+9)$." Ground truth: **49** - 4 effort levels (`none`, `low`, `medium`, `high`) × N=5 trials each - `temperature=0`, `max_tokens=16000` - `prompt_tokens_details.cached_tokens` = 256/304 across all 20 trials, so prompt cache hit rate i

hermes2026-04-29 06:35:05

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#17314•Fetched 2026-04-30 06:48:27

View on GitHub

Comments

Participants

Timeline

Reactions

Author

TatsuKo-Tsukimi

Participants

TatsuKo-Tsukimi

Timeline (top)

labeled ×4cross-referenced ×1

run_agent.py:_supports_reasoning_extra_body() short-circuits to False for any base_url not on its whitelist (nousresearch.com, ai-gateway.vercel.sh, github models, openrouter with specific model prefixes, lmstudio with detected capability). The MiMo direct endpoint isn't on that list, and there's no is_xiaomi branch in agent/transports/chat_completions.py's top-level reasoning_effort block either. So neither top-level reasoning_effort nor extra_body.reasoning is forwarded to MiMo.

The conservative default is sensible in principle (avoids forwarding fields some upstreams reject), but I wasn't sure whether MiMo actually honors reasoning_effort server-side, so I tested it.

TL;DR — the conservative default is correct. MiMo accepts the field at the schema layer (rejects xhigh with Input should be 'low', 'medium' or 'high'), but the four levels (none / low / medium / high) produce statistically indistinguishable reasoning depth, accuracy, and reasoning content on a real AIME problem.

Root Cause

The conservative default is sensible in principle (avoids forwarding fields some upstreams reject), but I wasn't sure whether MiMo actually honors reasoning_effort server-side, so I tested it.

Fix Action

Fixed

Fixed by PR: docs(run_agent): note xiaomi/MiMo empirical exclusion from reasoning whitelist (#17314) (https://github.com/NousResearch/hermes-agent/pull/17323)

PR fix notes

PR #17323: docs(run_agent): note xiaomi/MiMo empirical exclusion from reasoning whitelist (#17314)

Repository: NousResearch/hermes-agent
Author: Sanjays2402
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/17323

Description (problem / solution / changelog)

Closes #17314 (the suggested follow-up #1 — inline doc).

Summary

#17314 empirically established that xiaomi/MiMo (mimo-v2.5-pro) accepts reasoning_effort at the schema layer but produces statistically indistinguishable reasoning depth, length, and accuracy across none / low / medium / high (4 efforts × N=5 on AIME 2025 II P2; Mann-Whitney U pairwise p > 0.1 on every pair; 100% accuracy on all 20 trials; identical solution paths in inspected reasoning_content).

The conservative default in _supports_reasoning_extra_body() — not whitelisting xiaomi — is therefore correct. Forwarding the field would just ship a no-op.

This PR adds a docstring note documenting the empirical test so a future PR doesn't "complete the list" by adding xiaomi without first re-verifying server-side behavior.

Changes

run_agent.py: extended _supports_reasoning_extra_body() docstring with an "Empirically excluded providers" section noting the xiaomi/MiMo result, the test methodology, and the link back to #17314.
No code-path changes. No behavior changes. AST-parses clean.

Not included (intentional)

The issue's optional follow-up #2 — surfacing a startup warning when agent.reasoning_effort is set in config.yaml for a provider that won't forward it — is broader-scope (touches every excluded provider, not just xiaomi) and is left as a separate optional follow-up so this PR stays a minimal, mergeable doc-only change.

Tests

python3 -c "import ast; ast.parse(open('run_agent.py').read())" clean.
Docs-only change; no functional test surface.

Changed files

run_agent.py (modified, +14/-0)

Code Example

none vs low      p=0.117    indistinguishable
none vs medium   p=0.117    indistinguishable
none vs high     p=0.917    indistinguishable
low vs medium    p=0.602    indistinguishable
low vs high      p=0.917    indistinguishable
medium vs high   p=0.754    indistinguishable

RAW_BUFFERClick to expand / collapse

Context

The conservative default is sensible in principle (avoids forwarding fields some upstreams reject), but I wasn't sure whether MiMo actually honors reasoning_effort server-side, so I tested it.

Setup

Model: mimo-v2.5-pro
Endpoint: MiMo OpenAI-compatible /v1/chat/completions, called directly with httpx (bypassing hermes — the goal was to characterize server-side behavior in isolation)
Prompt: AIME 2025 II Problem 2, sourced verbatim from HF yentinglin/aime_2025 row 16

"Find the sum of all positive integers $n$ such that $n+2$ divides $3(n+3)(n^2+9)$."

Ground truth: 49
4 effort levels (none, low, medium, high) × N=5 trials each
temperature=0, max_tokens=16000
prompt_tokens_details.cached_tokens = 256/304 across all 20 trials, so prompt cache hit rate is constant — not a confound.

Results

effort	n	reasoning_tokens mean	std	range	accuracy
none	5	1948	387	1285–2301	5/5
low	5	1700	227	1392–1970	5/5
medium	5	1619	162	1407–1789	5/5
high	5	1913	623	1454–2871	5/5

Mann-Whitney U (two-tailed, normal approximation*) on reasoning_tokens distributions, all 6 pairs:

none vs low      p=0.117    indistinguishable
none vs medium   p=0.117    indistinguishable
none vs high     p=0.917    indistinguishable
low vs medium    p=0.602    indistinguishable
low vs high      p=0.917    indistinguishable
medium vs high   p=0.754    indistinguishable

*Normal approximation; exact p with N=5 vs N=5 would be slightly higher (~0.15 for the lowest U). No pair reaches p<0.05 either way. Within-effort std (especially high at 623) dwarfs between-effort mean spread (1619–1948 = 329).

Reasoning content inspection

I read the first ~1500 chars of message.reasoning_content for high trial 0 (rt=2871, longest), high trial 1 (rt=1473, shortest of high), and low trial 2 (rt=1392, shortest of low). All three follow the same solution path within that window:

Substitute d = n+2, so n = d-2
Compute mod d: (d+1) ≡ 1, (d²-4d+13) ≡ 13
Conclude d | 39
Enumerate divisors d ∈ {3, 13, 39}, sum n = 1+11+37 = 49

The extra tokens in high trial 0 look (from skimming) like verbose self-hedges ("But careful: Is that correct? ... However, there is a subtlety:") around the same argument structure rather than additional reasoning steps. The variance appears more consistent with sampling noise within a single reasoning policy than with effort-controlled depth.

(I also ran 5 trials per effort on a trivial retrievable prompt — 24-game with 8,8,3,3 — and saw the same indistinguishability there. So the flat result isn't an artifact of "AIME P2 was already easy enough for low effort"; it holds at both difficulty extremes I tested.)

Suggestion

The current behavior is correct — please don't add is_xiaomi to the top-level reasoning_effort block in agent/transports/chat_completions.py or extend _supports_reasoning_extra_body() to include xiaomi. Doing so would just ship a no-op field to MiMo.

Two minor follow-ups, no strong opinion:

Inline comment in _supports_reasoning_extra_body() documenting that xiaomi/MiMo was empirically tested and excluded — would prevent a future PR from "completing" the list without checking server behavior.
Optional UX nicety: if agent.reasoning_effort is set in config.yaml while model.provider is one for which the field won't be forwarded, surface a one-time warning at startup. Right now the config silently has no effect for these providers.

Caveats

N=5 per effort. Power to detect a small effect is limited; the no-effect conclusion rests on the combination of (a) flat means, (b) overlapping ranges, (c) identical solution paths in the inspected reasoning content, and (d) 100% accuracy across all efforts.
Single problem (AIME 2025 II P2) plus a trivial sanity-check prompt. Not tested on code, longer reasoning chains, or multi-turn agentic scenarios.
Tested only mimo-v2.5-pro; other MiMo variants not covered.
Server-side behavior may change. Tested 2026-04-29.
This experiment characterizes the MiMo endpoint in isolation. The hermes → MiMo end-to-end path was not exercised, since hermes already (correctly) doesn't forward the field.

extent analysis

TL;DR

The current behavior of not forwarding reasoning_effort to MiMo is correct and should not be changed.

Guidance

The empirical testing shows that MiMo does not distinguish between different reasoning_effort levels, so adding support for it would be unnecessary.
Consider adding an inline comment in _supports_reasoning_extra_body() to document that xiaomi/MiMo was tested and excluded to prevent future changes without verification.
Optionally, surface a one-time warning at startup if agent.reasoning_effort is set in config.yaml for a provider that doesn't support it.

Example

No code changes are necessary, but an example of the suggested inline comment could be:

def _supports_reasoning_extra_body(base_url):
    # xiaomi/MiMo was empirically tested and excluded, as it does not distinguish between reasoning effort levels
    whitelist = [...]

Notes

The conclusion is based on a limited number of trials (N=5 per effort) and a single problem, so further testing may be necessary to confirm the results.

Recommendation

Apply the current workaround, which is to not forward reasoning_effort to MiMo, as the empirical testing shows it has no effect.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#tensor shape #autograd error #model save/load #optimization #mixed precision

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - ✅(Solved) Fix xiaomi/MiMo: reasoning_effort has no measurable server-side effect (4 efforts × N=5 on AIME 2025 II P2) [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #17323: docs(run_agent): note xiaomi/MiMo empirical exclusion from reasoning whitelist (#17314)

Description (problem / solution / changelog)

Summary

Changes

Not included (intentional)

Tests

Changed files

Code Example

Context

Setup

Results

Reasoning content inspection

Suggestion

Caveats

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

hermes - ✅(Solved) Fix xiaomi/MiMo: reasoning_effort has no measurable server-side effect (4 efforts × N=5 on AIME 2025 II P2) [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #17323: docs(run_agent): note xiaomi/MiMo empirical exclusion from reasoning whitelist (#17314)

Description (problem / solution / changelog)

Summary

Changes

Not included (intentional)

Tests

Changed files

Code Example

Context

Setup

Results

Reasoning content inspection

Suggestion

Caveats

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING