hermes - ✅(Solved) Fix Context progress bar stays at 0% with OpenAI-compat local servers (mlx_vlm) [3 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#14686Fetched 2026-04-24 06:15:22
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×3labeled ×3

Root Cause

agent/usage_pricing.py:532-534, in the normalize_usage() else-branch (the default path for non-Anthropic, non-Codex providers), reads only OpenAI-schema field names:

else:
    prompt_total = _to_int(getattr(response_usage, "prompt_tokens", 0))
    output_tokens = _to_int(getattr(response_usage, "completion_tokens", 0))

mlx_vlm.server (and some other local OpenAI-compatible servers) emit the Anthropic-style field names input_tokens / output_tokens in the usage object of their chat-completion responses, even when served over /v1/chat/completions. The OpenAI Python client preserves unknown fields as attributes (ConfigDict(extra="allow")), so getattr(response_usage, "input_tokens", 0) returns the real number — but the else branch never looks there.

Result: CanonicalUsage.input_tokens is always 0 → canonical_usage.prompt_tokens == 0 → the dict built at run_agent.py:9843 passes prompt_tokens: 0 to context_compressor.update_from_response(...)last_prompt_tokens stays 0 → progress bar stays at 0.

Fix Action

Fixed

PR fix notes

PR #14698: fix(agent): fallback to input_tokens/output_tokens for OpenAI-compat local servers

Description (problem / solution / changelog)

What does this PR do?

Fixes normalize_usage() in agent/usage_pricing.py so that OpenAI-compatible local servers (e.g. mlx_vlm.server) that emit Anthropic-style input_tokens / output_tokens instead of OpenAI-style prompt_tokens / completion_tokens are handled correctly.

Previously, the else branch only read prompt_tokens and completion_tokens. When a local server returned input_tokens/output_tokens, both values were read as 0. This caused:

  • Context progress bar stuck at 0% forever
  • Auto-compression thresholds never triggering
  • Long sessions eventually OOM-ing the context window

The fix adds a defensive fallback: try OpenAI-style names first, then fall back to Anthropic-style names. OpenAI-style takes priority when both are present.

Fixes #14686

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • agent/usage_pricing.pynormalize_usage() else-branch now reads input_tokens/output_tokens as fallback
  • tests/agent/test_usage_pricing.py — added 2 regression tests:
    • test_normalize_usage_openai_compat_fallback_to_anthropic_names (mlx_vlm shape)
    • test_normalize_usage_openai_compat_prefers_openai_names_when_both_present (priority defense)

How to Test

  1. pytest tests/agent/test_usage_pricing.py -v — all 11 tests pass
  2. The fix can be exercised by pointing Hermes at any local OpenAI-compatible server that returns input_tokens/output_tokens in usage (e.g. mlx_vlm.server).

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass (focused: tests/agent/test_usage_pricing.py)
  • I've added tests for my changes
  • I've tested on my platform: Ubuntu 24.04 aarch64

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Changed files

  • agent/usage_pricing.py (modified, +10/-2)
  • tests/agent/test_usage_pricing.py (modified, +34/-0)

PR #14860: fix(usage): fall back to Anthropic token names for OpenAI-compat servers

Description (problem / solution / changelog)

Summary

  • fall back to input_tokens when chat-completions usage does not include prompt_tokens
  • fall back to output_tokens when chat-completions usage does not include completion_tokens
  • add regression coverage for Anthropic-named usage objects while keeping OpenAI names preferred

Testing

  • python3 -m pytest -o addopts= tests/agent/test_usage_pricing.py

Closes #14686

Changed files

  • agent/usage_pricing.py (modified, +4/-0)
  • tests/agent/test_usage_pricing.py (modified, +28/-0)

PR #14903: fix(context-compressor): accept Anthropic-style usage keys as fallback

Description (problem / solution / changelog)

update_from_response() previously only recognised OpenAI-style field names (prompt_tokens/completion_tokens). OpenAI-compatible local servers such as mlx_vlm and NVIDIA NIM can return Anthropic-shaped usage dicts (input_tokens/output_tokens), causing last_prompt_tokens and last_completion_tokens to silently stay at zero. This broke the context progress bar and disabled auto-compression for those providers.

OpenAI-style keys still take priority when both are present.

Closes #14687 Related: #14686

What does this PR do?

ContextCompressor.update_from_response() now falls back to Anthropic-style usage keys (input_tokens/output_tokens) when OpenAI-style keys (prompt_tokens/completion_tokens) are absent. This is the correct fix because
OpenAI-compatible servers like NVIDIA NIM and mlx_vlm can return Anthropic-shaped usage objects, causing token counts to silently read as zero — which breaks the context progress bar and prevents auto-compression from ever triggering.

Related Issue

Fixes #14687
Related: #14686, companion to #14698

Type of Change

<!-- Check the one that applies. -->
  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • agent/context_compressor.py — update_from_response(): fall back to
    input_tokens/output_tokens when OpenAI-style keys are absent; OpenAI-style takes priority when both present
  • tests/agent/test_context_compressor.py — two regression tests added to TestUpdateFromResponse

How to Test

  1. Configure Hermes with an OpenAI-compatible local server that returns
    Anthropic-style usage fields (e.g. mlx_vlm or NVIDIA NIM via integrate.api.nvidia.com)
  2. Run a multi-turn session and observe the context progress bar — it should now reflect actual token usage instead of staying at 0%
  3. Run uv run --extra dev pytest tests/agent/test_context_compressor.py -v — all 52 tests pass including the two new regression tests

Checklist

<!-- Complete these before requesting review. -->

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 15.2 Apple Silicon

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — updated
    docstring on update_from_response()
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or
    workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) per the
    https://github.com/NousResearch/hermes-agent/blob/main/CONTRIBUTING.md#cross-pla tform-compatibility — N/A, pure Python dict key lookup
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

  • tests/agent/test_context_compressor.py::TestUpdateFromResponse::test_updates_fields PASSED
  • tests/agent/test_context_compressor.py::TestUpdateFromResponse::test_missing_fields_default_zero PASSED
  • tests/agent/test_context_compressor.py::TestUpdateFromResponse::test_anthropic_style_keys PASSED
  • tests/agent/test_context_compressor.py::TestUpdateFromResponse::test_openai_keys_take_priority PASSED

Changed files

  • agent/context_compressor.py (modified, +10/-3)
  • tests/agent/test_context_compressor.py (modified, +20/-0)

Code Example

Qwen3.5-35B-A3B-4bit │ 0/131.1K │ [░░░░░░░░░░] 0% │ 1h 20m │ ⏲ 31s

---

else:
    prompt_total = _to_int(getattr(response_usage, "prompt_tokens", 0))
    output_tokens = _to_int(getattr(response_usage, "completion_tokens", 0))

---

"usage": {
  "input_tokens": 11477,
  "output_tokens": 235,
  "total_tokens": 11712,
  "prompt_tps": 223.4,
  "generation_tps": 29.3
}

---

else:
    # OpenAI-style names first; fall back to Anthropic-style
    # (input_tokens/output_tokens). Local OpenAI-compatible servers like
    # mlx_vlm.server emit the Anthropic names in chat_completions responses,
    # and the OpenAI Python client preserves them as extra attributes.
    prompt_total = _to_int(getattr(response_usage, "prompt_tokens", 0)) or _to_int(
        getattr(response_usage, "input_tokens", 0)
    )
    output_tokens = _to_int(getattr(response_usage, "completion_tokens", 0)) or _to_int(
        getattr(response_usage, "output_tokens", 0)
    )
RAW_BUFFERClick to expand / collapse

Context progress bar stays at 0% with OpenAI-compatible local servers (mlx_vlm, etc.)

Environment

  • Hermes: v2026.4.16-1165-gce089169 (cli __version__ = 0.10.0)
  • Model provider: custom (OpenAI-compatible), pointed at a local MLX inference server (mlx_vlm.server from Blaizzy/mlx-vlm)
  • Python: 3.11 venv

Symptom

The status line shows 0/131.1K │ [░░░░░░░░░░] 0% forever, even after many turns. Context-window pressure never registers and auto-compression thresholds never trigger.

Display format (reproduced from the CLI status line):

 ⚕ Qwen3.5-35B-A3B-4bit │ 0/131.1K │ [░░░░░░░░░░] 0% │ 1h 20m │ ⏲ 31s

Root cause

agent/usage_pricing.py:532-534, in the normalize_usage() else-branch (the default path for non-Anthropic, non-Codex providers), reads only OpenAI-schema field names:

else:
    prompt_total = _to_int(getattr(response_usage, "prompt_tokens", 0))
    output_tokens = _to_int(getattr(response_usage, "completion_tokens", 0))

mlx_vlm.server (and some other local OpenAI-compatible servers) emit the Anthropic-style field names input_tokens / output_tokens in the usage object of their chat-completion responses, even when served over /v1/chat/completions. The OpenAI Python client preserves unknown fields as attributes (ConfigDict(extra="allow")), so getattr(response_usage, "input_tokens", 0) returns the real number — but the else branch never looks there.

Result: CanonicalUsage.input_tokens is always 0 → canonical_usage.prompt_tokens == 0 → the dict built at run_agent.py:9843 passes prompt_tokens: 0 to context_compressor.update_from_response(...)last_prompt_tokens stays 0 → progress bar stays at 0.

Reproduction

  1. Run any OpenAI-compatible local server that emits input_tokens/output_tokens instead of prompt_tokens/completion_tokens (e.g. mlx_vlm.server --model mlx-community/Qwen3.5-35B-A3B-4bit --port 8083).
  2. Configure Hermes with a custom provider pointing at that endpoint.
  3. Start any session and send messages — the progress bar stays at 0% regardless of conversation length.

Raw usage object from mlx_vlm for reference:

"usage": {
  "input_tokens": 11477,
  "output_tokens": 235,
  "total_tokens": 11712,
  "prompt_tps": 223.4,
  "generation_tps": 29.3
}

Proposed fix

Add an Anthropic-name fallback to the else-branch of normalize_usage:

else:
    # OpenAI-style names first; fall back to Anthropic-style
    # (input_tokens/output_tokens). Local OpenAI-compatible servers like
    # mlx_vlm.server emit the Anthropic names in chat_completions responses,
    # and the OpenAI Python client preserves them as extra attributes.
    prompt_total = _to_int(getattr(response_usage, "prompt_tokens", 0)) or _to_int(
        getattr(response_usage, "input_tokens", 0)
    )
    output_tokens = _to_int(getattr(response_usage, "completion_tokens", 0)) or _to_int(
        getattr(response_usage, "output_tokens", 0)
    )

OpenAI-style fields take priority when both are present (no real servers emit both, but defensive). Unit-tested locally — both shapes normalize correctly with no regression.

Impact

Affects every Hermes user running mlx_vlm, possibly other local OpenAI-compat servers using Anthropic-style usage naming (to audit: mlx_lm, some vLLM variants, lmstudio builds). Growing user base as local-LLM Hermes deployments increase.

Downstream effects beyond display cosmetics:

  • context_compressor.should_compress() never triggers (uses last_prompt_tokens)
  • auto-compression never runs → long sessions eventually OOM the context window
  • Cost/session accounting shows 0 throughout

extent analysis

TL;DR

The progress bar issue can be fixed by adding an Anthropic-name fallback to the normalize_usage function to handle local OpenAI-compatible servers that emit input_tokens and output_tokens instead of prompt_tokens and completion_tokens.

Guidance

  • Verify that the normalize_usage function is the root cause of the issue by checking the response_usage object for the presence of input_tokens and output_tokens fields.
  • Apply the proposed fix to the normalize_usage function to add the Anthropic-name fallback.
  • Test the fix with different local OpenAI-compatible servers to ensure compatibility.
  • Monitor the progress bar and context window pressure to ensure the fix resolves the issue.

Example

The proposed fix can be applied as follows:

else:
    prompt_total = _to_int(getattr(response_usage, "prompt_tokens", 0)) or _to_int(
        getattr(response_usage, "input_tokens", 0)
    )
    output_tokens = _to_int(getattr(response_usage, "completion_tokens", 0)) or _to_int(
        getattr(response_usage, "output_tokens", 0)
    )

Notes

This fix assumes that the response_usage object contains either prompt_tokens and completion_tokens or input_tokens and output_tokens, but not both.

Recommendation

Apply the workaround by adding the Anthropic-name fallback to the normalize_usage function, as this fix has been unit-tested and shown to resolve the issue without regression.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING