hermes - 💡(How to fix) Fix [Setup]: How do I get token usage tracking working with a local OpenAI-compatible provider (LocalAI) on v0.14.0?

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

Full Error Output

Fix Action

Fix / Workaround

Happy to provide more logs / try patches. Thanks!

Code Example

curl -sN https://localai.example.internal/v1/chat/completions \
  -H "Authorization: Bearer $LOCALAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"production","messages":[{"role":"user","content":"say hi"}],"stream":true,"stream_options":{"include_usage":true}}' | tail -2

---

data: {"...","choices":[],"usage":{"prompt_tokens":12,"completion_tokens":209,"total_tokens":221}}
data: [DONE]

---

model:
  default: production
  provider: LocalAI
  context_length: 172000

providers:
  LocalAI:
    api_key: ${LOCALAI_API_KEY}
    base_url: https://localai.example.internal/v1
    extra_body:
      stream_options:
        include_usage: true
    models:
      production:
        context_length: 172000
        extra_body:
          stream_options:
            include_usage: true
      production-embedding:
        context_length: 8192
      production-tiny:
        context_length: 64000

---

📊 Session Token Usage
────────────────────────────────────────
Model:                     production
Input tokens:                       0
[...all counters 0...]
API calls:                          1
Cost status:                 unknown
────────────────────────────────────────
Current context:  0 / 172,000 (0%)

---



---

See above
RAW_BUFFERClick to expand / collapse

What's Going Wrong?

What I'm trying to do

Get /usage to show real token counts when chatting against a LocalAI backend. Currently all counters read 0 and Cost status: unknown, even though the model is responding correctly and the context bar displays the right ceiling (172K).

What I've tried

LocalAI requires stream_options: {include_usage: true} to emit the final usage chunk on streamed responses (standard OpenAI streaming spec behavior). I confirmed LocalAI honors this with a direct curl:

curl -sN https://localai.example.internal/v1/chat/completions \
  -H "Authorization: Bearer $LOCALAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"production","messages":[{"role":"user","content":"say hi"}],"stream":true,"stream_options":{"include_usage":true}}' | tail -2

Returns:

data: {"...","choices":[],"usage":{"prompt_tokens":12,"completion_tokens":209,"total_tokens":221}}
data: [DONE]

So LocalAI's side is fine. The question is how to get Hermes to send stream_options.include_usage on the main-model request.

I tried adding extra_body to my providers: block in two places (config below), but /usage still reads zero in both cases. The config loads without errors and hermes config show reflects the change.

Current config (/opt/data/config.yaml)

model:
  default: production
  provider: LocalAI
  context_length: 172000

providers:
  LocalAI:
    api_key: ${LOCALAI_API_KEY}
    base_url: https://localai.example.internal/v1
    extra_body:
      stream_options:
        include_usage: true
    models:
      production:
        context_length: 172000
        extra_body:
          stream_options:
            include_usage: true
      production-embedding:
        context_length: 8192
      production-tiny:
        context_length: 64000

I also tried extra_body.chat_template_kwargs per the providers documentation example — same result, no effect on /usage.

What /usage shows

📊 Session Token Usage
────────────────────────────────────────
Model:                     production
Input tokens:                       0
[...all counters 0...]
API calls:                          1
Cost status:                 unknown
────────────────────────────────────────
Current context:  0 / 172,000 (0%)

LocalAI logs during a Hermes-initiated request show clean prompt processing (34,226 tokens through the slot) but no final usage chunk emission — consistent with include_usage not being in the request body Hermes sends.

Environment

  • Hermes Agent v0.14.0 (2026.5.16), Docker, Up to date
  • Python 3.13.5, OpenAI SDK 2.24.0
  • LocalAI behind reverse proxy at https://localai.example.internal/v1
  • Model: custom production (llama-cpp backend, 172K context)

Questions

  1. Is extra_body under providers: supposed to be forwarded on the main-model request, or does the "auxiliary/compression/fallback only" note from the configuration docs also apply here?
  2. If it's not the right knob, is there a supported way to either (a) force stream_options.include_usage on streamed requests, or (b) disable HTTP-layer streaming so usage comes back in the non-streamed response body (which works fine on curl)?
  3. Is there an env var I'm missing (HERMES_*) that controls this?

Happy to provide more logs / try patches. Thanks!

Steps Taken

See above,

Installation Method

Docker

Operating System

Ubuntu 24.04

Python Version

No response

Hermes Version

v0.14.0

Debug Report

Full Error Output

See above

What I've Already Tried

No response

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [Setup]: How do I get token usage tracking working with a local OpenAI-compatible provider (LocalAI) on v0.14.0?