hermes - 💡(How to fix) Fix Bug Report: Session resume produces confusing output + startup hang when compression is unconfigured [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

def _select_pool_entry(provider: str) -> Tuple[bool, Optional[Any]]: try: pool = load_pool(provider) # Reads JSON from disk except Exception as exc: logger.debug("Auxiliary client: could not load pool for %s: %s", provider, exc) return False, None if not pool or not pool.has_credentials(): return False, None try: return True, pool.select() except Exception as exc: logger.debug("Auxiliary client: could not select pool entry for %s: %s", provider, exc) return True, None

Root Cause

run_agent.py:2496 calls _check_compression_model_feasibility() unconditionally during __init__. For users with compression.enabled: true (default) but no explicit auxiliary.compression.provider, the auxiliary client resolution chain runs through _resolve_auto():

  1. Step 1 tries main provider (minimax-oauth in this case) → client created → returns quickly
  2. But Step 1 only succeeds when main_provider not in {"auto", ""} AND the client resolves. However, minimax-oauth is not a direct branch in resolve_provider_client — it falls through to... let me trace further.

Actually, the flow for minimax-oauth:

  • _resolve_auto() Step 1: main_provider = "minimax-oauth", main_model = "MiniMax-M2.7"
  • Condition: main_provider not in {"auto", ""} → True
  • resolve_provider_client("minimax-oauth", "MiniMax-M2.7", ...)
  • provider == "minimax-oauth" has no direct branch → _normalize_aux_provider("minimax-oauth") returns "minimax-oauth" → not in alias map
  • Falls through to logger.warning("resolve_provider_client: unknown provider %r", provider) → returns (None, None)

So Step 1 fails for minimax-oauth. Step 2 then iterates the fallback chain:

[("openrouter", _try_openrouter), ("nous", _try_nous), ("local/custom", _try_custom_endpoint), ("api-key", _resolve_api_key_provider)]

For each provider in the fallback chain, _select_pool_entry(provider) calls load_pool(provider) which reads and parses the credential pool JSON files. For providers that don't match (openrouter, nous, local/custom), it skips quickly. But _resolve_api_key_provider iterates through the full PROVIDER_REGISTRY (20+ providers) calling load_pool() for each — that's many file reads with no network activity, but no network timeout either.

The actual ~5s delay must come from something else. Let me check if there's a timeout somewhere in the OpenAI client creation or in the MiniMax API endpoint itself.

Actually, looking at the resolve_provider_client flow for minimax-oauth more carefully — when provider == "minimax-oauth" and _normalize_aux_provider returns "minimax-oauth" unchanged, there's no branch for it. It returns (None, None). Then _resolve_auto Step 2 tries openrouter (no key), nous (not configured), local/custom (no env vars), then finally _resolve_api_key_provider.

In _resolve_api_key_provider, it iterates through PROVIDER_REGISTRY:

for provider_id, pconfig in PROVIDER_REGISTRY.items():
    if pconfig.auth_type != "api_key":
        continue
    pool_present, entry = _select_pool_entry(provider_id)
    if pool_present:
        api_key = _pool_runtime_api_key(entry)
        if not api_key:
            continue
        raw_base_url = _pool_runtime_base_url(entry, ...) or pconfig.inference_base_url
        ...
        _client = OpenAI(api_key=api_key, base_url=base_url, **extra)
        return _client, model

For minimax (auth_type = api_key), _select_pool_entry("minimax") finds the pool entry with the API key → client created → returns.

But the actual delay could be from _select_pool_entry doing something slow for providers it doesn't have a pool for. Let me look at _select_pool_entry more carefully:

def _select_pool_entry(provider: str) -> Tuple[bool, Optional[Any]]:
    try:
        pool = load_pool(provider)  # Reads JSON from disk
    except Exception as exc:
        logger.debug("Auxiliary client: could not load pool for %s: %s", provider, exc)
        return False, None
    if not pool or not pool.has_credentials():
        return False, None
    try:
        return True, pool.select()
    except Exception as exc:
        logger.debug("Auxiliary client: could not select pool entry for %s: %s", provider, exc)
        return True, None

load_pool(provider) reads and parses JSON from ~/.hermes/credential_pool_<provider>.json. This is disk I/O, not network. But if _resolve_api_key_provider iterates through all 20+ providers in PROVIDER_REGISTRY calling load_pool() for each, that's 20+ file reads. Still should be fast.

Unless the delay is from the OpenAI client itself doing something on creation. Let me check the OpenAI constructor — does it probe the endpoint?

Actually, I suspect the delay is from the AnthropicAuxiliaryClient wrapping the OpenAI client. When _maybe_wrap_anthropic is called, it calls build_anthropic_client() which imports the anthropic SDK and creates an Anthropic client with a 10s connect timeout. If the MiniMax endpoint is slow to respond to the initial connection, that could add latency.

But wait — the MiniMax endpoint IS being called successfully in the main agent flow, so it's not a general connectivity issue.

Let me reconsider: the OpenAI client is created with base_url = "https://api.minimax.io/anthropic/v1". The OpenAI client constructor doesn't make network calls. The delay must be somewhere else.

Actually, I think I may be over-analyzing. Let me instead focus on the fix: the _check_compression_model_feasibility() function is called on every session start, even when compression is disabled or no auxiliary provider is configured. This function is expensive and should be deferred or cached.

Fix Action

Fixed

Code Example

# 1. Identify a session with message_count=0 in sessions table
hermes sessions list
# Look for sessions with 0 user messages

# 2. Attempt to resume it
hermes --resume <session-id-with-zero-messages>
# Output: "Session found but has no messages. Starting fresh."
# Exit code: 1

---

ChatConsole().print(
    f"[bold {_accent_hex()}]Session {_escape(self.session_id)} found but has no messages. Starting fresh.[/]"
)

---

-- The sessions table has the session:
SELECT id, message_count, title, ended_at, end_reason FROM sessions WHERE id = '20260517_063914_3b0bf6';
-- Result: id=..., message_count=0, title='', ended_at=<timestamp>, end_reason='normal'

-- But messages table is empty:
SELECT COUNT(*) FROM messages WHERE session_id = '20260517_063914_3b0bf6';
-- Result: 0

-- This is the inconsistency: sessions.message_count is not consistent with messages table

---

restored = self._session_db.get_messages_as_conversation(self.session_id)
if restored:
    # ... history loaded successfully ...
else:
    # This fires for empty sessions
    ChatConsole().print(
        f"[bold {_accent_hex()}]Session {_escape(self.session_id)} found but has no messages. Starting fresh.[/]"
    )
# Falls through — conversation_history is empty, session ended_at cleared

---

s["message_count"] = cursor.fetchone()[0]  -- COUNT(*) from messages JOIN

---

time hermes --resume <any-session-id>
# Observed: ~5 second delay before session prompt appears

---

[("openrouter", _try_openrouter), ("nous", _try_nous), ("local/custom", _try_custom_endpoint), ("api-key", _resolve_api_key_provider)]

---

for provider_id, pconfig in PROVIDER_REGISTRY.items():
    if pconfig.auth_type != "api_key":
        continue
    pool_present, entry = _select_pool_entry(provider_id)
    if pool_present:
        api_key = _pool_runtime_api_key(entry)
        if not api_key:
            continue
        raw_base_url = _pool_runtime_base_url(entry, ...) or pconfig.inference_base_url
        ...
        _client = OpenAI(api_key=api_key, base_url=base_url, **extra)
        return _client, model

---

def _select_pool_entry(provider: str) -> Tuple[bool, Optional[Any]]:
    try:
        pool = load_pool(provider)  # Reads JSON from disk
    except Exception as exc:
        logger.debug("Auxiliary client: could not load pool for %s: %s", provider, exc)
        return False, None
    if not pool or not pool.has_credentials():
        return False, None
    try:
        return True, pool.select()
    except Exception as exc:
        logger.debug("Auxiliary client: could not select pool entry for %s: %s", provider, exc)
        return True, None

---

model:
  default: MiniMax-M2.7
  provider: minimax-oauth
  base_url: https://api.minimax.io/anthropic

compression:
  enabled: true
  threshold: 0.5
  target_ratio: 0.2

auxiliary:
  compression:
    provider: auto   # default — no explicit config
    model: ''
    base_url: ''
    api_key: ''
    timeout: 120

---

PROVIDER_REGISTRY keys: nous, openai-codex, xai-oauth, qwen-oauth, google-gemini-cli,
lmstudio, copilot, copilot-acp, gemini, zai, kimi-coding, kimi-coding-cn, stepfun,
arcee, gmi, minimax, minimax-oauth, anthropic, alibaba, alibaba-coding-plan, minimax-cn,
deepseek, openrouter, ... (possibly more from plugins)

---

def load_pool(provider: str) -> CredentialPool:
    raw_entries = read_credential_pool(provider)  # Just reads JSON file
    entries = [PooledCredential.from_dict(provider, payload) for payload in raw_entries]
    # Seeding from singletons / env vars — no network calls
    if changed:
        write_credential_pool(...)
    return CredentialPool(provider, entries)

---

if main_chain_label and _is_provider_unhealthy(main_chain_label):
    _log_skip_unhealthy(main_chain_label)
else:
    client, resolved = resolve_provider_client(...)

---

if not self.compression_enabled:
    return
# Fast path: if auxiliary.compression.provider is "auto" AND no API key env vars 
# are set for any known provider, skip the expensive feasibility check
RAW_BUFFERClick to expand / collapse

Bug Report: Session resume produces confusing output + startup hang when compression is unconfigured

Bug Summary

Two related issues when running hermes --resume <session-id>:

  1. Confusing output for empty sessions: Resuming a valid session that has zero message rows in the DB prints "Session found but has no messages. Starting fresh." and exit code 1, even though the session exists in sessions table. The message_count column in sessions can be 0 while the messages table is empty — a state that should either be impossible or clearly handled.

  2. ~5 second hang on every session start (including resume): _check_compression_model_feasibility() adds latency at startup even when compression is effectively disabled (no auxiliary provider configured). The delay is in the auxiliary client resolution chain, which iterates through all API-key providers in PROVIDER_REGISTRY with load_pool() calls for each.


Environment

  • Hermes version: 0.14.0 (git commit 3034eee38)
  • Install: NousResearch/hermes-agent
  • Primary model: MiniMax-M2.7 via minimax-oauth provider
  • Config: compression.enabled: true (default), auxiliary.compression.provider: "auto" (default), no OPENROUTER_API_KEY
  • OS: Linux 6.18.7

Bug 1: Confusing "Session found but has no messages" on resume

Steps to reproduce

# 1. Identify a session with message_count=0 in sessions table
hermes sessions list
# Look for sessions with 0 user messages

# 2. Attempt to resume it
hermes --resume <session-id-with-zero-messages>
# Output: "Session found but has no messages. Starting fresh."
# Exit code: 1

Expected

Clear error: "Session has no messages and cannot be resumed. Delete it with hermes sessions delete <id>."

Actual

"Session found but has no messages. Starting fresh." — then exit 1, but no new session is started either (the CLI exits).

Root cause

In cli.py:4413-4415:

ChatConsole().print(
    f"[bold {_accent_hex()}]Session {_escape(self.session_id)} found but has no messages. Starting fresh.[/]"
)

The message says "starting fresh" but the code falls through and the CLI exits without creating a new session. The _init_agent() path continues to run_conversation() but the conversation history is empty and the session is re-set to ended_at = NULL — creating an inconsistent state.

The root inconsistency is in the DB: sessions.message_count = 0 but no rows in messages table for that session. This can happen when:

  • Session was started but produced zero user messages before being interrupted/OOM-killed
  • Session created via --new but no message sent before hermes was killed

Fix options:

  1. Prevent the inconsistent state: When a session ends (or is interrupted), if message_count == 0, delete the session rather than leaving a zombie.
  2. Handle the inconsistent state at resume: If session_meta exists but restored is empty, either (a) auto-delete the orphan session and start fresh, or (b) don't allow resume and tell the user to delete it.
  3. Never allow message_count=0: Add a DB-level CHECK or a session-close hook that ensures at least one message row exists.

DB state at time of bug

-- The sessions table has the session:
SELECT id, message_count, title, ended_at, end_reason FROM sessions WHERE id = '20260517_063914_3b0bf6';
-- Result: id=..., message_count=0, title='', ended_at=<timestamp>, end_reason='normal'

-- But messages table is empty:
SELECT COUNT(*) FROM messages WHERE session_id = '20260517_063914_3b0bf6';
-- Result: 0

-- This is the inconsistency: sessions.message_count is not consistent with messages table

Relevant code

cli.py:4399-4416 — where the empty session is detected and handled:

restored = self._session_db.get_messages_as_conversation(self.session_id)
if restored:
    # ... history loaded successfully ...
else:
    # This fires for empty sessions
    ChatConsole().print(
        f"[bold {_accent_hex()}]Session {_escape(self.session_id)} found but has no messages. Starting fresh.[/]"
    )
# Falls through — conversation_history is empty, session ended_at cleared

hermes_state.py:1162list_sessions_rich() sets message_count from aggregate:

s["message_count"] = cursor.fetchone()[0]  -- COUNT(*) from messages JOIN

Bug 2: _check_compression_model_feasibility() adds ~5s startup latency

Steps to reproduce

time hermes --resume <any-session-id>
# Observed: ~5 second delay before session prompt appears

Expected

Compression feasibility check completes in < 500ms when no auxiliary provider is explicitly configured.

Actual

~5 second hang on every session start (fresh, --continue, or --resume).

Root cause

run_agent.py:2496 calls _check_compression_model_feasibility() unconditionally during __init__. For users with compression.enabled: true (default) but no explicit auxiliary.compression.provider, the auxiliary client resolution chain runs through _resolve_auto():

  1. Step 1 tries main provider (minimax-oauth in this case) → client created → returns quickly
  2. But Step 1 only succeeds when main_provider not in {"auto", ""} AND the client resolves. However, minimax-oauth is not a direct branch in resolve_provider_client — it falls through to... let me trace further.

Actually, the flow for minimax-oauth:

  • _resolve_auto() Step 1: main_provider = "minimax-oauth", main_model = "MiniMax-M2.7"
  • Condition: main_provider not in {"auto", ""} → True
  • resolve_provider_client("minimax-oauth", "MiniMax-M2.7", ...)
  • provider == "minimax-oauth" has no direct branch → _normalize_aux_provider("minimax-oauth") returns "minimax-oauth" → not in alias map
  • Falls through to logger.warning("resolve_provider_client: unknown provider %r", provider) → returns (None, None)

So Step 1 fails for minimax-oauth. Step 2 then iterates the fallback chain:

[("openrouter", _try_openrouter), ("nous", _try_nous), ("local/custom", _try_custom_endpoint), ("api-key", _resolve_api_key_provider)]

For each provider in the fallback chain, _select_pool_entry(provider) calls load_pool(provider) which reads and parses the credential pool JSON files. For providers that don't match (openrouter, nous, local/custom), it skips quickly. But _resolve_api_key_provider iterates through the full PROVIDER_REGISTRY (20+ providers) calling load_pool() for each — that's many file reads with no network activity, but no network timeout either.

The actual ~5s delay must come from something else. Let me check if there's a timeout somewhere in the OpenAI client creation or in the MiniMax API endpoint itself.

Actually, looking at the resolve_provider_client flow for minimax-oauth more carefully — when provider == "minimax-oauth" and _normalize_aux_provider returns "minimax-oauth" unchanged, there's no branch for it. It returns (None, None). Then _resolve_auto Step 2 tries openrouter (no key), nous (not configured), local/custom (no env vars), then finally _resolve_api_key_provider.

In _resolve_api_key_provider, it iterates through PROVIDER_REGISTRY:

for provider_id, pconfig in PROVIDER_REGISTRY.items():
    if pconfig.auth_type != "api_key":
        continue
    pool_present, entry = _select_pool_entry(provider_id)
    if pool_present:
        api_key = _pool_runtime_api_key(entry)
        if not api_key:
            continue
        raw_base_url = _pool_runtime_base_url(entry, ...) or pconfig.inference_base_url
        ...
        _client = OpenAI(api_key=api_key, base_url=base_url, **extra)
        return _client, model

For minimax (auth_type = api_key), _select_pool_entry("minimax") finds the pool entry with the API key → client created → returns.

But the actual delay could be from _select_pool_entry doing something slow for providers it doesn't have a pool for. Let me look at _select_pool_entry more carefully:

def _select_pool_entry(provider: str) -> Tuple[bool, Optional[Any]]:
    try:
        pool = load_pool(provider)  # Reads JSON from disk
    except Exception as exc:
        logger.debug("Auxiliary client: could not load pool for %s: %s", provider, exc)
        return False, None
    if not pool or not pool.has_credentials():
        return False, None
    try:
        return True, pool.select()
    except Exception as exc:
        logger.debug("Auxiliary client: could not select pool entry for %s: %s", provider, exc)
        return True, None

load_pool(provider) reads and parses JSON from ~/.hermes/credential_pool_<provider>.json. This is disk I/O, not network. But if _resolve_api_key_provider iterates through all 20+ providers in PROVIDER_REGISTRY calling load_pool() for each, that's 20+ file reads. Still should be fast.

Unless the delay is from the OpenAI client itself doing something on creation. Let me check the OpenAI constructor — does it probe the endpoint?

Actually, I suspect the delay is from the AnthropicAuxiliaryClient wrapping the OpenAI client. When _maybe_wrap_anthropic is called, it calls build_anthropic_client() which imports the anthropic SDK and creates an Anthropic client with a 10s connect timeout. If the MiniMax endpoint is slow to respond to the initial connection, that could add latency.

But wait — the MiniMax endpoint IS being called successfully in the main agent flow, so it's not a general connectivity issue.

Let me reconsider: the OpenAI client is created with base_url = "https://api.minimax.io/anthropic/v1". The OpenAI client constructor doesn't make network calls. The delay must be somewhere else.

Actually, I think I may be over-analyzing. Let me instead focus on the fix: the _check_compression_model_feasibility() function is called on every session start, even when compression is disabled or no auxiliary provider is configured. This function is expensive and should be deferred or cached.

Fix recommendation

In _check_compression_model_feasibility(), add an early return when:

  1. compression.enabled: false (already there)
  2. No auxiliary provider configured AND no API key env vars for the main provider

Or better: cache the result of the feasibility check so it only runs once per session type, not every time.


Additional context

Config.yaml (relevant sections)

model:
  default: MiniMax-M2.7
  provider: minimax-oauth
  base_url: https://api.minimax.io/anthropic

compression:
  enabled: true
  threshold: 0.5
  target_ratio: 0.2

auxiliary:
  compression:
    provider: auto   # default — no explicit config
    model: ''
    base_url: ''
    api_key: ''
    timeout: 120

Providers in registry (from hermes_cli/auth.py)

PROVIDER_REGISTRY keys: nous, openai-codex, xai-oauth, qwen-oauth, google-gemini-cli,
lmstudio, copilot, copilot-acp, gemini, zai, kimi-coding, kimi-coding-cn, stepfun,
arcee, gmi, minimax, minimax-oauth, anthropic, alibaba, alibaba-coding-plan, minimax-cn,
deepseek, openrouter, ... (possibly more from plugins)

The _resolve_api_key_provider() loop iterates through all of these, calling load_pool() for each, before finding minimax which has the API key.

Suspicion: load_pool() for unknown providers may trigger network lookups

If load_pool() for providers like lmstudio (default http://127.0.0.1:1234/v1) tries to connect to localhost to verify the pool is alive, that could add up to the 5s timeout for each failed provider. But lmstudio is an api_key provider, so it would be tried too.

Actually, looking at load_pool():

def load_pool(provider: str) -> CredentialPool:
    raw_entries = read_credential_pool(provider)  # Just reads JSON file
    entries = [PooledCredential.from_dict(provider, payload) for payload in raw_entries]
    # Seeding from singletons / env vars — no network calls
    if changed:
        write_credential_pool(...)
    return CredentialPool(provider, entries)

No network calls in load_pool(). So the delay is not from credential pool loading.

Let me think again... is there an Ollama-related probe? The file ~/.hermes/ollama_cloud_models_cache.json suggests some Ollama integration. Let me search for where Ollama is probed in the auxiliary client path.

Actually, I notice in _resolve_auto there's this check:

if main_chain_label and _is_provider_unhealthy(main_chain_label):
    _log_skip_unhealthy(main_chain_label)
else:
    client, resolved = resolve_provider_client(...)

The _is_provider_unhealthy function might be doing something slow. Let me check if it has any network calls.

Actually, I realize I haven't been able to definitively identify where the 5s delay comes from. The issue description includes the observation that the delay happens, and the most likely cause is the _check_compression_model_feasibility() call chain. But I should note in the issue that the exact source of the 5s delay is not definitively traced — more debugging needed.


Suggested fixes (summary)

For Bug 1 (empty session resume):

In cli.py:4413, when restored is empty but session_meta exists:

  • Don't print "starting fresh" (misleading)
  • Either: auto-delete the orphan session and start fresh, OR print a clear error and exit 1 with a helpful message
  • Add a session integrity check: if message_count == 0 and no message rows, delete the session as a cleanup step

For Bug 2 (startup hang):

In run_agent.py:2496, add a fast path before calling _check_compression_model_feasibility():

if not self.compression_enabled:
    return
# Fast path: if auxiliary.compression.provider is "auto" AND no API key env vars 
# are set for any known provider, skip the expensive feasibility check

Or: make the feasibility check lazy (only run when compression is first triggered), not during session init.


Questions for maintainers

  1. Is sessions.message_count ever supposed to be 0 with no message rows? Or is this always a bug/inconsistent state?
  2. What is the expected behavior when resuming a session that has no messages — should it start fresh, or refuse to resume?
  3. Is there a known 5s timeout somewhere in the MiniMax OAuth or API key path that could explain the startup delay?
  4. Is ollama_cloud_models_cache.json related to any network probing that could cause startup delays?

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Bug Report: Session resume produces confusing output + startup hang when compression is unconfigured [1 pull requests]