hermes - ✅(Solved) Fix [Performance] Missing global conversation history state causes 314x slowdown on llama.cpp with Qwen3.5-27B [1 pull requests, 3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#13442Fetched 2026-04-22 08:06:34
View on GitHub
Comments
3
Participants
2
Timeline
10
Reactions
0
Author
Participants
Timeline (top)
labeled ×4commented ×3cross-referenced ×2renamed ×1

Root Cause

🔍 Root Cause Analysis

Fix Action

Fixed

PR fix notes

PR #4563: fix(gateway): strip internal fields from tool_calls on session reload to preserve KV cache

Description (problem / solution / changelog)

Fixes #4555

Problem

KV cache was fully invalidated on every new user message because session reload produced different tokens than the in-memory agentic loop. Three differences were identified.

Fix 1: Strip internal tool_call fields

call_id, response_item_id, finish_reason are Hermes-internal fields not part of OpenAI API spec. Stripped on session reload so tool_calls are byte-identical to agentic loop.

Fix 2: Normalize content whitespace

Assistant content trailing whitespace stripped consistently in both tool message path and simple message path.

Result

Messages sent to API are now consistent between agentic loop iteration and session reload, allowing local backends (llama.cpp, lemonade) to reuse KV cache across turns.

Changed files

  • gateway/run.py (modified, +22/-1)

Code Example

docker run -d --name llama27b-turbo4 \
  --gpus all \
  -p 8089:8080 \
  -v /models:/models:ro \
  llama-cpp-turboquant:cuda \
    llama-server \
    -m /models/Qwen3.5-27B-Q4_K_M.gguf \
    --cache-prompt \
    --cache-reuse 1024 \
    --keep -1 \
    -ngl 99 \
    -c 262144 \
    --flash-attn on \
    --cache-type-k q4_0 \
    --cache-type-v q4_0 \
    --batch-size 2048 \
    --ubatch-size 512 \
    -np 1 \
    --host 0.0.0.0 \
    --port 8080

---

docker logs llama27b-turbo4 2>&1 | grep "n_past"

---

slot update_slots: id  0 | task XXXXX | n_past = 34576, slot.prompt.tokens.size() = 34775

---

slot update_slots: id  0 | task XXXXX | n_past = 3, slot.prompt.tokens.size() = 298
erased invalidated context checkpoint (15 instances)

---

# Initialize conversation (copy to avoid mutating the caller's list)
messages = list(conversation_history) if conversation_history else []

---

if effective_system:
    api_messages = [{"role": "system", "content": effective_system}] + api_messages

---

# Tool results
tool_msg = {"role": "tool", "content": function_result, "tool_call_id": tool_call.id}
messages.append(tool_msg)

# Assistant responses  
assistant_msg = {"role": "assistant", "content": final_response}
messages.append(assistant_msg)

---

# Request 1 (first message)
slot 0 | n_past = 0, processing 32000 tokens... → 122s

# Request 2 (follow-up, should use cache)
slot 0 | n_past = 3, processing 32000 tokens... → 122s
erased invalidated context checkpoint (15 instances)

# Request 3 (same pattern)
slot 0 | n_past = 4, processing 32000 tokens... → 122s

---

class AIAgent:
    def __init__(self, ...):
        # Add persistent state
        self._global_conversation_history: List[Dict[str, Any]] = []
        self._system_prompt_sent: bool = False
RAW_BUFFERClick to expand / collapse

[Performance] Missing global conversation history state causes 314x slowdown on llama.cpp with Qwen3.5-27B

System: RTX 3090 (24GB VRAM) + llama.cpp + Qwen3.5-27B-Q4_K_M + -np 1


🚨 Problem Summary

Every LLM request resends the system prompt and creates a new messages array, causing complete KV cache invalidation. This results in 314x performance degradation (122s → 0.39s) when using --keep -1 on llama.cpp backends.

Impact: Affects local LLM users running llama.cpp with single-slot configuration (-np 1) and conversation history persistence (--keep -1).

Note: This analysis is based on testing with Qwen3.5-27B on llama.cpp. Other models/backends may behave differently.


🎯 Environment & Reproduction

Hardware/Software Stack:

  • GPU: NVIDIA RTX 3090 (24GB VRAM)
  • Backend: llama-cpp-turboquant:cuda (Docker)
  • Model: Qwen3.5-27B-Q4_K_M.gguf (~16GB)
  • Context: 32K-256K tokens
  • Configuration: -np 1 (single parallel slot)

llama.cpp Server Configuration:

docker run -d --name llama27b-turbo4 \
  --gpus all \
  -p 8089:8080 \
  -v /models:/models:ro \
  llama-cpp-turboquant:cuda \
    llama-server \
    -m /models/Qwen3.5-27B-Q4_K_M.gguf \
    --cache-prompt \
    --cache-reuse 1024 \
    --keep -1 \
    -ngl 99 \
    -c 262144 \
    --flash-attn on \
    --cache-type-k q4_0 \
    --cache-type-v q4_0 \
    --batch-size 2048 \
    --ubatch-size 512 \
    -np 1 \
    --host 0.0.0.0 \
    --port 8080

Reproduction Steps:

  1. Start llama.cpp server with --keep -1 and -np 1
  2. Run Hermes CLI with base_url=http://localhost:8089/v1
  3. Send first user message → ~122s response time
  4. Send second user message → ~122s response time (should be <1s with cache)
  5. Monitor llama.cpp logs for n_past values:
    docker logs llama27b-turbo4 2>&1 | grep "n_past"

Expected vs Actual:

Expected (with KV cache):

slot update_slots: id  0 | task XXXXX | n_past = 34576, slot.prompt.tokens.size() = 34775

Response time: <1s

Actual (cache invalid):

slot update_slots: id  0 | task XXXXX | n_past = 3, slot.prompt.tokens.size() = 298
erased invalidated context checkpoint (15 instances)

Response time: ~122s


🔍 Root Cause Analysis

Problem Location: run_agent.py

Issue 1: No Persistent Global State (Line ~7783)

Current Code:

# Initialize conversation (copy to avoid mutating the caller's list)
messages = list(conversation_history) if conversation_history else []

Problem:

  • messages array is recreated from scratch on every LLM request
  • Only the initial conversation_history parameter is used
  • No persistent state between LLM requests within a single run_conversation() call

Issue 2: System Prompt Sent Every Request (Line ~8127)

Current Code:

if effective_system:
    api_messages = [{"role": "system", "content": effective_system}] + api_messages

Problem:

  • System prompt is prepended to api_messages on EVERY LLM request
  • Changes tokenization → KV cache invalidation
  • --keep -1 cannot work because prefix changes

Important Note: The code creates a copy called api_messages from messages (lines 8090-8113), then adds system prompt to api_messages. But api_messages is recreated fresh on every LLM request, so the system prompt is always added.

Issue 3: No Message Persistence (Lines ~7468, ~10021)

Current Code:

# Tool results
tool_msg = {"role": "tool", "content": function_result, "tool_call_id": tool_call.id}
messages.append(tool_msg)

# Assistant responses  
assistant_msg = {"role": "assistant", "content": final_response}
messages.append(assistant_msg)

Problem:

  • Messages appended to local messages list
  • List is discarded after LLM request returns
  • Next LLM request starts fresh → cache broken

📊 Performance Impact

Metrics:

MetricCurrent (Broken)Expected (Fixed)Degradation
Response Time~122s~0.39s314x slower
Cache Hit Rate~1%~99%98% worse
n_past (typical)3-1030,000+Cache not used
Token Processing32K tokens/request~100 tokens/request320x waste
VRAM Usage21.9/24GB (89%)~20/24GB (83%)Cache overflow

llama.cpp Log Evidence:

# Request 1 (first message)
slot 0 | n_past = 0, processing 32000 tokens... → 122s

# Request 2 (follow-up, should use cache)
slot 0 | n_past = 3, processing 32000 tokens... → 122s
erased invalidated context checkpoint (15 instances)

# Request 3 (same pattern)
slot 0 | n_past = 4, processing 32000 tokens... → 122s

🧪 What We Attempted

We attempted to implement a fix by adding global conversation history state to AIAgent. Here's what we tried:

Attempted Implementation:

class AIAgent:
    def __init__(self, ...):
        # Add persistent state
        self._global_conversation_history: List[Dict[str, Any]] = []
        self._system_prompt_sent: bool = False

Changes Made:

  1. Modified message initialization to use global state
  2. Added system prompt only on first LLM request
  3. Persisted messages to global history after each LLM request

Results:

  • ✅ Initial tests showed promise (n_past increased)
  • ❌ Integration with existing code caused errors
  • ❌ Not fully compatible with background review system
  • ❌ Had to revert all changes

See: IMPLEMENTATION-ATTEMPT-ANALYSIS.md for detailed attempt documentation.

Additional Issues Found:

  1. Speculative Decoding: Breaks KV cache on llama.cpp
  2. Cache Reuse: --cache-reuse 1024 too low for 32K context
  3. VRAM Pressure: 28GB needed, 24GB available (with speculative decoding)
  4. Parallel Requests: -np 1 forces single slot

🔗 Related Issues

Our research found several related but different issues:

Issue #4555: KV cache invalidation on new user message

  • Different: That issue is about session reload vs agentic loop message format
  • Our finding: Affects ALL LLM requests within a single run_conversation() call
  • Relationship: Our root cause may explain WHY #4555 happens

Issue #4319: KV cache invalidation on compression

  • Different: That issue is about compression triggering system prompt rebuild
  • Our finding: System prompt sent on EVERY LLM request (not just compression)
  • Relationship: Related concern, different scope

Issue #12089: Conversation-aware sliding cache breakpoints

  • Different: That's a proposal for future optimization
  • Our finding: Fundamental architecture issue preventing cache from working
  • Relationship: Our fix is prerequisite for #12089

Issue #8687: System prompt timestamp changes after compression

  • Related: Both about system prompt stability
  • Our finding: System prompt shouldn't change, but shouldn't be resent either
  • Relationship: Complementary issue

Issue #3353: Runtime metadata in cached system prompt

  • Related: System prompt caching optimization
  • Our finding: System prompt caching exists but is ineffective due to resend
  • Relationship: Our fix makes #3353 more impactful

🔗 Related PR

  • #4563: fix(gateway): strip internal fields from tool_calls on session reload
    • Author: ygd58
    • Status: OPEN (not merged yet)
    • Different scope: Fixes gateway → CLI handoff cache invalidation
    • Our issue: Fixes CLI agentic loop cache invalidation (within run_conversation)
    • Relationship: Complementary fixes - both needed for complete optimization
    • Combined impact: PR #4563 + Issue #13442 = full KV cache optimization

💡 Proposed Solution Approach

Core Concept: Persistent Conversation State

Add persistent state to AIAgent class to maintain conversation history across LLM requests.

Key Changes Needed:

  1. Message Initialization (line ~7783)

    • Use global state instead of recreating from parameter
    • Check if system prompt already exists
  2. System Prompt Injection (line ~8127)

    • Add system prompt only once (first LLM request)
    • Track with _system_prompt_sent flag
  3. Message Persistence (after each LLM response)

    • Append assistant/tool messages to global history
    • Ensure consistency with session DB
  4. Session Integration (save/load)

    • Persist global history to session DB
    • Restore on session reload
    • Handle context compression
  5. Background Review Compatibility (ORIGIN-ANALYSIS-REPORT.md)

    • Merge with origin/main's _spawn_background_review()
    • Ensure background review doesn't break cache

❓ Questions for Maintainers

We need maintainer guidance to properly implement this fix:

1. Design Intent:

  • Was the lack of global state a design decision?
  • If so, what are the trade-offs we're missing?

2. Background Review:

  • Origin/main has _spawn_background_review() that forks AIAgent
  • How should this interact with global conversation state?
  • Should background review have its own history?

3. Session Persistence:

  • Should global history be stored in session DB?
  • How to handle context compression with persistent state?

4. Model/Backend Compatibility:

  • Does this approach work with vLLM, Ollama, cloud providers?
  • Are there backends where this would break?
  • We only tested with llama.cpp + Qwen3.5-27B

5. Alternative Approaches:

  • Is there a better architectural solution?
  • Should we modify the message format instead?
  • What would you recommend?

📚 Research Documentation

We created detailed analysis documents. Available for review:

  1. LLM-MESSAGE-ARRAY-ISSUE-ANALYSIS.md - Initial problem identification
  2. ROOT-CAUSE-ANALYSIS.md - Root cause with 4 solution options
  3. ORIGIN-ANALYSIS-REPORT.md - Comparison with origin/main
  4. IMPLEMENTATION-ATTEMPT-ANALYSIS.md - Our attempted fix (unsuccessful)

Note: We can share these files if helpful for understanding our investigation.


🎯 Expected Impact

If this issue is resolved:

  • 314x performance improvement for llama.cpp users with -np 1
  • Proper KV cache utilization with --keep -1
  • Reduced token waste (system prompt sent once)
  • Better VRAM efficiency (cache doesn't overflow)
  • Improved UX for all local LLM users

User Impact: Affects anyone running:

  • llama.cpp with --keep -1 and -np 1
  • Long conversations (30K+ tokens)
  • Qwen3.5-27B or similar large models
  • RTX 3090 or similar 24GB VRAM GPUs

🏷️ Suggested Labels

  • performance
  • enhancement
  • backend:llama.cpp
  • KV-cache
  • priority:high
  • needs-maintainer-input

👤 Reporter

Levent Sunay (@lsunay1)
Date: 2026-04-21
System: RTX 3090 (24GB) + llama-cpp-turboquant + Qwen3.5-27B-Q4_K_M
Impact: 314x slowdown (122s → 0.39s)


Important Note: We attempted to implement a fix but were unable to complete it successfully due to integration issues with the existing codebase. We believe this is a fundamental architecture issue that needs maintainer input on the best approach.

We're sharing our detailed research and partial implementation attempt to:

  1. Document the problem we found
  2. Show our investigation and analysis
  3. Request maintainer guidance on proper implementation
  4. Offer to collaborate on testing and refinement

We're happy to help test any proposed solution! 🙏

extent analysis

TL;DR

To fix the 314x slowdown on llama.cpp with Qwen3.5-27B, implement a persistent global conversation history state in the AIAgent class to maintain conversation history across LLM requests.

Guidance

  1. Modify message initialization: Use global state instead of recreating the conversation history from the parameter on every LLM request.
  2. Add system prompt only once: Inject the system prompt only on the first LLM request and track it with a _system_prompt_sent flag to prevent cache invalidation.
  3. Persist messages to global history: Append assistant and tool messages to the global conversation history after each LLM response to ensure consistency with the session database.
  4. Integrate with session persistence: Store the global history in the session database and restore it on session reload, handling context compression appropriately.
  5. Ensure background review compatibility: Merge the global conversation state with the background review system, ensuring it doesn't break the cache.

Example

class AIAgent:
    def __init__(self, ...):
        # Add persistent state
        self._global_conversation_history: List[Dict[str, Any]] = []
        self._system_prompt_sent: bool = False

    # ...

    # Initialize conversation using global state
    messages = self._global_conversation_history.copy()

    # Add system prompt only on the first LLM request
    if effective_system and not self._system_prompt_sent:
        api_messages = [{"role": "system", "content": effective_system}] + api_messages
        self._system_prompt_sent = True

    # Persist messages to global history
    self._global_conversation_history.extend(api_messages)

Notes

  • This solution requires careful integration with the existing codebase, particularly with the background review system and session persistence.
  • The proposed changes may have implications for model/backend compatibility and should be tested thoroughly.
  • Maintainer input is crucial for ensuring the solution aligns with the project's design intent and architecture.

Recommendation

Apply the proposed workaround by implementing a persistent global conversation history state in the AIAgent class, as it addresses the root cause of the performance issue and has the potential to significantly improve the response time for llama.cpp users with -np 1.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING