hermes - ✅(Solved) Fix [Performance] Missing global conversation history state causes 314x slowdown on llama.cpp with Qwen3.5-27B [1 pull requests, 3 comments, 2 participants]

hermes2026-04-21 09:23:08

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#13442•Fetched 2026-04-22 08:06:34

View on GitHub

Comments

Participants

Timeline

Reactions

Author

lsunay

Participants

alt-glitch

lsunay

Timeline (top)

labeled ×4commented ×3cross-referenced ×2renamed ×1

Root Cause

🔍 Root Cause Analysis

Fix Action

Fixed

Fixed by PR: fix(gateway): strip internal fields from tool_calls on session reload to preserve KV cache (https://github.com/NousResearch/hermes-agent/pull/4563)

PR fix notes

PR #4563: fix(gateway): strip internal fields from tool_calls on session reload to preserve KV cache

Repository: NousResearch/hermes-agent
Author: ygd58
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/4563

Description (problem / solution / changelog)

Fixes #4555

Problem

KV cache was fully invalidated on every new user message because session reload produced different tokens than the in-memory agentic loop. Three differences were identified.

Fix 1: Strip internal tool_call fields

call_id, response_item_id, finish_reason are Hermes-internal fields not part of OpenAI API spec. Stripped on session reload so tool_calls are byte-identical to agentic loop.

Fix 2: Normalize content whitespace

Assistant content trailing whitespace stripped consistently in both tool message path and simple message path.

Result

Messages sent to API are now consistent between agentic loop iteration and session reload, allowing local backends (llama.cpp, lemonade) to reuse KV cache across turns.

Changed files

gateway/run.py (modified, +22/-1)

Code Example

docker run -d --name llama27b-turbo4 \
  --gpus all \
  -p 8089:8080 \
  -v /models:/models:ro \
  llama-cpp-turboquant:cuda \
    llama-server \
    -m /models/Qwen3.5-27B-Q4_K_M.gguf \
    --cache-prompt \
    --cache-reuse 1024 \
    --keep -1 \
    -ngl 99 \
    -c 262144 \
    --flash-attn on \
    --cache-type-k q4_0 \
    --cache-type-v q4_0 \
    --batch-size 2048 \
    --ubatch-size 512 \
    -np 1 \
    --host 0.0.0.0 \
    --port 8080

---

docker logs llama27b-turbo4 2>&1 | grep "n_past"

---

slot update_slots: id  0 | task XXXXX | n_past = 34576, slot.prompt.tokens.size() = 34775

---

slot update_slots: id  0 | task XXXXX | n_past = 3, slot.prompt.tokens.size() = 298
erased invalidated context checkpoint (15 instances)

---

# Initialize conversation (copy to avoid mutating the caller's list)
messages = list(conversation_history) if conversation_history else []

---

if effective_system:
    api_messages = [{"role": "system", "content": effective_system}] + api_messages

---

# Tool results
tool_msg = {"role": "tool", "content": function_result, "tool_call_id": tool_call.id}
messages.append(tool_msg)

# Assistant responses  
assistant_msg = {"role": "assistant", "content": final_response}
messages.append(assistant_msg)

---

# Request 1 (first message)
slot 0 | n_past = 0, processing 32000 tokens... → 122s

# Request 2 (follow-up, should use cache)
slot 0 | n_past = 3, processing 32000 tokens... → 122s
erased invalidated context checkpoint (15 instances)

# Request 3 (same pattern)
slot 0 | n_past = 4, processing 32000 tokens... → 122s

---

class AIAgent:
    def __init__(self, ...):
        # Add persistent state
        self._global_conversation_history: List[Dict[str, Any]] = []
        self._system_prompt_sent: bool = False

RAW_BUFFERClick to expand / collapse

[Performance] Missing global conversation history state causes 314x slowdown on llama.cpp with Qwen3.5-27B

System: RTX 3090 (24GB VRAM) + llama.cpp + Qwen3.5-27B-Q4_K_M + -np 1

🚨 Problem Summary

Every LLM request resends the system prompt and creates a new messages array, causing complete KV cache invalidation. This results in 314x performance degradation (122s → 0.39s) when using --keep -1 on llama.cpp backends.

Impact: Affects local LLM users running llama.cpp with single-slot configuration (-np 1) and conversation history persistence (--keep -1).

Note: This analysis is based on testing with Qwen3.5-27B on llama.cpp. Other models/backends may behave differently.

🎯 Environment & Reproduction

Hardware/Software Stack:

GPU: NVIDIA RTX 3090 (24GB VRAM)
Backend: llama-cpp-turboquant:cuda (Docker)
Model: Qwen3.5-27B-Q4_K_M.gguf (~16GB)
Context: 32K-256K tokens
Configuration: -np 1 (single parallel slot)

llama.cpp Server Configuration:

docker run -d --name llama27b-turbo4 \
  --gpus all \
  -p 8089:8080 \
  -v /models:/models:ro \
  llama-cpp-turboquant:cuda \
    llama-server \
    -m /models/Qwen3.5-27B-Q4_K_M.gguf \
    --cache-prompt \
    --cache-reuse 1024 \
    --keep -1 \
    -ngl 99 \
    -c 262144 \
    --flash-attn on \
    --cache-type-k q4_0 \
    --cache-type-v q4_0 \
    --batch-size 2048 \
    --ubatch-size 512 \
    -np 1 \
    --host 0.0.0.0 \
    --port 8080

Reproduction Steps:

Start llama.cpp server with --keep -1 and -np 1
Run Hermes CLI with base_url=http://localhost:8089/v1
Send first user message → ~122s response time
Send second user message → ~122s response time (should be <1s with cache)

Monitor llama.cpp logs for n_past values:

docker logs llama27b-turbo4 2>&1 | grep "n_past"

Expected vs Actual:

Expected (with KV cache):

slot update_slots: id  0 | task XXXXX | n_past = 34576, slot.prompt.tokens.size() = 34775

Response time: <1s

Actual (cache invalid):

slot update_slots: id  0 | task XXXXX | n_past = 3, slot.prompt.tokens.size() = 298
erased invalidated context checkpoint (15 instances)

Response time: ~122s

🔍 Root Cause Analysis

Problem Location: `run_agent.py`

Issue 1: No Persistent Global State (Line ~7783)

Current Code:

# Initialize conversation (copy to avoid mutating the caller's list)
messages = list(conversation_history) if conversation_history else []

Problem:

messages array is recreated from scratch on every LLM request
Only the initial conversation_history parameter is used
No persistent state between LLM requests within a single run_conversation() call

Issue 2: System Prompt Sent Every Request (Line ~8127)

Current Code:

if effective_system:
    api_messages = [{"role": "system", "content": effective_system}] + api_messages

Problem:

System prompt is prepended to api_messages on EVERY LLM request
Changes tokenization → KV cache invalidation
--keep -1 cannot work because prefix changes

Important Note: The code creates a copy called api_messages from messages (lines 8090-8113), then adds system prompt to api_messages. But api_messages is recreated fresh on every LLM request, so the system prompt is always added.

Issue 3: No Message Persistence (Lines ~7468, ~10021)

Current Code:

# Tool results
tool_msg = {"role": "tool", "content": function_result, "tool_call_id": tool_call.id}
messages.append(tool_msg)

# Assistant responses  
assistant_msg = {"role": "assistant", "content": final_response}
messages.append(assistant_msg)

Problem:

Messages appended to local messages list
List is discarded after LLM request returns
Next LLM request starts fresh → cache broken

📊 Performance Impact

Metrics:

Metric	Current (Broken)	Expected (Fixed)	Degradation
Response Time	~122s	~0.39s	314x slower
Cache Hit Rate	~1%	~99%	98% worse
n_past (typical)	3-10	30,000+	Cache not used
Token Processing	32K tokens/request	~100 tokens/request	320x waste
VRAM Usage	21.9/24GB (89%)	~20/24GB (83%)	Cache overflow

llama.cpp Log Evidence:

# Request 1 (first message)
slot 0 | n_past = 0, processing 32000 tokens... → 122s

# Request 2 (follow-up, should use cache)
slot 0 | n_past = 3, processing 32000 tokens... → 122s
erased invalidated context checkpoint (15 instances)

# Request 3 (same pattern)
slot 0 | n_past = 4, processing 32000 tokens... → 122s

🧪 What We Attempted

We attempted to implement a fix by adding global conversation history state to AIAgent. Here's what we tried:

Attempted Implementation:

class AIAgent:
    def __init__(self, ...):
        # Add persistent state
        self._global_conversation_history: List[Dict[str, Any]] = []
        self._system_prompt_sent: bool = False

Changes Made:

Modified message initialization to use global state
Added system prompt only on first LLM request
Persisted messages to global history after each LLM request

Results:

✅ Initial tests showed promise (n_past increased)
❌ Integration with existing code caused errors
❌ Not fully compatible with background review system
❌ Had to revert all changes

See: IMPLEMENTATION-ATTEMPT-ANALYSIS.md for detailed attempt documentation.

Additional Issues Found:

Speculative Decoding: Breaks KV cache on llama.cpp
Cache Reuse: --cache-reuse 1024 too low for 32K context
VRAM Pressure: 28GB needed, 24GB available (with speculative decoding)
Parallel Requests: -np 1 forces single slot

🔗 Related Issues

Our research found several related but different issues:

Issue #4555: KV cache invalidation on new user message

Different: That issue is about session reload vs agentic loop message format
Our finding: Affects ALL LLM requests within a single run_conversation() call
Relationship: Our root cause may explain WHY #4555 happens

Issue #4319: KV cache invalidation on compression

Different: That issue is about compression triggering system prompt rebuild
Our finding: System prompt sent on EVERY LLM request (not just compression)
Relationship: Related concern, different scope

Issue #12089: Conversation-aware sliding cache breakpoints

Different: That's a proposal for future optimization
Our finding: Fundamental architecture issue preventing cache from working
Relationship: Our fix is prerequisite for #12089

Issue #8687: System prompt timestamp changes after compression

Related: Both about system prompt stability
Our finding: System prompt shouldn't change, but shouldn't be resent either
Relationship: Complementary issue

Issue #3353: Runtime metadata in cached system prompt

Related: System prompt caching optimization
Our finding: System prompt caching exists but is ineffective due to resend
Relationship: Our fix makes #3353 more impactful

🔗 Related PR

#4563: fix(gateway): strip internal fields from tool_calls on session reload
- Author: ygd58
- Status: OPEN (not merged yet)
- Different scope: Fixes gateway → CLI handoff cache invalidation
- Our issue: Fixes CLI agentic loop cache invalidation (within run_conversation)
- Relationship: Complementary fixes - both needed for complete optimization
- Combined impact: PR #4563 + Issue #13442 = full KV cache optimization

💡 Proposed Solution Approach

Core Concept: Persistent Conversation State

Add persistent state to AIAgent class to maintain conversation history across LLM requests.

Key Changes Needed:

Message Initialization (line ~7783)
- Use global state instead of recreating from parameter
- Check if system prompt already exists
System Prompt Injection (line ~8127)
- Add system prompt only once (first LLM request)
- Track with _system_prompt_sent flag
Message Persistence (after each LLM response)
- Append assistant/tool messages to global history
- Ensure consistency with session DB
Session Integration (save/load)
- Persist global history to session DB
- Restore on session reload
- Handle context compression
Background Review Compatibility (ORIGIN-ANALYSIS-REPORT.md)
- Merge with origin/main's _spawn_background_review()
- Ensure background review doesn't break cache

❓ Questions for Maintainers

We need maintainer guidance to properly implement this fix:

1. Design Intent:

Was the lack of global state a design decision?
If so, what are the trade-offs we're missing?

2. Background Review:

Origin/main has _spawn_background_review() that forks AIAgent
How should this interact with global conversation state?
Should background review have its own history?

3. Session Persistence:

Should global history be stored in session DB?
How to handle context compression with persistent state?

4. Model/Backend Compatibility:

Does this approach work with vLLM, Ollama, cloud providers?
Are there backends where this would break?
We only tested with llama.cpp + Qwen3.5-27B

5. Alternative Approaches:

Is there a better architectural solution?
Should we modify the message format instead?
What would you recommend?

📚 Research Documentation

We created detailed analysis documents. Available for review:

LLM-MESSAGE-ARRAY-ISSUE-ANALYSIS.md - Initial problem identification
ROOT-CAUSE-ANALYSIS.md - Root cause with 4 solution options
ORIGIN-ANALYSIS-REPORT.md - Comparison with origin/main
IMPLEMENTATION-ATTEMPT-ANALYSIS.md - Our attempted fix (unsuccessful)

Note: We can share these files if helpful for understanding our investigation.

🎯 Expected Impact

If this issue is resolved:

✅ 314x performance improvement for llama.cpp users with -np 1
✅ Proper KV cache utilization with --keep -1
✅ Reduced token waste (system prompt sent once)
✅ Better VRAM efficiency (cache doesn't overflow)
✅ Improved UX for all local LLM users

User Impact: Affects anyone running:

llama.cpp with --keep -1 and -np 1
Long conversations (30K+ tokens)
Qwen3.5-27B or similar large models
RTX 3090 or similar 24GB VRAM GPUs

🏷️ Suggested Labels

performance
enhancement
backend:llama.cpp
KV-cache
priority:high
needs-maintainer-input

👤 Reporter

Levent Sunay (@lsunay1)
Date: 2026-04-21
System: RTX 3090 (24GB) + llama-cpp-turboquant + Qwen3.5-27B-Q4_K_M
Impact: 314x slowdown (122s → 0.39s)

Important Note: We attempted to implement a fix but were unable to complete it successfully due to integration issues with the existing codebase. We believe this is a fundamental architecture issue that needs maintainer input on the best approach.

We're sharing our detailed research and partial implementation attempt to:

Document the problem we found
Show our investigation and analysis
Request maintainer guidance on proper implementation
Offer to collaborate on testing and refinement

We're happy to help test any proposed solution! 🙏

extent analysis

TL;DR

To fix the 314x slowdown on llama.cpp with Qwen3.5-27B, implement a persistent global conversation history state in the AIAgent class to maintain conversation history across LLM requests.

Guidance

Modify message initialization: Use global state instead of recreating the conversation history from the parameter on every LLM request.
Add system prompt only once: Inject the system prompt only on the first LLM request and track it with a _system_prompt_sent flag to prevent cache invalidation.
Persist messages to global history: Append assistant and tool messages to the global conversation history after each LLM response to ensure consistency with the session database.
Integrate with session persistence: Store the global history in the session database and restore it on session reload, handling context compression appropriately.
Ensure background review compatibility: Merge the global conversation state with the background review system, ensuring it doesn't break the cache.

Example

class AIAgent:
    def __init__(self, ...):
        # Add persistent state
        self._global_conversation_history: List[Dict[str, Any]] = []
        self._system_prompt_sent: bool = False

    # ...

    # Initialize conversation using global state
    messages = self._global_conversation_history.copy()

    # Add system prompt only on the first LLM request
    if effective_system and not self._system_prompt_sent:
        api_messages = [{"role": "system", "content": effective_system}] + api_messages
        self._system_prompt_sent = True

    # Persist messages to global history
    self._global_conversation_history.extend(api_messages)

Notes

This solution requires careful integration with the existing codebase, particularly with the background review system and session persistence.
The proposed changes may have implications for model/backend compatibility and should be tested thoroughly.
Maintainer input is crucial for ensuring the solution aligns with the project's design intent and architecture.

Recommendation

Apply the proposed workaround by implementing a persistent global conversation history state in the AIAgent class, as it addresses the root cause of the performance issue and has the potential to significantly improve the response time for llama.cpp users with -np 1.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #optimization #conversation history #LLM response #integration issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix [Performance] Missing global conversation history state causes 314x slowdown on llama.cpp with Qwen3.5-27B [1 pull requests, 3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

🔍 Root Cause Analysis

Fix Action

Fixed

PR fix notes

PR #4563: fix(gateway): strip internal fields from tool_calls on session reload to preserve KV cache

Description (problem / solution / changelog)

Problem

Fix 1: Strip internal tool_call fields

Fix 2: Normalize content whitespace

Result

Changed files

Code Example

[Performance] Missing global conversation history state causes 314x slowdown on llama.cpp with Qwen3.5-27B

🚨 Problem Summary

🎯 Environment & Reproduction

Hardware/Software Stack:

llama.cpp Server Configuration:

Reproduction Steps:

Expected vs Actual:

🔍 Root Cause Analysis

Problem Location: run_agent.py

Issue 1: No Persistent Global State (Line ~7783)

Issue 2: System Prompt Sent Every Request (Line ~8127)

Issue 3: No Message Persistence (Lines ~7468, ~10021)

📊 Performance Impact

Metrics:

llama.cpp Log Evidence:

🧪 What We Attempted

Attempted Implementation:

Additional Issues Found:

🔗 Related Issues

Issue #4555: KV cache invalidation on new user message

Issue #4319: KV cache invalidation on compression

Issue #12089: Conversation-aware sliding cache breakpoints

Issue #8687: System prompt timestamp changes after compression

Issue #3353: Runtime metadata in cached system prompt

🔗 Related PR

💡 Proposed Solution Approach

Core Concept: Persistent Conversation State

Key Changes Needed:

❓ Questions for Maintainers

1. Design Intent:

2. Background Review:

3. Session Persistence:

4. Model/Backend Compatibility:

5. Alternative Approaches:

📚 Research Documentation

🎯 Expected Impact

🏷️ Suggested Labels

👤 Reporter

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Problem Location: `run_agent.py`