After gateway restart, the session should: 1. Detect that tool results are available but no assistant follow-up exists 2. Resume the tool-use loop by sending the tool results + context to the model 3. Generate the assistant's follow-up response

openclaw - 💡(How to fix) Fix [Bug]: Session stuck after gateway restart during tool-use loop — stale lock not recovered [1 comments, 2 participants]

openclaw2026-04-23 09:26:54

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#70555•Fetched 2026-04-24 05:56:34

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ihedy

Participants

ihedy

rafiki270

Timeline (top)

labeled ×2commented ×1cross-referenced ×1

After a gateway restart triggered by a config change, a session in an active tool-use loop is not resumed — the session remains permanently stuck in "running" state with tool results persisted but no follow-up assistant message generated.

Root Cause

Fix Action

Fix / Workaround

Current workaround: Manually reset the session or send a new message to trigger recovery. Neither is ideal for production use.

Code Example

Gateway log snippet:
16:37:51.889 [reload] config change requires gateway restart (plugins.entries.tavily.enabled) — deferring until 4 operation(s), 2 reply(ies), 2 embedded run(s) complete
16:38:42.780 [gateway] removed stale session lock: /Users/maidou/.openclaw/agents/main/sessions/0a7cc9e9-...jsonl.lock (dead-pid)

Session state after restart:
{
  "status": "running",
  "abortedLastRun": false,
  "updatedAt": 1776846975307,
  "totalTokens": 89694
}

Timeline:
- 16:36:50 — Assistant sends last tool calls (2 × web_search)
- 16:37:50 — Both tool results returned
- 16:37:51 — Config change detected → gateway restart queued
- 16:38:42 — Gateway completes restart, stale lock removed
- After that — No further assistant messages (stuck 18+ hours)

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

Steps to reproduce

Agent is in a tool-use loop with parallel web_search calls
While tool results are returning, a config change triggers gateway restart (plugins.entries.tavily.enabled)
Gateway restarts, stale session lock is cleaned as dead-pid: removed stale session lock: ...jsonl.lock (dead-pid)
Tool results are persisted to the session transcript
No follow-up assistant message is generated; session remains status: "running" indefinitely

Expected behavior

After gateway restart, the session should:

Detect that tool results are available but no assistant follow-up exists
Resume the tool-use loop by sending the tool results + context to the model
Generate the assistant's follow-up response

Actual behavior

The session remains in "running" state with stale data. No recovery occurs. The session is effectively dead until manually reset. User sees "当前还在忙，你的新消息已经排队，上一条完成后我马上继续。（in english：I'm currently busy. Your new message is already in the queue. I'll get back to you right after finishing the previous one.）" and cannot get a response.

OpenClaw version

2026.4.14 (323493f)

Operating system

macOS 25.3.0 (arm64)

Install method

npm global

Model

bailian/qwen3.6-plus

Provider / routing chain

openclaw -> DashScope compatible-mode API -> bailian/qwen3.6-plus

Additional provider/model setup details

Channel: dingtalk-connector (DM session) Model accessed via DashScope compatible-mode API.

Logs, screenshots, and evidence

Gateway log snippet:
16:37:51.889 [reload] config change requires gateway restart (plugins.entries.tavily.enabled) — deferring until 4 operation(s), 2 reply(ies), 2 embedded run(s) complete
16:38:42.780 [gateway] removed stale session lock: /Users/maidou/.openclaw/agents/main/sessions/0a7cc9e9-...jsonl.lock (dead-pid)

Session state after restart:
{
  "status": "running",
  "abortedLastRun": false,
  "updatedAt": 1776846975307,
  "totalTokens": 89694
}

Timeline:
- 16:36:50 — Assistant sends last tool calls (2 × web_search)
- 16:37:50 — Both tool results returned
- 16:37:51 — Config change detected → gateway restart queued
- 16:38:42 — Gateway completes restart, stale lock removed
- After that — No further assistant messages (stuck 18+ hours)

Impact and severity

Affected: DingTalk DM users on OpenClaw 2026.4.14 Severity: High — blocks all interaction in the affected session; agent appears permanently busy Frequency: Reproduced in this instance; occurs when gateway restart happens during an active tool-use loop Consequence: Users see "当前还在忙，你的新消息已经排队，上一条完成后我马上继续。（In Enligsh，such as： I'm currently busy. Your new message is already in the queue. I'll get back to you right after finishing the previous one.）", session context accumulates tokens without progress, work is lost unless session is manually reset

Additional information

Suggested fix:

On startup, scan all "running" sessions for incomplete tool-use loops
If tool results exist but no follow-up assistant message, resume the loop
Add a timeout mechanism: if a session has been "running" for > N minutes without progress, flag it for recovery or notify the user

Current workaround: Manually reset the session or send a new message to trigger recovery. Neither is ideal for production use.

extent analysis

TL;DR

Implement a mechanism to scan and resume incomplete tool-use loops after a gateway restart to prevent sessions from getting stuck in a "running" state.

Guidance

Identify sessions in a "running" state after a gateway restart and check for incomplete tool-use loops by verifying the presence of tool results without a follow-up assistant message.
Resume the tool-use loop by sending the tool results and context to the model to generate the assistant's follow-up response.
Consider implementing a timeout mechanism to flag sessions for recovery or notify the user if a session has been "running" for an extended period without progress.
Review the suggested fix provided in the issue for a potential implementation approach.

Example

No code snippet is provided as the issue does not contain sufficient implementation details.

Notes

The provided guidance is based on the information given in the issue and may require adjustments based on the specific implementation and requirements of the OpenClaw system.

Recommendation

Apply the suggested workaround of manually resetting the session or sending a new message to trigger recovery until a permanent fix can be implemented, as upgrading to a fixed version is not mentioned as an option in the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

After gateway restart, the session should:

Detect that tool results are available but no assistant follow-up exists
Resume the tool-use loop by sending the tool results + context to the model
Generate the assistant's follow-up response

#api #docker error #permission error #memory optimization #batch processing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - 💡(How to fix) Fix [Bug]: Session stuck after gateway restart during tool-use loop — stale lock not recovered [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING