openclaw - ✅(Solved) Fix exec: PTY zombie sessions accept routed commands after session marked done [3 pull requests, 9 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#49943Fetched 2026-04-08 01:00:59
View on GitHub
Comments
9
Participants
3
Timeline
18
Reactions
0
Author
Timeline (top)
commented ×9cross-referenced ×3mentioned ×3subscribed ×3

Root Cause

Root Cause Theory The gateway's session lifecycle has a mismatch:

  • "session done" marks the session as complete in the registry
  • The PTY subprocess is NOT killed at this point
  • New exec requests are routed to zombie PTY sessions instead of spawning fresh shells

Fix Action

Fix / Workaround

Operational Workaround Restart the gateway to clear the session registry: openclaw gateway restart Read operations are unaffected by this issue. For critical sequences, use file reads to verify state before and after exec operations.

Suggested Fix Directions

  1. Session lifecycle: "session done" should terminate the PTY subprocess immediately, not just mark the session complete. The session should not accept new routed work after being marked done.
  2. Process group isolation: pty=false exec requests should not share a process group with PTY sessions. A cascade kill of one PTY should not terminate explicitly non-PTY subprocesses.
  3. Routing logic: the dispatcher should not route new exec requests to sessions marked "done." Consider a strict active-session-only routing policy rather than reusing any available session.
  4. Zombie detection: add instrumentation to detect and log when a "done" session's PTY subprocess continues to receive routed work, which would surface this class of bug earlier.

Not a timeout tuning issue This is not resolved by lowering idleTimeoutMs, maxDurationMs, or requestTimeoutMs. Zombie PTY sessions are not idle — they accept and process routed commands. The issue is in session lifecycle management, not timeout configuration.

PR fix notes

PR #50112: fix(exec): kill PTY subprocess when session marked done

Description (problem / solution / changelog)

Fixes #49943 — Bug 1. When a session is marked done, the PTY subprocess is now killed synchronously to prevent zombie PTY sessions from accepting routed commands.

Changed files

  • src/agents/bash-tools.exec-runtime.ts (modified, +4/-0)
  • src/process/supervisor/adapters/pty.ts (modified, +1/-0)

PR #50113: fix(exec): kill PTY subprocess synchronously on session exit

Description (problem / solution / changelog)

Fixes #49943 — Bug 1. When a PTY exec session exits, the PTY subprocess is now killed synchronously inside the supervisor before the run is marked fully exited, preventing zombie PTY sessions from accepting routed commands.

Changed files

  • src/process/supervisor/supervisor.ts (modified, +3/-0)

PR #50114: fix(exec): isolate PTY process group to prevent cascade kill

Description (problem / solution / changelog)

Fixes #49943 — Bug 2. Isolates PTY process groups to prevent SIGKILL on a zombie PTY from propagating to unrelated pty=false subprocesses. Uses pty.kill() instead of killProcessTree since node-pty does not support child_process.spawn-style detached option.

Changed files

  • src/process/supervisor/adapters/pty.ts (modified, +6/-4)

Code Example

09:18:47 — session started (witty-bison, PTY)
09:19:08 — session started (lobster, PTY)
09:19:47 — session done (lobster)        ← gateway marks done
09:19:47PTY output (witty-bison)      ← witty-bison receives lobster's zombie output
09:31:47PTY died (lobster)             ← zombie PTY dies 12 min AFTER "session done"
09:31:47 — process kill (lucky-lobster)  ← manual kill
09:31:47 — process killed (witty-bison)   ← cascade: witty-bison killed
09:31:47 — process killed (turbulent-trail) ← cascade: all routed sessions killed
RAW_BUFFERClick to expand / collapse

Environment

  • OS: Ubuntu 24.04.4 LTS (WSL2, Linux 6.6.87.2-microsoft-standard-WSL2)
  • Node: v25.8.1
  • OpenClaw: 2026.3.13
  • Python: 3.x in venv at /tmp/sync-test-venv

Gateway config at time of issue

  • exec.requestTimeoutMs: 60000
  • exec.maxDurationMs: 300000
  • exec.idleTimeoutMs: 300000
  • exec.maxConcurrent: 3
  • heartbeatIntervalMs: 60000

Observed Behavior

  1. Exec command completes and returns a result, but the PTY subprocess remains alive.
  2. Subsequent exec requests are routed to the zombie PTY instead of spawning a fresh shell.
  3. Multiple commands stack on the same zombie PTY, causing output mixing (accumulated session history returned instead of current command output).
  4. When the zombie PTY is killed (manually or via gateway restart), all routed subprocesses — including those with pty=false — are terminated together (cascade kill).
  5. Read operations continue to work normally throughout; only exec routing is stuck.

Expected Behavior When a session is marked "done," its PTY subprocess should be terminated. New exec requests should spawn fresh shells, not route to zombie PTY sessions.

Controlled Test Matrix

Test CommandPTYExpectedActual
datefalse~0s~12s (buffered on zombie)
datetrue~0s~12s (buffered on zombie)
sleep 5true~5scompleted then cascade-killed
sleep 30true~30skilled at ~12s by cascade

Key Log Evidence All times from a single log file. Session names are illustrative.

09:18:47 — session started (witty-bison, PTY)
09:19:08 — session started (lobster, PTY)
09:19:47 — session done (lobster)        ← gateway marks done
09:19:47 — PTY output (witty-bison)      ← witty-bison receives lobster's zombie output
09:31:47 — PTY died (lobster)             ← zombie PTY dies 12 min AFTER "session done"
09:31:47 — process kill (lucky-lobster)  ← manual kill
09:31:47 — process killed (witty-bison)   ← cascade: witty-bison killed
09:31:47 — process killed (turbulent-trail) ← cascade: all routed sessions killed

Critical contradiction: "session done" at 09:19:47, but PTY did not die until 09:31:47 — 12 minutes later. During those 12 minutes, the zombie PTY processed at least one routed exec command.

Root Cause Theory The gateway's session lifecycle has a mismatch:

  • "session done" marks the session as complete in the registry
  • The PTY subprocess is NOT killed at this point
  • New exec requests are routed to zombie PTY sessions instead of spawning fresh shells

Additionally, all routed exec requests (regardless of pty=false) appear to share a process group. When one zombie PTY is killed, all routed subprocesses in the group are terminated together — even those explicitly created with pty=false.

Secondary contributing factor: maxConcurrent=3 appears to be bypassed by zombie session reuse, as zombie sessions may not count against the concurrent limit but still accept routed work.

Operational Workaround Restart the gateway to clear the session registry: openclaw gateway restart Read operations are unaffected by this issue. For critical sequences, use file reads to verify state before and after exec operations.

Suggested Fix Directions

  1. Session lifecycle: "session done" should terminate the PTY subprocess immediately, not just mark the session complete. The session should not accept new routed work after being marked done.
  2. Process group isolation: pty=false exec requests should not share a process group with PTY sessions. A cascade kill of one PTY should not terminate explicitly non-PTY subprocesses.
  3. Routing logic: the dispatcher should not route new exec requests to sessions marked "done." Consider a strict active-session-only routing policy rather than reusing any available session.
  4. Zombie detection: add instrumentation to detect and log when a "done" session's PTY subprocess continues to receive routed work, which would surface this class of bug earlier.

Not a timeout tuning issue This is not resolved by lowering idleTimeoutMs, maxDurationMs, or requestTimeoutMs. Zombie PTY sessions are not idle — they accept and process routed commands. The issue is in session lifecycle management, not timeout configuration.

extent analysis

Fix Plan

To address the issue, we need to modify the session lifecycle management and routing logic. Here are the concrete steps:

  • Terminate PTY subprocess on session completion:
    • Modify the sessionDone function to immediately terminate the PTY subprocess using process.kill() or equivalent.
    • Ensure the session is removed from the registry and marked as inactive to prevent further routing.
  • Isolate process groups for non-PTY sessions:
    • Create a new process group for each non-PTY session using process.setpgid() or equivalent.
    • Ensure that non-PTY sessions do not share a process group with PTY sessions.
  • Implement strict active-session-only routing:
    • Modify the routing logic to only route new exec requests to active sessions.
    • Ignore sessions marked as "done" or inactive.
  • Add zombie detection instrumentation:
    • Log a warning or error when a "done" session's PTY subprocess receives routed work.
    • Use this instrumentation to detect and surface similar issues earlier.

Example code snippet ( Node.js ):

// Terminate PTY subprocess on session completion
function sessionDone(session) {
  // ...
  if (session.pty) {
    session.pty.kill(); // terminate PTY subprocess
  }
  // ...
}

// Isolate process groups for non-PTY sessions
function createNonPtySession() {
  const session = { /* ... */ };
  if (!session.pty) {
    process.setpgid(session.pid, 0); // create new process group
  }
  // ...
}

// Implement strict active-session-only routing
function routeExecRequest(request) {
  const activeSessions = getActiveSessions();
  const suitableSession = activeSessions.find((session) => {
    return session.active && !session.done;
  });
  if (suitableSession) {
    // route request to suitable session
  } else {
    // create new session or return error
  }
}

Verification

To verify the fix, test the following scenarios:

  • Execute a command with PTY enabled and verify that the PTY subprocess is terminated after the session is marked as "done".
  • Execute a command with PTY disabled and verify that it does not share a process group with PTY sessions.
  • Route multiple exec requests to different sessions and verify that each request is routed to a fresh, active session.
  • Verify that zombie detection instrumentation logs warnings or errors when a "done" session's PTY subprocess receives routed work.

Extra Tips

  • Regularly review and update session lifecycle management and routing logic to prevent similar issues.
  • Consider implementing additional instrumentation and logging to detect and surface other potential issues.
  • Test and verify fixes thoroughly to ensure that they do not introduce new issues or regressions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING