claude-code - 💡(How to fix) Fix main thread enters user-space spin loop after Bash tool child exits; SIGCHLD ignored; terminal unresponsive (2.1.118) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#52544Fetched 2026-04-24 06:04:20
View on GitHub
Comments
1
Participants
2
Timeline
6
Reactions
0
Timeline (top)
labeled ×4commented ×1cross-referenced ×1

Error Message

  • Cursor still in the Claude Code TUI prompt area; no error, no exit. | kill -CHLD <pid> on the stuck process | No effect. Sent at 52:47 etime. 3 s later: STAT=Rl+, CPU=65.3% (was 65.2% before), wchan=0, child zombie still present, syscall sampling still 100% running. The signal was delivered (no kill error) but the runtime did not act on it. |

Root Cause

I'm flagging this because the pattern (regressions in untouched code paths, plus what reads like syntax-only validation in the release pipeline) seems consistent across the bugs we've seen, not just specific to this hang. Worth a look beyond just the immediate fix.

Fix Action

Fix / Workaround

Mitigation tried by the user

Code Example

PID    PPID    ELAPSED   %CPU  STAT  CMD
957988 624203  47:27     60.7  Rl+   claude --effort max --chrome

---

50 running

---

$ pgrep -aP 957988
969673 [bash] <defunct>

---

#0  0x00000000061db575 in ?? ()
#1  0x00007ffda552e940 in ?? ()
#2  0x0000000000000020 in ?? ()
#3  0x0000000000000000 in ?? ()

---

02d2f000-068c0000 r-xp 02b2e000 00:33  /root/.local/share/claude/versions/2.1.118

---

#0  0x0000000003062933 in ?? ()
#1  0x00000000030cfb59 in ?? ()
#2  0x0000000003c25ae9 in ?? ()
#3  0x00000000035deaf0 in ?? ()
#4  0x0000715994729b7b in ?? () from libc.so.6   (start_thread)
#5  0x00007159947a77f8 in ?? () from libc.so.6

---

#0  0x0000000003f6a25a in ?? ()
#1  0x0000000004aba4d6 in ?? ()

---

1.json status=in_progress  "T01088: Finalize photos hash check + drop rc → 1.66"
2.json status=pending      "T01089: Fix dblclick-to-copy in history messages"
3.json status=pending      "T01090: Restore IP next to name in messages"
4.json status=pending      "T01091: Filter session jsonl raw leak"
5.json status=pending      "T01092: Fix Vega duplicated message 3s after response"
6.json status=pending      "T01093: Fix empty Microfone/Voices comboboxes via cloudflared tunnel"

---

rchar:                 318808690   (~304 MB read)
wchar:                  14034018   (~13 MB written)
syscr:                   6168827
syscw:                     47201
read_bytes:              1822208
write_bytes:             7207172
cancelled_write_bytes:         0
RAW_BUFFERClick to expand / collapse

Claude Code 2.1.118 — main thread enters user-space spin loop after Bash tool child exits, terminal becomes unresponsive

Repo to file under: https://github.com/anthropics/claude-code/issues Filed by: user [email protected] (Marco) Diagnosed by: new Claude Code session (PID 973745) on the same host, using gdb on the stuck PID. Date observed: 2026-04-23


Context from the user filing this

This is the third confirmed bug we've hit in version 2.1.118 in roughly two days. The user's words, verbatim (translated from pt-BR):

"These last two days, we're developing something, and it breaks other points that should NOT have been touched. Tests aren't run — only syntax, no runtime, no logic."

I'm flagging this because the pattern (regressions in untouched code paths, plus what reads like syntax-only validation in the release pipeline) seems consistent across the bugs we've seen, not just specific to this hang. Worth a look beyond just the immediate fix.


TL;DR

After ~30+ minutes of normal interactive use, the claude CLI process became unresponsive in the terminal. The TUI no longer accepts input, but the process is alive: connected to the user's pty, with two ESTAB TLS sockets to api.anthropic.com still open. top shows it pinned at 50–60% CPU. There is exactly one orphaned [bash] <defunct> child of the claude process — a leftover from a Bash tool invocation that Claude never waitpid()'d. The main thread is spinning in a tight user-space loop inside the claude binary's .text section. All 20 worker threads (Bun pools, HeapHelpers, HTTP Client, File Watcher) are correctly idle.

This pattern (live process, idle workers, single zombie child not reaped, main thread in user-space spin) is consistent with a child-process lifecycle bug in the Bun-compiled runtime: an event/promise that resolves on the bash child's exit was never satisfied, and the event loop is busy-polling for it instead of either yielding or completing.


Environment

FieldValue
Claude Code version2.1.118
Binary path/root/.local/share/claude/versions/2.1.118
Invocationclaude --effort max --chrome
RuntimeBun (visible in thread names: Bun Pool 0..9, HeapHelper, HTTP Client, File Watcher)
OSDebian 13
KernelLinux 6.17.13-2-pve
Archx86_64
Run asroot
Terminalinside tmux, on /dev/pts/4
NetworkLAN host 192.168.0.41, ESTAB sockets to 160.79.104.10:443 (api.anthropic.com)

Symptoms (user-observable)

  • Terminal stops echoing input.
  • Cursor still in the Claude Code TUI prompt area; no error, no exit.
  • top shows the claude process at 50–60% CPU constantly.
  • Opening a second claude session in another tmux window works normally; the stuck one stays stuck indefinitely.

Process state at the moment of diagnosis (47 min into the hang)

PID    PPID    ELAPSED   %CPU  STAT  CMD
957988 624203  47:27     60.7  Rl+   claude --effort max --chrome
  • STAT = Rl+ — running, multi-threaded, foreground.
  • wchan = 0 — not blocked in any kernel syscall.
  • Threads = 21.
  • voluntary_ctxt_switches = 578328, nonvoluntary_ctxt_switches = 20495 over 38 min — the main thread keeps yielding briefly but nothing breaks the spin.

Syscall sampling (passive, no ptrace)

50 reads of /proc/957988/syscall over 2 s, sampled every 40 ms:

     50 running

100% of samples returned running, confirming the main thread is in user-space code, not in any kernel syscall.


Child process — the smoking gun

$ pgrep -aP 957988
969673 [bash] <defunct>

There is exactly one child of the claude process, and it is a zombie — a bash that finished and is waiting to be reaped by waitpid() from the parent. The parent (claude, PID 957988) never picked it up, even though many minutes have passed.

The defunct child correlates with the user's last activity in the session: a Bash tool invocation that ran a shell command. The bash exited; the claude runtime missed the close/exit event for it.


Backtrace (gdb 16.3 on the stripped binary)

gdb -batch -p 957988 -ex 'set pagination off' -ex 'thread apply all bt 30' -ex 'detach' -ex 'quit'

The binary is stripped (no DWARF symbols, no unwind info), so symbol resolution and stack unwinding are limited. Below is the relevant signal extracted from the full output.

Thread 1 — claude (LWP 957988), the spinning main thread

#0  0x00000000061db575 in ?? ()
#1  0x00007ffda552e940 in ?? ()
#2  0x0000000000000020 in ?? ()
#3  0x0000000000000000 in ?? ()
  • 0x00000000061db575 lies inside the r-xp text segment of the claude binary itself:
    02d2f000-068c0000 r-xp 02b2e000 00:33  /root/.local/share/claude/versions/2.1.118
    This is not libc, not libpthread, not JSC JIT — it is the AOT-compiled JS bundle of Claude Code (or a small native shim around it).
  • Frames #1–#3 are stack pointers / frame pointers, not return addresses — the unwinder cannot proceed without unwind tables.
  • Detach + re-check: process state remained Rl+, CPU climbed from 48% to 60% — gdb attach was non-destructive but did not break the loop.

Thread 21 — claude aux (LWP 957989)

In pthread_cond_timedwait. Idle.

Threads 19–14 + 5–2 — Bun Pool 0..9

All in the same idle wait position deep in JSC code:

#0  0x0000000003062933 in ?? ()
#1  0x00000000030cfb59 in ?? ()
#2  0x0000000003c25ae9 in ?? ()
#3  0x00000000035deaf0 in ?? ()
#4  0x0000715994729b7b in ?? () from libc.so.6   (start_thread)
#5  0x00007159947a77f8 in ?? () from libc.so.6

Indistinguishable across all 10 pool threads → standard WTF::ThreadCondition::wait-style worker idle.

Thread 18 — HTTP Client

#0  0x0000000003f6a25a in ?? ()
#1  0x0000000004aba4d6 in ?? ()

Idle in network event loop.

Thread 13 — File Watcher

Blocked in read() on (presumably) an inotify fd. Normal.

Threads 12–6 — HeapHelper (7 threads)

All in pthread_cond_timedwait. Idle, waiting for a GC trigger.

Conclusion: every worker is idle exactly where it should be when the runtime has nothing to do. The 50%+ CPU is being burned entirely by Thread 1, in user-space, in the claude binary's text section.

Full gdb output saved on the host at /tmp/claude_957988_bt.txt (266 lines).


File descriptors and sockets

50 fds open. Highlights:

  • 0, 1, 2, 11, 12, 13/dev/pts/4 (the user's terminal — still attached).
  • 4, 14, 41anon_inode:[eventpoll] — Bun event loops.
  • 15, 43anon_inode:[timerfd].
  • 16, 45anon_inode:[eventfd].
  • 18 → ESTAB TCP 192.168.0.41:43064 → 160.79.104.10:443 (api.anthropic.com).
  • 44 → ESTAB TCP 192.168.0.41:57418 → 160.79.104.10:443 (api.anthropic.com).
  • 19–40 → various paths under /root/.claude/... (tasks, settings, sessions, plugins, shell-snapshots, credentials, etc).
  • 22/root/.claude/tasks/ae57ad32-03dd-4eab-beb1-220de8ca8732/.lock (still held).
  • 19/root/.claude/tasks/ae57ad32-03dd-4eab-beb1-220de8ca8732 (the task dir itself, with 6 task JSON files inside — see below).

API sockets are still open and (presumably) on the server side waiting. The session is not network-disconnected.

Task dir contents (TaskList state at hang)

Six task JSONs were active. The first was in_progress, the other five pending. None had been closed. All written between 17:13 and 17:14, ~17 min after session start (16:56).

1.json status=in_progress  "T01088: Finalize photos hash check + drop rc → 1.66"
2.json status=pending      "T01089: Fix dblclick-to-copy in history messages"
3.json status=pending      "T01090: Restore IP next to name in messages"
4.json status=pending      "T01091: Filter session jsonl raw leak"
5.json status=pending      "T01092: Fix Vega duplicated message 3s after response"
6.json status=pending      "T01093: Fix empty Microfone/Voices comboboxes via cloudflared tunnel"

This is consistent with the user's report that the session locked up while working on T01088, before getting to the others.


Cumulative I/O at hang

rchar:                 318808690   (~304 MB read)
wchar:                  14034018   (~13 MB written)
syscr:                   6168827
syscw:                     47201
read_bytes:              1822208
write_bytes:             7207172
cancelled_write_bytes:         0

Read syscall count is two orders of magnitude higher than write — consistent with active streaming reads from somewhere (probably the API socket and/or a tool output pipe) before the freeze.


Hypothesis

A Bash tool was invoked. The runtime spawned bash as a child via fork+exec (or Bun's equivalent), wired up pipes for stdout/stderr/stdin, and registered a Promise/callback to fire on the child's exit.

The bash exited. One of the following happened:

  1. The runtime's SIGCHLD handler missed the signal (handler not installed, masked, or coalesced with another SIGCHLD and only one was processed).
  2. The exit detection code path relies on observing EOF on the child's stdout pipe via epoll. The pipe FD never reported EPOLLHUP because of a race, or the EPOLLHUP was observed but the event handler errored out before resolving the awaiting promise.
  3. The runtime is polling waitpid(WNOHANG) in a tick callback that always returns 0 (because something is holding the kernel from delivering it — but wchan=0 and STAT=R argue against the kernel side, so this is unlikely).

The [bash] <defunct> child confirms waitpid() was never called successfully on it.

The 50%+ CPU suggests an await / Promise.race / event loop predicate that the runtime keeps re-evaluating to not ready on every tick, never sleeping.

This kind of bug typically triggers when:

  • A long-running tool (Bash) outputs nothing for a while, then exits, and the user's terminal is being read on the same event loop tick.
  • A process-group / setsid configuration mismatch causes the shell child to exit through a path the parent isn't subscribed to.

Reproduction

I do not have a deterministic repro. The user reports it has happened more than once on this host, but not predictably.

Variables that may matter:

  • --effort max --chrome flags (chrome MCP server is part of the toolset).
  • Long session (30+ min, multiple tool calls before the hang).
  • Bash tool with output to a pty/stream.
  • Potentially: a Bash command that itself spawned children that survived the parent (creating reaping confusion).

If a deterministic repro is needed, I can collect more data on the next occurrence (e.g. install strace/perf/bpftrace ahead of time and capture the syscall stream live during the freeze).


What I recommend the Anthropic team do with this

  1. Audit child-process lifecycle in the Bash tool implementation — specifically the path that registers the "child has exited" event. Make sure both the SIGCHLD and the EOF-on-stdout paths converge correctly and that waitpid is always called.
  2. Add a watchdog in the event loop — if the loop runs N consecutive ticks at 100% with no I/O progress, log a warning (and ideally a stack dump) to /root/.claude/sessions/<id>/runtime.log. This would give post-mortem evidence next time.
  3. Ship the binary with at least minimal unwind info (-fasynchronous-unwind-tables, keep .eh_frame). The strip removed everything; a future user with gdb cannot give you a real backtrace.
  4. Reap orphan children defensively at session shutdown — even if the live-reaping path fails, a final sweep on SIGTERM/SIGINT would prevent zombies from accumulating.

Mitigation tried by the user

StepResult
Open new tmux pane and start fresh claude sessionWorked — that's how this report is being written.
kill -CHLD <pid> on the stuck processNo effect. Sent at 52:47 etime. 3 s later: STAT=Rl+, CPU=65.3% (was 65.2% before), wchan=0, child zombie still present, syscall sampling still 100% running. The signal was delivered (no kill error) but the runtime did not act on it.
kill -9 <pid>Process exited cleanly; the orphan [bash] <defunct> (PID 969673) was immediately reaped by init (now its new parent). The user's tmux pane survived as expected.

Adjunct finding: SIGCHLD non-effect is itself a clue

A correctly-written runtime that detects child exit purely via pipe EOF would naturally ignore a forced SIGCHLD — that's expected. But Bun does install SIGCHLD handling (via libuv-style uv_signal_t infrastructure), so an extra SIGCHLD should at minimum re-enter the child-reaping code path. The fact that absolutely nothing changed (zombie preserved, CPU unchanged, state unchanged) suggests one of:

  1. The SIGCHLD handler was never installed at the OS level for this session (a cat /proc/<pid>/status snapshot showing SigCgt would be informative next time — check whether bit 17 / SIGCHLD is set).
  2. The handler is installed but its userspace continuation never runs because the spinning main thread never returns to the event loop. The 100% running syscall sample supports this — the main thread is not calling epoll_wait, so a wakeup posted by a signal handler would have nowhere to be observed.

In other words, the runtime appears to be in a state where signal-driven recovery is structurally impossible, not just "ignored". This is what made kill -9 necessary.


Attachments

  • claude_957988_bt.txt — full gdb backtrace output (266 lines).
  • This report.

Related files on the host (kept for reference)

  • /root/.claude/tasks/ae57ad32-03dd-4eab-beb1-220de8ca8732/{1..6}.json — TaskList at moment of hang.
  • /proc/957988/maps — full memory map (binary loaded at 00200000-069a1000).
  • /proc/957988/status — process status snapshot.

extent analysis

TL;DR

The issue can likely be fixed by auditing the child-process lifecycle in the Bash tool implementation and adding a watchdog in the event loop to detect and handle stuck processes.

Guidance

  1. Audit child-process lifecycle: Review the Bash tool implementation to ensure correct registration of the "child has exited" event and proper convergence of SIGCHLD and EOF-on-stdout paths.
  2. Add a watchdog in the event loop: Implement a mechanism to log warnings and stack dumps when the event loop runs consecutive ticks at 100% CPU with no I/O progress.
  3. Ship binary with unwind info: Include minimal unwind info in the binary to facilitate debugging and provide more informative backtraces.
  4. Reap orphan children defensively: Add a final sweep at session shutdown to prevent zombie processes from accumulating.

Example

No specific code example is provided due to the lack of explicit code references in the issue. However, the general approach would involve modifying the Bash tool implementation to correctly handle child process exit events and adding a watchdog mechanism to the event loop.

Notes

The provided information suggests a complex issue related to child process handling and event loop management. The proposed steps aim to address the root cause, but further investigation and testing may be necessary to fully resolve the issue.

Recommendation

Apply the proposed workaround by auditing the child-process lifecycle and adding a watchdog in the event loop. This approach addresses the likely root cause of the issue and provides a foundation for further debugging and optimization.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING