claude-code - ✅(Solved) Fix [BUG] Agents emit self-matching `pgrep -f` polling patterns that deadlock background tasks [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#60716Fetched 2026-05-20 03:51:22
View on GitHub
Comments
2
Participants
2
Timeline
8
Reactions
0
Author
Timeline (top)
labeled ×4commented ×2closed ×1renamed ×1

Error Message

Error Messages/Logs

  1. Default system-prompt guidance — warn against pgrep -f 'PAT' as a completion poll and point at:

Root Cause

The root cause is the model picking a self-matching idiom, but the absence of a first-class "wait for background task" primitive in Claude Code is what pushes models toward hand-rolled polling in the first place.

Fix Action

Fix / Workaround

Workaround we applied in our repo: kelos-dev/kelos#1165 — added a bullet to our CLAUDE.md telling the model not to use the self-matching pgrep -f form. Works for our agents, but the bug will keep recurring for other users until Claude Code's defaults change.

PR fix notes

PR #1165: Warn agents not to poll with self-matching pgrep -f

Description (problem / solution / changelog)

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Adds a single bullet to the kelos-dev-agent (shared) and kelos-workers-agent agentsMD warning the agent away from pgrep -f 'CMD_NAME' as a process-completion poll, and pointing at two safe alternatives.

Background

A live kelos-workers-issue-comment-* pod was found wedged with two orphaned polling shells the agent had written:

bash -c "until [ -z \"\$(pgrep -f 'make verify')\" ]; do sleep 5; done; tail -60 OUT"
bash -c "until [ ! -e /proc/\$(pgrep -f 'go test ./internal/cli/') ]; do sleep 2; done; cat OUT | tail -30"

pgrep -f PAT matches any process whose /proc/*/cmdline contains PAT. The shell running pgrep -f 'make verify' has pgrep -f 'make verify' in its own argv, so pgrep finds the caller itself. The until condition is never empty, the loop runs forever, and the tail after the && never executes — so the BashOutput tool call the agent is awaiting never returns. Claude sits in do_epoll_wait holding pidfds for the wedged shells, the entrypoint waits on claude, and the pod cannot terminate.

The real make verify finished long before (OUT ends with Verification passed). Only the polling shell hangs.

Which issue(s) this PR is related to:

N/A

Special notes for your reviewer:

This is a targeted prompt fix, not a ban on run_in_background: true. Backgrounding is fine; only the self-matching polling form is forbidden. Two safe alternatives are suggested:

  • PID capture: cmd & PID=$!; while kill -0 "$PID" 2>/dev/null; do sleep N; done
  • Self-exclusion in pgrep: pgrep -f 'CMD_NAME' | grep -vw $$

The same bullet is added to kelos-workers-agent (only used by the workers spawner) and kelos-dev-agent (shared by kelos-pr-responder, kelos-config-update, kelos-triage, and kelos-squash-commits), so all coding agents pick it up.

Does this PR introduce a user-facing change?

NONE

Changed files

  • self-development/agentconfig.yaml (modified, +1/-0)
  • self-development/kelos-workers.yaml (modified, +1/-0)

Code Example

until [ -z "$(pgrep -f 'CMD_NAME')" ]; do sleep N; done; tail -K OUT

---

PID 1   entrypoint.sh                           wchan: do_wait
PID 9   claude --prompt "..."                   wchan: do_epoll_wait
        ├─ FD 25 = pidfd → PID 12586
        └─ FD 27 = pidfd → PID 7251
PID 12586  bash -c "until [ -z $(pgrep -f 'make verify') ]; do sleep 5; done; tail -60 OUT"
   └─ sleep 5  (respawning forever)
PID 7251   bash -c "until [ ! -e /proc/$(pgrep -f 'go test ./internal/cli/') ]; do sleep 2; done; cat OUT | tail -30"
   └─ sleep 2  (respawning forever)

$ pgrep -af 'make verify'
# returns the wrapper bash itself + the parent claude (its argv carries the prompt text)

---

claude --dangerously-skip-permissions --output-format stream-json --verbose \
     -p "Run \`make verify\` and \`make test\`, then summarize the result."
RAW_BUFFERClick to expand / collapse

Preflight Checklist

  • I have searched existing issues and this hasn't been reported yet
  • This is a single bug report
  • I am using the latest version of Claude Code

What's Wrong?

When agents launch a long-running command via the Bash tool with run_in_background: true and want to wait for it to finish, they commonly improvise a polling shell like:

until [ -z "$(pgrep -f 'CMD_NAME')" ]; do sleep N; done; tail -K OUT

This pattern deadlocks. pgrep -f matches any process whose /proc/*/cmdline contains the pattern, and the shell running this command has pgrep -f 'CMD_NAME' in its own argv — so pgrep always finds the caller itself. The until condition is never empty, the loop runs forever (even after the real command has finished), the tail after the && never executes, the BashOutput tool call the agent is awaiting never returns, and claude blocks indefinitely in do_epoll_wait holding pidfds for the wedged wrappers.

The root cause is the model picking a self-matching idiom, but the absence of a first-class "wait for background task" primitive in Claude Code is what pushes models toward hand-rolled polling in the first place.

In headless sessions (Claude Code running inside a Kubernetes Job), the wedged wrapper holds claude in do_epoll_wait on the pidfd and the container never terminates.

What Should Happen?

The background-task lifecycle should not deadlock when the agent waits on it. Concretely, any of:

  • Claude Code's default system prompt warns the model away from the self-matching pgrep -f idiom.
  • A watchdog terminates a backgrounded shell whose process subtree is only sleep for >N minutes.
  • A first-class wait_for: <task_id> (or equivalent) primitive on the Bash tool removes the need for the model to hand-roll a polling loop.

Error Messages/Logs

PID 1   entrypoint.sh                           wchan: do_wait
PID 9   claude --prompt "..."                   wchan: do_epoll_wait
        ├─ FD 25 = pidfd → PID 12586
        └─ FD 27 = pidfd → PID 7251
PID 12586  bash -c "until [ -z $(pgrep -f 'make verify') ]; do sleep 5; done; tail -60 OUT"
   └─ sleep 5  (respawning forever)
PID 7251   bash -c "until [ ! -e /proc/$(pgrep -f 'go test ./internal/cli/') ]; do sleep 2; done; cat OUT | tail -30"
   └─ sleep 2  (respawning forever)

$ pgrep -af 'make verify'
# returns the wrapper bash itself + the parent claude (its argv carries the prompt text)

The real make verify finished long before — its output file ends with Verification passed. Only the polling shells hang.

Steps to Reproduce

  1. Run Claude Code non-interactively against a repo that has a long-running build command (e.g. make verify taking several minutes):
    claude --dangerously-skip-permissions --output-format stream-json --verbose \
      -p "Run \`make verify\` and \`make test\`, then summarize the result."
  2. Observe the agent launches the build via the Bash tool with run_in_background: true, then writes a second background command of the form until [ -z "$(pgrep -f 'make verify')" ]; do sleep N; done; tail OUT.
  3. The real make verify completes, but the polling shell keeps spinning. claude blocks waiting on its pidfd. The session never finishes.

This reproduces especially reliably when the user prompt itself contains the literal command name (because then claude's own argv also matches pgrep -f), but the wrapper-self-match alone is enough.

Claude Model

Opus

Is this a regression?

I don't know

Claude Code Version

2.1.143 (Claude Code)

Platform

Anthropic API

Operating System

Other Linux

Terminal/Shell

Non-interactive/CI environment

Additional Information

Suggested fixes:

  1. Default system-prompt guidance — warn against pgrep -f 'PAT' as a completion poll and point at:
    • PID capture: cmd & PID=$!; while kill -0 "$PID" 2>/dev/null; do sleep N; done
    • Self-exclusion: pgrep -f 'PAT' | grep -vw $$
  2. Background-task watchdog — if a backgrounded shell's process subtree is just sleep for >N minutes, surface a warning to the model or kill the wrapper.
  3. First-class wait primitive — a Bash tool flag like wait_for: <task_id> that handles polling correctly so the model doesn't hand-roll the loop.

Workaround we applied in our repo: kelos-dev/kelos#1165 — added a bullet to our CLAUDE.md telling the model not to use the self-matching pgrep -f form. Works for our agents, but the bug will keep recurring for other users until Claude Code's defaults change.

Environment notes:

  • Claude Code container image: ghcr.io/kelos-dev/claude-code:main (built on the official Claude Code)
  • Host: Kubernetes Job (linux/amd64)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - ✅(Solved) Fix [BUG] Agents emit self-matching `pgrep -f` polling patterns that deadlock background tasks [1 pull requests, 2 comments, 2 participants]