openclaw - ✅(Solved) Fix pdf tool can hang indefinitely, blocking session and preventing /new /restart recovery [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#68649Fetched 2026-04-19 15:09:07
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
cross-referenced ×1

Error Message

  1. The pdf tool call never returns — no result, no error, no timeout
  2. The pdf tool should have a timeout (e.g., 60-120 seconds) after which it returns an error

Fix Action

Fixed

PR fix notes

PR #68747: PDF tool: abort stalled remote fetch after 30s idle timeout

Description (problem / solution / changelog)

Summary

  • Problem: The pdf tool can hang indefinitely when fetching a large remote PDF whose body read stalls (reported with a 244‑page www-cdn.anthropic.com download). The entire agent session becomes a zombie — subsequent messages, /new, and /restart all queue behind the stuck tool call and only a gateway restart recovers it. lane=nested waitedMs=164407 queueAhead=0 in the diagnostic confirms the nested tool lane is wedged inside the fetch.
  • Why it matters: Complete, silent loss of the agent with no user-facing recovery path. Users commonly re‑send messages thinking the agent is slow, which deepens the queue.
  • What changed: The pdf tool now passes a 30 000 ms readIdleTimeoutMs to loadWebMediaRaw on both the sandbox and non-sandbox remote-fetch branches. A new optional readIdleTimeoutMs on WebMediaOptions is forwarded into fetchRemoteMedia, which already plumbs the value through its body-reader to abort on stalled chunks. Matches the existing Matrix/Discord/Telegram/Tlon media-download pattern.
  • What did NOT change (scope boundary): No config surface, no new capabilities, no SSRF/sandbox/maxBytes/workspaceOnly/localRoots changes, no changes to anthropicAnalyzePdf / geminiAnalyzePdf / complete() overall request deadlines, no changes to the gateway lane scheduler (/new / /restart interruption is a separate concern and is deliberately out of scope here).

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Skills / tool execution

Linked Issue/PR

  • Closes #68649
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: src/agents/tools/pdf-tool.ts called loadWebMediaRaw(url, { maxBytes, localRoots }) with no idle timeout. src/media/web-media.ts then called fetchRemoteMedia({ url, maxBytes, ssrfPolicy }) — again with no readIdleTimeoutMs. fetchRemoteMedia already supports that option and forwards it to readResponseWithLimit({ chunkTimeoutMs }), but the pdf path never opted in. A CDN TCP/TLS connection that accepts the request and then stops yielding body chunks therefore has no abort signal at any layer and the await never resolves.
  • Missing detection / guardrail: No existing regression test covered an idle-body stall for the pdf tool (no timeout/abort/hang assertions in pdf-tool.test.ts, pdf-native-providers.test.ts, or pdf-tool.helpers.test.ts).
  • Contributing context: Matrix, Discord, Telegram, and Tlon media downloaders already set readIdleTimeoutMs; the pdf tool was the odd one out.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
  • Target test or file: src/agents/tools/pdf-tool.test.ts (new case: passes a fetch idle timeout when loading remote PDFs).
  • Scenario the test should lock in: executing the pdf tool with an https:// URL must hand loadWebMediaRaw a positive readIdleTimeoutMs, so stalled body reads can be aborted.
  • Why this is the smallest reliable guardrail: it is the single seam at which the pdf tool becomes responsible for the idle-abort contract; everything below that call is already covered by fetchRemoteMedia's existing idle-timeout tests in src/media/fetch.test.ts.
  • Existing test that already covers this: None.
  • If no new test is added, why not: N/A.

User-visible / Behavior Changes

  • Remote PDF downloads that stall for 30 s without yielding additional body bytes now abort with a standard fetch error surfaced as a tool error, instead of wedging the session. Healthy downloads behave identically.

Diagram (if applicable)

Before:
pdf tool -> loadWebMediaRaw({ maxBytes }) -> fetchRemoteMedia({ maxBytes, ssrfPolicy })
                                                   |
                                                   v
                                             body read stalls -> await never resolves
                                             -> nested tool lane wedged -> session zombie

After:
pdf tool -> loadWebMediaRaw({ maxBytes, readIdleTimeoutMs: 30_000 })
         -> fetchRemoteMedia({ ..., readIdleTimeoutMs: 30_000 })
         -> readResponseWithLimit({ chunkTimeoutMs: 30_000 })
         -> stall aborts -> tool returns error -> lane freed

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No (same requests, same SSRF guard, same hosts, same maxBytes)
  • Command/tool execution surface changed? No (same tool schema, same inputs, same outputs)
  • Data access scope changed? No

Security posture notes:

  • The new behavior is a pure runtime-enforced defense (body-read idle-abort on the fetch pipeline). It does not depend on prompt text and cannot be disabled from the model side.
  • The timeout is applied unconditionally to both remote-fetch branches (sandbox + non-sandbox) so there is no "bypass by path" case.
  • No existing security guard is relaxed: maxBytes, SSRF allowlist, workspaceOnly, localRoots, sandbox validation, and hostReadCapability checks all continue to run exactly as before.
  • Local-path reads are unaffected; readIdleTimeoutMs only applies to the https?:// branch of loadWebMediaInternal.

Repro + Verification

Environment

  • OS: macOS (darwin 25.3.0)
  • Runtime/container: Node 22, pnpm
  • Model/provider: any pdf-capable model (reproduced with openai-codex/gpt-5.4; original report also on claude-opus-4-6)
  • Integration/channel: WhatsApp (not channel-specific; any path that invokes the pdf tool)
  • Relevant config: defaults

Steps

  1. Ask the agent to analyze a large remote PDF (e.g. a 200+‑page document from a CDN that stalls mid‑download).
  2. Observe that without this patch, the tool call never returns and the session wedges.
  3. With this patch, the fetch aborts after 30 s of idle body reads and the tool returns a standard fetch error to the model, which can react normally.

Expected

  • Stalled PDF fetch aborts within ~30 s and the session remains responsive.

Actual (with fix)

  • Matches expected. Fixed.

Evidence

  • Failing test/log before + passing after

Logs/tests:

$ pnpm test src/agents/tools/pdf-tool.test.ts
 Test Files  1 passed (1)
      Tests  12 passed (12)

$ pnpm test src/media/web-media.test.ts src/media/fetch.test.ts
 Test Files  2 passed (2)
      Tests  36 passed (36)

Full pnpm check (tsgo:core, tsgo:core:test, tsgo:extensions, tsgo:extensions:test, oxlint, webhook/body-read lint, pairing/account-scope lints, import-cycle + madge checks) ran green via the pre-commit hook.

Human Verification (required)

  • Verified scenarios:
    • New unit test asserts the pdf tool forwards readIdleTimeoutMs to loadWebMediaRaw on a remote URL.
    • Existing pdf-tool, web-media, and fetch tests remain green.
    • pnpm check passes end-to-end.
  • Edge cases checked:
    • Sandbox branch and non-sandbox branch both receive the timeout (both loadWebMediaRaw call sites updated).
    • Local-path pdf reads are unaffected (the idle-timeout only applies under the https?:// branch of loadWebMediaInternal).
    • Callers that pass a plain maxBytes number to loadWebMedia/loadWebMediaRaw are unchanged (new field is optional).
  • What I did not verify:
    • End-to-end reproduction against the exact original upstream CDN URL (would require a reliably stalling remote). The behavior is covered by the unit test, the existing readResponseWithLimit chunk-timeout tests, and the matching pattern already live in Matrix/Discord/Telegram/Tlon media downloaders.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Risks and Mitigations

  • Risk: a pathological but healthy remote server that legitimately goes quiet between body chunks for more than 30 s could trip the abort.
    • Mitigation: matches the idle-timeout values already used by Matrix (30 s) and other channels; well above typical TLS/HTTP liveness; the model receives a normal tool error and can retry. No change to maxBytes or overall request size caps.

Made with Cursor

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/agents/tools/pdf-tool.test.ts (modified, +22/-0)
  • src/agents/tools/pdf-tool.ts (modified, +5/-0)
  • src/media/web-media.ts (modified, +9/-1)

Code Example

# Last entry in session transcript — pdf tool call with no response
timestamp: 2026-04-18T16:41:04.934Z
role: assistant
toolCall: pdf
  pdf: "https://www-cdn.anthropic.com/08ab9158070959f88f296514c21b7facce6f52bc.pdf"
  pages: "1-20"
  prompt: "Find the exact section(s) that mention..."
# NO toolResult follows — session ends here

# Gateway diagnostic warning
lane wait exceeded: lane=nested waitedMs=164407 queueAhead=0
RAW_BUFFERClick to expand / collapse

Bug Description

The pdf tool can hang indefinitely when fetching/processing a large remote PDF, causing the entire agent session to become a zombie. While the session is stuck:

  • No new messages from the user are processed (they queue behind the stuck tool call)
  • Slash commands (/new, /restart) are also enqueued and cannot interrupt the stuck run
  • /tasks reports "All clear" even though the session is effectively dead
  • The only recovery is a full gateway restart

Steps to Reproduce

  1. In a direct WhatsApp chat, ask the agent to research a topic
  2. Agent does multiple web_search calls (all succeed)
  3. Agent calls the pdf tool on a large remote PDF (in this case, a 244-page PDF from www-cdn.anthropic.com)
  4. The pdf tool call never returns — no result, no error, no timeout
  5. Session becomes a zombie
  6. All subsequent messages (including /new, /restart) queue behind the stuck tool call and are never processed

Evidence from Logs

# Last entry in session transcript — pdf tool call with no response
timestamp: 2026-04-18T16:41:04.934Z
role: assistant
toolCall: pdf
  pdf: "https://www-cdn.anthropic.com/08ab9158070959f88f296514c21b7facce6f52bc.pdf"
  pages: "1-20"
  prompt: "Find the exact section(s) that mention..."
# NO toolResult follows — session ends here

# Gateway diagnostic warning
lane wait exceeded: lane=nested waitedMs=164407 queueAhead=0

Expected Behavior

  1. The pdf tool should have a timeout (e.g., 60-120 seconds) after which it returns an error
  2. /new and /restart should be able to interrupt a stuck tool execution, not queue behind it
  3. If a tool call hangs, the session should eventually recover on its own rather than requiring a gateway restart

Environment

  • OpenClaw version: 2026.4.14 → 2026.4.15 (issue present in both)
  • OS: macOS 26.3 (arm64)
  • Channel: WhatsApp
  • Model: openai-codex/gpt-5.4 (session fell back from claude-opus-4-6 due to separate auth issue)

Impact

  • User-facing: complete loss of the agent with no way to recover except restarting the gateway (which affects all agents)
  • The session appears alive (gateway shows it as active, updated recently) but is actually dead
  • Users send multiple messages thinking the agent is slow, which only deepens the queue

Suggested Fix

  1. Add a configurable timeout to the pdf tool (default ~120s)
  2. Allow /new and /restart to bypass the run queue and force-reset the session
  3. Consider a session-level watchdog that detects tool calls that have been pending beyond a threshold

extent analysis

TL;DR

Implement a configurable timeout for the pdf tool to prevent indefinite hangs and consider a session-level watchdog to detect and recover from stuck tool calls.

Guidance

  • Introduce a timeout mechanism for the pdf tool, allowing it to return an error after a specified time (e.g., 60-120 seconds) to prevent session hangs.
  • Modify the handling of /new and /restart commands to bypass the run queue and force-reset the session, enabling recovery from stuck tool executions.
  • Explore the implementation of a session-level watchdog to detect tool calls that have been pending beyond a threshold, triggering automatic recovery or notification.
  • Review the pdf tool's error handling to ensure it properly reports errors or timeouts, rather than silently failing and causing session hangs.

Example

No specific code example is provided due to the lack of implementation details, but the concept would involve setting a timeout when calling the pdf tool, such as:

import timeout_decorator

@timeout_decorator.timeout(120)  # 120-second timeout
def call_pdf_tool(url, pages, prompt):
    # Existing pdf tool call logic here
    pass

Notes

The exact implementation details of the pdf tool and the session management logic are not provided, so the guidance is based on the information given in the issue description. The introduction of a timeout and a watchdog mechanism should help mitigate the issue but may require adjustments based on the specific requirements and constraints of the system.

Recommendation

Apply a workaround by introducing a configurable timeout for the pdf tool, as this directly addresses the identified issue of indefinite hangs and provides a clear path to recovery. This approach allows for a controlled failure and recovery, improving the overall resilience of the system.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix pdf tool can hang indefinitely, blocking session and preventing /new /restart recovery [1 pull requests, 1 participants]