openclaw - ✅(Solved) Fix pdf tool can hang indefinitely, blocking session and preventing /new /restart recovery [1 pull requests, 1 participants]

crisandrews · 2026-04-18T17:16:07Z

[openclaw] PR 68747: PDF tool: abort stalled remote fetch after 30s idle timeout - Repository: openclaw/openclaw - Author: neeravmakwana - State: open | merged… # PR #68747: PDF tool: abort stalled remote fetch after 30s idle timeout - Repository: openclaw/openclaw - Author: neeravmakwana - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/68747 ## Description (problem / solution / changelog) ## Summary - Problem: The `pdf` tool can hang indefinitely when fetching a large remote PDF whose body read stalls (reported with a 244‑page `www-cdn.anthropic.com` download). The entire agent session becomes a zombie — subsequent messages, `/new`, and `/restart` all queue behind the stuck tool call and only a gateway restart recovers it. `lane=nested waitedMs=164407 queueAhead=0` in the diagnostic confirms the nested tool lane is wedged inside the fetch. - Why it matters: Complete, silent loss of the agent with no user-facing recovery path. Users commonly re‑send messages thinking the agent is slow, which deepens the queue. - What changed: The `pdf` tool now passes a 30 000 ms `readIdleTimeoutMs` to `loadWebMediaRaw` on both the sandbox and non-sandbox remote-fetch branches. A new optional `readIdleTimeoutMs` on `WebMediaOptions` is forwarded into `fetchRemoteMedia`, which already plumbs the value through its body-reader to abort on stalled chunks. Matches the existing Matrix/Discord/Telegram/Tlon media-download pattern. - What did NOT change (scope boundary): No config surface, no new capabilities, no SSRF/sandbox/`maxBytes`/`workspaceOnly`/`localRoots` changes, no changes to `anthropicAnalyzePdf` / `geminiAnalyzePdf` / `complete()` overall request deadlines, no changes to the gateway lane scheduler (`/new` / `/restart` interruption is a separate concern and is deliberately out of scope here). ## Change Type (select all) - [x] Bug fix ## Scope (select all touched areas) - [x] Skills / tool execution ## Linked Issue/PR - Closes #68649 - [x] This PR fixes a bug or regression ## Root Cause (if applicable) - Root cause: `src/agents/tools/pdf-tool.ts` called `loadWebMediaRaw(url, { maxBytes, localRoots })` with no idle timeout. `src/media/web-media.ts` then called `fetchRemoteMedia({ url, maxBytes, ssrfPolicy })` — again with no `readIdleTimeoutMs`. `fetchRemoteMedia` already supports that option and forwards it to `readResponseWithLimit({ chunkTimeoutMs })`, but the pdf path never opted in. A CDN TCP/TLS connection that accepts the request and then stops yielding body chunks therefore has no abort signal at any layer and the `await` never resolves. - Missing detection / guardrail: No existing regression test covered an idle-body stall for the pdf tool (no `timeout`/`abort`/`hang` assertions in `pdf-tool.test.ts`, `pdf-native-providers.test.ts`, or `pdf-tool.helpers.test.ts`). - Contributing context: Matrix, Discord, Telegram, and Tlon media downloaders already set `readIdleTimeoutMs`; the pdf tool was the odd one out. ## Regression Test Plan (if applicable) - Coverage level that should have caught this: - [x] Unit test - Target test or file: `src/agents/tools/pdf-tool.test.ts` (new case: `passes a fetch idle timeout when loading remote PDFs`). - Scenario the test should lock in: executing the `pdf` tool with an `https://` URL must hand `loadWebMediaRaw` a positive `readIdleTimeoutMs`, so stalled body reads can be aborted. - Why this is the smallest reliable guardrail: it is the single seam at which the pdf tool becomes responsible for the idle-abort contract; everything below that call is already covered by `fetchRemoteMedia`'s existing idle-timeout tests in `src/media/fetch.test.ts`. - Existing test that already covers this: None. - If no new test is added, why not: N/A. ## User-visible / Behavior Changes - Remote PDF downloads that stall for 30 s without yielding additional body bytes now abort with a standard fetch error surfaced as a tool error, instead of wedging the session. Healthy downloads behave identically. ## Diagram (if applicable) ```text Before: pdf tool -> loadWebMediaRaw({ maxBytes }) -> fetchRemoteMedia({ maxBytes, ssrfPolicy }) | v body read stalls -> await never resolves -> nested tool lane wedged -> session zombie After: pdf tool -> loadWebMediaRaw({ maxBytes, readIdleTimeoutMs: 30_000 }) -> fetchRemoteMedia({ ..., readIdleTimeoutMs: 30_000 }) -> readResponseWithLimit({ chunkTimeoutMs: 30_000 }) -> stall aborts -> tool returns error -> lane freed ``` ## Security Impact (required) - New permissions/capabilities? No - Secrets/tokens handling changed? No - New/changed network calls? No (same requests, same SSRF guard, same hosts, same `maxBytes`) - Command/tool execution surface changed? No (same tool schema, same inputs, same outputs) - Data access scope changed? No Security posture notes: - The new behavior is a pure runtime-enforced defense (body-read idle-abort on the fetch pipeline). It does not depend on prompt text and cannot be disabled from the model side. - The time

openclaw2026-04-18 17:16:07

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#68649•Fetched 2026-04-19 15:09:07

View on GitHub

Comments

Participants

Timeline

Reactions

Author

crisandrews

Participants

crisandrews

Timeline (top)

cross-referenced ×1

Error Message

The pdf tool call never returns — no result, no error, no timeout
The pdf tool should have a timeout (e.g., 60-120 seconds) after which it returns an error

Fix Action

Fixed

Fixed by PR: PDF tool: abort stalled remote fetch after 30s idle timeout (https://github.com/openclaw/openclaw/pull/68747)

PR fix notes

PR #68747: PDF tool: abort stalled remote fetch after 30s idle timeout

Repository: openclaw/openclaw
Author: neeravmakwana
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/68747

Description (problem / solution / changelog)

Summary

Problem: The pdf tool can hang indefinitely when fetching a large remote PDF whose body read stalls (reported with a 244‑page www-cdn.anthropic.com download). The entire agent session becomes a zombie — subsequent messages, /new, and /restart all queue behind the stuck tool call and only a gateway restart recovers it. lane=nested waitedMs=164407 queueAhead=0 in the diagnostic confirms the nested tool lane is wedged inside the fetch.
Why it matters: Complete, silent loss of the agent with no user-facing recovery path. Users commonly re‑send messages thinking the agent is slow, which deepens the queue.
What changed: The pdf tool now passes a 30 000 ms readIdleTimeoutMs to loadWebMediaRaw on both the sandbox and non-sandbox remote-fetch branches. A new optional readIdleTimeoutMs on WebMediaOptions is forwarded into fetchRemoteMedia, which already plumbs the value through its body-reader to abort on stalled chunks. Matches the existing Matrix/Discord/Telegram/Tlon media-download pattern.
What did NOT change (scope boundary): No config surface, no new capabilities, no SSRF/sandbox/maxBytes/workspaceOnly/localRoots changes, no changes to anthropicAnalyzePdf / geminiAnalyzePdf / complete() overall request deadlines, no changes to the gateway lane scheduler (/new / /restart interruption is a separate concern and is deliberately out of scope here).

Change Type (select all)

Bug fix

Scope (select all touched areas)

Skills / tool execution

Linked Issue/PR

Closes #68649
This PR fixes a bug or regression

Root Cause (if applicable)

Root cause: src/agents/tools/pdf-tool.ts called loadWebMediaRaw(url, { maxBytes, localRoots }) with no idle timeout. src/media/web-media.ts then called fetchRemoteMedia({ url, maxBytes, ssrfPolicy }) — again with no readIdleTimeoutMs. fetchRemoteMedia already supports that option and forwards it to readResponseWithLimit({ chunkTimeoutMs }), but the pdf path never opted in. A CDN TCP/TLS connection that accepts the request and then stops yielding body chunks therefore has no abort signal at any layer and the await never resolves.
Missing detection / guardrail: No existing regression test covered an idle-body stall for the pdf tool (no timeout/abort/hang assertions in pdf-tool.test.ts, pdf-native-providers.test.ts, or pdf-tool.helpers.test.ts).
Contributing context: Matrix, Discord, Telegram, and Tlon media downloaders already set readIdleTimeoutMs; the pdf tool was the odd one out.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
Target test or file: src/agents/tools/pdf-tool.test.ts (new case: passes a fetch idle timeout when loading remote PDFs).
Scenario the test should lock in: executing the pdf tool with an https:// URL must hand loadWebMediaRaw a positive readIdleTimeoutMs, so stalled body reads can be aborted.
Why this is the smallest reliable guardrail: it is the single seam at which the pdf tool becomes responsible for the idle-abort contract; everything below that call is already covered by fetchRemoteMedia's existing idle-timeout tests in src/media/fetch.test.ts.
Existing test that already covers this: None.
If no new test is added, why not: N/A.

User-visible / Behavior Changes

Remote PDF downloads that stall for 30 s without yielding additional body bytes now abort with a standard fetch error surfaced as a tool error, instead of wedging the session. Healthy downloads behave identically.

Diagram (if applicable)

Before:
pdf tool -> loadWebMediaRaw({ maxBytes }) -> fetchRemoteMedia({ maxBytes, ssrfPolicy })
                                                   |
                                                   v
                                             body read stalls -> await never resolves
                                             -> nested tool lane wedged -> session zombie

After:
pdf tool -> loadWebMediaRaw({ maxBytes, readIdleTimeoutMs: 30_000 })
         -> fetchRemoteMedia({ ..., readIdleTimeoutMs: 30_000 })
         -> readResponseWithLimit({ chunkTimeoutMs: 30_000 })
         -> stall aborts -> tool returns error -> lane freed

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No (same requests, same SSRF guard, same hosts, same maxBytes)
Command/tool execution surface changed? No (same tool schema, same inputs, same outputs)
Data access scope changed? No

Security posture notes:

The new behavior is a pure runtime-enforced defense (body-read idle-abort on the fetch pipeline). It does not depend on prompt text and cannot be disabled from the model side.
The timeout is applied unconditionally to both remote-fetch branches (sandbox + non-sandbox) so there is no "bypass by path" case.
No existing security guard is relaxed: maxBytes, SSRF allowlist, workspaceOnly, localRoots, sandbox validation, and hostReadCapability checks all continue to run exactly as before.
Local-path reads are unaffected; readIdleTimeoutMs only applies to the https?:// branch of loadWebMediaInternal.

Repro + Verification

Environment

OS: macOS (darwin 25.3.0)
Runtime/container: Node 22, pnpm
Model/provider: any pdf-capable model (reproduced with openai-codex/gpt-5.4; original report also on claude-opus-4-6)
Integration/channel: WhatsApp (not channel-specific; any path that invokes the pdf tool)
Relevant config: defaults

Steps

Ask the agent to analyze a large remote PDF (e.g. a 200+‑page document from a CDN that stalls mid‑download).
Observe that without this patch, the tool call never returns and the session wedges.
With this patch, the fetch aborts after 30 s of idle body reads and the tool returns a standard fetch error to the model, which can react normally.

Expected

Stalled PDF fetch aborts within ~30 s and the session remains responsive.

Actual (with fix)

Matches expected. Fixed.

Evidence

Failing test/log before + passing after

Logs/tests:

$ pnpm test src/agents/tools/pdf-tool.test.ts
 Test Files  1 passed (1)
      Tests  12 passed (12)

$ pnpm test src/media/web-media.test.ts src/media/fetch.test.ts
 Test Files  2 passed (2)
      Tests  36 passed (36)

Full pnpm check (tsgo:core, tsgo:core:test, tsgo:extensions, tsgo:extensions:test, oxlint, webhook/body-read lint, pairing/account-scope lints, import-cycle + madge checks) ran green via the pre-commit hook.

Human Verification (required)

Verified scenarios:
- New unit test asserts the pdf tool forwards readIdleTimeoutMs to loadWebMediaRaw on a remote URL.
- Existing pdf-tool, web-media, and fetch tests remain green.
- pnpm check passes end-to-end.
Edge cases checked:
- Sandbox branch and non-sandbox branch both receive the timeout (both loadWebMediaRaw call sites updated).
- Local-path pdf reads are unaffected (the idle-timeout only applies under the https?:// branch of loadWebMediaInternal).
- Callers that pass a plain maxBytes number to loadWebMedia/loadWebMediaRaw are unchanged (new field is optional).
What I did not verify:
- End-to-end reproduction against the exact original upstream CDN URL (would require a reliably stalling remote). The behavior is covered by the unit test, the existing readResponseWithLimit chunk-timeout tests, and the matching pattern already live in Matrix/Discord/Telegram/Tlon media downloaders.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No

Risks and Mitigations

Risk: a pathological but healthy remote server that legitimately goes quiet between body chunks for more than 30 s could trip the abort.
- Mitigation: matches the idle-timeout values already used by Matrix (30 s) and other channels; well above typical TLS/HTTP liveness; the model receives a normal tool error and can retry. No change to maxBytes or overall request size caps.

Made with Cursor

Changed files

CHANGELOG.md (modified, +1/-0)
src/agents/tools/pdf-tool.test.ts (modified, +22/-0)
src/agents/tools/pdf-tool.ts (modified, +5/-0)
src/media/web-media.ts (modified, +9/-1)

Code Example

# Last entry in session transcript — pdf tool call with no response
timestamp: 2026-04-18T16:41:04.934Z
role: assistant
toolCall: pdf
  pdf: "https://www-cdn.anthropic.com/08ab9158070959f88f296514c21b7facce6f52bc.pdf"
  pages: "1-20"
  prompt: "Find the exact section(s) that mention..."
# NO toolResult follows — session ends here

# Gateway diagnostic warning
lane wait exceeded: lane=nested waitedMs=164407 queueAhead=0

RAW_BUFFERClick to expand / collapse

Bug Description

The pdf tool can hang indefinitely when fetching/processing a large remote PDF, causing the entire agent session to become a zombie. While the session is stuck:

No new messages from the user are processed (they queue behind the stuck tool call)
Slash commands (/new, /restart) are also enqueued and cannot interrupt the stuck run
/tasks reports "All clear" even though the session is effectively dead
The only recovery is a full gateway restart

Steps to Reproduce

In a direct WhatsApp chat, ask the agent to research a topic
Agent does multiple web_search calls (all succeed)
Agent calls the pdf tool on a large remote PDF (in this case, a 244-page PDF from www-cdn.anthropic.com)
The pdf tool call never returns — no result, no error, no timeout
Session becomes a zombie
All subsequent messages (including /new, /restart) queue behind the stuck tool call and are never processed

Evidence from Logs

# Last entry in session transcript — pdf tool call with no response
timestamp: 2026-04-18T16:41:04.934Z
role: assistant
toolCall: pdf
  pdf: "https://www-cdn.anthropic.com/08ab9158070959f88f296514c21b7facce6f52bc.pdf"
  pages: "1-20"
  prompt: "Find the exact section(s) that mention..."
# NO toolResult follows — session ends here

# Gateway diagnostic warning
lane wait exceeded: lane=nested waitedMs=164407 queueAhead=0

Expected Behavior

The pdf tool should have a timeout (e.g., 60-120 seconds) after which it returns an error
/new and /restart should be able to interrupt a stuck tool execution, not queue behind it
If a tool call hangs, the session should eventually recover on its own rather than requiring a gateway restart

Environment

OpenClaw version: 2026.4.14 → 2026.4.15 (issue present in both)
OS: macOS 26.3 (arm64)
Channel: WhatsApp
Model: openai-codex/gpt-5.4 (session fell back from claude-opus-4-6 due to separate auth issue)

Impact

User-facing: complete loss of the agent with no way to recover except restarting the gateway (which affects all agents)
The session appears alive (gateway shows it as active, updated recently) but is actually dead
Users send multiple messages thinking the agent is slow, which only deepens the queue

Suggested Fix

Add a configurable timeout to the pdf tool (default ~120s)
Allow /new and /restart to bypass the run queue and force-reset the session
Consider a session-level watchdog that detects tool calls that have been pending beyond a threshold

extent analysis

TL;DR

Implement a configurable timeout for the pdf tool to prevent indefinite hangs and consider a session-level watchdog to detect and recover from stuck tool calls.

Guidance

Introduce a timeout mechanism for the pdf tool, allowing it to return an error after a specified time (e.g., 60-120 seconds) to prevent session hangs.
Modify the handling of /new and /restart commands to bypass the run queue and force-reset the session, enabling recovery from stuck tool executions.
Explore the implementation of a session-level watchdog to detect tool calls that have been pending beyond a threshold, triggering automatic recovery or notification.
Review the pdf tool's error handling to ensure it properly reports errors or timeouts, rather than silently failing and causing session hangs.

Example

No specific code example is provided due to the lack of implementation details, but the concept would involve setting a timeout when calling the pdf tool, such as:

import timeout_decorator

@timeout_decorator.timeout(120)  # 120-second timeout
def call_pdf_tool(url, pages, prompt):
    # Existing pdf tool call logic here
    pass

Notes

The exact implementation details of the pdf tool and the session management logic are not provided, so the guidance is based on the information given in the issue description. The introduction of a timeout and a watchdog mechanism should help mitigate the issue but may require adjustments based on the specific requirements and constraints of the system.

Recommendation

Apply a workaround by introducing a configurable timeout for the pdf tool, as this directly addresses the identified issue of indefinite hangs and provides a clear path to recovery. This approach allows for a controlled failure and recovery, improving the overall resilience of the system.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#model download #tokenizer error #prompt formatting #chain error #conversation history

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix pdf tool can hang indefinitely, blocking session and preventing /new /restart recovery [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #68747: PDF tool: abort stalled remote fetch after 30s idle timeout

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual (with fix)

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Changed files

Code Example

Bug Description

Steps to Reproduce

Evidence from Logs

Expected Behavior

Environment

Impact

Suggested Fix

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING