n8n - 💡(How to fix) Fix External task runner: Code nodes time out for hours while /healthz reports healthy, no self-recovery [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
n8n-io/n8n#29372Fetched 2026-04-29 06:35:13
View on GitHub
Comments
1
Participants
2
Timeline
6
Reactions
0
Author
Timeline (top)
labeled ×3commented ×1mentioned ×1subscribed ×1

Error Message

Error: Task request timed out at LocalTaskRequester.requestExpired (/usr/.../task-requester.ts:309:22) at LocalTaskRequester.onMessage (/usr/.../task-requester.ts:272:10) at TaskBroker.handleRequestTimeout (/usr/.../task-broker.service.ts:120:50) at Timeout.<anonymous> (/usr/.../task-broker.service.ts:107:9)

Root Cause

Possible root causes (for triage)

Code Example

Error: Task request timed out
    at LocalTaskRequester.requestExpired (/usr/.../task-requester.ts:309:22)
    at LocalTaskRequester.onMessage    (/usr/.../task-requester.ts:272:10)
    at TaskBroker.handleRequestTimeout (/usr/.../task-broker.service.ts:120:50)
    at Timeout.<anonymous>             (/usr/.../task-broker.service.ts:107:9)
RAW_BUFFERClick to expand / collapse

Bug Description

In N8N_RUNNERS_MODE=external, the broker↔runner channel can enter a stuck state where every task request times out at 60 seconds while the runner process is still alive and its HTTP /healthz keeps returning 200. The launcher's HTTP-based health monitor cannot detect this state, and there is no self-recovery path: the system stays stuck until something external (a deploy, a process OOM, or the launcher panicking) forces a container restart.

This is the same observable symptom as #23381 (Roc-Jia7's comment), #26629, #27406, #28881 and #28625 — but unlike those reports, this one has a single controlled instance, full timeline data across multiple recurrences, and a clear differentiation between "runner alive but unreachable via task channel" and "runner dead". I am also the author of #25958 / #25959, which fixed a different mechanism that produced the same surface symptom on the same Cloud Run + external runner setup, so this is not a duplicate of that fix.

Reproduced across versions and across multiple recurrences

Examination of 7 days of Cloud Logging (2026-04-21 to 2026-04-28) reveals three distinct stuck windows with the same signature (Task request timed out sustained at ~24/hour for the duration):

WindowPeriod (UTC)DurationEnded byn8n version
12026-04-24 23:00 → 2026-04-25 11:01~12 hlauncher panic → container restart2.15.0
22026-04-25 17:00 → 2026-04-27 06:22~37 hmanual deploy (2.15.0 → 2.17.7)2.17.7 (started post-upgrade)
32026-04-27 20:00 → 2026-04-28 06:29~10 hn8n-main JS heap OOM → container restart2.17.7

Total stuck time over 7 days: ~59 hours, recovered three different ways but never via any self-healing mechanism in n8n or the launcher.

This rules out the most common dismissals upfront:

  • "Fixed in a newer version" — bug occurs on both 2.15.0 and 2.17.7.
  • "One-off / cold-start" — three distinct windows on the same instance, repeating at ~1.3 day cadence on average.
  • "Resource exhaustion is the cause" — the three recoveries had nothing in common with each other (panic, manual deploy, separate main OOM); during the 37-hour window 2 the runner did not exit or OOM at all, ruling out runner-side resource pressure as the trigger.
  • "Single environment / Cloud Run-specific" — happens repeatedly in our environment, and the same surface symptom has been reported across Cloud Run, self-hosted Kubernetes (#26629), and n8n Cloud (#28881).

Detailed observed timeline of stuck window 1 (best-instrumented)

Time (UTC)Event
2026-04-24 23:42First HTTP 504 (60.04 s) on a watchdog workflow that exercises a Code node
2026-04-24 23:43First Error: Task request timed out in the n8n-main container logs
~11 hours277 Task request timed out errors at ~24 / hour; every Code node call hits 60 s timeout
n8n-runners container alive throughout, /healthz returns 200 (launcher emits no "Found runner unresponsive" warning)
2026-04-25 11:01:52n8n-runners process exits on its own (suspected OOM); launcher's cmd.Process.Kill() returns os.ErrProcessDone and panics — see https://github.com/n8n-io/task-runner-launcher/pull/98
2026-04-25 11:02Container restarts; the channel re-establishes; timeouts stop

So for ~11 hours: runner process up + /healthz 200 + every task times out at exactly 60 s.

Stuck windows 2 and 3 produced the same hourly cadence of Task request timed out errors at 21–28 / hour, sustained until the next external recovery event. Full Cloud Logging exports of all three windows are available on request.

Stack trace (from n8n-main, Cloud Logging)

Error: Task request timed out
    at LocalTaskRequester.requestExpired (/usr/.../task-requester.ts:309:22)
    at LocalTaskRequester.onMessage    (/usr/.../task-requester.ts:272:10)
    at TaskBroker.handleRequestTimeout (/usr/.../task-broker.service.ts:120:50)
    at Timeout.<anonymous>             (/usr/.../task-broker.service.ts:107:9)

The 60 s figure matches the default N8N_RUNNERS_TASK_TIMEOUT. The broker genuinely never receives a runner response for any task request during the stuck window, while the runner process and /healthz listener stay up.

Why this report is distinct from prior closed reports

  • Same instance, same revision, no scaling/cold-start involvement — rules out the "runner not started yet" / "wrong sidecar" / "queue-mode worker config" causes that prior reports landed on.
  • /healthz is unambiguously up the entire time — observable by the absence of "Found runner unresponsive" warnings from the launcher during all three stuck windows.
  • The stuck state ends only via external intervention (panic, manual deploy, or unrelated main-side OOM). It never recovers on its own — strongly indicating a stale connection / state on the broker↔runner channel that the current health model cannot see.

Why this is not a resource-sizing or environment issue

To preempt the natural "is this just OOM / sizing?" reading:

  1. The stuck state observed before any process self-died is the bug. Stuck windows 1 and 3 ended via a process OOM on the runner side and main side respectively, but stuck window 2 ran for 37 hours with no OOM, no exit(134), no exit(2), and no resource-exhaustion signal anywhere — and was only broken by a manual deploy. Resource pressure is not the trigger.
  2. Window 1 ended when the runner self-died after 12 hours; window 2 ended without any runner death after 37 hours. If runner memory exhaustion were the trigger, the durations would scale with runner memory — they do not, and the runner did not OOM at all in window 2.
  3. Throughout all three stuck windows, /healthz returned 200 and the launcher emitted no "Found runner unresponsive" warnings — i.e. the system reported healthy while every task failed. This is a logical contradiction in the observability surface, regardless of what resources are allocated.
  4. The same surface symptom has been reported on Cloud Run, self-hosted Kubernetes (#26629), and n8n Cloud (#28881), arguing against a single-environment cause.
  5. The fix class previously accepted by the team in #25959 — broker /healthz returning a value that does not reflect the actual readiness/health of the task channel — is exactly this kind of bug, just at runtime instead of at startup.

Possible root causes (for triage)

  1. WebSocket / long-poll connection between broker and runner appearing alive at TCP level but no longer carrying messages (idle Cloud Run CPU throttling, missing keepalive, half-closed socket).
  2. Runner-side event loop blocked while /healthz answers from a separate handler thread.
  3. Internal lock / ack-tracking deadlock in task-broker.service.ts.

What I think the right fix shape is

The launcher and the orchestrator (Cloud Run / Kubernetes) already have everything needed to restart a stuck runner — the launcher's monitorRunnerHealth polls /healthz and kills the runner on 6 consecutive failures, and Cloud Run / K8s liveness probes will do the same. The reason recovery does not happen is that none of these health signals reflect the actual liveness of the task channel during this stuck state.

So I do not think the right fix is a new self-recovery subsystem. The right fix is to close the gap so that the existing health signals tell the truth. Concretely, the broker (which is the only component that can observe "task channel is dead") should treat N consecutive task timeouts to a given runner as a connection failure and disconnect that runner. Once the connection drops, the existing launcher and orchestrator paths take over and recovery happens within seconds rather than hours.

This matches the shape of #25959 — surfacing real readiness/health on the existing endpoints, rather than building parallel mechanisms.

Happy to send a PR if this framing is acceptable.

To Reproduce

We have not been able to reduce this to a deterministic short test case. The conditions under which it occurs in our environment are:

  • External runner mode on Cloud Run (sidecar container layout: nginx, n8n-main, n8n-runners on a single instance, maxScale=1, containerConcurrency=80).
  • n8n-main env: N8N_RUNNERS_ENABLED=true, N8N_RUNNERS_MODE=external, N8N_RUNNERS_BROKER_LISTEN_ADDRESS=127.0.0.1, N8N_NATIVE_PYTHON_RUNNER=true, N8N_ENDPOINT_HEALTH=health (Cloud Run reserves /healthz, fixed in #25959).
  • n8n-runners sidecar with N8N_RUNNERS_TASK_BROKER_URI=http://127.0.0.1:5679.
  • Continuous Code-node traffic at any frequency, including very low (e.g. one Code node call every 2 minutes from a watchdog workflow).
  • Sustained uptime of multiple hours.

Under these conditions in our environment, the stuck state has recurred on average once every ~1.3 days and persists 10–37 hours each time, until something external (deploy, OOM, panic) forces a container restart. Full Cloud Logging exports for all three windows are available on request.

Expected behavior

When the broker↔runner task channel becomes unable to deliver requests, n8n (or the launcher) should detect this within a small multiple of N8N_RUNNERS_TASK_TIMEOUT and recover automatically — either by resetting the broker connection for that runner or by terminating and restarting the runner — rather than allowing every Code node execution to time out indefinitely while the system reports healthy.

Debug Info

Debug info

core

  • n8nVersion: 2.17.7
  • platform: npm
  • nodeJsVersion: 24.14.1
  • nodeEnv: production
  • database: postgres
  • executionMode: regular
  • concurrency: -1
  • license: enterprise (production)
  • consumerId: def9e2f5-a092-4761-8f72-48e3a8a75628

storage

  • success: all
  • error: all
  • progress: false
  • manual: true
  • binaryMode: filesystem

pruning

  • enabled: true
  • maxAge: 336 hours
  • maxCount: 10000 executions

client

  • userAgent: mozilla/5.0 (macintosh; intel mac os x 10_15_7) applewebkit/537.36 (khtml, like gecko) chrome/147.0.0.0 safari/537.36
  • isTouchDevice: false

Generated at: 2026-04-28T08:16:59.017Z

Operating System

unknown

n8n Version

2.15.0 and 2.17.7 (bug observed on both)

Node.js Version

24.14.1

Database

PostgreSQL

Execution mode

main (default)

Hosting

self hosted

extent analysis

TL;DR

The broker-runner channel can enter a stuck state where every task request times out, and the system stays stuck until an external intervention occurs, which can be mitigated by modifying the broker to treat consecutive task timeouts as a connection failure.

Guidance

  • The issue is likely caused by a stale connection or state on the broker-runner channel that the current health model cannot detect.
  • To verify, check the broker logs for consecutive task timeouts and the runner's /healthz endpoint status.
  • Modify the broker to treat N consecutive task timeouts to a given runner as a connection failure and disconnect that runner, allowing the existing launcher and orchestrator paths to take over and recover.
  • Review the task-broker.service.ts and LocalTaskRequester code to ensure proper handling of task timeouts and connection failures.

Example

No code example is provided as the issue requires a deeper understanding of the n8n codebase and the specific modifications needed to address the stuck state.

Notes

The fix should focus on surfacing real readiness/health on the existing endpoints rather than building parallel mechanisms. The modifications should be made to the broker to detect and handle consecutive task timeouts, allowing for automatic recovery.

Recommendation

Apply a workaround by modifying the broker to treat consecutive task timeouts as a connection failure, as this approach matches the shape of a previously accepted fix and leverages the existing health signals to recover from the stuck state.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When the broker↔runner task channel becomes unable to deliver requests, n8n (or the launcher) should detect this within a small multiple of N8N_RUNNERS_TASK_TIMEOUT and recover automatically — either by resetting the broker connection for that runner or by terminating and restarting the runner — rather than allowing every Code node execution to time out indefinitely while the system reports healthy.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING