n8n - 💡(How to fix) Fix WaitTracker polls terminal-state executions with stale waitTill, causing negative-timeout loop and silent main process exit (queue mode, 2.15.x) [2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
n8n-io/n8n#29370Fetched 2026-04-29 06:35:15
View on GitHub
Comments
2
Participants
3
Timeline
7
Reactions
0
Timeline (top)
labeled ×3commented ×2mentioned ×1subscribed ×1

Error Message

In queue mode, n8n main intermittently exits silently (exit code 0, no OOMKilled, no kernel OOM, no FATAL/uncaughtException stderr output) at irregular intervals (observed ~5-35 min apart on a busy instance). Docker restart: always brings it back, so externally it looks like a soft crash loop. Every restart in queue mode produces a noisy side-effect: in-flight executions are marked running in postgres → on next startup, registered error workflows fire WorkflowCrashedError: "Workflow did not finish, possible out-of-memory issue" for those rows even though the worker on the other side eventually completes them successfully. From the user's perspective this manifests as random false-positive crash alerts on workflows that actually succeeded. After a multi-hour debug capture (N8N_LOG_LEVEL=debug + --trace-warnings --trace-uncaught on main, --unhandled-rejections=warn confirmed it is not a Promise rejection), the smoking gun was found in the WaitTracker poll loop, repeating every 60s: ExecutionRepository.getWaitingExecutions() returns rows where waitTill IS NOT NULL (with a lookahead window in newer versions) without filtering by status='waiting'. As a result, executions that hit a Wait node and then terminally fail (status='error' or 'crashed' — e.g. because the workflow itself errored on resume, or n8n crashed mid-execution and the row was promoted to crashed by recovery) retain their original waitTill in the database. Those zombie rows stay in the query result forever. 4. The "already being resumed" guard prevents the resume from completing (the execution is in error/crashed, not waiting), but the row is not removed from the next poll. A healthy instance returns only status='waiting' rows with waitTill in the future. Affected instances also return status ∈ {error, crashed, success, canceled} with waitTill in the (sometimes very distant) past — those are the zombies. On the instance where this was diagnosed, the oldest zombie was 7 days stale; total of 5 zombies accumulated over ~7 days of uptime.

  1. Build any workflow with a Wait node followed by a node that always throws (e.g. a Code node with throw new Error('boom')).
  2. After the wait elapses, n8n attempts to resume → workflow throws → status flips to errorwaitTill is NOT cleared (this is the bug).
  3. Inspect execution_entity: row has status='error' AND waitTill is populated. WHERE status IN ('error', 'crashed', 'success', 'canceled')
  4. Clear waitTill on terminal status transitions. Wherever execution_entity.status is updated to a terminal value (error, crashed, success, canceled), also SET waitTill = NULL. Defense in depth — protects against any future query path that doesn't filter by status, and keeps the data shape coherent.

Root Cause

ExecutionRepository.getWaitingExecutions() returns rows where waitTill IS NOT NULL (with a lookahead window in newer versions) without filtering by status='waiting'. As a result, executions that hit a Wait node and then terminally fail (status='error' or 'crashed' — e.g. because the workflow itself errored on resume, or n8n crashed mid-execution and the row was promoted to crashed by recovery) retain their original waitTill in the database. Those zombie rows stay in the query result forever.

For each zombie, every poll cycle:

  1. WaitTracker computes setTimeout(callback, waitTill − now) → large negative number once waitTill is in the past.
  2. Node clamps to 1ms and emits TimeoutNegativeWarning.
  3. The timer fires almost immediately and startExecution(id) is called.
  4. The "already being resumed" guard prevents the resume from completing (the execution is in error/crashed, not waiting), but the row is not removed from the next poll.
  5. Cycle repeats every poll interval. Each iteration leaks something (event-loop pressure, finalizer churn, possibly handles); after a variable interval main exits with code 0 and no log signal.

A direct DB query confirms the data shape:

SELECT id, status, "waitTill"
FROM execution_entity
WHERE "waitTill" IS NOT NULL
ORDER BY "waitTill";

A healthy instance returns only status='waiting' rows with waitTill in the future. Affected instances also return status ∈ {error, crashed, success, canceled} with waitTill in the (sometimes very distant) past — those are the zombies. On the instance where this was diagnosed, the oldest zombie was 7 days stale; total of 5 zombies accumulated over ~7 days of uptime.

Fix Action

Workaround

Clean the zombies in postgres (safe — terminal-state executions cannot legitimately be resumed):

UPDATE execution_entity SET "waitTill" = NULL
WHERE status IN ('error', 'crashed', 'success', 'canceled')
  AND "waitTill" IS NOT NULL;

WaitTracker drops them at the next poll. No restart needed. The next iteration reports zero or only legitimate status='waiting' rows, and TimeoutNegativeWarning stops appearing.

A daily cron running this is a reasonable belt-and-braces preventive — the zombies regrow slowly (only when a Wait-using workflow terminally fails between resumes), but a few per week per active instance is plausible.

Code Example

debug | Querying database for waiting executions {"file":"wait-tracker.js","function":"getWaitingExecutions"}
debug | Found 1 executions. Setting timer for IDs: <id> {"file":"wait-tracker.js","function":"getWaitingExecutions"}
(node:N) TimeoutNegativeWarning: -622830095 is a negative number.
    at new Timeout (node:internal/timers:194:17)
    at WaitTracker.getWaitingExecutions (/usr/local/lib/node_modules/n8n/src/wait-tracker.ts:81:13)
debug | Resuming execution <id> {"file":"wait-tracker.js","function":"startExecution"}
debug | Execution <id> is already being resumed, skipping duplicate resume {"file":"wait-tracker.js","function":"startExecution"}

---

SELECT id, status, "waitTill"
FROM execution_entity
WHERE "waitTill" IS NOT NULL
ORDER BY "waitTill";

---

UPDATE execution_entity SET "waitTill" = NULL
WHERE status IN ('error', 'crashed', 'success', 'canceled')
  AND "waitTill" IS NOT NULL;
RAW_BUFFERClick to expand / collapse

Bug Description

In queue mode, n8n main intermittently exits silently (exit code 0, no OOMKilled, no kernel OOM, no FATAL/uncaughtException stderr output) at irregular intervals (observed ~5-35 min apart on a busy instance). Docker restart: always brings it back, so externally it looks like a soft crash loop. Every restart in queue mode produces a noisy side-effect: in-flight executions are marked running in postgres → on next startup, registered error workflows fire WorkflowCrashedError: "Workflow did not finish, possible out-of-memory issue" for those rows even though the worker on the other side eventually completes them successfully. From the user's perspective this manifests as random false-positive crash alerts on workflows that actually succeeded.

After a multi-hour debug capture (N8N_LOG_LEVEL=debug + --trace-warnings --trace-uncaught on main, --unhandled-rejections=warn confirmed it is not a Promise rejection), the smoking gun was found in the WaitTracker poll loop, repeating every 60s:

debug | Querying database for waiting executions {"file":"wait-tracker.js","function":"getWaitingExecutions"}
debug | Found 1 executions. Setting timer for IDs: <id> {"file":"wait-tracker.js","function":"getWaitingExecutions"}
(node:N) TimeoutNegativeWarning: -622830095 is a negative number.
    at new Timeout (node:internal/timers:194:17)
    at WaitTracker.getWaitingExecutions (/usr/local/lib/node_modules/n8n/src/wait-tracker.ts:81:13)
debug | Resuming execution <id> {"file":"wait-tracker.js","function":"startExecution"}
debug | Execution <id> is already being resumed, skipping duplicate resume {"file":"wait-tracker.js","function":"startExecution"}

Root Cause

ExecutionRepository.getWaitingExecutions() returns rows where waitTill IS NOT NULL (with a lookahead window in newer versions) without filtering by status='waiting'. As a result, executions that hit a Wait node and then terminally fail (status='error' or 'crashed' — e.g. because the workflow itself errored on resume, or n8n crashed mid-execution and the row was promoted to crashed by recovery) retain their original waitTill in the database. Those zombie rows stay in the query result forever.

For each zombie, every poll cycle:

  1. WaitTracker computes setTimeout(callback, waitTill − now) → large negative number once waitTill is in the past.
  2. Node clamps to 1ms and emits TimeoutNegativeWarning.
  3. The timer fires almost immediately and startExecution(id) is called.
  4. The "already being resumed" guard prevents the resume from completing (the execution is in error/crashed, not waiting), but the row is not removed from the next poll.
  5. Cycle repeats every poll interval. Each iteration leaks something (event-loop pressure, finalizer churn, possibly handles); after a variable interval main exits with code 0 and no log signal.

A direct DB query confirms the data shape:

SELECT id, status, "waitTill"
FROM execution_entity
WHERE "waitTill" IS NOT NULL
ORDER BY "waitTill";

A healthy instance returns only status='waiting' rows with waitTill in the future. Affected instances also return status ∈ {error, crashed, success, canceled} with waitTill in the (sometimes very distant) past — those are the zombies. On the instance where this was diagnosed, the oldest zombie was 7 days stale; total of 5 zombies accumulated over ~7 days of uptime.

Reproduction (deterministic)

  1. Build any workflow with a Wait node followed by a node that always throws (e.g. a Code node with throw new Error('boom')).
  2. Trigger it. The execution enters status='waiting' with waitTill set.
  3. After the wait elapses, n8n attempts to resume → workflow throws → status flips to errorwaitTill is NOT cleared (this is the bug).
  4. Inspect execution_entity: row has status='error' AND waitTill is populated.
  5. Restart n8n main. WaitTracker now finds the row on every poll, computes a negative setTimeout delta, emits TimeoutNegativeWarning, and re-attempts resume in a loop.
  6. With debug logging on, the repeating "Setting timer for IDs / Resuming execution / already being resumed" pattern is visible. Process eventually exits silently after a variable interval.

(I did not reliably reproduce the crash in a controlled environment — only the looping warning + duplicate-resume pattern. But on a production instance with real traffic, the loop correlates 1:1 with silent exits, and clearing the zombie rows eliminated 14 restarts in 14h → 0 restarts in 20h+.)

Workaround

Clean the zombies in postgres (safe — terminal-state executions cannot legitimately be resumed):

UPDATE execution_entity SET "waitTill" = NULL
WHERE status IN ('error', 'crashed', 'success', 'canceled')
  AND "waitTill" IS NOT NULL;

WaitTracker drops them at the next poll. No restart needed. The next iteration reports zero or only legitimate status='waiting' rows, and TimeoutNegativeWarning stops appearing.

A daily cron running this is a reasonable belt-and-braces preventive — the zombies regrow slowly (only when a Wait-using workflow terminally fails between resumes), but a few per week per active instance is plausible.

Suggested Fix

Two layers, both desirable:

  1. Filter the query. In ExecutionRepository.getWaitingExecutions(), add AND status = 'waiting' (or equivalent QueryBuilder clause) to the select. This is the smallest possible fix and addresses the root issue: a row in any terminal status should never be considered for resume regardless of waitTill.
  2. Clear waitTill on terminal status transitions. Wherever execution_entity.status is updated to a terminal value (error, crashed, success, canceled), also SET waitTill = NULL. Defense in depth — protects against any future query path that doesn't filter by status, and keeps the data shape coherent.

Environment

  • n8n Version: 2.15.1 (pinned; tracking issue #28514 prevents upgrade to 2.16/2.17 in this deployment, but inspecting getWaitingExecutions() in the post-#27066 code suggests the same query — sans status filter — is still in place, so this likely affects newer versions too)
  • Node.js Version: 20.x and 24.x (default n8n image)
  • Database: PostgreSQL 16
  • Execution mode: queue (1 main + 3 workers + Redis + Postgres, Docker)
  • Hosting: self-hosted

Related

  • #27066 (merged) — reworked WaitTracker but did not add a status filter to the query, so this bug likely persists post-merge.
  • #28541 (closed) — adjacent dbTime crash on 2.16/2.17 Postgres. Different bug, same area.
  • #15123, #8136 — unrelated wait-tracker behaviour issues.

extent analysis

TL;DR

The most likely fix is to update the ExecutionRepository.getWaitingExecutions() query to filter by status = 'waiting' and clear waitTill on terminal status transitions.

Guidance

  1. Verify the issue: Run the provided SQL query to check for zombie rows in the execution_entity table.
  2. Apply the workaround: Run the provided SQL update query to clean the zombies in Postgres.
  3. Implement the suggested fix: Update the ExecutionRepository.getWaitingExecutions() query to filter by status = 'waiting' and clear waitTill on terminal status transitions.
  4. Monitor the instance: After applying the fix, monitor the instance for any further silent exits or TimeoutNegativeWarning messages.

Example

No code snippet is provided as the issue is related to a specific query and database schema.

Notes

The issue is specific to the queue execution mode and may affect newer versions of n8n. The suggested fix is a two-layer approach to prevent similar issues in the future.

Recommendation

Apply the workaround by running the SQL update query to clean the zombies in Postgres, and then implement the suggested fix to update the ExecutionRepository.getWaitingExecutions() query. This will prevent further silent exits and TimeoutNegativeWarning messages.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING