n8n - 💡(How to fix) Fix WaitTracker polls terminal-state executions with stale waitTill, causing negative-timeout loop and silent main process exit (queue mode, 2.15.x) [2 comments, 3 participants]

n8n2026-04-28 08:24:51

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

n8n-io/n8n#29370•Fetched 2026-04-29 06:35:15

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

labeled ×3commented ×2mentioned ×1subscribed ×1

Error Message

In queue mode, n8n main intermittently exits silently (exit code 0, no OOMKilled, no kernel OOM, no FATAL/uncaughtException stderr output) at irregular intervals (observed ~5-35 min apart on a busy instance). Docker restart: always brings it back, so externally it looks like a soft crash loop. Every restart in queue mode produces a noisy side-effect: in-flight executions are marked running in postgres → on next startup, registered error workflows fire WorkflowCrashedError: "Workflow did not finish, possible out-of-memory issue" for those rows even though the worker on the other side eventually completes them successfully. From the user's perspective this manifests as random false-positive crash alerts on workflows that actually succeeded. After a multi-hour debug capture (N8N_LOG_LEVEL=debug + --trace-warnings --trace-uncaught on main, --unhandled-rejections=warn confirmed it is not a Promise rejection), the smoking gun was found in the WaitTracker poll loop, repeating every 60s: ExecutionRepository.getWaitingExecutions() returns rows where waitTill IS NOT NULL (with a lookahead window in newer versions) without filtering by status='waiting'. As a result, executions that hit a Wait node and then terminally fail (status='error' or 'crashed' — e.g. because the workflow itself errored on resume, or n8n crashed mid-execution and the row was promoted to crashed by recovery) retain their original waitTill in the database. Those zombie rows stay in the query result forever. 4. The "already being resumed" guard prevents the resume from completing (the execution is in error/crashed, not waiting), but the row is not removed from the next poll. A healthy instance returns only status='waiting' rows with waitTill in the future. Affected instances also return status ∈ {error, crashed, success, canceled} with waitTill in the (sometimes very distant) past — those are the zombies. On the instance where this was diagnosed, the oldest zombie was 7 days stale; total of 5 zombies accumulated over ~7 days of uptime.

Build any workflow with a Wait node followed by a node that always throws (e.g. a Code node with throw new Error('boom')).
After the wait elapses, n8n attempts to resume → workflow throws → status flips to error → waitTill is NOT cleared (this is the bug).
Inspect execution_entity: row has status='error' AND waitTill is populated. WHERE status IN ('error', 'crashed', 'success', 'canceled')
Clear waitTill on terminal status transitions. Wherever execution_entity.status is updated to a terminal value (error, crashed, success, canceled), also SET waitTill = NULL. Defense in depth — protects against any future query path that doesn't filter by status, and keeps the data shape coherent.

Root Cause

ExecutionRepository.getWaitingExecutions() returns rows where waitTill IS NOT NULL (with a lookahead window in newer versions) without filtering by status='waiting'. As a result, executions that hit a Wait node and then terminally fail (status='error' or 'crashed' — e.g. because the workflow itself errored on resume, or n8n crashed mid-execution and the row was promoted to crashed by recovery) retain their original waitTill in the database. Those zombie rows stay in the query result forever.

For each zombie, every poll cycle:

WaitTracker computes setTimeout(callback, waitTill − now) → large negative number once waitTill is in the past.
Node clamps to 1ms and emits TimeoutNegativeWarning.
The timer fires almost immediately and startExecution(id) is called.
The "already being resumed" guard prevents the resume from completing (the execution is in error/crashed, not waiting), but the row is not removed from the next poll.
Cycle repeats every poll interval. Each iteration leaks something (event-loop pressure, finalizer churn, possibly handles); after a variable interval main exits with code 0 and no log signal.

A direct DB query confirms the data shape:

SELECT id, status, "waitTill"
FROM execution_entity
WHERE "waitTill" IS NOT NULL
ORDER BY "waitTill";

A healthy instance returns only status='waiting' rows with waitTill in the future. Affected instances also return status ∈ {error, crashed, success, canceled} with waitTill in the (sometimes very distant) past — those are the zombies. On the instance where this was diagnosed, the oldest zombie was 7 days stale; total of 5 zombies accumulated over ~7 days of uptime.

Fix Action

Workaround

Clean the zombies in postgres (safe — terminal-state executions cannot legitimately be resumed):

UPDATE execution_entity SET "waitTill" = NULL
WHERE status IN ('error', 'crashed', 'success', 'canceled')
  AND "waitTill" IS NOT NULL;

WaitTracker drops them at the next poll. No restart needed. The next iteration reports zero or only legitimate status='waiting' rows, and TimeoutNegativeWarning stops appearing.

A daily cron running this is a reasonable belt-and-braces preventive — the zombies regrow slowly (only when a Wait-using workflow terminally fails between resumes), but a few per week per active instance is plausible.

Code Example

debug | Querying database for waiting executions {"file":"wait-tracker.js","function":"getWaitingExecutions"}
debug | Found 1 executions. Setting timer for IDs: <id> {"file":"wait-tracker.js","function":"getWaitingExecutions"}
(node:N) TimeoutNegativeWarning: -622830095 is a negative number.
    at new Timeout (node:internal/timers:194:17)
    at WaitTracker.getWaitingExecutions (/usr/local/lib/node_modules/n8n/src/wait-tracker.ts:81:13)
debug | Resuming execution <id> {"file":"wait-tracker.js","function":"startExecution"}
debug | Execution <id> is already being resumed, skipping duplicate resume {"file":"wait-tracker.js","function":"startExecution"}

---

SELECT id, status, "waitTill"
FROM execution_entity
WHERE "waitTill" IS NOT NULL
ORDER BY "waitTill";

---

UPDATE execution_entity SET "waitTill" = NULL
WHERE status IN ('error', 'crashed', 'success', 'canceled')
  AND "waitTill" IS NOT NULL;

RAW_BUFFERClick to expand / collapse

Bug Description

After a multi-hour debug capture (N8N_LOG_LEVEL=debug + --trace-warnings --trace-uncaught on main, --unhandled-rejections=warn confirmed it is not a Promise rejection), the smoking gun was found in the WaitTracker poll loop, repeating every 60s:

debug | Querying database for waiting executions {"file":"wait-tracker.js","function":"getWaitingExecutions"}
debug | Found 1 executions. Setting timer for IDs: <id> {"file":"wait-tracker.js","function":"getWaitingExecutions"}
(node:N) TimeoutNegativeWarning: -622830095 is a negative number.
    at new Timeout (node:internal/timers:194:17)
    at WaitTracker.getWaitingExecutions (/usr/local/lib/node_modules/n8n/src/wait-tracker.ts:81:13)
debug | Resuming execution <id> {"file":"wait-tracker.js","function":"startExecution"}
debug | Execution <id> is already being resumed, skipping duplicate resume {"file":"wait-tracker.js","function":"startExecution"}

Root Cause

For each zombie, every poll cycle:

WaitTracker computes setTimeout(callback, waitTill − now) → large negative number once waitTill is in the past.
Node clamps to 1ms and emits TimeoutNegativeWarning.
The timer fires almost immediately and startExecution(id) is called.
The "already being resumed" guard prevents the resume from completing (the execution is in error/crashed, not waiting), but the row is not removed from the next poll.
Cycle repeats every poll interval. Each iteration leaks something (event-loop pressure, finalizer churn, possibly handles); after a variable interval main exits with code 0 and no log signal.

A direct DB query confirms the data shape:

SELECT id, status, "waitTill"
FROM execution_entity
WHERE "waitTill" IS NOT NULL
ORDER BY "waitTill";

Reproduction (deterministic)

Build any workflow with a Wait node followed by a node that always throws (e.g. a Code node with throw new Error('boom')).
Trigger it. The execution enters status='waiting' with waitTill set.
After the wait elapses, n8n attempts to resume → workflow throws → status flips to error → waitTill is NOT cleared (this is the bug).
Inspect execution_entity: row has status='error' AND waitTill is populated.
Restart n8n main. WaitTracker now finds the row on every poll, computes a negative setTimeout delta, emits TimeoutNegativeWarning, and re-attempts resume in a loop.
With debug logging on, the repeating "Setting timer for IDs / Resuming execution / already being resumed" pattern is visible. Process eventually exits silently after a variable interval.

(I did not reliably reproduce the crash in a controlled environment — only the looping warning + duplicate-resume pattern. But on a production instance with real traffic, the loop correlates 1:1 with silent exits, and clearing the zombie rows eliminated 14 restarts in 14h → 0 restarts in 20h+.)

Workaround

Clean the zombies in postgres (safe — terminal-state executions cannot legitimately be resumed):

UPDATE execution_entity SET "waitTill" = NULL
WHERE status IN ('error', 'crashed', 'success', 'canceled')
  AND "waitTill" IS NOT NULL;

WaitTracker drops them at the next poll. No restart needed. The next iteration reports zero or only legitimate status='waiting' rows, and TimeoutNegativeWarning stops appearing.

Suggested Fix

Two layers, both desirable:

Filter the query. In ExecutionRepository.getWaitingExecutions(), add AND status = 'waiting' (or equivalent QueryBuilder clause) to the select. This is the smallest possible fix and addresses the root issue: a row in any terminal status should never be considered for resume regardless of waitTill.
Clear waitTill on terminal status transitions. Wherever execution_entity.status is updated to a terminal value (error, crashed, success, canceled), also SET waitTill = NULL. Defense in depth — protects against any future query path that doesn't filter by status, and keeps the data shape coherent.

Environment

n8n Version: 2.15.1 (pinned; tracking issue #28514 prevents upgrade to 2.16/2.17 in this deployment, but inspecting getWaitingExecutions() in the post-#27066 code suggests the same query — sans status filter — is still in place, so this likely affects newer versions too)
Node.js Version: 20.x and 24.x (default n8n image)
Database: PostgreSQL 16
Execution mode: queue (1 main + 3 workers + Redis + Postgres, Docker)
Hosting: self-hosted

#27066 (merged) — reworked WaitTracker but did not add a status filter to the query, so this bug likely persists post-merge.
#28541 (closed) — adjacent dbTime crash on 2.16/2.17 Postgres. Different bug, same area.
#15123, #8136 — unrelated wait-tracker behaviour issues.

extent analysis

TL;DR

The most likely fix is to update the ExecutionRepository.getWaitingExecutions() query to filter by status = 'waiting' and clear waitTill on terminal status transitions.

Guidance

Verify the issue: Run the provided SQL query to check for zombie rows in the execution_entity table.
Apply the workaround: Run the provided SQL update query to clean the zombies in Postgres.
Implement the suggested fix: Update the ExecutionRepository.getWaitingExecutions() query to filter by status = 'waiting' and clear waitTill on terminal status transitions.
Monitor the instance: After applying the fix, monitor the instance for any further silent exits or TimeoutNegativeWarning messages.

Example

No code snippet is provided as the issue is related to a specific query and database schema.

Notes

The issue is specific to the queue execution mode and may affect newer versions of n8n. The suggested fix is a two-layer approach to prevent similar issues in the future.

Recommendation

Apply the workaround by running the SQL update query to clean the zombies in Postgres, and then implement the suggested fix to update the ExecutionRepository.getWaitingExecutions() query. This will prevent further silent exits and TimeoutNegativeWarning messages.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#ssr #cache issue #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

n8n - 💡(How to fix) Fix WaitTracker polls terminal-state executions with stale waitTill, causing negative-timeout loop and silent main process exit (queue mode, 2.15.x) [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Bug Description

Root Cause

Reproduction (deterministic)

Workaround

Suggested Fix

Environment

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

n8n - 💡(How to fix) Fix WaitTracker polls terminal-state executions with stale waitTill, causing negative-timeout loop and silent main process exit (queue mode, 2.15.x) [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Bug Description

Root Cause

Reproduction (deterministic)

Workaround

Suggested Fix

Environment

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING