n8n - 💡(How to fix) Fix [Bug] Parent Workflow Intermittently Stuck in "Waiting" State After Sub-Workflow Completes in Queue/Scaling Mode [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
n8n-io/n8n#28214Fetched 2026-04-09 08:16:03
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Timeline (top)
commented ×1cross-referenced ×1labeled ×1mentioned ×1

Error Message

Each hop is an async boundary where a transient failure (DB connection pool exhaustion, Redis timeout, network blip) is silently swallowed by the void + no-.catch() pattern. Under high load, the probability of hitting a transient error at any single hop increases.

  • error: all

Fix Action

Fix / Workaround

Main → BullMQ → Redis → Worker (child runs) → DB → Redis → BullMQ → Main
  → DB read (parent) → DB write (patch parent) → DB atomic update → BullMQ → Worker (parent resumes)

Code Example

const { parentExecution } = runExecutionData;
if (WorkflowHelpers.shouldRestartParentExecution(parentExecution)) {
    const executionRepository = Container.get(ExecutionRepository);
    void executePromise                              // ← fire-and-forget
        .then(async (subworkflowResults) => {
            if (!subworkflowResults) return;
            if (subworkflowResults.status === 'waiting') return;
            await WorkflowHelpers.updateParentExecutionWithChildResults(
                executionRepository,
                parentExecution.executionId,
                subworkflowResults,
            );
            return subworkflowResults;
        })
        .then((subworkflowResults) => {
            if (!subworkflowResults) return;
            if (subworkflowResults.status === 'waiting') return;
            const waitTracker = Container.get(WaitTracker);
            void waitTracker.startExecution(parentExecution.executionId);  // ← also fire-and-forget
        });
        // ← NO .catch() — errors silently become unhandled rejections
}

---

MainBullMQRedisWorker (child runs)DBRedisBullMQMain
DB read (parent)DB write (patch parent)DB atomic update → BullMQWorker (parent resumes)
RAW_BUFFERClick to expand / collapse

Bug Description

Bug Description

When using the Execute Workflow node with waitForSubWorkflow: true (the default) in queue/scaling mode (EXECUTIONS_MODE=queue), the parent workflow intermittently gets permanently stuck in Waiting state with waitTill: 3000-01-01T00:00:00.000Z, even though the sub-workflow completed successfully.

This is an intermittent issue — the same workflow with the same data succeeds most of the time but occasionally fails. The failure correlates with system load and fast-returning sub-workflows (sub-workflows that complete in seconds rather than minutes).

This issue is specific to distributed/queue execution mode. The same workflows at the same volumes run reliably in single-process mode (EXECUTIONS_MODE=regular).

Further Analysis

The parent resume mechanism in both webhook-helpers.ts and wait-tracker.ts uses fire-and-forget promise chains (void ... .then(...).then(...)) with no .catch() handler and no retry logic.

The relevant code path (webhook-helpers.ts:802-822):

const { parentExecution } = runExecutionData;
if (WorkflowHelpers.shouldRestartParentExecution(parentExecution)) {
    const executionRepository = Container.get(ExecutionRepository);
    void executePromise                              // ← fire-and-forget
        .then(async (subworkflowResults) => {
            if (!subworkflowResults) return;
            if (subworkflowResults.status === 'waiting') return;
            await WorkflowHelpers.updateParentExecutionWithChildResults(
                executionRepository,
                parentExecution.executionId,
                subworkflowResults,
            );
            return subworkflowResults;
        })
        .then((subworkflowResults) => {
            if (!subworkflowResults) return;
            if (subworkflowResults.status === 'waiting') return;
            const waitTracker = Container.get(WaitTracker);
            void waitTracker.startExecution(parentExecution.executionId);  // ← also fire-and-forget
        });
        // ← NO .catch() — errors silently become unhandled rejections
}

The identical pattern exists in wait-tracker.ts:179-197.

In regular mode, the child runs in-process and the resume chain involves a short in-memory async path with no network hops. In queue mode, the resume chain crosses multiple process boundaries:

Main → BullMQ → Redis → Worker (child runs) → DB → Redis → BullMQ → Main
  → DB read (parent) → DB write (patch parent) → DB atomic update → BullMQ → Worker (parent resumes)

Each hop is an async boundary where a transient failure (DB connection pool exhaustion, Redis timeout, network blip) is silently swallowed by the void + no-.catch() pattern. Under high load, the probability of hitting a transient error at any single hop increases.

Additional logs would at least help diagnose the problem, but having a catch/retry mechanism may make this more reliable.

To Reproduce

  1. Configure n8n in queue/scaling mode (EXECUTIONS_MODE=queue) with at least one worker.
  2. Create a sub-workflow that:
  • Receives input via Execute Sub-workflow Trigger
  • Makes an HTTP POST to an external API
  • Enters a Wait node (webhook resume mode) to wait for the API callback
  • Completes after the callback arrives
  1. Create a parent workflow that calls the sub-workflow via the Execute Workflow node with Wait for Sub-Workflow Completion enabled.
  2. Run the parent workflow under load (multiple concurrent executions).
  3. Observe: Most executions complete normally. Occasionally, the sub-workflow completes successfully (status: Success) but the parent workflow remains permanently stuck in Waiting state.

The issue is more likely to manifest when the external API responds quickly (within seconds), but it can occur at any response time under sufficient system load.

Expected behavior

When a sub-workflow completes with shouldResume: true, the parent workflow should always be resumed — regardless of timing or system load.

Actual behavior:

  • Sub-workflow: Completes with Success status. parentExecution.shouldResume: true is correctly set.
  • Parent workflow: Stuck permanently with status: 'waiting', waitTill: '3000-01-01T00:00:00.000Z'.
  • The parent's nodeExecutionStack has 1 entry (the next node was queued but never executed).
  • No errors in logs. No crashes. No BullMQ failures. The resume signal is silently lost.

Debug Info

Debug info

core

  • n8nVersion: 2.13.3
  • platform: docker (self-hosted)
  • nodeJsVersion: 24.13.1
  • nodeEnv: production
  • database: postgres
  • executionMode: scaling (single-main)
  • concurrency: -1
  • license: enterprise (production)

storage

  • success: all
  • error: all
  • progress: false
  • manual: true
  • binaryMode: s3

pruning

  • enabled: true
  • maxAge: 336 hours
  • maxCount: 10000 executions

client

  • userAgent: mozilla/5.0 (macintosh; intel mac os x 10_15_7) applewebkit/537.36 (khtml, like gecko) chrome/146.0.0.0 safari/537.36
  • isTouchDevice: false

Generated at: 2026-04-08T18:45:48.748Z

Operating System

Alpine Linux 3.22

n8n Version

2.13.3

Node.js Version

24.13.1

Database

PostgreSQL

Execution mode

queue

Hosting

self hosted

extent analysis

TL;DR

Adding a .catch() handler and retry logic to the fire-and-forget promise chains in webhook-helpers.ts and wait-tracker.ts may resolve the intermittent issue of parent workflows getting stuck in the Waiting state.

Guidance

  • Identify and handle potential errors in the promise chains by adding .catch() handlers to prevent unhandled rejections.
  • Implement retry logic to handle transient failures that may occur during the execution of sub-workflows.
  • Consider adding logging to diagnose issues and understand the flow of executions.
  • Review the code paths in webhook-helpers.ts and wait-tracker.ts to ensure that all potential error scenarios are handled.

Example

const { parentExecution } = runExecutionData;
if (WorkflowHelpers.shouldRestartParentExecution(parentExecution)) {
    const executionRepository = Container.get(ExecutionRepository);
    executePromise
        .then(async (subworkflowResults) => {
            // ...
        })
        .then((subworkflowResults) => {
            // ...
        })
        .catch((error) => {
            // Handle error and potentially retry
            console.error('Error resuming parent execution:', error);
            // Retry logic can be added here
        });
}

Notes

The provided solution focuses on handling errors and implementing retry logic. However, the root cause of the issue may be more complex and related to the distributed execution mode. Further analysis and debugging may be necessary to fully resolve the issue.

Recommendation

Apply a workaround by adding error handling and retry logic to the affected code paths, as this is likely to mitigate the intermittent issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When a sub-workflow completes with shouldResume: true, the parent workflow should always be resumed — regardless of timing or system load.

Actual behavior:

  • Sub-workflow: Completes with Success status. parentExecution.shouldResume: true is correctly set.
  • Parent workflow: Stuck permanently with status: 'waiting', waitTill: '3000-01-01T00:00:00.000Z'.
  • The parent's nodeExecutionStack has 1 entry (the next node was queued but never executed).
  • No errors in logs. No crashes. No BullMQ failures. The resume signal is silently lost.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING