When a sub-workflow completes with shouldResume: true, the parent workflow should always be resumed — regardless of timing or system load. **Actual behavior:** - Sub-workflow: Completes with Success status. parentExecution.shouldResume: true is correctly set. - Parent workflow: Stuck permanently with status: 'waiting', waitTill: '3000-01-01T00:00:00.000Z'. - The parent's nodeExecutionStack has 1 entry (the next node was queued but never executed). - No errors in logs. No crashes. No BullMQ failures. The resume signal is silently lost.

Code Example

const { parentExecution } = runExecutionData;
if (WorkflowHelpers.shouldRestartParentExecution(parentExecution)) {
    const executionRepository = Container.get(ExecutionRepository);
    void executePromise                              // ← fire-and-forget
        .then(async (subworkflowResults) => {
            if (!subworkflowResults) return;
            if (subworkflowResults.status === 'waiting') return;
            await WorkflowHelpers.updateParentExecutionWithChildResults(
                executionRepository,
                parentExecution.executionId,
                subworkflowResults,
            );
            return subworkflowResults;
        })
        .then((subworkflowResults) => {
            if (!subworkflowResults) return;
            if (subworkflowResults.status === 'waiting') return;
            const waitTracker = Container.get(WaitTracker);
            void waitTracker.startExecution(parentExecution.executionId);  // ← also fire-and-forget
        });
        // ← NO .catch() — errors silently become unhandled rejections
}

---

Main → BullMQ → Redis → Worker (child runs) → DB → Redis → BullMQ → Main
  → DB read (parent) → DB write (patch parent) → DB atomic update → BullMQ → Worker (parent resumes)

Bug Description

When using the Execute Workflow node with waitForSubWorkflow: true (the default) in queue/scaling mode (EXECUTIONS_MODE=queue), the parent workflow intermittently gets permanently stuck in Waiting state with waitTill: 3000-01-01T00:00:00.000Z, even though the sub-workflow completed successfully.

This is an intermittent issue — the same workflow with the same data succeeds most of the time but occasionally fails. The failure correlates with system load and fast-returning sub-workflows (sub-workflows that complete in seconds rather than minutes).

This issue is specific to distributed/queue execution mode. The same workflows at the same volumes run reliably in single-process mode (EXECUTIONS_MODE=regular).

Further Analysis

The parent resume mechanism in both webhook-helpers.ts and wait-tracker.ts uses fire-and-forget promise chains (void ... .then(...).then(...)) with no .catch() handler and no retry logic.

The relevant code path (`webhook-helpers.ts:802-822`):

const { parentExecution } = runExecutionData;
if (WorkflowHelpers.shouldRestartParentExecution(parentExecution)) {
    const executionRepository = Container.get(ExecutionRepository);
    void executePromise                              // ← fire-and-forget
        .then(async (subworkflowResults) => {
            if (!subworkflowResults) return;
            if (subworkflowResults.status === 'waiting') return;
            await WorkflowHelpers.updateParentExecutionWithChildResults(
                executionRepository,
                parentExecution.executionId,
                subworkflowResults,
            );
            return subworkflowResults;
        })
        .then((subworkflowResults) => {
            if (!subworkflowResults) return;
            if (subworkflowResults.status === 'waiting') return;
            const waitTracker = Container.get(WaitTracker);
            void waitTracker.startExecution(parentExecution.executionId);  // ← also fire-and-forget
        });
        // ← NO .catch() — errors silently become unhandled rejections
}

The identical pattern exists in wait-tracker.ts:179-197.

In regular mode, the child runs in-process and the resume chain involves a short in-memory async path with no network hops. In queue mode, the resume chain crosses multiple process boundaries:

Main → BullMQ → Redis → Worker (child runs) → DB → Redis → BullMQ → Main
  → DB read (parent) → DB write (patch parent) → DB atomic update → BullMQ → Worker (parent resumes)

Each hop is an async boundary where a transient failure (DB connection pool exhaustion, Redis timeout, network blip) is silently swallowed by the void + no-.catch() pattern. Under high load, the probability of hitting a transient error at any single hop increases.

Additional logs would at least help diagnose the problem, but having a catch/retry mechanism may make this more reliable.

To Reproduce

Configure n8n in queue/scaling mode (EXECUTIONS_MODE=queue) with at least one worker.
Create a sub-workflow that:

Receives input via Execute Sub-workflow Trigger
Makes an HTTP POST to an external API
Enters a Wait node (webhook resume mode) to wait for the API callback
Completes after the callback arrives

Create a parent workflow that calls the sub-workflow via the Execute Workflow node with Wait for Sub-Workflow Completion enabled.
Run the parent workflow under load (multiple concurrent executions).
Observe: Most executions complete normally. Occasionally, the sub-workflow completes successfully (status: Success) but the parent workflow remains permanently stuck in Waiting state.

The issue is more likely to manifest when the external API responds quickly (within seconds), but it can occur at any response time under sufficient system load.

Expected behavior

When a sub-workflow completes with shouldResume: true, the parent workflow should always be resumed — regardless of timing or system load.

Actual behavior:

Sub-workflow: Completes with Success status. parentExecution.shouldResume: true is correctly set.
Parent workflow: Stuck permanently with status: 'waiting', waitTill: '3000-01-01T00:00:00.000Z'.
The parent's nodeExecutionStack has 1 entry (the next node was queued but never executed).
No errors in logs. No crashes. No BullMQ failures. The resume signal is silently lost.

Debug Info

Debug info

core

n8nVersion: 2.13.3
platform: docker (self-hosted)
nodeJsVersion: 24.13.1
nodeEnv: production
database: postgres
executionMode: scaling (single-main)
concurrency: -1
license: enterprise (production)

storage

success: all
error: all
progress: false
manual: true
binaryMode: s3

pruning

enabled: true
maxAge: 336 hours
maxCount: 10000 executions

client

userAgent: mozilla/5.0 (macintosh; intel mac os x 10_15_7) applewebkit/537.36 (khtml, like gecko) chrome/146.0.0.0 safari/537.36
isTouchDevice: false

Generated at: 2026-04-08T18:45:48.748Z

Operating System

Alpine Linux 3.22

n8n Version

2.13.3

Node.js Version

24.13.1

Database

PostgreSQL

Execution mode

queue

Hosting

self hosted

extent analysis

TL;DR

Adding a .catch() handler and retry logic to the fire-and-forget promise chains in webhook-helpers.ts and wait-tracker.ts may resolve the intermittent issue of parent workflows getting stuck in the Waiting state.

Guidance

Identify and handle potential errors in the promise chains by adding .catch() handlers to prevent unhandled rejections.
Implement retry logic to handle transient failures that may occur during the execution of sub-workflows.
Consider adding logging to diagnose issues and understand the flow of executions.
Review the code paths in webhook-helpers.ts and wait-tracker.ts to ensure that all potential error scenarios are handled.

Example

const { parentExecution } = runExecutionData;
if (WorkflowHelpers.shouldRestartParentExecution(parentExecution)) {
    const executionRepository = Container.get(ExecutionRepository);
    executePromise
        .then(async (subworkflowResults) => {
            // ...
        })
        .then((subworkflowResults) => {
            // ...
        })
        .catch((error) => {
            // Handle error and potentially retry
            console.error('Error resuming parent execution:', error);
            // Retry logic can be added here
        });
}

Notes

The provided solution focuses on handling errors and implementing retry logic. However, the root cause of the issue may be more complex and related to the distributed execution mode. Further analysis and debugging may be necessary to fully resolve the issue.

Recommendation

Apply a workaround by adding error handling and retry logic to the affected code paths, as this is likely to mitigate the intermittent issue.

FAQ

Expected behavior

When a sub-workflow completes with shouldResume: true, the parent workflow should always be resumed — regardless of timing or system load.

Actual behavior:

Sub-workflow: Completes with Success status. parentExecution.shouldResume: true is correctly set.
Parent workflow: Stuck permanently with status: 'waiting', waitTill: '3000-01-01T00:00:00.000Z'.
The parent's nodeExecutionStack has 1 entry (the next node was queued but never executed).
No errors in logs. No crashes. No BullMQ failures. The resume signal is silently lost.

n8n - 💡(How to fix) Fix [Bug] Parent Workflow Intermittently Stuck in "Waiting" State After Sub-Workflow Completes in Queue/Scaling Mode [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

Bug Description

Bug Description

Further Analysis

The relevant code path (webhook-helpers.ts:802-822):

To Reproduce

Expected behavior

Debug Info

Debug info

core

storage

pruning

client

Operating System

n8n Version

Node.js Version

Database

Execution mode

Hosting

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING

The relevant code path (`webhook-helpers.ts:802-822`):