dify - 💡(How to fix) Fix Bug: Async workflow time-slice scheduler for sandbox tier is incomplete (risk of stuck RUNNING state) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
langgenius/dify#35499Fetched 2026-04-23 07:45:24
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
1
Timeline (top)
labeled ×2

On Dify Cloud, asynchronous workflows for the sandbox (free) queue use a CFS (completely fair scheduling) time-slice plan. The current AsyncWorkflowCFSPlanScheduler implementation does not implement a real “can this run continue” decision for sandbox: it always returns RESOURCE_LIMIT_REACHED unless the run is on the professional or team queue. There is an explicit FIXME in code stating the need to prevent a sandbox user’s workflow from remaining in a running state indefinitely.

This is a reliability / state-machine issue: incorrect or over-aggressive pause signaling can leave workflow runs in an inconsistent or stuck state relative to user expectations, and the scheduler logic is acknowledged as unfinished.

Severity: High (core workflow execution, long-running / async paths, Dify Cloud)

Area: api/tasks/workflow_cfs_scheduler/, api/core/app/layers/timeslice_layer.py, api/tasks/async_workflow_tasks.py

Root Cause

On Dify Cloud, asynchronous workflows for the sandbox (free) queue use a CFS (completely fair scheduling) time-slice plan. The current AsyncWorkflowCFSPlanScheduler implementation does not implement a real “can this run continue” decision for sandbox: it always returns RESOURCE_LIMIT_REACHED unless the run is on the professional or team queue. There is an explicit FIXME in code stating the need to prevent a sandbox user’s workflow from remaining in a running state indefinitely.

This is a reliability / state-machine issue: incorrect or over-aggressive pause signaling can leave workflow runs in an inconsistent or stuck state relative to user expectations, and the scheduler logic is acknowledged as unfinished.

Severity: High (core workflow execution, long-running / async paths, Dify Cloud)

Area: api/tasks/workflow_cfs_scheduler/, api/core/app/layers/timeslice_layer.py, api/tasks/async_workflow_tasks.py

Fix Action

Fix / Workaround

  • GitHub: #20725 — Knowledge retrieval node: structured_output_enabled is forced to false in api/core/workflow/nodes/knowledge_retrieval/entities.py as a temporary workaround; proper fix is architectural (node inheritance). Track separately if the “bug” you care about is RAG/structured output rather than async scheduling.
RAW_BUFFERClick to expand / collapse

Self Checks

  • I have read the Contributing Guide and Language Policy.
  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report, otherwise it will be closed.
  • 【中文用户 & Non English User】请使用英语提交,否则会被关闭 :)
  • Please do not modify this template :) and fill in all the required fields.

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Summary

On Dify Cloud, asynchronous workflows for the sandbox (free) queue use a CFS (completely fair scheduling) time-slice plan. The current AsyncWorkflowCFSPlanScheduler implementation does not implement a real “can this run continue” decision for sandbox: it always returns RESOURCE_LIMIT_REACHED unless the run is on the professional or team queue. There is an explicit FIXME in code stating the need to prevent a sandbox user’s workflow from remaining in a running state indefinitely.

This is a reliability / state-machine issue: incorrect or over-aggressive pause signaling can leave workflow runs in an inconsistent or stuck state relative to user expectations, and the scheduler logic is acknowledged as unfinished.

Severity: High (core workflow execution, long-running / async paths, Dify Cloud)

Area: api/tasks/workflow_cfs_scheduler/, api/core/app/layers/timeslice_layer.py, api/tasks/async_workflow_tasks.py

Evidence in codebase

  1. FIXME on sandbox schedulingapi/tasks/workflow_cfs_scheduler/cfs_scheduler.py:

    • can_schedule() returns SchedulerCommand.NONE only for PROFESSIONAL_QUEUE and TEAM_QUEUE.
    • For all other cases (including SANDBOX_QUEUE), it returns RESOURCE_LIMIT_REACHED.
    • Comment: “FIXME: avoid the sandbox user's workflow at a running state for ever”.
  2. Time-slice layer behaviorapi/core/app/layers/timeslice_layer.py:

    • When can_schedule() == RESOURCE_LIMIT_REACHED, the background job sends GraphEngineCommand with CommandType.PAUSE and payload reason: resource_limit_reached.
    • This ties the scheduler’s boolean outcome directly to pausing the graph.
  3. When this path is active (Cloud only)api/tasks/workflow_cfs_scheduler/entities.py:

    • If EDITION == "CLOUD", AsyncWorkflowSystemStrategy is TimeSlice and separate Celery queues exist for professional, team, and sandbox.
    • If not Cloud, strategy is Nop, so the time-slice checker is not used the same way — this bug primarily concerns Dify Cloud and the sandbox queue.
  4. Related uncertainty on workflow run statusapi/repositories/sqlalchemy_api_workflow_run_repository.py (pause path):

    • TODO notes that WorkflowRun.status persistence may occur in an order that interacts badly with graph execution (“before the execution of GraphLayer”), which can exacerbate user-visible “wrong status” symptoms when combined with aggressive pause/limit logic.

✔️ Expected Behavior

  • Sandbox tier workflows should either:
    • be throttled in a defined way (clear quotas, backoff, or bounded pause) without leaving runs stuck in RUNNING, or
    • if paused for quota, transition to a consistent status (PAUSED / failed / resumable) and recover predictably.
  • can_schedule() for sandbox should encode real resource / quota state, not a permanent RESOURCE_LIMIT_REACHED if that does not match product intent.

Exact product rules (limits, when to pause vs. queue) should be confirmed with the Dify Cloud team.

❌ Actual Behavior

How to reproduce (indicative)

  1. Run Dify Cloud with async workflow execution routed to the sandbox Celery queue (execute_workflow_sandbox).
  2. Use a workflow whose execution uses the TimeSlice CFS plan (same as production Cloud config in entities.py).
  3. Observe graph scheduling: TimeSliceLayer will periodically call can_schedule(); for sandbox, it always gets RESOURCE_LIMIT_REACHED (non–professional/team), so pause commands are emitted on every check interval.
  4. Inspect workflow run and node run records for stuck or inconsistent RUNNING / pause states under load or over long runs.

Concrete UI/API steps depend on your Cloud environment and app configuration; backend logs for the sandbox queue and workflow run status fields are the primary signal.


Suggested fix direction (non-prescriptive)

  1. Implement sandbox can_schedule() using real quota / usage (or a deliberate cooldown), and align pause/resume with WorkflowRun / pause entities.
  2. Add integration coverage for: sandbox + TimeSlice + long run + pause/resume, asserting terminal/suspended states are correct.
  3. Revisit the WorkflowRun status ordering TODO if pause/continue races appear in support tickets.

Related upstream issue (feature regression)

  • GitHub: #20725 — Knowledge retrieval node: structured_output_enabled is forced to false in api/core/workflow/nodes/knowledge_retrieval/entities.py as a temporary workaround; proper fix is architectural (node inheritance). Track separately if the “bug” you care about is RAG/structured output rather than async scheduling.

Metadata for filing on GitHub

  • Title: [Cloud] Sandbox async workflow CFS scheduler: FIXME re. stuck RUNNING; can_schedule always resource-limited for non–paid queues
  • Labels (suggested): bug, backend, workflow, cloud (if applicable)
  • Environment: Dify Cloud, sandbox queue, async workflow tasks

This file was generated from a static analysis of the repository at the time of writing, not from runtime reproduction on a live Dify Cloud tenant.

extent analysis

TL;DR

Implement a revised can_schedule() function for the sandbox queue that accurately reflects resource usage and quota, preventing workflows from becoming stuck in a running state indefinitely.

Guidance

  • Review the can_schedule() function in api/tasks/workflow_cfs_scheduler/cfs_scheduler.py to understand the current implementation and identify areas for improvement.
  • Implement a quota or cooldown-based approach for the sandbox queue, ensuring that can_schedule() returns a value that accurately reflects the current resource usage and availability.
  • Update the TimeSliceLayer to handle the revised can_schedule() output, pausing or resuming workflows as needed to maintain a consistent state.
  • Add integration tests to cover sandbox workflows with long runs, pause, and resume scenarios to ensure correct terminal or suspended states.

Example

# Example revised can_schedule() function
def can_schedule(self, queue):
    if queue == SANDBOX_QUEUE:
        # Implement quota or cooldown logic here
        # For example:
        if self.current_usage < self.quota:
            return SchedulerCommand.SCHEDULE
        else:
            return SchedulerCommand.PAUSE
    # ... existing logic for other queues ...

Notes

The provided issue description lacks specific details on the desired quota or cooldown implementation. Therefore, the example above is a simplified illustration of how the can_schedule() function could be revised. The actual implementation will depend on the specific requirements and constraints of the Dify Cloud sandbox queue.

Recommendation

Apply a workaround by implementing a revised can_schedule() function that accurately reflects resource usage and quota for the sandbox queue, as this will prevent workflows from becoming stuck in a running state indefinitely.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

dify - 💡(How to fix) Fix Bug: Async workflow time-slice scheduler for sandbox tier is incomplete (risk of stuck RUNNING state) [1 participants]