Fix Action

Fixed

Fixed by PR: feat(api): add scheduled cleanup task for dataset_queries (https://github.com/Echo0ff/dify/pull/2)
Fixed by PR: feat(api): add scheduled cleanup task for dataset_queries (https://github.com/langgenius/dify/pull/35734)

PR fix notes

PR #2: feat(api): add scheduled cleanup task for dataset_queries

Echo0ff · 2026-04-30T08:29:40Z

[dify] PR 2: feat api : add scheduled cleanup task for dataset queries - Repository: Echo0ff/dify - Author: Echo0ff - State: closed | merged: True - Link: http… # PR #2: feat(api): add scheduled cleanup task for dataset_queries - Repository: Echo0ff/dify - Author: Echo0ff - State: closed | merged: True - Link: https://github.com/Echo0ff/dify/pull/2 ## Description (problem / solution / changelog) ## Summary Fixes langgenius/dify#35733 The `dataset_queries` table grows without bound because every RAG retrieval and hit-testing operation inserts a row. There is currently no periodic cleanup task for this table. This PR adds a configurable Celery Beat task (`clean_dataset_queries_task`) that deletes rows older than a retention period in batches. ### Changes 1. **`api/schedule/clean_dataset_queries_task.py`** — New task: Redis lock + batch deletion with automatic retention clamping and warning logging when the configured retention falls below the safe threshold 2. **`api/configs/feature/__init__.py`** — 4 new config fields in `CeleryScheduleTasksConfig`: - `ENABLE_CLEAN_DATASET_QUERIES_TASK` (default: `False`) - `CLEAN_DATASET_QUERIES_RETENTION_DAYS` (default: `60`) - `CLEAN_DATASET_QUERIES_BATCH_SIZE` (default: `500`) - `CLEAN_DATASET_QUERIES_LOCK_TTL` (default: `3600`) 3. **`api/extensions/ext_celery.py`** — Register in beat_schedule, hour=5 to avoid collision with existing tasks at 0/2/3/4 4. **`api/models/dataset.py`** — Add `created_at` index declaration to `DatasetQuery.__table_args__` 5. **`api/migrations/versions/2026_04_30_1600-67b5709d7d0a_add_dataset_queries_created_at_idx.py`** — Alembic migration 6. **`api/tests/unit_tests/schedule/test_clean_dataset_queries_task.py`** — 4 unit tests ### Key constraint `clean_unused_datasets_task` reads `DatasetQuery.created_at` to determine whether a dataset has been queried recently (threshold = `PLAN_SANDBOX_CLEAN_DAY_SETTING`, default 30 days). The new task uses a default retention of 60 days (> 30). If a user manually sets retention below 30, the task clamps it and logs a warning. ## Test plan - [x] `make lint` passes - [x] `basedpyright api/schedule/clean_dataset_queries_task.py` — 0 errors - [x] 4 unit tests all pass - [ ] Manual verification: set `ENABLE_CLEAN_DATASET_QUERIES_TASK=true`, start celery beat, confirm the task is scheduled and deletes in batches 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Changed files - `api/configs/feature/__init__.py` (modified, +19/-0) - `api/extensions/ext_celery.py` (modified, +6/-0) - `api/migrations/versions/2026_04_30_1600-67b5709d7d0a_add_dataset_queries_created_at_idx.py` (added, +25/-0) - `api/models/dataset.py` (modified, +1/-0) - `api/schedule/clean_dataset_queries_task.py` (added, +108/-0) - `api/tests/unit_tests/schedule/__init__.py` (added, +0/-0) - `api/tests/unit_tests/schedule/test_clean_dataset_queries_task.py` (added, +102/-0) --- # PR #35734: feat(api): add scheduled cleanup task for dataset_queries - Repository: langgenius/dify - Author: Echo0ff - State: open | merged: False - Link: https://github.com/langgenius/dify/pull/35734 ## Description (problem / solution / changelog) ## Summary Fixes #35733 The `dataset_queries` table grows without bound because every RAG retrieval and hit-testing operation inserts a row. There is currently no periodic cleanup task for this table — the only deletions happen when an entire dataset is removed via `clean_dataset_task`. This PR adds a configurable Celery Beat task (`clean_dataset_queries_task`) that deletes rows older than a retention period in batches. ### Changes 1. **`api/schedule/clean_dataset_queries_task.py`** — New task: Redis lock + batch deletion with automatic retention clamping and warning logging when the configured retention falls below the safe threshold 2. **`api/configs/feature/__init__.py`** — 4 new config fields in `CeleryScheduleTasksConfig`: - `ENABLE_CLEAN_DATASET_QUERIES_TASK` (default: `False`) - `CLEAN_DATASET_QUERIES_RETENTION_DAYS` (default: `60`) - `CLEAN_DATASET_QUERIES_BATCH_SIZE` (default: `500`) - `CLEAN_DATASET_QUERIES_LOCK_TTL` (default: `3600`) 3. **`api/extensions/ext_celery.py`** — Register in beat_schedule, hour=5 to avoid collision with existing tasks at 0/2/3/4 4. **`api/models/dataset.py`** — Add `created_at` index declaration to `DatasetQuery.__table_args__` 5. **`api/migrations/versions/2026_04_30_1600-67b5709d7d0a_add_dataset_queries_created_at_idx.py`** — Alembic migration 6. **`api/tests/unit_tests/schedule/test_clean_dataset_queries_task.py`** — 4 unit tests ### Key constraint `clean_unused_datasets_task` reads `DatasetQuery.created_at` to determine whether a dataset has been queried recently (threshold = `PLAN_SANDBOX_CLEAN_DAY_SETTING`, default 30 days). The new task uses a default retention of 60 days (> 30). If a user manually sets retention below 30, the task clamps it and logs a warning. ## Test plan - [x] `make lint` passes - [x] `basedpyright api/schedule/clean_dataset_queries_task.py` — 0 err

Repository: Echo0ff/dify
Author: Echo0ff
State: closed | merged: True
Link: https://github.com/Echo0ff/dify/pull/2

Description (problem / solution / changelog)

Summary

Fixes langgenius/dify#35733

The dataset_queries table grows without bound because every RAG retrieval and hit-testing operation inserts a row. There is currently no periodic cleanup task for this table.

This PR adds a configurable Celery Beat task (clean_dataset_queries_task) that deletes rows older than a retention period in batches.

Changes

api/schedule/clean_dataset_queries_task.py — New task: Redis lock + batch deletion with automatic retention clamping and warning logging when the configured retention falls below the safe threshold
api/configs/feature/__init__.py — 4 new config fields in CeleryScheduleTasksConfig:
- ENABLE_CLEAN_DATASET_QUERIES_TASK (default: False)
- CLEAN_DATASET_QUERIES_RETENTION_DAYS (default: 60)
- CLEAN_DATASET_QUERIES_BATCH_SIZE (default: 500)
- CLEAN_DATASET_QUERIES_LOCK_TTL (default: 3600)
api/extensions/ext_celery.py — Register in beat_schedule, hour=5 to avoid collision with existing tasks at 0/2/3/4
api/models/dataset.py — Add created_at index declaration to DatasetQuery.__table_args__
api/migrations/versions/2026_04_30_1600-67b5709d7d0a_add_dataset_queries_created_at_idx.py — Alembic migration
api/tests/unit_tests/schedule/test_clean_dataset_queries_task.py — 4 unit tests

Key constraint

clean_unused_datasets_task reads DatasetQuery.created_at to determine whether a dataset has been queried recently (threshold = PLAN_SANDBOX_CLEAN_DAY_SETTING, default 30 days). The new task uses a default retention of 60 days (> 30). If a user manually sets retention below 30, the task clamps it and logs a warning.

Test plan

make lint passes
basedpyright api/schedule/clean_dataset_queries_task.py — 0 errors
4 unit tests all pass
Manual verification: set ENABLE_CLEAN_DATASET_QUERIES_TASK=true, start celery beat, confirm the task is scheduled and deletes in batches

🤖 Generated with Claude Code

Changed files

api/configs/feature/__init__.py (modified, +19/-0)
api/extensions/ext_celery.py (modified, +6/-0)
api/migrations/versions/2026_04_30_1600-67b5709d7d0a_add_dataset_queries_created_at_idx.py (added, +25/-0)
api/models/dataset.py (modified, +1/-0)
api/schedule/clean_dataset_queries_task.py (added, +108/-0)
api/tests/unit_tests/schedule/__init__.py (added, +0/-0)
api/tests/unit_tests/schedule/test_clean_dataset_queries_task.py (added, +102/-0)

PR #35734: feat(api): add scheduled cleanup task for dataset_queries

Repository: langgenius/dify
Author: Echo0ff
State: open | merged: False
Link: https://github.com/langgenius/dify/pull/35734

Description (problem / solution / changelog)

Summary

Fixes #35733

The dataset_queries table grows without bound because every RAG retrieval and hit-testing operation inserts a row. There is currently no periodic cleanup task for this table — the only deletions happen when an entire dataset is removed via clean_dataset_task.

This PR adds a configurable Celery Beat task (clean_dataset_queries_task) that deletes rows older than a retention period in batches.

Changes

api/schedule/clean_dataset_queries_task.py — New task: Redis lock + batch deletion with automatic retention clamping and warning logging when the configured retention falls below the safe threshold
api/configs/feature/__init__.py — 4 new config fields in CeleryScheduleTasksConfig:
- ENABLE_CLEAN_DATASET_QUERIES_TASK (default: False)
- CLEAN_DATASET_QUERIES_RETENTION_DAYS (default: 60)
- CLEAN_DATASET_QUERIES_BATCH_SIZE (default: 500)
- CLEAN_DATASET_QUERIES_LOCK_TTL (default: 3600)
api/extensions/ext_celery.py — Register in beat_schedule, hour=5 to avoid collision with existing tasks at 0/2/3/4
api/models/dataset.py — Add created_at index declaration to DatasetQuery.__table_args__
api/migrations/versions/2026_04_30_1600-67b5709d7d0a_add_dataset_queries_created_at_idx.py — Alembic migration
api/tests/unit_tests/schedule/test_clean_dataset_queries_task.py — 4 unit tests

Key constraint

Test plan

make lint passes
basedpyright api/schedule/clean_dataset_queries_task.py — 0 errors
4 unit tests all pass
Manual verification: set ENABLE_CLEAN_DATASET_QUERIES_TASK=true, start celery beat, confirm the task is scheduled and deletes in batches

🤖 Generated with Claude Code

Changed files

api/configs/feature/__init__.py (modified, +19/-0)
api/extensions/ext_celery.py (modified, +6/-0)
api/migrations/versions/2026_04_30_1600-67b5709d7d0a_add_dataset_queries_created_at_idx.py (added, +25/-0)
api/models/dataset.py (modified, +1/-0)
api/schedule/clean_dataset_queries_task.py (added, +108/-0)
api/tests/unit_tests/schedule/__init__.py (added, +0/-0)
api/tests/unit_tests/schedule/test_clean_dataset_queries_task.py (added, +102/-0)

Problem

The dataset_queries table records every RAG retrieval and hit-testing operation (insertion points: hit_testing_service.py, index_tool_callback_handler.py, dataset_retrieval.py). There is no periodic cleanup task for this table — the only deletions happen when an entire dataset is removed via clean_dataset_task.

On long-running instances, this table grows without bound, consuming disk space and degrading query performance over time.

Proposed Solution

Add a configurable Celery Beat task (clean_dataset_queries_task) that:

Deletes rows older than a configurable retention period (default 60 days) in batches

Uses a Redis lock to prevent concurrent execution

Clamps retention to max(configured_days, PLAN_SANDBOX_CLEAN_DAY_SETTING) to avoid breaking clean_unused_datasets_task, which reads DatasetQuery.created_at to decide if a dataset has been queried recently

Adds a created_at index on dataset_queries to keep the delete scan performant

Is gated by ENABLE_CLEAN_DATASET_QUERIES_TASK=False (opt-in, same as other cleanup tasks)

extent analysis

TL;DR

Implement a periodic cleanup task for the dataset_queries table to prevent unbounded growth and performance degradation.

Guidance

Identify the optimal retention period for dataset_queries rows based on specific use cases and performance requirements.
Consider adding a created_at index on dataset_queries to improve delete scan performance.
Evaluate the impact of the proposed clean_dataset_queries_task on existing tasks, such as clean_unused_datasets_task.
Assess the trade-offs of opting-in to the ENABLE_CLEAN_DATASET_QUERIES_TASK feature.

Example

No code snippet is provided as it is not explicitly supported by the issue.

Notes

The proposed solution assumes a self-hosted, long-running instance with high RAG retrieval volume. The effectiveness of the solution may vary depending on the specific environment and usage patterns.

Recommendation

Apply workaround by implementing the proposed clean_dataset_queries_task with a configurable retention period, as it addresses the root cause of the issue and provides a flexible solution for managing dataset_queries table growth.

dify - ✅(Solved) Fix dataset_queries table grows without bound — no periodic cleanup task [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #2: feat(api): add scheduled cleanup task for dataset_queries

Description (problem / solution / changelog)

Summary

Changes

Key constraint

Test plan

Changed files

PR #35734: feat(api): add scheduled cleanup task for dataset_queries

Description (problem / solution / changelog)

Summary

Changes

Key constraint

Test plan

Changed files

Problem

Proposed Solution

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING