dify - ✅(Solved) Fix dataset_queries table grows without bound — no periodic cleanup task [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
langgenius/dify#35733Fetched 2026-05-01 05:53:26
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
1
Author
Participants
Timeline (top)
cross-referenced ×2closed ×1

Fix Action

Fixed

PR fix notes

PR #2: feat(api): add scheduled cleanup task for dataset_queries

Description (problem / solution / changelog)

Summary

Fixes langgenius/dify#35733

The dataset_queries table grows without bound because every RAG retrieval and hit-testing operation inserts a row. There is currently no periodic cleanup task for this table.

This PR adds a configurable Celery Beat task (clean_dataset_queries_task) that deletes rows older than a retention period in batches.

Changes

  1. api/schedule/clean_dataset_queries_task.py — New task: Redis lock + batch deletion with automatic retention clamping and warning logging when the configured retention falls below the safe threshold
  2. api/configs/feature/__init__.py — 4 new config fields in CeleryScheduleTasksConfig:
    • ENABLE_CLEAN_DATASET_QUERIES_TASK (default: False)
    • CLEAN_DATASET_QUERIES_RETENTION_DAYS (default: 60)
    • CLEAN_DATASET_QUERIES_BATCH_SIZE (default: 500)
    • CLEAN_DATASET_QUERIES_LOCK_TTL (default: 3600)
  3. api/extensions/ext_celery.py — Register in beat_schedule, hour=5 to avoid collision with existing tasks at 0/2/3/4
  4. api/models/dataset.py — Add created_at index declaration to DatasetQuery.__table_args__
  5. api/migrations/versions/2026_04_30_1600-67b5709d7d0a_add_dataset_queries_created_at_idx.py — Alembic migration
  6. api/tests/unit_tests/schedule/test_clean_dataset_queries_task.py — 4 unit tests

Key constraint

clean_unused_datasets_task reads DatasetQuery.created_at to determine whether a dataset has been queried recently (threshold = PLAN_SANDBOX_CLEAN_DAY_SETTING, default 30 days). The new task uses a default retention of 60 days (> 30). If a user manually sets retention below 30, the task clamps it and logs a warning.

Test plan

  • make lint passes
  • basedpyright api/schedule/clean_dataset_queries_task.py — 0 errors
  • 4 unit tests all pass
  • Manual verification: set ENABLE_CLEAN_DATASET_QUERIES_TASK=true, start celery beat, confirm the task is scheduled and deletes in batches

🤖 Generated with Claude Code

Changed files

  • api/configs/feature/__init__.py (modified, +19/-0)
  • api/extensions/ext_celery.py (modified, +6/-0)
  • api/migrations/versions/2026_04_30_1600-67b5709d7d0a_add_dataset_queries_created_at_idx.py (added, +25/-0)
  • api/models/dataset.py (modified, +1/-0)
  • api/schedule/clean_dataset_queries_task.py (added, +108/-0)
  • api/tests/unit_tests/schedule/__init__.py (added, +0/-0)
  • api/tests/unit_tests/schedule/test_clean_dataset_queries_task.py (added, +102/-0)

PR #35734: feat(api): add scheduled cleanup task for dataset_queries

Description (problem / solution / changelog)

Summary

Fixes #35733

The dataset_queries table grows without bound because every RAG retrieval and hit-testing operation inserts a row. There is currently no periodic cleanup task for this table — the only deletions happen when an entire dataset is removed via clean_dataset_task.

This PR adds a configurable Celery Beat task (clean_dataset_queries_task) that deletes rows older than a retention period in batches.

Changes

  1. api/schedule/clean_dataset_queries_task.py — New task: Redis lock + batch deletion with automatic retention clamping and warning logging when the configured retention falls below the safe threshold
  2. api/configs/feature/__init__.py — 4 new config fields in CeleryScheduleTasksConfig:
    • ENABLE_CLEAN_DATASET_QUERIES_TASK (default: False)
    • CLEAN_DATASET_QUERIES_RETENTION_DAYS (default: 60)
    • CLEAN_DATASET_QUERIES_BATCH_SIZE (default: 500)
    • CLEAN_DATASET_QUERIES_LOCK_TTL (default: 3600)
  3. api/extensions/ext_celery.py — Register in beat_schedule, hour=5 to avoid collision with existing tasks at 0/2/3/4
  4. api/models/dataset.py — Add created_at index declaration to DatasetQuery.__table_args__
  5. api/migrations/versions/2026_04_30_1600-67b5709d7d0a_add_dataset_queries_created_at_idx.py — Alembic migration
  6. api/tests/unit_tests/schedule/test_clean_dataset_queries_task.py — 4 unit tests

Key constraint

clean_unused_datasets_task reads DatasetQuery.created_at to determine whether a dataset has been queried recently (threshold = PLAN_SANDBOX_CLEAN_DAY_SETTING, default 30 days). The new task uses a default retention of 60 days (> 30). If a user manually sets retention below 30, the task clamps it and logs a warning.

Test plan

  • make lint passes
  • basedpyright api/schedule/clean_dataset_queries_task.py — 0 errors
  • 4 unit tests all pass
  • Manual verification: set ENABLE_CLEAN_DATASET_QUERIES_TASK=true, start celery beat, confirm the task is scheduled and deletes in batches

🤖 Generated with Claude Code

Changed files

  • api/configs/feature/__init__.py (modified, +19/-0)
  • api/extensions/ext_celery.py (modified, +6/-0)
  • api/migrations/versions/2026_04_30_1600-67b5709d7d0a_add_dataset_queries_created_at_idx.py (added, +25/-0)
  • api/models/dataset.py (modified, +1/-0)
  • api/schedule/clean_dataset_queries_task.py (added, +108/-0)
  • api/tests/unit_tests/schedule/__init__.py (added, +0/-0)
  • api/tests/unit_tests/schedule/test_clean_dataset_queries_task.py (added, +102/-0)
RAW_BUFFERClick to expand / collapse

Problem

The dataset_queries table records every RAG retrieval and hit-testing operation (insertion points: hit_testing_service.py, index_tool_callback_handler.py, dataset_retrieval.py). There is no periodic cleanup task for this table — the only deletions happen when an entire dataset is removed via clean_dataset_task.

On long-running instances, this table grows without bound, consuming disk space and degrading query performance over time.

Proposed Solution

Add a configurable Celery Beat task (clean_dataset_queries_task) that:

  • Deletes rows older than a configurable retention period (default 60 days) in batches
  • Uses a Redis lock to prevent concurrent execution
  • Clamps retention to max(configured_days, PLAN_SANDBOX_CLEAN_DAY_SETTING) to avoid breaking clean_unused_datasets_task, which reads DatasetQuery.created_at to decide if a dataset has been queried recently
  • Adds a created_at index on dataset_queries to keep the delete scan performant
  • Is gated by ENABLE_CLEAN_DATASET_QUERIES_TASK=False (opt-in, same as other cleanup tasks)

Environment

Self-hosted, long-running instance with high RAG retrieval volume.

extent analysis

TL;DR

Implement a periodic cleanup task for the dataset_queries table to prevent unbounded growth and performance degradation.

Guidance

  • Identify the optimal retention period for dataset_queries rows based on specific use cases and performance requirements.
  • Consider adding a created_at index on dataset_queries to improve delete scan performance.
  • Evaluate the impact of the proposed clean_dataset_queries_task on existing tasks, such as clean_unused_datasets_task.
  • Assess the trade-offs of opting-in to the ENABLE_CLEAN_DATASET_QUERIES_TASK feature.

Example

No code snippet is provided as it is not explicitly supported by the issue.

Notes

The proposed solution assumes a self-hosted, long-running instance with high RAG retrieval volume. The effectiveness of the solution may vary depending on the specific environment and usage patterns.

Recommendation

Apply workaround by implementing the proposed clean_dataset_queries_task with a configurable retention period, as it addresses the root cause of the issue and provides a flexible solution for managing dataset_queries table growth.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

dify - ✅(Solved) Fix dataset_queries table grows without bound — no periodic cleanup task [2 pull requests, 1 participants]