dify - ✅(Solved) Fix db-migration-test-mysql fails with 'Lost connection to MySQL server during query' ~3ms after migration starts [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
langgenius/dify#35620Fetched 2026-04-29 06:36:33
View on GitHub
Comments
1
Participants
2
Timeline
9
Reactions
1
Author
Assignees
Timeline (top)
referenced ×3assigned ×1closed ×1commented ×1

Error Message

  1. A specific migration in the chain (the dry-run trace shows ALTER TABLE messages ADD/DROP COLUMN error; and similar) crashes the container due to resource limits on the Depot runner.

Root Cause

Plausible root causes (not validated, flagging for someone with CI infra context):

  1. MySQL 8 healthcheck reports ready before InnoDB recovery / buffer pool is fully usable; the first DDL hits a transient unavailability window.
  2. A specific migration in the chain (the dry-run trace shows ALTER TABLE messages ADD/DROP COLUMN error; and similar) crashes the container due to resource limits on the Depot runner.
  3. Auth-plugin or wait_timeout mismatch between mysql:8.0 defaults and pymysql / SQLAlchemy.

Fix Action

Fixed

PR fix notes

PR #35631: fix(ci): wait for mysql to accept queries before db migration

Description (problem / solution / changelog)

Fixes #35620

Summary

Run DB Migration Test / db-migration-test-mysql failed reliably with pymysql.err.OperationalError (2013, 'Lost connection to MySQL server during query') ~3-5 ms after Starting database migration.. The Postgres counterpart in the same workflow passed.

Real root cause — confirmed by adding a diagnostic step that ran between Set up Middlewares and Run DB Migration:

06:36:48  Bringing up docker compose service(s)
06:36:57  docker compose service(s) are up        ← compose-action returned (9s)
06:36:57  docker ps  → db_mysql Up 1 second (health: starting)
06:36:57  db_mysql logs:  [InnoDB] InnoDB initialization has started.
06:37:02  flask upgrade-db                         → Lost connection during query

hoverkraft-tech/[email protected] only waits for docker compose up -d to return (the container processes are running); it does not wait on healthcheck status. docker ps at the time of failure shows the mysql container still in health: starting, and the container's own logs are still at InnoDB initialization. mysql:8.0's first-run init takes 15-30 s (InnoDB recovery, system tables, root user setup, default db creation); the migration step starts ~14 s after the action returns, well inside that window, and the first SQLAlchemy connection is reset by the still-bootstrapping server.

Postgres is unaffected because pg starts in <5 s, comfortably finished by the time the next workflow steps (env prep, uv run startup, Flask app factory) finish — those steps already absorb most of the slack for pg, but not enough for mysql.

Fix

Two changes, ordered by importance:

  1. .github/workflows/db-migration-test.yml — primary fix. Add an explicit Wait for MySQL to accept queries step between the compose-action and Run DB Migration. Polls mysql -e 'SELECT 1' from the runner host once per second, up to 60 s. Independent of the compose-action's wait semantics, dumps container logs on timeout. This is what makes the job pass.

  2. docker/docker-compose.middleware.yaml — secondary, defensible-on-its-own improvement. Replace the existing mysqladmin ping healthcheck (which only verifies TCP handshake) with a mysql -e "SELECT 1" healthcheck (which verifies the server can actually process queries). Adds start_period: 20s so the bootstrap window does not consume the retry budget. This is correct in its own right — it makes docker compose ps and any future --wait-based workflow report mysql as healthy only when it actually is — but it does not fix the CI failure on its own, because the CI's compose-action does not wait on health at all.

Why this PR has 4 commits

For full transparency: the first two commits chase the wrong root cause (assumed compose-action was waiting on health, tried to tighten the check). Commit 3 adds the diagnostic step that disproved that assumption. Commit 4 is the actual fix. Reviewers should look at the net file diff rather than the per-commit history.

If preferred I can squash on merge.

Screenshots

BeforeAfter
Lost connection to MySQL server during query ~3 ms after migration start, every PR that triggers db-migration-test-mysqlMySQL ready after 14sDatabase migration successful! (this PR's CI run https://github.com/langgenius/dify/actions/runs/25038021419)

Checklist

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran make lint && make type-check (backend) and cd web && pnpm exec vp staged (frontend) to appease the lint gods

Note: this change is to a docker-compose file and a workflow yaml (no Python / TypeScript). make lint passes. make type-check surfaces several pre-existing errors on main (app_factory.py:132, services/model_load_balancing_service.py:623, several in providers/trace/trace-tencent/...) that are unrelated to this change; identical on upstream/main HEAD.

The actual verification is the green Run DB Migration Test / db-migration-test-mysql check on this PR.

From Claude Code

Changed files

  • .github/workflows/db-migration-test.yml (modified, +22/-0)
  • docker/docker-compose.middleware.yaml (modified, +10/-4)

Code Example

2026-04-28 02:58:03.834  Preparing database migration...
2026-04-28 02:58:03.891  Starting database migration.
2026-04-28 02:58:03.894  Database migration failed: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
RAW_BUFFERClick to expand / collapse

Self Checks

  • I have read the Contributing Guide and Language Policy.
  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report, otherwise it will be closed.
  • 【中文用户 & Non English User】请使用英语提交,否则会被关闭 :)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

N/A — CI failure on main (HEAD 2d6babeeb4). Reproduces against the workflow shipped in the repo, not against a deployed instance.

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

The reusable workflow Run DB Migration Test / db-migration-test-mysql (defined in .github/workflows/db-migration-test.yml) fails on every PR that triggers it. The Postgres counterpart in the same workflow passes consistently.

Reproduce by opening any PR whose changes match the path filter on Run DB Migration Test. Recent observed failures on PR #35515:

Both fail at the same point with the same timing.

Relevant log excerpt:

2026-04-28 02:58:03.834  Preparing database migration...
2026-04-28 02:58:03.891  Starting database migration.
2026-04-28 02:58:03.894  Database migration failed: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')

The 3 ms gap between Starting and Lost connection suggests the failure is not from a long-running DDL but from the connection dropping immediately on one of the first queries.

✔️ Expected Behavior

db-migration-test-mysql runs the migration to completion against the mysql:8.0 middleware container, the same way db-migration-test-postgres does, and the job exits 0.

❌ Actual Behavior

The job fails roughly 3 ms after Starting database migration. with pymysql.err.OperationalError (2013, 'Lost connection to MySQL server during query').

Why this has been latent: Skip Duplicate Checks plus the path filter on Run DB Migration Test cause main pushes and most PRs to skip this job entirely. Scanning the last 15+ Main CI Pipeline runs on main shows db-migration-test-mysql listed as skipped on every one of them. The job has effectively not been exercised in main's CI for a long time; the failure is reliably reproducible only on PRs that touch broader api/ paths and is now blocking required checks (API Tests, DB Migration Test) on those PRs.

Environment notes:

  • Runner migrated to Depot (depot-ubuntu-24.04) in d6dee43c09, image builds in 23648141c9. The Postgres job on the same runner passes, so the runner change alone does not explain the failure, but the migration may have changed timing/resources for the MySQL container.
  • Container: mysql:8.0, healthcheck mysqladmin ping -u root -p$DB_PASSWORD, interval 1s / timeout 3s / retries 30.
  • Compose action: hoverkraft-tech/[email protected] (waits for healthy by default).
  • Loosely related: #32454 (move to Testcontainers).

Plausible root causes (not validated, flagging for someone with CI infra context):

  1. MySQL 8 healthcheck reports ready before InnoDB recovery / buffer pool is fully usable; the first DDL hits a transient unavailability window.
  2. A specific migration in the chain (the dry-run trace shows ALTER TABLE messages ADD/DROP COLUMN error; and similar) crashes the container due to resource limits on the Depot runner.
  3. Auth-plugin or wait_timeout mismatch between mysql:8.0 defaults and pymysql / SQLAlchemy.

Impact: Any PR that hits the API path filter cannot get a green required DB Migration Test until this is resolved.

extent analysis

TL;DR

The db-migration-test-mysql job fails due to a lost connection to the MySQL server during query, likely caused by the MySQL container not being fully ready or a resource limit issue.

Guidance

  • Investigate the MySQL container's healthcheck and readiness, ensuring it accounts for InnoDB recovery and buffer pool initialization.
  • Review the resource limits on the Depot runner and consider increasing them to prevent crashes during migrations.
  • Verify the compatibility of mysql:8.0 defaults with pymysql and SQLAlchemy, focusing on auth-plugin and wait_timeout settings.
  • Consider adding a delay or retry mechanism to the db-migration-test-mysql job to account for transient unavailability windows.
  • Check the migration logs for any specific queries that may be causing the connection loss, such as the ALTER TABLE statements mentioned.

Example

No code snippet is provided as the issue is more related to the CI infrastructure and container configuration.

Notes

The issue may be specific to the Depot runner and mysql:8.0 container, and resolving it may require collaboration with someone familiar with the CI infrastructure.

Recommendation

Apply a workaround by adding a delay or retry mechanism to the db-migration-test-mysql job to account for transient unavailability windows, while investigating the root cause of the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

dify - ✅(Solved) Fix db-migration-test-mysql fails with 'Lost connection to MySQL server during query' ~3ms after migration starts [1 pull requests, 1 comments, 2 participants]