dify - ✅(Solved) Fix db-migration-test-mysql fails with 'Lost connection to MySQL server during query' ~3ms after migration starts [1 pull requests, 1 comments, 2 participants]

lin-snow · 2026-04-28T03:34:44Z

[dify] PR 35631: fix ci : wait for mysql to accept queries before db migration - Repository: langgenius/dify - Author: lin-snow - State: closed | merged: True… # PR #35631: fix(ci): wait for mysql to accept queries before db migration - Repository: langgenius/dify - Author: lin-snow - State: closed | merged: True - Link: https://github.com/langgenius/dify/pull/35631 ## Description (problem / solution / changelog) Fixes #35620 ## Summary `Run DB Migration Test / db-migration-test-mysql` failed reliably with `pymysql.err.OperationalError (2013, 'Lost connection to MySQL server during query')` ~3-5 ms after `Starting database migration.`. The Postgres counterpart in the same workflow passed. **Real root cause** — confirmed by adding a diagnostic step that ran between `Set up Middlewares` and `Run DB Migration`: ``` 06:36:48 Bringing up docker compose service(s) 06:36:57 docker compose service(s) are up ← compose-action returned (9s) 06:36:57 docker ps → db_mysql Up 1 second (health: starting) 06:36:57 db_mysql logs: [InnoDB] InnoDB initialization has started. 06:37:02 flask upgrade-db → Lost connection during query ``` `hoverkraft-tech/compose-action@v2.6.0` only waits for `docker compose up -d` to return (the container *processes* are running); it does **not** wait on healthcheck status. `docker ps` at the time of failure shows the mysql container still in `health: starting`, and the container's own logs are still at InnoDB initialization. mysql:8.0's first-run init takes 15-30 s (InnoDB recovery, system tables, root user setup, default db creation); the migration step starts ~14 s after the action returns, well inside that window, and the first SQLAlchemy connection is reset by the still-bootstrapping server. Postgres is unaffected because pg starts in <5 s, comfortably finished by the time the next workflow steps (env prep, `uv run` startup, Flask app factory) finish — those steps already absorb most of the slack for pg, but not enough for mysql. ## Fix Two changes, ordered by importance: 1. **`.github/workflows/db-migration-test.yml`** — primary fix. Add an explicit `Wait for MySQL to accept queries` step between the compose-action and `Run DB Migration`. Polls `mysql -e 'SELECT 1'` from the runner host once per second, up to 60 s. Independent of the compose-action's wait semantics, dumps container logs on timeout. This is what makes the job pass. 2. **`docker/docker-compose.middleware.yaml`** — secondary, defensible-on-its-own improvement. Replace the existing `mysqladmin ping` healthcheck (which only verifies TCP handshake) with a `mysql -e "SELECT 1"` healthcheck (which verifies the server can actually process queries). Adds `start_period: 20s` so the bootstrap window does not consume the retry budget. This is correct in its own right — it makes `docker compose ps` and any future `--wait`-based workflow report mysql as healthy only when it actually is — but it does **not** fix the CI failure on its own, because the CI's compose-action does not wait on health at all. ## Why this PR has 4 commits For full transparency: the first two commits chase the wrong root cause (assumed compose-action was waiting on health, tried to tighten the check). Commit 3 adds the diagnostic step that disproved that assumption. Commit 4 is the actual fix. Reviewers should look at the net file diff rather than the per-commit history. If preferred I can squash on merge. ## Screenshots | Before | After | |--------|-------| | `Lost connection to MySQL server during query` ~3 ms after migration start, every PR that triggers `db-migration-test-mysql` | `MySQL ready after 14s` → `Database migration successful!` (this PR's CI run https://github.com/langgenius/dify/actions/runs/25038021419) | ## Checklist - [ ] This change requires a documentation update, included: [Dify Document](https://github.com/langgenius/dify-docs) - [x] I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!) - [x] I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change. - [ ] I've updated the documentation accordingly. - [x] I ran `make lint && make type-check` (backend) and `cd web && pnpm exec vp staged` (frontend) to appease the lint gods > Note: this change is to a docker-compose file and a workflow yaml (no Python / TypeScript). `make lint` passes. `make type-check` surfaces several pre-existing errors on `main` (`app_factory.py:132`, `services/model_load_balancing_service.py:623`, several in `providers/trace/trace-tencent/...`) that are unrelated to this change; identical on `upstream/main` HEAD. > The actual verification is the green `Run DB Migration Test / db-migration-test-mysql` check on this PR. From Claude Code ## Changed files - `.github/workflows/db-migration-test.yml` (modified, +22/-0) - `docker/docker-compose.middleware.yaml` (modified, +10/-4) ## Fixed - Fixed by PR: fix(ci): wait for mysql to accept queries before d

dify2026-04-28 03:34:44

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

langgenius/dify#35620•Fetched 2026-04-29 06:36:33

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

referenced ×3assigned ×1closed ×1commented ×1

Error Message

A specific migration in the chain (the dry-run trace shows ALTER TABLE messages ADD/DROP COLUMN error; and similar) crashes the container due to resource limits on the Depot runner.

Root Cause

Plausible root causes (not validated, flagging for someone with CI infra context):

MySQL 8 healthcheck reports ready before InnoDB recovery / buffer pool is fully usable; the first DDL hits a transient unavailability window.
A specific migration in the chain (the dry-run trace shows ALTER TABLE messages ADD/DROP COLUMN error; and similar) crashes the container due to resource limits on the Depot runner.
Auth-plugin or wait_timeout mismatch between mysql:8.0 defaults and pymysql / SQLAlchemy.

Fix Action

Fixed

Fixed by PR: fix(ci): wait for mysql to accept queries before db migration (https://github.com/langgenius/dify/pull/35631)

PR fix notes

PR #35631: fix(ci): wait for mysql to accept queries before db migration

Repository: langgenius/dify
Author: lin-snow
State: closed | merged: True
Link: https://github.com/langgenius/dify/pull/35631

Description (problem / solution / changelog)

Fixes #35620

Summary

Run DB Migration Test / db-migration-test-mysql failed reliably with pymysql.err.OperationalError (2013, 'Lost connection to MySQL server during query') ~3-5 ms after Starting database migration.. The Postgres counterpart in the same workflow passed.

Real root cause — confirmed by adding a diagnostic step that ran between Set up Middlewares and Run DB Migration:

06:36:48  Bringing up docker compose service(s)
06:36:57  docker compose service(s) are up        ← compose-action returned (9s)
06:36:57  docker ps  → db_mysql Up 1 second (health: starting)
06:36:57  db_mysql logs:  [InnoDB] InnoDB initialization has started.
06:37:02  flask upgrade-db                         → Lost connection during query

hoverkraft-tech/[email protected] only waits for docker compose up -d to return (the container processes are running); it does not wait on healthcheck status. docker ps at the time of failure shows the mysql container still in health: starting, and the container's own logs are still at InnoDB initialization. mysql:8.0's first-run init takes 15-30 s (InnoDB recovery, system tables, root user setup, default db creation); the migration step starts ~14 s after the action returns, well inside that window, and the first SQLAlchemy connection is reset by the still-bootstrapping server.

Postgres is unaffected because pg starts in <5 s, comfortably finished by the time the next workflow steps (env prep, uv run startup, Flask app factory) finish — those steps already absorb most of the slack for pg, but not enough for mysql.

Fix

Two changes, ordered by importance:

.github/workflows/db-migration-test.yml — primary fix. Add an explicit Wait for MySQL to accept queries step between the compose-action and Run DB Migration. Polls mysql -e 'SELECT 1' from the runner host once per second, up to 60 s. Independent of the compose-action's wait semantics, dumps container logs on timeout. This is what makes the job pass.
docker/docker-compose.middleware.yaml — secondary, defensible-on-its-own improvement. Replace the existing mysqladmin ping healthcheck (which only verifies TCP handshake) with a mysql -e "SELECT 1" healthcheck (which verifies the server can actually process queries). Adds start_period: 20s so the bootstrap window does not consume the retry budget. This is correct in its own right — it makes docker compose ps and any future --wait-based workflow report mysql as healthy only when it actually is — but it does not fix the CI failure on its own, because the CI's compose-action does not wait on health at all.

Why this PR has 4 commits

For full transparency: the first two commits chase the wrong root cause (assumed compose-action was waiting on health, tried to tighten the check). Commit 3 adds the diagnostic step that disproved that assumption. Commit 4 is the actual fix. Reviewers should look at the net file diff rather than the per-commit history.

If preferred I can squash on merge.

Screenshots

Before	After
`Lost connection to MySQL server during query` ~3 ms after migration start, every PR that triggers `db-migration-test-mysql`	`MySQL ready after 14s` → `Database migration successful!` (this PR's CI run https://github.com/langgenius/dify/actions/runs/25038021419)

Checklist

This change requires a documentation update, included: Dify Document
I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
I've updated the documentation accordingly.
I ran make lint && make type-check (backend) and cd web && pnpm exec vp staged (frontend) to appease the lint gods

Note: this change is to a docker-compose file and a workflow yaml (no Python / TypeScript). make lint passes. make type-check surfaces several pre-existing errors on main (app_factory.py:132, services/model_load_balancing_service.py:623, several in providers/trace/trace-tencent/...) that are unrelated to this change; identical on upstream/main HEAD.

The actual verification is the green Run DB Migration Test / db-migration-test-mysql check on this PR.

From Claude Code

Changed files

.github/workflows/db-migration-test.yml (modified, +22/-0)
docker/docker-compose.middleware.yaml (modified, +10/-4)

Code Example

2026-04-28 02:58:03.834  Preparing database migration...
2026-04-28 02:58:03.891  Starting database migration.
2026-04-28 02:58:03.894  Database migration failed: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')

RAW_BUFFERClick to expand / collapse

Self Checks

I have read the Contributing Guide and Language Policy.
This is only for bug report, if you would like to ask a question, please head to Discussions.
I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report, otherwise it will be closed.
【中文用户 & Non English User】请使用英语提交，否则会被关闭：）
Please do not modify this template :) and fill in all the required fields.

Dify version

N/A — CI failure on main (HEAD 2d6babeeb4). Reproduces against the workflow shipped in the repo, not against a deployed instance.

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

The reusable workflow Run DB Migration Test / db-migration-test-mysql (defined in .github/workflows/db-migration-test.yml) fails on every PR that triggers it. The Postgres counterpart in the same workflow passes consistently.

Reproduce by opening any PR whose changes match the path filter on Run DB Migration Test. Recent observed failures on PR #35515:

Both fail at the same point with the same timing.

Relevant log excerpt:

2026-04-28 02:58:03.834  Preparing database migration...
2026-04-28 02:58:03.891  Starting database migration.
2026-04-28 02:58:03.894  Database migration failed: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')

The 3 ms gap between Starting and Lost connection suggests the failure is not from a long-running DDL but from the connection dropping immediately on one of the first queries.

✔️ Expected Behavior

db-migration-test-mysql runs the migration to completion against the mysql:8.0 middleware container, the same way db-migration-test-postgres does, and the job exits 0.

❌ Actual Behavior

The job fails roughly 3 ms after Starting database migration. with pymysql.err.OperationalError (2013, 'Lost connection to MySQL server during query').

Why this has been latent: Skip Duplicate Checks plus the path filter on Run DB Migration Test cause main pushes and most PRs to skip this job entirely. Scanning the last 15+ Main CI Pipeline runs on main shows db-migration-test-mysql listed as skipped on every one of them. The job has effectively not been exercised in main's CI for a long time; the failure is reliably reproducible only on PRs that touch broader api/ paths and is now blocking required checks (API Tests, DB Migration Test) on those PRs.

Environment notes:

Runner migrated to Depot (depot-ubuntu-24.04) in d6dee43c09, image builds in 23648141c9. The Postgres job on the same runner passes, so the runner change alone does not explain the failure, but the migration may have changed timing/resources for the MySQL container.
Container: mysql:8.0, healthcheck mysqladmin ping -u root -p$DB_PASSWORD, interval 1s / timeout 3s / retries 30.
Compose action: hoverkraft-tech/[email protected] (waits for healthy by default).
Loosely related: #32454 (move to Testcontainers).

Plausible root causes (not validated, flagging for someone with CI infra context):

MySQL 8 healthcheck reports ready before InnoDB recovery / buffer pool is fully usable; the first DDL hits a transient unavailability window.
A specific migration in the chain (the dry-run trace shows ALTER TABLE messages ADD/DROP COLUMN error; and similar) crashes the container due to resource limits on the Depot runner.
Auth-plugin or wait_timeout mismatch between mysql:8.0 defaults and pymysql / SQLAlchemy.

Impact: Any PR that hits the API path filter cannot get a green required DB Migration Test until this is resolved.

extent analysis

TL;DR

The db-migration-test-mysql job fails due to a lost connection to the MySQL server during query, likely caused by the MySQL container not being fully ready or a resource limit issue.

Guidance

Investigate the MySQL container's healthcheck and readiness, ensuring it accounts for InnoDB recovery and buffer pool initialization.
Review the resource limits on the Depot runner and consider increasing them to prevent crashes during migrations.
Verify the compatibility of mysql:8.0 defaults with pymysql and SQLAlchemy, focusing on auth-plugin and wait_timeout settings.
Consider adding a delay or retry mechanism to the db-migration-test-mysql job to account for transient unavailability windows.
Check the migration logs for any specific queries that may be causing the connection loss, such as the ALTER TABLE statements mentioned.

Example

No code snippet is provided as the issue is more related to the CI infrastructure and container configuration.

Notes

The issue may be specific to the Depot runner and mysql:8.0 container, and resolving it may require collaboration with someone familiar with the CI infrastructure.

Recommendation

Apply a workaround by adding a delay or retry mechanism to the db-migration-test-mysql job to account for transient unavailability windows, while investigating the root cause of the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #environment variable #network issue #logging issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

dify - ✅(Solved) Fix db-migration-test-mysql fails with 'Lost connection to MySQL server during query' ~3ms after migration starts [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #35631: fix(ci): wait for mysql to accept queries before db migration

Description (problem / solution / changelog)

Summary

Fix

Why this PR has 4 commits

Screenshots

Checklist

Changed files

Code Example

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING