vllm - ✅(Solved) Fix [Transformers v5] Distributed shutdown test timtout [50 pull requests, 7 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38384Fetched 2026-04-08 01:41:42
View on GitHub
Comments
7
Participants
4
Timeline
31
Reactions
0
Author
Assignees
Timeline (top)
mentioned ×9subscribed ×9commented ×7labeled ×2

PR fix notes

PR #38779: [Transformers v5] Add engine_core.shutdown() to LLMEngine.del

Description (problem / solution / changelog)

Purpose

fix https://github.com/vllm-project/vllm/issues/38384

huggingface_hub >= 1.0.0 switched the HTTP backend from requests to httpx. httpx creates non-daemon threads, which causes threading._shutdown() to hang when the Python process shuts down.These changes could lead to some unexpected issues in the CI pipeline.

issue https://github.com/vllm-project/vllm/issues/38384 mentioned that "This test seems to be reliably failing in CI when v5 is installed, but I have not been able to reproduce it locally." This might be related to a specific combination of CI environment conditions (huggingface-hub >= 1.0.0 + no cache + Docker networking), which causes the httpx threads to stay alive when the child process exits.

update:The current version of huggingface-hub in the main branch CI is 0.36.2. Meanwhile, transformers v5 requires huggingface_hub >= 1.0.0, an update that involves a transition in the HTTP backend.

Test Result

CI test result in https://github.com/vllm-project/vllm/pull/30566

link https://buildkite.com/vllm/ci/builds/59212/steps/canvas?jid=019d484e-3c50-4ece-a743-6073d55b5eb5:

[2026-04-01T09:46:39Z] (EngineCore pid=663) INFO 04-01 09:46:39 [core.py:1210] Shutdown initiated (timeout=0) [2026-04-01T09:46:39Z] (EngineCore pid=663) INFO 04-01 09:46:39 [core.py:1233] Shutdown complete [2026-04-01T09:46:40Z] gpu memory used/total (GiB): 0=0.45/22.49; [2026-04-01T09:46:40Z] Done waiting for free GPU memory on devices devices=[0] (threshold='2.0 GiB') dur_s=0.00 [2026-04-01T09:46:40Z] PASSED

The GPU memory wasn't released at all after del llm—it remained stuck at 20.65 GiB right up until the timeout. The clearance threshold is 2.0 GiB, but the memory was pinned at 20.65 GiB.:

[2026-04-01T09:48:18Z] gpu memory used/total (GiB): 0=20.65/22.49; 1=20.65/22.49; [2026-04-01T09:48:23Z] gpu memory used/total (GiB): 0=20.65/22.49; 1=20.65/22.49; [2026-04-01T09:48:28Z] gpu memory used/total (GiB): 0=20.65/22.49; 1=20.65/22.49; [2026-04-01T09:48:33Z] gpu memory used/total (GiB): 0=20.65/22.49; 1=20.65/22.49; [2026-04-01T09:48:38Z] gpu memory used/total (GiB): 0=20.65/22.49; 1=20.65/22.49; [2026-04-01T09:48:43Z] gpu memory used/total (GiB): 0=20.65/22.49; 1=20.65/22.49; [2026-04-01T09:48:48Z] gpu memory used/total (GiB): 0=20.65/22.49; 1=20.65/22.49; [2026-04-01T09:48:53Z] gpu memory used/total (GiB): 0=20.65/22.49; 1=20.65/22.49; [2026-04-01T09:48:58Z] gpu memory used/total (GiB): 0=20.65/22.49; 1=20.65/22.49; [2026-04-01T09:48:59Z] +++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++

update: Notice this detail 👆: if the child process had actually been SIGKILLed, the OS should have reclaimed the GPU memory within a few seconds. Instead, the logs show it pinned at 20.65 GiB for a full 2 minutes. This strongly implies that the child process is still alive, rather than "killed but memory not released." It seems a GC delay might be preventing the SIGTERM from being dispatched. But here is the question: would this cause the test to fail reliably in CI while remaining un-reproducible locally?🤔

I took a look at the code related to Worker. I initially suspected a GC delay, but test_vllm_gc_ed confirms there are no circular references — del llm triggers cleanup immediately. The 2-minute memory hold is better explained by orphaned Worker processes: once EngineCore is SIGKILLed, its Worker children become orphans outside kill_process_tree's reach, and they continue holding GPU memory indefinitely.

So, a more complete picture of the situation looks like this: EngineCore likely hangs at threading._shutdown() due to httpx background threads from huggingface_hub. The parent process's 5-second join timeout then expires, and it SIGKILLs EngineCore. At that point, _ensure_worker_termination() never runs, so Worker processes become orphaned and continue holding ~20 GiB of GPU memory indefinitely.

Adding os._exit() after engine_core.shutdown() bypasses threading._shutdown(), allowing EngineCore to exit cleanly within the parent's timeout. This preserves the entire shutdown chain: engine_core.shutdown() → model_executor.shutdown() → _ensure_worker_termination() gets to run, which SIGTERMs/SIGKILLs Workers while they are still child processes of EngineCore, so the OS properly reclaims their GPU memory.

cc @hmellor

update: weakref.finalize relies on the reference count reaching zero or garbage collection (GC). However, the global httpx client in huggingface_hub >= 1.0.0 may indirectly hold references to MPClient or its higher-level objects through mechanisms such as callbacks or caching. In a CI environment (no cache → triggers HTTP download → httpx client is created), this reference chain prevents the reference count from dropping to zero, causing weakref.finalize to never trigger.

Changed files

  • vllm/v1/engine/llm_engine.py (modified, +2/-0)

PR #4084: Fix reference cycle in hf_raise_for_status delaying object destruction

Description (problem / solution / changelog)

Summary

Commit 098091fe ("#3889") changed hf_raise_for_status() from inline raises to storing exceptions in local variables before raising:

# Before (v1.5.0) — no cycle
raise _format(RemoteEntryNotFoundError, message, response) from e

# After (v1.6.0) — creates cycle
entry_err = _format(RemoteEntryNotFoundError, message, response)
entry_err.repo_type = repo_type
entry_err.repo_id = repo_id
raise entry_err from e

This creates a CPython reference cycle:

  1. entry_err.__cause__e (the original HTTPStatusError)
  2. e.__traceback__ → traceback → tb_framehf_raise_for_status frame
  3. hf_raise_for_status frame → f_locals['entry_err'] → back to (1)

The cycle prevents the exception from being freed by refcounting when except blocks exit. The cyclic GC will eventually collect it, but the delay is long enough to cause real problems. When this exception propagates through callers (e.g. transformers.cached_filesLLM.__init__), the traceback chain holds a reference to self in the caller's frame, preventing deterministic cleanup.

In vLLM, this means del llm doesn't immediately trigger the weakref.finalize that sends SIGTERM to the EngineCore subprocess, so GPU memory isn't released until the cyclic GC eventually runs. Bisected to v1.6.0v1.5.0 works fine. Related: vllm-project/vllm#38384.

Fix

Move repo_type/repo_id/bucket_id assignment into helper functions (_format_with_repo_info, _format_with_bucket_info) so the exception object is never stored as a local variable in hf_raise_for_status's frame. This preserves the functionality added in #3889 while avoiding the reference cycle.

Test plan

  • Added unit tests for _format_with_repo_info and _format_with_bucket_info
  • Verified all existing hf_raise_for_status tests still pass
  • Verified vLLM's tests/v1/shutdown/test_delete.py passes (8/8) with this fix
<!-- CURSOR_SUMMARY -->

[!NOTE] Low Risk Low risk: refactors how Hub HTTP exceptions are constructed/raised while preserving error types and attached metadata; main risk is subtle differences in exception object lifetimes or attributes in edge cases.

Overview Fixes hf_raise_for_status to avoid creating reference cycles when enriching raised HTTP exceptions with repo/bucket metadata.

This introduces _format_with_repo_info and _format_with_bucket_info helpers that set repo_type/repo_id/bucket_id on the error without keeping the exception in local variables, and updates the relevant raise paths to use them. Adds focused unit tests covering both helpers.

<sup>Reviewed by Cursor Bugbot for commit e49bb83f22d0e259a203da656370d6e388d7a1c9. Bugbot is set up for automated code reviews on this repo. Configure here.</sup>

<!-- /CURSOR_SUMMARY -->

Changed files

  • src/huggingface_hub/utils/_http.py (modified, +38/-19)
  • tests/test_utils_http.py (modified, +56/-1)

PR #4092: Fix reference cycle in hf_raise_for_status causing delayed object destruction

Description (problem / solution / changelog)

Why? PR equivalent to https://github.com/huggingface/huggingface_hub/pull/4084 but slightly cleaner. Should solved vLLM garbage collection problem (https://github.com/vllm-project/vllm/issues/38384). It seems that in https://github.com/huggingface/huggingface_hub/pull/3889 we've introduced a reference cycle issue in hf_raise_for_status, making it impossible to free-up memory correctly.

I have tested the regression test introduced in this PR and it does fail on main

Kudos to @yg7445 for investigating + suggesting the solution. This was not an easy bug to spot!


Summary

Fixes a CPython reference cycle in hf_raise_for_status() that prevents deterministic exception cleanup.

When exceptions are stored in local variables before raise ... from e, a reference cycle forms: err.__cause__ee.__traceback__ → frame → f_locals['err'] → back to start. This delays garbage collection and causes real issues downstream (e.g., vLLM GPU memory not released until cyclic GC runs). See vllm-project/vllm#38384.

Alternative approach to #4084: instead of introducing two new helper functions, this extends the existing _format() with **attrs so attributes (like repo_type, repo_id, bucket_id) are set inside _format and the exception is never stored in the caller's frame.

Changes

  • Extended _format() to accept **attrs keyword arguments, set as attributes on the error
  • Converted all 5 call sites in hf_raise_for_status from local-variable pattern to inline raise _format(...) from e
  • Added regression test using weakref to verify no reference cycle exists

🤖 Generated with Claude Code

<!-- CURSOR_SUMMARY -->

[!NOTE] <sup>Cursor Bugbot is generating a summary for commit 8ba3871177887a96e5e1d19c013f24b082d006d0. Configure here.</sup>

<!-- /CURSOR_SUMMARY -->

Changed files

  • src/huggingface_hub/utils/_http.py (modified, +14/-21)
  • tests/test_utils_http.py (modified, +37/-0)

PR #39695: Introduce De-dup/Similarity-Check in CI Workflow for PR/Issue

Description (problem / solution / changelog)

Co-Author: Trae + GPT5.3-Codex

Purpose

Example to explain https://github.com/vllm-project/vllm/issues/39694

Example Algorithm:

  • Scoring: 0.75 * text_similarity + 0.25 * file_overlap .
  • Threshold used for report: 0.75 .
  • Using Github Action CI Cache to temp save the Github API result cache for recent 1000 PR/500 issue..etc

Test Plan

Using 1000 recent PR to test the similarity check :

High-similarity pairs ( >=0.75 ): 26

Test Result

PR Similarity

  • Repo: vllm-project/vllm
  • PR count: 1000
  • Candidate pairs: 17375
  • High-similarity pairs (>= 0.75): 26
ScoreTextFilesPR APR B
100%100%100%#39553 Okakarpa shadow clone#39577 Okakarpa shadow clone
99%99%100%#37929 [Core] Use standalone autograd_cache_key for compilation dedup optimization#39517 [Core] Use standalone autograd_cache_key for compilation dedup optimization
96%95%100%#37947 [DRAFT][XPU] Upgrade torch 2.11 for xpu#39257 [XPU] update triton version for torch 2.11 upgrade
96%95%100%#37947 [DRAFT][XPU] Upgrade torch 2.11 for xpu#39313 [XPU] upgrade to triton-xpu 3.7.0
95%97%88%#38249 [Misc] Organize NixlConnector into own directory#39354 [KVConnector][NIXL] Organize NIXL connector into its own directory
95%93%100%#39410 [XPU] Disable fusion passes on XPU Platform#39671 use spawn multiproc method on xpu
94%92%100%#38856 [LMCache] vLLM Block Allocation Event#39719 fix(lmcache): correct store for cached requests while enable prefix cache
94%91%100%#39606 Pass extra_config to the constructor of LMCacheMPXXXAdapter#39719 fix(lmcache): correct store for cached requests while enable prefix cache
94%91%100%#39257 [XPU] update triton version for torch 2.11 upgrade#39313 [XPU] upgrade to triton-xpu 3.7.0
91%100%67%#39432 Gfx1250 wip#39437 Gfx1250 wip rebase test
90%92%85%#36823 [vLLM IR] 3/N fused_add_rms_norm and maybe_inplace#38775 [vLLM IR] 4/N Compile native implementation
90%86%100%#39402 [kv_offload+HMA[10/N]: Support load with multiple KV groups#39403 [kv_offload+HMA][11/N]: Support store with multiple KV groups
86%98%50%#23995 Feature/deepseek v31 lora support#39661 [DOC] Update Gemma 4
82%76%100%#39110 [Core] Disable HMA for eagle/MTP with sliding window models#39376 [Core] Disable HMA for eagle/MTP with sliding window models
82%76%100%#39401 [kv_offload+HMA][9/N]: Support lookup with multiple KV groups#39402 [kv_offload+HMA[10/N]: Support load with multiple KV groups
82%76%100%#39401 [kv_offload+HMA][9/N]: Support lookup with multiple KV groups#39403 [kv_offload+HMA][11/N]: Support store with multiple KV groups
80%96%33%#26583 add log for request trace#39646 V0.12.0 support n sampling delay split to eliminate redundant prefill computation and memory
79%97%22%#35721 [LoRA] Support dual CUDA streams-Linear Layer#37297 [LoRA] Support FP8 LoRA E2E inference-dense model
79%94%32%#39153 [Frontend][4/n] Improve pooling entrypointspooling.
79%74%91%#38775 [vLLM IR] 4/N Compile native implementation#39453 Port activations to IR op 1/3
79%88%50%#39312 [Mergify] Update model vendor auto-label rules#39429 [CI/Build] Update auto-rebase rule
78%100%13%#39723 [SimpleCPUOffloadConnector]: Add support for reset_cache()#39726 [SimpleCPUOffloadConnector]: Add support for reset_cache()
77%98%14%#38780 [vLLM IR][RMSNorm] Port GemmaRMSNorm to vLLM IR Ops#38798 [vLLM IR][RMSNorm] Port RMSNormGated to vLLM IR Ops
77%69%100%#39744 [v1] Expose num_prompt_tokens in CommonAttentionMetadata#39745 [v1] Expose num_prompt_tokens in CommonAttentionMetadata
77%81%62%#23133 Split compressed_tensors_moe.py into separate wna16, int8, fp8, nvfp4#29427 [Refactor] Split up compressed_tensors_moe.py into separate files per method
76%82%59%#39267 [vllm IR] 1/N Port FP8 Quantization to vLLM IR Ops#39481 [vllm IR] Port FP8 Quantization to vLLM IR Ops

Similar Issues:

  • Repo: vllm-project/vllm
  • Issue count: 500
  • Candidate pairs: 9909
  • High-similarity pairs (>= 0.75): 12
Match ScoreDesc SimilarityTitle OverlapIssue AIssue B
100%100%100%#39270 [Bug]: Qwen3.5 crashes when using suffix-decoding#39271 [Bug]: Qwen3.5 crashes when using suffix-decoding
100%100%100%#39372 [Bug]:#39373 [Bug]:
100%100%100%#39372 [Bug]:#39374 [Bug]:
100%100%100%#39373 [Bug]:#39374 [Bug]:
100%100%100%#39433 RFC: Add logit_scale to PoolerConfig for Affine Score Calibration (Platt Scaling)#39434 [RFC]: Add logit_scale to PoolerConfig for Affine Score Calibration (Platt Scaling)
100%100%100%#39299 [Performance] DSV3.2 Indexer: Overlap indexer k+w path
81%95%25%#31888 [Usage]: rollout slow#38642 [Usage]: 模型返回值reasoning_content
80%88%50%#38734 [Transformers v5] SarvamMLAForCausalLM#38740 [Transformers v5] NemotronParseForConditionalGeneration
79%94%20%#29245 [Usage]: 启动 qwen3 vl 超级超级超级慢,sglang 启动很快,可能的原因是什么?#38642 [Usage]: 模型返回值reasoning_content
77%92%17%#29245 [Usage]: 启动 qwen3 vl 超级超级超级慢,sglang 启动很快,可能的原因是什么?#31888 [Usage]: rollout slow
77%89%29%#38384 [Transformers v5] Distributed shutdown test timetout#38740 [Transformers v5] NemotronParseForConditionalGeneration
76%88%31%#31661 [Bug]: jina-reranker-m0 [image_index] IndexError: list index out of range#32151 [Bug]: jina-reranker-m0 infer error

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • .github/workflows/detect-duplicate-issues.yml (added, +64/-0)
  • .github/workflows/detect-duplicate-prs.yml (added, +55/-0)
  • .github/workflows/scripts/detect_duplicate_issues.py (added, +453/-0)
  • .github/workflows/scripts/detect_duplicate_prs.py (added, +317/-0)

Code Example

$ pytest tests/v1/shutdown/test_delete.py::test_llm_delete[False-True-2-hmellor/tiny-random-LlamaForCausalLM]
...
Failed: Timeout >120.0s

---

# Or your fork
git clone https://github.com/huggingface/transformers.git
git clone https://github.com/vllm-project/vllm.git

cd vllm
VLLM_USE_PRECOMPILED=1 uv pip install -e .
uv pip install -e ../transformers
RAW_BUFFERClick to expand / collapse

This is a sub-issue forming part of the work in https://github.com/vllm-project/vllm/issues/38379, please read the description of this issue before beginning to work on this one.

Which test is failing?

This test seems to be reliably failing in CI when v5 is installed, but I have not been able to reproduce it locally.

$ pytest tests/v1/shutdown/test_delete.py::test_llm_delete[False-True-2-hmellor/tiny-random-LlamaForCausalLM]
...
Failed: Timeout >120.0s

How to configure my environment?

It's very important that you install both vLLM and Transformers from source so that your test results reflect the current state of both libraries.

# Or your fork
git clone https://github.com/huggingface/transformers.git
git clone https://github.com/vllm-project/vllm.git

cd vllm
VLLM_USE_PRECOMPILED=1 uv pip install -e .
uv pip install -e ../transformers

extent analysis

Fix Plan

The fix involves increasing the timeout value for the test or optimizing the test to run faster.

Steps to Fix

  • Increase the timeout value in the test_delete.py file:
    • Locate the test_llm_delete function
    • Add or modify the timeout parameter, for example:
      import pytest
      
      @pytest.mark.timeout(300)  # increase timeout to 5 minutes
      def test_llm_delete():
          # test code here
  • Alternatively, optimize the test to run faster:
    • Review the test code and identify performance bottlenecks
    • Consider using mocking or patching to reduce dependencies and speed up the test

Verification

  • Run the test again using pytest tests/v1/shutdown/test_delete.py::test_llm_delete[False-True-2-hmellor/tiny-random-LlamaForCausalLM]
  • Verify that the test completes within the new timeout value or runs faster without timing out

Extra Tips

  • Consider adding more specific error handling to the test to provide more informative error messages in case of failures
  • Review the test environment and dependencies to ensure they are properly configured and up-to-date

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING