pytorch - ✅(Solved) Fix [Test] Environment variable leak in NCCLTraceTestBase causes subsequent tests to fail with named pipe error [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#179556Fetched 2026-04-08 03:00:28
View on GitHub
Comments
2
Participants
2
Timeline
37
Reactions
0
Participants
Timeline (top)
mentioned ×13subscribed ×13labeled ×5commented ×2

Error Message

The C++ backend throws an error because it tries to create a named pipe in a directory that no longer exists: terminate called after throwing an instance of ‘c10::Error’ what(): Error creating named pipe /tmp/…/xxx.pipe, Error: No such file or directory

This is caused by mkfifo failing in DumpPipe::DumpPipe because the parent directory (from the stale env var) was deleted.

Root Cause

The C++ backend throws an error because it tries to create a named pipe in a directory that no longer exists: terminate called after throwing an instance of ‘c10::Error’ what(): Error creating named pipe /tmp/…/xxx.pipe, Error: No such file or directory

Fix Action

Fixed

PR fix notes

PR #179557: [test] Fix env var leak in NCCLTraceTestBase causing named pipe errors

Description (problem / solution / changelog)

Fixes https://github.com/pytorch/pytorch/issues/179556

Fix a test environment leak in NCCLTraceTestBase where environment variables set in setUp were not cleaned up in tearDown.

The setUp method sets TORCH_NCCL_DEBUG_INFO_PIPE_FILE to a path inside a temporary directory. However, tearDown destroys this temporary directory without unsetting the environment variable.

This leak causes subsequent tests to fail during DumpPipe initialization. The C++ backend reads the stale environment variable and attempts to create a named pipe via mkfifo in the now non-existent directory. This results in a TORCH_CHECK failure from ProcessGroupNCCL.hpp: 'Error creating named pipe ... No such file or directory'.

This patch ensures these variables are removed from os.environ during teardown to prevent test pollution.

Test Plan: Verified that environment variables are cleared after test execution and subsequent tests no longer encounter named pipe creation errors.

Changed files

  • test/distributed/test_c10d_nccl.py (modified, +2/-0)
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

🐛 Bug Description

The NCCLTraceTestBase test class sets environment variables TORCH_NCCL_DEBUG_INFO_TEMP_FILE and TORCH_NCCL_DEBUG_INFO_PIPE_FILE in setUp(), but fails to clean them up in tearDown().

This leads to test pollution. After the test finishes, the temporary directory pointed to by these variables is deleted, but the environment variables remain. When subsequent tests run (or if the test harness reuses the process), the NCCL initialization logic attempts to read these stale variables.

Reproduction Steps

  1. Run tests involving NCCLTraceTestBase.
  2. Ensure the temporary directory created in step 1 is cleaned up.
  3. Run any subsequent test or logic that initializes DumpPipe (or triggers NCCL debug info setup) in the same process environment.

Error Message

The C++ backend throws an error because it tries to create a named pipe in a directory that no longer exists: terminate called after throwing an instance of ‘c10::Error’ what(): Error creating named pipe /tmp/…/xxx.pipe, Error: No such file or directory

This is caused by mkfifo failing in DumpPipe::DumpPipe because the parent directory (from the stale env var) was deleted.

Expected Behavior

Environment variables should be unset in tearDown to ensure they do not affect subsequent tests.

Suggested Fix

Add os.environ.pop(...) calls in the tearDown method of NCCLTraceTestBase.

I have a PR ready to fix this.

Versions

Version details cannot be disclosed due to NDA

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan @mruberry

extent analysis

TL;DR

The most likely fix is to add os.environ.pop() calls in the tearDown method of NCCLTraceTestBase to clean up environment variables.

Guidance

  • Identify the environment variables TORCH_NCCL_DEBUG_INFO_TEMP_FILE and TORCH_NCCL_DEBUG_INFO_PIPE_FILE set in setUp() and add corresponding os.environ.pop() calls in tearDown() to prevent test pollution.
  • Verify that the temporary directory is properly cleaned up after each test run to prevent stale environment variables from affecting subsequent tests.
  • Test the fix by running the reproduction steps and checking for the error message indicating the failure to create a named pipe.

Example

def tearDown(self):
    os.environ.pop('TORCH_NCCL_DEBUG_INFO_TEMP_FILE', None)
    os.environ.pop('TORCH_NCCL_DEBUG_INFO_PIPE_FILE', None)

Notes

The provided fix assumes that the issue is solely caused by the stale environment variables. However, without version details, it's uncertain if this fix applies to all affected versions.

Recommendation

Apply the suggested workaround by adding os.environ.pop() calls in the tearDown method, as it directly addresses the identified cause of the issue and prevents test pollution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING