pytorch - 💡(How to fix) Fix DISABLED test_fused_linear_cel (__main__.AutoChunkerTest) [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#180588Fetched 2026-04-17 08:26:17
View on GitHub
Comments
1
Participants
1
Timeline
42
Reactions
0
Participants
Timeline (top)
mentioned ×18subscribed ×18labeled ×5commented ×1

Error Message

Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/inductor/test_auto_chunker.py", line 151, in test_fused_linear_cel expect = (f(x, y), x.grad, mod.linear.weight.grad, mod.linear.bias.grad) File "/var/lib/jenkins/workspace/test/inductor/test_auto_chunker.py", line 143, in f loss.backward() File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 631, in backward torch.autograd.backward( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/init.py", line 379, in backward _engine_run_backward( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 882, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.07 GiB. GPU 0 has a total capacity of 22.03 GiB of which 14.36 GiB is free. Process 7785 has 186.00 MiB memory in use. Process 7849 has 188.00 MiB memory in use. Process 10596 has 186.00 MiB memory in use. Process 10636 has 186.00 MiB memory in use. Process 10758 has 342.00 MiB memory in use. Including non-PyTorch memory, this process has 6.58 GiB memory in use. 6.39 GiB allowed; Of the allocated memory 6.31 GiB is allocated by PyTorch, and 45.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) Exception raised from malloc at /var/lib/jenkins/workspace/c10/cuda/CUDACachingAllocator.cpp:1779 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::malloc(void**, signed char, unsigned long, CUstream_st*) from :0 #7 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::allocate(unsigned long) from :0 #8 at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, std::optionalc10::MemoryFormat) from ??:0 #9 at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, std::optionalc10::Device, std::optionalc10::MemoryFormat) from ??:0 #10 at::detail::empty_cuda(c10::ArrayRef<long>, std::optionalc10::ScalarType, std::optionalc10::Layout, std::optionalc10::Device, std::optional<bool>, std::optionalc10::MemoryFormat) from ??:0 #11 at::detail::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&) from ??:0 #12 at::(anonymous namespace)::create_out(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions const&) from RegisterCUDA_0.cpp:0 #13 at::(anonymous namespace)::structured_log_softmax_backward_cuda_out_functional::set_output_raw_strided(long, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions, c10::ArrayRefat::Dimname) from RegisterCUDA_0.cpp:0 #14 at::meta::structured__log_softmax_backward_data::meta(at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0 #15 c10::impl::wrap_kernel_functor_unboxed<c10::impl::detail::WrapFunctionIntoFunctor<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, long, c10::ScalarType), &at::(anonymous namespace)::wrapper_CUDA__log_softmax_backward_data>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, long, c10::ScalarType> >, at::Tensor (at::Tensor const&, at::Tensor const&, long, c10::ScalarType)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from RegisterCUDA_0.cpp:0 #16 at::_ops::log_softmax_backward_data::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0 #17 torch::autograd::VariableType::(anonymous namespace)::log_softmax_backward_data(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from VariableType_1.cpp:0 #18 c10::impl::wrap_kernel_functor_unboxed<c10::impl::detail::WrapFunctionIntoFunctor<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType), &torch::autograd::VariableType::(anonymous namespace)::_log_softmax_backward_data>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from VariableType_1.cpp:0 #19 at::_ops::_log_softmax_backward_data::call(at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0 #20 torch::autograd::generated::LogSoftmaxBackward0_apply_functional(std::vector<at::Tensor, std::allocatorat::Tensor >&&, std::array<bool, 1ul>, long&, c10::ScalarType&, at::Tensor&) from Functions.cpp:0 #21 torch::autograd::generated::LogSoftmaxBackward0::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) from ??:0 #22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocatorat::Tensor >&&) from :0 #23 torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptrtorch::autograd::ReadyQueue const&) from ??:0 #24 torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&) from ??:0 #25 torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) from ??:0 #26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) from :0 #27 std::error_code::default_error_condition() const from ??:0 #28 start_thread from ./nptl/pthread_create.c:442 #29 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

To execute this test, run the following from the base repo dir: python test/inductor/test_auto_chunker.py AutoChunkerTest.test_fused_linear_cel

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Root Cause

This test was disabled because it is failing in CI. See recent examples and the most recent trunk workflow logs.

Fix Action

Fix / Workaround

Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/inductor/test_auto_chunker.py", line 151, in test_fused_linear_cel
    expect = (f(x, y), x.grad, mod.linear.weight.grad, mod.linear.bias.grad)
  File "/var/lib/jenkins/workspace/test/inductor/test_auto_chunker.py", line 143, in f
    loss.backward()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 631, in backward
    torch.autograd.backward(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 379, in backward
    _engine_run_backward(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 882, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.07 GiB. GPU 0 has a total capacity of 22.03 GiB of which 14.36 GiB is free. Process 7785 has 186.00 MiB memory in use. Process 7849 has 188.00 MiB memory in use. Process 10596 has 186.00 MiB memory in use. Process 10636 has 186.00 MiB memory in use. Process 10758 has 342.00 MiB memory in use. Including non-PyTorch memory, this process has 6.58 GiB memory in use. 6.39 GiB allowed; Of the allocated memory 6.31 GiB is allocated by PyTorch, and 45.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)
Exception raised from malloc at /var/lib/jenkins/workspace/c10/cuda/CUDACachingAllocator.cpp:1779 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::malloc(void**, signed char, unsigned long, CUstream_st*) from :0
#7 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::allocate(unsigned long) from :0
#8 at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, std::optional<c10::MemoryFormat>) from ??:0
#9 at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, std::optional<c10::Device>, std::optional<c10::MemoryFormat>) from ??:0
#10 at::detail::empty_cuda(c10::ArrayRef<long>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, std::optional<c10::MemoryFormat>) from ??:0
#11 at::detail::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&) from ??:0
#12 at::(anonymous namespace)::create_out(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions const&) from RegisterCUDA_0.cpp:0
#13 at::(anonymous namespace)::structured_log_softmax_backward_cuda_out_functional::set_output_raw_strided(long, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions, c10::ArrayRef<at::Dimname>) from RegisterCUDA_0.cpp:0
#14 at::meta::structured__log_softmax_backward_data::meta(at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0
#15 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, long, c10::ScalarType), &at::(anonymous namespace)::wrapper_CUDA__log_softmax_backward_data>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, long, c10::ScalarType> >, at::Tensor (at::Tensor const&, at::Tensor const&, long, c10::ScalarType)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from RegisterCUDA_0.cpp:0
#16 at::_ops::_log_softmax_backward_data::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0
#17 torch::autograd::VariableType::(anonymous namespace)::_log_softmax_backward_data(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from VariableType_1.cpp:0
#18 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType), &torch::autograd::VariableType::(anonymous namespace)::_log_softmax_backward_data>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from VariableType_1.cpp:0
#19 at::_ops::_log_softmax_backward_data::call(at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0
#20 torch::autograd::generated::LogSoftmaxBackward0_apply_functional(std::vector<at::Tensor, std::allocator<at::Tensor> >&&, std::array<bool, 1ul>, long&, c10::ScalarType&, at::Tensor&) from Functions.cpp:0
#21 torch::autograd::generated::LogSoftmaxBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from ??:0
#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0
#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) from ??:0
#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0
#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0
#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0
#27 std::error_code::default_error_condition() const from ??:0
#28 start_thread from ./nptl/pthread_create.c:442
#29 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Code Example

Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/inductor/test_auto_chunker.py", line 151, in test_fused_linear_cel
    expect = (f(x, y), x.grad, mod.linear.weight.grad, mod.linear.bias.grad)
  File "/var/lib/jenkins/workspace/test/inductor/test_auto_chunker.py", line 143, in f
    loss.backward()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 631, in backward
    torch.autograd.backward(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 379, in backward
    _engine_run_backward(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 882, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.07 GiB. GPU 0 has a total capacity of 22.03 GiB of which 14.36 GiB is free. Process 7785 has 186.00 MiB memory in use. Process 7849 has 188.00 MiB memory in use. Process 10596 has 186.00 MiB memory in use. Process 10636 has 186.00 MiB memory in use. Process 10758 has 342.00 MiB memory in use. Including non-PyTorch memory, this process has 6.58 GiB memory in use. 6.39 GiB allowed; Of the allocated memory 6.31 GiB is allocated by PyTorch, and 45.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)
Exception raised from malloc at /var/lib/jenkins/workspace/c10/cuda/CUDACachingAllocator.cpp:1779 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::malloc(void**, signed char, unsigned long, CUstream_st*) from :0
#7 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::allocate(unsigned long) from :0
#8 at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, std::optional<c10::MemoryFormat>) from ??:0
#9 at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, std::optional<c10::Device>, std::optional<c10::MemoryFormat>) from ??:0
#10 at::detail::empty_cuda(c10::ArrayRef<long>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, std::optional<c10::MemoryFormat>) from ??:0
#11 at::detail::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&) from ??:0
#12 at::(anonymous namespace)::create_out(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions const&) from RegisterCUDA_0.cpp:0
#13 at::(anonymous namespace)::structured_log_softmax_backward_cuda_out_functional::set_output_raw_strided(long, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions, c10::ArrayRef<at::Dimname>) from RegisterCUDA_0.cpp:0
#14 at::meta::structured__log_softmax_backward_data::meta(at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0
#15 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, long, c10::ScalarType), &at::(anonymous namespace)::wrapper_CUDA__log_softmax_backward_data>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, long, c10::ScalarType> >, at::Tensor (at::Tensor const&, at::Tensor const&, long, c10::ScalarType)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from RegisterCUDA_0.cpp:0
#16 at::_ops::_log_softmax_backward_data::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0
#17 torch::autograd::VariableType::(anonymous namespace)::_log_softmax_backward_data(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from VariableType_1.cpp:0
#18 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType), &torch::autograd::VariableType::(anonymous namespace)::_log_softmax_backward_data>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from VariableType_1.cpp:0
#19 at::_ops::_log_softmax_backward_data::call(at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0
#20 torch::autograd::generated::LogSoftmaxBackward0_apply_functional(std::vector<at::Tensor, std::allocator<at::Tensor> >&&, std::array<bool, 1ul>, long&, c10::ScalarType&, at::Tensor&) from Functions.cpp:0
#21 torch::autograd::generated::LogSoftmaxBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from ??:0
#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0
#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) from ??:0
#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0
#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0
#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0
#27 std::error_code::default_error_condition() const from ??:0
#28 start_thread from ./nptl/pthread_create.c:442
#29 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81


To execute this test, run the following from the base repo dir:
    python test/inductor/test_auto_chunker.py AutoChunkerTest.test_fused_linear_cel

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
RAW_BUFFERClick to expand / collapse

Platforms: linux

This test was disabled because it is failing in CI. See recent examples and the most recent trunk workflow logs.

Over the past 6 hours, it has been determined flaky in 3 workflow(s) with 3 failures and 3 successes.

Debugging instructions (after clicking on the recent samples link): DO NOT ASSUME THINGS ARE OKAY IF THE CI IS GREEN. We now shield flaky tests from developers so CI will thus be green but it will be harder to parse the logs. To find relevant log snippets:

  1. Click on the workflow logs linked above
  2. Click on the Test step of the job so that it is expanded. Otherwise, the grepping will not work.
  3. Grep for test_fused_linear_cel
  4. There should be several instances run (as flaky tests are rerun in CI) from which you can study the logs.
<details><summary>Sample error message</summary>
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/inductor/test_auto_chunker.py", line 151, in test_fused_linear_cel
    expect = (f(x, y), x.grad, mod.linear.weight.grad, mod.linear.bias.grad)
  File "/var/lib/jenkins/workspace/test/inductor/test_auto_chunker.py", line 143, in f
    loss.backward()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 631, in backward
    torch.autograd.backward(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 379, in backward
    _engine_run_backward(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 882, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.07 GiB. GPU 0 has a total capacity of 22.03 GiB of which 14.36 GiB is free. Process 7785 has 186.00 MiB memory in use. Process 7849 has 188.00 MiB memory in use. Process 10596 has 186.00 MiB memory in use. Process 10636 has 186.00 MiB memory in use. Process 10758 has 342.00 MiB memory in use. Including non-PyTorch memory, this process has 6.58 GiB memory in use. 6.39 GiB allowed; Of the allocated memory 6.31 GiB is allocated by PyTorch, and 45.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)
Exception raised from malloc at /var/lib/jenkins/workspace/c10/cuda/CUDACachingAllocator.cpp:1779 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::malloc(void**, signed char, unsigned long, CUstream_st*) from :0
#7 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::allocate(unsigned long) from :0
#8 at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, std::optional<c10::MemoryFormat>) from ??:0
#9 at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, std::optional<c10::Device>, std::optional<c10::MemoryFormat>) from ??:0
#10 at::detail::empty_cuda(c10::ArrayRef<long>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, std::optional<c10::MemoryFormat>) from ??:0
#11 at::detail::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&) from ??:0
#12 at::(anonymous namespace)::create_out(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions const&) from RegisterCUDA_0.cpp:0
#13 at::(anonymous namespace)::structured_log_softmax_backward_cuda_out_functional::set_output_raw_strided(long, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions, c10::ArrayRef<at::Dimname>) from RegisterCUDA_0.cpp:0
#14 at::meta::structured__log_softmax_backward_data::meta(at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0
#15 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, long, c10::ScalarType), &at::(anonymous namespace)::wrapper_CUDA__log_softmax_backward_data>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, long, c10::ScalarType> >, at::Tensor (at::Tensor const&, at::Tensor const&, long, c10::ScalarType)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from RegisterCUDA_0.cpp:0
#16 at::_ops::_log_softmax_backward_data::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0
#17 torch::autograd::VariableType::(anonymous namespace)::_log_softmax_backward_data(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from VariableType_1.cpp:0
#18 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType), &torch::autograd::VariableType::(anonymous namespace)::_log_softmax_backward_data>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from VariableType_1.cpp:0
#19 at::_ops::_log_softmax_backward_data::call(at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0
#20 torch::autograd::generated::LogSoftmaxBackward0_apply_functional(std::vector<at::Tensor, std::allocator<at::Tensor> >&&, std::array<bool, 1ul>, long&, c10::ScalarType&, at::Tensor&) from Functions.cpp:0
#21 torch::autograd::generated::LogSoftmaxBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from ??:0
#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0
#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) from ??:0
#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0
#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0
#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0
#27 std::error_code::default_error_condition() const from ??:0
#28 start_thread from ./nptl/pthread_create.c:442
#29 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81


To execute this test, run the following from the base repo dir:
    python test/inductor/test_auto_chunker.py AutoChunkerTest.test_fused_linear_cel

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
</details>

Test file path: inductor/test_auto_chunker.py

For all disabled tests (by GitHub issue), see https://hud.pytorch.org/disabled.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

extent analysis

TL;DR

The most likely fix for the failing test is to optimize memory usage or increase the available GPU memory to prevent CUDA out-of-memory errors.

Guidance

  • The error message indicates a CUDA out-of-memory error, suggesting that the test is attempting to allocate more memory than is available on the GPU.
  • To mitigate this issue, try setting the PYTORCH_CUDA_ALLOC_CONF environment variable to expandable_segments:True to avoid memory fragmentation.
  • Verify that the GPU has sufficient memory to run the test by checking the available memory and adjusting the test configuration as needed.
  • Consider optimizing the test to use less memory or splitting it into smaller tests to reduce the memory requirements.

Example

No specific code example is provided, but the error message suggests modifying the test configuration or environment variables to optimize memory usage.

Notes

The provided error message and stack trace suggest a memory-related issue, but without more information about the test or the environment, it's difficult to provide a more specific solution. The suggested fix is based on the error message and may require further investigation to resolve the issue.

Recommendation

Apply the workaround by setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid memory fragmentation and optimize memory usage. This may help prevent the CUDA out-of-memory error and allow the test to run successfully.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix DISABLED test_fused_linear_cel (__main__.AutoChunkerTest) [1 comments, 1 participants]