pytorch - 💡(How to fix) Fix DISABLED test_fused_linear_cel (main.AutoChunkerTest) [1 comments, 1 participants]

pytorch-bot[bot] · 2026-04-16T19:02:47Z

[pytorch] Platforms: linux This test was disabled because it is failing in CI. See recent examples https://hud.pytorch.org/flakytest?name=test fused linear cel… ## Fix / Workaround ``` Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/inductor/test_auto_chunker.py", line 151, in test_fused_linear_cel expect = (f(x, y), x.grad, mod.linear.weight.grad, mod.linear.bias.grad) File "/var/lib/jenkins/workspace/test/inductor/test_auto_chunker.py", line 143, in f loss.backward() File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 631, in backward torch.autograd.backward( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 379, in backward _engine_run_backward( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 882, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.07 GiB. GPU 0 has a total capacity of 22.03 GiB of which 14.36 GiB is free. Process 7785 has 186.00 MiB memory in use. Process 7849 has 188.00 MiB memory in use. Process 10596 has 186.00 MiB memory in use. Process 10636 has 186.00 MiB memory in use. Process 10758 has 342.00 MiB memory in use. Including non-PyTorch memory, this process has 6.58 GiB memory in use. 6.39 GiB allowed; Of the allocated memory 6.31 GiB is allocated by PyTorch, and 45.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) Exception raised from malloc at /var/lib/jenkins/workspace/c10/cuda/CUDACachingAllocator.cpp:1779 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler , std::allocator > > const> (), c10::SetStackTraceFetcher(std::function , std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string , std::allocator >) from ??:0 #6 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::malloc(void**, signed char, unsigned long, CUstream_st*) from :0 #7 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::allocate(unsigned long) from :0 #8 at::detail::empty_generic(c10::ArrayRef , c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, std::optional ) from ??:0 #9 at::detail::empty_cuda(c10::ArrayRef , c10::ScalarType, std::optional , std::optional ) from ??:0 #10 at::detail::empty_cuda(c10::ArrayRef , std::optional , std::optional , std::optional , std::optional , std::optional ) from ??:0 #11 at::detail::empty_cuda(c10::ArrayRef , c10::TensorOptions const&) from ??:0 #12 at::(anonymous namespace)::create_out(c10::ArrayRef , c10::ArrayRef , c10::TensorOptions const&) from RegisterCUDA_0.cpp:0 #13 at::(anonymous namespace)::structured_log_softmax_backward_cuda_out_functional::set_output_raw_strided(long, c10::ArrayRef , c10::ArrayRef , c10::TensorOptions, c10::ArrayRef ) from RegisterCUDA_0.cpp:0 #14 at::meta::structured__log_softmax_backward_data::meta(at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0 #15 c10::impl::wrap_kernel_functor_unboxed_ , at::Tensor, c10::guts::typelist::typelist >, at::Tensor (at::Tensor const&, at::Tensor const&, long, c10::ScalarType)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from RegisterCUDA_0.cpp:0 #16 at::_ops::_log_softmax_backward_data::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0 #17 torch::autograd::VariableType::(anonymous namespace)::_log_softmax_backward_data(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from VariableType_1.cpp:0 #18 c10::impl::wrap_kernel_functor_unboxed_ , at::Tensor, c10::guts::typelist::typelist<c10::Di

pytorch2026-04-16 19:02:47

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#180588•Fetched 2026-04-17 08:26:17

View on GitHub

Comments

Participants

Timeline

Reactions

Author

pytorch-bot[bot]

Participants

pytorch-bot[bot]

Timeline (top)

mentioned ×18subscribed ×18labeled ×5commented ×1

Error Message

Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/inductor/test_auto_chunker.py", line 151, in test_fused_linear_cel expect = (f(x, y), x.grad, mod.linear.weight.grad, mod.linear.bias.grad) File "/var/lib/jenkins/workspace/test/inductor/test_auto_chunker.py", line 143, in f loss.backward() File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 631, in backward torch.autograd.backward( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/init.py", line 379, in backward _engine_run_backward( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 882, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.07 GiB. GPU 0 has a total capacity of 22.03 GiB of which 14.36 GiB is free. Process 7785 has 186.00 MiB memory in use. Process 7849 has 188.00 MiB memory in use. Process 10596 has 186.00 MiB memory in use. Process 10636 has 186.00 MiB memory in use. Process 10758 has 342.00 MiB memory in use. Including non-PyTorch memory, this process has 6.58 GiB memory in use. 6.39 GiB allowed; Of the allocated memory 6.31 GiB is allocated by PyTorch, and 45.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) Exception raised from malloc at /var/lib/jenkins/workspace/c10/cuda/CUDACachingAllocator.cpp:1779 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::malloc(void**, signed char, unsigned long, CUstream_st*) from :0 #7 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::allocate(unsigned long) from :0 #8 at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, std::optionalc10::MemoryFormat) from ??:0 #9 at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, std::optionalc10::Device, std::optionalc10::MemoryFormat) from ??:0 #10 at::detail::empty_cuda(c10::ArrayRef<long>, std::optionalc10::ScalarType, std::optionalc10::Layout, std::optionalc10::Device, std::optional<bool>, std::optionalc10::MemoryFormat) from ??:0 #11 at::detail::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&) from ??:0 #12 at::(anonymous namespace)::create_out(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions const&) from RegisterCUDA_0.cpp:0 #13 at::(anonymous namespace)::structured_log_softmax_backward_cuda_out_functional::set_output_raw_strided(long, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions, c10::ArrayRefat::Dimname) from RegisterCUDA_0.cpp:0 #14 at::meta::structured__log_softmax_backward_data::meta(at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0 #15 c10::impl::wrap_kernel_functor_unboxed<c10::impl::detail::WrapFunctionIntoFunctor<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, long, c10::ScalarType), &at::(anonymous namespace)::wrapper_CUDA__log_softmax_backward_data>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, long, c10::ScalarType> >, at::Tensor (at::Tensor const&, at::Tensor const&, long, c10::ScalarType)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from RegisterCUDA_0.cpp:0 #16 at::_ops::log_softmax_backward_data::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0 #17 torch::autograd::VariableType::(anonymous namespace)::log_softmax_backward_data(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from VariableType_1.cpp:0 #18 c10::impl::wrap_kernel_functor_unboxed<c10::impl::detail::WrapFunctionIntoFunctor<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType), &torch::autograd::VariableType::(anonymous namespace)::_log_softmax_backward_data>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from VariableType_1.cpp:0 #19 at::_ops::_log_softmax_backward_data::call(at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0 #20 torch::autograd::generated::LogSoftmaxBackward0_apply_functional(std::vector<at::Tensor, std::allocatorat::Tensor >&&, std::array<bool, 1ul>, long&, c10::ScalarType&, at::Tensor&) from Functions.cpp:0 #21 torch::autograd::generated::LogSoftmaxBackward0::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) from ??:0 #22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocatorat::Tensor >&&) from :0 #23 torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptrtorch::autograd::ReadyQueue const&) from ??:0 #24 torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&) from ??:0 #25 torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) from ??:0 #26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) from :0 #27 std::error_code::default_error_condition() const from ??:0 #28 start_thread from ./nptl/pthread_create.c:442 #29 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

To execute this test, run the following from the base repo dir: python test/inductor/test_auto_chunker.py AutoChunkerTest.test_fused_linear_cel

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Root Cause

This test was disabled because it is failing in CI. See recent examples and the most recent trunk workflow logs.

Fix Action

Fix / Workaround

Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/inductor/test_auto_chunker.py", line 151, in test_fused_linear_cel
    expect = (f(x, y), x.grad, mod.linear.weight.grad, mod.linear.bias.grad)
  File "/var/lib/jenkins/workspace/test/inductor/test_auto_chunker.py", line 143, in f
    loss.backward()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 631, in backward
    torch.autograd.backward(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 379, in backward
    _engine_run_backward(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 882, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.07 GiB. GPU 0 has a total capacity of 22.03 GiB of which 14.36 GiB is free. Process 7785 has 186.00 MiB memory in use. Process 7849 has 188.00 MiB memory in use. Process 10596 has 186.00 MiB memory in use. Process 10636 has 186.00 MiB memory in use. Process 10758 has 342.00 MiB memory in use. Including non-PyTorch memory, this process has 6.58 GiB memory in use. 6.39 GiB allowed; Of the allocated memory 6.31 GiB is allocated by PyTorch, and 45.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)
Exception raised from malloc at /var/lib/jenkins/workspace/c10/cuda/CUDACachingAllocator.cpp:1779 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::malloc(void**, signed char, unsigned long, CUstream_st*) from :0
#7 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::allocate(unsigned long) from :0
#8 at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, std::optional<c10::MemoryFormat>) from ??:0
#9 at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, std::optional<c10::Device>, std::optional<c10::MemoryFormat>) from ??:0
#10 at::detail::empty_cuda(c10::ArrayRef<long>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, std::optional<c10::MemoryFormat>) from ??:0
#11 at::detail::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&) from ??:0
#12 at::(anonymous namespace)::create_out(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions const&) from RegisterCUDA_0.cpp:0
#13 at::(anonymous namespace)::structured_log_softmax_backward_cuda_out_functional::set_output_raw_strided(long, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions, c10::ArrayRef<at::Dimname>) from RegisterCUDA_0.cpp:0
#14 at::meta::structured__log_softmax_backward_data::meta(at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0
#15 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, long, c10::ScalarType), &at::(anonymous namespace)::wrapper_CUDA__log_softmax_backward_data>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, long, c10::ScalarType> >, at::Tensor (at::Tensor const&, at::Tensor const&, long, c10::ScalarType)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from RegisterCUDA_0.cpp:0
#16 at::_ops::_log_softmax_backward_data::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0
#17 torch::autograd::VariableType::(anonymous namespace)::_log_softmax_backward_data(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from VariableType_1.cpp:0
#18 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType), &torch::autograd::VariableType::(anonymous namespace)::_log_softmax_backward_data>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from VariableType_1.cpp:0
#19 at::_ops::_log_softmax_backward_data::call(at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0
#20 torch::autograd::generated::LogSoftmaxBackward0_apply_functional(std::vector<at::Tensor, std::allocator<at::Tensor> >&&, std::array<bool, 1ul>, long&, c10::ScalarType&, at::Tensor&) from Functions.cpp:0
#21 torch::autograd::generated::LogSoftmaxBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from ??:0
#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0
#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) from ??:0
#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0
#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0
#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0
#27 std::error_code::default_error_condition() const from ??:0
#28 start_thread from ./nptl/pthread_create.c:442
#29 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Code Example

Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/inductor/test_auto_chunker.py", line 151, in test_fused_linear_cel
    expect = (f(x, y), x.grad, mod.linear.weight.grad, mod.linear.bias.grad)
  File "/var/lib/jenkins/workspace/test/inductor/test_auto_chunker.py", line 143, in f
    loss.backward()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 631, in backward
    torch.autograd.backward(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 379, in backward
    _engine_run_backward(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 882, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.07 GiB. GPU 0 has a total capacity of 22.03 GiB of which 14.36 GiB is free. Process 7785 has 186.00 MiB memory in use. Process 7849 has 188.00 MiB memory in use. Process 10596 has 186.00 MiB memory in use. Process 10636 has 186.00 MiB memory in use. Process 10758 has 342.00 MiB memory in use. Including non-PyTorch memory, this process has 6.58 GiB memory in use. 6.39 GiB allowed; Of the allocated memory 6.31 GiB is allocated by PyTorch, and 45.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)
Exception raised from malloc at /var/lib/jenkins/workspace/c10/cuda/CUDACachingAllocator.cpp:1779 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::malloc(void**, signed char, unsigned long, CUstream_st*) from :0
#7 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::allocate(unsigned long) from :0
#8 at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, std::optional<c10::MemoryFormat>) from ??:0
#9 at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, std::optional<c10::Device>, std::optional<c10::MemoryFormat>) from ??:0
#10 at::detail::empty_cuda(c10::ArrayRef<long>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, std::optional<c10::MemoryFormat>) from ??:0
#11 at::detail::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&) from ??:0
#12 at::(anonymous namespace)::create_out(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions const&) from RegisterCUDA_0.cpp:0
#13 at::(anonymous namespace)::structured_log_softmax_backward_cuda_out_functional::set_output_raw_strided(long, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions, c10::ArrayRef<at::Dimname>) from RegisterCUDA_0.cpp:0
#14 at::meta::structured__log_softmax_backward_data::meta(at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0
#15 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, long, c10::ScalarType), &at::(anonymous namespace)::wrapper_CUDA__log_softmax_backward_data>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, long, c10::ScalarType> >, at::Tensor (at::Tensor const&, at::Tensor const&, long, c10::ScalarType)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from RegisterCUDA_0.cpp:0
#16 at::_ops::_log_softmax_backward_data::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0
#17 torch::autograd::VariableType::(anonymous namespace)::_log_softmax_backward_data(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from VariableType_1.cpp:0
#18 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType), &torch::autograd::VariableType::(anonymous namespace)::_log_softmax_backward_data>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from VariableType_1.cpp:0
#19 at::_ops::_log_softmax_backward_data::call(at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0
#20 torch::autograd::generated::LogSoftmaxBackward0_apply_functional(std::vector<at::Tensor, std::allocator<at::Tensor> >&&, std::array<bool, 1ul>, long&, c10::ScalarType&, at::Tensor&) from Functions.cpp:0
#21 torch::autograd::generated::LogSoftmaxBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from ??:0
#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0
#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) from ??:0
#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0
#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0
#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0
#27 std::error_code::default_error_condition() const from ??:0
#28 start_thread from ./nptl/pthread_create.c:442
#29 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81


To execute this test, run the following from the base repo dir:
    python test/inductor/test_auto_chunker.py AutoChunkerTest.test_fused_linear_cel

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

RAW_BUFFERClick to expand / collapse

Platforms: linux

This test was disabled because it is failing in CI. See recent examples and the most recent trunk workflow logs.

Over the past 6 hours, it has been determined flaky in 3 workflow(s) with 3 failures and 3 successes.

Debugging instructions (after clicking on the recent samples link): DO NOT ASSUME THINGS ARE OKAY IF THE CI IS GREEN. We now shield flaky tests from developers so CI will thus be green but it will be harder to parse the logs. To find relevant log snippets:

Click on the workflow logs linked above
Click on the Test step of the job so that it is expanded. Otherwise, the grepping will not work.
Grep for test_fused_linear_cel
There should be several instances run (as flaky tests are rerun in CI) from which you can study the logs.

<details><summary>Sample error message</summary>

Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/inductor/test_auto_chunker.py", line 151, in test_fused_linear_cel
    expect = (f(x, y), x.grad, mod.linear.weight.grad, mod.linear.bias.grad)
  File "/var/lib/jenkins/workspace/test/inductor/test_auto_chunker.py", line 143, in f
    loss.backward()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 631, in backward
    torch.autograd.backward(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 379, in backward
    _engine_run_backward(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 882, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.07 GiB. GPU 0 has a total capacity of 22.03 GiB of which 14.36 GiB is free. Process 7785 has 186.00 MiB memory in use. Process 7849 has 188.00 MiB memory in use. Process 10596 has 186.00 MiB memory in use. Process 10636 has 186.00 MiB memory in use. Process 10758 has 342.00 MiB memory in use. Including non-PyTorch memory, this process has 6.58 GiB memory in use. 6.39 GiB allowed; Of the allocated memory 6.31 GiB is allocated by PyTorch, and 45.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)
Exception raised from malloc at /var/lib/jenkins/workspace/c10/cuda/CUDACachingAllocator.cpp:1779 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::malloc(void**, signed char, unsigned long, CUstream_st*) from :0
#7 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::allocate(unsigned long) from :0
#8 at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, std::optional<c10::MemoryFormat>) from ??:0
#9 at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, std::optional<c10::Device>, std::optional<c10::MemoryFormat>) from ??:0
#10 at::detail::empty_cuda(c10::ArrayRef<long>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, std::optional<c10::MemoryFormat>) from ??:0
#11 at::detail::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&) from ??:0
#12 at::(anonymous namespace)::create_out(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions const&) from RegisterCUDA_0.cpp:0
#13 at::(anonymous namespace)::structured_log_softmax_backward_cuda_out_functional::set_output_raw_strided(long, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions, c10::ArrayRef<at::Dimname>) from RegisterCUDA_0.cpp:0
#14 at::meta::structured__log_softmax_backward_data::meta(at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0
#15 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, long, c10::ScalarType), &at::(anonymous namespace)::wrapper_CUDA__log_softmax_backward_data>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, long, c10::ScalarType> >, at::Tensor (at::Tensor const&, at::Tensor const&, long, c10::ScalarType)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from RegisterCUDA_0.cpp:0
#16 at::_ops::_log_softmax_backward_data::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0
#17 torch::autograd::VariableType::(anonymous namespace)::_log_softmax_backward_data(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from VariableType_1.cpp:0
#18 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType), &torch::autograd::VariableType::(anonymous namespace)::_log_softmax_backward_data>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from VariableType_1.cpp:0
#19 at::_ops::_log_softmax_backward_data::call(at::Tensor const&, at::Tensor const&, long, c10::ScalarType) from ??:0
#20 torch::autograd::generated::LogSoftmaxBackward0_apply_functional(std::vector<at::Tensor, std::allocator<at::Tensor> >&&, std::array<bool, 1ul>, long&, c10::ScalarType&, at::Tensor&) from Functions.cpp:0
#21 torch::autograd::generated::LogSoftmaxBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from ??:0
#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0
#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) from ??:0
#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0
#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0
#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0
#27 std::error_code::default_error_condition() const from ??:0
#28 start_thread from ./nptl/pthread_create.c:442
#29 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81


To execute this test, run the following from the base repo dir:
    python test/inductor/test_auto_chunker.py AutoChunkerTest.test_fused_linear_cel

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

</details>

Test file path: inductor/test_auto_chunker.py

For all disabled tests (by GitHub issue), see https://hud.pytorch.org/disabled.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

extent analysis

TL;DR

The most likely fix for the failing test is to optimize memory usage or increase the available GPU memory to prevent CUDA out-of-memory errors.

Guidance

The error message indicates a CUDA out-of-memory error, suggesting that the test is attempting to allocate more memory than is available on the GPU.
To mitigate this issue, try setting the PYTORCH_CUDA_ALLOC_CONF environment variable to expandable_segments:True to avoid memory fragmentation.
Verify that the GPU has sufficient memory to run the test by checking the available memory and adjusting the test configuration as needed.
Consider optimizing the test to use less memory or splitting it into smaller tests to reduce the memory requirements.

Example

No specific code example is provided, but the error message suggests modifying the test configuration or environment variables to optimize memory usage.

Notes

The provided error message and stack trace suggest a memory-related issue, but without more information about the test or the environment, it's difficult to provide a more specific solution. The suggested fix is based on the error message and may require further investigation to resolve the issue.

Recommendation

Apply the workaround by setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid memory fragmentation and optimize memory usage. This may help prevent the CUDA out-of-memory error and allow the test to run successfully.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#memory management #request error #file not found #serialization error #model compatibility

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix DISABLED test_fused_linear_cel (main.AutoChunkerTest) [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix DISABLED test_fused_linear_cel (__main__.AutoChunkerTest) [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

pytorch - 💡(How to fix) Fix DISABLED test_fused_linear_cel (main.AutoChunkerTest) [1 comments, 1 participants]