pytorch - 💡(How to fix) Fix Failing test TestLocalDTensorOpsCPU.test_dtensor_op_db_linalg_multi_dot_cpu_float32 & test_dtensor_op_db_mv_cpu_float32 [1 participants]

pytorch2026-05-19 11:15:53

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#184350•Fetched 2026-05-20 03:39:10

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Flamefire

Participants

Flamefire

Timeline (top)

mentioned ×16subscribed ×16labeled ×5cross-referenced ×1

Error Message

Exception: Tensor-likes are not close!

Root Cause

Caused by sample input at index 6: SampleInput(input=TensorList[Tensor[size=(2, 4), device="cpu", dtype=torch.float32], Tensor[size=(4, 3), device="cpu", dtype=torch.float32], Tensor[size=(3, 5), device="cpu", dtype=torch.float32], Tensor[size=(5, 3), device="cpu", dtype=torch.float32], Tensor[size=(3, 2), device="cpu", dtype=torch.float32]], args=(), kwargs={}, broadcasts_input=False, name='')

Fix Action

Fix / Workaround

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Vendor ID: AuthenticAMD Model name: AMD EPYC 7352 24-Core Processor CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 2 Stepping: 0 Frequency boost: enabled CPU(s) scaling MHz: 74% CPU max MHz: 2300.0000 CPU min MHz: 1500.0000 BogoMIPS: 4600.34 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es Virtualization: AMD-V L1d cache: 1.5 MiB (48 instances) L1i cache: 1.5 MiB (48 instances) L2 cache: 24 MiB (48 instances) L3 cache: 256 MiB (16 instances) NUMA node(s): 4 NUMA node0 CPU(s): 0-11,48-59 NUMA node1 CPU(s): 12-23,60-71 NUMA node2 CPU(s): 24-35,72-83 NUMA node3 CPU(s): 36-47,84-95 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Code Example

Failed to run: torch.linalg.multi_dot, with (*[[DTensor(local_tensor=LocalTensor(
  0: tensor([[ 3.4669, -7.7058,  4.6590,  7.5694],
        [ 1.1421, -5.1504, -2.7776, -7.4891]]),
  1: tensor([[ 3.4669, -7.7058,  4.6590,  7.5694],
        [ 1.1421, -5.1504, -2.7776, -7.4891]]),
  2: tensor([[ 3.4669, -7.7058,  4.6590,  7.5694],
        [ 1.1421, -5.1504, -2.7776, -7.4891]]),
  3: tensor([[ 3.4669, -7.7058,  4.6590,  7.5694],
        [ 1.1421, -5.1504, -2.7776, -7.4891]])
), device_mesh=DeviceMesh((4,), 'cpu', stride=(1,)), placements=(Replicate(),)), DTensor(local_tensor=LocalTensor(
  0: tensor([[ 5.2068, -8.2851,  0.3633]]),
  1: tensor([[2.8720, 1.0462, 2.9117]]),
  2: tensor([[ 8.5645, -1.2671, -5.9610]]),
  3: tensor([[-2.5924,  0.3652, -0.8183]])
), device_mesh=DeviceMesh((4,), 'cpu', stride=(1,)), placements=(Shard(dim=0),)), DTensor(local_tensor=LocalTensor(
  0: tensor([[ 7.7910, -5.0327, -4.3949, -5.5762,  7.9628],
        [ 7.6850,  1.1760, -4.7881, -1.1749, -1.7137],
        [ 0.2945, -2.0396,  2.0495, -2.6512,  4.0498]]),
  1: tensor([[ 7.7910, -5.0327, -4.3949, -5.5762,  7.9628],
        [ 7.6850,  1.1760, -4.7881, -1.1749, -1.7137],
        [ 0.2945, -2.0396,  2.0495, -2.6512,  4.0498]]),
  2: tensor([[ 7.7910, -5.0327, -4.3949, -5.5762,  7.9628],
        [ 7.6850,  1.1760, -4.7881, -1.1749, -1.7137],
        [ 0.2945, -2.0396,  2.0495, -2.6512,  4.0498]]),
  3: tensor([[ 7.7910, -5.0327, -4.3949, -5.5762,  7.9628],
        [ 7.6850,  1.1760, -4.7881, -1.1749, -1.7137],
        [ 0.2945, -2.0396,  2.0495, -2.6512,  4.0498]])
), device_mesh=DeviceMesh((4,), 'cpu', stride=(1,)), placements=(Replicate(),)), DTensor(local_tensor=LocalTensor(
  0: tensor([[ 6.7551,  5.9732,  3.1151],
        [-0.5791, -7.1602, -6.7908],
        [ 4.4103,  0.5778,  6.9939],
        [-0.5896, -6.5964, -4.5201],
        [-7.8206, -7.5208,  8.9973]]),
  1: tensor([[ 6.7551,  5.9732,  3.1151],
        [-0.5791, -7.1602, -6.7908],
        [ 4.4103,  0.5778,  6.9939],
        [-0.5896, -6.5964, -4.5201],
        [-7.8206, -7.5208,  8.9973]]),
  2: tensor([[ 6.7551,  5.9732,  3.1151],
        [-0.5791, -7.1602, -6.7908],
        [ 4.4103,  0.5778,  6.9939],
        [-0.5896, -6.5964, -4.5201],
        [-7.8206, -7.5208,  8.9973]]),
  3: tensor([[ 6.7551,  5.9732,  3.1151],
        [-0.5791, -7.1602, -6.7908],
        [ 4.4103,  0.5778,  6.9939],
        [-0.5896, -6.5964, -4.5201],
        [-7.8206, -7.5208,  8.9973]])
), device_mesh=DeviceMesh((4,), 'cpu', stride=(1,)), placements=(Replicate(),)), DTensor(local_tensor=LocalTensor(
  0: tensor([[ 4.2510,  1.6292],
        [ 2.9927, -4.2197],
        [-6.7983,  5.6341]]),
  1: tensor([[ 4.2510,  1.6292],
        [ 2.9927, -4.2197],
        [-6.7983,  5.6341]]),
  2: tensor([[ 4.2510,  1.6292],
        [ 2.9927, -4.2197],
        [-6.7983,  5.6341]]),
  3: tensor([[ 4.2510,  1.6292],
        [ 2.9927, -4.2197],
        [-6.7983,  5.6341]])
), device_mesh=DeviceMesh((4,), 'cpu', stride=(1,)), placements=(Replicate(),))]], **{})

Caused by sample input at index 6: SampleInput(input=TensorList[Tensor[size=(2, 4), device="cpu", dtype=torch.float32], Tensor[size=(4, 3), device="cpu", dtype=torch.float32], Tensor[size=(3, 5), device="cpu", dtype=torch.float32], Tensor[size=(5, 3), device="cpu", dtype=torch.float32], Tensor[size=(3, 2), device="cpu", dtype=torch.float32]], args=(), kwargs={}, broadcasts_input=False, name='')

---

Greatest absolute difference: 1.1444091796875e-05 at index (3,) (up to 1e-05 allowed)
Greatest relative difference: 1.1317124517518096e-05 at index (3,) (up to 1.3e-06 allowed)

Failed to run: torch.mv, with (*[DTensor(local_tensor=tensor([[ 4.1761, -0.2984,  2.4235,  2.9058,  4.6064,  2.9942,  7.9766,  4.3033,
          7.8944, -1.5713,  3.5244,  6.0316],
        [ 7.8078,  6.6162,  8.1755, -8.0849, -6.7513,  8.2102, -5.7861,  8.3222,
         -5.1234,  2.9601, -7.2162,  4.3643],
        [ 6.3112,  3.0157, -2.5416, -7.6653, -5.2010,  7.5854,  2.9170, -2.8340,
          6.3382,  0.8583, -1.6771,  6.4638],
        [-2.7521, -5.7925, -2.0701, -6.8787,  6.3040, -2.1614,  8.2091, -2.3018,
          7.1533,  8.1187,  0.2556, -1.4198]], requires_grad=True), device_mesh=DeviceMesh((4,), 'cpu', stride=(1,)), placements=(Replicate(),)), DTensor(local_tensor=tensor([-4.6842, -0.3703,  0.8937], requires_grad=True), device_mesh=DeviceMesh((4,), 'cpu', stride=(1,)), placements=(Shard(dim=0),))], **{})

Caused by sample input at index 0: SampleInput(input=Tensor[size=(4, 12), device="cpu", dtype=torch.float32], args=TensorList[Tensor[size=(12,), device="cpu", dtype=torch.float32]], kwargs={}, broadcasts_input=False, name='')

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=6 python test/distributed/tensor/test_dtensor_ops.py TestLocalDTensorOpsCPU.test_dtensor_op_db_linalg_multi_dot_cpu_float32 fails with

Exception: Tensor-likes are not close!

Mismatched elements: 1 / 4 (25.0%) Greatest absolute difference: 0.00390625 at index (0, 0) (up to 1e-05 allowed) Greatest relative difference: 2.372718427068321e-06 at index (0, 0) (up to 1.3e-06 allowed)

Failed to run: torch.linalg.multi_dot, with (*[[DTensor(local_tensor=LocalTensor(
  0: tensor([[ 3.4669, -7.7058,  4.6590,  7.5694],
        [ 1.1421, -5.1504, -2.7776, -7.4891]]),
  1: tensor([[ 3.4669, -7.7058,  4.6590,  7.5694],
        [ 1.1421, -5.1504, -2.7776, -7.4891]]),
  2: tensor([[ 3.4669, -7.7058,  4.6590,  7.5694],
        [ 1.1421, -5.1504, -2.7776, -7.4891]]),
  3: tensor([[ 3.4669, -7.7058,  4.6590,  7.5694],
        [ 1.1421, -5.1504, -2.7776, -7.4891]])
), device_mesh=DeviceMesh((4,), 'cpu', stride=(1,)), placements=(Replicate(),)), DTensor(local_tensor=LocalTensor(
  0: tensor([[ 5.2068, -8.2851,  0.3633]]),
  1: tensor([[2.8720, 1.0462, 2.9117]]),
  2: tensor([[ 8.5645, -1.2671, -5.9610]]),
  3: tensor([[-2.5924,  0.3652, -0.8183]])
), device_mesh=DeviceMesh((4,), 'cpu', stride=(1,)), placements=(Shard(dim=0),)), DTensor(local_tensor=LocalTensor(
  0: tensor([[ 7.7910, -5.0327, -4.3949, -5.5762,  7.9628],
        [ 7.6850,  1.1760, -4.7881, -1.1749, -1.7137],
        [ 0.2945, -2.0396,  2.0495, -2.6512,  4.0498]]),
  1: tensor([[ 7.7910, -5.0327, -4.3949, -5.5762,  7.9628],
        [ 7.6850,  1.1760, -4.7881, -1.1749, -1.7137],
        [ 0.2945, -2.0396,  2.0495, -2.6512,  4.0498]]),
  2: tensor([[ 7.7910, -5.0327, -4.3949, -5.5762,  7.9628],
        [ 7.6850,  1.1760, -4.7881, -1.1749, -1.7137],
        [ 0.2945, -2.0396,  2.0495, -2.6512,  4.0498]]),
  3: tensor([[ 7.7910, -5.0327, -4.3949, -5.5762,  7.9628],
        [ 7.6850,  1.1760, -4.7881, -1.1749, -1.7137],
        [ 0.2945, -2.0396,  2.0495, -2.6512,  4.0498]])
), device_mesh=DeviceMesh((4,), 'cpu', stride=(1,)), placements=(Replicate(),)), DTensor(local_tensor=LocalTensor(
  0: tensor([[ 6.7551,  5.9732,  3.1151],
        [-0.5791, -7.1602, -6.7908],
        [ 4.4103,  0.5778,  6.9939],
        [-0.5896, -6.5964, -4.5201],
        [-7.8206, -7.5208,  8.9973]]),
  1: tensor([[ 6.7551,  5.9732,  3.1151],
        [-0.5791, -7.1602, -6.7908],
        [ 4.4103,  0.5778,  6.9939],
        [-0.5896, -6.5964, -4.5201],
        [-7.8206, -7.5208,  8.9973]]),
  2: tensor([[ 6.7551,  5.9732,  3.1151],
        [-0.5791, -7.1602, -6.7908],
        [ 4.4103,  0.5778,  6.9939],
        [-0.5896, -6.5964, -4.5201],
        [-7.8206, -7.5208,  8.9973]]),
  3: tensor([[ 6.7551,  5.9732,  3.1151],
        [-0.5791, -7.1602, -6.7908],
        [ 4.4103,  0.5778,  6.9939],
        [-0.5896, -6.5964, -4.5201],
        [-7.8206, -7.5208,  8.9973]])
), device_mesh=DeviceMesh((4,), 'cpu', stride=(1,)), placements=(Replicate(),)), DTensor(local_tensor=LocalTensor(
  0: tensor([[ 4.2510,  1.6292],
        [ 2.9927, -4.2197],
        [-6.7983,  5.6341]]),
  1: tensor([[ 4.2510,  1.6292],
        [ 2.9927, -4.2197],
        [-6.7983,  5.6341]]),
  2: tensor([[ 4.2510,  1.6292],
        [ 2.9927, -4.2197],
        [-6.7983,  5.6341]]),
  3: tensor([[ 4.2510,  1.6292],
        [ 2.9927, -4.2197],
        [-6.7983,  5.6341]])
), device_mesh=DeviceMesh((4,), 'cpu', stride=(1,)), placements=(Replicate(),))]], **{})

Caused by sample input at index 6: SampleInput(input=TensorList[Tensor[size=(2, 4), device="cpu", dtype=torch.float32], Tensor[size=(4, 3), device="cpu", dtype=torch.float32], Tensor[size=(3, 5), device="cpu", dtype=torch.float32], Tensor[size=(5, 3), device="cpu", dtype=torch.float32], Tensor[size=(3, 2), device="cpu", dtype=torch.float32]], args=(), kwargs={}, broadcasts_input=False, name='')

The differences are the same over multiple runs suggesting this is a general issue

Similar for TestMultiThreadedDTensorOpsCPU.test_dtensor_op_db_mv_cpu_float32:

Greatest absolute difference: 1.1444091796875e-05 at index (3,) (up to 1e-05 allowed)
Greatest relative difference: 1.1317124517518096e-05 at index (3,) (up to 1.3e-06 allowed)

Failed to run: torch.mv, with (*[DTensor(local_tensor=tensor([[ 4.1761, -0.2984,  2.4235,  2.9058,  4.6064,  2.9942,  7.9766,  4.3033,
          7.8944, -1.5713,  3.5244,  6.0316],
        [ 7.8078,  6.6162,  8.1755, -8.0849, -6.7513,  8.2102, -5.7861,  8.3222,
         -5.1234,  2.9601, -7.2162,  4.3643],
        [ 6.3112,  3.0157, -2.5416, -7.6653, -5.2010,  7.5854,  2.9170, -2.8340,
          6.3382,  0.8583, -1.6771,  6.4638],
        [-2.7521, -5.7925, -2.0701, -6.8787,  6.3040, -2.1614,  8.2091, -2.3018,
          7.1533,  8.1187,  0.2556, -1.4198]], requires_grad=True), device_mesh=DeviceMesh((4,), 'cpu', stride=(1,)), placements=(Replicate(),)), DTensor(local_tensor=tensor([-4.6842, -0.3703,  0.8937], requires_grad=True), device_mesh=DeviceMesh((4,), 'cpu', stride=(1,)), placements=(Shard(dim=0),))], **{})

Caused by sample input at index 0: SampleInput(input=Tensor[size=(4, 12), device="cpu", dtype=torch.float32], args=TensorList[Tensor[size=(12,), device="cpu", dtype=torch.float32]], kwargs={}, broadcasts_input=False, name='')

And TestMultiThreadedDTensorOpsCPU.test_dtensor_op_db_index_fill_cpu_float32 has an unexpected success

Versions

PyTorch version: 2.12.0+cu130 Is debug build: False CUDA used to build PyTorch: 13.0 ROCM used to build PyTorch: N/A

OS: Rocky Linux 9.6 (Blue Onyx) (x86_64) GCC version: (GCC) 13.3.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.34

Python version: 3.12.3 (main, May 13 2025, 17:56:01) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-5.14.0-570.49.1.el9_6.x86_64-x86_64-with-glibc2.34 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: NVIDIA A100-SXM4-40GB GPU 1: NVIDIA A100-SXM4-40GB GPU 2: NVIDIA A100-SXM4-40GB GPU 3: NVIDIA A100-SXM4-40GB GPU 4: NVIDIA A100-SXM4-40GB GPU 5: NVIDIA A100-SXM4-40GB GPU 6: NVIDIA A100-SXM4-40GB GPU 7: NVIDIA A100-SXM4-40GB

Nvidia driver version: 580.65.06 cuDNN version: Could not collect Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A

Versions of relevant libraries: [pip3] numpy==2.4.6 [pip3] nvidia-cublas==13.1.1.3 [pip3] nvidia-cuda-cupti==13.0.85 [pip3] nvidia-cuda-nvrtc==13.0.88 [pip3] nvidia-cuda-runtime==13.0.96 [pip3] nvidia-cudnn-cu13==9.20.0.48 [pip3] nvidia-cufft==12.0.0.61 [pip3] nvidia-curand==10.4.0.35 [pip3] nvidia-cusolver==12.0.4.66 [pip3] nvidia-cusparse==12.6.3.3 [pip3] nvidia-cusparselt-cu13==0.8.1 [pip3] nvidia-nccl-cu13==2.29.7 [pip3] nvidia-nvjitlink==13.0.88 [pip3] nvidia-nvtx==13.0.85 [pip3] torch==2.12.0 [pip3] triton==3.7.0 [conda] Could not collect

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @weifengpy @tianyu-l @XilunWu @SherlockNoMad @ppwwyyxx

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #memory management #API rate limit #retriever error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix Failing test TestLocalDTensorOpsCPU.test_dtensor_op_db_linalg_multi_dot_cpu_float32 & test_dtensor_op_db_mv_cpu_float32 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

🐛 Describe the bug

Versions

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix Failing test TestLocalDTensorOpsCPU.test_dtensor_op_db_linalg_multi_dot_cpu_float32 & test_dtensor_op_db_mv_cpu_float32 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

🐛 Describe the bug

Versions

Still need to ship something?

RELATED_DISCOVERY

TRENDING