pytorch - 💡(How to fix) Fix Different results for with and without torch.compile on flux.1-dev model [1 participants]

🐛 Describe the bug

When testing with the diffusers flux.1-dev model, I am seeing a big difference between w/ torch.compile and w/out torch.compile for some reason. This is not true with other flux models like flux.1-schnell. Is it expected that some models will have different results depending on torch.compile behavior?

Here is a repro script: mre_compile_issue.py

Error logs

The result made with and without torch.compile have an LPIPS score of 0.34, which is fairly drastic.

Versions

(oss_dev_flashv3_diffusers) [[email protected] ~]$ with-proxy curl -sL https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py | python3 Collecting environment information... PyTorch version: 2.12.0a0+gitc7e314f Is debug build: False CUDA used to build PyTorch: 12.9 ROCM used to build PyTorch: N/A

OS: CentOS Stream 9 (x86_64) GCC version: (GCC) 11.5.0 20240719 (Red Hat 11.5.0-14) Clang version: Could not collect CMake version: version 3.31.10 Libc version: glibc-2.34

Python version: 3.12.0 | packaged by Anaconda, Inc. | (main, Oct 2 2023, 17:29:18) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-6.4.3-0_fbk15_hardened_2630_gf27365f948db-x86_64-with-glibc2.34 Is CUDA available: True CUDA runtime version: 12.9.86 CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: NVIDIA H100 Nvidia driver version: 550.90.07 cuDNN version: Could not collect Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 46 On-line CPU(s) list: 0-45 Vendor ID: AuthenticAMD Model name: AMD EPYC 9654 96-Core Processor CPU family: 25 Model: 17 Thread(s) per core: 1 Core(s) per socket: 46 Socket(s): 1 Stepping: 1 BogoMIPS: 4792.80 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean flushbyasid pausefilter pfthreshold v_vmsave_vmload vgif vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm flush_l1d arch_capabilities Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type: full L1d cache: 2.9 MiB (46 instances) L1i cache: 2.9 MiB (46 instances) L2 cache: 23 MiB (46 instances) L3 cache: 736 MiB (46 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-45 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] flake8==7.3.0 [pip3] flake8-bugbear==24.12.12 [pip3] flake8-comprehensions==3.16.0 [pip3] flake8-executable==2.1.3 [pip3] flake8-logging-format==2024.24.12 [pip3] flake8-pyi==25.5.0 [pip3] flake8_simplify==0.22.0 [pip3] mypy_extensions==1.1.0 [pip3] numpy==2.1.0 [pip3] nvidia-cublas-cu12==12.8.4.1 [pip3] nvidia-cuda-cupti-cu12==12.8.90 [pip3] nvidia-cuda-nvrtc-cu12==12.8.93 [pip3] nvidia-cuda-runtime-cu12==12.8.90 [pip3] nvidia-cudnn-cu12==9.10.2.21 [pip3] nvidia-cufft-cu12==11.3.3.83 [pip3] nvidia-curand-cu12==10.3.9.90 [pip3] nvidia-cusolver-cu12==11.7.3.90 [pip3] nvidia-cusparse-cu12==12.5.8.93 [pip3] nvidia-cusparselt-cu12==0.7.1 [pip3] nvidia-nccl-cu12==2.27.5 [pip3] nvidia-nvjitlink-cu12==12.8.93 [pip3] nvidia-nvtx-cu12==12.8.90 [pip3] optree==0.17.0 [pip3] torch==2.11.0a0+gitbf59aea [pip3] torch_c_dlpack_ext==0.1.4 [pip3] torchao==0.16.0+gita8fa9e554 [pip3] torchvision==0.25.0 [pip3] triton==3.6.0 [conda] numpy 2.1.0 pypi_0 pypi [conda] nvidia-cublas-cu12 12.8.4.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.8.90 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.8.93 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.8.90 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.10.2.21 pypi_0 pypi [conda] nvidia-cufft-cu12 11.3.3.83 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.9.90 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.7.3.90 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.5.8.93 pypi_0 pypi [conda] nvidia-cusparselt-cu12 0.7.1 pypi_0 pypi [conda] nvidia-nccl-cu12 2.27.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.8.93 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.8.90 pypi_0 pypi [conda] optree 0.17.0 pypi_0 pypi [conda] torch 2.11.0a0+gitbf59aea pypi_0 pypi [conda] torch-c-dlpack-ext 0.1.4 pypi_0 pypi [conda] torchao 0.16.0+gita8fa9e554 pypi_0 pypi [conda] torchfix 0.4.0 pypi_0 pypi [conda] torchvision 0.25.0 pypi_0 pypi [conda] triton 3.6.0 pypi_0 pypi

cc @chauhang @penguinwu

extent analysis

TL;DR

The issue can be mitigated by checking the model-specific behavior with torch.compile and potentially adjusting the compilation settings or using a different model like flux.1-schnell that does not exhibit this discrepancy.

Guidance

Investigate the flux.1-dev model's implementation to understand why it behaves differently with torch.compile enabled.
Compare the results of flux.1-dev with other models like flux.1-schnell to identify any patterns or commonalities in the discrepancies.
Consider adjusting the torch.compile settings or using a different compilation mode to see if it affects the results.
Run the repro script mre_compile_issue.py with different models and compilation settings to gather more data on the issue.

Example

No specific code example is provided due to the lack of explicit code in the issue, but the repro script mre_compile_issue.py can be used as a starting point for investigation.

Notes

The issue seems to be model-specific, and the root cause may be related to the implementation of the flux.1-dev model or the interaction between the model and torch.compile. Further investigation is needed to determine the exact cause and find a solution.

Recommendation

Apply workaround: Use a different model like flux.1-schnell that does not exhibit the discrepancy, or adjust the torch.compile settings to mitigate the issue, as the root cause is unclear and may require further investigation.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix Different results for with and without torch.compile on flux.1-dev model [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs

🐛 Describe the bug

Error logs

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix Different results for with and without torch.compile on flux.1-dev model [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs

🐛 Describe the bug

Error logs

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING