ollama - ✅(Solved) Fix Vulkan/AMD performance: vendored llama.cpp (b7437, Dec 2025) missing Wave32 FA (#19625) and graphics queue (#20551) — ~56% t/s gap vs standalone llama.cpp [2 pull requests, 9 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15601Fetched 2026-04-17 08:23:24
View on GitHub
Comments
9
Participants
4
Timeline
22
Reactions
0
Timeline (top)
commented ×9subscribed ×7mentioned ×6

Ollama's vendored llama.cpp is currently at b7437 (Dec 16, 2025). Two significant Vulkan/AMD performance PRs landed in llama.cpp after that and have not yet been picked up by Ollama:

PRDescriptionMerged into llama.cpp
ggml-org/llama.cpp#19625Vulkan: scalar flash attention refactor + Wave32 on AMDFeb 24, 2026
ggml-org/llama.cpp#20551Vulkan: use graphics queue on AMDMar 15, 2026

Root Cause

Ollama's vendored llama.cpp is currently at b7437 (Dec 16, 2025). Two significant Vulkan/AMD performance PRs landed in llama.cpp after that and have not yet been picked up by Ollama:

PRDescriptionMerged into llama.cpp
ggml-org/llama.cpp#19625Vulkan: scalar flash attention refactor + Wave32 on AMDFeb 24, 2026
ggml-org/llama.cpp#20551Vulkan: use graphics queue on AMDMar 15, 2026

Fix Action

Fix / Workaround

PRDescriptionMerged into llama.cpp
ggml-org/llama.cpp#19625Vulkan: scalar flash attention refactor + Wave32 on AMDFeb 24, 2026
ggml-org/llama.cpp#20551Vulkan: use graphics queue on AMDMar 15, 2026

Current Workaround

This works, but it's a significant workaround — model management, multi-model serving, and automatic updates all have to be handled manually. Ollama fixing this would make the workaround unnecessary.

PR fix notes

PR #19625: Vulkan Scalar Flash Attention Refactor

Description (problem / solution / changelog)

This started out as an attempt to go through the scalar FA version and add proper float16 support to improve AMD and Intel performance and went quite a bit further. @jeffbolznv Sorry about the amount of changes, let me know if there's something I can do to make the review easier. Please also let me know if you have architectural concerns. Flash Attention has so many dimensions and making it work well on so much hardware and models is pretty hard. I had to spend quite a lot of time figuring out and fixing regressions on specific configurations.

<details> <summary>AI-generated summary of changes</summary>

Scalar Flash Attention Core Optimizations

  • Implemented row splitting within workgroups (row_split = 1 or 4) for better subgroup utilization
  • Added shared memory staging for K and V loads on Nvidia GPUs when head sizes < 256
  • Cached Q values in registers for KQ computation when HSK_per_thread > 16
  • Fused loop for Lf accumulation and Of scaling by eMf
  • Changed to vectorized vec4 stores for output
  • Optimized masksh layout with stride padding (Br + 1) and removed unnecessary barrier

Row Size Tiering

  • Replaced binary small_rows/large_rows with three-tier system: FA_ROWS_1, FA_ROWS_SMALL, FA_ROWS_LARGE
  • Dynamic Br selection based on head sizes, device vendor, and architecture
  • FA_ROWS_1 uses Br=1 for N=1, FA_ROWS_SMALL uses Br=8, FA_ROWS_LARGE uses Br=16
  • Device-specific adjustments: AMD GCN uses smaller Br, Intel uses Br=8 maximum

Vendor-Specific Optimizations

  • AMD RDNA: Use wave32 subgroup size for scalar FA when N=1
  • Intel: Added shader core count lookup table for Alchemist and Battlemage GPUs
  • Intel: Disable subgroup operations in favor of shared memory reductions
  • Intel Alchemist: Apply 2x shader core count multiplier for split_k calculation
  • Adjusted workgroup sizes per vendor and head size combinations

split_k Enhancements

  • Relaxed split_k conditions to support non-GQA workloads
  • Fixed dispatch logic to handle both GQA and non-GQA cases correctly
  • Improved split_k calculation based on total workgroup count and shader cores

Device Compatibility

  • Added FP32 shader variants (_fp32 suffix) for devices without FP16 support
  • Made FLOAT_TYPE conditional on device capabilities
  • Updated dequantize4 functions to use FLOAT_TYPE instead of hardcoded float

Shared Memory Management

  • Dynamic tmpsh sizing based on row_split and subgroup configuration
  • Added kvsh buffer for K/V staging (size conditional on SHMEM_STAGING flag)
  • Improved Qf buffer stride calculation
  • Fixed tmpsh size calculation for split_k temporaries

Code Path Selection

  • Switch from coopmat1 to scalar when N=1 or rows=FA_ROWS_1
  • Improved shared memory size checks for scalar path fallback
  • Better alignment checking and stride validation

Shader Compilation

  • Made coopmat1/coopmat2 pipeline creation conditional on device FP16 support
  • Added subgroup size configuration per code path and row configuration
  • Removed hardcoded subgroup size assumptions
</details>

Benchmarks

<details> <summary>AMD Radeon Pro VII</summary>
modelsizeparamsnglfatestt/s (ROCm)t/s (before)t/s (after)diff
llama 8B Q4_04.33 GiB8.03 B991pp5121003.15 ± 0.89800.28 ± 1.41827.57 ± 0.74+3.4%
llama 8B Q4_04.33 GiB8.03 B991tg12885.12 ± 1.3998.55 ± 0.5597.83 ± 0.47-0.7%
llama 8B Q4_04.33 GiB8.03 B991pp512 @ d8192689.31 ± 0.64174.36 ± 0.42388.72 ± 3.37+122.9%
llama 8B Q4_04.33 GiB8.03 B991tg128 @ d819269.91 ± 0.2055.97 ± 0.2072.24 ± 0.34+29.1%
llama 8B Q4_04.33 GiB8.03 B991pp512 @ d16384525.25 ± 1.6884.33 ± 0.11247.07 ± 1.51+193.0%
llama 8B Q4_04.33 GiB8.03 B991tg128 @ d1638460.48 ± 0.1741.46 ± 0.1257.70 ± 0.57+39.2%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp5121061.99 ± 7.851319.64 ± 7.821321.90 ± 6.90+0.2%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg128110.86 ± 0.97136.10 ± 0.27127.75 ± 0.88-6.1%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp512 @ d8192745.39 ± 1.25757.62 ± 3.94740.88 ± 4.66-2.2%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg128 @ d8192101.64 ± 0.41116.38 ± 0.17113.37 ± 0.93-2.6%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp512 @ d16384577.95 ± 3.32509.10 ± 3.64484.85 ± 2.85-4.8%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg128 @ d1638499.23 ± 0.21107.31 ± 0.68102.88 ± 1.13-4.1%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512351.98 ± 3.24749.40 ± 5.15759.11 ± 4.74+1.3%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg12868.83 ± 0.1195.12 ± 0.2293.94 ± 0.45-1.2%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512 @ d8192295.91 ± 3.09207.63 ± 0.63312.17 ± 5.34+50.3%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg128 @ d819260.01 ± 0.7755.87 ± 0.3573.73 ± 0.68+32.0%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512 @ d16384247.76 ± 0.77114.90 ± 0.42191.18 ± 1.32+66.4%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg128 @ d1638455.69 ± 0.3044.11 ± 0.1161.76 ± 0.63+40.0%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991pp512641.90 ± 2.66657.73 ± 3.46740.63 ± 1.78+12.6%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991tg12847.72 ± 0.1364.38 ± 0.1965.54 ± 0.32+1.8%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991pp512 @ d8192293.28 ± 0.5483.15 ± 0.33129.38 ± 0.69+55.6%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991tg128 @ d819238.76 ± 0.0735.93 ± 0.2037.94 ± 0.33+5.6%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991pp512 @ d16384189.33 ± 0.1841.62 ± 0.2470.77 ± 0.49+70.0%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991tg128 @ d1638431.80 ± 0.0824.39 ± 0.3626.41 ± 0.22+8.3%
</details> <details> <summary>AMD 8060S</summary>
modelsizeparamsnglfatestt/s (before)t/s (after)diff
llama 8B Q4_04.33 GiB8.03 B991pp512994.34 ± 34.50947.41 ± 7.78-4.7%
llama 8B Q4_04.33 GiB8.03 B991tg12845.14 ± 0.4444.86 ± 0.42-0.6%
llama 8B Q4_04.33 GiB8.03 B991pp512 @ d8192418.71 ± 11.10397.77 ± 8.90-5.0%
llama 8B Q4_04.33 GiB8.03 B991tg128 @ d819235.83 ± 0.0935.68 ± 0.08-0.4%
llama 8B Q4_04.33 GiB8.03 B991pp512 @ d16384234.05 ± 5.66246.05 ± 11.58+5.1%
llama 8B Q4_04.33 GiB8.03 B991tg128 @ d1638430.53 ± 0.0830.13 ± 0.11-1.3%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp5121263.73 ± 34.961208.77 ± 37.78-4.3%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg12873.19 ± 0.1372.68 ± 0.10-0.7%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp512 @ d8192920.01 ± 4.93919.00 ± 4.71-0.1%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg128 @ d819266.74 ± 0.4566.42 ± 0.13-0.5%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp512 @ d16384670.22 ± 4.61670.46 ± 5.07+0.0%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg128 @ d1638461.53 ± 0.7861.78 ± 1.08+0.4%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512945.03 ± 32.97992.30 ± 11.33+5.0%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg12891.76 ± 0.0691.60 ± 0.53-0.2%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512 @ d8192487.96 ± 2.76479.56 ± 4.25-1.7%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg128 @ d819266.47 ± 0.3366.13 ± 0.27-0.5%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512 @ d16384302.07 ± 1.01286.72 ± 1.03-5.1%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg128 @ d1638450.54 ± 0.1949.64 ± 0.88-1.8%
deepseek2 30B.A3B Q4_016.03 GiB29.94 B991pp512924.97 ± 10.45923.58 ± 4.06-0.2%
deepseek2 30B.A3B Q4_016.03 GiB29.94 B991tg12861.52 ± 0.3461.43 ± 0.41-0.1%
deepseek2 30B.A3B Q4_016.03 GiB29.94 B991pp512 @ d8192306.02 ± 0.84297.15 ± 0.91-2.9%
deepseek2 30B.A3B Q4_016.03 GiB29.94 B991tg128 @ d819238.31 ± 0.2039.20 ± 0.17+2.3%
deepseek2 30B.A3B Q4_016.03 GiB29.94 B991pp512 @ d16384192.72 ± 0.35182.25 ± 0.82-5.4%
deepseek2 30B.A3B Q4_016.03 GiB29.94 B991tg128 @ d1638427.83 ± 0.1628.83 ± 0.01+3.6%
</details> <details> <summary>AMD 8060S (Without Coopmat)</summary>
modelsizeparamsnglfatestt/s (before)t/s (after)diff
llama 8B Q4_04.33 GiB8.03 B991pp512815.03 ± 7.22822.68 ± 4.39+0.9%
llama 8B Q4_04.33 GiB8.03 B991tg12844.96 ± 0.2245.36 ± 0.30+0.9%
llama 8B Q4_04.33 GiB8.03 B991pp512 @ d819267.06 ± 4.00190.34 ± 2.98+183.8%
llama 8B Q4_04.33 GiB8.03 B991tg128 @ d819231.53 ± 0.1335.31 ± 0.28+12.0%
llama 8B Q4_04.33 GiB8.03 B991pp512 @ d1638428.05 ± 0.8578.89 ± 4.18+181.2%
llama 8B Q4_04.33 GiB8.03 B991tg128 @ d1638425.53 ± 0.1729.71 ± 0.08+16.4%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp5121249.96 ± 37.101187.02 ± 15.67-5.0%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg12873.17 ± 0.0672.39 ± 0.23-1.1%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp512 @ d8192681.99 ± 1.44681.63 ± 2.60-0.1%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg128 @ d819266.34 ± 0.3566.37 ± 0.21+0.0%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp512 @ d16384438.09 ± 2.70408.44 ± 7.02-6.8%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg128 @ d1638461.46 ± 0.6261.54 ± 0.76+0.1%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512617.33 ± 13.14614.00 ± 6.22-0.5%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg12894.84 ± 0.2092.14 ± 0.22-2.8%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512 @ d8192179.49 ± 0.92227.94 ± 1.12+27.0%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg128 @ d819257.91 ± 0.3967.14 ± 0.11+15.9%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512 @ d1638486.39 ± 0.78128.04 ± 0.64+48.2%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg128 @ d1638443.22 ± 0.1851.58 ± 0.14+19.3%
deepseek2 30B.A3B Q4_016.03 GiB29.94 B991pp512727.26 ± 4.81810.87 ± 5.13+11.5%
deepseek2 30B.A3B Q4_016.03 GiB29.94 B991tg12861.59 ± 0.7061.90 ± 0.12+0.5%
deepseek2 30B.A3B Q4_016.03 GiB29.94 B991pp512 @ d8192105.57 ± 0.50178.01 ± 0.22+68.6%
deepseek2 30B.A3B Q4_016.03 GiB29.94 B991tg128 @ d819238.58 ± 0.1939.50 ± 0.33+2.4%
deepseek2 30B.A3B Q4_016.03 GiB29.94 B991pp512 @ d1638452.56 ± 0.2994.60 ± 0.41+80.0%
deepseek2 30B.A3B Q4_016.03 GiB29.94 B991tg128 @ d1638428.02 ± 0.1828.98 ± 0.06+3.4%
</details> <details> <summary>Intel A770</summary>
modelsizeparamsnglfatestt/s (before)t/s (after)diff
llama 8B Q4_04.33 GiB8.03 B991pp512818.22 ± 0.63812.84 ± 1.85-0.7%
llama 8B Q4_04.33 GiB8.03 B991tg12832.64 ± 0.0732.45 ± 0.05-0.6%
llama 8B Q4_04.33 GiB8.03 B991pp512 @ d204897.15 ± 0.05550.81 ± 1.20+467.0%
llama 8B Q4_04.33 GiB8.03 B991tg128 @ d204821.67 ± 0.0227.75 ± 0.02+28.1%
llama 8B Q4_04.33 GiB8.03 B991pp512 @ d409643.79 ± 2.97405.21 ± 0.78+825.3%
llama 8B Q4_04.33 GiB8.03 B991tg128 @ d409617.28 ± 0.0025.06 ± 0.01+45.0%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp512930.73 ± 3.24898.65 ± 3.47-3.4%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg12841.29 ± 0.0737.53 ± 0.11-9.1%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp512 @ d2048701.16 ± 3.52670.17 ± 4.91-4.4%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg128 @ d204831.19 ± 0.0631.73 ± 0.03+1.7%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp512 @ d4096545.63 ± 1.16495.18 ± 0.71-9.2%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg128 @ d409628.83 ± 0.0929.27 ± 0.04+1.5%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512640.10 ± 3.55657.27 ± 3.54+2.7%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg12833.43 ± 0.0830.04 ± 0.03-10.1%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512 @ d204860.27 ± 4.78281.25 ± 1.21+366.7%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg128 @ d204820.16 ± 0.0222.98 ± 0.03+14.0%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512 @ d409626.38 ± 0.63310.19 ± 1.68+1075.9%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg128 @ d409618.27 ± 0.0323.61 ± 0.08+29.2%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991pp512167.35 ± 0.1766.63 ± 0.23-60.2%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991tg12819.23 ± 0.0120.38 ± 0.03+6.0%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991pp512 @ d204826.23 ± 1.0225.38 ± 0.01-3.2%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991tg128 @ d20485.95 ± 0.0013.59 ± 0.01+128.4%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991pp512 @ d409625.54 ± 0.0225.29 ± 0.04-1.0%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991tg128 @ d40963.64 ± 0.0010.37 ± 0.00+184.9%
</details> <details> <summary>Nvidia RTX 3090 (Coopmat2)</summary>
modelsizeparamsnglfatestt/s (before)t/s (after)diff
llama 8B Q4_04.33 GiB8.03 B991pp5124666.60 ± 19.464721.23 ± 12.32+1.2%
llama 8B Q4_04.33 GiB8.03 B991tg128144.71 ± 1.53147.49 ± 0.52+1.9%
llama 8B Q4_04.33 GiB8.03 B991pp512 @ d81923426.64 ± 19.293428.98 ± 22.04+0.1%
llama 8B Q4_04.33 GiB8.03 B991tg128 @ d8192114.85 ± 0.97115.92 ± 0.34+0.9%
llama 8B Q4_04.33 GiB8.03 B991pp512 @ d163842695.37 ± 16.652692.89 ± 16.34-0.1%
llama 8B Q4_04.33 GiB8.03 B991tg128 @ d1638499.65 ± 0.7399.82 ± 0.29+0.2%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp5124520.31 ± 33.684513.71 ± 30.22-0.1%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg128177.65 ± 0.75177.15 ± 0.77-0.3%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp512 @ d81924040.47 ± 78.904049.94 ± 174.56+0.2%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg128 @ d8192156.59 ± 1.58155.91 ± 0.78-0.4%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp512 @ d163843546.97 ± 21.353529.89 ± 36.63-0.5%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg128 @ d16384147.96 ± 0.76145.37 ± 0.48-1.8%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp5123469.59 ± 17.363465.49 ± 34.45-0.1%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg128178.72 ± 0.64177.48 ± 2.05-0.7%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512 @ d81922508.75 ± 42.022500.37 ± 34.47-0.3%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg128 @ d8192141.66 ± 0.54141.16 ± 0.65-0.4%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512 @ d163841942.67 ± 15.901936.24 ± 20.12-0.3%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg128 @ d16384123.39 ± 0.72123.21 ± 0.29-0.1%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991pp5122287.89 ± 11.772289.12 ± 9.34+0.1%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991tg128116.47 ± 0.80114.38 ± 3.56-1.8%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991pp512 @ d81921047.29 ± 9.191047.12 ± 9.51-0.0%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991tg128 @ d819290.74 ± 0.3490.44 ± 0.37-0.3%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991pp512 @ d16384647.46 ± 3.70644.65 ± 3.78-0.4%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991tg128 @ d1638481.92 ± 0.8182.07 ± 0.20+0.2%
</details> <details> <summary>Nvidia RTX 3090 (Coopmat1)</summary>
modelsizeparamsnglfatestt/s (before)t/s (after)diff
llama 8B Q4_04.33 GiB8.03 B991pp5124117.11 ± 10.814052.19 ± 17.94-1.6%
llama 8B Q4_04.33 GiB8.03 B991tg128145.98 ± 1.84144.04 ± 0.74-1.3%
llama 8B Q4_04.33 GiB8.03 B991pp512 @ d81922182.12 ± 11.972359.95 ± 10.14+8.1%
llama 8B Q4_04.33 GiB8.03 B991tg128 @ d8192115.72 ± 0.56116.46 ± 0.62+0.6%
llama 8B Q4_04.33 GiB8.03 B991pp512 @ d163841486.54 ± 4.891671.90 ± 9.35+12.5%
llama 8B Q4_04.33 GiB8.03 B991tg128 @ d1638499.15 ± 0.74101.36 ± 0.32+2.2%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp5123062.95 ± 94.073090.31 ± 33.32+0.9%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg128175.29 ± 0.83175.87 ± 0.88+0.3%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp512 @ d81922439.28 ± 32.022494.98 ± 47.57+2.3%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg128 @ d8192148.99 ± 14.70154.40 ± 2.18+3.6%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp512 @ d163841964.74 ± 21.602098.26 ± 19.00+6.8%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg128 @ d16384147.55 ± 0.70147.66 ± 0.69+0.1%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp5122839.27 ± 26.122837.32 ± 30.26-0.1%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg128174.78 ± 1.25176.05 ± 1.26+0.7%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512 @ d81921505.57 ± 14.411639.74 ± 14.94+8.9%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg128 @ d8192137.34 ± 0.86139.22 ± 2.10+1.4%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512 @ d163841010.90 ± 10.491146.23 ± 14.19+13.4%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg128 @ d16384119.58 ± 0.71121.95 ± 0.88+2.0%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991pp5121968.30 ± 10.151954.94 ± 33.29-0.7%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991tg128114.35 ± 0.87115.05 ± 0.80+0.6%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991pp512 @ d8192554.73 ± 1.56555.49 ± 1.82+0.1%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991tg128 @ d819262.50 ± 0.5163.21 ± 0.34+1.1%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991pp512 @ d16384314.59 ± 0.93315.91 ± 1.26+0.4%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991tg128 @ d1638443.01 ± 0.1043.98 ± 0.15+2.3%
</details> <details> <summary>Nvidia RTX 3090 (Without Coopmat)</summary>
modelsizeparamsnglfatestt/s (before)t/s (after)diff
llama 8B Q4_04.33 GiB8.03 B991pp5122129.81 ± 5.522081.00 ± 42.53-2.3%
llama 8B Q4_04.33 GiB8.03 B991tg128145.98 ± 0.24144.26 ± 0.53-1.2%
llama 8B Q4_04.33 GiB8.03 B991pp512 @ d8192997.77 ± 3.311048.43 ± 25.28+5.1%
llama 8B Q4_04.33 GiB8.03 B991tg128 @ d8192110.19 ± 0.54112.16 ± 0.12+1.8%
llama 8B Q4_04.33 GiB8.03 B991pp512 @ d16384637.54 ± 1.09701.26 ± 11.14+10.0%
llama 8B Q4_04.33 GiB8.03 B991tg128 @ d1638494.33 ± 0.2295.27 ± 0.31+1.0%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp5122410.79 ± 15.882331.15 ± 89.00-3.3%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg128176.60 ± 0.74173.28 ± 0.72-1.9%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp512 @ d81921582.99 ± 17.171429.18 ± 11.60-9.7%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg128 @ d8192153.60 ± 1.60150.58 ± 0.91-2.0%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991pp512 @ d163841114.36 ± 154.821009.61 ± 23.16-9.4%
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 B991tg128 @ d16384146.14 ± 0.64143.19 ± 1.18-2.0%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp5121159.21 ± 12.741137.29 ± 13.35-1.9%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg128177.45 ± 1.07175.96 ± 1.95-0.8%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512 @ d8192592.47 ± 4.68620.55 ± 6.11+4.7%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg128 @ d8192130.00 ± 0.58135.84 ± 1.70+4.5%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512 @ d16384387.10 ± 1.89425.32 ± 0.85+9.9%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg128 @ d16384113.49 ± 0.51117.90 ± 0.71+3.9%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991pp5121050.83 ± 17.391092.14 ± 16.92+3.9%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991tg128114.66 ± 2.79115.36 ± 3.33+0.6%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991pp512 @ d8192281.20 ± 1.84342.26 ± 2.76+21.7%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991tg128 @ d819263.73 ± 0.0663.90 ± 0.37+0.3%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991pp512 @ d16384159.38 ± 1.00202.89 ± 2.03+27.3%
deepseek2 30B.A3B Q3_K - Small12.37 GiB29.94 B991tg128 @ d1638443.40 ± 0.0544.22 ± 0.09+1.9%
</details>

Changed files

  • ggml/src/ggml-vulkan/ggml-vulkan.cpp (modified, +360/-212)
  • ggml/src/ggml-vulkan/vulkan-shaders/flash_attn.comp (modified, +354/-176)
  • ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_base.glsl (modified, +40/-23)
  • ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm1.comp (modified, +135/-89)
  • ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm2.comp (modified, +35/-7)
  • ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp (modified, +46/-34)

PR #20551: vulkan: use graphics queue on AMD

Description (problem / solution / changelog)

I'm not sure why, but the graphics queue is slightly faster in tg on AMD than the compute queue, and this also fixes the partial offload issue I fixed in #19976, so the second queue no longer has to be enabled by default. I got the idea from @zedbytes reporting that tg goes up when running with RADV_DEBUG=nocompute.

<details> <summary>AMD RX 9070 XT</summary>
modelsizeparamsnglfatestt/s (before)t/s (after)diff
llama 8B Q4_04.33 GiB8.03 B201pp5122288.04 ± 2.422225.76 ± 2.31-2.7%
llama 8B Q4_04.33 GiB8.03 B201tg12824.33 ± 0.0424.58 ± 0.05+1.0%
llama 8B Q4_04.33 GiB8.03 B991pp5124886.26 ± 105.084901.77 ± 102.66+0.3%
llama 8B Q4_04.33 GiB8.03 B991tg128115.78 ± 0.02121.39 ± 0.02+4.8%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B201pp512736.21 ± 9.37735.19 ± 7.51-0.1%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B201tg12839.53 ± 0.1040.36 ± 0.21+2.1%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp5123383.58 ± 29.263425.38 ± 28.68+1.2%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg128200.45 ± 1.89220.41 ± 1.46+10.0%
</details> <details> <summary>AMD Radeon Pro VII</summary>
modelsizeparamsnglfatestt/s (before)t/s (after)diff
llama 8B Q4_04.33 GiB8.03 B201pp512636.62 ± 9.07615.62 ± 0.79-3.3%
llama 8B Q4_04.33 GiB8.03 B201tg12838.35 ± 0.0938.20 ± 0.01-0.4%
llama 8B Q4_04.33 GiB8.03 B991pp512830.30 ± 1.51834.44 ± 1.05+0.5%
llama 8B Q4_04.33 GiB8.03 B991tg128102.45 ± 0.64100.28 ± 0.24-2.1%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B201pp512289.76 ± 3.59287.75 ± 3.10-0.7%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B201tg12834.57 ± 0.3234.05 ± 1.20-1.5%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991pp512749.65 ± 5.42762.52 ± 5.89+1.7%
qwen3moe 30B.A3B Q2_K - Medium10.48 GiB30.53 B991tg12894.70 ± 0.4697.55 ± 0.20+3.0%
</details>

Changed files

  • ggml/src/ggml-vulkan/ggml-vulkan.cpp (modified, +5/-5)

Code Example

# Build llama.cpp from source with Vulkan
git clone https://github.com/ggml-org/llama.cpp
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)

# Run llama-server directly
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \
  ./build/bin/llama-server \
  --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  --port 11434 -ngl 99 -fa on --no-mmap

---

# With Ollama (v0.20.5)
ollama run gemma4:26b
# observe ~34 t/s in generation

# With standalone llama.cpp b8765 (same model, same quant, same hardware)
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \
  llama-bench -m gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -ngl 99 -mmp 0 -fa 1 -p 0 -n 128
# observe 52+ t/s
RAW_BUFFERClick to expand / collapse

Summary

Ollama's vendored llama.cpp is currently at b7437 (Dec 16, 2025). Two significant Vulkan/AMD performance PRs landed in llama.cpp after that and have not yet been picked up by Ollama:

PRDescriptionMerged into llama.cpp
ggml-org/llama.cpp#19625Vulkan: scalar flash attention refactor + Wave32 on AMDFeb 24, 2026
ggml-org/llama.cpp#20551Vulkan: use graphics queue on AMDMar 15, 2026

Measured Impact

Benchmarked on the same hardware, same model, same flags (-ngl 99 -fa 1 --no-mmap):

Setupgemma4:26b Q4_K_XL tg128gemma4:e4b Q4_K_XL tg128
Ollama v0.20.5 (llama.cpp b7437)~34 t/s~34 t/s
llama.cpp b8765 (has both PRs)52.3 t/s56.2 t/s
Windows LM Studio (same hardware)~56 t/s~56 t/s

That's a ~56% throughput improvement from two Vulkan-specific commits that Ollama simply hasn't vendored yet. Standalone llama.cpp b8765 on Linux/Vulkan is now at parity with Windows LM Studio on the same machine. This is not a hardware/driver issue — the gap disappears entirely when running standalone llama.cpp.

Token speed vs context depth (llama.cpp b8765, tg128)

For reference, full context-depth profile on this hardware:

Context depthgemma4:26bgemma4:e4b
d0 (fresh)52.3 t/s56.2 t/s
d8k45.6 t/s~50 t/s
d32k40.1 t/s42.5 t/s
d64k35.1 t/s35.0 t/s
d128k17.0 t/s26.1 t/s

With Ollama (b7437) you're stuck at ~34 t/s even at d0 — below what standalone llama.cpp delivers at d64k.

Current Workaround

Due to this gap, I switched from Ollama to llama-swap + llama.cpp built from source. llama-swap is a lightweight proxy that hot-swaps llama-server instances on a single port, making it a drop-in Ollama replacement (same port 11434, OpenAI-compatible API).

Setup:

# Build llama.cpp from source with Vulkan
git clone https://github.com/ggml-org/llama.cpp
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)

# Run llama-server directly
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \
  ./build/bin/llama-server \
  --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  --port 11434 -ngl 99 -fa on --no-mmap

This works, but it's a significant workaround — model management, multi-model serving, and automatic updates all have to be handled manually. Ollama fixing this would make the workaround unnecessary.

System

HardwareMinisforum MS-S1 Max (AMD Ryzen AI MAX+ 395 / Radeon 8060S, Strix Halo)
GPU archgfx1151, 128 GB unified memory (iGPU shares system RAM)
OSUbuntu 24.04.4 LTS
Kernel6.19.11
Vulkan driverRADV (Mesa 25.2.8), radeon_icd.json
Ollama versionv0.20.5 (llama.cpp b7437, Dec 16 2025)
Standalone llama.cppb8765 (Apr 2026), built from source with Vulkan

Steps to Reproduce

# With Ollama (v0.20.5)
ollama run gemma4:26b
# observe ~34 t/s in generation

# With standalone llama.cpp b8765 (same model, same quant, same hardware)
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \
  llama-bench -m gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -ngl 99 -mmp 0 -fa 1 -p 0 -n 128
# observe 52+ t/s

Request

Please update the vendored llama.cpp to a commit that includes both PRs (any commit ≥ b8500 / after Mar 15, 2026 should include both). The ROCm 7.2.1 update in v0.20.7 is appreciated, but the Vulkan path (which is what iGPU/APU users rely on — ROCm doesn't support Strix Halo yet) is still stuck on December code.

Users with AMD APUs (Strix Halo, Phoenix, Hawk Point) running Vulkan are leaving ~56% performance on the table compared to what's already available in upstream llama.cpp.

extent analysis

TL;DR

Update the vendored llama.cpp in Ollama to a commit that includes both performance-enhancing PRs (≥ b8500 / after Mar 15, 2026) to unlock the ~56% throughput improvement.

Guidance

  • Identify the current vendored llama.cpp version in Ollama (b7437) and recognize it lacks crucial Vulkan performance updates.
  • Understand that updating to a version that includes PRs #19625 and #20551 (e.g., b8765 or later) is necessary for the performance boost.
  • Consider building llama.cpp from source with Vulkan support as a temporary workaround, as demonstrated in the issue.
  • Note that simply updating Ollama to use a newer version of llama.cpp that includes these PRs should resolve the performance gap without needing manual model management or other workarounds.

Example

# Example of building and running llama.cpp from source for comparison
git clone https://github.com/ggml-org/llama.cpp
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \
  ./build/bin/llama-server \
  --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  --port 11434 -ngl 99 -fa on --no-mmap

Notes

The performance improvement is specific to Vulkan on AMD hardware, particularly for users with integrated graphics (iGPU) like those found in AMD APUs. The workaround using llama-swap and manually built llama.cpp indicates the issue is with Ollama's vendored version of llama.cpp, not the hardware or drivers.

Recommendation

Apply the workaround of building llama.cpp from source and using it with llama-swap until Ollama updates its vendored llama.cpp version, as this provides a functional, albeit manual, solution to achieve the performance benefits of the newer llama.cpp versions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - ✅(Solved) Fix Vulkan/AMD performance: vendored llama.cpp (b7437, Dec 2025) missing Wave32 FA (#19625) and graphics queue (#20551) — ~56% t/s gap vs standalone llama.cpp [2 pull requests, 9 comments, 4 participants]