ollama - ✅(Solved) Fix Vulkan/AMD performance: vendored llama.cpp (b7437, Dec 2025) missing Wave32 FA (#19625) and graphics queue (#20551) — ~56% t/s gap vs standalone llama.cpp [2 pull requests, 9 comments, 4 participants]

sagar-kale · 2026-04-15T08:07:25Z

[ollama] Ollama's vendored llama.cpp is currently at b7437 Dec 16, 2025 . Two significant Vulkan/AMD performance PRs landed in llama.cpp after that and have no… Ollama's vendored llama.cpp is currently at **b7437 (Dec 16, 2025)**. Two significant Vulkan/AMD performance PRs landed in llama.cpp after that and have not yet been picked up by Ollama: | PR | Description | Merged into llama.cpp | |---|---|---| | [ggml-org/llama.cpp#19625](https://github.com/ggml-org/llama.cpp/pull/19625) | Vulkan: scalar flash attention refactor + Wave32 on AMD | Feb 24, 2026 | | [ggml-org/llama.cpp#20551](https://github.com/ggml-org/llama.cpp/pull/20551) | Vulkan: use graphics queue on AMD | Mar 15, 2026 | # PR #19625: Vulkan Scalar Flash Attention Refactor - Repository: ggml-org/llama.cpp - Author: 0cc4m - State: closed | merged: True - Link: https://github.com/ggml-org/llama.cpp/pull/19625 ## Description (problem / solution / changelog) This started out as an attempt to go through the scalar FA version and add proper float16 support to improve AMD and Intel performance and went quite a bit further. @jeffbolznv Sorry about the amount of changes, let me know if there's something I can do to make the review easier. Please also let me know if you have architectural concerns. Flash Attention has so many dimensions and making it work well on so much hardware and models is pretty hard. I had to spend quite a lot of time figuring out and fixing regressions on specific configurations. AI-generated summary of changes ### Scalar Flash Attention Core Optimizations - Implemented row splitting within workgroups (row_split = 1 or 4) for better subgroup utilization - Added shared memory staging for K and V loads on Nvidia GPUs when head sizes 16 - Fused loop for Lf accumulation and Of scaling by eMf - Changed to vectorized vec4 stores for output - Optimized masksh layout with stride padding (Br + 1) and removed unnecessary barrier ### Row Size Tiering - Replaced binary small_rows/large_rows with three-tier system: FA_ROWS_1, FA_ROWS_SMALL, FA_ROWS_LARGE - Dynamic Br selection based on head sizes, device vendor, and architecture - FA_ROWS_1 uses Br=1 for N=1, FA_ROWS_SMALL uses Br=8, FA_ROWS_LARGE uses Br=16 - Device-specific adjustments: AMD GCN uses smaller Br, Intel uses Br=8 maximum ### Vendor-Specific Optimizations - AMD RDNA: Use wave32 subgroup size for scalar FA when N=1 - Intel: Added shader core count lookup table for Alchemist and Battlemage GPUs - Intel: Disable subgroup operations in favor of shared memory reductions - Intel Alchemist: Apply 2x shader core count multiplier for split_k calculation - Adjusted workgroup sizes per vendor and head size combinations ### split_k Enhancements - Relaxed split_k conditions to support non-GQA workloads - Fixed dispatch logic to handle both GQA and non-GQA cases correctly - Improved split_k calculation based on total workgroup count and shader cores ### Device Compatibility - Added FP32 shader variants (_fp32 suffix) for devices without FP16 support - Made FLOAT_TYPE conditional on device capabilities - Updated dequantize4 functions to use FLOAT_TYPE instead of hardcoded float ### Shared Memory Management - Dynamic tmpsh sizing based on row_split and subgroup configuration - Added kvsh buffer for K/V staging (size conditional on SHMEM_STAGING flag) - Improved Qf buffer stride calculation - Fixed tmpsh size calculation for split_k temporaries ### Code Path Selection - Switch from coopmat1 to scalar when N=1 or rows=FA_ROWS_1 - Improved shared memory size checks for scalar path fallback - Better alignment checking and stride validation ### Shader Compilation - Made coopmat1/coopmat2 pipeline creation conditional on device FP16 support - Added subgroup size configuration per code path and row configuration - Removed hardcoded subgroup size assumptions ## Benchmarks AMD Radeon Pro VII | model | size | params | ngl | fa | test | t/s (ROCm) | t/s (before) | t/s (after) | diff | |--------------------------------|-----------|---------|-----|----|----------------|----------------|----------------|----------------|---------| | llama 8B Q4_0 | 4.33 GiB | 8.03 B | 99 | 1 | pp512 | 1003.15 ± 0.89 | 800.28 ± 1.41 | 827.57 ± 0.74 | +3.4% | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | 99 | 1 | tg128 | 85.12 ± 1.39 | 98.55 ± 0.55 | 97.83 ± 0.47 | -0.7% | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | 99 | 1 | pp512 @ d8192 | 689.31 ± 0.64 | 174.36 ± 0.42 | 388.72 ± 3.37 | +122.9% | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | 99 | 1 | tg128 @ d8192 | 69.91 ± 0.20 | 55.97 ± 0.20 | 72.24 ± 0.34 | +29.1% | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | 99 | 1 | pp512 @ d16384 | 525.25 ± 1.68 | 84.33 ± 0.11 | 247.07 ± 1.51 | +193.0% | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | 99 | 1 | tg128 @ d16384 | 60.48 ± 0.17 | 41.46 ± 0.12 | 57.70 ± 0.57 | +39.2% | | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | 99

ollama2026-04-15 08:07:25

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#15601•Fetched 2026-04-17 08:23:24

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×9subscribed ×7mentioned ×6

Ollama's vendored llama.cpp is currently at b7437 (Dec 16, 2025). Two significant Vulkan/AMD performance PRs landed in llama.cpp after that and have not yet been picked up by Ollama:

PR	Description	Merged into llama.cpp
ggml-org/llama.cpp#19625	Vulkan: scalar flash attention refactor + Wave32 on AMD	Feb 24, 2026
ggml-org/llama.cpp#20551	Vulkan: use graphics queue on AMD	Mar 15, 2026

Root Cause

Ollama's vendored llama.cpp is currently at b7437 (Dec 16, 2025). Two significant Vulkan/AMD performance PRs landed in llama.cpp after that and have not yet been picked up by Ollama:

PR	Description	Merged into llama.cpp
ggml-org/llama.cpp#19625	Vulkan: scalar flash attention refactor + Wave32 on AMD	Feb 24, 2026
ggml-org/llama.cpp#20551	Vulkan: use graphics queue on AMD	Mar 15, 2026

Fix Action

Fix / Workaround

PR	Description	Merged into llama.cpp
ggml-org/llama.cpp#19625	Vulkan: scalar flash attention refactor + Wave32 on AMD	Feb 24, 2026
ggml-org/llama.cpp#20551	Vulkan: use graphics queue on AMD	Mar 15, 2026

Current Workaround

This works, but it's a significant workaround — model management, multi-model serving, and automatic updates all have to be handled manually. Ollama fixing this would make the workaround unnecessary.

PR fix notes

PR #19625: Vulkan Scalar Flash Attention Refactor

Repository: ggml-org/llama.cpp
Author: 0cc4m
State: closed | merged: True
Link: https://github.com/ggml-org/llama.cpp/pull/19625

Description (problem / solution / changelog)

This started out as an attempt to go through the scalar FA version and add proper float16 support to improve AMD and Intel performance and went quite a bit further. @jeffbolznv Sorry about the amount of changes, let me know if there's something I can do to make the review easier. Please also let me know if you have architectural concerns. Flash Attention has so many dimensions and making it work well on so much hardware and models is pretty hard. I had to spend quite a lot of time figuring out and fixing regressions on specific configurations.

<details> <summary>AI-generated summary of changes</summary>

Scalar Flash Attention Core Optimizations

Implemented row splitting within workgroups (row_split = 1 or 4) for better subgroup utilization
Added shared memory staging for K and V loads on Nvidia GPUs when head sizes < 256
Cached Q values in registers for KQ computation when HSK_per_thread > 16
Fused loop for Lf accumulation and Of scaling by eMf
Changed to vectorized vec4 stores for output
Optimized masksh layout with stride padding (Br + 1) and removed unnecessary barrier

Row Size Tiering

Replaced binary small_rows/large_rows with three-tier system: FA_ROWS_1, FA_ROWS_SMALL, FA_ROWS_LARGE
Dynamic Br selection based on head sizes, device vendor, and architecture
FA_ROWS_1 uses Br=1 for N=1, FA_ROWS_SMALL uses Br=8, FA_ROWS_LARGE uses Br=16
Device-specific adjustments: AMD GCN uses smaller Br, Intel uses Br=8 maximum

Vendor-Specific Optimizations

AMD RDNA: Use wave32 subgroup size for scalar FA when N=1
Intel: Added shader core count lookup table for Alchemist and Battlemage GPUs
Intel: Disable subgroup operations in favor of shared memory reductions
Intel Alchemist: Apply 2x shader core count multiplier for split_k calculation
Adjusted workgroup sizes per vendor and head size combinations

split_k Enhancements

Relaxed split_k conditions to support non-GQA workloads
Fixed dispatch logic to handle both GQA and non-GQA cases correctly
Improved split_k calculation based on total workgroup count and shader cores

Device Compatibility

Added FP32 shader variants (_fp32 suffix) for devices without FP16 support
Made FLOAT_TYPE conditional on device capabilities
Updated dequantize4 functions to use FLOAT_TYPE instead of hardcoded float

Shared Memory Management

Dynamic tmpsh sizing based on row_split and subgroup configuration
Added kvsh buffer for K/V staging (size conditional on SHMEM_STAGING flag)
Improved Qf buffer stride calculation
Fixed tmpsh size calculation for split_k temporaries

Code Path Selection

Switch from coopmat1 to scalar when N=1 or rows=FA_ROWS_1
Improved shared memory size checks for scalar path fallback
Better alignment checking and stride validation

Shader Compilation

Made coopmat1/coopmat2 pipeline creation conditional on device FP16 support
Added subgroup size configuration per code path and row configuration
Removed hardcoded subgroup size assumptions

</details>

Benchmarks

<details> <summary>AMD Radeon Pro VII</summary>

model	size	params	ngl	fa	test	t/s (ROCm)	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	1003.15 ± 0.89	800.28 ± 1.41	827.57 ± 0.74	+3.4%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	85.12 ± 1.39	98.55 ± 0.55	97.83 ± 0.47	-0.7%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d8192	689.31 ± 0.64	174.36 ± 0.42	388.72 ± 3.37	+122.9%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d8192	69.91 ± 0.20	55.97 ± 0.20	72.24 ± 0.34	+29.1%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d16384	525.25 ± 1.68	84.33 ± 0.11	247.07 ± 1.51	+193.0%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d16384	60.48 ± 0.17	41.46 ± 0.12	57.70 ± 0.57	+39.2%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512	1061.99 ± 7.85	1319.64 ± 7.82	1321.90 ± 6.90	+0.2%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128	110.86 ± 0.97	136.10 ± 0.27	127.75 ± 0.88	-6.1%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d8192	745.39 ± 1.25	757.62 ± 3.94	740.88 ± 4.66	-2.2%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d8192	101.64 ± 0.41	116.38 ± 0.17	113.37 ± 0.93	-2.6%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d16384	577.95 ± 3.32	509.10 ± 3.64	484.85 ± 2.85	-4.8%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d16384	99.23 ± 0.21	107.31 ± 0.68	102.88 ± 1.13	-4.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512	351.98 ± 3.24	749.40 ± 5.15	759.11 ± 4.74	+1.3%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	68.83 ± 0.11	95.12 ± 0.22	93.94 ± 0.45	-1.2%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d8192	295.91 ± 3.09	207.63 ± 0.63	312.17 ± 5.34	+50.3%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d8192	60.01 ± 0.77	55.87 ± 0.35	73.73 ± 0.68	+32.0%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d16384	247.76 ± 0.77	114.90 ± 0.42	191.18 ± 1.32	+66.4%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d16384	55.69 ± 0.30	44.11 ± 0.11	61.76 ± 0.63	+40.0%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512	641.90 ± 2.66	657.73 ± 3.46	740.63 ± 1.78	+12.6%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128	47.72 ± 0.13	64.38 ± 0.19	65.54 ± 0.32	+1.8%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d8192	293.28 ± 0.54	83.15 ± 0.33	129.38 ± 0.69	+55.6%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d8192	38.76 ± 0.07	35.93 ± 0.20	37.94 ± 0.33	+5.6%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d16384	189.33 ± 0.18	41.62 ± 0.24	70.77 ± 0.49	+70.0%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d16384	31.80 ± 0.08	24.39 ± 0.36	26.41 ± 0.22	+8.3%

</details> <details> <summary>AMD 8060S</summary>

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	994.34 ± 34.50	947.41 ± 7.78	-4.7%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	45.14 ± 0.44	44.86 ± 0.42	-0.6%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d8192	418.71 ± 11.10	397.77 ± 8.90	-5.0%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d8192	35.83 ± 0.09	35.68 ± 0.08	-0.4%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d16384	234.05 ± 5.66	246.05 ± 11.58	+5.1%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d16384	30.53 ± 0.08	30.13 ± 0.11	-1.3%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512	1263.73 ± 34.96	1208.77 ± 37.78	-4.3%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128	73.19 ± 0.13	72.68 ± 0.10	-0.7%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d8192	920.01 ± 4.93	919.00 ± 4.71	-0.1%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d8192	66.74 ± 0.45	66.42 ± 0.13	-0.5%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d16384	670.22 ± 4.61	670.46 ± 5.07	+0.0%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d16384	61.53 ± 0.78	61.78 ± 1.08	+0.4%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512	945.03 ± 32.97	992.30 ± 11.33	+5.0%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	91.76 ± 0.06	91.60 ± 0.53	-0.2%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d8192	487.96 ± 2.76	479.56 ± 4.25	-1.7%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d8192	66.47 ± 0.33	66.13 ± 0.27	-0.5%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d16384	302.07 ± 1.01	286.72 ± 1.03	-5.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d16384	50.54 ± 0.19	49.64 ± 0.88	-1.8%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	pp512	924.97 ± 10.45	923.58 ± 4.06	-0.2%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	tg128	61.52 ± 0.34	61.43 ± 0.41	-0.1%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	pp512 @ d8192	306.02 ± 0.84	297.15 ± 0.91	-2.9%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	tg128 @ d8192	38.31 ± 0.20	39.20 ± 0.17	+2.3%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	pp512 @ d16384	192.72 ± 0.35	182.25 ± 0.82	-5.4%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	tg128 @ d16384	27.83 ± 0.16	28.83 ± 0.01	+3.6%

</details> <details> <summary>AMD 8060S (Without Coopmat)</summary>

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	815.03 ± 7.22	822.68 ± 4.39	+0.9%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	44.96 ± 0.22	45.36 ± 0.30	+0.9%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d8192	67.06 ± 4.00	190.34 ± 2.98	+183.8%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d8192	31.53 ± 0.13	35.31 ± 0.28	+12.0%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d16384	28.05 ± 0.85	78.89 ± 4.18	+181.2%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d16384	25.53 ± 0.17	29.71 ± 0.08	+16.4%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512	1249.96 ± 37.10	1187.02 ± 15.67	-5.0%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128	73.17 ± 0.06	72.39 ± 0.23	-1.1%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d8192	681.99 ± 1.44	681.63 ± 2.60	-0.1%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d8192	66.34 ± 0.35	66.37 ± 0.21	+0.0%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d16384	438.09 ± 2.70	408.44 ± 7.02	-6.8%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d16384	61.46 ± 0.62	61.54 ± 0.76	+0.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512	617.33 ± 13.14	614.00 ± 6.22	-0.5%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	94.84 ± 0.20	92.14 ± 0.22	-2.8%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d8192	179.49 ± 0.92	227.94 ± 1.12	+27.0%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d8192	57.91 ± 0.39	67.14 ± 0.11	+15.9%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d16384	86.39 ± 0.78	128.04 ± 0.64	+48.2%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d16384	43.22 ± 0.18	51.58 ± 0.14	+19.3%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	pp512	727.26 ± 4.81	810.87 ± 5.13	+11.5%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	tg128	61.59 ± 0.70	61.90 ± 0.12	+0.5%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	pp512 @ d8192	105.57 ± 0.50	178.01 ± 0.22	+68.6%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	tg128 @ d8192	38.58 ± 0.19	39.50 ± 0.33	+2.4%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	pp512 @ d16384	52.56 ± 0.29	94.60 ± 0.41	+80.0%
deepseek2 30B.A3B Q4_0	16.03 GiB	29.94 B	99	1	tg128 @ d16384	28.02 ± 0.18	28.98 ± 0.06	+3.4%

</details> <details> <summary>Intel A770</summary>

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	818.22 ± 0.63	812.84 ± 1.85	-0.7%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	32.64 ± 0.07	32.45 ± 0.05	-0.6%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d2048	97.15 ± 0.05	550.81 ± 1.20	+467.0%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d2048	21.67 ± 0.02	27.75 ± 0.02	+28.1%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d4096	43.79 ± 2.97	405.21 ± 0.78	+825.3%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d4096	17.28 ± 0.00	25.06 ± 0.01	+45.0%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512	930.73 ± 3.24	898.65 ± 3.47	-3.4%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128	41.29 ± 0.07	37.53 ± 0.11	-9.1%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d2048	701.16 ± 3.52	670.17 ± 4.91	-4.4%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d2048	31.19 ± 0.06	31.73 ± 0.03	+1.7%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d4096	545.63 ± 1.16	495.18 ± 0.71	-9.2%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d4096	28.83 ± 0.09	29.27 ± 0.04	+1.5%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512	640.10 ± 3.55	657.27 ± 3.54	+2.7%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	33.43 ± 0.08	30.04 ± 0.03	-10.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d2048	60.27 ± 4.78	281.25 ± 1.21	+366.7%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d2048	20.16 ± 0.02	22.98 ± 0.03	+14.0%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d4096	26.38 ± 0.63	310.19 ± 1.68	+1075.9%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d4096	18.27 ± 0.03	23.61 ± 0.08	+29.2%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512	167.35 ± 0.17	66.63 ± 0.23	-60.2%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128	19.23 ± 0.01	20.38 ± 0.03	+6.0%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d2048	26.23 ± 1.02	25.38 ± 0.01	-3.2%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d2048	5.95 ± 0.00	13.59 ± 0.01	+128.4%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d4096	25.54 ± 0.02	25.29 ± 0.04	-1.0%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d4096	3.64 ± 0.00	10.37 ± 0.00	+184.9%

</details> <details> <summary>Nvidia RTX 3090 (Coopmat2)</summary>

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	4666.60 ± 19.46	4721.23 ± 12.32	+1.2%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	144.71 ± 1.53	147.49 ± 0.52	+1.9%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d8192	3426.64 ± 19.29	3428.98 ± 22.04	+0.1%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d8192	114.85 ± 0.97	115.92 ± 0.34	+0.9%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d16384	2695.37 ± 16.65	2692.89 ± 16.34	-0.1%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d16384	99.65 ± 0.73	99.82 ± 0.29	+0.2%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512	4520.31 ± 33.68	4513.71 ± 30.22	-0.1%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128	177.65 ± 0.75	177.15 ± 0.77	-0.3%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d8192	4040.47 ± 78.90	4049.94 ± 174.56	+0.2%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d8192	156.59 ± 1.58	155.91 ± 0.78	-0.4%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d16384	3546.97 ± 21.35	3529.89 ± 36.63	-0.5%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d16384	147.96 ± 0.76	145.37 ± 0.48	-1.8%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512	3469.59 ± 17.36	3465.49 ± 34.45	-0.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	178.72 ± 0.64	177.48 ± 2.05	-0.7%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d8192	2508.75 ± 42.02	2500.37 ± 34.47	-0.3%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d8192	141.66 ± 0.54	141.16 ± 0.65	-0.4%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d16384	1942.67 ± 15.90	1936.24 ± 20.12	-0.3%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d16384	123.39 ± 0.72	123.21 ± 0.29	-0.1%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512	2287.89 ± 11.77	2289.12 ± 9.34	+0.1%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128	116.47 ± 0.80	114.38 ± 3.56	-1.8%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d8192	1047.29 ± 9.19	1047.12 ± 9.51	-0.0%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d8192	90.74 ± 0.34	90.44 ± 0.37	-0.3%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d16384	647.46 ± 3.70	644.65 ± 3.78	-0.4%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d16384	81.92 ± 0.81	82.07 ± 0.20	+0.2%

</details> <details> <summary>Nvidia RTX 3090 (Coopmat1)</summary>

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	4117.11 ± 10.81	4052.19 ± 17.94	-1.6%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	145.98 ± 1.84	144.04 ± 0.74	-1.3%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d8192	2182.12 ± 11.97	2359.95 ± 10.14	+8.1%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d8192	115.72 ± 0.56	116.46 ± 0.62	+0.6%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d16384	1486.54 ± 4.89	1671.90 ± 9.35	+12.5%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d16384	99.15 ± 0.74	101.36 ± 0.32	+2.2%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512	3062.95 ± 94.07	3090.31 ± 33.32	+0.9%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128	175.29 ± 0.83	175.87 ± 0.88	+0.3%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d8192	2439.28 ± 32.02	2494.98 ± 47.57	+2.3%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d8192	148.99 ± 14.70	154.40 ± 2.18	+3.6%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d16384	1964.74 ± 21.60	2098.26 ± 19.00	+6.8%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d16384	147.55 ± 0.70	147.66 ± 0.69	+0.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512	2839.27 ± 26.12	2837.32 ± 30.26	-0.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	174.78 ± 1.25	176.05 ± 1.26	+0.7%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d8192	1505.57 ± 14.41	1639.74 ± 14.94	+8.9%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d8192	137.34 ± 0.86	139.22 ± 2.10	+1.4%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d16384	1010.90 ± 10.49	1146.23 ± 14.19	+13.4%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d16384	119.58 ± 0.71	121.95 ± 0.88	+2.0%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512	1968.30 ± 10.15	1954.94 ± 33.29	-0.7%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128	114.35 ± 0.87	115.05 ± 0.80	+0.6%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d8192	554.73 ± 1.56	555.49 ± 1.82	+0.1%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d8192	62.50 ± 0.51	63.21 ± 0.34	+1.1%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d16384	314.59 ± 0.93	315.91 ± 1.26	+0.4%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d16384	43.01 ± 0.10	43.98 ± 0.15	+2.3%

</details> <details> <summary>Nvidia RTX 3090 (Without Coopmat)</summary>

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	2129.81 ± 5.52	2081.00 ± 42.53	-2.3%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	145.98 ± 0.24	144.26 ± 0.53	-1.2%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d8192	997.77 ± 3.31	1048.43 ± 25.28	+5.1%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d8192	110.19 ± 0.54	112.16 ± 0.12	+1.8%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512 @ d16384	637.54 ± 1.09	701.26 ± 11.14	+10.0%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128 @ d16384	94.33 ± 0.22	95.27 ± 0.31	+1.0%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512	2410.79 ± 15.88	2331.15 ± 89.00	-3.3%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128	176.60 ± 0.74	173.28 ± 0.72	-1.9%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d8192	1582.99 ± 17.17	1429.18 ± 11.60	-9.7%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d8192	153.60 ± 1.60	150.58 ± 0.91	-2.0%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512 @ d16384	1114.36 ± 154.82	1009.61 ± 23.16	-9.4%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128 @ d16384	146.14 ± 0.64	143.19 ± 1.18	-2.0%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512	1159.21 ± 12.74	1137.29 ± 13.35	-1.9%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	177.45 ± 1.07	175.96 ± 1.95	-0.8%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d8192	592.47 ± 4.68	620.55 ± 6.11	+4.7%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d8192	130.00 ± 0.58	135.84 ± 1.70	+4.5%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512 @ d16384	387.10 ± 1.89	425.32 ± 0.85	+9.9%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128 @ d16384	113.49 ± 0.51	117.90 ± 0.71	+3.9%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512	1050.83 ± 17.39	1092.14 ± 16.92	+3.9%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128	114.66 ± 2.79	115.36 ± 3.33	+0.6%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d8192	281.20 ± 1.84	342.26 ± 2.76	+21.7%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d8192	63.73 ± 0.06	63.90 ± 0.37	+0.3%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	pp512 @ d16384	159.38 ± 1.00	202.89 ± 2.03	+27.3%
deepseek2 30B.A3B Q3_K - Small	12.37 GiB	29.94 B	99	1	tg128 @ d16384	43.40 ± 0.05	44.22 ± 0.09	+1.9%

</details>

Changed files

ggml/src/ggml-vulkan/ggml-vulkan.cpp (modified, +360/-212)
ggml/src/ggml-vulkan/vulkan-shaders/flash_attn.comp (modified, +354/-176)
ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_base.glsl (modified, +40/-23)
ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm1.comp (modified, +135/-89)
ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm2.comp (modified, +35/-7)
ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp (modified, +46/-34)

PR #20551: vulkan: use graphics queue on AMD

Repository: ggml-org/llama.cpp
Author: 0cc4m
State: closed | merged: True
Link: https://github.com/ggml-org/llama.cpp/pull/20551

Description (problem / solution / changelog)

I'm not sure why, but the graphics queue is slightly faster in tg on AMD than the compute queue, and this also fixes the partial offload issue I fixed in #19976, so the second queue no longer has to be enabled by default. I got the idea from @zedbytes reporting that tg goes up when running with RADV_DEBUG=nocompute.

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	20	1	pp512	2288.04 ± 2.42	2225.76 ± 2.31	-2.7%
llama 8B Q4_0	4.33 GiB	8.03 B	20	1	tg128	24.33 ± 0.04	24.58 ± 0.05	+1.0%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	4886.26 ± 105.08	4901.77 ± 102.66	+0.3%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	115.78 ± 0.02	121.39 ± 0.02	+4.8%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	20	1	pp512	736.21 ± 9.37	735.19 ± 7.51	-0.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	20	1	tg128	39.53 ± 0.10	40.36 ± 0.21	+2.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512	3383.58 ± 29.26	3425.38 ± 28.68	+1.2%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	200.45 ± 1.89	220.41 ± 1.46	+10.0%

</details> <details> <summary>AMD Radeon Pro VII</summary>

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	20	1	pp512	636.62 ± 9.07	615.62 ± 0.79	-3.3%
llama 8B Q4_0	4.33 GiB	8.03 B	20	1	tg128	38.35 ± 0.09	38.20 ± 0.01	-0.4%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	830.30 ± 1.51	834.44 ± 1.05	+0.5%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	102.45 ± 0.64	100.28 ± 0.24	-2.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	20	1	pp512	289.76 ± 3.59	287.75 ± 3.10	-0.7%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	20	1	tg128	34.57 ± 0.32	34.05 ± 1.20	-1.5%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512	749.65 ± 5.42	762.52 ± 5.89	+1.7%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	94.70 ± 0.46	97.55 ± 0.20	+3.0%

</details>

Changed files

ggml/src/ggml-vulkan/ggml-vulkan.cpp (modified, +5/-5)

Code Example

# Build llama.cpp from source with Vulkan
git clone https://github.com/ggml-org/llama.cpp
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)

# Run llama-server directly
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \
  ./build/bin/llama-server \
  --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  --port 11434 -ngl 99 -fa on --no-mmap

---

# With Ollama (v0.20.5)
ollama run gemma4:26b
# observe ~34 t/s in generation

# With standalone llama.cpp b8765 (same model, same quant, same hardware)
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \
  llama-bench -m gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -ngl 99 -mmp 0 -fa 1 -p 0 -n 128
# observe 52+ t/s

RAW_BUFFERClick to expand / collapse

Summary

Ollama's vendored llama.cpp is currently at b7437 (Dec 16, 2025). Two significant Vulkan/AMD performance PRs landed in llama.cpp after that and have not yet been picked up by Ollama:

PR	Description	Merged into llama.cpp
ggml-org/llama.cpp#19625	Vulkan: scalar flash attention refactor + Wave32 on AMD	Feb 24, 2026
ggml-org/llama.cpp#20551	Vulkan: use graphics queue on AMD	Mar 15, 2026

Measured Impact

Benchmarked on the same hardware, same model, same flags (-ngl 99 -fa 1 --no-mmap):

Setup	gemma4:26b Q4_K_XL tg128	gemma4:e4b Q4_K_XL tg128
Ollama v0.20.5 (llama.cpp b7437)	~34 t/s	~34 t/s
llama.cpp b8765 (has both PRs)	52.3 t/s	56.2 t/s
Windows LM Studio (same hardware)	~56 t/s	~56 t/s

That's a ~56% throughput improvement from two Vulkan-specific commits that Ollama simply hasn't vendored yet. Standalone llama.cpp b8765 on Linux/Vulkan is now at parity with Windows LM Studio on the same machine. This is not a hardware/driver issue — the gap disappears entirely when running standalone llama.cpp.

Token speed vs context depth (llama.cpp b8765, tg128)

For reference, full context-depth profile on this hardware:

Context depth	gemma4:26b	gemma4:e4b
d0 (fresh)	52.3 t/s	56.2 t/s
d8k	45.6 t/s	~50 t/s
d32k	40.1 t/s	42.5 t/s
d64k	35.1 t/s	35.0 t/s
d128k	17.0 t/s	26.1 t/s

With Ollama (b7437) you're stuck at ~34 t/s even at d0 — below what standalone llama.cpp delivers at d64k.

Current Workaround

Due to this gap, I switched from Ollama to llama-swap + llama.cpp built from source. llama-swap is a lightweight proxy that hot-swaps llama-server instances on a single port, making it a drop-in Ollama replacement (same port 11434, OpenAI-compatible API).

Setup:

# Build llama.cpp from source with Vulkan
git clone https://github.com/ggml-org/llama.cpp
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)

# Run llama-server directly
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \
  ./build/bin/llama-server \
  --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  --port 11434 -ngl 99 -fa on --no-mmap

System


Hardware	Minisforum MS-S1 Max (AMD Ryzen AI MAX+ 395 / Radeon 8060S, Strix Halo)
GPU arch	gfx1151, 128 GB unified memory (iGPU shares system RAM)
OS	Ubuntu 24.04.4 LTS
Kernel	6.19.11
Vulkan driver	RADV (Mesa 25.2.8), `radeon_icd.json`
Ollama version	v0.20.5 (llama.cpp b7437, Dec 16 2025)
Standalone llama.cpp	b8765 (Apr 2026), built from source with Vulkan

Steps to Reproduce

# With Ollama (v0.20.5)
ollama run gemma4:26b
# observe ~34 t/s in generation

# With standalone llama.cpp b8765 (same model, same quant, same hardware)
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \
  llama-bench -m gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -ngl 99 -mmp 0 -fa 1 -p 0 -n 128
# observe 52+ t/s

Request

Please update the vendored llama.cpp to a commit that includes both PRs (any commit ≥ b8500 / after Mar 15, 2026 should include both). The ROCm 7.2.1 update in v0.20.7 is appreciated, but the Vulkan path (which is what iGPU/APU users rely on — ROCm doesn't support Strix Halo yet) is still stuck on December code.

Users with AMD APUs (Strix Halo, Phoenix, Hawk Point) running Vulkan are leaving ~56% performance on the table compared to what's already available in upstream llama.cpp.

extent analysis

TL;DR

Update the vendored llama.cpp in Ollama to a commit that includes both performance-enhancing PRs (≥ b8500 / after Mar 15, 2026) to unlock the ~56% throughput improvement.

Guidance

Identify the current vendored llama.cpp version in Ollama (b7437) and recognize it lacks crucial Vulkan performance updates.
Understand that updating to a version that includes PRs #19625 and #20551 (e.g., b8765 or later) is necessary for the performance boost.
Consider building llama.cpp from source with Vulkan support as a temporary workaround, as demonstrated in the issue.
Note that simply updating Ollama to use a newer version of llama.cpp that includes these PRs should resolve the performance gap without needing manual model management or other workarounds.

Example

# Example of building and running llama.cpp from source for comparison
git clone https://github.com/ggml-org/llama.cpp
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \
  ./build/bin/llama-server \
  --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  --port 11434 -ngl 99 -fa on --no-mmap

Notes

The performance improvement is specific to Vulkan on AMD hardware, particularly for users with integrated graphics (iGPU) like those found in AMD APUs. The workaround using llama-swap and manually built llama.cpp indicates the issue is with Ollama's vendored version of llama.cpp, not the hardware or drivers.

Recommendation

Apply the workaround of building llama.cpp from source and using it with llama-swap until Ollama updates its vendored llama.cpp version, as this provides a functional, albeit manual, solution to achieve the performance benefits of the newer llama.cpp versions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #logging issue #authentication issue #prompt issue #agent setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

ollama - ✅(Solved) Fix Vulkan/AMD performance: vendored llama.cpp (b7437, Dec 2025) missing Wave32 FA (#19625) and graphics queue (#20551) — ~56% t/s gap vs standalone llama.cpp [2 pull requests, 9 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Current Workaround

PR fix notes

PR #19625: Vulkan Scalar Flash Attention Refactor

Description (problem / solution / changelog)

Scalar Flash Attention Core Optimizations

Row Size Tiering

Vendor-Specific Optimizations

split_k Enhancements

Device Compatibility

Shared Memory Management

Code Path Selection

Shader Compilation

Benchmarks

Changed files

PR #20551: vulkan: use graphics queue on AMD

Description (problem / solution / changelog)

Changed files

Code Example

Summary

Measured Impact

Token speed vs context depth (llama.cpp b8765, tg128)

Current Workaround

System

Steps to Reproduce

Request

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING