ollama - 💡(How to fix) Fix Qwen3.5 crashes on NVIDIA Turing GPUs (RTX 2080 Ti) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#14715Fetched 2026-04-08 00:32:42
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
closed ×1commented ×1labeled ×1subscribed ×1

Error Message

Silent Termination: In some cases, the ollama_llama_server process dies without a clear error message in the application log, just before the driver reset. Qwen3.5 Specifics: This model uses new attention mechanisms (Hybrid/MROPE). The llm_graph_input_attn_cross class is likely heavily utilized. If the graph construction is flawed due to this UB, the resulting CUDA graph sent to the Turing GPU may contain invalid instructions, causing the Xid 43 (GPU dropped off bus) error. Why not Ampere?: Newer architectures might have more robust error handling or the specific instruction sequence generated by the optimizer happens to be "safe enough" on CC 8.0+, masking the underlying bug. ... Log cuts off abruptly or followed by Xid error in dmesg ... dmesg Error:

RAW_BUFFERClick to expand / collapse

What is the issue?

Title: [Bug] Qwen3.5 crashes on NVIDIA Turing GPUs (RTX 2080 Ti) with Xid 43/31; Compiler warning in llama-graph.cpp suggests undefined behavior

Description:

. Summary Running Qwen3.5 models (e.g., qwen3.5:9b) on NVIDIA Turing architecture GPUs (specifically RTX 2080 Ti) causes immediate system instability, driver resets (Xid 43, Xid 31), or silent process termination during inference. Additionally, compiling the latest source code triggers a severe GCC warning (-Waggressive-loop-optimizations) in llama-graph.cpp, indicating potential Undefined Behavior (UB) in the computation graph logic. While newer architectures (Ampere/Ada) seem unaffected, Turing cards fail consistently.

. Environment OS: Linux (Ubuntu/Debian based) GPU: NVIDIA GeForce RTX 2080 Ti (22GB VRAM, Modified) Architecture: Turing (Compute Capability 7.5) Driver Version: [Insert your driver version, e.g., 535.xx or 550.xx] Ollama Version: Latest source build (post-v0.17.5) / v0.17.5 binary Model: qwen3.5:9b (GGUF) Compiler: GCC (version [e.g., 11.4.0])

. Symptoms Driver Crash: Upon initiating inference (often after the first token or during KV cache expansion), the GPU drops off the bus. dmesg logs show: NVRM: Xid (PCI:0000:xx:xx.x): 43, pid=xxxx, Ch 00, [...] or Xid 31. The system often requires a hard reboot; nvidia-smi fails to respond. Silent Termination: In some cases, the ollama_llama_server process dies without a clear error message in the application log, just before the driver reset. Compilation Warning: Building from source reveals a critical logic flaw warning: text

github.com/ollama/ollama/llama/llama.cpp/src llama-graph.cpp: In member function ‘virtual void llm_graph_input_attn_cross::set_input(const llama_ubatch*)’: llama-graph.cpp:473:9: warning: iteration 2147483645 invokes undefined behavior [-Waggressive-loop-optimizations] | for (int i = n_tokens; i < n_tokens; ++i) { | ^~~ llama-graph.cpp:473:34: note: within this loop | for (int i = n_tokens; i < n_tokens; ++i) { | ~~^~~~~~~~~~ . Steps to Reproduce Install Ollama on a machine with an RTX 2080 Ti (Turing). Pull the model: ollama pull qwen3.5:9b. Run a simple generation: ollama run qwen3.5:9b "Hello". Observe the system hang, driver reset, or process crash. (Optional) Compile from source to see the llama-graph.cpp warning.

. Technical Analysis & Hypothesis The Loop Logic: The code for (int i = n_tokens; i < n_tokens; ++i) is logically a no-op (condition is initially false). However, the GCC warning about "iteration 2147483645" suggests the compiler detects a path where integer overflow or aggressive optimization leads to Undefined Behavior. Impact on Turing: In C++, UB can cause the compiler to generate optimized machine code that behaves unpredictably. It appears that Turing GPUs (or the specific CUDA kernel generation for CC 7.5) are extremely sensitive to this malformed control flow or the resulting memory layout, leading to illegal memory access or invalid kernel launches. Qwen3.5 Specifics: This model uses new attention mechanisms (Hybrid/MROPE). The llm_graph_input_attn_cross class is likely heavily utilized. If the graph construction is flawed due to this UB, the resulting CUDA graph sent to the Turing GPU may contain invalid instructions, causing the Xid 43 (GPU dropped off bus) error. Why not Ampere?: Newer architectures might have more robust error handling or the specific instruction sequence generated by the optimizer happens to be "safe enough" on CC 8.0+, masking the underlying bug.

. Expected Behavior The model should run stably on Turing GPUs, utilizing the available 22GB VRAM. No compiler warnings regarding undefined behavior should exist in critical graph construction paths.

. Suggested Fix Immediate Code Fix: Inspect and correct line 473 in llama-graph.cpp. If the loop is intended to be empty, remove it entirely or wrap it in an explicit if (false) block to prevent compiler misinterpretation.

// Current problematic code: // for (int i = n_tokens; i < n_tokens; ++i) { ... }

.Proposed fix: // Remove the loop if it serves no purpose, or fix the logic if it was meant to iterate. Turing-Specific Testing: Add CI tests or manual verification steps specifically for Compute Capability 7.5 (Turing) when running Qwen3.5 series models. Kernel Validation: Ensure that the computed graph splits and memory offsets do not exceed 32-bit integer limits or align poorly on older architectures.

. Logs Journalctl / Ollama Log Snippet (before crash):

Mar 08 19:00:30 aiserver ollama[6612]: level=DEBUG source=ggml.go:852 msg="compute graph" nodes=16775 splits=4 Mar 08 19:00:30 aiserver ollama[6612]: level=INFO source=ggml.go:494 msg="offloaded 33/33 layers to GPU" Mar 08 19:00:33 aiserver ollama[6612]: level=INFO source=server.go:1388 msg="llama runner started in 5.53 seconds" ... Log cuts off abruptly or followed by Xid error in dmesg ... dmesg Error:

NVRM: Xid (PCI:0000:09:00.0): 43, pid=XXXX, Ch 00, [XXX]

Relevant log output

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.17.0

extent analysis

Fix Plan

To resolve the issue, we need to address the undefined behavior in the llama-graph.cpp file. The problematic code is:

for (int i = n_tokens; i < n_tokens; ++i) { ... }

This loop is logically a no-op, but the compiler warning suggests that it may cause undefined behavior.

Step-by-Step Solution

  1. Remove the loop: If the loop serves no purpose, remove it entirely.
// Remove the following line
// for (int i = n_tokens; i < n_tokens; ++i) { ... }
  1. Fix the logic: If the loop was meant to iterate, fix the logic to ensure it doesn't cause undefined behavior.
// Example: fix the loop condition
for (int i = 0; i < n_tokens; ++i) { ... }
  1. Add a check: Add a check to ensure that n_tokens is not exceeded.
// Example: add a check
if (n_tokens > 0) {
    for (int i = 0; i < n_tokens; ++i) { ... }
}
  1. Verify the fix: Compile the code and run the Qwen3.5 model to verify that the issue is resolved.

Code Example

The corrected code should look like this:

// llama-graph.cpp
void llm_graph_input_attn_cross::set_input(const llama_ubatch* input) {
    // ...
    if (n_tokens > 0) {
        for (int i = 0; i < n_tokens; ++i) {
            // ...
        }
    }
    // ...
}

Verification

To verify that the fix worked, run the Qwen3.5 model and check for any errors or crashes. You can also check the compiler warnings to ensure that the undefined behavior warning is resolved.

Extra Tips

  • Always check for compiler warnings and address them promptly to prevent undefined behavior.
  • Use tools like gcc -Wall -Wextra to enable additional warnings and catch potential issues early.
  • Test your code thoroughly on different architectures and platforms to ensure compatibility and stability.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING