ollama - 💡(How to fix) Fix Qwen3.5 crashes on NVIDIA Turing GPUs (RTX 2080 Ti) [1 comments, 2 participants]

Error Message

Silent Termination: In some cases, the ollama_llama_server process dies without a clear error message in the application log, just before the driver reset. Qwen3.5 Specifics: This model uses new attention mechanisms (Hybrid/MROPE). The llm_graph_input_attn_cross class is likely heavily utilized. If the graph construction is flawed due to this UB, the resulting CUDA graph sent to the Turing GPU may contain invalid instructions, causing the Xid 43 (GPU dropped off bus) error. Why not Ampere?: Newer architectures might have more robust error handling or the specific instruction sequence generated by the optimizer happens to be "safe enough" on CC 8.0+, masking the underlying bug. ... Log cuts off abruptly or followed by Xid error in dmesg ... dmesg Error:

What is the issue?

Title: [Bug] Qwen3.5 crashes on NVIDIA Turing GPUs (RTX 2080 Ti) with Xid 43/31; Compiler warning in llama-graph.cpp suggests undefined behavior

Description:

. Summary Running Qwen3.5 models (e.g., qwen3.5:9b) on NVIDIA Turing architecture GPUs (specifically RTX 2080 Ti) causes immediate system instability, driver resets (Xid 43, Xid 31), or silent process termination during inference. Additionally, compiling the latest source code triggers a severe GCC warning (-Waggressive-loop-optimizations) in llama-graph.cpp, indicating potential Undefined Behavior (UB) in the computation graph logic. While newer architectures (Ampere/Ada) seem unaffected, Turing cards fail consistently.

. Environment OS: Linux (Ubuntu/Debian based) GPU: NVIDIA GeForce RTX 2080 Ti (22GB VRAM, Modified) Architecture: Turing (Compute Capability 7.5) Driver Version: [Insert your driver version, e.g., 535.xx or 550.xx] Ollama Version: Latest source build (post-v0.17.5) / v0.17.5 binary Model: qwen3.5:9b (GGUF) Compiler: GCC (version [e.g., 11.4.0])

. Symptoms Driver Crash: Upon initiating inference (often after the first token or during KV cache expansion), the GPU drops off the bus. dmesg logs show: NVRM: Xid (PCI:0000:xx:xx.x): 43, pid=xxxx, Ch 00, [...] or Xid 31. The system often requires a hard reboot; nvidia-smi fails to respond. Silent Termination: In some cases, the ollama_llama_server process dies without a clear error message in the application log, just before the driver reset. Compilation Warning: Building from source reveals a critical logic flaw warning: text

github.com/ollama/ollama/llama/llama.cpp/src llama-graph.cpp: In member function ‘virtual void llm_graph_input_attn_cross::set_input(const llama_ubatch*)’: llama-graph.cpp:473:9: warning: iteration 2147483645 invokes undefined behavior [-Waggressive-loop-optimizations] | for (int i = n_tokens; i < n_tokens; ++i) { | ^~~ llama-graph.cpp:473:34: note: within this loop | for (int i = n_tokens; i < n_tokens; ++i) { | ~~^~~~~~~~~~ . Steps to Reproduce Install Ollama on a machine with an RTX 2080 Ti (Turing). Pull the model: ollama pull qwen3.5:9b. Run a simple generation: ollama run qwen3.5:9b "Hello". Observe the system hang, driver reset, or process crash. (Optional) Compile from source to see the llama-graph.cpp warning.

. Technical Analysis & Hypothesis The Loop Logic: The code for (int i = n_tokens; i < n_tokens; ++i) is logically a no-op (condition is initially false). However, the GCC warning about "iteration 2147483645" suggests the compiler detects a path where integer overflow or aggressive optimization leads to Undefined Behavior. Impact on Turing: In C++, UB can cause the compiler to generate optimized machine code that behaves unpredictably. It appears that Turing GPUs (or the specific CUDA kernel generation for CC 7.5) are extremely sensitive to this malformed control flow or the resulting memory layout, leading to illegal memory access or invalid kernel launches. Qwen3.5 Specifics: This model uses new attention mechanisms (Hybrid/MROPE). The llm_graph_input_attn_cross class is likely heavily utilized. If the graph construction is flawed due to this UB, the resulting CUDA graph sent to the Turing GPU may contain invalid instructions, causing the Xid 43 (GPU dropped off bus) error. Why not Ampere?: Newer architectures might have more robust error handling or the specific instruction sequence generated by the optimizer happens to be "safe enough" on CC 8.0+, masking the underlying bug.

. Expected Behavior The model should run stably on Turing GPUs, utilizing the available 22GB VRAM. No compiler warnings regarding undefined behavior should exist in critical graph construction paths.

. Suggested Fix Immediate Code Fix: Inspect and correct line 473 in llama-graph.cpp. If the loop is intended to be empty, remove it entirely or wrap it in an explicit if (false) block to prevent compiler misinterpretation.

// Current problematic code: // for (int i = n_tokens; i < n_tokens; ++i) { ... }

.Proposed fix: // Remove the loop if it serves no purpose, or fix the logic if it was meant to iterate. Turing-Specific Testing: Add CI tests or manual verification steps specifically for Compute Capability 7.5 (Turing) when running Qwen3.5 series models. Kernel Validation: Ensure that the computed graph splits and memory offsets do not exceed 32-bit integer limits or align poorly on older architectures.

. Logs Journalctl / Ollama Log Snippet (before crash):

Mar 08 19:00:30 aiserver ollama[6612]: level=DEBUG source=ggml.go:852 msg="compute graph" nodes=16775 splits=4 Mar 08 19:00:30 aiserver ollama[6612]: level=INFO source=ggml.go:494 msg="offloaded 33/33 layers to GPU" Mar 08 19:00:33 aiserver ollama[6612]: level=INFO source=server.go:1388 msg="llama runner started in 5.53 seconds" ... Log cuts off abruptly or followed by Xid error in dmesg ... dmesg Error:

NVRM: Xid (PCI:0000:09:00.0): 43, pid=XXXX, Ch 00, [XXX]

Relevant log output

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.17.0

extent analysis

Fix Plan

To resolve the issue, we need to address the undefined behavior in the llama-graph.cpp file. The problematic code is:

for (int i = n_tokens; i < n_tokens; ++i) { ... }

This loop is logically a no-op, but the compiler warning suggests that it may cause undefined behavior.

Step-by-Step Solution

Remove the loop: If the loop serves no purpose, remove it entirely.

// Remove the following line
// for (int i = n_tokens; i < n_tokens; ++i) { ... }

Fix the logic: If the loop was meant to iterate, fix the logic to ensure it doesn't cause undefined behavior.

// Example: fix the loop condition
for (int i = 0; i < n_tokens; ++i) { ... }

Add a check: Add a check to ensure that n_tokens is not exceeded.

// Example: add a check
if (n_tokens > 0) {
    for (int i = 0; i < n_tokens; ++i) { ... }
}

Verify the fix: Compile the code and run the Qwen3.5 model to verify that the issue is resolved.

Code Example

The corrected code should look like this:

// llama-graph.cpp
void llm_graph_input_attn_cross::set_input(const llama_ubatch* input) {
    // ...
    if (n_tokens > 0) {
        for (int i = 0; i < n_tokens; ++i) {
            // ...
        }
    }
    // ...
}

Verification

To verify that the fix worked, run the Qwen3.5 model and check for any errors or crashes. You can also check the compiler warnings to ensure that the undefined behavior warning is resolved.

Extra Tips

Always check for compiler warnings and address them promptly to prevent undefined behavior.
Use tools like gcc -Wall -Wextra to enable additional warnings and catch potential issues early.
Test your code thoroughly on different architectures and platforms to ensure compatibility and stability.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix Qwen3.5 crashes on NVIDIA Turing GPUs (RTX 2080 Ti) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Step-by-Step Solution

Code Example

Verification

Extra Tips

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix Qwen3.5 crashes on NVIDIA Turing GPUs (RTX 2080 Ti) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Step-by-Step Solution

Code Example

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING