ollama - 💡(How to fix) Fix Qwen3.5 (GGUF via Ollama 0.18.0) causes reproducible GPU driver instability under moderate context load on RTX 3090 (WSL2) [3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15024Fetched 2026-04-08 01:21:52
View on GitHub
Comments
3
Participants
3
Timeline
5
Reactions
0
Author
Timeline (top)
commented ×3labeled ×2

Root Cause

Eliminated as root causes:

Fix Action

Fix / Workaround

Workaround

RAW_BUFFERClick to expand / collapse

What is the issue?

Summary

Running Qwen3.5 (Q4_K_M GGUF via Ollama 0.18.0) on an RTX 3090 (24 GB) in WSL2 (Ubuntu 24.04) on Windows 11 causes reproducible GPU driver instability when the model processes moderately populated input (~12k characters).

The issue persists after:

clean driver reinstall (DDU)

CMOS reset

fresh Ollama state (~/.ollama reset)

model re-download

The instability escalates beyond application failure and can corrupt GPU initialization state, leading to:

black screen (no POST)

required CMOS reset + GPU reseat to recover


Environment

GPU: NVIDIA RTX 3090 (24 GB VRAM)

Driver: 595.79 (Studio, clean install via DDU)

OS: Windows 11 + WSL2

WSL distro: Ubuntu 24.04

Ollama: 0.18.0

Model: qwen3.5:latest (GGUF, Q4_K_M, ~9.7B)

Control model: llama3.2:3b (stable)


Reproduction (minimal)

  1. Start Ollama (WSL)

ollama serve

  1. Confirm model works with tiny prompt

ollama run qwen3.5:latest "Reply with exactly: ok"

→ Works reliably


  1. Populate moderate input (~12k chars)

head -c 12000 filler.txt | ollama run qwen3.5:latest

→ Observed:

GPU instability

driver crash (nvlddmkm Event ID 14 / 153)

in some cases system-level failure


Observations

Tiny prompts: stable

Declared large context (e.g., 32k) with tiny input: stable

Moderate populated input (~12k): triggers crash

llama3.2:3b under same conditions: stable

CUDA container benchmarks: stable


Isolation Performed

Eliminated as root causes:

❌ OpenClaw (crash reproduced in Ollama directly)

❌ Corrupted model weights (re-pulled)

❌ Persistent Ollama user state (~/.ollama reset)

❌ Windows vs WSL confusion (confirmed WSL server)

❌ Short-lived cache/session persistence (crashes after reboot)

❌ General GPU instability (other workloads stable)


Critical Finding

The failure is triggered by actual context population, not declared context size.

Model loads fine

Model runs small prompts fine

Model crashes when KV-cache / attention memory expands under real input


Escalation Behavior

Repeated triggering led to:

GPU driver crashes (TDR)

eventual failure to initialize GPU at boot

required:

GPU removal

CMOS reset

reseating


Working Hypothesis

Likely issue in:

Qwen3.5 inference path under GGUF/Ollama runtime, specifically during KV-cache expansion / attention memory allocation

Possible contributing factors:

CUDA kernel behavior under memory pressure

KV-cache shifting / reprocessing logic

interaction with WSL GPU virtualization layer


Workaround

Stable operation achieved by:

FROM qwen3.5:latest PARAMETER num_ctx 8192 PARAMETER num_batch 32 PARAMETER num_gpu 1

Plus:

avoiding large single-shot prompt ingestion

using incremental context accumulation


Questions for Maintainers

  1. Is Qwen3.5 GGUF known to have instability under KV-cache expansion in Ollama?

  2. Are there known issues with:

KV-cache shifting

large prompt ingestion

CUDA kernels in Qwen models?

  1. Is there a recommended safe configuration for:

context size vs VRAM

batch size

memory strategy?

Impact

This is not just an application crash:

The workload can escalate into GPU driver failure and hardware initialization instability

Reproducibility

✔ Reproducible after full system reset ✔ Independent of user-level state ✔ Specific to Qwen3.5 (not observed with Llama)

Relevant log output

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

extent analysis

Fix Plan

To address the GPU driver instability issue with Qwen3.5 in Ollama, follow these steps:

  • Reduce context size: Limit the num_ctx parameter to 8192 to prevent excessive KV-cache expansion.
  • Batch size adjustment: Set num_batch to 32 to balance memory usage and processing efficiency.
  • GPU allocation: Ensure num_gpu is set to 1 to prevent over-allocation of GPU resources.
  • Incremental context accumulation: Instead of ingesting large prompts at once, use incremental context accumulation to reduce memory pressure.

Example configuration:

FROM qwen3.5:latest
PARAMETER num_ctx 8192
PARAMETER num_batch 32
PARAMETER num_gpu 1

Additionally, consider the following code snippet to implement incremental context accumulation:

def incremental_context_accumulation(prompt, max_ctx=8192):
    chunks = [prompt[i:i+max_ctx] for i in range(0, len(prompt), max_ctx)]
    for chunk in chunks:
        # Process chunk with Qwen3.5 model
        output = ollama_run(qwen3.5:latest, chunk)
        # Accumulate output
        accumulated_output += output
    return accumulated_output

Verification

To verify the fix, run the modified configuration and incremental context accumulation code with a moderately populated input (~12k characters). Monitor the system for GPU driver instability and crashes. If the issue persists, further tuning of the num_ctx, num_batch, and num_gpu parameters may be necessary.

Extra Tips

  • Regularly update the Ollama and Qwen3.5 models to ensure you have the latest stability fixes.
  • Consider using CUDA container benchmarks to monitor GPU stability and performance.
  • Be cautious when increasing context size or batch size, as this can lead to GPU driver instability and crashes.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING