ollama - 💡(How to fix) Fix Qwen3.5:9b concurrent call BUG [12 comments, 5 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#14621Fetched 2026-04-08 00:33:42
View on GitHub
Comments
12
Participants
5
Timeline
25
Reactions
0
Author
Timeline (top)
commented ×12subscribed ×8mentioned ×3labeled ×1

Error Message

level=WARN source=sched.go:450 msg="model architecture does not currently support parallel requests" architecture=qwen35

Code Example

#### 1. The Architecture Warning (Parallelism Blocked):

level=WARN source=sched.go:450 msg="model architecture does not currently support parallel requests" architecture=qwen35
level=INFO source=runner.go:1302 msg=load request="{Operation:fit ... Parallel:1 ... FlashAttention:Enabled KvSize:4096 ...}"

#### 2.The Backend Crash (SIGABRT):

SIGABRT: abort
PC=0xfe7256047608 m=11 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 20 [syscall]:
runtime.cgocall(...)
github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_reserve(0xfe71e10b9950, 0xfe6f5eceda10)
github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0x400043c100)
github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0x400024f0e0, 0x1)

#### 3.Hardware Environment:

level=INFO source=types.go:42 msg="inference compute" id=GPU-... library=CUDA compute=12.1 name=CUDA0 description="NVIDIA GB10" libdirs=ollama,cuda_v13 driver=13.0 pci_id=000f:01:00.0 type=iGPU total="119.7 GiB" available="61.4 GiB"
RAW_BUFFERClick to expand / collapse

What is the issue?

Summary:

Despite having 128GB of Unified Memory on an ARM-based NVIDIA DGX Spark (GB10) and setting OLLAMA_NUM_PARALLEL correctly, Ollama (v0.17.6) fails to handle concurrent requests for the qwen3.5 architecture.

Details:

  1. Parallelism Downgraded: The server logs show a warning: model architecture does not currently support parallel requests for architecture=qwen35. It overrides the environment variable and forces Parallel: 1.

  2. Crash (SIGABRT): When attempting to force concurrent calls or during the model loading phase for parallel execution, the runner crashes with a SIGABRT during ggml_backend_sched_reserve.

  3. Platform Specificity: This issue occurs on the NVIDIA DGX Spark (ARM64). The device is detected as iGPU with 119.7 GiB VRAM. Similar configurations work on Apple Silicon (macOS) but fail here, suggesting a backend/scheduling bug in the Linux-ARM64 CUDA runner for this specific architecture.

Relevant log output

#### 1. The Architecture Warning (Parallelism Blocked):

level=WARN source=sched.go:450 msg="model architecture does not currently support parallel requests" architecture=qwen35
level=INFO source=runner.go:1302 msg=load request="{Operation:fit ... Parallel:1 ... FlashAttention:Enabled KvSize:4096 ...}"

#### 2.The Backend Crash (SIGABRT):

SIGABRT: abort
PC=0xfe7256047608 m=11 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 20 [syscall]:
runtime.cgocall(...)
github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_reserve(0xfe71e10b9950, 0xfe6f5eceda10)
github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0x400043c100)
github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0x400024f0e0, 0x1)

#### 3.Hardware Environment:

level=INFO source=types.go:42 msg="inference compute" id=GPU-... library=CUDA compute=12.1 name=CUDA0 description="NVIDIA GB10" libdirs=ollama,cuda_v13 driver=13.0 pci_id=000f:01:00.0 type=iGPU total="119.7 GiB" available="61.4 GiB"

OS

Linux

GPU

Nvidia

CPU

No response

Ollama version

0.17.6

extent analysis

Fix Plan

To address the issue of Ollama failing to handle concurrent requests for the qwen3.5 architecture on an ARM-based NVIDIA DGX Spark, we need to:

  • Update the Ollama version to the latest, as the current version (0.17.6) may have known issues with parallelism on specific architectures.
  • Modify the ggml_backend_sched_reserve function to properly handle the qwen3.5 architecture.
  • Ensure that the CUDA runner is correctly configured for the Linux-ARM64 platform.

Code Changes

Here are the steps to modify the ggml_backend_sched_reserve function:

// In ggml/backend/ggml.go
func (c *Context) Reserve(graph *Graph) error {
    // Add a check for the qwen3.5 architecture
    if graph.Architecture == "qwen35" {
        // Handle the reservation for this architecture
        // This may involve setting specific parameters or using a different reservation strategy
        return c.reserveQwen35Graph(graph)
    }
    // ... rest of the function remains the same
}

func (c *Context) reserveQwen35Graph(graph *Graph) error {
    // Implementation for reserving the qwen3.5 graph
    // This may involve setting the parallelism level to 1 or using a different scheduling strategy
    // For example:
    c.parallelism = 1
    // ... rest of the implementation
}

Configuration Changes

Update the OLLAMA_NUM_PARALLEL environment variable to a value that is supported by the qwen3.5 architecture.

Temporary Workaround

If updating the Ollama version or modifying the code is not feasible, a temporary workaround could be to set the OLLAMA_NUM_PARALLEL environment variable to 1, which may prevent the crash but will also disable parallelism.

Verification

To verify that the fix worked, run the Ollama server with the modified code and configuration, and test concurrent requests for the qwen3.5 architecture. Check the server logs for any warnings or errors related to parallelism or architecture support.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING