ollama - 💡(How to fix) Fix Qwen3.5:9b concurrent call BUG [12 comments, 5 participants]

ollama2026-03-04 19:20:27

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#14621•Fetched 2026-04-08 00:33:42

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×12subscribed ×8mentioned ×3labeled ×1

Error Message

level=WARN source=sched.go:450 msg="model architecture does not currently support parallel requests" architecture=qwen35

Code Example

#### 1. The Architecture Warning (Parallelism Blocked):

level=WARN source=sched.go:450 msg="model architecture does not currently support parallel requests" architecture=qwen35
level=INFO source=runner.go:1302 msg=load request="{Operation:fit ... Parallel:1 ... FlashAttention:Enabled KvSize:4096 ...}"

#### 2.The Backend Crash (SIGABRT):

SIGABRT: abort
PC=0xfe7256047608 m=11 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 20 [syscall]:
runtime.cgocall(...)
github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_reserve(0xfe71e10b9950, 0xfe6f5eceda10)
github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0x400043c100)
github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0x400024f0e0, 0x1)

#### 3.Hardware Environment:

level=INFO source=types.go:42 msg="inference compute" id=GPU-... library=CUDA compute=12.1 name=CUDA0 description="NVIDIA GB10" libdirs=ollama,cuda_v13 driver=13.0 pci_id=000f:01:00.0 type=iGPU total="119.7 GiB" available="61.4 GiB"

RAW_BUFFERClick to expand / collapse

What is the issue?

Summary:

Despite having 128GB of Unified Memory on an ARM-based NVIDIA DGX Spark (GB10) and setting OLLAMA_NUM_PARALLEL correctly, Ollama (v0.17.6) fails to handle concurrent requests for the qwen3.5 architecture.

Details:

Parallelism Downgraded: The server logs show a warning: model architecture does not currently support parallel requests for architecture=qwen35. It overrides the environment variable and forces Parallel: 1.
Crash (SIGABRT): When attempting to force concurrent calls or during the model loading phase for parallel execution, the runner crashes with a SIGABRT during ggml_backend_sched_reserve.
Platform Specificity: This issue occurs on the NVIDIA DGX Spark (ARM64). The device is detected as iGPU with 119.7 GiB VRAM. Similar configurations work on Apple Silicon (macOS) but fail here, suggesting a backend/scheduling bug in the Linux-ARM64 CUDA runner for this specific architecture.

Relevant log output

#### 1. The Architecture Warning (Parallelism Blocked):

level=WARN source=sched.go:450 msg="model architecture does not currently support parallel requests" architecture=qwen35
level=INFO source=runner.go:1302 msg=load request="{Operation:fit ... Parallel:1 ... FlashAttention:Enabled KvSize:4096 ...}"

#### 2.The Backend Crash (SIGABRT):

SIGABRT: abort
PC=0xfe7256047608 m=11 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 20 [syscall]:
runtime.cgocall(...)
github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_reserve(0xfe71e10b9950, 0xfe6f5eceda10)
github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0x400043c100)
github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0x400024f0e0, 0x1)

#### 3.Hardware Environment:

level=INFO source=types.go:42 msg="inference compute" id=GPU-... library=CUDA compute=12.1 name=CUDA0 description="NVIDIA GB10" libdirs=ollama,cuda_v13 driver=13.0 pci_id=000f:01:00.0 type=iGPU total="119.7 GiB" available="61.4 GiB"

OS

Linux

GPU

Nvidia

CPU

No response

Ollama version

0.17.6

extent analysis

Fix Plan

To address the issue of Ollama failing to handle concurrent requests for the qwen3.5 architecture on an ARM-based NVIDIA DGX Spark, we need to:

Update the Ollama version to the latest, as the current version (0.17.6) may have known issues with parallelism on specific architectures.
Modify the ggml_backend_sched_reserve function to properly handle the qwen3.5 architecture.
Ensure that the CUDA runner is correctly configured for the Linux-ARM64 platform.

Code Changes

Here are the steps to modify the ggml_backend_sched_reserve function:

// In ggml/backend/ggml.go
func (c *Context) Reserve(graph *Graph) error {
    // Add a check for the qwen3.5 architecture
    if graph.Architecture == "qwen35" {
        // Handle the reservation for this architecture
        // This may involve setting specific parameters or using a different reservation strategy
        return c.reserveQwen35Graph(graph)
    }
    // ... rest of the function remains the same
}

func (c *Context) reserveQwen35Graph(graph *Graph) error {
    // Implementation for reserving the qwen3.5 graph
    // This may involve setting the parallelism level to 1 or using a different scheduling strategy
    // For example:
    c.parallelism = 1
    // ... rest of the implementation
}

Configuration Changes

Update the OLLAMA_NUM_PARALLEL environment variable to a value that is supported by the qwen3.5 architecture.

Temporary Workaround

If updating the Ollama version or modifying the code is not feasible, a temporary workaround could be to set the OLLAMA_NUM_PARALLEL environment variable to 1, which may prevent the crash but will also disable parallelism.

Verification

To verify that the fix worked, run the Ollama server with the modified code and configuration, and test concurrent requests for the qwen3.5 architecture. Check the server logs for any warnings or errors related to parallelism or architecture support.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #model loading #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix Qwen3.5:9b concurrent call BUG [12 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

What is the issue?

Summary:

Details:

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Code Changes

Configuration Changes

Temporary Workaround

Verification

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix Qwen3.5:9b concurrent call BUG [12 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

What is the issue?

Summary:

Details:

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Code Changes

Configuration Changes

Temporary Workaround

Verification

Still need to ship something?

RELATED_DISCOVERY

TRENDING