ollama - 💡(How to fix) Fix Concurrent processing with Qwen 3.5 family models [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#14879Fetched 2026-04-08 00:48:10
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Participants
Timeline (top)
closed ×1labeled ×1
RAW_BUFFERClick to expand / collapse

What is the issue?

Summary

Qwen 3.5 models appear to ignore or bypass OLLAMA_NUM_PARALLEL, resulting in effectively single-request inference even when parallelism is configured and hardware resources are available.

Other models (e.g., Llama family) run concurrently under the same configuration.

Environment

Hardware: Apple Mac Studio (M3 Ultra)

RAM:

Test system (Colossus): 128 GB

Production systems (Hal / Sal): 512 GB

Backend: Ollama (Metal acceleration)

Frontend: Open WebUI

Ollama configuration:

OLLAMA_NUM_PARALLEL=8

Environment variable confirmed active in the running process.

Observed Behavior

When running Qwen 3.5 models (tested with both qwen3.5:35b and qwen3.5:122b):

Requests appear to serialize rather than execute concurrently

Additional requests wait until the active generation completes

Effective parallelism behaves as 1

This occurs even though:

sufficient RAM is available

the model is fully loaded in GPU memory

OLLAMA_NUM_PARALLEL is set and confirmed active.

Control Test

Under the same system and configuration, Llama models behave as expected:

Multiple requests generate simultaneously

Concurrency matches OLLAMA_NUM_PARALLEL

Token streaming occurs from multiple requests at once.

This suggests the issue is specific to the Qwen runner or architecture handling in Ollama.

Reproduction Example

Concurrent requests using the Ollama API:

curl http://localhost:11434/api/generate ... curl http://localhost:11434/api/generate ...

Expected:

Both requests generate tokens simultaneously.

Observed with Qwen:

second request waits

generation starts only after first request finishes or partially completes.

Additional Observations

Token generation speed and TTFT are excellent for Qwen models.

GPU layers fully offload to Metal.

Memory pressure remains low.

The issue appears to be scheduling / concurrency, not performance.

Expected Behavior

Qwen 3.5 models should respect the configured parallelism:

OLLAMA_NUM_PARALLEL > 1

and allow multiple concurrent inference streams similar to Llama models.

Impact

This prevents Qwen models from being used effectively in multi-user inference environments, even when hardware capacity exists.

Relevant log output

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

extent analysis

Fix Plan

To resolve the issue of Qwen 3.5 models ignoring OLLAMA_NUM_PARALLEL and not executing concurrently, we need to modify the Ollama configuration and potentially the model runner to support parallelism.

Step-by-Step Solution

  1. Verify Ollama Configuration: Ensure that OLLAMA_NUM_PARALLEL is correctly set and applied in the Ollama configuration file or environment variables.
  2. Update Qwen Model Runner: Modify the Qwen model runner to utilize the OLLAMA_NUM_PARALLEL configuration. This might involve using threading or asynchronous processing to handle multiple requests concurrently.
  3. Implement Concurrent Inference: Update the inference logic in the Qwen model runner to support concurrent execution of multiple requests. This can be achieved using libraries like concurrent.futures in Python.

Example Code (Python)

import concurrent.futures

def generate_tokens(request):
    # Token generation logic here
    pass

def handle_requests(concurrency_level, requests):
    with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency_level) as executor:
        futures = {executor.submit(generate_tokens, request): request for request in requests}
        for future in concurrent.futures.as_completed(futures):
            request = futures[future]
            try:
                future.result()
            except Exception as e:
                # Handle exception
                pass

concurrency_level = int(os.environ['OLLAMA_NUM_PARALLEL'])
requests = [...]  # List of incoming requests
handle_requests(concurrency_level, requests)

Verification

To verify that the fix worked, you can test the Qwen model with multiple concurrent requests using the Ollama API, as described in the reproduction example. Both requests should generate tokens simultaneously, and the effective parallelism should match the configured OLLAMA_NUM_PARALLEL value.

Extra Tips

  • Ensure that the Ollama version and model runner are compatible and support parallelism.
  • Monitor system resources (e.g., CPU, GPU, and memory) to ensure that they are not bottlenecking the concurrent execution of requests.
  • Consider implementing logging and metrics to track the performance and concurrency of the Qwen model runner.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING