ollama - 💡(How to fix) Fix nemotron-cascade-2 not working in parallel [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15017Fetched 2026-04-08 01:17:14
View on GitHub
Comments
2
Participants
2
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
commented ×2closed ×1labeled ×1

Error Message

time=2026-03-22T23:50:53.988Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing "max": invalid syntax" time=2026-03-22T23:50:54.084Z level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=nemotron_h_moe

Code Example

environment:
      - CUDA_VISIBLE_DEVICES=0,1
      - OLLAMA_KEEP_ALIVE=99999999m
      - OLLAMA_FLASH_ATTENTION=1
      - OLLAMA_KV_CACHE_TYPE=f16 # q8_0
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_HOST=0.0.0.0:11434

---

time=2026-03-22T23:50:53.437Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 42335"
time=2026-03-22T23:50:53.988Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax"
time=2026-03-22T23:50:54.084Z level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=nemotron_h_moe
time=2026-03-22T23:50:54.183Z level=INFO source=server.go:246 msg="enabling flash attention"
time=2026-03-22T23:50:54.183Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-9e0c827cfd6a6d000032be3da3d0914668b0c1112977e927186d29c4487466c4 --port 37627"
time=2026-03-22T23:50:54.184Z level=INFO source=sched.go:484 msg="system memory" total="62.7 GiB" free="33.1 GiB" free_swap="7.2 GiB"
time=2026-03-22T23:50:54.184Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 library=CUDA available="23.1 GiB" free="23.6 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-22T23:50:54.184Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6 library=CUDA available="22.9 GiB" free="23.3 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-22T23:50:54.184Z level=INFO source=server.go:757 msg="loading model" "model layers"=53 requested=-1
time=2026-03-22T23:50:54.207Z level=INFO source=runner.go:1411 msg="starting ollama engine"
time=2026-03-22T23:50:54.207Z level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:37627"
time=2026-03-22T23:50:54.217Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType:f16 NumThreads:16 GPULayers:53[ID:GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 Layers:53(0..52)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-22T23:50:54.276Z level=INFO source=ggml.go:136 msg="" architecture=nemotron_h_moe file_type=Q4_K_M name="" description="" num_tensors=401 num_key_values=45
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
time=2026-03-22T23:50:54.522Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2026-03-22T23:50:55.729Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType:f16 NumThreads:16 GPULayers:53[ID:GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 Layers:27(0..26) ID:GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6 Layers:26(27..52)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-22T23:50:56.718Z level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType:f16 NumThreads:16 GPULayers:53[ID:GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 Layers:27(0..26) ID:GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6 Layers:26(27..52)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-22T23:50:57.891Z level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType:f16 NumThreads:16 GPULayers:53[ID:GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 Layers:27(0..26) ID:GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6 Layers:26(27..52)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-22T23:50:57.891Z level=INFO source=ggml.go:482 msg="offloading 52 repeating layers to GPU"
time=2026-03-22T23:50:57.891Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
time=2026-03-22T23:50:57.891Z level=INFO source=ggml.go:494 msg="offloaded 53/53 layers to GPU"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="10.8 GiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="11.5 GiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:245 msg="model weights" device=CPU size="231.0 MiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="1.6 GiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="1.1 GiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="852.5 MiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="418.0 MiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="5.2 MiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:272 msg="total memory" size="26.5 GiB"
time=2026-03-22T23:50:57.891Z level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-03-22T23:50:57.891Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
time=2026-03-22T23:50:57.892Z level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
time=2026-03-22T23:51:00.654Z level=INFO source=server.go:1388 msg="llama runner started in 6.47 seconds"
RAW_BUFFERClick to expand / collapse

What is the issue?

When using docker compose with the following environment variables, nemotron-cascade-2 still processes requests serially and not in parallel. I have tested gpt-oss:20b and it works fine in parallel, it appears to be just nemotron-cascade-2 that is broken.

    environment:
      - CUDA_VISIBLE_DEVICES=0,1
      - OLLAMA_KEEP_ALIVE=99999999m
      - OLLAMA_FLASH_ATTENTION=1
      - OLLAMA_KV_CACHE_TYPE=f16 # q8_0
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_HOST=0.0.0.0:11434

Relevant log output

time=2026-03-22T23:50:53.437Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 42335"
time=2026-03-22T23:50:53.988Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax"
time=2026-03-22T23:50:54.084Z level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=nemotron_h_moe
time=2026-03-22T23:50:54.183Z level=INFO source=server.go:246 msg="enabling flash attention"
time=2026-03-22T23:50:54.183Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-9e0c827cfd6a6d000032be3da3d0914668b0c1112977e927186d29c4487466c4 --port 37627"
time=2026-03-22T23:50:54.184Z level=INFO source=sched.go:484 msg="system memory" total="62.7 GiB" free="33.1 GiB" free_swap="7.2 GiB"
time=2026-03-22T23:50:54.184Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 library=CUDA available="23.1 GiB" free="23.6 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-22T23:50:54.184Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6 library=CUDA available="22.9 GiB" free="23.3 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-22T23:50:54.184Z level=INFO source=server.go:757 msg="loading model" "model layers"=53 requested=-1
time=2026-03-22T23:50:54.207Z level=INFO source=runner.go:1411 msg="starting ollama engine"
time=2026-03-22T23:50:54.207Z level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:37627"
time=2026-03-22T23:50:54.217Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType:f16 NumThreads:16 GPULayers:53[ID:GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 Layers:53(0..52)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-22T23:50:54.276Z level=INFO source=ggml.go:136 msg="" architecture=nemotron_h_moe file_type=Q4_K_M name="" description="" num_tensors=401 num_key_values=45
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
time=2026-03-22T23:50:54.522Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2026-03-22T23:50:55.729Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType:f16 NumThreads:16 GPULayers:53[ID:GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 Layers:27(0..26) ID:GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6 Layers:26(27..52)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-22T23:50:56.718Z level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType:f16 NumThreads:16 GPULayers:53[ID:GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 Layers:27(0..26) ID:GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6 Layers:26(27..52)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-22T23:50:57.891Z level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType:f16 NumThreads:16 GPULayers:53[ID:GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 Layers:27(0..26) ID:GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6 Layers:26(27..52)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-22T23:50:57.891Z level=INFO source=ggml.go:482 msg="offloading 52 repeating layers to GPU"
time=2026-03-22T23:50:57.891Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
time=2026-03-22T23:50:57.891Z level=INFO source=ggml.go:494 msg="offloaded 53/53 layers to GPU"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="10.8 GiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="11.5 GiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:245 msg="model weights" device=CPU size="231.0 MiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="1.6 GiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="1.1 GiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="852.5 MiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="418.0 MiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="5.2 MiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:272 msg="total memory" size="26.5 GiB"
time=2026-03-22T23:50:57.891Z level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-03-22T23:50:57.891Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
time=2026-03-22T23:50:57.892Z level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
time=2026-03-22T23:51:00.654Z level=INFO source=server.go:1388 msg="llama runner started in 6.47 seconds"

OS

Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.18.2

extent analysis

Fix Plan

To fix the issue of nemotron-cascade-2 processing requests serially instead of in parallel, we need to modify the environment variables and configuration.

  • Set OLLAMA_NUM_PARALLEL to a value greater than 1 to enable parallel processing.
  • Ensure that the CUDA_VISIBLE_DEVICES variable includes all available GPUs.
  • Update the OLLAMA_KEEP_ALIVE variable to a suitable value to prevent timeouts.

Example environment variables:

environment:
  - CUDA_VISIBLE_DEVICES=0,1
  - OLLAMA_KEEP_ALIVE=99999999m
  - OLLAMA_FLASH_ATTENTION=1
  - OLLAMA_KV_CACHE_TYPE=f16
  - OLLAMA_NUM_PARALLEL=4
  - OLLAMA_HOST=0.0.0.0:11434

Additionally, verify that the nemotron-cascade-2 model supports parallel requests. If not, consider using a different model or updating the existing one to support parallel processing.

Verification

To verify that the fix worked, check the logs for messages indicating parallel processing. You can also use tools like nvidia-smi to monitor GPU utilization and ensure that multiple GPUs are being used.

Example command to check GPU utilization:

nvidia-smi --query-gpu=utilization.gpu --format=csv

This should show increased GPU utilization when processing requests in parallel.

Extra Tips

  • Ensure that the Docker container has access to all available GPUs by setting the --gpus flag when running the container.
  • Monitor system resources (e.g., CPU, memory, and GPU utilization) to ensure that the container is not resource-constrained.
  • Consider updating the Ollama version to the latest available, as newer versions may include bug fixes or performance improvements.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING