ollama - 💡(How to fix) Fix `/api/generate` returns `response: ""` for one concurrent request when model is loading (`raw: true`) [1 pull requests]

Code Example

#!/usr/bin/env python3
"""
Reproducer: concurrent /api/generate with raw=True against cold model.
Run multiple times — failure rate is ~1-in-8 on Apple Silicon.
"""
import asyncio
import ollama

MODEL = "granite4:micro"  # replace with any available model


async def evict() -> None:
    client = ollama.AsyncClient()
    await client.generate(model=MODEL, prompt="", keep_alive=0)
    print(f"Evicted {MODEL!r}.")


async def generate_one(client: ollama.AsyncClient, idx: int) -> tuple[int, str]:
    resp = await client.generate(
        model=MODEL,
        prompt=f"What is {idx}+{idx}? Answer briefly.",
        stream=False,
        raw=True,
        options={"num_predict": 100, "num_ctx": 2048},
    )
    return idx, resp.response


async def main() -> None:
    await evict()
    print("Firing 4 concurrent requests against cold model...")
    client = ollama.AsyncClient()
    results = await asyncio.gather(*[generate_one(client, i) for i in range(1, 5)])
    any_empty = False
    for idx, r in results:
        if not r:
            print(f"  Request {idx}: EMPTY  ← bug")
            any_empty = True
        else:
            print(f"  Request {idx}: OK — {r[:60]!r}")
    print("\nBug triggered." if any_empty else "\nAll OK (try again — intermittent).")


asyncio.run(main())

---

time=...  source=runner.go:895  msg=load  request="...Parallel:1...KvSize:2048..."
time=...  source=server.go:1385 msg="waiting for llama runner to start responding"
time=...  source=sched.go:561   msg="loaded runners" count=1
time=...  source=server.go:1385 msg="waiting for llama runner to start responding"   ← second wait, after runners loaded
[GIN] 200 |  680ms  | POST /api/generate   ← returned BEFORE model was generating
[GIN] 200 |  701ms  | POST /api/generate   ← also fast — one of these is the empty response
[GIN] 200 | 1519ms  | POST /api/generate
[GIN] 200 | 2344ms  | POST /api/generate

---

time=...  source=server.go:433  msg="starting runner"
time=...  source=sched.go:484   msg="system memory"  total="64.0 GiB"
time=...  source=sched.go:491   msg="gpu memory"  available="51.3 GiB"
time=...  source=server.go:532  msg="loading model"  "model layers"=41
time=...  source=runner.go:895  msg=load  request="...Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:2048..."
time=...  source=server.go:1385 msg="waiting for llama runner to start responding"
time=...  source=server.go:1428 msg="waiting for server to become available"  status="llm server loading model"
time=...  source=sched.go:561   msg="loaded runners" count=1
time=...  source=server.go:1385 msg="waiting for llama runner to start responding"
[GIN] 200 |    680ms | POST /api/generate
[GIN] 200 |    701ms | POST /api/generate
[GIN] 200 |  1,519ms | POST /api/generate
[GIN] 200 |  2,344ms | POST /api/generate

What is the issue?

Send four concurrent requests to /api/generate with raw: true and num_ctx: 2048 against a cold model and one will come back HTTP 200 with response: "" and done: true. No error. No retry. Zero tokens, while the other three succeed.

Expected: All requests queue behind the model load and return non-empty responses.

Actual: One request completes silently with response: "" during initialisation.

Reproducer — requires pip install ollama, runs against a local Ollama server:

#!/usr/bin/env python3
"""
Reproducer: concurrent /api/generate with raw=True against cold model.
Run multiple times — failure rate is ~1-in-8 on Apple Silicon.
"""
import asyncio
import ollama

MODEL = "granite4:micro"  # replace with any available model


async def evict() -> None:
    client = ollama.AsyncClient()
    await client.generate(model=MODEL, prompt="", keep_alive=0)
    print(f"Evicted {MODEL!r}.")


async def generate_one(client: ollama.AsyncClient, idx: int) -> tuple[int, str]:
    resp = await client.generate(
        model=MODEL,
        prompt=f"What is {idx}+{idx}? Answer briefly.",
        stream=False,
        raw=True,
        options={"num_predict": 100, "num_ctx": 2048},
    )
    return idx, resp.response


async def main() -> None:
    await evict()
    print("Firing 4 concurrent requests against cold model...")
    client = ollama.AsyncClient()
    results = await asyncio.gather(*[generate_one(client, i) for i in range(1, 5)])
    any_empty = False
    for idx, r in results:
        if not r:
            print(f"  Request {idx}: EMPTY  ← bug")
            any_empty = True
        else:
            print(f"  Request {idx}: OK — {r[:60]!r}")
    print("\nBug triggered." if any_empty else "\nAll OK (try again — intermittent).")


asyncio.run(main())

Key observation from server log during a failure:

time=...  source=runner.go:895  msg=load  request="...Parallel:1...KvSize:2048..."
time=...  source=server.go:1385 msg="waiting for llama runner to start responding"
time=...  source=sched.go:561   msg="loaded runners" count=1
time=...  source=server.go:1385 msg="waiting for llama runner to start responding"   ← second wait, after runners loaded
[GIN] 200 |  680ms  | POST /api/generate   ← returned BEFORE model was generating
[GIN] 200 |  701ms  | POST /api/generate   ← also fast — one of these is the empty response
[GIN] 200 | 1519ms  | POST /api/generate
[GIN] 200 | 2344ms  | POST /api/generate

Two of the four responses return in ~680ms and ~701ms. Normal generation latency for this model is 1.5–2.3s. Both fast responses land before the model was generating anything; one of them is empty. With Parallel: 1 set, two concurrent completions shouldn't happen — points to a race in the runner initialisation path.

Notes:

Without raw: true (same num_ctx: 2048), no failure in 20+ attempts. The bug is specific to the raw endpoint.
OLLAMA_NUM_PARALLEL is unset (default).
Not a client timeout — done: true is set in the body and HTTP status is 200.

Relevant log output

(Captured with OLLAMA_DEBUG not set — standard Homebrew service log)

time=...  source=server.go:433  msg="starting runner"
time=...  source=sched.go:484   msg="system memory"  total="64.0 GiB"
time=...  source=sched.go:491   msg="gpu memory"  available="51.3 GiB"
time=...  source=server.go:532  msg="loading model"  "model layers"=41
time=...  source=runner.go:895  msg=load  request="...Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:2048..."
time=...  source=server.go:1385 msg="waiting for llama runner to start responding"
time=...  source=server.go:1428 msg="waiting for server to become available"  status="llm server loading model"
time=...  source=sched.go:561   msg="loaded runners" count=1
time=...  source=server.go:1385 msg="waiting for llama runner to start responding"
[GIN] 200 |    680ms | POST /api/generate
[GIN] 200 |    701ms | POST /api/generate
[GIN] 200 |  1,519ms | POST /api/generate
[GIN] 200 |  2,344ms | POST /api/generate

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.24.0

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix `/api/generate` returns `response: ""` for one concurrent request when model is loading (`raw: true`) [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

Code Example

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

Still need to ship something?

TRENDING