ollama - 💡(How to fix) Fix `/api/generate` returns `response: ""` for one concurrent request when model is loading (`raw: true`) [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

Send four concurrent requests to /api/generate with raw: true and num_ctx: 2048 against a cold model and one will come back HTTP 200 with response: "" and done: true. No error. No retry. Zero tokens, while the other three succeed.

Fix Action

Fixed

Code Example

#!/usr/bin/env python3
"""
Reproducer: concurrent /api/generate with raw=True against cold model.
Run multiple times — failure rate is ~1-in-8 on Apple Silicon.
"""
import asyncio
import ollama

MODEL = "granite4:micro"  # replace with any available model


async def evict() -> None:
    client = ollama.AsyncClient()
    await client.generate(model=MODEL, prompt="", keep_alive=0)
    print(f"Evicted {MODEL!r}.")


async def generate_one(client: ollama.AsyncClient, idx: int) -> tuple[int, str]:
    resp = await client.generate(
        model=MODEL,
        prompt=f"What is {idx}+{idx}? Answer briefly.",
        stream=False,
        raw=True,
        options={"num_predict": 100, "num_ctx": 2048},
    )
    return idx, resp.response


async def main() -> None:
    await evict()
    print("Firing 4 concurrent requests against cold model...")
    client = ollama.AsyncClient()
    results = await asyncio.gather(*[generate_one(client, i) for i in range(1, 5)])
    any_empty = False
    for idx, r in results:
        if not r:
            print(f"  Request {idx}: EMPTY  ← bug")
            any_empty = True
        else:
            print(f"  Request {idx}: OK — {r[:60]!r}")
    print("\nBug triggered." if any_empty else "\nAll OK (try again — intermittent).")


asyncio.run(main())

---

time=...  source=runner.go:895  msg=load  request="...Parallel:1...KvSize:2048..."
time=...  source=server.go:1385 msg="waiting for llama runner to start responding"
time=...  source=sched.go:561   msg="loaded runners" count=1
time=...  source=server.go:1385 msg="waiting for llama runner to start responding"   ← second wait, after runners loaded
[GIN] 200 |  680ms  | POST /api/generate   ← returned BEFORE model was generating
[GIN] 200 |  701ms  | POST /api/generate   ← also fast — one of these is the empty response
[GIN] 200 | 1519ms  | POST /api/generate
[GIN] 200 | 2344ms  | POST /api/generate

---

time=...  source=server.go:433  msg="starting runner"
time=...  source=sched.go:484   msg="system memory"  total="64.0 GiB"
time=...  source=sched.go:491   msg="gpu memory"  available="51.3 GiB"
time=...  source=server.go:532  msg="loading model"  "model layers"=41
time=...  source=runner.go:895  msg=load  request="...Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:2048..."
time=...  source=server.go:1385 msg="waiting for llama runner to start responding"
time=...  source=server.go:1428 msg="waiting for server to become available"  status="llm server loading model"
time=...  source=sched.go:561   msg="loaded runners" count=1
time=...  source=server.go:1385 msg="waiting for llama runner to start responding"
[GIN] 200 |    680ms | POST /api/generate
[GIN] 200 |    701ms | POST /api/generate
[GIN] 200 |  1,519ms | POST /api/generate
[GIN] 200 |  2,344ms | POST /api/generate
RAW_BUFFERClick to expand / collapse

What is the issue?

Send four concurrent requests to /api/generate with raw: true and num_ctx: 2048 against a cold model and one will come back HTTP 200 with response: "" and done: true. No error. No retry. Zero tokens, while the other three succeed.

Expected: All requests queue behind the model load and return non-empty responses.

Actual: One request completes silently with response: "" during initialisation.

Reproducer — requires pip install ollama, runs against a local Ollama server:

#!/usr/bin/env python3
"""
Reproducer: concurrent /api/generate with raw=True against cold model.
Run multiple times — failure rate is ~1-in-8 on Apple Silicon.
"""
import asyncio
import ollama

MODEL = "granite4:micro"  # replace with any available model


async def evict() -> None:
    client = ollama.AsyncClient()
    await client.generate(model=MODEL, prompt="", keep_alive=0)
    print(f"Evicted {MODEL!r}.")


async def generate_one(client: ollama.AsyncClient, idx: int) -> tuple[int, str]:
    resp = await client.generate(
        model=MODEL,
        prompt=f"What is {idx}+{idx}? Answer briefly.",
        stream=False,
        raw=True,
        options={"num_predict": 100, "num_ctx": 2048},
    )
    return idx, resp.response


async def main() -> None:
    await evict()
    print("Firing 4 concurrent requests against cold model...")
    client = ollama.AsyncClient()
    results = await asyncio.gather(*[generate_one(client, i) for i in range(1, 5)])
    any_empty = False
    for idx, r in results:
        if not r:
            print(f"  Request {idx}: EMPTY  ← bug")
            any_empty = True
        else:
            print(f"  Request {idx}: OK — {r[:60]!r}")
    print("\nBug triggered." if any_empty else "\nAll OK (try again — intermittent).")


asyncio.run(main())

Key observation from server log during a failure:

time=...  source=runner.go:895  msg=load  request="...Parallel:1...KvSize:2048..."
time=...  source=server.go:1385 msg="waiting for llama runner to start responding"
time=...  source=sched.go:561   msg="loaded runners" count=1
time=...  source=server.go:1385 msg="waiting for llama runner to start responding"   ← second wait, after runners loaded
[GIN] 200 |  680ms  | POST /api/generate   ← returned BEFORE model was generating
[GIN] 200 |  701ms  | POST /api/generate   ← also fast — one of these is the empty response
[GIN] 200 | 1519ms  | POST /api/generate
[GIN] 200 | 2344ms  | POST /api/generate

Two of the four responses return in ~680ms and ~701ms. Normal generation latency for this model is 1.5–2.3s. Both fast responses land before the model was generating anything; one of them is empty. With Parallel: 1 set, two concurrent completions shouldn't happen — points to a race in the runner initialisation path.

Notes:

  • Without raw: true (same num_ctx: 2048), no failure in 20+ attempts. The bug is specific to the raw endpoint.
  • OLLAMA_NUM_PARALLEL is unset (default).
  • Not a client timeout — done: true is set in the body and HTTP status is 200.

Relevant log output

(Captured with OLLAMA_DEBUG not set — standard Homebrew service log)

time=...  source=server.go:433  msg="starting runner"
time=...  source=sched.go:484   msg="system memory"  total="64.0 GiB"
time=...  source=sched.go:491   msg="gpu memory"  available="51.3 GiB"
time=...  source=server.go:532  msg="loading model"  "model layers"=41
time=...  source=runner.go:895  msg=load  request="...Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:2048..."
time=...  source=server.go:1385 msg="waiting for llama runner to start responding"
time=...  source=server.go:1428 msg="waiting for server to become available"  status="llm server loading model"
time=...  source=sched.go:561   msg="loaded runners" count=1
time=...  source=server.go:1385 msg="waiting for llama runner to start responding"
[GIN] 200 |    680ms | POST /api/generate
[GIN] 200 |    701ms | POST /api/generate
[GIN] 200 |  1,519ms | POST /api/generate
[GIN] 200 |  2,344ms | POST /api/generate

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.24.0

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - 💡(How to fix) Fix `/api/generate` returns `response: ""` for one concurrent request when model is loading (`raw: true`) [1 pull requests]