ollama - ✅(Solved) Fix generate API ignores think=false for qwen3.5 (chat API works) [2 pull requests, 4 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#14793Fetched 2026-04-08 00:31:39
View on GitHub
Comments
4
Participants
3
Timeline
12
Reactions
0
Timeline (top)
commented ×4cross-referenced ×4closed ×1labeled ×1

Error Message

This asymmetry causes silent failures for applications using the generate endpoint with thinking models — they get empty responses with no error.

Fix Action

Workaround

Switch from generate to chat API and pass think=False as a top-level parameter (not inside options).

PR fix notes

PR #3: Extend syllabus

Description (problem / solution / changelog)

<!-- This is an auto-generated description by cubic. -->

Summary by cubic

Expands the curriculum to 23 modules (~115 lessons) with the full v2 syllabus, and marks the expansion as Done across plans and features. Foundation modules (0–2) are rewritten for LocalForge with Ollama, a cleaner webhook flow, Telegram support, and clearer onboarding.

  • New Features

    • Added docs/curriculum/SYLLABUS.md and PRD docs/exec-plans/04-curriculum-expansion_prd.md; updated PRODUCT_PLAN.md, docs/exec-plans/README.md, and docs/features/README.md to mark Curriculum Expansion v2 and Challenges/Gamification as Done.
    • Rewrote modules 0–2 for LocalForge: onboarding and project structure; Docker with Ollama, Langfuse and ngrok; env setup; first-message walkthrough; webhook pipeline; and LLM integration via OllamaClient (async httpx) with resilient error handling.
    • Added MDX for modules 3–22 covering: DB/state and repository pattern; memory + context builder and token budgeting; multimodal & formatting (audio with faster-whisper, vision via Ollama llava, WhatsApp/Telegram formatting and message splitting); guardrails & observability (Langfuse v3, trace context/recorder/scores); eval, prompt versioning, and auto-evolution; agents (reactive loop, planner–worker, robustness/fallbacks); security (policy engine, audit trail, HITL, shell controls); platform abstraction (PlatformClient) with Telegram client; MCP with hot-reload; knowledge graph; rule engine; and performance/ops (asyncio, caching, SQLite tuning, Docker, CI, eval in CI, monitoring).
  • Bug Fixes

    • Updated WhatsApp Graph API to v22.0, tightened HMAC verification guidance, and fixed Telegram HTML formatting/escape order.
    • Minor copy/formatting fixes across docs (README heading, module titles, LocalForge references).

<sup>Written for commit 24a07831951e46a72bf2a5ebfa4834dc86138f7f. Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->

Changed files

  • PRODUCT_PLAN.md (modified, +73/-27)
  • README.md (modified, +1/-1)
  • docs/curriculum/SYLLABUS.md (added, +364/-0)
  • docs/exec-plans/04-curriculum-expansion_prd.md (added, +136/-0)
  • docs/exec-plans/README.md (modified, +2/-2)
  • docs/features/README.md (modified, +4/-1)
  • frontend/content/modules/module-0/00-bienvenida.mdx (modified, +88/-13)
  • frontend/content/modules/module-0/01-estructura-proyecto.mdx (modified, +182/-44)
  • frontend/content/modules/module-0/02-docker-setup.mdx (modified, +210/-56)
  • frontend/content/modules/module-0/03-entorno-local.mdx (modified, +191/-54)
  • frontend/content/modules/module-0/04-primer-mensaje.mdx (modified, +205/-37)
  • frontend/content/modules/module-1/00-intro-webhooks.mdx (modified, +140/-30)
  • frontend/content/modules/module-1/01-firma-hmac.mdx (modified, +187/-40)
  • frontend/content/modules/module-1/02-parsear-payload.mdx (modified, +216/-57)
  • frontend/content/modules/module-1/03-responder-mensajes.mdx (modified, +230/-68)
  • frontend/content/modules/module-1/04-testing-webhook.mdx (modified, +300/-80)
  • frontend/content/modules/module-10/00-presupuesto-de-tokens.mdx (added, +201/-0)
  • frontend/content/modules/module-10/01-context-builder.mdx (added, +234/-0)
  • frontend/content/modules/module-10/02-conversation-context-build.mdx (added, +276/-0)
  • frontend/content/modules/module-10/03-fact-extraction.mdx (added, +252/-0)
  • frontend/content/modules/module-10/04-ensamblando-el-prompt.mdx (added, +299/-0)
  • frontend/content/modules/module-11/00-audio-faster-whisper.mdx (added, +208/-0)
  • frontend/content/modules/module-11/01-imagenes-vision-llm.mdx (added, +254/-0)
  • frontend/content/modules/module-11/02-markdown-a-whatsapp.mdx (added, +208/-0)
  • frontend/content/modules/module-11/03-markdown-a-telegram.mdx (added, +234/-0)
  • frontend/content/modules/module-11/04-message-splitting.mdx (added, +252/-0)
  • frontend/content/modules/module-12/00-por-que-guardrails.mdx (added, +122/-0)
  • frontend/content/modules/module-12/01-checks-deterministicos.mdx (added, +164/-0)
  • frontend/content/modules/module-12/02-checks-llm.mdx (added, +168/-0)
  • frontend/content/modules/module-12/03-pipeline-y-remediacion.mdx (added, +190/-0)
  • frontend/content/modules/module-12/04-guardrails-a-dataset.mdx (added, +151/-0)
  • frontend/content/modules/module-13/00-intro-observabilidad.mdx (added, +132/-0)
  • frontend/content/modules/module-13/01-langfuse-v3-sdk.mdx (added, +138/-0)
  • frontend/content/modules/module-13/02-trace-context.mdx (added, +149/-0)
  • frontend/content/modules/module-13/03-trace-recorder.mdx (added, +180/-0)
  • frontend/content/modules/module-13/04-scores-y-metricas.mdx (added, +153/-0)
  • frontend/content/modules/module-14/00-dataset-vivo.mdx (added, +249/-0)
  • frontend/content/modules/module-14/01-senales-de-usuario.mdx (added, +270/-0)
  • frontend/content/modules/module-14/02-llm-as-judge.mdx (added, +285/-0)
  • frontend/content/modules/module-14/03-prompt-versioning.mdx (added, +314/-0)
  • frontend/content/modules/module-14/04-auto-evolucion.mdx (added, +335/-0)
  • frontend/content/modules/module-15/00-que-es-un-agente.mdx (added, +201/-0)
  • frontend/content/modules/module-15/01-reactive-loop.mdx (added, +248/-0)
  • frontend/content/modules/module-15/02-session-persistence.mdx (added, +245/-0)
  • frontend/content/modules/module-15/03-loop-detection.mdx (added, +213/-0)
  • frontend/content/modules/module-15/04-planner-orchestrator.mdx (added, +352/-0)
  • frontend/content/modules/module-16/00-patron-planner-worker.mdx (added, +202/-0)
  • frontend/content/modules/module-16/01-workers-especializados.mdx (added, +234/-0)
  • frontend/content/modules/module-16/02-replanificacion.mdx (added, +226/-0)
  • frontend/content/modules/module-16/03-synthesis-y-task-memory.mdx (added, +226/-0)
  • frontend/content/modules/module-16/04-robustez-y-fallbacks.mdx (added, +239/-0)
  • frontend/content/modules/module-17/00-defense-in-depth.mdx (added, +121/-0)
  • frontend/content/modules/module-17/01-policy-engine.mdx (added, +263/-0)
  • frontend/content/modules/module-17/02-audit-trail.mdx (added, +226/-0)
  • frontend/content/modules/module-17/03-human-in-the-loop.mdx (added, +259/-0)
  • frontend/content/modules/module-17/04-shell-controls.mdx (added, +231/-0)
  • frontend/content/modules/module-18/00-platform-protocol.mdx (added, +193/-0)
  • frontend/content/modules/module-18/01-telegram-client.mdx (added, +260/-0)
  • frontend/content/modules/module-18/02-mcp-basico.mdx (added, +216/-0)
  • frontend/content/modules/module-18/03-hot-reload-mcp.mdx (added, +217/-0)
  • frontend/content/modules/module-18/04-integracion-completa.mdx (added, +219/-0)
  • frontend/content/modules/module-19/00-knowledge-graph-basics.mdx (added, +190/-0)
  • frontend/content/modules/module-19/01-entity-registry.mdx (added, +169/-0)
  • frontend/content/modules/module-19/02-graph-traversal.mdx (added, +213/-0)
  • frontend/content/modules/module-19/03-data-provenance.mdx (added, +196/-0)
  • frontend/content/modules/module-19/04-memory-versions.mdx (added, +220/-0)
  • frontend/content/modules/module-2/00-intro-llms.mdx (modified, +145/-20)
  • frontend/content/modules/module-2/01-anthropic-sdk.mdx (modified, +199/-60)
  • frontend/content/modules/module-2/02-prompt-engineering.mdx (modified, +173/-51)
  • frontend/content/modules/module-2/03-streaming.mdx (modified, +166/-68)
  • frontend/content/modules/module-2/04-integracion-completa.mdx (modified, +210/-91)
  • frontend/content/modules/module-20/00-rule-engine.mdx (added, +179/-0)
  • frontend/content/modules/module-20/01-conditions.mdx (added, +216/-0)
  • frontend/content/modules/module-20/02-actions.mdx (added, +221/-0)
  • frontend/content/modules/module-20/03-builtin-rules.mdx (added, +208/-0)
  • frontend/content/modules/module-20/04-custom-rules.mdx (added, +239/-0)
  • frontend/content/modules/module-21/00-profiling-critical-path.mdx (added, +157/-0)
  • frontend/content/modules/module-21/01-paralelizacion-asyncio.mdx (added, +204/-0)
  • frontend/content/modules/module-21/02-caching-estrategico.mdx (added, +182/-0)
  • frontend/content/modules/module-21/03-sqlite-performance.mdx (added, +194/-0)
  • frontend/content/modules/module-21/04-event-loop-hygiene.mdx (added, +219/-0)
  • frontend/content/modules/module-22/00-docker-para-ai.mdx (added, +203/-0)
  • frontend/content/modules/module-22/01-health-checks.mdx (added, +198/-0)
  • frontend/content/modules/module-22/02-ci-pipeline.mdx (added, +263/-0)
  • frontend/content/modules/module-22/03-eval-en-ci.mdx (added, +236/-0)
  • frontend/content/modules/module-22/04-monitoreo-produccion.mdx (added, +254/-0)
  • frontend/content/modules/module-3/00-sqlite-async.mdx (added, +172/-0)
  • frontend/content/modules/module-3/01-schema-design.mdx (added, +206/-0)
  • frontend/content/modules/module-3/02-repository-pattern.mdx (added, +219/-0)
  • frontend/content/modules/module-3/03-dedup-atomico.mdx (added, +170/-0)
  • frontend/content/modules/module-3/04-vector-storage.mdx (added, +0/-0)
  • frontend/content/modules/module-4/00-conversation-manager.mdx (added, +0/-0)
  • frontend/content/modules/module-4/01-historial-ventana.mdx (added, +0/-0)
  • frontend/content/modules/module-4/02-summarizer-background.mdx (added, +0/-0)
  • frontend/content/modules/module-4/03-conversation-state.mdx (added, +0/-0)
  • frontend/content/modules/module-4/04-compactacion-inteligente.mdx (added, +0/-0)
  • frontend/content/modules/module-5/00-memorias-semanticas.mdx (added, +0/-0)
  • frontend/content/modules/module-5/01-daily-logs.mdx (added, +0/-0)
  • frontend/content/modules/module-5/02-memory-md-sync.mdx (added, +0/-0)
  • frontend/content/modules/module-5/03-consolidacion-llm.mdx (added, +0/-0)

PR #193: feat: Ollama qwen3 upgrade + WhisperFlow voice pipeline

Description (problem / solution / changelog)

Summary

  • Ollama model swap: hermes3:8b → qwen3:8b (primary) + qwen3:14b (complex/planning), with intent-based /think vs /no_think mode selection
  • PicoClaw upgrade: qwen2.5-coder:1.5b → qwen2.5-coder:3b
  • Voice pipeline: 3 new Docker sidecars — Kokoro TTS (highest quality), Piper TTS (fast fallback), whisper.cpp STT (local transcription)
  • TTS fallback chain: Kokoro → Piper → edge-tts with health checks + 30s cache
  • STT fallback chain: whisper.cpp (local) → Groq Whisper (cloud)
  • Frontend: Server-side TTS hook (useTTS.ts) + VoiceChatPage with playback controls
  • Config: ollamaMaxTokens 512→4096, timeout 45s→90s, all new env vars documented

New Docker services

ServicePortRAM limitPurpose
kokoro-tts51012GBNeural TTS (ONNX CPU)
piper-tts5100512MBFast CPU TTS fallback
whisper-stt5102512MBLocal speech-to-text

Memory budget (32GB VPS)

Typical (qwen3:8b loaded): ~16.5GB used, ~15.5GB free Peak (qwen3:14b + voice sidecars): ~20GB used, ~12GB free

Test plan

  • ollama pull qwen3:8b && ollama pull qwen3:14b && ollama pull qwen2.5-coder:3b
  • docker compose build kokoro-tts piper-tts whisper-stt
  • docker compose up -d — verify all 3 sidecars start healthy
  • Health checks: curl localhost:5100/health && curl localhost:5101/health && curl localhost:5102/health
  • TTS test: curl -X POST localhost:5101/tts -d '{"text":"Hello"}' -H 'Content-Type: application/json' -o test.wav
  • STT test: curl -X POST localhost:5102/transcribe -F '[email protected]'
  • Send complex query → verify routing trace shows qwen3:14b with /think
  • Fallback test: stop Kokoro → Piper takes over → stop Piper → edge-tts takes over

https://claude.ai/code/session_017nK3obD6HhVkPkFuecwxPp

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->

Summary by CodeRabbit

  • New Features

    • Local voice pipeline with Kokoro/Piper/Whisper sidecars, automatic fallback, and UI labels showing which engine handled each message.
    • Dual LLM model support (standard vs. complex) with optional "thinking" mode and intent-aware selection.
    • Client prefers server-side TTS with browser fallback and exposes TTS/STT engine info.
  • Chores

    • Updated default models, increased token limits, adjusted timeouts, and added env settings for voice services and models.
    • Added containerized sidecar services and runner setup script; CI deploys moved to self-hosted.
  • Tests

    • Updated tests to assert presence of engine-related fields.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Changed files

  • .env.example (modified, +17/-5)
  • .env.staging.example (modified, +12/-3)
  • .github/workflows/ci.yml (modified, +102/-121)
  • docker-compose.yml (modified, +122/-5)
  • kokoro-tts/Dockerfile (added, +25/-0)
  • kokoro-tts/download-model.sh (added, +14/-0)
  • kokoro-tts/server.py (added, +107/-0)
  • package-lock.json (modified, +9667/-19076)
  • picoclaw/index.js (modified, +1/-1)
  • piper-tts/Dockerfile (added, +19/-0)
  • piper-tts/download-voices.sh (added, +29/-0)
  • piper-tts/server.py (added, +124/-0)
  • scripts/setup-runner.sh (added, +25/-0)
  • server/src/config.ts (modified, +12/-4)
  • server/src/modules/agent/routes/streaming.ts (modified, +1/-1)
  • server/src/modules/agent/services/llm.ts (modified, +47/-15)
  • server/src/modules/media/index.ts (modified, +2/-0)
  • server/src/modules/media/routes/voice.ts (modified, +8/-13)
  • server/src/modules/media/services/voice.ts (modified, +255/-90)
  • server/src/routes/webhooks.ts (modified, +6/-4)
  • server/src/services/proactive-engine.ts (modified, +2/-2)
  • server/src/services/voice.ts (modified, +2/-0)
  • server/src/test/api/phase110.test.ts (modified, +2/-2)
  • server/src/test/api/phase80.test.ts (modified, +2/-2)
  • src/dashboard/pages/VoiceChatPage.tsx (modified, +22/-12)
  • src/hooks/useTTS.ts (modified, +169/-107)
  • whisper-stt/Dockerfile (added, +37/-0)
  • whisper-stt/download-model.sh (added, +16/-0)
  • whisper-stt/server.py (added, +126/-0)

Code Example

import ollama

# Test 1: generate API — think=false IGNORED
r1 = ollama.generate(
    model='qwen3.5:9b',
    prompt='Say hello in 2 words. Reply with ONLY the words.',
    options={'num_predict': 30, 'think': False}
)
print('generate think=false:')
print('  response:', repr(r1['response']))        # '' (empty!)
print('  thinking:', repr(r1['thinking'][:80]))    # still populated
print('  eval_count:', r1['eval_count'])           # 30 (all tokens burned on thinking)

# Test 2: chat API — think=False WORKS
r2 = ollama.chat(
    model='qwen3.5:9b',
    messages=[{'role': 'user', 'content': 'Say hello in 2 words. Reply with ONLY the words.'}],
    think=False,
    options={'num_predict': 30}
)
print('chat think=False:')
print('  content:', repr(r2['message']['content']))  # 'Hello there!' (actual output)
print('  eval_count:', r2['eval_count'])              # 3 (no wasted tokens)

---

generate think=false:
  response: ''
  thinking: 'Thinking Process:\n\n1.  **Analyze the Request:**\n    *   Task: Say hello.\n    *   Constraint 1:'
  eval_count: 30

chat think=False:
  content: 'Hello there!'
  eval_count: 3
RAW_BUFFERClick to expand / collapse

What is the issue?

The generate API completely ignores think: false passed in options for qwen3.5:9b. The model still produces thinking tokens that consume the entire num_predict budget, resulting in an empty response field. The chat API with think=False as a top-level parameter works correctly.

This asymmetry causes silent failures for applications using the generate endpoint with thinking models — they get empty responses with no error.

OS / Ollama version

  • OS: Linux 6.17.0-14-generic (Ubuntu)
  • Ollama: 0.17.7
  • Model: qwen3.5:9b

Steps to reproduce

import ollama

# Test 1: generate API — think=false IGNORED
r1 = ollama.generate(
    model='qwen3.5:9b',
    prompt='Say hello in 2 words. Reply with ONLY the words.',
    options={'num_predict': 30, 'think': False}
)
print('generate think=false:')
print('  response:', repr(r1['response']))        # '' (empty!)
print('  thinking:', repr(r1['thinking'][:80]))    # still populated
print('  eval_count:', r1['eval_count'])           # 30 (all tokens burned on thinking)

# Test 2: chat API — think=False WORKS
r2 = ollama.chat(
    model='qwen3.5:9b',
    messages=[{'role': 'user', 'content': 'Say hello in 2 words. Reply with ONLY the words.'}],
    think=False,
    options={'num_predict': 30}
)
print('chat think=False:')
print('  content:', repr(r2['message']['content']))  # 'Hello there!' (actual output)
print('  eval_count:', r2['eval_count'])              # 3 (no wasted tokens)

Actual output

generate think=false:
  response: ''
  thinking: 'Thinking Process:\n\n1.  **Analyze the Request:**\n    *   Task: Say hello.\n    *   Constraint 1:'
  eval_count: 30

chat think=False:
  content: 'Hello there!'
  eval_count: 3

Expected behavior

think: false should disable thinking on BOTH generate and chat endpoints. When thinking is disabled, all num_predict tokens should be available for the visible response.

Additional context

  • Non-thinking models (e.g. gemma2:9b) are unaffected — think: false is harmlessly ignored
  • The model's Modelfile uses RENDERER qwen3.5 / PARSER qwen3.5, so thinking is handled by Ollama's built-in renderer, not the template
  • Increasing num_predict to 200+ doesn't help on generate — the model just thinks longer, still empty response
  • Related: #14502, #14612, #14645

Workaround

Switch from generate to chat API and pass think=False as a top-level parameter (not inside options).

extent analysis

Fix Plan

To fix the issue with the generate API ignoring think: false in options for qwen3.5:9b, we need to modify the ollama.generate function to properly handle the think option.

Here are the steps:

  • Update the ollama.generate function to accept think as a top-level parameter.
  • Modify the function to pass think=False to the underlying model when generating text.

Example code:

import ollama

def generate(model, prompt, options, think=None):
    # ... existing code ...
    if think is not False:
        think = True  # default to True if not specified
    # Pass think to the underlying model
    response = ollama._generate(model, prompt, options, think=think)
    return response

# Usage
r1 = ollama.generate(
    model='qwen3.5:9b',
    prompt='Say hello in 2 words. Reply with ONLY the words.',
    options={'num_predict': 30},
    think=False
)

Verification

To verify that the fix worked, run the following test:

r1 = ollama.generate(
    model='qwen3.5:9b',
    prompt='Say hello in 2 words. Reply with ONLY the words.',
    options={'num_predict': 30},
    think=False
)
print('generate think=False:')
print('  response:', repr(r1['response']))        
print('  thinking:', repr(r1['thinking'][:80]))    
print('  eval_count:', r1['eval_count'])

The output should show a non-empty response and a lower eval_count.

Extra Tips

  • Make sure to update the ollama library to the latest version to ensure that the fix is included.
  • If you are using a custom model, ensure that it is compatible with the updated ollama.generate function.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

think: false should disable thinking on BOTH generate and chat endpoints. When thinking is disabled, all num_predict tokens should be available for the visible response.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING