openclaw - 💡(How to fix) Fix PRD: Model-aware thinking controls for vLLM and Pi OpenAI harness

openclaw2026-05-24 15:51:50

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Fix Action

Solution

Add a model-aware thinking capability seam for the vLLM provider and the Pi OpenAI-compatible harness so each model exposes the correct thinking mode surface.

For Qwen-style vLLM models, OpenClaw should expose a binary thinking profile: off and on. Internally, it should continue using the existing durable thinking state, where off remains off and on can be represented by the existing lowest supported non-off thinking level. The vLLM request mapper should keep converting that durable state into the Qwen payload shape required by the model, such as top-level enable_thinking or chat-template enable_thinking.

For OpenAI-style reasoning models, OpenClaw should keep using level-based reasoning effort where the provider/model compatibility contract supports it.

For reasoning output visibility, OpenClaw should keep /reasoning separate from /think. /think controls model-side thinking. /reasoning controls whether reasoning content is displayed or streamed back to the user.

The implementation should preserve the existing session/default persistence model and add provider-owned capability policy rather than adding a parallel thinkEnabled state.

RAW_BUFFERClick to expand / collapse

Problem Statement

Users running local OpenAI-compatible models through the vLLM plugin and the Pi harness need thinking controls that match what each model actually supports. Today, OpenClaw has a durable thinking state, but the user-facing control does not consistently reflect provider-specific semantics.

For Qwen-style models, thinking is effectively binary: enable or disable template thinking. Those models do not accept OpenAI-style reasoning effort levels such as low, medium, high, or xhigh. In the current vLLM path, OpenClaw can persist a thinking level, but vLLM only emits Qwen thinking payload fields when model/provider params opt into a Qwen thinking format. There is no provider-owned profile that makes the UI and slash command expose the correct off/on behavior for vLLM Qwen-like models.

The result is a mismatch: users see or configure level-based thinking controls, while the underlying vLLM/Qwen request seam only supports binary enablement. Users also have to know where to configure request-shaping details, and it is easy to confuse model-side thinking with reasoning output visibility.

Solution

Add a model-aware thinking capability seam for the vLLM provider and the Pi OpenAI-compatible harness so each model exposes the correct thinking mode surface.

For OpenAI-style reasoning models, OpenClaw should keep using level-based reasoning effort where the provider/model compatibility contract supports it.

The implementation should preserve the existing session/default persistence model and add provider-owned capability policy rather than adding a parallel thinkEnabled state.

User Stories

As a local model user, I want Qwen-style vLLM models to show off/on thinking controls, so that I do not choose unsupported reasoning levels.
As a local model user, I want /think off to persist for my vLLM Qwen session, so that the model stays in non-thinking mode until I change it.
As a local model user, I want /think on to persist for my vLLM Qwen session, so that the model stays in thinking mode across turns.
As a local model user, I want vLLM Qwen thinking to survive Pi harness turn rebuilds, so that the request payload stays consistent during a session.
As a local model user, I want OpenClaw to send Qwen's expected enable_thinking field, so that vLLM does not receive unsupported OpenAI reasoning effort fields.
As a local model user, I want Qwen chat-template models to receive thinking as chat-template kwargs, so that the model server applies the correct chat template behavior.
As a local model user, I want Qwen top-level thinking models to receive thinking as a top-level request field, so that the model server applies the correct request behavior.
As a local model user, I want model-side thinking to be separate from reasoning output visibility, so that I can enable thinking without necessarily streaming internal reasoning text.
As a local model user, I want /reasoning to continue controlling output visibility, so that existing reasoning display workflows do not change.
As a local model user, I want unsupported stored thinking values to be remapped safely, so that switching from a level-based model to a binary model does not leave the session in an invalid state.
As a local model user, I want model switches to refresh thinking choices, so that the command surface follows the active provider/model.
As a local model user, I want the default thinking state to be configurable per model, so that Qwen-style local models can default to off when that is the desired operating mode.
As a local model user, I want the default thinking state to be configurable per agent, so that different agents can use different defaults with the same vLLM server.
As a local model user, I want the request payload to avoid both enable_thinking and reasoning_effort conflicts, so that vLLM does not reject or misinterpret requests.
As a local model user, I want late extra body configuration to have documented precedence, so that explicit advanced overrides remain understandable.
As an OpenClaw maintainer, I want vLLM to own its thinking capability policy, so that core stays provider-agnostic.
As an OpenClaw maintainer, I want provider thinking profiles to describe binary model controls, so that slash command choices come from the same contract as runtime behavior.
As an OpenClaw maintainer, I want the vLLM request mapper to stay provider-local, so that Qwen-specific payload details do not leak into core logic.
As an OpenClaw maintainer, I want the Pi harness to consume prepared thinking state through existing params, so that no duplicate state model is introduced.
As an OpenClaw maintainer, I want OpenAI-style reasoning effort models to keep level-based controls, so that this fix does not regress OpenAI, Anthropic-compatible, or other reasoning providers.
As an OpenClaw maintainer, I want vLLM Qwen behavior tested at the payload boundary, so that tests prove external behavior rather than implementation details.
As an OpenClaw maintainer, I want provider profile tests for binary thinking choices, so that the UI and slash command behavior cannot drift from runtime behavior.
As an OpenClaw maintainer, I want persisted off/on behavior tested through the directive/session layer, so that user-facing persistence is proven.
As an OpenClaw maintainer, I want a single durable model config surface for Qwen thinking format, so that generic OpenAI-compatible compatibility and vLLM-specific params do not diverge silently.
As an OpenClaw maintainer, I want existing configured vLLM users to keep working, so that the change is backward-compatible.
As an OpenClaw maintainer, I want unsupported legacy values to be downgraded predictably, so that old sessions and configs do not break after the provider profile becomes binary.
As a plugin author, I want the provider thinking profile seam to be clear enough for binary providers, so that third-party OpenAI-compatible plugins can adopt the same pattern.
As a plugin author, I want request-shaping hooks to receive the active thinking state, so that provider-specific payloads can be produced without core branching.
As a docs reader, I want vLLM thinking docs to state which config field enables Qwen payload shaping, so that setup does not require source reading.
As a docs reader, I want the docs to distinguish /think from /reasoning, so that I do not configure the wrong control.
As an operator, I want model-specific defaults to work without editing every agent, so that local model fleets can be configured centrally.
As an operator, I want agent-specific overrides to still work, so that high-reasoning and low-reasoning agents can coexist.
As an operator, I want discovery/config gaps to fail transparently, so that a model missing reasoning metadata does not silently pretend to support Qwen thinking.
As an operator, I want request payload behavior to be inspectable in tests or debug logs without secrets, so that local vLLM failures can be diagnosed safely.
As an agent runner user, I want the Pi harness to preserve the selected thinking mode across run setup, stream wrapping, and subscription, so that behavior does not change between the first and later turns.
As an agent runner user, I want binary Qwen thinking to use intuitive labels, so that the stored internal level does not leak as low when the model only supports on.
As an agent runner user, I want non-Qwen vLLM reasoning models to keep their appropriate controls, so that binary Qwen support does not flatten all vLLM models.
As a contributor, I want the tests to show prior art from existing provider stream wrapper and thinking profile tests, so that the implementation is easy to review.
As a contributor, I want the implementation to avoid broad core special cases, so that provider behavior remains modular.
As a contributor, I want the changelog/docs impact to be obvious, so that users know how to configure the new behavior.

Implementation Decisions

Preserve the existing durable thinking state. Do not introduce a separate boolean thinkEnabled state.
Treat Qwen-style thinking as a binary provider profile: off plus an on label backed by an existing non-off thinking level.
Add or extend a provider-owned vLLM thinking policy so the command surface can expose binary choices for Qwen-like vLLM reasoning models.
Keep vLLM-specific request shaping in the vLLM provider layer. Core should not learn Qwen payload keys.
Keep the Pi harness passing the active thinking state through its existing run params and prepared extra params flow.
Keep reasoning output visibility separate from model-side thinking. /reasoning must not be treated as a model-side thinking enablement control.
Continue removing OpenAI-style reasoning effort fields from Qwen-style payloads when Qwen template thinking is applied.
Preserve support for both known vLLM Qwen payload formats: chat-template kwargs and top-level request body fields.
Keep OpenAI-style effort mapping for models that actually support OpenAI reasoning effort levels.
Make unsupported persisted thinking values remap to the nearest supported provider/profile value instead of failing the run.
Prefer one durable model config surface for Qwen thinking format. If both generic OpenAI-compatible compatibility and vLLM-specific params remain supported, document precedence and add tests for it.
Consider extending the provider thinking policy context additively if exact profile selection needs resolved model params. This avoids brittle model-id-only inference when config already contains the stronger fact.
Preserve backward compatibility for existing vLLM configs that already use provider params for Qwen thinking format.
Keep late explicit extra body overrides advanced and documented. If they overwrite provider-injected fields, that should be intentional and test-covered.
Update user-facing vLLM thinking documentation if behavior or config contracts change.
Add a changelog entry if this ships as a behavior change.

Testing Decisions

Good tests should assert observable behavior: available thinking choices, persisted off/on state, and emitted request payload fields. They should not assert private helper call order.
Test the vLLM provider profile behavior for Qwen-like models: choices should be off/on, with on backed by the existing non-off stored level.
Test default resolution for vLLM Qwen models: configured default off remains off, configured on resolves to the binary on choice, and unsupported levels remap predictably.
Test slash/directive persistence behavior: /think off and /think on should persist through session state and model run setup.
Test Pi harness integration through its public run/setup seams where practical: active thinking state should reach prepared extra params and provider wrappers.
Test vLLM Qwen payload mapping for chat-template format: off sends enable_thinking false in template kwargs; on sends true; OpenAI reasoning effort fields are absent.
Test vLLM Qwen payload mapping for top-level format: off sends enable_thinking false; on sends true; OpenAI reasoning effort fields are absent.
Test that non-reasoning vLLM models are not patched with Qwen thinking payload fields unless the model capability contract says they should be.
Test that OpenAI-style reasoning effort behavior still works for models with OpenAI reasoning compatibility.
Test that /reasoning visibility choices do not change the model-side Qwen enable_thinking value.
Reuse prior art from provider stream wrapper tests, provider thinking policy tests, directive/default resolution tests, and Pi embedded runner extra-param tests.
In Codex worktrees, use the repo's lightweight Vitest wrapper for narrow file proof and Testbox/Crabbox for broader gates.

Out of Scope

Adding new model server backends beyond vLLM.
Changing Qwen model semantics to support multiple thinking levels. Qwen-style thinking is treated as binary unless the upstream model/server contract changes.
Reworking the entire OpenAI-compatible transport stack.
Changing reasoning trace rendering, summarization, or storage behavior beyond preserving the /reasoning separation.
Adding a new user-facing state model separate from existing thinking defaults and session persistence.
Removing existing vLLM Qwen config aliases without a migration plan.
Changing unrelated provider thinking policies.
Running live GPU/vLLM validation as part of this PRD. Implementation should include it where feasible.

Further Notes

The exploration found that the current architecture already has most of the needed seams: durable thinking state, provider thinking profiles, provider stream wrappers, prepared extra params, and Pi harness propagation. The main product gap is alignment: vLLM Qwen-like models need a provider-owned binary profile so the command surface, persistence, and payload behavior all describe the same capability.

The likely implementation path is small and provider-local: add vLLM thinking policy, keep the current vLLM payload mapper, add tests that prove binary off/on behavior, and document the durable config path for Qwen thinking format.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix PRD: Model-aware thinking controls for vLLM and Pi OpenAI harness

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Solution

Problem Statement

Solution

User Stories

Implementation Decisions

Testing Decisions

Out of Scope

Further Notes

Still need to ship something?

TRENDING