hermes - 💡(How to fix) Fix Bug: web_crawl schema lets models auto-guess "instructions" instead of asking the user via clarify [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

Schema-level wording is more effective than a runtime error envelope because the model reads it during tool selection (before deciding whether to call), not just after a failed call.

Root Cause

When the user's message doesn't specify what to extract, the model should call clarify to ask the user — not guess. The schema is the natural place to push this behavior, because it's what the model reads at tool-selection time.

Fix Action

Fixed

Code Example

"instructions": {
    "type": "string",
    "description": (
        "Natural-language description of what to extract "
        "from each crawled page (e.g. 'list all product "
        "pages with name and price')."
    ),
},

---

plugins.web.oxylabs.provider: Oxylabs crawl: ... user_prompt='Crawl the site
  and extract the main visible page content, page title, important links,
  event/product listings, dates, prices, contact details, and any
  navigation/category structure.' ...

---

"instructions": {
    "type": "string",
    "description": (
        "What to extract from each crawled page (e.g. 'list all "
        "product pages with name and price'). Must reflect the "
        "user's stated goal. If the user did not say what to extract, "
        "ASK them via the `clarify` tool before calling — do not "
        "guess on their behalf."
    ),
},

---

# firecrawl.v2.FirecrawlClient.crawl signature
def crawl(self, url: str, *,
    prompt: Optional[str] = None,   # ← natural-language guidance
    exclude_paths: Optional[List[str]] = None,
    ...
)

---

# in plugins/web/firecrawl/provider.py, where crawl_params is built
if instructions:
    crawl_params["prompt"] = instructions
RAW_BUFFERClick to expand / collapse

Bug Description

The web_crawl tool's schema (added in #32608 via tools/web_tools.py:1566) marks instructions as optional with a neutral description:

"instructions": {
    "type": "string",
    "description": (
        "Natural-language description of what to extract "
        "from each crawled page (e.g. 'list all product "
        "pages with name and price')."
    ),
},

Observed model behavior (gpt-5.5, fresh session, message: crawl https://tamsta.lt): the model auto-formulates a value for instructions from the URL alone, without asking the user what they want extracted. Example log line from agent.log:

plugins.web.oxylabs.provider: Oxylabs crawl: ... user_prompt='Crawl the site
  and extract the main visible page content, page title, important links,
  event/product listings, dates, prices, contact details, and any
  navigation/category structure.' ...

The user said "crawl" — three syllables. The model invented an 18-clause instruction without ever asking what the user actually wanted to extract. The crawl runs, produces results, but those results may have nothing to do with what the user had in mind.

Why the current behavior is suboptimal

instructions is the goal-shaping parameter for crawl. Wrong instructions → wrong crawl. The model's auto-formulated instruction is a guess about the user's intent — sometimes right, sometimes wrong, always taken without confirmation.

For crawl specifically, getting instructions right matters more than for web_search (where query is usually the user's verbatim ask) or web_extract (where the user typically passes URLs). A crawl is a multi-page traversal — the cost of running the wrong one is multiplied across every page returned.

Steps to Reproduce

  1. Start a fresh Hermes session with any crawl-capable backend configured.
  2. Send a minimal message: crawl <url> (no instructions, no extraction goal).
  3. Check agent.log for the actual instructions passed to the crawl provider.

Observed: model invents a multi-clause instruction. The user is never asked.

Expected Behavior

When the user's message doesn't specify what to extract, the model should call clarify to ask the user — not guess. The schema is the natural place to push this behavior, because it's what the model reads at tool-selection time.

Proposed Fix

Tighten the description of the instructions field to be directive about asking the user when their goal isn't explicit:

"instructions": {
    "type": "string",
    "description": (
        "What to extract from each crawled page (e.g. 'list all "
        "product pages with name and price'). Must reflect the "
        "user's stated goal. If the user did not say what to extract, "
        "ASK them via the `clarify` tool before calling — do not "
        "guess on their behalf."
    ),
},

Schema-level wording is more effective than a runtime error envelope because the model reads it during tool selection (before deciding whether to call), not just after a failed call.

Cross-backend impact assessment

Three providers support crawl: Firecrawl, Tavily, Oxylabs.

ProviderAPI behavior for instructionsEffect of schema directive
TavilyOptional pass-through (provider.py:251-252). If present, sent to Tavily API; if absent, vendor defaults apply.Strict improvement — user-supplied instructions give Tavily a sharper target than its default.
OxylabsRequired by API. Without it: the plugin's crawl() short-circuits with a typed hint.Strict improvement — no wasted credits on instructions-less calls; user in control.
FirecrawlThe current plugins/web/firecrawl/provider.py:610-619 logs and drops instructions, based on a comment claiming /crawl doesn't accept natural-language guidance. That assumption is stale — see sidenote below.Asking the user a question Firecrawl currently ignores is a mild UX cost (one wasted turn). Becomes strict improvement if the Firecrawl plugin is updated to forward instructions.

Sidenote: Firecrawl plugin could honor instructions today

Not in scope for this issue (this is a schema change, not a Firecrawl plugin change), but worth flagging since it removes the "asks a useless question" UX cost from the Firecrawl row above.

The pinned SDK (firecrawl-py==4.17.0) exposes a prompt parameter on Firecrawl.crawl() (the v2 client):

# firecrawl.v2.FirecrawlClient.crawl signature
def crawl(self, url: str, *,
    prompt: Optional[str] = None,   # ← natural-language guidance
    exclude_paths: Optional[List[str]] = None,
    ...
)

The SDK also ships a crawl_params_preview(url, prompt) helper whose docstring is "Derive crawl parameters from natural-language prompt." — explicit confirmation that /crawl accepts prompts in the current API. Per-version: v2 has prompt; v1 (firecrawl.v1.V1FirecrawlApp) does not. Hermes uses v2 (from firecrawl import Firecrawl at provider.py:88 resolves to the v2 client).

The "log + drop" behavior in the plugin was likely correct against an older SDK; the comment hasn't been updated. The one-line fix would be:

# in plugins/web/firecrawl/provider.py, where crawl_params is built
if instructions:
    crawl_params["prompt"] = instructions

…instead of the current log-and-discard. With that change, all three crawl backends would uniformly honor instructions, and the schema directive proposed in this issue becomes a strict improvement everywhere.

Happy to bundle the Firecrawl fix with the schema PR if maintainers prefer, or leave it as a follow-up.

Impact

  • Without this change: crawl results reflect the model's guess about user intent, not the user's actual goal. Costly per-page (and per-credit for paid backends) when the guess is wrong.
  • With this change: one extra round-trip (model asks via clarify, user answers, model calls with the answer) — but every crawl is intentional. Matches what clarify was built for.
  • Adjacent benefit: if/when the Firecrawl plugin forwards prompt (see sidenote), the schema directive benefits all three crawl backends uniformly.

Notes

  • Surfaced while integrating a new Oxylabs plugin (#32600). The Oxylabs API rejects empty user_prompt with HTTP 400; the plugin's short-circuit returns a typed hint when instructions is missing. But because the model auto-formulates instructions on the first call, the short-circuit rarely fires — the model just guesses. Plugin-level fixes can't override what the model reads at tool-selection time; only the schema can.
  • This is purely a schema text change — no new code, no behavior changes for callers who already pass instructions. Strictly a model-direction improvement.

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING