openclaw - ✅(Solved) Fix [Bug]: web_fetch returns mojibake for non-UTF-8 pages [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#72916Fetched 2026-04-28 06:30:19
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×2labeled ×1

web_fetch appear to decode HTTP response bodies as UTF-8 unconditionally. Pages encoded with legacy charsets such as Shift_JIS, Big5, GBK, etc. return mojibake / replacement characters instead of readable text.

Root Cause

web_fetch appear to decode HTTP response bodies as UTF-8 unconditionally. Pages encoded with legacy charsets such as Shift_JIS, Big5, GBK, etc. return mojibake / replacement characters instead of readable text.

Fix Action

Fixed

PR fix notes

PR #73103: Fix web_fetch legacy charset decoding

Description (problem / solution / changelog)

Summary

  • Decode web_fetch responses from raw bytes with charset detection from Content-Type
  • Sniff a bounded first 4KB HTML meta charset before falling back to UTF-8
  • Keep bounded response reads by accumulating capped bytes before decoding
  • Add regressions for ISO-8859-1 via Content-Type and HTML <meta charset>

Fixes #72916.

Validation

  • pnpm exec oxfmt --write --threads=1 src/agents/tools/web-shared.ts src/agents/tools/web-tools.fetch.test.ts
  • pnpm exec vitest run --config test/vitest/vitest.agents.config.ts src/agents/tools/web-tools.fetch.test.ts (16 tests passed)
  • pnpm exec tsgo --ignoreConfig --noEmit --pretty false --target ES2022 --module NodeNext --moduleResolution NodeNext --skipLibCheck src/agents/tools/web-shared.ts

Note: full pnpm exec tsgo -p tsconfig.core.test.json --noEmit --pretty false currently fails on upstream src/plugins/contracts/host-hooks.contract.test.ts missing ../../../test/helpers/plugins/contracts-testkit.js plus implicit-any errors; unrelated to this scoped change.

Changed files

  • src/agents/tools/web-shared.ts (modified, +76/-9)
  • src/agents/tools/web-tools.fetch.test.ts (modified, +58/-0)

PR #8: fix(web-fetch): detect response charset from Content-Type and HTML meta

Description (problem / solution / changelog)

Summary

  • web_fetch decoded all HTTP response bodies as UTF-8 unconditionally, producing mojibake for legacy-charset pages (Shift_JIS, Big5, GBK, ISO-8859-1, etc.)
  • Root cause: readResponseText() used new TextDecoder() (UTF-8) in the streaming path and res.text() (also UTF-8 per WHATWG Fetch spec) in the non-streaming fallback -- neither respects the declared charset
  • Fix: collect raw bytes before decoding; resolve charset from the Content-Type: charset= parameter; if absent and content is HTML, scan the first 4 KB for a <meta charset> or <meta http-equiv="Content-Type" content="...charset=..."> declaration; decode with TextDecoder(detectedCharset), falling back to UTF-8 for unknown/missing labels

Files changed

  • src/agents/tools/web-shared.ts -- charset helpers + reworked streaming and fallback decode paths
  • src/agents/tools/web-shared.charset.test.ts -- 7 new regression tests (all pass)

Test plan

  • 7 new tests in web-shared.charset.test.ts covering: Content-Type charset, HTML meta charset, http-equiv meta, UTF-8 fallback, non-HTML content, maxBytes truncation with charset
  • All 26 existing web-fetch tests still pass
  • pnpm check clean

Closes #72916


Generated by Claude Code

<!-- devin-review-badge-begin -->
<a href="https://app.devin.ai/review/suboss87/openclaw/pull/8" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end -->

Changed files

  • src/agents/tools/web-shared.charset.test.ts (added, +76/-0)
  • src/agents/tools/web-shared.ts (modified, +70/-5)
  • src/cron/service.armtimer-tight-loop.test.ts (modified, +112/-0)
  • src/cron/service/timer.ts (modified, +3/-0)

Code Example

const decoder = new TextDecoder();

---

const text = await res.text();

---

Content-Type: text/html; charset=Shift_JIS

---

<meta charset="...">

---

<meta http-equiv="Content-Type" content="text/html; charset=...">

---

new TextDecoder(charset)

---

http://www.aozora.gr.jp/cards/000081/files/46268_23911.html

---

web_fetch({
  url: "http://www.aozora.gr.jp/cards/000081/files/46268_23911.html",
  extractMode: "text"
})

---

{V���� ���߂���̉�
...

---
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

web_fetch appear to decode HTTP response bodies as UTF-8 unconditionally. Pages encoded with legacy charsets such as Shift_JIS, Big5, GBK, etc. return mojibake / replacement characters instead of readable text.

Suspected cause

The shared response reader appears to decode as UTF-8 by default.

In src/agents/tools/web-shared.ts, readResponseText() uses:

const decoder = new TextDecoder();

and later:

const text = await res.text();

Both paths default to UTF-8. This causes non-UTF-8 pages to be decoded incorrectly before HTML extraction/readability processing.

Suggested fix

Decode from raw bytes instead of calling res.text() directly:

  1. Read response as ArrayBuffer / raw bytes.
  2. Detect charset from Content-Type, e.g.:
Content-Type: text/html; charset=Shift_JIS
  1. If missing, scan the first few KB of HTML for:
<meta charset="...">

or:

<meta http-equiv="Content-Type" content="text/html; charset=...">
  1. Decode with:
new TextDecoder(charset)
  1. Fall back to UTF-8 only if no charset can be determined.

Steps to reproduce

Use a known Shift_JIS page:

http://www.aozora.gr.jp/cards/000081/files/46268_23911.html

Call:

web_fetch({
  url: "http://www.aozora.gr.jp/cards/000081/files/46268_23911.html",
  extractMode: "text"
})

Expected behavior

The page should be decoded according to its declared charset and return readable Japanese text.

Actual behavior

Output contains mojibake, for example:

�{�V���� ���߂���̉�
...

OpenClaw version

2026.04.24

Operating system

ubuntu 24.04.4 LTS

Install method

No response

Model

GTP-5.4

Provider / routing chain

Telegram → OpenClaw Gateway → model router / OpenAI API → gpt-5.4

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

No response

Additional information

No response

extent analysis

TL;DR

Decode HTTP response bodies using the detected charset instead of defaulting to UTF-8 to fix mojibake issues with non-UTF-8 encoded pages.

Guidance

  • Detect the charset from the Content-Type header or the HTML meta tags to determine the correct encoding.
  • Read the response as ArrayBuffer or raw bytes instead of calling res.text() directly to avoid default UTF-8 decoding.
  • Use new TextDecoder(charset) to decode the response body with the detected charset, falling back to UTF-8 only if no charset can be determined.
  • Verify the fix by checking the output of the web_fetch function with a known non-UTF-8 encoded page, such as the provided Shift_JIS example.

Example

const response = await fetch(url);
const contentType = response.headers.get('Content-Type');
const charset = getCharsetFromContentType(contentType) || getCharsetFromHtml(response);
const decoder = new TextDecoder(charset);
const arrayBuffer = await response.arrayBuffer();
const text = decoder.decode(arrayBuffer);

Notes

This fix assumes that the getCharsetFromContentType and getCharsetFromHtml functions are implemented to extract the charset from the Content-Type header and HTML meta tags, respectively.

Recommendation

Apply the workaround by decoding the response body using the detected charset to fix the mojibake issue, as the root cause is the unconditional decoding of response bodies as UTF-8.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The page should be decoded according to its declared charset and return readable Japanese text.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: web_fetch returns mojibake for non-UTF-8 pages [2 pull requests, 1 participants]