openclaw - ✅(Solved) Fix [Bug]: web_fetch returns mojibake for non-UTF-8 pages [2 pull requests, 1 participants]

nickyhk · 2026-04-27T16:29:12Z

[openclaw] web fetch appear to decode HTTP response bodies as UTF-8 unconditionally. Pages encoded with legacy charsets such as Shift JIS, Big5, GBK, etc. retu… `web_fetch` appear to decode HTTP response bodies as UTF-8 unconditionally. Pages encoded with legacy charsets such as Shift_JIS, Big5, GBK, etc. return mojibake / replacement characters instead of readable text. # PR #73103: Fix web_fetch legacy charset decoding - Repository: openclaw/openclaw - Author: pfrederiksen - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/73103 ## Description (problem / solution / changelog) ## Summary - Decode `web_fetch` responses from raw bytes with charset detection from `Content-Type` - Sniff a bounded first 4KB HTML meta charset before falling back to UTF-8 - Keep bounded response reads by accumulating capped bytes before decoding - Add regressions for ISO-8859-1 via `Content-Type` and HTML ` ` Fixes #72916. ## Validation - `pnpm exec oxfmt --write --threads=1 src/agents/tools/web-shared.ts src/agents/tools/web-tools.fetch.test.ts` - `pnpm exec vitest run --config test/vitest/vitest.agents.config.ts src/agents/tools/web-tools.fetch.test.ts` (16 tests passed) - `pnpm exec tsgo --ignoreConfig --noEmit --pretty false --target ES2022 --module NodeNext --moduleResolution NodeNext --skipLibCheck src/agents/tools/web-shared.ts` Note: full `pnpm exec tsgo -p tsconfig.core.test.json --noEmit --pretty false` currently fails on upstream `src/plugins/contracts/host-hooks.contract.test.ts` missing `../../../test/helpers/plugins/contracts-testkit.js` plus implicit-any errors; unrelated to this scoped change. ## Changed files - `src/agents/tools/web-shared.ts` (modified, +76/-9) - `src/agents/tools/web-tools.fetch.test.ts` (modified, +58/-0) --- # PR #8: fix(web-fetch): detect response charset from Content-Type and HTML meta - Repository: suboss87/openclaw - Author: suboss87 - State: open | merged: False - Link: https://github.com/suboss87/openclaw/pull/8 ## Description (problem / solution / changelog) ## Summary - `web_fetch` decoded all HTTP response bodies as UTF-8 unconditionally, producing mojibake for legacy-charset pages (Shift_JIS, Big5, GBK, ISO-8859-1, etc.) - Root cause: `readResponseText()` used `new TextDecoder()` (UTF-8) in the streaming path and `res.text()` (also UTF-8 per WHATWG Fetch spec) in the non-streaming fallback -- neither respects the declared charset - Fix: collect raw bytes before decoding; resolve charset from the `Content-Type: charset=` parameter; if absent and content is HTML, scan the first 4 KB for a ` ` or ` ` declaration; decode with `TextDecoder(detectedCharset)`, falling back to UTF-8 for unknown/missing labels ## Files changed - `src/agents/tools/web-shared.ts` -- charset helpers + reworked streaming and fallback decode paths - `src/agents/tools/web-shared.charset.test.ts` -- 7 new regression tests (all pass) ## Test plan - [x] 7 new tests in `web-shared.charset.test.ts` covering: Content-Type charset, HTML meta charset, http-equiv meta, UTF-8 fallback, non-HTML content, maxBytes truncation with charset - [x] All 26 existing web-fetch tests still pass - [x] `pnpm check` clean Closes #72916 --- _Generated by [Claude Code](https://claude.ai/code/session_01NHHoPHTrH4F9qFJBJHqjTk)_ --- ## Changed files - `src/agents/tools/web-shared.charset.test.ts` (added, +76/-0) - `src/agents/tools/web-shared.ts` (modified, +70/-5) - `src/cron/service.armtimer-tight-loop.test.ts` (modified, +112/-0) - `src/cron/service/timer.ts` (modified, +3/-0) ## Fixed - Fixed by PR: Fix web_fetch legacy charset decoding (https://github.com/openclaw/openclaw/pull/73103) ### Bug type Behavior bug (incorrect output/state without crash) ### Beta release blocker No ### Summary `web_fetch` appear to decode HTTP response bodies as UTF-8 unconditionally. Pages encoded with legacy charsets such as Shift_JIS, Big5, GBK, etc. return mojibake / replacement characters instead of readable text. ### Suspected cause The shared response reader appears to decode as UTF-8 by default. In `src/agents/tools/web-shared.ts`, `readResponseText()` uses: ```ts const decoder = new TextDecoder(); ``` and later: ```ts const text = await res.text(); ``` Both paths default to UTF-8. This causes non-UTF-8 pages to be decoded incorrectly before HTML extraction/readability processing. ### Suggested fix Decode from raw bytes instead of calling `res.text()` directly: 1. Read response as `ArrayBuffer` / raw bytes. 2. Detect charset from `Content-Type`, e.

openclaw2026-04-27 16:29:12

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#72916•Fetched 2026-04-28 06:30:19

View on GitHub

Comments

Participants

Timeline

Reactions

Author

nickyhk

Participants

nickyhk

Timeline (top)

cross-referenced ×2labeled ×1

web_fetch appear to decode HTTP response bodies as UTF-8 unconditionally. Pages encoded with legacy charsets such as Shift_JIS, Big5, GBK, etc. return mojibake / replacement characters instead of readable text.

Root Cause

Fix Action

Fixed

Fixed by PR: Fix web_fetch legacy charset decoding (https://github.com/openclaw/openclaw/pull/73103)

PR fix notes

PR #73103: Fix web_fetch legacy charset decoding

Repository: openclaw/openclaw
Author: pfrederiksen
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/73103

Description (problem / solution / changelog)

Summary

Decode web_fetch responses from raw bytes with charset detection from Content-Type
Sniff a bounded first 4KB HTML meta charset before falling back to UTF-8
Keep bounded response reads by accumulating capped bytes before decoding
Add regressions for ISO-8859-1 via Content-Type and HTML <meta charset>

Fixes #72916.

Validation

pnpm exec oxfmt --write --threads=1 src/agents/tools/web-shared.ts src/agents/tools/web-tools.fetch.test.ts
pnpm exec vitest run --config test/vitest/vitest.agents.config.ts src/agents/tools/web-tools.fetch.test.ts (16 tests passed)
pnpm exec tsgo --ignoreConfig --noEmit --pretty false --target ES2022 --module NodeNext --moduleResolution NodeNext --skipLibCheck src/agents/tools/web-shared.ts

Note: full pnpm exec tsgo -p tsconfig.core.test.json --noEmit --pretty false currently fails on upstream src/plugins/contracts/host-hooks.contract.test.ts missing ../../../test/helpers/plugins/contracts-testkit.js plus implicit-any errors; unrelated to this scoped change.

Changed files

src/agents/tools/web-shared.ts (modified, +76/-9)
src/agents/tools/web-tools.fetch.test.ts (modified, +58/-0)

PR #8: fix(web-fetch): detect response charset from Content-Type and HTML meta

Repository: suboss87/openclaw
Author: suboss87
State: open | merged: False
Link: https://github.com/suboss87/openclaw/pull/8

Description (problem / solution / changelog)

Summary

web_fetch decoded all HTTP response bodies as UTF-8 unconditionally, producing mojibake for legacy-charset pages (Shift_JIS, Big5, GBK, ISO-8859-1, etc.)
Root cause: readResponseText() used new TextDecoder() (UTF-8) in the streaming path and res.text() (also UTF-8 per WHATWG Fetch spec) in the non-streaming fallback -- neither respects the declared charset
Fix: collect raw bytes before decoding; resolve charset from the Content-Type: charset= parameter; if absent and content is HTML, scan the first 4 KB for a <meta charset> or <meta http-equiv="Content-Type" content="...charset=..."> declaration; decode with TextDecoder(detectedCharset), falling back to UTF-8 for unknown/missing labels

Files changed

src/agents/tools/web-shared.ts -- charset helpers + reworked streaming and fallback decode paths
src/agents/tools/web-shared.charset.test.ts -- 7 new regression tests (all pass)

Test plan

7 new tests in web-shared.charset.test.ts covering: Content-Type charset, HTML meta charset, http-equiv meta, UTF-8 fallback, non-HTML content, maxBytes truncation with charset
All 26 existing web-fetch tests still pass
pnpm check clean

Closes #72916

Generated by Claude Code

Changed files

src/agents/tools/web-shared.charset.test.ts (added, +76/-0)
src/agents/tools/web-shared.ts (modified, +70/-5)
src/cron/service.armtimer-tight-loop.test.ts (modified, +112/-0)
src/cron/service/timer.ts (modified, +3/-0)

Code Example

const decoder = new TextDecoder();

---

const text = await res.text();

---

Content-Type: text/html; charset=Shift_JIS

---

<meta charset="...">

---

<meta http-equiv="Content-Type" content="text/html; charset=...">

---

new TextDecoder(charset)

---

http://www.aozora.gr.jp/cards/000081/files/46268_23911.html

---

web_fetch({
  url: "http://www.aozora.gr.jp/cards/000081/files/46268_23911.html",
  extractMode: "text"
})

---

�{�V���� ���߂���̉�
...

---

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

Suspected cause

The shared response reader appears to decode as UTF-8 by default.

In src/agents/tools/web-shared.ts, readResponseText() uses:

const decoder = new TextDecoder();

and later:

const text = await res.text();

Both paths default to UTF-8. This causes non-UTF-8 pages to be decoded incorrectly before HTML extraction/readability processing.

Suggested fix

Decode from raw bytes instead of calling res.text() directly:

Read response as ArrayBuffer / raw bytes.
Detect charset from Content-Type, e.g.:

Content-Type: text/html; charset=Shift_JIS

If missing, scan the first few KB of HTML for:

<meta charset="...">

or:

<meta http-equiv="Content-Type" content="text/html; charset=...">

Decode with:

new TextDecoder(charset)

Fall back to UTF-8 only if no charset can be determined.

Steps to reproduce

Use a known Shift_JIS page:

http://www.aozora.gr.jp/cards/000081/files/46268_23911.html

Call:

web_fetch({
  url: "http://www.aozora.gr.jp/cards/000081/files/46268_23911.html",
  extractMode: "text"
})

Expected behavior

The page should be decoded according to its declared charset and return readable Japanese text.

Actual behavior

Output contains mojibake, for example:

�{�V���� ���߂���̉�
...

OpenClaw version

2026.04.24

Operating system

ubuntu 24.04.4 LTS

Install method

No response

Model

GTP-5.4

Provider / routing chain

Telegram → OpenClaw Gateway → model router / OpenAI API → gpt-5.4

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

No response

Additional information

No response

extent analysis

TL;DR

Decode HTTP response bodies using the detected charset instead of defaulting to UTF-8 to fix mojibake issues with non-UTF-8 encoded pages.

Guidance

Detect the charset from the Content-Type header or the HTML meta tags to determine the correct encoding.
Read the response as ArrayBuffer or raw bytes instead of calling res.text() directly to avoid default UTF-8 decoding.
Use new TextDecoder(charset) to decode the response body with the detected charset, falling back to UTF-8 only if no charset can be determined.
Verify the fix by checking the output of the web_fetch function with a known non-UTF-8 encoded page, such as the provided Shift_JIS example.

Example

const response = await fetch(url);
const contentType = response.headers.get('Content-Type');
const charset = getCharsetFromContentType(contentType) || getCharsetFromHtml(response);
const decoder = new TextDecoder(charset);
const arrayBuffer = await response.arrayBuffer();
const text = decoder.decode(arrayBuffer);

Notes

This fix assumes that the getCharsetFromContentType and getCharsetFromHtml functions are implemented to extract the charset from the Content-Type header and HTML meta tags, respectively.

Recommendation

Apply the workaround by decoding the response body using the detected charset to fix the mojibake issue, as the root cause is the unconditional decoding of response bodies as UTF-8.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

The page should be decoded according to its declared charset and return readable Japanese text.

#api #model save/load #optimization #mixed precision #training loop

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: web_fetch returns mojibake for non-UTF-8 pages [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #73103: Fix web_fetch legacy charset decoding

Description (problem / solution / changelog)

Summary

Validation

Changed files

PR #8: fix(web-fetch): detect response charset from Content-Type and HTML meta

Description (problem / solution / changelog)

Summary

Files changed

Test plan

Changed files

Code Example

Bug type

Beta release blocker

Summary

Suspected cause

Suggested fix

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING