hermes - ✅(Solved) Fix FTS5 unicode61 tokenizer silently drops CJK characters, LIKE fallback only triggers on zero results [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#14829Fetched 2026-04-24 06:14:34
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
1
Timeline (top)
labeled ×3cross-referenced ×1

The search_messages() method in hermes_state.py uses FTS5 with the default unicode61 tokenizer for session search. This tokenizer silently drops many CJK characters, causing Chinese/Japanese/Korean queries to return incomplete results. The existing LIKE fallback only activates when FTS5 returns zero matches, so it misses the common case where FTS5 returns some results but misses many others.

Root Cause

Two compounding issues:

1. FTS5 unicode61 drops CJK characters

The unicode61 tokenizer does not properly tokenize CJK characters — many are silently discarded as if they were punctuation. This is a known SQLite limitation. Example from a real database:

QueryFTS5 matchesLIKE matchesCoverage
昨晚21612.5%
半夜020%
中欧红利3721117.5%

Individual character analysis shows certain CJK chars are completely absent from the FTS5 index:

CharacterFTS5 hitsLIKE hitsStatus
0169❌ Dropped
0133❌ Dropped
01358❌ Dropped
25266⚠️ Partial

2. LIKE fallback condition is too narrow

Current logic (line 1248):

if not matches and self._contains_cjk(query):
    # LIKE fallback

This only triggers when FTS5 returns zero results. But as shown above, FTS5 often returns some results for CJK queries — just far fewer than it should. The fallback is never reached in those cases.

Fix Action

Fixed

PR fix notes

PR #14842: fix(session_search): supplement FTS5 with LIKE for CJK partial results

Description (problem / solution / changelog)

Summary

Fixes #14829

FTS5's unicode61 tokenizer silently drops certain CJK characters, causing queries like 昨晚 to return only a fraction of actual matches. The existing LIKE fallback (added in 8826d9c for #11511) only triggers when FTS5 returns zero results, but the more common case is FTS5 returning some results while missing many others.

Changes

  • hermes_state.py: Change the LIKE path from a zero-result fallback to an always-run supplement for CJK queries. Results are merged with deduplication by message id, preserving FTS5 results while LIKE fills in the gaps.
  • tests/test_hermes_state.py: Add two regression tests:
    • test_cjk_partial_fts5_results_supplemented_by_like: verifies LIKE supplements partial FTS5 results
    • test_cjk_like_dedup_no_duplicates: verifies no duplicate results when both FTS5 and LIKE match

Before / After

ScenarioBeforeAfter
CJK query, FTS5 returns 0✅ LIKE fallback✅ LIKE runs
CJK query, FTS5 returns partial❌ LIKE skipped✅ LIKE supplements
English query✅ FTS5 only✅ FTS5 only (unchanged)
CJK + English mixed✅ LIKE fallback on 0✅ LIKE always supplements

Testing

pytest tests/test_hermes_state.py::TestCJKSearchFallback -v
# 14 passed (12 existing + 2 new)

pytest tests/test_hermes_state.py -v
# 176 passed

Changed files

  • hermes_state.py (modified, +11/-4)
  • scripts/release.py (modified, +1/-0)
  • tests/test_hermes_state.py (modified, +27/-0)

Code Example

if not matches and self._contains_cjk(query):
    # LIKE fallback

---

# Option A: CJK queries bypass FTS5 entirely
if self._contains_cjk(original_query):
    # go straight to LIKE fallback
    ...

# Option B: Always supplement FTS5 with LIKE for CJK queries
if self._contains_cjk(original_query):
    # merge FTS5 + LIKE results (dedup by message id)
    ...
RAW_BUFFERClick to expand / collapse

Bug: CJK full-text search returns incomplete results

Summary

The search_messages() method in hermes_state.py uses FTS5 with the default unicode61 tokenizer for session search. This tokenizer silently drops many CJK characters, causing Chinese/Japanese/Korean queries to return incomplete results. The existing LIKE fallback only activates when FTS5 returns zero matches, so it misses the common case where FTS5 returns some results but misses many others.

Root Cause

Two compounding issues:

1. FTS5 unicode61 drops CJK characters

The unicode61 tokenizer does not properly tokenize CJK characters — many are silently discarded as if they were punctuation. This is a known SQLite limitation. Example from a real database:

QueryFTS5 matchesLIKE matchesCoverage
昨晚21612.5%
半夜020%
中欧红利3721117.5%

Individual character analysis shows certain CJK chars are completely absent from the FTS5 index:

CharacterFTS5 hitsLIKE hitsStatus
0169❌ Dropped
0133❌ Dropped
01358❌ Dropped
25266⚠️ Partial

2. LIKE fallback condition is too narrow

Current logic (line 1248):

if not matches and self._contains_cjk(query):
    # LIKE fallback

This only triggers when FTS5 returns zero results. But as shown above, FTS5 often returns some results for CJK queries — just far fewer than it should. The fallback is never reached in those cases.

Impact

  • Users in CJK locales (Chinese, Japanese, Korean) get unreliable session_search results
  • The agent reports "no matching sessions found" for conversations that clearly exist
  • This is especially impactful for Feishu/WeChat/DingTalk users whose messages are predominantly CJK

Suggested Fix

For CJK queries, skip FTS5 entirely and go straight to LIKE (or always run LIKE as a supplement). Example:

# Option A: CJK queries bypass FTS5 entirely
if self._contains_cjk(original_query):
    # go straight to LIKE fallback
    ...

# Option B: Always supplement FTS5 with LIKE for CJK queries
if self._contains_cjk(original_query):
    # merge FTS5 + LIKE results (dedup by message id)
    ...

Environment

  • Hermes Agent v0.11.0 (2026.4.23)
  • SQLite 3.x with FTS5 (default unicode61 tokenizer)
  • Affects all platforms where CJK session content is stored

Related Code

  • hermes_state.py: search_messages() (line 1164), _contains_cjk() (line 1150), _sanitize_fts5_query() (line 1096)

extent analysis

TL;DR

The most likely fix is to modify the search_messages() method to either bypass FTS5 entirely for CJK queries or supplement FTS5 results with LIKE queries.

Guidance

  • Identify CJK queries using the _contains_cjk() method and apply a different search strategy, such as using LIKE or a combination of FTS5 and LIKE.
  • Consider merging FTS5 and LIKE results to ensure comprehensive search coverage, deduplicating by message ID to avoid duplicates.
  • Review the _sanitize_fts5_query() method to ensure it does not inadvertently drop CJK characters.
  • Test the modified search functionality with various CJK queries to verify improved results.

Example

if self._contains_cjk(original_query):
    # merge FTS5 + LIKE results (dedup by message id)
    fts5_results = self._search_fts5(original_query)
    like_results = self._search_like(original_query)
    results = list(set(fts5_results + like_results))  # deduplicate by message id
    return results

Notes

The unicode61 tokenizer limitation in SQLite's FTS5 is a known issue, and using LIKE as a fallback or supplement can help mitigate the problem. However, this may impact performance, and further optimization may be necessary.

Recommendation

Apply workaround: Modify the search_messages() method to supplement FTS5 results with LIKE queries for CJK searches, as this approach can provide more comprehensive results without requiring significant changes to the existing infrastructure.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING