hermes - 💡(How to fix) Fix [holographic] _extract_entities is ASCII-only; CJK / non-English facts produce zero entities

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

The holographic memory plugin's entity extractor (plugins/memory/holographic/store.py:_extract_entities) is ASCII-only. For users whose facts are predominantly Chinese, Japanese, Korean, Cyrillic, etc., zero entities are ever extracted, which silently degrades probe / related / reason from compositional HRR retrieval down to FTS5 keyword fallback.

The bug is silent — facts add fine, search works fine, list works fine. Only probe / related / reason quietly underperform, and there's no log line to point you at the entity table being empty.

Root Cause

The four patterns in store.py:84–91:

_RE_CAPITALIZED  = re.compile(r'\b([A-Z][a-z]+(?:\s+[A-Z][a-z]+)+)\b')
_RE_DOUBLE_QUOTE = re.compile(r'"([^"]+)"')         # ASCII " only
_RE_SINGLE_QUOTE = re.compile(r"'([^']+)'")          # ASCII ' only
_RE_AKA          = re.compile(r'(\w+(?:\s+\w+)*)\s+(?:aka|also known as)\s+(\w+(?:\s+\w+)*)', re.I)

For CJK input:

  • _RE_CAPITALIZED — no Latin uppercase/lowercase, never fires
  • _RE_DOUBLE_QUOTE / _RE_SINGLE_QUOTE — CJK users type 「」 《》 "" '' instead of ASCII quotes
  • _RE_AKA — English idiom, doesn't translate

Result: pure-CJK facts yield zero candidates, the entity graph stays empty, and HRR vectors are effectively content-only.

Fix Action

Fix / Workaround

I have a working patch

I patched this locally on my install and verified it on 11 Chinese facts:

The patch is ~50 lines (regex defs + stopword set + 3 additions to _extract_entities + a small _add tweak that strips CJK punctuation and skips stopwords only for pure-CJK candidates). English behavior is fully unchanged — all four existing rules and the original _add semantics are preserved verbatim.

Code Example

# Fresh memory store, add some Chinese facts
python -c "
import sys, importlib.util, sqlite3, threading
spec = importlib.util.spec_from_file_location('s','plugins/memory/holographic/store.py')
m = importlib.util.module_from_spec(spec); spec.loader.exec_module(m)
fs = m.MemoryStore('/tmp/test.db')
fs.add_fact('飞书白兔 App 已于 2026-5-10 接入完成')
fs.add_fact('Coco 香港插班项目计划')
fs.add_fact('用户公司日常用「白兔」/「白兔控股」,不要用工商执照名「成都抖咖」')
"

sqlite3 /tmp/test.db "SELECT COUNT(*) FROM facts;        -- 3
                      SELECT COUNT(*) FROM entities;    -- 0  ← bug
                      SELECT COUNT(*) FROM fact_entities; -- 0  ← bug
                      SELECT COUNT(*) FROM facts WHERE hrr_vector IS NOT NULL; -- 3 (but encoded with empty entity list)"

---

_RE_CAPITALIZED  = re.compile(r'\b([A-Z][a-z]+(?:\s+[A-Z][a-z]+)+)\b')
_RE_DOUBLE_QUOTE = re.compile(r'"([^"]+)"')         # ASCII " only
_RE_SINGLE_QUOTE = re.compile(r"'([^']+)'")          # ASCII ' only
_RE_AKA          = re.compile(r'(\w+(?:\s+\w+)*)\s+(?:aka|also known as)\s+(\w+(?:\s+\w+)*)', re.I)
RAW_BUFFERClick to expand / collapse

Holographic memory plugin: _extract_entities is ASCII-only, breaks compositional retrieval for non-English users

Summary

The holographic memory plugin's entity extractor (plugins/memory/holographic/store.py:_extract_entities) is ASCII-only. For users whose facts are predominantly Chinese, Japanese, Korean, Cyrillic, etc., zero entities are ever extracted, which silently degrades probe / related / reason from compositional HRR retrieval down to FTS5 keyword fallback.

The bug is silent — facts add fine, search works fine, list works fine. Only probe / related / reason quietly underperform, and there's no log line to point you at the entity table being empty.

Affected version

  • Plugin: plugins/memory/holographic/ (version 0.1.0 per plugin.yaml)
  • Files: plugins/memory/holographic/store.py lines 84–91 (regex defs) and lines 394–427 (_extract_entities body)

Reproduction

# Fresh memory store, add some Chinese facts
python -c "
import sys, importlib.util, sqlite3, threading
spec = importlib.util.spec_from_file_location('s','plugins/memory/holographic/store.py')
m = importlib.util.module_from_spec(spec); spec.loader.exec_module(m)
fs = m.MemoryStore('/tmp/test.db')
fs.add_fact('飞书白兔 App 已于 2026-5-10 接入完成')
fs.add_fact('Coco 香港插班项目计划')
fs.add_fact('用户公司日常用「白兔」/「白兔控股」,不要用工商执照名「成都抖咖」')
"

sqlite3 /tmp/test.db "SELECT COUNT(*) FROM facts;        -- 3
                      SELECT COUNT(*) FROM entities;    -- 0  ← bug
                      SELECT COUNT(*) FROM fact_entities; -- 0  ← bug
                      SELECT COUNT(*) FROM facts WHERE hrr_vector IS NOT NULL; -- 3 (but encoded with empty entity list)"

The HRR vectors are computed with entities=[] since the linker found nothing, so the bind(entity, role) → bank → unbind pipeline in retrieval.py:probe() has no structural signal to find.

Root cause

The four patterns in store.py:84–91:

_RE_CAPITALIZED  = re.compile(r'\b([A-Z][a-z]+(?:\s+[A-Z][a-z]+)+)\b')
_RE_DOUBLE_QUOTE = re.compile(r'"([^"]+)"')         # ASCII " only
_RE_SINGLE_QUOTE = re.compile(r"'([^']+)'")          # ASCII ' only
_RE_AKA          = re.compile(r'(\w+(?:\s+\w+)*)\s+(?:aka|also known as)\s+(\w+(?:\s+\w+)*)', re.I)

For CJK input:

  • _RE_CAPITALIZED — no Latin uppercase/lowercase, never fires
  • _RE_DOUBLE_QUOTE / _RE_SINGLE_QUOTE — CJK users type 「」 《》 "" '' instead of ASCII quotes
  • _RE_AKA — English idiom, doesn't translate

Result: pure-CJK facts yield zero candidates, the entity graph stays empty, and HRR vectors are effectively content-only.

Impact

Anyone whose fact base is mostly non-English silently loses compositional retrieval. They still get FTS5 keyword search (which is fine), but the selling point of this plugin — entity-aware HRR algebra — is dark. They probably won't notice until they specifically test probe / reason and wonder why the scores look flat.

Proposed fix

Add three CJK-aware rules to _extract_entities. Important design choice: prefer explicit-marker rules over bare-character heuristics, because a sliding regex over CJK runs (e.g. [\u4e00-\u9fff]{2,6}) produces too many cross-word fragments without a dictionary ("把视觉模型识", "校跟用户口头"). Stay conservative.

Suggested rule set:

  1. CJK brackets/quotes (high signal) — 「…」 『…』 《…》 ""…"" ''…''
  2. Mixed-script identifiers (high signal) — [A-Za-z][A-Za-z0-9_.\-]+ with optional \s+\d+(?:\.\d+)* version suffix. Captures lark-cli, GPT-5.5, Gemini 3.1 Pro, baitugroup.com, IHMS, etc.
  3. Bare CJK runs of 2–6 chars (low signal — recommend leaving off by default, or behind a config flag) — useful only with a small stopword list to filter pronouns/generic terms.

Combined with a tiny _CN_STOPWORDS set (~80 common pronouns/auxiliaries/generic nouns) applied only to pure-CJK candidates so English entities like "Project" aren't accidentally caught.

I have a working patch

I patched this locally on my install and verified it on 11 Chinese facts:

MetricBeforeAfter
entities067
fact_entities links077
facts with HRR vector1111 (recomputed with real entities)
memory_banks05
probe(entity="Coco") signalflat (FTS fallback)finds the 3 Coco-linked facts via the graph
compositional reason(["Coco","Kanban"])n/areturns the 2 facts linked to both

The patch is ~50 lines (regex defs + stopword set + 3 additions to _extract_entities + a small _add tweak that strips CJK punctuation and skips stopwords only for pure-CJK candidates). English behavior is fully unchanged — all four existing rules and the original _add semantics are preserved verbatim.

Happy to send a PR if it would be welcome — let me know the preferred shape:

  • (a) The narrow patch as described (additive, no config knobs)
  • (b) Same patch but with a cjk_run_extraction: bool config flag in plugin.yaml so adventurous users can enable the noisy Rule 3
  • (c) A more ambitious refactor that makes _extract_entities pluggable (a [extractor] block in plugin.yaml) so future contributors can drop in jieba/spaCy/LLM-based extractors without touching core

I'd default to (a) unless you'd rather start with the more pluggable design.

Side note: same root cause affects HRR vector quality

Since _compute_hrr_vector (store.py:470) reads from fact_entities, the bug also means CJK users' HRR vectors are computed with entities=[]. After backfilling entities the vectors need to be recomputed, which the existing add_fact path handles automatically — but anyone with an existing DB will need a one-shot migration to recompute. Worth a note in the release.


Environment:

  • Debian 13 (trixie), Python 3.13
  • Hermes-Agent venv at ~/.hermes/hermes-agent/
  • Plugin path plugins/memory/holographic/
  • Fact base 100% Chinese / Chinese-English mixed

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING