llamaIndex - ✅(Solved) Fix [Bug]: SimplePropertyGraphStore can't persist utf-8 encoded chars on Windows [4 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#21109Fetched 2026-04-08 01:17:12
View on GitHub
Comments
2
Participants
2
Timeline
11
Reactions
0
Author
Timeline (top)
cross-referenced ×3commented ×2labeled ×2referenced ×2

Error Message

Stack trace of crash during store (likely due to corruption from previous run failing, not necessarily a bug in LlamaIndex):

[18/329] Processing: chunk_00052_00054.txt... Applying transformations: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 22.97it/s] Traceback (most recent call last): File "C:\Users\manue\Documents\Dictadura\kg_builder.py", line 130, in <module> index.storage_context.persist(persist_dir=PERSIST_DIR) File "C:\Users\manue\Documents\Dictadura.venv310\lib\site-packages\llama_index\core\storage\storage_context.py", line 187, in persist self.property_graph_store.persist(persist_path=pg_graph_store_path, fs=fs) File "C:\Users\manue\Documents\Dictadura.venv310\lib\site-packages\llama_index\core\graph_stores\simple_labelled.py", line 171, in persist f.write(self.graph.model_dump_json()) File "C:\Users\manue\AppData\Local\Python\pythoncore-3.10-64\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 2139192-2139193: character maps to <undefined>

Stack trace of crash during load:

(.venv310) PS C:\Users\manue\Documents\Dictadura> python .\kg_builder.py Loading index from storage... Traceback (most recent call last): File "C:\Users\manue\Documents\Dictadura\kg_builder.py", line 120, in <module> storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR) File "C:\Users\manue\Documents\Dictadura.venv310\lib\site-packages\llama_index\core\storage\storage_context.py", line 126, in from_defaults or SimplePropertyGraphStore.from_persist_dir(persist_dir, fs=fs) File "C:\Users\manue\Documents\Dictadura.venv310\lib\site-packages\llama_index\core\graph_stores\simple_labelled.py", line 196, in from_persist_dir return cls.from_persist_path(persist_path, fs=fs) File "C:\Users\manue\Documents\Dictadura.venv310\lib\site-packages\llama_index\core\graph_stores\simple_labelled.py", line 184, in from_persist_path data = json.loads(f.read()) File "C:\Users\manue\AppData\Local\Python\pythoncore-3.10-64\lib\json_init_.py", line 346, in loads return _default_decoder.decode(s) File "C:\Users\manue\AppData\Local\Python\pythoncore-3.10-64\lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "C:\Users\manue\AppData\Local\Python\pythoncore-3.10-64\lib\json\decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Root Cause

In Windows, the default encoding is cp1252, not utf-8, which is likely the root cause of this error. I've noticed that SimplePropertyGraphStore.persist does not specify an encoding nor does it allow it to be specified. I've solved this issue by monkey patching it. Same applies to from_persist_path.

Fix Action

Fix / Workaround

In Windows, the default encoding is cp1252, not utf-8, which is likely the root cause of this error. I've noticed that SimplePropertyGraphStore.persist does not specify an encoding nor does it allow it to be specified. I've solved this issue by monkey patching it. Same applies to from_persist_path.

PR fix notes

PR #21111: fix: add explicit UTF-8 encoding to persistence layer fs.open() calls

Description (problem / solution / changelog)

Description

On Windows, the default file encoding is locale-specific (cp1252, GBK, etc.) rather than UTF-8. This causes UnicodeEncodeError when persisting data containing non-ASCII characters (Chinese, Japanese, special Unicode symbols).

Add encoding='utf-8' to all text-mode fs.open() calls in:

  • SimplePropertyGraphStore (persist + from_persist_path)
  • SimpleKVStore (persist)
  • SimpleGraphStore (persist)
  • SimpleVectorStore (persist)
  • SimpleChatStore (persist + from_persist_path)

Also adds a round-trip test with Chinese, Japanese, and special Unicode characters to verify the fix.

Fixes #21109 Related: #17846, #19564

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

llama-index-core doesn't need version bump

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

  • I added new unit tests to cover this change

Suggested Checklist:

  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran uv run make format; uv run make lint to appease the lint gods

Changed files

  • llama-index-core/llama_index/core/graph_stores/simple.py (modified, +1/-1)
  • llama-index-core/llama_index/core/graph_stores/simple_labelled.py (modified, +2/-2)
  • llama-index-core/llama_index/core/storage/chat_store/simple_chat_store.py (modified, +2/-2)
  • llama-index-core/llama_index/core/storage/kvstore/simple_kvstore.py (modified, +1/-1)
  • llama-index-core/llama_index/core/vector_stores/simple.py (modified, +1/-1)
  • llama-index-core/tests/graph_stores/test_simple_lpg.py (modified, +33/-0)

PR #21115: Fix: Add encoding=utf-8 to SimplePropertyGraphStore for Windows compatibility

Description (problem / solution / changelog)

Description

Fixes #21109

On Windows, files are opened with cp1252 encoding by default, causing errors when persisting UTF-8 content.

Changes

Added encoding="utf-8" parameter to file open operations in simple_labelled.py:

  1. persist() method: fs.open(persist_path, "w", encoding="utf-8")
  2. from_persist_path() method: fs.open(persist_path, "r", encoding="utf-8")

This ensures consistent UTF-8 encoding across all platforms, fixing the Windows compatibility issue described in #21109.

Changed files

  • llama-index-core/llama_index/core/graph_stores/simple_labelled.py (modified, +2/-2)

PR #21210: fix: resolve multiple bugs in core and openai integrations (Issues #21109, #21159, #21150, #21124)

Description (problem / solution / changelog)

This PR aggregates fixes for multiple reported issues across the llama-index-core, llama-index-llms-openai, and llama-index-postprocessor-google-rerank packages:

  • Fixes #21109: Modified SimplePropertyGraphStore to correctly persist utf-8 encoded characters on Windows environments by adding explicit encoding="utf-8" arguments to file handlers.
  • Fixes #21159: Fixed an issue in QueryFusionRetriever where the asynchronous _aretrieve method blocked the event loop by synchronously calling _get_queries(). Introduced and awaited an _aget_queries equivalent instead.
  • Fixes #21150: Resolved production gRPC failures in GoogleRerank caused by thread/event-loop mismatches by lazy-loading the RankServiceAsyncClient during async execution instead of in the __init__ constructor.
  • Fixes #21124: Addressed multiple edge cases in the OpenAI LLM integration regarding O1/O3 reasoning models and serialization:
    • Retained assistant text content when tool calls are present.
    • Parses, tracks, and serializes the phase property for commentary and reasoning workflows.
    • Used prefix matching instead of rigid dictionary checks for O1/O3 models to automatically support emerging variants (like gpt-5).
    • Properly distributes reasoning tokens universally across multiple ThinkingBlock elements.
    • Enforces valid JSON string serialization for ToolCallBlock tool arguments where dicts were previously failing.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran uv run make format; uv run make lint to appease the lint gods

Changed files

  • llama-index-core/llama_index/core/graph_stores/simple_labelled.py (modified, +2/-2)
  • llama-index-core/llama_index/core/retrievers/fusion_retriever.py (modified, +18/-1)
  • llama-index-integrations/postprocessor/llama-index-postprocessor-google-rerank/llama_index/postprocessor/google_rerank/base.py (modified, +15/-4)
  • llama-index-integrations/postprocessor/llama-index-postprocessor-google-rerank/tests/test_postprocessor_google_rerank.py (modified, +7/-3)

PR #21360: fix: add encoding="utf-8" to graph store persist/load methods

Description (problem / solution / changelog)

Description

Fixes #21109

SimplePropertyGraphStore.persist() / from_persist_path() and SimpleGraphStore.persist() / from_persist_path() open files without specifying encoding="utf-8". On Windows, Python's open() defaults to the system encoding (often cp1252), which causes UnicodeDecodeError / UnicodeEncodeError when the graph data contains non-ASCII characters.

Changes

  • Added encoding="utf-8" to all fs.open() calls in persist() and from_persist_path() for both SimplePropertyGraphStore and SimpleGraphStore
  • Fixed SimpleGraphStore.from_persist_path() using binary mode ("rb") instead of text mode — json.load() expects a text stream, and binary mode is inconsistent with the write path which uses "w"

Files changed

  • llama-index-core/llama_index/core/graph_stores/simple_labelled.py (SimplePropertyGraphStore)
  • llama-index-core/llama_index/core/graph_stores/simple.py (SimpleGraphStore)

Related issue

Closes #21109 — same root cause as PR #21111 (stale since March 2025), but this PR also fixes the older SimpleGraphStore class and the "rb" mode inconsistency.

Changed files

  • llama-index-core/llama_index/core/graph_stores/simple.py (modified, +2/-2)
  • llama-index-core/llama_index/core/graph_stores/simple_labelled.py (modified, +2/-2)

Code Example

import llama_index.core.graph_stores.simple_labelled as _sg

def _persist_utf8(self, persist_path, fs=None):
    if fs is None:
        fs = _fsspec.filesystem("file")
    with fs.open(persist_path, "w", encoding="utf-8") as f:
        f.write(self.graph.model_dump_json())

_sg.SimplePropertyGraphStore.persist = _persist_utf8

@classmethod
def _from_persist_path_utf8(cls, persist_path, fs=None):
    if fs is None:
        fs = _fsspec.filesystem("file")
    with fs.open(persist_path, "r", encoding="utf-8") as f:
        data = json.loads(f.read())
    return cls.from_dict(data)

_sg.SimplePropertyGraphStore.from_persist_path = _from_persist_path_utf8

---

Stack trace of crash during store (likely due to corruption from previous run failing, not necessarily a bug in LlamaIndex):

[18/329] Processing: chunk_00052_00054.txt...
Applying transformations: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 22.97it/s]
Traceback (most recent call last):
File "C:\Users\manue\Documents\Dictadura\kg_builder.py", line 130, in <module>
index.storage_context.persist(persist_dir=PERSIST_DIR)
File "C:\Users\manue\Documents\Dictadura.venv310\lib\site-packages\llama_index\core\storage\storage_context.py", line 187, in persist
self.property_graph_store.persist(persist_path=pg_graph_store_path, fs=fs)
File "C:\Users\manue\Documents\Dictadura.venv310\lib\site-packages\llama_index\core\graph_stores\simple_labelled.py", line 171, in persist
f.write(self.graph.model_dump_json())
File "C:\Users\manue\AppData\Local\Python\pythoncore-3.10-64\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 2139192-2139193: character maps to <undefined>

Stack trace of crash during load:

(.venv310) PS C:\Users\manue\Documents\Dictadura> python .\kg_builder.py
Loading index from storage...
Traceback (most recent call last):
File "C:\Users\manue\Documents\Dictadura\kg_builder.py", line 120, in <module>
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
File "C:\Users\manue\Documents\Dictadura.venv310\lib\site-packages\llama_index\core\storage\storage_context.py", line 126, in from_defaults
or SimplePropertyGraphStore.from_persist_dir(persist_dir, fs=fs)
File "C:\Users\manue\Documents\Dictadura.venv310\lib\site-packages\llama_index\core\graph_stores\simple_labelled.py", line 196, in from_persist_dir
return cls.from_persist_path(persist_path, fs=fs)
File "C:\Users\manue\Documents\Dictadura.venv310\lib\site-packages\llama_index\core\graph_stores\simple_labelled.py", line 184, in from_persist_path
data = json.loads(f.read())
File "C:\Users\manue\AppData\Local\Python\pythoncore-3.10-64\lib\json_init_.py", line 346, in loads
return _default_decoder.decode(s)
File "C:\Users\manue\AppData\Local\Python\pythoncore-3.10-64\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\manue\AppData\Local\Python\pythoncore-3.10-64\lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
RAW_BUFFERClick to expand / collapse

Bug Description

I'm trying to create a knowledge graph based on documents which are OCR using deepseek. During the OCR process is common to have some non-utf8 characters sometimes due to hallucinations, sometimes just valid Chinese characters. While creating the graph and trying to persist it, I got the following stack trace:

[18/329] Processing: chunk_00052_00054.txt... Applying transformations: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 22.97it/s] Traceback (most recent call last): File "C:\Users\manue\Documents\Dictadura\kg_builder.py", line 130, in <module> index.storage_context.persist(persist_dir=PERSIST_DIR) File "C:\Users\manue\Documents\Dictadura.venv310\lib\site-packages\llama_index\core\storage\storage_context.py", line 187, in persist self.property_graph_store.persist(persist_path=pg_graph_store_path, fs=fs) File "C:\Users\manue\Documents\Dictadura.venv310\lib\site-packages\llama_index\core\graph_stores\simple_labelled.py", line 171, in persist f.write(self.graph.model_dump_json()) File "C:\Users\manue\AppData\Local\Python\pythoncore-3.10-64\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 2139192-2139193: character maps to <undefined>

In Windows, the default encoding is cp1252, not utf-8, which is likely the root cause of this error. I've noticed that SimplePropertyGraphStore.persist does not specify an encoding nor does it allow it to be specified. I've solved this issue by monkey patching it. Same applies to from_persist_path.

import llama_index.core.graph_stores.simple_labelled as _sg

def _persist_utf8(self, persist_path, fs=None):
    if fs is None:
        fs = _fsspec.filesystem("file")
    with fs.open(persist_path, "w", encoding="utf-8") as f:
        f.write(self.graph.model_dump_json())

_sg.SimplePropertyGraphStore.persist = _persist_utf8

@classmethod
def _from_persist_path_utf8(cls, persist_path, fs=None):
    if fs is None:
        fs = _fsspec.filesystem("file")
    with fs.open(persist_path, "r", encoding="utf-8") as f:
        data = json.loads(f.read())
    return cls.from_dict(data)

_sg.SimplePropertyGraphStore.from_persist_path = _from_persist_path_utf8

This seems to have solved my issue, although it is obviously not a desirable solution

Version

0.14.18

Steps to Reproduce

  1. Create a document with lots fo utf-8 characters like -定义 \[ \text{图} \]
  2. Be on Windows
  3. Create a PropertyGraphIndex and store it
  4. The code should crash if the text in the graph nodes contains such characters.

Relevant Logs/Tracbacks

Stack trace of crash during store (likely due to corruption from previous run failing, not necessarily a bug in LlamaIndex):

[18/329] Processing: chunk_00052_00054.txt...
Applying transformations: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 22.97it/s]
Traceback (most recent call last):
File "C:\Users\manue\Documents\Dictadura\kg_builder.py", line 130, in <module>
index.storage_context.persist(persist_dir=PERSIST_DIR)
File "C:\Users\manue\Documents\Dictadura.venv310\lib\site-packages\llama_index\core\storage\storage_context.py", line 187, in persist
self.property_graph_store.persist(persist_path=pg_graph_store_path, fs=fs)
File "C:\Users\manue\Documents\Dictadura.venv310\lib\site-packages\llama_index\core\graph_stores\simple_labelled.py", line 171, in persist
f.write(self.graph.model_dump_json())
File "C:\Users\manue\AppData\Local\Python\pythoncore-3.10-64\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 2139192-2139193: character maps to <undefined>

Stack trace of crash during load:

(.venv310) PS C:\Users\manue\Documents\Dictadura> python .\kg_builder.py
Loading index from storage...
Traceback (most recent call last):
File "C:\Users\manue\Documents\Dictadura\kg_builder.py", line 120, in <module>
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
File "C:\Users\manue\Documents\Dictadura.venv310\lib\site-packages\llama_index\core\storage\storage_context.py", line 126, in from_defaults
or SimplePropertyGraphStore.from_persist_dir(persist_dir, fs=fs)
File "C:\Users\manue\Documents\Dictadura.venv310\lib\site-packages\llama_index\core\graph_stores\simple_labelled.py", line 196, in from_persist_dir
return cls.from_persist_path(persist_path, fs=fs)
File "C:\Users\manue\Documents\Dictadura.venv310\lib\site-packages\llama_index\core\graph_stores\simple_labelled.py", line 184, in from_persist_path
data = json.loads(f.read())
File "C:\Users\manue\AppData\Local\Python\pythoncore-3.10-64\lib\json_init_.py", line 346, in loads
return _default_decoder.decode(s)
File "C:\Users\manue\AppData\Local\Python\pythoncore-3.10-64\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\manue\AppData\Local\Python\pythoncore-3.10-64\lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

extent analysis

Fix Plan

To fix the UnicodeEncodeError, we need to specify the encoding when writing to a file. We can do this by modifying the persist method in SimplePropertyGraphStore to use utf-8 encoding.

Here are the steps:

  • Modify the persist method to use utf-8 encoding:
import llama_index.core.graph_stores.simple_labelled as _sg

def _persist_utf8(self, persist_path, fs=None):
    if fs is None:
        fs = _fsspec.filesystem("file")
    with fs.open(persist_path, "w", encoding="utf-8") as f:
        f.write(self.graph.model_dump_json())

_sg.SimplePropertyGraphStore.persist = _persist_utf8
  • Modify the from_persist_path method to use utf-8 encoding:
@classmethod
def _from_persist_path_utf8(cls, persist_path, fs=None):
    if fs is None:
        fs = _fsspec.filesystem("file")
    with fs.open(persist_path, "r", encoding="utf-8") as f:
        data = json.loads(f.read())
    return cls.from_dict(data)

_sg.SimplePropertyGraphStore.from_persist_path = _from_persist_path_utf8

Alternatively, you can also submit a pull request to the llama_index repository to add encoding support to the persist and from_persist_path methods.

Verification

To verify that the fix worked, you can try running your code again and check if the UnicodeEncodeError is resolved. You can also test the persist and from_persist_path methods separately to ensure they are working correctly.

Extra Tips

  • When working with text data, it's essential to specify the encoding to avoid Unicode-related issues.
  • If you're using a library that doesn't support encoding, consider submitting a pull request or using a different library.
  • Always test your code thoroughly to ensure it works correctly with different types of input data.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

llamaIndex - ✅(Solved) Fix [Bug]: SimplePropertyGraphStore can't persist utf-8 encoded chars on Windows [4 pull requests, 2 comments, 2 participants]