langchain - 💡(How to fix) Fix HTMLSemanticPreservingSplitter processes malformed and unsafe links

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

HTMLSemanticPreservingSplitter currently processes malformed or unsafe links such as empty href values and javascript: pseudo-links.

This can generate invalid markdown links and unintentionally preserve unsafe pseudo-links during HTML processing.

Expected behavior:

  • Empty or malformed links should be skipped
  • javascript: pseudo-links should not be preserved
  • Valid links should continue to work normally

Error Message

Error Message and Stack Trace (if applicable)

Root Cause

HTMLSemanticPreservingSplitter currently processes malformed or unsafe links such as empty href values and javascript: pseudo-links.

This can generate invalid markdown links and unintentionally preserve unsafe pseudo-links during HTML processing.

Expected behavior:

  • Empty or malformed links should be skipped
  • javascript: pseudo-links should not be preserved
  • Valid links should continue to work normally

Fix Action

Fix / Workaround

  • This is a bug, not a usage question.
  • I added a clear and descriptive title that summarizes this issue.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
  • This is not related to the langchain-community package.
  • I posted a self-contained, minimal, reproducible example. A maintainer can copy it and run it AS IS.

Other Dependencies

anthropic: 0.92.0 filetype: 1.2.0 google-genai: 1.71.0 groq: 0.37.1 httpx: 0.28.1 huggingface-hub: 1.9.2 jsonpatch: 1.33 langgraph: 1.1.6 openai: 2.31.0 orjson: 3.11.8 packaging: 26.0 pydantic: 2.12.4 pyyaml: 6.0.3 requests: 2.32.5 requests-toolbelt: 1.0.0 rich: 14.3.3 tenacity: 9.1.4 tiktoken: 0.12.0 tokenizers: 0.22.2 transformers: 5.5.1 typing-extensions: 4.15.0 uuid-utils: 0.14.1 websockets: 15.0.1 xxhash: 3.6.0 zstandard: 0.25.0

Code Example

from langchain_text_splitters.html import HTMLSemanticPreservingSplitter

html = """

<html>
    <body>
        <p><a href="">Empty Link</a></p>
        <p><a href="javascript:void(0)">Bad Link</a></p>
        <p><a href="https://example.com">Valid Link</a></p>
    </body>
</html>
"""

splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1")],
preserve_links=True,
)

docs = splitter.split_text(html)

for doc in docs:
print(doc.page_content)

---
RAW_BUFFERClick to expand / collapse

Submission checklist

  • This is a bug, not a usage question.
  • I added a clear and descriptive title that summarizes this issue.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
  • This is not related to the langchain-community package.
  • I posted a self-contained, minimal, reproducible example. A maintainer can copy it and run it AS IS.

Package (Required)

  • langchain
  • langchain-openai
  • langchain-anthropic
  • langchain-classic
  • langchain-core
  • langchain-model-profiles
  • langchain-tests
  • langchain-text-splitters
  • langchain-chroma
  • langchain-deepseek
  • langchain-exa
  • langchain-fireworks
  • langchain-groq
  • langchain-huggingface
  • langchain-mistralai
  • langchain-nomic
  • langchain-ollama
  • langchain-openrouter
  • langchain-perplexity
  • langchain-qdrant
  • langchain-xai
  • Other / not sure / general

Related Issues / PRs

No response

Reproduction Steps / Example Code (Python)

from langchain_text_splitters.html import HTMLSemanticPreservingSplitter

html = """

<html>
    <body>
        <p><a href="">Empty Link</a></p>
        <p><a href="javascript:void(0)">Bad Link</a></p>
        <p><a href="https://example.com">Valid Link</a></p>
    </body>
</html>
"""

splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1")],
preserve_links=True,
)

docs = splitter.split_text(html)

for doc in docs:
print(doc.page_content)

Error Message and Stack Trace (if applicable)

Description

HTMLSemanticPreservingSplitter currently processes malformed or unsafe links such as empty href values and javascript: pseudo-links.

This can generate invalid markdown links and unintentionally preserve unsafe pseudo-links during HTML processing.

Expected behavior:

  • Empty or malformed links should be skipped
  • javascript: pseudo-links should not be preserved
  • Valid links should continue to work normally

System Info

PS C:\Users\Acer> python -m langchain_core.sys_info C:\Python314\Lib\site-packages\langchain_core_api\deprecation.py:25: UserWarning: Core Pydantic V1 functionality isn't compatible with Python 3.14 or greater. from pydantic.v1.fields import FieldInfo as FieldInfoV1

System Information

OS: Windows OS Version: 10.0.26200 Python Version: 3.14.0 (tags/v3.14.0:ebf955d, Oct 7 2025, 10:15:03) [MSC v.1944 64 bit (AMD64)]

Package Information

langchain_core: 1.2.28 langchain: 1.2.15 langsmith: 0.7.29 langchain_anthropic: 1.4.0 langchain_google_genai: 4.2.1 langchain_groq: 1.1.2 langchain_huggingface: 1.2.1 langchain_openai: 1.1.12 langgraph_sdk: 0.3.13

Optional packages not installed

deepagents deepagents-cli

Other Dependencies

anthropic: 0.92.0 filetype: 1.2.0 google-genai: 1.71.0 groq: 0.37.1 httpx: 0.28.1 huggingface-hub: 1.9.2 jsonpatch: 1.33 langgraph: 1.1.6 openai: 2.31.0 orjson: 3.11.8 packaging: 26.0 pydantic: 2.12.4 pyyaml: 6.0.3 requests: 2.32.5 requests-toolbelt: 1.0.0 rich: 14.3.3 tenacity: 9.1.4 tiktoken: 0.12.0 tokenizers: 0.22.2 transformers: 5.5.1 typing-extensions: 4.15.0 uuid-utils: 0.14.1 websockets: 15.0.1 xxhash: 3.6.0 zstandard: 0.25.0

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

langchain - 💡(How to fix) Fix HTMLSemanticPreservingSplitter processes malformed and unsafe links