langchain - ✅(Solved) Fix core: _convert_openai_format_to_data_block hard-codes mime_type on base64 file blocks [2 pull requests, 2 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
langchain-ai/langchain#36939Fetched 2026-04-23 07:23:09
View on GitHub
Comments
2
Participants
1
Timeline
7
Reactions
0
Participants
Timeline (top)
labeled ×3cross-referenced ×2commented ×1issue_type_added ×1

In langchain_core/messages/block_translators/openai.py, _convert_openai_format_to_data_block has two base64 branches that look symmetrical: one for image_url, one for file.

The image branch reads the MIME type from the parsed data URI (parsed["mime_type"]). The file branch hard-codes "application/pdf".

The repro passes a CSV via the OpenAI base64 file block shape that the OpenAI docs prescribe. The resulting v1 content block has mime_type="application/pdf" instead of "text/csv", even though the data URI explicitly says text/csv. Any non-PDF file attached this way (CSV, plain text, spreadsheets, office docs) gets silently relabeled the same way.

Since _normalize_messages calls this translator on every chat model's input path, the wrong MIME type propagates to downstream integrations that consume content_blocks.

Expected: mime_type matches the data URI (text/csv in the example). Actual: mime_type is always application/pdf.

_parse_data_uri already returns None if the MIME type is missing, so the fix is to use parsed["mime_type"] like the image branch does, no extra None check needed.

Error Message

Error Message and Stack Trace (if applicable)

Root Cause

In langchain_core/messages/block_translators/openai.py, _convert_openai_format_to_data_block has two base64 branches that look symmetrical: one for image_url, one for file.

The image branch reads the MIME type from the parsed data URI (parsed["mime_type"]). The file branch hard-codes "application/pdf".

The repro passes a CSV via the OpenAI base64 file block shape that the OpenAI docs prescribe. The resulting v1 content block has mime_type="application/pdf" instead of "text/csv", even though the data URI explicitly says text/csv. Any non-PDF file attached this way (CSV, plain text, spreadsheets, office docs) gets silently relabeled the same way.

Since _normalize_messages calls this translator on every chat model's input path, the wrong MIME type propagates to downstream integrations that consume content_blocks.

Expected: mime_type matches the data URI (text/csv in the example). Actual: mime_type is always application/pdf.

_parse_data_uri already returns None if the MIME type is missing, so the fix is to use parsed["mime_type"] like the image branch does, no extra None check needed.

Fix Action

Fix / Workaround

  • This is a bug, not a usage question.
  • I added a clear and descriptive title that summarizes this issue.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
  • This is not related to the langchain-community package.
  • I posted a self-contained, minimal, reproducible example. A maintainer can copy it and run it AS IS.

Other Dependencies

aiohttp: 3.13.5 dataclasses-json: 0.6.7 google-adk: 1.30.0 httpx: 0.28.1 httpx-sse: 0.4.3 jsonpatch: 1.33 numpy: 2.4.4 opentelemetry-api: 1.38.0 opentelemetry-exporter-otlp-proto-http: 1.38.0 opentelemetry-sdk: 1.38.0 orjson: 3.11.8 packaging: 26.1 pydantic: 2.12.5 pydantic-settings: 2.13.1 pytest: 9.0.3 PyYAML: 6.0.3 pyyaml: 6.0.3 requests: 2.33.1 requests-toolbelt: 1.0.0 rich: 15.0.0 SQLAlchemy: 2.0.49 sqlalchemy: 2.0.49 tenacity: 9.1.4 typing-extensions: 4.15.0 uuid-utils: 0.14.1 vcrpy: 8.1.1 websockets: 15.0.1 wrapt: 1.17.3 xxhash: 3.6.0 zstandard: 0.25.0

PR fix notes

PR #36937: core[patch]: preserve MIME type on base64 file blocks in openai translator

Description (problem / solution / changelog)

Fixes #36939.

The file branch of _convert_openai_format_to_data_block hard-codes mime_type="application/pdf", while the image branch right above it uses parsed["mime_type"] from the data URI. So a CSV sent via the OpenAI file block shape comes out with mime_type="application/pdf" in the v1 content block.

One-line change to read it off the parsed data URI, same as the image branch. _parse_data_uri returns None when the mime_type is missing, so parsed["mime_type"] is always set inside this branch.

Test added with a CSV and a text/plain data URI. Existing tests still pass since they use data:application/pdf;....

Changed files

  • libs/core/langchain_core/messages/block_translators/openai.py (modified, +1/-1)
  • libs/core/tests/unit_tests/messages/block_translators/test_openai.py (modified, +43/-0)

PR #36940: core[patch]: use parsed mime_type for base64 file blocks in openai translator

Description (problem / solution / changelog)

Fixes #36939.

The file branch of _convert_openai_format_to_data_block hard-codes mime_type="application/pdf", while the image branch right above it uses parsed["mime_type"] from the data URI. So a CSV sent via the OpenAI file block shape comes out with mime_type="application/pdf" in the v1 content block.

One-line change to read it off the parsed data URI, same as the image branch. _parse_data_uri returns None when the mime_type is missing, so parsed["mime_type"] is always set inside this branch.

Test added with a CSV and a text/plain data URI. Existing tests still pass since they use data:application/pdf;....

Changed files

  • libs/core/langchain_core/messages/block_translators/openai.py (modified, +1/-1)
  • libs/core/tests/unit_tests/messages/block_translators/test_openai.py (modified, +37/-0)

Code Example

from langchain_core.messages import HumanMessage

msg = HumanMessage(content=[
    {
        "type": "file",
        "file": {
            "filename": "sheet.csv",
            "file_data": "data:text/csv;base64,aGVsbG8=",
        },
    },
])

for block in msg.content_blocks:
    print(block)

---
RAW_BUFFERClick to expand / collapse

Submission checklist

  • This is a bug, not a usage question.
  • I added a clear and descriptive title that summarizes this issue.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
  • This is not related to the langchain-community package.
  • I posted a self-contained, minimal, reproducible example. A maintainer can copy it and run it AS IS.

Package (Required)

  • langchain
  • langchain-openai
  • langchain-anthropic
  • langchain-classic
  • langchain-core
  • langchain-model-profiles
  • langchain-tests
  • langchain-text-splitters
  • langchain-chroma
  • langchain-deepseek
  • langchain-exa
  • langchain-fireworks
  • langchain-groq
  • langchain-huggingface
  • langchain-mistralai
  • langchain-nomic
  • langchain-ollama
  • langchain-openrouter
  • langchain-perplexity
  • langchain-qdrant
  • langchain-xai
  • Other / not sure / general

Related Issues / PRs

No response

Reproduction Steps / Example Code (Python)

from langchain_core.messages import HumanMessage

msg = HumanMessage(content=[
    {
        "type": "file",
        "file": {
            "filename": "sheet.csv",
            "file_data": "data:text/csv;base64,aGVsbG8=",
        },
    },
])

for block in msg.content_blocks:
    print(block)

Error Message and Stack Trace (if applicable)

Description

In langchain_core/messages/block_translators/openai.py, _convert_openai_format_to_data_block has two base64 branches that look symmetrical: one for image_url, one for file.

The image branch reads the MIME type from the parsed data URI (parsed["mime_type"]). The file branch hard-codes "application/pdf".

The repro passes a CSV via the OpenAI base64 file block shape that the OpenAI docs prescribe. The resulting v1 content block has mime_type="application/pdf" instead of "text/csv", even though the data URI explicitly says text/csv. Any non-PDF file attached this way (CSV, plain text, spreadsheets, office docs) gets silently relabeled the same way.

Since _normalize_messages calls this translator on every chat model's input path, the wrong MIME type propagates to downstream integrations that consume content_blocks.

Expected: mime_type matches the data URI (text/csv in the example). Actual: mime_type is always application/pdf.

_parse_data_uri already returns None if the MIME type is missing, so the fix is to use parsed["mime_type"] like the image branch does, no extra None check needed.

System Info

System Information

OS: Darwin OS Version: Darwin Kernel Version 25.4.0: Thu Mar 19 19:33:25 PDT 2026; root:xnu-12377.101.15~1/RELEASE_ARM64_T6041 Python Version: 3.14.2 (main, Dec 5 2025, 16:49:16) [Clang 17.0.0 (clang-1700.4.4.1)]

Package Information

langchain_core: 1.3.0 langchain_community: 0.4.1 langsmith: 0.7.31 langchain_classic: 1.0.3 langchain_text_splitters: 1.1.1 langgraph_sdk: 0.3.13

Optional packages not installed

deepagents deepagents-cli

Other Dependencies

aiohttp: 3.13.5 dataclasses-json: 0.6.7 google-adk: 1.30.0 httpx: 0.28.1 httpx-sse: 0.4.3 jsonpatch: 1.33 numpy: 2.4.4 opentelemetry-api: 1.38.0 opentelemetry-exporter-otlp-proto-http: 1.38.0 opentelemetry-sdk: 1.38.0 orjson: 3.11.8 packaging: 26.1 pydantic: 2.12.5 pydantic-settings: 2.13.1 pytest: 9.0.3 PyYAML: 6.0.3 pyyaml: 6.0.3 requests: 2.33.1 requests-toolbelt: 1.0.0 rich: 15.0.0 SQLAlchemy: 2.0.49 sqlalchemy: 2.0.49 tenacity: 9.1.4 typing-extensions: 4.15.0 uuid-utils: 0.14.1 vcrpy: 8.1.1 websockets: 15.0.1 wrapt: 1.17.3 xxhash: 3.6.0 zstandard: 0.25.0

extent analysis

TL;DR

The issue can be fixed by using the parsed MIME type from the data URI in the file branch of _convert_openai_format_to_data_block instead of hard-coding "application/pdf".

Guidance

  • Identify the _convert_openai_format_to_data_block function in langchain_core/messages/block_translators/openai.py and locate the file branch.
  • Replace the hard-coded "application/pdf" with parsed["mime_type"] to use the MIME type from the data URI.
  • Verify that the mime_type in the resulting content_blocks matches the expected type (e.g., "text/csv" for a CSV file).
  • Test the change with different file types to ensure the correct MIME type is propagated.

Example

# In _convert_openai_format_to_data_block
if "file" in parsed:
    # ...
    mime_type = parsed["mime_type"]  # Use the parsed MIME type
    # ...

Notes

This fix assumes that the parsed["mime_type"] will always contain the correct MIME type. If this is not the case, additional error handling may be necessary.

Recommendation

Apply the workaround by modifying the _convert_openai_format_to_data_block function to use the parsed MIME type. This should resolve the issue with incorrect MIME types being propagated.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING