langchain - ✅(Solved) Fix core: _convert_openai_format_to_data_block hard-codes mime_type="application/pdf" on base64 file blocks [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
langchain-ai/langchain#36938Fetched 2026-04-23 07:23:11
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Timeline (top)
closed ×1commented ×1cross-referenced ×1labeled ×1

In libs/core/langchain_core/messages/block_translators/openai.py, _convert_openai_format_to_data_block has two symmetrical base64 branches: one for images (image_url) and one for files.

The image branch correctly reads the MIME type from the parsed data URI:

# base64-style image block
if (block["type"] == "image_url") and (
    parsed := _parse_data_uri(block["image_url"]["url"])
):
    ...
    return types.create_image_block(
        base64=parsed["data"],
        mime_type=parsed["mime_type"],   # from data URI
        **all_extras,
    )

The file branch hard-codes PDF, discarding the parsed value:

# base64-style file block
if (block["type"] == "file") and (
    parsed := _parse_data_uri(block["file"]["file_data"])
):
    ...
    return types.create_file_block(
        base64=parsed["data"],
        mime_type="application/pdf",     # hard-coded
        filename=filename,
        **all_extras,
    )

Effect: any non-PDF file delivered via the OpenAI base64 file block shape (CSV, plain text, spreadsheets, office docs, etc.) is silently relabeled as application/pdf on the way into v1 content blocks. Since _normalize_messages calls this translator on every chat model's _astream, the wrong MIME type propagates to every downstream integration that consumes content_blocks.

_parse_data_uri already guarantees mime_type is non-empty whenever parsed is truthy (returns None otherwise), so the fix is a one-line change: use parsed["mime_type"] like the image branch does. No extra None check needed.

I've opened PR #36937 with the one-line fix and a regression test covering CSV and plain-text base64 file blocks. The PR was auto-closed for the missing-issue-link check. Happy to reopen once this issue is approved and assigned.

Found while tracing a separate production crash (PDF attachments through langchain-litellm to Anthropic-via-Vertex). This is a secondary correctness bug caught in passing.

Error Message

Error Message and Stack Trace (if applicable)

No exception. The output is:

Root Cause

In libs/core/langchain_core/messages/block_translators/openai.py, _convert_openai_format_to_data_block has two symmetrical base64 branches: one for images (image_url) and one for files.

The image branch correctly reads the MIME type from the parsed data URI:

# base64-style image block
if (block["type"] == "image_url") and (
    parsed := _parse_data_uri(block["image_url"]["url"])
):
    ...
    return types.create_image_block(
        base64=parsed["data"],
        mime_type=parsed["mime_type"],   # from data URI
        **all_extras,
    )

The file branch hard-codes PDF, discarding the parsed value:

# base64-style file block
if (block["type"] == "file") and (
    parsed := _parse_data_uri(block["file"]["file_data"])
):
    ...
    return types.create_file_block(
        base64=parsed["data"],
        mime_type="application/pdf",     # hard-coded
        filename=filename,
        **all_extras,
    )

Effect: any non-PDF file delivered via the OpenAI base64 file block shape (CSV, plain text, spreadsheets, office docs, etc.) is silently relabeled as application/pdf on the way into v1 content blocks. Since _normalize_messages calls this translator on every chat model's _astream, the wrong MIME type propagates to every downstream integration that consumes content_blocks.

_parse_data_uri already guarantees mime_type is non-empty whenever parsed is truthy (returns None otherwise), so the fix is a one-line change: use parsed["mime_type"] like the image branch does. No extra None check needed.

I've opened PR #36937 with the one-line fix and a regression test covering CSV and plain-text base64 file blocks. The PR was auto-closed for the missing-issue-link check. Happy to reopen once this issue is approved and assigned.

Found while tracing a separate production crash (PDF attachments through langchain-litellm to Anthropic-via-Vertex). This is a secondary correctness bug caught in passing.

Fix Action

Fixed

PR fix notes

PR #36937: core[patch]: preserve MIME type on base64 file blocks in openai translator

Description (problem / solution / changelog)

Fixes #36939.

The file branch of _convert_openai_format_to_data_block hard-codes mime_type="application/pdf", while the image branch right above it uses parsed["mime_type"] from the data URI. So a CSV sent via the OpenAI file block shape comes out with mime_type="application/pdf" in the v1 content block.

One-line change to read it off the parsed data URI, same as the image branch. _parse_data_uri returns None when the mime_type is missing, so parsed["mime_type"] is always set inside this branch.

Test added with a CSV and a text/plain data URI. Existing tests still pass since they use data:application/pdf;....

Changed files

  • libs/core/langchain_core/messages/block_translators/openai.py (modified, +1/-1)
  • libs/core/tests/unit_tests/messages/block_translators/test_openai.py (modified, +43/-0)

Code Example

from langchain_core.messages import HumanMessage

# CSV attached via the OpenAI Chat Completions file block shape
# (same shape OpenAI docs prescribe for file inputs)
msg = HumanMessage(content=[
    {
        "type": "file",
        "file": {
            "filename": "sheet.csv",
            "file_data": "data:text/csv;base64,aGVsbG8=",
        },
    },
])

print(msg.content_blocks)

---

[{'type': 'file', 'id': 'lc_...', 'base64': 'aGVsbG8=',
  'mime_type': 'application/pdf',   <-- silently wrong
  'extras': {'filename': 'sheet.csv'}}]

---

# base64-style image block
if (block["type"] == "image_url") and (
    parsed := _parse_data_uri(block["image_url"]["url"])
):
    ...
    return types.create_image_block(
        base64=parsed["data"],
        mime_type=parsed["mime_type"],   # from data URI
        **all_extras,
    )

---

# base64-style file block
if (block["type"] == "file") and (
    parsed := _parse_data_uri(block["file"]["file_data"])
):
    ...
    return types.create_file_block(
        base64=parsed["data"],
        mime_type="application/pdf",     # hard-coded
        filename=filename,
        **all_extras,
    )
RAW_BUFFERClick to expand / collapse

Checked other resources

  • This is a bug, not a usage question. For questions, please use GitHub Discussions.
  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation and API reference with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.

Example Code

from langchain_core.messages import HumanMessage

# CSV attached via the OpenAI Chat Completions file block shape
# (same shape OpenAI docs prescribe for file inputs)
msg = HumanMessage(content=[
    {
        "type": "file",
        "file": {
            "filename": "sheet.csv",
            "file_data": "data:text/csv;base64,aGVsbG8=",
        },
    },
])

print(msg.content_blocks)

Error Message and Stack Trace (if applicable)

No exception. The output is:

[{'type': 'file', 'id': 'lc_...', 'base64': 'aGVsbG8=',
  'mime_type': 'application/pdf',   <-- silently wrong
  'extras': {'filename': 'sheet.csv'}}]

Description

In libs/core/langchain_core/messages/block_translators/openai.py, _convert_openai_format_to_data_block has two symmetrical base64 branches: one for images (image_url) and one for files.

The image branch correctly reads the MIME type from the parsed data URI:

# base64-style image block
if (block["type"] == "image_url") and (
    parsed := _parse_data_uri(block["image_url"]["url"])
):
    ...
    return types.create_image_block(
        base64=parsed["data"],
        mime_type=parsed["mime_type"],   # from data URI
        **all_extras,
    )

The file branch hard-codes PDF, discarding the parsed value:

# base64-style file block
if (block["type"] == "file") and (
    parsed := _parse_data_uri(block["file"]["file_data"])
):
    ...
    return types.create_file_block(
        base64=parsed["data"],
        mime_type="application/pdf",     # hard-coded
        filename=filename,
        **all_extras,
    )

Effect: any non-PDF file delivered via the OpenAI base64 file block shape (CSV, plain text, spreadsheets, office docs, etc.) is silently relabeled as application/pdf on the way into v1 content blocks. Since _normalize_messages calls this translator on every chat model's _astream, the wrong MIME type propagates to every downstream integration that consumes content_blocks.

_parse_data_uri already guarantees mime_type is non-empty whenever parsed is truthy (returns None otherwise), so the fix is a one-line change: use parsed["mime_type"] like the image branch does. No extra None check needed.

I've opened PR #36937 with the one-line fix and a regression test covering CSV and plain-text base64 file blocks. The PR was auto-closed for the missing-issue-link check. Happy to reopen once this issue is approved and assigned.

Found while tracing a separate production crash (PDF attachments through langchain-litellm to Anthropic-via-Vertex). This is a secondary correctness bug caught in passing.

System Info

langchain-core: master (reproduced on 1.2.7 as well) platform: macOS python version: 3.13

extent analysis

TL;DR

The most likely fix is to update the create_file_block function to use the parsed MIME type from the data URI instead of hard-coding it to "application/pdf".

Guidance

  • Review the _convert_openai_format_to_data_block function in libs/core/langchain_core/messages/block_translators/openai.py to ensure it correctly handles file blocks with different MIME types.
  • Update the create_file_block function to use parsed["mime_type"] instead of hard-coding "application/pdf" to fix the silent relabeling of non-PDF files.
  • Verify the fix by testing with different file types, such as CSV and plain text, to ensure the correct MIME type is propagated to downstream integrations.
  • Consider adding additional regression tests to cover other file types and ensure the fix does not introduce new issues.

Example

# base64-style file block
if (block["type"] == "file") and (
    parsed := _parse_data_uri(block["file"]["file_data"])
):
    ...
    return types.create_file_block(
        base64=parsed["data"],
        mime_type=parsed["mime_type"],  # use parsed MIME type
        filename=filename,
        **all_extras,
    )

Notes

The fix is a one-line change, and the _parse_data_uri function already guarantees a non-empty mime_type when parsed is truthy, so no extra None check is needed.

Recommendation

Apply the workaround by updating the create_file_block function to use the parsed MIME type, as this will fix the silent relabeling of non-PDF files and ensure correct propagation of MIME types to downstream integrations.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

langchain - ✅(Solved) Fix core: _convert_openai_format_to_data_block hard-codes mime_type="application/pdf" on base64 file blocks [1 pull requests, 1 comments, 2 participants]