llamaIndex - 💡(How to fix) Fix [Feature Request]: Multimodal Query Engines

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

  1. CitationQueryEngine — full multimodal support. Adds CITATION_CHAT_CONTENT_QA_TEMPLATE / CITATION_CHAT_CONTENT_REFINE_TEMPLATE and a multimodal-aware _create_citation_nodes that uses
    ChatMessage.split(...) to chunk multimodal source nodes per "Source N:" while preserving image/audio/video blocks.
  2. RetrieverQueryEngine — from_args(multimodal=True, ...) forwards directly to the factory. No internal branching needed because the engine just delegates to the synthesizer.
  3. RouterQueryEngine + ToolRetrieverRouterQueryEngine — init / from_defaults accept multimodal, summary_template, chat_summary_template, chat_prompt_helper; passed through to the default
    TreeSummarize. combine_responses / acombine_responses now branch on summarizer._multimodal and route through get_response_from_messages (using chat_summary_template) when multimodal. Caveat: those
    ChatMessages are built from str(response) of each sub-engine, so the multimodal aggregator prompt is in use but its inputs are still text — media surfaced by sub-engines does not propagate into the aggregation.
RAW_BUFFERClick to expand / collapse

Feature Description

Background

Now that multimodal synthesis is broadly supported, the next goal is to make multimodal retrieval and synthesis pipelines easily configurable from high level objects like query engines--where relevant. The function get_response_synthesizer (response_synthesizers/factory.py) which is widely used in the configuration of query engines, can now be updated to accept a multimodal: bool flag plus the matching chat-content templates and chat_prompt_helper.

When multimodal=True, the chosen synthesizer can be wired with RichPromptTemplate-based prompts that iterate over context_messages[].blocks and emit text / image / audio / video blocks individually. BaseSynthesizer.synthesize already auto-branches on self._multimodal and constructs each ChatMessage from node.get_content_blocks(MetadataMode.LLM). All nine ResponseModes are now supported under multimodal=True. This issue tracks which query engines can be updated to expose multimodal=True and which are currently not, with the rationale.

Good Candidates For Update ✅

Engines which can accept multimodal=True (and the relevant chat-content templates) and plumb them through to get_response_synthesizer

  1. CitationQueryEngine — full multimodal support. Adds CITATION_CHAT_CONTENT_QA_TEMPLATE / CITATION_CHAT_CONTENT_REFINE_TEMPLATE and a multimodal-aware _create_citation_nodes that uses
    ChatMessage.split(...) to chunk multimodal source nodes per "Source N:" while preserving image/audio/video blocks.
  2. RetrieverQueryEngine — from_args(multimodal=True, ...) forwards directly to the factory. No internal branching needed because the engine just delegates to the synthesizer.
  3. RouterQueryEngine + ToolRetrieverRouterQueryEngine — init / from_defaults accept multimodal, summary_template, chat_summary_template, chat_prompt_helper; passed through to the default
    TreeSummarize. combine_responses / acombine_responses now branch on summarizer._multimodal and route through get_response_from_messages (using chat_summary_template) when multimodal. Caveat: those
    ChatMessages are built from str(response) of each sub-engine, so the multimodal aggregator prompt is in use but its inputs are still text — media surfaced by sub-engines does not propagate into the aggregation.

To Be Deprecated

  1. SimpleMultiModalQueryEngine (multi_modal.py) — no longer necessary; superseded by RetrieverQueryEngine.from_args(retriever=..., llm=..., multimodal=True) (or CitationQueryEngine.from_args(...,
    multimodal=True) for cited responses). Migration notes are embedded in the deprecation reason: multi_modal_llm= → llm=; separate text_qa_template / image_qa_template → unified chat_content_qa_template (+ optional chat_content_refine_template).

Already Deprecated — Skip

  1. RetrieverRouterQueryEngine (router_query_engine.py:264) — docstring marks it deprecated in favor of ToolRetrieverRouterQueryEngine. Doesn't use a synthesizer.
  2. KnowledgeGraphQueryEngine (knowledge_graph_query_engine.py) — already @deprecated.deprecated in favor of PropertyGraphIndex. KG triples are inherently text anyway.
  3. JSONalyzeQueryEngine (jsonalyze/) — moved to llama-index-experimental.
  4. PandasQueryEngine (pandas/) — moved to llama-index-experimental.

Wrappers — Already Inherit Multimodal For Free, No Change Needed

These don't synthesize themselves; they delegate query / synthesize to an inner engine. If the inner engine is multimodal, the wrapper is multimodal.

  1. TransformQueryEngine (transform_query_engine.py:11) — applies a BaseQueryTransform to the query and forwards to self._query_engine.synthesize/query. Pure passthrough.
  2. RetryQueryEngine (retry_query_engine.py:22) — wraps a query_engine, runs an evaluator, retries on the same wrapped engine. No own synthesis.
  3. RetryGuidelineQueryEngine (retry_query_engine.py:78) — same delegation pattern with a guideline-based query transform.
  4. ComposableGraphQueryEngine (graph_query_engine.py:15) — orchestrates per-index sub-engines and calls each one's synthesize. Multimodality is whatever the leaf engines provide.
  5. RetrySourceQueryEngine (retry_source_query_engine.py:24) — wraps a RetrieverQueryEngine, uses self._llm only for evaluation/relevance scoring of sources (a text task). Synthesis runs through the
    wrapped RetrieverQueryEngine, which is now multimodal-aware via from_args(multimodal=True).

For all of these, the user gets multimodal behavior by passing in an already-multimodal inner engine — no API change is needed or useful.

Possible Future Candidates: Easy Plumb-Through, Limited Semantic Gain

  1. SubQuestionQueryEngine (sub_question_query_engine.py:37) — from_defaults builds the synthesizer via get_response_synthesizer. Same caveat as MultiStepQueryEngine (below): each sub-question's response is
    wrapped as a plain text node before final synthesis, so the synthesizer would only ever see text. Adding multimodal=True would let users pick a chat-content RichPromptTemplate for the aggregator but no media flows through. Could be done for API consistency; minimal semantic value. Deferred from this PR.
  2. MultiStepQueryEngine — init accepts multimodal, chat_content_qa_template, chat_content_refine_template. Caveat: _query_multistep aggregates each sub-step as TextNode(text=...), so even with multimodal=True the final synthesizer only sees the text Q&A pairs from sub-queries; media is preserved on additional_source_nodes for attribution but does not reach LLM input. multimodal=True here
    mainly enables using a chat-content RichPromptTemplate for the aggregator prompt.

Not Candidates: Wrong Synthesis Shape, Fundamentally Text-Based, and/or External API

These bypass get_response_synthesizer entirely and call the LLM directly on inherently text tasks (SQL strings, code, JSON, lookahead reasoning), or go through an external service. Adding multimodal
doesn't fit the domain.

  1. SQLJoinQueryEngine (sql_join_query_engine.py:167) — calls self._llm.predict / self._llm.stream to fuse a SQL result with a vector answer. SQL results are tabular text.
  2. SQLAutoVectorQueryEngine (sql_vector_query_engine.py:53) — subclasses SQLJoinQueryEngine. Same shape.
  3. FLAREInstructQueryEngine (flare/base.py:98) — iterative lookahead generation via self._llm.predict. Internal loop is text-only by design.
  4. SQLAugmentQueryTransform (sql_join_query_engine.py:109) — query transform, not an engine; transforms the natural-language query via LLM.
  5. CogniswitchQueryEngine (cogniswitch_query_engine.py:9) — proxies queries to the external Cogniswitch HTTP API. No local LLM call to add multimodal to.

No Update Needed: User-Defined

  1. CustomQueryEngine (custom.py:16) — abstract BaseModel + BaseQueryEngine requiring custom_query from subclasses. The user owns the synthesis path; we have nothing to plumb.

Reason

No response

Value of Feature

No response

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING