llamaIndex - 💡(How to fix) Fix [Feature Request]: Multimodal Query Engines

StepCodex · 2026-05-21T20:16:29Z

[llamaIndex] Feature Description Background Now that multimodal synthesis is broadly supported, the next goal is to make multimodal retrieval and synthesis pip… ### Feature Description ## Background Now that multimodal synthesis is broadly supported, the next goal is to make multimodal retrieval and synthesis pipelines easily configurable from high level objects like query engines--where relevant. The function get_response_synthesizer (response_synthesizers/factory.py) which is widely used in the configuration of query engines, can now be updated to accept a multimodal: bool flag plus the matching chat-content templates and chat_prompt_helper. When multimodal=True, the chosen synthesizer can be wired with RichPromptTemplate-based prompts that iterate over context_messages[].blocks and emit text / image / audio / video blocks individually. BaseSynthesizer.synthesize already auto-branches on self._multimodal and constructs each ChatMessage from node.get_content_blocks(MetadataMode.LLM). All nine ResponseModes are now supported under multimodal=True. This issue tracks which query engines can be updated to expose multimodal=True and which are currently not, with the rationale. ## Good Candidates For Update ✅ Engines which can accept multimodal=True (and the relevant chat-content templates) and plumb them through to get_response_synthesizer 1. CitationQueryEngine — full multimodal support. Adds CITATION_CHAT_CONTENT_QA_TEMPLATE / CITATION_CHAT_CONTENT_REFINE_TEMPLATE and a multimodal-aware _create_citation_nodes that uses ChatMessage.split(...) to chunk multimodal source nodes per "Source N:" while preserving image/audio/video blocks. 2. RetrieverQueryEngine — from_args(multimodal=True, ...) forwards directly to the factory. No internal branching needed because the engine just delegates to the synthesizer. 3. RouterQueryEngine + ToolRetrieverRouterQueryEngine — __init__ / from_defaults accept multimodal, summary_template, chat_summary_template, chat_prompt_helper; passed through to the default TreeSummarize. combine_responses / acombine_responses now branch on summarizer._multimodal and route through get_response_from_messages (using chat_summary_template) when multimodal. Caveat: those ChatMessages are built from str(response) of each sub-engine, so the multimodal aggregator prompt is in use but its inputs are still text — media surfaced by sub-engines does not propagate into the aggregation. ## To Be Deprecated 1. SimpleMultiModalQueryEngine (multi_modal.py) — no longer necessary; superseded by RetrieverQueryEngine.from_args(retriever=..., llm=..., multimodal=True) (or CitationQueryEngine.from_args(..., multimodal=True) for cited responses). Migration notes are embedded in the deprecation reason: multi_modal_llm= → llm=; separate text_qa_template / image_qa_template → unified chat_content_qa_template (+ optional chat_content_refine_template). ## Already Deprecated — Skip 1. RetrieverRouterQueryEngine (router_query_engine.py:264) — docstring marks it deprecated in favor of ToolRetrieverRouterQueryEngine. Doesn't use a synthesizer. 2. KnowledgeGraphQueryEngine (knowledge_graph_query_engine.py) — already @deprecated.deprecated in favor of PropertyGraphIndex. KG triples are inherently text anyway. 3. JSONalyzeQueryEngine (jsonalyze/) — moved to llama-index-experimental. 4. PandasQueryEngine (pandas/) — moved to llama-index-experimental. ## Wrappers — Already Inherit Multimodal For Free, No Change Needed These don't synthesize themselves; they delegate query / synthesize to an inner engine. If the inner engine is multimodal, the wrapper is multimodal. 1. TransformQueryEngine (transform_query_engine.py:11) — applies a BaseQueryTransform to the query and forwards to self._query_engine.synthesize/query. Pure passthrough. 2. RetryQueryEngine (retry_query_engine.py:22) — wraps a query_engine, runs an evaluator, retries on the same wrapped engine. No own synthesis. 3. RetryGuidelineQueryEngine (retry_query_engine.py:78) — same delegation pattern with a guideline-based query transform. 4. ComposableGraphQueryEngine (graph_query_engine.py:15) — orchestrates per-index sub-engines and calls each one's synthesize. Multimodality is whatever the leaf engines provide. 5. RetrySourceQueryEngine (retry_source_query_engine.py:24) — wraps a RetrieverQueryEngine, uses self._llm only for evaluation/relevance scoring of sources (a text task). Synthesis runs through the wrapped RetrieverQueryEngine, which is now multimodal-aware via from_args(multimodal=True). For all of these, the user gets multimodal behavior by passing in an already-multimodal inner engine — no API change is needed or useful. ## Possible Future Candidates: Easy Plumb-Through, Limited Semantic Gain 1. SubQuestionQueryEngine (sub_question_query_engine.py:37) — from_defaults builds the synthesizer via get_response_synthesizer. Same caveat as MultiStepQueryEngine (below): each sub-question's response is wrapped as a plain text node before final synthesis, so the synthesi

Root Cause

CitationQueryEngine — full multimodal support. Adds CITATION_CHAT_CONTENT_QA_TEMPLATE / CITATION_CHAT_CONTENT_REFINE_TEMPLATE and a multimodal-aware _create_citation_nodes that uses
ChatMessage.split(...) to chunk multimodal source nodes per "Source N:" while preserving image/audio/video blocks.
RetrieverQueryEngine — from_args(multimodal=True, ...) forwards directly to the factory. No internal branching needed because the engine just delegates to the synthesizer.
RouterQueryEngine + ToolRetrieverRouterQueryEngine — init / from_defaults accept multimodal, summary_template, chat_summary_template, chat_prompt_helper; passed through to the default
TreeSummarize. combine_responses / acombine_responses now branch on summarizer._multimodal and route through get_response_from_messages (using chat_summary_template) when multimodal. Caveat: those
ChatMessages are built from str(response) of each sub-engine, so the multimodal aggregator prompt is in use but its inputs are still text — media surfaced by sub-engines does not propagate into the aggregation.

Feature Description

Background

Now that multimodal synthesis is broadly supported, the next goal is to make multimodal retrieval and synthesis pipelines easily configurable from high level objects like query engines--where relevant. The function get_response_synthesizer (response_synthesizers/factory.py) which is widely used in the configuration of query engines, can now be updated to accept a multimodal: bool flag plus the matching chat-content templates and chat_prompt_helper.

When multimodal=True, the chosen synthesizer can be wired with RichPromptTemplate-based prompts that iterate over context_messages[].blocks and emit text / image / audio / video blocks individually. BaseSynthesizer.synthesize already auto-branches on self._multimodal and constructs each ChatMessage from node.get_content_blocks(MetadataMode.LLM). All nine ResponseModes are now supported under multimodal=True. This issue tracks which query engines can be updated to expose multimodal=True and which are currently not, with the rationale.

Good Candidates For Update ✅

Engines which can accept multimodal=True (and the relevant chat-content templates) and plumb them through to get_response_synthesizer

CitationQueryEngine — full multimodal support. Adds CITATION_CHAT_CONTENT_QA_TEMPLATE / CITATION_CHAT_CONTENT_REFINE_TEMPLATE and a multimodal-aware _create_citation_nodes that uses
ChatMessage.split(...) to chunk multimodal source nodes per "Source N:" while preserving image/audio/video blocks.
RetrieverQueryEngine — from_args(multimodal=True, ...) forwards directly to the factory. No internal branching needed because the engine just delegates to the synthesizer.
RouterQueryEngine + ToolRetrieverRouterQueryEngine — init / from_defaults accept multimodal, summary_template, chat_summary_template, chat_prompt_helper; passed through to the default
TreeSummarize. combine_responses / acombine_responses now branch on summarizer._multimodal and route through get_response_from_messages (using chat_summary_template) when multimodal. Caveat: those
ChatMessages are built from str(response) of each sub-engine, so the multimodal aggregator prompt is in use but its inputs are still text — media surfaced by sub-engines does not propagate into the aggregation.

To Be Deprecated

SimpleMultiModalQueryEngine (multi_modal.py) — no longer necessary; superseded by RetrieverQueryEngine.from_args(retriever=..., llm=..., multimodal=True) (or CitationQueryEngine.from_args(...,
multimodal=True) for cited responses). Migration notes are embedded in the deprecation reason: multi_modal_llm= → llm=; separate text_qa_template / image_qa_template → unified chat_content_qa_template (+ optional chat_content_refine_template).

Already Deprecated — Skip

RetrieverRouterQueryEngine (router_query_engine.py:264) — docstring marks it deprecated in favor of ToolRetrieverRouterQueryEngine. Doesn't use a synthesizer.
KnowledgeGraphQueryEngine (knowledge_graph_query_engine.py) — already @deprecated.deprecated in favor of PropertyGraphIndex. KG triples are inherently text anyway.
JSONalyzeQueryEngine (jsonalyze/) — moved to llama-index-experimental.
PandasQueryEngine (pandas/) — moved to llama-index-experimental.

Wrappers — Already Inherit Multimodal For Free, No Change Needed

These don't synthesize themselves; they delegate query / synthesize to an inner engine. If the inner engine is multimodal, the wrapper is multimodal.

TransformQueryEngine (transform_query_engine.py:11) — applies a BaseQueryTransform to the query and forwards to self._query_engine.synthesize/query. Pure passthrough.
RetryQueryEngine (retry_query_engine.py:22) — wraps a query_engine, runs an evaluator, retries on the same wrapped engine. No own synthesis.
RetryGuidelineQueryEngine (retry_query_engine.py:78) — same delegation pattern with a guideline-based query transform.
ComposableGraphQueryEngine (graph_query_engine.py:15) — orchestrates per-index sub-engines and calls each one's synthesize. Multimodality is whatever the leaf engines provide.
RetrySourceQueryEngine (retry_source_query_engine.py:24) — wraps a RetrieverQueryEngine, uses self._llm only for evaluation/relevance scoring of sources (a text task). Synthesis runs through the
wrapped RetrieverQueryEngine, which is now multimodal-aware via from_args(multimodal=True).

For all of these, the user gets multimodal behavior by passing in an already-multimodal inner engine — no API change is needed or useful.

Possible Future Candidates: Easy Plumb-Through, Limited Semantic Gain

SubQuestionQueryEngine (sub_question_query_engine.py:37) — from_defaults builds the synthesizer via get_response_synthesizer. Same caveat as MultiStepQueryEngine (below): each sub-question's response is
wrapped as a plain text node before final synthesis, so the synthesizer would only ever see text. Adding multimodal=True would let users pick a chat-content RichPromptTemplate for the aggregator but no media flows through. Could be done for API consistency; minimal semantic value. Deferred from this PR.
MultiStepQueryEngine — init accepts multimodal, chat_content_qa_template, chat_content_refine_template. Caveat: _query_multistep aggregates each sub-step as TextNode(text=...), so even with multimodal=True the final synthesizer only sees the text Q&A pairs from sub-queries; media is preserved on additional_source_nodes for attribution but does not reach LLM input. multimodal=True here
mainly enables using a chat-content RichPromptTemplate for the aggregator prompt.

Not Candidates: Wrong Synthesis Shape, Fundamentally Text-Based, and/or External API

These bypass get_response_synthesizer entirely and call the LLM directly on inherently text tasks (SQL strings, code, JSON, lookahead reasoning), or go through an external service. Adding multimodal
doesn't fit the domain.

SQLJoinQueryEngine (sql_join_query_engine.py:167) — calls self._llm.predict / self._llm.stream to fuse a SQL result with a vector answer. SQL results are tabular text.
SQLAutoVectorQueryEngine (sql_vector_query_engine.py:53) — subclasses SQLJoinQueryEngine. Same shape.
FLAREInstructQueryEngine (flare/base.py:98) — iterative lookahead generation via self._llm.predict. Internal loop is text-only by design.
SQLAugmentQueryTransform (sql_join_query_engine.py:109) — query transform, not an engine; transforms the natural-language query via LLM.
CogniswitchQueryEngine (cogniswitch_query_engine.py:9) — proxies queries to the external Cogniswitch HTTP API. No local LLM call to add multimodal to.

No Update Needed: User-Defined

CustomQueryEngine (custom.py:16) — abstract BaseModel + BaseQueryEngine requiring custom_query from subclasses. The user owns the synthesis path; we have nothing to plumb.

Reason

No response

Value of Feature

No response

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering