litellm - ✅(Solved) Fix [Bug]: TTFT not captured for /v1/messages requests (Anthropic, Bedrock, Vertex AI, Azure AI, Minimax) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#25598Fetched 2026-04-12 13:24:45
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Participants
Timeline (top)
labeled ×3referenced ×2cross-referenced ×1

Root Cause

Suspected Root Cause

Fix Action

Fixed

PR fix notes

PR #25599: fix(streaming): capture TTFT for /v1/messages (Anthropic, Bedrock, Vertex AI)

Description (problem / solution / changelog)

The pass-through streaming path for /v1/messages logged completion_start_time only after the full stream finished. async_success_handler then fell back to end_time, making TTFT equal to total duration or null in the UI and Prometheus.

Record the timestamp of the first chunk in async_sse_wrapper and propagate it to litellm_logging_obj.completion_start_time / model_call_details before the logging handler runs.

Fixes #25598

Changed files

  • litellm/llms/anthropic/experimental_pass_through/messages/streaming_iterator.py (modified, +12/-0)

Code Example

# litellm/llms/anthropic/experimental_pass_through/messages/streaming_iterator.py
async def async_sse_wrapper(self, completion_stream):
    collected_chunks = []
    async for chunk in completion_stream:
        collected_chunks.append(self._convert_chunk_to_sse_format(chunk))
        yield chunk                                    # chunks go to client here

    await self._handle_streaming_logging(collected_chunks)  # logging fires after last chunk

async def _handle_streaming_logging(self, collected_chunks):
    end_time = datetime.now()                          # captured AFTER all chunks sent
    asyncio.create_task(
        PassThroughStreamingHandler._route_streaming_logging_to_handler(
            ...
            end_time=end_time,
        )
    )

---

# litellm/litellm_core_utils/litellm_logging.py
if self.completion_start_time is None:
    self.completion_start_time = end_time   # falls back to end_time, not first-chunk time
    self.model_call_details["completion_start_time"] = self.completion_start_time

---

# litellm/litellm_core_utils/streaming_handler.py
if self.logging_obj.completion_start_time is None:
    self.logging_obj._update_completion_start_time(
        completion_start_time=datetime.datetime.now()
    )

---
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

Bug Description

Time to First Token (TTFT) is never correctly captured for requests routed through the /v1/messages endpoint when using providers that have a native BaseAnthropicMessagesConfig implementation (Anthropic, Bedrock, Vertex AI Claude, Azure AI Claude, Minimax).

The result is:

  1. Usage logs UI: TTFT column shows - (null) or displays a value equal to the total request duration — neither is the real first-token time.

    <img width="1484" height="418" alt="Image" src="https://github.com/user-attachments/assets/261e67c6-92e7-4451-8993-44d8ecc0f305" />
  2. litellm_llm_api_time_to_first_token_metric (Prometheus): Emitted for streaming requests, but the value equals the full API call duration, not the actual TTFT.

    <img width="2860" height="1034" alt="Image" src="https://github.com/user-attachments/assets/ef06a2fe-ee12-49f7-94ae-22f4ee0f48ef" />
<img width="2864" height="1087" alt="Image" src="https://github.com/user-attachments/assets/c3dfa551-0ef3-4a10-b6e1-66656de6f98a" />

Expected Behaviour

completion_start_time should be set to the timestamp when the first non-empty chunk is received from the upstream provider, consistent with how the standard /chat/completions streaming path works.

Actual Behaviour

completion_start_time is set to end_time (after the last chunk), making reported TTFT equal to total duration.

Affected Providers

Any provider that returns a non-None value from ProviderConfigManager.get_provider_anthropic_messages_config() uses the pass-through path and is affected:

ProviderScope
anthropicAll models
bedrockClaude models only
vertex_aiClaude models only
azure_aiClaude models only
minimaxAll models

Providers not affected: any provider that returns None from get_provider_anthropic_messages_config (e.g. Mistral, Cohere, Groq) falls through to LiteLLMMessagesToCompletionTransformationHandler, which uses the standard CustomStreamWrapper and correctly captures completion_start_time on the first streaming chunk.

Suspected Root Cause

The /v1/messages endpoint for the affected providers routes through a pass-through streaming pipeline (BaseAnthropicMessagesStreamingIterator.async_sse_wrapper) that collects all chunks and fires a single logging call after the stream is complete:

# litellm/llms/anthropic/experimental_pass_through/messages/streaming_iterator.py
async def async_sse_wrapper(self, completion_stream):
    collected_chunks = []
    async for chunk in completion_stream:
        collected_chunks.append(self._convert_chunk_to_sse_format(chunk))
        yield chunk                                    # chunks go to client here

    await self._handle_streaming_logging(collected_chunks)  # logging fires after last chunk

async def _handle_streaming_logging(self, collected_chunks):
    end_time = datetime.now()                          # captured AFTER all chunks sent
    asyncio.create_task(
        PassThroughStreamingHandler._route_streaming_logging_to_handler(
            ...
            end_time=end_time,
        )
    )

Because the logging call only receives end_time (after streaming completes), completion_start_time is never set during the stream. Inside async_success_handler, the fallback triggers:

# litellm/litellm_core_utils/litellm_logging.py
if self.completion_start_time is None:
    self.completion_start_time = end_time   # falls back to end_time, not first-chunk time
    self.model_call_details["completion_start_time"] = self.completion_start_time

completion_start_time ends up equal to end_time, so:

  • In the DB: completionStartTime ≈ endTimeTTFT ≈ Duration or "-" (UI filters it)
  • In Prometheus: litellm_llm_api_time_to_first_token_metric = completion_start_time − api_call_start_time = full API call duration

Contrast with the standard /chat/completions streaming path, where _update_completion_start_time() is called on the first chunk:

# litellm/litellm_core_utils/streaming_handler.py
if self.logging_obj.completion_start_time is None:
    self.logging_obj._update_completion_start_time(
        completion_start_time=datetime.datetime.now()
    )

Steps to Reproduce

  1. Configure LiteLLM proxy with a Bedrock Claude model (or any other affected provider above).
  2. Send a streaming request to /v1/messages.
  3. Observe the usage logs UI: the TTFT column is either - or shows a value equal to the Duration column.
  4. Observe Prometheus: litellm_llm_api_time_to_first_token_metric value matches litellm_llm_api_latency_metric rather than being a fraction of it.

Related

  • PR #9688 / Issue #9210 — fixed the same completion_start_time = end_time fallback for /chat/completions streaming, but the /v1/messages pass-through path was not covered.

Relevant log output

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.82.3-stable

Twitter / LinkedIn details

No response

extent analysis

TL;DR

Update the /v1/messages endpoint's pass-through streaming pipeline to set completion_start_time when the first non-empty chunk is received from the upstream provider.

Guidance

  • Identify the async_sse_wrapper function in streaming_iterator.py and modify it to set completion_start_time when the first chunk is received.
  • Update the _handle_streaming_logging function to accept completion_start_time as a parameter and use it for logging instead of end_time.
  • Verify that completion_start_time is correctly set by checking the usage logs UI and Prometheus metrics after applying the changes.
  • Consider backporting the fix to earlier versions of LiteLLM to ensure consistency across different releases.

Example

async def async_sse_wrapper(self, completion_stream):
    collected_chunks = []
    completion_start_time = None
    async for chunk in completion_stream:
        collected_chunks.append(self._convert_chunk_to_sse_format(chunk))
        if completion_start_time is None and chunk:
            completion_start_time = datetime.now()
        yield chunk

    await self._handle_streaming_logging(collected_chunks, completion_start_time)

async def _handle_streaming_logging(self, collected_chunks, completion_start_time):
    # Use completion_start_time for logging instead of end_time
    asyncio.create_task(
        PassThroughStreamingHandler._route_streaming_logging_to_handler(
            ...
            completion_start_time=completion_start_time,
        )
    )

Notes

The provided code snippet assumes that the completion_stream yields chunks in the correct order and that the first non-empty chunk is the one that should trigger the setting of completion_start_time. Additional error handling and edge cases may need to be considered depending on the specific requirements of the LiteLLM proxy.

Recommendation

Apply the workaround by updating the /v1/messages endpoint's pass-through streaming pipeline to set completion_start_time when the first non-empty chunk is received from the upstream provider. This should fix the issue with incorrect TTFT values in the usage logs UI and Prometheus metrics.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING