llamaIndex - ✅(Solved) Fix [Feature Request]: Add token usage tracking to GoogleGenAI structured_predict methods [1 pull requests, 4 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#21106Fetched 2026-04-08 01:08:08
View on GitHub
Comments
4
Participants
3
Timeline
7
Reactions
0
Author
Timeline (top)
commented ×4labeled ×2cross-referenced ×1

Fix Action

Fix / Workaround

  1. Workaround using chat() - While technically possible to pass generation_config={"response_schema": MyModel} to chat(), this requires manual JSON parsing, loses the convenience of structured_predict() returning typed objects, is undocumented, and creates API inconsistency.

PR fix notes

PR #21135: feat(GoogleGenAI): add token tracking for GoogleGenAI structured predict methods

Description (problem / solution / changelog)

Description

Adds token usage tracking to all structured prediction methods in the GoogleGenAI LLM integration:

  • structured_predict
  • astructured_predict
  • stream_structured_predict
  • astream_structured_predict

Previously, these methods returned only the parsed Pydantic model and discarded usage_metadata from the Gemini API response. This created a gap in observability and cost tracking compared to chat() and complete() methods, which already expose token usage.

This change introduces a shared utility (extract_token_usage_from_response) to extract token counts (prompt_tokens, completion_tokens, total_tokens, and thoughts_token_count when available) and attaches them to the active OpenTelemetry span via llm.token_usage.* attributes. This brings structured prediction methods to parity with existing LLM methods in terms of tracing and monitoring.

Streaming variants capture token usage from the final response chunk to ensure accurate reporting.

Motivation:

  • Restore observability parity across all LLM interfaces
  • Enable accurate cost tracking for structured prediction workflows
  • Support reasoning-aware models by exposing thoughts_token_count
  • Ensure compatibility with downstream tools (e.g. tracing/callback systems)

No additional dependencies were introduced.

Fixes #21106

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

  • I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • [] I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran uv run make format; uv run make lint to appease the lint gods

Changed files

  • llama-index-integrations/llms/llama-index-llms-google-genai/llama_index/llms/google_genai/base.py (modified, +72/-0)
  • llama-index-integrations/llms/llama-index-llms-google-genai/llama_index/llms/google_genai/utils.py (modified, +21/-0)
  • llama-index-integrations/llms/llama-index-llms-google-genai/tests/test_base_cleanup.py (modified, +221/-0)

Code Example

if response.usage_metadata:
    raw["usage_metadata"] = response.usage_metadata.model_dump()
    additional_kwargs["prompt_tokens"] = response.usage_metadata.prompt_token_count
    additional_kwargs["completion_tokens"] = response.usage_metadata.candidates_token_count
    additional_kwargs["total_tokens"] = response.usage_metadata.total_token_count

---

# response.usage_metadata exists but is discarded
if isinstance(response.parsed, BaseModel):
    return response.parsed  # No token metadata attached
RAW_BUFFERClick to expand / collapse

Feature Description

The GoogleGenAI LLM integration should expose token usage metadata for structured prediction methods (structured_predict, astructured_predict, stream_structured_predict, astream_structured_predict).

Token tracking works correctly for chat(), achat(), complete(), and acomplete() via the chat_from_gemini_response() utility function which extracts usage_metadata and populates additional_kwargs with prompt_tokens, completion_tokens, and total_tokens.

However, structured prediction methods bypass this utility and return only the parsed Pydantic model, discarding all token usage information from the API response.

Expected behavior: Token usage should be accessible for all LLM methods, including structured predictions.

MethodReturnsToken Tracking?Raw Response Access?
chat()ChatResponse✅ Yesresponse.raw
achat()ChatResponse✅ Yesresponse.raw
complete()CompletionResponse✅ Yesresponse.raw
acomplete()CompletionResponse✅ Yesresponse.raw
stream_chat()ChatResponseGen✅ Yesresponse.raw
astream_chat()ChatResponseAsyncGen✅ Yesresponse.raw
structured_predict()Model (Pydantic)❌ No❌ No
astructured_predict()Model (Pydantic)❌ No❌ No
stream_structured_predict()Model (yielded)❌ No❌ No
astream_structured_predict()Model (yielded)❌ No❌ No
<details> <summary>📁 Code reference</summary>

Working implementation: chat_from_gemini_response() in utils.py lines 167-178

if response.usage_metadata:
    raw["usage_metadata"] = response.usage_metadata.model_dump()
    additional_kwargs["prompt_tokens"] = response.usage_metadata.prompt_token_count
    additional_kwargs["completion_tokens"] = response.usage_metadata.candidates_token_count
    additional_kwargs["total_tokens"] = response.usage_metadata.total_token_count

Missing implementation: structured_predict() in base.py lines 584-644

# response.usage_metadata exists but is discarded
if isinstance(response.parsed, BaseModel):
    return response.parsed  # No token metadata attached
</details>

Reason

What is stopping LlamaIndex from supporting this feature today?

The structured_predict() implementation calls self._client.models.generate_content() directly and returns response.parsed without extracting usage_metadata from the response object. The data exists in the response, but is discarded.

What existing approaches have not worked for you?

  1. Phoenix/Arize observability - Transactions using structured_predict() appear in traces without token counts, making it impossible to benchmark across methods.

  2. TokenCountingHandler - Cannot count tokens for structured predictions, breaking cost analysis.

  3. Workaround using chat() - While technically possible to pass generation_config={"response_schema": MyModel} to chat(), this requires manual JSON parsing, loses the convenience of structured_predict() returning typed objects, is undocumented, and creates API inconsistency.

  4. Thinking models - Gemini 3.1, 3, and 3 Flash Lite have reasoning/thinking capabilities with thoughts_token_count. Engineers are unaware that structured predictions have untracked thinking tokens.


Value of Feature

  1. Observability parity - Phoenix OpenInference traces for structured predictions currently lack token counts, making it difficult for engineers to benchmark across different methods.

  2. Cost tracking - Teams allocating costs by token usage cannot accurately track structured prediction usage.

  3. Thinking/reasoning models - Modern Gemini models (3.1, 3, 3 Flash Lite) perform reasoning with thoughts_token_count. Without tracking, engineers cannot measure reasoning token efficiency or optimize prompt strategies.

  4. API consistency - Users expect all LLM methods to return consistent metadata. The current gap creates confusion—why does chat() show token counts but structured_predict() doesn't?

  5. Downstream tool compatibility - Tools like MLFlow, Phoenix, and custom callback handlers expect token counts in additional_kwargs. Structured predictions break this contract.

Impact if not fixed:

  • Engineers may not realize structured predictions aren't being tracked
  • Production systems have incomplete observability data
  • Cost allocation for structured workflows is impossible
  • Comparison benchmarks between methods are incomplete

Related Issues

  • #20218 - Missing token usage information in GoogleGenAI metadata for MLFlow Tracing (Closed - Fixed for chat()/achat() only, structured_predict() methods not addressed)
  • #17736 - StructuredLLM - Add raw completion response alongside structured output (Open - Broader request for all LLMs and all raw response fields)
  • #19293 - No Input/Output Token count for Gemini 2.5 models (Open - May be related; reports missing token counts for Gemini 2.5 in instrumentation)
  • #19662 - Get thoughts_token_count from gemini response (Closed - Fixed in chat_from_gemini_response() but thoughts_token_count still not extracted in structured_predict())

#20218 is the closest predecessor - it was closed after fixing token tracking for chat() methods, but the fix did not extend to structured_predict() methods. This issue effectively completes the work started in #20218.

extent analysis

Fix Plan

To fix the issue of missing token usage metadata for structured prediction methods, we need to modify the structured_predict() method to extract and return the usage_metadata from the response object.

Step-by-Step Solution

  1. Modify the structured_predict() method: Update the method to extract usage_metadata from the response object and attach it to the returned Model object.
  2. Add token metadata to the returned model: Use the usage_metadata to calculate prompt_tokens, completion_tokens, and total_tokens, and add them to the returned Model object.
  3. Update the base.py file: Apply the changes to the structured_predict() method in the base.py file.

Example Code

# In base.py, update the structured_predict() method
def structured_predict(self, ...):
    # ...
    response = self._client.models.generate_content(...)
    if isinstance(response.parsed, BaseModel):
        # Extract usage metadata
        usage_metadata = response.usage_metadata
        if usage_metadata:
            # Calculate token counts
            prompt_tokens = usage_metadata.prompt_token_count
            completion_tokens = usage_metadata.candidates_token_count
            total_tokens = usage_metadata.total_token_count
            
            # Attach token metadata to the returned model
            response.parsed.token_metadata = {
                "prompt_tokens": prompt_tokens,
                "completion_tokens": completion_tokens,
                "total_tokens": total_tokens
            }
        return response.parsed

Verification

To verify that the fix worked, you can:

  1. Test the structured_predict() method: Call the method with a sample input and check if the returned Model object contains the expected token metadata.
  2. Check the token counts: Verify that the token counts are accurate by comparing them with the expected values.

Extra Tips

  • Make sure to update the documentation to reflect the changes to the structured_predict() method.
  • Consider adding error handling to handle cases where the usage_metadata is missing or invalid.
  • Review the related issues and consider applying similar fixes to other methods that may be affected by the same issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING