llamaIndex - ✅(Solved) Fix [Feature Request]: Add token usage tracking to GoogleGenAI structured_predict methods [1 pull requests, 4 comments, 3 participants]

linchun3 · 2026-03-21T04:10:13Z

[llamaIndex] PR 21135: feat GoogleGenAI : add token tracking for GoogleGenAI structured predict methods - Repository: run-llama/llama index - Author: Pavan-Ran… # PR #21135: feat(GoogleGenAI): add token tracking for GoogleGenAI structured predict methods - Repository: run-llama/llama_index - Author: Pavan-Rana - State: open | merged: False - Link: https://github.com/run-llama/llama_index/pull/21135 ## Description (problem / solution / changelog) # Description Adds token usage tracking to all structured prediction methods in the GoogleGenAI LLM integration: - structured_predict - astructured_predict - stream_structured_predict - astream_structured_predict Previously, these methods returned only the parsed Pydantic model and discarded usage_metadata from the Gemini API response. This created a gap in observability and cost tracking compared to chat() and complete() methods, which already expose token usage. This change introduces a shared utility (extract_token_usage_from_response) to extract token counts (prompt_tokens, completion_tokens, total_tokens, and thoughts_token_count when available) and attaches them to the active OpenTelemetry span via llm.token_usage.* attributes. This brings structured prediction methods to parity with existing LLM methods in terms of tracing and monitoring. Streaming variants capture token usage from the final response chunk to ensure accurate reporting. Motivation: - Restore observability parity across all LLM interfaces - Enable accurate cost tracking for structured prediction workflows - Support reasoning-aware models by exposing thoughts_token_count - Ensure compatibility with downstream tools (e.g. tracing/callback systems) No additional dependencies were introduced. Fixes #21106 ## New Package? Did I fill in the `tool.llamahub` section in the `pyproject.toml` and provide a detailed README.md for my new integration or package? - [ ] Yes - [X] No ## Version Bump? Did I bump the version in the `pyproject.toml` file of the package I am updating? (Except for the `llama-index-core` package) - [ ] Yes - [X] No ## Type of Change Please delete options that are not relevant. - [ ] Bug fix (non-breaking change which fixes an issue) - [X] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] This change requires a documentation update ## How Has This Been Tested? Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing. - [X] I added new unit tests to cover this change - [ ] I believe this change is already covered by existing unit tests ## Suggested Checklist: - [X] I have performed a self-review of my own code - [X] I have commented my code, particularly in hard-to-understand areas - [] I have made corresponding changes to the documentation - [ ] I have added Google Colab support for the newly added notebooks. - [X] My changes generate no new warnings - [X] I have added tests that prove my fix is effective or that my feature works - [X] New and existing unit tests pass locally with my changes - [X] I ran `uv run make format; uv run make lint` to appease the lint gods ## Changed files - `llama-index-integrations/llms/llama-index-llms-google-genai/llama_index/llms/google_genai/base.py` (modified, +72/-0) - `llama-index-integrations/llms/llama-index-llms-google-genai/llama_index/llms/google_genai/utils.py` (modified, +21/-0) - `llama-index-integrations/llms/llama-index-llms-google-genai/tests/test_base_cleanup.py` (modified, +221/-0) ## Fix / Workaround 3. **Workaround using `chat()`** - While technically possible to pass `generation_config={"response_schema": MyModel}` to `chat()`, this requires manual JSON parsing, loses the convenience of `structured_predict()` returning typed objects, is undocumented, and creates API inconsistency. ## Feature Description The `GoogleGenAI` LLM integration should expose token usage metadata for structured prediction methods (`structured_predict`, `astructured_predict`, `stream_structured_predict`, `astream_structured_predict`). Token tracking works correctly for `chat()`, `achat()`, `complete()`, and `acomplete()` via the `chat_from_gemini_response()` utility function which extracts `usage_metadata` and populates `additional_kwargs` with `prompt_tokens`, `completion_tokens`, and `total_tokens`. However, structured prediction methods bypass this utility and return only the parsed Pydantic model, discarding all token usage information from the API response. **Expected behavior:** Token usage should be accessible for all LLM methods, including structured predictions. | Method | Returns | Token Tracking? | Raw Response Access? | |--------|---------|-----------------|---------------------| | `chat()` | `ChatResponse` | ✅ Yes | ✅ `response.raw` | | `achat()` | `ChatResponse` | ✅ Yes | ✅ `response.raw` | | `complete()` | `CompletionResponse` | ✅ Yes | ✅ `response.raw` | | `acomplete()` | `Completio

llamaIndex2026-03-21 04:10:13

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

run-llama/llama_index#21106•Fetched 2026-04-08 01:08:08

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×4labeled ×2cross-referenced ×1

Code Example

if response.usage_metadata:
    raw["usage_metadata"] = response.usage_metadata.model_dump()
    additional_kwargs["prompt_tokens"] = response.usage_metadata.prompt_token_count
    additional_kwargs["completion_tokens"] = response.usage_metadata.candidates_token_count
    additional_kwargs["total_tokens"] = response.usage_metadata.total_token_count

---

# response.usage_metadata exists but is discarded
if isinstance(response.parsed, BaseModel):
    return response.parsed  # No token metadata attached

RAW_BUFFERClick to expand / collapse

Feature Description

The GoogleGenAI LLM integration should expose token usage metadata for structured prediction methods (structured_predict, astructured_predict, stream_structured_predict, astream_structured_predict).

Token tracking works correctly for chat(), achat(), complete(), and acomplete() via the chat_from_gemini_response() utility function which extracts usage_metadata and populates additional_kwargs with prompt_tokens, completion_tokens, and total_tokens.

However, structured prediction methods bypass this utility and return only the parsed Pydantic model, discarding all token usage information from the API response.

Expected behavior: Token usage should be accessible for all LLM methods, including structured predictions.

Method	Returns	Token Tracking?	Raw Response Access?
`chat()`	`ChatResponse`	✅ Yes	✅ `response.raw`
`achat()`	`ChatResponse`	✅ Yes	✅ `response.raw`
`complete()`	`CompletionResponse`	✅ Yes	✅ `response.raw`
`acomplete()`	`CompletionResponse`	✅ Yes	✅ `response.raw`
`stream_chat()`	`ChatResponseGen`	✅ Yes	✅ `response.raw`
`astream_chat()`	`ChatResponseAsyncGen`	✅ Yes	✅ `response.raw`
`structured_predict()`	`Model` (Pydantic)	❌ No	❌ No
`astructured_predict()`	`Model` (Pydantic)	❌ No	❌ No
`stream_structured_predict()`	`Model` (yielded)	❌ No	❌ No
`astream_structured_predict()`	`Model` (yielded)	❌ No	❌ No

<details> <summary>📁 Code reference</summary>

Working implementation: chat_from_gemini_response() in utils.py lines 167-178

if response.usage_metadata:
    raw["usage_metadata"] = response.usage_metadata.model_dump()
    additional_kwargs["prompt_tokens"] = response.usage_metadata.prompt_token_count
    additional_kwargs["completion_tokens"] = response.usage_metadata.candidates_token_count
    additional_kwargs["total_tokens"] = response.usage_metadata.total_token_count

Missing implementation: structured_predict() in base.py lines 584-644

# response.usage_metadata exists but is discarded
if isinstance(response.parsed, BaseModel):
    return response.parsed  # No token metadata attached

</details>

Reason

What is stopping LlamaIndex from supporting this feature today?

The structured_predict() implementation calls self._client.models.generate_content() directly and returns response.parsed without extracting usage_metadata from the response object. The data exists in the response, but is discarded.

What existing approaches have not worked for you?

Phoenix/Arize observability - Transactions using structured_predict() appear in traces without token counts, making it impossible to benchmark across methods.
TokenCountingHandler - Cannot count tokens for structured predictions, breaking cost analysis.
Workaround using chat() - While technically possible to pass generation_config={"response_schema": MyModel} to chat(), this requires manual JSON parsing, loses the convenience of structured_predict() returning typed objects, is undocumented, and creates API inconsistency.
Thinking models - Gemini 3.1, 3, and 3 Flash Lite have reasoning/thinking capabilities with thoughts_token_count. Engineers are unaware that structured predictions have untracked thinking tokens.

Value of Feature

Observability parity - Phoenix OpenInference traces for structured predictions currently lack token counts, making it difficult for engineers to benchmark across different methods.
Cost tracking - Teams allocating costs by token usage cannot accurately track structured prediction usage.
Thinking/reasoning models - Modern Gemini models (3.1, 3, 3 Flash Lite) perform reasoning with thoughts_token_count. Without tracking, engineers cannot measure reasoning token efficiency or optimize prompt strategies.
API consistency - Users expect all LLM methods to return consistent metadata. The current gap creates confusion—why does chat() show token counts but structured_predict() doesn't?
Downstream tool compatibility - Tools like MLFlow, Phoenix, and custom callback handlers expect token counts in additional_kwargs. Structured predictions break this contract.

Impact if not fixed:

Engineers may not realize structured predictions aren't being tracked
Production systems have incomplete observability data
Cost allocation for structured workflows is impossible
Comparison benchmarks between methods are incomplete

Related Issues

#20218 - Missing token usage information in GoogleGenAI metadata for MLFlow Tracing (Closed - Fixed for chat()/achat() only, structured_predict() methods not addressed)
#17736 - StructuredLLM - Add raw completion response alongside structured output (Open - Broader request for all LLMs and all raw response fields)
#19293 - No Input/Output Token count for Gemini 2.5 models (Open - May be related; reports missing token counts for Gemini 2.5 in instrumentation)
#19662 - Get thoughts_token_count from gemini response (Closed - Fixed in chat_from_gemini_response() but thoughts_token_count still not extracted in structured_predict())

#20218 is the closest predecessor - it was closed after fixing token tracking for chat() methods, but the fix did not extend to structured_predict() methods. This issue effectively completes the work started in #20218.

extent analysis

Fix Plan

To fix the issue of missing token usage metadata for structured prediction methods, we need to modify the structured_predict() method to extract and return the usage_metadata from the response object.

Step-by-Step Solution

Modify the structured_predict() method: Update the method to extract usage_metadata from the response object and attach it to the returned Model object.
Add token metadata to the returned model: Use the usage_metadata to calculate prompt_tokens, completion_tokens, and total_tokens, and add them to the returned Model object.
Update the base.py file: Apply the changes to the structured_predict() method in the base.py file.

Example Code

# In base.py, update the structured_predict() method
def structured_predict(self, ...):
    # ...
    response = self._client.models.generate_content(...)
    if isinstance(response.parsed, BaseModel):
        # Extract usage metadata
        usage_metadata = response.usage_metadata
        if usage_metadata:
            # Calculate token counts
            prompt_tokens = usage_metadata.prompt_token_count
            completion_tokens = usage_metadata.candidates_token_count
            total_tokens = usage_metadata.total_token_count
            
            # Attach token metadata to the returned model
            response.parsed.token_metadata = {
                "prompt_tokens": prompt_tokens,
                "completion_tokens": completion_tokens,
                "total_tokens": total_tokens
            }
        return response.parsed

Verification

To verify that the fix worked, you can:

Test the structured_predict() method: Call the method with a sample input and check if the returned Model object contains the expected token metadata.
Check the token counts: Verify that the token counts are accurate by comparing them with the expected values.

Extra Tips

Make sure to update the documentation to reflect the changes to the structured_predict() method.
Consider adding error handling to handle cases where the usage_metadata is missing or invalid.
Review the related issues and consider applying similar fixes to other methods that may be affected by the same issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #prompt formatting #chain error #conversation history #tool integration

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

llamaIndex - ✅(Solved) Fix [Feature Request]: Add token usage tracking to GoogleGenAI structured_predict methods [1 pull requests, 4 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #21135: feat(GoogleGenAI): add token tracking for GoogleGenAI structured predict methods

Description (problem / solution / changelog)

Description

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

Changed files

Code Example

Feature Description

Reason

Value of Feature

Related Issues

extent analysis

Fix Plan

Step-by-Step Solution

Example Code

Verification

Extra Tips

Still need to ship something?

TRENDING

llamaIndex - ✅(Solved) Fix [Feature Request]: Add token usage tracking to GoogleGenAI structured_predict methods [1 pull requests, 4 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #21135: feat(GoogleGenAI): add token tracking for GoogleGenAI structured predict methods

Description (problem / solution / changelog)

Description

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

Changed files

Code Example

Feature Description

Reason

Value of Feature

Related Issues

extent analysis

Fix Plan

Step-by-Step Solution

Example Code

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING