Skip to content

[Feature Request]: Add token usage tracking to GoogleGenAI structured_predict methods #21106

@linchun3

Description

@linchun3

Feature Description

The GoogleGenAI LLM integration should expose token usage metadata for structured prediction methods (structured_predict, astructured_predict, stream_structured_predict, astream_structured_predict).

Token tracking works correctly for chat(), achat(), complete(), and acomplete() via the chat_from_gemini_response() utility function which extracts usage_metadata and populates additional_kwargs with prompt_tokens, completion_tokens, and total_tokens.

However, structured prediction methods bypass this utility and return only the parsed Pydantic model, discarding all token usage information from the API response.

Expected behavior: Token usage should be accessible for all LLM methods, including structured predictions.

Method Returns Token Tracking? Raw Response Access?
chat() ChatResponse ✅ Yes response.raw
achat() ChatResponse ✅ Yes response.raw
complete() CompletionResponse ✅ Yes response.raw
acomplete() CompletionResponse ✅ Yes response.raw
stream_chat() ChatResponseGen ✅ Yes response.raw
astream_chat() ChatResponseAsyncGen ✅ Yes response.raw
structured_predict() Model (Pydantic) ❌ No ❌ No
astructured_predict() Model (Pydantic) ❌ No ❌ No
stream_structured_predict() Model (yielded) ❌ No ❌ No
astream_structured_predict() Model (yielded) ❌ No ❌ No
📁 Code reference

Working implementation: chat_from_gemini_response() in utils.py lines 167-178

if response.usage_metadata:
    raw["usage_metadata"] = response.usage_metadata.model_dump()
    additional_kwargs["prompt_tokens"] = response.usage_metadata.prompt_token_count
    additional_kwargs["completion_tokens"] = response.usage_metadata.candidates_token_count
    additional_kwargs["total_tokens"] = response.usage_metadata.total_token_count

Missing implementation: structured_predict() in base.py lines 584-644

# response.usage_metadata exists but is discarded
if isinstance(response.parsed, BaseModel):
    return response.parsed  # No token metadata attached

Reason

What is stopping LlamaIndex from supporting this feature today?

The structured_predict() implementation calls self._client.models.generate_content() directly and returns response.parsed without extracting usage_metadata from the response object. The data exists in the response, but is discarded.

What existing approaches have not worked for you?

  1. Phoenix/Arize observability - Transactions using structured_predict() appear in traces without token counts, making it impossible to benchmark across methods.

  2. TokenCountingHandler - Cannot count tokens for structured predictions, breaking cost analysis.

  3. Workaround using chat() - While technically possible to pass generation_config={"response_schema": MyModel} to chat(), this requires manual JSON parsing, loses the convenience of structured_predict() returning typed objects, is undocumented, and creates API inconsistency.

  4. Thinking models - Gemini 3.1, 3, and 3 Flash Lite have reasoning/thinking capabilities with thoughts_token_count. Engineers are unaware that structured predictions have untracked thinking tokens.


Value of Feature

  1. Observability parity - Phoenix OpenInference traces for structured predictions currently lack token counts, making it difficult for engineers to benchmark across different methods.

  2. Cost tracking - Teams allocating costs by token usage cannot accurately track structured prediction usage.

  3. Thinking/reasoning models - Modern Gemini models (3.1, 3, 3 Flash Lite) perform reasoning with thoughts_token_count. Without tracking, engineers cannot measure reasoning token efficiency or optimize prompt strategies.

  4. API consistency - Users expect all LLM methods to return consistent metadata. The current gap creates confusion—why does chat() show token counts but structured_predict() doesn't?

  5. Downstream tool compatibility - Tools like MLFlow, Phoenix, and custom callback handlers expect token counts in additional_kwargs. Structured predictions break this contract.

Impact if not fixed:

  • Engineers may not realize structured predictions aren't being tracked
  • Production systems have incomplete observability data
  • Cost allocation for structured workflows is impossible
  • Comparison benchmarks between methods are incomplete

Related Issues

#20218 is the closest predecessor - it was closed after fixing token tracking for chat() methods, but the fix did not extend to structured_predict() methods. This issue effectively completes the work started in #20218.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesttriageIssue needs to be triaged/prioritized

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions