-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Description
Feature Description
The GoogleGenAI LLM integration should expose token usage metadata for structured prediction methods (structured_predict, astructured_predict, stream_structured_predict, astream_structured_predict).
Token tracking works correctly for chat(), achat(), complete(), and acomplete() via the chat_from_gemini_response() utility function which extracts usage_metadata and populates additional_kwargs with prompt_tokens, completion_tokens, and total_tokens.
However, structured prediction methods bypass this utility and return only the parsed Pydantic model, discarding all token usage information from the API response.
Expected behavior: Token usage should be accessible for all LLM methods, including structured predictions.
| Method | Returns | Token Tracking? | Raw Response Access? |
|---|---|---|---|
chat() |
ChatResponse |
✅ Yes | ✅ response.raw |
achat() |
ChatResponse |
✅ Yes | ✅ response.raw |
complete() |
CompletionResponse |
✅ Yes | ✅ response.raw |
acomplete() |
CompletionResponse |
✅ Yes | ✅ response.raw |
stream_chat() |
ChatResponseGen |
✅ Yes | ✅ response.raw |
astream_chat() |
ChatResponseAsyncGen |
✅ Yes | ✅ response.raw |
structured_predict() |
Model (Pydantic) |
❌ No | ❌ No |
astructured_predict() |
Model (Pydantic) |
❌ No | ❌ No |
stream_structured_predict() |
Model (yielded) |
❌ No | ❌ No |
astream_structured_predict() |
Model (yielded) |
❌ No | ❌ No |
📁 Code reference
Working implementation: chat_from_gemini_response() in utils.py lines 167-178
if response.usage_metadata:
raw["usage_metadata"] = response.usage_metadata.model_dump()
additional_kwargs["prompt_tokens"] = response.usage_metadata.prompt_token_count
additional_kwargs["completion_tokens"] = response.usage_metadata.candidates_token_count
additional_kwargs["total_tokens"] = response.usage_metadata.total_token_countMissing implementation: structured_predict() in base.py lines 584-644
# response.usage_metadata exists but is discarded
if isinstance(response.parsed, BaseModel):
return response.parsed # No token metadata attachedReason
What is stopping LlamaIndex from supporting this feature today?
The structured_predict() implementation calls self._client.models.generate_content() directly and returns response.parsed without extracting usage_metadata from the response object. The data exists in the response, but is discarded.
What existing approaches have not worked for you?
-
Phoenix/Arize observability - Transactions using
structured_predict()appear in traces without token counts, making it impossible to benchmark across methods. -
TokenCountingHandler - Cannot count tokens for structured predictions, breaking cost analysis.
-
Workaround using
chat()- While technically possible to passgeneration_config={"response_schema": MyModel}tochat(), this requires manual JSON parsing, loses the convenience ofstructured_predict()returning typed objects, is undocumented, and creates API inconsistency. -
Thinking models - Gemini 3.1, 3, and 3 Flash Lite have reasoning/thinking capabilities with
thoughts_token_count. Engineers are unaware that structured predictions have untracked thinking tokens.
Value of Feature
-
Observability parity - Phoenix OpenInference traces for structured predictions currently lack token counts, making it difficult for engineers to benchmark across different methods.
-
Cost tracking - Teams allocating costs by token usage cannot accurately track structured prediction usage.
-
Thinking/reasoning models - Modern Gemini models (3.1, 3, 3 Flash Lite) perform reasoning with
thoughts_token_count. Without tracking, engineers cannot measure reasoning token efficiency or optimize prompt strategies. -
API consistency - Users expect all LLM methods to return consistent metadata. The current gap creates confusion—why does
chat()show token counts butstructured_predict()doesn't? -
Downstream tool compatibility - Tools like MLFlow, Phoenix, and custom callback handlers expect token counts in
additional_kwargs. Structured predictions break this contract.
Impact if not fixed:
- Engineers may not realize structured predictions aren't being tracked
- Production systems have incomplete observability data
- Cost allocation for structured workflows is impossible
- Comparison benchmarks between methods are incomplete
Related Issues
- [Bug]: Missing token usage information in GoogleGenAI metadata for MLFlow Tracing #20218 - Missing token usage information in GoogleGenAI metadata for MLFlow Tracing (Closed - Fixed for
chat()/achat()only,structured_predict()methods not addressed) - [Feature Request] StructuredLLM - Add raw completion response alongside structured output #17736 - StructuredLLM - Add raw completion response alongside structured output (Open - Broader request for all LLMs and all raw response fields)
- [Bug]: No Input/Output Token count for Gemini 2.5 models #19293 - No Input/Output Token count for Gemini 2.5 models (Open - May be related; reports missing token counts for Gemini 2.5 in instrumentation)
- [Feature Request]: Get thoughts_token_count from gemini response #19662 - Get thoughts_token_count from gemini response (Closed - Fixed in
chat_from_gemini_response()butthoughts_token_countstill not extracted instructured_predict())
#20218 is the closest predecessor - it was closed after fixing token tracking for chat() methods, but the fix did not extend to structured_predict() methods. This issue effectively completes the work started in #20218.