-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Description
Bug Description
LlamaIndex OpenAI Responses API: Round-Trip & Reasoning Bugs
Affected versions: llama-index-llms-openai==0.7.3, llama-index-core==0.14.18
Files: llama_index/llms/openai/utils.py, llama_index/llms/openai/responses.py
We use the Responses API with GPT-5.x models in a multi-turn tool-calling workflow. Several bugs in the serialization and parsing paths cause data loss when assistant messages are round-tripped through to_openai_responses_message_dict and _parse_response_output. Below is a summary of each issue with minimal reproduction.
1. Assistant text silently dropped when tool calls are present
Location: utils.py — to_openai_responses_message_dict(), lines 676-684
When an assistant message contains both text blocks and tool calls, the serializer discards the text entirely and returns only the tool call items.
# Upstream code (simplified)
if "tool_calls" in message.additional_kwargs:
message_dicts = [tc.model_dump() for tc in message.additional_kwargs["tool_calls"]]
return [*reasoning, *message_dicts] # <-- text content is lost
elif tool_calls:
return [*reasoning, *tool_calls] # <-- same hereThe content / message_dict built earlier in the function is never included in these return paths.
Reproduction
from llama_index.core.base.llms.types import ChatMessage, MessageRole, TextBlock, ToolCallBlock
from llama_index.llms.openai.utils import to_openai_responses_message_dict
msg = ChatMessage(
role=MessageRole.ASSISTANT,
blocks=[
TextBlock(text="I'll search for that information now."),
ToolCallBlock(tool_name="search", tool_call_id="call_1", tool_kwargs='{"q": "test"}'),
],
)
result = to_openai_responses_message_dict(msg, model="o3-mini")
print(result)
# Actual:
# [{"type": "function_call", "arguments": ..., "call_id": "call_1", "name": "search"}]
#
# Expected:
# [{"role": "assistant", "content": "I'll search for that information now."},
# {"type": "function_call", "arguments": ..., "call_id": "call_1", "name": "search"}]Impact
In multi-turn tool-use workflows, the model's pre-tool-call commentary (e.g. "I'll inspect the vendor site first") is stripped from conversation history. GPT-5.4 uses these preambles as its primary reasoning-before-action mechanism, so losing them degrades follow-up quality.
Suggested fix
Include the assistant message_dict in the returned list when it has non-empty content:
if tool_calls:
items = [*reasoning]
if message_dict.get("content") not in (None, "", []):
items.append(message_dict)
items.extend(tool_calls)
return items2. ResponseOutputMessage.phase not preserved when parsing responses
Location: responses.py — OpenAIResponses._parse_response_output(), lines 471-530
The OpenAI Responses API returns phase ("commentary" or "final_answer") on ResponseOutputMessage objects for GPT-5.4-style flows. The upstream parser never reads this field.
# Upstream code (simplified)
for item in output:
if isinstance(item, ResponseOutputMessage):
for part in item.content:
if hasattr(part, "text"):
blocks.append(TextBlock(text=part.text))
# item.phase is never accessedReproduction
from unittest.mock import MagicMock
from llama_index.llms.openai import OpenAIResponses
# Simulate a response with phase="commentary"
mock_msg = MagicMock()
mock_msg.type = "message"
mock_msg.role = "assistant"
mock_msg.phase = "commentary"
mock_text = MagicMock()
mock_text.text = "Let me think about this..."
mock_text.annotations = []
mock_text.refusal = None
mock_msg.content = [mock_text]
# Need to make isinstance() work — use the real type
from openai.types.responses import ResponseOutputMessage
mock_msg.__class__ = ResponseOutputMessage
result = OpenAIResponses._parse_response_output([mock_msg])
print(result.message.additional_kwargs)
# Actual: {"built_in_tool_calls": []}
# Expected: {"built_in_tool_calls": [], "phase": "commentary"}Impact
When replaying conversation history, commentary turns lose their phase distinction and appear as regular assistant messages. The model can no longer tell which of its prior turns were intermediate reasoning vs. final answers.
Suggested fix
Read phase from each ResponseOutputMessage and attach it to the ChatMessage:
if isinstance(item, ResponseOutputMessage):
item_phase = getattr(item, "phase", None)
if item_phase in ("commentary", "final_answer"):
phase = item_phase
# ... existing content parsing ...
# After the loop:
if phase is not None:
message.additional_kwargs["phase"] = phase3. phase not included when serializing assistant messages back to the API
Location: utils.py — to_openai_responses_message_dict(), lines 730-739
Even if phase were preserved on parsing (bug #2), the serializer never writes it back out. The message_dict for assistant messages is built without checking for phase in additional_kwargs.
# Upstream code
message_dict = {
"role": message.role.value,
"content": ...,
}
# No phase handling for assistant messagesSuggested fix
if message.role == MessageRole.ASSISTANT:
phase = message.additional_kwargs.get("phase")
if phase in ("commentary", "final_answer"):
message_dict["phase"] = phaseCombined with the fix for bug #2, this completes the round-trip so that phase survives parse -> serialize -> parse cycles.
4. Reasoning support and sampling-param stripping use exact model-name lookup
Location: responses.py — __init__() line 318, _get_model_kwargs() lines 424-435
Both the constructor and _get_model_kwargs gate reasoning behavior on self.model in O1_MODELS, which is an exact dictionary lookup. Any model snapshot not pre-listed in O1_MODELS (e.g. a newly released dated version like gpt-5.4-2026-03-05) silently falls through: reasoning options are not forwarded, and temperature/top_p are sent to models that reject them.
# Upstream __init__
if model in O1_MODELS: # exact match only
temperature = 1.0
# Upstream _get_model_kwargs
if self.model in O1_MODELS and self.reasoning_options is not None:
model_kwargs["reasoning"] = self.reasoning_options
if self.reasoning_options is not None or self.model in O1_MODELS:
for param in params_to_exclude_for_reasoning:
model_kwargs.pop(param, None)Sub-issues
4a. Constructor forces temperature=1.0 for all O1/GPT-5 models. GPT-5.2 and GPT-5.4 support custom temperature when reasoning_effort="none", but the constructor overwrites it unconditionally.
4b. Sampling params are stripped too aggressively. The upstream logic strips temperature, top_p, presence_penalty, and frequency_penalty for all O1-family models regardless of reasoning effort. GPT-5.2+ models accept these params when reasoning is disabled.
Reproduction
from llama_index.llms.openai import OpenAIResponses
# A valid dated snapshot not in O1_MODELS
llm = OpenAIResponses(
model="gpt-5.4-2026-03-05",
temperature=0.3,
reasoning_options={"effort": "low", "summary": "concise"},
api_key="sk-test",
)
kwargs = llm._get_model_kwargs()
print("reasoning" in kwargs) # False — reasoning options silently dropped
print("temperature" in kwargs) # True — should have been stripped for this modelSuggested fix
Use prefix matching instead of exact dict lookups:
def _supports_reasoning(model: str) -> bool:
return model.startswith(("o1-", "o3-", "o4-", "gpt-5-", "gpt-5."))And apply more nuanced sampling-param rules per model family — e.g. GPT-5.2+ allows sampling params when reasoning_effort is "none".
5. Reasoning token count duplicated across all ThinkingBlocks
Location: responses.py — _chat(), lines 554-559
When the response contains multiple ThinkingBlocks, the total reasoning_tokens count is assigned to every block instead of being divided among them.
# Upstream code
if hasattr(response.usage.output_tokens_details, "reasoning_tokens"):
for block in chat_response.message.blocks:
if isinstance(block, ThinkingBlock):
block.num_tokens = response.usage.output_tokens_details.reasoning_tokens
# ^ Every block gets the TOTAL countReproduction
from llama_index.core.base.llms.types import ThinkingBlock
# Simulate: 3 reasoning blocks, 900 total reasoning tokens
blocks = [ThinkingBlock(content=f"step {i}") for i in range(3)]
# Upstream assigns 900 to each:
total = 900
for block in blocks:
block.num_tokens = total
print([b.num_tokens for b in blocks])
# Actual: [900, 900, 900] (total appears to be 2700)
# Expected: [300, 300, 300] (evenly distributed)Suggested fix
reasoning_blocks = [b for b in chat_response.message.blocks if isinstance(b, ThinkingBlock)]
if reasoning_blocks:
total = response.usage.output_tokens_details.reasoning_tokens or 0
per_block = total // len(reasoning_blocks)
remainder = total % len(reasoning_blocks)
for i, block in enumerate(reasoning_blocks):
block.num_tokens = per_block + (1 if i < remainder else 0)6. Tool call arguments not consistently serialized to JSON strings
Location: utils.py — to_openai_responses_message_dict(), line 666
ToolCallBlock.tool_kwargs can be either a str or a dict. The serializer passes it through as-is, but the Responses API expects arguments to be a JSON string.
# Upstream code
tool_calls.extend([{
"type": "function_call",
"arguments": block.tool_kwargs, # <-- may be a dict
"call_id": block.tool_call_id,
"name": block.tool_name,
}])Similarly, when additional_kwargs["tool_calls"] contains tool calls, the arguments are passed through without serialization (line 678).
Suggested fix
def _serialize_arguments(value):
return value if isinstance(value, str) else json.dumps(value)Apply to both the ToolCallBlock path and the additional_kwargs["tool_calls"] path.
Summary Table
| # | Bug | Severity | Component |
|---|---|---|---|
| 1 | Assistant text dropped alongside tool calls | High | utils.py serializer |
| 2 | phase lost when parsing responses |
High | responses.py parser |
| 3 | phase lost when serializing messages |
High | utils.py serializer |
| 4 | Reasoning/sampling gated on exact model names | Medium | responses.py init + kwargs |
| 5 | Reasoning tokens duplicated across blocks | Low | responses.py _chat |
| 6 | Tool call arguments not JSON-serialized | Medium | utils.py serializer |
Bugs 1-3 compound: together they break the round-trip for GPT-5.4-style multi-turn tool workflows where the model emits commentary + tool calls in the same turn, tagged with phase.
Version
llama-index-llms-openai==0.7.3, llama-index-core==0.14.18
Steps to Reproduce
Steps in the description.