Skip to content

[Bug]: Multiple bugs in llama-index-llms-openai #21124

@DimaDidmanidze

Description

@DimaDidmanidze

Bug Description

LlamaIndex OpenAI Responses API: Round-Trip & Reasoning Bugs

Affected versions: llama-index-llms-openai==0.7.3, llama-index-core==0.14.18
Files: llama_index/llms/openai/utils.py, llama_index/llms/openai/responses.py

We use the Responses API with GPT-5.x models in a multi-turn tool-calling workflow. Several bugs in the serialization and parsing paths cause data loss when assistant messages are round-tripped through to_openai_responses_message_dict and _parse_response_output. Below is a summary of each issue with minimal reproduction.


1. Assistant text silently dropped when tool calls are present

Location: utils.pyto_openai_responses_message_dict(), lines 676-684

When an assistant message contains both text blocks and tool calls, the serializer discards the text entirely and returns only the tool call items.

# Upstream code (simplified)
if "tool_calls" in message.additional_kwargs:
    message_dicts = [tc.model_dump() for tc in message.additional_kwargs["tool_calls"]]
    return [*reasoning, *message_dicts]          # <-- text content is lost
elif tool_calls:
    return [*reasoning, *tool_calls]             # <-- same here

The content / message_dict built earlier in the function is never included in these return paths.

Reproduction

from llama_index.core.base.llms.types import ChatMessage, MessageRole, TextBlock, ToolCallBlock
from llama_index.llms.openai.utils import to_openai_responses_message_dict

msg = ChatMessage(
    role=MessageRole.ASSISTANT,
    blocks=[
        TextBlock(text="I'll search for that information now."),
        ToolCallBlock(tool_name="search", tool_call_id="call_1", tool_kwargs='{"q": "test"}'),
    ],
)

result = to_openai_responses_message_dict(msg, model="o3-mini")
print(result)
# Actual:
#   [{"type": "function_call", "arguments": ..., "call_id": "call_1", "name": "search"}]
#
# Expected:
#   [{"role": "assistant", "content": "I'll search for that information now."},
#    {"type": "function_call", "arguments": ..., "call_id": "call_1", "name": "search"}]

Impact

In multi-turn tool-use workflows, the model's pre-tool-call commentary (e.g. "I'll inspect the vendor site first") is stripped from conversation history. GPT-5.4 uses these preambles as its primary reasoning-before-action mechanism, so losing them degrades follow-up quality.

Suggested fix

Include the assistant message_dict in the returned list when it has non-empty content:

if tool_calls:
    items = [*reasoning]
    if message_dict.get("content") not in (None, "", []):
        items.append(message_dict)
    items.extend(tool_calls)
    return items

2. ResponseOutputMessage.phase not preserved when parsing responses

Location: responses.pyOpenAIResponses._parse_response_output(), lines 471-530

The OpenAI Responses API returns phase ("commentary" or "final_answer") on ResponseOutputMessage objects for GPT-5.4-style flows. The upstream parser never reads this field.

# Upstream code (simplified)
for item in output:
    if isinstance(item, ResponseOutputMessage):
        for part in item.content:
            if hasattr(part, "text"):
                blocks.append(TextBlock(text=part.text))
        # item.phase is never accessed

Reproduction

from unittest.mock import MagicMock
from llama_index.llms.openai import OpenAIResponses

# Simulate a response with phase="commentary"
mock_msg = MagicMock()
mock_msg.type = "message"
mock_msg.role = "assistant"
mock_msg.phase = "commentary"

mock_text = MagicMock()
mock_text.text = "Let me think about this..."
mock_text.annotations = []
mock_text.refusal = None
mock_msg.content = [mock_text]

# Need to make isinstance() work — use the real type
from openai.types.responses import ResponseOutputMessage
mock_msg.__class__ = ResponseOutputMessage

result = OpenAIResponses._parse_response_output([mock_msg])
print(result.message.additional_kwargs)
# Actual:   {"built_in_tool_calls": []}
# Expected: {"built_in_tool_calls": [], "phase": "commentary"}

Impact

When replaying conversation history, commentary turns lose their phase distinction and appear as regular assistant messages. The model can no longer tell which of its prior turns were intermediate reasoning vs. final answers.

Suggested fix

Read phase from each ResponseOutputMessage and attach it to the ChatMessage:

if isinstance(item, ResponseOutputMessage):
    item_phase = getattr(item, "phase", None)
    if item_phase in ("commentary", "final_answer"):
        phase = item_phase
    # ... existing content parsing ...

# After the loop:
if phase is not None:
    message.additional_kwargs["phase"] = phase

3. phase not included when serializing assistant messages back to the API

Location: utils.pyto_openai_responses_message_dict(), lines 730-739

Even if phase were preserved on parsing (bug #2), the serializer never writes it back out. The message_dict for assistant messages is built without checking for phase in additional_kwargs.

# Upstream code
message_dict = {
    "role": message.role.value,
    "content": ...,
}
# No phase handling for assistant messages

Suggested fix

if message.role == MessageRole.ASSISTANT:
    phase = message.additional_kwargs.get("phase")
    if phase in ("commentary", "final_answer"):
        message_dict["phase"] = phase

Combined with the fix for bug #2, this completes the round-trip so that phase survives parse -> serialize -> parse cycles.


4. Reasoning support and sampling-param stripping use exact model-name lookup

Location: responses.py__init__() line 318, _get_model_kwargs() lines 424-435

Both the constructor and _get_model_kwargs gate reasoning behavior on self.model in O1_MODELS, which is an exact dictionary lookup. Any model snapshot not pre-listed in O1_MODELS (e.g. a newly released dated version like gpt-5.4-2026-03-05) silently falls through: reasoning options are not forwarded, and temperature/top_p are sent to models that reject them.

# Upstream __init__
if model in O1_MODELS:         # exact match only
    temperature = 1.0

# Upstream _get_model_kwargs
if self.model in O1_MODELS and self.reasoning_options is not None:
    model_kwargs["reasoning"] = self.reasoning_options

if self.reasoning_options is not None or self.model in O1_MODELS:
    for param in params_to_exclude_for_reasoning:
        model_kwargs.pop(param, None)

Sub-issues

4a. Constructor forces temperature=1.0 for all O1/GPT-5 models. GPT-5.2 and GPT-5.4 support custom temperature when reasoning_effort="none", but the constructor overwrites it unconditionally.

4b. Sampling params are stripped too aggressively. The upstream logic strips temperature, top_p, presence_penalty, and frequency_penalty for all O1-family models regardless of reasoning effort. GPT-5.2+ models accept these params when reasoning is disabled.

Reproduction

from llama_index.llms.openai import OpenAIResponses

# A valid dated snapshot not in O1_MODELS
llm = OpenAIResponses(
    model="gpt-5.4-2026-03-05",
    temperature=0.3,
    reasoning_options={"effort": "low", "summary": "concise"},
    api_key="sk-test",
)

kwargs = llm._get_model_kwargs()
print("reasoning" in kwargs)  # False — reasoning options silently dropped
print("temperature" in kwargs)  # True — should have been stripped for this model

Suggested fix

Use prefix matching instead of exact dict lookups:

def _supports_reasoning(model: str) -> bool:
    return model.startswith(("o1-", "o3-", "o4-", "gpt-5-", "gpt-5."))

And apply more nuanced sampling-param rules per model family — e.g. GPT-5.2+ allows sampling params when reasoning_effort is "none".


5. Reasoning token count duplicated across all ThinkingBlocks

Location: responses.py_chat(), lines 554-559

When the response contains multiple ThinkingBlocks, the total reasoning_tokens count is assigned to every block instead of being divided among them.

# Upstream code
if hasattr(response.usage.output_tokens_details, "reasoning_tokens"):
    for block in chat_response.message.blocks:
        if isinstance(block, ThinkingBlock):
            block.num_tokens = response.usage.output_tokens_details.reasoning_tokens
            # ^ Every block gets the TOTAL count

Reproduction

from llama_index.core.base.llms.types import ThinkingBlock

# Simulate: 3 reasoning blocks, 900 total reasoning tokens
blocks = [ThinkingBlock(content=f"step {i}") for i in range(3)]

# Upstream assigns 900 to each:
total = 900
for block in blocks:
    block.num_tokens = total

print([b.num_tokens for b in blocks])
# Actual:   [900, 900, 900]  (total appears to be 2700)
# Expected: [300, 300, 300]  (evenly distributed)

Suggested fix

reasoning_blocks = [b for b in chat_response.message.blocks if isinstance(b, ThinkingBlock)]
if reasoning_blocks:
    total = response.usage.output_tokens_details.reasoning_tokens or 0
    per_block = total // len(reasoning_blocks)
    remainder = total % len(reasoning_blocks)
    for i, block in enumerate(reasoning_blocks):
        block.num_tokens = per_block + (1 if i < remainder else 0)

6. Tool call arguments not consistently serialized to JSON strings

Location: utils.pyto_openai_responses_message_dict(), line 666

ToolCallBlock.tool_kwargs can be either a str or a dict. The serializer passes it through as-is, but the Responses API expects arguments to be a JSON string.

# Upstream code
tool_calls.extend([{
    "type": "function_call",
    "arguments": block.tool_kwargs,   # <-- may be a dict
    "call_id": block.tool_call_id,
    "name": block.tool_name,
}])

Similarly, when additional_kwargs["tool_calls"] contains tool calls, the arguments are passed through without serialization (line 678).

Suggested fix

def _serialize_arguments(value):
    return value if isinstance(value, str) else json.dumps(value)

Apply to both the ToolCallBlock path and the additional_kwargs["tool_calls"] path.


Summary Table

# Bug Severity Component
1 Assistant text dropped alongside tool calls High utils.py serializer
2 phase lost when parsing responses High responses.py parser
3 phase lost when serializing messages High utils.py serializer
4 Reasoning/sampling gated on exact model names Medium responses.py init + kwargs
5 Reasoning tokens duplicated across blocks Low responses.py _chat
6 Tool call arguments not JSON-serialized Medium utils.py serializer

Bugs 1-3 compound: together they break the round-trip for GPT-5.4-style multi-turn tool workflows where the model emits commentary + tool calls in the same turn, tagged with phase.

Version

llama-index-llms-openai==0.7.3, llama-index-core==0.14.18

Steps to Reproduce

Steps in the description.

Relevant Logs/Tracbacks

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriageIssue needs to be triaged/prioritized

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions