[Bug]: Multiple bugs in llama-index-llms-openai

### Bug Description

# LlamaIndex OpenAI Responses API: Round-Trip & Reasoning Bugs

**Affected versions:** `llama-index-llms-openai==0.7.3`, `llama-index-core==0.14.18`
**Files:** `llama_index/llms/openai/utils.py`, `llama_index/llms/openai/responses.py`

We use the Responses API with GPT-5.x models in a multi-turn tool-calling workflow. Several bugs in the serialization and parsing paths cause data loss when assistant messages are round-tripped through `to_openai_responses_message_dict` and `_parse_response_output`. Below is a summary of each issue with minimal reproduction.

---

## 1. Assistant text silently dropped when tool calls are present

**Location:** `utils.py` — `to_openai_responses_message_dict()`, lines 676-684

When an assistant message contains both text blocks *and* tool calls, the serializer discards the text entirely and returns only the tool call items.

```python
# Upstream code (simplified)
if "tool_calls" in message.additional_kwargs:
    message_dicts = [tc.model_dump() for tc in message.additional_kwargs["tool_calls"]]
    return [*reasoning, *message_dicts]          # <-- text content is lost
elif tool_calls:
    return [*reasoning, *tool_calls]             # <-- same here
```

The `content` / `message_dict` built earlier in the function is never included in these return paths.

### Reproduction

```python
from llama_index.core.base.llms.types import ChatMessage, MessageRole, TextBlock, ToolCallBlock
from llama_index.llms.openai.utils import to_openai_responses_message_dict

msg = ChatMessage(
    role=MessageRole.ASSISTANT,
    blocks=[
        TextBlock(text="I'll search for that information now."),
        ToolCallBlock(tool_name="search", tool_call_id="call_1", tool_kwargs='{"q": "test"}'),
    ],
)

result = to_openai_responses_message_dict(msg, model="o3-mini")
print(result)
# Actual:
#   [{"type": "function_call", "arguments": ..., "call_id": "call_1", "name": "search"}]
#
# Expected:
#   [{"role": "assistant", "content": "I'll search for that information now."},
#    {"type": "function_call", "arguments": ..., "call_id": "call_1", "name": "search"}]
```

### Impact

In multi-turn tool-use workflows, the model's pre-tool-call commentary (e.g. "I'll inspect the vendor site first") is stripped from conversation history. GPT-5.4 uses these preambles as its primary reasoning-before-action mechanism, so losing them degrades follow-up quality.

### Suggested fix

Include the assistant `message_dict` in the returned list when it has non-empty content:

```python
if tool_calls:
    items = [*reasoning]
    if message_dict.get("content") not in (None, "", []):
        items.append(message_dict)
    items.extend(tool_calls)
    return items
```

---

## 2. `ResponseOutputMessage.phase` not preserved when parsing responses

**Location:** `responses.py` — `OpenAIResponses._parse_response_output()`, lines 471-530

The OpenAI Responses API returns `phase` (`"commentary"` or `"final_answer"`) on `ResponseOutputMessage` objects for GPT-5.4-style flows. The upstream parser never reads this field.

```python
# Upstream code (simplified)
for item in output:
    if isinstance(item, ResponseOutputMessage):
        for part in item.content:
            if hasattr(part, "text"):
                blocks.append(TextBlock(text=part.text))
        # item.phase is never accessed
```

### Reproduction

```python
from unittest.mock import MagicMock
from llama_index.llms.openai import OpenAIResponses

# Simulate a response with phase="commentary"
mock_msg = MagicMock()
mock_msg.type = "message"
mock_msg.role = "assistant"
mock_msg.phase = "commentary"

mock_text = MagicMock()
mock_text.text = "Let me think about this..."
mock_text.annotations = []
mock_text.refusal = None
mock_msg.content = [mock_text]

# Need to make isinstance() work — use the real type
from openai.types.responses import ResponseOutputMessage
mock_msg.__class__ = ResponseOutputMessage

result = OpenAIResponses._parse_response_output([mock_msg])
print(result.message.additional_kwargs)
# Actual:   {"built_in_tool_calls": []}
# Expected: {"built_in_tool_calls": [], "phase": "commentary"}
```

### Impact

When replaying conversation history, commentary turns lose their phase distinction and appear as regular assistant messages. The model can no longer tell which of its prior turns were intermediate reasoning vs. final answers.

### Suggested fix

Read `phase` from each `ResponseOutputMessage` and attach it to the `ChatMessage`:

```python
if isinstance(item, ResponseOutputMessage):
    item_phase = getattr(item, "phase", None)
    if item_phase in ("commentary", "final_answer"):
        phase = item_phase
    # ... existing content parsing ...

# After the loop:
if phase is not None:
    message.additional_kwargs["phase"] = phase
```

---

## 3. `phase` not included when serializing assistant messages back to the API

**Location:** `utils.py` — `to_openai_responses_message_dict()`, lines 730-739

Even if `phase` were preserved on parsing (bug #2), the serializer never writes it back out. The `message_dict` for assistant messages is built without checking for `phase` in `additional_kwargs`.

```python
# Upstream code
message_dict = {
    "role": message.role.value,
    "content": ...,
}
# No phase handling for assistant messages
```

### Suggested fix

```python
if message.role == MessageRole.ASSISTANT:
    phase = message.additional_kwargs.get("phase")
    if phase in ("commentary", "final_answer"):
        message_dict["phase"] = phase
```

Combined with the fix for bug #2, this completes the round-trip so that `phase` survives parse -> serialize -> parse cycles.

---

## 4. Reasoning support and sampling-param stripping use exact model-name lookup

**Location:** `responses.py` — `__init__()` line 318, `_get_model_kwargs()` lines 424-435

Both the constructor and `_get_model_kwargs` gate reasoning behavior on `self.model in O1_MODELS`, which is an exact dictionary lookup. Any model snapshot not pre-listed in `O1_MODELS` (e.g. a newly released dated version like `gpt-5.4-2026-03-05`) silently falls through: reasoning options are not forwarded, and temperature/top_p are sent to models that reject them.

```python
# Upstream __init__
if model in O1_MODELS:         # exact match only
    temperature = 1.0

# Upstream _get_model_kwargs
if self.model in O1_MODELS and self.reasoning_options is not None:
    model_kwargs["reasoning"] = self.reasoning_options

if self.reasoning_options is not None or self.model in O1_MODELS:
    for param in params_to_exclude_for_reasoning:
        model_kwargs.pop(param, None)
```

### Sub-issues

**4a. Constructor forces `temperature=1.0` for all O1/GPT-5 models.** GPT-5.2 and GPT-5.4 support custom temperature when `reasoning_effort="none"`, but the constructor overwrites it unconditionally.

**4b. Sampling params are stripped too aggressively.** The upstream logic strips `temperature`, `top_p`, `presence_penalty`, and `frequency_penalty` for *all* O1-family models regardless of reasoning effort. GPT-5.2+ models accept these params when reasoning is disabled.

### Reproduction

```python
from llama_index.llms.openai import OpenAIResponses

# A valid dated snapshot not in O1_MODELS
llm = OpenAIResponses(
    model="gpt-5.4-2026-03-05",
    temperature=0.3,
    reasoning_options={"effort": "low", "summary": "concise"},
    api_key="sk-test",
)

kwargs = llm._get_model_kwargs()
print("reasoning" in kwargs)  # False — reasoning options silently dropped
print("temperature" in kwargs)  # True — should have been stripped for this model
```

### Suggested fix

Use prefix matching instead of exact dict lookups:

```python
def _supports_reasoning(model: str) -> bool:
    return model.startswith(("o1-", "o3-", "o4-", "gpt-5-", "gpt-5."))
```

And apply more nuanced sampling-param rules per model family — e.g. GPT-5.2+ allows sampling params when `reasoning_effort` is `"none"`.

---

## 5. Reasoning token count duplicated across all ThinkingBlocks

**Location:** `responses.py` — `_chat()`, lines 554-559

When the response contains multiple `ThinkingBlock`s, the total `reasoning_tokens` count is assigned to *every* block instead of being divided among them.

```python
# Upstream code
if hasattr(response.usage.output_tokens_details, "reasoning_tokens"):
    for block in chat_response.message.blocks:
        if isinstance(block, ThinkingBlock):
            block.num_tokens = response.usage.output_tokens_details.reasoning_tokens
            # ^ Every block gets the TOTAL count
```

### Reproduction

```python
from llama_index.core.base.llms.types import ThinkingBlock

# Simulate: 3 reasoning blocks, 900 total reasoning tokens
blocks = [ThinkingBlock(content=f"step {i}") for i in range(3)]

# Upstream assigns 900 to each:
total = 900
for block in blocks:
    block.num_tokens = total

print([b.num_tokens for b in blocks])
# Actual:   [900, 900, 900]  (total appears to be 2700)
# Expected: [300, 300, 300]  (evenly distributed)
```

### Suggested fix

```python
reasoning_blocks = [b for b in chat_response.message.blocks if isinstance(b, ThinkingBlock)]
if reasoning_blocks:
    total = response.usage.output_tokens_details.reasoning_tokens or 0
    per_block = total // len(reasoning_blocks)
    remainder = total % len(reasoning_blocks)
    for i, block in enumerate(reasoning_blocks):
        block.num_tokens = per_block + (1 if i < remainder else 0)
```

---

## 6. Tool call arguments not consistently serialized to JSON strings

**Location:** `utils.py` — `to_openai_responses_message_dict()`, line 666

`ToolCallBlock.tool_kwargs` can be either a `str` or a `dict`. The serializer passes it through as-is, but the Responses API expects `arguments` to be a JSON string.

```python
# Upstream code
tool_calls.extend([{
    "type": "function_call",
    "arguments": block.tool_kwargs,   # <-- may be a dict
    "call_id": block.tool_call_id,
    "name": block.tool_name,
}])
```

Similarly, when `additional_kwargs["tool_calls"]` contains tool calls, the arguments are passed through without serialization (line 678).

### Suggested fix

```python
def _serialize_arguments(value):
    return value if isinstance(value, str) else json.dumps(value)
```

Apply to both the `ToolCallBlock` path and the `additional_kwargs["tool_calls"]` path.

---

## Summary Table

| # | Bug | Severity | Component |
|---|-----|----------|-----------|
| 1 | Assistant text dropped alongside tool calls | High | `utils.py` serializer |
| 2 | `phase` lost when parsing responses | High | `responses.py` parser |
| 3 | `phase` lost when serializing messages | High | `utils.py` serializer |
| 4 | Reasoning/sampling gated on exact model names | Medium | `responses.py` init + kwargs |
| 5 | Reasoning tokens duplicated across blocks | Low | `responses.py` _chat |
| 6 | Tool call arguments not JSON-serialized | Medium | `utils.py` serializer |

Bugs 1-3 compound: together they break the round-trip for GPT-5.4-style multi-turn tool workflows where the model emits commentary + tool calls in the same turn, tagged with `phase`.


### Version

`llama-index-llms-openai==0.7.3`, `llama-index-core==0.14.18`

### Steps to Reproduce

Steps in the description.

### Relevant Logs/Tracbacks

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Multiple bugs in llama-index-llms-openai #21124

Bug Description

LlamaIndex OpenAI Responses API: Round-Trip & Reasoning Bugs

1. Assistant text silently dropped when tool calls are present

Reproduction

Impact

Suggested fix

2. `ResponseOutputMessage.phase` not preserved when parsing responses

Reproduction

Impact

Suggested fix

3. `phase` not included when serializing assistant messages back to the API

Suggested fix

4. Reasoning support and sampling-param stripping use exact model-name lookup

Sub-issues

Reproduction

Suggested fix

5. Reasoning token count duplicated across all ThinkingBlocks

Reproduction

Suggested fix

6. Tool call arguments not consistently serialized to JSON strings

Suggested fix

Summary Table

Version

Steps to Reproduce

Relevant Logs/Tracbacks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

#	Bug	Severity	Component
1	Assistant text dropped alongside tool calls	High	`utils.py` serializer
2	`phase` lost when parsing responses	High	`responses.py` parser
3	`phase` lost when serializing messages	High	`utils.py` serializer
4	Reasoning/sampling gated on exact model names	Medium	`responses.py` init + kwargs
5	Reasoning tokens duplicated across blocks	Low	`responses.py` _chat
6	Tool call arguments not JSON-serialized	Medium	`utils.py` serializer

[Bug]: Multiple bugs in llama-index-llms-openai #21124

Description

Bug Description

LlamaIndex OpenAI Responses API: Round-Trip & Reasoning Bugs

1. Assistant text silently dropped when tool calls are present

Reproduction

Impact

Suggested fix

2. ResponseOutputMessage.phase not preserved when parsing responses

Reproduction

Impact

Suggested fix

3. phase not included when serializing assistant messages back to the API

Suggested fix

4. Reasoning support and sampling-param stripping use exact model-name lookup

Sub-issues

Reproduction

Suggested fix

5. Reasoning token count duplicated across all ThinkingBlocks

Reproduction

Suggested fix

6. Tool call arguments not consistently serialized to JSON strings

Suggested fix

Summary Table

Version

Steps to Reproduce

Relevant Logs/Tracbacks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. `ResponseOutputMessage.phase` not preserved when parsing responses

3. `phase` not included when serializing assistant messages back to the API