Description
I’d like to propose adding a per-turn index to the GenAI span naming emitted by PydanticAI agent/model tracing, similar in spirit to the per-LLM-turn spans described in Promptfoo’s tracing documentation. Today the model spans are emitted with names like chat {gen_ai.request.model} and the corresponding semantic attributes include gen_ai.operation.name="chat" and gen_ai.request.model="gpt-4.1-mini" (or similar), which is useful, but it does not expose the turn number in a way that is easy to assert against during evaluation.
My use case is evaluation with Promptfoo on top of OpenTelemetry traces outputed by PydanticAI agent run. I need to verify not just that the correct tools were called, but that the agent batched tool calls in the correct turn order. For example, in a workflow where the agent must paginate sequentially and then fan out tool calls in parallel, it is important to distinguish:
- a single LLM turn that emits multiple tool calls in parallel, versus
- multiple LLM turns that emit tools sequentially across generations.
Right now, that distinction is hard to make reliably from the existing trace shape because the visible model span name is model-based and the agent span name is separate. In practice, the trace is great for observability, but it is not sufficient for clean automated assertions about turn structure when tool execution order matters. I ended up needing custom wrapper spans in my provider just to create a stable span name for counting turns, which feels like a workaround for something that could be built into the framework itself.
I’d like to suggest that PydanticAI emit an incrementing turn index on every model request, for example by extending the current span naming or attributes to include something like:
- span name:
gen_ai.operation.name + gen_ai.operation.turn + gen_ai.request.model
That would make trace-based evaluation much cleaner because it would let tools like Promptfoo distinguish turn boundaries directly from the trace, instead of inferring them indirectly from parent-child spans or from model-specific span titles. It would also make it easier to assert sequential versus parallel tool calling behavior, which is a common requirement for latency-sensitive agent workflows.
References
Description
I’d like to propose adding a per-turn index to the GenAI span naming emitted by PydanticAI agent/model tracing, similar in spirit to the per-LLM-turn spans described in Promptfoo’s tracing documentation. Today the model spans are emitted with names like
chat {gen_ai.request.model}and the corresponding semantic attributes includegen_ai.operation.name="chat"andgen_ai.request.model="gpt-4.1-mini"(or similar), which is useful, but it does not expose the turn number in a way that is easy to assert against during evaluation.My use case is evaluation with Promptfoo on top of OpenTelemetry traces outputed by PydanticAI agent run. I need to verify not just that the correct tools were called, but that the agent batched tool calls in the correct turn order. For example, in a workflow where the agent must paginate sequentially and then fan out tool calls in parallel, it is important to distinguish:
Right now, that distinction is hard to make reliably from the existing trace shape because the visible model span name is model-based and the agent span name is separate. In practice, the trace is great for observability, but it is not sufficient for clean automated assertions about turn structure when tool execution order matters. I ended up needing custom wrapper spans in my provider just to create a stable span name for counting turns, which feels like a workaround for something that could be built into the framework itself.
I’d like to suggest that PydanticAI emit an incrementing turn index on every model request, for example by extending the current span naming or attributes to include something like:
gen_ai.operation.name + gen_ai.operation.turn + gen_ai.request.modelThat would make trace-based evaluation much cleaner because it would let tools like Promptfoo distinguish turn boundaries directly from the trace, instead of inferring them indirectly from parent-child spans or from model-specific span titles. It would also make it easier to assert sequential versus parallel tool calling behavior, which is a common requirement for latency-sensitive agent workflows.
References