Skip to content
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
"""
Agent Tool Trajectory Experiment
This example demonstrates Traceloop's agent tool trajectory evaluator:
- Agent Tool Trajectory: Validates the agent tool trajectory
This evaluator helps ensure your AI agents perform optimally and follow the expected tool trajectory.
"""

import asyncio
from traceloop.sdk import Traceloop
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition

# Initialize Traceloop
client = Traceloop.init()


def agent_evaluators_task(row):
executed_tool_calls = row.get("actual", "")
default_expected = (
"[{'name': 'search', 'input': {'query': 'weather'}}, "
"{'name': 'book_flight', 'input': {'flight': 'NYC to Paris'}}, "
"{'name': 'get_confirmation', 'input': {'confirmation': 'flight booked'}}]"
)
expected_tool_calls = row.get("expected", default_expected)

return {
"executed_tool_calls": executed_tool_calls,
"expected_tool_calls": expected_tool_calls,
}
Comment on lines +18 to +30
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

rg -n -A5 -B5 "tool_calls" --type=py

Repository: traceloop/openllmetry

Length of output: 50378


🏁 Script executed:

# First, let's look at the file in question to understand the context
cat -n packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py

# Search for the evaluator that uses agent_evaluators_task
rg -n "agent_evaluators_task\|AgentToolTrajectoryEvaluator\|agent_tool_trajectory" --type=py packages/sample-app/

Repository: traceloop/openllmetry

Length of output: 3431


🏁 Script executed:

# Search for the agent_tool_trajectory evaluator implementation
rg -n "agent_tool_trajectory" --type=py -A10 -B5 | head -100

# Also search for EvaluatorMadeByTraceloopDefinition
fd -e py -x grep -l "agent_tool_trajectory\|AgentToolTrajectory" {} \;

Repository: traceloop/openllmetry

Length of output: 6137


🏁 Script executed:

# Get the full definition of agent_tool_trajectory evaluator
sed -n '109,135p' packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py

# Also check the response.py to understand how it processes tool calls
cat packages/traceloop-sdk/traceloop/sdk/generated/evaluators/response.py

Repository: traceloop/openllmetry

Length of output: 6133


🏁 Script executed:

# Search for other examples of agent_tool_trajectory evaluator usage
rg -n "executed_tool_calls\|expected_tool_calls" --type=py -B5 -A5

# Also search for any tests or documentation
fd -e md -e txt -x grep -l "agent.*tool.*trajectory\|executed_tool_calls" {} \;

Repository: traceloop/openllmetry

Length of output: 47


🏁 Script executed:

# Search for any README or documentation files related to experiments
fd -e md "README\|GUIDE\|EXAMPLE" packages/sample-app/ | head -20

# Check if there are any other evaluator examples in sample-app
ls -la packages/sample-app/sample_app/experiment/made_by_traceloop/

# Search for how row data is structured or passed
rg -n "agent-tool-trajectory" --type=py -B3 -A3

Repository: traceloop/openllmetry

Length of output: 5601


🏁 Script executed:

# Check other agent experiment files to see how they handle tool calls
cat packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py

# Also check travel_agent_exp.py
cat packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py

Repository: traceloop/openllmetry

Length of output: 12825


🏁 Script executed:

# Search for the actual evaluator implementation or any processing logic
find packages/traceloop-sdk -name "*.py" -type f -exec grep -l "agent_tool_trajectory\|AgentToolTrajectory" {} \;

# Check if there's any validation or parsing code for tool_calls
rg -n "json.loads\|ast.literal_eval\|parse.*tool" --type=py packages/traceloop-sdk/ | head -50

Repository: traceloop/openllmetry

Length of output: 416


🏁 Script executed:

# Check the request.py to see AgentToolTrajectoryRequest structure
cat packages/traceloop-sdk/traceloop/sdk/generated/evaluators/request.py

Repository: traceloop/openllmetry

Length of output: 13294


🏁 Script executed:

# Check if there are any examples or tests showing proper format
rg -n "executed_tool_calls\|expected_tool_calls" --type=py -B2 -A2 packages/

Repository: traceloop/openllmetry

Length of output: 47


Fix JSON string format for tool calls to use valid JSON syntax.

Lines 20-24 define default_expected as a string, which is correct. However, the string uses single quotes instead of double quotes, creating invalid JSON that will fail parsing. Update to proper JSON format:

Corrected code
default_expected = (
    '[{"name": "search", "input": {"query": "weather"}}, '
    '{"name": "book_flight", "input": {"flight": "NYC to Paris"}}, '
    '{"name": "get_confirmation", "input": {"confirmation": "flight booked"}}]'
)

The agent-tool-trajectory evaluator expects both executed_tool_calls and expected_tool_calls as JSON strings (not Python lists).

🤖 Prompt for AI Agents
In
@packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py
around lines 18 - 30, The default_expected string in agent_evaluators_task uses
Python-style single quotes and will not be valid JSON; update the
default_expected value to a JSON-valid string using double quotes (e.g., change
the list/dict literals to use double quotes for keys and string values) so that
expected_tool_calls (and executed_tool_calls) remain JSON strings parsable by
downstream evaluators; ensure you only modify the default_expected literal
inside the agent_evaluators_task function.



async def run_agent_tool_trajectory_experiment():
print("\n" + "="*80)
print("AGENT TOOL TRAJECTORY EXPERIMENT")
print("="*80 + "\n")
print("This experiment will test the agent tool trajectory with the agent tool trajectory evaluator:\n")
print("1. Agent Tool Trajectory - Validates the agent tool trajectory")
print("\n" + "-"*80 + "\n")

# Configure agent evaluators
evaluators = [
EvaluatorMadeByTraceloopDefinition.agent_tool_trajectory(
input_params_sensitive=True,
mismatch_sensitive=False,
order_sensitive=False,
threshold=0.7,
),
]

print("Running experiment with evaluators:")
for evaluator in evaluators:
print(f" - {evaluator.slug}")

print("\n" + "-"*80 + "\n")

# Run the experiment
# Note: You'll need to create a dataset with appropriate test cases for agents
results, errors = await client.experiment.run(
dataset_slug="agent-tool-trajectory", # Set a dataset slug that exists in the traceloop platform
dataset_version="v1",
task=agent_evaluators_task,
evaluators=evaluators,
experiment_slug="agent-tool-trajectory-exp",
stop_on_error=False,
wait_for_results=True,
)

print("\n" + "="*80)
print("Agent tool trajectory experiment completed!")
print("="*80 + "\n")

print("Results summary:")
print(f" - Total rows processed: {len(results) if results else 0}")
print(f" - Errors encountered: {len(errors) if errors else 0}")

if errors:
print("\nErrors:")
for error in errors:
print(f" - {error}")

if __name__ == "__main__":
print("\nAgent Tool Trajectory Experiment\n")

asyncio.run(run_agent_tool_trajectory_experiment())
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
import os
from openai import AsyncOpenAI
from traceloop.sdk import Traceloop
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloop
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition

# Initialize Traceloop
client = Traceloop.init()
Expand Down Expand Up @@ -135,11 +135,11 @@ async def run_agents_experiment():

# Configure agent evaluators
evaluators = [
EvaluatorMadeByTraceloop.agent_goal_accuracy(),
EvaluatorMadeByTraceloop.agent_tool_error_detector(),
EvaluatorMadeByTraceloop.agent_flow_quality(),
EvaluatorMadeByTraceloop.agent_efficiency(),
EvaluatorMadeByTraceloop.agent_goal_completeness(),
EvaluatorMadeByTraceloopDefinition.agent_goal_accuracy(),
EvaluatorMadeByTraceloopDefinition.agent_tool_error_detector(),
EvaluatorMadeByTraceloopDefinition.agent_flow_quality(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing required parameters for agent_flow_quality().

According to the method signature in definitions.py (lines 42-62), agent_flow_quality() requires two mandatory parameters: conditions (list[str]) and threshold (float). Calling it without arguments will cause a TypeError.

🐛 Proposed fix

Add the required parameters:

-        EvaluatorMadeByTraceloopDefinition.agent_flow_quality(),
+        EvaluatorMadeByTraceloopDefinition.agent_flow_quality(
+            conditions=["Agent should not repeat questions", "Agent should complete task efficiently"],
+            threshold=0.7,
+        ),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
EvaluatorMadeByTraceloopDefinition.agent_flow_quality(),
EvaluatorMadeByTraceloopDefinition.agent_flow_quality(
conditions=["Agent should not repeat questions", "Agent should complete task efficiently"],
threshold=0.7,
),
🤖 Prompt for AI Agents
In @packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py at
line 140, The call to EvaluatorMadeByTraceloopDefinition.agent_flow_quality() is
missing its required parameters and will raise a TypeError; update the call in
agents_exp.py to pass a list of condition strings for the first arg (conditions)
and a float for the second arg (threshold) per the signature defined in
definitions.py (agent_flow_quality(conditions: list[str], threshold: float));
either supply literal values that match the evaluator expectations (e.g.,
["conditionA", "conditionB"], 0.8) or pass through appropriately named variables
(e.g., conditions, threshold) that are defined earlier in the module.

EvaluatorMadeByTraceloopDefinition.agent_efficiency(),
EvaluatorMadeByTraceloopDefinition.agent_goal_completeness(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing required parameter for agent_goal_completeness().

According to the method signature in definitions.py (lines 76-94), agent_goal_completeness() requires a mandatory threshold (float) parameter. Calling it without arguments will cause a TypeError.

🐛 Proposed fix

Add the required parameter:

-        EvaluatorMadeByTraceloopDefinition.agent_goal_completeness(),
+        EvaluatorMadeByTraceloopDefinition.agent_goal_completeness(
+            threshold=0.8,
+        ),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
EvaluatorMadeByTraceloopDefinition.agent_goal_completeness(),
EvaluatorMadeByTraceloopDefinition.agent_goal_completeness(
threshold=0.8,
),
🤖 Prompt for AI Agents
In @packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py at
line 142, The call to
EvaluatorMadeByTraceloopDefinition.agent_goal_completeness() is missing its
required threshold parameter; update the call in agents_exp.py to pass a float
threshold (e.g., 0.8 or the appropriate value for your tests) so it matches the
signature defined in definitions.py and avoids the TypeError.

]

print("Running experiment with evaluators:")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
import os
from openai import AsyncOpenAI
from traceloop.sdk import Traceloop
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloop
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition

# Initialize Traceloop
client = Traceloop.init()
Expand Down Expand Up @@ -70,9 +70,9 @@ async def run_content_compliance_experiment():

# Configure content compliance evaluators
evaluators = [
EvaluatorMadeByTraceloop.profanity_detector(),
EvaluatorMadeByTraceloop.toxicity_detector(threshold=0.7),
EvaluatorMadeByTraceloop.sexism_detector(threshold=0.7),
EvaluatorMadeByTraceloopDefinition.profanity_detector(),
EvaluatorMadeByTraceloopDefinition.toxicity_detector(threshold=0.7),
EvaluatorMadeByTraceloopDefinition.sexism_detector(threshold=0.7),
]

print("Running experiment with content safety evaluators:")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
import os
from openai import AsyncOpenAI
from traceloop.sdk import Traceloop
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloop
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition

# Initialize Traceloop
client = Traceloop.init()
Expand Down Expand Up @@ -76,8 +76,8 @@ async def run_correctness_experiment():

# Configure correctness evaluators
evaluators = [
EvaluatorMadeByTraceloop.answer_relevancy(),
EvaluatorMadeByTraceloop.faithfulness(),
EvaluatorMadeByTraceloopDefinition.answer_relevancy(),
EvaluatorMadeByTraceloopDefinition.faithfulness(),
]

print("Running experiment with evaluators:")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
import os
from openai import AsyncOpenAI
from traceloop.sdk import Traceloop
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloop
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition

# Initialize Traceloop
client = Traceloop.init()
Expand Down Expand Up @@ -104,17 +104,17 @@ async def run_formatting_experiment():
}'''

evaluators = [
EvaluatorMadeByTraceloop.json_validator(
EvaluatorMadeByTraceloopDefinition.json_validator(
enable_schema_validation=True,
schema_string=json_schema
),
EvaluatorMadeByTraceloop.sql_validator(),
EvaluatorMadeByTraceloop.regex_validator(
EvaluatorMadeByTraceloopDefinition.sql_validator(),
EvaluatorMadeByTraceloopDefinition.regex_validator(
regex=r"^\d{3}-\d{2}-\d{4}$", # SSN format
should_match=True,
case_sensitive=True
),
EvaluatorMadeByTraceloop.placeholder_regex(
EvaluatorMadeByTraceloopDefinition.placeholder_regex(
regex=r"^user_.*",
placeholder_name="username",
should_match=True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
import os
from openai import AsyncOpenAI
from traceloop.sdk import Traceloop
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloop
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition

# Initialize Traceloop
client = Traceloop.init()
Expand Down Expand Up @@ -88,10 +88,10 @@ async def run_advanced_quality_experiment():

# Configure advanced quality evaluators
evaluators = [
EvaluatorMadeByTraceloop.perplexity(),
EvaluatorMadeByTraceloop.agent_goal_accuracy(),
EvaluatorMadeByTraceloop.semantic_similarity(),
EvaluatorMadeByTraceloop.topic_adherence(),
EvaluatorMadeByTraceloopDefinition.perplexity(),
EvaluatorMadeByTraceloopDefinition.agent_goal_accuracy(),
EvaluatorMadeByTraceloopDefinition.semantic_similarity(),
EvaluatorMadeByTraceloopDefinition.topic_adherence(),
]

print("Running experiment with advanced quality evaluators:")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
import os
from openai import AsyncOpenAI
from traceloop.sdk import Traceloop
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloop
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition

# Initialize Traceloop
client = Traceloop.init()
Expand Down Expand Up @@ -73,9 +73,9 @@ async def run_security_experiment():

# Configure security evaluators
evaluators = [
EvaluatorMadeByTraceloop.pii_detector(probability_threshold=0.7,),
EvaluatorMadeByTraceloop.secrets_detector(),
EvaluatorMadeByTraceloop.prompt_injection(threshold=0.6),
EvaluatorMadeByTraceloopDefinition.pii_detector(probability_threshold=0.7,),
EvaluatorMadeByTraceloopDefinition.secrets_detector(),
EvaluatorMadeByTraceloopDefinition.prompt_injection(threshold=0.6),
]

print("\n" + "-"*80 + "\n")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
import os
from openai import AsyncOpenAI
from traceloop.sdk import Traceloop
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloop
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition

# Initialize Traceloop
client = Traceloop.init()
Expand Down Expand Up @@ -74,10 +74,10 @@ async def run_style_experiment():

# Configure metrics evaluators
evaluators = [
EvaluatorMadeByTraceloop.char_count(),
EvaluatorMadeByTraceloop.word_count(),
EvaluatorMadeByTraceloop.char_count_ratio(),
EvaluatorMadeByTraceloop.word_count_ratio(),
EvaluatorMadeByTraceloopDefinition.char_count(),
EvaluatorMadeByTraceloopDefinition.word_count(),
EvaluatorMadeByTraceloopDefinition.char_count_ratio(),
EvaluatorMadeByTraceloopDefinition.word_count_ratio(),
]

print("Running experiment with metrics evaluators:")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
from pathlib import Path

from traceloop.sdk import Traceloop
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloop
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition
from traceloop.sdk.experiment.utils import run_with_span_capture

# Add the agents directory to sys.path for imports
Expand Down Expand Up @@ -94,13 +94,13 @@ async def run_travel_agent_experiment():

# Configure agent evaluators
evaluators = [
EvaluatorMadeByTraceloop.agent_goal_accuracy(),
EvaluatorMadeByTraceloop.agent_flow_quality(
EvaluatorMadeByTraceloopDefinition.agent_goal_accuracy(),
EvaluatorMadeByTraceloopDefinition.agent_flow_quality(
threshold=0.7,
conditions=["create_itinerary tool should be called last"],
),
EvaluatorMadeByTraceloop.agent_efficiency(),
EvaluatorMadeByTraceloop.agent_goal_completeness(),
EvaluatorMadeByTraceloopDefinition.agent_efficiency(),
EvaluatorMadeByTraceloopDefinition.agent_goal_completeness(),
]

print("Running experiment with evaluators:")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from openai import AsyncOpenAI
from traceloop.sdk import Traceloop
from traceloop.sdk.guardrails.guardrails import guardrail
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloop
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition


Traceloop.init(app_name="medical-chat-example")
Expand Down Expand Up @@ -39,7 +39,7 @@ def handle_medical_evaluation(evaluator_result, original_result):


@guardrail(
evaluator=EvaluatorMadeByTraceloop.pii_detector(probability_threshold=0.8),
evaluator=EvaluatorMadeByTraceloopDefinition.pii_detector(probability_threshold=0.8),
on_evaluation_complete=handle_medical_evaluation,
)
async def get_doctor_response_with_pii_check(patient_message: str) -> dict:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
sys.path.insert(0, str(agents_dir))

from traceloop.sdk.guardrails.guardrails import guardrail # noqa: E402
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloop # noqa: E402
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition # noqa: E402
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused noqa directive.

The # noqa: E402 comment is not needed as the E402 check is not enabled or not triggering on this line.

🧹 Proposed cleanup
-from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition  # noqa: E402
+from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition # noqa: E402
from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition
🧰 Tools
🪛 Ruff (0.14.10)

14-14: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

🤖 Prompt for AI Agents
In @packages/sample-app/sample_app/guardrail_travel_agent_example.py at line 14,
The import line for EvaluatorMadeByTraceloopDefinition contains an unnecessary
noqa directive; remove the trailing "# noqa: E402" from the import statement
(the symbol is EvaluatorMadeByTraceloopDefinition) so the line becomes a normal
import without the unused suppression comment.


# Import the travel agent function
try:
Expand Down Expand Up @@ -62,7 +62,7 @@ def handle_pii_detection(evaluator_result, original_result):


@guardrail(
evaluator=EvaluatorMadeByTraceloop.pii_detector(probability_threshold=0.7),
evaluator=EvaluatorMadeByTraceloopDefinition.pii_detector(probability_threshold=0.7),
on_evaluation_complete=handle_pii_detection
)
async def guarded_travel_agent(query: str) -> dict:
Expand Down
5 changes: 2 additions & 3 deletions packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
from .evaluator import Evaluator
from .config import EvaluatorDetails
from .evaluators_made_by_traceloop import EvaluatorMadeByTraceloop, create_evaluator
from ..generated.evaluators.definitions import EvaluatorMadeByTraceloopDefinition

__all__ = [
"Evaluator",
"EvaluatorDetails",
"EvaluatorMadeByTraceloop",
"create_evaluator",
"EvaluatorMadeByTraceloopDefinition",
]
3 changes: 2 additions & 1 deletion packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,8 @@ def _validate_evaluator_input(slug: str, input: Dict[str, str]) -> None:
request_model = get_request_model(slug)
if request_model:
try:
request_model(**input)
# Request models expect data nested under 'input' field
request_model(input=input)
except ValidationError as e:
raise ValueError(f"Invalid input for '{slug}': {e}") from e

Expand Down
Loading