[open-swe] feat: Add LangSmith tool calling dataset and evaluation script #110

open-swe · 2025-06-19T19:27:12Z

Fixes #109

This PR adds a comprehensive script for setting up and evaluating LangSmith datasets for tool calling functionality.

What's Added

python/openevals/tool_calling_dataset.py: Complete tool calling evaluation framework including:
- Dataset creation function that formats examples for LangSmith
- Tool call extraction functions that process various message formats
- Evaluator that compares expected vs actual tool calls with detailed scoring
- Integration with existing openevals utilities for seamless evaluation workflow

Handles multiple input formats (ChatCompletionMessage, BaseMessage, dict)
Extracts tool calls with names and arguments from AI responses
Provides accurate scoring based on exact matches of tool names and arguments
Integrates with existing _run_evaluator and _normalize_to_openai_messages_list utilities
Includes comprehensive error handling and edge case management

The script follows the existing openevals patterns and provides a complete solution for testing tool calling functionality with LangSmith datasets.

open-swe-dev[bot] added 3 commits June 19, 2025 19:22

Apply patch

378eea0

Apply patch

682e821

Apply patch

b9b1d7b

open-swe bot added the open-swe label Jun 19, 2025