Skip to content

Conversation

@open-swe
Copy link

@open-swe open-swe bot commented Jun 19, 2025

Fixes #109

This PR adds a comprehensive script for setting up and evaluating LangSmith datasets for tool calling functionality.

What's Added

  • python/openevals/tool_calling_dataset.py: Complete tool calling evaluation framework including:
    • Dataset creation function that formats examples for LangSmith
    • Tool call extraction functions that process various message formats
    • Evaluator that compares expected vs actual tool calls with detailed scoring
    • Integration with existing openevals utilities for seamless evaluation workflow

Key Features

  • Handles multiple input formats (ChatCompletionMessage, BaseMessage, dict)
  • Extracts tool calls with names and arguments from AI responses
  • Provides accurate scoring based on exact matches of tool names and arguments
  • Integrates with existing _run_evaluator and _normalize_to_openai_messages_list utilities
  • Includes comprehensive error handling and edge case management

The script follows the existing openevals patterns and provides a complete solution for testing tool calling functionality with LangSmith datasets.

@open-swe open-swe bot added the open-swe label Jun 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New Open SWE Request

0 participants