Skip to content

Commit 95ad743

Browse files
korenLazarKoren Lazarelronbandel
authored
Add ReflectionToolCallingMetric and update related metrics (#1931)
* Add ReflectionToolCallingMetric and update related metrics - Introduced ReflectionToolCallingMetric for assessing syntactic and semantic validity of tool calls. - Updated MultiTurnToolCallingMetric description for clarity. - Added reflection.json to catalog with appropriate descriptions. - Enhanced test coverage for ReflectionToolCallingMetric and its reduction logic. * removed redundant import which makes tests fail. * Minor fix for mock provider name in llmevalkit * Update descriptions for ReflectionToolCallingMetric and ReflectionToolCallingMetricSyntactic; enhance clarity and detail on evaluation criteria and installation instructions. * Refactor ReflectionToolCallingMetric to use settings directly instead of unitxt.settings; update provider name format for watsonx. * Fixed minor bugs to support different tasks. * Fixed requirements issue, general import bug, and added some guards for the provider. * Fixed pre-commit issues. * made sure that we reinstall libraries from git. * Add logging for installation URL and version info in internal pip action * minor change * fixed assignment of mock provider * Update .github/actions/install-internal-pip/action.yml Co-authored-by: Elron Bandel <[email protected]> * removed two unittests that were causing problems and fixed assertEqual to assertFalse/assertTrue. --------- Co-authored-by: Koren Lazar <[email protected]> Co-authored-by: Elron Bandel <[email protected]>
1 parent c6da61e commit 95ad743

File tree

9 files changed

+703
-239
lines changed

9 files changed

+703
-239
lines changed

.github/actions/install-internal-pip/action.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,4 +30,5 @@ runs:
3030
else
3131
URL="git+ssh://git@${{ inputs.host }}/${{ inputs.repo }}.git"
3232
fi
33-
pip install "$URL" ${{ inputs.pip-extra-args }}
33+
echo "Installing from URL: $URL"
34+
pip install --no-cache-dir --force-reinstall "$URL" ${{ inputs.pip-extra-args }}

.pre-commit-config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ repos:
2828
hooks:
2929
- id: enforce-relative-imports
3030
name: Enforce Relative Imports
31-
entry: python utils/enforce_relative_imports.py
31+
entry: python3 utils/enforce_relative_imports.py
3232
language: system
3333
# Adjust the files pattern to match your needs
3434
files: ^src/.*\.py$
@@ -40,7 +40,7 @@ repos:
4040
hooks:
4141
- id: enforce-library-imports
4242
name: Enforce Library Imports
43-
entry: python utils/enforce_library_imports.py
43+
entry: python3 utils/enforce_library_imports.py
4444
language: system
4545
# Adjust the files pattern to match your needs
4646
exclude: (^src/.*\.py$)|utils/enforce_library_imports.py|utils/enforce_relative_imports.py

examples/evaluate_tool_calling_with_reflection.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,10 @@
6464
test_set=data,
6565
split="test",
6666
format="formats.chat_api",
67-
metrics=["metrics.tool_calling.reflection.syntactic"],
67+
metrics=[
68+
"metrics.tool_calling.reflection.syntactic",
69+
"metrics.tool_calling.reflection",
70+
],
6871
max_test_instances=10,
6972
)
7073

prepare/metrics/tool_calling.py

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
from unitxt.catalog import add_to_catalog
22
from unitxt.metrics import (
33
MultiTurnToolCallingMetric,
4+
ReflectionToolCallingMetric,
45
ReflectionToolCallingMetricSyntactic,
56
ToolCallingMetric,
67
ToolCallKeyValueExtraction,
@@ -48,15 +49,23 @@
4849

4950
add_to_catalog(
5051
MultiTurnToolCallingMetric(
51-
__description__="""Metric that evaluates tool call predictions for the validity with regards to the tools schema."""
52+
__description__="""A metric that assesses tool call predictions for their conformity to the tool schema."""
5253
),
5354
"metrics.tool_calling.multi_turn.validity",
5455
overwrite=True,
5556
)
5657

58+
add_to_catalog(
59+
ReflectionToolCallingMetric(
60+
__description__="""A metric that assesses tool call predictions for both syntactic correctness and semantic validity, using predefined checks combined with LLM-based evaluations. For each instance, it returns a score reflecting its overall validity, as well as a breakdown of the specific checks/metrics that passed or failed, including hallucination check, value format alignment, function selection and agentic constraints satisfaction. Each metric also contains an evidence from the input, an explanation describing the reflection decision, a confidence, and a validity score with a range of 1-5 (higher score -> more valid)."""
61+
),
62+
"metrics.tool_calling.reflection",
63+
overwrite=True,
64+
)
65+
5766
add_to_catalog(
5867
ReflectionToolCallingMetricSyntactic(
59-
__description__="""This metric evaluates whether a model's tool call outputs are structurally valid by checking their compliance with the provided tool schema. For each instance, it returns a binary score (True for valid, False for invalid), and aggregates these into a global percentage across all instances. The evaluation covers a wide range of possible issues, including nonexistent functions or parameters, incorrect parameter types, missing required parameters, values outside allowed ranges, JSON schema violations, invalid or empty API specifications, and malformed tool calls. The main reported score, overall_valid (aliased as score), reflects the proportion of calls that are fully valid, making the metric a measure of syntactic and schema-level correctness rather than semantic accuracy."""
68+
__description__="""This metric evaluates whether a model's tool call outputs are structurally valid by checking their compliance with the provided tool schema. For each instance, it returns a binary score (True for valid, False for invalid), and aggregates these into a global percentage across all instances. The evaluation covers a wide range of possible issues, including nonexistent functions or parameters, incorrect parameter types, missing required parameters, values outside allowed ranges, JSON schema violations, invalid or empty API specifications, and malformed tool calls. The main reported score, overall_valid (aliased as score), reflects the proportion of calls that are fully valid, making the metric a measure of syntactic and schema-level correctness rather than semantic accuracy. Each metric also contains an explanation describing the errors that it detected (if no errors were found - the explanation will be None)."""
6069
),
6170
"metrics.tool_calling.reflection.syntactic",
6271
overwrite=True,
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
{
22
"__type__": "multi_turn_tool_calling_metric",
3-
"__description__": "Metric that evaluates tool call predictions for the validity with regards to the tools schema."
3+
"__description__": "A metric that assesses tool call predictions for their conformity to the tool schema."
44
}
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"__type__": "reflection_tool_calling_metric",
3+
"__description__": "A metric that assesses tool call predictions for both syntactic correctness and semantic validity, using predefined checks combined with LLM-based evaluations. For each instance, it returns a score reflecting its overall validity, as well as a breakdown of the specific checks/metrics that passed or failed, including hallucination check, value format alignment, function selection and agentic constraints satisfaction. Each metric also contains an evidence from the input, an explanation describing the reflection decision, a confidence, and a validity score with a range of 1-5 (higher score -> more valid)."
4+
}
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
{
22
"__type__": "reflection_tool_calling_metric_syntactic",
3-
"__description__": "This metric evaluates whether a model's tool call outputs are structurally valid by checking their compliance with the provided tool schema. For each instance, it returns a binary score (True for valid, False for invalid), and aggregates these into a global percentage across all instances. The evaluation covers a wide range of possible issues, including nonexistent functions or parameters, incorrect parameter types, missing required parameters, values outside allowed ranges, JSON schema violations, invalid or empty API specifications, and malformed tool calls. The main reported score, overall_valid (aliased as score), reflects the proportion of calls that are fully valid, making the metric a measure of syntactic and schema-level correctness rather than semantic accuracy."
3+
"__description__": "This metric evaluates whether a model's tool call outputs are structurally valid by checking their compliance with the provided tool schema. For each instance, it returns a binary score (True for valid, False for invalid), and aggregates these into a global percentage across all instances. The evaluation covers a wide range of possible issues, including nonexistent functions or parameters, incorrect parameter types, missing required parameters, values outside allowed ranges, JSON schema violations, invalid or empty API specifications, and malformed tool calls. The main reported score, overall_valid (aliased as score), reflects the proportion of calls that are fully valid, making the metric a measure of syntactic and schema-level correctness rather than semantic accuracy. Each metric also contains an explanation describing the errors that it detected (if no errors were found - the explanation will be None)."
44
}

0 commit comments

Comments
 (0)