This repository contains the code and evaluation framework for studying the tool-call generation subtask in agentic workflows. It compares a larger Hugging Face model (LLM baseline) against a smaller one (SLM replacement) to measure trade-offs in quality, latency, and cost.
The model is tasked with:
- Choosing exactly one tool from a fixed schema.
- Emitting a machine-readable JSON tool call.
- Providing the required arguments for that tool.
The default comparison uses:
- LLM:
Qwen/Qwen2.5-14B-Instruct - SLM:
Qwen/Qwen2.5-3B-Instruct
submission/
configs/ # Experiment configuration YAMLs
data/ # Generated benchmark data
scripts/ # Helper scripts for data generation and evaluation
src/ # Core source code (agent_subtask_swap_hf)
tests/ # Unit tests for parsing and execution
requirements.txt # Project dependencies
setup.py # Installation script
- Create a virtual environment (optional but recommended).
- Install dependencies:
pip install -r requirements.txt
- Install the package in editable mode:
pip install -e . --no-build-isolation
If you wish to regenerate the synthetic benchmark data:
python -m agent_subtask_swap_hf.cli make-dataThis populates data/generated/ with train.jsonl and test.jsonl.
To run the full evaluation pipeline using the Hugging Face models specified in the config:
python -m agent_subtask_swap_hf.cli evaluate-config --config configs/qwen_14b_vs_3b.yamlThe evaluation will:
- Load the test split.
- Run predictions using both the LLM and SLM.
- Parse and execute the predicted tool calls.
- Calculate metrics (accuracy, latency, cost proxy).
- Save results and plots to the
outputs/directory.
The framework reports:
- Tool Accuracy: Correctness of the tool selection.
- Argument Recall/Match: Accuracy of the extracted arguments.
- JSON Validity: Rate at which the model produced valid JSON.
- Execution Success: Rate at which the tool call was actually executable.
- Latency: Time taken for model generation.
- Cost Proxy: A parameter-aware metric (
params_billion * total_tokens / 1000).
- This repository is designed for local execution with
transformers. - Ensure your hardware can accommodate the models specified in the configuration (e.g., the 14B Qwen model requires significant VRAM).