Agent Subtask Swap with Hugging Face Models

This repository contains the code and evaluation framework for studying the tool-call generation subtask in agentic workflows. It compares a larger Hugging Face model (LLM baseline) against a smaller one (SLM replacement) to measure trade-offs in quality, latency, and cost.

Overview

The model is tasked with:

Choosing exactly one tool from a fixed schema.
Emitting a machine-readable JSON tool call.
Providing the required arguments for that tool.

The default comparison uses:

LLM: Qwen/Qwen2.5-14B-Instruct
SLM: Qwen/Qwen2.5-3B-Instruct

Repository Structure

submission/
  configs/           # Experiment configuration YAMLs
  data/              # Generated benchmark data
  scripts/           # Helper scripts for data generation and evaluation
  src/               # Core source code (agent_subtask_swap_hf)
  tests/             # Unit tests for parsing and execution
  requirements.txt   # Project dependencies
  setup.py           # Installation script

Installation

Create a virtual environment (optional but recommended).
Install dependencies:
```
pip install -r requirements.txt
```
Install the package in editable mode:
```
pip install -e . --no-build-isolation
```

Usage

1. Generate Data

If you wish to regenerate the synthetic benchmark data:

python -m agent_subtask_swap_hf.cli make-data

This populates data/generated/ with train.jsonl and test.jsonl.

2. Run Evaluation

To run the full evaluation pipeline using the Hugging Face models specified in the config:

python -m agent_subtask_swap_hf.cli evaluate-config --config configs/qwen_14b_vs_3b.yaml

The evaluation will:

Load the test split.
Run predictions using both the LLM and SLM.
Parse and execute the predicted tool calls.
Calculate metrics (accuracy, latency, cost proxy).
Save results and plots to the outputs/ directory.

Metrics

The framework reports:

Tool Accuracy: Correctness of the tool selection.
Argument Recall/Match: Accuracy of the extracted arguments.
JSON Validity: Rate at which the model produced valid JSON.
Execution Success: Rate at which the tool call was actually executable.
Latency: Time taken for model generation.
Cost Proxy: A parameter-aware metric (params_billion * total_tokens / 1000).

Notes

This repository is designed for local execution with transformers.
Ensure your hardware can accommodate the models specified in the configuration (e.g., the 14B Qwen model requires significant VRAM).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Subtask Swap with Hugging Face Models

Overview

Repository Structure

Installation

Usage

1. Generate Data

2. Run Evaluation

Metrics

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
data/generated		data/generated
scripts		scripts
src/agent_subtask_swap_hf		src/agent_subtask_swap_hf
tests		tests
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Agent Subtask Swap with Hugging Face Models

Overview

Repository Structure

Installation

Usage

1. Generate Data

2. Run Evaluation

Metrics

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages