Skip to content

mofadel/LLM_to_SLM_Conversion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agent Subtask Swap with Hugging Face Models

This repository contains the code and evaluation framework for studying the tool-call generation subtask in agentic workflows. It compares a larger Hugging Face model (LLM baseline) against a smaller one (SLM replacement) to measure trade-offs in quality, latency, and cost.

Overview

The model is tasked with:

  1. Choosing exactly one tool from a fixed schema.
  2. Emitting a machine-readable JSON tool call.
  3. Providing the required arguments for that tool.

The default comparison uses:

  • LLM: Qwen/Qwen2.5-14B-Instruct
  • SLM: Qwen/Qwen2.5-3B-Instruct

Repository Structure

submission/
  configs/           # Experiment configuration YAMLs
  data/              # Generated benchmark data
  scripts/           # Helper scripts for data generation and evaluation
  src/               # Core source code (agent_subtask_swap_hf)
  tests/             # Unit tests for parsing and execution
  requirements.txt   # Project dependencies
  setup.py           # Installation script

Installation

  1. Create a virtual environment (optional but recommended).
  2. Install dependencies:
    pip install -r requirements.txt
  3. Install the package in editable mode:
    pip install -e . --no-build-isolation

Usage

1. Generate Data

If you wish to regenerate the synthetic benchmark data:

python -m agent_subtask_swap_hf.cli make-data

This populates data/generated/ with train.jsonl and test.jsonl.

2. Run Evaluation

To run the full evaluation pipeline using the Hugging Face models specified in the config:

python -m agent_subtask_swap_hf.cli evaluate-config --config configs/qwen_14b_vs_3b.yaml

The evaluation will:

  • Load the test split.
  • Run predictions using both the LLM and SLM.
  • Parse and execute the predicted tool calls.
  • Calculate metrics (accuracy, latency, cost proxy).
  • Save results and plots to the outputs/ directory.

Metrics

The framework reports:

  • Tool Accuracy: Correctness of the tool selection.
  • Argument Recall/Match: Accuracy of the extracted arguments.
  • JSON Validity: Rate at which the model produced valid JSON.
  • Execution Success: Rate at which the tool call was actually executable.
  • Latency: Time taken for model generation.
  • Cost Proxy: A parameter-aware metric (params_billion * total_tokens / 1000).

Notes

  • This repository is designed for local execution with transformers.
  • Ensure your hardware can accommodate the models specified in the configuration (e.g., the 14B Qwen model requires significant VRAM).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages