Skip to content

A tool for automated testing of translations of planning problems using the Google API.

Notifications You must be signed in to change notification settings

esadorinha/HTN-LLM-testing-tool

Repository files navigation

HTN LLM Testing Tool

Testing chatbots hability to provide assistance for HTN planning problems

This repository provides a simple interface for obtaining chatbot responses to a restricted set of planning problems, as well as analyzing the quality of these responses using the PANDA planning engine.

Setting the environment

On Linux/MacOS:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

On Windows:

python3 -m venv venv
venv\Scripts\activate
pip install -r requirements.txt

Features

The analysis pipeline requires the user to provide a set of problems represented in PDDL, along with the domain name and the corresponding domain definition files in both PDDL and HTN. Each problem is converted into a prompt for a given LLM, which is instructed to produce a “translation” of the problem into the HTN language (HDDL). All LLM responses are parsed and submitted to the PANDA planning engine. The results are analyzed and stored in the results folder.

For example, if the input folder contains 10 PDDL problems, the results folder will include 10 subfolders (one for each problem), containing the prompt used, the parsed LLM output, and all PANDA-related files. File names are dynamic: {problem_name}.prompt and {problem_name}.hddl.

Prompts

The interface allows the user to set up a fixed prompt structure that will be applied to all problems in the provided set. For instance, if there are 10 PDDL problems, 10 prompts will be generated, each following the same structure defined by the arguments passed to main.py.

The core directive of every prompt is: “Provide a Hierarchical Task Network representation for a given PDDL problem.” Each prompt must include at least three files for context, so every application call must contain:

  • the name of the domain (--domain_name);
  • the PDDL domain definition filename (--domain_pddl),
  • the corresponding HTN domain definition filename (--domain_htn),
  • the path to the folder that contains PDDL problems for that domain (--prob_path).

From the problem folder, this application only takes files whose names start with 'p' and end with '.pddl' (to adjust this, see the build_prompts() function in main.py, lines 111–125).

Additional context can also be included with optional arguments:

  • a role for the model (using '--add_role true');
  • a constraint to provide an initial task network (using '--add_net_constraint true');
  • a constraint to add a goal to the translated problem (using '--add_goal_constraint true');
  • a concern about PANDA's case-sensitivity (using '--add_concern_cs true');
  • a concern about the correction of predicate requirements (using '--add_concern_pr true').

Prompts may also include two types of examples:

  • a syntax example (an HTN problem filename only, using --example_htn),
  • or a full translation example (both HTN and its corresponding PDDL problem, using --example_pddl and --example_htn).

Providing only a PDDL example is not supported, since it is not useful on its own.

Finally, users may opt to include a partial translation of the problem in the prompt. This is generated automatically, assuming consistent predicate/type names between the PDDL and HTN domains. If used with problems outside the benchmark set, the user must ensure all PDDL predicate names correspond to either predicates or type names in the HTN domain. This feature is enabled with --reduced_query true. In this mode, fragment-related options are ignored, as the reduced query uses a fixed template with a role, explicit constraints, and a case-sensitivity warning.

For more details, run:

python3 src.main.py --help

Models

The user can choose between different LLMs:

  • ChatGPT (model 'gpt-4o-mini');
  • DeepSeek (model 'deepseek/deepseek-r1:free');
  • Gemini (model 'gemini-2.5-flash-lite');

Model definitions are stored in src/models/. Each API request uses a key retrieved from an environment variable:

  • 'gpt.py' searches for OPENAI_API_KEY in env;
  • 'deepseek.py' searches for DEEPSEEK_API_KEY;
  • 'gemini' searches for GEMINI_API_KEY;

The default model is Gemini. The user can specify a different one with --llm_name (options: gpt, gemini, deepseek). Make sure to define the corresponding environment variable before execution.

Benchmarks and examples

The benchmarks folder contains a collection of PDDL problems organized by domain. Each domain folder also includes both PDDL and HTN domain definitions, as well as two example problems (PDDL and its HTN translation). The problems were adapted from this unofficial collection of IPC benchmark instances.

Examples of commands that should work in a Linux machine after the setup (and after transfering domain and example files from 'benchmarks' to the main folder) are:

  • Minimal example:
python3 "{path_to_repository_folder}/src/main.py" --domain_name 'blocksworld' --domain_pddl "domain.pddl" --domain_htn "domain.hddl" --prob_path "{path_to_repository_folder}/benchmarks/Blocksworld-GTOHP"
  • With translation example:
python3 "{path_to_repository_folder}/src/main.py" --domain_name 'blocksworld' --domain_pddl "domain.pddl" --domain_htn "domain.hddl" --example_pddl 'example.pddl' --example_htn 'example.hddl' --prob_path "{path_to_repository_folder}/benchmarks/Blocksworld-GTOHP"
  • With example and no role:
python3 "{path_to_repository_folder}/src/main.py" --domain_name 'blocksworld' --domain_pddl "domain.pddl" --domain_htn "domain.hddl" --example_pddl 'example.pddl' --example_htn 'example.hddl' --prob_path "{path_to_repository_folder}/benchmarks/Blocksworld-GTOHP" --add_role false
  • Reduced query with sintax example:
python3 "{path_to_repository_folder}/src/main.py" --domain_name 'blocksworld' --domain_pddl "domain.pddl" --domain_htn "domain.hddl" --example_htn 'example.hddl' --prob_path "{path_to_repository_folder}/benchmarks/Blocksworld-GTOHP" --reduced_query true

Experiments, Tests, and Extras

The repository also contains sets of experiments conducted for research purposes. Experiments are organized by domain, command type, and iteration, and were executed using the scripts in the extras folder. This folder also includes verification tests and supporting files related to the research (such as prompt structure history).

Verifying results through PANDA engine

Saved results from LLM calls are automaticaly tested using the PANDA engine executables. Therefore, this repository includes third-party executables for HTN planning purposes.
These binaries are distributed under the BSD 3-Clause License.

About

A tool for automated testing of translations of planning problems using the Google API.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages