Welcome to JaxAHT! This is a Jax-based benchmark repository for Ad Hoc Teamwork. For a quick introduction to the benchmark, please see our Colab tutorial notebook.
If you find this repository useful for your research, please cite,
@misc{jaxaht2025,
author = {Learning Agents Research Group},
title = {JaxAHT},
year = {2025},
month = {September},
note = {Version 1.0.0},
url = {https://github.com/LARG/jax-aht},
}The JaxAHT library is designed to (1) facilitate research across the entire lifecycle of ad hoc teamwork, and (2) ease the evaluation of ad hoc agents (ego agents) for commonly used AHT benchmark tasks. As such, the benchmark includes:
- AHT algorithms
- Environments
- Evaluation teammates for each environment
The library includes a variety of MARL/AHT algorithms, as AHT research often requires orchestrating multiple algorithms:
- An ego agent training algorithm
- A teammate generation algorithm
- A multi-agent reinforcement learning (MARL) algorithm
- A single-agent reinforcement learning method to act as a best response operator.
Research focused on one type of algorithm can require other types for training/evaluation purposes. For example, to evaluate a teammate generation method, an ego agent training method is necessary. Similarly, an ego agent training method requires a set of training teammates, which may be generated by a teammate generation algorithm or a MARL algorithm.
This codebase aims to provide a unified interface for the above AHT procedures, to enable research on both individual procedures and combinations thereof. On the other hand, to facilitate fast iteration, we take inspiration from the single-file model used by projects such as JaxMARL and CleanRL and minimally modularize the code. Algorithms are largely implemented in a single-file format to enable researchers to easily understand and build upon existing methods. However, the agent interface is shared by all methods, to allow agents trained by one algorithm to easily be used by another algorithm---a common workflow for AHT research.
Our modularization is restricted to environments, agents, and populations, which allows us to cleanly interface the algorithm types above, while placing most of the logic for any single algorithm within a single file.
| Category | Algorithm | Description | Paper |
|---|---|---|---|
| Ego Agent Training | PPO Ego | Trains a PPO agent against a population of homogeneous partner agents. | - |
| LIAM Ego | Trains a LIAM agent against a population of homogeneous partner agents. | Papoudakis et al. 2021 | |
| MeLIBA Ego | Trains a MeLIBA agent against a population of homogeneous partner agents. | Zintgraf et al. 2022 | |
| Teammate Generation | FCP (Fictitious Co-Play) | Generates diverse teammates using varying seeds and checkpoints of IPPO. | Strouse et al. 2021 |
| BRDiv | Generates diverse teammates using best response diversity (BRDiv) metric. | Rahman et al. 2022 | |
| LBRDiv | Generates diverse teammates via emulating the minimum coverage set. | Rahman et al. 2024 | |
| CoMeDi | Generates diverse teammates by optimizing mixed-play. | Sarkar et al. 2023 | |
| MARL | IPPO | Multi-agent reinforcement learning using independent PPO agents with parameter sharing | Yu et al. 2022 |
| Open-Ended Training | ROTATE | Open-ended training using cooperative regret maximization | Wang et al. 2025 |
| PAIRED | Open-ended training based on the PAIRED algorithm from the unsupervised environment design literature. | Dennis et al. 2020 | |
| Open-Ended Minimax | Open-ended training baseline using minimax return optimization. | - |
| Environment | Source | Description | Variants | Evaluation Teammates |
|---|---|---|---|---|
| Level-Based Foraging (LBF) | Jumanji | Cooperative foraging environment where agents must work together to collect food | 7x7 grid with full observability | โ |
| Overcooked-v1 | JaxMARL | Cooperative cooking environment where agents must coordinate to prepare and serve dishes | asymm_advantages, coord_ring, counter_circuit, cramped_room, forced_coord | โ |
- ๐ Installation Guide
โถ๏ธ Getting Started- ๐ Code Overview
- ๐บ๏ธ Project Structure
- ๐ License
- ๐ See Also
Follow instructions at docs/install_instructions.md to install the necessary libraries.
Evaluating trained agents against the heldout evaluation set requires downloading the evaluation agents. We also provide the best returns achieved against each evaluation agent in our experiments. Directories containing both data can be obtained by running the provided data download script:
python download_eval_data.pyFor a quick introduction to the benchmark, please see our Colab tutorial notebook.
Algorithms are sorted into four main directories in this codebase.
ego_agent_training/: contains algorithms for training an AHT agent against a pre-specified set of teammatesmarl/: contains MARL algorithms for training a team of agents from scratchopen_ended_training/: contains open-ended AHT algorithmsteammate_generation/: contains teammate generation algorithms.
Each contains a run.py, that serves as an entry point.
We provide an experiments.sh for open-ended and teammate generation methods that runs the algorithm specified
at the top of the experiments.sh, on LBF and Overcooked tasks.
JaxMARL follows a single-script training paradigm, which enables jit-compiling the entire RL training loop and makes it simple for researchers to modify algorithms. We follow a similar paradigm, but use agent and population interfaces, along with some common utility functions to avoid code duplication.
The code makes the following assumptions:
- Agent policies are assumed to handle "done" signals and reset internally.
- Environments have homogeneous agents and discrete actions
- Environments are assumed to "auto-reset", i.e. when the episode is done, the step function should check for this and reset the environment if needed.
Gotchas
- The metric,
returned_episode_returnsis automatically tracked and logged by the LogWrapper. It corresponds to summing up the reward returned by env.step() over an episode. Thus, if an environment returns a shaped reward, it corresponds to the shaped return.
The project structure is described here. Additional notes about some folders are provided.
agents/: Contains agent related implementations.common/: Shared utilities and common code.envs/: Environment implementations and wrappers.evaluation/: Evaluation and visualization scripts.ego_agent_training/: All ego agent learning implementations (PPO, LIAM, and MeLIBA).marl/: MARL algorithm implementations. Currently only supports IPPO.open_ended_training/: Open-ended learning methods (ROTATE, PAIRED, Minimax Return).teammate_generation/: Teammate generation algorithms (BRDiv, FCP, CoMeDi).tests/: Test scripts used during development.
The algorithms in this codebase are divided into four categories, and each is stored in its own directory:
- MARL algorithms, located at
marl/ - AHT (Ad Hoc Teamwork) algorithms
- Ego agent training methods, located at
ego_agent_training/ - Two-stage teammate generation methods, located at
teammate_generation/ - Open-ended AHT methods, located at
open_ended_training/
- Ego agent training methods, located at
Note that algorithms from the marl/ and ego_agent_training/ categories are called as subroutines in the other two categories.
For example:
- FCP uses the
marl/ippoimplementation as the teammate generation subroutine. - Two-stage teammate generation methods use
ego_agent_training/ppo_ego.pyas the ego agent training routine.
Within each directory, there is a run.py which serves as the entry point for
all algorithms implemented within the directory.
We use Hydra to manage algorithm and task configurations.
In each directory above, there is a configs/ directory with the
following subdirectories:
configs/algorithm/: Contains algorithm configs, for each algorithm and task combination.configs/hydra/: Contains Hydra settings.configs/task/: Contains environment configs necessary to specify a task.
Given an algorithm and task, Hydra retrieves the appropriate configs from the subdirectories above
and merges them into the master config found in configs/base_config_<method_type>.yaml (e.g., configs/base_config_teammate_generation.yaml).
The algorithm and task may be manually specified by modifying the master config, or by using
Hydra's command line argument support.
For example, the following command runs Fictitious Co-Play on the Level-Based Foraging (LBF) task:
python teammate_generation/run.py task=lbf algorithm=fcp/lbfNote that Hydra allows the user to modify any config value specified in the algorithm/task config files from the command line. For example, to set the number of training interactions for FCP, use the following command:
python teammate_generation/run.py task=lbf algorithm=fcp/lbf algorithm.TOTAL_TIMESTEPS=1e5By default, results are logged to a local results/ directory, as specified within the configs/hydra/hydra_simple.yaml file for each method type, and to the Weights & Biases (wandb) project specified in the master config.
All metrics are logged using wandb and can be viewed using the wandb web interface.
Please see the wandb documentation for general information about wandb.
Logging settings in each master config allow the user to control whether logging is enabled/disabled.
The agents/ directory contains:
- Heuristic agents for Overcooked and LBF environments.
- Various actor-critic architectures.
- Population and agent interfaces for RL agents.
You can test the Overcooked heuristic agents by running, python tests/test_overcooked_agents.py,
and the LBF heuristic agents by running, python tests/test_lbf_agents.py.
Certain workflows within this project (namely, ego agent training, heldout evaluation) require teammate policies as inputs. The user may provide these teammate policies by specifying a partner config that may point to heuristic or RL-based partner policies.
By default, the heldout evaluation workflow uses the downloaded evaluation teammates, and the corresponding partner config is specified at evaluation/configs/global_heldout_settings.yaml --- thus, no intervention from the user is necessary to perform heldout evaluations.
However, the ego agent training workflow requires the user to specify a partner agent config. A quick example of how to run an ego agent training algorithm with particular partner config is provided in our tutorial notebook. More details on how to specify the partner config are provided at the top of the ego agent training scripts.
This codebase uses the Jumanji LBF implementation. The wrapper for the Jumanji LBF environment is stored in the envs/ directory, at envs/lbf/lbf_wrapper.py. A corresponding test script is stored at tests/test_lbf_wrapper.py.
`
We made some modifications to the JaxMARL Overcooked environment to improve the functionality and ensure environments are solvable.
- Initialization randomization: Previously, setting
random_resetwould lead to random initial agent positions, and randomized initial object states (e.g. pot might be initialized with onions already in it, agents might be initialized holding plates, etc.). We separate the functionality of the argumentrandom_resetinto two arguments:random_resetandrandom_obj_state, whererandom_resetonly controls the initial positions of the two agents. - Agent initial positions: previously, in a map with disconnected components, it was possible for two agents to be spawned in the same component, making it impossible to solve the task. The Overcooked-v1 environment initializes agents such that one is always spawned on each side of the map.
This project is licensed under the MIT License - see the LICENSE file for details.
This project was inspired by the following Jax-based RL repositories. Please check them out!
- JaxMARL: a library with Jax-based MARL algorithms and environments
- Jumanji: a library with Jax implementations of several MARL environments
- Minimax: a library with Jax implementations of single-agent UED algorithms
- ROTATE: code for the ROTATE paper (Wang et al. 2025), which this benchmark is built off of.
