Skip to content

jla-gardner/augment-atoms

Repository files navigation

augment-atoms

Test PyPI GitHub last commit License

augment-atoms is a tool for augmenting datasets of atomic configurations via a model-driven, GPU-accelerated, rattle-relax-repeat procedure.

For each structure in the starting dataset, augment-atoms uses the provided potential energy surface (PES) model to generate a "family tree" of new structures. In the beginning, the tree consists of the single starting structure. To generate a new "child" structure, augment-atoms:

  1. selects a "parent" structure from the tree,
  2. rattles the atomic positions and unit cell,
  3. relaxes using the PES model to get a new structure,
  4. labels the child structure with the PES model, and
  5. inserts the child structure into the tree.

For precise details of each of these steps, see the Details section below.

Installation

pip install augment-atoms

This will install the augment-atoms command line tool (see pyproject.toml for the dependencies, requires Python 3.9+). Using uv is recommended, and will install augment-atoms with the correct dependencies in under 20 seconds starting from scratch.

There are no specific hardware requirements for augment-atoms. If a GPU is available, and the PES model supports it, the GPU will be used to accelerate structure generation. augment-atoms has been tested on both Linux and macOS.

Usage

augment-atoms config.yaml

where config.yaml is a YAML file containing the following:

data:
  # an ase-readable file containing the starting structures
  input: input.xyz 

  # an ase-writeable path to append the new structures to
  output: output.xyz

config:
  # number of augmentations per starting structure
  n_per_structure: 10
  
  # the temperature
  T: 300  # units are Kelvin

  # the explore-vs-exploit trade-off (see below)
  beta: 0.5

  # the range of values from which to sample a 
  # standard deviation to rattle with at each step
  sigma_range: [0.01, 0.1]  # units are Å

  # the random seed to use (for reproducibility)
  seed: 42

  # the standard deviation of the cell perturbation
  # if null, no cell perturbation is applied
  cell_sigma: null  # units are Å
  
  # the units of the energies generated by the PES model
  units: eV

  # the maximum force magnitude to relax to
  max_force: 30  # units are (energy / Å)

  # the minimum separation between atoms to consider
  min_separation: 0.5  # units are Å

  # the maximum number of relaxations to perform per iteration
  max_relax_steps: 20

  # the threshold for considering a structure too similar to the existing pool
  similarity_threshold: 0.1  # units are Å

model:
  # the calculator to use to generate the PES model
  calculator: +lennard_jones()

In-built options for the calculator are:

  • a Lennard-Jones calculator:
model:
  calculator: +lennard_jones()
  • any model from the graph-pes package. If a GPU is available, it will be used to accelerate the PES model.
model:
  calculator:
    +graph_pes_calculator:
      path: path/to/model.pt

Alternatively, you are free to point to any instance of an ase.Calculator object. If you have my_function in my_file.py that returns an ase.Calculator object, you can use it as follows:

model:
  calculator: +my_file.my_function()

Details

1. Selecting a parent structure

To choose a new parent structure, we randomly sample from all structures in the tree, such that atom $i$ in structure $i$ has a probability of being picked given by

$$\mathbb{P}_i = \beta \cdot \frac{e^{-E_i / kT}}{\sum_j e^{-E_j / kT}} + (1-\beta) \cdot \frac{G_i}{\sum_j G_j}$$

where $E_i$ is the energy of structure $i$ and $G_i \in \mathbb{Z}^+$ is the `generation' of the structure, $k$ is the Boltzmann constant, $T$ is the temperature and $\beta \in [0, 1]$. Small values of $\beta$ favour the sampling of "younger" structures in the family tree, and hence a greater degree of exploration. Large values of $\beta$ favour the sampling of lower energy structures, and hence a denser sampling of the PES around energy minima.

2. Rattling the atomic positions and unit cell

To create a "child" from this parent structure, we perform the following transformation:

$$\begin{aligned} R^\prime &\leftarrow [(A + I) \times R] + B \\ C^\prime &\leftarrow (A + I) \times C_0 \end{aligned}$$

where

  • $R$ are the atomic positions
  • $C_0$ is the unit cell of the original seed structure
  • $A \in \mathbb{R}^{3\times 3}$ has entries sampled from $\mathcal{N}(0, \sigma_{A})$ where $\sigma_{A} \in \rm{sigma \_ range}$
  • $B \in \mathbb{R}^{N \times 3}$ has entries sampled from $\mathcal{N}(0, \sigma_{B})$ where $\sigma_{B} \in [0, \rm{cell \_ sigma}]$

In the case of isolated structures, we only rattle the positions (i.e. $A = 0^{3 \times 3}$).

3. Relaxing the rattled child structure

To relax the rattled child structure, we use energies and forces generated by the PES model using a scheme inspired by the Robbins-Monro algorithm.

Step $x$ of this relaxation involves updating the atomic positions according to:

$$R^\prime \leftarrow R + \frac{\sigma_B}{x} \cdot \frac{F}{||F||}$$

where $F/||F||$ are the normalised unit vectors corresponding to the direction of each atomic force. We perform up to $M$ relaxations steps, but stop early with probability $\min(0.25, e^{-\Delta E / kT})$ providing the maximum force magnitude is less than config.max_force and where $\Delta E$ is the energy difference between the relaxed child and its starting parent structure. We reject all final structures that have any pair of atoms closer than config.min_separation Å.

Demo

This demo uses structures and a model taken from this repo's sister repository, found here.

We include a stand-alone demo usage in the demo directory. This takes 3 water structures as input and uses a PaiNN model to generate and label 27 new structures, for a total of 30 structures.

The demo directory has the following files:

  • input.xyz contains 3 starting water structures
  • config.yaml contains the configuration for the demo
  • model.pt is a PaiNN model trained on water structures from ...
  • output.xyz is the augmented dataset output.

To run this demo yourself:

# clone the repository
git clone https://github.com/jla-gardner/augment-atoms.git
cd augment-atoms/demo
# remove the output file if it exists
rm -rf output.xyz
# run the demo
augment-atoms config.yaml

This entire script took under 10 seconds on my M1 MacBook Pro.

Citation

If you use augment-atoms in your research, please cite the following pre-print:

@misc{Gardner-25-06,
  title = {Distillation of Atomistic Foundation Models across Architectures and Chemical Domains},
  author = {Gardner, John L. A. and du Toit, Daniel F. Thomas and Mahmoud, Chiheb Ben and Beaulieu, Zo{\'e} Faure and Juraskova, Veronika and Pa{\c s}ca, Laura-Bianca and Rosset, Louise A. M. and Duarte, Fernanda and Martelli, Fausto and Pickard, Chris J. and Deringer, Volker L.},
  year = {2025},
  number = {arXiv:2506.10956},
  doi = {10.48550/arXiv.2506.10956},
}

About

dataset augmentation for atomistic machine learning

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages