Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on data with different number of atoms. #131

Closed
Arijit-Majumdar opened this issue Feb 7, 2025 · 1 comment
Closed

Training on data with different number of atoms. #131

Arijit-Majumdar opened this issue Feb 7, 2025 · 1 comment

Comments

@Arijit-Majumdar
Copy link

Hi,
I am trying to train the HIP-NN model on DFT data for water. I have two datasets, one with 24 atoms and the other with 96 atoms. When I trained on the 24 atoms data, I converted all the position, force, energy and cell size into numpy arrays and used the hippynn.databases.Database class to read the arrays. So, the position array was of dimension [num configuration x 24 x 3]. How can I train the model on both 24 atoms and 96 atoms datasets? Do I need to store the data in something other than numpy arrays?

Thanks

@lubbersnick
Copy link
Collaborator

The documentation is here, although it could be made more thorough. We accept pull requests!

We use padding. All arrays dimensions that are atom-wise should have shape (n_sys,96,...). The species array should have zeros where there is no atom in that system.
e.g.

# species for one data point with 1 water, and one with 2 waters
z = [
    [8,1,1,0,0,0],
    [8,1,1,8,1,1],
]

For other atom-based arrays such as force and position, the values corresponding to z=0 will be irrelevant. There is a system for removing the padding while processing the NN and the loss function; these values don't cost you significant computational time or change the meaning of the metrics; you could pad the whole array to length 300 atoms if you wanted.

You can also put all the oxygens first or however you like. The atom order will not affect anything at all. I think that datasets in the ANI format require having the padding at the end of the array, rather than in an arbitrary place. The one exception is finding pairs, which might currently depend on the total array length and/or position of the padding. With systems of 96 atoms, this should not be a concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants