Skip to content

Conversation

@misko
Copy link
Contributor

@misko misko commented Sep 22, 2025

Refactor and update gputil functions

  • Exclude as much possible to outside of our custom functions, this allows compiler to optimize
  • Optimize padding operation
  • Use reduce_scatter which is faster than all_reduce in all_gather backward
  • Implement special all_gather backwards for GLOO only , for CPU GP + GP tests
  • Update test cases to use these functions
  • Add asynchronous version of all_gather in preparation of any overlapped compute + comms implementation
  • Combine two sequential all_reduce calls in EFS head into one all_reduce call

Change layers to always output atom embeddings for the full system. Currently in GP mode we output a tensor the size of only the local atoms. This makes it hard to implement anything to overlap communication and computation.

  • If we keep the current implementation you cannot chunk the communication and do compute in between.
  • In this new implementation there is an additional all_reduce required to synchronize energy across systems (systems x 1) float

@meta-cla meta-cla bot added the cla signed label Sep 22, 2025
@misko misko added enhancement New feature or request minor Minor version release labels Sep 22, 2025
@misko misko marked this pull request as ready for review September 23, 2025 23:17
@misko misko requested review from lbluque and rayg1234 September 23, 2025 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed enhancement New feature or request minor Minor version release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants