aggregators in a distributed future



_Originally posted by @mahf708 in https://github.com/ai2cm/ace/pull/975#discussion_r2936161500_
            

thoughts from @mcgibbon 
- would we consider "implementing" spatial aggregators for now by doing a spatial gather at the top level of the aggregator? And then just running the existing aggregators with reduction along the data-parallel dim (so that the spatial roots gather together to the global root naturally)?
- That's basically what 975 does for grad mag percent diff, but we could use it as a temporary solution for all aggregators to work, and should be enough for this case we're talking about where we can store global data but can't store global data for the entire latent space and gradients (i.e. for training)
- Honestly, we may want to do this long-term, except that we may want to have it spatial gather one variable at a time, run all the aggregators on it, and iterate over the variables. Even at 3km resolution, it's only ~300MB per variable per timestep at float32 precision.
- The bigger issue would be, some (many?) metrics are probably faster to get by doing compute on all the nodes instead of having root do all the compute

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aggregators in a distributed future #987

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

aggregators in a distributed future #987

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions