Add spatial parallel regression tests and fix differentiable reduce by mcgibbon · Pull Request #995 · ai2cm/ace

mcgibbon · 2026-03-20T16:59:23Z

Fix spatial parallelism backward pass and add regression test infrastructure.

spatial_reduce_sum used an in-place all_reduce that broke the autograd graph,
preventing gradients from flowing through the loss computation path. Spatial ranks
also did not aggregate parameter gradients, so each rank applied different weight
updates.

Changes:

fme.core.distributed.model_torch_distributed._AutogradAllReduce: new
torch.autograd.Function wrapping all_reduce with identity backward, making
spatial_reduce_sum differentiable
fme.core.distributed.model_torch_distributed.ModelTorchDistributed.wrap_module:
register per-parameter gradient hooks that all-reduce across spatial ranks;
set broadcast_buffers=False to protect SHT/iSHT Legendre buffers
fme.core.distributed.parallel_tests.test_regression: new parameterized
regression test framework (RegressionCase base class) that validates
forward → backward → forward correctness across spatial decompositions
AGENTS.md: document parallel test commands, env vars, and baseline workflow

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Introduces test_regression.py with a RegressionCase base class that defines initialize (data + module), reduce (default: spatial-aware sum), and lr (for SGD). The test does forward -> backward -> SGD step -> forward and compares all outputs against a single-rank baseline, catching gradient bugs that Adam-based tests mask. Includes a linear (Conv2d 1x1) case that currently catches the known gradient hook bug under spatial parallelism. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add _AutogradAllReduce, a torch.autograd.Function that wraps all_reduce with an identity backward, replacing the in-place all_reduce in spatial_reduce_sum that broke the autograd graph. Add spatial gradient hooks in wrap_module that all-reduce parameter gradients across spatial ranks after backward, so each rank applies the same weight update. Also set broadcast_buffers=False in DDP to prevent corruption of SHT/iSHT Legendre polynomial buffers. Based on work by mahf708 and peterdschwartz in E3SM-Project/ace PR #993. Co-Authored-By: mahf708 <[email protected]> Co-Authored-By: peterdschwartz <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]>

elynnwu · 2026-03-20T18:11:38Z

fme/core/distributed/model_torch_distributed.py

+    """
+
+    @staticmethod
+    @custom_fwd(device_type="cuda")


Is this strictly a GPU test? Does it get skipped on CPU?

Ah yes I had fixed this locally but not pushed. Now it's pushed.

elynnwu · 2026-03-20T18:16:59Z

fme/core/distributed/parallel_tests/test_regression.py

+    """1x1 Conv2d applied channel-wise; verifies basic gradient flow."""
+
+    n_channels: int = 4
+    _img_shape: tuple[int, int] = (8, 16)


[nit] can you just have this as img_shape and not have a property?

elynnwu · 2026-03-20T18:27:33Z

AGENTS.md

+torchrun. Generating baselines under the same backend you test against does
+not validate cross-backend correctness.


[nit] I don't understand this comment to the agent, we don't allow the agent to generate or delete the baseline .pt right?

We do, or at least I do - if the file is deleted (which you can let it do on a per-command basis) and the test is re-run, the file gets re-generated.

Also, if Claude adds a new test, it needs to generate the baseline pt file.

elynnwu · 2026-03-20T18:40:42Z

fme/core/distributed/model_torch_distributed.py

+            # If we want mean gradient instead of sum, we want:
+            # reduced /= (self._h_size * self._w_size)


We always want the sum right? Doesn't each rank get the partial gradient that is already area weighted, so summing them should be the correct mean?

Yes, deleted.

Not always, right? Some averages aren't weighted... or maybe I misread the code. For the production ones, like AreaWeightedMSE, we are covered

I kept it in the other PR because I wasn't 100% sure which losses are used and which ones etc. and what general case we should support

custom_fwd/custom_bwd are for AMP dtype management and not needed here since all_reduce and identity don't require dtype casting. The clone() in backward is also unnecessary since there's no in-place mutation. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Remove _img_shape indirection in _LinearCase dataclass - a plain attribute satisfies the abstract property. Remove the commented-out mean gradient alternative since sum is always correct here (each rank's gradient is already a partial sum of area-weighted values). Co-Authored-By: Claude Opus 4.6 <[email protected]>

mcgibbon and others added 3 commits March 20, 2026 16:26

Document parallel and spatial parallel test commands in AGENTS.md

b5ea140

Co-Authored-By: Claude Opus 4.6 <[email protected]>

climate-ci-github changed the title ~~Feature/parallel regression coverage~~ Add spatial parallel regression tests and fix differentiable reduce Mar 20, 2026

Merge branch 'main' into feature/parallel-regression-coverage

a35413f

mcgibbon marked this pull request as ready for review March 20, 2026 17:17

elynnwu reviewed Mar 20, 2026

View reviewed changes

mcgibbon and others added 2 commits March 20, 2026 19:10

elynnwu approved these changes Mar 20, 2026

View reviewed changes

Merge branch 'main' into feature/parallel-regression-coverage

64b3805

mcgibbon enabled auto-merge (squash) March 20, 2026 20:11

Merge branch 'main' into feature/parallel-regression-coverage

adfc7b8

mcgibbon merged commit 7943fe4 into main Mar 20, 2026
7 checks passed

mcgibbon deleted the feature/parallel-regression-coverage branch March 20, 2026 20:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spatial parallel regression tests and fix differentiable reduce#995

Add spatial parallel regression tests and fix differentiable reduce#995
mcgibbon merged 8 commits intomainfrom
feature/parallel-regression-coverage

mcgibbon commented Mar 20, 2026 •

edited by climate-ci-github

Loading

Uh oh!

elynnwu Mar 20, 2026

Uh oh!

mcgibbon Mar 20, 2026

Uh oh!

elynnwu Mar 20, 2026

Uh oh!

mcgibbon Mar 20, 2026

Uh oh!

elynnwu Mar 20, 2026

Uh oh!

mcgibbon Mar 20, 2026

Uh oh!

mcgibbon Mar 20, 2026

Uh oh!

elynnwu Mar 20, 2026

Uh oh!

mcgibbon Mar 20, 2026

Uh oh!

mahf708 Mar 20, 2026

Uh oh!

mahf708 Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		torchrun. Generating baselines under the same backend you test against does
		not validate cross-backend correctness.

		# If we want mean gradient instead of sum, we want:
		# reduced /= (self._h_size * self._w_size)

Conversation

mcgibbon commented Mar 20, 2026 • edited by climate-ci-github Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mcgibbon commented Mar 20, 2026 •

edited by climate-ci-github

Loading