add backward pass for spatial prallellism by mahf708 · Pull Request #993 · ai2cm/ace

mahf708 · 2026-03-19T20:34:19Z

add backward pass for spatial prallellism with three changes:

an autograd version of spatial_reduce_sum to deal with the loss path
hook up the parameters with a grad hook
setting broadcast_buffers to False, otherwise corrupting/mutating in-place

there might be some assumptions about what types of loss we are supporting thus far, and a careful auditing will be required. An alternative to this formulation would be an explicit and clear spatial awarness in the loss calculation elsewhere ... but that may be invasive?

Tests added

use graph-aware all_reduce inside spatial mean

mahf708 · 2026-03-19T20:36:22Z

fme/ace/stepper/test_single_module_csfno.py

+        atol=1e-2,
+        rtol=1e-2,


seems too low, and thus I have low confidence in all of this for now

mahf708 · 2026-03-19T20:49:31Z

fme/core/distributed/model_torch_distributed.py

+        output = input.clone()
+        torch.distributed.all_reduce(output, group=group)
+        return output
+
+    @staticmethod
+    @custom_bwd(device_type="cuda")
+    def backward(ctx, grad_output: torch.Tensor):
+        return grad_output.clone(), None


the cloning is probably unnecessary here

mahf708 · 2026-03-19T20:49:50Z

fme/core/distributed/model_torch_distributed.py

+            # If we want mean gradient instead of sum, we want:
+            # reduced /= (self._h_size * self._w_size)


I can't quite wrap my head around which one we really need tbh, and I think this is linked with how we do losses here...

mahf708 · 2026-03-19T20:50:18Z

fme/core/distributed/model_torch_distributed.py

+    @custom_bwd(device_type="cuda")
+    def backward(ctx, grad_output: torch.Tensor):
+        return grad_output.clone(), None
+


identity in backward may become an issue... we may need an all-reduce in backward, but that's kind of what the hook below is doing

backwards step Add test that verifies consistency between NonDistribute and TorchModelDistributed for loss and gradient calculation using simple SHT/iSHT transforms

mcgibbon · 2026-03-20T15:37:50Z

fme/core/distributed/parallel_tests/test_backward_step.py

+    if not dist.is_root():
+        return
+
+    if not BASELINE_FILE.exists():


Please use the regression helper in fme/core/testing/regression.py for this, to reduce duplication of this kind of regression logic. It should be general-purpose and support this use case.

A "bug" here is that this will pass when no baseline exists.

mcgibbon · 2026-03-20T16:07:45Z

fme/ace/stepper/test_single_module_csfno.py

+TIMESTEP = datetime.timedelta(hours=6)
+
+
+def get_dataset_info(


This is a lot of duplicated testing logic, do we need a new testing file or can we make the existing test in test_single_module.py parallel-enabled / put this parallel-enabled test in that file to use its helpers?

That test uses the legacy SFNO --- could we change it to use CSFNO?

mcgibbon · 2026-03-20T16:21:52Z

scripts/testing/test_spatial.sh

I'd like to avoid new ways of executing tests, if we need something can we put this in the Makefile, or if it's temporary for this PR can you leave it uncommitted? I think the current make targets support this in two executions (in serial and then in parallel), using the TEST_PATH argument.

mcgibbon

It seems like the core contribution of this script is that it adds proper gradient reductions to spatial_reduce_sum. It would be nice to see a unit test added that fails before you add those changes and passes after you add those changes. I think we could and should merge a PR with just those changes.

It feels like the large rtol and atol are needed because the test isn't really passing, and that we'll need some finer-grained tests to tell us what part of the code is causing the failure. Splitting the spatial_reduce_sum changes out will let us get them merged without blocking on this, which could take a while and will need a lot more test code to be written/reviewed.

Add _AutogradAllReduce, a torch.autograd.Function that wraps all_reduce with an identity backward, replacing the in-place all_reduce in spatial_reduce_sum that broke the autograd graph. Add spatial gradient hooks in wrap_module that all-reduce parameter gradients across spatial ranks after backward, so each rank applies the same weight update. Also set broadcast_buffers=False in DDP to prevent corruption of SHT/iSHT Legendre polynomial buffers. Based on work by mahf708 and peterdschwartz in E3SM-Project/ace PR #993. Co-Authored-By: mahf708 <naser.mahfouz@pnnl.gov> Co-Authored-By: peterdschwartz <peterdschwartz83@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mahf708 · 2026-03-20T17:08:27Z

@mcgibbon note that large tol is only need for GPU/CPU inconsistency, but it is ok otherwise (i.e., it is ok if we run everything on cpu or gpu --- that may simplify the hunt or be reassiuring? idk)

also, @mcgibbon, thoguhts on this potential gotcha: besides the forward ops that we want to be differentiable, etc., I don't think we want this thing to be differentiable, right? If so, we can potentially change this to use a bare all_reduce here. What do you think? I think this is largely ok for forward, but if we ever add backward to the spatial_reduce_sum, it may cause problems?

    def gather_spatial(
        self, data: dict[str, torch.Tensor], img_shape: tuple[int, int]
    ) -> dict[str, torch.Tensor]:
        """Gather local spatial chunks back to global tensors via all-reduce."""
        return {k: self.gather_spatial_tensor(v, img_shape) for k, v in data.items()}

    def gather_spatial_tensor(
        self, tensor: torch.Tensor, img_shape: tuple[int, int]
    ) -> torch.Tensor:
        """Reassemble a spatially-sharded tensor on every rank via all-reduce.

        Args:
            tensor: Local spatial shard.
            img_shape: Global ``(H, W)`` spatial dimensions.
        """
        if img_shape == tensor.shape[-2:]:
            return tensor
        global_shape = (*tensor.shape[:-2], *img_shape)
        slices = self.get_local_slices(img_shape)
        buf = torch.zeros(global_shape, dtype=tensor.dtype, device=tensor.device)
        buf[(..., *slices)] = tensor
        return self.spatial_reduce_sum(buf)

mcgibbon · 2026-03-20T18:15:15Z

fme/core/distributed/model_torch_distributed.py

+    """
+
+    @staticmethod
+    @custom_fwd(device_type="cuda")


Why is custom_fwd needed? Don't you need this on CPU? Is this why CPU and GPU are giving different results?

No idea, I saw this in the makani/physicsnemo repos and I copied it blindly (testing with it and without didn't really make any difference, or at least I didn't notice it)

They don’t have spatial parallelism on cpu, so it wouldn’t cause issues for them, but we should remove it. I did in my branch incorporating this code.

mcgibbon · 2026-03-20T18:17:52Z

@mcgibbon note that large tol is only need for GPU/CPU inconsistency, but it is ok otherwise (i.e., it is ok if we run everything on cpu or gpu --- that may simplify the hunt or be reassiuring? idk)

We should be able to get pretty close results on both. Just make sure that initializations always happen on CPU, and then get transferred to get_device(), rather than being initialized directly on-device. I marked one location that might be causing this difference with a line comment.

also, @mcgibbon, thoguhts on this potential gotcha: besides the forward ops that we want to be differentiable, etc., I don't think we want this thing to be differentiable, right? If so, we can potentially change this to use a bare all_reduce here. What do you think? I think this is largely ok for forward, but if we ever add backward to the spatial_reduce_sum, it may cause problems?

    def gather_spatial(
        self, data: dict[str, torch.Tensor], img_shape: tuple[int, int]
    ) -> dict[str, torch.Tensor]:
        """Gather local spatial chunks back to global tensors via all-reduce."""
        return {k: self.gather_spatial_tensor(v, img_shape) for k, v in data.items()}

    def gather_spatial_tensor(
        self, tensor: torch.Tensor, img_shape: tuple[int, int]
    ) -> torch.Tensor:
        """Reassemble a spatially-sharded tensor on every rank via all-reduce.

        Args:
            tensor: Local spatial shard.
            img_shape: Global ``(H, W)`` spatial dimensions.
        """
        if img_shape == tensor.shape[-2:]:
            return tensor
        global_shape = (*tensor.shape[:-2], *img_shape)
        slices = self.get_local_slices(img_shape)
        buf = torch.zeros(global_shape, dtype=tensor.dtype, device=tensor.device)
        buf[(..., *slices)] = tensor
        return self.spatial_reduce_sum(buf)

Does your added code make anything slower? I don't really expect so? When we are in non-differentiable places we use a no_grad context to handle not running extra logic. I'd say don't worry about using the differentiable version.

peterdschwartz and others added 2 commits March 10, 2026 15:03

Add test for backward step spatial parallelism

4da9aee

use graph-aware all_reduce inside spatial mean

add backward pass for spatial prallellism

08647d8

mahf708 commented Mar 19, 2026

View reviewed changes

peterdschwartz added 3 commits March 19, 2026 21:19

Implement an autograd aware all_reduce op that doesn't double count in

2f85d7a

backwards step Add test that verifies consistency between NonDistribute and TorchModelDistributed for loss and gradient calculation using simple SHT/iSHT transforms

Merge branch 'sp-backward-step' into spatial-parallel-training

bee34f4

add tests explicitly comparing gradients for backward step

85bf4ec

mcgibbon reviewed Mar 20, 2026

View reviewed changes

		# If we want mean gradient instead of sum, we want:
		# reduced /= (self._h_size * self._w_size)

Conversation

mahf708 commented Mar 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mahf708 Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcgibbon left a comment

Choose a reason for hiding this comment

Uh oh!

mahf708 commented Mar 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcgibbon commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mahf708 Mar 19, 2026 •

edited

Loading