Add separated channel module registry and step implementation by mcgibbon · Pull Request #957 · ai2cm/ace

mcgibbon · 2026-03-11T19:26:30Z

Adds a new module interface where modules explicitly take separate forcing and prognostic tensors as input and return separate prognostic and diagnostic tensors as output, instead of a single concatenated tensor. This makes channel semantics explicit at the module level.

Changes:

fme.core.registry.SeparatedModuleConfig, SeparatedModule, SeparatedModuleSelector: new module registry mirroring ModuleSelector but with separated channel interface
fme.core.registry.LegacyModuleAdapter, LegacyWrapper: adapter wrapping old-style single-tensor modules for backwards compatibility
fme.core.step.SeparatedModuleStepConfig, SeparatedModuleStep: new StepABC implementation using the separated channel interface, with explicit forcing_names, prognostic_names, and diagnostic_names fields
StepConfigABC.prognostic_names and StepABC.prognostic_names: refactored from @final @property to get_prognostic_names() getter method, enabling subclasses to use prognostic_names as a dataclass field name
Tests added

Add SeparatedModuleConfig, SeparatedModule, and SeparatedModuleSelector for modules that take separate forcing/prognostic input tensors and return separate prognostic/diagnostic output tensors. Include a LegacyModuleAdapter that wraps old-style single-tensor modules via LegacyWrapper for use in the new interface. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add SeparatedModuleStepConfig/SeparatedModuleStep as a new StepABC implementation that works with the separated channel module interface. Uses explicit forcing_names, prognostic_names_, and diagnostic_names instead of deriving them from in/out name set operations. Remove state_dict/load_state_dict overrides from LegacyWrapper to fix an asymmetry where PyTorch's parent recursion calls child state_dict() overrides but not child load_state_dict() overrides, causing key mismatches when wrapped by DummyWrapper. Includes parametrized test coverage in test_step.py and dedicated tests with numerical equivalence regression tests vs SingleModuleStep. Co-Authored-By: Claude Opus 4.6 <[email protected]>

mcgibbon · 2026-03-11T19:28:43Z

fme/core/registry/separated_module.py

+    @classmethod
+    def register(
+        cls, type_name: str
+    ) -> Callable[[Type[SeparatedModuleConfig]], Type[SeparatedModuleConfig]]:  # noqa: UP006


Question: Why is a noqa needed here?

It's because of type being an attribute, so the type built-in is no longer accessible.

Would it work to define SeparatedModuleConfigType = type[SeparatedModuleConfig] at the module level to avoid the clash?

mcgibbon · 2026-03-11T19:30:17Z

fme/core/registry/separated_module.py

+from fme.core.dataset_info import DatasetInfo
+from fme.core.labels import BatchLabels, LabelEncoding
+
+from .module import CONDITIONAL_BUILDERS, ModuleSelector


Question: Is it appropriate to re-use the CONDITIONAL_BUILDERS from .module? Or should we be defining these independently for SeparatedModule? Or should all new module types here support conditioning (we don't need backwards compatibility)? At a minimum labels can always be given as broadcasted input variables, right?

Answer: Good question. Right now we reuse it because the only way to get a conditional separated module is through LegacyModuleAdapter, which delegates to ModuleSelector, so the same builder types apply. When we register native separated modules that support conditioning, we should define an independent CONDITIONAL_BUILDERS list (or a different mechanism) for this registry. For now, reusing it is correct since LegacyModuleAdapter is the only registered type.

mcgibbon · 2026-03-11T19:34:12Z

fme/core/step/separated_module.py

+        if "secondary_decoder" in state:
+            self.secondary_decoder.load_module_state(state["secondary_decoder"])


Question: Do we need this check? There aren't any existing checkpoints to maintain backwards compatibility for.

Suggestion: Remove this check if you can.

mcgibbon · 2026-03-11T19:34:30Z

fme/core/step/test_separated_module.py

+from fme.core.step.step import StepSelector
+
+IMG_SHAPE = (16, 32)
+TIMESTEP = __import__("datetime").timedelta(hours=6)


Suggestion: Properly import timedelta instead of this weird line.

mcgibbon · 2026-03-11T19:35:22Z

fme/core/step/test_separated_module.py

+
+        return single_step, separated_step, in_names
+
+    def test_equivalence(self):


Suggestion: test_equivalence_with_(something), "single_step"?

mcgibbon · 2026-03-11T19:36:45Z

fme/core/step/test_separated_module.py

+        single_sfno = single_step.module.torch_module.module
+        separated_sfno = separated_step.module.torch_module.module.inner


Issue: I don't like digging into these internal properties, is it possible to do this with public API like the .modules attribute of StepABC?

It wasn't really avoidable.

…tics Validate that prognostic_names_ is non-empty in SeparatedModuleStepConfig post-init, since prognostic variables are required. Add test cases for configurations with no forcings, no diagnostics, and both empty together. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Change StepConfigABC.prognostic_names and StepABC.prognostic_names from @Final @Property to @Final getter methods, enabling subclasses to use prognostic_names as a dataclass field name in a follow-up commit. Updates all callers across StepperConfig, Stepper, CoupledStepper, and tests. Co-Authored-By: Claude Opus 4.6 <[email protected]>

…nfig Now that the base class property was refactored to get_prognostic_names(), the field can use the cleaner name without conflict. Co-Authored-By: Claude Opus 4.6 <[email protected]>

- Fix timedelta import in test (use datetime.timedelta properly) - Remove unnecessary backwards-compat check in load_state - Rename test_equivalence to test_equivalence_with_single_step Co-Authored-By: Claude Opus 4.6 <[email protected]>

mcgibbon · 2026-03-11T20:29:40Z

I ended up going the route of enabling prognostic/forcing/output tensors with a new StepABC implementation. Thinking about it, it's just too much of a fundamental change in the sense that it makes no sense to run an existing checkpoint with the new code - the input and output packers are fundamentally changed, and there's no guarantee the variable name ordering for existing checkpoints is valid under the new system.

mcgibbon · 2026-03-11T20:38:50Z

There are a bunch of weird questions for the Legacy wrapper, like conditional being set in both the inside and outside module selectors. I don't think we actually need this support - I'll just register a new separated builder when we have the local model that uses this feature. So for now this PR will just have a class/config used for testing.

- Remove LegacyWrapper and LegacyModuleAdapter from separated module registry - Remove conditional field from SeparatedModuleSelector; all separated modules are expected to accept an optional labels argument - Replace legacy adapter usage in tests with SimpleSeparatedBuilder - Remove equivalence test class (was specific to legacy wrapper) Co-Authored-By: Claude Opus 4.6 <[email protected]>

…ace into feature/separated-module-registry

- Remove unused _no_optimization / NullOptimization from SingleModuleStep, SeparateRadiationStep, FCN3Step, and SeparatedModuleStep - Move SimpleSeparatedModule and its builder into separated_module.py with deferred registration via register_test_types(), eliminating test-to-test import dependency - Fix SeparatedModuleConfig.build docstring to document labels parameter Co-Authored-By: Claude Opus 4.6 <[email protected]>

mcgibbon · 2026-03-12T14:38:43Z

fme/ace/step/fcn3.py

        self.module = dist.wrap_module(module)
        self._img_shape = dataset_info.img_shape
        self._config = config
-        self._no_optimization = NullOptimization()


This was unused.

mcgibbon · 2026-03-12T14:39:13Z

fme/ace/inference/inference.py

        initial_condition = get_initial_condition(
            config.initial_condition.get_dataset(),
-            stepper_config.prognostic_names,
+            stepper_config.get_prognostic_names(),


This property method needed to be refactored to a getter so that prognostic_names could be a dataclass attribute instead, on the new Step.

~~What do you think about making this refactor a separate PR? Fairly minor change, but does touch a lot of files.~~

Feel free to ignore, I don't think it's really necessary to separate off.

I also went back and forth on this. Will keep it.

mcgibbon · 2026-03-12T14:40:26Z

fme/core/registry/separated_module.py

+        forward(
+            forcing: Tensor,
+            prognostic: Tensor,
+            labels: Tensor | None = None,


I went with requiring all modules of this new type take in a labels argument, since there's no need for backwards compatibility with ones that don't. For network types that don't have conditioning, we can broadcast these to the domain size and just treat them as additional input variables.

mcgibbon · 2026-03-12T14:42:34Z

fme/core/registry/separated_module.py

+        )
+
+
+class SeparatedModule:


This handles conversion of BatchLabels to Tensor, and also provides a strongly typed interface (unlike nn.Module which is treated as Any by mypy). We do the same thing for Module.

fme/core/step/separated_module.py

jpdunc23

This is looking good. My main concern is the handling of combinations of missing labels and encoding.

When I heard you talking about this PR I had a somewhat different idea of what the changes would look like. I thought that you would implement a multi-module rather than single-module approach. The multi-module approach might be impractical for certain architectures, e.g. if you want a shared embedding layer for all inputs. But it would also be more readily able to take advantage of existing nn.Modules. The StepABC would just be responsible for piping the appropriate args to the appropriate nn.Module without any single nn.Module having to know about "business logic" (i.e., distinctions between forcing, prognostic, and diagnostic variables).

Anyways, the approach here is also reasonable.

jpdunc23 · 2026-03-12T16:59:11Z

fme/core/registry/separated_module.py

+    @classmethod
+    def register(
+        cls, type_name: str
+    ) -> Callable[[Type[SeparatedModuleConfig]], Type[SeparatedModuleConfig]]:  # noqa: UP006


Would it work to define SeparatedModuleConfigType = type[SeparatedModuleConfig] at the module level to avoid the clash?

jpdunc23 · 2026-03-12T17:18:24Z

fme/ace/inference/inference.py

        initial_condition = get_initial_condition(
            config.initial_condition.get_dataset(),
-            stepper_config.prognostic_names,
+            stepper_config.get_prognostic_names(),


~~What do you think about making this refactor a separate PR? Fairly minor change, but does touch a lot of files.~~

jpdunc23 · 2026-03-12T18:28:41Z

fme/core/registry/separated_module.py

+        if labels is not None and self._label_encoding is not None:
+            encoded_labels = labels.conform_to_encoding(self._label_encoding)
+            return self._module(forcing, prognostic, labels=encoded_labels.tensor)


Similar to fme.core.registry.module.Module, should this raise an error if labels is not None and self._label_encoding is None or vice versa?

Also, I think this branch is untested.

fme/core/step/separated_module.py

jpdunc23 · 2026-03-12T18:55:12Z

fme/core/registry/separated_module.py

+        ):
+            return _SimpleSeparatedModule(
+                n_forcing_channels, n_prognostic_channels, n_diagnostic_channels
+            )


Could this can be moved to fme/core/registry/test_separated_module.py? I think so since we have existing tests where the registry is updated in the test file (e.g., see MockStepConfig in fme/core/step/test_step_registry.py). I think it would be cleaner to keep the testing modules isolated to the test files.

Claude wanted to avoid one test file importing from another test file, which I agreed with. The issue being this is used in more than one test file.

Perhaps a separate fme/core/registry/testing.py module?

I think that would make sense.

jpdunc23 · 2026-03-12T19:47:16Z

fme/core/step/separated_module.py

+    ocean: OceanConfig | None = None
+    corrector: AtmosphereCorrectorConfig | CorrectorSelector = dataclasses.field(
+        default_factory=lambda: AtmosphereCorrectorConfig()
+    )


Since we don't need to support backwards compatibility I wish we could avoid this bit of atmosphere specificity:

Suggested change

ocean: OceanConfig | None = None

corrector: AtmosphereCorrectorConfig | CorrectorSelector = dataclasses.field(

default_factory=lambda: AtmosphereCorrectorConfig()

)

corrector: CorrectorSelector | None = None

I'm working on some refactors now to make this change possible, though I don't necessarily think this PR should be held up by it.

If you’re able to get those changes in soon, it wouldn’t be unreasonable to break backwards compatibility on this in the very near future before we have many checkpoints.

At the least I think we can require CorrectorSelector now? I'm less clear on how we can run without an OceanConfig.

mcgibbon · 2026-03-12T20:31:54Z

without any single nn.Module having to know about "business logic" (i.e., distinctions between forcing, prognostic, and diagnostic variables).

The issue where this breaks down is that we want to use the architecture differently for these types of variables. For example, the way Nvidia does residual updates for prognostic variables in the Makani code, or the way forcing variables are used as context instead of normal inputs. These features are impossible to implement without the module knowing about this distinction. Well, maybe not impossible, the Step could do most of it, but these really do feel like architectural choices and not like physical choices.

I do think it’s good practice to continue using modules that take single tensors as much as possible when they don’t need to make this distinction, and concatenating to call them.

Co-authored-by: James Duncan <[email protected]>

mcgibbon · 2026-03-12T20:51:53Z

Mh as I'm working in tandem on the Local network, I'm finding an interesting question that is leading me to see the appeal of doing this stuff in Step... I'll need to think on it some more.

- Replace typing.Type with type alias to remove noqa comments - Move test utilities to fme/core/registry/testing.py - Add labels/encoding mismatch validation in SeparatedModule - Use public .modules API in tests instead of internal properties - Properly import timedelta in test file Co-Authored-By: Claude Opus 4.6 <[email protected]>

Co-Authored-By: Claude Opus 4.6 <[email protected]>

The build_corrector method was removed from VerticalCoordinate on main. Use self.corrector.get_corrector(dataset_info) instead, matching SingleModuleStepConfig. Co-Authored-By: Claude Opus 4.6 <[email protected]>

mcgibbon and others added 2 commits March 11, 2026 18:49

mcgibbon commented Mar 11, 2026

View reviewed changes

mcgibbon and others added 4 commits March 11, 2026 19:53

Rename prognostic_names_ to prognostic_names in SeparatedModuleStepCo…

1d88b0b

…nfig Now that the base class property was refactored to get_prognostic_names(), the field can use the cleaner name without conflict. Co-Authored-By: Claude Opus 4.6 <[email protected]>

mcgibbon changed the title ~~Feature/separated module registry~~ Add separated channel module registry and step implementation Mar 11, 2026

Merge branch 'main' into feature/separated-module-registry

231f3a9

mcgibbon and others added 3 commits March 11, 2026 20:44

Merge branch 'feature/separated-module-registry' of github.com:ai2cm/…

56d838e

…ace into feature/separated-module-registry

mcgibbon marked this pull request as ready for review March 12, 2026 14:34

Merge branch 'main' into feature/separated-module-registry

9858f11

mcgibbon commented Mar 12, 2026

View reviewed changes

fme/core/step/separated_module.py Outdated Show resolved Hide resolved

jpdunc23 reviewed Mar 12, 2026

View reviewed changes

Update fme/core/step/separated_module.py

5bd9e08

Co-authored-by: James Duncan <[email protected]>

mcgibbon and others added 5 commits March 20, 2026 20:40

reorder arguments

6d6df2e

Fix register_test_types import in test_step.py

91c2910

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Merge branch 'main' into feature/separated-module-registry

c67142f

Merge branch 'main' into feature/separated-module-registry

b4e49c7

Fix corrector building to use new get_corrector API

a9fbef2

The build_corrector method was removed from VerticalCoordinate on main. Use self.corrector.get_corrector(dataset_info) instead, matching SingleModuleStepConfig. Co-Authored-By: Claude Opus 4.6 <[email protected]>

		if "secondary_decoder" in state:
		self.secondary_decoder.load_module_state(state["secondary_decoder"])


		return single_step, separated_step, in_names

		def test_equivalence(self):

		single_sfno = single_step.module.torch_module.module
		separated_sfno = separated_step.module.torch_module.module.inner

Conversation

mcgibbon commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcgibbon commented Mar 11, 2026

Uh oh!

mcgibbon commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcgibbon Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpdunc23 Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcgibbon Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jpdunc23 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpdunc23 Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcgibbon commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcgibbon commented Mar 12, 2026

mcgibbon commented Mar 11, 2026 •

edited

Loading

mcgibbon commented Mar 11, 2026 •

edited

Loading

mcgibbon Mar 12, 2026 •

edited

Loading

jpdunc23 Mar 12, 2026 •

edited

Loading

mcgibbon Mar 12, 2026 •

edited

Loading

jpdunc23 Mar 12, 2026 •

edited

Loading

mcgibbon commented Mar 12, 2026 •

edited

Loading