Add per variable loss to Stepper and Log using TrainAggs by Arcomano1234 · Pull Request #981 · ai2cm/ace

Arcomano1234 · 2026-03-16T21:59:45Z

Currently we do not log the individual components of the loss (e.g., variable contributions to the overall loss), although we do log weighted RMSE for training, val, and inference (which is very rarely the actual loss). This can make diagnosing overfitting difficult. This PR adds a per_channel (per variable) loss method to fme/core/loss.py. This gets called in the Stepper inside _accumulate_loss during the TrainAggregator (not during regular training).

Example of run on wandb.

Changes:

Add per channel loss to fme/core/loss.py
Log per channel loss to metrics inside _accumulate_loss of the Stepper when called by TrainAggregator
Add per channel metrics to TrainAggregator
Tests added

Resolves #485

fme/ace/aggregator/train.py

oliverwm1 · 2026-03-17T23:24:39Z

fme/ace/aggregator/train.py

 from fme.core.typing_ import TensorMapping

+# Metric key prefix for per-variable loss (must match stepper's metrics["loss/<var>"]).
+PER_CHANNEL_LOSS_PREFIX = "loss/"


not ideal to have this coupling with the naming in the stepper metrics, but this already exists for the the other loss terms so I think it's okay.

Yeah I was not a fan of this at all either but Claude and I couldn't think of a good way. I guess the one thing I can do is make this an aggregator it self and decouple anything from the stepper. This would also help reduce the need to record it when we aren't using it during training.

The problem becomes then getting the loss function to the aggregator which has it's own complications. I defer to you or Jeremy on whether its worth decoupling this from the stepper and just pass a loss_fn to an "PerChannelLossAggregator".

Yeah, like I said it's a pre-existing issue so I don't think we should worry about decoupling in this PR. But open to other thoughts from @mcgibbon on this.

I'd suggest using a new attribute on the TrainOutput instead of a string label.

fme/ace/aggregator/train.py

fme/ace/stepper/single_module.py

oliverwm1 · 2026-03-17T23:30:37Z

fme/ace/stepper/single_module.py

                )
                step_loss = self._loss_obj(gen_step, target_step, step=step)
                metrics[f"loss_step_{step}"] = step_loss.detach()
+                per_channel = self._loss_obj.forward_per_channel(


Is there a way to avoid computing this when it is not needed? I.e. it seems we only should do this calculation when computing metrics for the train/val aggregator, but not when actually training.

Agreed its a little hacky but the newest commit has a way around this.

oliverwm1 · 2026-03-18T21:28:49Z

fme/core/generics/trainer.py

+                    stepped = self.stepper.train_on_batch(
+                        batch,
+                        self._no_optimization,
+                        compute_per_channel_metrics=compute_per_channel,


I think it would be fine to just hard-code this to True instead of coupling to the aggregator. Important thing is that it's only done for the train_evaluation_batches not the full training dataset.

Yeah I agree should simply the code.

oliverwm1 · 2026-03-18T21:29:29Z

fme/ace/stepper/single_module.py

        target_data: BatchData,
        optimization: OptimizationABC,
        metrics: dict[str, float],
+        *,


Claude has been trying to add this back in every time I've made changes happen, for some reason it likes this style.

This forces kwargs to be passed as kwargs and not as positional arguments, which IMO is nice though not something we enforce.

Yeah at least for me claude really wants us to use this style. There is some benefit IMO to this way of passing kwargs but thats outside the scope of this PR

oliverwm1 · 2026-03-18T21:30:22Z

fme/ace/aggregator/train.py

 from fme.core.typing_ import TensorMapping

+# Metric key prefix for per-variable loss (must match stepper's metrics["loss/<var>"]).
+PER_CHANNEL_LOSS_PREFIX = "loss/"


Yeah, like I said it's a pre-existing issue so I don't think we should worry about decoupling in this PR. But open to other thoughts from @mcgibbon on this.

configs/baselines/era5/ace-train-config.yaml

configs/baselines/era5/run-ace-train.sh

…/ace into feature/per-channel-loss-train-agg

mcgibbon

I think we can iterate on this design to reduce the amount of code needed to handle this concern, and avoid re-computing the loss N times (with N GPU kernel dispatches). I'd suggest making a substantial change to the API of the loss object, having them return 1D vectors across channel, to avoid most of the code here as well as to avoid recomputing the loss (which can be expensive, I expect the GPUs to have very low occupancy during these calls). If we need to allow support for cross-channel losses that don't nicely fit into per-channel losses, we can iterate later by having loss return a data type allowing both per-channel and scalar losses (each optional).

Also, it feels a little odd to have TrainAggregator responsible so much for these logs. At the least, the validation aggregator should also log them, and it should be easy in principle (even if we don't want to run it that way as a matter of course) to include these in the per-batch metrics we get during training. The suggested change(s) would make this easier to implement if we choose, as it avoids the low-level prefix arithmetic in the aggregator.

mcgibbon · 2026-03-19T14:31:38Z

fme/ace/stepper/single_module.py

+                if compute_per_channel_metrics:
+                    per_channel = self._loss_obj.forward_per_channel(
+                        gen_step, target_step, step=step
+                    )
+                    if per_channel_sum is None:
+                        per_channel_sum = {
+                            k: v.detach().clone() for k, v in per_channel.items()
+                        }
+                    else:
+                        for k in per_channel_sum:
+                            per_channel_sum[k] = (
+                                per_channel_sum[k] + per_channel[k].detach()
+                            )


Rather than re-computing the per-channel metrics, I suggest updating the code so _loss_obj returns a 1-D vector over the channel dimension. Then at this level you can use the unpacker to turn that into a dict of per-channel losses. while still using step_loss.sum() as the optimized/accumulated loss. This should make it cheap enough that you can just always "compute" these per-channel metrics, instead of needing new boolean flags.

Additionally, right now you have low-level accumulation code in the middle of the optimize function. I suggest instead using a simple Aggregator implementation that takes in a dict of values and keeps a running mean of those values (maybe the existing batch metrics aggregator already handles this) to accumulate these channel losses. You could either attach them as an additional attribute on TrainOutput that then gets passed to this aggregator above this scope (probably better? at least more consistent with what we currently do), or you could take the aggregator as an Optional input argument and record them in this scope.

Make sure that when refactoring this, you maintain the sum-ness of loss across steps (perhaps make sure this is tested).

mcgibbon · 2026-03-19T14:33:04Z

fme/core/loss.py


        return self.loss(predict_tensors, target_tensors)

+    def call_per_channel(


I think this function can be avoided if loss_obj were to return a 1D vector of per-channel losses.

mcgibbon · 2026-03-19T14:33:50Z

fme/core/loss.py

        step_weight = (1.0 + self.sqrt_loss_decay_constant * step) ** (-0.5)
        return self.loss(predict_dict, target_dict) * step_weight

+    def forward_per_channel(


What's the meaning of "forward" here? In any case, I think this function can be avoided if loss_obj were to return a 1D vector of per-channel losses.

mcgibbon · 2026-03-19T14:36:28Z

fme/ace/aggregator/train.py

 from fme.core.typing_ import TensorMapping

+# Metric key prefix for per-variable loss (must match stepper's metrics["loss/<var>"]).
+PER_CHANNEL_LOSS_PREFIX = "loss/"


I'd suggest using a new attribute on the TrainOutput instead of a string label.

mcgibbon · 2026-03-19T14:40:32Z

fme/ace/stepper/single_module.py

        target_data: BatchData,
        optimization: OptimizationABC,
        metrics: dict[str, float],
+        *,


This forces kwargs to be passed as kwargs and not as positional arguments, which IMO is nice though not something we enforce.

claude first attempt to add per variable loss to TrainAgg

62077b9

Arcomano1234 changed the title ~~claude first attempt to add per variable loss to TrainAgg~~ Add per variable loss to TrainAggs Mar 16, 2026

Arcomano1234 commented Mar 16, 2026

View reviewed changes

fme/ace/aggregator/train.py Outdated Show resolved Hide resolved

Arcomano1234 added 4 commits March 17, 2026 15:16

make per channel loss configurable and clean up code

1d1aded

test era5 config

7079529

set per channel loss in TrainAgg to False

ef1b9f6

update tests

87af429

oliverwm1 reviewed Mar 17, 2026

View reviewed changes

Arcomano1234 added 2 commits March 18, 2026 13:23

incorporate comments

6223e0e

fix doc

35f2f1d

Arcomano1234 changed the title ~~Add per variable loss to TrainAggs~~ Add per variable loss to Stepper and Log using TrainAggs Mar 18, 2026

oliverwm1 reviewed Mar 18, 2026

View reviewed changes

remove coupling of train agg and generic trainer

84211b1

Arcomano1234 marked this pull request as ready for review March 18, 2026 22:50

Merge branch 'main' into feature/per-channel-loss-train-agg

7def3ca

mcgibbon reviewed Mar 18, 2026

View reviewed changes

configs/baselines/era5/ace-train-config.yaml Outdated Show resolved Hide resolved

configs/baselines/era5/run-ace-train.sh Outdated Show resolved Hide resolved

Arcomano1234 added 2 commits March 18, 2026 16:00

revert changes to ERA5 configs

a67c993

Merge branch 'feature/per-channel-loss-train-agg' of github.com:ai2cm…

377fdc0

…/ace into feature/per-channel-loss-train-agg

mcgibbon reviewed Mar 19, 2026

View reviewed changes


		return self.loss(predict_tensors, target_tensors)

		def call_per_channel(

Conversation

Arcomano1234 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mcgibbon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Arcomano1234 commented Mar 16, 2026 •

edited

Loading

mcgibbon left a comment •

edited

Loading