Jk/log grad norms/log grad norms #1068

Jubeku · 2025-10-09T16:13:21Z

Description

This PR is based on @sophie-xhonneux's log_grad_norm branch in #685, modified to allow logging gradients when running in parallel on multiple GPUs with FSDP2.

Issue Number

Closes #688

Is this PR a draft? Mark it as draft.

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

Jubeku · 2025-10-09T16:16:52Z

src/weathergen/train/trainer.py

+        """
+        self.last_grad_norm = (
+            total_norm.full_tensor().item() if self.cf.world_size > 1 else total_norm.item()
+        )


As mentioned here, full_tensor().item() is needed in parallel runs with FSDP2.

I tested this by logging both ways of calculating:

000 : 00010/02048 : 000010 : loss = 1.0287E+00 (lr=1.64E-06, gradient norm=0.983, gradient norm FT=1.403, s/sec=0.236) ERA5 : 1.0287E+00 000 : 00020/02048 : 000020 : loss = 1.0101E+00 (lr=3.34E-06, gradient norm=0.587, gradient norm FT=0.817, s/sec=0.435) ERA5 : 1.0101E+00

Jubeku · 2025-10-09T16:18:15Z

src/weathergen/train/trainer.py

+                    if self.cf.world_size > 1
+                    else param.grad.norm().item()
+                )
+


Same as above, we also need .full_tensor().item() here in multi-gpu mode.

Tested it by printing both versions on 2 GPUs:

print(".item():", param.grad.norm().item()) print(".full_tensor().item()", param.grad.norm().full_tensor().item()) .item(): 0.028306283056735992 .item(): 0.022433193400502205 .full_tensor().item() 0.03611777722835541 .full_tensor().item() 0.03611777722835541

tjhunter · 2025-10-13T08:40:00Z

src/weathergen/train/trainer.py

+                grad_norms["grad_norm_" + name] = (
+                    param.grad.norm().full_tensor().item()
+                    if self.cf.world_size > 1
+                    else param.grad.norm().item()


shouldn't you divide by the number of items in the gradient? otherwise, if every component in the gradient is equal, you are biased by batching computations

Not sure I follow. But as far as I know the gradient norm logging is correct and people do not commonly account for the number of dimensions, as for batching this is handle in the forward pass and thus is automatically dealt with during backprop.

sophie-xhonneux

Minor things to fix in the comments, I trust the will happen and thus already approve the PR. If logging is off, there should be no effect on the runs.

sophie-xhonneux · 2025-10-13T09:34:11Z

config/default_config.yml


 ae_local_dim_embed: 1024
-ae_local_num_blocks: 2
+ae_local_num_blocks: 0


Can we as always revert back to the original default config please?

sophie-xhonneux · 2025-10-13T09:35:27Z

src/weathergen/train/trainer.py

+        Log instantaneous grad norms, we do not average because of the cost and because we want to
+        measure the actual values
+
+        TODO test DDP case


can you remove the TODO and copy your FSDP2 comment into the code. Thank you!

sophie-xhonneux · 2025-10-13T09:37:45Z

src/weathergen/train/trainer.py

+                grad_norms["grad_norm_" + name] = (
+                    param.grad.norm().full_tensor().item()
+                    if self.cf.world_size > 1
+                    else param.grad.norm().item()


Not sure I follow. But as far as I know the gradient norm logging is correct and people do not commonly account for the number of dimensions, as for batching this is handle in the forward pass and thus is automatically dealt with during backprop.

Jubeku · 2025-10-13T13:34:55Z

@sophie-xhonneux, @tjhunter, should we keep the plot_grad_norms.py in the utils/ folder or should we rather move it to the private repo or add a wiki page on options to plot the grad norms.

sophie-xhonneux and others added 12 commits August 6, 2025 12:24

Log gradient norms

a0039ec

Prototype for recording grad norms

e83903b

Address review changes + hide behind feature flag

d2995b4

Final fixes including backward compatibility

26c6869

Merge branch 'develop' into sophiex/dev/log-grad-norms

66da0d7

Ruff

9a66f72

More ruff stuff

22a6fd7

Merge branch 'develop' into jk/log-grad-norms/log-grad-norms

87e7d3b

forecast config with small decoder

754d31c

Merge branch 'develop' into jk/log-grad-norms/log-grad-norms

cd7948f

fixed uv.lock

7c756a3

test gradient logging on mutli gpus

41716a6

github-project-automation bot added this to WeatherGen-dev Oct 9, 2025

github-actions bot added the model Related to model training or definition (not generic infra) label Oct 9, 2025

Jubeku commented Oct 9, 2025

View reviewed changes

Jubeku self-assigned this Oct 9, 2025

tjhunter reviewed Oct 13, 2025

View reviewed changes

Jubeku requested a review from sophie-xhonneux October 13, 2025 09:25

sophie-xhonneux approved these changes Oct 13, 2025

View reviewed changes

Jubeku added 2 commits October 13, 2025 13:24

update uv.lock to latest develop version

8bdbac4

revert to default confit

da92f8f

Jubeku and others added 5 commits October 13, 2025 13:54

add comment on FSDP2 specifics

a072c35

Merge branch 'develop' into jk/log-grad-norms/log-grad-norms

6d477be

Merge branch 'develop' into jk/log-grad-norms/log-grad-norms

b30b69a

move plot grad script to private repo

c8fadf6

rm seaborn from pyproject

8bd7383

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Jk/log grad norms/log grad norms #1068

Jk/log grad norms/log grad norms #1068

Uh oh!

Jubeku commented Oct 9, 2025 •

edited

Loading

Uh oh!

Jubeku Oct 9, 2025

Uh oh!

Jubeku Oct 9, 2025

Uh oh!

tjhunter Oct 13, 2025

Uh oh!

sophie-xhonneux Oct 13, 2025

Uh oh!

sophie-xhonneux left a comment

Uh oh!

sophie-xhonneux Oct 13, 2025

Uh oh!

sophie-xhonneux Oct 13, 2025

Uh oh!

sophie-xhonneux Oct 13, 2025

Uh oh!

Jubeku commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Jk/log grad norms/log grad norms #1068

Are you sure you want to change the base?

Jk/log grad norms/log grad norms #1068

Uh oh!

Conversation

Jubeku commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Uh oh!

Jubeku Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Jubeku Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

tjhunter Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

sophie-xhonneux Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

sophie-xhonneux left a comment

Choose a reason for hiding this comment

Uh oh!

sophie-xhonneux Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

sophie-xhonneux Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

sophie-xhonneux Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Jubeku commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Jubeku commented Oct 9, 2025 •

edited

Loading