Skip to content

Conversation

Jubeku
Copy link
Contributor

@Jubeku Jubeku commented Oct 9, 2025

Description

This PR is based on @sophie-xhonneux's log_grad_norm branch in #685, modified to allow logging gradients when running in parallel on multiple GPUs with FSDP2.

Issue Number

Closes #688

Is this PR a draft? Mark it as draft.

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@github-actions github-actions bot added the model Related to model training or definition (not generic infra) label Oct 9, 2025
"""
self.last_grad_norm = (
total_norm.full_tensor().item() if self.cf.world_size > 1 else total_norm.item()
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned here, full_tensor().item() is needed in parallel runs with FSDP2.

I tested this by logging both ways of calculating:

000 : 00010/02048 : 000010 : loss = 1.0287E+00 (lr=1.64E-06, gradient norm=0.983, gradient norm FT=1.403, s/sec=0.236)

ERA5 : 1.0287E+00 


000 : 00020/02048 : 000020 : loss = 1.0101E+00 (lr=3.34E-06, gradient norm=0.587, gradient norm FT=0.817, s/sec=0.435)

ERA5 : 1.0101E+00 

if self.cf.world_size > 1
else param.grad.norm().item()
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, we also need .full_tensor().item() here in multi-gpu mode.

Tested it by printing both versions on 2 GPUs:

print(".item():", param.grad.norm().item())
print(".full_tensor().item()", param.grad.norm().full_tensor().item())

.item(): 0.028306283056735992
.item(): 0.022433193400502205
.full_tensor().item() 0.03611777722835541
.full_tensor().item() 0.03611777722835541

@Jubeku Jubeku self-assigned this Oct 9, 2025
grad_norms["grad_norm_" + name] = (
param.grad.norm().full_tensor().item()
if self.cf.world_size > 1
else param.grad.norm().item()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't you divide by the number of items in the gradient? otherwise, if every component in the gradient is equal, you are biased by batching computations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow. But as far as I know the gradient norm logging is correct and people do not commonly account for the number of dimensions, as for batching this is handle in the forward pass and thus is automatically dealt with during backprop.

Copy link
Contributor

@sophie-xhonneux sophie-xhonneux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor things to fix in the comments, I trust the will happen and thus already approve the PR. If logging is off, there should be no effect on the runs.


ae_local_dim_embed: 1024
ae_local_num_blocks: 2
ae_local_num_blocks: 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we as always revert back to the original default config please?

Log instantaneous grad norms, we do not average because of the cost and because we want to
measure the actual values
TODO test DDP case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove the TODO and copy your FSDP2 comment into the code. Thank you!

grad_norms["grad_norm_" + name] = (
param.grad.norm().full_tensor().item()
if self.cf.world_size > 1
else param.grad.norm().item()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow. But as far as I know the gradient norm logging is correct and people do not commonly account for the number of dimensions, as for batching this is handle in the forward pass and thus is automatically dealt with during backprop.

@Jubeku
Copy link
Contributor Author

Jubeku commented Oct 13, 2025

@sophie-xhonneux, @tjhunter, should we keep the plot_grad_norms.py in the utils/ folder or should we rather move it to the private repo or add a wiki page on options to plot the grad norms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Related to model training or definition (not generic infra)

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Log gradient norms to understand and debug model better

3 participants