Add materialisation of new modules before loading checkpoint #1030

kacpnowak · 2025-10-05T12:37:14Z

Description

Fixes crash when using --forecast_finetune flag by adding materialization of new modules before loading checkpoint

Issue Number

Closes #1029

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

clessig · 2025-10-06T13:28:36Z

The --forecast_finetune flag is deprecated.

clessig · 2025-10-13T16:30:59Z

Is this ready for review?

kacpnowak · 2025-10-13T16:32:12Z

Yes

sophie-xhonneux · 2025-10-13T17:04:35Z

src/weathergen/train/trainer.py

        )
+
+        model_state_dict = self.model.state_dict()
+        params = {


Can you add a comment explaining this?

I wanna test this with a couple configs and then I will approve it

sophie-xhonneux

This is a good fix. It is not fully complete as I think with new encoders/decoders that have the same parameter shapes as the existing model there will be issues, but this cannot be solved until embedding engines, prediction heads, etc are named according to the stream names they belong to. This issue is for a new PR.

sophie-xhonneux · 2025-10-13T17:41:20Z

Note I test with pre-training on ERA5 and continuing with ERA5, NPPATMS, & SYNOP on 2 GPUs

* Add materialisation of new modules before loading checkpoint * Initialize new modules in load_model * Fix adding new embedding networks

Add materialisation of new modules before loading checkpoint

e0a3731

github-project-automation bot added this to WeatherGen-dev Oct 5, 2025

Initialize new modules in load_model

f423b8f

sophie-xhonneux mentioned this pull request Oct 8, 2025

Failed to load model in FSDP #1062

Closed

Fix adding new embedding networks

f1d1ee0

sophie-xhonneux reviewed Oct 13, 2025

View reviewed changes

sophie-xhonneux approved these changes Oct 13, 2025

View reviewed changes

sophie-xhonneux merged commit aae0b8a into ecmwf:develop Oct 13, 2025
5 checks passed

github-project-automation bot moved this to Done in WeatherGen-dev Oct 13, 2025

sophie-xhonneux pushed a commit that referenced this pull request Oct 17, 2025

Add materialisation of new modules before loading checkpoint (#1030)

df87ce5

* Add materialisation of new modules before loading checkpoint * Initialize new modules in load_model * Fix adding new embedding networks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add materialisation of new modules before loading checkpoint #1030

Add materialisation of new modules before loading checkpoint #1030

Uh oh!

kacpnowak commented Oct 5, 2025

Uh oh!

clessig commented Oct 6, 2025

Uh oh!

clessig commented Oct 13, 2025

Uh oh!

kacpnowak commented Oct 13, 2025

Uh oh!

sophie-xhonneux Oct 13, 2025

Uh oh!

sophie-xhonneux Oct 13, 2025

Uh oh!

sophie-xhonneux left a comment

Uh oh!

sophie-xhonneux commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add materialisation of new modules before loading checkpoint #1030

Add materialisation of new modules before loading checkpoint #1030

Uh oh!

Conversation

kacpnowak commented Oct 5, 2025

Description

Issue Number

Checklist before asking for review

Uh oh!

clessig commented Oct 6, 2025

Uh oh!

clessig commented Oct 13, 2025

Uh oh!

kacpnowak commented Oct 13, 2025

Uh oh!

sophie-xhonneux Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

sophie-xhonneux Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

sophie-xhonneux left a comment

Choose a reason for hiding this comment

Uh oh!

sophie-xhonneux commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants