Diffusion v1 (closes #702) #944

moritzhauschulz · 2025-09-23T09:34:33Z

Description

DRAFT PR for diffusion forecasting engine...

I summarise the changes below, with some open questions (Q) and some notes (NB).

Arguments:

(see diff_config.yml)

forecast_policy: diffusion

loss_fcts_lat:
  -
    - "mse"
    - 1.0

fe_diff_sigma_min: 0.02

fe_diff_sigma_max: 88

fe_diff_sigma_data: 1 #I think in gencast this is hard coded to 1...

Components:

some changes copied (or adapted) from [Kerem’s PR](https://iffchat.fz-juelich.de/mlesm-dev/messages/@kerem_tezcan) in:
- datasets/data_reader_base.py
- datasets/multi_stream_data_sampler.py
- datasets/stream_data.py
- datasets/tokeniser_forecat.py
- datasets/utils.py
- train/loss_calculator.py
- train/loss.py
- train/trainer.py
- train_trainer_base.py
- model/model.py (for encoding target variables)
new class LinearNormConditioning in attention
- this basically acts as a conditional scale and offset layer which depends on the noise
- added as noise_conditioning to MultiSelfAttentionHeadLocal and MultiSelfAttentionHeadGlobal
  - called after layer norm and prior to attention
    - Q: Should this replace the layer norm?
in ForecastingEngine added…
- …new parameters:
  - sigma_min
  - sigma_max
  - sigma_data (this is set to 1, which assumes that that the latent channels are normalised to unit variance – this is currently not implemented, since clearly this depends on the encoder unlike in the gencast case where normalisation is simply done to the dataset; note however that we do apply this transformation prior to encoding)
- …new layers:
  - map_noise which embeds the noise via PositionalEmbedding
    - PositionalEmbedding
    - Q: Alternatively, should we use the alternative embedding used in Gencast?
  - map_layer0 and map_layer1 , two linear layers post processing the embedding
    - Q: Why?
  - Q: Are we correctly freezing the model parameters for the new layers?
- in Model class
  - add attributes P_mean , P_std which are used to scale the noise added used to noise the target in training (NB: noise used at sampling is separate)
    - Q: Why is that, and what is the difference?
  - in forward , add cases for forecast_policy==diffusion distinguishing between training mode and not
    - in both cases, the target (in latent space) has to be computed as the residual between source and target tokens
      - Q: What would be the impact if source and target tokens are based on different datasets?
    - in training mode: calls denoise to obtain predicted denoised residuals
    - in non-training mode: calls edm_sampler to obtain predicted denoised residuals
    - NB: loss should be computed in latent space, so the relevant comparison is between the items in tokens_all and tokens_targets – yet, if loss_fcts are passed, physical losses can also be computed and added up.
    - also carrying over target embedding functions from Kerem (see above)
    - in forecast added optional argument noise_conditioning which contains the noise that is passed to the blocks in diffusion mode
    - two new methods edm_denoise (called during training) and edm_sampler (called during inference)
      - edm_denoise
        
        samples sigma from exponential normal distribution (mean P_mean, variance P_std ) – sigma is the noise level per sample in batch
        
        sigma then multiplied by per sample/cell/channel normal sample – this is the noise added to the target token
        
        per sample weight is computed according to formula from edm paper – this is later passed to the loss function
        
        finally calls edm_preconditioning , see below
      - edm_sample
        
        first computes the spacing of t_steps
        
        then applies (2nd order) Heun sampling scheme using the conditioning tokens from the previous time step concatenated with normal noise
        
        note that at sampling time we are denoising pure noise
        
        parameters taken from gencast supplementary
        
        note also that this is taken from the edm code and Gencast is using a variation of this…
      - both then call edm_preconditioning (see below) for the actual denoising
        
        NB: might need to rename functions here to avoid confusion.
    - edm_preconditioning
      - this concatenates the conditioning tokens (i.e. the tokens from the previous time steps) with noised target (or the pure noise at inference time) and embeds the noise via the forecasting engine’s map_noise (PositionalEmbedding) layer
      - these are passed to the forecast method before the final output is obtained via the following formula known form the EDM and GenCast papers:
        
        $D_{\theta}!\left( \mathbf{Z}^t_{\sigma}; \mathbf{X}^{t-1}, \sigma \right) := c_{\text{skip}}(\sigma) \cdot \mathbf{Z}^t_{\sigma} + c_{\text{out}}(\sigma) \cdot f_{\theta}!\left( c_{\text{in}}(\sigma)\mathbf{Z}^t_{\sigma}; \mathbf{X}^{t-1}, c_{\text{noise}}(\sigma) \right)$
        
        NB: the current model is not able to condition on $\mathbf{X}^{t-2}$ – this requires further work on the data pipeline…
  - small changes in loss_calculator.py
    - introduced weights_samples which are computed in edm_noise
      - NB: currently only implemented in mse_channel_location_weighted
    - some refactoring, mainly replacing preds with more comprehensive out
      - out contains preds, posteriors, weights, tokens_all, tokens_targets
        
        where tokens_all are the predicted latents for each fstep and tokens_targets are the true encoded targets for each fstep
    - when training diffusion model only in latent space, then the loss calculation happens via the block following if self.loss_fcts_lat:
    - NB: caution should be taken not to specify the loss_fcts though of course loss in physical space and latent space can be combined…

Issue Number

Closes #702

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

…ging level

…bles

moritzhauschulz · 2025-09-25T10:38:38Z

Description will be updated regularly.

kctezcan and others added 30 commits August 11, 2025 17:53

added the era5 v8 file

1c7d04e

functions for encoding target source like

03588f5

more functions for source like encoding

5067531

take only preds_all in loss calc

68a271c

set epoch=epoch in inference

f848c57

smaller stuff, nccl, paths etc

599feaf

unpack preds

9376353

ruff

95f599b

added the necessary changes to the default config for visibility

9b72723

added the latent loss

c797622

added terminal logging

fe409eb

ensure working without the latent loss

1e275f9

more ensure working without the latent loss

01af6ef

merged dev

2980c44

merged lat loss for streams

7b22945

using tensor for lat_loss_hist

babcd89

fixed terminal logging

153f88d

testing lat loss weight

5b22fae

Cleaned up to use proper logger

5d3ddc9

Cleaned up to use proper logger

a707beb

Fix logging: needs to be registered per output stream and not per log…

3876de2

…ging level

Set logging level consistently with debug to file

f9365b9

Merge branch 'develop' into clessig/develop/fix_logging_719

f24bf2b

merged clessig/develop/fix_logging_719

44e1be1

fixed CLs fix for duplicate logging

599b491

getting the embed and embed_pe from the srclk targets

0566b83

explicit detaching

aece959

fixed bug in case fo=1, fs=1

d4017dd

fixed wrong time_win1-2

d5b17ca

fixed alignement of timesteps in the loss calculation

ae28f98

moritzhauschulz and others added 14 commits September 10, 2025 11:25

inter connect

8be832f

error fixes and edited compute_iffsets_scatter_embed for target varia…

1481a9f

…bles

implemented provisional target embedding... to be tested

3b4ed4f

def config

ef92e1f

inter commit

b1e4d43

Merge branch 'ktezcan/dev/iss706_latent_loss' into issue702

620a1c8

Merge branch 'develop' into issue702

ae46101

added era5.yml back

bda087d

inter commit – debugging in progress

6d46dd8

inter commit

43e82ad

Merge branch 'develop' into issue702

22b8e17

inter commit

f0f530b

inter commit

5d058e4

creating draft PR

fbed208

github-project-automation bot added this to WeatherGen-dev Sep 23, 2025

moritzhauschulz added 3 commits September 23, 2025 11:46

reset default config

322bb84

remove default_config_diff

228cf43

added diff_config

0be48aa

moritzhauschulz changed the title ~~Issue702~~ Diffusion v1 (closes #702) Sep 25, 2025

moritzhauschulz mentioned this pull request Sep 25, 2025

Explore Diffusion Based Forecasting Engines #702

Open

7 tasks

moritzhauschulz added 9 commits September 25, 2025 15:27

inter commit

dc93b4e

minor refactoring model.py

7423cea

added pretrain config

eb338db

hot fix to adjust att dimension for concatenated input

4c5b77e

inter commit

79d3c78

inter commit

4fa8bdf

inter commit

5c4642c

inter commit

c54fb09

inter commit

dab1029

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Diffusion v1 (closes #702) #944

Diffusion v1 (closes #702) #944

Uh oh!

moritzhauschulz commented Sep 23, 2025 •

edited

Loading

Uh oh!

moritzhauschulz commented Sep 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Diffusion v1 (closes #702) #944

Are you sure you want to change the base?

Diffusion v1 (closes #702) #944

Uh oh!

Conversation

moritzhauschulz commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Uh oh!

moritzhauschulz commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

moritzhauschulz commented Sep 23, 2025 •

edited

Loading

moritzhauschulz commented Sep 25, 2025 •

edited

Loading