New Eval Pipeline as the MLPerf #254

susanbao · 2025-09-26T06:59:14Z

Implement the new eval pipeline which matches that in MLPerf.

ALGORITHM: Validation Loss Computation

INPUT:
  - validation_samples: set of validation data samples

INITIALIZE:
  - sum[8]: array of zeros for accumulating losses
  - count[8]: array of zeros for counting samples per timestep

FOR each sample, timestep in validation_samples:
    loss = forward_pass(sample, timestep=t/8)
    sum[t] += loss
    count[t] += 1

mean_per_timestep = sum / count
validation_loss = mean(mean_per_timestep)

RETURN validation_loss

Add new para enable_ssim which help to debug.

github-actions · 2025-09-26T06:59:22Z

e2e testgrid: https://8bcf50593faf4ea38060e236169827e5-dot-us-central1.composer.googleusercontent.com/dags/maxdiffusion_tpu_e2e/grid

src/maxdiffusion/trainers/wan_trainer.py

src/maxdiffusion/configs/base_wan_14b.yml

src/maxdiffusion/trainers/wan_trainer.py

coolkp · 2025-10-03T21:41:55Z

A idea for optimizing this PR. iiurc we have nested loop where we split RNG keys first in training loop and then in loop wrapping loss function. it would be good to generate the keys upfront and move the key out of the jit, iiurc eval_step is jitted function, so this will increase compile time too. https://stackoverflow.com/a/75339951 . We should vectorize that and move the key. you can give it a shot or create a TODO for me, i can take a look at it later.

susanbao added 2 commits September 25, 2025 18:07

eval pipeline

47c8305

modify pusav1 generation

edcd3c2

susanbao added 3 commits September 26, 2025 17:19

change 1

3f4d3d4

fix

bf6bea9

fix

8b1b427

entrpn reviewed Sep 26, 2025

View reviewed changes

src/maxdiffusion/trainers/wan_trainer.py Outdated Show resolved Hide resolved

susanbao and others added 16 commits September 26, 2025 20:21

verion 2

82502da

version 3

2c9c73d

remove log

9edefc3

add hyper

5503f9c

fix OOM problem

3b35dd5

Merge branch 'main' into sanbao/wan_new_eval

a113271

fix for loop bugs on timesteps and losses

aaaa094

remove print log

eb7c473

improve speed on eval

9b4ae33

add eval time

9e74521

fix eval slow bug

140db99

block until ready

8e2bddb

successfully run on multi-host

ad8f9ba

refactor

c0fa2ca

remove space

6e5aee9

lint

e96302c

susanbao requested a review from coolkp October 2, 2025 16:45

coolkp reviewed Oct 3, 2025

View reviewed changes

src/maxdiffusion/configs/base_wan_14b.yml Outdated Show resolved Hide resolved

coolkp reviewed Oct 3, 2025

View reviewed changes

src/maxdiffusion/configs/base_wan_14b.yml Show resolved Hide resolved

coolkp reviewed Oct 3, 2025

View reviewed changes

src/maxdiffusion/trainers/wan_trainer.py Show resolved Hide resolved

coolkp reviewed Oct 3, 2025

View reviewed changes

src/maxdiffusion/trainers/wan_trainer.py Show resolved Hide resolved

solve comment

e6f495e

solve comment

c3f53b1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New Eval Pipeline as the MLPerf #254

New Eval Pipeline as the MLPerf #254

Uh oh!

susanbao commented Sep 26, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coolkp commented Oct 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

New Eval Pipeline as the MLPerf #254

Are you sure you want to change the base?

New Eval Pipeline as the MLPerf #254

Uh oh!

Conversation

susanbao commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coolkp commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

susanbao commented Sep 26, 2025 •

edited

Loading

coolkp commented Oct 3, 2025 •

edited

Loading