Add dry-run data preflight mode in train_model by AR10129 · Pull Request #511 · mllam/neural-lam

AR10129 · 2026-03-24T11:08:59Z

Describe your changes

This PR adds an optional dry-run data preflight path in the training CLI to fail fast on dataset/configuration errors before model and trainer initialization.

The new flag validates one batch from the relevant dataloader(s) and checks batch structure, expected tensor dimensions, finite values, forcing-window consistency, and strictly increasing target times.

This reduces late pipeline failures and debugging time for invalid data/window settings.

No new runtime dependencies are introduced.

Issue Link

closes #510

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the README to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug
- maintenance: when your contribution is relates to repo maintenance, e.g. CI/CD or documentation

Checklist for assignee

PR is up to date with the base branch
the tests pass
(if the PR is not just maintenance/bugfix) the PR is assigned to the next milestone. If it is not, propose it for a future milestone.
author has added an entry to the changelog (and designated the change as added, changed, fixed or maintenance)
Once the PR is ready to be merged, squash commits and merge the PR.

Sir-Sloth-The-Lazy · 2026-03-24T16:11:05Z

Thanks for this! A few observations from reading through:

Lightning already does this as num_sanity_val_steps (default 2) runs validation batches before training begins. Shape errors, NaN, and dtype issues surface there with full stack traces pointing to the exact failing operation. What specific failure does this catch that sanity validation misses?
The checks hardcode assumptions that the batch must be a 4-tuple, init_states.shape[1] must be 2, target_times must be integer dtype. If the data pipeline evolves (e.g. more history steps, ensemble dims), this validation code will need updating alongside the actual code. That's extra maintenance burden.
Some checks can't fail in practice because WeatherDataset produces target times by slicing contiguous time steps, so they're monotonic by construction. The forcing_window_size > 0 check also can't fail since it's computed as past + future + 1.
The unsqueeze logic (lines 39-48) is concerning the validation code probably shouldn't be reshaping inputs to handle multiple input formats.

This is just to take some load of the reviewers ! Hope you find this constructive ! @AR10129 Pardon me if I missed something, I would be grateful to learn !

AR10129 · 2026-03-24T17:11:56Z

Thanks for the thorough review, these are fair points. Let me address each one:

On num_sanity_val_steps: agreed there is overlap. The intended difference is that --dry_run_data validates data contracts before model and trainer setup and supports explicit preflight workflows. I will keep only checks that add value beyond Lightning sanity validation.
On hardcoded assumptions: fair point. I’ll remove assumptions that are not strict data contracts and derive expectations from runtime config/datastore where possible.
On checks that can’t fail in current construction: agreed. I’ll remove the monotonic-time and forcing-window-positive checks.
On unsqueeze logic: agreed. I’ll drop compatibility reshaping and validate canonical batched format only.

I’ll push a follow-up cleanup commit with these changes.

Add dry-run data preflight mode in train_model

e3929e5

Narrow dry-run preflight checks

a8cee5c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dry-run data preflight mode in train_model#511

Add dry-run data preflight mode in train_model#511
AR10129 wants to merge 2 commits intomllam:mainfrom
AR10129:feat/dry-run-data-preflight

AR10129 commented Mar 24, 2026 •

edited

Loading

Uh oh!

Sir-Sloth-The-Lazy commented Mar 24, 2026

Uh oh!

AR10129 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AR10129 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Issue Link

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

Uh oh!

Sir-Sloth-The-Lazy commented Mar 24, 2026

Uh oh!

AR10129 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AR10129 commented Mar 24, 2026 •

edited

Loading