Skip to content

Add dry-run data preflight mode in train_model#511

Open
AR10129 wants to merge 2 commits intomllam:mainfrom
AR10129:feat/dry-run-data-preflight
Open

Add dry-run data preflight mode in train_model#511
AR10129 wants to merge 2 commits intomllam:mainfrom
AR10129:feat/dry-run-data-preflight

Conversation

@AR10129
Copy link
Copy Markdown

@AR10129 AR10129 commented Mar 24, 2026

Describe your changes

This PR adds an optional dry-run data preflight path in the training CLI to fail fast on dataset/configuration errors before model and trainer initialization.

The new flag validates one batch from the relevant dataloader(s) and checks batch structure, expected tensor dimensions, finite values, forcing-window consistency, and strictly increasing target times.

This reduces late pipeline failures and debugging time for invalid data/window settings.

No new runtime dependencies are introduced.

Issue Link

closes #510

Type of change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

  • My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
  • I have performed a self-review of my code
  • For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
  • I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
  • I have updated the README to cover introduced code changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have given the PR a name that clearly describes the change, written in imperative form (context).
  • I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

  • the code is readable
  • the code is well tested
  • the code is documented (including return types and parameters)
  • the code is easy to maintain

Author checklist after completed review

  • I have added a line to the CHANGELOG describing this change, in a section
    reflecting type of change (add section where missing):
    • added: when you have added new functionality
    • changed: when default behaviour of the code has been changed
    • fixes: when your contribution fixes a bug
    • maintenance: when your contribution is relates to repo maintenance, e.g. CI/CD or documentation

Checklist for assignee

  • PR is up to date with the base branch
  • the tests pass
  • (if the PR is not just maintenance/bugfix) the PR is assigned to the next milestone. If it is not, propose it for a future milestone.
  • author has added an entry to the changelog (and designated the change as added, changed, fixed or maintenance)
  • Once the PR is ready to be merged, squash commits and merge the PR.

@Sir-Sloth-The-Lazy
Copy link
Copy Markdown
Contributor

Thanks for this! A few observations from reading through:

  • Lightning already does this as num_sanity_val_steps (default 2) runs validation batches before training begins. Shape errors, NaN, and dtype issues surface there with full stack traces pointing to the exact failing operation. What specific failure does this catch that sanity validation misses?

  • The checks hardcode assumptions that the batch must be a 4-tuple, init_states.shape[1] must be 2, target_times must be integer dtype. If the data pipeline evolves (e.g. more history steps, ensemble dims), this validation code will need updating alongside the actual code. That's extra maintenance burden.

  • Some checks can't fail in practice because WeatherDataset produces target times by slicing contiguous time steps, so they're monotonic by construction. The forcing_window_size > 0 check also can't fail since it's computed as past + future + 1.

  • The unsqueeze logic (lines 39-48) is concerning the validation code probably shouldn't be reshaping inputs to handle multiple input formats.

This is just to take some load of the reviewers ! Hope you find this constructive ! @AR10129 Pardon me if I missed something, I would be grateful to learn !

@AR10129
Copy link
Copy Markdown
Author

AR10129 commented Mar 24, 2026

Thanks for the thorough review, these are fair points. Let me address each one:

  1. On num_sanity_val_steps: agreed there is overlap. The intended difference is that --dry_run_data validates data contracts before model and trainer setup and supports explicit preflight workflows. I will keep only checks that add value beyond Lightning sanity validation.

  2. On hardcoded assumptions: fair point. I’ll remove assumptions that are not strict data contracts and derive expectations from runtime config/datastore where possible.

  3. On checks that can’t fail in current construction: agreed. I’ll remove the monotonic-time and forcing-window-positive checks.

  4. On unsqueeze logic: agreed. I’ll drop compatibility reshaping and validate canonical batched format only.

I’ll push a follow-up cleanup commit with these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add --dry_run_data preflight mode in train_model for early dataset validation failures

2 participants