Skip to content

refactor: model class hierarchy into Forecaster/StepPredictor layers#513

Closed
parth000007 wants to merge 1 commit intomllam:mainfrom
parth000007:claude/refactor-model-class-hierarchy-again
Closed

refactor: model class hierarchy into Forecaster/StepPredictor layers#513
parth000007 wants to merge 1 commit intomllam:mainfrom
parth000007:claude/refactor-model-class-hierarchy-again

Conversation

@parth000007
Copy link
Copy Markdown

@parth000007 parth000007 commented Mar 24, 2026

Refactors the monolithic ARModel class into a composable hierarchy:

  • ForecasterModule (pl.LightningModule): Training loop, metrics, plotting
  • ARForecaster (nn.Module): Auto-regressive unrolling
  • StepPredictor (nn.Module): Single-step prediction interface
  • BaseGraphModel inherits StepPredictor instead of ARModel

This separation enables:

  • Non-autoregressive forecasters
  • New step predictor architectures (e.g. Vision Transformers)
  • Ensemble strategies without modifying training infrastructure

Also fixes two pre-existing bugs:

  • interior_mask_bool shape (1,) → (N,) for correct loss masking
  • all_gather_cat dimension collapse on single-device runs

Refs #49

Describe your changes

< Summary of the changes.>

< Please also include relevant motivation and context. >

< List any dependencies that are required for this change. >

Issue Link

Type of change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

  • My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
  • I have performed a self-review of my code
  • For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
  • I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
  • I have updated the README to cover introduced code changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have given the PR a name that clearly describes the change, written in imperative form (context).
  • I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

  • the code is readable
  • the code is well tested
  • the code is documented (including return types and parameters)
  • the code is easy to maintain

Author checklist after completed review

  • I have added a line to the CHANGELOG describing this change, in a section
    reflecting type of change (add section where missing):
    • added: when you have added new functionality
    • changed: when default behaviour of the code has been changed
    • fixes: when your contribution fixes a bug
    • maintenance: when your contribution is relates to repo maintenance, e.g. CI/CD or documentation

Checklist for assignee

  • PR is up to date with the base branch
  • the tests pass
  • (if the PR is not just maintenance/bugfix) the PR is assigned to the next milestone. If it is not, propose it for a future milestone.
  • author has added an entry to the changelog (and designated the change as added, changed, fixed or maintenance)
  • Once the PR is ready to be merged, squash commits and merge the PR.

Refactors the monolithic ARModel class into a composable hierarchy:
- ForecasterModule (pl.LightningModule): Training loop, metrics, plotting
- ARForecaster (nn.Module): Auto-regressive unrolling
- StepPredictor (nn.Module): Single-step prediction interface
- BaseGraphModel inherits StepPredictor instead of ARModel

This separation enables:
- Non-autoregressive forecasters
- New step predictor architectures (e.g. Vision Transformers)
- Ensemble strategies without modifying training infrastructure

Also fixes two pre-existing bugs:
- interior_mask_bool shape (1,) → (N,) for correct loss masking
- all_gather_cat dimension collapse on single-device runs

Refs mllam#49

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
@parth000007
Copy link
Copy Markdown
Author

hi @j6k4m8 @leifdenby @kshirajahere can you please check this PR its benefits it attach here by:-
Uploading Screenshot 2026-03-24 at 7.04.36 PM.png…

@kshirajahere
Copy link
Copy Markdown
Contributor

Hey @parth000007 there is an issue with the PR description it has parts of the placeholder of template, please get rid of them..
Also I dont know if it is just issue with my internet but i am not able to see the screenshot.

Copy link
Copy Markdown
Contributor

@kshirajahere kshirajahere Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is undoing some recent changes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT : Meant to tag ./Agents.md

@kshirajahere
Copy link
Copy Markdown
Contributor

This seems too large and overlapping to review as a separate parallel refactor PR. Since it targets the same #49 hierarchy lane as #444, I think consolidation there would be better than keeping two full refactor paths open.

@Sir-Sloth-The-Lazy
Copy link
Copy Markdown
Contributor

Sir-Sloth-The-Lazy commented Mar 24, 2026

Hi @parth000007, I noticed this PR overlaps significantly with #208 which covers the same ARModel → ForecasterModule/ARForecaster/StepPredictor refactor and is currently under review. Could you clarify how this differs or if it was intended to supersede #208? cc @joeloskarsson @sadamov @leifdenby @observingClouds

uses: astral-sh/setup-uv@v7
- name: Build with uv
run: uv build
- uses: actions/setup-python@v5
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot understand why we have to do this ?

f"SLURM_JOB_NODELIST is set to {repr(nodelist)}, but "
"'scontrol show hostnames' returned no hostnames. "
"Please check your SLURM job configuration."
master_node = (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the previous version provided more security against a command injection vulnerability

@sadamov
Copy link
Copy Markdown
Collaborator

sadamov commented Mar 25, 2026

@parth000007 such a massive PR certainly warrants some discussion in #49 before implementation. Especially why you think we need another approach to #208. I'd ask you to contribute to #208 instead and introduce your ideas through code reviews, comments or PRs into that branch instead. thanks!

@sadamov sadamov closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants