Skip to content

Fix probabilistic variance underflow#527

Draft
Panchadip-128 wants to merge 2 commits intomllam:mainfrom
Panchadip-128:fix-probabilistic-variance-underflow
Draft

Fix probabilistic variance underflow#527
Panchadip-128 wants to merge 2 commits intomllam:mainfrom
Panchadip-128:fix-probabilistic-variance-underflow

Conversation

@Panchadip-128
Copy link
Copy Markdown

@Panchadip-128 Panchadip-128 commented Mar 27, 2026

Describe your changes

This PR resolves two critical numerical and logical flaws in the probabilistic forecasting (--output_std) engine:

  1. Fixed softplus Underflow (NaN crashes): In neural_lam/models/base_graph_model.py, the standard deviation calculation is now clamped to a minimum of 1e-6. This prevents the network from producing a machine-zero variance which previously caused division-by-zero errors in NLL and CRPS metrics, leading to irreversible NaN training losses.
  2. Corrected Ensemble Sampling: The ARModel._sample_ensemble method in neural_lam/models/ar_model.py was previously discarding the model's predicted uncertainty map in favor of a hardcoded 0.01 noise fallback. This has been refactored to utilize the model's dynamically predicted pred_std for physically grounded ensemble generation.
  3. Regression Testing: Added test_base_graph_model_prevents_softplus_underflow_nans and test_ar_model_ensemble_samples_from_pred_std to tests/test_probabilistic_forecasting.py to assert that these mathematical stability and logic requirements are met.

Motivation and Context: These bugs made probabilistic training unstable and produced deceptively uniform ensemble spreads that ignored the model's internal confidence.

Dependencies: No new dependencies.

Issue Link

solves #526

Type of change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

  • My branch is up-to-date with the target branch (upstream/main rebased and force-pushed).
  • I have performed a self-review of my code.
  • For any new/modified functions/classes I have added docstrings.
  • I have placed in-line comments to clarify intent.
  • I have updated the README (Not applicable as these are internal bug fixes).
  • I have added tests that prove my fix is effective.
  • I have given the PR a name that clearly describes the change.

Author checklist after completed review

  • I have added a line to the CHANGELOG.md describing this change.
    • fixes: softplus variance underflow and ensemble sampling logic.

@Panchadip-128 Panchadip-128 force-pushed the fix-probabilistic-variance-underflow branch from e6e74b2 to bb30e20 Compare March 27, 2026 20:37
@Panchadip-128 Panchadip-128 marked this pull request as ready for review March 27, 2026 20:48
@Panchadip-128 Panchadip-128 marked this pull request as draft March 27, 2026 20:48
@ronilmitra7
Copy link
Copy Markdown

Nice catch on the NaN crashes! I’ve definitely been there with PINNs and seen how one zero variance can just blow up a whole training run. Clamping at 1e-6 is a solid move for keeping the NLL/CRPS stable.

Also, good to see that 0.01 noise replaced with the actual pred_std in the ensemble logic. I’m curious, did you run into these NaN crashes mostly during the initial epochs or during longer auto-regressive rollouts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants