Skip to content

Fix checkpoint resume for Monte Carlo GCS #100

@mmaclay

Description

@mmaclay

Problem

The checkpoint resume capability for Monte Carlo GCS currently does not preserve nadir-equivalent error calculations, since nadir-equivalent error is calculated in the subsequent error-stats step. This prevents accurate aggregation of error statistics across multiple runs since the required error data isn't saved in the NetCDF files.

Proposed Improvements

  1. Calculate and Write Nadir-Equivalent Error:
    • Implement or port the nadir-equivalent error calculation logic from geolocation_error_stats.py into image_match.py.
    • Ensure this error statistic is computed for each checkpoint and written to the NetCDF output during simulation.
  2. Loop Integration:
    • Integrate the calculation step directly into the main simulation loop so checkpoints will have up-to-date NE error values.
    • At the end of the simulation batch, run the full error statistics aggregation for all completed and resumed runs.
  3. Support Checkpoint Resume:
    • When resuming from a checkpoint, read previously saved nadir-equivalent errors from NetCDF to incorporate them into aggregated statistics.
    • Ensure resume behavior is robust for both partial and full run aggregation.

Implementation Plan

  • Audit existing error metrics code in geolocation_error_stats.py and image_match.py to identify what needs porting or integration.
  • Refactor the code so nadir-equivalent error calculation is available and called within the simulation loop where checkpoints are written.
  • Update NetCDF output handling to include nadir-equivalent error for each checkpoint.
  • Develop logic to load and aggregate previous checkpoint errors when resuming simulations.
  • Add/modify unit tests to cover resume and aggregation of error statistics.
  • Document the new error calculation workflow and update relevant usage guides.

Background:
The nadir-equivalent error metric is essential in post-processing, allowing direct inclusion of previous runs' error values in aggregate statistics. Integrating it directly with periodic checkpointing will improve both robustness and downstream error analysis.


Additional Notes:

  • Review whether any non-obvious dependencies between error calculation and NetCDF writing need explicit handling.
  • Confirm compatibility for current checkpoint format and resume workflow.

Original issue below for context.


We need nadir-equivalent error to be calculated and written to netcdf in order to be able to resume and include previous runs in error stats calculations.

We should explore porting the nadir-equivalent error calculation step from geolocation_error_stats.py to the image_match.py or at least call that calculation IN loop, and THEN run the full error Stats at the end of the loop (for aggregate stats).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions