Skip to content

Fix MAG L1C gap-fill bugs for validation tests T015 and T016#2899

Open
sapols wants to merge 3 commits intoIMAP-Science-Operations-Center:devfrom
sapols:mag-l1c-fix
Open

Fix MAG L1C gap-fill bugs for validation tests T015 and T016#2899
sapols wants to merge 3 commits intoIMAP-Science-Operations-Center:devfrom
sapols:mag-l1c-fix

Conversation

@sapols
Copy link
Copy Markdown

@sapols sapols commented Apr 1, 2026

Summary

Six localized bug fixes to the existing MAG L1C gap-filling code. These fixes un-skip and pass validation tests T015 and T016 (both mago and magi) while preserving the existing passes on T013, T014, and T024. All 10 L1C validation cases now pass with exact row-count matches against the checked-in expected outputs.

This PR intentionally keeps the current production algorithm shape. It does not rewrite the gap-fill pipeline, add new dataclasses, or replace the simplified L1C flow that was developed collaboratively between SDC and the MAG team. It only patches the specific correctness bugs that were causing T015 and T016 to fail.

Background

What was failing and why

Current dev skips T015 and T016 because they produce wrong output. Running them un-skipped shows three concrete failures:

Test Sensor Failure Root cause fixes
T015 magi +5 extra rows Fix 5 alone
T016 mago -663 missing rows Fix 1 + Fix 3 + Fix 4 jointly
T016 magi +330 extra rows Fix 1 + Fix 3 + Fix 4 jointly

T015-mago and T024 (both sensors) already passed on dev.

The validation tests exercise increasingly complex mode transitions:

  • T013/T014: Simple NM-to-BM-to-NM transitions, no Config mode. Already passing because the NM rate is always 2 Hz (matching the old hardcoded cadence) and there are no Config-mode boundary artifacts.
  • T015: NM-to-BM-to-NM with a Config mode transition at the start of burst. NM rate changes from 2 Hz to 4 Hz (mago) or 2 Hz to 1 Hz (magi).
  • T016: NM-to-BM-to-NM with a Config mode transition at the end of burst. NM rate changes from 4 Hz to 2 Hz (mago) or 1 Hz to 2 Hz (magi).
  • T024: Burst-only, no normal data at all.

Why T015 magi failed (+5 rows): When MAG transitions through Config mode, the vectors_per_second metadata timestamp doesn't perfectly align with where the cadence actually changes in the data. The old find_all_gaps() split the timeline at the declared metadata timestamp, and find_gaps() misidentified the boundary misalignment as 1-sample phantom gaps, generating 5 spurious gap-fill rows. T015 mago was unaffected because its Config transition boundary happened to align cleanly. Fix 5 (_find_rate_segments walkback) eliminates these phantom gaps at the root.

Why T016 failed (-663 mago / +330 magi rows): Three bugs interacted to produce the T016 failures. No single fix alone was sufficient:

  • Fix 1 (rate-aware cadence): Gap-fill timestamps were generated at a hardcoded 0.5s (2 Hz) regardless of the actual NM rate. T016 mago has a 4 Hz pre-gap rate (needs 0.25s cadence) and T016 magi has a 1 Hz pre-gap rate (needs 1.0s cadence). The wrong cadence produced the wrong number of gap-fill timestamps. Reverting Fix 1 alone: T016 mago -662, T016 magi +332.
  • Fix 3 (off-by-one burst endpoint): The interpolation window silently dropped the last burst sample, producing wrong interpolated values near the burst-resume boundary. Reverting Fix 3 alone: row counts matched, but first value mismatch at index 2475 (mago) and 618 (magi).
  • Fix 4 (CIC filter clipping): The CIC delay compensation clipped the final interpolated timestamp at the tail of the burst region. Reverting Fix 4 alone: T016 mago -1, T016 magi -1.

Reverting all three together reproduces the exact dev failures (-663 mago, +330 magi), confirming they interact non-linearly.

Why T024 was already passing but needed Fix 6: T024 passed on dev because it's burst-only and doesn't exercise the broken transition paths. However, Fix 4 (extrapolate=True) causes T024 to produce one extra trailing sample. Fix 6 (burst-only trailing-edge trim) compensates for this, preventing a regression.

How this was investigated

This issue was investigated using two AI coding agents (Claude Code and Codex) in parallel, guided by manual review. The process was:

  1. Initial exploration: Both agents independently identified the same set of bugs from the failing tests. Their proposed fixes overlapped significantly but differed in approach.
  2. Spec review: The MAG Science Algorithm Document (IMAP-MAG-SW-009-01B, i5r1) and the SDC Data Validation Document (IMAP-OPS-TP-ICL-001) were consulted. Both agents initially proposed full spec-aligned rewrites of section 7.3.4 step 3, but this was scaled back after MAG team feedback (see below).
  3. MAG team feedback: @alastairtree and @maxinelasp clarified that the current code was developed in collaboration with the MAG team and that deviations from the algorithm document may be intentional simplifications. The team requested a conservative fix rather than a full rewrite.
  4. Minimal patches: Both agents produced minimal draft PRs that preserved the existing algorithm shape. These were iteratively compared and refined until both converged on identical code.
  5. CDF validation on flight data: The code was run on real commissioning data (2025-09-27, 2025-09-28, 2025-10-03, 2025-12-05, 2025-12-06) to compare outputs between dev and the patched code. An exhaustive search of the MAG L1B archive through 2026-03-31 found only 8 same-day norm+burst sensor-day cases. Unfortunately, none of these real-flight cases separated current dev from the patched code (they both produced identical CDFs; the available real data does not exercise the exact broken transition patterns).
  6. Semi-synthetic validation: Because real flight data was insufficient to demonstrate the fix, semi-synthetic input data was derived from real 2025-09-27 MAG transitions to create cases that specifically exercise the T015/T016 failure modes. These cases demonstrated clear semantic differences between dev and the patched code, with dev producing incorrect row counts and the patched code matching expectations.

Relationship to the algorithm document

The MAG Science Algorithm Document (section 7.3.4) describes a six-step L1C process. The current code implements a simplified version of these steps that was developed collaboratively between SDC and the MAG team. This PR preserves that simplified implementation. Specifically:

Step 3 (Generate a new NM time sequence) is the area with the largest gap between the document and the code. The document describes a burst-driven approach: find the BM anchor time tC closest to the gap start tA, decimate BM timestamps, shift by (tA - tC), and trim. The current code instead generates gap timestamps arithmetically from the gap boundaries at the appropriate NM cadence, then interpolates burst data onto those timestamps.

Both approaches produce a regular NM time sequence that bridges the gap. The current approach is simpler and was explicitly approved by the MAG team. The key fix in this PR is that the cadence used for gap-fill timestamps now respects the actual NM rate declared in the gap metadata, rather than always using 0.5s (2 Hz). This aligns with the document's intent that "the rate depends on the rate of the NM data before the gap."

Step 5 (Identify remaining gaps with >1.1s threshold) is not changed by this PR. The current implementation uses a per-rate tolerance check rather than a fixed 1.1s threshold, which is arguably more correct for non-2 Hz rates.

The six bugs and their fixes

Fix 1: Rate-aware gap cadence (generate_missing_timestamps)

Bug: Gap-fill timestamps were always generated at 0.5s intervals (2 Hz) regardless of the actual normal mode rate. For non-2 Hz gaps, this was wrong by construction.

Fix: Read the rate from gap[2] (the third element of each gap tuple) and compute the cadence as int(1e9 / rate). Fall back to 0.5s if no rate is provided. Use integer arithmetic throughout to avoid float precision loss at TTJ2000-scale nanosecond timestamps.

Which tests this fixed: T016 (jointly with Fixes 3 and 4). Reverting this fix alone produces T016 mago -662 rows and T016 magi +332 rows -- the dominant contributor to the T016 failure. Also covered by the new test_generate_missing_timestamps_uses_gap_rate() and the updated rate-aware case in test_generate_timeline().

Spec alignment: This directly implements the document's instruction that the interpolated rate should depend on "the rate of the NM data before the gap."

Fix 2: 2D array indexing in interpolate_gaps

Bug: filled_norm_timeline is an (n, 8) array, but the gap-timeline extraction compared the entire 2D array against scalar timestamps without specifying a column. The boolean mask broadcasted across all 8 columns instead of just the epoch column (index 0).

Fix: Extract norm_epochs = filled_norm_timeline[:, 0] once at the top of the function and filter against it explicitly.

Which tests this fixed: No isolated validation delta reproduced when this line was reverted in the current patch set. This is still a real correctness fix: it makes the gap-timeline mask operate on the epoch column by construction instead of relying on non-epoch columns being numerically unrelated to TTJ2000 timestamps.

Fix 3: Off-by-one burst slice endpoint

Bug: burst_end was computed as an inclusive index (via min(len-1, ...)), but then used in a Python slice burst_epochs[burst_start:burst_end] which is end-exclusive, silently dropping the last intended burst sample from the interpolation window.

Fix: Make burst_end exclusive from the start: burst_end = min(len(burst_epochs), burst_gap_end + burst_buffer + 1).

Which tests this fixed: T016. In isolated replay, reverting this fix preserved row counts but changed the first mismatched interpolated vector at the burst-resume boundary for both mago and magi.

Fix 4: CIC filter edge clipping (linear_filtered)

Bug: After the CIC filter compensates for group delay, the filtered timestamp range is shorter than the input by delay samples at the tail. linear() then calls remove_invalid_output_timestamps(), which clips output timestamps to this shortened filtered range, dropping valid gap timestamps near the burst boundary.

Fix: Pass extrapolate=True to linear() inside linear_filtered(). This matches the behavior of the spec's reference CIC implementation, which uses InterpolatedUnivariateSpline (extrapolates by default).

Which tests this fixed: T016. In isolated replay, reverting this fix alone causes both T016-mago and T016-magi to lose the final interpolated row.

Interaction with T024: Adding extrapolate=True alone would cause a +1 row regression on T024 (burst-only). Fix 6 below compensates for this.

Fix 5: Spurious micro-gaps at Config mode transitions (_find_rate_segments)

Bug: When MAG transitions between modes via Config mode, the vectors_per_second metadata records a transition timestamp that does not perfectly align with where the cadence actually changes in the data. The existing find_all_gaps() split the timeline at the declared metadata timestamps, and find_gaps() misidentified the boundary misalignment as 1-sample phantom gaps. These caused extra rows in the T015 output.

Fix: New _find_rate_segments() function that walks each configured transition backward while the observed cadence already matches the new rate. This aligns rate segment boundaries to the actual cadence in the data, preventing spurious micro-gaps from being generated in the first place. Also extracts the existing 0.075 tolerance magic number into a shared GAP_TOLERANCE constant used by find_gaps().

Which tests this fixed: T015-magi (had 5 extra rows due to phantom gaps at the Config boundary). Reverting the walkback logic alone reintroduces that exact +5 row failure.

Why not a post-hoc filter: An earlier approach filtered out gaps shorter than 2 cadence periods after detection. This is simpler (around 5 lines vs around 35 lines) but cannot distinguish Config-transition artifacts from genuine single-sample data loss. The root-cause fix via _find_rate_segments() preserves detection of legitimate short gaps while eliminating only the artifacts.

Fix 6: Burst-only trailing-edge trim for T024

Bug: Fix 4 (extrapolate=True) allows interpolation slightly past the CIC-filtered range, which is correct for T016 but causes T024 (burst-only, no normal data) to produce one extra trailing sample beyond the expected output.

Fix: Detect the burst-only case by checking whether any samples in the filled timeline have ModeFlags.NORM status (has_norm_context). When there is no normal-mode context, shorten the burst coverage upper bound by one output cadence to compensate for the CIC filter delay. This removes the extra trailing sample without affecting the normal+burst path.

Which tests this fixed: T024-mago and T024-magi (regressed to 687 rows with Fix 4 alone, expected 686). Reverting this trim alone reproduces that exact +1 row failure.

Additional cleanup: Dead np.delete removal

The existing code called np.delete(filled_norm_timeline, timeline_index) without assigning the result. Since np.delete() returns a new array, this was a no-op. Removed.

Defensive additions

  • Guard clause in find_gaps(): Returns empty when timeline_data has fewer than 2 elements, preventing crashes on single-timestamp rate segments.
  • Integer dtype consistency: Gap arrays use dtype=np.int64 throughout to prevent silent float precision loss on nanosecond timestamps.
  • generate_timeline() bug fix: Update last_index when generated timestamps are empty, preventing duplicate epoch data copies with adjacent gaps.

Test changes

  • Un-skip T015 and T016: Removed the pytest.skip() guard in test_mag_l1c_validation.
  • Exact row-count assertion: Added an assert checking len(expected_output.index) == len(l1c["epoch"].data) before the per-row value checks. This catches off-by-one regressions immediately rather than silently passing when the output has extra rows.
  • Updated test_process_mag_l1c expectations: The existing unit test now expects rate-aware cadence (17 timestamps instead of 15 in the 4 Hz gap region, with updated flag indices).
  • New: test_find_all_gaps_uses_observed_transition_boundary(): Regression test for the _find_rate_segments() micro-gap fix.
  • New: test_generate_missing_timestamps_uses_gap_rate(): Regression test for rate-aware cadence, including the 2 Hz fallback.
  • New: Rate-aware gap fill case in test_generate_timeline(): Verifies 4 Hz gap timestamps are spaced at 0.25s.
  • New: Adjacent-boundary case in test_generate_timeline(): Verifies no duplicate timestamps when adjacent gaps share a real epoch boundary, with a mixed-rate transition.

Files changed

File Lines What
imap_processing/mag/l1c/mag_l1c.py +131/-32 Fixes 1-3, 5-6, guard clauses, cleanup, and helper docstrings required by numpydoc
imap_processing/mag/l1c/interpolation_methods.py +1/-1 Fix 4 (extrapolate=True)
imap_processing/tests/mag/test_mag_l1c.py +73/-6 Updated expectations + 4 new regression tests
imap_processing/tests/mag/test_mag_validation.py +1/-4 Remove skip guard + add exact row-count assertion
poetry.lock +10/-0 Required poetry-lock hook refresh

What this PR does NOT change

  • No rewrite of the gap-fill pipeline architecture
  • No GapFillPlan dataclass or BM-anchor/shift/trim step-3 implementation
  • No day-boundary / inherited-timeline work (T017-T019 scope)
  • No changes to public API or CLI interfaces
  • No timestamp precision fixes outside the gap-fill path

Validation results

Rerun locally on mag-l1c-fix:

imap_processing/tests/mag/test_mag_l1c.py: 18 passed in 0.84s
compare_mag_l1c_validation.py on T013/T014/T015/T016/T024 (mago,magi): all 10 cases PASS
pre-commit run --all-files: PASS

Exact row-count validation (previously skipped T015/T016 now pass):

Test Sensor Expected Got Status
T013 mago 1856 1856 MATCH
T013 magi 1856 1856 MATCH
T014 mago 1792 1792 MATCH
T014 magi 1792 1792 MATCH
T015 mago 2498 2498 MATCH
T015 magi 1538 1538 MATCH
T016 mago 3052 3052 MATCH
T016 magi 1195 1195 MATCH
T024 mago 686 686 MATCH
T024 magi 686 686 MATCH

sapols and others added 3 commits April 1, 2026 01:12
Six localized bug fixes to the existing MAG L1C gap-filling algorithm,
plus targeted regression tests. Keeps the current production algorithm
shape — no architectural rewrite.

Fixes:
- Rate-aware gap cadence in generate_missing_timestamps()
- 2D array indexing in interpolate_gaps() (column 0 explicitly)
- Off-by-one burst slice endpoint
- CIC filter edge clipping via extrapolate=True in linear_filtered()
- Spurious micro-gaps at Config transitions via _find_rate_segments()
- Burst-only trailing-edge trim for T024 exact row count
- Remove dead np.delete() no-op

Refs IMAP-Science-Operations-Center#1993

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Codex <noreply@openai.com>
@sapols
Copy link
Copy Markdown
Author

sapols commented Apr 1, 2026

One scope clarification: My read is that this PR is enough to close the issue, but we haven't completed every historical MAG L1C TODO that came up in the discussion.

Relative to @maxinelasp’s earlier checklist:

  • done: one-input-file handling
  • done/effectively already present: CIC delay compensation and 2x buffer sizing
  • partial: beginning/end-of-day handling (synthetic day bounds are handled, but previous/next-day files are still not loaded)
  • not done here: using FILLVAL instead of zero placeholders

Relative to @alastairtree’s broader “L1C should” list:

  • this PR improves the validated transition/gap-fill behavior substantially
  • but it does not implement true previous/next-day context stitching
  • it does not add broader proof for out-of-order data handling
  • it does not claim comprehensive MAG OFF / no-science coverage

@sapols sapols marked this pull request as ready for review April 1, 2026 09:37
@sapols sapols requested a review from maxinelasp April 1, 2026 09:38
@sapols
Copy link
Copy Markdown
Author

sapols commented Apr 1, 2026

Another clarification: we're probably sweeping a lot under the rug by not counting day-boundary / inherited-timeline work (T017-T019) as in-scope for this ticket. E.g. Alastair's comment about gaps in the Dec 5-6 data appears to be directly related to L1C not treating leading/trailing burst-only intervals on mixed norm+burst days as fillable gaps. In those windows the parent burst L1B data does exist, but the current mixed-input L1C path collapses the output window to the existing normal-mode coverage instead of extending it across those burst-only spans.

@maxinelasp
Copy link
Copy Markdown
Contributor

@sapols Agreed, would you be able to open a follow-up ticket for the day boundary coverage if one does not exist?

@maxinelasp
Copy link
Copy Markdown
Contributor

Initial thoughts from reading through the PR description only:

  1. We should get a thumbs up from the MAG team around using extrapolation in the CIC filter only. Am I correct in understanding that these extrapolated data points are later clipped?
  2. for _find_rate_segements - sounds like a nice change. Does this still use the vecsec transitions attribute in any way?
  3. I'm a little confused by step 6 - what is the root of this extra timestamp getting produced? I worry this solution is a little too closely aligned with the validation test, will there ever be more than one extra timestamps?

Nice work! I'll review the code more thoroughly in a bit.

@sapols
Copy link
Copy Markdown
Author

sapols commented Apr 1, 2026

@maxinelasp

  1. Nah, they're not later clipped, and that's intentional. linear_filtered() passes extrapolate=True to linear(), which skips the remove_invalid_output_timestamps() clipping; that's the whole point of the fix. interpolate_gaps() bounds the candidate gap timestamps to the burst window via the short mask (lines 506/514) before interpolation, and within that window the boundary samples get extrapolated, filled, marked burst, and kept. So the T016 boundary sample is intentionally preserved.
    The MAG team should know that in this small window (less than one output cadence), the linear spline is extrapolating beyond the CIC-filtered input rather than interpolating within it.

  2. Yes, _find_rate_segments still uses vectors_per_second. process_mag_l1c() parses the attribute and passes it through find_all_gaps() into _find_rate_segments(), which iterates the configured (start_time, rate) pairs. The only change is that it walks each boundary backward while the observed cadence already matches the new rate—so it still uses the metadata, just not as an exact sample-level split point.

  3. The root of that extra timestamp is the CIC filter's group delay shortening the filtered range by delay = decimation_factor - 1 burst-rate samples at the tail. With extrapolate=True, the burst-only path can fill trailing NM timestamps past that shortened range. The delay in output-cadence terms is (decimation_factor - 1) / decimation_factor, which is always < 1.0 (e.g. 0.75 for 4x, 0.5 for 2x decimation). So at most one extra output timestamp can appear—this is a mathematical bound, not just empirically matched to the test.

input_filtered, vectors_filtered = cic_filter(
input_vectors, input_timestamps, output_timestamps, input_rate, output_rate
)
return linear(vectors_filtered, input_filtered, output_timestamps)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, we should add extrapolate=true to all the _filtered methods if we're doing it for one.


logger = logging.getLogger(__name__)

GAP_TOLERANCE = 0.075
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should go in mag/constants.py. Might need a clearer name too (L1C_GAP_TOLERANCE) and a comment describing what this is.

6-7 - compression flags.
"""
burst_epochs = burst_dataset["epoch"].data
norm_epochs = filled_norm_timeline[:, 0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rename this variable to make it clear it's the full or filled timeline?

Comment on lines +510 to +512
# In the burst-only fallback, CIC delay compensation shortens the usable
# filtered range at the trailing edge by roughly one output cadence.
short_end -= int(1e9 / norm_rate.value)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment in the main section about my concerns about this section.

]
gap_timeline = norm_epochs[(norm_epochs > gap[0]) & (norm_epochs < gap[1])]

short_end = burst_epochs[burst_end - 1]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this variable is kind of confusing to me

)
generated_timestamps = generate_missing_timestamps(gap)
if generated_timestamps.size == 0:
last_index = int(np.searchsorted(epoch_data, gap[1], side="left"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

difference_ns = 0.5 * 1e9
output: np.ndarray = np.arange(gap[0], gap[1], difference_ns)
difference_ns = int(0.5 * 1e9)
if len(gap) > 2:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reasoning for this if statement? Add a comment plz

@sapols
Copy link
Copy Markdown
Author

sapols commented Apr 6, 2026

@sapols Agreed, would you be able to open a follow-up ticket for the day boundary coverage if one does not exist?

Created #2925

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants