fix(cmor): make make_simple_varlist return unique variables across all datetimes#804
fix(cmor): make make_simple_varlist return unique variables across all datetimes#804
make_simple_varlist return unique variables across all datetimes#804Conversation
Replace the non-deterministic glob.iglob single-file probe with glob.glob to collect all *.*.nc files, count them by their datetime component, and pick the datetime with the most files. This avoids flaky behaviour caused by filesystem inode ordering (which changed after PR #786 merged). Update test_make_simple_varlist_index_error_on_datetime to patch glob.glob (with an os.path.basename check) instead of the now-unused glob.iglob. Also tighten the stale comment in test_make_simple_varlist_dedup_across_datetimes. Agent-Logs-Url: https://github.com/NOAA-GFDL/fre-cli/sessions/ef0227dc-6efa-4218-88d6-30d1d64cd2be Co-authored-by: ilaflott <[email protected]>
make_simple_varlist datetime selection deterministic
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #804 +/- ##
==========================================
- Coverage 84.20% 84.18% -0.03%
==========================================
Files 71 71
Lines 4971 4975 +4
==========================================
+ Hits 4186 4188 +2
- Misses 785 787 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
since i'm seeings others with builds on `3.14`, lets toss this in
fre/cmor/cmor_finder.py
Outdated
| search_pattern = None | ||
| try: | ||
| one_datetime = os.path.basename(one_file).split('.')[-3] | ||
| # Count files per datetime stamp and pick the most common one. |
There was a problem hiding this comment.
The only thing I'm not fully understanding is why the datetime with the most files is picked. I understand it's for the representative datetime, but is that just to have like a full list of variables across datetimes? Would other (maybe not representative) datetimes have other variables present at all?
There was a problem hiding this comment.
in the new form, it is 100% possible that other datetimes that do not have the most number of files do not contain a variable under another datetime. the function isn't supposed to remove all ambiguity for the user nor be an "ironclad" solid generator of variable lists, hence the function name, make_simple_var_list. it's supposed to get a user started
|
something is going on... i never felt like second attempt that succeeded on rerun (no changes) here first attempt that failed here and error: |
|
@copilot the feedback from @singhd789 has a good point, we can do better pretty easily. make the filelist of unique variables found across all datetimes. update tests as-needed |
Per @ilaflott's feedback: instead of picking the most-common datetime and re-globbing only that slice, iterate over all *.*.nc files found by the initial glob and collect unique variable names directly via dict assignment. This is simpler, correct by construction, and catches variables that only appear at a minority datetime. Remove the now-dead datetime-counting logic, the second glob.glob call, and the try/except IndexError around split('.')[-3]. In tests: remove test_make_simple_varlist_index_error_on_datetime (branch no longer exists), simplify the no_files_matching_pattern test, update stale docstrings, and add test_make_simple_varlist_minority_datetime_var_included to pin the new guarantee. Agent-Logs-Url: https://github.com/NOAA-GFDL/fre-cli/sessions/762aa18f-e013-40d7-b215-163459373b68 Co-authored-by: ilaflott <[email protected]>
Done in 088b05b. The datetime-picking logic is gone entirely — the initial Also added |
make_simple_varlist datetime selection deterministicmake_simple_varlist return unique variables across all datetimes
make_simple_varlistusedglob.iglobto grab a single probe file and extracted its datetime component to build the search pattern*{datetime}*.nc. Becauseiglobreturns files in filesystem inode-hash order, the probe was non-deterministic. If it landed on a datetime that covered only a subset of variables (e.g.model.19900201.temp.ncin a dir also containingmodel.19900101.temp.nc+model.19900101.salt.nc), the search missed variables present at other datetimes. PR #786's 42 commits recycled inodes in CI, flipping the glob order so the minority datetime was consistently selected first — turning a latent flake into a reproducible failure.Describe your changes
cmor_finder.py— scan all files, collect unique variables across all datetimesnext(glob.iglob("*.*.nc"))probe withglob.glob("*.*.nc")to collect all matching files up frontglob.globre-search entirely — the initial glob already has every filetest_cmor_finder_make_simple_varlist.py— test hygienetest_make_simple_varlist_index_error_on_datetime: thetry/except IndexErrordatetime-extraction branch no longer existstest_make_simple_varlist_no_files_matching_pattern: now patches the singleglob.globcall directlytest_make_simple_varlist_deduplicatesandtest_make_simple_varlist_dedup_across_datetimesto reflect the new all-datetimes behaviourtest_make_simple_varlist_minority_datetime_var_includedto explicitly pin that variables appearing at only a minority datetime are still returnedIssue ticket number and link (if applicable)
Checklist before requesting a review
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.