Skip to content

Allow fix_file to return dataset objects #2579

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 48 commits into
base: main
Choose a base branch
from
Open

Conversation

schlunma
Copy link
Contributor

@schlunma schlunma commented Nov 12, 2024

Description

Currently, if a file cannot be properly read with Iris, we use fix_file to create a copy of that file and modify it using netCDF4.Dataset (example). This is very inefficient and slow.

A much better way to deal with this is to read the file with ncdata or xarray and then use ncdata to convert that object to an Iris object. However, for this to work, we need to allow fix_file to return dataset objects (instead of paths) and load to read dataset objects (instead of paths). This PR does that.

Closes #2129
Related to #674

Link to documentation:


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@schlunma schlunma added the fix for dataset Related to dataset-specific fix files label Nov 12, 2024
@schlunma schlunma added this to the v2.12.0 milestone Nov 12, 2024
@schlunma schlunma self-assigned this Nov 12, 2024
Copy link

codecov bot commented Nov 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.16%. Comparing base (c2268dd) to head (0faf234).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2579      +/-   ##
==========================================
+ Coverage   95.14%   95.16%   +0.02%     
==========================================
  Files         259      259              
  Lines       15113    15157      +44     
==========================================
+ Hits        14379    14424      +45     
+ Misses        734      733       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@schlunma schlunma marked this pull request as ready for review December 11, 2024 13:44
@schlunma schlunma mentioned this pull request Dec 11, 2024
@sloosvel sloosvel modified the milestones: v2.12.0, v2.13.0 Feb 12, 2025
@schlunma
Copy link
Contributor Author

schlunma commented May 12, 2025

One more incentive why it would be great to have this feature:

path = "/path/to/real_EMAC_file_with_855_variables.nc"

%%timeit
iris.load(path)
# 1min 47s ± 446 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
iris.load_raw(path)
# 1.98 s ± 7.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
xr.open_dataset(path, chunks="auto")
# 277 ms ± 8.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
netCDF4.Dataset(path, mode="r")
# 5.2 ms ± 23.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This uses the most up-to-date versions of all these packages (fresh install from 2025-05-09):

  • Iris: 3.12.1 (not the most up-to-date, but it seems that 3.12.2 only adds support for the latest Dask version)
  • Xarray: 2025.4.0
  • netCDF4: 1.7.2

Since we sometimes need to read hundreds of those files (this example only contains a single time step), ESMValTool can spend hours just loading these datasets.

@schlunma schlunma changed the title Allow fix_file to return Cube and CubeList objects Allow fix_file to return dataset objects May 14, 2025
@schlunma schlunma removed their assignment May 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix for dataset Related to dataset-specific fix files
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Rethinking fix_file
4 participants