Skip to content

Streamline parcels-benchmarks#42

Merged
VeckoTheGecko merged 44 commits intomainfrom
improvements
Apr 2, 2026
Merged

Streamline parcels-benchmarks#42
VeckoTheGecko merged 44 commits intomainfrom
improvements

Conversation

@VeckoTheGecko
Copy link
Copy Markdown
Contributor

This PR reworks parcels-benchmarks in a way (I hope) is much easier to work with. Follow the README and let me know what you think.

Changes:

  • Replaces the parcels_benchmarks internal package (which provided the CLI tool for adding dataset hashes etc.). Now instead:

    • An intake-xarray catalog is defined in catalogs/parcels-benchmarks/catalog.yml. The top of the file has a comment which contains the link to the ZIP to be downloaded.
      • This streamlines our approach, making it easier for the benchmarking scripts to go straight from data on disk to xarray dataset.
      • We can use other options available via intake
      • This approach allows us to get familiar with intake which will likely be used for our HPC systems after v4 is released.
    • A script (scripts/download-catalog.py) downloads the data for a catalog and takes a output_dir (both via CLI args). This uses curl to download the dataset, and then unzips all nested zip files (deleting the original zips). This script also copies the catalog file into the output_dir (which is good since the datasets in the catalog are defined relative to this catalog file).
      • If a catalog is already downloaded (i.e., if the folder already exists) its skipped
      • Pro: The use of curl here means this approach is quite transparent - one can easily see download speeds and decide to cancel
      • Con: There is no longer the concept of "known hashes" - this is something we can get back if we want in future1
    • Pixi is used, via the setup-data task, to download all the datasets.
      • This makes our data approach much more flexible should we want to change it in future
  • Requires a PARCELS_BENCHMARKS_DATA_FOLDER environment variable to be explicitly set which is then acts as the working space for the data. This environment variable is used in the download and benchmarking code.

We needed the following things to ease development:

  • Download all datasets before running benchmarks

  • Make it transparent the download progress of datasets

Footnotes

  1. given we are the sole owners of our data sources I don't think this is a concern

@VeckoTheGecko
Copy link
Copy Markdown
Contributor Author

Not all the benchmarks are running. Once this is merged I'll fix the rest in #40 .

Let me know what you think of this @fluidnumerics-joe

@VeckoTheGecko
Copy link
Copy Markdown
Contributor Author

Oh, and since Parcels is now a submodule I think you'll need to do git submodule update --init --recursive (if you aren't doing a fresh clone from the README)

@fluidnumerics-joe
Copy link
Copy Markdown
Contributor

Is it intentional to not have the FESOM and ICON datasets in the catalog ? I'm confused about where that went.

@VeckoTheGecko
Copy link
Copy Markdown
Contributor Author

I should have mentioned in the PR description, I was planning on having it in a future PR (wanted to avoid conflicts with the other reworking of the ingestion code)

@VeckoTheGecko
Copy link
Copy Markdown
Contributor Author

Also I need to figure out exactly how intake integrates with Uxarray. The fact Uxarray doesn't initialise from an xarray dataset (ie has uxr.open_mfdataset as the main entry) slightly complicated things

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between the catalogues in parcels-benchmarks and the parcels-examples? They seem to be the same now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to be updated in a future PR (mainly focussing on the actual downloading of the datasets - will fix the catalogs and ingestion at the same time)

- Rafactor path variables and move catalogue definitions to separate file
- Since the downloading script now relies on the
Using the `pyproject.toml` file to specify the project dependencies (including the local dependency with Parcels). This helps ensure that dependencies are available in a way available to ASV as well.

Maybe this ASV <-> pixi thing needs to be further investigated... ASV's own environment management is something I find confusing
Looks like stripping the environments out of ASV is on the roadmap airspeed-velocity/asv#1581 , which will mean that we can fully manage them with pixi (which would be great)
@VeckoTheGecko VeckoTheGecko merged commit 17c74e1 into main Apr 2, 2026
1 check passed
@VeckoTheGecko VeckoTheGecko deleted the improvements branch April 2, 2026 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants