Skip to content

Use obstore in data processing workflows#988

Open
spencerkclark wants to merge 1 commit intomainfrom
feature/obstore-in-data-processing
Open

Use obstore in data processing workflows#988
spencerkclark wants to merge 1 commit intomainfrom
feature/obstore-in-data-processing

Conversation

@spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Mar 19, 2026

This follows the approach we have taken in the ERA5 data processing workflow to use obstore when reading/writing from/to zarr in the cloud. It is expected this should provide a meaningful performance improvement, particularly when reading from stores with small inner chunks, namely the stats and time coarsening parts of our workflow.

Test workflows on the same 10-year dataset:

  • Without obstore: compute-fme-dataset-ensemble-fzl9k
  • With obstore: compute-fme-dataset-ensemble-cjp5g

Timing results:

Dataset computation Stats computation
Without obstore 48m 34m
With obstore 38m 35m

Surprisingly we do not see any meaningful change in the stats computation time, though we do see a faster dataset computation step.

Changes:

  • Adds a get_zarr_store function in compute_dataset.py and get_stats.py and uses it in compute_stats.py, get_stats.py, and time_coarsen.py.
  • Updates the processing image to include obstore. The new image is tagged v2026.03.0 and we have updated the argo workflow to use the image in all steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant