Skip to content

GetObservations in parallel#721

Open
mranst wants to merge 5 commits intodevelopfrom
feature/mranst/parallel_get_obs
Open

GetObservations in parallel#721
mranst wants to merge 5 commits intodevelopfrom
feature/mranst/parallel_get_obs

Conversation

@mranst
Copy link
Collaborator

@mranst mranst commented Mar 4, 2026

Addressing #718. Restructures GetObservations to fetch all files in parallel, similar to how eva observations works. It also changes the way fetching empty obs works. It only fetches an empty obs file from r2d2 once, then copies that file for other empty observations in the experiment, to avoid having to fetch the same file repeatedly. (I'm assuming this works as long as the filename matches the target file, unless r2d2 is doing something additional in the fetch stage that this wouldn't work for)

With the number of processes set to 4, for 3dvar default experiment goes from 150 seconds to 100 seconds, and 3dfgat_atmos, execution time goes from 255 seconds to 220. I believe this is still not nearly as quick as it used to be, but maybe r2d2-client changes can improve that further? I'm sure more processes can increase this, but I don't have a good sense of what we should use for this. GetObs can't run on a compute node, and I imagine I/O load might start to be a concern.

Tier1 tests pass

Copy link
Contributor

@mer-a-o mer-a-o left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me. I only had one minor comment regarding this PR. Thanks @mranst

I noticed that in Cylc get observation task is one task and if fetching one of the observations fails it's difficult to see it in the tui. I think it's also not possible to rerun only for failed observation. Having separate logs can also make things cleaner and easier for experiment with lots of observaitons. Would having a Cylc family of get observation with each observtion as sub-task under this family make more sense?

@mranst
Copy link
Collaborator Author

mranst commented Mar 5, 2026

I noticed that in Cylc get observation task is one task and if fetching one of the observations fails it's difficult to see it in the tui. I think it's also not possible to rerun only for failed observation. Having separate logs can also make things cleaner and easier for experiment with lots of observaitons. Would having a Cylc family of get observation with each observtion as sub-task under this family make more sense?

That's an interesting concept - I can try this out. The only thing I'm unsure about is the impact on execution time/IO

@jeromebarre
Copy link
Contributor

jeromebarre commented Mar 5, 2026

I noticed that in Cylc get observation task is one task and if fetching one of the observations fails it's difficult to see it in the tui. I think it's also not possible to rerun only for failed observation. Having separate logs can also make things cleaner and easier for experiment with lots of observaitons. Would having a Cylc family of get observation with each observtion as sub-task under this family make more sense?

That's an interesting concept - I can try this out. The only thing I'm unsure about is the impact on execution time/IO

To make the parallel with Skylab/Ewok, this is done like so over there to create separate tasks under the getObservation family:
https://github.com/JCSDA-internal/ewok/blob/develop/src/ewok/tasks/getObs.py
and the number of obs fetch tasks running at the same time are controled by:
maxlogintasks which can be very debatable...
https://github.com/JCSDA-internal/ewok/blob/04c875303e8e58d4cc14480bd46c80e15f5ec962/src/ewok/workflows/ecflow/ecflow.py#L66
or
https://github.com/JCSDA-internal/ewok/blob/04c875303e8e58d4cc14480bd46c80e15f5ec962/src/ewok/workflows/cylc/cylc.py#L220

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants