Skip to content

Use geomad instead of hdstats.#207

Open
SpacemanPaul wants to merge 12 commits intodevelopfrom
use-geomad
Open

Use geomad instead of hdstats.#207
SpacemanPaul wants to merge 12 commits intodevelopfrom
use-geomad

Conversation

@SpacemanPaul
Copy link
Copy Markdown
Contributor

Use new open source Geomad package instead of hdstats.

Remove dependency on packages.dea.ga.gov.au python package index.

@SpacemanPaul SpacemanPaul requested review from Ariana-B, cbur24, omad and robbibt and removed request for Ariana-B September 5, 2025 01:12
Copy link
Copy Markdown
Member

@omad omad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nifty!

@SpacemanPaul
Copy link
Copy Markdown
Contributor Author

Tests are hanging due to worker memory blowing out. Not sure what the issue is at this stage.

@codecov
Copy link
Copy Markdown

codecov bot commented Sep 5, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.52%. Comparing base (5b85e93) to head (39ab72a).

Additional details and impacted files
@@           Coverage Diff            @@
##           develop     #207   +/-   ##
========================================
  Coverage    81.52%   81.52%           
========================================
  Files           53       53           
  Lines         4849     4849           
========================================
  Hits          3953     3953           
  Misses         896      896           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@omad
Copy link
Copy Markdown
Member

omad commented Sep 5, 2025

That memory limit argument is used to tell dask how much memory it is allowed to use (and maybe to work out chuck sizes, I'm not sure).

I think GitHub actions runners have 16GB of RAM, so dask needs to use substantially less to not be killed.

The log messages from dask in the failed runs indicated it had decided to use about 14GB, so it must use a heuristic that assumes it can use most of the available ram, too much in this case.

@SpacemanPaul
Copy link
Copy Markdown
Contributor Author

SpacemanPaul commented Sep 17, 2025

Comments:

I'm having to leave this where it is for now, as I'm off on a much-needed 4 week holiday.

Status

  • Geomad seems to be working hunky dory in odc-algo.
  • integration test failures here appear to be happening BEFORE the execution gets to the geomad code. I think it's got something to do with the dask scheduling of the output COG writing.
  • integration tests have been running against the last released version of odc-stats code (i.e. the code from the previous successfully merged PR), rather than the code of the current PR, making it very difficult to figure out where things broke. I have fixed this in this PR.
  • I have reverted to core's internal (deprecated) to_cog() method which gives clearer error messages (odc-geo to_cog just runs out of memory and the container gets killed). With datacube internal to_cog() I get pickling errors referencing _HLGExprSequence which I have seen before and indicates that an internal Dask object (e.g. a Client) is getting passed to a delayed dask function, but I haven't been able to spot where this might be happening.

There are some unrelated test errors happening due to the switch to numpy 2. These should be relatively straightforward to resolve.

@omad
Copy link
Copy Markdown
Member

omad commented Sep 17, 2025

In addition to @SpacemanPaul 's suspicion about the different to_cog() dask functions, I'm wondering whether the yxbt_sink and reshape_yxbt features in odc-stats are still necessary.

My understanding is that they were very important a few years ago to efficiently scale across dask workers, especially for running all of time statistical summaries. Those summaries are embarrassingly parallel spatially, but, (I think), the dask scheduler at the time was causing major problems in how it was splitting the work up.

In the last year there have been a few significant improvements in the dask/distributed scheduler that should address some or all of these issues, and the custom memory sinks may just be getting in the way.

It may be possible to use a much simpler implementation using the geomad kernel with xarray code something like the following that ran better against one of these new tests:

def xr_geomedian(x, dim="time"):
    return xr.apply_ufunc(
        geomad.nangeomedian_pcm,
        x,
        input_core_dims=[["time"]],
        dask="parallelized",
        dask_gufunc_kwargs={'allow_rechunk': True}
        )
src_dataarray = src.to_dataarray(dim="band").chunk({'x': 800, 'y': 800, 'time': -1, 'band': -1})
result = xr_geomedian(src_dataarray)

It also shouldn't be necessary to reach as low level as the .map_blocks() calls, since .apply_ufunc() can do simple dimension reductions like geomad on it's own.

Loading

I'm also curious as to whether there are custom dask loading functions implemented here in odc-stats that would be better replaced by the code in odc-loader.

@cbur24
Copy link
Copy Markdown
Contributor

cbur24 commented Sep 19, 2025

@SpacemanPaul @omad
I ran Landsat 8 geomedians using the code in this branch to see if the memory leak issue you're reporting is replicated outside of the unit testing. On the unstable sandbox image, the landsat geomedian plugin (gm-ls) worked as expected without memory spikes. The outputs also look correct. Maybe this is only an issue with the unit tests and not with the geomedian code per se?

I've pushed the jupyter notebook and config file that I used for this testing into odc-stats/docs within this branch so you can test for yourselves - we can delete these later once your happy. ODC and Datacube versions tested:

UPDATE: this was run on xarray=2025.6.1

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants