Replace PBC handling in feature detection with label function from dask-image#562
Replace PBC handling in feature detection with label function from dask-image#562w-k-jones wants to merge 7 commits intotobac-project:RC_v1.6.xfrom
Conversation
Linting results by Pylint:Your code has been rated at 8.36/10 (previous run: 8.36/10, +0.00) |
844f3d1 to
ffd8833
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## RC_v1.6.x #562 +/- ##
=============================================
- Coverage 64.84% 64.46% -0.38%
=============================================
Files 27 27
Lines 3985 3923 -62
=============================================
- Hits 2584 2529 -55
+ Misses 1401 1394 -7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Testing with the low cloud tracking notebook shows a slight slow down in performance: Previous: I will test on a larger example, but I think this is a reasonable trade off for simpler code |
|
On further investigation with larger datasets, it seems like there is a fairly large overhead to using the dask image label function for moderately large feature detection jobs (where feature detection takes ~10 seconds to 1 minute), but performance benefits for larger tasks (e.g. multiple minutes). I will investigate a bit more to see if time steps vs spatial array size has different results. Note that this is using the default dask settings, so not increasing the number of workers etc. or chunking the data. |
|
Some test results: MCSMIP Obs (geo-ir), 1200x3600 domain, PBC hdim_2: [on HPC] EUREC4A LES, 1524x1524 domain, PBC both: [on laptop] What's puzzling is that the relative performance is changing with the number of timesteps, even though the labelling is working independently per timestep. Dask continues to be an enigma to me, do you have any ideas on this @freemansw1 ? |
|
@w-k-jones I'm certainly not a dask expert (and I'm looking at this relatively quickly), but I'm guessing it has to do with how the dask graph is constructed/how big the dask graph gets. Even if it's operating independently per timestep, unless This is very exciting, though. I wonder how performance scales by number of workers. |
Currently I have it running compute every timestep immediately for the labelling, to avoid having change any of the surrounding code, and given the array isn't chunked it will likely run slower than the scipy/scikit-image labelling just due to overhead. From a quick inspection dask-image should have substitutes for most of the image functions we need for feature detection. I'll see if I can get lazy execution of the per timestep feature detection working in a simple manner and see if running with an actual dask client and changing the number of workers affects things. |
|
I think that the best way forward here is to keep both a "small data" approach based on the current method for performance on smaller problems, and a "big data" alternative using dask-image that is used if a dask array is passed to feature detection. |
The
dask_image.ndmeasure.labelfunction replicates thescipy.ndimage.labelfunction, but adds awrap_axeskeyword that can be used to perform label detection across periodic boundaries (it requires identical logic to labelling regions across tiles). Replacing the current feature detection PBC code with this both simplifies our codebase, and starts to integrate more dask features into the tobac pipeline so that eventually it can be fully dask enabled. I have set up thelabelfunction to replicate the handling of connectivity indentically to the scikit-image function it is replacing, meaning that full connectivity is used rather than square connectivity. However, in future I think we should make this an option and change the default to square (1) connectivity to match segmentation (see #481)Currently in draft as I am not sure whether to introduce this on its own or as part of a larger refactor of feature detection