Timeboxed push for simplifying work stealing

Work stealing is a known source of problems. It's current implementation is overly complex and has a couple of known problems, some of which are *almost* fixed. 

Specifically, I propose to time box this to ~1-2weeks and try to wrap up a few known issues while pushing for a drastic simplification of the implementation. Once the dust settles, we can reevaluate how this feature has to evolve.

- Remove ["steal from public"](https://github.com/dask/distributed/blob/acf607832c7191cc496a9b4a81760170de85062c/distributed/stealing.py#L468-L495). I strongly believe this is responsible for all the aggressiveness we're seeing, e.g. https://github.com/dask/distributed/issues/5243. This may need an adjustment to how we define saturated workers, see also https://github.com/dask/distributed/pull/6614/files/36a60a5e358ea2a5d16597651126ac5892203b01#r952608704
- Fix dashboard https://github.com/dask/distributed/pull/4920#issuecomment-865134055
- Finish https://github.com/dask/distributed/pull/4920
- Improve measurements of bandwidth https://github.com/dask/distributed/pull/6115#issuecomment-1098961206
- Potentially revisit logic about cost_multiplier + level

The short to mid term target of this effort should be to reduce the number of steal requests drastically such that we can afford spending more time on "good" decisions (e.g. reusing the actual scheduler decide_worker logic or something even better)

Even if we want to get rid of work stealing entirely, there is some need for it to balance inhomogeneous workloads and allow cluster upscaling, see https://github.com/dask/distributed/issues/6600 The most valuable component of the current implementation is the handshake mechanism `move_task_request / move_task_confirm` that ensures consistent transitions without recomputing a key. I believe by tearing down the infrastructure around this handshake piece by piece we can iterate towards a more stable and maintainable implementation.

Previously I approached changes to this logic very carefully due to the lack of repeatable benchmarks. Therefore, I suggest that this effort should utilize benchmarks in [coiled-runtime](https://github.com/coiled/coiled-runtime) to the best of our abilities.

- [x] https://github.com/coiled/coiled-runtime/issues/305
- [x] https://github.com/dask/distributed/issues/7002
- [ ] https://github.com/dask/distributed/issues/7003
- [ ] https://github.com/dask/distributed/issues/7004

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Timeboxed push for simplifying work stealing #6993

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Timeboxed push for simplifying work stealing #6993

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions