Skip to content

Stealing balance not accounting for recent decisions #7002

Closed
@fjetter

Description

@fjetter

The Stealing.balance implementation is currently using a greedy for-loop with multiple early termination conditions. These break statements are de-facto dead code causing too greedy and potentially destructive work stealing.

Stealing is highly sensitive to the definition of idle and saturated workers which is defined in Scheduler.check_idle_saturated.

Explanation of saturated

See also https://github.com/dask/distributed/pull/6614/files/36a60a5e358ea2a5d16597651126ac5892203b01#r952608704

I think the new definition of idle to incorporate whether there are free slots makes a lot of sense.

I didn't change the saturated definition because I don't honestly understand what it's trying to measure:

pending: float = occ * (p - nc) / (p * nc)
if 0.4 < pending > 1.9 * avg:
saturated.add(ws)

off-topic but I spent some time on this before and it's worth sharing

you can rewrite this statement to

pending: float = occ * (p - nc) / (p * nc)
pending = occ / nc - occ / p
pending = occupancy / n_threads - occupancy / processing
pending = avg_occ_per_thread - avg_occ_currently_processing

i.e. pending is "average occupancy per thread that is in state ready / not, yet executing"
or put into context of stealing, which is the only place that saturated matters, "occupancy per thread that could be stolen"
I guess what's written there is more stable in terms of floating point arithmetic.

So, the definition of saturated is spelled out something like

A worker is classified as saturated if it's ready occupancy is at least 0.4s per thread and at least 1.9 times the average cluster wide occupancy.

i.e. "there is something queued up / in ready and it is about two times more than average"

Simplified, this is how the algorithm is intended to work:
Stealing.balance iterates over saturated workers in descending occupancy and victimizes them for stealing. An idle worker will act as the thief and work will be rebalanced. This continues until there are no longer saturated workers or there are no longer idle workers.

However, this pseudo algorithm is no longer implemented. The current code only shows hints at how this was formerly achieved and it likely strongly depended on dicts, sets, etc. being passed by reference such that any changes to the fundamental collections are immediately propagated to the balance loop which then allows for early termination.

Stealing.maybe_move_task is calling Scheduler. check_idle_saturated with an additional keyword occ which is only used here. This keyword allows to override the actual occupancy of the worker which would allow us to reclassify thieves and victims based on their combined_occupancy, i.e. the Scheduler.idle and Scheduler.saturated collections are updated under the assumption that in-flight tasks are stolen successfully.

s.check_idle_saturated(victim, occ=occ_victim)
s.check_idle_saturated(thief, occ=occ_thief)

Assuming that idle in the balance for-loop was referencing Scheduler.idle and not be a derivative, this would allow us to break out of the loop at various places, e.g. here, here or here. However, the loop variable idle is always defined as a list since #5665 and even before is sometimes a list due to sorting ever since #754
This effectively renders any if not idle; break statements dead code. This can be confirmed by codecoverage measurements

image

Going back to #750 it appears that a similar mechanism existed for early termination if saturated became empty. The implementation changed drastically and I do not see any leftover termination mechanism that ensures a potential victim is still saturated if one accounts for combined_occupancy.

I suspect this lack of early termination to be a major contributor to at least the following tickets

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is brokenp1Affects a large population and inhibits workregressionschedulerschedulingstabilityIssue or feature related to cluster stability (e.g. deadlock)stealing

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions