Poor adaptive target for empty clusters #6962

gjoseph92 · 2022-08-26T02:35:28Z

If a cluster hasn't run any work yet, it will only recommend 1 worker initially, regardless of how many tasks are queued on the scheduler:

@gen_cluster(
    client=True,
    nthreads=[],
    config={"distributed.scheduler.default-task-durations": {"inc": 1}},
)
async def test_adaptive_target_empty_cluster(c, s):
    assert s.adaptive_target() == 0

    f = c.submit(inc, -1)
    await async_wait_for(lambda: s.tasks, timeout=5)
    assert s.adaptive_target() == 1

    fs = c.map(inc, range(1000))
    await async_wait_for(lambda: len(s.tasks) == len(fs) + 1, timeout=5)
    print(s.total_occupancy)
  > assert s.adaptive_target() > 1
E   AssertionError: assert 1 > 1

The scheduler's adaptive target is based on looking at its total_occupancy. But occupancy is only updated once tasks are scheduled (into processing). So if there are no workers, no tasks can be scheduled, and occupancy remains 0 even with tons of tasks in unrunnable.

I would expect the total_occupancy to also include the expected runtime of all unrunnable/queued tasks. That would result in faster scale-up from zero usually. Some deployment systems might be quite slow to scale. You might have to wait a few minutes to get 1 worker, to realize you then need more, and then wait a few minutes again. It would be better to ask for more up front.

This is what ensures we at least get one worker, otherwise we'd never scale up at all:

distributed/distributed/scheduler.py

Lines 7287 to 7288 in 16748b7

    
           if self.unrunnable and not self.workers: 
        
               cpu = max(1, cpu)

The text was updated successfully, but these errors were encountered:

jacobtomlinson · 2022-09-02T14:19:35Z

Yeah I've thought about this before. Typically we recommend folks set a minimum of 1 worker anyway because scale up latency is probably significant (especially for dask-cloudprovider or dask-jobqueue) so it's best to have 1 worker idling that can pick up tasks immediately and then the scheduler can make a more realistic recommendation.

Do you have a specific use case where scaling to 0 is desirable?

gjoseph92 · 2022-09-02T15:53:48Z

I pointed it out more as an edge case. I definitely think scaling from 0 is desirable if it works. It's just odd to allow it but then not handle that case properly.

Task rebalancing during scale-up is usually pretty bad anyway, so I wonder how much of a difference a 0 vs 1 minimum would even make (if we had a good scale-up metric from 0).

gjoseph92 added enhancement Improve existing functionality or make things work better scheduling labels Aug 26, 2022

gjoseph92 mentioned this issue Aug 26, 2022

Withhold root tasks [no co assignment] #6614

Merged

2 tasks

fjetter added the adaptive All things relating to adaptive scaling label Aug 26, 2022

hendrikmakait mentioned this issue Aug 30, 2022

Integration tests for adaptive scaling coiled/benchmarks#211

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Poor adaptive target for empty clusters #6962

Poor adaptive target for empty clusters #6962

gjoseph92 commented Aug 26, 2022 •

edited

Loading

jacobtomlinson commented Sep 2, 2022

Uh oh!

gjoseph92 commented Sep 2, 2022

Uh oh!

Uh oh!

Poor adaptive target for empty clusters #6962

Poor adaptive target for empty clusters #6962

Comments

gjoseph92 commented Aug 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jacobtomlinson commented Sep 2, 2022

Uh oh!

gjoseph92 commented Sep 2, 2022

Uh oh!

gjoseph92 commented Aug 26, 2022 •

edited

Loading