Integration tests for adaptive scaling #211

hendrikmakait · 2022-07-20T12:45:44Z

PR Contents

Closes: Integration tests: Adaptive scaling #134
Blocked by: distributed=2022.6.0 used by v0.1.0 contains regression #306

This PR adds a suite of integration tests for adaptive scaling. Generally, these tests assert that the cluster scales up/down as desired and does so quickly enough. For now, quickly is quite loosely defined as scaling within a few minutes. We may want to iterate on this, but using the Coiled default settings, this appears to the closest bound we should use if we want the tests to pass reliably.

Test Statistics

General

Measure: Latency of scaling (up|down) from the point where the runnable tasks indicate that the cluster should scale.
Sample size: 10

`test_scale_up_on_task_load`

params	min (s)	max (s)	mean (s)	stdev
[minimum=1, scatter=False]	100.32	116.69	104.88	5.51
[minimum=1, scatter=True]	102.94	130.39	111.31	8.65
[minimum=0, scatter=False]	196.83	202.75	198.72	1.74

[minimum=0, scatter=True]
Fails with TimeoutError: No valid workers found due to dask/distributed#6686

Note:
Scaling up from an empty cluster takes ~2x as long as from a non-empty cluster. This is due to the fact that the cluster first spins up a single worker and only scales further once that is done.

`test_adapt_to_changing_workload`

params	measure	min (s)	max (s)	mean (s)	stdev
[minimum=0]	scale up from minimum	97.21	106.66	100.84	3.16
[minimum=0]	scale down to 1	295.82	300.14	298.11	1.69
[minimum=0]	scale up from 1	101.69	117.10	106.17	4.68
[minimum=0]	scale down to minimum	296.30	300.05	298.14	1.39
[minimum=1]	scale up from minimum	101.02	110.23	106.46	2.89
[minimum=1]	scale down to 1	296.59	314.44	299.90	5.26
[minimum=1]	scale up from 1	110.89	189.64	125.23	24.20
[minimum=1]	scale down to minimum	295.37	299.81	297.32	1.41

Note:

Scaling down takes about 5 minutes (due to the interval='5s' and wait_count=60). IMO, while is is not snappy in terms of reaction time to changing workload patterns, it makes sense given the long startup time of containers. Essentially, we wait for about 3x the startup time before scaling down to smooth the pattern and avoid constantly firing up new containers if we over-reacted.

`test_adapt_to_memory_intensive_workload`

TL;DR: This behavior of this test is very unreliable due to the long duration of scaling up, which leads to significant spilling and the workload slowing down tremendously or spilling workers getting OOM-killed.

Note:
Due to the long time it takes for this test to scale up, the first worker generates too much data in the beginning, leading to swapping and slowing down the processing.

When running the memory-intensive task in isolation, it takes ~30 s on a cluster that consists of 10 workers. On an adaptive cluster that starts out with a single worker and can scale up to 10 workers, this becomes highly variable. If it finishes within 10 minutes, it appears to be due to an OOM error killing the worker that had spilled to disk. As a consequence, the data was reshuffled and workers were able to progress much faster.

Due to this, I would currently avoid testing that nothing gets recomputed. As things stand, recomputing tasks might be necessary to drive workloads over the finish line where workers spilled too much.

hendrikmakait · 2022-07-26T13:31:41Z

test_adapt_to_changing_workload behaves weird when run multiple times. At some point, it tends to fail with

KilledWorker                              Traceback (most recent call last)
Input In [2], in <cell line: 2>()
      1 results = []
      2 for _ in range(10):
----> 3     results.append(test_adapt_to_changing_workload(1))

File ~/projects/coiled/coiled-runtime/tests/stability/test_adaptive_scaling.py:133, in test_adapt_to_changing_workload(minimum)
    130 assert adapt.log[-1][1]["status"] == "up"
    132 ev_final_fan_out.set()
--> 133 client.gather(final_fan_out)
    135 # Scale down to minimum
    136 start = time.monotonic()

File /opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/distributed/client.py:2174, in Client.gather(self, futures, errors, direct, asynchronous)
   2172 else:
   2173     local_worker = None
-> 2174 return self.sync(
   2175     self._gather,
   2176     futures,
   2177     errors=errors,
   2178     direct=direct,
   2179     local_worker=local_worker,
   2180     asynchronous=asynchronous,
   2181 )

File /opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/distributed/utils.py:320, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    318     return future
    319 else:
--> 320     return sync(
    321         self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    322     )

File /opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/distributed/utils.py:387, in sync(loop, func, callback_timeout, *args, **kwargs)
    385 if error:
    386     typ, exc, tb = error
--> 387     raise exc.with_traceback(tb)
    388 else:
    389     return result

File /opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/distributed/utils.py:360, in sync.<locals>.f()
    358         future = asyncio.wait_for(future, callback_timeout)
    359     future = asyncio.ensure_future(future)
--> 360     result = yield future
    361 except Exception:
    362     error = sys.exc_info()

File /opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/tornado/gen.py:762, in Runner.run(self)
    759 exc_info = None
    761 try:
--> 762     value = future.result()
    763 except Exception:
    764     exc_info = sys.exc_info()

File /opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/distributed/client.py:2037, in Client._gather(self, futures, errors, direct, local_worker)
   2035         exc = CancelledError(key)
   2036     else:
-> 2037         raise exception.with_traceback(traceback)
   2038     raise exc
   2039 if errors == "skip":

KilledWorker: ('clog-6b2594aecb77f0bcba80ce7afa0b4cc7', <WorkerState 'tls://10.0.10.50:44135', name: test_adaptive_scaling-4f590d701c1e4c2e85570f705e4ee3fb-worker-066874a249, status: closed, memory: 0, processing: 50>)

Notably, the worker logs contain this:

Jul 26 13:14:49 ip-10-0-10-50 cloud-init[1267]: 2022-07-26 13:14:49,156 - distributed.core - INFO - Starting established connection
Jul 26 13:14:53 ip-10-0-10-50 systemd-logind[545]: Power key pressed.
Jul 26 13:14:53 ip-10-0-10-50 systemd-logind[545]: System is powering down.
Jul 26 13:14:53 ip-10-0-10-50 systemd[1]: unattended-upgrades.service: Succeeded.
Jul 26 13:14:53 ip-10-0-10-50 systemd[1]: Stopped Unattended Upgrades Shutdown.

@ntabris: Does this mean something to you? It feels like upgrades shouldn't cause an issue with workflows within 20 minutes of firing up a cluster. See https://gitlab.com/coiled/cloud/-/issues/5060

hendrikmakait · 2022-07-27T11:12:27Z

Another issue I've run into is that Num Workers on cloud.coiled.io is confusing to me. For example, I've had a situation where the cluster decided to scale down from 10 to 7 workers. When that finished, Num Workers continued to show 7 / 9. To me, this signalled that the cluster would scale back to 9 workers, but that never happened and I also did not see it in the logs.

In case it helps, these are the related logs:

 Worker              | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-400b43ea8d             | stopping    | 2022-07-27T10:01:44UTC | Dask process exiting
 Worker              | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-77169aea7c             | stopping    | 2022-07-27T10:01:44UTC | Dask process exiting
 Worker              | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-fd511b087b             | stopping    | 2022-07-27T10:01:45UTC | hendrikmakait requested
 Worker Instance     | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-77169aea7c-instance    | stopping    | 2022-07-27T10:01:45UTC | Instance self terminating
 Worker Instance     | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-400b43ea8d-instance    | stopping    | 2022-07-27T10:01:45UTC | Instance self terminating
 Worker Instance     | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-fd511b087b-instance    | stopping    | 2022-07-27T10:01:46UTC | Instance self terminating
 Worker Instance     | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-400b43ea8d-instance    | stopped     | 2022-07-27T10:02:16UTC | Dask graceful shutdown
 Worker Instance     | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-77169aea7c-instance    | stopped     | 2022-07-27T10:02:16UTC | Dask graceful shutdown
 Worker              | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-400b43ea8d             | stopped     | 2022-07-27T10:02:18UTC | Dask process stopped
 Worker              | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-77169aea7c             | stopped     | 2022-07-27T10:02:18UTC | Dask process stopped
 Worker Instance     | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-fd511b087b-instance    | stopped     | 2022-07-27T10:02:18UTC | hendrikmakait requested
 Worker              | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-fd511b087b             | stopped     | 2022-07-27T10:02:19UTC | hendrikmakait requested

From what I can tell, the cluster determined that two of the processes exited for "some" reason and I only requested that one worker stopped. This might explain the ominous 9

hendrikmakait · 2022-08-03T14:26:50Z

For some reason test_adapt_to_changing_workload fails very reliably on CI, but wasn't making much trouble locally.

hendrikmakait · 2022-08-03T17:40:18Z

It looks like the comm is timing out after 30s trying to connect to the scheduler while adapting (https://github.com/coiled/coiled-runtime/runs/7566591563?check_suite_focus=true#step:7:757). It's odd to see this happen so reliably.

hendrikmakait · 2022-08-03T17:46:02Z

CI failures can be grouped as follows:

OSError: Timed out trying to connect to [...] after 30 s (during _wait_for_workers):

[ubuntu-latest, Python 3.8, Runtime latest] test_adapt_to_changing_workload[0] (https://github.com/coiled/coiled-runtime/runs/7566591563?check_suite_focus=true#step:7:404)
[ubuntu-latest, Python 3.8, Runtime latest] test_adapt_to_changing_workload[1] (https://github.com/coiled/coiled-runtime/runs/7566591563?check_suite_focus=true#step:7:730)
[ubuntu-latest, Python 3.8, Runtime 0.0.3] test_adapt_to_changing_workload[0] (https://github.com/coiled/coiled-runtime/runs/7566591693?check_suite_focus=true#step:7:395)
[ubuntu-latest, Python 3.8, Runtime 0.0.3] test_adapt_to_changing_workload[1] (https://github.com/coiled/coiled-runtime/runs/7566591693?check_suite_focus=true#step:7:748)
[ubuntu-latest, Python 3.9, Runtime latest] test_adapt_to_changing_workload[0]
(https://github.com/coiled/coiled-runtime/runs/7566591925?check_suite_focus=true#step:7:577)
[ubuntu-latest, Python 3.9, Runtime latest] test_adapt_to_changing_workload[1]
(https://github.com/coiled/coiled-runtime/runs/7566591925?check_suite_focus=true#step:7:1117)
[ubuntu-latest, Python 3.9, Runtime 0.0.4] test_adapt_to_changing_workload[0]
(https://github.com/coiled/coiled-runtime/runs/7566592171?check_suite_focus=true#step:7:572)
[ubuntu-latest, Python 3.10, Runtime latest] test_adapt_to_changing_workload[1]
(https://github.com/coiled/coiled-runtime/runs/7566592272?check_suite_focus=true#step:7:762)
[ubuntu-latest, Python 3.10, Runtime 0.0.4] test_adapt_to_changing_workload[0]
(https://github.com/coiled/coiled-runtime/runs/7566592369?check_suite_focus=true#step:7:563)
[ubuntu-latest, Python 3.7, Runtime 0.0.3] test_adapt_to_changing_workload[0]
(https://github.com/coiled/coiled-runtime/runs/7566592734?check_suite_focus=true#step:7:390)

distributed.comm.core.CommClosedError: in <TLS (closed) ConnectionPool.identity local=[...] remote=[...]>: Stream is closed (during _wait_for_workers):

[ubuntu-latest, Python 3.8, Runtime 0.0.4] test_adapt_to_changing_workload[0]
(https://github.com/coiled/coiled-runtime/runs/7566591793?check_suite_focus=true#step:7:241)
[ubuntu-latest, Python 3.8, Runtime 0.0.4] test_adapt_to_changing_workload[1]
(https://github.com/coiled/coiled-runtime/runs/7566591693?check_suite_focus=true#step:7:395)
[ubuntu-latest, Python 3.9, Runtime 0.0.3] test_adapt_to_changing_workload[0]
(https://github.com/coiled/coiled-runtime/runs/7566592051?check_suite_focus=true#step:7:236)
[ubuntu-latest, Python 3.9, Runtime 0.0.3] test_adapt_to_changing_workload[1]
(https://github.com/coiled/coiled-runtime/runs/7566592051?check_suite_focus=true#step:7:408)
[ubuntu-latest, Python 3.9, Runtime 0.0.4] test_adapt_to_changing_workload[1]
(https://github.com/coiled/coiled-runtime/runs/7566592171?check_suite_focus=true#step:7:745)
[ubuntu-latest, Python 3.10, Runtime latest] test_adapt_to_changing_workload[0]
(https://github.com/coiled/coiled-runtime/runs/7566592272?check_suite_focus=true#step:7:242)
[ubuntu-latest, Python 3.10, Runtime 0.0.4] test_adapt_to_changing_workload[1]
https://github.com/coiled/coiled-runtime/runs/7566592369?check_suite_focus=true#step:7:729

[ubuntu-latest, Python 3.7, Runtime 0.0.3] test_adapt_to_changing_workload[1] seems to contain both:
https://github.com/coiled/coiled-runtime/runs/7566592734?check_suite_focus=true#step:7:554
https://github.com/coiled/coiled-runtime/runs/7566592734?check_suite_focus=true#step:7:581

shughes-uk · 2022-08-05T02:01:59Z

Do you have any cluster ids for this? I'd like to see what the AWS shutdown reason was

shughes-uk · 2022-08-05T15:50:00Z

I also think the unattended upgrades thing is a red herring, I suspect either this shutdown was requested by adaptive or AWS killed your instance for other reasons.

ntabris · 2022-08-08T18:49:09Z

@shughes-uk looks like this might be cluster in question:

https://cloud.coiled.io/dask-engineering/clusters/44038/details

Where do we capture the AWS shutdown reason?

shughes-uk · 2022-08-09T05:00:15Z

It would be in the instance stop reason, looks like these were shut down by adaptive scaling intentionally

ntabris · 2022-08-09T13:56:31Z

It would be in the instance stop reason, looks like these were shut down by adaptive scaling intentionally

@shughes-uk could you say in more detail what's telling you this? Maybe I should already know but this isn't clear to me.

Looking at logs, I see this on the scheduler:

Jul 27 10:56:44 ip-10-0-15-96 cloud-init[1268]: 2022-07-27 10:56:44,660 - distributed.scheduler - INFO - Remove client Client-e0203af0-0d91-11ed-bf83-be79fecea867
Jul 27 10:56:44 ip-10-0-15-96 cloud-init[1268]: 2022-07-27 10:56:44,734 - distributed.scheduler - INFO - Remove client Client-e0203af0-0d91-11ed-bf83-be79fecea867
Jul 27 10:56:44 ip-10-0-15-96 cloud-init[1268]: 2022-07-27 10:56:44,740 - distributed.scheduler - INFO - Close client connection: Client-e0203af0-0d91-11ed-bf83-be79fecea867
Jul 27 10:56:49 ip-10-0-15-96 systemd-logind[550]: Power key pressed.
Jul 27 10:56:49 ip-10-0-15-96 systemd-logind[550]: System is powering down.
Jul 27 10:56:49 ip-10-0-15-96 systemd[1]: unattended-upgrades.service: Succeeded.

and this on one of the first worker to go down (test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-77169aea7c):

Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]: 2022-07-27 09:56:11,038 - distributed.worker - ERROR - Worker stream died during communication: tls://10.0.6.231:42563
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]: Traceback (most recent call last):
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:   File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 236, in read
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:     n = await stream.read_into(chunk)
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]: tornado.iostream.StreamClosedError: Stream is closed
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]: The above exception was the direct cause of the following exception:
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]: Traceback (most recent call last):
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:   File "/opt/conda/lib/python3.10/site-packages/distributed/worker.py", line 3296, in gather_dep
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:     response = await get_data_from_worker(
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:   File "/opt/conda/lib/python3.10/site-packages/distributed/worker.py", line 4680, in get_data_from_worker
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:     return await retry_operation(_get_data, operation="get_data_from_worker")
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:   File "/opt/conda/lib/python3.10/site-packages/distributed/utils_comm.py", line 381, in retry_operation
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:     return await retry(
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:   File "/opt/conda/lib/python3.10/site-packages/distributed/utils_comm.py", line 366, in retry
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:     return await coro()
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:   File "/opt/conda/lib/python3.10/site-packages/distributed/worker.py", line 4660, in _get_data
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:     response = await send_recv(
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:   File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 748, in send_recv
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:     response = await comm.read(deserializers=deserializers)
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:   File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 242, in read
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:     convert_stream_closed_error(self, e)
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:   File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 150, in convert_stream_closed_error
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:     raise CommClosedError(f"in {obj}: {exc}") from exc
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]: distributed.comm.core.CommClosedError: in <TLS (closed) Ephemeral Worker->Worker for gather local=tls://10.0.6.130:40162 remote=tls://10.0.6.231:42563>: Stream is closed
Jul 27 09:58:57 ip-10-0-6-130 PackageKit: daemon quit
Jul 27 09:58:57 ip-10-0-6-130 systemd[1]: packagekit.service: Succeeded.
Jul 27 09:59:02 ip-10-0-6-130 gnome-shell[1467]: cr_parser_new_from_buf: assertion 'a_buf && a_len' failed
Jul 27 09:59:02 ip-10-0-6-130 gnome-shell[1467]: cr_declaration_parse_list_from_buf: assertion 'parser' failed
Jul 27 10:01:42 ip-10-0-6-130 cloud-init[1262]: 2022-07-27 10:01:42,941 - distributed.worker - INFO - Stopping worker at tls://10.0.6.130:41201
Jul 27 10:01:42 ip-10-0-6-130 cloud-init[1262]: 2022-07-27 10:01:42,957 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-bf7290fe-6e60-40f9-b2eb-83c47ba061a8 Address tls://10.0.6.130:41201 Status: Status.closing
Jul 27 10:01:42 ip-10-0-6-130 cloud-init[1262]: 2022-07-27 10:01:42,964 - distributed.nanny - INFO - Worker closed
Jul 27 10:01:43 ip-10-0-6-130 cloud-init[1262]: 2022-07-27 10:01:43,301 - distributed.nanny - INFO - Closing Nanny at 'tls://10.0.6.130:42941'.

On the control plane side we don't see the dask worker process (and then instance) stopping until 10:01:44UTC.

None of this looks to me like a clean shutdown of worker from adaptive scaling, but it's possible the logs are just very misleading / I don't know what to look for.

…cluster

… tasks

hendrikmakait · 2022-08-19T09:32:05Z

CI failures can be grouped as follows:

This one has been solved, missed an assert inside a function that was sent over to the worker and overwritten by pytest. This also explains why it worked in a notebook, but not when run with pytest.

hendrikmakait · 2022-08-19T09:48:37Z

Do you have any cluster ids for this? I'd like to see what the AWS shutdown reason was

Sorry for the late reply, this appears to be the cluster with the ominous Power key pressed.:
https://cloud.coiled.io/dask-engineering/clusters/43743/details

From a first look into the logs, it seems like it did pretty much the same as the cluster @ntabris found.

hendrikmakait · 2022-08-19T10:23:54Z

test_adapt_to_changing_workload shows some interesting flaky behavior on CI and throws KilledWorker.

gjoseph92

An overall question is that the intervals we're waiting for adaptive scaling to happen—even scaling down—are pretty slow (like 420 seconds). Would it be possible to set some scaling parameters differently so these would run faster? I also get that the slow runtime is broadly part of why a lot of these will be skipped (and maybe, even more broadly, why adaptive scaling isn't very useful to actually use right now).

gjoseph92 · 2022-08-23T15:11:55Z