Skip to content

Integration tests for adaptive scaling #211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Jul 20, 2022

PR Contents

This PR adds a suite of integration tests for adaptive scaling. Generally, these tests assert that the cluster scales up/down as desired and does so quickly enough. For now, quickly is quite loosely defined as scaling within a few minutes. We may want to iterate on this, but using the Coiled default settings, this appears to the closest bound we should use if we want the tests to pass reliably.

Test Statistics

General

Measure: Latency of scaling (up|down) from the point where the runnable tasks indicate that the cluster should scale.
Sample size: 10

test_scale_up_on_task_load

params min (s) max (s) mean (s) stdev
[minimum=1, scatter=False] 100.32 116.69 104.88 5.51
[minimum=1, scatter=True] 102.94 130.39 111.31 8.65
[minimum=0, scatter=False] 196.83 202.75 198.72 1.74

[minimum=0, scatter=True]
Fails with TimeoutError: No valid workers found due to dask/distributed#6686

Note:
Scaling up from an empty cluster takes ~2x as long as from a non-empty cluster. This is due to the fact that the cluster first spins up a single worker and only scales further once that is done.

test_adapt_to_changing_workload

params measure min (s) max (s) mean (s) stdev
[minimum=0] scale up from minimum 97.21 106.66 100.84 3.16
[minimum=0] scale down to 1 295.82 300.14 298.11 1.69
[minimum=0] scale up from 1 101.69 117.10 106.17 4.68
[minimum=0] scale down to minimum 296.30 300.05 298.14 1.39
[minimum=1] scale up from minimum 101.02 110.23 106.46 2.89
[minimum=1] scale down to 1 296.59 314.44 299.90 5.26
[minimum=1] scale up from 1 110.89 189.64 125.23 24.20
[minimum=1] scale down to minimum 295.37 299.81 297.32 1.41

Note:

Scaling down takes about 5 minutes (due to the interval='5s' and wait_count=60). IMO, while is is not snappy in terms of reaction time to changing workload patterns, it makes sense given the long startup time of containers. Essentially, we wait for about 3x the startup time before scaling down to smooth the pattern and avoid constantly firing up new containers if we over-reacted.

test_adapt_to_memory_intensive_workload

TL;DR: This behavior of this test is very unreliable due to the long duration of scaling up, which leads to significant spilling and the workload slowing down tremendously or spilling workers getting OOM-killed.

Note:
Due to the long time it takes for this test to scale up, the first worker generates too much data in the beginning, leading to swapping and slowing down the processing.

When running the memory-intensive task in isolation, it takes ~30 s on a cluster that consists of 10 workers. On an adaptive cluster that starts out with a single worker and can scale up to 10 workers, this becomes highly variable. If it finishes within 10 minutes, it appears to be due to an OOM error killing the worker that had spilled to disk. As a consequence, the data was reshuffled and workers were able to progress much faster.

Due to this, I would currently avoid testing that nothing gets recomputed. As things stand, recomputing tasks might be necessary to drive workloads over the finish line where workers spilled too much.

@hendrikmakait
Copy link
Member Author

hendrikmakait commented Jul 26, 2022

test_adapt_to_changing_workload behaves weird when run multiple times. At some point, it tends to fail with

KilledWorker                              Traceback (most recent call last)
Input In [2], in <cell line: 2>()
      1 results = []
      2 for _ in range(10):
----> 3     results.append(test_adapt_to_changing_workload(1))

File ~/projects/coiled/coiled-runtime/tests/stability/test_adaptive_scaling.py:133, in test_adapt_to_changing_workload(minimum)
    130 assert adapt.log[-1][1]["status"] == "up"
    132 ev_final_fan_out.set()
--> 133 client.gather(final_fan_out)
    135 # Scale down to minimum
    136 start = time.monotonic()

File /opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/distributed/client.py:2174, in Client.gather(self, futures, errors, direct, asynchronous)
   2172 else:
   2173     local_worker = None
-> 2174 return self.sync(
   2175     self._gather,
   2176     futures,
   2177     errors=errors,
   2178     direct=direct,
   2179     local_worker=local_worker,
   2180     asynchronous=asynchronous,
   2181 )

File /opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/distributed/utils.py:320, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    318     return future
    319 else:
--> 320     return sync(
    321         self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    322     )

File /opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/distributed/utils.py:387, in sync(loop, func, callback_timeout, *args, **kwargs)
    385 if error:
    386     typ, exc, tb = error
--> 387     raise exc.with_traceback(tb)
    388 else:
    389     return result

File /opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/distributed/utils.py:360, in sync.<locals>.f()
    358         future = asyncio.wait_for(future, callback_timeout)
    359     future = asyncio.ensure_future(future)
--> 360     result = yield future
    361 except Exception:
    362     error = sys.exc_info()

File /opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/tornado/gen.py:762, in Runner.run(self)
    759 exc_info = None
    761 try:
--> 762     value = future.result()
    763 except Exception:
    764     exc_info = sys.exc_info()

File /opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/distributed/client.py:2037, in Client._gather(self, futures, errors, direct, local_worker)
   2035         exc = CancelledError(key)
   2036     else:
-> 2037         raise exception.with_traceback(traceback)
   2038     raise exc
   2039 if errors == "skip":

KilledWorker: ('clog-6b2594aecb77f0bcba80ce7afa0b4cc7', <WorkerState 'tls://10.0.10.50:44135', name: test_adaptive_scaling-4f590d701c1e4c2e85570f705e4ee3fb-worker-066874a249, status: closed, memory: 0, processing: 50>)

Notably, the worker logs contain this:

Jul 26 13:14:49 ip-10-0-10-50 cloud-init[1267]: 2022-07-26 13:14:49,156 - distributed.core - INFO - Starting established connection
Jul 26 13:14:53 ip-10-0-10-50 systemd-logind[545]: Power key pressed.
Jul 26 13:14:53 ip-10-0-10-50 systemd-logind[545]: System is powering down.
Jul 26 13:14:53 ip-10-0-10-50 systemd[1]: unattended-upgrades.service: Succeeded.
Jul 26 13:14:53 ip-10-0-10-50 systemd[1]: Stopped Unattended Upgrades Shutdown.

@ntabris: Does this mean something to you? It feels like upgrades shouldn't cause an issue with workflows within 20 minutes of firing up a cluster. See https://gitlab.com/coiled/cloud/-/issues/5060

@hendrikmakait
Copy link
Member Author

Another issue I've run into is that Num Workers on cloud.coiled.io is confusing to me. For example, I've had a situation where the cluster decided to scale down from 10 to 7 workers. When that finished, Num Workers continued to show 7 / 9. To me, this signalled that the cluster would scale back to 9 workers, but that never happened and I also did not see it in the logs.

In case it helps, these are the related logs:

 Worker              | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-400b43ea8d             | stopping    | 2022-07-27T10:01:44UTC | Dask process exiting
 Worker              | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-77169aea7c             | stopping    | 2022-07-27T10:01:44UTC | Dask process exiting
 Worker              | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-fd511b087b             | stopping    | 2022-07-27T10:01:45UTC | hendrikmakait requested
 Worker Instance     | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-77169aea7c-instance    | stopping    | 2022-07-27T10:01:45UTC | Instance self terminating
 Worker Instance     | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-400b43ea8d-instance    | stopping    | 2022-07-27T10:01:45UTC | Instance self terminating
 Worker Instance     | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-fd511b087b-instance    | stopping    | 2022-07-27T10:01:46UTC | Instance self terminating
 Worker Instance     | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-400b43ea8d-instance    | stopped     | 2022-07-27T10:02:16UTC | Dask graceful shutdown
 Worker Instance     | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-77169aea7c-instance    | stopped     | 2022-07-27T10:02:16UTC | Dask graceful shutdown
 Worker              | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-400b43ea8d             | stopped     | 2022-07-27T10:02:18UTC | Dask process stopped
 Worker              | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-77169aea7c             | stopped     | 2022-07-27T10:02:18UTC | Dask process stopped
 Worker Instance     | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-fd511b087b-instance    | stopped     | 2022-07-27T10:02:18UTC | hendrikmakait requested
 Worker              | test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-fd511b087b             | stopped     | 2022-07-27T10:02:19UTC | hendrikmakait requested

From what I can tell, the cluster determined that two of the processes exited for "some" reason and I only requested that one worker stopped. This might explain the ominous 9

@hendrikmakait hendrikmakait marked this pull request as ready for review July 28, 2022 16:23
@hendrikmakait
Copy link
Member Author

For some reason test_adapt_to_changing_workload fails very reliably on CI, but wasn't making much trouble locally.

@hendrikmakait
Copy link
Member Author

hendrikmakait commented Aug 3, 2022

It looks like the comm is timing out after 30s trying to connect to the scheduler while adapting (https://github.com/coiled/coiled-runtime/runs/7566591563?check_suite_focus=true#step:7:757). It's odd to see this happen so reliably.

@hendrikmakait
Copy link
Member Author

hendrikmakait commented Aug 3, 2022

CI failures can be grouped as follows:

OSError: Timed out trying to connect to [...] after 30 s (during _wait_for_workers):

distributed.comm.core.CommClosedError: in <TLS (closed) ConnectionPool.identity local=[...] remote=[...]>: Stream is closed (during _wait_for_workers):

[ubuntu-latest, Python 3.7, Runtime 0.0.3] test_adapt_to_changing_workload[1] seems to contain both:
https://github.com/coiled/coiled-runtime/runs/7566592734?check_suite_focus=true#step:7:554
https://github.com/coiled/coiled-runtime/runs/7566592734?check_suite_focus=true#step:7:581

@hendrikmakait hendrikmakait force-pushed the adaptive-scaling-tests branch from 23b5a85 to 8a0d566 Compare August 4, 2022 12:33
@shughes-uk
Copy link
Contributor

Do you have any cluster ids for this? I'd like to see what the AWS shutdown reason was

@shughes-uk
Copy link
Contributor

I also think the unattended upgrades thing is a red herring, I suspect either this shutdown was requested by adaptive or AWS killed your instance for other reasons.

@ntabris
Copy link
Member

ntabris commented Aug 8, 2022

@shughes-uk looks like this might be cluster in question:

https://cloud.coiled.io/dask-engineering/clusters/44038/details

Where do we capture the AWS shutdown reason?

@shughes-uk
Copy link
Contributor

shughes-uk commented Aug 9, 2022

It would be in the instance stop reason, looks like these were shut down by adaptive scaling intentionally

@ntabris
Copy link
Member

ntabris commented Aug 9, 2022

It would be in the instance stop reason, looks like these were shut down by adaptive scaling intentionally

@shughes-uk could you say in more detail what's telling you this? Maybe I should already know but this isn't clear to me.

Looking at logs, I see this on the scheduler:

Jul 27 10:56:44 ip-10-0-15-96 cloud-init[1268]: 2022-07-27 10:56:44,660 - distributed.scheduler - INFO - Remove client Client-e0203af0-0d91-11ed-bf83-be79fecea867
Jul 27 10:56:44 ip-10-0-15-96 cloud-init[1268]: 2022-07-27 10:56:44,734 - distributed.scheduler - INFO - Remove client Client-e0203af0-0d91-11ed-bf83-be79fecea867
Jul 27 10:56:44 ip-10-0-15-96 cloud-init[1268]: 2022-07-27 10:56:44,740 - distributed.scheduler - INFO - Close client connection: Client-e0203af0-0d91-11ed-bf83-be79fecea867
Jul 27 10:56:49 ip-10-0-15-96 systemd-logind[550]: Power key pressed.
Jul 27 10:56:49 ip-10-0-15-96 systemd-logind[550]: System is powering down.
Jul 27 10:56:49 ip-10-0-15-96 systemd[1]: unattended-upgrades.service: Succeeded.

and this on one of the first worker to go down (test_adaptive_scaling-fbb0985b93b0441b9789022bd770ed0a-worker-77169aea7c):

Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]: 2022-07-27 09:56:11,038 - distributed.worker - ERROR - Worker stream died during communication: tls://10.0.6.231:42563
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]: Traceback (most recent call last):
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:   File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 236, in read
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:     n = await stream.read_into(chunk)
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]: tornado.iostream.StreamClosedError: Stream is closed
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]: The above exception was the direct cause of the following exception:
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]: Traceback (most recent call last):
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:   File "/opt/conda/lib/python3.10/site-packages/distributed/worker.py", line 3296, in gather_dep
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:     response = await get_data_from_worker(
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:   File "/opt/conda/lib/python3.10/site-packages/distributed/worker.py", line 4680, in get_data_from_worker
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:     return await retry_operation(_get_data, operation="get_data_from_worker")
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:   File "/opt/conda/lib/python3.10/site-packages/distributed/utils_comm.py", line 381, in retry_operation
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:     return await retry(
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:   File "/opt/conda/lib/python3.10/site-packages/distributed/utils_comm.py", line 366, in retry
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:     return await coro()
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:   File "/opt/conda/lib/python3.10/site-packages/distributed/worker.py", line 4660, in _get_data
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:     response = await send_recv(
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:   File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 748, in send_recv
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:     response = await comm.read(deserializers=deserializers)
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:   File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 242, in read
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:     convert_stream_closed_error(self, e)
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:   File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 150, in convert_stream_closed_error
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]:     raise CommClosedError(f"in {obj}: {exc}") from exc
Jul 27 09:56:11 ip-10-0-6-130 cloud-init[1262]: distributed.comm.core.CommClosedError: in <TLS (closed) Ephemeral Worker->Worker for gather local=tls://10.0.6.130:40162 remote=tls://10.0.6.231:42563>: Stream is closed
Jul 27 09:58:57 ip-10-0-6-130 PackageKit: daemon quit
Jul 27 09:58:57 ip-10-0-6-130 systemd[1]: packagekit.service: Succeeded.
Jul 27 09:59:02 ip-10-0-6-130 gnome-shell[1467]: cr_parser_new_from_buf: assertion 'a_buf && a_len' failed
Jul 27 09:59:02 ip-10-0-6-130 gnome-shell[1467]: cr_declaration_parse_list_from_buf: assertion 'parser' failed
Jul 27 10:01:42 ip-10-0-6-130 cloud-init[1262]: 2022-07-27 10:01:42,941 - distributed.worker - INFO - Stopping worker at tls://10.0.6.130:41201
Jul 27 10:01:42 ip-10-0-6-130 cloud-init[1262]: 2022-07-27 10:01:42,957 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-bf7290fe-6e60-40f9-b2eb-83c47ba061a8 Address tls://10.0.6.130:41201 Status: Status.closing
Jul 27 10:01:42 ip-10-0-6-130 cloud-init[1262]: 2022-07-27 10:01:42,964 - distributed.nanny - INFO - Worker closed
Jul 27 10:01:43 ip-10-0-6-130 cloud-init[1262]: 2022-07-27 10:01:43,301 - distributed.nanny - INFO - Closing Nanny at 'tls://10.0.6.130:42941'.

On the control plane side we don't see the dask worker process (and then instance) stopping until 10:01:44UTC.

None of this looks to me like a clean shutdown of worker from adaptive scaling, but it's possible the logs are just very misleading / I don't know what to look for.

@hendrikmakait hendrikmakait force-pushed the adaptive-scaling-tests branch from feadd86 to a5c05fa Compare August 18, 2022 14:59
@hendrikmakait
Copy link
Member Author

CI failures can be grouped as follows:

This one has been solved, missed an assert inside a function that was sent over to the worker and overwritten by pytest. This also explains why it worked in a notebook, but not when run with pytest.

@hendrikmakait
Copy link
Member Author

Do you have any cluster ids for this? I'd like to see what the AWS shutdown reason was

Sorry for the late reply, this appears to be the cluster with the ominous Power key pressed.:
https://cloud.coiled.io/dask-engineering/clusters/43743/details

From a first look into the logs, it seems like it did pretty much the same as the cluster @ntabris found.

@hendrikmakait
Copy link
Member Author

test_adapt_to_changing_workload shows some interesting flaky behavior on CI and throws KilledWorker.

@hendrikmakait hendrikmakait requested a review from fjetter August 19, 2022 15:25
Copy link
Contributor

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An overall question is that the intervals we're waiting for adaptive scaling to happen—even scaling down—are pretty slow (like 420 seconds). Would it be possible to set some scaling parameters differently so these would run faster? I also get that the slow runtime is broadly part of why a lot of these will be skipped (and maybe, even more broadly, why adaptive scaling isn't very useful to actually use right now).

minimum = 0
maximum = 10
fan_out_size = 100
with Cluster(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to reuse the same cluster via a module-level fixture here, or is it important that both tests use standalone clusters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer a clean cluster here to make sure that I test its scaling behavior in isolation. In addition to that, I'd at the very least have to spin up workers again to end up with maximum workers in the beginning, which might diminish the benefit we gain from reusing a cluster since it will take up quite some time.

"""
maximum = 10
with Cluster(
name=f"test_adaptive_scaling-{uuid.uuid4().hex}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, didn't know you had addded this in the meantime!


@pytest.mark.stability
@pytest.mark.parametrize("minimum,threshold", [(0, 300), (1, 150)])
def test_scale_up_on_task_load(minimum, threshold):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to benchmark these as well? Not sure if they're appropriate to benchmark or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed with @fjetter and for now the idea seems to be focussing on stability, but eventually, we should benchmark these, yes. This test in particular might also be a candidate that could be moved to benchmarks as is (maybe referencing in this file with a module-level docstring).

fan_out = [clog(i, ev=ev_fan_out) for i in range(fan_out_size)]
barrier = clog(delayed(sum)(fan_out), ev=ev_barrier)
final_fan_out = [
clog(i, ev=ev_final_fan_out, barrier=barrier)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the barrier used for here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to start the final_fan_out tasks once the barrier task, i.e., scaling down, is done. Would replacing barrier with reduction make this clearer?

# Initialize array on workers to avoid adaptive scale-down
arr = client.persist(
da.random.random(
(8 * 1024, 8 * 1024, 16, 16), chunks=(8 * 1024, 128, 2, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all necessary, but maybe it would be possible to write this in terms of scaled_array_shape?
https://github.com/coiled/coiled-runtime/blob/78a8614b203971494520d273619383c08b002118/tests/utils_test.py#L14-L36

@hendrikmakait
Copy link
Member Author

An overall question is that the intervals we're waiting for adaptive scaling to happen—even scaling down—are pretty slow (like 420 seconds). Would it be possible to set some scaling parameters differently so these would run faster?

Fair point, with the latest shift of perspective on these tests as focusing on stability only, I'll give setting parameters to speed things up another pass. This will likely only benefit us in the case of scaling down though.

@hendrikmakait
Copy link
Member Author

xref: dask/distributed#6962

@hendrikmakait
Copy link
Member Author

CI failure for Python 3.8, Runtime 0.1.0 (https://github.com/coiled/coiled-runtime/runs/8195319805?check_suite_focus=true#step:6:102) appears to be caused by #306:

Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]: 2022-09-05 15:38:18,830 - distributed.core - INFO - Starting established connection
--
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]: 2022-09-05 15:38:18,923 - distributed.core - ERROR - keywords must be strings
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]: Traceback (most recent call last):
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]:   File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 667, in handle_stream
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]:     handler(**merge(extra, msg))
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]:   File "/opt/conda/lib/python3.10/site-packages/distributed/worker.py", line 1930, in _
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]:     event = cls(**kwargs)
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]:   File "<string>", line 13, in __init__
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]:   File "/opt/conda/lib/python3.10/site-packages/distributed/worker_state_machine.py", line 627, in __post_init__
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]:     self.run_spec = SerializedTask(**self.run_spec)
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]: TypeError: keywords must be strings
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]: 2022-09-05 15:38:18,926 - distributed.worker - ERROR - keywords must be strings
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]: Traceback (most recent call last):
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]:   File "/opt/conda/lib/python3.10/site-packages/distributed/worker.py", line 197, in wrapper
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]:     return await method(self, *args, **kwargs)
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]:   File "/opt/conda/lib/python3.10/site-packages/distributed/worker.py", line 1280, in handle_scheduler
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]:     await self.handle_stream(comm)
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]:   File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 667, in handle_stream
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]:     handler(**merge(extra, msg))
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]:   File "/opt/conda/lib/python3.10/site-packages/distributed/worker.py", line 1930, in _
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]:     event = cls(**kwargs)
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]:   File "<string>", line 13, in __init__
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]:   File "/opt/conda/lib/python3.10/site-packages/distributed/worker_state_machine.py", line 627, in __post_init__
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]:     self.run_spec = SerializedTask(**self.run_spec)
Sep  5 15:38:18 ip-10-0-1-93 cloud-init[1128]: TypeError: keywords must be strings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integration tests: Adaptive scaling
4 participants