Skip to content

Failing test_adapt_to_changing_workload #266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hendrikmakait opened this issue Aug 19, 2022 · 0 comments
Open

Failing test_adapt_to_changing_workload #266

hendrikmakait opened this issue Aug 19, 2022 · 0 comments

Comments

@hendrikmakait
Copy link
Member

This test implemented in #211 shows flaky behavior on CI (https://github.com/coiled/coiled-runtime/runs/7915655906?check_suite_focus=true#step:6:107). From the logs, it looks like the worker is restarting after being being shutdown by the scheduler due to adaptive scaling. The worker is then assigned new tasks and for some reason removed again at 10:00:44, at which point it has 19 tasks in processing that are then marked as failed. The worker logs don't seem to include much useful information, in fact their last timestamp is at 10:00:34.

Scheduler logs:
(Note: I removed interleaving ones that did not seem relevant)

Aug 19 09:55:07 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 09:55:07,950 - distributed.scheduler - INFO - Register worker <WorkerState 'tls://10.0.3.187:43629', name: test_adaptive_scaling-4762fcb51b5f4682b7a2ea182d9279e4-worker-6524cda773, status: init, memory: 0, processing: 0>
Aug 19 10:00:14 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:14,344 - distributed.scheduler - INFO - Retire worker names ('test_adaptive_scaling-4762fcb51b5f4682b7a2ea182d9279e4-worker-6524cda773', 'test_adaptive_scaling-4762fcb51b5f4682b7a2ea182d9279e4-worker-e75620b317', 'test_adaptive_scaling-4762fcb51b5f4682b7a2ea182d9279e4-worker-c934f9680e', 'test_adaptive_scaling-4762fcb51b5f4682b7a2ea182d9279e4-worker-230ce99634', 'test_adaptive_scaling-4762fcb51b5f4682b7a2ea182d9279e4-worker-a853bf1b66', 'test_adaptive_scaling-4762fcb51b5f4682b7a2ea182d9279e4-worker-7e392fb66b', 'test_adaptive_scaling-4762fcb51b5f4682b7a2ea182d9279e4-worker-992711e88b', 'test_adaptive_scaling-4762fcb51b5f4682b7a2ea182d9279e4-worker-1e55f55428', 'test_adaptive_scaling-4762fcb51b5f4682b7a2ea182d9279e4-worker-d15a737502')
Aug 19 10:00:14 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:14,413 - distributed.scheduler - INFO - Remove worker <WorkerState 'tls://10.0.3.187:43629', name: test_adaptive_scaling-4762fcb51b5f4682b7a2ea182d9279e4-worker-6524cda773, status: closing_gracefully, memory: 0, processing: 0>
Aug 19 10:00:15 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:15,894 - distributed.scheduler - INFO - Register worker <WorkerState 'tls://10.0.3.187:40733', name: test_adaptive_scaling-4762fcb51b5f4682b7a2ea182d9279e4-worker-6524cda773, status: init, memory: 0, processing: 0>

## We decide to remove this worker with 19 tasks in processing

Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,553 - distributed.scheduler - INFO - Remove worker <WorkerState 'tls://10.0.3.187:40733', name: test_adaptive_scaling-4762fcb51b5f4682b7a2ea182d9279e4-worker-6524cda773, status: running, memory: 1, processing: 19>
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,554 - distributed.core - INFO - Removing comms to tls://10.0.3.187:40733

## 19 tasks are getting marked as failed

Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,554 - distributed.scheduler - INFO - Task clog-fab0f037-8d03-4dc6-b526-1cfd5f1da1cd marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,555 - distributed.scheduler - INFO - Task clog-206d6d99-c3dc-456a-bb9a-1e957258a8a4 marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,556 - distributed.scheduler - INFO - Task clog-2a71f486-bb7f-4047-ab99-d467092357c1 marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,556 - distributed.scheduler - INFO - Task clog-056c8917-6cae-4916-b0e6-9a71e100c72a marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,557 - distributed.scheduler - INFO - Task clog-f176afb3-3e88-45cd-99c8-614802ef6f9b marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,557 - distributed.scheduler - INFO - Task clog-1a96245b-24fa-465a-8311-5f19bd20c76f marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,557 - distributed.scheduler - INFO - Task clog-7ff270ed-131a-4626-9aef-5bdf7c166066 marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,557 - distributed.scheduler - INFO - Task clog-bbfbe82e-70fc-4c48-a561-3311e742ba90 marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,558 - distributed.scheduler - INFO - Task clog-5fb6ba82-1beb-43a6-9605-9c0f23886bfe marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,558 - distributed.scheduler - INFO - Task clog-82d4601d-4f86-4b1b-9a07-f0fbda35711d marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,558 - distributed.scheduler - INFO - Task clog-05754c8b-67cd-4396-92d4-8e6303ce259a marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,559 - distributed.scheduler - INFO - Task clog-04895ba3-aa7e-4811-8788-a74606635d58 marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,559 - distributed.scheduler - INFO - Task clog-17f86541-2346-40ce-912f-5031a57cb57c marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,559 - distributed.scheduler - INFO - Task clog-4eff4416-97ad-442b-b86a-1afc83b4d17c marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,559 - distributed.scheduler - INFO - Task clog-a3d0bf98-ae36-47ee-b079-e3650ab2bb9c marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,559 - distributed.scheduler - INFO - Task clog-22942b17-6bfc-46c2-b8bb-301c76b869c0 marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,559 - distributed.scheduler - INFO - Task clog-37630efe-b6a7-48be-ac72-81ca525ab4e4 marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,560 - distributed.scheduler - INFO - Task clog-ef713a34-613a-4905-aee8-c46a6b749143 marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,560 - distributed.scheduler - INFO - Task clog-94253182-d6c3-46bd-aba8-8e9104dbf1bb marked as failed because 0 workers died while trying to run it
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,560 - distributed.scheduler - INFO - Remove client Client-worker-b9973b50-1fa5-11ed-8028-02e28ee669ec
Aug 19 10:00:44 ip-10-0-4-136 cloud-init[1168]: 2022-08-19 10:00:44,562 - distributed.scheduler - INFO - Close client connection: Client-worker-b9973b50-1fa5-11ed-8028-02e28ee669ec

Worker logs:

Aug 19 09:55:07 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 09:55:07,953 - distributed.core - INFO - Starting established connection
Aug 19 09:55:12 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 09:55:12,996 - distributed.worker - INFO - Starting Worker plugin coiled-aws-env
Aug 19 09:55:12 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 09:55:12,996 - distributed.worker - INFO - coiled updating aws sts token environment variables
Aug 19 09:55:13 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 09:55:13,296 - distributed.worker - INFO - coiled cleared s3fs/boto session

## Note that the worker shuts down because the scheduler was not aware of it anymore

Aug 19 10:00:14 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:14,505 - distributed.worker - ERROR - Scheduler was unaware of this worker 'tls://10.0.3.187:43629'. Shutting down.
Aug 19 10:00:14 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:14,505 - distributed.worker - INFO - Stopping worker at tls://10.0.3.187:43629
Aug 19 10:00:14 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:14,521 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-caa22ba5-6e42-45a3-bdd4-74ee69182048 Address tls://10.0.3.187:43629 Status: Status.closing
Aug 19 10:00:14 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:14,535 - distributed.nanny - INFO - Worker closed
Aug 19 10:00:14 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:14,537 - distributed.nanny - ERROR - Worker process died unexpectedly

## The worker restarts

Aug 19 10:00:14 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:14,807 - distributed.nanny - WARNING - Restarting worker
Aug 19 10:00:15 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:15,632 - distributed.worker - INFO -       Start worker at:     tls://10.0.3.187:40733
Aug 19 10:00:15 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:15,632 - distributed.worker - INFO -          Listening to:     tls://10.0.3.187:40733
Aug 19 10:00:15 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:15,632 - distributed.worker - INFO -          dashboard at:           10.0.3.187:40285
Aug 19 10:00:15 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:15,632 - distributed.worker - INFO - Waiting to connect to:      tls://10.0.4.136:8786
Aug 19 10:00:15 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:15,633 - distributed.worker - INFO - -------------------------------------------------
Aug 19 10:00:15 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:15,633 - distributed.worker - INFO -               Threads:                          2
Aug 19 10:00:15 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:15,633 - distributed.worker - INFO -                Memory:                   3.28 GiB
Aug 19 10:00:15 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:15,633 - distributed.worker - INFO -       Local Directory: /scratch/dask-worker-space/worker-kac2mfar
Aug 19 10:00:15 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:15,633 - distributed.worker - INFO - -------------------------------------------------
Aug 19 10:00:15 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:15,896 - distributed.worker - INFO - Starting Worker plugin coiled-aws-env
Aug 19 10:00:15 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:15,897 - distributed.worker - INFO - coiled updating aws sts token environment variables
Aug 19 10:00:16 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:16,106 - distributed.worker - INFO - coiled cleared s3fs/boto session
Aug 19 10:00:16 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:16,107 - distributed.worker - INFO -         Registered to:      tls://10.0.4.136:8786
Aug 19 10:00:16 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:16,107 - distributed.worker - INFO - -------------------------------------------------
Aug 19 10:00:16 ip-10-0-3-187 cloud-init[1119]: 2022-08-19 10:00:16,109 - distributed.core - INFO - Starting established connection
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant