Skip to content

⚠️ CI failed ⚠️ #254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
github-actions bot opened this issue Aug 17, 2022 · 6 comments
Closed

⚠️ CI failed ⚠️ #254

github-actions bot opened this issue Aug 17, 2022 · 6 comments
Labels
ci-failure duplicate This issue or pull request already exists

Comments

@github-actions
Copy link

Workflow Run URL

@ian-r-rose
Copy link
Contributor

This timed out after a client.restart(). @gjoseph92 I feel like I've seen that happen not infrequently, any insights?

@gjoseph92
Copy link
Contributor

I'm seeing OSError: Timed out trying to connect to tls://18.191.218.164:8786 after 30 s

  1. Trying to connect to the cluster (Client(cluster))
  2. Trying to call the scheduler.identity() RPC as part of wait_for_workers

Neither of those look like a restart to me?

That does feel like a common failure mode though. I don't know why it happens. I imagine either the scheduler event loop is blocked for some reason, or there's some messed up asyncio thing going on.

@ian-r-rose
Copy link
Contributor

Ah, my mistake, the time out was right before the restart(), rather than right after. But yeah, I feel like I've regularly seen timeouts of various sorts when trying to clear the decks for some new computation (whether in a wait_for_workers() or a restart()).

@gjoseph92
Copy link
Contributor

Yeah, I think it's definitely a real upstream issue, just maybe not with restart().

@ntabris
Copy link
Member

ntabris commented Aug 17, 2022

In case it's helpful, this was https://cloud.coiled.io/dask-engineering/clusters/51941/details

Scheduler logs show it shutdown itself from idle timeout:

Aug 17 21:35:16 ip-10-0-1-239 cloud-init[1128]: distributed.preloading - INFO - Downloading preload at https://cloud.coiled.io/api/v2/cluster_facing/preload/scheduler
Aug 17 21:35:16 ip-10-0-1-239 cloud-init[1128]: distributed.preloading - INFO - Downloading preload at https://cloud.coiled.io/api/v2/cluster_facing/preload/aws-creds-loader
Aug 17 21:35:16 ip-10-0-1-239 cloud-init[1128]: distributed.preloading - INFO - Downloading preload at https://cloud.coiled.io/api/v2/analytics/preload
Aug 17 21:35:17 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Clear task state
Aug 17 21:35:17 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO -   Scheduler at:     tls://10.0.1.239:8786
Aug 17 21:35:17 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO -   dashboard at:                     :8787
Aug 17 21:35:17 ip-10-0-1-239 cloud-init[1128]: distributed.preloading - INFO - Run preload setup function: https://cloud.coiled.io/api/v2/cluster_facing/preload/scheduler
Aug 17 21:35:17 ip-10-0-1-239 cloud-init[1128]: distributed.preloading - INFO - Run preload setup function: https://cloud.coiled.io/api/v2/cluster_facing/preload/aws-creds-loader
Aug 17 21:35:17 ip-10-0-1-239 cloud-init[1128]: distributed.preloading - INFO - Run preload setup function: https://cloud.coiled.io/api/v2/analytics/preload
Aug 17 21:35:17 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Connecting to coiled - I am i-09295bebecdd01f56
Aug 17 21:35:17 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - my auth token ends with cIzA
Aug 17 21:35:22 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Register worker <WorkerState 'tls://10.0.8.103:42149', name: test_shuffle-3de34d7d-worker-286eae706e, status: undefined, memory: 0, processing: 0>
Aug 17 21:35:22 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Starting worker compute stream, tls://10.0.8.103:42149
Aug 17 21:35:22 ip-10-0-1-239 cloud-init[1128]: distributed.core - INFO - Starting established connection
Aug 17 21:35:23 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Receive client connection: Client-806337c8-1e74-11ed-8a2e-000d3a56315f
Aug 17 21:35:23 ip-10-0-1-239 cloud-init[1128]: distributed.core - INFO - Starting established connection
Aug 17 21:35:24 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Register worker <WorkerState 'tls://10.0.13.65:35837', name: test_shuffle-3de34d7d-worker-9ae52ee3a9, status: undefined, memory: 0, processing: 0>
Aug 17 21:35:24 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Starting worker compute stream, tls://10.0.13.65:35837
Aug 17 21:35:24 ip-10-0-1-239 cloud-init[1128]: distributed.core - INFO - Starting established connection
Aug 17 21:35:25 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Register worker <WorkerState 'tls://10.0.8.201:33557', name: test_shuffle-3de34d7d-worker-e6c9935d74, status: undefined, memory: 0, processing: 0>
Aug 17 21:35:25 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Starting worker compute stream, tls://10.0.8.201:33557
Aug 17 21:35:25 ip-10-0-1-239 cloud-init[1128]: distributed.core - INFO - Starting established connection
Aug 17 21:35:26 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Register worker <WorkerState 'tls://10.0.0.248:33717', name: test_shuffle-3de34d7d-worker-f49244ae10, status: undefined, memory: 0, processing: 0>
Aug 17 21:35:26 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Starting worker compute stream, tls://10.0.0.248:33717
Aug 17 21:35:26 ip-10-0-1-239 cloud-init[1128]: distributed.core - INFO - Starting established connection
Aug 17 21:35:27 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Register worker <WorkerState 'tls://10.0.8.216:35557', name: test_shuffle-3de34d7d-worker-304a19c0af, status: undefined, memory: 0, processing: 0>
Aug 17 21:35:27 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Starting worker compute stream, tls://10.0.8.216:35557
Aug 17 21:35:27 ip-10-0-1-239 cloud-init[1128]: distributed.core - INFO - Starting established connection
Aug 17 21:35:27 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Register worker <WorkerState 'tls://10.0.3.198:45679', name: test_shuffle-3de34d7d-worker-206d80fedc, status: undefined, memory: 0, processing: 0>
Aug 17 21:35:27 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Starting worker compute stream, tls://10.0.3.198:45679
Aug 17 21:35:27 ip-10-0-1-239 cloud-init[1128]: distributed.core - INFO - Starting established connection
Aug 17 21:35:29 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Register worker <WorkerState 'tls://10.0.2.81:45163', name: test_shuffle-3de34d7d-worker-779b4d911f, status: undefined, memory: 0, processing: 0>
Aug 17 21:35:29 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Starting worker compute stream, tls://10.0.2.81:45163
Aug 17 21:35:29 ip-10-0-1-239 cloud-init[1128]: distributed.core - INFO - Starting established connection
Aug 17 21:35:29 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Register worker <WorkerState 'tls://10.0.4.4:38827', name: test_shuffle-3de34d7d-worker-0b629e79c8, status: undefined, memory: 0, processing: 0>
Aug 17 21:35:29 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Starting worker compute stream, tls://10.0.4.4:38827
Aug 17 21:35:29 ip-10-0-1-239 cloud-init[1128]: distributed.core - INFO - Starting established connection
Aug 17 21:35:30 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Register worker <WorkerState 'tls://10.0.13.193:37545', name: test_shuffle-3de34d7d-worker-62d118aa36, status: undefined, memory: 0, processing: 0>
Aug 17 21:35:30 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Starting worker compute stream, tls://10.0.13.193:37545
Aug 17 21:35:30 ip-10-0-1-239 cloud-init[1128]: distributed.core - INFO - Starting established connection
Aug 17 21:48:56 ip-10-0-1-239 systemd[1]: Starting Cleanup of Temporary Directories...
Aug 17 21:48:56 ip-10-0-1-239 systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
Aug 17 21:48:56 ip-10-0-1-239 systemd[1]: Finished Cleanup of Temporary Directories.
Aug 17 21:55:16 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Scheduler closing after being idle for 1200.00 s
Aug 17 21:55:16 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Scheduler closing...
Aug 17 21:55:16 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Connecting to coiled - I am i-09295bebecdd01f56
Aug 17 21:55:16 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - my auth token ends with cIzA
Aug 17 21:55:17 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Scheduler closing after being idle for 1200.00 s
Aug 17 21:55:17 ip-10-0-1-239 cloud-init[1128]: distributed.scheduler - INFO - Scheduler closing all comms

@jrbourbeau
Copy link
Member

Closing as a duplicate of #214

@jrbourbeau jrbourbeau added the duplicate This issue or pull request already exists label Aug 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

4 participants