Description
The utils_test.cluster()
contextmanager creates a lightweight cluster using subprocesses. It's used in a number of tests directly, as well as via the client
, a
, b
, etc. fixtures.
In a finally
block of the contextmanager, it tries to open an RPC to all subprocesses that are still alive, and use that RPC to call close
on the server.
However, a few tests have a pattern where they call terminate
on one of the processes themselves right before exiting:
distributed/distributed/tests/test_failed_workers.py
Lines 37 to 44 in 4af2d0a
Subprocess.terminate
just sends SIGTERM
; it doesn't block until the process has actually shut down. So what can happen:
- The test calls
terminate
on worker A, but the process is still running - The test finishes and returns control to the contextmanager, which looks at which workers are still alive:
distributed/distributed/utils_test.py
Lines 687 to 691 in 4af2d0a
- Worker A actually shuts down
- Calling the
terminate
RPC times out trying to connect to the now-dead worker Adistributed/distributed/utils_test.py
Lines 715 to 722 in 4af2d0a
suppress(CommClosedError)
, but it relies on the RPC's internal timeout being shorter than the timeout on thewait_for
distributed/distributed/utils_test.py
Line 724 in 4af2d0a
asyncio.TimeoutError
would be raised. This would not be the case after Only set 5s connect timeout ingen_cluster
tests #6822.
I propose that we entirely remove the "RPC to the server and call close
on it" logic.
Because we're already adding a callback to the ExitStack
to terminate and join every subprocess:
distributed/distributed/utils_test.py
Line 621 in 4af2d0a
So the RPC method is
- Belt-and-suspenders (we have another mechanism to shut down the suppresses)
- Superfluous (the clean
close
may start via RPC before the SIGTERM, but there are no handlers registered for SIGTERM by default, so the SIGTERM will then forcibly terminate the subprocess in the middle of itsclose
) - Way more brittle (connecting to a subprocess as we're terminating)
- Rather pointless (the RPC doesn't block until the server is actually shut down, so it has no benefit compared to sending a signal)
In general when working with subprocesses, using signals and join
to shut them down seems way simpler and more reliable than RPCs.