Race condition connecting to vs shutting down subprocesses in tests using cluster()
#6828
Labels
flaky test
Intermittent failures on CI.
The
utils_test.cluster()
contextmanager creates a lightweight cluster using subprocesses. It's used in a number of tests directly, as well as via theclient
,a
,b
, etc. fixtures.In a
finally
block of the contextmanager, it tries to open an RPC to all subprocesses that are still alive, and use that RPC to callclose
on the server.However, a few tests have a pattern where they call
terminate
on one of the processes themselves right before exiting:distributed/distributed/tests/test_failed_workers.py
Lines 37 to 44 in 4af2d0a
Subprocess.terminate
just sendsSIGTERM
; it doesn't block until the process has actually shut down. So what can happen:terminate
on worker A, but the process is still runningdistributed/distributed/utils_test.py
Lines 687 to 691 in 4af2d0a
terminate
RPC times out trying to connect to the now-dead worker Adistributed/distributed/utils_test.py
Lines 715 to 722 in 4af2d0a
suppress(CommClosedError)
, but it relies on the RPC's internal timeout being shorter than the timeout on thewait_for
distributed/distributed/utils_test.py
Line 724 in 4af2d0a
asyncio.TimeoutError
would be raised. This would not be the case after Only set 5s connect timeout ingen_cluster
tests #6822.I propose that we entirely remove the "RPC to the server and call
close
on it" logic.Because we're already adding a callback to the
ExitStack
to terminate and join every subprocess:distributed/distributed/utils_test.py
Line 621 in 4af2d0a
So the RPC method is
close
may start via RPC before the SIGTERM, but there are no handlers registered for SIGTERM by default, so the SIGTERM will then forcibly terminate the subprocess in the middle of itsclose
)In general when working with subprocesses, using signals and
join
to shut them down seems way simpler and more reliable than RPCs.The text was updated successfully, but these errors were encountered: