Skip to content

Race condition connecting to vs shutting down subprocesses in tests using cluster() #6828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gjoseph92 opened this issue Aug 4, 2022 · 0 comments · Fixed by #6829
Closed
Labels
flaky test Intermittent failures on CI.

Comments

@gjoseph92
Copy link
Collaborator

The utils_test.cluster() contextmanager creates a lightweight cluster using subprocesses. It's used in a number of tests directly, as well as via the client, a, b, etc. fixtures.

In a finally block of the contextmanager, it tries to open an RPC to all subprocesses that are still alive, and use that RPC to call close on the server.

However, a few tests have a pattern where they call terminate on one of the processes themselves right before exiting:

def test_submit_after_failed_worker_sync(loop):
with cluster() as (s, [a, b]):
with Client(s["address"], loop=loop) as c:
L = c.map(inc, range(10))
wait(L)
a["proc"]().terminate()
total = c.submit(sum, L)
assert total.result() == sum(map(inc, range(10)))

Subprocess.terminate just sends SIGTERM; it doesn't block until the process has actually shut down. So what can happen:

  1. The test calls terminate on worker A, but the process is still running
  2. The test finishes and returns control to the contextmanager, which looks at which workers are still alive:
    alive_workers = [
    w["address"]
    for w in workers_by_pid.values()
    if w["proc"].is_alive()
    ]
    Worker A is still alive, so it's in the list.
  3. Worker A actually shuts down
  4. Calling the terminate RPC times out trying to connect to the now-dead worker A
    async with rpc(addr, **rpc_kwargs) as w:
    # If the worker was killed hard (e.g. sigterm) during test runtime,
    # we do not know at this point and may not be able to connect
    with suppress(EnvironmentError, CommClosedError):
    # Do not request a reply since comms will be closed by the
    # worker before a reply can be made and we will always trigger
    # the timeout
    await w.terminate(reply=False)
    The comment notes this possibility. In theory this would be fine thanks to the suppress(CommClosedError), but it relies on the RPC's internal timeout being shorter than the timeout on the wait_for
    await asyncio.wait_for(do_disconnect(), timeout=timeout)
    otherwise an asyncio.TimeoutError would be raised. This would not be the case after Only set 5s connect timeout in gen_cluster tests #6822.

I propose that we entirely remove the "RPC to the server and call close on it" logic.

Because we're already adding a callback to the ExitStack to terminate and join every subprocess:

stack.callback(_terminate_join, scheduler)

So the RPC method is

  1. Belt-and-suspenders (we have another mechanism to shut down the suppresses)
  2. Superfluous (the clean close may start via RPC before the SIGTERM, but there are no handlers registered for SIGTERM by default, so the SIGTERM will then forcibly terminate the subprocess in the middle of its close)
  3. Way more brittle (connecting to a subprocess as we're terminating)
  4. Rather pointless (the RPC doesn't block until the server is actually shut down, so it has no benefit compared to sending a signal)

In general when working with subprocesses, using signals and join to shut them down seems way simpler and more reliable than RPCs.

@gjoseph92 gjoseph92 added the flaky test Intermittent failures on CI. label Aug 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant