Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using TCP transport, StopRun is never received #1187

Open
CharliePoole opened this issue Jun 12, 2022 · 12 comments
Open

When using TCP transport, StopRun is never received #1187

CharliePoole opened this issue Jun 12, 2022 · 12 comments
Labels
Bug High Priority V4 All issues related to V4 - use -label:V4 to get non-V4 issues
Milestone

Comments

@CharliePoole
Copy link
Member

NOTE: I'm calling this a bug, because it is one, but some of the fallout may lead to major changes to our basic runner interfaces. So I'm looking for comments from @nunit/engine-team before I actually do anything.

Up to now, we have been dealing with the problem of cancelling a run using StopRun(true) as it relates to .NET Core not supporting killing threads. However, I just discovered an additional problem: when we use the TCP transport, the StopRun command can't be received while the test is running.

The reason for this is that the command loop, which receives and executes commands, runs those commands synchronously. So the command to stop the run won't be seen until the run has already completed. One fix for this - maybe the easiest - would be for the command-loop to create a task to run the tests when the Run command is received.

Thinking about it a bit more, I wondered what would happen if we used RunAsync instead of Run, so I did an experiment. It didn't work and I still saw synchronous Run commands being issued. I realized (or reminded myself) that MasterTestRunner translates RunAsync into a task, which basically calls Run. None of the lower-level runners RunAsync methods are ever actually used.

I'm sure we (or I) had a good reason for this back in the day, but right now it doesn't make much sense. We should either not have lower-level RunAsync methods or we should actually use them. That's one of the big decisions I think we should make as a team.

Of course, this also gets into the interfaces we use in 4.0, which we have never finalized. I'm currently thinking that we should eliminate RunAsync from ITestRunner and make Run an async method. The calling runner could then decide whether or not to await the call.
That gets beyond the scope of a bug, of course, but we should keep in mind what we ultimately want to have as we work this issue.

@rprouse @jnm2 @mikkelbu @manfred-brands What do you think? Where should we go with this?

@rprouse
Copy link
Member

rprouse commented Jun 13, 2022

We should either not have lower-level RunAsync methods or we should actually use them

I think we should use them. If I am calling RunAsync then I expect it to be async all the way down.

interfaces we use in 4.0, which we have never finalized. I'm currently thinking that we should eliminate RunAsync from ITestRunner and make Run an async method.

Personally, I'd prefer to keep the async method and deprecate or remove the run method. Normally I'm not a fan of naming methods XxxAsync, but changing the current synchronous Run method to async is a major change and many people don't read the docs. If it is going to be a breaking change, then I think we should be as explicit as possible and break them in the most obvious way possible.

@CharliePoole
Copy link
Member Author

@rprouse When you say "keep the async method" do you man keep it as it is... i.e. not true async and therefore not awaitable?

On the general principal of not changing too much in this major release, considering the breaking changes we've already made, I feel that ship has pretty much sailed. :-)

@CharliePoole
Copy link
Member Author

Bear in mind, we're talking about a major release, with lots breaking changes already in place. It's our third major upgrade in 22 years, so nobody can say we break things indiscriminately!

Assuming there won't be a 5.0 release for 7 to 10 years, I think now is the time to change anything we don't want to live with for that long.

That said, the particular thing we're talking about is only relevant to those who write runners. I can keep using the RunAsync pattern, which I'm used to from my work in the previous century. :-)

Regarding "async all the way down" I agree with you. It will need lots of testing. Regarding deprecating Run, I wouldn't want to do that because it's the simplest possible way to use the engine: you call it and it returns a result. For anyone using the engine from a script, it's ideal.

@CharliePoole
Copy link
Member Author

BTW, one reason this problem exists in the first place is that the console runner never uses RunAsync. When I started working on the TestCentric GUI, I found I needed to use it in order to keep the display updating and responsive. That worked for a long time and it was only when I started dealing with cancelling a run that I discovered the negative impact of using Run in the lower level runners.

@CharliePoole
Copy link
Member Author

@nunit/engine-team @OsirisTerje I'm wondering what the true priority of the issue ought to be for this project.

I made it "high" because my GUI needs it to be fixed if I'm ever going to stop having my own engine and use NUnit's. OTOH, it's hard to see how to show this is a problem when using the console runner or the adapter. Can anyone think of a way to make it visible in either of those contexts?

@jnm2
Copy link
Collaborator

jnm2 commented Jun 13, 2022

The priority seems high to me even if it's contrived or impossible to demonstrate the problem with the console runner or adapter. If my memory is right, the same issue will also come back in the console runner if it begins to support connecting to agents which the console doesn't start and stop. Possibly an agent running in another machine or in an Android emulator.

I'd like true async. 4.0 seems like an ideal time to clean up the interface.

It's also technically feasible to run the TCP message loop on a different thread from the thread on which ITestRunner calls are dispatched, like you're saying, and if that is somehow more practical in the short term then that seems like a valid option.

@jnm2
Copy link
Collaborator

jnm2 commented Jun 13, 2022

A side note on the GUI: I would be concerned about running the engine on the UI thread. I'd be vaguely afraid of side effects. One being that the responsiveness of the GUI is now at the mercy of the engine. All the way up to, maybe the engine or an addin or something sets some thread local or variable which now affects the GUI. I'd even be afraid that the engine might set static configuration settings in .NET, like supported TLS level or something, which subtly interact with the GUI. For my own peace of mind, I'd be tempted to have the GUI start the engine out of process unless the engine provides guarantees that it won't do anything that could possibly influence the runtime behavior of the host app, within reason. Anyway, only interacting with the engine on a background thread would be where I'd probably start, and then you don't have the concern about keeping the display updating and responsive.

@CharliePoole
Copy link
Member Author

@jnm2

The current setup is that the thread used by the runner to invoke the engine is blocked in the agent, waiting for any commands to be executed. A thread is created, which connects to the caller either through remoting or TCP and waits to be told what to do. In the case of remoting, it's told via a call to a method - the magic is behind the scenes. For TCP, a separate command loop thread is started, which tries to read messages. That's where this issue arises.

In all cases, the action taken when a message is received is synchronous. (As a reminder, possible actions are Load, Reload, Unload, Explore, CountTestCases, Run, RunAsync and StopRun.) This is by design, since no actions were intended to be invoked in parallel with other actions.

What I missed in designing this is that StopRun, which is ony allowed while a Run is in process, cannot be received because the command loop calls Run directly. In my working code, that's fixed by creating a thread on which to perform the run. That allows StopRun to be received and call the corresponding engine commands on the executing runner.

Regarding the GUI, the engine is not run on the UI thread. However, the synchronicity of the command loop prevents messages from being sent to the GUI at certain times. For a short-term fix, I need to fix that WRT Run and RunAsync. Longer term, I think it will be better if each action listed above is performed on a separate thread, even if I don't currently plan to have more than one of them going at one time.

I have to solve the problem of a hard cancel for the GUI. But for the console and engine, I guess it depends on the consensus of the team. We've managed to expose a difference of opinion there. On top of that, see my last comment in nunit/nunit#3276. Let's wait to see what others think about both issues.

@rprouse
Copy link
Member

rprouse commented Jun 14, 2022

@rprouse When you say "keep the async method" do you man keep it as it is... i.e. not true async and therefore not awaitable?

I meant keep the method named RunAsync but make it properly async. There will be many breaking changes, i'm just suggesting that they be as explicit as possible.

@manfred-brands
Copy link
Member

If the TCP transport is expected to handle command whilst running, by definition the run must be asynchronous.
I presume we only support a single run at a time, so the TCP transport must reject all but stop until the run is finished.

@CharliePoole
Copy link
Member Author

That's quite true from an implementation point of view. I think the discussion here relates to whether the call to the public API blocks or not. Currently, we have both types of calls for running tests and I think we need to retain both.

But on the agent side of the communication link, asynchronicity is needed if we want to support stopping an action. Currently, we only support stopping a test run but it's conceivable that we might eventually support cancelling an explore or load request as well, so I'd say that most operations should be run on a separate thread from the command loop.

@CharliePoole CharliePoole added this to the 3.19.0 milestone Sep 19, 2024
@CharliePoole CharliePoole modified the milestones: 3.19.0, 4.0 Dec 3, 2024
@CharliePoole
Copy link
Member Author

There seems to be no non-breaking way to improve the current implementation, so I have moved this to V4.

@CharliePoole CharliePoole modified the milestones: 4.0, 4.0.0-beta.1 Dec 29, 2024
@CharliePoole CharliePoole added the V4 All issues related to V4 - use -label:V4 to get non-V4 issues label Feb 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug High Priority V4 All issues related to V4 - use -label:V4 to get non-V4 issues
Projects
None yet
Development

No branches or pull requests

4 participants