Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide useful error message when agent crashes with a stack overflow exception #391

Closed
ChrisMaddock opened this issue Mar 23, 2018 · 18 comments

Comments

@ChrisMaddock
Copy link
Member

When the agent exits uncleanly, we currently look at the exit code and provide a more suitable error message.

private void OnAgentExit(object sender, EventArgs e)
{
var process = sender as Process;
if (process == null)
return;
var agentRecord = _agentData.GetDataForProcess(process);
agentRecord.Status = AgentStatus.Terminated;
string errorMsg;
switch (process.ExitCode)
{
case AgentExitCodes.OK:
return;
case AgentExitCodes.PARENT_PROCESS_TERMINATED:
errorMsg = "Remote test agent believes agency process has exited.";
break;
case AgentExitCodes.UNEXPECTED_EXCEPTION:
errorMsg = "Unhandled exception on remote test agent. " +
"To debug, try running with the --inprocess flag, or using --trace=debug to output logs.";
break;
case AgentExitCodes.FAILED_TO_START_REMOTE_AGENT:
errorMsg = "Failed to start remote test agent.";
break;
case AgentExitCodes.DEBUGGER_SECURITY_VIOLATION:
errorMsg = "Debugger could not be started on remote agent due to System.Security.Permissions.UIPermission not being set.";
break;
case AgentExitCodes.DEBUGGER_NOT_IMPLEMENTED:
errorMsg = "Debugger could not be started on remote agent as not available on platform.";
break;
default:
errorMsg = $"Remote test agent exited with non-zero exit code {process.ExitCode}";
break;
}
throw new NUnitEngineException(errorMsg);
}

This currently only checks for NUnit-specific error codes, but we could also check for -1073741571 - which I believe is the .NET exit code specific to Stack Overflow exceptions - an exception which could be thrown by user code, which we have no other way of handing.

@ChrisMaddock ChrisMaddock changed the title Provide useful error message when agent crashes with a stack overflow error message Provide useful error message when agent crashes with a stack overflow exception Mar 23, 2018
@jnm2
Copy link
Collaborator

jnm2 commented Mar 23, 2018

Do we know of any socket exceptions that are still happening to people that are not stack overflows?

For any socket exception except our own shutdown, we could print out a suggestion to use --inprocess to diagnose the issue.

@ChrisMaddock
Copy link
Member Author

@jnm2 Good idea, as a point of last fallback.

I'd like to work on catching as many 'know agent crash' cases as we can, and providing useful debug information. The two I know about right now are Stack Overflow, and none CLR exceptions. But yes, adding this for the cases we don't know about yet sounds sensible. 🙂

@CharliePoole
Copy link
Member

CharliePoole commented Mar 24, 2018

@jnm2@ChrisMaddock "none CLR exceptions" ?

If that's "unknown" then it should be completely preventable. The primary engine should never start a process that requests a CLR that isn't installed. It's supposed to see the problem and report the error without even attempting to start the process.

@ChrisMaddock
Copy link
Member Author

Whoops - CLS, not CLR! 😅 I was meaning non-CLS compliment exceptions, I.e. from unmanaged code. Am I right that they can currently crash the agent, or am I remembering incorrectly?

@CharliePoole
Copy link
Member

I'm not sure. I wouldn't be surprised. Easy to test though. Should maybe be tested in the framework first. If the framework let's it slip through then my guess is that the engine/agent won't make things any better.

@mano-si
Copy link
Contributor

mano-si commented Apr 11, 2020

@ChrisMaddock I could look into it.

@ChrisMaddock
Copy link
Member Author

Yes please! Be careful with this one, we think there's a wider bug at the moment which is causing the general error reporting functionality here not to work at all. See #700.

Have a look - you might be able to solve both at the same time? 🙂

@mano-si
Copy link
Contributor

mano-si commented Apr 19, 2020

This is what I found.

  • The console does report any exception thrown by the tests. (including StackOverflow)
    throw exception

  • If environment exitCode was not OK(0), it is also being reported on the console by the OnAgentExit method. Nothing seems to be wrong there. ( #700 could be closed in that case).

  • The only potential problem that might have resulted in this thread is : when a test invokes some bad method that goes in recursion hell or something, the console just prints Remote test agent exited with non-zero exit code -1073741571(default case on OnAgentExit ). This is an interesting scenario as test result of other tests are also not reported and results in kind-of like total failure. So, adding another condition in OnAgentExit method will not be of much help.

recursion hell

The class tested had two methods that were tested:

        public int CreateProblem(int a)
        {
            return a / 0;
        }

        public int ThrowStackOverflow()
        {
            throw new StackOverflowException();
        }

After adding this method and running a test on it, resulted in the problematic scenario

        public int RecursionHell()
        {
            return RecursionHell();
        }

@CharliePoole
Copy link
Member

Aren't both kinds of errors (Exceptions and bad exit code) received via events, called on different threads? Seems like this is a potential race condition.

@ChrisMaddock
Copy link
Member Author

Hey @mano-si.

So - your third bullet point was the one I was intending this issue to tackle, how we can better inform the user what's happening when it's their code that's crashing the console. This was my thinking behind adding this particular code to OnAgentExit - that we can inform the user there is "recursion hell" in there code (or a Stack Overflow - in other words. 🙂)

If I understand correctly, we should be able to detect this specific case by the exit code "-1073741571", and show the user a friendly error message as to what the problem might be. (And also suppress all the stack traces that are shown at the moment - which are really just NUnit internals, as far as a user is concerned)

Does help explain my thought process from when I created this issue? If I'm honest, I'm not entirely sure it will be possible - but I'm hoping it will work! 😁

@mano-si
Copy link
Contributor

mano-si commented Apr 20, 2020

Aren't both kinds of errors (Exceptions and bad exit code) received via events, called on different threads? Seems like this is a potential race condition.

Yes, they are received via events. I am not very familiar with the internals yet and won't be able to confirm you for sure if its race condition or not. It is possible.
One observation is that, even when I ran the tests from vs nunit test adapter, it froze too without passing/failing any of the tests.

@mano-si
Copy link
Contributor

mano-si commented Apr 20, 2020

Hey @mano-si.

So - your third bullet point was the one I was intending this issue to tackle, how we can better inform the user what's happening when it's their code that's crashing the console. This was my thinking behind adding this particular code to OnAgentExit - that we can inform the user there is "recursion hell" in there code (or a Stack Overflow - in other words. 🙂)

If I understand correctly, we should be able to detect this specific case by the exit code "-1073741571", and show the user a friendly error message as to what the problem might be. (And also suppress all the stack traces that are shown at the moment - which are really just NUnit internals, as far as a user is concerned)

Does help explain my thought process from when I created this issue? If I'm honest, I'm not entirely sure it will be possible - but I'm hoping it will work! 😁

Yes I was able to understand that's what you might have intended. 😄

We could do that I assume. Provide a meaningful information like 'StackOverflow in one the tests.' But, I don't think we could be able to provide more insightful information (eg. TestName etc. ) .

Regarding hiding the server stackTrace, that's something I need to look into.

@ChrisMaddock
Copy link
Member Author

No, that makes sense. But any infos better than what we currently have here, in my opinion. 😊

@mano-si
Copy link
Contributor

mano-si commented Apr 22, 2020

I think @CharliePoole was right about the race condition. I put a Thread.Sleep(50000); on the OnAgentExit() method and now I can see the full stackTrace of NUnit.Engine.NUnitEngineUnloadException. OnAgentExit() was unable to report its exception in this case. This could be the reason for #700 if there was any such incident. Not entirely sure though.

@ChrisMaddock I was able to generate the actual 'StackOverflow' exception on the console by changing p.StartInfo.CreateNoWindow = false; in the LaunchAgentProcess() method of TestAgency. Is this something that could work ?

exception

stack trace

@ChrisMaddock
Copy link
Member Author

Great work on the race condition! I wonder if there's anything we can do to solve that?

I don't think changing CreateNoWindow will work - I think that would create a new console window for every agent process, and there could potentially be hundreds of those.

I'm afraid I'm not following why adding the special case in OnAgentExit won't work - won't that allow us to show a nicer error message to the user? It would be nice to clean up the socket exception as well - but we can leave that bit, and treat it as a separate problem another time.

@CharliePoole
Copy link
Member

@mano-si That makes sense. Code in ProcessRunner is waiting for the call to Run to return or an exception to be thrown. The process terminates, calling OnExit first. Once OnExit returns,it's over.

@mano-si
Copy link
Contributor

mano-si commented Apr 23, 2020

Great work on the race condition! I wonder if there's anything we can do to solve that?

I don't think changing CreateNoWindow will work - I think that would create a new console window for every agent process, and there could potentially be hundreds of those.

I'm afraid I'm not following why adding the special case in OnAgentExit won't work - won't that allow us to show a nicer error message to the user? It would be nice to clean up the socket exception as well - but we can leave that bit, and treat it as a separate problem another time.

Adding the special case in OnAgentExit would definitely work. But only if the main thread isn't killed first. This exception is caught in RunTests() method of ProcessRunner. And to me it looks like its intentionally designed in a way that all exceptions caught there would go through the same exception handling. It prints out to console and also to the output xml after formatting. If we did a special handling here for this socket exception, I think we can predictably process the situation.

And as you suggested, we could also add this condition in OnAgentExit as well. But, by only doing that would we be able to solve it, is something am not very sure of. :)

@CharliePoole
Copy link
Member

IIRC the code in ProcessRunner was the original way we handled exceptions. However, we were not seeing some errors, probably those that were thrown in other threads created by the framework or in user code. To handle that issue OnAgentExit was created. It guarantees we get some message if the agent terminates abnormally, but unfortunately the default message is not very useful.

@ChrisMaddock ChrisMaddock added this to the 3.12 milestone Apr 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants