Provide useful error message when agent crashes with a stack overflow exception #391

ChrisMaddock · 2018-03-23T14:57:33Z

When the agent exits uncleanly, we currently look at the exit code and provide a more suitable error message.

nunit-console/src/NUnitEngine/nunit.engine/Services/TestAgency.cs

Lines 287 to 324 in ae37176

    
           private void OnAgentExit(object sender, EventArgs e) 
        
           { 
        
               var process = sender as Process; 
        
               if (process == null) 
        
                   return; 
        
               var agentRecord = _agentData.GetDataForProcess(process); 
        
               agentRecord.Status = AgentStatus.Terminated; 
        
               string errorMsg; 
        
               switch (process.ExitCode) 
        
               { 
        
                   case AgentExitCodes.OK: 
        
                       return; 
        
                   case AgentExitCodes.PARENT_PROCESS_TERMINATED: 
        
                       errorMsg = "Remote test agent believes agency process has exited."; 
        
                       break; 
        
                   case AgentExitCodes.UNEXPECTED_EXCEPTION: 
        
                       errorMsg = "Unhandled exception on remote test agent. " + 
        
                                  "To debug, try running with the --inprocess flag, or using --trace=debug to output logs."; 
        
                       break; 
        
                   case AgentExitCodes.FAILED_TO_START_REMOTE_AGENT: 
        
                       errorMsg = "Failed to start remote test agent."; 
        
                       break; 
        
                   case AgentExitCodes.DEBUGGER_SECURITY_VIOLATION: 
        
                       errorMsg = "Debugger could not be started on remote agent due to System.Security.Permissions.UIPermission not being set."; 
        
                       break; 
        
                   case AgentExitCodes.DEBUGGER_NOT_IMPLEMENTED: 
        
                       errorMsg = "Debugger could not be started on remote agent as not available on platform."; 
        
                       break; 
        
                   default: 
        
                       errorMsg = $"Remote test agent exited with non-zero exit code {process.ExitCode}"; 
        
                       break; 
        
               } 
        
               throw new NUnitEngineException(errorMsg); 
        
           }

This currently only checks for NUnit-specific error codes, but we could also check for -1073741571 - which I believe is the .NET exit code specific to Stack Overflow exceptions - an exception which could be thrown by user code, which we have no other way of handing.

jnm2 · 2018-03-23T15:22:08Z

Do we know of any socket exceptions that are still happening to people that are not stack overflows?

For any socket exception except our own shutdown, we could print out a suggestion to use --inprocess to diagnose the issue.

ChrisMaddock · 2018-03-24T14:51:59Z

@jnm2 Good idea, as a point of last fallback.

I'd like to work on catching as many 'know agent crash' cases as we can, and providing useful debug information. The two I know about right now are Stack Overflow, and none CLR exceptions. But yes, adding this for the cases we don't know about yet sounds sensible. 🙂

CharliePoole · 2018-03-24T18:31:36Z

~~@jnm2~~@ChrisMaddock "none CLR exceptions" ?

If that's "unknown" then it should be completely preventable. The primary engine should never start a process that requests a CLR that isn't installed. It's supposed to see the problem and report the error without even attempting to start the process.

ChrisMaddock · 2018-03-25T01:07:14Z

Whoops - CLS, not CLR! 😅 I was meaning non-CLS compliment exceptions, I.e. from unmanaged code. Am I right that they can currently crash the agent, or am I remembering incorrectly?

CharliePoole · 2018-03-25T03:47:32Z

I'm not sure. I wouldn't be surprised. Easy to test though. Should maybe be tested in the framework first. If the framework let's it slip through then my guess is that the engine/agent won't make things any better.

mano-si · 2020-04-11T08:20:21Z

@ChrisMaddock I could look into it.

ChrisMaddock · 2020-04-11T14:01:20Z

Yes please! Be careful with this one, we think there's a wider bug at the moment which is causing the general error reporting functionality here not to work at all. See #700.

Have a look - you might be able to solve both at the same time? 🙂

mano-si · 2020-04-19T12:10:32Z

This is what I found.

The console does report any exception thrown by the tests. (including StackOverflow)
If environment exitCode was not OK(0), it is also being reported on the console by the OnAgentExit method. Nothing seems to be wrong there. ( #700 could be closed in that case).
The only potential problem that might have resulted in this thread is : when a test invokes some bad method that goes in recursion hell or something, the console just prints Remote test agent exited with non-zero exit code -1073741571(default case on OnAgentExit ). This is an interesting scenario as test result of other tests are also not reported and results in kind-of like total failure. So, adding another condition in OnAgentExit method will not be of much help.

The class tested had two methods that were tested:

        public int CreateProblem(int a)
        {
            return a / 0;
        }

        public int ThrowStackOverflow()
        {
            throw new StackOverflowException();
        }

After adding this method and running a test on it, resulted in the problematic scenario

        public int RecursionHell()
        {
            return RecursionHell();
        }

CharliePoole · 2020-04-19T15:59:43Z

Aren't both kinds of errors (Exceptions and bad exit code) received via events, called on different threads? Seems like this is a potential race condition.

ChrisMaddock · 2020-04-19T21:38:43Z

Hey @mano-si.

So - your third bullet point was the one I was intending this issue to tackle, how we can better inform the user what's happening when it's their code that's crashing the console. This was my thinking behind adding this particular code to OnAgentExit - that we can inform the user there is "recursion hell" in there code (or a Stack Overflow - in other words. 🙂)

If I understand correctly, we should be able to detect this specific case by the exit code "-1073741571", and show the user a friendly error message as to what the problem might be. (And also suppress all the stack traces that are shown at the moment - which are really just NUnit internals, as far as a user is concerned)

Does help explain my thought process from when I created this issue? If I'm honest, I'm not entirely sure it will be possible - but I'm hoping it will work! 😁

mano-si · 2020-04-20T15:09:21Z

Aren't both kinds of errors (Exceptions and bad exit code) received via events, called on different threads? Seems like this is a potential race condition.

Yes, they are received via events. I am not very familiar with the internals yet and won't be able to confirm you for sure if its race condition or not. It is possible.
One observation is that, even when I ran the tests from vs nunit test adapter, it froze too without passing/failing any of the tests.

mano-si · 2020-04-20T15:43:25Z

Hey @mano-si.

So - your third bullet point was the one I was intending this issue to tackle, how we can better inform the user what's happening when it's their code that's crashing the console. This was my thinking behind adding this particular code to OnAgentExit - that we can inform the user there is "recursion hell" in there code (or a Stack Overflow - in other words. 🙂)

If I understand correctly, we should be able to detect this specific case by the exit code "-1073741571", and show the user a friendly error message as to what the problem might be. (And also suppress all the stack traces that are shown at the moment - which are really just NUnit internals, as far as a user is concerned)

Does help explain my thought process from when I created this issue? If I'm honest, I'm not entirely sure it will be possible - but I'm hoping it will work! 😁

Yes I was able to understand that's what you might have intended. 😄

We could do that I assume. Provide a meaningful information like 'StackOverflow in one the tests.' But, I don't think we could be able to provide more insightful information (eg. TestName etc. ) .

Regarding hiding the server stackTrace, that's something I need to look into.

ChrisMaddock · 2020-04-20T22:02:12Z

No, that makes sense. But any infos better than what we currently have here, in my opinion. 😊

mano-si · 2020-04-22T17:58:17Z

I think @CharliePoole was right about the race condition. I put a Thread.Sleep(50000); on the OnAgentExit() method and now I can see the full stackTrace of NUnit.Engine.NUnitEngineUnloadException. OnAgentExit() was unable to report its exception in this case. This could be the reason for #700 if there was any such incident. Not entirely sure though.

@ChrisMaddock I was able to generate the actual 'StackOverflow' exception on the console by changing p.StartInfo.CreateNoWindow = false; in the LaunchAgentProcess() method of TestAgency. Is this something that could work ?

ChrisMaddock · 2020-04-22T23:29:52Z

Great work on the race condition! I wonder if there's anything we can do to solve that?

I don't think changing CreateNoWindow will work - I think that would create a new console window for every agent process, and there could potentially be hundreds of those.

I'm afraid I'm not following why adding the special case in OnAgentExit won't work - won't that allow us to show a nicer error message to the user? It would be nice to clean up the socket exception as well - but we can leave that bit, and treat it as a separate problem another time.

CharliePoole · 2020-04-23T00:41:02Z

@mano-si That makes sense. Code in ProcessRunner is waiting for the call to Run to return or an exception to be thrown. The process terminates, calling OnExit first. Once OnExit returns,it's over.

mano-si · 2020-04-23T17:04:57Z

Great work on the race condition! I wonder if there's anything we can do to solve that?

I don't think changing CreateNoWindow will work - I think that would create a new console window for every agent process, and there could potentially be hundreds of those.

I'm afraid I'm not following why adding the special case in OnAgentExit won't work - won't that allow us to show a nicer error message to the user? It would be nice to clean up the socket exception as well - but we can leave that bit, and treat it as a separate problem another time.

Adding the special case in OnAgentExit would definitely work. But only if the main thread isn't killed first. This exception is caught in RunTests() method of ProcessRunner. And to me it looks like its intentionally designed in a way that all exceptions caught there would go through the same exception handling. It prints out to console and also to the output xml after formatting. If we did a special handling here for this socket exception, I think we can predictably process the situation.

And as you suggested, we could also add this condition in OnAgentExit as well. But, by only doing that would we be able to solve it, is something am not very sure of. :)

CharliePoole · 2020-04-23T19:21:39Z

IIRC the code in ProcessRunner was the original way we handled exceptions. However, we were not seeing some errors, probably those that were thrown in other threads created by the framework or in user code. To handle that issue OnAgentExit was created. It guarantees we get some message if the agent terminates abnormally, but unfortunately the default message is not very useful.

ChrisMaddock added Enhancement good first issue pri:normal help wanted labels Mar 23, 2018

ChrisMaddock mentioned this issue Mar 23, 2018

SocketException thrown by nunit3-console.exe --explore option #225

Closed

ChrisMaddock changed the title ~~Provide useful error message when agent crashes with a stack overflow error message~~ Provide useful error message when agent crashes with a stack overflow exception Mar 23, 2018

ChrisMaddock mentioned this issue May 9, 2018

How to detect if test crashed from stack overflow by examining console runner output? nunit/docs#266

Closed

JVimes mentioned this issue Aug 2, 2019

Planning: replacing remoting with custom protocol #266

Closed

ChrisMaddock mentioned this issue Aug 6, 2019

Better handling of FailFast #666

Open

ChrisMaddock assigned mano-si Apr 11, 2020

ChrisMaddock added the closed:done label Apr 27, 2020

ChrisMaddock added this to the 3.12 milestone Apr 27, 2020

ChrisMaddock closed this as completed Apr 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide useful error message when agent crashes with a stack overflow exception #391

Provide useful error message when agent crashes with a stack overflow exception #391

ChrisMaddock commented Mar 23, 2018

jnm2 commented Mar 23, 2018

ChrisMaddock commented Mar 24, 2018

CharliePoole commented Mar 24, 2018 •

edited

Loading

ChrisMaddock commented Mar 25, 2018

CharliePoole commented Mar 25, 2018

mano-si commented Apr 11, 2020 •

edited

Loading

ChrisMaddock commented Apr 11, 2020

mano-si commented Apr 19, 2020 •

edited

Loading

CharliePoole commented Apr 19, 2020

ChrisMaddock commented Apr 19, 2020

mano-si commented Apr 20, 2020 •

edited

Loading

mano-si commented Apr 20, 2020

ChrisMaddock commented Apr 20, 2020

mano-si commented Apr 22, 2020

ChrisMaddock commented Apr 22, 2020

CharliePoole commented Apr 23, 2020

mano-si commented Apr 23, 2020

CharliePoole commented Apr 23, 2020

Provide useful error message when agent crashes with a stack overflow exception #391

Provide useful error message when agent crashes with a stack overflow exception #391

Comments

ChrisMaddock commented Mar 23, 2018

jnm2 commented Mar 23, 2018

ChrisMaddock commented Mar 24, 2018

CharliePoole commented Mar 24, 2018 • edited Loading

ChrisMaddock commented Mar 25, 2018

CharliePoole commented Mar 25, 2018

mano-si commented Apr 11, 2020 • edited Loading

ChrisMaddock commented Apr 11, 2020

mano-si commented Apr 19, 2020 • edited Loading

CharliePoole commented Apr 19, 2020

ChrisMaddock commented Apr 19, 2020

mano-si commented Apr 20, 2020 • edited Loading

mano-si commented Apr 20, 2020

ChrisMaddock commented Apr 20, 2020

mano-si commented Apr 22, 2020

ChrisMaddock commented Apr 22, 2020

CharliePoole commented Apr 23, 2020

mano-si commented Apr 23, 2020

CharliePoole commented Apr 23, 2020

CharliePoole commented Mar 24, 2018 •

edited

Loading

mano-si commented Apr 11, 2020 •

edited

Loading

mano-si commented Apr 19, 2020 •

edited

Loading

mano-si commented Apr 20, 2020 •

edited

Loading