-
Notifications
You must be signed in to change notification settings - Fork 146
Description
Hi!
Recently I have stumbled upon very tricky issue in my program, that cost me several months of debugging, and was caused by a mistake in the OTL. I would like to share my findings with You in order to issue some sort of fix and so others would also know about the problem.
I can't create a minimal reproducible example, but I'm pretty sure you can understand what is what from my explanation. Also I'' le shed some light on how some could re-create it.
Issue I was facing and that made me to begin debugging was Windows Service being stuck in an unusual place.
Analyzing memory dump has shown me, that thread is sleeping inside of the WaitForSingleObject and it is waiting for LockSemaphore object in the critical section. In other words, thread was sleeping inside of the EnterCriticalSection. It might be just fine, if the critical section hasn't been.... unlocked at the time.
So, dumping the !cs xyz in WinDbg has shown me, that CS is unlocked, but the thread is waiting for it. So how could it happen?
-----------------------------------------
Critical section = 0x01c92ee4 (+0x1C92EE4)
DebugInfo = 0x060787a8
NOT LOCKED
LockSemaphore = 0xBA8
SpinCount = 0x020007d0
Side note 1: the problem can be seen only on Windows up to 8.1/Windows Server 2012, that use events in the Critical Sections.
The CS thread in question was waiting for the MonitorLock here:
function TOmniTaskControl.RemoveMonitor: IOmniTaskControl;
begin
{$IFDEF MSWINDOWS}
if assigned(otcSharedInfo.Monitor) then begin
EnsureCommChannel;
otcSharedInfo.CommChannel.Endpoint2.Writer.ContainerSubject.Detach(
otcSharedInfo.Monitor, coiNotifyOnAllInserts);
otcSharedInfo.MonitorLock.Acquire;
After long debugging, many build-a-hypothesis-check-it-come-up-with-a-new-one iterations, requesting any ideas on SO for further debugging, I created a special facility to verify integrity of the CS events. So. I have intercepted on the ntdll level:
InitializeCriticalSection
DeleteCriticalSection
registering CS's known to me
EnterCriticalSection
TryEnterCriticalSection
LeaveCriticalSection
registering threads that are inside of these areas
WaitForSingleObject
WaitForMultipleObjects
SetEvent
ResetEvent
ClearEvent
registering attempts to manipulate event of the known critical section while not being inside of Enter/LeaveCriticalSection code.
As a result, I have found that this is a OTL task handling code that is responsible for the failure.
Shortly, this code:
procedure TOmniTask.InternalExecute(calledFromTerminate: boolean);
...
(1) eventTerminate := otSharedInfo_ref.TerminatedEvent;
otSharedInfo_ref.Stopped := true;
// with internal monitoring this will not be processed if the task controller owner is also shutting down
sync := nil; // to remove the warning in the 'finally' clause below
otSharedInfo_ref.MonitorLock.Acquire;
try
sync := otSharedInfo_ref.MonitorLock.SyncObj;
{$IFDEF MSWINDOWS}
if assigned(otSharedInfo_ref.Monitor) then
{$IFDEF CPUX64}
otSharedInfo_ref.Monitor.Send(COmniTaskMsg_Terminated, WPARAM(UniqueID), 0);
{$ELSE}
(2) otSharedInfo_ref.Monitor.Send(COmniTaskMsg_Terminated, WPARAM(Int64Rec(UniqueID).Lo), LPARAM(Int64Rec(UniqueID).Hi));
{$ENDIF}
{$ENDIF MSWINDOWS}
otSharedInfo_ref := nil;
finally sync.Release; end;
end;
finally
//Task controller could die any time now. Make sure we're not using shared
//structures anymore.
otExecutor_ref := nil;
otParameters_ref := nil;
otSharedInfo_ref := nil;
(3) SetEvent(eventTerminate);
end;
....
destructor TOmniTaskControl.Destroy;
begin
...
{$IFDEF MSWINDOWS}
if otcSharedInfo.TerminatedEvent <> 0 then begin
(4) CloseHandle(otcSharedInfo.TerminatedEvent);
otcSharedInfo.TerminatedEvent := 0;
end;
is responsible.
Notice: first, we send the message that task is terminated, and then call SetEvent to signal yet again to various listeners, that thread is terminated.
The problem is that between .Send and SetEvent whole century of events passes, and when we approach SetEvent, the event is no longer valid.
So what, you might ask - it is not valid, not a big deal.
The conceptual problem that it manages to CloseHandle on this event and then create new one reusing same handle!!!
And the worst thing that handle is reused for another critical section that gets created, challenged, so event is allocated for it, with the handle we remembered.
Sequence is following (see code above):
(1) we write down the handle value
(2) we send a message, the main thread (it is happening in the perallel loop) dispatches it and detaches from monitor
...
Task destructor is called
...
(4) task is destroyed, its terminated event gets closed. But we still remember the handle in the eventTerminate variable
...
Event gets reused for whatever there is (CS in my case)
...
(3) we kick absolutely unrelated event
And we're doomed. Per my experiments, with the worst unluck, it could be even event inside of the internal ntdll critical section guarding list of user critical sections, making app completely stuck.
In Windows 10/2019 Server + CSs do not use events anymore, but whatever - it could be any other event that we're erroneously kicking here.
So, my solution was - create a wrapper:
THandleWrapper = class(TInterfacedObject, IOmniHandleObject)
FValue: THandle
with a destructor calling CloseHandle. Then I changed all the places to be not TOmniTransitionEvent, but IOmniHandleObject (including OmniCommChanel etc), so event is closed only after last one who wants to touch it is destroyed.
As I can see from the comment:
//At some point this type will be dropped and all the codebase will use
//IOmniEvent or something similar.
TOmniTransitionEvent = {$IFDEF MSWINDOWS}THandle{$ELSE}IOmniEvent{$ENDIF};
Probably it is about the time to do so :) Or use my approach.
Side note 2: previously, 11 years ago, in the commit 3354ae6
it was changed so that first SetEvent is used, then Post is done (if I'm not mistaken), but then it was reverted back again, and it is where the issue build up.
Side note 3: to recreate the issue, you probably can do following, for example, to not recreate my heavy facility:
- Create Parallel.For for set of tasks every 1 second, each task doing something short
- Create simplier facility to track the seqeuence - TDictionary for example. Add an item in the TOmniTaskControl.Initialize when SharedInfo.TerminatedEvent is created, together with the some GUID (CLSID); delete in in the .Destroy, when the CloseHandle is called; before the SetEvent call check, that event is still in the dictionary and it is of same version as when we remembered it in the eventTerminate.
You may want to add some eosrt of Sleep(1) before the SetEvent to make probablity higher - Run it for some time and see that once you'd try to kick the deallocated event.
Note that you need to remember not just handle value, but the guid, exactly because handle values are reused.