Operating System Design Review

Operating System Design Review is a modern exploration of operating system architecture focusing primarily on user-mode, starting at its origin: the loader, and investigating other subsystems from there.

The intentions of this write-up are to:

Compare the Windows, Linux, and MacOS user-mode environments
- Providing perspective on architectural and ecosystem differences, how they coincide with the loader and the broader system, then draw conclusions and creating solutions based on our findings
Focus on the concurrent design and properties of subsystems
- Including formal documentation on how the modern Windows loader functions in contrast to current open source Windows implementations, including Wine and ReactOS (they lack support for the "parallel loading" ability present in a modern Windows loader)
Educate, satisfy curiosity, and help fellow reverse engineers
- If you're looking for information on anything in particular, give this document a Ctrl + F or ⌘ + F

All of the information contained here covers Windows 10 22H2 and glibc 2.38 on Linux. In certain cases, facts were also verified on a fully up-to-date release of Windows 11. Some sections of this document additionally touch on MacOS, and now other operating systems, too.

Author: Elliot Killick (@ElliotKillick)

Parallel Loader Overview

When a library load contains more than one work item (i.e. a library with at least one dependency that is not already loaded into the process), the Windows loader will use its parallel loading ability to speed up library loading. The first work item of a load will always happen in series, on the same thread that called LoadLibrary, because the loader must begin to map and snap one library before it can find dependencies that it also needs to map and snap. To start, see what a trace of one library with no new dependencies looks like.

Put simply, the parallel loader is a layer on top of the regular loader that calls ntdll!LdrpQueueWork to offload library loading work to other loader threads:

# "call    ntdll!LdrpQueueWork" <NTDLL_ADDRESS> L9999999
ntdll!LdrpSignalModuleMapped+0x54:
00007ffa`56b208e0 e83bebffff      call    ntdll!LdrpQueueWork (00007ffa`56b1f420)
ntdll!LdrpMapAndSnapDependency+0x20d:
00007ffa`56b27b9d e87e78ffff      call    ntdll!LdrpQueueWork (00007ffa`56b1f420)
ntdll!LdrpLoadDependentModule+0xd63:
00007ffa`56b28943 e8d86affff      call    ntdll!LdrpQueueWork (00007ffa`56b1f420)
ntdll!LdrpLoadContextReplaceModule+0x126:
00007ffa`56b718e2 e839dbfaff      call    ntdll!LdrpQueueWork (00007ffa`56b1f420)

Everything else is infrastructure to support this work offloading mechanism.

The ntdll!LdrpQueueWork function is how modules are added to the ntdll!LdrpWorkQueue linked list data structure. Work processors (i.e. callers of ntdll!LdrpProcessWork) such as loader worker threads or the ntdll!LdrpDrainWorkQueue function, for instance in a concurrent LoadLibrary, access the ntdll!LdrpWorkQueue list to pick up a work item. Access to the ntdll!LdrpWorkQueue shared data structure is protected by the ntdll!LdrpWorkQueueLock critical section lock.

Each list entry in the ntdll!LdrpWorkQueue data structure is a LDRP_LOAD_CONTEXT structure. This structure is undocumented by Microsoft because its contents are not in the public debug symbols. Each LDRP_LOAD_CONTEXT structure relates directly to one module because a module's LDR_DATA_TABLE_ENTRY structure is allocated at the same time as its LDRP_LOAD_CONTEXT structure in the LdrpAllocatePlaceHolder function. In addition, the first member of each LDRP_LOAD_CONTEXT structure is a UNICODE_STRING of the BaseDllName according to the module that it relates to.

Loader worker threads are dedicated threads that are part of a thread pool for parallelizing lodaer work. These threads can be identified by checking whether the LoaderWorker flag is present the TEB.SameTebFlags of a thread.

Only mapping and snapping work can be offloaded for parallelized processing because module initialization routines must execute in series.

High-Level Loader Synchronization

The high-level loader synchronization mechanisms responsible for controlling the loader are the LdrpLoadCompleteEvent and LdrpWorkCompleteEvent loader events in NTDLL.

When the loader sets the LdrpLoadCompleteEvent event, it is signalling the completion of a full library load or unload, or the completion of loader thread initialization. When LdrpLoadCompleteEvent is signalled, it directly correlates with ntdllLdrpWorkInProgress equalling zero and the decommissioning of the current thread as the load owner (LoadOwner flag in TEB.SameTebFlags). Here is a minimal reverse engineering of the ntdll!LdrpDropLastInProgressCount function showing this:

NTSTATUS LdrpDropLastInProgressCount()
{
    // Remove thread's load owner flag
    PTEB CurrentTeb = NtCurrentTeb();
    CurrentTeb->SameTebFlags &= ~LoadOwner; // 0x1000

    // Load/unload is now complete
    RtlEnterCriticalSection(&LdrpWorkQueueLock);
    LdrpWorkInProgress = 0;
    RtlLeaveCriticalSection(&LdrpWorkQueueLock);

    // Signal completion of load/unload to any waiting threads
    return NtSetEvent(LdrpLoadCompleteEvent, NULL);
}

When the loader sets the LdrpWorkCompleteEvent event, it is signalling that the loader has completed the mapping and snapping work on the entire work queue across all of the currently processing loader worker threads. When a loader worker thread starts, it atomically increments ntdll!LdrpWorkInProgress (in the ntdll!LdrpWorkCallback function) and when a loader worker thread ends, it atomically decrements ntdll!LdrpWorkInProgress (at the end of the ntdll!LdrpProcessWork function). This means that every increment to the ntdll!LdrpWorkInProgress reference counter past 1, since that is the value ntdll!LdrpDrainWorkQueue initially sets ntdll!LdrpWorkInProgress to, indicates another loader worker thread processing a work item in parallel. Here is a minimal reverse engineering of where the ntdll!LdrpProcessWork function returns showing this:

   // Second argument of LdrpProcessWork: isCurrentThreadLoadOwner
   // If the current thread is a loader worker (i.e. not a load owner)
   if (!isCurrentThreadLoadOwner)
   {
       RtlEnterCriticalSection(&LdrpWorkQueueLock);
       // If the work queue is empty AND we we are the last loader worker thread processing work
       // There were some double negatives I had to sort out here in the reverse engineering
       BOOL doSetEvent = &LdrpWorkQueue == LdrpWorkQueue.Flink && --LdrpWorkInProgress == 1
       Status = RtlLeaveCriticalSection(&LdrpWorkQueueLock);
       if ( doSetEvent )
           return NtSetEvent(LdrpWorkCompleteEvent, NULL);
    }

    return Status;

Here are all the loader's usages of LdrpLoadCompleteEvent and LdrpWorkCompleteEvent:

0:000> # "ntdll!LdrpLoadCompleteEvent" <NTDLL_ADDRESS> L9999999
ntdll!LdrpDropLastInProgressCount+0x38:
00007ffd`2896d9c4 488b0db5e91000  mov     rcx,qword ptr [ntdll!LdrpLoadCompleteEvent (00007ffd`28a7c380)]
ntdll!LdrpDrainWorkQueue+0x2d:
00007ffd`2896ea01 4c0f443577d91000 cmove   r14,qword ptr [ntdll!LdrpLoadCompleteEvent (00007ffd`28a7c380)]
ntdll!LdrpCreateLoaderEvents+0x12:
00007ffd`2898e182 488d0df7e10e00  lea     rcx,[ntdll!LdrpLoadCompleteEvent (00007ffd`28a7c380)]

0:000> # "ntdll!LdrpWorkCompleteEvent" <NTDLL_ADDRESS> L9999999
ntdll!LdrpDrainWorkQueue+0x18:
00007ffd`2896e9ec 4c8b35bdd91000  mov     r14,qword ptr [ntdll!LdrpWorkCompleteEvent (00007ffd`28a7c3b0)]
ntdll!LdrpProcessWork+0x1e4:
00007ffd`2896ede0 488b0dc9d51000  mov     rcx,qword ptr [ntdll!LdrpWorkCompleteEvent (00007ffd`28a7c3b0)]
ntdll!LdrpCreateLoaderEvents+0x35:
00007ffd`2898e1a5 488d0d04e20e00  lea     rcx,[ntdll!LdrpWorkCompleteEvent (00007ffd`28a7c3b0)]
ntdll!LdrpProcessWork$fin$0+0x7c:
00007ffd`289b5ad7 488b0dd2680c00  mov     rcx,qword ptr [ntdll!LdrpWorkCompleteEvent (00007ffd`28a7c3b0)]

The ntdll!LdrpCreateLoaderEvents function creates both events. Only the ntdll!LdrpDrainWorkQueue function can wait (calling ntdll!NtWaitForSingleObject) on the LdrpLoadCompleteEvent or LdrpWorkCompleteEvent loader events. Only the ntdll!LdrpDropLastInProgressCount function sets LdrpLoadCompleteEvent. Only the ntdll!LdrpProcessWork function sets LdrpWorkCompleteEvent.

At event creation (ntdll!NtCreateEvent), LdrpLoadCompleteEvent and LdrpWorkCompleteEvent are configured to be auto-reset events.

The loader never manually resets the LdrpLoadCompleteEvent and LdrpWorkCompleteEvent events (with ntdll!NtResetEvent).

The ntdll!LdrpDrainWorkQueue function takes one argument. This argument is a boolean, indicating whether the function should wait on LdrpLoadCompleteEvent or LdrpWorkCompleteEvent loader event before draining the work queue. Please see my reverse engineering of the ntdll!LdrpDrainWorkQueue function.

What follows documents the parts of the loader that call ntdll!LdrpDrainWorkQueue (data gathered by searching disassembly for calls to the ntdll!LdrpDrainWorkQueue function) as either a load owner or a load worker:

ntdll!LdrUnloadDll+0x80:                          OWNER
ntdll!RtlQueryInformationActivationContext+0x43c: OWNER
ntdll!LdrShutdownThread+0x98:                     OWNER
ntdll!LdrpInitializeThread+0x86:                  OWNER
ntdll!LdrpLoadDllInternal+0xbe:                   OWNER
ntdll!LdrpLoadDllInternal+0x144:                  WORKER
ntdll!LdrpLoadDllInternal$fin$0+0x38:             WORKER
ntdll!LdrGetProcedureAddressForCaller+0x270:      OWNER
ntdll!LdrEnumerateLoadedModules+0xa7:             OWNER
ntdll!RtlExitUserProcess+0x23:                    OWNER or WORKER
  - Depends on `TEB.SameTebFlags`, typically `OWNER` if `LoadOwner` or `LoaderWorker` flags are absent, `TRUE` if either of these flags are present
ntdll!RtlPrepareForProcessCloning+0x23:           OWNER
ntdll!LdrpFindLoadedDll+0x9127a:                  OWNER
ntdll!LdrpFastpthReloadedDll+0x9033a:             OWNER
ntdll!LdrpInitializeImportRedirection+0x46d44:    OWNER
ntdll!LdrInitShimEngineDynamic+0x3c:              OWNER
ntdll!LdrpInitializeProcess+0x130a:               OWNER
ntdll!LdrpInitializeProcess+0x1d0d:               OWNER
ntdll!LdrpInitializeProcess+0x1e22:               WORKER
ntdll!LdrpInitializeProcess+0x1f33:               OWNER
ntdll!RtlCloneUserProcess+0x71:                   OWNER

Calls to the ntdll!LdrpDrainWorkQueue function do not always result in synchronizing on the relevant loader event.

Notably, there are many more instances of the loader potentially synchronizing on the entire load's completion rather than just the completion of mapping and snapping work. For example, thread initialization (ntdll!LdrpInitializeThread) always synchronizes on the LdrpLoadCompleteEvent loader event. The only parts of the loader that may synchronize on LdrpWorkCompleteEvent are ntdll!LdrpLoadDllInternal, ntdll!LdrpInitializeProcess, and ntdll!RtlExitUserProcess.

Here are the places where the loader completes all loader work (ntdll!LdrpDropLastInProgressCount function), which is where the LdrpLoadCompleteEvent is set. Although, many of these are edge cases with the invocations by ntdll!LdrpLoadDllInternal, or loader thread initialization/deinitialization by the ntdll!!LdrpInitializeThread and ntdll!LdrShutdownThread functions being the most common:

0:000> # "call    ntdll!LdrpDropLastInProgressCount" <NTDLL_ADDRESS> L9999999
ntdll!LdrUnloadDll+0x99:
00007ffa`56b1fc89 e8eef10400      call    ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!RtlQueryInformationActivationContext+0x463:
00007ffa`56b23243 e834bc0400      call    ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrShutdownThread+0x20b:
00007ffa`56b2765b e81c780400      call    ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpInitializeThread+0x218:
00007ffa`56b27950 e827750400      call    ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpLoadDllInternal+0x24b:
00007ffa`56b2fc5f e818f20300      call    ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrGetProcedureAddressForCaller+0x275:
00007ffa`56b40035 e842ee0200      call    ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrEnumerateLoadedModules+0xae:
00007ffa`56b6ee6e e809000000      call    ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrShutdownThread$fin$2+0x1e:
00007ffa`56bb4f95 e8e29efbff      call    ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpInitializeThread$fin$2+0x15:
00007ffa`56bb4ff4 e8839efbff      call    ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpLoadDllInternal$fin$0+0x47:
00007ffa`56bb526e e8099cfbff      call    ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrEnumerateLoadedModules$fin$0+0x1b:
00007ffa`56bb5ee9 e88e8ffbff      call    ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpFindLoadedDll+0x917ae:
00007ffa`56bbf2ce e8a9fbfaff      call    ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpFastpthReloadedDll+0x90862:
00007ffa`56bc04e2 e895e9faff      call    ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpInitializeImportRedirection+0x464cf:
00007ffa`56bd89b3 e8c464f9ff      call    ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrInitShimEngineDynamic+0xe8:
00007ffa`56be0528 e84fe9f8ff      call    ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpInitializeProcess+0x183c:
00007ffa`56be358c e8ebb8f8ff      call    ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpInitializeProcess+0x1eda:
00007ffa`56be3c2a e84db2f8ff      call    ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpInitializeProcess+0x1f8e:
00007ffa`56be3cde e899b1f8ff      call    ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)

Here are the few places where the loader processes mapping and snapping work (ntdll!LdrpProcessWork function), which is where the LdrpWorkCompleteEvent is set:

0:000> # "call    ntdll!LdrpProcessWork" <NTDLL_ADDRESS> L9999999
ntdll!LdrpLoadDependentModule+0x184c:
00007ffa`56b2942c e8bb6c0400      call    ntdll!LdrpProcessWork (00007ffa`56b700ec)
ntdll!LdrpLoadDllInternal+0x13a:
00007ffa`56b2fb4e e899050400      call    ntdll!LdrpProcessWork (00007ffa`56b700ec)
ntdll!LdrpDrainWorkQueue+0x17f:
00007ffa`56b70043 e8a4000000      call    ntdll!LdrpProcessWork (00007ffa`56b700ec)
ntdll!LdrpWorkCallback+0x6e:
00007ffa`56b700ce e819000000      call    ntdll!LdrpProcessWork (00007ffa`56b700ec)

Windows Loader Module State Transitions Overview

LDR_DDAG_NODE.State or LDR_DDAG_STATE tracks a module's entire lifetime from beginning to end. With this analysis, I intend to extrapolate information based on the known types given to us by Microsoft (dt _LDR_DDAG_STATE command in WinDbg).

Each state represents a stage of loader work on a module. This table comprehensively documents where these state changes occur throughout the loader and which locks are present

A typical library load ranges from LdrModulesPlaceHolder to LdrModulesReadyToRun (may also include LdrModulesMerged), and a typical library unload ranges from LdrModulesUnloading to LdrModulesUnloaded.

`LDR_DDAG_STATE` States	State Changing Function(s)	Remarks
LdrModulesMerged (-5)	`LdrpMergeNodes`	`LdrpModuleDatatableLock` is held during this state change. See `LdrModulesCondensed` state for more information.
LdrModulesInitError (-4)	`LdrpInitializeGraphRecurse`	During `DLL_PROCESS_ATTACH`, if a module's `DllMain` returns `FALSE` for failure then this module state is set (any other return value counts as success). `LdrpLoaderLock` is held here.
LdrModulesSnapError (-3)	`LdrpCondenseGraphRecurse`	This function may set this state on a module if a snap error occurs. See the `LdrModulesCondensed` state for more information.
LdrModulesUnloaded (-2)	`LdrpUnloadNode`	Before setting this state, `LdrpUnloadNode` may walk `LDR_DDAG_NODE.Dependencies`, holding `LdrpModuleDataTableLock` to call `LdrpDecrementNodeLoadCountLockHeld` thus decrementing the `LDR_DDAG_NODE.LoadCount` of dependencies and recursively calling `LdrpUnloadNode` to unload dependencies. Loader lock (`LdrpLoaderLock`) is held here.
LdrModulesUnloading (-1)	`LdrpUnloadNode`	Set near the start of this function. This function checks for `LdrModulesInitError`, `LdrModulesReadyToInit`, and `LdrModulesReadyToRun` states before setting this new state. After setting state, this function calls `LdrpProcessDetachNode`. Loader lock (`LdrpLoaderLock`) is held here.
LdrModulesPlaceHolder (0)	`LdrpAllocateModuleEntry`	The loader directly calls `LdrpAllocateModuleEntry` until parallel loader initialization (`LdrpInitParallelLoadingSupport`) occurs at process startup. At which point (with exception to directly calling `LdrpAllocateModuleEntry` once more soon after parallel loader initialization to allocate a module entry for the EXE), the loader calls `LdrpAllocatePlaceHolder` (this function first allocates a `LDRP_LOAD_CONTEXT` structure), which calls through to `LdrpAllocateModuleEntry` (this function places a pointer to this module's `LDRP_LOAD_CONTEXT` structure at `LDR_DATA_TABLE_ENTRY.LoadContext`). The `LdrpAllocateModuleEntry` function, along with creating the module's `LDR_DATA_TABLE_ENTRY` structure, allocates its `LDR_DDAG_NODE` structure with zero-initialized heap memory. The module's data structures have been allocated with basic initialization.
LdrModulesMapping (1)	`LdrpMapCleanModuleView`	I've never seen this function get called; the state typically jumps from 0 to 2. Only the `LdrpGetImportDescriptorForSnap` function may call this function which itself may only be called by `LdrpMapAndSnapDependency` (according to a disassembly search). `LdrpMapAndSnapDependency` typically calls `LdrpGetImportDescriptorForSnap`; however, `LdrpGetImportDescriptorForSnap` doesn't typically call `LdrpMapCleanModuleView`. This state is set before mapping a memory section (`NtMapViewOfSection`). Mapping is the process of loading a file from disk into memory.
LdrModulesMapped (2)	`LdrpProcessMappedModule`	`LdrpModuleDatatableLock` is held during this state change. Mapping is complete.
LdrModulesWaitingForDependencies (3)	`LdrpLoadDependentModule`	This state isn't typically set, but during a trace, I was able to observe the loader set it by launching a web browser (Google Chrome) under WinDbg, which triggered the watchpoint in this function when loading app compatibility DLL `C:\Windows\System32\ACLayers.dll`. Interestingly, the `LDR_DDAG_STATE` decreases by one here from `LdrModulesSnapping` to `LdrModulesWaitingForDependencies`; the only time I've observed this. `LdrpModuleDatatableLock` is held during this state change.
LdrModulesSnapping (4)	`LdrpSignalModuleMapped` or `LdrpMapAndSnapDependency`	In the `LdrpMapAndSnapDependency` case, a jump from `LdrModulesMapped` to `LdrModulesSnapping` may happen. `LdrpModuleDatatableLock` is held during state change in `LdrpSignalModuleMapped`, but not in `LdrpMapAndSnapDependency`. Snapping is the process of resolving the library’s import address table (module imports and exports) to addresses in memory.
LdrModulesSnapped (5)	`LdrpSnapModule` or `LdrpMapAndSnapDependency`	In the `LdrpMapAndSnapDependency` case, a jump from `LdrModulesMapped` to `LdrModulesSnapped` may happen, which indicates the loader doesn't always bother recording the in-between `LdrModulesSnapping` state transition. `LdrpModuleDatatableLock` isn't held here in either case. Snapping is complete.
LdrModulesCondensed (6)	`LdrpCondenseGraphRecurse`	This function receives a `LDR_DDAG_NODE` as its first argument and recursively calls itself to walk `LDR_DDAG_NODE.Dependencies`. On every recursion, this function checks whether it can remove the passed `LDR_DDAG_NODE` from the graph. If so, this function acquires `LdrpModuleDataTableLock` to call the `LdrpMergeNodes` function, which receives the same first argument, then releasing `LdrpModuleDataTableLock` after it returns. `LdrpMergeNodes` discards the uneeded node from the `LDR_DDAG_NODE.Dependencies` and `LDR_DDAG_NODE.IncomingDependencies` DAG adjacency lists of any modules starting from the given parent node (first function argument), decrements `LDR_DDAG_NODE.LoadCount` to zero, and calls `RtlFreeHeap` to deallocate `LDR_DDAG_NODE` DAG nodes. After `LdrpMergeNodes` returns, `LdrpCondenseGraphRecurse` calls `LdrpDestroyNode` to deallocate any DAG nodes in the `LDR_DDAG_NODE.ServiceTagList` list of the parent `LDR_DDAG_NODE` then deallocate the parent `LDR_DDAG_NODE` itself. `LdrpCondenseGraphRecurse` sets the state to `LdrModulesCondensed` before returning. Note: The `LdrpCondenseGraphRecurse` function and its callees rely heavily on all members of the `LDR_DDAG_NODE` structure, which needs further reverse engineering to fully understand the inner workings and "whys" of what's occurring here. Condensing is the process of discarding unnecessary nodes from the dependency graph.
LdrModulesReadyToInit (7)	`LdrpNotifyLoadOfGraph`	This state is set immediately before this function calls `LdrpSendPostSnapNotifications` to run post-snap DLL notification callbacks. As the loader initializes nodes (i.e. modules) in the dependency graph (while loader lock is held), each node's state will transition to `LdrModulesInitializing` then `LdrModulesReadyToRun` (or `LdrModulesInitError` if initialization fails). The module is mapped and snapped but pending initialization (which includes any form of running code from the module).
LdrModulesInitializing (8)	`LdrpInitializeNode`	Set at the start of this function, immediately before linking a module into the `InInitializationOrderModuleList` list. After linking the module into the initialization order list, the loader calls the module's `LDR_DATA_TABLE_ENTRY.EntryPoint`. Loader lock (`LdrpLoaderLock`) is held here. Initializing is the process of running a module's initialization routines (i.e. module initializer including Windows `DllMain`).
LdrModulesReadyToRun (9)	`LdrpInitializeNode`	Set at the end of this function, before it returns. Loader lock (`LdrpLoaderLock`) is held here. The module is ready for use.

Findings were gathered by tracing all LDR_DDAG_STATE.State values at load-time and tracing a library unload, as well as searching disassembly. See what a LDR_DDAG_STATE trace log looks like (be aware of the warnings).

Constructors and Destructors Overview

Constructors and destructors exist to facilitate dynamic initialization. Dynamic initialization is custom code that runs before accessing a resource. In the module scope, this code executes before the main() function or when a module is loaded.

Module constructors and destuctors are the operating system and language agnostic terms for describing this feature. On Unix, these may be referred to as initialization and finalization or termination routines/functions. In Windows DLLs, the functionally equivalent idea exists as DLL_PROCESS_ATTACH and DLL_PROCESS_DETACH calls to the DllMain function. Initialization and deinitialization/uninitialization routines or simply initializer and finalizer is also common terminology.

In addition to module load and unload, the Windows loader may call each module's DllMain at DLL_THREAD_ATTACH and DLL_THREAD_DETACH times. The Windows loader only calls these routines at thread start and exit. Windows doesn't run the DLL_THREAD_ATTACH of a DLL following DLL_PROCESS_ATTACH. Additionally, a DLL loaded after thread start won't preempt that thread to run its DllMain with DLL_THREAD_ATTACH. These calls can be disabled per-library as a performance optimization by calling DisableThreadLibraryCalls at DLL_PROCESS_ATTACH time.

Compilers commonly provide access to module initialization/deinitialization functions through compiler-specific syntax. In GCC or Clang, a programmer can create module constructors/destructors using the __attribute__((constructor)) and __attribute__((destructor)) functions or the _init and _fini functions, historically. Modern GCC or Clang module constructors and destructors support specifying a priority like __attribute__((constructor(101))) or __attribute__((destructor(101))) in case a particular execution order is desired.

In C++, the constructor of an object is invoked whenever an instance of a class is created. Creating an instance of a class returns an object pointing to that instance. If an object is created in the global scope (C++ terminology) or the module scope (OS terminology) then its constructor is called during program or library initialization (code example). If an object is created in a local scope like in a function, its constructor is called when program execution creates that object in the function. A constructor or class itself is neither inherently global nor local, it entirely depends on what context the object is created in.

Common use cases for dynamic initialization can include: Communication with another process (for instance, the Windows API relies on a system-wide csrss.exe server, which requires dynamic initialization on the side of the client), creating an inter-process synchronization mechanism (Windows commonly uses inter-process event synchronization objects even when predominantly or only intra-process synchronization is or should be required), or initializing an implementation-dependent data structure such as a critical section (rather than storing the internal POD directly in your module, which would necessitate that ABI remain backward compatible forever or be versioned). The apparent reason for Microsoft not providing a method for statically initializing a Windows critical section is that developers butchered the original POD definition and when they wanted to go back and change it to something more sensible (i.e. simply initializing to all zeros by default like GNU does), they couldn't without breaking bug compatibility (there is also the kernel view on needing to keep track of mutex objects in its memory, which is obsolete ever since registrationless futex and futex-like mechanisms became a thing). In the common case where default mutex attributes are appropriate, POSIX mutexes can be statically initialized with the PTHREAD_MUTEX_INITIALIZER macro; otherwise or when creating a mutex in dynamically allocated memory, dynamic initialization with the pthread_mutex_init function is necessary. A POSIX mutex is equivalent to a Windows critical section, whereas a Windows mutex object differs due to being an inter-process synchronization mechanism. Other dynamic initialization operations could include: Setting up a thread pool, background thread, or event loop to prepare early for concurrent operations. Reading configuration data from some persistent data source (e.g. an environment variable, a file, or a registry key). Tracing or logging events or setting it up. Controlling resource lifecycle (at initialization and destruction time). In addition, various domain-specific initialization and registration tasks. Generally, constructors effectively address cross-cutting concerns in initialization (especially in the module scope when functionality is split across many tightly coupled libraries like in Windows).

Due to the useful position of constructors and destructors when run in the global scope, they may sometimes be used outside of dynamic initialization like for auditing or hooking purposes. The GNU loader specially provides the LD_ADUIT and LD_PRELOAD mechanisms for these purposes (with the latter having broad support across Unix-like systems). The GNU loader calls into an LD_AUDIT library at library load/unload and symbol resolution times to allow for hooking or monitoring. LD_PRELOAD allows easily hooking global scope symbol resolution. Windows allows for DLL notifications registration, offering similar functionality, though it is more limited. The always and early execution style of constructors when in the module scope also makes them an attractive target for attackers.

Constructors and destructors originate from object-oriented programming (OOP), a programming paradigm first introduced by the Simula 67 language in 1962. C++, a modern object-oriented language, was originally designed in the early 1980s as an extension of C and received initial standardization in 1998. Constructors and destructors do not exist in the C standard. On Unix systems, the concept of code that runs when a module loads and unloads goes back to the 1990 System V Application Binary Interface Version 4 (DT_INIT and DT_FINI section types, as well as .init and .fini special section names).

In the ELF executable format, module constructors and destructors are standardized by the System V ABI to be in the .init and .fini sections. Modern systems use the non-standard but common and generally agreed-upon .init_array/.fini_array sections, or before that the deprecated .ctors/.dtors sections. Modern GCC built binaries only include .init_array/.fini_array and .init/.fini sections, they don't include the .ctors/.dtors sections (verified with objdump -h and readelf --sections). Individually exposing each initialization/finalization routine in an array within the ELF file grants more control to the loader over calling an opaque function for handling all initialization/finalization. A Unix-like loader loops through these routines contained in the ELF file. Todo: Create a code test to check whether this feature makes module initializers safely interruptible between these routines in the case of circular dependencies and make the equivalent test for Windows.

The PE (Windows) executable format standard doesn't define any sections specific to module initialization; instead, a DllMain function or any module constructors/destructors are included with the rest of the program code in the .text section. MSVC optionally provides the init_seg pragma to specify a section name with module constructors to run frist when compiling C++ code. However, such a section is only used if this pragma is explicitly specified by the programmer (unlikely) or in the niche cases MSVC will generate one itself (as stated by the documentation). The granularity this pragma provides is low with only compiler, lib, and user options. In contrast, the .init_array/.fini_array sections and __attribute__((constructor(priority)))/__attribute__((destructor(priority))) on Unix-like systems serve as a modular and robust means for controlling dynamic initializiation order.

The Windows loader calls a module's LDR_DATA_TABLE_ENTRY.EntryPoint at module initialization or deinitialization with the respective fdwReason argument (DLL_PROCESS_ATTACH or DLL_PROCESS_DETACH); it has no knowledge of DllMain or C++ constructors/destructors in the module scope. Merging these into one callable EntryPoint is the job of a compiler. For instance, MSVC compiles a stub into your DLL (dllmain_dispatch) that calls any module constructors followed by DllMain with the DLL_PROCESS_ATTACH argument (and destructors, of course, in the reverse order). Constructors other than DllMain, of course, initialize in the order they are laid out in code. The word Main in DllMain indicates that DllMain will run as the last constructor in the module similar to how the main function of a program runs after all constructors. Still, I find DllMain to generally be a bad name because it may lead people to use constructors in ways that one might use the main function of a program due to the similar name (like DllMain is just main but in a DLL, which is not the case). I also find Microsoft's use of the term "entry point" (e.g. in LDR_DATA_TABLE_ENTRY.EntryPoint) to describe calling a module's constructor and destructor routines bad because an entry point has a specific definition that refers to the start of program execution. This reason for this name stems from both an EXE and its DLLs having a LDR_DATA_TABLE_ENTRY. Especially since the Windows loader does accurately set the EXE's EntryPoint set to the program's main function (then just above EntryPoint is the DllBase member of LDR_DATA_TABLE_ENTRY, which conflates EXEs with DLLs but the other way around). So, a good question is posed by asking why the LDR_DATA_TABLE_ENTRY structure definition should be shared between an EXE and its DLLs at all seeing as the as the GNU loader does not conflate these concepts because, besides both being some code with data that is mapped into memory, these are completely different things. Up until one point in Windows history, the LDR_DATA_TABLE_ENTRY structure definition was even shared between kernel and user-mode modules until separating into the KLDR_DATA_TABLE_ENTRY structure: "The LDR_DATA_TABLE_ENTRY structure is NTDLL’s record of how a DLL is loaded into a process. In early Windows versions, this structure is similarly the kernel’s record of each module that is loaded for kernel-mode execution. The different demands of kernel and user modes eventually led to the separate definition of a KLDR_DATA_TABLE_ENTRY." The GNU loader calls legacy init before going through the init_array functions (the opposite of Windows where DllMain comes last after all other constructors, similar to how a main function would). All of these facts come together to paint a picture of Windows being too kernel-centric and monolithic, not considering the unique requirements of user-mode and correctly distinguishing between execution environments.

The name DllMain is inherited from LibMain, which along with Windows Exit Procedure (WEP) for exit, was its name in the 16-bit DLLs used by Windows 3.x (non-NT). When Windows was built for 16-bit applications (before Windows NT 3.1 and Windows 95, non-NT), multitasking was cooperative not preemptive so predictable scheduling meant there was no need for synchronization mechanisms such as loader lock. System libraries were also typically already loaded in the single shared address space and that was the level tasks (what we now call "processes" were widely referred to as "tasks" before each application had an independent address space and execution context) would reference-count them on (Nitpick: Through the use of a virtual machine, early Windows versions on MS-DOS could run a given application in protected mode, albeit with no true privilege separation, and multithreading, but applications had to specifically be written to support this functionality and DOS was borked, that is why it got abandoned). Still, the MS-DOS EXE format allowed for a pseudo-DLL to specify whether module initialization should be "Global" or "Per-Process" (this information can be gathered using the old exehdr tool), which would have only lessened the room for module initialization issues in the global case. There was no dynamic linker in MS-DOS because the pseudo-DLLs of that time did not support imports as they did starting with Windows NT. This fact would have made it unnecessary to hold a lock while running a module initialization routine due to having no other DLLs depend on the initialization code being complete, at least not that the loader could have been aware of (a flag likely existed to ensure FreeLibrary could not unload a pseudo-DLL before it was done loading, but that is it). Obviously then, delay loading did not exist Windows 3.x (so, the loader was naturally at the top of the lock hierarchy). Delayed DLL loading was added to Visual C++ 6.0 (1998) as the public header delayimp.h (although, the MSVC compiler Microsoft internally used to build Windows may have supported delay loading earlier). Delay loading a DLL was and still is done by the /DELAYLOAD linker option, which under the hood uses LoadLibrary/GetProcAddress with address caching to implement the functionality. Later, delay loading was also integrated into the native loader (When?). DLL thread initializers/deinitializers DLL_THREAD_ATTACH and DLL_THREAD_DETACH didn't exist until Windows NT 3.1. NtTerminateProcess was also introduced with Windows NT 3.1 seemingly as an incredibly poor and hasty but deliberate "design" decision. Anyway, the MS-DOS API likely was not spawning threads all the time like Win32 does since it was originally designed for use in a cooperatively multitasking system. COM (which tightly couples with the loader by placing itself at the top of the lock hierarchy in CoFreeUnusedLibraries and potenitally other places) wasn't a foundational Windows technology used pervasively within the Windows API until Windows NT 4.0 (released in 1996). These properties of older systems largely mitigated issues arising especially from LibMain on Windows versions prior to Windows NT 3.1 and DllMain in later Windows versions. Official Windows 3.x (non-NT) books at the time (specifically "Windows Programmers Reference Volume 2 Functions" released in 1992), provided no guidance on LibMain besides that the "LibMain function is called by the system to initialize a dynamic-link library (DLL)". Although, there was a note for WEP that explictly stated "The FreeLibrary function should not be called from within a WEP function". Additionally, we know from Matt Pietrek's Windows Internals book (released in 1993, shortly before Windows NT 3.1 came out and long before the author later became a Microsoft employee) that "A common problem programmers encounter is that functions like MessageBox() won't work inside the LibMain() of an implicitly-linked DLL". The reason is that creating a window to show a message box requires initialization of the USER application message queue by the InitApp() function in USER. This message queue is not initialized in the LibMain of USER but by some setup work done before calling WinMain in the EXE (the book provides the relevant reverse engineered pseudocode of C0W.ASM to prove this): "For EXEs, the important parts of the startup code involves calling InitTask() and then InitApp(), which we cover momentarily. After those functions have been called, the EXE is completely initialized and ready to start its work as a Windows program." The book notes that initialization is done this way because a DLL "cannot own things that Windows associates with a task, like message queues" (i.e. a DLL may not exist for the full application lifetime) so it cannot own the application message queue. However, the core issue here is that there each task had a single, global application message queue and a DLL couldn't create and tear down its own, independent message queue instance to perform a GUI operation detached from the application (obviously, this is no longer the case in modern Windows). Instead of the application lifetime (that of the EXE), the message box can live in the instance lifetime (from when birth when the call to MessageBox is made to death when it returns, since MessageBox is a synchronous function), or for more complex GUI operations that continue in the DLL outside of its moudle initializer, in the lifetime of the DLL (from birth at DLL_PROCESS_ATTACH to death at DLL_PROCESS_DETACH for modern DllMain, since our DLL depends on the GUI subsystem). Thus, GUI operations not working from the LibMain of implicitly-linked DLLs was a consequence of tight coupling between the GUI subsystem and the operating system. See here for information on Windows and Windows NT history.

C# and .NET

The CLR loader uses a module's .cctor section to initialize .NET assemblies. A .NET assembly is a layer of abstraction over an underlying native library. Each module .cctor section is the "managed module initializer" (i.e. assembly initializer). Microsoft uses the managed module initializer to work around Windows issues surrounding loader lock in .NET applications.

A static constructor in C# is unique from its C++ counterpart because C# specifies that a static constructor, even when instance creation happens at the module scope, will initialize on-demand instead of at the start of a program or library:

The static constructor for a closed class executes at most once in a given application domain. The execution of a static constructor is triggered by the first of the following events to occur within an application domain:

An instance of the class is created.

Any of the static members of the class are referenced.

If a class contains the Main method (§7.1) in which execution begins, the static constructor for that class executes before the Main method is called.

C# also has finalizers (historically referred to as destructors in C#). The finalizer of an object will run if the garbage collector decides it can destroy the given object. Unlike low-level languages with manual memory mangement like C++, finalization is not typically necessary because the garbage collector traces memory allocations to do clean up. Garbage collectors delay resources cleanup like freeing memory as a function of how they work, it is a trade-off they make in exchange for easier programming. This delay extends to finalizers or destructors where these routines will not run until the garbage collector destroys the object. For unmanaged or system resources such as "windows, files, and network connections" (e.g. closing a database connection) Microsoft documentation condones the use of finalizers saying "you should use finalizers to free those resources". However, starting with .NET 5, finalizers are not run at application exit. The decision not to call destructors or finalizers at .NET runtime exit appears to have come down to an issue with reachable objects + unjoined background threads still using those objects (Java appears to have fixed this issue by replacing finalizers with cleaners, which will only call the cleanup action of an object once it becomes unreachable). The root issue, in the described case, is the unjoined thread that is still running when .NET shutdown occurs (similar to what we explored in "The Problem with How Windows Uses Threads"). Also for releasing unmanaged resources (i.e. external to the .NET runtime so they won't be garbage collected, like a Windows API file handle), an application can register for the AppDomain.ProcessExit event to perform cleanup before the .NET runtime exits in the process and a library assembly can use the AppDomain.DomainUnload event to get the same functionality for its lifetime (this works because a .NET assembly cannot unload without unloading the entire domain). Starting with .NET 5, an assembly can be dynamically loaded into a AssemblyLoadContext, which on Unload, free all the assemblies in that load context and call Unloading events for cleanup. Assemblies in the default assembly load context cannot be unloaded. Due to the nature of garbage collected languages, the cleanup of especially expensive or contested system resources is best performed by prescribing that users of your subsystem call a Shutdown, Close, Disconnect, etc. method on the relevant object when they are done using it, if possible. Although this approach cannot scale with libraries since they depend on each other and must be destructed in the reverse order they were constructed (even if you hack it by employing expensive reference counting on the individual resource-level, this approach falls apart with circular references or reference cycles), applications can use this technique. If you find your application consuming lots of limited or contended system resouces though, then you may want to reconsider using a garbage collected language since so-called two-phase initialization (or cleanup) is an anti-pattern.

C# supports the ModuleInitializer attribute for initialization code that is to run when the assembly loads even when that assembly is a library (like traditional static constructors). Presumably, C# module initializers require protection from a global CLR initialization lock. In C# 9 and .NET 5 (released together in 2020), module initializers were added to the language and runtime out of necessity.

The unexpected initialization time of C# static constructors can cause unforseen problems similar to how Windows delay loading does for operating system initializers. For instance, a static constructor "call is made in a locked region based on the specific type of the class". So, if creating an instance of a class for the first time happens at an unexpected time (perhaps via proxy through another call) like when the thread is holding a lock, and there exists a static constructor (of the same class type) that acquires the same external lock in the reverse order, then lock order inversion and consequently ABBA deadlock can occur. CLR lazy loading/initialization does have a couple significant mitigating factors that make it safer than native library lazy loading, namely: Firstly, lazy initialization can only occur upon instance creation which is necessarily more expected because it's already known that typical per-instance object constructors will run at instance creation time (unlike native library lazy loading where the initialization can potentially happen on every call to a DLL import). Though, this does leave the other, less common, static constructor trigger of referencing a member in a static class somewhat up in the air as to its safety at the given time. Secondly, static constructors are split into their own routines and initialize with granular, per-instance MT-safe synchronization instead of a broadly serializing "CLR static constructor lock", thus decreasing the chance of trying to reenter initialization or deadlocks. Lazy initialization can still become problematic if your lazy initializer routine accidentally tries to lazily initialize itself again (this issue is typically an artifact of circular dependencies). In reagard to libary loading, a synchronized, lazily initializing global type (e.g. a C# static constructor) should never load or unload libraries (or higher level .NET assemblies) to ensure that the OS loader (also CLR loader for .NET assemblies) sensibly remains at the top of the lock hierarchy. This steadfast rule must be in place to maintain lock hierarchy. If some data is only accessed from a single threaded, though, then lazy initialization may not require sychronization (synchronization is mandatory for C# static constructors and is the the default for Lazy<T> types). Note that Microsoft documentation breaks this sensible idea on lock hierarchy by recommending programmers call LoadLibrary from lazy static constructors. Regardless of synchronization, modules with significant cross-cutting concerns should never lazily initialize, instead initializaing at module load-time, or preferably initializing at compile-time if possible while having little to no dependencies. From purely a performance point of view, lazy initializers could introduce "measurable overhead" because the language or runtime must internally perform atomic or synchronized checks to decide whether or not the initializer needs to be run on every pass (this cost is at odds with the benefit of potentially never having to run the initializer if, for example, the application goes down a different code path or errors out early... I'm looking at you, Rust lazy_static and once_cell). With all these factors in mind, it can generally be safe to use a syncrhonized, lazily initializing global type as long as an application has a clear structure that ensures a lazy initializer routine will not depend on itself through some means (directly or indirectly), and that this thinking extends to subsystems that your code depends on (e.g. the OS loader).

The CLR loader, particularly the fact that it intentionally runs outside the OS loader, is a hack because only one of these two components can be at the top of the lock hierarchy and since the OS loader starts first, it should take precedence. By the CLR loader placing itself higher in the lock hierarchy than the OS loader, the CLR becomes tighly coupled with the OS loader. Ideally, the CLR under C# should be able to, as a modular subsystem, safely abstract from the OS without worrying about low-level concerns within the native loader. In particular, it should ideally be possible for C# to use the same constructors and destructors as C++ because Microsoft has tighly integrated .NET into Windows thus making it possible to accidentally utilize the technology when the programmer didn't intend to, such as via COM interop (there are likely some cases where the Windows API internally uses .NET through COM interop in an in-process server).

The Root of `DllMain` Problems

The Windows loader, in contrast to Unix-like loaders, is more vulnerable to correctness issues, and deadlock or crash scenarios for a variety of architectural reasons. "The Root of DllMain Problems" (or, more casually, "DllMain Rules Rewritten") provides a fundamental understanding of DllMain hurdles and why they exist. It improves on Microsoft's "DLL Best Practices", often referred to as the "DllMain rules" informally. "DLL Best Practices" originally dates back to a technologically ancient 2006 Microsoft document for providing guidance on what actions are safe to perform from DllMain, as well as other module initializers and finalizers, or constructors and destructors running in the module scope of a library. The architectural reasons for DllMain issues on Windows (specifically Windows NT) include:

Windows uniquely positions the loader at the bottom of any external lock hierarchy
- This placement is completely backwards because the loader is the first thing to start in a process
  - Indeed, it is not a deliberate design decision, but rather one made as an afterthought due to Windows' misuse of DLLs
- As a result, the threat of ABBA deadlock makes it unsafe to acquire any external lock (not used by NTDLL) from inside the loader without knowing how that lock is implemented
Windows is the ultimate monolith
- The broadness of the Windows API (thousands of DLLs in C:\Windows\System32, including everything from file creation to WinHTTP) in combination with its lack of a clear separation between components leads to operating-system-wide dependency breakdown
  - Despite Windows prioritizing libraries and shared processes over programs and small processes at the operating-system-level, its library dependency infrastructure is significantly less robust and more tightly coupled than its Unix counterpart
- The Windows threading implementation meshes with the loader at thread startup and exit (DLL_THREAD_ATTACH and DLL_THREAD_DETACH)
  - The synchronization requirement this added to threads broke the library subsystem lifetime, which led to Microsoft condoning thread termination as a synchronization model and Windows leaving the process in an inconsistent state at process exit thus breaking module destructors
  - Despite Windows prioritizing multithreading over multiprocessing at the operating-system-level, its threading implementation is significantly less robust and more prone to deadlocks than its Unix counterpart
- The monolithic architecture of the Windows API may cause the loader's lock hierarchy to become nested within the lock hierarchy of a separate subsystem; if this nesting interleaves with another thread nesting in the opposite order, ABBA deadlock is the result
  - The COM and loader subsystems exhibit tight coupling whereby Microsoft's implementation of COM may interact with the loader while holding the COM lock, an issue that becomes increasingly problematic due to the Windows API's extensive use of COM behind the scenes (including much of the Windows User API, Windows Shell, and WinHTTP AutoProxy to name a few)
- The heavy use of thread-local data throughout the Windows API can lock its users to the unspecified thread that loaded the library
- Windows kernel mode and user mode closely integrate (NT and NTDLL), whereas Unix began with modularity as a core value
  - This value carried through to the formalization of Unix in the POSIX and C standards, and the System V ABI specification
Windows overrelies on dynamic initialization and dynamic operations in general
- It is always best practice for robustness and performance to initialize statically (i.e. at compile time) over dynamically (using module initializers and finalizers including Windows DllMain) if feasible
- Windows commonly requires dynamic initialization even for core system functionality, such as initializing a critical section
- The Windows Process Environment Block (PEB), along with its common use by getter functions like GetProcessHeap, artificially enforces dynamic initialization
- The multithreading-first design of Windows can increase contention when accessing popular shared resources such as the process heap, which may require some Windows components dynamically create their own resources
- In contrast, POSIX data structures commonly provide a static initialization option, and Unix prioritizes multiprocessing through minimal and pluggable processes
Unexpected library loading
- Inherently, delay loading may unexpectedly cause library loading when a programmer didn't intend, thus leading to an array of potential issues that could deadlock or crash a process
  - MacOS previously supported lazy loading until Apple removed it, likely due to scenarios where it becomes an anti-feature and any performance gains not being worth the trade-off
- The Windows API does dynamic library loading in situations that are inappropriate in the context of an operating system
  - Windows institutes that creating a process can load libraries into the existing process
  - The modern Universal C Runtime (UCRT) in Windows loads libraries at process exit
Windows runtime libraries commonly implement a poor thread-safe implementations that restrict concurrency
- Notable runtime components in Windows such as atexit registration and its callbacks are not designed with deadlock-free thread safety in mind (while runtime components are not directly part of the loader, their implementations may use or integrate with it)
Historical library loader issues
- Poor ability for reentrancy during module initialization
  - The Windows API heavily relying on dynamic library loading, especially past its intended use case of loading extension libraries, requires a loader with robust reentrancy capabilities
  - Microsoft mostly built the legacy loader to be reentrant; however, how it performed module initialization was subject to crashes or correctness issues due to the loader's poor ability to enforce the correct order of operations when initializing modules upon being reentered (this issue was at the heart of the previous "DllMain Best Practices")
  - Starting with Windows 8, the loader maintains a dependency graph throughout its operation thus significantly resolving out-of-order module initialization problems that can occur when loading a library from DllMain, but there could potentially still be issues due to circular dependencies ("dependency loops")
  - The legacy loader could only walk the dependency graph while immediately collapsing it into a linked list thus giving the module initialization order (this list was only formed by walking the import address tables (IATs) at the start of a load and was not able to dynamically adjust to the reentrant case of someone calling LoadLibrary from DllMain)
- The Windows GetModuleHandle function was broken
  - GetModuleHandle from DllMain can be problematic because it assumes a DLL is already loaded when it may not be yet or has only partially loaded (since the loader cannot know in advance that a given DLL depends on another DLL if it does not dynamically link to it)
  - With the release of an Ex function and other patchwork, Microsoft has mostly fixed this issue, but it is still not as robust as its GNU loader counterpart or POSIX only defining dlopen with no flag for this functionality
Backward and forward compatibility
- The loader runs DllMain code under that DLL's activation context (LDR_DATA_TABLE_ENTRY.EntryPointActivationContext) for application compatibility, if it has one, which could cause unforseen issues due to conflicting requirements laid out by another activation context

The result of these architectural facts and faults, among other consequences, is that subsystems are often unable to construct and destruct safely. Outcomes of this simple fact are far-reaching with impact that can be seen throughout all facets of the operating system. Subsystems are often left employing delicate hacks (including, if you are Microsoft, making subsystems more reliant on the kernel and broader system leading to tighter coupling), introducing anti-patterns, and falling back on design decisions that are deficient in performance to work around what is a fundamental failing of the operating system. Alternatively, a subsystem could unknowingly perform an unsafe action in its module routines, which may work until a rare but possible race condition is met, a single point of failure that constantly challenges the soundness and robustness of Windows with every additional module. On an ad-hoc basis, Microsoft, adds blockers to ensure common actions that can fail or may be risky from a Windows module routine, cannot proceed. However, the checks necessary to implement these blockers are only possible due to the tight coupling that is prevalent in Windows, can add up to have a negative effect on performance when done at run-time, and can never create a fully correct system as long as root issues persist. It is unfortunate that while a DLL is the executable unit Windows is architected around, DllMain exists as one of the most fragile parts of Windows.

With a newfound understanding of DllMain woes and the greater perspective gained from admiring how Unix-like operating systems get it right, you can reason about the safety of performing a given action from DllMain, module initializers and finalizers, or other constructors and destructors running in the module scope of a library.

Alpha Notice: This work is currently considered to be of alpha quality. The document, including this section, is incomplete: there are still strong arguments I need to add and some sections of this document could probably be written better.

The Problem with How Windows Uses DLLs

A DLL or library is modular code that processes can load to use the contained functionality. A linker can connect libraries together to create dependencies between them. Defining dependencies between libraries requires careful management of the dependency tree to avoid creating conflicts such as circular dependencies.

How the Windows operating system scatters functionality across multiple libraries leads to the uncontrolled creation of dependencies. In particular, DLLs on Windows lack a clear separation of components causing nearly everything to depend on everything else (if not directly, then by proxy through a dependent DLL). It's this lack of organization between Windows libraries that dooms what a library is supposed to be and transforms the Windows API into a monolithic beast.

As a hack to workaround this root issue, Microsoft (ab)uses the "delay loading" Windows feature to stop dependency loops. However, delay loading or library lazy loading, is an inherently broken feature at the operating system level. Thus, delay loading only moves the issue to being an equally as bad but manageable problem. This delay loading hack is pervasive throughout virtually all parts of the Windows API. We will now give a quick walkthrough of common DLLs, core to Windows' functioning, which exhibit the described hack:

> dumpbin /imports C:\Windows\System32\kernel32.dll
...

  Section contains the following delay load imports:

    RPCRT4.dll
              00000001 Characteristics
      00000001800B7A48 Address of HMODULE
      00000001800BF000 Import Address Table
      000000018009D0E0 Import Name Table
      000000018009D268 Bound Import Name Table
      0000000000000000 Unload Import Name Table
                     0 time date stamp

        0000000180025D2D   16C RpcAsyncCompleteCall
        0000000180025D09   211 RpcStringBindingComposeW
        0000000180025CF7   176 RpcBindingFromStringBindingW
        0000000180025C6C   16E RpcAsyncInitializeHandle
        0000000180025D1B    2E I_RpcExceptionFilter
        0000000180025D3F   186 RpcBindingSetAuthInfoExW
        0000000180025D87    94 Ndr64AsyncClientCall
        0000000180025D63   16B RpcAsyncCancelCall
        0000000180025D75   174 RpcBindingFree
        0000000180025D51   215 RpcStringFreeW

...

The most common Windows DLL after NTDLL.dll, KERNEL32.dll, contains one of these hacks for loading RPCRT4.dll, the RPC runtime. RPCRT4.dll immediately depends on KERNEL32.dll, and Microsoft chose KERNEL32.dll as the DLL to break the immediate dependency loop. Additionally, KERNEL32.dll delays the loading of its RPCRT4.dll dependency to ensure the RPC runtime and its dependencies aren't unnecessarily loaded into all processes that load KERNEL32.dll (which is all standard Windows processes, not including pico processes).

Worse, KERNEL32.dll immediately depends on KernelBase.dll, which in turn depends on ntdll.dll starting with Windows 7. In modern Windows, we can see KernelBase.dll is stuffed with delay loading hacks that lead back to an astounding 18 DLLs including: KERNEL32.dll (a direct circular dependency), advapi32.dll, apisethost.appexecutionalias.dll, appxdeploymentclient.dll, bcryptPrimitives.dll, capauthz.dll, daxexec.dll, deviceaccess.dll, efswrt.dll, feclient.dll, gpapi.dll, mrmcorer.dll, ntdsapi.dll, sechost.dll, twnapi.appcore.dll, user32.dll, windows.staterepositoryclient.dll, windows.staterepositorycore.dll, and windows.storage.dll.

Here is the same hack in a couple more DLLs central to the Windows API, including the core DLL to the User API (which encompases many other Windows APIs):

user32.dll Delay Loads:
    api-ms-win-power-setting-l1-1-0.dll -> powrprof.dll
    api-ms-win-power-base-l1-1-0.dll -> powrprof.dll
    api-ms-win-service-private-l1-1-0.dll -> sechost.dll
    MSIMG32.dll
    WINSTA.dll
    ext-ms-win-edputil-policy-l1-1-0.dll -> edputil.dll

And some more, this time in the Advanced Windows 32 Base API DLL, used for security calls and to provide access to the Windows Registry:

advapi32.dll Delay Loads:
    CRYPTSP.dll
    WINTRUST.dll
    CRYPTBASE.dll
    SspiCli.dll
    USER32.dll
    CRYPT32.dll
    bcrypt.dll
    api-ms-win-security-lsalookup-l1-1-0.dll -> sechost.dll
    api-ms-win-security-credentials-l1-1-0.dll -> sechost.dll
    api-ms-win-security-credentials-l2-1-0.dll -> sechost.dll
    api-ms-win-security-provider-l1-1-0.dll -> ntmarta.dll
    api-ms-win-devices-config-l1-1-1.dll -> cfgmgr32.dll

Practically every DLL you look at in the Windows API is swamped with these delay loading hacks. Specifically, there are ~3000 DLLs (.dll files) in C:\Windows\System32 (not including subdirectories). Of those approximately 3000 DLLs, some are "resource-only DLLs" (e.g. imageres.dll), which can be excluded (I have not bothered though, since the figure is already staggering). By my measurement, this means over half of Windows DLL (1663 exactly, within C:\Windows\System32 not including subdirectories) exhibit a delay loading hack. Note that this figure only includes DLLs that directly include a delay load, not DLLs that immediately depends on another DLL that includes a delay load. For a comprehensive list of affected DLLs, see the final output of the dumpbin-delay-loads.ps1 script.

The vast quantity of circular dependencies all through out the DLLs that make up the Windows API breaks the vital and commonly ascribed modularity benefit of the DLL. This tight coupling leads to a variety of poor outcomes for the operating system and software running on it. For obvious reasons, it would be undesirable to load so many libraries for even the simplest "Hello, World!" class of applications. As such, Microsoft needed a remedy and decided to move the problem, instead of fixing it, with delay loading.

NOTE: Work on the Dependency Breakdown section is pending to separate the definition of lazy library loading from arguments against it in this document. The work here is INCOMPLETE, ALPHA QUALITY, and I still have my strongest arguments to add.

For further research on Windows' misuse of DLLs, see here.

Problem Solved?

Identifying issues is important, but it's even more valuable to pair that with ideas for solutions. So, let's come with some actionable solutions for the root issue we explored here!

Solution #1: API Sets Extension

With Windows 7 came the introduction of API Sets. API sets are an application compatibility mechanism designed as an altenative to activation contexts for finding the correctly versioned DLL to load (i.e. to help in the fight against DLL Hell). API sets are promising because they neatly sort the Windows API into smaller and more modular units.

As it stands, an API set is merely an alias that maps to a real DLL on disk. In this solution, we propose that API set DLL names become or more closely imitate real DLLs. Perhaps they could be called "virtual DLLs". With enough granularity, the hope is that Windows DLLs would naturally lose their circular dependencies because they were using separate parts of the same DLL.

A caveat to this solution exists if the circular dependency is formed because a specific API in one "real DLL A" requires functionality from "real DLL B" while "real DLL B" also requires functionality from "real DLL A" within the scope of that API call. In this case, no level of API granularity could break the circular dependency. Before eaching per-API granularity, there could also be other practical refactoring limitations due to the underlying implementation.

Solution #2: Organize Subsystems

In cases where increasing the granularity in the set of APIs provided by a library fails to remove circular dependencies, it may be warranted to reorganize by merging multiple libraries into one. Generally, it is always possible to remove a dependency cycle by decoupling subsystems and employing cycle breaking strategies.

Solution #3: Reimplementation

With the realization that the Windows API more closely resembles a dog chasing its own tail, planets orbiting an NT kernel star, or simply a web of libraries more than it does a directed acyclic graph (DAG) of modular subsystems, reimplementing large parts of the Windows API to use a different backend becomes a viable solution.

Wine is already making progress here by making translating DirectX into Vulkan (DKVK) and Windows audio/video APIs into using GStreamer or FFmpeg. This solution also comes with other benefits such as typically improving performance and efficiency over the Microsoft Windows alternatives.

Summary

Solving the problem we explored here will likey require a multifaceted resolution involving all three solutions. Once the Windows API is organized, it will be up to Microsoft Windows developers to remain conscientious about the dependenices their subsystems create. The vastness of the Windows API doesn't make coming up with the best solution to its circular depenendency problem easy, but I maintain that it is possible and that application compatibility can come along for the ride.

Dependency Breakdown

NOTE: Work in progress.

Further Research on Windows' Usage of DLLs

The DLL Host

A DLL or library is modular code that processes can load to use the contained functionality. If this was the extent to how Windows, like any other operating system, utilized DLLs, then it would be correct. However, Windows' usage of DLLs goes far beyond their intended use. Introducing, the DLL host.

In Windows, DLL hosts are programs that serve only to host other DLLs that provide the core functionality of an application or service. Common DLL hosts include rundll32.exe, svchost.exe, taskhostw.exe, and COM surrogates such as dllhost.exe.

DLL hosts are prevalent throughout Windows, with svchost.exe alone accounting for over half (55% or 70/126 processes by my measurement) of all proceses on the system upon booting up Windows.

Clearly, Windows really likes DLL hosts and specifically shared service processes. But why? No other operating system has the concept of a DLL host and they seem to get along just fine.

Well for a start, we know processes are more expensive on Windows than on Unix systems. Looking at the Private Bytes consumed by even the most minimal of proccesses in Process Explorer verifies this to be the case:

AggregatorHost.exe |  912K
smss.exe           | 1072K
svchost.exe        | 1284K
svchost.exe        | 1292K
svchost.exe        | 1384K

Even the smallest processes are eating up about 1 MiB or more of memory each! The plethora of highly interconnected DLLs making up the Windows API would also certainly certainly contribute to slower process start times.

Going further, another reason for shared services could be that being in the process allows for faster communication between similar services (especially since the base overhead of a Windows system call as well as the system calls themselves are generally known to be higher on Windows than on Unix systems). This was the same motivation for in-process COM servers. By running tasklist /svc | findstr ,, we can find shared service hosts containing multiple service:

lsass.exe                      860 KeyIso, SamSs, VaultSvc
svchost.exe                    984 BrokerInfrastructure, DcomLaunch, PlugPlay,
                                   Power, SystemEventsBroker
svchost.exe                    928 RpcEptMapper, RpcSs
svchost.exe                   2672 BFE, mpssvc
svchost.exe                    456 OneSyncSvc_59a4b,
                                   PimIndexMaintenanceSvc_59a4b,
                                   UnistoreSvc_59a4b, UserDataSvc_59a4b

That's interesting, out of all the shared service processes, only five (including lsass.exe) are actually hosting multiple services in one process. This is in stark contrast to the large number of unrelated services that previous Windows versions packed into one process:

1280 svchost.exe Svcs: AudioSrv,BITS,CryptSvc,Dhcp,dmserver,ERSvc,EventSystem,helpsvc,lanmanserver,lanmanworkstatio
n,Netman,Nla,RasMan,Schedule,seclogon,SENS,SharedAccess,ShellHWDetection,srservice,TapiSrv,Themes,W32Time,winmgmt,wuause
rv,WZCSVC

The simplest explanation for Microsoft no longer packing many services into one process like they use to is valuing the robustness of a separate virtual address space for each service over the expense that comes with it. One megabyte of memory, while not nothing, isn't nearly as valuable as it was when the average system was sporting fewer gigabytes of RAM than it was today. As a result, Windows shared services mostly appears to be a relic of the past and I wouldn't be surprised if Microsoft does away with them entirely at one point. Said in another way, using a separate process for each service brings Windows closer to microservice architecture because "one component’s failure won’t break the whole app" (broadly—the term microservice can take on more meaning in the cloud context).

Beyond robustness, multiple DLLs operating independently in a process with their own threads could actually hurt performance by causing unnecessary contention on in-demand resources like the process heap lock. Windows DLLs or threads sometimes use a private or local heap to help with this issue (see heaps in Windbg with !heap command). However, Windows API calls that create heap allocations implictly often makes full heap separation unattainable in practice. Concerns regarding process heap lock contention are especially pertinent because the Windows NT Heap implementation doesn't implement any measure to reduce blocking like the glibc heap does with per-thread arenas (and Microsoft's attempts at implementing a more concurrent and performant heap, like the Segment Heap, have not worked out).

Another victim of the DLL host that cannot go understated is ease of debugging. There will always be bugs, so it's crucial to be proactive in maximizing correctness and minimizing complexity so bugs can be fixed as quickly as they're spotted. A DLL host stands in the way of debugging for multiple reasons, most obviously, a shared address space makes determining the source of memory corruption bug challenging if multiple compnents or services operate in a single address space. But also, Microsoft won't be able to send Windows Error Reporting (WER) reports for crashes because that's tracked by the EXE hosting the DLL (as well as other notable concerns like making application compatibility more difficult).

Shared service processes use service DLLs. Since a service DLL exists solely for the purpose of allowing multiples services to exist in one process, one would not expect DLLs to take a dependency on a service DLL. A service DLL is more like an EXE in that svchost.exe delegates control of the application lifetime to it. So, depending on a DLL that works like its an EXE is surely a recipe for circular dependencies, which are bad. Alas, upon searching, I did find some DLLs depending on service DLLs (this search only being in C:\Windows\System32, not including subdirectories).

Once again, DLLs provide a false promise of modularity. Bad Microsoft—bad.

DLL Procurement

Windows will load and execute a DLL from practically anywhere, which as you can imagine, does not fare well for the security of the operating system and frequently invents security vulnerabilites that could never exist on other systems.

See here for more information.

One DLL, One Base Address

Today, Windows still does not support per-process address space layout randomization (ASLR) of libraries. It's absence effectively makes this crucial exploit mitigation useless for defending against privilege escalation, including sandbox escape (e.g. from a web browser), on Windows. This weakness markedly tips the scales in favor of the attacker (e.g. in a ROP attack).

This requirement exists because how Windows works, I believe in relation to the operating system's heavy usage of shared memory and historical reasons, mandates all image mappings to be at the same address in virtual memory across processes.

See here for more information.

DLLs as Data

Microsoft confused memory-mapped files with libraries thus giving us the resource-only DLL.

Turning a pointer into a search through a lookup table for that pointer is a diabolical level of bloat.

See here for more information.

Library Loading Locations Across Operating Systems

The Windows loader is searching for DLLs to load in a vast (and growing) number of places. Strangely, Windows uses the PATH environment variable for locating programs (similar to Unix-like systems) as well as DLLs. Microsoft's decision to retain (since this one originates from backward compatibility with CP/M DOS) the current working directory ("the current folder") in this list of places is an accident (or worse, a security incident) waiting to happen, particularly when running applications from untrusted CWDs in a shell (e.g. CMD or PowerShell). This Microsoft documentation still doesn't cover all the possible locations, though, because while debugging the loader during a LoadLibrary, I saw LdrpSendPostSnapNotifications eventually calls through to SbpRetrieveCompatibilityManifest (this isn't part of a notification callback). This Sbp-prefixed function searches for application compatibility shims in SDB files which may result in a compat DLL loading. Also to do with application compatibility, WinSxS and activation contexts (DLLs in C:\Windows\WinSxS) exist to load versioned DLLs typically based on the application's manifest (these are usually embedded in the binary). A process calling the CreateProcess family of functions or WinExec is subject to loading AppCert DLLs. When secure boot is disabled in Windows 8 or greater, AppInit DLLs can load DLLs into any process. The plethora of possible search locations contributes to DLL Hell and DLL hijacking (also known as DLL preloading or DLL sideloading) problems in Windows, the latter of which makes vulnerabilities due to a privileged process accidentally loading an attacker-controlled library more likely (I've personally seen how common these and similar Windows-specific vulnerabilities are especially in LOB applications that enterprises use).

The GNU/Linux ecosystem differs due to system package managers (e.g. APT or DNF). All programs are built against the same system libraries (this is possible because all the packages are open source). Proprietary apps are generally statically linked (typically using musl and not glibc as the libc implementation) or come with all their necessary libraries. The trusted directories for loading libraries and the configuration file for adding directories to the search path can be found in the ldconfig manual. Beyond that, you can set the LD_LIBRARY_PATH environment variable to choose other places the loader should search for libraries and LD_PRELOAD or LD_AUDIT to specify libraries to load before any other library (including libc) with the difference being that libraries specified by the latter run first and can receive callbacks to monitor the loader's actions. Loading libraries based on environment variables is a default feature that may optionally be turned off during compilation (and is always disabled for setuid binaries). Binaries can include an rpath to specify additional run-time library search paths.

On Windows, statically linking system DLLs is unsupported and a copyright infringment because the Windows software license doesn't permit bundling Microsoft's libraries with your own application. Bringing your own system DLLs (e.g. from a different Windows version) is also unsupported because the internals of how they interact with the operating system and other tightly coupled DLLs can change. Microsoft keeps userland backward compatibility by ensuring Windows system libraries stay the same in their APIs and relevent internals. Since static linking and bringing your own libraries is a strong suit of Unix systems, I suggest capitalizaing on that advantage by employing this linking or library distribution approach for third-party, business, enterprise, or proprietary applications (of course, some things like an audio client and server pair still need to communicate compatibly, but that should easily be solved by versioning the protocol and because Linux has now converged on Pipewire for audio and video... also great libraries like SDL have existed for a long time now). User-mode API stability on GNU/Linux has historically been a problem for adoption (e.g. since glibc has no problem with breaking backward compatibility) but it is a non-issue when taking a typical Unix system's other strengths into account. Linus Torvalds is very adimnant about the kernel not breaking userland and thinks shared libraries are not good, anyway. As a result, static linking or shipping all the required dependencies with an application is indeed a great solution for companies that want to bring proprietary software, especially professional applications that people need for their work, to all platforms, including Linux. There is no reason to depend on any Microsoft/Windows-only APIs or frameworks when superior vendor neutral options exist. In addition, companies and developers are already familiar with managing/upgrading all their own dependencies as is commonly done to an extreme extent in the popular JavaScript ecosystem.

`LoadLibrary` vs `dlopen` Return Type

On Windows, LoadLibrary returns an officially opaque HMODULE, which is implemented as the base address of the loaded module. Windows searches for this module handle in a lookup table to obtain a pointer to that module's LDR_DATA_TABLE_ENTRY. A pointer was made for pointing, so this extra layer of indirection amounts to nothing more than bloat on top of a pointer with no benefit.

In POSIX, dlopen returns a symbol table handle. On my GNU/Linux system, this handle is a pointer to the object's own link_map structure located in the heap. (The returned handle is opaque, meaning you must not access the contents behind it directly since they could change between versions and it is implementation-dependent; instead, only pass this handle to other dl* functions.)

How cool is that? That's like if Windows serviced your LoadLibrary request by handing you back a pointer to the module's LDR_DATA_TABLE_ENTRY!

Microsoft's reasoning behind the return type of LoadLibrary stems from the 16-bit Windows API (going back to Windows 1.0 when the function was introduced), where libraries became conflated with data files or memory-mapped files meaning "libraries" could exist for the sole purpose of containing data with no code. In the first release of Windows NT (i.e. Windows NT 3.1), Microsoft carried forward this unique mistake while also adding an extension for specifying that a "library" must load only as a memory-mapped file and how to open this file. Consequently, the modern Windows loader is stuck with maintaining a red-black tree at ntdll!LdrpModuleBaseAddressIndex for speeding up base address ➜ LDR_DATA_TABLE_ENTRY lookups (the legacy Windows loader slowly iterated the InLoadOrderModuleList linked list in the PEB to do these lookups). This indirection is a contibuting factor to slow process creation on Windows (simply set a read watchpoint on ntdll!LdrpModuleBaseAddressIndex during process startup to see what a hot data structure this is). The performance of Windows delay loading is also negatively affected by this indirection, and the synchronization it required to access the shared data structure, because ntdll!LdrResolveDelayLoadedAPI calls ntdll!LdrpFindLoadedDllByHandle every time it runs.

A reasonable person could expect that given Windows' focus on dynamic loading of executable modules—unlike the prevalent static linking of the time—it would have ensured a solid design for its core dynamic loading API. Especially since Multics had already pioneered the idea of dynamically linking libraries, used memory-mapped files to share data, and supported dynamic loading years before Microsoft was even founded. Indeed, the module handle implementation quirk in the library loader exemplifies an avoidable misstep—one that, like most Windows API oversights, exists as an outcome of Microsoft's expedient development style when creating their core technologies.

For new code, Microsoft could fix this issue by introducing a new set of library loader functions including LibraryOpen and LibraryClose (these names would also be more accurate since loading a library simply increases its reference count if a given library is already loaded); however, the Windows loader would internally still have to expend effort maintaining the legacy data structures for compatibility with the older library loader functions, at least in a compatability mode. These are changes that should have happened in the transition to Windows NT, with the old LoadLibrary, FreeLibrary, and other functions being designated as existing only for compatibility with 16-bit Windows (as Microsoft does in other places), but there is no time like the present.

An excerpt from Windows Internals: System architecture, processes, threads, memory management, and more, Part 1 (7th edition) states this regarding the ntdll!LdrpModuleBaseAddressIndex data structure (and ntdll!LdrpMappingInfoIndex):

Additionally, because lookups in linked lists are algorithmically expensive (being done in linear time), the loader also maintains two red-black trees, which are efficient binary lookup trees. The first is sorted by base address, while the second is sorted by the hash of the module’s name. With these trees, the searching algorithm can run in logarithmic time, which is significantly more efficient and greatly speeds up process-creation performance in Windows 8 and later. Additionally, as a security precaution, the root of these two trees, unlike the linked lists, is not accessible in the PEB. This makes them harder to locate by shell code, which is operating in an environment where address space layout randomization (ASLR) is enabled.

While the message on performance is a true and prudent point to make, I also find that statement alone lacks relevant perspective on the fact that ntdll!LdrpModuleBaseAddressIndex only exists to begin with as a workaround for Microsoft's blunder with the LoadLibrary function API. The point regarding security is dubious because if the module linked lists are already in the PEB, and must remain there indefinitely for backward compatibility since Microsoft chose to share one these lists in the public winternl.h header, then excluding the red-black trees has no effect because security comes down to the lowest common denominator. The background information on trouble that arises from module linked lists residing in the PEB is nice (of course, there are a variety of ways to find other modules in the process but those methods would be a bit "harder" and likely not universal). Again though, there is a more relevant point to make that is not addressed by the book especially since it does cover the associated Windows history in some places, just not here.

Investigating the Idea of MT-Safe Library Initialization

Possible MT-safe approaches:

Per-library
Per-routine

Goal: Deadlock avoidance

It should, in theory, be doable to implement an MT-safe synchronization strategy for the native loader's library initialization stage similar to what the CLR loader has with per-instance (lazy) static protection. Although, a distinction between the C# static type and library routines is that there is also library deinitialization to account for in the latter case (and the loader thread routines on Windows).

The first part in this plan is decoupling library mapping and snapping from library initialization. These two pieces would take place under separate locks that avoid nesting which each other: A mapping and snapping lock (like what already exists in the Windows loader with ntdll!LdrpWorkCompleteEvent) and the MT-safe synchronization with locks at a per-library or per-routine granularity.

During library initialization, the loader would acquire the lock for each library/routine, then run the initialization for the library/routine, then mark the library/routine as initialized, before releasing that lock.

The increased granularity in locking would overall make it less likely for lock order inversion to occur and reduce loader lock contention.

Problems & Solutions:

Circular dependencies (root issue)
- A directed acyclic graph (DAG) formation is a must-have for any MT-safe library and deinitialization system to be workable.
- The tight coupling of Windows libraries would significantly reduce the utility of even a per-library MT-safe initialization mechanism. A hypothetical mechanism of this kind would likely lock each library as it initializes them starting from the base or root library while then unlocking each library once it is an initialized state. A circular dependency, which in Windows typically manifests as a delay loading hack, would nest initializer locks if the delay loading accidentally gets triggered. If enough delay loads occur like this then it would turn our MT-safe synchronization into being as good as a broadly serializing lock.
- Circular dependencies would mess with the order we are acquiring locks in because everything could depend on everything else, thus we get lock order inversion
  - If there was a coherent dependency DAG then we could initialize as expected with the root dependencies first and if two DLLs are equally root or at the same level then we could start by initializing the DLL with the smallest base address first or perhaps which ever DLL comes first in the alphabet (so ASLR does not change the ordering, because determinism is preferable in case one library initializer does something really funny) to naturally give the same initialization order for equal libraries
- If a DLL or individual routine depends on itself (directly or indirectly) then that's a big problem because it is already initializing so it's not safe to simply reenter an initialization routine (since it is some custom code, where would we enter?). If we have per-routine granularity then the best we could do is optimistically let that routine stay uninitialzed for now and continue with initializing the rest of the routines (it's still not fully safe depending on what the parially initialized routine has yet to initialize, but it could work).
Library initializer nesting the mapping and snapping lock due to performing a dynamic library load
- If a nesting does occur because a library initializer routine dynamically loads a library, then the mapping and snapping lock should be at the bottom of the lock hierarchy. Additionally, the start of a library load should check if its being nested inside a library mapping and snapping operation by some detoured or hotpatched code, and if so fail to enforce the lock hierarchy.
- Ideally, dynamic loading (e.g. LoadLibrary) should not take place from a library initializer (Windows and ReactOS does this unconditionally in a few DLL_PROCESS_ATTACH routines presumably to give some order to when circular dependies are loaded, so these hacks will have to cease)
  - If library loads do take place from a library initializer though then they should be managable because the we ensure the mapping and snapping lock is completely unable to nest a hold library initialization lock. We just have to make sure code patchers do not try doing anything unwise.
Library load and free per-library locking requirement
- If per-routine MT-safety is to be implemented, there will still have to be a per-library lock to ensure a library is never initializing as it deinitializes (it could be a readers-writer lock that LoadLibrary acquires in read/shared mode and FreeLibrary acquires in write/exclusive mode)
- Library deinitializers must run in the opposite order of library initializers, but if we lock and unlock before and after a library each individual library initialization/deinitialization in DAG formation then that is not a problem. The only unique danger would be if there was a circular dependency between initialization and deinitialization routines due to some dynamic library load or free opeartion.
- Library free would start at the top of the given dependency chain. It would acquire the first per-library lock at the top of the chain. It would decrement the library's reference count. If it hit zero, it would deinitialize that library. We unlock the per-library lock. We repeat this process for dependencies of that node. We avoid a race condtion by reference counting on a node-by-node basis so we can hold the per-library lock while decrementing the library's reference counter then possibly deinitializing that library in one protected breath.
Windows lacks a prerequisite for per-routine MT-safety
- The Unix loaders splits initialization into serparate routines in the .init_array section, called by the loader
- DllMain will have to be split up into four individual routines for each of its call reasons
- Especially necessary to improve deadlock avoidance when DLL_THREAD_ATTACH and DLL_THREAD_DETACH on loaded DLLs run
Performance considerations
- From purely a performance perspective, such a mechanism would increase synchronization overhead. Although, with an upside that provides greater concurrency and could therefore improve overall performance depending on the workload.
- Parallelized library initialization: Per-library initialization could be parallelized in the case that one of the initializing libraries does not depend on the other initialzing library (if this extra feature was desired)

When all considerations are taken into account, I stand by MT-safe library initialization as a viable strategy for making this process more flexible and preventing deadlocks. Another benefit being the significant increase in concurrency for dynamic library loading calls, especially for some module initializers that may run long (for non-initialization actions throughout the process lifetime you can always spawn a thread, which is sufficiently performant as long as it's not waited on, then join back in the library destructor without any MT-safety being necessary... of course DLL subsystem lifetimes are broken by the whole NtTerminateProcess situation on Windows, though).

At the same time, there are reasons why some Unix loaders (e.g. musl) simply choose to forgo dynamic loading all together, but this option is not on the table for Windows.

The Problem with How Windows Uses Threads

NOTE: This section contains incomplete work and is subject to change.

A thread is the smallest unit of execution managed by an operating system. It runs code independently, sharing the address space with other threads in the same process. A process controls the lifetime of its containing threads, when the process exits, so do all its thread. Thus, creating a new thread requires cooperation with the relevant subsystem or application in the process, typically the starting thread, to complete work before exiting.

How the Windows operating system creates threads causes the starting thread, particularly the main thread, to be unaware of other threads currently working in the process. It's this lack of awareness regarding other active threads, spawned internally by the Windows API, that dooms running code to being abruptly terminated mid-operation thereby resulting in an inconsistent or potentially corrupted state outside the process upon process exit.

As a hack to work around this root issue so module destructors within the process can run without crashing it, Microsoft forcefully kills all but the exiting thread (typically the main thread) at process exit initiation. In any case, a process is the container for threads, so process termination will cause forceful thread termination if threads do not operate within the scope of process lifetime.

WORK IN PROGRESS!

Problem Solved

Unwilling to leave any problem unsolved, let's take it full circle by coming up with some potential solutions to the root issue we explored here!

-- I've solved this problem. Sharing my conclusion though will require monetary compensation as I'm unwilling to work for a trillion dollar tech company for free. --

Process Meltdown

NOTE: This section contains incomplete work and is subject to change.

Process exit on Windows is broken. In The Problem with How Windows Uses Threads, we talked about how it got this way and the immediate consequences. In this section, we discuss Microsoft's best efforts to salvage process exit as it pertains to the execution of module destructors.

As we covered, the inner workings of the Windows API can leave running threads in the process at the time of process exit. Process exit includes running modules destructors to perform cleanup work, so if threads are still running while the process is being destroyed then an unpredicatble crash by memory corruption would likely be the result.

Instead of addressing the root issue, Microsoft chose to mitigate the problem as best they could starting with NtTerminateProcess. NtTerminateProcess iterates through each thread in the process, abruptly terminating each one, except for the thread requesting process termination. Thus, the process is reduced to a single thread.

However, abruptly terminating threads is impossible to do safely, and Microsoft says as much in their own documentation. Terminating threads is unsafe because those threads could be modifying a shared resource, holding onto a lock, or doing some other important task at the time of termination, thereby leaving those resources in a corrupt state, orphaning locked synchronization mechanisms, or interrupting critical work. Thus, solving the first problem only transformed the issue and created a new problem.

Post NtTerminateProcess (with a first argument of 0), if the process tries to wait on one of these orphaned locks then it would cause the process to hang open, never exiting. Microsoft would like to avoid this, so their second mitigation is for Windows API locks to check if the process is shutting down before waiting on them, and if so, forcefully terminating the processs without calling the remaining module destructors. Here is how Windows implements these mitigations for a few synchronization mechanisms:

To begin with, the LdrShutdownProcess function (called by RtlExitUserProcess) sets PEB_LDR_DATA.ShutdownInProgress to true. Since RtlExitUserProcess acquires load/loader lock (i.e. ntdll!LdrpLoadCompleteEvent and ntdll!LdrpLoaderLock) before setting PEB_LDR_DATA.ShutdownInProgress, code run past this point also runs under load/loader lock. PEB_LDR_DATA.ShutdownInProgress covers all module destructors at process exit, including FLS callbacks, TLS destructors, and the DLL_PROCESS_DETACH routine of DllMain functions (which includes C++ destructors running at module scope, DLL atexit routines, and others).

For a critical section, upon calling EnterCriticalSection on a contended lock, the ntdll!RtlpWaitOnCriticalSection function checks if PEB_LDR_DATA.ShutdownInProgress is true. If so, the function jumps to calling NtTerminateProcess passing the flag -1 as the first argument, thereby forcefully terminating the process.

For a slim reader/writer (SRW) lock, upon calling AcquireSRWLockExclusive or AcquireSRWLockShared on a contended lock, the ntdll!RtlpWaitCouldDeadlock function checks if PEB_LDR_DATA.ShutdownInProgress is true. If so, that function returns true and ntdll!RtlAcquireSRWLockExclusive calls NtTerminateProcess with a first argument of -1 to immediately kill the process.

As another minor mitigation, prior to calling NtTerminateProcess (with a first argument of 0), Windows (specifically the RtlExitUserProcess function) acquires a couple locks on top of load/loader lock. These include the PEB lock (ntdll!FastPebLock), and the process heap lock (calls ntdll!RtlLockHeap), which works as a quick fix to at least ensure consistency of these core data structures for module destructors. These two locks are unlocked following NtTerminateProcess. All together, the locks are acquired in the following order thus forming a lock hierarchy: load/loader lock ➜ PEB lock ➜ process heap lock. While Microsoft typically defines the loader at being at the bottom of any lock hierarchy, these locks are internal to NTDLL (which contains the loader) and I've confirmed that Microsoft is consistent with this order of lock acquisition throughout NTDLL and presumably other DLLs.

There are also other little things Windows will do following the initial NtTerminateProcess like blocking thread creation with CreateThread at the kernel-level or failing the thread pool API functions by checking PEB_LDR_DATA.ShutdownInProgress in user-mode (since these functions are otherwise unaware that the threads in its pool have been killed).

With all mitigations applied, the fallout for DLL destructors in a Windows process are as follows:

In-Process Inconsistencies

The consistency of all data structures, aside from the ones we specifically mentioned, in use by the Windows API or your program becomes a gamble as to whether they are left in a corrupt state after running NtTerminateProcess. Common data structures that could be left in a corrupt state include private heaps or heaps using a custom allocator, CRT state (there are many locks here), internal KERNEL32/KERNELBASE/NTDLL state (e.g. various global list locks like the heaps list locks at ntdll!RtlpProcessHeapsListLock or the thread pools list lock at ntdll!TppPoolpListLock, WIL locks part of KERNELBASE), and generally tons of other obscure locks. Even FLS state could be corrupted due to the ntdll!RtlpFlsDataCleanup function at process exit (after the initial NtTerminateProcess) acquiring the global FLS lock and per-FLS locks thereby forfeiting the process before any DLL_PROCESS_DETACH destructors get the chance to run. Even calling a function as simple as printf or puts from a module destructor is unsafe because the CRT stdio critical section lock could be orphaned. Doing virtually anything inside the process is unsafe at this stage.

These inconsistencies form because the process is still operating when it calls NtTerminateProcess. As an example, here are all the threads that still exist threads following a ShellExecute on the main thread (after ShellExecute returns and we get back the control flow):

.  0  Id: 1e88.398 ntterminateprocess_test_harness!test5
   1  Id: 1e88.18f0 ntdll!TppWorkerThread (ntdll!LdrpWorkCallback)
   2  Id: 1e88.1d78 ntdll!TppWorkerThread (ntdll!LdrpWorkCallback)
   3  Id: 1e88.cf8  ntdll!TppWorkerThread (ntdll!LdrpWorkCallback)
   4  Id: 1e88.1dec SHCORE!_WrapperThreadProc
   5  Id: 1e88.8e0  ntdll!TppWorkerThread (SHCORE!ExecuteWorkItemThreadProc)
   6  Id: 1e88.1b90 ntdll!TppWorkerThread (windows_storage!_CallWithTimeoutThreadProc)
   7  Id: 1e88.1498 combase!CRpcThreadCache::RpcWorkerThreadEntry
   8  Id: 1e88.1098 ntdll!TppWorkerThread (RPCRT4!LrpcIoComplete)
   9  Id: 1e88.1f90 ntdll!TppWorkerThread (shared thread pool)
  10  Id: 1e88.1158 SHCORE!<lambda_9844335fc14345151eefcc3593dd6895>::<lambda_invoker_cdecl>

Windows never joins these background threads back to the main thread or allows them to exit before process exit, as is best practice. But, as long as all these threads are guaranteed to stay waiting, then this configuration is workable; however, this is not the case. In particular, the SHCORE!_WrapperThreadProc thread is still actively working to shutdown an in-process COM server (here we see CoUninitialize is left running where it could still be processing outstanding messages and is currently cleaning up so it can shut down):

0:000> k
 # Child-SP          RetAddr               Call Site
00 00000041`7f8ff218 00007ffe`9a12e939     win32u!NtUserGetProp+0x14
01 00000041`7f8ff220 00007ffe`9a12e843     uxtheme!CThemeWnd::RemoveWindowProperties+0xa5
02 00000041`7f8ff250 00007ffe`9a133b3e     uxtheme!CThemeWnd::Detach+0x5f
03 00000041`7f8ff280 00007ffe`9e96ef98     uxtheme!ThemePostWndProc+0x4be
04 00000041`7f8ff360 00007ffe`9e96e8cc     USER32!UserCallWinProcCheckWow+0x548
05 00000041`7f8ff4f0 00007ffe`9e9870c8     USER32!DispatchClientMessage+0x9c
06 00000041`7f8ff550 00007ffe`9f191374     USER32!_fnNCDESTROY+0x38
07 00000041`7f8ff5b0 00007ffe`9d042384     ntdll!KiUserCallbackDispatcherContinue
08 00000041`7f8ff638 00007ffe`9d541c47     win32u!NtUserDestroyWindow+0x14
09 00000041`7f8ff640 00007ffe`9d541bf4     combase!UninitMainThreadWnd+0x47 [onecore\com\combase\objact\mainthrd.cxx @ 323]
0a 00000041`7f8ff670 00007ffe`9d492b6a     combase!OXIDEntry::CleanupRemoting+0x12c [onecore\com\combase\dcomrem\ipidtbl.cxx @ 1365]
0b 00000041`7f8ff6a0 00007ffe`9d492a7f     combase!CComApartment::CleanupRemoting+0xd2 [onecore\com\combase\dcomrem\aprtmnt.cxx @ 1078]
0c 00000041`7f8ff830 00007ffe`9d492ec1     combase!ChannelThreadUninitialize+0x37 [onecore\com\combase\dcomrem\channelb.cxx @ 993]
0d 00000041`7f8ff860 00007ffe`9d4a85bd     combase!ApartmentUninitialize+0x131 [onecore\com\combase\class\compobj.cxx @ 2680]
0e 00000041`7f8ff8e0 00007ffe`9d4a7a34     combase!wCoUninitialize+0x209 [onecore\com\combase\class\compobj.cxx @ 4037]
0f 00000041`7f8ff950 00007ffe`9d84c955     combase!CoUninitialize+0x104 [onecore\com\combase\class\compobj.cxx @ 3957]
10 00000041`7f8ffa40 00007ffe`9eb5be4e     ole32!OleUninitialize+0x45 [com\ole32\ole232\base\ole2.cpp @ 557]
11 00000041`7f8ffa70 00007ffe`9d357374     SHCORE!_WrapperThreadProc+0x21e
12 00000041`7f8ffb50 00007ffe`9f13cc91     KERNEL32!BaseThreadInitThunk+0x14
13 00000041`7f8ffb80 00000000`00000000     ntdll!RtlUserThreadStart+0x21

At the center of a crucial Windows component, the Shell, lies one case where Windows fails to control a thread within the lifecycle of the application, instead leaving the thread running to be consumed by NtTerminateProcess if it doesn't happen to exit in time. Additionally, if a background thread is waiting on stimuli from outside the process to start working then an external process giving work could cause currently waiting threads to start working at any time. The SHCORE!<lambda_9844335fc14345151eefcc3593dd6895>::<lambda_invoker_cdecl> thread meets this criterion because it is listening to a window object in a Windows message loop (confirmed by decompling the code). The combase!CRpcThreadCache::RpcWorkerThreadEntry thread is waiting on a timer, which means it can also run at an arbitrary time past our initial ShellExecute. On Windows, waiting can be "alertable" thus allowing the kernel to run custom code (in the form of APCs) in a given process as it waits. The SHCORE!<lambda_9844335fc14345151eefcc3593dd6895>::<lambda_invoker_cdecl> thread's wait with MsgWaitForMultipleObjectsEx passes the alertable flag, which is another means through which this wait is not guaranteed. Generally, a thread waiting on an intra-process synchronization mechanism is safe because an inter-process synchronization mechanism like a Win32 event object could be set by another process without regard to the process lifecycle if the application nevers joins the thread back. Information on thread running states can be gathered in Process Explorer (although this information doesn't include whether the waiting thread is alertable since that requires decompilation).

One practice I've noticed in the Windows API is for it to be holding a lock even while waiting, presumably for a message. This means that even if a programmger were to generously wait for some time to increase the odds that threads belonging to the Windows API aren't working when NtTerminateProcess kills threads, locks will still become orphaned thereby always leaving the process in an inconsistent state. After sleeping for an extended amount of time, over a minute, some background threads will wind themselves down to save resoures (a stack memory allocation on Windows consumes at least 64 KiBs of physical memory). If this winding down happens to occur when process exit is happening, then these threads will be killed mid-operation. In particular, thread pool workers call a thread pool's cleanup callback when winding down, thus allowing for its unexpected interruption. A significant time after the original call into a Windows API function, worker threads can also be found lingering for some time in the process without regard to process lifetime. For instance the RPCRT4!PerformGarbageCollection thread, which is likely a mechanism for cleaning up idle asynchronous connections among other resources for the RPC subsystem.

On Windows, the the library subsystem lifetime for threads is broken by the contention of DLL_THREAD_DETACH and DLL_PROCESS_DETACH synchronizing and are also be impacted by circular dependencies since a subsystem's worker thread could shutdown before while another dependency circularly depends on that subsystem.

As a result of all these contributing factors, threads can be terminated in the middle of what they are doing, which often leads to inconsistencies within the process. These in-process inconsistencies can unexpectedly make destructors unable to deinitialize and clean up safely, which easily results in some or all DLLs never having their destructors run thus causing resources to never be cleaned up and for graceful shutdown routines to go uncalled.

Process Hanging Open

Even with all mitigations applied, it's possible for a process to hang open due to deadlocking on an orhpaned synchronization mechanism in a module destructor. Specifically, hanging open is possible with a Win32 event object, which is a sychronization mechanism that the Windows API commonly utilizes throughout its operation. An event object has two unique properties that causes the anti-deadlock logic Microsoft employs for other synchronization mechanisms like critical sections and mutex objects to break down: No owning thread and inter-process synchronization support. An event object works in cases where the thread that reset the event has exited, by design. Combine this with event objects supporting inter-process synchronization (so, the entire process that reset the event can legally no longer exist) and handle inheritance causing events to easily be shared between processes makes implementing an anti-deadlock mitigation for event objects post-NtTerminateProcess infeasible (not that Microsoft likely wants to since user-mode inherited event objects from the kernel and are supposed to work the same way). Microsoft makes no mention of this danger in their official documentation.

Likewise, other inter-process synchronization objects without an owning thread are vulnerable to this deadlock scenario. Custom synchronization mechanisms without an owning thread, likely designed using the Windows futex-like API, are also at risk. For instance, asynchronous C++ semantics on VC+++ internally use a SleepConditionVariableSRW (which is implemented with the futex-like ntdll!NtWaitForAlertByThreadId API) when getting the result of a future object (std::future) returned by an asynchronous routine (std::async), thus hanging the process indefinitely (I have tested this).

A fate worse than deadlock, spinlocks are yet another another victim of NtTerminateProcess. A spinlock is useful for protecting hot data structures (often a flag, as Windows does in a few places) that observe short access times with the lowest overhead possible. An orphaned spinlock is a huge problem because attempting to acquire it will infinitely busy loop the CPU thereby degrading completely system performance on that CPU core and wasting power. Knowing about NtTerminateProcess does at least mean a programmer can mitigate deadlock concerns in their spinlock by checking PEB_LDR_DATA.ShutdownInProgress before spinning (although, frequently perfoming this check on a hot code path like a synchronization mechanism could impact performance, especially since the branch has an immediate data dependency).

Generally, there also exists other edge cases involving Windows APIs with the ability to wait (e.g. a blocking read waiting forever for I/O that will never happen after NtTerminateProcess unexpectedly killing relevent threads or a corner cases with file locks), which can hang the process open.

Crash

One of the side effects of thread termination is that it causes the underlying thread object in the kernel to become signaled:

The state of the thread object becomes signaled, releasing any other threads that had been waiting for the thread to terminate. The thread's termination status changes from STILL_ACTIVE to the value of the dwExitCode parameter.

This side effect can result in incorrect behavior or crashes if the library destructor uses the synchronization provided by the thread object as a signal to operate on some state that the thread owned without further protection or a variable that it was only going to set before exiting. A common occurrence of this risk being realized is when getting the return value of a thread, here in cross-platform C++. The outcomes of this side effect could have security implications.

TODO: Create Rust example

Further, subsystems that employ the efficient thread-local storage strategy for thread synchronization typically work by waiting for threads to exit then checking an accumulator to get the final result. Thread termination breaks this synchronization approach by causing threads to die before they finished their part of the work thus leaving the accumulator in an incorrect or impartial state, which could result in memory corruption or incorrect behavior.

Generally, thread termination by NtTerminateProcess kills threads that some subsystems could still hold a reference to or believes to exist. Windows tries to compensate for this scenario in some cases by raising an exception when interacting with these subsystems. In particular, a predicatable crash like this can occur upon trying to use thread pool internals, which raises a ntdll!TppRaiseInvalidParameter exception because the relevant functions inspect PEB_LDR_DATA.ShutdownInProgress before proceeding with typical operation. Beyond synchronization mechanisms, there are lots of places where Windows checks PEB_LDR_DATA.ShutdownInProgress (setting a read watchpoint here gleans a lot of information), prior to proceeding with typical operation, which is presumably to prevent undefined behavior that could result in a crash or incorrect behavior.

A memory access violation crash can occur if one thread tries to access the stack memory allocation of another thread (I've confirmed this memory mapping is removed as soon as a thread exits in WinDbg and the behavior for thread termination is documented to be the same) that it assumes still exists. While not a typical action, walking between stacks (sometimes performed along with stack walking) is still commonly done by garbage collectors, anti-malware software, and in other specialized cases.

Beyond these scenarios, assuming applications respond correctly to Windows API failure statuses by failing closed, a crash should not occur. However, software vendors are not always known to robustly check for failure with each Windows API call. As a result, in practice, crashes can still occur in cases like ignoring the WAIT_ABANDONED mutex error status (only mutex objects can return this error status) or ignoring thread creation failure.

Out-of-Process Inconsistencies

Here is a short list containing some of the out-of-process affects Process Meltdown could have on module cleanup and graceful shutdown routines:

Thread termination could interrupt the process' communication with another process or endpoint leading to an inconsistent data stream when destructors run
- If a destructor reuses that connection to communicate again (e.g. to gracefully send a connection end message) then undefined behavior outside the process could result. Buffered I/O provides a great example: When communicating in a TLV protocol, for instance, if transmission is interrupted midway sending a packet then the server will never receive the full packet length of data causing it wait until timeout or forever, which could leak the connection. A destructor trying to send new data over that socket would break the application-layer message boundary thus forfeiting or corrupting the connection.
Destructors that were supposed to clean up external resources such as stored data including temporary files or registry entries may never run due to the process forcefully exiting early (e.g. in the orphaned lock case) will be permanently and persistently leaked to the system.
- On Windows, atexit routines are included in a library's module deinitialization code when registered from a DLL and the typical use case for an atexit routine to clean up a resource outside the process like a file. Strictly speaking, calls to atexit should not should not be made by a library since its lifetime is not necessarily tied to the process lifetime; however, the pervasive "DLL host" Windows architecture, where everything is a library and programs only exist to load libraries, significantly increases this risk and the chance that all Process Meltdown risks are realized across the board
System resources like kernel objects can leak if inconistent process state following NtTerminateProcess leads to another process not closing handles because, for instance, the other process was waiting for some communication that it never got and it may never receive
If an event object was in use between multiple processes and the thread in a process that last put it into a waiting state gets terminated before setting it again, then indefinite hangs can occur in other processes

We talk in-depth about the immediate of out-of-process impact of permanent thread interruption in "The Problem with How Windows Uses Threads".

Performance Degradation and Resource Inefficiency

When a process ends, the kernel will close any resources including handles left to kernel objects by the process. However, this procedure can be costly in terms of bookkeeping, state checks, and synchronization. So, if the system does not immediately need more resources, it's likely to work on higher priority tasks instead of reclaiming resources (Note: Pending verification on whether process cleanup is specially deprioritized in some way or if it's just the large amount of I/O operations that are done in series which is responsible for process cleanup slowness).

Windows killing threads, whether they were running or not, can leave lots of extra resources behind that would have been normally cleaned up by user-mode code (outside of module destructor cleanup) had threads not been abruptly ended. These resources take up physical memory until the kernel gets around to cleaning them up, which is a low priority task. Thus, priority inversion and thrashing can occur in particular when system resources are being utilized to their fullest.

When the resources left behind for the kernel to clean up include shared resources, then interrupting the use of or orphaning those resources can also lead to performance degradation in the form of extra work at process exit, which could also lead to priority inversion. This issue, for instance, can be seen with files including all of the following I/O devices: "file, file stream, directory, physical disk, volume, console buffer, tape drive, communications resource, mailslot, and pipe" objects, especially when opening in the default exclusive mode. Inter-process communication I/O is another case that shows how expensive leaving threads around can be because another process could be waiting on communications from our process that it won't get until the kernel gets around to reading its message then closing the connection. File locks are yet another great example. In the LockFile/LockFileEx documentation it specifically calls out that "the time it takes for the operating system to unlock these locks depends upon available system resources". To avoid leaving files locked for an extended period of time, the documentation therefore recommends unlocking files before process exit; however, this may not be doable if the Windows API is not correctly managing its thread lifetimes within the scope of the process and external Windows vendors have adopted the same poor practice. An exclusive access or other access type is a property tied to the kernel object itself and the kernel calls object-specific "Okay To Close" or CloseProcedure routines before cleaning up an object. Additionally, if an orphaned intra-process synchronization mechanism causes forceful process termination before libraries get the chance to run module destructors, then that will generally create more resources for the kernel to clean up in series.

Lastly, it's notable that the kernel blocks kernel APCs from running on the current thread for the duration of process handle table cleanup. These APCs, a form of cooperative multitasking, could be important operations that the process started before termination like I/O completion by NtWriteFile or other system routines. ReactOS warns: "Any caller of this routine should call KeLeaveCriticalRegion as quickly as possible". So, lessening handle cleanup would be more than ideal in this scenario.

Note: Parts of this analysis covers the ReactOS implementation. Modern Windows could have changed some technical details but the overall message stays the same.

Summary

In the end, Windows is stuck between implementing process exit deadlock and resource cleanup heuristics at the cost of performance on code hot paths, all the while it being impossible for the operating system to achieve correctness as long as it's killing threads. Over a long enough period, resource leaks in memory and on disk can accumulate until a system restart, reinstallation, or manual cleanup is the only solution. The complete lack of design here is hostile towards cross-platform software that wishes to reasonably rely on destructors for their intended purpose without writing code natively for Windows, low-level programming languages that provide access to module destructors through its features, correct operating system designs that coexist with Windows. In its current state, the phenomenon that occurs every time an application finishes running on Windows would most accurately be described as Process Meltdown.

Further Research on Windows' Usage of Threads

Securable Threads

In Windows, threads are a securable resource independent of the host process:

A thread can assume a different security context than that of its process. This mechanism is called impersonation. When a thread is impersonating, security validation mechanisms use the thread’s security context instead of that of the thread’s process. When a thread isn’t impersonating, security validation falls back on using the security context of the thread’s owning process.

Windows Internals: System architecture, processes, threads, memory management, and more, Part 1 (7th edition)

Where Windows uses thread impersonation to execute code as another user, the equivalent functionality on a Unix-like system can be accomplished by forking a new process (optimized through copy-on-write memory) and using setuid along with the CAP_SETUID privilege (this is the Linux privilege, it can vary on other Unix-like OSs) to permanently change the process' user ID (UID). An OpenSSH server, for example, server works in this fashion. Process creation is fast on Unix but forking an already existing and set up process is faster and more efficient.

The Principle of Least Privilege states that user or entity should only have access to the specific data, resources, and privileges necessary to complete a required task.

Securable threads violate the Principle of Least Privilege because threads with different identities have access to each other by residing in the same address space within a process. As per Windows Internals, this shared access includes handles to kernel objects:

It’s important to keep in mind that all the threads in a process share the same handle table, so when a thread opens an object—even if it’s impersonating—all the threads of the process have access to the object.

Essentially, impersonation is the opposite of a proactive security design. Instead of limiting attack surface, securable threads often maximize it as much as possible.

Thread impersonation is also highly complex and doesn't compose. For thread impersonation to work, every layer of a subsystem within the Windows API has to specially support its usage (e.g. COM cloaking). And if there's even a single occurrence of a Windows API function being called that does not support impersonation or that you forget to pass the impersonation token to, then that's a vulnerability. Failing to correctly use and control for the consequences of thread impersonation has long been a source of security bugs in Windows (with Microsoft now implementing hacks in the kernel to workaround this delicate security model).

For all these reasons and more, I find that the Windows securable thread model is an insecure and fragile security model. Securable processes or per-process security is inherently more robust and secure, and this is the model that Unix-like systems are built on.

Expensive Threads

Everybody knows that, on Windows, process creation is slow. However, this fact is commonly attributed to an architectural preference of Windows favoring multithreading over multiprocessing (i.e. one process housing multiple threads over separate single-threaded processes). It follows then, that Windows would be better optimized for creating threads and multithreaded workloads than Unix-like operating systems.

Let's get some numbers on Windows vs. Linux thread creation and join times for 10,000 threads (benchmark source code for Linux and Windows):

System	Native Create Thread (seconds)	Native Create Thread with 50 FLS Allocations (seconds)	Native Create Thread without Loader Initialization (seconds)
Linux	0.45	N/A	N/A
Windows	1.43	1.54	1.33

Benchmark systems details: Both Xen HVMs, Intel i5 4590, 4 vCPUs each, 8 GiBs of memory each, up-to-date Windows 10 22H2 and Fedora 39 on Linux 6.1. Tests performed while the host system and other virtual machines were suspended or turned off.

Linux native thread creation and join times comes out firmly ahead, averaging speeds 3.2x faster than Windows. However, outside of some server applications that may correspond each client connection to a new thread, quickly creating 10,000 threads is not a realistic workload. Upon booting Windows, Process Explorer shows that there are about 1,000 threads between all processes on the system. So, Windows thread creation time is unlikely to become a performance bottleneck in practice especially because Windows typically keeps threads alive and waiting as worker threads for some time instead of immediately deleting them in case new work comes along. Windows threads run their DLL_THREAD_ATTACH and DLL_THREAD_DETACH routines at thread startup and exit, which requires the same synchronization as LoadLibrary (including the DLL_PROCESS_ATTACH routine) and FreeLibrary (including the DLL_PROCESS_DETACH routine) operations. Therefore, significant variance or unexpected stutters could be present in thread startup and exit times if overlapping thread creation/exit or library load/free operations occur. Interestingly, by looking at these numbers we can see that Windows thread creation overhead primarily comes from the time it takes for the NT kernel to spawn the thread itself and not from any action in user-mode (in the future, we may run tests to see how performance changes with two threads simultaneously creating threads since the synchronization requirement of thread loader initialization could have a greater effect then). Another minor note is that not joining threads on Linux yields around a 25% performance improvement, although this didn't seem to have a noticeable effect on Windows (joining means running the thread until its end so any performance impact here would be due to the scheduler).

Next, we will review the difference in resource consumption between Windows and Unix threads. Each thread requires its own stack memory allocation. On Windows, the default reservation size of this memory mapping is 1 MiB. However, only 64 KiBs of that reservation is consumed from phyiscal memory. On Linux, these sizes are 8 MiB and the archiecture's page size (typically 4 KiBs on modern x86-based and ARM systems), respectively. Since Linux and other Unix-like systems follow the system's page size (the smallest possible memory mapping size as set by the MMU) when creating memory mappings, each thread's stack memory mapping consumes significantly less memory on Unix systems than on Windows, an attribute which is certainly desirable for a general-purpose computer. Specifically, with a page size of 4 KiB means Linux threads are 16x more lightweight in memory than Windows. These facts only account for user-mode threads because kernel-mode threads do not exist in virtual memory. Kernel-mode threads are fully committed into physical memory with a fixed size stack (typically 8 KiBs on Linux or 16 KiBs on Windows). Generally, less memory mapping granularity also makes guard pages less effective at catching memory overrun bugs.

Checking in Process Explorer, a freshly booted Windows system has around 1,000 threads between all processes. 1,000 × 64 = 64,000 KiB or 62.5 MiB of memory spent just on thread stacks. In contrast, htop (since ps and top default to including kernel threads) shows a typical freshly booted Linux system has around 100 threads (although this number increases to ~150 when starting an XFCE desktop). Let's compare apples to apples since Windows has its desktop open. 150 × 4 = 600 KiBs in thread stacks. Let's assume that thread stacks stay within 4 KiBs because stacks mostly consist of pointers and thus rarely grow to be very large. By our measurement, the end result is that threads themselves on a typical Linux desktop system consume 107x fewer memory resources than Windows.

With more threads, in particular running threads, comes greater context switching overhead. The cost for a kernel to swtich between threads is expensive and not a negligible factor in performance. This price includes saving and restoring CPU state, CPU cache (L1, L2, and L3) eviction leading to cache misses having the largest potential performance impact, TLB (translation lookaside buffer) flushes thus invalidating virtual-to-physical address translation caches, and general kernel bookkeeping.

Checking in Performance Monitor, a freshly booted Windows system does around 700 context switches per second while idling. In contrast, checking vmstat shows that an idling desktop Linux system sees around 150 context switches per second. Linux, probably due to less overall background work going on, has around 5 times less context switches while idling. These differences are significant and could lead to notable baseline performance and battery life differences (although, Windows may takes steps to reduce background work when running on a battery for devices like laptops).

A kernel's scheduler plays a large role in the performance of a multithreading program or threads within a system. A scheduler decides which thread the kernel should switch to when a context switch occurs. As of version 6.6 (2023), the Linux kernel uses a new scheduler called earliest eligible virtual deadline first (EEVDF) that employs an algorithm by the same name. This scheduler takes into account multiple paramters including virtual time, eligible time, virtual requests and virtual deadlines for determining scheduling priority. This scheduler replaces the Completely Fair Scheduler (CFS), likely because overly striving for equal run-time distribution over thread readiness factors can cause lock convoys in practice. According to Windows Internals: System architecture, processes, threads, memory management, and more, Part 1 (7th edition), "Windows implements a priority-driven, preemptive scheduling system". Indeed, the Windows scheduler is dynamic, relying heavily on "priority boosts" to optimize for the foreground window, user interface interactiveness, multimedia applications, and lock ownership for locks that fully rely on the kernel (e.g. an event object allows execution to proceed). The Windows Internals book talks in-depth about the scheduler in its "Thread scheduling" subchapter. The scheduler on Windows and Linux is best optimized for the workloads common to each system (similar to a heap memory allocator, there are no settings that will be best optimized for every possible workload).

Departing from synthetic benchmarks, Linux is also better equipped to take advantage of modern CPUs with high core counts in real-world applications, with increasing margins for higher numbers of cores.

Based on our findings, we can conclude that Windows threads, like processes, are significantly more expensive and heavyweight than their typical Unix counterpart.

Multithreading is Insecure

Threads are a significant source of non-determinism in computers. Multithreading allows the execution of separate threads to overlap at unspecified times while accessing shared/global state or data structures. These interactions are inherently complex.

Security is first and foremost about minimizing attack surface or the things that can go wrong.

Multithreading works counter to security by introducing an entire an entire new class of bugs—concurrency bugs, for attackers to find and exploit.

Nowhere has this huge attack surface come to light more than in the 2022 paper "COMRace: Detecting Data Race Vulnerabilities in COM Objects", that revealed 26 privilege escalation vulnerabilities (with most of these also being sandbox escapes). Windows uses COM everywhere and managing concurrent access, particularly for MTA apartments since requests to an STA server are serialized, is error-prone even for experienced developers. COM components that come with Windows are mostly free threaded (supporting STA or MTA), with most uses being MTA to ensure high performance (this information is visible by looking at registered COM components in the registry and tracing Windows APIs). These vulnerabilities are only the tip of the iceberg, with concurrency bugs in complex, multi-threaded software being an immense landscape for security issues to hide.

Concurrency bugs are difficult to catch in code review or fuzzing. As a result, multithreading, especially when combined with any sizable amount of shared state, should be avoided in security-sensitive contexts.

In contrast to Windows, Unix architecture tends to avoid this entire class of bugs through modular design that allows (single-threaded) multi-processing instead of multi-threading where data crosses security boundaries.

DLL Thread Routines Anti-Feature

The Windows loader is tightly coupled with the threading implementation to provide a feature known as DLL thread routines or notifications. The notifications run a callback in each DLL at thread startup and exit times. By default, all Windows DLLs are registered for this callback and can define custom actions by handling the DLL_THREAD_ATTACH or DLL_THREAD_DETACH call reasons in DllMain.

DLL thread routines are an anti-feature that should have never existed because their synchronization breaks the library subsystem lifetime for threads.

DLL thread notifiactions themselves are are effectively useless, and well-written libraries often disable them to improve performance by calling DisableThreadLibraryCalls (a dynamic operation that must be called in DLL_PROCESS_ATTACH). Dynamically allocated thread-local data is already a fragile mechanism for managing state and integrating it with the loader causes breakage.

There is no good trade-off that justifies the existence of these notifications and they come with far more disadvantages than advantages. Additionally, other operating systems work just fine without requiring loaded libraries receive thread creation and exit notifications.

Synchronization Requirements

DLL thread routines are for initializing per-thread data, so one might question the need for process-wide synchronization here. However, these routines run as part of the loader's state machine and a module is accessible from the global scope, so synchronization is necessary for a few reasons:

Protect from concurrent library unload
- Thread startup and exit acquiring the same locks needed for library load and free fully protects all DLLs from being unloaded while DLL_THREAD_ATTACH or DLL_THREAD_DETACH routines run
- The loader could also ensure a module is not concurrently unloaded by incrementing the reference count on each module, running its DLL thread routine, then decrementing that module's reference count, and then repeat for all modules. But, that could be taxing on performance, and if a reference count concurrently drops to zero then actual library unload (acquiring the typical locks) must occur, anyway.
Avoid running DLL_THREAD_ATTACH for partially loaded libraries
- It is likely a desired trait that the DLL_PROCESS_ATTACH of modules that the current module depends on or all modules is finished running before their DLL_THREAD_ATTACH routines run because thread attach routines could depend on initialization done by the processs attach routines
- Presumably a module will have sufficiently initialized itself before spawning a thread in its module initializer. In the case of desiring fully initialized dependencies for a module spawning a thread from its DLL_PROCESS_ATTACH before DLL_THREAD_ATTACH routines run on the new thread, circular dependencies get in the way of that.
Module list protection
- The loader thread initialization function walks a module list to initialize each module, this list is a global data structure that requires protection to walk between nodes (Windows fails to protect this access)
  - There could be a lock specific to just protecting a modlue list's access; however, Windows groups the protection of module lists in with broader locks
- If a consistent snapshot of the loaded modules at a given point in time is desired then the lock must remain held while all the callbacks are run, a trait which is desirable because the loader does not want to run the DLL_THREAD_ATTACH of a module while it is still in its LdrModulesInitializing state due to a concurrent library load but it likely still wants to run the DLL_THREAD_ATTACH of a LdrModulesInitializing library that is responsible for starting the new thread
Full load owner protection requirement
- Technically, only acquiring ntdll!LdrpLoaderLock is necessary to protect from concurrent library unload because unloading libraries must first deinitialize by calling DLL_PROCESS_DETACH routines, which requires ntdll!LdrpLoaderLock protection, before any futher unloading steps can occur
- Still, full loader owner protection by the ntdll!LdrpLoadCompleteEvent and ntdll!LdrpLoaderLock sychronization mechanisms is necessary to prevent lock hierarchy violation in the case that a DLL_THREAD_ATTACH or DLL_THREAD_DETACH routine loads a library perhaps accidentally due to Windows delay loading
- This fact makes no difference to DLL thread routines breaking the library subsystem lifetime but some extra performance in concurrent library load and thread startup scenarios could be eeked out if not for requiring full load owner protection

Flimsy Thread-Local Data

Thread-local data is a fragile mechanism that introduces unnecessary failure points, is often a symptom of poor design, and is unfit for use in subsystems, particularly at the operating-system-level, for a variety of reasons:

Thread-affinity issues
- A subsystem that creates thread-local data ties itself to the lifetime of the thread it created the data on, once that thread exits the thread-local data is invalid to use on any thread
- Subsystems should avoid imposing strict threading requirements on other subsystems or the application
- Keeping track of the caller is always best performed by explictly passing a context structure around, as is commonly done by C libraries like SQLite with its sqlite3 structure, rather than assuming the caller's identity is tied to the thread its on like COM does with its CoInitialize/CoInitializeEx functions
- This fact combined with dynamic library loading, especially through delay loading, being common on Windows makes thread-local data a source of module initialization routine issues on Windows
  - Conversely, POSIX recommends creating thread-local data in a module initialization routine as a valid approach to creating thread-local data because Unix systems typically load all their libraries from the main thread at process startup and the main thread should live until the end of the process
Dynamically allocated thread-local storage can quickly run out of indexes
- TLS has 1088 maximum slots per-process and FLS has 128 maximum slots per-process
- glibc thread-specific data has 1024 maximum keys per-process
Thread-local storage has subpar performance
- Thread-local storage slots may not be as close together in memory as they would be in a single contiguous allocation using alloca, which could lessen the locality of reference performance benefit for memory accesses
- Thread-local storage retrival and storage can introduce unnecessary overhead in the form of lookup cost and extra function calls
  - On Windows, a dynamic thread-local storage allocation with TlsAlloc due to TlsAlloc acquires the shared PEB lock on every allocation because it works by modifying the process-wide Peb->TlsBitmap data structure
  - On Windows, a dynamic thread-local storage allocation with FlsAlloc returns an FLS index which is implemented as a key in a binary array, which means that retrieving thread-local data means you need to search a binary array each time (it's bloat on top of a pointer or index)
  - glibc simply implements thread-specific data key as an index stored as an unsigned integer and uses an per-key MT-safe atomic swap to create keys
Thread-local data can complicate library unload or even make correctly unloading a library impossible
- Global thread-local/thread-specific data always makes a library unloadable (e.g. C/C++ thread_local or using Windows DLL_THREAD_ATTACH)
  - The library's lifetime may be shorter or longer than the thread's lifetime
- Local thread-local/thread-specific data is safe to use from a library while not making it unloadable as long as the given library owns the thread it created the data on and will join that thread before unloading
  - Joining a thread from a module destructor in unsafe on Windows, which generally makes safe library unloading more difficult to achieve
- For example: If a worker thread dynamically loads a library, uses the contained subsystem which calls FlsAlloc to create thread-local data with a callback into its library code, then the library is unloaded, and later the thread exits then a crash will occur! Simply calling FlsFree from DLL_PROCESS_DETACH will not solve the problem because the thread could have exited before your library was unloaded.
- Another example: If the DLL_THREAD_ATTACH of a dynamically loaded library allocates some per-thread data, a new thread is created anywhere in the process, then the library is unloaded before that thread exits, the resources that the DLL_THREAD_DETACH of the now unloaded DLL would have cleaned up, will be leaked
- Unfortunately, MacOS has already been hit with this issue
It is easy to accidentally use a thread-local data slot that is not yours thus creating an instability or an application compatibility issue where memory sanitizers would have been able to proactively catch an issue with traditional pointers
Thread-local data is easy to misuse outside of its one valid use case: Giving each thread gets its own isolated instance of some data
- Valid use cases: Passing a key created by pthread_key_create between threads so each thread has their own instance of some data, thread_local int thread_local_counter = 0; in an application (because global thread-local data can make libraries unloadable), or errno by the operating system
- An application programmer may misuse thread-local data in a way that unnecessarily extends memory lifetime until the end of a thread, which could be a waste when typical stack memory provides fine-grained memory lifetime management and can also automatically clean up resources with a great pattern like RAII
- Instead of creating a structured function-oriented program or subsystem that cleanly passes through values and keeps track of memory in block scope, a programmer can easily use thread-local storage to be lazy by allocating FLS slots with a custom cleanup FLS callback at thread exit when keeping the data in a smaller block scope would have created a cleaner and more coherent codebase

The PEB Problem

In Windows, the Process Environment Block (PEB) stores data that is global to the process. As opposed to linking with data exports or symbols made available by libraries, the PEB is a bad model for several reasons:

Enforces Dynamic Initialization

Instead of resolving data references at link time, the PEB, and its various getter functions such as GetProcessHeap, force developers to initialize dynamically via the module constructor. Insisting on dynamic initialization when none should be necessary is suboptimal for performance. In addition, while dependency cycles should never exist as a core principle, reducing or eliminating a module's reliance on dynamic initialization can work as a mitigating factor for preventing crashes or other poor outcomes due to these loops, should they crop up.

Adds Process Startup Overhead

Every process must initialize its PEB, a large and complex data structure, when it starts up. This inefficient burden contributes to the slow process creation times on Windows. Since the PEB also enforces dynamic initialization on the side of modules, the PEB essentially gives way to two-sided or double dynamic initialization.

Promotes Centralization

By its very definition, the PEB centralizes state into one process-wide block. The contents of this block are dictated only by Microsoft. Clumping together unrelated data like this makes the operating system more monolithic as opposed to an improved way of functioning whereby libraries, each of which is its own subsystem, simply choose data symbols to export.

Centralizing state is also bad because it encourages coarse-grained locking, thus increasing lock contention, as is true for the PEB with its single PEB.FastPEBLock for synchronizing access even to unrelated members in this data structure. Windows calls this critical section lock "fast" presumably because it has a spin count attached to it that optimizes for the typically short acquisition times of this lock. However, a heuristic that improves performance by busy waiting is not a better solution than implementing fine-grained locking, thus reducing waits to begin with. Common scenarios where acquiring the PEB lock is necessary includes creating thread-local data and accessing environment variables, whereas Unix systems typically use purpose-built locks or low-level atomic operations, often including MT-safe synchronization, in these cases.

Elevates Backward Compatibility Risk

The PEB's definition is well-known and Microsoft must stay compatible with its exact layout through Windows versions. In contrast, data symbols do not require positioning into any exact layout and can easily be versioned.

Weakens Security

Through a thinly veiled segment register, the PEB exposes pointers to various valuable targets and data structures. Most prominently, using its contained PEB_LDR_DATA structure, the PEB makes it trivial for shellcode to extract the base address of any module in the process, serving as a universal technique for breaking the ASLR of all modules in the process even if an attacker starts out by only knowing where one module is located. Accessing PEB_LDR_DATA is a part of virtually all Windows shellcodes.

Accessing the PEB is Slow

Accessing the PEB requires going through the TEB to reach it, which adds a layer of indirection. Alternatively, one could call the NTDLL exported function of RtlGetCurrentPeb to make PEB access slower by also introducing a function call plus the slight overhead of dynamic linking.

On the microarchitectural level, accessing addresses with a non-zero segment base is documented to increase load latency, at least on the x86-64 architecture. Registers outside of the general-purpose registers are also less likely to be cached in the CPU pipeline, leading to a slowdown.

Summary

The PEB should never have existed. Despite Microsoft first adding the PEB in Windows NT 3.1 (the first Windows NT release), which is also when the operating system first gained the ability for dynamic linking, how the PEB works is a callback to the times before dynamic linking, when data was centralized into one location and required manual symbol resolution to obtain. In modern Windows, the PEB continues to exist and Microsoft is not shy about expanding it, undermining dynamic linking with each new member it receives.

Procedure/Symbol Lookup Comparison (Windows `GetProcAddress` vs POSIX `dlsym` GNU Implementation)

The Windows GetProcAddress and POSIX dlsym functions are platform equivalents because they resolve a procedure/symbol name to its address. They differ because GetProcAddress can only resolve function export symbols, whereas dlsym can resolve any global (RTLD_GLOBAL) or local (RTLD_LOCAL) symbol. There's also this difference: The first argument of GetProcAddress requires passing in a module handle. In contrast, the first argument of dlsym can take a module handle, but it also accepts one of the RTLD_DEFAULT or RTLD_NEXT flags (or "pseudo handles" in more Windows terminology). Let's discuss how GetProcAddress functions first.

GetProcAddress receives an HMODULE (a module's base address) as its first argument. The loader maintains a red-black tree sorted by each module's base address called ntdll!LdrpModuleBaseAddressIndex. GetProcAddress ➜ LdrGetProcedureAddressForCaller ➜ LdrpFindLoadedDllByAddress (this is a call chain) searches this red-black tree for the matching module base address to ensure a valid DLL handle. Searching the LdrpModuleBaseAddressIndex red-black tree mandates acquiring the ntdll!LdrpModuleDataTableLock lock. If locating the module fails, GetProcAddress sets the thread error code (retrieved with the GetLastError function) to ERROR_MOD_NOT_FOUND and returns early. GetProcAddress receives a procedure name as a string for its second argument. GetProcAddress ➜ LdrGetProcedureAddressForCaller ➜ LdrpResolveProcedureAddress ➜ LdrpGetProcedureAddress calls RtlImageNtHeaderEx to get the NT header (IMAGE_NT_HEADERS) of the PE image. IMAGE_NT_HEADERS contains optional headers (IMAGE_OPTIONAL_HEADER), including the image data directory (IMAGE_DATA_DIRECTORY, this is in the .rdata section). The data directory (IMAGE_DATA_DIRECTORY) includes multiple directory entries, including IMAGE_DIRECTORY_ENTRY_EXPORT, IMAGE_DIRECTORY_ENTRY_IMPORT, IMAGE_DIRECTORY_ENTRY_RESOURCE, and more. LdrpGetProcedureAddress gets the PE's export directory entry. The compiler sorted the PE export directory entries alphabetically by procedure name ahead of time. LdrpGetProcedureAddress performs a binary search over the sorted procedure names, looking for a name matching the procedure name passed into GetProcAddress. If locating the procedure fails, GetProcAddress sets the thread error code to ERROR_PROC_NOT_FOUND. Locking isn't required while searching for an export because PE image exports are resolved once during library load and remain unchanged (this doesn't cover delay loading). In classic Windows monolithic fashion, GetProcAddress may do much more than find a procedure on edge cases. Still, for the sake of our comparison, we only need to know how GetProcAddress works at its core.

GNU's POSIX-compliant dlsym implementation firstly differs from GetProcAddress because the former will not validate a correct module handle before searching for a symbol. Pass in an invalid module handle, and the program will crash; to be fair, you deserve to crash if you do that. Also, not validating the module handle provides a great performance boost. Depending on the flags passed to dlopen and the handle passed to dlsym, the GNU loader searches for symbols in a few ways. Most commonly, a symbol lookup occurs in the global scope (RTLD_DEFAULT handle to dlsym), which requires iterating the searchlist in the main link map then matching on the first symbol found within a library. Here, we will cover the most straightforward case when calling dlsym with a handle to a library (i.e. dlsym(myLibraryHandle, "myfunc")). The ELF standard specifies the use of a hash table for searching symbols. do_sym (called by _dl_sym) calls _dl_lookup_symbol_x to find the symbol in our specified library (also referred to as an object). _dl_lookup_symbol_x calls _dl_new_hash to hash our symbol name with the djb2 hash function (look for the magic numbers). Recent versions of the GNU loader use this djb2-based hash function, which differs from the old standard ELF hash function based on the PJW hash function. At its introduction, this new hash function improved dynamic linking time by 50%. Though commonly agreed upon across Unix-like systems, this hash function is formally a GNU extension and requires standardization. It's also worth noting that, in 2023, someone caught an overflow bug in the original hash function described by the System V Application Binary Interface. _dl_lookup_symbol_x calls do_lookup_x, where the real searching begins. do_lookup_x filters on our library's link_map to see if it should disregard searching it for any reason. Passing that check, do_lookup_x gets pointers into our library's DT_SYMTAB and DT_STRTAB ELF tables (the latter for use later while searching for matching symbol names). Based on our symbol's hash (calculated in _dl_new_hash), do_lookup_x selects a bucket from the hash table to search for symbols from. l_gnu_buckets is an array of buckets in our hash table to choose from. At build time during the linking phase, the linker builds each ELF image's hash table with the number of buckets, which adjusts depending on how many symbols are in the binary. With a bucket selected, do_lookup_x fetches the l_gnu_chain_zero chain for the given bucket and puts a reference to the chain in the hasharr pointer variable for easy access. l_gnu_chain_zero is an array of GNU hash entries containing standard ELF symbol indexes. The ELF symbol indexes inside are what's relevant to us right now. Each of these indexes points to a symbol table entry in the DT_SYMTAB table. do_lookup_x iterates through the symbol table entries in the chain until the selected symbol table entry holds the desired name or hits STN_UNDEF (i.e. 0), marking the end of the array. l_gnu_buckets and l_gnu_chain_zero inherit similarities in structure from the original ELF standardized l_buckets and l_chain arrays which the GNU loader still implements for backward compatibility. For the memory layout of the symbol hash table, see this diagram. There appears to be a fast path to quickly eliminate any entries that we can know early on won't match—passing the fast path check, do_lookup_x calls ELF_MACHINE_HASH_SYMIDX to extract the offset to the standard ELF symbol table index from within the GNU hash entry. The GNU hash entry is a layer on top of the standard ELF symbol table entry; in the old but standard hash table implementation, you can see that the chain is directly an array of symbol table indexes. Having obtained the symbol table index, dl_lookup_x passes the address to DT_SYMTAB at the offset of our symbol table index to the check_match function. The check_match function then examines the symbol name cross-referencing with DT_STRTAB where the ELF binary stores strings to see if we've found a matching symbol. Upon finding a symbol, check_match looks if the symbol requires a certain version (i.e. dlvsym GNU extension). In the unversioned case with dlsym, check_match determines if this is a hidden symbol (e.g. -fvisibility=hidden compiler option in GCC/Clang), and if so not returning the symbol. This loop restarts at the fast path check in dl_lookup_x until check_match finds a matching symbol or there are no more chain entries to search. Having found a matching symbol, do_lookup_symbol_x determines the symbol type; in this case, it's STB_GLOBAL and returns the successfully found symbol address. Finally, the loader internals pass this value back through the call chain until dlsym returns the symbol address to the user!

Note that the glibc source code follows a Hungarian notation that prefixes link_map structure members with l_ (for organization and likely because GNU partially exposes the link_map implementation detail with the dlinfo GNU extension) and structures in general with r_ (e.g. see link_map structure).

Also, note that, unlike the Windows ntdll!LdrpHashTable (which serves an entirely different purpose), the hash table in each ELF DT_SYMTAB is made up of arrays instead of linked lists for each chain (each bucket has a chain). Using arrays (size determined during binary compilation) is possible because the ELF images aren't dynamically allocated structures like the LDR_DATA_TABLE_ENTRY structures ntdll!LdrpHashTable keeps track of. Arrays are significantly faster than linked lists because following links is a relatively expensive operation (e.g. you lose locality of reference). In general, the hash table is a commonly used data structure because they typically outperform other means of storing and retrieving data.

Due to the increased locality of reference and a hash table being O(1) average and amortized time complexity vs a binary search being O(log n) time complexity, I believe that searching a hash table (bucket count optimized at compile time) and then iterating through the also optimally sized array as done by the GNU loader's dlsym is faster than the binary search approach employed by GetProcAddress in Windows for finding symbol/procedure addresses.

ELF Flat Symbol Namespace (GNU Namespaces and `STB_GNU_UNIQUE`)

Windows PE (EXE) and MacOS Mach-O starting with OS X v10.1 executable formats store library symbols in a two-dimensional namespace; thus effectively making every library its own namespace.

On the other hand, the Linux ELF executable format specifies a flat namespace such that two libraries having the same symbol name can collide. For example, if two libraries expose a malloc function, then the dynamic linker won't be able to differentiate between them. As a result, the dynamic linker recognizes the first malloc symbol definition it sees as the malloc, ignoring any malloc definitions that come later. This refers to cases of loading a library (dlopen) with RTLD_GLOBAL; with RTLD_LOCAL, the newly loaded library's symbols aren't made available to other libraries in the process.

These namespace collisions have been the source of some bugs, and as a result, there have been workarounds to fix them. The most straightforward being: dlsym(mySpecificLibraryHandle, "malloc"). However, there are some cases where that doesn't cut it, so GNU devised some solutions of their own.

Let's talk about GNU loader namespaces. Since 2004, the glibc loader has supported a feature known as loader namespaces for separating symbols contained within a loaded library into a separate namespace. Creating a new namespace for a loading library requires calling dlmopen (this is a GNU extension). Loader namespaces allow a programmer to isolate the symbols of a module to its own namespace. There are various reasons GNU gives for why a developer might want to open a library in a separate namespace and why RTLD_LOCAL isn't a substitute for this, mainly regarding loading an external library with an unwieldy number of generic symbol names polluting your namespace. It's worth noting that there's a hard limit of 16 on the number of GNU loader namespaces a process can have. Some might say this is just a bandage patch around the core issue of flat ELF symbol namespaces, but it doesn't hurt to exist as an option.

In 2009, the GNU loader received a new symbol type called STB_GNU_UNIQUE. In the dl_lookup_x function, determining the symbol type is the final step after successfully locating a symbol. We previously saw this occur when our symbol was determined to be type STB_GLOBAL. STB_GNU_UNIQUE is another one of those symbol types, and as the name implies, it's a GNU extension. The purpose of this new symbol type was to fix symbol collision problems in C++ by giving precedence to STB_GNU_UNIQUE symbols even above RTLD_LOCAL symbols within the same module. In theory, the C++ one definition rule (ODR) works perfectly with global symbols because it enforces a single defintion for each symbol name. In practice though, the real world can be more messy with different versions or implementations of the same C++ symbol name requring per-library linkage instead of one that's agreed upon process-wide. While, STB_GNU_UNIQUE solves this problem, it may not have been the best or most holistic solution to the problem. Due to the global nature of STB_GNU_UNIQUE symbols and their lack of reference counting (per-symbol reference counting could impact performance), their usage in a library makes unloading impossible by design. On top of this, STB_GNU_UNIQUE introduced significant bloat to the GNU loader by adding a special, dynamically allocated hash table known as the _ns_unique_sym_table just for handling this one new symbol type, which, along with a lock for controlling access to this new data structure, is included in the global _rtld_global. rtld_global (run-time dynamic linker) is the structure that defines all variables global to the loader. There's also a performance hit because locating a STB_GNU_UNIQUE symbol requires a lookup in two separate hash tables.

These workarounds indicate that ELF symbol namespaces should have been two-dimensional long ago. Standards organizations could still update the ELF standard to support two-dimensional namespaces (although it may require a new version or a slight ABI compatibility hack).

But, what exactly would be necssary to make 2D namespaces a reality on GNU/Linux? Well, let's take a page from someone who's done it (in 2001 no less). According to Apple's documentation, their dymanic liknker simply "adds the module name as part of the symbol name of the symbols defined within it" to accomplish 2D namespaces. As a result, this could be a very easy feature to add even on a module opt-in basis (plus, 2D namespaces allow for an optimzation that improves symbol resolution performance in the RTLD_DEFAULT case). When mixing 1D and 2D namespace symbols (if that is desirable), a per-symbol flag (like STB_GNU_UNIQUE) could be added to differentiate the two symbol types.

Note that a one-dimensional symbol namespace doesn't change the fact that a binary still needs to specify the library files it depends on. The symbol namespace only changes how the dynamic linker searches symbols once those libraries are loaded.

Side note: We also need to kill ELF interposition ASAP, then we can do away with the temporary compilation flag hacks currently working around it.

Suggestion: Since going 2D will require we have some way to denote module + symbol pairs in GDB (like Windows does in WinDbg, for instance, ntdll!NtOpenFile), I suggest using the / character as the separator. The only two characters that are illegal to have in a file name on Unix platforms is / and the null byte, so choosing slash for this purpose (e.g. libc.so/printf) will not limit any file names and is the most natural (where symbols in a module are sort of like a virtual file system), in my opinion. LLDB on MacOS uses a ` (backtick) for the same purpose, but I do not like that because it conflicts with markdown.

How Does `GetProcAddress`/`dlsym` Handle Concurrent Library Unload?

The dlsym function can locate RTLD_LOCAL symbols (i.e. libraries that have been dlopened with the RTLD_LOCAL flag). Internally, the dlsym_implementation function acquires dl_load_lock at its entry. Since dlclose must also acquire dl_load_lock to unload a library, this prevents a library from unloading while another thread searches for a symbol in that same (or any) library. dlsym eventually calls into _dl_lookup_symbol_x to perform the symbol lookup.

Answering this question on Windows requires a more in-depth two-part investigation of GetProcAddress and FreeLibrary internals:

For GetProcAddress, LdrGetProcedureAddressForCaller (this is the NTDLL function GetProcAddress calls through to) acquires the LdrpModuleDatatableLock lock, searches for our module LDR_DATA_TABLE_ENTRY structure in the ntdll!LdrpModuleBaseAddressIndex red-black tree, checks if our DLL was dynamically loaded, and if so atomically incrementing LDR_DATA_TABLE_ENTRY.ReferenceCount. LdrGetProcedureAddressForCaller releases the LdrpModuleDatatableLock lock. Before acquiring the LdrpModuleDatatableLock lock, note there's a special path for NTDLL whereby LdrGetProcedureAddressForCaller checks if the passed base address matches ntdll!LdrpSystemDllBase (this holds NTDLL's base address) and lets it skip close to the meat of the function where LdrpFindLoadedDllByAddress and LdrpResolveProcedureAddress occur. Towards the end of LdrGetProcedureAddressForCaller, it calls LdrpDereferenceModule, passing the LDR_DATA_TABLE_ENTRY of the module it was searching for a procedure in. LdrpDereferenceModule, assuming the module isn't pinned (LDR_ADDREF_DLL_PIN) or a static import (ProcessStaticImport in LDR_DATA_TABLE_ENTRY.Flags), atomically decrements the same LDR_DATA_TABLE_ENTRY.ReferenceCount of the searched module. If LdrpDerferenceModule senses the LDR_DATA_TABLE_ENTRY.ReferenceCount reference counter has dropped to zero (this could only occur due to a concurrent thread decrementing the reference count), it will delete the necessary module information data structures and unmap the module. By reference counting, GetProcAddress ensures a module isn't unmapped midway through its search.

The Windows loader maintains LDR_DATA_TABLE_ENTRY.ReferenceCount and LDR_DDAG_NODE.LoadCount reference counters for each module. The loader ensures there are no references to a module's LDR_DDAG_NODE (LDR_DDAG_NODE.LoadCount = 0) before there are no references to the same module's LDR_DATA_TABLE_ENTRY (LDR_DATA_TABLE_ENTRY.ReferenceCount = 0). This sequence is the correct order of operations for decrementing these reference counts. The LdrUnloadDll NTDLL function (public FreeLibrary calls this) calls LdrpDecrementModuleLoadCountEx which typically decrements a module's LDR_DDAG_NODE.LoadCount then, if it hits zero, runs DLL_PROCESS_DETACH. Lastly, LdrUnloadDll calls LdrpDereferenceModule which decrements the module's LDR_DATA_TABLE_ENTRY.ReferenceCount. Unloading a library (when LoadCount decrements for the last time from 1 to 0) requires becoming the load owner (LdrUnloadDll calls LdrpDrainWorkQueue). Once the thread is appointed as the load owner (only one thread can be a load owner at a time), LdrUnloadDll calls LdrpDecrementModuleLoadCountEx again with the DontCompleteUnload argument set to FALSE thus allowing actual module unload to occur instead of just decrementing the LDR_DDAG_NODE.LoadCount reference counter. With LDR_DDAG_NODE.LoadCount now at zero but the thread still being the load owner, LdrpDecrementModuleLoadCountEx calls LdrpUnloadNode to run the module's DLL_PROCESS_DETACH routine (LdrpUnloadNode can also walk the dependency graph to unload other now unused libraries). LdrUnloadDll then calls LdrpDropLastInProgressCount to decommission the current thread as the load owner followed by calling LdrpDereferenceModule to remove the remaining module information data structures and unmap the module. The LdrpDereferenceModule function acquires the LdrpModuleDatatableLock lock while removing the module from the global module information data structures. Since GetProcAddress also appropriately acquires the LdrpModuleDatatableLock when searching global module information data structures and correctly utilizes module reference counts, this prevents a library from unloading while another thread searches for a symbol in that same library. Note that many functions all throughout the loader call LdrpDereferenceModule (including GetProcAddress internally) so there can't be a race condition that causes a module now with a reference count of zero to remain loaded.

The coarse-grained locking approach of GNU dlsym is the only place the Windows loader approach is superior for maximizing concurrency and preventing deadlocks. I recommend that glibc switches use a more fine-grained locking approach like they already do with RTLD_GLOBAL symbols and their global scope locking system (GSCOPE). Note that acquiring dl_load_lock also allows dlsym to safely search link maps in the RTLD_DEFAULT and RTLD_NEXT pseudo-handle scenarios without acquiring dl_load_write_lock. However, that goal could also be accomplished by shortly acquiring the dl_load_write_lock lock in the _dl_find_dso_for_object function, so this is only a helpful side effect. Simply protecting symbol resolution through per-module reference counting is likely not viable due to global symbols. Although, might easier protection be doable by combining global scope locking (for global symbols) with module reference counting (for local symbols)?

Keep in mind, that if your program frees libraries whose exports/symbols are still in use (after locating them with GetProcAddress or dlsym), then you can expect your application to crash due to your own negligence. In other words, GetProcAddress/dlsym only protects from internal data races (within the loader). However, you the programmer are responsible for guarding against external data races. Said in another way again: The loader doesn't know what your program might do. So, the loader can only maintain consistency with itself.

Lazy Linking Synchronization

The Procedure/Symbol Lookup Comparison (Windows GetProcAddress vs POSIX dlsym GNU Implementation) and How Does GetProcAddress/dlsym Handle Concurrent Library Unload? sections cover, including synchronization, how the GetProcAddress/dlsym functions, resolve a symbol name to an address. Lazy linking also does this and so they typically share many internals. However, given that the program may lazily resolve a library symbol at any time in execution, one must be careful to design a system that is flexible to concurrent scenarios.

On the GNU loader, dlsym internally calls into the same internals as lazy linking. However, the GNU loader's approach to supporting the additional functionality POSIX specifies dlsym to support creates some notable differences that you can read about in the aforementioned sections.

Windows lazy linking, which is merely a part of Window delay loading, is a different beast that I've only barely researched and there's not much public information on. See Library Lazy Loading and Lazy Linking Overview for some context on Windows delay loading.

Stepping into a lazy linked function on the GNU loader reveals that it eventually calls the familiar _dl_lookup_symbol_x function to locate a symbol. During dynamic linking, _dl_lookup_symbol_x can resolve global symbols. The GNU loader uses GSCOPE, the global scope system, to ensure consistent access to STB_GLOBAL symbols (as it's known in the ELF standard; this maps to the RTLD_GLOBAL flag of POSIX dlopen). The global scope is the searchlist in the main link map. The main link map refers to the program's link map structure (not one of the libraries). In the TCB (this is the generic term for the Windows TEB) of each thread is a piece of atomic state flag (this is not a reference count), which can hold one of three states known as the gscope_flag that keeps track of which threads are currently depending on the global scope for their operations. A thread uses the THREAD_GSCOPE_SET_FLAG macro (internally calls THREAD_SETMEM) to atomically set this flag and the THREAD_GSCOPE_RESET_FLAG macro to atomically unset this flag. When the GNU loader requires synchronization of the global scope, it uses the THREAD_GSCOPE_WAIT macro to call __thread_gscope_wait. Note there are two implementations of __thread_gscope_wait, one for the Native POSIX Threads Library (NPTL) used on Linux systems and the other for the Hurd Threads Library (HTL) which was previously used on GNU Hurd systems (with the GNU Mach microkernel). GNU Hurd has since switched to using NPTL. For our purposes, only NPTL is relevant. The __thread_gscope_wait function iterates through the gscope_flag of all (user and system) threads, signalling to them that it's waiting (THREAD_GSCOPE_FLAG_WAIT) to synchronize. GSCOPE is a custom approach to locking that works by the GNU loader creating its own synchronization mechanisms based on low-level locking primitives. Creating your own synchronization mechanisms can similarly be done on Windows using the WaitOnAddress and WakeByAddressSingle/WakeByAddressAll functions. Note that the THREAD_SETMEM and THREAD_GSCOPE_RESET_FLAG macros don't prepend a lock prefix to the assembly instruction when atomically modifying a thread's gscope_flag. These modifications are still atomic because xchg is atomic by default on x86, and a single aligned load or store (e.g. mov) operation is also atomic by default on x86 up to 64 bits. If gscope_flag were a reference count, the assembly instruction would require a lock prefix (e.g. lock inc) because incrementing internally requires two memory operations, one load and one store. The GNU loader must still use a locking assembly instruction/prefix on architectures where memory consistency doesn't automatically guarantee this (e.g. AArch64/ARM64). Also, note that all assembly in the glibc source code is in AT&T syntax, not Intel syntax.

Understanding the fundamentals of how the GNU loader uses low-level locking to perform global scope synchronization requires knowing how a modern futex works. A modern POSIX mutex or Windows critical section are implemented using a futex or futex-like mechanism under the hood, respectively. Additionally, one must grasp the thread-local storage strategy for thread synchronization, although how the GNU loader employs this strategy comes with a twist because it uses a predefined field in the TCB as the per-thread flag that it makes atomic modifications to and acquires a stack lock (dl_stack_cache_lock) when accumulating values versus typical thread-local synchronization using, for example, a thread_local (C++) variable which non-atmoic modifications are made to and joining relevant threads before reading the accumulated value.

The Windows loader resolves lazy linkage by calling LdrResolveDelayLoadedAPI (an NTDLL export). This function makes heavy use of LDR_DATA_TABLE_ENTRY.LoadContext. Symbols for the LoadContext structure aren't publicly available, and some members require reverse engineering. In the protected load case, LdrpHandleProtectedDelayload ➜ LdrpWriteBackProtectedDelayLoad (LdrResolveDelayLoadedAPI may call this or LdrpHandleUnprotectedDelayLoad) acquires the LDR_DATA_TABLE_ENTRY.Lock lock and never calls through to LdrpResolveProcedureAddress. The unprotected case eventually calls through to the same LdrpResolveProcedureAddress function that GetProcAddress uses; this is not true for the protected case. From glazing over some of the functions LdrResolveDelayLoadedAPI calls, a module's LDR_DATA_TABLE_ENTRY.ReferenceCount is atomically modified quite a bit. Setting a breakpoint on both LdrpHandleProtectedDelayload and LdrpHandleUnprotectedDelayLoad shows that a typical process will only ever call the former function (experimented with a ShellExecute).

The above paragraph is only an informal glance; I or someone else must thoroughly look into how delay loading is implemented since its integration into the modern Windows loader itself.

Library Lazy Loading and Lazy Linking Overview

Lazy Loading and Lazy Linking OS Comparison

NOTE: This section contains incomplete work and is subject to change.

Update: Pending rewrite.

In a loader or dynamic linker, loading refers to the entire process of setting up a module from mapping it into memory, to linking, to initialization. Linking (sometimes referred to as binding or snapping) is the part of the libary loading process that resolves symbols names to addresses in memory. Doing either of these operations lazily means that it's done on an as-needed basis instead of all at process startup or library load-time.

Windows collectively refers to lazy loading and lazy linking as "delay loading". However, we tend towards the distinguished terms throughout this section.

Lazy library loading is as a first-class citizen on Windows but not on Unix systems. While the optimization could improve performance, architectural differences make it less needed on Unix systems than on Windows. Windows prefers fewer processes with more threads, which get their functionality from tightly coupled libraries, thus one library easily brings in many others. In contrast, Unix libaries are loosely coupled and the architecture favors more processes with fewer threads, leading to, on average, fewer libraries per process. As a result, the minimal process model common to Unix would give a lazy library loading optimization less upside.

On Windows, lazy loading/linking exists as both a feature of the MSVC linker and, in later Windows versions, the Windows loader with the NTDLL exported functions LdrResolveDelayLoadsFromDll, LdrResolveDelayLoadedAPI, and LdrQueryOptionalDelayLoadedAPI (more functions internally, not exposed as exports). A lazy loaded DLL also has its imported functions lazy linked.

On Unix systems, lazy loading can be effectively achieved by doing dlopen and then dlsym at run-time. There are techniques to get more seamless lazy loading on Unix systems but it could come with caveats. One could also employ a proxy design pattern in their code to achieve the effect.

MacOS used to support lazy loading until Apple removed this feature from the default OSX linker (old OSX ld man page ➜ up-to-date OSX ld man page showing that the -lazy_library option has disappeared). Apple likely removed lazy loading because it's an inherently incorrect feature—when initializing or deinitializing libraries, the order of operations matter and is not necessarily safely interruptible. For instance, an application could unknowingly call a lazy-loaded function while the process is trying to deinitialize libraries and shut down. Or worse yet, accidentally loading a library while another library is in the process of initializing or deinitializing, which could lead to unintended use of the partially initialized/uninitialized library if the newly loading library depends on it due to a circular dependency, likely causing a crash. Lazy loading turns every function call to another library into a potential minefield. Perhaps the most widespread and damning consequence of lazy loading is its effect on lock hierarchies. Due to the fact that lazy loading can cause a library to load at any time, it can disrupt lock hierarchies thereby forcing the loader into being at the bottom of any lock hierarchy (except for locks used by the loader itself, since a loader doesn't typically have any dependencies). However, this is completely backwards when considering that the loader is the first thing a process invokes when it starts up. The loader being at the bottom of the lock hierarchy renders it potentially unsafe to acquire any lock from a module initializer/deinitializer (e.g. DllMain) or constructor/destructor in the module scope. As a result, the threat of ABBA deadlock makes the safety of even seemingly simple actions from a module constructor/destructor questionable at best. Using lazy loading (or dynamic loading in general) as a solution to circular dependencies can also make module deinitializers (e.g. Windows DLL_PROCESS_DETACH) safety impossible due to the issue of one library deinitializing before libraries that depend on it deinitialize, typically leading to a crash (even though the loader correctly deinitializes libraries in the reverse order it initialized them). In general, lazy loading can also cause deadlock if a module constructor/destructor waits on a thread (this is safe on a non-Windows OS), then that new thread accidentally loads a library due to lazy loading. Changing a library that was once immediately loaded into being lazy loaded could also cause applications to unexpectedly break due to all the gotchas lazy loading, an incredibly delicate loader feature, or rather anti-feature, comes with. An accurate description of lazy loading would be that it's the library loader equivalent of randomly calling TerminateThread then praying you come out on the other side okay.

It is important to distinguish between lazy loading code and lazy loading data because, in high-level scenarios, lazy loading data is a great thing and can often work wonders (e.g. ad-hoc fetching of web resources or postponing the parsing of some data). However, in high or low level scenarios, lazy loading code or lazily initializing code loaded into memory is when lazy loading can become problematic because that code likely requires initialization by some custom code at a now unspecififed time while also holding a lock to ensure code depends that on the initializing code cannot run at an overlapping time—and this is where the trouble begins.

So, is lazy loading libaries always bad? At the operating system level, yes. Low-level code forming the core of the operating system runs with specific parameters in mind, and a high-level action such as lazy loading a library can easily break its assumptions. Further, when module initializers come into the picture, they often need to fulfill a diverse set of requirements and should ideally be as flexible as reasonbly possbile, which necessitates that library initialization works robustly without the danger of being interrupted unexpectedly or loaded at an unsafe point in the process' run-time. All in all, I cannot think of a non-contrived case in which library lazy loading would be a good idea over dynamic loading at a known safe time, or better yet loading dependencies before application run-time whenever possible.

From a purely performance perspective, as opposed to library lazy loading or otherwise dynamic loading, the GNU loader has taken to improving performance by merging multiple smaller or commonly used libraries into one since each individual library load is a sizable expense (similar to the work musl has already done). Meanwhile, Windows is heading in the opposite direction with a parallelized loader and many merely lazily loaded libraries that can never hope to make large Windows processes as fast and inexpensive as they are on Unix systems. Here, library lazy loading functions as a performance hack when what Windows really wants are more losely coupled libraries that don't have lots of, especially circular, dependencies to begin with (you don't need to lazy load libraries you don't have) and overall more lightweight processes (i.e. Unix architecture). The minimal process and modular library architecture of Unix systems like GNU/Linux and MacOS really shines here.

On the other hand, lazy linking is a first-class citizen only on POSIX-compliant systems. Unix systems can achieve lazy linking simply by calling dlopen with the RTLD_LAZY flag or providing the -z lazy flag to the linker.

On Windows, lazy linking is only supported as part of as a part of lazy loading (hence Microsoft collectively refers to it as "delay loading"). There's no way to make imported functions from a DLL link lazily without having the entire DLL load lazily. Windows doesn't readily expose a lazy linking option through LoadLibrary like POSIX-compliant systems do with dlopen. The lack of distinct lazy linking on Windows is especially unfortunate because symbol resolution on Unix-like systems is significantly faster than it is on Windows (this fact is relevant because a dynamic linker resolves only the symbols depended on by a program and its libraries at process startup).

While upstream GNU sets lazy linking (also known as lazy binding) as the default (see ld manual), GNU/Linux distributions commonly default to using an exploit mitigation known as full RELRO, which has the effect of disabling lazy linking. In my testing, GCC on these distributions still typically compiles with lazy linking by default. Lazy linking on Windows poses the same potential security risk. Although, the risk is unavoidable on Windows since it doesn't have an option entirely to disable lazy linking. Unlike Unix lazy linking (when it's enabled), Microsoft opts to take the performance hit of changing memory protection to writable then back (two system calls) upon resolving each lazy linked function or delay loaded API, which leaves only a small time window for an attacker to hypothetically modify a writable code pointer to their own liking (e.g. when exploiting a memory corruption issue). Take this defense in depth mesaure with a grain of salt though, because programs often expose writable code pointers in other ways—with those potentially exposed by the dynamic linker being only one of them. For instance, C++ virtual method tables commonly expose writable code pointers and C++ is the default low-level language on Windows. Although lazy linking is often disabled on modern Unix-like operating systems for security, its minimal process architecture means that there will be less symbols for a dynamic linker to resolve at process startup, anyway.

Laziness such as library lazy loading and lazy linking can introduce unexpected latency and jitter into a system that is unsuitable for a real-time operating system (RTOS). Since Windows heavily relies on delay loading to hold itself together, Windows can never hope to function as a hard RTOS in its current state. The soft RTOS performance of Windows IoT (sufficient for tasks that aren't safety or mission critical) can also be impacted by delay loading occurring in user-mode. As well as forcing the loader to the bottom of a lock hierarchy, delay loading while holding a lock can cause priority inversion (e.g. due to waiting for disk I/O, initialization tasks, or loader lock contention), which can be poor for the performance and responsiveness of any system but catastrophic for a real-time application.

Similarly, laziness is also unsuitable when constant time execution is mandatory like in security applications. Notably, Windows DLLs lazy load various security-related Windows DLLs (e.g. CRYPT32.dll, bcrypt.dll, SspiCli.dll, and secur32.dll). As an example, if a security application validates a correct username, then reaches into the Windows security APIs to perform password validation or cryptographic operations for the first time, that could create a signficant delay. This delay could then be used to leak sensitive information in a timing attack, which in this case results or greatly helps in the exploitation of a username enumeration vulnerability. The small delay imposed by lazy linking could also be problematic depending on the attack scenario, like if the attacker is remote or not.

In summary, library lazy loading is a fundamentally broken feature at the operating system level that can reduce process startup expense at the cost of the correctness, robustness, security, and run-time performance of the system, as well as its flexibility to specialized operating system designs. If library lazy loading is used to restore some order to the load-time of circular dependencies and lessening library loads for the simplest case applications (ensuring the simplest case does not require the overhead of the most complex case is a key tenant of API design), as it is on Windows, then library lazy loading is more than just a problematic solution, but a hack. There is never a time when library lazy loading is a better solution than the programmer controlling dynamic library load time or loading libraries before application run-time, along with minimizing dependencies and properly managing dependency chains to avoid circular dependencies. Lazy API linking is an acceptable feature for a dynamic linker to employ on general-purpose systems, albeit at the cost of some security since the lazy loaded data are pointers to code that could have otherwise permanently resided in read-only memory.

GNU Loader Lock Hierarchy and Synchronization Strategy

dl_load_lock is the general lock that protects the loader from concurrent access.

dl_load_write_lock is, as the name implies, a write lock that protects against writing to the list of link maps (i.e. adding/removing a node to/from the list).

dl_load_lock is the highest lock in the loader's lock hierarchy, with its position being above both dl_load_write_lock and GSCOPE locking.

There are occurrences where the GNU loader walks the list of link maps (a read operation) while only holding the dl_load_lock lock. For instance, walking the link map list while only holding dl_load_lock but not dl_load_write_lock occurs in _dl_sym_find_caller_link_map ➜ _dl_find_dso_for_object. This action is safe because the only occurrence where the loader modifies the list of link maps (following early loader startup) is when it loads/unloads a library, which means acquiring dl_load_lock. Linking a new link map into the list of link maps (a write operation) requires acquiring dl_load_lock then dl_load_write_lock (e.g. see the GDB log from the load-library experiment). It's unsafe to modify the link map list without also acquiring dl_load_write_lock because, for instance, the dl_iterate_phdr function acquires dl_load_write_lock without first acquiring dl_load_lock to ensure the list of link maps remains consistent while the function walks between nodes (a read operation).

Note that the dl_iterate_phdr function is a GNU/BSD extension, it's not POSIX. Also, since dl_iterate_phdr calls your callback while holding dl_load_write_lock, lock hierarchy layout makes it unsafe to load a library from within this callback. It makes sense to keep dl_load_write_lock held while running these callbacks so they can obtain a consistent snapshot of information on the load libaries at the given point in time. There is a trade-off here between maximizing concurrency from not acquiring the broad dl_load_lock while calling these callbacks and the library loading/unloading limitation within these callbacks. Although, I deem this trade-off acceptable since a program should only want to use these callbacks to search the libraries and find information about a given loaded library. A program could call dlopen with the RTLD_NOLOAD flag on the library name it wants to find loaded before iterating libraries so as to increase that library's reference count thus allowing its safe use outside of the callback after finding it in memory. Alternatively, if one wants to use dl_iterate_phdr to gather information from the ELF program headers (phdr, which is this function's intended use) of all loaded libraries then that can be done in its entirety within the callback safely without loading a library or having to worry about library reference counts due to holding dl_load_write_lock thus making sure the library is not unloaded by a concurrent thread.

Interestingly, dl_load_write_lock is a standard POSIX mutex instead of a POSIX readers-writer or rwlock (acquired with the pthread_rwlock_rdlock and pthread_rwlock_wrlock locking functions). While a readers-writer lock would extend concurrency to multiple threads iterating the link map list with dl_iterate_phdr at once (these being read operations), that ability may be unwanted if its expected that libraries should be able to safely modify the iterated ELF program headers (a write operation) in their callbacks (this action would require changing memory protection).

A Concurrency Bug in the Windows Loader!

This section was supposed to be titled "Windows Loader Lock Hierarchy and Synchronization Strategy" to supplement the analysis we have covering the GNU loader. However, plans changed upon realizing that thread creation is not thread-safe on Windows!

In the ntdll!LdrpInitializeThread function, the loader acquires the load/loader lock (ntdll!LdrpLoadCompleteEvent + ntdll!LdrpLoaderLock) then iterates over modules in the PEB_LDR_DATA.InLoadOrderModuleList list to run DLL_THREAD_ATTACH routines. This function does not acquire ntdll!LdrpModuleDataTableLock to walk the load order list and somewhat peculiarly decides to walk it over the seemingly more fitting PEB_LDR_DATA.InInitializationOrderModuleList (i.e. DLLs that have had their DLL_PROCESS_ATTACH called).

By taking these actions, the loader's approach to thread safety assumes that load/loader lock protects against write operations to PEB_LDR_DATA.InLoadOrderModuleList. However, this assumption is wrong. Setting a breakpoint on ntdll!LdrpInsertDataTableEntry reveals that there are instances where a data table entry will be inserted, thereby writing a new entry into the PEB_LDR_DATA.InLoadOrderModuleList, without acquiring load/loader lock. During ntdll!LdrpInsertDataTableEntry, the ntdll!LdrpModuleDataTableLock lock will be exclusively acquired. However, this does not matter because ntdll!LdrpInitializeThread does not also acquire ntdll!LdrpModuleDataTableLock on its side.

That's right, we have a thread-safety bug in the Windows loader!

I was able to confirm the lack of safety in WinDbg and after searching around for instances where thread-unsafe use of PEB_LDR_DATA.InLoadOrderModuleList was causing a crash, I was able confirm my suspicions with a real crash in the wild!

`GetProcAddress` Can Perform Module Initialization

On the legacy and modern Windows loaders, GetProcAddress holds the ability to initialize a module if it's uninitialized. In this section, we learn why GetProcAddress can perform module initialization and compare how GetProcAddress determines if module initialization should occur across loader versions.

GetProcAddress may perform module initialization to fix the issue of someone doing GetModuleHandle ➜ GetProcAddress on a partially loaded (i.e. uninitialized) library. This problem can arise when GetModuleHandle is used from DllMain and possibly in concurrent GetModuleHandle cases. GetModuleHandle is also problematic because it doesn't increment a library's reference count thus leaving your program vulnerable to using a library that no longer exists in the process. The core idea of GetModuleHandle being a function that can get a handle to a libary without loading it if it's currently unloaded is alright; however, these two implementation problems are what breaks GetModuleHandle. Furthermore, the LoadLibrary name by Microsoft is slightly misleading and should instead be called LibraryOpen/OpenLibrary similiar to the POSIX dlopen. This is in the same way that someone would call the fopen and fclose functions in the standard C API to work with files, opening (not "loading") a handle to the file if it's not open or simply increasing the reference count if it happens that some other code in the process has already opened it, all without any knowledge if your call will the be the one to initially "load" it into the process (okay, this is a nitpick). On Windows, the complex dependency chains between DLLs in combination with delay loading makes GetModuleHandle is risky in the case that an immediately loaded library changes statuses to a lazy loaded library across Windows versions, thereby breaking applications. The GetModuleHandle function arguably shouldn't exist, as it doesn't in POSIX. Or if it is to exist, it should have been implemented right (e.g. as a flag to LoadLibraryEx). But, since GetModuleHandle exists and is commonly used throughout the Windows ecosystem, Microsoft had to implement their best hack to workaround its pitfalls. A better workaround than making GetProcAddress initialize on behalf of GetModuleHandle might have been to make GetModuleHandle not return the module's handle if the module isn't fully loaded, or to simply turn GetModuleHandle into a LoadLibrary call. Although, the latter solution could introduce concurrency issues for old code relying on the inherently faulty GetModuleHandle function because GetModuleHandle doesn't typically require load owner protection using ntdll!LdrpLoadCompleteEvent nor did it typically require loader lock protection on the legacy loader (GetModuleHandleW calls legacy BasepGetModuleHandleExW passing NoLock = TRUE). However, Microsoft clearly didn't choose either of those solutions because I tested doing GetModuleHandle on a partially loaded library from DllMain (I verified the module I was getting a handle of was uninitialized in WinDbg) and it successfully gave me a handle to the uninitialized library.

The GNU loader implements a form of GetModuleHandle within dlopen as the RTLD_NOLOAD flag. However, after running an experiment, I found that the GNU loader will initialize an uninitialized library before returning a handle to it, even with the RTLD_NOLOAD flag. This behavior resolves my greatest concern regarding GetModuleHandle on the GNU/Linux side of things. Next, testing if RTLD_NOLOAD increments the library's reference count reveals that indeed does. Perfect, great job on this GNU developers!

For Windows, as long as you use GetModuleHandleEx to do reference counting and don't stray from the public GetProcAddress API for getting the address of a proceedure to call, you will probably be okay. Prefer LoadLibrary whenever possible, since that also generally solves the problem of GetModuleHandle assuming a library is in the process at all (as previosuly mentioned, Windows presents a higher risk here). Don't forget to balance out calls to functions that increment library reference count with FreeLibrary. All around, the GNU loader and Unix architecture clearly proves more robust to any usage here.

On with the GetProcAddress analysis, the public GetProcAddress function internally calls through to the LdrGetProcedureAddressForCaller (in NTDLL) on the modern loader or the LdrpGetProcedureAddress function on the legacy loader.

In the legacy loader, when GetProcAddress internally resolves an exported procedure to an address, it checks if the loader has performed initialization on the containing module. If the module requires initialization, then GetProcAddress initializes the module (i.e. calling its DllMain) before returning a procedure address to the caller.

In the ReactOS code for LdrpGetProcedureAddress, we see this happen:

    /* Acquire lock unless we are initing */
    /* MY COMMENT: This refers to the loader initing; ignore this for now */
    if (!LdrpInLdrInit) RtlEnterCriticalSection(&LdrpLoaderLock);

    ...

        /* Finally, see if we're supposed to run the init routines */
        /* MY COMMENT: ExecuteInit is a function argument which GetProcAddress always passes as true */
        if ((NT_SUCCESS(Status)) && (ExecuteInit))
        {
            /*
            * It's possible a forwarded entry had us load the DLL. In that case,
            * then we will call its DllMain. Use the last loaded DLL for this.
            */
            Entry = NtCurrentPeb()->Ldr->InInitializationOrderModuleList.Blink;
            LdrEntry = CONTAINING_RECORD(Entry,
                                         LDR_DATA_TABLE_ENTRY,
                                         InInitializationOrderLinks);

            /* Make sure we didn't process it yet*/
            /* MY COMMENT: If module initialization hasn't already run... */
            if (!(LdrEntry->Flags & LDRP_ENTRY_PROCESSED))
            {
                /* Call the init routine */
                _SEH2_TRY
                {
                    Status = LdrpRunInitializeRoutines(NULL);
                }
                _SEH2_EXCEPT(EXCEPTION_EXECUTE_HANDLER)
                {
                    /* Get the exception code */
                    Status = _SEH2_GetExceptionCode();
                }
                _SEH2_END;
            }
        }

...

Now, let's see how the modern Windows loader handles performing module initialization. In LdrGetProcedureAddressForCaller, there's an instance where module initialization may occur without the LdrGetProcedureAddressForCaller function itself acquiring loader lock (what follows is a marked-up IDA decompilation):

...
        ReturnValue = LdrpResolveProcedureAddress(
                        (unsigned int)v24,
                        (unsigned int)Current_LDR_DATA_TABLE_ENTRY,
                        (unsigned int)Heap,
                        v38,
                        v30,
                        (char **)&v34
                      );
...
        // If LdrpResolveProcedureAddress succeeds
        if ( ReturnValue != NULL )
        {
            // Test for the searched module having a LDR_DDAG_STATE of LdrModulesReadyToInit
            if ( Current_Module_LDR_DDAG_STATE == 7
                // Test whether module module initialization should occur
                // LdrGetProcedureAddressForCaller receives this as its fifth argument, KERNELBASE!GetProcAddresForCaller (called by KERNELBASE!GetProcAddress) always sets it to TRUE
                && ExecuteInit
                // Test for the current thread having the LoadOwner flag in TEB.SameTebFlags
                // LoadOwner = 0x1000
                && (NtCurrentTeb()->SameTebFlags & LoadOwner) != 0
                // Test for the current thread not holding the LdrpDllNotificationLock lock (i.e. we're not executing a DLL notification callback)
                && !(unsigned int)RtlIsCriticalSectionLockedByThread(&LdrpDllNotificationLock) )
            {
                DdagNode = *(_QWORD *)(Current_LDR_DATA_TABLE_ENTRY + 0x98);
                v33[0] = 0;
                // Perform module initialization
                ReturnValue = LdrpInitializeGraphRecurse(DdagNode, 0i64, v33);
            }
...
        }

Huh? LdrGetProcedureAddressForCaller didn't acquire loader lock, yet it is performing module initialization! How could that be safe?

Checking for the LoadOwner flag in TEB.SameTebFlags ensures that a given thread has the necessary protection to safely perform module initialization because the loader only sets this flag on the calling thread of a LoadLibrary operation and unsets it once library loading is complete. The level of protection necessary is typically of that imposed by the ntdll!LdrpLoadCompleteEvent + ntdll!LdrpLoaderLock synchronization mechanisms. However, during loader initialization, Windows restricts the process to only having one load owner thread (by making new threads block on ntdll!LdrpInitCompleteEvent), thus allowing for module initialization without the protection that is traditionally necessary while the process is single threaded. As you can see, these state checks also examines other information regarding the execution context and module information whereby it could be undesirable or unsafe to proceed with module initialization. Additionally, now is a good time to remind you that loader lock(ntdll!LdrpLoaderLock) alone is not enough protection to run any DllMain routine on the modern Windows loader.

Windows Loader Initialization Locking Requirements

On Windows, loader initialization includes process initialization (e.g. critical data structures like the PEB), as well as fully loading the library dependencies of the application. Once loader initialization is complete, the loader can proceed with running the application or program.

Reading the ReactOS code for LdrpLoadDll (the internal NTDLL function called by LoadLibrary), we see this code:

NTSTATUS NTAPI LdrpLoadDll(...)
{
    // MY COMMENT: Get the value of global variable LdrpInLdrInit into a local variable
    BOOLEAN InInit = LdrpInLdrInit;
...
    /* Check for init flag and acquire lock */
    /* MY COMMENT: This refers to the loader initing */
    if (!InInit) RtlEnterCriticalSection(&LdrpLoaderLock);
...
}

The loader won't acquire loader lock during library loading (including during module initialization by the LdrpRunInitializeRoutines function) when the loader is initializing (at process startup). What's up with that? It's a startup performance optimization to forgo locking during process startup.

Not acquiring loader lock here is safe because the legacy loader, like the modern loader, includes a mechanism for blocking new threads spawned into the process until loader initialization is complete. The legacy loader waits in the LdrpInit function using a ntdll!LdrpProcessInitialized spinlock and sleeping with ZwDelayExecution. While the loader is initializing, the LdrpInit function sets LdrpInLdrInit to TRUE, initializes the process by calling LdrpInitializeProcess, then upon returning, LdrpInit sets LdrpInLdrInit to FALSE. After, the loader, will unlock the ntdll!LdrpProcessInitializing spinlock. Hence, during loader initialization, one can safely forgo acquiring loader lock.

The modern loader optimizes waiting by using the ntdll!LdrpInitCompleteEvent event object instead of sleeping for a set time. The modern loader also includes a ntdll!LdrpProcessInitialized spinlock. However, the loader may (in the unlikely occurrence of a remote thread spawning in early) only spin on it until event creation (NtCreateEvent), at which point the loader waits solely using that synchronization object. ZwDelayExecution is still called to slow the spin. While ntdll!LdrInitState is 0, it's safe not to acquire any locks. This includes accessing shared module information data structures without acquiring ntdll!LdrpModuleDataTableLock lock and performing module initialization/deinitialization. ntdll!LdrInitState changes to 1 immediately after LdrpInitializeProcess calls LdrpEnableParallelLoading which creates the loader worker threads (LoaderWorker flag in TEB.SameTebFlags). However, these loader worker threads won't have any work yet, so it should still be safe not to acquire locks during this time. Once these loader worker threads receive work, though, the loader would have to start acquiring the ntdll!LdrpModuleDataTableLock lock to ensure thread safety when accessing module information data structures. Additionally, since loader worker threads are naturally limited to only performing mapping and snapping, the loader can currently still forgo acquiring any locks associated with being a load owner (LoadOwner flag in TEB.SameTebFlags) such as performing module initialization.

The LdrpInitShimEngine function is a good example of the loader performing a bunch of loader operations without locking. The loader may call LdrpInitShimEngine shortly before it calls LdrpEnableParallelLoading to start spawning loader worker threads (not always; it happens when running Google Chrome under WinDbg). The LdrpInitShimEngine function calls LdrpLoadShimEngine, which does a whole bunch of typically unsafe actions like module initialization (calls LdrpInitalizeNode directly and calls LdrpInitializeShimDllDependencies, which in turn calls LdrpInitializeGraphRecurse) without loader lock and walking the PEB_LDR_DATA.InLoadOrderModuleList without acquiring ntdll!LdrpModuleDataTableLock. Of course, all these actions are safe due to the unique circumstances of loader initialization. Note that the shim engine initialization function may still acquire the ntdll!LdrpDllNotificationLock lock not for thread safety but because the loader branches on its state using the RtlIsCriticalSectionLockedByThread function.

The modern loader explicitly checks ntdll!LdrInitState to optionally perform locking as an optimization in a few places. Notably, the ntdll!RtlpxLookupFunctionTable function opts to skip locking the ntdll!LdrpInvertedFunctionTableSRWLock lock before accessing the ntdll!LdrpInvertedFunctionTable shared data structure if ntdll!LdrInitState equals 3 (i.e. just before "loader initialization is done"). Similarly, the ntdll!LdrLockLoaderLock function only acquires loader lock if loader initialization is done.

Be aware that for both the legacy and modern loaders, this improvement in startup performance comes with a trade-off in run-time performance. Since, after loader initialization is complete those branches on ntdll!LdrpInLdrInit or ntdll!LdrInitState become nothing but dead weight.

Beyond a slight startup performance improvement through reduced synchronization overhead, there is another reason Microsoft may want to disallow new (potential load owner) threads from running during process statup, particularly in regard to module initializers: Windows places loader lock (or the modern equivalents) at the bottom of any lock hierarchy that is external to the loader. Thus, restricting additional threads from running in this stage works as a quick fix to mitigate ABBA deadlock risk from module initializers at process startup. Of course, this risk reduction does not extend to libraries that are dynamically loaded at process run-time.

The GNU loader doesn't implement any such startup performance hack to forgo locking on process startup. The absence of any such mechanism by the GNU loader enables threads to start and exit at process startup or within a module initializer. The same is true for process exit. Therefore, the GNU loader is more flexible in this respect.

Investigating the COM Server Deadlock from `DllMain`

Trying to connect to a COM server under loader lock fails deterministically. For instance, running this code from DllMain on DLL_PROCESS_ATTACH will deadlock:

// Ensure valid LNK file with this CMD command:
// explorer "C:\ProgramData\Microsoft\Windows\Start Menu\Programs\Accessories\Notepad.lnk"
LPCSTR linkFilePath = "C:\\ProgramData\\Microsoft\\Windows\\Start Menu\\Programs\\Accessories\\Notepad.lnk";
WCHAR resolvedPath[MAX_PATH];

HRESULT hres;
HWND hwnd = GetDesktopWindow();

hres = CoInitializeEx(NULL, 0);

if (SUCCEEDED(hres)) {
    // Resolve LNK file to its target file
    // Implementation: https://learn.microsoft.com/en-us/windows/win32/shell/links#resolving-a-shortcut
    ResolveIt(hwnd, linkFilePath, resolvedPath, MAX_PATH);
}

CoUninitialize();

// Output: C:\Windows\system32\notepad.exe
wprintf(L"%ls\r\n", resolvedPath);

Here we see the deadlocked call stack, which shows that NtAlpcSendWaitReceivePort is waiting for something (this function only exists to make the NtAlpcSendWaitReceivePort system call):

0:000> k
 # Child-SP          RetAddr               Call Site
00 00000080`f96fcde8 00007ffd`26c93f8f     ntdll!NtAlpcSendWaitReceivePort+0x14
01 00000080`f96fcdf0 00007ffd`26ca94d7     RPCRT4!LRPC_BASE_CCALL::SendReceive+0x12f
02 00000080`f96fcec0 00007ffd`26c517c0     RPCRT4!NdrpSendReceive+0x97
03 00000080`f96fcef0 00007ffd`26c524bf     RPCRT4!NdrpClientCall2+0x5d0
04 00000080`f96fd510 00007ffd`28491ce5     RPCRT4!NdrClientCall2+0x1f
05 (Inline Function) --------`--------     combase!ServerAllocateOXIDAndOIDs+0x73 [onecore\com\combase\idl\internal\daytona\objfre\amd64\lclor_c.c @ 313]
06 00000080`f96fd540 00007ffd`28491acd     combase!CRpcResolver::ServerRegisterOXID+0xd5 [onecore\com\combase\dcomrem\resolver.cxx @ 1056]
07 00000080`f96fd600 00007ffd`28494531     combase!OXIDEntry::RegisterOXIDAndOIDs+0x71 [onecore\com\combase\dcomrem\ipidtbl.cxx @ 1642]
08 (Inline Function) --------`--------     combase!OXIDEntry::AllocOIDs+0xc2 [onecore\com\combase\dcomrem\ipidtbl.cxx @ 1696]
09 00000080`f96fd710 00007ffd`2849438f     combase!CComApartment::CallTheResolver+0x14d [onecore\com\combase\dcomrem\aprtmnt.cxx @ 693]
0a 00000080`f96fd8c0 00007ffd`284abc2f     combase!CComApartment::InitRemoting+0x25b [onecore\com\combase\dcomrem\aprtmnt.cxx @ 991]
0b (Inline Function) --------`--------     combase!CComApartment::StartServer+0x52 [onecore\com\combase\dcomrem\aprtmnt.cxx @ 1214]
0c 00000080`f96fd930 00007ffd`2849c285     combase!InitChannelIfNecessary+0xbf [onecore\com\combase\dcomrem\channelb.cxx @ 1028]
0d 00000080`f96fd960 00007ffd`2849a644     combase!CGIPTable::RegisterInterfaceInGlobalHlp+0x61 [onecore\com\combase\dcomrem\giptbl.cxx @ 815]
0e 00000080`f96fda10 00007ffd`21b86399     combase!CGIPTable::RegisterInterfaceInGlobal+0x14 [onecore\com\combase\dcomrem\giptbl.cxx @ 776]
0f 00000080`f96fda50 00007ffd`21b5adb3     PROPSYS!CApartmentLocalObject::_RegisterInterfaceInGIT+0x81
10 00000080`f96fda90 00007ffd`21b842e6     PROPSYS!CApartmentLocalObject::_SetApartmentObject+0x7b
11 00000080`f96fdac0 00007ffd`21b5c1fc     PROPSYS!CApartmentLocalObject::TrySetApartmentObject+0x4e
12 00000080`f96fdaf0 00007ffd`21b5bde6     PROPSYS!CreateObjectWithCachedFactory+0x2bc
13 00000080`f96fdbd0 00007ffd`21b5d16c     PROPSYS!CreateMultiplexPropertyStore+0x46
14 00000080`f96fdc30 00007ffd`241d3235     PROPSYS!PSCreateItemStoresFromDelegate+0xbfc
15 00000080`f96fde90 00007ffd`2422892f     windows_storage!CShellItem::_GetPropertyStoreWorker+0x2d5
16 00000080`f96fe3d0 00007ffd`2422b7e7     windows_storage!CShellItem::GetPropertyStoreForKeys+0x14f
17 00000080`f96fe6a0 00007ffd`2415f2b6     windows_storage!CShellItem::GetCLSID+0x67
18 00000080`f96fe760 00007ffd`2415eb0b     windows_storage!GetParentNamespaceCLSID+0xde
19 00000080`f96fe7c0 00007ffd`241772fb     windows_storage!CShellLink::_LoadFromStream+0x2d3
1a 00000080`f96feaf0 00007ffd`2417709c     windows_storage!CShellLink::LoadFromPathHelper+0x97
1b 00000080`f96feb40 00007ffd`24177039     windows_storage!CShellLink::_LoadFromFile+0x48
1c 00000080`f96febd0 00007ffd`21aa10e2     windows_storage!CShellLink::Load+0x29
1d (Inline Function) --------`--------     TestDLL!ResolveIt+0x8c [C:\Users\user\source\repos\TestDLL\TestDLL\dllmain.cpp @ 110]
1e 00000080`f96fec00 00007ffd`21aa143b     TestDLL!DllMain+0xd2 [C:\Users\user\source\repos\TestDLL\TestDLL\dllmain.cpp @ 170]
1f 00000080`f96ff4f0 00007ffd`28929a1d     TestDLL!dllmain_dispatch+0x8f [d:\a01\_work\20\s\src\vctools\crt\vcstartup\src\startup\dll_dllmain.cpp @ 281]
20 00000080`f96ff550 00007ffd`2897c2c7     ntdll!LdrpCallInitRoutine+0x61
21 00000080`f96ff5c0 00007ffd`2897c05a     ntdll!LdrpInitializeNode+0x1d3
22 00000080`f96ff710 00007ffd`2894d947     ntdll!LdrpInitializeGraphRecurse+0x42
23 00000080`f96ff750 00007ffd`2892fbae     ntdll!LdrpPrepareModuleForExecution+0xbf
24 00000080`f96ff790 00007ffd`289273e4     ntdll!LdrpLoadDllInternal+0x19a
25 00000080`f96ff810 00007ffd`28926af4     ntdll!LdrpLoadDll+0xa8
26 00000080`f96ff9c0 00007ffd`260156b2     ntdll!LdrLoadDll+0xe4
27 00000080`f96ffab0 00007ff7`8fda1022     KERNELBASE!LoadLibraryExW+0x162
28 00000080`f96ffb20 00007ff7`8fda1260     TestProject!main+0x12 [C:\Users\user\source\repos\TestProject\TestProject\source.c @ 82]
29 (Inline Function) --------`--------     TestProject!invoke_main+0x22 [d:\a01\_work\20\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 78]
2a 00000080`f96ffb50 00007ffd`26f37344     TestProject!__scrt_common_main_seh+0x10c [d:\a01\_work\20\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 288]
2b 00000080`f96ffb90 00007ffd`289626b1     KERNEL32!BaseThreadInitThunk+0x14
2c 00000080`f96ffbc0 00000000`00000000     ntdll!RtlUserThreadStart+0x21

Running the same code but swapping the CoCreateInstance execution context from CLSCTX_INPROC_SERVER (an in-process DLL server, the most common execution context) to CLSCTX_LOCAL_SERVER (an out-of-process EXE server on the same machine) yields a similar deadlock (the CLSID_ShellLink COM component doesn't support the latter execution context but that's besides the point):

0:000> k
 # Child-SP          RetAddr               Call Site
00 000000ea`aed1c908 00007ff8`d3b41b4f     ntdll!NtAlpcSendWaitReceivePort+0x14
01 000000ea`aed1c910 00007ff8`d3b5c357     RPCRT4!LRPC_BASE_CCALL::SendReceive+0x12f
02 000000ea`aed1c9e0 00007ff8`d3b01610     RPCRT4!NdrpSendReceive+0x97
03 000000ea`aed1ca10 00007ff8`d3b0102f     RPCRT4!NdrpClientCall2+0x5d0
04 000000ea`aed1d030 00007ff8`d379d801     RPCRT4!NdrClientCall2+0x1f
05 (Inline Function) --------`--------     combase!ServerAllocateOXIDAndOIDs+0x73 [onecore\com\combase\idl\internal\daytona\objfre\amd64\lclor_c.c @ 313]
06 000000ea`aed1d060 00007ff8`d379d67d     combase!CRpcResolver::ServerRegisterOXID+0xd5 [onecore\com\combase\dcomrem\resolver.cxx @ 1056]
07 000000ea`aed1d120 00007ff8`d379ded1     combase!OXIDEntry::RegisterOXIDAndOIDs+0x71 [onecore\com\combase\dcomrem\ipidtbl.cxx @ 1642]
08 (Inline Function) --------`--------     combase!OXIDEntry::AllocOIDs+0xc2 [onecore\com\combase\dcomrem\ipidtbl.cxx @ 1696]
09 000000ea`aed1d230 00007ff8`d374a103     combase!CComApartment::CallTheResolver+0x14d [onecore\com\combase\dcomrem\aprtmnt.cxx @ 693]
0a 000000ea`aed1d3e0 00007ff8`d37476fe     combase!CComApartment::InitRemoting+0x25b [onecore\com\combase\dcomrem\aprtmnt.cxx @ 991]
0b 000000ea`aed1d450 00007ff8`d3717d87     combase!CComApartment::StartServer+0x2a [onecore\com\combase\dcomrem\aprtmnt.cxx @ 1214]
0c (Inline Function) --------`--------     combase!InitChannelIfNecessary+0x1f [onecore\com\combase\dcomrem\channelb.cxx @ 1028]
0d 000000ea`aed1d480 00007ff8`d3717708     combase!CRpcResolver::BindToSCMProxy+0x2b [onecore\com\combase\dcomrem\resolver.cxx @ 1733]
0e 000000ea`aed1d4c0 00007ff8`d37c6d66     combase!CRpcResolver::DelegateActivationToSCM+0x12c [onecore\com\combase\dcomrem\resolver.cxx @ 2243]
0f 000000ea`aed1d690 00007ff8`d3717315     combase!CRpcResolver::CreateInstance+0x1a [onecore\com\combase\dcomrem\resolver.cxx @ 2507]
10 000000ea`aed1d6c0 00007ff8`d372cb30     combase!CClientContextActivator::CreateInstance+0x135 [onecore\com\combase\objact\actvator.cxx @ 616]
11 000000ea`aed1d970 00007ff8`d372581a     combase!ActivationPropertiesIn::DelegateCreateInstance+0x90 [onecore\com\combase\actprops\actprops.cxx @ 1983]
12 000000ea`aed1da00 00007ff8`d37242c0     combase!ICoCreateInstanceEx+0x90a [onecore\com\combase\objact\objact.cxx @ 2032]
13 000000ea`aed1e8d0 00007ff8`d372401c     combase!CComActivator::DoCreateInstance+0x240 [onecore\com\combase\objact\immact.hxx @ 392]
14 (Inline Function) --------`--------     combase!CoCreateInstanceEx+0xd1 [onecore\com\combase\objact\actapi.cxx @ 177]
15 000000ea`aed1ea30 00007ff8`c41210a7     combase!CoCreateInstance+0x10c [onecore\com\combase\objact\actapi.cxx @ 121]
16 (Inline Function) --------`--------     TestDLL!ResolveIt+0x24 [C:\Users\user\source\repos\TestDLL\TestDLL\dllmain.cpp @ 42]
17 000000ea`aed1ead0 00007ff8`c412145b     TestDLL!DllMain+0x77 [C:\Users\user\source\repos\TestDLL\TestDLL\dllmain.cpp @ 304]
18 000000ea`aed1f3c0 00007ff8`d4209a1d     TestDLL!dllmain_dispatch+0x8f [d:\a01\_work\20\s\src\vctools\crt\vcstartup\src\startup\dll_dllmain.cpp @ 281]
19 000000ea`aed1f420 00007ff8`d425d307     ntdll!LdrpCallInitRoutine+0x61
1a 000000ea`aed1f490 00007ff8`d425d09a     ntdll!LdrpInitializeNode+0x1d3
1b 000000ea`aed1f5e0 00007ff8`d422d947     ntdll!LdrpInitializeGraphRecurse+0x42
1c 000000ea`aed1f620 00007ff8`d420fbae     ntdll!LdrpPrepareModuleForExecution+0xbf
1d 000000ea`aed1f660 00007ff8`d42073e4     ntdll!LdrpLoadDllInternal+0x19a
1e 000000ea`aed1f6e0 00007ff8`d4206af4     ntdll!LdrpLoadDll+0xa8
1f 000000ea`aed1f890 00007ff8`d1b32612     ntdll!LdrLoadDll+0xe4
20 000000ea`aed1f980 00007ff6`ff831012     KERNELBASE!LoadLibraryExW+0x162
21 000000ea`aed1f9f0 00007ff6`ff831240     TestProject!main+0x12 [C:\Users\user\source\repos\TestProject\TestProject\source.c @ 175]
22 (Inline Function) --------`--------     TestProject!invoke_main+0x22 [d:\a01\_work\20\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 78]
23 000000ea`aed1fa20 00007ff8`d2ab7374     TestProject!__scrt_common_main_seh+0x10c [d:\a01\_work\20\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 288]
24 000000ea`aed1fa60 00007ff8`d423cc91     KERNEL32!BaseThreadInitThunk+0x14
25 000000ea`aed1fa90 00000000`00000000     ntdll!RtlUserThreadStart+0x21

The NtAlpcSendWaitReceivePort function is indeed waiting for something, thus the reason for our deadlock.

Note that the in-process COM server deadlock occurs on the first call to a COM method in the former case (hres = ppf->Load(wsz, STGM_READ)), not when we initially call CoCreateInstance to create an in-process COM server (CLSCTX_INPROC_SERVER). The deadlock occurs here because a in-process COM server's startup is lazy. Lazy server launch occurs for the most common CLSCTX_INPROC_SERVER execution context, but not necessarily for other execution contexts such as CLSCTX_LOCAL_SERVER (out-of-process, same machine). The deadlock occurs in the CoCreateInstance function itself in the CLSCTX_LOCAL_SERVER execution context.

The call stack is similar to the deadlock we receive from ShellExecute under DllMain. As we know from "Perfect DLL Hijacking", loader lock (ntdll!LdrpLoaderLock) is the root cause of this deadlock and releasing this lock allows execution to continue (also, see the DEBUG NOTICE in the LdrLockLiberator project for further potential blockers). However, setting a read watchpoint on loader lock reveals that the user-mode code within our process never checks the state of the loader lock. This finding leads me to believe that the local procedure call by ntdll!NtAlpcSendWaitReceivePort causes a remote process (csrss.exe?) to introspect on the state of loader lock within our process, thus explaining why WinDbg never hits our user-mode watchpoint. This introspection is likely done using shared memory (i.e. NtMapViewOfSection targeting a mapping in a different process on Windows or shm_open on POSIX-compliant systems). Starting a COM server (combase!CComApartment::StartServer) involves calling NdrClientCall2 to perform a local procedure call (technically "LRPC" which is RPC done locally). COM must support real RPC (to another machine) in the case of DCOM, which was eventually integrated into COM (this is what combase!CComApartment::InitRemoting in the call stack refers to). In particular, the combase!ServerAllocateOXIDAndOID function does a local/remote procedure call according to the lclor interface.

Microsoft explains this deadlock (emphasis mine):

To do it, the Windows loader uses a process-global critical section (often called the "loader lock") that prevents unsafe access during module initialization.

The CLR will attempt to automatically load that assembly, which may require the Windows loader to block on the loader lock. A deadlock occurs since the loader lock is already held by code earlier in the call sequence.

Note that Microsoft wrote this documentation to resolve Windows issues surrounding loader lock that high-level developers may come across with the CLR (the .NET runtime environment); however, it also roughly appears to have COM application. As we know, Microsoft uniquely places the loader at the bottom any lock hierarchy in the system (outside of NTDLL), which is already reason enough Microsoft would choose to turn a probabilistic ABBA deadlock into a deterministic blocker. Particularly the sentence "A deadlock occurs since the loader lock is already held by code earlier in the call sequence." refers to the lock hierarchy nesting that can lead to ABBA deadlock. Additionally, I reason that initializing a COM object, like a CLR assembly, may take "actions that are invalid under loader lock" such as spawning and waiting on a thread, which causes a deadlock on Windows. The combase!CRpcThreadCache::RpcWorkerThreadEntry thread that spawns in the case of both execution contexts stand out. So, instead of allowing for non-determinism that could cause an unlikely or hard to diagnose deadlock/crash at some other point, Microsoft took steps to make connecting to a COM server from DllMain, which is in effect using COM, deadlock deterministically.

Emphasis on the word "block" because it indicates that, in this case, Windows treats loader lock as a readers-writer lock (SRW lock in Windows) that's been acquired in exclusive/write mode instead of a critical section (a thread synchronization mechanism), the latter of which allows recursive acquisition. Reacquiring a lock on the same thread requires the surrounding code to have a reentrant design. Nesting the acquisition of different locks, from nested subsystems in this case, requires that they agree on lock hierachy. These facts align with what we see when starting a COM server from the DLL_PROCESS_ATTACH of DllMain and our diagnosis.

When building a .NET project, compiling a binary with the /clr option causes the MSVC compiler to put module initializers and finalizers in the .cctor section. Now, the high-level CLR loader will perform module initialization and deinitialization instead of the native loader, thus working around "loader lock issues" that are symptomatic of Windows architecture. Of course, DllMain is still run by the native loader under loader lock.

Please note that the above analysis is a strong theory based on the surrounding evidence and my work. However, verifying it with full certainty would require ascertaining whether the remote process is checking loader lock in our process. After conducting some research, I may be able to debug precisely what is happening internally with the !alpc WinDbg command, information in the "Advanced Windows Debugging" book, and RPC View or perhaps some (partially implemented) Wireshark disectors; however, I've not gotten around to doing so.

On Making COM from `DllMain` Safe

In this section, we will analyze why doing COM initialization or starting a COM server from DllMain is currently unsafe and more interestingly, how these actions could be made safe.

Note: Throughout this section, I refer to the synchronization mechanism that is at the top of a modern Windows loader's lock hierarchy colloquially as "loader lock" (as does Microsoft). However, internally, this title actually goes to the LdrpLoadCompleteEvent loader event.

Avoiding ABBA Deadlock

In the documentation for CoInitializeEx lies this statement:

Because there is no way to control the order in which in-process servers are loaded or unloaded, do not call CoInitialize, CoInitializeEx, or CoUninitialize from the DllMain function.

This warning concerns the risk of nesting the native loader ("loader lock") and COM lock hierarchies within a thread. Realizing this risk would require at least two threads acquiring the locks in different orders to interleave, thus causing an ABBA deadlock. Ideally, Microsoft would support this by ensuring a clean lock hierarchy between these subsystems or controlling load/unload order to some extent. Microsoft has come across this issue in the wild (note that starting with OLE 2, OLE is built on top of COM, so they share internals). Drawing from ReactOS code, Microsoft has also realized this issue in their own code (I've verified these components aren't specific to ReactOS, for example, appwiz.cpl, desk.cpl and joy.cpl are all real Microsoft Windows CPL applets, and I've verified that Windows application may load CPL library files during their run-time, for instance, I've seen explorer.exe do this in WinDbg). The simplest method for fixing this ABBA deadlock would be to acquire the loader lock before the COM lock, thereby keeping a clean lock hierarchy. However, this method would decrease concurrency, thus reducing performance. Another solution would be to specifically target COM functions that interact with the native loader while already holding the COM lock, such as CoFreeUnusedLibraries. CoFreeUnusedLibraries would leave COM's shared data structures/components in a consistent state before unlocking the COM lock, do library loader work thus taking the loader lock, and then perform consistency checks after reacquiring the COM lock. COM architecture might precisely track each component's state to support its concurrent design. A state tracking mechanism could work like LDR_DDAG_STATE in the native loader. My inspiration for the second approach partially came from the GNU loader. In both solutions, the process maintains a single coherent lock hierarchy. Lastly, Microsoft could try controlling load/unload order perhaps by deferring actual library load/unload work until after COM operations are complete or before they begin (when all COM locks are unlocked). COM could accomplish this goal by maintaining a list of all the libraries that need freeing and then using checkpoints to appropriately decide when it can, while now being unlocked, free the libraries (likely easier said than done given the monolithic nature of it all). Thus, any of these solutions would solve the issue potential ABBA deadlock issue upon COM performing a CoFreeUnusedLibraries when combined with some module's DllMain interacting with COM at the same time.

Given how monolithic Windows tends to be, the second and likely third solutions for ABBA deadlock would be challenging; however, concurrency is hardly ever easy, and Microsoft engineers have taken on more difficult problems: it's possible to make it work. The monolithic architecture of Windows is often what unexpectedly causes problems between these separate lock hierarchies to arise in the first place (e.g. due to a Windows API function internally doing COM without the programmer's explicit knowledge).

Note (from the future): Solutions that aim to keep loader lock above the COM lock in the lock hierarchy must account for Windows delay loading. Delay loading can cause library loading any time one DLL calls an import of another DLL. As a result, COM must be highly cautious not to call any delay loaded imports while holding the COM lock. Recursively resolving delay loads of combase.dll (possibly also ole32.dll) to load immediately would likely be infeasible due to delay loading being the hack that holds Windows together. While Windows architecture makes this goal challenging, it should be possible to achieve with lots of dependency chain micromanaging (on the import level, not DLL level).

Other Deadlock Possibilities

Note: I do not claim to know every requirement COM may have in any situation and if someone tells you they do then outside of a select few people at Microsoft, they're not being truthful. So, here I just try to exhaust any possible issue that could arise.

COM may perform other operations that cause a deadlock:

If some COM operation requires spawning and waiting on a new thread, then spawn that thread with the THREAD_CREATE_FLAGS_SKIP_LOADER_INIT flag and avoid use of DLL_THREAD_ATTACH/DLL_THREAD_DETACH in the relevant Windows API subsystems, or generally remove loader thread blockers. If these objectives are unattainable (likely due to backward compatibility with rash design choices), then we will provide an alternative solution below, although it requires some extra work because loader thread routines typically create thread-local data.

If some COM operation further requires that waited on new thread to load additional libraries (Is this something that can happen? It is a naturally tricky situation that even the GNU loader will deadlock under.) then things get hard enough to where the only solution is to employ some small amount of cooperative multitasking. And even then, there are unknowns that prevent full cooperation. As a thought experiment though, we will give it our best shot. We are currently on "thread A" and the thread we just spawned is "thread B". Thread A is already the load owner. So, our approach to avoiding deadlock should be to attempt offloading library loading work from thread B to thread A (thread A would have to wait to see if it picks up any work) or just have thread A wait so thread B can act as the load owner for a bit (this means "loader lock" would no longer be a thread-synchronization mechanism in the typical sense, which could get complicated). We will go for the second option. After thread A does the system call to spawn thread B, thread A (in NtCreateThreadEx) should check if it's under loader lock. If so, thread A must wait until it receives a signal from thread B that it's completed its loader work. Thread B should assign its thread as the load owner (LoadOwner flag in TEB.SameTebFlags; the global LdrpWorkInProgress state is already 1) at the same time as thread A is already the load owner. There should be some special state that allows thread B to bypass load owner locks, which is safe because thread A is waiting for us. With thread A waiting, thread B can now safely load libraries as normal and this can work because the loader is reentrant and due to all the loader data structures being global. When thread B finishes loader work, it can signal thread A to proceed. If removing thread blockers is doable, the first solution would probably be more simpler and more tenable; however, with DLL_THREAD_ATTACH/DLL_THREAD_DETACH then you likely need to use the second, more complex solution due to thread-local data.

As with any approach, there are glaring issues with this method of avoiding deadlock. The first issue is that thread B cannot know when it will be done with loader work (if it is to do any following DLL_THREAD_ATTACH/DLL_THREAD_DETACH) because the loader cannot forsee dynamic loading operations (LoadLibrary/FreeLibrary or delay loading) ahead of time. Some Unix-like loader implementations don't support dynamic loading due to the deadlock problems that can occur when libraries load unexpectedly and the performance benefits you can achieve from not having to program for that scenario. The ideal situation is always to load all a program's dependencies once at process startup then never load/unload libraries again for the lifetime of the process. This issue is already a showstopper, but in the name of education, we will continue. The second issue comes from the fact that thread A is in a DLL's DllMain or module initializer performing its initialization. As a result, that DLL could be partially initialized which may become problematic in a complex operation due to circular dependencies running rampant in the Windows API (the loader initializes dependency DLLs before the dependent DLL except when it cannot due to circular dpendencies). If circular dependenices exist, then the loader cannot know which module to initialize first (Microsoft uses delay loading as a dodgy workaround for this issue, which is a solution that falls apart if one the DLL's delay loads is accidentally trigerred from DLL_PROCESS_ATTACH/DLL_PROCESS_DETACH, something that can easily happen due to the tight coupling of Windows DLLs). Dependency loops are a fundamental problem and a complex operation in a module initializer/finalizer, constructor/destructor running in the module context, or DllMain could realize risk stemming from that problem. The third issue is that waiting on a thread for any amount of time when not explicitly told to (WaitForSingleObject(hThread)) leads to poor performance.

Having a dedicated load owner thread at all times or one that spawns in with a timer (similar to the current loader worker threads), provided suitable optimizations are in place, is likely the most realistic and workable solution. However, Microsoft would have to create a mechanism for threaded operations (e.g. FlsAlloc, FlsGetValue, and FlsSetValue functions) done at DLL_THREAD_ATTACH and DLL_THREAD_DETACH to affect and access the thread requesting the DLL_THREAD_ATTACH/DLL_THREAD_DETACH instead of the dedicated load owner thread itself (also, the GetCurrentThread and GetCurrentThreadId functions may have to lie to maintain application compatibility). Process exit (ntdll!LdrShutdownProcess) reliably occurring on this dedicated load owner thread (while not pretending to be another thread) would also resolve the thread affinity issue that can occur when using Windows APIs that store thread-local data (e.g. CoInitialize/CoInitializeEx).

Conclusion

DllMain can now safely run COM code (in theory).

Note that applying this solution to the CLR may mean removing the CLR loader because its continued presence means foreign (third-party) code could violate the loader lock ➜ CLR assembly initialization lock (I'm assuming the CLR loader has an assembly initialization lock similar to the native loader's loader lock) lock hierarchy without our knowledge. Or rather, the CLR loader could continue existing but its assembly initialization actions would happen in DllMain and since the initalization of a CLR assembly file maps 1:1 to the initialization of a DLL file, you can safely do away with any CLR assembly initialization lock. See this information covering C# and .NET constructors for more information. See here for investigating the idea of MT-safe library initialization.

Note that the solutions I provide above are mostly thought experiments. Due to Windows' architectural issues, I can only offer bandage patches.

Loader Enclaves

An enclave is a security feature that isolates a region of data or code within an application's address space. Enclaves utilize one of three backing technologies to provide this security feature: Intel Software Guard Extensions (SGX), AMD Secure Encrypted Virtualization (SEV), or Virtualization-based Security (VBS). The Intel and AMD solutions are memory encryptors; they safeguard sensitive memory by encrypting it at the hardware level. VBS securely isolates sensitive memory by containing it in virtual secure mode where even the NT kernel cannot access it.

Within SGX, there is SGX1 and SGX2. SGX1 only allows using statically allocated memory with a set size before enclave initialization. SGX2 adds support for an enclave to allocate memory dynamically. Due to this limitation, putting a library into an enclave on SGX1 requires that the library be statically linked. On the other hand, SGX2 supports dynamically loading/linking libraries.

A Windows VBS-based enclave requires Microsoft's signature as the root in the chain of trust or as a countersignature on a third party's certificate of an Authenticode-signed DLL. The signature must contain a specific extended key usage (EKU) value that permits running as an enclave. Thanks to the Windows Internals: System architecture, processes, threads, memory management, and more, Part 1 (7th edition) book for this tidbit on VBS signing requirements. Enabling test signing on your system can get an unsigned enclave running. VBS enclaves may call enclave-compatible versions of Windows API functions from the enclave version of the same library.

Both Windows and Linux support loading libraries into enclaves. An enclave library requires special preparation; a programmer cannot load any generic library into an enclave.

Windows integrates enclaves as part of the native loader. One may CreateEnclave to make an enclave and then call LoadEnclaveImage to load a library into an enclave. Internally, CreateEnclave (public API) calls ntdll!LdrCreateEnclave, which then calls ntdll!NtCreateEnclave; this function only does the system call to create an enclave, and then calls LdrpCreateSoftwareEnclave to initialize and link the new enclave entry into the ntdll!LdrpEnclaveList list. ntdll!LdrpEnclaveList is the list head, its list entries of type LDR_SOFTWARE_ENCLAVE are allocated onto the heap and linked into the list. Compiling an enclave library requires special compilation steps (e.g. compiling for Intel SGX). The Windows loader only supports Intel SGX enclaves (for Intel CPUs), it does not support AMD SEGV enclaves (for AMD CPUs).

The GNU loader has no knowledge of enclaves. Intel provides the Intel SGX SDK and the Intel SGX Platform Software (PSW) necessary for using SGX on Linux. The SGX driver awaits upstreaming into the Linux source tree. Linux currently has no equivalent to Windows Virtualization Based Security. Here is some sample code for loading an SGX module. However, Hypervisor-Enforced Kernel Integrity (Heki) is on its way, including patches to the kernel. Developers at Microsoft are introducing this new feature into Linux.

If you are interested in the CPU-based enclave technologies themselves, here is a comparison of Intel SGX and AMD SEV.

Open Enclave is a cross-platform and hardware-agnostic open source library for utilizing enclaves. Here is some sample code calling into an enclave library.

Side Note regarding the Windows LdrpObtainLockedEnclave Function:

The Windows loader uses ntdll!LdrpObtainLockedEnclave to obtain the enclave lock for a module, doing per-node locking. The LdrpObtainLockedEnclave function acquires the ntdll!LdrpEnclaveListLock lock and then searches from the ntdll!LdrpEnclaveList list head to find an enclave for the DLL image base address this function receives as its first argument. If LdrpObtainLockedEnclave finds a matching enclave, it atomically increments a reference count and enters a critical section stored in the enclave structure before returning that enclave's address to the caller. Typically (unless your process uses enclaves), the list at ntdll!LdrpEnclaveList will be empty, effectively making LdrpObtainLockedEnclave a no-op.

The LdrpObtainLockedEnclave function is called every time GetProcAddress ➜ LdrGetProcedureAddressForCaller runs (bloat alert), so I wanted to give it a look.

Module Information Data Structures

A Windows module list is a circular doubly linked list of type LDR_DATA_TABLE_ENTRY. The loader maintains multiple LDR_DATA_TABLE_ENTRY lists containing the same entries but in different link orders. These lists include InLoadOrderModuleList, InMemoryOrderModuleList, and InInitializationOrderModuleList, which are the list heads that can be found in ntdll!PEB_LDR_DATA. Each LDR_DATA_TABLE_ENTRY structure houses InLoadOrderLinks, InMemoryOrderLinks, and InInitializationOrderLinks, which are LIST_ENTRY structures (containing both Flink and Blink pointers), thus building the module lists between LDR_DATA_TABLE_ENTRY nodes.

The glibc (on Linux) module list is a linear (i.e. non-circular) doubly linked list of type link_map. link_map contains both the next and prev pointers used to link the module information structures together into a list. glibc makes the list of loaded modules accessible for debugging purposes through the r_debug->r_map symbol it exposes, which is a list head for the list of link_map structures.

This section only covers the central module information data structure to each loader.

Loader Components

Locks

Operating-system-level synchronization mechanisms of all kinds, including thread synchronization locks (i.e. Windows critical section or POSIX mutex), readers-writer locks (i.e. Windows slim reader/writer locks or POSIX rwlock locks, which can be acquired in exclusive/write or shared/read mode), and inter-process synchronization locks. Operating-system-level, meaning these locks may make a system call (a syscall instruction on x86-64) to perform a non-busy wait if the synchronization mechanism is owned/contended/waiting.

An intra-process OS lock uses an atomic flag as its locking primitive when there is no contention (e.g. implemented with the lock cmpxchg instruction on x86). Inter-process locks such as Win32 event synchronization objects must rely entirely on system calls to provide synchronization (e.g. the Windows event object NtSetEvent and NtResetEvent functions are just stubs containing a syscall instruction).

Some lock varieties are a mix of an OS lock and a spinlock (i.e. busy loop). For example, both a Windows critical section and a GNU mutex (not POSIX; this is a GNU extension) support specifying a spin count. When there is contention on a lock, its spin count is a potential performance optimization for avoiding the expensive context switch between user mode and kernel mode that occurs when performing a system call.

Windows LdrpModuleDatatableLock

Performs full blocking access to its respective module information data structures
- This includes two linked lists (InLoadOrderModuleList and InMemoryOrderModuleList), a hash table, two red-black trees, and two directed acyclic graphs
  - Note: The DAGs only require LdrpModuleDatatableLock protection to ensure synchronization/consistency with other module information data structures during write operations (e.g. adding/deleting nodes)
    - In addition to acquiring the LdrpModuleDatatableLock lock, safely modifying the DAGs requires that your thread be the loader owner (LoadOwner in TEB.SameTebFlags, incrementing ntdll!LdrpWorkInProgress, and ntdll!LdrpLoadCompleteEvent)
    - With load owner status, read operations (e.g. walking between nodes) are naturally protected depending on where the load owner is in the loading process (see: Windows Loader Module State Transitions Overview)
- This lock also protects some structure members contained within these data structures (e.g. the LDR_DDAG_NODE.LoadCount reference counter)
Windows shortly acquires LdrpModuleDatatableLock many times (I counted 17 exactly) for every LoadLibrary (tested with a full path to an empty DLL; a completely fresh Visual Studio DLL project)
- Acquiring this lock so many times could create contention on LdrpModuleDatatableLock, even if the lock is only held for short sprints
- Monitor changes to LdrpModuleDatatableLock by setting a watchpoint: ba w8 ntdll!LdrpModuleDatatableLock (ensure you don't count unlocks)
  - Note: There are a few occurrences of this lock's data being modified directly for unlocking instead of calling RtlReleaseSRWLockExclusive (this is likely done as a performance optimization on hot paths)
Implemented as a slim read/write (SRW) lock, and the Windows loader only ever acquires it in the exclusive/write locking mode

Linux (GNU loader) dl_load_write_lock

Performs full blocking (exclusive/write) access to its respective module data structures
On Linux, this is only a linked list (thelink_map list)
Linux shortly acquires dl_load_write_lock once on every dlopen from the _dl_add_to_namespace_list internal function (see the GDB log for evidence)
- Other functions that acquire dl_load_write_lock (not called during dlopen) include the dl_iterate_phdr function, which is for iterating over the module list
  - According to glibc source code, this lock is acquired to: "keep __dl_iterate_phdr from inspecting the list of loaded objects while an object is added to or removed from that list."
  - On Windows, acquiring the equivalent LdrpModuleDatatableLock is required to iterate the module list safely (e.g. when calling the LdrpFindLoadedDllByNameLockHeld function)

Windows LdrpLoaderLock

Blocks concurrent module initialization/deinitialization and protects the InInitializationOrderModuleList linked list
- Safely running a module's DLL_PROCESS_ATTACH requires holding loader lock
  - At process exit, RtlExitUserProcess acquires loader lock before running the library deinitialization case (DLL_PROCESS_DETACH) of DllMain
- This protection includes DLL thread initialization/deinitialization during DLL_THREAD_ATTACH/DLL_THREAD_DETACH
On the modern Windows loader at DLL_PROCESS_ATTACH, the loader lock remains locked as each full dependency chain of a loading DLL, including the loading DLL itself, is initialized (i.e. the loader locks loader lock once before starting LdrpInitializeGraphRecurse and unlocks after returning from that function)
Loader lock protects against concurrent module initialization because the initialization routine of one module may depend on another module already being initialized, hence why library initialization must be serialized (i.e. done in series; not in parallel)
Implemented as a critical section

Linux (GNU loader) dl_load_lock

This lock is acquired right at the start of a _dl_open, dlclose, and other loader functions
- dlopen eventually calls _dlopen after some preparation work (which shows in the call stack) like setting up an exception handler, at which point the loader is committed to doing some loader work
According to glibc source code, this lock's purpose is to: "Protect against concurrent loads and unloads."
- This protection includes concurrent module initialization similar to how a modern Windows ntdll!LdrpLoaderLock does
- For example, dl_load_lock protects a concurrent dlclose from running a library's module destructors before that library's module initializer have finished running
dl_load_lock is at the top of the loader's lock hierarchy
Since dl_load_lock protects the entire library loading/unloading process from beginning to end, the closest modern Windows loader equivalent synchronization mechanism would be the LdrpLoadCompleteEvent loader event (this is when the soon-to-be load owner thread increments ntdll!LdrpWorkInProgress from 0 to 1) combined with holding the ntdll!LdrpLoaderLock lock

Linux (GNU loader) _ns_unique_sym_table.lock

This is a per-namespace lock for protecting that namespace's unique (STB_GNU_UNIQUE) symbol hash table
STB_GNU_UNIQUE symbols are a type of symbol that a module can expose; they are considered a misfeature of the GNU loader
As standardized by the ELF executable format, the GNU loader uses a per-module statically allocated (at compile time) symbol table for locating symbols within a module (.so shared object file); however, the _ns_unique_sym_table.lock lock protects a separate dynamically allocated hash table specially for STB_GNU_UNIQUE symbols
- See the Procedure/Symbol Lookup Comparison (Windows GetProcAddress vs POSIX dlsym GNU Implementation) section for more information on how the GNU loader typically locates symbols
- In Windows terminology, the closest approximation to a symbol would be a DLL's function exports (there's no mention of the word "export" in the objdump manual)
- Use readelf to dump all the symbols, including unique symbols, of an ELF file: readelf --symbols --file <FILE> (Note: The readelf tool is preferable over objdump)
Internally, the call chain for looking up a STB_GNU_UNIQUE symbol starting with dlsym goes dlsym ➜ dl_lookup_symbol_x ➜ do_lookup_x ➜ do_lookup_unique where finally, _ns_unique_sym_table.lock is acquired
For more information on STB_GNU_UNIQUE symbols, see the STB_GNU_UNIQUE section

This is where the loader's locks end for the GNU loader on Linux. Other than that, there's only a lock specific to thread-local storage (TLS) if your module uses that, as well as global scope (GSCOPE) locking.

However, Windows has more synchronization mechanisms that control the loader, including:

Loader events (Win32 event objects), including:
- ntdll!LdrpInitCompleteEvent
  - This event being set indicates loader initialization is complete
    - Loader initialization includes process initialization
    - This event is only set (NtSetEvent) by the LdrpProcessInitializationComplete function soon after LdrpInitializeProcess returns, at which point it's never set/unset again
  - Thread startup waits on this
  - This event is not an auto-reset event and is created in the nonsignaled (i.e. waiting) state
  - The LdrpInitialize (_LdrpInitialize) function creates this event
    - This event is created before loader initialization begins (early at process creation)
- ntdll!LdrpLoadCompleteEvent
  - This event being set indicates an entire load/unload has completed or that running DllMain routines has completed
    - This event is set (NtSetEvent) in the LdrpDropLastInProgressCount function before relinquishing control as the load owner (LoadOwner flag in TEB.SameTebFlags)
  - Thread startup waits on this
  - This event is an auto-reset event and is created in the nonsignaled (i.e. waiting) state
  - Created by LdrpInitParallelLoadingSupport calling LdrpCreateLoaderEvents
- LdrpWorkCompleteEvent
  - This event being set indicates the loader has completed processing (i.e. mapping and snapping) the entire work queue and all loader work threads are finished processing work
    - This event is set (NtSetEvent) immediately before the LdrpProcessWork function returns if the work queue is now empty or all current loader worker threads are finished processing work
  - This event is an auto-reset event and is created in the nonsignaled (i.e. waiting) state
  - Created by LdrpInitParallelLoadingSupport calling LdrpCreateLoaderEvents
- All of these events are created by NTDLL at loader/process startup using NtCreateEvent
- For in-depth information on the latter two events, see the High-Level Loader Synchronization section
ntdll!LdrpWorkQueueLock
- Implemented as a critical section
- Used in LdrpDrainWorkQueue so only one thread can access the LdrpWorkQueue queue at a time
ntdll!LdrpDllNotificationLock
- This is a critical section; it's typically locked (RtlEnterCriticalSection) in the ntdll!LdrpSendPostSnapNotifications function (called by ntdll!PrepareModuleForExecution ➜ ntdll!LdrpNotifyLoadOfGraph; this happens in advance of ntdll!PrepareModuleForExecution calling LdrpInitializeGraphRecurse to do module initialization
- The LdrpSendPostSnapNotifications function acquires this lock to ensure consistency with other functions that run DLL notification callbacks before fetching app compatibility data (SbUpdateSwitchContextBasedOnDll function) and potentially running post snap DLL notification callbacks (look for call qword ptr [ntdll!__guard_dispatch_icall_fptr (<ADDRESS>)] in disassembly)
  - Typically, LdrpSendPostSnapNotifications doesn't run any callback functions
  - Other functions that directly send DLL notifications: LdrpSendShimEngineInitialNotifications (called by LdrpLoadShimEngine and LdrpDynamicShimModule)
- The ntdll!LdrpSendNotifications function (called by LdrpSendPostSnapNotifications) function recursively acquires this lock to safely access the LdrpDllNotificationList list so it can call the callback functions stored inside
  - By default, the LdrpDllNotificationList list is empty (so the LdrpSendDllNotifications function doesn't send any callbacks)
  - Notifications callbacks are registered with LdrRegisterDllNotification and are then sent with LdrpSendDllNotifications (it runs the callback function)
  - Functions calling DLL notification callbacks must hold the LdrpDllNotificationLock lock during callback execution. This is similar to how executing module initialization/deinitialization code (DllMain) requires holding the LdrpLoaderLock lock.
  - Other functions that may call LdrpSendDllNotifications include: LdrpUnloadNode and LdrpCorProcessImports (may be called by LdrpMapDllWithSectionHandle)
  - LdrpSendDllNotifications takes a pointer to a LDR_DATA_TABLE_ENTRY structure as its first argument
  - In ReactOS, LdrpSendDllNotifications is referenced in LdrUnloadDll sending a shutdown notification with a FIXME comment (not implemented yet): //LdrpSendDllNotifications(CurrentEntry, 2, LdrpShutdownInProgress);
    - ReactOS code analysis: If ntdll!LdrpShutdownInProgress (the internal NTDLL variable referencing the exposed PEB_LDR_DATA.ShutdownInProgress) is set, LdrUnloadDll skips deinitialization and releases loader lock early, presumably so shutdown happens faster (don't need to deinitialize a module if the whole process is about to not exist)
- Reading loader disassembly, you may see quite a few places where loader functions check the LdrpDllNotificationLock lock like so: RtlIsCriticalSectionLockedByThread(&LdrpDllNotificationLock)
  - For instance, in the ntdll!LdrpAllocateModuleEntry, ntdll!LdrGetProcedureAddressForCaller, ntdll!LdrpPrepareModuleForExecution, and ntdll!LdrpMapDllWithSectionHandle functions
  - These checks detect if the current thread is executing a DLL notification callback and then implement special logic for that edge case (for this reason, they can generally be ignored)
  - By putting Google Chrome under WinDbg, I found an instance where the loader ran a post-snap DLL notification callback. The callback ran the apphelp!SE_DllLoaded function.
LDR_DATA_TABLE_ENTRY.Lock
- Starting with Windows 10, each LDR_DATA_TABLE_ENTRY has a PVOID Lock member (it replaced a Spare slot)
- The LdrpWriteBackProtectedDelayLoad function uses this per-node lock to implement some level of protection during delay loading
  - This function calls NtProtectVirtualMemory, so it likely protects the process of setting and resetting memory protection on a module's Import Address Table (IAT) during the lazy linking part of Windows delay loading
PEB.TppWorkerpListLock
- This SRW lock (typically acquired exclusively) exists in the PEB to control access to the member immediately below it, which is the TppWorkerpList doubly linked list
- This list keeps track of all the threads belonging to any thread pool in the process
  - These threads show up as ntdll!TppWorkerThread threads in WinDbg
  - There's a list head, after which each LIST_ENTRY points into the stack memory of a thread owned by a thread pool
    - The TpInitializePackage function (called by LdrpInitializeProcess) initializes the list head, then the main ntdll!TppWorkerThread function of each new thread belonging to a thread pool adds itself to the list
  - The threads in this list include threads belonging to the loader worker thread pool (LoaderWorker in TEB.SameTebFlags)
- This is a bit out of scope since it relates to thread pool internals, not loader internals (however, the loader relies on thread pool internals to implement parallelism for loader workers)
Searching symbols reveals more of the loader's locks: x ntdll!Ldr*Lock
- LdrpDllDirectoryLock (SRW lock, sometimes acquired in shared mode), LdrpTlsLock (SRW lock, sometimes acquired in shared mode), LdrpEnclaveListLock (critical section lock), LdrpPathLock (SRW lock, only acquired exclusive mode), LdrpInvertedFunctionTableSRWLock (SRW lock, sometimes acquired in shared mode, high contention, locking and unlocking functions are inlined), LdrpVehLock (SRW lock, only acquired exclusive mode), LdrpForkActiveLock (SRW lock, sometimes acquired in shared mode), LdrpCODScenarioLock (SRW lock, only acquired exclusive mode, COD stands for component on demand and it is an application compatibility mechanism that integrates with the Program Compatibility Assistant service), LdrpMrdataLock (SRW lock, only acquired exclusive mode), and LdrpVchLock (SRW lock, only acquired exclusive mode)
The Windows loader may also dynamically create and destroy temporary synchronization objects in some cases (e.g. see the Windows loader's calls to ntdll!ZwCreateEvent)

State

A state value may either be shared state (also known as global state) or local state (i.e. whether separate threads may access the state). Shared state has access to it protected by one of the aforementioned locks. Local state does not require protection. Only a few key pieces of Windows loader state I came across are listed here.

LDR_DDAG_NODE.State (pointed to by LDR_DATA_TABLE_ENTRY.DdagNode)
- Each module has a LDR_DDAG_NODE structure with a State member containing 15 possible states -5 through 9
- LDR_DDAG_NODE.State tracks a module's entire lifetime from allocating module information data structures (LdrpAllocatePlaceHolder) and loading to unload and subsequent deallocation of its module information structures
  - In my opinion, this makes the combined LDR_DDAG_NODE.State values of all modules to be the most important piece of loader state**
- Performing each state change may necessitate acquiring the LdrpModuleDatatableLock lock to ensure consistency between module information data structures
  - The specific state changes requiring LdrpModuleDatatableLock protection (i.e. consistency between all module information data structures) are documented in the link below
- Please see the Windows Loader Module State Transitions Overview for more information
ntdll!LdrpWorkInProgress
- This reference counter is a key piece of loader state (zero meaning work is not in progress and up from that meaning work is in progress)
  - It's not modified atomically with lock prefixed instructions
- Acquiring the LdrpWorkQueueLock lock is a requirement for safely modifying the LdrpWorkInProgress state and LdrpWorkQueue linked list
  - I verified this by setting a watchpoint on LdrpWorkInProgress and noticing that LdrpWorkQueueLock is always locked while checking/modifying the LdrpWorkInProgress state (also, I searched disassembly code)
    - The LdrpDropLastInProgressCount function makes this clear because it briefly acquires LdrpWorkQueueLock around only the single assembly instruction that sets LdrpWorkInProgress to zero
- Please see the Windows Loader Module State Transitions Overview for more information
ntdll!LdrInitState
- This value is not modified atomically with lock prefixed instructions
  - Loader initialization is a procedural process only occurring once and on one thread, so this value doesn't require protection
- In ReactOS code, the equivalent value is LdrpInLdrInit, which the code declares as a BOOLEAN value
- In a modern Windows loader, this a 32-bit integer (likely an enum) ranging from zero to three; here are the state transitions:
  - LdrpInitialize initializes LdrInitState to zero (loader is uninitialized)
  - LdrpInitializeProcess calls LdrpEnableParallelLoading and immediately after sets LdrInitState to one (mapping and snapping dependency graph)
  - LdrpInitializeProcess sets LdrInitState to two (initializing dependency graph)
    - The DLL_PROCESS_ATTACH routines of DllMain at process startup runs with this state active
  - LdrpInitialize (LdrpInitializeProcess returned), shortly before calling LdrpProcessInitializationComplete, sets LdrInitState to three (loader initialization is done)
LDR_DDAG_NODE.LoadCount
- This is the reference counter for a LDR_DDAG_NODE structure; safely modifying it requires acquiring the LdrpModuleDataTableLock lock
TEB.WaitingOnLoaderLock is thread-specific data set when a thread is waiting for loader lock
- RtlpWaitOnCriticalSection (RtlEnterCriticalSection calls RtlpEnterCriticalSectionContended, which calls this function) checks if the contended critical section is LdrpLoaderLock and if so, sets TEB.WaitingOnLoaderLock equal to one
  - This branch condition runs every time any contended critical section gets waited on, which is interesting (monolithic much?)
- RtlpNotOwnerCriticalSection (called from RtlLeaveCriticalSection) also checks LdrpLoaderLock (and some other information from PEB_LDR_DATA) for special handling
  - However, this is only for error handling and debugging because a thread that doesn't own a critical section should have never attempted to leave it in the first place
Flags in TEB.SameTebFlags, including: LoadOwner, LoaderWorker, and SkipLoaderInit
- All of these were introduced in Windows 10 (SkipLoaderInit only in 1703 and later)
- LoadOwner (flag mask 0x1000) is state that a thread uses to inform itself that it's the one responsible for completing the work in progress (ntdll!LdrpWorkInProgress)
  - The LdrpDrainWorkQueue function sets the LoadOwner flag on the current thread immediately after setting ntdll!LdrpWorkInProgress to 1, thus directly connecting these two pieces of state
  - The LdrpDropLastInProgressCount function unsets this flag along with ntdll!LdrpWorkInProgress
  - Any thread doing loader work (e.g. LoadLibrary) will temporarily receive this TEB flag
  - This state is local to the thread (in the TEB), so it doesn't require the protection of a lock
- LoaderWorker (flag mask 0x2000) identifies loader worker threads
  - These show up as ntdll!TppWorkerThread in WinDbg
  - On thread creation, LdrpInitialize checks if the thread is a loader worker and, if so, handles it specially
  - This flag can be set on a new thread using NtCreateThreadEx
- SkipLoaderInit (flag mask 0x4000) tells the spawning thread to skip all loader initialization
  - In LdrpInitialize IDA decompilation, you can see SameTebFlags being tested for 0x4000, and if present, loader initialization is completely skipped (_LdrpInitialize is never called)
  - This could be useful for creating new threads without being blocked by loader events
  - This flag can be set on a new thread using NtCreateThreadEx
ntdll!LdrpMapAndSnapWork
- An undocumented, global structure that loader worker (LoaderWorker flag in TEB.SameTebFlags) threads read from to get mapping and snapping work
- The first member of this undocumented structure is an atomic reference counter that gets incremented whenever work is enqueued (ntdll!LdrpQueueWork function) to the global work queue (ntdll!LdrpWorkQueue linked list) and decremented whenever a loader worker thread consumes work
- The undocumented structure is incremented by the ntdll!TppWorkPost function (called by ntdll!LdrpQueueWork) and decremented by the ntdll!TppIopCallbackEpilog function (called by thread pool internals in a loader worker thread), which means this undocumented structure belongs to thread pool internals and so is a bit out of scope here

Atomic State

An atomic state value is modified using a single assembly instruction. On an SMP operating system (e.g. Windows and very likely your build of Linux; check with the uname -a command) with a multi-core processor, this instruction must include a lock prefix (on x86) so the processor knows to synchronize that memory access across CPU cores. The x86 ISA mandates that a single memory read/write operation is atomic by default. It is only when combining multiple reads or writes at once that the lock prefix is necessary to guarantee atomicity (e.g. a read + write operation, typically performed atomically, is needed to increment or decrement a reference counter). Only a few key pieces of the Windows loader atomic state I came across are listed here.

ntdll!LdrpProcessInitialized
- This value is modified atomically with a lock cmpxchg instruction
- As the name implies, it indicates whether process initialization has been completed (LdrpInitializeProcess)
- This is an enum ranging from zero to two; here are the state transitions:
  - NTDLL compile-time initialization starts LdrpProcessInitialized with a value of zero (process is uninitialized)
  - LdrpInitialize increments LdrpProcessInitialized to one zero early on (initialization event created)
    - If the process is still initializing, newly spawned threads jump to calling NtWaitForSingleObject, waiting on the LdrpInitCompleteEvent loader event before proceeding
    - Before the loader calls NtCreateEvent to create LdrpInitCompleteEvent at process startup, spawning a thread into a process causes it to use LdrpProcessInitialized as a spinlock (i.e. a busy loop)
    - For example, if a remote process calls CreateRemoteThread and the thread spawns before the creation of LdrpInitCompleteEvent (an unlikely but possible race condition)
  - LdrpProcessInitializationComplete increments LdrpProcessInitialized to two (process initialization is done)
    - This happens immediately before setting the LdrpInitCompleteEvent loader event so other threads can run
    - After LdrpProcessInitializationComplete returns, NtTestAlert processes the asynchronous procedure call (APC) queue, and finally, NtContinue yields code execution of the current thread to KERNEL32!BaseThreadInitThunk, which eventually runs our program's main function
LDR_DATA_TABLE_ENTRY.ReferenceCount
- This is a reference counter for LDR_DATA_TABLE_ENTRY structures
- On load/unload, this state is modified by lock inc/dec instructions, respectively (although, on the initial allocation of a LDR_DATA_TABLE_ENTRY before linking it into any shared data structures, of course, no locking is necessary)
- The LdrpDereferenceModule function atomically decrements LDR_DATA_TABLE_ENTRY.ReferenceCount by passing 0xffffffff to a lock xadd assembly instruction, causing the 32-bit integer to overflow to one less than what it was (x86 assembly doesn't have an xsub instruction so this is the standard way of doing this)
  - Note that the xadd instruction isn't the same as the add instruction because the former also atomically exchanges (hence the "x") the previous memory value into the source operand
    - This is at the assembly level; in code, Microsoft is likely using the InterlockedExchangeSubtract macro to do this
  - The LdrpDereferenceModule function tests (among other things) if the previous memory value was 1 (meaning post-decrement, it's now zero; i.e. nobody is referencing this LDR_DATA_TABLE_ENTRY anymore) and takes that as a cue to unmap the entire module from memory (calling the LdrpUnmapModule function, deallocating memory structures, etc.)
ntdll!LdrpLoaderLockAcquisitionCount
- This value is modified atomically with the lock xadd prefixed instructions
- It was only ever used as part of cookie generation (web archive, because Doxygen links can change) in the LdrLockLoaderLock function
  - On both older/modern loaders, LdrLockLoaderLock adds to LdrpLoaderLockAcquisitionCount every time it acquires the loader lock (it's never decremented)
  - In a legacy (Windows Server 2003) Windows loader, the LdrLockLoaderLock function (an NTDLL export) was often used internally by NTDLL even in Ldr prefixed functions. However, in a modern Windows loader, it's mostly phased out in favor of the LdrpAcquireLoaderLock function
  - In a modern loader, the only places where I see LdrLockLoaderLock called are from non-Ldr prefixed functions, specifically: TppWorkCallbackPrologRelease and TppIopExecuteCallback (thread pool internals, still in NTDLL)

Indeed, the Windows loader is an intricate and monolithic state machine. Its complexity and deep ties into the rest of the operating system stands out when put side-by-side with the comparatively simple glibc loader on Linux.

Component Model Technology Overview

A component model is an object-oriented framework used for creating modular, reusable components, and facilitating communication between them. Objects are useful because they encapsulate information (hide implementation details), divide systems into distinct and modular components, are reusable across applications and programming languages, and many more reasons common to object-oriented programming. Cons of objects include overhead (in low-level programming, allocating objects and methods calls can be taxing and may introduce unnecessary complexity) and, potentially, lessened security (type confusion vulnerabilities, typically occuring with complex object types, are an especially dangerous vulnerability class that tend to bypass exploit mitigations because no overflow into other memory is occurring and no creation of a fake pointer is necessary). A framework for transporting objects, calls, or messages is useful to ease communication between components and abastract away implementation details that may be separating them. If used with intent to solve the right problem, such a framework is a fantastic tool with no cons other than some acceptable overhead.

The centerpiece to any component model is that each connection by a client corresponds to an instance of a component or object on the server. As we will see, there are other common elements, but all component models revolve around this core idea of a connection to an instance (e.g. CoCreateInstance on Windows or NSXPCConnection on Mac) of an object thus enabling interaction with the services, methods, or interfaces exposed by that component.

Microsoft Component Object Model (COM)

Component Object Model (COM) is Microsoft's component framework. A component in COM is described by its binary interface thus creating an ABI to communicate with. A programmer writes interface descriptions in Microsoft Interface Definition Language (MIDL), which turns into that component's binary interface. COM works intra-process, inter-proess, and between machines (with the addition of DCOM, which has since been integrated into COM). DCOM internally relies on Microsoft RPC. COM is deeply integrated into Windows, with many Windows APIs being implemented in COM (e.g. the Task Scheduler API) or using COM internally. The COM base interface class is IUnknown, which is the root interface from which all other interfaces are derived. COM and DCOM are openly specified protocols. COM is for Windows and isn't a cross-platform technology. Note that XPCOM, by Mozilla, bares no direct relation to COM.

Common Object Request Broker Architecture (CORBA)

Common Object Request Broker Architecture (CORBA) is an open, vendor-neutral standard that defines a framework for object-oriented communication across different platforms and programming languages. CORBA was developed by the Object Management Group (OMG), a consortium that creates and maintains standards for distributed computing. CORBA interfaces fit an exact binary description (like a C structure) that a programmer describes in OMG IDL. CORBA can be used intra-process, inter-process, and between machines. Inter-process and machine-to-machine communication is done using the General Inter-ORB Protocol (GIOP). The object request broker (ORB) is the central piece in CORBA, responsible for brokering requests between clients and server objects (similar to how an object-relational mapping or ORM framework, a technology that came after component model technology, allows interacting with an SQL database using the natural features of a given programming language except directly between languages for elegant, platform agonistic communication). ORBs allow for mapping objects between programming languages. The base interface class in CORBA is simply named Object. CORBA was one of the first component frameworks, it gained traction throughout the 1990s, but after inspiring many other early component frameworks, it fell by the wayside for a variety of reasons (including the rise of simple but powerful technologies, where often applicable, such as REST).

GNU/Linux Component Frameworks and History

There doesn't strictly exist a component framework common to GNU/Linux systems. Historically, Bonobo, based on CORBA, was the component framework of choice by the GNOME Shell. GNOME officially deprecated Bonobo in 2009 (for context GNOME existed starting 1999) and has sinc switched to simpler and more modular technologies for doing the job of a comonent framework. These include D-Bus for inter-process comunication, GObject for object-oriented development, GIO for location transparency (abstracting network file locations on the file system), and GTK technology for embedding application views. D-Bus is the Desktop Bus, it was designed specifically for communication between desktop apps, the desktop, and the operating system. The KDE graphical shell uses KParts as its component framework. KDE Frameworks is based on Qt. As a result, KParts is unique in that interfaces are described not in IDL but in the Qt Meta-Object System, which is a C++ class including the Q_OBJECT macro. KPart's use of the Qt Meta-Object System can be dynamic (evalute at run-time) and cannot be reduced to a binary interface. KParts::Part is the base interface class from which all other interfaces inherit. KParts splits network transparency into its own KIO component (like GNOME GIO). Indeed, GUI frameworks common to GNU/Linux were once and in some parts still are implemented in terms of components belonging to a component framework. After all, GNOME stands for GNU Network Object Model Environment, which speaks to its component model roots. The GNOME desktop was also heavily inspired by KDE, sharing some code in the beginning. The X Windowing System and protocol is well-known for its built-in networking transparency (a common component model feature) because, throughout the 1980s and 1990s, expensive computing resources were often hosted centrally and shared between thin clients. Wayland didn't carry on the network transparency feature of X11. However, it's important to highlight that, in all cases, this complex and all encompassing component model technology was only ever used for creating user inteface components (justifiable by the inherently complex nature of a GUI framework). In general, Unix-like systems (including MacOS) commonly use simple but powerful Berkeley/POSIX sockets for inter-process communication.

MacOS Distributed Objects and NSXPCConnection

On MacOS, component model technology exists as distibuted objects (DO). The original design focus of component model and distributed object technologies varies in that the former focuses on modularity and encapsulation, whereas the latter focuses on remote object-oriented communication. However, in practice, implementations largely overlap to fulfill both purposes. Distributed objects are implemented using NSXPCConnection in modern MacOS. A distibuted objects interface is described with an Objective-C protocol making it dynamic, unlike an IDL-based component framework. NSObject is the root object from which all other objects inherit. An NSPort provides location transparency. Let's do a small walk through the history of component model technology on MacOS and put this technology in context. Cocoa is a general object-oriented framework for developing native applications targeting the MacOS platform. Developing for the Cocoa framework is typically done in Objective-C or Swift; however, other language bindings also exist. Cocoa eases interaction with core MacOS frameworks, including the Foundation framework. Within the Foundation framework, there exists the legacy and deprecated NSConnection API that "forms the backbone of the distributed objects mechanism". A new API for doing IPC, XPC was internally added to MacOS in its 10.7 Lion release (2011). XPC streamlines inter-object and simple inter-process communication with its modular, lightweight, secure design. In MacOS 10.8 Mountain Lion (2012), NSXPCConnection was introducted to provide inter-object communication based on the new XPC backend. Apple superseded the NSConnection API for distributed objects with this new NSXPCConnection API. In MacOS 10.10 Yosemite (2014), XPC was published as its own inter-process communication mechanism. Bringing it together, on MacOS, XPC exists in two distinct forms, XPC provided by the Foundation framework for building object-oriented APIs (component model technology) and XPC provided by libSystem for performing low-level messaging (typical IPC). Note that low-level XPC documentation says it communicates in "objects"; however, these these objects are more like structures in that they can only store primitive data types along with some custom binary types. Moreover, a low-level XPC connection doesn't map to an instance of any object on the server-side, which is what makes it more comprable to typical IPC.

Fun Facts

Apple NSXPC Woes: The dynamic nature of interfaces supported by Apple's distributed objects may be too dynamic for its own good (not that COM would make a good technology to communicate with a sandboxed process with, either). An interesting observation is that Apple iOS didn't publicly support low-level XPC until iOS 17.4 (significantly later than MacOS), which was only released in March 2024.

Microsoft Aggressively Promoted COM: This promotion included popular COM-based technologies at the time like MTS and ActiveX. In particular, COM on Windows NT was marketed as superior to the Unix platform.

Is COM Dead? (2000) by COM founder Don Box: COM was already falling out of fashion in 2000 but lives on inside of .NET (the CLR) and within Windows.

Brief NeXTSTEP History: The NS prefix on some Macintosh APIs refers NeXT, which is a company Apple acquired to merge the NeXTSTEP operating system into the classic Mac OS X, thus giving us the Mac OS X we know today (that's where the "X" comes from). As a result, Apple inherited lots of NeXTSTEP technology, including distributed objects.

COMplications

The Microsoft Windows operating system is a large proponent of component model technology. COM is a binary-interface technology for creating software components as transparent objects. The technology is prominent throughout Windows, with many public Windows APIs being implemented in terms of COM interfaces or utilizing COM internally. For information about component model technology on other operating systems, see Component Model Technology Overview. Component Object Model (COM) technology exhibits a multitude of technical issues that can be split up into three groups: problems with COM, problems with how Windows uses COM, and problems with using component model technology at the operating-system-level.

Notice: Everything in this section is subject to change. Nothing here in this section is final and it is currently incomplete.

!!! WORK IN PROGRESS !!!

When investigating particularly complex and monolithic Windows components, it's important to remember the KISS design principle true of all engineering. The genius sees elegance in simplicity; the fool is captivated by convolution.

Computer History Perspective

This section provides perspective between distinct systems on the history of key operating system components that are common to all modern computers today, as well as the operating systems themselves:

MS-DOS

The first MS-DOS release was in 1980 (with version 1.0 coming out in 1981). MS-DOS was based on 86-DOS (originally named QDOS, the Quick and Dirty Operating System), which was infamously purchased for a total of $75,000 dollars from Seattle Computer Products (initially $25,000 for a non-exclusive license, later followed by another $50,000 for an exclusive license). 86-DOS was primarily written by Tim Paterson. In the broad sense, a disk operating system (DOS) refers to any operating system that resides on a disk storage device (e.g. floppy disk or hard disk). However, DOS typically refers to a family of simple operating systems that are 16-bit, single-user and single-tasking, have a CLI, and basic file/disk management capabilities. DOS systems didn't typically offer advanced features such as process scheduling, protected memory, permissions, and networking that Unix systems had. Other popular DOS operating systems around at the time that often took inspiration from each other included CP/M DOS (1974), Apple DOS (1978), and DR-DOS by Novell (1988). OS/2 by Microsoft and IBM (before being called OS/2, the system was only developed by Unix and DOS programmers within Microsoft, by 1987 IBM joined Microsoft in OS/2 development, and it was exclusively developed by IBM starting 1992) was a replacement for DOS that introduced advanced operating system features like protected mode.

Microsoft and UNIX History

This writing is a condensed history detailing Microsoft's Unix beginnings, how and why they departed from Unix, and how MS-DOS took off.

First created in 1969, UNIX was developed at Bell Laboratories, which at the time was part of the Bell System (1877-1984) and operated as the research and development arm within AT&T. The Bell System had been under antitrust scrutiny by US regulators ever since the 1910s and one of its effects was the 1956 Consent Decree. This legal agreement restricted members of the Bell System from manufacturing and selling computer products and required them to license their technology to all applicants in exchange for the payment of reasonable royalties.

In 1978, Microsoft purchased a license from AT&T, a Bell System member, for Version 7 Unix with the goal of porting it to microcomputers (since Unix was originally developed for larger minicomputers like the PDP-7 and PDP-11). This is back when Microsoft was shipping a Unix-based operating system and believed Unix to be the "future desktop operating system, when machines got powerful enough to run something good" (thanks to Gordon Letwin, one of the first Microsoft employees, for this information). Xenix, the first microcomputer adaptation of Unix, was a huge success. By the mid-1980s, Microsoft had become "the most widely installed Unix-based microcomputer operating system", and since microcomputers were taking over in this time period and throughout the end of the 1980s, it likely became the most popular Unix distribution overall.

In 1980, after failing to strike a deal with Digital Research (DR), IBM approached Microsoft in need of an operating system for the upcoming IBM PC. Xenix was unsuitable because of the hardware limitations of the IBM PC. Microsoft did not have an operating system for the IBM PC yet, but promised to deliver one. And they did, by puchasing 86-DOS for $75,000 dollars in total from Seattle Computer Products. The IBM PC and MS-DOS, which IBM rebranded to PC-DOS, released on the same day of August 12, 1981.

Within a year of MS-DOS launching on the IBM PC, Microsoft had sold over 70 copies of the operating system to other companies. Microsoft quickly realized the commercial advantage of having complete control over the operating system they sold. And, they no longer wished to compete with other Unix systems nor the developers of Unix at all, anymore. During 1982 in an anti-competitive move, Microsoft quietly ceased new work on Xenix while still promoting the system, thus gradually turning it into vaporware that continued to gain popularity after Microsoft stopped developing the operating system beyond keeping up-to-date with the latest underlying Unix version and maintaining basic stability. Much later in 1987, Microsoft would lose interest in maintaining Xenix all together, selling it to the Santa Cruz Operation (SCO).

In 1984, the Bell System was broken up due to its monopoly power (telecommunication companies were the Big Tech of their day). Following the split, AT&T continued to develop System V under its newly restructured AT&T Technologies division. No longer apart of the Bell System, AT&T was free to sell UNIX, releasing Unix System V in 1983 as its flagship commercial Unix operating system. AT&T was now selling Unix in direct competition to Microsoft and since AT&T had the original Unix developers working for them, they were sure to set the precedent for the standard to which the operating system conformed. These circumstances reinforced Microsoft's plan to develop MS-DOS and abandon Unix.

The primary reason Microsoft gave up on Unix is because they wanted to control the operating system standard, what would eventually become the Windows API. Another reason was hardware requirements, MS-DOS was single-task and single-user compared to Xenix being multitasking and multi-user, which gave the latter system significantly higher hardware requirements for the time. MS-DOS required at minimum 32K of RAM while Xenix required at minimum 256K of RAM (384K for the full install, which totals 8x or 12x more memory than MS-DOS), this difference was impactful to each system's adoption for home users. Even "RAM-hungry" versions of MS-DOS at the time consumed around 2.75x less memory than a base installation of Xenix. Microsoft attaining a partnership with IBM ("Big Blue", the most dominant tech company at the time) to license and ship MS-DOS (rebranded to PC-DOS by IBM) as the default operating system on affordable IBM PCs (released August 12, 1981, the same day as MS-DOS) was the last big domino to fall in securing Microsoft's place in the market. Pivotal is that Microsoft licensed, not sold, MS-DOS to IBM so they could retain the rights to the operating system. That's right, before Microsoft was setting defaults, they became the default through strategic business partnership with IBM. After the era of the IBM PC, Microsoft went on to form many strategic licensing agreements with OEMs (including some anti-competitive deals). Ironically, 19 years after the Bell System, Microsoft would be deemed a monopoly.

An Alternate Reality

The commonly cited reason for what ultimately led Microsoft's departure of Unix is AT&T being introduced as competition in a context where Microsoft perceived them as being likely to control the operating system standard. While this reasoning is factual, I see there was also another, unchosen option that Microsoft could have taken to remain dominant, if not more so, while still shipping a competent Unix-based system. Here is how:

First, Microsoft undervalued the power they held as a highly popular and growing Unix-based system in the face of AT&T. We can see clear evidence of the weight Microsoft had by how, in 1987, they were able to convince AT&T to pay them royalties for code that allowed the broader Unix ecosystem to stay compatible with Xenix (i.e. the opposite of Microsoft's fear because AT&T wanted their Unix to stay compatible with Xenix, not the other way around). Microsoft could have bankrolled MS-DOS compatibility into Xenix in a compatibility mode (the opposite of MS-DOS staying compatible with Xenix, since that was the "second most important feature of MS-DOS 2.0"), created their own user friendly GUI, designed proprietary subsystems for Xenix to serve as the champion of. Basically, Microsoft could have done anything in their infamous "embrace, extend, and extinguish" (EEE) mantra minus the most controversial "extinguish" part.

Second, one of Microsoft's fears was that, if they stuck with Unix, they would be at the mercy of anti-competitive business practices by AT&T: "They might sell it [Unix] for cheaper than we had to pay them in royalties!" However, ex-Bell System members like AT&T were already well on the radar of US antitrust regulators. As a result, if Microsoft had simply gone to regulators regarding their concerns or went to regulators right away if AT&T tried anything nasty, there is a very good chance the government would have sided with Microsoft.

Third, if Microsoft still did not like the idea of ever having to pay AT&T royalites then there were certainly solutions for while sticking to a technologically well designed and sound base. They could have developed their own Unix from scratch based on the open specification. Or, more likely, switched to using permissively licensed code such as from 386BSD down the line, which released a year before Windows NT 3.1. Unix started as a research operating system with lots of academic interest, so it was forseeable that Unix rewritings would come out that would escape AT&T's royalties. Either of these actions would have solved the royalty problem since only using AT&T code came with royalties, the specification was always open. This strategy would have been highly feasible seeing as Apple followed a similar timeline starting with Apple DOS, switching to A/UX (based on AT&T Unix), and later switched again to create MacOS based on NeXTSTEP (they purchased this system by acquiring NeXT), which was itself was based on the permissively licesned 4.3BSD Unix.

Foruth, sticking with Unix may have allowed Microsoft to avoid their own time consuming 2001 United States v. Microsoft Corp. monopoly case because Microsoft would never have began making the anti-competitive deals they did between themselves and OEMs with Windows, leading them to not attempt continuing the creation of these anti-competiive deals between themselves and OEMs with Internet Explorer. Microsoft's Xenix popularity did not come from anti-competitive deals but rather by being first to the Unix microcomputer market, supporting lots of hardware, and as a result having a growing software ecosystem of applications for Xenix. In other words, Microsoft innovated and just made a good product. Bill Gates has remarked on the significance of the antitrust investigations on Microsoft:

There’s no doubt the antitrust lawsuit was bad for Microsoft, and we would have been more focused on creating the phone operating system, and so instead of using Android today, you would be using Windows Mobile if it hadn’t been for the antitrust case.

We’re in the field of doing operating systems for personal computers. We knew the mobile phone would be very popular, and so we were doing what was called Windows Mobile. We missed being the dominant mobile operating system by a very tiny amount. We were distracted during our antitrust trial. We didn’t assign the best people to do the work. So it’s the biggest mistake I made in terms of something that was clearly within our skill set. We were clearly the company that should have achieved that, and we didn’t. We allowed this Motorola design win, and therefore the software momentum to go to Android, and so it became the dominant non-Apple mobile phone operating system globally.

Fifth and perhaps most importantly: opportunity cost. If Microsoft had redirected all the effort that went in Windows NT 3.1 (the first Windows NT release, a full operating system written from scratch) into capturing the mobile market with Unix as a strong base then Bill Gates could have had the personal computer and mobile markets. The same thing goes with Microsoft's web browser, if they had stuck with Unix instead of taking on the long drawn-out task of creating a new operating system from nothing then Internet Explorer (released August 1995) could have beat Netscape (released October 1994) to the market by a wide margin. 386BSD even beat Microsoft in the first platform to support protected mode on the i386 processor. There would have been nothing stopping Microsoft from switching to Unix at this point especially because, by the time Microsoft was done working on Windows NT 3.1 and released it in 1993, the average home computer was significantly more powerful and fully capable of running a Unix operating system (in fact, Windows NT 3.1 was notorious for consuming lots of memory in contrast to Unix systems).

Microsoft could have had all these things, but instead they focused far too much time and attention on the personal computer, and ultimately decided to harm progress by becoming the leading anti-competitive company in the computer space.

Graphical User Interface

Windows: GUI first appeared as a graphical extension to MS-DOS. The first Windows version, Windows 1.0, was released in 1985 and provided a basic graphical user interface (GUI) on top of MS-DOS. Originally, starting Windows required running the WIN.COM program in the MS-DOS command prompt. Later by Windows 95 (still based on MS-DOS), Windows started by default so manually running the WIN.COM program was no longer necessary. Windows Me was the last version of Windows based on MS-DOS.

Windows NT (New Technology) was introduced in 1993 as a separate line of operating systems built from the ground up. Unlike the DOS-based Windows versions, Windows NT was designed as a fully 32-bit, multi-user, and multitasking operating system with a focus on security and stability. All Windows operating systems from Windows NT 3.1 onward are based on the NT kernel and components. Windows NT 3.1 (1993) was the first Windows operating sytem based on the NT kernel. The GUI has become part of the Windows API with facilities we know today like CreateWindow, GetMessage, and the message loop. In Windows NT 4.0, Microsoft sacrificed some stability and security for performance by moving the GDI and USER subsystems (in charge of graphics and windowing tasks) from user-mode to kernel-mode (win32k.sys).

Windows 1.0 had color support; however, it was limited to a 4-bit color (16 colors) due to its reliance on the IBM EGA graphics adapter (or less colors at the same time if the older IBM CGA was used). Windows 3.0 intoduced 256 possible colors on supporting VGA graphics hardware. Windows NT 3.1 added SVGA support thus providing 24-bit color, at which point the number of supported colors mostly depended on a system's graphics hardware.

Try out Windows 1.0 yourself here (or on v86), I recommend daily driving it for a week before making the switch.

Note that Windows 3.1 (1992) and Windows NT 3.1 (1993) are distinct operating systems. Windows 3.10 received a few minor point upgrades, which a group are referred to as Windows 3.1x (still on MS-DOS, not NT). You can use 86Box or v86 to easily run Windows NT 3.1.

On Unix-like OSs: The X Window System first appeared in 1984. The X Windows System initially received color support in 1985. Graphic hardware support for color and a mature color scheme gradually improved over X releases.

Macintosh: The first Macintosh system to have a GUI was the Apple Macintosh 128K (hardware), which came with System 1 (1984). However, color didn't come until the Macintosh II with System 4 (1987).

Other: Amiga OS (1985, just before Windows 1.0 released) had a GUI with color support.

Windows NT 3.1 (1993) was well received and posed major competition to the largest market share holding Apple System 7 (1991) desktop consumer operating system. The X Window System on Unix-like operating systems was mostly used for academic, research, and certain commercial purposes.

Virtual Address Spaces

Windows: Virtual memory first appeared in Windows 3.0 (1990), which supported the 386 Enhanced mode on Intel 386 processors. However, Windows 3.x (including Windows 3.0 and Windows 3.1 non-NT with a few point release upgrades) only supported using the virtual memory feature for extra memory (i.e. a page file or swap memory). All applications still ran in a single shared virtual address space because the Windows 3.x kernel and all of userland was still 16-bit. It wasn't until Windows NT 3.1 (1993, the first Windows NT release) that the Windows operating system was fully 32-bit and each processes was isolated to its own address spaces. Windows 95 (1995) was the first DOS-based Windows to support a per-process virtual address space for 32-bit processes, while also still having a single shared address space for 16-bit application compatibility and parts of the operating system that were still 16-bit.

Unix-like OS: First appeared in 3BSD (1978) on VAX (this system did not take advantage of the virtual memory capabilites of VAX for swapping to disk, though). On Intel 386, the first Unix-like OS to support virtual address spaces on Intel 386 was 386BSD (March, 1992).

Macintosh: First appeared in System 7 (1991) using the Motorola 68k architecture (Apple only shipped Macintosh with their own hardware and didn't move to Intel 386 until 2006). While Macintosh is a Unix-like OS, Apple developed their own proprietary GUI solution for their operating system.

Other: OS/2 1.0 (1987) supported parts of Intel 386 protected mode through a compatibility layer, but it was limited by its 16-bit architecture. It wouldn't be until OS/2 2.0 (April 1992, after the IBM and Microsoft split) that OS/2 would receive full Intel 386 Enhanced mode support (protected mode with per-process virtual address spaces + virtual memory with demand paging + preemptive multitasking).

There were many operating systems written for the Virtual Address eXtension (VAX) instruction set architecture (ISA). These include the original OpenVMS OS (1977) which was not Unix-like and later some Unix-like OSs. VAX was ahead of its time, featuring full 32-bit protected mode support. However, factors such as being closely tied to the non-portable OpenVMS OS, strategic missteps of Digital Equipment Corporation (DEC) like being late to microcomputers with the MicroVAX I releasing in late 1984, the high expense of VAX systems, and competition from other architectures ultimately led to VAX not catching on.

Academia: First appeared in the Atlas (1962), which supported virtual address spaces as part of its virtual memory system. The computer was awarded an IEEE Milestone award for inventing virtual memory.

On a 32-bit microprocessor, 386BSD (March, 1992) was impressively the first major operating system to fully utilize the protected mode on Intel 386. This release date beats Windows NT 3.1 by a year. 386BSD was closely followed by OS/2 2.0 (April, 1992), with only one month between their releases (it's too bad OS/2 was abandoned, first by Microsoft who left IBM to develop the operating system on their own in 1990). On the VAX architecture, OpenVMS (1977) beat everybody to supporting virtual address spaces by a long shot.

POSIX

The first Portable Operating System Interface (POSIX) release came in 1988.

Threads in POSIX didn't come until 1995 because Unix architecture favored multiprocessing over multithreading (i.e. a preference for single-threaded applications and minimal processes that used fork). The complex nature of threads enabling overlapping operations within the same address space also required special care. So, the IEEE 1003 working group took their time to develop this part of the standard.

Disclaimer: I wasn't alive when most of these events unfolded. Hence, why I'm writing this section to gain perspective. In addition to broad perspective, my aim is to include comprehensive background information and create a timeline of events. I've done diligent research to ensure the facts presented here are correct and emphasized to the optimal degree. If you believe there are inaccuracies or missed nuances, please let me know via email or GitHub Issue/PR. Thank you.

Microsoft Windows Complaints

I've provided Microsoft lots of constructive criticism and ways to fix their systems throughout this writeup, but now I need to vent. Everyone has their personal list of Windows complaints, so here's mine. Microsoft is unlikely to fix much of these issues due to either remaining backward compatible with rash decisions or because it wouldn't benefit them as a business. Although I personally run Linux on my own devices, I have no choice but to use or support Windows in some cases, so I have the right complain about it. Note that my complaints are directed at the Microsoft Corporation as a company or the specific team at Microsoft (or elsewhere) responsible for each shortcoming. No one person at Microsoft or any big company oversees (or even has access to, for security reasons) all the code. With that said:

Complaint #1: Downloading Windows
- I had to make an entire project to work around this Microsoft bloatware (stop wasting people's time, Microsoft)
Complaint #2: String Encoding
- A classic tale of Unix getting it right
  - UTF-8 was invented by Ken Thompson (Unix co-founder) and Rob Pike (former Unix developer at Bell Labs and now a Google engineer) in 1992. As Raymond notes, the first version of Windows to ship with UCS-2 unicode support (the predecessor to UTF-16) was Windows NT 3.1 in 1993.
- UTF-8 is simply the best: Backwards compatible with ASCII, 2x smaller for Latin text than its UTF-16 counterpart (with its small size being crucial for the web), and no horrendous BOM
- Microsoft is now stuck with separate unicode functions bloating the Windows API forever, format specifier disagreements between standard C compilers and the VC++ compiler, and things like PowerShell annoyingly default to UTF-16 LE with mandatory BOM
Complaint #3: GUI/Console Kernel Program Types
- The problem with monoliths is that once you "design" one tightly coupled set of components, it becomes harder to make all the other parts modular, so it sprials out control quickly until you're left with Windows
Complaint #4: UAC is Broken
- This is an issue (one fundamentally of poor design)
Complaint #5: The Win32 Driver Model Sucked
- It wasn't until 2019 that Microsoft replaced it with Windows Driver Frameworks (WDF)
- Looks like I found out why my work-provided laptop fails to wake up from sleep sometimes thus wasting valuable company time (to be fair, maybe the company I work for should also preload less low quality software/drivers on their systems... the battery life on a fresh and modern laptop from them is a whopping 2 hours and use to burn me until I disabled some background tasks so the thing wouldn't run so hot)
Complaint #6: File Names and Paths
- Windows reserved filenames continue to be problematic in the Linux source tree (or even when a filename just includes a : or ?, or the worst is \ path separators because programming languages recognize them as escape sequences)
- I understand Windows inherited reserved filename issues from MS-DOS, which inherited it from 86DOS/QDOS (which itself inherited the issue because CP/M DOS originally had no file system hierarchy causing the developers to taint the global file namespace), but Microsoft knew the technology they were adopting was inferior when they made the switch from Xenix (a Unix-like OS) to a DOS-based operating system. As a result, Microsoft remains accountable.
Complaint #7: Registry Strings
- Well, nobody thought that one through (even a single step ahead)... seems like that's the story everything Microsoft and Windows follows
Complaint #8: Locked EXEs and DLLs
- Running into that whole "This action can't be completed because the file is open in <Application Name>" error when trying to do things on Windows is the worst (the error message makes it sound kind of reasonable but in practice it just sucks)
- Among other things, this poor design is the root cause for installers being unable to remove themselves, which leads to them using hacks
  - I've looked around and apparently, if you don't want to use a polling script (which is not guaranteed to work in locked down corporate environments), you can do this trick (at the cost of looking like malware, of course), which appears to be documented
Complaint #9: PowerShell Null Comparison "Design"
- Hey, this is how it "works by-design", the PowerShell designers really went back to the drawing board to engineer some state of the art stuff here (nothing against the PowerShell devs personally, of course, but you're making this too easy for me)
- This one always gets a laugh out of me
- To be fair, this masterpiece could be from the goated Microsoft employees who always come through to slip some hilariously bad design into Windows. In that case, bravo—you’ve really outdone yourselves this time!
Complaint #10
Complaint #11: Forced Recall
- Nobody wants Recall, literally noone
- Oh well, Unix already has the server and mobile markets. Microsoft Windows controls the desktop, but once we get it firmly into a VM or sufficiently translated then it's only a matter of time. That Redmond-based business will still be making boat loads of money from Azure (which runs on Linux and mostly ships Linux VMs), anyway. So, as long as one company in particular doesn't become a greedy sore loser, I think it will work out well for everyone in the end.
Complaint #12: CMD Variables
- The shell is my favorite, and yours sucks
Complaint #13: Kernel-Mode GUI
- Ah yes, Windows: The Ultimate Monolith (and how is it that GUI is still faster on OSs that don't put it in the kernel, speaking from experience)
Complaint #14: COM is a COMplexity Nightmare
- A wise man once said: "If you can't explain it simply, you don't understand it well enough." I don't think Einstein ever anticipated the atrocity that is COM when he made that statement
Complaint #15: ALPC is Bloat
- Sprawling complexity is sprawling
  - POSIX message queues seek to provide message-oriented communication like ALPC does, but is much simpler. For a start, message queues are exposed as special files in /dev/mqeueue allowing them to benefit from the simple Unix file permission model. Asynchronous communication, how about a non-blocking read? Also, ALPC is built on top of LRPC which already has too many message types (tagLrpcMessageTypes) that appears to go beyond message-oriented communication. This kind of layered + all-in-one bloat is exactly what makes too much Microsoft technology unbearable. Just compare how many lines of code it takes to do <insert anything here> with ALPC (code preview viewable in the slideshow link) versues a POSIX message queue.
  - Moreover, if you look into the /dev/mqueue of a typical GNU/Linux system (or at least my GNU/Linux systems), you will see there aren't any existing POSIX message queues. This is because message-oriented communication where a programmer controls the priority of each individual message sent/received is a kernel feature that's only useful in specialized low-level scenarios. The Windows operating system is monolithic right through to user-mode (as "The ALPC Security Reality" slide presents, so many things are implemented in DCOM which internally does ALPC communication). As a result, ALPC is heavily used by Windows user-mode but the closest POSIX message queue equivalent on GNU/Linux is rarely used.
  - Instead, GNU/Linux user-mode commmonly uses simpler stream-based communication (read and write system calls on a socket). One major factor contributing to the complexity of ALPC is its implementation of custom security controls and security descriptors. In contrast, Unix domain sockets (based on Berkeley/POSIX sockets) benefit from the simple but powerful "everything is a file" paradigm, which allows them to obtain security through the existing file permission model. This approach keeps the IPC mechanism lightweight and modular. And once again, asynchronous communication is just a non-blocking read (set the O_NONBLOCK socket option with setsockopt, use something like select or epoll to notify the application when new data arrives, then read from the socket). It's worth noting that the priority (not per-message priority) of a POSIX socket is typically inherited from thread priority to avoid priority inversion (or given by the SO_PRIORITY Linux socket option). Like many Windows components, I find ALPC does too much.
  - Also, Microsoft has now released yet another IPC mechanism to Windows called Windows Notification Facility (WNF) (although, I do like the publish–subscribe communication model)
    - Side note: I've seen, following a ShellExecute, Windows spawn timed, dedicated WNF threads (ntdll!TppWorkerThread, seemingly spawned by the kernel?) that subscribe to a notification with the callback being some lambda function (e.g. in combase.dll, but I've also seen this with iertutil.dll, urlmon.dll, SHCORE.dll, and KERNELBASE.dll) that, when called, will exclusively acquire an SRW lock and call combase!wil::details::FeatureStateManager::FlushUsage (some API implemented in WIL), this function checks to ensure PEB_LDR_DATA.ShutdownInProgress is false even though it should always be false for the non-exiting thread before NtTerminateProcess, then calls combase!wil::details::FeatureStateManager::RecordUsage, which judging by the name is part of a state manager like those high-level JavaScript state manager frameworks (but for low-level notifications)... or maybe these thread in particular are just part of some advanced telemetry, then unlock the SRW lock (all very interesting, but also a level of bloat that most people would not be able to conceive of at the system level)
  - This is turning into its entire own section on inter-process communication that I might formalize at one point
- A lot of things can "work", that doesn't make it the best option or even a good option
  - It's also worth noting that like misinformation, over-engineered software or bloatware follows Brandolini's law
Complaint #16: A Warning Sign?
- Rude, not helpful, and suspiciously vauge (also, security through obscurity vibes)
- This error message and the surrounding context of trying to run a non-Microsoft signed binary when Microsoft's signature is enforced has a very AARD code feel to it, not good
- The most generous interpretation of this message is that it's generic or could be a remnant of the now discontinued Windows S Edition, which I've never personally used (however, that doesn't make the message any more helpful or less obscure and I still don't like it)
Complaint #17: PatchGuard was Never Robust
- Security through obscurity (having an arbitrary read-write primitive on the kernel, such as when you're a driver, will always defeat PatchGuard)
- Alas, the only thing that could tear Microsoft's relationship with security through obscurity would be if they made the NT kernel open source (or at least source available) as Apple does for their XNU kernel, and we all know Microsoft doesn't have the balls to do that
- In addition, while I don't think this group policy warrants its own complaint, meaningless and superficial controls that give organizations and users a false sense of security arguably do more harm than good (unless you value compliance box-checking for the sake of checking boxes)
Complaint #18: The AARD Code
- Microsoft got sued and paid the price for this, so I'm not holding a grudge; however, this is some interesting Microsoft history
Complaint #19: NTFS Alternate Data Streams
- It's just not a good feature, it violates POLS
Complaint #20: Microsoft Locking Down the Kernel
- All kernel-mode drivers now require Microsoft's signature (not just a signature from any trusted root authority), it's clear Microsoft will continue locking down Windows until it becomes a Fisher-Price toy if we let them get away with it
Complaint #21: Process Creation Search Path
- This search path blunder is only a little less bad then the one we looked at for DLLs
- In particular, searching the current working directory (CWD) by default makes it impossible to safely use the CMD shell upon navigating to an untrusted directory (PowerShell implements a high-level hack around this issue to provide the Unix shell ./ behavior)
- Again, this current working directory mess started with CP/M DOS then MS-DOS being backward compatible, but Microsoft is accountable because they knew DOS was technologically inferior to Unix but still switched to a DOS-based operating system for business reasons
Complaint #22: Native Loader Tight Coupling with .NET Loader
- Can someone please explain to me why someone thought it would be a good idea to integrate .NET with the native loader? Imagine if every high-level language did this. (In Windows decompiled code, LdrpInitializeProcess may call LdrpCorInitialize to do a bunch of .NET specific process initialization)
- While we're at it, I for one would welcome JVM's integration into the Windows loader (assuming it's not some crippled J/Direct version, and how about we go for the Amazon Corretto Java distribution, that would be good)
Complaint #23: Internet Explorer Lost Decade
- Imagine how much further along the world's web technology could be if not for Internet Explorer unfairly monopolizing the market for so long (I wasn't alive then but probably a good decade ahead, if what I've heard from some people is correct... a lost decade of web development)
  - To be fair, Microsoft's new venture with Edge has been contributing back signficant improvements to the base Chromium project, which should be celebrated. At the same time, we as people, including people working at Microsoft, must proceed with caution whenever one company controls an entire vertical
Complaint #24: Windows Communication Foundation (WCF) Bloatware Utopia
- Windows Communication Foundation (WCF) is some horrendous Microsoft technology based on the WS-* web service standards, probably some of the grossest standards ever divised
  - WS-* is all highly complex, slow, poorly interoperable, and over-engineered technology
  - Microsoft played the primary role in creating these standards then Windows completely bought into all the WS-* technologies
  - WCF is layered bloat on top of .NET, which is just agonizing
- SOAP, made by Microsoft for web APIs and later WS-*, is bloat on top of XML-RPC (also by Microsoft), which itself is bloat on top of XML
  - Microsoft also helped the W3C standardize XML itself in 1998, so I'm not dedicating a separate complaint to this
- Of course, WCF solved a problem at the time, but as I've emphasized prior, that doesn't make it the best solution or even a good one. Even at the time, people knew that WS-* was poor quality tech (luckily for us, this common sense didn't require 20/20 hindsight)
- Microsoft eventually came to their senses on this matter and now recommends Google's gRPC as a WCF alternative
- Thanks to my dad (who holds an equally low opinion of WCF) for letting me know that this now antiquated WCF technology exists so I can complain about it in his place
Complaint #25: CIM_DATETIME Reinvents the Wheel
- There's this great thing called ISO-8601 circa 1988, you should try it (ISO-8601 also supports high precision like CIM_DATATIME)
- I have a POSIX-compliant shell one-liner for parsing these out to a Unix timestamp and I would share, but I need to find it
Complaint #26: Unix UGO > Windows ACLs
- UGO achieves permission granularity by defining clear groups and then assigning users to any number of those groups where each file can have one group, while ACLs work by tacking permissions directly on files themselves
  - Unlike Windows ACLs, UGO groups cannot be nested, which seems like it could be a weakness until you realize that the flat model for groups avoids nasty side effects such as the possibility of group cycles while avoiding excessive permissions and keeping file access times low (nested groups can be simulated by assigning users to more groups, anyway)
- In practice, ACLs are unmanageable, inefficient, and insecure (in terms of auditability for preventing backdoors) because a file system can have tons of files and each file can have up to about 1,820 permissions meanwhile the UGO model is so elegant and clean, espeically with every state of each letter fitting squarely into its own octect for perfect memory efficiency
- Changing security information on Windows is always slow as molasses in my experience with those dialog boxes that have no progress indicator
- On top of UGO, modern Unix systems also optionally support ACLs if throwing permissions directly on files is what you want to do, but UGO undoubtably works best as the base permission model for an operating system
- Like most everything, Unix got this right before MS-DOS or Windows even existed (Microsoft would know, pro tip: enter fd /xenix.fd as the boot program then run ls -l)
Bonus
- Here's a Wikipedia page full of extra Microsoft and Windows criticisms, in case you were ever in need of some more
Bonus #2
- Okay, so I just found out that Microsoft is screwing other cloud platforms by unfairly licesning Windows Server and SQL Server to be 5x more expensive on non-Azure platforms, and I couldn't not include this (Google complained at end of September in 2024)
- Microsoft trying not to be anti-competitive challenge (impossible)
- It's almost as if Microsoft needs to be separated from Windows
Bonus #3
- Linux developers are trying to take out this Microsoft RNDIS trash because it is some impossible to secure, tightly coupled, junk—let's hope they succeed
- Don't worry NT kernel, I would never forget about you

These are far from all my critiques about Microsoft and Windows. Other critiques are mentioned in meaningful contexts throughout the document. I'm also holding onto many more good candidates for making this list, but here is a start.

The Process Lifetime

In user-mode, code runs in the lifetime of a process. Within a process there are three kinds of lifetimes:

The application lifetime
- Birth: The main function or process entrypoint is called, starting with constructors in the program before main
- Death: The main function returns, ending with destructors in the program after main
Library subsystem lifetimes
- Birth: Library initializers/constructors (e.g. Windows DLL_PROCESS_ATTACH or legacy Unix _init)
- Death: Library initializers/constructors (e.g. Windows DLL_PROCESS_DETACH or legacy Unix _fini)
Stack lifetimes
- Birth: Data is pushed onto the stack
- Death: Data is popped off the stack
- The stack is an abstract LIFO data structure, which is in theory of infinite size
  - Beyond that, its implementation can technically be anything, but on modern systems a stack is implemented by adding and subtracting to a stack pointer register
- Also referred to as block scope or automatic storage duration

All other lifetimes occur as a result of these three lifetimes. For instance, heap allocation lifetimes inherit from these three types of lifetime because one of them must keep a reference to a given heap allocation within its scope. Thread-local data is just stack memory with the lifetime of the entire stack. Abstracting further, it could be said that the stack lifetime of the main thread is the parent of all lifetimes, especially in a single-threaded application, but also in a multithreaded application because references to threads are stored in the stack or are indirectly referenced by the stack, excluding threads remotely spawned by the outside environment. These three points, though, define the optimal level of abstraction for defining lifetime as it relates to modern user-mode processes.

Defining Loader and Linker Terminology

A loader and linker are two components essential for program execution or building.

As the first component to run code when a process starts, a loader is responsible for setting up a new process and bringing in dependencies of the application as they are required. Setting up a process includes tasks like initializing critical data structures, loading dependencies, and running the program. The term loader is often interchangeable with dynamic linker except that loader is also a catch all term for any operations done at process startup.

A linker can be dynamic, working at application run-time to resolves dependencies between executable modules. A linker can also work at compile-time as the next step after compilation where it is used to write information into a program or library about where a dynamic linker or loader can find its dependencies at process load-time/run-time, or to stitch executable modules together to create a runnable program.

Dynamic linking resolves dependencies between executable modules. Dynamic linking typically occurs at process load-time; however, it can also occur later due to a lazy linking optimization.

Static linking is when, at build-time, a linker (e.g. the ld program on GNU/Linux or link.exe program used as part of building by MSVC) stitches object files together into a single executable. Compilers, like GCC, Clang, and MSVC, commonly invoke linkers (GCC actually includes its linker in the GCC program itself), although they can be also be used as standalone tools. Note that some Microsoft sources conflate "statically linked" DLLs with dynamically linked DLLs; however, this use of terminology is incorrect.

Dynamic loading refers to loading a library at run-time such as with dlopen/LoadLibrary or library lazy loading functionality.

Static loading refers to loading that occurs due to dependencies found during dynamic linking. Static loading Microsoft-specific terminology. The term is a bit misleading because loading is inherently a dynamic operation regardless of how the dependencies to load are found.

A dynamic library is a library that is compiled for use in dynamic for use in dynamic linking or loading. A static library is a library for use in static linking.

What is Concurrency and Parallelism?

Concurrency is the property of a system that allows multiple running tasks to interact with it at overlapping times and reach the same outcome. A "running task" typically refers to a thread of execution within the process or kernel. Concurrency on its own is easy; concurrency once you introduce shared resources or states can often become challenging to navigate. Protecting shared resources or states is where locks and atomic primitives become relevant in concurrency.

At its most fundamental level, the loader is a state machine. Threads may call upon different parts of the loader at overlapping times and when this happens it's the job of the loader to ensure each request is serviced while maintaining a consistent state to produce a consistent result.

Parallelism and concurrency are related but different. Parallelism stipulates a processor with multiple cores running tasks at the same time. Meanwhile, concurrency could be a single-core processor multitasking by continually starting and stopping different tasks. Parallelism is what's needed for heavily CPU-bound workloads because making a single-core processor multitask to complete the same work would be slower than that same processor executing the work item to completion in series due to the additional overhead.

The modern Windows loader employs "loader worker" threads to parallelize its mapping and "snapping" work (the latter referring to dynamic linking) because mapping requires making lots of slow CPU-bound system calls (each user-mode thread corresponds to a kernel-mode thread) and snapping is a purely CPU-bound operation (not requiring any I/O) where the loader resolves proceedure import names depended on by a module to function addresses in the dependent DLL. The Windows loader spawns these loader worker threads together in a thread pool at process startup, then it's the kernel's job to delegate which core of a multicore processor to run each thread on.

Requiring a strict order of operations, such as when the loader runs module initialization or finalization routines because each module's code within these routines may depend on each other, makes concurrent or parallelized processing infeasible.

ABBA Deadlock

An ABBA deadlock is a deadlock due to lock order inversion. Whether lock order inversion results in an ABBA deadlock is probabilistic because it depends on whether at least two threads interleave while acquiring the locks in a different order. Agreeing on an order to acquire locks in, thereby avoiding lock order inversion, prevents ABBA deadlock. Failure to follow an agreed upon order for acquiring locks is known as a lock hierarchy violation.

A system can realize ABBA deadlock due to lock order inversion in a couple ways. First, lock order inversion can occur in the lock hierarchy of a single subsystem if it's poorly programmed or in distinct cases where there is an intentional goal of maximizing concurrency at the cost of making some subsystem operations unsafe for external code to perform at that time. Secondly, an ABBA deadlock can occur due to the more complex case whereby separate subsystems nest their respective lock hierarchies within a thread. The latter case can be tricky because there may not necessarily be a defined order for interacting with separate subsystems, each of which impose their own lock hierarchy, and when nested form one grand lock hierarchy between them.

Microsoft refers to an ABBA deadlock as a "deadlock caused by lock order inversion". These are synonyms. However, I prefer the more concrete term common throughout the Linux ecosystem.

It's even possible to ABBA deadlock in Rust because "fearless concurrency" only extends to memory safety and not logical issues in synchronization.

For a conceptual exploration of synchronization issues that can occur in multithreaded applications and how to resolve them, refer to the Dining Philosophers Problem.

ABA Problem

The ABA problem (sometimes written as the A-B-A problem) is a low-level concept that can cause data structure inconsistency in lock-free code (i.e. code that relies entirely on atomic assembly instructions to ensure data structure consistency):

In multithreaded computing, the ABA problem occurs during synchronization, when a location is read twice, has the same value for both reads, and the read value being the same twice is used to conclude that nothing has happened in the interim; however, another thread can execute between the two reads and change the value, do other work, then change the value back, thus fooling the first thread into thinking nothing has changed even though the second thread did work that violates that assumption.

A common case of the ABA problem is encountered when implementing a lock-free data structure. If an item is removed from the list, deleted, and then a new item is allocated and added to the list, it is common for the allocated object to be at the same location as the deleted object due to MRU memory allocation. A pointer to the new item is thus often equal to a pointer to the old item, causing an ABA problem.

This description explains how atomic compare-and-swap instructions (CAS, e.g. lock cmpxchg on x86) has the potential to mix up list items on concurrent deletion and creation because a new list item allocation at the same address in memory as the just deleted list item (e.g., in a linked list) could interleave, thus causing the ABA problem.

As a result, the risk of naively programmed lock-free code realizing the ABA problem is probabilistic. It's highly probabilistic with dynamically allocated memory, particularly when considering that modern heap allocators avoid returning the same block of memory in too close succession as a form of exploit mitigation (i.e. modern heap allocators may not do MRU memory allocation as the ABA problem Wikipedia page suggests).

The shortcoming of CAS is that it only atomically guarantees whether two pointers match. However, if for instance, a malloc reuses a memory address to store data about a lock-free data structure, then two pointers can match, causing CAS to believe the underlying data is unchanged, which is not always true (even when, of course, using a synchronized heap).

Here's a minimal demonstartion on how the ABA problem can manifest when using a single CAS operation to modify a singly linked list:

Initial state
- Node A points to node B (i.e., A.next = B).
ABA scenario
- Thread 1: Reads A.next and sees it points to B.
- Thread 2: Removes node B from the list and adds a new node C in place of B (i.e., A.next is updated to C).
- Thread 2: Later, it removes node C and adds node B back to the list (i.e., A.next is updated back to B).
Result
- Thread 1 attempts to perform a CAS operation to change A.next to D assuming A.next is still B.
- Since A.next has been restored to B, the CAS operation succeeds, even though the node that Thread 1 was expecting (B) may have a different underlying value

The problem is that while A.next equals the same pointer value, the underlying value of A.next (or B) could have been changed to something entirely different by thread 2 in the interim, thus causing the ABA problem when the CAS incorrectly succeeds. It's possible to create the ABA problem even in safe Rust code.

Modern instruction set architectures (ISAs), such as AArch64 (ARM64), support load-link/store-conditional (LL/SC) instructions, which provide a stronger synchronization primitive than compare-and-swap (CAS) instructions. LL/SC solves the ABA problem by failing the conditional store (SC) if the data at the address referenced by the LL is modified (this can be detected at the ISA-level). In other words, it includes the initial read as part of the atomic modification. LL/SC also cannot livelock (an inflite busy loop between two or more threads) because one or more LL/SC pairs failing implies another succeeded.

On older architectures (e.g. x86), one must use a workaround to create correct lock-free code that avoids the ABA problem. However, these workarounds, such as tagged pointers or hazard pointers, can be complex and are often difficult to verify the correctness of.

Dining Philosophers Problem

The dining philosophers problem is a scenario originally devised by Dijkstra to illustrate synchronization issues that occur in concurrent algorithms and how to resolve them. Here's the problem statement.

The simplest solution is that philosophers pick a side (e.g., the left side) and then agree always to take a fork from that side first. If a philosopher picks up the left fork and then fails to pick up the right fork because someone else is holding it, he puts the left fork back down. Otherwise, now holding both forks, he eats some spaghetti and then puts both forks back down. By controlling the order in which philosophers pick up forks, you effectively create (lock) hierarchies between the left and right forks, thus preventing deadlock.

What we just described is the resource hierarchy solution. I encourage you to explore other solutions. Anyone who has learned about concurrency throughout their computer science program would be familiar with this classic problem.

Reverse Engineered Windows Loader Functions

`LdrpDrainWorkQueue`

The LdrpDrainWorkQueue function is responsible for the high-level synchronization of the loader and for helping to process work when it is immediately available. See the Parallel Loader Overview section for contextual information.

// Variable names are my own

typedef enum {
    LoadOwner,
    LoaderWorker
} LoadType;

// The caller decides whether the context LdrpDrainWorkQueue should work under
// From the perspective of module initialization, the call to LdrpDrainWorkQueue that acquires a loader event before running module initialization routine must work under the load owner context unless the routine is reentering the loader. In the reentrant case, LdrpLoadCompleteEvent will have already been acquired so it cannot be acquired again (an event object is not a reentrant synchronization mechanism)
// This behavior can be seen in the ntdll!LdrpLoadDllInternal function, if it is to call LdrpDrainWorkQueue then it will always go in with the load owner context if the current thread is not already the load owner (it checks for `LoadOwner` in `TEB.SameTebFlags`); otherwise, there are additional conditions that must pass for ntdll!LdrpLoadDllInternal to call LdrpDrainWorkQueue with the loader worker context
struct PTEB LdrpDrainWorkQueue(LoadType LoadContext)
{
    HANDLE EventHandle;
    BOOL CompleteRetryOrReturn;
    BOOL LdrpDetourExistAtStart;
    PLIST_ENTRY LdrpWorkQueueEntry;
    //PLIST_ENTRY LinkHolderCheckCorruptionTemp; // This variable is inlined by the call to RtlpCheckListEntry
    PTEB CurrentTeb;
    PLIST_ENTRY LdrpRetryQueueEntry;
    PLIST_ENTRY LdrpRetryQueueBlink;

    CompleteRetryOrReturn = FALSE

    EventHandle = (LoadContext == LoadOwner) ? LdrpLoadCompleteEvent : LdrpWorkCompleteEvent;

    while ( TRUE )
    {
        while ( TRUE )
        {
            RtlEnterCriticalSection(&LdrpWorkQueueLock);
            // LdrpDetourExists relates to LdrpCriticalLoaderFunctions, find a list of these functions within this repo
            LdrpDetourExistAtStart = LdrpDetourExist;
            if ( !LdrpDetourExist || LoadContext == LoaderWorker )
            {
                LdrpWorkQueueEntry = &LdrpWorkQueue;
                // Corruption check on LdrpWorkQueue list: https://www.alex-ionescu.com/new-security-assertions-in-windows-8/
                RtlpCheckListEntry(LdrpWorkQueueEntry);

                LdrpWorkQueueEntry = LdrpWorkQueue.Flink;

                // Test if LdrpWorkQueue is empty
                if ( &LdrpWorkQueue == LdrpWorkQueueEntry ) {
                    if ( LdrpWorkInProgress == LoadContext ) {
                        LdrpWorkInProgress = 1;
                        CompleteRetryOrReturn = 1;
                    }
                } else {
                    if ( !LdrpDetourExistAtStart )
                        ++LdrpWorkInProgress;
                    // LdrpUpdateStatistics is a very small function with one branch on whether we're a loader worker thread
                    LdrpUpdateStatistics();
                }
            }
            else
            {
                if ( LdrpWorkInProgress == LoadContext ) {
                    LdrpWorkInProgress = 1;
                    CompleteRetryOrReturn = TRUE;
                }

                LdrpWorkQueueEntry = &LdrpWorkQueue;
            }
            RtlLeaveCriticalSection(&LdrpWorkQueueLock);

            // We only synchronize on a loader event if both conditions are met:
            // 1. CompleteRetryOrReturn is FALSE
            // 2. The work queue is not empty
            // The LdrpDrainWorkQueue function can return without synchronizing on a loader event (I have seen this happen in testing with the LdrpLoadCompleteEvent loader event)

            if ( CompleteRetryOrReturn )
                break;

            // Test if LdrpWorkQueue is empty
            if ( &LdrpWorkQueue == LdrpWorkQueueEntry )
            {
                // No mapping and snapping work left to do, just wait our turn
                NtWaitForSingleObject(EventHandle, 0, NULL);
            }
            else
            {
                // Help process the work while we're here
                // LdrpProcessWork processes (i.e. mapping and snapping) the specified work item
                // LdrpWorkQueueEntry - 8: Navtigate to the item above the list link in the undocumented LDRP_LOAD_CONTEXT structure
                LdrpProcessWork(LdrpWorkQueueEntry - 8, LdrpDetourExistAtStart);
            }
        }

        // Test if we were called in the LoadOwner context OR if LdrpRetryQueue is empty
        //
        // WinDbg disassembly (IDA disassembly with "cs:" and decompilation is poor here):
        // lea     rbx, [ntdll!LdrpRetryQueue (7ffb5bebc3a0)]
        // cmp     qword ptr [ntdll!LdrpRetryQueue (7ffb5bebc3a0)], rbx
        // je      ntdll!LdrpDrainWorkQueue+0xb1 (7ffb5bdaea85)
        // https://stackoverflow.com/a/68702967
        if ( LoadContext == LoadOwner || &LdrpRetryQueue == LdrpRetryQueue.Flink )
            break;

        RtlEnterCriticalSection(&LdrpWorkQueueLock);

        // Complete a retried mapping and snapping operation

        // Add a work item to LdrpWorkQueue from LdrpRetryQueue then clear LdrpRetryQueue
        // Reverse engineered based on WinDbg disassembly due to the IDA issue described above
        // TODO: Use proper list modification macros
                                                            // r12 is ntdll!LdrpWorkQueue
                                                            // rbx is ntdll!LdrpRetryQueue
        // Add first entry of LdrpRetryQueue to LdrpWorkQueue and remove that entry from LdrpRetryQueue
        LdrpRetryQueueEntry = LdrpRetryQueue.Flink;         // mov     rax, qword ptr [ntdll!LdrpRetryQueue (7ffb5bebc3a0)]
                                                            // lea     rcx, [ntdll!LdrpWorkQueueLock (7ffb5bebc3c0)]
                                                            // xorps   xmm0, xmm0 (any value xor'd by itself is zero)
        LdrpRetryQueueEntry.Blink = &LdrpWorkQueue;         // mov     qword ptr [rax+8], r12
        LdrpWorkQueue.Flink = LdrpRetryQueueEntry.Flink;    // mov     qword ptr [ntdll!LdrpWorkQueue (7ffb5bebc3f0)], rax
                                                            // mov     rax, qword ptr [ntdll!LdrpRetryQueue+0x8 (7ffb5bebc3a8)]
        LdrpRetryQueue.Blink = &LdrpWorkQueue;              // mov     qword ptr [rax], r12
        LdrpWorkQueue.Blink = LdrpRetryQueue.Blink;         // mov     qword ptr [ntdll!LdrpWorkQueue+0x8 (7ffb5bebc3f8)], rax

        // Clear the LdrpRetryQueue list
        LdrpRetryQueue.Blink = &LdrpRetryQueue;             // mov     qword ptr [ntdll!LdrpRetryQueue+0x8 (7ffb5bebc3a8)], rbx
        LdrpRetryQueue.Flink = &LdrpRetryQueue;             // mov     qword ptr [ntdll!LdrpRetryQueue (7ffb5bebc3a0)], rbx

        // Clear ntdll!LdrpRetryingModuleIndex
        // Global used by ntdll!LdrpCheckForRetryLoading function (may be called during mapping by ntdll!LdrpMinimalMapModule or ntdll!LdrpMapDllNtFileName functions)
        // ntdll!LdrpRetryingModuleIndex is a red-black tree (LdrpCheckForRetryLoading modifies it by calling RtlRbInsertNodeEx)
        // Each entry in ntdll!LdrpRetryingModuleIndex is a structure of the undocumented LDRP_LOAD_CONTEXT type
        LdrpRetryingModuleIndex = NULL;                     // movdqu  xmmword ptr [ntdll!LdrpRetryingModuleIndex (7ffb5bebd350)], xmm0

        RtlLeaveCriticalSection(&LdrpWorkQueueLock);
        CompleteRetryOrReturn = FALSE;
    }

    // Give context to this thread as the LoadOwner, which is state used by the loader
    // A thread can have the LoadOwner flag while not owning the ntdll!LdrpLoadCompleteEvent lock
    CurrentTeb = NtCurrentTeb();
    CurrentTeb->SameTebFlags |= LoadOwner; // 0x1000
    return CurrentTeb;
}

`LdrpDecrementModuleLoadCountEx`

Reversing LdrpDecrementModuleLoadCountEx was necessary for my investigation on how GetProcAddress handles concurrent library unload.

NTSTATUS LdrpDecrementModuleLoadCountEx(LDR_DATA_TABLE_ENTRY Entry, BOOL DontCompleteUnload)
{
    LDR_DDAG_NODE Node;
    NTSTATUS Status;
    BOOL CanUnloadNode;
    DWORD_PTR LdrpReleaseLoaderLockReserved; // Not used or even initialized, so I still consider this function fully reverse engineered (also, LdrpReleaseLoaderLock never touches this parameter)

    // If the next reference counter decrement will drop the LDR_DDAG_NODE into having zero references then we may want to retry later
    // Specifying DontCompleteUnload = FALSE requires that the caller has made this thread the load owner
    if ( DontCompleteUnload && Entry->DdagNode->LoadCount == 1 )
    {
        // Retry when we're the load owner
        // NTSTATUS code 0xC000022D: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55
        return STATUS_RETRY;
    }

    RtlAcquireSRWLockExclusive(&LdrpModuleDatatableLock);
    Node = Entry->DdagNode;
    Status = LdrpDecrementNodeLoadCountLockHeld(Node, DontCompleteUnload, &CanUnloadNode);
    RtlReleaseSRWLockExclusive(&LdrpModuleDatatableLock);

    if ( CanUnloadNode )
    {
        LdrpAcquireLoaderLock();
        // LdrpUnloadNode runs a module's DLL_PROCESS_DETACH
        // It also walks the dependency graph potentially to unload other now unused modules
        LdrpUnloadNode(Node);
        LdrpReleaseLoaderLock(LdrpReleaseLoaderLockReserved, 8); // Second parameter is an ID for use in log messages
    }
}

`LdrpDropLastInProgressCount`

Please see the High-Level Loader Synchronization section for the reverse engineering of this function.

`LdrpProcessWork`

Please see the High-Level Loader Synchronization section for a partial reverse engineering of this function.

License

The document you just read is under a CC BY-SA License.

This repo's code is triple licensed under the MIT License, GPLv2, or GPLv3 at your choice.

Big thanks to Microsoft for successfully nerd sniping me!

EOF

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
code		code
data/windows		data/windows
.gitignore		.gitignore
CODE-LICENSE-GPL2		CODE-LICENSE-GPL2
CODE-LICENSE-GPL3		CODE-LICENSE-GPL3
CODE-LICENSE-MIT		CODE-LICENSE-MIT
LICENSE		LICENSE
README.md		README.md
analysis-commands.md		analysis-commands.md

License

Licenses found

ElliotKillick/operating-system-design-review

Folders and files

Latest commit

History

Repository files navigation

Operating System Design Review

Table of Contents

Parallel Loader Overview

High-Level Loader Synchronization

Windows Loader Module State Transitions Overview

Constructors and Destructors Overview

C# and .NET

The Root of DllMain Problems

The Problem with How Windows Uses DLLs

Problem Solved?

Solution #1: API Sets Extension

Solution #2: Organize Subsystems

Solution #3: Reimplementation

Summary

Dependency Breakdown

Further Research on Windows' Usage of DLLs

The DLL Host

DLL Procurement

One DLL, One Base Address

DLLs as Data

Library Loading Locations Across Operating Systems

LoadLibrary vs dlopen Return Type

Investigating the Idea of MT-Safe Library Initialization

The Problem with How Windows Uses Threads

Problem Solved

Process Meltdown

In-Process Inconsistencies

Process Hanging Open

Crash

Out-of-Process Inconsistencies

Performance Degradation and Resource Inefficiency

Summary

Further Research on Windows' Usage of Threads

Securable Threads

Expensive Threads

Multithreading is Insecure

DLL Thread Routines Anti-Feature

Synchronization Requirements

Flimsy Thread-Local Data

The PEB Problem

Enforces Dynamic Initialization

Adds Process Startup Overhead

Promotes Centralization

Elevates Backward Compatibility Risk

Weakens Security

Accessing the PEB is Slow

Summary

Procedure/Symbol Lookup Comparison (Windows GetProcAddress vs POSIX dlsym GNU Implementation)

ELF Flat Symbol Namespace (GNU Namespaces and STB_GNU_UNIQUE)

How Does GetProcAddress/dlsym Handle Concurrent Library Unload?

Lazy Linking Synchronization

Library Lazy Loading and Lazy Linking Overview

GNU Loader Lock Hierarchy and Synchronization Strategy

A Concurrency Bug in the Windows Loader!

GetProcAddress Can Perform Module Initialization

Windows Loader Initialization Locking Requirements

Investigating the COM Server Deadlock from DllMain

On Making COM from DllMain Safe

Avoiding ABBA Deadlock

Other Deadlock Possibilities

Conclusion

Loader Enclaves

Module Information Data Structures

Loader Components

Locks

State

Atomic State

Component Model Technology Overview

Microsoft Component Object Model (COM)

Common Object Request Broker Architecture (CORBA)

GNU/Linux Component Frameworks and History

MacOS Distributed Objects and NSXPCConnection

Fun Facts

The Root of `DllMain` Problems

`LoadLibrary` vs `dlopen` Return Type

Procedure/Symbol Lookup Comparison (Windows `GetProcAddress` vs POSIX `dlsym` GNU Implementation)

ELF Flat Symbol Namespace (GNU Namespaces and `STB_GNU_UNIQUE`)

How Does `GetProcAddress`/`dlsym` Handle Concurrent Library Unload?

`GetProcAddress` Can Perform Module Initialization

Investigating the COM Server Deadlock from `DllMain`

On Making COM from `DllMain` Safe

`LdrpDrainWorkQueue`

`LdrpDecrementModuleLoadCountEx`

`LdrpDropLastInProgressCount`

`LdrpProcessWork`