Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreCLR EventPipe CPU+memory leak #111368

Open
dqsully opened this issue Jan 13, 2025 · 0 comments
Open

CoreCLR EventPipe CPU+memory leak #111368

dqsully opened this issue Jan 13, 2025 · 0 comments
Labels
area-Tracing-coreclr untriaged New issue has not been triaged by the area owner

Comments

@dqsully
Copy link

dqsully commented Jan 13, 2025

Description

EventPipe sessions that aren't file or IPC sessions never garbage collect their "thread state session list" for exited threads. This results in a memory leak proportional to the number of threads that have ever started since the EventPipe session began, and for each EventPipe event, a single-threaded CPU leak caused by iterating over the ever-growing thread session state linked list when looking for the next event.

This behavior has been present since at least .NET 6.0 when EventPipe was rewritten into CoreCLR in C, and is still present in the latest main branch as of writing this.

Reproduction Steps

Create an EventListener subclass that enables CLR events, something like this:

class ClrEventListener : EventListener
{
    protected override void OnEventSourceCreated(EventSource eventSource)
    {
        if (eventSource.Name.Equals("Microsoft-Windows-DotNETRuntime"))
        {
            EnableEvents(eventSource, EventLevel.Informational, (EventKeywords) 0x8000); // 0x8000 enables exception events in this case
        }
    }
}

Instantiate the event listener subclass, spawn thousands of short-lived threads, and then do something that creates CLR events (e.g. throw and catch exceptions).

As the number of threads ever spawned increases, the .NET Long Runni thread that processes these CLR events will get slower and slower and tend towards 100% CPU time, even if the number of currently running threads stays constant.

I made a GitHub repo to fully demonstrate and reproduce this issue.

Expected behavior

EventPipe session performance should be at least proportional to number currently running threads, not threads that have ever existed during the session.

EventPipe "thread session states" should be garbage collected in all EventPipe session types, so that the length of the thread session state linked-list approximates the number of currently running threads.

Actual behavior

After a non-file, non-IPC EventPipe session is created, any new threads created will cause the .NET Long Runni thread to consume more and more CPU for each event it collects.

Regression?

I haven't tried this on .NET 5.0, so I'm not sure, but it's been around since at least .NET 6.0 when EventPipe was rewritten in C.

Known Workarounds

If the EventPipe session is closed and reopened, the new session will start with an empty "thread state session list" and the old one will have cleaned everything up. This also happens if an active EventPipe session is reconfigured, which ends up closing and opening a new session anyways.

Another workaround would be to only use file or IPC-type EventPipe sessions if possible, as these session types do run garbage collection on the "thread session state list".

Configuration

Tested locally on .NET 8.0.404 and .NET 9.0.101, on a Linux laptop running Ubuntu 24.04 (kernel 6.8.0) with an Intel i7-13800H processor.

Have seen this in production in containers on both Intel and AMD processors, all x86, on various Linux kernels. I don't believe this issue is specific to any particular system or architecture.

Other information

When I test creating 5,000 synchronous, short-lived threads, and then throwing 5,000,000 exceptions on .NET 9.0.101 on my Linux laptop, perf reports the following "Self" times with libcoreclr.so offsets:

26.23%  .NET Long Runni  libcoreclr.so  [.] 0x00000000004f9eda
13.61%  .NET Long Runni  libcoreclr.so  [.] 0x00000000004f9ed6
 7.88%  .NET Long Runni  libcoreclr.so  [.] 0x00000000004f9ed2

For some reason, on my laptop, the perf addresses are off by exactly 0x1000. I couldn't tell you why, but I have validated this behavior with .NET 8.0.404 on my laptop, and I've seen this exact same issue in at least 5 different production systems all on different .NET versions with correct perf addresses.

Unfortunately, libcoreclr.so isn't built with debug symbols, and no debuginfo is published for it, so I've had to reverse-engineer libcoreclr.so to match the disassembly with the source code to figure out what is going on here. Below are the .NET 9.0 slow instructions in question:

004faec5  4d8b6d08           mov     r13, qword [r13+0x8]
004faec9  4d85ed             test    r13, r13
004faecc  0f848f000000       je      0x4faf61

004faed2  498b4500           mov     rax, qword [r13]
004faed6  488b5818           mov     rbx, qword [rax+0x18]
004faeda  4c8b7b10           mov     r15, qword [rbx+0x10]
004faede  4d85ff             test    r15, r15
004faee1  74e2               je      0x4faec5

This is the section from buffer_manager_move_next_event_any_thread that iterates each thread session state to find a single event, which will then be sent back through to the EventListener.

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-Tracing-coreclr untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

1 participant