-
Notifications
You must be signed in to change notification settings - Fork 2.2k
use clone3 for exec process creation to reduce cgroup lock contention #4782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Currently, the runc exec process creates child processes by first cloning the child process and then writing its PID into cgroup.procs. This approach leads to high lock contention on the cgroup_threadgroup_rwsem read-write lock under conditions of high container density and numerous exec probes, potentially causing system hang. This change introduces the usage of the clone3 system call within the setnsProcess.start function to merge the application of the cgroup into the clone operation (assuming cgroup v2 is in use). By doing so, it avoids the need to write PIDs to cgroup.procs directly, thereby bypassing the requirement for taking the write lock and reducing the risk of lock contention. Signed-off-by: jinda.ljd <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using clone3 might be a good idea (with a fallback, of course), thanks!
Do you have perf numbers to share? To better understand when this is problematic and how much help this provides on kernels that support it.
I wonder if @kolyshkin that, IIRC, wrote the patch for golang had ideas already.
@@ -203,6 +204,28 @@ func (p *setnsProcess) start() (retErr error) { | |||
|
|||
// Get the "before" value of oom kill count. | |||
oom, _ := p.manager.OOMKillCount() | |||
useClone3 := false | |||
if cgroups.IsCgroup2UnifiedMode() && p.initProcessPid != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start() is already complex, let's move this to a function with a clear name, so it's more readable.
procPid := p.pid() | ||
if useClone3 { | ||
procPid = -1 | ||
} | ||
if err := cgroups.WriteCgroupProc(path, procPid); err != nil && !p.rootlessCgroups { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if it's -1 it's not written? Let's just useClone3 for the condition, explaining that if we are using clone3, then it's already set in the cgroup.
p.cmd.SysProcAttr.UseCgroupFD = true | ||
p.cmd.SysProcAttr.CgroupFD = int(fd.Fd()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
man clone3 says this is available since linux 5.7. You are setting useClone3 to true, but I don't think that is detecting CLONE_INTO_CGROUP
is supported in this kernel, right? We will need to improve the detection. Not sure about the golang wrapper, but IIRC @kolyshkin wrote that for Go. He might have tips :)
Note: This PR is only for discussion, and the code is for demonstration. If it is confirmed that there is no problem, the code structure may need to be optimized
Currently, the runc exec process creates child processes by first cloning the child process and then writing its PID into
cgroup.procs
. This approach leads to high lock contention on thecgroup_threadgroup_rwsem
read-write lock under conditions of high container density and numerous exec probes, potentially causing system hang.This change introduces the usage of the
clone3
system call within thesetnsProcess.start
function to merge the application of the cgroup into the clone operation (assuming cgroup v2 is in use). By doing so, it avoids the need to write PIDs to cgroup.procs directly, thereby bypassing the requirement for taking the write lock and reducing the risk of lock contention.