Skip to content

use clone3 for exec process creation to reduce cgroup lock contention #4782

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lujinda
Copy link

@lujinda lujinda commented Jun 13, 2025

Note: This PR is only for discussion, and the code is for demonstration. If it is confirmed that there is no problem, the code structure may need to be optimized

Currently, the runc exec process creates child processes by first cloning the child process and then writing its PID into cgroup.procs. This approach leads to high lock contention on the cgroup_threadgroup_rwsem read-write lock under conditions of high container density and numerous exec probes, potentially causing system hang.

This change introduces the usage of the clone3 system call within the setnsProcess.start function to merge the application of the cgroup into the clone operation (assuming cgroup v2 is in use). By doing so, it avoids the need to write PIDs to cgroup.procs directly, thereby bypassing the requirement for taking the write lock and reducing the risk of lock contention.

Currently, the runc exec process creates child processes by first cloning the child process and then writing its PID into cgroup.procs. This approach leads to high lock contention on the cgroup_threadgroup_rwsem read-write lock under conditions of high container density and numerous exec probes, potentially causing system hang.

This change introduces the usage of the clone3 system call within the setnsProcess.start function to merge the application of the cgroup into the clone operation (assuming cgroup v2 is in use). By doing so, it avoids the need to write PIDs to cgroup.procs directly, thereby bypassing the requirement for taking the write lock and reducing the risk of lock contention.

Signed-off-by: jinda.ljd <[email protected]>
Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using clone3 might be a good idea (with a fallback, of course), thanks!

Do you have perf numbers to share? To better understand when this is problematic and how much help this provides on kernels that support it.

I wonder if @kolyshkin that, IIRC, wrote the patch for golang had ideas already.

@@ -203,6 +204,28 @@ func (p *setnsProcess) start() (retErr error) {

// Get the "before" value of oom kill count.
oom, _ := p.manager.OOMKillCount()
useClone3 := false
if cgroups.IsCgroup2UnifiedMode() && p.initProcessPid != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start() is already complex, let's move this to a function with a clear name, so it's more readable.

Comment on lines +258 to +262
procPid := p.pid()
if useClone3 {
procPid = -1
}
if err := cgroups.WriteCgroupProc(path, procPid); err != nil && !p.rootlessCgroups {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if it's -1 it's not written? Let's just useClone3 for the condition, explaining that if we are using clone3, then it's already set in the cgroup.

Comment on lines +223 to +224
p.cmd.SysProcAttr.UseCgroupFD = true
p.cmd.SysProcAttr.CgroupFD = int(fd.Fd())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

man clone3 says this is available since linux 5.7. You are setting useClone3 to true, but I don't think that is detecting CLONE_INTO_CGROUP is supported in this kernel, right? We will need to improve the detection. Not sure about the golang wrapper, but IIRC @kolyshkin wrote that for Go. He might have tips :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants