[Deepin-Kernel-SIG] [linux 6.12-y] [Upstream] sched: Improve cache locality of RSEQ concurrency IDs for intermitten… #1249

opsiff · 2025-10-21T08:10:38Z

…t workloads

mainline inclusion
from mainline-v6.13-rc1
category: performance

commit 223baf9 ("sched: Fix performance regression introduced by mm_cid") introduced a per-mm/cpu current concurrency id (mm_cid), which keeps a reference to the concurrency id allocated for each CPU. This reference expires shortly after a 100ms delay.

These per-CPU references keep the per-mm-cid data cache-local in situations where threads are running at least once on each CPU within each 100ms window, thus keeping the per-cpu reference alive.

However, intermittent workloads behaving in bursts spaced by more than 100ms on each CPU exhibit bad cache locality and degraded performance compared to purely per-cpu data indexing, because concurrency IDs are allocated over various CPUs and cores, therefore losing cache locality of the associated data.

Introduce the following changes to improve per-mm-cid cache locality:

Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep track of which mm_cid value was last used, and use it as a hint to attempt re-allocating the same concurrency ID the next time this mm/cpu needs to allocate a concurrency ID,
Add a per-mm CPUs allowed mask, which keeps track of the union of CPUs allowed for all threads belonging to this mm. This cpumask is only set during the lifetime of the mm, never cleared, so it represents the union of all the CPUs allowed since the beginning of the mm lifetime (note that the mm_cpumask() is really arch-specific and tailored to the TLB flush needs, and is thus not a viable approach for this),
Add a per-mm nr_cpus_allowed to keep track of the weight of the per-mm CPUs allowed mask (for fast access),
Add a per-mm max_nr_cid to keep track of the highest number of concurrency IDs allocated for the mm. This is used for expanding the concurrency ID allocation within the upper bound defined by:

min(mm->nr_cpus_allowed, mm->mm_users)

When the next unused CID value reaches this threshold, stop trying to expand the cid allocation and use the first available cid value instead.

Spreading allocation to use all the cid values within the range

[ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]

improves cache locality while preserving mm_cid compactness within the expected user limits,
In __mm_cid_try_get, only return cid values within the range [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This prevents allocating cids above the number of allowed cpus in rare scenarios where cid allocation races with a concurrent remote-clear of the per-mm/cpu cid. This improvement is made possible by the addition of the per-mm CPUs allowed mask,
In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than t->nr_cpus_allowed. This criterion was really meant to compare the number of mm->mm_users to the number of CPUs allowed for the entire mm. Therefore, the prior comparison worked fine when all threads shared the same CPUs allowed mask, but not so much in scenarios where those threads have different masks (e.g. each thread pinned to a single CPU). This improvement is made possible by the addition of the per-mm CPUs allowed mask.

Benchmarks

Each thread increments 16kB worth of 8-bit integers in bursts, with a configurable delay between each thread's execution. Each thread run one after the other (no threads run concurrently). The order of thread execution in the sequence is random. The thread execution sequence begins again after all threads have executed. The 16kB areas are allocated with rseq_mempool and indexed by either cpu_id, mm_cid (not cache-local), or cache-local mm_cid. Each thread is pinned to its own core.

Testing configurations:

8-core/1-L3: Use 8 cores within a single L3
24-core/24-L3: Use 24 cores, 1 core per L3
192-core/24-L3: Use 192 cores (all cores in the system)
384-thread/24-L3: Use 384 HW threads (all HW threads in the system)

Intermittent workload delays between threads: 200ms, 10ms.

Hardware:

CPU(s): 384
On-line CPU(s) list: 0-383
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9654 96-Core Processor
Thread(s) per core: 2
Core(s) per socket: 96
Socket(s): 2
Caches (sum of all):
L1d: 6 MiB (192 instances)
L1i: 6 MiB (192 instances)
L2: 192 MiB (192 instances)
L3: 768 MiB (24 instances)

Each result is an average of 5 test runs. The cache-local speedup is calculated as: (cache-local mm_cid) / (mm_cid).

Intermittent workload delay: 200ms

                 per-cpu     mm_cid    cache-local mm_cid    cache-local speedup
                     (ns)      (ns)                  (ns)

8-core/1-L3 1374 19289 1336 14.4x
24-core/24-L3 2423 26721 1594 16.7x
192-core/24-L3 2291 15826 2153 7.3x
384-thread/24-L3 1874 13234 1907 6.9x

Intermittent workload delay: 10ms

                 per-cpu     mm_cid    cache-local mm_cid    cache-local speedup
                     (ns)      (ns)                  (ns)

8-core/1-L3 662 756 686 1.1x
24-core/24-L3 1378 3648 1035 3.5x
192-core/24-L3 1439 10833 1482 7.3x
384-thread/24-L3 1503 10570 1556 6.8x

[ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs"
patch series with a simpler and more general approach. ]

[ This patch applies on top of v6.12-rc1. ]

Acked-by: Marco Elver [email protected]
Link: https://lore.kernel.org/lkml/[email protected]/ (cherry picked from commit 7e019dc)

Summary by Sourcery

Improve cache locality for sched RSEQ concurrency IDs by introducing a per-mm hint and bounding allocations by the union of allowed CPUs and thread count.

Enhancements:

Add a recent_cid hint in per-mm/cpu state to prioritize reuse of the last used concurrency ID
Introduce a per-mm cpus_allowed mask with its weight (nr_cpus_allowed) and an atomic max_nr_cid to guide and limit ID allocation
Modify __mm_cid_try_get to first retry recent_cid, expand allocations up to min(nr_cpus_allowed, mm_users), and search within the allowed CPU range
Update sched_mm_cid_migrate_to and related APIs to consider recent_cid and per-mm constraints when migrating or clearing IDs
Accumulate thread CPU affinity into the mm’s cpus_allowed mask on set_cpus_allowed to keep mm-wide constraints up to date

…t workloads mainline inclusion from mainline-v6.13-rc1 category: performance commit 223baf9 ("sched: Fix performance regression introduced by mm_cid") introduced a per-mm/cpu current concurrency id (mm_cid), which keeps a reference to the concurrency id allocated for each CPU. This reference expires shortly after a 100ms delay. These per-CPU references keep the per-mm-cid data cache-local in situations where threads are running at least once on each CPU within each 100ms window, thus keeping the per-cpu reference alive. However, intermittent workloads behaving in bursts spaced by more than 100ms on each CPU exhibit bad cache locality and degraded performance compared to purely per-cpu data indexing, because concurrency IDs are allocated over various CPUs and cores, therefore losing cache locality of the associated data. Introduce the following changes to improve per-mm-cid cache locality: - Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep track of which mm_cid value was last used, and use it as a hint to attempt re-allocating the same concurrency ID the next time this mm/cpu needs to allocate a concurrency ID, - Add a per-mm CPUs allowed mask, which keeps track of the union of CPUs allowed for all threads belonging to this mm. This cpumask is only set during the lifetime of the mm, never cleared, so it represents the union of all the CPUs allowed since the beginning of the mm lifetime (note that the mm_cpumask() is really arch-specific and tailored to the TLB flush needs, and is thus _not_ a viable approach for this), - Add a per-mm nr_cpus_allowed to keep track of the weight of the per-mm CPUs allowed mask (for fast access), - Add a per-mm max_nr_cid to keep track of the highest number of concurrency IDs allocated for the mm. This is used for expanding the concurrency ID allocation within the upper bound defined by: min(mm->nr_cpus_allowed, mm->mm_users) When the next unused CID value reaches this threshold, stop trying to expand the cid allocation and use the first available cid value instead. Spreading allocation to use all the cid values within the range [ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ] improves cache locality while preserving mm_cid compactness within the expected user limits, - In __mm_cid_try_get, only return cid values within the range [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This prevents allocating cids above the number of allowed cpus in rare scenarios where cid allocation races with a concurrent remote-clear of the per-mm/cpu cid. This improvement is made possible by the addition of the per-mm CPUs allowed mask, - In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than t->nr_cpus_allowed. This criterion was really meant to compare the number of mm->mm_users to the number of CPUs allowed for the entire mm. Therefore, the prior comparison worked fine when all threads shared the same CPUs allowed mask, but not so much in scenarios where those threads have different masks (e.g. each thread pinned to a single CPU). This improvement is made possible by the addition of the per-mm CPUs allowed mask. * Benchmarks Each thread increments 16kB worth of 8-bit integers in bursts, with a configurable delay between each thread's execution. Each thread run one after the other (no threads run concurrently). The order of thread execution in the sequence is random. The thread execution sequence begins again after all threads have executed. The 16kB areas are allocated with rseq_mempool and indexed by either cpu_id, mm_cid (not cache-local), or cache-local mm_cid. Each thread is pinned to its own core. Testing configurations: 8-core/1-L3: Use 8 cores within a single L3 24-core/24-L3: Use 24 cores, 1 core per L3 192-core/24-L3: Use 192 cores (all cores in the system) 384-thread/24-L3: Use 384 HW threads (all HW threads in the system) Intermittent workload delays between threads: 200ms, 10ms. Hardware: CPU(s): 384 On-line CPU(s) list: 0-383 Vendor ID: AuthenticAMD Model name: AMD EPYC 9654 96-Core Processor Thread(s) per core: 2 Core(s) per socket: 96 Socket(s): 2 Caches (sum of all): L1d: 6 MiB (192 instances) L1i: 6 MiB (192 instances) L2: 192 MiB (192 instances) L3: 768 MiB (24 instances) Each result is an average of 5 test runs. The cache-local speedup is calculated as: (cache-local mm_cid) / (mm_cid). Intermittent workload delay: 200ms per-cpu mm_cid cache-local mm_cid cache-local speedup (ns) (ns) (ns) 8-core/1-L3 1374 19289 1336 14.4x 24-core/24-L3 2423 26721 1594 16.7x 192-core/24-L3 2291 15826 2153 7.3x 384-thread/24-L3 1874 13234 1907 6.9x Intermittent workload delay: 10ms per-cpu mm_cid cache-local mm_cid cache-local speedup (ns) (ns) (ns) 8-core/1-L3 662 756 686 1.1x 24-core/24-L3 1378 3648 1035 3.5x 192-core/24-L3 1439 10833 1482 7.3x 384-thread/24-L3 1503 10570 1556 6.8x [ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs" patch series with a simpler and more general approach. ] [ This patch applies on top of v6.12-rc1. ] Signed-off-by: Mathieu Desnoyers <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Marco Elver <[email protected]> Link: https://lore.kernel.org/lkml/[email protected]/ (cherry picked from commit 7e019dc) Signed-off-by: Wentao Guan <[email protected]>

sourcery-ai · 2025-10-21T08:10:45Z

Reviewer's Guide

Improve cache locality of scheduler RSEQ concurrency IDs by adding a recent_cid hint, tracking per-mm CPU usage and bounds, constraining and expanding CID allocations to allowed CPUs, and updating allocation and migration routines to use per-mm context.

Sequence diagram for improved CID allocation with recent_cid and per-mm cpus_allowed

sequenceDiagram
    participant T["task_struct (thread)"]
    participant MM["mm_struct (memory map)"]
    participant CID["mm_cid (per-cpu)"]
    participant SCHED["Scheduler"]
    T->>SCHED: Request CID allocation
    SCHED->>MM: Access mm_cidmask and mm_cpus_allowed
    SCHED->>CID: Check recent_cid for reuse
    alt recent_cid available
        CID->>SCHED: Return recent_cid as CID
    else recent_cid not available
        SCHED->>MM: Check max_nr_cid, nr_cpus_allowed, mm_users
        alt Can expand CID allocation
            SCHED->>MM: Increment max_nr_cid
            SCHED->>CID: Allocate new CID
        else Cannot expand
            SCHED->>MM: Find first available CID in allowed range
            SCHED->>CID: Allocate found CID
        end
    end
    SCHED->>CID: Update recent_cid
    SCHED->>T: Return allocated CID

Sequence diagram for updating per-mm cpus_allowed on task affinity change

sequenceDiagram
    participant T["task_struct (thread)"]
    participant MM["mm_struct (memory map)"]
    participant SCHED["Scheduler"]
    T->>SCHED: Change CPU affinity
    SCHED->>MM: Call mm_set_cpus_allowed(mm, new_mask)
    MM->>MM: Update mm_cpus_allowed (union with new_mask)
    MM->>MM: Update nr_cpus_allowed
    SCHED->>T: Complete affinity change

Sequence diagram for CID migration between CPUs with recent_cid update

sequenceDiagram
    participant SRC["Source CPU (src_rq)"]
    participant DST["Destination CPU (dst_rq)"]
    participant MM["mm_struct"]
    participant CID_SRC["mm_cid (src)"]
    participant CID_DST["mm_cid (dst)"]
    SRC->>CID_SRC: Try to clear src_cid and recent_cid
    DST->>CID_DST: Check if dst_cid or recent_cid is set
    alt dst_cid or recent_cid is set and thread count >= nr_cpus_allowed
        DST->>SRC: Abort migration
    else dst_cid/recent_cid not set
        DST->>CID_DST: Move src_cid to dst_cid
        DST->>CID_DST: Update recent_cid
    end

Class diagram for updated mm_struct and mm_cid structures

classDiagram
    class mm_cid {
        u64 time
        int cid
        int recent_cid
    }
    class mm_struct {
        unsigned long mm_cid_next_scan
        unsigned int nr_cpus_allowed
        atomic_t max_nr_cid
        raw_spinlock_t cpus_allowed_lock
        cpumask_t cpu_bitmap
        cpumask_t* mm_cpus_allowed()
        cpumask_t* mm_cidmask()
        void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
        int mm_alloc_cid(struct mm_struct *mm, struct task_struct *p)
        void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask)
    }
    mm_struct "1" o-- "*" mm_cid : per-cpu
    mm_struct "1" o-- "1" cpumask_t : mm_cpus_allowed
    mm_struct "1" o-- "1" cpumask_t : mm_cidmask

File-Level Changes

Change	Details	Files
Add recent_cid hint to reuse last used concurrency ID for cache locality	Introduce recent_cid field in mm_cid struct and initialize to UNSET Store recent_cid on allocation and migration paths Try reusing recent_cid in __mm_cid_try_get before scanning the mask	`include/linux/mm_types.h` `kernel/sched/sched.h` `kernel/sched/core.c`
Track per-mm CPU usage and bound concurrency ID allocations	Add cpus_allowed mask, nr_cpus_allowed counter, cpus_allowed_lock and max_nr_cid in mm_struct Implement mm_cpus_allowed() accessor and mm_set_cpus_allowed() helper Initialize and update per-mm mask and counters in mm_init_cid, mm_alloc_cid and on affinity changes	`include/linux/mm_types.h` `kernel/sched/core.c`
Limit CID allocation to currently allowed CPUs and expand range intelligently	Use mm->nr_cpus_allowed as the upper bound instead of nr_cpu_ids in __mm_cid_try_get Expand cid search up to min(nr_cpus_allowed, mm_users) using max_nr_cid Fallback to first available slot within the permitted range	`kernel/sched/sched.h`
Update migration logic to respect per-mm constraints and recent_cid	Use mm->nr_cpus_allowed rather than task affinity for migration threshold Consider both cid and recent_cid when deciding to clear or reuse IDs Clear recent_cid on steal and set recent_cid on destination cpu	`kernel/sched/core.c`
Refactor function signatures to include task context for allocation paths	Change __mm_cid_try_get, __mm_cid_get and mm_cid_get to accept struct task_struct Update all callers in sched.h, core.c, exec.c and fork.c to pass the task pointer	`kernel/sched/sched.h` `kernel/sched/core.c` `fs/exec.c` `kernel/fork.c`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

Copilot

Pull Request Overview

This pull request improves cache locality of RSEQ concurrency IDs for intermittent workloads by implementing smarter CID allocation strategies. The changes address performance degradation in workloads with bursts spaced by more than 100ms, where the previous implementation lost cache locality.

Key changes:

Added tracking of recently used CIDs and per-mm CPU masks to improve CID reuse and allocation decisions
Modified CID allocation to attempt reusing recent CIDs and expand allocations within bounds of allowed CPUs
Updated migration logic to consider both current and recent CIDs when deciding whether to move CIDs between CPUs

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
kernel/sched/sched.h	Modified CID allocation functions to accept task_struct parameter and implement recent CID reuse logic with bounded allocation
kernel/sched/core.c	Updated migration and affinity change handling to track recent CIDs and use per-mm CPU mask
kernel/fork.c	Updated mm_alloc_cid call to pass task_struct parameter
include/linux/mm_types.h	Added recent_cid field, per-mm CPU tracking fields (nr_cpus_allowed, max_nr_cid, cpus_allowed_lock), and helper functions
fs/exec.c	Updated mm_init_cid call to pass task_struct parameter

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-21T08:13:03Z

include/linux/mm_types.h

 static inline unsigned int mm_cid_size(void)
 {
-	return cpumask_size();
+	return 2 * cpumask_size();	/* mm_cpus_allowed(), mm_cidmask(). */


Corrected spelling of 'cidmask' to match the actual function name 'mm_cidmask()'.

Suggested change

return 2 * cpumask_size(); /* mm_cpus_allowed(), mm_cidmask(). */

return 2 * cpumask_size(); /* mm_cpus_allowed(), mm_cidmask. */

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `kernel/sched/core.c:2708` </location>
<code_context>
 		put_prev_task(rq, p);

 	p->sched_class->set_cpus_allowed(p, ctx);
+	mm_set_cpus_allowed(p->mm, ctx->new_mask);

 	if (queued)
</code_context>

<issue_to_address>
**issue (bug_risk):** mm_set_cpus_allowed is called unconditionally, but p->mm may be NULL.

Add a check to ensure p->mm is not NULL before calling mm_set_cpus_allowed to prevent a possible NULL pointer dereference.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

deepin-ci-robot · 2025-10-21T08:16:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from opsiff. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

deepin/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Avenger-285714 requested review from Avenger-285714 and Copilot October 21, 2025 08:12

Copilot AI reviewed Oct 21, 2025

View reviewed changes

sourcery-ai bot reviewed Oct 21, 2025

View reviewed changes

deepin-ci-robot requested a review from shy129 October 21, 2025 08:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Deepin-Kernel-SIG] [linux 6.12-y] [Upstream] sched: Improve cache locality of RSEQ concurrency IDs for intermitten… #1249

[Deepin-Kernel-SIG] [linux 6.12-y] [Upstream] sched: Improve cache locality of RSEQ concurrency IDs for intermitten… #1249

Uh oh!

opsiff commented Oct 21, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Oct 21, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 21, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

deepin-ci-robot commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	return 2 * cpumask_size(); /* mm_cpus_allowed(), mm_cidmask(). */
	return 2 * cpumask_size(); /* mm_cpus_allowed(), mm_cidmask. */

[Deepin-Kernel-SIG] [linux 6.12-y] [Upstream] sched: Improve cache locality of RSEQ concurrency IDs for intermitten… #1249

Are you sure you want to change the base?

[Deepin-Kernel-SIG] [linux 6.12-y] [Upstream] sched: Improve cache locality of RSEQ concurrency IDs for intermitten… #1249

Uh oh!

Conversation

opsiff commented Oct 21, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for improved CID allocation with recent_cid and per-mm cpus_allowed

Sequence diagram for updating per-mm cpus_allowed on task affinity change

Sequence diagram for CID migration between CPUs with recent_cid update

Class diagram for updated mm_struct and mm_cid structures

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

deepin-ci-robot commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

opsiff commented Oct 21, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Oct 21, 2025 •

edited

Loading