Skip to content

Conversation

@opsiff
Copy link
Member

@opsiff opsiff commented Oct 21, 2025

…t workloads

mainline inclusion
from mainline-v6.13-rc1
category: performance

commit 223baf9 ("sched: Fix performance regression introduced by mm_cid") introduced a per-mm/cpu current concurrency id (mm_cid), which keeps a reference to the concurrency id allocated for each CPU. This reference expires shortly after a 100ms delay.

These per-CPU references keep the per-mm-cid data cache-local in situations where threads are running at least once on each CPU within each 100ms window, thus keeping the per-cpu reference alive.

However, intermittent workloads behaving in bursts spaced by more than 100ms on each CPU exhibit bad cache locality and degraded performance compared to purely per-cpu data indexing, because concurrency IDs are allocated over various CPUs and cores, therefore losing cache locality of the associated data.

Introduce the following changes to improve per-mm-cid cache locality:

  • Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep track of which mm_cid value was last used, and use it as a hint to attempt re-allocating the same concurrency ID the next time this mm/cpu needs to allocate a concurrency ID,

  • Add a per-mm CPUs allowed mask, which keeps track of the union of CPUs allowed for all threads belonging to this mm. This cpumask is only set during the lifetime of the mm, never cleared, so it represents the union of all the CPUs allowed since the beginning of the mm lifetime (note that the mm_cpumask() is really arch-specific and tailored to the TLB flush needs, and is thus not a viable approach for this),

  • Add a per-mm nr_cpus_allowed to keep track of the weight of the per-mm CPUs allowed mask (for fast access),

  • Add a per-mm max_nr_cid to keep track of the highest number of concurrency IDs allocated for the mm. This is used for expanding the concurrency ID allocation within the upper bound defined by:

    min(mm->nr_cpus_allowed, mm->mm_users)

    When the next unused CID value reaches this threshold, stop trying to expand the cid allocation and use the first available cid value instead.

    Spreading allocation to use all the cid values within the range

    [ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]

    improves cache locality while preserving mm_cid compactness within the expected user limits,

  • In __mm_cid_try_get, only return cid values within the range [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This prevents allocating cids above the number of allowed cpus in rare scenarios where cid allocation races with a concurrent remote-clear of the per-mm/cpu cid. This improvement is made possible by the addition of the per-mm CPUs allowed mask,

  • In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than t->nr_cpus_allowed. This criterion was really meant to compare the number of mm->mm_users to the number of CPUs allowed for the entire mm. Therefore, the prior comparison worked fine when all threads shared the same CPUs allowed mask, but not so much in scenarios where those threads have different masks (e.g. each thread pinned to a single CPU). This improvement is made possible by the addition of the per-mm CPUs allowed mask.

  • Benchmarks

Each thread increments 16kB worth of 8-bit integers in bursts, with a configurable delay between each thread's execution. Each thread run one after the other (no threads run concurrently). The order of thread execution in the sequence is random. The thread execution sequence begins again after all threads have executed. The 16kB areas are allocated with rseq_mempool and indexed by either cpu_id, mm_cid (not cache-local), or cache-local mm_cid. Each thread is pinned to its own core.

Testing configurations:

8-core/1-L3: Use 8 cores within a single L3
24-core/24-L3: Use 24 cores, 1 core per L3
192-core/24-L3: Use 192 cores (all cores in the system)
384-thread/24-L3: Use 384 HW threads (all HW threads in the system)

Intermittent workload delays between threads: 200ms, 10ms.

Hardware:

CPU(s): 384
On-line CPU(s) list: 0-383
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9654 96-Core Processor
Thread(s) per core: 2
Core(s) per socket: 96
Socket(s): 2
Caches (sum of all):
L1d: 6 MiB (192 instances)
L1i: 6 MiB (192 instances)
L2: 192 MiB (192 instances)
L3: 768 MiB (24 instances)

Each result is an average of 5 test runs. The cache-local speedup is calculated as: (cache-local mm_cid) / (mm_cid).

Intermittent workload delay: 200ms

                 per-cpu     mm_cid    cache-local mm_cid    cache-local speedup
                     (ns)      (ns)                  (ns)

8-core/1-L3 1374 19289 1336 14.4x
24-core/24-L3 2423 26721 1594 16.7x
192-core/24-L3 2291 15826 2153 7.3x
384-thread/24-L3 1874 13234 1907 6.9x

Intermittent workload delay: 10ms

                 per-cpu     mm_cid    cache-local mm_cid    cache-local speedup
                     (ns)      (ns)                  (ns)

8-core/1-L3 662 756 686 1.1x
24-core/24-L3 1378 3648 1035 3.5x
192-core/24-L3 1439 10833 1482 7.3x
384-thread/24-L3 1503 10570 1556 6.8x

[ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs"
patch series with a simpler and more general approach. ]

[ This patch applies on top of v6.12-rc1. ]

Acked-by: Marco Elver [email protected]
Link: https://lore.kernel.org/lkml/[email protected]/ (cherry picked from commit 7e019dc)

Summary by Sourcery

Improve cache locality for sched RSEQ concurrency IDs by introducing a per-mm hint and bounding allocations by the union of allowed CPUs and thread count.

Enhancements:

  • Add a recent_cid hint in per-mm/cpu state to prioritize reuse of the last used concurrency ID
  • Introduce a per-mm cpus_allowed mask with its weight (nr_cpus_allowed) and an atomic max_nr_cid to guide and limit ID allocation
  • Modify __mm_cid_try_get to first retry recent_cid, expand allocations up to min(nr_cpus_allowed, mm_users), and search within the allowed CPU range
  • Update sched_mm_cid_migrate_to and related APIs to consider recent_cid and per-mm constraints when migrating or clearing IDs
  • Accumulate thread CPU affinity into the mm’s cpus_allowed mask on set_cpus_allowed to keep mm-wide constraints up to date

…t workloads

mainline inclusion
from mainline-v6.13-rc1
category: performance

commit 223baf9 ("sched: Fix performance regression introduced by mm_cid")
introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
a reference to the concurrency id allocated for each CPU. This reference
expires shortly after a 100ms delay.

These per-CPU references keep the per-mm-cid data cache-local in
situations where threads are running at least once on each CPU within
each 100ms window, thus keeping the per-cpu reference alive.

However, intermittent workloads behaving in bursts spaced by more than
100ms on each CPU exhibit bad cache locality and degraded performance
compared to purely per-cpu data indexing, because concurrency IDs are
allocated over various CPUs and cores, therefore losing cache locality
of the associated data.

Introduce the following changes to improve per-mm-cid cache locality:

- Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep
  track of which mm_cid value was last used, and use it as a hint to
  attempt re-allocating the same concurrency ID the next time this
  mm/cpu needs to allocate a concurrency ID,

- Add a per-mm CPUs allowed mask, which keeps track of the union of
  CPUs allowed for all threads belonging to this mm. This cpumask is
  only set during the lifetime of the mm, never cleared, so it
  represents the union of all the CPUs allowed since the beginning of
  the mm lifetime (note that the mm_cpumask() is really arch-specific
  and tailored to the TLB flush needs, and is thus _not_ a viable
  approach for this),

- Add a per-mm nr_cpus_allowed to keep track of the weight of the
  per-mm CPUs allowed mask (for fast access),

- Add a per-mm max_nr_cid to keep track of the highest number of
  concurrency IDs allocated for the mm. This is used for expanding the
  concurrency ID allocation within the upper bound defined by:

    min(mm->nr_cpus_allowed, mm->mm_users)

  When the next unused CID value reaches this threshold, stop trying
  to expand the cid allocation and use the first available cid value
  instead.

  Spreading allocation to use all the cid values within the range

    [ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]

  improves cache locality while preserving mm_cid compactness within the
  expected user limits,

- In __mm_cid_try_get, only return cid values within the range
  [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This
  prevents allocating cids above the number of allowed cpus in
  rare scenarios where cid allocation races with a concurrent
  remote-clear of the per-mm/cpu cid. This improvement is made
  possible by the addition of the per-mm CPUs allowed mask,

- In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than
  t->nr_cpus_allowed. This criterion was really meant to compare
  the number of mm->mm_users to the number of CPUs allowed for the
  entire mm. Therefore, the prior comparison worked fine when all
  threads shared the same CPUs allowed mask, but not so much in
  scenarios where those threads have different masks (e.g. each
  thread pinned to a single CPU). This improvement is made
  possible by the addition of the per-mm CPUs allowed mask.

* Benchmarks

Each thread increments 16kB worth of 8-bit integers in bursts, with
a configurable delay between each thread's execution. Each thread run
one after the other (no threads run concurrently). The order of
thread execution in the sequence is random. The thread execution
sequence begins again after all threads have executed. The 16kB areas
are allocated with rseq_mempool and indexed by either cpu_id, mm_cid
(not cache-local), or cache-local mm_cid. Each thread is pinned to its
own core.

Testing configurations:

8-core/1-L3:        Use 8 cores within a single L3
24-core/24-L3:      Use 24 cores, 1 core per L3
192-core/24-L3:     Use 192 cores (all cores in the system)
384-thread/24-L3:   Use 384 HW threads (all HW threads in the system)

Intermittent workload delays between threads: 200ms, 10ms.

Hardware:

CPU(s):                   384
  On-line CPU(s) list:    0-383
Vendor ID:                AuthenticAMD
  Model name:             AMD EPYC 9654 96-Core Processor
    Thread(s) per core:   2
    Core(s) per socket:   96
    Socket(s):            2
Caches (sum of all):
  L1d:                    6 MiB (192 instances)
  L1i:                    6 MiB (192 instances)
  L2:                     192 MiB (192 instances)
  L3:                     768 MiB (24 instances)

Each result is an average of 5 test runs. The cache-local speedup
is calculated as: (cache-local mm_cid) / (mm_cid).

Intermittent workload delay: 200ms

                     per-cpu     mm_cid    cache-local mm_cid    cache-local speedup
                         (ns)      (ns)                  (ns)
8-core/1-L3             1374      19289                  1336            14.4x
24-core/24-L3           2423      26721                  1594            16.7x
192-core/24-L3          2291      15826                  2153             7.3x
384-thread/24-L3        1874      13234                  1907             6.9x

Intermittent workload delay: 10ms

                     per-cpu     mm_cid    cache-local mm_cid    cache-local speedup
                         (ns)      (ns)                  (ns)
8-core/1-L3               662       756                   686             1.1x
24-core/24-L3            1378      3648                  1035             3.5x
192-core/24-L3           1439     10833                  1482             7.3x
384-thread/24-L3         1503     10570                  1556             6.8x

[ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs"
  patch series with a simpler and more general approach. ]

[ This patch applies on top of v6.12-rc1. ]

Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Marco Elver <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
(cherry picked from commit 7e019dc)
Signed-off-by: Wentao Guan <[email protected]>
@sourcery-ai
Copy link

sourcery-ai bot commented Oct 21, 2025

Reviewer's Guide

Improve cache locality of scheduler RSEQ concurrency IDs by adding a recent_cid hint, tracking per-mm CPU usage and bounds, constraining and expanding CID allocations to allowed CPUs, and updating allocation and migration routines to use per-mm context.

Sequence diagram for improved CID allocation with recent_cid and per-mm cpus_allowed

sequenceDiagram
    participant T["task_struct (thread)"]
    participant MM["mm_struct (memory map)"]
    participant CID["mm_cid (per-cpu)"]
    participant SCHED["Scheduler"]
    T->>SCHED: Request CID allocation
    SCHED->>MM: Access mm_cidmask and mm_cpus_allowed
    SCHED->>CID: Check recent_cid for reuse
    alt recent_cid available
        CID->>SCHED: Return recent_cid as CID
    else recent_cid not available
        SCHED->>MM: Check max_nr_cid, nr_cpus_allowed, mm_users
        alt Can expand CID allocation
            SCHED->>MM: Increment max_nr_cid
            SCHED->>CID: Allocate new CID
        else Cannot expand
            SCHED->>MM: Find first available CID in allowed range
            SCHED->>CID: Allocate found CID
        end
    end
    SCHED->>CID: Update recent_cid
    SCHED->>T: Return allocated CID
Loading

Sequence diagram for updating per-mm cpus_allowed on task affinity change

sequenceDiagram
    participant T["task_struct (thread)"]
    participant MM["mm_struct (memory map)"]
    participant SCHED["Scheduler"]
    T->>SCHED: Change CPU affinity
    SCHED->>MM: Call mm_set_cpus_allowed(mm, new_mask)
    MM->>MM: Update mm_cpus_allowed (union with new_mask)
    MM->>MM: Update nr_cpus_allowed
    SCHED->>T: Complete affinity change
Loading

Sequence diagram for CID migration between CPUs with recent_cid update

sequenceDiagram
    participant SRC["Source CPU (src_rq)"]
    participant DST["Destination CPU (dst_rq)"]
    participant MM["mm_struct"]
    participant CID_SRC["mm_cid (src)"]
    participant CID_DST["mm_cid (dst)"]
    SRC->>CID_SRC: Try to clear src_cid and recent_cid
    DST->>CID_DST: Check if dst_cid or recent_cid is set
    alt dst_cid or recent_cid is set and thread count >= nr_cpus_allowed
        DST->>SRC: Abort migration
    else dst_cid/recent_cid not set
        DST->>CID_DST: Move src_cid to dst_cid
        DST->>CID_DST: Update recent_cid
    end
Loading

Class diagram for updated mm_struct and mm_cid structures

classDiagram
    class mm_cid {
        u64 time
        int cid
        int recent_cid
    }
    class mm_struct {
        unsigned long mm_cid_next_scan
        unsigned int nr_cpus_allowed
        atomic_t max_nr_cid
        raw_spinlock_t cpus_allowed_lock
        cpumask_t cpu_bitmap
        cpumask_t* mm_cpus_allowed()
        cpumask_t* mm_cidmask()
        void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
        int mm_alloc_cid(struct mm_struct *mm, struct task_struct *p)
        void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask)
    }
    mm_struct "1" o-- "*" mm_cid : per-cpu
    mm_struct "1" o-- "1" cpumask_t : mm_cpus_allowed
    mm_struct "1" o-- "1" cpumask_t : mm_cidmask
Loading

File-Level Changes

Change Details Files
Add recent_cid hint to reuse last used concurrency ID for cache locality
  • Introduce recent_cid field in mm_cid struct and initialize to UNSET
  • Store recent_cid on allocation and migration paths
  • Try reusing recent_cid in __mm_cid_try_get before scanning the mask
include/linux/mm_types.h
kernel/sched/sched.h
kernel/sched/core.c
Track per-mm CPU usage and bound concurrency ID allocations
  • Add cpus_allowed mask, nr_cpus_allowed counter, cpus_allowed_lock and max_nr_cid in mm_struct
  • Implement mm_cpus_allowed() accessor and mm_set_cpus_allowed() helper
  • Initialize and update per-mm mask and counters in mm_init_cid, mm_alloc_cid and on affinity changes
include/linux/mm_types.h
kernel/sched/core.c
Limit CID allocation to currently allowed CPUs and expand range intelligently
  • Use mm->nr_cpus_allowed as the upper bound instead of nr_cpu_ids in __mm_cid_try_get
  • Expand cid search up to min(nr_cpus_allowed, mm_users) using max_nr_cid
  • Fallback to first available slot within the permitted range
kernel/sched/sched.h
Update migration logic to respect per-mm constraints and recent_cid
  • Use mm->nr_cpus_allowed rather than task affinity for migration threshold
  • Consider both cid and recent_cid when deciding to clear or reuse IDs
  • Clear recent_cid on steal and set recent_cid on destination cpu
kernel/sched/core.c
Refactor function signatures to include task context for allocation paths
  • Change __mm_cid_try_get, __mm_cid_get and mm_cid_get to accept struct task_struct
  • Update all callers in sched.h, core.c, exec.c and fork.c to pass the task pointer
kernel/sched/sched.h
kernel/sched/core.c
fs/exec.c
kernel/fork.c

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request improves cache locality of RSEQ concurrency IDs for intermittent workloads by implementing smarter CID allocation strategies. The changes address performance degradation in workloads with bursts spaced by more than 100ms, where the previous implementation lost cache locality.

Key changes:

  • Added tracking of recently used CIDs and per-mm CPU masks to improve CID reuse and allocation decisions
  • Modified CID allocation to attempt reusing recent CIDs and expand allocations within bounds of allowed CPUs
  • Updated migration logic to consider both current and recent CIDs when deciding whether to move CIDs between CPUs

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
kernel/sched/sched.h Modified CID allocation functions to accept task_struct parameter and implement recent CID reuse logic with bounded allocation
kernel/sched/core.c Updated migration and affinity change handling to track recent CIDs and use per-mm CPU mask
kernel/fork.c Updated mm_alloc_cid call to pass task_struct parameter
include/linux/mm_types.h Added recent_cid field, per-mm CPU tracking fields (nr_cpus_allowed, max_nr_cid, cpus_allowed_lock), and helper functions
fs/exec.c Updated mm_init_cid call to pass task_struct parameter

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

static inline unsigned int mm_cid_size(void)
{
return cpumask_size();
return 2 * cpumask_size(); /* mm_cpus_allowed(), mm_cidmask(). */
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'cidmask' to match the actual function name 'mm_cidmask()'.

Suggested change
return 2 * cpumask_size(); /* mm_cpus_allowed(), mm_cidmask(). */
return 2 * cpumask_size(); /* mm_cpus_allowed(), mm_cidmask. */

Copilot uses AI. Check for mistakes.
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `kernel/sched/core.c:2708` </location>
<code_context>
 		put_prev_task(rq, p);

 	p->sched_class->set_cpus_allowed(p, ctx);
+	mm_set_cpus_allowed(p->mm, ctx->new_mask);

 	if (queued)
</code_context>

<issue_to_address>
**issue (bug_risk):** mm_set_cpus_allowed is called unconditionally, but p->mm may be NULL.

Add a check to ensure p->mm is not NULL before calling mm_set_cpus_allowed to prevent a possible NULL pointer dereference.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@deepin-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from opsiff. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants